CHTC Fellows
Ben Staehle
Mentor(s):
Joe BartowiakTracking server inventory and elevation
The CHTC maintains over 1,000 servers on the UW–Madison campus and across the country. Keeping track of server elevation (datacenter and rack location), serial numbers, asset tags is a challenge that is always in need of improvement. This project will focus on taking existing data from the CHTC hardware monitoring system and automatically exporting it to other systems such as Google spreadsheets or ITAdvisor. After a successful summer, the student fellow will gain skills in Python and monitoring and Google Docs APIs.
Kristina Zhao
Mentor(s):
Emma Turetsky and Ian RossIntegrating PyTorch and Pelican
PyTorch is one of the most popular machine learning frameworks. An important aspect of using it is the data engineering: how is input data fed into the model during training? Going from “tutorial scale” problems to cutting-edge research requires drastically different techniques around data handling.
For this project, we aim to better integrate Pelican into the PyTorch community, providing both technical mechanisms (implementing the fsspec interface for Pelican) and documentation by providing tutorials and recipes for scaling PyTorch-based training using a combination of HTCondor and Pelican.
Neha Talluri
Mentor(s):
Jason PattonWhere in the world am I
In PATh, an important part of the infrastructure is the “glidein”, a client that starts at a remote location and provides computational cycles for research. In the past, glideins have relied on configuration at remote locations to determine their location but this often results in missing or incorrect information. This project will focus on enhancing glideins so that they can detect and report where they are running in the world, possibly including data like geolocation and institutional owner. After a successful summer, the student fellow will gain skills in Python, bash, and layer 3 networking.
Patrick Brophy
Mentor(s):
Haoming MengExpanded Pelican Origin Monitoring
The Pelican origin service is responsible for exporting objects in the backend
storage to the data federation. As it is the “entry point” for the data, understanding
the load on the origin and its activities is key to keeping the federation healthy.
Pelican takes monitoring data from the web server component and feeds it into the popular
Prometheus software to store time series about the activity. This project would focus on:
- Implementing new monitoring probes to complement the existing information.
- Forwarding the raw, unsummarized data to an ElasticSearch database for further analysis.
- Designing visualizations to provide administrators with an overview of the origin’s activities.
- Implementing alerts when there are health issues with the origin.
After a successful summer, the student fellow will gain skills in using the Go language, the Prometheus monitoring system (and other Cloud Native technologies), and web design.
Pratham Patel
Mentor(s):
Brian LinEnhancing container image build system
Container images are a widely used technology to package and distribute software and services for use in systems such as Docker or Kubernetes. The PATh project builds hundreds of these images on a weekly basis but the build system needs improvement to support more images and additional use cases. This project will focus on taking the existing system and adding configurable, per-image build options. After a successful summer, the student fellow will gain skills in Docker containers, GitHub actions, and Bash.
Ryan Boone
Mentor(s):
Cole Bollig and Rachel LombardiGrid Exerciser
The OSPool is a very large, very dynamic, heterogenous high throughput system composed of execute points from dozens of campuses all over the United States. Sometimes, something will go wrong at one of these many sites, or one network, or one storage point, and it is difficult to determine where the problem is. This project proposed the design and construction of a “Grid Exerciser”, which consists of intentionally sending sample jobs to targetted locations on the OSPool to verify correct operation and sufficient performance. The project will also have a reporting and visualization component so that the voluminous results can be understood by a human in a concise manner.
Thinh Nguyen
Mentor(s):
Justin HiemstraML for failure classification in the OSPool
The OSPool runs hundreds of thousands of jobs every day on dozens of different sites, each unique in their own way. Naturally, there are many hundreds of failures, most of which the system works around, but with added latency to workflow completion. This project would attempt to automatically classify failures from job logs to detect common patterns and highlight places for humans to look to fix common failures with the most payoff. Students working on this project will gain experience applying ML techniques to real world problems.
Wil Cram
Mentor(s):
Greg ThainSchedd performance analysis for human
The condor_schedd is a single threaded program, and when it is overloaded, it is difficult for administrators to understand why. There are some statistics about what it is doing, but there is no clear way to present this information in a useful way to an administrator. Students working on this project would build visualizations of complex data, and work with end users and facilitators to tune output for real world human consumption.