CHTC Fellows

Headshot for Ben Staehle

Ben Staehle

Joe Bartowiak

Tracking server inventory and elevation

The CHTC maintains over 1,000 servers on the UW–Madison campus and across the country. Keeping track of server elevation (datacenter and rack location), serial numbers, asset tags is a challenge that is always in need of improvement. This project will focus on taking existing data from the CHTC hardware monitoring system and automatically exporting it to other systems such as Google spreadsheets or ITAdvisor. After a successful summer, the student fellow will gain skills in Python and monitoring and Google Docs APIs.

Headshot for Kristina Zhao

Kristina Zhao

Emma Turetsky and Ian Ross

Integrating PyTorch and Pelican

PyTorch is one of the most popular machine learning frameworks. An important aspect of using it is the data engineering: how is input data fed into the model during training? Going from “tutorial scale” problems to cutting-edge research requires drastically different techniques around data handling.

For this project, we aim to better integrate Pelican into the PyTorch community, providing both technical mechanisms (implementing the fsspec interface for Pelican) and documentation by providing tutorials and recipes for scaling PyTorch-based training using a combination of HTCondor and Pelican.

Headshot for Neha Talluri

Neha Talluri

Jason Patton

Where in the world am I

In PATh, an important part of the infrastructure is the “glidein”, a client that starts at a remote location and provides computational cycles for research. In the past, glideins have relied on configuration at remote locations to determine their location but this often results in missing or incorrect information. This project will focus on enhancing glideins so that they can detect and report where they are running in the world, possibly including data like geolocation and institutional owner. After a successful summer, the student fellow will gain skills in Python, bash, and layer 3 networking.

Headshot for Patrick Brophy

Patrick Brophy

Haoming Meng

Expanded Pelican Origin Monitoring

The Pelican origin service is responsible for exporting objects in the backend storage to the data federation. As it is the “entry point” for the data, understanding the load on the origin and its activities is key to keeping the federation healthy.
Pelican takes monitoring data from the web server component and feeds it into the popular Prometheus software to store time series about the activity. This project would focus on:

  • Implementing new monitoring probes to complement the existing information.
  • Forwarding the raw, unsummarized data to an ElasticSearch database for further analysis.
  • Designing visualizations to provide administrators with an overview of the origin’s activities.
  • Implementing alerts when there are health issues with the origin.

After a successful summer, the student fellow will gain skills in using the Go language, the Prometheus monitoring system (and other Cloud Native technologies), and web design.

Headshot for Pratham Patel

Pratham Patel

Brian Lin

Enhancing container image build system

Container images are a widely used technology to package and distribute software and services for use in systems such as Docker or Kubernetes. The PATh project builds hundreds of these images on a weekly basis but the build system needs improvement to support more images and additional use cases. This project will focus on taking the existing system and adding configurable, per-image build options. After a successful summer, the student fellow will gain skills in Docker containers, GitHub actions, and Bash.

Headshot for Ryan Boone

Ryan Boone

Cole Bollig and Rachel Lombardi

Grid Exerciser

The OSPool is a very large, very dynamic, heterogenous high throughput system composed of execute points from dozens of campuses all over the United States. Sometimes, something will go wrong at one of these many sites, or one network, or one storage point, and it is difficult to determine where the problem is. This project proposed the design and construction of a “Grid Exerciser”, which consists of intentionally sending sample jobs to targetted locations on the OSPool to verify correct operation and sufficient performance. The project will also have a reporting and visualization component so that the voluminous results can be understood by a human in a concise manner.

Headshot for Thinh Nguyen

Thinh Nguyen

Justin Hiemstra

ML for failure classification in the OSPool

The OSPool runs hundreds of thousands of jobs every day on dozens of different sites, each unique in their own way. Naturally, there are many hundreds of failures, most of which the system works around, but with added latency to workflow completion. This project would attempt to automatically classify failures from job logs to detect common patterns and highlight places for humans to look to fix common failures with the most payoff. Students working on this project will gain experience applying ML techniques to real world problems.

Headshot for Wil Cram

Wil Cram

Greg Thain

Schedd performance analysis for human

The condor_schedd is a single threaded program, and when it is overloaded, it is difficult for administrators to understand why. There are some statistics about what it is doing, but there is no clear way to present this information in a useful way to an administrator. Students working on this project would build visualizations of complex data, and work with end users and facilitators to tune output for real world human consumption.