Collage photos of current and previous CHTC interns. Collage photos of current and previous CHTC interns.
Photo: Morgridge Institute for Research

Open projects for CHTC Fellows

This page lists software development, infrastructure services, and research facilitation projects for CHTC Fellow applicants to consider.

Updated October 30th, 2025

Software Development

Distributed Tracing and Log Aggregation for Pelican Request Lifecycle

In a distributed system with multiple services communicating with one another, a key challenge is correlating logging information from different services that handle a single job or client request. This project aims to design and implement a method for aggregating all logs generated during a client request by introducing a unique identifier that acts as a foreign key to link every log entry together. This focused approach will ensure administrators can precisely trace the path of a request through the system, identifying the services involved and pinpointing the exact location of errors or performance-related events recorded in the logs.

The primary objective of this project is to implement a system for auto-aggregation and tracing of request data across Pelican’s distributed architecture. The goal is to move beyond siloed log files to ensure a complete picture of job execution is available for administrators. The core solution involves determining how to aggregate the logging data as well as creating a unique identifier that is generated and propagated throughout the system, acting as a foreign key to link every log entry together. The fellow will be responsible for defining the tracing methodology, augmenting the request ID throughout the application layers, and making critical adjustments in the Pelican code. The fellow will develop client tooling to utilize this trace ID for diagnostics and will learn to inject diagnostic information back into the result ad for retrospective analysis via HTCondor.

Questions the fellow will have to answer in the course of the project: How do we define the foreign key when one pelican command could translate to multiple transfers or jobs? How best can we aggregate the logs into a searchable system? How can the system handle the continuously growing size of the logs?

By the end of the fellowship, the fellow will acquire a comprehensive understanding of distributed data systems and gain hands-on experience designing and implementing a tracing system for log correlation. They will be responsible for defining the auto-aggregation and tracing methodology using this unique identifier, and for augmenting the request ID through all layers of the Pelican code. This work will include adjusting selective places in the Pelican code and developing client tooling to utilize the trace ID. Additionally, the fellow will solidify their practical skills in Python and Go programming.

Project Objectives:

The project’s specific objectives are broken down to reflect both the high-level design and the necessary low-level implementation:

  • Implement UUID-based Tracing: Establish the methodology for UUID generation/propagation and use it as a foreign key for log correlation across all services.
  • Augment Service Logs: Adjust selective places in the Pelican code to ensure the UUID is consistently captured.
  • Develop Client Tooling: Create tools that run on the client or service hosts to leverage the UUID for direct log retrieval and diagnostics.
  • System Integration: Create a system for client-side request tracking that leverages the aggregated data.

Prerequisite skills or education that would be good for the Fellow to have to work on the project:

  • Python and Golang, required
  • Linux/CLI, required
  • HTTP development, preferred
  • Distributed Computing, preferred
  • Git/GitHub/GitHub actions, preferred
  • Docker/Kubernetes, preferred

Scaling ML Inference from Laptop to High-Throughput Computing

Transform how researchers deploy machine learning models for science at scale! Many scientists are able to prototype and test ML models on a small scale, but hit a wall when trying to move beyond their laptop to process full datasets: millions of satellite images, thousands of genomic samples, or massive text corpora. This fellowship tackles one of the most critical bottlenecks in computational research: bridging the gap between proof-of-concept and production-scale ML inference on high-throughput computing systems. As a fellow, you’ll start by exploring the landscape: working directly with researchers across domains to discover their inference workflows, identify the most effective tools and frameworks, and uncover common scaling patterns. You’ll then translate these insights into practical resources to simplify access to HTC infrastructure.

The ideal candidate will have:

  • Strong technical writing skills
  • Experience with Python and ML frameworks (PyTorch, TensorFlow, or scikit-learn)
  • Bonus: Familiarity with batch computing, containers, or ML deployment workflows

What You’ll Do:

  • Discover and document: Collaborate with close CHTC collaborators in computer vision, genomics, NLP, and other fields to identify target use cases, optimal inference frameworks, and recurring challenges.
  • Build an inference engine pattern library: Create a comprehensive collection of reusable submit file templates, data staging strategies, and best-practice patterns for common scenarios (model comparison, parameter sweeps, batch processing).
  • Develop end-to-end tutorials: Create complete, domain-specific guides demonstrating the full journey from laptop to HTC-scale inference.

Project Objectives:

  • Through user feedback and conversations with CHTC collaborators, identify targets for high-throughput inference guides: As inference-driven scientific research grows, work with the CHTC user base (including close collaborators in computer vision on satellite or microscopy imagery, geological image classification, genomics variant calling, or NLP text mining) to understand the inference task and identify the most appropriate inference framework (such as NVIDIA’s Triton inference server).
  • Design a set of experiments: measuring inference performance, focusing on overall throughput, resource utilization, efficiency, and scalability.
  • Create end-to-end tutorial workflows: Develop at least two complete, domain-specific tutorials that demonstrate the full path from laptop-scale to HTC-scale inference, with emphasis on submit file patterns, data staging strategies, and common troubleshooting scenarios.
  • Develop submit file templates and pattern library: Build a collection of well-documented, modular submit file templates for common inference scenarios (single model/many data chunks, model comparison, parameter sweeps) that researchers can adapt rather than write from scratch.

Prerequisite skills or education:

  • Strong technical writing and documentation skills, required
  • Experience with Python and machine learning frameworks (PyTorch, TensorFlow, or scikit-learn), required
  • Prior experience with batch computing systems or distributed computing concepts, preferred
  • Familiarity with containerization (Docker/Apptainer), preferred
  • Experience with ML model deployment or inference workflows, preferred
  • Background in a research domain that uses ML inference at scale, preferred

Infrastructure Services

Characterizing Backfill Availability in Kubernetes

HTCondor can take advantage of the unused capacity (“backfill”) of shared compute resources, such as those in a Kubernetes cluster, by provisioning HTCondor worker nodes (“glideins”) on the unused capacity. As the lowest priority workload running on a cluster at any given time, glideins running on backfill resources may be evicted (i.e. killed) at any time for nearly any reason. An HTCondor job running on a glidein when that glidein is evicted will lose some or all of its progress and must be rescheduled and restarted, so ideally glideins should only accept jobs if they are unlikely to be evicted for a long enough time for jobs to finish and/or checkpoint their progress. By characterizing the lifecycles of glideins running on backfill resources (e.g. in a Kubernetes cluster), we may be able to improve glidein scheduling decisions and increase job throughput.

To begin addressing the problem of characterizing the lifetimes of backfill glideins, the fellow will conduct a study of the lifetimes of backfill workers (in this case, glideins that don’t actually run any HTCondor jobs) running in a Kubernetes cluster with simulated higher priority (“foreground”) workloads. The fellow will develop a scheduling toolset to generate parameterizable synthetic foreground workloads in the Kubernetes cluster. A monitoring toolset will then be developed to observe the lifecycles of backfill workers running alongside varied foreground workloads. Statistical analysis will be applied to data gathered by the monitoring tool to characterize the expected lifetimes of backfill workers based on the parameters of the foreground workload.

Project Objectives:

  • Survey the set of variables that define a typical Kubernetes workload, such as memory, cpu usage, duration, and replica count.
  • Design an algorithm for scheduling parameterizable Kubernetes workloads based on the selected variables.
  • Implement a Kubernetes operator that automatically generates synthetic Kubernetes workloads using the designed algorithm.
  • Design a method for collecting data on the lifetime of backfill Kubernetes workloads based on existing Kubernetes monitoring tools.
  • Deploy both the workload generation operator and backfill monitoring tool to a Kubernetes cluster.
  • Gather data on backfill workload lifecycles under varied foreground workload generation configurations.
  • Prepare an analysis of gathered data.

Prerequisite skills or education that would be good for the Fellow to have to work on the project:

  • Familiarity with Unix and Python, required
  • Familiarity with HTCondor, Kubernetes, and/or Go, preferred

Research Facilitation

Classifying User Contributed Images

CHTC’s High Throughput Computing (HTC) system supports hundreds of users and thousands of jobs each day. It is optimized for workloads or sets of jobs, where many jobs can run in parallel as computational capacity becomes available. This project aims to better understand the impact of workload size and requirements on overall throughput through empirical measurement of workloads in CHTC. A key component of the project will be developing tools to a) submit sample workloads and b) gather metrics about their performance. Once these tools are developed, they can be used to run experiments with different workload types.

Project Objectives:

  • Develop a synthetic workload generation tool that uses job parameters to automatically generate test HTC workloads.
  • Identify a list of 5-6 core metrics for measuring HTC workload performance in collaboration with the Facilitation team.
  • Develop a tool that extracts these core metrics using HTCondor log files or other command line tools.
  • Use the above tools end-to-end, to generate a test workload and measusre its outcomes.

Prerequisite skills or education that would be good for the Fellow to have to work on the project:

  • Familiarity with unix and Python
  • Familiarity with git
  • Familiarity with HTCondor

Questions: chtc-jobs@g-groups.wisc.edu