Open projects for CHTC Fellows
This page lists software development, infrastructure services, and research facilitation projects for CHTC Fellow applicants to consider.
Updated October 30th, 2025
Software Development
Description Distributed Tracing and Log Aggregation for Pelican Request Lifecycle
In a distributed system with multiple services communicating with one another, a key challenge is correlating logging information from different services that handle a single job or client request. This project aims to design and implement a method for aggregating all logs generated during a client request by introducing a unique identifier that acts as a foreign key to link every log entry together. This focused approach will ensure administrators can precisely trace the path of a request through the system, identifying the services involved and pinpointing the exact location of errors or performance-related events recorded in the logs.
The primary objective of this project is to implement a system for auto-aggregation and tracing of request data across Pelican’s distributed architecture. The goal is to move beyond siloed log files to ensure a complete picture of job execution is available for administrators. The core solution involves determining how to aggregate the logging data as well as creating a unique identifier that is generated and propagated throughout the system, acting as a foreign key to link every log entry together. The fellow will be responsible for defining the tracing methodology, augmenting the request ID throughout the application layers, and making critical adjustments in the Pelican code. The fellow will develop client tooling to utilize this trace ID for diagnostics and will learn to inject diagnostic information back into the result ad for retrospective analysis via HTCondor.
Questions the fellow will have to answer in the course of the project: How do we define the foreign key when one pelican command could translate to multiple transfers or jobs? How best can we aggregate the logs into a searchable system? How can the system handle the continuously growing size of the logs?
By the end of the fellowship, the fellow will acquire a comprehensive understanding of distributed data systems and gain hands-on experience designing and implementing a tracing system for log correlation. They will be responsible for defining the auto-aggregation and tracing methodology using this unique identifier, and for augmenting the request ID through all layers of the Pelican code. This work will include adjusting selective places in the Pelican code and developing client tooling to utilize the trace ID. Additionally, the fellow will solidify their practical skills in Python and Go programming.
Project Objectives:
The project’s specific objectives are broken down to reflect both the high-level design and the necessary low-level implementation:
- Implement UUID-based Tracing: Establish the methodology for UUID generation/propagation and use it as a foreign key for log correlation across all services.
- Augment Service Logs: Adjust selective places in the Pelican code to ensure the UUID is consistently captured.
- Develop Client Tooling: Create tools that run on the client or service hosts to leverage the UUID for direct log retrieval and diagnostics.
- System Integration: Create a system for client-side request tracking that leverages the aggregated data.
Prerequisite skills or education that would be good for the Fellow to have to work on the project:
- Python and Golang (required)
- Linux/CLI (required)
- HTTP development (preferred)
- Distributed Computing (preferred)
- Git/GitHub/GitHub Actions (preferred)
- Docker/Kubernetes (preferred)
Scaling ML Inference from Laptop to High-Throughput Computing
Transform how researchers deploy machine learning models for science at scale! Many scientists are able to prototype and test ML models on a small scale, but hit a wall when trying to move beyond their laptop to process full datasets: millions of satellite images, thousands of genomic samples, or massive text corpora. This fellowship tackles one of the most critical bottlenecks in computational research: bridging the gap between proof-of-concept and production-scale ML inference on high-throughput computing systems. As a fellow, you’ll start by exploring the landscape: working directly with researchers across domains to discover their inference workflows, identify the most effective tools and frameworks, and uncover common scaling patterns. You’ll then translate these insights into practical resources to simplify access to HTC infrastructure.
The ideal candidate will have:
- Strong technical writing skills
- Experience with Python and ML frameworks (PyTorch, TensorFlow, or scikit-learn)
- Bonus: Familiarity with batch computing, containers, or ML deployment workflows
What You’ll Do:
- Discover and document: Collaborate with close CHTC collaborators in computer vision, genomics, NLP, and other fields to identify target use cases, optimal inference frameworks, and recurring challenges
- Build an inference engine pattern library: Create a comprehensive collection of reusable submit file templates, data staging strategies, and best-practice patterns for common scenarios (model comparison, parameter sweeps, batch processing)
- Develop end-to-end tutorials: Create complete, domain-specific guides demonstrating the full journey from laptop to HTC-scale inference.
Project Objectives:
- Through user feedback and conversations with CHTC collaborators, identify targets for high-throughput inference guides: As inference-driven scientific research grows, work with the CHTC user base (including close collaborators in computer vision on satellite or microscopy imagery, geological image classification, genomics variant calling, or NLP text mining) to understand the inference task and identify the most appropriate inference framework (such as NVIDIA’s Triton inference server).
- Design a set of experiments: measuring inference performance, focusing on overall throughput, resource utilization, efficiency, and scalability.
- Create end-to-end tutorial workflows: Develop at least two complete, domain-specific tutorials that demonstrate the full path from laptop-scale to HTC-scale inference, with emphasis on submit file patterns, data staging strategies, and common troubleshooting scenarios
- Develop submit file templates and pattern library: Build a collection of well-documented, modular submit file templates for common inference scenarios (single model/many data chunks, model comparison, parameter sweeps) that researchers can adapt rather than write from scratch
Prerequisite skills or education:
- Strong technical writing and documentation skills, required
- Experience with Python and machine learning frameworks (PyTorch, TensorFlow, or scikit-learn), required
- Prior experience with batch computing systems or distributed computing concepts, preferred
- Familiarity with containerization (Docker/Apptainer), preferred
- Experience with ML model deployment or inference workflows, preferred
- Background in a research domain that uses ML inference at scale, preferred
Infrastructure Services
No software development projects are currently available.
Research Facilitation
Classifying User Contributed Images
CHTC’s High Throughput Computing (HTC) system supports hundreds of users and thousands of jobs each day. It is optimized for workloads or sets of jobs, where many jobs can run in parallel as computational capacity becomes available. This project aims to better understand the impact of workload size and requirements on overall throughput through empirical measurement of workloads in CHTC. A key component of the project will be developing tools to a) submit sample workloads and b) gather metrics about their performance. Once these tools are developed, they can be used to run experiments with different workload types.
Project Objectives:
- Develop a synthetic workload generation tool that uses job parameters to automatically generate test HTC workloads.
- Identify a list of 5-6 core metrics for measuring HTC workload performance in collaboration with the Facilitation team.
- Develop a tool that extracts these core metrics using HTCondor log files or other command line tools.
- Use the above tools end-to-end, to generate a test workload and measusre its outcomes
Prerequisite skills or education that would be good for the Fellow to have to work on the project:
- Familiarity with unix and Python
- Familiarity with git
- Familiarity with HTCondor
Questions: chtc-jobs@g-groups.wisc.edu