Open projects for CHTC Fellows
This page lists software development, infrastructure services, and research facilitation projects for CHTC Fellow applicants to consider.
Applications are open!
To apply, send an email to chtc-jobs@g-groups.wisc.edu with the following information:
- A resume/CV (in PDF format) with contact information. Be sure to include your full name, email address, the name of your university or college and your current or planned major and/or area of study.
- A cover letter that describes your interest in the Fellowship Program. For example, you may wish to expand on 3 or 4 topics from the following list: your background, your skills, and strengths; what software, computing or scientific topics appeal to you; previous research experience, if any; what you may want to pursue as a future career; and what benefits you would like to gain from this program. If you already have a potential project which interests you from the project list, you can also mention them here. It is however not required to have a mentor/project finalized to submit an application. Successful applicants will be connected to mentors to select and define their projects in a 2nd step following this application.
Updated January 15th, 2025
Software Development
Expanding Pelican with Globus Interoperability
Modern scientific research relies on data that is Findable, Accessible, Interoperable, and Reusable (FAIR) – a foundational yet complex requirement in today’s large-scale scientific research, where automation and fault tolerance must scale to manage hundreds, thousands, and even millions of individual computations.
The Pelican Platform is at the forefront of this ecosystem, and this project will focus on giving the fellow(s) an opportunity to advance FAIR data principles by developing a robust client that allows Pelican to interface with Globus, a leading service for secure, high-performance data transfers. This client will enable Pelican users to easily locate and access data across platforms, facilitating seamless data sharing and reuse between Pelican and Globus. Through this work, fellows will explore the technical and security aspects of cross-platform interoperability, learning to design solutions that make data more accessible and usable for global research efforts.
Project Objectives:
This project aims to:
- Introduce fellows to the FAIR data principles and the technical challenges of creating interoperable and accessible systems in a distributed computing environment
- Explore and implement secure data transfer protocols, especially as applied to Globus integrations
- Offer experience with cloud-based data systems, high-performance computing, and RESTful API development
By the end of the fellowship, fellows will gain hands-on experience in developing and testing components for distributed data systems, with potential exposure to technologies such as Golang/C++, REST APIs, OAuth/token authorization, XRootD, and Globus.
Prerequisite skills or education that would be good for the Fellow to have to work on the project:
- C++ or Golang (required)
- Networking fundamentals (required)
- Security basics: OAuth, SSL/TLS (preferred)
- Git/GitHub/GitHub Actions (preferred)
- Linux/CLI (preferred)
- Docker/Kubernetes (preferred)
Turbocharge HTCondor with Smart Aggregation and Indexing
This is your opportunity to showcase your software engineering skills and contribute to HTCondor, an internationally recognized open-source software suite in use at hundreds of universities and government labs worldwide. You’ll design, code, test, and commit a new service within the HTCondor Software Suite to GitHub. If successful, your work will be deployed at numerous prominent institutions globally. Not only will you impress future employers by pointing to your GitHub contributions, but you’ll also feel proud knowing your efforts are helping translate computing power into research discoveries. HTCondor harnesses the computing power of thousands of servers at organizations like DreamWorks Animation, SpaceX, Boeing, Hubble Telescope Operations, and CERN, and has managed computations for international scientific collaborations that have resulted in two Nobel Prizes.
An HTCondor deployment can complete thousands of computing tasks daily. The information about these completed tasks is archived, but this archive is essentially a simple flat file. You will create a service (in either Python or C++) that monitors this archive and, when new data is added, updates a SQLite relational database with index and aggregate information. Additionally, you will design the database schema to ensure efficient data organization and retrieval. This will significantly speed up query responses for users seeking information about completed tasks.
Project Objectives:
- Acquire hands-on experience writing production quality-software and with the full software engineering lifecycle: design, coding, code reviews, regression testing, documenting, and working with a professional team
- Build an understanding of high throughput computing architecture and systems performance in the context of the HTCondor Software Suite
- Learn and practice both crafting relational database schemas
- Learn and practice effectively navigating the API of an extremely popular and widely used database system (SQLite) from Python and/or C++
- Refine your abilities in public speaking, presentation, and the ability to communicate highly complex subject matter
Prerequisite Skills:
- Python or C++ (at least one required)
- Introductory exposure to SQL (preferred)
- Linux/CLI (preferred)
- Docker (preferred)
- Git/GitHub (preferred)
High Throughput Inference using CHTC and OSPool
Advancements in the predictive powers of AI models have led to an expansion in their application and impact in scientific research. An increasing challenge with this rise in utility (and complexity) comes from the desire to leverage these models for enormous volumes of predictions. This project aims to profile technologies and strategies for high throughput inference, using the computing resources available in CHTC and the OSPool to tackle these predictive tasks at enormous scale. By conducting experiments to measure overall performance of these approaches, we aim to be able to provide users guidance and insight into how best to leverage powerful models in their research. These explorations will include strategies for efficient model and data movement, batch processing, data splitting, and may include additional experiments into distributed inference, NVIDIA technologies (Triton, Multi-Process Service), and additional state-of-the-art AI inference technologies.
Project Objectives:
- Define a benchmark inference task using a model and dataset
- Design a set of experiments to measure inference performance, focusing on overall throughput, resource utilization, efficiency, and scalability.
- Profile variants of the inference task, comparing several approaches:
- Baseline
- Data movement via Pelican
- Data splitting and distribution across multiple execution points
- Leveraging external software or advanced runtime configuration (NVIDIA Triton, Multi-Process Service (MPS), HuggingFace’s Accelerate)
- Create documentation, tutorials, and guides to inform CHTC and OSPool users of recommended practices.
Prerequisite skills or education:
- Familiarity with machine learning frameworks (PyTorch, Tensorflow), preferred
- Familiarity with HTCondor, preferred
Expanded Pelican Cache Monitoring
Get ready to contribute to the national Cyber Infrastructure, because CHTC is a leader in the development and operation of nationally-distributed, heterogeneous systems — an environment in which services owned and operated by many different organizations spanning vast geographical regions are brought together to deliver cohesive experiences for users. Perhaps the most pressing challenge in this environment is that of system monitoring, or the task of understanding health for both the overall system and its individual components. Key questions include defining what it means for something to be “healthy/unhealthy” and detecting when a particular service requires human intervention. While monitoring requires both data collection and data presentation, the skills and interests of the fellow will guide which aspect is the focus of this project.
In particular, the fellow will start this activity by getting hands on with the Pelican Platform (https://pelicanplatform.org). Here, early focus will be on expanded monitoring for Pelican’s “cache” component, which allows researchers to stage large volumes of scientific data close to computational resources needed for analysis.
Project Objectives:
This project aims to:
- Provide fellows a direct look at the challenges facing distributed systems at national and international scales
- Develop an understanding of what it means for various components in a distributed system to be “healthy”
- Create monitoring solutions (data collection, data presentation) that allow administrators to quickly assess the state of these systems
By the end of the fellowship, the fellow will gain hands-on experience working with distributed systems, and depending on focus may have opportunities to work with any of Go, C++, Prometheus (and other Cloud Native tools), the XRootD software stack, and web design.
Prerequisite skills or education:
- C++ or Golang (required)
- Linux/CLI (required)
- React/NextJS (preferred)
- Git/GitHub/GitHub Actions (preferred)
- Docker/Kubernetes (preferred)
Pelican Passport: Where did we go and what did we do? — Client side request tracking in Pelican
In a distributed system with multiple services communicating with one another, a key challenge is tracking each service that a job or request interacts with and reporting the success of those interactions. This project aims to design a method for monitoring which services are accessed by a client request and whether each interaction was successful. In addition to identifying where failures occur, this approach will help explain why certain successful tasks might exhibit different behaviors, such as performance slowdowns.
Depending on the fellow’s interests and focus, this project may involve exploring tracking mechanisms in distributed systems, classifying requests and jobs, and engaging in client-side programming and troubleshooting, possibly including Python API development. It may also encompass the collection of client-side data and determining effective ways to present this information to both system users and administrators.
Project Objectives:
- Develop a method for error propagation across various clients accessing a distributed system.
- Create a system for client-side request tracking and monitoring.
- Enhance client-side monitoring and error understanding by enabling users to view exactly which services a job interacted with.
By the end of the fellowship, the fellow will gain a comprehensive understanding of distributed data systems and client-side monitoring, as well as insights into distributed and heterogeneous computing using HTCondor. They will also learn to implement methods for error propagation across various clients for accessing the system as well as how to effectively convey that information to a user. Additionally, the fellow will acquire hands-on experience in Python and Go programming.
Prerequisite skills or education:
- A willingness to tackle steep learning curves (required)
- Python or Golang (required)
- HTTP development (required)
- Distributed Computing (preferred)
Infrastructure Services
Monitoring CHTC
Support high throughput computing at UW-Madison and around the world by joining the CHTC Fellowship Program, where through this project you will build and refine your skills in IT Observability by learning to use ubiquitous industry tools such as Prometheus and Grafana. Home of the HTCondor Software Suite, this project is an excellent opportunity to expand your portfolio and IT skill set. In a distributed high throughput system, there are endless opportunities to collect and display metrics; however, discretion is the better part of observability. With the guidance of your mentor, you will strategically gather data with Prometheus, writing alerts to notify System Administrators and Research Computing Facilitators to call attention to system issues. Additionally, you will craft dashboards with just enough information to allow for rapid assessment of both individual components and overall system health and performance. Observability tools and systems designed during this fellowship are ultimately intended to ship worldwide with the HTCondor Software Suite and via dashboard sharing services such as Grafana.
Project Objectives:
- Acquire hands-on experience with Prometheus Monitoring tools and Grafana by creating Prometheus metrics and alerts as well as Grafana Dashboards for both HTCondor and system exported metrics.
- Build an understanding of high throughput computing architecture and systems performance in the context of the HTCondor Software Suite.
- Learn observability and Linux best practices in a dynamic and collaborative environment.
- Refine your abilities in public speaking, presentation, and the ability to communicate highly complex subject matter.
Prerequisite Skills:
- Prior programming experience, required.
- Prior Linux experience, strongly preferred.
- Prior experience with Prometheus and Grafana (or similar), desired but not required.
Research Facilitation
Measuring Throughput in CHTC
CHTC’s High Throughput Computing (HTC) system supports hundreds of users and thousands of jobs each day. It is optimized for workloads or sets of jobs, where many jobs can run in parallel as computational capacity becomes available. This project aims to better understand the impact of workload size and requirements on overall throughput through empirical measurement of workloads in CHTC. A key component of the project will be developing tools to a) submit sample workloads and b) gather metrics about their performance. Once these tools are developed, they can be used to run experiments with different workload types.
Project Objectives:
- Develop a synthetic workload generation tool that uses job parameters to automatically generate test HTC workloads.
- Identify a list of 5-6 core metrics for measuring HTC workload performance in collaboration with the Facilitation team.
- Develop a tool that extracts these core metrics using HTCondor log files or other command line tools.
- Use the above tools end-to-end to generate a test workload and measure its outcomes.
- Run additional tests and benchmarks and write up a summary of findings.
Prerequisite Skills:
- Familiarity with unix and Python
- Familiarity with HTCondor preferred
Questions: chtc-jobs@g-groups.wisc.edu