Link to CHTC Home Page
Collage photos of current and previous CHTC interns. Collage photos of current and previous CHTC interns.
Photo: Morgridge Institute for Research

Open projects for CHTC Fellows

This page lists software development, infrastructure services, and research facilitation projects for CHTC Fellow applicants to consider.

Updated February 2, 2023

Software Development

User interface and Jupyter widgets

Large

Historically, the User Interface to HTCondor has been purely through the command-line. This often presents a large hurdle for onboarding new users, both in terms of even getting to the command-line via ssh and some terminal emulator, and then using the command line. Modern notebook interfaces, such as Jupyter have allowed us to minimize the first problem, by providing a web interface to the terminal. However, users still need to type in commands to access the batch system. This project extends HTCondor by implementing various Jupyter UI widgets to visualize users’ jobs within the system, and further interface with the system. After completing this project, the student will have Jupyter widgets to show in their portfolio, professional JavaScript experience, and understanding of how a distributed High Throughput System works.

Benchmarking OSDF Performance via the OSPool

Large

The OSDF is the flagship instance of a Pelican data federation and most often used via the OSPool, a HTCondor pool open to any US-based researcher or collaborators. In this project, we aim to design a test harness for benchmarking the OSDF and tools to analyze the performance of a test run.

The project would create a toolkit that submits workflows for testing the OSDF across multiple sites; we aim to test as many combinations of origins, caches, and clusters as possible. The same toolkit would then be used to analyze the logs of the run and create an HTML report of the performance and reliability of the OSDF. The ultimate goal is to run the testing nightly, giving the OSDF administrators a morning snapshot of the federation’s status.

Database techniques and data architecture

Medium

The HTCondor scheduler stores historical information about completed and removed jobs in a single, flat, sequential file on disk with little indexing. As such, querying this file for information about a single job is very slow. This project will add an index, perhaps with an off-the-shelf open source database or on-disk hashing tool to provide quick access to single entries in the history, and use this index in the python based cli tool. Students completing this project will gain real world experience in database techniques and Python programming.

Integrating Pytorch and Pelican

Large

PyTorch is one of the most popular machine learning frameworks. An important aspect of using it is the data engineering: how is input data fed into the model during training?
Going from “tutorial scale” problems to cutting-edge research requires drastically different techniques around data handling.

For this project, we aim to better integrate Pelican into the PyTorch community, providing both technical mechanisms (implementing the fsspec interface for Pelican) and documentation by providing tutorials and recipes for scaling PyTorch-based training using a combination of HTCondor and Pelican.

Grid Exerciser

Large

The OSPool is a very large, very dynamic, heterogeneous high throughput system composed of execute points from dozens of campuses all over the United States. Sometimes, something will go wrong at one of these many sites, or one network, or one storage point, and it is difficult to determine where the problem is. This project proposed the design and construction of a “Grid Exerciser”, which consists of intentionally sending sample jobs to targeted locations on the OSPool to verify correct operation and sufficient performance. The project will also have a reporting and visualization component so that the voluminous results can be understood by a human in a concise manner.

Distributed training of ML models

Large

As machine learning models become larger and more complex, the computational needs expand beyond single-GPU and single-machine capabilities. This project aims to leverage distributed ML software (e.g. PyTorch elastic) within the HTCondor Software Suite (HTCSS) to utilize CHTC resources in these large computing tasks. Students will develop strategies, software components, and guides to enable researchers to distribute training tasks across multiple GPU nodes within the cluster. Students working on this project will develop valuable ML and distributed computing knowledge.

Synchronizing datasets via the OSDF

Large

The pelican client tool provides a standalone mechanism to upload and download single files from a data federation. While the client works well, a typical use case is downloading an entire dataset - potentially, tens of thousands of files or hundreds of terabytes. When moving a dataset, features like state management and resumption become important.

In this project, we will explore better ways to automate dataset movement to and from the OSDF, exploring approaches like the HTCondor file transfer tool. The goal for the end of the summer would be to move a 100TB dataset using a single command.

Infrastructure Services

Tracking server inventory and elevation

Medium

The CHTC maintains over 1,000 servers on the UW–Madison campus and across the country. Keeping track of server elevation (datacenter and rack location), serial numbers, asset tags is a challenge that is always in need of improvement. This project will focus on taking existing data from the CHTC hardware monitoring system and automatically exporting it to other systems such as Google spreadsheets or ITAdvisor. After a successful summer, the student fellow will gain skills in Python and monitoring and Google Docs APIs.

Research Facilitation

Developing software recipes

Medium

Implement a repository of apptainer container recipes for common software packages. Create a self-serve tutorial on building containers and show how to use the recipe repository and OSPool guides to build your own container.

Building training materials

Medium

Catalog existing training materials for the OSPool / OSDF. Archive outdated materials and develop new training materials for using Pelican. Run an online workshop to test the new materials.

Easy Code Profiling

Medium

Develop and publish recommendations for how researchers can better profile and understand their code. Run 1-2 focus groups to gather ideas and get feedback, and then organize a short workshop sharing recommendations and providing hands-on examples.

Questions: htcondor-jobs@cs.wisc.edu