Scale Beyond Local HTC Capacity
This guide provides an introduction to running jobs outside of CHTC: why using these resources is beneficial, what resources are available, and how to use them.
Contents
1. Why run on additional resources outside CHTC?
Running on other resources in addition to CHTC has one huge benefit: size! The UW Grid and OSG include thousands of computers, addition to what's already available in CHTC, including specialized hardware resources like GPUs. Most CHTC users who run on CHTC, the UW Grid, and the OSG can get more than 100,000 computer hours (more than 11 years of computing!) in a single day. Read on to learn more about these resources.
A. UW Grid
What we call the "UW Grid" is a collection of all the groups and centers on campus that run their own high throughput computing pool that uses HTCondor. Some of these groups include departments (Biochemistry, Statistics) or large physics projects (IceCube, CMS). Through agreements with these groups, jobs submitted in CHTC can opt into running on these other campus pools if there is space.
We call sending jobs to other pools on campus flocking.
B. UW-Madison’s OSG Pool
CHTC maintains an OSG pool for the campus community, which includes resources contributed by campuses, national labs, and other institutions across and beyond the US.
When you send jobs to other institutions in our OSG pool, we call that gliding.
2. Job Qualifications
Not all jobs will run well outside of CHTC. Because these jobs are running all over the campus or country, on computers that don't belong to us, they have two major requirements:
-
Moderate Data Sizes: We can support input file sizes of up to 20 GB per file per job. This covers input files that would normally be transferred out of a
/home
directory or use SQUID, in addition to larger files up to 20GB. Outputs per job can be of similar sizes. If your input or output files are larger than 1GB, or you have any other questions about handling data on resources beyond CHTC, please contact us! -
Short or interruptable jobs: Your job can complete in under 10 hours -- either it finishes in that amount of time, or it self-checkpoints at least that frequently. If you would like to implement self-checkpointing for a longer code, we are happy to provide resources and guidance.
3. Submitting Jobs to Run Beyond CHTC
If your jobs meet the characteristics above and you would like to use either the UW Grid or OS Pool to run jobs, in addition to CHTC, you can add the following to your submit file:
+WantFlocking = true | Also send jobs to other HTCondor Pools on campus (UW Grid) Good for jobs that are less than ~8 hours, on average, or checkpointing jobs. |
+WantGlideIn = true | Also send jobs to the OS Pool. Good for jobs that are less than ~8 hours, on average, or checkpointing jobs. |
To guarantee maximum efficiency, please do the following steps whenever submitting a new type of job to the UW Grid or OSG:
-
Test Your Jobs: You should run a small test (anywhere from 10-100 jobs) outside CHTC before submitting your full workflow. To do this, take a job submission that you know runs successfully on CHTC. Then add the following options in the submit file + submit the test jobs:
requirements = (Poolname =!= "CHTC")
(If your submit file already has a
requirements =
line, you can appending thePoolname
requirement by using a double ampersand (&&
) and then the additional requirement.) -
Troubleshooting: If your jobs don't run successfully on the UW Grid or OS Pool, please get in touch with a research computing facilitator.
-
Scaling Up: Once you have tested your jobs and they seem to be running successfully, you are ready to submit a full batch of jobs that uses CHTC and the UW Grid/OS Pool. REMOVE the
Poolname
requirement from the test jobs but leave the+wantFlocking
and+wantGlidein
lines.