Use GPUs
Overview
GPUs (Graphical Processing Units) are a special kind of computer processor that are optimized for running very large numbers of simple calculations in parallel, which often can be applied to problems related to image processing or machine learning. Well-crafted GPU programs for suitable applications can outperform implementations running on CPUs by a factor of ten or more, but only when the program is written and designed explicitly to run on GPUs using special libraries like CUDA. For researchers who have problems that are well-suited to GPU processing, it is possible to run jobs that use GPUs in CHTC. Read on to determine:
- Available CHTC GPUs
- Submit jobs using GPUs in CHTC
- GPU capacity beyond the CHTC GPU Lab
- Using condor_status to explore GPUs
- Prepare software using GPUs
A. Available CHTC GPUs
1. GPU Lab
CHTC has a set of GPUs that are available for use by any CHTC user with an account on our high throughput computing (HTC) system via the CHTC GPU Lab, which includes templates and a campus GPU community.
Our expectation is that most, if not all, of CHTC users running GPU jobs should utilize the capacity of the GPU Lab to run their work.
Number of Servers | Names | GPUs / Server | GPU Type (DeviceName ) |
Hardware Generation Capability |
GPU Memory GlobalMemoryMB |
---|---|---|---|---|---|
2 | gpu2000, gpu2001 | 2 | Tesla P100-PCIE-16GB | 6.0 | 16GB |
4 | gpulab2000 - gpulab2003 | 8 | NVIDIA GeForce RTX 2080 Ti | 7.5 | 10GB |
2 | gpulab2004, gpulab2005 | 4 | NVIDIA A100-SXM4-40GB | 8.0 | 40GB |
9 | gpu2003 - gpu2011 | 4 | NVIDIA A100-SXM4-80GB | 8.0 | 80GB |
3 | gpu4000 - gpu4002 | 10 | NVIDIA L40 | 8.9 | 45GB |
1 | gpu4003 | 8 | NVIDIA H100 80GB HBM3 | 9.0 | 80GB |
Special GPU Lab Policies
Jobs running on GPU Lab servers have time limits and job number limits (differing from CHTC defaults across the rest of the HTC System).
Job type | Maximum runtime | Per-user limitation |
---|---|---|
Short | 12 hrs | 2/3 of CHTC GPU Lab GPUs |
Medium | 24 hrs | 1/3 of CHTC GPU Lab GPUs |
Long | 7 days | up to 4 GPUs in use |
There are a certain number of slots in the GPU Lab reserved for interactive use. Interactive jobs that use GPU Lab servers are restricted to using a single GPU and a 4 hour runtime.
2. Other Capacity
There is additional dedicated and backfill GPU capacity available in CHTC and beyond; see GPU capacity beyond the GPU Lab for details.
B. Submit Jobs Using GPUs in CHTC
1. Choose GPU-Related Submit File Options
The following options are needed in your HTCondor submit file in order to access the GPUs in the CHTC GPU Lab and beyond:
- Request GPUs (required): All jobs that use GPUs must request GPUs in their submit file (along
with the usual requests for CPUs, memory, and disk).
request_gpus = 1
- Request the CHTC GPU Lab: To use CHTC’s shared use GPUs, you need to opt-in to the GPU Lab. To
do so, add the
following line to your submit file:
+WantGPULab = true
- Indicate Job Type: We have categorized three “types”
of GPU jobs, characterized in the table above. Indicate which job type you would
like to submit by using the submit file option below.
+GPUJobLength = "short" # Can also request "medium" or "long"
If you do not specify a job type, the
medium
job type will be used as the default. If your jobs will run in less than 12 hours, it is advantageous to indicate that they are “short” jobs because you will be able to have more jobs running at once. -
Request Specific GPUs or CUDA Functionality (optional): If your software or code requires a certain “capability” of GPU (see table above) or a certain amount of memory you can request them with these submit file options:
To request a certain range of capabilities:
gpus_minimum_capability = <version> gpus_maximum_capability = <version>
To request a minimum amount of GPU memory:
gpus_minimum_memory = <quantity in MB>
More information on these commands can be found in the HTCondor manual.
It may be tempting to add requirements for specific GPU servers or types of GPU cards. However, when possible, it is best to write your code so that it can run across GPU types and without needing the latest version of CUDA.
-
Indicate Software or Data Requirements Using
requirements
: If your data is large enough to use our/staging
data system (see more information here), or you are using modules or other software in our shared/software
system, include the needed requirements. - Indicate Shorter/Resumable Jobs: if your jobs are shorter than 4-6 hours, or have
the ability to checkpoint at least that frequently, we highly recommend taking
advantage of the additional GPU servers in CHTC that can run these kind of jobs
as backfill! Simply add the following option to your submit file:
+is_resumable = true
For more information about the servers that you can run on with this option, and what it means to run your jobs as “backfill” see the section below on Accessing Research Group GPUs.
- Complex GPU requirements: if your jobs have more complex requirements than
the capability and memory options shown above, you can use a more general submit file
option
require_gpus
to construct a complex, custom requirement. Contact the facilitators at chtc@cs.wisc.edu if you believe you need to use this option.
2. Sample Submit File
A sample submit file is shown below. There are also example submit files and job scripts in this GPU Job Templates repository in CHTC’s Github organization.
# gpu-lab.sub
# sample submit file for GPU Lab jobs
universe = vanilla
log = job_$(Cluster)_$(Process).log
error = job_$(Cluster)_$(Process).err
output = job_$(Cluster)_$(Process).out
# Fill in with whatever executable you're using
executable = run_gpu_job.sh
#arguments =
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
# Uncomment and add input files that are in /home
# transfer_input_files =
# Uncomment and add custom requirements
# requirements =
+WantGPULab = true
+GPUJobLength = "short"
request_gpus = 1
request_cpus = 1
request_memory = 1GB
request_disk = 1GB
queue 1
3. Notes
It is important to still request at least one CPU per job to do the processing that is not well-suited to the GPU.
Note that HTCondor will make sure your job has access to the GPU; it will
set the environment variable CUDA_VISIBLE_DEVICES
to indicate which GPU(s)
your code should run on. The environment variable will be read by CUDA to select the appropriate
GPU(s). Your code should not modify this environment variable or manually
select which GPU to run on, as this could result in two jobs sharing a GPU.
It is possible to request multiple GPUs. Before doing so, make sure you’re using code that can utilize multiple GPUs and then submit a test job to confirm success before submitting a bigger job. Also keep track of how long jobs are running versus waiting; the time you save by using multiple GPUs may be not worth the extra time that the job will likely wait in the queue.
C. GPU Capacity Beyond the CHTC GPU Lab
The following resources are additional CHTC-accessible servers with GPUs. They do not have the special time limit policies or job limits of the GPU Lab. However, some of them are owned or prioritized by specific groups. The implications of this on job runtimes is noted in each section.
Note that all GPU jobs need to include the request_gpus
option in their submit file,
even if they are not using the GPU Lab.
1. Access Research Group GPUs
Certain GPU servers in CHTC are prioritized for the research groups that own them, but are available to run other jobs when not being used by their owners. When running on these servers, jobs forfeit our otherwise guaranteed runtime of 72 hours, and have the potential to be interrupted. However, for shorter jobs or jobs that have implemented self-checkpointing, this is not a drawback and allowing jobs to run on these additional servers opens up more capacity.
Therefore, these servers are a good fit for GPU jobs that run in a few hours
or less, or have implemented self-checkpointing (the capability to save progress
to a file and restart from that progress). Use the is_resumable
option shown
above in the list of submit file options.
2. Use the gzk
Servers
These are servers that are similar to the GPU Lab severs with two important differences for running GPU jobs:
- they do not have access to CHTC’s large data
/staging
file system - they do not have Docker capability
You do not need to do anything specific to allow jobs to run on these servers.
3. Using GPUs in CHTC’s OSG Pool and the UW Grid
CHTC, as a member of the OSG Consortium can access GPUs that are available on the OS Pool. CHTC is also a member of a campus computing network called the UW Grid, where groups on campus share computing capacity, including access to idle GPUs.
See this guide to know whether your jobs are good candidates for the UW Grid or OS Pool and then get in touch with CHTC’s Research Computing Facilitators to discuss details.
D. Using condor_status to explore CHTC GPUs
You can find out information about GPUs in CHTC through the
condor_status
command. All of our servers with GPUs have a TotalGPUs
attribute that is greater than zero; thus we can query the pool to find
GPU-enabled servers by running:
[alice@submit]$ condor_status -compact -constraint 'TotalGpus > 0'
To print out specific information about a GPU server and its GPUs, you
can use the “auto-format” option for condor_status
and the names of
specific server attributes. In general, when querying attributes using
condor_status
, a “GPUs_” prefix needs to be added to the attribute name.
For example, the tables at the top of the guide can be mostly
recreated using the attributes Machine
, TotalGpus
,
GPUs_DeviceName
and GPUs_Capability
:
[alice@submit]$ condor_status -constraint 'Gpus > 0' \
-af Machine TotalGpus GPUs_DeviceName GPUs_Capability
In addition, HTCondor tracks other GPU-related attributes for each server, including:
Attribute | Explanation |
---|---|
Gpus |
Number of GPUs in an individual job slot on a server (one server can be divided into slots to run multiple jobs). |
TotalGPUs |
The total number of GPUs on a server. |
(GPUs_ )DeviceName |
The type of GPU card. |
(GPUs_ )Capability |
Represents various capabilities of the GPU. Can be used as a proxy for the GPU card type when requiring a specific type of GPU. Wikipedia has a table showing the compute capability for specific GPU architectures and cards. More details on what the capability numbers mean can be found on the NVIDIA website. |
(GPUs_ )DriverVersion |
Not the version of CUDA on the server or the NVIDIA driver version, but the maximum CUDA runtime version supported by the NVIDIA driver on the server. |
(GPUs_ )GlobalMemoryMb |
Amount of memory available on the GPU card. |
E. Prepare Software Using GPUs
Before using GPUs in CHTC you should ensure that the use of GPUs will actually help your program run faster. This means that the code or software you are using has the special programming required to use GPUs and that your particular task will use this capability.
If this is the case, there are several ways to run GPU-enabled software in CHTC:
Machine Learning
For those using machine learning code specifically, we have a guide with more specific recommendations here: Run Machine Learning Jobs on HTC
1. Compiled Code
You can use our conventional methods of creating a portable installation
of a software package (as in our R/Python guides) to run on GPUs. Most
of our build servers or GPU servers have copies of the CUDA Runtime that
can be used to compile code. To access these servers, submit an
interactive job, following the instructions in our Build Job
Guide or by submitting a GPU job submit file with the
interactive flag for condor_submit
. Once on a build or GPU server, see
what CUDA versions are available by looking at the path
/user/local/cuda-*
.
Note that we strongly recommend software installation strategies that incorporate the CUDA runtime into the final installed code, so that jobs are able to run on servers even if a different version of the CUDA runtime is installed (or there’s no runtime at all!). For compiled code, look for flags that enable static linking or use one of the solutions listed below.
2. Docker
CHTC’s GPU servers have “nvidia-docker” installed, a specific version of Docker that integrates Docker containers with GPUs. If you can find or create a Docker image with your software that is based on the nvidia-docker container, you can use this to run your jobs in CHTC. See our Docker guide for how to use Docker in CHTC.
Currently we recommend using “nvidia/cuda” containers with a tag beginning with “12.1.1-devel” for best integration with our system.