Use GPUs

Overview

GPUs (Graphical Processing Units) are a special kind of computer processor that are optimized for running very large numbers of simple calculations in parallel, which often can be applied to problems related to image processing or machine learning. Well-crafted GPU programs for suitable applications can outperform implementations running on CPUs by a factor of ten or more, but only when the program is written and designed explicitly to run on GPUs using special libraries like CUDA. For researchers who have problems that are well-suited to GPU processing, it is possible to run jobs that use GPUs in CHTC. Read on to determine:

A. Available CHTC GPUs

1. GPU Lab

CHTC has a set of GPUs that are available for use by any CHTC user with an account on our high throughput computing (HTC) system via the CHTC GPU Lab, which includes templates and a campus GPU community.

Our expectation is that most, if not all, of CHTC users running GPU jobs should utilize the capacity of the GPU Lab to run their work.

Number of Servers Names GPUs / Server GPU Type (DeviceName) Hardware Generation Capability GPU Memory GlobalMemoryMB
2 gpu2000, gpu2001 2 Tesla P100-PCIE-16GB 6.0 16GB
4 gpulab2000 - gpulab2003 8 NVIDIA GeForce RTX 2080 Ti 7.5 10GB
2 gpulab2004, gpulab2005 4 NVIDIA A100-SXM4-40GB 8.0 40GB
9 gpu2003 - gpu2011 4 NVIDIA A100-SXM4-80GB 8.0 80GB
3 gpu4000 - gpu4002 10 NVIDIA L40 8.9 45GB
1 gpu4003 8 NVIDIA H100 80GB HBM3 9.0 80GB

Special GPU Lab Policies

Jobs running on GPU Lab servers have time limits and job number limits (differing from CHTC defaults across the rest of the HTC System).

Job type Maximum runtime Per-user limitation
Short 12 hrs 2/3 of CHTC GPU Lab GPUs
Medium 24 hrs 1/3 of CHTC GPU Lab GPUs
Long 7 days up to 4 GPUs in use

There are a certain number of slots in the GPU Lab reserved for interactive use. Interactive jobs that use GPU Lab servers are restricted to using a single GPU and a 4 hour runtime.

2. Other Capacity

There is additional dedicated and backfill GPU capacity available in CHTC and beyond; see GPU capacity beyond the GPU Lab for details.

B. Submit Jobs Using GPUs in CHTC

The following options are needed in your HTCondor submit file in order to access the GPUs in the CHTC GPU Lab and beyond:

  • Request GPUs (required): All jobs that use GPUs must request GPUs in their submit file (along with the usual requests for CPUs, memory, and disk).
    request_gpus = 1
    
  • Request the CHTC GPU Lab: To use CHTC’s shared use GPUs, you need to opt-in to the GPU Lab. To do so, add the following line to your submit file:
    +WantGPULab = true
    
  • Indicate Job Type: We have categorized three “types” of GPU jobs, characterized in the table above. Indicate which job type you would like to submit by using the submit file option below.
    +GPUJobLength = "short" 
    # Can also request "medium" or "long"
    

    If you do not specify a job type, the medium job type will be used as the default. If your jobs will run in less than 12 hours, it is advantageous to indicate that they are “short” jobs because you will be able to have more jobs running at once.

  • Request Specific GPUs or CUDA Functionality (optional): If your software or code requires a certain “capability” of GPU (see table above) or a certain amount of memory you can request them with these submit file options:

    To request a certain range of capabilities:

    gpus_minimum_capability = <version>
    gpus_maximum_capability = <version>
    

    To request a minimum amount of GPU memory:

    gpus_minimum_memory = <quantity in MB>
    

    More information on these commands can be found in the HTCondor manual.

    It may be tempting to add requirements for specific GPU servers or types of GPU cards. However, when possible, it is best to write your code so that it can run across GPU types and without needing the latest version of CUDA.

  • Indicate Software or Data Requirements Using requirements: If your data is large enough to use our /staging data system (see more information here), or you are using modules or other software in our shared /software system, include the needed requirements.

  • Indicate Shorter/Resumable Jobs: if your jobs are shorter than 4-6 hours, or have the ability to checkpoint at least that frequently, we highly recommend taking advantage of the additional GPU servers in CHTC that can run these kind of jobs as backfill! Simply add the following option to your submit file:
    +is_resumable = true
    

    For more information about the servers that you can run on with this option, and what it means to run your jobs as “backfill” see the section below on Accessing Research Group GPUs.

  • Complex GPU requirements: if your jobs have more complex requirements than the capability and memory options shown above, you can use a more general submit file option require_gpus to construct a complex, custom requirement. Contact the facilitators at chtc@cs.wisc.edu if you believe you need to use this option.

2. Sample Submit File

A sample submit file is shown below. There are also example submit files and job scripts in this GPU Job Templates repository in CHTC’s Github organization.

# gpu-lab.sub
# sample submit file for GPU Lab jobs

universe = vanilla
log = job_$(Cluster)_$(Process).log
error = job_$(Cluster)_$(Process).err
output = job_$(Cluster)_$(Process).out

# Fill in with whatever executable you're using
executable = run_gpu_job.sh
#arguments = 

should_transfer_files = YES
when_to_transfer_output = ON_EXIT
# Uncomment and add input files that are in /home
# transfer_input_files = 

# Uncomment and add custom requirements
# requirements = 

+WantGPULab = true
+GPUJobLength = "short"

request_gpus = 1
request_cpus = 1
request_memory = 1GB
request_disk = 1GB

queue 1

3. Notes

It is important to still request at least one CPU per job to do the processing that is not well-suited to the GPU.

Note that HTCondor will make sure your job has access to the GPU; it will set the environment variable CUDA_VISIBLE_DEVICES to indicate which GPU(s) your code should run on. The environment variable will be read by CUDA to select the appropriate GPU(s). Your code should not modify this environment variable or manually select which GPU to run on, as this could result in two jobs sharing a GPU.

It is possible to request multiple GPUs. Before doing so, make sure you’re using code that can utilize multiple GPUs and then submit a test job to confirm success before submitting a bigger job. Also keep track of how long jobs are running versus waiting; the time you save by using multiple GPUs may be not worth the extra time that the job will likely wait in the queue.

C. GPU Capacity Beyond the CHTC GPU Lab

The following resources are additional CHTC-accessible servers with GPUs. They do not have the special time limit policies or job limits of the GPU Lab. However, some of them are owned or prioritized by specific groups. The implications of this on job runtimes is noted in each section.

Note that all GPU jobs need to include the request_gpus option in their submit file, even if they are not using the GPU Lab.

1. Access Research Group GPUs

Certain GPU servers in CHTC are prioritized for the research groups that own them, but are available to run other jobs when not being used by their owners. When running on these servers, jobs forfeit our otherwise guaranteed runtime of 72 hours, and have the potential to be interrupted. However, for shorter jobs or jobs that have implemented self-checkpointing, this is not a drawback and allowing jobs to run on these additional servers opens up more capacity.

Therefore, these servers are a good fit for GPU jobs that run in a few hours or less, or have implemented self-checkpointing (the capability to save progress to a file and restart from that progress). Use the is_resumable option shown above in the list of submit file options.

2. Use the gzk Servers

These are servers that are similar to the GPU Lab severs with two important differences for running GPU jobs:

  • they do not have access to CHTC’s large data /staging file system
  • they do not have Docker capability

You do not need to do anything specific to allow jobs to run on these servers.

3. Using GPUs in CHTC’s OSG Pool and the UW Grid

CHTC, as a member of the OSG Consortium can access GPUs that are available on the OS Pool. CHTC is also a member of a campus computing network called the UW Grid, where groups on campus share computing capacity, including access to idle GPUs.

See this guide to know whether your jobs are good candidates for the UW Grid or OS Pool and then get in touch with CHTC’s Research Computing Facilitators to discuss details.

D. Using condor_status to explore CHTC GPUs

You can find out information about GPUs in CHTC through the condor_status command. All of our servers with GPUs have a TotalGPUs attribute that is greater than zero; thus we can query the pool to find GPU-enabled servers by running:

[alice@submit]$ condor_status -compact -constraint 'TotalGpus > 0'

To print out specific information about a GPU server and its GPUs, you can use the “auto-format” option for condor_status and the names of specific server attributes. In general, when querying attributes using condor_status, a “GPUs_” prefix needs to be added to the attribute name. For example, the tables at the top of the guide can be mostly recreated using the attributes Machine, TotalGpus, GPUs_DeviceName and GPUs_Capability:

[alice@submit]$ condor_status -constraint 'Gpus > 0' \
				-af Machine TotalGpus GPUs_DeviceName GPUs_Capability

In addition, HTCondor tracks other GPU-related attributes for each server, including:

Attribute Explanation
Gpus Number of GPUs in an individual job slot on a server (one server can be divided into slots to run multiple jobs).
TotalGPUs The total number of GPUs on a server.
(GPUs_)DeviceName The type of GPU card.
(GPUs_)Capability Represents various capabilities of the GPU. Can be used as a proxy for the GPU card type when requiring a specific type of GPU. Wikipedia has a table showing the compute capability for specific GPU architectures and cards. More details on what the capability numbers mean can be found on the NVIDIA website.
(GPUs_)DriverVersion Not the version of CUDA on the server or the NVIDIA driver version, but the maximum CUDA runtime version supported by the NVIDIA driver on the server.
(GPUs_)GlobalMemoryMb Amount of memory available on the GPU card.

E. Prepare Software Using GPUs

Before using GPUs in CHTC you should ensure that the use of GPUs will actually help your program run faster. This means that the code or software you are using has the special programming required to use GPUs and that your particular task will use this capability.

If this is the case, there are several ways to run GPU-enabled software in CHTC:

Machine Learning
For those using machine learning code specifically, we have a guide with more specific recommendations here: Run Machine Learning Jobs on HTC

1. Compiled Code

You can use our conventional methods of creating a portable installation of a software package (as in our R/Python guides) to run on GPUs. Most of our build servers or GPU servers have copies of the CUDA Runtime that can be used to compile code. To access these servers, submit an interactive job, following the instructions in our Build Job Guide or by submitting a GPU job submit file with the interactive flag for condor_submit. Once on a build or GPU server, see what CUDA versions are available by looking at the path /user/local/cuda-*.

Note that we strongly recommend software installation strategies that incorporate the CUDA runtime into the final installed code, so that jobs are able to run on servers even if a different version of the CUDA runtime is installed (or there’s no runtime at all!). For compiled code, look for flags that enable static linking or use one of the solutions listed below.

2. Docker

CHTC’s GPU servers have “nvidia-docker” installed, a specific version of Docker that integrates Docker containers with GPUs. If you can find or create a Docker image with your software that is based on the nvidia-docker container, you can use this to run your jobs in CHTC. See our Docker guide for how to use Docker in CHTC.

Currently we recommend using “nvidia/cuda” containers with a tag beginning with “12.1.1-devel” for best integration with our system.