Going from Google Colab to CHTC's GPU Lab
This guide provides a step-by-step process for going from running a notebook that uses GPUs on Google Colab, to running it as a job in CHTC, using CHTC’s GPU Lab resources.
Go through the following steps to transition from Google Colab to CHTC’s GPU Lab:
- Get a script and package requirements from the Colab interface.
- Build and publish a docker container recreating the Colab software environment.
- Submit your job on CHTC, paying attention to hardware requirements.
Why use CHTC’s GPU Lab?
Google Colab has better support for interactive work in a notebook interface; however, the free version has a strict time limit on how long you can run calculations. If you want to run many GPU-based calculations or need more time, CHTC can provide a larger number of GPUs and a longer run time limit.
A. Get the Needed Packages From Colab
These steps are run from the Colab notebook.
Assume the notebook you’d like to run on CHTC’s system is already open.
- Export notebook as
Export a clean copy of the notebook with
.pyextension before it is altered. This can be done using File–Download–Download
.py. Save this to the local machine.
- Mount Google Drive and navigate to working directory:
In a new cell, run
from google.colab import drive drive.mount('/content/drive/')
Navigate to the working directory of the notebook:
- Convert the notebook to
.pyto run with
In a new cell, run
!jupyter nbconvert --to=python 'notebook_name.ipynb'
- Install + run pipreqs:
Install and use
--use-localflag to generate a list of all python packages and versions used by Colab in the notebook. These packages are determined by the program’s imports. In a new cell, run
!pip install pipreqs !pipreqs --use-local
This will generate a file named
requirements.txt. Double check that the file contains the expected packages. If there were any other
.pyfiles in the directory, their imports will also be included in
requirements.txtgenerated in the previous step and save it to the local directory containing the converted
.pyfile downloaded in step 1.
B. Build a Docker Container
The rest of the instructions occur on the CHTC system.
If you don’t already have a Docker Hub account before starting this section, create one: Docker Hub
Upload files to CHTC:
requirements.txtfile and the Python
.pyscript saved from the previous step to your home directory on the HTC system.
Create the Dockerfile:
requirements.txtfile to CHTC, and create a file called
Dockerfile(there is no extension for this type of file) that looks like this:
FROM nvidia/cuda:11.4.2-cudnn8-runtime-ubuntu20.04 ADD requirements.txt / RUN apt-get update && apt-get install python3.7 RUN apt-get -y install python3-pip RUN pip install -r /requirements.txt
Using the Nvidia base container ensures that CUDA/CuDNN, two libraries often used by popular ML frameworks, are imported properly. If you know you require specific versions of CUDA/CuDNN, the base container can be adjusted accordingly. Additional containers can be found here.
Note that Python 3.7 is used in this example’s
Dockerfilebecause at the time of writing, Colab runs on 3.7. Future updates to Colab may require an updated version of Python.
Create a submit file:
Create a file called
build.subthat looks like
# Software build file universe = vanilla log = interactive.log # bring along the requirements transfer_input_files = requirements.txt, Dockerfile +IsBuildJob = true requirements = (OpSysMajorVer =?= 8) request_cpus = 1 request_memory = 4GB request_disk = 16GB queue
Some of the specified values, such as
request_diskmight have to be updated to reflect the particular memory and disk requirements of your container. Building containers can use a surprising amount of disk, so if your build job continues to exit back to the submit node while the build is running, try increasing these values incrementally until the build succeeds. The amount of disk used to build the container may be significantly larger than the total size of the container when it is finished.
Start an interactive job:
Submit the job to HTCondor using the
-iflag to indicate that it will be an interactive build job:
condor_submit -i build.sub
Build the container with
Once the interactive job has begun, use
podmanto login to Dockerhub:
podman login docker.io
When prompted, enter the username and password associated with your Dockerhub account.
Once logged in, build the container:
podman build -t <dockerhub_user/container_name:tag> .
For example, if you’re building the first version of a PyTorch container and your dockerhub username is chtc_user, your build command might look like
podman build -t chtc_user/pytorch:v1 .
Upload to Dockerhub:
Now find the image Hash ID associated with the container that was just built. Run
podman imagesand copy the Hash ID of your container.
Push the container to dockerhub.io:
podman push <Hash ID> <dockerhub_user/container_name:tag>
chtc_user, this might look like
podman push 123456 chtc_user/pytorch:v1
Once podman has finished uploading the container, the container should be ready for use by HTCondor.
Exit interactive job:
In the terminal, type
exit. This will terminate the interactive job and return you to the submit node.
C. Submit a Job
Create job submit file:
Create a submit file called something like
my_job.subthat looks like this:
universe = docker docker_image = your_image log = job_$(Cluster).log error = job_$(Cluster)_$(Process).err output = job_$(Cluster)_$(Process).out executable = python_script.py should_transfer_files = YES when_to_transfer_output = ON_EXIT #transfer_input_files = request_gpus = 1 # list other GPU requirements, if needed request_cpus = 1 request_memory = 1GB request_disk = 1GB queue 1
Fill in the Docker image name from the previous step, and the name of your python script. In the above example, a container called
chtc_user/pytorch:v1was pushed to Dockerhub, so the string
chtc_user/pytorch:v1would be included after the
Add “shebang” to Python script
In your Python script, put the following line (the first characters,
#!are known as a shebang) as the very first line of your Python script:
This line indicates that you want the script to be run as a Python script.
Think about data!
As with other jobs on CHTC, think about the data requirements for your job.. Jobs with larger requirements may require a larger value for the
request_diskattributes, and if you intend to transfer the data from the submit node, you may need to do so using one of CHTC’s alternative data transfer methods, such as Squid. More information about large file transfers can be found here.
Submit a test job, then submit the real thing.
Try submitting a scaled-down test job to ensure everything is set up correctly. When the test job runs successfully, submit the real job.