Going from Google Colab to CHTC's GPU Lab
This guide provides a step-by-step process for going from running a notebook that uses GPUs on Google Colab, to running it as a job in CHTC, using CHTC’s GPU Lab resources.
Overview
Go through the following steps to transition from Google Colab to CHTC’s GPU Lab:
- Get a script and package requirements from the Colab interface.
- Build and publish a docker container recreating the Colab software environment.
- Submit your job on CHTC, paying attention to hardware requirements.
Why use CHTC’s GPU Lab?
Google Colab has better support for interactive work in a notebook interface; however, the free version has a strict time limit on how long you can run calculations. If you want to run many GPU-based calculations or need more time, CHTC can provide a larger number of GPUs and a longer run time limit.
A. Get the Needed Packages From Colab
These steps are run from the Colab notebook.
Assume the notebook you’d like to run on CHTC’s system is already open.
- Export notebook as
.py
file:
Export a clean copy of the notebook with.py
extension before it is altered. This can be done using File–Download–Download.py
. Save this to the local machine. - Mount Google Drive and navigate to working directory:
In a new cell, runfrom google.colab import drive drive.mount('/content/drive/')
Navigate to the working directory of the notebook:
%cd 'path/to/directory'
- Convert the notebook to
.py
to run withpipreqs
:
In a new cell, run!jupyter nbconvert --to=python 'notebook_name.ipynb'
- Install + run pipreqs:
Install and usepipreqs
with the--use-local
flag to generate a list of all python packages and versions used by Colab in the notebook. These packages are determined by the program’s imports. In a new cell, run!pip install pipreqs !pipreqs --use-local
This will generate a file named
requirements.txt
. Double check that the file contains the expected packages. If there were any other.py
files in the directory, their imports will also be included inrequirements.txt
. - Download
requirements.txt
:
Downloadrequirements.txt
generated in the previous step and save it to the local directory containing the converted.py
file downloaded in step 1.
B. Build a Docker Container
The rest of the instructions occur on the CHTC system.
If you don’t already have a Docker Hub account before starting this section, create one: Docker Hub
-
Upload files to CHTC:
Upload the
requirements.txt
file and the Python.py
script saved from the previous step to your home directory on the HTC system. -
Create the Dockerfile:
Copy the
requirements.txt
file to CHTC, and create a file calledDockerfile
(there is no extension for this type of file) that looks like this:FROM nvidia/cuda:11.4.2-cudnn8-runtime-ubuntu20.04 ADD requirements.txt / RUN apt-get update && apt-get install python3.7 RUN apt-get -y install python3-pip RUN pip install -r /requirements.txt
Using the Nvidia base container ensures that CUDA/CuDNN, two libraries often used by popular ML frameworks, are imported properly. If you know you require specific versions of CUDA/CuDNN, the base container can be adjusted accordingly. Additional containers can be found here.
Note that Python 3.7 is used in this example’s
Dockerfile
because at the time of writing, Colab runs on 3.7. Future updates to Colab may require an updated version of Python. -
Create a submit file:
Create a file called
build.sub
that looks like# Software build file universe = vanilla log = interactive.log # bring along the requirements transfer_input_files = requirements.txt, Dockerfile +IsBuildJob = true requirements = (OpSysMajorVer =?= 8) request_cpus = 1 request_memory = 4GB request_disk = 16GB queue
Some of the specified values, such as
request_memory
andrequest_disk
might have to be updated to reflect the particular memory and disk requirements of your container. Building containers can use a surprising amount of disk, so if your build job continues to exit back to the submit node while the build is running, try increasing these values incrementally until the build succeeds. The amount of disk used to build the container may be significantly larger than the total size of the container when it is finished. -
Start an interactive job:
Submit the job to HTCondor using the
-i
flag to indicate that it will be an interactive build job:condor_submit -i build.sub
-
Build the container with
podman
:Once the interactive job has begun, use
podman
to login to Dockerhub:podman login docker.io
When prompted, enter the username and password associated with your Dockerhub account.
Once logged in, build the container:
podman build -t <dockerhub_user/container_name:tag> .
For example, if you’re building the first version of a PyTorch container and your dockerhub username is chtc_user, your build command might look like
podman build -t chtc_user/pytorch:v1 .
-
Upload to Dockerhub:
Now find the image Hash ID associated with the container that was just built. Run
podman images
and copy the Hash ID of your container.Push the container to dockerhub.io:
podman push <Hash ID> <dockerhub_user/container_name:tag>
Again, for
chtc_user
, this might look likepodman push 123456 chtc_user/pytorch:v1
Once podman has finished uploading the container, the container should be ready for use by HTCondor.
-
Exit interactive job:
In the terminal, type
exit
. This will terminate the interactive job and return you to the submit node.
C. Submit a Job
-
Create job submit file:
Create a submit file called something like
my_job.sub
that looks like this:universe = docker docker_image = your_image log = job_$(Cluster).log error = job_$(Cluster)_$(Process).err output = job_$(Cluster)_$(Process).out executable = python_script.py should_transfer_files = YES when_to_transfer_output = ON_EXIT #transfer_input_files = request_gpus = 1 # list other GPU requirements, if needed request_cpus = 1 request_memory = 1GB request_disk = 1GB queue 1
Fill in the Docker image name from the previous step, and the name of your python script. In the above example, a container called
chtc_user/pytorch:v1
was pushed to Dockerhub, so the stringchtc_user/pytorch:v1
would be included after thedocker_image
attribute. -
Add “shebang” to Python script
In your Python script, put the following line (the first characters,
#!
are known as a shebang) as the very first line of your Python script:#!/usr/bin/env python3
This line indicates that you want the script to be run as a Python script.
-
Think about data!
As with other jobs on CHTC, think about the data requirements for your job.. Jobs with larger requirements may require a larger value for the
request_memory
andrequest_disk
attributes, and if you intend to transfer the data from the submit node, you may need to do so using one of CHTC’s alternative data transfer methods, such as Squid. More information about large file transfers can be found here. -
Submit a test job, then submit the real thing.
Try submitting a scaled-down test job to ensure everything is set up correctly. When the test job runs successfully, submit the real job.