Going from Google Colab to CHTC's GPU Lab

This guide provides a step-by-step process for going from running a notebook that uses GPUs on Google Colab, to running it as a job in CHTC, using CHTC’s GPU Lab resources.

Overview

Go through the following steps to transition from Google Colab to CHTC’s GPU Lab:

Get a script and package requirements from the Colab interface.
Build and publish a docker container recreating the Colab software environment.
Submit your job on CHTC, paying attention to hardware requirements.

Why use CHTC’s GPU Lab?

Google Colab has better support for interactive work in a notebook interface; however, the free version has a strict time limit on how long you can run calculations. If you want to run many GPU-based calculations or need more time, CHTC can provide a larger number of GPUs and a longer run time limit.

A. Get the Needed Packages From Colab

These steps are run from the Colab notebook.

Assume the notebook you’d like to run on CHTC’s system is already open.

Export notebook as .py file:
Export a clean copy of the notebook with .py extension before it is altered. This can be done using File–Download–Download .py. Save this to the local machine.
Mount Google Drive and navigate to working directory:
In a new cell, run
```
 from google.colab import drive
 drive.mount('/content/drive/')
```
Navigate to the working directory of the notebook:
```
 %cd 'path/to/directory'
```
Convert the notebook to .py to run with pipreqs:
In a new cell, run
```
 !jupyter nbconvert --to=python 'notebook_name.ipynb'
```
Install + run pipreqs:
Install and use pipreqs with the --use-local flag to generate a list of all python packages and versions used by Colab in the notebook. These packages are determined by the program’s imports. In a new cell, run
```
 !pip install pipreqs
 !pipreqs --use-local
```
This will generate a file named requirements.txt. Double check that the file contains the expected packages. If there were any other .py files in the directory, their imports will also be included in requirements.txt.
Download requirements.txt:
Download requirements.txt generated in the previous step and save it to the local directory containing the converted .py file downloaded in step 1.

B. Build a Docker Container

The rest of the instructions occur on the CHTC system.

If you don’t already have a Docker Hub account before starting this section, create one: Docker Hub

Upload files to CHTC:

Upload the requirements.txt file and the Python .py script saved from the previous step to your home directory on the HTC system.
Create the Dockerfile:

Copy the requirements.txt file to CHTC, and create a file called Dockerfile (there is no extension for this type of file) that looks like this:
```
 FROM nvidia/cuda:11.4.2-cudnn8-runtime-ubuntu20.04
 ADD requirements.txt /
 RUN apt-get update && apt-get install python3.7
 RUN apt-get -y install python3-pip
 RUN pip install -r /requirements.txt
```
Using the Nvidia base container ensures that CUDA/CuDNN, two libraries often used by popular ML frameworks, are imported properly. If you know you require specific versions of CUDA/CuDNN, the base container can be adjusted accordingly. Additional containers can be found here.

Note that Python 3.7 is used in this example’s Dockerfile because at the time of writing, Colab runs on 3.7. Future updates to Colab may require an updated version of Python.
Create a submit file:

Create a file called build.sub that looks like
```
 # Software build file

 universe = vanilla
 log = interactive.log

 # bring along the requirements
 transfer_input_files = requirements.txt, Dockerfile

 +IsBuildJob = true
 requirements = (OpSysMajorVer =?= 8)
 request_cpus = 1
 request_memory = 4GB
 request_disk = 16GB

 queue
```
Some of the specified values, such as request_memory and request_disk might have to be updated to reflect the particular memory and disk requirements of your container. Building containers can use a surprising amount of disk, so if your build job continues to exit back to the submit node while the build is running, try increasing these values incrementally until the build succeeds. The amount of disk used to build the container may be significantly larger than the total size of the container when it is finished.
Start an interactive job:

Submit the job to HTCondor using the -i flag to indicate that it will be an interactive build job: condor_submit -i build.sub
Build the container with podman:

Once the interactive job has begun, use podman to login to Dockerhub:
```
 podman login docker.io
```
When prompted, enter the username and password associated with your Dockerhub account.

Once logged in, build the container:
```
 podman build -t <dockerhub_user/container_name:tag> .
```
For example, if you’re building the first version of a PyTorch container and your dockerhub username is chtc_user, your build command might look like
```
 podman build -t chtc_user/pytorch:v1 .
```
Upload to Dockerhub:

Now find the image Hash ID associated with the container that was just built. Run podman images and copy the Hash ID of your container.

Push the container to dockerhub.io:
```
 podman push <Hash ID> <dockerhub_user/container_name:tag>
```
Again, for chtc_user, this might look like podman push 123456 chtc_user/pytorch:v1

Once podman has finished uploading the container, the container should be ready for use by HTCondor.
Exit interactive job:

In the terminal, type exit. This will terminate the interactive job and return you to the submit node.

C. Submit a Job

Create job submit file:

Create a submit file called something like my_job.sub that looks like this:

 universe = docker
 docker_image = your_image
 log = job_$(Cluster).log
 error = job_$(Cluster)_$(Process).err
 output = job_$(Cluster)_$(Process).out

 executable = python_script.py

 should_transfer_files = YES
 when_to_transfer_output = ON_EXIT
 #transfer_input_files = 

 request_gpus = 1
 # list other GPU requirements, if needed
    
 request_cpus = 1
 request_memory = 1GB
 request_disk = 1GB

 queue 1

Fill in the Docker image name from the previous step, and the name of your python script. In the above example, a container called chtc_user/pytorch:v1 was pushed to Dockerhub, so the string chtc_user/pytorch:v1 would be included after the docker_image attribute.

Add “shebang” to Python script

In your Python script, put the following line (the first characters, #! are known as a shebang) as the very first line of your Python script:
```
 #!/usr/bin/env python3
```
This line indicates that you want the script to be run as a Python script.
Think about data!

As with other jobs on CHTC, think about the data requirements for your job.. Jobs with larger requirements may require a larger value for the request_memory and request_disk attributes, and if you intend to transfer the data from the submit node, you may need to do so using one of CHTC’s alternative data transfer methods, such as Squid. More information about large file transfers can be found here.
Submit a test job, then submit the real thing.

Try submitting a scaled-down test job to ensure everything is set up correctly. When the test job runs successfully, submit the real job.