Running Python Jobs on CHTC

ACTION REQUIRED: As of September 29th, the HTC system’s default operating system will transition to CentOS Stream 8. This may impact users who use Python in their jobs. For more information, see the HTC Operating System Transition guide.

To best understand the below information, you should already have an understanding of:

Overview

CHTC provides several copies of Python that can be used to run Python code in jobs. See our list of supported versions here: CHTC Supported Python

This guide details the steps needed to:

  1. Create a portable copy of your Python packages
  2. Write a script that uses Python and your packages
  3. Submit jobs

If you want to use conda to manage your Python package dependencies, read this guide as background material, then read our guide on using conda.

CHTC-Provided Python Installations

CHTC provides a pre-built copy of the following versions of Python:

Building on CentOS 7 Linux (Soon to be Phased Out)

Python version Name of Python installation file
Python 2.7 python27.tar.gz
Python 3.6 python36.tar.gz
Python 3.7 python37.tar.gz
Python 3.8 python38.tar.gz
Python 3.9 python39.tar.gz
Python version Name of Python installation file
Python 3.7 python37.tar.gz
Python 3.8 python38.tar.gz
Python 3.9 python39.tar.gz
Python 3.10 python310.tar.gz

If you need a specific version of Python not shown above, contact the Research Computing Facilitators to see if we can build it for you; if we can’t, we can send you instructions for how to build your own copy of Python or use a Docker container for running your jobs.

1. Adding Python Packages

If your code uses specific Python packages (like numpy, matplotlib, sci-kit learn, etc) follow the directions below to download and prepare the packages you need for job submission. If your job does not require any extra Python packages, skip to parts 2 and 3.

You are going to start an interactive job that runs on the HTC build servers and that downloads a copy of Python. You will then install your packages to a folder and zip those files to return to the submit server.

These instructions are primarily about adding packages to a fresh install of Python; if you want to add packages to a pre-existing package folder, there will be notes below in boxes like this one.

Preliminary Step: Choose a Linux Version to Build On

As of August 2022, the newest hardware in the HTC system is running a newer version of Linux, CentOS Stream 8. A limited amount of our older hardware is still running CentOS 7 but these machines will be upgraded to the new operating system in the near future. More information about this transition can be found in the HTC Operating System Transition guide.

There are two approaches to running on our pool:

  • compile on CentOS 8, run on CentOS8: This is the recommended option for (1) all new users and for (2) existing users who have tested their CentOS 7 builds and determined they are not compatabile with CentOS 8 machines. By choosing this option, you will have access to the vast majority of the HTC system’s capacity.

  • compile on CentOS 7, run on both versions of CentOS: This is a temporary option available to users who previously compiled their software on CHTC’s CentOS 7 machines. In some cases, it is possible to use the same software, library, and packages on both CentOS 7 and CentOS Stream 8 machines. Existing users who compiled their software on CentOS 7 machines will need to (1) test their jobs to ensure they run successfully on CentOS Stream 8 machines and (2) plan for the phasing out of CentOS 7 machines (expected Fall 2022).

A. Submit an Interactive Job

Create the following special submit file on the submit server, calling it something like build.sub. **Make sure that you choose the appropriate Python tar.gz file and requirements if you want to build on CentOS 7 versus CentOS Stream 8. **

# Python build file

universe = vanilla
log = interactive.log

# Choose a version of Python from the tables above
# If building on CentOS 7 (To be Phased Out)
# transfer_input_files = http://proxy.chtc.wisc.edu/SQUID/chtc/python##.tar.gz

# If building on CentOS 8 (Recommended)
transfer_input_files = http://proxy.chtc.wisc.edu/SQUID/chtc/el8/python##.tar.gz

+IsBuildJob = true
# Indicate which version of Linux (CentOS) you want to build your packages on
requirements = (OpSysMajorVer =?= 8)
request_cpus = 1
request_memory = 4GB
request_disk = 2GB

queue

If you want to add packages to a pre-existing package directory, add the tar.gz file with the packages to the transfer_input_files line:

transfer_input_files = http://proxy.chtc.wisc.edu/SQUID/chtc/el8/python##.tar.gz, packages.tar.gz

Once this submit file is created, you will start the interactive job by running the following command:

[alice@submit]$ condor_submit -i build.sub

It may take a few minutes for the build job to start.

B. Install the Packages

Once the interactive build job starts, you should see the Python that you specified inside the working directory:

[alice@build]$ ls -lh
-rw-r--r-- 1 alice alice  78M Mar 26 12:24 python##.tar.gz
drwx------ 2 alice alice 4.0K Mar 26 12:24 tmp
drwx------ 3 alice alice 4.0K Mar 26 12:24 var

We'll now unzip the copy of Python and set the PATH variable to reference that version of Python:

[alice@build]$ tar -xzf python##.tar.gz
[alice@build]$ export PATH=$PWD/python/bin:$PATH

To make sure that your setup worked, try running:

[alice@build]$ python3 --version

You can also try running this command to make sure the copy of python that is now active is the one you just installed:

[alice@build]$ which python3

The command above should return a path that includes the prefix /var/lib/condor/, indicating that it is installed in your job's working directory.

If you're using Python 2, use python2 instead of python3 above (and in what follows). The output should match the version number that you want to be using!

If you brought along your own package directory, un-tar it here and skip the directory creation step below.

First, create a directory to put your packages into and then add that directory to our list of enviornmental variables that can be used by HTCondor to find the package directory:

[alice@build]$ mkdir packages
[alice@build]$ export PYTHONPATH=$PWD/packages

You can choose what name to use for this directory -- if you have different sets of packages that you use for different jobs, you could use a more descriptive name than "packages"

To install the Python packages run the following command:

[alice@build]$ python3 -m pip install --target=$PWD/packages package1 package2 etc.

Replace package1 package2 with the names of packages you want to install. pip should download all dependent packages and install them. Certain packages may take longer than others.

If you have difficulties installing a package, we recommend you upgrade pip and then try reinstalling your packages:

[alice@build]$ python3 -m pip install --upgrade pip
[alice@build]$ python3 -m pip install --target=$PWD/packages package1 package2 etc.

C. Finish Up

Right now, if we exit the interactive job, nothing will be transferred back because we haven't created any new files in the working directory, just sub-directories. In order to transfer back our installation, we will need to compress it into a tarball file - not only will HTCondor then transfer back the file, it is generally easier to transfer a single, compressed tarball file than an uncompressed set of directories.

Run the following command to create your own tarball of your packages:

[alice@build]$ tar -czf packages.tar.gz packages/

Again, you can use a different name for the tar.gz file, if you want.

We now have our packages bundled and ready for CHTC! You can now exit the interactive job and the tar.gz file with your Python packages will return to the submit server with you (this sometimes takes a few extra seconds after exiting).

[alice@build]$ exit 

2. Creating a Script

In order to use CHTC's copy of Python and the packages you have prepared in an HTCondor job, we will need to write a script that unpacks both Python and the packages and then runs our Python code. We will use this script as as the executable of our HTCondor submit file.

A sample script appears below. After the first line, the lines starting with hash marks are comments . You should replace "my_script.py" with the name of the script you would like to run, and modify the Python version numbers to be the same as what you used above to install your packages.

#!/bin/bash

# untar your Python installation. Make sure you are using the right version!
tar -xzf python##.tar.gz
# (optional) if you have a set of packages (created in Part 1), untar them also
tar -xzf packages.tar.gz

# make sure the script will use your Python installation, 
# and the working directory as its home location
export PATH=$PWD/python/bin:$PATH
export PYTHONPATH=$PWD/packages
export HOME=$PWD

# run your script
python3 my_script.py

If you have additional commands you would like to be run within the job, you can add them to this base script. Once your script does what you would like, give it executable permissions by running:

[alice@submit] chmod +x run_python.sh

Arguments in Python

To pass arguments to an R script within a job, you'll need to use the following syntax in your main executable script, in place of the generic command above:

python3 my_script.py $1 $2

Here, $1 and $2 are the first and second arguments passed to the bash script from the submit file (see below), which are then sent on to the Python script. For more (or fewer) arguments, simply add more (or fewer) arguments and numbers.

In addition, your Python script will need to be able to accept arguments from the command line. There is an explanation of how to do this in this Software Carpentry lesson.

3. Submitting Jobs

A sample submit file can be found in our hello world example page. You should make the following changes in order to run Python jobs:

  • What version of Linux can you run on?
    • If you compiled your packages on CentOS7, you can try running your jobs on servers that have either version of Linux. This requires an additional requirement, shown in this guide.
    • If you compiled your packages on CentOS Stream 8, your jobs should ONLY run on servers using CentOS Stream 8. This will be the default as of September 29. Before then, make sure to opt into using that operating system, as described in the same guide linked above.
  • Your executable should be the script that you wrote above.

    executable = run_py.sh
    
  • Modify the CPU/memory request lines. Test a few jobs for disk space/memory usage in order to make sure your requests for a large batch are accurate!
    Disk space and memory usage can be found in the log file after the job completes.
  • Change transfer_input_files to include the python tar file, packages, script, and any other needed files. If you used our CentOS 7 version of Python, this may look like:
    transfer_input_files = http://proxy.chtc.wisc.edu/SQUID/chtc/python##.tar.gz, packages.tar.gz, my_script.py
    

    If you used our CentOS Stream 8 version of Python, this may look like:

    transfer_input_files = http://proxy.chtc.wisc.edu/SQUID/chtc/el8/python##.tar.gz, packages.tar.gz, my_script.py
    
  • If your script takes arguments (see the box from the previous section), include those in the arguments line:

    arguments = value1 value2
    

How big is your package tarball?

If your package tarball is larger than 100 MB, you should NOT transfer the tarball using transfer_input_files. Instead, you should use CHTC's web proxy, squid. In order to request space on squid, email the research computing facilitators at chtc@cs.wisc.edu.