Powered by:
Open Science Grid
Center for High Throughput Computing

Running Python Jobs on CHTC

To best understand the below information, users should already have an understanding of:

Overview

Many CHTC users have Python programs requiring Python versions that are not installed on CHTC's high throughput system, which includes the CHTC Pool, the UW Grid (flocking) and the Open Science Grid (GlideIn). Instead, you get to choose the version of Python you want, and bring it along with your jobs.

This guide details the steps needed to:

  1. Build a Python installation for use in your jobs
  2. Write a script that unpacks your Python installation and runs your Python code
  3. Submit jobs

1. Building a Python Installation

To run Python jobs, you will first need to build a Python installation for your jobs to use.

A. Get the Python Version You Need

Before starting, locate the version of Python that you want to use from python.org. Transfer or download the appropriate source .tgz file to the submit server.

Using Distributions

Instead of installing Python from source, it is also possible to create a Python installation using a Python distribution. Examples include Anaconda and miniconda (from Continuum Analytics) and Enthought Canopy (from Enthought). The only change to the instructions below will be the source file (the distribution's install script, instead of source code) and the exact commands required to create a local installation (Step 3 below). Otherwise, the process is nearly identical - install the Python distribution locally and create a tarball of the installed directory.

One major drawback of using a distribution is the size of the installation - the full Anaconda distribution is over 300 MB, whereas a Python installation from source with a few packages is less than 40 MB.

B. Create a Python Installation in an Interactive Job

Because a python installation can be computationally intensive, it should not be performed on the submit server. Instead, you can create your installation on a build server (dedicated), by using an interactive job. The interactive job is essentially a job without an executable; you are the one running the commands instead (in this case, to install Python). Like a regular HTCondor job, once you finish our installation on the build server, the output files (for us, our Python installation) will be transferred back to the submit server so that you can use it to submit your jobs.

  1. Submit an Interactive Build Job

    Instructions for submitting an interactive build job are here: http://chtc.cs.wisc.edu/inter-submit.shtml
    Note that you should replace source_code.tar.gz with the name of the Python source tarball that you downloaded. If you downloaded additional source code for modules in part A, you should list those in the transfer_input_files line as well.

    Submit the interactive job and wait for it to start.

  2. Prepare the Installation Directory

    Once the interactive job starts, create a directory for the installation, which can be done with the mkdir command:

    [alice@build]$ mkdir python

    Next, untar the source code that you transferred over. In the command below, replace python_source.tgz with the name of your Python tarball.

    [alice@build]$ tar -xzf python_source.tgz
  3. Install Python

    To install Python, we will run a configuration script that includes an option to set the installation location. We will set the location to the directory we created above, and then complete the installation by running make.

    Move into the untarred Python source directory (it should be named something like "Python-#.#").

    [alice@build]$ cd Python-#.#

    From that directory, type the following commands to compile and install Python to the directory you created in step 2:

    [alice@build]$ ./configure --prefix=$(pwd)/../python
    [alice@build]$ make
    [alice@build]$ make install
    

  4. Check the Installation

    Once these commands have finished executing, move back into the main working directory.

    [alice@build]$ cd ..
    Then, check the contents of your python directory. It should look like this:
    [alice@build]$ ls python 
    bin  include  lib  share

    Finally, make sure you have a python exectuable. Run:

    [alice@build]$ ls python/bin 
    You should see something like this:

    2to3              idle3    pydoc3     python3.4-config   pyenv
    2to3-3.4          idle3.4  pydoc3.4   python3.4m         pyenv-3.4
    easy_install-3.4  pip3.4   python3    python3.4m-config  
    f2py3.4           pip3     python3.4  python3-config
    

    The number of items may vary, depending on which version of Python you used. If you do not see the plain python exectuable, (as above), do the following:

    $ cp python/bin/python3 python/bin/python
    Replace "python3" with "python2", if that's the version you've installed.

  5. Install Modules

    If you are installing any additional modules, do so now:

    1. Set your PATH variable to include your Python installation:

      [alice@build]$ export PATH=$(pwd)/python/bin:$PATH 

    2. Install pip, a python package manager. Go to the pip documentation page and follow the directions under "Installing with get-pip.py". You can download the get-pip.py script by copying the link to the script and then typing:

      [alice@build]$ wget http://link.to.get-pip.py

    3. Then, for each module needed by your code, run:

      [alice@build]$ pip install module_name
      pip should download all dependent packages and install them. Certain modules may take longer than others.

  6. Exit the Interactive Job

    Right now, if we exit the interactive job, nothing will be transferred back because we haven't created any new files in the working directory, just sub-directories. In order to transfer back our installation, we will need to compress it into a tarball file - not only will HTCondor then transfer back the file, it is generally easier to transfer a single, compressed tarball file than an uncompressed set of directories.

    Run the following command to create your own tarball of the installation:

    [alice@build]$ tar -czvf python.tar.gz python/

    The installation is complete! You can now exit the interactive job and the tarball of your Python installation will return to the submit server with you.

    [alice@build]$ exit 

2. Creating a Script

We now have a python.tar.gz file that contains our entire Python installation. In order to use this installation in our HTCondor jobs, we will need to write a script that unpacks our Python installation and then runs our Python code. We will use this script as as the executable of our HTCondor submit file.

A sample script appears below. After the first line, the lines starting with hash marks are comments . You should replace "myscript.py" with the name of the script you would like to run.

#!/bin/bash

# untar your Python installation
tar -xzf python.tar.gz
# make sure the script will use your Python installation, 
# and the working directory as it's home location
export PATH=$(pwd)/python/bin:$PATH
mkdir home
export HOME=$(pwd)/home
# run your script
python my_script.py

If you have additional commands you would like to be run within the job, you can add them to this base script. Once your script does what you would like, give it executable permissions by running:

[alice@submit] chmod +x run_python.sh

3. Submitting Jobs

A sample submit file can be found in our hello world example page. You should make the following changes in order to run Python jobs:
  • Your executable should be the script that you wrote above.
  • Change transfer_input_files to include your Python installation tarball (python.tar.gz), your Python scripts, and any input files your job needs.
  • How big is your installation tarball?

    If your installation tarball is larger than 100 MB, you should NOT transfer the tarball using transfer_input_files. Instead, you should use CHTC's web proxy, squid. In order to request space on squid, email the research computing facilitators at chtc@cs.wisc.edu.

  • Modify the CPU/memory request lines. Test a few jobs for disk space/memory usage in order to make sure your requests for a large batch are accurate! Disk space and memory usage can be found in the log file after the job completes.