Learning About Your Jobs Using condor_q

The condor_q command can be used for much more than just checking on whether your jobs are running or not! Read on to learn how you can use condor_q to answer many common questions about running jobs.

Create a Portable Python Installation with Miniconda

The Anaconda/Miniconda distribution of Python is a common tool for installing and managing Python-based software and other tools. This guide describes how to use Miniconda to create a Python environment for use in CHTC jobs.

Overview

When should you use Miniconda as an installation method in CHTC?

  • Your software has specific conda-centric installation instructions.
  • The above is true and the software has a lot of dependencies.
  • You mainly use Python to do your work.

Notes on terminology:

  • conda is a Python package manager and package ecosystem that exists in parallel with pip and PyPI.
  • Miniconda is a slim Python distribution, containing the minimum amount of packages necessary for a Python installation that can use conda.
  • Anaconda is a pre-built scientific Python distribution based on Miniconda that has many useful scientific packages pre-installed.

To create the smallest, most portable Python installation possible, we recommend starting with Miniconda and installing only the packages you actually require.

There are two ways to create a Miniconda installation on CHTC’s HTC system. The first is to create your installation environment on the submit server and send a zipped version to your jobs. The other option is to install Miniconda inside each job. The first option is more efficient, especially for complex installations, but there may be rare situations where installing with each job is better. We recommend trying the pre-installation option first. If it doesn’t work, discuss the second option with a facilitator.

This guide also discusses how to “pin” your conda environment to create a more consistent and reproducible environment with specified versions of packages.

Option 1: Pre-Install Miniconda and Transfer to Jobs

In this approach, we will create an entire software installation inside Miniconda and then use a tool called conda pack to package it up for running jobs.

1. Create a Miniconda installation

On the submit server, download the latest Linux miniconda installer and run it.

[alice@submit]$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
[alice@submit]$ sh Miniconda3-latest-Linux-x86_64.sh

Accept the license agreement and default options. At the end, you can choose whether or not to “initialize Miniconda3 by running conda init?” The default is no; you would then run the eval command listed by the installer to “activate” Miniconda. If you choose “no” you’ll want to save this command so that you can reactivate the Miniconda installation when needed in the future.

2. Create a conda “environment” with your software

(If you are using an environment.yml file as described later, you should instead create the environment from your environment.yml file. If you don’t have an environment.yml file to work with, follow the install instructions in this section. We recommend switching to the environment.yml method of creating environments once you understand the “manual” method presented here.)

Make sure that you’ve activated the base Miniconda environment if you haven’t already. Your prompt should look like this:

(base)[alice@submit]$ 

To create an environment, use the conda create command and then activate the environment:

(base)[alice@submit]$ conda create -n env-name
(base)[alice@submit]$ conda activate env-name

Then, run the conda install command to install the different packages and software you want to include in the installation. How this should look is often listed in the installation examples for software (e.g. Qiime2, Pytorch).

(env-name)[alice@submit]$ conda install pkg1 pkg2

Some Conda packages are only available via specific Conda channels which serve as repositories for hosting and managing packages. If Conda is unable to locate the requested packages using the example above, you may need to have Conda search other channels. More detail are available at https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/channels.html.

Packages may also be installed via pip, but you should only do this when there is no conda package available.

Once everything is installed, deactivate the environment to go back to the Miniconda “base” environment.

(env-name)[alice@submit]$ conda deactivate

For example, if you wanted to create an installation with pandas and matplotlib and call the environment py-data-sci, you would use this sequence of commands:

(base)[alice@submit]$ conda create -n py-data-sci
(base)[alice@submit]$ conda activate py-data-sci
(py-data-sci)[alice@submit]$ conda install pandas matplotlib
(py-data-sci)[alice@submit]$ conda deactivate
(base)[alice@submit]$ 

More about Miniconda

See the official conda documentation for more information on creating and managing environments with conda.

3. Create Software Package

Make sure that your job’s Miniconda environment is created, but deactivated, so that you’re in the “base” Miniconda environment:

(base)[alice@submit]$ 

Then, run this command to install the conda pack tool:

(base)[alice@submit]$ conda install -c conda-forge conda-pack

Enter y when it asks you to install.

Finally, use conda pack to create a zipped tar.gz file of your environment (substitute the name of your conda environment where you see env-name), set the proper permissions for this file using chmod, and check the size of the final tarball:

(base)[alice@submit]$ conda pack -n env-name --dest-prefix='$ENVDIR'
(base)[alice@submit]$ chmod 644 env-name.tar.gz
(base)[alice@submit]$ ls -sh env-name.tar.gz

When this step finishes, you should see a file in your current directory named env-name.tar.gz

4. Check Size of Conda Environment Tar Archive

The tar archive, env-name.tar.gz, created in the previous step will be used as input for subsequent job submission. As with all job input files, you should check the size of this Conda environment file. If >100MB in size, you should NOT transfer the tar ball using transfer_input_files. Instead, you should plan to use either CHTC’s web proxy, SQUID or large data filesystem Staging. Please contact a research computing facilitators at chtc@cs.wisc.edu to determine the best option for your jobs.

More information is available at File Availability with Squid Web Proxy and Managing Large Data in HTC Jobs.

5. Create a Job Executable

The job will need to go through a few steps to use this “packed” conda environment; first, setting the PATH, then unzipping the environment, then activating it, and finally running whatever program you like. The script below is an example of what is needed (customize as indicated to match your choices above).

#!/bin/bash

# have job exit if any command returns with non-zero exit status (aka failure)
set -e

# replace env-name on the right hand side of this line with the name of your conda environment
ENVNAME=env-name
# if you need the environment directory to be named something other than the environment name, change this line
export ENVDIR=$ENVNAME

# these lines handle setting up the environment; you shouldn't have to modify them
export PATH
mkdir $ENVDIR
tar -xzf $ENVNAME.tar.gz -C $ENVDIR
. $ENVDIR/bin/activate

# modify this line to run your desired Python script and any other work you need to do
python3 hello.py

6. Submit Jobs

In your submit file, make sure to have the following:

  • Your executable should be the the bash script you created in step 5.
  • Remember to transfer your Python script and the environment tar.gz file via transfer_input_files. Since the tar.gz file will almost certainly be larger than 100MB, please email us about different tools for delivering the installation to your jobs, likely our SQUID web proxy.

Option 2: Install Miniconda Inside Each Job

In this approach, rather than copying the Miniconda installation with each job, we will copy the Miniconda installer and install a new copy of Miniconda with each job.

Do not use this installation method unless directed to do by a facilitator!

1. Download the Miniconda Installer and Test Installation

If you haven’t already, download the latest Miniconda installer for Linux from the Miniconda website and place it in your home directory on a CHTC submit server.

We strongly recommend testing the installation steps for your particular program or packages - either on your own computer or similar to the directions above - before trying to submit the installation as part of a job to CHTC.

2. Create an Executable Script

Our plan here is to run the Miniconda installer inside the job, build an environment with needed packages, and then run our desired script or program. The following script should work verbatim except for changing the conda install step to the packages you need or the instructions for your program. See below for instructions on using an environment.yml environment specification instead of “manually” listing packages in your job script.

#!/bin/bash

set -e

# installation steps for Miniconda
export HOME=$PWD
export PATH
sh [installer] -b -p $PWD/miniconda3
export PATH=$PWD/miniconda3/bin:$PATH

# install packages
conda install numpy matplotlib

# modify this line to run your desired Python script
python3 hello.py

3. Submit File

In your submit file, include the executable you wrote (as described above) and in transfer_input_files include the Miniconda installer and any other scripts or data files you want to include with the job:

executable = run_with_conda.sh
arguments = myscript.py

transfer_input_files = Miniconda3-latest-Linux-x86_64.sh, script.py, other_input.file

Specifying Exact Dependency Versions

An important part of improving reproducibility and consistency between runs is to ensure that you use the correct/expected versions of your dependencies.

When you run a command like conda install numpy, conda tries to install the most recent version of numpy. For example, numpy version 1.18.2 was released on March 17, 2020. To install exactly this version of numpy, you would run conda install numpy=1.18.2 (the same works for pip, if you replace = with ==). We recommend installing with an explicit version to make sure you have exactly the version of a package that you want. This is often called “pinning” or “locking” the version of the package.

If you want a record of what is installed in your environment, or want to reproduce your environment on another computer, conda can create a file, usually called environment.yml, that describes the exact versions of all of the packages you have installed in an environment. This file can be re-used by a different conda command to recreate that exact environment on another computer.

To create an environment.yml file from your currently-activated environment, run

[alice@submit]$ conda env export > environment.yml

This environment.yml will pin the exact version of every dependency in your environment. This can sometimes be problematic if you are moving between platforms because a package version may not be available on some other platform, causing an “unsatisfiable dependency” or “inconsistent environment” error. A much less strict pinning is

[alice@submit]$ conda env export --from-history > environment.yml

which only lists packages that you installed manually, and does not pin their versions unless you yourself pinned them during installation. If you need an intermediate solution, it is also possible to manually edit environment.yml files; see the conda environment documentation for more details about the format and what is possible. In general, exact environment specifications are simply not guaranteed to be transferable between platforms (e.g., between Windows and Linux). We strongly recommend using the strictest possible pinning available to you.

To create an environment from an environment.yml file, run

[alice@submit]$ conda env create -f environment.yml

By default, the name of the environment will be whatever the name of the source environment was; you can change the name by adding a -n <name> option to the conda env create command.

If you use a source control system like git, we recommend checking your environment.yml file into source control and making sure to recreate it when you make changes to your environment. Putting your environment under source control gives you a way to track how it changes along with your own code.

If you are developing software on your local computer for eventual use on the CHTC pool, your workflow might look like this:

  1. Set up a conda environment for local development and install packages as desired (e.g., conda create -n science; conda activate science; conda install numpy).
  2. Once you are ready to run on the CHTC pool, create an environment.yml file from your local environment (e.g., conda env export > environment.yml).
  3. Move your environment.yml file from your local computer to the submit machine and create an environment from it (e.g., conda env create -f environment.yml), then pack it for use in your jobs, as per Create Software Package.

More information on conda environments can be found in their documentation.

Summary

  • condor_q: Show my jobs that have been submitted on this server.
    Useful options:
    • -nobatch: Starting in version HTCondor 8.6.0 installed in July 2016, data is displayed in a compact mode (one line per cluster). With this option output will be displayed in the old format (one line per process)
    • -all: Show all the jobs submitted on the submit server.
    • -hold: Show only jobs in the "on hold" state and the reason for that. Held jobs are those that got an error so they could not finish. An action from the user is expected to solve the problem.
    • -better-analyze JobId: -better-analyze : Analyse a specific job and show the reason why it is in its current state.
    • -run: Show your running jobs and related info, like how much time they have been running, in which machine, etc.
    • -dag: Organize condor_q output by DAG.
    • -long JobId: Show all information related to that job.
    • -af Attr1 Attr2 ...: List specific attributes of jobs, using autoformat.

Examples and Further Explanation

1. Default condor_q output

As of July 19, 2016, the default condor_q output will show a single user's jobs, grouped in "batches", as shown below:

[alice@submit]$ condor_q
OWNER   BATCH_NAME        SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
alice   CMD: sb          6/22 13:05      _     32      _      _      _ 14297940.23-99
alice   DAG: 14306351    6/22 13:47     27    113     65      _    205 14306411.0 ...
alice   CMD: job.sh      6/22 13:56      _      _     12      _      _ 14308195.6-58
alice   DAG: 14361197    6/22 16:04    995      1      _      _   1000 14367836.0

HTCondor will automatically group jobs into "batches" for this display. However, it's also possible for you to specify groups of jobs as a "batch" yourself. You can either:

  • Add the following line to your submit file:

     batch_name = "CoolJobs" 
    
  • Use the -batch-name option with condor_submit:

    [alice@submit]$ condor_submit submit_file.sub -batch-name CoolJobs
    

Either option will create a batch of jobs with the label "CoolJobs".

2. View all jobs.

To display more detailed condor_q output (where each job is listed on a separate line), you can use the batch name or any existing grouping constraint (ClusterId or other "-constraint" options - see below for more on constraints) and the -nobatch flag.

Looking at a batch of jobs with the same ClusterId would look like this:

[alice@submit]$ condor_q -nobatch 195

 ID     OWNER    SUBMITTED     RUN_TIME ST PRI SIZE CMD
195.10  alice    6/22 13:00   0+00:00:00 H  0    0.0 job.sh
195.14  alice    6/22 13:00   0+00:01:44 R  0    0.0 job.sh
195.16  alice    6/22 13:00   0+00:00:26 R  0    0.0 job.sh
195.39  alice    6/22 13:00   0+00:00:05 R  0    0.0 job.sh
195.40  alice    6/22 13:00   0+00:00:00 I  0    0.0 job.sh
195.41  alice    6/22 13:00   0+00:00:00 I  0    0.0 job.sh
195.53  alice    6/22 13:00   0+00:00:00 I  0    0.0 job.sh
195.57  alice    6/22 13:00   0+00:00:00 I  0    0.0 job.sh
195.58  alice    6/22 13:00   0+00:00:00 I  0    0.0 job.sh

9 jobs; 0 completed, 0 removed, 5 idle, 3 running, 1 held, 0 suspended

This was the default view for condor_q from January 2016 until July 2016.

3. View jobs from all users.

By default, condor_q will just show you information about your jobs. To get information about all jobs in the queue, type:

[alice@submit]$ condor_q -all

This will show a list of all job batches in the queue. To see a list of all jobs (individually, not in batches) for all users, combine the -all and -nobatch options with condor_q. This was the default view for condor_q before January 2016.

4. Determine why jobs are on hold.

If your jobs have gone on hold, you can see the hold reason by running:

[alice@submit]$ condor_q -hold

or

[alice@submit]$ condor_q -hold JobId 

The first will show you the hold reasons for all of your jobs that are on hold; the second will show you the hold reason for a specific job. The hold reason is sometimes cut-off; try the following to see the entire hold reason:

[alice@submit]$ condor_q -hold -af HoldReason

If you aren't sure what your hold reason means email chtc@cs.wisc.edu.

5. Find out why jobs are idle

condor_q has an option to describe why a job hasn't matched and started running. Find the JobId of a job that hasn't started running yet and use the following command:

$ condor_q -better-analyze JobId

After a minute or so, this command should print out some information about why your job isn't matching and starting. This information is not always easy to understand, so please email us with the output of this command if you have questions about what it means.

6. Find out where jobs are running.

To see which computers your jobs are running on, use:

[alice@submit]$ condor_q -nobatch -run
428.0   alice        6/22  17:27   0+00:07:17 slot1_12@e313.chtc.wisc.edu
428.1   alice        6/22  17:27   0+00:07:11 slot1_8@e376.chtc.wisc.edu
428.2   alice        6/22  17:27   0+00:07:16 slot1_15@e451.chtc.wisc.edu
428.3   alice        6/22  17:27   0+00:07:16 slot1_17@e277.chtc.wisc.edu
428.5   alice        6/22  17:27   0+00:07:16 slot1_9@e351.chtc.wisc.edu
428.7   alice        6/22  17:27   0+00:07:16 slot1_1@e373.chtc.wisc.edu
428.8   alice        6/22  17:27   0+00:07:16 slot1_5@e264.chtc.wisc.edu

7. View jobs by DAG.

If you have submitted multiple DAGs to the queue, it can be hard to tell which jobs belong to which DAG. The -dag option to condor_q will sort your queue output by DAG:

[alice@submit]$ condor_q -nobatch -dag
 ID      OWNER/NODENAME   SUBMITTED     RUN_TIME ST PRI SIZE CMD               
460.0   alice        11/18 16:51   0+00:00:17 R  0   0.3  condor_dagman -p 0
462.0    |-0           11/18 16:51   0+00:00:00 I  0   0.0  print.sh
463.0    |-1           11/18 16:51   0+00:00:00 I  0   0.0  print.sh
464.0    |-2           11/18 16:51   0+00:00:00 I  0   0.0  print.sh
461.0   alice        11/18 16:51   0+00:00:09 R  0   0.3  condor_dagman -p 0
465.0    |-0           11/18 16:51   0+00:00:00 I  0   0.0  print.sh
466.0    |-1           11/18 16:51   0+00:00:00 I  0   0.0  print.sh
467.0    |-2           11/18 16:51   0+00:00:00 I  0   0.0  print.sh

8 jobs; 0 completed, 0 removed, 6 idle, 2 running, 0 held, 0 suspended

8. View all details about a job.

Each job you submit has a series of attributes that are tracked by HTCondor. You can see the full set of attributes for a single job by using the "long" option for condor_q like so:

[alice@submit]$ condor_q -l JobId 
...
Iwd = "/home/alice/analysis/39909"
JobPrio = 0
RequestCpus = 1
JobStatus = 1
ClusterId = 19997268
JobUniverse = 5
RequestDisk = 10485760
RequestMemory = 4096
DAGManJobId = 19448402
...

Attributes that are often useful for checking on jobs are:

  • Iwd: the job's submission directory on the submit node
  • UserLog: the log file for a job
  • RequestMemory, RequestDisk: how much memory and disk you've requested per job
  • MemoryUsage: how much memory the job has used so far
  • JobStatus: numerical code indicating whether a job is idle, running, or held
  • HoldReason: why a job is on hold
  • DAGManJobId: for jobs managed by a DAG, this is the JobId of the parent DAG

9. View specific details about a job using auto-format

If you would like to see specific attributes (see above) for a job or group of jobs, you can use the "auto-format" (-af) option to condor_q which will print out only the attributes you name for a single job or group of jobs.

For example, if I would like to see the amount of memory and disk I've requested for all of my jobs, and how much memory is currently behing used, I can run:

[alice@submit]$ condor_q -af RequestMemory RequestDisk MemoryUsage
1 325 undefined
1 325 undefined
2000 1000 245
2000 1000 220
2000 1000 245

10. Constraining the output of condor_q.

If you would like to find jobs that meet certain conditions, you can use condor_q's "constraint" option. For example, suppose you want to find all of the jobs associated with the DAGMan Job ID "234567". You can search using:

[alice@submit]$ condor_q -constraint "DAGManJobId == 234567" 

To use a name (for example, a batch name) as a constraint, you'll need to use multiple sets of quotation marks:

[alice@submit]$ condor_q -constraint 'JobBatchName == "MyJobs"'

One common use of constraints is to find all jobs that are running, held, or idle. To do this, use a constraint with the JobStatus attribute and the appropriate status number - the status codes can be found in Appendix A of the HTCondor Manual.

Remember condor_q -hold from before? In the background, the -hold option is constraining the list of jobs to jobs that are on hold (using the JobStatus attribute) and then printing out the HoldReason attribute. Try running:

[alice@submit]$ condor_q -constraint "JobStatus == 5" -af ClusterId ProcId HoldReason

You should see something very similar to running condor_q -hold!

11. Remove a held job from the queue

To remove a job held in the queue, run:

[alice@submit]$ condor_rm <JobID>

This will remove the job in the queue. Once you have made changes to allow the job to run successfully, the job can be resubmitted using condor_submit.


This page takes some of its content and formatting from this HTCondor reference page.