Running Your First CHTC Jobs
So, you have an account on a submit node,
and you are ready to run your first job in the CHTC.
As we described in Our Approach,
the CHTC is a collection of distributed resources.
The magic that enables you to run jobs on these resources is software,
developed at the UW-Madison.
1. Let's first do, and then ask why
Rather than having you read a bunch of stuff before hand,
let's just run some jobs so you can see what happens,
and we'll provide some additional discussion along the way.
We are going to run the traditional 'hello world' program with a CHTC twist.
In order to demonstrate the distributed resource nature of the CHTC,
we will produce a 'Hello CHTC' message 3 times, where each time is its own job.
Since you are not directly invoking the execution of each job,
you need to tell HTCondor how to run the jobs for you.
The information needed is placed into a submit file, which
defines variables that describe the set of jobs.
Note: You must be logged into an HTCondor submit machine
for the following example to work
1. Copy the highlighted text below,
and paste it into file called
hello-chtc.sub, the submit file,
in your home directory on the submit machine.
# My very first HTCondor submit file
# Specify the HTCondor Universe (vanilla is the default and is used
# for almost all jobs), the desired name of the HTCondor log file,
# and the desired name of the standard error file.
# Wherever you see $(Cluster), HTCondor will insert the queue number
# assigned to this set of jobs at the time of submission.
universe = vanilla
log = hello-chtc_$(Cluster).log
error = hello-chtc_$(Cluster)_$(Process).err
# Specify your executable (single binary or a script that runs several
# commands), arguments, and a files for HTCondor to store standard
# output (or "screen output").
# $(Process) will be a integer number for each job, starting with "0"
# and increasing for the relevant number of jobs.
executable = hello-chtc.sh
arguments = $(Process)
output = hello-chtc_$(Cluster)_$(Process).out
# Specify that HTCondor should transfer files to and from the
# computer where each job runs. The last of these lines *would* be
# used if there were any other files needed for the executable to run.
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
# transfer_input_files = file1,/absolute/pathto/file2,etc
# Tell HTCondor what amount of compute resources
# each job will need on the computer where it runs.
request_cpus = 1
request_memory = 1GB
request_disk = 1MB
# Tell HTCondor to run 3 instances of our job:
2. Now, create the executable that we specified above: copy the text
below and paste it into a file called
# My very first CHTC job
echo "Hello CHTC from Job $1 running on `whoami`@`hostname`"
When HTCondor runs this executable,
it will pass the $(Process) value for each job and
will insert that value for "$1", above.
3. Now, submit your job to the queue using
[alice@submit]$ condor_submit hello-chtc.sub
actually submits your jobs to HTCondor.
If all goes well, you will see output from
condor_submit command that appears as:
3 job(s) submitted to cluster 845638.
4. To check on the status of your jobs, run the following command:
(If you want to see everyone's jobs, use
The output of
condor_q should look like this:
-- Schedd: submit-5.chtc.wisc.edu : <22.214.171.124:9618?...
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE TOTAL JOB_IDS
alice CMD: hello-chtc.sh 7/20 18:41 _ _ 3 _ 845638.0-2
3 jobs; 0 completed, 0 removed, 3 idle, 0 running, 0 held, 0 suspended
You can run the
condor_q command periodically to see the progress of your jobs.
condor_q shows jobs grouped into batches by batch name (if provided),
or executable name. To show all of your jobs on individual lines, add the
option. For more details on this option, and other options to
condor_q, see our
5. When your jobs complete after a few minutes, they'll leave the queue.
If you do a listing of your home directory with the command
If your jobs go on hold and you usually use a Windows laptop
or desktop, please see
this page for a potential diagnosis and solution.
you should see something like:
[alice@submit]$ ls -l
-rwxrwxr-x 1 user user 92 Mar 12 12:45 hello-chtc.sh
-rw------- 1 user user 17 Mar 12 12:48 hello-chtc_845638_0.out
-rw------- 1 user user 17 Mar 12 12:48 hello-chtc_845638_1.out
-rw------- 1 user user 17 Mar 12 12:48 hello-chtc_845638_2.out
-rw------- 1 user user 0 Mar 12 12:48 hello-chtc_845638_0.err
-rw------- 1 user user 0 Mar 12 12:48 hello-chtc_845638_1.err
-rw------- 1 user user 0 Mar 12 12:48 hello-chtc_845638_2.err
-rw-rw-r-- 1 user user 3180 Mar 12 12:48 hello-chtc_845638.log
-rw-rw-r-- 1 user user 580 Mar 12 12:40 hello-chtc.sub
Useful information is provided in the user log and the output files.
HTCondor creates a transaction log of everything that happens to your jobs.
Looking at the log file is very useful for
debugging problems that may arise.
An excerpt from
produced due the submission of the 3 jobs will look something like this:
000 (845638.000.000) 03/12 12:46:29 Job submitted from host: <126.96.36.199:9618?sock=5235_1ed5_2>
000 (845638.001.000) 03/12 12:46:29 Job submitted from host: <188.8.131.52:9618?sock=5235_1ed5_2>
000 (845638.002.000) 03/12 12:46:29 Job submitted from host: <184.108.40.206:9618?sock=5235_1ed5_2>
001 (845638.000.000) 03/12 12:48:06 Job executing on host: <220.127.116.11:49163>
005 (845638.000.000) 03/12 12:48:06 Job terminated.
(1) Normal termination (return value 0)
Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage
Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage
17 - Run Bytes Sent By Job
92 - Run Bytes Received By Job
17 - Total Bytes Sent By Job
92 - Total Bytes Received By Job
Partitionable Resources : Usage Request Allocated
Cpus : 1 1
Disk (KB) : 12 1000 26703078
Memory (MB) : 0 1000 1000
And, if you look at one of the output files,
you should see something like this:
Hello CHTC from Job 0 running on firstname.lastname@example.org
Congratulations. You've run your first jobs in the CHTC!
2. What Else?
A. Removing Jobs
To remove a specific job, specify the job ID nubmer from the queue (format:
[alice@submit]$ condor_rm 845638.0
You can even remove all of the jobs of the same cluster by specifying
only the Cluster value for that batch.
To remove all of your jobs:
[alice@submit]$ condor_rm $USER
B. The Importance of Testing
1. Examining Job Success. Within the log file, you can see
information about the completion of each job, including a system
error code (as seen in "return value 0"). You can use this code, as well
as information in your ".err" file and other output files, to determine what
issues your job(s) may have had, if any.
2. Determining Memory and Disk Requirements. The log file also
indicates how much memory and disk each job used, so that you
can first test a few jobs before submitting many more with more accurate
request values. When you request too little, your jobs will be "evicted"
from the computer they're running on, and HTCondor will have to try to
rerun them (maybe many times) until it requests enough for you.
When you request too much, your jobs may not match to as many available
"slots" as they could otherwise, and your overall throughput will suffer
in that case as well.
3. Determining Run Time. Depending on how long each of your jobs
are (determined by examining when the job began executing and when it
completed), you can send your jobs to even more computers than are in the
CHTC Pool (where your jobs will run, by default). Refer to the table below
for tips on how to send your jobs to the rest of the UW Grid and to the
national Open Science Grid.
C. Getting the Right Resources
Be sure to always add or modify the following lines in your submit files, as appropriate,
and after running a few tests.
|Submit file entry||Resources your jobs will run on|
|request_cpus = cpus
Matches each job to a computer "slot" with at least this many CPU cores.
|request_disk = kilobytes
Matches each job to a slot with at least this much disk space, in units of KB.
|request_memory = megabytes
Matches each job to a slot with at least this much memory (RAM), in units of MB.
|+WantFlocking = true
Also send jobs to other HTCondor Pools on campus (UW Grid)
Good for jobs that are less than ~8 hours, or checkpoint at least that frequently.
|+WantGlideIn = true
Also send jobs to the Open Science Grid (OSG).
Good for jobs that are less than ~8 hours (or checkpoint at least that frequently),
and have been tested for portability. (Contact Us for more details).
D. Now, time for a little homework
To get the most of the CHTC,
you will want to have a good understanding of how HTCondor works.
The full HTCondor manual is comprehensive,
but the links below guide you to the most important sections to read
in order to get started.
You can always dig into more details as you become more experienced.
Now you are ready for some real work
Ok, you have the basics!
This should be enough background to get you started using the CHTC for the
real problems you came to us for.
Remember, we are here to help.
Don't hesitate to contact us at
email@example.com with questions.