Powered by:
Open Science Grid
Center for High Throughput Computing

Troubleshooting Condor Jobs

This page lists some tips and tricks for troubleshooting jobs submitted in CHTC using HTCondor. You can either jump to the section below that describes your problem, or, if you are not sure where to begin, see this section for how to email us.

Potential issues

Emailing Us

It can be hard to figure out why jobs aren't working! Don't hesitate to email chtc@cs.wisc.edu with questions. We can help you most effectively if your email to us contains the following information:

  • Tell us which submit server you log into
  • Describe your problem:
    • What are you trying to do?
    • What did you expect to see?
    • What was different than what you expected?
    • What error messages have you received (if any)?
  • Attach any relevant files (output, error, log, submit files, scripts) or tell us the directory on the submit server where we can find these files.

My job isn't starting.

condor_q has an option to describe why a job hasn't matched and started running. Find the JobId of a job that hasn't started running yet and use the following command:

$ condor_q -better-analyze JobId

After a minute or so, this command should print out some information about why your job isn't matching and starting. This information is not always easy to understand, so please email us with the output of this command if you have questions about what it means.

My job is on hold.

You can determine why your job is on hold by using condor_q. See this guide for details. Once you have the job's hold reason, see the list below for common hold reasons and their solutions.
  • "Disk quota exceeded": Output files can't be returned to the submit node if you have reached your quota. See this page for instructions on managing your quota.
  • "Job has gone over memory limit of X": Look at the resource usage table in your log files - are you requesting enough memory for your jobs?
  • "errno=8: 'Exec format error'": Often, this is a Windows/Linux compatibility issue.
  • "Job failed to complete in 72 hrs"

My job is running longer than I expect.

To log into a running job, you can use the condor_ssh_to_job command with the job's ID.
$ condor_ssh_to_job  
This will log you into the running job, so you can look at its working directory. Typing exit will log you out of the job without causing it to stop running. Jobs that are running on the UW Grid or Open Science Grid may be inaccessible using condor_ssh_to_job.

My job ran and completed, but I think something went wrong.

For jobs that have started and then completed, see the following sections for troubleshooting tips

Normal Submit Files

If your submit file includes the lines:

log = logfile.log
output = outfile.out
error = errfile.err

HTCondor will produce these files and they can be used for troubleshooting. Here is the kind of information you can find in each file.

Log

The log file contains information that HTCondor tracks for each job, including when it was submitted, started, and stopped. It also describes resource use, and where the job ran.

  • When jobs were submitted, started, and stopped:
    000 (16173120.000.000) 03/16 09:50:48 Job submitted from host: 
        <128.104.101.92:9618?addrs=128.104.101.92-9618
    001 (16173120.000.000) 03/16 09:53:10 Job executing on host: 
        <128.105.244.92:9618?addrs=128.105.244.92-9618&noUDP&sock=7150_4f71_3>
    005 (16173120.000.000) 03/16 09:58:12 Job terminated.
  • Resources used
    Partitionable Resources :    Usage  Request Allocated
    Cpus                 :                 1         1
    Disk (KB)            :       15  1048576  11053994
    Memory (MB)          :        1   102400    102400
  • Exit status
    (1) Normal termination (return value 0)
    A return value of "0" is normal; non-zero values indicate an error.
  • Where the job ran:
    Job executing on host: <128.105.244.92:
    You can get the "name" of the machine where a job ran by running the command host, followed by the 4-part IP address. Using the above example, this would look like:
    $ host 128.105.244.92
Output

The "output" file contains any information that would normally have been printed to the terminal if you were running the script/program yourself from the command line. It can be useful for figuring out what went right/wrong inside your program, or simply to measure a job's progress.

Tips for troubleshooting: If you have jobs that are failing and your job's main executable is a script, you can add the following information to help debug:
  • Use a command that prints the hostname, or name of the machine where the job is running. In a bash script, this is simply the command hostname.
  • Print out the contents of the working directory (in bash, this is ls -l).
  • Add "print" statements where appropriate in your code, to determine what your program is doing as it runs.
Error

The error file contains any errors that would normally have been printed to the terminal if you were running the script/program yourself from the command line. This is a good place to look first if you think your code triggered an error.

DAGs

A DAG will automatically create its own output and log files in its submission directory, in addition to any log/output/error files created by the submitted jobs. The most useful file is typically the name_of_dag.dagman.out file. This file contains information about when various jobs were submitted by the DAG, and it keeps a running tally of completed/running/idle jobs (look near the bottom of the file for the latest update):

03/09/16 11:36:16  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
03/09/16 11:36:16   ===     ===      ===     ===     ===        ===      ===
03/09/16 11:36:16     1       0        4       0      10          0        0

ChtcRun

If you use our "ChtcRun" tools to submit jobs, see this guide for troubleshooting tips.