Powered by:
Open Science Grid
Center for High Throughput Computing

Troubleshooting Condor Jobs

This page lists some tips and tricks for troubleshooting jobs submitted in CHTC using HTCondor. You can either jump to the section below that describes your problem, or, if you are not sure where to begin, see this section for how to email us or follow our directions for interactive testing.

Potential issues
  1. My job isn't starting.
  2. My job is on hold.
  3. I want to check on a running job.
  4. My job ran and completed, but I think something went wrong.
  5. I need to troubleshoot a DAG

Emailing Us

It can be hard to figure out why jobs aren't working! Don't hesitate to email chtc@cs.wisc.edu with questions. We can help you most effectively if your email to us contains the following information:

  • Tell us which submit server you log into
  • Describe your problem:
    • What are you trying to do?
    • What did you expect to see?
    • What was different than what you expected?
    • What error messages have you received (if any)?
  • Attach any relevant files (output, error, log, submit files, scripts) or tell us the directory on the submit server where we can find these files.

1. My job isn't starting.

condor_q has an option to describe why a job hasn't matched and started running. Find the JobId of a job that hasn't started running yet and use the following command:

[alice@submit]$ condor_q -better-analyze JobId

After a minute or so, this command should print out some information about why your job isn't matching and starting. This information is not always easy to understand, so please email us with the output of this command if you have questions about what it means.

2. My job is on hold.

If your jobs have gone on hold, you can see the hold reason by running:

[alice@submit]$ condor_q -hold -af HoldReason

The list below has some common hold reasons and their solutions.

  • "Disk quota exceeded": Output files can't be returned to the submit node if you have reached your quota. See our quota guide for instructions on managing your quota.
  • "Job has gone over memory limit of X": Look at the resource usage table in your log files - are you requesting enough memory for your jobs?
  • "errno=8: 'Exec format error'": Often, this is a Windows/Linux compatibility issue, addressed in this guide. It can also happen if your executable is a bash script and doesn't have this exact heading on the first line:
    #!/bin/bash
  • "Job failed to complete in 72 hrs": If your job ran past our 3 day time limit, it goes on hold and all progress is lost. Get in touch with the CHTC facilitators for strategies to avoid this issue.

3. I want to check on a running job.

To log into a running job, you can use the condor_ssh_to_job command with the job's ID.
[alice@submit]$ condor_ssh_to_job JobId 
This will log you into the running job, so you can look at its working directory. Typing exit will log you out of the job without causing it to stop running. Jobs that are running on the UW Grid or Open Science Grid may be inaccessible using condor_ssh_to_job.

4. My job ran and completed, but I think something went wrong.

A. Test Interactively

One way to see why a job didn't work is to run a sample job as an interactive test. We have a guide for doing this here.

B. Check Support Files

If your submit file includes the lines:

log = logfile.log
output = outfile.out
error = errfile.err

HTCondor will produce these files and they can be used for troubleshooting. Here is the kind of information you can find in each file.

Output

The "output" file contains any information that would normally have been printed to the terminal if you were running the script/program yourself from the command line. It can be useful for figuring out what went right/wrong inside your program, or simply to measure a job's progress.

Tips for troubleshooting: If you have jobs that are failing and your job's main executable is a script, you can add the following information to help debug:
  • Use a command that prints the hostname, or name of the machine where the job is running. In a bash script, this is simply the command hostname.
  • Print out the contents of the working directory (in bash, this is ls -l).
  • Add "print" statements where appropriate in your code, to determine what your program is doing as it runs.
Error

The error file contains any errors that would normally have been printed to the terminal if you were running the script/program yourself from the command line. This is a good place to look first if you think your code triggered an error.

Log

The log file contains information that HTCondor tracks for each job, including when it was submitted, started, and stopped. It also describes resource use, and where the job ran.

  • When jobs were submitted, started, and stopped:
    000 (16173120.000.000) 03/16 09:50:48 Job submitted from host: 
        <128.104.101.92:9618?addrs=128.104.101.92-9618
    001 (16173120.000.000) 03/16 09:53:10 Job executing on host: 
        <128.105.244.92:9618?addrs=128.105.244.92-9618&noUDP&sock=7150_4f71_3>
    005 (16173120.000.000) 03/16 09:58:12 Job terminated.
  • Resources used
    Partitionable Resources :    Usage  Request Allocated
    Cpus                 :                 1         1
    Disk (KB)            :       15  1048576  11053994
    Memory (MB)          :        1   102400    102400
  • Exit status
    (1) Normal termination (return value 0)
    A return value of "0" is normal; non-zero values indicate an error.
  • Where the job ran:
    Job executing on host: <128.105.244.92:
    You can get the "name" of the machine where a job ran by running the command host, followed by the 4-part IP address. Using the above example, this would look like:
    [alice@submit]$ host 128.105.244.92

3. Troubleshooting Common Errors

Here are some common errors that may come up while testing your job:

  • Is your executable not running?
    • Try adding executable permissions:
      [alice@submit]$ chmod +x my_exec
    • If you have a shell (or other script) as your executable, make sure you have the appropriate header, like:
      #!/bin/bash
      as the very first line of your script.
  • What does the error message say?
    • One common error is that the script can't find a program or library. If this is the case, you can email the research computing facilitators at chtc@cs.wisc.edu or search for the error yourself and try to fix it in the interactive session.
  • Are you missing a file you need?
    • If you are in an interactive job and the file you need is on the submit server, you can copy it into the directory using scp:
      [alice@e000 dir_0]$ scp alice@submit.chtc.wisc.edu:/home/alice/dir/input_file
      Change "alice" to your username and "submit.chtc.wisc.edu" to the submit node you normally use. The path after the colon should be the full path to the file you want to transfer. Once you exit the interactive session, don't forget to add this file to your list of transfer_input_files.
    • If you aren't testing in an interactive job, add the file to transfer_input_files in your submit file and resubmit.
    • If the file is on Gluster, either copy the file from Gluster to the directory (if testing interactively) or add a copy statement to the job's executable and try running again. If it works, make sure the the file is copied to the working directory and then removed during the job by adding to the job's executable.

5. Troubleshooting DAGs

A. Test the Pieces

The first step to troubleshooting a DAG is to test the component pieces by either running them manually or running a small test DAG.

B. DAG Support Files

A DAG will automatically create its own output and log files in its submission directory, in addition to any log/output/error files created by the submitted jobs. The most useful file is typically the name_of_dag.dagman.out file. This file contains information about when various jobs were submitted by the DAG, and it keeps a running tally of completed/running/idle jobs (look near the bottom of the file for the latest update):

03/09/16 11:36:16  Done     Pre   Queued    Post   Ready   Un-Ready   Failed
03/09/16 11:36:16   ===     ===      ===     ===     ===        ===      ===
03/09/16 11:36:16     1       0        4       0      10          0        0