Simple Example of a DAGMan workflow

This guide walks you step-by-step through the construction and submission of a simple DAGMan workflow. We recommend this guide if you are interested in automating your job submissions.

For the full details on various DAGMan features, see the HTCondor manual pages:

HTCondor’s DAGMan Documentation

1. Introduction

Consider the case of two HTCondor jobs that use the submit files A.sub and B.sub. Let’s say that A.sub generates an output file (output.txt) that B.sub will analyze. To run this workflow manually, we would

Submit the first HTCondor job with condor_submit A.sub.
Wait for the first HTCondor job to complete successfully.
Submit the second HTCondor job with condor_submit B.sub.

If the first HTCondor job using A.sub is fairly short, then manually running this workflow is not a big deal. But if the first HTCondor job takes a long time to complete (maybe takes several hours to run, or has to wait for special resources), this can be very inconvenient. Instead, we can use DAGMan to automatically submit B.sub once the first HTCondor job using A.sub has completed successfully. This guide walks through the process of creating such a DAGMan workflow.

2. Structure of the DAG

In this scenario, our workflow could be described as a DAG consisting of two nodes (A.sub and B.sub) connected by a single edge (output.txt). To represent this relationship, we will define nodes A and B - corresponding to A.sub and B.sub, respectively - and connect them with a line pointing from A and B, like in this figure:

Node A with arrow pointing to Node B

In order to use DAGMan to run this workflow, we need to communicate this structure to DAGMan via the .dag input file.

3. The Minimal DAG Input File

Let’s call the input file simple.dag. At minimum, the contents of the simple.dag input file are

# simple.dag

# Define the DAG jobs
JOB A A.sub
JOB B B.sub

# Define the connections
PARENT A CHILD B

In a DAGMan input file, a node is defined using the JOB keyword, followed by the name of the node and the name of the corresponding submit file. In this case, we have created a node named A and instructed DAGMan to use the submit file A.sub for executing that node. We have similarly created node B and instructed DAGMan to use the submit file B.sub. (While there is no requirement that the name of the node match the name of the corresponding submit file, it is convenient to use a consistent naming scheme.)

To connect the nodes, we use the PARENT .. CHILD .. syntax. Since node B requires that node A has completed successfully, we say that node A is the PARENT while node B is the CHILD. Note that we do not need to define why node B is dependent on node A, only that it is.

4. The Submit Files

Now let’s define simple examples of the submit files A.sub and B.sub.

Node A

First, the submit file A.sub uses the executable A.sh, which will generate the file called output.txt. We have explicitly told HTCondor to transfer back this file by using the transfer_output_files command.

# A.sub

executable = A.sh

log = A.log
output = A.out
error = A.err

transfer_output_files = output.txt

request_cpus = 1
request_memory = 1GB
request_disk = 1GB

queue

The executable file simply saves the hostname of the machine running the script:

#!/bin/bash

# A.sh
hostname > output.txt

sleep 1m  # so we can see the job in "running" status

Node B

Second, the submit file B.sub uses the executable B.sh to print a message using the contents of the output.txt file generated by A.sh. We have explicitly told HTCondor to transfer output.txt as an input file for this job, using the transfer_input_files command. Thus we have finally defined the “edge” that connects nodes A and B: the use of output.txt.

# B.sub

executable = B.sh

log = B.log
output = B.out
error = B.err

transfer_input_files = output.txt

request_cpus = 1
request_memory = 1GB
request_disk = 1GB

queue

The executable file contains the command for printing the desired message, which will be printed to B.out.

#!/bin/bash

# B.sh
echo "The previous job was executed on the following machine:"
cat output.txt

sleep 1m  # so we can see the job in "running" status

The directory structure

Based on the contents of simple.dag, DAGMan is expecting that the submit files A.sub and B.sub are in the same directory as simple.dag. The submit files in turn are expecting A.sh and B.sh be in the same directory as A.sub and B.sub. Thus, we have the following directory structure:

DAG_simple/
|-- A.sh
|-- A.sub
|-- B.sh
|-- B.sub
|-- simple.dag

It is possible to organize each job into its own directory, but for now we will use this simple, flat organization.

5. Running the Simple DAG

To run the DAG workflow described by simple.dag, we use the HTCondor command condor_submit_dag:

condor_submit_dag simple.dag

The DAGMan utility will then parse the input file and generate an assortment of related files that it will use for monitoring and managing your workflow. Here is the output of running the above command:

[user@login DAG_simple]$ condor_submit_dag simple.dag

Loading classad userMap 'checkpoint_destination_map' ts=1699037029 from /etc/condor/checkpoint-destination-mapfile
-----------------------------------------------------------------------
File for submitting this DAG to HTCondor           : simple.dag.condor.sub
Log of DAGMan debugging messages                 : simple.dag.dagman.out
Log of HTCondor library output                     : simple.dag.lib.out
Log of HTCondor library error messages             : simple.dag.lib.err
Log of the life of condor_dagman itself          : simple.dag.dagman.log

Submitting job(s).
1 job(s) submitted to cluster 562265.
-----------------------------------------------------------------------

The output shows the list of standard files that are created with every DAG submission along with brief descriptions. A couple of additional files, some of them temporary, will be created during the lifetime of the DAG.

6. Monitoring the Simple DAG

You can see the status of the DAG in your queue just like with any other HTCondor job submission.

[user@login DAG_simple]$ condor_q

-- Schedd: ap2002.chtc.wisc.edu : <128.105.68.92:9618?... @ 12/14/23 11:26:51
OWNER       BATCH_NAME           SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
user        simple.dag+562265  12/14 11:26      _      _      1      2 562279.0

There are a couple of things to note about the condor_q output:

The BATCH_NAME for the DAGMan job is the name of the input DAG file, simple.dag, plus the Job ID of the DAGMan scheduler job (562265 in this case): simple.dag+562265.
The total number of jobs for simple.dag+562265 corresponds to the total number of nodes in the DAG (2).
Only 1 node is listed as “Idle”, meaning that DAGMan has only submitted 1 job so far. This is consistent with the fact that node A has to complete before DAGMan can submit the job for node B.

Note that if you are very quick to run your condor_q command after running your condor_submit_dag command, then you may see only the DAGMan scheduler job. It may take a few seconds for DAGMan to start up and submit the HTCondor job associated with the first node.

To see more detailed information about the DAG workflow, use condor_q -nob -dag. For example,

[user@login DAG_simple]$ condor_q -dag -nob

-- Schedd: ap2002.chtc.wisc.edu : <128.105.68.92:9618?... @ 12/14/23 11:27:03
 ID        OWNER/NODENAME      SUBMITTED     RUN_TIME ST PRI SIZE CMD
562265.0   user                12/14 11:26   0+00:00:37 R  0    0.5 condor_dagman -p 0 -f -l . -Loc
562279.0    |-A                12/14 11:26   0+00:00:00 I  0    0.0 A.sh

In this case, the first entry is the DAGMan scheduler job that you created when you first submitted the DAG. The following entries correspond to the nodes whose jobs are currently in the queue. Nodes that have not yet been submitted by DAGMan or that have completed and thus left the queue will not show up in your condor_q output.

7. Wrapping Up

After waiting enough time, this simple DAG workflow should complete without any issues. But of course, that will not be the case for every DAG, especially as you start to create your own. DAGMan has a lot more features for managing and submitting DAG workflows, ranging from how to handle errors, combining DAG workflows, and restarting failed DAG workflows.

For now, we recommend that you continue exploring DAGMan by going through our Intermediate DAGMan Tutorial. There is also our guide Overview: Submit Workflows with HTCondor’s DAGMan, which contains links to more resources in the More Resources section.

Finally, the definitive guide to DAGMan and DAG workflows is HTCondor’s DAGMan Documentation.