Stage files specific to the job into correctly-named directories
within ChtcRun.
Make a directory
within ChtcRun
to house the project's input files and executables.
For purposes of this example, assume that the directory has been named
proj1data.
The name of this directory may not be
data1,
as an example MATLAB job is distributed with the scripts,
and its directory is called
data1.
Within proj1data, create one directory
for each job to be run.
Within each of these run-specific directories,
place the input files used for that specific job.
If the work does not require any per-job input files,
then nothing will be in these directories.
The files might include input files and parameter files.
These directories can be given any name,
but the file names may not contain whitespace characters, notably spaces,
in their name.
And, the file name shared is already
used for the files shared among job submissions.
Each of these directories will create an individual job within
the single DAG.
When the job starts,
its current working directory on the machine where the job executes
will contain copies of all files placed in the run-specific directory
on the submit machine,
as well as copies of all files placed in the
shared directory on the submit machine.
Input files, shared libraries, and all executables shared by
all jobs go into a directory with the name
shared.
Create the directory shared within
proj1data.
Place the executable and any input files common to the
set of jobs that will be submitted into directory shared.
The shared directory will not become an individual job.
For example, you might want to run several jobs on data you have for
several midwestern states, where there is one job per state.
Each state has its own input data file,
called state.dat.
There is also a shared input data file of US-wide averages for comparison,
called us.dat,
as well as a single executable (program).
For each job submission, assume this program
compare-states
will compare contents of us.dat to
state.dat.
Given this, a portion of the directory hierarchy might look like this:
proj1data/
shared/
us.dat
compare-states
wi/
state.dat
il/
state.dat
mn/
state.dat
ia/
state.dat
With a current working directory of ChtcRun,
run mkdag
with a job-specific set of command line arguments.
The output from from running mkdag
will be a directory to hold eventual output from running
the project's job(s),
as well as control files:
a DAG input file for describing the DAG
(called mydag.dag)
and an HTCondor job submit description files for each of the node jobs within
the DAG.
One of the command line arguments to
mkdag
will specify the name of this directory.
The command line arguments to mkdag :
--cmdtorun=NameOfJobExecutable
This required argument identifies which file within the shared
directory should be executed.
It should be only the base name of the file
(for example, compare-states);
do not include directory information.
-
--data=DirectoryName
This required argument identifies the relative path to and
name of the directory that contains the
shared directory.
The path is specified as relative to the directory containing the
mkdag script.
The string used instead of
DirectoryName
for this presented example would be
proj1data.
-
--dagdir=DirectoryName
This required argument identifies where the output from your
jobs will be placed.
This is the relative path to and
specifies a name for the directory within
ChtcRun
that will be created, and within which the directories and files
produced by the mkdag script will go.
Do not create this directory before running
mkdag.
The directory will be created by mkdag,
and it will contain one subdirectory for each of the jobs within the DAG.
-
--pattern=SubString
This required argument helps identify if a job ran successfully.
For a variety of reasons,
we cannot necessarily trust the return code from MATLAB or R
to tell us if the job was successful.
This argument defines a substring of a file's name that will
be created as output from the execution of each HTCondor job within the DAG.
We determine if a job was successful by checking if at least one
file name matching SubString
was created for each HTCondor job submitted.
This check is identical to
ls *SubString*
and seeing that at least one file is returned.
-
--parg=ArgumentString
This optional argument identifies a command line argument that
is to be passed to each invocation (as an HTCondor job) of the executable.
This argument will be listed multiple times to define more than one
command line argument.
--type
The --type argument is required,
and must be set to one of 3 values.
--type=Matlab
For MATLAB jobs.
It ensures that necessary MATLAB supporting libraries are made
available to the job.
-
--type=R
For R jobs.
It ensures that the R runtime environment is made available to the job.
--type=Other
For jobs that are neither MATLAB or R.
Jobs of type Other do not have any libraries or runtime
environment automatically provided.
-
--noosg
An optional argument that identifies the job as one which is only to run
on CHTC resources.
Matlab and R jobs will run in OSG and CHTC resources by default.
--osg
This optional argument indicates that job with
--type=Other should run on OSG and CHTC resources.
Such jobs default to only using CHTC resources.
-
--version=RVersionNumber
This argument is required for R jobs.
It specifies the version of R needed.
Possible values are the same as they were for building R:
-
sl5-R-2.10.1
(version R-2.10.1 on Scientific Linux 5)
-
sl5-R-2.13.1
(version R-2.13.1 on Scientific Linux 5)
-
sl5-R-2.14.0
(version R-2.14.0 on Scientific Linux 5)
-
--memory=MbytesNeeded
An optional argument that identifies how much memory each job
within the DAG should be allocated in order to usefully run
(its resident set size).
Specify the value in integer units of Mbytes.
When this optional argument is not given,
a default of 1 Gbyte of memory is requested for each job.
For example, assume that the compare-states program
is a compiled MATLAB program that takes two arguments:
the files to compare.
It produces a file called output.dat.
The command line for mkdag:
./mkdag --cmdtorun=compare-states --data=proj1data \
--dagdir=proj1output --pattern=output.dat --parg=us.dat \
--parg=state.dat --type=Matlab
(The backslashes above are used to break a long command across three lines. You can omit the backslashes and enter your command as a single line.)