Small Input and Output File Availability Via HTCondor
Which Option is the Best for Your Files?
Input Sizes | Output Sizes | Link to Guide | File Location | How to Transfer | Availability, Security |
---|---|---|---|---|---|
0 - 100 MB per file, up to 500 MB per job | 0 - 5 GB per job | Small Input/Output File Transfer via HTCondor | /home |
submit file; filename in transfer_input_files |
CHTC, UW Grid, and OSG; works for your jobs |
100 MB - 1 GB per repeatedly-used file | Not available for output | Large Input File Availability Via Squid | /squid |
submit file; http link in transfer_input_files |
CHTC, UW Grid, and OSG; files are made *publicly-readable* via an HTTP address |
100 MB - TBs per job-specific file; repeatedly-used files > 1GB | 4 GB - TBs per job | Large Input and Output File Availability Via Staging | /staging |
job executable; copy or move within the job | a portion of CHTC; accessible only to your jobs |
Table of Contents
HTCondor File Transfer
Due to the distributed configuration of the CHTC HTC pool, more often than not, your jobs will need to bring along a copy (i.e. transfer a copy) of data, code, packages, software, etc. from the submit server where the job is submitted to the execute node where the job will run. This requirement applies to any and all files that are needed to successfully execute and complete your job.
Any output that gets generated by your jobs is specifically written to the execute node on which the job ran. In order to get access to your output files, a copy of the output must be transferred back to an user accessible location like the submit server.
The mechanism that you use for file transfers will depend on the size of the individual input and output files of your jobs. This guide specifically describes input and output file transfer for input files <100MB in size (and <500MB of total input file transfer) and output files <4GB in size using the standard solution built into HTCondor job scheduling. More information about file transfer on a system without a shared filesystem is available in the HTCondor manual.
Applicability
-
Intended use:
Good for delivering any type of data to jobs, but with file-size limitations (see below). Remember that you can/should split up a large input file into many smaller files for cases where each job only needs a portion of the data. By default, the submit fileexecutable
,output
,error
, andlog
files are ALWAYS transferred. -
Advantages:
HTCondor file transfer is robust and is available on ANY of CHTC's accessible HTC resources including the UW Grid of campus pools, and the OS Pool. -
Data Security:
Files transferred with HTCondor transfer are owned by the job and protected by user permissions in the CHTC pool. When signaling your jobs to run on the UW Grid (Flocking) or the OS Pool (Glidein), your files will exist on someone else's computer only for the duration of each job. Please feel free to email us if you have data security concerns regarding HTCondor file transfer, as encryption options are available.
Transferring Input Files
To have HTCondor transfer small (<100MB) input files needed by your job, include the following attributes in your CHTC HTCondor submit files:
# my job submit file
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = file1, ../file2, /home/username/file3, dir1, dir2/
... other submit file details ...
By default, the submit file executable
, output
, and
error
files are ALWAYS transferred.
Important Considerations
-
DO NOT use
transfer_input_files
for files within/staging
; for files in/squid
onlyhttp
links (e.g.http://proxy.chtc.wisc.edu/SQUID/username/file
) should be used instead of direct file paths. These policies are in place to prevent severe performance issues for your jobs and those of other users. Jobs should should never be submitted from within/squid
or/staging
. -
HTCondor's file transfer can cause issues for submit server performance when too many jobs are transferring too much data at the same time. Therefore, HTCondor file transfer is only good for input files up to ~20 MB per file IF the number of concurrently-queued jobs will be 100 or greater. Even when individual files are small, there are issues when the total amount of input data per-job approaches 500 MB. For cases beyond these limitations, one of our other CHTC file delivery methods should be used. Remember that creating a
tar.gz
file of directories and files can give your input and output data a useful amount of compression. -
Comma-separated files and directories to-be-transferred should be listed with a path relative to the submit directory, or can be listed with the absolute path(s), as shown above for
file3
. The submit fileexecutable
is automatically transferred and does not need to be listed intransfer_input_files
. -
All files that are transferred to a job will appear within the top of the working directory of the job, regardless of how they are arranged within directories on the submit server.
-
A whole directory and it's contents will be transferred when listed without the trailing forward slash ("/") after the directory name. When a directory is listed with the trailing forward slash ("/") after the directory name, only the directory contents will be transferred. Care should be taken when transferring whole directories so that only the files needed by your jobs will be transferred. Generally, we recommend creating a
tar.gz
file of directories and files to be used a job inputs - this will help streamline the process of input file transfer and help speed up transfer times by reducing the overall size of files that will be transferred. -
Jobs will be placed on hold by HTCondor if any of the files or directories do not exist or if you have a typo.
Transferring Output Files
All of your HTCondor submit files should have the following attributes:
# my job submit file
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
when_to_transfer_output = ON_EXIT
will instruct HTCondor to automatically transfer
ALL new or modified files in the top level directory of the job (where it ran on the execute
server), back to the job’s initial directory on the submit server. Please note: this behavior
only applies to files in the job’s top-level working directory, meaning HTCondor will ignore
any files created in subdirectories of the job’s main working directory. Several options exist for modifying
this default output file transfer behavior - see below for some examples.
Only individual output files <4GB should be transferred back to your home directory
using HTCondor’s default behavior described here. Large output files >4GB should instead
use CHTC’s large data filesystem called staging, more information is available at
Managing Large Data in HTC Jobs. To help reduce output file
sizes, and help speed up file transfer times, we recommend creating a tar.gz
file of all
desired output before job completion (and to also delete the “un-tar'd”
files so they are not also transferred back); see our example below.
Group Multiple Output Files For Convenience
If your jobs will generate multiple output files, we recommend combining all output into a compressed tar archive for convenience, particularly when transferring your results to your local computer from the submit server. To create a compressed tar archive, include commands in your your bash executable script to create a new subdirectory, move all of the output to this new subdirectory, and create a tar archive. For example:
#! /bin/bash
# various commands needed to run your job
# create output tar archive
mkidr my_output
mv my_job_output.csv my_job_output.svg my_output/
tar -czf my_job.output.tar.gz my_ouput/
The example above will create a file called my_job.output.tar.gz
that contains all the output that
was moved to my_output
. Be sure to create my_job.output.tar.gz
in the top-level directory of where
your job executes and HTCondor will automatically transfer this tar archive back to your /home
directory.
Select Specific Output Files to Transfer to /home
As described above, HTCondor will transfer ALL new or modified files in the top level
directory of the job (where it ran on the execute server), back to the job’s initial directory
on the submit server. If your jobs will produce multiple output
files but you only need to retain a subset of these output files, we recommend deleting the unrequired
output files or moving them to a subdirectory as a step in the bash
executable script of your job - only the output files that remain in the top-level
directory will be transferred back to your /home
directory. This will help keep ample
space free and available on your /home
directory on the submit server and help prevent
you from exceeding the disk quota.
For jobs that use large input files from /staging
, you must include steps in your bash script
to either remove these files or move them to a subdirectory before the job terminates. Else,
these large files will be transferred back to your /home
directory. For more details, please
see Managing Large Data in HTC Jobs.
In cases where a bash script is not used as the excutable of your job and you wish to have only specific output files transferred back, please contact us.
Get Additional Options For Managing Job Output
Several options exist for managing output file transfers back to your /home
directory and we
encourage you to get in touch with us at chtc@cs.wisc.edu to
help identify the best solution for your needs.
Request a Quota Change
If you find that you are need of more space in you /home
directory to handle the number
of jobs that you want to run, please see our Request a Quota Change guide.