Small Input and Output File Availability Via HTCondor

Which Option is the Best for Your Files?


Input Sizes Output Sizes Link to Guide File Location How to Transfer Availability, Security
0 - 100 MB per file, up to 500 MB per job 0 - 5 GB per job Small Input/Output File Transfer via HTCondor /home submit file; filename in transfer_input_files CHTC, UW Grid, and OSG; works for your jobs
100 MB - 1 GB per repeatedly-used file Not available for output Large Input File Availability Via Squid /squid submit file; http link in transfer_input_files CHTC, UW Grid, and OSG; files are made *publicly-readable* via an HTTP address
100 MB - TBs per job-specific file; repeatedly-used files > 1GB 4 GB - TBs per job Large Input and Output File Availability Via Staging /staging job executable; copy or move within the job a portion of CHTC; accessible only to your jobs

HTCondor File Transfer

Due to the distributed configuration of the CHTC HTC pool, more often than not, your jobs will need to bring along a copy (i.e. transfer a copy) of data, code, packages, software, etc. from the submit server where the job is submitted to the execute node where the job will run. This requirement applies to any and all files that are needed to successfully execute and complete your job.

Any output that gets generated by your jobs is specifically written to the execute node on which the job ran. In order to get access to your output files, a copy of the output must be transferred back to an user accessible location like the submit server.

The mechanism that you use for file transfers will depend on the size of the individual input and output files of your jobs. This guide specifically describes input and output file transfer for input files <100MB in size (and <500MB of total input file transfer) and output files <4GB in size using the standard solution built into HTCondor job scheduling. More information about file transfer on a system without a shared filesystem is available in the HTCondor manual.

Applicability

  • Intended use:
    Good for delivering any type of data to jobs, but with file-size limitations (see below). Remember that you can/should split up a large input file into many smaller files for cases where each job only needs a portion of the data. By default, the submit file executable, output, error, and log files are ALWAYS transferred.

  • Advantages:
    HTCondor file transfer is robust and is available on ANY of CHTC's accessible HTC resources including the UW Grid of campus pools, and the OS Pool.

  • Data Security:
    Files transferred with HTCondor transfer are owned by the job and protected by user permissions in the CHTC pool. When signaling your jobs to run on the UW Grid (Flocking) or the OS Pool (Glidein), your files will exist on someone else's computer only for the duration of each job. Please feel free to email us if you have data security concerns regarding HTCondor file transfer, as encryption options are available.

Transferring Input Files

To have HTCondor transfer small (<100MB) input files needed by your job, include the following attributes in your CHTC HTCondor submit files:

# my job submit file

should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_input_files = file1, ../file2, /home/username/file3, dir1, dir2/

... other submit file details ...

By default, the submit file executable, output, and error files are ALWAYS transferred.

Important Considerations

  • DO NOT use transfer_input_files for files within /staging; for files in /squid only http links (e.g. http://proxy.chtc.wisc.edu/SQUID/username/file) should be used instead of direct file paths. These policies are in place to prevent severe performance issues for your jobs and those of other users. Jobs should should never be submitted from within /squid or /staging.

  • HTCondor's file transfer can cause issues for submit server performance when too many jobs are transferring too much data at the same time. Therefore, HTCondor file transfer is only good for input files up to ~20 MB per file IF the number of concurrently-queued jobs will be 100 or greater. Even when individual files are small, there are issues when the total amount of input data per-job approaches 500 MB. For cases beyond these limitations, one of our other CHTC file delivery methods should be used. Remember that creating a tar.gz file of directories and files can give your input and output data a useful amount of compression.

  • Comma-separated files and directories to-be-transferred should be listed with a path relative to the submit directory, or can be listed with the absolute path(s), as shown above for file3. The submit file executable is automatically transferred and does not need to be listed in transfer_input_files.

  • All files that are transferred to a job will appear within the top of the working directory of the job, regardless of how they are arranged within directories on the submit server.

  • A whole directory and it's contents will be transferred when listed without the trailing forward slash ("/") after the directory name. When a directory is listed with the trailing forward slash ("/") after the directory name, only the directory contents will be transferred. Care should be taken when transferring whole directories so that only the files needed by your jobs will be transferred. Generally, we recommend creating a tar.gz file of directories and files to be used a job inputs - this will help streamline the process of input file transfer and help speed up transfer times by reducing the overall size of files that will be transferred.

  • Jobs will be placed on hold by HTCondor if any of the files or directories do not exist or if you have a typo.

  • Learn more about HTCondor input files transfer.

Transferring Output Files

All of your HTCondor submit files should have the following attributes:

# my job submit file

should_transfer_files = YES
when_to_transfer_output = ON_EXIT

when_to_transfer_output = ON_EXIT will instruct HTCondor to automatically transfer ALL new or modified files in the top level directory of the job (where it ran on the execute server), back to the job’s initial directory on the submit server. Please note: this behavior only applies to files in the job’s top-level working directory, meaning HTCondor will ignore any files created in subdirectories of the job’s main working directory. Several options exist for modifying this default output file transfer behavior - see below for some examples.

Only individual output files <4GB should be transferred back to your home directory using HTCondor’s default behavior described here. Large output files >4GB should instead use CHTC’s large data filesystem called staging, more information is available at Managing Large Data in HTC Jobs. To help reduce output file sizes, and help speed up file transfer times, we recommend creating a tar.gz file of all desired output before job completion (and to also delete the “un-tar'd” files so they are not also transferred back); see our example below.

Group Multiple Output Files For Convenience

If your jobs will generate multiple output files, we recommend combining all output into a compressed tar archive for convenience, particularly when transferring your results to your local computer from the submit server. To create a compressed tar archive, include commands in your your bash executable script to create a new subdirectory, move all of the output to this new subdirectory, and create a tar archive. For example:

#! /bin/bash

# various commands needed to run your job

# create output tar archive
mkidr my_output
mv my_job_output.csv my_job_output.svg my_output/
tar -czf my_job.output.tar.gz my_ouput/

The example above will create a file called my_job.output.tar.gz that contains all the output that was moved to my_output. Be sure to create my_job.output.tar.gz in the top-level directory of where your job executes and HTCondor will automatically transfer this tar archive back to your /home directory.

Select Specific Output Files to Transfer to /home

As described above, HTCondor will transfer ALL new or modified files in the top level directory of the job (where it ran on the execute server), back to the job’s initial directory on the submit server. If your jobs will produce multiple output files but you only need to retain a subset of these output files, we recommend deleting the unrequired output files or moving them to a subdirectory as a step in the bash executable script of your job - only the output files that remain in the top-level directory will be transferred back to your /home directory. This will help keep ample space free and available on your /home directory on the submit server and help prevent you from exceeding the disk quota.

For jobs that use large input files from /staging, you must include steps in your bash script to either remove these files or move them to a subdirectory before the job terminates. Else, these large files will be transferred back to your /home directory. For more details, please see Managing Large Data in HTC Jobs.

In cases where a bash script is not used as the excutable of your job and you wish to have only specific output files transferred back, please contact us.

Get Additional Options For Managing Job Output

Several options exist for managing output file transfers back to your /home directory and we encourage you to get in touch with us at chtc@cs.wisc.edu to help identify the best solution for your needs.

Request a Quota Change

If you find that you are need of more space in you /home directory to handle the number of jobs that you want to run, please see our Request a Quota Change guide.