|Meant For||Input/Software File Sizes||Output File Sizes||Available for Jobs Running ...||File Security||Special Considerations|
|HTCondor file transfer||basic file delivery and return; see size limits at right||0 - 100 MB per file; <500 MB total per job||0 - 4 GB total||in CHTC, UW Grid, and OSG||available to your jobs, on CHTC and beyond||DO NOT USE for files in /mnt/gluster OR /squid|
|SQUID Web Proxy||large input or software shared by many jobs||100 MB - 1 GB per shared file||N/A||in CHTC, UW Grid, and OSG||files will be world-readable||large files unique to individual jobs are better for Gluster|
|Gluster File Share||largest software, input, and output||100 MB - TBs per unique file; 1GB - TBs per shared file||4 GB - TBs||in only a portion of CHTC||accessible ONLY to your jobs on specific CHTC servers||special submit "Requirements"|
- Intended Use:
- Gluster is a staging area for input files, output files, and software that are too large for file transfer or SQUID. Files and software within Gluster will be available to jobs running in the CHTC Pool, only.
- Access to Gluster:
- is granted upon request to firstname.lastname@example.org. As for all CHTC file space, users should minimize the amount of data in their own directory within /mnt/gluster, and should clean files from /mnt/gluster location regularly.
- Access outside of CHTC?Jobs relying on Gluster will only run in CHTC's HTCondor Pool, and data in Gluster is not accessible to HTCondor jobs running outside of this pool. It is always better to split large input files if only some of the input data is needed by each job, in order to avoid depending on Gluster.
- Not all files ... Gluster should ONLY be used for files and software that cannot leverage another data accessibility mode. Files placed in /mnt/gluster should NOT be listed in the submit file (for example, as the "executable", "output", "error", or "log" files, or for files listed in "transfer_input_files").
- CapacityEach user is allowed a certain amount of space in Gluster, though we can increase this space for special requests to email@example.com
- Data Security:
- Files placed in Gluster are owned by the user, and only the user's own jobs can access these files (unless the user specifically modifies unix file permissions to make certain files available for other users).
- you already have an account on a CHTC-administered submit server
- you already have a user directory in the HTC Gluster system, after requesting it via email discussion to firstname.lastname@example.org
- Using the command-line to: navigate directories, create/edit/copy/move/delete files and directories, and run intended programs (aka "executables").
- CHTC's Intro to Running HTCondor Jobs
- CHTC's guide for File Availability Options
1. Staging files within Gluster:
After obtaining a user directory within
/mnt/gluster on a
CHTC-administered submit server, copy relevant files directly
into this user directory from your own computer:
scp command on your own Linux or Mac computer:
$ scp large.file email@example.com:/mnt/gluster/username/
If using a Windows computer:
Using a file transfer application, like WinSCP, directly drag the large
file from its location on your computer to a location within
/mnt/gluster/username/ on the submit server.
2. Submit File Implications
In order to properly submit jobs using Gluster, please make sure to do the following:
- ONLY submit Gluster-dependent jobs from within your home
/home/username), and NOT from within
- Do NOT list any
/mnt/glusterfiles in any of the submit file lines, including:
executable, log, output, error, transfer_input_files. Rather, your job's ENTIRE interaction with files in
/mnt/glusterneeds to occur WITHIN your executable, when it runs.
- Request an adequate amount of disk space with "
request_disk", which will only pertain to the maximum amount of data within the job working directory on the execute node, and not to files that only ever exist within
- Make sure to use a submit file "Requirements" line to make sure your jobs only run on execute servers that have access to Gluster
See the below submit file, as an example, which would be submitted from within the user's home directory:
### Example submit file for a single Gluster-dependent job # Files for the below lines MUST all be somewhere within /home/username, # and not within /mnt/gluster/username log = myprogram.log executable = /home/username/run_myprogram.sh output = $(Cluster).out error = $(Cluster).err transfer_input_files = myprogram # IMPORTANT! Require execute servers that have Gluster: Requirements = (Target.HasGluster == true) # Make sure to still include lines like "request_memory", "request_disk", "request_cpus", etc. queue
3. Using Gluster-staged Files and Software for Jobs
As stated in #2, all interaction with
files and software in
should occur within your job's main executable, when it runs.
Therefore, there are two options for jobs depending on on Gluster-staged software (larger than
a few GB) and input (see table above).
A. Option 1: Copy files from Gluster into the working directory (best, if possible)
The recommended method is to copy input or software into the working directory of the job, and use it from there, being careful to remove such files from the working directory before the completion of the job (so that they're not copied back to the submit server as perceived output). An example is below:
#!/bin/bash # # First, copy the large file from /mnt/gluster into the working directory: cp /mnt/gluster/username/large_input.txt ./ # # Command for myprogram, which will us the file from the working directory ./myprogram large_input.txt myoutput.txt # # Before the script exits, make sure to remove the large file from the working directory rm large_input.txt # # END
B. Option 2: Refer to files and software DIRECTLY from their location in Gluster (LAST RESORT!)
If your software will ONLY work if it remains in the same location where it was
first installed and there are barriers to installing it within the working directory
of every job (install-on-the-fly), you may instead have your
jobs use these files or software from where they are located in
Note that this method performs poorly, and should be avoided if at all possible.
Essentially, your job executable should refer to Gluster-located files
and software using the absolute path (e.g.
Example, if your job executable is a unix shell (bash) script:
#!/bin/bash # # script to run myprogram, # which reads in a large file directly from Gluster ./myprogram /mnt/gluster/username/large_input.txt myoutput.txt # # END
4. Writing large output files to Gluster from within a job
As stated in #2, all
interaction with files in
should occur within your designated "executable", when it runs.
Therefore, there are two options for having steps within your
executable write files to Gluster, as well as a consideration
for large standard output.
A. Write output files to the working directory, then move these to Gluster
It can be detrimental to the Gluster filesystem (and cause your jobs to run more slowly) to write files directly to Gluster from within a job. Instead, have your executable write the file to a location within the working directory, and then make sure to move this large file to Gluster (or copy to Gluster, and then remove from the working directory), so that it's not transferred back to the home directory, like all other "new" files in the working directory will be.
Example, if executable is a shell script:
#!/bin/bash # # Command to save output to the working directory: ./myprogram myinput.txt large_output.txt # # Move large output to Gluster: mv large_output.txt /mnt/gluster/username/large_output.txt # # END
B. ALSO consider: Large standard output ("screen output") produced by your jobs
In some instances, your software may produce very large standard output (what would typically be output to the command screen, if you ran the command for yourself, instead of having HTCondor do it). Because such standard output from your software will usually be captured by HTCondor in the submit file "output" file, this "output" file WILL still be transferred by HTCondor back to your home directory on the submit server, which may be very bad for you and others, if that captured standard output is very large.
In these cases, it is useful to redirect the standard output of commands in your executable to a file in the working directory, and then move it into Gluster at the end of the job.
Example, if "
myprogram" produces very large standard output, and
is run from a script (bash) executable:
#!/bin/bash # # script to run myprogram, # # redirecting large standard output to a file in Gluster: ./myprogram myinput.txt myoutput.txt > large_std.out # # move large files to Gluster so they're not copied to the submit server: mv large_std.out /mnt/gluster/username/large_std.out # END
5. Removing files from Gluster after jobs complete
Similar to the procedures from transferring files into Gluster,
you can directly copy files out of Gluster using command-line
or file-transfer applications like WinSCP.
For all user support, questions, and comments: firstname.lastname@example.org