|Meant For||Input/Software File Sizes||Output File Sizes||Available for Jobs Running ...||File Security||Special Considerations|
|HTCondor file transfer||basic file delivery and return; see size limits at right||0 - 100 MB per file; <500 MB total per job||0 - 4 GB total||in CHTC, UW Grid, and OSG||available to your jobs, on CHTC and beyond||DO NOT USE for files in /mnt/gluster OR /squid|
|SQUID Web Proxy||large input or software shared by many jobs||100 MB - 1 GB per shared file||N/A||in CHTC, UW Grid, and OSG||files will be world-readable||large files unique to individual jobs are better for Gluster|
|Gluster File Share||largest software, input, and output||100 MB - TBs per unique file; 1GB - TBs per shared file||4 GB - TBs||in only a portion of CHTC||accessible ONLY to your jobs on specific CHTC servers||special submit "Requirements"|
- Policies and Intended Use
- Practices for Files within Gluster
- Submit File Implications
- Using Gluster-staged Files and Software for Jobs
- Copying large output files to Gluster from within a job
- Removing files from Gluster after jobs complete
- Checking your Quota, Data Use, and File Counts in Glusters
1. Policies and Intended Use:
- Intended Use:
- Gluster is a staging area for input files, output files, and software that are individually too large for file transfer or SQUID. Files and software within Gluster will be available only to jobs running in the CHTC Pool, and only a portion of the pool.
- Access to Gluster:
- is granted upon request to email@example.com and consultation with a Research Computing Facilitator.
- Access outside of CHTC? Jobs relying on Gluster will only run in CHTC's HTCondor Pool, and data in Gluster is not accessible to HTCondor jobs running outside of this pool.
- Not all files ... Gluster should ONLY be used for individual data and software files that are larger than the values in the table, above. Files placed in /mnt/gluster should NEVER be listed in the submit file (for example, as the "executable", "output", "error", or "log" files, or for files listed in "transfer_input_files"). Only files in the user's home directory should ever be listed in these lines of the submit file.
- Capacity Each user is allowed a certain amount of space in Gluster, though we can increase this space for special requests to firstname.lastname@example.org
- Data Security:
- Files placed in Gluster are owned by the user, and only the user's own jobs can access these files (unless the user specifically modifies unix file permissions to make certain files available for other users).
- Data Cleanup:
- As for all CHTC file space, data should be removed from /mnt/gluster AS SOON AS IT IS NO LONGER NEEDED FOR ACTIVELY-RUNNING JOBS. Even if it will be used it the future, it should be deleted from Gluster and copied back at a later date.
- Allowed Data Amounts:
- As of July 25, 2017, initial quotas of 10GB of 100 files have been implemented for every folder within Gluster. Request to email@example.com and consultation with an RCF will be necessary for increasing this quota.
- KEEP COPIES:
- of ANY and ALL data or software in Gluster in another, non-CHTC location. The HTC Gluster is not backed up and is prone to data loss over time. CHTC staff reserve the right to remove any data from the HTC Gluser (or any CHTC file system) at any time.
- USERS VIOLATING ANY OF THE POLICIES (also below) IN THIS GUIDE WILL HAVE THEIR GLUSTER ACCESS AND/OR CHTC ACCOUNT REVOKED UNTIL CORRECTIVE MEASURES ARE TAKEN. CHTC STAFF RESERVE THE RIGHT TO REMOVE ANY PROBLEMATIC USER DATA AT ANY TIME IN ORDER TO PRESERVE PERFORMANCE.
- you already have an account on a CHTC-administered submit server
- you already have a user directory in the HTC Gluster system, after requesting it via email discussion to firstname.lastname@example.org
- Using the command-line to: navigate directories, create/edit/copy/move/delete files and directories, and run intended programs (aka "executables").
- CHTC's Intro to Running HTCondor Jobs
- CHTC's guide for Typical File Transfer
2. Practices for Files within Gluster:Data in Gluster should be stored in as few files as possible (ideally, one file per job), and will be used by a job only after being copied from /mnt/gluster into the job working directory (see #3, below). Similarly, large output requiring Gluster should first be written to the job working directory then compressed in to a single file before being copied to /mnt/gluster at the end of the job (see #4, below). To prepare job-specific data that is large enough to require Gluster and exists as multiple files or directories (or a directory of multiple files), first create a compressed tar package before placing the file in /mnt/gluster (either before submitting jobs, or within jobs before moving output to /mnt/gluster). For example:
$ tar -czvf job_package.tar.gz file_or_dir
Movement of data into/out of /mnt/gluster before and after jobs should
only be performed via CHTC's transfer server, as below, and not via a CHTC submit server.
After obtaining a user directory within
/mnt/gluster and an account on
the transfer server, copy relevant files directly
into this user directory from your own computer:
scp command on your own Linux or Mac computer:
$ scp large.file email@example.com:/mnt/gluster/username/
If using a Windows computer:
Using a file transfer application, like WinSCP, directly drag the large
file from its location on your computer to a location within
/mnt/gluster/username/ on transfer.chtc.wisc.edu.
3. Submit File Implications
In order to properly submit jobs using Gluster, always do the following:
- ONLY submit Gluster-dependent jobs from within your home
/home/username), and NEVER from within
- Do NOT list any
/mnt/glusterfiles in any of the submit file lines, including:
executable, log, output, error, transfer_input_files. Rather, your job's ENTIRE interaction with files in
/mnt/glusterneeds to occur WITHIN each job's executable, when it runs within the job.
- Request an adequate amount of disk space with "
request_disk", to include the total amount of input data that each job will copy into the job working directory from /mnt/gluster, and any output that will be created in the job working directory.
- Make sure to use a submit file "Requirements" line so that your jobs only run on execute servers that have access to Gluster.
See the below submit file, as an example, which would be submitted from within the user's home directory:
### Example submit file for a single Gluster-dependent job # Files for the below lines MUST all be somewhere within /home/username, # and not within /mnt/gluster/username log = myprogram.log executable = run_myprogram.sh output = $(Cluster).out error = $(Cluster).err transfer_input_files = myprogram # IMPORTANT! Require execute servers that have Gluster: Requirements = (Target.HasGluster == true) # Make sure to still include lines like "request_memory", "request_disk", "request_cpus", etc. queue
4. Using Gluster-staged Files and Software for Jobs
As stated in #2, all interaction with
files and software in
should occur within your job's main executable, when it runs.
Therefore, there are two options for jobs depending on on Gluster-staged software (larger than
a few GB) and input (larger than 100 MB per file).
A. Copy files from Gluster into the working directory, from within the job
The recommended method is to have your job executable copy input or software from /mnt/gluster into the working directory of the job, and use it from there, being careful to remove such files from the working directory before the completion of the job (so that they're not copied back to the submit server as perceived output). An example is below:
#!/bin/bash # # First, copy the compressed tar file from /mnt/gluster into the working directory, # and un-tar it to reveal your large input file(s) or directories: cp /mnt/gluster/username/large_input.tar.gz ./ tar -xzvf large_input.tar.gz # # Command for myprogram, which will use files from the working directory ./myprogram large_input.txt myoutput.txt # # Before the script exits, make sure to remove the file(s) from the working directory rm large_input.tar.gz large_input.txt # # END
B. Software that is too large for or doesn't work with file transfer
If your software will ONLY work if it remains in the same location where it was first installedi, and there are barriers to installing it within the working directory of every job (install-on-the-fly), please contact us for assistance. DO NOT PLACE SOFTWARE IN GLUSTER WITHOUT PERMISSION AND INPUT FROM CHTC STAFF: firstname.lastname@example.org
5. Copying large output files to Gluster from within a job
As stated in #2, all
interaction with files in
should occur within your designated "executable", when it runs.
Therefore, there are two options for having steps within your
executable write files to Gluster, as well as a consideration
for large standard output.
A. Write output files to the working directory, then move these to Gluster
If your jobs write any data directly to Gluster from within a job, your jobs will run slower AND will cause Gluster to be slower for other users. Instead, have your executable write the file to a location within the working directory, and then make sure to move this large file to Gluster (or copy to Gluster, and then remove from the working directory), so that it's not transferred back to the home directory, as all other "new" files in the working directory will be.
Example, if executable is a shell script:
#!/bin/bash # # Command to save output to the working directory: ./myprogram myinput.txt large_output.txt # # Tar and copy output to Gluster, then delete from the job working directory: tar -czvf large_output.tar.gz large_output.txt
mv large_output.tar.gz /mnt/gluster/username/ rm large_output.txt # # END
B. ALSO consider: Large standard output ("screen output") produced by your jobs
In some instances, your software may produce very large standard output (what would typically be output to the command screen, if you ran the command for yourself, instead of having HTCondor do it). Because such standard output from your software will usually be captured by HTCondor in the submit file "output" file, this "output" file WILL still be transferred by HTCondor back to your home directory on the submit server, which may be very bad for you and others, if that captured standard output is very large.
In these cases, it is useful to redirect the standard output of commands in your executable to a file in the working directory, and then move it into Gluster at the end of the job.
Example, if "
myprogram" produces very large standard output, and
is run from a script (bash) executable:
#!/bin/bash # # script to run myprogram, # # redirecting large standard output to a file in the working directory: ./myprogram myinput.txt myoutput.txt > large_std.out # # tar and move large files to Gluster so they're not copied to the submit server: tar -czvf large_stdout.tar.gz larg_stdout cp large_stdout.tar.gz /mnt/gluster/username/
rm large_std.out large_stdout.tar.gz # END
6. Removing files from Gluster after jobs complete
Similar to the procedures from transferring files into Gluster,
you can directly copy files out of Gluster using command-line
or file-transfer applications like WinSCP.
7. Checking your Quota, Data Use, and File Counts in Gluster
To check your total data usage and quota, run
df -h for your Gluster directory. Example:
$ df -h /mnt/gluster/alice
To check data usage and file counts, run
ncdu from within
the directory you'd like to query. Example:
$ cd /mnt/gluster/alice $ ncduWhen
ncduhas finished running, the output will give you a total file count and allow you to navigate between subdirectories for even more details. Type
qwhen you're ready to exit the output viewer. More info here: https://lintut.com/ncdu-check-disk-usage/
For all user support, questions, and comments: email@example.com