Powered by:
Open Science Grid
Center for High Throughput Computing

Using Gluster for Large Data and Software

CHTC maintains a Gluster file share, which should only be used for files or software that are too large for HTCondor file transfer or SQUID. The below guide discusses how to use CHTC's Gluster option for jobs that use or produce very large files.

Which Option is the Best for Your Files?


Meant For Input/Software File Sizes Output File Sizes Available for Jobs Running ... File Security Special Considerations
HTCondor file transfer basic file delivery and return; see size limits at right 0 - 100 MB per file; <500 MB total per job 0 - 4 GB total in CHTC, UW Grid, and OSG available to your jobs, on CHTC and beyond DO NOT USE for files in /mnt/gluster OR /squid
SQUID Web Proxy large input or software shared by many jobs 100 MB - 1 GB per shared file N/A in CHTC, UW Grid, and OSG files will be world-readable large files unique to individual jobs are better for Gluster
Gluster File Share largest software, input, and output 100 MB - TBs per unique file; 1GB - TBs per shared file 4 GB - TBs in only a portion of CHTC accessible ONLY to your jobs on specific CHTC servers special submit "Requirements"

Applicability

Intended Use:
Gluster is a staging area for input files, output files, and software that are too large for file transfer or SQUID. Files and software within Gluster will be available to jobs running in the CHTC Pool, only.
Access to Gluster:
is granted upon request to chtc@cs.wisc.edu. As for all CHTC file space, users should minimize the amount of data in their own directory within /mnt/gluster, and data should be removed from /mnt/gluster AS SOON AS IT IS NO LONGER NEEDED FOR ACTIVELY-RUNNING JOBS. You can always copy data back into Gluster at a later date.
Limitations:
  • Access outside of CHTC? Jobs relying on Gluster will only run in CHTC's HTCondor Pool, and data in Gluster is not accessible to HTCondor jobs running outside of this pool. It is always better to split large input files if only some of the input data is needed by each job, in order to avoid depending on Gluster.
  • Not all files ... Gluster should ONLY be used for files and software that cannot leverage another data accessibility mode. Files placed in /mnt/gluster should NOT be listed in the submit file (for example, as the "executable", "output", "error", or "log" files, or for files listed in "transfer_input_files").
  • Capacity Each user is allowed a certain amount of space in Gluster, though we can increase this space for special requests to chtc@cs.wisc.edu
Data Security:
Files placed in Gluster are owned by the user, and only the user's own jobs can access these files (unless the user specifically modifies unix file permissions to make certain files available for other users).
The examples and information in the below guide are useful ONLY if:
  • you already have an account on a CHTC-administered submit server
  • you already have a user directory in the HTC Gluster system, after requesting it via email discussion to chtc@cs.wisc.edu
To best understand the below information, users should already be familiar with:
  1. Using the command-line to: navigate directories, create/edit/copy/move/delete files and directories, and run intended programs (aka "executables").
  2. CHTC's Intro to Running HTCondor Jobs
  3. CHTC's guide for File Availability Options

1. Staging files within Gluster:

After obtaining a user directory within /mnt/gluster on a CHTC-administered submit server, copy relevant files directly into this user directory from your own computer:

Example scp command on your own Linux or Mac computer:

$ scp large.file username@submit-5.chtc.wisc.edu:/mnt/gluster/username/

If using a Windows computer:

Using a file transfer application, like WinSCP, directly drag the large file from its location on your computer to a location within /mnt/gluster/username/ on the submit server.

2. Submit File Implications

In order to properly submit jobs using Gluster, please make sure to do the following:

  • ONLY submit Gluster-dependent jobs from within your home directory (/home/username), and NOT from within /mnt/gluster.
  • Do NOT list any /mnt/gluster files in any of the submit file lines, including: executable, log, output, error, transfer_input_files. Rather, your job's ENTIRE interaction with files in /mnt/gluster needs to occur WITHIN each job's executable, when it runs within the job.
  • Request an adequate amount of disk space with "request_disk", to include the total amount of input data that has been copied into the job working directory from /mnt/gluster, and any large enough output that will be created in the job working directory before it is copied by the job into /mnt/gluster.
  • Make sure to use a submit file "Requirements" line so that your jobs only run on execute servers that have access to Gluster

See the below submit file, as an example, which would be submitted from within the user's home directory:

### Example submit file for a single Gluster-dependent job
# Files for the below lines MUST all be somewhere within /home/username,
# and not within /mnt/gluster/username

log = myprogram.log
executable = /home/username/run_myprogram.sh
output = $(Cluster).out
error = $(Cluster).err
transfer_input_files = myprogram

# IMPORTANT! Require execute servers that have Gluster:
Requirements = (Target.HasGluster == true)

# Make sure to still include lines like "request_memory", "request_disk", "request_cpus", etc. 

queue

3. Using Gluster-staged Files and Software for Jobs

As stated in #2, all interaction with files and software in /mnt/gluster should occur within your job's main executable, when it runs. Therefore, there are two options for jobs depending on on Gluster-staged software (larger than a few GB) and input (see table above).

A. Option 1: Copy files from Gluster into the working directory (best, if possible)

The recommended method is to copy input or software into the working directory of the job, and use it from there, being careful to remove such files from the working directory before the completion of the job (so that they're not copied back to the submit server as perceived output). An example is below:

#!/bin/bash
#
# First, copy the large file from /mnt/gluster into the working directory:
cp /mnt/gluster/username/large_input.txt ./
#
# Command for myprogram, which will us the file from the working directory
./myprogram large_input.txt myoutput.txt
#
# Before the script exits, make sure to remove the large file from the working directory
rm large_input.txt
#
# END

B. Option 2: Refer to files and software DIRECTLY from their location in Gluster (LAST RESORT!)

If your software will ONLY work if it remains in the same location where it was first installed and there are barriers to installing it within the working directory of every job (install-on-the-fly), you may instead have your jobs use these files or software from where they are located in /mnt/gluster. Note that this method performs poorly, and should be avoided if at all possible.

Essentially, your job executable should refer to Gluster-located files and software using the absolute path (e.g. /mnt/gluster/username/large_input.txt).

Example, if your job executable is a unix shell (bash) script:

#!/bin/bash
#
# script to run myprogram, 
# which reads in a large file directly from Gluster
./myprogram /mnt/gluster/username/large_input.txt myoutput.txt
#
# END

4. Writing large output files to Gluster from within a job

As stated in #2, all interaction with files in /mnt/gluster should occur within your designated "executable", when it runs. Therefore, there are two options for having steps within your executable write files to Gluster, as well as a consideration for large standard output.

A. Write output files to the working directory, then move these to Gluster

It can be detrimental to the Gluster filesystem (and cause your jobs to run more slowly) to write files directly to Gluster from within a job. Instead, have your executable write the file to a location within the working directory, and then make sure to move this large file to Gluster (or copy to Gluster, and then remove from the working directory), so that it's not transferred back to the home directory, like all other "new" files in the working directory will be.

Example, if executable is a shell script:

#!/bin/bash
# 
# Command to save output to the working directory:
./myprogram myinput.txt large_output.txt
#
# Move large output to Gluster:
mv large_output.txt /mnt/gluster/username/large_output.txt
#
# END

B. ALSO consider: Large standard output ("screen output") produced by your jobs

In some instances, your software may produce very large standard output (what would typically be output to the command screen, if you ran the command for yourself, instead of having HTCondor do it). Because such standard output from your software will usually be captured by HTCondor in the submit file "output" file, this "output" file WILL still be transferred by HTCondor back to your home directory on the submit server, which may be very bad for you and others, if that captured standard output is very large.

In these cases, it is useful to redirect the standard output of commands in your executable to a file in the working directory, and then move it into Gluster at the end of the job.

Example, if "myprogram" produces very large standard output, and is run from a script (bash) executable:

#!/bin/bash
#
# script to run myprogram,
# 
# redirecting large standard output to a file in Gluster:
./myprogram myinput.txt myoutput.txt > large_std.out
# 
# move large files to Gluster so they're not copied to the submit server:
mv large_std.out /mnt/gluster/username/large_std.out
# END

5. Removing files from Gluster after jobs complete

Similar to the procedures from transferring files into Gluster, you can directly copy files out of Gluster using command-line scp or file-transfer applications like WinSCP.