Submitting High Memory Jobs on CHTC's HTC System
The examples and information in the below guide are useful ONLY if:
To best understand the below information, users should already be familiar with:
- you already have an account on a CHTC-administered submit server
- you already have a directory in the HTC Gluster system, after requesting it
via email discussion to firstname.lastname@example.org
- files needed by your jobs are larger than 4GB
- your software uses a lot of memory
- Using the command-line to: navigate directories, create/edit/copy/move/delete
files and directories, and run intended programs (aka "executables").
CHTC's Intro to Running HTCondor Jobs
- CHTC's guides for
In-job File Handling and
File Delivery Methods
- CHTC's guides for using Gluster to handle
large data and software installation.
High Memory Jobs
What kind of work are we talking about? A high-memory job is one that
requires a significant amount of memory (also known as RAM), usually over
120 GB and up to 1-2 TB. Tasks like genome assembly are often high in
memory needs, and the amount of memory is typically a function of the
total amount of input data.
Therefore, high memory jobs typically have large input and output files.
In the following guide, we cover typical implementations of high-memory software,
as well as the handling of large input and output files needed for high-memory work.
The best option for high memory work is to install your software into your
Gluster location, somewhere within
When your jobs run, they'll be able to access software you've installed
With certain software, you may need to use special arguments to
install to a local or "custom" location - check your software's
documentation. See the list below for more details on common software installs:
- Trinity: http://trinityrnaseq.github.io/#installation
- Mothur: http://www.mothur.org/wiki/Installation
- Velvet: https://www.ebi.ac.uk/~zerbino/velvet/ (see the Manual)
- If you don't see your software in this list, but have successfully installed
it in Gluster, let us know! We're happy to add it to this list.
Certain software may depend on Python, which is not installed on CHTC
computers. In this case, you will first need to install Python to your Gluster
directory. Searching for "Python local installation" or "Python installation
to home directory" will provide instructions for a local (instead of computer
or system-wide) installation. Once you've installed Python to Gluster,
you can then install the specific software you need based on that
local Python installation.
Special Considerations - "Executable" script
As described in our Gluster guides for software and
large files, in
order to now run the software installed in gluster, you will need to write
a script that will run your software commands for you, and that will serve
as the submit file "executable". Things to note are:
- One way to run the software installed in gluster is to add it to your
PATH, using the command:
Once you have included your software in the PATH, you can run it using normal command line syntax:
velvetg -exp_cov 21 -ins_length_long 40000
- Wherever your reference input files that are located in gluster,
you should reference the file using the full Gluster path, like so:
velveth -fasta -short /mnt/gluster/username/path/to/fasta
Alternatively, some programs require you to set the input directory that
contains all your input files. In this case, just set the input directory
in the same way
- Where does your output go? If there is a way to set the output directory,
or location of output file, try to do so, setting that location in Gluster.
velvetg /mnt/gluster/username/path/to/outdir -exp_cov 21 -ins_length_long 40000
- In the rare case that your software is not able to read/write to a common
/mnt/gluster, you will have to add commands that,
- copy the input files from Gluster to the working
directory on the execute node
- run your software
- remove input files from the working directory
- move output files back to Gluster
- Standard output for high-memory programs can be many gigabytes when saved
in a file. Condor saves this information automatically, but since it is so
large, it should be saved on Gluster instead. Add a redirect to the end of
your command that will send standard output to a file in Gluster.
Altogether, a sample script may look something like this
(perhaps called "run_Trinity.sh":
# Add the /mnt/gluster software location to the job's PATH
# Run software command, referencing input files from gluster and
# redirecting "stdout" to a file in gluster. Backslashes are line continuation.
Trinity --seqType fq --left /mnt/gluster/username/path/to/reads_1.fq \
--right /mnt/gluster/username/path/to/reads_2.fq --CPU 6 --max_memory \
20G > /mnt/gluster/username/trinity_stdout.txt
# Trinity will write output to the working directory by default,
# so when the job finishes, it needs to be moved back to gluster
mv trinity_out_dir/Trinity.fasta /mnt/gluster/username/
The script that contains these commands should be located
/home/username directory on the submit node.
It will be the executable in your submit file.
Special Considerations - Submit File
Your submit file should also be located in
The submit file should be fairly simple - important points to note are:
Altogether, a sample submit file may look something like this:
### Example submit file for a single Gluster-dependent job
universe = vanilla
# Files for the below lines will all be somewhere within /home/username,
# and not within /mnt/gluster/username
log = run_myprogram.log
executable = /home/username/run_Trinity.sh
output = $(Cluster).out
error = $(Cluster).err
transfer_input_files = myprogram
# Require execute servers that have Gluster:
Requirements = (Target.HasGluster == true)
# The below line needs to be set to "YES"
should_transfer_files = YES
# Make sure to still include "request" lines:
request_memory = 200GB
request_disk = 100GB
request_cpus = 6
Our high memory machines have the following specs:
||Amount of memory
||Number of CPUs
||Local disk space on machine
Consult with Facilitators
If you are unsure how to run high-memory jobs on CHTC, in particular,
if you're not sure if everything in this guide applies to you, get in touch
with a research computing facilitator by emailing email@example.com.
We are here to help you get your work done as efficiently as possible
and can advise when your process might be slightly different
than what we've outlined here.