Powered by:
Open Science Grid
Center for High Throughput Computing

Submitting High Memory Jobs on CHTC's HTC System

The examples and information in the below guide are useful ONLY if:
  • you already have an account on a CHTC-administered submit server
  • you already have a directory in the HTC Gluster system, after requesting it via email discussion to chtc@cs.wisc.edu
  • files needed by your jobs are larger than 4GB
  • your software uses a lot of memory
To best understand the below information, users should already be familiar with:
  1. Using the command-line to: navigate directories, create/edit/copy/move/delete files and directories, and run intended programs (aka "executables").
  2. CHTC's Intro to Running HTCondor Jobs
  3. CHTC's guides for In-job File Handling and File Delivery Methods
  4. CHTC's guides for using Gluster to handle large data and software installation.

High Memory Jobs

What kind of work are we talking about? A high-memory job is one that requires a significant amount of memory (also known as RAM), usually over 120 GB and up to 1-2 TB. Tasks like genome assembly are often high in memory needs, and the amount of memory is typically a function of the total amount of input data.

Therefore, high memory jobs typically have large input and output files. In the following guide, we cover typical implementations of high-memory software, as well as the handling of large input and output files needed for high-memory work.

Software

The best option for high memory work is to install your software into your Gluster location, somewhere within /mnt/gluster/username/. When your jobs run, they'll be able to access software you've installed within Gluster.

With certain software, you may need to use special arguments to install to a local or "custom" location - check your software's documentation. See the list below for more details on common software installs:

  • Trinity: http://trinityrnaseq.github.io/#installation
  • Mothur: http://www.mothur.org/wiki/Installation
  • Velvet: https://www.ebi.ac.uk/~zerbino/velvet/ (see the Manual)
  • If you don't see your software in this list, but have successfully installed it in Gluster, let us know! We're happy to add it to this list.

Certain software may depend on Python, which is not installed on CHTC computers. In this case, you will first need to install Python to your Gluster directory. Searching for "Python local installation" or "Python installation to home directory" will provide instructions for a local (instead of computer or system-wide) installation. Once you've installed Python to Gluster, you can then install the specific software you need based on that local Python installation.

Special Considerations - "Executable" script

As described in our Gluster guides for software and large files, in order to now run the software installed in gluster, you will need to write a script that will run your software commands for you, and that will serve as the submit file "executable". Things to note are:

  • One way to run the software installed in gluster is to add it to your PATH, using the command:
    export PATH=/mnt/gluster/username/path/to/software-directory:$(PATH)
    Once you have included your software in the PATH, you can run it using normal command line syntax:
    velvetg -exp_cov 21 -ins_length_long 40000
  • Wherever your reference input files that are located in gluster, you should reference the file using the full Gluster path, like so:
    velveth -fasta -short /mnt/gluster/username/path/to/fasta
    Alternatively, some programs require you to set the input directory that contains all your input files. In this case, just set the input directory in the same way
    set.dir(input=/mnt/gluster/username/path/to/inputdir)
  • Where does your output go? If there is a way to set the output directory, or location of output file, try to do so, setting that location in Gluster.
    set.dir(output=/mnt/gluster/username/path/to/outputdir)
    velvetg /mnt/gluster/username/path/to/outdir -exp_cov 21 -ins_length_long 40000
  • In the rare case that your software is not able to read/write to a common location like /mnt/gluster, you will have to add commands that, as needed:
    • copy the input files from Gluster to the working directory on the execute node
    • run your software
    • remove input files from the working directory
    • move output files back to Gluster
  • Standard output for high-memory programs can be many gigabytes when saved in a file. Condor saves this information automatically, but since it is so large, it should be saved on Gluster instead. Add a redirect to the end of your command that will send standard output to a file in Gluster.

Altogether, a sample script may look something like this (perhaps called "run_Trinity.sh":

#!/bin/bash
# Add the /mnt/gluster software location to the job's PATH
export PATH=/mnt/gluster/username/trinityrnaseq-2.0.6:$PATH
#
# Run software command, referencing input files from gluster and 
# redirecting "stdout" to a file in gluster. Backslashes are line continuation.
Trinity --seqType fq --left /mnt/gluster/username/path/to/reads_1.fq \
--right /mnt/gluster/username/path/to/reads_2.fq --CPU 6 --max_memory \
20G > /mnt/gluster/username/trinity_stdout.txt
#
# Trinity will write output to the working directory by default, 
# so when the job finishes, it needs to be moved back to gluster
mv trinity_out_dir/Trinity.fasta /mnt/gluster/username/
### END

The script that contains these commands should be located in your /home/username directory on the submit node. It will be the executable in your submit file.

Special Considerations - Submit File

Your submit file should also be located in /home/username. The submit file should be fairly simple - important points to note are:

  • ONLY submit Gluster-dependent jobs (with condor_submit commands) from within your home directory (/home/username), and NOT from within /mnt/gluster.
  • Do NOT list any /mnt/gluster files in any the submit file lines, including: executable, log, output, error, transfer_input_files. Rather, your job's ENTIRE interaction with files in /mnt/gluster needs to occur WITHIN your executable, when it runs.
  • Even though your job will internally use files in Gluster (not using HTCondor file transfer), it is still important to have "should _transfer_files = YES" (the default) in the submit file so that the job will run AT ALL, because the "executable", "log", "output", and "error" files still need to use HTCondor file transfer, as usual.
  • In order for jobs to run on CHTC servers that have access to the /mnt/gluster location, require execute servers marked with "HasGluster" using a "Requirements" line in the submit file.
    requirements = (Target.HasGluster == true)
  • Your "request_disk" will only pertain to the maximum amount of data your job will ever have within the job working directory on the execute node, and not to files that only ever exist within /mnt/gluster.
  • If you're not sure how much memory to request, first do some small tests with a subset of your input data, and make a generous estimate, based on the tests, as well as the experience of your colleagues and/or estimates of memory usage from the online documentation for your particular software.
  • In order to run on a dedicated "high-memory" server, request over 120 GB of memory with "request_memory" in the submit file. However, if your job doesn't need quite that much memory, it's good to request less, as doing so will allow your job(s) to run on more servers, since CHTC has hundreds of servers with less than 120 GB of memory.
  • Typically, your software command will include an option to indicate the number of CPUs or "processors" to be used, which should be the same number that you indicate with "request_cpus". A value of "16" or less is best, as larger values may mean that your jobs wait longer in the queue.

Altogether, a sample submit file may look something like this:

### Example submit file for a single Gluster-dependent job
universe = vanilla
# Files for the below lines will all be somewhere within /home/username,
# and not within /mnt/gluster/username
log = run_myprogram.log
executable = /home/username/run_Trinity.sh
output = $(Cluster).out
error = $(Cluster).err
transfer_input_files = myprogram
# Require execute servers that have Gluster:
Requirements = (Target.HasGluster == true)
# The below line needs to be set to "YES"
should_transfer_files = YES
# Make sure to still include "request" lines:
request_memory = 200GB
request_disk = 100GB
request_cpus = 6
queue
### END

Our high memory machines have the following specs:

Machine name Amount of memory Number of CPUs Local disk space on machine
mem1.chtc.wisc.edu 1 TB 80 1 TB
mem2.chtc.wisc.edu 2 TB 80
wid-003.chtc.wisc.edu 512 GB 16 2.5 TB

Testing

  • If possible, try using a small subset of data in a test job. Not only will this give you a chance to try out the submit file syntax and make sure your job runs, but it can help you estimate how much memory and/or disk you will need for a job using your full data.
  • Use interactive jobs to test commands that end up in your "executable" script. You can use your normal submit file for an interactive job - just submit it using the -i flag with condor_submit:
     $condor_submit -i submit.file
    This should open a bash session on an execute machine, which will allow you to test your commands interactively.

Consult with Facilitators

If you are unsure how to run high-memory jobs on CHTC, in particular, if you're not sure if everything in this guide applies to you, get in touch with a research computing facilitator by emailing chtc@cs.wisc.edu. We are here to help you get your work done as efficiently as possible and can advise when your process might be slightly different than what we've outlined here.