Powered by:
Open Science Grid
Center for High Throughput Computing

Submitting High Memory Jobs on CHTC's HTC System

The examples and information in the below guide are useful ONLY if:
  • you already have an account on a CHTC-administered submit server
  • you already have a directory in the HTC Gluster system, after requesting it via email discussion to chtc@cs.wisc.edu
  • files needed by your jobs are larger than 4GB
  • your software uses a lot of memory
To best understand the below information, users should already be familiar with:
  1. Using the command-line to: navigate directories, create/edit/copy/move/delete files and directories, and run intended programs (aka "executables").
  2. CHTC's Intro to Running HTCondor Jobs
  3. CHTC's guides for In-job File Handling and File Delivery Methods
  4. CHTC's guides for using Gluster to handle large data and software installation.

High Memory Jobs

What kind of work are we talking about? A high-memory job is one that requires a significant amount of memory (also known as RAM), usually over 120 GB and up to 1-4 TB. Tasks like genome assembly are often high in memory needs, and the amount of memory is typically a function of the total amount of input data.

Therefore, high memory jobs typically have large input and output files. In the following guide, we cover typical implementations of high-memory software, as well as the handling of large input and output files needed for high-memory work.

Software

The best option for high memory work is to create a portable installation of your software that will be run after it is copied to the working directory of your job. If you are using Gluster to stage large input or output files (in compliance with all policies for using Gluster you should not install your software into your Gluster location. If your software has to run from where it is initially installed, your jobs will nearly always run fastest if the software is installed to the working directory, as part of the job.

See the list below for more details on common software installs:

  • Trinity: http://trinityrnaseq.github.io/#installation
  • Mothur: http://www.mothur.org/wiki/Installation
  • Velvet: https://www.ebi.ac.uk/~zerbino/velvet/ (see the Manual)
  • If you don't see your software in this list, but have successfully installed it for use in CHTC, let us know! We're happy to add it to this list.
Please get in touch with a Research Computing Facilitator if you have any questions about making sure that your software will work in your jobs, by emailing chtc@cs.wisc.edu

Special Considerations - "Executable" script

As described in our Gluster guides for software and large files, in order to now run the software installed in gluster, you will need to write a script that will run your software commands for you, and that will serve as the submit file "executable". Things to note are:

  • If you have large files that are located in the Gluster filesystem, or have output that will be large enough to need Gluster, you will need to add commands that, as needed:
    • copy the input files from Gluster to the working directory on the execute node
    • set up and run your software
    • remove input files from the working directory
    • tar output files into a single output tar file (.tar.gz)
    • copy the output tar file to Gluster and remove it from the working directory
  • Standard output for high-memory programs can be many gigabytes when saved in a file. Condor saves this information automatically, but since it is so large, it should be saved on Gluster instead. Add a redirect to the end of your command that will send standard output to a file in the working directory, then use another command in your job executable script that copies this file to a location in Gluster before removing the file from the job working directory.

Altogether, a sample script may look something like this (perhaps called "run_Trinity.sh":

#!/bin/bash
# Copy input data from Gluster to the present directory of the job
cp /mnt/gluster/usename/reads.tar.gz ./
tar -xzvf reads.tar.gz
rm reads.tar.gz
# Set up the software installation in the job working directory, and add it to the job's PATH
tar -xzvf trinityrnaseq-2.0.6-installed.tar.gz
rm trinityrnaseq-2.0.6-installed.tar.gz
export PATH=$(pwd)/trinityrnaseq-2.0.6:$PATH
#
# Run software command, referencing input files in the working directory and 
# redirecting "stdout" to a file in gluster. Backslashes are line continuation.
Trinity --seqType fq --left reads_1.fq \
--right /mnt/gluster/username/path/to/reads_2.fq --CPU 6 --max_memory \
20G > trinity_stdout.txt
#
# Trinity will write output to the working directory by default, 
# so when the job finishes, it needs to be moved back to gluster
tar -czvf trinity_out_dir.tar.gz trinity_out_dir
cp trinity_out_dir.tar.gz trinity_stdout.txt /mnt/gluster/username/
rm reads_*.fq trinity_out_dir.tar.gz trinity_stdout.txt
### END

The script that contains these commands should be located in your /home/username directory on the submit node. It will be the executable in your submit file.

Special Considerations - Submit File

Your submit file should also be located in /home/username. The submit file should be fairly simple - important points to note are:

  • ONLY submit Gluster-dependent jobs (with condor_submit commands) from within your home directory (/home/username), and NOT from within /mnt/gluster.
  • Do NOT list any /mnt/gluster files in any the submit file lines, including: executable, log, output, error, transfer_input_files. Rather, your job's ENTIRE interaction with files in /mnt/gluster needs to occur WITHIN your executable, when it runs.
  • Even though your job will internally use files in Gluster (maybe not using HTCondor file transfer for input), it is still important to have "should _transfer_files = YES" (the default) in the submit file so that the job will run AT ALL, because the "executable", "log", "output", and "error" files still need to use HTCondor file transfer, as usual.
  • In order for jobs to run on CHTC servers that have access to the /mnt/gluster location, require execute servers marked with "HasGluster" using a "Requirements" line in the submit file.
    requirements = (Target.HasGluster == true)
  • Your "request_disk" will need to pertain to the maximum amount of data your job will ever have within the job working directory on the execute node, including all output and input (which will take up space before some of it is removed from the job working directory at the end of the job).
  • If you're not sure how much memory to request, first do some small tests with a subset of your input data, and make a generous estimate, based on the tests, as well as the experience of your colleagues and/or estimates of memory usage from the online documentation for your particular software.
  • In order to run on a dedicated "high-memory" server, request over 120 GB of memory with "request_memory" in the submit file. However, if your job doesn't need quite that much memory, it's good to request less, as doing so will allow your job(s) to run on more servers, since CHTC has hundreds of servers with less than 120 GB of memory.
  • Typically, your software command will include an option to indicate the number of CPUs or "processors" to be used, which should be the same number that you indicate with "request_cpus". A value of "16" or less is best, as larger values may mean that your jobs wait longer in the queue.

Altogether, a sample submit file may look something like this:

### Example submit file for a single Gluster-dependent job
universe = vanilla
# Files for the below lines will all be somewhere within /home/username,
# and not within /mnt/gluster/username
log = run_myprogram.log
executable = /home/username/run_Trinity.sh
output = $(Cluster).out
error = $(Cluster).err
transfer_input_files = trinityrnaseq-2.0.1.tar.gz
# Require execute servers that have Gluster:
Requirements = (Target.HasGluster == true)
# The below line needs to be set to "YES"
should_transfer_files = YES
# Make sure to still include "request" lines:
request_memory = 200GB
request_disk = 100GB
request_cpus = 6
queue
### END

Our high memory machines have the following specs:

Machine name Amount of memory Number of CPUs Local disk space on machine
mem1.chtc.wisc.edu 1 TB 80 1 TB
mem2.chtc.wisc.edu 2 TB 80 1 TB
mem3.chtc.wisc.edu 4 TB 80 6 TB
wid-003.chtc.wisc.edu 512 GB 16 2.5 TB

Testing

  • Before running a full-size high-memory job, make sure to use a small subset of data in a test job. Not only will this give you a chance to try out the submit file syntax and make sure your job runs, but it can help you estimate how much memory and/or disk you will need for a job using your full data.
  • Use interactive jobs to test commands that end up in your "executable" script. You can use your normal submit file for an interactive job - just submit it using the -i flag with condor_submit:
     $condor_submit -i submit.file
    This should open a bash session on an execute machine, which will allow you to test your commands interactively.

Consult with Facilitators

If you are unsure how to run high-memory jobs on CHTC, in particular, if you're not sure if everything in this guide applies to you, get in touch with a research computing facilitator by emailing chtc@cs.wisc.edu. We are here to help you get your work done as efficiently as possible and can advise when your process might be slightly different than what we've outlined here.