Powered by:
Open Science Grid
Center for High Throughput Computing

Running R Jobs on CHTC

To best understand the below information, users should already have an understanding of:

Overview

Many CHTC users have R programs requiring R versions and specialized packages that are not installed on CHTC's high throughput system, which includes the CHTC Pool, the UW Grid (flocking) and the Open Science Grid (GlideIn). In order to run R jobs, you can build a version of R with the packages you want and use it within your jobs.

This guide details the steps needed to:

  1. Build an R installation for use in your jobs
  2. Write a script that unpacks your R installation and runs your R code
  3. Submit jobs

1. Building an R Installation

To run R jobs, you will first need to build a portable R installation that will later go along with each of your R jobs.

A. Get the Source Code for the R Version You Want

Before running any commands on CHTC, use a browser to get the source code for your desired version of R from CRAN. (See the note below on supported versions.) Under "Source Code for all Platforms", find the R-#.#.#.tar.gz file for your desired version of R and download it to your computer before copying to the submit server.

Version Notes

To use R version 3.3.0 or higher, you will need to compile and run your jobs on CHTC's new CentOS 7 servers. There is information in both our compiling guide and on this page about accessing CentOS 7 to compile and run jobs.

If you can use an earlier version of R (3.2.# or earlier), you will be able to compile on our Scientific Linux 6 build server and run on both Scientific Linux 6 and CentOS 7, accessing the most capacity.

B. Create a Portable R Installation in an Interactive Job

Make sure you know which major packages are used in your code (anything loaded using the library() command), before getting started.

Because creating an R installation can be computationally intensive, it should not be performed on the submit server. Instead, you will create your installation on a CHTC build server by using an interactive job. The interactive job is essentially a job without an executable; you will be the one running the commands, instead (in this case, to install R). Like a regular HTCondor job, once you finish your R installation on the build server, the output files (your completed portable R installation) will be transferred back to the submit server (so that you can use the R installation for later jobs).

  1. Submit an Interactive Build Job

    Instructions for submitting an interactive build job are here: http://chtc.cs.wisc.edu/inter-submit.shtml
    You'll need to do Step 2. Note that you should replace the contents of the "transfer_input_files" line with the name of the R source tarball that you downloaded.

    Submit the interactive job and wait for it to start (this is Step 3 of the guide above).

  2. Install R

    Once the interactive job starts, we can install R. To install R, we will run a configuration script that includes an option to set the installation location. We will set the location to our current directory, and then complete the installation by running make. (In what follows, R-3.x.x. should always be replaced by the name/version of the R code that you chose in Part A.)

    Un-tar and move into the untarred R source directory:

    [alice@build]$ tar -xzf R-3.x.x.tar.gz
    [alice@build]$ cd R-3.x.x
    From that directory, type the following commands. The middle one may take a while!:
    [alice@build]$ ./configure --prefix=$(pwd)
    [alice@build]$ make
    [alice@build]$ make install
    
    After the last command finishes, move back to the main working directory:
    [alice@build]$ cd ..

  3. Install Packages

    The installation steps above should have generated an R installation in the lib64 subdirectory of the installation directory. We can start R by typing the path to that installation, like so:

    [alice@build]$ R-3.x.x/lib64/R/bin/R 

    This should open up an R console, which is how we're going to install any extra R libraries. Install each of the library packages your code needs by using R's install.packages command:

     > install.packages('package_name')
    You only need to install the major packages needed by your code; if you install a package that depends on other packages, those will automatically be installed.

    The first time you will be prompted to choose a "CRAN mirror" - this is where R is downloading the package. Choose any http (not https!) option.

    Once you've installed all the packages, type quit() to exit the R session. You don't need to save the workspace.

  4. Edit the R executable

    Once you've added the packages you need, you need to edit the R executable that you used in the previous section. You can do this with a command line text editor - this example uses the nano text editor:

    [alice@build]$ nano R-3.x.x/lib64/R/bin/R
    The above will open up the main R executable. You will need to change the first line, from something like:
    R_HOME_DIR=/var/lib/condor/execute/slot1/dir_554715/R-3.1.0/lib64/R
    to
    R_HOME_DIR=$(pwd)/R
    Save and close the file. (In nano, this will be CTRL-O, followed by CTRL-X.)

  5. Exit the Interactive Job

    Right now, if we exit the interactive job, nothing will be transferred back because we haven't created any new files in the working directory, just sub-directories. In order to transfer back our installation, we will need to compress it into a tarball - not only will HTCondor then transfer back the resulting file, it is generally easier to transfer a single, compressed tarball file than an uncompressed set of directories.

    Move the directory with your R installation to the main working directory:

    [alice@build]$ mv R-3.x.x/lib64/R ./

    Run the following command to create your own tarball of the installation:

    [alice@build]$ tar -czvf R.tar.gz R/

    The installation is complete! You can now exit the interactive job and your R installation tarball will return to the submit server.

    [alice@build]$ exit 

2. Creating a Script

We now have an R.tar.gz file that contains our entire R installation. In order to use this installation in our HTCondor jobs, we will need to write a script that unpacks our R installation and then runs our R code. We will use this script as the executable of our HTCondor submit file.

A sample script appears below. After the first line, the lines starting with hash marks are comments.

#!/bin/bash

# untar your R installation
tar -xzf R.tar.gz

# make sure the script will use your R installation
export PATH=$(pwd)/R/bin:$PATH

# run R, with the name of your  R script
R CMD BATCH myscript.R

If you have additional commands you would like to be run within the job, you can add them to this base script.

Arguments in R

To pass arguments to an R script within a job, you'll need to use the following syntax in your main executable script, in place of the generic command above:

R CMD BATCH '--args argname='$1' argname='$2'' myscript.R
Here, $1 and $2 are the first and second arguments passed to the bash script from the submit file (see below), which are then sent on to the R script. For more (or fewer) arguments, simply add more (or fewer) argument names and numbers.

In addition, your R script will need to be able to accept arguments from the command line. There is sample code for doing this on this r-bloggers.com page.

3. Submitting Jobs

The submit file you use for submitting your R jobs will be different from the one you created in part 1 for building your R installation. You'll want to create a new submit file; a good starting point is the sample submit filee on our hello world example page. You should make the following changes in order to run R jobs:
  • Your executable should be the script that you wrote above.
  • Change transfer_input_files to include your R installation tarball (R.tar.gz), your R scripts, and any input files your job needs.
  • How big is your installation tarball?

    If your installation tarball is larger than 100 MB, you should NOT transfer the tarball using transfer_input_files. Instead, you should use CHTC's web proxy, squid. In order to request space on squid, email the research computing facilitators at chtc@cs.wisc.edu.

  • If your script takes arguments (see the box from the previous section), include those in the arguments line:
    arguments = value1 value2
  • Include the below requirements line in order to request the operating system of the server your interactive job ran on.
    requirements = (OpSys == "LINUX") && (OpSysMajorVer == 6)
    
  • Test a few jobs for disk space/memory usage in order to make sure your requests for a large batch are accurate! Disk space and memory usage can be found in the log file after jobs complete.