6  Array Submission of Multiple Independent R Jobs

Consider the need to obtain random numbers across varying sample sizes and means.

\[N = \begin{cases} 250 \\ 500 \\ 750 \end{cases}, \mu = \begin{cases} 0 \\ 1.5 \end{cases}\]

6.1 Sample Job Script

sim_job.R

# Expect command line args at the end.
args = commandArgs(trailingOnly = TRUE)
# Skip args[1] to prevent getting --args

# Extract and cast as numeric from character
rnorm(n = as.numeric(args[2]), mean = as.numeric(args[3]))

Download a copy onto the cluster with:

wget https://hpc.thecoatlessprofessor.com/slurm/scripts/sim_job.R

chmod +x sim_job.R

6.2 Sample Parameter Inputs

inputs.txt

250 0
500 0
750 0
250 1.5
500 1.5
750 1.5

Download a copy onto the cluster with:

# Download a pre-made inputs.txt onto the cluster
wget https://hpc.thecoatlessprofessor.com/slurm/scripts/inputs.txt

Note: Parameters are best generated using expand.grid().

N_vals = c(250, 500, 750)
mu_vals = c(0, 1.5)

sim_frame = expand.grid(N = N_vals, mu = mu_vals)
sim_frame
# 250 0.0
# 500 0.0
# 750 0.0
# 250 1.5
# 500 1.5
# 750 1.5

Write the simulation parameter configuration to inputs.txt with:

write.table(sim_frame, file = "inputs.txt", 
            col.names = FALSE, row.names = FALSE)

6.3 Array Job Launch

sim_array_launch.slurm

#!/bin/bash

## Describe requirements for computing ----

## Name the job to ID it in squeue -u $USER
#SBATCH --job-name=myjobarray

## Send email on any change in job status (NONE, BEGIN, END, FAIL, ALL)
## Note: To be notified on each task on the array use: ALL,ARRAY_TASKS
#SBATCH --mail-type=ALL

## Email address of where the notification should be sent.
#SBATCH --mail-user=netid@illinois.edu

## Amount of time the job should run
## Note: specified in hour:min:sec, e.g. 01:30:00 is a 1 hour and 30 min job.
#SBATCH --time=00:10:00
## Request a single node
#SBATCH --ntasks=1
## Specify number of CPU cores for parallel jobs
## Note: Leave at 1 if not running in parallel.
#SBATCH --cpus-per-task=1
## Request a maximum amount of RAM per CPU core
## Note: For memory intensive work, set to a higher amount of ram.
#SBATCH --mem-per-cpu=5gb

## Standard output and error log
#SBATCH --output=myjobarray_%A-%a.out
# Array range
#SBATCH --array=1-6

## Setup computing environment for job ----

## Create a directory for the data output based on the SLURM_ARRAY_JOB_ID
mkdir -p ${SLURM_SUBMIT_DIR}/${SLURM_ARRAY_JOB_ID}

## Switch directory into job ID (puts all output here)
cd ${SLURM_SUBMIT_DIR}/${SLURM_ARRAY_JOB_ID}

## Run simulation ----

## Load a pre-set version of R
module load R/3.6.2

## Grab the appropriate line from the input file.
## Put that in a shell variable named "PARAMS"
export PARAMS=`cat ${HOME}/inputs.txt | sed -n ${SLURM_ARRAY_TASK_ID}p`

## Run R script in batch mode without file output
Rscript $HOME/sim_job.R --args $PARAMS

Download a copy and run it on the cluster with:

# Download script file
wget https://hpc.thecoatlessprofessor.com/slurm/scripts/sim_array_launch.slurm

# Queue the job on the Cluster
sbatch sim_array_launch.slurm

Note: %A will be replaced by the value of the SLURM_ARRAY_JOB_ID environment variable and %a will be replaced by the value of SLURM_ARRAY_TASK_ID environment variable. For example, SLURM_ARRAY_JOB_ID corresponds to the number assigned to the job in the queue and SLURM_ARRAY_TASK_ID relates to a value in the job array. In the case of this example, the SLURM_ARRAY_TASK_ID would take on values from 1 to 6.