Submit jobs to a slave node from within an R script?

https://stackoverflow.com/questions/13142882

21-07-2021
|

Question

To get myscript.R to run on a cluster slave node using a job scheduler (specifically, PBS)

Currently, I submit an R script to a slave node using the following command

qsub -S /bin/bash -p -1 -cwd -pe mpich 1 -j y -o output.log ./myscript.R

Are there functions in R that would allow me to run myscript.R on the head node and send individual tasks to the slave nodes? Something like:

foreach(i=c('file1.csv', 'file2.csv', pbsoptions = list()) %do% read.csv(i)

Update: alternative solution to the qsub command is to remove #/usr/bin/Rscript from the first line of myscript.R and call it directly, as pointed out by @Josh

qsub -S /usr/bin/Rscript -p -1 -cwd -pe mpich 1 -j y -o output.log myscript.R

Solution

If you want to submit jobs from within an R script, I suggest that you look at the "BatchJobs" package. Here is a quote from the DESCRIPTION file:

Provides Map, Reduce and Filter variants to generate jobs on batch computing systems like PBS/Torque, LSF, SLURM and Sun Grid Engine.

BatchJobs appears to be more sophisticated than previous, similar packages, such as Rsge and Rlsf. There are functions for registering, submitting, and retrieving the results of jobs. Here's a simple example:

library(BatchJobs)
reg <- makeRegistry(id='test')
batchMap(reg, sqrt, x=1:10)
submitJobs(reg)
y <- loadResults(reg)

You need to configure BatchJobs to use your batch queueing system. The submitJobs "resource" argument can be used to request appropriate resources for the jobs.

This approach is very useful if your cluster doesn't allow very long running jobs, or if it severely restricts the number of long running jobs. BatchJobs allows you to get around those restrictions by breaking up your work into multiple jobs while hiding most of the work associated with doing that manually.

Documentation and examples are available at the project website.

OTHER TIPS

For most of our work we do run multiple R sessions in parallel using qsub (instead).

If it is for multiple files I normally do:

while read infile rest
do
qsub -v infile=$infile call_r.sh 
done < list_of_infiles.txt

call_r.sh:

...
R --vanilla -f analyse_file.R $infile
...

analyse_file.R:

args <- commandArgs()
infile=args[5]
outfile=paste(infile,".out",sep="")...

Then I combine all the output afterwards...

The R package Rsge allows job submission to SGE managed clusters. It basically saves the required environment to disk, builds job submission scripts, executes them via qsub and then collates the results and returns them to you.

Because it basically wraps calls to qsub, it should work with PBS too (although since I don't know PBS, I can't guarantee it). You can alter the qsub command and the options used by altering the Rsge associated global options (prefixed sge. in the options() output)

Its is no longer on CRAN, but it is availible from github: https://github.com/bodepd/Rsge, although it doesn't look like its maintained any more.

To use it use one of the apply type functions supplied with the package: sge.apply , sge.parRapply, sge.parCapply, sge.parLapply and sge.parSapply, which are parallel equivalents to apply, rapply, rapply(t(x),…), lapply and sapply respectively. In addition to the standard parameters passed to the non-parallel functions a couple of other parameters are needed:

njobs:             Number of parallel jobs to use

global.savelist:   Character vector giving the names of variables
                   from  the global environment that should be imported.

function.savelist: Character vector giving the variables to save from
                   the local environment.

packages:          List of library packages to be loaded by each worker process
                   before computation is started.

The two savelist parameters and the packages parameters basically specify what variables, functions and packages should be loaded into the new instances of R running on the cluster machines before your code is executed. The different components of X (either list items or data.frame rows/columns) are divided between njobs different jobs and submitted as a job array to SGE. Each node starts an instance of R loads the specified variables, functions and packages, executes the code, saves and save the results to a tmp file. The controlling R instance checks when the jobs are complete, loads the data from the tmp files and joins the results back together to get the final results.

For example computing a statistic on a random sample of a gene list:

library(Rsge)
library(some.bioc.library)

gene.list <- read.delim(“gene.list.tsv”)

compute.sample <- function(gene.list) {
   gene.list.sample <- sample(1000, gene.list)
   statistic <- some.slow.bioc.function(gene.list.sample)
   return (statistic)

}

results <- sge.parSapply(1:10000, function(x) compute.sample,
                         njobs = 100,
                         global.savelist = c(“gene.list”),
                         function.savelist(“compute.sample”),
                         packages = c(“some.bioc.library”))

If you like to send tasks to slave nodes as you go along with a script on the head node, I believe your options are the following:

Pre-allocate all slave nodes and them and keep them in standby when they are not needed (as I suggested in my first answer).
Launch new jobs when the slave nodes are needed and have them save their results to disk. Put the main process on hold until the slaves have completed their tasks and then assemble their output files.

Option 2 is definitely possible but will take a lot longer to implement (I've actually done it myself several times). @pallevillesen's answer is pretty much spot on.

Original answer, with missinterpreted question

I have never worked with PBS myself, but it appears that you can use it to submit MPI jobs. You might need to load an MPI module before executing the R script, having a shell script along these lines sent to qsub.

#!/bin/bash
#PBS -N my_job
#PBS -l cput=10:00:00,ncpus=4,mem=2gb

module load openmpi
module load R
R -f myscript.R

You should then be able to use doSNOW to execute your foraech loop in parallel.

n.slaves <- 4

library(doSNOW)
cl <- makeMPIcluster(n.slaves)
registerDoSNOW(cl)

foreach(i=c('file1.csv', 'file2.csv'), pbsoptions = list()) %dopar% read.csv(i)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow