Question

My university runs a condor computing grid (compute nodes are running Linux), and I'd like to use it for running simulations in R. The problem is that only some of the machines on the grid have R installed. So far I see two options, but I don't know how to implement either one, so I hope you'll help me (keeping in mind that I'm not a sysadmin and can't do much to change the setup of the compute nodes):

1) Put a check in the ClassAds that go out with my condor submit file to require that the job be computed on nodes that have a /usr/bin/R.

2) Package R and all of its dependencies into a self-contained directory that can be sent out to the compute nodes and against which my simulation can be run. I've tried for several hours to do this, but the Linux version of R (unlike the OSX and Windows versions) seems to run against libraries that are distributed across the filesystem, and I can't think of a practical way to gather them all into a location where R can find them.

Any ideas? Thanks in advance.

Was it helpful?

Solution

What eventually worked for me was proposed solution (1). Here I discuss how I implemented (1) in my condor submit file and my worker shell script.

Here's the shell script. The important change was to check whether R is installed on the compute node via: if [ -f /usr/bin/R ]. If R is found, we go down a path that ends in a return value of 0. If R is not found, we return 1 (that's the meaning of the lines exit 0 and exit 1).

mkdir output
if [ -f /usr/bin/R ]
then
    if $(uname -m |grep '64')
    then
            Rscript code/simulations-x86_64.r $*
    else
            Rscript code/simulations-i386.r $*
    fi

    tar -zcvf output/output-$1-$2.tgz2 output/*.csv
    exit 0
else
    exit 1
fi

Now the condor submit file. The crucial change was the second-to-last line (on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)). It checks the return value of each job from the compute node - if the return value is not zero (i.e. if R wasn't found on the compute node), then the job is put back into the queue to be re-run. Otherwise, the job is considered finished and is removed from the queue.

universe = vanilla
log = logs/log_$(Cluster)_$(Process).log
error = logs/err_$(Cluster)_$(Process).err
output = logs/out_$(Cluster)_$(Process).out
executable = condor/worker.sh
arguments = $(Cluster) $(Process)
requirements = (Target.OpSys=="LINUX" && regexp("stat", Machine))
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
transfer_input_files = code, R-libs, condor, seeds.csv
transfer_output_files = output
notification = Never
on_exit_remove = (ExitBySignal == False) && (ExitCode == 0)
queue 1800

OTHER TIPS

Wow, OK, this was harder than I thought. Let's start with proposed solution (2):

At hadley's suggestion, I used Renv to install R to a known local directory (also using R-build to build R-2.15.2). Unfortunately, this local installation still relied on the system-wide libraries from locations like /usr/lib.

MvG suggested pulling the local R installation out of sage. This one comes packaged with local copies of all the necessary system libraries, and is a method that would probably work for most people who face my situation. However, my R code relies on a few R packages that are compatible only with R >= 2.15.

So I took all of the libraries from the lib directory of sage and copied them into the R-2.15.2 install from Renv. This would have worked, but some machines on my University's condor grid must have an odd architecture, because about 1 in 10 jobs came back with errors related to trying to use the wrong version of libc.so. At this point, I abandoned proposed solution (2) and moved on to proposed solution (1).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top