Will RStan run on a supercomputer?

https://stackoverflow.com/questions/12848168

r
stan

06-07-2021
|

Question

Stan is a new Bayesian analysis software by Gelman et al.

RStan is, I am guessing, a way to call Stan from within R.

Will Stan / RStan run on a supercomputer with a Linux operating system, and if so can it take advantage of the super-computer's multi-processors? I have been told that WinBUGS will not run on a Linux machine and/or cannot take advantage of a supercomputer's multi-processors.

I am looking for a way to speed up Bayesian analyses - from weeks to days / hours.

La solution

Stan and rstan should run on Linux, Mac, or Windows that supports the dependencies. We have not tested on BSD or Oracle, but we expect them to work with either the g++ or clang compilers (although not the Oracle compilers).

There is no explicitly parallel code in Stan or rstan but neither is there any code that prevents the binary from being executed by several processes simultaneously. For example, if you use Stan from the command line in a bash shell, you could do something like

./my_model --data=my_data.dump --seed=12345 --chain_id=1 --samples=samples_1.csv &
./my_model --data=my_data.dump --seed=12345 --chain_id=2 --samples=samples_2.csv &

and so forth for as many chains as you like. It is important to use the same seed but different chain_id when executing in parallel.

If you are using the rstan package, you can call the main stan() function using any of the parallel engines supported by R and your operating system. Again, it is best to pass the same seed and different chain_id. As of rstan v1.0.3 (not released yet), there is a function called sflist2stanfit() that takes a list of stanfit objects that may have been generated in parallel and combines them into a single stanfit object for analysis.

For more information, there is a thread devoted to parallel execution at

https://groups.google.com/d/topic/stan-users/3goteHAsJGs/discussion

Autres conseils

I wrote that I would post what I learned.

The university Supercomputing Center believes that RStan will run on their machines. However, I must apply for an account, which might take some time. So, I will not be certain that RStan will run on those machines for a while yet. For what it is worth the formal name of their facility is the 'Arctic Region Supercomputing Center'.

I had trouble installing RStan on my desktop and had to get OIT assistance. So, here are the steps I used and the code used by the OIT gentleman. I have a Windows 7 Professional operating system.

I had to use R 2.15.1
I installed R in the directory 'C:\R\R-2.15.1' so there would be no spaces in the directory name
I had to install Rtools.
I installed Rtools in the directory 'C:\Rtools'
Make sure that Rtools appears in the path so that R can locate the C++ compiler in Rtools
To check:

Computer, Properties, Advanced System Setting, Environment Variables, Path.

I think I should include both: 'c:\Rtools\bin' and: 'c:\Rtools\gcc-4.6.3\bin'
Open R
Here is the R code to type (this code appears here: http://code.google.com/p/stan/wiki/RStanGettingStarted):

install.packages('inline')

install.packages('Rcpp')

install.packages('RcppEigen')

options(repos = c(getOption("repos"), rstan = "http://wiki.stan.googlecode.com/git/R"))

install.packages('rstan', type = 'source')

library(rstan)
Then I ran the school example from here:

http://code.google.com/p/stan/wiki/RStanGettingStarted

Last week I had been trying to install STAN using instructions contained within the pdf file 'stan-reference-1.0.2' instead of the instructions at the above link.

I hope this helps others. If and when I learn whether RStan definitely will run on the Supercomputing Center machines I will post here what I learn.

I have not uninstalled STAN to test the above procedure. Hopefully I did not make any errors in the above steps.

Here's a concrete parallelization function that takes source code as text:

library(rstan)
library(parallel)

parallel_stan <- function(code, data, cores=detectCores(), chains=8, iter=2000, seed=1234) {
    cat("parallel_stan: cores=", cores, ", chains=", chains, ", iter=", iter, ", seed=", seed, "\n", sep="")
    cat("--- Step 1: compile the model (and run it once, very briefly, ignoring its output)\n")
    f1 = stan(model_code = code, data = data, iter = 1, seed = seed, chains = 1, chain_id = 1)
    cat("--- Step 2: run more chains in parallel\n")
    sflist <- mclapply(
        1:chains
        , mc.cores = cores
        , function(i) stan(fit = f1, data = data, iter = iter, seed = seed, chains = 1, chain_id = i)
    )
    # ... passing the same seed to all chains follows example(sflist2stanfit)
    # ... important to use the same seed but different chain_id when executing in parallel
    cat("--- Finished.\n")
    return(sflist2stanfit(sflist))
}

The first hit on an RSeek search (for: Rstan gelman) yielded this after following one link:

https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started

It's not yet on CRAN.

This is a general comment about Bayesian MCMC calculations.

Typically supercomputers run server class processors rather than desktop class. Stan and other MCMC programs are almost always pretty strictly serial on a per-chain basis, i.e. you will rarely get any possibility of a speedup for a single chain by having more than one processor. We have a small cluster with a dual-Xeon-class server and a couple of regular desktop machines as workstations. The Core-i7s processors in the workstations are typically about 40% faster than the server for doing real-world calculations as long as you stay inside their 16GB RAM limitation.

The fastest machines for doing these sorts of calculations will likely be over-clocked custom game machines with a water-cooled CPU.

That said you can of course run different chains in parallel as has been pointed out above.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow