Statistics with R-Project, Grid Engine, and Open MPI (2012-07-13)
This article demonstrates the usage of the statistical R package with Grid Engine.
Since I recently generated some plots using R I was looking a bit closer to what R supports in terms of cluster computing and parallel computing, especially what can be done with Grid Engine.
There is a really good technical paper about parallel computing packages for R from the LMU (State-of-the-art in Parallel Computing with R).
So lets install the Rmpi (for connectivity with OpenMPI), snow (simple network of workstations), and Rsge package. We assume to have a running (Univa) Grid Engine cluster. If you haven‘t one and you have just a small cluster you can simply download a free 48 core limited version from Univa (www.univa.com). When using the GUI installer your cluster will be setup in just a few minutes. I was using an Univa Grid Engine 8.1 pre-release for doing this.
In order to exploit also MPI capabilities in R you first need a MPI installation. Just download OpenMPI from www.openmpi.org (I took version 1.6). Assuming you have a compiler (gcc/g++/...) installing OpenMPI is pretty simple. Untar the packages and run configure. For built-in Grid Engine support you must add --with-sge and in order to work with Rmpi you need to pass --enable-shared and --enable-static. The prefix is where it is going to be installed.
./configure --prefix=/usr/local --enable-shared --enable-static --with-sge
Build and install.
make all install
Now you need to download the R package (R-base or similar) from the repository of your Linux distribution. Once installed set the LD_LIBRARY_PATH to /usr/local/lib otherwise the MPI libs are not found.
export LD_LIBRARY_PATH=/usr/local/lib
Call R on command line and install the Rmpi package.
(within R)
> install.packages("Rmpi")
In order to load the Rmpi when it is not loaded (like after R restart) you have to type:
> if (!is.loaded("mpi_initialize")){
library("Rmpi")
}
Spawn the slaves:
> mpi.spawn.Rslaves()
And run the common hello world of Rmpi:
> mpi.remote.exec(paste("I am",mpi.comm.ran(),"of",mpi.comm.size()))
Thats working! You can now make use of the mpi package.
In order to install the snow package I had to download an older version because my R is a little outdated.
http://cran.r-project.org/src/contrib/Archive/snow/
Download the package:
wget http://cran.r-project.org/src/contrib/Archive/snow/snow_0.3-3.tar.gz
Install it on command line:
R CMD INSTALL snow_0.3-3.tar.gz
Now it is time to install the Rsge package:
Be sure that you can do a qsub before starting R. If it is not possible source your $SGE_ROOT/default/common/settings.sh file. Start R and load the Rmpi and snow library first (library(„Rmpi“ and library(„snow“) see above. Now install the Rsge package:
> install.packages("Rsge")
(load the lib when necessary)
After this is done on each compute host in your cluster you can submit Grid Engine jobs from within R. The following example generates numbers from 0.1 to 2.5 (c(1:25/10) and applies on each the exp function and returns the result as an array (parSapply). This is done by sending the task as a job to grid engine.
> sge.parSapply(c(1:25)/10, function(x) exp(x))
Completed storing environment to disk
Submitting 1 jobs...
All jobs completed
[1] 1.105171 1.221403 1.349859 1.491825 1.648721 1.822119 2.013753
[8] 2.225541 2.459603 2.718282 3.004166 3.320117 3.669297 4.055200
[15] 4.481689 4.953032 5.473947 6.049647 6.685894 7.389056 8.166170
[22] 9.025013 9.974182 11.023176 12.182494
If you want to execute the same as above but as 10 different Grid Engine job tasks then try following:
> sge.parSapply(c(1:25)/10, function(x) exp(x), njobs=10)
Completed storing environment to disk
Submitting 10 jobs...
All jobs completed
[1] 1.105171 1.221403 1.349859 1.491825 1.648721 1.822119 2.013753
[8] 2.225541 2.459603 2.718282 3.004166 3.320117 3.669297 4.055200
[15] 4.481689 4.953032 5.473947 6.049647 6.685894 7.389056 8.166170
[22] 9.025013 9.974182 11.023176 12.1824
Now you have a MPI and Univa Grid Engine enabled R environment!
When you have a Grid Engine installation where not all hosts have R installed then your job might not get successfully executed. Hence you might change the internal submission parameters of Rsge. This is done by changing the options.
> getOption("sge.qsub.options")
[1] "-cwd
You can see that the internal "qsub" has just a -cwd as command line parameter. You can add a specific host or whatever submission parameter your "qsub" from the Univa Grid Engine installation supports.
In order to force that all jobs are going to host "u1010" you can set the sge.qsub.options in R as follows:
> options(sge.qsub.options=...)
where ... is "-cwd -l h=u1010"
Of course you can also request a specific queue by adding -q <yourqueue> or request a specific core binding with -binding linear:1 for example. Changing submission parameters allows you to specify a specific subset of your cluster with strong compute hosts for your R computations while you are starting an interactive R session itself on high responsive interactive compute nodes with "qrsh".