"VL-e PoC Documentation::Adding R packages", owner=>"Jan Just Keijser", email=>"janjust@nikhef.nl")); ?> vl-e

Virtual Laboratory for e-Science
HOWTO: Adding custom R packages to the PoC Environment

by Jan Just Keijser (janjust@nikhef.nl)

Introduction

The R toolkit is part of the PoC distribution and has been compiled without any external R packages. Some applications might require a custom R package to be loaded. This document serves as a tutorial on how to build and add a custom R package and how to submit a job on the VL-e PoC environment that uses (requires) the package.

For this tutorial the package RMySQL was chosen, but the same approach applies to other R packages. A huge repository of R packages can be found on The Comprehensive R-Archive Network.

This HOWTO applies to R 2.4.0, as found in the PoC R2 distribution. See the PoC R1 version of this HOWTO for instructions on how to do this for R 2.2.0.

Step 1: Download the package

The RMySQL package can be downloaded from this webpage.
At the time of writing the latest version was 0.6-0.
A list of archived versions can be found here.

The website indicates that RMySQL depends on DBI so that one also needs to be downloaded: DBI-0.2-3

Step 2: Building the packages

After downloading the packages we install and build the packages on a system on which the Vl-e PoC distribution is installed. This is done in a regular user's home directory.
For the RMySQL package it is required to have the MySQL library libmysqlcient.so installed. This file is part of the MySQL-shared RPM that is part of CentOS/Scientific Linux 3 or the mysql RPM that is part of CentOS/Scientific Linux 4.
On RHEL4 it can be installed using

  # yum install mysql
For older RHEL3 systems, which will shortly be no longer supported, use
  # apt-get install MySQL-shared
Next, we build the software in the user's home directory.
First, we set up the directories and unpack the source tarballs:
  mkdir ~/src
  mkdir ~/R
  cd ~/src
  tar xzvf DBI_0.2-3.tar.gz
  tar xzvf RMySQL_0.6-0.tar.gz
Then we build the R packages:
  R CMD INSTALL --no-docs -l ~/R DBI
  R CMD INSTALL --no-docs -l ~/R RMySQL
which should result in output similar to
* Installing *source* package 'DBI' ...
** R
** inst
** save image
[1] TRUE
<snip>
** building package indices ...
* DONE (DBI)

* Installing *source* package 'RMySQL' ...
checking for gcc... gcc
checking for C compiler default output file name... a.out
checking whether the C compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables...
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ANSI C... none needed
checking how to run the C preprocessor... gcc -E
checking for compress in -lz... yes
checking for getopt_long in -lc... yes
checking for mysql_init in -lmysqlclient... no
checking for egrep... grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking mysql.h usability... no
checking mysql.h presence... no
checking for mysql.h... no
checking for mysql_init in -lmysqlclient... no
checking for mysql_init in -lmysqlclient... no
checking for mysql_init in -lmysqlclient... no
checking for mysql_init in -lmysqlclient... yes
             mysqlclient found in -L/usr/lib/mysql
checking /usr/local/include/mysql/mysql.h usability... no
checking /usr/local/include/mysql/mysql.h presence... no
checking for /usr/local/include/mysql/mysql.h... no
checking /usr/include/mysql/mysql.h usability... yes
checking /usr/include/mysql/mysql.h presence... yes
checking for /usr/include/mysql/mysql.h... yes
configure: creating ./config.status
config.status: creating src/Makevars
** libs
gcc -I/opt/vl-e/r_2.4/lib/R/include -I/usr/include/mysql -I/usr/local/include
  -fpic  -O2 -g -march=i386 -mcpu=i686 -c RS-DBI.c -o RS-DBI.o
gcc -I/opt/vl-e/r_2.4/lib/R/include -I/usr/include/mysql -I/usr/local/include
  -fpic  -O2 -g -march=i386 -mcpu=i686 -c RS-MySQL.c -o RS-MySQL.o
gcc -shared -L/usr/local/lib -o RMySQL.so RS-DBI.o RS-MySQL.o 
  -L/usr/lib/mysql -lmysqlclient -lz  -L/opt/vl-e/r_2.4/lib/R/lib -lR
** R
** inst
** preparing package for lazy loading
Loading required package: DBI
Creating a new generic function for "format" in "RMySQL"
Creating a new generic function for "print" in "RMySQL"
<snip>
** building package indices ...
* DONE (RMySQL)

Step 3: Building the installation packages

To use the packages on the grid, we need to create a distribution tarball that we can send along with the grid job. To do this, we strip some unnecessary files from the build we have just made:

  cd ~/R
  rm R.css
  cd DBI
  rm -rf NEWS TODO doc man
  cd ../RMySQL
  rm -rf NEWS README* THANKS TODO WindowsPath.txt INSTALL INSTALL.win
  rm -rf doc gnu man newFunctionNames.txt
Next, we add any external dependencies. For the RMySQL package we need the libmysqlclient.so.14 file (see RHEL4 sample above). We cannot assume that this file is present on the worker nodes where our grid job will run, so we add the file to our installation package:
  cp /usr/lib/mysql/libmysqlclient.so.14 ~/R/RMySQL/libs

And finally we create a distribution tarball:
  cd
  tar czvf RmySQL-libs.tar.gz R
The resulting file can be downloaded here.

Step 4: Submitting a grid job with the custom R package

In order to use our custom R package we need to send the package tarball along with the rest of our grid job. An InputSandbox can contain a few megabytes and our tarball is only a few hundred kilobytes. If the input sandbox were to become too large then we would have to resort to using a VO_SW directory, but that is outside the scope of this tutorial.

To add our custom R package the following .jdl file is used:

  Executable = "R.sh";
  Stdoutput = "std.out";
  StdError = "std.err";
  InputSandbox = {"R.sh", "RMySQL-libs.tar.gz", "Rtest.R" };
  OutputSandBox = {"std.out","std.err"};
Alternatively you can download the file directly here.

The R.jdl file lists The R.sh script unpacks the tarball containing our custom R package, sets up a few environment variables and then runs the R test script:
  #!/bin/bash

  tar xzf RMySQL-libs.tar.gz
  export R_LIBS=$PWD/R
  export LD_LIBRARY_PATH=$PWD/R/RMySQL/libs
  R --no-save < Rtest.R
Alternatively you can download the file directly here.

Notes: The Rtest.R script is a very simple test script to verify that we can connect to a MySQL database:
  require(RMySQL)
  con <- dbConnect(MySQL(), user="db-user",
                            password="db-passwd",
                            dbname="db-name",
                            host="db-host")
  dbGetQuery(con, "select * from mod_users where username='janjust'")
Alternatively you can download the file directly here.

The Rtest.R script is run like any other R script. Please note that the --no-save parameter is useful to make sure the script finished automatically without asking any questions about saving workspaces.

Troubleshooting

During the development of this tutorial several issues showed up. Here are a few tips and tricks on how to troubleshoot such issues:

All files

Valid HTML 4.0 Transitional