HPC with R: The Basics

Intel

0/5 (0 vote)

May 26, 2017

CPOL

16 min read

17868

We’ll begin with a little history and some basics that everyone who picks up R and cares about performance ought to know.

Click here to register and download your free 30-day trial of Intel® Parallel Studio XE

Drew Schmidt, Graduate Research Assistant, University of Tennessee, Knoxville

Some say R is not fit for high-performance computing (HPC). Others say, "No … wait … you’re actually serious? R in HPC?"

However, the world is changing. Data analytics is the new hip, cool thing. And whether you see tangible benefits in data science, or just dollar signs, the fact is that R excels here. The HPC landscape is changing to better accommodate data analytics applications and users. Therefore, R is the natural candidate on which to focus your attention due to its overwhelming popularity. Fortunately, many have already taken up the call to embed R in HPC environments.

I’m making some assumptions about my audience here. I’m guessing you may not know much about R, but are at least curious. Maybe this curiosity is being driven by an application need, a fear of missing out (all the cool kids are programming in R), or maybe you have clients increasingly asking you about R solutions. Whatever your motivations, welcome. I’m glad to have you. I’m also assuming that you’re otherwise fairly well versed in HPC sorts of things. You know what a compiler is, and Intel^® compilers are your favorites.

Given these assumptions, this article will be a bit different. I want to introduce you to the basics and introduce some of the competing ideas in the R landscape without trying too hard to pick sides. Maybe it’s more exciting to talk about distributed computing with R on thousands of nodes and getting interactive speeds on terabytes of data―and we can do that. But given the relative obscurity of R in HPC circles, I feel compelled to take some time learning to crawl before we fire ourselves out of a cannon.

We’ll begin with a little history and some basics that everyone who picks up R and cares about performance ought to know. We’ll spend a little more time talking about integrating compiled code into R and then close off with a discussion of parallel computing. It’s a bit of a whirlwind, and this article won’t make you an expert on any one topic. But the hope is to give you enough to get you started working with R more seriously.

Background

R can trace its roots back to 1976, when John Chambers, then of Bell Labs, began working on S. S was originally designed as an interactive interface to a bunch of Fortran code. And, try as you might, you just can’t get rid of Fortran, so that’s also pretty much how R works today. R itself was released in the early 1990s, created by Ross Ihaka and Robert Gentleman as a free S. Strictly speaking, R is a dialect of the S language, which is a very snooty way of saying that there’s a lot of S code written in the ’80s that still runs in R.

R is a strange language. One of my favorite examples to demonstrate this is:

typeof(1)
## [1] "double"

typeof(2)
## [1] "double"

# 1:2 is the vector of numbers "1 2"
typeof(1:2)
## [1] "integer"

It sort of makes sense if you stare at it long enough. It’s pretty reasonable to assume that 1:2 is probably going to be an index of some kind. But it’s still pretty weird. In general, : is a bit unpredictable. If you do "1":"2", then it will return the vector of ints 1 and 2. So it just does an ASCII conversion with chars to ints, right? Except that "A":"B" errors. And : doesn’t always produce integers; you can do 1.5:2.5, for example.

R has a very vibrant community, boasting over 10,000 contributed packages on its Comprehensive R Archive Network (CRAN). These packages represent everything from hardcore numerical computing, to cutting-edge statistics and data science methodology, to building an interactive data analysis website (it’s called Shiny and it’s amazing). And R’s package infrastructure is unparalleled as far as I’m concerned. It’s about as "it just works" a thing as you will ever find. So for all the Python* fans who’ve been giggling and nodding "so true" up until this point, now might be a good time to explain how a bunch of statisticians who you think have no idea what they’re doing somehow managed to create the only packaging framework that isn’t abject misery to use.

Now, when I said R was popular, I wasn’t kidding. In 2016, the IEEE Spectrum programming language rankings placed R in the number five spot, beating out C# and JavaScript*. What is especially interesting about this is that the rankings are of programming languages. And even the people who love R will tell you that it’s a terrible programming language. R is just so good for data analysis that people are willing to overlook all of its peculiarities to see the really beautiful gem hiding underneath.

Said another way, R is a bit like Jack Sparrow from the Pirates of the Caribbean films: it may be the worst (language) you’ve ever heard of ... but you have heard of it.

Free Improvements

American comedian W. C. Fields once said, "The laziest man I ever met put popcorn in his pancakes so they would turn over by themselves." I suspect this apocryphal man Fields speaks of would have made a fantastic engineer. After all, why work hard when we can let others do the hard work for us?

In R, there are a few ways to engage in this kind of giant-shoulder-standing. First―and this should hardly come as any surprise―if you compile R with good compilers, you can expect to see some nontrivial performance gains. R is written in C, Fortran, and R, so using Intel’s icc and ifort on Intel^® hardware is a good place to start. And, lucky you, Intel has a very nice article on how to build R with Intel compilers.

That is a strong first step in getting good performance out of R. But what about all that R code making up base R? R has had a bytecode compiler since version 2.13.0, and it compiles R internals for 2.14.0 and later. For code that you write, you have traditionally needed to go somewhat out of your way to use the bytecode compiler. However, in version 3.4.0 (due for release shortly at the time of this writing), R will include a JIT, making many of the old recommendations for using the compiler fairly moot.

Now, it’s worth pointing out that the bytecode compiler is not nearly as nice as a real compiler. If your code has bad design, like unnecessarily computing something, then it’s still going to be there; the computation is just executed in its bytecode form. It’s also not an R-to-C translator or anything like that. It does best on loop-heavy code (not counting implicit loops), and (from a performance standpoint) tends to do next to nothing otherwise. I have seen it improve a loop body by the order of 10 percent, and I have seen it affect the performance by 0.01 percent. But hey, it doesn’t take any work on your part, so we’ll take what we can get.

These improvements are all fine and will definitely help with the runtime of your R code, but it won’t blow your socks off. Now, if you’re looking to buy a new pair of socks, then you can get really impressive performance gains by choosing good LAPACK and BLAS libraries. These are de facto standard numerical libraries for matrix operations, and R uses them to power its low-level linear algebra, and most of its statistical operations. Ironically, perhaps the most important operation in statistics, linear regression, does not use LAPACK. Instead, it uses a highly modified version of LINPACK. No, not the benchmark that runs on supercomputers. I’m talking about the ’70s predecessor to LAPACK. The reasons for this are a bit complicated, but there are reasons. So your fancy tuned LAPACK won’t help with linear regression, but it can still take advantage of good level-one BLAS.

R ships with the so-called "reference" BLAS, which is bone-achingly slow. The above example notwithstanding, if you link R with good BLAS and LAPACK implementations, then you can expect to see significant performance improvements. And, as luck would have it, Intel has a very high-quality implementation in the Intel^® Math Kernel Library (Intel^® MKL). Microsoft offers a distribution of R, which, to my understanding, is R compiled with Intel compilers and shipped with Intel MKL. They call it Microsoft R Open, and it is freely available. They also maintain a detailed collection of benchmarks demonstrating the power of Intel MKL. Or, if you prefer, you can follow Intel’s own documentation for linking R with Intel MKL.

And for those of you with all the newest, fanciest toys: yes, this applies to MIC accelerators as well. A lot of the early work on this comes from the fine folks at the Texas Advanced Computing Center (TACC), who have done quite a bit of experimenting with using Intel MKL Automatic Offload. To say that things work well is a bit of an understatement, and R users with a few Intel^® Xeon Phi^™ processors lying around should seriously consider trying this for themselves. If you aren’t sure where to start, Intel has also produced a very handy guide to help you with exactly this kind of thing.

Leveraging Compiled Code

One of the interesting revolutions happening in the R world today is the increasing use of C++ in R packages. Most of the credit for this belongs to Dirk Eddelbuettel and Romain Francois, who created the Rcpp package. Rcpp makes it significantly easier to incorporate C++ code into an R analysis pipeline. Yes, somehow they managed to convince a bunch of statisticians who thought Python was too complicated to program in C++. I’m just as amazed as you are. But however they managed to pull it off, we are all the beneficiaries. This means that CRAN packages are only getting faster and consuming less memory. And those using the Intel compilers stand to benefit the most from this revolution, because, as we discussed earlier: better compiler, faster code.

Now, for those who prefer vanilla C, R has a first-class C API. In fact, it’s on this foundation that Rcpp is built; although I think it’s fair to say that Rcpp goes to much greater lengths. But getting back to the C API, this is also convenient for those who can tolerate Fortran and are willing to write a C wrapper and don’t want to bring the C++ linker to the party. This API is mostly documented in the Writing R Extensions manual, which is an indispensable resource for anyone working with R. However, to answer some questions that arise, you may find yourself poking around R’s header files if you go this route.

For the Rcpp route, you install it as you would any other R package, namely:

install.packages("Rcpp")

For a quick example of the power of this package, let’s take a look at the "numerical Hello World," the Monte Carlo integration, to find Π that you’ve seen a million times before. In R, you might write something like this:mcpi <- function(n)

mcpi <- function(n)
{
  r <- 0L
  
  for (i in 1:n){
    u <- runif(1)
    v <- runif(1)
    
    if (u*u + v*v <= 1)
      r <- r + 1L
  }
  
  return(4*r/n)
}

Now when I say "you might write," I am again assuming you’re not that familiar with R. Probably no experienced R user would ever write such a thing. One could reasonably argue that it looks a bit like C. The best advice I could give to anyone who ever inherits an R codebase for the purposes of making it faster is: the more it looks like C, the worse it will run in R, but the easier it is to convert to C/C++. The inverse is also true, in that the less it looks like R, the harder it is to convert. A more natural R solution is the following vectorized gibberish:

mcpi_vec <- function(n)
{
  x <- matrix(runif(n * 2), ncol=2)
  r <- sum(rowSums(x^2) <= 1)
  
  return(4*r/n)
}

Like in every other high-level language, the use of vectorization https://software.intel.com/en-us/intel-vectorization-tools> will improve the runtime performance, but also gobble up a lot more RAM. So instead, let’s forget all this R business and just write a version in C++:

#include <Rcpp.h>

// [[Rcpp::export]]
double mcpi_rcpp(const int n)
{
  int r = 0;
  
  for (int i=0; i<n; i++){
    double u = R::runif(0, 1);
    double v = R::runif(0, 1);
    
    if (u*u + v*v <= 1)
      r++;
  }
  
  return (double) 4.*r/n;
}

Now, except for that mysterious Rcpp::export bit, that probably looks like very readable C++. Well, it turns out that the mysterious bit will handle the generation of all of the "boilerplate" code. In R, there are no scalars (hey, I told you it was a weird language), only vectors of length 1. So, behind the scenes, Rcpp is actually handling this mental overhead for you and, in its wrapper, will create a length 1 double vector for you. As a general rule, we can play this game with integers and doubles, and vectors of these basic types. More complicated things involve more complications. But hey, that’s pretty nice, right?

To compile/link/load and generate the various boilerplates, we need only call sourceCpp() to make the function immediately available to R:

Rcpp::sourceCpp(file="mcpi.cpp")
mcpi_rcpp(10000)
## [1] 3.1456

Eagle-eyed readers may be wondering, "Isn’t ‘10000’ here a double?" And you’d be correct in thinking so, because it is. We could demand an integer by calling with 10000L―that’s an ordinary 32-bit integer, mind you―but Rcpp will automatically handle type conversions for you. It’s actually handling the conversion exactly as the R code versions are. This has fairly obvious pros and cons, but it’s the approach they took, and is worth noting and being aware of.

We can easily compare the performance of the three using the rbenchmark or microbenchmark packages. I happen to prefer rbenchmark, which goes something like:

library(rbenchmark)

n <- 100000
cols <- c("test", "replications", "elapsed", "relative")
benchmark(mcpi(n), mcpi_vec(n), mcpi_rcpp(n), columns=cols)
##           test replications elapsed relative
## 1      mcpi(n)          100  49.901  214.167
## 2  mcpi_vec(n)          100   1.307    5.609
## 3 mcpi_rcpp(n)          100   0.233    1.000

And hey, that’s pretty good! Now, of course, this opens up opportunities for things like OpenMP* or Intel^® Threading Building Blocks (Intel^® TBB). But speaking of parallelism ...

Parallel Programming

Since version 2.14.0, R ships with the parallel package. This allows for pretty simple task-level parallelism by offering two separate APIs, one using sockets, and one using the OS fork. The reason for the two interfaces is one part historical, in that they are derived from the older contributed packages, multicore and snow. But the desire to keep both is probably best explained by R core’s desire to support all platforms, even Windows* (which lacks fork). On a non-Windows platform, the function of interest is mclapply(), and it’s the multicore lapply()―so named because it applies a function and returns a list. Here R flexes some of its functional programming muscles:

lapply(my_data, my_function)
parallel::mclapply(my_data, my_function)

The data can be an index or convoluted list of very large, complex objects. So long as the supplied function can handle the inputs, it’ll work.

Now that’s one of the two officially supported interfaces. The other is more complicated and generally only used by Windows programmers. This creates a bit of a rift for R users. This is purely my own opinion, but I feel that R users don’t really like having multiple options as much as your regular programmer working in another language. They want one good way to do things and for that to be the end of the discussion. It’s to this end that (ironically) several projects have emerged to try to unify all of the disparate interfaces. These include the older and more established foreach package, as well as the newer BiocParallel from the Bioconductor project.

You might wonder what the big deal is about having two separate interfaces. Well, in fact, there are many more packages that enable parallelism in R. The HPC Task View is a good resource to discover the many options.

If it sounds to you like most of the focus and interest has been on shared-memory parallelism, you’d be right. The R community in general has been a bit resistant to caring much about performance until relatively recently. I think this is largely because the R mind-space is still dominated by the statistics side of data science. Most big data problems in statistics, frankly, aren’t. You can still manage to get a lot done by just downsampling your data and using classical statistics techniques. It’s not the right tool for every job, but it certainly has its place―and if anything, these tools are underappreciated.

But none of that involves supercomputers, so forget that nonsense. Let’s talk about MPI. The Rmpi package dates all the way back to 2002. More recently, the Programming with Big Data in R (pbdR) project has been developing packages for doing large-scale computing with R in supercomputing environments. Now, full disclosure: I work on that project. As such my opinions on it are naturally biased. But I think we have a few interesting things to show you.

We maintain quite a few packages, and generally try to bring the best of HPC to R for data analysis and profiling. For the sake of brevity, let’s just focus on direct MPI programming. We’ll briefly compare Rmpi and our package, pbdMPI. First, and this is a big one, Rmpi can be used interactively, but pbdMPI cannot. This is because pbdMPI is designed to be used exclusively in single program, multiple data (SPMD) style, whereas Rmpi is really meant to work in a manager/worker style. If you’ve ever submitted a batch job on a cluster using MPI, you were almost certainly writing in SPMD. It’s one of those ideas that’s so intuitive, if you’ve never heard of it before, you’ll be surprised it even has a name. So for those coming to R from the HPC world, pbdMPI should feel right at home.

Beyond the programming style, there are some serious differences between the APIs of the two packages. In Rmpi, you need to specify the type of the data. For example:

library(Rmpi)
mpi.allreduce(x, type=1) # int
mpi.allreduce(x, type=2) # double

Remember our type example from the beginning? In pbdMPI, we use a lot of R’s object-oriented facilities to try to automatically handle these and other low-level details:

library(pbdMPI)
allreduce(x)

It’s a small example but a good demonstration of our philosophy. We think the HPC community does great work, but we also think HPC tools are too hard to use for most people and should be made simpler.

For a slightly more substantive example, let’s take a quick look at parallel "Hello World." Now, we mentioned that pbdMPI has to be used in batch mode. The downside is that R users have trouble thinking in terms of batch rather than interactive processing. The upside is that it plays well with all of the HPC things like resource managers and job schedulers that you already know about. Say we wanted to run our "Hello World" example:

library(pbdMPI)

comm.print(paste("Hello from rank", comm.rank(), "of", comm.size()), all.rank=TRUE)

finalize()

We just need to make the appropriate call to mpirun (or your system’s equivalent):

mpirun -np 2 Rscript hello_world.r

Which gives the output you would expect:

[1] "Hello from rank 0 of 2"
[1] "Hello from rank 1 of 2"

Summary

There’s a famous saying in statistics circles, attributed to George Box: All models are wrong, but some are useful. Well, I posit that all programming languages are bad, but some are useful. R is perhaps the ultimate expression of this idea. After all, there has to be something to it if it’s held up for 40 years (counting S) and is currently ranked fifth among programming languages. And while R has a reputation for being slow, there are definitely strategies to mitigate this. Use a good compiler. Good BLAS and LAPACK will improve the performance of many data science operations. Embedding compiled kernels in your R analysis pipeline can greatly enhance performance. And, when in doubt, throw more cores at your problem.

Try Intel^® Compilers, part of Intel^® Parallel Studio XE