Background
Running R code in parallel can be very useful in speeding up performance. Basically, parallelization allows you to run multiple processes in your code simultaneously, rather than than iterating over a list one element at a time, or running a single process at a time. Thankfully, running R code in parallel is relatively simple using the parallel package. This package provides parallelized versions of sapply, lapply, and rapply.
Parallelizing code works best when you need to call a function or perform an operation on different elements of a list or vector when doing so on any particular element of the list (or vector) has no impact on the evaluation of any other element. This could be running a large number of models across different elements of a list, scraping data from many webpages, or a host of other activities.
Testing for Primes in Parallel
In the example below, we’re going to use the parallel package to loop over 1 million integers to test whether each of them is a prime (or not). If you were doing this without the parallel package, you might try to speed up the operation by using sapply (rather than a for loop). This is fine, but the drawback is that sapply will only be able to test each number in the set one at a time. Using the parallelized version of sapply, called parSapply in the parallel package, we can test multiple numbers simulatenously for primality.
# load parallel package require(parallel) # define function to test whether an number is prime is_prime <- function(num) { # if input equals 2 or 3, then we know it's prime if(num == 2 | num == 3) return(TRUE) # if input equals 1, then we know it's not prime if(num == 1) return(FALSE) # else if num is greater than 2 # and divisible by 2, then can't be even if(num %% 2 == 0) return(FALSE) # else use algorithm to figure out # what factors, if any, input has # get square root of num, rounded down root <- floor(sqrt(num)) # try to divide each odd number up to root # into num; if any leave a remainder of zero, # then we know num is not prime for(elt in seq(5,root)) { if (num %% elt == 0) return(FALSE) } # otherwise, num has no divisors except 1 and itself # thus, num must be prime return(TRUE) } # get random sample of 1 million integers from integers between 1 and # 10 million # set seed so the random sample will be the same every time set.seed(2) sample_numbers <- sample(10000000, 1000000) # do a couple checks of function is_prime(17) # 17 is prime is_prime(323) # 323 = 17 * 19; not prime # create cluster object cl <- makeCluster(3) # test each number in sample_numbers for primality results <- parSapply(cl , sample_numbers , is_prime) # close stopCluster(cl)
The main piece of the code above is this:
# create cluster object cl <- makeCluster(3) # test each number in sample_numbers for primality results <- parSapply(cl , sample_numbers , is_prime) # close cluster object stopCluster(cl)
The makeCluster function creates a cluster of R engines to run code in parallel. In other words, calling makeCluster creates multiple instances of R. Passing the number 3 as input to this function means three separate instances of R will be created. If you’re running on Windows, you can see these instances by looking at running processes in the Task Manager.
After this cluster is created, we call parSapply, which works almost exactly like sapply, except that instead of looping over each element in the vector, sample_numbers, one at a time, it uses the cluster of R instances to test multiple numbers in the vector for primality simultaneously. As you’ll see a little bit later, this saves a nice chunk of time.
Once our operation is done, we close the cluster object using the stopCluster function. This is important to do each time you use the parallel package; otherwise you could end up with lots of R instances on your machine.
How fast is running R code in parallel?
Alright, so let’s test how much time we can save by parallelizing our code. We’ll start by running the same is_prime function above on the same list of 1 million integers using regular sapply — so no parallelization. We will time the operational execution by using R’s builtin function, proc.time, before and after we run sapply; this gives us a time stamp at the start of the code run and at the end, so we can subtract these to see how much time it took for our code to run.
start <- proc.time() results <- sapply(sample_numbers , is_prime) end <- proc.time() print(end - start) # 125.34
So the code takes 125.34 seconds to run.
start <- proc.time() cl <- makeCluster(2) results <- parSapply(cl , sample_numbers , is_prime) stopCluster(cl) end <- proc.time() print(end - start) # 70.01
As you can see, using just two cores has lessened the amount of run time down to 70.01 seconds! What if we use three cores, like in our initial example?
start <- proc.time() cl <- makeCluster(3) results <- parSapply(cl , sample_numbers , is_prime) stopCluster(cl) end <- proc.time() print(end - start) # 47.81
Using three cores runs our process in 47.81 seconds, which is much faster than using regular sapply. The exact amount of time you’ll save using parallelization will vary depending upon what operations you’re performing, and on the processor speed of the machine you’re working on, but in general, parallelization can definitely increase efficiency in your code. Creating a cluster of R processes, as well as merging together results from those instances, does take some amount of time. This means that parallelizing code over a small list or vector may not be worth it if the computation involved is not very intensive. However, in the case above of involving a larger vector of numbers, parallelization helps immensely.
How many parallelized instances should we use?
Above, we tested using 2 and 3 cores, respectively. But why not some other amount? The number of cores we should use is related to the number of cores on your machine. A good rule of thumb is to generally not exceed this number. There are exceptions, but often, creating more processes than cores will end up slowing down a computation, rather than increasing the speed. This has to do with how an operating system handles multiprocessing. For a more detailed explanation, see this link.
To figure out how many cores your machine has, you can run the detectCores function from the parallel package:
detectCores()
You may also want to balance the number of cores you use with other computations or applications you have running on your machine simultaneously.
That’s the end for this post. Have fun coding!
Originally posted on TheAutomatic.net blog.
Disclosure: Interactive Brokers
Information posted on IBKR Campus that is provided by third-parties does NOT constitute a recommendation that you should contract for the services of that third party. Third-party participants who contribute to IBKR Campus are independent of Interactive Brokers and Interactive Brokers does not make any representations or warranties concerning the services offered, their past or future performance, or the accuracy of the information provided by the third party. Past performance is no guarantee of future results.
This material is from TheAutomatic.net and is being posted with its permission. The views expressed in this material are solely those of the author and/or TheAutomatic.net and Interactive Brokers is not endorsing or recommending any investment or trading discussed in the material. This material is not and should not be construed as an offer to buy or sell any security. It should not be construed as research or investment advice or a recommendation to buy, sell or hold any security or commodity. This material does not and is not intended to take into account the particular financial conditions, investment objectives or requirements of individual customers. Before acting on this material, you should consider whether it is suitable for your particular circumstances and, as necessary, seek professional advice.