Multilanguage Monte Carlo

Python

Statistical Computing Languages, RSS

[Mike Croucher](http://www.walkingrandomly.com/) [@WalkingRandomly](http://twitter.com/walkingrandomly)

University of Manchester

Presented by Neil D. Lawrence on 21st November 2014

Multi Language Monte Carlo

We want to estimate the integral

$$ \int^5_{-5} \frac{\exp(-u^2/2)}{\sqrt{2\pi}}\text{d}u $$

using a simple Monte Carlo rejection algorithm. The function is the standard normal distribution, so the integral is approximately equal to 1.

Here, we use the IPython notebook (Soon to be called Project Jupyter) to demonstrate possible solutions in several languages. Once you have this notebook set up correctly on your machine, all of the code below is editable and executable from within the web browser.

In [4]:
# rerun Fernando's script to load in css
%run talktools

Python

A first attempt at a solution using pure Python might look like this

In [16]:
import random
from math import exp, sqrt, pi

def f(u):
    return(exp(-u**2/2)/sqrt(2*pi))

def loopy_monte(N):
    hits = 0
    for count in range(N):
        x = random.uniform(-5,5)
        y = random.uniform(0,0.5)
        hits = hits + (y < f(x))
        
    estimate = hits / float(N) * (0.5*10)
    return(estimate)

Let's run the calculation for 1 million iterations

In [17]:
loopy_monte(10**6)
Out[17]:
1.001425

Import Libraries

Note the import statements. Unlike languages such as MATLAB or R, Python has few functions built into its global namespace at startup. This means that even simple functions such as sqrt need to be explicity imported from a module. In the above code, we use the modules random and math which are part of the standard Python library, the set of modules that are included with every Python installation.

It is possible to export everything from a module in one hit. Instead of

from math import exp,sqrt,pi

we might have done

from math import *

This idiom is commonly used but is considered bad practice. Python programmers would say that it is not Pythonic. The principle objection to this practice is that it pollutes your namespace with many functions -- many of which the user is probably unaware of. If you do multiple module imports in this manner, there is a good chance that function names from one module would overwrite those of a previous module.

Timing It

Let's time our function using IPython's %timeit magic

In [18]:
%timeit loopy_monte(10**6)
1 loops, best of 3: 3.61 s per loop

This might not seem too bad for 1 million iterations but we can do significantly better. As with many high-level languages, loops are very expensive in Python. One approach to making things faster is to vectorise our code. That is, we operate on entire vectors at once rather than on scalars. This can be achieved in Python using the numpy module.

Numpy is not part of the Python standard library and needs to be installed separately. It is, however, considered the de-facto standard for numerical computing in Python and is widely used. If you use a Python distribution such as Anaconda or Enthought, numpy will be available to import out of the box.

Vectored Code

In [13]:
import numpy as np

def numpy_f(u):
    return(np.exp(-u**2/2)/np.sqrt(2*np.pi))

def numpy_monte(N):
    x = np.random.uniform(-5,5,N)
    y = np.random.uniform(0,0.5,N)
    hits = np.sum(y<numpy_f(x))

    estimate = hits / np.float(N) * (0.5*10)
    return(estimate)
In [14]:
numpy_monte(10**6)
Out[14]:
0.99990499999999993

Note that we've had to do some extra work here in order to make vectorisation possible. In particular, we've had to redfine our function f to a version that uses numpy versions of exp and sqrt. This extra work is worth it, however, since we get significantly faster runtimes.

In [15]:
%timeit numpy_monte(10**6)
10 loops, best of 3: 79.4 ms per loop

On my machine, I got times of 2.02 seconds for the loopy version and 70.5 milliseconds for the numpy version - a speed up of almost 29 times.

Vectorisation is nothing special. We are still performing a loop but it's now implicit. Rather than perform the loop in Python, which is slow, we generate our data and ship it out to C or Fortran routines, where loops are cheap. This is the basis of vectorisation in many other languages such as R and MATLAB.

Parallel Python

There are several ways to go parallel in Python. Here, I use the IPython parallel infrastructure.

First, you need to go to the Clusters tab of the Ipython dashboard and start a cluster with your required number of engines. I only have a dual core laptop, so I'll use 2. The following code will not work if you omit this step.

In [19]:
from IPython import parallel

clients = parallel.Client()
clients.block = True
print(clients.ids)

dview = clients[:] # use all engines
[0, 1]

Whenever you run stochastic code in parallel, you need to think carefully about random number generators. The numpy random number generator is the Mersenne Twister algorithm which isn't ideal for parallel work but it's good enough in many instances. In this instance, we are going to give each engine its own seed to work with. This does not guarentee non-overlapping sub-streams but the probability that we are going to have an issue is extremely small. For our purposes here, it's probably going to be fine.

In [20]:
import numpy as np

def numpy_f(u):
    return(np.exp(-u**2/2)/np.sqrt(2*np.pi))

# Redefine numpy_monte to take a seed argument
def numpy_monte(N,seed):
    np.random.seed(seed)
    x = np.random.uniform(-5,5,N)
    y = np.random.uniform(0,0.5,N)
    hits = np.sum(y<numpy_f(x))

    estimate = hits / float(N) * (0.5*10)
    return(estimate)
In [22]:
# First, we do this in serial
# Set up
cores = 2

# This contains the number of iterations to be performed by each engine.
# 10 times the number we did earlier. When working in parallel, you need to ensure that each engine has sufficent work to do to
# ensure that communications overheads don't swamp the benefits
N = [10**7]*cores

seeds = range(0,cores) # A list of seeds, one for each engine

# We use the map function to map numpy_monte over the lists created above
# The total number of iterations will be cores * N
%timeit estimate = sum(list(map(numpy_monte,N,seeds)))/cores
estimate = sum(list(map(numpy_monte,N,seeds)))/cores

#Display the esimate
estimate
1 loops, best of 3: 1.4 s per loop
Out[22]:
0.99941950000000002

Now we do it in Parallel using the engines created earlier First, we need to define our functions on our engines. So far, we've only defined them locally. We use the px magic for this

In [23]:
%%px
#Define all functions on the engines

import numpy as np

def numpy_f(u):
    return(np.exp(-u**2/2)/np.sqrt(2*np.pi))

# Redefine numpy_monte to take a seed argument
def numpy_monte(N,seed):
    np.random.seed(seed)
    x = np.random.uniform(-5,5,N)
    y = np.random.uniform(0,0.5,N)
    hits = np.sum(y<numpy_f(x))

    estimate = hits / float(N) * (0.5*10)
    return(estimate)
In [24]:
#Do the work in parallel using the dview we created earlier
%timeit estimate = sum(dview.map(numpy_monte,N,seeds))/cores

#Print the result
estimate
1 loops, best of 3: 1.12 s per loop
Out[24]:
0.99941950000000002

Two points to note here:

  • The parallel and serial results are identical and exactly reproducible thanks to the way we seeded the random number generator
  • We don't quite get N times speed up with N cores. This is because of communications overheads.

MATLAB Bridge

Although not supported by the default IPython install, it is possible to include MATLAB code within an IPython notebook. In order for this to work, you need a licensed verison of MATLAB on your machine and a copy of the python-matlab-bridge installed

https://github.com/arokem/python-matlab-bridge

First, the magic:

In [1]:
%load_ext pymatbridge
Starting MATLAB on ZMQ socket ipc:///tmp/pymatbridge
Send 'exit' command to kill the server
.........MATLAB started and connected!

If the above command was successful, you can use the cell magic %%matlab to let IPython know that you'd like to execute the code in MATLAB.

In [2]:
%%matlab 
a = linspace(-4,4,100);
b = sin(a);
plot(a,b)

MATLAB Solution

Armed with our new MATLAB magic, let's try a loopy solution in MATLAB.

In [3]:
%%matlab
format long

N = 10^6;
hits = 0;
tic
for count = 1:N
    x = rand() * 10 - 5;
    y = rand() * 0.5;
    hits = hits + ( y < (exp(-x^2/2) / sqrt(2*pi)) ); 
end
estimate = hits / N * (0.5*10)
toc
estimate =

   0.998805000000000

Elapsed time is 5.081431 seconds.

Avoiding the Bridge Overhead

That's much slower than I expected it to be! The problem here is that most of the time is, in fact, overhead of the MATLAB-Python bridge. This is rather unfair on MATLAB so let's create an .m file containing our code and run that instead

In [4]:
%%writefile loopy_monte.m
function estimate = loopy_monte(N)

hits = 0;
tic
for count = 1:N
    x = rand() * 10 - 5;
    y = rand() * 0.5;
    hits = hits + ( y < (exp(-x^2/2) / sqrt(2*pi)) ); 
end
estimate = hits / N * (0.5*10);
toc
Overwriting loopy_monte.m

This allows us to run the MATLAB code from within our IPython notebook with the minimum of overhead

In [5]:
from pymatbridge import Matlab

# Start a MATLAB session
mlab = Matlab()
mlab.start()

# Change MATLAB's working directory to match Ipython's current working directory
current_dir = %pwd
mlab.run_code('cd ' + current_dir)

# Run the code in MATLAB and print the results
results = mlab.run_code('loopy_monte(10^6)')
print(results['content']['stdout'])

# Stop the MATLAB session
mlab.stop()
Starting MATLAB on ZMQ socket ipc:///tmp/pymatbridge
Send 'exit' command to kill the server
.MATLAB started and connected!
Elapsed time is 0.253124 seconds.

ans =

   0.999335000000000


MATLAB closed
Out[5]:
True

The time shown above matches what I get when I run the code directly in MATLAB so the Python-MATLAB-bridge overhead is now rather small.

As with other interpreted languages, it is not considered good practice to solve this type of problem using loops in MATLAB. However, MATLAB punishes you a lot less for using loops than Python or R does. The loopy MATLAB version is over 10 times faster than the loopy Python version on my machine. This is because MATLAB has a very high quality JIT (Just In Time) compiler whereas standard Python or R do not. Julia gains a big advantage here by pre-compiling code.

Despite the JIT compiler, it is often still faster to convert from loops to vectorised code in MATLAB and that is the case here

In [6]:
%%writefile vector_monte.m
function estimate = vector_monte(N)
% A vectorised version of our monte carlo code
tic
x = rand(1,N) * 10 - 5;
y = rand(1,N) * 0.5;
hits = ( sum(y < (exp(-x.^2/2) / sqrt(2*pi))) ); 
estimate = hits / N * (0.5*10);
toc
Overwriting vector_monte.m
In [60]:
from pymatbridge import Matlab

# Start a MATLAB session
mlab = Matlab()
mlab.start()

# Change MATLAB's working directory to match Ipython's current working directory
current_dir = %pwd
mlab.run_code('cd ' + current_dir)

# Run the code in MATLAB and print the results
results = mlab.run_code('vector_monte(10^6)')
print(results['content']['stdout'])

# Stop the MATLAB session
mlab.stop()
Starting MATLAB on ZMQ socket ipc:///tmp/pymatbridge
Send 'exit' command to kill the server
.MATLAB started and connected!
Elapsed time is 0.064763 seconds.

ans =

    1.0001


MATLAB closed
Out[60]:
True

This is almost twice as fast as the loopy version on my machine and is around the same speed as the Python/Numpy version.

Parallel MATLAB

One of the issues with straight vectorisation is that it uses a lot of memory. For example, say you wanted to evaluate $5*10^8$ iterations, the x vector alone would use almost 4Gb - pretty much all the memory available on low-end laptops.

The solution is to mix vectorisation and for-loops

In [7]:
%%writefile serial_monte.m
function estimate = par_monte(N,jobs)
tic
estimate = 0;
for i=1:jobs
    x = rand(1,N) * 10 - 5;
    y = rand(1,N) * 0.5;
    hits = ( sum(y < (exp(-x.^2/2) / sqrt(2*pi))) ); 
    estimate = estimate + (hits / N * (0.5*10));
end
estimate = estimate / jobs
toc
Overwriting serial_monte.m

Now I can split my $5*10^8$ iterations into 100 smaller vectorised jobs of $5*10^6$ each. Memory's no longer an issue.

In [8]:
from pymatbridge import Matlab

# Start a MATLAB session
mlab = Matlab()
mlab.start()

# Change MATLAB's working directory to match Ipython's current working directory
current_dir = %pwd
mlab.run_code('cd ' + current_dir)

# Run the code in MATLAB and print the results
results = mlab.run_code('serial_monte(5*10^6,100)')
print(results['content']['stdout'])

# Stop the MATLAB session
mlab.stop()
Starting MATLAB on ZMQ socket ipc:///tmp/pymatbridge
Send 'exit' command to kill the server
.MATLAB started and connected!

estimate =

    1.0000

Elapsed time is 23.162530 seconds.

ans =

    1.0000


MATLAB closed
Out[8]:
True

Converting to parallel MATLAB is easy in this instance as long as we have access to the Parallel Computing Toolbox. we just type three letters and turn for into parfor

In [9]:
%%writefile par_monte.m
function estimate = par_monte(N,jobs)
tic
estimate = 0;
parfor i=1:jobs
    x = rand(1,N) * 10 - 5;
    y = rand(1,N) * 0.5;
    hits = ( sum(y < (exp(-x.^2/2) / sqrt(2*pi))) ); 
    estimate = estimate + (hits / N * (0.5*10));
end
estimate = estimate / jobs
toc
Overwriting par_monte.m
In [10]:
from pymatbridge import Matlab

# Start a MATLAB session
mlab = Matlab()
mlab.start()

# Change MATLAB's working directory to match Ipython's current working directory
current_dir = %pwd
mlab.run_code('cd ' + current_dir)
# Open the parallel pool
mlab.run_code('parpool local')

# Run the code in MATLAB and print the results
results = mlab.run_code('par_monte(5*10^6,100)')
print(results['content']['stdout'])

# Stop the MATLAB session
mlab.stop()
Starting MATLAB on ZMQ socket ipc:///tmp/pymatbridge
Send 'exit' command to kill the server
.....MATLAB started and connected!

estimate =

    1.0000

Elapsed time is 20.719257 seconds.

ans =

    1.0000


MATLAB closed
Out[10]:
True

A couple of things to note here

  • There isn't much speed up in this instance - much less than a factor of 2. This is probably because communication overhead is killing the benefit of multiple cores. We could try fiddling with the size of each job or adding cores to improve this.
  • The parallel computing toolbox is not part of core MATLAB, it costs extra
  • The default random number generator in MATLAB is Mersenne Twister. However, inside a parfor loop this silently changes to use a different random number generator that has better properties for parallel work. You should bear this mind when you need your work to be reproducible or if you are comparing with other languages that do use Mersenne Twister

R

It is also possible to run R code in the IPython notebook. To do this, you need the latest copy of R installed on your system along with the rpy2 module. I installed rpy2 with the following command in a terminal window

pip install rpy2

Once this has been done, We need to load the rmagic extension

In [62]:
%load_ext rmagic
The rmagic extension is already loaded. To reload it, use:
  %reload_ext rmagic

Any cell that we start with %%R will have its code run in an R session. Here is the code that Colin used to plot the function we want to integrate

In [65]:
%%R

f = function(u) exp(-u^2/2)/sqrt(2*pi)
x = seq(-5, 5, length.out = 100)
plot(x, f(x), type="l", ylim=c(0, 0.5))
abline(h=0, lty=4, col=4);abline(h=0.5, lty=3, col=4); abline(v=c(-5, 5), lty=4, col=4)

R Solution

Here is the R solution using loops. As with Python and MATLAB, loops are not reccommended in R. This code is based on the original version written by Colin (downloaded 19th November 2014) but changed so that N is 10^6 rather than 10^5 to keep it on par with our other examples.

In [95]:
%%R -o estimate
# start the clock
start_time = proc.time()
f = function(u) exp(-u^2/2)/sqrt(2*pi)

simulate_pt = function(i){
    x = runif(1,-5,5); y = runif(1,0,0.5);
    y < f(x)
}

N = 10^6; hits = 0
for(i in 1:N)
 hits = hits + simulate_pt() 
(estimate = hits/N*(0.5*10))

#Stop the clock
print(proc.time() - start_time)
print(estimate)
   user  system elapsed 
 11.533   0.124  11.661 
[1] 0.9953

On my machine, this took over 11.6. seconds! Almost 6 times slower than the equivalent loop in Python and over 70 times slower than the equivalent loop in MATLAB. R seems to really punish you for using loops.

As with MATLAB, there is a risk that all we are measuring here is the overhead on the connection between R and IPython. Although there is some overhead, it appears to be much less of an issue.

Let's write the above code to a file and execute it in R via the shell.

In [91]:
%%writefile loop_monte.R
# start the clock
start_time = proc.time()
f = function(u) exp(-u^2/2)/sqrt(2*pi)

simulate_pt = function(i){
    x = runif(1,-5,5); y = runif(1,0,0.5);
    y < f(x)
}

N = 10^6; hits = 0
for(i in 1:N)
 hits = hits + simulate_pt() 
(estimate = hits/N*(0.5*10))

#Stop the clock
print(proc.time() - start_time)
print(estimate)
Overwriting loop_monte.R
In [92]:
!R --vanilla < loop_monte.R
R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin13.4.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> # start the clock
> start_time = proc.time()
> f = function(u) exp(-u^2/2)/sqrt(2*pi)
> 
> simulate_pt = function(i){
+     x = runif(1,-5,5); y = runif(1,0,0.5);
+     y < f(x)
+ }
> 
> N = 10^6; hits = 0
> for(i in 1:N)
+  hits = hits + simulate_pt() 
> (estimate = hits/N*(0.5*10))
[1] 0.996925
> 
> #Stop the clock
> print(proc.time() - start_time)
   user  system elapsed 
 10.180   0.050  10.303 
> print(estimate)
[1] 0.996925
> 

So, 11.68 seconds via the R connector and 10.3 seconds if run directly in the shell. Either way, it's the slowest solution so far.

Vectorisation is pretty much essential in R. Here's the vecorisd version given by Colin. Although the R connector is nice, I'll run it in the shell to give a better idea of the speed difference.

In [98]:
%%writefile vector_monte.R
start_time = proc.time()
N=10^6
f = function(u) exp(-u^2/2)/sqrt(2*pi)
estimate = sum(runif(N, 0, 0.5) < f(runif(N, -5, 5)))/N*(0.5*10)

print(proc.time() - start_time)
print(estimate)
Overwriting vector_monte.R
In [100]:
!R --vanilla < vector_monte.R
R version 3.1.2 (2014-10-31) -- "Pumpkin Helmet"
Copyright (C) 2014 The R Foundation for Statistical Computing
Platform: x86_64-apple-darwin13.4.0 (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> start_time = proc.time()
> N=10^6
> f = function(u) exp(-u^2/2)/sqrt(2*pi)
> estimate = sum(runif(N, 0, 0.5) < f(runif(N, -5, 5)))/N*(0.5*10)
> 
> print(proc.time() - start_time)
   user  system elapsed 
  0.246   0.013   0.260 
> print(estimate)
[1] 0.999495
> 

On my machine, this comes out at 0.26 seconds giving the slowest vectorised solution we've seen so far.