Multilanguage Monte Carlo

Python

Statistical Computing Languages, RSS

We want to estimate the integral

$$ \int^5_{-5} \frac{\exp(-u^2/2)}{\sqrt{2\pi}}\text{d}u $$using a simple Monte Carlo rejection algorithm. The function is the standard normal distribution, so the integral is approximately equal to 1.

Here, we use the IPython notebook (Soon to be called Project Jupyter) to demonstrate possible solutions in several languages. Once you have this notebook set up correctly on your machine, all of the code below is editable and executable from within the web browser.

In [4]:

```
# rerun Fernando's script to load in css
%run talktools
```

A first attempt at a solution using pure Python might look like this

In [16]:

```
import random
from math import exp, sqrt, pi
def f(u):
return(exp(-u**2/2)/sqrt(2*pi))
def loopy_monte(N):
hits = 0
for count in range(N):
x = random.uniform(-5,5)
y = random.uniform(0,0.5)
hits = hits + (y < f(x))
estimate = hits / float(N) * (0.5*10)
return(estimate)
```

Let's run the calculation for 1 million iterations

In [17]:

```
loopy_monte(10**6)
```

Out[17]:

Note the **import** statements. Unlike languages such as `MATLAB`

or `R`

, Python has few functions built into its global namespace at startup. This means that even simple functions such as **sqrt** need to be explicity imported from a module. In the above code, we use the modules **random** and **math** which are part of the standard Python library, the set of modules that are included with every Python installation.

It is possible to export everything from a module in one hit. Instead of

`from math import exp,sqrt,pi`

we might have done

`from math import *`

This idiom is commonly used but is considered bad practice. Python programmers would say that it is not *Pythonic*. The principle objection to this practice is that it pollutes your namespace with many functions -- many of which the user is probably unaware of. If you do multiple module imports in this manner, there is a good chance that function names from one module would overwrite those of a previous module.

Let's time our function using IPython's `%timeit`

magic

In [18]:

```
%timeit loopy_monte(10**6)
```

This might not seem too bad for 1 million iterations but we can do significantly better. As with many high-level languages, loops are very expensive in Python. One approach to making things faster is to **vectorise** our code. That is, we operate on entire vectors at once rather than on scalars. This can be achieved in Python using the **numpy** module.

Numpy is not part of the Python standard library and needs to be installed separately. It is, however, considered the de-facto standard for numerical computing in Python and is widely used. If you use a Python distribution such as Anaconda or Enthought, numpy will be available to import out of the box.

In [13]:

```
import numpy as np
def numpy_f(u):
return(np.exp(-u**2/2)/np.sqrt(2*np.pi))
def numpy_monte(N):
x = np.random.uniform(-5,5,N)
y = np.random.uniform(0,0.5,N)
hits = np.sum(y<numpy_f(x))
estimate = hits / np.float(N) * (0.5*10)
return(estimate)
```

In [14]:

```
numpy_monte(10**6)
```

Out[14]:

Note that we've had to do some extra work here in order to make vectorisation possible. In particular, we've had to redfine our function f to a version that uses numpy versions of **exp** and **sqrt**. This extra work is worth it, however, since we get significantly faster runtimes.

In [15]:

```
%timeit numpy_monte(10**6)
```

On my machine, I got times of **2.02 seconds** for the loopy version and **70.5 milliseconds** for the numpy version - a speed up of almost 29 times.

Vectorisation is nothing special. We are still performing a loop but it's now implicit. Rather than perform the loop in Python, which is slow, we generate our data and ship it out to C or Fortran routines, where loops are cheap. This is the basis of vectorisation in many other languages such as R and MATLAB.

There are several ways to go parallel in Python. Here, I use the IPython parallel infrastructure.

First, you need to go to the Clusters tab of the Ipython dashboard and start a cluster with your required number of engines. I only have a dual core laptop, so I'll use 2. The following code will not work if you omit this step.

In [19]:

```
from IPython import parallel
clients = parallel.Client()
clients.block = True
print(clients.ids)
dview = clients[:] # use all engines
```

Whenever you run stochastic code in parallel, you need to think carefully about random number generators. The numpy random number generator is the Mersenne Twister algorithm which isn't ideal for parallel work but it's good enough in many instances. In this instance, we are going to give each engine its own seed to work with. This does not guarentee non-overlapping sub-streams but the probability that we are going to have an issue is extremely small. For our purposes here, it's probably going to be fine.

In [20]:

```
import numpy as np
def numpy_f(u):
return(np.exp(-u**2/2)/np.sqrt(2*np.pi))
# Redefine numpy_monte to take a seed argument
def numpy_monte(N,seed):
np.random.seed(seed)
x = np.random.uniform(-5,5,N)
y = np.random.uniform(0,0.5,N)
hits = np.sum(y<numpy_f(x))
estimate = hits / float(N) * (0.5*10)
return(estimate)
```

In [22]:

```
# First, we do this in serial
# Set up
cores = 2
# This contains the number of iterations to be performed by each engine.
# 10 times the number we did earlier. When working in parallel, you need to ensure that each engine has sufficent work to do to
# ensure that communications overheads don't swamp the benefits
N = [10**7]*cores
seeds = range(0,cores) # A list of seeds, one for each engine
# We use the map function to map numpy_monte over the lists created above
# The total number of iterations will be cores * N
%timeit estimate = sum(list(map(numpy_monte,N,seeds)))/cores
estimate = sum(list(map(numpy_monte,N,seeds)))/cores
#Display the esimate
estimate
```

Out[22]:

Now we do it in Parallel using the engines created earlier First, we need to define our functions on our engines. So far, we've only defined them locally. We use the px magic for this

In [23]:

```
%%px
#Define all functions on the engines
import numpy as np
def numpy_f(u):
return(np.exp(-u**2/2)/np.sqrt(2*np.pi))
# Redefine numpy_monte to take a seed argument
def numpy_monte(N,seed):
np.random.seed(seed)
x = np.random.uniform(-5,5,N)
y = np.random.uniform(0,0.5,N)
hits = np.sum(y<numpy_f(x))
estimate = hits / float(N) * (0.5*10)
return(estimate)
```

In [24]:

```
#Do the work in parallel using the dview we created earlier
%timeit estimate = sum(dview.map(numpy_monte,N,seeds))/cores
#Print the result
estimate
```

Out[24]:

Two points to note here:

- The parallel and serial results are identical and exactly reproducible thanks to the way we seeded the random number generator
- We don't quite get N times speed up with N cores. This is because of communications overheads.

Although not supported by the default IPython install, it is possible to include MATLAB code within an IPython notebook. In order for this to work, you need a licensed verison of MATLAB on your machine and a copy of the python-matlab-bridge installed

https://github.com/arokem/python-matlab-bridge

First, the magic:

In [1]:

```
%load_ext pymatbridge
```

If the above command was successful, you can use the cell magic `%%matlab`

to let IPython know that you'd like to execute the code in MATLAB.

In [2]:

```
%%matlab
a = linspace(-4,4,100);
b = sin(a);
plot(a,b)
```

Armed with our new MATLAB magic, let's try a loopy solution in MATLAB.

In [3]:

```
%%matlab
format long
N = 10^6;
hits = 0;
tic
for count = 1:N
x = rand() * 10 - 5;
y = rand() * 0.5;
hits = hits + ( y < (exp(-x^2/2) / sqrt(2*pi)) );
end
estimate = hits / N * (0.5*10)
toc
```

That's much slower than I expected it to be! The problem here is that most of the time is, in fact, overhead of the `MATLAB`

-Python bridge. This is rather unfair on `MATLAB`

so let's create an `.m`

file containing our code and run that instead

In [4]:

```
%%writefile loopy_monte.m
function estimate = loopy_monte(N)
hits = 0;
tic
for count = 1:N
x = rand() * 10 - 5;
y = rand() * 0.5;
hits = hits + ( y < (exp(-x^2/2) / sqrt(2*pi)) );
end
estimate = hits / N * (0.5*10);
toc
```

This allows us to run the `MATLAB`

code from within our IPython notebook with the minimum of overhead

In [5]:

```
from pymatbridge import Matlab
# Start a MATLAB session
mlab = Matlab()
mlab.start()
# Change MATLAB's working directory to match Ipython's current working directory
current_dir = %pwd
mlab.run_code('cd ' + current_dir)
# Run the code in MATLAB and print the results
results = mlab.run_code('loopy_monte(10^6)')
print(results['content']['stdout'])
# Stop the MATLAB session
mlab.stop()
```

Out[5]:

The time shown above matches what I get when I run the code directly in `MATLAB`

so the Python-`MATLAB`

-bridge overhead is now rather small.

As with other interpreted languages, it is not considered good practice to solve this type of problem using loops in `MATLAB`

. However, `MATLAB`

punishes you a lot less for using loops than Python or `R`

does. The loopy `MATLAB`

version is over 10 times faster than the loopy Python version on my machine. This is because `MATLAB`

has a very high quality JIT (Just In Time) compiler whereas *standard* Python or `R`

do not. Julia gains a *big* advantage here by pre-compiling code.

Despite the JIT compiler, it is often still faster to convert from loops to vectorised code in `MATLAB`

and that is the case here

In [6]:

```
%%writefile vector_monte.m
function estimate = vector_monte(N)
% A vectorised version of our monte carlo code
tic
x = rand(1,N) * 10 - 5;
y = rand(1,N) * 0.5;
hits = ( sum(y < (exp(-x.^2/2) / sqrt(2*pi))) );
estimate = hits / N * (0.5*10);
toc
```

In [60]:

```
from pymatbridge import Matlab
# Start a MATLAB session
mlab = Matlab()
mlab.start()
# Change MATLAB's working directory to match Ipython's current working directory
current_dir = %pwd
mlab.run_code('cd ' + current_dir)
# Run the code in MATLAB and print the results
results = mlab.run_code('vector_monte(10^6)')
print(results['content']['stdout'])
# Stop the MATLAB session
mlab.stop()
```

Out[60]:

This is almost twice as fast as the loopy version on my machine and is around the same speed as the Python/Numpy version.

One of the issues with straight vectorisation is that it uses a lot of memory. For example, say you wanted to evaluate $5*10^8$ iterations, the x vector alone would use almost 4Gb - pretty much all the memory available on low-end laptops.

The solution is to mix vectorisation and for-loops

In [7]:

```
%%writefile serial_monte.m
function estimate = par_monte(N,jobs)
tic
estimate = 0;
for i=1:jobs
x = rand(1,N) * 10 - 5;
y = rand(1,N) * 0.5;
hits = ( sum(y < (exp(-x.^2/2) / sqrt(2*pi))) );
estimate = estimate + (hits / N * (0.5*10));
end
estimate = estimate / jobs
toc
```

Now I can split my $5*10^8$ iterations into 100 smaller vectorised jobs of $5*10^6$ each. Memory's no longer an issue.

In [8]:

```
from pymatbridge import Matlab
# Start a MATLAB session
mlab = Matlab()
mlab.start()
# Change MATLAB's working directory to match Ipython's current working directory
current_dir = %pwd
mlab.run_code('cd ' + current_dir)
# Run the code in MATLAB and print the results
results = mlab.run_code('serial_monte(5*10^6,100)')
print(results['content']['stdout'])
# Stop the MATLAB session
mlab.stop()
```

Out[8]:

Converting to parallel MATLAB is easy in this instance as long as we have access to the Parallel Computing Toolbox. we just type three letters and turn `for`

into `parfor`

In [9]:

```
%%writefile par_monte.m
function estimate = par_monte(N,jobs)
tic
estimate = 0;
parfor i=1:jobs
x = rand(1,N) * 10 - 5;
y = rand(1,N) * 0.5;
hits = ( sum(y < (exp(-x.^2/2) / sqrt(2*pi))) );
estimate = estimate + (hits / N * (0.5*10));
end
estimate = estimate / jobs
toc
```

In [10]:

```
from pymatbridge import Matlab
# Start a MATLAB session
mlab = Matlab()
mlab.start()
# Change MATLAB's working directory to match Ipython's current working directory
current_dir = %pwd
mlab.run_code('cd ' + current_dir)
# Open the parallel pool
mlab.run_code('parpool local')
# Run the code in MATLAB and print the results
results = mlab.run_code('par_monte(5*10^6,100)')
print(results['content']['stdout'])
# Stop the MATLAB session
mlab.stop()
```

Out[10]:

A couple of things to note here

- There isn't much speed up in this instance - much less than a factor of 2. This is probably because communication overhead is killing the benefit of multiple cores. We could try fiddling with the size of each job or adding cores to improve this.
- The parallel computing toolbox is not part of core
`MATLAB`

, it costs extra - The default random number generator in
`MATLAB`

is Mersenne Twister. However, inside a`parfor`

loop this silently changes to use a different random number generator that has better properties for parallel work. You should bear this mind when you need your work to be reproducible or if you are comparing with other languages that do use Mersenne Twister

It is also possible to run `R`

code in the `IPython`

notebook. To do this, you need the latest copy of `R`

installed on your system along with the `rpy2`

module. I installed `rpy2`

with the following command in a terminal window

`pip install rpy2`

Once this has been done, We need to load the rmagic extension

In [62]:

```
%load_ext rmagic
```

Any cell that we start with `%%R`

will have its code run in an R session. Here is the code that Colin used to plot the function we want to integrate

In [65]:

```
%%R
f = function(u) exp(-u^2/2)/sqrt(2*pi)
x = seq(-5, 5, length.out = 100)
plot(x, f(x), type="l", ylim=c(0, 0.5))
abline(h=0, lty=4, col=4);abline(h=0.5, lty=3, col=4); abline(v=c(-5, 5), lty=4, col=4)
```

Here is the R solution using loops. As with Python and MATLAB, loops are not reccommended in R. This code is based on the original version written by Colin (downloaded 19th November 2014) but changed so that N is 10^6 rather than 10^5 to keep it on par with our other examples.

In [95]:

```
%%R -o estimate
# start the clock
start_time = proc.time()
f = function(u) exp(-u^2/2)/sqrt(2*pi)
simulate_pt = function(i){
x = runif(1,-5,5); y = runif(1,0,0.5);
y < f(x)
}
N = 10^6; hits = 0
for(i in 1:N)
hits = hits + simulate_pt()
(estimate = hits/N*(0.5*10))
#Stop the clock
print(proc.time() - start_time)
print(estimate)
```

On my machine, this took over 11.6. seconds! Almost 6 times slower than the equivalent loop in Python and over 70 times slower than the equivalent loop in MATLAB. R seems to **really** punish you for using loops.

As with MATLAB, there is a risk that all we are measuring here is the overhead on the connection between R and IPython. Although there is some overhead, it appears to be much less of an issue.

Let's write the above code to a file and execute it in R via the shell.

In [91]:

```
%%writefile loop_monte.R
# start the clock
start_time = proc.time()
f = function(u) exp(-u^2/2)/sqrt(2*pi)
simulate_pt = function(i){
x = runif(1,-5,5); y = runif(1,0,0.5);
y < f(x)
}
N = 10^6; hits = 0
for(i in 1:N)
hits = hits + simulate_pt()
(estimate = hits/N*(0.5*10))
#Stop the clock
print(proc.time() - start_time)
print(estimate)
```

In [92]:

```
!R --vanilla < loop_monte.R
```

So, 11.68 seconds via the R connector and 10.3 seconds if run directly in the shell. Either way, it's the slowest solution so far.

Vectorisation is pretty much essential in R. Here's the vecorisd version given by Colin. Although the R connector is nice, I'll run it in the shell to give a better idea of the speed difference.

In [98]:

```
%%writefile vector_monte.R
start_time = proc.time()
N=10^6
f = function(u) exp(-u^2/2)/sqrt(2*pi)
estimate = sum(runif(N, 0, 0.5) < f(runif(N, -5, 5)))/N*(0.5*10)
print(proc.time() - start_time)
print(estimate)
```

In [100]:

```
!R --vanilla < vector_monte.R
```

On my machine, this comes out at 0.26 seconds giving the slowest vectorised solution we've seen so far.