Adapted from http://dml.riken.jp/~rob/.

Python's `multiprocessing`

is a built-in module for concurrent computing.

In [1]:

```
import multiprocessing
import os
import time
import numpy
def task(args):
return (os.getpid(), args)
```

In [2]:

```
task("test")
```

Out[2]:

In [3]:

```
N_core = multiprocessing.cpu_count()
print("This system has {} cores".format(N_core))
pool = multiprocessing.Pool(processes=3)
```

In [4]:

```
pool.map(task, [1, 2, 3, 4 ,5 , 6, 7, 8])
```

Out[4]:

`multiprocessing`

module is very useful for highly parallel tasks that do not need to communicate with each other, other than when sending the initial data to the pool of processes and when collecting the results.

IPython includes a very interesting and versatile parallel computing environment, which is very easy to use. It builds on the concept of ipython engines and controllers, that one can connect to and submit tasks to. To get started using this framework for parallel computing, one first have to start up an IPython cluster of engines. The easiest way to do this is to use the `ipcluster`

command,

```
$ ipcluster start -n 4
```

Or, alternatively, from the "Clusters" tab on the IPython notebook dashboard page. This will start 4 IPython engines on the current host, which is useful for multicore systems. It is also possible to setup IPython clusters that spans over **many nodes in a computing cluster**. For more information about possible use cases, see the official documentation Using IPython for parallel computing.

To use the IPython cluster in our Python programs or notebooks, we start by creating an instance of `IPython.parallel.Client`

:

In [5]:

```
from IPython.parallel import Client
```

In [6]:

```
cli = Client() # if this fails, the cluster isn't started
```

Using the 'ids' attribute we can retreive a list of ids for the IPython engines in the cluster:

In [7]:

```
cli.ids
```

Out[7]:

Each of these engines are ready to execute tasks. We can selectively run code on individual engines:

In [8]:

```
def getpid():
"""
return the unique ID of the current process
"""
import os
return os.getpid()
```

In [9]:

```
# first try it on the notebook process
getpid()
```

Out[9]:

In [10]:

```
# run it on one of the engines
cli[0].apply_sync(getpid)
```

Out[10]:

In [11]:

```
# run it on ALL of the engines at the same time
cli[:].apply_sync(getpid)
```

Out[11]:

We can use this cluster of IPython engines to execute tasks in parallel. The easiest way to dispatch a function to different engines is to define the function with the decorator:

```
@view.parallel(block=True)
```

Here, `view`

is supposed to be the engine pool which we want to dispatch the function (task). Once our function is defined this way we can dispatch it to the engine using the `map`

method in the resulting class (in Python, a decorator is a language construct which automatically wraps the function into another function or a class).

To see how all this works, lets look at an example:

In [12]:

```
dview = cli[:]
```

In [13]:

```
@dview.parallel(block=True)
def dummy_task(delay):
"""
a dummy task that takes 'delay' seconds to finish
"""
import os
import time
t0 = time.time()
pid = os.getpid()
time.sleep(delay)
t1 = time.time()
return [pid, t0, t1]
```

In [14]:

```
# generate random delay times for dummy tasks
delay_times = numpy.random.rand(4)
```

`dummy_task`

to the random delay time data, we use the `map`

method in `dummy_task`

:

In [15]:

```
dummy_task.map(delay_times)
```

Out[15]:

In [16]:

```
from matplotlib import pyplot as plt
%matplotlib inline
def visualize_tasks(results):
res = numpy.array(results)
fig, ax = plt.subplots(figsize=(10, res.shape[1]))
yticks = []
yticklabels = []
tmin = min(res[:,1])
for n, pid in enumerate(numpy.unique(res[:, 0])):
yticks.append(n)
yticklabels.append(str(pid))
for m in numpy.where(res[:, 0] == pid)[0]:
ax.add_patch(plt.Rectangle((res[m,1] - tmin, n - 0.25),
res[m, 2] - res[m, 1], 0.5,
color="green", alpha=0.5))
ax.set_ylim(-.5, n + .5)
ax.set_xlim(0, max(res[:, 2]) - tmin + 0.)
ax.set_yticks(yticks)
ax.set_yticklabels(yticklabels)
ax.set_ylabel("PID")
ax.set_xlabel("seconds")
```

In [17]:

```
delay_times = numpy.random.rand(64)
result = dummy_task.map(delay_times)
visualize_tasks(result)
```

That's a nice and easy parallelization! We can see that we utilize all four engines quite well. But one shortcoming is that the tasks are not load balanced, so one engine might be idle while others still have more tasks to work on. However, the IPython parallel environment provides a number of alternative "views" of the engine cluster, and there is a view that provides load balancing as well (above we have used the "direct view", which is why we called it "dview").

To obtain a load balanced view we simply use the `load_balanced_view`

method in the engine cluster client instance `cli`

:

In [18]:

```
lbview = cli.load_balanced_view()
```

In [19]:

```
@lbview.parallel(block=True)
def dummy_task_load_balanced(delay):
"""
a dummy task that takes 'delay' seconds to finish
"""
import os
import time
t0 = time.time()
pid = os.getpid()
time.sleep(delay)
t1 = time.time()
return [pid, t0, t1]
```

In [20]:

```
result = dummy_task_load_balanced.map(delay_times)
visualize_tasks(result)
```

When more communication between processes is required, sophisticated solutions such as MPI and OpenMP are often needed. MPI is a **process**-based parallel processing library/protocol, and can be used in Python programs through the `mpi4py`

package. MPI is best used when you need parallelization because you are running out of memory, e.g. you have a simulation and the problem size is so large that your data does not fit into the memory of a single node anymore. However, the operations you perform on the data are rather fast, so you do not need more computational power. In this case you probably want to start one MPI process on each node, thereby making maximum use of the available memory while limiting communication to the bare minimum.

To use the `mpi4py`

package we include `MPI`

from `mpi4py`

:

```
from mpi4py import MPI
```

A MPI python program must be started using the `mpirun -n N`

command, where `N`

is the number of processes that should be included in the process group.

Note that the IPython parallel enviroment also has support for MPI.

In MPI for Python, Comm is the base class of communicators. The two predefined intracommunicator instances are available: `COMM_SELF`

and `COMM_WORLD`

. From them, new communicators can be created as needed.

The number of processes in a communicator and the calling process rank can be respectively obtained with methods `Get_size()`

and `Get_rank()`

.

Point to point communication is a fundamental capability of message passing systems. This mechanism enables the transmittal of data between a pair of processes, one side *sending*, the other, *receiving*. MPI provides a set of send and receive functions allowing the communication of *typed data* with an associated *tag*.

MPI provides basic send and receive functions that are blocking. In MPI for Python, the `Send()`

, `Recv()`

and `Sendrecv()`

methods of communicator objects provide support for blocking point-to-point communications, communicating memory buffers. The methods `send()`

, `recv()`

and `sendrecv()`

can communicate generic Python objects.

In [21]:

```
%%file mpitest.py
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
if rank == 0:
data = [1.0, 2.0, 3.0, 4.0]
comm.send(data, dest=1, tag=11)
elif rank == 1:
data = comm.recv(source=0, tag=11)
print("rank = {} : data = {}".format(rank, data))
```

In [22]:

```
!mpirun -n 2 python mpitest.py
```

In [23]:

```
%%file mpi-numpy-array.py
from mpi4py import MPI
import numpy
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
if rank == 0:
data = numpy.random.rand(10)
comm.Send(data, dest=1, tag=13)
elif rank == 1:
# initialize receiving object
data = numpy.empty(10, dtype=numpy.float64)
comm.Recv(data, source=0, tag=13)
print("rank = {} : data = {}".format(rank, data))
```

In [24]:

```
!mpirun -n 2 python mpi-numpy-array.py
```

The following uses the ** Allgather** mechanism, which can be described as:

"the block of data sent from the jth process is received by every process and placed in the jth block of the receiving buffer."

In [25]:

```
# prepare some random data
import numpy
N = 16
A = numpy.random.rand(N, N)
numpy.save("random-matrix.npy", A)
x = numpy.random.rand(N)
numpy.save("random-vector.npy", x)
```

In [26]:

```
%%file mpi-matrix-vector.py
from mpi4py import MPI
import numpy
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
def matvec(A, x):
m = A.shape[0] / comm.Get_size()
p = rank * m
y_part = numpy.dot(A[p:(p+m)], x) # partial multiplication
y = numpy.zeros_like(x) # initialize receiving object
comm.Allgather(y_part, y)
#comm.Allgather([y_part, MPI.DOUBLE], [y, MPI.DOUBLE]) # explicit typing
return y
A = numpy.load("random-matrix.npy")
x = numpy.load("random-vector.npy")
y_mpi = matvec(A, x)
if rank == 0:
y = numpy.dot(A, x)
print numpy.allclose(y, y_mpi)
```

In [27]:

```
!mpirun -n 4 python mpi-matrix-vector.py
```

** Allreduce** mechanism, which combines values from all processes and distributes the result back to all processes.

In [28]:

```
# prepare some random data
import numpy
N = 4096
x = numpy.random.rand(N)
numpy.save("random-vector.npy", x)
```

In [29]:

```
%%file mpi-psum.py
from mpi4py import MPI
import numpy as np
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
def psum(a):
m = len(a) / size
locsum = np.sum(a[rank*m:(rank+1)*m])
rcvBuf = np.array(0.0, 'd')
comm.Allreduce([locsum, MPI.DOUBLE], [rcvBuf, MPI.DOUBLE], op=MPI.SUM)
return rcvBuf
a = np.load("random-vector.npy")
s = psum(a)
if MPI.COMM_WORLD.Get_rank() == 0:
print "sum =", s, ", numpy sum =", a.sum()
```

In [30]:

```
!mpirun -n 4 python mpi-psum.py
```