Parallelization

Multiprocessing

Python's multiprocessing is a built-in module for concurrent computing.

In [1]:
import multiprocessing
import os
import time
import numpy

def task(args):
    return (os.getpid(), args)
In [2]:
task("test")
Out[2]:
(92371, 'test')
In [3]:
N_core = multiprocessing.cpu_count()
print("This system has {} cores".format(N_core))
pool = multiprocessing.Pool(processes=3)
This system has 8 cores
In [4]:
pool.map(task, [1, 2, 3, 4 ,5 , 6, 7, 8])
Out[4]:
[(92380, 1),
 (92381, 2),
 (92382, 3),
 (92380, 4),
 (92381, 5),
 (92382, 6),
 (92380, 7),
 (92381, 8)]

The multiprocessing module is very useful for highly parallel tasks that do not need to communicate with each other, other than when sending the initial data to the pool of processes and when collecting the results.

IPython parallelism

IPython includes a very interesting and versatile parallel computing environment, which is very easy to use. It builds on the concept of ipython engines and controllers, that one can connect to and submit tasks to. To get started using this framework for parallel computing, one first have to start up an IPython cluster of engines. The easiest way to do this is to use the ipcluster command,

$ ipcluster start -n 4

Or, alternatively, from the "Clusters" tab on the IPython notebook dashboard page. This will start 4 IPython engines on the current host, which is useful for multicore systems. It is also possible to setup IPython clusters that spans over many nodes in a computing cluster. For more information about possible use cases, see the official documentation Using IPython for parallel computing.

To use the IPython cluster in our Python programs or notebooks, we start by creating an instance of IPython.parallel.Client:

In [5]:
from IPython.parallel import Client
In [6]:
cli = Client() # if this fails, the cluster isn't started

Using the 'ids' attribute we can retreive a list of ids for the IPython engines in the cluster:

In [7]:
cli.ids
Out[7]:
[0, 1, 2, 3]

Each of these engines are ready to execute tasks. We can selectively run code on individual engines:

In [8]:
def getpid():
    """
    return the unique ID of the current process
    """
    import os
    return os.getpid()
In [9]:
# first try it on the notebook process
getpid()
Out[9]:
92371
In [10]:
# run it on one of the engines
cli[0].apply_sync(getpid)
Out[10]:
91086
In [11]:
# run it on ALL of the engines at the same time
cli[:].apply_sync(getpid)
Out[11]:
[91086, 91087, 91088, 91095]

We can use this cluster of IPython engines to execute tasks in parallel. The easiest way to dispatch a function to different engines is to define the function with the decorator:

@view.parallel(block=True)

Here, view is supposed to be the engine pool which we want to dispatch the function (task). Once our function is defined this way we can dispatch it to the engine using the map method in the resulting class (in Python, a decorator is a language construct which automatically wraps the function into another function or a class).

To see how all this works, lets look at an example:

In [12]:
dview = cli[:]
In [13]:
@dview.parallel(block=True)
def dummy_task(delay):
    """
    a dummy task that takes 'delay' seconds to finish
    """
    import os
    import time
    t0 = time.time()
    pid = os.getpid()
    time.sleep(delay)
    t1 = time.time()    
    return [pid, t0, t1]
In [14]:
# generate random delay times for dummy tasks
delay_times = numpy.random.rand(4)

Now, to map the function dummy_task to the random delay time data, we use the map method in dummy_task:

In [15]:
dummy_task.map(delay_times)
Out[15]:
[[91086, 1417481820.137354, 1417481820.703643],
 [91087, 1417481820.138661, 1417481820.506246],
 [91088, 1417481820.139798, 1417481820.450693],
 [91095, 1417481820.140885, 1417481821.022963]]

Let's do the same thing again with many more tasks and visualize how these tasks are executed on different IPython engines:

In [16]:
from matplotlib import pyplot as plt
%matplotlib inline

def visualize_tasks(results):
    res = numpy.array(results)
    fig, ax = plt.subplots(figsize=(10, res.shape[1]))
    
    yticks = []
    yticklabels = []
    tmin = min(res[:,1])
    for n, pid in enumerate(numpy.unique(res[:, 0])):
        yticks.append(n)
        yticklabels.append(str(pid))
        for m in numpy.where(res[:, 0] == pid)[0]:
            ax.add_patch(plt.Rectangle((res[m,1] - tmin, n - 0.25),
                                        res[m, 2] - res[m, 1], 0.5, 
                                        color="green", alpha=0.5))
        
    ax.set_ylim(-.5, n + .5)
    ax.set_xlim(0, max(res[:, 2]) - tmin + 0.)
    ax.set_yticks(yticks)
    ax.set_yticklabels(yticklabels)
    ax.set_ylabel("PID")
    ax.set_xlabel("seconds")
In [17]:
delay_times = numpy.random.rand(64)
result = dummy_task.map(delay_times)
visualize_tasks(result)

That's a nice and easy parallelization! We can see that we utilize all four engines quite well. But one shortcoming is that the tasks are not load balanced, so one engine might be idle while others still have more tasks to work on. However, the IPython parallel environment provides a number of alternative "views" of the engine cluster, and there is a view that provides load balancing as well (above we have used the "direct view", which is why we called it "dview").

To obtain a load balanced view we simply use the load_balanced_view method in the engine cluster client instance cli:

In [18]:
lbview = cli.load_balanced_view()
In [19]:
@lbview.parallel(block=True)
def dummy_task_load_balanced(delay):
    """ 
    a dummy task that takes 'delay' seconds to finish
    """
    import os
    import time
    t0 = time.time()
    pid = os.getpid()
    time.sleep(delay)
    t1 = time.time()
    return [pid, t0, t1]
In [20]:
result = dummy_task_load_balanced.map(delay_times)
visualize_tasks(result)

In the example above we can see that the engine cluster is a bit more efficiently used, and the time to completion is shorter than in the previous example.

Further reading

There are many other ways to use the IPython parallel environment. The official documentation has a nice guide.

MPI

When more communication between processes is required, sophisticated solutions such as MPI and OpenMP are often needed. MPI is a process-based parallel processing library/protocol, and can be used in Python programs through the mpi4py package. MPI is best used when you need parallelization because you are running out of memory, e.g. you have a simulation and the problem size is so large that your data does not fit into the memory of a single node anymore. However, the operations you perform on the data are rather fast, so you do not need more computational power. In this case you probably want to start one MPI process on each node, thereby making maximum use of the available memory while limiting communication to the bare minimum.

To use the mpi4py package we include MPI from mpi4py:

from mpi4py import MPI

A MPI python program must be started using the mpirun -n N command, where N is the number of processes that should be included in the process group.

Note that the IPython parallel enviroment also has support for MPI.

Example 1: Passing generic Python objects

In MPI for Python, Comm is the base class of communicators. The two predefined intracommunicator instances are available: COMM_SELF and COMM_WORLD. From them, new communicators can be created as needed.

The number of processes in a communicator and the calling process rank can be respectively obtained with methods Get_size() and Get_rank().

Point to point communication is a fundamental capability of message passing systems. This mechanism enables the transmittal of data between a pair of processes, one side sending, the other, receiving. MPI provides a set of send and receive functions allowing the communication of typed data with an associated tag.

MPI provides basic send and receive functions that are blocking. In MPI for Python, the Send(), Recv() and Sendrecv() methods of communicator objects provide support for blocking point-to-point communications, communicating memory buffers. The methods send(), recv() and sendrecv() can communicate generic Python objects.

In [21]:
%%file mpitest.py

from mpi4py import MPI

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
   data = [1.0, 2.0, 3.0, 4.0]
   comm.send(data, dest=1, tag=11)
elif rank == 1:
   data = comm.recv(source=0, tag=11)
    
print("rank = {} : data = {}".format(rank, data))
Overwriting mpitest.py
In [22]:
!mpirun -n 2 python mpitest.py
rank = 0 : data = [1.0, 2.0, 3.0, 4.0]
rank = 1 : data = [1.0, 2.0, 3.0, 4.0]

Example 2: Passing numpy objects

In [23]:
%%file mpi-numpy-array.py

from mpi4py import MPI
import numpy

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

if rank == 0:
    data = numpy.random.rand(10)
    comm.Send(data, dest=1, tag=13)
elif rank == 1:
    # initialize receiving object
    data = numpy.empty(10, dtype=numpy.float64)
    comm.Recv(data, source=0, tag=13)
    
print("rank = {} : data = {}".format(rank, data))
Overwriting mpi-numpy-array.py
In [24]:
!mpirun -n 2 python mpi-numpy-array.py
rank = 0 : data = [ 0.88419591  0.30159437  0.13022219  0.51819353  0.42310447  0.46366055
  0.55816726  0.61192431  0.21297446  0.34444343]
rank = 1 : data = [ 0.88419591  0.30159437  0.13022219  0.51819353  0.42310447  0.46366055
  0.55816726  0.61192431  0.21297446  0.34444343]

Example 3: Matrix-vector multiplication

The following uses the Allgather mechanism, which can be described as:

"the block of data sent from the jth process is received by every process and placed in the jth block of the receiving buffer."

In [25]:
# prepare some random data
import numpy
N = 16
A = numpy.random.rand(N, N)
numpy.save("random-matrix.npy", A)
x = numpy.random.rand(N)
numpy.save("random-vector.npy", x)
In [26]:
%%file mpi-matrix-vector.py

from mpi4py import MPI
import numpy

comm = MPI.COMM_WORLD
rank = comm.Get_rank()

def matvec(A, x):
    m = A.shape[0] / comm.Get_size()
    p = rank * m
    y_part = numpy.dot(A[p:(p+m)], x) # partial multiplication
    y = numpy.zeros_like(x) # initialize receiving object
    comm.Allgather(y_part, y)
    #comm.Allgather([y_part,  MPI.DOUBLE], [y, MPI.DOUBLE]) # explicit typing
    return y

A = numpy.load("random-matrix.npy")
x = numpy.load("random-vector.npy")
y_mpi = matvec(A, x)

if rank == 0:
    y = numpy.dot(A, x)
    print numpy.allclose(y, y_mpi)
Overwriting mpi-matrix-vector.py
In [27]:
!mpirun -n 4 python mpi-matrix-vector.py
True

Example 4: Vector sum

The following uses the Allreduce mechanism, which combines values from all processes and distributes the result back to all processes.

In [28]:
# prepare some random data
import numpy
N = 4096
x = numpy.random.rand(N)
numpy.save("random-vector.npy", x)
In [29]:
%%file mpi-psum.py

from mpi4py import MPI
import numpy as np

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

def psum(a):
    m = len(a) / size
    locsum = np.sum(a[rank*m:(rank+1)*m])
    rcvBuf = np.array(0.0, 'd')
    comm.Allreduce([locsum, MPI.DOUBLE], [rcvBuf, MPI.DOUBLE], op=MPI.SUM)
    return rcvBuf

a = np.load("random-vector.npy")
s = psum(a)

if MPI.COMM_WORLD.Get_rank() == 0:
    print "sum =", s, ", numpy sum =", a.sum()
Overwriting mpi-psum.py
In [30]:
!mpirun -n 4 python mpi-psum.py
sum = 2059.18689425 , numpy sum = 2059.18689425

Further reading