Notebook

Part 3: Parallel programming with Cython¶

Cython is an extension of the Python language which provides improved performance similar to compiled C-code.

Cython is sometimes described as a hybrid-style language mixing Python and C, using C-like static type definitions to provide the Cython compiler with enough extra information to produce highly-performant code, offering performance similar to traditional compiled C code.

In this mini-tutorial, we are going to look at a particular subset of the Cython project that provides parallel programming support, using the cython.parallel module.

A brief crash course in Cython¶

Before we look at the cython.parallel module, however, we should cover some basic Cython first to get a feel of how it looks and works compared to pure Python and C code!

Here is a programming example favourite - a function that computes the n-th Fibonacci number, taking n as its input argument.

Here it is in Python:

In [1]:

def fib(n):
    a = 0.0
    b = 1.0
    for i in range(n):
        a, b = a + b, a
    return a

All Python code is also valid Cython code, and has the same behaviour whether it is run through the Python interpreter, or compiled with the Cython compiler. Using the Cython compiler on the above code will not have much effect on its performance however, but here is where the special Cython syntax comes in.

As Cython is a kind of Python-C hybrid, let's consider what the above code would look like translated into pure C code.

/* C version of the fibonacci calculation*/
double fib(int n)
{
    int i;
    double a = 0.0, b = 1.0, tmp;
    for (i=0; i<n; ++i)
    {
        tmp = a;
        a = a + b;
        b = tmp;
    }
    return a;
}

In the C version, we have to define the types of variables we use (e.g. int, double), in contrast to the way Python infers the types dynamically at runtime. We also have the usual for-loop style, curly braces, and semi-colon syntax not present in Python.

In the Cython version, we can blend the static types of the C code with more Python-like syntax.

Cython version of the Fibonacci function:

Note on running the examples in Jupyter notebook (or IPython)¶

When using Cython in a Jupyter notebook, you need the following line added at the start of a Cython session:

%load_ext Cython

In [2]:

%load_ext Cython

And this line needs to be added for every notebook cell that contains Cython code:

%%cython

In [3]:

%%cython
def fib(int n):
    cdef int i
    cdef double a = 0.0
    cdef double b = 1.0
    for i in range(n):
        a, b = a + b, a
    return a

In Cython we can use cdef to define static variables, just like in C. But note we are still using the Python-syntax for for loops and function definitions. Think of Cython as a superset of the Python language, giving us some 'extra' Python syntax and keywords that we can use to help speed up our code and make it more C-like.

We have not yet used any parallelism, but when compiled with the Cython compiler this code will offer significant speed up at runtime compared to the dynamically interpreted pure-Python version.

If we were to compile this cython code and run it, we might expect something like at least an order of magnitude speed up compared to the native Python version.

Running the examples as standalone scripts¶

It is quicker and easier to follow the Cython examples using the Jupyter/IPython "cellmagic" commands. But if you want to run these examples as standalone scripts, there are a few extra steps to using Cython. You will need to create:

A .pyx file which contains any Cython code.
A setup.py script to build your Cython extension
A separate .py script which is used to import the Cython extension module and call any functions defined in it.

I recommend spending 5 minutes reading the first part of the introductory Cython tutorial from the Cython documentation https://cython.readthedocs.io/en/latest/src/tutorial/cython_tutorial.html

Parallelism in Cython¶

Cython alone may offer sufficient performance gains for an application written in Python. However, since we are here to look at parallel programming in Python, let's look at the cython.parallel module.

The prange function within this module can be used to iterate through a for loop where each iteration can be executed in parallel:

In [4]:

%%cython
from cython.parallel import prange

# First declare the variables we are going to use with cdefs:
cdef int i
cdef int n = 30
cdef int sum = 0

# Use prange instead of range
for i in prange(n, nogil=True):
    sum += i

print(sum)

The prange function takes extra arguments in addition to the number of items to iterate over, n. Here we have passed the argument nogil=True. This tells Cython we can safely release Python's Global Interpreter Lock for this section of the code in the for loop. Python's normal restriction on thread-based parallelism will be relaxed for the duration of the for loop. Each iteration is therefore free to be computer in parallel, exploiting multiple CPU cores if they are available on the system.

Sometimes when we are debugging code, it's useful to be able to print out the values of variables. Let's try printing out the current loop iteration within a prange loop. What happens if you run this code snippet?:

In [ ]:

%%cython
#Thread ID
from cython.parallel import prange

cdef int i
cdef int sum = 0

# Use prange instead of range
for i in prange(4, nogil=True):
    sum += i
    print("Current loop iter:", i)

Yikes! Cython did not like that...

Cython has another restriction when you are using the parallel functionality combined with turning off the GIL (i.e. when nogil=True) - you may not manipulate pure Python objects anymore within the parallel block. (Since Python objects rely on the presence of the GIL).

sum += i is fine because we are calling code that can be easily compiled into pure C code with Cython, as we have declared the types of i and sum using cdef.

But we have a problem here because we would like to use the print function in Python, which, being another Python object, requires the use of the GIL.

So instead, Cython provides a way of using GIL-free C functions from the C standard library. Instead of the Python print() function, we would have to use the printf() function from C.

This can be accessed by importing it like so, and then adding it within the body of our prange loop:

In [6]:

%%cython
#Thread ID
from cython.parallel import prange
# Give me some C functions!
from libc.stdio cimport printf

cdef int i
cdef int sum = 0

# Use prange instead of range
for i in prange(4, nogil=True):
    sum += i
    printf("Current loop iter: %d\n", i)

You will notice that this doesn't print out directly to the Jupyter notebook. Instead, because we are manipulating a lower level C function, the output goes to the terminal where we launched the jupyter notebook from. Have a look in the terminal and check the output. It should be someting like this:

Current loop iter: 0
Current loop iter: 1
Current loop iter: 2
Current loop iter: 3

Success! (Hopefully) - We can now use lower level C functions when we need to perform operations such as print inside a nogil block, such as prange

The Cython parallel module uses the OpenMP library for parallelisation. If you have not come across OpenMP before, do not worry at this stage - most of main features of the cython.parallel module can be used without an in-depth knowledge of OpenMP. (https://www.openmp.org/)

When threads are created to execute the code within the prange block, it might be useful to know which thread is working at a time. We can get this information with the cython.parallel.threadid function or the OpenMP-native function omp_get_thread_num().

To use the native OpenMP function call we use a special import statement called cimport to say we are importing some C-library code.

`with nogil`¶

We introduce a new bit of Cython syntax here: with nogil:. This is a standard Python with statement, but following it with nogil says we want to construct a with block that will be free of restrictions imposed by the GIL. However, remember it is up to us, the programmer, to make sure we are using no-GIL-safe functions, otherwise Cython will give us error messages!

Here's a simple block that just prints out the number of available threads and each thread ID number:

In [7]:

%%cython --compile-args=-fopenmp --link-args=-fopenmp
from cython.parallel cimport parallel
from libc.stdio cimport printf
cimport openmp

# We need a variable to store the thread id
cdef int num_threads
cdef int thread_id

with nogil, parallel():
    num_threads = openmp.omp_get_num_threads()
    printf("Number of threads: %d\n", num_threads)
    thread_id = openmp.omp_get_thread_num()
    printf("Thread ID: %d\n", thread_id)

*Note for jupyter notebook users: Notice how we needed to give our Jupyter notebook some more compilation information at the top of the code cell:*

%%cython --compile-args=-fopenmp --link-args=-fopenmp

This tells the cython compiler (which runs automatically when the code-cell is run in Jupyter) that we need to compile with the OpenMP library.

For users wanting to run 'standalone' Python/Cython scripts (outwith Jupyter/Ipython) you will need to modify the setup.py file as instructed here: https://cython.readthedocs.io/en/latest/src/userguide/parallelism.html?highlight=openmp#compiling

After running the above code, the output (in the terminal, because we are using C-level printf() functions.) Should be something like:

Number of threads: 4
Thread ID: 0
Number of threads: 4
Thread ID: 1
Number of threads: 4
Thread ID: 2
Number of threads: 4
Thread ID: 3

Note that the ouptut is not guaranteed to appear in the above order - the threads will execute the printf functions in a seemingly random order.

More options with `prange` and OpenMP¶

The Julia set¶

A classic example of parallel programming is calculating the Julia set (for making pretty pictures of fractals...) It is an embarrasingly-parallel CPU-bound computation, ideal for speeding up with Cython-OpenMP threads.

Here is the Cython code used to calculate the Julia set. We are not going to go through what every line of the code does. I shall leave it as an "exercise for the reader" to determine how it works, if you are so inclined.

(This would be the .pyx file if you were compiling this outwith the Jupyter notebook/IPython environment.)

In [8]:

%%cython
# julia.pyx
from cython cimport boundscheck, wraparound
from cython.parallel cimport prange

import numpy as np

cdef inline double norm2(double complex z) nogil:
    return z.real * z.real + z.imag * z.imag


cdef int escape(double complex z,
                double complex c,
                double z_max,
                int n_max) nogil:

    cdef:
        int i = 0
        double z_max2 = z_max * z_max

    while norm2(z) < z_max2 and i < n_max:
        z = z * z + c
        i += 1

    return i


@boundscheck(False)
@wraparound(False)
def calc_julia(int resolution, double complex c,
               double bound=1.5, double z_max=4.0, int n_max=1000):

    cdef:
        double step = 2.0 * bound / resolution
        int i, j
        double complex z
        double real, imag
        int[:, ::1] counts

    counts = np.zeros((resolution+1, resolution+1), dtype=np.int32)

    for i in prange(resolution + 1, nogil=True,
                    schedule='static', chunksize=1):
        real = -bound + i * step
        for j in range(resolution + 1):
            imag = -bound + j * step
            z = real + imag * 1j
            counts[i,j] = escape(z, c, z_max, n_max)

    return np.asarray(counts)

@boundscheck(False)
@wraparound(False)
def julia_fraction(int[:,::1] counts, int maxval=1000):
    cdef:
        int total = 0
        int i, j, N, M
    N = counts.shape[0]; M = counts.shape[1]

    for i in prange(N, nogil=True):
        for j in range(M):
            if counts[i,j] == maxval:
                total += 1
    return total / float(counts.size)

And this would be the code than calls the Cython code above. If you were doing this with a separate compilation script, you would need to add import julia or whatever you called your Cython .pyx extension module.

In [9]:

# julia.py
import numpy as np
from time import clock
#import matplotlib.pyplot as plt

t1 = clock()
jl = calc_julia(2000, (0.322 + 0.05j))
print("time:", clock() - t1)

print("julia fraction:", julia_fraction(jl))

# To plot a nice fractal - uncomment these lines
#plt.imshow(np.log(jl))
#plt.show()

time: 4.604583
julia fraction: 0.23719749320741929

The serial version of the code is as so (no pranges, mainly):

In [10]:

%%cython
# The original, non-parallelized version.

from cython cimport boundscheck, wraparound

import numpy as np

cdef inline double norm2(double complex z) nogil:
    return z.real * z.real + z.imag * z.imag


cdef int escape(double complex z,
                double complex c,
                double z_max,
                int n_max) nogil:

    cdef:
        int i = 0
        double z_max2 = z_max * z_max

    while norm2(z) < z_max2 and i < n_max:
        z = z * z + c
        i += 1

    return i


@boundscheck(False)
@wraparound(False)
def calc_julia(int resolution, double complex c,
               double bound=1.5, double z_max=4.0, int n_max=1000):

    cdef:
        double step = 2.0 * bound / resolution
        int i, j
        double complex z
        double real, imag
        int[:, ::1] counts

    counts = np.zeros((resolution+1, resolution+1), dtype=np.int32)

    for i in range(resolution + 1):
        real = -bound + i * step
        for j in range(resolution + 1):
            imag = -bound + j * step
            z = real + imag * 1j
            counts[i,j] = escape(z, c, z_max, n_max)

    return np.asarray(counts)


@boundscheck(False)
@wraparound(False)
def julia_fraction(int[:,::1] counts, int maxval=1000):
    cdef:
        int total = 0
        int i, j, N, M
    N = counts.shape[0]; M = counts.shape[1]

    for i in range(N):
        for j in range(M):
            if counts[i,j] == maxval:
                total += 1
    return total / float(counts.size)

In [11]:

import numpy as np
from time import clock
#import matplotlib.pyplot as plt

#t1 = clock()
jl = calc_julia(2000, (0.322 + 0.05j))
print("time:", clock() - t1)

print("julia fraction:", julia_fraction(jl))

# To plot a nice fractal - uncomment these lines
#plt.imshow(np.log(jl))
#plt.show()

time: 9.230494
julia fraction: 0.23719749320741929

On my laptop I got roughly a 2x speed up with the parallel version on four CPU cores. (Note that we have not compared this to the pure Python version at any stage - both of these examples are using Cython.)

Cython compiler directives¶

We are using two new decorators, which are not parallelism-specific, but useful potential performance enhancers:

@boundscheck=False - do not check that array indices are within the bounds of the size of the array

@wraparound=False - do not allow negative (wraparound) python-style array indexing, e.g. A[-1] etc.

Further information on the available directives are here: https://github.com/cython/cython/wiki/enhancements-compilerdirectives

OpenMP scheduling directives¶

You might also have noticed we have used some extra arguments in the prange method:

schedule='static', chunksize=1

These arguments are the same as the OpenMP loop scheduling directives (If you are already familiar with OpenMP). They allow finer grained control of how work is distributed among OpenMP threads. Static implies that the work is divided up into equal 'chunks' and does not change over the course of the computation. Other options allow the chunksize to vary during runtime (e.g. guided, dynamic). When using a fixed, static chunksize, the chunksize parameter can be set explicitly.

Scheduling keywords are explained in more detail here:

https://github.com/cython/cython/wiki/enhancements-compilerdirectives

Summary and taking it further¶

Cython is a relatively mature Python technology (compared to say numba or mpi4py), providing an opportunity to do thread-based parallel programming when we can guarantee not to use Python objects in our code, but instead translate it to 'lower level' Cython code.
Cython's cython.parallel module provides an interface to OpenMP - a type of shared memory parallel programming model suited to multicore CPU systems. (I.e. single-node on a cluster, multicore desktop/laptop computer). Much of the OpenMP API can be accessed through this module.
However, the user may still stick to the basics of functions like prange and with nogil... to access simple parallelism techniques without needing an indepth knowledge of OpenMP.
Cython requires more programming effort time compared to something like numba, but offers finer grained control, interoperability with C and C++ code, and access to some of the power of the OpenMP library. (We have not covered interoperability with C code, but other tutorials and resources on this are available.

Tutorials that take it further¶

https://cython.readthedocs.io/en/latest/src/userguide/parallelism.html

https://software.intel.com/en-us/articles/thread-parallelism-in-cython