Cython is an extension of the Python language which provides improved performance similar to compiled C-code.
Cython is sometimes described as a hybrid-style language mixing Python and C, using C-like static type definitions to provide the Cython compiler with enough extra information to produce highly-performant code, offering performance similar to traditional compiled C code.
In this mini-tutorial, we are going to look at a particular subset of the Cython project that provides parallel programming support, using the cython.parallel
module.
Before we look at the cython.parallel
module, however, we should cover some basic Cython first to get a feel of how it looks and works compared to pure Python and C code!
Here is a programming example favourite - a function that computes the n-th Fibonacci number, taking n
as its input argument.
Here it is in Python:
def fib(n):
a = 0.0
b = 1.0
for i in range(n):
a, b = a + b, a
return a
All Python code is also valid Cython code, and has the same behaviour whether it is run through the Python interpreter, or compiled with the Cython compiler. Using the Cython compiler on the above code will not have much effect on its performance however, but here is where the special Cython syntax comes in.
As Cython is a kind of Python-C hybrid, let's consider what the above code would look like translated into pure C code.
/* C version of the fibonacci calculation*/
double fib(int n)
{
int i;
double a = 0.0, b = 1.0, tmp;
for (i=0; i<n; ++i)
{
tmp = a;
a = a + b;
b = tmp;
}
return a;
}
In the C version, we have to define the types of variables we use (e.g. int
, double
), in contrast to the way Python infers the types dynamically at runtime. We also have the usual for-loop style, curly braces, and semi-colon syntax not present in Python.
In the Cython version, we can blend the static types of the C code with more Python-like syntax.
Cython version of the Fibonacci function:
When using Cython in a Jupyter notebook, you need the following line added at the start of a Cython session:
%load_ext Cython
%load_ext Cython
And this line needs to be added for every notebook cell that contains Cython code:
%%cython
%%cython
def fib(int n):
cdef int i
cdef double a = 0.0
cdef double b = 1.0
for i in range(n):
a, b = a + b, a
return a
In Cython we can use cdef
to define static variables, just like in C. But note we are still using the Python-syntax for for
loops and function definitions. Think of Cython as a superset of the Python language, giving us some 'extra' Python syntax and keywords that we can use to help speed up our code and make it more C-like.
We have not yet used any parallelism, but when compiled with the Cython compiler this code will offer significant speed up at runtime compared to the dynamically interpreted pure-Python version.
If we were to compile this cython code and run it, we might expect something like at least an order of magnitude speed up compared to the native Python version.
It is quicker and easier to follow the Cython examples using the Jupyter/IPython "cellmagic" commands. But if you want to run these examples as standalone scripts, there are a few extra steps to using Cython. You will need to create:
.pyx
file which contains any Cython code.setup.py
script to build your Cython extension.py
script which is used to import
the Cython extension module and call any functions defined in it.I recommend spending 5 minutes reading the first part of the introductory Cython tutorial from the Cython documentation https://cython.readthedocs.io/en/latest/src/tutorial/cython_tutorial.html
Cython alone may offer sufficient performance gains for an application written in Python. However, since we are here to look at parallel programming in Python, let's look at the cython.parallel
module.
The prange
function within this module can be used to iterate through a for loop where each iteration can be executed in parallel:
%%cython
from cython.parallel import prange
# First declare the variables we are going to use with cdefs:
cdef int i
cdef int n = 30
cdef int sum = 0
# Use prange instead of range
for i in prange(n, nogil=True):
sum += i
print(sum)
435
The prange
function takes extra arguments in addition to the number of items to iterate over, n
. Here we have passed the argument nogil=True
. This tells Cython we can safely release Python's Global Interpreter Lock for this section of the code in the for loop. Python's normal restriction on thread-based parallelism will be relaxed for the duration of the for loop. Each iteration is therefore free to be computer in parallel, exploiting multiple CPU cores if they are available on the system.
Sometimes when we are debugging code, it's useful to be able to print out the values of variables. Let's try printing out the current loop iteration within a prange
loop. What happens if you run this code snippet?:
%%cython
#Thread ID
from cython.parallel import prange
cdef int i
cdef int sum = 0
# Use prange instead of range
for i in prange(4, nogil=True):
sum += i
print("Current loop iter:", i)
Yikes! Cython did not like that...
Cython has another restriction when you are using the parallel functionality combined with turning off the GIL (i.e. when nogil=True
) - you may not manipulate pure Python objects anymore within the parallel block. (Since Python objects rely on the presence of the GIL).
sum += i
is fine because we are calling code that can be easily compiled into pure C code with Cython, as we have declared the types of i
and sum
using cdef
.
But we have a problem here because we would like to use the print
function in Python, which, being another Python object, requires the use of the GIL.
So instead, Cython provides a way of using GIL-free C functions from the C standard library. Instead of the Python print()
function, we would have to use the printf()
function from C.
This can be accessed by importing it like so, and then adding it within the body of our prange loop:
%%cython
#Thread ID
from cython.parallel import prange
# Give me some C functions!
from libc.stdio cimport printf
cdef int i
cdef int sum = 0
# Use prange instead of range
for i in prange(4, nogil=True):
sum += i
printf("Current loop iter: %d\n", i)
You will notice that this doesn't print out directly to the Jupyter notebook. Instead, because we are manipulating a lower level C function, the output goes to the terminal where we launched the jupyter notebook from. Have a look in the terminal and check the output. It should be someting like this:
Current loop iter: 0
Current loop iter: 1
Current loop iter: 2
Current loop iter: 3
Success! (Hopefully) - We can now use lower level C functions when we need to perform operations such as print
inside a nogil
block, such as prange
The Cython parallel
module uses the OpenMP library for parallelisation. If you have not come across OpenMP before, do not worry at this stage - most of main features of the cython.parallel module can be used without an in-depth knowledge of OpenMP. (https://www.openmp.org/)
When threads are created to execute the code within the prange block, it might be useful to know which thread is working at a time. We can get this information with the cython.parallel.threadid
function or the OpenMP-native function omp_get_thread_num()
.
To use the native OpenMP function call we use a special import statement called cimport
to say we are importing some C-library code.
with nogil
¶We introduce a new bit of Cython syntax here: with nogil:
. This is a standard Python with
statement, but following it with nogil
says we want to construct a with
block that will be free of restrictions imposed by the GIL. However, remember it is up to us, the programmer, to make sure we are using no-GIL-safe functions, otherwise Cython will give us error messages!
Here's a simple block that just prints out the number of available threads and each thread ID number:
%%cython --compile-args=-fopenmp --link-args=-fopenmp
from cython.parallel cimport parallel
from libc.stdio cimport printf
cimport openmp
# We need a variable to store the thread id
cdef int num_threads
cdef int thread_id
with nogil, parallel():
num_threads = openmp.omp_get_num_threads()
printf("Number of threads: %d\n", num_threads)
thread_id = openmp.omp_get_thread_num()
printf("Thread ID: %d\n", thread_id)
*Note for jupyter notebook users: Notice how we needed to give our Jupyter notebook some more compilation information at the top of the code cell:*
%%cython --compile-args=-fopenmp --link-args=-fopenmp
This tells the cython compiler (which runs automatically when the code-cell is run in Jupyter) that we need to compile with the OpenMP library.
For users wanting to run 'standalone' Python/Cython scripts (outwith Jupyter/Ipython) you will need to modify the setup.py
file as instructed here: https://cython.readthedocs.io/en/latest/src/userguide/parallelism.html?highlight=openmp#compiling
After running the above code, the output (in the terminal, because we are using C-level printf()
functions.) Should be something like:
Number of threads: 4
Thread ID: 0
Number of threads: 4
Thread ID: 1
Number of threads: 4
Thread ID: 2
Number of threads: 4
Thread ID: 3
Note that the ouptut is not guaranteed to appear in the above order - the threads will execute the printf functions in a seemingly random order.
prange
and OpenMP¶A classic example of parallel programming is calculating the Julia set (for making pretty pictures of fractals...) It is an embarrasingly-parallel CPU-bound computation, ideal for speeding up with Cython-OpenMP threads.
Here is the Cython code used to calculate the Julia set. We are not going to go through what every line of the code does. I shall leave it as an "exercise for the reader" to determine how it works, if you are so inclined.
(This would be the .pyx
file if you were compiling this outwith the Jupyter notebook/IPython environment.)
%%cython
# julia.pyx
from cython cimport boundscheck, wraparound
from cython.parallel cimport prange
import numpy as np
cdef inline double norm2(double complex z) nogil:
return z.real * z.real + z.imag * z.imag
cdef int escape(double complex z,
double complex c,
double z_max,
int n_max) nogil:
cdef:
int i = 0
double z_max2 = z_max * z_max
while norm2(z) < z_max2 and i < n_max:
z = z * z + c
i += 1
return i
@boundscheck(False)
@wraparound(False)
def calc_julia(int resolution, double complex c,
double bound=1.5, double z_max=4.0, int n_max=1000):
cdef:
double step = 2.0 * bound / resolution
int i, j
double complex z
double real, imag
int[:, ::1] counts
counts = np.zeros((resolution+1, resolution+1), dtype=np.int32)
for i in prange(resolution + 1, nogil=True,
schedule='static', chunksize=1):
real = -bound + i * step
for j in range(resolution + 1):
imag = -bound + j * step
z = real + imag * 1j
counts[i,j] = escape(z, c, z_max, n_max)
return np.asarray(counts)
@boundscheck(False)
@wraparound(False)
def julia_fraction(int[:,::1] counts, int maxval=1000):
cdef:
int total = 0
int i, j, N, M
N = counts.shape[0]; M = counts.shape[1]
for i in prange(N, nogil=True):
for j in range(M):
if counts[i,j] == maxval:
total += 1
return total / float(counts.size)
And this would be the code than calls the Cython code above. If you were doing this with a separate compilation script, you would need to add import julia
or whatever you called your Cython .pyx
extension module.
# julia.py
import numpy as np
from time import clock
#import matplotlib.pyplot as plt
t1 = clock()
jl = calc_julia(2000, (0.322 + 0.05j))
print("time:", clock() - t1)
print("julia fraction:", julia_fraction(jl))
# To plot a nice fractal - uncomment these lines
#plt.imshow(np.log(jl))
#plt.show()
time: 4.604583 julia fraction: 0.23719749320741929
The serial version of the code is as so (no prange
s, mainly):
%%cython
# The original, non-parallelized version.
from cython cimport boundscheck, wraparound
import numpy as np
cdef inline double norm2(double complex z) nogil:
return z.real * z.real + z.imag * z.imag
cdef int escape(double complex z,
double complex c,
double z_max,
int n_max) nogil:
cdef:
int i = 0
double z_max2 = z_max * z_max
while norm2(z) < z_max2 and i < n_max:
z = z * z + c
i += 1
return i
@boundscheck(False)
@wraparound(False)
def calc_julia(int resolution, double complex c,
double bound=1.5, double z_max=4.0, int n_max=1000):
cdef:
double step = 2.0 * bound / resolution
int i, j
double complex z
double real, imag
int[:, ::1] counts
counts = np.zeros((resolution+1, resolution+1), dtype=np.int32)
for i in range(resolution + 1):
real = -bound + i * step
for j in range(resolution + 1):
imag = -bound + j * step
z = real + imag * 1j
counts[i,j] = escape(z, c, z_max, n_max)
return np.asarray(counts)
@boundscheck(False)
@wraparound(False)
def julia_fraction(int[:,::1] counts, int maxval=1000):
cdef:
int total = 0
int i, j, N, M
N = counts.shape[0]; M = counts.shape[1]
for i in range(N):
for j in range(M):
if counts[i,j] == maxval:
total += 1
return total / float(counts.size)
import numpy as np
from time import clock
#import matplotlib.pyplot as plt
#t1 = clock()
jl = calc_julia(2000, (0.322 + 0.05j))
print("time:", clock() - t1)
print("julia fraction:", julia_fraction(jl))
# To plot a nice fractal - uncomment these lines
#plt.imshow(np.log(jl))
#plt.show()
time: 9.230494 julia fraction: 0.23719749320741929
On my laptop I got roughly a 2x speed up with the parallel version on four CPU cores. (Note that we have not compared this to the pure Python version at any stage - both of these examples are using Cython.)
We are using two new decorators, which are not parallelism-specific, but useful potential performance enhancers:
@boundscheck=False
- do not check that array indices are within the bounds of the size of the array
@wraparound=False
- do not allow negative (wraparound) python-style array indexing, e.g. A[-1]
etc.
Further information on the available directives are here: https://github.com/cython/cython/wiki/enhancements-compilerdirectives
You might also have noticed we have used some extra arguments in the prange
method:
schedule='static', chunksize=1
These arguments are the same as the OpenMP loop scheduling directives (If you are already familiar with OpenMP). They allow finer grained control of how work is distributed among OpenMP threads. Static implies that the work is divided up into equal 'chunks' and does not change over the course of the computation. Other options allow the chunksize to vary during runtime (e.g. guided, dynamic
). When using a fixed, static
chunksize, the chunksize
parameter can be set explicitly.
Scheduling keywords are explained in more detail here:
https://github.com/cython/cython/wiki/enhancements-compilerdirectives
Cython is a relatively mature Python technology (compared to say numba or mpi4py), providing an opportunity to do thread-based parallel programming when we can guarantee not to use Python objects in our code, but instead translate it to 'lower level' Cython code.
Cython's cython.parallel
module provides an interface to OpenMP - a type of shared memory parallel programming model suited to multicore CPU systems. (I.e. single-node on a cluster, multicore desktop/laptop computer). Much of the OpenMP API can be accessed through this module.
However, the user may still stick to the basics of functions like prange
and with nogil...
to access simple parallelism techniques without needing an indepth knowledge of OpenMP.
Cython requires more programming effort time compared to something like numba
, but offers finer grained control, interoperability with C and C++ code, and access to some of the power of the OpenMP library. (We have not covered interoperability with C code, but other tutorials and resources on this are available.
https://cython.readthedocs.io/en/latest/src/userguide/parallelism.html
https://software.intel.com/en-us/articles/thread-parallelism-in-cython