Hello! Welcome to the Hitchhiker's Guide to Parallelism with Python. This tutorial/workshop is designed to give an overview of some of the many options for doing parallel programming with Python. What do we mean by parallel programming? We are taking a very broad view of the definition of parallelism here, and using it to mean any programmatic means of doing computations side-by-side, at the same time, on multiple processors or CPU cores. We include both thread-based and process-based parallelism here. We're not going to spend a long time discussing abstract definitions of parallelism or concurrency, but instead take a practical, hands-on approach to experimenting with different Python modules that allow us to exploit the multi-core, multi-threaded, multi-CPU world that we work in.
This workshop covers a few select topics in the world of Python parallel programming: It can't possibly cover them all in a short workshop, but I hope it will show you that Python programming is not restricted to single-threaded, serial applications, and that there a many approaches to parallel programming with Python.
The workshop is made up of four 'mini-tutorials', each one about 15-20 minutes long. They each cover a differnt Python library that supports parallel programming in some way. These four topics are:
You are welcome to cover them in a different order if you like. (If you are already familiar with one or more of the modules for example)
After the workshop is finished, we will collate topics and feedback discussed in the free discussion session at the end, and add them to this page:
The traditional thinking on Python is that it does not support parallel programming very well, and that 'proper' parallel programming is best left to the 'heavy-duty' languages such as Fortran or C++, where libraries or standards such as OpenMP and MPI can be utilised. For large scale, massively-parallel applications, this is perhaps still the case. But with the rich variety of libraries and packages that have been developed outside the core Python language, doing parallel programming is now much better supported in Python, despite one fundamental limitation of Python's implementation.
The most common implmentation of Python (I.e. the interpreter or executable that runs your Python code) is called CPython. CPython is the defacto standard implmentation for Python (though others are available such as PyPy), it is the most commonly found implmentation bundled with operating systems, and provided through the Ananconda Scientific Python distribution.
CPython does not support the use of threads well, which are fundamental to a large proportion of parallel programming approaches. This is because it has been written to assume that an individual Python program is serial, that only one single thread will execute the code at a time. This prevents memory corruption from multiple threads trying to read, write, or delete memory at the same time. Contrast this with other language standards, such as C and Fortran, which have no such restriction that has become, in practice, the de facto standard.
CPython implements something called the Global Interpreter Lock (GIL) that protects access to Python objects, preventing multiple threads executing Python bytecode through the Python interpreter at once.
Because of this initial design decision in CPython, subsequent developments in Python have come to rely on this feaure (GIL) being present, so removing it in future versions of Python is unlikely to happen any time soon. However, that has not stopped the Python community from developing creative ways of getting round this limitation of the GIL, and we will look at a few of these solutions in this workshop.
The GIL "problem" is an interesting topic that comes up frequently in the Python community. If you are interested in reading more about it, this is a good starting place: https://wiki.python.org/moin/GlobalInterpreterLock
Parallel approaches to Python are normally based around running multiple instances of the Python interpreter, each with its own copy of the the code being run and each with its own separate GIL. These processes are managed at the operating system level, with each process performing part of the shared work. However, there is a high cost when communicating between processes and sharing data between them, compared to true multi-threaded programs, where a single process can contain multiple threads accessing and writing to the same data (when safe to do so, of course). The 'multiple process' approach is used by the aptly named multiprocessing
module, covered in Part 1. (And in a sort-of-similar way by the mpi4py
package in Part 3.)
Other approaches involve 'dropping out' of Python to C or Fortran code, which support multi-threaded code (through external libraries such as OpenMP), doing the parallel computation in the compiled C code, then returning the results back to Python. Developments such as the numba
library (part 2) and Cython
extensions (Part 4) make this easier to do with Python, and even automate some of the parallelisation-finding tasks that would be required for traditional parallel programming in C, C++, or Fortran.
Each mini-tutorial is presented as a Jupyter notebook. If you are familiar with Jupyter notebooks, I recommend using a fresh one for each tutorial, and typing up your own notes or observations alongside the code examples. If you have not used a jupyter notebook before, they are fairly easy to get started with, just fire one up from a terminal using:
jupyter notebook
And open a new Python 3 notebook. The documentation for using notebooks is here: https://jupyter-notebook.readthedocs.io/en/stable/notebook.html
Equally as good to use for these tutorials - an IPython console can be used - whichever you prefer. You should be able to start an IPython console with:
ipython
Check that it shows the correct version of Python after it starts up (Python 3).
Finally, you may use an editor and run the scripts/programs manually from the console. However, in some of the tutorials, this will result in extra steps being taken as they rely on some of the interactive features of IPython/jupyter notebooks. We will do our best to help support which method you choose to work through the workshop materials.
I wanted the focus of this workshop to be experimentation and exploration - not a formal tutorial with set exercises and strict time limits (Though obviously we are limited to the wall-clock time of the workshop schedule.)
You are strongly encouraged to play around at writing your own code and try breaking the examples once you have gone through them the first time. At the end of most of the tutorials, there will be links to further examples and documentation to support this. Please feel free to experiment!
It seems like there is always another Python parallel programming library that I haven't heard of before, so please feel free to add suggestions in the discussion sesssion at the end (10-15 mins). I will try to collate these contributions into a webpage that I will provide a link to after the workshop is over, which hopefully will become a handy reference for parallel programming in Python.