#!/usr/bin/env python # coding: utf-8 # # Numba vs. Cython: Take 2 # *This notebook first appeared as a* # [*post*](http://jakevdp.github.io/blog/2012/08/24/numba-vs-cython/) # *by Jake Vanderplas on the blog* # [*Pythonic Perambulations*](http://jakevdp.github.io) # # # Last summer I wrote a [post](http://jakevdp.github.io/blog/2012/08/24/numba-vs-cython/) # comparing the performance of [Numba](http://numba.pydata.org/) and [Cython](http://cython.org/) # for optimizing array-based computation. Since posting, the page has received thousands of hits, # and resulted in a number of interesting discussions. # But in the meantime, the Numba package has come a long way both in its interface and its # performance. # # Here I want to revisit those timing comparisons with a more recent Numba release, using the newer # and more convenient ``autojit`` syntax, and also add in a few additional benchmarks for # completeness. I've also written this post entirely within an IPython notebook, so it can be # easily downloaded and modified. # # As before, I'll use a **pairwise distance** function. This will take an array representing # ``M`` points in ``N`` dimensions, and return the ``M x M`` matrix of pairwise distances. # This is a nice test function for a few reasons. First of all, it's a very clean and # well-defined test. Second of all, it illustrates the kind of array-based operation that # is common in statistics, datamining, and machine learning. Third, it is a function that # results in large memory consumption if the standard numpy broadcasting approach is used # (it requires a temporary array containing ``M * M * N`` elements), making it a good # candidate for an alternate approach. # # # # We'll start by defining the array which we'll use for the benchmarks: one thousand points in # three dimensions. # In[1]: import numpy as np X = np.random.random((1000, 3)) # ## Numpy Function With Broadcasting # We'll start with a typical numpy broadcasting approach to this problem. Numpy # broadcasting is an abstraction that allows loops over array indices to be # executed in compiled C. For many applications, this is extremely fast and efficient. # Unfortunately, there is a problem with broadcasting approaches that comes up here: # it ends up allocating hidden temporary arrays which can eat up memory and cause # computational overhead. Nevertheless, it's a good comparison to have. The function # looks like this: # In[2]: def pairwise_numpy(X): return np.sqrt(((X[:, None, :] - X) ** 2).sum(-1)) get_ipython().run_line_magic('timeit', 'pairwise_numpy(X)') # ## Pure Python Function # A loop-based solution avoids the overhead associated with temporary arrays, # and can be written like this: # In[3]: def pairwise_python(X): M = X.shape[0] N = X.shape[1] D = np.empty((M, M), dtype=np.float) for i in range(M): for j in range(M): d = 0.0 for k in range(N): tmp = X[i, k] - X[j, k] d += tmp * tmp D[i, j] = np.sqrt(d) return D get_ipython().run_line_magic('timeit', 'pairwise_python(X)') # As we see, it is over 100 times slower than the numpy broadcasting approach! # This is due to Python's dynamic type checking, which can drastically slow down # nested loops. With these two solutions, we're left with a tradeoff between # efficiency of computation and efficiency of memory usage. This is where tools # like Numba and Cython become vital # # I should note that there exist alternative Python interpreters which improve # on the computational inefficiency of the Python run-time, one of which is # the popular [PyPy](http://pypy.org/) project. # PyPy is extremely interesting. However, # it's currently all but useless for scientific applications, # because it does not support NumPy, and by extension cannot run # code based on SciPy, scikit-learn, matplotlib, or virtually any other # package that makes Python a useful tool for scientific computing. # For that reason, I won't consider PyPy here. # ## Numba Wrapper # [Numba](http://numba.pydata.org/) is an LLVM compiler for python code, which # allows code written in Python to be converted to highly efficient compiled code # in real-time. Due to its dependencies, compiling it can be a challenge. To experiment # with Numba, I recommend using a local installation of [Anaconda](https://store.continuum.io/), # the free cross-platform Python distribution which includes Numba and all its prerequisites # within a single easy-to-install package. # # Numba is extremely simple to use. We just wrap our python function with ``autojit`` (JIT stands # for "just in time" compilation) to automatically create an efficient, compiled version of the function: # In[4]: from numba import double from numba.decorators import jit, autojit pairwise_numba = autojit(pairwise_python) get_ipython().run_line_magic('timeit', 'pairwise_numba(X)') # Adding this simple expression speeds up our execution by over a factor of over 1400! # For those keeping track, this is about 50% faster than the version of Numba that # I tested last August on the same machine. # ## Optimized Cython Function # [Cython](http://cython.org) is another package which is built to convert Python-like statemets # into compiled code. The language is actually a superset of Python which acts as a sort of # hybrid between Python and C. By adding type annotations to Python code and running # it through the Cython interpreter, we obtain fast compiled code. Here is a # highly-optimized Cython version of the pairwise distance function, which we compile # using IPython's Cython magic: # In[5]: get_ipython().run_line_magic('load_ext', 'cythonmagic') # In[6]: get_ipython().run_cell_magic('cython', '', '\nimport numpy as np\ncimport cython\nfrom libc.math cimport sqrt\n\n@cython.boundscheck(False)\n@cython.wraparound(False)\ndef pairwise_cython(double[:, ::1] X):\n cdef int M = X.shape[0]\n cdef int N = X.shape[1]\n cdef double tmp, d\n cdef double[:, ::1] D = np.empty((M, M), dtype=np.float64)\n for i in range(M):\n for j in range(M):\n d = 0.0\n for k in range(N):\n tmp = X[i, k] - X[j, k]\n d += tmp * tmp\n D[i, j] = sqrt(d)\n return np.asarray(D)\n') # In[7]: get_ipython().run_line_magic('timeit', 'pairwise_cython(X)') # The Cython version, despite all the optimization, is a few percent *slower* than # the result of the simple Numba decorator! I should emphasize here that I have # years of experience with Cython, and in this function I've used every Cython # optimization there is # (if any Cython super-experts are out there and would like to correct me # on that, please let me know in the blog comment thread!) By comparison, the Numba # version is a simple, unadorned wrapper around plainly-written Python code. # ## Fortran/F2Py # Another option for fast computation is to write a Fortran function directly, and use # the ``f2py`` package to interface with the function. We can write the function # as follows: # In[8]: get_ipython().run_cell_magic('file', 'pairwise_fort.f', '\n subroutine pairwise_fort(X,D,m,n)\n integer :: n,m\n double precision, intent(in) :: X(m,n)\n double precision, intent(out) :: D(m,m) \n integer :: i,j,k\n double precision :: r \n do i = 1,m \n do j = 1,m \n r = 0\n do k = 1,n \n r = r + (X(i,k) - X(j,k)) * (X(i,k) - X(j,k)) \n end do \n D(i,j) = sqrt(r) \n end do \n end do \n end subroutine pairwise_fort\n') # We can then use the shell interface to compile the Fortran function. In order # to hide the output of this operation, we direct it into ``/dev/null`` (note: I # tested this on Linux, and it may have to be modified for Mac or Windows). # In[9]: # Compile the Fortran with f2py. # We'll direct the output into /dev/null so it doesn't fill the screen get_ipython().system('f2py -c pairwise_fort.f -m pairwise_fort > /dev/null') # We can import the resulting code into Python to time the execution # of the function. To make sure we're being fair, we'll first convert # the test array to Fortran-ordering so that no conversion needs to # happen in the background: # In[10]: from pairwise_fort import pairwise_fort XF = np.asarray(X, order='F') get_ipython().run_line_magic('timeit', 'pairwise_fort(XF)') # The result is nearly a factor of two slower than the Cython and Numba versions. # # Now, I should note here that I am most definitely **not** an expert on Fortran, so # there may very well be optimizations missing from the above code. If you see any # obvious problems here, please let me know in the blog comments. # ## Scipy Pairwise Distances # Because pairwise distances are such a commonly used application in scientific # computing, both Scipy and scikit-learn have optimized routines to compute them. # The Scipy version is a Python wrapper of C code, and can be called as follows: # In[11]: from scipy.spatial.distance import cdist get_ipython().run_line_magic('timeit', 'cdist(X, X)') # ``cdist`` is about 50% slower than Numba. # ## Scikit-learn Pairwise Distances # Scikit-learn contains the ``euclidean_distances`` function, works on sparse # matrices as well as numpy arrays, and is implemented in Cython: # In[12]: from sklearn.metrics import euclidean_distances get_ipython().run_line_magic('timeit', 'euclidean_distances(X, X)') # ``euclidean_distances`` is several times slower than the Numba pairwise function # on dense arrays. # ## Comparing the Results # Out of all the above pairwise distance methods, unadorned Numba is the clear winner, # with highly-optimized Cython coming in a close second. Both beat out the other # options by a large amount. # # As a summary of the results, we'll create a bar-chart to visualize the timings: # # *Edit: I changed the "fortran" label to "fortran/f2py" to make clear that this # is not raw Fortran.* # In[13]: get_ipython().run_line_magic('pylab', 'inline') # In[14]: labels = ['python\nloop', 'numpy\nbroadc.', 'sklearn', 'fortran/\nf2py', 'scipy', 'cython', 'numba'] timings = [13.4, 0.111, 0.0356, 0.0167, 0.0129, 0.00987, 0.00912] x = np.arange(len(labels)) ax = plt.axes(xticks=x, yscale='log') ax.bar(x - 0.3, timings, width=0.6, alpha=0.4, bottom=1E-6) ax.grid() ax.set_xlim(-0.5, len(labels) - 0.5) ax.set_ylim(1E-3, 1E2) ax.xaxis.set_major_formatter(plt.FuncFormatter(lambda i, loc: labels[int(i)])) ax.set_ylabel('time (s)') ax.set_title("Pairwise Distance Timings") # Note that this is log-scaled, so the vertical space between two # grid lines indicates a factor of 10 difference in computation time! # # When I compared Cython and Numba last August, I found that Cython was about # 30% faster than Numba. Since then, Numba has had a few more releases, and both # the interface and the performance has improved. On # top of being much easier to use (i.e. automatic type inference by ``autojit``) # it's now about 50% faster, and is even a few percent faster than the Cython option. # # And though I've seen similar things for months, I'm still incredibly impressed # by the results enabled by Numba: *a single function decorator results in a # 1300x speedup of simple Python code.* # I'm becoming more and more convinced that Numba is # the future of fast scientific computing in Python. # *This post was written entirely as an IPython notebook.* # *The full notebook can be downloaded* # [*here*](http://jakevdp.github.io/downloads/notebooks/NumbaCython.ipynb), # *or viewed statically on* # [*nbviewer*](http://nbviewer.ipython.org/url/jakevdp.github.io/downloads/notebooks/NumbaCython.ipynb)