Numba vs. Parakeet vs. Cython

This notebook is derived from a blog post by Jake Vanderplas on the blog Pythonic Perambulations and updated by Olivier Grisel to add Parakeet.

In [1]:
import numpy as np

X = np.random.random((1000, 3))
X_wide = np.random.random((1000, 100))

Numpy Function With Broadcasting

In [2]:
def pairwise_numpy(X):
    return np.sqrt(((X[:, None, :] - X) ** 2).sum(-1))
%timeit pairwise_numpy(X)
10 loops, best of 3: 64.7 ms per loop

Pure Python Function

In [3]:
def pairwise_python(X):
    M = X.shape[0]
    N = X.shape[1]
    D = np.empty((M, M), dtype=np.float)
    for i in range(M):
        for j in range(M):
            d = 0.0
            for k in range(N):
                tmp = X[i, k] - X[j, k]
                d += tmp * tmp
            D[i, j] = np.sqrt(d)
    return D
In [4]:
%timeit pairwise_python(X)
1 loops, best of 3: 9.51 s per loop

Alternative python / numpy implementation closer to the parakeet example from the examples folder of its git repo to be fair.

In [5]:
def pairwise_python2(X):
    n_samples = X.shape[0]
    result = np.zeros((n_samples, n_samples), dtype=X.dtype)
    for i in xrange(X.shape[0]):
        for j in xrange(X.shape[0]):
            result[i, j] = np.sqrt(np.sum((X[i, :] - X[j, :]) ** 2))
    return result
In [6]:
%timeit pairwise_python2(X)
1 loops, best of 3: 18.2 s per loop
In [7]:
#np.allclose(pairwise_python(X), pairwise_python2(X))

Numba Wrapper

Note: I did not use master as I get a TypeError: 'numba.numbawrapper.NumbaCompiledWrapper' object is not callable when calling it.

In [8]:
import numba

In [9]:
from numba import double
from numba.decorators import jit, autojit

pairwise_numba = autojit(pairwise_python)

%timeit pairwise_numba(X)
1 loops, best of 3: 6.72 ms per loop
In [10]:
%timeit pairwise_numba(X_wide)
10 loops, best of 3: 97.3 ms per loop
In [11]:
pairwise_numba2 = autojit(pairwise_python2)

%timeit pairwise_numba2(X)
1 loops, best of 3: 13.9 s per loop

Parakeet Wrapper

Parakeet is installed from the master branch of the git repo on Jul. 3 2013

In [12]:
from parakeet import jit

pairwise_parakeet = jit(pairwise_python)

%timeit pairwise_parakeet(X)
100 loops, best of 3: 12.3 ms per loop
In [13]:
%timeit pairwise_parakeet(X_wide)
10 loops, best of 3: 101 ms per loop
In [14]:
pairwise_parakeet2 = jit(pairwise_python2)
%timeit pairwise_parakeet2(X)
1 loops, best of 3: 13 ms per loop
In [15]:
%timeit pairwise_parakeet2(X_wide)
10 loops, best of 3: 103 ms per loop
In [16]:
np.allclose(pairwise_parakeet(X), pairwise_parakeet2(X))
In [17]:
np.allclose(pairwise_parakeet(X_wide), pairwise_parakeet2(X_wide))

Optimized Cython Function

In [18]:
!cython --version
Cython version 0.19.1
In [19]:
%load_ext cythonmagic
In [20]:

import numpy as np
cimport cython
from libc.math cimport sqrt

def pairwise_cython(double[:, ::1] X):
    cdef int M = X.shape[0]
    cdef int N = X.shape[1]
    cdef double tmp, d
    cdef double[:, ::1] D = np.empty((M, M), dtype=np.float64)
    for i in range(M):
        for j in range(M):
            d = 0.0
            for k in range(N):
                tmp = X[i, k] - X[j, k]
                d += tmp * tmp
            D[i, j] = sqrt(d)
    return np.asarray(D)
In [21]:
%timeit pairwise_cython(X)
100 loops, best of 3: 6.57 ms per loop
In [22]:
%timeit pairwise_cython(X_wide)
10 loops, best of 3: 95.5 ms per loop


In [23]:
%%file pairwise_fortran.f

      subroutine pairwise_fortran(X,D,m,n)
          integer :: n,m
          double precision, intent(in) :: X(m,n)
          double precision, intent(out) :: D(m,m) 
          integer :: i,j,k
          double precision :: r 
          do i = 1,m 
              do j = 1,m 
                  r = 0
                  do k = 1,n 
                      r = r + (X(i,k) - X(j,k)) * (X(i,k) - X(j,k)) 
                  end do 
                  D(i,j) = sqrt(r) 
              end do 
          end do 
      end subroutine pairwise_fortran
Overwriting pairwise_fortran.f
In [24]:
# Compile the Fortran with f2py.
# We'll direct the output into /dev/null so it doesn't fill the screen
!f2py -c pairwise_fortran.f -m pairwise_fortran > /dev/null
In [25]:
from pairwise_fortran import pairwise_fortran
XF = np.asarray(X, order='F')
%timeit pairwise_fortran(XF)
100 loops, best of 3: 10.8 ms per loop
In [26]:
XF_wide = np.asarray(X_wide, order='F')
%timeit pairwise_fortran(XF_wide)
10 loops, best of 3: 111 ms per loop

Scipy Pairwise Distances

In [27]:
from scipy.spatial.distance import cdist
%timeit cdist(X, X)
100 loops, best of 3: 7.37 ms per loop
In [28]:
%timeit cdist(X_wide, X_wide)
10 loops, best of 3: 97.6 ms per loop

Scikit-learn Pairwise Distances

In [29]:
from sklearn.metrics import euclidean_distances
%timeit euclidean_distances(X, X)
100 loops, best of 3: 16.2 ms per loop
In [30]:
%timeit euclidean_distances(X_wide, X_wide)
10 loops, best of 3: 22.4 ms per loop

Remarks and analysis

  • This was run on a macbook air 2012 2Ghz Core i7 with the default system blas implementation (no MKL) for numpy
  • Some of the timings vary quite a lot from Jake's original post.
  • Numba seems to be able to go twice faster than Parakeet when n_features is small (e.g. 3 in Jake's original setting)
  • Numba fails to optimize the python version that uses the numpy notation to compute distances on pairs of rows
  • Maybe calling numba nopython=True would catch this but I did not understand how to add this option and make the first example work so I am not sure how to use that option correctly
  • Parakeet is almost as fast as numba when n_features grows to more realistic sizes (e.g. 100)
  • Parakeet can work as efficiently with the numpy row slice expression without any issue which allow for a more natural and concise syntax.
  • Blas (as used in the scikit-learn implementation) is still a killer as soon as all the dimensions are not small (note: the scikit-learn implementation can be less numerically stable though)