# Numba vs. Parakeet vs. Cython¶

This notebook is derived from a blog post by Jake Vanderplas on the blog Pythonic Perambulations and updated by Olivier Grisel to add Parakeet.

In [1]:
import numpy as np

X = np.random.random((1000, 3))
X_wide = np.random.random((1000, 100))


In [2]:
def pairwise_numpy(X):
return np.sqrt(((X[:, None, :] - X) ** 2).sum(-1))
%timeit pairwise_numpy(X)

10 loops, best of 3: 64.7 ms per loop


## Pure Python Function¶

In [3]:
def pairwise_python(X):
M = X.shape[0]
N = X.shape[1]
D = np.empty((M, M), dtype=np.float)
for i in range(M):
for j in range(M):
d = 0.0
for k in range(N):
tmp = X[i, k] - X[j, k]
d += tmp * tmp
D[i, j] = np.sqrt(d)
return D

In [4]:
%timeit pairwise_python(X)

1 loops, best of 3: 9.51 s per loop


Alternative python / numpy implementation closer to the parakeet example from the examples folder of its git repo to be fair.

In [5]:
def pairwise_python2(X):
n_samples = X.shape[0]
result = np.zeros((n_samples, n_samples), dtype=X.dtype)
for i in xrange(X.shape[0]):
for j in xrange(X.shape[0]):
result[i, j] = np.sqrt(np.sum((X[i, :] - X[j, :]) ** 2))
return result

In [6]:
%timeit pairwise_python2(X)

1 loops, best of 3: 18.2 s per loop

In [7]:
#np.allclose(pairwise_python(X), pairwise_python2(X))


## Numba Wrapper¶

Note: I did not use master as I get a TypeError: 'numba.numbawrapper.NumbaCompiledWrapper' object is not callable when calling it.

In [8]:
import numba

numba.__version__

Out[8]:
'0.9.0'
In [9]:
from numba import double
from numba.decorators import jit, autojit

pairwise_numba = autojit(pairwise_python)

%timeit pairwise_numba(X)

1 loops, best of 3: 6.72 ms per loop

In [10]:
%timeit pairwise_numba(X_wide)

10 loops, best of 3: 97.3 ms per loop

In [11]:
pairwise_numba2 = autojit(pairwise_python2)

%timeit pairwise_numba2(X)

1 loops, best of 3: 13.9 s per loop


## Parakeet Wrapper¶

Parakeet is installed from the master branch of the git repo on Jul. 3 2013

In [12]:
from parakeet import jit

pairwise_parakeet = jit(pairwise_python)

%timeit pairwise_parakeet(X)

100 loops, best of 3: 12.3 ms per loop

In [13]:
%timeit pairwise_parakeet(X_wide)

10 loops, best of 3: 101 ms per loop

In [14]:
pairwise_parakeet2 = jit(pairwise_python2)
%timeit pairwise_parakeet2(X)

1 loops, best of 3: 13 ms per loop

In [15]:
%timeit pairwise_parakeet2(X_wide)

10 loops, best of 3: 103 ms per loop

In [16]:
np.allclose(pairwise_parakeet(X), pairwise_parakeet2(X))

Out[16]:
True
In [17]:
np.allclose(pairwise_parakeet(X_wide), pairwise_parakeet2(X_wide))

Out[17]:
True

## Optimized Cython Function¶

In [18]:
!cython --version

Cython version 0.19.1

In [19]:
%load_ext cythonmagic

In [20]:
%%cython

import numpy as np
cimport cython
from libc.math cimport sqrt

@cython.boundscheck(False)
@cython.wraparound(False)
def pairwise_cython(double[:, ::1] X):
cdef int M = X.shape[0]
cdef int N = X.shape[1]
cdef double tmp, d
cdef double[:, ::1] D = np.empty((M, M), dtype=np.float64)
for i in range(M):
for j in range(M):
d = 0.0
for k in range(N):
tmp = X[i, k] - X[j, k]
d += tmp * tmp
D[i, j] = sqrt(d)
return np.asarray(D)

In [21]:
%timeit pairwise_cython(X)

100 loops, best of 3: 6.57 ms per loop

In [22]:
%timeit pairwise_cython(X_wide)

10 loops, best of 3: 95.5 ms per loop


## Fortran/F2Py¶

In [23]:
%%file pairwise_fortran.f

subroutine pairwise_fortran(X,D,m,n)
integer :: n,m
double precision, intent(in) :: X(m,n)
double precision, intent(out) :: D(m,m)
integer :: i,j,k
double precision :: r
do i = 1,m
do j = 1,m
r = 0
do k = 1,n
r = r + (X(i,k) - X(j,k)) * (X(i,k) - X(j,k))
end do
D(i,j) = sqrt(r)
end do
end do
end subroutine pairwise_fortran

Overwriting pairwise_fortran.f

In [24]:
# Compile the Fortran with f2py.
# We'll direct the output into /dev/null so it doesn't fill the screen
!f2py -c pairwise_fortran.f -m pairwise_fortran > /dev/null

In [25]:
from pairwise_fortran import pairwise_fortran
XF = np.asarray(X, order='F')
%timeit pairwise_fortran(XF)

100 loops, best of 3: 10.8 ms per loop

In [26]:
XF_wide = np.asarray(X_wide, order='F')
%timeit pairwise_fortran(XF_wide)

10 loops, best of 3: 111 ms per loop


## Scipy Pairwise Distances¶

In [27]:
from scipy.spatial.distance import cdist
%timeit cdist(X, X)

100 loops, best of 3: 7.37 ms per loop

In [28]:
%timeit cdist(X_wide, X_wide)

10 loops, best of 3: 97.6 ms per loop


## Scikit-learn Pairwise Distances¶

In [29]:
from sklearn.metrics import euclidean_distances
%timeit euclidean_distances(X, X)

100 loops, best of 3: 16.2 ms per loop

In [30]:
%timeit euclidean_distances(X_wide, X_wide)

10 loops, best of 3: 22.4 ms per loop


## Remarks and analysis¶

• This was run on a macbook air 2012 2Ghz Core i7 with the default system blas implementation (no MKL) for numpy
• Some of the timings vary quite a lot from Jake's original post.
• Numba seems to be able to go twice faster than Parakeet when n_features is small (e.g. 3 in Jake's original setting)
• Numba fails to optimize the python version that uses the numpy notation to compute distances on pairs of rows
• Maybe calling numba nopython=True would catch this but I did not understand how to add this option and make the first example work so I am not sure how to use that option correctly
• Parakeet is almost as fast as numba when n_features grows to more realistic sizes (e.g. 100)
• Parakeet can work as efficiently with the numpy row slice expression without any issue which allow for a more natural and concise syntax.
• Blas (as used in the scikit-learn implementation) is still a killer as soon as all the dimensions are not small (note: the scikit-learn implementation can be less numerically stable though)