# A Python Performance Optimization Exercise¶

## Author Jean-François Puget¶

A revisit of the Python micro benchmarks available at https://github.com/JuliaLang/julia/blob/master/test/perf/micro/perf.py

The code used in this notebook is discussed at length in Python Meets Julia Micro Performance

Timings below are for Anaconda 64 bits with Python 3.5.1 and Numba 0.23.1 running on a Windows laptop (Lenovo W520). Timings on another machine with another Python version can be quite different.

First, let's load useful extensions and import useful modules

In [1]:
%load_ext Cython

import random
import numpy as np
from numba import jit
import sys

if sys.version_info < (3,):
range = xrange


# Fibonacci¶

### Naive code¶

The base line is gthe one used in Julia benchmarks. It is a straightforward, frecursive implementation.

In [2]:
def fib(n):
if n<2:
return n
return fib(n-1)+fib(n-2)


Sanity check.

In [3]:
assert fib(20) == 6765


### Cython¶

Let's try various versions.

In [4]:
%%cython
def fib_cython(n):
if n<2:
return n
return fib_cython(n-1)+fib_cython(n-2)

cpdef inline fib_cython_inline(n):
if n<2:
return n
return fib_cython_inline(n-1)+fib_cython_inline(n-2)

cpdef long fib_cython_type(long n):
if n<2:
return n
return fib_cython_type(n-1)+fib_cython_type(n-2)

cpdef long fib_cython_type_inline(long n):
if n<2:
return n
return fib_cython_type_inline(n-1)+fib_cython_type_inline(n-2)


Sanity checks

In [5]:
assert fib_cython(20) == 6765
assert fib_cython_inline(20) == 6765
assert fib_cython_type(20) == 6765
assert fib_cython_type_inline(20) == 6765


### Numba¶

Let's try three versions, with and without type.

In [6]:
from numba import int32, int64
@jit
def fib_numba(n):
if n<2:
return n
return fib_numba(n-1)+fib_numba(n-2)

@jit(int32(int32))
def fib_numba_type32(n):
if n<2:
return n
return fib_numba_type32(n-1)+fib_numba_type32(n-2)

@jit(int64(int64))
def fib_numba_type64(n):
if n<2:
return n
return fib_numba_type64(n-1)+fib_numba_type64(n-2)


Sanity check

In [7]:
assert fib_numba(20) == 6765
assert fib_numba_type32(20) == 6765
assert fib_numba_type64(20) == 6765


### Using floats¶

In [8]:
def ffib(n):
if n<2.0:
return n
return ffib(n-1)+ffib(n-2)


### Timings¶

In [9]:
%timeit fib(20)
%timeit fib_cython(20)
%timeit fib_cython_inline(20)
%timeit fib_cython_type(20)
%timeit fib_cython_type_inline(20)
%timeit fib_numba(20)
%timeit fib_numba_type32(20)
%timeit fib_numba_type64(20)
%timeit ffib(20.0)

100 loops, best of 3: 3.77 ms per loop
1000 loops, best of 3: 1.47 ms per loop
1000 loops, best of 3: 759 µs per loop
10000 loops, best of 3: 24.4 µs per loop
10000 loops, best of 3: 24.6 µs per loop
100 loops, best of 3: 4.89 ms per loop
100 loops, best of 3: 5 ms per loop
100 loops, best of 3: 5.01 ms per loop
100 loops, best of 3: 5.26 ms per loop


Let's try a larger one.

In [10]:
n = 30
%timeit fib(n)
%timeit fib_cython(n)
%timeit fib_cython_inline(n)
%timeit fib_cython_type(n)
%timeit fib_cython_type_inline(n)
%timeit fib_numba(n)
%timeit ffib(n*1.0)

1 loops, best of 3: 442 ms per loop
10 loops, best of 3: 178 ms per loop
10 loops, best of 3: 94.2 ms per loop
100 loops, best of 3: 3.05 ms per loop
100 loops, best of 3: 3.01 ms per loop
1 loops, best of 3: 600 ms per loop
1 loops, best of 3: 648 ms per loop


## Sequential Fibonacci¶

The recursive call is not the best way to compute Fibonacci numbers. It is better to accumulate them as follows.

### Functools¶

In [51]:
from functools import lru_cache as cache

@cache(maxsize=None)
def fib_cache(n):
if n<2:
return n
return fib_cache(n-1)+fib_cache(n-2)


### Plain Python¶

In [11]:
def fib_seq(n):
if n < 2:
return n
a,b = 1,0
for i in range(n-1):
a,b = a+b,a
return a


### Numba¶

In [12]:
@jit
def fib_seq_numba(n):
if n < 2:
return n
a,b = 1,0
for i in range(n-1):
a,b = a+b,a
return a


### Cython¶

In [13]:
%%cython

def fib_seq_cython(n):
if n < 2:
return n
a,b = 1,0
for i in range(n-1):
a,b = a+b,a
return a

cpdef long fib_seq_cython_type(long n):
if n < 2:
return n
cdef long a,b
a,b = 1,0
for i in range(n-1):
a,b = a+b,a
return a


### Sanity checks¶

In [14]:
assert fib_seq(20) == 6765
assert fib_seq_cython(20) == 6765
assert fib_seq_cython_type(20) == 6765
assert fib_seq_numba(20) == 6765


### Timings¶

In [52]:
%timeit fib_cache(20)
%timeit fib_seq(20)
%timeit fib_seq_cython(20)
%timeit fib_seq_cython_type(20)
%timeit fib_seq_numba(20)

The slowest run took 188.04 times longer than the fastest. This could mean that an intermediate result is being cached
10000000 loops, best of 3: 127 ns per loop
1000000 loops, best of 3: 1.81 µs per loop
The slowest run took 4.46 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 959 ns per loop
The slowest run took 15.66 times longer than the fastest. This could mean that an intermediate result is being cached
10000000 loops, best of 3: 82 ns per loop
The slowest run took 31.77 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 216 ns per loop


### Arbitrary precision check¶

In [16]:
fib_seq(100)

Out[16]:
354224848179261915075

# Quicksort¶

### Pure Python¶

Code used in Julia benchmarks

In [17]:
def qsort_kernel(a, lo, hi):
i = lo
j = hi
while i < hi:
pivot = a[(lo+hi) // 2]
while i <= j:
while a[i] < pivot:
i += 1
while a[j] > pivot:
j -= 1
if i <= j:
a[i], a[j] = a[j], a[i]
i += 1
j -= 1
if lo < j:
qsort_kernel(a, lo, j)
lo = i
j = hi
return a

def benchmark_qsort():
lst = [ random.random() for i in range(1,5000) ]
qsort_kernel(lst, 0, len(lst)-1)


### Numba¶

In [18]:
@jit
def qsort_kernel_numba(a, lo, hi):
i = lo
j = hi
while i < hi:
pivot = a[(lo+hi) // 2]
while i <= j:
while a[i] < pivot:
i += 1
while a[j] > pivot:
j -= 1
if i <= j:
a[i], a[j] = a[j], a[i]
i += 1
j -= 1
if lo < j:
qsort_kernel_numba(a, lo, j)
lo = i
j = hi
return a

def benchmark_qsort_numba():
lst = [ random.random() for i in range(1,5000) ]
qsort_kernel_numba(lst, 0, len(lst)-1)


With Numpy

In [19]:
@jit
def qsort_kernel_numba_numpy(a, lo, hi):
i = lo
j = hi
while i < hi:
pivot = a[(lo+hi) // 2]
while i <= j:
while a[i] < pivot:
i += 1
while a[j] > pivot:
j -= 1
if i <= j:
a[i], a[j] = a[j], a[i]
i += 1
j -= 1
if lo < j:
qsort_kernel_numba_numpy(a, lo, j)
lo = i
j = hi
return a

def benchmark_qsort_numba_numpy():
lst = np.random.rand(5000)
qsort_kernel_numba(lst, 0, len(lst)-1)


### Cython¶

In [20]:
%%cython
import numpy as np
import cython

@cython.boundscheck(False)
@cython.wraparound(False)
cdef double[:] \
qsort_kernel_cython_numpy_type(double[:] a, \
long lo, \
long hi):
cdef:
long i, j
double pivot
i = lo
j = hi
while i < hi:
pivot = a[(lo+hi) // 2]
while i <= j:
while a[i] < pivot:
i += 1
while a[j] > pivot:
j -= 1
if i <= j:
a[i], a[j] = a[j], a[i]
i += 1
j -= 1
if lo < j:
qsort_kernel_cython_numpy_type(a, lo, j)
lo = i
j = hi
return a

def benchmark_qsort_numpy_cython():
lst = np.random.rand(5000)
qsort_kernel_cython_numpy_type(lst, 0, len(lst)-1)

In [21]:
%timeit benchmark_qsort()
%timeit benchmark_qsort_numba()
%timeit benchmark_qsort_numba_numpy()
%timeit benchmark_qsort_numpy_cython()

10 loops, best of 3: 17.7 ms per loop
The slowest run took 4.06 times longer than the fastest. This could mean that an intermediate result is being cached
1 loops, best of 3: 79.4 ms per loop
The slowest run took 12.62 times longer than the fastest. This could mean that an intermediate result is being cached
1 loops, best of 3: 20.9 ms per loop
1000 loops, best of 3: 772 µs per loop


### Built-in sort¶

In [22]:
def benchmark_sort():
lst = [ random.random() for i in range(1,5000) ]
lst.sort()

def benchmark_sort_numpy():
lst = np.random.rand(5000)
np.sort(lst)

In [23]:
%timeit benchmark_sort()
%timeit benchmark_sort_numpy()

1000 loops, best of 3: 2 ms per loop
1000 loops, best of 3: 306 µs per loop


# Random mat stat¶

### Baseline¶

Used in Julia benchmarks

In [24]:
def randmatstat(t):
n = 5
v = np.zeros(t)
w = np.zeros(t)
for i in range(1,t):
a = np.random.randn(n, n)
b = np.random.randn(n, n)
c = np.random.randn(n, n)
d = np.random.randn(n, n)
P = np.matrix(np.hstack((a, b, c, d)))
Q = np.matrix(np.vstack((np.hstack((a, b)), np.hstack((c, d)))))
v[i] = np.trace(np.linalg.matrix_power(np.transpose(P)*P, 4))
w[i] = np.trace(np.linalg.matrix_power(np.transpose(Q)*Q, 4))
return (np.std(v)/np.mean(v), np.std(w)/np.mean(w))


### Speeding it up¶

In [25]:
def my_power(a):
a = np.dot(a.T,a)
a = np.dot(a,a)
a = np.dot(a,a)
return a

def randmatstat_fast(t):
n = 5
v = np.zeros(t)
w = np.zeros(t)
for i in range(1,t):
a = np.random.randn(n, n)
b = np.random.randn(n, n)
c = np.random.randn(n, n)
d = np.random.randn(n, n)
P = np.hstack((a, b, c, d))
Q = np.vstack((np.hstack((a, b)), np.hstack((c, d))))
v[i] = np.trace(my_power(P))
w[i] = np.trace(my_power(Q))
return (np.std(v)/np.mean(v), np.std(w)/np.mean(w))


### Timings¶

In [26]:
%timeit randmatstat(1000)
%timeit randmatstat_fast(1000)

10 loops, best of 3: 160 ms per loop
10 loops, best of 3: 83.2 ms per loop


# Randmatmul¶

In [27]:
def randmatmul(n):
A = np.random.rand(n,n)
B = np.random.rand(n,n)
return np.dot(A,B)

In [28]:
%timeit randmatmul(1000)

10 loops, best of 3: 83.7 ms per loop


# Mandelbrot¶

### Baseline¶

The baseline used in Julia benchmarks

In [29]:
def mandel(z):
maxiter = 80
c = z
for n in range(maxiter):
if abs(z) > 2:
return n
z = z*z + c
return maxiter

def mandelperf():
r1 = np.linspace(-2.0, 0.5, 26)
r2 = np.linspace(-1.0, 1.0, 21)
return [mandel(complex(r, i)) for r in r1 for i in r2]

In [30]:
assert sum(mandelperf()) == 14791


### Numba¶

In [31]:
@jit
def mandel_numba(z):
maxiter = 80
c = z
for n in range(maxiter):
if abs(z) > 2:
return n
z = z*z + c
return maxiter

def mandelperf_numba():
r1 = np.linspace(-2.0, 0.5, 26)
r2 = np.linspace(-1.0, 1.0, 21)
c3 = [complex(r,i) for r in r1 for i in r2]
return [mandel_numba(c) for c in c3]

In [32]:
assert sum(mandelperf_numba()) == 14791


### Cython¶

In [33]:
%%cython

import numpy as np
cimport numpy as np

cdef long mandel_cython_type(np.complex z):
cdef long maxiter, n
cdef np.complex c
maxiter = 80
c = z
for n in range(maxiter):
if abs(z) > 2:
return n
z = z*z + c
return maxiter

cpdef mandelperf_cython_type():
cdef double[:] r1,r2
r1 = np.linspace(-2.0, 0.5, 26)
r2 = np.linspace(-1.0, 1.0, 21)
return [mandel_cython_type(complex(r, i)) for r in r1 for i in r2]

In [34]:
assert sum(mandelperf_cython_type()) == 14791

In [35]:
%timeit mandelperf()
%timeit mandelperf_cython_type()
%timeit mandelperf_numba()

100 loops, best of 3: 6.57 ms per loop
100 loops, best of 3: 3.6 ms per loop
1000 loops, best of 3: 481 µs per loop


### Profiling¶

In [36]:
%lprun -s -f mandelperf_numba mandelperf_numba()


### Numpy¶

In [53]:
@jit
def mandel_numba(z):
maxiter = 80
c = z
for n in range(maxiter):
if abs(z) > 2:
return n
z = z*z + c
return maxiter

@jit
def mandelperf_numba_mesh():
width = 26
height = 21
r1 = np.linspace(-2.0, 0.5, width)
r2 = np.linspace(-1.0, 1.0, height)
mandel_set = np.empty((width,height), dtype=int)
for i in range(width):
for j in range(height):
mandel_set[i,j] = mandel_numba(r1[i] + 1j*r2[j])
return mandel_set

In [54]:
assert np.sum(mandelperf_numba_mesh()) == 14791

In [55]:
%timeit mandelperf_numba_mesh()

10000 loops, best of 3: 126 µs per loop


An even faster code is presented in How To Quickly Compute The Mandelbrot Set In Python

# Pisum¶

In [43]:
def pisum(t):
sum = 0.0
for j in range(1, 501):
sum = 0.0
for k in range(1, t+1):
sum += 1.0/(k*k)
return sum

@jit
def pisum_numba(t):
sum = 0.0
for j in range(1, 501):
sum = 0.0
for k in range(1, t+1):
sum += 1.0/(k*k)
return sum

In [44]:
%%cython

import cython

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef double pisum_cython(int t):
cdef:
double sum = 0.0
int j,k
for j in range(1, 501):
sum = 0.0
for k in range(1, t+1):
sum += 1.0/(k*k)
return sum

In [45]:
assert abs(pisum(10000)-1.644834071848065) < 1e-6
assert abs(pisum_numba(10000)-1.644834071848065) < 1e-6
assert abs(pisum_cython(10000)-1.644834071848065) < 1e-6

In [46]:
%timeit pisum(10000)
%timeit pisum_numba(10000)
%timeit pisum_cython(10000)

1 loops, best of 3: 926 ms per loop
100 loops, best of 3: 20.4 ms per loop
10 loops, best of 3: 38.2 ms per loop


## Parse int¶

In [19]:
def parse_int():
for i in range(1,1000):
n = random.randint(0,2**32-1)
s = hex(n)

# required for Python 2.7 on 64 bits machine
if s[-1]=='L':
s = s[0:-1]

m = int(s,16)
assert m == n

def parse_int_numpy():
for i in range(1,1000):
n = np.random.randint(0,2**31-1)
s = hex(n)

# required for Python 2.7 on 64 bits machine
if s[-1]=='L':
s = s[0:-1]

m = int(s,16)
assert m == n

def parse_int_opt():
for i in range(1,1000):
n = random.randint(0,2**32-1)
s = hex(n)

# required for Python 2.7 on 64 bits machine
if s.endswith('L'):
s = s[0:-1]

m = int(s,16)
assert m == n

def parse_int1():
for i in range(1,1000):
n = random.randint(0,2**32-1)
s = hex(n)
m = int(s,16)
assert m == n

def parse_int1_numpy():
for i in range(1,1000):
n = np.random.randint(0,2**31-1)
s = hex(n)
m = int(s,16)
assert m == n

@jit
def parse_numba():
for i in range(1,1000):
n = random.randint(0,2**32-1)
s = hex(n)
m = int(s,16)
assert m == n

@jit
def parse_numba_numpy():
for i in range(1,1000):
n = np.random.randint(0,2**31-1)
s = hex(n)
m = int(s,16)
assert m == n


Int in C are up to 2^31-1, hence we must use this as the limit when using Cython or Numpy

In [21]:
%%cython
import random
import cython
cimport cython
import numpy as np
cimport numpy as np

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef parse_int_cython():
cdef:
int i,n,m
for i in range(1,1000):
n = random.randint(0,2**31-1)
m = int(hex(n),16)
assert m == n

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef parse_int_cython_numpy():
cdef:
int i,n,m
for i in range(1,1000):
n = np.random.randint(0,2**31-1)
m = int(hex(n),16)
assert m == n

In [22]:
%timeit parse_int()
%timeit parse_int_numpy()
%timeit parse_int_opt()
%timeit parse_int1()
%timeit parse_int1_numpy()
%timeit parse_numba()
%timeit parse_numba_numpy()
%timeit parse_int_cython()
%timeit parse_int_cython_numpy()

100 loops, best of 3: 3.55 ms per loop
1000 loops, best of 3: 1.21 ms per loop
100 loops, best of 3: 3.72 ms per loop
100 loops, best of 3: 3.32 ms per loop
1000 loops, best of 3: 1.05 ms per loop
100 loops, best of 3: 3.63 ms per loop
1000 loops, best of 3: 1.13 ms per loop
100 loops, best of 3: 3.07 ms per loop
1000 loops, best of 3: 817 µs per loop


### Vectorized operation¶

In [61]:
%lprun -s -f parse_int parse_int()

In [25]:
import numpy as np
def parse_int_vec():
n = np.random.randint(2^31-1,size=1000)
for i in range(1,1000):
ni = n[i]
s = hex(ni)
m = int(s,16)
assert m == ni

@jit
def parse_int_vec_numba():
n = np.random.randint(2^31-1,size=1000)
for i in range(1,1000):
ni = n[i]
s = hex(ni)
m = int(s,16)
assert m == ni

In [26]:
vhex = np.vectorize(hex)
vint = np.vectorize(int)

def parse_int_numpy():
n = np.random.randint(0,2**31-1,1000)
s = vhex(n)
m = vint(s,16)
np.all(m == n)
return s

@jit
def parse_int_numpy_numba():
n = np.random.randint(0,2**31-1,1000)
s = vhex(n)
m = vint(s,16)
np.all(m == n)

In [27]:
%%cython
import numpy as np
import cython
cimport numpy
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
cpdef parse_int_vec_cython():
cdef:
int i,m
int[:] n
n = np.random.randint(0,2**31-1,1000)
for i in range(1,1000):
m = int(hex(n[i]),16)
assert m == n[i]

In [28]:
%timeit parse_int_vec()
%timeit parse_int_vec_numba()
%timeit parse_int_numpy()
%timeit parse_int_numpy_numba()
%timeit parse_int_vec_cython()

1000 loops, best of 3: 781 µs per loop
The slowest run took 195.93 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 824 µs per loop
1000 loops, best of 3: 733 µs per loop
The slowest run took 104.18 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 739 µs per loop
1000 loops, best of 3: 471 µs per loop


That's it!