For this lecture you will want to have installed additional Python packages (if they are not already installed):

!conda install -y numba numexpr cython snakeviz

High Performance Python

Python is slower than compiled languages for a variety of reasons:

Python is Dynamically Typed rather than Statically Typed.

What this means is that at the time the program executes, the interpreter doesn't know the type of the variables that are defined. For example, the difference between a C variable and a Python variable is summarized by this diagram:

For a variable in C, the compiler knows the type by its very definition. For a variable in Python, all you know at the time the program executes is that it's some sort of Python object.

So if you write the following in C:

int a = 1;
int b = 2;
int c = a + b;

the C compiler knows from the start that a and b are integers: they simply can't be anything else! With this knowledge, it can call the routine which adds two integers, returning another integer which is just a simple value in memory. As a rough schematic, the sequence of events looks like this:

C Addition

  1. Assign 1 to a
  2. Assign 2 to b
  3. call binary_add<int, int>(a, b)
  4. Assign the result to c

The equivalent code in Python looks like this:

a = 1
b = 2
c = a + b

here the interpreter knows only that 1 and 2 are objects, but not what type of object they are. So the The interpreter must inspect PyObject_HEAD for each variable to find the type information, and then call the appropriate summation routine for the two types. Finally it must create and initialize a new Python object to hold the return value. The sequence of events looks roughly like this:

Python Addition

  1. Assign 1 to a
    • Set a->PyObject_HEAD->typecode to integer
    • Set a->val = 1
  2. Assign 2 to b
    • Set b->PyObject_HEAD->typecode to integer
    • Set b->val = 2
  3. call binary_add(a, b)
    • find typecode in a->PyObject_HEAD
    • a is an integer; value is a->val
    • find typecode in b->PyObject_HEAD
    • b is an integer; value is b->val
    • call binary_add<int, int>(a->val, b->val)
    • result of this is result, and is an integer.
  4. Create a Python object c
    • set c->PyObject_HEAD->typecode to integer
    • set c->val to result

The dynamic typing means that there are a lot more steps involved with any operation. This is a primary reason that Python is slow compared to C for operations on numerical data.

Python is interpreted rather than compiled.

We saw above one difference between interpreted and compiled code. A smart compiler can look ahead and optimize for repeated or unneeded operations, which can result in speed-ups. Compiler optimization is its own beast, and I'm personally not qualified to say much about it, so I'll stop there.

Python's object model can lead to inefficient memory access

We saw above the extra type info layer when moving from a C integer to a Python integer. Now imagine you have many such integers and want to do some sort of batch operation on them. In Python you might use the standard List object, while in C you would likely use some sort of buffer-based array.

A NumPy array in its simplest form is a Python object build around a C array. That is, it has a pointer to a contiguous data buffer of values. A Python list, on the other hand, has a pointer to a contiguous buffer of pointers, each of which points to a Python object which in turn has references to its data (in this case, integers). This is a schematic of what the two might look like:

You can see that if you're doing some operation which steps through data in sequence, the numpy layout will be much more efficient than the Python layout, both in the cost of storage and the cost of access.

Speeding up statistical computations in Python

In the age of "big data" and sophisitcated Bayesian and statistical learning algorithms, many are interested in optimizing the performance of the high-level languages that we use to analyse data.

NumPy gets us part of the way there on Python:

  • Storage of multidimensional data
  • Efficient data access
  • Efficient in-memory storage
  • Fast methods and functions for data manipulation

Ffor many applications, this is sufficient to drastically improve performance. However, there is plenty of scope for improving Python's performance in situations where speed matters.

Pure Python and Python with NumPy are not particularly fast. Below are some recent performance benchmarks comparing several computing languages (taken directly from the Julia website):

FortranJuliaPythonRMatlabOctaveMathematicaJavaScriptGoLuaJITJava
gcc 5.1.1 0.4.0 3.4.3 3.2.2 R2015b 4.0.0 10.2.0 V8 3.28.71.19 go1.5 gsl-shell 2.3.1 1.8.0_45
fib0.702.1177.76533.5226.899324.35118.533.361.861.711.21
parse_int5.051.4517.0245.73802.529581.4415.026.061.205.773.35
quicksort1.311.1532.89264.544.921866.0143.232.701.292.032.60
mandel0.810.7915.3253.167.58451.815.130.661.110.671.35
pi_sum1.001.0021.999.561.00299.311.691.011.001.001.00
rand_mat_stat1.451.6617.9314.5614.5230.935.952.302.963.273.92
rand_mat_mul3.481.021.141.571.121.121.3015.071.421.162.36

Figure: benchmark times relative to C (smaller is better, C performance = 1.0).

So, while fast relative to some scientific compution choices (e.g. R, Matlab), Python sometimes needs to be tweaked in order to make it a competitive choice for implementing modern statistical methods. We will cover two approachable ways of improving the performance of Python.

Profiling

Before you barrel ahead and prematurely optimize your Python code, it is important to understand why and where your code is slow. This is achieved by systematically accounting for the resources that your code is using, such as memory, CPU time or data transfer. This process is broadly referred to as Profiling, and it allows you to identify where the performance bottlenecks in your code lie.

Here, we will concentrate on optimizing performance for CPU-bound problems.

There are a number of tools to help you profile your code.

time

For those of you on UNIX platforms, the built-in utility time can be used to assess how long your code takes to run.

In [ ]:
!time python ../examples/abc.py

The output from time can be interpreted as:

  • real: elapsed (wall) time
  • user: time spent in your code
  • sys: time spent in system (kernel) functions

The last 2 quantities account for the cycles used by your program. The remaining real time is often due to waiting for information either from disk or a network connection (I/O).

Python also has a time module (and function) that is more rudimentary; it simply returns the time, in seconds from the Epoch (1/1/1970).

In [ ]:
import time
time.time()

We can use this for profiling by differencing the times before and after running some code of interest:

In [ ]:
import numpy as np
start_time = time.time()
np.product(range(1, 100000))
end_time = time.time()

end_time - start_time

Note, however that it does not provide a breakdown of where the code spends its time.

IPython magic: %timeit, %run and %prun

IPython has three built-in "magic" functions that are useful for profiling your code.

The %timeit magic executes a Python statement or expressions in a loop to see how long we expect it to take for any given call. Additionally, it repeats the loop a certain number of times, and returns the best result.

As an example, consider a Python implementation of the trapezoidal rule, a method from numerical analysis for approximating a definite integral. Specifically, it allows us to approximate:

$$\int_a^b f(x) dx$$

using the approximation:

$$\int_a^b f(x) dx \approx (b-a) \frac{f(b) + f(a)}{2}$$

Rather than use a single interval for this estimate, we break the interval down into $n$ subintervals, to obtain a more accurate approximation.

In [ ]:
def f(x):
    return 2*x*x + 3*x + 1
      
def trapez(a, b, n):
    h = (b-a)/float(n) 
    sumy = 0
    x=a
    
    for i in range(n):
        x += h
        sumy += f(x)
    sumy += 0.5*(f(a) + f(b))
    return sumy*h
In [ ]:
trapez(1, 5, 10000)

To confirm that this works, we can compare this to the symbolic solution, using Sympy:

In [ ]:
import sympy as sym

xs = sym.symbols('xs')

fx = 2*xs*xs + 3*xs + 1

ifx = sym.integrate(fx, (xs, 1, 5))
ifx.evalf()
In [ ]:
%timeit trapez(1, 5, 10000)

%timeit tries to pick suitable values for the number of loops and repeats; these values can be overriden by specifying -n and -r values, respectively.

The %run command with a -p option allows you to run complete programs under the control of the Python profiler. It writes the output to the help pane, which opens at the bottom of the page.

In [ ]:
# This code redirects pager output to a regular cell
from IPython.core import page
page.page = print
In [ ]:
%run -p ../examples/abc.py

The profiling information includes the following information:

  • ncalls: number of calls to function
  • tottime: total time spent in the given function (excluding time in calls to sub-functions)
  • percall: time per call
  • cumtime: cumulative time spent in this and all subfunctions

We can see that most of the time in this example is spent inside of core NumPy functions and methods.

The %prun command does a similar job for single Python expressions (like function calls).

In [ ]:
%prun trapez(2, 6, 100000)

For even more fine-grained profiling information, we can use a line profiler to see how long it takes each line of a function to run.

In [ ]:
!conda install -y line_profiler pprofile
In [ ]:
!pprofile ../examples/bisection.py

This output makes it clear that the biggest cost is in the repeated calling of the function $f$ for which the root is being found. If we could improve the speed of this function, it would be the easiest single way of improving the performance of the code.

Snakeviz

SnakeViz is a browser based graphical viewer for the output of Python’s cProfile module. Though it is ostensibly a command-line tool, SnakeViz includes IPython line and cell magics for going straight from code to a visualization.

In [ ]:
%load_ext snakeviz

Let's load the ABC example from the source, so that we can run it in the notebook.

In [ ]:
%load ../examples/abc.py
In [ ]:
%snakeviz abc(y=np.random.normal(4, 2, 50), N=20, epsilon=[0.2, 0.8]).mean(0)

The default SnakeViz visualization displays profiles as a sunburst that represent function calls as a series of nested arcs, expanding outward. Thus, the root function is represented as a circle at the center, with the functions it calls displayed as arcs wrapping around it, and so forth. The amount of time spent inside a function is represented by the angular width of the arc. Hence, an arc that wraps most of the way around the circle represents a function that is taking up most of the time of its calling function, while a narrow arc represents a function that is using relatively little time.

Functions don’t just spend time calling other functions, they also have their own internal time. SnakeViz shows this by putting a special child on each node that represents internal time. Only functions that call other functions will have this, functions with no calls are entirely internal time.

SnakeViz is useful for a global overview of the runtime of your project because it might uncover functions that do not seem (at first) to be worth optimizing because of their short runtime, but are in fact called from other functions so that their total runtime is significant.

snakeviz

Exercise: Profiling EM

Use one or more of the above methods to profile your expectation maximization code, and locate any bottlenecks.

In [ ]:
# Write your answer here

Speeding up Python by being Pythonic

When you have decided that your code is unacceptably slow, and have gone through the process of profiling to see if and where your program is experiencing a bottleneck, it can be easy to jump ahead and try speeding it up using external tools. There are several packages that will certainly improve Python's performance (and we will introduce some of them later), but the first place to look for better performance is in refactoring your implementation of whichever algorithm you happen to be using.

Effective Python programming involves applying particular Python idioms effectively; these are idiosyncratic expressions that may only exist in Python (if you are coming from another languate), but when used appropriately they can make your code more readable, faster, or both. You have seen some of these already -- for example, the list comprehension as a means for succinctly implementing a for loop.

Comprehensions

In [ ]:
def do_math(x):
    return 3 + x**3
In [ ]:
%%timeit
squares = []
for i in range(1000):
    squares.append(do_math(i))
In [ ]:
%timeit squares = [do_math(i) for i in range(1000)]

Here, not only is the list comprehension easier to write and read, it is also slightly faster.

Generators

When you are dealing with a large number of elements that you do not need all at once, you can also consider another Python expression we have already seen, a generator. For example, if we enclose the comprehension in parentheses instead of square brackets, we get a generator expression object:

In [ ]:
(i**2 for i in range(int(1e20)))

Now, rather than storing 100,000,000,000,000,000,000,000 elements in memory, we can produce values as needed:

In [ ]:
squares = (i**2 for i in range(int(1e10)))
next(squares)
In [ ]:
next(squares)

Built-in functions

Before you go about coding your own functions, make sure that it isn't already provided as a built-in function. These are typically highly optimized, and written in C! Here is a list of built-in functions.

String concatenation

Just as you should avoid growing lists or arrays by concatenation or appending, iterating over strings and concatenating them manually is very inefficient. For example, let's say we want to concatente a list of strings into a single string:

In [ ]:
words = ["Six",
"days",
"in",
"to",
"what",
"should",
"be",
"a",
"greatest",
"two",
"months",
"of",
"my",
"life",
"and",
"it’s",
"turned",
"in",
"to",
"a",
"nightmare"]

One might be tempted to code the following:

In [ ]:
%%timeit
sentence = ""
for word in words:
    sentence += word

However, this is inefficient; since strings is immutable in Python, every + operation involves creating a new string and copying the old content. Instead, we can use the string method join, which is not only faster, but more flexible. Here, we would like to separate the words by spaces, which is easily done:

In [ ]:
' '.join(words)
In [ ]:
%timeit ' '.join(words)

Avoid loops

As we have seen, for loops in Python are slow. Wherever possible, avoid looping by using alternative strategies or vectorized operations. For example, say we wanted to return the common elements between two arrays. We might naively loop over both lists, comparing them elementwise to return their intersection:

In [ ]:
list1 = np.random.choice(np.arange(20), size=10)
list2 = np.random.choice(np.arange(20), size=10)

def common_elements(a, b):
    for i in a:
        for j in b:
            if i==j:
                yield i
In [ ]:
list(common_elements(list1, list2))

However, this involves two Python for loops and a conditional statement. Instead, we can use set operations on the built-in set type provided by Python:

In [ ]:
set(list1) & set(list2)

Use NumPy

Often, considerable performance gains can be achieved by replacing Python data structures and functions with corresponding NumPy versions. It provides a high-performance multidimensional array object, and tools for working with these arrays.

This example, borrowed from NumPy creator Travis Oliphant, solves Laplace's equation over a 2-d rectangular grid using a simple iterative method. The code finds a two-dimensional function, u, where ∇2 u = 0, given some fixed boundary conditions.

In [ ]:
dx = 0.1
dy = 0.1
dx2 = dx*dx
dy2 = dy*dy

def py_update(u):
    nx, ny = u.shape
    for i in range(1,nx-1):
        for j in range(1, ny-1):
            u[i,j] = ((u[i+1, j] + u[i-1, j]) * dy2 +
                      (u[i, j+1] + u[i, j-1]) * dx2) / (2*(dx2+dy2))
In [ ]:
def calc(N, Niter=100, func=py_update, args=()):
    u = np.zeros([N, N])
    u[0] = 1
    for i in range(Niter):
        func(u,*args)
    return u

This code takes a very long time to run in order to converge to the correct solution. For a 100x100 grid, visually-indistinguishable convergence occurs after about 8000 iterations.

In [ ]:
%timeit calc(10)

Using NumPy, we can speed this code up significantly by using slicing and vectorized (automatic looping) calculations that replace the explicit loops in the Python-only solution.

In [ ]:
def num_update(u):
    u[1:-1,1:-1] = ((u[2:,1:-1]+u[:-2,1:-1])*dy2 + 
                    (u[1:-1,2:] + u[1:-1,:-2])*dx2) / (2*(dx2+dy2))
In [ ]:
%timeit calc(10, func=num_update)

Such speed-ups are not uncommon when using NumPy to replace Python loops where the inner loop is doing simple math on basic data-types.

Fast array expression evaluation with numexpr

numexpr allows array expressions to be evaluated far faster that what can be achieved in Python using Numpy arrays. numexpr parses a string expression and optimizes and compiles the code on the fly, using a virtual machine that includes a Just-in-time (JIT) compiler. In addition, numexpr offers direct support for parallel multi-threaded computations, since Python's global interpreter lock is bypassed.

Python's global interpreter lock (GIL) ensures that only one thread runs in the interpreter at once. This simplifies many of the low-level activities, such as memory management, and allows for co-operative multi-tasking. But, since the currently-running thread holds onto the interpreter, it makes multi-core parallelization difficult.

Part the reason Python can be slow for array calculations is that it creates temporary arrays to store intermediate results from array element calculations, which wastes memory and cache. numexpr handles such calculations in manageable chunks, which accellerates computation.

The speedup over NumPy by using numexpr can be as high as 20x, but is typically in the range of 2-4x.

Example: Computing a polynomial

In [ ]:
import numpy as np

x = np.linspace(-1, 1, 1e7)

0.25*x**3 + 0.75*x**2 - 1.5*x - 2
In [ ]:
%timeit 0.25*x**3 + 0.75*x**2 - 1.5*x - 2
In [ ]:
import numexpr as ne

ne.set_num_threads(1)
In [ ]:
ne.evaluate('0.25*x**3 + 0.75*x**2 - 1.5*x - 2')
In [ ]:
%timeit ne.evaluate('0.25*x**3 + 0.75*x**2 - 1.5*x - 2')

numexpr actually expands the polynomial terms so that it does not need to use a transcendental function.

We can achieve further gains in performance by multithreading the calculations:

In [ ]:
ne.set_num_threads(4)
In [ ]:
%timeit ne.evaluate('0.25*x**3 + 0.75*x**2 - 1.5*x - 2')

Since the performance of processors has outpaced that of memory in the past several decades, the CPU spends a lot of time waiting for memory to give it computations; this is the processor-memory performance gap.

performance gap (graph courtesy http://www.techdesignforums.com)

CPU caches are often used to make up for this difference. CPU caches are more effective when the data are optimally located in memory to take advantage of cache performance. numexpr does this by moving contiguous blocks of data from memory to the CPU cache, reusing them as much as possible within the cache to more quickly give the CPU access to the data.

Limitations

numexpr only implements element-wise operations. So, a * b becomes:

for i in range(N):
    c[i] = a[i] * b[i]

Similarly, it cannot index other parts of arrays in the same expression:

for i in range(N):
    c[i] = a[i-1] + a[i] * b[i]

Cython

Python developers typically solve performance constraints by building Python extensions by wrapping code written in other languages (for example, SciPy contains more lines of C/C++/Fortran than Python). However, programming with the Python/C API is not straightforward for most users.

Cython is a language that allows Python programmers to write fast code without having to write C/C++/Fortran directly. It looks much like Python code, but with type declarations. Cython code is translated it to C (or C++ or others), which is then compiled to create a Python extension that we can import and use.

Using Cython, we can achieve speedups of several orders of magnitude, often faster than hand-coded C code. In addtion, Cython is compatible with core scientific programming tools like NumPy and IPython.

Cython has built-in support for multicore processing.

Cython is used to varying degrees by other packages in the Python scientific stack, such as pandas, sympy, scikit-learn and SciPy.

Example: Numerical integration

Recall from above the function trapez for performing numerical integration using the trapezoidal rule. As a benchmark, let's time the execution of the pure-Python version of the trapezoidal rule:

In [ ]:
%timeit trapez(1, 5, 1000)

Perhaps the easiest way to use Cython, is via the IPython cythonmagic, which allows us to run Cython interactively:

In [ ]:
%load_ext cython
In [ ]:
%%cython

def f(x):
    return 2*x*x + 3*x + 1

def trapez2(a, b, n):
    h = (b-a)/float(n) 
    sumy = 0
    x=a
    for i in range(n):
        x += h
        sumy += f(x)
    sumy += 0.5*(f(a) + f(b))
    return sumy*h

The Cython magic is doing a lot of work for you: it compiles the code into an extension module, and loads it into the notebook. This allows us to ignore all of the compilation details of building Cython extensions.

If we run trapez2, we can see a reasonable speedup simply by compiling it, unchanged, using Cython.

In [ ]:
%timeit trapez2(1, 5, 1000)

Under the hood, several things are happening in order to deliver this improved performance. The Cython source code is translated into C source code by cython. Then, this C source is compiled, using the appropriate compiler, flags and associated library files (if any), into a Python extension. This extension is then loaded by IPython into the current session.

cython flow

C extensions can also be compiled manually, using a setup file. Here is an example for an extension called dist within a package called probability:

from distutils.core import setup
from distutils.extension import Extension
from Cython.Distutils import build_ext

import numpy as np

setup(
    cmdclass = {'build_ext': build_ext},
    ext_modules = [Extension("dist", ["probability/src/dist.pyx"], include_dirs=[np.get_include()])]
)

It mainly uses machinery from a core Python package distutils that manages the build process.

If we look at the trapez2 function compared to the pure Python trapez, we see that the Cython version appears to be a lower-level function:

In [ ]:
trapez?
In [ ]:
trapez2?

To get a closer look at where Cython is improving our unchanged Python code, we can add an --annotate flag to the %%cython magic declaration:

In [ ]:
%%cython --annotate

def f(x):
    return 2*x*x + 3*x + 1
      
def trapez2(a, b, n):
    h = (b-a)/float(n) 
    sumy = 0
    x=a
    for i in range(n):
        x += h
        sumy += f(x)
    sumy += 0.5*(f(a) + f(b))
    return sumy*h

In the above, the line color indicates the "typedness" of the extension, where yellower lines are closer to Python, and therefore require calls to the Python C API, while whiter lines indicate code that is closer to pure C, hence requiring few, if any, Python API calls.

If you click on a line, it unravels to show you the C code that results from the call to cython.

The goal in speeding up code with Cython is to turn as many lines to white as we can. The easiest way to do this is to add type declarations to the Python code:

In [ ]:
%%cython --annotate

# Add type to argument
def ff(double x):
    return 2*x*x + 3*x + 1

# Add types to arguments
def trapez3(double a, double b, int n):
    # Declare types of variables
    cdef double h, x, sumy
    cdef int i
    h = (b-a)/float(n) 
    sumy = 0
    x=a
    for i in range(n):
        x += h
        sumy += ff(x)
    sumy += 0.5*(ff(a) + ff(b))
    return sumy*h
In [ ]:
%timeit trapez3(1, 5, 1000)

This gives us a considerable speedup. The next thing we might try is to inline the polynomial function. By inlining, we mean that we ask the compiler to perform an inline expansion of said function; that is, it will insert a copy of the function itself wherever the function is called, instead of calling the function wherever it is defined.

We do three things to the specification of ff:

  • change def to cdef
  • add a return type to the function
  • add an inline keyword
In [ ]:
%%cython --annotate

cdef inline double ff(double x):
    return 2*x*x + 3*x + 1

cpdef trapez4(double a, double b, int n):
    cdef double h, x, sumy
    cdef int i
    h = (b-a)/float(n) 
    sumy = 0
    x=a
    for i in range(n):
        x += h
        sumy += ff(x)
    sumy += 0.5*(ff(a) + ff(b))
    return sumy*h

The cdef keyword declares a C object. Everything that follows it is therefore specified in terms of C; we are essentially writing C, but using a subset of Python's syntax rules. So, when we are creating a function cdef ff it is a C function, and is not available to you in Python.

cpdef is a hybrid declaration that creates both a C interface and a Python interface to the function.

Let's see how this performs.

In [ ]:
%timeit trapez4(1, 5, 1000)

Woof! That's a big speedup, and there's not much yellow left in the annotated code.

If you would like a very simple way of injecting types into your code with Cython, without modifying any of the code itelf, you can use the @cython.locals decorator. Note that you don't get as fast of a speedup as we have just achieved.

In [ ]:
%%cython
import cython

@cython.locals(x=cython.double)
def f(x):
    return 2*x*x + 3*x + 1
     
@cython.locals(a=cython.double, b=cython.double, n=cython.int,
               h=cython.double, sumy=cython.double, i=cython.int,
               x=cython.double, func=cython.double)
def trapez5(a, b, n):
    h = (b-a)/float(n) 
    sumy = 0
    x=a
    
    for i in range(n):
        x += h
        sumy += f(x)
    sumy += 0.5*(f(a) + f(b))
    return sumy*h
In [ ]:
%timeit trapez5(1, 5, 1000)

If you can stand to look at it, you can peek at all the C code that is generated by Cython just to optimize this short function.

In [ ]:
%load ../examples/trapezoid.c

Due to conveneince, running Cython from IPython is a preferred way of using the language. However, if we have some legacy C/C++ code that we wish to use in Python, we can do that by writing a wrapper and calling cython from the terminal.

Here is the C code:

In [ ]:
%load ../examples/fact.h

And here is the Cython wrapper. Cython code is stored in files with a .pyx extension.

In [ ]:
%load ../examples/fact.c
In [ ]:
%load ../examples/fact.pyx
In [ ]:
!cython ../examples/fact.pyx

Now we can compile the extension.

In [ ]:
import os
os.chdir('../examples')

!gcc -Wall -fno-strict-aliasing -static -undefined dynamic_lookup \
    -bundle -arch x86_64 \
    -I/Users/fonnescj/anaconda3/include/python3.5m \
    -o fact.so fact.c
    
os.chdir('../notebooks')
In [ ]:
!cp ../examples/fact.so .
In [ ]:
import fact

fact.fact(5)

Using lists and arrays in Cython

The above example used only scalar variables. When we have vector-valued data, we need to declare the appropriate types. Here's a simple example, using a function that calculates the Euclidean distance between two arrays:

In [ ]:
def euclidean(x, y):
    x = np.array(x)
    y = np.array(y)
    return np.sqrt(((x - y) ** 2).sum())
In [ ]:
%timeit euclidean(np.random.randn(10), np.random.randn(10))

In order to get a speedup under Cython, we need to iterate over the elements of each passed array, and aggregate them manually.

In [ ]:
%%cython --annotate

import cython
cimport numpy as np
from libc.math cimport sqrt

@cython.boundscheck(False)
@cython.wraparound(False)
def euclidean2(np.ndarray[np.float64_t, ndim=1] x, 
               np.ndarray[np.float64_t, ndim=1] y):
    cdef: 
        double diff
        int i
    diff = 0
    for i in range(x.shape[0]):
        diff += (x[i] - y[i])**2
    return sqrt(diff)
In [ ]:
%timeit euclidean2(np.random.randn(10), np.random.randn(10))

The decorators for trapez5 are compiler directives that alter the behavior of Cython code. Setting boundscheck to False removes boundary checking for indexing operations, forcing us to ensure that we do not try to index arrays using index vlaues that are out of bounds. When we set wraparound to False, Cython will not support negative indexes, as is the case with Python. While these directives may increase the speed of our code, it can be dangerous; if we do not ensure that we index our arrays properly, it may cause segmentation faults or data corruption.

The full set of compiler directives are described in the Cython docs.

Here is the same code using lists instead of NumPy arrays:

In [ ]:
%%cython --annotate

from libc.math cimport sqrt

def euclidean3(list x, list y):
    cdef: 
        double diff
        int i
    diff = 0
    for i in range(len(x)):
        diff += (x[i] - y[i])**2
    return sqrt(diff)
In [ ]:
%timeit euclidean3(np.random.randn(10).tolist(), np.random.randn(10).tolist())

pyximport

If we have some Cython source code, we can use pyximport to directly import it as if it were a Python module.

In [ ]:
import pyximport

# Move source into current directory
!cp ../examples/trapezoid.pyx .

# Allow it to use Python's import mechanism
pyximport.install()

from trapezoid import trapez as trapez_pyx
trapez_pyx(1, 10, 10) 

In other words, it treats .pyx files as if they were .py files. This includes detecting changes in the source file,, if any, and recompiling it as necessary before importing.

Benchmark example: Gibbs sampling

Let's see if we can use Cython to speed up MCMC.

Gibbs sampler for function:

$$f(x,y) = x x^2 \exp(-xy^2 - y^2 + 2y - 4x)$$

using conditional distributions:

$$x|y \sim Gamma(3, y^2 +4)$$$$y|x \sim Normal(\frac{1}{1+x}, \frac{1}{2(1+x)})$$

Here is the pure Python implementation:

In [ ]:
from numpy import zeros, random, sqrt
gamma = random.gamma
normal = random.normal

def pygibbs(N=20000, thin=200):
    mat = zeros((N,2))
    x,y = mat[0]
    for i in range(N):
        for j in range(thin):
            x = gamma(3, y**2 + 4)
            y = normal(1./(x+1), 1./sqrt(2*(x+1)))
        mat[i] = x,y

    return mat
In [ ]:
%timeit pygibbs(1000, 10)

Unchanged, compiling this code with Cython results in a slight improvement in speed.

In [ ]:
%%cython

from numpy import zeros, random, sqrt
gamma = random.gamma
normal = random.normal

def cygibbs(N=20000, thin=200):
    mat = zeros((N,2))
    x,y = mat[0]
    for i in range(N):
        for j in range(thin):
            x = gamma(3, y**2 + 4)
            y = normal(1./(x+1), 1./sqrt(2*(x+1)))
        mat[i] = x,y

    return mat
In [ ]:
%timeit cygibbs(1000, 10)

Now, for some type declarations:

In [ ]:
%%cython

from numpy import zeros, random, sqrt
from numpy cimport *
gamma = random.gamma
normal = random.normal

def cygibbs2(int N=20000, int thin=200):
    cdef: 
        ndarray[float64_t, ndim=2] mat = zeros((N,2))
        float64_t x,y = 0
        int i,j
    for i in range(N):
        for j in range(thin):
            x = gamma(3, y**2 + 4)
            y = normal(1./(x+1), 1./sqrt(2*(x+1)))
        mat[i] = x,y

    return mat
In [ ]:
%timeit cygibbs2(1000, 10)

A full-blown "Cythonization" involves using GSL's random number generators, and giving Cython a few more instructions:

In [ ]:
# If you are using Homebrew on OS X, you can install GSL using "brew"
!brew install gsl
In [ ]:
%%cython -lm -lgsl -lgslcblas

cimport cython
import numpy as np
from numpy cimport *

cdef extern from "math.h":
    double sqrt(double) 
  
cdef extern from "gsl/gsl_rng.h":
    ctypedef struct gsl_rng_type
    ctypedef struct gsl_rng

    gsl_rng_type *gsl_rng_mt19937
    gsl_rng *gsl_rng_alloc(gsl_rng_type * T) nogil
  
cdef extern from "gsl/gsl_randist.h":
    double gamma "gsl_ran_gamma"(gsl_rng * r,double,double)
    double gaussian "gsl_ran_gaussian"(gsl_rng * r,double)
  
cdef gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937)

@cython.wraparound(False)
@cython.boundscheck(False)
def gibbs(int N=20000,int thin=500):
    cdef: 
        double x=0
        double y=0
        int i, j
        ndarray[float64_t, ndim=2] samples

    samples = np.empty((N,thin))
    for i from 0 <= i < N:
        for j from 0 <= j < thin:
            x = gamma(r,3,1.0/(y*y+4))
            y = gaussian(r,1.0/sqrt(x+1))
        samples[i,0] = x
        samples[i,1] = y
    return samples
In [ ]:
%timeit gibbs(1000, 10)

Exercise

Try using Cython to improve the performance of the gradient descent algorithm from our optimization lecture:

In [ ]:
from scipy import optimize

def gradient_descent(x0, f, f_prime, adapt=False):
    x_i, y_i = x0
    all_x_i = list()
    all_y_i = list()
    all_f_i = list()

    for i in range(1, 100):
        all_x_i.append(x_i)
        all_y_i.append(y_i)
        all_f_i.append(f([x_i, y_i]))
        dx_i, dy_i = f_prime(np.asarray([x_i, y_i]))
        if adapt:
            # Compute a step size using a line_search
            step = optimize.line_search(f, f_prime,
                                np.r_[x_i, y_i], -np.r_[dx_i, dy_i],
                                np.r_[dx_i, dy_i], c2=.05)
            step = step[0]
        else:
            step = 1
        x_i += -step*dx_i
        y_i += -step*dy_i
        if np.abs(all_f_i[-1]) < 1e-16:
            break
    return all_x_i, all_y_i, all_f_i

Here is a sample function to optimize. Recall from Section 3 that it returns both the quadratic function and its gradient.

In [ ]:
def quad(epsilon, ndim=2):
    def f(x):
        x = np.asarray(x)
        y = x.copy()
        y *= np.power(epsilon, np.arange(ndim))
        return .33*np.sum(y**2)

    def f_prime(x):
        x = np.asarray(x)
        y = x.copy()
        scaling = np.power(epsilon, np.arange(ndim))
        y *= scaling
        return .33*2*scaling*y

    return f, f_prime
In [ ]:
x0, y0 = 1.6, 1.1

f, f_prime = quad(0.8)
%timeit gd_x_i, gd_y_i, gd_f_i = gradient_descent([x0, y0], f, f_prime)
In [ ]:
# Write answer here

Numba

Cython precompiles parts of Python code before running. Another approach is Just-in-Time (JIT) compilation. Numba is a compiler that runs Python code through an LLVM compiler to produce optimized bytecode for fast execution. Numba does not require a C/C++ compiler on your machine.

Numba's lone API is a decorator.

The @jit decorator runs the decorated function through bytecode analysis and the function arguments through a type inference engine, and generates an intermediate representation of your code, which is then passed to LLVM for compilation to bytecode.

In [ ]:
from numba import jit, autojit
In [ ]:
@jit
def nfibonacci(size):
    F = np.empty(size, 'int')
    a, b = 0, 1
    for i in range(size):
        F[i] = a
        a, b = b, a + b
    return F
In [ ]:
nfibonacci(50)

Numba is able to compile separate specializations depending on the input types.

If you want fine-grained control over types chosen by the compiler, you can tell Numba the function signature (types) to expect.

In [ ]:
from numba import int32

@jit(int32[:](int32))
def nfibonacci(size):
    F = np.empty(size, 'int')
    a, b = 0, 1
    for i in range(size):
        F[i] = a
        a, b = b, a + b
    return F
In [ ]:
nfibonacci(50)

Compilation is deferred until the first function execution. Numba will infer the argument types at call time, and generate optimized code based on this information.

In [ ]:
def pairwise_python(X):
    M = X.shape[0]
    N = X.shape[1]
    D = np.empty((M, M), dtype=np.float)
    for i in range(M):
        for j in range(M):
            d = 0.0
            for k in range(N):
                tmp = X[i, k] - X[j, k]
                d += tmp * tmp
            D[i, j] = np.sqrt(d)
    return D

X = np.random.random((1000, 3))

%timeit pairwise_python(X)
In [ ]:
@jit
def npairwise_python(X):
    M = X.shape[0]
    N = X.shape[1]
    D = np.empty((M, M), dtype=np.float)
    for i in range(M):
        for j in range(M):
            d = 0.0
            for k in range(N):
                tmp = X[i, k] - X[j, k]
                d += tmp * tmp
            D[i, j] = np.sqrt(d)
    return D
In [ ]:
%timeit npairwise_python(X)

Numba-compiled functions can call other compiled functions. In some situations, the optimizer may even inline the function in the machine code code.

In [ ]:
def square(x):
    return x ** 2

def hypot(x, y):
    return np.sqrt(square(x) + square(y))
In [ ]:
%timeit hypot(10, 8)
In [ ]:
@jit
def nsquare(x):
    return x ** 2

@jit
def nhypot(x, y):
    return np.sqrt(nsquare(x) + nsquare(y))
In [ ]:
%timeit nhypot(10, 8)

Numba can compile most NumPy functions, as well as generators.

Numba does not compile things like lists, sets, dictionaries (tuples are compiled), comprehensions, and string operations, so there will be no speedup for these.

As with all performance tools, the best strategy is not to apply the @jit decorator all over your code, but to use Python's profiling tools to identify "hotspots" in your program, and selectively apply @jit.

Exercise: EM speedup

Use the method of your choice to speed up the execution of the EM algorithm that was implemented in the previous lecture.

In [ ]:
# Write your answer here