#!/usr/bin/env python # coding: utf-8 # [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/fonnesbeck/Bios8366/blob/master/notebooks/Section3_3-High-Performance-Python.ipynb) # # # High Performance Python # # Python is slower than compiled languages for a variety of reasons: # # ### Python is Dynamically Typed rather than Statically Typed. # # What this means is that at the time the program executes, the interpreter doesn't know the type of the variables that are defined. For example, the difference between a C variable and a Python variable is summarized by this diagram: # # ![](http://jakevdp.github.io/images/cint_vs_pyint.png) # # For a variable in C, the compiler knows the type by its very definition. For a variable in Python, all you know at the time the program executes is that it's some sort of Python object. # # So if you write the following in C: # # ```C # int a = 1; # int b = 2; # int c = a + b; # ``` # # the C compiler knows from the start that a and b are integers: they simply can't be anything else! With this knowledge, it can call the routine which adds two integers, returning another integer which is just a simple value in memory. As a rough schematic, the sequence of events looks like this: # # **C Addition** # # 1. Assign 1 to a # 2. Assign 2 to b # 3. call binary_add(a, b) # 4. Assign the result to c # # The equivalent code in Python looks like this: # # ```python # a = 1 # b = 2 # c = a + b # ``` # # here the interpreter knows only that 1 and 2 are objects, but not what type of object they are. So the The interpreter must inspect PyObject_HEAD for each variable to find the type information, and then call the appropriate summation routine for the two types. Finally it must create and initialize a new Python object to hold the return value. The sequence of events looks roughly like this: # # **Python Addition** # # 1. Assign 1 to a # - Set a->PyObject_HEAD->typecode to integer # - Set a->val = 1 # 2. Assign 2 to b # - Set b->PyObject_HEAD->typecode to integer # - Set b->val = 2 # 3. call binary_add(a, b) # - find typecode in a->PyObject_HEAD # - a is an integer; value is a->val # - find typecode in b->PyObject_HEAD # - b is an integer; value is b->val # - call binary_add(a->val, b->val) # - result of this is result, and is an integer. # 4. Create a Python object c # - set c->PyObject_HEAD->typecode to integer # - set c->val to result # # The dynamic typing means that there are a lot more steps involved with any operation. This is a primary reason that Python is slow compared to C for operations on numerical data. # # ### Python is interpreted rather than compiled. # # We saw above one difference between interpreted and compiled code. A smart compiler can look ahead and optimize for repeated or unneeded operations, which can result in speed-ups. Compiler optimization is its own beast, and I'm personally not qualified to say much about it, so I'll stop there. # # ### Python's object model can lead to inefficient memory access # # We saw above the extra type info layer when moving from a C integer to a Python integer. Now imagine you have many such integers and want to do some sort of batch operation on them. In Python you might use the standard List object, while in C you would likely use some sort of buffer-based array. # # A NumPy array in its simplest form is a Python object build around a C array. That is, it has a pointer to a contiguous data buffer of values. A Python list, on the other hand, has a pointer to a contiguous buffer of pointers, each of which points to a Python object which in turn has references to its data (in this case, integers). This is a schematic of what the two might look like: # # ![](http://jakevdp.github.io/images/array_vs_list.png) # # You can see that if you're doing some operation which steps through data in sequence, the numpy layout will be much more efficient than the Python layout, both in the cost of storage and the cost of access. # ## Speeding up statistical computations in Python # # In the age of "big data" and sophisitcated Bayesian and statistical learning algorithms, many are interested in optimizing the performance of the high-level languages that we use to analyse data. # # [NumPy](http://numpy.scipy.org/) gets us part of the way there on Python: # # * Storage of multidimensional data # * Efficient data access # * Efficient in-memory storage # * Fast methods and functions for data manipulation # # Ffor many applications, this is sufficient to drastically improve performance. However, there is plenty of scope for improving Python's performance in situations where speed matters. # # Pure Python and Python with NumPy are not particularly fast. Below are some recent performance benchmarks comparing several computing languages (taken directly from the [Julia website](http://julialang.org)): # # ![benchmarks](images/benchmarks.svg) # # So, while fast relative to some scientific compution choices (*e.g.* R, Matlab), Python sometimes needs to be tweaked in order to make it a competitive choice for implementing modern statistical methods. We will cover two approachable ways of improving the performance of Python. # # > One does have to be careful when interpreting benchmarks. They involve many assumptions about how algorithms are implemented that may not be applicable to real-world applications in a particular language. # # # ## Profiling # # Before you barrel ahead and prematurely optimize your Python code, it is important to understand **why** and **where** your code is slow. This is achieved by systematically accounting for the resources that your code is using, such as memory, CPU time or data transfer. This process is broadly referred to as ***Profiling***, and it allows you to identify where the performance bottlenecks in your code lie. # # Here, we will concentrate on optimizing performance for **CPU-bound** problems. # # There are a number of tools to help you profile your code. # # ### `time` # # For those of you on UNIX platforms, the built-in utility `time` can be used to assess how long your code takes to run. # In[ ]: get_ipython().system('time python ../examples/abc.py') # The output from `time` can be interpreted as: # # * `real`: elapsed (wall) time # * `user`: time spent in your code # * `sys`: time spent in system (kernel) functions # # The last 2 quantities account for the cycles used by your program. The remaining `real` time is often due to waiting for information either from disk or a network connection (I/O). # # Python also has a `time` module (and function) that is more rudimentary; it simply returns the time, in seconds from the Epoch (1/1/1970). # In[ ]: import time time.time() # We can use this for profiling by differencing the times before and after running some code of interest: # In[ ]: import numpy as np start_time = time.time() np.product(range(1, 100000)) end_time = time.time() end_time - start_time # Note, however that it does not provide a breakdown of where the code spends its time. # # ### IPython magic: `%timeit`, `%run` and `%prun` # # IPython has three built-in "magic" functions that are useful for profiling your code. # # The `%timeit` magic executes a Python statement or expressions in a loop to see how long we expect it to take for any given call. Additionally, it repeats the loop a certain number of times, and returns the best result. # # As an example, consider a Python implementation of the **trapezoidal rule**, a method from numerical analysis for approximating a definite integral. Specifically, it allows us to approximate: # # $$\int_a^b f(x) dx$$ # # using the approximation: # # $$\int_a^b f(x) dx \approx (b-a) \frac{f(b) + f(a)}{2}$$ # # Rather than use a single interval for this estimate, we break the interval down into $n$ subintervals, to obtain a more accurate approximation. # In[ ]: def f(x): return 2*x*x + 3*x + 1 def trapez(a, b, n): h = (b-a)/n sumy = 0 x = a for i in range(n): x += h sumy += f(x) sumy += 0.5*(f(a) + f(b)) return sumy*h # In[ ]: trapez(1, 5, 10000) # To confirm that this works, we can compare this to the symbolic solution, using Sympy: # In[ ]: import sympy as sym xs = sym.symbols('xs') fx = 2*xs*xs + 3*xs + 1 ifx = sym.integrate(fx, (xs, 1, 5)) ifx.evalf() # In[ ]: get_ipython().run_line_magic('timeit', '-n 50 -r 5 trapez(1, 5, 10000)') # `%timeit` tries to pick suitable values for the number of loops and repeats; these values can be overriden by specifying `-n` and `-r` values, respectively. # # The `%run` command with a `-p` option allows you to run complete programs under the control of the Python profiler. It writes the output to the help pane, which opens at the bottom of the page. # In[ ]: # This code redirects pager output to a regular cell from IPython.core import page page.page = print # In[ ]: get_ipython().run_line_magic('run', '-p ../examples/abc.py') # The profiling information includes the following information: # # * `ncalls`: number of calls to function # * `tottime`: total time spent in the given function (excluding time in calls to sub-functions) # * `percall`: time per call # * `cumtime`: cumulative time spent in this and all subfunctions # # We can see that most of the time in this example is spent inside of core NumPy functions and methods. # # The `%prun` command does a similar job for single Python expressions (like function calls). # In[ ]: get_ipython().run_line_magic('prun', 'trapez(2, 6, 100000)') # For even more fine-grained profiling information, we can use a line profiler to see how long it takes each line of a function to run. # In[ ]: get_ipython().system('pprofile ../examples/bisection.py') # This output makes it clear that the biggest cost is in the repeated calling of the function $f$ for which the root is being found. If we could improve the speed of this function, it would be the easiest single way of improving the performance of the code. # # ### Snakeviz # # SnakeViz is a browser based graphical viewer for the output of Python’s cProfile module. Though it is ostensibly a command-line tool, # SnakeViz includes IPython line and cell magics for going straight from code to a visualization. # In[ ]: get_ipython().run_line_magic('load_ext', 'snakeviz') # Let's load the ABC example from the source, so that we can run it in the notebook. # In[ ]: get_ipython().run_line_magic('load', '../examples/abc.py') # In[ ]: get_ipython().run_line_magic('snakeviz', 'abc(y=np.random.normal(4, 2, 50), N=20, epsilon=[0.2, 0.8]).mean(0)') # The default SnakeViz visualization displays profiles as a *sunburst* that represent function calls as a series of nested arcs, expanding outward. Thus, the root function is represented as a circle at the center, with the functions it calls displayed as arcs wrapping around it, and so forth. The amount of time spent inside a function is represented by the angular width of the arc. Hence, an arc that wraps most of the way around the circle represents a function that is taking up most of the time of its calling function, while a narrow arc represents a function that is using relatively little time. # # Functions don’t just spend time calling other functions, they also have their own internal time. SnakeViz shows this by putting a special child on each node that represents internal time. Only functions that call other functions will have this, functions with no calls are entirely internal time. # # SnakeViz is useful for a global overview of the runtime of your project because it might uncover functions that do not seem (at first) to be worth optimizing because of their short runtime, but are in fact called from other functions so that their total runtime is significant. # # ![snakeviz](images/snakeviz.png) # ### Exercise: Profiling EM # # Use one or more of the above methods to profile your expectation maximization code, and locate any bottlenecks. # In[ ]: # Write your answer here # ## Speeding up Python by being Pythonic # # When you have decided that your code is unacceptably slow, and have gone through the process of profiling to see if and where your program is experiencing a bottleneck, it can be easy to jump ahead and try speeding it up using external tools. There are several packages that will certainly improve Python's performance (and we will introduce some of them later), but the first place to look for better performance is in **refactoring** your implementation of whichever algorithm you happen to be using. # # Effective Python programming involves applying particular Python **idioms** effectively; these are idiosyncratic expressions that may only exist in Python (if you are coming from another languate), but when used appropriately they can make your code more readable, faster, or both. You have seen some of these already -- for example, the **list comprehension** as a means for succinctly implementing a `for` loop. # # ### Comprehensions # In[ ]: def do_math(x): return 3 + x**3 # In[ ]: get_ipython().run_cell_magic('timeit', '', 'squares = np.empty(1000000)\nfor i in range(1000000):\n squares[i] = do_math(i)\n') # In[ ]: get_ipython().run_line_magic('timeit', 'squares = [do_math(i) for i in range(1000000)]') # Here, not only is the list comprehension easier to write and read, it is also slightly faster. # # ### Generators # # When you are dealing with a large number of elements that you do not need all at once, you can also consider another Python expression we have already seen, a **generator**. For example, if we enclose the comprehension in parentheses instead of square brackets, we get a **generator expression** object: # In[ ]: (i**2 for i in range(int(1e20))) # Now, rather than storing 100,000,000,000,000,000,000,000 elements in memory, we can produce values as needed: # In[ ]: squares = (i**2 for i in range(int(1e10))) next(squares) # In[ ]: next(squares) # ### Built-in functions # # Before you go about coding your own functions, make sure that it isn't already provided as a **built-in function**. These are typically highly optimized, and written in C! [Here is a list of built-in functions](https://docs.python.org/3/library/functions.html). # # ### String concatenation # # Just as you should avoid growing lists or arrays by concatenation or appending, iterating over strings and concatenating them manually is very inefficient. For example, let's say we want to concatente a list of strings into a single string: # In[ ]: words = ["Six", "days", "in", "to", "what", "should", "be", "a", "greatest", "two", "months", "of", "my", "life", "and", "it’s", "turned", "in", "to", "a", "nightmare"] # One might be tempted to code the following: # In[ ]: get_ipython().run_cell_magic('timeit', '', 'sentence = ""\nfor word in words:\n sentence += " " + word\n') # In[ ]: sentence = "" for word in words: sentence += " " + word sentence # However, this is inefficient; since strings is immutable in Python, every `+` operation involves creating a new string and copying the old content. Instead, we can use the string method `join`, which is not only faster, but more flexible. Here, we would like to separate the words by spaces, which is easily done: # In[ ]: ' '.join(words) # In[ ]: get_ipython().run_line_magic('timeit', "' '.join(words)") # ### Avoid loops # # As we have seen, `for` loops in Python are slow. Wherever possible, avoid looping by using alternative strategies or vectorized operations. For example, say we wanted to return the common elements between two arrays. We might naively loop over both lists, comparing them elementwise to return their intersection: # In[ ]: np.random.seed(1) list1 = np.random.choice(np.arange(20), replace=False, size=10) list2 = np.random.choice(np.arange(20), replace=False, size=10) def common_elements(a, b): for i in a: for j in b: if i==j: yield i # In[ ]: get_ipython().run_line_magic('timeit', 'list(common_elements(list1, list2))') # However, this involves two Python `for` loops and a conditional statement. Instead, we can use set operations on the built-in `set` type provided by Python: # In[ ]: get_ipython().run_line_magic('timeit', 'set(list1) & set(list2)') # ### Use NumPy # # Often, considerable performance gains can be achieved by replacing Python data structures and functions with corresponding NumPy versions. It provides a high-performance multidimensional array object, and tools for working with these arrays. # # This example, borrowed from NumPy creator Travis Oliphant, solves Laplace's equation over a 2-d rectangular grid using a simple iterative method. The code finds a two-dimensional function $u$ where $\nabla^2 u = 0$, given some fixed boundary conditions. # In[ ]: dx = 0.1 dy = 0.1 dx2 = dx*dx dy2 = dy*dy def py_update(u): nx, ny = u.shape for i in range(1,nx-1): for j in range(1, ny-1): u[i,j] = ((u[i+1, j] + u[i-1, j]) * dy2 + (u[i, j+1] + u[i, j-1]) * dx2) / (2*(dx2+dy2)) # In[ ]: def calc(N, Niter=100, func=py_update, args=()): u = np.zeros([N, N]) u[0] = 1 for i in range(Niter): func(u,*args) return u # This code takes a very long time to run in order to converge to the correct solution. For a 100x100 grid, visually-indistinguishable convergence occurs after about 8000 iterations. # In[ ]: get_ipython().run_line_magic('timeit', 'calc(10)') # Using NumPy, we can speed this code up significantly by using slicing and vectorized (automatic looping) calculations that replace the explicit loops in the Python-only solution. # In[ ]: def num_update(u): u[1:-1,1:-1] = ((u[2:,1:-1]+u[:-2,1:-1])*dy2 + (u[1:-1,2:] + u[1:-1,:-2])*dx2) / (2*(dx2+dy2)) # In[ ]: get_ipython().run_line_magic('timeit', 'calc(10, func=num_update)') # Such speed-ups are not uncommon when using NumPy to replace Python loops where the inner loop is doing simple math on basic data-types. # ## Memoize repeated function calls # # The fibonacci sequence benchmark cited in the table above on the Julia website uses the following implementation: # In[ ]: def fib(n): if n < 2: return n return fib(n-1) + fib(n-2) # This is a simple recursive function, and as such, grows quickly with the size of `n` passed to it. # In[ ]: fib(5) # In[ ]: fib(20) # In[ ]: fib(40) # Notice how long the last call took. We can use the `timeit` magic to quantify the speed of this naive implementation. # In[ ]: get_ipython().run_line_magic('timeit', 'fib(20)') # The first thing you should notice is that, being a recursive function, the same values will tend to be calculated over and over again. What if, instead of re-calculating values that have been calculated before, we instead store calculations so that they can be retrieved when needed the next time. Almost always, looking up a value stored in an array or hash table is going to be faster than performing an expensive calculation. # # This optimization technique is called **memoization**. # # Here is a memoization function that is implemented as a **decorator**. # In[ ]: def memoize(f): memo = {} def func(arg): if arg not in memo: memo[arg] = f(arg) return memo[arg] return func # We can simply annotate our function with this decorator, and run it the same way as the original function. # In[ ]: @memoize def fib_memo(n): if n < 2: return n return fib_memo(n - 1) + fib_memo(n - 2) # In[ ]: fib_memo(20) # In[ ]: get_ipython().run_line_magic('timeit', 'fib_memo(20)') # This is a massive speedup. # # The built-in `functools` library includes an efficient memoization function called `lru_cache` (LRU: Least Recently Used). # In[ ]: from functools import lru_cache @lru_cache(maxsize=None) def fib_memo(n): if n < 2: return n return fib_memo(n - 1) + fib_memo(n - 2) # The only constraint on `lru_cache` is that the arguments to the function must be hashable. # ## Fast array expression evaluation with `numexpr` # # [`numexpr`](http://code.google.com/p/numexpr/) allows array expressions to be evaluated far faster that what can be achieved in Python using Numpy arrays. `numexpr` parses a string expression and optimizes and compiles the code on the fly, using a virtual machine that includes a [Just-in-time (JIT) compiler](http://en.wikipedia.org/wiki/Just-in-time_compilation). In addition, `numexpr` offers direct support for parallel multi-threaded computations, since Python's global interpreter lock is bypassed. # # > Python's global interpreter lock (GIL) ensures that only one thread runs in the interpreter at once. This simplifies many of the low-level activities, such as memory management, and allows for co-operative multi-tasking. But, since the currently-running thread holds onto the interpreter, it makes multi-core parallelization difficult. # # Part the reason Python can be slow for array calculations is that it creates temporary arrays to store intermediate results from array element calculations, which wastes memory and cache. `numexpr` handles such calculations in manageable chunks, which accellerates computation. # # The speedup over NumPy by using `numexpr` can be as high as 20x, but is typically in the range of 2-4x. # # ### Example: Computing a polynomial # In[ ]: import numpy as np x = np.linspace(-1, 1, 1000000) 0.25*x**3 + 0.75*x**2 - 1.5*x - 2 # In[ ]: get_ipython().run_line_magic('timeit', '0.25*x**3 + 0.75*x**2 - 1.5*x - 2') # In[ ]: import numexpr as ne ne.set_num_threads(1) # In[ ]: ne.evaluate('0.25*x**3 + 0.75*x**2 - 1.5*x - 2') # In[ ]: get_ipython().run_line_magic('timeit', "ne.evaluate('0.25*x**3 + 0.75*x**2 - 1.5*x - 2')") # `numexpr` actually expands the polynomial terms so that it does not need to use a transcendental function. # # We can achieve further gains in performance by multithreading the calculations: # In[ ]: ne.set_num_threads(4) # In[ ]: get_ipython().run_line_magic('timeit', "ne.evaluate('0.25*x**3 + 0.75*x**2 - 1.5*x - 2')") # Since the performance of processors has outpaced that of memory in the past several decades, the CPU spends a lot of time waiting for memory to give it computations; this is the ***processor-memory performance gap***. # # ![performance gap](http://www.techdesignforums.com/practice/files/2013/02/tdf-snps-ARMcc-feb13-fig1lg.jpg) # (graph courtesy http://www.techdesignforums.com) # # CPU caches are often used to make up for this difference. CPU caches are more effective when the data are optimally located in memory to take advantage of cache performance. `numexpr` does this by moving contiguous blocks of data from memory to the CPU cache, reusing them as much as possible within the cache to more quickly give the CPU access to the data. # # ### Limitations # # `numexpr` only implements element-wise operations. So, `a * b` becomes: # # for i in range(N): # c[i] = a[i] * b[i] # # Similarly, it cannot index other parts of arrays in the same expression: # # for i in range(N): # c[i] = a[i-1] + a[i] * b[i] # ## Cython # # Python developers typically solve performance constraints by building Python extensions by wrapping code written in other languages (for example, SciPy contains more lines of C/C++/Fortran than Python). However, programming with the Python/C API is not straightforward for most users. # # Cython is a language that allows Python programmers to write fast code without having to write C/C++/Fortran directly. It looks much like Python code, but with type declarations. Cython code is translated it to C (or C++ or others), which is then compiled to create a Python extension that we can import and use. # # Using Cython, we can achieve speedups of several orders of magnitude, often *faster than hand-coded C code*. In addtion, Cython is compatible with core scientific programming tools like NumPy and IPython. # # Cython has built-in support for multicore processing. # # Cython is used to varying degrees by other packages in the Python scientific stack, such as pandas, sympy, scikit-learn and SciPy. # # ### Example: Numerical integration # # Recall from above the function `trapez` for performing numerical integration using the trapezoidal rule. As a benchmark, let's time the execution of the pure-Python version of the trapezoidal rule: # In[ ]: get_ipython().run_line_magic('timeit', 'trapez(1, 5, 1000)') # Perhaps the easiest way to use Cython, is via the IPython `cythonmagic`, which allows us to run Cython interactively: # In[ ]: get_ipython().run_line_magic('load_ext', 'cython') # In[ ]: get_ipython().run_cell_magic('cython', '', '\ndef f(x):\n return 2*x*x + 3*x + 1\n\ndef trapez2(a, b, n):\n h = (b-a)/float(n) \n sumy = 0\n x=a\n for i in range(n):\n x += h\n sumy += f(x)\n sumy += 0.5*(f(a) + f(b))\n return sumy*h\n') # The Cython magic is doing a lot of work for you: it compiles the code into an extension module, and loads it into the notebook. This allows us to ignore all of the compilation details of building Cython extensions. # # If we run `trapez2`, we can see a reasonable speedup simply by compiling it, unchanged, using Cython. # In[ ]: get_ipython().run_line_magic('timeit', 'trapez2(1, 5, 1000)') # Under the hood, several things are happening in order to deliver this improved performance. The Cython source code is translated into C source code by `cython`. Then, this C source is compiled, using the appropriate compiler, flags and associated library files (if any), into a Python extension. This extension is then loaded by IPython into the current session. # # ![cython flow](images/cython.png) # # C extensions can also be compiled manually, using a setup file. Here is an example for an extension called `dist` within a package called `probability`: # # ```python # from distutils.core import setup # from distutils.extension import Extension # from Cython.Distutils import build_ext # # import numpy as np # # setup( # cmdclass = {'build_ext': build_ext}, # ext_modules = [Extension("dist", ["probability/src/dist.pyx"], include_dirs=[np.get_include()])] # ) # ``` # # It mainly uses machinery from a core Python package `distutils` that manages the build process. # To get a closer look at where Cython is improving our unchanged Python code, we can add an `--annotate` flag to the `%%cython` magic declaration: # In[ ]: get_ipython().run_cell_magic('cython', '--annotate -3', '\ndef f(x):\n return 2*x*x + 3*x + 1\n \ndef trapez2(a, b, n):\n h = (b-a)/float(n) \n sumy = 0\n x=a\n for i in range(n):\n x += h\n sumy += f(x)\n sumy += 0.5*(f(a) + f(b))\n return sumy*h\n') # In the above, the line color indicates the "typedness" of the extension, where yellower lines are closer to Python, and therefore require calls to the Python C API, while whiter lines indicate code that is closer to pure C, hence requiring few, if any, Python API calls. # # If you click on a line, it unravels to show you the C code that results from the call to `cython`. # # The goal in speeding up code with Cython is to turn as many lines to white as we can. The easiest way to do this is to add type declarations to the Python code: # In[ ]: get_ipython().run_cell_magic('cython', '--annotate -3', '\n# Add type to argument\ndef ff(double x):\n return 2*x*x + 3*x + 1\n\n# Add types to arguments\ndef trapez3(double a, double b, int n):\n # Declare types of variables\n cdef double h, x, sumy\n cdef int i\n h = (b-a)/float(n) \n sumy = 0\n x=a\n for i in range(n):\n x += h\n sumy += ff(x)\n sumy += 0.5*(ff(a) + ff(b))\n return sumy*h\n') # In[ ]: get_ipython().run_line_magic('timeit', 'trapez3(1, 5, 1000)') # This gives us a considerable speedup. The next thing we might try is to *inline* the polynomial function. By inlining, we mean that we ask the compiler to perform an inline expansion of said function; that is, it will insert a copy of the function itself wherever the function is called, instead of calling the function wherever it is defined. # # We do three things to the specification of `ff`: # # * change `def` to `cdef` # * add a return type to the function # * add an `inline` keyword # In[ ]: get_ipython().run_cell_magic('cython', '--annotate -3', '\ncdef inline double ff(double x):\n return 2*x*x + 3*x + 1\n\ncpdef trapez4(double a, double b, int n):\n cdef double h, x, sumy\n cdef int i\n h = (b-a)/n \n sumy = 0\n x=a\n for i in range(n):\n x += h\n sumy += ff(x)\n sumy += 0.5*(ff(a) + ff(b))\n return sumy*h\n') # The `cdef` keyword declares a C object. Everything that follows it is therefore specified in terms of C; we are essentially writing C, but using a subset of Python's syntax rules. So, when we are creating a function `cdef ff` it is a C function, and is not available to you in Python. # # `cpdef` is a hybrid declaration that creates both a C interface and a Python interface to the function. # # Let's see how this performs. # In[ ]: get_ipython().run_line_magic('timeit', 'trapez4(1, 5, 1000)') # Woof! That's a big speedup, and there's not much yellow left in the annotated code. # If you would like a very simple way of injecting types into your code with Cython, without modifying any of the code itelf, you can use the `@cython.locals` decorator. Note that you don't get as fast of a speedup as we have just achieved. # In[ ]: get_ipython().run_cell_magic('cython', '-3', 'import cython\n\n@cython.locals(x=cython.double)\ndef f(x):\n return 2*x*x + 3*x + 1\n \n@cython.locals(a=cython.double, b=cython.double, n=cython.int,\n h=cython.double, sumy=cython.double, i=cython.int,\n x=cython.double, func=cython.double)\ndef trapez5(a, b, n):\n h = (b-a)/float(n) \n sumy = 0\n x=a\n \n for i in range(n):\n x += h\n sumy += f(x)\n sumy += 0.5*(f(a) + f(b))\n return sumy*h\n') # In[ ]: get_ipython().run_line_magic('timeit', 'trapez5(1, 5, 1000)') # If you can stand to look at it, you can peek at all the C code that is generated by Cython just to optimize this short function. # In[ ]: get_ipython().run_line_magic('load', '../examples/trapezoid.c') # Due to conveneince, running Cython from IPython is a preferred way of using the language. However, if we have some **legacy C/C++ code** that we wish to use in Python, we can do that by writing a wrapper and calling `cython` from the terminal. # # Here is the C code: # In[ ]: get_ipython().run_line_magic('load', '../examples/fact.h') # And here is the Cython wrapper. Cython code is stored in files with a `.pyx` extension. # In[ ]: get_ipython().run_line_magic('load', '../examples/fact.pyx') # In[ ]: get_ipython().system('cython ../examples/fact.pyx') # Now we can compile the extension. # In[ ]: import os, sys os.chdir('../examples') if sys.platform == "darwin": get_ipython().system('gcc -Wall -fno-strict-aliasing -static -undefined dynamic_lookup -bundle -arch x86_64 -I$CONDA_PREFIX/include/python3.9 -o fact.so fact.c') else: get_ipython().system('gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing -I$CONDA_PREFIX/include/python3.9 -o fact.so fact.c') os.chdir('../notebooks') # In[ ]: get_ipython().system('cp ../examples/fact.so .') # In[ ]: import fact fact.fact(5) # ### Using lists and arrays in Cython # # The above example used only scalar variables. When we have vector-valued data, we need to declare the appropriate types. Here's a simple example, using a function that calculates the Euclidean distance between two arrays: # In[ ]: def euclidean(x, y): x = np.array(x) y = np.array(y) return np.sqrt(((x - y) ** 2).sum()) # In[ ]: get_ipython().run_line_magic('timeit', 'euclidean(np.random.randn(10), np.random.randn(10))') # In order to get a speedup under Cython, we need to iterate over the elements of each passed array, and aggregate them manually. # In[ ]: get_ipython().run_line_magic('load_ext', 'Cython') # In[ ]: get_ipython().run_cell_magic('cython', '--annotate', '\nimport cython\ncimport numpy as np\nfrom libc.math cimport sqrt\n\n@cython.boundscheck(False)\n@cython.wraparound(False)\ndef euclidean2(np.ndarray[np.float64_t, ndim=1] x, \n np.ndarray[np.float64_t, ndim=1] y):\n cdef: \n double diff\n int i\n diff = 0\n for i in range(x.shape[0]):\n diff += (x[i] - y[i])**2\n return sqrt(diff)\n') # In[ ]: get_ipython().run_line_magic('timeit', 'euclidean2(np.random.randn(10), np.random.randn(10))') # The decorators for `trapez5` are **compiler directives** that alter the behavior of Cython code. Setting `boundscheck` to False removes boundary checking for indexing operations, forcing us to ensure that we do not try to index arrays using index vlaues that are out of bounds. When we set `wraparound` to False, Cython will not support negative indexes, as is the case with Python. While these directives may increase the speed of our code, it can be dangerous; if we do not ensure that we index our arrays properly, it may cause segmentation faults or data corruption. # # The full set of compiler directives are described in the [Cython docs](http://docs.cython.org/src/reference/compilation.html#compiler-directives). # Here is the same code using lists instead of NumPy arrays: # In[ ]: get_ipython().run_cell_magic('cython', '--annotate', '\nfrom libc.math cimport sqrt\n\ndef euclidean3(list x, list y):\n cdef: \n double diff\n int i\n diff = 0\n for i in range(len(x)):\n diff += (x[i] - y[i])**2\n return sqrt(diff)\n') # In[ ]: get_ipython().run_line_magic('timeit', 'euclidean3(np.random.randn(10).tolist(), np.random.randn(10).tolist())') # ### `pyximport` # # If we have some Cython source code, we can use `pyximport` to directly import it as if it were a Python module. # In[ ]: import pyximport # Move source into current directory get_ipython().system('cp ../examples/trapezoid.pyx .') # Allow it to use Python's import mechanism pyximport.install(language_level=3) from trapezoid import trapez as trapez_pyx trapez_pyx(1, 10, 10) # In other words, it treats `.pyx` files as if they were `.py` files. This includes detecting changes in the source file,, if any, and recompiling it as necessary before importing. # ### Benchmark example: Gibbs sampling # # Let's see if we can use Cython to speed up MCMC. # # Gibbs sampler for function: # # $$f(x,y) = x x^2 \exp(-xy^2 - y^2 + 2y - 4x)$$ # # using conditional distributions: # # $$x|y \sim Gamma(3, y^2 +4)$$ # $$y|x \sim Normal(\frac{1}{1+x}, \frac{1}{2(1+x)})$$ # # Here is the pure Python implementation: # In[ ]: from numpy import zeros, random, sqrt gamma = random.gamma normal = random.normal def pygibbs(N=20000, thin=200): mat = zeros((N,2)) x,y = mat[0] for i in range(N): for j in range(thin): x = gamma(3, y**2 + 4) y = normal(1./(x+1), 1./sqrt(2*(x+1))) mat[i] = x,y return mat # In[ ]: get_ipython().run_line_magic('timeit', 'pygibbs(1000, 10)') # Unchanged, compiling this code with Cython results in a slight improvement in speed. # In[ ]: get_ipython().run_cell_magic('cython', '', '\nfrom numpy import zeros, random, sqrt\ngamma = random.gamma\nnormal = random.normal\n\ndef cygibbs(N=20000, thin=200):\n mat = zeros((N,2))\n x,y = mat[0]\n for i in range(N):\n for j in range(thin):\n x = gamma(3, y**2 + 4)\n y = normal(1./(x+1), 1./sqrt(2*(x+1)))\n mat[i] = x,y\n\n return mat\n') # In[ ]: get_ipython().run_line_magic('timeit', 'cygibbs(1000, 10)') # Now, for some type declarations: # In[ ]: get_ipython().run_cell_magic('cython', '', '\nfrom numpy import zeros, random, sqrt\nfrom numpy cimport *\ngamma = random.gamma\nnormal = random.normal\n\ndef cygibbs2(int N=20000, int thin=200):\n cdef: \n ndarray[float64_t, ndim=2] mat = zeros((N,2))\n float64_t x,y = 0\n int i,j\n for i in range(N):\n for j in range(thin):\n x = gamma(3, y**2 + 4)\n y = normal(1./(x+1), 1./sqrt(2*(x+1)))\n mat[i] = x,y\n\n return mat\n') # In[ ]: get_ipython().run_line_magic('timeit', 'cygibbs2(1000, 10)') # A full-blown "Cythonization" involves using GSL's random number generators, and giving Cython a few more instructions: # If you are using Homebrew on OS X, you can install GSL using "brew" # # !brew install gsl # In[ ]: get_ipython().run_cell_magic('cython', '-lm -lgsl -lgslcblas', '\ncimport cython\nimport numpy as np\nfrom numpy cimport *\n\ncdef extern from "math.h":\n double sqrt(double) \n \ncdef extern from "gsl/gsl_rng.h":\n ctypedef struct gsl_rng_type\n ctypedef struct gsl_rng\n\n gsl_rng_type *gsl_rng_mt19937\n gsl_rng *gsl_rng_alloc(gsl_rng_type * T) nogil\n \ncdef extern from "gsl/gsl_randist.h":\n double gamma "gsl_ran_gamma"(gsl_rng * r,double,double)\n double gaussian "gsl_ran_gaussian"(gsl_rng * r,double)\n \ncdef gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937)\n\n@cython.wraparound(False)\n@cython.boundscheck(False)\ndef gibbs(int N=20000,int thin=500):\n cdef: \n double x=0\n double y=0\n int i, j\n ndarray[float64_t, ndim=2] samples\n\n samples = np.empty((N,thin))\n for i from 0 <= i < N:\n for j from 0 <= j < thin:\n x = gamma(r,3,1.0/(y*y+4))\n y = gaussian(r,1.0/sqrt(x+1))\n samples[i,0] = x\n samples[i,1] = y\n return samples\n') # In[ ]: get_ipython().run_line_magic('timeit', 'gibbs(1000, 10)') # ## Numba # # Cython precompiles parts of Python code before running. Another approach is **Just-in-Time (JIT)** compilation. Numba is a compiler that runs Python code through an LLVM compiler to produce optimized bytecode for fast execution. Numba does not require a C/C++ compiler on your machine. # # Numba's lone API is a **decorator**. # # The `@jit` decorator runs the decorated function through bytecode analysis and the function arguments through a type inference engine, and generates an intermediate representation of your code, which is then passed to LLVM for compilation to bytecode. # In[ ]: from numba import jit import warnings warnings.filterwarnings("ignore") # In[ ]: def nfibonacci(size): F = np.empty(size, 'int') a, b = 0, 1 for i in range(size): F[i] = a a, b = b, a + b return F # In[ ]: get_ipython().run_line_magic('timeit', 'nfibonacci(50)') # In[ ]: @jit def nfibonacci(size): F = np.empty(size, 'int') a, b = 0, 1 for i in range(size): F[i] = a a, b = b, a + b return F # In[ ]: get_ipython().run_line_magic('timeit', 'nfibonacci(50)') # Numba is able to compile separate specializations depending on the input types. # If you want fine-grained control over types chosen by the compiler, you can tell Numba the function signature (types) to expect. # In[ ]: from numba import int32 @jit(int32[:](int32)) def nfibonacci(size): F = np.empty(size, 'int') a, b = 0, 1 for i in range(size): F[i] = a a, b = b, a + b return F # In[ ]: get_ipython().run_line_magic('timeit', 'nfibonacci(50)') # Compilation is deferred until the first function execution. Numba will infer the argument types at call time, and generate optimized code based on this information. # In[ ]: def pairwise_python(X): M = X.shape[0] N = X.shape[1] D = np.empty((M, M), dtype=float) for i in range(M): for j in range(M): d = 0.0 for k in range(N): tmp = X[i, k] - X[j, k] d += tmp * tmp D[i, j] = np.sqrt(d) return D X = np.random.random((1000, 3)) get_ipython().run_line_magic('timeit', 'pairwise_python(X)') # In[ ]: @jit def npairwise_python(X): M = X.shape[0] N = X.shape[1] D = np.empty((M, M), dtype=float) for i in range(M): for j in range(M): d = 0.0 for k in range(N): tmp = X[i, k] - X[j, k] d += tmp * tmp D[i, j] = np.sqrt(d) return D get_ipython().run_line_magic('timeit', 'npairwise_python(X)') # Numba-compiled functions can call other compiled functions. In some situations, the optimizer may even inline the function in the machine code code. # In[ ]: def square(x): return x ** 2 def hypot(x, y): return np.sqrt(square(x) + square(y)) # In[ ]: get_ipython().run_line_magic('timeit', 'hypot(10, 8)') # In[ ]: @jit def nsquare(x): return x ** 2 @jit def nhypot(x, y): return np.sqrt(nsquare(x) + nsquare(y)) # In[ ]: get_ipython().run_line_magic('timeit', 'nhypot(10, 8)') # Numba **can** compile *most* NumPy functions, as well as generators. # # Numba **cannot** compile things like lists, sets, dictionaries (tuples are compiled), comprehensions, and string operations, so there will be no speedup for these. # # As with all performance tools, the best strategy is not to apply the `@jit` decorator all over your code, but to use Python's profiling tools to identify "hotspots" in your program, and selectively apply `@jit`. # ### Exercise: Fixed-point algorithm speedup # # Use the method of your choice to speed up the execution of the fixed-point algorithm that was implemented in the optimization lecture (you will find it in the `examples` directory). # In[ ]: # Write your answer here # --- # ## References # # van der Plas, J. (2014) [Why Python is Slow: Looking Under the Hood](https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/) # # van der Plas, J. (2015) [Optimizing Python in the Real World: NumPy, Numba, and the NUFFT](https://jakevdp.github.io/blog/2015/02/24/optimizing-python-with-numpy-and-numba/) # # [A guide to analyzing Python performance](http://www.huyng.com/posts/python-performance-analysis/) # # [Kurt Smith's Cython tutorial from SciPy 2013](https://www.youtube.com/watch?v=JKCjsRDffXo) #