Notebook

Part 2: Numba¶

Numba is a Python package designed to offer increased performance with Python applications that use a lot of numpy code for computation. Numba is a 'just-in-time' compiler for numpy code - it works by compiling your numpy code at runtime into optimised machine code.

Numba works best on numpy code that can be encapsulated neatly into separate, minimal functions, such as performing operations on arrays, or code with loops. Numba is most commonly used with python decorators, which are placed above the function you wish to optimise through just-in-time compilation.

You will likely see perfomance gains in most numpy code through using numba with its default options and using the commonly applied decorators @jit, however, these decorators themselves do not cause numba to run parallel computation (The speed up initially comes from having "jit-ted" code.)

To exploit the features of parallelism in numba, we have to go beyond the basics of numba, but first lets look at some simple examples of jitted code with numba.

The most commonly used numba feature is the @jit decorator. This decorator is placed above a function like so:

In [1]:

from numba import jit
import numpy as np
import time

x = np.arange(1000000).reshape(1000, 1000)

@jit(nopython=True) # Set "nopython" mode for best performance
def go_fast(a): # Function is compiled to machine code when called the first time
    trace = 0
    for i in range(a.shape[0]):   # Numba likes loops
        trace += np.tanh(a[i, i]) # Numba likes NumPy functions
    return a + trace              # Numba likes NumPy broadcasting

# You can optionally run the function the first time to compile it
result = go_fast(x)

`nopython=True`¶

The nopython=True mode will ensure that the numba-compiled code will not use any of the Python C-API. It requires that all native types can be inferred, and that no new Python objects will be created in the function. A warning is issued if these criteria cannot be met, and numba will fall-back to a less optimised mode.

No we are going to time our function. If you are not using a jupyter notebook, you can uncomment the timing measurements below, or use a profiling tool of your choice. (I am using the built in feature of Jupyter notebooks that allows us to time the execution of a notebook cell: %%timeit.)

In [2]:

%%timeit
# If you are not using a jupyter notebook, you can uncomment the timing measurements below:

#t1 = time.time()
result = go_fast(x)
#t2 = time.time()
#delta_t = t2 - t1

#print("Time taken: {}".format(delta_t))

349 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

And for comparison, the non-numba version:

In [1]:

import numpy as np
import time

x = np.arange(1000000).reshape(1000, 1000)

def go_slow(a): # Function is run as standard Python/Numpy code
    trace = 0
    for i in range(a.shape[0]):   
        trace += np.tanh(a[i, i]) 
    return a + trace              

In [4]:

%%timeit

# If you are not using a jupyter notebook, you can uncomment the timing measurements below:

#t1 = time.time()
result = go_slow(x)
#t2 = time.time()
#delta_t = t2 - t1

#print("Time taken: {}".format(delta_t))

397 ms ± 2.95 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Summary of results so far:¶

The adding of a numba decorator @jit gives us fairly minor speed up when the plain numpy code is jitted:

Plain numpy: 397ms
Numba numpy: 349ms

Update: Some people running this code have reported that the non-numba version is actually faster. We think this may be due to updates in the latest version of NumPy that optimise the basic NumPy code better than previously. (If you have any of your own insights into this, please let us know!)

Let's see if we can now get some better performance using the parallel options in numba

Parallel numba¶

To make use of auto-parallelisation methods with numba, we can use the parallel keywords in the numba @jit decorator. Adding the parallel keyword to @jit, attempts to find regions of the code that can be parallelised such as operations that will be independent of each loop iteration.

When using the @jit decorator, we pass an additional keyword arguments: @jit(parallel=True) to use this.

`parallel=True`¶

With the auto-parallelization, Numba attempts to identify such operations in a user program that can be readily parallelised, and fuse adjacent ones together, to form one or more kernels that are automatically run in parallel. The process is fully automated without modifications to the user program. (Except for the decorator syntax itself, which is place above the function definition)

A further keyword argument is also used in these numba examples:

`nopython=True`¶

A Numba compilation mode that generates code that does not access the Python C API. This compilation mode produces the highest performance code, but requires that the native types of all values in the function can be inferred, and that no new objects are allocated. Unless otherwise instructed, the @jit decorator will automatically fall back to object mode if nopython mode cannot be used.

In [5]:

from numba import jit
import numpy as np
import time

x = np.arange(100000000).reshape(10000, 10000)

@jit(nopython=True, parallel=True) # Set "nopython" mode for best performance
def go_fast_parallel(a): # Function is compiled to machine code when called the first time
    trace = 0
    for i in range(a.shape[0]):   # Numba likes loops
        trace += np.tanh(a[i, i]) # Numba likes NumPy functions
    return a + trace              # Numba likes NumPy broadcasting

# You can optionally run the function the first time to compile it
result = go_fast_parallel(x)

In [6]:

%%timeit
result = go_fast_parallel(x)

198 ms ± 6.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Summary of results so far:¶

The adding of arguments parallel=True to to numba decorator @jit gives us a much better speed up. The results from my (4 core) laptop were:

Plain numpy: 397ms
Numba numpy: 349ms
Numba + parallel=True: 198ms

Try it with you own machine and see if you get comparable results.

More details on how numba works¶

We won't go into the full details of how numba works in this mini-tutorial, but basically it uses clever heuristics to determine if a loop or other constructs in a function can be parallelised. This means you may not always get speedup using the parallel=True argument, as numba's internal logic may have decided that the loop cannot be parallelised or is not worth parallelising. The aim of the numba module is to make parallelisation easy to the end user (by just adding a decorator with a few keyword arguments) but at the expense of hiding a lot of the details of what is going on internally.

Explicit parallel loops¶

Numba will automatically try to detect loops that can be parallelised with parallel=True, but it also gives you the option of creating parallel loops manually. This functionality is provided in the numba.prange method.

The prange method can be used in place of where you would normally call the python range function is used, say in a for loop for example. This tells numba that the for loop is suitable to be parallelised. However, it is up to the user to determine if the loop can be safely parallelised, i.e. the iterations of the loop should be able to be calculated independently of each other.

When using numba's prange a reduction will be inferred automatically if a variable is being updated by a binary function/operator (i.e. +, -, /, *). In other words, you do no have to explicitly sum up the separate computations in each parallel task, numba will do this for you.

So for example if we had a function that had a loop like this:

In [7]:

def standard_range_test(A):   # 'A' would be a 1D numpy array in this example
    s = 0
    for i in range(A.shape[0]):
        s += A[i]
    return s

We could use the numba.prange function to explicitly parallelise the loop:

In [8]:

from numba import jit, prange

@jit(nopython=True, parallel=True)
def prange_test(A):   # 'A' would be a 1D numpy array in this example
    s = 0
    for i in prange(A.shape[0]):
        s += A[i]
    return s

Using a two-dimensional array, the setup would be similar:

In [9]:

from numba import jit, prange
@jit(nopython=True, parallel=True)
def two_d_array_reduction_prod(n):
    shp = (13, 17)
    result1 = 2 * np.ones(shp, np.int_)
    tmp = 2 * np.ones_like(result1)

    for i in numba.prange(n):
        result1 *= tmp

    return result1

In general, when using numba it is easier to begin with to try the auto-parallelisation feature from using the @jit(parallel=True) decorator, rather than manually using the prange function. However it is sometimes a useful feature to have when numba cannot determine automatically if a loop can be parallelised.

The numba documentation goes into more detail of when parallelisation can and cannot be inferred by numba: https://numba.pydata.org/numba-doc/dev/user/parallel.html#supported-operations

Don't 'unvectorize' fast numpy code!¶

Numba works well with nonvectorized numpy code that iterates over many items. This doesn't mean that you should rewrite exisiting, vectorized numpy code to use explicit for loops. Any speed up gain from numba is likely to be cancelled out by the removal of vectorized code. (Though as always, you should try it first and profile the results.)

Summary¶

Numba is a useful tool if you are using numpy and want to quickly exploit thread-parallelism with multi-core CPUs. Because of the use of simple decorators and decorator keyword arguments, numba can give certain hotspots of your code a quick boost from parallelisation, with relatively little time effort involved.

I want to stress that there is much more to numba that just the @jit decorator! I've only covered this to give an idea of what numba can do. If you want to explore numba's features in more depth, I suggest heading to the numba website which has many more examples and some documentation:

https://numba.pydata.org/

A few final thoughts on when to use numba¶

Numba works best when you problem is best decribed by some or all of the following criteria:

Compute time is primarily due to NumPy array element memory access or numerical operations (integer or float) that are more complex than a single NumPy function call.
Functions which work with data types that are frequently converted by NumPy functions to int64 or float64 for calculations (like int8 and int16).
The function is called many times during normal execution. Compilation is slow, so if the function is not called more than once, the execution time savings is unlikely to compensate for compilation time.
The function execution time is larger than the Numba dispatcher overhead. Functions which execute in much less than a microsecond are not going to see a major improvement, as the wrapper code which transitions from the Python interpreter to Numba takes longer than a pure Python function call.

Another thing to verify in your tests with numba is whether or not the compiler was able to fully translate the function to nopython mode. You can add the nopython=True option to the @jit decorator to raise an exception if nopython mode was not possible.

In [ ]:

Part 2: Numba¶

nopython=True¶

Summary of results so far:¶

Parallel numba¶

parallel=True¶

nopython=True¶

Summary of results so far:¶

More details on how numba works¶

Explicit parallel loops¶

Don't 'unvectorize' fast numpy code!¶

Summary¶

A few final thoughts on when to use numba¶

`nopython=True`¶

`parallel=True`¶

`nopython=True`¶