Numba is a Python package designed to offer increased performance with Python applications that use a lot of numpy
code for computation. Numba is a 'just-in-time' compiler for numpy code - it works by compiling your numpy code at runtime into optimised machine code.
Numba works best on numpy code that can be encapsulated neatly into separate, minimal functions, such as performing operations on arrays, or code with loops. Numba is most commonly used with python decorators, which are placed above the function you wish to optimise through just-in-time compilation.
You will likely see perfomance gains in most numpy code through using numba with its default options and using the commonly applied decorators @jit
, however, these decorators themselves do not cause numba to run parallel computation (The speed up initially comes from having "jit-ted" code.)
To exploit the features of parallelism in numba, we have to go beyond the basics of numba, but first lets look at some simple examples of jitted code with numba.
The most commonly used numba feature is the @jit
decorator. This decorator is placed above a function like so:
from numba import jit
import numpy as np
import time
x = np.arange(1000000).reshape(1000, 1000)
@jit(nopython=True) # Set "nopython" mode for best performance
def go_fast(a): # Function is compiled to machine code when called the first time
trace = 0
for i in range(a.shape[0]): # Numba likes loops
trace += np.tanh(a[i, i]) # Numba likes NumPy functions
return a + trace # Numba likes NumPy broadcasting
# You can optionally run the function the first time to compile it
result = go_fast(x)
nopython=True
¶The nopython=True
mode will ensure that the numba-compiled code will not use any of the Python C-API. It requires that all native types can be inferred, and that no new Python objects will be created in the function. A warning is issued if these criteria cannot be met, and numba will fall-back to a less optimised mode.
No we are going to time our function. If you are not using a jupyter notebook, you can uncomment the timing measurements below, or use a profiling tool of your choice. (I am using the built in feature of Jupyter notebooks that allows us to time the execution of a notebook cell: %%timeit
.)
%%timeit
# If you are not using a jupyter notebook, you can uncomment the timing measurements below:
#t1 = time.time()
result = go_fast(x)
#t2 = time.time()
#delta_t = t2 - t1
#print("Time taken: {}".format(delta_t))
349 ms ± 10.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
And for comparison, the non-numba version:
import numpy as np
import time
x = np.arange(1000000).reshape(1000, 1000)
def go_slow(a): # Function is run as standard Python/Numpy code
trace = 0
for i in range(a.shape[0]):
trace += np.tanh(a[i, i])
return a + trace
%%timeit
# If you are not using a jupyter notebook, you can uncomment the timing measurements below:
#t1 = time.time()
result = go_slow(x)
#t2 = time.time()
#delta_t = t2 - t1
#print("Time taken: {}".format(delta_t))
397 ms ± 2.95 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The adding of a numba decorator @jit
gives us fairly minor speed up when the plain numpy code is jitted:
Update: Some people running this code have reported that the non-numba version is actually faster. We think this may be due to updates in the latest version of NumPy that optimise the basic NumPy code better than previously. (If you have any of your own insights into this, please let us know!)
Let's see if we can now get some better performance using the parallel options in numba
To make use of auto-parallelisation methods with numba, we can use the parallel
keywords in the numba @jit
decorator. Adding the parallel
keyword to @jit
, attempts to find regions of the code that can be parallelised such as operations that will be independent of each loop iteration.
When using the @jit
decorator, we pass an additional keyword arguments: @jit(parallel=True)
to use this.
parallel=True
¶With the auto-parallelization, Numba attempts to identify such operations in a user program that can be readily parallelised, and fuse adjacent ones together, to form one or more kernels that are automatically run in parallel. The process is fully automated without modifications to the user program. (Except for the decorator syntax itself, which is place above the function definition)
A further keyword argument is also used in these numba examples:
nopython=True
¶A Numba compilation mode that generates code that does not access the Python C API. This compilation mode produces the highest performance code, but requires that the native types of all values in the function can be inferred, and that no new objects are allocated. Unless otherwise instructed, the @jit
decorator will automatically fall back to object mode if nopython mode cannot be used.
from numba import jit
import numpy as np
import time
x = np.arange(100000000).reshape(10000, 10000)
@jit(nopython=True, parallel=True) # Set "nopython" mode for best performance
def go_fast_parallel(a): # Function is compiled to machine code when called the first time
trace = 0
for i in range(a.shape[0]): # Numba likes loops
trace += np.tanh(a[i, i]) # Numba likes NumPy functions
return a + trace # Numba likes NumPy broadcasting
# You can optionally run the function the first time to compile it
result = go_fast_parallel(x)
%%timeit
result = go_fast_parallel(x)
198 ms ± 6.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
The adding of arguments parallel=True
to to numba decorator @jit
gives us a much better speed up. The results from my (4 core) laptop were:
parallel=True
: 198msTry it with you own machine and see if you get comparable results.
We won't go into the full details of how numba works in this mini-tutorial, but basically it uses clever heuristics to determine if a loop or other constructs in a function can be parallelised. This means you may not always get speedup using the parallel=True
argument, as numba's internal logic may have decided that the loop cannot be parallelised or is not worth parallelising. The aim of the numba module is to make parallelisation easy to the end user (by just adding a decorator with a few keyword arguments) but at the expense of hiding a lot of the details of what is going on internally.
Numba will automatically try to detect loops that can be parallelised with parallel=True
, but it also gives you the option of creating parallel loops manually. This functionality is provided in the numba.prange
method.
The prange
method can be used in place of where you would normally call the python range
function is used, say in a for loop for example. This tells numba that the for loop is suitable to be parallelised. However, it is up to the user to determine if the loop can be safely parallelised, i.e. the iterations of the loop should be able to be calculated independently of each other.
When using numba's prange
a reduction will be inferred automatically if a variable is being updated by a binary function/operator (i.e. +, -, /, *
). In other words, you do no have to explicitly sum up the separate computations in each parallel task, numba will do this for you.
So for example if we had a function that had a loop like this:
def standard_range_test(A): # 'A' would be a 1D numpy array in this example
s = 0
for i in range(A.shape[0]):
s += A[i]
return s
We could use the numba.prange
function to explicitly parallelise the loop:
from numba import jit, prange
@jit(nopython=True, parallel=True)
def prange_test(A): # 'A' would be a 1D numpy array in this example
s = 0
for i in prange(A.shape[0]):
s += A[i]
return s
Using a two-dimensional array, the setup would be similar:
from numba import jit, prange
@jit(nopython=True, parallel=True)
def two_d_array_reduction_prod(n):
shp = (13, 17)
result1 = 2 * np.ones(shp, np.int_)
tmp = 2 * np.ones_like(result1)
for i in numba.prange(n):
result1 *= tmp
return result1
In general, when using numba it is easier to begin with to try the auto-parallelisation feature from using the @jit(parallel=True)
decorator, rather than manually using the prange
function. However it is sometimes a useful feature to have when numba cannot determine automatically if a loop can be parallelised.
The numba documentation goes into more detail of when parallelisation can and cannot be inferred by numba: https://numba.pydata.org/numba-doc/dev/user/parallel.html#supported-operations
Numba works well with nonvectorized numpy code that iterates over many items. This doesn't mean that you should rewrite exisiting, vectorized numpy code to use explicit for loops. Any speed up gain from numba is likely to be cancelled out by the removal of vectorized code. (Though as always, you should try it first and profile the results.)
Numba is a useful tool if you are using numpy and want to quickly exploit thread-parallelism with multi-core CPUs. Because of the use of simple decorators and decorator keyword arguments, numba can give certain hotspots of your code a quick boost from parallelisation, with relatively little time effort involved.
I want to stress that there is much more to numba that just the @jit
decorator! I've only covered this to give an idea of what numba can do. If you want to explore numba's features in more depth, I suggest heading to the numba website which has many more examples and some documentation:
Numba works best when you problem is best decribed by some or all of the following criteria:
Another thing to verify in your tests with numba is whether or not the compiler was able to fully translate the function to nopython mode. You can add the nopython=True
option to the @jit
decorator to raise an exception if nopython mode was not possible.