A Year of Numba (the year 2018!)

A lot changed in Numba throughout 2018, this notebook aims to show some of the highlights. A key driver of a large number of the changes made this year has been direct response to feedback provided by the Numba users community. The core development team are very grateful for such excellent feedback and would like to thank everyone who has contributed.

For reference, online help over IRC is available on Gitter and issues for bugs/questions/feature requests are welcome on Github.

Updates to Numba's dependencies

  • Numba now supports Python 2.7 and Python 3.4-3.7. Typically one release cycle is all that is required to adopt a new 3.x Python version.
  • Numba (via llvmlite) is now backed by LLVM 7.0. Three LLVM releases have been used to back Numba this year, 5, 6 and 7. Again, it is typical for only a single release cycle to be required to adopt a new LLVM version.
  • CUDA 8.0 is now the minimum supported CUDA version.

New NumPy function support:

These are not going to be demonstrated in this notebook, but it is worth mentioning that somewhere in the order of an additional 30 new NumPy functions/ndarray methods (or additional kwarg support) have been added to Numba this year. As always a full list of supported NumPy functionality can be found in the documentation. The new for 2018 features are:

  • numpy.ndarray methods:

    • numpy.ndarray.conj
    • numpy.ndarray.conjugate
    • numpy.ndarray.argsort
    • numpy.ndarray.dot
    • numpy.ndarray.transpose
  • NumPy functions:

    • numpy.ascontiguousarray
    • numpy.percentile
    • numpy.convolve
    • numpy.corrcoef
    • numpy.correlate
    • numpy.cov
    • numpy.dtype
    • numpy.ediff1d
    • numpy.fill_diagonal
    • numpy.nancumprod
    • numpy.nancumsum
    • numpy.nanpercentile
    • numpy.partition
    • numpy.reshape
    • numpy.transpose
    • numpy.tri
    • numpy.tril
    • numpy.triu
    • numpy.unique
    • numpy.vander
    • numpy.random.randint
    • numpy.random.permutation
  • Updates to NumPy function kwarg:

    • np.searchsorted (side kwarg available)
    • np.argsort (kind kwarg with quicksort and mergesort available.)

Updates to CUDA:

A large number of CUDA bugs have been fixed, and the following enhancements made:

  • Support for FMA and faster float64 atomics on suitable hardware.
  • Records in const memory
  • Improved datatime dtype support.
  • The addition of the __cuda_array_interface__ member following the NumPy array interface specification to allow Numba to consume externally defined device arrays. This was undertaken in correspondence with CuPy to test out the concept and be able to use CuPy GPU arrays.
  • Support for the NumPY tranpose API
  • Permit direct assignment to to DeviceNDArrays
  • Support added for a larger selection of bit manipulation intrinsics and also SELP.

The Demonstration

Prerequisites

This demonstration needs at least the following:

  • Python 2.7 or >= 3.4
  • NumPy >= 1.11
  • llvmlite >= 0.27.0
  • Numba >= 0.42.0
  • SciPy >= 0.16
  • progressbar2

All of which can be obtained via the Anaconda Python distribution via the conda package manager (pip also works!):

$ conda create -n demo_env python=3 numpy scipy numba progressbar2

If you can, try and download this notebook and run it locally. However, if you are running this on binder, performance results may be a bit dubious as understandably the hardware isn't state-of-the-art and there are only a couple of cores available. Further, the GPU examples won't work on binder as there's no GPUs available. Here's what Numba detects about the hardware on which this is running:

In [ ]:
!numba -s

Next, import the most commonly used modules/functions.

In [ ]:
import numpy as np
import numba
from numba import jit, njit, prange, config, generated_jit

New Numba/Python interaction

A number of significant improvements were made in the interaction between Numba and Python.

Passing jitted functions as arguments

Numba now allows the passing of jitted functions (and containers of jitted functions) as arguments to other jitted functions.

In [ ]:
add_one = njit(lambda x: x + 1)
div_two = njit(lambda x: x / 2)

@njit
def apply_funcs(a, f1, f2):   
    return f1(a) + f2(a)

A = np.arange(10)
apply_funcs(A, add_one, div_two)

Lists can now contain NumPy arrays, lists etc

Numba's list handling gained support for containing reference-counted types, like NumPy arrays and list. This permits a more natural programming style as it's often common for users to want to pass a list of array arguments to a function. Note: list still only support homogeneous data types, i.e. everything in a list must be the same type, like all items being NumPy arrays. Here's an example:

In [ ]:
a_list = [(1, 2), (3, 4), (5, 6)]

@njit
def list_conv(alist):
    new_list = []
    for item in alist:
        new_list.append(np.ones(item))
    return new_list

list_conv(a_list)

Running object mode blocks inside nopython mode

Experimental support was added for executing an arbitrary block of code in object mode inside a nopython mode function. This is really useful if you want to e.g. occasionally update an image with results or show a progress bar from with in a loop.

In [ ]:
import progressbar
import time
from numba import objmode

n = 100
with progressbar.ProgressBar(max_value=n) as thebar:
    @njit
    def foo(): # this is a nopython mode function
        x = y = 0
        for i in range(n):
            x += np.sqrt(np.cos(n) ** 2 + np.sin(n) ** 2)
            # but this block jumps into object mode j is defined in object mode,
            # so we need to tell `nopython` mode its type so it can be used
            # outside this block in nopython mode
            with objmode(j='int64'): 
                thebar.update(i)
                time.sleep(0.05)
                j = i + 10 # j is defined in object mode
            y += j
        return x, y
    print(foo())

Unicode strings

Initial support for Unicode strings has been implemented for Python versions >= 3.4. Support for fundamental string operations has been added as well as support for strings as arguments and return value. The next release of Numba will contain performance updates and additional features to further enhance string support.

In [ ]:
if config.PYVERSION > (3, 4): # Only supported in Python >= 3.4
    
    @njit
    def strings_demo(str1, str2, str3):
        # strings, ---^  ---^   ---^
        # as arguments are now supported!
        
        # defining strings in compiled code also works
        def1 = 'numba is '
        
        # as do unicode strings
        def2 = '🐍⚡'
        
        # also string concatenation 
        print(str1 + str2)
        
        # comparison operations
        x = True
        x |= (str1 == str2)
        x |= (str1 < str2)
        x |= (str1 <= str2)
        x |= (str1 > str2)
        x |= (str1 >= str2)
        
        # {starts,ends}with
        x |= (str1.startswith(str3))
        x |= (str2.endswith(str3))
        
        # len()
        print(len(str1), len(def2), len(str3))
        
        # str.find()
        print(str2.find(str3))
        
        # in
        x |= (str3 in str2)
        
        # slicing
        print(str2[1:], str1[:1])
        
        # and finally, strings can also be returned
        return '\nnum' + str1[1::-1] + def1[5:] + def2
    
    
    # run the demo
    print(strings_demo('abc', 'zba', 'a'))

Exceptions with tracebacks relating to Python source.

Tracebacks from exceptions raised in jitted code now contain a synthesized stack frame containing the location where the exception was raised. The stack frame is based on the Python source from which is was compiled, it looks like a CPython traceback, but is coming from compiled code! This makes it easier to use exceptions in nopython mode as it is now possible to find out the location from which they were raised. Try commenting/uncommenting the @njit decorator and rerunning!

In [ ]:
import traceback
@njit
def raise_exception(x):
    if x == 0:
        raise Exception('raised x==0. Also, exception arguments are correctly handled', 123, 4j)

try:
    raise_exception(0)
except Exception:
    traceback.print_exc()

Usability

Numba now supports the use of a per-project configuration file to permanently set behaviours typically set via NUMBA_* family environment variables (this requires the pyyaml package to be present). Much effort has gone into improving error reporting and the general usability of Numba. This includes highlighted error messages and performance tips documentation (requires the colorama package and a color scheme to be set).

In 2018 Numba's primary exception handling mechanism was rewritten based on user feedback to make tracebacks much shorter and more user friendly. Improvements include:

  • Better error messages, often including diagnostic output.
  • Identification of the offending source code, this is now printed out with line numbers and a caret pointing to the offending line.
  • Categorized help messages referring to documentation that may be useful.

For example, this is an invalid use of the add built-in, note the argument type display, the declaration of signatures Numba knows, the offending source is printed and pointed to and a help message appears:

In [ ]:
@njit
def f(x, y):
    return x + y

try:
    f(1, (2,)) # invalid
except:
    traceback.print_exc()

This is an example of a type unification failure, a problem commonly encountered by users, note that the two locations that cause the problem are highlighted and the types are listed:

In [ ]:
@njit
def f(x):
    # Numba cannot statically determine if the return type is tuple or int
    if x > 10:
        return (1,)
    else:
        return 1
try:
    f(0)
except:
    traceback.print_exc()

In this example a list of an unknown type is used, note that Numba identifies this and has a specialized help message.

In [ ]:
from numba import jit
@jit(nopython=True)
def f(x):
    tmp = [] # defined empty
    return (tmp, x) # ERROR: the type of `tmp` is unknown
try:
    f(1)
except:
    traceback.print_exc()

Extending Numba

The Numba extension API has gained the ability operate more easily with functions from Cython modules through the use of numba.extending.get_cython_function_address to obtain function addresses for direct use in ctypes.CFUNCTYPE.

In [ ]:
import ctypes
import scipy
from numba.extending import get_cython_function_address

addr = get_cython_function_address("scipy.special.cython_special", "j0")
functype = ctypes.CFUNCTYPE(ctypes.c_double, ctypes.c_double)
bind_j0 = functype(addr)

The function bind_j0 can now be used inside jitted functions, for example:

In [ ]:
@njit
def double_j0(x):
    return 2 * bind_j0(x)

val = 0.5

assert double_j0(val) == 2 * scipy.special.j0(0.5)

Literal value support

Numba 0.41.0 has a significant change made to the typing system that aims to clean up the use of constants. This change takes the form of support for type specific literal values in the type inference mechanism. During typing two passes are now made, the first with anything which is a constant and can expressed as a literal set as such (integers, strings, slices and make_function are implemented as literals at present), the second with the standard types used for the constants. This, for example, permits value based dispatch as demonstrated below, but also opens up a lot of future possibilities surrounding typing which were inaccessible prior to this change.

In [ ]:
@generated_jit
def myoverload(arg):
    literal_val = getattr(arg, 'literal_value', None)
    if literal_val is not None:
        if literal_val == 100:
            def impl_1(arg):
                return 'dispatched: impl_1(literal, value 100)'
            return impl_1
        else:
            def impl_2(arg):
                return 'dispatched: impl_2(literal, value not 100)'
            return impl_2
    else:
        def impl_3(arg):
                return 'dispatched: impl_3(non-literal type)'
        return impl_3

@njit
def example(x):
    print(myoverload(100))         # literal value 100, dispatches impl_1
    print(myoverload(99))          # literal value 99, dispatches impl_2
    a = 50 + 25 + 2 * 10 + 15 // 3 # `a` is const expr value 100
    print(myoverload(a))           # `a` has literal value 100, dispatches impl_1
    b = 50 * x                     # `b` non-literal, it's an intp type
    print(myoverload(b))           # `b` non-literal intp, has no value, dispatches impl_3

example(2)

ParallelAccelerator

Following the introduction of ParallelAccelerator technology into Numba in mid-2017, steady improvments were made throughout 2018, including the following...

The thread pool implementation used by Numba for automatic multithreading is configurable to use Intel's Thread Building Blocks (TBB), OpenMP, or the old "workqueue" implementation. (TBB is likely to become the preferred default in a future release.) documentation is here. For the purposes of this example, TBB will be used:

In [ ]:
numba.config.THREADING_LAYER = 'tbb'

Now define a demonstration kernel:

In [ ]:
@njit(parallel=True)
def parallel_demo(x):
    n = x.shape[0]
    a = np.sin(x)
    b = np.cos(a * a)
    acc = 0
    for i in prange(n - 2):
        for j in prange(n - 1):
            acc += b[i] + b[j + 1]
    return acc

parallel_demo(np.arange(10))

The threading layer used for execution can easily be inspected as follows...

In [ ]:
numba.threading_layer()

Documentation on ensuring thread and fork-safety whilst using Numba is available here. It's now both possible and safe to fork, spawn, use threads, and use Numba's compiler and threading backend simulataneously...

In [ ]:
import multiprocessing as mp
from threading import Thread, get_ident

nthreads = 4

# this function is going to be compiled
def function(n):
    x = np.dot( 3 + 5j * np.ones((n, n)), 2 + 1j * np.ones((n, n)))
    return np.linalg.norm(x + np.arange(n) + n)

# this work is done by each thread, concurrent compilation and execution
def thread_work(results, n):
    compiled_function = njit(parallel=True, nogil=True)(function)
    # pointless extra work to stress the backend
    [compiled_function(n) for _ in range(10)]
    # return one result key'd by thread id
    results[get_ident()] = compiled_function(n)

# this work is done by each process
def process_work(n):   
    # spin up some threads to do the thread_work
    results = dict()
    tpool = [Thread(target=thread_work, args=(results, n)) for _ in range(nthreads)]
    [t.start() for t in tpool]
    [t.join() for t in tpool]
    # reduce the per thread results
    acc = 0
    for v in results.values():
        acc += v
    return acc

# This starts a process pool of 4 processes and maps the process work to it.
# Each process then starts 4 threads.
# Each thread compile a function for use in the parallel backend and then repeatedly runs it
# before returning a single result.
# The process then reduces the results all of its threads and returns that result.
p = mp.Pool(4)
print(p.map(process_work, [100, 200, 300, 400]))

Parallel support for np.arange and np.linspace, also np.mean, np.std and np.var have been added, for example:

In [ ]:
@njit(parallel=True)
def parallel_new_function_support(n):
    a = np.arange(0, n, 1, np.float64)
    b = np.linspace(0, 1, n)
    m = np.mean(a)
    s = np.std(b)
    v = np.var(a + b)
    return m + s + v

parallel_new_function_support(5)

Parallel loops now allow arrays as reduction variables, for example:

In [ ]:
@njit(parallel=True)
def parallel_array_reduction(n):
    y = np.arange(n)
    x = np.ones_like(y)
    for i in prange(10):
        y += x
    return y

parallel_array_reduction(5)

Having got ParallelAccelerator working well and achieving great performance, the question asked most often by users was "what did it do?!". Numba 0.41.0 added Parallel Diagnostics functionality to address this, calling the parallel_diagnostics member function on a function compiled with parallel=True set shows the optimizations done by ParallelAccelerator. The documentation for this feature is here and includes a guide for interpreting the output given. For example, obtaining the Parallel Diagnostics with verbosity level 4 on the above function.

In [ ]:
parallel_demo.parallel_diagnostics(level=4)

More general Diagnostics

Support was added for profiling Numba-compiled functions in Intel VTune, simply set the NUMBA_ENABLE_PROFILING environment variable and ask Vtune to profile the execution.

Further, as a result of community feedback, the Numba dispatcher inspect_types() method now supports the kwarg pretty which if set to True will produce ANSI/HTML output, showing the annotated types, when invoked from ipython/jupyter-notebook respectively. Green highlighting shows compiled loops, and yellow interpreted code. For example:

In [ ]:
class clazz(object):

    def __init__(self, arr):
        self._arr = arr

    @property
    def arr(self):
        return self._arr

    @arr.setter
    def arr(self, value):
        self._arr = value

@jit # use of class, `c`, prevents `nopython` mode compilation.
def foo(a, c):

    c.arr += 12

    for i in range(len(a)):
        a[i] = np.sqrt(i) + 7
        if i % 8 + 1 > 4:
            a[i] -= np.pi


    return a

A = np.arange(100.)
class_inst = clazz(11)
foo(A, class_inst)
foo.inspect_types(pretty=True)

New hardware support

Support for the ppc64le, aarch64 (64bit ARMv8) and armv7l (ARMv7 little endian. Yes Numba now works on a Raspberry Pi!) architectures have been added.

Intel SVML

Further considerable improvements in vectorization were made available as Numba now supports Intel's short vector math library (SVML). Try it out with conda install -c numba icc_rt. This primarily permits the vectorization of fundamental math functions and has variants with lesser precision for use in fastmath mode.

In [ ]:
def demo_svml(n):
    ret = np.arange(n)
    return np.sqrt(np.cos(ret) ** 2 + np.cos(ret) ** 2)

demo_svml_njit = njit(demo_svml)
demo_svml_fastmath = njit(fastmath=True)(demo_svml)

count = 10000
%timeit -o -r 10 -n 1000 demo_svml(count)
%timeit -o -r 10 -n 1000 demo_svml_njit(count)
%timeit -o -r 10 -n 1000 demo_svml_fastmath(count);

AMD ROCm

2018 saw the addition of a new GPU backend, AMD's ROCm (Radeon Open Compute). Kernels for ROCm supported AMD GPUs can now be compiled using the ROCm driver on Linux. Documentation is here with information about prerequistites here. The kernel launch syntax and programming model is very similar to that found in Numba's CUDA support, and ufuncs work exactly the same way. For example

In [ ]:
from numba import roc
from numba import cuda
from numba import vectorize, float64

# this is to handle thread model indexing differences between CUDA and ROCm
global_id_func = {roc: lambda : roc.get_global_id(0),
                  cuda: lambda : cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x}

# Same code, different GPU backends !
for gpu_target in ['roc', 'cuda']:
    backend = getattr(numba, gpu_target)
    if backend.is_available():

            @backend.jit(device=True)
            def inner(a, b):
                return a + b
            
            index = backend.jit(device=True)(global_id_func[backend])

            @backend.jit
            def outer(A, B):
                i = index()
                if i < A.size:
                    A[i] = inner(A[i], B[i])

            A = np.arange(10)
            Aorig = A.copy()
            B = np.arange(10)

            outer.forall(A.size)(A, B)
            assert not np.all(Aorig == A)
            np.testing.assert_equal(Aorig + B, A)
            
            sig = [float64(float64, float64)]
            
            @vectorize(sig, target=gpu_target)
            def vector_add(a, b):
                return a + b
            
            a = np.arange(100.)
            b = np.arange(100.) * 3.14
            c = vector_add(a, b)
            np.testing.assert_equal(c, a + b)