A lot changed in Numba throughout 2018, this notebook aims to show some of the highlights. A key driver of a large number of the changes made this year has been direct response to feedback provided by the Numba users community. The core development team are very grateful for such excellent feedback and would like to thank everyone who has contributed.

For reference, online help over IRC is available on Gitter and issues for bugs/questions/feature requests are welcome on Github.

- Numba now supports Python 2.7 and Python 3.4-3.7. Typically one release cycle is all that is required to adopt a new 3.x Python version.
- Numba (via llvmlite) is now backed by LLVM 7.0. Three LLVM releases have been used to back Numba this year, 5, 6 and 7. Again, it is typical for only a single release cycle to be required to adopt a new LLVM version.
- CUDA 8.0 is now the minimum supported CUDA version.

These are not going to be demonstrated in this notebook, but it is worth mentioning that somewhere in the order of an additional 30 new NumPy functions/`ndarray`

methods (or additional kwarg support) have been added to Numba this year. As always a full list of supported NumPy functionality can be found in the documentation. The new for 2018 features are:

`numpy.ndarray`

methods:`numpy.ndarray.conj`

`numpy.ndarray.conjugate`

`numpy.ndarray.argsort`

`numpy.ndarray.dot`

`numpy.ndarray.transpose`

NumPy functions:

`numpy.ascontiguousarray`

`numpy.percentile`

`numpy.convolve`

`numpy.corrcoef`

`numpy.correlate`

`numpy.cov`

`numpy.dtype`

`numpy.ediff1d`

`numpy.fill_diagonal`

`numpy.nancumprod`

`numpy.nancumsum`

`numpy.nanpercentile`

`numpy.partition`

`numpy.reshape`

`numpy.transpose`

`numpy.tri`

`numpy.tril`

`numpy.triu`

`numpy.unique`

`numpy.vander`

`numpy.random.randint`

`numpy.random.permutation`

Updates to NumPy function kwarg:

`np.searchsorted`

(`side`

kwarg available)`np.argsort`

(`kind`

kwarg with`quicksort`

and`mergesort`

available.)

A large number of CUDA bugs have been fixed, and the following enhancements made:

- Support for FMA and faster float64 atomics on suitable hardware.
- Records in const memory
- Improved datatime dtype support.
- The addition of the
`__cuda_array_interface__`

member following the NumPy array interface specification to allow Numba to consume externally defined device arrays. This was undertaken in correspondence with CuPy to test out the concept and be able to use CuPy GPU arrays. - Support for the NumPY tranpose API
- Permit direct assignment to to DeviceNDArrays
- Support added for a larger selection of bit manipulation intrinsics and also SELP.

This demonstration needs at least the following:

- Python 2.7 or >= 3.4
- NumPy >= 1.11
- llvmlite >= 0.27.0
- Numba >= 0.42.0
- SciPy >= 0.16
- progressbar2

All of which can be obtained via the Anaconda Python distribution via the `conda`

package manager (pip also works!):

`$ conda create -n demo_env python=3 numpy scipy numba progressbar2`

If you can, try and download this notebook and run it locally. However, if you are running this on binder, performance results may be a bit dubious as understandably the hardware isn't state-of-the-art and there are only a couple of cores available. Further, the GPU examples won't work on binder as there's no GPUs available. Here's what Numba detects about the hardware on which this is running:

In [ ]:

```
!numba -s
```

Next, import the most commonly used modules/functions.

In [ ]:

```
import numpy as np
import numba
from numba import jit, njit, prange, config, generated_jit
```

In [ ]:

```
add_one = njit(lambda x: x + 1)
div_two = njit(lambda x: x / 2)
@njit
def apply_funcs(a, f1, f2):
return f1(a) + f2(a)
A = np.arange(10)
apply_funcs(A, add_one, div_two)
```

Numba's list handling gained support for containing reference-counted types, like NumPy arrays and `list`

. This permits a more natural programming style as it's often common for users to want to pass a list of array arguments to a function. Note: list still only support homogeneous data types, i.e. everything in a list must be the same type, like all items being NumPy arrays. Here's an example:

In [ ]:

```
a_list = [(1, 2), (3, 4), (5, 6)]
@njit
def list_conv(alist):
new_list = []
for item in alist:
new_list.append(np.ones(item))
return new_list
list_conv(a_list)
```

`object mode`

blocks inside `nopython`

mode¶Experimental support was added for executing an arbitrary block of code in `object mode`

inside a `nopython`

mode function. This is really useful if you want to e.g. occasionally update an image with results or show a progress bar from with in a loop.

In [ ]:

```
import progressbar
import time
from numba import objmode
n = 100
with progressbar.ProgressBar(max_value=n) as thebar:
@njit
def foo(): # this is a nopython mode function
x = y = 0
for i in range(n):
x += np.sqrt(np.cos(n) ** 2 + np.sin(n) ** 2)
# but this block jumps into object mode j is defined in object mode,
# so we need to tell `nopython` mode its type so it can be used
# outside this block in nopython mode
with objmode(j='int64'):
thebar.update(i)
time.sleep(0.05)
j = i + 10 # j is defined in object mode
y += j
return x, y
print(foo())
```

Initial support for Unicode strings has been implemented for Python versions >= 3.4. Support for fundamental string operations has been added as well as support for strings as arguments and return value. The next release of Numba will contain performance updates and additional features to further enhance string support.

In [ ]:

```
if config.PYVERSION > (3, 4): # Only supported in Python >= 3.4
@njit
def strings_demo(str1, str2, str3):
# strings, ---^ ---^ ---^
# as arguments are now supported!
# defining strings in compiled code also works
def1 = 'numba is '
# as do unicode strings
def2 = '🐍⚡'
# also string concatenation
print(str1 + str2)
# comparison operations
x = True
x |= (str1 == str2)
x |= (str1 < str2)
x |= (str1 <= str2)
x |= (str1 > str2)
x |= (str1 >= str2)
# {starts,ends}with
x |= (str1.startswith(str3))
x |= (str2.endswith(str3))
# len()
print(len(str1), len(def2), len(str3))
# str.find()
print(str2.find(str3))
# in
x |= (str3 in str2)
# slicing
print(str2[1:], str1[:1])
# and finally, strings can also be returned
return '\nnum' + str1[1::-1] + def1[5:] + def2
# run the demo
print(strings_demo('abc', 'zba', 'a'))
```

Tracebacks from exceptions raised in jitted code now contain a synthesized stack frame containing the location where the exception was raised. The stack frame is based on the Python source from which is was compiled, it looks like a CPython traceback, but is coming from compiled code! This makes it easier to use exceptions in `nopython`

mode as it is now possible to find out the location from which they were raised. Try commenting/uncommenting the `@njit`

decorator and rerunning!

In [ ]:

```
import traceback
@njit
def raise_exception(x):
if x == 0:
raise Exception('raised x==0. Also, exception arguments are correctly handled', 123, 4j)
try:
raise_exception(0)
except Exception:
traceback.print_exc()
```

Numba now supports the use of a per-project configuration file to permanently set behaviours typically set via `NUMBA_*`

family environment variables (this requires the `pyyaml`

package to be present). Much effort has gone into improving error reporting and the general usability of Numba. This includes highlighted error messages and performance tips documentation (requires the `colorama`

package and a color scheme to be set).

In 2018 Numba's primary exception handling mechanism was rewritten based on user feedback to make tracebacks much shorter and more user friendly. Improvements include:

- Better error messages, often including diagnostic output.
- Identification of the offending source code, this is now printed out with line numbers and a caret pointing to the offending line.
- Categorized help messages referring to documentation that may be useful.

For example, this is an invalid use of the `add`

built-in, note the argument type display, the declaration of signatures Numba knows, the offending source is printed and pointed to and a help message appears:

In [ ]:

```
@njit
def f(x, y):
return x + y
try:
f(1, (2,)) # invalid
except:
traceback.print_exc()
```

This is an example of a type unification failure, a problem commonly encountered by users, note that the two locations that cause the problem are highlighted and the types are listed:

In [ ]:

```
@njit
def f(x):
# Numba cannot statically determine if the return type is tuple or int
if x > 10:
return (1,)
else:
return 1
try:
f(0)
except:
traceback.print_exc()
```

In this example a list of an unknown type is used, note that Numba identifies this and has a specialized help message.

In [ ]:

```
from numba import jit
@jit(nopython=True)
def f(x):
tmp = [] # defined empty
return (tmp, x) # ERROR: the type of `tmp` is unknown
try:
f(1)
except:
traceback.print_exc()
```

The Numba extension API has gained the ability operate more easily with functions from Cython modules through the use of `numba.extending.get_cython_function_address`

to obtain function addresses for direct use in `ctypes.CFUNCTYPE`

.

In [ ]:

```
import ctypes
import scipy
from numba.extending import get_cython_function_address
addr = get_cython_function_address("scipy.special.cython_special", "j0")
functype = ctypes.CFUNCTYPE(ctypes.c_double, ctypes.c_double)
bind_j0 = functype(addr)
```

The function `bind_j0`

can now be used inside jitted functions, for example:

In [ ]:

```
@njit
def double_j0(x):
return 2 * bind_j0(x)
val = 0.5
assert double_j0(val) == 2 * scipy.special.j0(0.5)
```

Numba 0.41.0 has a significant change made to the typing system that aims to clean up the use of constants. This change takes the form of support for type specific literal values in the type inference mechanism. During typing two passes are now made, the first with anything which is a constant and can expressed as a literal set as such (integers, strings, slices and `make_function`

are implemented as literals at present), the second with the standard types used for the constants. This, for example, permits value based dispatch as demonstrated below, but also opens up a lot of future possibilities surrounding typing which were inaccessible prior to this change.

In [ ]:

```
@generated_jit
def myoverload(arg):
literal_val = getattr(arg, 'literal_value', None)
if literal_val is not None:
if literal_val == 100:
def impl_1(arg):
return 'dispatched: impl_1(literal, value 100)'
return impl_1
else:
def impl_2(arg):
return 'dispatched: impl_2(literal, value not 100)'
return impl_2
else:
def impl_3(arg):
return 'dispatched: impl_3(non-literal type)'
return impl_3
@njit
def example(x):
print(myoverload(100)) # literal value 100, dispatches impl_1
print(myoverload(99)) # literal value 99, dispatches impl_2
a = 50 + 25 + 2 * 10 + 15 // 3 # `a` is const expr value 100
print(myoverload(a)) # `a` has literal value 100, dispatches impl_1
b = 50 * x # `b` non-literal, it's an intp type
print(myoverload(b)) # `b` non-literal intp, has no value, dispatches impl_3
example(2)
```

Following the introduction of ParallelAccelerator technology into Numba in mid-2017, steady improvments were made throughout 2018, including the following...

The thread pool implementation used by Numba for automatic multithreading is configurable to use Intel's Thread Building Blocks (TBB), OpenMP, or the old "workqueue" implementation. (TBB is likely to become the preferred default in a future release.) documentation is here. For the purposes of this example, TBB will be used:

In [ ]:

```
numba.config.THREADING_LAYER = 'tbb'
```

Now define a demonstration kernel:

In [ ]:

```
@njit(parallel=True)
def parallel_demo(x):
n = x.shape[0]
a = np.sin(x)
b = np.cos(a * a)
acc = 0
for i in prange(n - 2):
for j in prange(n - 1):
acc += b[i] + b[j + 1]
return acc
parallel_demo(np.arange(10))
```

The threading layer used for execution can easily be inspected as follows...

In [ ]:

```
numba.threading_layer()
```

Documentation on ensuring thread and fork-safety whilst using Numba is available here. It's now both possible and safe to fork, spawn, use threads, and use Numba's compiler and threading backend simulataneously...

In [ ]:

```
import multiprocessing as mp
from threading import Thread, get_ident
nthreads = 4
# this function is going to be compiled
def function(n):
x = np.dot( 3 + 5j * np.ones((n, n)), 2 + 1j * np.ones((n, n)))
return np.linalg.norm(x + np.arange(n) + n)
# this work is done by each thread, concurrent compilation and execution
def thread_work(results, n):
compiled_function = njit(parallel=True, nogil=True)(function)
# pointless extra work to stress the backend
[compiled_function(n) for _ in range(10)]
# return one result key'd by thread id
results[get_ident()] = compiled_function(n)
# this work is done by each process
def process_work(n):
# spin up some threads to do the thread_work
results = dict()
tpool = [Thread(target=thread_work, args=(results, n)) for _ in range(nthreads)]
[t.start() for t in tpool]
[t.join() for t in tpool]
# reduce the per thread results
acc = 0
for v in results.values():
acc += v
return acc
# This starts a process pool of 4 processes and maps the process work to it.
# Each process then starts 4 threads.
# Each thread compile a function for use in the parallel backend and then repeatedly runs it
# before returning a single result.
# The process then reduces the results all of its threads and returns that result.
p = mp.Pool(4)
print(p.map(process_work, [100, 200, 300, 400]))
```

Parallel support for `np.arange`

and `np.linspace`

, also `np.mean`

, `np.std`

and `np.var`

have been added, for example:

In [ ]:

```
@njit(parallel=True)
def parallel_new_function_support(n):
a = np.arange(0, n, 1, np.float64)
b = np.linspace(0, 1, n)
m = np.mean(a)
s = np.std(b)
v = np.var(a + b)
return m + s + v
parallel_new_function_support(5)
```

Parallel loops now allow arrays as reduction variables, for example:

In [ ]:

```
@njit(parallel=True)
def parallel_array_reduction(n):
y = np.arange(n)
x = np.ones_like(y)
for i in prange(10):
y += x
return y
parallel_array_reduction(5)
```

Having got ParallelAccelerator working well and achieving great performance, the question asked most often by users was **"what did it do?!"**. Numba 0.41.0 added Parallel Diagnostics functionality to address this, calling the `parallel_diagnostics`

member function on a function compiled with `parallel=True`

set shows the optimizations done by ParallelAccelerator. The documentation for this feature is here and includes a guide for interpreting the output given. For example, obtaining the Parallel Diagnostics with verbosity level `4`

on the above function.

In [ ]:

```
parallel_demo.parallel_diagnostics(level=4)
```

Support was added for profiling Numba-compiled functions in Intel VTune, simply set the `NUMBA_ENABLE_PROFILING`

environment variable and ask Vtune to profile the execution.

Further, as a result of community feedback, the Numba dispatcher `inspect_types()`

method now supports the kwarg `pretty`

which if set to `True`

will produce ANSI/HTML output, showing the annotated types, when invoked from ipython/jupyter-notebook respectively. Green highlighting shows compiled loops, and yellow interpreted code. For example:

In [ ]:

```
class clazz(object):
def __init__(self, arr):
self._arr = arr
@property
def arr(self):
return self._arr
@arr.setter
def arr(self, value):
self._arr = value
@jit # use of class, `c`, prevents `nopython` mode compilation.
def foo(a, c):
c.arr += 12
for i in range(len(a)):
a[i] = np.sqrt(i) + 7
if i % 8 + 1 > 4:
a[i] -= np.pi
return a
A = np.arange(100.)
class_inst = clazz(11)
foo(A, class_inst)
foo.inspect_types(pretty=True)
```

Support for the `ppc64le`

, `aarch64`

(64bit ARMv8) and `armv7l`

(ARMv7 little endian. Yes Numba now works on a Raspberry Pi!) architectures have been added.

Further considerable improvements in vectorization were made available as Numba now supports Intel's short vector math library (SVML). Try it out with `conda install -c numba icc_rt`

. This primarily permits the vectorization of fundamental math functions and has variants with lesser precision for use in `fastmath`

mode.

In [ ]:

```
def demo_svml(n):
ret = np.arange(n)
return np.sqrt(np.cos(ret) ** 2 + np.cos(ret) ** 2)
demo_svml_njit = njit(demo_svml)
demo_svml_fastmath = njit(fastmath=True)(demo_svml)
count = 10000
%timeit -o -r 10 -n 1000 demo_svml(count)
%timeit -o -r 10 -n 1000 demo_svml_njit(count)
%timeit -o -r 10 -n 1000 demo_svml_fastmath(count);
```

2018 saw the addition of a new GPU backend, AMD's ROCm (Radeon Open Compute). Kernels for ROCm supported AMD GPUs can now be compiled using the ROCm driver on Linux. Documentation is here with information about prerequistites here. The kernel launch syntax and programming model is very similar to that found in Numba's CUDA support, and ufuncs work exactly the same way. For example

In [ ]:

```
from numba import roc
from numba import cuda
from numba import vectorize, float64
# this is to handle thread model indexing differences between CUDA and ROCm
global_id_func = {roc: lambda : roc.get_global_id(0),
cuda: lambda : cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x}
# Same code, different GPU backends !
for gpu_target in ['roc', 'cuda']:
backend = getattr(numba, gpu_target)
if backend.is_available():
@backend.jit(device=True)
def inner(a, b):
return a + b
index = backend.jit(device=True)(global_id_func[backend])
@backend.jit
def outer(A, B):
i = index()
if i < A.size:
A[i] = inner(A[i], B[i])
A = np.arange(10)
Aorig = A.copy()
B = np.arange(10)
outer.forall(A.size)(A, B)
assert not np.all(Aorig == A)
np.testing.assert_equal(Aorig + B, A)
sig = [float64(float64, float64)]
@vectorize(sig, target=gpu_target)
def vector_add(a, b):
return a + b
a = np.arange(100.)
b = np.arange(100.) * 3.14
c = vector_add(a, b)
np.testing.assert_equal(c, a + b)
```