numba
¶numba
and save yourself the pain).We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.
Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.
Donald Knuth
numba
help?¶numpy
arrays with your data (remember pandas
uses numpy
under-the-covers), numba
makes it easy to write simple functions that are fast that work with that data.numba
is an open source JIT (Just-In-Time) compiler for Python functions.numba
can often generate a specialized, fast, machine code implementation atruntime.
numpy
arrays.numba
features¶numba
supports:numpy
conda
not longer required): pip install numba
(wheels for Windows/Linux/OSX are available, no need to compile anything).numba
works¶numba
modes of compilation¶if
, else
, while
, for
, range
.numpy
arrays, int, float, complex, booleans, and tuples.numpy
dtypes: int
, float
, complex
, datetime64
, timedelta64
.sum
, prod
, max
, min
, etc.nopython
mode compiled functions.ctypes
or cffi
wrapped external functions.int
, bool
, float
, complex
, bytes
.array
(limited support), cmath
, collections
, ctypes
, enum
, math
, operator
, functools
, random
.cffi
- Similarly to ctypes
, numba
is able to call into cffi
declared external functions.numpy
features¶numba
integrates seamlessly with numpy
, whose arrays provide an efficient storage method for homogeneous sets of data and numpy
dtypes provide type information useful when compiling.numba
understands calls to numpy
ufuncs and is able to generate equivalent native code for many of them.numpy
arrays are directly supported in numba
. Access to numpy
arrays is very efficient, as indexing is lowered to direct memory accesses when possible.numba
is able to generate ufuncs and gufuncs. This means that it is possible to implement ufuncs and gufuncs within Python, getting speeds comparable to that of ufuncs/gufuncs implemented in C extension modules using the numpy
C API.numpy
arrays and dtype objects¶numpy
's main object is the homogeneous multidimensional array. It is a table of elements, all of the same type, indexed by a tuple of positive integers. In numpy
dimensions are called axes. The number of axes is rank.numpy.dtype
class) describes how the bytes in the fixed-size block of memory corresponding to an array item should be interpreted. It describes the type of the data (integer, float, Python object, etc.), size of the data (how many bytes is in e.g. the integer), byte order and some other parameters that further describe the data if it is e.g. a sub-array or an aggregate of other data types.>>> import numpy as np
>>> a = np.arange(15).reshape(3, 5)
>>> a
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14]])
>>> a.shape
(3, 5)
>>> a.ndim
2
>>> a.dtype.name
'int64'
>>> np.ones( (2,3,4), dtype=np.int16 ) # dtype can also be specified
array([[[ 1, 1, 1, 1],
[ 1, 1, 1, 1],
[ 1, 1, 1, 1]],
[[ 1, 1, 1, 1],
[ 1, 1, 1, 1],
[ 1, 1, 1, 1]]], dtype=int16)
numba
will infer the argument types at call time, and generate optimized code based on this information. numba
will also be able to compile separate specializations depending on the input types.from numba import jit
#
@jit
def f(x, y):
return x + y
int32(int32, int32)
is the function’s signature. In this case, the corresponding specialization will be compiled by the @jit
decorator, and no other specialization will be allowed. This is useful if you want fine-grained control over types chosen by the compiler (for example, to use single-precision floats).from numba import jit, int32
#
@jit(int32(int32, int32))
def f(x, y):
return x + y
nopython
: numba
has two compilation modes: nopython
mode and object
mode. The former produces much faster code, but has limitations that can force numba
to fall back to the latter. To prevent numba
from falling back, and instead raise an error, pass nopython=True
.nogil
: Whenever numba
optimizes Python code to native code that only works on native types and variables (rather than Python objects), it is not necessary anymore to hold Python’s global interpreter lock (GIL). numba
will release the GIL when entering such a compiled function if you passed nogil=True
.parallel
: Enables an experimental feature that automatically parallelizes (and performs other optimizations for) those operations in the function known to have parallel semantics. This feature is enabled by passing parallel=True
and must be used in conjunction with nopython=True
:cache
: To avoid compilation times each time you invoke a Python program, you can instruct numba
to write the result of function compilation into a file-based cache. This is done by passing cache=True
:numpy
universal functions with @vectorize
¶numba
's vectorize allows Python functions taking scalar input arguments to be used as numpy
ufuncs. Creating a traditional numpy
ufunc is not not the most straightforward process and involves writing some C code. numba
makes this easy. Using @vectorize()
, numba
can compile a pure Python function into a ufunc that operates over numpy
arrays as fast as traditional ufuncs written in C.@vectorize
, you write your function as operating over input scalars, rather than arrays. numba
will generate the surrounding loop (or kernel) allowing efficient iteration over the actual inputs.from numba import vectorize, float64
import numpy as np
@vectorize([float64(float64, float64)])
def f(x, y):
return x + y
a = np.arange(6)
print(f(a,a))
a = np.linspace(0, 1, 6)
print(f(a,a))
[ 0. 2. 4. 6. 8. 10.] [0. 0.4 0.8 1.2 1.6 2. ]
@jitclass
¶numba
supports code generation for classes via the @jitclass
decorator. A class can be marked for optimization using this decorator along with a specification of the types of each field. We call the resulting class object a jitclass.
nopython
functions. The data of a jitclass instance is allocated on the heap as a C-compatible structure so that any compiled functions can have direct access to the underlying data, bypassing the interpreter.import numpy as np
from numba import jitclass # import the decorator
from numba import int32, float32 # import the types
spec = [
('value', int32), # a simple scalar field
('array', float32[:]), # an array field
]
@jitclass(spec)
class Bag(object):
def __init__(self, value):
self.value = value
self.array = np.zeros(value, dtype=np.float32)
@property
def size(self):
return self.array.size
def increment(self, val):
for i in range(self.size):
self.array[i] = val
return self.array
@generated_jit
: Sometimes you want to write a function that has different implementations depending on its input types. The @generated_jit
decorator allows the user to control the selection of a specialization at compile-time, while retaining runtime execution speed of a JIT function.@cfunc
: Interfacing with some native libraries (for example written in C or C++) can necessitate writing native callbacks to provide business logic to the library. The @cfunc
decorator creates a compiled function callable from foreign C code, using the signature of your choice.@jit
: Setting the parallel
option for @jit
enables an experimental Numba feature that attempts to automatically parallelize and perform other optimizations on (part of) a function.numba
's CUDA support exposes facilities to declare and manage this hierarchy of threads. The facilities are largely similar to those exposed by NVidia’s CUDA C language.from numba import jit
def psum(x):
res = 0
for i in range(x):
res += i ** 2 + 1 + i
return res
nsum = jit(psum)
t1 = %timeit -c -o psum(1000)
%timeit -c psum(100000)
t2 = %timeit -c -o nsum(1000)
%timeit -c nsum(100000000)
print("Speed-up factor: {:.1f}x".format(int(t1.average / t2.average)))
457 µs ± 19.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) 46.4 ms ± 1.21 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 168 ns ± 5.99 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) 165 ns ± 0.0791 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) Speed-up factor: 2715.0x
from numba import jit
def fib(n):
a, b = 0, 1
for i in range(n):
a, b = b, a + b
return a
nfib = jit(fib, nopython=True)
t1 = %timeit -c -o fib(1000)
t2 = %timeit -c -o nfib(1000)
print("Speed-up factor: {:.1f}x".format(t1.average / t2.average))
71.6 µs ± 1.85 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 918 ns ± 11.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) Speed-up factor: 78.0x
%load_ext cython
%%cython
def cfib(int n):
cdef int i, a, b
a, b = 0, 1
for i in range(n):
a, b = b, a + b
return a
t3 = %timeit -c -o cfib(1000)
print("Speed-up factor (cython vs numba): {:.1f}x".format(t2.average / t3.average))
645 ns ± 1.52 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) Speed-up factor (cython vs numba): 1.4x
from timeit import default_timer as timer
from matplotlib.pylab import imshow, jet, show, ion
import numpy as np
from numba import jit
def mandel(x, y, max_iters):
"""
Given the real and imaginary parts of a complex number,
determine if it is a candidate for membership in the Mandelbrot
set given a fixed number of iterations.
"""
i = 0
c = complex(x,y)
z = 0.0j
for i in range(max_iters):
z = z * z + c
if (z.real * z.real + z.imag * z.imag) >= 4:
return i
return 255
def create_fractal(min_x, max_x, min_y, max_y, image, iters):
height = image.shape[0]
width = image.shape[1]
pixel_size_x = (max_x - min_x) / width
pixel_size_y = (max_y - min_y) / height
for x in range(width):
real = min_x + x * pixel_size_x
for y in range(height):
imag = min_y + y * pixel_size_y
color = mandel(real, imag, iters)
image[y, x] = color
return image
image = np.zeros((500 * 2, 750 * 2), dtype=np.uint8)
t1 = %timeit -c -o -r 3 create_fractal(-2.0, 1.0, -1.0, 1.0, image, 20) # without numba
image = np.zeros((500 * 2, 750 * 2), dtype=np.uint8)
create_fractal, mandel = jit(create_fractal), jit(mandel)
t2 = %timeit -c -o create_fractal(-2.0, 1.0, -1.0, 1.0, image, 20) # with numba
print("Speed-up factor: {:.1f}x".format(t1.average / t2.average))
imshow(image)
show()
5.82 s ± 88.8 ms per loop (mean ± std. dev. of 3 runs, 1 loop each) 47.9 ms ± 403 µs per loop (mean ± std. dev. of 7 runs, 1 loop each) Speed-up factor: 121.5x
import numpy as np
RHO_0 = 0.25 # initial particle density
MCS = 10000 # Monte Carlo steps
RUNS = 10
DIM_0, DIM_1 = 81, 81
def init_lattice(rho=RHO_0, dim0=DIM_0, dim1=DIM_1):
"""Place particles on lattice with density rho."""
lat = np.zeros(shape=(dim0, dim1), dtype=np.int32)
for i in range(lat.shape[0]):
for j in range(lat.shape[1]):
if np.random.random() <= rho:
lat[i, j] = 1
return lat
def execute_mcs(lat):
"""Move each particle on the lattice once in a random direction."""
lat_t0 = np.copy(lat) # lattice at time t0
for i in range(lat_t0.shape[0]):
for j in range(lat_t0.shape[1]):
if lat_t0[i, j] == 1: # we found a particle
r = np.random.randint(4)
i1, j1 = i, j
# move particle up/down/left right
if r == 0: i1 += 1
if r == 1: i1 -= 1
if r == 2: j1 += 1
if r == 3: j1 -= 1
# boundary conditions
if i1 < 0: i1 = 0
if i1 > lat.shape[0] - 1: i1 = lat.shape[0] - 1
if j1 < 0: j1 = 0
if j1 > lat.shape[1] - 1: j1 = lat.shape[1] - 1
# check trap
if i1 == lat.shape[0] // 2 and j1 == lat.shape[1] // 2:
# we hit the center (trap), remove the particle
lat[i, j] = 0
elif not lat[i1, j1]:
# new position is empty, move particle
lat[i, j] = 0
lat[i1, j1] = 1
def calc_rho(lat):
return np.sum(lat) / lat.size
def execute_simulation(runs=RUNS, mcs=MCS, rho_0=RHO_0):
rho_t = np.zeros(mcs, dtype=np.float)
for r in range(runs):
lat = init_lattice()
for m in range(mcs):
rho_t[m] += calc_rho(lat)
execute_mcs(lat)
for i in range(rho_t.shape[0]):
rho_t[i] /= (runs * rho_0)
return rho_t
t1 = %timeit -c -o execute_simulation(runs=1, mcs=1000)
7.37 s ± 109 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
from numba import jit
import numpy as np
import matplotlib.pyplot as plt
RHO_0 = 0.25 # initial particle density
MCS = 10000 # Monte Carlo steps
RUNS = 10
DIM_0, DIM_1 = 81, 81
@jit(nopython=True)
def init_lattice(rho=RHO_0, dim0=DIM_0, dim1=DIM_1):
"""Place particles on lattice with density rho."""
lat = np.zeros(shape=(dim0, dim1), dtype=np.int32)
for i in range(lat.shape[0]):
for j in range(lat.shape[1]):
if np.random.random() <= rho:
lat[i, j] = 1
return lat
@jit(nopython=True)
def execute_mcs(lat):
"""Move each particle on the lattice once in a random direction."""
lat_t0 = np.copy(lat) # lattice at time t0
for i in range(lat_t0.shape[0]):
for j in range(lat_t0.shape[1]):
if lat_t0[i, j] == 1: # we found a particle
r = np.random.randint(4)
i1, j1 = i, j
# move particle up/down/left right
if r == 0: i1 += 1
if r == 1: i1 -= 1
if r == 2: j1 += 1
if r == 3: j1 -= 1
# boundary conditions
if i1 < 0: i1 = 0
if i1 > lat.shape[0] - 1: i1 = lat.shape[0] - 1
if j1 < 0: j1 = 0
if j1 > lat.shape[1] - 1: j1 = lat.shape[1] - 1
# check trap
if i1 == lat.shape[0] // 2 and j1 == lat.shape[1] // 2:
# we hit the center (trap), remove the particle
lat[i, j] = 0
elif not lat[i1, j1]:
# new position is empty, move particle
lat[i, j] = 0
lat[i1, j1] = 1
@jit(nopython=True)
def calc_rho(lat):
return np.sum(lat) / lat.size
@jit(nopython=True)
def execute_simulation(runs=RUNS, mcs=MCS, rho_0=RHO_0):
rho_t = np.zeros(mcs, dtype=np.float32)
for r in range(runs):
lat = init_lattice()
for m in range(mcs):
rho_t[m] += calc_rho(lat)
execute_mcs(lat)
for i in range(rho_t.shape[0]):
rho_t[i] /= (runs * rho_0)
return rho_t
t2= %timeit -c -o execute_simulation(runs=1, mcs=1000)
print("Speed-up factor: {:.1f}x".format(t1.average / t2.average))
rho_t = execute_simulation()
plt.plot(rho_t)
plt.yscale('log')
plt.show()
59.4 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Speed-up factor: 124.1x