Notebook

GTC 2019 Numba Tutorial Notebook 2: CuPy and Numba on the GPU¶

In the previous notebook, we saw how NumPy and Numba could be used for array math on the CPU. Array operations are very amenable to execution on a massively parallel GPU. We will not go into the CUDA programming model too much in this tutorial, but the most important thing to remember is that the GPU hardware is designed for data parallelism. Maximum throughput is achieved when you are computing the same operations on many different elements at once.

What is CuPy?¶

Simply put: CuPy is NumPy, but for the GPU. Preferred Networks created CuPy as the GPU backend for their deep learning library, Chainer, but it also works great as a standalone NumPy-like GPU array library. If you know NumPy, CuPy is a very easy way to get started on the GPU.

Just like NumPy, CuPy offers 3 basic things:

A multidimensional array object, but stored in GPU memory.
A ufunc system that follows broadcast rules, but executes in parallel on the GPU.
A large library of array functions already implemented with CUDA.

In [ ]:

import numpy as np
import cupy as cp

CuPy arrays look just like NumPy arrays:

In [ ]:

ary = cp.arange(10).reshape((2,5))
print(repr(ary))
print(ary.dtype)
print(ary.shape)
print(ary.strides)

This array is in the GPU memory of the default GPU (device 0). We can see this by inspecting the special device attribute:

In [ ]:

ary.device

We can move data from the CPU to the GPU using the cp.asarray() function:

In [ ]:

ary_cpu = np.arange(10)
ary_gpu = cp.asarray(ary_cpu)
print('cpu:', ary_cpu)
print('gpu:', ary_gpu)
print(ary_gpu.device)

Note that when we print the contents of a GPU array, CuPy is copying the data from the GPU back to the CPU so it can print the results.

If we are done with the data on the GPU, we can convert it back to a NumPy array on the CPU with the cp.asnumpy() function:

In [ ]:

ary_cpu_returned = cp.asnumpy(ary_gpu)
print(repr(ary_cpu_returned))
print(type(ary_cpu_returned))

GPU Array Math¶

Most of the NumPy methods are supported in CuPy with identical function names and arguments:

In [ ]:

print(ary_gpu * 2)
print(cp.exp(-0.5 * ary_gpu**2))
print(cp.linalg.norm(ary_gpu))
print(cp.random.normal(loc=5, scale=2.0, size=10))

You may notice a slight pause when you run these functions the first time. This is because CuPy has to compile the CUDA functions on the fly, and then cache them to disk for reuse in the future.

That's pretty much it! CuPy is very easy to use and has excellent documentation, which you should become familiar with.

Before we get into GPU performance measurement, let's switch gears back to Numba.

When would I want Numba on the GPU?¶

Similar to NumPy, Numba can be useful to use with CuPy when you want to:

Combine several operations together for greater efficiency.
Implement custom algorithms that are not easily described by combining CuPy functions.

Numba's compiler pipeline for transforming Python functions to machine code can be used to generate CUDA functions which can be used standalone or with CuPy. There are two basic approaches supported by Numba:

ufuncs/gufuncs (subject of the rest of this notebook)
CUDA Python kernels (subject of next notebook)

Making new ufuncs for the GPU¶

Numba has the ability to create compiled ufuncs. You implement a scalar function of all the inputs, and Numba will figure out the broadcast rules for you. Generating a ufunc that uses CUDA requires giving an explicit type signature and setting the target attribute:

In [ ]:

from numba import vectorize

@vectorize(['int64(int64, int64)'], target='cuda')
def add_ufunc(x, y):
    return x + y

In [ ]:

a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])
b_col = b[:, np.newaxis] # b as column array
c = np.arange(4*4).reshape((4,4))

print('a+b:\n', add_ufunc(a, b))
print()
print('b_col + c:\n', add_ufunc(b_col, c))

A lot of things just happened! Numba automatically:

Compiled a CUDA kernel to execute the ufunc operation in parallel over all the input elements.
Allocated GPU memory for the inputs and the output.
Copied the input data to the GPU.
Executed the CUDA kernel with the correct kernel dimensions given the input sizes.
Copied the result back from the GPU to the CPU.
Returned the result as a NumPy array on the host.

This is very convenient for testing, but copying data back and forth between the CPU and GPU can be slow and hurt performance. In the next tutorial notebook, you'll learn about device management, memory allocation, and using CuPy arrays with Numba.

You might be wondering how fast our simple example is on the GPU? Let's see:

In [ ]:

%timeit np.add(b_col, c)   # NumPy on CPU

In [ ]:

%timeit add_ufunc(b_col, c) # Numba on GPU

Wow, the GPU is a lot slower than the CPU. What happened??

This is to be expected because we have (deliberately) misused the GPU in several ways in this example:

Our inputs are too small: the GPU achieves performance through parallelism, operating on thousands of values at once. Our test inputs have only 4 and 16 integers, respectively. We need a much larger array to even keep the GPU busy.
Our calculation is too simple: Sending a calculation to the GPU involves quite a bit of overhead compared to calling a function on the CPU. If our calculation does not involve enough math operations (often called "arithmetic intensity"), then the GPU will spend most of its time waiting for data to move around.
We copy the data to and from the GPU: While including the copy time can be realistic for a single function, often we want to run several GPU operations in sequence. In those cases, it makes sense to send data to the GPU and keep it there until all of our processing is complete.
Our data types are larger than necessary: Our example uses int64 when we probably don't need it. Scalar code using data types that are 32 and 64-bit run basically the same speed on the CPU, but 64-bit data types have a significant performance cost on the GPU. Basic arithmetic on 64-bit floats can be anywhere from 2x (Pascal-architecture Tesla) to 24x (Maxwell-architecture GeForce) slower than 32-bit floats. NumPy defaults to 64-bit data types when creating arrays, so it is important to set the dtype attribute or use the ndarray.astype() method to pick 32-bit types when you need them.

Given the above, let's try an example that is faster on the GPU:

In [ ]:

import math  # Note that for the CUDA target, we need to use the scalar functions from the math module, not NumPy

SQRT_2PI = np.float32((2*math.pi)**0.5)  # Precompute this constant as a float32.  Numba will inline it at compile time.

@vectorize(['float32(float32, float32, float32)'], target='cuda')
def gaussian_pdf(x, mean, sigma):
    '''Compute the value of a Gaussian probability density function at x with given mean and sigma.'''
    return math.exp(-0.5 * ((x - mean) / sigma)**2) / (sigma * SQRT_2PI)

In [ ]:

# Evaluate the Gaussian distribution PDF a million times!
x = np.random.uniform(-3, 3, size=1000000).astype(np.float32)
mean = np.float32(0.0)
sigma = np.float32(1.0)

# Quick test
gaussian_pdf(x[0], 0.0, 1.0)

In [ ]:

import scipy.stats # for definition of gaussian distribution
norm_pdf = scipy.stats.norm
%timeit norm_pdf.pdf(x, loc=mean, scale=sigma)

In [ ]:

%timeit gaussian_pdf(x, mean, sigma)

That's a pretty large improvement, even including the overhead of copying all the data to and from the GPU. Ufuncs that use special functions (exp, sin, cos, etc) on large float32 data sets run especially well on the GPU.

CUDA Device Functions¶

Ufuncs are great, but you should not have to cram all of your logic into a single function body. You can also create normal functions that are only called from other functions running on the GPU. (These are similar to CUDA C functions defined with __device__.)

Device functions are created with the numba.cuda.jit decorator:

In [ ]:

from numba import cuda

@cuda.jit(device=True)
def polar_to_cartesian(rho, theta):
    x = rho * math.cos(theta)
    y = rho * math.sin(theta)
    return x, y  # This is Python, so let's return a tuple

@vectorize(['float32(float32, float32, float32, float32)'], target='cuda')
def polar_distance(rho1, theta1, rho2, theta2):
    x1, y1 = polar_to_cartesian(rho1, theta1)
    x2, y2 = polar_to_cartesian(rho2, theta2)
    
    return ((x1 - x2)**2 + (y1 - y2)**2)**0.5

In [ ]:

n = 1000000
rho1 = np.random.uniform(0.5, 1.5, size=n).astype(np.float32)
theta1 = np.random.uniform(-np.pi, np.pi, size=n).astype(np.float32)
rho2 = np.random.uniform(0.5, 1.5, size=n).astype(np.float32)
theta2 = np.random.uniform(-np.pi, np.pi, size=n).astype(np.float32)

In [ ]:

polar_distance(rho1, theta1, rho2, theta2)

Note that the CUDA compiler aggressively inlines device functions, so there is generally no overhead for function calls. Similarly, the "tuple" returned by polar_to_cartesian is not actually created as a Python object, but represented temporarily as a struct, which is then optimized away by the compiler.

We can compare this to doing the same thing on the CPU, still using Numba:

In [ ]:

import numba

@numba.jit
def polar_to_cartesian_cpu(rho, theta):
    x = rho * math.cos(theta)
    y = rho * math.sin(theta)
    return x, y  # This is Python, so let's return a tuple

@vectorize(['float32(float32, float32, float32, float32)'])  # default target is CPU
def polar_distance_cpu(rho1, theta1, rho2, theta2):
    x1, y1 = polar_to_cartesian_cpu(rho1, theta1)
    x2, y2 = polar_to_cartesian_cpu(rho2, theta2)
    
    return ((x1 - x2)**2 + (y1 - y2)**2)**0.5

np.testing.assert_allclose(polar_distance(rho1, theta1, rho2, theta2),
                           polar_distance_cpu(rho1, theta1, rho2, theta2),
                           rtol=1e-7, atol=5e-7)

In [ ]:

%timeit polar_distance_cpu(rho1, theta1, rho2, theta2)
%timeit polar_distance(rho1, theta1, rho2, theta2)

Not a bad speedup, and we're still doing quite a few GPU memory allocations and data copies.

Allowed Python on the GPU¶

Compared to Numba on the CPU (which is already limited), Numba on the GPU has more limitations. Supported Python includes:

if/elif/else
while and for loops
Basic math operators
Selected functions from the math and cmath modules
Tuples

See the Numba manual for more details.

Exercise¶

Let's build a "zero suppression" function. A common operation when working with waveforms is to force all samples values below a certain absolute magnitude to be zero, as a way to eliminate low amplitude noise. Let's make some sample data:

In [ ]:

# Hacking up a noisy pulse train
%matplotlib inline
from matplotlib import pyplot as plt

n = 100000
noise = np.random.normal(size=n) * 3
pulses = np.maximum(np.sin(np.arange(n) / (n / 23)) - 0.3, 0.0)
waveform = ((pulses * 300) + noise).astype(np.int16)
plt.plot(waveform)

Now try filling in body of this ufunc:

In [ ]:

@vectorize(['int16(int16, int16)'], target='cuda')
def zero_suppress(waveform_value, threshold):
    ### Replace this implementation with yours
    result = waveform_value
    ###
    return result

In [ ]:

# the noise on the baseline should disappear when zero_suppress is implemented properly
plt.plot(zero_suppress(waveform, 15.0))

In [ ]: