In the previous notebook, we saw how NumPy and Numba could be used for array math on the CPU. Array operations are very amenable to execution on a massively parallel GPU. We will not go into the CUDA programming model too much in this tutorial, but the most important thing to remember is that the GPU hardware is designed for data parallelism. Maximum throughput is achieved when you are computing the same operations on many different elements at once.
Simply put: CuPy is NumPy, but for the GPU. Preferred Networks created CuPy as the GPU backend for their deep learning library, Chainer, but it also works great as a standalone NumPy-like GPU array library. If you know NumPy, CuPy is a very easy way to get started on the GPU.
Just like NumPy, CuPy offers 3 basic things:
import numpy as np
import cupy as cp
CuPy arrays look just like NumPy arrays:
ary = cp.arange(10).reshape((2,5))
print(repr(ary))
print(ary.dtype)
print(ary.shape)
print(ary.strides)
This array is in the GPU memory of the default GPU (device 0). We can see this by inspecting the special device
attribute:
ary.device
We can move data from the CPU to the GPU using the cp.asarray()
function:
ary_cpu = np.arange(10)
ary_gpu = cp.asarray(ary_cpu)
print('cpu:', ary_cpu)
print('gpu:', ary_gpu)
print(ary_gpu.device)
Note that when we print the contents of a GPU array, CuPy is copying the data from the GPU back to the CPU so it can print the results.
If we are done with the data on the GPU, we can convert it back to a NumPy array on the CPU with the cp.asnumpy()
function:
ary_cpu_returned = cp.asnumpy(ary_gpu)
print(repr(ary_cpu_returned))
print(type(ary_cpu_returned))
Most of the NumPy methods are supported in CuPy with identical function names and arguments:
print(ary_gpu * 2)
print(cp.exp(-0.5 * ary_gpu**2))
print(cp.linalg.norm(ary_gpu))
print(cp.random.normal(loc=5, scale=2.0, size=10))
You may notice a slight pause when you run these functions the first time. This is because CuPy has to compile the CUDA functions on the fly, and then cache them to disk for reuse in the future.
That's pretty much it! CuPy is very easy to use and has excellent documentation, which you should become familiar with.
Before we get into GPU performance measurement, let's switch gears back to Numba.
Similar to NumPy, Numba can be useful to use with CuPy when you want to:
Numba's compiler pipeline for transforming Python functions to machine code can be used to generate CUDA functions which can be used standalone or with CuPy. There are two basic approaches supported by Numba:
Numba has the ability to create compiled ufuncs. You implement a scalar function of all the inputs, and Numba will figure out the broadcast rules for you. Generating a ufunc that uses CUDA requires giving an explicit type signature and setting the target
attribute:
from numba import vectorize
@vectorize(['int64(int64, int64)'], target='cuda')
def add_ufunc(x, y):
return x + y
a = np.array([1, 2, 3, 4])
b = np.array([10, 20, 30, 40])
b_col = b[:, np.newaxis] # b as column array
c = np.arange(4*4).reshape((4,4))
print('a+b:\n', add_ufunc(a, b))
print()
print('b_col + c:\n', add_ufunc(b_col, c))
A lot of things just happened! Numba automatically:
This is very convenient for testing, but copying data back and forth between the CPU and GPU can be slow and hurt performance. In the next tutorial notebook, you'll learn about device management, memory allocation, and using CuPy arrays with Numba.
You might be wondering how fast our simple example is on the GPU? Let's see:
%timeit np.add(b_col, c) # NumPy on CPU
%timeit add_ufunc(b_col, c) # Numba on GPU
Wow, the GPU is a lot slower than the CPU. What happened??
This is to be expected because we have (deliberately) misused the GPU in several ways in this example:
int64
when we probably don't need it. Scalar code using data types that are 32 and 64-bit run basically the same speed on the CPU, but 64-bit data types have a significant performance cost on the GPU. Basic arithmetic on 64-bit floats can be anywhere from 2x (Pascal-architecture Tesla) to 24x (Maxwell-architecture GeForce) slower than 32-bit floats. NumPy defaults to 64-bit data types when creating arrays, so it is important to set the dtype
attribute or use the ndarray.astype()
method to pick 32-bit types when you need them.Given the above, let's try an example that is faster on the GPU:
import math # Note that for the CUDA target, we need to use the scalar functions from the math module, not NumPy
SQRT_2PI = np.float32((2*math.pi)**0.5) # Precompute this constant as a float32. Numba will inline it at compile time.
@vectorize(['float32(float32, float32, float32)'], target='cuda')
def gaussian_pdf(x, mean, sigma):
'''Compute the value of a Gaussian probability density function at x with given mean and sigma.'''
return math.exp(-0.5 * ((x - mean) / sigma)**2) / (sigma * SQRT_2PI)
# Evaluate the Gaussian distribution PDF a million times!
x = np.random.uniform(-3, 3, size=1000000).astype(np.float32)
mean = np.float32(0.0)
sigma = np.float32(1.0)
# Quick test
gaussian_pdf(x[0], 0.0, 1.0)
import scipy.stats # for definition of gaussian distribution
norm_pdf = scipy.stats.norm
%timeit norm_pdf.pdf(x, loc=mean, scale=sigma)
%timeit gaussian_pdf(x, mean, sigma)
That's a pretty large improvement, even including the overhead of copying all the data to and from the GPU. Ufuncs that use special functions (exp
, sin
, cos
, etc) on large float32
data sets run especially well on the GPU.
Ufuncs are great, but you should not have to cram all of your logic into a single function body. You can also create normal functions that are only called from other functions running on the GPU. (These are similar to CUDA C functions defined with __device__
.)
Device functions are created with the numba.cuda.jit
decorator:
from numba import cuda
@cuda.jit(device=True)
def polar_to_cartesian(rho, theta):
x = rho * math.cos(theta)
y = rho * math.sin(theta)
return x, y # This is Python, so let's return a tuple
@vectorize(['float32(float32, float32, float32, float32)'], target='cuda')
def polar_distance(rho1, theta1, rho2, theta2):
x1, y1 = polar_to_cartesian(rho1, theta1)
x2, y2 = polar_to_cartesian(rho2, theta2)
return ((x1 - x2)**2 + (y1 - y2)**2)**0.5
n = 1000000
rho1 = np.random.uniform(0.5, 1.5, size=n).astype(np.float32)
theta1 = np.random.uniform(-np.pi, np.pi, size=n).astype(np.float32)
rho2 = np.random.uniform(0.5, 1.5, size=n).astype(np.float32)
theta2 = np.random.uniform(-np.pi, np.pi, size=n).astype(np.float32)
polar_distance(rho1, theta1, rho2, theta2)
Note that the CUDA compiler aggressively inlines device functions, so there is generally no overhead for function calls. Similarly, the "tuple" returned by polar_to_cartesian
is not actually created as a Python object, but represented temporarily as a struct, which is then optimized away by the compiler.
We can compare this to doing the same thing on the CPU, still using Numba:
import numba
@numba.jit
def polar_to_cartesian_cpu(rho, theta):
x = rho * math.cos(theta)
y = rho * math.sin(theta)
return x, y # This is Python, so let's return a tuple
@vectorize(['float32(float32, float32, float32, float32)']) # default target is CPU
def polar_distance_cpu(rho1, theta1, rho2, theta2):
x1, y1 = polar_to_cartesian_cpu(rho1, theta1)
x2, y2 = polar_to_cartesian_cpu(rho2, theta2)
return ((x1 - x2)**2 + (y1 - y2)**2)**0.5
np.testing.assert_allclose(polar_distance(rho1, theta1, rho2, theta2),
polar_distance_cpu(rho1, theta1, rho2, theta2),
rtol=1e-7, atol=5e-7)
%timeit polar_distance_cpu(rho1, theta1, rho2, theta2)
%timeit polar_distance(rho1, theta1, rho2, theta2)
Not a bad speedup, and we're still doing quite a few GPU memory allocations and data copies.
Compared to Numba on the CPU (which is already limited), Numba on the GPU has more limitations. Supported Python includes:
if
/elif
/else
while
and for
loopsmath
and cmath
modulesSee the Numba manual for more details.
Let's build a "zero suppression" function. A common operation when working with waveforms is to force all samples values below a certain absolute magnitude to be zero, as a way to eliminate low amplitude noise. Let's make some sample data:
# Hacking up a noisy pulse train
%matplotlib inline
from matplotlib import pyplot as plt
n = 100000
noise = np.random.normal(size=n) * 3
pulses = np.maximum(np.sin(np.arange(n) / (n / 23)) - 0.3, 0.0)
waveform = ((pulses * 300) + noise).astype(np.int16)
plt.plot(waveform)
Now try filling in body of this ufunc:
@vectorize(['int16(int16, int16)'], target='cuda')
def zero_suppress(waveform_value, threshold):
### Replace this implementation with yours
result = waveform_value
###
return result
# the noise on the baseline should disappear when zero_suppress is implemented properly
plt.plot(zero_suppress(waveform, 15.0))