Introduction to Python GPU Programming with Numba and NumbaPro

In [3]:
# Misc. import
from __future__ import print_function
from IPython.display import Image 
%matplotlib inline

Numba

  • Opensource BSD license
  • Basic CUDA GPU JIT compilation
  • OpenCL support coming
In [5]:
import numba
print("numba", numba.__version__)
numba 0.22.1

Accelerate

  • Commercial
  • Contains CUDA library bindings: cuRAND, cuBLAS, cuFFT, cuSparse
In [6]:
import accelerate
print("accelerate", accelerate.__version__)
accelerate 2.0.0

The CUDA GPU

  • A massively parallel processor (many cores)
    • 100 threads, 1,000 threads, and more
  • optimized for data throughput
    • Simple (shallow) cache hierarchy
    • Best with manual caching!
    • Cache memory is called shared memory and it is addressable
  • CPU is latency optimized
    • Deep cache hierarchy
    • L1, L2, L3 caches
  • GPU execution model is different
  • GPU forces you to think and program in parallel
In [8]:
# Get all the imports we need
import numba.cuda
import numpy as np
import math

my_gpu = numba.cuda.get_current_device()
print("Running on GPU:", my_gpu.name)
cores_per_capability = {
    1: 8,
    2: 32,
    3: 192,
    5: 128
}
cc = my_gpu.compute_capability
print("Compute capability: ", "%d.%d" % cc, "(Numba requires >= 2.0)")
majorcc = cc[0]
print("Number of streaming multiprocessor:", my_gpu.MULTIPROCESSOR_COUNT)
cores_per_multiprocessor = cores_per_capability[majorcc]
print("Number of cores per mutliprocessor:", cores_per_multiprocessor)
total_cores = cores_per_multiprocessor * my_gpu.MULTIPROCESSOR_COUNT
print("Number of cores on GPU:", total_cores)
Running on GPU: b'GeForce GTX 970'
Compute capability:  5.2 (Numba requires >= 2.0)
Number of streaming multiprocessor: 13
Number of cores per mutliprocessor: 128
Number of cores on GPU: 1664

High-level Array-Oriented Style

  • Use NumPy array as a unit of computation
  • Use NumPy universal function (ufunc) as an abstraction of computation and scheduling
  • ufuncs are elementwise functions
  • If you use NumPy, you are using ufuncs
In [10]:
print(np.sin, "is of type", type(np.sin))
print(np.add, "is of type", type(np.add))
<ufunc 'sin'> is of type <class 'numpy.ufunc'>
<ufunc 'add'> is of type <class 'numpy.ufunc'>

Vectorize

  • generate a ufunc from a python function
  • converts scalar function to elementwise array function
  • Numba provides CPU and GPU support
In [11]:
# CPU version
@numba.vectorize(['float32(float32, float32)',
                  'float64(float64, float64)'], target='cpu')
def cpu_sincos(x, y):
    return math.sin(x) * math.cos(y)

# CUDA version
@numba.vectorize(['float32(float32, float32)',
                     'float64(float64, float64)'], target='cuda')
def gpu_sincos(x, y):
    return math.sin(x) * math.cos(y)
@numba.vectorize(<list of signatures>, target=<'cpu', 'gpu'>)
  • A ufunc can be overloaded to work on multiple type signatures

Test it out

  • 2 input arrays
  • 1 output array
  • 1 million doubles (8 MB) per array
  • Total 24 MB of data
In [12]:
# Generate data
n = 1000000
x = np.linspace(0, np.pi, n)
y = np.linspace(0, np.pi, n)

# Check result
np_ans = np.sin(x) * np.cos(y)
nb_cpu_ans = cpu_sincos(x, y)
nb_gpu_ans = gpu_sincos(x, y)

print("CPU vectorize correct: ", np.allclose(nb_cpu_ans, np_ans))
print("GPU vectorize correct: ", np.allclose(nb_gpu_ans, np_ans))
CPU vectorize correct:  True
GPU vectorize correct:  True

Benchmark

In [15]:
print("NumPy")
%timeit np.sin(x) * np.cos(y)

print("CPU vectorize")
%timeit cpu_sincos(x, y)

print("GPU vectorize")
%timeit gpu_sincos(x, y)

# Optional cleanup 
del x, y
NumPy
10 loops, best of 3: 44 ms per loop
CPU vectorize
10 loops, best of 3: 37.7 ms per loop
GPU vectorize
100 loops, best of 3: 4.81 ms per loop
  • CPU vectorize time is similar to pure NumPy time because sin() and cos() calls dominate the time.
  • GPU vectorize is a lot faster

Behind the scence

Automatic memory transfer

  • NumPy arrays are automatically transferred
    • CPU -> GPU
    • GPU -> CPU

Automatic work scheduling

  • The work is distributed the across all threads on the GPU
  • The GPU hardware handles the scheduling

Automatic GPU memory management

  • GPU memory is tied to object lifetime
  • freed automatically

Another Vectorize Example

In [16]:
@numba.vectorize(['float32(float32, float32, float32, float32)'])
def cpu_powers(p, q, r, s):
    return math.sqrt(p ** 2 + q ** 3 + r ** 4 + s ** 5)

@numba.vectorize(['float32(float32, float32, float32, float32)'], target='cuda')
def gpu_powers(p, q, r, s):
    return math.sqrt(p ** 2 + q ** 3 + r ** 4 + s ** 5)
In [18]:
# Generate data
n = 5000000
p = np.random.random(n).astype(np.float32)
q = np.random.random(n).astype(np.float32)
r = np.random.random(n).astype(np.float32)
s = np.random.random(n).astype(np.float32)

# Check results
np_ans = np.sqrt(p ** 2 + q ** 3 + r ** 4 + s ** 5)
cpu_ans = cpu_powers(p, q, r, s)
gpu_ans = gpu_powers(p, q, r, s)
print("CPU vectorize correct", np.allclose(np_ans, cpu_ans))
print("GPU vectorize correct", np.allclose(np_ans, gpu_ans))
CPU vectorize correct True
GPU vectorize correct True

Benchmark

In [19]:
print("NumPy")
%timeit np.sqrt(p ** 2 + q ** 3 + r ** 4 + s ** 5)
print("CPU vectorize")
%timeit cpu_powers(p, q, r, s)
print("GPU vectorize")
%timeit gpu_powers(p, q, r, s)

# Optional cleanup 
del p, q, r, s
NumPy
1 loops, best of 3: 1e+03 ms per loop
CPU vectorize
10 loops, best of 3: 20.8 ms per loop
GPU vectorize
100 loops, best of 3: 16.4 ms per loop
  • NumPy is slower than CPU vectorize (likely due to allocation of temporaries)
  • CPU version is slower than GPU version because of multiple cores running (even though copying to card and back is costly)

Generalized Universal Function (guvectorize)

  • Vectorize is limited to scalar arguments in the core function
  • GUVectorize accepts array arguments
In [20]:
@numba.guvectorize(['void(float32[:,:], float32[:,:], float32[:,:])'],
                      '(m, n),(n, p)->(m, p)', target='cuda')
def batch_matrix_mult(a, b, c):
    for i in range(c.shape[0]):
        for j in range(c.shape[1]):
            tmp = 0
            for n in range(a.shape[1]):
                 tmp += a[i, n] * b[n, j]
            c[i, j] = tmp
@numba.guvectorize(<list of function signatures>, <a string to desc the shape signature>, target=<'cpu', 'cuda'>)
In [24]:
a = np.arange(1.0, 10.0, dtype=np.float32).reshape(3,3)
b = np.arange(1.0, 10.0, dtype=np.float32).reshape(3,3)

# Use the builtin matrix_multiply in NumPy for CPU test
import numpy.core.umath_tests as ut

# Check result
print('NumPy result')
np_ans = ut.matrix_multiply(a, b)
print(np_ans)

print('Numba GPU result')
gpu_ans = batch_matrix_mult(a, b)
print(gpu_ans)

assert np.allclose(np_ans, gpu_ans)
NumPy result
[[  30.   36.   42.]
 [  66.   81.   96.]
 [ 102.  126.  150.]]
Numba GPU result
[[  30.   36.   42.]
 [  66.   81.   96.]
 [ 102.  126.  150.]]

Test it out

  • batch multiply two 4 million 2x2 matrices
In [25]:
n = 4000000
dim = 2
a = np.random.random(n * dim * dim).astype(np.float32).reshape(n, dim, dim)
b = np.random.random(n * dim * dim).astype(np.float32).reshape(n, dim, dim)

print('NumPy time')
%timeit ut.matrix_multiply(a, b)

print('Numba GPU time')
%timeit batch_matrix_mult(a, b)
NumPy time
10 loops, best of 3: 81.3 ms per loop
Numba GPU time
10 loops, best of 3: 87.9 ms per loop
  • GPU time seems to be similar to CPU time

Manually Transfer the data to the GPU

  • This will let us see the actual compute time without the CPU<->GPU transfer
In [26]:
dc = numba.cuda.device_array_like(a)
da = numba.cuda.to_device(a)
db = numba.cuda.to_device(b)
  • numba.cuda.device_array_like allocate without initialization with the type and shape of another array.
    • similar to numpy.empty_like(a)
  • numba.cuda.to_device create a GPU copy of the CPU array

Pure compute time on the GPU

In [27]:
def check_pure_compute_time(da, db, dc):
    batch_matrix_mult(da, db, out=dc)
    numba.cuda.synchronize()   # ensure the call has completed
    
%timeit check_pure_compute_time(da, db, dc)
del da, db, dc
100 loops, best of 3: 3.35 ms per loop
  • Actual compute time is a lot faster
  • PCI-express transfer overhead

Tips

If you have a sequence of ufuncs to apply, pin the data on the GPU by manual transfer

Accelerate CUDA Libraries Bindings

  • Access to CUDA libraries
  • Work seamless with NumPy

    • auto memory transfer
    • managed memory
  • cuBLAS: CUDA version of BLAS

  • cuSparse: CUDA sparse matrix support
  • cuFFT: FFT on CUDA
  • cuRAND: random number generation on CUDA

An Example with CUDA Libs

Convolution on the GPU

Import cuFFT

In [28]:
import accelerate.cuda.fft as cufft

Misc. imports

In [29]:
from scipy.signal import fftconvolve
from scipy import misc, ndimage
from matplotlib import pyplot as plt
from timeit import default_timer as timer

Build elementwise complex array multiplication CUDA function

In [30]:
@numba.vectorize(['complex64(complex64, complex64)'], target='cuda')
def vmult(a, b):
    return a * b

Prepare image and filter

In [31]:
image = misc.face(gray=True).astype(np.float32)

laplacian_pts = '''
-4 -1 0 -1 -4
-1 2 3 2 -1
0 3 4 3 0
-1 2 3 2 -1
-4 -1 0 -1 -4
'''.split()

laplacian = np.array(laplacian_pts, dtype=np.float32).reshape(5, 5)

response = np.zeros_like(image)
response[:5, :5] = laplacian

print("Image size: %s" % (image.shape,))
Image size: (768, 1024)

Convolution on the CPU

In [33]:
ts = timer()  # Start Timer
cvimage_cpu = fftconvolve(image, laplacian, mode='same')
te = timer()  # Stop Timer
print('CPU: %.2fs' % (te - ts))
CPU: 0.05s

Convolution on the GPU

In [34]:
image_complex = image.astype(np.complex64)
response_complex = response.astype(np.complex64)
In [35]:
# Trigger initialization the cuFFT system.
# This takes significant time for small dataset.
# We should not be including the time wasted here
cufft.FFTPlan(shape=image.shape, itype=np.complex64, otype=np.complex64)
Out[35]:
<accelerate.cuda.fft.api.FFTPlan at 0x7fa4827e85c0>
In [37]:
ts = timer()     # Start timer

d_image_complex = numba.cuda.to_device(image_complex)
d_response_complex = numba.cuda.to_device(response_complex)

# Forward FFT
cufft.fft_inplace(d_image_complex)
cufft.fft_inplace(d_response_complex)

# Multiply the image with teh filter
vmult(d_image_complex, d_response_complex, out=d_image_complex)

# Inverse FFT
cufft.ifft_inplace(d_image_complex)

cvimage_gpu = d_image_complex.copy_to_host().real / np.prod(image.shape)

te = timer()   # Stop timer

print('GPU: %.2fs' % (te - ts))
GPU: 0.01s
In [38]:
plt.title('Original')
plt.imshow(image, cmap=plt.cm.gray)
plt.axis('off')
plt.show()
In [39]:
plt.title('CPU')
plt.imshow(cvimage_cpu, cmap=plt.cm.gray)
plt.axis('off')
plt.show()
In [40]:
plt.title('GPU')
plt.imshow(cvimage_gpu, cmap=plt.cm.gray)
plt.axis('off')
plt.show()

Low-Level Approach: @numba.cuda.jit

  • Numba can generate CUDA functions with the @jit decorator
  • Decorated function follows CUDA execution model

CUDA Execution Model

  • Kernel functions
    • visible to the host CPU
    • cannot return any value
      • use output argument
    • associates to a grid
  • Grid
    • a group of blocks
    • 1D, 2D, 3D
  • Blocks
    • a group of threads
    • 1D, 2D, 3D
  • Every thread executes the same kernel
    • thread can use the grid, block, thread coordinate system to determine its ID
In [41]:
Image(url='http://docs.nvidia.com/cuda/cuda-c-programming-guide/graphics/grid-of-thread-blocks.png')
Out[41]:

Compiling a CUDA Kernel

In [42]:
from numba import cuda

@numba.cuda.jit("void(float32[:], float32[:], float32[:])")
def vadd(arr_a, arr_b, arr_out):
    tx = cuda.threadIdx.x
    bx = cuda.blockIdx.x
    bw = cuda.blockDim.x    # number of threads per block
    i = tx + bx * bw
    if i >= arr_out.size:
        return
    arr_out[i] = arr_a[i] + arr_b[i]

Code Explained

Define a CUDA kernel with three 1D float32 arrays as args

@numba.cuda.jit("void(float32[:], float32[:], float32[:])")
def vadd(arr_a, arr_b, arr_out):

Map thread, block coordinate to global position

    tx = cuda.threadIdx.x   # thread label (along x dimension)
    bx = cuda.blockIdx.x    # block label (along x dimension)
    bw = cuda.blockDim.x    # number of threads in each block (along x dimension)
    i = tx + bx * bw        # flattened linear address for each thread

or simplified to:

    i = cuda.grid(1)

Ensure global position is within array size

    if i >= arr_out.size:
        return

The actual work

    arr_out[i] = arr_a[i] + arr_b[i]

Launch kernel

Prepare data

In [43]:
n = 100
a = np.arange(n, dtype=np.float32)
b = np.arange(n, dtype=np.float32)
c = np.empty_like(a)                 # Must prepare the output array to hold the result

Calculate thread, block count

  • thread count is set to warp size of the GPU
    • Warp size is similar to SIMD vector width on the CPU
    • performance tips: set thread count to multiple of warp size
  • block count is ceil(n/thread_ct)

Note: This will launch more threads than there are elements in the array

In [44]:
thread_ct = my_gpu.WARP_SIZE
block_ct = int(math.ceil(float(n) / thread_ct))

print("Threads per block:", thread_ct)
print("Block per grid:", block_ct)
Threads per block: 32
Block per grid: 4

Launch kernel

Kernel function object uses the __getitem__ (indexing notation) to configure the grid and block dimensions.

    kernel_function[griddim, blockdim](*args)
  • griddim

    • Number of blocks per grid (grid dimension)
    • type: int for 1d or 1,2,3-tuple of ints for 1d, 2d, or 3d respectively
  • blockdim:

    • Number of threads per block (blockdim dimension)
    • type: int for 1d or 1,2,3-tuple of ints for 1d, 2d, or 3d respectively
In [45]:
vadd[block_ct, thread_ct](a, b, c)    # Last argument is the output array in this case
print(c)
[   0.    2.    4.    6.    8.   10.   12.   14.   16.   18.   20.   22.
   24.   26.   28.   30.   32.   34.   36.   38.   40.   42.   44.   46.
   48.   50.   52.   54.   56.   58.   60.   62.   64.   66.   68.   70.
   72.   74.   76.   78.   80.   82.   84.   86.   88.   90.   92.   94.
   96.   98.  100.  102.  104.  106.  108.  110.  112.  114.  116.  118.
  120.  122.  124.  126.  128.  130.  132.  134.  136.  138.  140.  142.
  144.  146.  148.  150.  152.  154.  156.  158.  160.  162.  164.  166.
  168.  170.  172.  174.  176.  178.  180.  182.  184.  186.  188.  190.
  192.  194.  196.  198.]

Example: Matrix Matrix Multiplication

  • Show manual caching with shared memory
  • Not the best matrix matrix multiplication implementation

Prepare constants

In [46]:
from numba import float32

bpg = 150
tpb = 32
n = bpg * tpb
shared_mem_size = (tpb, tpb)
griddim = bpg, bpg
blockdim = tpb, tpb

Naive version

In [47]:
Image(url="http://docs.nvidia.com/cuda/cuda-c-programming-guide/graphics/matrix-multiplication-without-shared-memory.png")
Out[47]:
In [48]:
@numba.cuda.jit("void(float32[:,:], float32[:,:], float32[:,:])")
def naive_matrix_mult(A, B, C):
    x, y = cuda.grid(2)
    if x >= n or y >= n:
        return

    C[y, x] = 0
    for i in range(n):
        C[y, x] += A[y, i] * B[i, x]

Optimized version (shared memory + blocking)

In [49]:
Image(url="http://docs.nvidia.com/cuda/cuda-c-programming-guide/graphics/matrix-multiplication-with-shared-memory.png")
Out[49]:
In [50]:
@numba.cuda.jit("void(float32[:,:], float32[:,:], float32[:,:])")
def optimized_matrix_mult(A, B, C):
    # Declare shared memory
    sA = cuda.shared.array(shape=shared_mem_size, dtype=float32)
    sB = cuda.shared.array(shape=shared_mem_size, dtype=float32)
    
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    x, y = cuda.grid(2)

    acc = 0
    for i in range(bpg):
        if x < n and y < n:
            # Prefill cache
            sA[ty, tx] = A[y, tx + i * tpb]
            sB[ty, tx] = B[ty + i * tpb, x]

        # Synchronize all threads in the block
        cuda.syncthreads()

        if x < n and y < n:
            # Compute product
            for j in range(tpb):
                acc += sA[ty, j] * sB[j, tx]

        # Wait until all threads finish the computation
        cuda.syncthreads()

    if x < n and y < n:
        C[y, x] = acc

Prepare data

In [51]:
# Prepare data on the CPU
A = np.array(np.random.random((n, n)), dtype=np.float32)
B = np.array(np.random.random((n, n)), dtype=np.float32)

print("N = %d x %d" % (n, n))

# Prepare data on the GPU
dA = cuda.to_device(A)
dB = cuda.to_device(B)
dC = cuda.device_array_like(A)
N = 4800 x 4800

Benchmark

In [52]:
# Time the unoptimized version
s = timer()
naive_matrix_mult[griddim, blockdim](dA, dB, dC)
numba.cuda.synchronize()
e = timer()
unopt_ans = dC.copy_to_host()
tcuda_unopt = e - s

# Time the optimized version
s = timer()
optimized_matrix_mult[griddim, blockdim](dA, dB, dC)
numba.cuda.synchronize()
e = timer()
opt_ans = dC.copy_to_host()
tcuda_opt = e - s

Result

In [53]:
assert np.allclose(unopt_ans, opt_ans)
print("Without shared memory:", "%.2f" % tcuda_unopt, "s")
print("With shared memory:", "%.2f" % tcuda_opt, "s")
Without shared memory: 3.75 s
With shared memory: 4.23 s

Summary

  • Numba
    • opensource low-level GPU support
    • CUDA kernel @numba.cuda.jit
    • vectorize @numba.vectorize
    • guvectorize @numba.guvectorize
  • Accelerate
    • CUDA libraries accelerate.cuda.*
In [ ]: