This is one of the 100 recipes of the IPython Cookbook, the definitive guide to high-performance scientific computing and data science in Python.

5.7. Writing massively parallel code for NVIDIA graphics cards (GPUs) with CUDA

Let's import PyCUDA.

In [ ]:
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np

Now, we initialize the NumPy array that will contain the fractal.

In [ ]:
size = 200
iterations = 100
col = np.empty((size, size), dtype=np.int32)

We allocate memory for this array on the GPU.

In [ ]:
col_gpu = cuda.mem_alloc(col.nbytes)

We write the CUDA kernel in a string. The mandelbrot function accepts the figure size, the number of iterations, and a pointer to the memory buffer as arguments. It updates the col buffer with the escape value in the fractal for each pixel.

In [ ]:
code = """
__global__ void mandelbrot(int size,
                           int iterations,
                           int *col)

{
    // Get the row and column index of the current thread.
    int i = blockIdx.y * blockDim.y + threadIdx.y;
    int j = blockIdx.x * blockDim.x + threadIdx.x;
    int index = i * size + j;
    
    // Declare and initialize the variables.
    double cx, cy;
    double z0, z1, z0_tmp, z0_2, z1_2;
    cx = -2.0 + (double)j / size * 3;
    cy = -1.5 + (double)i / size * 3;

    // Main loop.
    z0 = z1 = 0.0;
    for (int n = 0; n < iterations; n++)
    {
        z0_2 = z0 * z0;
        z1_2 = z1 * z1;
        if (z0_2 + z1_2 <= 100)
        {
            // Need to update z0 and z1 in parallel.
            z0_tmp = z0_2 - z1_2 + cx;
            z1 = 2 * z0 * z1 + cy;
            z0 = z0_tmp;
            col[index] = n;
        }
        else break;
    }
}
"""

Now, we compile the CUDA program.

In [ ]:
prg = SourceModule(code)
mandelbrot = prg.get_function("mandelbrot")

We define the block size and the grid size, specifying how the threads will be parallelized with respect to the data.

In [ ]:
block_size = 10
block = (block_size, block_size, 1)
grid = (size // block_size, size // block_size, 1)

We call the compiled function.

In [ ]:
mandelbrot(np.int32(size), np.int32(iterations), col_gpu,
           block=block, grid=grid)

Once the function has completed, we copy the contents of the CUDA buffer back to the NumPy array col.

In [ ]:
cuda.memcpy_dtoh(col, col_gpu)

Let's display the fractal.

In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.imshow(np.log(col), cmap=plt.cm.hot,);
plt.xticks([]);
plt.yticks([]);

Let's evaluate the time taken by this function.

In [ ]:
%%timeit col_gpu = cuda.mem_alloc(col.nbytes); cuda.memcpy_htod(col_gpu, col)
mandelbrot(np.int32(size), np.int32(iterations), col_gpu,
           block=block, grid=grid)
cuda.memcpy_dtoh(col, col_gpu)

You'll find all the explanations, figures, references, and much more in the book (to be released later this summer).

IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014 (500 pages).