Bempp OpenCL performance benchmarks

This is a test notebook to benchmark the Bempp performance on different devices. The default figures reported here are obtained under Ubuntu 20.04 Linux on an Intel Core i9-9980HK 8 Core CPU with a base clock of 2.4GHz and a maximum turbo frequency of 5GHz. The GPU device is an NVIDIA Quadro RTX 3000 GPU with 6GB Ram.

As OpenCL CPU driver we test both POCL (in Version 1.5) and the Intel OpenCL CPU driver, both with default vectorization options.

We are benchmarking the following operator types

  • Boundary Operators:

    • Laplace single and double layer boundary operator
    • Helmholtz single and double layer boundary operator
    • Maxwell electric and magnetic field boundary operator
  • Domain Potential Operators:

    • Laplace single and double layer potential operator
    • Helmholtz single and double layer potential operator
    • Maxwell electric and magnetic field domain potential operator

We are testing all operators in single and double precision. For the GPU we only perform single precision tests as it is significantly slower in double precision.

As mesh we use a uniform sphere with 8192 elements. As wavenumber in the Helmholtz and Maxwell tests we use the value $k=1.0$. This has no effect on the performance. As scalar function spaces we use spaces of continuous $P1$ functions. For Maxwell we use RWG functions of order 0.

General Setup

In this section we define the general objects that we need in all benchmarks.

In [1]:
import bempp.api
import numpy as np
import pandas as pd
In [2]:
grid = bempp.api.shapes.regular_sphere(5)
p1_space = bempp.api.function_space(grid, "P", 1)
rwg_space = bempp.api.function_space(grid, "RWG", 0)
snc_space = bempp.api.function_space(grid, "SNC", 0)
In [3]:
def benchmark_boundary_operator(operator, precision):
    """Benchmark an operator with given precision"""
    result =  %timeit -o -r 2 -n 2 operator(precision).weak_form()
    return result.best

def benchmark_potential_operator(operator, fun, precision):
    """Benchmark an operator with given precision"""
    result =  %timeit -o -r 2 -n 2 res = operator(precision) @ fun
    return result.best

Boundary Operators

We first define the boundary operators that we want to test. We make them dependent only on a precision argument.

In [4]:
from bempp.api.operators import boundary

k = 1.0

operators = [
    lambda precision: boundary.laplace.single_layer(p1_space, p1_space, p1_space, precision=precision),
    lambda precision: boundary.laplace.double_layer(p1_space, p1_space, p1_space, precision=precision),
    lambda precision: boundary.helmholtz.single_layer(p1_space, p1_space, p1_space, k, precision=precision),
    lambda precision: boundary.helmholtz.double_layer(p1_space, p1_space, p1_space, k, precision=precision),
    lambda precision: boundary.maxwell.electric_field(rwg_space, rwg_space, snc_space, k, precision=precision),
    lambda precision: boundary.maxwell.magnetic_field(rwg_space, rwg_space, snc_space, k, precision=precision),
]

We setup a Pandas data frame to conveniently store the different timings.

In [5]:
driver_labels = ['Portable Computing Language', 'Intel(R) OpenCL']
precision_labels = ['single', 'double']

operator_labels = [
    "laplace single layer bnd",
    "laplace double layer bnd",
    "helmholtz single layer bnd",
    "helmholtz double layer bnd",
    "maxwell electric bnd",
    "maxwell magnetic bnd",
]
df = pd.DataFrame(index=operator_labels, columns=pd.MultiIndex.from_product([driver_labels, precision_labels]))

We assemble each operator once to make sure that all Numba functions are compiled.

In [6]:
for precision in ['single', 'double']:
    if precision == 'single':
        bempp.api.VECTORIZATION_MODE = 'vec16'
    else:
        bempp.api.VECTORIZATION_MODE = 'vec8'
    for driver_name in driver_labels:
        bempp.api.set_default_cpu_device_by_name(driver_name)
        for op in operators:
            op(precision).weak_form()
/home/betcke/miniconda3/envs/dev/lib/python3.8/site-packages/pyopencl/__init__.py:267: CompilerWarning: Non-empty compiler output encountered. Set the environment variable PYOPENCL_COMPILER_OUTPUT=1 to see more.
  warn("Non-empty compiler output encountered. Set the "

Now let's run the actual benchmarks.

In [7]:
for precision in ['single', 'double']:
    if precision == 'single':
        bempp.api.VECTORIZATION_MODE = 'vec16'
    else:
        bempp.api.VECTORIZATION_MODE = 'vec8'
    for driver_name in driver_labels:
        print(f"Driver: {driver_name}")
        bempp.api.set_default_cpu_device_by_name(driver_name)
        for label, op in zip(operator_labels, operators):
            df.loc[label, (driver_name, precision)] = benchmark_boundary_operator(op, precision)
Driver: Portable Computing Language
748 ms ± 23 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
784 ms ± 615 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
2.8 s ± 36.2 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
2.96 s ± 20.2 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
3.87 s ± 6 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
4.37 s ± 30.5 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Driver: Intel(R) OpenCL
619 ms ± 2.25 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
660 ms ± 1.56 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
1.26 s ± 7.74 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
1.3 s ± 3.02 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
2.59 s ± 25.2 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
2.95 s ± 47.6 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Driver: Portable Computing Language
1.35 s ± 8.83 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
1.39 s ± 1.76 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
4.75 s ± 2.76 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
4.91 s ± 1.59 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
6.96 s ± 51.1 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
7.7 s ± 15 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Driver: Intel(R) OpenCL
1.31 s ± 33.4 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
1.35 s ± 869 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
2.19 s ± 15.1 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
2.36 s ± 31.3 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
5.15 s ± 11.8 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
5.86 s ± 64.3 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
In [8]:
results_boundary_operators = df

Potential Operators

We are going to evaluate the potentials at 10000 evaluation points. The points are normalised to lie on a sphere with radius .5 As evaluation function we use a simple constant function.

In [9]:
npoints = 10000
rand = np.random.RandomState(0)
points = rand.randn(3, npoints)
points /= np.linalg.norm(points, axis=0)

Let us define the operators and functions.

In [10]:
from bempp.api.operators import potential

k = 1.0

operators = [
    lambda precision: potential.laplace.single_layer(p1_space, points, precision=precision),
    lambda precision: potential.laplace.double_layer(p1_space, points, precision=precision),
    lambda precision: potential.helmholtz.single_layer(p1_space, points, k, precision=precision),
    lambda precision: potential.helmholtz.double_layer(p1_space, points, k, precision=precision),
    lambda precision: potential.maxwell.electric_field(rwg_space, points, k, precision=precision),
    lambda precision: potential.maxwell.magnetic_field(rwg_space, points, k, precision=precision),
]

functions = [
    bempp.api.GridFunction.from_ones(p1_space),
    bempp.api.GridFunction.from_ones(p1_space),
    bempp.api.GridFunction.from_ones(p1_space),
    bempp.api.GridFunction.from_ones(p1_space),
    bempp.api.GridFunction.from_ones(rwg_space),
    bempp.api.GridFunction.from_ones(rwg_space),
]

We assemble each operator once to compile all functions.

In [11]:
for precision in ['single', 'double']:
    if precision == 'single':
        bempp.api.VECTORIZATION_MODE = 'vec16'
    else:
        bempp.api.VECTORIZATION_MODE = 'vec8'
    for driver_name in driver_labels:
        bempp.api.set_default_cpu_device_by_name(driver_name)
        for op, fun in zip(operators, functions):
            res = op(precision) @ fun

Let's create the data structure to store the results.

In [12]:
driver_labels = ['Portable Computing Language', 'Intel(R) OpenCL', 'NVIDIA CUDA']
precision_labels = ['single', 'double']

operator_labels = [
    "laplace single layer pot",
    "laplace double layer pot",
    "helmholtz single layer pot",
    "helmholtz double layer pot",
    "maxwell electric pot",
    "maxwell magnetic pot",
]
df = pd.DataFrame(index=operator_labels, columns=pd.MultiIndex.from_product([driver_labels, precision_labels]))

Finally, we run the actual tests.

In [13]:
bempp.api.set_default_gpu_device_by_name('NVIDIA CUDA')

for precision in ['single', 'double']:
    if precision == 'single':
        bempp.api.VECTORIZATION_MODE = 'vec16'
    else:
        bempp.api.VECTORIZATION_MODE = 'vec8'
    for driver_name in driver_labels:
        print(f"Driver: {driver_name}")
        if driver_name == 'NVIDIA CUDA':
            bempp.api.POTENTIAL_OPERATOR_DEVICE_TYPE = 'gpu'
            bempp.api.VECTORIZATION_MODE = 'novec'
        else:
            bempp.api.POTENTIAL_OPERATOR_DEVICE_TYPE = 'cpu'
            bempp.api.set_default_cpu_device_by_name(driver_name)
        for label, op, fun in zip(operator_labels, operators, functions):
            df.loc[label, (driver_name, precision)] = benchmark_potential_operator(op, fun, precision)
Driver: Portable Computing Language
91.9 ms ± 3.32 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
114 ms ± 619 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
435 ms ± 1.26 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
455 ms ± 734 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
1.03 s ± 961 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
999 ms ± 7.38 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Driver: Intel(R) OpenCL
95.2 ms ± 4.22 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
106 ms ± 2.82 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
172 ms ± 1.16 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
198 ms ± 5.2 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
382 ms ± 5.03 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
351 ms ± 2.34 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Driver: NVIDIA CUDA
25.9 ms ± 4.28 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
22.4 ms ± 255 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
34.9 ms ± 1.67 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
36.2 ms ± 340 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
71.5 ms ± 394 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
71 ms ± 18.2 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
Driver: Portable Computing Language
149 ms ± 28.2 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
170 ms ± 7.11 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
797 ms ± 916 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
841 ms ± 195 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
2.3 s ± 7.4 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
2.22 s ± 2.97 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Driver: Intel(R) OpenCL
190 ms ± 2.53 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
218 ms ± 64.4 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
319 ms ± 1.77 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
358 ms ± 541 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
754 ms ± 6.19 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
655 ms ± 1.32 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Driver: NVIDIA CUDA
449 ms ± 27.5 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
468 ms ± 45.5 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
750 ms ± 67.6 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
818 ms ± 1.53 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
1.53 s ± 64.7 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
1.45 s ± 1.42 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
In [14]:
results_potential_operators = df

Output

In [15]:
print(results_boundary_operators)
print(results_potential_operators)
                           Portable Computing Language            \
                                                single    double   
laplace single layer bnd                      0.725336  1.344582   
laplace double layer bnd                      0.783209  1.390851   
helmholtz single layer bnd                    2.760039  4.743433   
helmholtz double layer bnd                    2.943111  4.907221   
maxwell electric bnd                          3.861363  6.908861   
maxwell magnetic bnd                          4.338312  7.688866   

                           Intel(R) OpenCL            
                                    single    double  
laplace single layer bnd          0.617227  1.276647  
laplace double layer bnd          0.658904  1.353705  
helmholtz single layer bnd        1.249008  2.174586  
helmholtz double layer bnd        1.296146   2.32856  
maxwell electric bnd              2.567333  5.136648  
maxwell magnetic bnd              2.901476  5.794071  
                           Portable Computing Language            \
                                                single    double   
laplace single layer pot                      0.088567  0.121174   
laplace double layer pot                      0.113249  0.162764   
helmholtz single layer pot                    0.433934  0.795617   
helmholtz double layer pot                    0.453831  0.840373   
maxwell electric pot                          1.029883  2.297465   
maxwell magnetic pot                          0.991923  2.220875   

                           Intel(R) OpenCL           NVIDIA CUDA            
                                    single    double      single    double  
laplace single layer pot          0.091016  0.186997     0.02161  0.421118  
laplace double layer pot          0.102928  0.218029    0.022098  0.467704  
helmholtz single layer pot        0.170673  0.316907    0.033239   0.74997  
helmholtz double layer pot        0.192861  0.357478    0.035876  0.816843  
maxwell electric pot               0.37675  0.747511     0.07108  1.533835  
maxwell magnetic pot               0.34823  0.653525     0.07103  1.450303