# Bempp OpenCL performance benchmarks¶

This is a test notebook to benchmark the Bempp performance on different devices. The default figures reported here are obtained under Ubuntu 20.04 Linux on an Intel Core i9-9980HK 8 Core CPU with a base clock of 2.4GHz and a maximum turbo frequency of 5GHz. The GPU device is an NVIDIA Quadro RTX 3000 GPU with 6GB Ram.

As OpenCL CPU driver we test both POCL (in Version 1.5) and the Intel OpenCL CPU driver, both with default vectorization options.

We are benchmarking the following operator types

• Boundary Operators:

• Laplace single and double layer boundary operator
• Helmholtz single and double layer boundary operator
• Maxwell electric and magnetic field boundary operator
• Domain Potential Operators:

• Laplace single and double layer potential operator
• Helmholtz single and double layer potential operator
• Maxwell electric and magnetic field domain potential operator

We are testing all operators in single and double precision. For the GPU we only perform single precision tests as it is significantly slower in double precision.

As mesh we use a uniform sphere with 8192 elements. As wavenumber in the Helmholtz and Maxwell tests we use the value $k=1.0$. This has no effect on the performance. As scalar function spaces we use spaces of continuous $P1$ functions. For Maxwell we use RWG functions of order 0.

## General Setup¶

In this section we define the general objects that we need in all benchmarks.

In [1]:
import bempp.api
import numpy as np
import pandas as pd

In [2]:
grid = bempp.api.shapes.regular_sphere(5)
p1_space = bempp.api.function_space(grid, "P", 1)
rwg_space = bempp.api.function_space(grid, "RWG", 0)
snc_space = bempp.api.function_space(grid, "SNC", 0)

In [3]:
def benchmark_boundary_operator(operator, precision):
"""Benchmark an operator with given precision"""
result =  %timeit -o -r 2 -n 2 operator(precision).weak_form()
return result.best

def benchmark_potential_operator(operator, fun, precision):
"""Benchmark an operator with given precision"""
result =  %timeit -o -r 2 -n 2 res = operator(precision) @ fun
return result.best


## Boundary Operators¶

We first define the boundary operators that we want to test. We make them dependent only on a precision argument.

In [4]:
from bempp.api.operators import boundary

k = 1.0

operators = [
lambda precision: boundary.laplace.single_layer(p1_space, p1_space, p1_space, precision=precision),
lambda precision: boundary.laplace.double_layer(p1_space, p1_space, p1_space, precision=precision),
lambda precision: boundary.helmholtz.single_layer(p1_space, p1_space, p1_space, k, precision=precision),
lambda precision: boundary.helmholtz.double_layer(p1_space, p1_space, p1_space, k, precision=precision),
lambda precision: boundary.maxwell.electric_field(rwg_space, rwg_space, snc_space, k, precision=precision),
lambda precision: boundary.maxwell.magnetic_field(rwg_space, rwg_space, snc_space, k, precision=precision),
]


We setup a Pandas data frame to conveniently store the different timings.

In [5]:
driver_labels = ['Portable Computing Language', 'Intel(R) OpenCL']
precision_labels = ['single', 'double']

operator_labels = [
"laplace single layer bnd",
"laplace double layer bnd",
"helmholtz single layer bnd",
"helmholtz double layer bnd",
"maxwell electric bnd",
"maxwell magnetic bnd",
]
df = pd.DataFrame(index=operator_labels, columns=pd.MultiIndex.from_product([driver_labels, precision_labels]))


We assemble each operator once to make sure that all Numba functions are compiled.

In [6]:
for precision in ['single', 'double']:
if precision == 'single':
bempp.api.VECTORIZATION_MODE = 'vec16'
else:
bempp.api.VECTORIZATION_MODE = 'vec8'
for driver_name in driver_labels:
bempp.api.set_default_cpu_device_by_name(driver_name)
for op in operators:
op(precision).weak_form()

/home/betcke/miniconda3/envs/dev/lib/python3.8/site-packages/pyopencl/__init__.py:267: CompilerWarning: Non-empty compiler output encountered. Set the environment variable PYOPENCL_COMPILER_OUTPUT=1 to see more.
warn("Non-empty compiler output encountered. Set the "


Now let's run the actual benchmarks.

In [7]:
for precision in ['single', 'double']:
if precision == 'single':
bempp.api.VECTORIZATION_MODE = 'vec16'
else:
bempp.api.VECTORIZATION_MODE = 'vec8'
for driver_name in driver_labels:
print(f"Driver: {driver_name}")
bempp.api.set_default_cpu_device_by_name(driver_name)
for label, op in zip(operator_labels, operators):
df.loc[label, (driver_name, precision)] = benchmark_boundary_operator(op, precision)

Driver: Portable Computing Language
748 ms ± 23 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
784 ms ± 615 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
2.8 s ± 36.2 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
2.96 s ± 20.2 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
3.87 s ± 6 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
4.37 s ± 30.5 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Driver: Intel(R) OpenCL
619 ms ± 2.25 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
660 ms ± 1.56 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
1.26 s ± 7.74 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
1.3 s ± 3.02 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
2.59 s ± 25.2 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
2.95 s ± 47.6 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Driver: Portable Computing Language
1.35 s ± 8.83 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
1.39 s ± 1.76 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
4.75 s ± 2.76 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
4.91 s ± 1.59 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
6.96 s ± 51.1 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
7.7 s ± 15 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Driver: Intel(R) OpenCL
1.31 s ± 33.4 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
1.35 s ± 869 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
2.19 s ± 15.1 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
2.36 s ± 31.3 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
5.15 s ± 11.8 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
5.86 s ± 64.3 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)

In [8]:
results_boundary_operators = df


## Potential Operators¶

We are going to evaluate the potentials at 10000 evaluation points. The points are normalised to lie on a sphere with radius .5 As evaluation function we use a simple constant function.

In [9]:
npoints = 10000
rand = np.random.RandomState(0)
points = rand.randn(3, npoints)
points /= np.linalg.norm(points, axis=0)


Let us define the operators and functions.

In [10]:
from bempp.api.operators import potential

k = 1.0

operators = [
lambda precision: potential.laplace.single_layer(p1_space, points, precision=precision),
lambda precision: potential.laplace.double_layer(p1_space, points, precision=precision),
lambda precision: potential.helmholtz.single_layer(p1_space, points, k, precision=precision),
lambda precision: potential.helmholtz.double_layer(p1_space, points, k, precision=precision),
lambda precision: potential.maxwell.electric_field(rwg_space, points, k, precision=precision),
lambda precision: potential.maxwell.magnetic_field(rwg_space, points, k, precision=precision),
]

functions = [
bempp.api.GridFunction.from_ones(p1_space),
bempp.api.GridFunction.from_ones(p1_space),
bempp.api.GridFunction.from_ones(p1_space),
bempp.api.GridFunction.from_ones(p1_space),
bempp.api.GridFunction.from_ones(rwg_space),
bempp.api.GridFunction.from_ones(rwg_space),
]


We assemble each operator once to compile all functions.

In [11]:
for precision in ['single', 'double']:
if precision == 'single':
bempp.api.VECTORIZATION_MODE = 'vec16'
else:
bempp.api.VECTORIZATION_MODE = 'vec8'
for driver_name in driver_labels:
bempp.api.set_default_cpu_device_by_name(driver_name)
for op, fun in zip(operators, functions):
res = op(precision) @ fun


Let's create the data structure to store the results.

In [12]:
driver_labels = ['Portable Computing Language', 'Intel(R) OpenCL', 'NVIDIA CUDA']
precision_labels = ['single', 'double']

operator_labels = [
"laplace single layer pot",
"laplace double layer pot",
"helmholtz single layer pot",
"helmholtz double layer pot",
"maxwell electric pot",
"maxwell magnetic pot",
]
df = pd.DataFrame(index=operator_labels, columns=pd.MultiIndex.from_product([driver_labels, precision_labels]))


Finally, we run the actual tests.

In [13]:
bempp.api.set_default_gpu_device_by_name('NVIDIA CUDA')

for precision in ['single', 'double']:
if precision == 'single':
bempp.api.VECTORIZATION_MODE = 'vec16'
else:
bempp.api.VECTORIZATION_MODE = 'vec8'
for driver_name in driver_labels:
print(f"Driver: {driver_name}")
if driver_name == 'NVIDIA CUDA':
bempp.api.POTENTIAL_OPERATOR_DEVICE_TYPE = 'gpu'
bempp.api.VECTORIZATION_MODE = 'novec'
else:
bempp.api.POTENTIAL_OPERATOR_DEVICE_TYPE = 'cpu'
bempp.api.set_default_cpu_device_by_name(driver_name)
for label, op, fun in zip(operator_labels, operators, functions):
df.loc[label, (driver_name, precision)] = benchmark_potential_operator(op, fun, precision)

Driver: Portable Computing Language
91.9 ms ± 3.32 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
114 ms ± 619 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
435 ms ± 1.26 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
455 ms ± 734 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
1.03 s ± 961 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
999 ms ± 7.38 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Driver: Intel(R) OpenCL
95.2 ms ± 4.22 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
106 ms ± 2.82 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
172 ms ± 1.16 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
198 ms ± 5.2 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
382 ms ± 5.03 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
351 ms ± 2.34 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Driver: NVIDIA CUDA
25.9 ms ± 4.28 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
22.4 ms ± 255 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
34.9 ms ± 1.67 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
36.2 ms ± 340 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
71.5 ms ± 394 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
71 ms ± 18.2 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
Driver: Portable Computing Language
149 ms ± 28.2 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
170 ms ± 7.11 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
797 ms ± 916 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
841 ms ± 195 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
2.3 s ± 7.4 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
2.22 s ± 2.97 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Driver: Intel(R) OpenCL
190 ms ± 2.53 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
218 ms ± 64.4 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
319 ms ± 1.77 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
358 ms ± 541 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
754 ms ± 6.19 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
655 ms ± 1.32 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
Driver: NVIDIA CUDA
449 ms ± 27.5 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
468 ms ± 45.5 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
750 ms ± 67.6 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
818 ms ± 1.53 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
1.53 s ± 64.7 µs per loop (mean ± std. dev. of 2 runs, 2 loops each)
1.45 s ± 1.42 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)

In [14]:
results_potential_operators = df


## Output¶

In [15]:
print(results_boundary_operators)
print(results_potential_operators)

                           Portable Computing Language            \
single    double
laplace single layer bnd                      0.725336  1.344582
laplace double layer bnd                      0.783209  1.390851
helmholtz single layer bnd                    2.760039  4.743433
helmholtz double layer bnd                    2.943111  4.907221
maxwell electric bnd                          3.861363  6.908861
maxwell magnetic bnd                          4.338312  7.688866

Intel(R) OpenCL
single    double
laplace single layer bnd          0.617227  1.276647
laplace double layer bnd          0.658904  1.353705
helmholtz single layer bnd        1.249008  2.174586
helmholtz double layer bnd        1.296146   2.32856
maxwell electric bnd              2.567333  5.136648
maxwell magnetic bnd              2.901476  5.794071
Portable Computing Language            \
single    double
laplace single layer pot                      0.088567  0.121174
laplace double layer pot                      0.113249  0.162764
helmholtz single layer pot                    0.433934  0.795617
helmholtz double layer pot                    0.453831  0.840373
maxwell electric pot                          1.029883  2.297465
maxwell magnetic pot                          0.991923  2.220875

Intel(R) OpenCL           NVIDIA CUDA
single    double      single    double
laplace single layer pot          0.091016  0.186997     0.02161  0.421118
laplace double layer pot          0.102928  0.218029    0.022098  0.467704
helmholtz single layer pot        0.170673  0.316907    0.033239   0.74997
helmholtz double layer pot        0.192861  0.357478    0.035876  0.816843
maxwell electric pot               0.37675  0.747511     0.07108  1.533835
maxwell magnetic pot               0.34823  0.653525     0.07103  1.450303