This is a test notebook to benchmark the Bempp performance on different devices. The default figures reported here are obtained under Ubuntu 20.04 Linux on an Intel Core i9-9980HK 8 Core CPU with a base clock of 2.4GHz and a maximum turbo frequency of 5GHz. The GPU device is an NVIDIA Quadro RTX 3000 GPU with 6GB Ram.
As OpenCL CPU driver we test both POCL (in Version 1.5) and the Intel OpenCL CPU driver, both with default vectorization options.
We are benchmarking the following operator types
Boundary Operators:
Domain Potential Operators:
We are testing all operators in single and double precision. For the GPU we only perform single precision tests as it is significantly slower in double precision.
As mesh we use a uniform sphere with 8192 elements. As wavenumber in the Helmholtz and Maxwell tests we use the value $k=1.0$. This has no effect on the performance. As scalar function spaces we use spaces of continuous $P1$ functions. For Maxwell we use RWG functions of order 0.
In this section we define the general objects that we need in all benchmarks.
import bempp.api
import numpy as np
import pandas as pd
grid = bempp.api.shapes.regular_sphere(5)
p1_space = bempp.api.function_space(grid, "P", 1)
rwg_space = bempp.api.function_space(grid, "RWG", 0)
snc_space = bempp.api.function_space(grid, "SNC", 0)
def benchmark_boundary_operator(operator, precision):
"""Benchmark an operator with given precision"""
result = %timeit -o -r 2 -n 2 operator(precision).weak_form()
return result.best
def benchmark_potential_operator(operator, fun, precision):
"""Benchmark an operator with given precision"""
result = %timeit -o -r 2 -n 2 res = operator(precision) @ fun
return result.best
We first define the boundary operators that we want to test. We make them dependent only on a precision argument.
from bempp.api.operators import boundary
k = 1.0
operators = [
lambda precision: boundary.laplace.single_layer(p1_space, p1_space, p1_space, precision=precision),
lambda precision: boundary.laplace.double_layer(p1_space, p1_space, p1_space, precision=precision),
lambda precision: boundary.helmholtz.single_layer(p1_space, p1_space, p1_space, k, precision=precision),
lambda precision: boundary.helmholtz.double_layer(p1_space, p1_space, p1_space, k, precision=precision),
lambda precision: boundary.maxwell.electric_field(rwg_space, rwg_space, snc_space, k, precision=precision),
lambda precision: boundary.maxwell.magnetic_field(rwg_space, rwg_space, snc_space, k, precision=precision),
]
We setup a Pandas data frame to conveniently store the different timings.
driver_labels = ['Portable Computing Language', 'Intel(R) OpenCL']
precision_labels = ['single', 'double']
operator_labels = [
"laplace single layer bnd",
"laplace double layer bnd",
"helmholtz single layer bnd",
"helmholtz double layer bnd",
"maxwell electric bnd",
"maxwell magnetic bnd",
]
df = pd.DataFrame(index=operator_labels, columns=pd.MultiIndex.from_product([driver_labels, precision_labels]))
We assemble each operator once to make sure that all Numba functions are compiled.
for precision in ['single', 'double']:
if precision == 'single':
bempp.api.VECTORIZATION_MODE = 'vec16'
else:
bempp.api.VECTORIZATION_MODE = 'vec8'
for driver_name in driver_labels:
bempp.api.set_default_cpu_device_by_name(driver_name)
for op in operators:
op(precision).weak_form()
/home/betcke/miniconda3/envs/dev/lib/python3.8/site-packages/pyopencl/__init__.py:267: CompilerWarning: Non-empty compiler output encountered. Set the environment variable PYOPENCL_COMPILER_OUTPUT=1 to see more. warn("Non-empty compiler output encountered. Set the "
Now let's run the actual benchmarks.
for precision in ['single', 'double']:
if precision == 'single':
bempp.api.VECTORIZATION_MODE = 'vec16'
else:
bempp.api.VECTORIZATION_MODE = 'vec8'
for driver_name in driver_labels:
print(f"Driver: {driver_name}")
bempp.api.set_default_cpu_device_by_name(driver_name)
for label, op in zip(operator_labels, operators):
df.loc[label, (driver_name, precision)] = benchmark_boundary_operator(op, precision)
Driver: Portable Computing Language 748 ms ± 23 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 784 ms ± 615 µs per loop (mean ± std. dev. of 2 runs, 2 loops each) 2.8 s ± 36.2 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 2.96 s ± 20.2 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 3.87 s ± 6 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 4.37 s ± 30.5 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) Driver: Intel(R) OpenCL 619 ms ± 2.25 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 660 ms ± 1.56 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 1.26 s ± 7.74 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 1.3 s ± 3.02 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 2.59 s ± 25.2 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 2.95 s ± 47.6 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) Driver: Portable Computing Language 1.35 s ± 8.83 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 1.39 s ± 1.76 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 4.75 s ± 2.76 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 4.91 s ± 1.59 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 6.96 s ± 51.1 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 7.7 s ± 15 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) Driver: Intel(R) OpenCL 1.31 s ± 33.4 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 1.35 s ± 869 µs per loop (mean ± std. dev. of 2 runs, 2 loops each) 2.19 s ± 15.1 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 2.36 s ± 31.3 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 5.15 s ± 11.8 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 5.86 s ± 64.3 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
results_boundary_operators = df
We are going to evaluate the potentials at 10000 evaluation points. The points are normalised to lie on a sphere with radius .5 As evaluation function we use a simple constant function.
npoints = 10000
rand = np.random.RandomState(0)
points = rand.randn(3, npoints)
points /= np.linalg.norm(points, axis=0)
Let us define the operators and functions.
from bempp.api.operators import potential
k = 1.0
operators = [
lambda precision: potential.laplace.single_layer(p1_space, points, precision=precision),
lambda precision: potential.laplace.double_layer(p1_space, points, precision=precision),
lambda precision: potential.helmholtz.single_layer(p1_space, points, k, precision=precision),
lambda precision: potential.helmholtz.double_layer(p1_space, points, k, precision=precision),
lambda precision: potential.maxwell.electric_field(rwg_space, points, k, precision=precision),
lambda precision: potential.maxwell.magnetic_field(rwg_space, points, k, precision=precision),
]
functions = [
bempp.api.GridFunction.from_ones(p1_space),
bempp.api.GridFunction.from_ones(p1_space),
bempp.api.GridFunction.from_ones(p1_space),
bempp.api.GridFunction.from_ones(p1_space),
bempp.api.GridFunction.from_ones(rwg_space),
bempp.api.GridFunction.from_ones(rwg_space),
]
We assemble each operator once to compile all functions.
for precision in ['single', 'double']:
if precision == 'single':
bempp.api.VECTORIZATION_MODE = 'vec16'
else:
bempp.api.VECTORIZATION_MODE = 'vec8'
for driver_name in driver_labels:
bempp.api.set_default_cpu_device_by_name(driver_name)
for op, fun in zip(operators, functions):
res = op(precision) @ fun
Let's create the data structure to store the results.
driver_labels = ['Portable Computing Language', 'Intel(R) OpenCL', 'NVIDIA CUDA']
precision_labels = ['single', 'double']
operator_labels = [
"laplace single layer pot",
"laplace double layer pot",
"helmholtz single layer pot",
"helmholtz double layer pot",
"maxwell electric pot",
"maxwell magnetic pot",
]
df = pd.DataFrame(index=operator_labels, columns=pd.MultiIndex.from_product([driver_labels, precision_labels]))
Finally, we run the actual tests.
bempp.api.set_default_gpu_device_by_name('NVIDIA CUDA')
for precision in ['single', 'double']:
if precision == 'single':
bempp.api.VECTORIZATION_MODE = 'vec16'
else:
bempp.api.VECTORIZATION_MODE = 'vec8'
for driver_name in driver_labels:
print(f"Driver: {driver_name}")
if driver_name == 'NVIDIA CUDA':
bempp.api.POTENTIAL_OPERATOR_DEVICE_TYPE = 'gpu'
bempp.api.VECTORIZATION_MODE = 'novec'
else:
bempp.api.POTENTIAL_OPERATOR_DEVICE_TYPE = 'cpu'
bempp.api.set_default_cpu_device_by_name(driver_name)
for label, op, fun in zip(operator_labels, operators, functions):
df.loc[label, (driver_name, precision)] = benchmark_potential_operator(op, fun, precision)
Driver: Portable Computing Language 91.9 ms ± 3.32 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 114 ms ± 619 µs per loop (mean ± std. dev. of 2 runs, 2 loops each) 435 ms ± 1.26 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 455 ms ± 734 µs per loop (mean ± std. dev. of 2 runs, 2 loops each) 1.03 s ± 961 µs per loop (mean ± std. dev. of 2 runs, 2 loops each) 999 ms ± 7.38 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) Driver: Intel(R) OpenCL 95.2 ms ± 4.22 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 106 ms ± 2.82 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 172 ms ± 1.16 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 198 ms ± 5.2 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 382 ms ± 5.03 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 351 ms ± 2.34 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) Driver: NVIDIA CUDA 25.9 ms ± 4.28 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 22.4 ms ± 255 µs per loop (mean ± std. dev. of 2 runs, 2 loops each) 34.9 ms ± 1.67 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 36.2 ms ± 340 µs per loop (mean ± std. dev. of 2 runs, 2 loops each) 71.5 ms ± 394 µs per loop (mean ± std. dev. of 2 runs, 2 loops each) 71 ms ± 18.2 µs per loop (mean ± std. dev. of 2 runs, 2 loops each) Driver: Portable Computing Language 149 ms ± 28.2 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 170 ms ± 7.11 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 797 ms ± 916 µs per loop (mean ± std. dev. of 2 runs, 2 loops each) 841 ms ± 195 µs per loop (mean ± std. dev. of 2 runs, 2 loops each) 2.3 s ± 7.4 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 2.22 s ± 2.97 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) Driver: Intel(R) OpenCL 190 ms ± 2.53 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 218 ms ± 64.4 µs per loop (mean ± std. dev. of 2 runs, 2 loops each) 319 ms ± 1.77 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 358 ms ± 541 µs per loop (mean ± std. dev. of 2 runs, 2 loops each) 754 ms ± 6.19 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 655 ms ± 1.32 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) Driver: NVIDIA CUDA 449 ms ± 27.5 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 468 ms ± 45.5 µs per loop (mean ± std. dev. of 2 runs, 2 loops each) 750 ms ± 67.6 µs per loop (mean ± std. dev. of 2 runs, 2 loops each) 818 ms ± 1.53 ms per loop (mean ± std. dev. of 2 runs, 2 loops each) 1.53 s ± 64.7 µs per loop (mean ± std. dev. of 2 runs, 2 loops each) 1.45 s ± 1.42 ms per loop (mean ± std. dev. of 2 runs, 2 loops each)
results_potential_operators = df
print(results_boundary_operators)
print(results_potential_operators)
Portable Computing Language \ single double laplace single layer bnd 0.725336 1.344582 laplace double layer bnd 0.783209 1.390851 helmholtz single layer bnd 2.760039 4.743433 helmholtz double layer bnd 2.943111 4.907221 maxwell electric bnd 3.861363 6.908861 maxwell magnetic bnd 4.338312 7.688866 Intel(R) OpenCL single double laplace single layer bnd 0.617227 1.276647 laplace double layer bnd 0.658904 1.353705 helmholtz single layer bnd 1.249008 2.174586 helmholtz double layer bnd 1.296146 2.32856 maxwell electric bnd 2.567333 5.136648 maxwell magnetic bnd 2.901476 5.794071 Portable Computing Language \ single double laplace single layer pot 0.088567 0.121174 laplace double layer pot 0.113249 0.162764 helmholtz single layer pot 0.433934 0.795617 helmholtz double layer pot 0.453831 0.840373 maxwell electric pot 1.029883 2.297465 maxwell magnetic pot 0.991923 2.220875 Intel(R) OpenCL NVIDIA CUDA single double single double laplace single layer pot 0.091016 0.186997 0.02161 0.421118 laplace double layer pot 0.102928 0.218029 0.022098 0.467704 helmholtz single layer pot 0.170673 0.316907 0.033239 0.74997 helmholtz double layer pot 0.192861 0.357478 0.035876 0.816843 maxwell electric pot 0.37675 0.747511 0.07108 1.533835 maxwell magnetic pot 0.34823 0.653525 0.07103 1.450303