DaCe is a Python library that enables optimizing code with ease, from running on a single core to a full supercomputer. With the power of data-centric transformations, it can automatically map code for CPUs, GPUs, and FPGAs.
Let's get started with DaCe by importing it:
import dace
A data-centric program can be generated from several general-purpose and domain-specific languages. Our main frontend, however, is Python/numpy. To define a program, we take an existing function on numpy arrays and decorate it with @dace.program
:
@dace.program
def getstarted(A):
return A + A
Running our dace program, we will see several outputs and a prompt. These are the available transformations we can apply. For the first step, we opt to apply none (press Enter) and proceed with compilation and running:
import numpy as np
a = np.random.rand(2, 3)
a
array([[0.74867876, 0.85403223, 0.16573784], [0.71994615, 0.29855314, 0.21483992]])
getstarted(a)
array([[1.49735752, 1.70806445, 0.33147568], [1.4398923 , 0.59710627, 0.42967985]])
The results are, as expected, 2*A
.
Now, let's inspect the intermediate representation of the data-centric program, its Stateful Dataflow Multigraph (SDFG):
getstarted.to_sdfg(a)
You can drag the handle at the bottom right to make the SDFG frame larger.
Notice the following four elements in the graph:
A+A
, we see only one state encompassing the computation.ndarray
s), and the edges represent data that is moved throughout the state. Hovering over a memlet will show more information about the subset being moved.2*3
times). This creates parametric parallelism in the graph and can be nested in each other for efficient parallelization and distribution of work.Unfortunately (or fortunately in some cases), this graph is specialized for a specific size of array (as given to it), and will not work on other sizes. To compile a program that works with general sizes, we'll need to use symbolic sizes.
DaCe includes a symbolic math engine (extending SymPy) to support symbolic expressions for sizes, ranges, accesses, and more.
Any number of symbols can be used throughout a computation. Defining a symbol is as easy as calling:
N = dace.symbol('N')
which we can now use for any computation and definitions. For example, annotating the types of our function from above will yield a version that works with any size:
@dace.program
def getstarted_sym(A: dace.float64[N, 2*N]):
return A + A
getstarted_sym.to_sdfg()
If we compile this code, any array that can match a size of Nx2N
will be automatically used to infer the value of N
and invoke the function:
getstarted_sym(np.random.rand(100, 200))
array([[1.63216549, 1.26522381, 0.21606686, ..., 0.56988572, 1.12572538, 1.72701877], [0.3829452 , 1.52386969, 0.82165197, ..., 1.3105662 , 1.19336786, 1.43671993], [1.55277426, 1.50918516, 1.30665626, ..., 1.06562809, 1.53069088, 1.10071159], ..., [0.60629736, 1.73240929, 1.26797782, ..., 1.72034476, 1.56691557, 0.22283613], [1.96245486, 1.60559508, 0.02009914, ..., 1.40944583, 1.44560312, 0.37804927], [1.17875002, 0.96963921, 0.28278902, ..., 1.56747976, 0.4616313 , 0.94999278]])
Given our symbolic SDFG, we would not like to recompile it every time. Thus, we can pre-compile the graph into an .so/.dll file:
csdfg = getstarted_sym.compile()
A compiled SDFG, however, has to be invoked like an SDFG, with keyword arguments only:
b = csdfg(A=np.random.rand(10,20), N=np.int32(10))
We can now see the performance of the code on large arrays vs. numpy:
tester = np.random.rand(2000, 4000)
%timeit tester + tester
12 ms ± 143 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit csdfg(A=tester, N=np.int32(2000))
3.86 ms ± 271 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
One can specify explicit dataflow in dace using for i in dace.map[begin:end]:
syntax, as well as tasklets manually using with dace.tasklet:
. Here is an example of a real-world example (Scattering Self-Energies) with an 8-dimensional parallel computation:
# Declaration of symbolic variables
Nkz, NE, Nqz, Nw, N3D, NA, NB, Norb = (
dace.symbol(name)
for name in ['Nkz', 'NE', 'Nqz', 'Nw', 'N3D', 'NA', 'NB', 'Norb'])
@dace.program
def sse_sigma(neigh_idx: dace.int32[NA, NB],
dH: dace.complex128[NA, NB, N3D, Norb, Norb],
G: dace.complex128[Nkz, NE, NA, Norb, Norb],
D: dace.complex128[Nqz, Nw, NA, NB, N3D, N3D],
Sigma: dace.complex128[Nkz, NE, NA, Norb, Norb]):
# Declaration of Map scope
for k, E, q, w, i, j, a, b in dace.map[0:Nkz, 0:NE, 0:Nqz, 0:Nw, 0:N3D, 0:
N3D, 0:NA, 0:NB]:
dHG = G[k - q, E - w, neigh_idx[a, b]] @ dH[a, b, i]
dHD = dH[a, b, j] * D[q, w, a, b, i, j]
Sigma[k, E, a] += dHG @ dHD
sse_sigma.to_sdfg()