Step 1: Markup all subroutines, functions
To call a subroutine from code running on the GPU, you must compile a version for the GPU. With OpenACC, this is done with the routine
directive. Our integrator makes use of many linear algebra packages which Maestro has a local copy of. One of the first tasks was to mark these up as sequential routines. For example,
subroutine dgesl (a,lda,n,ipvt,b,job)
!$acc routine seq
!$acc routine(daxpy) seq
!$acc routine(vddot) seq
integer lda,n,ipvt(*),job
double precision a(lda,n),b(*)
!
! dgesl solves the double precision system
! a * x = b or trans(a) * x = b
! using the factors computed by dgeco or dgefa.
! ...
end subroutine
Notice that we also tell the compiler that routines daxpy
and vddot
have a GPU version. This is required because we do not access these routines through a module but instead link them directly.
Thirty routines were marked up in this way:
# %load oac_routines.txt
Microphysics/screening/screen.f90-subroutine screenz (t,d,z1,z2,a1,a2,ymass,aion,zion,nion,scfac, dscfacdt)
Microphysics/screening/screen.f90: !$acc routine seq
--
Util/LINPACK/dgesl.f- subroutine dgesl (a,lda,n,ipvt,b,job)
Util/LINPACK/dgesl.f:!$acc routine seq
Util/LINPACK/dgesl.f:!$acc routine(daxpy) seq
Util/LINPACK/dgesl.f:!$acc routine(vddot) seq
--
Util/LINPACK/dgefa.f- subroutine dgefa (a,lda,n,ipvt,info)
Util/LINPACK/dgefa.f:!$acc routine seq
Util/LINPACK/dgefa.f:!$acc routine(idamax) seq
Util/LINPACK/dgefa.f:!$acc routine(dscal) seq
Util/LINPACK/dgefa.f:!$acc routine(daxpy) seq
--
Util/BLAS/idamax.f- INTEGER FUNCTION IDAMAX(N,DX,INCX)
Util/BLAS/idamax.f:!$acc routine seq
--
Util/BLAS/vddot.f- double precision function vddot (n,dx,incx,dy,incy)
Util/BLAS/vddot.f:!$acc routine seq
--
Util/BLAS/dscal.f- SUBROUTINE DSCAL(N,DA,DX,INCX)
Util/BLAS/dscal.f:!$acc routine seq
--
Util/BLAS/daxpy.f- SUBROUTINE DAXPY(N,DA,DX,INCX,DY,INCY)
Util/BLAS/daxpy.f:!$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90- subroutine bdf_advance(ts, neq, npt, y0, t0, y1, t1, dt0, reset, reuse, ierr, initial_call)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90: !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90- subroutine bdf_update(ts)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90: !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90- subroutine bdf_predict(ts)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90: !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90- subroutine bdf_solve(ts)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90: !$acc routine seq
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90: !$acc routine(dgefa) seq
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90: !$acc routine(dgesl) seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90- subroutine bdf_check(ts, retry, err)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90: !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90- subroutine bdf_correct(ts)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90: !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90- subroutine bdf_adjust(ts)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90: !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90- subroutine bdf_reset(ts, y0, dt, reuse)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90: !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90- subroutine rescale_timestep(ts, eta_in, force)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90: !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90- subroutine decrease_order(ts)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90: !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90- subroutine increase_order(ts)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90: !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90- function alpha0(k) result(a0)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90: !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90- function alphahat0(k, h) result(a0)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90: !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90- function xi_star_inv(k, h) result(xii)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90: !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90- function xi_j(h, j) result(xi)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90: !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90- subroutine ewts(ts)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90: !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90- function norm(y, ewt) result(r)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90: !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90- subroutine eye_r(A)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90: !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90- recursive function factorial(n) result(r)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90: !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90- subroutine eoshift_local(arr, sh, shifted_arr)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90: !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90- function minloc(arr) result(ret)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90: !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/ignition_simple_bdf/f_rhs.f90- subroutine f_rhs_vec(neq, npt, y, t, yd, upar)
Util/VBDF/dev/oac_vbdf/ignition_simple_bdf/f_rhs.f90: !$acc routine seq
Util/VBDF/dev/oac_vbdf/ignition_simple_bdf/f_rhs.f90: !$acc routine(screenz) seq
--
Util/VBDF/dev/oac_vbdf/ignition_simple_bdf/f_rhs.f90- subroutine jac_vec(neq, npt, y, t, pd, upar)
Util/VBDF/dev/oac_vbdf/ignition_simple_bdf/f_rhs.f90: !$acc routine seq
--
Step 2: Markup a computationally expensive loop over hydro cells
One of the balances one must attempt to strike with GPU coding is you must give it a lot of computation to keep it busy. Though they're not like conventional cores, the GPUs in Titan (Kepler K20x) have 2688 CUDA cores. To keep all of these busy, you want to get as much of your code onto the GPU as possible. Of course you want a computationally intensive portion of code, but you also must put enough of the code onto the GPU that you do not lose all performance gains to data transfer costs.
The strategy I've taken for the moment is to markup all routines as sequential and to markup a huge loop over hydro cells that calls the initial sequential routine bdf_advance
. The markup looks like this:
!$acc parallel loop gang vector present(dens, temp, eos_cp, eos_dhdX, &
!$acc Xin, Xout, rho_omegadot, rho_Hnuc, ebin, ts) &
!$acc private(ierr, y, y0, y1) reduction(+:ierr_tot)
do i = 1, npt
! ...
call bdf_advance(ts(i), NEQ, bdf_npt, y0, t0, y1, t1, &
DT0, RESET, REUSE, ierr, .true.)
! ...
end do
Here we tell the compiler we want this loop to be run in parallel on the GPU. The gang
and vector
options represent the heirarchy of parallelism we want to use on the GPU. To understand this, it's helpful to take a look at the card:
Here's a zoom-in on an SMX unit:
There is a rough correlation between the heirarchy of parallelism on the hardware and that exposed in the OpenACC standard. My understanding is it's not necessarily rigidly followed by compilers if they think they can be clever, but it's helpful for thinking about how your code is being run.
gang
parallelism is across SMX units, of which the K20x's have 14 active. For initial development, I do not recommend trying to think of what's called worker
parallelism. This is parallelism over the warp schedulers in an SMX. Rather, initially I prefer to focus on vector
parallelism, which is parallelism over the CUDA cores in an SMX if you don't use worker
parallelism and over the CUDA cores in a warp if you do. Thus, in the loop I showed above, we parallelize over the SMX's and the CUDA cores within each SMX by giving the loop directive gang
and vector
options.
Note that our directive also includes data statements present
and private
:
!$acc parallel loop gang vector present(dens, temp, eos_cp, eos_dhdX, &
!$acc Xin, Xout, rho_omegadot, rho_Hnuc, ebin, ts) &
!$acc private(ierr, y, y0, y1) reduction(+:ierr_tot)
do i = 1, npt
! ...
call bdf_advance(ts(i), NEQ, bdf_npt, y0, t0, y1, t1, &
DT0, RESET, REUSE, ierr, .true.)
! ...
end do
present
tells the compiler that the listed data is already present on the GPU, while private
tells the compiler that each thread on the GPU should have its own copy of the listed data. We also see a reduction
, which will sum up the value of ierr
from all cores and return the summed up value to the host/CPU.
Step 3: Get the data to the GPU
GPU's do not have access to the CPU/host's memory (e.g. the RAM and cache memory that all of your computer's cores have access to). Thus, we must explicitly put any data we need into the GPU's memory. This is acheived with data
directives in OpenACC.
Often in our codes, and other production science codes, we make use of data in modules. This global data can be put on the GPU with the declare
directive. Here's an example of this from my work accelerating the reactions:
module network
use bl_types
implicit none
! ...
real(kind=dp_t), save :: aion(nspec), zion(nspec), ebin(nspec)
!$acc declare create(aion(:), zion(:), ebin(:))
! ...
contains
subroutine network_init()
! ...
aion(ic12_) = 12.0_dp_t
aion(io16_) = 16.0_dp_t
aion(img24_) = 24.0_dp_t
zion(ic12_) = 6.0_dp_t
zion(io16_) = 8.0_dp_t
zion(img24_) = 12.0_dp_t
ebin(ic12_) = -7.4103097e18_dp_t ! 92.16294 MeV
ebin(io16_) = -7.6959672e18_dp_t ! 127.62093 MeV
ebin(img24_) = -7.9704080e18_dp_t ! 198.2579 MeV
!$acc update device(aion(:), zion(:), ebin(:))
! ...
end subroutine network_init
! ...
end module network
Here we had a module array that we need on the GPU. A combination of a declare create
and update device
directives makes sure this data is accessible on the GPU and contains the same data as the host. Note that this ability to use global data is relatively new, introduced in the OpenACC 2.0 standard.
A major difficulty in working with GPUs is dealing with user-defined types (similar to struct
s in C, and to a lesser extent classes in object-oriented languages). It is exceptionally good software engineering practice to make use of such types and they're plentiful in BoxLib/Maestro. However, types that include dynamic memory (in Fortran, this means types that have the alloctable
or pointer
attribute) are currently poorly handled by compiler implementations of OpenACC. There are plans in the next major update of the standard, 3.0, to incorporate deep copy functionality that should rid us of many of the these difficulties. For now, howerver, we must manually deep copy non-trivial user-defined types.
In this case, our integrator uses a timestepper type called bdf_ts
. Our loop over hydro cells includes an array of these types ts(:)
. Before getting to this loop, I deep copy each over to the GPU. It looks like this:
!$acc enter data create(ts)
do i = 1, npt
! we need the specific heat at constant pressure and dhdX |_p. Take
! T, rho, Xin as input
eos_state%rho = dens(i)
eos_state%T = temp(i)
eos_state%xn(:) = Xin(:,i)
call eos(eos_input_rt, eos_state, .false.)
eos_cp(i) = eos_state%cp
eos_dhdX(:,i) = eos_state%dhdX(:)
! Build the bdf_ts time-stepper object
call bdf_ts_build(ts(i), NEQ, bdf_npt, rtol, atol, MAX_ORDER, upar)
!Now we update all non-dynamic data members, meaning those that aren't
!allocatables, pointers, etc. They can be arrays as long as they're
!static. To make my previous note more concrete, if we did something
!like `update device(ts(i))` it would overwrite the device's pointer
!address with that of the host, according to PGI/NVIDIA consults.
!Updating individual members avoids this.
!$acc update device( &
!$acc ts(i)%neq, &
!$acc ts(i)%npt, &
!$acc ts(i)%max_order, &
!$acc ts(i)%max_steps, &
!$acc ts(i)%max_iters, &
!$acc ts(i)%verbose, &
!$acc ts(i)%dt_min, &
!$acc ts(i)%eta_min, &
!$acc ts(i)%eta_max, &
!$acc ts(i)%eta_thresh, &
!$acc ts(i)%max_j_age, &
!$acc ts(i)%max_p_age, &
!$acc ts(i)%debug, &
!$acc ts(i)%dump_unit, &
!$acc ts(i)%t, &
!$acc ts(i)%t1, &
!$acc ts(i)%dt, &
!$acc ts(i)%dt_nwt, &
!$acc ts(i)%k, &
!$acc ts(i)%n, &
!$acc ts(i)%j_age, &
!$acc ts(i)%p_age, &
!$acc ts(i)%k_age, &
!$acc ts(i)%tq, &
!$acc ts(i)%tq2save, &
!$acc ts(i)%temp_data, &
!$acc ts(i)%refactor, &
!$acc ts(i)%nfe, &
!$acc ts(i)%nje, &
!$acc ts(i)%nlu, &
!$acc ts(i)%nit, &
!$acc ts(i)%nse, &
!$acc ts(i)%ncse, &
!$acc ts(i)%ncit, &
!$acc ts(i)%ncdtmin)
!Now it's time to deal with dynamic data. At the moment, they only
!exist as pointers on the device. For PGI at least, doing a copyin on
!dynamic data serves to create, allocate, and then attach each dynamic
!data member to the corresponding pointer in ts(i).
!$acc enter data copyin( &
!$acc ts(i)%rtol, &
!$acc ts(i)%atol, &
!$acc ts(i)%J, &
!$acc ts(i)%P, &
!$acc ts(i)%z(:,:,0:), &
!$acc ts(i)%z0(:,:,0:), &
!$acc ts(i)%h(0:), &
!$acc ts(i)%l(0:), &
!$acc ts(i)%shift(0:), &
!$acc ts(i)%upar, &
!$acc ts(i)%y, &
!$acc ts(i)%yd, &
!$acc ts(i)%rhs, &
!$acc ts(i)%e, &
!$acc ts(i)%e1, &
!$acc ts(i)%ewt, &
!$acc ts(i)%b, &
!$acc ts(i)%ipvt, &
!$acc ts(i)%A(0:,0:))
end do
Here I make use of the enter data
directive, which allows you to put data on the GPU in one part of your code and close that data out in another. First, I create the array of bdf_ts
types, ts(:)
. I then build these object on the host and update the static data while doings a copyin
for the dynamic data. For PGI, this copyin
serves to 1) create, 2) allocate, and 3) attach each dynamic data member to the appropriate type on the GPU.
Later, there's an analgous loop in which I retrieve any data I might need on the host and destroy all of the data on the GPU.
We've seen how I get global data and a non-trivial derived type onto the GPU. Finally, before my loop I have a data statement that gets other needed variables onto the GPU:
!$acc data &
!$acc copyin(dens(:), temp(:), eos_cp(:), eos_dhdX(:,:), Xin(:,:)) &
!$acc copyout(Xout(:,:), rho_omegadot(:,:), rho_Hnuc(:))
I copyin
data that need only be read on the GPU and not returned to the host. I copyout
data needed on the host for subsequent computation. Note that in practice you need only do this for non-scalar data like arrays. By default, scalar data used in an OpenACC compute region will be copied over and readily available.
Hidden Step 0/4: Rewrite your code to satisfy GPU restrictions
GPUs are not CPUs. A CPU core is very sophisticated and able to handle complex code flows. GPU cores are plentiful, but dumb. What this means is that although I would love to tell you the three-step narrative I laid out above is sufficient for getting your code running on GPUs, I can't. Were you to execute this series of steps your code almost certainly wouldn't compile. So either before (step 0) or, more realistically, after (step 4) you must modify your code so that any portion that needs to run on the GPU doesn't violate GPU restrictions. Briefly, the primary restrictions I had to rewrite to avoid are:
No allocate
statements on the GPU
This includes implicit ones! In practice, this means you should avoid statements that implicitly require a temporary variable which the compiler will be allocating behind the scenes. So, no this:
ts%l = ts%l + eoshift_local(ts%l, -1) / xi_j(ts%h, j)
Generally, I've found that using Fortran's concise array notation on GPUs leads to errors. I replace all instances of it with explicit loops.
subroutines or functions cannot be passed as arguments to GPU routines
Many intrinsic functions may not be available on the GPU (e.g. sum
, eoshift
, max
, etc). Some are, some aren't. If code is giving you trouble and it contains an intrinsic, that's likely the problem.
No I/O! You can't print *, 'Hey, I am on a GPU'