#!/usr/bin/env python
# coding: utf-8

# # Accelerating Maestro's Reactions with OpenACC

# **Step 1: Markup all subroutines, functions**
# 
# To call a subroutine from code running on the GPU, you must compile a version for the GPU.  With OpenACC, this is done with the `routine` directive.  Our integrator makes use of many linear algebra packages which Maestro has a local copy of.  One of the first tasks was to mark these up as sequential routines.  For example,
# 
# ```fortran
#       subroutine dgesl (a,lda,n,ipvt,b,job)
# !$acc routine seq
# !$acc routine(daxpy) seq
# !$acc routine(vddot) seq
#       integer lda,n,ipvt(*),job
#       double precision a(lda,n),b(*)
# !
# !     dgesl solves the double precision system
# !     a * x = b  or  trans(a) * x = b
# !     using the factors computed by dgeco or dgefa.
# !     ...
#       end subroutine
# ```
# 
# Notice that we also tell the compiler that routines `daxpy` and `vddot` have a GPU version.  This is required because we do not access these routines through a module but instead link them directly.
# 
# Thirty routines were marked up in this way:

# In[ ]:


# %load oac_routines.txt
Microphysics/screening/screen.f90-subroutine screenz (t,d,z1,z2,a1,a2,ymass,aion,zion,nion,scfac, dscfacdt)
Microphysics/screening/screen.f90:  !$acc routine seq
--
Util/LINPACK/dgesl.f-      subroutine dgesl (a,lda,n,ipvt,b,job)
Util/LINPACK/dgesl.f:!$acc routine seq
Util/LINPACK/dgesl.f:!$acc routine(daxpy) seq
Util/LINPACK/dgesl.f:!$acc routine(vddot) seq
--
Util/LINPACK/dgefa.f-      subroutine dgefa (a,lda,n,ipvt,info)
Util/LINPACK/dgefa.f:!$acc routine seq
Util/LINPACK/dgefa.f:!$acc routine(idamax) seq
Util/LINPACK/dgefa.f:!$acc routine(dscal)  seq
Util/LINPACK/dgefa.f:!$acc routine(daxpy)  seq
--
Util/BLAS/idamax.f-      INTEGER FUNCTION IDAMAX(N,DX,INCX)
Util/BLAS/idamax.f:!$acc routine seq
--
Util/BLAS/vddot.f-      double precision function vddot (n,dx,incx,dy,incy)
Util/BLAS/vddot.f:!$acc routine seq
--
Util/BLAS/dscal.f-      SUBROUTINE DSCAL(N,DA,DX,INCX)
Util/BLAS/dscal.f:!$acc routine seq
--
Util/BLAS/daxpy.f-      SUBROUTINE DAXPY(N,DA,DX,INCX,DY,INCY)
Util/BLAS/daxpy.f:!$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90-  subroutine bdf_advance(ts, neq, npt, y0, t0, y1, t1, dt0, reset, reuse, ierr, initial_call)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90:    !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90-  subroutine bdf_update(ts)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90:    !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90-  subroutine bdf_predict(ts)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90:    !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90-  subroutine bdf_solve(ts)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90:    !$acc routine seq
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90:    !$acc routine(dgefa) seq
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90:    !$acc routine(dgesl) seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90-  subroutine bdf_check(ts, retry, err)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90:    !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90-  subroutine bdf_correct(ts)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90:    !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90-  subroutine bdf_adjust(ts)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90:    !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90-  subroutine bdf_reset(ts, y0, dt, reuse)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90:    !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90-  subroutine rescale_timestep(ts, eta_in, force)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90:    !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90-  subroutine decrease_order(ts)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90:    !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90-  subroutine increase_order(ts)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90:    !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90-  function alpha0(k) result(a0)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90:    !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90-  function alphahat0(k, h) result(a0)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90:    !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90-  function xi_star_inv(k, h) result(xii)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90:    !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90-  function xi_j(h, j) result(xi)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90:    !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90-  subroutine ewts(ts)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90:    !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90-  function norm(y, ewt) result(r)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90:    !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90-  subroutine eye_r(A)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90:    !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90-  recursive function factorial(n) result(r)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90:    !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90-  subroutine eoshift_local(arr, sh, shifted_arr)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90:    !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90-  function minloc(arr) result(ret)
Util/VBDF/dev/oac_vbdf/test_react/bdf.f90:    !$acc routine seq
--
Util/VBDF/dev/oac_vbdf/ignition_simple_bdf/f_rhs.f90-   subroutine f_rhs_vec(neq, npt, y, t, yd, upar)
Util/VBDF/dev/oac_vbdf/ignition_simple_bdf/f_rhs.f90:      !$acc routine seq
Util/VBDF/dev/oac_vbdf/ignition_simple_bdf/f_rhs.f90:      !$acc routine(screenz) seq
--
Util/VBDF/dev/oac_vbdf/ignition_simple_bdf/f_rhs.f90-   subroutine jac_vec(neq, npt, y, t, pd, upar)
Util/VBDF/dev/oac_vbdf/ignition_simple_bdf/f_rhs.f90:      !$acc routine seq
--


# **Step 2: Markup a computationally expensive loop over hydro cells**
# 
# One of the balances one must attempt to strike with GPU coding is you must give it *a lot* of computation to keep it busy.  Though they're not like conventional cores, the GPUs in Titan (Kepler K20x) have 2688 CUDA cores.  To keep all of these busy, you want to get as much of your code onto the GPU as possible.  Of course you want a computationally intensive portion of code, but you also must put enough of the code onto the GPU that you do not lose all performance gains to data transfer costs.
# 
# The strategy I've taken for the moment is to markup all routines as sequential and to markup a huge loop over hydro cells that calls the initial sequential routine `bdf_advance`.  The markup looks like this:
# 
# ```fortran
# !$acc parallel loop gang vector present(dens, temp, eos_cp, eos_dhdX,    &
# !$acc    Xin, Xout, rho_omegadot, rho_Hnuc, ebin, ts)         &
# !$acc    private(ierr, y, y0, y1) reduction(+:ierr_tot)
# do i = 1, npt
#     ! ...
#     call bdf_advance(ts(i), NEQ, bdf_npt, y0, t0, y1, t1, &
#                           DT0, RESET, REUSE, ierr, .true.)
#     ! ... 
# end do
# ```
# 
# Here we tell the compiler we want this loop to be run in parallel on the GPU.  The `gang` and `vector` options represent the heirarchy of parallelism we want to use on the GPU.  To understand this, it's helpful to take a look at the card:
# 
# ![title](GK110-Tesla-K20-Block-Diagram-635x449.jpg)
# 
# Here's a zoom-in on an SMX unit:
# 
# ![title](GK110-SMX-Unit-635x712.jpg)
# 
# There is a rough correlation between the heirarchy of parallelism on the hardware and that exposed in the OpenACC standard.  My understanding is it's not necessarily rigidly followed by compilers if they think they can be clever, but it's helpful for thinking about how your code is being run.
# 
# `gang` parallelism is across SMX units, of which the K20x's have 14 active.  For initial development, I do not recommend trying to think of what's called `worker` parallelism.  This is parallelism over the warp schedulers in an SMX.  Rather, initially I prefer to focus on `vector` parallelism, which is parallelism over the CUDA cores in an SMX if you *don't* use `worker` parallelism and over the CUDA cores in a warp if you do.  Thus, in the loop I showed above, we parallelize over the SMX's and the CUDA cores within each SMX by giving the loop directive `gang` and `vector` options.
# 
# Note that our directive also includes data statements `present` and `private`:
# 
# ```fortran
# !$acc parallel loop gang vector present(dens, temp, eos_cp, eos_dhdX,    &
# !$acc    Xin, Xout, rho_omegadot, rho_Hnuc, ebin, ts)         &
# !$acc    private(ierr, y, y0, y1) reduction(+:ierr_tot)
# do i = 1, npt
#     ! ...
#     call bdf_advance(ts(i), NEQ, bdf_npt, y0, t0, y1, t1, &
#                           DT0, RESET, REUSE, ierr, .true.)
#     ! ... 
# end do
# ```
# 
# `present` tells the compiler that the listed data is already present on the GPU, while `private` tells the compiler that each thread on the GPU should have its own copy of the listed data.  We also see a `reduction`, which will sum up the value of `ierr` from all cores and return the summed up value to the host/CPU.
# 
# **Step 3: Get the data to the GPU**
# 
# GPU's do not have access to the CPU/host's memory (e.g. the RAM and cache memory that all of your computer's cores have access to).  Thus, we must explicitly put any data we need into the GPU's memory.  This is acheived with `data` directives in OpenACC.
# 
# Often in our codes, and other production science codes, we make use of data in modules.  This global data can be put on the GPU with the `declare` directive.  Here's an example of this from my work accelerating the reactions:
# 
# ```fortran
# module network
#   use bl_types
#   implicit none
#   ! ...
#   real(kind=dp_t), save :: aion(nspec), zion(nspec), ebin(nspec)
#   !$acc declare create(aion(:), zion(:), ebin(:))
#   ! ...
# contains  
#   subroutine network_init()
#     ! ...
#     aion(ic12_)  = 12.0_dp_t
#     aion(io16_)  = 16.0_dp_t
#     aion(img24_) = 24.0_dp_t
#     
#     zion(ic12_)  = 6.0_dp_t
#     zion(io16_)  = 8.0_dp_t
#     zion(img24_) = 12.0_dp_t
# 
#     ebin(ic12_)  = -7.4103097e18_dp_t     !  92.16294 MeV
#     ebin(io16_)  = -7.6959672e18_dp_t     ! 127.62093 MeV
#     ebin(img24_) = -7.9704080e18_dp_t     ! 198.2579  MeV
# 
#     !$acc update device(aion(:), zion(:), ebin(:))
#     ! ...
#   end subroutine network_init
#   ! ...
# end module network
# ```
# 
# Here we had a module array that we need on the GPU.  A combination of a `declare create` and `update device` directives makes sure this data is accessible on the GPU and contains the same data as the host.  Note that this ability to use global data is relatively new, introduced in the OpenACC 2.0 standard.
# 
# A major difficulty in working with GPUs is dealing with user-defined types (similar to `struct`s in C, and to a lesser extent classes in object-oriented languages).  It is exceptionally good software engineering practice to make use of such types and they're plentiful in BoxLib/Maestro.  However, types that include dynamic memory (in Fortran, this means types that have the `alloctable` or `pointer` attribute) are currently poorly handled by compiler implementations of OpenACC.  There are plans in the next major update of the standard, 3.0, to incorporate deep copy functionality that should rid us of many of the these difficulties.  For now, howerver, we must manually deep copy non-trivial user-defined types.
# 
# In this case, our integrator uses a timestepper type called `bdf_ts`.  Our loop over hydro cells includes an array of these types `ts(:)`.  Before getting to this loop, I deep copy each over to the GPU.  It looks like this:
# 
# ```fortran
# !$acc enter data create(ts)
# do i = 1, npt
#     ! we need the specific heat at constant pressure and dhdX |_p.  Take
#     ! T, rho, Xin as input
#     eos_state%rho   = dens(i)
#     eos_state%T     = temp(i)
#     eos_state%xn(:) = Xin(:,i)
#     call eos(eos_input_rt, eos_state, .false.)
#     eos_cp(i) = eos_state%cp
#     eos_dhdX(:,i) = eos_state%dhdX(:)
# 
#     ! Build the bdf_ts time-stepper object
#     call bdf_ts_build(ts(i), NEQ, bdf_npt, rtol, atol, MAX_ORDER, upar)
# 
#     !Now we update all non-dynamic data members, meaning those that aren't
#     !allocatables, pointers, etc.  They can be arrays as long as they're
#     !static.  To make my previous note more concrete, if we did something
#     !like `update device(ts(i))` it would overwrite the device's pointer
#     !address with that of the host, according to PGI/NVIDIA consults.
#     !Updating individual members avoids this.
#           
#     !$acc update device(       &
#     !$acc    ts(i)%neq,        &
#     !$acc    ts(i)%npt,        &
#     !$acc    ts(i)%max_order,  &
#     !$acc    ts(i)%max_steps,  &
#     !$acc    ts(i)%max_iters,  &
#     !$acc    ts(i)%verbose,    &
#     !$acc    ts(i)%dt_min,     &
#     !$acc    ts(i)%eta_min,    &
#     !$acc    ts(i)%eta_max,    &
#     !$acc    ts(i)%eta_thresh, &
#     !$acc    ts(i)%max_j_age,  &
#     !$acc    ts(i)%max_p_age,  &
#     !$acc    ts(i)%debug,      &
#     !$acc    ts(i)%dump_unit,  &
#     !$acc    ts(i)%t,          &
#     !$acc    ts(i)%t1,         &
#     !$acc    ts(i)%dt,         &
#     !$acc    ts(i)%dt_nwt,     &
#     !$acc    ts(i)%k,          &
#     !$acc    ts(i)%n,          &
#     !$acc    ts(i)%j_age,      &
#     !$acc    ts(i)%p_age,      &
#     !$acc    ts(i)%k_age,      &
#     !$acc    ts(i)%tq,         &
#     !$acc    ts(i)%tq2save,    &
#     !$acc    ts(i)%temp_data,  &
#     !$acc    ts(i)%refactor,   &
#     !$acc    ts(i)%nfe,        &
#     !$acc    ts(i)%nje,        &
#     !$acc    ts(i)%nlu,        &
#     !$acc    ts(i)%nit,        &
#     !$acc    ts(i)%nse,        &
#     !$acc    ts(i)%ncse,       &
#     !$acc    ts(i)%ncit,       &
#     !$acc    ts(i)%ncdtmin)
#     
#     !Now it's time to deal with dynamic data.  At the moment, they only
#     !exist as pointers on the device.  For PGI at least, doing a copyin on
#     !dynamic data serves to create, allocate, and then attach each dynamic
#     !data member to the corresponding pointer in ts(i).
#     !$acc enter data copyin(   &
#     !$acc    ts(i)%rtol,       &
#     !$acc    ts(i)%atol,       &
#     !$acc    ts(i)%J,          &
#     !$acc    ts(i)%P,          &
#     !$acc    ts(i)%z(:,:,0:),  &
#     !$acc    ts(i)%z0(:,:,0:), &
#     !$acc    ts(i)%h(0:),      &
#     !$acc    ts(i)%l(0:),      &
#     !$acc    ts(i)%shift(0:),  &
#     !$acc    ts(i)%upar,       &
#     !$acc    ts(i)%y,          &
#     !$acc    ts(i)%yd,         &
#     !$acc    ts(i)%rhs,        &
#     !$acc    ts(i)%e,          &
#     !$acc    ts(i)%e1,         &
#     !$acc    ts(i)%ewt,        &
#     !$acc    ts(i)%b,          &
#     !$acc    ts(i)%ipvt,       &
#     !$acc    ts(i)%A(0:,0:))
# end do
# ```
# 
# Here I make use of the `enter data` directive, which allows you to put data on the GPU in one part of your code and close that data out in another.  First, I create the array of `bdf_ts` types, `ts(:)`.  I then build these object on the host and update the static data while doings a `copyin` for the dynamic data.  For PGI, this `copyin` serves to 1) create, 2) allocate, and 3) attach each dynamic data member to the appropriate type on the GPU.
# 
# Later, there's an analgous loop in which I retrieve any data I might need on the host and destroy all of the data on the GPU.
# 
# We've seen how I get global data and a non-trivial derived type onto the GPU.  Finally, before my loop I have a data statement that gets other needed variables onto the GPU:
# 
# ```fortran
# !$acc data                                                               &
# !$acc copyin(dens(:), temp(:), eos_cp(:), eos_dhdX(:,:), Xin(:,:))       &
# !$acc copyout(Xout(:,:), rho_omegadot(:,:), rho_Hnuc(:)) 
# ```
# 
# I `copyin` data that need only be read on the GPU and not returned to the host.  I `copyout` data needed on the host for subsequent computation.  Note that in practice you need only do this for non-scalar data like arrays.  By default, scalar data used in an OpenACC compute region will be copied over and readily available.
# 
# **Hidden Step 0/4: Rewrite your code to satisfy GPU restrictions**
# 
# GPUs are not CPUs.  A CPU core is very sophisticated and able to handle complex code flows.  GPU cores are plentiful, but dumb.  What this means is that although I would love to tell you the three-step narrative I laid out above is sufficient for getting your code running on GPUs, I can't.  Were you to execute this series of steps your code almost certainly wouldn't compile.  So either before (step 0) or, more realistically, after (step 4) you must modify your code so that any portion that needs to run on the GPU doesn't violate GPU restrictions.  Briefly, the primary restrictions I had to rewrite to avoid are:
# 
# * No `allocate` statements on the GPU
# 
# * This includes implicit ones!  In practice, this means you should avoid statements that implicitly require a temporary variable which the compiler will be allocating behind the scenes.  So, no this:
# ```fortran
# ts%l = ts%l + eoshift_local(ts%l, -1) / xi_j(ts%h, j)
# ```
# 
# * Generally, I've found that using Fortran's concise array notation on GPUs leads to errors.  I replace all instances of it with explicit loops.
# 
# * subroutines or functions cannot be passed as arguments to GPU routines
# 
# * Many intrinsic functions may not be available on the GPU (e.g. `sum`, `eoshift`, `max`, etc).  Some are, some aren't.  If code is giving you trouble and it contains an intrinsic, that's likely the problem.
# 
# * No I/O! You can't `print *, 'Hey, I am on a GPU'`