Julias generic way of implementing algorithms often makes it easy to explore different storage schemes, elevated or reduced precision or to try acceleration hardware like a GPU. I want to present a few illustrating examples for this in this workbook, showing how little effort is needed to give these things a try in Julia. I will also show in one example how one can track performance by profiling and understand what should be done to improve an algorithm at hand.

First we install some packages for linear algebra and to perform some benchmarks ...

In [ ]:

```
import Pkg; Pkg.add(["GenericLinearAlgebra", "SparseArrays", "Profile", "ProfileView", "BenchmarkTools", "PProf"])
using LinearAlgebra
using SparseArrays
```

For dense and sparse arrays, all important linear algebra routines are available in the `LinearAlgebra`

. This includes common tasks such as

`qr`

(also pivoted)`cholesky`

(also pivoted)`eigen`

,`eigvals`

,`eigvecs`

(compute eigenpairs, values, vectors)`factorize`

(for computing matrix factorisations)

All these methods are both implemented for generic matrices and specialised for specific kinds. For example `factorize`

is intended to compute a clever factorisation for solving linear systems. What it does depends on the matrix properties:

In [ ]:

```
# Random real matrix -> will do an LU
A = randn(10, 10)
@show typeof(factorize(A))
# Real-symmetric matrix -> will do a Bunch-Kaufman
Am = Symmetric(A + A')
@show typeof(factorize(Am))
# Symmetric tridiagonal -> will do a LDLt
Am = SymTridiagonal(A + A')
@show typeof(factorize(Am))
# Random sparse matrix -> will do sparse LU
S = sprandn(50, 50, 0.3)
@show typeof(factorize(S))
# ... and so on ...
```

The all share a common interface, such that an algorithm like

In [ ]:

```
function solve_many(A, xs)
F = factorize(A)
[F \ rhs for rhs in xs]
end
```

will automatically work for sparse arrays and dense arrays and is furthermore independent of the floating-point type.

Let's try this in a more realistic algorithm. If we leave efficiency aside for now, a simple Davidson algorithm can be implemented quite concisely:

In [ ]:

```
using LinearAlgebra
function davidson(A, SS::AbstractArray; maxiter=100, prec=I,
tol=20size(A,2)*eps(eltype(A)),
maxsubspace=8size(SS, 2))
m = size(SS, 2)
for i in 1:100
Ass = A * SS
# Use eigen specialised for Hermitian matrices
rvals, rvecs = eigen(Hermitian(SS' * Ass))
rvals = rvals[1:m]
rvecs = rvecs[:, 1:m]
Ax = Ass * rvecs
R = Ax - SS * rvecs * Diagonal(rvals)
if norm(R) < tol
return rvals, SS * rvecs
end
println(i, " ", size(SS, 2), " ", norm(R))
# Use QR to orthogonalise the subspace.
if size(SS, 2) + m > maxsubspace
SS = typeof(R)(qr(hcat(SS * rvecs, prec * R)).Q)
else
SS = typeof(R)(qr(hcat(SS, prec * R)).Q)
end
end
error("not converged.")
end
```

We can easily test:

In [ ]:

```
nev = 2
A = randn(20, 20); A = A + A' + I;
# Generate two random orthogonal guess vectors
x0 = randn(size(A, 2), nev)
x0 = Array(qr(x0).Q)
# Run the problem
davidson(A, x0)
```

From this a mixed-precsion scheme can be just written down:

In [ ]:

```
using GenericLinearAlgebra # (Slow) generic fallbacks for eigen and qr
λ, v = davidson(Matrix{Float16}(A), Float16.(x0), tol=1)
println()
λ, v = davidson(Matrix{Float32}(A), v, tol=1e-3)
println()
λ, v = davidson(Matrix{Float64}(A), v, tol=1e-13)
println()
λ, v = davidson(Matrix{BigFloat}(A), v, tol=1e-25)
λ
```

We can use sparse arrays on this:

In [ ]:

```
nev = 2
spA = sprandn(100, 100, 0.3); spA = spA + spA' + 2I
spx0 = randn(size(spA, 2), nev)
spx0 = Array(qr(spx0).Q)
davidson(spA, spx0, tol=1e-6)
```

Or GPU arrays, see also davidson_cuda.jl:

In [ ]:

```
# Use the GPU
using CuArrays
using CuArrays.CUSOLVER
function LinearAlgebra.eigen(A::Hermitian{T, CuArray{T, 2}}) where {T <: Real}
CUSOLVER.syevd!('V', 'U', A.data)
end
davidson(cu(A), 2)
```

but actually the performance is overall not that good out of the box, because we're doing a lot of copying and elementwise access in our naive algorithm.

Let's see if we can detect the performance issues and suggest places for improvements. For this we will use Julia's builtin `Profile`

package in combination with two grapical viewers, `ProfileView`

and `PProf`

.

In [ ]:

```
using Profile
using ProfileView
using PProf
```

In [ ]:

```
# Setup of the problem:
nev = 2
A = randn(20, 20); A = A + A' + I;
x0 = randn(size(A, 2), nev)
x0 = Array(qr(x0).Q)
```

In [ ]:

```
# Run once to compile everything ... this should be ignored
@profview davidson(A, x0)
@profview davidson(A, x0)
```

In [ ]:

```
# Run once to compile everything ... this should be ignored
@pprof davidson(A, x0)
@pprof davidson(A, x0)
```

So it seems line 27 in the davidson implementation is where basically all the time is spent. This line is

```
SS = typeof(R)(qr(hcat(SS, prec * R)).Q)
```

which basically consists of a QR, the `hcat`

and the conversion of the QR result to `typeof(R)`

(which is an array in this case). Let us try to get an idea for the time each of these steps need.

We will use the `BenchmarkTools`

package for this. It measures the time one (or multiple) Julia statement take to execute. For this it runs them a *repeated* number of times and collects some statistics.

In [ ]:

```
using BenchmarkTools
```

In our example `SS`

is at most 20x16, `prec * R`

is 20x2, so let's use that:

In [ ]:

```
SS = randn(20, 16)
precR = randn(20, 2)
catted = @btime hcat(SS, precR)
fac = @btime qr(catted)
@btime Array(fac.Q);
```

In agreement with the benchmark result the QR dominates, but also the unpacking of the Q factor takes quite some time. Basically it shows that the things to impove in the algorithm would be to use a different way to orthogonalise the subspace.