Linear Algebra, Precision and Profiling

Julias generic way of implementing algorithms often makes it easy to explore different storage schemes, elevated or reduced precision or to try acceleration hardware like a GPU. I want to present a few illustrating examples for this in this workbook, showing how little effort is needed to give these things a try in Julia. I will also show in one example how one can track performance by profiling and understand what should be done to improve an algorithm at hand.

First we install some packages for linear algebra and to perform some benchmarks ...

In [ ]:
import Pkg; Pkg.add(["GenericLinearAlgebra", "SparseArrays", "Profile", "ProfileView", "BenchmarkTools", "PProf"])
using LinearAlgebra
using SparseArrays

Linear Algebra

For dense and sparse arrays, all important linear algebra routines are available in the LinearAlgebra. This includes common tasks such as

  • qr (also pivoted)
  • cholesky (also pivoted)
  • eigen, eigvals, eigvecs (compute eigenpairs, values, vectors)
  • factorize (for computing matrix factorisations)

All these methods are both implemented for generic matrices and specialised for specific kinds. For example factorize is intended to compute a clever factorisation for solving linear systems. What it does depends on the matrix properties:

In [ ]:
# Random real matrix -> will do an LU
A = randn(10, 10)
@show typeof(factorize(A))

# Real-symmetric matrix ->  will do a Bunch-Kaufman
Am = Symmetric(A + A')
@show typeof(factorize(Am))

# Symmetric tridiagonal -> will do a LDLt
Am = SymTridiagonal(A + A')
@show typeof(factorize(Am))

# Random sparse matrix -> will do sparse LU
S = sprandn(50, 50, 0.3)
@show typeof(factorize(S))

# ... and so on ...

The all share a common interface, such that an algorithm like

In [ ]:
function solve_many(A, xs)
    F = factorize(A)
    [F \ rhs for rhs in xs]
end

will automatically work for sparse arrays and dense arrays and is furthermore independent of the floating-point type.

More details

Use case: A generic Davidson

Let's try this in a more realistic algorithm. If we leave efficiency aside for now, a simple Davidson algorithm can be implemented quite concisely:

In [ ]:
using LinearAlgebra

function davidson(A, SS::AbstractArray; maxiter=100, prec=I,
                  tol=20size(A,2)*eps(eltype(A)),
                  maxsubspace=8size(SS, 2))
    m = size(SS, 2)
    for i in 1:100
        Ass = A * SS

        # Use eigen specialised for Hermitian matrices
        rvals, rvecs = eigen(Hermitian(SS' * Ass))
        rvals = rvals[1:m]
        rvecs = rvecs[:, 1:m]
        Ax = Ass * rvecs

        R = Ax - SS * rvecs * Diagonal(rvals)
        if norm(R) < tol
            return rvals, SS * rvecs
        end

        println(i, "  ", size(SS, 2), "  ", norm(R))

        # Use QR to orthogonalise the subspace.
        if size(SS, 2) + m > maxsubspace
            SS = typeof(R)(qr(hcat(SS * rvecs, prec * R)).Q)
        else
            SS = typeof(R)(qr(hcat(SS, prec * R)).Q)
        end
    end
    error("not converged.")
end

We can easily test:

In [ ]:
nev = 2
A = randn(20, 20); A = A + A' + I;

# Generate two random orthogonal guess vectors
x0 = randn(size(A, 2), nev)
x0 = Array(qr(x0).Q)

# Run the problem
davidson(A, x0)

From this a mixed-precsion scheme can be just written down:

In [ ]:
using GenericLinearAlgebra  # (Slow) generic fallbacks for eigen and qr

λ, v = davidson(Matrix{Float16}(A), Float16.(x0), tol=1)
println()
λ, v = davidson(Matrix{Float32}(A), v, tol=1e-3)
println()
λ, v = davidson(Matrix{Float64}(A), v, tol=1e-13)
println()
λ, v = davidson(Matrix{BigFloat}(A), v, tol=1e-25)
λ

We can use sparse arrays on this:

In [ ]:
nev = 2
spA = sprandn(100, 100, 0.3); spA = spA + spA' + 2I
spx0 = randn(size(spA, 2), nev)
spx0 = Array(qr(spx0).Q)

davidson(spA, spx0, tol=1e-6)

Or GPU arrays, see also davidson_cuda.jl:

In [ ]:
# Use the GPU
using CuArrays
using CuArrays.CUSOLVER

function LinearAlgebra.eigen(A::Hermitian{T, CuArray{T, 2}}) where {T <: Real}
    CUSOLVER.syevd!('V', 'U', A.data)
end

davidson(cu(A), 2)

but actually the performance is overall not that good out of the box, because we're doing a lot of copying and elementwise access in our naive algorithm.

Profiling and suggestions for improvement

Let's see if we can detect the performance issues and suggest places for improvements. For this we will use Julia's builtin Profile package in combination with two grapical viewers, ProfileView and PProf.

In [ ]:
using Profile
using ProfileView
using PProf
In [ ]:
# Setup of the problem:
nev = 2
A = randn(20, 20); A = A + A' + I;
x0 = randn(size(A, 2), nev)
x0 = Array(qr(x0).Q)
In [ ]:
# Run once to compile everything ... this should be ignored
@profview davidson(A, x0)
@profview davidson(A, x0)
In [ ]:
# Run once to compile everything ... this should be ignored
@pprof davidson(A, x0)
@pprof davidson(A, x0)

So it seems line 27 in the davidson implementation is where basically all the time is spent. This line is

SS = typeof(R)(qr(hcat(SS, prec * R)).Q)

which basically consists of a QR, the hcat and the conversion of the QR result to typeof(R) (which is an array in this case). Let us try to get an idea for the time each of these steps need.

We will use the BenchmarkTools package for this. It measures the time one (or multiple) Julia statement take to execute. For this it runs them a repeated number of times and collects some statistics.

In [ ]:
using BenchmarkTools

In our example SS is at most 20x16, prec * R is 20x2, so let's use that:

In [ ]:
SS = randn(20, 16)
precR = randn(20, 2)

catted = @btime hcat(SS, precR)
fac = @btime qr(catted)
@btime Array(fac.Q);

In agreement with the benchmark result the QR dominates, but also the unpacking of the Q factor takes quite some time. Basically it shows that the things to impove in the algorithm would be to use a different way to orthogonalise the subspace.