You cannot learn too much linear algebra. – Benedict Gross
In this lecture, we examine the structure of matrices and linear operators (e.g., dense, sparse, symmetric, tridiagonal, banded) and discuss how it can be exploited to radically increase the performance of solving large problems.
We build on applications discussed in previous lectures: linear algebra, orthogonal projections, and Markov chains.
The methods in this section are called direct methods, and they are qualitatively similar to performing Gaussian elimination to factor matrices and solve systems of equations. In iterative methods and sparsity we examine a different approach, using iterative algorithms, where matrices can be thought up more generally as linear operators.
The list of specialized packages for these tasks is enormous and growing, but some of the important organizations to look at are JuliaMatrices , JuliaSparse, and JuliaMath
NOTE: You may wish to review multiple-dispatch and generic programming in introduction to types, and consider further study on generic programming.
The theme of this lecture, and numerical linear algebra in general, comes down to three principles:
using LinearAlgebra, Statistics, BenchmarkTools, SparseArrays, Random
Random.seed!(42); # seed random numbers for reproducibility
Ask yourself whether the following is a computationally expensive operation as the matrix size increases
Multiplying two matrices?
Solving a linear system of equations?
Finding the eigenvalues of a matrix?
As the goal of this section is to move towards numerical methods with large systems, we need to understand how well algorithms scale with the size of matrices/vectors/etc. This is known as computational complexity. As we saw in the answer to the questions above, the algorithm - and hence the computational complexity - changes based on matrix structure.
While this notion of complexity can work at various levels such as the number of significant digits for basic mathematical operations, the amount of memory and storage required, or the amount of time - but we will typically focus on the time-complexity.
For time-complexity, the size $ N $ is usually the dimensionality of the problem, although occasionally the key will be the number of non-zeros in the matrix or width of bands. For our applications, time-complexity is best thought of as the number of floating point operations (e.g. add, multiply, etc.) required.
Complexity of algorithms is typically written in Big O notation which provides bounds on the scaling.
Formally, if the number of operations required for a problem size $ N $ is $ f(N) $, we can write this as $ f(N) = O(g(N)) $ for some $ g(N) $ - typically a polynomial.
The interpretation is that there exists some constants $ M $ and $ N_0 $ such that
$$ f(N) \leq M g(N), \text{ for } N > N_0 $$For example, the complexity of finding an LU Decomposition of a dense matrix is $ O(N^3) $ which should be read as there being a constant where eventually the number of floating point operations required decompose a matrix of size $ N\times N $ grows cubically.
Keep in mind that these are asymptotic results intended to understanding the scaling of the problem, and the constant can matter for a given fixed size.
For example, the number of operations required for an LU decomposition of a dense $ N \times N $ matrix $ f(N) = \frac{2}{3} N^3 $, ignoring the $ N^2 $ and lower terms. Other methods of solving a linear system may have different constants of proportionality, even if they have the same scaling $ O(N^3) $.
When combining algorithms, you will sometimes need to think through how combining algorithms changes complexity. For example, if you use
With this, we have an important word of caution: dense matrix-multiplication is an expensive operation for unstructured matrices, and the basic version is $ O(N^3) $.
Of course, modern libraries use highly turned and careful algorithms to multiply matrices and exploit the computer architecture, memory cache, etc., but this simply lowers the constant of proportionality and they remain $ O(N^3) $.
A consequence is that, since many algorithms require matrix-matrix multiplication, it is often not possible to go below that order without further matrix structure.
That is, changing the constant of proportionality for a given size can help, but in order to achieve higher scaling you need to identify matrix structure (e.g. tridigonal, sparse, etc.) and ensure your operations do not lose it.
As a first example of a structured matrix, consider a sparse arrays
A = sprand(10, 10, 0.45) # random sparse 10x10, 45 percent filled with non-zeros
@show nnz(A) # counts the number of non-zeros
invA = sparse(inv(Array(A))) # Julia won't invert sparse so convert to dense with Array.
@show nnz(invA);
nnz(A) = 47 nnz(invA) = 100
This increase from less than 50 to 100 percent dense demonstrates that significant sparsity can be lost when calculating an inverse.
The results can be even more extreme. Consider a tridiagonal matrix of size $ N \times N $ that might come out of a Markov Chain or discretization of a diffusion process,
N = 5
A = Tridiagonal([fill(0.1, N-2); 0.2], fill(0.8, N), [0.2; fill(0.1, N-2);])
5×5 Tridiagonal{Float64,Array{Float64,1}}: 0.8 0.2 ⋅ ⋅ ⋅ 0.1 0.8 0.1 ⋅ ⋅ ⋅ 0.1 0.8 0.1 ⋅ ⋅ ⋅ 0.1 0.8 0.1 ⋅ ⋅ ⋅ 0.2 0.8
The number of non-zeros here is approximately $ 3 N $, linear, which scales well for huge matrices into the millions or billions
But consider the inverse
inv(A)
5×5 Array{Float64,2}: 1.29099 -0.327957 0.0416667 -0.00537634 0.000672043 -0.163978 1.31183 -0.166667 0.0215054 -0.00268817 0.0208333 -0.166667 1.29167 -0.166667 0.0208333 -0.00268817 0.0215054 -0.166667 1.31183 -0.163978 0.000672043 -0.00537634 0.0416667 -0.327957 1.29099
Now, the matrix is fully dense and has $ N^2 $ nonzeros.
This also applies to the $ A' A $ operation in the normal equations of linear-least squares.
A = sprand(20, 21, 0.3)
@show nnz(A)/20^2
@show nnz(A'*A)/21^2;
nnz(A) / 20 ^ 2 = 0.2825 nnz(A' * A) / 21 ^ 2 = 0.800453514739229
While there is some variation based on the randoms chosen, we see that a 30 percent dense matrix becomes almost full dense after the product is taken.
Sparsity/Structure is not just for storage: While we have been emphasizing counting the non-zeros as a useful heuristic, the primary reason to maintain structure and sparsity is not for using less memory to store the matrices.
Size can sometimes become important (e.g. a 1 million by 1 million tridiagonal matrix needs to store 3 million numbers (i.e, about 6MB of memory), where a dense one requires 1 trillion (i.e., about 1TB of memory).
But, as we will see, the main purpose of considering sparsity and matrix structure is that it enables specialized algorithms which typically have a lower-computational order than unstructured dense, or even an unstructured sparse operations.
First, create convenient functions for benchmarking which displays the type
using BenchmarkTools
function benchmark_solve(A, b)
println("A\\b for typeof(A) = $(string(typeof(A)))")
@btime $A \ $b
end
benchmark_solve (generic function with 1 method)
Then, take away structure to see the impact on performance,
N = 1000
b = rand(N)
A = Tridiagonal([fill(0.1, N-2); 0.2], fill(0.8, N), [0.2; fill(0.1, N-2);])
A_sparse = sparse(A)
A_dense = Array(A)
# benchmark solution to system A x = b
benchmark_solve(A, b)
benchmark_solve(A_sparse, b)
benchmark_solve(A_dense, b);
A\b for typeof(A) = Tridiagonal{Float64,Array{Float64,1}} 28.199 μs (9 allocations: 47.75 KiB) A\b for typeof(A) = SparseMatrixCSC{Float64,Int64} 789.899 μs (69 allocations: 1.06 MiB) A\b for typeof(A) = Array{Float64,2} 25.725 ms (5 allocations: 7.65 MiB)
This example shows what is at stake: using a structured tridiagonal may be 10-20x faster than using a sparse matrix which is 100x faster then using a dense matrix.
In fact, the difference becomes more extreme as the matrices grow. Solving a tridiagonal system is $ O(N) $ while that of a dense matrix without any structure is $ O(N^3) $. The complexity of a sparse solution is more complicated, and scales in part by the nnz(N)
, i.e. the number of nonzeros.
Why we write matrix multiplications in our algebra with abandon, in practice the operation scales very poorly without any matrix structure.
Matrix multiplication is so important to modern computers that the constant of scaling in front of the scaling has been radically reduced when using a proper package, but the order is still $ O(N^3) $ in practice.
Sparse matrix multiplication, on the other hand, is $ O(N M_A M_B) $ where $ M_A $ are the number of nonzeros per row of $ A $ and $ B $ are the number of non-zeros per column of $ B $.
By the rules of computational order, that means any algorithm requiring a matrix multiplication of dense matrices requires at least $ O(N^3) $ operation.
The other important question is what is the structure of the resulting matrix. For example, multiplying an upper triangular by a lower triangular
N = 5
U = UpperTriangular(rand(N,N))
5×5 UpperTriangular{Float64,Array{Float64,2}}: 0.299976 0.176934 0.0608682 0.20465 0.409653 ⋅ 0.523923 0.127154 0.512531 0.235328 ⋅ ⋅ 0.600588 0.682868 0.330638 ⋅ ⋅ ⋅ 0.345419 0.0312986 ⋅ ⋅ ⋅ ⋅ 0.471043
L = U'
5×5 Adjoint{Float64,UpperTriangular{Float64,Array{Float64,2}}}: 0.299976 0.0 0.0 0.0 0.0 0.176934 0.523923 0.0 0.0 0.0 0.0608682 0.127154 0.600588 0.0 0.0 0.20465 0.512531 0.682868 0.345419 0.0 0.409653 0.235328 0.330638 0.0312986 0.471043
But the multiplication is fully dense (e.g. think of a cholesky multiplied by itself to produce a covariance matrix)
L * U
5×5 Array{Float64,2}: 0.0899855 0.0530758 0.018259 0.0613901 0.122886 0.0530758 0.305801 0.0773883 0.304736 0.195775 0.018259 0.0773883 0.380579 0.487749 0.253435 0.0613901 0.304736 0.487749 0.890193 0.441042 0.122886 0.195775 0.253435 0.441042 0.555378
On the other hand, a tridiagonal times a diagonal is still a tridiagonal - and can use specialized $ O(N) $ algorithms.
A = Tridiagonal([fill(0.1, N-2); 0.2], fill(0.8, N), [0.2; fill(0.1, N-2);])
D = Diagonal(rand(N))
D * A
5×5 Tridiagonal{Float64,Array{Float64,1}}: 0.0156225 0.00390564 ⋅ ⋅ ⋅ 0.0436677 0.349342 0.0436677 ⋅ ⋅ ⋅ 0.0213158 0.170526 0.0213158 ⋅ ⋅ ⋅ 0.00790566 0.0632453 0.00790566 ⋅ ⋅ ⋅ 0.19686 0.787442
When you tell a numerical analyst you are solving a linear system using direct methods, their first question is “which factorization?”.
Just as you can factorize a number (e.g. $ 6 = 3 \times 2 $) you can factorize a matrix as the product of other, more convenient matrices (e.g. $ A = L U $ or $ A = Q R $ where $ L, U, Q, $ and $ R $ have properties such as being triangular or orthogonal ).
On paper, since the Invertible Matrix Theorem tells us a unique solution is equivalent to $ A $ being invertible, we often write the solution to $ A x = b $ is
$$ x = A^{-1} b $$What if we do not (directly) use a factorization?
Take a simple linear system of a dense matrix,
N = 4
A = rand(N,N)
b = rand(N)
4-element Array{Float64,1}: 0.5682240701809245 0.40245385575255055 0.1825995192132288 0.06160128039631019
On paper, we to solve for $ A x = b $ by inverting the matrix,
x = inv(A) * b
4-element Array{Float64,1}: -0.03390698404076775 0.7988200873225004 0.9963711951331815 -0.927635209850046
As we will see throughout, inverting matrices should be used for theory, not for code. The classic advice that you should never invert a matrix may be slightly exaggerated, but is generally good advice.
Solving a system by inverting a matrix is always a little slower, potentially less accurate, and will often lose crucial sparsity compared to using factorizations. Moreover, the methods used by libraries to invert matrices typically calculate the same factorizations used for computing a system of equations.
Even if you need to solve a system with the same matrix multiple times, you are better off factoring the matrix and using the solver rather than calculating an inverse.
N = 100
A = rand(N,N)
M = 30
B = rand(N,M)
function solve_inverting(A, B)
A_inv = inv(A)
X = similar(B)
for i in 1:size(B,2)
X[:,i] = A_inv * B[:,i]
end
return X
end
function solve_factoring(A, B)
X = similar(B)
A = factorize(A)
for i in 1:size(B,2)
X[:,i] = A \ B[:,i]
end
return X
end
@btime solve_inverting($A, $B)
@btime solve_factoring($A, $B)
# even better, use built-in feature for multiple RHS
@btime $A \ $B;
6.394 ms (68 allocations: 205.28 KiB) 2.650 ms (96 allocations: 155.59 KiB) 2.439 ms (6 allocations: 102.63 KiB)
Some matrices are already in a convenient form and require no further factoring.
For example, consider solving a system with an UpperTriangular
matrix,
b = [1.0, 2.0, 3.0]
U = UpperTriangular([1.0 2.0 3.0; 0.0 5.0 6.0; 0.0 0.0 9.0])
3×3 UpperTriangular{Float64,Array{Float64,2}}: 1.0 2.0 3.0 ⋅ 5.0 6.0 ⋅ ⋅ 9.0
This system is especially easy to solve using back-substitution. In particular, $ x_3 = b_3 / U_{33}, x_2 = (b_2 - x_3 U_{23})/U_{22} $, etc.
U \ b
3-element Array{Float64,1}: 0.0 0.0 0.3333333333333333
A LowerTriangular
has similar properties and can be solved with forward-substitution.
The computational order of back-substitution and forward-substitution is $ O(N^2) $ for dense matrices. In fact, the triangular structure is often a target of factorizations in order to exploit those efficient $ O(N^2) $ algorithms.
The $ LU $ decompositions finds a lower triangular $ L $ and upper triangular $ U $ such that $ L U = A $.
For a general dense matrix without any other structure (i.e. not known to be symmetric, tridiagonal, etc.) this is the standard approach to solve a system and exploit the speed of backward and forward substitution using the factorization.
The computational order of LU decomposition itself for a dense matrix is $ O(N^3) $ - the same as Gaussian elimination, but it tends to have a better constant term than others (e.g. half the number of operations of the QR). For structured matrices or sparse ones, that order drops.
We can see which algorithm Julia will use for the \
operator by looking at the factorize
function for a given
matrix.
N = 4
A = rand(N,N)
b = rand(N)
Af = factorize(A) # chooses the right factorization, LU here
LU{Float64,Array{Float64,2}} L factor: 4×4 Array{Float64,2}: 1.0 0.0 0.0 0.0 0.563082 1.0 0.0 0.0 0.730109 0.912509 1.0 0.0 0.114765 0.227879 0.115228 1.0 U factor: 4×4 Array{Float64,2}: 0.79794 0.28972 0.765939 0.496278 0.0 0.82524 0.23962 -0.130989 0.0 0.0 -0.447888 0.374303 0.0 0.0 0.0 0.725264
In this case, it provides an $ L $ and $ U $ factorization (with pivoting ).
With the factorization complete, we can solve different b
right hand sides
Af \ b
4-element Array{Float64,1}: -0.4984260549573151 -0.11835721499695559 1.5055538550184813 0.07694455957797534
b2 = rand(N)
Af \ b2
4-element Array{Float64,1}: -0.6456780666059366 -0.26015157376547593 1.1168895662966312 0.5405293106660054
This decomposition also includes a $ P $ is a permutation matrix such that $ P A = L U $.
Af.P * A ≈ Af.L * Af.U
true
We can also directly calculate an lu
decomposition without the pivoting,
L, U = lu(A, Val(false)) # the Val(false) provides solution without permutation matrices
LU{Float64,Array{Float64,2}} L factor: 4×4 Array{Float64,2}: 1.0 0.0 0.0 0.0 0.730109 1.0 0.0 0.0 0.563082 1.09588 1.0 0.0 0.114765 0.249728 0.122733 1.0 U factor: 4×4 Array{Float64,2}: 0.79794 0.28972 0.765939 0.496278 0.0 0.753039 -0.229233 0.254774 0.0 0.0 0.490832 -0.410191 0.0 0.0 0.0 0.725264
And we can verify the decomposition
A ≈ L * U
true
To see roughly how the solver works, note that we can write the problem $ A x = b $ as $ L U x = b $. Let $ U x = y $, which breaks the problem into two sub-problems.
$$ \begin{aligned} L y &= b\\ U x &= y \end{aligned} $$As we saw above, this is the solution to two triangular systems, which can be efficiently done with back or forward substitution in $ O(N^2) $ operations.
To demonstrate this, first using
y = L \ b
4-element Array{Float64,1}: 0.759344042755733 -0.41464678155905965 0.707411438334498 0.05580508465599854
x = U \ y
x ≈ A \ b # Check identical
true
The LU decomposition also has specialized algorithms for structured matrices, such as a Tridiagonal
N = 1000
b = rand(N)
A = Tridiagonal([fill(0.1, N-2); 0.2], fill(0.8, N), [0.2; fill(0.1, N-2);])
factorize(A) |> typeof
LU{Float64,Tridiagonal{Float64,Array{Float64,1}}}
This factorization is the key to the performance of the A \ b
in this case. For Tridiagonal matrices, the
LU decomposition is $ O(N^2) $.
Finally, just as a dense matrix without any structure use an LU decomposition to solve a system, so will the sparse solvers
A_sparse = sparse(A)
factorize(A_sparse) |> typeof # dropping the tridiagonal structure to just become sparse
SuiteSparse.UMFPACK.UmfpackLU{Float64,Int64}
benchmark_solve(A, b)
benchmark_solve(A_sparse, b);
A\b for typeof(A) = Tridiagonal{Float64,Array{Float64,1}} 28.200 μs (9 allocations: 47.75 KiB) A\b for typeof(A) = SparseMatrixCSC{Float64,Int64} 788.499 μs (69 allocations: 1.06 MiB)
With sparsity, the computational order is related to the number of non-zeros rather than the size of the matrix itself.
For real, symmetric, positive definitive matrices, a Cholesky decomposition is a specialized example of the LU decomposition where $ L = U' $.
The Cholesky is directly useful on its own (e.g. Classical Control with Linear Algebra) but it is also an efficient factorization to solve symmetric positive definite system.
As always, symmetry allows specialized algorithms.
N = 500
B = rand(N,N)
A_dense = B' * B # an easy way to generate a symmetric positive definite matrix
A = Symmetric(A_dense) # flags the matrix as symmetric
factorize(A) |> typeof
BunchKaufman{Float64,Array{Float64,2}}
Here, the $ A $ decomposition is Bunch-Kaufman rather than a Cholesky, because Julia doesn’t know the matrix is positive definite. We can manually factorize with a Cholesky,
cholesky(A) |> typeof
Cholesky{Float64,Array{Float64,2}}
Benchmarking,
b = rand(N)
cholesky(A) \ b # use the factorization to solve
benchmark_solve(A, b)
benchmark_solve(A_dense, b)
@btime cholesky($A, check=false) \ $b;
A\b for typeof(A) = Symmetric{Float64,Array{Float64,2}} 23.439 ms (8 allocations: 2.16 MiB) A\b for typeof(A) = Array{Float64,2} 13.218 ms (5 allocations: 1.92 MiB) 4.215 ms (7 allocations: 1.91 MiB)
Previously, we learned about applications of the QR application to solving the linear least squares.
While in principle, the solution to least-squares
$$ \min_x \| Ax -b \|^2 $$is $ x = (A'A)^{-1}A'b $, in practice note that $ A'A $ becomes dense and calculating the inverse is rarely a good idea.
The QR decomposition is a decomposition $ A = Q R $ where $ Q $ is an orthogonal matrix (i.e. $ Q'Q = Q Q' = I $) and $ R $ is a upper triangular matrix.
Given the previous derivation we showed that the, given the decomposition, we can write the least squares problem as the solution to
$$ R x = Q' b $$Where, as discussed above, the upper-triangular structure of $ R $ can be solved easily with back substitution.
The \
operator will solve the linear-least squares problem whenever the given A
is rectangular
N = 10
M = 3
x_true = rand(3)
A = rand(N,M) .+ randn(N)
b = rand(N)
x = A \ b
3-element Array{Float64,1}: 0.4011747124872587 0.07361080010718472 -0.23478068012724587
To manually use the QR decomposition in solving linear least squares:
Af = qr(A)
Q = Af.Q
R = [Af.R; zeros(N - M, M)] # Stack with zeros
@show Q * R ≈ A
x = R \ Q'*b # simplified QR solution for least squares
Q * R ≈ A = true
3-element Array{Float64,1}: 0.4011747124872585 0.07361080010718468 -0.23478068012724573
This stacks the R
with zeros, but the more specialized algorithm would not multiply directly
in that way.
In some cases, if an LU is not available for a particular matrix structure, the QR factorization can also be used to solve systems of equations (i.e. not just LLS). This tends to be about 2x slower than the LU, but is of the same computational order.
Deriving the approach, where we can now use inverse since the system is square and we assumed $ A $ was non-singular
$$ \begin{aligned} A x &= b\\ Q R x &= b\\ Q^{-1} Q R x &= Q^{-1} b\\ R x &= Q' b \end{aligned} $$Where the last step uses that $ Q^{-1} = Q' $ for an orthogonal matrix.
Given the decomposition, the solution for dense matrices is of computational order $ O(N^2) $. To see this, look at the order of each operation.
In all cases, the order would drop depending on the sparsity pattern of the matrix (and corresponding decomposition). A key benefit of a QR decomposition is that it tends to maintain sparsity.
Without implementing the full process, you can form a QR
factorization with qr
and then use it to solve a system
N = 5
A = rand(N,N)
b = rand(N)
@show A \ b
@show qr(A) \ b;
A \ b = [-1.4780409419445582, 2.0987575263439306, -0.6857071090150311, -0.16849538664184538, 2.012803045177841] qr(A) \ b = [-1.478040941944558, 2.09875752634393, -0.6857071090150318, -0.16849538664184394, 2.012803045177841]
A spectral decomposition, also known as an eigendecomposition, finds all of the eigenvectors and eigenvalues to decompose a square matrix A
such that
where $ Q $ is a matrix made of the the eigenvectors of $ A $ as columns, and $ \Lambda $ is a diagonal matrix of the eigenvalues. Only square, diagonalizable matrices have an eigendecomposition (where a matrix is not diagonalizable if it does not have a full set of linearly independent eigenvectors).
In Julia, whenever you ask for a full set of eigenvectors and eigenvalues, it will find them through this decomposition using an algorithm appropriate for the matrix type. For example, symmetric, hermitian, or tridiagonal matrices have their own algorithms.
To see this in operation,
A = Symmetric(rand(5, 5)) # symmetric matrices have real eigenvectors/eigenvalues
A_eig = eigen(A)
Λ = Diagonal(A_eig.values)
Q = A_eig.vectors
norm(Q * Λ * inv(Q) - A)
3.664849993902364e-15
Keep in mind that a real matrix may have complex eigenvalues and eigenvectors, so if you attempt to check Q * Λ * inv(Q) - A
it may not be a real due number due to numerical inaccuracy.
In the previous lecture on discrete time Markov chains, we saw that the transition probability between state $ x $ and state $ y $ was summarized by the matrix $ P(x, y) := \mathbb P \{ X_{t+1} = y \,|\, X_t = x \} $.
As a brief introduction to continuous time processes, consider the same state-space as in the discrete case: $ S $ a finite set with $ n $ elements $ \{x_1, \ldots, x_n\} $.
A Markov chain $ \{X_t\} $ on $ S $ is a sequence of random variables on $ S $ that have the Markov property
In continuous time, the Markov Property is more complicated, but intuitively is the same as the discrete time case.
That is, knowing the current state is enough to know probabilities for future states. Or, for realizations $ x(\tau)\in S, \tau \leq t $,
$$ \mathbb P \{ X(t+s) = y \,|\, X(t) = x, X(\tau) = x(\tau) \text{ for } 0 \leq \tau \leq t \} = \mathbb P \{ X(t+s) = y \,|\, X(t) = x\} $$Heuristically, consider a time period $ t $ and a small step forward $ \Delta $. Then the probability to transition from state $ i $ to state $ j $ is
$$ \mathbb P \{ X(t + \Delta) = j \,|\, X(t) \} = \begin{cases} q_{ij} \Delta + o(\Delta) & i \neq j\\ 1 + q_{ii} \Delta + o(\Delta) & i = j \end{cases} $$where $ q_{ij} $ are parameters governing the transition process, and $ o(\Delta) $ is little-o notation. That is, $ \lim_{\Delta\to 0} o(\Delta)/\Delta = 0 $.
Just as in the discrete case, we can summarize these parameters by a $ N \times N $ matrix, $ Q \in R^{N\times N} $.
Recall that in the discrete case every element is weakly positive and every row must sum to one. Instead, with a continuous time the rows of $ Q $ sum to zero, where the diagonal contains the negative value of jumping out of the current state. That is
The $ Q $ matrix is called the intensity matrix, or the infinitesimal generator of the Markov chain. For example,
$$ Q = \begin{bmatrix} -0.1 & 0.1 & 0 & 0 & 0 & 0\\ 0.1 &-0.2 & 0.1 & 0 & 0 & 0\\ 0 & 0.1 & -0.2 & 0.1 & 0 & 0\\ 0 & 0 & 0.1 & -0.2 & 0.1 & 0\\ 0 & 0 & 0 & 0.1 & -0.2 & 0.1\\ 0 & 0 & 0 & 0 & 0.1 & -0.1\\ \end{bmatrix} $$In that example, transitions only occur between adjacent states with the same intensity (except for a ``bouncing’’ back of the bottom and top states).
Implementing the $ Q $ using its tridiagonal structure
using LinearAlgebra
α = 0.1
N = 6
Q = Tridiagonal(fill(α, N-1), [-α; fill(-2α, N-2); -α], fill(α, N-1))
6×6 Tridiagonal{Float64,Array{Float64,1}}: -0.1 0.1 ⋅ ⋅ ⋅ ⋅ 0.1 -0.2 0.1 ⋅ ⋅ ⋅ ⋅ 0.1 -0.2 0.1 ⋅ ⋅ ⋅ ⋅ 0.1 -0.2 0.1 ⋅ ⋅ ⋅ ⋅ 0.1 -0.2 0.1 ⋅ ⋅ ⋅ ⋅ 0.1 -0.1
Here we can use the Tridiagonal
to exploit the structure of the problem.
Consider a simple payoff vector $ r $ associated with each state, and a discount rate $ ρ $. Then we can solve for the expected present discounted value in a similar way to the discrete time case.
$$ \rho v = r + Q v $$or rearranging slightly, solving the linear system
$$ (\rho I - Q) v = r $$For our example, exploiting the tridiagonal structure,
r = range(0.0, 10.0, length=N)
ρ = 0.05
A = ρ * I - Q
6×6 Tridiagonal{Float64,Array{Float64,1}}: 0.15 -0.1 ⋅ ⋅ ⋅ ⋅ -0.1 0.25 -0.1 ⋅ ⋅ ⋅ ⋅ -0.1 0.25 -0.1 ⋅ ⋅ ⋅ ⋅ -0.1 0.25 -0.1 ⋅ ⋅ ⋅ ⋅ -0.1 0.25 -0.1 ⋅ ⋅ ⋅ ⋅ -0.1 0.15
Note that this $ A $ matrix is maintaining the tridiagonal structure of the problem, which leads to an efficient solution to the linear problem.
v = A \ r
6-element Array{Float64,1}: 38.15384615384615 57.23076923076923 84.92307692307693 115.07692307692311 142.76923076923077 161.84615384615384
The $ Q $ is also used to calculate the evolution of the Markov chain, in direct analogy to the $ ψ_{t+k} = ψ_t P^k $ evolution with transition matrix $ P $ of the discrete case.
In the continuous case, this becomes the system of linear differential equations
$$ \dot{ψ}(t) = Q(t)^T ψ(t) $$given the initial condition $ \psi(0) $ and where the $ Q(t) $ intensity matrix is allows to vary with time. In the simplest case of a constant $ Q $ matrix, this is a simple constant-coefficient system of Linear ODEs with coefficients $ Q^T $
If a stationary equilibria exists, note that $ \dot{ψ}(t) = 0 $, and the stationary solution $ ψ^{*} $ needs to fulfill
$$ 0 = Q^T ψ^{*} $$Notice that this is of the form $ 0 ψ^{*} = Q^T ψ^{*} $ and hence is equivalent to finding the eigevector associated with the $ \lambda = 0 $ eigenvalue
With our example, we can calculate all of the eigenvalues and eigenvectors
λ, vecs = eigen(Array(Q'))
Eigen{Float64,Float64,Array{Float64,2},Array{Float64,1}} eigenvalues: 6-element Array{Float64,1}: -0.3732050807568874 -0.29999999999999993 -0.19999999999999998 -0.09999999999999995 -0.026794919243112274 0.0 eigenvectors: 6×6 Array{Float64,2}: -0.149429 -0.288675 0.408248 0.5 -0.557678 0.408248 0.408248 0.57735 -0.408248 1.38778e-16 -0.408248 0.408248 -0.557678 -0.288675 -0.408248 -0.5 -0.149429 0.408248 0.557678 -0.288675 0.408248 -0.5 0.149429 0.408248 -0.408248 0.57735 0.408248 7.63278e-16 0.408248 0.408248 0.149429 -0.288675 -0.408248 0.5 0.557678 0.408248
Indeed, there is a $ \lambda = 0 $ eigenvalue, which is associated with the last column in the eigenvector. To turn that into a probability we need to normalize it.
vecs[:,N] ./ sum(vecs[:,N])
6-element Array{Float64,1}: 0.16666666666666657 0.16666666666666657 0.1666666666666667 0.16666666666666682 0.16666666666666685 0.16666666666666663
A frequent case in discretized models is dealing with Markov chains with multiple “spatial” dimensions (e.g. wealth and income).
After discretizing a process to create a Markov chain, you can always take the cartesian product of the set of states in order to enumerate as a single state variable.
To see this, consider states $ i $ and $ j $ governed by infinitesimal generators $ Q $ and $ A $.
function markov_chain_product(Q, A)
M = size(Q, 1)
N = size(A, 1)
Q = sparse(Q)
Qs = blockdiag(fill(Q, N)...) # create diagonal blocks of every operator
As = kron(A, sparse(I(M)))
return As + Qs
end
α = 0.1
N = 4
Q = Tridiagonal(fill(α, N-1), [-α; fill(-2α, N-2); -α], fill(α, N-1))
A = sparse([-0.1 0.1
0.2 -0.2])
M = size(A,1)
L = markov_chain_product(Q, A)
L |> Matrix # display as a dense matrix
8×8 Array{Float64,2}: -0.2 0.1 0.0 0.0 0.1 0.0 0.0 0.0 0.1 -0.3 0.1 0.0 0.0 0.1 0.0 0.0 0.0 0.1 -0.3 0.1 0.0 0.0 0.1 0.0 0.0 0.0 0.1 -0.2 0.0 0.0 0.0 0.1 0.2 0.0 0.0 0.0 -0.3 0.1 0.0 0.0 0.0 0.2 0.0 0.0 0.1 -0.4 0.1 0.0 0.0 0.0 0.2 0.0 0.0 0.1 -0.4 0.1 0.0 0.0 0.0 0.2 0.0 0.0 0.1 -0.3
This provides the combined markov chain for the $ (i,j) $ process. To see the sparsity pattern,
using Plots
spy(L, markersize = 10)
To calculate a simple dynamic pricing problem, consider if the payoff of being in state $ (i,j) $ is $ r_{ij} = i + 2j $
r = [i + 2.0j for i in 1:N, j in 1:M]
r = vec(r) # vectorize it since stacked in same order
8-element Array{Float64,1}: 3.0 4.0 5.0 6.0 5.0 6.0 7.0 8.0
Solving the equation $ \rho v = r + L v $
ρ = 0.05
v = (ρ * I - L) \ r
reshape(v, N, M)
4×2 Array{Float64,2}: 87.8992 93.6134 96.1345 101.849 106.723 112.437 114.958 120.672
The reshape
helps to rearrange it back to being two-dimensional.
To find the stationary distribution, we calculate the eigenvalue and choose the eigenvector associated with $ \lambda=0 $ .
L_eig = eigen(Matrix(L'))
@assert norm(L_eig.values[end]) < 1E-10
ψ = L_eig.vectors[:,end]
ψ = ψ / sum(ψ)
8-element Array{Float64,1}: 0.16666666666666669 0.1666666666666665 0.16666666666666669 0.1666666666666668 0.08333333333333327 0.08333333333333344 0.08333333333333331 0.0833333333333334
We can reshape this to be two dimensional if it is helpful for visualization
reshape(ψ, N, size(A,1))
4×2 Array{Float64,2}: 0.166667 0.0833333 0.166667 0.0833333 0.166667 0.0833333 0.166667 0.0833333
As with the discrete time Markov chains, a key question is whether CTMCs are reducible, i.e. states communicate. The problem is isomorphic to determining if the directed graph of the Markov chain is strongly connected.
using LightGraphs
α = 0.1
N = 6
Q = Tridiagonal(fill(α, N-1), [-α; fill(-2α, N-2); -α], fill(α, N-1))
6×6 Tridiagonal{Float64,Array{Float64,1}}: -0.1 0.1 ⋅ ⋅ ⋅ ⋅ 0.1 -0.2 0.1 ⋅ ⋅ ⋅ ⋅ 0.1 -0.2 0.1 ⋅ ⋅ ⋅ ⋅ 0.1 -0.2 0.1 ⋅ ⋅ ⋅ ⋅ 0.1 -0.2 0.1 ⋅ ⋅ ⋅ ⋅ 0.1 -0.1
This graph visually shows
Q_graph = DiGraph(Q)
@show is_strongly_connected(Q_graph); # i.e. can follow directional edges to get to every state
is_strongly_connected(Q_graph) = true
Alternatively, as an example of a reducible Markov chain where states $ 1 $ and $ 2 $ cannot jump to state $ 3 $.
Q = [-0.2 0.2 0
0.2 -0.2 0
0.2 0.6 -0.2]
Q_graph = DiGraph(Q)
@show is_strongly_connected(Q_graph);
is_strongly_connected(Q_graph) = false
A tridiagonal matrix has 3 non-zero diagonals. The main diagonal, the first sub-diagonal (i.e. below the main diagonal) and the also the first super-diagonal (i.e. above the main diagonal).
This is a special case of a more general type called a banded matrix, where the number of sub and super-diagonals can be greater than 1. The total width of main, sub-, and super-diagonals is called the bandwidth. For example, a tridiagonal matrix has a bandwidth of 3.
A $ N \times N $ banded matrix with bandwidth $ P $ has about $ N P $ nonzeros in its sparsity pattern.
These can be created directly as a dense matrix with diagm
diagm(1 => [1,2,3], -1 => [4,5,6])
4×4 Array{Int64,2}: 0 1 0 0 4 0 2 0 0 5 0 3 0 0 6 0
Or as a sparse matrix,
spdiagm(1 => [1,2,3], -1 => [4,5,6])
4×4 SparseMatrixCSC{Int64,Int64} with 6 stored entries: [2, 1] = 4 [1, 2] = 1 [3, 2] = 5 [2, 3] = 2 [4, 3] = 6 [3, 4] = 3
Or, directly using BandedMatrices.jl
using BandedMatrices
BandedMatrix(1 => [1,2,3], -1 => [4,5,6])
4×4 BandedMatrix{Int64,Array{Int64,2},Base.OneTo{Int64}}: 0 1 ⋅ ⋅ 4 0 2 ⋅ ⋅ 5 0 3 ⋅ ⋅ 6 0
There is also a convenience function for generating random banded matrices
A = brand(7, 7, 3, 1) # 7x7 matrix, 3 subdiagonals, 1 subdiagonal
7×7 BandedMatrix{Float64,Array{Float64,2},Base.OneTo{Int64}}: 0.909192 0.999804 ⋅ ⋅ ⋅ ⋅ ⋅ 0.365279 0.231345 0.449552 ⋅ ⋅ ⋅ ⋅ 0.719595 0.32635 0.374328 0.815217 ⋅ ⋅ ⋅ 0.441325 0.486003 0.386271 0.770824 0.976365 ⋅ ⋅ ⋅ 0.438857 0.424792 0.624999 0.177116 0.135212 ⋅ ⋅ ⋅ 0.754709 0.155209 0.916274 0.0422574 0.0167215 ⋅ ⋅ ⋅ 0.41038 0.13815 0.163309 0.528277
And, of course, specialized algorithms will be used to exploit the structure when solving linear systems. In particular, the complexity is related to the $ O(N P_L P_U) $ for upper and lower bandwidths $ P $
@show factorize(Symmetric(A)) |> typeof
A \ rand(7)
factorize(Symmetric(A)) |> typeof = LDLt{Float64,Symmetric{Float64,BandedMatrix{Float64,Array{Float64,2},Base.OneTo{Int64}}}}
7-element Array{Float64,1}: -2.0656030361729125 2.093360259272618 2.793643054381428 0.05325581046848979 -1.1075022789389772 -8.08705215511709 3.829925059373694
Recall the famous quote from Knuth: “97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%”. The most common example of premature optimization is trying to use your own mental model of a compiler while writing your code, worried about the efficiency of code and (usually incorrectly) second-guessing the compiler.
Concretely, the lessons in this section are
@btime
) and profiling are the tools to figure out performance bottlenecks. If 99% ofof computing time is spent in 1 small function, then there is no point optimizing anything else.
You will rarely get to step 3, let alone step 5.
However, there is also a corollary: “don’t pessimize prematurely”. That is, don’t make choices that lead to poor performance without any tradeoff in improved code clarity. For example, writing your own algorithms when a high performance algorithm exists in a package or Julia itself, or lazily making a matrix dense and carelessly dropping its structure.
Numerical analysis sometimes to refer to the lowest level of code for basic operations (e.g. a dot product, matrix-matrix product, convolutions) as kernels
.
That sort of code is difficult to write, and performance depends on the characteristics of the underlying hardware such as the instruction set available on the particular CPU, the size of the CPU cache, and the layout of arrays in memory.
Typically these operations are written in a BLAS library, organized into different levels. The levels roughly correspond to the computational order of the operations: BLAS Level 1 are $ O(N) $ operations such as linear products, Level 2 are $ O(N^2) $ operations such as matrix-vector products, and Level 3 are $ O(N^3) $ such as general matrix-matrix products.
An example of a BLAS library is OpenBLAS used by default in Julia, or the Intel MKL used in Matlab (and Julia if the MKL.jl
package is installed).
On top of BLAS are LAPACK operations, which are higher level kernels, such as matrix factorizations and eigenvalue algorithms, and are often in the same libraries (e.g. MKL has both BLAS and LAPACK functionality).
The details of these packages are not especially relevant, but if you are talking about performance, people will inevitably start discussing these different packages and kernels. There are a few important things to keep in mind:
There is a practical performance issue which may influence your code. Since memory in a CPU is linear, dense matrices need to be stored by either stacking columns (called column-major order) or rows.
The reason this matters is that compilers can generate better performance if they work in contiguous chunks of memory, and this becomes especially important with large matrices due to the interaction with the CPU cache. Choosing the wrong order when there is no benefit in code clarity is a an example of premature pessimization. The performance difference can be orders of magnitude in some cases, and nothing in others.
One option is to use the functions that let the compiler choose the most efficient way to traverse memory. If you need to choose the looping order yourself, then you might want to experiment with swapping whether you go through columns or rows first. Other times, let Julia decide, i.e. enumerate
and eachindex
will choose the right approach.
Julia, Fortran, and Matlab all use column-major order while C/C++ and Python use row-major order. This means that if you find an algorithm written for C/C++/Python you will sometimes need to make small changes if performance is an issue.
While we have usually not considered optimizing code for performance (and focused on the choice of algorithms instead), when matrices and vectors become large we need to be more careful.
The most important thing to avoid are excess allocations, which usually occur due to the use of temporary vectors and matrices when they are not necessary. Sometimes those extra temporary values can cause enormous degredations in performance.
However, caution is suggested since excess allocations are never relevant for scalar values, and can sometimes create faster code for smaller matrices/vectors since it can lead to better cache locality.
To see this, a convenient tool is the benchmarking
using BenchmarkTools
A = rand(10,10)
B = rand(10,10)
C = similar(A)
function f!(C, A, B)
D = A*B
C .= D .+ 1
end
@btime f!($C, $A, $B)
565.405 ns (1 allocation: 896 bytes)
10×10 Array{Float64,2}: 3.00017 3.30146 3.19421 2.931 … 2.86516 3.45795 3.21823 3.56728 2.89356 3.80696 3.9359 3.08818 2.99012 3.91756 3.96194 3.40309 3.6578 4.14182 4.19339 3.57138 3.26475 4.093 4.39952 4.12536 3.01479 4.01799 4.3451 3.16811 3.35607 3.71766 4.18068 3.43241 2.51483 2.71889 2.85436 2.52423 2.2663 3.0724 2.99511 3.15074 4.69762 4.59687 4.63039 3.99239 … 3.62851 4.64178 4.50716 4.0128 3.19921 3.89976 3.7035 3.37346 3.1357 3.71877 4.38951 3.98701 3.74569 4.80772 4.5111 3.80346 3.54253 4.48217 4.33268 3.89957 2.74908 4.21825 4.01952 3.31269 2.81493 3.87101 4.41809 3.37978 2.17474 2.02398 2.66357 1.88612 1.94634 2.32106 2.2074 2.15534
The !
on the f!
is an informal way to say that the function is mutating, and the first arguments (C
here)
is by convention the modified values.
In the f!
function, notice that the D
is a temporary variable which is created, and then modified afterwards. But, notice that since
C
is modified directly, there is no need to create the temporary D
matrix.
This is an example of where an inplace version of the matrix multiplication can help avoid the allocation.
function f2!(C, A, B)
mul!(C, A, B) # in place multiplication
C .+= 1
end
A = rand(10,10)
B = rand(10,10)
C = similar(A)
@btime f!($C, $A, $B)
@btime f2!($C, $A, $B)
555.135 ns (1 allocation: 896 bytes) 498.964 ns (0 allocations: 0 bytes)
10×10 Array{Float64,2}: 4.73813 3.77289 4.383 3.43855 … 3.83811 2.57423 3.26083 3.42563 3.65135 3.03401 3.30964 2.9681 3.06587 2.09178 2.50549 3.09057 4.18424 3.39865 4.01303 2.76115 3.61969 2.45743 2.83093 3.1237 3.51628 2.8582 2.66244 2.81682 2.81641 1.85632 2.65815 2.71383 5.97982 4.57594 4.96922 4.31472 4.91866 2.93451 4.02817 4.25554 3.83973 2.98239 3.41323 2.29676 … 3.98253 2.41441 2.75848 2.83409 3.01026 2.68221 2.87022 1.97828 2.99349 1.9412 2.40943 2.48763 5.29417 4.30378 4.562 4.15332 4.42037 2.69636 3.68781 3.85869 4.68153 3.62809 4.19926 3.56273 3.97236 2.59404 2.95793 3.62902 2.59148 2.75541 3.42386 1.86871 2.59298 2.22461 2.25589 2.20528
Note in the output of the benchmarking, the f2!
is non-allocating and is using the pre-allocated C
variable directly.
Another example of this is solutions to linear equations, where for large solutions you may pre-callocate and reuse the solution vector.
A = rand(10,10)
y = rand(10)
z = A \ y # creates temporary
A = factorize(A) # inplace requires factorization
x = similar(y) # pre-allocate
ldiv!(x, A, y) # inplace left divide, using factorization
10-element Array{Float64,1}: 0.07108718396179413 -0.08329715154614271 -0.15580851293529174 0.6975734336112906 -0.5530004580356502 0.18264128633539528 0.7251824686591718 0.5075902097757793 -0.6383442788622705 -0.443019398823186
However, if you benchmark carefully, you will see that this is sometimes slower. Avoiding allocations is not always a good idea - and worrying about it prior to benchmarking is premature optimization.
There are a variety of other non-allocating versions of functions. For example,
A = rand(10,10)
B = similar(A)
transpose!(B, A) # non-allocating version of B = transpose(A)
10×10 Array{Float64,2}: 0.70827 0.411295 0.0834197 … 0.11247 0.904854 0.827702 0.154587 0.448786 0.935975 0.998019 0.481155 0.218838 0.584614 0.519837 0.918883 0.336193 0.993505 0.947485 0.644861 0.0350973 0.410407 0.00477567 0.442824 0.836457 0.308395 0.336602 0.95752 0.0847486 0.0680403 0.578927 0.137693 0.411517 0.826035 … 0.617513 0.487702 0.115422 0.95813 0.431378 0.169727 0.131022 0.602005 0.31347 0.276449 0.794457 0.69505 0.123424 0.812738 0.695384 0.810153 0.135141 0.791019 0.992133 0.16448 0.745776 0.957567 0.186042 0.650625 0.710451 0.977151 0.426741
Finally, a common source of unnecessary allocations is when taking slices or portions of
matrices. For example, the following allocates a new matrix B
and copies the values.
A = rand(5,5)
B = A[2,:] # extract a vector
5-element Array{Float64,1}: 0.7731783572897324 0.9146937700753357 0.7094278236445719 0.15601518309763196 0.6214939938853177
To see these are different matrices, note that
A[2,1] = 100.0
@show A[2,1]
@show B[1];
A[2, 1] = 100.0 B[1] = 0.7731783572897324
Instead of allocating a new matrix, you can take a view
of a matrix, which provides an
appropriate AbstractArray
type that doesn’t allocate new memory with the @view
matrix.
A = rand(5,5)
B = @view A[2,:] # does not copy the data
A[2,1] = 100.0
@show A[2,1]
@show B[1];
A[2, 1] = 100.0 B[1] = 100.0
But, again, you will often find that doing @view
leads to slower code. Benchmark
instead, and generally rely on it for large matrices and for contiguous chunks of memory (e.g. a column rather than a row).
This exercise is for a practice on writing low-level routines (i.e. “kernels”), and to hopefully convince you to leave low-level code to the experts.
The formula for matrix multiplication is deceptively simple. For example, with the product of square matrices $ C = A B $ of size $ N \times N $, the $ i,j $ element of $ C $ is
$$ C_{ij} = \sum_{k=1}^N A_{ik} B_{kj} $$Alternatively, you can take a row $ A_{i,:} $ and column $ B_{:, j} $ and use an inner product
$$ C_{ij} = A_{i,:} \cdot B_{:,j} $$Note that the inner product in a discrete space is simply a sum, and has the same complexity as the sum (i.e. $ O(N) $ operations).
For a dense matrix without any structure, this also makes it clear why the complexity is $ O(N^3) $: you need to evaluate it for $ N^2 $ elements in the matrix and do an $ O(N) $ operation each time.
For this exercise, implement matrix multiplication yourself and compare performance in a few permutations.
C = A * B
or, for a better comparison, the inplace version mul!(C, A, B)
which works with preallocated data)i
index) and use a for
loop for the inner productj
index) and use a for
loop for the inner productdot
product instead of the sum.N=10
, N=1000
, etc. and compare the ratio of performance of your best implementation to the built in BLAS library.A few more hints:
A = rand(N, N)
, etc.C = similar(A)
or something equivalent.@btime
macro to time it. Remember to escape globals if necessary (e.g. @btime f(\$A)
rather than @btime f(A)
Here we will calculate the evolution of the pdf of a discrete time Markov Chain, $ \psi_t $ given the initial condition $ \psi_0 $.
Start with a simple symmetric tridiagonal matrix
N = 100
A = Tridiagonal([fill(0.1, N-2); 0.2], fill(0.8, N), [0.2; fill(0.1, N-2)])
A_adjoint = A';
T
and use the initial condition $ \psi_0 = \begin{bmatrix} 1 & 0 & \ldots & 0\end{bmatrix} $A_adjoint^T
, which uses specialized algorithms faster and more accurate than repeated matrix multiplication (but with the same computational order).Note: The algorithm used in Julia to take matrix powers depends on the matrix structure, as always. In the symmetric case, it can use an eigendecomposition, whereas with a general dense matrix it uses squaring and scaling.
With the same setup as Exercise 2a, do an eigendecomposition of A_transpose
. That is, use eigen
to factorize the adjoint $ A' = Q \Lambda Q^{-1} $ where $ Q $ the matrix of eigenvectors and $ \Lambda $ the diagonal matrix of eigenvalues. Calculate $ Q^{-1} $ from the results.
Use the factored matrix to calculate the sequence of $ \psi_t = (A')^t \psi_0 $ using the relationship
$$ \psi_t = Q \Lambda^t Q^{-1} \psi_0 $$Where matrix powers of diagonal matrices are simply the element-wise power of each element.
Benchmark the speed of calculating the sequence of $ \psi_t $ up to T = 2N
using this method. In principle, the factorization and easy calculation of the power should give you benefits compared to simply iterating the map as we did in Exercise 2a. Explain why it does or does not using computational order of each approach.