A successful family of methods, usually referred to as Conjugate Gradient-like algorithms, or Krylov subspace methods, can be viewed as Galerkin or least-squares methods applied to a linear system Ax=b. This view is different from the standard approaches to deriving the classical Conjugate Gradient method in the literature. Nevertheless, the fundamental ideas of least squares and Galerkin approximations can be used to derive the most popular and successful methods for linear systems, and this is the topic of the present chapter. Such a view may increase the general understanding of variational methods and their applicability.
Given a linear system
and a start vector x0, we want to construct an iterative solution method that produces approximations x1,x2,…, which hopefully converge to the exact solution x. In iteration no. k we seek an approximation
where qj∈Rn are known vectors
and cj are constants to be determined. To be specific, let q0,…,qk be basis vectors for Vk+1:
The associated inner product (⋅,⋅) is here the standard Euclidean inner product on Rn.
The corresponding error in the equation Ax=b, the residual, becomes
This statement is equivalent to the residual being orthogonal to each basis vector:
Inserting the expression for rk+1 in (3) gives a linear system for cj:
The idea of the least-squares method is to minimize the square of the norm of the residual with respect to the free parameters c0,…,ck. That is, we minimize (rk+1,rk+1):
Since ∂rk+1/∂ci=−Aqi, this approach leads to the following linear system:
To obtain a complete algorithm, we need to establish a rule to update the basis B={q0,…,qk} for the next iteration. That is, we need to compute a new basis vector qk+1∈Vk+2 such that
is a basis for the space Vk+2 that is used in the next iteration. The present family of methods applies the Krylov space, where Vk is defined as
Some frequent names of the associated iterative methods are therefore {Krylov subspace iterations}, Krylov projection methods, or simply Krylov methods. It is a fact that Vk⊂Vk+1 and that r0Ar0,…Ak−1r0 are linearly independent vectors.
A potential formula for updating qk+1, such that qk+1∈Vk+2, is
(Since rk+1 involves Aqk, and qk∈Vk+1, multiplying by A raises the dimension of the Krylov space by 1, so Aqk∈Vk+2.) The free parameters βj can be used to enforce desirable orthogonality properties of q0,…,qk+1. For example, it is convenient to require that the coefficient matrices in the linear systems for c0,…,ck are diagonal. Otherwise, we must solve a (k+1)×(k+1) linear system in each iteration. If k should approach n, the systems for the coefficients ci are of the same size as our original system Ax=b! A diagonal matrix, however, ensures an efficient closed form solution for c0,…,ck.
To obtain a diagonal coefficient matrix, we require in Galerkin's method that
whereas we in the least-squares method require
We can define the inner product
provided A is symmetric and positive definite. Another useful inner product is
These inner products will be referred to as the A product, with the associated A norm, and the ATA product, with the associated ATA norm.
The orthogonality condition on the qi vectors are then ⟨qk+1,qi⟩=0 in the Galerkin method and [qk+1,qi]=0 in the least-squares method, where i runs from 0 to k. A standard Gram-Schmidt process can be used for constructing qk+1 orthogonal to q0,…,qk. This leads to the determination of the β0,…,βk constants as
for i=0,…,k.
The orthogonality condition on the basis vectors qi leads to the following solution for c0,…,ck:
In iteration k, (rk,qi)=0 and (rk,Aqi)=0, for i=0,…,k−1, in the Galerkin and least squares case, respectively. Hence, ci=0, for i=0,…,k−1. In other words,
When A is symmetric and positive definite, one can show that also βi=0, for 0=1,…,k−1, in both the Galerkin and least squares methods. This means that xk and qk+1 can be updated using only qk and not the previous q0,…,qk−1 vectors. This property has of course dramatic effects on the storage requirements of the algorithms as the number of iterations increases.
For the suggested algorithms to work, we must require that the denominators in (13) and (14) do not vanish. This is always fulfilled for the least-squares method, while a (positive or negative) definite matrix A avoids break-down of the Galerkin-based iteration (provided qi≠0).
The Galerkin solution method for linear systems was originally devised as a direct method in the 1950s. After n iterations the exact solution is found in exact arithmetic, but at a higher cost than when using Gaussian elimination. Naturally, the method did not receive significant popularity before researchers discovered (in the beginning of the 1970s) that the method could produce a good approximation to x for k≪n iterations. The method, called the Conjugate Gradient method, has from then on caught considerable interest as an iterative scheme for solving linear systems arising from PDEs discretized such that the coefficient matrix becomes sparse.
Finally, we mention how to terminate the iteration. The simplest criterion is ||rk+1||≤ϵr, where ϵr is a small prescribed tolerance. Sometimes it is more appropriate to use a relative residual, ||rk+1||/||r0||≤ϵr. Termination criteria for Conjugate Gradient-like methods is a subject on its own.
In the algorithm below, we have summarized the computational steps in the least-squares method. Notice that we update the residual recursively instead of using rk=b−Axk in each iteration since we then avoid a possibly expensive matrix-vector product.
given a start vector x0, compute r0=b−Ax0 and set q0=r0.
for k=0,1,2,… until termination criteria are fulfilled:
a. ck=(rk,Aqk)/[qk,qk]
b. xk+1=xk+ckqk
c. rk+1=rk−ckAqk
d. if A is symmetric then
a. qk+1=rk+1−βkqk
b. else
βj=[rk+1,qj]/[qj,qj],j=0,…,k
qk+1=rk+1−∑kj=0βjqj
The algorithm above is just a summary of the steps in the derivation of the least-squares method and should not be directly used for practical computations without further developments.
When A is nonsymmetric, the storage requirements of q0,…,qk may be prohibitively large. It has become a standard trick to either truncate or restart the algorithm. In the latter case one restarts the algorithm every K+1-th step, i.e., one aborts the iteration and starts the algorithm again with x0=xK. The other alternative is to truncate the sum ∑kj=0βjqj and use only the last K vectors:
Both the restarted and truncated version of the algorithm require storage of only K+1 basis vectors qk−K,…,qk. The basis vectors are also often called search direction vectors. The truncated version of the least-squares algorithm above is widely known as Generalized Minimum Residuals, shortened as GMRES, or GMRES(K) to explicitly indicate the number of search direction vectors. In the literature one encounters the name Generalized Conjugate Residual method, abbreviated CGR, for the restarted version of Orthomin. When A is symmetric, the method is known under the name Conjugate Residuals.
In case of Galerkin's method, we assume that A is symmetric and positive definite. The resulting computational procedure is the famous Conjugate Gradient method. Since A must be symmetric, the recursive update of qk+1 needs only one previous search direction vector qk, that is, βj=0 for j<k.
Given a start vector x0, compute r0=b−Ax0 and set q0=r0.
for k=1,2,… until termination criteria are fulfilled:
a. ck=(rk,qk)/⟨qk,qk⟩
b. xk=xk−1+ckqk
c. rk=rk−1−ckAqk
d. βk=⟨rk+1,qk⟩/⟨qk,qk⟩
e. qk+1=rk+1−βkqk
The previous remark that the listed algorithm is just a summary of the steps in the solution procedure, and not an efficient algorithm that should be implemented in its present form, must be repeated here. In general, we recommend to rely on some high-quality linear algebra library that offers an implementation of the Conjugate gradient method.
The computational nature of Conjugate gradient-like methods.
Looking at the Galerkin and least squares algorithms above, one notice that the matrix A is only used in matrix-vector products. This means that it is sufficient to store only the nonzero entries of A. The rest of the algorithms consists of vector operations of the type y←ax+y, the slightly more general variant q←ax+y, as well as inner products.
Let us define the error ek=x−xk. Multiplying this equation by A leads to the well-known relation between the error and the residual for linear systems:
Using rk=Aek we can reformulate the Galerkin and least-squares methods in terms of the error. The Galerkin method can then be written
For the least-squares method we obtain
This means that
In other words, the error is A-orthogonal to the space Vk+1 in the Galerkin method, whereas the error is ATA-orthogonal to Vk+1 in the least-squares method. When the error is orthogonal to a space, we find the best approximation in the associated norm to the solution in that space. Specifically here, it means that for a symmetric and positive definite A, the Conjugate gradient method finds the optimal adjustment in Vk+1 of the vector xk (in the A-norm) in the update for xk+1. Similarly, the least squares formulation finds the optimal adjustment in Vk+1 measured in the ATA-norm.
A lot of Conjugate gradient-like methods were developed in the 1980s and 1990s, some of the most popular methods do not fit directly into the framework presented here. The theoretical similarities between the methods are covered in the literature, and we refer to it for algorithms and practical comments related to widespread methods, such as the SYMMLQ method (for symmetric indefinite systems), the Generalized Minimal Residual (GMRES) method, the BiConjugate Gradient (BiCG) method, the Quasi-Minimal Residual (QMR) method, and the BiConjugate Gradient Stabilized (BiCGStab) method. When A is symmetric and positive definite, the Conjugate gradient method is the optimal choice with respect to computational efficiency, but when A is nonsymmetric, the performance of the methods is strongly problem dependent, and one needs to experiment.
where κ is the ratio of the largest and smallest eigenvalue of A. The quantity κ is commonly referred to as the spectral condition number. Common finite element and finite difference discretizations of Poisson-like PDEs lead to κ∼h−2, where h denotes the mesh size. This implies that the Conjugate Gradient method converges slowly in PDE problems with fine meshes, as the number of iterations is proportional to h−1.
To speed up the Conjugate Gradient method, we should manipulate the eigenvalue distribution. For instance, we could reduce the condition number κ. This can be achieved by so-called preconditioning. Instead of applying the iterative method to the system Ax=b, we multiply by a matrix M−1 and apply the iterative method to the mathematically equivalent system
The aim now is to construct a nonsingular preconditioning matrix or algorithm such that M−1A has a more favorable condition number than A. We remark that we use the notation M−1 here to indicate that it should resemble the inverse of the matrix A.
For increased flexibility we can write M−1=CLCR and transform the system according to
where CL is the left and CR is the right preconditioner. If the original coefficient matrix A is symmetric and positive definite, CL=CTR leads to preservation of these properties in the transformed system. This is important when applying the Conjugate Gradient method to the preconditioned linear system. Even if A and M are symmetric and positive definite, M−1A does not necessarily inherit these properties. It appears that for practical purposes one can express the iterative algorithms such that it is sufficient to work with a single preconditioning matrix M only. We shall therefore speak of preconditioning in terms of the left preconditioner M in the following.
Optimal convergence for the Conjugate Gradient method is achieved when the coefficient matrix M−1A equals the identity matrix I and only one iteration is required. In the algorithm we need to perform matrix-vector products M−1Au for an arbitrary u∈Rn. This means that we have to solve a linear system with M as coefficient matrix in each iteration since we implement the product y=M−1Au in a two step fashion: First we compute v=Au and then we solve the linear system My=v for y. The optimal choice M=A therefore involves the solution of Ay=v in each iteration, which is a problem of the same complexity as our original system Ax=b. The strategy must hence be to compute an M−1≈A−1 such that the algorithmic operations involved in the inversion of M are cheap.
The preceding discussion motivates the following demands on the preconditioning matrix M:
M−1 should be a good approximation to A,
M−1 should be inexpensive to compute,
M−1 should be sparse in order to minimize storage requirements,
linear systems with M as coefficient matrix must be efficiently solved.
Regarding the last property, such systems must be solved in O(n) operations, that is, a complexity of the same order as the vector updates in the Conjugate Gradient-like algorithms. These four properties are contradictory and some sort of compromise must be sought.
The simplest possible iterative method for solving Ax=b is
Applying this method to the preconditioned system M−1Ax=M−1b results in the scheme
which is nothing but the formula for classical iterative methods such as the Jacobi method, the Gauss-Seidel method, SOR (Successive over-relaxation), and SSOR (Symmetric successive over-relaxation). This motivates for choosing M as one iteration of these classical methods. In particular, these methods provide simple formulas for M:
Jacobi preconditioning: M=D.
Gauss-Seidel preconditioning: M=D+L.
SOR preconditioning: M=ω−1D+L.
SSOR preconditioning: M=(2−ω)−1(ω−1D+L)(ω−1D)−1(ω−1D+U)
Turning our attention to the four requirements of the preconditioning matrix, we realize that the suggested M matrices do not demand additional storage, linear systems with M as coefficient matrix are solved effectively in O(n) operations, and M needs no initial computation. The only questionable property is how well M approximates A, and that is the weak point of using classical iterative methods as preconditioners.
The Conjugate Gradient method can only utilize the Jacobi and SSOR preconditioners among the classical iterative methods, because the M matrix in that case is on the form M−1=CLCTL, which is necessary to ensure that the coefficient matrix of the preconditioned system is symmetric and positive definite. For certain PDEs, like the Poisson equation, it can be shown that the SSOR preconditioner reduces the condition number with an order of magnitude, i.e., from O(h−2) to O(h−1), provided we use the optimal choice of the relaxation parameter ω. According to experiments, however, the performance of the SSOR preconditioned Conjugate Gradient method is not very sensitive to the choice of ω.
Imagine that we choose M=A and solve systems My=v by a direct method. Such methods typically first compute the LU factorization M=ˉLˉU and thereafter perform two triangular solves. The lower and upper triangular factors ˉL and ˉU are computed from a Gaussian elimination procedure. Unfortunately, ˉL and ˉU contain nonzero values, so-called fill-in, in many locations where the original matrix A contains zeroes. This decreased sparsity of ˉL and ˉU increases both the storage requirements and the computational efforts related to solving systems My=v. An idea to improve the situation is to compute sparse versions of the factors ˉL and ˉU. This is achieved by performing Gaussian elimination, but neglecting the fill-in (!). In this way, we can compute approximate factors ˆL and ˆU that become as sparse as A. The storage requirements are hence only doubled by introducing a preconditioner, and the triangular solves become an O(n) operation since the number of nonzeroes in the ˆL and ˆU matrices (and A) is O(n) when the underlying PDE is discretized by finite difference or finite element methods. We call M=ˆLˆU an incomplete LU factorization preconditioner, often just referred to as the ILU preconditioner.
Instead of throwing away all fill-in entries, we can add them to the
main diagonal. This yields the modified incomplete LU factorization
(MILU). One can also allow a certain amount of fill-in
to improve the quality of the preconditioner, at the expense of
more storage for the factors. Libraries offering ILU preconditioners
will then often have a tolerance for how small a fill-in element in
the Gaussian elimination must be to be dropped, and a parameter that
governs the maximum number of nonzeros per row in the factors.
A popular ILU implementation is the open source C code
SuperLU. This preconditioner is available to Python programmers through the
scipy.sparse.linalg.spilu
function.
The general algorithm for MILU preconditioning follows the steps of traditional exact Gaussian elimination, except that we restrict the computations to the nonzero entries in A. The factors ˆL and ˆU can be stored directly in the sparsity structure of A, that is, the algorithm overwrites a copy M of A with its MILU factorization. The steps in the MILU factorizations are listed below.
Given a sparsity pattern as an index set I, copy Mi,j←Ai,j, i,j=1,…,n
for k=1,2,…,n
a. for i=k+1,…,n
* if $(i,k)\in\mathcal{I}$ then
a. Mi,k←Mi,k/Mk,k
* else
a. Mi,k=0
* $r=M_{i,k}$
* for $j=k+1,\ldots,n$
* if $j=i$ then
* $M_{i,j}\leftarrow M_{i,j} - rM_{k,j} + \sum_{p=k+1}^n
\left( M_{i,p} - rM_{k,p}\right)$
* else
* if $(i,j)\in\mathcal{I}$ then
* $M_{i,j}\leftarrow M_{i,j} - rM_{k,j}$
* else
* $M_{i,j} =0$
We also remark here that the algorithm above needs careful refinement before it should be implemented in a code. For example, one will not run through a series of (i,j) indices and test for each of them if (i,j)∈I. Instead one should run more directly through the sparsity structure of A.
The above mentioned preconditioners are general purpose preconditioners that may be applied to solve any linear system and may speed up the solution process significantly. For linear systems arising from PDE problems there are however often more efficient preconditioners that exploit the fact that the matrices come from discretized PDEs. These preconditioners often enable the solution of linear systems involving millions of unknowns in a matter of seconds or less on a regular desktop computer. This means that solving these huge linear systems often takes less time than printing them file. Moreover, these preconditioners may be used on parallel computers in a scalable fashion such that simulations involving billions of degrees of freedom are available e.g. by cloud services or super-computing centers.
The most efficient preconditioners are the so-called multilevel preconditioners (e.g. multigrid and domain decomposition) which in particular for Poisson type problems yields error reduction of a factor 10 per iteration and hence a typical iteration count of the iterative method of 5-10. These preconditioners utilize the fact that different parts of the solution can be localized or represented on a coarser resolution during the computations before being glued together to a close match to the true solution. While a proper description of these methods are beyond the scope here, we mention that black-box preconditioners are implemented in open source frameworks such as PETSc, Hypre, and Trilinos and these frameworks are used by most available finite element software (such as FEniCS).
While these multilevel preconditioners are available and efficient for PDEs similar
to the Poisson problem there is currently no black-box solver that handles
general systems of PDEs. We mention, however, a promising approach for
tackling some systems of PDEs, namely the so-called operator preconditioning
framework. This framework utilize tools from functional analysis to deduct
appropriate block preconditioners typically involving blocks composed
of for instance multilevel preconditioners in a systematic construction.
The framework has extended the use of Poisson type multilevel preconditioners
to many systems of PDEs with success.