Lecture 13: Iterative methods

Syllabus

Week 1: Matrices, vectors, matrix/vector norms, scalar products & unitary matrices
Week 2: TAs-week (Strassen, FFT, a bit of SVD)
Week 3: Matrix ranks, singular value decomposition, linear systems, eigenvalues
Week 4: Matrix decompositions: QR, LU, SVD + test + structured matrices start
Week 5: Iterative methods, preconditioners, matrix functions

Recap of the previous lecture

  • Krylov subspaces
  • Iterative methods: Conjugate Gradient (CG), Generalized Minimal Residual Method (GMRES), BiCGStab
  • Definition of preconditioners
  • Jacobi, Gauss-Seidel as preconditioners

Today lecture

Today we will continue to talk about iterative methods.

We will consider:

  • Concept of Incomplete LU/Incomplete Cholesky
  • A brief note about algebraic multigrid (no details, because we need PDEs for the general multigrid concept)
  • Iterative methods for other NLA problems (SVD, eigenvalue problems)
  • Other than sparse: Toeplitz matrices, FFT, circulant matrices.

Problem setup (recall)

Just to recall: we are solving

$$Ax = f,$$

where $A$ is an $n \times n$ sparse matrix (i.e. the number of non-zero elements is smaller than the $N^2)$,

and the matrix-by-vector product can be computed in $\mathcal{O}(N)$ operations.

We solve $$A P^{-1} y = f,$$

and $P$ is such that $A \approx P$ and $AP^{-1}$ has better spectra for the iterative method.

Simple method

General idea is to select a class $\mathcal{P}$ of matrices with which we can easily solve linear systems and then

find a matrix $P \in \mathcal{P}$ that minimizes

$$P = \arg \min_{P \in \mathcal{P}}\Vert A - P \Vert $$
  1. Diagonal matrices give Jacobi preconditioners
  2. Triangular matrices give Gauss-Seidel preconditioner

Any other ideas for the class of the matrices?

Remember the Gaussian elimination

Let us remember the basic method for solving linear systems:

Decompose the matrix $A$ in the form

$$A = P_1 L U P^{\top}_2, $$

where $P_1$ and $P_2$ are certain permutation matrices (which do the pivoting).

The most natural idea is to use sparse $L$ and $U$.

It is not possible without fill-in growth for example for matrices, coming from 2D/3D Partial Differential equations (PDEs).

What to do?

Incomplete LU

Suppose you want to eliminate a variable $x_1$,

and the equations have the form

$$5 x_1 + x_4 + x_{10} = 1, \quad 3 x_1 + x_4 + x_8 = 0, \ldots,$$

and in all other equations $x_1$ is not present.

After the elimination, only $x_{10}$ will enter additionally to the second equation (new fill-in).

$$x_4 + x_8 + 3(1 - x_4 - x_{10})/5 = 0$$

In the Incomplete $LU$ case (actually, ILU(0)) we just throw away the new fill-in.

Incomplete-LU: formal definition

We run the usual LU-decomposition cycle, but avoid inserting non-zeros other than the initial non-zero pattern.

L = np.zeros((n, n)) U = np.zeros((n, n)) for k in xrange(n): #Eliminate one row L[k, k] = 1 for i in xrange(k+1, n): L[i, k] = a[i, k] / a[k, k] for j in xrange(k+1, n): a[i, j] = a[i, j] - L[i, k] * a[k, j] #New fill-ins appear here for j in xrange(k, n): U[k, j] = a[k, j]

ILU(k)

Youseef Saad (who is the author of GMRES) also had a seminal paper on the Incomplete LU decomposition

Saad, Yousef (1996), Iterative methods for sparse linear systems

And he proposed ILU(k) method.

ILU(k), continued

The idea of ILU(k) is very instructive and is based on the connection between sparse matrices and graphs.

Suppose you have an $n \times n$ matrix $A$ and a corresponding adjacency graph.

Then we eliminate one variable (vertex) and get a smaller system of size $(n-1) \times (n-1)$.

What is the graph of this system, when a new edge appear?

LU & graphs

The new edge can appear only between vertices that had common neighbour: it means, that they are second-order neigbours. This is also a sparsity pattern of the matrix $$A^2.$$

The ILU(k) idea is to leave only the elements in $L$ and $U$ that are $k$-order neighbours.

The ILU(2) is very efficient, but for some reason completely abandoned (i.e. there is no implementation in MATLAB and scipy).

There is an original Sparsekit software by Saad, which works quite well.

ILU Thresholded (ILUT)

A much more popular approach is based on the so-called thresholded LU.

You do the standard Gaussian elimination with fill-ins, but either:

Throw away elements that are smaller than threshold, and/or control the amount of non-zeros you are allowed to store.

The smaller is the threshold, the better is the preconditioner, but more memory it takes.

Symmetric positive definite case

In the SPD case, instead of incomplete LU you should use Incomplete Cholesky, which is twice faster and consumes twice less memory.

Both ILUT and Ichol are implemented in scipy and you can try (nothing quite fancy, but it works).

Second-order LU preconditioners

There are more efficient (but much less popular due to the absense of open-source implementations) second-order LU factorization proposed by I. Kaporin

The idea is to approximate the matrix in the form

$$A \approx U_2 U^{\top}_2 + U^{\top}_2 R_2 + R^{\top}_2 U_2,$$

which is just the expansion of the $UU^{\top}$ with respect to the perturbation of $U$.

where $U_1$ and $U_2$ are low-triangular and sparse, whereare $R_2$ is small with respect to the drop tolerance parameter.

What else?

Besides the ILU/IC approaches, there is a very efficient approach, called algebraic multigrid (AMG).

It was first proposed by Ruge and Stuben in 1987.

It is probably the only black-box method that is asymptotically optimal for 2D/3D problems

(at least, for elliptic problems).

The idea comes from the geometric multigrid.

Multigrid: just the basic idea

The basic idea of the multigrid is to consider instead of one problem problems on coarser meshes.

Then we solve for a system on a coarse mesh, and interpolate it to the finer mesh.

Geometric multigrid uses the information about the meshes, operators, discretization..

AMG method tries to guess it from the matrix.

To know more, enroll in the FastPDE course this Spring :)

Summary of this part

The ILU has three flavours: ILU(k), ILUT, ILU2. They are different, but based on the approximation of the matrix in a simpler form to use as a preconditioners.

Now we will briefly discuss iterative methods for other NLA problems.

Iterative methods for other NLA problems

Up to now, we only talked about the iterative methods for linear systems.

There are other important large-scale problems:

  1. (Partial) Eigenvalue problem: $Ax_k = \lambda_k x_k.$
  2. (Partial) SVD problem: $A v_k = \sigma_k u_k, \quad A^* u_k = \sigma_k v_k$.

It is extremely difficult to compute all eigenvalues/singular values of the matrix

(singular vectors / eigenvectors are not sparse, so we need to store them all!).

But it is possible to solve partial eigenvalue problems.

The SVD follows from the symmetric eigenvalue problem.

Can we do better, than the power method

Recall the power method:

$$x_{k+1} = A x_k.$$

Can we do better, than the power method (which converges to the largest eigenvalue?)

Yes, we can do better

We can use the full Krylov subspace, and construct the orthogonal basis in the subspace

$$x \in K_n(A, f)$$

to approximate the eigenvalues.

This gives rise to the Arnoldi method (for Hermitian matrices it reduces to the Lanczos method)

Arnoldi method

The Arnold method has a simple structure (see the pseudocode), which just modified Gram-Schmidt applied to the Krylov subspace

q[k] = A * q[k-1], for j in range(k-1): h[j, k-1] = (q[j], q[k]) q[k] -= h[j, k-1] * q[j] h[k, k-1] = || q[k] || q[k] = q[k]/h[k, k-1]

Arnoldi method, cont

The Arnoldi method generates an orthogonal basis of the form

$$A Q_n = \widetilde{H}_n Q_{n+1},$$

and $H_n = Q^*_n A Q_n$ is upper Hessenberg matrix.

The idea of using Arnoldi method as an eigenvalue algorithm is that we compute the eigenvalues of $A$

by looking at the eigenvalues of the orthogonal projection $H_n$ (called Ritz values).

Since $H_n$ is small we can use standard software to compute its eigenvalues.

Typically, Ritz values converge to extreme (large & small) eigenvalues of $A$.

Symmetric case: Lanczos method

In the Hermitian case, $A = A^*$ the matrix

$$H = Q^* A Q$$

becomes tridiagonals, and it gives short three-term recurrence relations for the vectors:

the famous Lanczos method.

We only need to store last vectors and the tridiagonal matrix.

Practical issues and stability

The Lanczos vectors may loose orthogonality during the process due to floating-point errors, thus all practical implementations of it use restarts.

A very good introduction to the topic is given in the book of Golub and Van-Loan (Matrix Computations).

Preconditioning of eigenvalue problems.

The convergence of eigenvalues is still governed by the condition number, and the number of iterations can be large.

Thus, instead of Krylov subspaces, we can use Rational Krylov subspaces, which require auxiliary linear solvers.

Simplest Rational Krylov method: Inverse iteration

The simplest Rational Krylov method is the inverse iteration:

$$x_{k+1} = A^{-1} x_k.$$

It converges to the smallest eigenvector (as the power method with the inverse matrix).

Each step requires one linear solve.

We can factorize $A$ once, and then apply inverse!

Rayleigh quotient iteration

The RQI-iteration has the form

$$x_{k+1} = (A - \lambda_k I)^{-1} x_k, \quad \lambda_k = ( A x_k, x_k)/(x_k, x_k).$$

For SPD matrices it has cubic convergence, but each time we have to construct preconditioners for the different matrices.

If the factorization of $(A - \lambda I)$ is not feasible, Jacobi-Davidson method is typically the method of choice.

Jacobi correction

Jacobi not only presented the way to solve the eigenvalue problem by Jacobi rotations, but also proposed an iterative procedure. Let $u_j$ be the current approximation, and $t$ the correction:

$$A(u_j + t) = \lambda (u_j + t),$$

and we look for the correction $t \perp u_j$ (new orthogonal vector).

Then, the parallel part has the form

$$u_j u^*_j A (u_j + t) = \lambda u_j,$$

which simplifies to

$$\theta_j + u_j u^* _j A t = \lambda.$$

The orthogonal component is

$$( I - u_j u^*_j) A (u_j + t) = (I - u_j u^*_j) \lambda (u_j + t),$$

which is equivalent to

$$ (I - u_j u^*_j) (A - \lambda I) t = (I - u_j u^*_j) (- A u_j + \lambda u_j) = - (I - u_j u^*_j) A u_j = - (A - \theta_j I) u_j = -r_j. $$

$r_j$ is the residue.

Since $(I - u_j u^*j) t = t$, we can rewrite this equation in the form

Jacobi-Davidson correction

$$ (I - u_j u^*_j) (A - \lambda I) (I - u_j u^*_j) t = -r_j.$$

Now we replace $\lambda$ by $\theta_j$, and get

$$ (I - u_j u^*_j) (A - \theta_j I) (I - u_j u^*_j) t = -r_j. $$

Since $r_j \perp u_j$ this equation is consistent, if $(A - \theta_j I)$ is non-singular.

If this equation is solved exactly, we will get Rayleigh Quotient iteration!

If we want many eigenvectors, we just compute partial Schur decomposition:

$$A Q_k = Q_k T_k, $$

and then want to update $Q_k$ by one vector added to $Q_k$. We just use instead of $A$ the matrix $(I - Q_k Q^*_k) A (I - Q_k Q^*_k)$.

Jacobi-Davidson: summary

The correction equation can be solved only roughly, and JD method is often the fastest.

Software

  • The ARPack is the most widely used (it powers scipy sparse eigensolver)
  • The PRIMME is the best from my experience (it employs dynamic switching between different methods, including those we have not talked about like LOBPCG.)
  • PROPACK works well for the SVD.

Summary for this part

  • Iterative methods (Arnoldi, Lanczos, Jacobi-Davidson) for eigenvalue problems
  • There is a software for using them
  • There is a lot of technical issues hidden (restarts, spurious eigenvalues, stability)

Other structured matrices

  • Up to now, we discussed preconditioning only for sparse matrices
  • But iterative methods work well for any matrices that have fast black-box matrix-by-vector product
  • Important class of matrices are Toeplitz matrices (and Hankel matrices)

We will briefly discuss them, with more to follow on Thursday.

Toeplitz matrices: definition

A matrix is called Toeplitz if its elements are defined as

$$a_{ij} = t_{i - j}.$$

A Toeplitz matrix is completely defined by its first column and first row (i.e., $2n-1$ parameters).

It can also be multiplied by vector fast.

It has a lot of application in signal processing

Fast Toeplitz matrix-by-vector product

In order to get fast matrix-by-vector product, consider a special class of Toeplitz matrices:

Circulant matrices.

A matrix is called circulant, if

$$a_{ij} = c_{\mathrm{mod}(i - j, n)}.$$

Spectral theorem for circulant matrices

Theorem:

Any circulant matrix can be represented in the form

$$C = F \Lambda F^*,$$

where $F$ is the Fourier matrix with the elements

$$F_{kl} = \frac{1}{\sqrt{n}} w^{kl}, \quad k, l = 0, \ldots, n-1, \quad w = e^{\frac{2 \pi i}{n}},$$

and matrix $\Lambda$ is the diagonal matrix and

$$\lambda = F^* c, $$

where $c$ is the first column of the circulant matrix.

FFT

The matrix-by-vector product $F x$ can be computed in $\mathcal{O}(n \log n)$ operations using Fast Fourier Transform (FFT).

Fast Toeplitz-by-vector product by circulant embedding

$$ C = \begin{bmatrix} T & T_2 \\ T_1 & T \end{bmatrix} $$
                       and then
                       $$ 
                           C \begin{bmatrix} q\\ 0 \end{bmatrix} = \begin{bmatrix} Tq \\0 \end{bmatrix}.
                       $$

Linear system solution with Toeplitz matrices

How to solve linear systems with Toeplitz matrices?

Since we have fast matrix-by-vector product, methods we can use iterative methods.

We can also use direct solvers, but what is the structure of the inverse of such matrix?

Discuss it next time!

Take home message

  • ILU
  • Iterative eigensolvers
  • Toeplitz matrices, FFT, circulant

Next lecture

  • More on Toeplitz matrices (direct and iterative methods, block case).
  • Back to matrices: matrix functions

Questions?

In [1]:
from IPython.core.display import HTML
def css_styling():
    styles = open("./styles/custom.css", "r").read()
    return HTML(styles)
css_styling()
Out[1]: