#!/usr/bin/env python
# coding: utf-8

# # Lecture 13: Iterative methods

# ## Syllabus
# **Week 1:** Matrices, vectors, matrix/vector norms, scalar products & unitary matrices  
# **Week 2:** TAs-week (Strassen, FFT, a bit of SVD)  
# **Week 3:** Matrix ranks, singular value decomposition, linear systems, eigenvalues  
# **Week 4:** Matrix decompositions: QR, LU, SVD + test + structured matrices start  
# **Week 5:** Iterative methods, preconditioners, matrix functions

# ## Recap of the previous lecture
# - Krylov subspaces
# - Iterative methods: Conjugate Gradient (CG), Generalized Minimal Residual Method (GMRES), BiCGStab
# - Definition of preconditioners
# - Jacobi, Gauss-Seidel as preconditioners

# ## Today lecture
# 
# Today we will continue to talk about iterative methods.
# 
# We will consider:
# 
# - Concept of Incomplete LU/Incomplete Cholesky
# - A brief note about algebraic multigrid (no details, because we need PDEs for the general multigrid concept)
# - Iterative methods for other NLA problems (SVD, eigenvalue problems)
# - Other than sparse: Toeplitz matrices, FFT, circulant matrices.

# ## Problem setup (recall)
# 
# Just to recall: we are solving 
# 
# $$Ax = f,$$
# 
# where $A$ is an $n \times n$ **sparse matrix** (i.e. the number of non-zero elements is smaller than the $N^2)$, 
# 
# and the **matrix-by-vector product** can be computed in $\mathcal{O}(N)$ operations.
# 
# We solve $$A P^{-1} y = f,$$ 
# 
# and $P$ is such that $A \approx P$ and $AP^{-1}$ has **better spectra** for the iterative method.

# ## Simple method
# General idea is to select a class $\mathcal{P}$ of matrices with which we can **easily solve** linear systems and then   
# 
# find a matrix $P \in \mathcal{P}$ that minimizes
# 
# $$P = \arg \min_{P \in \mathcal{P}}\Vert A - P \Vert $$
# 
# 1. Diagonal matrices give **Jacobi preconditioners**
# 2. Triangular matrices give **Gauss-Seidel** preconditioner
# 
# Any other ideas for the class of the matrices?
# 

# ## Remember the Gaussian elimination
# Let us remember the basic method for solving linear systems:
# 
# Decompose the matrix $A$ in the form 
# 
# $$A = P_1 L U P^{\top}_2, $$
# 
# where $P_1$ and $P_2$ are certain **permutation** matrices (which do the pivoting).
# 
# The most natural idea is to use **sparse** $L$ and $U$. 
# 
# It is not possible without **fill-in** growth for example for matrices, coming from 2D/3D Partial Differential equations (PDEs).
# 
# What to do?

# ## Incomplete LU
# 
# 
# Suppose you want to eliminate a variable $x_1$,
# 
# and the equations have the form
# 
# $$5 x_1 + x_4 + x_{10} = 1, \quad 3 x_1 + x_4 + x_8 = 0, \ldots,$$
# 
# and in all other equations $x_1$ is not present. 
# 
# After the elimination, only $x_{10}$ will enter additionally to the second equation (new fill-in).
# 
# $$x_4 + x_8 + 3(1 - x_4 - x_{10})/5 = 0$$
# 
# In the Incomplete $LU$ case (actually, ILU(0)) we just throw away the **new fill-in**.

# ## Incomplete-LU: formal definition
# 
# We run the usual LU-decomposition cycle, but avoid inserting non-zeros **other** than the initial non-zero pattern. 
# 
# ``
#     L = np.zeros((n, n))
#     U = np.zeros((n, n))
#     for k in xrange(n): #Eliminate one row   
#         L[k, k] = 1
#         for i in xrange(k+1, n):
#             L[i, k] = a[i, k] / a[k, k]
#             for j in xrange(k+1, n):
#                 a[i, j] = a[i, j] - L[i, k] * a[k, j]  #New fill-ins appear here
#         for j in xrange(k, n):
#             U[k, j] = a[k, j]
# ``

# ## ILU(k)
# 
# Youseef Saad (who is the author of GMRES) also had a seminal paper on the **Incomplete LU** decomposition
# 
# Saad, Yousef (1996), Iterative methods for sparse linear systems 
# 
# And he proposed **ILU(k)** method.

# ## ILU(k), continued
# 
# The idea of ILU(k) is very instructive and is based on the connection between sparse matrices and graphs.
# 
# Suppose you have an $n \times n$ matrix $A$ and a corresponding adjacency graph.
# 
# Then we eliminate one variable (vertex) and get a smaller system of size $(n-1) \times (n-1)$.
# 
# What is the graph of this system, when a **new edge appear**?

# ## LU & graphs
# 
# The new edge can appear only between vertices that had common neighbour: it means, that they are second-order neigbours.  This is also a sparsity pattern of the matrix $$A^2.$$
# 
# The **ILU(k)** idea is to leave only the elements in $L$ and $U$ that are $k$-order neighbours.
# 
# The ILU(2) is very efficient, but for some reason completely abandoned (i.e. there is no implementation in MATLAB and scipy).
# 
# There is an original Sparsekit software by Saad, which works quite well.

# ## ILU Thresholded (ILUT)
# A much more popular approach is based on the so-called **thresholded LU**.
# 
# You do the standard Gaussian elimination with fill-ins, but either:
# 
# Throw away elements that are smaller than threshold, and/or control the amount of non-zeros you are allowed to store.
# 
# The smaller is the threshold, the better is the preconditioner, but more memory it takes.

# ## Symmetric positive definite case
# 
# In the SPD case, instead of incomplete LU you should use Incomplete Cholesky, which is twice faster and consumes twice less memory.
# 
# Both ILUT and Ichol are implemented in scipy and you can try (nothing quite fancy, but it works).

# ## Second-order LU preconditioners
# 
# There are more efficient (but much less popular due to the absense of open-source implementations) **second-order** LU factorization [proposed by I. Kaporin](http://www.researchgate.net/profile/I_Kaporin/publication/242940993_High_quality_preconditioning_of_a_general_symmetric_positive_definite_matrix_based_on_its_UTU__UTR__RTU-decomposition/links/53f72ad90cf2888a74976f54.pdf)
# 
# The idea is to approximate the matrix in the form
# 
# $$A \approx U_2 U^{\top}_2 + U^{\top}_2 R_2 + R^{\top}_2 U_2,$$
# 
# which is just the expansion of the $UU^{\top}$ with respect to the perturbation of $U$. 
# 
# where $U_1$ and $U_2$ are low-triangular and sparse, whereare $R_2$ is small with respect to the drop tolerance parameter.
# 
# 

# ## What else?
# Besides the ILU/IC approaches, there is a very efficient approach, called **algebraic multigrid (AMG)**.
# 
# It was first proposed by Ruge and Stuben in 1987.
# 
# It is probably the only **black-box** method that is asymptotically optimal for 2D/3D problems 
# 
# (at least, for elliptic problems).
# 
# The idea comes from the **geometric multigrid**.
# 
# 
# 

# ## Multigrid: just the basic idea
# The basic idea of the multigrid is to consider instead of one problem problems on **coarser meshes**.
# 
# Then we solve for a system on a coarse mesh, and interpolate it to the finer mesh.
# 
# Geometric multigrid uses the information about the meshes, operators, discretization..
# 
# AMG method tries to guess it from the **matrix**.
# 
# To know more, enroll in the FastPDE course this Spring :)

# ## Summary of this part
# 
# The ILU has three flavours: ILU(k), ILUT, ILU2. They are different, but based on the approximation of the matrix in a simpler form to use as a preconditioners.
# 
# Now we will briefly discuss iterative methods for other NLA problems.

# ## Iterative methods for other NLA problems
# 
# Up to now, we only talked about the iterative methods for **linear systems**.
# 
# There are other important large-scale problems:
# 
# 1. (Partial) Eigenvalue problem: $Ax_k = \lambda_k x_k.$
# 2. (Partial) SVD problem: $A v_k = \sigma_k u_k, \quad A^* u_k = \sigma_k v_k$.
# 
# **It is extremely difficult to compute all eigenvalues/singular values of the matrix**
# 
# (singular vectors / eigenvectors are not sparse, so we need to store them all!).
# 
# But it is possible to solve **partial eigenvalue problems**. 
# 
# The SVD follows from the symmetric eigenvalue problem.

# ## Can we do better, than the power method
# 
# Recall the power method:
# 
# $$x_{k+1} = A x_k.$$
# 
# Can we do better, than the power method (which converges to the largest eigenvalue?)

# ## Yes, we can do better
# 
# We can use the full Krylov subspace, and construct the orthogonal basis in the subspace
# 
# $$x \in K_n(A, f)$$ 
# 
# to approximate the eigenvalues. 
# 
# This gives rise to the **Arnoldi method** (for Hermitian matrices it reduces to the **Lanczos method**)

# ## Arnoldi method
# 
# The Arnold method has a simple structure (see the pseudocode), which just modified Gram-Schmidt applied to the Krylov subspace
# 
# 
# ``   
# q[k] = A * q[k-1],  
# for j in range(k-1):  
#     h[j, k-1] = (q[j], q[k])  
#     q[k] -= h[j, k-1] * q[j]  
# h[k, k-1] = || q[k] ||  
# q[k] = q[k]/h[k, k-1]
# ``

# ## Arnoldi method, cont
# 
# The Arnoldi method generates an orthogonal basis of the form
# 
# $$A Q_n = \widetilde{H}_n Q_{n+1},$$
# 
# and $H_n = Q^*_n A Q_n$ is **upper Hessenberg matrix**.
# 
# The idea of using Arnoldi method as an **eigenvalue algorithm** is that we compute the eigenvalues of $A$ 
# 
# by looking at the eigenvalues of the **orthogonal projection** $H_n$ (called **Ritz values**).
# 
# Since $H_n$ is small we can use standard software to compute its eigenvalues.
# 
# Typically, Ritz values  converge to **extreme** (large & small) eigenvalues of $A$.
# 

# ## Symmetric case: Lanczos method
# 
# In the Hermitian case, $A = A^*$ the matrix
# 
# $$H = Q^* A Q$$ becomes tridiagonals, and it gives short three-term recurrence relations for the vectors:
# 
# the famous Lanczos method.
# 
# We only need to store last vectors and the tridiagonal matrix.

# ## Practical issues and stability
# 
# The Lanczos vectors may loose orthogonality during the process due to floating-point errors, thus all practical implementations of it use **restarts**.
# 
# A very good introduction to the topic is given in the book of **Golub and Van-Loan (Matrix Computations)**.

# ## Preconditioning of eigenvalue problems.
# 
# The convergence of eigenvalues is still governed by the condition number, and the number of iterations can be large.
# 
# Thus, instead of Krylov subspaces, we can use **Rational Krylov subspaces**, which require auxiliary linear solvers.

# ## Simplest Rational Krylov method: Inverse iteration
# 
# The simplest Rational Krylov method is the **inverse iteration**:
# 
# $$x_{k+1} = A^{-1} x_k.$$
# 
# It converges to the smallest eigenvector (as the power method with the inverse matrix).
# 
# Each step requires one linear solve. 
# 
# We can factorize $A$ once, and then apply inverse!
# 

# ## Rayleigh quotient iteration
# 
# The RQI-iteration has the form
# 
# $$x_{k+1} = (A - \lambda_k I)^{-1} x_k, \quad \lambda_k = (  A x_k, x_k)/(x_k, x_k).$$
# 
# For SPD matrices it has **cubic convergence**, but each time we have to construct preconditioners for the different matrices.
# 
# If the factorization of $(A - \lambda I)$ is not feasible, **Jacobi-Davidson** method is typically the method of choice.

# ## Jacobi correction
# 
# Jacobi not only presented the way to solve the eigenvalue problem by Jacobi rotations, but also proposed an iterative procedure. Let $u_j$ be the current approximation, and $t$ the correction:
# 
# $$A(u_j + t) = \lambda (u_j + t),$$
# 
# and we look for the correction $t \perp u_j$ (new orthogonal vector).
# 
# Then, the parallel part has the form
# 
# $$u_j u^*_j A (u_j + t) = \lambda u_j,$$
# 
# which simplifies to 
# 
# $$\theta_j + u_j u^* _j A t = \lambda.$$
# 
# The orthogonal component is 
# 
# $$( I - u_j u^*_j) A (u_j + t) = (I - u_j u^*_j) \lambda (u_j + t),$$
# 
# which is equivalent to 
# 
# $$
#   (I - u_j u^*_j) (A - \lambda I) t = (I - u_j u^*_j) (- A u_j + \lambda u_j) = - (I - u_j u^*_j) A u_j = - (A - \theta_j I) u_j = -r_j.
# $$
# 
# $r_j$ is the residue.
# 
# Since $(I - u_j u^*j) t  = t$, we can rewrite this equation in the form
# 

# ## Jacobi-Davidson correction
# 
# $$ (I - u_j u^*_j) (A - \lambda I) (I - u_j u^*_j) t = -r_j.$$
# 
# Now we replace $\lambda$ by $\theta_j$, and get 
# 
# $$
#  (I - u_j u^*_j) (A - \theta_j I) (I - u_j u^*_j) t = -r_j.
# $$
# 
# Since $r_j \perp u_j$ this equation is consistent, if $(A - \theta_j I)$ is non-singular.
# 
# If this equation is solved exactly, we will get Rayleigh Quotient iteration!
# 
# If we want many eigenvectors, we just compute **partial Schur decomposition:**
# 
# $$A Q_k = Q_k T_k, $$
# 
# and then want to update $Q_k$ by one vector added to $Q_k$. We just use instead of $A$ the matrix $(I - Q_k Q^*_k) A (I - Q_k Q^*_k)$. 

# ## Jacobi-Davidson: summary
# 
# The correction equation can be solved only roughly, and JD method is often the fastest.

# ## Software
# 
# - The [ARPack](http://www.caam.rice.edu/software/ARPACK/) is the most widely used (it powers scipy sparse eigensolver)
# - The [PRIMME](https://github.com/primme/primme) is the best from my experience (it employs dynamic switching between different methods, including those we have not talked about like LOBPCG.)
# - [PROPACK](http://sun.stanford.edu/~rmunk/PROPACK/) works well for the SVD.
# 

# ## Summary for this part
# 
# - Iterative methods (Arnoldi, Lanczos, Jacobi-Davidson) for eigenvalue problems
# - There is a software for using them
# - There is a lot of technical issues hidden (restarts, spurious eigenvalues, stability)

# ## Other structured matrices
# 
# - Up to now, we discussed preconditioning only for **sparse matrices**
# - But iterative methods work well for any matrices that have fast black-box matrix-by-vector product
# - Important class of matrices are **Toeplitz matrices** (and **Hankel matrices**)
# 
# We will briefly discuss them, with more to follow on Thursday.

# ## Toeplitz matrices: definition
# 
# A matrix is called **Toeplitz** if its elements are defined as
# 
# $$a_{ij} = t_{i - j}.$$
# 
# A Toeplitz matrix is completely defined by its first column and first row (i.e., $2n-1$ parameters).
# 
# It can also be multiplied by vector fast.
# 
# It has a lot of application in **signal processing**

# ## Fast Toeplitz matrix-by-vector product
# 
# In order to get fast matrix-by-vector product, consider a special class of **Toeplitz matrices:**
# 
# **Circulant matrices**.
# 
# A matrix is called **circulant**, if
# 
# $$a_{ij} = c_{\mathrm{mod}(i - j, n)}.$$
# 

# ## Spectral theorem for circulant matrices
# 
# **Theorem:**
# 
# Any circulant matrix can be represented in the form
# 
# $$C = F \Lambda F^*,$$
# 
# where $F$ is the **Fourier matrix** with the elements
# 
# $$F_{kl} = \frac{1}{\sqrt{n}} w^{kl}, \quad k, l = 0, \ldots, n-1, \quad w = e^{\frac{2 \pi i}{n}},$$
# 
# and matrix $\Lambda$ is the diagonal matrix and
# 
# $$\lambda = F^* c, $$
# 
# where $c$ is the first column of the circulant matrix.

# ## FFT
# 
# The matrix-by-vector product $F x$ can be computed in $\mathcal{O}(n \log n)$ operations using Fast Fourier Transform (FFT).

# ## Fast Toeplitz-by-vector product by circulant embedding
# 
# 
# 
# $$ C = \begin{bmatrix} T & T_2 \\
#                            T_1  & T \end{bmatrix} $$
#                            and then
#                            $$ 
#                                C \begin{bmatrix} q\\ 0 \end{bmatrix} = \begin{bmatrix} Tq \\0 \end{bmatrix}.
#                            $$

# ## Linear system solution with Toeplitz matrices
# 
# How to solve linear systems with Toeplitz matrices? 
# 
# Since we have fast matrix-by-vector product,  methods we can use iterative methods.
# 
# We can also use direct solvers, but what is **the structure of the inverse** of such matrix?
# 
# Discuss it next time!

# ## Take home message
# - ILU
# - Iterative eigensolvers
# - Toeplitz matrices, FFT, circulant

# ## Next lecture
# - More on Toeplitz matrices (direct and iterative methods, block case).
# - Back to matrices: matrix functions

# # Questions?

# In[1]:


from IPython.core.display import HTML
def css_styling():
    styles = open("./styles/custom.css", "r").read()
    return HTML(styles)
css_styling()