**Week 1:** Intro week, floating point, vector norms, matrix multiplication

- Concept of floating point
- Basic vector norms(p-norm, Manhattan distance, 2-norm, Chebyshev norm)
- A short demo on $L_1$-norm minimization
- Concept of forward/backward error

Today we will talk about:

- Matrices
- Matrix multiplication
- Matrix norms, operator norms
- unitary matrices, unitary invariant norms
- Concept of block algorithms for NLA: why and how.
- Complexity of matrix multiplication

Consider composition of two linear operators:

- $y = Bx$
- $z = Ay$

Then, $z = Ay = A B x = C x$, where $C$ is the **matrix-by-matrix product**.

A product of an $n \times k$ matrix $A$ and a $k \times m$ matrix $B$ is a $n \times m$ matrix $C$ with the elements

$$
c_{ij} = \sum_{s=1}^k a_{is} b_{sj}, \quad i = 1, \ldots, n, \quad j = 1, \ldots, m
$$

Complexity of a naive algorithm for MM is $\mathcal{O}(n^3)$.

Matrix-by-matrix product is the **core** for almost all efficient algorithms in linear algebra.

Basically, all the NLA algorithms are reduced to a sequence of matrix-by-matrix products,

so efficient implementation of MM reduces the complexity of numerical algorithms by the same factor.

However, implementing MM is not easy at all!

Is it easy to multiply a matrix by a matrix?

The answer is: **no**, if you want it as fast as possible,

using the computers that are at hand.

Let us do a short demo and compare a `np.dot()`

procedure which in my case uses MKL with a hand-written matrix-by-matrix routine in Python and also its Cython version (and also gives a very short introduction to Cython).

In [24]:

```
import numpy as np
def matmul(a, b):
n = a.shape[0]
k = a.shape[1]
m = b.shape[1] #c[i, :] +=
c = np.zeros((n, m))
#Transpose B here, but it is N^2 operations
for i in range(n): #This is N^3
for j in range(m):
for s in range(k):
c[i, j] += a[i, s] * b[s, j]
```

In [22]:

```
%reload_ext cythonmagic
```

In [26]:

```
import numpy as np
from numba import jit
@jit
def numba_matmul(a, b):
n = a.shape[0]
k = a.shape[1]
m = b.shape[1]
c = np.zeros((n, m))
for i in range(n):
for j in range(m):
for s in range(k):
c[i, j] += a[i, s] * b[s, j]
return c
```

Then we just compare computational times.

Guess the answer.

In [27]:

```
n = 100
a = np.random.randn(n, n)
b = np.random.randn(n, n)
%timeit c = numba_matmul(a, b)
#%timeit cf = cython_matmul(a, b)
%timeit c = np.dot(a, b)
```

Why it is so?

There are two important issues:

- Computers are more and more parallel (multicore, graphics processing units)
- The memory pyramid: there is a whole hierarchy of levels

Fast memory is small, bigger memory is slow.

- Data fits into the fast memory:

load all data, compute - Data does not fit into the fast memory:

load data by chunks, compute, load again

We need to reduce the number of read/write operations!

This is typically achieved in efficient implementations of the BLAS libraries, one of which (Intel MKL) we now use.

Basic linear algebra operations (**BLAS**) have three levels:

- BLAS-1, operations like $c = a + b$
- BLAS-2, operations like matrix-by-vector product
- BLAS-3, matrix-by-matrix product

What is the principal differences between them?

The main difference is the number of operations vs. the number of input data!

- BLAS-1: $\mathcal{O}(n)$ data, $\mathcal{O}(n)$ operations
- BLAS-2: $\mathcal{O}(n^2)$ data, $\mathcal{O}(n^2)$ operations
- BLAS-3: $\mathcal{O}(n^2)$ data, $\mathcal{O}(n^3)$ operations

**Remark**: a quest for $\mathcal{O}(n^2)$ matrix-by-matrix multiplication algorithm is not yet done.

Strassen gives $\mathcal{O}(n^{2.78...})$

World record $\mathcal{O}(n^{2.37})$ Reference

The constant is unfortunately too big to make it practical!

How we can use memory hierarchy?

Break the matrix into blocks! ($2 \times 2$ is an **illustration**)

$ A = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix}$, $B = \begin{bmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{bmatrix}$

Then,

$AB$ = $\begin{bmatrix}A_{11} B_{11} + A_{12} B_{21} & A_{11} B_{12} + A_{12} B_{22} \\ A_{21} B_{11} + A_{22} B_{21} & A_{21} B_{12} + A_{22} B_{22}\end{bmatrix}.$

If $A_{11}, B_{11}$ and their product fit into the cache memory (which is 1024 Kb for the Haswell Intel Chip), then we load them only once into the memory.

The number of read/writes is reduced by a factor $\sqrt{M}$, where $M$ is the cache size.

- Have to do linear algebra in terms of blocks!
- So, you can not even do Gaussian elimination as usual (or just suffer 10x performance loss)

The blocking has also deep connection with parallel computations.

Consider adding two vectors:
$$ c = a + b$$
and we have two processors.

How fast can we go?

Of course, not faster then twice.

In [2]:

```
## This demo requires Anaconda distribution to be installed
import mkl
import numpy as np
n = 1000000
a = np.random.randn(n)
mkl.set_num_threads(1)
%timeit a + a
mkl.set_num_threads(2)
%timeit a + a
```

In [3]:

```
## This demo requires Anaconda distribution to be installed
import mkl
n = 500
a = np.random.randn(n, n)
mkl.set_num_threads(1)
%timeit a.dot(a)
mkl.set_num_threads(4)
%timeit a.dot(a)
```

Typically, two cases are distinguished:

- Shared memory (i.e., multicore on every desktop/smartphone)
- Distributed memory (i.e. each processor has its own memory, can send information through a network)

In both cases, the efficiency is governed by a

**memory bandwidth**:

I.e., for BLAS-1,2 routines (like sum of two vectors) reads/writes take all the time.

For BLAS-3 routines, the speedup can be obtained that is more noticable.

For large-scale clusters (>100 000 cores, see the Top500 list) there is still scaling.

A new direction in NLA is **communication-avoiding** algorithms (i.e. Hadoop), when you have many computing nodes, but very slow communication with limited communication capabilities.

This requires **absolutely different algorithms**.

This can be an interesting **project** (i.e. do NLA in a cloud).

- MM is the core of NLA. You have to think in block terms, if you want high efficiency
- This is all about computer memory hierarchy
- $\mathcal{O}(n^{2 + \epsilon})$ complexity hypothesis is not proven or disproven yet.

Now we go to matrix norms.

In [14]:

```
from IPython.core.display import HTML
def css_styling():
styles = open("./styles/custom.css", "r").read()
return HTML(styles)
css_styling()
```

Out[14]: