Lecture 3.1: Matrix multiplication part 1

Syllabus

Week 1: Intro week, floating point, vector norms, matrix multiplication

Recap of the previous lecture

  • Concept of floating point
  • Basic vector norms(p-norm, Manhattan distance, 2-norm, Chebyshev norm)
  • A short demo on $L_1$-norm minimization
  • Concept of forward/backward error

Today lecture

Today we will talk about:

  • Matrices
  • Matrix multiplication
  • Matrix norms, operator norms
  • unitary matrices, unitary invariant norms
  • Concept of block algorithms for NLA: why and how.
  • Complexity of matrix multiplication

Matrix-by-matrix product

Consider composition of two linear operators:

  1. $y = Bx$
  2. $z = Ay$

Then, $z = Ay = A B x = C x$, where $C$ is the matrix-by-matrix product.

A product of an $n \times k$ matrix $A$ and a $k \times m$ matrix $B$ is a $n \times m$ matrix $C$ with the elements
$$ c_{ij} = \sum_{s=1}^k a_{is} b_{sj}, \quad i = 1, \ldots, n, \quad j = 1, \ldots, m $$

Complexity of MM

Complexity of a naive algorithm for MM is $\mathcal{O}(n^3)$.

Matrix-by-matrix product is the core for almost all efficient algorithms in linear algebra.

Basically, all the NLA algorithms are reduced to a sequence of matrix-by-matrix products,

so efficient implementation of MM reduces the complexity of numerical algorithms by the same factor.

However, implementing MM is not easy at all!

Efficient implementation for MM

Is it easy to multiply a matrix by a matrix?

The answer is: no, if you want it as fast as possible,

using the computers that are at hand.

Demo

Let us do a short demo and compare a np.dot() procedure which in my case uses MKL with a hand-written matrix-by-matrix routine in Python and also its Cython version (and also gives a very short introduction to Cython).

In [24]:
import numpy as np
def matmul(a, b):
    n = a.shape[0]
    k = a.shape[1]
    m = b.shape[1]   #c[i, :] += 
    c = np.zeros((n, m))
    #Transpose B here, but it is N^2 operations
    for i in range(n): #This is N^3
        for j in range(m):
            for s in range(k):
                c[i, j] += a[i, s] * b[s, j]
In [22]:
%reload_ext cythonmagic
In [26]:
import numpy as np
from numba import jit
@jit
def numba_matmul(a, b):
    n = a.shape[0]
    k = a.shape[1]
    m = b.shape[1]
    c = np.zeros((n, m))
    for i in range(n):
        for j in range(m):
            for s in range(k):
                c[i, j] += a[i, s] * b[s, j]
    return c

Then we just compare computational times.

Guess the answer.

In [27]:
n = 100
a = np.random.randn(n, n)
b = np.random.randn(n, n)
%timeit c = numba_matmul(a, b)
#%timeit cf = cython_matmul(a, b)
%timeit c = np.dot(a, b)
The slowest run took 294.61 times longer than the fastest. This could mean that an intermediate result is being cached 
1 loops, best of 3: 952 µs per loop
10000 loops, best of 3: 112 µs per loop

Why it is so?
There are two important issues:

  • Computers are more and more parallel (multicore, graphics processing units)
  • The memory pyramid: there is a whole hierarchy of levels

Memory architecture

Fast memory is small, bigger memory is slow.

  • Data fits into the fast memory:
    load all data, compute
  • Data does not fit into the fast memory:
    load data by chunks, compute, load again

We need to reduce the number of read/write operations!

This is typically achieved in efficient implementations of the BLAS libraries, one of which (Intel MKL) we now use.

BLAS

Basic linear algebra operations (BLAS) have three levels:

  1. BLAS-1, operations like $c = a + b$
  2. BLAS-2, operations like matrix-by-vector product
  3. BLAS-3, matrix-by-matrix product

What is the principal differences between them?

The main difference is the number of operations vs. the number of input data!

  1. BLAS-1: $\mathcal{O}(n)$ data, $\mathcal{O}(n)$ operations
  2. BLAS-2: $\mathcal{O}(n^2)$ data, $\mathcal{O}(n^2)$ operations
  3. BLAS-3: $\mathcal{O}(n^2)$ data, $\mathcal{O}(n^3)$ operations

Remark: a quest for $\mathcal{O}(n^2)$ matrix-by-matrix multiplication algorithm is not yet done.

Strassen gives $\mathcal{O}(n^{2.78...})$

World record $\mathcal{O}(n^{2.37})$ Reference

The constant is unfortunately too big to make it practical!

Memory hierarchy

How we can use memory hierarchy?

Break the matrix into blocks! ($2 \times 2$ is an illustration)

$ A = \begin{bmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{bmatrix}$, $B = \begin{bmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{bmatrix}$

Then,

$AB$ = $\begin{bmatrix}A_{11} B_{11} + A_{12} B_{21} & A_{11} B_{12} + A_{12} B_{22} \\ A_{21} B_{11} + A_{22} B_{21} & A_{21} B_{12} + A_{22} B_{22}\end{bmatrix}.$

If $A_{11}, B_{11}$ and their product fit into the cache memory (which is 1024 Kb for the Haswell Intel Chip), then we load them only once into the memory.

Key point

The number of read/writes is reduced by a factor $\sqrt{M}$, where $M$ is the cache size.

  • Have to do linear algebra in terms of blocks!
  • So, you can not even do Gaussian elimination as usual (or just suffer 10x performance loss)

Parallelization

The blocking has also deep connection with parallel computations.
Consider adding two vectors: $$ c = a + b$$ and we have two processors.

How fast can we go?

Of course, not faster then twice.

In [2]:
## This demo requires Anaconda distribution to be installed
import mkl
import numpy as np
n = 1000000
a = np.random.randn(n)
mkl.set_num_threads(1)
%timeit a + a
mkl.set_num_threads(2)
%timeit a + a
100 loops, best of 3: 2.82 ms per loop
100 loops, best of 3: 2.82 ms per loop
In [3]:
## This demo requires Anaconda distribution to be installed
import mkl
n = 500
a = np.random.randn(n, n)
mkl.set_num_threads(1)
%timeit a.dot(a)
mkl.set_num_threads(4)
%timeit a.dot(a)
100 loops, best of 3: 10.4 ms per loop
100 loops, best of 3: 5.73 ms per loop

Typically, two cases are distinguished:

  1. Shared memory (i.e., multicore on every desktop/smartphone)
  2. Distributed memory (i.e. each processor has its own memory, can send information through a network)

In both cases, the efficiency is governed by a

memory bandwidth:

I.e., for BLAS-1,2 routines (like sum of two vectors) reads/writes take all the time.

For BLAS-3 routines, the speedup can be obtained that is more noticable.

For large-scale clusters (>100 000 cores, see the Top500 list) there is still scaling.

Communication-avoiding algorithms

A new direction in NLA is communication-avoiding algorithms (i.e. Hadoop), when you have many computing nodes, but very slow communication with limited communication capabilities.

This requires absolutely different algorithms.

This can be an interesting project (i.e. do NLA in a cloud).

Summary of MM part

  • MM is the core of NLA. You have to think in block terms, if you want high efficiency
  • This is all about computer memory hierarchy
  • $\mathcal{O}(n^{2 + \epsilon})$ complexity hypothesis is not proven or disproven yet.

Now we go to matrix norms.

In [14]:
from IPython.core.display import HTML
def css_styling():
    styles = open("./styles/custom.css", "r").read()
    return HTML(styles)
css_styling()
Out[14]: