Material for a UC Irvine course offered by the Department of Physics and Astronomy.
Content is maintained on github and distributed under a BSD3 license.
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as np
import pandas as pd
Most practical algorithms of ML can be decomposed into small steps where the calculations are expressed with linear algebra, i.e., linear combinations of scalars, vectors and matrices.
For example, a neural network can be built from layers that each calculate $$ \mathbf{x}_\text{out} = \max(0, W \mathbf{x}_\text{in} + \mathbf{b}) \; , $$ where $W$ is a matrix, and boldface symbols represent vectors. In typical applications, $\mathbf{x}_\text{out}$ and $\mathbf{x}_\text{in}$ are derived from data while $W$ and $\mathbf{b}$ are considered model parameters. (This expression is not strictly linear: why?)
The python numeric and list types can represent arbitrary scalars, vectors, and matrices, but are designed for flexibility instead of efficiency.
Numpy is instead optimized for the special case where all list elements are numeric values of the same type, which can be organized and accessed very efficiently in memory, with a specialized array type with lots of nice features. One downside of this approach is that most of builtin math functions are duplicated (e.g., math.sin
and np.sin
) to work with numpy arrays.
EXERCISE: Complete the function below using numpy to evaluate the neural-network layer defined above:
def xout(W, xin, b):
return np.maximum(0, W.dot(xin) + b)
def xout(W, xin, b):
# Add your solution here
return 0
We frequently use $\mathbf{r} = (x, y, z)$ in physics to represent an arbitrary position in three (continuous) dimensions.
In numpy, we cannot represent an arbitrary position but can easily represent a specific position, for example:
rvec = np.array([0.1, -0.2, 0.3])
However, rvec
has only one (discrete) dimension, which we use to access its three array elements with indices 0,1,2:
rvec[0], rvec[1], rvec[2]
(0.1, -0.2, 0.3)
Note how we use the term dimension differently in these two cases!
All numpy arrays have a shape
property that specifies the range of indices allowed for each of their (discrete) dimensions:
rvec.shape
(3,)
rvec.ndim
1
Compare with a matrix represented in numpy:
matrix = np.identity(3)
print(matrix)
[[1. 0. 0.] [0. 1. 0.] [0. 0. 1.]]
matrix[1, 0], matrix[1, 1]
(0.0, 1.0)
matrix.shape
(3, 3)
matrix.ndim
2
Numpy supports arrays with any (finite) number of (discrete) dimensions. The general name for these arrays is a tensor (so, scalars, vectors and matrices are tensors). For example:
tensor = np.ones((2, 3, 4))
print(tensor)
[[[1. 1. 1. 1.] [1. 1. 1. 1.] [1. 1. 1. 1.]] [[1. 1. 1. 1.] [1. 1. 1. 1.] [1. 1. 1. 1.]]]
tensor[0, 0, 0], tensor[1, 2, 3]
(1.0, 1.0)
tensor.shape
(2, 3, 4)
tensor.ndim
3
Tensors are used in physics also: for example, the tensor expression $g^{il} \Gamma^m_{ki} x^k$ arises in contravariant derivatives in General Relativity. What are the dimensions of $g$, $\Gamma$ and $x$ in this expression? Note that numpy tensors do not make any distinction between upper or lower indices.
The numpy dimension is sometimes also referred to as the rank, but note that array rank is similar to but subtly different from linear algebra rank.
The most common ways you will create new arrays are:
# Regular sequence of values
shape = (3, 4)
c1 = np.zeros(shape)
c2 = np.ones(shape)
c3 = np.full(shape, -1)
c4 = np.arange(12)
# Reproducible "random" numbers
gen = np.random.RandomState(seed=123)
r1 = gen.uniform(size=shape)
r2 = gen.normal(loc=-1, scale=2, size=shape)
# Calculated as function of existing array.
f1 = r1 * np.sin(r2) ** c3
All the values contained within a tensors have the same data type, which you can inspect:
c1.dtype, c4.dtype
(dtype('float64'), dtype('int64'))
EXERCISE: Try to guess the dtype
of c3
, r2
and f1
, then check your answer. Deep learning often uses smaller (32 or 16 bit) float data types: what advantages and disadvantages might this have?
SOLUTION: The zeros
and ones
functions default to float64
, but full
uses the type of the provided constant value. Integers are automatically promoted to floats in mixed expressions.
c3.dtype, r2.dtype, f1.dtype
(dtype('int64'), dtype('float64'), dtype('float64'))
Smaller floats allow more efficient use of limited (GPU) memory and faster calculations, at the cost of some accuracy. Since the training of a deep neural network is inherently noisy, this is generally a good tradeoff.
c4.reshape(c1.shape)
array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]])
EXERCISE: Predict the result of c4.reshape(2, 3, 2)
then check your answer.
c4.reshape(2, 3, 2)
array([[[ 0, 1], [ 2, 3], [ 4, 5]], [[ 6, 7], [ 8, 9], [10, 11]]])
The real power of tensor computing comes from expressions like this:
# Add one to each element.
c4 + 1
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
# Scale each column of the 3x4 ones matrix by a different value.
np.ones(shape=(3, 4)) * np.arange(4)
array([[0., 1., 2., 3.], [0., 1., 2., 3.], [0., 1., 2., 3.]])
The results are not surprising in these examples, but something non-trivial is going on behind the scenes to make this work since we are combining tensors with different shapes. This is called broadcasting and has specific rules for how to handle less obvious cases.
Broadcasting serves two purposes:
Not all expressions can be automatically broadcast, even if they seem to make sense. For example:
# Scale each row of the 3x4 ones matrix by a different value.
try:
np.ones(shape=(3, 4)) * np.arange(3)
except ValueError as e:
print(e)
operands could not be broadcast together with shapes (3,4) (3,)
However, you can usually reshape the inputs to get the desired result:
np.ones(shape=(3, 4)) * np.arange(3).reshape(3, 1)
array([[0., 0., 0., 0.], [1., 1., 1., 1.], [2., 2., 2., 2.]])
Another useful trick is to use keepdims=True
with reducing functions, e.g.
print(np.ones((4, 3)).sum(axis=1))
print(np.ones((4, 3)).sum(axis=1, keepdims=True))
[3. 3. 3. 3.] [[3.] [3.] [3.] [3.]]
To experiment with broadcasting rules, define a function to try broadcasting two arbitrary tensor shapes:
def broadcast(shape1, shape2):
array1 = np.ones(shape1)
array2 = np.ones(shape2)
try:
array12 = array1 + array2
print('shapes {} {} broadcast to {}'.format(shape1, shape2, array12.shape))
except ValueError as e:
print(e)
broadcast((1, 3), (3,))
shapes (1, 3) (3,) broadcast to (1, 3)
broadcast((1, 2), (3,))
operands could not be broadcast together with shapes (1,2) (3,)
EXERCISE: Predict the results of the following then check your answers:
broadcast((3, 1, 2), (3, 2))
broadcast((2, 1, 3), (3, 2))
broadcast((3,), (2, 1))
broadcast((3,), (1, 2))
broadcast((3,), (1, 3))
broadcast((3, 1, 2), (3, 2))
broadcast((2, 1, 3), (3, 2))
broadcast((3,), (2, 1))
broadcast((3,), (1, 2))
broadcast((3,), (1, 3))
shapes (3, 1, 2) (3, 2) broadcast to (3, 3, 2) operands could not be broadcast together with shapes (2,1,3) (3,2) shapes (3,) (2, 1) broadcast to (2, 3) operands could not be broadcast together with shapes (3,) (1,2) shapes (3,) (1, 3) broadcast to (1, 3)
Numpy is an example of a framework for tensor computing that is widely supported and requires no special hardware. However, it still offers significant performance improvements by eliminating explicit python loops and using memory efficiently.
For example, let's calculate the opening angle separation between two unit vectors, each specified with (lat, lon) angles in radians (or RA,DEC for astronomers, as implemented here). The Haversine formula is a good way to calculate this quantity.
Generate a large number of random unit vectors for benchmarking (are these uniformly distributed on the sphere?)
def generate(N, seed=123):
gen = np.random.RandomState(seed=123)
lats = gen.uniform(low=-np.pi / 2, high=+np.pi / 2, size=N)
lons = gen.uniform(low=0, high=2 * np.pi, size=N)
plt.plot(lons, lats, '.')
return lats, lons
lats, lons = generate(N=1000)
Use explicit python loops to calculate the (square) matrix of separation angles between all pairs of unit vectors:
def separation_matrix_loops():
# Allocate memory for the matrix.
N = len(lats)
matrix = np.empty((N, N))
for i, (lat1, lon1) in enumerate(zip(lats, lons)):
for j, (lat2, lon2) in enumerate(zip(lats, lons)):
# Evaluate the Haversine formula for matrix element [i, j].
matrix[i, j] = 2 * np.arcsin(np.sqrt(
np.sin(0.5 * (lat2 - lat1)) ** 2 +
np.cos(lat1) * np.cos(lat2) * np.sin(0.5 * (lon2 - lon1)) ** 2))
return matrix
%time S1 = separation_matrix_loops()
CPU times: user 8.72 s, sys: 39.2 ms, total: 8.76 s Wall time: 8.79 s
Now calculate the same separations using numpy implicit loops:
def separation_matrix_numpy():
lat1, lat2 = lats, lats.reshape(-1, 1)
lon1, lon2 = lons, lons.reshape(-1, 1)
return 2 * np.arcsin(np.sqrt(
np.sin(0.5 * (lat2 - lat1)) ** 2 +
np.cos(lat1) * np.cos(lat2) * np.sin(0.5 * (lon2 - lon1)) ** 2))
%time S2 = separation_matrix_numpy()
CPU times: user 56.8 ms, sys: 16.1 ms, total: 72.9 ms Wall time: 71.7 ms
Check that both calculations give the same results:
np.allclose(S1, S2)
True
Since this is so much faster, increase the amount of computation (and memory) 100x for a better benchmark:
lats, lons = generate(N=10000)
%time S2 = separation_matrix_numpy()
CPU times: user 5.26 s, sys: 1.57 s, total: 6.83 s Wall time: 6.85 s
Therefore using implicit numpy loops speeds up the calculation by a factor of about 6.8 / 0.02 = 340. Since we are using the efficient numpy arrays in both cases, the speed up is entirely due to the loops!
Machine learning relies heavily on frameworks that copy the successful numpy design for tensor computing, while adding some important new features:
Unlike numpy, the default type in these frameworks is usually a 32-bit float, rather than a 64-bit float.
The two most popular tensor computing frameworks for machine learning today are PyTorch and TensorFlow. Both are large open-source projects, primarily developed by facebook (pytorch) and google (tensorflow). These frameworks were originally quite different, with pytorch preferred for research and tensorflow preferred for large-scale deployment, but they are gradually converging towards similar a feature set.
Below, we repeat our calculation of the separation matrix with both of these frameworks. You will notice that the new features come with some additional complexity.
import torch
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"
print(f'Using device: {device}.')
lons_pt = torch.tensor(lons, device=device)
lats_pt = torch.tensor(lats, device=device)
Using device: cpu.
def separation_matrix_torch():
lat1, lat2 = lats_pt, lats_pt.reshape(-1, 1)
lon1, lon2 = lons_pt, lons_pt.reshape(-1, 1)
return 2 * torch.asin(torch.sqrt(
torch.sin(0.5 * (lat2 - lat1)) ** 2 +
torch.cos(lat1) * torch.cos(lat2) * torch.sin(0.5 * (lon2 - lon1)) ** 2))
%time S3 = separation_matrix_torch()
CPU times: user 4.98 s, sys: 5.5 s, total: 10.5 s Wall time: 2.35 s
np.allclose(S2, S3.numpy())
True
import tensorflow as tf
device = 'GPU:0' if tf.config.list_physical_devices('GPU') else 'CPU:0'
print(f'Using device: {device}.')
with tf.device(device):
lons_tf = tf.constant(lons)
lats_tf = tf.constant(lats)
Using device: CPU:0.
def separation_matrix_tensorflow():
lat1, lat2 = lats_tf, tf.reshape(lats_tf, [-1, 1])
lon1, lon2 = lons_tf, tf.reshape(lons_tf, [-1, 1])
return 2 * tf.asin(tf.sqrt(
tf.sin(0.5 * (lat2 - lat1)) ** 2 +
tf.cos(lat1) * tf.cos(lat2) * tf.sin(0.5 * (lon2 - lon1)) ** 2))
%time S4 = separation_matrix_tensorflow()
CPU times: user 20.4 s, sys: 9.31 s, total: 29.7 s Wall time: 3.54 s
np.allclose(S2, S4.numpy())
True
Tensor computing can be sped up significantly (10-100x) using hardware that is optimized to perform tensor computing by distributing simple calculations ("kernels") across many independent processors ("cores") running in parallel.
The original driver for such hardware was to accelerate the 3D geometry calculations required to render real time 3D graphics, leading to the first Graphics Processing Units (GPUs) in the 1990s. More recently, GPUs have been adopted for purely numerical calculations, with no display attached, leading to the development of specialized programming languages such as CUDA and OpenCL.
Currently, one vendor (Nvidia) dominates the use of GPUs for ML with its proprietary CUDA language. Google has also introduced an even more specialized TPU architecture.
The table below shows some benchmarks for the separation matrix problem, running on different hardware with different frameworks. The speed ups obtained using PyTorch and TensorFlow with a GPU are typical. The two frameworks provide comparable GPU performance overall, but can differ on specific problems.
Test | Laptop | Server(GPU) | Collab(CPU) | Collab(GPU) |
---|---|---|---|---|
numpy | 2.08s | 1.17s | 10.5s | 10.3s |
torch | 7.32s | 48.7ms | --- | --- |
tensorflow | --- | --- | 9.11s | 246ms |
ratio | 3.5 | 1 / 24 | 0.87 | 1 / 41 |
To benefit from this hardware, you can either add a GPU to a linux server, or use a cloud computing platform.
Cloud computing is the easiest way to get started. There are some free options, but generally you have to "pay as you go" to do a useful amount of work. Some good starting points are:
Note: this is not a complete list, and pricing and capabilities are rapidly changing.
If you are considering building your own GPU server, start here. A single server can host 4 GPUs. Here is a single water-cooled RTX 2080 Ti GPU installed in my office:
In addition to hardware acceleration, a key feature of tensor computing frameworks for ML is their ability to automate the calculation of derivatives, which then enable efficient and accurate gradient-based optimization algorithms.
In general, a derivate can be implemented in software three ways:
As a concrete example calculate the (un-normalized) Gaussian distribution $$ y(x) = e^{-x^2} $$ in PyTorch:
x = torch.linspace(-5, 5, 20, requires_grad=True)
y = torch.exp(-x ** 2)
y
tensor([1.3888e-11, 2.0328e-09, 1.7098e-07, 8.2640e-06, 2.2952e-04, 3.6632e-03, 3.3596e-02, 1.7705e-01, 5.3619e-01, 9.3309e-01, 9.3309e-01, 5.3619e-01, 1.7705e-01, 3.3596e-02, 3.6632e-03, 2.2952e-04, 8.2640e-06, 1.7098e-07, 2.0328e-09, 1.3888e-11], grad_fn=<ExpBackward>)
We specify requires_grad=True
to enable AD for all tensors that depend on x
(so just y
in this case). To calculate partial derivatives ("gradients") of y
wrt x
, use:
y.backward(torch.ones_like(y))
The tensor x.grad
now contains $y'(x)$ at each value of x
:
x.grad
tensor([ 1.3888e-10, 1.8188e-08, 1.3498e-06, 5.6543e-05, 1.3288e-03, 1.7352e-02, 1.2377e-01, 4.6593e-01, 8.4661e-01, 4.9110e-01, -4.9110e-01, -8.4661e-01, -4.6593e-01, -1.2377e-01, -1.7352e-02, -1.3288e-03, -5.6543e-05, -1.3498e-06, -1.8188e-08, -1.3888e-10])
x_n = x.detach().numpy()
yp_n = x.grad.detach().numpy()
y_n = y.detach().numpy()
plt.plot(x_n, y_n, 'o--', label='$y(x)$')
plt.plot(x_n, yp_n, 'o:', label='$y^\prime(x)$')
plt.legend();
Note that these derivatives are calculated to full machine precision and not affected by the coarse spacing in $x$.
Jax is a relatively new framework for automatic differentiation (developed by google but independent of tensorflow) that relies on "just-in-time" compilation and is designed for ML research.
Although TensorFlow and PyTorch are both similar to numpy, they have different APIs so you are forced to choose one to take advantage of their unique features. However, for many calculations they are interchangeable, and a new ecosystem of higher-level APIs is growing to support this. For example, check out:
Neither of these packages are included in the MLS conda environment, but I encourage you to experiment with them if you want to write framework-independent tensor code.