General Form
In the case of $d$ variables $$\hat{y} = w_1 \cdot x_1 + ... + w_d \cdot x_d + b$$
$$\hat{y} = \mathbf{w}^\top \mathbf{x} + b$$
We'll try to find the weight vector $w$ and bias term $b$ that approximately associate data points $x_i$ with their corresponding labels $y_i$.
For a collection of data points $\mathbf{X}$ the predictions $\hat{\mathbf{y}}$ can be expressed via the matrix-vector product $${\hat{\mathbf{y}}} = \mathbf{X} \mathbf{w} + b$$
Training Data
Loss Function
Square Loss for a data sample $$l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2,$$
To measure the quality of a model on the entire dataset, we can simply average the losses on the training set.$$L(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2.$$
In model training, we want to find a set of model parameters, represented by $\mathbf{w}^*$, $b^*$, that can minimize the average loss of training samples: $$\mathbf{w}^*, b^* = \operatorname*{argmin}_{\mathbf{w}, b}\ L(\mathbf{w}, b).$$
Optimization Algorithm
*The mini-batch stochastic gradient descent*
In each iteration, we randomly and uniformly sample a mini-batch $\mathcal{B}$ consisting of a fixed number of training data samples.
We then compute the derivative (gradient) of the average loss on the mini batch the with regard to the model parameters.
This result is used to change the parameters in the direction of the minimum of the loss.$$ \begin{aligned} \mathbf{w} &\leftarrow \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{\mathbf{w}} l^{(i)}(\mathbf{w}, b) = \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right) \\b &\leftarrow b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_b l^{(i)}(\mathbf{w}, b) = b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \left(\mathbf{w}^\top \mathbf{x}^{(i)} - y^{(i)}\right). \end{aligned} $$
hyper-parameters
Model prediction (or Model inference)
$d$: feature dimension (the number of inputs)
A Detour to Biology
from mxnet import nd
from time import time
a = nd.ones(shape=10000)
b = nd.ones(shape=10000)
start = time()
c = nd.zeros(shape=10000)
for i in range(10000):
c[i] = a[i] + b[i]
print(time() - start)
0.8771929740905762
start = time()
d = a + b
print(time() - start)
0.00016832351684570312
%matplotlib inline
from IPython import display
from matplotlib import pyplot as plt
from mxnet import autograd, nd
import random
num_inputs = 2
num_examples = 1000
true_w = nd.array([2, -3.4])
true_b = 4.2
features = nd.random.normal(scale=1, shape=(num_examples, num_inputs)) # scale --> standard deviation
labels = nd.dot(features, true_w) + true_b
labels += nd.random.normal(scale=0.01, shape=labels.shape)
print(features[0])
print(labels[0])
[1.1630787 0.4838046] <NDArray 2 @cpu(0)> [4.879625] <NDArray 1 @cpu(0)>
gluonbook.set_figsize()
to print the vector diagram and set its size.def use_svg_display():
# Displayed in vector graphics.
display.set_matplotlib_formats('svg')
def set_figsize(figsize=(5, 3)):
use_svg_display()
# Set the size of the graph to be plotted.
plt.rcParams['figure.figsize'] = figsize
set_figsize()
plt.scatter(features[:, 1].asnumpy(), labels.asnumpy(), 1);
data_iter
to return the features and labels of random batch_size
(batch size) examples.gluonbook
package for future use.def data_iter(batch_size, features, labels):
num_examples = len(features)
indices = list(range(num_examples))
random.shuffle(indices) # The examples are read at random, in no particular order.
for i in range(0, num_examples, batch_size):
j = nd.array(indices[i: min(i + batch_size, num_examples)])
yield features.take(j), labels.take(j)
# The “take” function will then return the corresponding element based on the indices.
batch_size = 10
iterator = data_iter(batch_size, features, labels)
X_batch, y_batch = next(iterator)
print(X_batch, y_batch)
[[ 0.5226451 -0.36821297] [ 1.4648029 -0.9737964 ] [ 0.2238931 -0.22365536] [-0.39304093 0.45682105] [-0.15107684 0.95578885] [-0.1967979 -0.13781767] [-0.5158308 -1.1073749 ] [ 0.6277698 0.08966792] [-1.2582556 1.9790834 ] [-0.0842585 -0.73394597]] <NDArray 10x2 @cpu(0)> [ 6.5130043 10.44841 5.4141636 1.867064 0.6560518 4.27661 6.926715 5.157996 -5.051523 6.529941 ] <NDArray 10 @cpu(0)>
for X_batch, y_batch in iterator:
print(X_batch, y_batch)
break
[[ 2.2771406 0.37218785] [-1.4932036 0.13177635] [-0.83469146 0.6173635 ] [ 0.9580142 0.39661676] [-0.6929967 1.6447209 ] [ 0.11793987 -2.2566102 ] [ 0.28372657 -0.530157 ] [-1.1158241 0.2747542 ] [-2.1107721 0.04706239] [-0.34981632 -0.13172452]] <NDArray 10x2 @cpu(0)> [ 7.4963803 0.7656811 0.44722196 4.7559056 -2.7936707 12.117811 6.5841184 1.0266986 -0.16563828 3.9547756 ] <NDArray 10 @cpu(0)>
w = nd.random.normal(scale=0.01, shape=(num_inputs, 1))
b = nd.zeros(shape=(1,))
autograd
to know that it needs to set up the appropriate data structures, track changes, etc., we need to attach gradients explicitly.w.attach_grad()
b.attach_grad()
def linreg(X, w, b):
return nd.dot(X, w) + b # return value's shape: (10, 1)
def squared_loss(y_hat, y):
return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2
sgd
instead of analytical closed-form solution.def sgd(params, lr, batch_size):
for param in params:
param[:] = param - lr * param.grad / batch_size
batch_size
to 10, the loss shape for each small batch is (10, 1).l.backward()
will add together the elements in l
to obtain the new variable, and then calculate the variable model parameters’ gradient.num_epochs
and lr
are both hyper-parameters and are set to 3 and 0.03, respectively.lr = 0.03 # learning rate
num_epochs = 3 # number of iterations
net = linreg # our fancy linear model
loss = squared_loss # 0.5 (y-y')^2
for epoch in range(num_epochs):
for X, y in data_iter(batch_size, features, labels):
with autograd.record():
l = loss(net(X, w, b), y) # minibatch loss in X and y
l.backward() # compute gradient, w.grad and b.grad, on l with respect to [w, b]
sgd([w, b], lr, batch_size) # update parameters [w, b] using their gradient
train_l = loss(net(features, w, b), labels)
print('epoch {0}, loss {1:.8f}'.format(epoch + 1, float(train_l.mean().asnumpy())))
epoch 1, loss 0.03515049 epoch 2, loss 0.00011977 epoch 3, loss 0.00004835
print('Error in estimating w', true_w - w.reshape(true_w.shape))
print('Error in estimating b', true_b - b)
Error in estimating w [2.7418137e-04 3.6954880e-05] <NDArray 2 @cpu(0)> Error in estimating b [-1.7642975e-05] <NDArray 1 @cpu(0)>
from mxnet import autograd, nd
num_inputs = 2
num_examples = 1000
true_w = nd.array([2, -3.4])
true_b = 4.2
features = nd.random.normal(scale=1, shape=(num_examples, num_inputs))
labels = nd.dot(features, true_w) + true_b
labels += nd.random.normal(scale=0.01, shape=labels.shape)
from mxnet.gluon import data as gdata
batch_size = 10
# Combining the features and labels of the training data.
dataset = gdata.ArrayDataset(features, labels)
# Randomly reading mini-batches.
data_iter = gdata.DataLoader(dataset, batch_size, shuffle=True)
for X, y in data_iter:
print(X, y)
break
[[-1.6311898 0.17929219] [-0.60608387 1.2890214 ] [ 1.2066218 -0.23281197] [-1.6739345 -0.19565606] [-0.6975202 0.1806888 ] [ 1.3205358 -1.0833067 ] [-0.23225537 1.2591201 ] [ 1.4396188 1.9778014 ] [-2.8449814 -0.765359 ] [-0.7729255 0.5642861 ]] <NDArray 10x2 @cpu(0)> [ 0.34095168 -1.3967377 7.4045095 1.5087711 2.2066317 10.523467 -0.5347519 0.36069041 1.1080252 0.7397361 ] <NDArray 10 @cpu(0)>
net
Sequential
instance can be regarded as a container that concatenates the various layers in sequence.Dense
instance.net(X)
is executed later, the model will automatically infer the number of inputs in each layer.from mxnet.gluon import nn
net = nn.Sequential()
net.add(nn.Dense(1))
from mxnet import init
net.initialize(init.Normal(sigma=0.01))
from mxnet.gluon import loss as gloss
loss = gloss.L2Loss()
from mxnet import gluon
trainer = gluon.Trainer(
params = net.collect_params(),
optimizer = 'sgd',
optimizer_params = {
'learning_rate': 0.03
}
)
trainer.step
about the amount of data (i.e., batch_size
)num_epochs = 3
for epoch in range(num_epochs):
for X, y in data_iter:
with autograd.record():
l = loss(net(X), y)
l.backward()
trainer.step(batch_size)
train_l = loss(net(features), labels)
print('epoch {0}, loss {1:.8f}'.format(epoch + 1, float(train_l.mean().asnumpy())))
epoch 1, loss 0.04303914 epoch 2, loss 0.00016977 epoch 3, loss 0.00004908
w = net[0].weight.data()
print('Error in estimating w', true_w.reshape(w.shape) - w)
b = net[0].bias.data()
print('Error in estimating b', true_b - b)
Error in estimating w [[ 0.00129068 -0.00103784]] <NDArray 1x2 @cpu(0)> Error in estimating b [0.00063229] <NDArray 1 @cpu(0)>
The input image has a height and width of 2 pixels and the color is grayscale.
The actual labels of the images in the training data set are "cat", "chicken" or "dog"
Network Architecture
Softmax Operation
Vectorization for Minibatches
Log-Likelihood
and thus the Loss Function $L$ (*Log-Likelihood*) $$L = -\log p(Y|X) = \sum_{i=1}^n -\log p(y^{(i)}|x^{(i)}) = - \sum_{i=1}^n \sum_{j} y_{j}^{(i)} log {\hat{y}_j}^{(i)} $$ [NOTE] $log {\hat{y}_j}^{(i)} \leq 0$, so that $L \geq 0$
Softmax and Derivatives