3.1 Linear Regression¶

• Regression
• the task of predicting a real valued target $y$ given a data point $x$.

3.1.1 Basic Elements of Linear Regression¶

• Prediction can be expressed as a linear combination of the input features.
• Linear Model

• Example: estimating the price of a house $$\mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b$$
• General Form

• In the case of $d$ variables $$\hat{y} = w_1 \cdot x_1 + ... + w_d \cdot x_d + b$$
$$\hat{y} = \mathbf{w}^\top \mathbf{x} + b$$
• We'll try to find the weight vector $w$ and bias term $b$ that approximately associate data points $x_i$ with their corresponding labels $y_i$.
• For a collection of data points $\mathbf{X}$ the predictions $\hat{\mathbf{y}}$ can be expressed via the matrix-vector product $${\hat{\mathbf{y}}} = \mathbf{X} \mathbf{w} + b$$

• Model parameters: $\mathbf{w}$, $b$
• Training Data

• ‘features’ or 'covariates'
• The two factors used to predict the label
• $n$: the number of samples that we collect.
• Each sample (indexed as $i$) is described by $x^{(i)} = [x_1^{(i)}, x_2^{(i)}]$, and the label is $y^{(i)}$.
• Loss Function

• Square Loss for a data sample $$l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2,$$

• the smaller the error, the closer the predicted price is to the actual price
• To measure the quality of a model on the entire dataset, we can simply average the losses on the training set.$$L(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2.$$
• In model training, we want to find a set of model parameters, represented by $\mathbf{w}^*$, $b^*$, that can minimize the average loss of training samples: $$\mathbf{w}^*, b^* = \operatorname*{argmin}_{\mathbf{w}, b}\ L(\mathbf{w}, b).$$
• Optimization Algorithm

• The mini-batch stochastic gradient descent

• In each iteration, we randomly and uniformly sample a mini-batch $\mathcal{B}$ consisting of a fixed number of training data samples.
• We then compute the derivative (gradient) of the average loss on the mini batch the with regard to the model parameters.
• This result is used to change the parameters in the direction of the minimum of the loss.\begin{aligned} \mathbf{w} &\leftarrow \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{\mathbf{w}} l^{(i)}(\mathbf{w}, b) = \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right) \\b &\leftarrow b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_b l^{(i)}(\mathbf{w}, b) = b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \left(\mathbf{w}^\top \mathbf{x}^{(i)} - y^{(i)}\right). \end{aligned}

• $|\mathcal{B}|$: the number of samples (batch size) in each mini-batch
• $\eta$: learning rate
• hyper-parameters

• $|\mathcal{B}|$, $\eta$
• They are set somewhat manually and are typically not learned through model training.
• Model prediction (or Model inference)

3.1.2 From Linear Regression to Deep Networks¶

• Neural Network Diagram

• a neural network diagram to represent the linear regression model

• $d$: feature dimension (the number of inputs)

• A Detour to Biology

• Dendrites: input terminals
• Nucleus: CPU
• Axon: output wire
• Axon terminals (output terminals) are connected to other neurons via synapses.
• Vectorzation for Speed
• Vectorizing code is a good way of getting order of mangitude speedups.
In [1]:
from mxnet import nd
from time import time

a = nd.ones(shape=10000)
b = nd.ones(shape=10000)

• 1) add them one coordinate at a time using a for loop.
In [2]:
start = time()
c = nd.zeros(shape=10000)
for i in range(10000):
c[i] = a[i] + b[i]
print(time() - start)

0.8771929740905762

• 2) add the vectors directly:
In [3]:
start = time()
d = a + b
print(time() - start)

0.00016832351684570312


3.1.3 The Normal Distribution and Squared Loss¶

• Maximum Likelihood Principle
• The notion of maximizing the likelihood of the data subject to the parameters
• its estimators are usually called Maximum Likelihood Estimators (MLE).
• minimize the Negative Log-Likelihood
• The maximum likelihood in a linear model with additive Gaussian noise is equivalent to linear regression with squared loss.

3.2 Linear regression implementation from scratch¶

In [4]:
%matplotlib inline
from IPython import display
from matplotlib import pyplot as plt
import random


3.2.1 Generating Data Sets¶

• The randomly generated batch example feature $\mathbf{X}\in \mathbb{R}^{1000 \times 2}$,
• The actual weight $\mathbf{w} = [2, -3.4]^\top$ and bias $b = 4.2$ of the linear regression model
• A random noise term $\epsilon$
• It obeys a normal distribution with a mean of 0 and a standard deviation of 0.01 ($\epsilon \sim \mathcal{N}(0, 0.01^2)$. $$\mathbf{y}= \mathbf{X} \mathbf{w} + b + \mathbf\epsilon$$
In [5]:
num_inputs = 2
num_examples = 1000
true_w = nd.array([2, -3.4])
true_b = 4.2

features = nd.random.normal(scale=1, shape=(num_examples, num_inputs)) # scale --> standard deviation

labels = nd.dot(features, true_w) + true_b

labels += nd.random.normal(scale=0.01, shape=labels.shape)

In [6]:
print(features[0])
print(labels[0])

[1.1630787 0.4838046]
<NDArray 2 @cpu(0)>

[4.879625]
<NDArray 1 @cpu(0)>

• By generating a scatter plot using the second features and labels, we can clearly observe the linear correlation between the two.
• For future plotting, we only need to call gluonbook.set_figsize() to print the vector diagram and set its size.
In [7]:
def use_svg_display():
# Displayed in vector graphics.
display.set_matplotlib_formats('svg')

def set_figsize(figsize=(5, 3)):
use_svg_display()
# Set the size of the graph to be plotted.
plt.rcParams['figure.figsize'] = figsize

set_figsize()
plt.scatter(features[:, 1].asnumpy(), labels.asnumpy(), 1);


• Here we define a function data_iter to return the features and labels of random batch_size (batch size) examples.
• This function has been saved in the gluonbook package for future use.
In [30]:
def data_iter(batch_size, features, labels):
num_examples = len(features)
indices = list(range(num_examples))
random.shuffle(indices)  # The examples are read at random, in no particular order.
for i in range(0, num_examples, batch_size):
j = nd.array(indices[i: min(i + batch_size, num_examples)])
yield features.take(j), labels.take(j)
# The “take” function will then return the corresponding element based on the indices.

In [31]:
batch_size = 10
iterator = data_iter(batch_size, features, labels)
X_batch, y_batch = next(iterator)
print(X_batch, y_batch)

[[ 0.5226451  -0.36821297]
[ 1.4648029  -0.9737964 ]
[ 0.2238931  -0.22365536]
[-0.39304093  0.45682105]
[-0.15107684  0.95578885]
[-0.1967979  -0.13781767]
[-0.5158308  -1.1073749 ]
[ 0.6277698   0.08966792]
[-1.2582556   1.9790834 ]
[-0.0842585  -0.73394597]]
<NDArray 10x2 @cpu(0)>
[ 6.5130043 10.44841    5.4141636  1.867064   0.6560518  4.27661
6.926715   5.157996  -5.051523   6.529941 ]
<NDArray 10 @cpu(0)>

In [32]:
for X_batch, y_batch in iterator:
print(X_batch, y_batch)
break

[[ 2.2771406   0.37218785]
[-1.4932036   0.13177635]
[-0.83469146  0.6173635 ]
[ 0.9580142   0.39661676]
[-0.6929967   1.6447209 ]
[ 0.11793987 -2.2566102 ]
[ 0.28372657 -0.530157  ]
[-1.1158241   0.2747542 ]
[-2.1107721   0.04706239]
[-0.34981632 -0.13172452]]
<NDArray 10x2 @cpu(0)>
[ 7.4963803   0.7656811   0.44722196  4.7559056  -2.7936707  12.117811
6.5841184   1.0266986  -0.16563828  3.9547756 ]
<NDArray 10 @cpu(0)>


3.2.3 Initialize Model Parameters¶

• Weights are initialized to normal random numbers using a mean of 0 and a standard deviation of 0.01,
• Bias b is set to zero.
In [33]:
w = nd.random.normal(scale=0.01, shape=(num_inputs, 1))
b = nd.zeros(shape=(1,))

• We’ll update each parameter, w and b, in the direction that reduces the loss.
• In order for autograd to know that it needs to set up the appropriate data structures, track changes, etc., we need to attach gradients explicitly.
In [34]:
w.attach_grad()


3.2.4 Define the Model¶

In [35]:
def linreg(X, w, b):
return nd.dot(X, w) + b    # return value's shape: (10, 1)


3.2.5 Define the Loss Function¶

In [36]:
def squared_loss(y_hat, y):
return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2


3.2.6 Define the Optimization Algorithm¶

• We’ll solve this problem by stochastic gradient descent sgd instead of analytical closed-form solution.
• At each step, we’ll estimate the gradient of the loss with respect to our parameters, using one batch randomly drawn from our dataset.
• Then, we’ll update our parameters a small amount in the direction that reduces the loss.
• Here, the gradient calculated by the automatic differentiation module is the gradient sum of a batch of examples.
In [37]:
def sgd(params, lr, batch_size):
for param in params:
param[:] = param - lr * param.grad / batch_size


3.2.7 Training¶

• Training Procedure
• Initialize parameters $(\mathbf{w}, b)$
• Repeat until done
• Compute loss $l(\mathbf{x}^i, y^i, \mathbf{w}, b)$
• Compute gradient $\mathbf{g} \leftarrow \partial_{(\mathbf{w},b)} \frac{1}{\mathcal{B}} \sum_{i \in \mathcal{B}} l(\mathbf{x}^i, y^i, \mathbf{w}, b)$
• Update parameters $(\mathbf{w}, b) \leftarrow (\mathbf{w}, b) - \eta \mathbf{g}$
• Loss shape
• Since we previously set batch_size to 10, the loss shape for each small batch is (10, 1).
• Running l.backward() will add together the elements in l to obtain the new variable, and then calculate the variable model parameters’ gradient.
• num_epochs and lr are both hyper-parameters and are set to 3 and 0.03, respectively.
In [38]:
lr = 0.03               # learning rate
num_epochs = 3          # number of iterations
net = linreg            # our fancy linear model
loss = squared_loss     # 0.5 (y-y')^2

for epoch in range(num_epochs):
for X, y in data_iter(batch_size, features, labels):
l = loss(net(X, w, b), y)  # minibatch loss in X and y
sgd([w, b], lr, batch_size)    # update parameters [w, b] using their gradient
train_l = loss(net(features, w, b), labels)
print('epoch {0}, loss {1:.8f}'.format(epoch + 1, float(train_l.mean().asnumpy())))

epoch 1, loss 0.03515049
epoch 2, loss 0.00011977
epoch 3, loss 0.00004835

In [39]:
print('Error in estimating w', true_w - w.reshape(true_w.shape))
print('Error in estimating b', true_b - b)

Error in estimating w
[2.7418137e-04 3.6954880e-05]
<NDArray 2 @cpu(0)>
Error in estimating b
[-1.7642975e-05]
<NDArray 1 @cpu(0)>


3.3 Gluon Implementation of Linear Regression¶

3.3.1 Generating Data Sets¶

In [68]:
from mxnet import autograd, nd

num_inputs = 2
num_examples = 1000
true_w = nd.array([2, -3.4])
true_b = 4.2

features = nd.random.normal(scale=1, shape=(num_examples, num_inputs))

labels = nd.dot(features, true_w) + true_b

labels += nd.random.normal(scale=0.01, shape=labels.shape)


In [69]:
from mxnet.gluon import data as gdata

batch_size = 10

# Combining the features and labels of the training data.
dataset = gdata.ArrayDataset(features, labels)


In [70]:
for X, y in data_iter:
print(X, y)
break

[[-1.6311898   0.17929219]
[-0.60608387  1.2890214 ]
[ 1.2066218  -0.23281197]
[-1.6739345  -0.19565606]
[-0.6975202   0.1806888 ]
[ 1.3205358  -1.0833067 ]
[-0.23225537  1.2591201 ]
[ 1.4396188   1.9778014 ]
[-2.8449814  -0.765359  ]
[-0.7729255   0.5642861 ]]
<NDArray 10x2 @cpu(0)>
[ 0.34095168 -1.3967377   7.4045095   1.5087711   2.2066317  10.523467
-0.5347519   0.36069041  1.1080252   0.7397361 ]
<NDArray 10 @cpu(0)>


3.3.3 Define the Model¶

• Gluon provides a large number of predefined layers, which allow us to focus especially on the layers used to construct the model rather than having to focus on the implementation.
• We will first define a model variable net
• In Gluon, a Sequential instance can be regarded as a container that concatenates the various layers in sequence.
• We need a single layer.
• The layer is fully connected since it connects all inputs with all outputs by means of a matrix-vector multiplication.
• In Gluon, the fully connected layer is referred to as a Dense instance.
• Since we only want to generate a single scalar output, we set that number to 1.
• We do not need to specify the input shape for each layer, such as the number of linear regression inputs.
• When the net(X) is executed later, the model will automatically infer the number of inputs in each layer.
In [71]:
from mxnet.gluon import nn
net = nn.Sequential()


3.3.4 Initialize Model Parameters¶

In [72]:
from mxnet import init
net.initialize(init.Normal(sigma=0.01))

• We are initializing parameters for a networks where we haven’t told yet how many dimensions the input will have.
• The real initialization are deferred until the first time that data is sent through the networks.
• Since the parameters have not been initialized yet, we would not be able to manipulate them yet.

3.3.5 Define the Loss Function¶

In [73]:
from mxnet.gluon import loss as gloss
loss = gloss.L2Loss()


3.3.6 Define the Optimization Algorithm¶

In [74]:
from mxnet import gluon
trainer = gluon.Trainer(
params = net.collect_params(),
optimizer = 'sgd',
optimizer_params = {
'learning_rate': 0.03
}
)


3.3.7 Training¶

• We don’t have to
• individually allocate parameters,
• define our loss function,
• We need tell trainer.step about the amount of data (i.e., batch_size)
In [75]:
num_epochs = 3
for epoch in range(num_epochs):
for X, y in data_iter:
l = loss(net(X), y)
l.backward()
trainer.step(batch_size)
train_l = loss(net(features), labels)
print('epoch {0}, loss {1:.8f}'.format(epoch + 1, float(train_l.mean().asnumpy())))

epoch 1, loss 0.04303914
epoch 2, loss 0.00016977
epoch 3, loss 0.00004908

• We get the layer we need from the net and access its weight (weight) and bias (bias).
In [76]:
w = net[0].weight.data()
print('Error in estimating w', true_w.reshape(w.shape) - w)

b = net[0].bias.data()
print('Error in estimating b', true_b - b)

Error in estimating w
[[ 0.00129068 -0.00103784]]
<NDArray 1x2 @cpu(0)>
Error in estimating b
[0.00063229]
<NDArray 1 @cpu(0)>


3.4 Softmax Regression¶

3.4.1 Classification Problems¶

• The input image has a height and width of 2 pixels and the color is grayscale.
• We record the four pixels in the image as $x_1, x_2, x_3, x_4$.
• The actual labels of the images in the training data set are "cat", "chicken" or "dog"
• $y$ is viewed as a three-dimensional vector (one hot encoding) via$$y \in {(1, 0, 0), (0, 1, 0), (0, 0, 1)}$$
• $(1,0,0)$: "cat"
• $(0,1,0)$: "chicken"
• $(0,0,1)$: "dog"
• Network Architecture
• There are 4 features and 3 output animal categories \begin{aligned} o_1 &= x_1 w_{11} + x_2 w_{21} + x_3 w_{31} + x_4 w_{41} + b_1,\\ o_2 &= x_1 w_{12} + x_2 w_{22} + x_3 w_{32} + x_4 w_{42} + b_2,\\ o_3 &= x_1 w_{13} + x_2 w_{23} + x_3 w_{33} + x_4 w_{43} + b_3. \end{aligned}
• The neural network diagram below depicts the calculation above.
• Softmax Operation

• In vector form we arrive at $\mathbf{o} = \mathbf{W} \mathbf{x} + \mathbf{b}$.
• We can use a simple approach to treat the output value $o_i$ as the confidence level of the prediction category $i$.
• The output of softmax regression is subjected to a nonlinearity which ensures that the sum over all outcomes always adds up to 1 and that none of the terms is ever negative. $$\hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{o}) \text{ where } \hat{y}_i = \frac{\exp(o_i)}{\sum_j \exp(o_j)}$$
• We can still find the most likely class by $$\hat{\imath}(\mathbf{o}) = \operatorname*{argmax}_i o_i = \operatorname*{argmax}_i \hat y_i$$
• Summarizing it all in vector notation $${\hat{\mathbf{y}}}^{(i)} = \mathrm{softmax}({\mathbf{o}}^{(i)}) = \mathrm{softmax}(\mathbf{W} {\mathbf{x}}^{(i)} + {\mathbf{b}})$$
• Vectorization for Minibatches

• Assume that we are given a mini-batch $\mathbf{X}$ of examples with dimensionality $d$ and batch size $n$.
• Assume that we have q categories (outputs)
• Then the minibatch features $\mathbf{X}$ are in $\mathbb{R}^{n \times d}$, weights $\mathbf{W} \in \mathbb{R}^{d \times q}$ and the bias satisfies $\mathbf{b} \in \mathbb{R}^q$.
\begin{aligned} \mathbf{O} &= \mathbf{W} \mathbf{X} + \mathbf{b} \\ \hat{\mathbf{Y}} & = \mathrm{softmax}(\mathbf{O}) \end{aligned}

3.4.2 Loss Function¶

• Log-Likelihood

• Softmax function allows us to compare the estimates with reality, simply by checking how well it predicted what we observe. $$p(Y|X) = \prod_{i=1}^n p(y^{(i)}|x^{(i)}) = \prod_{i=1}^n \prod_{j}( {\hat{y}_j}^{(i)})^{y_{j}^{(i)}}$$ and thus the Loss Function $L$ (Log-Likelihood) $$L = -\log p(Y|X) = \sum_{i=1}^n -\log p(y^{(i)}|x^{(i)}) = - \sum_{i=1}^n \sum_{j} y_{j}^{(i)} log {\hat{y}_j}^{(i)}$$ [NOTE] $log {\hat{y}_j}^{(i)} \leq 0$, so that $L \geq 0$
• $L$ is minimized if we correctly predict $y$ with certainty, i.e. ${\hat{y}_j}^{(i)} = 1$ for the correct label.
• Softmax and Derivatives

• we dropped the superscript $(i)$ to avoid notation clutter

$$l = -\sum_j y_j \log \hat{y}_j = -\sum_j y_j \log \frac{\exp(o_j)}{\sum_k \exp(o_k)} = \sum_j y_j \log \sum_k \exp(o_k) - \sum_j y_j o_j = \log \sum_k \exp(o_k) - \sum_j y_j o_j$$
• consider the the loss function's derivative with respect to $o_j$

$$\partial_{o_j} l = \frac{\exp(o_j)}{\sum_k \exp(o_k)} - y_j = \mathrm{softmax}(\mathbf{o})_j - y_j = {\hat{y}_j} - y_j$$
• Gradient is the difference between the observation $y$ and estimate $\hat{y}$ --> it is very similar to what we saw in regression
• This fact makes computing gradients a lot easier in practice.
• Cross-Entropy Loss $$l(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_j y_j \log \hat{y}_j$$
• It is one of the most commonly used ones for multiclass classification.

3.4.3 Information Theory Basics¶

• Entropy
• Kullback Leibler Divergence

3.4.4 Model Predictionand Evaluation¶

• Normally, we use the category with the highest predicted probability as the output category.
• Accuracy to evaluate the model’s performance.
• The ratio between the number of correct predictions and the total number of predictions.