3.1 Linear Regression

  • Regression
    • the task of predicting a real valued target $y$ given a data point $x$.

3.1.1 Basic Elements of Linear Regression

  • Prediction can be expressed as a linear combination of the input features.
  • Linear Model

    • Example: estimating the price of a house $$\mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b$$
    • General Form

      • In the case of $d$ variables $$\hat{y} = w_1 \cdot x_1 + ... + w_d \cdot x_d + b$$
        $$\hat{y} = \mathbf{w}^\top \mathbf{x} + b$$
      • We'll try to find the weight vector $w$ and bias term $b$ that approximately associate data points $x_i$ with their corresponding labels $y_i$.
      • For a collection of data points $\mathbf{X}$ the predictions $\hat{\mathbf{y}}$ can be expressed via the matrix-vector product $${\hat{\mathbf{y}}} = \mathbf{X} \mathbf{w} + b$$

        • Model parameters: $\mathbf{w}$, $b$
  • Training Data

    • ‘features’ or 'covariates'
      • The two factors used to predict the label
    • $n$: the number of samples that we collect.
    • Each sample (indexed as $i$) is described by $x^{(i)} = [x_1^{(i)}, x_2^{(i)}]$, and the label is $y^{(i)}$.
  • Loss Function

    • Square Loss for a data sample $$l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2,$$

      • the smaller the error, the closer the predicted price is to the actual price
    • To measure the quality of a model on the entire dataset, we can simply average the losses on the training set.$$L(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2.$$
    • In model training, we want to find a set of model parameters, represented by $\mathbf{w}^*$, $b^*$, that can minimize the average loss of training samples: $$\mathbf{w}^*, b^* = \operatorname*{argmin}_{\mathbf{w}, b}\ L(\mathbf{w}, b).$$
  • Optimization Algorithm

    • The mini-batch stochastic gradient descent

      • In each iteration, we randomly and uniformly sample a mini-batch $\mathcal{B}$ consisting of a fixed number of training data samples.
      • We then compute the derivative (gradient) of the average loss on the mini batch the with regard to the model parameters.
      • This result is used to change the parameters in the direction of the minimum of the loss.$$ \begin{aligned} \mathbf{w} &\leftarrow \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{\mathbf{w}} l^{(i)}(\mathbf{w}, b) = \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right) \\b &\leftarrow b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_b l^{(i)}(\mathbf{w}, b) = b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \left(\mathbf{w}^\top \mathbf{x}^{(i)} - y^{(i)}\right). \end{aligned} $$

        • $|\mathcal{B}|$: the number of samples (batch size) in each mini-batch
        • $\eta$: learning rate
      • hyper-parameters

        • $|\mathcal{B}|$, $\eta$
        • They are set somewhat manually and are typically not learned through model training.
  • Model prediction (or Model inference)

3.1.2 From Linear Regression to Deep Networks

  • Neural Network Diagram

    • a neural network diagram to represent the linear regression model

    • $d$: feature dimension (the number of inputs)

  • A Detour to Biology

    • Dendrites: input terminals
    • Nucleus: CPU
    • Axon: output wire
    • Axon terminals (output terminals) are connected to other neurons via synapses.
  • Vectorzation for Speed
    • Vectorizing code is a good way of getting order of mangitude speedups.
In [1]:
from mxnet import nd
from time import time

a = nd.ones(shape=10000)
b = nd.ones(shape=10000)
  • 1) add them one coordinate at a time using a for loop.
In [2]:
start = time()
c = nd.zeros(shape=10000)
for i in range(10000):
    c[i] = a[i] + b[i]
print(time() - start)
0.8771929740905762
  • 2) add the vectors directly:
In [3]:
start = time() 
d = a + b
print(time() - start)
0.00016832351684570312

3.1.3 The Normal Distribution and Squared Loss

  • Maximum Likelihood Principle
    • The notion of maximizing the likelihood of the data subject to the parameters
    • its estimators are usually called Maximum Likelihood Estimators (MLE).
    • minimize the Negative Log-Likelihood
  • The maximum likelihood in a linear model with additive Gaussian noise is equivalent to linear regression with squared loss.

3.2 Linear regression implementation from scratch

In [4]:
%matplotlib inline
from IPython import display
from matplotlib import pyplot as plt 
from mxnet import autograd, nd 
import random

3.2.1 Generating Data Sets

  • The randomly generated batch example feature $\mathbf{X}\in \mathbb{R}^{1000 \times 2}$,
  • The actual weight $\mathbf{w} = [2, -3.4]^\top$ and bias $b = 4.2$ of the linear regression model
  • A random noise term $\epsilon$
    • It obeys a normal distribution with a mean of 0 and a standard deviation of 0.01 ($\epsilon \sim \mathcal{N}(0, 0.01^2)$. $$\mathbf{y}= \mathbf{X} \mathbf{w} + b + \mathbf\epsilon$$
In [5]:
num_inputs = 2
num_examples = 1000
true_w = nd.array([2, -3.4])
true_b = 4.2

features = nd.random.normal(scale=1, shape=(num_examples, num_inputs)) # scale --> standard deviation

labels = nd.dot(features, true_w) + true_b

labels += nd.random.normal(scale=0.01, shape=labels.shape)
In [6]:
print(features[0])
print(labels[0])
[1.1630787 0.4838046]
<NDArray 2 @cpu(0)>

[4.879625]
<NDArray 1 @cpu(0)>
  • By generating a scatter plot using the second features and labels, we can clearly observe the linear correlation between the two.
  • For future plotting, we only need to call gluonbook.set_figsize() to print the vector diagram and set its size.
In [7]:
def use_svg_display():
    # Displayed in vector graphics.
    display.set_matplotlib_formats('svg')

def set_figsize(figsize=(5, 3)):
    use_svg_display()
    # Set the size of the graph to be plotted.
    plt.rcParams['figure.figsize'] = figsize

set_figsize()
plt.scatter(features[:, 1].asnumpy(), labels.asnumpy(), 1);

3.2.2 Reading Data

  • Here we define a function data_iter to return the features and labels of random batch_size (batch size) examples.
    • This function has been saved in the gluonbook package for future use.
In [30]:
def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    random.shuffle(indices)  # The examples are read at random, in no particular order.
    for i in range(0, num_examples, batch_size):
        j = nd.array(indices[i: min(i + batch_size, num_examples)])
        yield features.take(j), labels.take(j)
        # The “take” function will then return the corresponding element based on the indices.
In [31]:
batch_size = 10
iterator = data_iter(batch_size, features, labels)
X_batch, y_batch = next(iterator)
print(X_batch, y_batch)
[[ 0.5226451  -0.36821297]
 [ 1.4648029  -0.9737964 ]
 [ 0.2238931  -0.22365536]
 [-0.39304093  0.45682105]
 [-0.15107684  0.95578885]
 [-0.1967979  -0.13781767]
 [-0.5158308  -1.1073749 ]
 [ 0.6277698   0.08966792]
 [-1.2582556   1.9790834 ]
 [-0.0842585  -0.73394597]]
<NDArray 10x2 @cpu(0)> 
[ 6.5130043 10.44841    5.4141636  1.867064   0.6560518  4.27661
  6.926715   5.157996  -5.051523   6.529941 ]
<NDArray 10 @cpu(0)>
In [32]:
for X_batch, y_batch in iterator:
    print(X_batch, y_batch)
    break
[[ 2.2771406   0.37218785]
 [-1.4932036   0.13177635]
 [-0.83469146  0.6173635 ]
 [ 0.9580142   0.39661676]
 [-0.6929967   1.6447209 ]
 [ 0.11793987 -2.2566102 ]
 [ 0.28372657 -0.530157  ]
 [-1.1158241   0.2747542 ]
 [-2.1107721   0.04706239]
 [-0.34981632 -0.13172452]]
<NDArray 10x2 @cpu(0)> 
[ 7.4963803   0.7656811   0.44722196  4.7559056  -2.7936707  12.117811
  6.5841184   1.0266986  -0.16563828  3.9547756 ]
<NDArray 10 @cpu(0)>

3.2.3 Initialize Model Parameters

  • Weights are initialized to normal random numbers using a mean of 0 and a standard deviation of 0.01,
  • Bias b is set to zero.
In [33]:
w = nd.random.normal(scale=0.01, shape=(num_inputs, 1))
b = nd.zeros(shape=(1,))
  • We’ll update each parameter, w and b, in the direction that reduces the loss.
  • In order for autograd to know that it needs to set up the appropriate data structures, track changes, etc., we need to attach gradients explicitly.
In [34]:
w.attach_grad()
b.attach_grad()

3.2.4 Define the Model

In [35]:
def linreg(X, w, b): 
    return nd.dot(X, w) + b    # return value's shape: (10, 1)

3.2.5 Define the Loss Function

In [36]:
def squared_loss(y_hat, y): 
    return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2

3.2.6 Define the Optimization Algorithm

  • We’ll solve this problem by stochastic gradient descent sgd instead of analytical closed-form solution.
  • At each step, we’ll estimate the gradient of the loss with respect to our parameters, using one batch randomly drawn from our dataset.
  • Then, we’ll update our parameters a small amount in the direction that reduces the loss.
  • Here, the gradient calculated by the automatic differentiation module is the gradient sum of a batch of examples.
In [37]:
def sgd(params, lr, batch_size): 
    for param in params:
        param[:] = param - lr * param.grad / batch_size

3.2.7 Training

  • Training Procedure
    • Initialize parameters $(\mathbf{w}, b)$
    • Repeat until done
      • Compute loss $l(\mathbf{x}^i, y^i, \mathbf{w}, b)$
      • Compute gradient $\mathbf{g} \leftarrow \partial_{(\mathbf{w},b)} \frac{1}{\mathcal{B}} \sum_{i \in \mathcal{B}} l(\mathbf{x}^i, y^i, \mathbf{w}, b)$
      • Update parameters $(\mathbf{w}, b) \leftarrow (\mathbf{w}, b) - \eta \mathbf{g}$
  • Loss shape
    • Since we previously set batch_size to 10, the loss shape for each small batch is (10, 1).
  • Running l.backward() will add together the elements in l to obtain the new variable, and then calculate the variable model parameters’ gradient.
  • num_epochs and lr are both hyper-parameters and are set to 3 and 0.03, respectively.
In [38]:
lr = 0.03               # learning rate
num_epochs = 3          # number of iterations
net = linreg            # our fancy linear model
loss = squared_loss     # 0.5 (y-y')^2

for epoch in range(num_epochs):
    for X, y in data_iter(batch_size, features, labels):
        with autograd.record():
            l = loss(net(X, w, b), y)  # minibatch loss in X and y
        l.backward()                   # compute gradient, w.grad and b.grad, on l with respect to [w, b]
        sgd([w, b], lr, batch_size)    # update parameters [w, b] using their gradient
    train_l = loss(net(features, w, b), labels)
    print('epoch {0}, loss {1:.8f}'.format(epoch + 1, float(train_l.mean().asnumpy())))
epoch 1, loss 0.03515049
epoch 2, loss 0.00011977
epoch 3, loss 0.00004835
In [39]:
print('Error in estimating w', true_w - w.reshape(true_w.shape))
print('Error in estimating b', true_b - b)
Error in estimating w 
[2.7418137e-04 3.6954880e-05]
<NDArray 2 @cpu(0)>
Error in estimating b 
[-1.7642975e-05]
<NDArray 1 @cpu(0)>

3.3 Gluon Implementation of Linear Regression

3.3.1 Generating Data Sets

In [68]:
from mxnet import autograd, nd

num_inputs = 2
num_examples = 1000
true_w = nd.array([2, -3.4])
true_b = 4.2

features = nd.random.normal(scale=1, shape=(num_examples, num_inputs))

labels = nd.dot(features, true_w) + true_b

labels += nd.random.normal(scale=0.01, shape=labels.shape)

3.3.2 Reading Data

In [69]:
from mxnet.gluon import data as gdata

batch_size = 10

# Combining the features and labels of the training data. 
dataset = gdata.ArrayDataset(features, labels)

# Randomly reading mini-batches.
data_iter = gdata.DataLoader(dataset, batch_size, shuffle=True)
In [70]:
for X, y in data_iter:
    print(X, y)
    break
[[-1.6311898   0.17929219]
 [-0.60608387  1.2890214 ]
 [ 1.2066218  -0.23281197]
 [-1.6739345  -0.19565606]
 [-0.6975202   0.1806888 ]
 [ 1.3205358  -1.0833067 ]
 [-0.23225537  1.2591201 ]
 [ 1.4396188   1.9778014 ]
 [-2.8449814  -0.765359  ]
 [-0.7729255   0.5642861 ]]
<NDArray 10x2 @cpu(0)> 
[ 0.34095168 -1.3967377   7.4045095   1.5087711   2.2066317  10.523467
 -0.5347519   0.36069041  1.1080252   0.7397361 ]
<NDArray 10 @cpu(0)>

3.3.3 Define the Model

  • Gluon provides a large number of predefined layers, which allow us to focus especially on the layers used to construct the model rather than having to focus on the implementation.
  • We will first define a model variable net
  • In Gluon, a Sequential instance can be regarded as a container that concatenates the various layers in sequence.
  • We need a single layer.
    • The layer is fully connected since it connects all inputs with all outputs by means of a matrix-vector multiplication.
    • In Gluon, the fully connected layer is referred to as a Dense instance.
    • Since we only want to generate a single scalar output, we set that number to 1.
    • We do not need to specify the input shape for each layer, such as the number of linear regression inputs.
      • When the net(X) is executed later, the model will automatically infer the number of inputs in each layer.
In [71]:
from mxnet.gluon import nn
net = nn.Sequential()
net.add(nn.Dense(1))

3.3.4 Initialize Model Parameters

In [72]:
from mxnet import init
net.initialize(init.Normal(sigma=0.01))
  • We are initializing parameters for a networks where we haven’t told yet how many dimensions the input will have.
  • The real initialization are deferred until the first time that data is sent through the networks.
    • Since the parameters have not been initialized yet, we would not be able to manipulate them yet.

3.3.5 Define the Loss Function

In [73]:
from mxnet.gluon import loss as gloss
loss = gloss.L2Loss()

3.3.6 Define the Optimization Algorithm

In [74]:
from mxnet import gluon
trainer = gluon.Trainer(
    params = net.collect_params(), 
    optimizer = 'sgd', 
    optimizer_params = {
        'learning_rate': 0.03
    }
)

3.3.7 Training

  • We don’t have to
    • individually allocate parameters,
    • define our loss function,
    • implement stochastic gradient descent.
  • We need tell trainer.step about the amount of data (i.e., batch_size)
In [75]:
num_epochs = 3
for epoch in range(num_epochs):
    for X, y in data_iter: 
        with autograd.record(): 
            l = loss(net(X), y)
        l.backward()
        trainer.step(batch_size)
    train_l = loss(net(features), labels)
    print('epoch {0}, loss {1:.8f}'.format(epoch + 1, float(train_l.mean().asnumpy())))
epoch 1, loss 0.04303914
epoch 2, loss 0.00016977
epoch 3, loss 0.00004908
  • We get the layer we need from the net and access its weight (weight) and bias (bias).
In [76]:
w = net[0].weight.data()
print('Error in estimating w', true_w.reshape(w.shape) - w) 

b = net[0].bias.data()
print('Error in estimating b', true_b - b)
Error in estimating w 
[[ 0.00129068 -0.00103784]]
<NDArray 1x2 @cpu(0)>
Error in estimating b 
[0.00063229]
<NDArray 1 @cpu(0)>

3.4 Softmax Regression

3.4.1 Classification Problems

  • The input image has a height and width of 2 pixels and the color is grayscale.
    • We record the four pixels in the image as $x_1, x_2, x_3, x_4$.
  • The actual labels of the images in the training data set are "cat", "chicken" or "dog"
    • $y$ is viewed as a three-dimensional vector (one hot encoding) via$$y \in {(1, 0, 0), (0, 1, 0), (0, 0, 1)}$$
    • $(1,0,0)$: "cat"
    • $(0,1,0)$: "chicken"
    • $(0,0,1)$: "dog"
  • Network Architecture
    • There are 4 features and 3 output animal categories $$ \begin{aligned} o_1 &= x_1 w_{11} + x_2 w_{21} + x_3 w_{31} + x_4 w_{41} + b_1,\\ o_2 &= x_1 w_{12} + x_2 w_{22} + x_3 w_{32} + x_4 w_{42} + b_2,\\ o_3 &= x_1 w_{13} + x_2 w_{23} + x_3 w_{33} + x_4 w_{43} + b_3. \end{aligned} $$
    • The neural network diagram below depicts the calculation above.
  • Softmax Operation

    • In vector form we arrive at $\mathbf{o} = \mathbf{W} \mathbf{x} + \mathbf{b}$.
    • We can use a simple approach to treat the output value $o_i$ as the confidence level of the prediction category $i$.
    • The output of softmax regression is subjected to a nonlinearity which ensures that the sum over all outcomes always adds up to 1 and that none of the terms is ever negative. $$ \hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{o}) \text{ where } \hat{y}_i = \frac{\exp(o_i)}{\sum_j \exp(o_j)} $$
    • We can still find the most likely class by $$ \hat{\imath}(\mathbf{o}) = \operatorname*{argmax}_i o_i = \operatorname*{argmax}_i \hat y_i $$
    • Summarizing it all in vector notation $${\hat{\mathbf{y}}}^{(i)} = \mathrm{softmax}({\mathbf{o}}^{(i)}) = \mathrm{softmax}(\mathbf{W} {\mathbf{x}}^{(i)} + {\mathbf{b}})$$
  • Vectorization for Minibatches

    • Assume that we are given a mini-batch $\mathbf{X}$ of examples with dimensionality $d$ and batch size $n$.
    • Assume that we have q categories (outputs)
    • Then the minibatch features $\mathbf{X}$ are in $\mathbb{R}^{n \times d}$, weights $\mathbf{W} \in \mathbb{R}^{d \times q}$ and the bias satisfies $\mathbf{b} \in \mathbb{R}^q$.
$$ \begin{aligned} \mathbf{O} &= \mathbf{W} \mathbf{X} + \mathbf{b} \\ \hat{\mathbf{Y}} & = \mathrm{softmax}(\mathbf{O}) \end{aligned} $$

3.4.2 Loss Function

  • Log-Likelihood

    • Softmax function allows us to compare the estimates with reality, simply by checking how well it predicted what we observe. $$ p(Y|X) = \prod_{i=1}^n p(y^{(i)}|x^{(i)}) = \prod_{i=1}^n \prod_{j}( {\hat{y}_j}^{(i)})^{y_{j}^{(i)}} $$ and thus the Loss Function $L$ (Log-Likelihood) $$L = -\log p(Y|X) = \sum_{i=1}^n -\log p(y^{(i)}|x^{(i)}) = - \sum_{i=1}^n \sum_{j} y_{j}^{(i)} log {\hat{y}_j}^{(i)} $$ [NOTE] $log {\hat{y}_j}^{(i)} \leq 0$, so that $L \geq 0$
    • $L$ is minimized if we correctly predict $y$ with certainty, i.e. ${\hat{y}_j}^{(i)} = 1$ for the correct label.
  • Softmax and Derivatives

    • we dropped the superscript $(i)$ to avoid notation clutter

      $$ l = -\sum_j y_j \log \hat{y}_j = -\sum_j y_j \log \frac{\exp(o_j)}{\sum_k \exp(o_k)} = \sum_j y_j \log \sum_k \exp(o_k) - \sum_j y_j o_j = \log \sum_k \exp(o_k) - \sum_j y_j o_j $$
    • consider the the loss function's derivative with respect to $o_j$

      $$ \partial_{o_j} l = \frac{\exp(o_j)}{\sum_k \exp(o_k)} - y_j = \mathrm{softmax}(\mathbf{o})_j - y_j = {\hat{y}_j} - y_j $$
    • Gradient is the difference between the observation $y$ and estimate $\hat{y}$ --> it is very similar to what we saw in regression
      • This fact makes computing gradients a lot easier in practice.
  • Cross-Entropy Loss $$ l(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_j y_j \log \hat{y}_j $$
    • It is one of the most commonly used ones for multiclass classification.

3.4.3 Information Theory Basics

  • Entropy
  • Kullback Leibler Divergence

3.4.4 Model Predictionand Evaluation

  • Normally, we use the category with the highest predicted probability as the output category.
  • Accuracy to evaluate the model’s performance.
    • The ratio between the number of correct predictions and the total number of predictions.