#!/usr/bin/env python # coding: utf-8 # # 3.1 Linear Regression # - Regression # - the task of predicting a real valued target $y$ given a data point $x$. # ## 3.1.1 Basic Elements of Linear Regression # - Prediction can be expressed as a *linear* combination of the input features. # - Linear Model # - Example: estimating the price of a house # $$\mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b$$ # - General Form # - In the case of $d$ variables $$\hat{y} = w_1 \cdot x_1 + ... + w_d \cdot x_d + b$$
$$\hat{y} = \mathbf{w}^\top \mathbf{x} + b$$
# - We'll try to find the weight vector $w$ and bias term $b$ that approximately associate data points $x_i$ with their corresponding labels $y_i$. # - For a collection of data points $\mathbf{X}$ the predictions $\hat{\mathbf{y}}$ can be expressed via the matrix-vector product $${\hat{\mathbf{y}}} = \mathbf{X} \mathbf{w} + b$$ # # - ***Model parameters***: $\mathbf{w}$, $b$ # # - Training Data # - ‘features’ or 'covariates' # - The two factors used to predict the label # - $n$: the number of samples that we collect. # - Each sample (indexed as $i$) is described by $x^{(i)} = [x_1^{(i)}, x_2^{(i)}]$, and the label is $y^{(i)}$. # # - Loss Function # - Square Loss for a data sample $$l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2,$$ # # - the smaller the error, the closer the predicted price is to the actual price # - To measure the quality of a model on the entire dataset, we can simply average the losses on the training set.$$L(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2.$$ # - In model training, we want to find a set of model parameters, represented by $\mathbf{w}^*$, $b^*$, that can minimize the average loss of training samples: $$\mathbf{w}^*, b^* = \operatorname*{argmin}_{\mathbf{w}, b}\ L(\mathbf{w}, b).$$ # # - Optimization Algorithm # - ***The mini-batch stochastic gradient descent*** # - In each iteration, we randomly and uniformly sample a mini-batch $\mathcal{B}$ consisting of a fixed number of training data samples. # - We then compute the derivative (gradient) of the average loss on the mini batch the with regard to the model parameters. # - This result is used to change the parameters in the direction of the minimum of the loss.$$ \begin{aligned} \mathbf{w} &\leftarrow \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{\mathbf{w}} l^{(i)}(\mathbf{w}, b) = \mathbf{w} - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \mathbf{x}^{(i)} \left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right) \\b &\leftarrow b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_b l^{(i)}(\mathbf{w}, b) = b - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \left(\mathbf{w}^\top \mathbf{x}^{(i)} - y^{(i)}\right). \end{aligned} $$ # - $|\mathcal{B}|$: the number of samples (batch size) in each mini-batch # - $\eta$: learning rate # # - hyper-parameters # - $|\mathcal{B}|$, $\eta$ # - They are set somewhat manually and are typically not learned through model training. # # - Model prediction (or Model inference) # ## 3.1.2 From Linear Regression to Deep Networks # - Neural Network Diagram # - a neural network diagram to represent the linear regression model # ![](https://github.com/diveintodeeplearning/d2l-en/raw/master/img/singleneuron.svg?sanitize=true) # # - $d$: feature dimension (the number of inputs) # # - A Detour to Biology # ![](https://github.com/diveintodeeplearning/d2l-en/raw/master/img/Neuron.svg?sanitize=true) # # - ***Dendrites***: input terminals # - ***Nucleus***: CPU # - ***Axon***: output wire # - Axon terminals (output terminals) are connected to other neurons via ***synapses***. # - Vectorzation for Speed # - Vectorizing code is a good way of getting order of mangitude speedups. # In[1]: from mxnet import nd from time import time a = nd.ones(shape=10000) b = nd.ones(shape=10000) # - 1) add them one coordinate at a time using a for loop. # In[2]: start = time() c = nd.zeros(shape=10000) for i in range(10000): c[i] = a[i] + b[i] print(time() - start) # - 2) add the vectors directly: # In[3]: start = time() d = a + b print(time() - start) # ## 3.1.3 The Normal Distribution and Squared Loss # - ***Maximum Likelihood Principle*** # - The notion of maximizing the likelihood of the data subject to the parameters # - its estimators are usually called ***Maximum Likelihood Estimators (MLE)***. # - minimize the Negative Log-Likelihood # - The maximum likelihood in a linear model with additive Gaussian noise is equivalent to linear regression with squared loss. # # 3.2 Linear regression implementation from scratch # In[4]: get_ipython().run_line_magic('matplotlib', 'inline') from IPython import display from matplotlib import pyplot as plt from mxnet import autograd, nd import random # ## 3.2.1 Generating Data Sets # - The randomly generated batch example feature $\mathbf{X}\in \mathbb{R}^{1000 \times 2}$, # - The actual weight $\mathbf{w} = [2, -3.4]^\top$ and bias $b = 4.2$ of the linear regression model # - A random noise term $\epsilon$ # - It obeys a normal distribution with a mean of 0 and a standard deviation of 0.01 ($\epsilon \sim \mathcal{N}(0, 0.01^2)$. $$\mathbf{y}= \mathbf{X} \mathbf{w} + b + \mathbf\epsilon$$ # # # In[5]: num_inputs = 2 num_examples = 1000 true_w = nd.array([2, -3.4]) true_b = 4.2 features = nd.random.normal(scale=1, shape=(num_examples, num_inputs)) # scale --> standard deviation labels = nd.dot(features, true_w) + true_b labels += nd.random.normal(scale=0.01, shape=labels.shape) # In[6]: print(features[0]) print(labels[0]) # - By generating a scatter plot using the second features and labels, we can clearly observe the linear correlation between the two. # - For future plotting, we only need to call `gluonbook.set_figsize()` to print the vector diagram and set its size. # In[7]: def use_svg_display(): # Displayed in vector graphics. display.set_matplotlib_formats('svg') def set_figsize(figsize=(5, 3)): use_svg_display() # Set the size of the graph to be plotted. plt.rcParams['figure.figsize'] = figsize set_figsize() plt.scatter(features[:, 1].asnumpy(), labels.asnumpy(), 1); # ## 3.2.2 Reading Data # - Here we define a function `data_iter` to return the features and labels of random `batch_size` (batch size) examples. # - This function has been saved in the `gluonbook` package for future use. # In[30]: def data_iter(batch_size, features, labels): num_examples = len(features) indices = list(range(num_examples)) random.shuffle(indices) # The examples are read at random, in no particular order. for i in range(0, num_examples, batch_size): j = nd.array(indices[i: min(i + batch_size, num_examples)]) yield features.take(j), labels.take(j) # The “take” function will then return the corresponding element based on the indices. # In[31]: batch_size = 10 iterator = data_iter(batch_size, features, labels) X_batch, y_batch = next(iterator) print(X_batch, y_batch) # In[32]: for X_batch, y_batch in iterator: print(X_batch, y_batch) break # ## 3.2.3 Initialize Model Parameters # - Weights are initialized to normal random numbers using a mean of 0 and a standard deviation of 0.01, # - Bias b is set to zero. # In[33]: w = nd.random.normal(scale=0.01, shape=(num_inputs, 1)) b = nd.zeros(shape=(1,)) # - We’ll update each parameter, w and b, in the direction that reduces the loss. # - In order for `autograd` to know that it needs to set up the appropriate data structures, track changes, etc., we need to attach gradients explicitly. # In[34]: w.attach_grad() b.attach_grad() # ## 3.2.4 Define the Model # In[35]: def linreg(X, w, b): return nd.dot(X, w) + b # return value's shape: (10, 1) # ## 3.2.5 Define the Loss Function # In[36]: def squared_loss(y_hat, y): return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2 # ## 3.2.6 Define the Optimization Algorithm # - We’ll solve this problem by stochastic gradient descent `sgd` instead of analytical closed-form solution. # - At each step, we’ll estimate the gradient of the loss with respect to our parameters, using one batch randomly drawn from our dataset. # - Then, we’ll update our parameters a small amount in the direction that reduces the loss. # - Here, the gradient calculated by the automatic differentiation module is the gradient sum of a batch of examples. # In[37]: def sgd(params, lr, batch_size): for param in params: param[:] = param - lr * param.grad / batch_size # ## 3.2.7 Training # - Training Procedure # - Initialize parameters $(\mathbf{w}, b)$ # - Repeat until done # - Compute loss $l(\mathbf{x}^i, y^i, \mathbf{w}, b)$ # - Compute gradient $\mathbf{g} \leftarrow \partial_{(\mathbf{w},b)} \frac{1}{\mathcal{B}} \sum_{i \in \mathcal{B}} l(\mathbf{x}^i, y^i, \mathbf{w}, b)$ # - Update parameters $(\mathbf{w}, b) \leftarrow (\mathbf{w}, b) - \eta \mathbf{g}$ # - Loss shape # - Since we previously set `batch_size` to 10, the loss shape for each small batch is (10, 1). # - Running `l.backward()` will add together the elements in `l` to obtain the new variable, and then calculate the variable model parameters’ gradient. # - `num_epochs` and `lr` are both hyper-parameters and are set to 3 and 0.03, respectively. # # In[38]: lr = 0.03 # learning rate num_epochs = 3 # number of iterations net = linreg # our fancy linear model loss = squared_loss # 0.5 (y-y')^2 for epoch in range(num_epochs): for X, y in data_iter(batch_size, features, labels): with autograd.record(): l = loss(net(X, w, b), y) # minibatch loss in X and y l.backward() # compute gradient, w.grad and b.grad, on l with respect to [w, b] sgd([w, b], lr, batch_size) # update parameters [w, b] using their gradient train_l = loss(net(features, w, b), labels) print('epoch {0}, loss {1:.8f}'.format(epoch + 1, float(train_l.mean().asnumpy()))) # In[39]: print('Error in estimating w', true_w - w.reshape(true_w.shape)) print('Error in estimating b', true_b - b) # # 3.3 Gluon Implementation of Linear Regression # ## 3.3.1 Generating Data Sets # In[68]: from mxnet import autograd, nd num_inputs = 2 num_examples = 1000 true_w = nd.array([2, -3.4]) true_b = 4.2 features = nd.random.normal(scale=1, shape=(num_examples, num_inputs)) labels = nd.dot(features, true_w) + true_b labels += nd.random.normal(scale=0.01, shape=labels.shape) # ## 3.3.2 Reading Data # In[69]: from mxnet.gluon import data as gdata batch_size = 10 # Combining the features and labels of the training data. dataset = gdata.ArrayDataset(features, labels) # Randomly reading mini-batches. data_iter = gdata.DataLoader(dataset, batch_size, shuffle=True) # In[70]: for X, y in data_iter: print(X, y) break # ## 3.3.3 Define the Model # - Gluon provides a large number of predefined layers, which allow us to ***focus especially on the layers used to construct the model*** rather than having to focus on the implementation. # - We will first define a model variable `net` # - In Gluon, a `Sequential` instance can be regarded as a container that concatenates the various layers in sequence. # - We need a single layer. # - The layer is fully connected since it connects all inputs with all outputs by means of a matrix-vector multiplication. ![](https://github.com/diveintodeeplearning/d2l-en/raw/master/img/singleneuron.svg?sanitize=true) # - In Gluon, the fully connected layer is referred to as a `Dense` instance. # - Since we only want to generate a single scalar output, we set that number to 1. # - We do not need to specify the input shape for each layer, such as the number of linear regression inputs. # - When the `net(X)` is executed later, the model will automatically infer the number of inputs in each layer. # In[71]: from mxnet.gluon import nn net = nn.Sequential() net.add(nn.Dense(1)) # ## 3.3.4 Initialize Model Parameters # In[72]: from mxnet import init net.initialize(init.Normal(sigma=0.01)) # - We are initializing parameters for a networks where we haven’t told yet how many dimensions the input will have. # - The real initialization are deferred until the first time that data is sent through the networks. # - Since the parameters have not been initialized yet, we would not be able to manipulate them yet. # ## 3.3.5 Define the Loss Function # In[73]: from mxnet.gluon import loss as gloss loss = gloss.L2Loss() # ## 3.3.6 Define the Optimization Algorithm # In[74]: from mxnet import gluon trainer = gluon.Trainer( params = net.collect_params(), optimizer = 'sgd', optimizer_params = { 'learning_rate': 0.03 } ) # ## 3.3.7 Training # - We don’t have to # - individually allocate parameters, # - define our loss function, # - implement stochastic gradient descent. # - We need tell `trainer.step` about the amount of data (i.e., `batch_size`) # In[75]: num_epochs = 3 for epoch in range(num_epochs): for X, y in data_iter: with autograd.record(): l = loss(net(X), y) l.backward() trainer.step(batch_size) train_l = loss(net(features), labels) print('epoch {0}, loss {1:.8f}'.format(epoch + 1, float(train_l.mean().asnumpy()))) # - We get the layer we need from the net and access its weight (weight) and bias (bias). # In[76]: w = net[0].weight.data() print('Error in estimating w', true_w.reshape(w.shape) - w) b = net[0].bias.data() print('Error in estimating b', true_b - b) # # 3.4 Softmax Regression # ## 3.4.1 Classification Problems # - The input image has a height and width of 2 pixels and the color is grayscale. # - We record the four pixels in the image as $x_1, x_2, x_3, x_4$. # - The actual labels of the images in the training data set are "cat", "chicken" or "dog" # - $y$ is viewed as a three-dimensional vector (one hot encoding) via$$y \in {(1, 0, 0), (0, 1, 0), (0, 0, 1)}$$ # - $(1,0,0)$: "cat" # - $(0,1,0)$: "chicken" # - $(0,0,1)$: "dog" #
# - Network Architecture # - There are 4 features and 3 output animal categories # $$ \begin{aligned} o_1 &= x_1 w_{11} + x_2 w_{21} + x_3 w_{31} + x_4 w_{41} + b_1,\\ o_2 &= x_1 w_{12} + x_2 w_{22} + x_3 w_{32} + x_4 w_{42} + b_2,\\ o_3 &= x_1 w_{13} + x_2 w_{23} + x_3 w_{33} + x_4 w_{43} + b_3. \end{aligned} $$ #
# - The neural network diagram below depicts the calculation above. # ![](https://github.com/diveintodeeplearning/d2l-en/raw/master/img/softmaxreg.svg?sanitize=true) # - Softmax Operation # - In vector form we arrive at $\mathbf{o} = \mathbf{W} \mathbf{x} + \mathbf{b}$. # - We can use a simple approach to treat the output value $o_i$ as the ***confidence level*** of the prediction category $i$. # - The output of softmax regression is subjected to a nonlinearity which ensures that the sum over all outcomes always adds up to 1 and that none of the terms is ever negative. $$ \hat{\mathbf{y}} = \mathrm{softmax}(\mathbf{o}) \text{ where } \hat{y}_i = \frac{\exp(o_i)}{\sum_j \exp(o_j)} $$ # - We can still find the most likely class by $$ \hat{\imath}(\mathbf{o}) = \operatorname*{argmax}_i o_i = \operatorname*{argmax}_i \hat y_i $$ # - Summarizing it all in vector notation $${\hat{\mathbf{y}}}^{(i)} = \mathrm{softmax}({\mathbf{o}}^{(i)}) = \mathrm{softmax}(\mathbf{W} {\mathbf{x}}^{(i)} + {\mathbf{b}})$$ # # - Vectorization for Minibatches # - Assume that we are given a mini-batch $\mathbf{X}$ of examples with dimensionality $d$ and batch size $n$. # - Assume that we have q categories (outputs) # - Then the minibatch features $\mathbf{X}$ are in $\mathbb{R}^{n \times d}$, weights $\mathbf{W} \in \mathbb{R}^{d \times q}$ and the bias satisfies $\mathbf{b} \in \mathbb{R}^q$. # # $$ \begin{aligned} \mathbf{O} &= \mathbf{W} \mathbf{X} + \mathbf{b} \\ \hat{\mathbf{Y}} & = \mathrm{softmax}(\mathbf{O}) \end{aligned} $$ # # # # ## 3.4.2 Loss Function # - Log-Likelihood # - Softmax function allows us to compare the estimates with reality, simply by checking how well it predicted what we observe. $$ p(Y|X) = \prod_{i=1}^n p(y^{(i)}|x^{(i)}) = \prod_{i=1}^n \prod_{j}( {\hat{y}_j}^{(i)})^{y_{j}^{(i)}} $$ # and thus the Loss Function $L$ (***Log-Likelihood***) $$L = -\log p(Y|X) = \sum_{i=1}^n -\log p(y^{(i)}|x^{(i)}) = - \sum_{i=1}^n \sum_{j} y_{j}^{(i)} log {\hat{y}_j}^{(i)} $$ # [NOTE] $log {\hat{y}_j}^{(i)} \leq 0$, so that $L \geq 0$ # - $L$ is minimized if we correctly predict $y$ with ***certainty***, i.e. ${\hat{y}_j}^{(i)} = 1$ for the correct label. # # - Softmax and Derivatives # - we dropped the superscript $(i)$ to avoid notation clutter

$$ l = -\sum_j y_j \log \hat{y}_j = -\sum_j y_j \log \frac{\exp(o_j)}{\sum_k \exp(o_k)} = \sum_j y_j \log \sum_k \exp(o_k) - \sum_j y_j o_j = \log \sum_k \exp(o_k) - \sum_j y_j o_j $$
# - consider the the loss function's derivative with respect to $o_j$

$$ \partial_{o_j} l = \frac{\exp(o_j)}{\sum_k \exp(o_k)} - y_j = \mathrm{softmax}(\mathbf{o})_j - y_j = {\hat{y}_j} - y_j $$
# - Gradient is the difference between the observation $y$ and estimate $\hat{y}$ --> it is very similar to what we saw in regression # - This fact makes computing gradients a lot easier in practice. #
# - Cross-Entropy Loss # $$ l(\mathbf{y}, \hat{\mathbf{y}}) = - \sum_j y_j \log \hat{y}_j $$ # - It is one of the most commonly used ones for multiclass classification. # # ## 3.4.3 Information Theory Basics # - Entropy # - Kullback Leibler Divergence # ## 3.4.4 Model Predictionand Evaluation # - Normally, we use the category with the highest predicted probability as the output category. # - Accuracy to evaluate the model’s performance. # - The ratio between the number of correct predictions and the total number of predictions.