$\newcommand{\xv}{\mathbf{x}} \newcommand{\Xv}{\mathbf{X}} \newcommand{\yv}{\mathbf{y}} \newcommand{\zv}{\mathbf{z}} \newcommand{\av}{\mathbf{a}} \newcommand{\Wv}{\mathbf{W}} \newcommand{\wv}{\mathbf{w}} \newcommand{\tv}{\mathbf{t}} \newcommand{\Tv}{\mathbf{T}} \newcommand{\muv}{\boldsymbol{\mu}} \newcommand{\sigmav}{\boldsymbol{\sigma}} \newcommand{\phiv}{\boldsymbol{\phi}} \newcommand{\Phiv}{\boldsymbol{\Phi}} \newcommand{\Sigmav}{\boldsymbol{\Sigma}} \newcommand{\Lambdav}{\boldsymbol{\Lambda}} \newcommand{\half}{\frac{1}{2}} \newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}} \newcommand{\argmin}[1]{\underset{#1}{\operatorname{argmin}}} \newcommand{\count}[2]{\underset{#1}{\overset{#2}{\operatorname{\#}}}} $

Sample-by-Sample Linear Regression

Also referred to as sequential, on-line, or stochastic gradient descent, training.

Remember how we started deriving the expression for the weights that minimized the sum of squared errors of a linear model?

With $g$ being an affine (linear + constant) function of $x$,

$$ g(\xv;\wv) = w_0 + w_1 x_1 + w_2 x_2 + \cdots + w_D x_D = \xv^T \wv $$

having parameters $\wv = (w_0, w_1, w_2, \ldots, w_D)$, we derived the solution to

$$ \begin{align*} \wv_{\mbox{best}} &= \argmin{\wv} \sum_{n=1}^N (t_n - g(\xv_n , \wv))^2\\ & = \argmin{\wv} \sum_{n=1}^N (t_n - \xv_n^T \wv)^2 \end{align*} $$

We did this by rewriting the above summation as a matrix expression, taking its derivative with respect to $\wv$, setting the derivative equal to zero, and solving for $\wv$.

$$ \wv = (X^T X)^{-1} X^T T $$

But what if you have thousands or millions of samples? $X$ and $T$ can be quite large. To avoid dealing with matrix operations on huge matrices, we can derive a sequential algorithm for finding $\wv$ by using the fact that a derivative of a sum is the sum of the derivatives. We will now express this derivative as a gradient, which is a vector or matrix of derivatives.

$$ \begin{align*} g(\xv_n, \wv) &= w_0 + w_1 x_{n,1} + w_2 x_{n,2} + \cdots + w_D x_{n,D} = \xv_n^T \wv\\ E(\Xv, \Tv, \wv) &= \sum_{n=1}^N (t_n - g(\xv_n, \wv))^2\\ \nabla_\wv E(\Xv, \Tv, \wv) &= \nabla_\wv \left ( \sum_{n=1}^N (t_n - g(\xv_n, \wv))^2 \right )\\ &= \sum_{n=1}^N \nabla_\wv (t_n - g(\xv_n, \wv))^2\\ &= \sum_{n=1}^N 2 (t_n - g(\xv_n, \wv)) \nabla_\wv (t_n - g(\xv_n, \wv)) \\ &= \sum_{n=1}^N 2 (t_n - g(\xv_n, \wv)) (-1) \nabla_\wv g(\xv_n, \wv) \\ &= \sum_{n=1}^N 2 (t_n - g(\xv_n, \wv)) (-1) \nabla_\wv (\xv_n^T \wv) \\ &= \sum_{n=1}^N 2 (t_n - g(\xv_n, \wv)) (-1) \xv_n \\ &= -2 \sum_{n=1}^N (t_n - g(\xv_n, \wv)) \xv_n \\ \end{align*} $$

Instead of summing over all $N$ samples, what if we just update $\wv$ after each sample based on the gradient of $E$ for that sample? The gradient for a sample $n$ can be considered as a limited, or noisy, sample of the true gradient. Thus, we can take a small step in the direction of the negative gradient to try to bring a current guess at the weight vector, $\wv^{(k)}$, on iteration $k$ to a new value, $\wv^{(k+1)}$, on iteration $k+1$ that is closer to a value that reduces the overall error. This kind of update is called "stochastic approximation".

$$ \begin{align*} \wv^{(k+1)} &= \wv^{(k)} - (-2) \rho (t_n - g(\xv_n, \wv)) \xv_n\\ &= \wv^{(k)} + \rho (t_n - g(\xv_n, \wv)) \xv_n \end{align*} $$

For this sequential algorithm to converge, $\rho$ must decrease with each iteration, not too fast but not too slow.

This algorithm is called the least mean squares (LMS) algorithm developed by Widrow and Hoff. It is now often referred to as the ''stochastic gradient descent'' algorithm, or SGD.

If we have two output variables, like mpg and horsepower, then $t_n$ is no longer a scalar. How do we deal with that? Well, to predict two variables, we need two linear models. We can do this by changing $\wv$ from a single column matrix to a two-column matrix. The first column could contain the weights used to predict mpg, and the second column could contain weights to predict horsepower. Now our linear model is

$$ g(\xv_n, \wv) = \xv_n^T \wv$$

Humm, no change here! This is the beauty of using matrix math. The input vector $\xv_n$ is dotted with each of the two columns of $\wv$, resulting in two values, or a two-component resulting vector, giving the predictions for mpg and horsepower.

What changes do we need to make to the SGD update formula? What else must we modify, other than $\wv$? For each sample, $n$, we must specify two target values, for mpg and horsepower. So $t_n$ is no longer a scalar, but now has two values in a vector, or $\tv_n$. To update the weights $\wv$ we must multiply each error by each input component. This does sound like a double loop. Well, in the last equation above we already used matrix math and numpy broadcasting once in

$$ \begin{align*} \wv^{(k+1)} &= \wv^{(k)} + \rho \; (t_n - g(\xv_n, \wv)) \; \xv_n \end{align*} $$

to remove the loop over all of the components in $\wv_n$ and $\xv_n$. Now we will use broadcasting again to remove a loop over target components, in $\tv_n$. We must take care to make sure the matrices are of the right shape in the matrix operations, and that the resulting matrix is the correct shape for $\wv$. Here we follow the convention that vectors are column vectors.

$$ \begin{align*} \wv^{(k+1)} &= \wv^{(k)} + \rho \; \xv_n \; (\tv_n^T - g(\xv_n, \wv))) \end{align*} $$

Let's see, $\rho$ is a scalar, $\xv_n$ is $D+1\times 1$, a column vector with $D+1$ components (counting the constant 1), $\tv_n$ is $K\times 1$ if we have $K$ outputs, so $\tv_n^T$ is $1\times K$ and $g(\xv_n, \wv) = \xv_n^T \wv$ is also $1\times K$. Stringing these dimensions together in the calculation gives us $(D+1\times 1) (1\times K)$ which results in $D+1\times K$ exactly the correct shape for our weight matrix $\wv$!

In Python, the update to the weight matrix for the $n^{th}$ sample is just

 w += rho * X1[n:n+1, :].T * (T[n:n+1, :] - predicted)

The long, boring, non-matrix way to update each element of w would look like

 nOutputs = T.shape[1]
 nInputs = X1.shape[1]
 for k in range(nOutputs):
     for i in range(nInputs):
         w[i,k] += rho * X1[n:n+1, i] * (T[n:n+1, k] - predicted[k])

So many lines of code can lead to more bugs!!

Let's animate the progress down the error function, following the negative gradient.

In [ ]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import display, clear_output
In [ ]:
n_samples = 100
X = np.random.uniform(0, 10, (n_samples, 1))
T = 2 - 0.1 * X + np.random.normal(0, 0.1, (n_samples,1)) + 0.05 * (X - 6)**2 # Change 0 to 0.05 to try to fit nonlinear cloud

X1 = np.insert(X, 0, 1, axis=1)

X1.shape, T.shape
In [ ]:
def run(rho, n_epochs, stepsPerFrame=10):

    # Initialize weights to all zeros
    # For this demonstration, we will have one variable input. With the constant 1 input, we have 2 weights.
    w = np.zeros((2,1))

    # Collect the weights after each update in a list for later plotting. 
    # This is not part of the training algorithm
    ws = [w.copy()]

    # Create a bunch of x values for plotting
    xs = np.linspace(0, 10, 100).reshape((-1,1))
    xs1 = np.insert(xs, 0, 1, axis=1)

    fig = plt.figure(figsize=(8, 8))

    # For each pass (one epoch) through all samples ...
    for iter in range(n_epochs):
        # For each sample ...
        for n in range(n_samples):
            # Calculate prediction using current model, w.
            #    n:n+1 is used instead of n to preserve the 2-dimensional matrix structure
            predicted = X1[n:n+1,:] @ w
            # Update w using negative gradient of error for nth sample
            w += rho * X1[n:n+1, :].T * (T[n:n+1, :] - predicted)
            # Add new w to our list of past w values for plotting
            if n % stepsPerFrame == 0:

                # Plot the X and T data.
                plt.subplot(2, 1, 1)
                plt.plot(X, T, 'o', alpha=0.6, label='Data')
                plt.plot(X[n,0], T[n], 'ko', ms=10, label='Last Trained Sample')

                # Plot the output of our linear model for a range of x values
                plt.plot(xs, xs1 @ w, 'r-', linewidth=5, label='Model')
                plt.legend(loc='upper right')
                plt.xlim(0, 10)
                plt.ylim(0, 5)

                # In second panel plot the weights versus the epoch number
                plt.subplot(2, 1, 2)
                plt.plot(np.array(ws)[:, :, 0])
                plt.xlim(0, n_epochs * n_samples)
                plt.ylim(-1, 3)
                plt.legend(('$w_0$', '$w_1$'))
    return w
In [ ]:
run(0.01, n_epochs=1, stepsPerFrame=1)
In [ ]:
run(0.01, n_epochs=20, stepsPerFrame=10)
In [ ]: