Machine Learning¶

2. Linear Regression¶

Regression$\renewcommand{\vec}[1]{\boldsymbol{#1}}$¶

We are already familiar with regression problems.
We have a dataset $\Psi=\left\{\left<\vec{x}^{(1)},\vec{y}^{(1)}\right>,\ldots,\left<\vec{x}^{(i)},\vec{y}^{(i)}\right>,\ldots\left<\vec{x}^{(m)},\vec{y}^{(m)}\right>\right\}$ of *noisy samples* drawn from and *unknown* function $\vec{F}:\mathcal{D}\rightarrow\mathcal{I}$.
In the regression case $\mathcal{D}\subseteq\mathbb{R}^n$ and $\mathcal{I}\subseteq\mathbb{R}^k$.

We can assume that $k=1$ and noise is Gaussian and that $\Psi$ is finite.

Why we use the Gaussian noise assumption¶

We will assume that the target variable is described by

$$y = \vec{F}(\vec{x},\vec{w}) + \varepsilon$$

where:

$\vec{F}(\vec{x},\vec{w})$ is a (possibly undefined and/or unknown) function of $\vec{x}$ and $\vec{w}$ and
$\varepsilon\sim\mathcal{N}(0,\sigma^2_\varepsilon)$ is a Gaussian-distributed noise component.

Gaussian Noise?¶

The derivations provided below all assume Gaussian noise on the target data. Is this a good assumption? In many cases yes. The argument hinges on the use of the Central_Limit_Theorem that basically says the the sum of many independent random variables behaves behaves like a Gaussian distributed random variable.

The noise term in this model, $\epsilon$, can be thought of as the sum of features not included in the model function, $\vec{F}(\vec{x},\vec{w})$. Assuming these features are themselves independent random variables then the Central Limit Theorem suggests a Gaussian model is appropriate, assuming there are many independent unaccounted for features.

It is possible that there is only a small number of unaccounted for features or that there is genuine non-Gaussian noise in our observation measurements, e.g. sensor shot noise that often has a Poisson distribution. In such cases, the assumption is no longer valid.

An Example¶

Training dataset

In [4]:

N = 5                                              # Size of the data set
X = np.array([[0.0], [1.0], [2.0], [3.0], [4.0]])  # Inputs, shape: N x 1
y = np.array([10.5, 5.0, 3.0, 2.5, 1.0])           # Outputs, shape: N

In [5]:

X.T

Out[5]:

array([[ 0.,  1.,  2.,  3.,  4.]])

In [6]:

Out[6]:

array([ 10.5,   5. ,   3. ,   2.5,   1. ])

Test dataset

In [7]:

N_test = 100
X_test = np.linspace(0.0, 4.0, N_test).reshape((N_test, 1))
y_test = np.linspace(7.0, -5.0, N_test)+ 2*np.random.randn(N_test)

Ploting the training and test datasets¶

In [8]:

plt.plot(X[:, 0], y, 'bo', label='Training')
plt.plot(X_test[:, 0], y_test, "g.", label='Test')
plt.xlabel('$X$'); plt.ylabel('$y$'); plt.legend(frameon=True);

Linear Regression Model¶

We can express our constraints down with a linear model:

$$ y^{(i)} + \varepsilon^{(i)} = \vec{x}^{(i)}\cdot \vec{w} + b, \forall i \in \{1,\ldots,m\}\,, $$

where $\vec{w}$ and $b$ constitute a weight vector and $\varepsilon^{(n)} \sim \mathcal{N}(0, \sigma^2_\varepsilon)$.

Weights are parameters of the linear model that can be tuned to fit the training data better.

A shorter version of this expression is the following equation:

$$\vec{y} + \vec{\epsilon} = \vec{X} \cdot \vec{w},\ \vec{X} \in \mathbb{R}^{m \times n},\ \vec{y} \in \mathbb{R}^m,\ \vec{\epsilon} \in \mathbb{R}^m,\ \vec{w} \in \mathbb{R}^n,$$

where the $i$-th row of $\vec{X}$ represents $\vec{x}^{(i)}$ and the $j$-th entry of the vector $\vec{y}$ represents $\vec{y}^{(j)}$.

It might not be possible to find a weight vector that satisfies the constraints perfectly.
Instead, we can minimize the sum of squared errors (SSE):

$$\hat{\vec{w}} = \text{arg}\,\text{min}_\vec{w} \dfrac{1}{2}\left\|\vec{X}\vec{w} - \vec{y}\right\|^2_2,$$

where $\left\|\vec{A}\right\|_2$ is called Frobenius norm and is the generalization of the Euclidean norm for matrices.

Learning Weights¶

There are two ways to adjust the weights of a linear model (which includes linear models with nonlinear features):

Normal equations (analytical solution)¶

Requires inversion of $m \times n$ matrix.
It does not scale well with the number of features.

Gradient descent (iterative solution)¶

Requires the calculation of $m$ gradients in each step.
It does not scale well with the number of examples.

In addition, we can add a constraint to the objective function to penalize large weights: Tikhonov regularization (forward pointer).

Normal Equations¶

Solving $\vec{X} \cdot \vec{w} = \vec{y}$ means solving $$ \left\{ \begin{array}{rcrcl} x^{(1)}_1 w_1 & + x^{(1)}_2 w_2 & + \cdots & + x^{(1)}_n w_n & = y^{(1)} \\ x^{(2)}_1 w_1 & + x^{(2)}_2 w_2 & + \cdots & + x^{(2)}_n w_n & = y^{(2)} \\ x^{(3)}_1 w_1 & + x^{(3)}_2 w_2 & + \cdots & + x^{(3)}_n w_n & = y^{(3)} \\ \vdots & \vdots & \ddots & \vdots & = \vdots \\ x^{(m)}_1 w_1 & + x^{(m)}_2 w_2 & + \cdots & + x^{(m)}_n w_n & = y^{(m)} \end{array} \right.\,. $$

Frequently, solving equation $\vec{X} \cdot \vec{w} = \vec{y}$ directly for $\vec{w}$ by inversion is not viable.
As $\vec{X}$ is not necessarily a square matrix, therefore $\vec{w} = \vec{X}^{-1} \vec{y}$ is usually not possible.
This implies that it is usually impossible to find an exact solution.
These systems are overdetermined.

The linear least squares solution¶

The linear least squares is the problem of approximately solving an overdetermined system of linear equations.
The best approximation is defined as that which minimizes the sum of squared differences between the data values and their corresponding modeled values.

That is, the solution of

$$\vec{X}^\intercal\vec{X} \hat{\boldsymbol{w}} = \vec{X}^\intercal\vec{y}$$

for $\hat{\boldsymbol{w}}$, i.e.,

$$\hat{\vec{w}} = (\vec{X}^\intercal\vec{X})^{-1}\vec{X}^\intercal\vec{y}.$$

In [9]:

def normal_equations(X, y):
    # w = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
    # ... or we can use the Moore-Penrose pseudoinverse (is usually more likely to 
    # be numerically stable)
    w = np.linalg.pinv(X).dot(y)
    # ... or we use the solver
    # w = np.linalg.lstsq(X, y)[0]
    return w

The bias trick¶

Add bias to each row/instance:

In [10]:

X_bias = np.hstack((X, np.ones((N, 1))))
X_test_bias = np.hstack((X_test, np.ones((N_test, 1))))

In [11]:

X.T

Out[11]:

array([[ 0.,  1.,  2.,  3.,  4.]])

In [12]:

X_bias.T

Out[12]:

array([[ 0.,  1.,  2.,  3.,  4.],
       [ 1.,  1.,  1.,  1.,  1.]])

In [13]:

w_norm = normal_equations(X_bias, y)

In [14]:

plt.plot(X_bias[:, 0], y, "o")
plt.plot(X_bias[:, 0], X_bias.dot(w_norm), "m-", linewidth=2)
plt.xlabel("x"); plt.ylabel("y"); plt.title("Model: $y = %.2f x + %.2f$" % tuple(w_norm));

We can now solve linear regression problems at will!

Homework 1¶

Test the other forms of normal equations.

Check the numerical stability of each approach and
how long they take to compute.

Computational cost and complexity¶

Think of this problem:

inputs: 100 color images of size 256x256 pixels
output: radius of the ball which is at the center of the image
Is the direct solution via normal equations viable?

Estimating problem tractability¶

Assuming that each pixel is represented by three bytes defining the RGB values:

In [15]:

n_pixels = 256 * 256
memory_size_xtx = n_pixels * n_pixels * 3 * 8
print("Required memory: %d GB" % (memory_size_xtx / 2**30))

Required memory: 96 GB

Inversion is an procedure that requires $O(n^{2.373})$ operations (Source).

Gradient Descent¶

In cases where the training data set is very large or data is received in a stream, a direct solution using the normal equations may not be possible. An alternative approach is the stochastic gradient descent algorithm.

Iteratively update the weights using the gradient as reference.

$\Delta \vec{w}_t$ is minus the derivative of the error function $E(\vec{X}, \vec{y};\vec{w}_t)$ with respect to the weight $w_i$:

$$ \Delta w_i = - \alpha\frac{\partial E(\vec{X}, \vec{y};\vec{w})}{\partial w_i}\,, $$

where:

$\alpha$ is a hyperparameter called learning rate, it has to be set manually and must usually be within $\left(0, 1\right)$, typical values are $10^{-1}$, $10^{-2}$, $10^{-3}$, ...

Update $\vec{w}$ incrementally in every iteration $t$ as:

$$ \vec{w}(t+1) = \vec{w}(t) + \Delta \vec{w}(t)\,, $$

Forward pointer¶

Gradient descent actually is a more general rule.
It can be applied to any minimization problem if the gradient can be computed.
For example, artificial neural networks, support vector machines, $k$-means, etc.

Gradient of the Linear Model¶

Using as error function the sum of squared errors (SSE),

$$E(\boldsymbol{X}, \boldsymbol{y};\boldsymbol{w}) = \frac{1}{2}\left\|\boldsymbol{X} \cdot \boldsymbol{w} - \boldsymbol{y}\right\|^2_2\,.$$

The gradient becomes

$$\nabla \boldsymbol{w} = \nabla_{\boldsymbol{w}} E(\boldsymbol{X}, \boldsymbol{y};\boldsymbol{w}) = \boldsymbol{X}^T \cdot (\boldsymbol{X} \cdot \boldsymbol{w} - \boldsymbol{y})\,.$$

Implementing gradient descent¶

In [16]:

def error(X, y, w):
    return 0.5*np.linalg.norm(X.dot(w) - y)**2

In [17]:

def linear_regression_gradient(X, y, w):
    return X.T.dot(X.dot(w)-y)

The learning loop¶

Initialize weights to small random values

In [18]:

w = np.random.random(2)/10000

In [19]:

alpha = 0.05

In [20]:

for i in range(50):
    # print('Iteration', i+1, 'weights', w, 'error', error(X_bias, y, w))
    w = w - alpha*linear_regression_gradient(X_bias, y, w)
print('w_norm:', w_norm, 'w grad. desc.', w)

w_norm: [-2.15  8.7 ] w grad. desc. [-2.089  8.526]

Visualizing gradient descent¶

Implementing the gradient descent loop as a function.

In [22]:

def gradient_descent(X, y, w_0, alpha, max_iters):
    'Returns the values of the weights as learning took place.'
    w = np.array(w_0, dtype=np.float64)
    w_hist = np.zeros(shape=(max_iters+1, w.shape[0]))
    w_hist[0] = w
    for i in range(0,max_iters):
        delta_weights = -alpha*linear_regression_gradient(X_bias, y, w)
        w += delta_weights
        w_hist[i+1] = w
    return w_hist

In [23]:

w_0 = [-3,2] # we fix the initial weights to make it more illustrative
alpha = 0.05
max_iters = 25

In [24]:

w_hist = gradient_descent(X_bias, y, w_0, alpha, max_iters)

In [26]:

plot_hist_contour(X_bias, y, w_hist, w_norm, title='end='+str(w_hist[-1]), show_legend=True)

What is the impact of the learning rate ($\alpha$)?

In [27]:

def alphas_study(alphas):
    fig = plt.figure(figsize=(11,7))
    for i,alpha in enumerate(alphas):
        ax = fig.add_subplot(2,3,i+1)
        w_hist = gradient_descent(X_bias, y , w_0, alpha, max_iters)
        plot_hist_contour(X_bias, y, w_hist, w_norm, ax=ax, title='$\\alpha='+str(alpha)+'$') 
    plt.legend(scatterpoints=1, ncol=3, bbox_to_anchor=(-0.2,-0.2), frameon=True);
    plt.tight_layout()

Impact of learning rate¶

In [28]:

alphas = np.linspace(0.02,0.07,6)
alphas

Out[28]:

array([ 0.02,  0.03,  0.04,  0.05,  0.06,  0.07])

In [29]:

alphas_study(alphas)

Improving GD by adding a momentum term¶

Idea adding some an inertial momentum to the process can:

accelerate convergence when moving in a plateau, or
damp oscilations.

Update $\vec{w}$ incrementally in every iteration $t$ as:

$$\vec{w}(t+1) = \vec{w}(t) + \alpha\Delta \vec{w}(t) + \beta \Delta \vec{w}(t-1),$$

where $\beta\in\mathbb{R}^+$ is known as the momentum rate.

In [30]:

def gradient_descent_with_momentum(X, y, w_0, alpha, beta, max_iters):
    w = np.array(w_0, dtype=np.float64)
    w_hist = np.zeros(shape=(max_iters+1, w.shape[0]))
    w_hist[0] = w
    omega = np.zeros_like(w)
    for i in range(max_iters): 
        delta_weights = -alpha*linear_regression_gradient(X, y, w) + beta*omega
        omega = delta_weights
        w += delta_weights
        w_hist[i+1] = w
    return w_hist

In [31]:

alpha = 0.05
beta = 0.5
max_iters = 25

In [32]:

w_hist = gradient_descent(X_bias, y, (-3,2), alpha, max_iters)
w_hist_mom = gradient_descent_with_momentum(X_bias,y, (-3,2), alpha, beta, max_iters)

In [34]:

comparison_plot()

In [35]:

def alphas_study_with_momentum(alphas, beta):
    fig = plt.figure(figsize=(11,7))
    for i,alpha in enumerate(alphas):
        ax = fig.add_subplot(2,3,i+1)
        w_hist = gradient_descent_with_momentum(X_bias, y , w_0, alpha, beta, max_iters)
        plot_hist_contour(X_bias, y, w_hist, w_norm, ax=ax, title='$\\alpha='+str(alpha)+'$')  
    plt.legend(scatterpoints=1, ncol=3, bbox_to_anchor=(-0.2,-0.2), frameon=True);
    plt.tight_layout()

Impact of the momentum term¶

In [36]:

alphas_study_with_momentum(alphas, 0.5)

In [37]:

alphas_study_with_momentum(alphas, 0.25)

In [38]:

alphas_study_with_momentum(alphas, 0.75)

... I know you are wondering how $\beta=1$ looks like.

In [39]:

alphas_study_with_momentum(alphas, 1.05)

In [40]:

plt.plot(X_bias[:, 0], y, "o", label='training data')
plt.plot(X_bias[:, 0], X_bias.dot(w_norm), "D-", c='lightcoral',  
         linewidth=2, label='normal equation')
plt.plot(X_bias[:, 0], X_bias.dot(w_hist[-1]), "s-", c='skyblue', 
         linewidth=2, label='gradient descent')
plt.plot(X_bias[:, 0], X_bias.dot(w_hist_mom[-1]), "^-", c='green', 
         linewidth=2, alpha=0.5, label='gradient descent\n with momentum')
plt.xlabel("$X$"); plt.ylabel("$y$"); plt.legend(numpoints=1, bbox_to_anchor=(1.45,1), frameon=True);

Normal equations vs gradient descent¶

In [41]:

N_test = 10000000
X_test_load = np.linspace(0.0, 4.0, N_test).reshape((N_test, 1))
y_test_load = np.linspace(7.0, -5.0, N_test)+ 2*np.random.randn(N_test)
X_test_load_bias = np.hstack((X_test_load, np.ones((N_test, 1))))

In [42]:

%%time 
w_norm = normal_equations(X_test_load_bias, y_test_load)

CPU times: user 788 ms, sys: 410 ms, total: 1.2 s
Wall time: 1.07 s

In [43]:

%%time 
w_grad = gradient_descent_with_momentum(X_test_load_bias, y_test_load, 
                                              np.random.random(2)/1000, 0.000001, 0.25, 25)

CPU times: user 2.62 s, sys: 753 ms, total: 3.37 s
Wall time: 1.77 s

Comparing results¶

In [44]:

w_norm

Out[44]:

array([-3.   ,  7.001])

In [45]:

w_grad[-1]

Out[45]:

array([ -6.249e+43,  -2.443e+43])

In [46]:

np.linalg.norm(w_norm - w_grad[-1])

Out[46]:

6.7101402268503633e+43

Homework 2¶

Why don't we always choose a very small learning rate?
Your implementation of gradient descent oscillates between two error values. What could be the cause?

Can we make linear models non-linear?¶

Nonlinear projection of features (e.g. polynomial regression)
Kernel trick (e.g. SVR, SVM, ...)
Mixture of experts (e.g. locally weighted regression, locally weighted projection regression, Gaussian mixture regression, ...)
Learn nonlinear features (e.g. artificial neural networks)

Approximation of nonlinear functions (with a linear model)¶

To approximate nonlinear functions with a linear model, we have to generate nonlinear features.
In this example, we generate cosinusoidal features. You could also try radial basis functions, polynomials, ...
We expand each feature $x_j$ to the nonlinear feature vector

$$\left\langle\cos(\frac{0}{d} \pi x_j), \cos(\frac{1}{d} \pi x_j), \ldots, \cos(\frac{d}{d} \pi x_j)\right\rangle^\intercal\,,$$

where $d$ is the number of basis functions.

Note that $\cos \left(\frac{0}{d} \pi x_j \right)=1$ is the bias we added manually in the previous example.

In [47]:

from itertools import chain, cycle

In [48]:

def cosinusoidalize(X, n_degree):
    X_cosinusoidal = np.ndarray((len(X), n_degree+1))
    for d in range(n_degree+1):
        X_cosinusoidal[:, d] = np.cos(X[:, 0] * np.pi * d / n_degree)
    return X_cosinusoidal

In [49]:

# Utility function
def build_cosinusoidal(n_degree, w):
    return ((("%+.2f \\cos(\\frac{%d \\pi x}{" + str(n_degree) + "} )") * (n_degree + 1)) % tuple(chain(*zip(w, range(n_degree + 1)))))

In [50]:

N_DEGREE = 4

In [51]:

X_cosinusoidal = cosinusoidalize(X, n_degree=N_DEGREE)
X_test_cosinusoidal = cosinusoidalize(X_test, n_degree=N_DEGREE)

In [52]:

X.T, X_cosinusoidal

Out[52]:

(array([[ 0.,  1.,  2.,  3.,  4.]]),
 array([[ 1.   ,  1.   ,  1.   ,  1.   ,  1.   ],
        [ 1.   ,  0.707,  0.   , -0.707, -1.   ],
        [ 1.   ,  0.   , -1.   , -0.   ,  1.   ],
        [ 1.   , -0.707, -0.   ,  0.707, -1.   ],
        [ 1.   , -1.   ,  1.   , -1.   ,  1.   ]]))

What is the effect of the basis transformation?¶

In [53]:

from matplotlib.lines import lineStyles

In [54]:

ls = cycle([style for style in lineStyles if not lineStyles[style] =='_draw_nothing'])
for deg in range(N_DEGREE):
    plt.plot(np.linspace(0,4,100), X_test_cosinusoidal[:,deg], 
             linestyle=next(ls), label='Degree '+str(deg))
plt.legend(frameon=True, bbox_to_anchor=(1.,1))
plt.title('Cosine basis transformations'); plt.xlabel("$X$"); plt.ylabel("$y$"); 

In [55]:

w_cos = normal_equations(X_cosinusoidal, y)

In [56]:

plt.figure(figsize=(11,4.5))
plt.plot(X_bias[:, 0], X_bias.dot(w), "m-", label='linear')
plt.plot(X_test_bias[:, 0], X_test_cosinusoidal.dot(w_cos), "g:", label='cosinusoidal')
plt.plot(X_bias[:, 0], y, "bo", label='training')
plt.xlabel("$X$"); plt.ylabel("$y$"); plt.legend(frameon=True)
plt.title("$y = " + build_cosinusoidal(N_DEGREE, w_cos) + "$");

You might be wondering, is this cheating?

I am hand-picking basis functions.
This approach has little applicability in real-world practice but...
...hints us towards an interesting direction.

Homework 3¶

What happens if you change the basis transformation to

$$\left\langle\sin(\frac{0}{d} \pi x_j), \sin(\frac{1}{d} \pi x_j), \ldots, \sin(\frac{d}{d} \pi x_j)\right\rangle^\intercal\,,$$

for $d$=N_DEGREES=4?

How it can be improved?

How can we make a model more general?¶

Get more training data.
Generate artificial training data:
- Add noise to your training data
- Example for object recognition: scale, rotate, distort images, select random samples from a bigger image, corrupt image, ...
Prefer simple models (e.g. degree of polynomial)
Penalize large weights, for example
- quadratic values of weights
- absolute values of weights
- ...
Average different models (ensemble methods)

Over and Under Fitting¶

Techniques used to account for over fitting or under fitting a given model when using a maximum likelihood approach.

Regularized Least Squares¶

Regularization attempts to address the problem of overfitting a given model to the training data and/or eliminating parameters that are not important based on the data.
The general total error function with a regularization term is given by

$$ E_D(\vec{w}) + \lambda E_W(\vec{w})\,. $$

where $\lambda$ is the regularization coefficient and $E_W$ is the regularization term.

A commonly used regularization term is the sum-of-squares of the model parameters,

$$E_W(\vec{w}) = \frac1{2}\vec{w}^\intercal\vec{w}\,,$$

known as the weight decay regularizer.

This regularization terms leads to the optimal solution, assuming a linear regression model with Gaussian noise on the training data, of

$$ \vec{w} = \left(\lambda \vec{I} + \vec{\Phi}^\intercal \vec{\Phi}\right)^{-1} \vec{\Phi}^T\vec{y}\,. $$

In [57]:

def y(x, m, b, mu=0, sigma=1):
    return m * x + b + np.random.normal(mu, sigma, 1)[0]

In [58]:

def el_pow(x, pow):
    temp = x
    for i in range(pow - 1):
        temp = temp * x
    return temp

In [59]:

def prediction(params, x):
    pred = 0
    for i in range(len(params)):
        pred += params[i] * math.pow(x, i)
    return pred

Generating training data, with $N$ data points

In [60]:

N = 101
M = 8

In [61]:

t = np.empty(N)
domain = np.empty(N)
domain_bound = 1.0 / N
for i in range(N):
    domain[i] = i * domain_bound
for i in range(N):
    t[i] = y(x=domain[i], m=4.89, b=0.57)

Find the solution without using regularization design matrix, $\phi$, of dimension $N\times M$.

In [62]:

phi = np.array([np.ones(N), domain, el_pow(domain,2), el_pow(domain,3),
                el_pow(domain,4), el_pow(domain,5), el_pow(domain,6),el_pow(domain,7)]).T
temp1 = np.linalg.inv(np.dot(phi.T,phi)) #inverse of phi.T X phi
temp2 = np.dot(temp1, phi.T)

Solution without regularization:

In [63]:

w_nreg = np.dot(temp2,t)

In [64]:

w_nreg

Out[64]:

array([    0.194,    29.042,  -317.676,  1720.279, -4590.23 ,  6374.297,
       -4413.416,  1202.347])

In [65]:

predicted_t = [prediction(w_nreg,x) for x in domain]

Finding the regularized solution

In [66]:

lam = 0.1
temp1 = np.linalg.inv(lam * np.eye(M) + np.dot(phi.T, phi))
temp2 = np.dot(temp1, phi.T)
w_reg = np.dot(temp2, t)

In [67]:

w_reg

Out[67]:

array([ 0.793,  3.763,  1.178,  0.412,  0.146, -0.088, -0.397, -0.772])

In [68]:

predicted_t_reg = [prediction(w_reg, x) for x in domain]

In [69]:

plt.plot(domain, t, '.', label='training')
plt.plot(domain, predicted_t, label='predicted')
plt.plot(domain, predicted_t_reg, '--', label='regularized')
plt.legend(loc='lower right', frameon=True); plt.xlabel("$X$"); plt.ylabel("$y$")
plt.title('Effects of regularization');

Final remarks¶

We now have a tool for solving regression problems.
...but the world is highly non-linear 😱.
Gradient descent is applicable in cases where the loss function is differentiable with respect to the parameters.
We saw regularization in action for the first time.
Read Chapter 3 of Hastie, Tibshirani and Friedman (2009) The Elements of Statistical Learning (2nd edition) Springer-Verlag.

Machine Learning¶

2. Linear Regression¶

Luis Martí ¶

Instituto de Computação ¶

Universidade Federal Fluminense ¶

Regression$\renewcommand{\vec}[1]{\boldsymbol{#1}}$¶

Why we use the Gaussian noise assumption¶

Gaussian Noise?¶

An Example¶

Ploting the training and test datasets¶

Linear Regression Model¶

Learning Weights¶

Normal equations (analytical solution)¶

Gradient descent (iterative solution)¶

Normal Equations¶

The linear least squares solution¶

The bias trick¶

Homework 1¶

Computational cost and complexity¶

Estimating problem tractability¶

Gradient Descent¶

Forward pointer¶

Gradient of the Linear Model¶

Implementing gradient descent¶

The learning loop¶

Visualizing gradient descent¶

Impact of learning rate¶

Improving GD by adding a momentum term¶

Impact of the momentum term¶

Normal equations vs gradient descent¶

Comparing results¶

Homework 2¶

Can we make linear models non-linear?¶

Approximation of nonlinear functions (with a linear model)¶

What is the effect of the basis transformation?¶

Homework 3¶

How can we make a model more general?¶

Over and Under Fitting¶

Regularized Least Squares¶

Final remarks¶

Machine Learning¶

2. Linear Regression¶

Luis Martí¶

Instituto de Computação¶

Universidade Federal Fluminense¶

Regression$\renewcommand{\vec}[1]{\boldsymbol{#1}}$¶

Why we use the Gaussian noise assumption¶

Gaussian Noise?¶

An Example¶

Ploting the training and test datasets¶

Linear Regression Model¶

Learning Weights¶

Normal equations (analytical solution)¶

Gradient descent (iterative solution)¶

Normal Equations¶

The linear least squares solution¶

The bias trick¶

Homework 1¶

Computational cost and complexity¶

Estimating problem tractability¶

Gradient Descent¶

Forward pointer¶

Gradient of the Linear Model¶

Implementing gradient descent¶

The learning loop¶

Visualizing gradient descent¶

Impact of learning rate¶

Improving GD by adding a momentum term¶

Impact of the momentum term¶

Normal equations vs gradient descent¶

Comparing results¶

Homework 2¶

Can we make linear models non-linear?¶

Approximation of nonlinear functions (with a linear model)¶

What is the effect of the basis transformation?¶

Homework 3¶

How can we make a model more general?¶

Over and Under Fitting¶

Regularized Least Squares¶

Final remarks¶

Luis Martí ¶

Instituto de Computação ¶

Universidade Federal Fluminense ¶