03 - Linear Regression

by Alejandro Correa Bahnsen & Iván Torroledo

version 1.3, June 2018

Part of the class Applied Deep Learning

This notebook is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Special thanks goes to Kevin Markham

In [1]:
import pandas as pd
import numpy as np

%matplotlib inline
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
import itertools
plt.style.use('ggplot')
In [2]:
print(plt.style.available)
['seaborn-pastel', 'seaborn-dark-palette', 'seaborn-darkgrid', '_classic_test', 'bmh', 'seaborn-muted', 'fast', 'fivethirtyeight', 'seaborn-dark', 'ggplot', 'seaborn-bright', 'seaborn-talk', 'dark_background', 'seaborn-poster', 'seaborn-notebook', 'seaborn-ticks', 'seaborn-deep', 'seaborn-white', 'seaborn', 'Solarize_Light2', 'classic', 'grayscale', 'seaborn-whitegrid', 'seaborn-colorblind', 'seaborn-paper']
In [3]:
plt.style.use('fivethirtyeight')
In [4]:
# Test Dataset
# Load dataset
import zipfile
with zipfile.ZipFile('../datasets/houses_portland.csv.zip', 'r') as z:
    f = z.open('houses_portland.csv')
    data = pd.io.parsers.read_table(f, sep=',')
data.head()
Out[4]:
area bedroom price
0 2104 3 399900
1 1600 3 329900
2 2400 3 369000
3 1416 2 232000
4 3000 4 539900
In [5]:
data.columns
Out[5]:
Index(['area', 'bedroom', ' price'], dtype='object')
In [6]:
y = data[' price'].values
X = data['area'].values
plt.scatter(X, y)
plt.xlabel('Area')
plt.ylabel('Price')
Out[6]:
Text(0,0.5,'Price')

Normalize data

$$ x = \frac{x -\overline x}{\sigma_x} $$

In [7]:
y_mean, y_std = y.mean(), y.std()
X_mean, X_std = X.mean(), X.std()

y = (y - y_mean)/ y_std
X = (X - X_mean)/ X_std

plt.scatter(X, y)
plt.xlabel('Area')
plt.ylabel('Price')
Out[7]:
Text(0,0.5,'Price')

Form of linear regression

$$h_\beta(x) = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$$

  • $h_\beta(x)$ is the response
  • $\beta_0$ is the intercept
  • $\beta_1$ is the coefficient for $x_1$ (the first feature)
  • $\beta_n$ is the coefficient for $x_n$ (the nth feature)

The $\beta$ values are called the model coefficients:

  • These values are estimated (or "learned") during the model fitting process using the least squares criterion.
  • Specifically, we are find the line (mathematically) which minimizes the sum of squared residuals (or "sum of squared errors").
  • And once we've learned these coefficients, we can use the model to predict the response.

Estimating coefficients

In the diagram above:

  • The black dots are the observed values of x and y.
  • The blue line is our least squares line.
  • The red lines are the residuals, which are the vertical distances between the observed values and the least squares line.

Cost function

The goal became to estimate the parameters $\beta$ that minimisse the sum of squared residuals

$$J(\beta_0, \beta_1)=\frac{1}{2n}\sum_{i=1}^n (h_\beta(x_i)-y_i)^2$$

In [8]:
# create X and y
n_samples = X.shape[0]
X_ = np.c_[np.ones(n_samples), X]

Lets suppose the following betas

In [9]:
beta_ini = np.array([-1, 1])
In [10]:
# h
def lr_h(beta,x):
    return np.dot(beta, x.T)
In [11]:
# scatter plot
plt.scatter(X, y,c='b')

# Plot the linear regression
x = np.c_[np.ones(2), [X.min(), X.max()]]
plt.plot(x[:, 1], lr_h(beta_ini, x), 'r', lw=5)
plt.xlabel('Area')
plt.ylabel('Price')
Out[11]:
Text(0,0.5,'Price')

Lets calculate the error of such regression

In [12]:
# Cost function
def lr_cost_func(beta, x, y):
    # Can be vectorized
    res = 0
    for i in range(x.shape[0]):
        res += (lr_h(beta,x[i, :]) - y[i]) ** 2
    res *= 1 / (2*x.shape[0])
    return res
lr_cost_func(beta_ini, X_, y)
Out[12]:
0.6450124071218747

Understanding the cost function

Lets see how the cost function looks like for different values of $\beta$

In [13]:
beta0 = np.arange(-15, 20, 1)

beta1 = 2
In [14]:
cost_func=[]
for beta_0 in beta0:
    cost_func.append(lr_cost_func(np.array([beta_0, beta1]), X_, y) )

plt.plot(beta0, cost_func)
plt.xlabel('beta_0')
plt.ylabel('J(beta)')
Out[14]:
Text(0,0.5,'J(beta)')
In [15]:
beta0 = 0
beta1 = np.arange(-15, 20, 1)
In [16]:
cost_func=[]
for beta_1 in beta1:
    cost_func.append(lr_cost_func(np.array([beta0, beta_1]), X_, y) )

plt.plot(beta1, cost_func)
plt.xlabel('beta_1')
plt.ylabel('J(beta)')
Out[16]:
Text(0,0.5,'J(beta)')

Analyzing both at the same time

In [17]:
beta0 = np.arange(-5, 7, 0.2)
beta1 = np.arange(-5, 7, 0.2)
In [18]:
cost_func = pd.DataFrame(index=beta0, columns=beta1)

for beta_0 in beta0:
    for beta_1 in beta1:
        cost_func.loc[beta_0, beta_1] = lr_cost_func(np.array([beta_0, beta_1]), X_, y)   
In [19]:
betas = np.transpose([np.tile(beta0, beta1.shape[0]), np.repeat(beta1, beta0.shape[0])])
fig = plt.figure(figsize=(10, 10))
ax = fig.gca(projection='3d')
ax.plot_trisurf(betas[:, 0], betas[:, 1], cost_func.T.values.flatten(), cmap=cm.jet, linewidth=0.1)
ax.set_xlabel('beta_0')
ax.set_ylabel('beta_1')
ax.set_zlabel('J(beta)')
plt.show()

It can also be seen as a contour plot

In [20]:
contour_levels = [0, 0.1, 0.25, 0.5, 0.75, 1, 1.5, 2, 3, 4, 5, 7, 10, 12, 15, 20]
plt.contour(beta0, beta1, cost_func.T.values, contour_levels)
plt.xlabel('beta_0')
plt.ylabel('beta_1')
Out[20]:
Text(0,0.5,'beta_1')

Lets understand how different values of betas are observed on the contour plot

In [21]:
betas = np.array([[0, 0],
                 [-1, -1],
                 [-5, 5],
                 [3, -2]])
In [22]:
plt.style.use('seaborn-notebook')
In [23]:
for beta in betas:
    print('\n\nLinear Regression with betas ', beta)
    f, (ax1, ax2) = plt.subplots(1,2, figsize=(12, 6))
    ax2.contour(beta0, beta1, cost_func.T.values, contour_levels)
    ax2.set_xlabel('beta_0')
    ax2.set_ylabel('beta_1')
    ax2.scatter(beta[0], beta[1],c='b', s=50)

    # scatter plot
    ax1.scatter(X, y,c='b')

    # Plot the linear regression
    x = np.c_[np.ones(2), [X.min(), X.max()]]
    ax1.plot(x[:, 1], lr_h(beta, x), 'r', lw=5)
    ax1.set_xlabel('Area')
    ax1.set_ylabel('Price')
    plt.show()
Linear Regression with betas  [0 0]
Linear Regression with betas  [-1 -1]
Linear Regression with betas  [-5  5]
Linear Regression with betas  [ 3 -2]