Linear Regression¶

Linear regression is a method that allows us to study the statistical relationship between independent/predictor variables and dependent/response/outcome variables. It is simple to use and interpreting a linear regression model is straightforward. There are two variants:

Simple linear regression
- One predictor variable, often denoted $x$
- One response variable, often denoted $y$
Multiple linear regression
- Two or more predictor variables
- One dependent variable

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Simple Linear Regression¶

Simple linear regression assumes that there is a linear relationship between two continuous variables; the predictor variable $x$ and the response variable $y$. This can be described as:

\begin{equation*} y \thickapprox \beta_0 + \beta_1 x \end{equation*}

where

$\thickapprox$ can be read as "is approximately modeled as".
$\beta_0$ is called the intercept e.g. if $x=0$ then $y=\beta_0$
$\beta_1$ is called the slope (sometimes referred to as the coefficient). This basically describes the slope of the regression line e.g. if $x$ increases by 1 unit then $y$ increases by $\beta_1$ units.

Together, $\beta_0$ and $\beta_1$ are called the model coefficients or parameters.

Learning the Model Coefficients¶

Since these coefficients are unknown to begin with, we use our data to estimate or learn them using least squares criteria. Essentially, we find the best fitting line by minimising the sum of squared errors/residuals. The errors or residuals are the differences between the observed value and the estimated value. We square the difference to avoid negative values. The residual sum of squares (RSS) is defined as: \begin{equation*} RSS = \sum_{i=1}^n{(y_i - \hat{y}_i)^2} \end{equation*} where

$y_i$ is the observed value
$\hat{y}_i$ is the estimated/fitted value

We can illustrate this with a plot:

In [2]:

from sklearn.linear_model import LinearRegression

# Create artificial data
X = np.array([   1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
Y = np.array([-1.1, 4, 1, 6, 4, 2, 8, 5, 12, 7])

# Create a model
lrm = LinearRegression()
lrm.fit(X.reshape(-1, 1), Y)
model_line = lrm.intercept_ + X * lrm.coef_[0]

fig, ax = plt.subplots(figsize=(8,4))

# Plot the data
ax.plot(X, Y, 'go') 

# Plot the regression line
ax.plot(X, model_line, 'bo-') # 

# Plot the residuals
ax.vlines(x=X, ymin=np.minimum(Y, model_line), ymax=np.maximum(Y, model_line), color='red')

Out[2]:

<matplotlib.collections.LineCollection at 0x1ff8332deb8>

The green dots represent our artificial data i.e. the observed data. The red vertical lines indicate the errors or the residuals. The blue line represent the regression line that best fit the observed data. This line is found by minimising the residuals.

Multiple Linear Regression¶

Typically, we have more than one predictor variable (features) in our data. In such instances, we use Multiple Linear Regression:

\begin{equation*} y \thickapprox \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n \end{equation*}

Explore the Boston House Prices Dataset¶

In [3]:

from sklearn.datasets import load_boston
boston = load_boston()

In [4]:

boston.keys()

Out[4]:

dict_keys(['data', 'target', 'feature_names', 'DESCR'])

In [5]:

print(boston['DESCR'])

Boston House Prices dataset
===========================

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's

    :Missing Attribute Values: None

    :Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
http://archive.ics.uci.edu/ml/datasets/Housing


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978.   Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980.   N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.   
     
**References**

   - Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
   - Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
   - many more! (see http://archive.ics.uci.edu/ml/datasets/Housing)

In [6]:

df = pd.DataFrame(data=boston.data, columns=boston['feature_names'])

# Add a column for the target feature
df['MEDV'] = boston['target'] 

df.head()

Out[6]:

	CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0	0.00632	18.0	2.31	0.538	6.575	65.2	4.0900	1.0	296.0	15.3	396.90	4.98	24.0
1	0.02731	0.0	7.07	0.469	6.421	78.9	4.9671	2.0	242.0	17.8	396.90	9.14	21.6
2	0.02729	0.0	7.07	0.469	7.185	61.1	4.9671	2.0	242.0	17.8	392.83	4.03	34.7
3	0.03237	0.0	2.18	0.458	6.998	45.8	6.0622	3.0	222.0	18.7	394.63	2.94	33.4
4	0.06905	0.0	2.18	0.458	7.147	54.2	6.0622	3.0	222.0	18.7	396.90	5.33	36.2

In [7]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
MEDV       506 non-null float64
dtypes: float64(14)
memory usage: 55.4 KB

Check data distributions.

In [8]:

df.describe().transpose()

Out[8]:

	count	mean	std	min	25%	50%	75%	max
CRIM	506.0	3.593761	8.596783	0.00632	0.082045	0.25651	3.647423	88.9762
ZN	506.0	11.363636	23.322453	0.00000	0.000000	0.00000	12.500000	100.0000
INDUS	506.0	11.136779	6.860353	0.46000	5.190000	9.69000	18.100000	27.7400
CHAS	506.0	0.069170	0.253994	0.00000	0.000000	0.00000	0.000000	1.0000
NOX	506.0	0.554695	0.115878	0.38500	0.449000	0.53800	0.624000	0.8710
RM	506.0	6.284634	0.702617	3.56100	5.885500	6.20850	6.623500	8.7800
AGE	506.0	68.574901	28.148861	2.90000	45.025000	77.50000	94.075000	100.0000
DIS	506.0	3.795043	2.105710	1.12960	2.100175	3.20745	5.188425	12.1265
RAD	506.0	9.549407	8.707259	1.00000	4.000000	5.00000	24.000000	24.0000
TAX	506.0	408.237154	168.537116	187.00000	279.000000	330.00000	666.000000	711.0000
PTRATIO	506.0	18.455534	2.164946	12.60000	17.400000	19.05000	20.200000	22.0000
B	506.0	356.674032	91.294864	0.32000	375.377500	391.44000	396.225000	396.9000
LSTAT	506.0	12.653063	7.141062	1.73000	6.950000	11.36000	16.955000	37.9700
MEDV	506.0	22.532806	9.197104	5.00000	17.025000	21.20000	25.000000	50.0000

We can get check if we have missing values in the dataset:

In [9]:

df.isnull().any()

Out[9]:

CRIM       False
ZN         False
INDUS      False
CHAS       False
NOX        False
RM         False
AGE        False
DIS        False
RAD        False
TAX        False
PTRATIO    False
B          False
LSTAT      False
MEDV       False
dtype: bool

No missing data. If any of our features contained missing values, we could use a heatmap to get a quick overview of where there missing data are and a rough picture on how much data is missing.

In [10]:

sns.heatmap(df.isnull(), cbar=False, yticklabels=False)

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x1ff84676d68>

Categorical features¶

Next step is to identify which features are categorical:

In [11]:

# Features with number of unique values
unq = { 
    column : df[column].nunique()
    for column in df.columns
}
unq

Out[11]:

{'AGE': 356,
 'B': 357,
 'CHAS': 2,
 'CRIM': 504,
 'DIS': 412,
 'INDUS': 76,
 'LSTAT': 455,
 'MEDV': 229,
 'NOX': 81,
 'PTRATIO': 46,
 'RAD': 9,
 'RM': 446,
 'TAX': 66,
 'ZN': 26}

In [12]:

import operator
sorted(unq.items(), key=operator.itemgetter(1))

Out[12]:

[('CHAS', 2),
 ('RAD', 9),
 ('ZN', 26),
 ('PTRATIO', 46),
 ('TAX', 66),
 ('INDUS', 76),
 ('NOX', 81),
 ('MEDV', 229),
 ('AGE', 356),
 ('B', 357),
 ('DIS', 412),
 ('RM', 446),
 ('LSTAT', 455),
 ('CRIM', 504)]

In [13]:

sorted(df['RAD'].unique())

Out[13]:

[1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 24.0]

We see that CHAS is a Boolean feature. RAD (index of accessibility to radial highways) is also categorical since it has 9 distinct values. Let us fix them:

In [14]:

df['CHAS'] = df['CHAS'].astype('category')
df['RAD'] = df['RAD'].astype('category')

In [15]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null category
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null category
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
MEDV       506 non-null float64
dtypes: category(2), float64(12)
memory usage: 49.0 KB

Since CHAS is a binary feature, we can use Point-Biserial to compute correlation between CHAS and MEDV:

In [16]:

from scipy import stats
stats.pointbiserialr(df['CHAS'], df['MEDV'])

Out[16]:

PointbiserialrResult(correlation=0.17526017719029846, pvalue=7.3906231705208155e-05)

Feature Selection¶

Before we can train a linear regression model, we must decide on which feature to pick as the independent variable. We must find a good predictor variable for the dependent variable. One way to find such a candidate is to create a heat map of the correlations between each of the features. Pearson's correlation can tell us as one variable increses, whether another variable increases (positive correlation), decreases (negative correlation), or stays the same (no correlation). Read more on Cross Validated.

In [17]:

plt.figure(figsize = (10,6))
sns.heatmap(df.corr(), annot=True, linewidths=1)

Out[17]:

<matplotlib.axes._subplots.AxesSubplot at 0x1ff8473fb70>

There is a strong correlation between our target feature MEDV (the house price) and RM which is the average number of rooms per house. There are also negative correlations with LSTAT and PTRATIO. This means that as one of the variables (LSTAT or PTRATIO) increases then our dependent variable (MEDV) tends to decrease and the other way around.

For now, we focus on the relationship between MEDV and RM. Let us create a linear model plot of these two attributes:

In [18]:

sns.lmplot(data=df, x='MEDV', y='RM')

Out[18]:

<seaborn.axisgrid.FacetGrid at 0x1ff8336c9b0>

From the plot, it looks like there is a good linear fit since the error bars are not that big.

Train a Model¶

Before we can train a model, it is important to split the data into training set and testing set to avoid overfitting. Overfitting occurs when our model becomes too complex which may yield high accuracy on the training data but does not generalise well.

Note that the train_test_split() method simply partitions data randomly. This means that we may get another coefficient if we run the method again. This means that the accuracy of our testing can vary depending on the random split. Therefore, a better approach would be to apply k-fold cross-validation. However, cross-validation may not work well when the data is not uniformly distributed.

In [19]:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [20]:

X = df['RM']
y = df['MEDV']

# Partition the data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

lrm = LinearRegression()
lrm.fit(X_train.values.reshape(-1, 1), y_train)

Out[20]:

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [21]:

intercept = lrm.intercept_
coefficient = lrm.coef_[0]
print("Intercept:   {0}\nCoefficient: {1}".format(intercept, coefficient))

Intercept:   -31.06739403994239
Coefficient: 8.567551750983585

It is straightforward to interpret the coefficient:

In [22]:

print('A unit increase in the RM feature is associated with a {0} units increase in the house price.'.format(coefficient))

A unit increase in the RM feature is associated with a 8.567551750983585 units increase in the house price.

Generate Predictions¶

Now we use the fitted model to perform predictions on our test set.

In [23]:

predicted = lrm.predict(X_test.values.reshape(-1, 1))

Let us check how far off the predicted values are from the observed values. We can visualise this via scatter plot. We can also plot a dashed line that indicates where perfect prediction values would lie.

In [24]:

fig, ax = plt.subplots()
ax.scatter(y_test, predicted)
ax.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', linewidth=3)
ax.set_xlabel('Observed')
ax.set_ylabel('Predicted')

Out[24]:

<matplotlib.text.Text at 0x1ff848a4470>

Create a histogram of our residuals/errors. If the residuals follow a normal distribution, then we have selected a correct model for the data set. Otherwise, we may have to reconsider the choice of linear regression model.

In [25]:

sns.distplot(y_test - predicted);

Evaluate the Model¶

Next, we evaluate the performance of our model. The sci-learn library implements several metric functions:

Mean Absolute Error is the mean of the absolute value of the errors

\begin{equation*} \frac{1}{n} \sum_{i=1}^n \mid y_i - \hat{y}_i \mid \end{equation*}

Median Absolute Error is interesting because it is robust to outliers.

Explained Variance Score

\begin{equation*} 1- \frac{Var(y - \hat{y})}{Var(y)} \end{equation*}

Coefficient of Determination ($R^2$) is often used as a tool for comparing different models. Although we want to achieve an $R^2$ value close to 1, there is no guarantee that a model with such high value will be good because $R^2$ is susceptible to overfitting.

Mean Squared Error tends to be useful with real-world data because it takes "larger" errors into account

\begin{equation*} \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i )^2 \end{equation*}

Root Mean Squared Error

In [26]:

from sklearn import metrics

print('Mean Absolute Error:      ', metrics.mean_absolute_error(y_test, predicted))
print('Median Absolute Error:    ', metrics.median_absolute_error(y_test, predicted))
print('Explained Variance Score: ', metrics.explained_variance_score(y_test, predicted))
print('Coefficient of Determination (R^2): ', metrics.r2_score(y_test, predicted))
print('Mean Squared Error:       ', metrics.mean_squared_error(y_test, predicted))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, predicted)))

Mean Absolute Error:       4.25158533333
Median Absolute Error:     2.83791229667
Explained Variance Score:  0.629424590611
Coefficient of Determination (R^2):  0.623482296308
Mean Squared Error:        34.5555063494
Root Mean Squared Error: 5.87839317751

Common Issues¶

There are many potential challenges that arises when we learn a linear regression model using a particular data set. Here are few common issues:

Collinearity
Outliers

Collinearity¶

We say that we have collinearity in our data set when two or more predictor variables are correlated with one another. These variables are called collinear variables. One collinear variable can predict another variable with high accuracy. In the left plot of Figure 3.14 shows that there are no collinearity between Age and Limit. However, there is a high collinearity between Rating and Limit as shown in the right plot.

Collinearity can be a problem because it can be difficult to differentiate the effects that each of the collinear variables has on the dependent variables. Another issue is that model predictions tend to be less precise if there is high level of collinearity in our training set. Therefore, if two predictor variables are highly correlated then we should exclude one of them from our data set. Alternative approach is to use Ridge Regression .

Outliers¶

Outliers ...