Downloaded and adapted from NbViewer
Adapted from Chapter 3 of An Introduction to Statistical Learning
statsmodels : basic tools in statistics
pandas : tools for real database analysis
matplotlib : best visualization library
scikit-learn : most used library in machine learning
# imports
import pandas as pd
import matplotlib.pyplot as plt
# this allows plots to appear directly in the notebook
%matplotlib inline
#figsize = (16,8)
figsize = (8,5)
Let's take a look at some data, ask some questions about that data, and then use linear regression to answer those questions!
# read data into a DataFrame
#data = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
data = pd.read_csv('Advertising.csv', index_col=0)
data.head()
#data[199:200]
TV | Radio | Newspaper | Sales | |
---|---|---|---|---|
1 | 230.1 | 37.8 | 69.2 | 22.1 |
2 | 44.5 | 39.3 | 45.1 | 10.4 |
3 | 17.2 | 45.9 | 69.3 | 9.3 |
4 | 151.5 | 41.3 | 58.5 | 18.5 |
5 | 180.8 | 10.8 | 58.4 | 12.9 |
What are the features?
What is the response?
# print the shape of the DataFrame
data.shape
(200, 4)
There are 200 observations, and thus 200 markets in the dataset.
# visualize the relationship between the features and the response using scatterplots
fig, axs = plt.subplots(1, 3, sharey=True)
data.plot(kind='scatter', x='TV', y='Sales', ax=axs[0], figsize= figsize)
data.plot(kind='scatter', x='Radio', y='Sales', ax=axs[1])
data.plot(kind='scatter', x='Newspaper', y='Sales', ax=axs[2])
<matplotlib.axes._subplots.AxesSubplot at 0x1168ca490>
Let's pretend you work for the company that manufactures and markets this widget. The company might ask you the following: On the basis of this data, how should we spend our advertising money in the future?
This general question might lead you to more specific questions:
We will explore these questions below!
Simple linear regression is an approach for predicting a quantitative response using a single feature (or "predictor" or "input variable"). The model takes the following form:
$$\boxed{Y = \beta_0 + \beta_1X + \xi}$$What does each term represent?
Together, $\beta_0$ and $\beta_1$ are called the model coefficients. To answer the questions, we must "estimate" (or "learn") the values of these coefficients. And once we've learned these coefficients, we can use them to predict Sales!
Generally speaking, coefficients are estimated using the least squares criterion, which means we are to find the line (mathematically) which minimizes the sum of squared residuals (or "sum of squared errors"):
What elements are present in the diagram?
How do the model coefficients relate to the least squares line?
Predictions at the points of the design are $$\hat y_i = \hat\beta_0 + \hat\beta_1 X_i$$
Let's use Statsmodels to estimate the model coefficients for the advertising data:
# this is the standard import if you're using "formula notation" (similar to R)
import statsmodels.formula.api as smf
# create a fitted model in one line
lm = smf.ols(formula='Sales ~ TV', data=data).fit()
# print the coefficients
lm.params
Intercept 7.032594 TV 0.047537 dtype: float64
How do we interpret the TV coefficient $\beta_1$?
(Note that if an increase in TV ad spending was associated with a decrease in sales, $\beta_1$ would be negative)
Let's say that there was a new market where the TV advertising spend was $50,000. What would we predict for the Sales in that market?
$$\hat y = \hat\beta_0 + \hat\beta_1x$$$$\hat y = 7.032594 + 0.047537 \times 50$$# manually calculate the prediction
sales = 1000*(7.032594 + 0.047537*50)
print("expected Sales = {} widgets".format(sales))
expected Sales = 9409.444 widgets
# use the model to make predictions on a new value
X_new = pd.DataFrame({'TV' : [50]})
lm.predict(X_new)
array([ 9.40942557])
Let's make predictions for the smallest and largest observed values of x, and then use the predicted values to plot the least squares line:
# create a DataFrame with the minimum and maximum values of TV
X_new = pd.DataFrame({'TV': [data.TV.min(), data.TV.max()]})
#X_new.head()
# make predictions for those x values and store them
preds = lm.predict(X_new)
#preds
# first, plot the observed data
data.plot(kind='scatter', x='TV', y='Sales')
# then, plot the least squares line
plt.plot(X_new, preds, c='red', linewidth=2)
[<matplotlib.lines.Line2D at 0x11668e210>]
Statsmodels calculates 95% confidence intervals on the model coefficients. It means that if we assume that the model $Y=\beta_0+\beta_1 X +\xi$ holds then with probability tending to 95%, the "true" coefficients $\beta_0$ and $\beta_1$ belong to these intervals.
# print the confidence intervals for the model coefficients
lm.conf_int()
0 | 1 | |
---|---|---|
Intercept | 6.129719 | 7.935468 |
TV | 0.042231 | 0.052843 |
We want to study the following test problem:
How do we test this hypothesis? Intuitively, we reject the null (and thus believe the alternative) if the 95% confidence interval does not include zero. Conversely, the p-value represents the probability that the coefficient is actually zero:
# print the p-values for the model coefficients
lm.pvalues
Intercept 1.406300e-35 TV 1.467390e-42 dtype: float64
If the 95% confidence interval includes zero, the p-value for that coefficient will be greater than 0.05. If the 95% confidence interval does not include zero, the p-value will be less than 0.05. Thus, a p-value less than 0.05 is one way to decide whether there is likely a relationship between the feature and the response. (Again, using 0.05 as the cutoff is just a convention.)
In this case, the p-value for TV is far less than 0.05, and so we have a high confidence in rejecting ($H_0$). As a consequence, we highly believe that there is a strong connection between TV ads and Sales.
(Note that we generally ignore the p-value for the intercept).
The most common way to evaluate the overall fit of a linear model is by the R-squared value. R-squared is the proportion of variance explained, meaning the proportion of variance in the observed data that is explained by the model, or the reduction in error over the null model. (The null model just predicts the mean of the observed response, and thus it has an intercept and no slope.)
$$R^2=\frac{\|\hat y - \bar{Y}_n\|_2^2}{\|Y-\bar{Y}_n\|_2^2}$$where $\hat y$ is the prediction vector via least squares.
R-squared is between 0 and 1, and higher is better because it means that more variance is explained by the model. Here's an example of what R-squared "looks like":
You can see that the blue line explains some of the variance in the data (R-squared=0.54), the green line explains more of the variance (R-squared=0.64), and the red line fits the training data even further (R-squared=0.66). (Does the red line look like it's overfitting?)
Let's calculate the R-squared value for our simple linear model:
# print the R-squared value for the model
lm.rsquared
0.61187505085007099
Is that a "good" R-squared value? It's hard to say. The threshold for a good R-squared value depends widely on the domain. Therefore, it's most useful as a tool for comparing different models.
Simple linear regression can easily be extended to include multiple features. This is called multiple linear regression:
$$Y = \beta_0 + \beta_1X^{(1)} + ... + \beta_n X^{(k)} + \xi$$Each $X^{(j)}$ represents a different feature ( = coordinates of the design $X=(X^{(1)}, \ldots, X^{(k)})$, and each feature has its own coefficient. In this case:
$$\boxed{Y = \beta_0 + \beta_1 \times TV + \beta_2 \times Radio + \beta_3 \times Newspaper +\xi}$$Let's use Statsmodels to estimate these coefficients:
# create a fitted model with all three features
lm = smf.ols(formula='Sales ~ TV + Radio + Newspaper', data=data).fit()
# print the coefficients
lm.params
Intercept 2.938889 TV 0.045765 Radio 0.188530 Newspaper -0.001037 dtype: float64
How do we interpret these coefficients? For a given amount of Radio and Newspaper ad spending, an increase of $1000 in TV ad spending is associated with an increase in Sales of 45.765 widgets.
A lot of the information we have been reviewing piece-by-piece is available in the model summary output:
# print a summary of the fitted model
lm.summary()
Dep. Variable: | Sales | R-squared: | 0.897 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.896 |
Method: | Least Squares | F-statistic: | 570.3 |
Date: | Sun, 27 Sep 2015 | Prob (F-statistic): | 1.58e-96 |
Time: | 20:32:01 | Log-Likelihood: | -386.18 |
No. Observations: | 200 | AIC: | 780.4 |
Df Residuals: | 196 | BIC: | 793.6 |
Df Model: | 3 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 2.9389 | 0.312 | 9.422 | 0.000 | 2.324 3.554 |
TV | 0.0458 | 0.001 | 32.809 | 0.000 | 0.043 0.049 |
Radio | 0.1885 | 0.009 | 21.893 | 0.000 | 0.172 0.206 |
Newspaper | -0.0010 | 0.006 | -0.177 | 0.860 | -0.013 0.011 |
Omnibus: | 60.414 | Durbin-Watson: | 2.084 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 151.241 |
Skew: | -1.327 | Prob(JB): | 1.44e-33 |
Kurtosis: | 6.332 | Cond. No. | 454. |
What are a few key things we learn from this output?
How do I decide which features to include in a linear model? Here's one idea:
What are the drawbacks to this approach?
# only include TV and Radio in the model
lm = smf.ols(formula='Sales ~ TV + Radio', data=data).fit()
lm.rsquared
0.89719426108289568
# add Newspaper to the model (which we believe has no association with Sales)
lm = smf.ols(formula='Sales ~ TV + Radio + Newspaper', data=data).fit()
lm.rsquared
0.89721063817895219
R-squared will always increase as you add more features to the model, even if they are unrelated to the response. Thus, selecting the model with the highest R-squared is not a reliable approach for choosing the best linear model.
There is alternative to R-squared called adjusted R-squared that penalizes model complexity (to control for overfitting), but it generally under-penalizes complexity.
So is there a better approach to feature selection? Cross-validation. It provides a more reliable estimate of out-of-sample error, and thus is a better way to choose which of your models will best generalize to out-of-sample data. There is extensive functionality for cross-validation in scikit-learn, including automated methods for searching different sets of parameters and different models. Importantly, cross-validation can be applied to any model, whereas the methods described above only apply to linear models.
# print a summary of the fitted model that includes only TV and Radio in the model
lm = smf.ols(formula='Sales ~ TV + Radio', data=data).fit()
lm.rsquared
lm.summary()
Dep. Variable: | Sales | R-squared: | 0.897 |
---|---|---|---|
Model: | OLS | Adj. R-squared: | 0.896 |
Method: | Least Squares | F-statistic: | 859.6 |
Date: | Fri, 25 Sep 2015 | Prob (F-statistic): | 4.83e-98 |
Time: | 15:39:26 | Log-Likelihood: | -386.20 |
No. Observations: | 200 | AIC: | 778.4 |
Df Residuals: | 197 | BIC: | 788.3 |
Df Model: | 2 | ||
Covariance Type: | nonrobust |
coef | std err | t | P>|t| | [95.0% Conf. Int.] | |
---|---|---|---|---|---|
Intercept | 2.9211 | 0.294 | 9.919 | 0.000 | 2.340 3.502 |
TV | 0.0458 | 0.001 | 32.909 | 0.000 | 0.043 0.048 |
Radio | 0.1880 | 0.008 | 23.382 | 0.000 | 0.172 0.204 |
Omnibus: | 60.022 | Durbin-Watson: | 2.081 |
---|---|---|---|
Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 148.679 |
Skew: | -1.323 | Prob(JB): | 5.19e-33 |
Kurtosis: | 6.292 | Cond. No. | 425. |
Let's redo some of the Statsmodels code above in scikit-learn:
# create X and y
feature_cols = ['TV', 'Radio', 'Newspaper']
X = data[feature_cols]
y = data.Sales
# follow the usual sklearn pattern: import, instantiate, fit
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X, y)
# print intercept and coefficients
print lm.intercept_
print lm.coef_
2.93888936946 [ 0.04576465 0.18853002 -0.00103749]
# pair the feature names with the coefficients
zip(feature_cols, lm.coef_)
[('TV', 0.045764645455397615), ('Radio', 0.18853001691820442), ('Newspaper', -0.001037493042476266)]
# predict for a new observation
lm.predict([100, 25, 25])
12.202667011892375
# calculate the R-squared
lm.score(X, y)
0.89721063817895219
Note that p-values and confidence intervals are not (easily) accessible through scikit-learn.