Machine Learning: Regression Basics

In [6]:
social()
Out[6]:
submit to reddit

This notebook covers or includes:

  • Linear Regression OLS
  • Multiple Regression
  • Best Practices

Data Sources:

TO DO:
  • Improve flow

Getting Started

Pulled directly from wikipedia:

In statistics, regression analysis is a supervised learning statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed.

Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable; for example, correlation does not imply causation.

Let's grab some data and get going.

In [5]:
import pandas as pd
import numpy as np

# Advertising Sales data
df = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
print df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 200 entries, 1 to 200
Data columns (total 4 columns):
TV           200 non-null float64
Radio        200 non-null float64
Newspaper    200 non-null float64
Sales        200 non-null float64
dtypes: float64(4)None
In [166]:
df.describe()
Out[166]:
TV Radio Newspaper Sales
count 200.000000 200.000000 200.000000 200.000000
mean 147.042500 23.264000 30.554000 14.022500
std 85.854236 14.846809 21.778621 5.217457
min 0.700000 0.000000 0.300000 1.600000
25% 74.375000 9.975000 12.750000 10.375000
50% 149.750000 22.900000 25.750000 12.900000
75% 218.825000 36.525000 45.100000 17.400000
max 296.400000 49.600000 114.000000 27.000000

8 rows × 4 columns

So, what types of questions could we answer with regards to this dataset?

  • Is there a relationship between the types of advertising, and sales?
  • How strong is this relationship?
  • What variables contribute to more sales?
  • How well can we predict sales?
  • Is there synergy among variables that contributes to the amount of sales?

I will not be going over all the theory of regression, but I will demonstrate a workflow which checks assumptions and describes the required steps and considerations.

The first step is to split our data into training and testing data.

In [171]:
from sklearn.cross_validation import train_test_split
#Help for train test spit
#train_test_split?

x = df[['TV', 'Radio', 'Newspaper']]
y = df['Sales']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.33, random_state=123)

Interpreting significance, influence, and the coefficients:

Now lets get a sense about how our independent variables are related to the dependent variable (Sales). We do this by examining the p values and coefficients of univariate regressions between each predictor and dependent variable (DV).

The questions we want to ask are:

  • "Which variables are significant?"
  • "What type of relationship do the predictors have with the DV? (positive / negative)"
  • "What is the strength of this relationship"
  • "Does this make intuitive sense?"

The last point is particularly important. As is the case with any modeling, accurate intuition about your data takes paramount importance over the dataset on hand. For example, if the data revealed that a increases in a predictor variable, smoking, was found to be negatively related to lung cancer, it would likely not be satisfactory to include smoking in the model.

Is there a potential relationship between the predictors and the response variable?

In [172]:
from sklearn.feature_selection import f_regression
#f_regression?
# Perform univariate regression and report p values
p_values = f_regression(x_train, y_train)
p_values
Out[172]:
(array([ 201.74678914,   50.60657905,    7.68597105]),
 array([  2.29293230e-28,   6.44463252e-11,   6.37042164e-03]))

Great! All of our predictors are significant ( p < 0.05 ) in a univariate regression. The F values are also > 1 which suggests that all the predictors are related to sales. This is not often the case and has only happened because this dataset is tailored for machine learning analysis.

Now we can take a look at the coefficients. We will build a customized function that runs several linear models to do this.

What do the coefficients tell us?

In [174]:
def coefs(x_train, y_train, x_test, y_test):
    lr = LinearRegression(fit_intercept=True)
    
    feature_names = df.columns
    coef_list = []
    
    for col in xrange(len(x_train[1])):
        # Iterate through columns, 
        # LinearRegression expects 2D inputs for x
        # So we add a new blank axis
        x = x_train[:, np.newaxis]
        x = x[:, :, col]
        
        lr.fit(x, y_train)
        coef_list.append([feature_names[col], lr.coef_[0]])
        
    return coef_list
        
coefs(x_train, y_train, x_test, y_test)
Out[174]:
[['TV', 0.046984295187230886],
 ['Radio', 0.18498775661674768],
 ['Newspaper', 0.05350821976042866]]

While we can begin to interpret the magnitude of these coefficients, the important thing at this stage is directionality. If there were any variables here which intuitively disagreed with a known reality in terms of direction, we would want to exclude them from the model. However, keep in mind that many times we do not know the expected behaviour, so we should not remove any variables in that case.

  • Here, we see that as expected, all three types of advertising have a positive effect on sales.

In this notebook, we will not get into interpreting magnitudes of the coefficients beyond this. The problem is similar to one we have experienced before in classification whereby we are currently mixing our units and it would not make sense (in most cases) to interpret the coefficients this way. If the predictors shared the same unit, say in dollars of advertising spent, you could begin exploring the magnitude of the relationships with some confidence.

Assessing Model Accuracy

There are two metrics typically used to asses model fit for linear regression:

  • Residual standard error (RSE)
  • R^2

The RSE is a measure of the average difference between the predicted y and our observed y, while R^2 is a measure of predictive power by measuring the proportion of explained variance.

Let's modify our function from above to include these new metrics.

In [176]:
def models(x_train, y_train, x_test, y_test):
    
    lr = LinearRegression(fit_intercept=True)
    
    feature_names = df.columns
    
    coef_list = []
    error_list = []
    r2_list = []
    feature_list = []
    
    for col in xrange(len(feature_names)-1):
        # Iterate through columns, 
        # LinearRegression expects 2D inputs for x
        # So we add a new blank axis
        xtrain = x_train[:, np.newaxis]
        xtrain = xtrain[:, :, col]
        
        xtest = x_test[:, np.newaxis]
        xtest = xtest[:, :, col]
        
        lr.fit(xtrain, y_train)
        
        # Compute RSE
        error_list.append(np.sum(lr.predict(xtest) - y_test)**2)
        r2_list.append(lr.score(xtest, y_test))
        coef_list.append(lr.coef_[0])
        feature_list.append(feature_names[col])
        
    out = zip(feature_list, r2_list, error_list, coef_list, )
    
    return out

model_outs = models(x_train, y_train, x_test, y_test)
model_outs
Out[176]:
[('TV', 0.62450810246495947, 326.33971180270174, 0.046984295187230886),
 ('Radio', 0.43882257063230956, 79.574614380832173, 0.18498775661674768),
 ('Newspaper', 0.038253869337874224, 830.59166804447102, 0.05350821976042866)]

Now we have a pretty good sense of how our predictors interact with sales. For example, TV advertising can explain ~62% of the variation in the data, yet the RSE error is rather high, which means that the deviation of this parameter can be quite variable (possibly due to some outliers).

Let's use this information to build a multiple regression model that improves our R^2 value, thus our future predictive power.

Switching python packages:

scikit-learn is best for doing machine learning with emphasis on predictive modeling, often with large and sparse data.

statsmodels on the other hand is better at "traditional" statistics and econometrics, with much stronger emphasis on parameter estimation and (statistical) testing. statsmodels will help us understand the basics more clearly before we move onto more complex regression techniques.

Before we get into the modeling, we need to check if our dependent variable sales is normally distributed, and thus satisfies the assumptions of our regression.

In [177]:
# Plot the results thus far
from ggplot import *
# Enable inline plotting
%matplotlib inline

qplot(y)+geom_histogram()
Out[177]:
<ggplot: (281636577)>

This is probably as close to normal as we can get (despite the right skew), however, under circumstances with very non-normal data, you can transform the data to coerce it to be normally distributed. We will do that another day apparently.

There is generally a misconception surrounding statistical analysis that requires a normality assumption. It is not very useful to test for normality using a statistical test like Shapiro's or Anderson Darling. This is because for small datasets, even large deviances from normality are not detected, and for large datasets, even the smallest departure from normality will lead to a rejected null. <a href = "http://www.r-bloggers.com/normality-tests-don’t-do-what-you-think-they-do/"> See more here</a>.

In [218]:
import statsmodels.api as sm

# Recall our data
x = df[['TV', 'Radio', 'Newspaper']]
y = df['Sales']
In [226]:
# Add intercept
X = sm.add_constant(x)
model = sm.OLS(y, X).fit()
print model.summary2()
            Results: Ordinary least squares
======================================================
Model:              OLS   AIC:                778.3941
Dependent Variable: Sales BIC:                788.2891
No. Observations:   200   Log-Likelihood:     -386.20 
Df Model:           2     F-statistic:        859.6   
Df Residuals:       197   Prob (F-statistic): 4.83e-98
R-squared:          0.897 Scale:              2.8270  
Adj. R-squared:     0.896                             
------------------------------------------------------
          Coef.  Std.Err.    t    P>|t|  [0.025 0.975]
------------------------------------------------------
const     2.9211   0.2945  9.9192 0.0000 2.3403 3.5019
TV        0.0458   0.0014 32.9087 0.0000 0.0430 0.0485
Radio     0.1880   0.0080 23.3824 0.0000 0.1721 0.2038
------------------------------------------------------
Omnibus:          60.022   Durbin-Watson:      2.081  
Prob(Omnibus):    0.000    Jarque-Bera (JB):   148.679
Skew:             -1.323   Prob(JB):           0.000  
Kurtosis:         6.292    Condition No.:      425    
======================================================

As you can see, statsmodels provides a much more descriptive output, and it well suited for these basic statistic stats.

Interpreting the output and variable selection:

This is a very important part! Here is what you need to know.

First, we can examine the model's F statistic and p-value. If our F statistic is greater than 1 we can conclude that at least one of our variables is related to housing price and we can continue with our analysis. There are special cases when n is small, or when we have more predictors than n , but we will cover this later.

Now, we are more concerned about which variables are related, and how they are related.

Below, we can see the coefficients and p-values of the individual independent variables. These p-values judge whether each individual predictor is related to the dependent variable, after adjusting for the other predictors. Simply put, it measures the partial effect of adding that particular variable to the model, in the presence of other variables.

What we are talking about here are the considerations for variable selection. Since we want only the predictors that actually influence the prediction of our dependent variable. While there are very complex methods of variable selection (AIC BIC), we will explore the basics here.

The two most basic methods are:

  • Forward Selection. Begin with a null model of an intercept and no predictors. Fit univariate regressions for all available predictors. Add the predictor with the lowest RSS to the model. The next variables added will be the one that results in the lowest RSS for the new 2-variable model. This is continued until some stopping rule.
  • Backward Selection. Start will all variables in the model, then remove the variable with the largest p-value (the least significant). Refit the model and repeat the processes until all remaining variables have p < some defined threshold (often < 0.05 or 0.1).

There are also ways to combine these two techniques, which is how I prefer to do it. This is called mixed selection. Starting with no variables in the model, add the variable that provides the best fit (highest R^2 from the univariate regression). When then continue adding variables by next highest R^2 as long as their p-value in the multiple regression model are significant.

Lastly, we need some way to judge overall model performance. Recall that RSE and R^2 are the most common measures of model fit. An R^2 of 1 would indicate that the predictor variables explain all the variation of the dependent variable (not realistic to find this in nature). We can add this idea to our variable selection technique. We will only include variables which also increase R^2 by at least some threshold (perhaps 0.5% or 1%).


So what does all this mean for our model?

Notice that when we include all three advertising types into the model that newspaper changes from significant in our univariate regression to insignificant in multiple regression. We can therefor conclude that newspaper has no influence (positive or negative) on sales. This is a common situation when using real-life data.

Consider the correlation between newspaper and radio:

In [186]:
from scipy.stats.stats import pearsonr

pearsonr(X['Radio'], X['Newspaper'])
Out[186]:
(0.35410375076117517, 2.688835078719089e-07)

As we can see, these two variables correlated. This means that there is a tendency to spend more on newspaper advertising in the same markets where more is spent on radio advertising. What this means is that newspaper appears to be related to sales, but it is only piggy-backing on the affect of radio advertising.

The classic example of this effect is the relationship between ice cream sales and shark attacks, which suggests that closing down ice cream parlours would reduce incidents. However, the increased shark attacks is facilitated by increasing temperatures, which drives more people to the beach, who also purchase ice cream.

Let's exclude newspaper from our model.

In [353]:
df = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)

x = df[['TV', 'Radio']]
y = df['Sales']

X = sm.add_constant(x)

model = sm.OLS(y, X).fit()
# print model error
print model.mse_resid
print model.summary2()
2.82697451811
            Results: Ordinary least squares
======================================================
Model:              OLS   AIC:                778.3941
Dependent Variable: Sales BIC:                788.2891
No. Observations:   200   Log-Likelihood:     -386.20 
Df Model:           2     F-statistic:        859.6   
Df Residuals:       197   Prob (F-statistic): 4.83e-98
R-squared:          0.897 Scale:              2.8270  
Adj. R-squared:     0.896                             
------------------------------------------------------
          Coef.  Std.Err.    t    P>|t|  [0.025 0.975]
------------------------------------------------------
const     2.9211   0.2945  9.9192 0.0000 2.3403 3.5019
TV        0.0458   0.0014 32.9087 0.0000 0.0430 0.0485
Radio     0.1880   0.0080 23.3824 0.0000 0.1721 0.2038
------------------------------------------------------
Omnibus:          60.022   Durbin-Watson:      2.081  
Prob(Omnibus):    0.000    Jarque-Bera (JB):   148.679
Skew:             -1.323   Prob(JB):           0.000  
Kurtosis:         6.292    Condition No.:      425    
======================================================


Great! As you can see, we really did not lose any predictive power by removing newspaper. However, there could also be some problems that cannot be diagnosed using these numbers or tests. One of the more robust and simple ways to assess the quality of our model is to plot the residuals, which is the error term, computed from the difference of the predicted and observed data.

For linear regression, we expect that the residuals (predicted values) are evenly distributed around zero.

In [284]:
qplot(y=model.resid, x = xrange(len(y)))+geom_point() \
+ geom_smooth(size=2,color='blue', se=False)
Out[284]:
<ggplot: (297961477)>

We can see that there is a negative skew on these residuals, which suggests that there is some bias towards under prediction in the model. Further we can see a very subtle negative-to-positive-to-negative trend. This suggests a non-linear pattern whereby two or more variables are interacting with some synergistic effects. In other words, sometimes combining TV and Radio advertising can work together to boost sales more than using any single medium.

We will discuss this in more later.

Prediction:

Once we have fit a satisfactory multiple regression model, we can quickly begin to predict the response of y using new values for the predictors. Yet, we should utilize a confidence interval to quantify the uncertainty of out prediction of sales. Using this method we can predict that the true sales will be within some 95% confidence interval given any TV or Radio spending.

Other Regression Considerations:

What we have discussed thus far is very basic regression techniques. In the real world we have many different types of data sources, and non-linear relationships.

Dealing with Quantitative Variables:

In [6]:
url = 'http://www-bcf.usc.edu/~gareth/ISL/Credit.csv'

df = pd.read_csv(url)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 400 entries, 0 to 399
Data columns (total 12 columns):
Unnamed: 0    400 non-null int64
Income        400 non-null float64
Limit         400 non-null int64
Rating        400 non-null int64
Cards         400 non-null int64
Age           400 non-null int64
Education     400 non-null int64
Gender        400 non-null object
Student       400 non-null object
Married       400 non-null object
Ethnicity     400 non-null object
Balance       400 non-null int64
dtypes: float64(1), int64(7), object(4)

In this dataset we have several qualitative variables, along with some quantitative. Let's see how this data is coded.

In [316]:
df.head()
Out[316]:
Unnamed: 0 Income Limit Rating Cards Age Education Gender Student Married Ethnicity Balance
0 1 14.891 3606 283 2 34 11 Male No Yes Caucasian 333
1 2 106.025 6645 483 3 82 15 Female Yes Yes Asian 903
2 3 104.593 7075 514 4 71 11 Male No No Asian 580
3 4 148.924 9504 681 3 36 11 Female No No Asian 964
4 5 55.882 4897 357 2 68 16 Male No Yes Caucasian 331

5 rows × 12 columns

We can see that the data for the qualitative variables (also known as factors) are coded as text strings. Factors such as Gender, Student and Married only have two levels, for example: male or female. To incorporate these into a regression model, we can create a dummy variable that takes on two values based on the two levels of the factor. To convert Gender to a dummy variable we simple create a new variable:

  • Gender = 1 if person is male, 0 if person is female.

In the regression model, if a person is female we simply do not compute a coefficient, since in our model, being female is represented as an absence of being male, thus the Gender variable can be modeled by only investigating the properties of either being male, or not being male. In this example, being female is considered the baseline condition, which is associated with B0, or the regression intercept.

β0+β1+εi if the person is Male

β0+εi if the person is Female

Now β0 can be interpreted as average credit score among females, β0 + β1 as the average credit score among males, and β1 as the average difference in credit balance between males and females.

Coding males at 1 is arbitrary, you could have chosen females to be equal to 1; the only difference is in how you interpret the coefficients.

In [317]:
# Create new dummy variable
df['Males'] = 1
df['Males'][df['Gender']=='Female'] = 0 

x = df[['Males']]
y = df['Balance']

X = sm.add_constant(x)
model = sm.OLS(y, X).fit()
# print model error
print 'p-values:'
print model.pvalues
print 'R^2:'
print model.rsquared_adj
p-values:
const    3.312981e-47
Males    6.685161e-01
dtype: float64
R^2:
-0.00205027122403

As we can see, thus far there is no significant relationship between being male or female.

Qualitative predictors with more than two levels:

Considered the variable Ethnicity, which has three levels. A single dummy variable cannot represent all the possible values. Instead we create two dummy variables.

  1. Asian: if Asian = 1, otherwise 0
  2. Caucasian: if Caucasian = 1, otherwise 0

Notice we do not need a third dummy variable, because African American is accounted for by the baseline condition (intercept) similar to how females was in the previous example.

β0+β1+εi if the person is Asian

β0+β2+εi if the person is Caucasian

β0+εi if the person is African American

Now β0 is the average credit balance for African Americans, β1 is the difference in the average balance between Asian and African Americans, and β2 is the difference in the average balance between Caucasian and African American's.

Thus we now can think about this as comparing the baseline condition β0 to the other conditions.

In [318]:
# Create dummy variables easily in pandas!
x = pd.core.reshape.get_dummies(df['Ethnicity'])
# Remove the African American variable
x = x[['Asian', 'Caucasian']]
x.head()
Out[318]:
Asian Caucasian
0 0 1
1 1 0
2 1 0
3 1 0
4 0 1

5 rows × 2 columns

In [352]:
X = sm.add_constant(x)
model = sm.OLS(y, X).fit()
model.summary()
Out[352]:
OLS Regression Results
Dep. Variable: Sales R-squared: 0.897
Model: OLS Adj. R-squared: 0.896
Method: Least Squares F-statistic: 859.6
Date: Mon, 26 May 2014 Prob (F-statistic): 4.83e-98
Time: 19:14:51 Log-Likelihood: -386.20
No. Observations: 200 AIC: 778.4
Df Residuals: 197 BIC: 788.3
Df Model: 2
coef std err t P>|t| [95.0% Conf. Int.]
const 2.9211 0.294 9.919 0.000 2.340 3.502
TV 0.0458 0.001 32.909 0.000 0.043 0.048
Radio 0.1880 0.008 23.382 0.000 0.172 0.204
Omnibus: 60.022 Durbin-Watson: 2.081
Prob(Omnibus): 0.000 Jarque-Bera (JB): 148.679
Skew: -1.323 Prob(JB): 5.19e-33
Kurtosis: 6.292 Cond. No. 425.

We can see that the baseline credit balance for African Americans is 531. It is also estimated that Asians have ~\$18 less debt than African Americans and Caucasians will have ~\$12 less debt than African Americans.

However, it turns out that this factor is not significant in our model of credit balance, since the p-values are insignificant and the F statistic is close to 1, which suggests that there is no relationship between the variables (accepts the null hypothesis).

Non-Linear effects:

While a linear regression model often may produce good results, the assumptions that the relationship between the predictors and the response is both additive and linear is often violated. The additive assumption means that the effect of changes in predictor X j on the response Y is independent of the values of the other predictors. While the linear assumption means that change in the response variable is constant per unit of the predictor variable.

There are however, some basic ways to relax some of these assumptions:

Removing the Additive Assumption:

Recall in the Advertising sales data we concluded that TV and Radio have an effect on Sales. Under additive linear assumptions, the effect of TV on sales is independent of the amount spent on Radio advertising.

However, it may be the the case that spending money on Radio advertising increases the effectiveness of TV.

From a marketing perspective it could mean that splitting the advertising funds between Radio and TV could be more effective than spending it all on TV.

In [354]:
# import formula api as alias smf
import statsmodels.formula.api as smf

# formula: response ~ predictor + predictor
model = smf.ols(formula='Sales ~ 1 + TV + Radio + TV * Radio', data=df).fit()
model.summary()
Out[354]:
OLS Regression Results
Dep. Variable: Sales R-squared: 0.968
Model: OLS Adj. R-squared: 0.967
Method: Least Squares F-statistic: 1963.
Date: Mon, 26 May 2014 Prob (F-statistic): 6.68e-146
Time: 19:15:19 Log-Likelihood: -270.14
No. Observations: 200 AIC: 548.3
Df Residuals: 196 BIC: 561.5
Df Model: 3
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 6.7502 0.248 27.233 0.000 6.261 7.239
TV 0.0191 0.002 12.699 0.000 0.016 0.022
Radio 0.0289 0.009 3.241 0.001 0.011 0.046
TV:Radio 0.0011 5.24e-05 20.727 0.000 0.001 0.001
Omnibus: 128.132 Durbin-Watson: 2.224
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1183.719
Skew: -2.323 Prob(JB): 9.09e-258
Kurtosis: 13.975 Cond. No. 1.80e+04

It turns out we were right! The synergistic effect increased our R^2 by about 6%. There may be examples where the individual effects may be insignificant after adding an interaction term, however it is still good practice to keep in these individual terms. This is because leaving them out tends to alter the meaning of the interaction.

Interaction tends to be an important consideration for qualitative variables as well. Recall the Credit Card dataset. It is likely that changes in income would have different effects on students vs. non-students. Thus we can multiply income with the student dummy variable. In terms of the model fit, this means that the slopes of the lines between students and non-students will be different when estimating credit balance using income. We could eventually conclude that non-students may have larger credit balances as income increases, compared to individuals that are students.

Non-Linear Relationships:

In some cases, the true relatioship between the predictors and the response variable could be non-linear. We can extend the linear model to accomodate non-linear effects using Polynomial Regression. There are however, other more advanced techniques that will be discussed later that may be better suited to this type of application.

More on polynomial fitting can be found in my other notebook <a target = '_self' href='http://rmdk.ca/machine-learning-curve-fitting/'>here</a>.

In [27]:
url = 'http://www-bcf.usc.edu/~gareth/ISL/Auto.csv'
df = pd.read_csv(url)

df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 397 entries, 0 to 396
Data columns (total 9 columns):
mpg             397 non-null float64
cylinders       397 non-null int64
displacement    397 non-null float64
horsepower      397 non-null object
weight          397 non-null int64
acceleration    397 non-null float64
year            397 non-null int64
origin          397 non-null int64
name            397 non-null object
dtypes: float64(3), int64(4), object(2)
In [30]:
# Make horsepower into integer
df2 = df[df['horsepower']!='?']
df2['horsepower'] = df2['horsepower'].astype(int)
# formula: response ~ predictor + predictor
# I makes sure operation is done before fitting the model

model1 = smf.ols(formula='mpg ~ 1 + horsepower', data=df2).fit()
print model1.rsquared_adj
model2 = smf.ols(formula='mpg ~ 1 + horsepower + I(horsepower**2)', data=df2).fit()
print model2.rsquared_adj
0.604937868807
0.685952650207

Potential Problems:

These are some things to keep out for while performing linear regression. Some of this we have already covered.

  1. Non-Linearity of the response-predictor relationships.
  2. Correlation of the error terms.
  3. Non-constant variance of error terms.
  4. Outliers.
  5. High -leverage points.
  6. Multicollinearity.

Non-Linear Relationships:

Linear regression assumes that there is a linear relationship between the predictors and response. If the actual relationship is very non-linear, then most of the conclusions that we draw from the model are less robust, and model prediction accuracy would be diminished.

Residual plots are the first step for testing this assumption. Here we plot residuals vs. predicted values for mpg ~ horsepower and mpg ~ horsepower + horsepower^2. We can see that the quadratic version has much less pattern in the residuals, which suggests that the quadratic fit improves the model. If it looks like the predictors are non-linear, a simple approach is to use non-linear transformations. We will discuss more advanced techniques.

Non-constant variance of error terms:

Non-constant variance is when there is a distinct trend in the variance, whereby the variance may increase with the value of the response. This can also be viewed with a residual vs fitted plot. The resulting plot will take on a funnel shape under this condition, called heteroscedasticity. In this case, a simple fix is often transforming the response variable to assume a more normal distribution. If this still doesn't work, you may have to explore weighted least squares regression, which proportionally weights the model by the inverse of the variances.

In [337]:
qplot(y=model1.resid, x=model1.fittedvalues)+geom_point()+geom_smooth()
Out[337]:
<ggplot: (299288717)>
In [336]:
qplot(y=model2.resid, x=model2.fittedvalues)+geom_point()+geom_smooth()
Out[336]:
<ggplot: (299286513)>


Correlation of error terms:

One of the assumptions of linear regression is that the error terms (residuals) are uncorrelated. Important indicators:p-values standard errors and R^2 all depend on this assumption. Time series data often has correlated error terms, and there are many ways to deal with it.

For us, a simple plot of the residuals can reveal if there is any noticeable trend. These errors are fairly evenly distributed around 0, thats good. If they are not, this can often be remedied by transforming our dependent variable to a more normal distribution. However, sources of dependent error terms are often due to experimental design and data sources/collection. For example: a study of predicting individuals heights are predicted from weight could be skewed if many of the study members are of the same family, or are on the same diet.

In [340]:
qplot(y=model2.resid, x=xrange(len(y)))+geom_point()+geom_smooth()
Out[340]:
<ggplot: (300335109)>


Outliers:

An outlier is a any record whose predicted value is far-off from the observed value. Outliers often arise from incorrect data recording. There are many ways to identify outliers, and removing them from the model will often result in an improvement in predictive power and reduced error. However, we have to be careful to understand if a record is actually an outlier (error) or just a rare occurrence in the data, which we should not remove. Outliers can also indicate model deficiency; perhaps a missing predictor could account for the missing response.

In [344]:
# Using stats models
model.outlier_test()

High Leverage Points:

Leverage points are unusual values in predictor variables, rather than response variables. Removing high leverage points often has a great effect on the model fit. Here is an influence plot of the advertising data model. This plot takes into account outliers and high leverage. We can see several points here that seem to deviate from the average influence, yet again it is up to our discretion if we choose to remove these points or not based on our knowledge of the data, and it's source.

In [365]:
fig = sm.graphics.influence_plot(model2, criterion="cooks")
print fig
Figure(480x320)

Collinearity:

This occurs when two or more predictors are closely related to each other. Collinearity will cause the standard error to grow, since it reduces the accuracy of the regression coefficients, which may cause us to improperly reject the null hypothesis (since the standard error effects the t-value and p-value). The probability of detecting a non-zero coefficient ( a relationship) is reduced by collinearity, thus we may not find a relationship that exists. Simply speaking, the importance of a variable may be missed due to collinearity.

We can test collinearity using a correlation matrix between the predictors, which we can use to eliminate variables within the model fitting process. If two variables are highly correlated, it is usual practise to remove the predictor with the lower R^2. However, multicollinearity exists when here is correlation between three or more variables. We use the variance inflation factor (VIF) to address this issue. This statistic shows us the ratio of the variation to the coefficients. The smallest value for VIF is 1, and as a rule of thumb, we should remove variables that have a VIF value that exceeds 5.

An alternative solution is to combine the correlated variables into a single feature. For example, one could take the average standardized versions of two predictors in order to make a new index all together. The construction of this new variable is more of an art than a science, and will be explored in a future lesson.

In [82]:
# Grab a few columns
data_to_check = df2.ix[:, 0:-4]
# Check correlation and VIF
print data_to_check.corr()
                   mpg  cylinders  displacement  horsepower    weight
mpg           1.000000  -0.777618     -0.805127   -0.778427 -0.832244
cylinders    -0.777618   1.000000      0.950823    0.842983  0.897527
displacement -0.805127   0.950823      1.000000    0.897257  0.932994
horsepower   -0.778427   0.842983      0.897257    1.000000  0.864538
weight       -0.832244   0.897527      0.932994    0.864538  1.000000

[5 rows x 5 columns]

It turns out that VIF is not well implimented in python for basic OLS regression. I will update when it is properly supported in statsmodels or scikit learn.

In [5]:
from IPython.core.display import HTML


def css_styling():
    styles = open("/users/ryankelly/desktop/custom_notebook.css", "r").read()
    return HTML(styles)
css_styling()
Out[5]:
In [3]:
def social():
    code = """
    <a style='float:left; margin-right:5px;' href="https://twitter.com/share" class="twitter-share-button" data-text="Check this out" data-via="Ryanmdk">Tweet</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
    <a style='float:left; margin-right:5px;' href="https://twitter.com/Ryanmdk" class="twitter-follow-button" data-show-count="false">Follow @Ryanmdk</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
    <a style='float:left; margin-right:5px;'target='_parent' href="http://www.reddit.com/submit" onclick="window.location = 'http://www.reddit.com/submit?url=' + encodeURIComponent(window.location); return false"> <img src="http://www.reddit.com/static/spreddit7.gif" alt="submit to reddit" border="0" /> </a>
<script src="//platform.linkedin.com/in.js" type="text/javascript">
  lang: en_US
</script>
<script type="IN/Share"></script>
"""
    return HTML(code)
In [ ]: