As demonstrated in the Data Analyst Nanodegree Webcast on Multicollinearity in Linear Regression by Charlie and Stephen on Tuesday 16th June 2015.

Run the code locally (or on your own dataset) to investigate multicollinearity yourself. It's a good idea to keep track of the features you include, R^2 value, and the multicollinearity issues that you observe.

In [1]:
#Import the useful data science packages!
import numpy as np
import pandas as pd
import statsmodels.api as sm
In [2]:
#Load the data
prosper = pd.read_csv('/Users/charlie/Downloads/prosperLoanData.csv')
# you can download the dataset from https://docs.google.com/document/d/1w7KhqotVi5eoKE3I_AZHbsxdr-NmcWsLTIiZrpxWx4w/pub and then
# run all this locally.
In [ ]:
prosper.columns
In [53]:
# Normalisation function used to ensure that each numerical variable has mean = 0 
# and standard deviation = 1. Does the same as the function in Lesson 3 of Intro to DS.
def normalise(data):
    mean = data.mean()
    stdev = data.std()
    return (data - mean)/stdev
In [54]:
# Choose some of the many columns from the dataset. We're going to attempt to predict the 
# 'LoanOriginalAmount' from some of the other data.
prosper = prosper[['CreditScoreRangeLower','StatedMonthlyIncome', \
                   'IsBorrowerHomeowner', 'CreditScoreRangeUpper',\
                   'EmploymentStatus','Term','BorrowerRate','LenderYield',\
                   'LoanOriginalAmount']]
In [65]:
# Select just the numerical variables, we'll normalise these and we'll be creating dummy variables
# from the categorical variables.
numerical_variables = ['CreditScoreRangeLower','StatedMonthlyIncome',\
                       'Term','CreditScoreRangeUpper','BorrowerRate',\
                       'LenderYield','LoanOriginalAmount']
In [77]:
#just remove the missing data and any duplication for simplicity!
prosper.dropna(inplace = True)
prosper.drop_duplicates(inplace = True)

#choose the numerical variables from prosper, remove the target to create features
features = prosper[numerical_variables].drop(['LoanOriginalAmount'],axis = 1)
#normalising numerical features improves the performance of fitting algorithms 
# (don't normalise the dummy variables though, that's generally a bad idea!)
features = normalise(features)

#create dataframes of homeowner and employment dummies 
home_dum = pd.get_dummies(prosper.IsBorrowerHomeowner,prefix="homeowner")
job_dum = pd.get_dummies(prosper.EmploymentStatus,prefix = "job")

Interact with the following cell to adjust your model. Uncomment/comment rows to use them or use

features.drop([ColumnName],axis=1,inplace=True)

to drop a column.

In [ ]:
# uncomment to add a constant column
#features = sm.add_constant(features)

# uncomment these to add the dummy variables to the features
#features = features.join(job_dum)
#features = features.join(home_dum)

# uncomment these to drop a single dummy variable from each full set 
# (but only if you've previously added them!)
#features.drop(['job_Employed'],axis=1,inplace=True)
#features.drop(['homeowner_True'],axis = 1,inplace=True)

# set the target values to fit the linear regression model
values = prosper.LoanOriginalAmount
In [ ]:
# Watch out for strongly correlated features!
features.corr()
In [78]:
# create, fit and summarise the model
# check out the coefficients and the condition number to look for multicollinearity

# A good resource for understanding all of this summary output can be found in the excellent
# online statistics textbook here: http://work.thaslwanter.at/Stats/html/statsModels.html#linear-regression-analysis-with-python
sm.OLS(values,features).fit().summary()
Out[78]:
OLS Regression Results
Dep. Variable: LoanOriginalAmount R-squared: 0.298
Model: OLS Adj. R-squared: 0.298
Method: Least Squares F-statistic: 3824.
Date: Tue, 16 Jun 2015 Prob (F-statistic): 0.00
Time: 18:03:43 Log-Likelihood: -1.0772e+06
No. Observations: 107893 AIC: 2.154e+06
Df Residuals: 107880 BIC: 2.155e+06
Df Model: 12
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
const 9746.5002 27.336 356.541 0.000 9692.922 9800.079
CreditScoreRangeLower 644.4733 20.641 31.224 0.000 604.018 684.928
StatedMonthlyIncome 813.8288 16.268 50.025 0.000 781.943 845.715
Term 1690.2776 16.863 100.236 0.000 1657.227 1723.329
LenderYield -1630.9804 18.663 -87.392 0.000 -1667.559 -1594.401
job_Full-time -2155.2679 41.243 -52.258 0.000 -2236.103 -2074.433
job_Not available -1902.1674 81.420 -23.362 0.000 -2061.750 -1742.585
job_Not employed -2156.2083 198.716 -10.851 0.000 -2545.689 -1766.727
job_Other -1702.3283 88.973 -19.133 0.000 -1876.714 -1527.943
job_Part-time -3515.9629 161.795 -21.731 0.000 -3833.079 -3198.847
job_Retired -3358.6644 187.640 -17.899 0.000 -3726.436 -2990.892
job_Self-employed -720.5137 71.666 -10.054 0.000 -860.977 -580.050
homeowner_False -1080.1679 33.811 -31.947 0.000 -1146.437 -1013.898
Omnibus: 29859.330 Durbin-Watson: 2.007
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1384223.160
Skew: 0.572 Prob(JB): 0.00
Kurtosis: 20.510 Cond. No. 15.7
In [ ]: