Pulled directly from wikipedia:
In statistics, regression analysis is a supervised learning statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed.
Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable; for example, correlation does not imply causation.
Let's grab some data and get going.
import pandas as pd import numpy as np # Advertising Sales data df = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0) print df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 200 entries, 1 to 200 Data columns (total 4 columns): TV 200 non-null float64 Radio 200 non-null float64 Newspaper 200 non-null float64 Sales 200 non-null float64 dtypes: float64(4)None
8 rows × 4 columns
So, what types of questions could we answer with regards to this dataset?
I will not be going over all the theory of regression, but I will demonstrate a workflow which checks assumptions and describes the required steps and considerations.
The first step is to split our data into training and testing data.
from sklearn.cross_validation import train_test_split #Help for train test spit #train_test_split? x = df[['TV', 'Radio', 'Newspaper']] y = df['Sales'] x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.33, random_state=123)
Now lets get a sense about how our independent variables are related to the dependent variable (
Sales). We do this by examining the p values and coefficients of univariate regressions between each predictor and dependent variable (DV).
The questions we want to ask are:
The last point is particularly important. As is the case with any modeling, accurate intuition about your data takes paramount importance over the dataset on hand. For example, if the data revealed that a increases in a predictor variable,
smoking, was found to be negatively related to lung cancer, it would likely not be satisfactory to include
smoking in the model.
from sklearn.feature_selection import f_regression #f_regression? # Perform univariate regression and report p values p_values = f_regression(x_train, y_train) p_values
(array([ 201.74678914, 50.60657905, 7.68597105]), array([ 2.29293230e-28, 6.44463252e-11, 6.37042164e-03]))
Great! All of our predictors are significant (
p < 0.05 ) in a univariate regression. The
F values are also > 1 which suggests that all the predictors are related to sales. This is not often the case and has only happened because this dataset is tailored for machine learning analysis.
Now we can take a look at the coefficients. We will build a customized function that runs several linear models to do this.
def coefs(x_train, y_train, x_test, y_test): lr = LinearRegression(fit_intercept=True) feature_names = df.columns coef_list =  for col in xrange(len(x_train)): # Iterate through columns, # LinearRegression expects 2D inputs for x # So we add a new blank axis x = x_train[:, np.newaxis] x = x[:, :, col] lr.fit(x, y_train) coef_list.append([feature_names[col], lr.coef_]) return coef_list coefs(x_train, y_train, x_test, y_test)
[['TV', 0.046984295187230886], ['Radio', 0.18498775661674768], ['Newspaper', 0.05350821976042866]]
While we can begin to interpret the magnitude of these coefficients, the important thing at this stage is directionality. If there were any variables here which intuitively disagreed with a known reality in terms of direction, we would want to exclude them from the model. However, keep in mind that many times we do not know the expected behaviour, so we should not remove any variables in that case.
In this notebook, we will not get into interpreting magnitudes of the coefficients beyond this. The problem is similar to one we have experienced before in classification whereby we are currently mixing our units and it would not make sense (in most cases) to interpret the coefficients this way. If the predictors shared the same unit, say in dollars of advertising spent, you could begin exploring the magnitude of the relationships with some confidence.
There are two metrics typically used to asses model fit for linear regression:
The RSE is a measure of the average difference between the predicted y and our observed y, while R^2 is a measure of predictive power by measuring the proportion of explained variance.
Let's modify our function from above to include these new metrics.
def models(x_train, y_train, x_test, y_test): lr = LinearRegression(fit_intercept=True) feature_names = df.columns coef_list =  error_list =  r2_list =  feature_list =  for col in xrange(len(feature_names)-1): # Iterate through columns, # LinearRegression expects 2D inputs for x # So we add a new blank axis xtrain = x_train[:, np.newaxis] xtrain = xtrain[:, :, col] xtest = x_test[:, np.newaxis] xtest = xtest[:, :, col] lr.fit(xtrain, y_train) # Compute RSE error_list.append(np.sum(lr.predict(xtest) - y_test)**2) r2_list.append(lr.score(xtest, y_test)) coef_list.append(lr.coef_) feature_list.append(feature_names[col]) out = zip(feature_list, r2_list, error_list, coef_list, ) return out model_outs = models(x_train, y_train, x_test, y_test) model_outs
[('TV', 0.62450810246495947, 326.33971180270174, 0.046984295187230886), ('Radio', 0.43882257063230956, 79.574614380832173, 0.18498775661674768), ('Newspaper', 0.038253869337874224, 830.59166804447102, 0.05350821976042866)]
Now we have a pretty good sense of how our predictors interact with
sales. For example,
TV advertising can explain ~62% of the variation in the data, yet the
RSE error is rather high, which means that the deviation of this parameter can be quite variable (possibly due to some outliers).
Let's use this information to build a multiple regression model that improves our R^2 value, thus our future predictive power.
scikit-learn is best for doing machine learning with emphasis on predictive modeling, often with large and sparse data.
statsmodels on the other hand is better at "traditional" statistics and econometrics, with much stronger emphasis on parameter estimation and (statistical) testing.
statsmodels will help us understand the basics more clearly before we move onto more complex regression techniques.
Before we get into the modeling, we need to check if our dependent variable
sales is normally distributed, and thus satisfies the assumptions of our regression.
# Plot the results thus far from ggplot import * # Enable inline plotting %matplotlib inline qplot(y)+geom_histogram()
This is probably as close to normal as we can get (despite the right skew), however, under circumstances with very non-normal data, you can transform the data to coerce it to be normally distributed. We will do that another day apparently.
There is generally a misconception surrounding statistical analysis that requires a normality assumption. It is not very useful to test for normality using a statistical test like
Anderson Darling. This is because for small datasets, even large deviances from normality are not detected, and for large datasets, even the smallest departure from normality will lead to a rejected null. <a href = "http://www.r-bloggers.com/normality-tests-don’t-do-what-you-think-they-do/"> See more here</a>.
import statsmodels.api as sm # Recall our data x = df[['TV', 'Radio', 'Newspaper']] y = df['Sales']
# Add intercept X = sm.add_constant(x) model = sm.OLS(y, X).fit() print model.summary2()
Results: Ordinary least squares ====================================================== Model: OLS AIC: 778.3941 Dependent Variable: Sales BIC: 788.2891 No. Observations: 200 Log-Likelihood: -386.20 Df Model: 2 F-statistic: 859.6 Df Residuals: 197 Prob (F-statistic): 4.83e-98 R-squared: 0.897 Scale: 2.8270 Adj. R-squared: 0.896 ------------------------------------------------------ Coef. Std.Err. t P>|t| [0.025 0.975] ------------------------------------------------------ const 2.9211 0.2945 9.9192 0.0000 2.3403 3.5019 TV 0.0458 0.0014 32.9087 0.0000 0.0430 0.0485 Radio 0.1880 0.0080 23.3824 0.0000 0.1721 0.2038 ------------------------------------------------------ Omnibus: 60.022 Durbin-Watson: 2.081 Prob(Omnibus): 0.000 Jarque-Bera (JB): 148.679 Skew: -1.323 Prob(JB): 0.000 Kurtosis: 6.292 Condition No.: 425 ======================================================
As you can see,
statsmodels provides a much more descriptive output, and it well suited for these basic statistic stats.
This is a very important part! Here is what you need to know.
First, we can examine the model's F statistic and p-value. If our F statistic is greater than 1 we can conclude that at least one of our variables is related to housing price and we can continue with our analysis. There are special cases when
n is small, or when we have more predictors than
n , but we will cover this later.
Now, we are more concerned about which variables are related, and how they are related.
Below, we can see the
p-values of the individual independent variables. These
p-values judge whether each individual predictor is related to the dependent variable, after adjusting for the other predictors. Simply put, it measures the partial effect of adding that particular variable to the model, in the presence of other variables.
What we are talking about here are the considerations for variable selection. Since we want only the predictors that actually influence the prediction of our dependent variable. While there are very complex methods of variable selection (
AIC BIC), we will explore the basics here.
The two most basic methods are:
RSSto the model. The next variables added will be the one that results in the lowest RSS for the new 2-variable model. This is continued until some stopping rule.
p-value(the least significant). Refit the model and repeat the processes until all remaining variables have
p< some defined threshold (often < 0.05 or 0.1).
There are also ways to combine these two techniques, which is how I prefer to do it. This is called mixed selection. Starting with no variables in the model, add the variable that provides the best fit (highest
R^2 from the univariate regression). When then continue adding variables by next highest
R^2 as long as their
p-value in the multiple regression model are significant.
Lastly, we need some way to judge overall model performance. Recall that
R^2 are the most common measures of model fit. An
1 would indicate that the predictor variables explain all the variation of the dependent variable (not realistic to find this in nature). We can add this idea to our variable selection technique. We will only include variables which also increase
R^2 by at least some threshold (perhaps 0.5% or 1%).
Notice that when we include all three advertising types into the model that
newspaper changes from significant in our univariate regression to insignificant in multiple regression. We can therefor conclude that newspaper has no influence (positive or negative) on sales. This is a common situation when using real-life data.
Consider the correlation between
from scipy.stats.stats import pearsonr pearsonr(X['Radio'], X['Newspaper'])
As we can see, these two variables correlated. This means that there is a tendency to spend more on
newspaper advertising in the same markets where more is spent on
radio advertising. What this means is that
newspaper appears to be related to
sales, but it is only piggy-backing on the affect of
The classic example of this effect is the relationship between ice cream sales and shark attacks, which suggests that closing down ice cream parlours would reduce incidents. However, the increased shark attacks is facilitated by increasing temperatures, which drives more people to the beach, who also purchase ice cream.
Let's exclude newspaper from our model.
df = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0) x = df[['TV', 'Radio']] y = df['Sales'] X = sm.add_constant(x) model = sm.OLS(y, X).fit() # print model error print model.mse_resid print model.summary2()
2.82697451811 Results: Ordinary least squares ====================================================== Model: OLS AIC: 778.3941 Dependent Variable: Sales BIC: 788.2891 No. Observations: 200 Log-Likelihood: -386.20 Df Model: 2 F-statistic: 859.6 Df Residuals: 197 Prob (F-statistic): 4.83e-98 R-squared: 0.897 Scale: 2.8270 Adj. R-squared: 0.896 ------------------------------------------------------ Coef. Std.Err. t P>|t| [0.025 0.975] ------------------------------------------------------ const 2.9211 0.2945 9.9192 0.0000 2.3403 3.5019 TV 0.0458 0.0014 32.9087 0.0000 0.0430 0.0485 Radio 0.1880 0.0080 23.3824 0.0000 0.1721 0.2038 ------------------------------------------------------ Omnibus: 60.022 Durbin-Watson: 2.081 Prob(Omnibus): 0.000 Jarque-Bera (JB): 148.679 Skew: -1.323 Prob(JB): 0.000 Kurtosis: 6.292 Condition No.: 425 ======================================================
newspaper. However, there could also be some problems that cannot be diagnosed using these numbers or tests. One of the more robust and simple ways to assess the quality of our model is to plot the residuals, which is the error term, computed from the difference of the predicted and observed data.
For linear regression, we expect that the residuals (predicted values) are evenly distributed around zero.
qplot(y=model.resid, x = xrange(len(y)))+geom_point() \ + geom_smooth(size=2,color='blue', se=False)
We can see that there is a negative skew on these residuals, which suggests that there is some bias towards under prediction in the model. Further we can see a very subtle negative-to-positive-to-negative trend. This suggests a non-linear pattern whereby two or more variables are interacting with some synergistic effects. In other words, sometimes combining
Radio advertising can work together to boost sales more than using any single medium.
We will discuss this in more later.
Once we have fit a satisfactory multiple regression model, we can quickly begin to predict the response of
y using new values for the predictors. Yet, we should utilize a confidence interval to quantify the uncertainty of out prediction of sales. Using this method we can predict that the true
sales will be within some 95% confidence interval given any
What we have discussed thus far is very basic regression techniques. In the real world we have many different types of data sources, and non-linear relationships.
url = 'http://www-bcf.usc.edu/~gareth/ISL/Credit.csv' df = pd.read_csv(url) df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 400 entries, 0 to 399 Data columns (total 12 columns): Unnamed: 0 400 non-null int64 Income 400 non-null float64 Limit 400 non-null int64 Rating 400 non-null int64 Cards 400 non-null int64 Age 400 non-null int64 Education 400 non-null int64 Gender 400 non-null object Student 400 non-null object Married 400 non-null object Ethnicity 400 non-null object Balance 400 non-null int64 dtypes: float64(1), int64(7), object(4)
In this dataset we have several qualitative variables, along with some quantitative. Let's see how this data is coded.
5 rows × 12 columns
We can see that the data for the qualitative variables (also known as factors) are coded as text strings. Factors such as
Married only have two
levels, for example:
female. To incorporate these into a regression model, we can create a
dummy variable that takes on two values based on the two levels of the factor. To convert
Gender to a dummy variable we simple create a new variable:
Gender= 1 if person is male, 0 if person is female.
In the regression model, if a person is female we simply do not compute a coefficient, since in our model, being female is represented as an absence of being male, thus the
Gender variable can be modeled by only investigating the properties of either being male, or not being male. In this example, being female is considered the
baseline condition, which is associated with B0, or the regression intercept.
β0+β1+εi if the person is Male
β0+εi if the person is Female
Now β0 can be interpreted as average credit score among females, β0 + β1 as the average credit score among males, and β1 as the average difference in credit balance between males and females.
Coding males at 1 is arbitrary, you could have chosen females to be equal to 1; the only difference is in how you interpret the coefficients.
# Create new dummy variable df['Males'] = 1 df['Males'][df['Gender']=='Female'] = 0 x = df[['Males']] y = df['Balance'] X = sm.add_constant(x) model = sm.OLS(y, X).fit() # print model error print 'p-values:' print model.pvalues print 'R^2:' print model.rsquared_adj
p-values: const 3.312981e-47 Males 6.685161e-01 dtype: float64 R^2: -0.00205027122403
As we can see, thus far there is no significant relationship between being male or female.
Considered the variable
Ethnicity, which has three levels. A single dummy variable cannot represent all the possible values. Instead we create two dummy variables.
Notice we do not need a third dummy variable, because African American is accounted for by the baseline condition (intercept) similar to how females was in the previous example.
β0+β1+εi if the person is Asian
β0+β2+εi if the person is Caucasian
β0+εi if the person is African American
Now β0 is the average credit balance for African Americans, β1 is the difference in the average balance between Asian and African Americans, and β2 is the difference in the average balance between Caucasian and African American's.
Thus we now can think about this as comparing the baseline condition β0 to the other conditions.
# Create dummy variables easily in pandas! x = pd.core.reshape.get_dummies(df['Ethnicity']) # Remove the African American variable x = x[['Asian', 'Caucasian']] x.head()
5 rows × 2 columns
X = sm.add_constant(x) model = sm.OLS(y, X).fit() model.summary()
|Date:||Mon, 26 May 2014||Prob (F-statistic):||4.83e-98|
|coef||std err||t||P>|t|||[95.0% Conf. Int.]|
We can see that the baseline credit balance for
African Americans is 531. It is also estimated that
Asians have ~\$18 less debt than
African Americans and
Caucasians will have ~\$12 less debt than
However, it turns out that this factor is not significant in our model of credit balance, since the
p-values are insignificant and the
F statistic is close to
1, which suggests that there is no relationship between the variables (accepts the null hypothesis).
While a linear regression model often may produce good results, the assumptions that the relationship between the predictors and the response is both additive and linear is often violated. The additive assumption means that the effect of changes in predictor X j on the response Y is independent of the values of the other predictors. While the linear assumption means that change in the response variable is constant per unit of the predictor variable.
There are however, some basic ways to relax some of these assumptions:
Recall in the Advertising sales data we concluded that
Radio have an effect on
Sales. Under additive linear assumptions, the effect of
TV on sales is independent of the amount spent on
However, it may be the the case that spending money on
Radio advertising increases the effectiveness of
From a marketing perspective it could mean that splitting the advertising funds between
TV could be more effective than spending it all on
# import formula api as alias smf import statsmodels.formula.api as smf # formula: response ~ predictor + predictor model = smf.ols(formula='Sales ~ 1 + TV + Radio + TV * Radio', data=df).fit() model.summary()
|Date:||Mon, 26 May 2014||Prob (F-statistic):||6.68e-146|
|coef||std err||t||P>|t|||[95.0% Conf. Int.]|
It turns out we were right! The synergistic effect increased our
R^2 by about 6%. There may be examples where the individual effects may be insignificant after adding an interaction term, however it is still good practice to keep in these individual terms. This is because leaving them out tends to alter the meaning of the interaction.
Interaction tends to be an important consideration for qualitative variables as well. Recall the Credit Card dataset. It is likely that changes in
income would have different effects on
non-students. Thus we can multiply
income with the student dummy variable. In terms of the model fit, this means that the slopes of the lines between
non-students will be different when estimating
credit balance using
income. We could eventually conclude that
non-students may have larger
credit balances as
income increases, compared to individuals that are
In some cases, the true relatioship between the predictors and the response variable could be non-linear. We can extend the linear model to accomodate non-linear effects using Polynomial Regression. There are however, other more advanced techniques that will be discussed later that may be better suited to this type of application.
More on polynomial fitting can be found in my other notebook <a target = '_self' href='http://rmdk.ca/machine-learning-curve-fitting/'>here</a>.
url = 'http://www-bcf.usc.edu/~gareth/ISL/Auto.csv' df = pd.read_csv(url) df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 397 entries, 0 to 396 Data columns (total 9 columns): mpg 397 non-null float64 cylinders 397 non-null int64 displacement 397 non-null float64 horsepower 397 non-null object weight 397 non-null int64 acceleration 397 non-null float64 year 397 non-null int64 origin 397 non-null int64 name 397 non-null object dtypes: float64(3), int64(4), object(2)
# Make horsepower into integer df2 = df[df['horsepower']!='?'] df2['horsepower'] = df2['horsepower'].astype(int) # formula: response ~ predictor + predictor # I makes sure operation is done before fitting the model model1 = smf.ols(formula='mpg ~ 1 + horsepower', data=df2).fit() print model1.rsquared_adj model2 = smf.ols(formula='mpg ~ 1 + horsepower + I(horsepower**2)', data=df2).fit() print model2.rsquared_adj
These are some things to keep out for while performing linear regression. Some of this we have already covered.
Linear regression assumes that there is a linear relationship between the predictors and response. If the actual relationship is very non-linear, then most of the conclusions that we draw from the model are less robust, and model prediction accuracy would be diminished.
Residual plots are the first step for testing this assumption. Here we plot residuals vs. predicted values for
mpg ~ horsepower and
mpg ~ horsepower + horsepower^2. We can see that the quadratic version has much less pattern in the residuals, which suggests that the quadratic fit improves the model. If it looks like the predictors are non-linear, a simple approach is to use non-linear transformations. We will discuss more advanced techniques.
Non-constant variance of error terms:
Non-constant variance is when there is a distinct trend in the variance, whereby the variance may increase with the value of the response. This can also be viewed with a
residual vs fitted plot. The resulting plot will take on a funnel shape under this condition, called
heteroscedasticity. In this case, a simple fix is often transforming the response variable to assume a more normal distribution. If this still doesn't work, you may have to explore weighted least squares regression, which proportionally weights the model by the inverse of the variances.
One of the assumptions of linear regression is that the error terms (residuals) are uncorrelated. Important indicators:
standard errors and
R^2 all depend on this assumption. Time series data often has correlated error terms, and there are many ways to deal with it.
For us, a simple plot of the residuals can reveal if there is any noticeable trend. These errors are fairly evenly distributed around 0, thats good. If they are not, this can often be remedied by transforming our dependent variable to a more normal distribution. However, sources of dependent error terms are often due to experimental design and data sources/collection. For example: a study of predicting individuals heights are predicted from weight could be skewed if many of the study members are of the same family, or are on the same diet.
An outlier is a any record whose predicted value is far-off from the observed value. Outliers often arise from incorrect data recording. There are many ways to identify outliers, and removing them from the model will often result in an improvement in predictive power and reduced error. However, we have to be careful to understand if a record is actually an outlier (error) or just a rare occurrence in the data, which we should not remove. Outliers can also indicate model deficiency; perhaps a missing predictor could account for the missing response.
# Using stats models model.outlier_test()
High Leverage Points:
Leverage points are unusual values in predictor variables, rather than response variables. Removing high leverage points often has a great effect on the model fit. Here is an influence plot of the advertising data model. This plot takes into account outliers and high leverage. We can see several points here that seem to deviate from the average influence, yet again it is up to our discretion if we choose to remove these points or not based on our knowledge of the data, and it's source.
fig = sm.graphics.influence_plot(model2, criterion="cooks") print fig
This occurs when two or more predictors are closely related to each other. Collinearity will cause the standard error to grow, since it reduces the accuracy of the regression coefficients, which may cause us to improperly reject the null hypothesis (since the standard error effects the
p-value). The probability of detecting a non-zero coefficient ( a relationship) is reduced by collinearity, thus we may not find a relationship that exists. Simply speaking, the importance of a variable may be missed due to collinearity.
We can test collinearity using a correlation matrix between the predictors, which we can use to eliminate variables within the model fitting process. If two variables are highly correlated, it is usual practise to remove the predictor with the lower R^2. However, multicollinearity exists when here is correlation between three or more variables. We use the
variance inflation factor (VIF) to address this issue. This statistic shows us the ratio of the variation to the coefficients. The smallest value for VIF is 1, and as a rule of thumb, we should remove variables that have a VIF value that exceeds 5.
An alternative solution is to combine the correlated variables into a single feature. For example, one could take the average standardized versions of two predictors in order to make a new index all together. The construction of this new variable is more of an art than a science, and will be explored in a future lesson.
# Grab a few columns data_to_check = df2.ix[:, 0:-4] # Check correlation and VIF print data_to_check.corr()
mpg cylinders displacement horsepower weight mpg 1.000000 -0.777618 -0.805127 -0.778427 -0.832244 cylinders -0.777618 1.000000 0.950823 0.842983 0.897527 displacement -0.805127 0.950823 1.000000 0.897257 0.932994 horsepower -0.778427 0.842983 0.897257 1.000000 0.864538 weight -0.832244 0.897527 0.932994 0.864538 1.000000 [5 rows x 5 columns]
It turns out that VIF is not well implimented in python for basic OLS regression. I will update when it is properly supported in statsmodels or scikit learn.
from IPython.core.display import HTML def css_styling(): styles = open("/users/ryankelly/desktop/custom_notebook.css", "r").read() return HTML(styles) css_styling()