This is a series of notebooks to document my learning, and hopefully to help others learn machine learning. I would love suggestions / corrections / feedback for these notebooks.

Visit my webpage for more notebooks like this.

Email me: [email protected]

I'd love if you shared this

In [6]:

```
social()
```

Out[6]:

- Linear Regression OLS
- Multiple Regression
- Best Practices

- Wikipedia (introduction content)

- Improve flow

Pulled directly from wikipedia:

In statistics, regression analysis is a **supervised learning** statistical process for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. More specifically, regression analysis helps one understand how the typical value of the dependent variable (or 'criterion variable') changes when any one of the independent variables is varied, while the other independent variables are held fixed.

Regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Regression analysis is also used to understand which among the independent variables are related to the dependent variable, and to explore the forms of these relationships. In restricted circumstances, regression analysis can be used to infer causal relationships between the independent and dependent variables. However this can lead to illusions or false relationships, so caution is advisable; for example, correlation does not imply causation.

Let's grab some data and get going.

In [5]:

```
import pandas as pd
import numpy as np
# Advertising Sales data
df = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
print df.info()
```

In [166]:

```
df.describe()
```

Out[166]:

So, what types of questions could we answer with regards to this dataset?

*Is there a relationship between the types of advertising, and sales?**How strong is this relationship?**What variables contribute to more sales?**How well can we predict sales?**Is there synergy among variables that contributes to the amount of sales?*

**I will not be going over all the theory of regression, but I will demonstrate a workflow which checks assumptions and describes the required steps and considerations.**

The first step is to split our data into training and testing data.

In [171]:

```
from sklearn.cross_validation import train_test_split
#Help for train test spit
#train_test_split?
x = df[['TV', 'Radio', 'Newspaper']]
y = df['Sales']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.33, random_state=123)
```

Now lets get a sense about how our independent variables are related to the dependent variable (`Sales`

). We do this by examining the p values and coefficients of univariate regressions between each predictor and dependent variable (DV).

The questions we want to ask are:

- "Which variables are significant?"
- "What type of relationship do the predictors have with the DV? (positive / negative)"
- "What is the strength of this relationship"
- "Does this make intuitive sense?"

The last point is particularly important. As is the case with any modeling, accurate intuition about your data takes paramount importance over the dataset on hand. For example, if the data revealed that a increases in a predictor variable, `smoking`

, was found to be negatively related to lung cancer, it would likely not be satisfactory to include `smoking`

in the model.

In [172]:

```
from sklearn.feature_selection import f_regression
#f_regression?
# Perform univariate regression and report p values
p_values = f_regression(x_train, y_train)
p_values
```

Out[172]:

Great! All of our predictors are significant ( `p < 0.05`

) in a univariate regression. The `F`

values are also > 1 which suggests that all the predictors are related to sales. This is not often the case and has only happened because this dataset is tailored for machine learning analysis.

Now we can take a look at the coefficients. We will build a customized function that runs several linear models to do this.

In [174]:

```
def coefs(x_train, y_train, x_test, y_test):
lr = LinearRegression(fit_intercept=True)
feature_names = df.columns
coef_list = []
for col in xrange(len(x_train[1])):
# Iterate through columns,
# LinearRegression expects 2D inputs for x
# So we add a new blank axis
x = x_train[:, np.newaxis]
x = x[:, :, col]
lr.fit(x, y_train)
coef_list.append([feature_names[col], lr.coef_[0]])
return coef_list
coefs(x_train, y_train, x_test, y_test)
```

Out[174]:

While we can begin to interpret the magnitude of these coefficients, the important thing at this stage is directionality. If there were any variables here which intuitively disagreed with a *known* reality in terms of direction, we would want to exclude them from the model. However, keep in mind that many times we do not know the expected behaviour, so we should not remove any variables in that case.

- Here, we see that as expected, all three types of advertising have a positive effect on sales.

In this notebook, we will not get into interpreting magnitudes of the coefficients beyond this. The problem is similar to one we have experienced before in classification whereby we are currently mixing our units and it would not make sense (in most cases) to interpret the coefficients this way. If the predictors shared the same unit, say in dollars of advertising spent, you could begin exploring the magnitude of the relationships with some confidence.

There are two metrics typically used to asses model fit for linear regression:

- Residual standard error (RSE)
- R^2

The RSE is a measure of the average difference between the predicted y and our observed y, while R^2 is a measure of predictive power by measuring the proportion of explained variance.

Let's modify our function from above to include these new metrics.

In [176]:

```
def models(x_train, y_train, x_test, y_test):
lr = LinearRegression(fit_intercept=True)
feature_names = df.columns
coef_list = []
error_list = []
r2_list = []
feature_list = []
for col in xrange(len(feature_names)-1):
# Iterate through columns,
# LinearRegression expects 2D inputs for x
# So we add a new blank axis
xtrain = x_train[:, np.newaxis]
xtrain = xtrain[:, :, col]
xtest = x_test[:, np.newaxis]
xtest = xtest[:, :, col]
lr.fit(xtrain, y_train)
# Compute RSE
error_list.append(np.sum(lr.predict(xtest) - y_test)**2)
r2_list.append(lr.score(xtest, y_test))
coef_list.append(lr.coef_[0])
feature_list.append(feature_names[col])
out = zip(feature_list, r2_list, error_list, coef_list, )
return out
model_outs = models(x_train, y_train, x_test, y_test)
model_outs
```

Out[176]:

Now we have a pretty good sense of how our predictors interact with `sales`

. For example, `TV`

advertising can explain ~62% of the variation in the data, yet the `RSE`

error is rather high, which means that the deviation of this parameter can be quite variable (possibly due to some outliers).

Let's use this information to build a multiple regression model that improves our R^2 value, thus our future predictive power.

scikit-learn is best for doing machine learning with emphasis on predictive modeling, often with large and sparse data.

`statsmodels`

on the other hand is better at "traditional" statistics and econometrics, with much stronger emphasis on parameter estimation and (statistical) testing. `statsmodels`

will help us understand the basics more clearly before we move onto more complex regression techniques.

Before we get into the modeling, we need to check if our dependent variable `sales`

is normally distributed, and thus satisfies the assumptions of our regression.

In [177]:

```
# Plot the results thus far
from ggplot import *
# Enable inline plotting
%matplotlib inline
qplot(y)+geom_histogram()
```

Out[177]:

This is probably as close to normal as we can get (despite the right skew), however, under circumstances with very non-normal data, you can transform the data to coerce it to be normally distributed. We will do that another day apparently.

There is generally a misconception surrounding statistical analysis that requires a normality assumption. It is not very useful to test for normality using a statistical test like `Shapiro's`

or `Anderson Darling`

. This is because for small datasets, even large deviances from normality are not detected, and for large datasets, even the smallest departure from normality will lead to a rejected null. <a href = "http://www.r-bloggers.com/normality-tests-don’t-do-what-you-think-they-do/"> See more here</a>.

In [218]:

```
import statsmodels.api as sm
# Recall our data
x = df[['TV', 'Radio', 'Newspaper']]
y = df['Sales']
```

In [226]:

```
# Add intercept
X = sm.add_constant(x)
model = sm.OLS(y, X).fit()
print model.summary2()
```

As you can see, `statsmodels`

provides a much more descriptive output, and it well suited for these basic statistic stats.

This is a very important part! Here is what you need to know.

First, we can examine the model's F statistic and p-value. If our F statistic is greater than 1 we can conclude that *at least* one of our variables is related to housing price and we can continue with our analysis. There are special cases when `n`

is small, or when we have more predictors than `n`

, but we will cover this later.

Now, we are more concerned about **which** variables are related, and how they are related.

Below, we can see the `coefficients`

and `p-values`

of the individual independent variables. These `p-values`

judge whether each individual predictor is related to the dependent variable, **after** adjusting for the other predictors. Simply put, it measures the *partial effect* of adding that particular variable to the model, in the presence of other variables.

What we are talking about here are the considerations for *variable selection*. Since we want only the predictors that actually influence the prediction of our dependent variable. While there are very complex methods of variable selection (`AIC BIC`

), we will explore the basics here.

The two most basic methods are:

**Forward Selection**. Begin with a null model of an intercept and no predictors. Fit univariate regressions for all available predictors. Add the predictor with the lowest`RSS`

to the model. The next variables added will be the one that results in the lowest RSS for the new 2-variable model. This is continued until some stopping rule.

**Backward Selection**. Start will all variables in the model, then remove the variable with the largest`p-value`

(the least significant). Refit the model and repeat the processes until all remaining variables have`p`

< some defined threshold (often < 0.05 or 0.1).

There are also ways to combine these two techniques, which is how I prefer to do it. This is called **mixed selection**. Starting with no variables in the model, add the variable that provides the best fit (highest `R^2`

from the univariate regression). When then continue adding variables by next highest `R^2`

as long as their `p-value`

in the multiple regression model are significant.

Lastly, we need some way to judge overall model performance. Recall that `RSE`

and `R^2`

are the most common measures of model fit. An `R^2`

of `1`

would indicate that the predictor variables explain all the variation of the dependent variable (not realistic to find this in nature). We can add this idea to our variable selection technique. We will only include variables which also increase `R^2`

by at least some threshold (perhaps 0.5% or 1%).

Notice that when we include all three advertising types into the model that `newspaper`

changes from significant in our univariate regression to insignificant in multiple regression. We can therefor conclude that newspaper has no influence (positive or negative) on sales. This is a common situation when using real-life data.

Consider the correlation between `newspaper`

and `radio`

:

In [186]:

```
from scipy.stats.stats import pearsonr
pearsonr(X['Radio'], X['Newspaper'])
```

Out[186]:

As we can see, these two variables correlated. This means that there is a tendency to spend more on `newspaper`

advertising in the same markets where more is spent on `radio`

advertising. What this means is that `newspaper`

appears to be related to `sales`

, but it is only piggy-backing on the affect of `radio`

advertising.

The classic example of this effect is the relationship between ice cream sales and shark attacks, which suggests that closing down ice cream parlours would reduce incidents. However, the increased shark attacks is facilitated by increasing temperatures, which drives more people to the beach, who also purchase ice cream.

Let's exclude newspaper from our model.

In [353]:

```
df = pd.read_csv('http://www-bcf.usc.edu/~gareth/ISL/Advertising.csv', index_col=0)
x = df[['TV', 'Radio']]
y = df['Sales']
X = sm.add_constant(x)
model = sm.OLS(y, X).fit()
# print model error
print model.mse_resid
print model.summary2()
```

Great! As you can see, we really did not lose any predictive power by removing

`newspaper`

. However, there could also be some problems that cannot be diagnosed using these numbers or tests. One of the more robust and simple ways to assess the quality of our model is to plot the residuals, which is the error term, computed from the difference of the predicted and observed data.
For linear regression, we expect that the residuals (predicted values) are evenly distributed around zero.

In [284]:

```
qplot(y=model.resid, x = xrange(len(y)))+geom_point() \
+ geom_smooth(size=2,color='blue', se=False)
```

Out[284]:

We can see that there is a negative skew on these residuals, which suggests that there is some bias towards under prediction in the model. Further we can see a very subtle negative-to-positive-to-negative trend. This suggests a non-linear pattern whereby two or more variables are interacting with some synergistic effects. In other words, sometimes combining `TV`

and `Radio`

advertising can work together to boost sales more than using any single medium.

We will discuss this in more later.

Once we have fit a satisfactory multiple regression model, we can quickly begin to predict the response of `y`

using new values for the predictors. Yet, we should utilize a confidence interval to quantify the uncertainty of out prediction of sales. Using this method we can predict that the true `sales`

will be within some 95% confidence interval given any `TV`

or `Radio`

spending.

What we have discussed thus far is very basic regression techniques. In the real world we have many **different types of data sources**, and **non-linear relationships**.

In [6]:

```
url = 'http://www-bcf.usc.edu/~gareth/ISL/Credit.csv'
df = pd.read_csv(url)
df.info()
```

In this dataset we have several qualitative variables, along with some quantitative. Let's see how this data is coded.

In [316]:

```
df.head()
```

Out[316]:

We can see that the data for the qualitative variables (also known as factors) are coded as text strings. Factors such as `Gender`

, `Student`

and `Married`

only have two `levels`

, for example: `male`

or `female`

. To incorporate these into a regression model, we can create a `dummy variable`

that takes on two values based on the two levels of the factor. To convert `Gender`

to a dummy variable we simple create a new variable:

`Gender`

=**1**if person is male,**0**if person is female.

In the regression model, if a person is female we simply do not compute a coefficient, since in our model, being female is represented as an absence of being male, thus the `Gender`

variable can be modeled by only investigating the properties of either being male, or not being male. In this example, being female is considered the `baseline`

condition, which is associated with *B*0, or the regression intercept.

β0+β1+εi if the person is Male

β0+εi if the person is Female

Now β0 can be interpreted as average credit score among females, β0 + β1 as the average credit score among males, and β1 as the average difference in credit balance between males and females.

Coding males at 1 is arbitrary, you could have chosen females to be equal to 1; the only difference is in how you interpret the coefficients.

In [317]:

```
# Create new dummy variable
df['Males'] = 1
df['Males'][df['Gender']=='Female'] = 0
x = df[['Males']]
y = df['Balance']
X = sm.add_constant(x)
model = sm.OLS(y, X).fit()
# print model error
print 'p-values:'
print model.pvalues
print 'R^2:'
print model.rsquared_adj
```

As we can see, thus far there is no significant relationship between being male or female.

Considered the variable `Ethnicity`

, which has three levels. A single dummy variable cannot represent all the possible values. Instead we create two dummy variables.

- Asian: if Asian = 1, otherwise 0
- Caucasian: if Caucasian = 1, otherwise 0

Notice we do not need a third dummy variable, because African American is accounted for by the baseline condition (intercept) similar to how females was in the previous example.

β0+β1+εi if the person is Asian

β0+β2+εi if the person is Caucasian

β0+εi if the person is African American

Now β0 is the average credit balance for African Americans, β1 is the difference in the average balance between Asian and African Americans, and β2 is the difference in the average balance between Caucasian and African American's.

Thus we now can think about this as comparing the baseline condition β0 to the other conditions.

In [318]:

```
# Create dummy variables easily in pandas!
x = pd.core.reshape.get_dummies(df['Ethnicity'])
# Remove the African American variable
x = x[['Asian', 'Caucasian']]
x.head()
```

Out[318]:

In [352]:

```
X = sm.add_constant(x)
model = sm.OLS(y, X).fit()
model.summary()
```

Out[352]:

We can see that the baseline credit balance for `African Americans`

is 531. It is also estimated that `Asians`

have **~\$18** less debt than `African Americans`

and `Caucasians`

will have **~\$12** less debt than `African Americans`

.

However, it turns out that this factor is not significant in our model of credit balance, since the `p-values`

are insignificant and the `F`

statistic is close to `1`

, which suggests that there is no relationship between the variables (accepts the null hypothesis).

While a linear regression model often may produce good results, the assumptions that the relationship between the predictors and the response is both *additive* and *linear* is often violated. The additive assumption means that the effect of changes in predictor *X**j* on the response ** Y** is independent of the values of the other predictors. While the linear assumption means that change in the response variable is constant per unit of the predictor variable.

There are however, some basic ways to relax some of these assumptions:

Recall in the Advertising sales data we concluded that `TV`

and `Radio`

have an effect on `Sales`

. Under additive linear assumptions, the effect of `TV`

on sales is independent of the amount spent on `Radio`

advertising.

However, it may be the the case that spending money on `Radio`

advertising increases the effectiveness of `TV`

.

From a marketing perspective it could mean that splitting the advertising funds between `Radio`

and `TV`

could be more effective than spending it all on `TV`

.

In [354]:

```
# import formula api as alias smf
import statsmodels.formula.api as smf
# formula: response ~ predictor + predictor
model = smf.ols(formula='Sales ~ 1 + TV + Radio + TV * Radio', data=df).fit()
model.summary()
```

Out[354]:

It turns out we were right! The synergistic effect increased our `R^2`

by about **6%**. There may be examples where the individual effects may be insignificant after adding an interaction term, however it is still good practice to keep in these individual terms. This is because leaving them out tends to alter the meaning of the interaction.

Interaction tends to be an important consideration for qualitative variables as well. Recall the Credit Card dataset. It is likely that changes in `income`

would have different effects on `students`

vs. `non-students`

. Thus we can multiply `income`

with the student dummy variable. In terms of the model fit, this means that the slopes of the lines between `students`

and `non-students`

will be different when estimating `credit balance`

using `income`

. We could eventually conclude that `non-students`

may have larger `credit balances`

as `income`

increases, compared to individuals that are `students`

.

In some cases, the true relatioship between the predictors and the response variable could be non-linear. We can extend the linear model to accomodate non-linear effects using **Polynomial Regression**. There are however, other more advanced techniques that will be discussed later that may be better suited to this type of application.

More on polynomial fitting can be found in my other notebook <a target = '_self' href='http://rmdk.ca/machine-learning-curve-fitting/'>here</a>.

In [27]:

```
url = 'http://www-bcf.usc.edu/~gareth/ISL/Auto.csv'
df = pd.read_csv(url)
df.info()
```

In [30]:

```
# Make horsepower into integer
df2 = df[df['horsepower']!='?']
df2['horsepower'] = df2['horsepower'].astype(int)
# formula: response ~ predictor + predictor
# I makes sure operation is done before fitting the model
model1 = smf.ols(formula='mpg ~ 1 + horsepower', data=df2).fit()
print model1.rsquared_adj
model2 = smf.ols(formula='mpg ~ 1 + horsepower + I(horsepower**2)', data=df2).fit()
print model2.rsquared_adj
```

These are some things to keep out for while performing linear regression. Some of this we have already covered.

- Non-Linearity of the response-predictor relationships.
- Correlation of the error terms.
- Non-constant variance of error terms.
- Outliers.
- High -leverage points.
- Multicollinearity.

**Non-Linear Relationships:**

Linear regression assumes that there is a linear relationship between the predictors and response. If the actual relationship is very non-linear, then most of the conclusions that we draw from the model are less robust, and model prediction accuracy would be diminished.

Residual plots are the first step for testing this assumption. Here we plot residuals vs. predicted values for `mpg ~ horsepower`

and `mpg ~ horsepower + horsepower^2`

. We can see that the quadratic version has much less pattern in the residuals, which suggests that the quadratic fit improves the model. If it looks like the predictors are non-linear, a simple approach is to use non-linear transformations. We will discuss more advanced techniques.

**Non-constant variance of error terms:**

Non-constant variance is when there is a distinct trend in the variance, whereby the variance may increase with the value of the response. This can also be viewed with a `residual vs fitted plot`

. The resulting plot will take on a *funnel shape* under this condition, called `heteroscedasticity`

. In this case, a simple fix is often transforming the response variable to assume a more normal distribution. If this still doesn't work, you may have to explore weighted least squares regression, which proportionally weights the model by the inverse of the variances.

In [337]:

```
qplot(y=model1.resid, x=model1.fittedvalues)+geom_point()+geom_smooth()
```

Out[337]:

In [336]:

```
qplot(y=model2.resid, x=model2.fittedvalues)+geom_point()+geom_smooth()
```

Out[336]:

One of the assumptions of linear regression is that the error terms (residuals) are uncorrelated. Important indicators:`p-values`

`standard errors`

and `R^2`

all depend on this assumption. Time series data often has correlated error terms, and there are many ways to deal with it.

For us, a simple plot of the residuals can reveal if there is any noticeable trend. These errors are fairly evenly distributed around 0, thats good. If they are not, this can often be remedied by transforming our dependent variable to a more normal distribution. However, sources of dependent error terms are often due to experimental design and data sources/collection. For example: a study of predicting individuals heights are predicted from weight could be skewed if many of the study members are of the same family, or are on the same diet.

In [340]:

```
qplot(y=model2.resid, x=xrange(len(y)))+geom_point()+geom_smooth()
```

Out[340]:

An outlier is a any record whose predicted value is far-off from the observed value. Outliers often arise from incorrect data recording. There are many ways to identify outliers, and removing them from the model will often result in an improvement in predictive power and reduced error. However, we have to be careful to understand if a record is actually an outlier (error) or just a rare occurrence in the data, which we should not remove. Outliers can also indicate model deficiency; perhaps a missing predictor could account for the missing response.

In [344]:

```
# Using stats models
model.outlier_test()
```

**High Leverage Points:**

Leverage points are unusual values in predictor variables, rather than response variables. Removing high leverage points often has a great effect on the model fit. Here is an influence plot of the advertising data model. This plot takes into account outliers and high leverage. We can see several points here that seem to deviate from the average influence, yet again it is up to our discretion if we choose to remove these points or not based on our knowledge of the data, and it's source.

In [365]:

```
fig = sm.graphics.influence_plot(model2, criterion="cooks")
print fig
```

**Collinearity:**

This occurs when two or more predictors are closely related to each other. Collinearity will cause the standard error to grow, since it reduces the accuracy of the regression coefficients, which may cause us to improperly reject the null hypothesis (since the standard error effects the `t-value`

and `p-value`

). The probability of detecting a non-zero coefficient ( a relationship) is reduced by collinearity, thus we may not find a relationship that exists. Simply speaking, the importance of a variable may be missed due to collinearity.

We can test collinearity using a correlation matrix between the predictors, which we can use to eliminate variables within the model fitting process. If two variables are highly correlated, it is usual practise to remove the predictor with the lower R^2. However, **multicollinearity** exists when here is correlation between three or more variables. We use the `variance inflation factor`

(VIF) to address this issue. This statistic shows us the ratio of the variation to the coefficients. The smallest value for VIF is 1, and as a rule of thumb, we should remove variables that have a VIF value that exceeds 5.

An alternative solution is to combine the correlated variables into a single feature. For example, one could take the average standardized versions of two predictors in order to make a new index all together. The construction of this new variable is more of an art than a science, and will be explored in a future lesson.

In [82]:

```
# Grab a few columns
data_to_check = df2.ix[:, 0:-4]
# Check correlation and VIF
print data_to_check.corr()
```

It turns out that VIF is not well implimented in python for basic OLS regression. I will update when it is properly supported in statsmodels or scikit learn.

In [5]:

```
from IPython.core.display import HTML
def css_styling():
styles = open("/users/ryankelly/desktop/custom_notebook.css", "r").read()
return HTML(styles)
css_styling()
```

Out[5]:

In [3]:

```
def social():
code = """
<a style='float:left; margin-right:5px;' href="https://twitter.com/share" class="twitter-share-button" data-text="Check this out" data-via="Ryanmdk">Tweet</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<a style='float:left; margin-right:5px;' href="https://twitter.com/Ryanmdk" class="twitter-follow-button" data-show-count="false">Follow @Ryanmdk</a>
<script>!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0],p=/^http:/.test(d.location)?'http':'https';if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src=p+'://platform.twitter.com/widgets.js';fjs.parentNode.insertBefore(js,fjs);}}(document, 'script', 'twitter-wjs');</script>
<a style='float:left; margin-right:5px;'target='_parent' href="http://www.reddit.com/submit" onclick="window.location = 'http://www.reddit.com/submit?url=' + encodeURIComponent(window.location); return false"> <img src="http://www.reddit.com/static/spreddit7.gif" alt="submit to reddit" border="0" /> </a>
<script src="//platform.linkedin.com/in.js" type="text/javascript">
lang: en_US
</script>
<script type="IN/Share"></script>
"""
return HTML(code)
```