#!/usr/bin/env python
# coding: utf-8

# In[1]:


get_ipython().run_line_magic('matplotlib', 'inline')


# ---
# title: Robust Regression with t-Distributed Residuals
# tags: Statistics, PyMC
# ---

# Ordinarly least squares (OLS) is, without a doubt, the most-well known linear regression model.  Despite its wide applicability, it often gives undesireable results when the data deviate from its underlying normal model.  In particular, it is quite sensitive to outliers in the data.  In this post, we will illustrate this sensitivity and then show that changing the error distribution results in a more robust regression model.
# 
# We will use one of the data sets from [Anscombe's quartet](http://en.wikipedia.org/wiki/Anscombe%27s_quartet) to illustrate these concepts.  Anscombe's quartet is a well-known group of four data sets that illustrates the importance of exploratory data analysis and visualization.  In particular, we will use the third dataset from Anscombe's quartet.  This data set is shown below.

# In[2]:


from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import pymc3
from scipy import stats
import seaborn as sns
from statsmodels import api as sm


# In[3]:


x = np.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])
y = np.array([7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73])


# In[4]:


X_MIN, X_MAX = x.min() - 1, x.max() + 1


# In[5]:


fig, ax = plt.subplots()

ax.scatter(x, y);

ax.set_xlim(X_MIN, X_MAX);


# It is quite clear that this data set exhibits a highly linear relationship, with one outlier (when `x = 13`).  Below, we show the results of two OLS models fit to this data.  One is fit to all of the data, and the other is fit to the data with the outlier point removed.

# In[6]:


X = sm.add_constant(x)


# In[7]:


ols_result = sm.OLS(y, X).fit()


# In[8]:


x_no_outlier = x[x != 13]
X_no_outlier = X[x != 13]
y_no_outlier = y[x != 13]


# In[9]:


no_outlier_result = sm.OLS(y_no_outlier, X_no_outlier).fit()


# In[10]:


xs = np.linspace(X_MIN, X_MAX, 20)
XS = sm.add_constant(xs)


# In[11]:


fig, ax = plt.subplots()

ax.scatter(x, y);
ax.plot(xs, ols_result.predict(XS), label='OLS model (full)');
ax.plot(xs, no_outlier_result.predict(XS), label='OLS model (without outlier)');

ax.set_xlim(X_MIN, X_MAX);

ax.legend(loc='upper left');


# One of the ways the OLS estimator can be derived is by minimizing the mean squared error (MSE) of the model on the training data.  Below we show the MSE of both of these model on both the full data set and the data set without the outlier.

# In[12]:


def mse(actual, predicted):
    return ((actual - predicted)**2).mean()


# In[13]:


mse_df = pd.DataFrame({
                'Full data set': [
                    mse(y, ols_result.predict(X)),
                    mse(y, no_outlier_result.predict(X))
                ],
                'Data set without outlier': [
                    mse(y_no_outlier, ols_result.predict(X_no_outlier)),
                    mse(y_no_outlier, no_outlier_result.predict(X_no_outlier))
                ]
            },
            index=['Full model', 'No outlier mdoel']
        )
mse_df = mse_df[mse_df.columns[::-1]]


# In[14]:


mse_df


# By simple visual inspection, we suspect that without the outlier (`x = 13`), the relationship between `x` and `y` is (almost) perfectly linear.  This suspicion is confirmed by this model's very small MSE on the reduced data set.

# #### Toward robust regression

# Unfortunately, we will usually have many more than eleven points in our data set.  In reality, the outliers may be more difficult to detect visually, and they may be harder to exclude manually.  We would like a model that performs reasonably well, even in the presence of outliers.
# 
# Before we define such a robust regression model, it will be helpful to consider the OLS model mathematically.  In the OLS model, we have that
# 
# $$y_i = \vec{\beta} \cdot \vec{x}_i + \varepsilon_i.$$
# 
# Here, $y_i$ is the observation corresponding to the feature vector $\vec{x}_i$, $\vec{\beta}$ is the vector of regression coefficients, and $\varepsilon_i$ is noise.  In the OLS model, the noise terms are independent with identical normal distributions.  It is the properties of these normally distributed errors that make OLS susceptible to outliers.
# 
# The normal distribution is well-known to have [thin tails](http://en.wikipedia.org/wiki/Fat-tailed_distribution).  That is, it assigns relatively little probability to observations far away from the mean.  Students of basic statistics are quite familiar with the fact that approximately 95% of the mass of the normal distribution lies within two standard deviations of the mean.
# 
# We find a robust regression model by choosing an error distribution with fatter tails; a common choice is [Student's t-distribution](https://www.google.com/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=student%27s%20t-distribution).
# 
# Below we define this model using [`pymc3`](https://github.com/pymc-devs/pymc3).

# In[15]:


with pymc3.Model() as model:
    # Regression coefficients
    alpha = pymc3.Uniform('alpha', -100, 100)
    beta = pymc3.Uniform('beta', -100, 100)
    
    # Expected value
    y_hat = alpha + beta * x
    
    # Observations with t-distributed error
    y_obs = pymc3.T('y_obs', nu=5, mu=y_hat, observed=y)


# Here we have given our t-distributed residuals five degrees of freedom.  Customarily, these models will use [four, five, or six degrees of freedom](http://en.wikipedia.org/wiki/Robust_regression#Parametric_alternatives).  It is important that the number of degrees of freedom, $\nu$, be relatively small, because as $\nu \to \infty$, the t-distribution converges to the normal distribution.
# 
# We now fit this model using [no-U-turn sampling](http://arxiv.org/abs/1111.4246).

# In[16]:


with model:
    step = pymc3.NUTS()
    trace_ = pymc3.sample(3000, step)
    
burn = 1000
thin = 2
trace = trace_[burn::thin]


# The following plots summarize the posterior distribution of the regression intercept ($\alpha$) and slope ($\beta$).

# In[17]:


pymc3.plots.traceplot(trace);


# We now plot the robust model along with the previous models.

# In[18]:


alpha = trace['alpha'].mean()
beta = trace['beta'].mean()


# In[19]:


fig, ax = plt.subplots()

ax.scatter(x, y);
ax.plot(xs, ols_result.predict(XS), label='OLS Model (full)');
ax.plot(xs, no_outlier_result.predict(XS), label='OLS Model (without outlier)');
ax.plot(xs, alpha + beta * xs, label='Robust model');

ax.set_xlim(X_MIN, X_MAX);

ax.legend(loc='upper left');


# We see right away that, although the robust model has not completely capture the linear relationship between the non-outlier points, it is much less biased by the outlier than the OLS model on the full data set.  Below we compare the MSE of this model to the previous models.

# In[20]:


robust_mse = pd.Series([mse(y, alpha + beta * x), mse(y_no_outlier, alpha + beta * x_no_outlier)], index=mse_df.columns)
robust_mse.name = 'Robust model'
mse_df = mse_df.append(robust_mse)


# In[21]:


mse_df


# On the data set without the outlier, the robust model has a significantly larger MSE than the no outlier model, but, importantly, its MSE is an order of magnitude smaller than that of the full model.

# In[ ]: