That title deserves an explanation.
This note will look at the Theil Sen estimator for robust regression. I'm going to use the UCI Machine Learning abalone dataset to compare this technique with Ordinary Least Squares.
Robust techniques are designed to mitigate the impact of outliers in regression models. My introduction to Theil Sen was through this paper which looks at stock valuations.
The Theil Sen was first applied on univariate data. It estimates the regression slope by taking the median of individual slopes, those being fitted to each pair of data points. If there are n data points, there will be n*(n-1)/2 individual slopes to estimate. As stated, the final slope is the median of these.
That concept is easy enough to wrap your head around. This blog on the SAS website has some nice visuals to help.
Theil Sen has been extended to apply to multivariate data by this paper. Scikit Learn has an implementation of this, using the spatial median which is a generalization of the median to multiple dimensions.
The concept of a spatial median is a lttle tougher grasp. Running some data through both the Theil Sen and OLS estimators may help with our intuition here.
Let's get to it.
Data is from the UCI Machine Learning repository and contains the age of abalone and their physical measurements. Hence the title.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
import statsmodels.api as sm
/usr/local/lib/python3.7/dist-packages/statsmodels/tools/_testing.py:19: FutureWarning: pandas.util.testing is deprecated. Use the functions in the public API at pandas.testing instead. import pandas.util.testing as tm
cols = ['sex','length','diam','height','whole','shucked','viscera','shell','rings']
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data', names=cols)
df
sex | length | diam | height | whole | shucked | viscera | shell | rings | |
---|---|---|---|---|---|---|---|---|---|
0 | M | 0.455 | 0.365 | 0.095 | 0.5140 | 0.2245 | 0.1010 | 0.1500 | 15 |
1 | M | 0.350 | 0.265 | 0.090 | 0.2255 | 0.0995 | 0.0485 | 0.0700 | 7 |
2 | F | 0.530 | 0.420 | 0.135 | 0.6770 | 0.2565 | 0.1415 | 0.2100 | 9 |
3 | M | 0.440 | 0.365 | 0.125 | 0.5160 | 0.2155 | 0.1140 | 0.1550 | 10 |
4 | I | 0.330 | 0.255 | 0.080 | 0.2050 | 0.0895 | 0.0395 | 0.0550 | 7 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4172 | F | 0.565 | 0.450 | 0.165 | 0.8870 | 0.3700 | 0.2390 | 0.2490 | 11 |
4173 | M | 0.590 | 0.440 | 0.135 | 0.9660 | 0.4390 | 0.2145 | 0.2605 | 10 |
4174 | M | 0.600 | 0.475 | 0.205 | 1.1760 | 0.5255 | 0.2875 | 0.3080 | 9 |
4175 | F | 0.625 | 0.485 | 0.150 | 1.0945 | 0.5310 | 0.2610 | 0.2960 | 10 |
4176 | M | 0.710 | 0.555 | 0.195 | 1.9485 | 0.9455 | 0.3765 | 0.4950 | 12 |
4177 rows × 9 columns
df.dtypes
sex object length float64 diam float64 height float64 whole float64 shucked float64 viscera float64 shell float64 rings int64 dtype: object
Drop the categorical sex
attribute. We are investigating a regression problem here. Yes, we could have encoded as dummy variable or fit a mixed effects model. I'm going to keep it simple.
df.drop(columns='sex', inplace=True)
df
length | diam | height | whole | shucked | viscera | shell | rings | |
---|---|---|---|---|---|---|---|---|
0 | 0.455 | 0.365 | 0.095 | 0.5140 | 0.2245 | 0.1010 | 0.1500 | 15 |
1 | 0.350 | 0.265 | 0.090 | 0.2255 | 0.0995 | 0.0485 | 0.0700 | 7 |
2 | 0.530 | 0.420 | 0.135 | 0.6770 | 0.2565 | 0.1415 | 0.2100 | 9 |
3 | 0.440 | 0.365 | 0.125 | 0.5160 | 0.2155 | 0.1140 | 0.1550 | 10 |
4 | 0.330 | 0.255 | 0.080 | 0.2050 | 0.0895 | 0.0395 | 0.0550 | 7 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
4172 | 0.565 | 0.450 | 0.165 | 0.8870 | 0.3700 | 0.2390 | 0.2490 | 11 |
4173 | 0.590 | 0.440 | 0.135 | 0.9660 | 0.4390 | 0.2145 | 0.2605 | 10 |
4174 | 0.600 | 0.475 | 0.205 | 1.1760 | 0.5255 | 0.2875 | 0.3080 | 9 |
4175 | 0.625 | 0.485 | 0.150 | 1.0945 | 0.5310 | 0.2610 | 0.2960 | 10 |
4176 | 0.710 | 0.555 | 0.195 | 1.9485 | 0.9455 | 0.3765 | 0.4950 | 12 |
4177 rows × 8 columns
The rings
variable is what we are trying to predict. This is a proxy for the age of the abalone.
Don't get too caught up on the data and regression problem at this stage. I just need a real world dataset to investigate the differences between robust and OLS methodologies.
x = df.iloc[:, 0:7]
y = df.iloc[:, 7:8]
Histograms of each feature.
sns.FacetGrid(
data=df.melt(var_name='Attribute', value_name='value'),
col='Attribute',
col_wrap=3,
sharex=False,
sharey=False) \
.map(sns.histplot, 'value')
<seaborn.axisgrid.FacetGrid at 0x7fa14ee3d310>
And scatter plots of each feature and the dependent variable.
sns.FacetGrid(
data=df.melt(var_name='Attribute', value_name='value', value_vars=['length', 'diam', 'height', 'whole', 'shucked', 'viscera', 'shell'], ignore_index=False).join(y),
col='Attribute',
col_wrap=3,
sharex=False,
sharey=False) \
.map(sns.scatterplot, 'value', 'rings')
<seaborn.axisgrid.FacetGrid at 0x7fa144730510>
With the exception of the height attribute, there are no obvious outliers.
I'm using statsmodels
for the OLS estimation. It produces a nice summary result.
x = sm.add_constant(x)
ols = sm.OLS(y,x).fit()
sm_pred = ols.predict(x)
print(ols.summary())
OLS Regression Results ============================================================================== Dep. Variable: rings R-squared: 0.528 Model: OLS Adj. R-squared: 0.527 Method: Least Squares F-statistic: 665.2 Date: Mon, 23 Aug 2021 Prob (F-statistic): 0.00 Time: 20:13:27 Log-Likelihood: -9250.0 No. Observations: 4177 AIC: 1.852e+04 Df Residuals: 4169 BIC: 1.857e+04 Df Model: 7 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 2.9852 0.269 11.092 0.000 2.458 3.513 length -1.5719 1.825 -0.861 0.389 -5.149 2.006 diam 13.3609 2.237 5.972 0.000 8.975 17.747 height 11.8261 1.548 7.639 0.000 8.791 14.861 whole 9.2474 0.733 12.622 0.000 7.811 10.684 shucked -20.2139 0.823 -24.552 0.000 -21.828 -18.600 viscera -9.8297 1.304 -7.538 0.000 -12.386 -7.273 shell 8.5762 1.137 7.545 0.000 6.348 10.805 ============================================================================== Omnibus: 933.799 Durbin-Watson: 1.387 Prob(Omnibus): 0.000 Jarque-Bera (JB): 2602.745 Skew: 1.174 Prob(JB): 0.00 Kurtosis: 6.072 Cond. No. 131. ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Create regression object
ts = linear_model.TheilSenRegressor()
# Fit the model
ts.fit(x, np.ravel(y))
# Print the intercept, co-efficients & R-squared
for i in zip(x.columns.tolist(), ts.coef_):
print(i[0].ljust(10),' :', '{:>9.2f}'.format(i[1]))
print('R-squared'.ljust(10),' :', '{:>9.2f}'.format(ts.score(x, np.ravel(y))))
const : 2.08 length : -6.41 diam : 11.16 height : 23.40 whole : 13.17 shucked : -24.88 viscera : -13.46 shell : 7.02 R-squared : 0.51
Lets compare the regression co-efficients and model fit.
ts_coef = zip(x.columns.tolist(), ts.coef_)
ts_coef = pd.DataFrame(ts_coef, columns = ['Variable', 'TS coef'])
ols_coef = ols.params.to_frame(name='OLS coef').rename_axis('Variable')
ols_coef.reset_index(inplace=True)
coef = ts_coef.merge(ols_coef, on='Variable')
coef['Diff in coef'] = coef['TS coef'] - coef['OLS coef']
coef.round(2)
Variable | TS coef | OLS coef | Diff in coef | |
---|---|---|---|---|
0 | const | 2.08 | 2.99 | -0.91 |
1 | length | -6.41 | -1.57 | -4.84 |
2 | diam | 11.16 | 13.36 | -2.20 |
3 | height | 23.40 | 11.83 | 11.57 |
4 | whole | 13.17 | 9.25 | 3.92 |
5 | shucked | -24.88 | -20.21 | -4.67 |
6 | viscera | -13.46 | -9.83 | -3.63 |
7 | shell | 7.02 | 8.58 | -1.56 |
The fit models are similar with the sign of coefficients in alignment.
We can see how the OLS estimate of the height coefficient is dragged down by the outliers. The Thiel Sen approach, by virtue of the estimation methodology explained above, places less weight on these outliers. This is to be expected with a robust technique.
The r-squared values are similar at around 0.5.
df1 = df
df1.iloc[0:100, 0:7] = df1.iloc[0:100, 0:7].multiply(5)
df1.iloc[4076:4177, 0:7] = df1.iloc[4076:4177, 0:7].div(5)
df1.iloc[:, 0:7]
length | diam | height | whole | shucked | viscera | shell | |
---|---|---|---|---|---|---|---|
0 | 2.275 | 1.825 | 0.475 | 2.5700 | 1.1225 | 0.5050 | 0.7500 |
1 | 1.750 | 1.325 | 0.450 | 1.1275 | 0.4975 | 0.2425 | 0.3500 |
2 | 2.650 | 2.100 | 0.675 | 3.3850 | 1.2825 | 0.7075 | 1.0500 |
3 | 2.200 | 1.825 | 0.625 | 2.5800 | 1.0775 | 0.5700 | 0.7750 |
4 | 1.650 | 1.275 | 0.400 | 1.0250 | 0.4475 | 0.1975 | 0.2750 |
... | ... | ... | ... | ... | ... | ... | ... |
4172 | 0.113 | 0.090 | 0.033 | 0.1774 | 0.0740 | 0.0478 | 0.0498 |
4173 | 0.118 | 0.088 | 0.027 | 0.1932 | 0.0878 | 0.0429 | 0.0521 |
4174 | 0.120 | 0.095 | 0.041 | 0.2352 | 0.1051 | 0.0575 | 0.0616 |
4175 | 0.125 | 0.097 | 0.030 | 0.2189 | 0.1062 | 0.0522 | 0.0592 |
4176 | 0.142 | 0.111 | 0.039 | 0.3897 | 0.1891 | 0.0753 | 0.0990 |
4177 rows × 7 columns
x1 = df1.iloc[:, 0:7]
y1 = df1.iloc[:, 7:8]
The plots below replicate those shown above, except using the amended data containing artificial outliers.
sns.FacetGrid(
data=df1.melt(var_name='Attribute', value_name='value'),
col='Attribute',
col_wrap=3,
sharex=False,
sharey=False) \
.map(sns.histplot, 'value')
<seaborn.axisgrid.FacetGrid at 0x7fa1424f6290>
sns.FacetGrid(
data=df1.melt(var_name='Attribute', value_name='value', value_vars=['length', 'diam', 'height', 'whole', 'shucked', 'viscera', 'shell'], ignore_index=False).join(y1),
col='Attribute',
col_wrap=3,
sharex=False,
sharey=False) \
.map(sns.scatterplot, 'value', 'rings')
<seaborn.axisgrid.FacetGrid at 0x7fa1417eae90>
Naturally the data becomes a lot more skewed. The scatterplot highlights the artificial outliers very clearly.
x1 = sm.add_constant(x1)
ols_o = sm.OLS(y1,x1).fit()
sm_pred= ols_o.predict(x1)
print(ols_o.summary())
OLS Regression Results ============================================================================== Dep. Variable: rings R-squared: 0.334 Model: OLS Adj. R-squared: 0.333 Method: Least Squares F-statistic: 298.8 Date: Mon, 23 Aug 2021 Prob (F-statistic): 0.00 Time: 20:13:35 Log-Likelihood: -9967.2 No. Observations: 4177 AIC: 1.995e+04 Df Residuals: 4169 BIC: 2.000e+04 Df Model: 7 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const 9.5868 0.084 114.707 0.000 9.423 9.751 length -12.1602 1.733 -7.018 0.000 -15.557 -8.763 diam 6.8125 2.292 2.972 0.003 2.319 11.307 height 6.8685 1.711 4.014 0.000 3.514 10.223 whole 5.6037 0.755 7.418 0.000 4.123 7.085 shucked -10.1362 0.872 -11.629 0.000 -11.845 -8.427 viscera 1.3970 1.227 1.139 0.255 -1.008 3.802 shell 7.0807 1.184 5.978 0.000 4.759 9.403 ============================================================================== Omnibus: 729.658 Durbin-Watson: 1.183 Prob(Omnibus): 0.000 Jarque-Bera (JB): 9803.518 Skew: 0.417 Prob(JB): 0.00 Kurtosis: 10.459 Cond. No. 124. ============================================================================== Warnings: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
# Create regression object
ts_o = linear_model.TheilSenRegressor()
# Fit the model
ts_o.fit(x1, np.ravel(y1))
# Print the intercept, co-efficients & R-squared
for i in zip(x1.columns.tolist(), ts_o.coef_):
print(i[0].ljust(10),' :', '{:>9.2f}'.format(i[1]))
print('R-squared'.ljust(10),' :', '{:>9.2f}'.format(ts_o.score(x1, np.ravel(y1))))
const : 2.76 length : -9.97 diam : 11.31 height : 19.35 whole : 12.91 shucked : -23.39 viscera : -12.50 shell : 9.30 R-squared : -0.62
ts_coef = zip(x1.columns.tolist(), ts_o.coef_)
ts_coef = pd.DataFrame(ts_coef, columns = ['Variable', 'TS coef (outliers)'])
ols_coef = ols_o.params.to_frame(name='OLS coef (outliers)').rename_axis('Variable')
ols_coef.reset_index(inplace=True)
coef_o = ts_coef.merge(ols_coef, on='Variable')
coef = coef.merge(coef_o, on='Variable')
coef['Diff in coef (outliers)'] = coef['TS coef (outliers)'] - coef['OLS coef (outliers)']
coef['TS coef (diff)'] = coef['TS coef'] - coef['TS coef (outliers)']
coef['OLS coef (diff)'] = coef['OLS coef'] - coef['OLS coef (outliers)']
coef.round(2)
Variable | TS coef | OLS coef | Diff in coef | TS coef (outliers) | OLS coef (outliers) | Diff in coef (outliers) | TS coef (diff) | OLS coef (diff) | |
---|---|---|---|---|---|---|---|---|---|
0 | const | 2.08 | 2.99 | -0.91 | 2.76 | 9.59 | -6.83 | -0.68 | -6.60 |
1 | length | -6.41 | -1.57 | -4.84 | -9.97 | -12.16 | 2.19 | 3.56 | 10.59 |
2 | diam | 11.16 | 13.36 | -2.20 | 11.31 | 6.81 | 4.50 | -0.15 | 6.55 |
3 | height | 23.40 | 11.83 | 11.57 | 19.35 | 6.87 | 12.48 | 4.05 | 4.96 |
4 | whole | 13.17 | 9.25 | 3.92 | 12.91 | 5.60 | 7.31 | 0.26 | 3.64 |
5 | shucked | -24.88 | -20.21 | -4.67 | -23.39 | -10.14 | -13.26 | -1.49 | -10.08 |
6 | viscera | -13.46 | -9.83 | -3.63 | -12.50 | 1.40 | -13.89 | -0.96 | -11.23 |
7 | shell | 7.02 | 8.58 | -1.56 | 9.30 | 7.08 | 2.22 | -2.28 | 1.50 |
The r-squared is now much lower for the OLS model. From 0.5 on the original data to 0.3 on the data with the artificial outliers inserted. This makes sense in the context of the outliers increasing the variance of the data.
The difference in coefficients is substantial, for example viscera has increased from -9.8 to 1.4, and length from -1.6 to -12.1. Interestingly, shell has changed by a much smaller relative amount, 8.6 to 7.1.
Why is this? Why has the sign of the coefficient on viscera changed?
It appears there is excessive multicollinearity in the original dataset, and this has given rise to spurious or unstable coefficients. A small(ish) change to the data has given rise to large changes in coefficients. Going back to the original data reveals an average pairwise correlation of around 0.9.
The Thiel Sen model coefficients are much more stable and have not changed sign. The largest change, the length coefficient, has almost doubled from -5.7 to -10.9. This variable has the largest change in coefficient in both the Theil Sen and OLS models.
The r-squared for the model with outliers has decreased to -0.56 (yes, negative). This is because the model in aggregate has worse predictions than simply predicting the baseline, the average of the y values. This result makes sense in the context of the solution that Theil Sen implements; if outliers are not taken into account when coefficients are derived (remember it is the median of all slopes across all data points), these will result in large prediction errors.
Below the r-squared is calculated for data points including and excluding the artificial outliers. The values for this are -0.27 and 0.51 respectively.
It should be noted that by virtue of the method of inserting outliers, the correlation structure of the data has been preserved.
This has been a quick little piece of analysis. What have we learned?
# r-squared - OLS with outliers
ols_o.rsquared
0.3340788540256908
# r-squared - TS with outliers
ts_o.score(x1, np.ravel(y1))
-0.6166942787878218
# r-squared - TS on ALL data points
ts_o.score(x1, y1)
-0.6166942787878218
# r-squared - TS on data points that ARE outliers
ts_o.score(x1.iloc[np.r_[0:100,4076:4176]], y1.iloc[np.r_[0:100,4076:4176]])
-29.564854211258634
# r-squared - TS on data points that ARE NOT outliers
ts_o.score(x1.iloc[np.r_[101:4075]], y1.iloc[np.r_[101:4075]])
0.5081364119559657
df_corr = df.iloc[:, 0:7].corr()
df_corr
length | diam | height | whole | shucked | viscera | shell | |
---|---|---|---|---|---|---|---|
length | 1.000000 | 0.997471 | 0.962519 | 0.861553 | 0.827044 | 0.850503 | 0.868906 |
diam | 0.997471 | 1.000000 | 0.964844 | 0.871345 | 0.835328 | 0.858122 | 0.879775 |
height | 0.962519 | 0.964844 | 1.000000 | 0.869090 | 0.827839 | 0.857939 | 0.880715 |
whole | 0.861553 | 0.871345 | 0.869090 | 1.000000 | 0.977039 | 0.973434 | 0.971785 |
shucked | 0.827044 | 0.835328 | 0.827839 | 0.977039 | 1.000000 | 0.955973 | 0.915816 |
viscera | 0.850503 | 0.858122 | 0.857939 | 0.973434 | 0.955973 | 1.000000 | 0.928095 |
shell | 0.868906 | 0.879775 | 0.880715 | 0.971785 | 0.915816 | 0.928095 | 1.000000 |
np.nanmean(df_corr.values[np.triu_indices_from(df_corr.values,1)])
0.901673180385633
df_corr_o = df1.iloc[:, 0:7].corr()
np.nanmean(df_corr_o.values[np.triu_indices_from(df_corr_o.values,1)])
0.901673180385633
dir(ts_o)
['__abstractmethods__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_abc_impl', '_check_subparams', '_decision_function', '_estimator_type', '_get_param_names', '_get_tags', '_more_tags', '_preprocess_data', '_set_intercept', 'breakdown_', 'coef_', 'copy_X', 'fit', 'fit_intercept', 'get_params', 'intercept_', 'max_iter', 'max_subpopulation', 'n_iter_', 'n_jobs', 'n_subpopulation_', 'n_subsamples', 'predict', 'random_state', 'score', 'set_params', 'tol', 'verbose']
dir(ols_o)
['HC0_se', 'HC1_se', 'HC2_se', 'HC3_se', '_HCCM', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_cache', '_data_attr', '_get_robustcov_results', '_is_nested', '_wexog_singular_values', 'aic', 'bic', 'bse', 'centered_tss', 'compare_f_test', 'compare_lm_test', 'compare_lr_test', 'condition_number', 'conf_int', 'conf_int_el', 'cov_HC0', 'cov_HC1', 'cov_HC2', 'cov_HC3', 'cov_kwds', 'cov_params', 'cov_type', 'df_model', 'df_resid', 'diagn', 'eigenvals', 'el_test', 'ess', 'f_pvalue', 'f_test', 'fittedvalues', 'fvalue', 'get_influence', 'get_prediction', 'get_robustcov_results', 'initialize', 'k_constant', 'llf', 'load', 'model', 'mse_model', 'mse_resid', 'mse_total', 'nobs', 'normalized_cov_params', 'outlier_test', 'params', 'predict', 'pvalues', 'remove_data', 'resid', 'resid_pearson', 'rsquared', 'rsquared_adj', 'save', 'scale', 'ssr', 'summary', 'summary2', 't_test', 't_test_pairwise', 'tvalues', 'uncentered_tss', 'use_t', 'wald_test', 'wald_test_terms', 'wresid']