Statsmodels regularized linear regression (lasso)

This notebook demonstrates how to fit a linear regression model using L1 regularization (the "lasso").

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

The data consist of measurements made on people sampled from a population that is at high risk for developing diabetes. The data are here:

http://vincentarelbundock.github.io/Rdatasets/csv/MASS/Pima.te.csv

We will be regressing serum glucose on several factors that may influence it.

In [2]:
data = pd.read_csv("Pima.te.csv")
print data.columns
Index([u'Unnamed: 0', u'npreg', u'glu', u'bp', u'skin', u'bmi', u'ped', u'age', u'type'], dtype='object')

The glucose data are slightly right-skewed, so following convention, we log transform it.

In [3]:
data["log_glu"] = np.log(data["glu"])

Below we will be fitting models with interactions. In general, it is easier to interpret the effects of an interaction if we center or standardize the components of the interaction before forming the product that defines the interaction. The next cell standardizes all of the variables that will be used below.

In [4]:
data["type"] = 1.0*(data["type"] == "Yes")

for vname in "type", "age", "bmi", "skin", "log_glu", "bp":
    data[vname] = (data[vname] - data[vname].mean()) / data[vname].std()

In the previous cell, we standardized each individual variable. We will also want to standardize the interaction products. When using the lasso, it is more straightforward to standardize all variables (including the outcome variable). This has two benefits. First, it means we don't need to include an intercept term, which would need to be handled differently than the other variables in terms of penalization. Second, it means that we can use the same penalty weight for all of the covariates. The following cell creates all the interaction products and standardizes them.

In [5]:
vnames = ["age", "bmi", "skin", "bp", "type"]

for j1,v1 in enumerate(vnames):
    for j2,v2 in enumerate(vnames[0:j1]):
        x = data[v1] * data[v2]
        x = (x - x.mean()) / x.std()
        data[v1 + "_x_" + v2] = x

We save the data, so we can compare to R below.

In [6]:
data.to_csv("dibetes.csv", float_format="%.4f")

We start with a big model that contains the main effects and all interactions with the type variable. This model is fit using unregularized OLS.

In [7]:
print data.columns
fml = "log_glu ~ 0 + bp + skin + age + bmi + type + type_x_bmi + type_x_age + type_x_skin + type_x_bp"
mod = sm.OLS.from_formula(fml, data)
rslt = mod.fit()
print rslt.summary()
Index([u'Unnamed: 0', u'npreg', u'glu', u'bp', u'skin', u'bmi', u'ped', u'age', u'type', u'log_glu', u'bmi_x_age', u'skin_x_age', u'skin_x_bmi', u'bp_x_age', u'bp_x_bmi', u'bp_x_skin', u'type_x_age', u'type_x_bmi', u'type_x_skin', u'type_x_bp'], dtype='object')
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                log_glu   R-squared:                       0.286
Model:                            OLS   Adj. R-squared:                  0.266
Method:                 Least Squares   F-statistic:                     14.39
Date:                Mon, 08 Sep 2014   Prob (F-statistic):           1.43e-19
Time:                        04:00:51   Log-Likelihood:                -414.61
No. Observations:                 332   AIC:                             847.2
Df Residuals:                     323   BIC:                             881.5
Df Model:                           9                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [95.0% Conf. Int.]
-------------------------------------------------------------------------------
bp              0.0597      0.053      1.125      0.261        -0.045     0.164
skin            0.0508      0.064      0.789      0.431        -0.076     0.177
age             0.0642      0.053      1.204      0.229        -0.041     0.169
bmi             0.0725      0.068      1.073      0.284        -0.060     0.205
type            0.4581      0.056      8.197      0.000         0.348     0.568
type_x_bmi     -0.0097      0.065     -0.149      0.882        -0.137     0.118
type_x_age     -0.0254      0.053     -0.474      0.636        -0.131     0.080
type_x_skin    -0.0656      0.063     -1.048      0.295        -0.189     0.057
type_x_bp       0.0160      0.053      0.302      0.763        -0.088     0.121
==============================================================================
Omnibus:                        5.838   Durbin-Watson:                   2.069
Prob(Omnibus):                  0.054   Jarque-Bera (JB):                3.839
Skew:                          -0.086   Prob(JB):                        0.147
Kurtosis:                       2.502   Cond. No.                         2.82
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Next we fit a sequence of models using the lasso. We collect the parameter estimates and display them as a table. This is a crude way to obtain the "solution path". In the near future, we will provide a function to do this for you.

In [8]:
mat = []
alphas = np.arange(0, 0.26, 0.05)
for alpha in alphas:
    mod = sm.OLS.from_formula(fml, data)
    rslt = mod.fit_regularized(alpha=alpha)
    mat.append(rslt.params)
mat = pd.DataFrame(mat, index=[str(x) for x in alphas])
print mat.T
                  0.0      0.05       0.1      0.15       0.2     0.25
bp           0.059741  0.034172  0.009133  0.000000  0.000000  0.00000
skin         0.050784  0.034924  0.010779  0.000000  0.000000  0.00000
age          0.064180  0.046337  0.016808  0.000000  0.000000  0.00000
bmi          0.072482  0.054131  0.037967  0.008797  0.000000  0.00000
type         0.458066  0.410491  0.382537  0.350764  0.303381  0.25323
type_x_bmi  -0.009687  0.000000  0.000000  0.000000  0.000000  0.00000
type_x_age  -0.025357  0.000000  0.000000  0.000000  0.000000  0.00000
type_x_skin -0.065568 -0.009761  0.000000  0.000000  0.000000  0.00000
type_x_bp    0.016021  0.000000  0.000000  0.000000  0.000000  0.00000

Here is R code that performs the same analysis using the glmnet library:

library(glmnet)

# The data set created above.
data = read.csv("diabetes.csv")

x = cbind(data$bp, data$skin, data$age, data$bmi, data$type, data$type_x_bmi, data$type_x_age,
    data$type_x_skin, data$type_x_bp)
y = data$log_glu

model = glmnet(x, y, lambda=c(0, 0.05, 0.1, 0.15, 0.2, 0.25))

print(model$beta[,6])
print(model$beta[,1])

The results below agree with what we obtained above for alpha=0 and alpha=0.25, respectively:

          V1           V2           V3           V4           V5           V6 
0.059751856  0.050783199  0.064140463  0.072454663  0.458089343 -0.009592389 
          V7           V8           V9 
-0.025364554 -0.065635870  0.016002218 
       V1        V2        V3        V4        V5        V6        V7        V8 
0.0000000 0.0000000 0.0000000 0.0000000 0.2536029 0.0000000 0.0000000 0.0000000 
       V9 
0.0000000