L1 regularized logistic regression

Key ideas: L1 regularization, logistic regression, simulation study

L1 penalized regression is an approach that can be used to select variables in a regression model. Some of the regression models implemented in Statsmodels have a fit_regularized method that can be used in place of the usual fit method to obtain a fit with L1 regularization.

In this notebook, we demonstrate L1 penalized logistic regression using the logistic regression provided in the discrete_model module (there is another implementaiton of logistic regression in the genmod module that does not currently have L1 regularization).

We start with a few import statements:

In [1]:
from statsmodels.discrete.discrete_model import Logit
import numpy as np

Next we simulate data that follows the standard logistic regression generating model. We specify a model in which most of the variables have no relationship with the response. These coefficients should be shrunk to zero by the L1 procedure if there is enough data.

In [2]:
n = 200
p = 10
exog = np.random.normal(size=(n,p))
lpred = exog[:,0] + 0.5*exog[:,1] - 0.5*exog[:,2]
prob = 1 / (1 + np.exp(-lpred))
endog = 1*(np.random.uniform(size=n) <= prob)

First, for comparison, we fit a traditional logistic regression without regularization:

In [3]:
mod = Logit(endog, exog)
rslt1 = mod.fit()
print rslt1.summary()
Optimization terminated successfully.
         Current function value: 0.576504
         Iterations 5
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                  200
Model:                          Logit   Df Residuals:                      190
Method:                           MLE   Df Model:                            9
Date:                Fri, 18 Jul 2014   Pseudo R-squ.:                  0.1595
Time:                        19:13:41   Log-Likelihood:                -115.30
converged:                       True   LL-Null:                       -137.19
                                        LLR p-value:                 1.556e-06
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.9861      0.186      5.288      0.000         0.621     1.352
x2             0.4116      0.160      2.576      0.010         0.098     0.725
x3            -0.0370      0.154     -0.241      0.810        -0.339     0.265
x4             0.3095      0.165      1.876      0.061        -0.014     0.633
x5             0.1902      0.159      1.198      0.231        -0.121     0.501
x6            -0.1029      0.171     -0.601      0.548        -0.438     0.232
x7            -0.0634      0.156     -0.406      0.685        -0.370     0.243
x8             0.1059      0.169      0.625      0.532        -0.226     0.438
x9            -0.0159      0.156     -0.102      0.919        -0.321     0.290
x10            0.2284      0.153      1.488      0.137        -0.072     0.529
==============================================================================

To fit a regularized model, we must specify a tuning parameter denoted alpha that determines the strength of the penalty. Greater vaues of alpha will result in fitted models with a greater number of coefficients that are equal to zero. Here we arbitarily set alpha to 10:

In [4]:
rslt2 = mod.fit_regularized(alpha=10, disp=False)
print rslt2.summary()
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                  200
Model:                          Logit   Df Residuals:                      197
Method:                           MLE   Df Model:                            2
Date:                Fri, 18 Jul 2014   Pseudo R-squ.:                  0.1114
Time:                        19:13:41   Log-Likelihood:                -121.91
converged:                       True   LL-Null:                       -137.19
                                        LLR p-value:                 2.316e-07
==============================================================================
                 coef    std err          z      P>|z|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1             0.6058      0.158      3.831      0.000         0.296     0.916
x2             0.1708      0.141      1.207      0.227        -0.106     0.448
x3                  0        nan        nan        nan           nan       nan
x4                  0        nan        nan        nan           nan       nan
x5                  0        nan        nan        nan           nan       nan
x6                  0        nan        nan        nan           nan       nan
x7                  0        nan        nan        nan           nan       nan
x8                  0        nan        nan        nan           nan       nan
x9                  0        nan        nan        nan           nan       nan
x10            0.0287      0.137      0.209      0.835        -0.240     0.298
==============================================================================

The solution path is the set of all coefficient estimates, expressed as a function of the penalty parameter. Here we (roughly) obtain the solution path by fitting the regularized regression with a sequence of penalty weights.

In [5]:
alpha = np.linspace(0, 50, 50)
params = []
for a in alpha:
    rslt2 = mod.fit_regularized(alpha=a, disp=False)
    params.append(rslt2.params)
params = np.asarray(params)

Here is a plot of the solution paths for all 10 covariates. The covariates with nonzero coefficients in the data generating model are plotted in colors, the paths for the other coefficients are plotted in grey.

In [6]:
plt.clf()
cmap = {0: "red", 1: "green", 2: "blue"}
for k in range(10):
    color = 'grey'
    if k in (0, 1, 2):
        color = cmap[k]
    plt.plot(alpha, params[:,k], color=color)
plt.xlabel("Penalty weight", size=17)
plt.ylabel("Coefficient", size=17)
Out[6]:
<matplotlib.text.Text at 0x4b08610>