*Key ideas:* L1 regularization, logistic regression, simulation study

In this notebook we demonstrate L1 penalized logistic regression (the "logistic lasso") using several data sets relating to the utilization of medical services.

Here are the usual import statements:

In [1]:

```
from statsmodels.discrete.discrete_model import Logit
import numpy as np
import pandas as pd
```

In [2]:

```
data = pd.read_csv("http://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/Doctor.csv")
data["children x access"] = data["children"] * data["access"]
data["children x health"] = data["children"] * data["health"]
data["access x health"] = data["access"] * data["health"]
```

In [3]:

```
vnames = ["children", "access", "health", "children x access", "children x health", "access x health"]
exog = data[vnames]
exog -= exog.mean(0)
exog /= exog.std(0)
```

*solution paths*, which are the paths followed by the regression coefficients as the penalty weight is varied. To do this, we specify a sequence of tuning parameter values.

In [4]:

```
avals = np.linspace(1, 100, 50)
```

In [5]:

```
exog = np.concatenate((np.ones((exog.shape[0],1)), exog), axis=1)
vnames.insert(0, "Intercept")
params = {}
for tr in range(1, 5):
pr = []
endog = 1*(data["doctor"] >= tr)
for a in avals:
alpha = a * np.ones(exog.shape[1], dtype=np.float64)
alpha[0] = 0 # Don't penalize the intercept
mod = Logit(endog, exog)
rslt = mod.fit_regularized(alpha=alpha, disp=False)
pr.append(rslt.params)
params[tr] = np.asarray(pr)
```

Next we have a helper function that plots the solution paths from a given dichotomization threshold.

In [6]:

```
def make_plot(tr):
plt.figure(figsize=(10,5))
plt.clf()
plt.axes([0.1, 0.1, 0.67, 0.8])
pr = params[tr]
ag = []
for k in range(pr.shape[1]):
v, = plt.plot(avals, pr[:,k], linestyle=["-", "--", ":"][k // 7])
ag.append(v)
leg = plt.figlegend(ag, vnames, "center right")
leg.draw_frame(False)
plt.xlabel("Penalty weight", size=17)
plt.ylabel("Coefficient", size=17)
plt.title(">=%d visit%s/year" % (tr, "s" if tr > 1 else ""), size=17)
plt.grid(True)
```

In [7]:

```
make_plot(1)
make_plot(2)
make_plot(3)
```

In [8]:

```
data = pd.read_csv("http://vincentarelbundock.github.io/Rdatasets/csv/Ecdat/DoctorContacts.csv")
endog = data["mdu"]
vnames = list(data.columns)
vnames.remove("mdu")
vnames.remove("Unnamed: 0")
exog = pd.DataFrame(data[vnames])
```

It seems that we need to explicitly convert boolean variables to numbers:

In [9]:

```
for k in range(exog.shape[1]):
if exog.iloc[:,k].dtype == type(True):
exog.iloc[:,k] = exog.iloc[:,k].astype(np.int32)
```

In [10]:

```
exog["health"] = pd.Categorical.from_array(exog["health"]).labels
exog["sex"] = pd.Categorical.from_array(exog["sex"]).labels
exog = exog.convert_objects(convert_numeric=True)
exog -= exog.mean(0)
exog /= exog.std(0)
exog["Intercept"] = 1.
```

In [11]:

```
endog1 = 1*(endog > 1)
avals = np.linspace(1, 2500, 50)
pr = []
for a in avals:
mod = Logit(endog1, exog)
rslt = mod.fit_regularized(alpha=a, disp=False)
pr.append(rslt.params)
pr = np.asarray(pr)
params = {2: pr}
```

Here is a plot of the solution paths:

In [12]:

```
vnames = exog.columns
make_plot(2)
```