The goal of this lab is to explore the effect of regularization on the coefficients and accuracy of logistic regression models for a toy (wine) dataset
import numpy as np
import pandas as pd
np.random.seed(999)
from sklearn.linear_model import LinearRegression, LogisticRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, precision_score, recall_score, log_loss, mean_absolute_error
import matplotlib.pyplot as plt
import matplotlib as mpl
%config InlineBackend.figure_format = 'retina'
from pandas.api.types import is_numeric_dtype
def normalize(X):
for colname in X.columns:
if is_numeric_dtype(X[colname]):
u = np.mean(X[colname])
s = np.std(X[colname])
X[colname] = (X[colname] - u) / s
wine = load_wine()
df_wine = pd.DataFrame(data=wine.data, columns=wine.feature_names)
df_wine['y'] = wine.target
df_wine = df_wine[df_wine['y']<2] # Only do 2-class problem {0,1}
X = df_wine.drop('y', axis=1)
y = df_wine['y']
print(f"{len(X)} records for classes {{0,1}} from {len(wine.data)} records")
X.head(2)
130 records for classes {0,1} from 178 records
alcohol | malic_acid | ash | alcalinity_of_ash | magnesium | total_phenols | flavanoids | nonflavanoid_phenols | proanthocyanins | color_intensity | hue | od280/od315_of_diluted_wines | proline | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 14.23 | 1.71 | 2.43 | 15.6 | 127.0 | 2.80 | 3.06 | 0.28 | 2.29 | 5.64 | 1.04 | 3.92 | 1065.0 |
1 | 13.20 | 1.78 | 2.14 | 11.2 | 100.0 | 2.65 | 2.76 | 0.26 | 1.28 | 4.38 | 1.05 | 3.40 | 1050.0 |
1. Split off validation/test set and train unregularized linear regression model
Select a seed and random state known to yield poor validation set accuracy (depending on the split, scores will fluctuate, particularly in the presence of outliers.)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=999)
2. Train a non-regularized logistic regression model
Use parameters: penalty='none', solver='lbfgs', max_iter=200. Train on X_train, y_train.
lg = ...
# train model
lg = LogisticRegression(penalty='none', solver='lbfgs', max_iter=200) lg.fit(X_train, y_train)
3. Compare metrics for training and test set
See model assessment for details on the log loss metric, but the score of zero means "no loss" (i.e., perfect score) and it is an unbounded positive loss from there, depending on how bad the model is. Log loss penalizes models that are confident in the wrong answer.
print(f"Accuracy: Train {100*lg.score(X_train, y_train):.0f}%, Test {100*lg.score(X_test, y_test):.0f}%")
y_proba_train = lg.predict_proba(X_train)[:,1]
y_proba_test = lg.predict_proba(X_test)[:,1]
print(f"Log loss: Train {log_loss(y_train, y_proba_train):.2f}, Test {log_loss(y_test, y_proba_test):.2f}")
Q. What are some possible reasons for the difference in training and testing scores?
4. Extract $\beta_1, ..., \beta_p$ and count how many are close to 0
Note: sum(np.abs(x) < 1e-5)
is a decent way to check for values of x
close to zero but not necessarily zero. There is also numpy.isclose()
but that is too strict (requires numbers to be really close to zero) for this exercise.
lg_beta = ...
...
lg_beta = lg.coef_[0] sum(np.abs(lg_beta) < 1e-5) # how many close to 0?
5. Plot the coefficient index versus the value
The plot should look something like:
The following function is a handy way to plot the coefficients.
def plotbeta(beta, which, yrange=(-20_000, 20_000),fontsize=10, xlabel=True, ylabel=True, tick_format='{x:.1f}', ax=None):
if ax is None:
fig,ax = plt.subplots(1,1,figsize=(4.5,2.5))
ax.bar(range(len(beta)),beta)
if xlabel:
ax.set_xlabel("Coefficient $\\beta_i$ for $i \\geq 1$", fontsize=fontsize)
if ylabel:
ax.set_ylabel("Coefficient value", fontsize=fontsize)
if yrange is not None:
ax.set_ylim(*yrange)
ax.tick_params(axis='both', which='major', labelsize=fontsize)
plt.gca().yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter(tick_format))
ax.set_title(f"{which} $\\overline{{\\beta}}$={np.mean(beta):.2f}, $\\sigma(\\beta)$={np.std(beta):.2f}", fontsize=fontsize)
plotbeta(lg_beta, "Non-regularized", yrange=None)
#plt.tight_layout(); plt.savefig("wine-ols.png",dpi=150,bbox_inches=0)
6. Normalize the variables, retrain, check how many coefficients are close to 0
normalize(X_train)
normalize(X_test)
lg = LogisticRegression(penalty='none', solver='lbfgs')
lg.fit(X_train, y_train)
lg_beta = lg.coef_[0]
sum(np.abs(lg_beta) < 1e-5) # how many close to 0?
7. Compare metrics for training and test set
print(f"Accuracy: Train {100*lg.score(X_train, y_train):.0f}%, Test {100*lg.score(X_test, y_test):.0f}%")
y_proba_train = lg.predict_proba(X_train)[:,1]
y_proba_test = lg.predict_proba(X_test)[:,1]
print(f"Log loss: Train {log_loss(y_train, y_proba_train):.2f}, Test {log_loss(y_test, y_proba_test):.2f}")
Q. Without the max_iter
arg > 100 the logistic regression model for unnormalized data does not converge on a solution. Why does the normalized data converge faster?
Q. Why does the test set score improve just by normalizing variables?
7. Plot the coefficient index versus the value again after normalizing variables
The plot should look something like:
plotbeta(lg_beta, "Non-regularized normed", yrange=None)
#plt.tight_layout(); plt.savefig("wine-ols-norm.png",dpi=150,bbox_inches=0)
Q. Why is the scale of the coefficients different after normalizing the variables?
Q. Why does the shape of the coefficient graph look different for the normalized data?
"""
sklearn says LogisticRegression arg C is "Inverse of regularization strength...
smaller values specify stronger regularization"
"""
lmbda=.1
lg = LogisticRegression(C=1/lmbda, penalty='l1', solver='liblinear', max_iter=1000)
lg.fit(X_train, y_train)
lg_beta = lg.coef_[0]
sum(np.abs(lg_beta) < 1e-5) # how many close to 0?
Q. What does it mean for our model when 4 coefficients go to zero?
2. Compare metrics for training and test set
print(f"Accuracy: Train {100*lg.score(X_train, y_train):.0f}%, Test {100*lg.score(X_test, y_test):.0f}%")
y_proba_train = lg.predict_proba(X_train)[:,1]
y_proba_test = lg.predict_proba(X_test)[:,1]
print(f"Log loss: Train {log_loss(y_train, y_proba_train):.2f}, Test {log_loss(y_test, y_proba_test):.2f}")
1. Plot the coefficient index versus the value
plotbeta(lg_beta, "Lasso", yrange=None, tick_format='{x:.1f}')
#plt.tight_layout(); plt.savefig("wine-l1.png",dpi=150,bbox_inches=0)
Q. Compare the L1 regularized coefficients with the normalized OLS coefficients?
Q. Compare the accuracy of the regularized model to the normalized OLS model?
The goal of the following code snippets is to help you visualize how the $\lambda$ regularization parameter affects model parameters and associated training and testing accuracy. There are a number of important questions to answer following the code snippets.
fig,axes = plt.subplots(1,6,figsize=(11,1.8), sharey=True)
for i,lmbda in enumerate([1e-5,.01,.1, 1, 10, 100]):
lg = LogisticRegression(C=1/lmbda, penalty='l1', solver='liblinear', max_iter=1000)
lg.fit(X_train, y_train)
accur = lg.score(X_train, y_train)
accurt = lg.score(X_test, y_test)
y_proba_train = lg.predict_proba(X_train)[:,1]
y_proba_test = lg.predict_proba(X_test)[:,1]
print(f"lambda={lmbda:5}: Zeros {sum(np.abs(lg.coef_[0]) < 1e-5):3d}: Accuracy: Train {100*lg.score(X_train, y_train):3.0f}%, Test {100*lg.score(X_test, y_test):3.0f}%; Log loss: Train {log_loss(y_train, y_proba_train):.2f}, Test {log_loss(y_test, y_proba_test):.2f}")
plotbeta(lg.coef_[0], f"$\\lambda={lmbda}$\n", ax=axes[i], fontsize=9, xlabel=False, ylabel=i==0, yrange=None)
Q. Describe what is happening to the L1 coefficients as we increase lambda?
Q. Why does the training accuracy and log loss go down as we increase lambda?
Q. What $\lambda$ value would you choose for regularization?
lmbda=0.01
lg = LogisticRegression(C=1/lmbda, penalty='l2', solver='liblinear', max_iter=1000)
lg.fit(X_train, y_train)
lg_beta = lg.coef_[0]
sum(np.abs(lg_beta) < 1e-5) # how many close to 0?
2. Compare the R^2 on training and test set
print(f"Accuracy: Train {100*lg.score(X_train, y_train):.0f}%, Test {100*lg.score(X_test, y_test):.0f}%")
y_proba_train = lg.predict_proba(X_train)[:,1]
y_proba_test = lg.predict_proba(X_test)[:,1]
print(f"Log loss: Train {log_loss(y_train, y_proba_train):.2f}, Test {log_loss(y_test, y_proba_test):.2f}")
1. Plot the coefficient index versus the value
plotbeta(lg_beta, "Ridge", yrange=None, tick_format='{x:.1f}')
#plt.tight_layout(); plt.savefig("wine-l1.png",dpi=150,bbox_inches=0)
The goal of the following code snippets is to help you visualize how the $\lambda$ regularization parameter affects model parameters and associated training and testing accuracy. There are a number of important questions to answer following the code snippets.
fig,axes = plt.subplots(1,6,figsize=(11,1.8), sharey=True)
for i,lmbda in enumerate([1e-5,.01,.1, 1, 10, 100]):
lg = LogisticRegression(C=1/lmbda, penalty='l2', solver='liblinear', max_iter=1000)
lg.fit(X_train, y_train)
accur = lg.score(X_train, y_train)
accurt = lg.score(X_test, y_test)
y_proba_train = lg.predict_proba(X_train)[:,1]
y_proba_test = lg.predict_proba(X_test)[:,1]
print(f"lambda={lmbda:5}: Zeros {sum(np.abs(lg.coef_[0]) < 1e-5):3d}: Accuracy: Train {100*lg.score(X_train, y_train):3.0f}%, Test {100*lg.score(X_test, y_test):3.0f}%; Log loss: Train {log_loss(y_train, y_proba_train):.2f}, Test {log_loss(y_test, y_proba_test):.2f}")
plotbeta(lg.coef_[0], f"$\\lambda={lmbda}$\n", ax=axes[i], fontsize=9, xlabel=False, ylabel=i==0, yrange=None)
Q. Describe what is happening to the L2 coefficients as we increase lambda?
Q. Characterize how the magnitudes of L1 and L2 coefficients differ.
Q. Which regularization method would you choose for this data set?