The goal of this lab is to explore the effect of regularization on the coefficients and accuracy of linear regression models for a toy (Ames housing) dataset.
import numpy as np
import pandas as pd
np.random.seed(999)
from sklearn.linear_model import LinearRegression, LogisticRegression, Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_wine, load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt
import matplotlib as mpl
%config InlineBackend.figure_format = 'retina'
df_ames = pd.read_csv("ames.csv") # in same directory as this notebook
df_ames.head(2)
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 rows × 81 columns
# ignore columns with missing values for this exercise
cols_with_missing = df_ames.columns[df_ames.isnull().any()]
cols = set(df_ames.columns) - set(cols_with_missing)
X = df_ames[cols]
X = X.drop('LotArea', axis=1) # has some outliers, let's drop for stability in this lab
X = X.drop('SalePrice', axis=1)
y = df_ames['SalePrice']
X = pd.get_dummies(X) # dummy encode categorical variables
X.shape
(1460, 215)
1. Split off validation/test set and train unregularized linear regression model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=999)
lm = LinearRegression()
lm.fit(X_train, y_train)
LinearRegression()
2. Compare the R^2 on training and test set
print(f"{lm.score(X_train, y_train):.2f} R^2 on training set")
print(f"{lm.score(X_test, y_test):.2f} R^2 on test set")
0.92 R^2 on training set -5800585.57 R^2 on test set
Q. Why is the testing score much worse?
3. Predict $\overline{y}$ computed from training set instead of using the model; what is the R^2? on test set?
R^2 should give 0 as the neutral value when the prediction is no better and no worse than simply predicting the average of the training set.
y_pred = np.full(shape=(len(y_test),1), fill_value=np.mean(y_train)) # get vector of y_train bar
r2_score(y_pred, y_test) # how well do we predict y_test using y_train bar?
0.0
4. Extract $\beta_1, ..., \beta_p$ and count how many are close to 0
Note: sum(np.abs(x) < 1e-5)
is a decent way to check for values of x
close to zero but not necessarily zero. There is also numpy.isclose()
but that is too strict (requires numbers to be really close to zero) for this exercise.
lm_beta = lm.coef_
sum(np.abs(lm_beta) < 1e-5) # how many close to 0?
1
5. Plot the coefficient index versus the value
R^2 should give 0 as the neutral value when the prediction is no better and no worse than simply predicting the average of the training set. The coefficients should look something like the following images, depending on which training samples are chosen:
(There is also some randomness probably due to how the sklearn regression model deletes collinear columns to prevent the symbolic solution that finds coefficients from failing.)
The following function is a handy way to plot the coefficients.
def plotbeta(beta, which, yrange=(-20_000, 20_000),fontsize=10, xlabel=True, ylabel=True, ax=None):
if ax is None:
fig,ax = plt.subplots(1,1,figsize=(4,2.5))
ax.bar(range(len(beta)),beta)
if xlabel:
ax.set_xlabel("Coefficient $\\beta_i$ for $i \\geq 1$", fontsize=fontsize)
if ylabel:
ax.set_ylabel("Coefficient value", fontsize=fontsize)
if yrange is not None:
ax.set_ylim(*yrange)
ax.tick_params(axis='both', which='major', labelsize=fontsize)
plt.gca().yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('{x:.0f}'))
ax.set_title(f"{which} $\\overline{{\\beta}}$={np.mean(beta):.0f}, $\\sigma(\\beta)$={np.std(beta):.1f}", fontsize=fontsize)
plotbeta(lm_beta, "OLS", yrange=None)
#plt.tight_layout(); plt.savefig("ames-ols-2.png",dpi=150,bbox_inches=0)
Both L1 and L2 regularization require normalizing variables; sklearn has an option to do this as part of the training process, but let's do it explicitly as a way to get started.
from pandas.api.types import is_numeric_dtype
def normalize(X):
for colname in X.columns:
if is_numeric_dtype(X[colname]):
u = np.mean(X[colname])
s = np.std(X[colname])
X[colname] = (X[colname] - u) / s
cols_with_missing = df_ames.columns[df_ames.isnull().any()]
cols = set(df_ames.columns) - set(cols_with_missing)
X = df_ames[cols].drop('SalePrice', axis=1)
y = df_ames['SalePrice']
normalize(X) # do this before getting dummy variables
X = pd.get_dummies(X) # make sure to normalize before you get the dummy variables
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=True, random_state=999)
X_train.head(3) # These values should be normalized
OverallQual | 3SsnPorch | MoSold | WoodDeckSF | FullBath | LowQualFinSF | MSSubClass | BsmtFinSF1 | BsmtUnfSF | LotArea | ... | LandContour_Bnk | LandContour_HLS | LandContour_Low | LandContour_Lvl | RoofStyle_Flat | RoofStyle_Gable | RoofStyle_Gambrel | RoofStyle_Hip | RoofStyle_Mansard | RoofStyle_Shed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1441 | -0.071836 | -0.116339 | -0.489110 | 0.437009 | -1.026041 | -0.120242 | 1.492282 | 0.555685 | -0.942327 | -0.610435 | ... | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
533 | -3.688413 | -0.116339 | -1.969111 | -0.752176 | -1.026041 | -0.120242 | -0.872563 | -0.973018 | -1.284176 | -0.552908 | ... | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
1078 | -0.071836 | -0.116339 | -0.489110 | 0.365179 | -1.026041 | -0.120242 | 1.492282 | 0.478921 | -0.863090 | -0.609533 | ... | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
3 rows × 216 columns
# Note that the sklearn model constructors call the parameter `alpha` not `lmbda`.
lmbda = 5
1. Train an L1 (lasso) linear regression model using lmbda
lasso = ...
# fit model to X,y
lasso = Lasso(alpha=lmbda, tol=.1) lasso.fit(X_train, y_train)
2. Examine the training and testing scores
Note that the sklearn model constructors call the parameter alpha
not lmbda
.
print(f"{lasso.score(X_train, y_train):.2f} R^2 on training set")
print(f"{lasso.score(X_test, y_test):.2f} R^2 on test set")
Q. Why is the testing score much worse than the training score?
Q. Why is the training score about the same for OLS and regularized regression?
3. Count how many $\beta_{1..p}$ coefficients are (close to) zero
sum(...) # how many close to 0?
sum(np.abs(lasso.coef_) < 1e-5) # how many close to 0?
4. Plot the $\beta_{1..p}$ coefficients using plotbeta
plotbeta(lasso.coef_, "Lasso $\lambda="+str(lmbda)+"$: ")
1. Train an L2 (Ridge) linear regression model using lmbda
Note that the sklearn model constructors call the parameter alpha
not lmbda
.
ridge = ...
# fit to X,y
ridge = Ridge(alpha=lmbda) ridge.fit(X_train, y_train)
2. Examine the training and testing scores
print(f"{ridge.score(X_train, y_train):.2f} R^2 on training set")
print(f"{ridge.score(X_test, y_test):.2f} R^2 on test set")
Q. Why are the R^2 scores for L2 similar to L1?
3. Count how many $\beta_{1..p}$ coefficients are (close to) zero
sum(np.abs(ridge.coef_) < 1e-5) # how many close to 0?
4. Plot the $\beta_{1..p}$ coefficients using plotbeta
plotbeta(ridge.coef_, "Ridge $\lambda="+str(lmbda)+"$: ")
Q. Compare the standard deviation of the coefficients for L1 and L2. What do you notice visually?
The goal of the following code snippets is to help you visualize how the $\lambda$ regularization parameter affects model parameters and associated training and testing scores. There are a number of important questions to answer following the code snippets.
fig,axes = plt.subplots(1,7,figsize=(13.5,1.8), sharey=True)
for i,lmbda in enumerate([1,10,100,200,500,1000,2000]):
lasso = Lasso(alpha=lmbda, tol=.1)
lasso.fit(X_train, y_train)
r2 = lasso.score(X_train, y_train)
r2t = lasso.score(X_test, y_test)
print(f"lambda={lmbda:4d}: Zeros {sum(np.abs(lasso.coef_) < 1e-5):3d}: train R^2 {r2:.2f} test R^2 {r2t:.2f}")
plotbeta(lasso.coef_, f"$\\lambda={lmbda}$\n", ax=axes[i], fontsize=9, xlabel=False, ylabel=i==0)
Q. Why does the training R^2 score go down as we increase lambda?
Q. Why does the L1 testing R^2 score go up we increase lambda?
Q. Describe what is happening to the L1 coefficients as we increase lambda?
fig,axes = plt.subplots(1,7,figsize=(13.5,1.8), sharey=True)
for i,lmbda in enumerate([1,10,100,200,500,1000,2000]):
ridge = Ridge(alpha=lmbda, tol=.1)
ridge.fit(X_train, y_train)
r2 = ridge.score(X_train, y_train)
r2t = ridge.score(X_test, y_test)
print(f"lambda={lmbda:4d}: Zeros {sum(np.abs(ridge.coef_) < 1e-5)}: train R^2 {r2:.2f} test R^2 {r2t:.2f}")
plotbeta(ridge.coef_, f"$\\lambda={lmbda}$\n", ax=axes[i], fontsize=9, xlabel=False, ylabel=i==0)
Q. Why does the training R^2 score go down as we increase lambda?
Q. Describe what is happening to the L2 coefficients as we increase lambda?
Q. Characterize how the magnitudes of L1 and L2 coefficients differ.