This simple dataset contains information about insect, fish, and bird species and whether or not they can fly:
Name | Class | Can fly |
---|---|---|
Pileated woodpecker | Birds | Yes |
Emu | Birds | No |
Northern cardinal | Birds | Yes |
Blacktip shark | Cartilaginous fishes | No |
Bluntnose stingray | Cartilaginous fishes | No |
Black drum | Bony fishes | No |
Florida carpenter ant | Insects | No |
Periodical cicada | Insects | Yes |
Luna moth | Insects | Yes |
Your task: Develop a model to classify whether or not an animal can fly, based on information available in the dataset.
Does this model make any mistakes? If so, can we improve it?
Aha! That model classifies each training example perfectly!
For this lesson, we will focus on two widely used regularization methods: L1 and L2 regularization. Both of these methods represent model complexity as a function of the model's feature weights.
Reminder: The general linear regression model looks like this:
$$ y = w_0 + w_1 x_1 + w_2 x_2 + \ldots + w_k x_k $$The L1 regularization penalty is:
$$L_1\text{ }regularization\text{ }penalty = \lambda\sum_{i=1}^k |w_i|$$import numpy as np
weights = [-0.5, -0.2, 0.5, 0.7, 1.0, 2.5]
The L2 regularization penalty is:
$$L_2\text{ }regularization\text{ }penalty = \lambda\sum_{i=1}^k w_i^2$$
Recall that the usual loss function for linear regression is the mean square error:
$$ MSE = \frac{1}{n} \sum_{i=1}^n (y_i - (w_0 + w_1 x_{i,1} + w_2 x_{i,2} + \ldots + w_k x_{i,k}))^2 $$To add L1 regularization, we want to minimize:
$$ MSE + \lambda\sum_{i=1}^k |w_i|$$Let's analyze a dataset called regularization.csv
that you can find in the nb-datasets
folder.
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as mse
Let's try using regularization on a real dataset. We'll again use the iris dataset that you've already seen in previous lessons. We might not have time for this example during the workshop, and if not, I encourage you to explore it on your own.
idata = pd.read_csv('../nb-datasets/iris_dataset.csv')
idata['species'] = idata['species'].astype('category')
# Convert the categorical variable "species" to 1-hot encoding (AKA "dummy variables"),
# but eliminate the first dummy variable because it is collinear with the other two
# and does not provide any additional information.
idata_enc = pd.get_dummies(idata, drop_first=True)
# Separate the x and y values.
x = idata_enc.drop(columns='petal_length')
y = idata_enc['petal_length']
# Split the train and test sets.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)
# See what we have.
idata_enc.head()
Try experimenting with the value of alpha
/$\lambda$ in the code above for both L1 regularization and L2 regularization. As you do so, consider these questions: