from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import geopandas as gpd
np.random.seed(42)
%matplotlib inline
plt.rcParams['figure.figsize'] = (10,6)
Nov 14, 2019
Today, we'll walk through an end-to-end regression example to predict Philadelphia's housing prices
Given your training set of data, which model parameters best represent the observed data?
The cost function measures the difference between the model's predictions and the observed data
In scikit-learn, you will call the fit()
method on your algorithm.
Key goal: how we can do this in a way to ensure the model is as generalizable as possible?
Or: "garbage in, garbage out"
Common issues:
Common to use 80% of data for your training set and 20% for your test set
For more information, see the scikit-learn docs
We'll load data compiled from two data sources:
data = pd.read_csv("./data/gdp_vs_satisfaction.csv")
data.head()
import hvplot.pandas
data.hvplot.scatter(
x="gdp_per_capita",
y="life_satisfaction",
hover_cols=["Country"],
ylim=(4, 9),
xlim=(1e3, 1.1e5),
)
A simple model with only two parameters: $\theta_1$ and $\theta_2$
Use the LinearRegression model object from scikit-learn.
This is not really machine learning — it simply find the Ordinary Least Squares fit to the data.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model
# our input features (in this case we only have 1)
X = data['gdp_per_capita'][:, np.newaxis]
# the labels (values we are trying to predict)
y = data['life_satisfaction'].values
Note: scikit learn expects the features to be a 2D array with shape: (number of observations, number of features). We are explicitly adding a second axis with the np.newaxis
variable.
Now, fit the model using the model.fit(X, y)
syntax.
This will "train" our model, using an optimization algorithm to identify the bestfit parameters.
model.fit(X, y)
intercept = model.intercept_
slope = model.coef_[0]
print(f"bestfit intercept = {intercept}")
print(f"bestfit slope = {slope}")
Note: In this case, our model is the same as ordinary least squares, and no actually optimization is performed since an exact solution exists.
score()
function that provides a score to evaluate the fit by.Note: you must call the fit()
function before calling the score()
function.
Rsq = model.score(X, y)
Rsq
Use the predict()
function to predict new values.
# The values we want to predict (ranging from our min to max GDP per capita)
gdp_pred = np.linspace(1e3, 1.1e5, 100)
# Sklearn needs the second axis!
X_pred = gdp_pred[:, np.newaxis]
y_pred = model.predict(X_pred)
with plt.style.context("fivethirtyeight"):
fig, ax = plt.subplots(figsize=(10, 6))
# Plot the predicted values
ax.plot(X_pred, y_pred, label="Predicted values", color="#666666")
# Training data
ax.scatter(
data["gdp_per_capita"],
data["life_satisfaction"],
label="Training data",
s=100,
zorder=10,
color="#f40000",
)
ax.legend()
ax.set_xlabel("GDP Per Capita")
ax.set_ylabel("Life Satisfaction")
Scikit learn provides a utility function to split our input data:
from sklearn.model_selection import train_test_split
# I'll use a 70/30% split
train_set, test_set = train_test_split(data, test_size=0.3, random_state=42)
These are new DataFrame objects, with lengths determined by the split percentage:
print("size of full dataset = ", len(data))
print("size of training dataset = ", len(train_set))
print("size of test dataset = ", len(test_set))
Now, make our feature and label arrays:
# Features
X_train = train_set['gdp_per_capita'][:, np.newaxis]
X_test = test_set['gdp_per_capita'][:, np.newaxis]
# Labels
y_train = train_set['life_satisfaction'].values
y_test = test_set['life_satisfaction'].values
Use the StandardScaler
to scale the GDP per capita:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Scale the training features
X_train_scaled = scaler.fit_transform(X_train)
# Scale the test features
X_test_scaled = scaler.fit_transform(X_test)
Now, let's fit on the training set and evaluate on the test set
model.fit(X_train_scaled, y_train)
model.score(X_test_scaled, y_test)
Unsurprisingly, our fit gets worst when we test on unseen data
Our accuracy was artifically inflated the first time, since we trained and tested on the same data.
We'll use scikit learn's PolynomialFeatures
to add new polynomial features from the GDP per capita.
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=3)
# Training
X_train_scaled_poly = poly.fit_transform(scaler.fit_transform(X_train))
# Test
X_test_scaled_poly = poly.fit_transform(scaler.fit_transform(X_test))
model.fit(X_train_scaled_poly, y_train)
model.score(X_test_scaled_poly, y_test)
The accuracy improved!
We can turn our preprocessing steps into a Pipeline
object using the make_pipeline()
function.
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(StandardScaler(), PolynomialFeatures(degree=3))
pipe
Individual steps can be accessed via their names in a dict-like fashion:
# Step 1
pipe['standardscaler']
# Step 2
pipe['polynomialfeatures']
Let's apply this pipeline to our predicted GDP values for our plot:
y_pred = model.predict(pipe.fit_transform(X_pred))
with plt.style.context("fivethirtyeight"):
fig, ax = plt.subplots(figsize=(10, 6))
# Plot the predicted values
y_pred = model.predict(pipe.fit_transform(X_pred))
ax.plot(X_pred, y_pred, label="Predicted values", color="#666666")
# Training data
ax.scatter(
data["gdp_per_capita"],
data["life_satisfaction"],
label="Training data",
s=100,
zorder=10,
color="#f40000",
)
ax.legend()
ax.set_xlabel("GDP Per Capita")
ax.set_ylabel("Life Satisfaction")
The additional polynomial features introduced some curvature and improved the fit!
with plt.style.context("fivethirtyeight"):
fig, ax = plt.subplots(figsize=(10, 6))
# Training data
ax.scatter(
data["gdp_per_capita"],
data["life_satisfaction"],
label="Training data",
s=100,
zorder=10,
color="#666666",
)
# Plot the predicted values
for degree in [3, 5, 10]:
print(f"degree = {degree}")
# Create out pipeline
p = make_pipeline(StandardScaler(), PolynomialFeatures(degree=degree))
# Fit the model on the training set
model.fit(p.fit_transform(X_train), y_train)
# Evaluate on the training set
training_score = model.score(p.fit_transform(X_train), y_train)
print(f"Training Score = {training_score}")
# Evaluate on the test set
test_score = model.score(p.fit_transform(X_test), y_test)
print(f"Test Score = {test_score}")
# Plot
y_pred = model.predict(p.fit_transform(X_pred))
ax.plot(X_pred, y_pred, label=f"Degree = {degree}")
print()
ax.legend(ncol=2, loc=0)
ax.set_ylim(4, 9)
ax.set_xlabel("GDP Per Capita")
ax.set_ylabel("Life Satisfaction")
As we increase the polynomial degree, two things happen:
This is the classic case of overfitting — our model does not generalize well at all.
Ridge
adds regularization to the linear regression least squares modelRemember, regularization penalizes large parameter values and complex fits
from sklearn.linear_model import Ridge
Let's gain some intuition:
# Create a pre-processing pipeline
pipe = make_pipeline(StandardScaler(), PolynomialFeatures(degree=3))
# Setup and fit a linear model (with scaled features)
linear = LinearRegression()
scaler = StandardScaler()
linear.fit(scaler.fit_transform(X_train), y_train)
with plt.style.context("fivethirtyeight"):
fig, ax = plt.subplots(figsize=(10, 6))
# The data
ax.scatter(
data["gdp_per_capita"],
data["life_satisfaction"],
label="Training data",
s=100,
zorder=10,
color="#666666",
)
## Linear fit results
print("Linear fit")
training_score = linear.score(scaler.fit_transform(X_train), y_train)
print(f"Training Score = {training_score}")
test_score = linear.score(scaler.fit_transform(X_test), y_test)
print(f"Test Score = {test_score}")
print()
# The linear fit
ax.plot(
X_pred,
linear.predict(scaler.fit_transform(X_pred)),
color="k",
label="Linear fit",
)
# Plot the predicted values for each alpha
for alpha in [0, 10, 100, 1e5]:
print(f"alpha = {alpha}")
# Create out Ridge model with this alpha
ridge = Ridge(alpha=alpha)
# Fit the model on the training set
ridge.fit(pipe.fit_transform(X_train), y_train)
# Evaluate on the training set
training_score = ridge.score(pipe.fit_transform(X_train), y_train)
print(f"Training Score = {training_score}")
# Evaluate on the test set
test_score = ridge.score(pipe.fit_transform(X_test), y_test)
print(f"Test Score = {test_score}")
# Plot
y_pred = ridge.predict(pipe.fit_transform(X_pred))
ax.plot(X_pred, y_pred, label=f"alpha = {alpha}")
print()
ax.legend(ncol=2, loc=0)
ax.set_ylim(4, 9)
ax.set_xlabel("GDP Per Capita")
ax.set_ylabel("Life Satisfaction")
LinearRegression()
with the polynomial featuresMore feature engineering!
In this case, I've done the hard work for you and pulled additional country properties from the OECD's website.
data2 = pd.read_csv("./data/gdp_vs_satisfaction_more_features.csv")
data2.head()
We'll move beyond simple linear regression and see if we can get a better fit.
A decision tree learns decision rules from the input features:
For a specific corner of the input feature space, the decision tree predicts an output target value
Decision trees can be very deep with many nodes — this will lead to overfitting your dataset!
This is an example of ensemble learning: combining multiple estimators to achieve better overall accuracy than any one model could achieve
from sklearn.ensemble import RandomForestRegressor
Let's split our data into training and test sets again:
# Split the data 70/30
train_set, test_set = train_test_split(data2, test_size=0.3, random_state=42)
# the target labels
y_train = train_set["life_satisfaction"].values
y_test = test_set["life_satisfaction"].values
# The features
feature_cols = [col for col in data2.columns if col not in ["life_satisfaction", "Country"]]
X_train = train_set[feature_cols].values
X_test = test_set[feature_cols].values
import seaborn as sns
sns.heatmap(
train_set[feature_cols].corr(), cmap="coolwarm", annot=True, vmin=-1, vmax=1
)
Establish a baseline with a linear model:
p = make_pipeline(StandardScaler())
# Fit a linear model
print("Linear regression")
linear = LinearRegression()
linear.fit(p.fit_transform(X_train), y_train)
# Print the training score
training_score = linear.score(p.fit_transform(X_train), y_train)
print(f"Training Score = {training_score}")
# Print the test score
test_score = linear.score(p.fit_transform(X_test), y_test)
print(f"Test Score = {test_score}")
# Fit a random forest
print("Random forest")
forest = RandomForestRegressor(n_estimators=100, max_depth=2)
forest.fit(p.fit_transform(X_train), y_train)
# Print the training score
training_score = forest.score(p.fit_transform(X_train), y_train)
print(f"Training Score = {training_score}")
# Print the test score
test_score = forest.score(p.fit_transform(X_test), y_test)
print(f"Test Score = {test_score}")
Because random forests are an ensemble method with multiple estimators, the algorithm can learn which features help improve the fit the most.
The feature importances are stored as the feature_importances_
attribute (available after calling fit()
)
# Make the dataframe
importance = pd.DataFrame(
{"Feature": feature_cols, "Importance": forest.feature_importances_}
)
# Plot
importance.hvplot.barh(
x="Feature", y="Importance", title="Does Money Make You Happier?"
)
The cross_val_score
will automatically partition the training set into k folds, fit the model to the subset, and return the scores for each partition.
from sklearn.model_selection import cross_val_score
cross_val_score?
# Make a linear pipeline
linear = make_pipeline(StandardScaler(), LinearRegression())
# Run the 3-fold cross validation
scores = cross_val_score(
linear,
X_train,
y_train,
cv=3,
)
# Report
print("R^2 scores = ", scores)
print("Scores mean = ", scores.mean())
print("Score std dev = ", scores.std())
# Make a random forest pipeline
forest = make_pipeline(
StandardScaler(), RandomForestRegressor(n_estimators=100, random_state=42)
)
# Run the 3-fold cross validation
scores = cross_val_score(
forest,
X_train,
y_train,
cv=3,
)
# Report
print("R^2 scores = ", scores)
print("Scores mean = ", scores.mean())
print("Score std dev = ", scores.std())
n_estimators
is a model hyperparameters(via the fit() method)
This is when cross validation becomes very important
from sklearn.model_selection import GridSearchCV
Let's do a search over the n_estimators
parameter and the max_depth
parameter:
# Create our regression pipeline
pipe = make_pipeline(StandardScaler(), RandomForestRegressor(random_state=42))
pipe
# Make the grid of parameters to search
# NOTE: you must prepend the name of the pipeline step
model_name = "randomforestregressor"
param_grid = {
f"{model_name}__n_estimators": [5, 10, 15, 20, 30, 50, 100],
f"{model_name}__max_depth": [2, 5, 7, 9, 13, 21, 33, 51],
}
param_grid
# Create the grid and use 3-fold CV
grid = GridSearchCV(pipe, param_grid, cv=3)
# Run the search
grid.fit(X_train, y_train)
# The best estimator
grid.best_estimator_
# The best hyper parameters
grid.best_params_
We'll define a helper utility function to calculate the accuracy in terms of the mean absolute percent error
def evaluate(model, X_test, y_test):
"""
Given a model and test features/targets, print out the
mean absolute error and accuracy
"""
# Make the predictions
predictions = model.predict(X_test)
# Absolute error
errors = abs(predictions - y_test)
avg_error = np.mean(errors)
# Mean absolute percentage error
mape = 100 * np.mean(errors / y_test)
# Accuracy
accuracy = 100 - mape
print("Model Performance")
print(f"Average Absolute Error: {avg_error:0.4f}")
print(f"Accuracy = {accuracy:0.2f}%.")
return accuracy
# Setup the pipeline
linear = make_pipeline(StandardScaler(), LinearRegression())
# Fit on train set
linear.fit(X_train, y_train)
# Evaluate on test set
evaluate(linear, X_test, y_test)
# Initialize the pipeline
base_model = make_pipeline(StandardScaler(), RandomForestRegressor(random_state=42))
# Fit the training set
base_model.fit(X_train, y_train)
# Evaluate on the test set
base_accuracy = evaluate(base_model, X_test, y_test)
# Evaluate the best random forest model
best_random = grid.best_estimator_
random_accuracy = evaluate(best_random, X_test, y_test)
# What's the improvement?
improvement = 100 * (random_accuracy - base_accuracy) / base_accuracy
print(f'Improvement of {improvement:0.2f}%.')
cross_val_score
GridSearchCV
In this part, we'll use a random forest model and housing data from the Office of Property Assessment to predict residential sale prices in Philadelphia
Note: We'll focus on the first two components in this analysis (and in assignment #7)
Too often, these models perpetuate inequality: low-value homes are over-assessed and high-value homes are under-assessed
Let's download data for properties in Philadelphia that had their last sale during 2018.
Sources:
import carto2gpd
# the CARTO API url
carto_url = "https://phl.carto.com/api/v2/sql"
# The table name
table_name = "opa_properties_public"
# Only pull 2018 sales for single family residential properties
where = "sale_date >= '2018-01-01' and sale_date < '2019-01-01' and category_code_description = 'Single Family'"
# Run the query
salesRaw = carto2gpd.get(carto_url, table_name, where=where)
salesRaw.head()
len(salesRaw)
import missingno as msno
# We'll visualize the first half of columns and then the second half
ncol = len(salesRaw.columns)
fields1 = salesRaw.columns[:ncol//2]
fields2 = salesRaw.columns[ncol//2:]
# The first half of columns
msno.bar(salesRaw[fields1]);
# The second half of columns
msno.bar(salesRaw[fields2]);
# The feature columns we want to use
cols = [
"sale_price",
"total_livable_area",
"total_area",
"garage_spaces",
"fireplaces",
"number_of_bathrooms",
"number_of_bedrooms",
"number_stories",
"exterior_condition",
"zip_code",
]
# Trim to these columns and remove NaNs
sales = salesRaw[cols].dropna()
# Trim zip code to only the first five digits
sales['zip_code'] = sales['zip_code'].astype(str).str.slice(0, 5)
# Trim very low and very high sales
valid = (sales['sale_price'] > 3000) & (sales['sale_price'] < 1e6)
sales = sales.loc[valid]
len(sales)
# Split the data 70/30
train_set, test_set = train_test_split(sales, test_size=0.3, random_state=42)
# the target labels
y_train = np.log(train_set["sale_price"])
y_test = np.log(test_set["sale_price"])
# The features
feature_cols = [
"total_livable_area",
"total_area",
"garage_spaces",
"fireplaces",
"number_of_bathrooms",
"number_of_bedrooms",
"number_stories",
]
X_train = train_set[feature_cols].values
X_test = test_set[feature_cols].values
# Make a random forest pipeline
forest = make_pipeline(
StandardScaler(), RandomForestRegressor(n_estimators=100, random_state=42)
)
# Run the 10-fold cross validation
scores = cross_val_score(
forest,
X_train,
y_train,
cv=10,
)
# Report
print("R^2 scores = ", scores)
print("Scores mean = ", scores.mean())
print("Score std dev = ", scores.std())
# Fit on the training data
forest.fit(X_train, y_train)
# What's the test score?
forest.score(X_test, y_test)
Model appears to generalize reasonably well! (although we should also be optimizing hyperparameters to see if we can find additional improvements)
# Extract the regressor from the pipeline
regressor = forest["randomforestregressor"]
# Create the data frame of importances
importance = pd.DataFrame(
{"Feature": feature_cols, "Importance": regressor.feature_importances_}
)
importance.hvplot.barh(x="Feature", y="Importance")
OneHotEncoder
object is a preprocessor that will perform the vectorization stepColumnTransformer
object will help us apply different transformers to numerical and categorical columnsfrom sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# Numerical columns
num_cols = [
"total_livable_area",
"total_area",
"garage_spaces",
"fireplaces",
"number_of_bathrooms",
"number_of_bedrooms",
"number_stories",
]
# Categorical columns
cat_cols = ["exterior_condition", "zip_code"]
OneHotEncoder?
# Set up the column transformer with two transformers
# Scale the numerical columns and one-hot
preprocessor = ColumnTransformer(
transformers=[
("num", StandardScaler(), num_cols),
("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
]
)
Note: the handle_unknown='ignore'
parameter ensures that if categories show up in our training set, but not our test set, no error will be raised.
Initialize the pipeline object, using the column transformer and the random forest regressor
# Initialize the pipeline
# NOTE: only use 10 estimators here so it will run in a reasonable time
regr = make_pipeline(
preprocessor, RandomForestRegressor(n_estimators=10, random_state=42)
)
Now, let's fit the model.
Important:
You must pass in the full training set DataFrame. We told scikit learn which column strings to extract in the ColumnTransformer, so it needs the DataFrame with named columns.
# Fit the training set
regr.fit(train_set, y_train);
# What's the test score?
regr.score(test_set, y_test)
Takeaway: neighborhood based effects play a crucial role in determining housing prices.
Note: to fully validate the model we should run $k$-fold cross validation and optimize hyperparameters of the model as well...
But first, we need to know the column names! The one-hot encoder created a column for each category type...
# The column transformer...
preprocessor
# The steps in the column transformer
preprocessor.named_transformers_
# The one-hot step
ohe = preprocessor.named_transformers_['cat']
ohe
# One column for each category type!
ohe_cols = ohe.get_feature_names()
ohe_cols
# Full list of columns is numerical + one-hot
features = num_cols + list(ohe_cols)
features
regressor = regr["randomforestregressor"]
# Create the dataframe with importances
importance = pd.DataFrame(
{"Feature": features, "Importance": regressor.feature_importances_}
)
# Sort by importance and get the top 30
# SORT IN DESCENDING ORDER
importance = importance.sort_values("Importance", ascending=False).iloc[:30]
# Plot
importance.hvplot.barh(
x="Feature", y="Importance", height=700, flip_yaxis=True
)
There is certainly much more than can be done to try to improve the accuracy of the model on the test set. Some things to try/explore: