In [ ]:

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import geopandas as gpd
np.random.seed(42)
%matplotlib inline

In [ ]:

plt.rcParams['figure.figsize'] = (10,6)

Week 11: Supervised Learning with Scikit-Learn¶

Nov 14, 2019

Housekeeping¶

HW #6 due today
HW #7 will be posted tonight — includes final project proposal
- machine learning analysis of Philadelphia housing sales
- due Tuesday Nov 26 (in two classes)

Last time¶

A crash course on clustering
- non-spatial: k-means
- spatial: DBSCAN
An introduction to scikit-learn

Reminder: clustering is an example of unsupervised learning¶

Clustering tries to detect previously unknown structure in your input dataset
Only requires the data set as input, no associated labels or prediction targets

Today: an example of supervised learning¶

Key distinction: requires a training data set as well as the desired solutions (called labels) as inputs to the algorithm
Two main types:
- classification: samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data.
- regression: predicting a continuous variable from a training dataset

Examples¶

classification: a spam filter to classify new emails as spam / not spam based on past examples
regression: predicting housing prices based on property characteristics

Today, we'll walk through an end-to-end regression example to predict Philadelphia's housing prices

Model-based learning¶

Select a model that can represent the data
Use the training data to identify the best model parameters (by minimizing a cost function)
Make predictions on new data — and hope that your model generalizes well to new data!

Machine learning is really just an optimization problem¶

Given your training set of data, which model parameters best represent the observed data?

1. Choose a model¶

2. The model has an associated cost function¶

The cost function measures the difference between the model's predictions and the observed data

3. "Learn" the best model parameters¶

In scikit-learn, you will call the fit() method on your algorithm.

Recap: the steps involved¶

Wrangle and understand data.
Select a model that would be best for your dataset.
Train the model on the training data — the learning algorithm searches for the best model parameters
Apply the model to new data to make predictions.

Key goal: how we can do this in a way to ensure the model is as generalizable as possible?

What could go wrong?¶

Mistake #1: "bad data"¶

Or: "garbage in, garbage out"

Common issues:

Not enough training data
Training data isn't representative of the unseen data that you want to make predictions for
Poor quality data — errors, missing data, outliers
Poor features in the training set
- You'll often hear the phrase feature engineering to describe the process of improving your input dataset:
- Involves: feature selection, feature extraction, creating new features

Mistake #2: "bad algorithm"¶

Overfitting the training data (more on this shortly)
- model performs well, too well in fact, on the training set, and does not generalize well
- model is too complex for your training set
Underfitting the training data
- model is not complex enough
- predictions will be inacurrate, but adding more model parameters (making it more complex) will help improve the accuracy

Regularization: keeping it simple¶

We can regularize our model to prevent the model from getting too complex and avoid overfitting
Adds a penalty to the cost function that prevents parameters from getting too large
Can effectively think of regularization as forcing some model parameters to be close, not quite, zero

Key question: How do we know if a model will perform well on new data?¶

Crossing your fingers and hoping for the best is not the recommended strategy¶

Option #1: a train/test split¶

Split your data into two sets: the training set and the test set
Train on the training set and test on the test set!
The accuracy on the test set provides a measure of how well your model generalizes

Common to use 80% of data for your training set and 20% for your test set

Still run the risk that you've selected the best model parameters for this specific training/test combination
For example:
- Does the model work best on a 80/20 split, a 60/40 split, 70/30? How to decide the test/train split?
- If you are using regularization, did your regularization strength parameter work only on this specific training set?

Option #2: k-fold cross-validation¶

Break the data into a training set and test set
Split the training set into k subsets (or folds), holding out one subset as the test set
Run the learning algorithm on each combination of subsets, using the average of all of the runs to find the best fitting model parameters

For more information, see the scikit-learn docs

Let's try out a simple example: does money make people happier?¶

We'll load data compiled from two data sources:

The Better Life Index from the OECD's website
GDP per capita from the IMF's website

In [ ]:

data = pd.read_csv("./data/gdp_vs_satisfaction.csv")
data.head()

Make a quick plot¶

In [ ]:

import hvplot.pandas

In [ ]:

data.hvplot.scatter(
    x="gdp_per_capita",
    y="life_satisfaction",
    hover_cols=["Country"],
    ylim=(4, 9),
    xlim=(1e3, 1.1e5),
)

There's a roughly linear trend here...let's start there¶

A simple model with only two parameters: $\theta_1$ and $\theta_2$

Use the LinearRegression model object from scikit-learn.

This is not really machine learning — it simply find the Ordinary Least Squares fit to the data.

In [ ]:

from sklearn.linear_model import LinearRegression

In [ ]:

model = LinearRegression()
model

In [ ]:

# our input features (in this case we only have 1)
X = data['gdp_per_capita'][:, np.newaxis]

# the labels (values we are trying to predict)
y = data['life_satisfaction'].values

Note: scikit learn expects the features to be a 2D array with shape: (number of observations, number of features). We are explicitly adding a second axis with the np.newaxis variable.

Now, fit the model using the model.fit(X, y) syntax.

This will "train" our model, using an optimization algorithm to identify the bestfit parameters.

In [ ]:

model.fit(X, y)

In [ ]:

intercept = model.intercept_
slope = model.coef_[0]

print(f"bestfit intercept = {intercept}")
print(f"bestfit slope = {slope}")

What's with the "_" at the end of variable names?¶

These represent "estimated" properties of the model — this is how scikit learn signals to the user that these attributes depend on the fit() function being called beforehand.

More info

Note: In this case, our model is the same as ordinary least squares, and no actually optimization is performed since an exact solution exists.

How good is the fit?¶

Each scikit learn model has a built-in score() function that provides a score to evaluate the fit by.
In the case of the linear model, the score is the $R^2$ coefficient of the fit

Note: you must call the fit() function before calling the score() function.

In [ ]:

Rsq = model.score(X, y)
Rsq

Let's plot the data and the predicted values¶

Use the predict() function to predict new values.

In [ ]:

# The values we want to predict (ranging from our min to max GDP per capita)
gdp_pred = np.linspace(1e3, 1.1e5, 100)

# Sklearn needs the second axis!
X_pred = gdp_pred[:, np.newaxis]

y_pred = model.predict(X_pred)

In [ ]:

with plt.style.context("fivethirtyeight"):

    fig, ax = plt.subplots(figsize=(10, 6))

    # Plot the predicted values
    ax.plot(X_pred, y_pred, label="Predicted values", color="#666666")

    # Training data
    ax.scatter(
        data["gdp_per_capita"],
        data["life_satisfaction"],
        label="Training data",
        s=100,
        zorder=10,
        color="#f40000",
    )

    ax.legend()
    ax.set_xlabel("GDP Per Capita")
    ax.set_ylabel("Life Satisfaction")

Not bad....but what did we do wrong?¶

1. We also fit and evaluated our model on the same training set!¶

2. We didn't scale our input data features!¶

Scikit learn provides a utility function to split our input data:

In [ ]:

from sklearn.model_selection import train_test_split

In [ ]:

# I'll use a 70/30% split
train_set, test_set = train_test_split(data, test_size=0.3, random_state=42)

These are new DataFrame objects, with lengths determined by the split percentage:

In [ ]:

print("size of full dataset = ", len(data))
print("size of training dataset = ", len(train_set))
print("size of test dataset = ", len(test_set))

Now, make our feature and label arrays:

In [ ]:

# Features
X_train = train_set['gdp_per_capita'][:, np.newaxis]
X_test = test_set['gdp_per_capita'][:, np.newaxis]

# Labels
y_train = train_set['life_satisfaction'].values
y_test = test_set['life_satisfaction'].values

Use the StandardScaler to scale the GDP per capita:

In [ ]:

from sklearn.preprocessing import StandardScaler

In [ ]:

scaler = StandardScaler()

In [ ]:

# Scale the training features
X_train_scaled = scaler.fit_transform(X_train)

# Scale the test features
X_test_scaled = scaler.fit_transform(X_test)

Now, let's fit on the training set and evaluate on the test set

In [ ]:

model.fit(X_train_scaled, y_train)

In [ ]:

model.score(X_test_scaled, y_test)

Unsurprisingly, our fit gets worst when we test on unseen data

Our accuracy was artifically inflated the first time, since we trained and tested on the same data.

Can we do better? Let's do some feature engineering...¶

We'll use scikit learn's PolynomialFeatures to add new polynomial features from the GDP per capita.

In [ ]:

from sklearn.preprocessing import PolynomialFeatures

Let's try up to degree 3 polynomials ($x^3$)¶

In [ ]:

poly = PolynomialFeatures(degree=3)

Now we have two transformations to make:¶

Scale our features
Create the polynomial features

In [ ]:

# Training
X_train_scaled_poly = poly.fit_transform(scaler.fit_transform(X_train))

# Test
X_test_scaled_poly = poly.fit_transform(scaler.fit_transform(X_test))

In [ ]:

model.fit(X_train_scaled_poly, y_train)

In [ ]:

model.score(X_test_scaled_poly, y_test)

The accuracy improved!

Pipelines: making multiple transformations much easier¶

We can turn our preprocessing steps into a Pipeline object using the make_pipeline() function.

In [ ]:

from sklearn.pipeline import make_pipeline

In [ ]:

pipe = make_pipeline(StandardScaler(), PolynomialFeatures(degree=3))

pipe

Individual steps can be accessed via their names in a dict-like fashion:

In [ ]:

# Step 1
pipe['standardscaler']

In [ ]:

# Step 2
pipe['polynomialfeatures']

Let's apply this pipeline to our predicted GDP values for our plot:

In [ ]:

y_pred = model.predict(pipe.fit_transform(X_pred))

In [ ]:

with plt.style.context("fivethirtyeight"):

    fig, ax = plt.subplots(figsize=(10, 6))

    # Plot the predicted values
    y_pred = model.predict(pipe.fit_transform(X_pred))
    ax.plot(X_pred, y_pred, label="Predicted values", color="#666666")

    # Training data
    ax.scatter(
        data["gdp_per_capita"],
        data["life_satisfaction"],
        label="Training data",
        s=100,
        zorder=10,
        color="#f40000",
    )

    ax.legend()
    ax.set_xlabel("GDP Per Capita")
    ax.set_ylabel("Life Satisfaction")

The additional polynomial features introduced some curvature and improved the fit!

How about large polynomial degrees?¶

In [ ]:

with plt.style.context("fivethirtyeight"):

    fig, ax = plt.subplots(figsize=(10, 6))

    # Training data
    ax.scatter(
        data["gdp_per_capita"],
        data["life_satisfaction"],
        label="Training data",
        s=100,
        zorder=10,
        color="#666666",
    )
    
    # Plot the predicted values
    for degree in [3, 5, 10]:
        print(f"degree = {degree}")
        
        # Create out pipeline
        p = make_pipeline(StandardScaler(), PolynomialFeatures(degree=degree))
        
        # Fit the model on the training set
        model.fit(p.fit_transform(X_train), y_train)
        
        # Evaluate on the training set
        training_score = model.score(p.fit_transform(X_train), y_train)
        print(f"Training Score = {training_score}")
        
        # Evaluate on the test set
        test_score = model.score(p.fit_transform(X_test), y_test)
        print(f"Test Score = {test_score}")
        
        # Plot
        y_pred = model.predict(p.fit_transform(X_pred))
        ax.plot(X_pred, y_pred, label=f"Degree = {degree}")
        
        print()

    ax.legend(ncol=2, loc=0)
    ax.set_ylim(4, 9)
    ax.set_xlabel("GDP Per Capita")
    ax.set_ylabel("Life Satisfaction")

Overfitting alert!¶

As we increase the polynomial degree, two things happen:

Training accuracy goes way up
Test accuracy goes way down

This is the classic case of overfitting — our model does not generalize well at all.

Regularization to the rescue?¶

The Ridge adds regularization to the linear regression least squares model
Parameter $\alpha$ determines the level of regularization
Larger values of $\alpha$ mean stronger regularization — results in a "simpler" bestfit

Remember, regularization penalizes large parameter values and complex fits

In [ ]:

from sklearn.linear_model import Ridge

Let's gain some intuition:

Fix the polynomial degree to 3
Try out alpha values of 0, 10, 100, and $10^5$
Compare to linear fit (no polynomial features)

In [ ]:

# Create a pre-processing pipeline
pipe = make_pipeline(StandardScaler(), PolynomialFeatures(degree=3))

# Setup and fit a linear model (with scaled features)
linear = LinearRegression()
scaler = StandardScaler()
linear.fit(scaler.fit_transform(X_train), y_train)

with plt.style.context("fivethirtyeight"):

    fig, ax = plt.subplots(figsize=(10, 6))

    # The data
    ax.scatter(
        data["gdp_per_capita"],
        data["life_satisfaction"],
        label="Training data",
        s=100,
        zorder=10,
        color="#666666",
    )

    ## Linear fit results
    print("Linear fit")
    training_score = linear.score(scaler.fit_transform(X_train), y_train)
    print(f"Training Score = {training_score}")

    test_score = linear.score(scaler.fit_transform(X_test), y_test)
    print(f"Test Score = {test_score}")
    print()


    # The linear fit
    ax.plot(
        X_pred,
        linear.predict(scaler.fit_transform(X_pred)),
        color="k",
        label="Linear fit",
    )

    # Plot the predicted values for each alpha
    for alpha in [0, 10, 100, 1e5]:
        print(f"alpha = {alpha}")

        # Create out Ridge model with this alpha
        ridge = Ridge(alpha=alpha)

        # Fit the model on the training set
        ridge.fit(pipe.fit_transform(X_train), y_train)

        # Evaluate on the training set
        training_score = ridge.score(pipe.fit_transform(X_train), y_train)
        print(f"Training Score = {training_score}")

        # Evaluate on the test set
        test_score = ridge.score(pipe.fit_transform(X_test), y_test)
        print(f"Test Score = {test_score}")

        # Plot
        y_pred = ridge.predict(pipe.fit_transform(X_pred))
        ax.plot(X_pred, y_pred, label=f"alpha = {alpha}")

        print()

    ax.legend(ncol=2, loc=0)
    ax.set_ylim(4, 9)
    ax.set_xlabel("GDP Per Capita")
    ax.set_ylabel("Life Satisfaction")

Takeaways¶

As we increase alpha, the fits become "simpler" and coefficients get closer and closer to zero — a straight line!
When alpha = 0 (no regularization), we get the same result as when we ran LinearRegression() with the polynomial features
In this case, regularization doesn't improve the fit, and the base polynomial regression (degree=3) provides the best fit

Recap: what we learned so far¶

The LinearRegression model
The test/train split and evaluation
Feature engineering: scaling and creating polynomial features
The Ridge model and regularization
Creating Pipeline() objects

How can we improve?¶

More feature engineering!

In this case, I've done the hard work for you and pulled additional country properties from the OECD's website.

In [ ]:

data2 = pd.read_csv("./data/gdp_vs_satisfaction_more_features.csv")

In [ ]:

data2.head()

Decision trees: a more sophisticated modeling algorithm¶

We'll move beyond simple linear regression and see if we can get a better fit.

A decision tree learns decision rules from the input features:

A decision tree classifier for the Iris data set¶

More info on the iris data set

Regression with decision trees is similar¶

For a specific corner of the input feature space, the decision tree predicts an output target value

Decision trees suffer from overfitting¶

Decision trees can be very deep with many nodes — this will lead to overfitting your dataset!

Random forests: an ensemble solution to overfitting¶

Introduces randomization into the fitting process to avoid overfitting
Fits multiple decision trees on random subsets of the input data
Avoids overfitting and leads to better overall fits

This is an example of ensemble learning: combining multiple estimators to achieve better overall accuracy than any one model could achieve

In [ ]:

from sklearn.ensemble import RandomForestRegressor

Let's split our data into training and test sets again:

In [ ]:

# Split the data 70/30
train_set, test_set = train_test_split(data2, test_size=0.3, random_state=42)

# the target labels
y_train = train_set["life_satisfaction"].values
y_test = test_set["life_satisfaction"].values

# The features
feature_cols = [col for col in data2.columns if col not in ["life_satisfaction", "Country"]]
X_train = train_set[feature_cols].values
X_test = test_set[feature_cols].values

Let's check for correlations in our input data¶

Highly correlated input variables can skew the model fits and lead to worse accuracy
Best to remove features with high correlations (rule of thumb: coefficients > 0.7 or 0.8, typically)

In [ ]:

import seaborn as sns

In [ ]:

sns.heatmap(
    train_set[feature_cols].corr(), cmap="coolwarm", annot=True, vmin=-1, vmax=1
)

Let's do some fitting...¶

Establish a baseline with a linear model:

In [ ]:

p = make_pipeline(StandardScaler())

# Fit a linear model
print("Linear regression")
linear = LinearRegression()
linear.fit(p.fit_transform(X_train), y_train)

# Print the training score
training_score = linear.score(p.fit_transform(X_train), y_train)
print(f"Training Score = {training_score}")

# Print the test score
test_score = linear.score(p.fit_transform(X_test), y_test)
print(f"Test Score = {test_score}")

In [ ]:

# Fit a random forest
print("Random forest")
forest = RandomForestRegressor(n_estimators=100, max_depth=2)
forest.fit(p.fit_transform(X_train), y_train)

# Print the training score
training_score = forest.score(p.fit_transform(X_train), y_train)
print(f"Training Score = {training_score}")

# Print the test score
test_score = forest.score(p.fit_transform(X_test), y_test)
print(f"Test Score = {test_score}")

Which variables matter the most?¶

Because random forests are an ensemble method with multiple estimators, the algorithm can learn which features help improve the fit the most.

The feature importances are stored as the feature_importances_ attribute (available after calling fit())

In [ ]:

# Make the dataframe
importance = pd.DataFrame(
    {"Feature": feature_cols, "Importance": forest.feature_importances_}
)

# Plot
importance.hvplot.barh(
    x="Feature", y="Importance", title="Does Money Make You Happier?"
)

Let's improve our fitting with k-fold cross validation¶

The cross_val_score will automatically partition the training set into k folds, fit the model to the subset, and return the scores for each partition.

In [ ]:

from sklearn.model_selection import cross_val_score

In [ ]:

cross_val_score?

Let's do 3-fold cross validation¶

In [ ]:

# Make a linear pipeline
linear = make_pipeline(StandardScaler(), LinearRegression())

# Run the 3-fold cross validation
scores = cross_val_score(
    linear,
    X_train,
    y_train,
    cv=3,
)

# Report
print("R^2 scores = ", scores)
print("Scores mean = ", scores.mean())
print("Score std dev = ", scores.std())

In [ ]:

# Make a random forest pipeline
forest = make_pipeline(
    StandardScaler(), RandomForestRegressor(n_estimators=100, random_state=42)
)

# Run the 3-fold cross validation
scores = cross_val_score(
    forest,
    X_train,
    y_train,
    cv=3,
)

# Report
print("R^2 scores = ", scores)
print("Scores mean = ", scores.mean())
print("Score std dev = ", scores.std())

Takeaway: the random forest model is clearly more accurate¶

Question: why did I choose to use 100 estimators in the RF model?¶

In this case, I didn't have a good reason
n_estimators is a model hyperparameters
In practice, it's best to optimize the hyperparameters and the model parameters (via the fit() method)

This is when cross validation becomes very important

Optimizing hyperparameters with a single train/test split means you are really optimizing based on your test set.
If you use cross validation, a final test set will always be held in reserve to do a final evaluation.

Enter GridSearchCV¶

A utility function that will:

Iterate over a grid of hyperparameters
Perform k-fold cross validation
Return the parameter combination with the best overall score

More info

In [ ]:

from sklearn.model_selection import GridSearchCV

Let's do a search over the n_estimators parameter and the max_depth parameter:

In [ ]:

# Create our regression pipeline
pipe = make_pipeline(StandardScaler(), RandomForestRegressor(random_state=42))
pipe

In [ ]:

# Make the grid of parameters to search
# NOTE: you must prepend the name of the pipeline step
model_name = "randomforestregressor"
param_grid = {
    f"{model_name}__n_estimators": [5, 10, 15, 20, 30, 50, 100],
    f"{model_name}__max_depth": [2, 5, 7, 9, 13, 21, 33, 51],
}

param_grid

In [ ]:

# Create the grid and use 3-fold CV
grid = GridSearchCV(pipe, param_grid, cv=3)

# Run the search
grid.fit(X_train, y_train)

In [ ]:

# The best estimator
grid.best_estimator_

In [ ]:

# The best hyper parameters
grid.best_params_

Now let's evaluate!¶

We'll define a helper utility function to calculate the accuracy in terms of the mean absolute percent error

In [ ]:

def evaluate(model, X_test, y_test):
    """
    Given a model and test features/targets, print out the 
    mean absolute error and accuracy
    """
    # Make the predictions
    predictions = model.predict(X_test)

    # Absolute error
    errors = abs(predictions - y_test)
    avg_error = np.mean(errors)

    # Mean absolute percentage error
    mape = 100 * np.mean(errors / y_test)

    # Accuracy
    accuracy = 100 - mape

    print("Model Performance")
    print(f"Average Absolute Error: {avg_error:0.4f}")
    print(f"Accuracy = {accuracy:0.2f}%.")

    return accuracy

Linear model results¶

In [ ]:

# Setup the pipeline
linear = make_pipeline(StandardScaler(), LinearRegression())

# Fit on train set
linear.fit(X_train, y_train)

# Evaluate on test set
evaluate(linear, X_test, y_test)

Random forest results with default parameters¶

In [ ]:

# Initialize the pipeline
base_model = make_pipeline(StandardScaler(), RandomForestRegressor(random_state=42))

# Fit the training set
base_model.fit(X_train, y_train)

# Evaluate on the test set
base_accuracy = evaluate(base_model, X_test, y_test)

The random forest model with the optimal hyperparameters¶

In [ ]:

# Evaluate the best random forest model
best_random = grid.best_estimator_
random_accuracy = evaluate(best_random, X_test, y_test)

# What's the improvement?
improvement = 100 * (random_accuracy - base_accuracy) / base_accuracy
print(f'Improvement of {improvement:0.2f}%.')

Recap¶

Decision trees and random forests
Cross validation with cross_val_score
Optimizing hyperparameters with GridSearchCV
Feature importances from random forests

Part 2: Modeling residential sales in Philadelphia¶

In this part, we'll use a random forest model and housing data from the Office of Property Assessment to predict residential sale prices in Philadelphia

Machine learning models are increasingly common in the real estate industry¶

The hedonic approach to housing prices¶

An econometric approach
Deconstruct housing price to the value of each of its parts
Captures the "price premium" consumers are willing to pay for an extra bedroom or garage

What contributes to the price of a house?¶

Property characteristics, e.g, size of the lot and the number of bedrooms
Neighborhood features based on amenities or disamenities, e.g., access to transit or exposure to crime
Spatial component that captures the tendency of housing prices to depend on the prices of neighboring homes

Note: We'll focus on the first two components in this analysis (and in assignment #7)

Why are these kinds of models important?¶

They are used widely by cities to perform property assessment valuation
- Train a model on recent residential sales
- Apply the model to the entire residential housing stock to produce assessments
Biases in the algorithmic models have important consequences for city residents

Too often, these models perpetuate inequality: low-value homes are over-assessed and high-value homes are under-assessed

Philadelphia's assessments are...not good¶

Data from the Office of Property Assessment¶

Let's download data for properties in Philadelphia that had their last sale during 2018.

Sources:

In [ ]:

import carto2gpd

In [ ]:

# the CARTO API url
carto_url = "https://phl.carto.com/api/v2/sql"

# The table name
table_name = "opa_properties_public"

# Only pull 2018 sales for single family residential properties
where = "sale_date >= '2018-01-01' and sale_date < '2019-01-01' and category_code_description = 'Single Family'"

# Run the query
salesRaw = carto2gpd.get(carto_url, table_name, where=where)

In [ ]:

salesRaw.head()

In [ ]:

len(salesRaw)

The OPA is messy¶

Lots of missing data.

We can use the missingno package to visualize the missing data easily.

In [ ]:

import missingno as msno

In [ ]:

# We'll visualize the first half of columns and then the second half
ncol = len(salesRaw.columns)

fields1 = salesRaw.columns[:ncol//2]
fields2 = salesRaw.columns[ncol//2:]

In [ ]:

# The first half of columns
msno.bar(salesRaw[fields1]);

In [ ]:

# The second half of columns
msno.bar(salesRaw[fields2]);

In [ ]:

# The feature columns we want to use
cols = [
    "sale_price",
    "total_livable_area",
    "total_area",
    "garage_spaces",
    "fireplaces",
    "number_of_bathrooms",
    "number_of_bedrooms",
    "number_stories",
    "exterior_condition",
    "zip_code",
]

# Trim to these columns and remove NaNs
sales = salesRaw[cols].dropna()

# Trim zip code to only the first five digits
sales['zip_code'] = sales['zip_code'].astype(str).str.slice(0, 5)

In [ ]:

# Trim very low and very high sales
valid = (sales['sale_price'] > 3000) & (sales['sale_price'] < 1e6)
sales = sales.loc[valid]

In [ ]:

len(sales)

Let's focus on numerical features only first¶

In [ ]:

# Split the data 70/30
train_set, test_set = train_test_split(sales, test_size=0.3, random_state=42)

# the target labels
y_train = np.log(train_set["sale_price"])
y_test = np.log(test_set["sale_price"])

# The features
feature_cols = [
    "total_livable_area",
    "total_area",
    "garage_spaces",
    "fireplaces",
    "number_of_bathrooms",
    "number_of_bedrooms",
    "number_stories",
]
X_train = train_set[feature_cols].values
X_test = test_set[feature_cols].values

In [ ]:

# Make a random forest pipeline
forest = make_pipeline(
    StandardScaler(), RandomForestRegressor(n_estimators=100, random_state=42)
)

# Run the 10-fold cross validation
scores = cross_val_score(
    forest,
    X_train,
    y_train,
    cv=10,
)

# Report
print("R^2 scores = ", scores)
print("Scores mean = ", scores.mean())
print("Score std dev = ", scores.std())

In [ ]:

# Fit on the training data
forest.fit(X_train, y_train)

# What's the test score?
forest.score(X_test, y_test)

Test score slightly better¶

Model appears to generalize reasonably well! (although we should also be optimizing hyperparameters to see if we can find additional improvements)

Which variables were most important?¶

In [ ]:

# Extract the regressor from the pipeline
regressor = forest["randomforestregressor"]

In [ ]:

# Create the data frame of importances
importance = pd.DataFrame(
    {"Feature": feature_cols, "Importance": regressor.feature_importances_}
)
importance.hvplot.barh(x="Feature", y="Importance")

How to handle categorical data?¶

We can use a technique called one-hot encoding
Represent each category as a vector of 1s and 0s

One-hot encoding in scikit learn¶

The OneHotEncoder object is a preprocessor that will perform the vectorization step
The ColumnTransformer object will help us apply different transformers to numerical and categorical columns

In [ ]:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

In [ ]:

# Numerical columns
num_cols = [
    "total_livable_area",
    "total_area",
    "garage_spaces",
    "fireplaces",
    "number_of_bathrooms",
    "number_of_bedrooms",
    "number_stories",
]

# Categorical columns
cat_cols = ["exterior_condition", "zip_code"]

In [ ]:

OneHotEncoder?

In [ ]:

# Set up the column transformer with two transformers
# Scale the numerical columns and one-hot 
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ]
)

Note: the handle_unknown='ignore' parameter ensures that if categories show up in our training set, but not our test set, no error will be raised.

Initialize the pipeline object, using the column transformer and the random forest regressor

In [ ]:

# Initialize the pipeline
# NOTE: only use 10 estimators here so it will run in a reasonable time
regr = make_pipeline(
    preprocessor, RandomForestRegressor(n_estimators=10, random_state=42)
)

Now, let's fit the model.

Important:

You must pass in the full training set DataFrame. We told scikit learn which column strings to extract in the ColumnTransformer, so it needs the DataFrame with named columns.

In [ ]:

# Fit the training set
regr.fit(train_set, y_train);

In [ ]:

# What's the test score?
regr.score(test_set, y_test)

Substantial improvement on test set when including ZIP codes¶

Takeaway: neighborhood based effects play a crucial role in determining housing prices.

Note: to fully validate the model we should run $k$-fold cross validation and optimize hyperparameters of the model as well...

But how crucial? Let's plot the importances¶

But first, we need to know the column names! The one-hot encoder created a column for each category type...

In [ ]:

# The column transformer...
preprocessor

In [ ]:

# The steps in the column transformer
preprocessor.named_transformers_

In [ ]:

# The one-hot step
ohe = preprocessor.named_transformers_['cat']

ohe

In [ ]:

# One column for each category type!
ohe_cols = ohe.get_feature_names()

ohe_cols

In [ ]:

# Full list of columns is numerical + one-hot
features = num_cols + list(ohe_cols)

features

In [ ]:

regressor = regr["randomforestregressor"]

# Create the dataframe with importances
importance = pd.DataFrame(
    {"Feature": features, "Importance": regressor.feature_importances_}
)

# Sort by importance and get the top 30
# SORT IN DESCENDING ORDER
importance = importance.sort_values("Importance", ascending=False).iloc[:30]

# Plot
importance.hvplot.barh(
    x="Feature", y="Importance", height=700, flip_yaxis=True
)

Takeaways¶

Number of bathrooms and area-based feature still important
ZIP codes for 19132 (Strawberry Mansion) and 19140 (North Philadelphia) most important

Exercise (time permitting)¶

There is certainly much more than can be done to try to improve the accuracy of the model on the test set. Some things to try/explore:

Run a $k$-fold cross validation analysis to ensure the model generalizes well
Use GridSearchCV to optimize some of the hyperparameters
Include additional property features from the original data set

In [ ]: