This notebook is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
%matplotlib inline
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, RepeatedKFold
Select a random seed for notebook reproducibility.
np.random.seed(862604190)
Source: https://www.itl.nist.gov/div898/strd/lls/data/Filip.shtml
Last class we took a look at the Filip dataset. The model that generated this dataset was a polynomial summation, y=β0+β1x+β2x2+β3x3+….
filip_csv_path = Path("../../data/nist/filip.csv")
filip_df = pd.read_csv(filip_csv_path)
The data looks as follows:
fig, ax = plt.subplots(dpi=90)
sns.scatterplot(x="x", y="y", data=filip_df, ax=ax);
Last time, we used the statsmodels
package to fit the formula y=β0+β1x+β2x2+β3x3 to the data.
I glossed over the details underpinning how the statsmodels
package worked, because, while it is a useful package to know how to use, it is not the main one I want to use. Instead, we need to learn how to perform the same type of fit using the powerful scikit-learn
package.
scikit-learn
?¶scikit-learn is the premiere machine learning package for Python, similar to what caret is for the R language.
If you use Python and plan to do something with machine learning, you will be using this package. It's worth learning how to navigate using it, even for simple tools such as linear regression.
So what kinds of problems should you use it for? The scikit-learn
documentation provides a nice overview of this:
In general, a learning problem considers a set of n samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several attributes or features.
Learning problems fall into a few categories:
supervised learning, in which the data comes with additional attributes that we want to predict [...] This problem can be either:
classification: samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data. [...]
regression: if the desired output consists of one or more continuous variables, then the task is called regression. [...]
unsupervised learning, in which the training data consists of a set of input vectors x without any corresponding target values.
A lot! Many of the steps to fitting any given model follow a similar workflow, so it's been possible to bring many models and techniques to bear under the single umbrella of scikit-learn
. Just a quick sampling of the User Guide table of contents gives you an idea of what's available:
The official documentation is a great place to start. There are several links in the documentation that you can consult:
Official user guide: http://scikit-learn.org/stable/user_guide.html
Official tutorials: http://scikit-learn.org/stable/tutorial/index.html
Official examples: http://scikit-learn.org/stable/auto_examples/index.html
External resources, videos and talks: http://scikit-learn.org/stable/presentations.html
Give the tutorial titled A tutorial on statistical-learning for scientific data processing a look: http://scikit-learn.org/stable/tutorial/statistical_inference/index.html#stat-learn-tut-index.
Jake VanderPlas puts out a lot of useful tutorials and materials on scientific computing with Python, so the following are worth a look:
Tutorial: scikit-learn - Machine Learning in Python with Contributor Jake VanderPlas: https://youtu.be/cHZONQ2-x7I
Scikit-learn tutorials for the Scipy 2013 conference by Gaël Varoquaux, Jake Vanderplas, and Olivier Grisel
Videos: Part 1, http://conference.scipy.org/scipy2013/tutorial_detail.php?id=107, and Part 2, http://conference.scipy.org/scipy2013/tutorial_detail.php?id=111
Notebooks for tutorial on GitHub: https://github.com/jakevdp/sklearn_scipy2013
Interactive version of this chart available here: http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
To ease us into the world of scikit-learn
, we start by reproducing what we did last time, which was using ordinary least-squares regression to fit the Filip dataset using the formula y=β0+β1x+β2x2+β3x3. First, in order to use scikit-learn
, you need to have it installed. The environment.yaml
file I distributed earlier in the semester should have taken care of this for you. But, just in case it hasn't, it is available through Anaconda with the command:
conda install scikit-learn
Then, at the top of your notebook, you will need to import the following:
from sklearn.linear_model import LinearRegression
Unlike with statsmodels
, where we could write down a formula, in scikit-learn
we instead need to create columns that represent the different powers in the polynomial. First, let's make a copy of the Filip data frame so that we can always go back to the beginning, if necessary:
filip_df_poly = filip_df.copy()
Next, we need to create the columns. For the above fit, we need columns that represent x2 and x3. Let's set up a loop to automate creating them.
for n in np.arange(2, 4):
filip_df_poly[f"x**{n}"] = filip_df_poly["x"] ** n
Looking at the first few rows in the data frame lets us confirm the above loop worked as expected:
filip_df_poly.head()
y | x | x**2 | x**3 | |
---|---|---|---|---|
0 | 0.8116 | -6.860121 | 47.061259 | -322.845927 |
1 | 0.9072 | -4.324130 | 18.698101 | -80.853019 |
2 | 0.9052 | -4.358625 | 18.997612 | -82.803469 |
3 | 0.9039 | -4.358427 | 18.995884 | -82.792168 |
4 | 0.8053 | -6.955852 | 48.383882 | -336.551143 |
To perform linear regression, we instantiate the LinearRegression
object as follows:
lm = LinearRegression()
To create a fit, we use the lm.fit()
method, which requires two keywords, X
and y
. X
takes the dependent variables we are fitting to (the powers of x, in this example). y
takes the independent variable, or outputs, we are trying to predict with the regression (the variable y in this example). We'll specify the independent variable and dependent variables as follows:
independent_var = "y"
dependent_vars = ["x", "x**2", "x**3"]
We now use these in lm.fit()
as follows:
smf_filip_fit = lm.fit(X=filip_df_poly[dependent_vars], y=filip_df_poly[independent_var])
The fitting coefficients (the parameters β0, β1, β2, and β3) are:
pd.DataFrame({
"coefficient": ["beta_0", "beta_1", "beta_2", "beta_3"],
"value": np.concatenate([[smf_filip_fit.intercept_], smf_filip_fit.coef_]),
})
coefficient | value | |
---|---|---|
0 | beta_0 | 0.390271 |
1 | beta_1 | -0.303364 |
2 | beta_2 | -0.053719 |
3 | beta_3 | -0.002726 |
We use the predict
method to generate our model curve that we can compare with the data:
smf_filip_predict_df = pd.DataFrame({
"x": np.arange(-9, -3, 0.01),
"x**2": np.arange(-9, -3, 0.01) ** 2,
"x**3": np.arange(-9, -3, 0.01) ** 3,
})
smf_filip_predict_df["y"] = smf_filip_fit.predict(smf_filip_predict_df)
Visualizing both on the same plot, we get:
fig, ax = plt.subplots(dpi=90)
sns.scatterplot(x="x", y="y", data=filip_df, ax=ax)
sns.lineplot(x="x", y="y", data=smf_filip_predict_df, ax=ax);
Comparing with the statsmodels fit, we see that we have the same result:
We've seen all this already. Let's take a look at how the fit to the Filip dataset changes as we increase the number of terms in the polynomial expansion. To do this, we'll first need to add even more columns to the data frame so that we can perform a range of fits:
for n in np.arange(4, 16):
filip_df_poly[f"x**{n}"] = filip_df_poly["x"] ** n
Now we have far more columns in the data frame
filip_df_poly.head()
y | x | x**2 | x**3 | x**4 | x**5 | x**6 | x**7 | x**8 | x**9 | x**10 | x**11 | x**12 | x**13 | x**14 | x**15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.8116 | -6.860121 | 47.061259 | -322.845927 | 2214.762094 | -15193.535763 | 104229.492448 | -715026.920996 | 4.905171e+06 | -3.365007e+07 | 2.308435e+08 | -1.583615e+09 | 1.086379e+10 | -7.452689e+10 | 5.112635e+11 | -3.507329e+12 |
1 | 0.9072 | -4.324130 | 18.698101 | -80.853019 | 349.618968 | -1511.797883 | 6537.210647 | -28267.748970 | 1.222334e+05 | -5.285532e+05 | 2.285533e+06 | -9.882941e+06 | 4.273512e+07 | -1.847922e+08 | 7.990656e+08 | -3.455264e+09 |
2 | 0.9052 | -4.358625 | 18.997612 | -82.803469 | 360.909276 | -1573.068212 | 6856.414522 | -29884.540122 | 1.302555e+05 | -5.677349e+05 | 2.474544e+06 | -1.078561e+07 | 4.701042e+07 | -2.049008e+08 | 8.930857e+08 | -3.892626e+09 |
3 | 0.9039 | -4.358427 | 18.995884 | -82.792168 | 360.843598 | -1572.710389 | 6854.543023 | -29875.023648 | 1.302081e+05 | -5.675025e+05 | 2.473418e+06 | -1.078021e+07 | 4.698476e+07 | -2.047796e+08 | 8.925170e+08 | -3.889970e+09 |
4 | 0.8053 | -6.955852 | 48.383882 | -336.551143 | 2341.000068 | -16283.650894 | 113266.671807 | -787866.248553 | 5.480281e+06 | -3.812003e+07 | 2.651573e+08 | -1.844395e+09 | 1.282934e+10 | -8.923899e+10 | 6.207332e+11 | -4.317729e+12 |
We set up a loop to create a series of fits:
independent_var = "y"
dependent_vars = ["x"]
filip_polyfit_predict_df = pd.DataFrame({
"x": np.arange(-8.90, -2.9, 0.01),
})
for n in range(1, 16):
if n > 1:
dependent_vars.append(f"x**{n}")
filip_polyfit_predict_df[f"x**{n}"] = \
np.arange(-8.90, -2.9, 0.01) ** n
filip_polyfit = lm.fit(
X=filip_df_poly[dependent_vars],
y=filip_df_poly[independent_var],
)
filip_polyfit_predict_df[f"{n}"] = \
filip_polyfit.predict(filip_polyfit_predict_df[dependent_vars])
Now let's reshape the data frame using melt. Instead of having 15 columns named y_1
through y_15
, we'll instead have a column called n
that specifies the polynomial degree and a column y_predicted
that contains the predicted value of y.
filip_polyfit_predict_df_reshaped = filip_polyfit_predict_df \
.melt(id_vars=dependent_vars,
value_vars=[f"{n}" for n in range(1, 16)],
var_name="Degree",
value_name="y_predicted")
Let's now view the results in a series of faceted windows.
g = sns.FacetGrid(
filip_polyfit_predict_df_reshaped,
col="Degree",
col_wrap=4,
col_order=[f"{n}" for n in range(1, 16)],
hue_kws={"color": ["k"]},
sharex=True,
sharey=True,
)
# Plots the predicted values as black lines
g = g.map(plt.plot, "x", "y_predicted")
# Plots the data points within each facet
for idx in range(15):
sns.scatterplot(x="x", y="y", data=filip_df_poly, ax=g.axes[idx])
g.fig.set_dpi(150)
g.fig.set_size_inches(6, 6)
An alternate way to plot the windows, which is a little more complicated, but offers more control, is as follows:
fig, ax = plt.subplots(nrows=4, ncols=4, figsize=(8, 8), dpi=120,
sharex=True, sharey=True)
poly_degree = 0
for nrow in range(4):
for ncol in range(4):
poly_degree += 1
if poly_degree < 16:
sns.scatterplot(
x="x",
y="y",
data=filip_df_poly, ax=ax[nrow][ncol]
)
sns.lineplot(
x="x",
y=f"y_{poly_degree}",
data=smf_filip_polyfit_predict_df,
color="black",
ax=ax[nrow][ncol]
)
ax[nrow][ncol].set_title(f"Degree {poly_degree}")
ax[nrow][ncol].set_ylabel("y")
plt.tight_layout();
So... which model is best? How can we even tell?
How should we compare and rank models?
This is what model selection is about, computing scores and measures of model performance for different models, and selecting the best choice.
Cross-validation is a method that can be used to compare relative model performance using only training data
Cross-validation refers to the procedure of removing portions of a dataset, training the model on the remaining portion, and then trying to predict the values of the data you removed
A popular flavor of cross-validation (especially among data scientists) is called k-fold cross-validation
Basic idea: Estimate how robust your model is by systematically removing different chunks (the "folds") of the dataset, repeating the fitting process, then testing its predictive power on the folds
The above example illustrates a 5-fold, or k=5, cross-validation.
Each fold will act as a testing set, with the remaining k−1 folds used to train the model.
Fit model, predict values in testing set, then calculate a score, such as the mean-squared prediction error (MSE)
MSE gives an estimate of how well the model works as a predictor
MSE is general-purpose and allows you to compare models of many types
For linear regression, you can also use model-specific measures, such as R2
scikit-learn
¶To use cross-validation, you will need to import additional modules from scikit-learn
:
from sklearn.model_selection import cross_val_score, RepeatedKFold
RepeatedKFold
is an implementation of k-fold cross-validation that allows you to repeat the procedure multiple times, randomizing the data within each repetition. By doing this, you can obtain better statistical averaging, which can be particularly useful if the dataset is not so large.
cross_val_score
is the function that takes the data and cross-validation splitting as input, runs the fitting procedure, and computes a specified score at the end.
A list of scores you can compute during a cross-validation run: http://scikit-learn.org/stable/modules/model_evaluation.html#the-scoring-parameter-defining-model-evaluation-rules
Let's modify our loop to run a series of k-fold cross-validations for our different polynomial formulas.
independent_var = "y"
dependent_vars = ["x"]
filip_poly_cv_scores = {
"n": [],
"mse": [],
"mse_sd": [],
"r**2": [],
"r**2_sd": [],
}
for n in range(1, 16):
if n > 1:
dependent_vars.append(f"x**{n}")
rkf = RepeatedKFold(
n_splits=10,
n_repeats=100,
random_state=int(np.round(np.random.uniform(0, 2**31), decimals=0)),
)
# Cross-validated mean-squared error score
mse_cv_score = cross_val_score(
lm,
filip_df_poly[dependent_vars],
filip_df_poly[independent_var],
scoring="neg_mean_squared_error",
cv=rkf,
n_jobs=-1, # Use all processors during cross-validation run
)
# Cross-validated R**2 score
r2_cv_score = cross_val_score(
lm,
filip_df_poly[dependent_vars],
filip_df_poly[independent_var],
scoring="r2",
cv=rkf,
n_jobs=-1, # Use all processors during cross-validation run
)
filip_poly_cv_scores["n"].append(n)
filip_poly_cv_scores["mse"].append(-np.mean(mse_cv_score)) # Get rid of negative sign
filip_poly_cv_scores["mse_sd"].append(np.std(mse_cv_score))
filip_poly_cv_scores["r**2"].append(np.mean(r2_cv_score))
filip_poly_cv_scores["r**2_sd"].append(np.std(r2_cv_score))
# Convert dictionary to data frame
filip_poly_cv_scores_df = pd.DataFrame(filip_poly_cv_scores)
Let's see how the various models perform:
filip_poly_cv_scores_df
n | mse | mse_sd | r**2 | r**2_sd | |
---|---|---|---|---|---|
0 | 1 | 0.000387 | 0.000120 | 0.815206 | 0.245704 |
1 | 2 | 0.000299 | 0.000115 | 0.860383 | 0.147108 |
2 | 3 | 0.000221 | 0.000082 | 0.887993 | 0.210553 |
3 | 4 | 0.000093 | 0.000035 | 0.959409 | 0.033963 |
4 | 5 | 0.000102 | 0.000071 | 0.955175 | 0.044956 |
5 | 6 | 0.000040 | 0.000025 | 0.981716 | 0.022905 |
6 | 7 | 0.000044 | 0.000062 | 0.975425 | 0.132341 |
7 | 8 | 0.000035 | 0.000078 | 0.984792 | 0.036522 |
8 | 9 | 0.000030 | 0.000126 | 0.986491 | 0.061791 |
9 | 10 | 0.000028 | 0.000098 | 0.988350 | 0.038857 |
10 | 11 | 0.000016 | 0.000022 | 0.992845 | 0.011417 |
11 | 12 | 0.000012 | 0.000008 | 0.994672 | 0.005599 |
12 | 13 | 0.000012 | 0.000012 | 0.994456 | 0.007132 |
13 | 14 | 0.000016 | 0.000027 | 0.992566 | 0.032214 |
14 | 15 | 0.000016 | 0.000024 | 0.993008 | 0.009973 |
Visualize the relative performance
fig, ax = plt.subplots(nrows=2, dpi=120, figsize=(4, 6), sharex=True)
sns.scatterplot(x="n", y="mse", data=filip_poly_cv_scores_df, ax=ax[0])
sns.lineplot(x="n", y="mse", data=filip_poly_cv_scores_df, ax=ax[0])
ax[0].axvline(x=10, linestyle="--", color="k") # "True" polynomial degree
ax[0].set_ylim([-0.0001, 0.0005])
ax[0].set_xlabel(r"Polynomial degree")
ax[0].set_ylabel(r"Mean-squared error")
sns.scatterplot(x="n", y="r**2", data=filip_poly_cv_scores_df, ax=ax[1])
sns.lineplot(x="n", y="r**2", data=filip_poly_cv_scores_df, ax=ax[1])
ax[1].axvline(x=10, linestyle="--", color="k") # "True" polynomial degree
ax[1].set_ylim([0.80, 1.05])
ax[1].set_xlabel(r"Polynomial degree")
ax[1].set_ylabel(r"$R^{2}$")
fig.tight_layout();