#!/usr/bin/env python # coding: utf-8 # # Overfitting II # # Last time, we saw a theoretical example of *overfitting*, in which we fit a machine learning model that perfectly fit the data it saw, but performed extremely poorly on fresh, unseen data. In this lecture, we'll observe overfitting in a more practical context, using the Titanic data set again. We'll then begin to study *validation* techniques for finding models with "just the right amount" of flexibility. # In[3]: import numpy as np from matplotlib import pyplot as plt import pandas as pd # In[4]: # assumes that you have run the function retrieve_data() # from "Introduction to ML in Practice" in ML_3.ipynb titanic = pd.read_csv("data.csv") titanic # Recall that we diagnosed overfitting by testing our model against some new data. In this case, we don't have any more data. So, what we can do instead is *hold out* some data that we won't let our model see at first. This holdout data is called the *validation* or *testing* data, depending on the use to which we put it. In contrast, the data that we allow our model to see is called the *training* data. `sklearn` provides a convenient function for partitioning our data into training and holdout sets called `train_test_split`. The default and generally most useful behavior is to randomly select rows of the data frame to be in each set. # In[5]: from sklearn.model_selection import train_test_split np.random.seed(1234) train, test = train_test_split(titanic, test_size = 0.3) # hold out 30% of the data train.shape, test.shape # Now we have two data frames. As you may recall from a previous lecture, we need to do some data cleaning, and split them into predictor variables `X` and target variables `y`. # In[6]: from sklearn import preprocessing def prep_titanic_data(data_df): df = data_df.copy() # convert male/female to 1/0 le = preprocessing.LabelEncoder() df['Sex'] = le.fit_transform(df['Sex']) # don't need name column df = df.drop(['Name'], axis = 1) # split into X and y X = df.drop(['Survived'], axis = 1) y = df['Survived'] return(X, y) # In[7]: X_train, y_train = prep_titanic_data(train) X_test, y_test = prep_titanic_data(test) # Now we're able to train our model on the `train` data, and then evaluate its performance on the `val` data. This will help us to diagnose and avoid overfitting. # # Let's try using the decision tree classifier again. As you may remember, the `DecisionTreeClassifier()` class takes an argument `max_depth` that governs how many layers of decisions the tree is allowed to make. Larger `max_depth` values correspond to more complicated trees. In this way, `max_depth` is a model complexity parameter, similar to the `degree` when we did polynomial regression. # # For example, with a small `max_depth`, the model scores on the training and validation data are relatively close. # In[10]: from sklearn import tree T = tree.DecisionTreeClassifier(max_depth = 3) T.fit(X_train, y_train) T.score(X_train, y_train), T.score(X_test, y_test) # On the other hand, if we use a much higher `max_depth`, we can achieve a substantially better score on the training data, but our performance on the validation data has not improved by much, and might even suffer. # In[11]: T = tree.DecisionTreeClassifier(max_depth = 20) T.fit(X_train, y_train) T.score(X_train, y_train), T.score(X_test, y_test) # That looks like overfitting! The model achieves a near-perfect score on the training data, but a much lower one on the test data. # In[12]: fig, ax = plt.subplots(1, figsize = (10, 7)) for d in range(1, 30): T = tree.DecisionTreeClassifier(max_depth = d) T.fit(X_train, y_train) ax.scatter(d, T.score(X_train, y_train), color = "black") ax.scatter(d, T.score(X_test, y_test), color = "firebrick") ax.set(xlabel = "Complexity (depth)", ylabel = "Performance (score)") # Observe that the training score (black) always increases, while the test score (red) tops out around 83\% and then even begins to trail off slightly. It looks like the optimal depth might be around 5-7 or so, but there's some random noise that can prevent us from being able to determine exactly what the optimal depth is. # # Increasing performance on the training set combined with decreasing performance on the test set is the trademark of overfitting. # # This noise reflects the fact that we took a single, random subset of the data for testing. In a more systematic experiment, we would draw many different subsets of the data for each value of depth and average over them. This is what *cross-validation* does, and we'll talk about it in the next lecture. # In[ ]: