#!/usr/bin/env python
# coding: utf-8

# # Overfitting II
# 
# Last time, we saw a theoretical example of *overfitting*, in which we fit a machine learning model that perfectly fit the data it saw, but performed extremely poorly on fresh, unseen data. In this lecture, we'll observe overfitting in a more practical context, using the Titanic data set again. We'll then begin to study *validation* techniques for finding models with "just the right amount" of flexibility. 

# In[3]:


import numpy as np
from matplotlib import pyplot as plt
import pandas as pd


# In[4]:


# assumes that you have run the function retrieve_data() 
# from "Introduction to ML in Practice" in ML_3.ipynb

titanic = pd.read_csv("data.csv")
titanic


# Recall that we diagnosed overfitting by testing our model against some new data. In this case, we don't have any more data. So, what we can do instead is *hold out* some data that we won't let our model see at first. This holdout data is called the *validation* or *testing* data, depending on the use to which we put it. In contrast, the data that we allow our model to see is called the *training* data. `sklearn` provides a convenient function for partitioning our data into training and holdout sets called `train_test_split`. The default and generally most useful behavior is to randomly select rows of the data frame to be in each set. 

# In[5]:


from sklearn.model_selection import train_test_split

np.random.seed(1234)
train, test = train_test_split(titanic, test_size = 0.3) # hold out 30% of the data
train.shape, test.shape


# Now we have two data frames. As you may recall from a previous lecture, we need to do some data cleaning, and split them into predictor variables `X` and target variables `y`. 

# In[6]:


from sklearn import preprocessing

def prep_titanic_data(data_df):
    df = data_df.copy()
    
    # convert male/female to 1/0
    le = preprocessing.LabelEncoder()
    df['Sex'] = le.fit_transform(df['Sex'])
    
    # don't need name column
    df = df.drop(['Name'], axis = 1)
    
    # split into X and y
    X = df.drop(['Survived'], axis = 1)
    y = df['Survived']
    
    return(X, y)


# In[7]:


X_train, y_train = prep_titanic_data(train)
X_test, y_test = prep_titanic_data(test)


# Now we're able to train our model on the `train` data, and then evaluate its performance on the `val` data. This will help us to diagnose and avoid overfitting.
# 
# Let's try using the decision tree classifier again. As you may remember, the `DecisionTreeClassifier()` class takes an argument `max_depth` that governs how many layers of decisions the tree is allowed to make. Larger `max_depth` values correspond to more complicated trees. In this way, `max_depth` is a model complexity parameter, similar to the `degree` when we did polynomial regression. 
# 
# For example, with a small `max_depth`, the model scores on the training and validation data are relatively close. 

# In[10]:


from sklearn import tree

T = tree.DecisionTreeClassifier(max_depth = 3)

T.fit(X_train, y_train)
T.score(X_train, y_train), T.score(X_test, y_test)


# On the other hand, if we use a much higher `max_depth`, we can achieve a substantially better score on the training data, but our performance on the validation data has not improved by much, and might even suffer. 

# In[11]:


T = tree.DecisionTreeClassifier(max_depth = 20)

T.fit(X_train, y_train)
T.score(X_train, y_train), T.score(X_test, y_test)


# That looks like overfitting! The model achieves a near-perfect score on the training data, but a much lower one on the test data. 

# In[12]:


fig, ax = plt.subplots(1, figsize = (10, 7))

for d in range(1, 30):
    T = tree.DecisionTreeClassifier(max_depth = d)

    T.fit(X_train, y_train)
    
    ax.scatter(d, T.score(X_train, y_train), color = "black")
    ax.scatter(d, T.score(X_test, y_test), color = "firebrick")

ax.set(xlabel = "Complexity (depth)", ylabel = "Performance (score)")


# Observe that the training score (black) always increases, while the test score (red) tops out around 83\% and then even begins to trail off slightly. It looks like the optimal depth might be around 5-7 or so, but there's some random noise that can prevent us from being able to determine exactly what the optimal depth is. 
# 
# Increasing performance on the training set combined with decreasing performance on the test set is the trademark of overfitting. 
# 
# This noise reflects the fact that we took a single, random subset of the data for testing. In a more systematic experiment, we would draw many different subsets of the data for each value of depth and average over them. This is what *cross-validation* does, and we'll talk about it in the next lecture.

# In[ ]: