Variance-Bias Tradeoff

Vaidation Methods Module in Python

© Kaixin Wang, Winter 2020

Data Preprocessing

Libraries Import

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Data Import

In [2]:
# read in the csv file
train = pd.read_csv("football_train.csv", index_col = 0)
In [3]:
# check if the file has been correctly read in
train.head()
Out[3]:
Yards OffensivePlays TurnOversLost FumblesLost FirstDowns PassesCompleted PassesAttempted YardsGainedPassing InterceptionsThrown RushingAttempts ... OppPassesAttempted OppYardsGainedPassing OppInterceptionsThrown OppRushingAttempts OppYardsGainedRushing OppPenaltiesCommitedByTeam OppPenaltiesInYards OppFirstDownsByPenalty OppNumberOfDrives Wins
ID
1 5340 1000 31 11 275 298 519 3340 20 437 ... 549 4092 9 442 1727 87 686 28 189 4
2 6138 1014 25 14 342 328 476 3784 11 508 ... 558 4031 18 392 1650 104 861 34 179 12
3 5362 955 19 13 288 306 494 3361 6 431 ... 521 3216 19 444 2107 79 676 18 177 10
4 5728 1004 17 8 324 334 505 3923 9 467 ... 646 3976 21 344 1477 116 1063 31 180 12
5 4788 994 26 9 269 328 586 3419 17 358 ... 535 3832 12 545 2256 101 820 37 191 2

5 rows × 31 columns

In [4]:
train.shape
Out[4]:
(380, 31)
In [5]:
# standardize the numeric columns
n = train.shape[1]
for i in range(n - 1):
    train.iloc[:, i] = (train.iloc[:, i] - np.mean(train.iloc[:, i])) / np.std(train.iloc[:, i])
In [6]:
train.head()
Out[6]:
Yards OffensivePlays TurnOversLost FumblesLost FirstDowns PassesCompleted PassesAttempted YardsGainedPassing InterceptionsThrown RushingAttempts ... OppPassesAttempted OppYardsGainedPassing OppInterceptionsThrown OppRushingAttempts OppYardsGainedRushing OppPenaltiesCommitedByTeam OppPenaltiesInYards OppFirstDownsByPenalty OppNumberOfDrives Wins
ID
1 -0.067486 -0.275791 0.797903 0.138608 -0.874948 -0.619963 -0.305416 -0.325343 1.009883 -0.041968 ... 0.178948 1.276131 -1.386068 0.172824 -0.262938 -1.136733 -1.345573 0.161351 0.268029 4
2 1.198873 0.018357 -0.084356 0.965899 0.907859 -0.009847 -1.013489 0.384790 -0.836082 1.296434 ... 0.382406 1.132749 0.485370 -0.909461 -0.521518 0.077035 0.009290 0.978862 -0.740398 12
3 -0.032574 -1.221267 -0.966615 0.690135 -0.529030 -0.457266 -0.717086 -0.291755 -1.861618 -0.155072 ... -0.454033 -0.782929 0.693308 0.216115 1.013175 -1.707917 -1.422994 -1.201168 -0.942083 10
4 0.548237 -0.191749 -1.260702 -0.688684 0.428896 0.112176 -0.535951 0.607106 -1.246296 0.523554 ... 2.371773 1.003470 1.109183 -1.948455 -1.102485 0.933812 1.573190 0.570106 -0.639555 12
5 -0.943463 -0.401855 0.062687 -0.412920 -1.034602 -0.009847 0.797860 -0.198990 0.394561 -1.531175 ... -0.137542 0.664994 -0.762256 2.402332 1.513546 -0.137159 -0.308135 1.387617 0.469715 2

5 rows × 31 columns

Correlation matrix and correlation heatmap

In [7]:
train.corr().iloc[:, n-1] 
Out[7]:
Yards                         0.545436
OffensivePlays                0.275029
TurnOversLost                -0.531927
FumblesLost                  -0.329564
FirstDowns                    0.527665
PassesCompleted               0.141868
PassesAttempted              -0.091427
YardsGainedPassing            0.363565
InterceptionsThrown          -0.496850
RushingAttempts               0.449488
YardsGainedRushing            0.342025
PenaltiesCommitedByTeam      -0.171125
PenaltiesInYards             -0.138735
FirstDownsByPenalty           0.172522
NumberOfDrives               -0.126591
OppYards                     -0.361141
OppOffensivePlays            -0.214024
OppTurnOversLost              0.441954
OppFumblesLost                0.194505
OppFirstDowns                -0.307572
OppPassesCompleted            0.090255
OppPassesAttempted            0.367913
OppYardsGainedPassing        -0.117004
OppInterceptionsThrown        0.452108
OppRushingAttempts           -0.631873
OppYardsGainedRushing        -0.464723
OppPenaltiesCommitedByTeam    0.044104
OppPenaltiesInYards           0.056435
OppFirstDownsByPenalty       -0.064709
OppNumberOfDrives            -0.136681
Wins                          1.000000
Name: Wins, dtype: float64
In [8]:
plt.figure(figsize=(16,16))
sns.heatmap(train.corr(), annot = True)
plt.xlim([0, n])
plt.ylim([0, n])
plt.title("correlation heatmap: numeric variables")
plt.show()
In [9]:
indices = []
for i in range(n-1):
    if (abs(train.corr().iloc[:, n-1]) > 0.35)[i]:
        indices.append(i)

Validation Methods

Train-Test-Split Method

Suppose that we want to predict the variable Wins of each team based on the predictors in the dataframe.

We can choose to use the entire dataset as the testing set, but we then cannot have a measure of how good predictions the model can make on future/unseen data.

Therefore, to enhance the predictive ability of the model, we split the dataset into a training set and a testing set:

  • training set: used to build the model (i.e., obtain the coefficients of the model)
  • testing set: used to test how good the model is (i.e., to test if the model is overfitting data observed)

Since there are 380 observations in the data frame train, we can use the 70%/30% split method, where 70% of the entries will serve as the training set while the other 30% will be the testing set.

There are two methods to split the dataset:

  1. using the train_test_split function in scikit-learn module
  2. using the random number generator sample function in random module

train_test_split() function

In [10]:
# import the function from module
from sklearn.model_selection import train_test_split
In [11]:
X = train.iloc[:, range(0, n-1)]
y = train.iloc[:, n-1]
In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
print("size of the training set is %d, size of the testing set is %d." %(X_train.shape[0], X_test.shape[0]))
size of the training set is 266, size of the testing set is 114.

sample() function

Note: this method takes relatively more steps than using the train_test_split() function.

In [13]:
import random
start = 0
end = X.shape[0]
size = X.shape[0]
# obtain the indices of training set
train_indices = random.sample(range(start, end), int(size * 0.7))
In [14]:
# to obtain the indices for testing set
test_indices = list(set(range(start, end)) - set(train_indices))
In [15]:
train_X = X.iloc[train_indices, :]
train_y = y.iloc[train_indices]
test_X = X.iloc[test_indices, :]
test_y = y.iloc[test_indices]
In [16]:
print("size of the training set is %d, size of the testing set is %d." %(train_X.shape[0], test_X.shape[0]))
size of the training set is 266, size of the testing set is 114.
In [17]:
# check if there is any duplicates in two dataframes
print(test_indices in train_indices)
print(train_indices in test_indices)
False
False

Variance-Bias Tradeoff

  • Variance: the variance of the predicted values.
  • Bias: the expected difference between the predicted value and the actual value for the new values of x.
In [18]:
import statsmodels.api as sm # another package for linear model
import statsmodels.formula.api as smf
In [19]:
# create a dataframe for both the training and the testing
df_train = pd.concat([X_train, y_train], axis = 1) 
# linear regression model using statsmodel
model1 = smf.ols("Wins ~ Yards + TurnOversLost + FirstDowns + InterceptionsThrown + OppYardsGainedRushing", data = df_train)
results = model1.fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   Wins   R-squared:                       0.542
Model:                            OLS   Adj. R-squared:                  0.533
Method:                 Least Squares   F-statistic:                     61.58
Date:                Mon, 10 Feb 2020   Prob (F-statistic):           3.54e-42
Time:                        12:30:17   Log-Likelihood:                -579.86
No. Observations:                 266   AIC:                             1172.
Df Residuals:                     260   BIC:                             1193.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                 8.1851      0.133     61.483      0.000       7.923       8.447
Yards                     0.5655      0.342      1.653      0.100      -0.108       1.239
TurnOversLost            -0.5980      0.259     -2.308      0.022      -1.108      -0.088
FirstDowns                0.6908      0.351      1.970      0.050       0.000       1.381
InterceptionsThrown      -0.4586      0.249     -1.843      0.067      -0.949       0.031
OppYardsGainedRushing    -0.9874      0.133     -7.397      0.000      -1.250      -0.725
==============================================================================
Omnibus:                        1.232   Durbin-Watson:                   1.848
Prob(Omnibus):                  0.540   Jarque-Bera (JB):                1.281
Skew:                          -0.100   Prob(JB):                        0.527
Kurtosis:                       2.725   Cond. No.                         5.88
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [20]:
model2 = smf.ols("Wins ~ Yards + TurnOversLost + FirstDowns", data = df_train)
results2 = model2.fit()
print(results2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   Wins   R-squared:                       0.439
Model:                            OLS   Adj. R-squared:                  0.432
Method:                 Least Squares   F-statistic:                     68.27
Date:                Mon, 10 Feb 2020   Prob (F-statistic):           1.19e-32
Time:                        12:30:17   Log-Likelihood:                -606.96
No. Observations:                 266   AIC:                             1222.
Df Residuals:                     262   BIC:                             1236.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         8.1642      0.147     55.612      0.000       7.875       8.453
Yards             0.6789      0.377      1.802      0.073      -0.063       1.421
TurnOversLost    -1.0858      0.155     -7.018      0.000      -1.390      -0.781
FirstDowns        0.8202      0.386      2.124      0.035       0.060       1.581
==============================================================================
Omnibus:                        2.060   Durbin-Watson:                   1.911
Prob(Omnibus):                  0.357   Jarque-Bera (JB):                2.030
Skew:                          -0.153   Prob(JB):                        0.362
Kurtosis:                       2.700   Cond. No.                         5.21
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [21]:
# true test set
test_x = pd.read_csv("football_test.csv", index_col = 0)
test_y = pd.read_csv("football_answers.csv")
In [22]:
# standardize the features
n = test_x.shape[1]
for i in range(n - 1):
    test_x.iloc[:, i] = (test_x.iloc[:, i] - np.mean(test_x.iloc[:, i])) / np.std(test_x.iloc[:, i])
In [23]:
from sklearn.linear_model import LinearRegression
In [24]:
model1_indices = [0, 2, 4, 8, 25]
model2_indices = [0, 2, 4]
In [25]:
model1 = LinearRegression().fit(X_train.iloc[:, model1_indices], y_train)
pred1 = model1.predict(X.iloc[:, model1_indices])
model2 = LinearRegression().fit(X_train.iloc[:, model2_indices], y_train)
pred2 = model2.predict(X.iloc[:, model2_indices])
In [26]:
model1.coef_
Out[26]:
array([ 0.56553607, -0.59798115,  0.69075096, -0.4586361 , -0.98742599])
In [27]:
model2.coef_
Out[27]:
array([ 0.67893791, -1.08579519,  0.82022425])
In [28]:
nrows = X.shape[0]
plt.figure(figsize = (16, 6))
plt.subplot(1, 2, 1)
plt.scatter(y, pred1)
plt.xlabel("true values")
plt.ylabel("predicted values")
plt.subplot(1, 2, 2)
plt.scatter(y, pred2)
plt.xlabel("true values")
plt.ylabel("predicted values")
plt.show()

Now we will fit multiple models on the training set and see the model performance on the dataset:

Building Linear Regression Models with different complexities

In [29]:
indices
Out[29]:
[0, 2, 4, 7, 8, 9, 15, 17, 21, 23, 24, 25]
In [30]:
# to generate all possible combinations of the indices
import itertools
In [31]:
def createMLRModel(train_X, train_y, test_X, indices, complexity):
    '''
    This function creates a Multiple Linear Regression Model based on the input 
    dataframes, indices of the predictors, and the complexity of the model.
    train_X: training set of the predictors
    train_y: training set of the outcome
    test_X: testing set of the predictors
    indices: index of each predictor used in the modeling
    complexity: number of predictors required
    return: predictions - a list of list that contains the predictions of each model
    '''
    predictions = []
    for subset in itertools.combinations(indices, complexity):
        model = LinearRegression().fit(train_X.iloc[:, list(subset)], train_y)
        pred = model.predict(test_X.iloc[:, list(subset)])
        predictions.append(pred.reshape(1,-1))
    return predictions

Create regression models and obtain the predictions

In [32]:
predictions = {}
for i in range(1, 10):
    predictions[i] = createMLRModel(X_train, y_train, test_x, indices, i)
    print("%d MLR models built with complexity = %d" %(len(predictions[i]), i))
12 MLR models built with complexity = 1
66 MLR models built with complexity = 2
220 MLR models built with complexity = 3
495 MLR models built with complexity = 4
792 MLR models built with complexity = 5
924 MLR models built with complexity = 6
792 MLR models built with complexity = 7
495 MLR models built with complexity = 8
220 MLR models built with complexity = 9

Calculate the average value of predictions at each complexity level

In [33]:
averages = {}
for i in range(1, 10):
    averages[i] = list(np.mean(predictions[i], axis = 0)[0])

Calculate the bias of predictions at each complexity level

In [34]:
test_y = test_y.loc[:, "Wins"]
In [35]:
biases = {}
for i in range(1, 10):
    biases[i] = np.mean(np.abs(averages[i] - test_y))

Calculate the average value of variance in predictions at each complexity level

In [36]:
variances = {}
for i in range(1, 10):
    variances[i] = np.mean(np.var(predictions[i]))

Visualize the decrease in bias as the complexity level increases

In [37]:
nrows = test_y.shape[0]
plt.figure(figsize = (16, 9))
plt.subplots_adjust(hspace = 0.5)
for i in range(9):
    plt.subplot(3, 3, i+1)
    plt.scatter(test_y, averages[i+1])
    plt.xlabel("true values")
    plt.ylabel("average predictions")
    if i == 0:
        plt.title("SLR model (1 feature)")
    else:
        plt.title("MLR with %d features" %(i+1))

Bias vs. model complexity

In [38]:
plt.plot(list(biases.keys()), list(biases.values()),"-o")
plt.xlabel("number of features")
plt.ylabel("bias in the predictions")
plt.show()

Variance vs. model complexity

In [39]:
plt.plot(list(variances.keys()), list(variances.values()),"-o", color = "orange")
plt.xlabel("number of features")
plt.ylabel("variance in the predictions")
plt.show()

Bias vs. Variance as model complexity increases

In [40]:
plt.plot(list(biases.keys()), list(biases.values()),"-o")
plt.plot(list(variances.keys()), list(variances.values()),"-o")
plt.xlabel("number of features")
plt.ylabel("bias or variance")
plt.legend(["bias", "variance"])
plt.title("bias vs. variance tradeoff")
plt.show()

Upcoming modules:

Tree-based Methods

  • Regression Trees and Classfication Trees
  • Boosting-based Trees:
    • XGBoost
    • LightGBM
  • Bagging-based Trees:
    • Random Forest

Evaluation Criteria

  • MSE and RMSE
  • $R^2$ and $R^2_{\text{adj}}$
  • AIC, BIC and $C_p$

© Kaixin Wang, updated February 2020