Variance-Bias Tradeoff¶

Vaidation Methods Module in Python

© Kaixin Wang, Winter 2020

Data Preprocessing¶

Libraries Import¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Data Import¶

# read in the csv file
train = pd.read_csv("football_train.csv", index_col = 0)

# check if the file has been correctly read in
train.head()

train.shape

(380, 31)

# standardize the numeric columns
n = train.shape[1]
for i in range(n - 1):
    train.iloc[:, i] = (train.iloc[:, i] - np.mean(train.iloc[:, i])) / np.std(train.iloc[:, i])

train.head()

Correlation matrix and correlation heatmap¶

train.corr().iloc[:, n-1]

Yards                         0.545436
OffensivePlays                0.275029
TurnOversLost                -0.531927
FumblesLost                  -0.329564
FirstDowns                    0.527665
PassesCompleted               0.141868
PassesAttempted              -0.091427
YardsGainedPassing            0.363565
InterceptionsThrown          -0.496850
RushingAttempts               0.449488
YardsGainedRushing            0.342025
PenaltiesCommitedByTeam      -0.171125
PenaltiesInYards             -0.138735
FirstDownsByPenalty           0.172522
NumberOfDrives               -0.126591
OppYards                     -0.361141
OppOffensivePlays            -0.214024
OppTurnOversLost              0.441954
OppFumblesLost                0.194505
OppFirstDowns                -0.307572
OppPassesCompleted            0.090255
OppPassesAttempted            0.367913
OppYardsGainedPassing        -0.117004
OppInterceptionsThrown        0.452108
OppRushingAttempts           -0.631873
OppYardsGainedRushing        -0.464723
OppPenaltiesCommitedByTeam    0.044104
OppPenaltiesInYards           0.056435
OppFirstDownsByPenalty       -0.064709
OppNumberOfDrives            -0.136681
Wins                          1.000000
Name: Wins, dtype: float64

plt.figure(figsize=(16,16))
sns.heatmap(train.corr(), annot = True)
plt.xlim([0, n])
plt.ylim([0, n])
plt.title("correlation heatmap: numeric variables")
plt.show()

indices = []
for i in range(n-1):
    if (abs(train.corr().iloc[:, n-1]) > 0.35)[i]:
        indices.append(i)

Validation Methods¶

Train-Test-Split Method¶

Suppose that we want to predict the variable Wins of each team based on the predictors in the dataframe.

We can choose to use the entire dataset as the testing set, but we then cannot have a measure of how good predictions the model can make on future/unseen data.

Therefore, to enhance the predictive ability of the model, we split the dataset into a training set and a testing set:

training set: used to build the model (i.e., obtain the coefficients of the model)
testing set: used to test how good the model is (i.e., to test if the model is overfitting data observed)

Since there are 380 observations in the data frame train, we can use the 70%/30% split method, where 70% of the entries will serve as the training set while the other 30% will be the testing set.

There are two methods to split the dataset:

using the train_test_split function in scikit-learn module
using the random number generator sample function in random module

train_test_split() function¶

# import the function from module
from sklearn.model_selection import train_test_split

X = train.iloc[:, range(0, n-1)]
y = train.iloc[:, n-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
print("size of the training set is %d, size of the testing set is %d." %(X_train.shape[0], X_test.shape[0]))

size of the training set is 266, size of the testing set is 114.

sample() function¶

Note: this method takes relatively more steps than using the train_test_split() function.

import random
start = 0
end = X.shape[0]
size = X.shape[0]
# obtain the indices of training set
train_indices = random.sample(range(start, end), int(size * 0.7))

# to obtain the indices for testing set
test_indices = list(set(range(start, end)) - set(train_indices))

train_X = X.iloc[train_indices, :]
train_y = y.iloc[train_indices]
test_X = X.iloc[test_indices, :]
test_y = y.iloc[test_indices]

print("size of the training set is %d, size of the testing set is %d." %(train_X.shape[0], test_X.shape[0]))

size of the training set is 266, size of the testing set is 114.

# check if there is any duplicates in two dataframes
print(test_indices in train_indices)
print(train_indices in test_indices)

False
False

Variance-Bias Tradeoff¶

Variance: the variance of the predicted values.

Bias: the expected difference between the predicted value and the actual value for the new values of x.

import statsmodels.api as sm # another package for linear model
import statsmodels.formula.api as smf

# create a dataframe for both the training and the testing
df_train = pd.concat([X_train, y_train], axis = 1) 
# linear regression model using statsmodel
model1 = smf.ols("Wins ~ Yards + TurnOversLost + FirstDowns + InterceptionsThrown + OppYardsGainedRushing", data = df_train)
results = model1.fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   Wins   R-squared:                       0.542
Model:                            OLS   Adj. R-squared:                  0.533
Method:                 Least Squares   F-statistic:                     61.58
Date:                Mon, 10 Feb 2020   Prob (F-statistic):           3.54e-42
Time:                        12:30:17   Log-Likelihood:                -579.86
No. Observations:                 266   AIC:                             1172.
Df Residuals:                     260   BIC:                             1193.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                 8.1851      0.133     61.483      0.000       7.923       8.447
Yards                     0.5655      0.342      1.653      0.100      -0.108       1.239
TurnOversLost            -0.5980      0.259     -2.308      0.022      -1.108      -0.088
FirstDowns                0.6908      0.351      1.970      0.050       0.000       1.381
InterceptionsThrown      -0.4586      0.249     -1.843      0.067      -0.949       0.031
OppYardsGainedRushing    -0.9874      0.133     -7.397      0.000      -1.250      -0.725
==============================================================================
Omnibus:                        1.232   Durbin-Watson:                   1.848
Prob(Omnibus):                  0.540   Jarque-Bera (JB):                1.281
Skew:                          -0.100   Prob(JB):                        0.527
Kurtosis:                       2.725   Cond. No.                         5.88
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

model2 = smf.ols("Wins ~ Yards + TurnOversLost + FirstDowns", data = df_train)
results2 = model2.fit()
print(results2.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   Wins   R-squared:                       0.439
Model:                            OLS   Adj. R-squared:                  0.432
Method:                 Least Squares   F-statistic:                     68.27
Date:                Mon, 10 Feb 2020   Prob (F-statistic):           1.19e-32
Time:                        12:30:17   Log-Likelihood:                -606.96
No. Observations:                 266   AIC:                             1222.
Df Residuals:                     262   BIC:                             1236.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         8.1642      0.147     55.612      0.000       7.875       8.453
Yards             0.6789      0.377      1.802      0.073      -0.063       1.421
TurnOversLost    -1.0858      0.155     -7.018      0.000      -1.390      -0.781
FirstDowns        0.8202      0.386      2.124      0.035       0.060       1.581
==============================================================================
Omnibus:                        2.060   Durbin-Watson:                   1.911
Prob(Omnibus):                  0.357   Jarque-Bera (JB):                2.030
Skew:                          -0.153   Prob(JB):                        0.362
Kurtosis:                       2.700   Cond. No.                         5.21
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

# true test set
test_x = pd.read_csv("football_test.csv", index_col = 0)
test_y = pd.read_csv("football_answers.csv")

# standardize the features
n = test_x.shape[1]
for i in range(n - 1):
    test_x.iloc[:, i] = (test_x.iloc[:, i] - np.mean(test_x.iloc[:, i])) / np.std(test_x.iloc[:, i])

from sklearn.linear_model import LinearRegression

model1_indices = [0, 2, 4, 8, 25]
model2_indices = [0, 2, 4]

model1 = LinearRegression().fit(X_train.iloc[:, model1_indices], y_train)
pred1 = model1.predict(X.iloc[:, model1_indices])
model2 = LinearRegression().fit(X_train.iloc[:, model2_indices], y_train)
pred2 = model2.predict(X.iloc[:, model2_indices])

model1.coef_

array([ 0.56553607, -0.59798115,  0.69075096, -0.4586361 , -0.98742599])

model2.coef_

array([ 0.67893791, -1.08579519,  0.82022425])

nrows = X.shape[0]
plt.figure(figsize = (16, 6))
plt.subplot(1, 2, 1)
plt.scatter(y, pred1)
plt.xlabel("true values")
plt.ylabel("predicted values")
plt.subplot(1, 2, 2)
plt.scatter(y, pred2)
plt.xlabel("true values")
plt.ylabel("predicted values")
plt.show()

Now we will fit multiple models on the training set and see the model performance on the dataset:

Building Linear Regression Models with different complexities¶

indices

[0, 2, 4, 7, 8, 9, 15, 17, 21, 23, 24, 25]

# to generate all possible combinations of the indices
import itertools

def createMLRModel(train_X, train_y, test_X, indices, complexity):
    '''
    This function creates a Multiple Linear Regression Model based on the input 
    dataframes, indices of the predictors, and the complexity of the model.
    train_X: training set of the predictors
    train_y: training set of the outcome
    test_X: testing set of the predictors
    indices: index of each predictor used in the modeling
    complexity: number of predictors required
    return: predictions - a list of list that contains the predictions of each model
    '''
    predictions = []
    for subset in itertools.combinations(indices, complexity):
        model = LinearRegression().fit(train_X.iloc[:, list(subset)], train_y)
        pred = model.predict(test_X.iloc[:, list(subset)])
        predictions.append(pred.reshape(1,-1))
    return predictions

Create regression models and obtain the predictions

predictions = {}
for i in range(1, 10):
    predictions[i] = createMLRModel(X_train, y_train, test_x, indices, i)
    print("%d MLR models built with complexity = %d" %(len(predictions[i]), i))

12 MLR models built with complexity = 1
66 MLR models built with complexity = 2
220 MLR models built with complexity = 3
495 MLR models built with complexity = 4
792 MLR models built with complexity = 5
924 MLR models built with complexity = 6
792 MLR models built with complexity = 7
495 MLR models built with complexity = 8
220 MLR models built with complexity = 9

Calculate the average value of predictions at each complexity level

averages = {}
for i in range(1, 10):
    averages[i] = list(np.mean(predictions[i], axis = 0)[0])

Calculate the bias of predictions at each complexity level

test_y = test_y.loc[:, "Wins"]

biases = {}
for i in range(1, 10):
    biases[i] = np.mean(np.abs(averages[i] - test_y))

Calculate the average value of variance in predictions at each complexity level

variances = {}
for i in range(1, 10):
    variances[i] = np.mean(np.var(predictions[i]))

Visualize the decrease in bias as the complexity level increases

nrows = test_y.shape[0]
plt.figure(figsize = (16, 9))
plt.subplots_adjust(hspace = 0.5)
for i in range(9):
    plt.subplot(3, 3, i+1)
    plt.scatter(test_y, averages[i+1])
    plt.xlabel("true values")
    plt.ylabel("average predictions")
    if i == 0:
        plt.title("SLR model (1 feature)")
    else:
        plt.title("MLR with %d features" %(i+1))

Bias vs. model complexity

plt.plot(list(biases.keys()), list(biases.values()),"-o")
plt.xlabel("number of features")
plt.ylabel("bias in the predictions")
plt.show()

Variance vs. model complexity

plt.plot(list(variances.keys()), list(variances.values()),"-o", color = "orange")
plt.xlabel("number of features")
plt.ylabel("variance in the predictions")
plt.show()

Bias vs. Variance as model complexity increases

plt.plot(list(biases.keys()), list(biases.values()),"-o")
plt.plot(list(variances.keys()), list(variances.values()),"-o")
plt.xlabel("number of features")
plt.ylabel("bias or variance")
plt.legend(["bias", "variance"])
plt.title("bias vs. variance tradeoff")
plt.show()

Upcoming modules:

Tree-based Methods¶

Regression Trees and Classfication Trees
Boosting-based Trees:
- XGBoost
- LightGBM
Bagging-based Trees:
- Random Forest

Evaluation Criteria¶

MSE and RMSE
$R^2$ and $R^2_{\text{adj}}$
AIC, BIC and $C_p$

	Yards	OffensivePlays	TurnOversLost	FumblesLost	FirstDowns	PassesCompleted	PassesAttempted	YardsGainedPassing	InterceptionsThrown	RushingAttempts	...	OppPassesAttempted	OppYardsGainedPassing	OppInterceptionsThrown	OppRushingAttempts	OppYardsGainedRushing	OppPenaltiesCommitedByTeam	OppPenaltiesInYards	OppFirstDownsByPenalty	OppNumberOfDrives	Wins
ID
1	5340	1000	31	11	275	298	519	3340	20	437	...	549	4092	9	442	1727	87	686	28	189	4
2	6138	1014	25	14	342	328	476	3784	11	508	...	558	4031	18	392	1650	104	861	34	179	12
3	5362	955	19	13	288	306	494	3361	6	431	...	521	3216	19	444	2107	79	676	18	177	10
4	5728	1004	17	8	324	334	505	3923	9	467	...	646	3976	21	344	1477	116	1063	31	180	12
5	4788	994	26	9	269	328	586	3419	17	358	...	535	3832	12	545	2256	101	820	37	191	2

	Yards	OffensivePlays	TurnOversLost	FumblesLost	FirstDowns	PassesCompleted	PassesAttempted	YardsGainedPassing	InterceptionsThrown	RushingAttempts	...	OppPassesAttempted	OppYardsGainedPassing	OppInterceptionsThrown	OppRushingAttempts	OppYardsGainedRushing	OppPenaltiesCommitedByTeam	OppPenaltiesInYards	OppFirstDownsByPenalty	OppNumberOfDrives	Wins
ID
1	-0.067486	-0.275791	0.797903	0.138608	-0.874948	-0.619963	-0.305416	-0.325343	1.009883	-0.041968	...	0.178948	1.276131	-1.386068	0.172824	-0.262938	-1.136733	-1.345573	0.161351	0.268029	4
2	1.198873	0.018357	-0.084356	0.965899	0.907859	-0.009847	-1.013489	0.384790	-0.836082	1.296434	...	0.382406	1.132749	0.485370	-0.909461	-0.521518	0.077035	0.009290	0.978862	-0.740398	12
3	-0.032574	-1.221267	-0.966615	0.690135	-0.529030	-0.457266	-0.717086	-0.291755	-1.861618	-0.155072	...	-0.454033	-0.782929	0.693308	0.216115	1.013175	-1.707917	-1.422994	-1.201168	-0.942083	10
4	0.548237	-0.191749	-1.260702	-0.688684	0.428896	0.112176	-0.535951	0.607106	-1.246296	0.523554	...	2.371773	1.003470	1.109183	-1.948455	-1.102485	0.933812	1.573190	0.570106	-0.639555	12
5	-0.943463	-0.401855	0.062687	-0.412920	-1.034602	-0.009847	0.797860	-0.198990	0.394561	-1.531175	...	-0.137542	0.664994	-0.762256	2.402332	1.513546	-0.137159	-0.308135	1.387617	0.469715	2