Notebook

K-Fold Cross Validation¶

Testing accuracy for just once doesn't account for the variance in the data and might give misleading results. K-Fold validation randomly selects one of $k$ parts of the data set then tests the accuracy on the same. After required number of iterations, the accuracy is averaged

In [1]:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv('Social_Network_Ads.csv')
X = df.iloc[:, 2:4]   # Using 1:2 as indices will give us np array of dim (10, 1)
y = df.iloc[:, 4]

df.head()

Out[1]:

	User ID	Gender	Age	EstimatedSalary
0	15624510	Male	19	19000
1	15810944	Male	35	20000
2	15668575	Female	26	43000
3	15603246	Female	27	57000
4	15804002	Male	19	76000

In [2]:

# Scale
from sklearn.preprocessing import StandardScaler
X_sca = StandardScaler()
X = X_sca.fit_transform(X)

In [5]:

from __future__ import division
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC


kfold_cv = KFold(n_splits=10)
correct = 0
total = 0
for train_indices, test_indices in kfold_cv.split(X):
    X_train, X_test, y_train, y_test = X[train_indices], X[test_indices], \
                                        y[train_indices], y[test_indices]
    clf = SVC(kernel='linear', random_state=0).fit(X_train, y_train)
    correct += accuracy_score(y_test, clf.predict(X_test))
    total += 1
print("Accuracy: {0:.2f}".format(correct/total))
    

Accuracy: 0.82

In [6]:

from sklearn.svm import SVC #support vector classifier
clf = SVC(kernel='linear', random_state=0).fit(X_train, y_train)

In [10]:

# applying k-fold cross validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(clf, X_train, y_train, cv=10)
print accuracies
print accuracies.mean()
print accuracies.std()

[ 0.90322581  0.90322581  0.77419355  0.87096774  0.77419355  0.86206897
  0.82758621  0.68965517  0.79310345  0.89655172]
0.829477196885
0.0671935884472

Leave one out cross validation¶

Another type of cross validation is leave one out cross validation. Out of the $n$ samples, one of them is left out and the model is trained on other samples. When K in KFold validation is equal to the number of samples then K-Fold validation is same as leave one out cross validation

In [1]:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv('Social_Network_Ads.csv')
X = df.iloc[:, 2:4]   # Using 1:2 as indices will give us np array of dim (10, 1)
y = df.iloc[:, 4]

df.head()

Out[1]:

	User ID	Gender	Age	EstimatedSalary
0	15624510	Male	19	19000
1	15810944	Male	35	20000
2	15668575	Female	26	43000
3	15603246	Female	27	57000
4	15804002	Male	19	76000

In [2]:

# Scale
from sklearn.preprocessing import StandardScaler
X_sca = StandardScaler()
X = X_sca.fit_transform(X)

In [7]:

from __future__ import division
from sklearn.model_selection import LeaveOneOut
from sklearn.metrics import accuracy_score
from sklearn.svm import SVC

loo_cv = LeaveOneOut()
correct = 0
total = 0
for train_indices, test_indices in loo_cv.split(X):
#     uncomment these lines to print splits
#     print("Train Indices: {}...".format(train_indices[:4]))
#     print("Test Indices: {}...".format(test_indices[:4]))
#     print("Training SVC model using this configuration")
    X_train, X_test, y_train, y_test = X[train_indices], X[test_indices], \
                                        y[train_indices], y[test_indices]
    clf = SVC(kernel='linear', random_state=0).fit(X_train, y_train)
    correct += accuracy_score(y_test, clf.predict(X_test))
    total += 1
print("Accuracy: {0:.2f}".format(correct/total))

Accuracy: 0.84

Stratified KFold¶

Kfold validation does not preserve the split of the output variable while splitting the data in k-folds. Imagine training a Naive Bayes classifier using KFold validation using 10 samples where 5 are positive and 5 are negative. Since KFold randomly selects the split imagine splitting it in an unfortunate way -- 1 split contains all positive samples and 1 contains all negative. Naive Bayes classifier will calculate the prior probabilities and find it to be 100% i.e. the model will think the output is always positive which is obviously wrong. To tackle this scenario we use Stratified split, what it would essentially do is preserve the split in the original dataset in training set, that is, if the original dataset has 50% positive and 50% negative outputs then the training set will also have 50% positive and 50% negative outputs.

In [6]:

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

df = pd.read_csv('Social_Network_Ads.csv')
X = df.iloc[:, 2:4]   # Using 1:2 as indices will give us np array of dim (10, 1)
y = df.iloc[:, 4]

df.head()

Out[6]:

	User ID	Gender	Age	EstimatedSalary
0	15624510	Male	19	19000
1	15810944	Male	35	20000
2	15668575	Female	26	43000
3	15603246	Female	27	57000
4	15804002	Male	19	76000

In [7]:

# Scale
from sklearn.preprocessing import StandardScaler
X_sca = StandardScaler()
X = X_sca.fit_transform(X)

from __future__ import division from sklearn.model_selection import StratifiedKFold from sklearn.metrics import accuracy_score from sklearn.svm import SVC strat_cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=0) correct = 0 total = 0 for train_indices, test_indices in strat_cv.split(X, y): # uncomment these lines to print splits # print("Train Indices: {}...".format(train_indices[:4])) # print("Test Indices: {}...".format(test_indices[:4])) # print("Training SVC model using this configuration") X_train, X_test, y_train, y_test = X[train_indices], X[test_indices], \ y[train_indices], y[test_indices] clf = SVC(kernel='linear', random_state=0).fit(X_train, y_train) correct += accuracy_score(y_test, clf.predict(X_test)) total += 1 print("Accuracy: {0:.2f}".format(correct/total))

Validating Time Series data¶

Time series data is data associated with a time frame, for instance stock prices. The motivation is to predict stock price for future given the data from previous data. If we were to use any splitting techniques from above we would end up predicting past from future (due to random nature from splitting) which shouldn't be permitted, we should always predict future from past. This can be achieved using TimeSeriesSplit

In [23]:

from sklearn.model_selection import TimeSeriesSplit
import numpy as np

X = np.random.rand(10, 2)
y = np.random.rand(10)
print(X)
print(y)

[[ 0.08485204  0.84689345]
 [ 0.02834187  0.68234029]
 [ 0.36309891  0.07100943]
 [ 0.66955444  0.88070583]
 [ 0.28241451  0.56733126]
 [ 0.30521588  0.73973179]
 [ 0.0566575   0.96430919]
 [ 0.53957399  0.05946202]
 [ 0.11530205  0.16625273]
 [ 0.89429006  0.83914383]]
[ 0.97006781  0.81953045  0.50522986  0.88384404  0.30715333  0.9750431
  0.68943093  0.74947717  0.93600522  0.33118984]

In [24]:

tss = TimeSeriesSplit(n_splits=7)

for train_indices, test_indices in tss.split(X):
    print("Train indices: {0} Test indices: {1}".format(train_indices, test_indices))

Train indices: [0 1 2] Test indices: [3]
Train indices: [0 1 2 3] Test indices: [4]
Train indices: [0 1 2 3 4] Test indices: [5]
Train indices: [0 1 2 3 4 5] Test indices: [6]
Train indices: [0 1 2 3 4 5 6] Test indices: [7]
Train indices: [0 1 2 3 4 5 6 7] Test indices: [8]
Train indices: [0 1 2 3 4 5 6 7 8] Test indices: [9]