This notebook explains how to generate K-folds for cross-validation using scikit-learn
for evaluation of machine learning models with out of sample data.
This notebook will work with an OpenML dataset to predict who pays for internet with 10108 observations and 69 columns.
This tutorial uses:
from sklearn.datasets import fetch_openml
import pandas as pd
from sklearn.model_selection import KFold
The data is from OpenML imported using the Python package sklearn.datasets
.
data = fetch_openml(name='kdd_internet_usage', as_frame=True)
df = data.frame
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10108 entries, 0 to 10107 Data columns (total 69 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Actual_Time 10108 non-null category 1 Age 10108 non-null category 2 Community_Building 10108 non-null category 3 Community_Membership_Family 10108 non-null category 4 Community_Membership_Hobbies 10108 non-null category 5 Community_Membership_None 10108 non-null category 6 Community_Membership_Other 10108 non-null category 7 Community_Membership_Political 10108 non-null category 8 Community_Membership_Professional 10108 non-null category 9 Community_Membership_Religious 10108 non-null category 10 Community_Membership_Support 10108 non-null category 11 Country 10108 non-null category 12 Disability_Cognitive 10108 non-null category 13 Disability_Hearing 10108 non-null category 14 Disability_Motor 10108 non-null category 15 Disability_Not_Impaired 10108 non-null category 16 Disability_Not_Say 10108 non-null category 17 Disability_Vision 10108 non-null category 18 Education_Attainment 10108 non-null category 19 Falsification_of_Information 10108 non-null category 20 Gender 10108 non-null category 21 Household_Income 10108 non-null category 22 How_You_Heard_About_Survey_Banner 10108 non-null category 23 How_You_Heard_About_Survey_Friend 10108 non-null category 24 How_You_Heard_About_Survey_Mailing_List 10108 non-null category 25 How_You_Heard_About_Survey_Others 10108 non-null category 26 How_You_Heard_About_Survey_Printed_Media 10108 non-null category 27 How_You_Heard_About_Survey_Remebered 10108 non-null category 28 How_You_Heard_About_Survey_Search_Engine 10108 non-null category 29 How_You_Heard_About_Survey_Usenet_News 10108 non-null category 30 How_You_Heard_About_Survey_WWW_Page 10108 non-null category 31 Major_Geographical_Location 10108 non-null category 32 Major_Occupation 10108 non-null category 33 Marital_Status 10108 non-null category 34 Most_Import_Issue_Facing_the_Internet 10108 non-null category 35 Opinions_on_Censorship 10108 non-null category 36 Primary_Computing_Platform 7409 non-null category 37 Primary_Language 10108 non-null category 38 Primary_Place_of_WWW_Access 10108 non-null category 39 Race 10108 non-null category 40 Not_Purchasing_Bad_experience 10108 non-null category 41 Not_Purchasing_Bad_press 10108 non-null category 42 Not_Purchasing_Cant_find 10108 non-null category 43 Not_Purchasing_Company_policy 10108 non-null category 44 Not_Purchasing_Easier_locally 10108 non-null category 45 Not_Purchasing_Enough_info 10108 non-null category 46 Not_Purchasing_Judge_quality 10108 non-null category 47 Not_Purchasing_Never_tried 10108 non-null category 48 Not_Purchasing_No_credit 10108 non-null category 49 Not_Purchasing_Not_applicable 10108 non-null category 50 Not_Purchasing_Not_option 10108 non-null category 51 Not_Purchasing_Other 10108 non-null category 52 Not_Purchasing_Prefer_people 10108 non-null category 53 Not_Purchasing_Privacy 10108 non-null category 54 Not_Purchasing_Receipt 10108 non-null category 55 Not_Purchasing_Security 10108 non-null category 56 Not_Purchasing_Too_complicated 10108 non-null category 57 Not_Purchasing_Uncomfortable 10108 non-null category 58 Not_Purchasing_Unfamiliar_vendor 10108 non-null category 59 Registered_to_Vote 10108 non-null category 60 Sexual_Preference 10108 non-null category 61 Web_Ordering 10108 non-null category 62 Web_Page_Creation 10108 non-null category 63 Who_Pays_for_Access_Dont_Know 10108 non-null category 64 Who_Pays_for_Access_Other 10108 non-null category 65 Who_Pays_for_Access_Parents 10108 non-null category 66 Who_Pays_for_Access_School 10108 non-null category 67 Who_Pays_for_Access_Self 10108 non-null category 68 Who_Pays_for_Access_Work 10108 non-null category dtypes: category(69) memory usage: 715.7 KB
Split the data into target and features.
Drop target leakage features of other options to pay.
target = 'Who_Pays_for_Access_Work'
y = df[target]
X = data.data.drop(columns=['Who_Pays_for_Access_Dont_Know',
'Who_Pays_for_Access_Other', 'Who_Pays_for_Access_Parents',
'Who_Pays_for_Access_School', 'Who_Pays_for_Access_Self'])
Scikit-learn's KFold
will randomly sample the data into N folds (default of 5) that can be used to perform cross-validation during machine learning training.
kf = KFold(n_splits=10, random_state=1066, shuffle=True)
for train_index, test_index in kf.split(X):
print("Train:", train_index, "Test:", test_index)
X_train = X.iloc[train_index, :]
y_train = y[train_index]
X_test = X.iloc[test_index, :]
y_test = y[test_index]
Train: [ 0 1 2 ... 10105 10106 10107] Test: [ 9 52 80 ... 10092 10102 10103] Train: [ 0 1 2 ... 10105 10106 10107] Test: [ 16 20 21 ... 10069 10079 10101] Train: [ 0 1 2 ... 10105 10106 10107] Test: [ 4 12 22 ... 10066 10074 10076] Train: [ 0 1 2 ... 10105 10106 10107] Test: [ 13 25 34 ... 10073 10075 10100] Train: [ 0 1 2 ... 10105 10106 10107] Test: [ 3 6 7 ... 10093 10095 10104] Train: [ 1 3 4 ... 10104 10105 10107] Test: [ 0 2 18 ... 10045 10096 10106] Train: [ 0 1 2 ... 10105 10106 10107] Test: [ 8 11 14 ... 10067 10084 10086] Train: [ 0 1 2 ... 10105 10106 10107] Test: [ 10 30 31 ... 10083 10085 10098] Train: [ 0 2 3 ... 10103 10104 10106] Test: [ 1 5 19 ... 10097 10105 10107] Train: [ 0 1 2 ... 10105 10106 10107] Test: [ 15 32 39 ... 10081 10094 10099]