This notebook explains how to use scikit-learn
's univariate feature selection methods to select the top N
features and the top P
% features with the mutual information statistic.
This notebook will work with an OpenML dataset to predict who pays for internet with 10108 observations and 69 columns.
This tutorial uses:
import pandas as pd
from sklearn.datasets import fetch_openml
import category_encoders as ce
from sklearn.feature_selection import SelectKBest, SelectPercentile, mutual_info_classif
The data is from OpenML imported using the Python package sklearn.datasets
.
data = fetch_openml(name='kdd_internet_usage')
df = data.frame
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10108 entries, 0 to 10107 Data columns (total 69 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Actual_Time 10108 non-null category 1 Age 10108 non-null category 2 Community_Building 10108 non-null category 3 Community_Membership_Family 10108 non-null category 4 Community_Membership_Hobbies 10108 non-null category 5 Community_Membership_None 10108 non-null category 6 Community_Membership_Other 10108 non-null category 7 Community_Membership_Political 10108 non-null category 8 Community_Membership_Professional 10108 non-null category 9 Community_Membership_Religious 10108 non-null category 10 Community_Membership_Support 10108 non-null category 11 Country 10108 non-null category 12 Disability_Cognitive 10108 non-null category 13 Disability_Hearing 10108 non-null category 14 Disability_Motor 10108 non-null category 15 Disability_Not_Impaired 10108 non-null category 16 Disability_Not_Say 10108 non-null category 17 Disability_Vision 10108 non-null category 18 Education_Attainment 10108 non-null category 19 Falsification_of_Information 10108 non-null category 20 Gender 10108 non-null category 21 Household_Income 10108 non-null category 22 How_You_Heard_About_Survey_Banner 10108 non-null category 23 How_You_Heard_About_Survey_Friend 10108 non-null category 24 How_You_Heard_About_Survey_Mailing_List 10108 non-null category 25 How_You_Heard_About_Survey_Others 10108 non-null category 26 How_You_Heard_About_Survey_Printed_Media 10108 non-null category 27 How_You_Heard_About_Survey_Remebered 10108 non-null category 28 How_You_Heard_About_Survey_Search_Engine 10108 non-null category 29 How_You_Heard_About_Survey_Usenet_News 10108 non-null category 30 How_You_Heard_About_Survey_WWW_Page 10108 non-null category 31 Major_Geographical_Location 10108 non-null category 32 Major_Occupation 10108 non-null category 33 Marital_Status 10108 non-null category 34 Most_Import_Issue_Facing_the_Internet 10108 non-null category 35 Opinions_on_Censorship 10108 non-null category 36 Primary_Computing_Platform 7409 non-null category 37 Primary_Language 10108 non-null category 38 Primary_Place_of_WWW_Access 10108 non-null category 39 Race 10108 non-null category 40 Not_Purchasing_Bad_experience 10108 non-null category 41 Not_Purchasing_Bad_press 10108 non-null category 42 Not_Purchasing_Cant_find 10108 non-null category 43 Not_Purchasing_Company_policy 10108 non-null category 44 Not_Purchasing_Easier_locally 10108 non-null category 45 Not_Purchasing_Enough_info 10108 non-null category 46 Not_Purchasing_Judge_quality 10108 non-null category 47 Not_Purchasing_Never_tried 10108 non-null category 48 Not_Purchasing_No_credit 10108 non-null category 49 Not_Purchasing_Not_applicable 10108 non-null category 50 Not_Purchasing_Not_option 10108 non-null category 51 Not_Purchasing_Other 10108 non-null category 52 Not_Purchasing_Prefer_people 10108 non-null category 53 Not_Purchasing_Privacy 10108 non-null category 54 Not_Purchasing_Receipt 10108 non-null category 55 Not_Purchasing_Security 10108 non-null category 56 Not_Purchasing_Too_complicated 10108 non-null category 57 Not_Purchasing_Uncomfortable 10108 non-null category 58 Not_Purchasing_Unfamiliar_vendor 10108 non-null category 59 Registered_to_Vote 10108 non-null category 60 Sexual_Preference 10108 non-null category 61 Web_Ordering 10108 non-null category 62 Web_Page_Creation 10108 non-null category 63 Who_Pays_for_Access_Dont_Know 10108 non-null category 64 Who_Pays_for_Access_Other 10108 non-null category 65 Who_Pays_for_Access_Parents 10108 non-null category 66 Who_Pays_for_Access_School 10108 non-null category 67 Who_Pays_for_Access_Self 10108 non-null category 68 Who_Pays_for_Access_Work 10108 non-null category dtypes: category(69) memory usage: 715.7 KB
Split the data into target and features.
Drop target leakage features of other options to pay.
target = 'Who_Pays_for_Access_Work'
y = df[target]
X_cat = data.data.drop(columns=['Who_Pays_for_Access_Dont_Know',
'Who_Pays_for_Access_Other', 'Who_Pays_for_Access_Parents',
'Who_Pays_for_Access_School', 'Who_Pays_for_Access_Self'])
Encode the categorical variables prior to feature selection.
encoder = ce.LeaveOneOutEncoder(return_df=True)
X = encoder.fit_transform(X_cat, y)
N
¶Start with 63 features after dropping target leakage features.
X.shape
(10108, 63)
Select the top 20 features.
Note, multual_info_classif
is used as this is a classification problem. For a regression problem, use mutual_info_regression
instead.
selector = SelectKBest(mutual_info_classif, k=20)
X_reduced = selector.fit_transform(X, y)
X_reduced.shape
(10108, 20)
The function get_support
can be used to generate the list of features that were kept.
cols = selector.get_support(indices=True)
selected_columns = X.iloc[:,cols].columns.tolist()
selected_columns
['Community_Membership_Family', 'Community_Membership_None', 'Community_Membership_Political', 'Community_Membership_Religious', 'Community_Membership_Support', 'Disability_Cognitive', 'Disability_Hearing', 'Disability_Vision', 'How_You_Heard_About_Survey_Banner', 'How_You_Heard_About_Survey_Mailing_List', 'How_You_Heard_About_Survey_Printed_Media', 'How_You_Heard_About_Survey_Remebered', 'How_You_Heard_About_Survey_Search_Engine', 'How_You_Heard_About_Survey_Usenet_News', 'Race', 'Not_Purchasing_Bad_press', 'Not_Purchasing_Cant_find', 'Not_Purchasing_Enough_info', 'Not_Purchasing_Never_tried', 'Not_Purchasing_Prefer_people']
P
%¶Select the top 25% of features.
selector = SelectPercentile(mutual_info_classif, percentile=25)
X_reduced = selector.fit_transform(X, y)
X_reduced.shape
(10108, 15)
Again, using the function get_support
to generate the list of features that were kept.
cols = selector.get_support(indices=True)
selected_columns = X.iloc[:,cols].columns.tolist()
selected_columns
['Community_Building', 'Community_Membership_Political', 'Community_Membership_Religious', 'Community_Membership_Support', 'Disability_Cognitive', 'Disability_Hearing', 'Disability_Motor', 'Disability_Vision', 'How_You_Heard_About_Survey_Banner', 'How_You_Heard_About_Survey_Printed_Media', 'Not_Purchasing_Bad_press', 'Not_Purchasing_Company_policy', 'Not_Purchasing_No_credit', 'Not_Purchasing_Prefer_people', 'Sexual_Preference']