"Feature engineering and feature selection are one of the most important elements of data analysis and machine learning at all".
Whatever good data you have, always there are some useful features, that help you to solve the problem and some noisy features - unuseful in your prediction model. Such features are dangerous, because it can lead to overfit. Opposite, quality of your model on hold-out data can be improved by deleting it from the dataset.
If dataset has hundreds or thousands features, fitting estimator, cross-validation or another computation can take a lot of time. Of course, we can use PCA to reduce dimension of data, but sometimes it's not available for current business-task. It's impossible to explain how business could change "new PCA feature" to reach their goals - it's called "interpretation problem". So, feature selection is useful in this case.
Ok, we have good dataset. But we added many custom features to beat Kaggle's baseline and now we are searching how we can select, which features improve quality metric and which not. Hm... We can use L1 regularization, it will move some weights towards 0. But what if we could fit estimator with different subsets of new features, and add one if quality metric is increasing or remove one if quality metric is decreasing, and make sure that solution takes only a couple of lines of code?
Mlxtend SequentialFeatureSelector is what we need!
**Everytime we try to select the best features :)**
Well, the informal intro is coming to an end. It's time to understand some formalized theory.
Please, install some libraries, if you haven't it in your system
#!pip install pandas
#!pip install mlxtend
#!pip install scikit-learn
#!pip install matplotlib
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import RFE
from mlxtend.feature_selection import SequentialFeatureSelector
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
Mlxtend SequentialFeatureSelector is greedy search algorithm that is used to reduce an initial d-dimensional feature space to a k-dimensional feature subspace where k < d.
There are 4 different flavors of SFAs available via the SequentialFeatureSelector:
In "forward" algorithm we start with no features in our subset and add one feature on each iteration, that maximize quality metric. In the contrary, "backward" algorithm start with full subset of features and remove one feature on each iteration maximizing quality of our model.
The floating variants, SFFS and SBFS, can be considered as extensions to the simpler SFS and SBS algorithms. The floating algorithms have an additional exclusion or inclusion step to remove features once they were included (or excluded), so that a larger number of feature subset combinations can be sampled.
Lets take a look at each of them.
Y={y1,y2,...,yd}
Xk={xj|j=1,2,...,k;xj∈Y}, where k=(0,1,2,...,d)
X0=∅, k=0
x+ = arg max J(xk+x), where x∈Y−Xk
Xk+1=Xk+x+
k=k+1
k=p
We add features from the feature subset Xk until the feature subset of size k contains the number of desired features p that we specified a priori.
The set of all features: Y={y1,y2,...,yd}
Xk={xj|j=1,2,...,k;xj∈Y}, where k=(0,1,2,...,d)
X0=Y, k=d
x-= arg max J(xk-x), where x∈Xk
Xk-1=Xk-x-
k=k-1
k=p
We remove features from the feature subset Xk until the feature subset of size k contains the number of desired features p that we specified a priori
The set of all features: Y={y1,y2,...,yd}
Xk={xj|j=1,2,...,k;xj∈Y}, where k=(0,1,2,...,d)
X0=∅, k=0
x+= arg max J(xk+x), where x∈Y−Xk
Xk+1=Xk+x+
k=k+1
In step 1, we include the feature from the feature space that leads to the best performance increase for our feature subset (assessed by the criterion function). Then, we go over to step 2.
x-= arg max J(xk-x), where x∈Xk
if J(xk - x) > J(xk):
Xk-1=Xk-x-
k=k-1
In step 2, we only remove a feature if the resulting subset would gain an increase in performance. If k=2 or an improvement cannot be made (i.e., such feature x- cannot be found), go back to step 1; else, repeat this step.
Steps 1 and 2 are repeated until the Termination criterion is reached.
k=p
We add features from the feature subset Xk until the feature subset of size k contains the number of desired features p that we specified a priori.
The set of all features: Y={y1,y2,...,yd}
Xk={xj|j=1,2,...,k;xj∈Y}, where k=(0,1,2,...,d)
X0=Y, k=d
x-= arg max J(xk-x), where x∈Xk
Xk-1=Xk-x-
k=k-1
x+= arg max J(xk+x), where x∈Y−Xk
if J(xk + x+) > J(xk):
Xk+1=Xk+x+
k=k+1
In Step 2, we search for features that improve the classifier performance if they are added back to the feature subset. If such features exist, we add the feature x+ for which the performance improvement is maximized. If k=2 or an improvement cannot be made (i.e., such feature x+ cannot be found), go back to step 1; else, repeat this step.
k=p
We add features from the feature subset Xk until the feature subset of size k contains the number of desired features p that we specified a priori.
Lets take a look at documentation and parameters of SFS object
SequentialFeatureSelector(estimator, k_features=1, forward=True, floating=False, verbose=0, scoring=None, cv=5, n_jobs=1, pre_dispatch='2n_jobs', clone_estimator=True)
Sequential Feature Selection for Classification and Regression.
Parameters
estimator : scikit-learn classifier or regressor
k_features : int or tuple or str (default: 1)
Number of features to select, where k_features < the full feature set. A tuple containing a min and max value can be provided, and the SFS will consider return any feature combination between min and max that scored highest in cross-validtion. For example, the tuple (1, 4) will return any combination from 1 up to 4 features instead of a fixed number of features k. A string argument "best" or "parsimonious". If "best" is provided, the feature selector will return the feature subset with the best cross-validation performance. If "parsimonious" is provided as an argument, the smallest feature subset that is within one standard error of the cross-validation performance will be selected.
forward : bool (default: True)
Forward selection if True, backward selection otherwise
floating : bool (default: False)
Adds a conditional exclusion/inclusion if True.
verbose : int (default: 0), level of verbosity to use in logging.
If 0, no output, if 1 number of features in current set, if 2 detailed logging i ncluding timestamp and cv scores at step.
scoring : str, callable, or None (default: None)
If None (default), uses 'accuracy' for sklearn classifiers and 'r2' for sklearn regressors. If str, uses a sklearn scoring metric string identifier, for example {accuracy, f1, precision, recall, roc_auc} for classifiers, {'mean_absolute_error', 'mean_squared_error'/'neg_mean_squared_error', 'median_absolute_error', 'r2'} for regressors.
cv : int (default: 5)
Integer or iterable yielding train, test splits. If cv is an integer and estimator is a classifier (or y consists of integer class labels) stratified k-fold. Otherwise regular k-fold cross-validation is performed. No cross-validation if cv is None, False, or 0.
n_jobs : int (default: 1)
The number of CPUs to use for evaluating different feature subsets in parallel. -1 means 'all CPUs'.
pre_dispatch : int, or string (default: '2*n_jobs')
Controls the number of jobs that get dispatched during parallel execution if n_jobs > 1 or n_jobs=-1. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs An int, giving the exact number of total jobs that are spawned A string, giving an expression as a function of n_jobs, as in 2*n_jobs
clone_estimator : bool (default: True)
Clones estimator if True; works with the original estimator instance if False. Set to False if the estimator doesn't implement scikit-learn's set_params and get_params methods. In addition, it is required to set cv=0, and n_jobs=1.
In this article we will use toy sklearn dataset "breast_cancer" (binary classification task). Lets load the dataset and take a look at the data.
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
df, y = pd.DataFrame(data=data.data, columns = data.feature_names), data.target
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 569 entries, 0 to 568 Data columns (total 30 columns): mean radius 569 non-null float64 mean texture 569 non-null float64 mean perimeter 569 non-null float64 mean area 569 non-null float64 mean smoothness 569 non-null float64 mean compactness 569 non-null float64 mean concavity 569 non-null float64 mean concave points 569 non-null float64 mean symmetry 569 non-null float64 mean fractal dimension 569 non-null float64 radius error 569 non-null float64 texture error 569 non-null float64 perimeter error 569 non-null float64 area error 569 non-null float64 smoothness error 569 non-null float64 compactness error 569 non-null float64 concavity error 569 non-null float64 concave points error 569 non-null float64 symmetry error 569 non-null float64 fractal dimension error 569 non-null float64 worst radius 569 non-null float64 worst texture 569 non-null float64 worst perimeter 569 non-null float64 worst area 569 non-null float64 worst smoothness 569 non-null float64 worst compactness 569 non-null float64 worst concavity 569 non-null float64 worst concave points 569 non-null float64 worst symmetry 569 non-null float64 worst fractal dimension 569 non-null float64 dtypes: float64(30) memory usage: 133.4 KB
columns = df.columns
df.head()
mean radius | mean texture | mean perimeter | mean area | mean smoothness | mean compactness | mean concavity | mean concave points | mean symmetry | mean fractal dimension | ... | worst radius | worst texture | worst perimeter | worst area | worst smoothness | worst compactness | worst concavity | worst concave points | worst symmetry | worst fractal dimension | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | ... | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 |
1 | 20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | ... | 24.99 | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 |
2 | 19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | ... | 23.57 | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 |
3 | 11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | ... | 14.91 | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 |
4 | 20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | ... | 22.54 | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 |
5 rows × 30 columns
sum(y) / len(y)
0.62741652021089633
There are 30 float not-null features and 569 examples. 62,7% of examples have class 1 and 37,3% of examples have class 0. Classes are not very skewed, so accuracy metric is suitable for us.
We will use LogisticRegression as our base algorithm. Firstly, scaling the data.
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
Then initiating cross-validation object with 5 folds with fixed random_state
cv = KFold(n_splits=5, shuffle=True, random_state=17)
Checking cv results without parameters tuning
logit = LogisticRegression()
cross_val_score(logit, df_scaled, y, cv = cv).mean()
0.98417947523676441
CV result is 0.984 It will be our baseline. Trying to beat it with feature selection
We will use Sequential Backward Selection, so toggle forward and floating parameters to False
Sequential Backward Selection means that:
Setting parameter k_features to tuple (1, K), so it will be subset of features in range (1, 30) with best scoring on CV as output of fit_transform method.
logit = LogisticRegression()
sbs = SequentialFeatureSelector(logit,
k_features=(1, 30),
forward=False,
floating=False,
verbose=2,
scoring='accuracy',
cv=cv)
There is information about CV scoring on each iteration in log. The best quality we have with subset with 15 and from 17 to 24 features.
X_sbs = sbs.fit_transform(df_scaled, y, custom_feature_names=columns)
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 30 out of 30 | elapsed: 1.3s finished [2018-12-12 00:08:17] Features: 29/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 29 out of 29 | elapsed: 1.2s finished [2018-12-12 00:08:19] Features: 28/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 28 out of 28 | elapsed: 1.1s finished [2018-12-12 00:08:20] Features: 27/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 1.1s finished [2018-12-12 00:08:21] Features: 26/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 26 out of 26 | elapsed: 1.0s finished [2018-12-12 00:08:22] Features: 25/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 25 out of 25 | elapsed: 0.9s finished [2018-12-12 00:08:23] Features: 24/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.8s finished [2018-12-12 00:08:24] Features: 23/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 23 out of 23 | elapsed: 0.8s finished [2018-12-12 00:08:25] Features: 22/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 22 out of 22 | elapsed: 0.7s finished [2018-12-12 00:08:26] Features: 21/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 21 out of 21 | elapsed: 0.7s finished [2018-12-12 00:08:26] Features: 20/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 0.6s finished [2018-12-12 00:08:27] Features: 19/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 19 out of 19 | elapsed: 0.5s finished [2018-12-12 00:08:28] Features: 18/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 18 out of 18 | elapsed: 0.5s finished [2018-12-12 00:08:28] Features: 17/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 17 out of 17 | elapsed: 0.5s finished [2018-12-12 00:08:29] Features: 16/1 -- score: 0.985949386741[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.4s finished [2018-12-12 00:08:29] Features: 15/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 15 out of 15 | elapsed: 0.4s finished [2018-12-12 00:08:30] Features: 14/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 14 out of 14 | elapsed: 0.3s finished [2018-12-12 00:08:30] Features: 13/1 -- score: 0.982425089272[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 13 out of 13 | elapsed: 0.3s finished [2018-12-12 00:08:31] Features: 12/1 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 0.2s finished [2018-12-12 00:08:31] Features: 11/1 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 11 out of 11 | elapsed: 0.2s finished [2018-12-12 00:08:31] Features: 10/1 -- score: 0.982440614811[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.2s finished [2018-12-12 00:08:32] Features: 9/1 -- score: 0.978916317342[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 9 out of 9 | elapsed: 0.1s finished [2018-12-12 00:08:32] Features: 8/1 -- score: 0.978900791803[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 0.1s finished [2018-12-12 00:08:32] Features: 7/1 -- score: 0.980639652228[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 7 out of 7 | elapsed: 0.1s finished [2018-12-12 00:08:32] Features: 6/1 -- score: 0.978885266263[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.0s finished [2018-12-12 00:08:32] Features: 5/1 -- score: 0.977146405838[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.0s finished [2018-12-12 00:08:32] Features: 4/1 -- score: 0.975376494333[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s finished [2018-12-12 00:08:32] Features: 3/1 -- score: 0.963080267039[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished [2018-12-12 00:08:33] Features: 2/1 -- score: 0.954261760596[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished [2018-12-12 00:08:33] Features: 1/1 -- score: 0.920835274026
Plotting results:
plot_sfs(sbs.get_metric_dict(), kind='std_dev');
SBS returns subset of dataframe with optimal K features
X_sbs.shape
(569, 24)
There is the subset of selected feature names:
sbs.k_feature_names_
('mean radius', 'mean texture', 'mean area', 'mean smoothness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'area error', 'smoothness error', 'compactness error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst area', 'worst smoothness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension')
'The best quality is {} with {} features in dataset'.format(sbs.k_score_, len(sbs.k_feature_idx_))
'The best quality is 0.9877037727061015 with 24 features in dataset'
Quality is increased! *0.984 -> 0.988*
Saving scores to dict and try another SFS algorithms
sbs_dict = dict()
for i in sbs.subsets_.values():
sbs_dict[len(i['feature_names'])] = i['avg_score']
Now we try to use Sequential Forward Selection, so toggle forward parameter to True
Sequential Forward Selection means that:
logit = LogisticRegression()
sfs = SequentialFeatureSelector(logit,
k_features=(1, 30),
forward=True,
floating=False,
verbose=2,
scoring='accuracy',
cv=cv)
X_sfs = sfs.fit_transform(df_scaled, y)
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 30 out of 30 | elapsed: 0.5s finished [2018-12-12 00:08:34] Features: 1/30 -- score: 0.920835274026[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 29 out of 29 | elapsed: 0.5s finished [2018-12-12 00:08:35] Features: 2/30 -- score: 0.954261760596[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 28 out of 28 | elapsed: 0.5s finished [2018-12-12 00:08:36] Features: 3/30 -- score: 0.966589038969[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 0.5s finished [2018-12-12 00:08:36] Features: 4/30 -- score: 0.971867722403[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 26 out of 26 | elapsed: 0.5s finished [2018-12-12 00:08:37] Features: 5/30 -- score: 0.975376494333[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 25 out of 25 | elapsed: 0.5s finished [2018-12-12 00:08:37] Features: 6/30 -- score: 0.977115354759[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.5s finished [2018-12-12 00:08:38] Features: 7/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 23 out of 23 | elapsed: 0.5s finished [2018-12-12 00:08:39] Features: 8/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 22 out of 22 | elapsed: 0.5s finished [2018-12-12 00:08:39] Features: 9/30 -- score: 0.978900791803[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 21 out of 21 | elapsed: 0.5s finished [2018-12-12 00:08:40] Features: 10/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 0.4s finished [2018-12-12 00:08:40] Features: 11/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 19 out of 19 | elapsed: 0.4s finished [2018-12-12 00:08:41] Features: 12/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 18 out of 18 | elapsed: 0.4s finished [2018-12-12 00:08:41] Features: 13/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 17 out of 17 | elapsed: 0.4s finished [2018-12-12 00:08:42] Features: 14/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.4s finished [2018-12-12 00:08:42] Features: 15/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 15 out of 15 | elapsed: 0.4s finished [2018-12-12 00:08:43] Features: 16/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 14 out of 14 | elapsed: 0.4s finished [2018-12-12 00:08:43] Features: 17/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 13 out of 13 | elapsed: 0.3s finished [2018-12-12 00:08:44] Features: 18/30 -- score: 0.978900791803[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 0.3s finished [2018-12-12 00:08:44] Features: 19/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 11 out of 11 | elapsed: 0.3s finished [2018-12-12 00:08:45] Features: 20/30 -- score: 0.982409563732[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.3s finished [2018-12-12 00:08:45] Features: 21/30 -- score: 0.982409563732[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 9 out of 9 | elapsed: 0.3s finished [2018-12-12 00:08:45] Features: 22/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 0.2s finished [2018-12-12 00:08:46] Features: 23/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 7 out of 7 | elapsed: 0.2s finished [2018-12-12 00:08:46] Features: 24/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.2s finished [2018-12-12 00:08:46] Features: 25/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.1s finished [2018-12-12 00:08:46] Features: 26/30 -- score: 0.984163949697[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.1s finished [2018-12-12 00:08:47] Features: 27/30 -- score: 0.982409563732[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished [2018-12-12 00:08:47] Features: 28/30 -- score: 0.982409563732[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished [2018-12-12 00:08:47] Features: 29/30 -- score: 0.984163949697[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s finished [2018-12-12 00:08:47] Features: 30/30 -- score: 0.984179475237
Ptotiing results:
plot_sfs(sfs.get_metric_dict(), kind='std_dev');
'The best quality is {} with {} features in dataset'.format(sfs.k_score_, len(sfs.k_feature_idx_))
'The best quality is 0.9841794752367644 with 30 features in dataset'
Now the quality is equal to our baseline.
Why quality of SFS is worse, than SBS?
We use "Forward" algorithm, so on first iteration we select one feature and fit estimator with it. It's obviously, that finding "the best" feature fitting one dimensional dataset is not very effective. More than that, in Sequential Forward Selection we can't remove feature once added.
Let's try to find, "bad feature" that we add in our dataset once and on what iteration stage this happened.
sbs_feat = set(sbs.subsets_[24]['feature_idx']) #best feature set of SBS algorithm
for i in range(1, 30):
sfs_feat = set(sfs.subsets_[i]['feature_idx']) #iterate throw feature set on each iteration of SFS algorithm
if len([x for x in sfs_feat if x not in sbs_feat]) > 0:
print('We add "bad feature" # {} on {} iteration stage'.format(sfs_feat - sbs_feat, i))
break
We add "bad feature" # {5} on 8 iteration stage
Save results on each itertaion too
sfs_dict = dict()
for i in sfs.subsets_.values():
sfs_dict[len(i['feature_names'])] = i['avg_score']
Now we will try to use Sequential Forward Floating Selection, so toggle floating parameter to True. It can help us to remove worst feature at each iteration additional step
logit = LogisticRegression()
sffs = SequentialFeatureSelector(logit,
k_features=(1, 30),
forward=True,
floating=True,
verbose=2,
scoring='accuracy',
cv=cv)
X_sffs = sffs.fit_transform(df_scaled, y)
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 30 out of 30 | elapsed: 0.5s finished [2018-12-12 00:08:49] Features: 1/30 -- score: 0.920835274026[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 29 out of 29 | elapsed: 0.5s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s finished [2018-12-12 00:08:49] Features: 2/30 -- score: 0.954261760596[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 28 out of 28 | elapsed: 0.5s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished [2018-12-12 00:08:50] Features: 3/30 -- score: 0.966589038969[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 0.5s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished [2018-12-12 00:08:50] Features: 4/30 -- score: 0.971867722403[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 26 out of 26 | elapsed: 0.5s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.0s finished [2018-12-12 00:08:51] Features: 5/30 -- score: 0.975376494333[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 25 out of 25 | elapsed: 0.5s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.0s finished [2018-12-12 00:08:52] Features: 6/30 -- score: 0.977115354759[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.5s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.0s finished [2018-12-12 00:08:53] Features: 7/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 23 out of 23 | elapsed: 0.5s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 7 out of 7 | elapsed: 0.1s finished [2018-12-12 00:08:53] Features: 8/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 22 out of 22 | elapsed: 0.5s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 0.1s finished [2018-12-12 00:08:54] Features: 9/30 -- score: 0.978900791803[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 21 out of 21 | elapsed: 0.5s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 9 out of 9 | elapsed: 0.1s finished [2018-12-12 00:08:55] Features: 10/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 0.4s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.2s finished [2018-12-12 00:08:56] Features: 11/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 19 out of 19 | elapsed: 0.4s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 11 out of 11 | elapsed: 0.2s finished [2018-12-12 00:08:57] Features: 12/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 18 out of 18 | elapsed: 0.4s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 0.2s finished [2018-12-12 00:08:57] Features: 13/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 17 out of 17 | elapsed: 0.4s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 13 out of 13 | elapsed: 0.3s finished [2018-12-12 00:08:58] Features: 14/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.4s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 14 out of 14 | elapsed: 0.3s finished [2018-12-12 00:08:59] Features: 15/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 15 out of 15 | elapsed: 0.4s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 15 out of 15 | elapsed: 0.4s finished [2018-12-12 00:09:00] Features: 16/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 14 out of 14 | elapsed: 0.4s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.4s finished [2018-12-12 00:09:01] Features: 17/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 13 out of 13 | elapsed: 0.4s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 17 out of 17 | elapsed: 0.5s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.4s finished [2018-12-12 00:09:03] Features: 17/30 -- score: 0.982409563732[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 13 out of 13 | elapsed: 0.4s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 17 out of 17 | elapsed: 0.5s finished [2018-12-12 00:09:04] Features: 18/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 0.3s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 18 out of 18 | elapsed: 0.5s finished [2018-12-12 00:09:05] Features: 19/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 11 out of 11 | elapsed: 0.3s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 19 out of 19 | elapsed: 0.6s finished [2018-12-12 00:09:06] Features: 20/30 -- score: 0.978900791803[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.3s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 0.6s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 19 out of 19 | elapsed: 0.6s finished [2018-12-12 00:09:08] Features: 20/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.3s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 0.6s finished [2018-12-12 00:09:09] Features: 21/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 9 out of 9 | elapsed: 0.2s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 21 out of 21 | elapsed: 0.7s finished [2018-12-12 00:09:10] Features: 22/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 0.2s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 22 out of 22 | elapsed: 0.8s finished [2018-12-12 00:09:11] Features: 23/30 -- score: 0.978885266263[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 7 out of 7 | elapsed: 0.2s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 23 out of 23 | elapsed: 0.8s finished [2018-12-12 00:09:12] Features: 24/30 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.2s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.9s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 23 out of 23 | elapsed: 0.8s finished [2018-12-12 00:09:14] Features: 24/30 -- score: 0.982425089272[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.2s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.9s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 23 out of 23 | elapsed: 0.8s finished [2018-12-12 00:09:17] Features: 24/30 -- score: 0.984179475237[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.2s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.9s finished [2018-12-12 00:09:18] Features: 25/30 -- score: 0.984179475237[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.1s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 25 out of 25 | elapsed: 1.0s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.9s finished [2018-12-12 00:09:20] Features: 25/30 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.1s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 25 out of 25 | elapsed: 1.0s finished [2018-12-12 00:09:21] Features: 26/30 -- score: 0.984179475237[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.1s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 26 out of 26 | elapsed: 1.0s finished [2018-12-12 00:09:23] Features: 27/30 -- score: 0.984179475237[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 1.1s finished [2018-12-12 00:09:24] Features: 28/30 -- score: 0.984179475237[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 28 out of 28 | elapsed: 1.2s finished [2018-12-12 00:09:25] Features: 29/30 -- score: 0.984179475237[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 29 out of 29 | elapsed: 1.2s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 28 out of 28 | elapsed: 1.2s finished [2018-12-12 00:09:28] Features: 29/30 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 29 out of 29 | elapsed: 1.2s finished [2018-12-12 00:09:29] Features: 30/30 -- score: 0.984179475237
Plotting results:
plot_sfs(sffs.get_metric_dict(), kind='std_dev');
'The best quality is {} with {} features in dataset'.format(sffs.k_score_, len(sffs.k_feature_idx_))
'The best quality is 0.9859338612016767 with 25 features in dataset'
The quality is little higher, that SFS one, but SBS is the best algortihm today. Saving results to dict and lets try the last implementation - Sequential Backward Floating Selection
sffs_dict = dict()
for i in sffs.subsets_.values():
sffs_dict[len(i['feature_names'])] = i['avg_score']
logit = LogisticRegression()
sbfs = SequentialFeatureSelector(logit,
k_features=(1, 30),
forward=False,
floating=True,
verbose=2,
scoring='accuracy',
cv=cv)
X_sbfs = sbfs.fit_transform(df_scaled, y)
[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 30 out of 30 | elapsed: 1.3s finished [2018-12-12 00:09:32] Features: 29/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 29 out of 29 | elapsed: 1.2s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s finished [2018-12-12 00:09:33] Features: 28/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 28 out of 28 | elapsed: 1.1s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished [2018-12-12 00:09:35] Features: 27/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 1.1s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished [2018-12-12 00:09:36] Features: 26/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 26 out of 26 | elapsed: 1.0s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.1s finished [2018-12-12 00:09:37] Features: 25/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 25 out of 25 | elapsed: 0.9s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.1s finished [2018-12-12 00:09:38] Features: 24/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.9s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.1s finished [2018-12-12 00:09:40] Features: 23/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 23 out of 23 | elapsed: 0.8s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 7 out of 7 | elapsed: 0.2s finished [2018-12-12 00:09:41] Features: 22/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 22 out of 22 | elapsed: 0.7s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 0.2s finished [2018-12-12 00:09:42] Features: 21/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 21 out of 21 | elapsed: 0.7s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 9 out of 9 | elapsed: 0.2s finished [2018-12-12 00:09:43] Features: 20/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 0.6s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.3s finished [2018-12-12 00:09:44] Features: 19/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 19 out of 19 | elapsed: 0.6s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 11 out of 11 | elapsed: 0.3s finished [2018-12-12 00:09:45] Features: 18/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 18 out of 18 | elapsed: 0.5s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 0.3s finished [2018-12-12 00:09:46] Features: 17/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 17 out of 17 | elapsed: 0.5s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 13 out of 13 | elapsed: 0.3s finished [2018-12-12 00:09:47] Features: 16/1 -- score: 0.985949386741[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.4s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 14 out of 14 | elapsed: 0.4s finished [2018-12-12 00:09:48] Features: 15/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 15 out of 15 | elapsed: 0.4s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 15 out of 15 | elapsed: 0.4s finished [2018-12-12 00:09:49] Features: 14/1 -- score: 0.985933861202[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 14 out of 14 | elapsed: 0.3s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.4s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 15 out of 15 | elapsed: 0.4s finished [2018-12-12 00:09:51] Features: 14/1 -- score: 0.987703772706[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 14 out of 14 | elapsed: 0.3s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.4s finished [2018-12-12 00:09:51] Features: 13/1 -- score: 0.982425089272[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 13 out of 13 | elapsed: 0.3s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 17 out of 17 | elapsed: 0.4s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 16 out of 16 | elapsed: 0.4s finished [2018-12-12 00:09:53] Features: 13/1 -- score: 0.984179475237[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 13 out of 13 | elapsed: 0.3s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 17 out of 17 | elapsed: 0.4s finished [2018-12-12 00:09:54] Features: 12/1 -- score: 0.982425089272[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 12 out of 12 | elapsed: 0.2s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 18 out of 18 | elapsed: 0.4s finished [2018-12-12 00:09:54] Features: 11/1 -- score: 0.984179475237[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 11 out of 11 | elapsed: 0.2s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 19 out of 19 | elapsed: 0.4s finished [2018-12-12 00:09:55] Features: 10/1 -- score: 0.982409563732[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.2s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 0.4s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 19 out of 19 | elapsed: 0.4s finished [2018-12-12 00:09:57] Features: 10/1 -- score: 0.984163949697[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.2s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 0.4s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 19 out of 19 | elapsed: 0.5s finished [2018-12-12 00:09:58] Features: 10/1 -- score: 0.984163949697[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 10 out of 10 | elapsed: 0.2s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 20 out of 20 | elapsed: 0.5s finished [2018-12-12 00:09:59] Features: 9/1 -- score: 0.984163949697[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 9 out of 9 | elapsed: 0.1s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 21 out of 21 | elapsed: 0.5s finished [2018-12-12 00:10:00] Features: 8/1 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 8 out of 8 | elapsed: 0.1s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 22 out of 22 | elapsed: 0.5s finished [2018-12-12 00:10:01] Features: 7/1 -- score: 0.978885266263[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 7 out of 7 | elapsed: 0.1s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 23 out of 23 | elapsed: 0.5s finished [2018-12-12 00:10:01] Features: 6/1 -- score: 0.977130880298[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.0s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.5s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 23 out of 23 | elapsed: 0.4s finished [2018-12-12 00:10:03] Features: 6/1 -- score: 0.977130880298[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.0s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.4s finished [2018-12-12 00:10:03] Features: 5/1 -- score: 0.975376494333[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.0s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 25 out of 25 | elapsed: 0.6s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.5s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 23 out of 23 | elapsed: 0.5s finished [2018-12-12 00:10:05] Features: 6/1 -- score: 0.980655177767[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 6 out of 6 | elapsed: 0.0s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 24 out of 24 | elapsed: 0.5s finished [2018-12-12 00:10:06] Features: 5/1 -- score: 0.977146405838[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 5 out of 5 | elapsed: 0.0s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 25 out of 25 | elapsed: 0.5s finished [2018-12-12 00:10:07] Features: 4/1 -- score: 0.975376494333[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 0.0s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 26 out of 26 | elapsed: 0.5s finished [2018-12-12 00:10:07] Features: 3/1 -- score: 0.964834653004[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 0.5s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 26 out of 26 | elapsed: 0.5s finished [2018-12-12 00:10:09] Features: 3/1 -- score: 0.968358950474[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 3 out of 3 | elapsed: 0.0s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 27 out of 27 | elapsed: 0.5s finished [2018-12-12 00:10:09] Features: 2/1 -- score: 0.954261760596[Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 2 out of 2 | elapsed: 0.0s finished [Parallel(n_jobs=1)]: Done 1 out of 1 | elapsed: 0.0s remaining: 0.0s [Parallel(n_jobs=1)]: Done 28 out of 28 | elapsed: 0.5s finished [2018-12-12 00:10:10] Features: 1/1 -- score: 0.915572116131
Ploting results:
plot_sfs(sbfs.get_metric_dict(), kind='std_dev');
'The best quality is {} with {} features in dataset'.format(sbfs.k_score_, len(sbfs.k_feature_idx_))
'The best quality is 0.9877037727061015 with 24 features in dataset'
The quality of SBS and SBFS algorithms is equal in our example. But sometimes it increased.
sbfs_dict = dict()
for i in sbfs.subsets_.values():
sbfs_dict[len(i['feature_names'])] = i['avg_score']
Trying another feature selection and dimensional reducing algorithms
dict_pca = dict()
for i in range(1, 31):
pca = PCA(n_components = i)
df_pca = pca.fit_transform(df_scaled, y)
logit = LogisticRegression()
score = cross_val_score(logit, df_pca, y, cv = cv).mean()
dict_pca[i] = score
'The best quality is {} with {} features in dataset'.format(max(dict_pca.values()), max(dict_pca, key=dict_pca.get))
'The best quality is 0.9841794752367644 with 18 features in dataset'
Accuracy metric is lower on PCA dataset
dict_rfe = dict()
for i in range(1, 31):
rfe = RFE(logit, n_features_to_select=i)
df_rfe = rfe.fit_transform(df_scaled, y)
logit = LogisticRegression()
score = cross_val_score(logit, df_rfe, y, cv = cv).mean()
dict_rfe[i] = score
'The best quality is {} with {} features in dataset'.format(max(dict_rfe.values()), max(dict_rfe, key=dict_rfe.get))
'The best quality is 0.9841794752367644 with 24 features in dataset'
RFE quality is lower too. RFE is computationally less complex using the feature weight coefficients (e.g., linear models) or feature importance (tree-based algorithms) to eliminate features recursively, whereas SFSs eliminate (or add) features based on a user-defined classifier/regression performance metric.
Comparing CV scores of all algorithms
pd.DataFrame(data = [pd.Series(dict_rfe),pd.Series(dict_pca), pd.Series(sbs_dict)], index = ['RFE', 'PCA', 'SBS']).T
RFE | PCA | SBS | |
---|---|---|---|
1 | 0.920835 | 0.913818 | 0.920835 |
2 | 0.950815 | 0.950753 | 0.954262 |
3 | 0.949045 | 0.947291 | 0.963080 |
4 | 0.945521 | 0.968374 | 0.975376 |
5 | 0.970144 | 0.973591 | 0.977146 |
6 | 0.968374 | 0.975392 | 0.978885 |
7 | 0.968359 | 0.975392 | 0.980640 |
8 | 0.970129 | 0.978885 | 0.978901 |
9 | 0.966605 | 0.982425 | 0.978916 |
10 | 0.968390 | 0.980655 | 0.982441 |
11 | 0.975408 | 0.978901 | 0.980655 |
12 | 0.973653 | 0.975392 | 0.980655 |
13 | 0.971883 | 0.975376 | 0.982425 |
14 | 0.971883 | 0.982410 | 0.985934 |
15 | 0.975376 | 0.982410 | 0.987704 |
16 | 0.977146 | 0.982410 | 0.985949 |
17 | 0.980655 | 0.980655 | 0.987704 |
18 | 0.980655 | 0.984179 | 0.987704 |
19 | 0.982410 | 0.982425 | 0.987704 |
20 | 0.982410 | 0.982425 | 0.987704 |
21 | 0.982410 | 0.984179 | 0.987704 |
22 | 0.982410 | 0.982425 | 0.987704 |
23 | 0.982425 | 0.982425 | 0.987704 |
24 | 0.984179 | 0.984179 | 0.987704 |
25 | 0.984179 | 0.984179 | 0.985934 |
26 | 0.984179 | 0.984179 | 0.985934 |
27 | 0.984179 | 0.984179 | 0.985934 |
28 | 0.984179 | 0.984179 | 0.985934 |
29 | 0.984179 | 0.984179 | 0.985934 |
30 | 0.984179 | 0.984179 | 0.984179 |
Maximum score we got with SBS and minimum 15 features in subset. RFE is worse with any number of features. PCA is better only with 9 features in subset. RFE and PCA could not find subset of features with score more than full dataset's score. SBS del with it.
**How we will choose features after this tutorial :)**
Of course, RFE, PCA and SBS solve slightly different tasks. It's important to know how and when we should implement one or another instrument. And more important is to have an inquiring mind and test the craziest hypotheses :)
In this tutorial we studied something new about feature selection, understood how SequentialFeatureSelector from Mlxtend library works - it allows very easy selection from new generated features and boost model's quality. Then we compared it with another feature selection and dimension reducing algorithms.
Beginners data scientists often pay a little attention to feature selection and trying to testing many different models instead. But feature selection can boost model score very much. It's near impossible to get top Kaggle without stacking xgboost careful feature engineering and selecting best features.
Save the best, delete the rest! That's all, folks!
Official Mlxtend Docs https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector