Data Description: The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.
Domain: Banking
Context: Leveraging customer information is paramount for most businesses. In the case of a bank, attributes of customers like the ones mentioned below can be crucial in strategizing a marketing campaign when launching a new product.
Attribute Information
age
: age at the time of calljob
: type of jobmarital
: marital statuseducation
: education background at the time of calldefault
: has credit in default?balance
: average yearly balance, in euros (numeric)housing
: has housing loan?loan
: has personal loan?contact
: contact communication typeday
: last contact day of the month (1 -31)month
: last contact month of year ('jan', 'feb', 'mar', ..., 'nov', 'dec')duration
: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration = 0 then Target = 'no'). Yet, the duration is not known before a call is performed. Also, after the end of the call Target is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.campaign
: number of contacts performed during this campaign and for this client (includes last contact)pdays
: number of days that passed by after the client was last contacted from a previous campaignprevious
: number of contacts performed before this campaign and for this clientpoutcome
: outcome of the previous marketing campaigntarget
: has the client subscribed a term deposit? ('yes', 'no')Learning Outcomes
# Basic packages
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns
from scipy import stats; from scipy.stats import zscore, norm, randint
import matplotlib.style as style; style.use('fivethirtyeight')
import plotly.express as px
%matplotlib inline
# Impute and Encode
from sklearn.preprocessing import LabelEncoder
from impyute.imputation.cs import mice
# Modelling - LR, KNN, NB, Metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, recall_score, precision_score
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, BaggingClassifier, VotingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.dummy import DummyClassifier
from sklearn.metrics import make_scorer
# Oversampling
from imblearn.over_sampling import SMOTE
# Suppress warnings
import warnings; warnings.filterwarnings('ignore')
# Visualize Tree
from sklearn.tree import export_graphviz
from IPython.display import Image
from os import system
# Display settings
pd.options.display.max_rows = 10000
pd.options.display.max_columns = 10000
random_state = 42
np.random.seed(random_state)
# Reading the data as dataframe and print the first five rows
bank = pd.read_csv('bank-full.csv')
bank.head()
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | Target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 58 | management | married | tertiary | no | 2143 | yes | no | unknown | 5 | may | 261 | 1 | -1 | 0 | unknown | no |
1 | 44 | technician | single | secondary | no | 29 | yes | no | unknown | 5 | may | 151 | 1 | -1 | 0 | unknown | no |
2 | 33 | entrepreneur | married | secondary | no | 2 | yes | yes | unknown | 5 | may | 76 | 1 | -1 | 0 | unknown | no |
3 | 47 | blue-collar | married | unknown | no | 1506 | yes | no | unknown | 5 | may | 92 | 1 | -1 | 0 | unknown | no |
4 | 33 | unknown | single | unknown | no | 1 | no | no | unknown | 5 | may | 198 | 1 | -1 | 0 | unknown | no |
# Get info of the dataframe columns
bank.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 45211 entries, 0 to 45210 Data columns (total 17 columns): age 45211 non-null int64 job 45211 non-null object marital 45211 non-null object education 45211 non-null object default 45211 non-null object balance 45211 non-null int64 housing 45211 non-null object loan 45211 non-null object contact 45211 non-null object day 45211 non-null int64 month 45211 non-null object duration 45211 non-null int64 campaign 45211 non-null int64 pdays 45211 non-null int64 previous 45211 non-null int64 poutcome 45211 non-null object Target 45211 non-null object dtypes: int64(7), object(10) memory usage: 5.9+ MB
Performing exploratory data analysis on the bank dataset. Below are some of the steps performed:
object
columnsjob
, marital
, education
, default
, housing
, loan
, contact
, day
, month
, poutcome
)age
, balance
, duration
, campaign
, pdays
, previous
)job
, marital
, education
, default
, housing
, loan
, contact
, day
, month
, poutcome
, Target
) to float for MICE training. Creating multiple imputations, as opposed to single imputations to complete
datasets, accounts for the statistical uncertainty in the imputations. MICE algorithms works by running multiple regression models and each missing value is modeled conditionally depeding on the observed (non-missing) values.Target
. Drop columns based on these.bank.describe(include = 'all').T
count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
age | 45211 | NaN | NaN | NaN | 40.9362 | 10.6188 | 18 | 33 | 39 | 48 | 95 |
job | 45211 | 12 | blue-collar | 9732 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
marital | 45211 | 3 | married | 27214 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
education | 45211 | 4 | secondary | 23202 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
default | 45211 | 2 | no | 44396 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
balance | 45211 | NaN | NaN | NaN | 1362.27 | 3044.77 | -8019 | 72 | 448 | 1428 | 102127 |
housing | 45211 | 2 | yes | 25130 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
loan | 45211 | 2 | no | 37967 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
contact | 45211 | 3 | cellular | 29285 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
day | 45211 | NaN | NaN | NaN | 15.8064 | 8.32248 | 1 | 8 | 16 | 21 | 31 |
month | 45211 | 12 | may | 13766 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
duration | 45211 | NaN | NaN | NaN | 258.163 | 257.528 | 0 | 103 | 180 | 319 | 4918 |
campaign | 45211 | NaN | NaN | NaN | 2.76384 | 3.09802 | 1 | 1 | 2 | 3 | 63 |
pdays | 45211 | NaN | NaN | NaN | 40.1978 | 100.129 | -1 | -1 | -1 | -1 | 871 |
previous | 45211 | NaN | NaN | NaN | 0.580323 | 2.30344 | 0 | 0 | 0 | 0 | 275 |
poutcome | 45211 | 4 | unknown | 36959 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Target | 45211 | 2 | no | 39922 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
columns = bank.loc[:, bank.dtypes == 'object'].columns.tolist()
for cols in columns:
print(f'Unique values for {cols} is \n{bank[cols].unique()}\n')
Unique values for job is ['management' 'technician' 'entrepreneur' 'blue-collar' 'unknown' 'retired' 'admin.' 'services' 'self-employed' 'unemployed' 'housemaid' 'student'] Unique values for marital is ['married' 'single' 'divorced'] Unique values for education is ['tertiary' 'secondary' 'unknown' 'primary'] Unique values for default is ['no' 'yes'] Unique values for housing is ['yes' 'no'] Unique values for loan is ['no' 'yes'] Unique values for contact is ['unknown' 'cellular' 'telephone'] Unique values for month is ['may' 'jun' 'jul' 'aug' 'oct' 'nov' 'dec' 'jan' 'feb' 'mar' 'apr' 'sep'] Unique values for poutcome is ['unknown' 'failure' 'other' 'success'] Unique values for Target is ['no' 'yes']
Categorical
job
: Nominal. Includes type of job. 'blue-collar' is the most frequently occurring in the data.marital
: Nominal. Most of the clients are married in the dataset we have.education
: Ordinal. Most of the clients have secondary level education.default
: Binary. Most of clients don't have credit in default.housing
: Binary. Most of the clients have housing loan.loan
: Binary. Most of the clients don't have personal loan.Numerical
age
: Continuous, ratio (has true zero, technically). Whether it's discrete or continuous depends on whether they are measured to the nearest year or not. At present, it seems it's discrete. Min age in the dataset being 18 and max being 95.balance
: Continuous, ratio. Range of average yearly balance is very wide from -8019 euros to 102127 euros.Categorical
contact
: Nominal. Includes communication type with the client, most frequently use communication mode is cellular.day
: Ordinal. Includes last contact day of the month.month
: Ordinal. Includes last contact month of the year.Numerical
duration
: Continuous, interval. Includes last contact duration in seconds. Min value being 0 and max value being 4918. It would be important to check is higher duration of call leading to more subscription.campaign
: Discrete, interval. Min number of contacts performed during this campaign being 1 and is also represents about 25% of the value and max being 63.Categorical
poutcome
: Nominal. Includes outcome of the previous marketing campaign. Most occuring value being 'unknown'.Numerical
pdays
: Continuous, interval. Min number of days that passed by after the client was last contacted from a previous campaign being -1 which may be dummy value for the cases where client wasn't contacted and max days being 63.previous
: Discrete, ratio. Min number of contacts performed before this campaign is 0 and max being 275.Target
: Binary. Most occurring value being 'no' i.e. cases where the client didn't subscribe to the term deposit.Descriptive statistics for the numerical variables (age, balance, duration, campaign, pdays, previous)
age
: Range of Q1 to Q3 is between 33 to 48. Since mean is slightly greater than median, we can say that age is right (positively) skewed.balance
: Range of Q1 to Q3 is between 72 to 1428. Since mean is greater than median, we can say that balance is skewed towards right (positively).duration
: Range of Q1 to Q3 is between 103 to 319. Since mean is greater than median, we can say that duration is right (positively) skewed.campaign
: Range of Q1 to Q3 is between 1 to 3. Since mean is greater than median, we can say that campaign is right (positively) skewed.pdays
: 75% of data values are around -1 which is a dummy value. It needs further check without considering the -1 value.previous
: 75% of data values are around 0 which is a dummy value, maybe cases where client wasn't contacted. It needs further checks.display(bank['Target'].value_counts(), bank['Target'].value_counts(normalize = True)*100)
no 39922 yes 5289 Name: Target, dtype: int64
no 88.30152 yes 11.69848 Name: Target, dtype: float64
Out of 45211 cases, only 5289 (=11.69%) are the cases where the client has subscribed to the term deposit.
# Replace values in some of the categorical columns
replace_values = {'education': {'unknown': -1, 'primary': 1, 'secondary': 2, 'tertiary': 3}, 'Target': {'no': 0, 'yes': 1},
'default': {'no': 0, 'yes': 1}, 'housing': {'no': 0, 'yes': 1}, 'loan': {'no': 0, 'yes': 1},
'month': {'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6,
'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12}}
bank = bank.replace(replace_values)
# Convert columns to categorical types
columns.extend(['day'])
for cols in columns:
bank[cols] = bank[cols].astype('category')
# Functions that will help us with EDA plot
def odp_plots(df, col):
f,(ax1, ax2, ax3) = plt.subplots(1, 3, figsize = (15, 7.2))
# Boxplot to check outliers
sns.boxplot(x = col, data = df, ax = ax1, orient = 'v', color = 'darkslategrey')
# Distribution plot with outliers
sns.distplot(df[col], ax = ax2, color = 'teal', fit = norm).set_title(f'Distribution of {col} with outliers')
# Removing outliers, but in a new dataframe
upperbound, lowerbound = np.percentile(df[col], [1, 99])
y = pd.DataFrame(np.clip(df[col], upperbound, lowerbound))
# Distribution plot without outliers
sns.distplot(y[col], ax = ax3, color = 'tab:orange', fit = norm).set_title(f'Distribution of {col} without outliers')
kwargs = {'fontsize':14, 'color':'black'}
ax1.set_title(col + ' Boxplot Analysis', **kwargs)
ax1.set_xlabel('Box', **kwargs)
ax1.set_ylabel(col + ' Values', **kwargs)
return plt.show()
def target_plot(df, col, target = 'Target'):
fig = plt.figure(figsize = (15, 7.2))
# Distribution for 'Target' -- didn't subscribed, considering outliers
ax = fig.add_subplot(121)
sns.distplot(df[(df[target] == 0)][col], color = 'c',
ax = ax).set_title(f'{col.capitalize()} for Term Desposit - Didn\'t subscribed')
# Distribution for 'Target' -- Subscribed, considering outliers
ax= fig.add_subplot(122)
sns.distplot(df[(df[target] == 1)][col], color = 'b',
ax = ax).set_title(f'{col.capitalize()} for Term Desposit - Subscribed')
return plt.show()
def target_count(df, col1, col2):
fig = plt.figure(figsize = (15, 7.2))
ax = fig.add_subplot(121)
sns.countplot(x = col1, data = df, palette = ['tab:blue', 'tab:cyan'], ax = ax, orient = 'v',
hue = 'Target').set_title(col1.capitalize() +' count plot by Target',
fontsize = 13)
plt.legend(labels = ['Didn\'t Subcribed', 'Subcribed'])
plt.xticks(rotation = 90)
ax = fig.add_subplot(122)
sns.countplot(x = col2, data = df, palette = ['tab:blue', 'tab:cyan'], ax = ax, orient = 'v',
hue = 'Target').set_title(col2.capitalize() +' coount plot by Target',
fontsize = 13)
plt.legend(labels = ['Didn\'t Subcribed', 'Subcribed'])
plt.xticks(rotation = 90)
return plt.show()
Looking at one feature at a time to understand how are the values distributed, checking outliers, checking relation of the column with Target column (bi).
# Subscribe and didn't subscribe for categorical columns
target_count(bank, 'job', 'marital')
target_count(bank, 'education', 'default')
target_count(bank, 'housing', 'loan')
target_count(bank, 'contact', 'day')
target_count(bank, 'month', 'poutcome')
# Outlier, distribution for 'age' column
Q3 = bank['age'].quantile(0.75)
Q1 = bank['age'].quantile(0.25)
IQR = Q3 - Q1
print('Age column', '--'*55)
display(bank.loc[(bank['age'] < (Q1 - 1.5 * IQR)) | (bank['age'] > (Q3 + 1.5 * IQR))].head())
odp_plots(bank, 'age')
# Distribution of 'age' by 'Target'
target_plot(bank, 'age')
Age column --------------------------------------------------------------------------------------------------------------
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | Target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
29158 | 83 | retired | married | 1 | 0 | 425 | 0 | 0 | telephone | 2 | 2 | 912 | 1 | -1 | 0 | unknown | 0 |
29261 | 75 | retired | divorced | 1 | 0 | 46 | 0 | 0 | cellular | 2 | 2 | 294 | 1 | -1 | 0 | unknown | 0 |
29263 | 75 | retired | married | 1 | 0 | 3324 | 0 | 0 | cellular | 2 | 2 | 149 | 1 | -1 | 0 | unknown | 0 |
29322 | 83 | retired | married | 3 | 0 | 6236 | 0 | 0 | cellular | 2 | 2 | 283 | 2 | -1 | 0 | unknown | 0 |
29865 | 75 | retired | divorced | 1 | 0 | 3881 | 1 | 0 | cellular | 4 | 2 | 136 | 3 | -1 | 0 | unknown | 1 |
# Outlier, distribution for 'balance' column
Q3 = bank['balance'].quantile(0.75)
Q1 = bank['balance'].quantile(0.25)
IQR = Q3 - Q1
print('Balance column', '--'*55)
display(bank.loc[(bank['balance'] < (Q1 - 1.5 * IQR)) | (bank['balance'] > (Q3 + 1.5 * IQR))].head())
odp_plots(bank, 'balance')
# Distribution of 'balance' by 'Target'
target_plot(bank, 'balance')
Balance column --------------------------------------------------------------------------------------------------------------
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | Target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
34 | 51 | management | married | 3 | 0 | 10635 | 1 | 0 | unknown | 5 | 5 | 336 | 1 | -1 | 0 | unknown | 0 |
65 | 51 | management | married | 3 | 0 | 6530 | 1 | 0 | unknown | 5 | 5 | 91 | 1 | -1 | 0 | unknown | 0 |
69 | 35 | blue-collar | single | 2 | 0 | 12223 | 1 | 1 | unknown | 5 | 5 | 177 | 1 | -1 | 0 | unknown | 0 |
70 | 57 | blue-collar | married | 2 | 0 | 5935 | 1 | 1 | unknown | 5 | 5 | 258 | 1 | -1 | 0 | unknown | 0 |
186 | 40 | services | divorced | -1 | 0 | 4384 | 1 | 0 | unknown | 5 | 5 | 315 | 1 | -1 | 0 | unknown | 0 |
# Outlier, distribution for 'duration' column
Q3 = bank['duration'].quantile(0.75)
Q1 = bank['duration'].quantile(0.25)
IQR = Q3 - Q1
print('Duration column', '--'*54)
display(bank.loc[(bank['duration'] < (Q1 - 1.5 * IQR)) | (bank['duration'] > (Q3 + 1.5 * IQR))].head())
odp_plots(bank, 'duration')
# Distribution of 'duration' by 'Target'
target_plot(bank, 'duration')
Duration column ------------------------------------------------------------------------------------------------------------
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | Target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
37 | 53 | technician | married | 2 | 0 | -3 | 0 | 0 | unknown | 5 | 5 | 1666 | 1 | -1 | 0 | unknown | 0 |
43 | 54 | retired | married | 2 | 0 | 529 | 1 | 0 | unknown | 5 | 5 | 1492 | 1 | -1 | 0 | unknown | 0 |
53 | 42 | admin. | single | 2 | 0 | -76 | 1 | 0 | unknown | 5 | 5 | 787 | 1 | -1 | 0 | unknown | 0 |
59 | 46 | services | married | 1 | 0 | 179 | 1 | 0 | unknown | 5 | 5 | 1778 | 1 | -1 | 0 | unknown | 0 |
61 | 53 | technician | divorced | 2 | 0 | 989 | 1 | 0 | unknown | 5 | 5 | 812 | 1 | -1 | 0 | unknown | 0 |
# Outlier, distribution for 'campaign' column
Q3 = bank['campaign'].quantile(0.75)
Q1 = bank['campaign'].quantile(0.25)
IQR = Q3 - Q1
print('Campaign column', '--'*54)
display(bank.loc[(bank['campaign'] < (Q1 - 1.5 * IQR)) | (bank['campaign'] > (Q3 + 1.5 * IQR))].head())
odp_plots(bank, 'campaign')
# Distribution of 'campaign' by 'Target'
target_plot(bank, 'campaign')
Campaign column ------------------------------------------------------------------------------------------------------------
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | Target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
758 | 59 | services | married | 2 | 0 | 307 | 1 | 1 | unknown | 6 | 5 | 250 | 7 | -1 | 0 | unknown | 0 |
780 | 30 | admin. | married | 2 | 0 | 4 | 0 | 0 | unknown | 7 | 5 | 172 | 8 | -1 | 0 | unknown | 0 |
906 | 27 | services | single | 2 | 0 | 0 | 1 | 0 | unknown | 7 | 5 | 388 | 7 | -1 | 0 | unknown | 0 |
1103 | 52 | technician | married | -1 | 0 | 133 | 1 | 0 | unknown | 7 | 5 | 253 | 8 | -1 | 0 | unknown | 0 |
1105 | 43 | admin. | married | 3 | 0 | 1924 | 1 | 0 | unknown | 7 | 5 | 244 | 7 | -1 | 0 | unknown | 0 |
# Outlier, distribution for 'pdays' column
Q3 = bank['pdays'].quantile(0.75)
Q1 = bank['pdays'].quantile(0.25)
IQR = Q3 - Q1
print('Pdays column', '--'*55)
display(bank.loc[(bank['pdays'] < (Q1 - 1.5 * IQR)) | (bank['pdays'] > (Q3 + 1.5 * IQR))].head())
# Check outlier in 'pdays', not considering -1
pdays = bank.loc[bank['pdays'] > 0, ['pdays', 'Target']]
pdays = pd.DataFrame(pdays, columns = ['pdays', 'Target'])
odp_plots(pdays, 'pdays')
# Distribution of 'pdays' by 'Target', not considering -1
target_plot(pdays, 'pdays')
Pdays column --------------------------------------------------------------------------------------------------------------
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | Target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
24060 | 33 | admin. | married | 3 | 0 | 882 | 0 | 0 | telephone | 21 | 10 | 39 | 1 | 151 | 3 | failure | 0 |
24062 | 42 | admin. | single | 2 | 0 | -247 | 1 | 1 | telephone | 21 | 10 | 519 | 1 | 166 | 1 | other | 1 |
24064 | 33 | services | married | 2 | 0 | 3444 | 1 | 0 | telephone | 21 | 10 | 144 | 1 | 91 | 4 | failure | 1 |
24072 | 36 | management | married | 3 | 0 | 2415 | 1 | 0 | telephone | 22 | 10 | 73 | 1 | 86 | 4 | other | 0 |
24077 | 36 | management | married | 3 | 0 | 0 | 1 | 0 | telephone | 23 | 10 | 140 | 1 | 143 | 3 | failure | 1 |
# Outlier, distribution and probability plot for 'previous' column
Q3 = bank['previous'].quantile(0.75)
Q1 = bank['previous'].quantile(0.25)
IQR = Q3 - Q1
print('Previous column', '--'*54)
display(bank.loc[(bank['previous'] < (Q1 - 1.5 * IQR)) | (bank['previous'] > (Q3 + 1.5 * IQR))].head())
odp_plots(bank, 'previous')
# Distribution of 'previous' by 'Target'
target_plot(bank, 'previous')
Previous column ------------------------------------------------------------------------------------------------------------
age | job | marital | education | default | balance | housing | loan | contact | day | month | duration | campaign | pdays | previous | poutcome | Target | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
24060 | 33 | admin. | married | 3 | 0 | 882 | 0 | 0 | telephone | 21 | 10 | 39 | 1 | 151 | 3 | failure | 0 |
24062 | 42 | admin. | single | 2 | 0 | -247 | 1 | 1 | telephone | 21 | 10 | 519 | 1 | 166 | 1 | other | 1 |
24064 | 33 | services | married | 2 | 0 | 3444 | 1 | 0 | telephone | 21 | 10 | 144 | 1 | 91 | 4 | failure | 1 |
24072 | 36 | management | married | 3 | 0 | 2415 | 1 | 0 | telephone | 22 | 10 | 73 | 1 | 86 | 4 | other | 0 |
24077 | 36 | management | married | 3 | 0 | 0 | 1 | 0 | telephone | 23 | 10 | 140 | 1 | 143 | 3 | failure | 1 |
print('Categorical Columns: \n{}'.format(list(bank.select_dtypes('category').columns)))
print('\nNumerical Columns: \n{}'.format(list(bank.select_dtypes(exclude = 'category').columns)))
Categorical Columns: ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'day', 'month', 'poutcome', 'Target'] Numerical Columns: ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']
# Removing outliers with upper and lower percentile values being 99 and 1, respectively
bank_nulls = bank.copy(deep = True)
columns = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']
for col in columns:
upper_lim = np.percentile(bank_nulls[col].values, 99)
lower_lim = np.percentile(bank_nulls[col].values, 1)
bank_nulls.loc[(bank_nulls[col] > upper_lim), col] = np.nan
bank_nulls.loc[(bank_nulls[col] < lower_lim), col] = np.nan
print('Column for which outliers where removed with upper and lower percentile values: \n', columns)
Column for which outliers where removed with upper and lower percentile values: ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']
# # Frequency encoding of 'job' column, this would creating too many columns with sparse distribution
# columns = ['job']#, 'marital', 'contact', 'poutcome']
# for col in columns:
# counts = bank_nulls[col].value_counts().index.tolist()
# encoding = bank_nulls.groupby(col).size()
# encoding = encoding/len(bank_nulls)
# bank_nulls[col] = bank_nulls[col].map(encoding)
# print([counts, bank_nulls[col].value_counts().index.tolist()], '\n')
# pd.get_dummies
cols_to_transform = ['job', 'marital', 'contact', 'poutcome']
bank_nulls = pd.get_dummies(bank_nulls, columns = cols_to_transform) #, drop_first = True)
print('Got dummies for \n', cols_to_transform)
bank_nulls.info()
Got dummies for ['job', 'marital', 'contact', 'poutcome'] <class 'pandas.core.frame.DataFrame'> RangeIndex: 45211 entries, 0 to 45210 Data columns (total 35 columns): age 44473 non-null float64 education 45211 non-null category default 45211 non-null category balance 44308 non-null float64 housing 45211 non-null category loan 45211 non-null category day 45211 non-null category month 45211 non-null category duration 44341 non-null float64 campaign 44760 non-null float64 pdays 44826 non-null float64 previous 44758 non-null float64 Target 45211 non-null category job_admin. 45211 non-null uint8 job_blue-collar 45211 non-null uint8 job_entrepreneur 45211 non-null uint8 job_housemaid 45211 non-null uint8 job_management 45211 non-null uint8 job_retired 45211 non-null uint8 job_self-employed 45211 non-null uint8 job_services 45211 non-null uint8 job_student 45211 non-null uint8 job_technician 45211 non-null uint8 job_unemployed 45211 non-null uint8 job_unknown 45211 non-null uint8 marital_divorced 45211 non-null uint8 marital_married 45211 non-null uint8 marital_single 45211 non-null uint8 contact_cellular 45211 non-null uint8 contact_telephone 45211 non-null uint8 contact_unknown 45211 non-null uint8 poutcome_failure 45211 non-null uint8 poutcome_other 45211 non-null uint8 poutcome_success 45211 non-null uint8 poutcome_unknown 45211 non-null uint8 dtypes: category(7), float64(6), uint8(22) memory usage: 3.3 MB
# Convert 'astype' of categorical columns to integer for getting it ready for MICE
columns = ['education', 'default', 'housing', 'loan', 'day', 'month', 'Target']
for col in columns:
bank_nulls[col] = bank_nulls[col].astype('float')
np.nan
in the earlier step¶# start the MICE training
bank_imputed = mice(bank_nulls.values)
bank_imputed = pd.DataFrame(bank_imputed, columns = bank_nulls.columns)
display(bank.describe(include = 'all').T, bank_imputed.describe(include = 'all').T)
count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|---|---|---|
age | 45211 | NaN | NaN | NaN | 40.9362 | 10.6188 | 18 | 33 | 39 | 48 | 95 |
job | 45211 | 12 | blue-collar | 9732 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
marital | 45211 | 3 | married | 27214 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
education | 45211 | 4 | 2 | 23202 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
default | 45211 | 2 | 0 | 44396 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
balance | 45211 | NaN | NaN | NaN | 1362.27 | 3044.77 | -8019 | 72 | 448 | 1428 | 102127 |
housing | 45211 | 2 | 1 | 25130 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
loan | 45211 | 2 | 0 | 37967 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
contact | 45211 | 3 | cellular | 29285 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
day | 45211 | 31 | 20 | 2752 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
month | 45211 | 12 | 5 | 13766 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
duration | 45211 | NaN | NaN | NaN | 258.163 | 257.528 | 0 | 103 | 180 | 319 | 4918 |
campaign | 45211 | NaN | NaN | NaN | 2.76384 | 3.09802 | 1 | 1 | 2 | 3 | 63 |
pdays | 45211 | NaN | NaN | NaN | 40.1978 | 100.129 | -1 | -1 | -1 | -1 | 871 |
previous | 45211 | NaN | NaN | NaN | 0.580323 | 2.30344 | 0 | 0 | 0 | 0 | 275 |
poutcome | 45211 | 4 | unknown | 36959 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Target | 45211 | 2 | 0 | 39922 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
age | 45211.0 | 40.836870 | 10.073690 | 23.000000 | 33.0 | 39.0 | 48.0 | 71.743841 |
education | 45211.0 | 2.019442 | 0.902795 | -1.000000 | 2.0 | 2.0 | 3.0 | 3.000000 |
default | 45211.0 | 0.018027 | 0.133049 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.000000 |
balance | 45211.0 | 1174.932363 | 1898.534988 | -812.502754 | 81.0 | 467.0 | 1402.0 | 13164.000000 |
housing | 45211.0 | 0.555838 | 0.496878 | 0.000000 | 0.0 | 1.0 | 1.0 | 1.000000 |
loan | 45211.0 | 0.160226 | 0.366820 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.000000 |
day | 45211.0 | 15.806419 | 8.322476 | 1.000000 | 8.0 | 16.0 | 21.0 | 31.000000 |
month | 45211.0 | 6.144655 | 2.408034 | 1.000000 | 5.0 | 6.0 | 8.0 | 12.000000 |
duration | 45211.0 | 247.428930 | 211.290370 | 11.000000 | 106.0 | 183.0 | 316.0 | 1269.000000 |
campaign | 45211.0 | 2.562222 | 2.214906 | 1.000000 | 1.0 | 2.0 | 3.0 | 16.000000 |
pdays | 45211.0 | 37.922341 | 92.399489 | -47.832018 | -1.0 | -1.0 | -1.0 | 370.000000 |
previous | 45211.0 | 0.461333 | 1.208702 | 0.000000 | 0.0 | 0.0 | 0.0 | 8.000000 |
Target | 45211.0 | 0.116985 | 0.321406 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.000000 |
job_admin. | 45211.0 | 0.114375 | 0.318269 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.000000 |
job_blue-collar | 45211.0 | 0.215257 | 0.411005 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.000000 |
job_entrepreneur | 45211.0 | 0.032890 | 0.178351 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.000000 |
job_housemaid | 45211.0 | 0.027427 | 0.163326 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.000000 |
job_management | 45211.0 | 0.209197 | 0.406740 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.000000 |
job_retired | 45211.0 | 0.050076 | 0.218105 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.000000 |
job_self-employed | 45211.0 | 0.034925 | 0.183592 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.000000 |
job_services | 45211.0 | 0.091880 | 0.288860 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.000000 |
job_student | 45211.0 | 0.020747 | 0.142538 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.000000 |
job_technician | 45211.0 | 0.168034 | 0.373901 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.000000 |
job_unemployed | 45211.0 | 0.028820 | 0.167303 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.000000 |
job_unknown | 45211.0 | 0.006370 | 0.079559 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.000000 |
marital_divorced | 45211.0 | 0.115171 | 0.319232 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.000000 |
marital_married | 45211.0 | 0.601933 | 0.489505 | 0.000000 | 0.0 | 1.0 | 1.0 | 1.000000 |
marital_single | 45211.0 | 0.282896 | 0.450411 | 0.000000 | 0.0 | 0.0 | 1.0 | 1.000000 |
contact_cellular | 45211.0 | 0.647741 | 0.477680 | 0.000000 | 0.0 | 1.0 | 1.0 | 1.000000 |
contact_telephone | 45211.0 | 0.064276 | 0.245247 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.000000 |
contact_unknown | 45211.0 | 0.287983 | 0.452828 | 0.000000 | 0.0 | 0.0 | 1.0 | 1.000000 |
poutcome_failure | 45211.0 | 0.108403 | 0.310892 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.000000 |
poutcome_other | 45211.0 | 0.040698 | 0.197592 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.000000 |
poutcome_success | 45211.0 | 0.033421 | 0.179735 | 0.000000 | 0.0 | 0.0 | 0.0 | 1.000000 |
poutcome_unknown | 45211.0 | 0.817478 | 0.386278 | 0.000000 | 1.0 | 1.0 | 1.0 | 1.000000 |
Column | Before MICE | After MICE |
---|---|---|
age |
Range of Q1 to Q3 is 33-48. Mean > Median, right (positively) skewed | Range of Q1 to Q3 is unchanged, because of change in min and max values there's a slight reduction is mean, right skewed |
balance |
Range of Q1 to Q3 is 72-1428. Mean > Median, skewed towards right (positively) | Range of Q1 to Q3 is 81 to 1402, reduction in mean, right skewed |
duration |
Range of Q1 to Q3 is 103-319. Mean > Median, right (positively) skewed | Range of Q1 to Q3 is 106-316, right skewed |
campaign |
Range of Q1 to Q3 is 1-3. Mean > Median, right (positively) skewed | Unchanged range and skewness |
pdays |
75% of data values are around -1 | Unchanged |
previous |
75% of data values are around 0 | Unchanged |
# Checking whether count of 0 in previous is equal to count of -1 in pdays
display(bank_imputed.loc[bank_imputed['previous'] == 0, 'previous'].value_counts().sum(),
bank_imputed.loc[bank_imputed['pdays'] == -1, 'pdays'].value_counts().sum())
36954
36954
Count of 0 in previous is equal to count of -1 in pdays column, we might replace -1 in pdays with 0 to account for cases where the client wasn't contacted previously. Checking correlation between variables and target next...
Checking relationship between two or more variables. Includes correlation and scatterplot matrix, checking relation between two variables and Target.
sns.pairplot(bank_imputed[['age', 'education', 'default', 'balance', 'housing', 'loan', 'day', 'month',
'duration', 'campaign', 'pdays', 'previous', 'Target']], hue = 'Target')
<seaborn.axisgrid.PairGrid at 0x2c2e9c16bc8>
# Correlation matrix for all variables
corr = bank_imputed.corr()
mask = np.zeros_like(corr, dtype = np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize = (11, 9))
cmap = sns.diverging_palette(220, 10, as_cmap = True)
sns.heatmap(corr, mask = mask, cmap = cmap, square = True, linewidths = .5, cbar_kws = {"shrink": .5})#, annot = True)
ax.set_title('Correlation Matrix of Data')
Text(0.5, 1, 'Correlation Matrix of Data')
# Filter for correlation value greater than 0.8
sort = corr.abs().unstack()
sort = sort.sort_values(kind = "quicksort", ascending = False)
sort[(sort > 0.8) & (sort < 1)]
pdays poutcome_unknown 0.891235 poutcome_unknown pdays 0.891235 contact_unknown contact_cellular 0.862398 contact_cellular contact_unknown 0.862398 previous poutcome_unknown 0.806952 poutcome_unknown previous 0.806952 dtype: float64
# Absolute correlation of independent variables with 'Target' i.e. the target variable
absCorrwithDep = []
allVars = bank_imputed.drop('Target', axis = 1).columns
for var in allVars:
absCorrwithDep.append(abs(bank_imputed['Target'].corr(bank_imputed[var])))
display(pd.DataFrame([allVars, absCorrwithDep], index = ['Variable', 'Correlation']).T.\
sort_values('Correlation', ascending = False))
Variable | Correlation | |
---|---|---|
8 | duration | 0.398107 |
32 | poutcome_success | 0.306788 |
33 | poutcome_unknown | 0.167051 |
11 | previous | 0.153341 |
29 | contact_unknown | 0.150935 |
4 | housing | 0.139173 |
27 | contact_cellular | 0.135873 |
10 | pdays | 0.0865931 |
17 | job_retired | 0.0792453 |
3 | balance | 0.0769227 |
20 | job_student | 0.076897 |
9 | campaign | 0.0754512 |
13 | job_blue-collar | 0.0720831 |
5 | loan | 0.068185 |
26 | marital_single | 0.0635258 |
25 | marital_married | 0.0602604 |
1 | education | 0.0416343 |
16 | job_management | 0.0329188 |
31 | poutcome_other | 0.031955 |
6 | day | 0.0283478 |
19 | job_services | 0.0278639 |
2 | default | 0.022419 |
22 | job_unemployed | 0.0203899 |
14 | job_entrepreneur | 0.0196623 |
7 | month | 0.018717 |
15 | job_housemaid | 0.0151949 |
28 | contact_telephone | 0.0140425 |
0 | age | 0.0134745 |
30 | poutcome_failure | 0.00988545 |
21 | job_technician | 0.00896982 |
12 | job_admin. | 0.00563747 |
24 | marital_divorced | 0.00277237 |
18 | job_self-employed | 0.000855031 |
23 | job_unknown | 0.000266748 |
poutcome_unknown
and pdays
; contact_unknown
and contact_cellular
; poutcome_unknown
and previous
; marital_married
and marital_single
; poutcome_unknown
and poutcome_failure
; pdays
and poutcome_failure
; previous
and pdays
; poutcome_failure
and previous
columns are correlated with each other.duration
, poutcome_success
, poutcome_unknown
and previous
are few columns which have a relatively strong correlation with Target
column.#bank_imputed.drop(['pdays', 'contact_cellular'], axis = 1, inplace = True) #, 'previous', 'marital_married', 'poutcome_failure'
# Creating age groups
bank_imputed.loc[(bank_imputed['age'] < 30), 'age_group'] = 20
bank_imputed.loc[(bank_imputed['age'] >= 30) & (bank_imputed['age'] < 40), 'age_group'] = 30
bank_imputed.loc[(bank_imputed['age'] >= 40) & (bank_imputed['age'] < 50), 'age_group'] = 40
bank_imputed.loc[(bank_imputed['age'] >= 50) & (bank_imputed['age'] < 60), 'age_group'] = 50
bank_imputed.loc[(bank_imputed['age'] >= 60), 'age_group'] = 60
# Check relationship between balance and age group by Target
fig = plt.figure(figsize = (15, 7.2))
ax = sns.boxplot(x = 'age_group', y = 'balance', hue = 'Target', palette = 'afmhot', data = bank_imputed)
ax.set_title('Relationship between balance and age group by Target')
Text(0.5, 1.0, 'Relationship between balance and age group by Target')
# Check relationship between campaign and age group by Target
fig = plt.figure(figsize = (15, 7.2))
ax = sns.boxplot(x = 'age_group', y = 'campaign', hue = 'Target', palette = 'afmhot', data = bank_imputed)
ax.set_title('Relationship between campaign and age group by Target')
Text(0.5, 1.0, 'Relationship between campaign and age group by Target')
# bank_imputed.drop(['age_group'], axis = 1, inplace = True)
Created age_group
and checked it's relation with balance
and target
and it appears that higher the balance range more are the chances that the client would subscribe to the term deposit irrespective of age group. It also appears that clients within age group 50 have the highest range of balance.
Then checked relation between campaign, age group and target and it appears that campaigns for client with age group 20 and 60 are less.
# Separating dependent and independent variables
X = bank_imputed.drop(['Target'], axis = 1)
y = bank_imputed['Target']
# Splitting the data into training and test set in the ratio of 70:30 respectively
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = random_state)
dummy = DummyClassifier(strategy = 'most_frequent', random_state = random_state)
dummy.fit(X_train, y_train)
y_pred = dummy.predict(X_test)
accuracy_ = accuracy_score(y_test, y_pred)
pre_s = precision_score(y_test, y_pred, average = 'binary', pos_label = 1)
re_s = recall_score(y_test, y_pred, average = 'binary', pos_label = 1)
f1_s = f1_score(y_test, y_pred, average = 'binary', pos_label = 1)
pre_m = precision_score(y_test, y_pred, average = 'macro')
re_m = recall_score(y_test, y_pred, average = 'macro')
f1_m = f1_score(y_test, y_pred, average = 'macro')
print('Training Score: ', dummy.score(X_train, y_train).round(3))
print('Test Score: ', dummy.score(X_test, y_test).round(3))
print('Accuracy: ', accuracy_.round(3))
print('Precision Score - Subscribe: ', pre_s.round(3))
print('Recall Score - Subscribe: ', re_s.round(3))
print('f1 Score - Subscribe: ', f1_s.round(3))
print('Precision Score - Macro: ', pre_m.round(3))
print('Recall Score - Macro: ', re_m.round(3))
print('f1 Score - Macro: ', f1_m.round(3))
df = pd.DataFrame([accuracy_.round(3), pre_s.round(3), pre_m.round(3), re_s.round(3),
re_m.round(3), f1_s.round(3), f1_m.round(3)], columns = ['Baseline Model']).T
df.columns = ['Accuracy', 'Precision_Subscribe', 'Precision_Macro',
'Recall_Subscribe', 'Recall_Macro', 'f1_Subscribe', 'f1_Macro']
df
Training Score: 0.883 Test Score: 0.882 Accuracy: 0.882 Precision Score - Subscribe: 0.0 Recall Score - Subscribe: 0.0 f1 Score - Subscribe: 0.0 Precision Score - Macro: 0.441 Recall Score - Macro: 0.5 f1 Score - Macro: 0.469
Accuracy | Precision_Subscribe | Precision_Macro | Recall_Subscribe | Recall_Macro | f1_Subscribe | f1_Macro | |
---|---|---|---|---|---|---|---|
Baseline Model | 0.882 | 0.0 | 0.441 | 0.0 | 0.5 | 0.0 | 0.469 |
# Helper function for making prediction and evaluating scores
def train_and_predict(n_splits, base_model, X, y, name, subscribe = 1, oversampling = False):
features = X.columns
X = np.array(X)
y = np.array(y)
folds = list(StratifiedKFold(n_splits = n_splits, shuffle = True, random_state = random_state).split(X, y))
train_pred = np.zeros((X.shape[0], len(base_model)))
accuracy = []
precision_subscribe = []
recall_subscribe = []
f1_subscribe = []
precision_macro = []
recall_macro = []
f1_macro = []
for i, clf in enumerate(base_model):
for j, (train, test) in enumerate(folds):
# Creating train and test sets
X_train = X[train]
y_train = y[train]
X_test = X[test]
y_test = y[test]
if oversampling:
sm = SMOTE(random_state = random_state, sampling_strategy = 'minority')
X_train_res, y_train_res = sm.fit_sample(X_train, y_train)
# fit the model
clf.fit(X_train_res, y_train_res)
# Get predictions
y_true, y_pred = y_test, clf.predict(X_test)
# Evaluate train and test scores
train_ = clf.score(X_train_res, y_train_res)
test_ = clf.score(X_test, y_test)
else:
# fit the model
clf.fit(X_train, y_train)
# Get predictions
y_true, y_pred = y_test, clf.predict(X_test)
# Evaluate train and test scores
train_ = clf.score(X_train, y_train)
test_ = clf.score(X_test, y_test)
# Other scores
accuracy_ = accuracy_score(y_true, y_pred).round(3)
precision_b = precision_score(y_true, y_pred, average = 'binary', pos_label = subscribe).round(3)
recall_b = recall_score(y_true, y_pred, average = 'binary', pos_label = subscribe).round(3)
f1_b = f1_score(y_true, y_pred, average = 'binary', pos_label = subscribe).round(3)
precision_m = precision_score(y_true, y_pred, average = 'macro').round(3)
recall_m = recall_score(y_true, y_pred, average = 'macro').round(3)
f1_m = f1_score(y_true, y_pred, average = 'macro').round(3)
print(f'Model- {name.capitalize()} and CV- {j}')
print('-'*20)
print('Training Score: {0:.3f}'.format(train_))
print('Test Score: {0:.3f}'.format(test_))
print('Accuracy Score: {0:.3f}'.format(accuracy_))
print('Precision Score - Subscribe: {0:.3f}'.format(precision_b))
print('Recall Score - Subscribe: {0:.3f}'.format(recall_b))
print('f1 Score - Subscribe: {0:.3f}'.format(f1_b))
print('Precision Score - Macro: {0:.3f}'.format(precision_m))
print('Recall Score - Macro: {0:.3f}'.format(recall_m))
print('f1 Score - Macro: {0:.3f}'.format(f1_m))
print('\n')
## Appending scores
accuracy.append(accuracy_)
precision_subscribe.append(precision_b)
recall_subscribe.append(recall_b)
f1_subscribe.append(f1_b)
precision_macro.append(precision_m)
recall_macro.append(recall_m)
f1_macro.append(f1_m)
# Creating a dataframe of scores
df = pd.DataFrame([np.mean(accuracy).round(3), np.mean(precision_subscribe).round(3),
np.mean(precision_macro).round(3), np.mean(recall_subscribe).round(3),
np.mean(recall_macro).round(3), np.mean(f1_subscribe).round(3),
np.mean(f1_macro).round(3)], columns = [name]).T
df.columns = ['Accuracy', 'Precision_Subscribe', 'Precision_Macro',
'Recall_Subscribe', 'Recall_Macro', 'f1_Subscribe', 'f1_Macro']
return df
# Separating dependent and independent variables
from sklearn.preprocessing import RobustScaler
X = bank_imputed.drop(['Target'], axis = 1)
y = bank_imputed['Target']
# Applying RobustScaler to make it less prone to outliers
features = X.columns
scaler = RobustScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns = features)
# Scaling the independent variables
Xs = X.apply(zscore)
display(X.shape, Xs.shape, y.shape)
(45211, 35)
(45211, 35)
(45211,)
Oversampling the one with better accuracy and recall score for subscribe
# LR model without hyperparameter tuning
LR = LogisticRegression()
base_model = [LR]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Logistic Regression Without Hyperparameter Tuning')
df = df.append(df1)
df
Model- Logistic regression without hyperparameter tuning and CV- 0 -------------------- Training Score: 0.899 Test Score: 0.897 Accuracy Score: 0.897 Precision Score - Subscribe: 0.605 Recall Score - Subscribe: 0.337 f1 Score - Subscribe: 0.433 Precision Score - Macro: 0.761 Recall Score - Macro: 0.654 f1 Score - Macro: 0.688 Model- Logistic regression without hyperparameter tuning and CV- 1 -------------------- Training Score: 0.898 Test Score: 0.900 Accuracy Score: 0.900 Precision Score - Subscribe: 0.639 Recall Score - Subscribe: 0.335 f1 Score - Subscribe: 0.439 Precision Score - Macro: 0.778 Recall Score - Macro: 0.655 f1 Score - Macro: 0.692 Model- Logistic regression without hyperparameter tuning and CV- 2 -------------------- Training Score: 0.899 Test Score: 0.897 Accuracy Score: 0.897 Precision Score - Subscribe: 0.614 Recall Score - Subscribe: 0.319 f1 Score - Subscribe: 0.419 Precision Score - Macro: 0.764 Recall Score - Macro: 0.646 f1 Score - Macro: 0.681 Model- Logistic regression without hyperparameter tuning and CV- 3 -------------------- Training Score: 0.898 Test Score: 0.901 Accuracy Score: 0.901 Precision Score - Subscribe: 0.645 Recall Score - Subscribe: 0.341 f1 Score - Subscribe: 0.446 Precision Score - Macro: 0.781 Recall Score - Macro: 0.658 f1 Score - Macro: 0.696 Model- Logistic regression without hyperparameter tuning and CV- 4 -------------------- Training Score: 0.899 Test Score: 0.898 Accuracy Score: 0.898 Precision Score - Subscribe: 0.627 Recall Score - Subscribe: 0.325 f1 Score - Subscribe: 0.428 Precision Score - Macro: 0.771 Recall Score - Macro: 0.649 f1 Score - Macro: 0.686
Accuracy | Precision_Subscribe | Precision_Macro | Recall_Subscribe | Recall_Macro | f1_Subscribe | f1_Macro | |
---|---|---|---|---|---|---|---|
Baseline Model | 0.882 | 0.000 | 0.441 | 0.000 | 0.500 | 0.000 | 0.469 |
Logistic Regression Without Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.652 | 0.433 | 0.689 |
# LR with hyperparameter tuning
LR = LogisticRegression(n_jobs = -1, random_state = random_state)
params = {'penalty': ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100], 'max_iter': [100, 110, 120, 130, 140]}
scoring = {'Recall': make_scorer(recall_score), 'f1_score': make_scorer(f1_score)}
skf = StratifiedKFold(n_splits = 10, shuffle = True, random_state = random_state)
LR_hyper = GridSearchCV(LR, param_grid = params, n_jobs = -1, cv = skf, scoring = scoring, refit = 'f1_score')
LR_hyper.fit(X_train, y_train)
print(LR_hyper.best_estimator_)
print(LR_hyper.best_params_)
LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='warn', n_jobs=-1, penalty='l2', random_state=42, solver='warn', tol=0.0001, verbose=0, warm_start=False) {'C': 100, 'max_iter': 100, 'penalty': 'l2'}
# LR model with hyperparameter tuning
LR_Hyper = LogisticRegression(C = 100, class_weight = None, dual = False, fit_intercept = True,
intercept_scaling = 1, l1_ratio = None, max_iter = 100,
multi_class = 'warn', n_jobs = -1, penalty = 'l2', random_state = 42,
solver = 'warn', tol = 0.0001, verbose = 0, warm_start = False)
base_model = [LR_Hyper]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Logistic Regression With Hyperparameter Tuning')
df = df.append(df1)
df
Model- Logistic regression with hyperparameter tuning and CV- 0 -------------------- Training Score: 0.899 Test Score: 0.897 Accuracy Score: 0.897 Precision Score - Subscribe: 0.604 Recall Score - Subscribe: 0.337 f1 Score - Subscribe: 0.433 Precision Score - Macro: 0.761 Recall Score - Macro: 0.654 f1 Score - Macro: 0.688 Model- Logistic regression with hyperparameter tuning and CV- 1 -------------------- Training Score: 0.898 Test Score: 0.900 Accuracy Score: 0.900 Precision Score - Subscribe: 0.640 Recall Score - Subscribe: 0.336 f1 Score - Subscribe: 0.440 Precision Score - Macro: 0.778 Recall Score - Macro: 0.655 f1 Score - Macro: 0.693 Model- Logistic regression with hyperparameter tuning and CV- 2 -------------------- Training Score: 0.899 Test Score: 0.897 Accuracy Score: 0.897 Precision Score - Subscribe: 0.615 Recall Score - Subscribe: 0.319 f1 Score - Subscribe: 0.420 Precision Score - Macro: 0.765 Recall Score - Macro: 0.646 f1 Score - Macro: 0.682 Model- Logistic regression with hyperparameter tuning and CV- 3 -------------------- Training Score: 0.898 Test Score: 0.901 Accuracy Score: 0.901 Precision Score - Subscribe: 0.644 Recall Score - Subscribe: 0.340 f1 Score - Subscribe: 0.445 Precision Score - Macro: 0.781 Recall Score - Macro: 0.658 f1 Score - Macro: 0.695 Model- Logistic regression with hyperparameter tuning and CV- 4 -------------------- Training Score: 0.898 Test Score: 0.898 Accuracy Score: 0.898 Precision Score - Subscribe: 0.627 Recall Score - Subscribe: 0.325 f1 Score - Subscribe: 0.428 Precision Score - Macro: 0.771 Recall Score - Macro: 0.650 f1 Score - Macro: 0.686
Accuracy | Precision_Subscribe | Precision_Macro | Recall_Subscribe | Recall_Macro | f1_Subscribe | f1_Macro | |
---|---|---|---|---|---|---|---|
Baseline Model | 0.882 | 0.000 | 0.441 | 0.000 | 0.500 | 0.000 | 0.469 |
Logistic Regression Without Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.652 | 0.433 | 0.689 |
Logistic Regression With Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.653 | 0.433 | 0.689 |
# KNN Model after scaling the features without hyperparameter tuning
kNN = KNeighborsClassifier()
base_model = [kNN]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, Xs, y, 'k-Nearest Neighbor Scaled Without Hyperparameter Tuning')
df = df.append(df1)
df
Model- K-nearest neighbor scaled without hyperparameter tuning and CV- 0 -------------------- Training Score: 0.916 Test Score: 0.892 Accuracy Score: 0.892 Precision Score - Subscribe: 0.575 Recall Score - Subscribe: 0.308 f1 Score - Subscribe: 0.401 Precision Score - Macro: 0.744 Recall Score - Macro: 0.639 f1 Score - Macro: 0.671 Model- K-nearest neighbor scaled without hyperparameter tuning and CV- 1 -------------------- Training Score: 0.916 Test Score: 0.889 Accuracy Score: 0.889 Precision Score - Subscribe: 0.545 Recall Score - Subscribe: 0.312 f1 Score - Subscribe: 0.397 Precision Score - Macro: 0.730 Recall Score - Macro: 0.639 f1 Score - Macro: 0.668 Model- K-nearest neighbor scaled without hyperparameter tuning and CV- 2 -------------------- Training Score: 0.916 Test Score: 0.890 Accuracy Score: 0.890 Precision Score - Subscribe: 0.562 Recall Score - Subscribe: 0.286 f1 Score - Subscribe: 0.379 Precision Score - Macro: 0.737 Recall Score - Macro: 0.628 f1 Score - Macro: 0.660 Model- K-nearest neighbor scaled without hyperparameter tuning and CV- 3 -------------------- Training Score: 0.916 Test Score: 0.892 Accuracy Score: 0.892 Precision Score - Subscribe: 0.571 Recall Score - Subscribe: 0.305 f1 Score - Subscribe: 0.398 Precision Score - Macro: 0.742 Recall Score - Macro: 0.637 f1 Score - Macro: 0.669 Model- K-nearest neighbor scaled without hyperparameter tuning and CV- 4 -------------------- Training Score: 0.917 Test Score: 0.892 Accuracy Score: 0.892 Precision Score - Subscribe: 0.578 Recall Score - Subscribe: 0.285 f1 Score - Subscribe: 0.381 Precision Score - Macro: 0.745 Recall Score - Macro: 0.629 f1 Score - Macro: 0.661
Accuracy | Precision_Subscribe | Precision_Macro | Recall_Subscribe | Recall_Macro | f1_Subscribe | f1_Macro | |
---|---|---|---|---|---|---|---|
Baseline Model | 0.882 | 0.000 | 0.441 | 0.000 | 0.500 | 0.000 | 0.469 |
Logistic Regression Without Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.652 | 0.433 | 0.689 |
Logistic Regression With Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.653 | 0.433 | 0.689 |
k-Nearest Neighbor Scaled Without Hyperparameter Tuning | 0.891 | 0.566 | 0.740 | 0.299 | 0.634 | 0.391 | 0.666 |
# Choosing a K Value
error_rate = {}
weights = ['uniform', 'distance']
for w in weights:
print(w)
rate = []
for i in range(1, 40):
knn = KNeighborsClassifier(n_neighbors = i, weights = w)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
rate.append(np.mean(y_pred != y_test))
plt.figure(figsize = (15, 7.2))
plt.plot(range(1, 40), rate, color = 'blue', linestyle = 'dashed', marker = 'o',
markerfacecolor = 'red', markersize = 10)
plt.title('Error Rate vs K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
plt.show()
uniform
distance
# KNN with hyperparameter tuning
kNN = KNeighborsClassifier(n_jobs = -1)
params = {'n_neighbors': list(range(3, 40, 2)), 'weights': ['uniform', 'distance']}
scoring = {'Recall': make_scorer(recall_score), 'f1_score': make_scorer(f1_score)}
skf = StratifiedKFold(n_splits = 3, shuffle = True, random_state = random_state)
kNN_hyper = GridSearchCV(kNN, param_grid = params, n_jobs = -1, cv = skf, scoring = scoring, refit = 'f1_score')
kNN_hyper.fit(X_train, y_train)
print(kNN_hyper.best_estimator_)
print(kNN_hyper.best_params_)
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=-1, n_neighbors=3, p=2, weights='distance') {'n_neighbors': 3, 'weights': 'distance'}
# KNN with hyperparameter tuning
kNN_hyper = KNeighborsClassifier(algorithm = 'auto', leaf_size = 30, metric = 'minkowski', metric_params = None,
n_jobs = -1, n_neighbors = 3, p = 2, weights = 'distance')
base_model = [kNN_hyper]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, Xs, y, 'k-Nearest Neighbor Scaled With Hyperparameter Tuning')
df = df.append(df1)
df
Model- K-nearest neighbor scaled with hyperparameter tuning and CV- 0 -------------------- Training Score: 1.000 Test Score: 0.890 Accuracy Score: 0.890 Precision Score - Subscribe: 0.544 Recall Score - Subscribe: 0.350 f1 Score - Subscribe: 0.426 Precision Score - Macro: 0.731 Recall Score - Macro: 0.655 f1 Score - Macro: 0.682 Model- K-nearest neighbor scaled with hyperparameter tuning and CV- 1 -------------------- Training Score: 1.000 Test Score: 0.885 Accuracy Score: 0.885 Precision Score - Subscribe: 0.514 Recall Score - Subscribe: 0.348 f1 Score - Subscribe: 0.415 Precision Score - Macro: 0.716 Recall Score - Macro: 0.652 f1 Score - Macro: 0.676 Model- K-nearest neighbor scaled with hyperparameter tuning and CV- 2 -------------------- Training Score: 1.000 Test Score: 0.886 Accuracy Score: 0.886 Precision Score - Subscribe: 0.518 Recall Score - Subscribe: 0.332 f1 Score - Subscribe: 0.405 Precision Score - Macro: 0.717 Recall Score - Macro: 0.645 f1 Score - Macro: 0.671 Model- K-nearest neighbor scaled with hyperparameter tuning and CV- 3 -------------------- Training Score: 1.000 Test Score: 0.888 Accuracy Score: 0.888 Precision Score - Subscribe: 0.534 Recall Score - Subscribe: 0.338 f1 Score - Subscribe: 0.414 Precision Score - Macro: 0.725 Recall Score - Macro: 0.650 f1 Score - Macro: 0.676 Model- K-nearest neighbor scaled with hyperparameter tuning and CV- 4 -------------------- Training Score: 1.000 Test Score: 0.887 Accuracy Score: 0.887 Precision Score - Subscribe: 0.524 Recall Score - Subscribe: 0.333 f1 Score - Subscribe: 0.407 Precision Score - Macro: 0.720 Recall Score - Macro: 0.646 f1 Score - Macro: 0.672
Accuracy | Precision_Subscribe | Precision_Macro | Recall_Subscribe | Recall_Macro | f1_Subscribe | f1_Macro | |
---|---|---|---|---|---|---|---|
Baseline Model | 0.882 | 0.000 | 0.441 | 0.000 | 0.500 | 0.000 | 0.469 |
Logistic Regression Without Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.652 | 0.433 | 0.689 |
Logistic Regression With Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.653 | 0.433 | 0.689 |
k-Nearest Neighbor Scaled Without Hyperparameter Tuning | 0.891 | 0.566 | 0.740 | 0.299 | 0.634 | 0.391 | 0.666 |
k-Nearest Neighbor Scaled With Hyperparameter Tuning | 0.887 | 0.527 | 0.722 | 0.340 | 0.650 | 0.413 | 0.675 |
# Naive Bayes Model
NB = GaussianNB()
base_model = [NB]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Naive Bayes Classifier')
df = df.append(df1)
df
Model- Naive bayes classifier and CV- 0 -------------------- Training Score: 0.813 Test Score: 0.808 Accuracy Score: 0.808 Precision Score - Subscribe: 0.312 Recall Score - Subscribe: 0.533 f1 Score - Subscribe: 0.394 Precision Score - Macro: 0.622 Recall Score - Macro: 0.689 f1 Score - Macro: 0.640 Model- Naive bayes classifier and CV- 1 -------------------- Training Score: 0.815 Test Score: 0.812 Accuracy Score: 0.812 Precision Score - Subscribe: 0.311 Recall Score - Subscribe: 0.501 f1 Score - Subscribe: 0.384 Precision Score - Macro: 0.619 Recall Score - Macro: 0.677 f1 Score - Macro: 0.636 Model- Naive bayes classifier and CV- 2 -------------------- Training Score: 0.818 Test Score: 0.819 Accuracy Score: 0.819 Precision Score - Subscribe: 0.324 Recall Score - Subscribe: 0.505 f1 Score - Subscribe: 0.395 Precision Score - Macro: 0.627 Recall Score - Macro: 0.683 f1 Score - Macro: 0.644 Model- Naive bayes classifier and CV- 3 -------------------- Training Score: 0.814 Test Score: 0.821 Accuracy Score: 0.821 Precision Score - Subscribe: 0.334 Recall Score - Subscribe: 0.539 f1 Score - Subscribe: 0.413 Precision Score - Macro: 0.634 Recall Score - Macro: 0.698 f1 Score - Macro: 0.653 Model- Naive bayes classifier and CV- 4 -------------------- Training Score: 0.817 Test Score: 0.816 Accuracy Score: 0.816 Precision Score - Subscribe: 0.323 Recall Score - Subscribe: 0.525 f1 Score - Subscribe: 0.400 Precision Score - Macro: 0.627 Recall Score - Macro: 0.690 f1 Score - Macro: 0.646
Accuracy | Precision_Subscribe | Precision_Macro | Recall_Subscribe | Recall_Macro | f1_Subscribe | f1_Macro | |
---|---|---|---|---|---|---|---|
Baseline Model | 0.882 | 0.000 | 0.441 | 0.000 | 0.500 | 0.000 | 0.469 |
Logistic Regression Without Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.652 | 0.433 | 0.689 |
Logistic Regression With Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.653 | 0.433 | 0.689 |
k-Nearest Neighbor Scaled Without Hyperparameter Tuning | 0.891 | 0.566 | 0.740 | 0.299 | 0.634 | 0.391 | 0.666 |
k-Nearest Neighbor Scaled With Hyperparameter Tuning | 0.887 | 0.527 | 0.722 | 0.340 | 0.650 | 0.413 | 0.675 |
Naive Bayes Classifier | 0.815 | 0.321 | 0.626 | 0.521 | 0.687 | 0.397 | 0.644 |
# Naive Bayes with oversampling
NB_over = GaussianNB()
base_model = [NB_over]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Naive Bayes, Oversampled',
oversampling = True)
df = df.append(df1)
df
Model- Naive bayes, oversampled and CV- 0 -------------------- Training Score: 0.734 Test Score: 0.709 Accuracy Score: 0.709 Precision Score - Subscribe: 0.246 Recall Score - Subscribe: 0.722 f1 Score - Subscribe: 0.367 Precision Score - Macro: 0.598 Recall Score - Macro: 0.715 f1 Score - Macro: 0.589 Model- Naive bayes, oversampled and CV- 1 -------------------- Training Score: 0.733 Test Score: 0.713 Accuracy Score: 0.713 Precision Score - Subscribe: 0.244 Recall Score - Subscribe: 0.692 f1 Score - Subscribe: 0.361 Precision Score - Macro: 0.595 Recall Score - Macro: 0.704 f1 Score - Macro: 0.588 Model- Naive bayes, oversampled and CV- 2 -------------------- Training Score: 0.734 Test Score: 0.729 Accuracy Score: 0.729 Precision Score - Subscribe: 0.259 Recall Score - Subscribe: 0.709 f1 Score - Subscribe: 0.380 Precision Score - Macro: 0.605 Recall Score - Macro: 0.720 f1 Score - Macro: 0.603 Model- Naive bayes, oversampled and CV- 3 -------------------- Training Score: 0.733 Test Score: 0.717 Accuracy Score: 0.717 Precision Score - Subscribe: 0.251 Recall Score - Subscribe: 0.718 f1 Score - Subscribe: 0.373 Precision Score - Macro: 0.601 Recall Score - Macro: 0.718 f1 Score - Macro: 0.595 Model- Naive bayes, oversampled and CV- 4 -------------------- Training Score: 0.734 Test Score: 0.715 Accuracy Score: 0.715 Precision Score - Subscribe: 0.249 Recall Score - Subscribe: 0.715 f1 Score - Subscribe: 0.370 Precision Score - Macro: 0.600 Recall Score - Macro: 0.715 f1 Score - Macro: 0.593
Accuracy | Precision_Subscribe | Precision_Macro | Recall_Subscribe | Recall_Macro | f1_Subscribe | f1_Macro | |
---|---|---|---|---|---|---|---|
Baseline Model | 0.882 | 0.000 | 0.441 | 0.000 | 0.500 | 0.000 | 0.469 |
Logistic Regression Without Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.652 | 0.433 | 0.689 |
Logistic Regression With Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.653 | 0.433 | 0.689 |
k-Nearest Neighbor Scaled Without Hyperparameter Tuning | 0.891 | 0.566 | 0.740 | 0.299 | 0.634 | 0.391 | 0.666 |
k-Nearest Neighbor Scaled With Hyperparameter Tuning | 0.887 | 0.527 | 0.722 | 0.340 | 0.650 | 0.413 | 0.675 |
Naive Bayes Classifier | 0.815 | 0.321 | 0.626 | 0.521 | 0.687 | 0.397 | 0.644 |
Naive Bayes, Oversampled | 0.717 | 0.250 | 0.600 | 0.711 | 0.714 | 0.370 | 0.594 |
# LR model with oversampling
LR_over = LogisticRegression(C = 1, class_weight = None, dual = False, fit_intercept = True,
intercept_scaling = 1, l1_ratio = None, max_iter = 100,
multi_class = 'warn', n_jobs = -1, penalty = 'l1', random_state = 42,
solver = 'warn', tol = 0.0001, verbose = 0, warm_start = False)
base_model = [LR_over]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Logistic Regression, Oversampled With Hyperparameter Tuning',
oversampling = True)
df = df.append(df1)
df
Model- Logistic regression, oversampled with hyperparameter tuning and CV- 0 -------------------- Training Score: 0.824 Test Score: 0.823 Accuracy Score: 0.823 Precision Score - Subscribe: 0.378 Recall Score - Subscribe: 0.790 f1 Score - Subscribe: 0.511 Precision Score - Macro: 0.673 Recall Score - Macro: 0.809 f1 Score - Macro: 0.702 Model- Logistic regression, oversampled with hyperparameter tuning and CV- 1 -------------------- Training Score: 0.824 Test Score: 0.832 Accuracy Score: 0.832 Precision Score - Subscribe: 0.391 Recall Score - Subscribe: 0.774 f1 Score - Subscribe: 0.520 Precision Score - Macro: 0.678 Recall Score - Macro: 0.807 f1 Score - Macro: 0.709 Model- Logistic regression, oversampled with hyperparameter tuning and CV- 2 -------------------- Training Score: 0.823 Test Score: 0.831 Accuracy Score: 0.831 Precision Score - Subscribe: 0.390 Recall Score - Subscribe: 0.787 f1 Score - Subscribe: 0.521 Precision Score - Macro: 0.679 Recall Score - Macro: 0.812 f1 Score - Macro: 0.709 Model- Logistic regression, oversampled with hyperparameter tuning and CV- 3 -------------------- Training Score: 0.823 Test Score: 0.833 Accuracy Score: 0.833 Precision Score - Subscribe: 0.393 Recall Score - Subscribe: 0.786 f1 Score - Subscribe: 0.524 Precision Score - Macro: 0.680 Recall Score - Macro: 0.813 f1 Score - Macro: 0.711 Model- Logistic regression, oversampled with hyperparameter tuning and CV- 4 -------------------- Training Score: 0.823 Test Score: 0.823 Accuracy Score: 0.823 Precision Score - Subscribe: 0.376 Recall Score - Subscribe: 0.783 f1 Score - Subscribe: 0.508 Precision Score - Macro: 0.671 Recall Score - Macro: 0.806 f1 Score - Macro: 0.700
Accuracy | Precision_Subscribe | Precision_Macro | Recall_Subscribe | Recall_Macro | f1_Subscribe | f1_Macro | |
---|---|---|---|---|---|---|---|
Baseline Model | 0.882 | 0.000 | 0.441 | 0.000 | 0.500 | 0.000 | 0.469 |
Logistic Regression Without Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.652 | 0.433 | 0.689 |
Logistic Regression With Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.653 | 0.433 | 0.689 |
k-Nearest Neighbor Scaled Without Hyperparameter Tuning | 0.891 | 0.566 | 0.740 | 0.299 | 0.634 | 0.391 | 0.666 |
k-Nearest Neighbor Scaled With Hyperparameter Tuning | 0.887 | 0.527 | 0.722 | 0.340 | 0.650 | 0.413 | 0.675 |
Naive Bayes Classifier | 0.815 | 0.321 | 0.626 | 0.521 | 0.687 | 0.397 | 0.644 |
Naive Bayes, Oversampled | 0.717 | 0.250 | 0.600 | 0.711 | 0.714 | 0.370 | 0.594 |
Logistic Regression, Oversampled With Hyperparameter Tuning | 0.828 | 0.386 | 0.676 | 0.784 | 0.809 | 0.517 | 0.706 |
Decision Tree Classifier, Bagging Classifier, AdaBoost Classifier, Gradient Boosting Classifier and Random Forest Classifier. Oversampling the ones with higher accuracy and better recall for subscribe.
# Decision Tree Classifier
DT = DecisionTreeClassifier(random_state = random_state)
base_model = [DT]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Decision Tree Classifier')
df = df.append(df1)
df
Model- Decision tree classifier and CV- 0 -------------------- Training Score: 1.000 Test Score: 0.874 Accuracy Score: 0.874 Precision Score - Subscribe: 0.464 Recall Score - Subscribe: 0.483 f1 Score - Subscribe: 0.473 Precision Score - Macro: 0.698 Recall Score - Macro: 0.705 f1 Score - Macro: 0.701 Model- Decision tree classifier and CV- 1 -------------------- Training Score: 1.000 Test Score: 0.876 Accuracy Score: 0.876 Precision Score - Subscribe: 0.471 Recall Score - Subscribe: 0.495 f1 Score - Subscribe: 0.483 Precision Score - Macro: 0.702 Recall Score - Macro: 0.711 f1 Score - Macro: 0.706 Model- Decision tree classifier and CV- 2 -------------------- Training Score: 1.000 Test Score: 0.871 Accuracy Score: 0.871 Precision Score - Subscribe: 0.449 Recall Score - Subscribe: 0.460 f1 Score - Subscribe: 0.455 Precision Score - Macro: 0.689 Recall Score - Macro: 0.693 f1 Score - Macro: 0.691 Model- Decision tree classifier and CV- 3 -------------------- Training Score: 1.000 Test Score: 0.879 Accuracy Score: 0.879 Precision Score - Subscribe: 0.482 Recall Score - Subscribe: 0.484 f1 Score - Subscribe: 0.483 Precision Score - Macro: 0.707 Recall Score - Macro: 0.707 f1 Score - Macro: 0.707 Model- Decision tree classifier and CV- 4 -------------------- Training Score: 1.000 Test Score: 0.875 Accuracy Score: 0.875 Precision Score - Subscribe: 0.466 Recall Score - Subscribe: 0.467 f1 Score - Subscribe: 0.466 Precision Score - Macro: 0.698 Recall Score - Macro: 0.698 f1 Score - Macro: 0.698
Accuracy | Precision_Subscribe | Precision_Macro | Recall_Subscribe | Recall_Macro | f1_Subscribe | f1_Macro | |
---|---|---|---|---|---|---|---|
Baseline Model | 0.882 | 0.000 | 0.441 | 0.000 | 0.500 | 0.000 | 0.469 |
Logistic Regression Without Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.652 | 0.433 | 0.689 |
Logistic Regression With Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.653 | 0.433 | 0.689 |
k-Nearest Neighbor Scaled Without Hyperparameter Tuning | 0.891 | 0.566 | 0.740 | 0.299 | 0.634 | 0.391 | 0.666 |
k-Nearest Neighbor Scaled With Hyperparameter Tuning | 0.887 | 0.527 | 0.722 | 0.340 | 0.650 | 0.413 | 0.675 |
Naive Bayes Classifier | 0.815 | 0.321 | 0.626 | 0.521 | 0.687 | 0.397 | 0.644 |
Naive Bayes, Oversampled | 0.717 | 0.250 | 0.600 | 0.711 | 0.714 | 0.370 | 0.594 |
Logistic Regression, Oversampled With Hyperparameter Tuning | 0.828 | 0.386 | 0.676 | 0.784 | 0.809 | 0.517 | 0.706 |
Decision Tree Classifier | 0.875 | 0.466 | 0.699 | 0.478 | 0.703 | 0.472 | 0.701 |
# Decision Tree Classifier with hyperparameter tuning
dt_hyper = DecisionTreeClassifier(max_depth = 3, random_state = random_state)
base_model = [dt_hyper]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Decision Tree Classifier - Reducing Max Depth')
df = df.append(df1)
df
Model- Decision tree classifier - reducing max depth and CV- 0 -------------------- Training Score: 0.899 Test Score: 0.898 Accuracy Score: 0.898 Precision Score - Subscribe: 0.634 Recall Score - Subscribe: 0.300 f1 Score - Subscribe: 0.407 Precision Score - Macro: 0.774 Recall Score - Macro: 0.638 f1 Score - Macro: 0.676 Model- Decision tree classifier - reducing max depth and CV- 1 -------------------- Training Score: 0.899 Test Score: 0.901 Accuracy Score: 0.901 Precision Score - Subscribe: 0.698 Recall Score - Subscribe: 0.270 f1 Score - Subscribe: 0.390 Precision Score - Macro: 0.804 Recall Score - Macro: 0.627 f1 Score - Macro: 0.668 Model- Decision tree classifier - reducing max depth and CV- 2 -------------------- Training Score: 0.900 Test Score: 0.897 Accuracy Score: 0.897 Precision Score - Subscribe: 0.624 Recall Score - Subscribe: 0.292 f1 Score - Subscribe: 0.398 Precision Score - Macro: 0.768 Recall Score - Macro: 0.634 f1 Score - Macro: 0.671 Model- Decision tree classifier - reducing max depth and CV- 3 -------------------- Training Score: 0.899 Test Score: 0.901 Accuracy Score: 0.901 Precision Score - Subscribe: 0.666 Recall Score - Subscribe: 0.307 f1 Score - Subscribe: 0.420 Precision Score - Macro: 0.790 Recall Score - Macro: 0.643 f1 Score - Macro: 0.683 Model- Decision tree classifier - reducing max depth and CV- 4 -------------------- Training Score: 0.900 Test Score: 0.899 Accuracy Score: 0.899 Precision Score - Subscribe: 0.680 Recall Score - Subscribe: 0.255 f1 Score - Subscribe: 0.371 Precision Score - Macro: 0.795 Recall Score - Macro: 0.620 f1 Score - Macro: 0.658
Accuracy | Precision_Subscribe | Precision_Macro | Recall_Subscribe | Recall_Macro | f1_Subscribe | f1_Macro | |
---|---|---|---|---|---|---|---|
Baseline Model | 0.882 | 0.000 | 0.441 | 0.000 | 0.500 | 0.000 | 0.469 |
Logistic Regression Without Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.652 | 0.433 | 0.689 |
Logistic Regression With Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.653 | 0.433 | 0.689 |
k-Nearest Neighbor Scaled Without Hyperparameter Tuning | 0.891 | 0.566 | 0.740 | 0.299 | 0.634 | 0.391 | 0.666 |
k-Nearest Neighbor Scaled With Hyperparameter Tuning | 0.887 | 0.527 | 0.722 | 0.340 | 0.650 | 0.413 | 0.675 |
Naive Bayes Classifier | 0.815 | 0.321 | 0.626 | 0.521 | 0.687 | 0.397 | 0.644 |
Naive Bayes, Oversampled | 0.717 | 0.250 | 0.600 | 0.711 | 0.714 | 0.370 | 0.594 |
Logistic Regression, Oversampled With Hyperparameter Tuning | 0.828 | 0.386 | 0.676 | 0.784 | 0.809 | 0.517 | 0.706 |
Decision Tree Classifier | 0.875 | 0.466 | 0.699 | 0.478 | 0.703 | 0.472 | 0.701 |
Decision Tree Classifier - Reducing Max Depth | 0.899 | 0.660 | 0.786 | 0.285 | 0.632 | 0.397 | 0.671 |
dt_hyper = DecisionTreeClassifier(max_depth = 3, random_state = random_state)
dt_hyper.fit(X, y)
decisiontree = open('decisiontree.dot','w')
dot_data = export_graphviz(dt_hyper, out_file = 'decisiontree.dot', feature_names = X.columns,
class_names = ['No', 'Yes'], rounded = True, proportion = False, filled = True)
decisiontree.close()
retCode = system('dot -Tpng decisiontree.dot -o decisiontree.png')
if(retCode>0):
print('system command returning error: '+str(retCode))
else:
display(Image('decisiontree.png'))
print('Feature Importance for Decision Tree Classifier ', '--'*38)
feature_importances = pd.DataFrame(dt_hyper.feature_importances_, index = X.columns,
columns=['Importance']).sort_values('Importance', ascending = True)
feature_importances.sort_values(by = 'Importance', ascending = True).plot(kind = 'barh', figsize = (15, 7.2))
Feature Importance for Decision Tree Classifier --------------------------------------------------------------------------------
<matplotlib.axes._subplots.AxesSubplot at 0x2c2805a6e08>
# Bagging Classifier
bgcl = BaggingClassifier(base_estimator = DecisionTreeClassifier(max_depth = 3, random_state = random_state),
n_estimators = 50, random_state = random_state)
base_model = [bgcl]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Bagging Classifier')
df = df.append(df1)
df
Model- Bagging classifier and CV- 0 -------------------- Training Score: 0.900 Test Score: 0.898 Accuracy Score: 0.898 Precision Score - Subscribe: 0.648 Recall Score - Subscribe: 0.289 f1 Score - Subscribe: 0.400 Precision Score - Macro: 0.780 Recall Score - Macro: 0.634 f1 Score - Macro: 0.672 Model- Bagging classifier and CV- 1 -------------------- Training Score: 0.900 Test Score: 0.900 Accuracy Score: 0.900 Precision Score - Subscribe: 0.678 Recall Score - Subscribe: 0.284 f1 Score - Subscribe: 0.401 Precision Score - Macro: 0.795 Recall Score - Macro: 0.633 f1 Score - Macro: 0.673 Model- Bagging classifier and CV- 2 -------------------- Training Score: 0.900 Test Score: 0.897 Accuracy Score: 0.897 Precision Score - Subscribe: 0.629 Recall Score - Subscribe: 0.283 f1 Score - Subscribe: 0.390 Precision Score - Macro: 0.770 Recall Score - Macro: 0.630 f1 Score - Macro: 0.667 Model- Bagging classifier and CV- 3 -------------------- Training Score: 0.900 Test Score: 0.900 Accuracy Score: 0.900 Precision Score - Subscribe: 0.677 Recall Score - Subscribe: 0.284 f1 Score - Subscribe: 0.400 Precision Score - Macro: 0.795 Recall Score - Macro: 0.633 f1 Score - Macro: 0.673 Model- Bagging classifier and CV- 4 -------------------- Training Score: 0.900 Test Score: 0.899 Accuracy Score: 0.899 Precision Score - Subscribe: 0.668 Recall Score - Subscribe: 0.272 f1 Score - Subscribe: 0.387 Precision Score - Macro: 0.789 Recall Score - Macro: 0.627 f1 Score - Macro: 0.666
Accuracy | Precision_Subscribe | Precision_Macro | Recall_Subscribe | Recall_Macro | f1_Subscribe | f1_Macro | |
---|---|---|---|---|---|---|---|
Baseline Model | 0.882 | 0.000 | 0.441 | 0.000 | 0.500 | 0.000 | 0.469 |
Logistic Regression Without Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.652 | 0.433 | 0.689 |
Logistic Regression With Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.653 | 0.433 | 0.689 |
k-Nearest Neighbor Scaled Without Hyperparameter Tuning | 0.891 | 0.566 | 0.740 | 0.299 | 0.634 | 0.391 | 0.666 |
k-Nearest Neighbor Scaled With Hyperparameter Tuning | 0.887 | 0.527 | 0.722 | 0.340 | 0.650 | 0.413 | 0.675 |
Naive Bayes Classifier | 0.815 | 0.321 | 0.626 | 0.521 | 0.687 | 0.397 | 0.644 |
Naive Bayes, Oversampled | 0.717 | 0.250 | 0.600 | 0.711 | 0.714 | 0.370 | 0.594 |
Logistic Regression, Oversampled With Hyperparameter Tuning | 0.828 | 0.386 | 0.676 | 0.784 | 0.809 | 0.517 | 0.706 |
Decision Tree Classifier | 0.875 | 0.466 | 0.699 | 0.478 | 0.703 | 0.472 | 0.701 |
Decision Tree Classifier - Reducing Max Depth | 0.899 | 0.660 | 0.786 | 0.285 | 0.632 | 0.397 | 0.671 |
Bagging Classifier | 0.899 | 0.660 | 0.786 | 0.282 | 0.631 | 0.396 | 0.670 |
# AdaBoost Classifier
abcl = AdaBoostClassifier(n_estimators = 10, random_state = random_state)
base_model = [abcl]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'AdaBoost Classifier')
df = df.append(df1)
df
Model- Adaboost classifier and CV- 0 -------------------- Training Score: 0.891 Test Score: 0.889 Accuracy Score: 0.889 Precision Score - Subscribe: 0.536 Recall Score - Subscribe: 0.350 f1 Score - Subscribe: 0.423 Precision Score - Macro: 0.727 Recall Score - Macro: 0.655 f1 Score - Macro: 0.681 Model- Adaboost classifier and CV- 1 -------------------- Training Score: 0.896 Test Score: 0.897 Accuracy Score: 0.897 Precision Score - Subscribe: 0.609 Recall Score - Subscribe: 0.335 f1 Score - Subscribe: 0.432 Precision Score - Macro: 0.763 Recall Score - Macro: 0.653 f1 Score - Macro: 0.688 Model- Adaboost classifier and CV- 2 -------------------- Training Score: 0.890 Test Score: 0.889 Accuracy Score: 0.889 Precision Score - Subscribe: 0.534 Recall Score - Subscribe: 0.391 f1 Score - Subscribe: 0.451 Precision Score - Macro: 0.728 Recall Score - Macro: 0.673 f1 Score - Macro: 0.695 Model- Adaboost classifier and CV- 3 -------------------- Training Score: 0.890 Test Score: 0.889 Accuracy Score: 0.889 Precision Score - Subscribe: 0.534 Recall Score - Subscribe: 0.390 f1 Score - Subscribe: 0.451 Precision Score - Macro: 0.728 Recall Score - Macro: 0.673 f1 Score - Macro: 0.694 Model- Adaboost classifier and CV- 4 -------------------- Training Score: 0.891 Test Score: 0.892 Accuracy Score: 0.892 Precision Score - Subscribe: 0.556 Recall Score - Subscribe: 0.379 f1 Score - Subscribe: 0.451 Precision Score - Macro: 0.739 Recall Score - Macro: 0.670 f1 Score - Macro: 0.696
Accuracy | Precision_Subscribe | Precision_Macro | Recall_Subscribe | Recall_Macro | f1_Subscribe | f1_Macro | |
---|---|---|---|---|---|---|---|
Baseline Model | 0.882 | 0.000 | 0.441 | 0.000 | 0.500 | 0.000 | 0.469 |
Logistic Regression Without Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.652 | 0.433 | 0.689 |
Logistic Regression With Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.653 | 0.433 | 0.689 |
k-Nearest Neighbor Scaled Without Hyperparameter Tuning | 0.891 | 0.566 | 0.740 | 0.299 | 0.634 | 0.391 | 0.666 |
k-Nearest Neighbor Scaled With Hyperparameter Tuning | 0.887 | 0.527 | 0.722 | 0.340 | 0.650 | 0.413 | 0.675 |
Naive Bayes Classifier | 0.815 | 0.321 | 0.626 | 0.521 | 0.687 | 0.397 | 0.644 |
Naive Bayes, Oversampled | 0.717 | 0.250 | 0.600 | 0.711 | 0.714 | 0.370 | 0.594 |
Logistic Regression, Oversampled With Hyperparameter Tuning | 0.828 | 0.386 | 0.676 | 0.784 | 0.809 | 0.517 | 0.706 |
Decision Tree Classifier | 0.875 | 0.466 | 0.699 | 0.478 | 0.703 | 0.472 | 0.701 |
Decision Tree Classifier - Reducing Max Depth | 0.899 | 0.660 | 0.786 | 0.285 | 0.632 | 0.397 | 0.671 |
Bagging Classifier | 0.899 | 0.660 | 0.786 | 0.282 | 0.631 | 0.396 | 0.670 |
AdaBoost Classifier | 0.891 | 0.554 | 0.737 | 0.369 | 0.665 | 0.442 | 0.691 |
# Gradient Boosting Classifier
gbcl = GradientBoostingClassifier(n_estimators = 50, random_state = random_state)
base_model = [gbcl]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Gradient Boosting Classifier')
df = df.append(df1)
df
Model- Gradient boosting classifier and CV- 0 -------------------- Training Score: 0.905 Test Score: 0.899 Accuracy Score: 0.899 Precision Score - Subscribe: 0.636 Recall Score - Subscribe: 0.325 f1 Score - Subscribe: 0.430 Precision Score - Macro: 0.776 Recall Score - Macro: 0.650 f1 Score - Macro: 0.688 Model- Gradient boosting classifier and CV- 1 -------------------- Training Score: 0.904 Test Score: 0.902 Accuracy Score: 0.902 Precision Score - Subscribe: 0.667 Recall Score - Subscribe: 0.319 f1 Score - Subscribe: 0.431 Precision Score - Macro: 0.791 Recall Score - Macro: 0.649 f1 Score - Macro: 0.689 Model- Gradient boosting classifier and CV- 2 -------------------- Training Score: 0.904 Test Score: 0.901 Accuracy Score: 0.901 Precision Score - Subscribe: 0.668 Recall Score - Subscribe: 0.312 f1 Score - Subscribe: 0.425 Precision Score - Macro: 0.791 Recall Score - Macro: 0.646 f1 Score - Macro: 0.686 Model- Gradient boosting classifier and CV- 3 -------------------- Training Score: 0.903 Test Score: 0.904 Accuracy Score: 0.904 Precision Score - Subscribe: 0.698 Recall Score - Subscribe: 0.317 f1 Score - Subscribe: 0.436 Precision Score - Macro: 0.807 Recall Score - Macro: 0.649 f1 Score - Macro: 0.692 Model- Gradient boosting classifier and CV- 4 -------------------- Training Score: 0.905 Test Score: 0.903 Accuracy Score: 0.903 Precision Score - Subscribe: 0.678 Recall Score - Subscribe: 0.321 f1 Score - Subscribe: 0.435 Precision Score - Macro: 0.797 Recall Score - Macro: 0.650 f1 Score - Macro: 0.691
Accuracy | Precision_Subscribe | Precision_Macro | Recall_Subscribe | Recall_Macro | f1_Subscribe | f1_Macro | |
---|---|---|---|---|---|---|---|
Baseline Model | 0.882 | 0.000 | 0.441 | 0.000 | 0.500 | 0.000 | 0.469 |
Logistic Regression Without Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.652 | 0.433 | 0.689 |
Logistic Regression With Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.653 | 0.433 | 0.689 |
k-Nearest Neighbor Scaled Without Hyperparameter Tuning | 0.891 | 0.566 | 0.740 | 0.299 | 0.634 | 0.391 | 0.666 |
k-Nearest Neighbor Scaled With Hyperparameter Tuning | 0.887 | 0.527 | 0.722 | 0.340 | 0.650 | 0.413 | 0.675 |
Naive Bayes Classifier | 0.815 | 0.321 | 0.626 | 0.521 | 0.687 | 0.397 | 0.644 |
Naive Bayes, Oversampled | 0.717 | 0.250 | 0.600 | 0.711 | 0.714 | 0.370 | 0.594 |
Logistic Regression, Oversampled With Hyperparameter Tuning | 0.828 | 0.386 | 0.676 | 0.784 | 0.809 | 0.517 | 0.706 |
Decision Tree Classifier | 0.875 | 0.466 | 0.699 | 0.478 | 0.703 | 0.472 | 0.701 |
Decision Tree Classifier - Reducing Max Depth | 0.899 | 0.660 | 0.786 | 0.285 | 0.632 | 0.397 | 0.671 |
Bagging Classifier | 0.899 | 0.660 | 0.786 | 0.282 | 0.631 | 0.396 | 0.670 |
AdaBoost Classifier | 0.891 | 0.554 | 0.737 | 0.369 | 0.665 | 0.442 | 0.691 |
Gradient Boosting Classifier | 0.902 | 0.669 | 0.792 | 0.319 | 0.649 | 0.431 | 0.689 |
abcl_over = AdaBoostClassifier(n_estimators = 15, random_state = random_state, learning_rate = 0.3)
base_model = [abcl_over]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'AdaBoost Classifier, Oversampled', oversampling = True)
df = df.append(df1)
df
Model- Adaboost classifier, oversampled and CV- 0 -------------------- Training Score: 0.838 Test Score: 0.793 Accuracy Score: 0.793 Precision Score - Subscribe: 0.335 Recall Score - Subscribe: 0.783 f1 Score - Subscribe: 0.470 Precision Score - Macro: 0.650 Recall Score - Macro: 0.789 f1 Score - Macro: 0.671 Model- Adaboost classifier, oversampled and CV- 1 -------------------- Training Score: 0.824 Test Score: 0.820 Accuracy Score: 0.820 Precision Score - Subscribe: 0.368 Recall Score - Subscribe: 0.752 f1 Score - Subscribe: 0.494 Precision Score - Macro: 0.665 Recall Score - Macro: 0.790 f1 Score - Macro: 0.692 Model- Adaboost classifier, oversampled and CV- 2 -------------------- Training Score: 0.838 Test Score: 0.805 Accuracy Score: 0.805 Precision Score - Subscribe: 0.353 Recall Score - Subscribe: 0.806 f1 Score - Subscribe: 0.491 Precision Score - Macro: 0.661 Recall Score - Macro: 0.805 f1 Score - Macro: 0.685 Model- Adaboost classifier, oversampled and CV- 3 -------------------- Training Score: 0.837 Test Score: 0.823 Accuracy Score: 0.823 Precision Score - Subscribe: 0.375 Recall Score - Subscribe: 0.769 f1 Score - Subscribe: 0.504 Precision Score - Macro: 0.670 Recall Score - Macro: 0.800 f1 Score - Macro: 0.698 Model- Adaboost classifier, oversampled and CV- 4 -------------------- Training Score: 0.845 Test Score: 0.812 Accuracy Score: 0.812 Precision Score - Subscribe: 0.363 Recall Score - Subscribe: 0.805 f1 Score - Subscribe: 0.500 Precision Score - Macro: 0.666 Recall Score - Macro: 0.809 f1 Score - Macro: 0.692
Accuracy | Precision_Subscribe | Precision_Macro | Recall_Subscribe | Recall_Macro | f1_Subscribe | f1_Macro | |
---|---|---|---|---|---|---|---|
Baseline Model | 0.882 | 0.000 | 0.441 | 0.000 | 0.500 | 0.000 | 0.469 |
Logistic Regression Without Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.652 | 0.433 | 0.689 |
Logistic Regression With Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.653 | 0.433 | 0.689 |
k-Nearest Neighbor Scaled Without Hyperparameter Tuning | 0.891 | 0.566 | 0.740 | 0.299 | 0.634 | 0.391 | 0.666 |
k-Nearest Neighbor Scaled With Hyperparameter Tuning | 0.887 | 0.527 | 0.722 | 0.340 | 0.650 | 0.413 | 0.675 |
Naive Bayes Classifier | 0.815 | 0.321 | 0.626 | 0.521 | 0.687 | 0.397 | 0.644 |
Naive Bayes, Oversampled | 0.717 | 0.250 | 0.600 | 0.711 | 0.714 | 0.370 | 0.594 |
Logistic Regression, Oversampled With Hyperparameter Tuning | 0.828 | 0.386 | 0.676 | 0.784 | 0.809 | 0.517 | 0.706 |
Decision Tree Classifier | 0.875 | 0.466 | 0.699 | 0.478 | 0.703 | 0.472 | 0.701 |
Decision Tree Classifier - Reducing Max Depth | 0.899 | 0.660 | 0.786 | 0.285 | 0.632 | 0.397 | 0.671 |
Bagging Classifier | 0.899 | 0.660 | 0.786 | 0.282 | 0.631 | 0.396 | 0.670 |
AdaBoost Classifier | 0.891 | 0.554 | 0.737 | 0.369 | 0.665 | 0.442 | 0.691 |
Gradient Boosting Classifier | 0.902 | 0.669 | 0.792 | 0.319 | 0.649 | 0.431 | 0.689 |
AdaBoost Classifier, Oversampled | 0.811 | 0.359 | 0.662 | 0.783 | 0.799 | 0.492 | 0.688 |
# Random Forest Classifier
rfc = RandomForestClassifier(n_jobs = -1, random_state = random_state)
base_model = [rfc]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Random Forest Classifier')
df = df.append(df1)
df
Model- Random forest classifier and CV- 0 -------------------- Training Score: 0.991 Test Score: 0.898 Accuracy Score: 0.898 Precision Score - Subscribe: 0.623 Recall Score - Subscribe: 0.323 f1 Score - Subscribe: 0.426 Precision Score - Macro: 0.769 Recall Score - Macro: 0.649 f1 Score - Macro: 0.685 Model- Random forest classifier and CV- 1 -------------------- Training Score: 0.992 Test Score: 0.896 Accuracy Score: 0.896 Precision Score - Subscribe: 0.604 Recall Score - Subscribe: 0.313 f1 Score - Subscribe: 0.412 Precision Score - Macro: 0.759 Recall Score - Macro: 0.643 f1 Score - Macro: 0.677 Model- Random forest classifier and CV- 2 -------------------- Training Score: 0.991 Test Score: 0.894 Accuracy Score: 0.894 Precision Score - Subscribe: 0.600 Recall Score - Subscribe: 0.292 f1 Score - Subscribe: 0.393 Precision Score - Macro: 0.756 Recall Score - Macro: 0.633 f1 Score - Macro: 0.668 Model- Random forest classifier and CV- 3 -------------------- Training Score: 0.991 Test Score: 0.900 Accuracy Score: 0.900 Precision Score - Subscribe: 0.648 Recall Score - Subscribe: 0.325 f1 Score - Subscribe: 0.433 Precision Score - Macro: 0.782 Recall Score - Macro: 0.651 f1 Score - Macro: 0.689 Model- Random forest classifier and CV- 4 -------------------- Training Score: 0.992 Test Score: 0.896 Accuracy Score: 0.896 Precision Score - Subscribe: 0.622 Recall Score - Subscribe: 0.289 f1 Score - Subscribe: 0.394 Precision Score - Macro: 0.767 Recall Score - Macro: 0.633 f1 Score - Macro: 0.669
Accuracy | Precision_Subscribe | Precision_Macro | Recall_Subscribe | Recall_Macro | f1_Subscribe | f1_Macro | |
---|---|---|---|---|---|---|---|
Baseline Model | 0.882 | 0.000 | 0.441 | 0.000 | 0.500 | 0.000 | 0.469 |
Logistic Regression Without Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.652 | 0.433 | 0.689 |
Logistic Regression With Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.653 | 0.433 | 0.689 |
k-Nearest Neighbor Scaled Without Hyperparameter Tuning | 0.891 | 0.566 | 0.740 | 0.299 | 0.634 | 0.391 | 0.666 |
k-Nearest Neighbor Scaled With Hyperparameter Tuning | 0.887 | 0.527 | 0.722 | 0.340 | 0.650 | 0.413 | 0.675 |
Naive Bayes Classifier | 0.815 | 0.321 | 0.626 | 0.521 | 0.687 | 0.397 | 0.644 |
Naive Bayes, Oversampled | 0.717 | 0.250 | 0.600 | 0.711 | 0.714 | 0.370 | 0.594 |
Logistic Regression, Oversampled With Hyperparameter Tuning | 0.828 | 0.386 | 0.676 | 0.784 | 0.809 | 0.517 | 0.706 |
Decision Tree Classifier | 0.875 | 0.466 | 0.699 | 0.478 | 0.703 | 0.472 | 0.701 |
Decision Tree Classifier - Reducing Max Depth | 0.899 | 0.660 | 0.786 | 0.285 | 0.632 | 0.397 | 0.671 |
Bagging Classifier | 0.899 | 0.660 | 0.786 | 0.282 | 0.631 | 0.396 | 0.670 |
AdaBoost Classifier | 0.891 | 0.554 | 0.737 | 0.369 | 0.665 | 0.442 | 0.691 |
Gradient Boosting Classifier | 0.902 | 0.669 | 0.792 | 0.319 | 0.649 | 0.431 | 0.689 |
AdaBoost Classifier, Oversampled | 0.811 | 0.359 | 0.662 | 0.783 | 0.799 | 0.492 | 0.688 |
Random Forest Classifier | 0.897 | 0.619 | 0.767 | 0.308 | 0.642 | 0.412 | 0.678 |
# Random Forest Classifier with hyperparameter tuning
rfc = RandomForestClassifier(n_jobs = -1, random_state = random_state)
params = {'n_estimators' : [10, 20, 30, 50, 75, 100], 'max_depth': [1, 2, 3, 5, 7, 10]}
scoring = {'Recall': make_scorer(recall_score), 'f1_score': make_scorer(f1_score)}
skf = StratifiedKFold(n_splits = 3, shuffle = True, random_state = random_state)
rfc_grid = GridSearchCV(rfc, param_grid = params, n_jobs = -1, cv = skf, scoring = scoring, refit = 'f1_score')
rfc_grid.fit(X, y)
print(rfc_grid.best_estimator_)
print(rfc_grid.best_params_)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=10, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=-1, oob_score=False, random_state=42, verbose=0, warm_start=False) {'max_depth': 10, 'n_estimators': 20}
# Random Forest Classifier with hyperparameter tuning
rfc_hyper = RandomForestClassifier(bootstrap = True, class_weight = None, criterion = 'gini', max_depth = 10,
max_features = 'auto', max_leaf_nodes = None, min_impurity_decrease = 0.0,
min_impurity_split = None, min_samples_leaf = 1, min_samples_split = 2,
min_weight_fraction_leaf = 0.0, n_estimators = 20, n_jobs = -1,
oob_score = False, random_state = 42, verbose = 0, warm_start = False)
base_model = [rfc_hyper]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Random Forest Classifier With Hyperparameter Tuning')
df = df.append(df1)
df
Model- Random forest classifier with hyperparameter tuning and CV- 0 -------------------- Training Score: 0.912 Test Score: 0.897 Accuracy Score: 0.897 Precision Score - Subscribe: 0.687 Recall Score - Subscribe: 0.220 f1 Score - Subscribe: 0.334 Precision Score - Macro: 0.796 Recall Score - Macro: 0.603 f1 Score - Macro: 0.639 Model- Random forest classifier with hyperparameter tuning and CV- 1 -------------------- Training Score: 0.911 Test Score: 0.898 Accuracy Score: 0.898 Precision Score - Subscribe: 0.733 Recall Score - Subscribe: 0.205 f1 Score - Subscribe: 0.321 Precision Score - Macro: 0.818 Recall Score - Macro: 0.598 f1 Score - Macro: 0.633 Model- Random forest classifier with hyperparameter tuning and CV- 2 -------------------- Training Score: 0.912 Test Score: 0.895 Accuracy Score: 0.895 Precision Score - Subscribe: 0.693 Recall Score - Subscribe: 0.192 f1 Score - Subscribe: 0.301 Precision Score - Macro: 0.798 Recall Score - Macro: 0.590 f1 Score - Macro: 0.622 Model- Random forest classifier with hyperparameter tuning and CV- 3 -------------------- Training Score: 0.911 Test Score: 0.902 Accuracy Score: 0.902 Precision Score - Subscribe: 0.755 Recall Score - Subscribe: 0.236 f1 Score - Subscribe: 0.360 Precision Score - Macro: 0.831 Recall Score - Macro: 0.613 f1 Score - Macro: 0.653 Model- Random forest classifier with hyperparameter tuning and CV- 4 -------------------- Training Score: 0.911 Test Score: 0.899 Accuracy Score: 0.899 Precision Score - Subscribe: 0.739 Recall Score - Subscribe: 0.215 f1 Score - Subscribe: 0.333 Precision Score - Macro: 0.822 Recall Score - Macro: 0.602 f1 Score - Macro: 0.639
Accuracy | Precision_Subscribe | Precision_Macro | Recall_Subscribe | Recall_Macro | f1_Subscribe | f1_Macro | |
---|---|---|---|---|---|---|---|
Baseline Model | 0.882 | 0.000 | 0.441 | 0.000 | 0.500 | 0.000 | 0.469 |
Logistic Regression Without Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.652 | 0.433 | 0.689 |
Logistic Regression With Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.653 | 0.433 | 0.689 |
k-Nearest Neighbor Scaled Without Hyperparameter Tuning | 0.891 | 0.566 | 0.740 | 0.299 | 0.634 | 0.391 | 0.666 |
k-Nearest Neighbor Scaled With Hyperparameter Tuning | 0.887 | 0.527 | 0.722 | 0.340 | 0.650 | 0.413 | 0.675 |
Naive Bayes Classifier | 0.815 | 0.321 | 0.626 | 0.521 | 0.687 | 0.397 | 0.644 |
Naive Bayes, Oversampled | 0.717 | 0.250 | 0.600 | 0.711 | 0.714 | 0.370 | 0.594 |
Logistic Regression, Oversampled With Hyperparameter Tuning | 0.828 | 0.386 | 0.676 | 0.784 | 0.809 | 0.517 | 0.706 |
Decision Tree Classifier | 0.875 | 0.466 | 0.699 | 0.478 | 0.703 | 0.472 | 0.701 |
Decision Tree Classifier - Reducing Max Depth | 0.899 | 0.660 | 0.786 | 0.285 | 0.632 | 0.397 | 0.671 |
Bagging Classifier | 0.899 | 0.660 | 0.786 | 0.282 | 0.631 | 0.396 | 0.670 |
AdaBoost Classifier | 0.891 | 0.554 | 0.737 | 0.369 | 0.665 | 0.442 | 0.691 |
Gradient Boosting Classifier | 0.902 | 0.669 | 0.792 | 0.319 | 0.649 | 0.431 | 0.689 |
AdaBoost Classifier, Oversampled | 0.811 | 0.359 | 0.662 | 0.783 | 0.799 | 0.492 | 0.688 |
Random Forest Classifier | 0.897 | 0.619 | 0.767 | 0.308 | 0.642 | 0.412 | 0.678 |
Random Forest Classifier With Hyperparameter Tuning | 0.898 | 0.721 | 0.813 | 0.214 | 0.601 | 0.330 | 0.637 |
# Random Forest Classifier with hyperparameter tuning, Oversampled
rfc_over = RandomForestClassifier(bootstrap = True, class_weight = None, criterion = 'gini', max_depth = 10,
max_features = 'auto', max_leaf_nodes = None, min_impurity_decrease = 0.0,
min_impurity_split = None, min_samples_leaf = 1, min_samples_split = 2,
min_weight_fraction_leaf = 0.0, n_estimators = 20, n_jobs = -1,
oob_score = False, random_state = 42, verbose = 0, warm_start = False)
base_model = [rfc_over]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y,
'Random Forest Classifier, Oversampled With Hyperparameter Tuning',
oversampling = True)
df = df.append(df1)
df
Model- Random forest classifier, oversampled with hyperparameter tuning and CV- 0 -------------------- Training Score: 0.913 Test Score: 0.854 Accuracy Score: 0.854 Precision Score - Subscribe: 0.429 Recall Score - Subscribe: 0.750 f1 Score - Subscribe: 0.546 Precision Score - Macro: 0.696 Recall Score - Macro: 0.809 f1 Score - Macro: 0.729 Model- Random forest classifier, oversampled with hyperparameter tuning and CV- 1 -------------------- Training Score: 0.916 Test Score: 0.857 Accuracy Score: 0.857 Precision Score - Subscribe: 0.433 Recall Score - Subscribe: 0.720 f1 Score - Subscribe: 0.541 Precision Score - Macro: 0.696 Recall Score - Macro: 0.798 f1 Score - Macro: 0.728 Model- Random forest classifier, oversampled with hyperparameter tuning and CV- 2 -------------------- Training Score: 0.915 Test Score: 0.853 Accuracy Score: 0.853 Precision Score - Subscribe: 0.425 Recall Score - Subscribe: 0.720 f1 Score - Subscribe: 0.534 Precision Score - Macro: 0.692 Recall Score - Macro: 0.795 f1 Score - Macro: 0.724 Model- Random forest classifier, oversampled with hyperparameter tuning and CV- 3 -------------------- Training Score: 0.917 Test Score: 0.864 Accuracy Score: 0.864 Precision Score - Subscribe: 0.450 Recall Score - Subscribe: 0.743 f1 Score - Subscribe: 0.560 Precision Score - Macro: 0.706 Recall Score - Macro: 0.811 f1 Score - Macro: 0.740 Model- Random forest classifier, oversampled with hyperparameter tuning and CV- 4 -------------------- Training Score: 0.914 Test Score: 0.857 Accuracy Score: 0.857 Precision Score - Subscribe: 0.435 Recall Score - Subscribe: 0.744 f1 Score - Subscribe: 0.549 Precision Score - Macro: 0.699 Recall Score - Macro: 0.808 f1 Score - Macro: 0.732
Accuracy | Precision_Subscribe | Precision_Macro | Recall_Subscribe | Recall_Macro | f1_Subscribe | f1_Macro | |
---|---|---|---|---|---|---|---|
Baseline Model | 0.882 | 0.000 | 0.441 | 0.000 | 0.500 | 0.000 | 0.469 |
Logistic Regression Without Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.652 | 0.433 | 0.689 |
Logistic Regression With Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.653 | 0.433 | 0.689 |
k-Nearest Neighbor Scaled Without Hyperparameter Tuning | 0.891 | 0.566 | 0.740 | 0.299 | 0.634 | 0.391 | 0.666 |
k-Nearest Neighbor Scaled With Hyperparameter Tuning | 0.887 | 0.527 | 0.722 | 0.340 | 0.650 | 0.413 | 0.675 |
Naive Bayes Classifier | 0.815 | 0.321 | 0.626 | 0.521 | 0.687 | 0.397 | 0.644 |
Naive Bayes, Oversampled | 0.717 | 0.250 | 0.600 | 0.711 | 0.714 | 0.370 | 0.594 |
Logistic Regression, Oversampled With Hyperparameter Tuning | 0.828 | 0.386 | 0.676 | 0.784 | 0.809 | 0.517 | 0.706 |
Decision Tree Classifier | 0.875 | 0.466 | 0.699 | 0.478 | 0.703 | 0.472 | 0.701 |
Decision Tree Classifier - Reducing Max Depth | 0.899 | 0.660 | 0.786 | 0.285 | 0.632 | 0.397 | 0.671 |
Bagging Classifier | 0.899 | 0.660 | 0.786 | 0.282 | 0.631 | 0.396 | 0.670 |
AdaBoost Classifier | 0.891 | 0.554 | 0.737 | 0.369 | 0.665 | 0.442 | 0.691 |
Gradient Boosting Classifier | 0.902 | 0.669 | 0.792 | 0.319 | 0.649 | 0.431 | 0.689 |
AdaBoost Classifier, Oversampled | 0.811 | 0.359 | 0.662 | 0.783 | 0.799 | 0.492 | 0.688 |
Random Forest Classifier | 0.897 | 0.619 | 0.767 | 0.308 | 0.642 | 0.412 | 0.678 |
Random Forest Classifier With Hyperparameter Tuning | 0.898 | 0.721 | 0.813 | 0.214 | 0.601 | 0.330 | 0.637 |
Random Forest Classifier, Oversampled With Hyperparameter Tuning | 0.857 | 0.434 | 0.698 | 0.735 | 0.804 | 0.546 | 0.731 |
# Random Forest Classifier with hyperparameter tuning, Oversampled -- Reducing Max Depth
rfc_over = RandomForestClassifier(bootstrap = True, class_weight = None, criterion = 'gini', max_depth = 3,
max_features = 'auto', max_leaf_nodes = None, min_impurity_decrease = 0.0,
min_impurity_split = None, min_samples_leaf = 1, min_samples_split = 2,
min_weight_fraction_leaf = 0.0, n_estimators = 50, n_jobs = -1,
oob_score = False, random_state = 42, verbose = 0, warm_start = False)
base_model = [rfc_over]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y,
'Random Forest Classifier, Oversampled With Hyperparameter Tuning - Reducing Max Depth',
oversampling = True)
df = df.append(df1)
df
Model- Random forest classifier, oversampled with hyperparameter tuning - reducing max depth and CV- 0 -------------------- Training Score: 0.850 Test Score: 0.804 Accuracy Score: 0.804 Precision Score - Subscribe: 0.347 Recall Score - Subscribe: 0.766 f1 Score - Subscribe: 0.478 Precision Score - Macro: 0.655 Recall Score - Macro: 0.787 f1 Score - Macro: 0.679 Model- Random forest classifier, oversampled with hyperparameter tuning - reducing max depth and CV- 1 -------------------- Training Score: 0.842 Test Score: 0.819 Accuracy Score: 0.819 Precision Score - Subscribe: 0.371 Recall Score - Subscribe: 0.790 f1 Score - Subscribe: 0.505 Precision Score - Macro: 0.669 Recall Score - Macro: 0.806 f1 Score - Macro: 0.697 Model- Random forest classifier, oversampled with hyperparameter tuning - reducing max depth and CV- 2 -------------------- Training Score: 0.842 Test Score: 0.812 Accuracy Score: 0.812 Precision Score - Subscribe: 0.359 Recall Score - Subscribe: 0.773 f1 Score - Subscribe: 0.490 Precision Score - Macro: 0.662 Recall Score - Macro: 0.795 f1 Score - Macro: 0.687 Model- Random forest classifier, oversampled with hyperparameter tuning - reducing max depth and CV- 3 -------------------- Training Score: 0.842 Test Score: 0.810 Accuracy Score: 0.810 Precision Score - Subscribe: 0.356 Recall Score - Subscribe: 0.771 f1 Score - Subscribe: 0.487 Precision Score - Macro: 0.660 Recall Score - Macro: 0.793 f1 Score - Macro: 0.685 Model- Random forest classifier, oversampled with hyperparameter tuning - reducing max depth and CV- 4 -------------------- Training Score: 0.848 Test Score: 0.813 Accuracy Score: 0.813 Precision Score - Subscribe: 0.364 Recall Score - Subscribe: 0.801 f1 Score - Subscribe: 0.501 Precision Score - Macro: 0.667 Recall Score - Macro: 0.808 f1 Score - Macro: 0.693
Accuracy | Precision_Subscribe | Precision_Macro | Recall_Subscribe | Recall_Macro | f1_Subscribe | f1_Macro | |
---|---|---|---|---|---|---|---|
Baseline Model | 0.882 | 0.000 | 0.441 | 0.000 | 0.500 | 0.000 | 0.469 |
Logistic Regression Without Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.652 | 0.433 | 0.689 |
Logistic Regression With Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.653 | 0.433 | 0.689 |
k-Nearest Neighbor Scaled Without Hyperparameter Tuning | 0.891 | 0.566 | 0.740 | 0.299 | 0.634 | 0.391 | 0.666 |
k-Nearest Neighbor Scaled With Hyperparameter Tuning | 0.887 | 0.527 | 0.722 | 0.340 | 0.650 | 0.413 | 0.675 |
Naive Bayes Classifier | 0.815 | 0.321 | 0.626 | 0.521 | 0.687 | 0.397 | 0.644 |
Naive Bayes, Oversampled | 0.717 | 0.250 | 0.600 | 0.711 | 0.714 | 0.370 | 0.594 |
Logistic Regression, Oversampled With Hyperparameter Tuning | 0.828 | 0.386 | 0.676 | 0.784 | 0.809 | 0.517 | 0.706 |
Decision Tree Classifier | 0.875 | 0.466 | 0.699 | 0.478 | 0.703 | 0.472 | 0.701 |
Decision Tree Classifier - Reducing Max Depth | 0.899 | 0.660 | 0.786 | 0.285 | 0.632 | 0.397 | 0.671 |
Bagging Classifier | 0.899 | 0.660 | 0.786 | 0.282 | 0.631 | 0.396 | 0.670 |
AdaBoost Classifier | 0.891 | 0.554 | 0.737 | 0.369 | 0.665 | 0.442 | 0.691 |
Gradient Boosting Classifier | 0.902 | 0.669 | 0.792 | 0.319 | 0.649 | 0.431 | 0.689 |
AdaBoost Classifier, Oversampled | 0.811 | 0.359 | 0.662 | 0.783 | 0.799 | 0.492 | 0.688 |
Random Forest Classifier | 0.897 | 0.619 | 0.767 | 0.308 | 0.642 | 0.412 | 0.678 |
Random Forest Classifier With Hyperparameter Tuning | 0.898 | 0.721 | 0.813 | 0.214 | 0.601 | 0.330 | 0.637 |
Random Forest Classifier, Oversampled With Hyperparameter Tuning | 0.857 | 0.434 | 0.698 | 0.735 | 0.804 | 0.546 | 0.731 |
Random Forest Classifier, Oversampled With Hyperparameter Tuning - Reducing Max Depth | 0.812 | 0.359 | 0.663 | 0.780 | 0.798 | 0.492 | 0.688 |
rfc_over = RandomForestClassifier(bootstrap = True, class_weight = None, criterion = 'gini', max_depth = 3,
max_features = 'auto', max_leaf_nodes = None, min_impurity_decrease = 0.0,
min_impurity_split = None, min_samples_leaf = 1, min_samples_split = 2,
min_weight_fraction_leaf = 0.0, n_estimators = 50, n_jobs = -1,
oob_score = False, random_state = 42, verbose = 0, warm_start = False)
rfc_over.fit(X, y)
random_forest_tree = open('random_forest.dot','w')
dot_data = export_graphviz(rfc_over.estimators_[0], out_file = random_forest_tree, feature_names = list(X_train), class_names = ['No', 'Yes'], rounded = True, proportion = False, filled = True)
random_forest_tree.close()
retCode = system("dot -Tpng random_forest.dot -o random_forest.png")
if(retCode>0):
print("system command returning error: "+str(retCode))
else:
display(Image("random_forest.png"))
print('Feature Importance for Random Forest Classifier ', '--'*38)
feature_importances = pd.DataFrame(rfc_over.feature_importances_, index = X.columns,
columns=['Importance']).sort_values('Importance', ascending = True)
feature_importances.sort_values(by = 'Importance', ascending = True).plot(kind = 'barh', figsize = (15, 7.2))
Feature Importance for Random Forest Classifier ----------------------------------------------------------------------------
<matplotlib.axes._subplots.AxesSubplot at 0x2c2814f1cc8>
print('Conditional Formatting on the scores dataframe ', '--'*39)
display(df.style.background_gradient(cmap = sns.light_palette('green', as_cmap = True)))
Conditional Formatting on the scores dataframe ------------------------------------------------------------------------------
Accuracy | Precision_Subscribe | Precision_Macro | Recall_Subscribe | Recall_Macro | f1_Subscribe | f1_Macro | |
---|---|---|---|---|---|---|---|
Baseline Model | 0.882 | 0 | 0.441 | 0 | 0.5 | 0 | 0.469 |
Logistic Regression Without Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.652 | 0.433 | 0.689 |
Logistic Regression With Hyperparameter Tuning | 0.899 | 0.626 | 0.771 | 0.331 | 0.653 | 0.433 | 0.689 |
k-Nearest Neighbor Scaled Without Hyperparameter Tuning | 0.891 | 0.566 | 0.74 | 0.299 | 0.634 | 0.391 | 0.666 |
k-Nearest Neighbor Scaled With Hyperparameter Tuning | 0.887 | 0.527 | 0.722 | 0.34 | 0.65 | 0.413 | 0.675 |
Naive Bayes Classifier | 0.815 | 0.321 | 0.626 | 0.521 | 0.687 | 0.397 | 0.644 |
Naive Bayes, Oversampled | 0.717 | 0.25 | 0.6 | 0.711 | 0.714 | 0.37 | 0.594 |
Logistic Regression, Oversampled With Hyperparameter Tuning | 0.828 | 0.386 | 0.676 | 0.784 | 0.809 | 0.517 | 0.706 |
Decision Tree Classifier | 0.875 | 0.466 | 0.699 | 0.478 | 0.703 | 0.472 | 0.701 |
Decision Tree Classifier - Reducing Max Depth | 0.899 | 0.66 | 0.786 | 0.285 | 0.632 | 0.397 | 0.671 |
Bagging Classifier | 0.899 | 0.66 | 0.786 | 0.282 | 0.631 | 0.396 | 0.67 |
AdaBoost Classifier | 0.891 | 0.554 | 0.737 | 0.369 | 0.665 | 0.442 | 0.691 |
Gradient Boosting Classifier | 0.902 | 0.669 | 0.792 | 0.319 | 0.649 | 0.431 | 0.689 |
AdaBoost Classifier, Oversampled | 0.811 | 0.359 | 0.662 | 0.783 | 0.799 | 0.492 | 0.688 |
Random Forest Classifier | 0.897 | 0.619 | 0.767 | 0.308 | 0.642 | 0.412 | 0.678 |
Random Forest Classifier With Hyperparameter Tuning | 0.898 | 0.721 | 0.813 | 0.214 | 0.601 | 0.33 | 0.637 |
Random Forest Classifier, Oversampled With Hyperparameter Tuning | 0.857 | 0.434 | 0.698 | 0.735 | 0.804 | 0.546 | 0.731 |
Random Forest Classifier, Oversampled With Hyperparameter Tuning - Reducing Max Depth | 0.812 | 0.359 | 0.663 | 0.78 | 0.798 | 0.492 | 0.688 |
for i, types in enumerate(df.columns):
temp = df[types]
plt.figure(i, figsize = (15, 7.2))
temp.sort_values(ascending = True).plot(kind = 'barh')
plt.title(f'{types.capitalize()} Scores')
plt.show()
The classification goal is to predict if the client will subscribe (yes/no) a term deposit.
Most of the ML models works best when the number of classes are in equal proportion since they are designed to maximize accuracy and reduce error. Thus, they do not take into account the class distribution / proportion or balance of classes. In our dataset, the clients subscribing to term deposit (class 'yes' i.e. 1) is 11.7% whereas those about 88.3% of the clients didn't subscribe (class 'no' i.e. 0) to the term deposit.
Building a DummyClassifier, baseline model, in our case gave an accuracy of 88.2% with zero recall and precision for predicting minority class i.e. where the client subscribed to term deposits. In this cases, important performance measures such as precision, recall, and f1-score would be helpful. We can also calculate this metrics for the minority, positive, class.
The confusion matrix for class 1 (Subscribed) would look like:
Predicted: 0 (Not Subscribed) | Predicted: 1 (Subscribed) | |
---|---|---|
Actual: 0 (Not Subscribed) | True Negatives | False Positives |
Actual: 1 (Subscribed) | False Negatives | True Positives |
In our case, it would be recall that would hold more importance then precision. So choosing recall particularly for class 1 and accuracy as as evaluation metric. Also important would be how is model behaving over the training and test scores across the cross validation sets.
Modeling was sub-divided in two phases, in the first phase we applied standard models (with and without the hyperparameter tuning wherever applicable) such as Logistic Regression, k-Nearest Neighbor and Naive Bayes classifiers. In second phase apply ensemble techniques such as Decision Tree, Bagging, AdaBoost, Gradient Boosting and Random Forest classifiers. Oversampling the ones with higher accuracy and better recall for subscribe.
Oversampling, which is one of common ways to tackle the issue of imbalanced data. Over-sampling refers to various methods that aim to increase the number of instances from the underrepresented class in the data set. Out of the various methods, we chose Synthetic Minority Over-Sampling Technique (SMOTE). SMOTE’s main advantage compared to traditional random naive over-sampling is that by creating synthetic observations instead of reusing existing observations, classifier is less likely to overfit.
In the first phase (Standard machine learning models vs baseline model),
In the second phase (Ensemble models vs baseline model),