Author: Pratik Sharma ¶

Project 3 - Ensemble Techniques¶

Data Description: The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

Domain: Banking

Context: Leveraging customer information is paramount for most businesses. In the case of a bank, attributes of customers like the ones mentioned below can be crucial in strategizing a marketing campaign when launching a new product.

Attribute Information

age: age at the time of call
job: type of job
marital: marital status
education: education background at the time of call
default: has credit in default?
balance: average yearly balance, in euros (numeric)
housing: has housing loan?
loan: has personal loan?
contact: contact communication type
day: last contact day of the month (1 -31)
month: last contact month of year ('jan', 'feb', 'mar', ..., 'nov', 'dec')
duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration = 0 then Target = 'no'). Yet, the duration is not known before a call is performed. Also, after the end of the call Target is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
campaign: number of contacts performed during this campaign and for this client (includes last contact)
pdays: number of days that passed by after the client was last contacted from a previous campaign
previous: number of contacts performed before this campaign and for this client
poutcome: outcome of the previous marketing campaign
target: has the client subscribed a term deposit? ('yes', 'no')

Learning Outcomes

Exploratory Data Analysis
Preparing the data to train a model
Training and making predictions using an Ensemble Model
Tuning an Ensemble model

Table of Contents¶

Import Packages
Reading the data as a dataframe and print the first five rows
Get info of the dataframe columns
- Observation 1 - Dataset shape
Exploratory Data Analysis
Modelling
Comparing model results
Conclusion

Import Packages¶

In [1]:

# Basic packages
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns
from scipy import stats; from scipy.stats import zscore, norm, randint
import matplotlib.style as style; style.use('fivethirtyeight')
import plotly.express as px
%matplotlib inline

# Impute and Encode
from sklearn.preprocessing import LabelEncoder
from impyute.imputation.cs import mice

# Modelling - LR, KNN, NB, Metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, recall_score, precision_score
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, BaggingClassifier, VotingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.dummy import DummyClassifier
from sklearn.metrics import make_scorer

# Oversampling
from imblearn.over_sampling import SMOTE

# Suppress warnings
import warnings; warnings.filterwarnings('ignore')

# Visualize Tree
from sklearn.tree import export_graphviz
from IPython.display import Image
from os import system

# Display settings
pd.options.display.max_rows = 10000
pd.options.display.max_columns = 10000

random_state = 42
np.random.seed(random_state)

Reading the data as a dataframe and print the first five rows¶

In [2]:

# Reading the data as dataframe and print the first five rows
bank = pd.read_csv('bank-full.csv')
bank.head()

Out[2]:

	age	job	marital	education	default	balance	housing	loan	contact	day	month	duration	campaign	pdays	poutcome	Target
0	58	management	married	tertiary	no	2143	yes	no	unknown	5	may	261	1	-1	unknown	no
1	44	technician	single	secondary	no	29	yes	no	unknown	5	may	151	1	-1	unknown	no
2	33	entrepreneur	married	secondary	no	2	yes	yes	unknown	5	may	76	1	-1	unknown	no
3	47	blue-collar	married	unknown	no	1506	yes	no	unknown	5	may	92	1	-1	unknown	no
4	33	unknown	single	unknown	no	1	no	no	unknown	5	may	198	1	-1	unknown	no

Get info of the dataframe columns¶

In [3]:

# Get info of the dataframe columns
bank.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
age          45211 non-null int64
job          45211 non-null object
marital      45211 non-null object
education    45211 non-null object
default      45211 non-null object
balance      45211 non-null int64
housing      45211 non-null object
loan         45211 non-null object
contact      45211 non-null object
day          45211 non-null int64
month        45211 non-null object
duration     45211 non-null int64
campaign     45211 non-null int64
pdays        45211 non-null int64
previous     45211 non-null int64
poutcome     45211 non-null object
Target       45211 non-null object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB

Observation 1 - Dataset shape¶

Dataset has 45211 rows and 17 columns, with no missing values.

Exploratory Data Analysis¶

Performing exploratory data analysis on the bank dataset. Below are some of the steps performed:

Get descriptive statistics including five point summary
Check unique values in object columns
Check distribution of Target column
Count plot by Target for categorical columns (job, marital, education, default, housing, loan, contact, day, month, poutcome)
Check outliers in each numerical columna (age, balance, duration, campaign, pdays, previous)
Distribution of numerical columns (with and without outliers)
Distribution of numerical columns for Target values (subscribed and didn`t subscribed to term deposit)
Fill outliers with upper and lower percentile values being 99 and 1, respectively by nan values
Frequency encoding for categorical columns with string values
Convert type of categorical columns (job, marital, education, default, housing, loan, contact, day, month, poutcome, Target) to float for MICE training. Creating multiple imputations, as opposed to single imputations to complete datasets, accounts for the statistical uncertainty in the imputations. MICE algorithms works by running multiple regression models and each missing value is modeled conditionally depeding on the observed (non-missing) values.
Get correlation matrix and check absolute correlation of independent variables with Target. Drop columns based on these.
Create age groups and check if there is relation with balance and target, campaign and target.

Five point summary of numerical attributes and check unique values in 'object' columns¶

In [4]:

bank.describe(include = 'all').T

Out[4]:

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
age	45211	NaN	NaN	NaN	40.9362	10.6188	18	33	39	48	95
job	45211	12	blue-collar	9732	NaN	NaN	NaN	NaN	NaN	NaN	NaN
marital	45211	3	married	27214	NaN	NaN	NaN	NaN	NaN	NaN	NaN
education	45211	4	secondary	23202	NaN	NaN	NaN	NaN	NaN	NaN	NaN
default	45211	2	no	44396	NaN	NaN	NaN	NaN	NaN	NaN	NaN
balance	45211	NaN	NaN	NaN	1362.27	3044.77	-8019	72	448	1428	102127
housing	45211	2	yes	25130	NaN	NaN	NaN	NaN	NaN	NaN	NaN
loan	45211	2	no	37967	NaN	NaN	NaN	NaN	NaN	NaN	NaN
contact	45211	3	cellular	29285	NaN	NaN	NaN	NaN	NaN	NaN	NaN
day	45211	NaN	NaN	NaN	15.8064	8.32248	1	8	16	21	31
month	45211	12	may	13766	NaN	NaN	NaN	NaN	NaN	NaN	NaN
duration	45211	NaN	NaN	NaN	258.163	257.528	0	103	180	319	4918
campaign	45211	NaN	NaN	NaN	2.76384	3.09802	1	1	2	3	63
pdays	45211	NaN	NaN	NaN	40.1978	100.129	-1	-1	-1	-1	871
previous	45211	NaN	NaN	NaN	0.580323	2.30344	0	0	0	0	275
poutcome	45211	4	unknown	36959	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Target	45211	2	no	39922	NaN	NaN	NaN	NaN	NaN	NaN	NaN

In [5]:

columns = bank.loc[:, bank.dtypes == 'object'].columns.tolist()
for cols in columns:
    print(f'Unique values for {cols} is \n{bank[cols].unique()}\n')

Unique values for job is 
['management' 'technician' 'entrepreneur' 'blue-collar' 'unknown'
 'retired' 'admin.' 'services' 'self-employed' 'unemployed' 'housemaid'
 'student']

Unique values for marital is 
['married' 'single' 'divorced']

Unique values for education is 
['tertiary' 'secondary' 'unknown' 'primary']

Unique values for default is 
['no' 'yes']

Unique values for housing is 
['yes' 'no']

Unique values for loan is 
['no' 'yes']

Unique values for contact is 
['unknown' 'cellular' 'telephone']

Unique values for month is 
['may' 'jun' 'jul' 'aug' 'oct' 'nov' 'dec' 'jan' 'feb' 'mar' 'apr' 'sep']

Unique values for poutcome is 
['unknown' 'failure' 'other' 'success']

Unique values for Target is 
['no' 'yes']

Observation 2 - information on the type of variable and min-max values¶

Client info¶

Categorical
- job: Nominal. Includes type of job. 'blue-collar' is the most frequently occurring in the data.
- marital: Nominal. Most of the clients are married in the dataset we have.
- education: Ordinal. Most of the clients have secondary level education.
- default: Binary. Most of clients don't have credit in default.
- housing: Binary. Most of the clients have housing loan.
- loan: Binary. Most of the clients don't have personal loan.
Numerical
- age: Continuous, ratio (has true zero, technically). Whether it's discrete or continuous depends on whether they are measured to the nearest year or not. At present, it seems it's discrete. Min age in the dataset being 18 and max being 95.
- balance: Continuous, ratio. Range of average yearly balance is very wide from -8019 euros to 102127 euros.

Last contact info¶

Categorical
- contact: Nominal. Includes communication type with the client, most frequently use communication mode is cellular.
- day: Ordinal. Includes last contact day of the month.
- month: Ordinal. Includes last contact month of the year.
Numerical
- duration: Continuous, interval. Includes last contact duration in seconds. Min value being 0 and max value being 4918. It would be important to check is higher duration of call leading to more subscription.

This campaign info¶

Numerical
- campaign: Discrete, interval. Min number of contacts performed during this campaign being 1 and is also represents about 25% of the value and max being 63.

Previous campaign info¶

Categorical
- poutcome: Nominal. Includes outcome of the previous marketing campaign. Most occuring value being 'unknown'.
Numerical
- pdays: Continuous, interval. Min number of days that passed by after the client was last contacted from a previous campaign being -1 which may be dummy value for the cases where client wasn't contacted and max days being 63.
- previous: Discrete, ratio. Min number of contacts performed before this campaign is 0 and max being 275.

Target¶

Categorical
- Target: Binary. Most occurring value being 'no' i.e. cases where the client didn't subscribe to the term deposit.

Observation 3 - Descriptive statistics for the numerical variables¶

Descriptive statistics for the numerical variables (age, balance, duration, campaign, pdays, previous)

age: Range of Q1 to Q3 is between 33 to 48. Since mean is slightly greater than median, we can say that age is right (positively) skewed.
balance: Range of Q1 to Q3 is between 72 to 1428. Since mean is greater than median, we can say that balance is skewed towards right (positively).
duration: Range of Q1 to Q3 is between 103 to 319. Since mean is greater than median, we can say that duration is right (positively) skewed.
campaign: Range of Q1 to Q3 is between 1 to 3. Since mean is greater than median, we can say that campaign is right (positively) skewed.
pdays: 75% of data values are around -1 which is a dummy value. It needs further check without considering the -1 value.
previous: 75% of data values are around 0 which is a dummy value, maybe cases where client wasn't contacted. It needs further checks.

Checking the distribution of target variable¶

In [6]:

display(bank['Target'].value_counts(), bank['Target'].value_counts(normalize = True)*100)

no     39922
yes     5289
Name: Target, dtype: int64

no     88.30152
yes    11.69848
Name: Target, dtype: float64

Observation 4 - Distribution of target variable¶

Out of 45211 cases, only 5289 (=11.69%) are the cases where the client has subscribed to the term deposit.

In [7]:

# Replace values in some of the categorical columns
replace_values = {'education': {'unknown': -1, 'primary': 1, 'secondary': 2, 'tertiary': 3}, 'Target': {'no': 0, 'yes': 1},
                  'default': {'no': 0, 'yes': 1}, 'housing': {'no': 0, 'yes': 1}, 'loan': {'no': 0, 'yes': 1},
                  'month': {'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6, 
                            'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12}}

bank = bank.replace(replace_values)

In [8]:

# Convert columns to categorical types
columns.extend(['day'])
for cols in columns:
    bank[cols] = bank[cols].astype('category')

In [9]:

# Functions that will help us with EDA plot
def odp_plots(df, col):
    f,(ax1, ax2, ax3) = plt.subplots(1, 3, figsize = (15, 7.2))
    
    # Boxplot to check outliers
    sns.boxplot(x = col, data = df, ax = ax1, orient = 'v', color = 'darkslategrey')
    
    # Distribution plot with outliers
    sns.distplot(df[col], ax = ax2, color = 'teal', fit = norm).set_title(f'Distribution of {col} with outliers')
    
    # Removing outliers, but in a new dataframe
    upperbound, lowerbound = np.percentile(df[col], [1, 99])
    y = pd.DataFrame(np.clip(df[col], upperbound, lowerbound))
    
    # Distribution plot without outliers
    sns.distplot(y[col], ax = ax3, color = 'tab:orange', fit = norm).set_title(f'Distribution of {col} without outliers')
    
    kwargs = {'fontsize':14, 'color':'black'}
    ax1.set_title(col + ' Boxplot Analysis', **kwargs)
    ax1.set_xlabel('Box', **kwargs)
    ax1.set_ylabel(col + ' Values', **kwargs)

    return plt.show()
    
def target_plot(df, col, target = 'Target'):
    fig = plt.figure(figsize = (15, 7.2))
    # Distribution for 'Target' -- didn't subscribed, considering outliers   
    ax = fig.add_subplot(121)
    sns.distplot(df[(df[target] == 0)][col], color = 'c', 
                 ax = ax).set_title(f'{col.capitalize()} for Term Desposit - Didn\'t subscribed')

    # Distribution for 'Target' -- Subscribed, considering outliers
    ax= fig.add_subplot(122)
    sns.distplot(df[(df[target] == 1)][col], color = 'b', 
             ax = ax).set_title(f'{col.capitalize()} for Term Desposit - Subscribed')
    return plt.show()

def target_count(df, col1, col2):
    fig = plt.figure(figsize = (15, 7.2))
    ax = fig.add_subplot(121)
    sns.countplot(x = col1, data = df, palette = ['tab:blue', 'tab:cyan'], ax = ax, orient = 'v',
                  hue = 'Target').set_title(col1.capitalize() +' count plot by Target', 
                                                                      fontsize = 13)
    plt.legend(labels = ['Didn\'t Subcribed', 'Subcribed'])
    plt.xticks(rotation = 90)
    
    ax = fig.add_subplot(122)
    sns.countplot(x = col2, data = df, palette = ['tab:blue', 'tab:cyan'], ax = ax, orient = 'v', 
                  hue = 'Target').set_title(col2.capitalize() +' coount plot by Target', 
                                                                      fontsize = 13)
    plt.legend(labels = ['Didn\'t Subcribed', 'Subcribed'])
    plt.xticks(rotation = 90)
    return plt.show()

Univariate and Bivariate Visualization¶

Looking at one feature at a time to understand how are the values distributed, checking outliers, checking relation of the column with Target column (bi).

In [10]:

# Subscribe and didn't subscribe for categorical columns
target_count(bank, 'job', 'marital')
target_count(bank, 'education', 'default')
target_count(bank, 'housing', 'loan')
target_count(bank, 'contact', 'day')
target_count(bank, 'month', 'poutcome')

Observation 5 - Comments from categorical columns¶

Management have a subscription rate of ~25 percent followed by technician.
Married and single clients are more likely to subscribe then divorced clients
Clients with education of secondary followed by tertiary are more likely to subscribe to term deposits
Most of the clients don't have credit in default and their subscription rate is higher then people with default
Cellular communication type have higher subscription rate
Most of the subscription were made in May and August

Check outlier and distribution for numerical columns and also plot it's relation with target variable¶

In [11]:

# Outlier, distribution for 'age' column
Q3 = bank['age'].quantile(0.75)
Q1 = bank['age'].quantile(0.25)
IQR = Q3 - Q1

print('Age column', '--'*55)
display(bank.loc[(bank['age'] < (Q1 - 1.5 * IQR)) | (bank['age'] > (Q3 + 1.5 * IQR))].head())

odp_plots(bank, 'age')

# Distribution of 'age' by 'Target'
target_plot(bank, 'age')

Age column --------------------------------------------------------------------------------------------------------------

	age	job	marital	education	balance	housing	contact	day	month	duration	campaign	pdays	poutcome	Target
29158	83	retired	married	1	425	0	telephone	2	2	912	1	-1	unknown	0
29261	75	retired	divorced	1	46	0	cellular	2	2	294	1	-1	unknown	0
29263	75	retired	married	1	3324	0	cellular	2	2	149	1	-1	unknown	0
29322	83	retired	married	3	6236	0	cellular	2	2	283	2	-1	unknown	0
29865	75	retired	divorced	1	3881	1	cellular	4	2	136	3	-1	unknown	1

In [12]:

# Outlier, distribution for 'balance' column
Q3 = bank['balance'].quantile(0.75)
Q1 = bank['balance'].quantile(0.25)
IQR = Q3 - Q1
print('Balance column', '--'*55)
display(bank.loc[(bank['balance'] < (Q1 - 1.5 * IQR)) | (bank['balance'] > (Q3 + 1.5 * IQR))].head())

odp_plots(bank, 'balance')

# Distribution of 'balance' by 'Target'
target_plot(bank, 'balance')

Balance column --------------------------------------------------------------------------------------------------------------

	age	job	marital	education	balance	housing	loan	contact	day	month	duration	campaign	pdays	poutcome
34	51	management	married	3	10635	1	0	unknown	5	5	336	1	-1	unknown
65	51	management	married	3	6530	1	0	unknown	5	5	91	1	-1	unknown
69	35	blue-collar	single	2	12223	1	1	unknown	5	5	177	1	-1	unknown
70	57	blue-collar	married	2	5935	1	1	unknown	5	5	258	1	-1	unknown
186	40	services	divorced	-1	4384	1	0	unknown	5	5	315	1	-1	unknown

In [13]:

# Outlier, distribution for 'duration' column
Q3 = bank['duration'].quantile(0.75)
Q1 = bank['duration'].quantile(0.25)
IQR = Q3 - Q1

print('Duration column', '--'*54)
display(bank.loc[(bank['duration'] < (Q1 - 1.5 * IQR)) | (bank['duration'] > (Q3 + 1.5 * IQR))].head())

odp_plots(bank, 'duration')

# Distribution of 'duration' by 'Target'
target_plot(bank, 'duration')

Duration column ------------------------------------------------------------------------------------------------------------

	age	job	marital	education	balance	housing	contact	day	month	duration	campaign	pdays	poutcome
37	53	technician	married	2	-3	0	unknown	5	5	1666	1	-1	unknown
43	54	retired	married	2	529	1	unknown	5	5	1492	1	-1	unknown
53	42	admin.	single	2	-76	1	unknown	5	5	787	1	-1	unknown
59	46	services	married	1	179	1	unknown	5	5	1778	1	-1	unknown
61	53	technician	divorced	2	989	1	unknown	5	5	812	1	-1	unknown

In [14]:

# Outlier, distribution for 'campaign' column
Q3 = bank['campaign'].quantile(0.75)
Q1 = bank['campaign'].quantile(0.25)
IQR = Q3 - Q1

print('Campaign column', '--'*54)
display(bank.loc[(bank['campaign'] < (Q1 - 1.5 * IQR)) | (bank['campaign'] > (Q3 + 1.5 * IQR))].head())

odp_plots(bank, 'campaign')

# Distribution of 'campaign' by 'Target'
target_plot(bank, 'campaign')

Campaign column ------------------------------------------------------------------------------------------------------------

	age	job	marital	education	balance	housing	loan	contact	day	month	duration	campaign	pdays	poutcome
758	59	services	married	2	307	1	1	unknown	6	5	250	7	-1	unknown
780	30	admin.	married	2	4	0	0	unknown	7	5	172	8	-1	unknown
906	27	services	single	2	0	1	0	unknown	7	5	388	7	-1	unknown
1103	52	technician	married	-1	133	1	0	unknown	7	5	253	8	-1	unknown
1105	43	admin.	married	3	1924	1	0	unknown	7	5	244	7	-1	unknown

In [15]:

# Outlier, distribution for 'pdays' column
Q3 = bank['pdays'].quantile(0.75)
Q1 = bank['pdays'].quantile(0.25)
IQR = Q3 - Q1

print('Pdays column', '--'*55)
display(bank.loc[(bank['pdays'] < (Q1 - 1.5 * IQR)) | (bank['pdays'] > (Q3 + 1.5 * IQR))].head())

# Check outlier in 'pdays', not considering -1
pdays = bank.loc[bank['pdays'] > 0, ['pdays', 'Target']]
pdays = pd.DataFrame(pdays, columns = ['pdays', 'Target'])
odp_plots(pdays, 'pdays')

# Distribution of 'pdays' by 'Target', not considering -1
target_plot(pdays, 'pdays')

Pdays column --------------------------------------------------------------------------------------------------------------

	age	job	marital	education	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	Target
24060	33	admin.	married	3	882	0	0	telephone	21	10	39	1	151	3	failure	0
24062	42	admin.	single	2	-247	1	1	telephone	21	10	519	1	166	1	other	1
24064	33	services	married	2	3444	1	0	telephone	21	10	144	1	91	4	failure	1
24072	36	management	married	3	2415	1	0	telephone	22	10	73	1	86	4	other	0
24077	36	management	married	3	0	1	0	telephone	23	10	140	1	143	3	failure	1

In [16]:

# Outlier, distribution and probability plot for 'previous' column
Q3 = bank['previous'].quantile(0.75)
Q1 = bank['previous'].quantile(0.25)
IQR = Q3 - Q1

print('Previous column', '--'*54)
display(bank.loc[(bank['previous'] < (Q1 - 1.5 * IQR)) | (bank['previous'] > (Q3 + 1.5 * IQR))].head())

odp_plots(bank, 'previous')

# Distribution of 'previous' by 'Target'
target_plot(bank, 'previous')

Previous column ------------------------------------------------------------------------------------------------------------

	age	job	marital	education	balance	housing	loan	contact	day	month	duration	campaign	pdays	previous	poutcome	Target
24060	33	admin.	married	3	882	0	0	telephone	21	10	39	1	151	3	failure	0
24062	42	admin.	single	2	-247	1	1	telephone	21	10	519	1	166	1	other	1
24064	33	services	married	2	3444	1	0	telephone	21	10	144	1	91	4	failure	1
24072	36	management	married	3	2415	1	0	telephone	22	10	73	1	86	4	other	0
24077	36	management	married	3	0	1	0	telephone	23	10	140	1	143	3	failure	1

Observation 6 - Comments from numerical columns¶

Used quantile method to check outliers in numerical column. It appears that there are outliers in each of the numerical columns.
It appears that removing outliers below 25% percentile and above 75% percentile will bring the age and pdays columns to almost normal distribution.

Print categorical and numerical columns list, remove outliers with upper and lower percentile values being 99 and 1, respectively and get dummies¶

In [17]:

print('Categorical Columns: \n{}'.format(list(bank.select_dtypes('category').columns)))
print('\nNumerical Columns: \n{}'.format(list(bank.select_dtypes(exclude = 'category').columns)))

Categorical Columns: 
['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'day', 'month', 'poutcome', 'Target']

Numerical Columns: 
['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']

In [18]:

# Removing outliers with upper and lower percentile values being 99 and 1, respectively
bank_nulls = bank.copy(deep = True)
columns = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']

for col in columns:
    upper_lim = np.percentile(bank_nulls[col].values, 99)
    lower_lim = np.percentile(bank_nulls[col].values, 1)
    bank_nulls.loc[(bank_nulls[col] > upper_lim), col] = np.nan
    bank_nulls.loc[(bank_nulls[col] < lower_lim), col] = np.nan

print('Column for which outliers where removed with upper and lower percentile values: \n', columns)

Column for which outliers where removed with upper and lower percentile values: 
 ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous']

In [19]:

# # Frequency encoding of 'job' column, this would creating too many columns with sparse distribution
# columns = ['job']#, 'marital', 'contact', 'poutcome']

# for col in columns:
#     counts = bank_nulls[col].value_counts().index.tolist()
#     encoding = bank_nulls.groupby(col).size()
#     encoding = encoding/len(bank_nulls)
#     bank_nulls[col] = bank_nulls[col].map(encoding)
#     print([counts, bank_nulls[col].value_counts().index.tolist()], '\n')

In [20]:

# pd.get_dummies
cols_to_transform = ['job', 'marital', 'contact', 'poutcome']
bank_nulls = pd.get_dummies(bank_nulls, columns = cols_to_transform) #, drop_first = True)

print('Got dummies for \n', cols_to_transform)
bank_nulls.info()

Got dummies for 
 ['job', 'marital', 'contact', 'poutcome']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 35 columns):
age                  44473 non-null float64
education            45211 non-null category
default              45211 non-null category
balance              44308 non-null float64
housing              45211 non-null category
loan                 45211 non-null category
day                  45211 non-null category
month                45211 non-null category
duration             44341 non-null float64
campaign             44760 non-null float64
pdays                44826 non-null float64
previous             44758 non-null float64
Target               45211 non-null category
job_admin.           45211 non-null uint8
job_blue-collar      45211 non-null uint8
job_entrepreneur     45211 non-null uint8
job_housemaid        45211 non-null uint8
job_management       45211 non-null uint8
job_retired          45211 non-null uint8
job_self-employed    45211 non-null uint8
job_services         45211 non-null uint8
job_student          45211 non-null uint8
job_technician       45211 non-null uint8
job_unemployed       45211 non-null uint8
job_unknown          45211 non-null uint8
marital_divorced     45211 non-null uint8
marital_married      45211 non-null uint8
marital_single       45211 non-null uint8
contact_cellular     45211 non-null uint8
contact_telephone    45211 non-null uint8
contact_unknown      45211 non-null uint8
poutcome_failure     45211 non-null uint8
poutcome_other       45211 non-null uint8
poutcome_success     45211 non-null uint8
poutcome_unknown     45211 non-null uint8
dtypes: category(7), float64(6), uint8(22)
memory usage: 3.3 MB

In [21]:

# Convert 'astype' of categorical columns to integer for getting it ready for MICE
columns = ['education', 'default', 'housing', 'loan', 'day', 'month', 'Target']
for col in columns:
    bank_nulls[col] = bank_nulls[col].astype('float')

Use MICE imputer to handled outliers that were filled with `np.nan` in the earlier step¶

In [22]:

# start the MICE training
bank_imputed = mice(bank_nulls.values)
bank_imputed = pd.DataFrame(bank_imputed, columns = bank_nulls.columns)

display(bank.describe(include = 'all').T, bank_imputed.describe(include = 'all').T)

	count	unique	top	freq	mean	std	min	25%	50%	75%	max
age	45211	NaN	NaN	NaN	40.9362	10.6188	18	33	39	48	95
job	45211	12	blue-collar	9732	NaN	NaN	NaN	NaN	NaN	NaN	NaN
marital	45211	3	married	27214	NaN	NaN	NaN	NaN	NaN	NaN	NaN
education	45211	4	2	23202	NaN	NaN	NaN	NaN	NaN	NaN	NaN
default	45211	2	0	44396	NaN	NaN	NaN	NaN	NaN	NaN	NaN
balance	45211	NaN	NaN	NaN	1362.27	3044.77	-8019	72	448	1428	102127
housing	45211	2	1	25130	NaN	NaN	NaN	NaN	NaN	NaN	NaN
loan	45211	2	0	37967	NaN	NaN	NaN	NaN	NaN	NaN	NaN
contact	45211	3	cellular	29285	NaN	NaN	NaN	NaN	NaN	NaN	NaN
day	45211	31	20	2752	NaN	NaN	NaN	NaN	NaN	NaN	NaN
month	45211	12	5	13766	NaN	NaN	NaN	NaN	NaN	NaN	NaN
duration	45211	NaN	NaN	NaN	258.163	257.528	0	103	180	319	4918
campaign	45211	NaN	NaN	NaN	2.76384	3.09802	1	1	2	3	63
pdays	45211	NaN	NaN	NaN	40.1978	100.129	-1	-1	-1	-1	871
previous	45211	NaN	NaN	NaN	0.580323	2.30344	0	0	0	0	275
poutcome	45211	4	unknown	36959	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Target	45211	2	0	39922	NaN	NaN	NaN	NaN	NaN	NaN	NaN

	count	mean	std	min	25%	50%	75%	max
age	45211.0	40.836870	10.073690	23.000000	33.0	39.0	48.0	71.743841
education	45211.0	2.019442	0.902795	-1.000000	2.0	2.0	3.0	3.000000
default	45211.0	0.018027	0.133049	0.000000	0.0	0.0	0.0	1.000000
balance	45211.0	1174.932363	1898.534988	-812.502754	81.0	467.0	1402.0	13164.000000
housing	45211.0	0.555838	0.496878	0.000000	0.0	1.0	1.0	1.000000
loan	45211.0	0.160226	0.366820	0.000000	0.0	0.0	0.0	1.000000
day	45211.0	15.806419	8.322476	1.000000	8.0	16.0	21.0	31.000000
month	45211.0	6.144655	2.408034	1.000000	5.0	6.0	8.0	12.000000
duration	45211.0	247.428930	211.290370	11.000000	106.0	183.0	316.0	1269.000000
campaign	45211.0	2.562222	2.214906	1.000000	1.0	2.0	3.0	16.000000
pdays	45211.0	37.922341	92.399489	-47.832018	-1.0	-1.0	-1.0	370.000000
previous	45211.0	0.461333	1.208702	0.000000	0.0	0.0	0.0	8.000000
Target	45211.0	0.116985	0.321406	0.000000	0.0	0.0	0.0	1.000000
job_admin.	45211.0	0.114375	0.318269	0.000000	0.0	0.0	0.0	1.000000
job_blue-collar	45211.0	0.215257	0.411005	0.000000	0.0	0.0	0.0	1.000000
job_entrepreneur	45211.0	0.032890	0.178351	0.000000	0.0	0.0	0.0	1.000000
job_housemaid	45211.0	0.027427	0.163326	0.000000	0.0	0.0	0.0	1.000000
job_management	45211.0	0.209197	0.406740	0.000000	0.0	0.0	0.0	1.000000
job_retired	45211.0	0.050076	0.218105	0.000000	0.0	0.0	0.0	1.000000
job_self-employed	45211.0	0.034925	0.183592	0.000000	0.0	0.0	0.0	1.000000
job_services	45211.0	0.091880	0.288860	0.000000	0.0	0.0	0.0	1.000000
job_student	45211.0	0.020747	0.142538	0.000000	0.0	0.0	0.0	1.000000
job_technician	45211.0	0.168034	0.373901	0.000000	0.0	0.0	0.0	1.000000
job_unemployed	45211.0	0.028820	0.167303	0.000000	0.0	0.0	0.0	1.000000
job_unknown	45211.0	0.006370	0.079559	0.000000	0.0	0.0	0.0	1.000000
marital_divorced	45211.0	0.115171	0.319232	0.000000	0.0	0.0	0.0	1.000000
marital_married	45211.0	0.601933	0.489505	0.000000	0.0	1.0	1.0	1.000000
marital_single	45211.0	0.282896	0.450411	0.000000	0.0	0.0	1.0	1.000000
contact_cellular	45211.0	0.647741	0.477680	0.000000	0.0	1.0	1.0	1.000000
contact_telephone	45211.0	0.064276	0.245247	0.000000	0.0	0.0	0.0	1.000000
contact_unknown	45211.0	0.287983	0.452828	0.000000	0.0	0.0	1.0	1.000000
poutcome_failure	45211.0	0.108403	0.310892	0.000000	0.0	0.0	0.0	1.000000
poutcome_other	45211.0	0.040698	0.197592	0.000000	0.0	0.0	0.0	1.000000
poutcome_success	45211.0	0.033421	0.179735	0.000000	0.0	0.0	0.0	1.000000
poutcome_unknown	45211.0	0.817478	0.386278	0.000000	1.0	1.0	1.0	1.000000

Observation 7 - Observation after MICE¶

Column	Before MICE	After MICE
`age`	Range of Q1 to Q3 is 33-48. Mean > Median, right (positively) skewed	Range of Q1 to Q3 is unchanged, because of change in min and max values there's a slight reduction is mean, right skewed
`balance`	Range of Q1 to Q3 is 72-1428. Mean > Median, skewed towards right (positively)	Range of Q1 to Q3 is 81 to 1402, reduction in mean, right skewed
`duration`	Range of Q1 to Q3 is 103-319. Mean > Median, right (positively) skewed	Range of Q1 to Q3 is 106-316, right skewed
`campaign`	Range of Q1 to Q3 is 1-3. Mean > Median, right (positively) skewed	Unchanged range and skewness
`pdays`	75% of data values are around -1	Unchanged
`previous`	75% of data values are around 0	Unchanged

Checking whether count of 0 in previous column is equal to count of -1 in pdays column¶

In [23]:

# Checking whether count of 0 in previous is equal to count of -1 in pdays
display(bank_imputed.loc[bank_imputed['previous'] == 0, 'previous'].value_counts().sum(), 
        bank_imputed.loc[bank_imputed['pdays'] == -1, 'pdays'].value_counts().sum())

Observation 8 - pdays and previous¶

Count of 0 in previous is equal to count of -1 in pdays column, we might replace -1 in pdays with 0 to account for cases where the client wasn't contacted previously. Checking correlation between variables and target next...

Multivariate visualization¶

Checking relationship between two or more variables. Includes correlation and scatterplot matrix, checking relation between two variables and Target.

Check scattermatrix¶

In [24]:

sns.pairplot(bank_imputed[['age', 'education', 'default', 'balance', 'housing', 'loan', 'day', 'month', 
                           'duration', 'campaign', 'pdays', 'previous', 'Target']], hue = 'Target')

Out[24]:

<seaborn.axisgrid.PairGrid at 0x2c2e9c16bc8>

Correlation matrix¶

In [25]:

# Correlation matrix for all variables
corr = bank_imputed.corr()

mask = np.zeros_like(corr, dtype = np.bool)
mask[np.triu_indices_from(mask)] = True

f, ax = plt.subplots(figsize = (11, 9))

cmap = sns.diverging_palette(220, 10, as_cmap = True)

sns.heatmap(corr, mask = mask, cmap = cmap, square = True, linewidths = .5, cbar_kws = {"shrink": .5})#, annot = True)
ax.set_title('Correlation Matrix of Data')

Out[25]:

Text(0.5, 1, 'Correlation Matrix of Data')

In [26]:

# Filter for correlation value greater than 0.8
sort = corr.abs().unstack()
sort = sort.sort_values(kind = "quicksort", ascending = False)
sort[(sort > 0.8) & (sort < 1)]

Out[26]:

pdays             poutcome_unknown    0.891235
poutcome_unknown  pdays               0.891235
contact_unknown   contact_cellular    0.862398
contact_cellular  contact_unknown     0.862398
previous          poutcome_unknown    0.806952
poutcome_unknown  previous            0.806952
dtype: float64

In [27]:

# Absolute correlation of independent variables with 'Target' i.e. the target variable
absCorrwithDep = []
allVars = bank_imputed.drop('Target', axis = 1).columns

for var in allVars:
    absCorrwithDep.append(abs(bank_imputed['Target'].corr(bank_imputed[var])))

display(pd.DataFrame([allVars, absCorrwithDep], index = ['Variable', 'Correlation']).T.\
        sort_values('Correlation', ascending = False))

	Variable	Correlation
8	duration	0.398107
32	poutcome_success	0.306788
33	poutcome_unknown	0.167051
11	previous	0.153341
29	contact_unknown	0.150935
4	housing	0.139173
27	contact_cellular	0.135873
10	pdays	0.0865931
17	job_retired	0.0792453
3	balance	0.0769227
20	job_student	0.076897
9	campaign	0.0754512
13	job_blue-collar	0.0720831
5	loan	0.068185
26	marital_single	0.0635258
25	marital_married	0.0602604
1	education	0.0416343
16	job_management	0.0329188
31	poutcome_other	0.031955
6	day	0.0283478
19	job_services	0.0278639
2	default	0.022419
22	job_unemployed	0.0203899
14	job_entrepreneur	0.0196623
7	month	0.018717
15	job_housemaid	0.0151949
28	contact_telephone	0.0140425
0	age	0.0134745
30	poutcome_failure	0.00988545
21	job_technician	0.00896982
12	job_admin.	0.00563747
24	marital_divorced	0.00277237
18	job_self-employed	0.000855031
23	job_unknown	0.000266748

Observation 9 - Correlation Matrix¶

poutcome_unknown and pdays; contact_unknown and contact_cellular; poutcome_unknown and previous; marital_married and marital_single; poutcome_unknown and poutcome_failure; pdays and poutcome_failure; previous and pdays; poutcome_failure and previous columns are correlated with each other.
duration, poutcome_success, poutcome_unknown and previous are few columns which have a relatively strong correlation with Target column.

In [28]:

#bank_imputed.drop(['pdays', 'contact_cellular'], axis = 1, inplace = True) #, 'previous', 'marital_married', 'poutcome_failure'

Creating age groups and check relation with balance and target; also with campaign and target¶

In [29]:

# Creating age groups
bank_imputed.loc[(bank_imputed['age'] < 30), 'age_group'] = 20
bank_imputed.loc[(bank_imputed['age'] >= 30) & (bank_imputed['age'] < 40), 'age_group'] = 30
bank_imputed.loc[(bank_imputed['age'] >= 40) & (bank_imputed['age'] < 50), 'age_group'] = 40
bank_imputed.loc[(bank_imputed['age'] >= 50) & (bank_imputed['age'] < 60), 'age_group'] = 50
bank_imputed.loc[(bank_imputed['age'] >= 60), 'age_group'] = 60

In [30]:

# Check relationship between balance and age group by Target
fig = plt.figure(figsize = (15, 7.2))
ax = sns.boxplot(x = 'age_group', y = 'balance', hue = 'Target', palette = 'afmhot', data = bank_imputed)
ax.set_title('Relationship between balance and age group by Target')

Out[30]:

Text(0.5, 1.0, 'Relationship between balance and age group by Target')

In [31]:

# Check relationship between campaign and age group by Target
fig = plt.figure(figsize = (15, 7.2))
ax = sns.boxplot(x = 'age_group', y = 'campaign', hue = 'Target', palette = 'afmhot', data = bank_imputed)
ax.set_title('Relationship between campaign and age group by Target')

Out[31]:

Text(0.5, 1.0, 'Relationship between campaign and age group by Target')

In [32]:

# bank_imputed.drop(['age_group'], axis = 1, inplace = True)

Observation 10 - Comments¶

Created age_group and checked it's relation with balance and target and it appears that higher the balance range more are the chances that the client would subscribe to the term deposit irrespective of age group. It also appears that clients within age group 50 have the highest range of balance.
Then checked relation between campaign, age group and target and it appears that campaigns for client with age group 20 and 60 are less.

Modelling¶

Create a baseline model
Use different classification models (Logistic, K-NN and Naïve Bayes) to predict will the client subscribe to term deposit.
Training and making predictions using an Ensemble Model.

Dummy Classifier -- Baseline Model¶

In [33]:

# Separating dependent and independent variables
X = bank_imputed.drop(['Target'], axis = 1)
y = bank_imputed['Target']

# Splitting the data into training and test set in the ratio of 70:30 respectively
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = random_state)

dummy = DummyClassifier(strategy = 'most_frequent', random_state = random_state)
dummy.fit(X_train, y_train)
y_pred = dummy.predict(X_test)

accuracy_ = accuracy_score(y_test, y_pred)
pre_s = precision_score(y_test, y_pred, average = 'binary', pos_label = 1)
re_s = recall_score(y_test, y_pred, average = 'binary', pos_label = 1)
f1_s = f1_score(y_test, y_pred, average = 'binary', pos_label = 1)

pre_m = precision_score(y_test, y_pred, average = 'macro')
re_m = recall_score(y_test, y_pred, average = 'macro')
f1_m = f1_score(y_test, y_pred, average = 'macro')

print('Training Score: ', dummy.score(X_train, y_train).round(3))
print('Test Score: ', dummy.score(X_test, y_test).round(3))

print('Accuracy: ', accuracy_.round(3))
print('Precision Score - Subscribe: ', pre_s.round(3))
print('Recall Score - Subscribe: ', re_s.round(3))
print('f1 Score - Subscribe: ', f1_s.round(3))

print('Precision Score - Macro: ', pre_m.round(3))
print('Recall Score - Macro: ', re_m.round(3))
print('f1 Score - Macro: ', f1_m.round(3))

df = pd.DataFrame([accuracy_.round(3), pre_s.round(3), pre_m.round(3), re_s.round(3), 
                   re_m.round(3), f1_s.round(3), f1_m.round(3)], columns = ['Baseline Model']).T
df.columns = ['Accuracy', 'Precision_Subscribe', 'Precision_Macro',
              'Recall_Subscribe', 'Recall_Macro', 'f1_Subscribe', 'f1_Macro']
df

Training Score:  0.883
Test Score:  0.882
Accuracy:  0.882
Precision Score - Subscribe:  0.0
Recall Score - Subscribe:  0.0
f1 Score - Subscribe:  0.0
Precision Score - Macro:  0.441
Recall Score - Macro:  0.5
f1 Score - Macro:  0.469

Out[33]:

	Accuracy	Precision_Subscribe	Precision_Macro	Recall_Subscribe	Recall_Macro	f1_Subscribe	f1_Macro
Baseline Model	0.882	0.0	0.441	0.0	0.5	0.0	0.469

In [34]:

# Helper function for making prediction and evaluating scores
def train_and_predict(n_splits, base_model, X, y, name, subscribe = 1, oversampling = False):
    features = X.columns
    X = np.array(X)
    y = np.array(y)
    
    folds = list(StratifiedKFold(n_splits = n_splits, shuffle = True, random_state = random_state).split(X, y))
    
    train_pred = np.zeros((X.shape[0], len(base_model)))
    
    accuracy = []

    precision_subscribe = []
    recall_subscribe = []
    f1_subscribe = []
    
    precision_macro = []
    recall_macro = []
    f1_macro = []
    
    for i, clf in enumerate(base_model):
        for j, (train, test) in enumerate(folds):
            
            # Creating train and test sets
            X_train = X[train]
            y_train = y[train]
            X_test = X[test]
            y_test = y[test]
            
            if oversampling:
                sm = SMOTE(random_state = random_state, sampling_strategy = 'minority')
                X_train_res, y_train_res = sm.fit_sample(X_train, y_train)
           
            # fit the model
                clf.fit(X_train_res, y_train_res)

            # Get predictions
                y_true, y_pred = y_test, clf.predict(X_test)

            # Evaluate train and test scores
                train_ = clf.score(X_train_res, y_train_res)
                test_ = clf.score(X_test, y_test)
            
            else:
            
            # fit the model
                clf.fit(X_train, y_train)

            # Get predictions
                y_true, y_pred = y_test, clf.predict(X_test)

            # Evaluate train and test scores
                train_ = clf.score(X_train, y_train)
                test_ = clf.score(X_test, y_test)
                      
            # Other scores
            accuracy_ = accuracy_score(y_true, y_pred).round(3)
            
            precision_b = precision_score(y_true, y_pred, average = 'binary', pos_label = subscribe).round(3)
            recall_b = recall_score(y_true, y_pred, average = 'binary', pos_label = subscribe).round(3)
            f1_b = f1_score(y_true, y_pred, average = 'binary', pos_label = subscribe).round(3)
            
            precision_m = precision_score(y_true, y_pred, average = 'macro').round(3)
            recall_m = recall_score(y_true, y_pred, average = 'macro').round(3)
            f1_m = f1_score(y_true, y_pred, average = 'macro').round(3)
            
            print(f'Model- {name.capitalize()} and CV- {j}')
            print('-'*20)
            print('Training Score: {0:.3f}'.format(train_))
            print('Test Score: {0:.3f}'.format(test_))
            
            print('Accuracy Score: {0:.3f}'.format(accuracy_))
            
            print('Precision Score - Subscribe: {0:.3f}'.format(precision_b))
            print('Recall Score - Subscribe: {0:.3f}'.format(recall_b))
            print('f1 Score - Subscribe: {0:.3f}'.format(f1_b))
            
            print('Precision Score - Macro: {0:.3f}'.format(precision_m))
            print('Recall Score - Macro: {0:.3f}'.format(recall_m))
            print('f1 Score - Macro: {0:.3f}'.format(f1_m))
            print('\n')
            
            ## Appending scores   
            accuracy.append(accuracy_)
            precision_subscribe.append(precision_b)
            recall_subscribe.append(recall_b)
            f1_subscribe.append(f1_b)
            precision_macro.append(precision_m)
            recall_macro.append(recall_m)
            f1_macro.append(f1_m)
                       
            # Creating a dataframe of scores
            df = pd.DataFrame([np.mean(accuracy).round(3), np.mean(precision_subscribe).round(3), 
                               np.mean(precision_macro).round(3), np.mean(recall_subscribe).round(3), 
                               np.mean(recall_macro).round(3), np.mean(f1_subscribe).round(3), 
                               np.mean(f1_macro).round(3)], columns = [name]).T
            df.columns = ['Accuracy', 'Precision_Subscribe', 'Precision_Macro',
                          'Recall_Subscribe', 'Recall_Macro', 'f1_Subscribe', 'f1_Macro']
            
    return df

In [35]:

# Separating dependent and independent variables
from sklearn.preprocessing import RobustScaler
X = bank_imputed.drop(['Target'], axis = 1)
y = bank_imputed['Target']

# Applying RobustScaler to make it less prone to outliers
features = X.columns
scaler = RobustScaler()
X = pd.DataFrame(scaler.fit_transform(X), columns = features)

# Scaling the independent variables
Xs = X.apply(zscore)

display(X.shape, Xs.shape, y.shape)

(45211, 35)

(45211, 35)

(45211,)

Logistic Regression, kNN and Naive Bayes¶

Oversampling the one with better accuracy and recall score for subscribe

Logistic Regression¶

In [36]:

# LR model without hyperparameter tuning
LR = LogisticRegression()
base_model = [LR]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Logistic Regression Without Hyperparameter Tuning')
df = df.append(df1)
df

Model- Logistic regression without hyperparameter tuning and CV- 0
--------------------
Training Score: 0.899
Test Score: 0.897
Accuracy Score: 0.897
Precision Score - Subscribe: 0.605
Recall Score - Subscribe: 0.337
f1 Score - Subscribe: 0.433
Precision Score - Macro: 0.761
Recall Score - Macro: 0.654
f1 Score - Macro: 0.688


Model- Logistic regression without hyperparameter tuning and CV- 1
--------------------
Training Score: 0.898
Test Score: 0.900
Accuracy Score: 0.900
Precision Score - Subscribe: 0.639
Recall Score - Subscribe: 0.335
f1 Score - Subscribe: 0.439
Precision Score - Macro: 0.778
Recall Score - Macro: 0.655
f1 Score - Macro: 0.692


Model- Logistic regression without hyperparameter tuning and CV- 2
--------------------
Training Score: 0.899
Test Score: 0.897
Accuracy Score: 0.897
Precision Score - Subscribe: 0.614
Recall Score - Subscribe: 0.319
f1 Score - Subscribe: 0.419
Precision Score - Macro: 0.764
Recall Score - Macro: 0.646
f1 Score - Macro: 0.681


Model- Logistic regression without hyperparameter tuning and CV- 3
--------------------
Training Score: 0.898
Test Score: 0.901
Accuracy Score: 0.901
Precision Score - Subscribe: 0.645
Recall Score - Subscribe: 0.341
f1 Score - Subscribe: 0.446
Precision Score - Macro: 0.781
Recall Score - Macro: 0.658
f1 Score - Macro: 0.696


Model- Logistic regression without hyperparameter tuning and CV- 4
--------------------
Training Score: 0.899
Test Score: 0.898
Accuracy Score: 0.898
Precision Score - Subscribe: 0.627
Recall Score - Subscribe: 0.325
f1 Score - Subscribe: 0.428
Precision Score - Macro: 0.771
Recall Score - Macro: 0.649
f1 Score - Macro: 0.686

Out[36]:

	Accuracy	Precision_Subscribe	Precision_Macro	Recall_Subscribe	Recall_Macro	f1_Subscribe	f1_Macro
Baseline Model	0.882	0.000	0.441	0.000	0.500	0.000	0.469
Logistic Regression Without Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.652	0.433	0.689

In [37]:

# LR with hyperparameter tuning
LR = LogisticRegression(n_jobs = -1, random_state = random_state)

params = {'penalty': ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100], 'max_iter': [100, 110, 120, 130, 140]}
scoring = {'Recall': make_scorer(recall_score), 'f1_score': make_scorer(f1_score)}

skf = StratifiedKFold(n_splits = 10, shuffle = True, random_state = random_state)
LR_hyper = GridSearchCV(LR, param_grid = params, n_jobs = -1, cv = skf, scoring = scoring, refit = 'f1_score')

LR_hyper.fit(X_train, y_train)
print(LR_hyper.best_estimator_)
print(LR_hyper.best_params_)

LogisticRegression(C=100, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=-1, penalty='l2', random_state=42,
                   solver='warn', tol=0.0001, verbose=0, warm_start=False)
{'C': 100, 'max_iter': 100, 'penalty': 'l2'}

In [38]:

# LR model with hyperparameter tuning
LR_Hyper = LogisticRegression(C = 100, class_weight = None, dual = False, fit_intercept = True, 
                              intercept_scaling = 1, l1_ratio = None, max_iter = 100,
                              multi_class = 'warn', n_jobs = -1, penalty = 'l2', random_state = 42,
                              solver = 'warn', tol = 0.0001, verbose = 0, warm_start = False)
base_model = [LR_Hyper]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Logistic Regression With Hyperparameter Tuning')
df = df.append(df1)
df

Model- Logistic regression with hyperparameter tuning and CV- 0
--------------------
Training Score: 0.899
Test Score: 0.897
Accuracy Score: 0.897
Precision Score - Subscribe: 0.604
Recall Score - Subscribe: 0.337
f1 Score - Subscribe: 0.433
Precision Score - Macro: 0.761
Recall Score - Macro: 0.654
f1 Score - Macro: 0.688


Model- Logistic regression with hyperparameter tuning and CV- 1
--------------------
Training Score: 0.898
Test Score: 0.900
Accuracy Score: 0.900
Precision Score - Subscribe: 0.640
Recall Score - Subscribe: 0.336
f1 Score - Subscribe: 0.440
Precision Score - Macro: 0.778
Recall Score - Macro: 0.655
f1 Score - Macro: 0.693


Model- Logistic regression with hyperparameter tuning and CV- 2
--------------------
Training Score: 0.899
Test Score: 0.897
Accuracy Score: 0.897
Precision Score - Subscribe: 0.615
Recall Score - Subscribe: 0.319
f1 Score - Subscribe: 0.420
Precision Score - Macro: 0.765
Recall Score - Macro: 0.646
f1 Score - Macro: 0.682


Model- Logistic regression with hyperparameter tuning and CV- 3
--------------------
Training Score: 0.898
Test Score: 0.901
Accuracy Score: 0.901
Precision Score - Subscribe: 0.644
Recall Score - Subscribe: 0.340
f1 Score - Subscribe: 0.445
Precision Score - Macro: 0.781
Recall Score - Macro: 0.658
f1 Score - Macro: 0.695


Model- Logistic regression with hyperparameter tuning and CV- 4
--------------------
Training Score: 0.898
Test Score: 0.898
Accuracy Score: 0.898
Precision Score - Subscribe: 0.627
Recall Score - Subscribe: 0.325
f1 Score - Subscribe: 0.428
Precision Score - Macro: 0.771
Recall Score - Macro: 0.650
f1 Score - Macro: 0.686

Out[38]:

	Accuracy	Precision_Subscribe	Precision_Macro	Recall_Subscribe	Recall_Macro	f1_Subscribe	f1_Macro
Baseline Model	0.882	0.000	0.441	0.000	0.500	0.000	0.469
Logistic Regression Without Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.652	0.433	0.689
Logistic Regression With Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.653	0.433	0.689

k-Nearest Neighbor Classifier¶

In [39]:

# KNN Model after scaling the features without hyperparameter tuning
kNN = KNeighborsClassifier()
base_model = [kNN]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, Xs, y, 'k-Nearest Neighbor Scaled Without Hyperparameter Tuning')
df = df.append(df1)
df

Model- K-nearest neighbor scaled without hyperparameter tuning and CV- 0
--------------------
Training Score: 0.916
Test Score: 0.892
Accuracy Score: 0.892
Precision Score - Subscribe: 0.575
Recall Score - Subscribe: 0.308
f1 Score - Subscribe: 0.401
Precision Score - Macro: 0.744
Recall Score - Macro: 0.639
f1 Score - Macro: 0.671


Model- K-nearest neighbor scaled without hyperparameter tuning and CV- 1
--------------------
Training Score: 0.916
Test Score: 0.889
Accuracy Score: 0.889
Precision Score - Subscribe: 0.545
Recall Score - Subscribe: 0.312
f1 Score - Subscribe: 0.397
Precision Score - Macro: 0.730
Recall Score - Macro: 0.639
f1 Score - Macro: 0.668


Model- K-nearest neighbor scaled without hyperparameter tuning and CV- 2
--------------------
Training Score: 0.916
Test Score: 0.890
Accuracy Score: 0.890
Precision Score - Subscribe: 0.562
Recall Score - Subscribe: 0.286
f1 Score - Subscribe: 0.379
Precision Score - Macro: 0.737
Recall Score - Macro: 0.628
f1 Score - Macro: 0.660


Model- K-nearest neighbor scaled without hyperparameter tuning and CV- 3
--------------------
Training Score: 0.916
Test Score: 0.892
Accuracy Score: 0.892
Precision Score - Subscribe: 0.571
Recall Score - Subscribe: 0.305
f1 Score - Subscribe: 0.398
Precision Score - Macro: 0.742
Recall Score - Macro: 0.637
f1 Score - Macro: 0.669


Model- K-nearest neighbor scaled without hyperparameter tuning and CV- 4
--------------------
Training Score: 0.917
Test Score: 0.892
Accuracy Score: 0.892
Precision Score - Subscribe: 0.578
Recall Score - Subscribe: 0.285
f1 Score - Subscribe: 0.381
Precision Score - Macro: 0.745
Recall Score - Macro: 0.629
f1 Score - Macro: 0.661

Out[39]:

	Accuracy	Precision_Subscribe	Precision_Macro	Recall_Subscribe	Recall_Macro	f1_Subscribe	f1_Macro
Baseline Model	0.882	0.000	0.441	0.000	0.500	0.000	0.469
Logistic Regression Without Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.652	0.433	0.689
Logistic Regression With Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.653	0.433	0.689
k-Nearest Neighbor Scaled Without Hyperparameter Tuning	0.891	0.566	0.740	0.299	0.634	0.391	0.666

In [40]:

# Choosing a K Value
error_rate = {}
weights = ['uniform', 'distance']
for w in weights:
    print(w)
    rate = []
    for i in range(1, 40):
        knn = KNeighborsClassifier(n_neighbors = i, weights = w)
        knn.fit(X_train, y_train)
        y_pred = knn.predict(X_test)
        rate.append(np.mean(y_pred != y_test))
    plt.figure(figsize = (15, 7.2))
    plt.plot(range(1, 40), rate, color = 'blue', linestyle = 'dashed', marker = 'o', 
             markerfacecolor = 'red', markersize = 10)
    plt.title('Error Rate vs K Value')
    plt.xlabel('K')
    plt.ylabel('Error Rate')
    plt.show()

uniform

distance

In [41]:

# KNN with hyperparameter tuning
kNN = KNeighborsClassifier(n_jobs = -1)

params = {'n_neighbors': list(range(3, 40, 2)), 'weights': ['uniform', 'distance']}

scoring = {'Recall': make_scorer(recall_score), 'f1_score': make_scorer(f1_score)}

skf = StratifiedKFold(n_splits = 3, shuffle = True, random_state = random_state)

kNN_hyper = GridSearchCV(kNN, param_grid = params, n_jobs = -1, cv = skf, scoring = scoring, refit = 'f1_score')

kNN_hyper.fit(X_train, y_train)
print(kNN_hyper.best_estimator_)
print(kNN_hyper.best_params_)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=-1, n_neighbors=3, p=2,
                     weights='distance')
{'n_neighbors': 3, 'weights': 'distance'}

In [42]:

# KNN with hyperparameter tuning
kNN_hyper = KNeighborsClassifier(algorithm = 'auto', leaf_size = 30, metric = 'minkowski', metric_params = None, 
                                 n_jobs = -1, n_neighbors = 3, p = 2, weights = 'distance')
base_model = [kNN_hyper]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, Xs, y, 'k-Nearest Neighbor Scaled With Hyperparameter Tuning')
df = df.append(df1)
df

Model- K-nearest neighbor scaled with hyperparameter tuning and CV- 0
--------------------
Training Score: 1.000
Test Score: 0.890
Accuracy Score: 0.890
Precision Score - Subscribe: 0.544
Recall Score - Subscribe: 0.350
f1 Score - Subscribe: 0.426
Precision Score - Macro: 0.731
Recall Score - Macro: 0.655
f1 Score - Macro: 0.682


Model- K-nearest neighbor scaled with hyperparameter tuning and CV- 1
--------------------
Training Score: 1.000
Test Score: 0.885
Accuracy Score: 0.885
Precision Score - Subscribe: 0.514
Recall Score - Subscribe: 0.348
f1 Score - Subscribe: 0.415
Precision Score - Macro: 0.716
Recall Score - Macro: 0.652
f1 Score - Macro: 0.676


Model- K-nearest neighbor scaled with hyperparameter tuning and CV- 2
--------------------
Training Score: 1.000
Test Score: 0.886
Accuracy Score: 0.886
Precision Score - Subscribe: 0.518
Recall Score - Subscribe: 0.332
f1 Score - Subscribe: 0.405
Precision Score - Macro: 0.717
Recall Score - Macro: 0.645
f1 Score - Macro: 0.671


Model- K-nearest neighbor scaled with hyperparameter tuning and CV- 3
--------------------
Training Score: 1.000
Test Score: 0.888
Accuracy Score: 0.888
Precision Score - Subscribe: 0.534
Recall Score - Subscribe: 0.338
f1 Score - Subscribe: 0.414
Precision Score - Macro: 0.725
Recall Score - Macro: 0.650
f1 Score - Macro: 0.676


Model- K-nearest neighbor scaled with hyperparameter tuning and CV- 4
--------------------
Training Score: 1.000
Test Score: 0.887
Accuracy Score: 0.887
Precision Score - Subscribe: 0.524
Recall Score - Subscribe: 0.333
f1 Score - Subscribe: 0.407
Precision Score - Macro: 0.720
Recall Score - Macro: 0.646
f1 Score - Macro: 0.672

Out[42]:

	Accuracy	Precision_Subscribe	Precision_Macro	Recall_Subscribe	Recall_Macro	f1_Subscribe	f1_Macro
Baseline Model	0.882	0.000	0.441	0.000	0.500	0.000	0.469
Logistic Regression Without Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.652	0.433	0.689
Logistic Regression With Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.653	0.433	0.689
k-Nearest Neighbor Scaled Without Hyperparameter Tuning	0.891	0.566	0.740	0.299	0.634	0.391	0.666
k-Nearest Neighbor Scaled With Hyperparameter Tuning	0.887	0.527	0.722	0.340	0.650	0.413	0.675

Naive Bayes Classifier¶

In [43]:

# Naive Bayes Model
NB = GaussianNB()
base_model = [NB]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Naive Bayes Classifier')
df = df.append(df1)
df

Model- Naive bayes classifier and CV- 0
--------------------
Training Score: 0.813
Test Score: 0.808
Accuracy Score: 0.808
Precision Score - Subscribe: 0.312
Recall Score - Subscribe: 0.533
f1 Score - Subscribe: 0.394
Precision Score - Macro: 0.622
Recall Score - Macro: 0.689
f1 Score - Macro: 0.640


Model- Naive bayes classifier and CV- 1
--------------------
Training Score: 0.815
Test Score: 0.812
Accuracy Score: 0.812
Precision Score - Subscribe: 0.311
Recall Score - Subscribe: 0.501
f1 Score - Subscribe: 0.384
Precision Score - Macro: 0.619
Recall Score - Macro: 0.677
f1 Score - Macro: 0.636


Model- Naive bayes classifier and CV- 2
--------------------
Training Score: 0.818
Test Score: 0.819
Accuracy Score: 0.819
Precision Score - Subscribe: 0.324
Recall Score - Subscribe: 0.505
f1 Score - Subscribe: 0.395
Precision Score - Macro: 0.627
Recall Score - Macro: 0.683
f1 Score - Macro: 0.644


Model- Naive bayes classifier and CV- 3
--------------------
Training Score: 0.814
Test Score: 0.821
Accuracy Score: 0.821
Precision Score - Subscribe: 0.334
Recall Score - Subscribe: 0.539
f1 Score - Subscribe: 0.413
Precision Score - Macro: 0.634
Recall Score - Macro: 0.698
f1 Score - Macro: 0.653


Model- Naive bayes classifier and CV- 4
--------------------
Training Score: 0.817
Test Score: 0.816
Accuracy Score: 0.816
Precision Score - Subscribe: 0.323
Recall Score - Subscribe: 0.525
f1 Score - Subscribe: 0.400
Precision Score - Macro: 0.627
Recall Score - Macro: 0.690
f1 Score - Macro: 0.646

Out[43]:

	Accuracy	Precision_Subscribe	Precision_Macro	Recall_Subscribe	Recall_Macro	f1_Subscribe	f1_Macro
Baseline Model	0.882	0.000	0.441	0.000	0.500	0.000	0.469
Logistic Regression Without Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.652	0.433	0.689
Logistic Regression With Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.653	0.433	0.689
k-Nearest Neighbor Scaled Without Hyperparameter Tuning	0.891	0.566	0.740	0.299	0.634	0.391	0.666
k-Nearest Neighbor Scaled With Hyperparameter Tuning	0.887	0.527	0.722	0.340	0.650	0.413	0.675
Naive Bayes Classifier	0.815	0.321	0.626	0.521	0.687	0.397	0.644

Oversampling and Naive Bayes¶

In [44]:

# Naive Bayes with oversampling
NB_over = GaussianNB()
base_model = [NB_over]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Naive Bayes, Oversampled', 
                       oversampling = True)
df = df.append(df1)
df

Model- Naive bayes, oversampled and CV- 0
--------------------
Training Score: 0.734
Test Score: 0.709
Accuracy Score: 0.709
Precision Score - Subscribe: 0.246
Recall Score - Subscribe: 0.722
f1 Score - Subscribe: 0.367
Precision Score - Macro: 0.598
Recall Score - Macro: 0.715
f1 Score - Macro: 0.589


Model- Naive bayes, oversampled and CV- 1
--------------------
Training Score: 0.733
Test Score: 0.713
Accuracy Score: 0.713
Precision Score - Subscribe: 0.244
Recall Score - Subscribe: 0.692
f1 Score - Subscribe: 0.361
Precision Score - Macro: 0.595
Recall Score - Macro: 0.704
f1 Score - Macro: 0.588


Model- Naive bayes, oversampled and CV- 2
--------------------
Training Score: 0.734
Test Score: 0.729
Accuracy Score: 0.729
Precision Score - Subscribe: 0.259
Recall Score - Subscribe: 0.709
f1 Score - Subscribe: 0.380
Precision Score - Macro: 0.605
Recall Score - Macro: 0.720
f1 Score - Macro: 0.603


Model- Naive bayes, oversampled and CV- 3
--------------------
Training Score: 0.733
Test Score: 0.717
Accuracy Score: 0.717
Precision Score - Subscribe: 0.251
Recall Score - Subscribe: 0.718
f1 Score - Subscribe: 0.373
Precision Score - Macro: 0.601
Recall Score - Macro: 0.718
f1 Score - Macro: 0.595


Model- Naive bayes, oversampled and CV- 4
--------------------
Training Score: 0.734
Test Score: 0.715
Accuracy Score: 0.715
Precision Score - Subscribe: 0.249
Recall Score - Subscribe: 0.715
f1 Score - Subscribe: 0.370
Precision Score - Macro: 0.600
Recall Score - Macro: 0.715
f1 Score - Macro: 0.593

Out[44]:

	Accuracy	Precision_Subscribe	Precision_Macro	Recall_Subscribe	Recall_Macro	f1_Subscribe	f1_Macro
Baseline Model	0.882	0.000	0.441	0.000	0.500	0.000	0.469
Logistic Regression Without Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.652	0.433	0.689
Logistic Regression With Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.653	0.433	0.689
k-Nearest Neighbor Scaled Without Hyperparameter Tuning	0.891	0.566	0.740	0.299	0.634	0.391	0.666
k-Nearest Neighbor Scaled With Hyperparameter Tuning	0.887	0.527	0.722	0.340	0.650	0.413	0.675
Naive Bayes Classifier	0.815	0.321	0.626	0.521	0.687	0.397	0.644
Naive Bayes, Oversampled	0.717	0.250	0.600	0.711	0.714	0.370	0.594

Oversampling and Logistic Regression¶

In [45]:

# LR model with oversampling
LR_over = LogisticRegression(C = 1, class_weight = None, dual = False, fit_intercept = True, 
                              intercept_scaling = 1, l1_ratio = None, max_iter = 100,
                              multi_class = 'warn', n_jobs = -1, penalty = 'l1', random_state = 42,
                              solver = 'warn', tol = 0.0001, verbose = 0, warm_start = False)
base_model = [LR_over]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Logistic Regression, Oversampled With Hyperparameter Tuning', 
                       oversampling = True)
df = df.append(df1)
df

Model- Logistic regression, oversampled with hyperparameter tuning and CV- 0
--------------------
Training Score: 0.824
Test Score: 0.823
Accuracy Score: 0.823
Precision Score - Subscribe: 0.378
Recall Score - Subscribe: 0.790
f1 Score - Subscribe: 0.511
Precision Score - Macro: 0.673
Recall Score - Macro: 0.809
f1 Score - Macro: 0.702


Model- Logistic regression, oversampled with hyperparameter tuning and CV- 1
--------------------
Training Score: 0.824
Test Score: 0.832
Accuracy Score: 0.832
Precision Score - Subscribe: 0.391
Recall Score - Subscribe: 0.774
f1 Score - Subscribe: 0.520
Precision Score - Macro: 0.678
Recall Score - Macro: 0.807
f1 Score - Macro: 0.709


Model- Logistic regression, oversampled with hyperparameter tuning and CV- 2
--------------------
Training Score: 0.823
Test Score: 0.831
Accuracy Score: 0.831
Precision Score - Subscribe: 0.390
Recall Score - Subscribe: 0.787
f1 Score - Subscribe: 0.521
Precision Score - Macro: 0.679
Recall Score - Macro: 0.812
f1 Score - Macro: 0.709


Model- Logistic regression, oversampled with hyperparameter tuning and CV- 3
--------------------
Training Score: 0.823
Test Score: 0.833
Accuracy Score: 0.833
Precision Score - Subscribe: 0.393
Recall Score - Subscribe: 0.786
f1 Score - Subscribe: 0.524
Precision Score - Macro: 0.680
Recall Score - Macro: 0.813
f1 Score - Macro: 0.711


Model- Logistic regression, oversampled with hyperparameter tuning and CV- 4
--------------------
Training Score: 0.823
Test Score: 0.823
Accuracy Score: 0.823
Precision Score - Subscribe: 0.376
Recall Score - Subscribe: 0.783
f1 Score - Subscribe: 0.508
Precision Score - Macro: 0.671
Recall Score - Macro: 0.806
f1 Score - Macro: 0.700

Out[45]:

	Accuracy	Precision_Subscribe	Precision_Macro	Recall_Subscribe	Recall_Macro	f1_Subscribe	f1_Macro
Baseline Model	0.882	0.000	0.441	0.000	0.500	0.000	0.469
Logistic Regression Without Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.652	0.433	0.689
Logistic Regression With Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.653	0.433	0.689
k-Nearest Neighbor Scaled Without Hyperparameter Tuning	0.891	0.566	0.740	0.299	0.634	0.391	0.666
k-Nearest Neighbor Scaled With Hyperparameter Tuning	0.887	0.527	0.722	0.340	0.650	0.413	0.675
Naive Bayes Classifier	0.815	0.321	0.626	0.521	0.687	0.397	0.644
Naive Bayes, Oversampled	0.717	0.250	0.600	0.711	0.714	0.370	0.594
Logistic Regression, Oversampled With Hyperparameter Tuning	0.828	0.386	0.676	0.784	0.809	0.517	0.706

Ensemble Techniques¶

Decision Tree Classifier, Bagging Classifier, AdaBoost Classifier, Gradient Boosting Classifier and Random Forest Classifier. Oversampling the ones with higher accuracy and better recall for subscribe.

Decision Tree Classifier¶

In [46]:

# Decision Tree Classifier
DT = DecisionTreeClassifier(random_state = random_state)
base_model = [DT]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Decision Tree Classifier')
df = df.append(df1)
df

Model- Decision tree classifier and CV- 0
--------------------
Training Score: 1.000
Test Score: 0.874
Accuracy Score: 0.874
Precision Score - Subscribe: 0.464
Recall Score - Subscribe: 0.483
f1 Score - Subscribe: 0.473
Precision Score - Macro: 0.698
Recall Score - Macro: 0.705
f1 Score - Macro: 0.701


Model- Decision tree classifier and CV- 1
--------------------
Training Score: 1.000
Test Score: 0.876
Accuracy Score: 0.876
Precision Score - Subscribe: 0.471
Recall Score - Subscribe: 0.495
f1 Score - Subscribe: 0.483
Precision Score - Macro: 0.702
Recall Score - Macro: 0.711
f1 Score - Macro: 0.706


Model- Decision tree classifier and CV- 2
--------------------
Training Score: 1.000
Test Score: 0.871
Accuracy Score: 0.871
Precision Score - Subscribe: 0.449
Recall Score - Subscribe: 0.460
f1 Score - Subscribe: 0.455
Precision Score - Macro: 0.689
Recall Score - Macro: 0.693
f1 Score - Macro: 0.691


Model- Decision tree classifier and CV- 3
--------------------
Training Score: 1.000
Test Score: 0.879
Accuracy Score: 0.879
Precision Score - Subscribe: 0.482
Recall Score - Subscribe: 0.484
f1 Score - Subscribe: 0.483
Precision Score - Macro: 0.707
Recall Score - Macro: 0.707
f1 Score - Macro: 0.707


Model- Decision tree classifier and CV- 4
--------------------
Training Score: 1.000
Test Score: 0.875
Accuracy Score: 0.875
Precision Score - Subscribe: 0.466
Recall Score - Subscribe: 0.467
f1 Score - Subscribe: 0.466
Precision Score - Macro: 0.698
Recall Score - Macro: 0.698
f1 Score - Macro: 0.698

Out[46]:

	Accuracy	Precision_Subscribe	Precision_Macro	Recall_Subscribe	Recall_Macro	f1_Subscribe	f1_Macro
Baseline Model	0.882	0.000	0.441	0.000	0.500	0.000	0.469
Logistic Regression Without Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.652	0.433	0.689
Logistic Regression With Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.653	0.433	0.689
k-Nearest Neighbor Scaled Without Hyperparameter Tuning	0.891	0.566	0.740	0.299	0.634	0.391	0.666
k-Nearest Neighbor Scaled With Hyperparameter Tuning	0.887	0.527	0.722	0.340	0.650	0.413	0.675
Naive Bayes Classifier	0.815	0.321	0.626	0.521	0.687	0.397	0.644
Naive Bayes, Oversampled	0.717	0.250	0.600	0.711	0.714	0.370	0.594
Logistic Regression, Oversampled With Hyperparameter Tuning	0.828	0.386	0.676	0.784	0.809	0.517	0.706
Decision Tree Classifier	0.875	0.466	0.699	0.478	0.703	0.472	0.701

In [47]:

# Decision Tree Classifier with hyperparameter tuning
dt_hyper = DecisionTreeClassifier(max_depth = 3, random_state = random_state)
base_model = [dt_hyper]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Decision Tree Classifier - Reducing Max Depth')
df = df.append(df1)
df

Model- Decision tree classifier - reducing max depth and CV- 0
--------------------
Training Score: 0.899
Test Score: 0.898
Accuracy Score: 0.898
Precision Score - Subscribe: 0.634
Recall Score - Subscribe: 0.300
f1 Score - Subscribe: 0.407
Precision Score - Macro: 0.774
Recall Score - Macro: 0.638
f1 Score - Macro: 0.676


Model- Decision tree classifier - reducing max depth and CV- 1
--------------------
Training Score: 0.899
Test Score: 0.901
Accuracy Score: 0.901
Precision Score - Subscribe: 0.698
Recall Score - Subscribe: 0.270
f1 Score - Subscribe: 0.390
Precision Score - Macro: 0.804
Recall Score - Macro: 0.627
f1 Score - Macro: 0.668


Model- Decision tree classifier - reducing max depth and CV- 2
--------------------
Training Score: 0.900
Test Score: 0.897
Accuracy Score: 0.897
Precision Score - Subscribe: 0.624
Recall Score - Subscribe: 0.292
f1 Score - Subscribe: 0.398
Precision Score - Macro: 0.768
Recall Score - Macro: 0.634
f1 Score - Macro: 0.671


Model- Decision tree classifier - reducing max depth and CV- 3
--------------------
Training Score: 0.899
Test Score: 0.901
Accuracy Score: 0.901
Precision Score - Subscribe: 0.666
Recall Score - Subscribe: 0.307
f1 Score - Subscribe: 0.420
Precision Score - Macro: 0.790
Recall Score - Macro: 0.643
f1 Score - Macro: 0.683


Model- Decision tree classifier - reducing max depth and CV- 4
--------------------
Training Score: 0.900
Test Score: 0.899
Accuracy Score: 0.899
Precision Score - Subscribe: 0.680
Recall Score - Subscribe: 0.255
f1 Score - Subscribe: 0.371
Precision Score - Macro: 0.795
Recall Score - Macro: 0.620
f1 Score - Macro: 0.658

Out[47]:

	Accuracy	Precision_Subscribe	Precision_Macro	Recall_Subscribe	Recall_Macro	f1_Subscribe	f1_Macro
Baseline Model	0.882	0.000	0.441	0.000	0.500	0.000	0.469
Logistic Regression Without Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.652	0.433	0.689
Logistic Regression With Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.653	0.433	0.689
k-Nearest Neighbor Scaled Without Hyperparameter Tuning	0.891	0.566	0.740	0.299	0.634	0.391	0.666
k-Nearest Neighbor Scaled With Hyperparameter Tuning	0.887	0.527	0.722	0.340	0.650	0.413	0.675
Naive Bayes Classifier	0.815	0.321	0.626	0.521	0.687	0.397	0.644
Naive Bayes, Oversampled	0.717	0.250	0.600	0.711	0.714	0.370	0.594
Logistic Regression, Oversampled With Hyperparameter Tuning	0.828	0.386	0.676	0.784	0.809	0.517	0.706
Decision Tree Classifier	0.875	0.466	0.699	0.478	0.703	0.472	0.701
Decision Tree Classifier - Reducing Max Depth	0.899	0.660	0.786	0.285	0.632	0.397	0.671

In [48]:

dt_hyper = DecisionTreeClassifier(max_depth = 3, random_state = random_state)
dt_hyper.fit(X, y)
decisiontree = open('decisiontree.dot','w')
dot_data = export_graphviz(dt_hyper, out_file = 'decisiontree.dot', feature_names = X.columns,
    class_names = ['No', 'Yes'], rounded = True, proportion = False, filled = True)
decisiontree.close()

retCode = system('dot -Tpng decisiontree.dot -o decisiontree.png')
if(retCode>0):
    print('system command returning error: '+str(retCode))
else:
    display(Image('decisiontree.png'))

In [49]:

print('Feature Importance for Decision Tree Classifier ', '--'*38)
feature_importances = pd.DataFrame(dt_hyper.feature_importances_, index = X.columns, 
                                   columns=['Importance']).sort_values('Importance', ascending = True)
feature_importances.sort_values(by = 'Importance', ascending = True).plot(kind = 'barh', figsize = (15, 7.2))

Feature Importance for Decision Tree Classifier  --------------------------------------------------------------------------------

Out[49]:

<matplotlib.axes._subplots.AxesSubplot at 0x2c2805a6e08>

Bagging, AdaBoost, Gradient Boosting Classifier¶

In [50]:

# Bagging Classifier
bgcl = BaggingClassifier(base_estimator = DecisionTreeClassifier(max_depth = 3, random_state = random_state), 
                         n_estimators = 50, random_state = random_state)
base_model = [bgcl]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Bagging Classifier')
df = df.append(df1)
df

Model- Bagging classifier and CV- 0
--------------------
Training Score: 0.900
Test Score: 0.898
Accuracy Score: 0.898
Precision Score - Subscribe: 0.648
Recall Score - Subscribe: 0.289
f1 Score - Subscribe: 0.400
Precision Score - Macro: 0.780
Recall Score - Macro: 0.634
f1 Score - Macro: 0.672


Model- Bagging classifier and CV- 1
--------------------
Training Score: 0.900
Test Score: 0.900
Accuracy Score: 0.900
Precision Score - Subscribe: 0.678
Recall Score - Subscribe: 0.284
f1 Score - Subscribe: 0.401
Precision Score - Macro: 0.795
Recall Score - Macro: 0.633
f1 Score - Macro: 0.673


Model- Bagging classifier and CV- 2
--------------------
Training Score: 0.900
Test Score: 0.897
Accuracy Score: 0.897
Precision Score - Subscribe: 0.629
Recall Score - Subscribe: 0.283
f1 Score - Subscribe: 0.390
Precision Score - Macro: 0.770
Recall Score - Macro: 0.630
f1 Score - Macro: 0.667


Model- Bagging classifier and CV- 3
--------------------
Training Score: 0.900
Test Score: 0.900
Accuracy Score: 0.900
Precision Score - Subscribe: 0.677
Recall Score - Subscribe: 0.284
f1 Score - Subscribe: 0.400
Precision Score - Macro: 0.795
Recall Score - Macro: 0.633
f1 Score - Macro: 0.673


Model- Bagging classifier and CV- 4
--------------------
Training Score: 0.900
Test Score: 0.899
Accuracy Score: 0.899
Precision Score - Subscribe: 0.668
Recall Score - Subscribe: 0.272
f1 Score - Subscribe: 0.387
Precision Score - Macro: 0.789
Recall Score - Macro: 0.627
f1 Score - Macro: 0.666

Out[50]:

	Accuracy	Precision_Subscribe	Precision_Macro	Recall_Subscribe	Recall_Macro	f1_Subscribe	f1_Macro
Baseline Model	0.882	0.000	0.441	0.000	0.500	0.000	0.469
Logistic Regression Without Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.652	0.433	0.689
Logistic Regression With Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.653	0.433	0.689
k-Nearest Neighbor Scaled Without Hyperparameter Tuning	0.891	0.566	0.740	0.299	0.634	0.391	0.666
k-Nearest Neighbor Scaled With Hyperparameter Tuning	0.887	0.527	0.722	0.340	0.650	0.413	0.675
Naive Bayes Classifier	0.815	0.321	0.626	0.521	0.687	0.397	0.644
Naive Bayes, Oversampled	0.717	0.250	0.600	0.711	0.714	0.370	0.594
Logistic Regression, Oversampled With Hyperparameter Tuning	0.828	0.386	0.676	0.784	0.809	0.517	0.706
Decision Tree Classifier	0.875	0.466	0.699	0.478	0.703	0.472	0.701
Decision Tree Classifier - Reducing Max Depth	0.899	0.660	0.786	0.285	0.632	0.397	0.671
Bagging Classifier	0.899	0.660	0.786	0.282	0.631	0.396	0.670

In [51]:

# AdaBoost Classifier
abcl = AdaBoostClassifier(n_estimators = 10, random_state = random_state)
base_model = [abcl]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'AdaBoost Classifier')
df = df.append(df1)
df

Model- Adaboost classifier and CV- 0
--------------------
Training Score: 0.891
Test Score: 0.889
Accuracy Score: 0.889
Precision Score - Subscribe: 0.536
Recall Score - Subscribe: 0.350
f1 Score - Subscribe: 0.423
Precision Score - Macro: 0.727
Recall Score - Macro: 0.655
f1 Score - Macro: 0.681


Model- Adaboost classifier and CV- 1
--------------------
Training Score: 0.896
Test Score: 0.897
Accuracy Score: 0.897
Precision Score - Subscribe: 0.609
Recall Score - Subscribe: 0.335
f1 Score - Subscribe: 0.432
Precision Score - Macro: 0.763
Recall Score - Macro: 0.653
f1 Score - Macro: 0.688


Model- Adaboost classifier and CV- 2
--------------------
Training Score: 0.890
Test Score: 0.889
Accuracy Score: 0.889
Precision Score - Subscribe: 0.534
Recall Score - Subscribe: 0.391
f1 Score - Subscribe: 0.451
Precision Score - Macro: 0.728
Recall Score - Macro: 0.673
f1 Score - Macro: 0.695


Model- Adaboost classifier and CV- 3
--------------------
Training Score: 0.890
Test Score: 0.889
Accuracy Score: 0.889
Precision Score - Subscribe: 0.534
Recall Score - Subscribe: 0.390
f1 Score - Subscribe: 0.451
Precision Score - Macro: 0.728
Recall Score - Macro: 0.673
f1 Score - Macro: 0.694


Model- Adaboost classifier and CV- 4
--------------------
Training Score: 0.891
Test Score: 0.892
Accuracy Score: 0.892
Precision Score - Subscribe: 0.556
Recall Score - Subscribe: 0.379
f1 Score - Subscribe: 0.451
Precision Score - Macro: 0.739
Recall Score - Macro: 0.670
f1 Score - Macro: 0.696

Out[51]:

	Accuracy	Precision_Subscribe	Precision_Macro	Recall_Subscribe	Recall_Macro	f1_Subscribe	f1_Macro
Baseline Model	0.882	0.000	0.441	0.000	0.500	0.000	0.469
Logistic Regression Without Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.652	0.433	0.689
Logistic Regression With Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.653	0.433	0.689
k-Nearest Neighbor Scaled Without Hyperparameter Tuning	0.891	0.566	0.740	0.299	0.634	0.391	0.666
k-Nearest Neighbor Scaled With Hyperparameter Tuning	0.887	0.527	0.722	0.340	0.650	0.413	0.675
Naive Bayes Classifier	0.815	0.321	0.626	0.521	0.687	0.397	0.644
Naive Bayes, Oversampled	0.717	0.250	0.600	0.711	0.714	0.370	0.594
Logistic Regression, Oversampled With Hyperparameter Tuning	0.828	0.386	0.676	0.784	0.809	0.517	0.706
Decision Tree Classifier	0.875	0.466	0.699	0.478	0.703	0.472	0.701
Decision Tree Classifier - Reducing Max Depth	0.899	0.660	0.786	0.285	0.632	0.397	0.671
Bagging Classifier	0.899	0.660	0.786	0.282	0.631	0.396	0.670
AdaBoost Classifier	0.891	0.554	0.737	0.369	0.665	0.442	0.691

In [52]:

# Gradient Boosting Classifier
gbcl = GradientBoostingClassifier(n_estimators = 50, random_state = random_state)
base_model = [gbcl]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Gradient Boosting Classifier')
df = df.append(df1)
df

Model- Gradient boosting classifier and CV- 0
--------------------
Training Score: 0.905
Test Score: 0.899
Accuracy Score: 0.899
Precision Score - Subscribe: 0.636
Recall Score - Subscribe: 0.325
f1 Score - Subscribe: 0.430
Precision Score - Macro: 0.776
Recall Score - Macro: 0.650
f1 Score - Macro: 0.688


Model- Gradient boosting classifier and CV- 1
--------------------
Training Score: 0.904
Test Score: 0.902
Accuracy Score: 0.902
Precision Score - Subscribe: 0.667
Recall Score - Subscribe: 0.319
f1 Score - Subscribe: 0.431
Precision Score - Macro: 0.791
Recall Score - Macro: 0.649
f1 Score - Macro: 0.689


Model- Gradient boosting classifier and CV- 2
--------------------
Training Score: 0.904
Test Score: 0.901
Accuracy Score: 0.901
Precision Score - Subscribe: 0.668
Recall Score - Subscribe: 0.312
f1 Score - Subscribe: 0.425
Precision Score - Macro: 0.791
Recall Score - Macro: 0.646
f1 Score - Macro: 0.686


Model- Gradient boosting classifier and CV- 3
--------------------
Training Score: 0.903
Test Score: 0.904
Accuracy Score: 0.904
Precision Score - Subscribe: 0.698
Recall Score - Subscribe: 0.317
f1 Score - Subscribe: 0.436
Precision Score - Macro: 0.807
Recall Score - Macro: 0.649
f1 Score - Macro: 0.692


Model- Gradient boosting classifier and CV- 4
--------------------
Training Score: 0.905
Test Score: 0.903
Accuracy Score: 0.903
Precision Score - Subscribe: 0.678
Recall Score - Subscribe: 0.321
f1 Score - Subscribe: 0.435
Precision Score - Macro: 0.797
Recall Score - Macro: 0.650
f1 Score - Macro: 0.691

Out[52]:

	Accuracy	Precision_Subscribe	Precision_Macro	Recall_Subscribe	Recall_Macro	f1_Subscribe	f1_Macro
Baseline Model	0.882	0.000	0.441	0.000	0.500	0.000	0.469
Logistic Regression Without Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.652	0.433	0.689
Logistic Regression With Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.653	0.433	0.689
k-Nearest Neighbor Scaled Without Hyperparameter Tuning	0.891	0.566	0.740	0.299	0.634	0.391	0.666
k-Nearest Neighbor Scaled With Hyperparameter Tuning	0.887	0.527	0.722	0.340	0.650	0.413	0.675
Naive Bayes Classifier	0.815	0.321	0.626	0.521	0.687	0.397	0.644
Naive Bayes, Oversampled	0.717	0.250	0.600	0.711	0.714	0.370	0.594
Logistic Regression, Oversampled With Hyperparameter Tuning	0.828	0.386	0.676	0.784	0.809	0.517	0.706
Decision Tree Classifier	0.875	0.466	0.699	0.478	0.703	0.472	0.701
Decision Tree Classifier - Reducing Max Depth	0.899	0.660	0.786	0.285	0.632	0.397	0.671
Bagging Classifier	0.899	0.660	0.786	0.282	0.631	0.396	0.670
AdaBoost Classifier	0.891	0.554	0.737	0.369	0.665	0.442	0.691
Gradient Boosting Classifier	0.902	0.669	0.792	0.319	0.649	0.431	0.689

Oversampling and AdaBoost Classifier¶

In [53]:

abcl_over = AdaBoostClassifier(n_estimators = 15, random_state = random_state, learning_rate = 0.3)
base_model = [abcl_over]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'AdaBoost Classifier, Oversampled', oversampling = True)
df = df.append(df1)
df

Model- Adaboost classifier, oversampled and CV- 0
--------------------
Training Score: 0.838
Test Score: 0.793
Accuracy Score: 0.793
Precision Score - Subscribe: 0.335
Recall Score - Subscribe: 0.783
f1 Score - Subscribe: 0.470
Precision Score - Macro: 0.650
Recall Score - Macro: 0.789
f1 Score - Macro: 0.671


Model- Adaboost classifier, oversampled and CV- 1
--------------------
Training Score: 0.824
Test Score: 0.820
Accuracy Score: 0.820
Precision Score - Subscribe: 0.368
Recall Score - Subscribe: 0.752
f1 Score - Subscribe: 0.494
Precision Score - Macro: 0.665
Recall Score - Macro: 0.790
f1 Score - Macro: 0.692


Model- Adaboost classifier, oversampled and CV- 2
--------------------
Training Score: 0.838
Test Score: 0.805
Accuracy Score: 0.805
Precision Score - Subscribe: 0.353
Recall Score - Subscribe: 0.806
f1 Score - Subscribe: 0.491
Precision Score - Macro: 0.661
Recall Score - Macro: 0.805
f1 Score - Macro: 0.685


Model- Adaboost classifier, oversampled and CV- 3
--------------------
Training Score: 0.837
Test Score: 0.823
Accuracy Score: 0.823
Precision Score - Subscribe: 0.375
Recall Score - Subscribe: 0.769
f1 Score - Subscribe: 0.504
Precision Score - Macro: 0.670
Recall Score - Macro: 0.800
f1 Score - Macro: 0.698


Model- Adaboost classifier, oversampled and CV- 4
--------------------
Training Score: 0.845
Test Score: 0.812
Accuracy Score: 0.812
Precision Score - Subscribe: 0.363
Recall Score - Subscribe: 0.805
f1 Score - Subscribe: 0.500
Precision Score - Macro: 0.666
Recall Score - Macro: 0.809
f1 Score - Macro: 0.692

Out[53]:

	Accuracy	Precision_Subscribe	Precision_Macro	Recall_Subscribe	Recall_Macro	f1_Subscribe	f1_Macro
Baseline Model	0.882	0.000	0.441	0.000	0.500	0.000	0.469
Logistic Regression Without Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.652	0.433	0.689
Logistic Regression With Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.653	0.433	0.689
k-Nearest Neighbor Scaled Without Hyperparameter Tuning	0.891	0.566	0.740	0.299	0.634	0.391	0.666
k-Nearest Neighbor Scaled With Hyperparameter Tuning	0.887	0.527	0.722	0.340	0.650	0.413	0.675
Naive Bayes Classifier	0.815	0.321	0.626	0.521	0.687	0.397	0.644
Naive Bayes, Oversampled	0.717	0.250	0.600	0.711	0.714	0.370	0.594
Logistic Regression, Oversampled With Hyperparameter Tuning	0.828	0.386	0.676	0.784	0.809	0.517	0.706
Decision Tree Classifier	0.875	0.466	0.699	0.478	0.703	0.472	0.701
Decision Tree Classifier - Reducing Max Depth	0.899	0.660	0.786	0.285	0.632	0.397	0.671
Bagging Classifier	0.899	0.660	0.786	0.282	0.631	0.396	0.670
AdaBoost Classifier	0.891	0.554	0.737	0.369	0.665	0.442	0.691
Gradient Boosting Classifier	0.902	0.669	0.792	0.319	0.649	0.431	0.689
AdaBoost Classifier, Oversampled	0.811	0.359	0.662	0.783	0.799	0.492	0.688

Random Forest Classifier¶

In [54]:

# Random Forest Classifier
rfc = RandomForestClassifier(n_jobs = -1, random_state = random_state)
base_model = [rfc]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Random Forest Classifier')
df = df.append(df1)
df

Model- Random forest classifier and CV- 0
--------------------
Training Score: 0.991
Test Score: 0.898
Accuracy Score: 0.898
Precision Score - Subscribe: 0.623
Recall Score - Subscribe: 0.323
f1 Score - Subscribe: 0.426
Precision Score - Macro: 0.769
Recall Score - Macro: 0.649
f1 Score - Macro: 0.685


Model- Random forest classifier and CV- 1
--------------------
Training Score: 0.992
Test Score: 0.896
Accuracy Score: 0.896
Precision Score - Subscribe: 0.604
Recall Score - Subscribe: 0.313
f1 Score - Subscribe: 0.412
Precision Score - Macro: 0.759
Recall Score - Macro: 0.643
f1 Score - Macro: 0.677


Model- Random forest classifier and CV- 2
--------------------
Training Score: 0.991
Test Score: 0.894
Accuracy Score: 0.894
Precision Score - Subscribe: 0.600
Recall Score - Subscribe: 0.292
f1 Score - Subscribe: 0.393
Precision Score - Macro: 0.756
Recall Score - Macro: 0.633
f1 Score - Macro: 0.668


Model- Random forest classifier and CV- 3
--------------------
Training Score: 0.991
Test Score: 0.900
Accuracy Score: 0.900
Precision Score - Subscribe: 0.648
Recall Score - Subscribe: 0.325
f1 Score - Subscribe: 0.433
Precision Score - Macro: 0.782
Recall Score - Macro: 0.651
f1 Score - Macro: 0.689


Model- Random forest classifier and CV- 4
--------------------
Training Score: 0.992
Test Score: 0.896
Accuracy Score: 0.896
Precision Score - Subscribe: 0.622
Recall Score - Subscribe: 0.289
f1 Score - Subscribe: 0.394
Precision Score - Macro: 0.767
Recall Score - Macro: 0.633
f1 Score - Macro: 0.669

Out[54]:

	Accuracy	Precision_Subscribe	Precision_Macro	Recall_Subscribe	Recall_Macro	f1_Subscribe	f1_Macro
Baseline Model	0.882	0.000	0.441	0.000	0.500	0.000	0.469
Logistic Regression Without Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.652	0.433	0.689
Logistic Regression With Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.653	0.433	0.689
k-Nearest Neighbor Scaled Without Hyperparameter Tuning	0.891	0.566	0.740	0.299	0.634	0.391	0.666
k-Nearest Neighbor Scaled With Hyperparameter Tuning	0.887	0.527	0.722	0.340	0.650	0.413	0.675
Naive Bayes Classifier	0.815	0.321	0.626	0.521	0.687	0.397	0.644
Naive Bayes, Oversampled	0.717	0.250	0.600	0.711	0.714	0.370	0.594
Logistic Regression, Oversampled With Hyperparameter Tuning	0.828	0.386	0.676	0.784	0.809	0.517	0.706
Decision Tree Classifier	0.875	0.466	0.699	0.478	0.703	0.472	0.701
Decision Tree Classifier - Reducing Max Depth	0.899	0.660	0.786	0.285	0.632	0.397	0.671
Bagging Classifier	0.899	0.660	0.786	0.282	0.631	0.396	0.670
AdaBoost Classifier	0.891	0.554	0.737	0.369	0.665	0.442	0.691
Gradient Boosting Classifier	0.902	0.669	0.792	0.319	0.649	0.431	0.689
AdaBoost Classifier, Oversampled	0.811	0.359	0.662	0.783	0.799	0.492	0.688
Random Forest Classifier	0.897	0.619	0.767	0.308	0.642	0.412	0.678

In [55]:

# Random Forest Classifier with hyperparameter tuning
rfc = RandomForestClassifier(n_jobs = -1, random_state = random_state)
params = {'n_estimators' : [10, 20, 30, 50, 75, 100], 'max_depth': [1, 2, 3, 5, 7, 10]}

scoring = {'Recall': make_scorer(recall_score), 'f1_score': make_scorer(f1_score)}

skf = StratifiedKFold(n_splits = 3, shuffle = True, random_state = random_state)

rfc_grid = GridSearchCV(rfc, param_grid = params, n_jobs = -1, cv = skf, scoring = scoring, refit = 'f1_score')
rfc_grid.fit(X, y)

print(rfc_grid.best_estimator_)
print(rfc_grid.best_params_)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=10, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=-1,
                       oob_score=False, random_state=42, verbose=0,
                       warm_start=False)
{'max_depth': 10, 'n_estimators': 20}

In [56]:

# Random Forest Classifier with hyperparameter tuning
rfc_hyper = RandomForestClassifier(bootstrap = True, class_weight = None, criterion = 'gini', max_depth = 10, 
                                   max_features = 'auto', max_leaf_nodes = None, min_impurity_decrease = 0.0,
                                   min_impurity_split = None, min_samples_leaf = 1, min_samples_split = 2,
                                   min_weight_fraction_leaf = 0.0, n_estimators = 20, n_jobs = -1, 
                                   oob_score = False, random_state = 42, verbose = 0, warm_start = False)
base_model = [rfc_hyper]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 'Random Forest Classifier With Hyperparameter Tuning')
df = df.append(df1)
df

Model- Random forest classifier with hyperparameter tuning and CV- 0
--------------------
Training Score: 0.912
Test Score: 0.897
Accuracy Score: 0.897
Precision Score - Subscribe: 0.687
Recall Score - Subscribe: 0.220
f1 Score - Subscribe: 0.334
Precision Score - Macro: 0.796
Recall Score - Macro: 0.603
f1 Score - Macro: 0.639


Model- Random forest classifier with hyperparameter tuning and CV- 1
--------------------
Training Score: 0.911
Test Score: 0.898
Accuracy Score: 0.898
Precision Score - Subscribe: 0.733
Recall Score - Subscribe: 0.205
f1 Score - Subscribe: 0.321
Precision Score - Macro: 0.818
Recall Score - Macro: 0.598
f1 Score - Macro: 0.633


Model- Random forest classifier with hyperparameter tuning and CV- 2
--------------------
Training Score: 0.912
Test Score: 0.895
Accuracy Score: 0.895
Precision Score - Subscribe: 0.693
Recall Score - Subscribe: 0.192
f1 Score - Subscribe: 0.301
Precision Score - Macro: 0.798
Recall Score - Macro: 0.590
f1 Score - Macro: 0.622


Model- Random forest classifier with hyperparameter tuning and CV- 3
--------------------
Training Score: 0.911
Test Score: 0.902
Accuracy Score: 0.902
Precision Score - Subscribe: 0.755
Recall Score - Subscribe: 0.236
f1 Score - Subscribe: 0.360
Precision Score - Macro: 0.831
Recall Score - Macro: 0.613
f1 Score - Macro: 0.653


Model- Random forest classifier with hyperparameter tuning and CV- 4
--------------------
Training Score: 0.911
Test Score: 0.899
Accuracy Score: 0.899
Precision Score - Subscribe: 0.739
Recall Score - Subscribe: 0.215
f1 Score - Subscribe: 0.333
Precision Score - Macro: 0.822
Recall Score - Macro: 0.602
f1 Score - Macro: 0.639

Out[56]:

	Accuracy	Precision_Subscribe	Precision_Macro	Recall_Subscribe	Recall_Macro	f1_Subscribe	f1_Macro
Baseline Model	0.882	0.000	0.441	0.000	0.500	0.000	0.469
Logistic Regression Without Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.652	0.433	0.689
Logistic Regression With Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.653	0.433	0.689
k-Nearest Neighbor Scaled Without Hyperparameter Tuning	0.891	0.566	0.740	0.299	0.634	0.391	0.666
k-Nearest Neighbor Scaled With Hyperparameter Tuning	0.887	0.527	0.722	0.340	0.650	0.413	0.675
Naive Bayes Classifier	0.815	0.321	0.626	0.521	0.687	0.397	0.644
Naive Bayes, Oversampled	0.717	0.250	0.600	0.711	0.714	0.370	0.594
Logistic Regression, Oversampled With Hyperparameter Tuning	0.828	0.386	0.676	0.784	0.809	0.517	0.706
Decision Tree Classifier	0.875	0.466	0.699	0.478	0.703	0.472	0.701
Decision Tree Classifier - Reducing Max Depth	0.899	0.660	0.786	0.285	0.632	0.397	0.671
Bagging Classifier	0.899	0.660	0.786	0.282	0.631	0.396	0.670
AdaBoost Classifier	0.891	0.554	0.737	0.369	0.665	0.442	0.691
Gradient Boosting Classifier	0.902	0.669	0.792	0.319	0.649	0.431	0.689
AdaBoost Classifier, Oversampled	0.811	0.359	0.662	0.783	0.799	0.492	0.688
Random Forest Classifier	0.897	0.619	0.767	0.308	0.642	0.412	0.678
Random Forest Classifier With Hyperparameter Tuning	0.898	0.721	0.813	0.214	0.601	0.330	0.637

Oversampling and Random Forest Classifier¶

In [57]:

# Random Forest Classifier with hyperparameter tuning, Oversampled
rfc_over = RandomForestClassifier(bootstrap = True, class_weight = None, criterion = 'gini', max_depth = 10, 
                                   max_features = 'auto', max_leaf_nodes = None, min_impurity_decrease = 0.0,
                                   min_impurity_split = None, min_samples_leaf = 1, min_samples_split = 2,
                                   min_weight_fraction_leaf = 0.0, n_estimators = 20, n_jobs = -1, 
                                   oob_score = False, random_state = 42, verbose = 0, warm_start = False)
base_model = [rfc_over]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 
                        'Random Forest Classifier, Oversampled With Hyperparameter Tuning',
                        oversampling = True)
df = df.append(df1)
df

Model- Random forest classifier, oversampled with hyperparameter tuning and CV- 0
--------------------
Training Score: 0.913
Test Score: 0.854
Accuracy Score: 0.854
Precision Score - Subscribe: 0.429
Recall Score - Subscribe: 0.750
f1 Score - Subscribe: 0.546
Precision Score - Macro: 0.696
Recall Score - Macro: 0.809
f1 Score - Macro: 0.729


Model- Random forest classifier, oversampled with hyperparameter tuning and CV- 1
--------------------
Training Score: 0.916
Test Score: 0.857
Accuracy Score: 0.857
Precision Score - Subscribe: 0.433
Recall Score - Subscribe: 0.720
f1 Score - Subscribe: 0.541
Precision Score - Macro: 0.696
Recall Score - Macro: 0.798
f1 Score - Macro: 0.728


Model- Random forest classifier, oversampled with hyperparameter tuning and CV- 2
--------------------
Training Score: 0.915
Test Score: 0.853
Accuracy Score: 0.853
Precision Score - Subscribe: 0.425
Recall Score - Subscribe: 0.720
f1 Score - Subscribe: 0.534
Precision Score - Macro: 0.692
Recall Score - Macro: 0.795
f1 Score - Macro: 0.724


Model- Random forest classifier, oversampled with hyperparameter tuning and CV- 3
--------------------
Training Score: 0.917
Test Score: 0.864
Accuracy Score: 0.864
Precision Score - Subscribe: 0.450
Recall Score - Subscribe: 0.743
f1 Score - Subscribe: 0.560
Precision Score - Macro: 0.706
Recall Score - Macro: 0.811
f1 Score - Macro: 0.740


Model- Random forest classifier, oversampled with hyperparameter tuning and CV- 4
--------------------
Training Score: 0.914
Test Score: 0.857
Accuracy Score: 0.857
Precision Score - Subscribe: 0.435
Recall Score - Subscribe: 0.744
f1 Score - Subscribe: 0.549
Precision Score - Macro: 0.699
Recall Score - Macro: 0.808
f1 Score - Macro: 0.732

Out[57]:

	Accuracy	Precision_Subscribe	Precision_Macro	Recall_Subscribe	Recall_Macro	f1_Subscribe	f1_Macro
Baseline Model	0.882	0.000	0.441	0.000	0.500	0.000	0.469
Logistic Regression Without Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.652	0.433	0.689
Logistic Regression With Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.653	0.433	0.689
k-Nearest Neighbor Scaled Without Hyperparameter Tuning	0.891	0.566	0.740	0.299	0.634	0.391	0.666
k-Nearest Neighbor Scaled With Hyperparameter Tuning	0.887	0.527	0.722	0.340	0.650	0.413	0.675
Naive Bayes Classifier	0.815	0.321	0.626	0.521	0.687	0.397	0.644
Naive Bayes, Oversampled	0.717	0.250	0.600	0.711	0.714	0.370	0.594
Logistic Regression, Oversampled With Hyperparameter Tuning	0.828	0.386	0.676	0.784	0.809	0.517	0.706
Decision Tree Classifier	0.875	0.466	0.699	0.478	0.703	0.472	0.701
Decision Tree Classifier - Reducing Max Depth	0.899	0.660	0.786	0.285	0.632	0.397	0.671
Bagging Classifier	0.899	0.660	0.786	0.282	0.631	0.396	0.670
AdaBoost Classifier	0.891	0.554	0.737	0.369	0.665	0.442	0.691
Gradient Boosting Classifier	0.902	0.669	0.792	0.319	0.649	0.431	0.689
AdaBoost Classifier, Oversampled	0.811	0.359	0.662	0.783	0.799	0.492	0.688
Random Forest Classifier	0.897	0.619	0.767	0.308	0.642	0.412	0.678
Random Forest Classifier With Hyperparameter Tuning	0.898	0.721	0.813	0.214	0.601	0.330	0.637
Random Forest Classifier, Oversampled With Hyperparameter Tuning	0.857	0.434	0.698	0.735	0.804	0.546	0.731

In [58]:

# Random Forest Classifier with hyperparameter tuning, Oversampled -- Reducing Max Depth
rfc_over = RandomForestClassifier(bootstrap = True, class_weight = None, criterion = 'gini', max_depth = 3, 
                                   max_features = 'auto', max_leaf_nodes = None, min_impurity_decrease = 0.0,
                                   min_impurity_split = None, min_samples_leaf = 1, min_samples_split = 2,
                                   min_weight_fraction_leaf = 0.0, n_estimators = 50, n_jobs = -1, 
                                   oob_score = False, random_state = 42, verbose = 0, warm_start = False)
base_model = [rfc_over]
n_splits = 5
df1 = train_and_predict(n_splits, base_model, X, y, 
                        'Random Forest Classifier, Oversampled With Hyperparameter Tuning - Reducing Max Depth',
                        oversampling = True)
df = df.append(df1)
df

Model- Random forest classifier, oversampled with hyperparameter tuning - reducing max depth and CV- 0
--------------------
Training Score: 0.850
Test Score: 0.804
Accuracy Score: 0.804
Precision Score - Subscribe: 0.347
Recall Score - Subscribe: 0.766
f1 Score - Subscribe: 0.478
Precision Score - Macro: 0.655
Recall Score - Macro: 0.787
f1 Score - Macro: 0.679


Model- Random forest classifier, oversampled with hyperparameter tuning - reducing max depth and CV- 1
--------------------
Training Score: 0.842
Test Score: 0.819
Accuracy Score: 0.819
Precision Score - Subscribe: 0.371
Recall Score - Subscribe: 0.790
f1 Score - Subscribe: 0.505
Precision Score - Macro: 0.669
Recall Score - Macro: 0.806
f1 Score - Macro: 0.697


Model- Random forest classifier, oversampled with hyperparameter tuning - reducing max depth and CV- 2
--------------------
Training Score: 0.842
Test Score: 0.812
Accuracy Score: 0.812
Precision Score - Subscribe: 0.359
Recall Score - Subscribe: 0.773
f1 Score - Subscribe: 0.490
Precision Score - Macro: 0.662
Recall Score - Macro: 0.795
f1 Score - Macro: 0.687


Model- Random forest classifier, oversampled with hyperparameter tuning - reducing max depth and CV- 3
--------------------
Training Score: 0.842
Test Score: 0.810
Accuracy Score: 0.810
Precision Score - Subscribe: 0.356
Recall Score - Subscribe: 0.771
f1 Score - Subscribe: 0.487
Precision Score - Macro: 0.660
Recall Score - Macro: 0.793
f1 Score - Macro: 0.685


Model- Random forest classifier, oversampled with hyperparameter tuning - reducing max depth and CV- 4
--------------------
Training Score: 0.848
Test Score: 0.813
Accuracy Score: 0.813
Precision Score - Subscribe: 0.364
Recall Score - Subscribe: 0.801
f1 Score - Subscribe: 0.501
Precision Score - Macro: 0.667
Recall Score - Macro: 0.808
f1 Score - Macro: 0.693

Out[58]:

	Accuracy	Precision_Subscribe	Precision_Macro	Recall_Subscribe	Recall_Macro	f1_Subscribe	f1_Macro
Baseline Model	0.882	0.000	0.441	0.000	0.500	0.000	0.469
Logistic Regression Without Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.652	0.433	0.689
Logistic Regression With Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.653	0.433	0.689
k-Nearest Neighbor Scaled Without Hyperparameter Tuning	0.891	0.566	0.740	0.299	0.634	0.391	0.666
k-Nearest Neighbor Scaled With Hyperparameter Tuning	0.887	0.527	0.722	0.340	0.650	0.413	0.675
Naive Bayes Classifier	0.815	0.321	0.626	0.521	0.687	0.397	0.644
Naive Bayes, Oversampled	0.717	0.250	0.600	0.711	0.714	0.370	0.594
Logistic Regression, Oversampled With Hyperparameter Tuning	0.828	0.386	0.676	0.784	0.809	0.517	0.706
Decision Tree Classifier	0.875	0.466	0.699	0.478	0.703	0.472	0.701
Decision Tree Classifier - Reducing Max Depth	0.899	0.660	0.786	0.285	0.632	0.397	0.671
Bagging Classifier	0.899	0.660	0.786	0.282	0.631	0.396	0.670
AdaBoost Classifier	0.891	0.554	0.737	0.369	0.665	0.442	0.691
Gradient Boosting Classifier	0.902	0.669	0.792	0.319	0.649	0.431	0.689
AdaBoost Classifier, Oversampled	0.811	0.359	0.662	0.783	0.799	0.492	0.688
Random Forest Classifier	0.897	0.619	0.767	0.308	0.642	0.412	0.678
Random Forest Classifier With Hyperparameter Tuning	0.898	0.721	0.813	0.214	0.601	0.330	0.637
Random Forest Classifier, Oversampled With Hyperparameter Tuning	0.857	0.434	0.698	0.735	0.804	0.546	0.731
Random Forest Classifier, Oversampled With Hyperparameter Tuning - Reducing Max Depth	0.812	0.359	0.663	0.780	0.798	0.492	0.688

In [59]:

rfc_over = RandomForestClassifier(bootstrap = True, class_weight = None, criterion = 'gini', max_depth = 3, 
                                   max_features = 'auto', max_leaf_nodes = None, min_impurity_decrease = 0.0,
                                   min_impurity_split = None, min_samples_leaf = 1, min_samples_split = 2,
                                   min_weight_fraction_leaf = 0.0, n_estimators = 50, n_jobs = -1, 
                                   oob_score = False, random_state = 42, verbose = 0, warm_start = False)
rfc_over.fit(X, y)

random_forest_tree = open('random_forest.dot','w')
dot_data = export_graphviz(rfc_over.estimators_[0], out_file = random_forest_tree, feature_names = list(X_train), class_names = ['No', 'Yes'], rounded = True, proportion = False, filled = True)
random_forest_tree.close()

retCode = system("dot -Tpng random_forest.dot -o random_forest.png")
if(retCode>0):
    print("system command returning error: "+str(retCode))
else:
    display(Image("random_forest.png"))

In [60]:

print('Feature Importance for Random Forest Classifier ', '--'*38)
feature_importances = pd.DataFrame(rfc_over.feature_importances_, index = X.columns, 
                                   columns=['Importance']).sort_values('Importance', ascending = True)
feature_importances.sort_values(by = 'Importance', ascending = True).plot(kind = 'barh', figsize = (15, 7.2))

Feature Importance for Random Forest Classifier  ----------------------------------------------------------------------------

Out[60]:

<matplotlib.axes._subplots.AxesSubplot at 0x2c2814f1cc8>

Comparing model results¶

In [61]:

print('Conditional Formatting on the scores dataframe ', '--'*39)
display(df.style.background_gradient(cmap = sns.light_palette('green', as_cmap = True)))

Conditional Formatting on the scores dataframe  ------------------------------------------------------------------------------

	Accuracy	Precision_Subscribe	Precision_Macro	Recall_Subscribe	Recall_Macro	f1_Subscribe	f1_Macro
Baseline Model	0.882	0	0.441	0	0.5	0	0.469
Logistic Regression Without Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.652	0.433	0.689
Logistic Regression With Hyperparameter Tuning	0.899	0.626	0.771	0.331	0.653	0.433	0.689
k-Nearest Neighbor Scaled Without Hyperparameter Tuning	0.891	0.566	0.74	0.299	0.634	0.391	0.666
k-Nearest Neighbor Scaled With Hyperparameter Tuning	0.887	0.527	0.722	0.34	0.65	0.413	0.675
Naive Bayes Classifier	0.815	0.321	0.626	0.521	0.687	0.397	0.644
Naive Bayes, Oversampled	0.717	0.25	0.6	0.711	0.714	0.37	0.594
Logistic Regression, Oversampled With Hyperparameter Tuning	0.828	0.386	0.676	0.784	0.809	0.517	0.706
Decision Tree Classifier	0.875	0.466	0.699	0.478	0.703	0.472	0.701
Decision Tree Classifier - Reducing Max Depth	0.899	0.66	0.786	0.285	0.632	0.397	0.671
Bagging Classifier	0.899	0.66	0.786	0.282	0.631	0.396	0.67
AdaBoost Classifier	0.891	0.554	0.737	0.369	0.665	0.442	0.691
Gradient Boosting Classifier	0.902	0.669	0.792	0.319	0.649	0.431	0.689
AdaBoost Classifier, Oversampled	0.811	0.359	0.662	0.783	0.799	0.492	0.688
Random Forest Classifier	0.897	0.619	0.767	0.308	0.642	0.412	0.678
Random Forest Classifier With Hyperparameter Tuning	0.898	0.721	0.813	0.214	0.601	0.33	0.637
Random Forest Classifier, Oversampled With Hyperparameter Tuning	0.857	0.434	0.698	0.735	0.804	0.546	0.731
Random Forest Classifier, Oversampled With Hyperparameter Tuning - Reducing Max Depth	0.812	0.359	0.663	0.78	0.798	0.492	0.688

In [62]:

for i, types in enumerate(df.columns):
    temp = df[types]
    plt.figure(i, figsize = (15, 7.2))
    temp.sort_values(ascending = True).plot(kind = 'barh')
    plt.title(f'{types.capitalize()} Scores')
    plt.show()

Conclusion¶

The classification goal is to predict if the client will subscribe (yes/no) a term deposit.

Most of the ML models works best when the number of classes are in equal proportion since they are designed to maximize accuracy and reduce error. Thus, they do not take into account the class distribution / proportion or balance of classes. In our dataset, the clients subscribing to term deposit (class 'yes' i.e. 1) is 11.7% whereas those about 88.3% of the clients didn't subscribe (class 'no' i.e. 0) to the term deposit.

Building a DummyClassifier, baseline model, in our case gave an accuracy of 88.2% with zero recall and precision for predicting minority class i.e. where the client subscribed to term deposits. In this cases, important performance measures such as precision, recall, and f1-score would be helpful. We can also calculate this metrics for the minority, positive, class.

Precision: When it predicts the positive result, how often is it correct? i.e. limit the number of false positives.
Recall: When it is actually the positive result, how often does it predict correctly? i.e. limit the number of false negatives.
f1-score: Harmonic mean of precision and recall.

The confusion matrix for class 1 (Subscribed) would look like:

	Predicted: 0 (Not Subscribed)	Predicted: 1 (Subscribed)
Actual: 0 (Not Subscribed)	True Negatives	False Positives
Actual: 1 (Subscribed)	False Negatives	True Positives

Precision would tell us cases where actually the client hadn't subscribed to the term deposit but we predicted it as subscribed.
Recall would tell us cases where actually the client had subscribed to the term deposit but we predicted it as didn't subscribe.

In our case, it would be recall that would hold more importance then precision. So choosing recall particularly for class 1 and accuracy as as evaluation metric. Also important would be how is model behaving over the training and test scores across the cross validation sets.

Modeling was sub-divided in two phases, in the first phase we applied standard models (with and without the hyperparameter tuning wherever applicable) such as Logistic Regression, k-Nearest Neighbor and Naive Bayes classifiers. In second phase apply ensemble techniques such as Decision Tree, Bagging, AdaBoost, Gradient Boosting and Random Forest classifiers. Oversampling the ones with higher accuracy and better recall for subscribe.

Oversampling, which is one of common ways to tackle the issue of imbalanced data. Over-sampling refers to various methods that aim to increase the number of instances from the underrepresented class in the data set. Out of the various methods, we chose Synthetic Minority Over-Sampling Technique (SMOTE). SMOTE’s main advantage compared to traditional random naive over-sampling is that by creating synthetic observations instead of reusing existing observations, classifier is less likely to overfit.

In the first phase (Standard machine learning models vs baseline model),

Best recall for minority class in Naive Bayes of 52.1%
However better accuracy of 89.9% was observed in logistic regression with hyperparameter tuning
Oversampling both, recall of Naive Bayes increases but accuracy drops significantly whereas oversampled logistic regression with hyperparameter tuned model out of which we found a significant improvement of 78.4% in the recall for subscribe against the baseline model, however the accuracy dropped by 5.4%. But again, it's recall that's more important here.

In the second phase (Ensemble models vs baseline model),

Decision tree classifier gives the best recall score for subscribe class but is also prone to overfitting across the cross validation set. Reducing the max depth did solve the problem of overfitting, but resulted in lower recall subscribe.
AdaBoost classifier gave the best recall when compared with decision tree reduced max depth, bagging and gradient boosting classifier. Thus, oversampled that which gave a recall of 78.3% for subscribe (yes) class. Best till now among ensembles.
Tried different methods with random forest i.e. hyperparameter tuning, oversampling and even reducing max depth.. but recall score didn't improve when compared with AdaBoost classifier when oversampled.
Thus two better models from this phase are AdaBoost classifier when oversampled with a recall of 78.3%, accuracy of 81.1% and Random Forest classifier when hyperparameter tuned and oversampled, max depth reduced with recall of 78% and accuracy of 81.2%
AdaBoost's recall for subscribe is an improvement of 78.3% from the baseline model while the accuracy did reduce. It's a trade off we have to deal with.

Author: Pratik Sharma¶