The primary targets of this practice:
Let's do it!
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# filter warnings
import warnings
warnings.filterwarnings('ignore')
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore')
%matplotlib inline
sns.set_style('darkgrid')
The problem is to predict the survivals in titanic.
As the data is already given in the competition, all we need to do is to read them.
gender_submission = pd.read_csv('titanic/gender_submission.csv')
test_df = pd.read_csv('titanic/test.csv')
train_df = pd.read_csv('titanic/train.csv')
print(train_df.info())
train_df.sample(3)
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB None
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
681 | 682 | 1 | 1 | Hassab, Mr. Hammad | male | 27.0 | 0 | 0 | PC 17572 | 76.7292 | D49 | C |
321 | 322 | 0 | 3 | Danoff, Mr. Yoto | male | 27.0 | 0 | 0 | 349219 | 7.8958 | NaN | S |
68 | 69 | 1 | 3 | Andersson, Miss. Erna Alexandra | female | 17.0 | 4 | 2 | 3101281 | 7.9250 | NaN | S |
train_df.describe(include='all')
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 891 | 891 | 714.000000 | 891.000000 | 891.000000 | 891 | 891.000000 | 204 | 889 |
unique | NaN | NaN | NaN | 891 | 2 | NaN | NaN | NaN | 681 | NaN | 147 | 3 |
top | NaN | NaN | NaN | Stephenson, Mrs. Walter Bertram (Martha Eustis) | male | NaN | NaN | NaN | 347082 | NaN | B96 B98 | S |
freq | NaN | NaN | NaN | 1 | 577 | NaN | NaN | NaN | 7 | NaN | 4 | 644 |
mean | 446.000000 | 0.383838 | 2.308642 | NaN | NaN | 29.699118 | 0.523008 | 0.381594 | NaN | 32.204208 | NaN | NaN |
std | 257.353842 | 0.486592 | 0.836071 | NaN | NaN | 14.526497 | 1.102743 | 0.806057 | NaN | 49.693429 | NaN | NaN |
min | 1.000000 | 0.000000 | 1.000000 | NaN | NaN | 0.420000 | 0.000000 | 0.000000 | NaN | 0.000000 | NaN | NaN |
25% | 223.500000 | 0.000000 | 2.000000 | NaN | NaN | 20.125000 | 0.000000 | 0.000000 | NaN | 7.910400 | NaN | NaN |
50% | 446.000000 | 0.000000 | 3.000000 | NaN | NaN | 28.000000 | 0.000000 | 0.000000 | NaN | 14.454200 | NaN | NaN |
75% | 668.500000 | 1.000000 | 3.000000 | NaN | NaN | 38.000000 | 1.000000 | 0.000000 | NaN | 31.000000 | NaN | NaN |
max | 891.000000 | 1.000000 | 3.000000 | NaN | NaN | 80.000000 | 8.000000 | 6.000000 | NaN | 512.329200 | NaN | NaN |
Understanding the data:
Column Name | Data Type | Comment |
---|---|---|
Survived | binary nominal | it's the training label for our estimator |
PassengerId | nominal | it's a random number which is not helpful in the analysis |
Name | nominal | it shows information like gender and title |
Sex | nominal | need to be converted |
Embarked | nominal | the place people embarked to the ship.need to be converted |
Ticket | nominal | seems not useful |
Cabin | nominal | lot's of missing data |
Pclass | ordinal | as stated in kaggle, it stands for ticket class |
Age | continuous quantitative datatypes | |
Fare | continuous quantitative datatypes | |
SibSp | discrete quantitative datatypes | number of siblings and spouses |
Parch | discrete quantitative datatypes | number of parents and children |
To see if there's any aberrant or non-acceptable data inputs.
# first, we create a copy of the original training dataset
# as we are now preparing for EDA so we call it df_EDA
df_EDA = train_df.copy()
# The ID is just a random number
df_EDA.drop(['PassengerId'],axis=1,inplace=True)
Check for null values or missing data.
# missing data
df_EDA.isnull().sum()[df_EDA.isnull().sum() != 0]
Age 177 Cabin 687 Embarked 2 dtype: int64
Age
# stats info
print(df_EDA.Age.describe())
# shape of its distribution
f, axes = plt.subplots(1,2,figsize=(12,4))
sns.boxplot(df_EDA.Age.dropna(),ax=axes[0])
sns.distplot(df_EDA.Age.dropna(),ax=axes[1])
count 714.000000 mean 29.699118 std 14.526497 min 0.420000 25% 20.125000 50% 28.000000 75% 38.000000 max 80.000000 Name: Age, dtype: float64
<matplotlib.axes._subplots.AxesSubplot at 0x13333811240>
From the data and the graphs above, we can know that its a slightly right-skewed normal distribution. Taking the median to fill all the null value should be fine, but we want more accuracy. First, let's see what it's like to fill with median. Wei: expectation maximization.
# fill age with median, save df for comparison
test_df = df_EDA.Age.fillna(df_EDA.Age.median())
What if we look into the feature correlations and fill the median according to different situations?
# convert sex into categories
df_EDA['Sex_Code'] = df_EDA['Sex'].astype('category').cat.codes
# show correlation heatmaps
df_EDA.corr().style.background_gradient(cmap='coolwarm')
Survived | Pclass | Age | SibSp | Parch | Fare | Sex_Code | |
---|---|---|---|---|---|---|---|
Survived | 1 | -0.338481 | -0.0772211 | -0.0353225 | 0.0816294 | 0.257307 | -0.543351 |
Pclass | -0.338481 | 1 | -0.369226 | 0.0830814 | 0.0184427 | -0.5495 | 0.1319 |
Age | -0.0772211 | -0.369226 | 1 | -0.308247 | -0.189119 | 0.0960667 | 0.0932536 |
SibSp | -0.0353225 | 0.0830814 | -0.308247 | 1 | 0.414838 | 0.159651 | -0.114631 |
Parch | 0.0816294 | 0.0184427 | -0.189119 | 0.414838 | 1 | 0.216225 | -0.245489 |
Fare | 0.257307 | -0.5495 | 0.0960667 | 0.159651 | 0.216225 | 1 | -0.182333 |
Sex_Code | -0.543351 | 0.1319 | 0.0932536 | -0.114631 | -0.245489 | -0.182333 | 1 |
The most correlated columns for age is "Pclass", "SibSp" and "Parch". I'm going to plot and see if that's intuitively ture.
sns.catplot(data=df_EDA,x='Pclass', y='Age', hue='Sex',
row='SibSp', col='Parch',kind='box',
row_order=[0,1,2],col_order=[0,1,2])
<seaborn.axisgrid.FacetGrid at 0x13333c31160>
We are gonna calculate median age for different sex, pclass, sibch and parch and fill NA with them.
# Fill NA with a complex method
for ind in df_EDA[df_EDA.Age.isnull()].index:
# filters
filter1 = df_EDA['Sex'] == df_EDA.loc[ind, 'Sex']
filter2 = df_EDA['Pclass'] == df_EDA.loc[ind, 'Pclass']
filter3 = df_EDA['SibSp'] == df_EDA.loc[ind, 'SibSp']
filter4 = df_EDA['Parch'] == df_EDA.loc[ind, 'Parch']
fill_value = df_EDA[filter1][filter2][filter3][filter4]['Age'].median()
# if filter result is nan, we fill with the global median
if pd.isna(fill_value):
fill_value = df_EDA['Age'].median()
# fill in values
df_EDA.loc[ind, 'Age'] = fill_value
# stats info
print(df_EDA.Age.describe(),"\n")
print("The kurtosis of the complex method is: {}".format(df_EDA.Age.kurtosis()))
print("The kurtosis of the global median method is: {}".format(test_df.kurtosis()))
# shape of its distribution
f, axes = plt.subplots(2,2,figsize=(12,8))
sns.boxplot(df_EDA.Age.dropna(),ax=axes[0,0])
sns.distplot(df_EDA.Age.dropna(),ax=axes[0,1])
sns.boxplot(test_df,ax=axes[1,0],color='Orange')
sns.distplot(test_df,ax=axes[1,1],color='Orange')
axes[0,0].set_title('Box Plot')
axes[0,1].set_title('Dist Plot')
axes[0,0].set_ylabel('More complex method')
axes[1,0].set_ylabel('Fill with median')
count 891.000000 mean 29.387957 std 13.467945 min 0.420000 25% 22.000000 50% 27.000000 75% 36.000000 max 80.000000 Name: Age, dtype: float64 The kurtosis of the complex method is: 0.5874999132251677 The kurtosis of the global median method is: 0.9938710163801736
Text(0, 0.5, 'Fill with median')
It can be clearly seen that the complex method is much better. It create less outliers, reduce the kurtosis and makes more sense intuitively.
Cabin
df_with_cabin = df_EDA[df_EDA.Cabin.notnull()]
# Let us see the survival rate
rate_of_survival_cabin = df_with_cabin.Survived.mean()
print("Survival Rate:","{:.0%}".format(rate_of_survival_cabin))
df_with_cabin['Cabin'] = df_with_cabin.Cabin.apply(lambda x: x[0])
sns.barplot(x='Cabin',y='Survived',data=df_with_cabin)
Survival Rate: 67%
<matplotlib.axes._subplots.AxesSubplot at 0x13334c4dac8>
df_EDA['Cabin'] = (df_EDA.Cabin
.apply(lambda x: x[0] if pd.notna(x) else 'Z')
.apply(lambda x: 'Y' if x in ['C','E','D','B','F'] else x)
.apply(lambda x: 'X' if x in ['G','A','T'] else x))
print(a.value_counts())
sns.barplot(x=a, y=train_df.Survived)
A survival rate of 66% is really high comparing to the overall rate! Remember that we have the overall survival rate of 38%.
We already know that ticket fare and pclass are the most important factor, so I guess that only higher class people get their Cabin data recorded in the system. To prove the hypothesis, we will plot two figures with one containing all the data and another one containing only the data with Cabin info. We will see the differences in the plots below.
A little practice on the seaborn plot here =)
# style and background
# plt.style.use('seaborn-notebook')
sns.set_style("white")
# figure and sizes
f, axes = plt.subplots(1,2,figsize=(14,4))
# all data
a1 = sns.countplot('Pclass',hue='Survived',data=df_EDA,ax=axes[0],palette='muted')
# data with Cabin available
a2 = sns.countplot('Pclass',hue='Survived',data=df_with_cabin,ax=axes[1],palette='muted')
# title
a1.set_title('All')
a2.set_title('Cabin Available')
# annotation and text
a1.annotate('See how many death here?\nMany Class 3 records here.',
(1.5,350),
(0.0,220),
arrowprops=dict(arrowstyle = 'simple'))
a2.text(.5,50,
'Pclass 2 and 3 only show few records')
# style
sns.despine(left=True, bottom=False)
Hypothesis proved. I can dig deeper into it but let's now focus on the topic. This column seems to have a high correlation with the fare and pclass and it has many null values. To use that we need more digging and efforts, so we drop that.
df_EDA.drop(['Cabin'], axis=1, inplace=True)
# double check
df_EDA.columns
Index(['Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Embarked', 'Sex_Code'], dtype='object')
Embarked
We got 2 missing value in the column Embarked. We fill it with the most common value in this column.
print('# of NA in Embarked: ',df_EDA.Embarked.isnull().sum())
df_EDA.Embarked.value_counts()
df_EDA.Embarked.fillna(df_EDA['Embarked'].mode()[0],inplace=True)
print('Are all the NAs has been filled: ', df_EDA.Embarked.notnull().all())
# of NA in Embarked: 2 Are all the NAs has been filled: True
# split the Name column twice to get the title
df_EDA['Title'] = df_EDA.Name.apply(lambda x: x.split(', ')[1].split('. ')[0])
g = sns.countplot(df_EDA['Title'])
g = plt.setp(g.get_xticklabels(), rotation=45)
# Convert to categorical values Title
df_EDA["Title"] = df_EDA["Title"].replace(['Lady', 'the Countess','Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
df_EDA["Title"] = df_EDA["Title"].map({"Master":"Master", "Miss":"Female", "Ms" : "Female" , "Mme":"Female", "Mlle":"Female", "Mrs":"Female", "Mr":"Male", "Rare":"Rare"})
# plot
g = sns.countplot(df_EDA["Title"])
g = g.set_xticklabels(["Master","Miss/Ms/Mme/Mlle/Mrs","Mr","Rare"])
sns.despine()
# create a column of family size and based on it, create another
df_EDA['FamilySize'] = df_EDA['Parch'] + df_EDA['SibSp'] + 1
df_EDA['IsAlone'] = df_EDA['FamilySize'].apply(lambda x: 1 if x == 1 else 0)
df_EDA.head()
# create bins
df_EDA['FareBin'] = pd.qcut(df_EDA['Fare'], 4)
df_EDA['AgeBin'] = pd.cut(df_EDA['Age'].astype(int), 5)
# copy it
df_for_model = df_EDA.copy()
from sklearn.preprocessing import LabelEncoder
# transform the interval/category into code, for EDA purpose
# using pandas astype category because according to Dark417, this performs better in the model
df_EDA['FareBin_Code'] = df_EDA.FareBin.astype('category').cat.codes
df_EDA['AgeBin_Code'] = df_EDA.AgeBin.astype('category').cat.codes
df_EDA['Embarked_Code'] = df_EDA['Embarked'].astype('category').cat.codes
# converting categorical data into one-hot columns
import patsy as pts
y, X = pts.dmatrices('Survived ~ Pclass + C(Sex) + Age + SibSp + Parch + Fare + ' +
'C(Embarked) + C(Title) + C(Cabin) + FamilySize + IsAlone', data=df_for_model,
return_type='dataframe')
pd.concat([X,y]).info()
from sklearn.model_selection import train_test_split
# train test split
# X_train, X_test, y_train, y_test = \
# train_test_split(X, y, test_size=0.2, random_state=42)
# I choose to use kfold cv method, so the train test split above is not used anymore
# To keep the rest of the codes unchanged, I'll reassign the variables.
X_train = X
y_train = y
<class 'pandas.core.frame.DataFrame'> Int64Index: 1782 entries, 0 to 890 Data columns (total 15 columns): Age 891 non-null float64 C(Embarked)[T.Q] 891 non-null float64 C(Embarked)[T.S] 891 non-null float64 C(Sex)[T.male] 891 non-null float64 C(Title)[T.Male] 891 non-null float64 C(Title)[T.Master] 891 non-null float64 C(Title)[T.Rare] 891 non-null float64 FamilySize 891 non-null float64 Fare 891 non-null float64 Intercept 891 non-null float64 IsAlone 891 non-null float64 Parch 891 non-null float64 Pclass 891 non-null float64 SibSp 891 non-null float64 Survived 891 non-null float64 dtypes: float64(15) memory usage: 222.8 KB
# Fare
sns.distplot(df_for_model.Fare)
print("Skewness: {}".format(df_for_model.Fare.skew()))
print("kurtosis: {}".format(df_for_model.Fare.kurtosis()))
Skewness: 4.787316519674893 kurtosis: 33.39814088089868
df_for_model['Fare'] = df_for_model.Fare.apply(lambda x: np.log(x) if x > 0 else 0)
sns.distplot(df_for_model.Fare)
print("Skewness: {}".format(df_for_model.Fare.skew()))
print("kurtosis: {}".format(df_for_model.Fare.kurtosis()))
Skewness: 0.44310881405404506 kurtosis: 0.641225603709215
Take a quick look of the pairplot and see what we can get.
As pairplot will not show categorical information, we are looking at numerical data.
g = sns.pairplot(df_EDA, hue='Survived', plot_kws={'alpha':0.5,'s':13},
vars=['Sex_Code','Pclass','Fare','Age',\
'FamilySize', 'IsAlone'])
# sns.catplot(kind='')
print(df_EDA.Survived.value_counts(normalize=True).apply(lambda x: '{:.0%}'.format(x)))
0 62% 1 38% Name: Survived, dtype: object
Diagonal:
Upper/lower plots: can't see any thing.
Lastly, remember that we have 38% total survival rate.
# I wrote a helper code to display each numeric column as a distplot.
# This is just to understand the distribution of the data.
import helper
tem_df = df_EDA.drop(['Parch', 'SibSp', 'FareBin_Code', 'AgeBin_Code', 'Title'],axis=1)
helper.distplot_w_stats(tem_df)
tem_df = pd.DataFrame()
tem_df['x'] = pd.cut(df_EDA['Fare'],200).apply(lambda x: round(int(x.mid),0))
tem_df['hue'] = df_EDA['Sex']
# plot
pal = dict(male="#6495ED", female="#F08080")
plt.subplots(figsize=(14,6))
ax = sns.countplot(x='x',hue='hue',data=tem_df,palette=pal)
plt.legend(loc=1)
ax.set_xlim([-.5,30])
ax.set_ylim([1,150])
sns.despine()
sns.set_style('white')
a = sns.jointplot(df_EDA['Fare'], df_EDA['Pclass'], kind="kde", xlim=(-50,150))
sns.despine()
pal = dict(male="#6495ED", female="#F08080")
fig, a = plt.subplots(2,3,figsize=(16,9))
ax0 = sns.distplot(df_EDA.Fare, ax=a[0,0])
ax1 = sns.boxplot(df_EDA.Fare, ax=a[0,1],palette="Oranges")
ax2 = sns.barplot(df_EDA['FareBin'].apply(lambda x:int(x.mid)), df_EDA['Survived'], palette='Oranges',ax=a[0,2], alpha=0.5)
sns.lineplot([x for x in range(-1,6)], 0.38, color='red',ax=a[0,2],alpha=.8)
ax0.set_title('Distribution')
ax1.set_title('Distribution')
ax2.set_title('Survival ratio by fare')
ax2.set_xlim(-1,4)
ax2.annotate('38% survival',
(0,.38),
(-.9,.8),
arrowprops=dict(arrowstyle = 'simple'))
a1 = sns.distplot(df_EDA.Pclass,ax=a[1,0])
a2 = sns.countplot('Pclass', hue='Survived', data=df_EDA, palette="BrBG",ax=a[1,1])
a3 = sns.pointplot('Pclass', 'Survived', data=df_EDA, palette=pal,hue='Sex',ax=a[1,2])
sns.lineplot([x for x in range(-1,4)], 0.38, color='red',ax=a[1,2],alpha=.8)
a1.set_title('Distribution')
a2.set_title('Count')
a3.set_title('Survival Ratio')
a3.set_xlim(-1,3)
sns.despine()
print(df_EDA.Fare.describe())
count 891.000000 mean 32.204208 std 49.693429 min 0.000000 25% 7.910400 50% 14.454200 75% 31.000000 max 512.329200 Name: Fare, dtype: float64
Fare Price
Most people paid 7.91 ~ 31.00 for the ticket, while some paid a lot of money like 512.32... We can see there are some outliers in the box plot. From the stats data showed above, we know that the mean value is influenced by the outliers, and the std is quite high possibly for the same reason.
In the graph, we have 4 groups of bins, and each of them has exactly the same number as each other because we cut the fare using Quantile-based discretization function.
As expected, we found that the more money paid for the ticket, the higher survival rate it will be.
Pclass
Expected. Same reason as fare price.
Sex
Gentle man. My greatest respect.
Variables: Age, Pclass, Survived, Sex
pal = dict(male="#6495ED", female="#F08080")
# Show the survival proability as a function of age and sex
g = sns.lmplot(x="Age", y="Survived", row = 'Pclass', col="Sex", hue="Sex", data=df_EDA, aspect=2,
palette=pal, y_jitter=.02, logistic=True)
g.set(xlim=(0, 80), ylim=(-.05, 1.05))
<seaborn.axisgrid.FacetGrid at 0x133384b1400>
Regression Plot
g = sns.factorplot(x="Pclass", y="Age", hue="Survived", col="Sex",
data=df_EDA, kind="violin", split=True,
bw=0.05, size=7, aspect=.9, s=7)
Violin plots
This is another way to display a 4 dimensional chart. While the lmplots emphases on the relationships between x and y, violin plots focuses on showing the distribution of the survivors.
This is a tricky one and I don't wanna go too deep on that, but let's try it anyway. From the numbers, we can know the sequence of booking and something more, but I really don't think this will contribute much on the analysis.
Instead, let's see what's the survival rate for different strings included in the ticket.
# replacing all digit with empty strings, and show the result
def replace_digit(x):
result = ""
for i in x:
if not i.isdigit():
result = "".join([result,i])
return result.strip()
# save the result in a variable
converted_ticket = train_df['Ticket'].apply(replace_digit)
# create a slicer including top 5 values of the above variable
slicer = train_df['Ticket'].apply(replace_digit).value_counts().index[:5]
j = []
k = []
for i in slicer:
print('Testing for {}'.format(i))
print('Number of data: {}'.format(sum(converted_ticket == i)))
j.append(i)
try:
survival_rate = train_df[converted_ticket == i].Survived.value_counts(normalize=True).loc[1]
print('P(Survival) = {:.0%}'.format(survival_rate))
k.append(survival_rate)
except:
print('The rate of survival is 0')
k.append(0)
print('-'*20)
f, axis = plt.subplots(figsize=(10,4))
a1 = sns.barplot(j,k,palette='GnBu_d')
sns.lineplot(range(-1,len(j) + 1),[.38 for j in range(len(j) + 2)],color='red')
a1.set_xlabel('String in Ticket')
a1.set_ylabel('Survival Rate')
a1.set_xlim(-1,5)
sns.despine()
Testing for Number of data: 661 P(Survival) = 38% -------------------- Testing for PC Number of data: 60 P(Survival) = 65% -------------------- Testing for C.A. Number of data: 27 P(Survival) = 48% -------------------- Testing for A/ Number of data: 13 P(Survival) = 8% -------------------- Testing for STON/O . Number of data: 12 P(Survival) = 42% --------------------
As we already know that the rate of survival for the entire training set is 38%, the first result proves that again. The second result shows a 65% of survival rate. Is that merely a coincidence? Does PC stands for premium class? We should see correlations between different columns later.
The rest of the results cannot prove anything due to lack of data.
# we convert the ticket to another column
# will use it later
converted_ticket = converted_ticket.apply(lambda x: 1 if x == 'PC ' else 0)
Cabin_transformed = train_df.Cabin.apply(lambda x: replace_digit(str(x)))
Cabin_counts = train_df.Cabin.apply(lambda x: replace_digit(str(x))).value_counts()
Cabin_filter = Cabin_counts > 10
Cabin_transformed = Cabin_transformed.fillna('nan').apply(lambda x: x if Cabin_filter[x] else 'misc')
a = sns.barplot(Cabin_transformed,train_df.Survived, palette='Blues')
sns.lineplot(range(-1,8),[.38 for x in range(9)],color='red')
a.set_xlim(-1,7)
sns.despine()
Remember, we only have 200+ number of data in Cabin, which only takes 25% of the total.
Very interestingly, We can see the survival rates for Cabin with letters in them all exceed the overall fatality rate. The rest of records that have no Cabin data with them have a lower than overall survival rate - 30%.
print('*'*35)
print('Survival Rate wrt Embarked')
print(df_EDA.pivot_table(values='Survived',columns=['Embarked']))
print('*'*35)
print('*'*35)
print('Embarked Count')
print(df_EDA.Embarked.value_counts())
print('*'*35)
g = sns.FacetGrid(df_EDA, col = 'Embarked', height =4)
g.map(sns.pointplot,'FareBin', 'Survived', 'Sex', palette='muted')
sns.despine()
*********************************** Survival Rate wrt Embarked Embarked C Q S Survived 0.553571 0.38961 0.339009 *********************************** *********************************** Embarked Count S 646 C 168 Q 77 Name: Embarked, dtype: int64 ***********************************
corr_data = (df_EDA.corr()
.drop(['Survived'])['Survived']
.apply(abs)
.sort_values(ascending=True)
)
(corr_data
.apply(lambda x: '{:.2%}'.format(x))
.to_frame('Correlation with Survival')
)
Correlation with Survival | |
---|---|
FamilySize | 1.66% |
AgeBin_Code | 3.49% |
SibSp | 3.53% |
Age | 6.52% |
Parch | 8.16% |
Embarked_Code | 16.77% |
IsAlone | 20.34% |
Fare | 25.73% |
FareBin_Code | 29.94% |
Pclass | 33.85% |
Sex_Code | 54.34% |
sns.set(style = 'white')
plt.subplots(figsize=(12,4))
ax = sns.barplot(corr_data.values,
corr_data.index,
orient='h',
palette='RdBu')
ax.set_title('Featrue Correlation with Survival')
sns.despine()
sns.set(style = 'darkgrid')
cmap = sns.diverging_palette(220, 10, as_cmap=True)
mask = np.zeros_like(df_EDA.corr(), dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(11, 9))
sns.heatmap(df_EDA.corr(),
center=0,
cmap = cmap,
square=True,
ax=ax,
mask = mask,
annot=True,
linewidths=0.1,
linecolor='white',
annot_kws={'fontsize':12},
cbar_kws={"shrink": .5})
<matplotlib.axes._subplots.AxesSubplot at 0x1333b07a898>
X_train.columns
Index(['Intercept', 'C(Sex)[T.male]', 'C(Embarked)[T.Q]', 'C(Embarked)[T.S]', 'C(Title)[T.Male]', 'C(Title)[T.Master]', 'C(Title)[T.Rare]', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'FamilySize', 'IsAlone'], dtype='object')
# clean names
for i in [X_train]:
i.columns = ['Intercept', 'C(Sex)T.male', 'C(Embarked)T.Q', 'C(Embarked)T.S',
'C(Title)T.Male', 'C(Title)T.Master', 'C(Title)T.Rare',
'Pclass', 'Age', 'SibSp', 'Parch', 'Fare',
'FamilySize', 'IsAlone']
from xgboost import XGBClassifier
from sklearn import ensemble, linear_model, svm, naive_bayes, discriminant_analysis, neighbors, tree
from sklearn.preprocessing import normalize
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curve
import time
# models
classifers = {'Random Rorest': ensemble.RandomForestClassifier(), # Ensembling Method
'GradientBoosting': ensemble.GradientBoostingClassifier(),
'XGB': XGBClassifier(),
# Linear model
'LogisticRegression': linear_model.LogisticRegressionCV(),
'SGD': linear_model.SGDClassifier(),
'Perception': linear_model.Perceptron(),
# Naive bayes
'BernoulliNB': naive_bayes.BernoulliNB(),
'GaussianNB': naive_bayes.GaussianNB(),
'SVM': svm.SVC(probability=True),
# neighboors
'KNN': neighbors.KNeighborsClassifier(),
# Trees
'Decision Tree': tree.DecisionTreeClassifier(),
'Discriminant Analysis': discriminant_analysis.LinearDiscriminantAnalysis()}
# create a loop for models to fit and generate scores
classifier_name = []
classifier_score = []
classifier_time = []
classifier_std = []
X_train_norm = normalize(X_train)
kfold = StratifiedKFold(n_splits=10)
for C in classifers:
start = time.time() # time start
classifier = classifers[C] #initializing
cv_scores = cross_val_score(classifier, X_train_norm, y = y_train, scoring = "accuracy", cv = kfold, n_jobs=4) #scoring
score = cv_scores.mean()
standard_deviation = cv_scores.std()
end = time.time() # time end
timespent = end - start
classifier_name.append(C)
classifier_score.append(score)
classifier_std.append(standard_deviation)
classifier_time.append(timespent)
number_of_C = len(classifers.keys())
# transform the lists to dataframes (this is not necessary)
temp = np.hstack((np.array(classifier_score).reshape(number_of_C,1),
np.array(classifier_std).reshape(number_of_C,1),
np.array(classifier_time).reshape(number_of_C,1)))
score_plot = pd.DataFrame(index=classifier_name,data=temp,columns=['Score','Std','Time']).sort_values('Score',ascending=True)
tem_df = score_plot.copy()
tem_df['Score'] = (tem_df
.loc[:,['Score']]
.apply(lambda x: "{:.2%}".format(x[0]),axis=1))
print(tem_df.sort_values('Score',ascending=False))
# Barplot for scores and times
sns.set(style = 'white')
f, axes = plt.subplots(2,1,figsize=(12,8))
ax1 = sns.barplot(score_plot['Score'],
score_plot.index,palette='RdBu',
ax=axes[0])
ax1.set_xlim(.65,.85)
score_plot.sort_values('Time',ascending=False,inplace=True)
ax2 = sns.barplot(score_plot['Time'],
score_plot.index,
palette='YlGn',
ax=axes[1])
sns.despine()
Score Std Time XGB 82.72% 0.036745 0.856709 GradientBoosting 81.93% 0.031963 0.421788 LogisticRegression 80.93% 0.032503 0.736032 Discriminant Analysis 80.60% 0.032006 0.046864 Random Rorest 79.91% 0.036953 0.979242 BernoulliNB 79.01% 0.037752 0.021942 Decision Tree 76.43% 0.044866 0.032922 KNN 74.65% 0.034153 0.034907 GaussianNB 73.76% 0.054647 0.022939 SGD 70.17% 0.076305 0.038896 SVM 68.82% 0.058704 0.346635 Perception 54.10% 0.160845 0.024933
Accuracy All models were using the default method to learn, so the accuracy is not maximized.
Discriminant Analysis > Logistic Regression > GB > Bayes > XGB > Random Forest > KNN > Decision Tree
How about normalization?
We used a normalization process in the last cell. Now, we are going to model it without norming the data and see what we have.
# models
classifers = {'Random Rorest': ensemble.RandomForestClassifier(), # Ensembling Method
'GradientBoosting': ensemble.GradientBoostingClassifier(),
'XGB': XGBClassifier(),
# Linear model
'LogisticRegression': linear_model.LogisticRegressionCV(),
'SGD': linear_model.SGDClassifier(),
'Perception': linear_model.Perceptron(),
# Naive bayes
'BernoulliNB': naive_bayes.BernoulliNB(),
'GaussianNB': naive_bayes.GaussianNB(),
'SVM': svm.SVC(probability=True),
# neighboors
'KNN': neighbors.KNeighborsClassifier(),
# Trees
'Decision Tree': tree.DecisionTreeClassifier(),
'Discriminant Analysis': discriminant_analysis.LinearDiscriminantAnalysis()}
# create a loop for models to fit and generate scores
classifier_name1 = []
classifier_score1 = []
classifier_time1 = []
classifier_std1 = []
for C in classifers:
start = time.time() # time start
classifier = classifers[C] #initializing
cv_scores = cross_val_score(classifier, X_train, y = y_train, scoring = "accuracy", cv = kfold, n_jobs=4) #scoring
score = cv_scores.mean()
standard_deviation = cv_scores.std()
end = time.time() # time end
timespent = end - start
classifier_name1.append(C)
classifier_score1.append(score)
classifier_time1.append(timespent)
classifier_std1.append(standard_deviation)
number_of_C = len(classifers.keys())
# transform the lists to dataframes (this is not necessary)
temp1 = np.hstack((np.array(classifier_score1).reshape(number_of_C,1),
np.array(classifier_std1).reshape(number_of_C,1),
np.array(classifier_time1).reshape(number_of_C,1)))
score_diff = (pd.DataFrame(index=classifier_name1,data= temp - temp1 ,columns=['Score','Std','Time'])
.sort_values('Score',ascending=True))
score_plot = (pd.DataFrame(index=classifier_name1,data= temp1 ,columns=['Score','Std','Time'])
.sort_values('Score',ascending=True))
tem_df = score_plot.copy()
tem_df['Score'] = (tem_df
.loc[:,['Score']]
.apply(lambda x: "{:.2%}".format(x[0]),axis=1))
print(tem_df.sort_values('Score',ascending=False))
# Barplot for scores and times
sns.set_style('whitegrid')
f, axes = plt.subplots(2,1,figsize=(12,8))
ax1 = sns.barplot(score_diff['Score'],
score_diff.index,palette='RdBu',
ax=axes[0]
)
ax1.set_title('Normed score - Not normed score')
ax2 = sns.barplot(score_plot['Score'],
score_plot.index,palette='RdBu',
ax=axes[1])
ax2.set_xlim(.65,.85)
sns.despine(left=True,bottom=True)
Score Std Time Discriminant Analysis 83.28% 0.032208 0.032912 GradientBoosting 82.94% 0.035958 0.308062 XGB 82.72% 0.040501 0.402932 LogisticRegression 81.93% 0.024556 1.202883 GaussianNB 80.14% 0.033692 0.029920 Random Rorest 79.92% 0.050365 0.074357 BernoulliNB 79.01% 0.037752 0.028922 Decision Tree 77.91% 0.050469 0.031914 SVM 74.65% 0.035100 0.501659 KNN 72.97% 0.030658 0.045878 SGD 66.20% 0.086785 0.030917 Perception 60.20% 0.138871 0.034908
I am so confused. Why modeling on normalized dataset returns a lower score? But look at that SGD model, normalization increase its accuracy by over 10%! KNN got a 5% boost in the normed data. Interesting.
Support from Wei: SVM, Random forest and XGB might not be sensitive to feature normalize. That's because they don't have the process of "weights x features". SGD is based on some random selections, so it is not stable especially in this small dataset. As to why the result turns out that way, there might be two answers:
In this section, I will only tune features for the algorithms that perform great in the last section, which are:
# Naive Bayes Parameters tunning
NBC = naive_bayes.BernoulliNB()
## Search grid for optimal parameters
nb_param_grid = {
'alpha': range(1,100,1),
'fit_prior': [True,False]}
gsNBC = GridSearchCV(NBC,param_grid = nb_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)
gsNBC.fit(X_train,y_train)
# this is the best classifier
NBC_best = gsNBC.best_estimator_
# Best score
print(gsNBC.best_score_)
print(gsNBC.best_params_)
Fitting 10 folds for each of 198 candidates, totalling 1980 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 1096 tasks | elapsed: 1.5s
0.792368125701459 {'alpha': 74, 'fit_prior': True}
[Parallel(n_jobs=4)]: Done 1980 out of 1980 | elapsed: 2.6s finished
# RFC Parameters tunning
RFC = ensemble.RandomForestClassifier()
## Search grid for optimal parameters
rf_param_grid = {"max_depth": [None],
"max_features": [1, 3, 10],
"min_samples_split": [2, 3, 10],
"min_samples_leaf": [1, 3, 10],
"bootstrap": [False],
"n_estimators" :[100,300],
"criterion": ["gini"]}
gsRFC = GridSearchCV(RFC,param_grid = rf_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)
gsRFC.fit(X_train,y_train)
# this is the best classifier
RFC_best = gsRFC.best_estimator_
# Best score
print(gsRFC.best_score_)
print(gsRFC.best_params_)
Fitting 10 folds for each of 54 candidates, totalling 540 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 76 tasks | elapsed: 4.8s [Parallel(n_jobs=4)]: Done 376 tasks | elapsed: 23.7s [Parallel(n_jobs=4)]: Done 540 out of 540 | elapsed: 38.4s finished
0.8383838383838383
# XGB Parameters tunning, reference
# https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
xgb = XGBClassifier()
## Search grid for optimal parameters
xgb_param_grid = {
'max_depth':[4,6],
'min_child_weight':[4,6],
'gamma':[i/10.0 for i in range(0,5,2)],
'subsample':[i/10.0 for i in range(6,10,2)],
'colsample_bytree':[i/10.0 for i in range(6,10,2)],
'reg_alpha':[1e-2, 0.1, 1],
}
gsXGBC = GridSearchCV(xgb,param_grid = xgb_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)
gsXGBC.fit(X_train,y_train)
# this is the best classifier
XGBC_best = gsXGBC.best_estimator_
# Best score
print(gsXGBC.best_score_)
print(gsXGBC.best_params_)
Fitting 10 folds for each of 144 candidates, totalling 1440 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 76 tasks | elapsed: 2.6s [Parallel(n_jobs=4)]: Done 376 tasks | elapsed: 12.5s [Parallel(n_jobs=4)]: Done 876 tasks | elapsed: 29.6s [Parallel(n_jobs=4)]: Done 1440 out of 1440 | elapsed: 50.3s finished
0.8439955106621774
# Discriminant Analysis
# DA Parameters tunning
DAC = discriminant_analysis.LinearDiscriminantAnalysis()
## Search grid for optimal parameters
dac_param_grid = {
'solver':['svd','lsqr'],
'n_components':[i for i in range(0,12,1)],
}
gsDAC = GridSearchCV(DAC,param_grid = dac_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)
gsDAC.fit(X_train,y_train)
# this is the best classifier
DAC_best = gsDAC.best_estimator_
# Best score
print(gsDAC.best_score_)
print(gsDAC.best_params_)
Fitting 10 folds for each of 24 candidates, totalling 240 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
0.8327721661054994 {'n_components': 0, 'solver': 'svd'}
[Parallel(n_jobs=4)]: Done 240 out of 240 | elapsed: 0.4s finished
# Logistic Regression
# LRC tunning
LRC = linear_model.LogisticRegression()
## Search grid for optimal parameters
lrc_param_grid = {
'C': [0.001,0.01,0.1,1,10,100,1000],
'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
# "penalty":["l1","l2"]
}
gsLRC = GridSearchCV(LRC,param_grid = lrc_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)
gsLRC.fit(X_train,y_train)
# this is the best classifier
LRC_best = gsLRC.best_estimator_
# Best score
print(gsLRC.best_score_)
print(gsLRC.best_params_)
Fitting 10 folds for each of 35 candidates, totalling 350 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
0.8260381593714927 {'C': 10, 'solver': 'newton-cg'}
[Parallel(n_jobs=4)]: Done 350 out of 350 | elapsed: 4.5s finished
# Gradient Boosting, reference:
# https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/
# GB tunning
GBC = ensemble.gradient_boosting.GradientBoostingClassifier()
## Search grid for optimal parameters
gbc_param_grid = {
'learning_rate': [0.1, 0.05, 0.01],
'max_depth': [4,5,6],
'min_samples_leaf': [31],
'n_estimators':[30,70],
'max_features':range(4,14,4),
'subsample':[0.6,0.75,0.9]
}
gsGBC = GridSearchCV(GBC ,param_grid = gbc_param_grid, cv=kfold, scoring="accuracy", n_jobs= 4, verbose = 1)
gsGBC.fit(X_train,y_train)
# this is the best classifier
GBC_best = gsGBC.best_estimator_
# Best score
print(gsGBC.best_score_)
print(gsGBC.best_params_)
Fitting 10 folds for each of 162 candidates, totalling 1620 fits
[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=4)]: Done 348 tasks | elapsed: 4.4s
0.8439955106621774 {'learning_rate': 0.05, 'max_depth': 6, 'max_features': 8, 'min_samples_leaf': 31, 'n_estimators': 70, 'subsample': 0.75}
[Parallel(n_jobs=4)]: Done 1620 out of 1620 | elapsed: 22.0s finished
# I decide to use the simple train test split in feature importance.
# That's easier for me to plot the figures.
X_train1, X_test1, y_train1, y_test1 = \
train_test_split(X, y, test_size=0.2, random_state=42)
for i in [X_train1, X_test1]:
i.columns = ['Intercept', 'C(Sex)T.male', 'C(Embarked)T.Q', 'C(Embarked)T.S',
'C(Title)T.Male', 'C(Title)T.Master', 'C(Title)T.Rare',
'Pclass', 'Age', 'SibSp', 'Parch', 'Fare',
'FamilySize', 'IsAlone']
from sklearn.metrics import confusion_matrix, classification_report
def plot_FI(clf, model, model_name):
'''
clf for classifier
model:
1. linear
2. tree
'''
# model
clf = clf
clf.fit(X_train1, y_train1)
# preparing feature importance score
empty_df = X_train1.copy().drop(X_train1.index)
if model == 'tree':
empty_df.loc[0,] = clf.feature_importances_
elif model == 'linear':
empty_df.loc[0,] = abs(clf.coef_)
total = empty_df.loc[0,].sum()
empty_df = empty_df.apply(lambda x:x/total)
plot_df = empty_df.transpose().sort_values(0,ascending=False)
plot_df.columns = [model_name]
# ploting feature importance
plt.subplots(figsize=(14,4))
sns.barplot(plot_df[model_name],plot_df.index)
pred = clf.predict(X_test1)
print("The overall score is: {:.2%}\n".format(clf.score(X_test1,y_test1)))
print(classification_report(y_test1,pred))
# Precision and recall
def inverse_of_oneandzero(x):
if x == 1:
return 0
elif x == 0:
return 1
else:
return np.nan
result = pred.reshape(pred.shape[0],1) == y_test1
tp_1 = result[y_test1 == 1]
fn_1 = result[y_test1 == 1]['Survived'].apply(inverse_of_oneandzero).to_frame()
tn_1 = result[y_test1 == 0]
fp_1 = result[y_test1 == 0]['Survived'].apply(inverse_of_oneandzero).to_frame()
tp_0 = result[y_test1 == 0]
fn_0 = result[y_test1 == 0]['Survived'].apply(inverse_of_oneandzero).to_frame()
tn_0 = result[y_test1 == 1]
fp_0 = result[y_test1 == 1]['Survived'].apply(inverse_of_oneandzero).to_frame()
precision_1 = tp_1.sum() / (tp_1.sum() + fp_1.sum())
precision_0 = tp_0.sum() / (tp_0.sum() + fp_0.sum())
recall_1 = tp_1.sum() / (tp_1.sum() + fn_1.sum())
recall_0 = tp_0.sum() / (tp_0.sum() + fn_0.sum())
return plot_df, [precision_0[0], precision_1[0], recall_0[0], recall_1[0]]
# XGB
XGB_df, XGB_pr = plot_FI(XGBC_best,'tree','XGB')
The overall score is: 82.12% precision recall f1-score support 0.0 0.84 0.86 0.85 105 1.0 0.79 0.77 0.78 74 micro avg 0.82 0.82 0.82 179 macro avg 0.82 0.81 0.81 179 weighted avg 0.82 0.82 0.82 179
Seeing the plot, I don't think this algorithm performs well because it shows high importance on Sex and very low importance for others.
# gradient boosting
GB_df, GB_pr = plot_FI(GBC_best,'tree','GradientBoosting')
The overall score is: 81.01% precision recall f1-score support 0.0 0.82 0.87 0.84 105 1.0 0.79 0.73 0.76 74 micro avg 0.81 0.81 0.81 179 macro avg 0.81 0.80 0.80 179 weighted avg 0.81 0.81 0.81 179
# Random Forest
RF_df, RF_pr = plot_FI(RFC_best,'tree','Random Forest')
The overall score is: 82.68% precision recall f1-score support 0.0 0.84 0.88 0.86 105 1.0 0.81 0.76 0.78 74 micro avg 0.83 0.83 0.83 179 macro avg 0.82 0.82 0.82 179 weighted avg 0.83 0.83 0.83 179
I like this! It finally takes Fare and Age into consideration! However, the result is lower than the above. Terrible prediction on those survived. This algorithm sacrifices recall score on survived to get a high overall result. That's not cool.
Maybe, it is all about sex? WTF!
# logistic regression
logistic_df, logistic_pr = plot_FI(LRC_best,'linear', 'Logistic regression')
The overall score is: 82.68% precision recall f1-score support 0.0 0.84 0.87 0.85 105 1.0 0.80 0.77 0.79 74 micro avg 0.83 0.83 0.83 179 macro avg 0.82 0.82 0.82 179 weighted avg 0.83 0.83 0.83 179
Finally, we have something good!
# Naive Bayes
NB_df, NB_pr = plot_FI(NBC_best,'linear', 'Naive Bayes')
The overall score is: 80.45% precision recall f1-score support 0.0 0.84 0.82 0.83 105 1.0 0.75 0.78 0.77 74 micro avg 0.80 0.80 0.80 179 macro avg 0.80 0.80 0.80 179 weighted avg 0.81 0.80 0.80 179
# Discriminative Analysis
DA_df, DA_pr = plot_FI(DAC_best,'linear', 'Discriminative Analysis')
The overall score is: 82.68% precision recall f1-score support 0.0 0.84 0.87 0.85 105 1.0 0.80 0.77 0.79 74 micro avg 0.83 0.83 0.83 179 macro avg 0.82 0.82 0.82 179 weighted avg 0.83 0.83 0.83 179
Feature Importance
Very interestingly, fare and age, the most important features, was considered irrelevant to NB's model. However, it did come up with a generally great score.
from IPython.display import HTML
feature_importance_df = pd.concat([XGB_df, GB_df, RF_df, logistic_df, NB_df, DA_df],axis=1)
for col in feature_importance_df.columns:
feature_importance_df[col] = feature_importance_df[col].apply(lambda x: round(x,4))
feature_importance_df['Total'] = feature_importance_df.apply(lambda x: round(sum(x),4),axis=1)
feature_importance_df.sort_values('Total',ascending=False,inplace=True)
feature_importance_df.drop('Total',axis=1).style.background_gradient()
XGB | GradientBoosting | Random Forest | Logistic regression | Naive Bayes | Discriminative Analysis | |
---|---|---|---|---|---|---|
C(Sex)T.male | 0.3513 | 0.2591 | 0.2751 | 0.2219 | 0.0888 | 0.2219 |
C(Title)T.Male | 0.2256 | 0.2875 | 0.2954 | 0.1125 | 0.1026 | 0.1125 |
C(Title)T.Master | 0 | 0 | 0.01 | 0.2644 | 0.1412 | 0.2644 |
Pclass | 0.1026 | 0.142 | 0.1162 | 0.0933 | 0.0183 | 0.0933 |
C(Title)T.Rare | 0 | 0 | 0.0069 | 0.092 | 0.1542 | 0.092 |
Fare | 0.0349 | 0.1343 | 0.1172 | 0.0003 | 0.0186 | 0.0003 |
FamilySize | 0.0546 | 0.0408 | 0.0599 | 0.0369 | 0.0183 | 0.0369 |
SibSp | 0.053 | 0.0251 | 0.025 | 0.0311 | 0.081 | 0.0311 |
C(Embarked)T.Q | 0.0254 | 0.0001 | 0.0031 | 0.0377 | 0.1372 | 0.0377 |
IsAlone | 0.0464 | 0.0041 | 0.0095 | 0.0555 | 0.0662 | 0.0555 |
C(Embarked)T.S | 0.0446 | 0.0179 | 0.0139 | 0.0475 | 0.0469 | 0.0475 |
Age | 0.0344 | 0.0828 | 0.0585 | 0.001 | 0.0183 | 0.001 |
Parch | 0.0272 | 0.0063 | 0.0092 | 0.0058 | 0.09 | 0.0058 |
Intercept | 0 | 0 | 0 | 0 | 0.0183 | 0 |
feature_importance_df.drop('Total',axis=1).plot(kind='bar', figsize=(18,6),width=.8)
<matplotlib.axes._subplots.AxesSubplot at 0x133480a6240>
Result Correlations
pred_corr = pd.DataFrame({'XGB': XGBC_best.predict(X_test1),
'GBC': GBC_best.predict(X_test1),
'RFC': RFC_best.predict(X_test1),
'NBC': NBC_best.predict(X_test1),
'DAC': DAC_best.predict(X_test1),
'LRC': LRC_best.predict(X_test1)}).corr()
sns.heatmap(pred_corr,
cmap='coolwarm',
annot=True)
<matplotlib.axes._subplots.AxesSubplot at 0x1334f27cbe0>
pr_dic = {'XGB_pr': XGB_pr,
'GB_pr': GB_pr,
'RF_pr': RF_pr,
'logistic_pr': logistic_pr,
'NB_pr': NB_pr,
'DA_pr': DA_pr}
precision_recall_df = pd.DataFrame(columns=['precision_0','precision_1','recall_0','recall_1'])
for key in pr_dic:
precision_recall_df.loc[key,:] = pr_dic[key]
Precision and Recall
precision_recall_df.sort_values('recall_1',ascending=False,inplace=True)
print(precision_recall_df)
precision_recall_df.transpose().plot(kind='bar',ylim=[0.65,0.92],figsize=(15,4),width=.6)
precision_0 precision_1 recall_0 recall_1 NB_pr 0.843137 0.753247 0.819048 0.783784 XGB_pr 0.841121 0.791667 0.857143 0.77027 logistic_pr 0.842593 0.802817 0.866667 0.77027 DA_pr 0.842593 0.802817 0.866667 0.77027 RF_pr 0.836364 0.811594 0.87619 0.756757 GB_pr 0.81982 0.794118 0.866667 0.72973
<matplotlib.axes._subplots.AxesSubplot at 0x13348a834e0>
SUMMARY
Different focuses among models
Features
Models
Precision and Recall
from sklearn.ensemble import VotingClassifier
classifiers = { 'XGB': XGBC_best,
'GBC': GBC_best,
'RFC': RFC_best,
'NBC': NBC_best,
'DAC': DAC_best,
'LRC': LRC_best}
voting_clf = VotingClassifier(estimators=[('XGB', XGBC_best),
('GBC', GBC_best),
('RFC', RFC_best),
('NBC', NBC_best),
('DAC', DAC_best),
('LRC', LRC_best)], voting='soft')
classifiers['Voting'] = voting_clf
final_score_df = pd.DataFrame(columns=['Score'])
dc_of_scores = {}
for n_loop in range(30):
X_train2, X_test2, y_train2, y_test2 = \
train_test_split(X, y, test_size=0.20)
for clf in classifiers:
classifiers[clf].fit(X_train2,y_train2)
sc = classifiers[clf].score(X_test2,y_test2)
if clf not in dc_of_scores.keys():
dc_of_scores[clf] = [sc]
elif clf in dc_of_scores.keys():
dc_of_scores[clf].append(sc)
else:
print('what?')
sc_df = (pd.DataFrame(dc_of_scores)
.apply(np.mean)
.to_frame(name='Average Score')
.sort_values('Average Score',ascending=False))
print(sc_df)
g = sns.barplot(x=sc_df.index,y=sc_df['Average Score'])
g.set_ylim(bottom=.82,top=.84)
sns.despine()
Average Score RFC 0.834451 Voting 0.834451 DAC 0.832216 GBC 0.831471 XGB 0.831099 LRC 0.827188 NBC 0.789199
sc_df = (pd.DataFrame(dc_of_scores)
.apply(np.std)
.to_frame(name='std')
.sort_values('std',ascending=False))
print(sc_df)
g = sns.barplot(x=sc_df.index,y=sc_df['std'])
sns.despine()
std NBC 0.031895 Voting 0.028649 LRC 0.027819 DAC 0.027463 RFC 0.027273 GBC 0.025482 XGB 0.025425