Prior to fitting a logistic regression model for classifying who would likely survive, we have to examine the dataset with information from EDA as well as using other statistical methods. The logistic regression algorithm is also a supervised learning technique.
The dataset from training and testing has data that cannot be directly used due to many issues including but not limited to:
Let us examine sparse columns by counting the ratio of NaNs to all the values. describe() function on dataframe provides information about mean, median and the number of values ignoring NaNs only for float/integer columns.
train_data.describe() `
PassengerId Survived Pclass Age SibSpThe count row provides information about how many values exist with rest being NaNs.
count 891.000000 891.000000 891.000000 714.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008
std 257.353842 0.486592 0.836071 14.526497 1.102743
min 1.000000 0.000000 1.000000 0.420000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000
50% 446.000000 0.000000 3.000000 28.000000 0.000000
75% 668.500000 1.000000 3.000000 38.000000 1.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000Parch Fare
count 891.000000 891.000000
mean 0.381594 32.204208
std 0.806057 49.693429
min 0.000000 0.000000
25% 0.000000 7.910400
50% 0.000000 14.454200
75% 0.000000 31.000000
max 6.000000 512.329200
We can see that Age column has 714 entries with missing (891 - 714 = 177) values. This would be 177/891 = 0.19 or approximately 20% of missing values. If this percentage was small, we could choose to ignore those rows for fitting a logistic regression model. There are various methods to fill the missing values. But before we discuss the ways to fix the issue of Age column sparsity, let us examine other columns as well.
We can see that PassengerId, Name and Ticket are all unique to each person and hence will not serve as columns for modeling. Logistic Regression or any supervised or unsupervised learning methods need to understand patterns in the dataset. This is a necessary condition, so that algorithms can make sense of the data available by mathematically recording these patterns. Hence, Ids, Names are usually candidates that aren't useful for modeling. They are needed for identifying the person post recommendation, prediction or classification. They are also going to be useful later for other columns, thereby improving the overall dataset.
The Cabin column is really sparse with just (148/891 = 0.16) 16% data available. You can use len(train_data.Cabin.unique()) to determine the total length. When data is very sparse, we can ignore it for modeling in the first iteration. Later for improving the fit, this column can be investigated deeper to extract more information.
This data shows the point where passengers Embarked. It has very less sparsity with train_data[train_data.Embarked.notnull()] = 889 which is nearly all the data (891). Hence, it can be useful for modeling.
This is a column we created ourselves by splitting up age into different bands of Child, Adult, Senior and Unknown. We have to determine how any Unknown people are there so that we can build better models. Since, this variable depends directly on age, if we can fix the sparsity of age, this will be fixed as well.
The Fare column has no sparsity and is complete.
# Here is the distplot used to generate Age plot. Modify features variable for fare.
import pandas as pd
import numpy as np
import seaborn as sns
train_data = pd.read_csv("https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/train.csv")
test_data = pd.read_csv("https://raw.githubusercontent.com/agconti/kaggle-titanic/master/data/test.csv")
ind_var = train_data[train_data['Age'].notnull()].Age
fare_plot = sns.distplot(ind_var)
C:\Users\Kshitij\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
You don't need to look for NaNs as Fare is a complete set.
ind_var = train_data['Fare']
fare_plot = sns.distplot(ind_var)
C:\Users\Kshitij\Anaconda3\lib\site-packages\statsmodels\nonparametric\kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
ref_tmp_var = False
try:
ref_assert_var = False
ind_var_ = train_data['Fare']
if np.all(ind_var == ind_var_):
ref_assert_var = True
out = fare_plot.get_figure()
else:
ref_assert_var = False
except Exception:
print('Please follow the instructions given and use the same variables provided in the instructions.')
else:
if ref_assert_var:
ref_tmp_var = True
else:
print('Please follow the instructions given and use the same variables provided in the instructions.')
assert ref_tmp_var
continue
Dummy variables are numerical values used in place of categorical variables. These are necessary to convert range of unique set of strings or numbers into a unique array of values that represents the string. The columns, 'Sex' and 'Embarked' are categorical variables with string content.
The Sex column has entries male or female. This column could be mapped into two columns of male and female. We can then have 1s and 0s to indicate if the row is male or female. Since it is mutually exclusive that the row is either a male or a female and hence the columns can only have the following entries:
Male | Female |
0 | 1 |
1 | 0 |
After this transform the logistic regression model can also 'mathematically model' or understand that the row refers to either male or female without having to go through the Sex column with string content. What happens if we eliminate one of the columns such as the Male column as below?
Female |
0 |
1 |
We can still determine uniquely if the subject is a female or not, where 0 represents absence and 1 represents presence. This step of elimination of a single column in a range of dummy variables not only reduces complexity for modeling with lesser columns conveying the information but also necessary due to mathematical reasons. When all the columns are added, a singularity matrix can occur which gives an error in classification.
Imputation refers to methods of substituting estimates for missing values in the data. This is an important step that can train the model better as more data becomes available post imputations. There are many known methods of imputations. Sometime, by analysis and EDA, we can design custom imputation methods that provide best statistical estimates for the missing value. This reduces sparsity in the dataset. In the Titanic dataset, let us start investigating various methods to impute sparse columns.
To impute the age column, we can use the name information. How many of names with Mr., Mrs., Miss and Master exist and use the mean values for each where the ages are missing. Here are the estimates for each category:
miss_est = train_data[train_data['Name'].str.contains('Miss. ')].Age.mean()
21.773972602739725
master_est = train_data[train_data['Name'].str.contains('Master. ')].Age.mean()
4.5741666666666667
mrs_est = train_data[train_data['Name'].str.contains('Mrs. ')].Age.mean()
35.898148148148145
mr_est = train_data[train_data['Name'].str.contains('Mr. ')].Age.mean()
32.332089552238806
The above estimates can be improved further by considering the Parents column as those names containing Master and Miss would have a subset of children (unmarried referring to Master & Miss). Here is a function that takes all of these rules into consideration:
girl_child_est = train_data[train_data['Name'].str.contains('Miss. ') & (train_data['Parch'] == 1)].Age.mean() 3.696 boy_child_est = train_data[train_data['Name'].str.contains('Master. ') & (train_data['Parch'] == 1)].Age.mean() 12.0 woman_adult_est = train_data[train_data['Name'].str.contains('Miss. ') & (train_data['Parch'] == 0)].Age.mean() 27.763 man_adult_est = train_data[train_data['Name'].str.contains('Master. ') & (train_data['Parch'] == 1)].Age.mean() 12.0 woman_married_est = train_data[train_data['Name'].str.contains('Mrs. ')].Age.mean() 35.898 man_married_est = train_data[train_data['Name'].str.contains('Mr. ')].Age.mean() 32.332089552238806
We shall use the above estimates with an imputation function that we will build based on the same rules as above. Math function is imported as to check for NaNs outside of dataframe methods, it is needed.
import math
def impute_age(row):
if math.isnan(row[5]):
if ((('Miss. ') in row[3]) and (row[7] == 1)):
return girl_child_est
elif ((('Master. ') in row[3]) and (row[7] == 1)):
return boy_child_est
elif ((('Miss. ') in row[3]) and (row[7] == 0)):
return woman_adult_est
elif (('Mrs. ') in row[3]):
return woman_married_est
else:
return man_married_est
else:
return row[5]
Dataframe has an apply method that can apply this function to either each element in the dataframe or to each row. To specify row instead of applying the function to each element, pass axis=1 to the function:
train_data['Imputed_Age'] = train_data.apply(impute_age, axis=1)
test_data['Imputed_Age'] = test_data.apply(impute_age, axis=1)
apply function on dataframes as shown in the code section below operates on every single row; i.e, every row is passed on to the impute_age function which returns the estimated age when it doesn't exist.
import math
import statsmodels.api as sm
from sklearn.metrics import roc_curve, roc_auc_score
girl_child_est = train_data[train_data['Name'].str.contains('Miss. ') & (train_data['Parch'] == 1)].Age.mean()
boy_child_est = train_data[train_data['Name'].str.contains('Master. ') & (train_data['Parch'] == 1)].Age.mean()
woman_adult_est = train_data[train_data['Name'].str.contains('Miss. ') & (train_data['Parch'] == 0)].Age.mean()
man_adult_est = train_data[train_data['Name'].str.contains('Master. ') & (train_data['Parch'] == 1)].Age.mean()
woman_married_est = train_data[train_data['Name'].str.contains('Mrs. ')].Age.mean()
man_married_est = train_data[train_data['Name'].str.contains('Mr. ')].Age.mean()
# Modify and uncomment the code below to impute the age.
#def impute_age(row):
# if ((('Miss. ') in row[3]) and (row[7] == 1)):
# return girl_child_est
# elif ((('Master. ') in row[3]) and (row[7] == 1)):
# return boy_child_est
# elif ((('Miss. ') in row[3]) and (row[7] == 0)):
# return woman_adult_est
# elif (('Mrs. ') in row[3]):
# return woman_married_est
# else:
# return man_married_est
#train_data['Imputed_Age'] = train_data.apply(impute_age, axis=1)
use train_data.head() or .columns to determine the index of age in each row and use that with math.isnan(x)
def impute_age(row):
if math.isnan(row[5]):
if ((('Miss. ') in row[3]) and (row[7] == 1)):
return girl_child_est
elif ((('Master. ') in row[3]) and (row[7] == 1)):
return boy_child_est
elif ((('Miss. ') in row[3]) and (row[7] == 0)):
return woman_adult_est
elif (('Mrs. ') in row[3]):
return woman_married_est
else:
return man_married_est
else:
return row[5]
train_data['Imputed_Age'] = train_data.apply(impute_age, axis=1)
test_data['Imputed_Age'] = test_data.apply(impute_age, axis=1)
imputed_age = train_data['Imputed_Age']
train_data.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Imputed_Age | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | 22.0 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | 38.0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | 26.0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | 35.0 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | 35.0 |
ref_tmp_var = False
try:
ref_assert_var = False
import numpy
def impute_age(row):
if math.isnan(row[5]):
if ((('Miss. ') in row[3]) and (row[7] == 1)):
return girl_child_est
elif ((('Master. ') in row[3]) and (row[7] == 1)):
return boy_child_est
elif ((('Miss. ') in row[3]) and (row[7] == 0)):
return woman_adult_est
elif (('Mrs. ') in row[3]):
return woman_married_est
else:
return man_married_est
else:
return row[5]
t_data_ = train_data.apply(impute_age, axis=1)
train_data.head()
if np.all(imputed_age == t_data_):
ref_assert_var = True
out = train_data.head()
else:
ref_assert_var = False
except Exception:
print('Please follow the instructions given and use the same variables provided in the instructions.')
else:
if ref_assert_var:
ref_tmp_var = True
else:
print('Please follow the instructions given and use the same variables provided in the instructions.')
assert ref_tmp_var
continue
Categorical variables are variables where the values belong to a finite set. For example, the days of the week (0-7), Sex (Male/Female) are categorical variables. These categorical variables such as where the person embarked the ship, the sex need to be split up into separate columns so that the logistic regression model can understand them. To do so, we use dummy variables. These dummy variables encode the categorical variables into a set of 0s & 1 where 0 indicates absence of the feature and 1 indicates presence of the feature. Hence, the logistic regression model can tune to this dataset. Here, we have used the get_dummies to prepare the dataset.
train_embarked = pd.get_dummies(train_data['Embarked'])
train_sex = pd.get_dummies(train_data['Sex'])
train_data = train_data.join([train_embarked, train_sex])
test_embarked = pd.get_dummies(test_data['Embarked'])
test_sex = pd.get_dummies(test_data['Sex'])
test_data = test_data.join([test_embarked, test_sex])
train_data['Age_Imputed']=train_data.apply(impute_age, axis=1)
test_data['Age_Imputed']=test_data.apply(impute_age, axis=1)
train_data.head()
ROC is a short form of Region of Convergence. We need to choose a threshold that best provides the estimate of classes. Before we proceed with understanding ROC curves we need to understand a confusion matrix that is used to analyse the performance of the classifier. Suppose we are trying to predict class A, then:
Class A | Not Class A | |
---|---|---|
"Class A" Prediction | True Positive | False Positive |
"Not Class A" Prediction | False Negative | True Negative |
Let us use ROC functions from sklearn:
from sklearn.metrics import roc_curve, roc_auc_score
roc_survival = roc_curve(train_data[['Survived']], y_pred)
ROC plots are plots of True Positives vs False Positives. Hence we need the top left curve to be closer to the upper and left axis as much as possible to increase performance.
Seaborn visualization is better than matplotlib. Matplotlib is a basic visualization tool and we can use it with seaborn styles.
sns.set_style("whitegrid")
plt.plot(roc_survival[0], roc_survival[1])
plt.show()
The above curve is how a typical ROC curve looks like and you should quickly be able to identify it. We would want to improve the fit, which would imply pushing the curve closer to left and top axis, minimizing False positives and maximizing True positives.
Train the logistic regression model for all input features in the same order as it is present in the dataframe columns. Use those features you think are best suited and get the predictions on training data.
from sklearn.metrics import roc_curve, roc_auc_score
# You dont want this to join twice if you are attempting this lesson multiple times.
try:
train_embarked = pd.get_dummies(train_data['Embarked'])
train_sex = pd.get_dummies(train_data['Sex'])
train_data = train_data.join([train_embarked, train_sex])
test_embarked = pd.get_dummies(test_data['Embarked'])
test_sex = pd.get_dummies(test_data['Sex'])
test_data = test_data.join([test_embarked, test_sex])
except:
print("The columns have categorical variables")
# Modify the features list to include the relevant features and plot the roc curve.
features = ['Fare']
log_model = sm.Logit(train_data['Survived'], train_data[features]).fit()
y_pred = log_model.predict(train_data[features])
Optimization terminated successfully. Current function value: 0.689550 Iterations 4
Include all features and generate the ROC curve.
features = ['Pclass', 'Imputed_Age', 'SibSp', 'Parch', 'Fare', 'C', 'Q', 'female']
log_model = sm.Logit(train_data['Survived'], train_data[features]).fit()
y_pred = log_model.predict(train_data[features])
roc_survival = roc_curve(train_data['Survived'], y_pred)
sns.set_style("whitegrid")
sns.plt.xlabel('False Positive Rate')
sns.plt.ylabel('True Positive Rate')
roc_plot = sns.plt.plot(roc_survival[0], roc_survival[1])
roc_plot
Optimization terminated successfully. Current function value: 0.450489 Iterations 6
[<matplotlib.lines.Line2D at 0x24aeceea9b0>]
ref_tmp_var = False
try:
ref_assert_var = False
features_ = ['Pclass', 'Imputed_Age', 'SibSp', 'Parch', 'Fare', 'C', 'Q', 'female']
import numpy as np
if np.all(features == features_):
ref_assert_var = True
out = sns.plt
else:
ref_assert_var = False
except Exception:
print('Please follow the instructions given and use the same variables provided in the instructions.')
else:
if ref_assert_var:
ref_tmp_var = True
else:
print('Please follow the instructions given and use the same variables provided in the instructions.')
assert ref_tmp_var
continue
Here we shall learn how to perform modeling using scikit-learn.
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
Instantiate a logistic regression model as:
log_sci_model = LogisticRegression()
Train the model with 'Fare':
log_sci_model = log_sci_model.fit(train_data['Fare'], train_data['Survived'])
Measure the performance of the trained model over the training set:
log_sci_model.score(train_data['Fare'], train_data['Survived'])
0.66554433221099885
Train the model with all possible features.
from sklearn import metrics, cross_validation
from sklearn.linear_model import LogisticRegression
# Modify the code below to include all possible features.
features = ['Fare']
log_sci_model = LogisticRegression()
log_sci_model = log_sci_model.fit(train_data[features], train_data['Survived'])
log_sci_model.score(train_data[features], train_data['Survived'])
C:\Users\Kshitij\Anaconda3\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning)
0.66554433221099885
change features list
features = ['Pclass', 'Imputed_Age', 'SibSp', 'Parch', 'Fare', 'C', 'Q', 'female']
log_sci_model = LogisticRegression()
log_sci_model = log_sci_model.fit(train_data[features], train_data['Survived'])
log_score = log_sci_model.score(train_data[features], train_data['Survived'])
print(log_score)
0.796857463524
ref_tmp_var = False
try:
ref_assert_var = False
features_ = ['Pclass', 'Imputed_Age', 'SibSp', 'Parch', 'Fare', 'C', 'Q', 'female']
if np.all(features == features_):
ref_assert_var = True
else:
ref_assert_var = False
except Exception:
print('Please follow the instructions given and use the same variables provided in the instructions.')
else:
if ref_assert_var:
ref_tmp_var = True
else:
print('Please follow the instructions given and use the same variables provided in the instructions.')
assert ref_tmp_var
continue