This is a notebook to analyze and predict survival rates for passengers on board the Titanic. As stated on kaggle.com:
On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 (∼ 68%) passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.
In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.
It is your job to predict if a passenger survived the sinking of the Titanic or not.
For each PassengerId
in the test set, you must predict a 0 or 1 value for the Survived
variable.
Your score is the percentage of passengers you correctly predict.
Both a training dataset (used to train the machine learning algorithim) and a test dataset (used to test the algorithm) have been provided. The following notebooks on kaggle were instrumental in developing and implementing this notebook:
Last edited: 2017-10-12 22:54:13
...
# first import relevant libraries
# data management and mathematical functions
import pandas as pd
import numpy as np
import random as rnd
# visualization tools
import seaborn as sns
import matplotlib.pyplot as plt
# machine learning algorithims
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
# now define different plotting functions
# function to plot a near continuous variable in our dataset against another variable
def plot_distribution( df , var , target, **kwargs):
row = kwargs.get('row', None)
col = kwargs.get('col', None)
ymax = kwargs.get('ymax', None)
facet = sns.FacetGrid( df , hue=target , aspect=4, row = row, col = col)
facet.map( sns.kdeplot , var , shade= True )
facet.set( xlim=( 0 , df[ var ].max() ))
facet.set( xlim=( 0 , ymax ))
facet.add_legend()
# function to plot a categorical variable in our dataset against another variable
def plot_categories( df , cat , target, **kwargs ):
order = kwargs.get('order', None)
facet = sns.FacetGrid( df )
facet.map( sns.barplot , cat , target, order = order )
# function to create a correlation plot of our features
def plot_correlation_map (df):
corr = df.corr()
fig = plt.subplots( figsize = ( 10 , 10 ) )
cscheme = sns.diverging_palette( 220 , 10 , as_cmap = True )
sns.heatmap(corr, cmap = cscheme, square=True, cbar_kws={ 'shrink' : .9 },
annot = True, annot_kws = { 'fontsize' : 12 } )
# read in training and testing data
train_df = pd.read_csv("./data/train.csv")
test_df = pd.read_csv("./data/test.csv")
# combine train/test into a new dataframe
full_df = train_df.append (test_df, ignore_index=True)
print (train_df.shape)
print (train_df.columns.values)
train_df.head(4)
(891, 12) ['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch' 'Ticket' 'Fare' 'Cabin' 'Embarked']
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
train_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB
From the above we also see that we have missing data for the Age
, Cabin
, and Embarked
variables. What do these different variables actually mean?
Variable Name | Meaning |
---|---|
PassengerId | unique numeric identifier |
Survived | whether the passenger survived or not |
Pclass | ticket class |
Name | passenger name |
Sex | passenger sex |
Age | passenger age |
SibSp | number of siblings/spouses aboard the ship |
Parch | number of parents/children aboard the ship |
Ticket | ticket number |
Fare | passenger fare in USD |
Cabin | cabin number |
Embarked | port of embarkation |
We then classify these variables as categorical or numerical. For a review of categorical and numerical variables, see Appendix 2.1.
Categorical Variables
Nominal: Survived, Sex, and Embarked
Ordinal: Pclass
Numerical Variables
train_df.describe(percentiles=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99], include = [np.number])
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
10% | 90.000000 | 0.000000 | 1.000000 | 14.000000 | 0.000000 | 0.000000 | 7.550000 |
20% | 179.000000 | 0.000000 | 1.000000 | 19.000000 | 0.000000 | 0.000000 | 7.854200 |
30% | 268.000000 | 0.000000 | 2.000000 | 22.000000 | 0.000000 | 0.000000 | 8.050000 |
40% | 357.000000 | 0.000000 | 2.000000 | 25.000000 | 0.000000 | 0.000000 | 10.500000 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
60% | 535.000000 | 0.000000 | 3.000000 | 31.800000 | 0.000000 | 0.000000 | 21.679200 |
70% | 624.000000 | 1.000000 | 3.000000 | 36.000000 | 1.000000 | 0.000000 | 27.000000 |
80% | 713.000000 | 1.000000 | 3.000000 | 41.000000 | 1.000000 | 1.000000 | 39.687500 |
90% | 802.000000 | 1.000000 | 3.000000 | 50.000000 | 1.000000 | 2.000000 | 77.958300 |
99% | 882.100000 | 1.000000 | 3.000000 | 65.870000 | 5.000000 | 4.000000 | 249.006220 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
To include the rest of the variables (the objects), we modify our .describe()
command to include all the objects, i.e., np.object
. We could also have used include = ['O']
.
train_df.describe(include = [np.object])
Name | Sex | Ticket | Cabin | Embarked | |
---|---|---|---|---|---|
count | 891 | 891 | 891 | 204 | 889 |
unique | 891 | 2 | 681 | 147 | 3 |
top | Watson, Mr. Ennis Hastings | male | 1601 | C23 C25 C27 | S |
freq | 1 | 577 | 7 | 4 | 644 |
full_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1309 entries, 0 to 1308 Data columns (total 12 columns): Age 1046 non-null float64 Cabin 295 non-null object Embarked 1307 non-null object Fare 1308 non-null float64 Name 1309 non-null object Parch 1309 non-null int64 PassengerId 1309 non-null int64 Pclass 1309 non-null int64 Sex 1309 non-null object SibSp 1309 non-null int64 Survived 891 non-null float64 Ticket 1309 non-null object dtypes: float64(3), int64(4), object(5) memory usage: 122.8+ KB
sns.countplot(x='Sex', data = full_df)
plt.show()
plot_categories(train_df , cat = 'Sex' , target = 'Survived', order = train_df.Sex.unique())
plt.show()
plot_distribution (train_df, var = 'Age' , target = 'Survived', row = 'Sex')
plt.show()
full_df.Sex.isnull().sum()
0
# replace sex string with a 0 or 1
gender_dict = {'female' : 1 , 'male' : 0}
full_df.Sex = full_df.Sex.map (gender_dict).astype(int)
sns.countplot (x='Embarked', data=full_df)
plt.show()
plot_categories(train_df , cat = 'Embarked' , target = 'Survived', order=train_df.Embarked.unique() )
plt.show()
full_df.Embarked.isnull().sum()
2
# impute missing values with port 'S'
full_df.Embarked = full_df.Embarked.fillna('S')
# replace embarked label with a 1 or 2 or 3
embarked_dict = {'S' : 1 , 'C' : 2, 'Q' : 3}
full_df.Embarked = full_df.Embarked.map (embarked_dict).astype(int)
sns.countplot (x='Pclass', data=full_df)
plt.show()
plot_categories( train_df , cat = 'Pclass' , target = 'Survived' )
plt.show()
plot_distribution (train_df, var = 'Fare' , target = 'Survived' , row = 'Pclass' )
plt.show()
full_df.Pclass.isnull().sum()
0
sns.distplot(full_df.Age.dropna(), hist=True)
plt.show()
plot_distribution (train_df, var='Age', target='Survived')
plt.show()
full_df.Age.isnull().sum()
263
age_mean = full_df.Age.mean()
age_std = full_df.Age.std()
print ('Mean age is %f' % age_mean)
print ('Age standard deviation %f' % age_std)
Mean age is 29.881138 Age standard deviation 14.413493
age_null = full_df.Age.isnull().sum()
# create missing ages dataset
missing_ages = np.random.uniform (low = age_mean - age_std, high = age_mean + age_std, size = age_null)
# and add to full dataset
full_df.Age[np.isnan(full_df.Age)] = missing_ages
/usr/local/lib/python3.4/dist-packages/ipykernel_launcher.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy import sys
sns.distplot (full_df.Age, hist=True)
plt.show()
fig, (axis1,axis2) = plt.subplots(1,2,figsize=(10,5))
fig1 = sns.countplot(x='Parch', data = full_df, ax=axis1)
fig2 = sns.countplot(x='SibSp', data = full_df, ax=axis2)
plt.tight_layout()
plt.show()
plot_categories(train_df , cat = 'Parch' , target = 'Survived')
plot_categories(train_df , cat = 'SibSp' , target = 'Survived')
plt.show()
print (full_df.Parch.isnull().sum())
full_df.SibSp.isnull().sum()
0
0
family = pd.DataFrame()
family['FamilySize'] = full_df.Parch + full_df.SibSp + 1
full_df['FamilySize'] = family.FamilySize
full_df.Name.head(10)
0 Braund, Mr. Owen Harris 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 2 Heikkinen, Miss. Laina 3 Futrelle, Mrs. Jacques Heath (Lily May Peel) 4 Allen, Mr. William Henry 5 Moran, Mr. James 6 McCarthy, Mr. Timothy J 7 Palsson, Master. Gosta Leonard 8 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 9 Nasser, Mrs. Nicholas (Adele Achem) Name: Name, dtype: object
full_df.Name.isnull().sum()
0
# we extract the title from each name: split along ',' and take the second. Then split along '.' and take the first.
# Then strip() to remove whitespace
full_df ['Title'] = full_df.Name.map( lambda name: name.split( ',' )[1].split( '.' )[0].strip() )
# print unique titles
full_df.Title.unique()
array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms', 'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'the Countess', 'Jonkheer', 'Dona'], dtype=object)
full_df.Title.value_counts()
Mr 757 Miss 260 Mrs 197 Master 61 Rev 8 Dr 8 Col 4 Mlle 2 Major 2 Ms 2 Lady 1 Don 1 Mme 1 Sir 1 Dona 1 Jonkheer 1 Capt 1 the Countess 1 Name: Title, dtype: int64
full_df.Title = full_df.Title.replace (['Capt', 'Col', 'Major', 'Jonkheer', 'Don', 'Sir', 'Dr',
'Rev', 'the Countess', 'Dona', 'Lady'], 'Rare')
full_df.Title = full_df.Title.replace ('Mme', 'Mrs')
full_df.Title = full_df.Title.replace ('Mlle', 'Ms')
full_df.Title = full_df.Title.replace ('Miss', 'Ms')
# replace titles with numbers
title_dict = {'Rare' : 1 , 'Mrs' : 2, 'Ms' : 3, 'Mr' : 4, 'Master': 5}
full_df.Title = full_df.Title.map (title_dict).astype(int)
sns.distplot(full_df.Fare.dropna(), hist=True)
plt.show()
plot_distribution (train_df, var = 'Fare' , target = 'Survived')
plt.show()
full_df.Fare.isnull().sum()
1
print ('Mean fare is %f' % train_df.Fare.mean())
print ('Mode fare is %f' % train_df.Fare.mode()[0])
print ('Median fare is %f' % train_df.Fare.median())
Mean fare is 32.204208 Mode fare is 8.050000 Median fare is 14.454200
full_df.Fare = full_df.Fare.fillna(train_df.Fare.median())
full_df.Ticket.head()
0 A/5 21171 1 PC 17599 2 STON/O2. 3101282 3 113803 4 373450 Name: Ticket, dtype: object
LINE
For our feature, we extract the number for the first 3 possibilities and map the string LINE
to the number -1.
tick = pd.DataFrame()
tick['num'] = full_df.Ticket.map( lambda ticket: ticket.split( ' ' ))
full_df['TickNum'] = tick.num.map (lambda term: term[1] if len(term) == 2 else term[2] if len(term) == 3 else term[0])
full_df['TickNum'] = full_df.TickNum.map (lambda term: -1 if term == 'LINE' else term).astype(int)
drop_elements = ['Cabin', 'Name', 'PassengerId', 'Ticket', 'Parch', 'SibSp']
engineered_df = full_df.drop (drop_elements, axis = 1)
engineered_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1309 entries, 0 to 1308 Data columns (total 9 columns): Age 1309 non-null float64 Embarked 1309 non-null int64 Fare 1309 non-null float64 Pclass 1309 non-null int64 Sex 1309 non-null int64 Survived 891 non-null float64 FamilySize 1309 non-null int64 Title 1309 non-null int64 TickNum 1309 non-null int64 dtypes: float64(3), int64(6) memory usage: 92.1 KB
plot_correlation_map (engineered_df.loc[0:890])
plt.show()
# create subtraining data from training dataset
subtrain_x = engineered_df.loc[0:800]
subtrain_x = subtrain_x.drop('Survived', axis=1)
# known surival rate for subtraining data
subtrain_y = engineered_df.Survived.loc[0:800]
# create subvalidation data from datatraining set
subvalid_x = engineered_df.loc[801:890]
subvalid_x = subvalid_x.drop('Survived', axis=1)
# known survival for subvalidation data
subvalid_y = engineered_df.Survived.loc[801:890]
model = LogisticRegression()
model.fit( subtrain_x , subtrain_y )
print("Fit score is %f" % model.score( subtrain_x , subtrain_y ))
print("Validation score is %f" % model.score( subvalid_x, subvalid_y))
Fit score is 0.679151 Validation score is 0.666667
#model = RandomForestClassifier (n_estimators=100)
#model.fit( train_X , train_y )
#print (model.score( train_X , train_y ))# , model.score( valid_X , valid_y ))
RF_results = pd.DataFrame(columns=['estimators','training_score','validation_score'], index=range(0,149))
for estimators in range (1,150):
model = RandomForestClassifier (n_estimators = estimators)
model.fit( subtrain_x , subtrain_y )
RF_results.loc[estimators-1, 'estimators'] = estimators
RF_results.loc[estimators-1, 'training_score'] = model.score( subtrain_x , subtrain_y )
RF_results.loc[estimators-1, 'validation_score'] = model.score( subvalid_x, subvalid_y)
plt.scatter(RF_results.estimators, RF_results.training_score, s=40)
plt.scatter(RF_results.estimators, RF_results.validation_score, s=40, marker='D', c='green')
# print best fit result for lowest number of estimators
#print ("Best Random Forest Result %f. First acheived with %f estimators" % (RF_results.training_score.max() , #RF_results.sore.idxmax()+1))
plt.legend()
plt.show()
model = SVC()
model.fit( subtrain_x , subtrain_y )
print("Fit score is %f" % model.score( subtrain_x , subtrain_y ))
print("Validation score is %f" % model.score( subvalid_x, subvalid_y))
Fit score is 0.998752 Validation score is 0.644444
model = GradientBoostingClassifier()
model.fit( subtrain_x , subtrain_y )
print("Fit score is %f" % model.score( subtrain_x , subtrain_y ))
print("Validation score is %f" % model.score( subvalid_x, subvalid_y))
Fit score is 0.913858 Validation score is 0.877778
KNN_results = pd.DataFrame(columns=['dots','training_score','validation_score'], index=range(0,10))
for dots in range (1,11):
model = KNeighborsClassifier(n_neighbors=dots)
model.fit( subtrain_x , subtrain_y )
KNN_results.loc[dots-1, 'dots'] = dots
KNN_results.loc[dots-1, 'training_score'] = model.score( subtrain_x , subtrain_y )
KNN_results.loc[dots-1, 'validation_score'] = model.score (subvalid_x, subvalid_y)
plt.scatter(KNN_results.dots, KNN_results.training_score, s=40)
plt.scatter(KNN_results.dots, KNN_results.validation_score, s=40, marker='D', c='green')
#print ("Best KNN Result %f with %f neighbors" % (KNN_results.score.max() , KNN_results.score.idxmax()+1))
plt.legend()
plt.show()
model = GaussianNB()
model.fit( subtrain_x , subtrain_y )
print("Fit score is %f" % model.score( subtrain_x , subtrain_y ))
print("Validation score is %f" % model.score( subvalid_x, subvalid_y))
Fit score is 0.661673 Validation score is 0.700000
# full training data
train_x = engineered_df.loc[0:890]
train_x = train_x.drop('Survived', axis=1)
# full training survival rate
train_y = train_df.Survived.loc[:]
# passengers whose survival we want to predict
valid_x = engineered_df.loc[891:]
valid_x = valid_x.drop('Survived', axis=1)
model = GradientBoostingClassifier()
model.fit( train_x , train_y )
print("Training Fit score is %f" % model.score( train_x , train_y ))
Training Fit score is 0.909091
passenger_id = full_df.PassengerId[891:]
test_y = model.predict (valid_x)
output = pd.DataFrame ({'PassengerId': passenger_id, 'Survived' : test_y})
print(output.shape) ; print(output.head()) ; output.to_csv('predictions_titanic.csv', index=False)
(418, 2) PassengerId Survived 891 892 0 892 893 1 893 894 0 894 895 0 895 896 1
Categorical variables are qualitative and take on only a limited number of values. There are 3 types, nominal, ordinal, and interval:
Numerical variables are quantitative and have numbers as their values. There are 2 types, continuous and discrete: