This is the writeup for titanicdeath.com. Besides having a truly fantastic name, this little website I created tells you whether or not you would be likely to die on the titanic provided some small amount of personal information.
The first question you are probably asking is: why did i die? or why did i live?
The biggest determining factor in whether you lived or died is your gender. Females were much more likely to survive the sinking of the titanic than men. If you have ever heard the phrase 'women and children first' you will have an intuitive understanding of why this is.
The next bigest determinant of survival is what class you travelled in; passengers in first class were more likely to survive. This makes some sense; often times first class passengers receive benefits. One of the benefits on a sinking ship may have been easier access to lifeboats. Third class passengers were below deck and may not have had the means to escape.
Age is another important factor, the younger you are the more likely it is that you survived. It seems young people and children had some level of priority on lifeboats.
Whether you were traveling alone or with others on the titanic helps to deterimine whether you passed away tragically or miraculously survived. This is why titanicdeath.com asks if you are married or have siblings. Travellers who were alone on the titanic or travelling with large numbers of relatives were more likely to have died. Travellers who were alone may have had to wait or may not have had lower priority (compared with women and children) on lifeboats. Many large families were in third class where death rates were high.
If you'd like a little more evidence proceed below.
#let us load the logistic regression model used
import pandas as pd
data_df = pd.read_csv('../titanicdeath/static/train.csv')
The 'sex' input has the biggest effect on survival.
You can see below the survival rates for women are 75% but for men are only about 19%.
data_df[['Sex', 'Survived']].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived')
Sex | Survived | |
---|---|---|
1 | male | 0.188908 |
0 | female | 0.742038 |
The 'class' input has the next biggest effect on survival.
In our data the class of a passenger has the name 'Pclass.' You can see here below that the survival rates ase highest for first class - 63%, second class - 47%, third class - 24%.
data_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)
Pclass | Survived | |
---|---|---|
0 | 1 | 0.629630 |
1 | 2 | 0.472826 |
2 | 3 | 0.242363 |
The age input affects your chances of survival as well.
People from ages 0 - 16 have the highest chance of survival. Everyone older than that has lowered chances of surviving the sinking of the titanic.
data_df['age_group'] = pd.cut(data_df['Age'], 5)
data_df[['age_group', 'Survived']].groupby(['age_group'], as_index=False).mean().sort_values(by='age_group', ascending=True)
age_group | Survived | |
---|---|---|
0 | (0.34, 16.336] | 0.550000 |
1 | (16.336, 32.252] | 0.369942 |
2 | (32.252, 48.168] | 0.404255 |
3 | (48.168, 64.084] | 0.434783 |
4 | (64.084, 80] | 0.090909 |
Whether you are travelling alone or with family can affect your chances of survival.
In our data the number of people you are travelling with is reprsented by the 'companions' column. If you travelled with 1,2, or 3 companions your chances of survival were decent (roughly 50-70%). If you travelled alone your chances of survival went down significantly to 30%. If you travelled with more than 5 people your chances were just as bad or worse.
data_df['companions'] = data_df['SibSp'] + data_df['Parch']
data_df[['companions', 'Survived']].groupby(['companions'], as_index=False).mean().sort_values(by='Survived', ascending=False)
companions | Survived | |
---|---|---|
3 | 3 | 0.724138 |
2 | 2 | 0.578431 |
1 | 1 | 0.552795 |
6 | 6 | 0.333333 |
0 | 0 | 0.303538 |
4 | 4 | 0.200000 |
5 | 5 | 0.136364 |
7 | 7 | 0.000000 |
8 | 10 | 0.000000 |
In the next section we will load a model (logistic regression) that I have taught how to take an individual passenger's information and give me the probability of survival.
from sklearn.externals import joblib
logreg = joblib.load('../titanicdeath/static/logreg.pkl')
age = 2
fare = 0
embarkation = 2
title = 1
is_alone = 1
age_class = 6
This next bit of code takes in a passenger's input values and tells us what probability that user has of living and dying. the number shows our probability of survival. 7.8% here if we are a man (sex=0) and in third class (passenger_class=3).
passenger_class = 3
sex = 0
passenger_input = pd.DataFrame([[passenger_class, sex, age, fare, embarkation, title, is_alone, age_class]])
pred = logreg.predict_proba(passenger_input)
pred[0][1]
0.078280156772079126
If we change the sex from male to female (sex=1) we see a huge improvement in our chances of survival (now 43%).
passenger_class = 3
sex = 1
passenger_input = pd.DataFrame([[passenger_class, sex, age, fare, embarkation, title, is_alone, age_class]])
pred = logreg.predict_proba(passenger_input)
pred[0][1]
0.43427750040904095
If we change the class from third class to first class (passenger_class=1) we improve odds of suvival even further to 77%.
passenger_class = 1
sex = 1
passenger_input = pd.DataFrame([[passenger_class, sex, age, fare, embarkation, title, is_alone, age_class]])
pred = logreg.predict_proba(passenger_input)
pred[0][1]
0.77444683956853377
This was intended to be a simple exploration of this titanic dataset. If you would like to explore this data yourself there is a really nice tutorial here:
https://www.kaggle.com/c/titanic
https://www.kaggle.com/startupsci/titanic/titanic-data-science-solutions
If you are interested in machine learning and a very good overview of how an algorithm like logistic regression works i highly recommend you check out the first 3 lectures of andrew ng's coursera course:
https://www.coursera.org/learn/machine-learning
You can check out some other stuff I have done here: