mlcourse.ai – Open Machine Learning Course

Author: Yury Kashnitsky. This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.

Topic 2. Visual data analysis

Practice. Analyzing "Titanic" passengers

Fill in the missing code ("You code here"). No need to select answers in a webform.

Competition Kaggle "Titanic: Machine Learning from Disaster".

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set()
import matplotlib.pyplot as plt

Read data

In [2]:
train_df = pd.read_csv("../../data/titanic_train.csv", 
                       index_col='PassengerId') 
In [3]:
train_df.head(2)
Out[3]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
In [4]:
train_df.describe(include='all')
Out[4]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
count 891.000000 891.000000 891 891 714.000000 891.000000 891.000000 891 891.000000 204 889
unique NaN NaN 891 2 NaN NaN NaN 681 NaN 147 3
top NaN NaN Molson, Mr. Harry Markland male NaN NaN NaN 347082 NaN B96 B98 S
freq NaN NaN 1 577 NaN NaN NaN 7 NaN 4 644
mean 0.383838 2.308642 NaN NaN 29.699118 0.523008 0.381594 NaN 32.204208 NaN NaN
std 0.486592 0.836071 NaN NaN 14.526497 1.102743 0.806057 NaN 49.693429 NaN NaN
min 0.000000 1.000000 NaN NaN 0.420000 0.000000 0.000000 NaN 0.000000 NaN NaN
25% 0.000000 2.000000 NaN NaN 20.125000 0.000000 0.000000 NaN 7.910400 NaN NaN
50% 0.000000 3.000000 NaN NaN 28.000000 0.000000 0.000000 NaN 14.454200 NaN NaN
75% 1.000000 3.000000 NaN NaN 38.000000 1.000000 0.000000 NaN 31.000000 NaN NaN
max 1.000000 3.000000 NaN NaN 80.000000 8.000000 6.000000 NaN 512.329200 NaN NaN
In [5]:
train_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB

Let's dropCabin, and then – all rows with missing values.

In [6]:
train_df = train_df.drop('Cabin', axis=1).dropna()
In [7]:
train_df.shape
Out[7]:
(712, 10)

1. Build a picture to visualize all scatter plots for each pair of features Age, Fare, SibSp, Parch and Survived. ( scatter_matrix from Pandas or pairplot from Seaborn)

In [8]:
# You code here

2. How does ticket price (Fare) depend on Pclass? Build a boxplot.

In [9]:
# You code here

3. Let's build the same plot but restricting values of Fare to be less than 95% quantile of the initial vector (to drop outliers that make the plot less clear).

In [10]:
# You code here

4. How is the percentage of surviving passengers dependent on passengers' gender? Depict it with Seaborn.countplot using the hue argument.

In [11]:
# You code here

5. How does the distribution of ticket prices differ for those who survived and those who didn't. Depict it with Seaborn.boxplot

In [12]:
# You code here

6. How does survival depend on passengers' age? Verify (graphically) an assumption that youngsters (< 30 y.o.) survived more frequently than old people (> 55 y.o.).

In [13]:
# You code here

Useful resources