mlcourse.ai – Open Machine Learning Course

Author: Yury Kashnitsky. This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.

Topic 2. Visual data analysis

Practice. Analyzing "Titanic" passengers. Solution

Fill in the missing code ("You code here"). No need to select answers in a webform.

Competition Kaggle "Titanic: Machine Learning from Disaster".

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
sns.set()
import matplotlib.pyplot as plt
# Graphics in SVG format are more sharp and legible
%config InlineBackend.figure_format = 'svg' 

Read data

In [2]:
train_df = pd.read_csv("../../data/titanic_train.csv", 
                       index_col='PassengerId') 
In [3]:
train_df.head(2)
Out[3]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
In [4]:
train_df.describe(include='all')
Out[4]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
count 891.000000 891.000000 891 891 714.000000 891.000000 891.000000 891 891.000000 204 889
unique NaN NaN 891 2 NaN NaN NaN 681 NaN 147 3
top NaN NaN Aks, Mrs. Sam (Leah Rosen) male NaN NaN NaN 347082 NaN C23 C25 C27 S
freq NaN NaN 1 577 NaN NaN NaN 7 NaN 4 644
mean 0.383838 2.308642 NaN NaN 29.699118 0.523008 0.381594 NaN 32.204208 NaN NaN
std 0.486592 0.836071 NaN NaN 14.526497 1.102743 0.806057 NaN 49.693429 NaN NaN
min 0.000000 1.000000 NaN NaN 0.420000 0.000000 0.000000 NaN 0.000000 NaN NaN
25% 0.000000 2.000000 NaN NaN 20.125000 0.000000 0.000000 NaN 7.910400 NaN NaN
50% 0.000000 3.000000 NaN NaN 28.000000 0.000000 0.000000 NaN 14.454200 NaN NaN
75% 1.000000 3.000000 NaN NaN 38.000000 1.000000 0.000000 NaN 31.000000 NaN NaN
max 1.000000 3.000000 NaN NaN 80.000000 8.000000 6.000000 NaN 512.329200 NaN NaN
In [5]:
train_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
Survived    891 non-null int64
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB

Let's dropCabin, and then – all rows with missing values.

In [6]:
train_df = train_df.drop('Cabin', axis=1).dropna()
In [7]:
train_df.shape
Out[7]:
(712, 10)

1. Build a picture to visualize all scatter plots for each pair of features Age, Fare, SibSp, Parch and Survived. ( scatter_matrix from Pandas or pairplot from Seaborn)

In [8]:
sns.pairplot(train_df[['Survived', 'Age', 'Fare', 'SibSp', 'Parch']]);