import pandas as pd
import warnings
warnings.filterwarnings("ignore")
titanic = pd.read_csv("train.csv")
titanic.head()
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
titanic.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'], dtype='object')
# Remove column ticket,name
titanic.drop(["Name", "Ticket"], axis = 1, inplace = True)
titanic.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin', 'Embarked'], dtype='object')
# Remove missing values
titanic.isnull().sum()
PassengerId 0 Survived 0 Pclass 0 Sex 0 Age 177 SibSp 0 Parch 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
titanic = titanic.dropna()
titanic.isnull().sum()
PassengerId 0 Survived 0 Pclass 0 Sex 0 Age 0 SibSp 0 Parch 0 Fare 0 Cabin 0 Embarked 0 dtype: int64
To get familiar with seaborn, we'll start by creating the familiar histogram.
Under the hood, seaborn creates a histogram using matplotlib, scales the axes values, and styles it. In addition, seaborn uses a technique called kernel density estimation, or KDE for short, to create a smoothed line chart over the histogram. If you're interested in learning about how KDE works, you can read more on Wikipedia.
What you need to know for now is that the resulting line is a smoother version of the histogram, called a kernel density plot. Kernel density plots are especially helpful when we're comparing distributions, which we'll explore later in this mission. When viewing a histogram, our visual processing systems influence us to smooth out the bars into a continuous line.
We can generate a histogram of the Fare column using the seaborn.distplot() function:
import seaborn as sns
import matplotlib.pyplot as plt
# Draw a histogram for Fare column
sns.distplot(titanic["Fare"])
plt.show()
# For age column histogram
sns.distplot(titanic["Age"])
plt.show()
While having both the histogram and the kernel density plot is useful when we want to explore the data, it can be overwhelming for someone who's trying to understand the distribution. To generate just the kernel density plot, we use the seaborn.kdeplot() function:
# Generate a kernel density plot
sns.kdeplot(titanic["Age"]);
plt.show()
While the distribution of data is displayed in a smoother fashion, it's now more difficult to visually estimate the area under the curve using just the line chart. When we also had the histogram, the bars provided a way to understand and compare proportions visually.
To bring back some of the ability to easily compare proportions, we can shade the area under the line using a single color. When calling the seaborn.kdeplot() function, we can shade the area under the line by setting the shade parameter to True.
sns.kdeplot(titanic["Age"], shade = True)
plt.xlabel("Age")
plt.show();
The default seaborn style sheet gets some things right, like hiding axis ticks, and some things wrong, like displaying the coordinate grid and keeping all of the axis spines. We can use the seaborn.set_style() function to change the default seaborn style sheet. Seaborn comes with a few style sheets:
By default, the seaborn style is set to "darkgrid":
If we change the style sheet using this method, all future plots will match that style in your current session. This means you need to set the style before generating the plot.
To remove the axis spines for the top and right axes, we use the seaborn.despine() function:
By default, only the top and right axes will be despined, or have their spines removed. To despine the other two axes, we need to set the left and bottom parameters to True
# Set the style to the style sheet that hides the coordinate grid and sets the background color to white.
# Despine all of the axes.
sns.set_style("white")
sns.kdeplot(titanic["Age"], shade = True)
sns.despine(left = True, bottom = True)
plt.xlabel("Age")
plt.show();
In seaborn, we can create a small multiple by specifying the conditioning criteria and the type of data visualization we want. For example, we can visualize the differences in age distributions between passengers who survived and those who didn't by creating a pair of kernel density plots. One kernel density plot would visualize the distribution of values in the "Age" column where Survived equalled 0 and the other would visualize the distribution of values in the "Age" column where Survived equalled 1.
Here's what those plots look like:
# Condition on unique values of the "Survived" column.
g = sns.FacetGrid(titanic, col="Survived", size=6)
# For each subset of values, generate a kernel density plot of the "Age" columns.
g.map(sns.kdeplot, "Age", shade=True);
The function that's passed into FacetGrid.map() has to be a valid matplotlib or seaborn function. For example, we can map matplotlib histograms to the grid:
g = sns.FacetGrid(titanic, col="Survived", size=6)
g.map(plt.hist, "Age");
# Condition on unique values of the "Survived" column.
g = sns.FacetGrid(titanic, col="Pclass", size=6)
# For each subset of values, generate a kernel density plot of the "Age" columns.
g.map(sns.kdeplot, "Age", shade=True);
# Remove all the spines
sns.despine(left = True, bottom = True)
plt.show();
When subsetting data using two conditions, the rows in the grid represented one condition while the columns represented another. We can express a third condition by generating multiple plots on the same subplot in the grid and color them differently.
Thankfully, we can add a condition just by setting the hue parameter to the column name from the dataframe.
Let's add a new condition to the grid of plots we generated in the last step and see what this grid of plots would look like.
g = sns.FacetGrid(titanic, col="Survived", row="Pclass", hue = "Sex", size = 3)
g.map(sns.kdeplot, "Age", shade=True)
sns.despine(left=True, bottom=True)
plt.show();