Complete seaborn tutorial

pokemon

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
# from beakerx import *
warnings.filterwarnings('ignore')
%matplotlib inline
color = sns.color_palette()

Dataset used here is taken from kaggle datasets: https://www.kaggle.com/sekarmg/pokemon

In [2]:
data = pd.read_csv('data/pokemon/pokemon.csv')
In [3]:
data.head()
Out[3]:
# Name Type 1 Type 2 HP Attack Defense Sp. Atk Sp. Def Speed Generation Legendary
0 1 Bulbasaur Grass Poison 45 49 49 65 65 45 1 False
1 2 Ivysaur Grass Poison 60 62 63 80 80 60 1 False
2 3 Venusaur Grass Poison 80 82 83 100 100 80 1 False
3 4 Mega Venusaur Grass Poison 80 100 123 122 120 80 1 False
4 5 Charmander Fire NaN 39 52 43 60 50 65 1 False
In [4]:
data.columns
Out[4]:
Index(['#', 'Name', 'Type 1', 'Type 2', 'HP', 'Attack', 'Defense', 'Sp. Atk',
       'Sp. Def', 'Speed', 'Generation', 'Legendary'],
      dtype='object')
In [279]:
data.shape
Out[279]:
(800, 12)
In [190]:
data.isnull().sum()
Out[190]:
#               0
Name            1
Type 1          0
Type 2        386
HP              0
Attack          0
Defense         0
Sp. Atk         0
Sp. Def         0
Speed           0
Generation      0
Legendary       0
dtype: int64
In [191]:
data.nunique()
Out[191]:
#             800
Name          799
Type 1         18
Type 2         18
HP             94
Attack        111
Defense       103
Sp. Atk       105
Sp. Def        92
Speed         108
Generation      6
Legendary       2
dtype: int64

lmplot() and regplot()

We'll try to observe relationship between 2 continuous variables: Attack and Defense. We can also differentiate based on a categorical variable i.e. whether the pokemon is legendary or not. For this we can use lmplot() or regplot()

In [413]:
# plt.figure(figsize=(14,6))
sns.set_style('whitegrid')
sns.lmplot(
    x="Attack",
    y="Defense",
    data=data,
    fit_reg=False,
    hue='Legendary',
    palette="Set1")
Out[413]:
<seaborn.axisgrid.FacetGrid at 0x1a704f7780>

We can see clearly that legendary pokemons have both high defense and attack

In [233]:
sns.set_style('darkgrid')  #changes the background of the plot
plt.figure(figsize=(14, 6))
sns.regplot(
    x="Attack", y="Defense", data=data,
    fit_reg=True)  #fit_Reg fits a regression line
Out[233]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a3e363828>

The relationship between Attack and Defense seems to be linear but their are few outliers too

We can make faceted plots where we can segment plots based on another categorical variable: Generation in this case

In [252]:
plt.figure(figsize=(20, 6))
sns.set_style('whitegrid')
sns.lmplot(
    x="Attack",
    y="Defense",
    data=data,
    fit_reg=False,
    hue='Legendary',
    col="Generation",
    aspect=0.4,
    size=10)
Out[252]:
<seaborn.axisgrid.FacetGrid at 0x1a44f4da90>
<matplotlib.figure.Figure at 0x1a44f38550>

We can also see plot a continous variable against a categorical column. Below we're trying to see relationship between Speed and Legendary status

In [238]:
plt.figure(figsize=(14, 6))
sns.set_style('whitegrid')
sns.regplot(x="Legendary", y="Speed", data=data)
Out[238]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a3eb0e828>

One issue with this plot is we cannot see the distribution at each value of speed as the points are overlapping. This can be fixed by an option called jitter

In [344]:
plt.figure(figsize=(14, 6))
sns.set_style("ticks")
sns.regplot(x="Legendary", y="Speed", data=data, x_jitter=0.3)
Out[344]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a57737cf8>

We can also fit a logistic relationship

In [253]:
plt.figure(figsize=(14, 6))
sns.set_style("ticks")
sns.regplot(x="Attack", y="Legendary", data=data, logistic=True)
Out[253]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a46146080>

distplot() and kdeplot()

These series of plots are good for observing distributions of variables

In [347]:
plt.figure(figsize=(12, 6))
ax = sns.distplot(data['Attack'], kde=False)
ax.set_title('Attack')
Out[347]:
Text(0.5,1,'Attack')

Distribution of Attack seems to be close to normal

kde = True option tries to estimate the density based on gaussian kernel

In [346]:
plt.figure(figsize=(12, 6))
ax = sns.distplot(
    data['Defense'], kde=True,
    norm_hist=False)  #norm_hist normalizes the count
ax.set_title('Defense')
plt.show()

Defense seems to have thinner tails and values are more centered around the mean

In [348]:
plt.figure(figsize=(12, 6))
ax = sns.distplot(data['Speed'], rug=True)
ax.set_title('Speed')
plt.show()

We can also just use kdeplot() if we are only interested in the density function

In [349]:
plt.figure(figsize=(12, 6))
ax = sns.kdeplot(data['HP'], shade=True, color='g')
ax.set_title('HP')
plt.show()

Other ways to visualize distributions are striplot() and boxplot()

In [260]:
plt.figure(figsize=(12, 6))
sns.stripplot(
    y='HP', data=data, jitter=0.1,
    color='g')  #jitter option to spread the points
Out[260]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a47b59080>
In [261]:
plt.figure(figsize=(12, 6))
sns.boxplot(y='Speed', data=data, width=.6)
Out[261]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a47f457f0>

jointplot()

Another way to make scatterplot is jointplot()

In [265]:
plt.figure(figsize=(12, 6))
sns.jointplot(x='HP', y='Speed', data=data)
Out[265]:
<seaborn.axisgrid.JointGrid at 0x1a48b015f8>
<matplotlib.figure.Figure at 0x1a48b011d0>

There are different varieties to the scatterplot

In [266]:
plt.figure(figsize=(12, 6))
sns.jointplot(x='HP', y='Speed', data=data, kind='kde')
Out[266]:
<seaborn.axisgrid.JointGrid at 0x1a49132940>
<matplotlib.figure.Figure at 0x1a49132390>

In the above plot we can see 2 prominent regions of high density

In [267]:
plt.figure(figsize=(12, 6))
sns.jointplot(x='HP', y='Speed', data=data, kind='hex')
Out[267]:
<seaborn.axisgrid.JointGrid at 0x1a48f857f0>
<matplotlib.figure.Figure at 0x1a48f850f0>

pairplot()

To see relationships between all pairwise combination of variables, we can use pairplot

In [268]:
sns.pairplot(data)
Out[268]:
<seaborn.axisgrid.PairGrid at 0x1a4915ea58>