Michaël Defferrard, PhD student, EPFL LTS2
Data visualization is a key aspect of exploratory data analysis. During this exercise we'll gradually build more and more complex vizualisations. We'll do this by replicating plots. Try to reproduce the lines but also the axis labels, legends or titles.
Data visualization is both an art and a science. It should combine both aesthetic form and functionality.
To start slowly, let's make a static line plot from some time series. Reproduce the plots below using:
Hint: to plot with pandas, you first need to create a DataFrame, pandas' tabular data format.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Random time series.
n = 1000
rs = np.random.RandomState(42)
data = rs.randn(n, 4).cumsum(axis=0)
plt.figure(figsize=(15,5))
plt.plot(data[:, 0], label='A')
plt.plot(data[:, 1], '.-k', label='B')
plt.plot(data[:, 2], '--m', label='C')
plt.plot(data[:, 3], ':', label='D')
plt.legend(loc='upper left')
plt.xticks(range(0, 1000, 50))
plt.ylabel('Value')
plt.xlabel('Day')
plt.grid()
idx = pd.date_range('1/1/2000', periods=n)
df = pd.DataFrame(data, index=idx, columns=list('ABCD'))
df.plot(figsize=(15,5));
Categorical data is best represented by bar or pie charts. Reproduce the plots below using the object-oriented API of matplotlib, which is recommended for programming.
Question: What are the pros / cons of each plot ?
Tip: the matplotlib gallery is a convenient starting point.
data = [10, 40, 25, 15, 10]
categories = list('ABCDE')
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
axes[1].pie(data, explode=[0,.1,0,0,0], labels=categories, autopct='%1.1f%%', startangle=90)
axes[1].axis('equal')
pos = range(len(data))
axes[0].bar(pos, data, align='center')
axes[0].set_xticks(pos)
axes[0].set_xticklabels(categories)
axes[0].set_xlabel('Category')
axes[0].set_title('Allotment');
A frequency plot is a graph that shows the pattern in a set of data by plotting how often particular values of a measure occur. They often take the form of an histogram or a box plot.
Reproduce the plots with the following three libraries, which provide high-level declarative syntax for statistical visualization as well as a convenient interface to pandas:
Hints:
distplot()
and boxplot()
.import seaborn as sns
import os
df = sns.load_dataset('iris', data_home=os.path.join('..', 'data'))
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
g = sns.distplot(df['petal_width'], kde=True, rug=False, ax=axes[0])
g.set(title='Distribution of petal width')
g = sns.boxplot('species', 'petal_width', data=df, ax=axes[1])
g.set(title='Distribution of petal width by species');
/usr/lib/python3.5/site-packages/statsmodels/nonparametric/kdetools.py:20: VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future y = X[:m/2+1] + np.r_[0,X[m/2+1:],0]*1j
import ggplot
ggplot.ggplot(df, ggplot.aes(x='petal_width', fill='species')) + \
ggplot.geom_histogram() + \
ggplot.ggtitle('Distribution of Petal Width by Species')
<ggplot: (8751797902719)>
import altair
altair.Chart(df).mark_bar(opacity=.75).encode(
x=altair.X('petal_width', bin=altair.Bin(maxbins=30)),
y='count(*)',
color=altair.Color('species')
)
Scatter plots are very much used to assess the correlation between 2 variables. Pair plots are then a useful way of displaying the pairwise relations between variables in a dataset.
Use the seaborn pairplot()
function to analyze how separable is the iris dataset.
sns.pairplot(df, hue="species");
Humans can only comprehend up to 3 dimensions (in space, then there is e.g. color or size), so dimensionality reduction is often needed to explore high dimensional datasets. Analyze how separable is the iris dataset by visualizing it in a 2D scatter plot after reduction from 4 to 2 dimensions with two popular methods:
Hints:
swarmplot()
.from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
pca = PCA(n_components=2)
X = pca.fit_transform(df.values[:, :4])
df['pca1'] = X[:, 0]
df['pca2'] = X[:, 1]
tsne = TSNE(n_components=2)
X = tsne.fit_transform(df.values[:, :4])
df['tsne1'] = X[:, 0]
df['tsne2'] = X[:, 1]
fig, axes = plt.subplots(1, 2, figsize=(15, 5))
sns.swarmplot(x='pca1', y='pca2', data=df, hue='species', ax=axes[0])
sns.swarmplot(x='tsne1', y='tsne2', data=df, hue='species', ax=axes[1]);
For interactive visualization, look at bokeh (we used it during the data exploration exercise) or VisPy.