Intro to Plotting¶

Sneak peak:¶

In [1]:

%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

pd.options.display.max_rows = 10
sns.set(style='ticks', context='talk')
plt.rcParams['figure.figsize'] = (12, 6)

In [ ]:

df = pd.read_csv('data/beer_subset.csv.gz', parse_dates=['time'], compression='gzip')
review_cols = [c for c in df.columns if c[0:6] == 'review']
df.head()

In [ ]:

fig, ax = plt.subplots(figsize=(5, 10))
sns.countplot(hue='kind', y='stars', data=(df[review_cols]
                                           .stack()
                                           .rename_axis(['record', 'kind'])
                                           .rename('stars')
                                           .reset_index()),
              ax=ax, order=np.arange(0, 5.5, .5))
sns.despine()

Matplotlib¶

Tons of features
"Low-level" library

Check out the tutorials

In [ ]:

from IPython import display
display.HTML('<iframe src="http://matplotlib.org/users/beginner.html" height=500 width=1024>')

In [ ]:

%matplotlib inline
import matplotlib.pyplot as plt

In [ ]:

A single series is interpreted as y values, so x is just the index...

In [ ]:

For every x, y pair of arguments, there is an optional third argument which is the format string that indicates the color and line type of the plot.

In [ ]:

To work on plots in more detail, it's useful to store the "axis" object

In [ ]:

Lots of keyword properties...

In [ ]:

Overlaying plots¶

In [ ]:

Multiple plots¶

In [ ]:

Types of axes¶

In [ ]:

The best way to learn is the gallery

In [ ]:

display.HTML('<iframe src="http://matplotlib.org/gallery.html" height=500 width=1024>')

A handful of examples¶

Scatter plots and "bubble charts"

In [ ]:

n = 20
x = np.random.normal(size=n)
y = np.random.normal(size=n)
c = np.random.uniform(size=n)
s = np.random.randint(100, size=n)

In [ ]:

Bar charts¶

In [ ]:

people = ['Annie', 'Brian', 'Chelsea', 'Derek', 'Elise']
performance = 3 + 10 * np.random.rand(len(people))
error = np.random.rand(len(people))

In [ ]:

Plotting with Pandas¶

matplotlib is a relatively low-level plotting package, relative to others. It makes very few assumptions about what constitutes good layout (by design), but has a lot of flexiblility to allow the user to completely customize the look of the output.

On the other hand, Pandas includes methods for DataFrame and Series objects that are relatively high-level, and that make reasonable assumptions about how the plot should look.

In [ ]:

normals = pd.Series(np.random.normal(size=10))

In [ ]:

Similarly, for a DataFrame:

In [ ]:

variables = pd.DataFrame({'normal': np.random.normal(size=100), 
                          'gamma': np.random.gamma(1, size=100), 
                          'poisson': np.random.poisson(size=100)})

In [ ]:

All Pandas plotting commands return matplotlib axis objects:

In [ ]:

As an illustration of the high-level nature of Pandas plots, we can split multiple series into subplots with a single argument for plot:

In [ ]:

Or, we could use a secondary y-axis:

(Note that "friends don't let friends use two y-axes", but we're just showing some examples here...)

In [ ]:

If we would like a little more control, we can use matplotlib's subplots function directly, and manually assign plots to its axes:

In [ ]:

Bar plots¶

Bar plots are useful for displaying and comparing measurable quantities, such as counts or volumes. In Pandas, we just use the plot method with a kind='bar' argument.

For this series of examples, let's load up the Titanic dataset:

In [ ]:

titanic = pd.read_excel("data/titanic.xls", "titanic")
titanic.head()

In [ ]:

Or if we wanted to see survival rate instead:

In [ ]:

Histograms¶

Frequently it is useful to look at the distribution of data before you analyze it. Histograms are a sort of bar graph that displays relative frequencies of data values; hence, the y-axis is always some measure of frequency. This can either be raw counts of values or scaled proportions.

For instance, fare distributions aboard the titanic:

In [ ]:

Boxplots¶

A different way of visualizing the distribution of data is the boxplot, which is a display of common quantiles; these are typically the quartiles and the lower and upper 5 percent values.

In [ ]:

One way to add additional information to a boxplot is to overlay the actual data; this is generally most suitable with small- or moderate-sized data series.

In [ ]:

Scatter plots¶

In [ ]:

df.head()

In [ ]:

jittered_df = df[review_cols] + (np.random.rand(*df[review_cols].shape) - 0.5)
jittered_df.head()

Lots more info on Pandas plotting in the docs ¶

Seaborn ¶

High-level interface for matplotlib

In [ ]:

Seaborn also returns matplotlib axis objects...

In [ ]:

ggplot ¶

Another high-level matplotlib library, but this time mimicking R's ggplot

In [ ]:

from ggplot import *
ggplot(diamonds, aes(x='carat', y='price', color='cut')) +\
    geom_point() +\
    scale_color_brewer(type='diverging', palette=4) +\
    xlab("Carats") + ylab("Price") + ggtitle("Diamonds")

In [ ]:

Bokeh ¶

In [2]:

from bokeh.io import push_notebook, show, output_notebook
from bokeh.layouts import row
from bokeh.plotting import figure
from bokeh.palettes import brewer
output_notebook()

N = 20
categories = ['y' + str(x) for x in range(10)]
data = {}
data['x'] = np.arange(N)
for cat in categories:
    data[cat] = np.random.randint(10, 100, size=N)

df = pd.DataFrame(data)
df = df.set_index(['x'])

def stacked(df, categories):
    areas = dict()
    last = np.zeros(len(df[categories[0]]))
    for cat in categories:
        next = last + df[cat]
        areas[cat] = np.hstack((last[::-1], next))
        last = next
    return areas

areas = stacked(df, categories)

colors = brewer["Spectral"][len(areas)]

x2 = np.hstack((data['x'][::-1], data['x']))

p = figure(x_range=(0, 19), y_range=(0, 800))
p.grid.minor_grid_line_color = '#eeeeee'

p.patches([x2] * len(areas), [areas[cat] for cat in categories],
          color=colors, alpha=0.8, line_color=None)

show(p, notebook_handle=True)
push_notebook()

Loading BokehJS ...

So many plotting libraries!¶

In [ ]:

display.HTML('<iframe src="https://dansaber.wordpress.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair/" width=1024 height=500>')

Exercise 6 - "Choose your own adventure" workshop¶

Grab the data of your choice
- Can't think of anything? GHDx
Load it into a Pandas DataFrame
Compute some summary statistics
Create some cool plots

References¶

Slide materials inspired by and adapted from Chris Fonnesbeck and Tom Augspurger

Intro to Plotting¶

Sneak peak:¶

Matplotlib¶

Overlaying plots¶

Multiple plots¶

Types of axes¶

A handful of examples¶

Bar charts¶

Plotting with Pandas¶

Bar plots¶

Histograms¶

Boxplots¶

Scatter plots¶

Lots more info on Pandas plotting in the docs¶

Seaborn¶

ggplot¶

Bokeh¶

So many plotting libraries!¶

Exercise 6 - "Choose your own adventure" workshop¶

References¶

Lots more info on Pandas plotting in the docs ¶

Seaborn ¶

ggplot ¶

Bokeh ¶