%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
pd.options.display.max_rows = 10
sns.set(style='ticks', context='talk')
plt.rcParams['figure.figsize'] = (12, 6)
df = pd.read_csv('data/beer_subset.csv.gz', parse_dates=['time'], compression='gzip')
review_cols = [c for c in df.columns if c[0:6] == 'review']
df.head()
fig, ax = plt.subplots(figsize=(5, 10))
sns.countplot(hue='kind', y='stars', data=(df[review_cols]
.stack()
.rename_axis(['record', 'kind'])
.rename('stars')
.reset_index()),
ax=ax, order=np.arange(0, 5.5, .5))
sns.despine()
Check out the tutorials
from IPython import display
display.HTML('<iframe src="http://matplotlib.org/users/beginner.html" height=500 width=1024>')
%matplotlib inline
import matplotlib.pyplot as plt
A single series is interpreted as y values, so x is just the index...
For every x, y pair of arguments, there is an optional third argument which is the format string that indicates the color and line type of the plot.
To work on plots in more detail, it's useful to store the "axis" object
Lots of keyword
properties...
The best way to learn is the gallery
display.HTML('<iframe src="http://matplotlib.org/gallery.html" height=500 width=1024>')
Scatter plots and "bubble charts"
n = 20
x = np.random.normal(size=n)
y = np.random.normal(size=n)
c = np.random.uniform(size=n)
s = np.random.randint(100, size=n)
people = ['Annie', 'Brian', 'Chelsea', 'Derek', 'Elise']
performance = 3 + 10 * np.random.rand(len(people))
error = np.random.rand(len(people))
matplotlib is a relatively low-level plotting package, relative to others. It makes very few assumptions about what constitutes good layout (by design), but has a lot of flexiblility to allow the user to completely customize the look of the output.
On the other hand, Pandas includes methods for DataFrame and Series objects that are relatively high-level, and that make reasonable assumptions about how the plot should look.
normals = pd.Series(np.random.normal(size=10))
Similarly, for a DataFrame:
variables = pd.DataFrame({'normal': np.random.normal(size=100),
'gamma': np.random.gamma(1, size=100),
'poisson': np.random.poisson(size=100)})
All Pandas plotting commands return matplotlib
axis
objects:
As an illustration of the high-level nature of Pandas plots, we can split multiple series into subplots with a single argument for plot
:
Or, we could use a secondary y-axis:
(Note that "friends don't let friends use two y-axes", but we're just showing some examples here...)
If we would like a little more control, we can use matplotlib's subplots
function directly, and manually assign plots to its axes:
Bar plots are useful for displaying and comparing measurable quantities, such as counts or volumes. In Pandas, we just use the plot
method with a kind='bar'
argument.
For this series of examples, let's load up the Titanic dataset:
titanic = pd.read_excel("data/titanic.xls", "titanic")
titanic.head()
Or if we wanted to see survival rate instead:
Frequently it is useful to look at the distribution of data before you analyze it. Histograms are a sort of bar graph that displays relative frequencies of data values; hence, the y-axis is always some measure of frequency. This can either be raw counts of values or scaled proportions.
For instance, fare distributions aboard the titanic:
A different way of visualizing the distribution of data is the boxplot, which is a display of common quantiles; these are typically the quartiles and the lower and upper 5 percent values.
One way to add additional information to a boxplot is to overlay the actual data; this is generally most suitable with small- or moderate-sized data series.
df.head()
jittered_df = df[review_cols] + (np.random.rand(*df[review_cols].shape) - 0.5)
jittered_df.head()
Seaborn also returns matplotlib
axis
objects...
from ggplot import *
ggplot(diamonds, aes(x='carat', y='price', color='cut')) +\
geom_point() +\
scale_color_brewer(type='diverging', palette=4) +\
xlab("Carats") + ylab("Price") + ggtitle("Diamonds")
from bokeh.io import push_notebook, show, output_notebook
from bokeh.layouts import row
from bokeh.plotting import figure
from bokeh.palettes import brewer
output_notebook()
N = 20
categories = ['y' + str(x) for x in range(10)]
data = {}
data['x'] = np.arange(N)
for cat in categories:
data[cat] = np.random.randint(10, 100, size=N)
df = pd.DataFrame(data)
df = df.set_index(['x'])
def stacked(df, categories):
areas = dict()
last = np.zeros(len(df[categories[0]]))
for cat in categories:
next = last + df[cat]
areas[cat] = np.hstack((last[::-1], next))
last = next
return areas
areas = stacked(df, categories)
colors = brewer["Spectral"][len(areas)]
x2 = np.hstack((data['x'][::-1], data['x']))
p = figure(x_range=(0, 19), y_range=(0, 800))
p.grid.minor_grid_line_color = '#eeeeee'
p.patches([x2] * len(areas), [areas[cat] for cat in categories],
color=colors, alpha=0.8, line_color=None)
show(p, notebook_handle=True)
push_notebook()
display.HTML('<iframe src="https://dansaber.wordpress.com/2016/10/02/a-dramatic-tour-through-pythons-data-visualization-landscape-including-ggplot-and-altair/" width=1024 height=500>')
Slide materials inspired by and adapted from Chris Fonnesbeck and Tom Augspurger