# Dotplots¶

Key ideas: Dotplots. graphing, gas prices

This notebook shows how a dotplot (sometimes called a "forest plot") can be used to visualize mean gasoline prices and their variation in the US states, in three consecutive years. The data come from the United States Energy Information agency, and are available here.

In [1]:
import pandas as pd
from statsmodels.graphics.dotplots import dotplot
import numpy as np
import urllib2
import StringIO


The data are in an Excel file. Pandas does not seem to be able to obtain an Excel file from a URL and process it directly into a dataframe, so we first read the entire Excel file into a string buffer, then let Pandas read from the string buffer. Alternatively, download the file using the link above, and let Pandas read the data from the local file.

In [2]:
data_url = "http://www.eia.gov/dnav/pet/xls/PET_PRI_ALLMG_A_EPM0_PTC_DPGAL_M.xls"
xlds = StringIO.StringIO(xld)
data = pd.read_excel(xlds, "Data 1", skiprows=2)


The column names are very long, so we process them to include only the state name, which appears at the beginning of all the relevant column names.

In [3]:
def tr(x):
xs = x.split()
# Two-word state names
if xs[0].lower() in ("new", "north", "south", "rhode", "west"):
return xs[0] + " " + xs[1]
# One-word state names
else:
return xs[0]

data = data.rename(columns=tr)


Here we remove the non-state columns and do some further cleanup of the data.

In [4]:
data = data.rename(columns=tr)
for nm in ("East", "Gulf", "Midwest", "Rocky", "Lower", "U.S.",
"Central", "New England", "District", "West Coast"):
del data[nm]

state_names = list(data.columns)
state_names.remove("Date")
state_names.sort()


Now we create a data frame containing the mean gas price and the minimum and maximum gas prices for each state in each year.

In [5]:
dfs = []
for year in 2008,2009,2010:

ii = [i for i in data.index if str(year) in str(data.loc[i, "Date"])]
mn = data.loc[ii,state_names].mean(0)
ucd = data.loc[ii,state_names].max(0) - mn
lcd = mn - data.loc[ii,state_names].min(0)

df1 = pd.DataFrame({"Gas": mn, "ival1": lcd, "ival2": ucd,
"State name": mn.index})
df1["Year"] = year
dfs.append(df1)

df = pd.concat(dfs, axis=0, ignore_index=True)

Out[5]:
Gas State name ival1 ival2 Year
0 2.768417 Alabama 1.578417 0.758583 2008
1 3.357667 Alaska 1.096667 0.839333 2008
2 2.791500 Arizona 1.533500 0.845500 2008
3 2.642727 Arkansas 1.516727 0.840273 2008
4 2.899333 California 1.622333 0.968667 2008

5 rows × 5 columns

Here is the code we use to make a dotplot of these data. The intervals represent the range of prices within a year in a given state.

In [6]:
fig = plt.figure(figsize=(8,25))
ax = plt.axes([0.1, 0.1, 0.87, 0.88])
dotplot(points=df["Gas"], intervals=(df[["ival1", "ival2"]]), ax=ax,
lines=df["State name"], styles=df["Year"], stacked=True, striped=True)
handles, labels = ax.get_legend_handles_labels()
leg = plt.figlegend(handles, labels, "upper center",
leg.draw_frame(False)
plt.xlim(0, 4)

Out[6]:
(0, 4)

Above the states are sorted alphabetically. It would also be interesting to sort them according to their mean gas price. To do this, we need to get the state names in the order that sorts by mean gas price.

In [7]:
gas = df[["Gas", "State name"]].groupby("State name")
gas = gas["Gas"].agg(np.mean)
ii = np.argsort(gas)
line_order = [gas.index[i] for i in ii]


Here is the dotplot sorted by mean gas price.

In [8]:
fig = plt.figure(figsize=(8,25))
ax = plt.axes([0.1, 0.1, 0.87, 0.88])
dotplot(points=df["Gas"], intervals=(df[["ival1", "ival2"]]), ax=ax,
lines=df["State name"], styles=df["Year"], stacked=True, striped=True,
line_order=line_order)
handles, labels = ax.get_legend_handles_labels()
leg = plt.figlegend(handles, labels, "upper center",

(0, 4)