Dotplots

Key ideas: Dotplots. graphing, gas prices

This notebook shows how a dotplot (sometimes called a "forest plot") can be used to visualize mean gasoline prices and their variation in the US states, in three consecutive years. The data come from the United States Energy Information agency, and are available here.

We start with the usual import statements.

In [1]:
import pandas as pd
from statsmodels.graphics.dotplots import dotplot
import numpy as np
import urllib2
import StringIO

The data are in an Excel file. Pandas does not seem to be able to obtain an Excel file from a URL and process it directly into a dataframe, so we first read the entire Excel file into a string buffer, then let Pandas read from the string buffer. Alternatively, download the file using the link above, and let Pandas read the data from the local file.

In [2]:
data_url = "http://www.eia.gov/dnav/pet/xls/PET_PRI_ALLMG_A_EPM0_PTC_DPGAL_M.xls"
xld = urllib2.urlopen(data_url).read()
xlds = StringIO.StringIO(xld)
data = pd.read_excel(xlds, "Data 1", skiprows=2)

The column names are very long, so we process them to include only the state name, which appears at the beginning of all the relevant column names.

In [3]:
def tr(x): 
    xs = x.split() 
    # Two-word state names
    if xs[0].lower() in ("new", "north", "south", "rhode", "west"):
        return xs[0] + " " + xs[1]
    # One-word state names
    else: 
        return xs[0] 

data = data.rename(columns=tr) 

Here we remove the non-state columns and do some further cleanup of the data.

In [4]:
data = data.rename(columns=tr)
for nm in ("East", "Gulf", "Midwest", "Rocky", "Lower", "U.S.", 
            "Central", "New England", "District", "West Coast"): 
    del data[nm] 
        
state_names = list(data.columns) 
state_names.remove("Date")
state_names.sort()

Now we create a data frame containing the mean gas price and the minimum and maximum gas prices for each state in each year.

In [5]:
dfs = []
for year in 2008,2009,2010:
     
    ii = [i for i in data.index if str(year) in str(data.loc[i, "Date"])] 
    mn = data.loc[ii,state_names].mean(0)
    ucd = data.loc[ii,state_names].max(0) - mn 
    lcd = mn - data.loc[ii,state_names].min(0) 
 
    df1 = pd.DataFrame({"Gas": mn, "ival1": lcd, "ival2": ucd, 
                        "State name": mn.index}) 
    df1["Year"] = year 
    dfs.append(df1)
       
df = pd.concat(dfs, axis=0, ignore_index=True)
df.head()
Out[5]:
Gas State name ival1 ival2 Year
0 2.768417 Alabama 1.578417 0.758583 2008
1 3.357667 Alaska 1.096667 0.839333 2008
2 2.791500 Arizona 1.533500 0.845500 2008
3 2.642727 Arkansas 1.516727 0.840273 2008
4 2.899333 California 1.622333 0.968667 2008

5 rows × 5 columns

Here is the code we use to make a dotplot of these data. The intervals represent the range of prices within a year in a given state.

In [6]:
fig = plt.figure(figsize=(8,25))
ax = plt.axes([0.1, 0.1, 0.87, 0.88])
fig.add_axes(ax)
dotplot(points=df["Gas"], intervals=(df[["ival1", "ival2"]]), ax=ax, 
    lines=df["State name"], styles=df["Year"], stacked=True, striped=True)
handles, labels = ax.get_legend_handles_labels() 
leg = plt.figlegend(handles, labels, "upper center",
        ncol=3, numpoints=1, handletextpad=0.0001) 
leg.draw_frame(False)
ax.set_xlabel("Gasoline price (dollars/gallon)", labelpad=10, size=17) 
plt.xlim(0, 4)
Out[6]:
(0, 4)

Above the states are sorted alphabetically. It would also be interesting to sort them according to their mean gas price. To do this, we need to get the state names in the order that sorts by mean gas price.

In [7]:
gas = df[["Gas", "State name"]].groupby("State name")
gas = gas["Gas"].agg(np.mean)
ii = np.argsort(gas)
line_order = [gas.index[i] for i in ii]

Here is the dotplot sorted by mean gas price.

In [8]:
fig = plt.figure(figsize=(8,25))
ax = plt.axes([0.1, 0.1, 0.87, 0.88])
fig.add_axes(ax)
dotplot(points=df["Gas"], intervals=(df[["ival1", "ival2"]]), ax=ax, 
    lines=df["State name"], styles=df["Year"], stacked=True, striped=True,
    line_order=line_order)
handles, labels = ax.get_legend_handles_labels() 
leg = plt.figlegend(handles, labels, "upper center",
        ncol=3, numpoints=1, handletextpad=0.0001) 
leg.draw_frame(False)
ax.set_xlabel("Gasoline price (dollars/gallon)", labelpad=10, size=17) 
plt.xlim(0, 4)
Out[8]:
(0, 4)

We can make a few observations based on these graphs: in every state, gas prices were highest in 2008, lowest in 2009, and intermediate in 2010. There was very large within-state variation in gas prices in 2008, and fairly large variation in 2009, but much less within-state variation in 2010 (the within-state variation is shown by the grey "error bars"). The between-state variation within a year is reflected in the horizontal scatter of the points of a given color. This variation is not very large, with the exception of around five states at the high end of the range, particularly Alaska and Hawaii. The distribution is quite concentrated at the low end of the price range.