Dot plots

Key ideas: Dot plots, forest plots, graphics, World Bank data, gross domestic product

A dot plot (or forest plot) can be used to create a graph of a small dataset. It is useful in situations where it is necessary to have the identity of each point be immediately apparent from the graph. Dot plots are especially useful in converting a table to a graph. Most tables that appear in research documents can be usefully presented in the form of a dot plot.

In this notebook we make a dot plot of GDP per capita in various countries in two different years (2000 and 2012).

In [1]:
from statsmodels.graphics.dotplots import dot_plot
import numpy as np
import pandas as pd

The data are available form the World Bank's website. Here we work with a slightly cleaned-up version of the data.

In [2]:
data = pd.read_csv("wb_gdp.csv")
print data.head()
  Country name Country code  2012 population     Region  2000 GDP  2012 GDP
0  Afghanistan          AFG         29824536       Asia       NaN       687
1       Angola          AGO         20820525     Africa       656      5482
2    Argentina          ARG         41086927  S America      7701     11573
3    Australia          AUS         22683600    Oceania     21678     67556
4      Belgium          BEL         11142157     Europe     22697     43372

Plots of the entire dataset are large, so we first create a smaller data set to use for some simpler demonstrations. At the bottom of this notebook we create a dot plot of the entire dataset.

In [3]:
small = data.iloc[0:10,:]

First we create a simple dot plot of the 2000 GDP values

In [4]:
fig = dot_plot(points=small["2000 GDP"])
plt.xlabel("GDP", size=16)
Out[4]:
<matplotlib.text.Text at 0x7f524005e490>

We can also have the dot plot be drawn vertically.

In [5]:
fig = dot_plot(points=small["2000 GDP"], horizontal=False)
plt.ylabel("GDP", size=17)
Out[5]:
<matplotlib.text.Text at 0x7f5241596110>

The graph is much more informative if we label the lines. Note that Afghanistan does not have a 2000 GDP value so its line is blank (use the Pandas dropna method to plot only countries with data on all variables).

In [6]:
fig = dot_plot(points=small["2000 GDP"], lines=small["Country name"])
plt.xlabel("GDP", size=17)
Out[6]:
<matplotlib.text.Text at 0x7f523a5dbcd0>

If we want to specify the order of the lines, we can use the line_order argument. Here we order the lines by population:

In [7]:
ii = np.argsort(-small["2012 population"])
countries = [small["Country name"].iloc[i] for i in ii]
fig = dot_plot(points=small["2000 GDP"], lines=small["Country name"], line_order=countries)
plt.xlabel("GDP", size=17)
Out[7]:
<matplotlib.text.Text at 0x7f5238abd690>

We can have labels on either or both sides of the lines, using the split_names argument. This requires us to construct new line labels containing a delimiter (here the delimiter is ":"). The part of the label to the left of the delimiter appears on the left side of the dot plot, the part of the label to the right of the delimiter appears on the right side of the dot plot. Here we put the country name in the left margin and the population in the right margin.

In [8]:
small["Country_pop"] = ["%s:%d" % tuple(row[1].values) for row in small[["Country name", "2012 population"]].iterrows()]
fig = dot_plot(points=small["2000 GDP"], lines=small["Country_pop"], split_names=":")
plt.xlabel("GDP", size=17)
Out[8]:
<matplotlib.text.Text at 0x7f5238a79d10>

To create a dot plot with multiple points per line, we need to reorganize the dataset so that each value to be plotted is in its own row of the dataset.

In [9]:
data1 = pd.melt(data, value_vars=["2000 GDP", "2012 GDP"], id_vars=["Country name", "2012 population", "Region"], value_name="GDP",
                 var_name="Year")
data1["Year"] = [int(x.split()[0]) for x in data1["Year"]]
print data1.head()

# Recreate the small dataset in a way that gives us all data from a limited
# set of countries
ii = [x in countries[0:10] for x in data1["Country name"]]
small1 = data1.loc[ii,:]
  Country name  2012 population     Region  Year    GDP
0  Afghanistan         29824536       Asia  2000    NaN
1       Angola         20820525     Africa  2000    656
2    Argentina         41086927  S America  2000   7701
3    Australia         22683600    Oceania  2000  21678
4      Belgium         11142157     Europe  2000  22697

Now we can make a dot plot that shows the 2000 and 2012 GDP data, with the two points for one country appearing on the same line. The styles argument gives unique appearances to different variables plotted on the same line so they can be distinguished.

In [ ]:
fig = dot_plot(points=small1["GDP"], lines=small1["Country name"], styles=small1["Year"])
plt.xlabel("GDP", size=16)
Out[ ]:
<matplotlib.text.Text at 0x7f52388df450>

We can prevent overplotting by "stacking" the points, so that they are slightly offset within each line.

In [11]:
fig = dot_plot(points=small1["GDP"], lines=small1["Country name"], styles=small1["Year"], stacked=True)
plt.xlabel("GDP", size=16)
Out[11]:
<matplotlib.text.Text at 0x7f5238750410>

We can also shade alternating lines to further distinguish the countries.

In [12]:
fig = dot_plot(points=small1["GDP"], lines=small1["Country name"], styles=small1["Year"], striped=True)
plt.xlim(-1000, 80000)
plt.xlabel("GDP", size=16)
Out[12]:
<matplotlib.text.Text at 0x7f523872e6d0>

Next we add a legend to clarify which point corresponds to which variable. This requires us to adjust the axes object so that there is room for the legend outside the dot plot.

In [13]:
fig = plt.figure(figsize=(8,5))
ax = plt.axes([0.1, 0.1, 0.77, 0.8])
fig.add_axes(ax)
fig = dot_plot(points=small1["GDP"], lines=small1["Country name"], styles=small1["Year"], stacked=True, striped=True, ax=ax)
ha, la = plt.gca().get_legend_handles_labels()
leg = plt.figlegend(ha, la, "center right", numpoints=1, handletextpad=0.001)
leg.draw_frame(False)
plt.xlabel("GDP", size=16)
plt.xlim(-1000, 80000)
Out[13]:
(-1000, 80000)

We can customize the appearances of the points in the plot using the marker_props keyword argument. It is a map from the values in the styles variable to dictionaries of Matplotlib point properties.

In [14]:
fig = plt.figure(figsize=(8,5))
ax = plt.axes([0.1, 0.1, 0.77, 0.8])
fig.add_axes(ax)
marker_props = {2000: {"ms": 8, "alpha": 0.5}, 2012: {"ms": 8, "alpha": 0.5}}
fig = dot_plot(points=small1["GDP"], lines=small1["Country name"], styles=small1["Year"], striped=True, ax=ax,
              marker_props=marker_props)
ha, la = plt.gca().get_legend_handles_labels()
leg = plt.figlegend(ha, la, "center right", numpoints=1, handletextpad=0.001)
plt.xlabel("GDP", size=17)
leg.draw_frame(False)
plt.xlim(-1000, 80000)
Out[14]:
(-1000, 80000)

Now we return to the full dataset. Since there are so many countries, we need to make the plot much taller. We can also break it into sections by continent, using the regions variable.

In [15]:
fig = plt.figure(figsize=(8,40))
ax = plt.axes([0.1, 0.1, 0.77, 0.8])
fig.add_axes(ax)
marker_props = {2000: {"ms": 8, "alpha": 0.5}, 2012: {"ms": 8, "alpha": 0.5}}
fig = dot_plot(points=data1["GDP"], lines=data1["Country name"], styles=data1["Year"], striped=True, ax=ax,
              marker_props=marker_props, sections=data1["Region"])
ha, la = plt.gca().get_legend_handles_labels()
leg = plt.figlegend(ha, la, "center right", numpoints=1, handletextpad=0.001)
leg.draw_frame(False)
plt.xlabel("GDP", size=17)
plt.xlim(-1000, 80000)
Out[15]:
(-1000, 80000)