In [1]:
!date
import numpy as np, pandas as pd, matplotlib.pyplot as plt, mpld3, seaborn as sns
%matplotlib inline
Fri Dec 19 10:35:23 PST 2014

The Horizontal Bar Chart

A figure that I often want is the "horizontal bar chart", a way to represent one or two columns of a table visually for deep inspection. Each row of the table often has a descriptive name, and there are comparisons within columns and between rows, as well as within rows between columns that I think might be interesting.

Here is how it might be done with mpld3, and how it might be done better:

In [3]:
# get some of my favorite data
df = pd.read_csv('http://ghdx.healthdata.org/sites/default/files/'
                 'record-attached-files/IHME_PHMRC_VA_DATA_ADULT_Y2013M09D11_0.csv',
                low_memory=False)
df.head()
Out[3]:
site module gs_code34 gs_text34 va34 gs_code46 gs_text46 va46 gs_code55 gs_text55 ... word_woman word_womb word_worri word_wors word_worsen word_worst word_wound word_xray word_yellow newid
0 Mexico Adult K71 Cirrhosis 6 K71 Cirrhosis 8 K71 Cirrhosis ... 0 0 0 0 0 0 0 0 0 1
1 AP Adult G40 Epilepsy 12 G40 Epilepsy 16 G40 Epilepsy ... 0 0 0 0 0 0 0 0 0 2
2 AP Adult J12 Pneumonia 26 J12 Pneumonia 37 J12 Pneumonia ... 0 0 0 0 0 0 0 0 0 3
3 Mexico Adult J33 COPD 8 J33 COPD 10 J33 COPD ... 0 0 0 0 0 0 0 0 0 4
4 UP Adult I21 Acute Myocardial Infarction 17 I21 Acute Myocardial Infarction 23 I21 Acute Myocardial Infarction ... 0 0 0 0 0 0 0 0 0 5

5 rows × 946 columns

In [27]:
# make a summary table that would like to inspect
df['Field Site'] = df.site
df['Underlying Cause'] = df.gs_text34

g = df.groupby('Field Site')['Underlying Cause']

t = g.value_counts().unstack(0)
t = t.fillna(0)

t['Mean'] = t.mean(axis=1)
t['Max'] = t.max(axis=1)
t['Min'] = t.min(axis=1)

t
Out[27]:
Field Site AP Bohol Dar Mexico Pemba UP Mean Max Min
AIDS 136 0 203 120 0 43 83.666667 203 0
Acute Myocardial Infarction 101 116 3 76 0 104 66.666667 116 0
Asthma 21 6 12 1 6 1 7.833333 21 1
Bite of Venomous Animal 31 1 0 1 0 33 11.000000 33 0
Breast Cancer 3 28 90 69 0 5 32.500000 90 0
COPD 25 4 2 63 0 77 28.500000 77 0
Cervical Cancer 3 9 108 31 0 4 25.833333 108 0
Cirrhosis 51 39 27 133 0 63 52.166667 133 0
Colorectal Cancer 7 24 33 35 0 0 16.500000 35 0
Diabetes 88 77 59 105 38 47 69.000000 105 38
Diarrhea/Dysentery 77 27 37 18 21 48 38.000000 77 18
Drowning 32 4 9 0 28 33 17.666667 33 0
Epilepsy 30 5 6 2 0 5 8.000000 30 0
Esophageal Cancer 1 5 34 0 0 0 6.666667 34 0
Falls 32 38 11 36 26 30 28.833333 38 11
Fires 35 7 25 18 6 31 20.333333 35 6
Homicide 31 35 30 37 3 31 27.833333 37 3
Leukemia/Lymphomas 7 15 59 72 0 3 26.000000 72 0
Lung Cancer 5 23 11 59 0 8 17.666667 59 0
Malaria 29 0 34 0 2 35 16.666667 35 0
Maternal 71 41 135 39 46 136 78.000000 136 39
Other Cardiovascular Diseases 76 78 112 81 0 69 69.333333 112 0
Other Infectious Diseases 34 51 37 37 3 101 43.833333 101 3
Other Injuries 31 7 23 2 3 37 17.166667 37 2
Other Non-communicable Diseases 90 125 171 122 54 37 99.833333 171 37
Pneumonia 55 142 95 102 41 105 90.000000 142 41
Poisonings 30 3 8 9 1 35 14.333333 35 1
Prostate Cancer 1 4 32 11 0 0 8.000000 32 0
Renal Failure 102 68 49 92 0 105 69.333333 105 0
Road Traffic 39 55 31 34 11 32 33.666667 55 11
Stomach Cancer 5 5 21 31 0 0 10.333333 31 0
Stroke 125 179 103 122 0 101 105.000000 179 0
Suicide 49 16 13 11 2 33 20.666667 49 2
TB 101 22 103 17 6 27 46.000000 103 6

The Comparison

For demonstration purposes, I plan to compare the distribution of data collected on Pemba Island to the average across all study sites. This is a health-metrics-specific example, but let's not get into the details.

In [28]:
t = t.sort('Mean')  # sort the table in a meaningful way

fig = plt.figure(figsize=(12,16))  # make a nice, big figure for the plot

y = np.arange(len(t.index))  # select points on the y-axis for each bar

# do actual plotting
plt.barh(y+.05, t.Pemba, height=.45, color=sns.color_palette()[0], label='Pemba')
plt.barh( y+.5, t.Mean, height=.45, xerr=[t.Mean - t.Min, t.Max - t.Mean],  # annoying format for error bars
         color=sns.color_palette()[1], ecolor='k', 
         label='Cross-site Mean')

plt.axis(xmin=0)  # silly error-bars go below zero, but don't show that

plt.legend(loc=(.5,.1))  # legend uses label values from calls to plt.barh
plt.yticks(y+.5, t.index)  # label each tick on the y-axis with the corresponding cause
plt.subplots_adjust(left=.5)  # make sure there is enough room to read the tick labels

plt.xlabel('Verbal Autopsies Collected')  # label the plot so that it is easy to remember what it is in the future
pass  # do something with no output to keep display clean
In [30]:
# make the figure interactive
mpld3.display(fig)
Out[30]:

Wishlist for interactive

  • A pan-and-zoom tool that mantains the minimum x-value of 0 on the x-axis
  • y-ticks that hide automatically when they are squeezed too close together
  • Automatic rescaling of the x-axis to maximize the dynamic range of the data

None of this is that hard... what would you want?