!date
import numpy as np, pandas as pd, matplotlib.pyplot as plt, mpld3, seaborn as sns
%matplotlib inline
Fri Dec 19 10:35:23 PST 2014
A figure that I often want is the "horizontal bar chart", a way to represent one or two columns of a table visually for deep inspection. Each row of the table often has a descriptive name, and there are comparisons within columns and between rows, as well as within rows between columns that I think might be interesting.
Here is how it might be done with mpld3
, and how it might be done better:
# get some of my favorite data
df = pd.read_csv('http://ghdx.healthdata.org/sites/default/files/'
'record-attached-files/IHME_PHMRC_VA_DATA_ADULT_Y2013M09D11_0.csv',
low_memory=False)
df.head()
site | module | gs_code34 | gs_text34 | va34 | gs_code46 | gs_text46 | va46 | gs_code55 | gs_text55 | ... | word_woman | word_womb | word_worri | word_wors | word_worsen | word_worst | word_wound | word_xray | word_yellow | newid | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Mexico | Adult | K71 | Cirrhosis | 6 | K71 | Cirrhosis | 8 | K71 | Cirrhosis | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
1 | AP | Adult | G40 | Epilepsy | 12 | G40 | Epilepsy | 16 | G40 | Epilepsy | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
2 | AP | Adult | J12 | Pneumonia | 26 | J12 | Pneumonia | 37 | J12 | Pneumonia | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
3 | Mexico | Adult | J33 | COPD | 8 | J33 | COPD | 10 | J33 | COPD | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
4 | UP | Adult | I21 | Acute Myocardial Infarction | 17 | I21 | Acute Myocardial Infarction | 23 | I21 | Acute Myocardial Infarction | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 |
5 rows × 946 columns
# make a summary table that would like to inspect
df['Field Site'] = df.site
df['Underlying Cause'] = df.gs_text34
g = df.groupby('Field Site')['Underlying Cause']
t = g.value_counts().unstack(0)
t = t.fillna(0)
t['Mean'] = t.mean(axis=1)
t['Max'] = t.max(axis=1)
t['Min'] = t.min(axis=1)
t
Field Site | AP | Bohol | Dar | Mexico | Pemba | UP | Mean | Max | Min |
---|---|---|---|---|---|---|---|---|---|
AIDS | 136 | 0 | 203 | 120 | 0 | 43 | 83.666667 | 203 | 0 |
Acute Myocardial Infarction | 101 | 116 | 3 | 76 | 0 | 104 | 66.666667 | 116 | 0 |
Asthma | 21 | 6 | 12 | 1 | 6 | 1 | 7.833333 | 21 | 1 |
Bite of Venomous Animal | 31 | 1 | 0 | 1 | 0 | 33 | 11.000000 | 33 | 0 |
Breast Cancer | 3 | 28 | 90 | 69 | 0 | 5 | 32.500000 | 90 | 0 |
COPD | 25 | 4 | 2 | 63 | 0 | 77 | 28.500000 | 77 | 0 |
Cervical Cancer | 3 | 9 | 108 | 31 | 0 | 4 | 25.833333 | 108 | 0 |
Cirrhosis | 51 | 39 | 27 | 133 | 0 | 63 | 52.166667 | 133 | 0 |
Colorectal Cancer | 7 | 24 | 33 | 35 | 0 | 0 | 16.500000 | 35 | 0 |
Diabetes | 88 | 77 | 59 | 105 | 38 | 47 | 69.000000 | 105 | 38 |
Diarrhea/Dysentery | 77 | 27 | 37 | 18 | 21 | 48 | 38.000000 | 77 | 18 |
Drowning | 32 | 4 | 9 | 0 | 28 | 33 | 17.666667 | 33 | 0 |
Epilepsy | 30 | 5 | 6 | 2 | 0 | 5 | 8.000000 | 30 | 0 |
Esophageal Cancer | 1 | 5 | 34 | 0 | 0 | 0 | 6.666667 | 34 | 0 |
Falls | 32 | 38 | 11 | 36 | 26 | 30 | 28.833333 | 38 | 11 |
Fires | 35 | 7 | 25 | 18 | 6 | 31 | 20.333333 | 35 | 6 |
Homicide | 31 | 35 | 30 | 37 | 3 | 31 | 27.833333 | 37 | 3 |
Leukemia/Lymphomas | 7 | 15 | 59 | 72 | 0 | 3 | 26.000000 | 72 | 0 |
Lung Cancer | 5 | 23 | 11 | 59 | 0 | 8 | 17.666667 | 59 | 0 |
Malaria | 29 | 0 | 34 | 0 | 2 | 35 | 16.666667 | 35 | 0 |
Maternal | 71 | 41 | 135 | 39 | 46 | 136 | 78.000000 | 136 | 39 |
Other Cardiovascular Diseases | 76 | 78 | 112 | 81 | 0 | 69 | 69.333333 | 112 | 0 |
Other Infectious Diseases | 34 | 51 | 37 | 37 | 3 | 101 | 43.833333 | 101 | 3 |
Other Injuries | 31 | 7 | 23 | 2 | 3 | 37 | 17.166667 | 37 | 2 |
Other Non-communicable Diseases | 90 | 125 | 171 | 122 | 54 | 37 | 99.833333 | 171 | 37 |
Pneumonia | 55 | 142 | 95 | 102 | 41 | 105 | 90.000000 | 142 | 41 |
Poisonings | 30 | 3 | 8 | 9 | 1 | 35 | 14.333333 | 35 | 1 |
Prostate Cancer | 1 | 4 | 32 | 11 | 0 | 0 | 8.000000 | 32 | 0 |
Renal Failure | 102 | 68 | 49 | 92 | 0 | 105 | 69.333333 | 105 | 0 |
Road Traffic | 39 | 55 | 31 | 34 | 11 | 32 | 33.666667 | 55 | 11 |
Stomach Cancer | 5 | 5 | 21 | 31 | 0 | 0 | 10.333333 | 31 | 0 |
Stroke | 125 | 179 | 103 | 122 | 0 | 101 | 105.000000 | 179 | 0 |
Suicide | 49 | 16 | 13 | 11 | 2 | 33 | 20.666667 | 49 | 2 |
TB | 101 | 22 | 103 | 17 | 6 | 27 | 46.000000 | 103 | 6 |
For demonstration purposes, I plan to compare the distribution of data collected on Pemba Island to the average across all study sites. This is a health-metrics-specific example, but let's not get into the details.
t = t.sort('Mean') # sort the table in a meaningful way
fig = plt.figure(figsize=(12,16)) # make a nice, big figure for the plot
y = np.arange(len(t.index)) # select points on the y-axis for each bar
# do actual plotting
plt.barh(y+.05, t.Pemba, height=.45, color=sns.color_palette()[0], label='Pemba')
plt.barh( y+.5, t.Mean, height=.45, xerr=[t.Mean - t.Min, t.Max - t.Mean], # annoying format for error bars
color=sns.color_palette()[1], ecolor='k',
label='Cross-site Mean')
plt.axis(xmin=0) # silly error-bars go below zero, but don't show that
plt.legend(loc=(.5,.1)) # legend uses label values from calls to plt.barh
plt.yticks(y+.5, t.index) # label each tick on the y-axis with the corresponding cause
plt.subplots_adjust(left=.5) # make sure there is enough room to read the tick labels
plt.xlabel('Verbal Autopsies Collected') # label the plot so that it is easy to remember what it is in the future
pass # do something with no output to keep display clean
# make the figure interactive
mpld3.display(fig)
None of this is that hard... what would you want?