Occupancy plots with Python and ggplot2 via rmagic

Background

In two previous posts (here and here), I showed how to do occupancy analysis with the beginnings of a Python based version of Hillmaker. The example is based on data from a hospital short stay unit.

In this short tutorial, I'll show how to use rmagic from within an IPython notebook so that we can make occupancy plots using ggplot2. In particular, we want to create a grid of occupancy histograms with one grid axis being patient type and the other axis being day of week. We want to make sure that the plots are ordered Sunday, Monday, ..., Saturday and NOT in alphabetical order (Friday, Monday, ..., Wednesday).

The first part of such an analysis leads to a table we call the "by datetime" table. At the end of Part 1 of this tutorial series, we ended up with a csv file called bydate_shortstay_csv.csv. Let's read it in and take a look at it.

You can find the data and the .ipynb file in my hselab-tutorials github repo. Clone or download a zip.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


# Make the graphs a bit prettier, and bigger
pd.set_option('display.mpl_style', 'default') 
pd.set_option('display.line_width', 5000) 
pd.set_option('display.max_columns', 60) 
line_width has been deprecated, use display.width instead (currently both are
identical)

In [3]:
## Read sample data set and convert string dates to datetimes
bydatetime_df = pd.read_csv('data/bydate_shortstay_csv.csv',parse_dates=['datetime'])
In [3]:
bydatetime_df.head()
Out[3]:
category datetime arrivals binofday binofweek dayofweek departures occupancy
0 IVT 1996-01-07 00:00:00 0 0 288 6 0 0.0
1 IVT 1996-01-07 00:30:00 0 1 289 6 1 0.5
2 IVT 1996-01-07 01:00:00 0 2 290 6 0 0.0
3 IVT 1996-01-07 01:30:00 0 3 291 6 0 0.0
4 IVT 1996-01-07 02:00:00 0 4 292 6 0 0.0

5 rows × 8 columns

Let's say that we are interested in average occupancy by day of week (ignoring time of day) by patient type (category). Since each day consists of the same number of equally sized time bins, we can use the Pandas groupby function to easily compute mean occupancy by day of week.

In [4]:
meanocc_df = bydatetime_df.groupby(['category','dayofweek'])['occupancy'].mean()
In [5]:
meanocc_df
Out[5]:
category  dayofweek
ART       0             1.841435
          1             1.700810
          2             1.770544
          3             1.868866
          4             2.144907
          5             0.000000
          6             0.000000
CAT       0             2.562963
          1             2.055556
          2             2.181134
          3             2.136053
          4             2.540683
          5             0.620833
          6             0.278312
IVT       0             8.670891
          1             7.403183
          2             7.350810
          3             7.557870
          4             9.250231
          5             0.969792
          6             0.283494
MYE       0             2.056250
          1             1.873669
          2             1.841030
          3             1.826505
          4             2.022743
          5             0.147280
          6             0.091293
OTH       0             1.455093
          1             1.174363
          2             1.124248
          3             1.147454
          4             1.270544
          5             0.000000
          6             0.000000
Total     0            16.586632
          1            14.207581
          2            14.267766
          3            14.536748
          4            17.229109
          5             1.737905
          6             0.653098
Name: occupancy, dtype: float64

Even better than just the means would be occupancy histograms by patient type and day of week. That will require a little more work. On the way we'll see how to access individual levels of a hierarchical index, how to push data into an R workspace and use ggplot2 to make nice plots.

Let's start by computing average occupancy by category by date (i.e. roll up over the time bins).

In [6]:
meanoccdate = bydatetime_df.groupby(['category',bydatetime_df['datetime'].map(lambda x: x.date())])['occupancy'].mean()
In [7]:
meanoccdate.head()
Out[7]:
category  datetime  
ART       1996-01-07    0.000000
          1996-01-08    1.732639
          1996-01-09    1.532639
          1996-01-10    1.541667
          1996-01-11    1.872222
Name: occupancy, dtype: float64

Now we can use this to create a grid of histograms of occupancy by category and day of week. To make our life a little easier, let's add a string day of week column to this new dataframe (yes, the output of the groupby is a DataFrame or Series). This will require us to access the datetime level of the hierarchical index of our Series object, meanoccdate. What does the index look like for one specific row in meanoccdate?

In [8]:
meanoccdate.index[0]
Out[8]:
('ART', datetime.date(1996, 1, 7))

It's a tuple with the index values for each level of the hierarchy. We want the 2nd element in the tuple.

In [9]:
meanoccdate.index[0][1]
Out[9]:
datetime.date(1996, 1, 7)

Ok, we've got that part. Now, how to convert a Timestamp (equivalently, a datetime), to its day of week label? Reading the datetime docs leads to the strftime function for creating formatted string representations of datetimes.

In [10]:
ts = pd.Timestamp('6/26/2014')
In [11]:
ts.strftime('%a')
Out[11]:
'Thu'

Great. Now just put it all together.

In [12]:
meanoccdate_df = pd.DataFrame(meanoccdate)
In [13]:
meanoccdate_df['dayofweek'] = meanoccdate_df.index.map(lambda x: pd.Timestamp(x[1]).strftime('%a'))
In [14]:
meanoccdate_df.head()
Out[14]:
occupancy dayofweek
category datetime
ART 1996-01-07 0.000000 Sun
1996-01-08 1.732639 Mon
1996-01-09 1.532639 Tue
1996-01-10 1.541667 Wed
1996-01-11 1.872222 Thu

5 rows × 2 columns

In [15]:
meanoccdate_df.tail()
Out[15]:
occupancy dayofweek
category datetime
Total 1996-03-27 15.801389 Wed
1996-03-28 13.538194 Thu
1996-03-29 18.146528 Fri
1996-03-30 1.436111 Sat
1996-03-31 0.227083 Sun

5 rows × 2 columns

In [16]:
meanoccdate_df['occupancy']['Total']
Out[16]:
datetime
1996-01-07     0.871528
1996-01-08    17.759028
1996-01-09    15.146528
1996-01-10    14.095139
1996-01-11    16.148611
1996-01-12    16.236111
1996-01-13     1.586806
1996-01-14     0.725694
1996-01-15    18.181250
1996-01-16    16.000000
1996-01-17    14.304861
1996-01-18    16.816667
1996-01-19    16.011111
1996-01-20     2.631944
1996-01-21     0.682639
...
1996-03-17     0.753472
1996-03-18    17.765278
1996-03-19    11.013194
1996-03-20    12.238194
1996-03-21    13.309722
1996-03-22    15.611806
1996-03-23     0.915972
1996-03-24     0.570139
1996-03-25    15.807639
1996-03-26    14.565972
1996-03-27    15.801389
1996-03-28    13.538194
1996-03-29    18.146528
1996-03-30     1.436111
1996-03-31     0.227083
Name: occupancy, Length: 85, dtype: float64

Exploring rmagic - a prelude to graphing occupancy

The rmagic extension allows us to magically use R from within IPython. Well, it's not really magic. It relies on the RPy2 Python module.

http://nbviewer.ipython.org/github/ipython/ipython/blob/3607712653c66d63e0d7f13f073bde8c0f209ba8/docs/examples/notebooks/rmagic_extension.ipynb

After loading the rmagic extension, you can use %R for a single R line magic command or %%R for cell magic to handle a sequence of R commands.

In [17]:
# Load the extension
%load_ext rmagic
In [18]:
%matplotlib inline
In [19]:
# Create some data in Python and scatter it

import pylab
X = np.array([0,1,2,3,4])
Y = np.array([3,5,4,6,7])
pylab.scatter(X, Y)
Out[19]:
<matplotlib.collections.PathCollection at 0xba87cac>
/home/pcba/anaconda/lib/python2.7/site-packages/matplotlib/font_manager.py:1236: UserWarning: findfont: Font family ['monospace'] not found. Falling back to Bitstream Vera Sans
  (prop.get_family(), self.defaultFamily[fontext]))
In [20]:
# "Push" these two numpy arrays into the R "space"
%Rpush X Y
In [21]:
%%R
linmodel <- lm(Y~X)
print(summary(linmodel))
Call:
lm(formula = Y ~ X)

Residuals:
   1    2    3    4    5 
-0.2  0.9 -1.0  0.1  0.2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)  
(Intercept)   3.2000     0.6164   5.191   0.0139 *
X             0.9000     0.2517   3.576   0.0374 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.7958 on 3 degrees of freedom
Multiple R-squared:   0.81,	Adjusted R-squared:  0.7467 
F-statistic: 12.79 on 1 and 3 DF,  p-value: 0.03739

So, pretty easy to run R commands from within IPython notebooks. Now let's exploit rmagic to do some ggplotting.

Occupancy Histograms

A popular plotting module for Python is matplotlib. The project homepage has many links to resources for learning it, with a very good place to start being the official documentation and the gallery of graphs. The are a few modes of using matplotlib. There is a pyplot mode which is particularly well suited for interactive plotting in a Python shell like IPython (much like one would work in Mathematica or MATLAB). You can also use matplotlib within Python scripts either with the pyplot commands or via an objected oriented API (similar to ggplot2 for plotting in R).

Here is a basic histogram for overall occupancy (there's a 'Total' cateogry in our data frame). For a more API based approach, see this example from the matplotlib page as well as the next version of our histogram below.

In [22]:
# normed=1 plots probs instead of counts, alpha in [0,1] is transparency level (RGBA colors)
plt.hist(meanoccdate_df['occupancy']['Total'], 20, normed=1, facecolor='green', alpha=0.75) 
plt.xlabel('Occupancy')
plt.ylabel('Probability')
plt.title(r'Histogram of Short Stay Occupancy')
plt.grid(True)
plt.show()

Clearly there's a day of the week effect - the lower mode is due to the weekends.

Now, let's create the same plot, but using R's ggplot2 package. Our basic strategy is:

  • use the %Rpush command to "push" the meanoccdate_df DataFrame into the R workspace.
  • use %R line magic to load the ggplot2 library
  • use %%R cell magic to create and show the plot

A big question is, what happens to the hiearchical index in the pandas DataFrame when pushed to R?

In [23]:
%Rpush meanoccdate_df
%R str(meanoccdate_df)
'data.frame':	510 obs. of  2 variables:
 $ occupancy: num [1:510(1d)] 0 1.73 1.53 1.54 1.87 ...
 $ dayofweek: Factor w/ 7 levels "Fri","Mon","Sat",..: 4 2 6 7 5 1 3 4 2 6 ...

Bummer, the category (patient type) and date fields that made up the pandas MultiIndex did not come through. No problem, we'll just create columns from the index before pushing the data frame to R.

In [24]:
meanoccdate_df['patient_type'] = meanoccdate_df.index.map(lambda x: x[0])
meanoccdate_df['date'] = meanoccdate_df.index.map(lambda x: x[1])
In [25]:
%Rpush meanoccdate_df
%R str(meanoccdate_df)
'data.frame':	510 obs. of  4 variables:
 $ occupancy   : num [1:510(1d)] 0 1.73 1.53 1.54 1.87 ...
 $ dayofweek   : Factor w/ 7 levels "Fri","Mon","Sat",..: 4 2 6 7 5 1 3 4 2 6 ...
 $ patient_type: Factor w/ 6 levels "ART","CAT","IVT",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ date        : Factor w/ 85 levels "1996-01-07","1996-01-08",..: 1 2 3 4 5 6 7 8 9 10 ...
In [26]:
%R library(ggplot2)
Out[26]:
array(['ggplot2', 'tools', 'stats', 'graphics', 'grDevices', 'utils',
       'datasets', 'methods', 'base'], 
      dtype='|S9')
In [27]:
%%R
g <- ggplot(data=meanoccdate_df[meanoccdate_df$patient_type == 'Total',]) + geom_histogram(aes(x=occupancy, y=..density..), fill="#FF9999", colour="black")
print(g)
stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

Now let's do a facet plot by patient_type and day of week.

In [28]:
%%R
g2 <- ggplot(data=meanoccdate_df) + geom_histogram(aes(x=occupancy, y=..density..), binwidth=2, fill="#FF9999", colour="black")
print(g2 + facet_grid(patient_type ~ dayofweek))

Notice that the days of the week are not sorted correctly. Why? Recall the structure of our R data.frame.

In [29]:
%R str(meanoccdate_df)
'data.frame':	510 obs. of  4 variables:
 $ occupancy   : num [1:510(1d)] 0 1.73 1.53 1.54 1.87 ...
 $ dayofweek   : Factor w/ 7 levels "Fri","Mon","Sat",..: 4 2 6 7 5 1 3 4 2 6 ...
 $ patient_type: Factor w/ 6 levels "ART","CAT","IVT",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ date        : Factor w/ 85 levels "1996-01-07","1996-01-08",..: 1 2 3 4 5 6 7 8 9 10 ...

The dayofweek column is a Factor. We want it to be an Ordered Factor and we want to tell R the order to use.

In [30]:
%%R
# Create vector with DOWs ordered as you wish
DOW_order <- c("Sun", "Mon", "Tue", "Wed", "Thu", "Fri", "Sat")

# Change DOW from factor to ordered factor using vector you just made
meanoccdate_df$dayofweek <- factor(meanoccdate_df$dayofweek,levels=DOW_order,ordered=TRUE)
In [31]:
%R str(meanoccdate_df)
'data.frame':	510 obs. of  4 variables:
 $ occupancy   : num [1:510(1d)] 0 1.73 1.53 1.54 1.87 ...
 $ dayofweek   : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tue"<..: 1 2 3 4 5 6 7 1 2 3 ...
 $ patient_type: Factor w/ 6 levels "ART","CAT","IVT",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ date        : Factor w/ 85 levels "1996-01-07","1996-01-08",..: 1 2 3 4 5 6 7 8 9 10 ...
In [32]:
%%R
g2 <- ggplot(data=meanoccdate_df) + geom_histogram(aes(x=occupancy, y=..density..), binwidth=2, fill="#FF9999", colour="black")
print(g2 + facet_grid(patient_type ~ dayofweek))

And now we've got the DOW ordering we really wanted. This has been a quick tour of plotting occupancy data using ggplot2 from within an IPython notebook. Future installments of this series of occupancy analysis will focus on more occupancy visualizations using some combination of R and Python.