#!/usr/bin/env python # coding: utf-8 # This notebook compares [pandas](http://pandas.pydata.org) # and [dplyr](http://cran.r-project.org/web/packages/dplyr/index.html). # The comparison is just on syntax (verbage), not performance. Whether you're an R user looking to switch to pandas (or the other way around), I hope this guide will help ease the transition. # # We'll work through the [introductory dplyr vignette](http://cran.r-project.org/web/packages/dplyr/vignettes/introduction.html) to analyze some flight data. # # I'm working on a better layout to show the two packages side by side. # But for now I'm just putting the ``dplyr`` code in a comment above each python call. # In[1]: # Some prep work to get the data from R and into pandas get_ipython().run_line_magic('matplotlib', 'inline') get_ipython().run_line_magic('load_ext', 'rpy2.ipython') import pandas as pd import seaborn as sns pd.set_option("display.max_rows", 5) # In[2]: # %%R # install.packages("nycflights13", repos='http://cran.us.r-project.org') # In[3]: # %%R # library(nycflights13) # write.csv(flights, "flights.csv") # # Data: nycflights13 # In[4]: flights = pd.read_csv("flights.csv", index_col=0) # In[5]: # dim(flights) <--- The R code flights.shape # <--- The python code # In[6]: # head(flights) flights.head() # # Single table verbs # ``dplyr`` has a small set of nicely defined verbs. I've listed their closest pandas verbs. # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

dplyr	pandas
`filter()` (and `slice()`)	`query()` (and `loc[]`, `iloc[]`)
`arrange()`	`sort_values` and `sort_index()`
`select()` (and `rename()`)	`__getitem__` (and `rename()`)
`distinct()`	`drop_duplicates()`
`mutate()` (and `transmute()`)	assign
summarise()	None
sample_n() and sample_frac()	`sample`
`%>%`	`pipe`

# # # Some of the "missing" verbs in pandas are because there are other, different ways of achieving the same goal. For example `summarise` is spread across `mean`, `std`, etc. It's closest analog is actually the `.agg` method on a `GroupBy` object, as it reduces a DataFrame to a single row (per group). This isn't quite what `.describe` does. # # I've also included the `pipe` operator from R (`%>%`), the `pipe` method from pandas, even though it isn't quite a verb. # # Filter rows with filter(), query() # In[7]: # filter(flights, month == 1, day == 1) flights.query("month == 1 & day == 1") # We see the first big *language* difference between R and python. # Many python programmers will shun the R code as too magical. # How is the programmer supposed to know that `month` and `day` are supposed to represent columns in the DataFrame? # On the other hand, to emulate this *very* convenient feature of R, python has to write the expression as a string, and evaluate the string in the context of the DataFrame. # The more verbose version: # In[8]: # flights[flights$month == 1 & flights$day == 1, ] flights[(flights.month == 1) & (flights.day == 1)] # In[9]: # slice(flights, 1:10) flights.iloc[:9] # # Arrange rows with arrange(), sort() # In[10]: # arrange(flights, year, month, day) flights.sort_values(['year', 'month', 'day']) # In[11]: # arrange(flights, desc(arr_delay)) flights.sort_values('arr_delay', ascending=False) # It's worth mentioning the other common sorting method for pandas DataFrames, `sort_index`. Pandas puts much more emphasis on indicies, (or row labels) than R. # This is a design decision that has positives and negatives, which we won't go into here. Suffice to say that when you need to sort a `DataFrame` by the index, use `DataFrame.sort_index`. # # Select columns with select(), [] # In[12]: # select(flights, year, month, day) flights[['year', 'month', 'day']] # In[13]: # select(flights, year:day) flights.loc[:, 'year':'day'] # In[14]: # select(flights, -(year:day)) # No direct equivalent here. I would typically use # flights.drop(cols_to_drop, axis=1) # or fligths[flights.columns.difference(pd.Index(cols_to_drop))] # point to dplyr! # In[15]: # select(flights, tail_num = tailnum) flights.rename(columns={'tailnum': 'tail_num'})['tail_num'] # But like Hadley mentions, not that useful since it only returns the one column. ``dplyr`` and ``pandas`` compare well here. # In[16]: # rename(flights, tail_num = tailnum) flights.rename(columns={'tailnum': 'tail_num'}) # Pandas is more verbose, but the the argument to `columns` can be any mapping. So it's often used with a function to perform a common task, say `df.rename(columns=lambda x: x.replace('-', '_'))` to replace any dashes with underscores. Also, ``rename`` (the pandas version) can be applied to the Index. # One more note on the differences here. # Pandas could easily include a `.select` method. # [`xray`](http://xray.readthedocs.org/en/stable/), a library that builds on top of NumPy and pandas to offer labeled N-dimensional arrays (along with many other things) does [just that](http://xray.readthedocs.org/en/stable/indexing.html#indexing-with-labeled-dimensions). # Pandas chooses the `.loc` and `.iloc` accessors because *any valid selection is also a valid assignment*. This makes it easier to modify the data. # # ```python # flights.loc[:, 'year':'day'] = data # ``` # # where `data` is an object that is, or can be broadcast to, the correct shape. # # Extract distinct (unique) rows # In[17]: # distinct(select(flights, tailnum)) flights.tailnum.unique() # FYI this returns a numpy array instead of a Series. # In[18]: # distinct(select(flights, origin, dest)) flights[['origin', 'dest']].drop_duplicates() # OK, so ``dplyr`` wins there from a consistency point of view. ``unique`` is only defined on Series, not DataFrames. # # Add new columns with mutate() # We at pandas shamelessly stole this for [v0.16.0](http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#whatsnew-0160-enhancements-assign). # In[19]: # mutate(flights, # gain = arr_delay - dep_delay, # speed = distance / air_time * 60) flights.assign(gain=flights.arr_delay - flights.dep_delay, speed=flights.distance / flights.air_time * 60) # In[20]: # mutate(flights, # gain = arr_delay - dep_delay, # gain_per_hour = gain / (air_time / 60) # ) (flights.assign(gain=flights.arr_delay - flights.dep_delay) .assign(gain_per_hour = lambda df: df.gain / (df.air_time / 60))) # The first example is pretty much identical (aside from the names, `mutate` vs. `assign`). # # The second example just comes down to language differences. In `R`, it's possible to implement a function like `mutate` where you can refer to `gain` in the line calcuating `gain_per_hour`, even though `gain` hasn't actually been calcuated yet. # # In Python, you can have arbitrary keyword arguments to functions (which we needed for `.assign`), but the order of the argumnets is arbitrary since `dict`s are unsorted and `**kwargs*` is a `dict`. So you can't have something like `df.assign(x=df.a / df.b, y=x **2)`, because you don't know whether `x` or `y` will come first (you'd also get an error saying `x` is undefined. # # To work around that with pandas, you'll need to split up the assigns, and pass in a *callable* to the second assign. The callable looks at itself to find a column named `gain`. Since the line above returns a DataFrame with the `gain` column added, the pipeline goes through just fine. # In[21]: # transmute(flights, # gain = arr_delay - dep_delay, # gain_per_hour = gain / (air_time / 60) # ) (flights.assign(gain=flights.arr_delay - flights.dep_delay) .assign(gain_per_hour = lambda df: df.gain / (df.air_time / 60)) [['gain', 'gain_per_hour']]) # # Summarise values with summarise() # In[22]: # summarise(flights, # delay = mean(dep_delay, na.rm = TRUE)) flights.dep_delay.mean() # This is only roughly equivalent. # `summarise` takes a callable (e.g. `mean`, `sum`) and evaluates that on the DataFrame. In pandas these are spread across `pd.DataFrame.mean`, `pd.DataFrame.sum`. This will come up again when we look at `groupby`. # # Randomly sample rows with sample_n() and sample_frac() # In[23]: # sample_n(flights, 10) flights.sample(n=10) # In[24]: # sample_frac(flights, 0.01) flights.sample(frac=.01) # # Grouped operations # In[25]: # planes <- group_by(flights, tailnum) # delay <- summarise(planes, # count = n(), # dist = mean(distance, na.rm = TRUE), # delay = mean(arr_delay, na.rm = TRUE)) # delay <- filter(delay, count > 20, dist < 2000) planes = flights.groupby("tailnum") delay = (planes.agg({"year": "count", "distance": "mean", "arr_delay": "mean"}) .rename(columns={"distance": "dist", "arr_delay": "delay", "year": "count"}) .query("count > 20 & dist < 2000")) delay # For me, dplyr's ``n()`` looked is a bit starge at first, but it's already growing on me. # # I think pandas is more difficult for this particular example. # There isn't as natural a way to mix column-agnostic aggregations (like ``count``) with column-specific aggregations like the other two. You end up writing could like `.agg{'year': 'count'}` which reads, "I want the count of `year`", even though you don't care about `year` specifically. You could just as easily have said `.agg('distance': 'count')`. # Additionally assigning names can't be done as cleanly in pandas; you have to just follow it up with a ``rename`` like before. # We may as well reproduce the graph. It looks like `ggplots` `geom_smooth` is some kind of lowess smoother. We can either us [seaborn](http://stanford.edu/~mwaskom/software/seaborn/): # In[26]: fig, ax = plt.subplots(figsize=(12, 6)) sns.regplot("dist", "delay", data=delay, lowess=True, ax=ax, scatter_kws={'color': 'k', 'alpha': .5, 's': delay['count'] / 10}, ci=90, line_kws={'linewidth': 3}); # Or using statsmodels directly for more control over the lowess, with an extremely lazy # "confidence interval". # In[27]: import statsmodels.api as sm # In[28]: smooth = sm.nonparametric.lowess(delay.delay, delay.dist, frac=1/8) ax = delay.plot(kind='scatter', x='dist', y = 'delay', figsize=(12, 6), color='k', alpha=.5, s=delay['count'] / 10) ax.plot(smooth[:, 0], smooth[:, 1], linewidth=3); std = smooth[:, 1].std() ax.fill_between(smooth[:, 0], smooth[:, 1] - std, smooth[:, 1] + std, alpha=.25); # In[29]: # destinations <- group_by(flights, dest) # summarise(destinations, # planes = n_distinct(tailnum), # flights = n() # ) destinations = flights.groupby('dest') destinations.agg({ 'tailnum': lambda x: len(x.unique()), 'year': 'count' }).rename(columns={'tailnum': 'planes', 'year': 'flights'}) # There's a little know feature to `groupby.agg`: it accepts a dict of dicts mapping # columns to `{name: aggfunc}` pairs. Here's the result: # In[30]: destinations = flights.groupby('dest') r = destinations.agg({'tailnum': {'planes': lambda x: len(x.unique())}, 'year': {'flights': 'count'}}) r # The result is a `MultiIndex` in the columns which can be a bit awkard to work with (you can drop a level with `r.columns.droplevel()`). Also the syntax going into the `.agg` may not be the clearest. # Similar to how ``dplyr`` provides optimized C++ versions of most of the `summarise` functions, pandas uses [cython](http://cython.org) optimized versions for most of the `agg` methods. # In[31]: # daily <- group_by(flights, year, month, day) # (per_day <- summarise(daily, flights = n())) daily = flights.groupby(['year', 'month', 'day']) per_day = daily['distance'].count() per_day # In[32]: # (per_month <- summarise(per_day, flights = sum(flights))) per_month = per_day.groupby(level=['year', 'month']).sum() per_month # In[33]: # (per_year <- summarise(per_month, flights = sum(flights))) per_year = per_month.sum() per_year # I'm not sure how ``dplyr`` is handling the other columns, like `year`, in the last example. With pandas, it's clear that we're grouping by them since they're included in the groupby. For the last example, we didn't group by anything, so they aren't included in the result. # # Chaining # Any follower of Hadley's [twitter account](https://twitter.com/hadleywickham/) will know how much R users *love* the ``%>%`` (pipe) operator. And for good reason! # In[34]: # flights %>% # group_by(year, month, day) %>% # select(arr_delay, dep_delay) %>% # summarise( # arr = mean(arr_delay, na.rm = TRUE), # dep = mean(dep_delay, na.rm = TRUE) # ) %>% # filter(arr > 30 | dep > 30) ( flights.groupby(['year', 'month', 'day']) [['arr_delay', 'dep_delay']] .mean() .query('arr_delay > 30 | dep_delay > 30') ) # A bit of soapboxing here if you'll indulge me. # # The example above is a bit contrived since it only uses methods on `DataFrame`. But what if you have some function to work into your pipeline that pandas hasn't (or won't) implement? In that case you're required to break up your pipeline by assigning your intermediate (probably uninteresting) DataFrame to a temporary variable you don't actually care about. # # `R` doesn't have this problem since the `%>%` operator works with any function that takes (and maybe returns) DataFrames. # The python language doesn't have any notion of right to left function application (other than special cases like `__radd__` and `__rmul__`). # It only allows the usual left to right `function(arguments)`, where you can think of the `()` as the "call this function" operator. # # Pandas wanted something like `%>%` and we did it in a farily pythonic way. The `pd.DataFrame.pipe` method takes a function and optionally some arguments, and calls that function with `self` (the DataFrame) as the first argument. # # So # # ```R # flights >%> my_function(my_argument=10) # ``` # # becomes # # ```python # flights.pipe(my_function, my_argument=10) # ``` # # We initially had grander visions for `.pipe`, but the wider python community didn't seem that interested. # # Other Data Sources # Pandas has tons [IO tools](http://pandas.pydata.org/pandas-docs/version/0.15.0/io.html) to help you get data in and out, including SQL databases via [SQLAlchemy](http://www.sqlalchemy.org). # # Summary # I think pandas held up pretty well, considering this was a vignette written for dplyr. I found the degree of similarity more interesting than the differences. The most difficult task was renaming of columns within an operation; they had to be followed up with a call to ``rename`` *after* the operation, which isn't that burdensome honestly. # # More and more it looks like we're moving towards future where being a language or package partisan just doesn't make sense. Not when you can load up a [Jupyter](http://jupyter.org) (formerly IPython) notebook to call up a library written in R, and hand those results off to python or Julia or whatever for followup, before going back to R to make a cool [shiny](http://shiny.rstudio.com) web app. # # There will always be a place for your "utility belt" package like dplyr or pandas, but it wouldn't hurt to be familiar with both. # # If you want to contribute to pandas, we're always looking for help at https://github.com/pydata/pandas/. # You can get ahold of me directly on [twitter](https://twitter.com/tomaugspurger).