This notebook compares pandas and dplyr. The comparison is just on syntax (verbage), not performance. Whether you're an R user looking to switch to pandas (or the other way around), I hope this guide will help ease the transition.

We'll work through the introductory dplyr vignette to analyze some flight data.

I'm working on a better layout to show the two packages side by side. But for now I'm just putting the dplyr code in a comment above each python call.

In [1]:

# Some prep work to get the data from R and into pandas
%matplotlib inline
%load_ext rpy2.ipython

import pandas as pd
import seaborn as sns

pd.set_option("display.max_rows", 5)

In [2]:

# %%R
# install.packages("nycflights13", repos='http://cran.us.r-project.org')

In [3]:

# %%R
# library(nycflights13)
# write.csv(flights, "flights.csv")

Data: nycflights13¶

In [4]:

flights = pd.read_csv("flights.csv", index_col=0)

In [5]:

# dim(flights)   <--- The R code
flights.shape  # <--- The python code

Out[5]:

(336776, 16)

In [6]:

# head(flights)
flights.head()

Out[6]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
3	2013	1	1	542	2	923	33	AA	N619AA	1141	JFK	MIA	160	1089	5	42
4	2013	1	1	544	-1	1004	-18	B6	N804JB	725	JFK	BQN	183	1576	5	44
5	2013	1	1	554	-6	812	-25	DL	N668DN	461	LGA	ATL	116	762	5	54

Single table verbs¶

dplyr has a small set of nicely defined verbs. I've listed their closest pandas verbs.

dplyr	pandas
`filter()` (and `slice()`)	`query()` (and `loc[]`, `iloc[]`)
`arrange()`	`sort_values` and `sort_index()`
`select()` (and `rename()`)	`__getitem__` (and `rename()`)
`distinct()`	`drop_duplicates()`
`mutate()` (and `transmute()`)	assign
summarise()	None
sample_n() and sample_frac()	`sample`
`%>%`	`pipe`

Some of the "missing" verbs in pandas are because there are other, different ways of achieving the same goal. For example summarise is spread across mean, std, etc. It's closest analog is actually the .agg method on a GroupBy object, as it reduces a DataFrame to a single row (per group). This isn't quite what .describe does.

I've also included the pipe operator from R (%>%), the pipe method from pandas, even though it isn't quite a verb.

Filter rows with filter(), query()¶

In [7]:

# filter(flights, month == 1, day == 1)
flights.query("month == 1 & day == 1")

Out[7]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
841	2013	1	1	NaN	NaN	NaN	NaN	AA	N3EVAA	1925	LGA	MIA	NaN	1096	NaN	NaN
842	2013	1	1	NaN	NaN	NaN	NaN	B6	N618JB	125	JFK	FLL	NaN	1069	NaN	NaN

842 rows × 16 columns

We see the first big language difference between R and python. Many python programmers will shun the R code as too magical. How is the programmer supposed to know that month and day are supposed to represent columns in the DataFrame? On the other hand, to emulate this very convenient feature of R, python has to write the expression as a string, and evaluate the string in the context of the DataFrame.

The more verbose version:

In [8]:

# flights[flights$month == 1 & flights$day == 1, ]
flights[(flights.month == 1) & (flights.day == 1)]

Out[8]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
841	2013	1	1	NaN	NaN	NaN	NaN	AA	N3EVAA	1925	LGA	MIA	NaN	1096	NaN	NaN
842	2013	1	1	NaN	NaN	NaN	NaN	B6	N618JB	125	JFK	FLL	NaN	1069	NaN	NaN

842 rows × 16 columns

In [9]:

# slice(flights, 1:10)
flights.iloc[:9]

Out[9]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8	2013	1	1	557	-3	709	-14	EV	N829AS	5708	LGA	IAD	53	229	5	57
9	2013	1	1	557	-3	838	-8	B6	N593JB	79	JFK	MCO	140	944	5	57

9 rows × 16 columns

Arrange rows with arrange(), sort()¶

In [10]:

# arrange(flights, year, month, day) 
flights.sort_values(['year', 'month', 'day'])

Out[10]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
111295	2013	12	31	NaN	NaN	NaN	NaN	UA	NaN	219	EWR	ORD	NaN	719	NaN	NaN
111296	2013	12	31	NaN	NaN	NaN	NaN	UA	NaN	443	JFK	LAX	NaN	2475	NaN	NaN

336776 rows × 16 columns

In [11]:

# arrange(flights, desc(arr_delay))
flights.sort_values('arr_delay', ascending=False)

Out[11]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
7073	2013	1	9	641	1301	1242	1272	HA	N384HA	51	JFK	HNL	640	4983	6	41
235779	2013	6	15	1432	1137	1607	1127	MQ	N504MQ	3535	JFK	CMH	74	483	14	32
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336775	2013	9	30	NaN	NaN	NaN	NaN	MQ	N511MQ	3572	LGA	CLE	NaN	419	NaN	NaN
336776	2013	9	30	NaN	NaN	NaN	NaN	MQ	N839MQ	3531	LGA	RDU	NaN	431	NaN	NaN

336776 rows × 16 columns

It's worth mentioning the other common sorting method for pandas DataFrames, sort_index. Pandas puts much more emphasis on indicies, (or row labels) than R. This is a design decision that has positives and negatives, which we won't go into here. Suffice to say that when you need to sort a DataFrame by the index, use DataFrame.sort_index.

Select columns with select(), []¶

In [12]:

# select(flights, year, month, day) 
flights[['year', 'month', 'day']]

Out[12]:

	year	month	day
1	2013	1	1
2	2013	1	1
...	...	...	...
336775	2013	9	30
336776	2013	9	30

336776 rows × 3 columns

In [13]:

# select(flights, year:day) 
flights.loc[:, 'year':'day']

Out[13]:

	year	month	day
1	2013	1	1
2	2013	1	1
...	...	...	...
336775	2013	9	30
336776	2013	9	30

336776 rows × 3 columns

In [14]:

# select(flights, -(year:day)) 

# No direct equivalent here. I would typically use
# flights.drop(cols_to_drop, axis=1)
# or fligths[flights.columns.difference(pd.Index(cols_to_drop))]
# point to dplyr!

In [15]:

# select(flights, tail_num = tailnum)
flights.rename(columns={'tailnum': 'tail_num'})['tail_num']

Out[15]:

1         N14228
2         N24211
           ...  
336775    N511MQ
336776    N839MQ
Name: tail_num, dtype: object

But like Hadley mentions, not that useful since it only returns the one column. dplyr and pandas compare well here.

In [16]:

# rename(flights, tail_num = tailnum)
flights.rename(columns={'tailnum': 'tail_num'})

Out[16]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tail_num	flight	origin	dest	air_time	distance	hour	minute
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336775	2013	9	30	NaN	NaN	NaN	NaN	MQ	N511MQ	3572	LGA	CLE	NaN	419	NaN	NaN
336776	2013	9	30	NaN	NaN	NaN	NaN	MQ	N839MQ	3531	LGA	RDU	NaN	431	NaN	NaN

336776 rows × 16 columns

Pandas is more verbose, but the the argument to columns can be any mapping. So it's often used with a function to perform a common task, say df.rename(columns=lambda x: x.replace('-', '_')) to replace any dashes with underscores. Also, rename (the pandas version) can be applied to the Index.

One more note on the differences here. Pandas could easily include a .select method. xray, a library that builds on top of NumPy and pandas to offer labeled N-dimensional arrays (along with many other things) does just that. Pandas chooses the .loc and .iloc accessors because any valid selection is also a valid assignment. This makes it easier to modify the data.

flights.loc[:, 'year':'day'] = data

where data is an object that is, or can be broadcast to, the correct shape.

Extract distinct (unique) rows¶

In [17]:

# distinct(select(flights, tailnum))
flights.tailnum.unique()

Out[17]:

array(['N14228', 'N24211', 'N619AA', ..., 'N776SK', 'N785SK', 'N557AS'], dtype=object)

FYI this returns a numpy array instead of a Series.

In [18]:

# distinct(select(flights, origin, dest))
flights[['origin', 'dest']].drop_duplicates()

Out[18]:

	origin	dest
1	EWR	IAH
2	LGA	IAH
...	...	...
255456	EWR	ANC
275946	EWR	LGA

224 rows × 2 columns

OK, so dplyr wins there from a consistency point of view. unique is only defined on Series, not DataFrames.

Add new columns with mutate()¶

We at pandas shamelessly stole this for v0.16.0.

In [19]:

# mutate(flights,
#   gain = arr_delay - dep_delay,
#   speed = distance / air_time * 60)

flights.assign(gain=flights.arr_delay - flights.dep_delay,
               speed=flights.distance / flights.air_time * 60)

Out[19]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute	gain	speed
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17	9	370.044053
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33	16	374.273128
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336775	2013	9	30	NaN	NaN	NaN	NaN	MQ	N511MQ	3572	LGA	CLE	NaN	419	NaN	NaN	NaN	NaN
336776	2013	9	30	NaN	NaN	NaN	NaN	MQ	N839MQ	3531	LGA	RDU	NaN	431	NaN	NaN	NaN	NaN

336776 rows × 18 columns

In [20]:

# mutate(flights,
#   gain = arr_delay - dep_delay,
#   gain_per_hour = gain / (air_time / 60)
# )

(flights.assign(gain=flights.arr_delay - flights.dep_delay)
        .assign(gain_per_hour = lambda df: df.gain / (df.air_time / 60)))

Out[20]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute	gain	gain_per_hour
1	2013	1	1	517	2	830	11	UA	N14228	1545	EWR	IAH	227	1400	5	17	9	2.378855
2	2013	1	1	533	4	850	20	UA	N24211	1714	LGA	IAH	227	1416	5	33	16	4.229075
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
336775	2013	9	30	NaN	NaN	NaN	NaN	MQ	N511MQ	3572	LGA	CLE	NaN	419	NaN	NaN	NaN	NaN
336776	2013	9	30	NaN	NaN	NaN	NaN	MQ	N839MQ	3531	LGA	RDU	NaN	431	NaN	NaN	NaN	NaN

336776 rows × 18 columns

The first example is pretty much identical (aside from the names, mutate vs. assign).

The second example just comes down to language differences. In R, it's possible to implement a function like mutate where you can refer to gain in the line calcuating gain_per_hour, even though gain hasn't actually been calcuated yet.

In Python, you can have arbitrary keyword arguments to functions (which we needed for .assign), but the order of the argumnets is arbitrary since dicts are unsorted and **kwargs* is a dict. So you can't have something like df.assign(x=df.a / df.b, y=x **2), because you don't know whether x or y will come first (you'd also get an error saying x is undefined.

To work around that with pandas, you'll need to split up the assigns, and pass in a callable to the second assign. The callable looks at itself to find a column named gain. Since the line above returns a DataFrame with the gain column added, the pipeline goes through just fine.

In [21]:

# transmute(flights,
#   gain = arr_delay - dep_delay,
#   gain_per_hour = gain / (air_time / 60)
# )
(flights.assign(gain=flights.arr_delay - flights.dep_delay)
        .assign(gain_per_hour = lambda df: df.gain / (df.air_time / 60))
        [['gain', 'gain_per_hour']])

Out[21]:

	gain	gain_per_hour
1	9	2.378855
2	16	4.229075
...	...	...
336775	NaN	NaN
336776	NaN	NaN

336776 rows × 2 columns

Summarise values with summarise()¶

In [22]:

# summarise(flights,
#   delay = mean(dep_delay, na.rm = TRUE))
flights.dep_delay.mean()

Out[22]:

12.639070257304708

This is only roughly equivalent. summarise takes a callable (e.g. mean, sum) and evaluates that on the DataFrame. In pandas these are spread across pd.DataFrame.mean, pd.DataFrame.sum. This will come up again when we look at groupby.

Randomly sample rows with sample_n() and sample_frac()¶

In [23]:

# sample_n(flights, 10)
flights.sample(n=10)

Out[23]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
197774	2013	5	5	1814	-4	2118	-2	B6	N554JB	35	JFK	PBI	141	1028	18	14
114716	2013	2	5	639	-6	953	1	UA	N825UA	369	EWR	DFW	211	1372	6	39
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
150179	2013	3	15	1949	-6	2237	-33	AA	N3ETAA	1709	LGA	MIA	145	1096	19	49
52160	2013	10	28	720	0	1005	5	UA	N534UA	261	LGA	IAH	185	1416	7	20

10 rows × 16 columns

In [24]:

# sample_frac(flights, 0.01)
flights.sample(frac=.01)

Out[24]:

	year	month	day	dep_time	dep_delay	arr_time	arr_delay	carrier	tailnum	flight	origin	dest	air_time	distance	hour	minute
28971	2013	10	3	605	-5	728	-17	WN	N238WN	2609	LGA	STL	126	888	6	5
233436	2013	6	13	617	-6	916	20	B6	N580JB	203	JFK	LAS	314	2248	6	17
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3100	2013	1	4	1257	-2	1356	-12	UA	N825UA	343	EWR	BOS	43	200	12	57
61881	2013	11	7	1431	1	1658	-6	B6	N281JB	477	JFK	JAX	126	828	14	31

3368 rows × 16 columns

Grouped operations¶

In [25]:

# planes <- group_by(flights, tailnum)
# delay <- summarise(planes,
#   count = n(),
#   dist = mean(distance, na.rm = TRUE),
#   delay = mean(arr_delay, na.rm = TRUE))
# delay <- filter(delay, count > 20, dist < 2000)

planes = flights.groupby("tailnum")
delay = (planes.agg({"year": "count",
                     "distance": "mean",
                     "arr_delay": "mean"})
               .rename(columns={"distance": "dist",
                                "arr_delay": "delay",
                                "year": "count"})
               .query("count > 20 & dist < 2000"))
delay

Out[25]:

	dist	delay	count
tailnum
N0EGMQ	676.188679	9.982955	371
N10156	757.947712	12.717241	153
...	...	...	...
N999DN	895.459016	14.311475	61
N9EAMQ	674.665323	9.235294	248

2961 rows × 3 columns

For me, dplyr's n() looked is a bit starge at first, but it's already growing on me.

I think pandas is more difficult for this particular example. There isn't as natural a way to mix column-agnostic aggregations (like count) with column-specific aggregations like the other two. You end up writing could like .agg{'year': 'count'} which reads, "I want the count of year", even though you don't care about year specifically. You could just as easily have said .agg('distance': 'count'). Additionally assigning names can't be done as cleanly in pandas; you have to just follow it up with a rename like before.

We may as well reproduce the graph. It looks like ggplots geom_smooth is some kind of lowess smoother. We can either us seaborn:

In [26]:

fig, ax = plt.subplots(figsize=(12, 6))

sns.regplot("dist", "delay", data=delay, lowess=True, ax=ax,
            scatter_kws={'color': 'k', 'alpha': .5, 's': delay['count'] / 10}, ci=90,
            line_kws={'linewidth': 3});

Or using statsmodels directly for more control over the lowess, with an extremely lazy "confidence interval".

In [27]:

import statsmodels.api as sm

In [28]:

smooth = sm.nonparametric.lowess(delay.delay, delay.dist, frac=1/8)
ax = delay.plot(kind='scatter', x='dist', y = 'delay', figsize=(12, 6),
                color='k', alpha=.5, s=delay['count'] / 10)
ax.plot(smooth[:, 0], smooth[:, 1], linewidth=3);
std = smooth[:, 1].std()
ax.fill_between(smooth[:, 0], smooth[:, 1] - std, smooth[:, 1] + std, alpha=.25);

In [29]:

# destinations <- group_by(flights, dest)
# summarise(destinations,
#   planes = n_distinct(tailnum),
#   flights = n()
# )

destinations = flights.groupby('dest')
destinations.agg({
    'tailnum': lambda x: len(x.unique()),
    'year': 'count'
    }).rename(columns={'tailnum': 'planes',
                       'year': 'flights'})

Out[29]:

	planes	flights
dest
ABQ	108	254
ACK	58	265
...	...	...
TYS	273	631
XNA	176	1036

105 rows × 2 columns

There's a little know feature to groupby.agg: it accepts a dict of dicts mapping columns to {name: aggfunc} pairs. Here's the result:

In [30]:

destinations = flights.groupby('dest')
r = destinations.agg({'tailnum': {'planes': lambda x: len(x.unique())},
                      'year': {'flights': 'count'}})
r

Out[30]:

	tailnum	year
	planes	flights
dest
ABQ	108	254
ACK	58	265
...	...	...
TYS	273	631
XNA	176	1036

105 rows × 2 columns

The result is a MultiIndex in the columns which can be a bit awkard to work with (you can drop a level with r.columns.droplevel()). Also the syntax going into the .agg may not be the clearest.

Similar to how dplyr provides optimized C++ versions of most of the summarise functions, pandas uses cython optimized versions for most of the agg methods.

In [31]:

# daily <- group_by(flights, year, month, day)
# (per_day   <- summarise(daily, flights = n()))

daily = flights.groupby(['year', 'month', 'day'])
per_day = daily['distance'].count()
per_day

Out[31]:

year  month  day
2013  1      1      842
             2      943
                   ... 
      12     30     968
             31     776
Name: distance, dtype: int64

In [32]:

# (per_month <- summarise(per_day, flights = sum(flights)))
per_month = per_day.groupby(level=['year', 'month']).sum()
per_month

Out[32]:

year  month
2013  1        27004
      2        24951
               ...  
      11       27268
      12       28135
Name: distance, dtype: int64

In [33]:

# (per_year  <- summarise(per_month, flights = sum(flights)))
per_year = per_month.sum()
per_year

Out[33]:

I'm not sure how dplyr is handling the other columns, like year, in the last example. With pandas, it's clear that we're grouping by them since they're included in the groupby. For the last example, we didn't group by anything, so they aren't included in the result.

Chaining¶

Any follower of Hadley's twitter account will know how much R users love the %>% (pipe) operator. And for good reason!

In [34]:

# flights %>%
#   group_by(year, month, day) %>%
#   select(arr_delay, dep_delay) %>%
#   summarise(
#     arr = mean(arr_delay, na.rm = TRUE),
#     dep = mean(dep_delay, na.rm = TRUE)
#   ) %>%
#   filter(arr > 30 | dep > 30)
(
flights.groupby(['year', 'month', 'day'])
    [['arr_delay', 'dep_delay']]
    .mean()
    .query('arr_delay > 30 | dep_delay > 30')
)

Out[34]:

			arr_delay	dep_delay
year	month	day
2013	1	16	34.247362	24.612865
	1	31	32.602854	28.658363
	1	...	...	...
	12	17	55.871856	40.705602
	12	23	32.226042	32.254149

49 rows × 2 columns

A bit of soapboxing here if you'll indulge me.

The example above is a bit contrived since it only uses methods on DataFrame. But what if you have some function to work into your pipeline that pandas hasn't (or won't) implement? In that case you're required to break up your pipeline by assigning your intermediate (probably uninteresting) DataFrame to a temporary variable you don't actually care about.

R doesn't have this problem since the %>% operator works with any function that takes (and maybe returns) DataFrames. The python language doesn't have any notion of right to left function application (other than special cases like __radd__ and __rmul__). It only allows the usual left to right function(arguments), where you can think of the () as the "call this function" operator.

Pandas wanted something like %>% and we did it in a farily pythonic way. The pd.DataFrame.pipe method takes a function and optionally some arguments, and calls that function with self (the DataFrame) as the first argument.

So

flights >%> my_function(my_argument=10)

becomes

flights.pipe(my_function, my_argument=10)

We initially had grander visions for .pipe, but the wider python community didn't seem that interested.

Other Data Sources¶

Pandas has tons IO tools to help you get data in and out, including SQL databases via SQLAlchemy.

Summary¶

I think pandas held up pretty well, considering this was a vignette written for dplyr. I found the degree of similarity more interesting than the differences. The most difficult task was renaming of columns within an operation; they had to be followed up with a call to rename after the operation, which isn't that burdensome honestly.

More and more it looks like we're moving towards future where being a language or package partisan just doesn't make sense. Not when you can load up a Jupyter (formerly IPython) notebook to call up a library written in R, and hand those results off to python or Julia or whatever for followup, before going back to R to make a cool shiny web app.

There will always be a place for your "utility belt" package like dplyr or pandas, but it wouldn't hurt to be familiar with both.

If you want to contribute to pandas, we're always looking for help at https://github.com/pydata/pandas/. You can get ahold of me directly on twitter.