dplython: dplyr in python

I've also included equivalent pandas examples for comparison.

In [88]:
from dplython import *
In [2]:
# dplython way: select 3 specific columns and return 5 rows of data
diamonds >> select(X.carat, X.cut, X.price) >> head(5)
Out[2]:
carat cut price
0 0.23 Ideal 326
1 0.21 Premium 326
2 0.23 Good 327
3 0.29 Premium 334
4 0.31 Good 335
In [3]:
# pandas way
diamonds[['carat', 'cut', 'price']].head()
Out[3]:
carat cut price
0 0.23 Ideal 326
1 0.21 Premium 326
2 0.23 Good 327
3 0.29 Premium 334
4 0.31 Good 335
In [4]:
# dplython way: filter to carat > 4, then select 4 specific columns
diamonds >> dfilter(X.carat > 4) >> select(X.carat, X.cut, X.depth, X.price)
Out[4]:
carat cut depth price
25998 4.01 Premium 61.0 15223
25999 4.01 Premium 62.5 15223
27130 4.13 Fair 64.8 17329
27415 5.01 Fair 65.5 18018
27630 4.50 Fair 65.8 18531
In [5]:
# pandas way
diamonds.query("carat > 4")[['carat','cut','depth','price']]
Out[5]:
carat cut depth price
25998 4.01 Premium 61.0 15223
25999 4.01 Premium 62.5 15223
27130 4.13 Fair 64.8 17329
27415 5.01 Fair 65.5 18018
27630 4.50 Fair 65.8 18531
In [6]:
# dplython way; sample 10 random records, the sort by carat column, then return only those specific columns
(diamonds >> 
  sample_n(10) >> 
  arrange(X.carat) >> 
  select(X.carat, X.cut, X.depth, X.price))
/home/pybokeh/envs/jupyter/lib/python3.5/site-packages/dplython/dplython.py:429: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  return lambda df: DplyFrame(df.sort(names))
Out[6]:
carat cut depth price
37842 0.33 Ideal 62.3 1002
31009 0.35 Ideal 62.2 748
52108 0.61 Ideal 59.4 2461
46058 0.64 Fair 59.5 1733
672 0.72 Ideal 60.3 2847
7807 1.05 Very Good 63.5 4292
19618 1.07 Ideal 62.4 8235
23217 1.14 Ideal 60.1 11226
18747 1.20 Premium 61.6 7661
22487 1.54 Very Good 62.7 10518
In [7]:
# pandas way: get sample first, then select those specific columns, then sort by carat column
diamonds.sample(n=10)[['carat','cut','depth','price']].sort_values(by='carat')
Out[7]:
carat cut depth price
49 0.29 Very Good 60.7 404
33804 0.32 Very Good 58.1 842
40481 0.33 Ideal 61.8 1141
39781 0.36 Ideal 61.6 1094
30164 0.40 Very Good 60.5 720
42543 0.53 Premium 61.4 1324
9222 1.05 Premium 62.6 4557
9561 1.10 Premium 60.9 4622
27619 2.03 Very Good 61.7 18507
26940 2.27 Premium 62.4 16994
In [8]:
# dplython way: get 6 random rows, select 2 specific columns, then transpose
# It's possible to pass the entire dataframe using X._
diamonds >> sample_n(6) >> select(X.carat, X.price) >> X._.T
Out[8]:
34991 1780 32042 1205 37420 6248
carat 0.4 0.72 0.32 0.82 0.33 0.9
price 881.0 3048.00 779.00 2937.00 984.00 4008.0
In [9]:
# pandas way: sample 6 random rows, select 2 specific columns, then transpose
diamonds.sample(n=6)[['carat','price']].transpose()
Out[9]:
9752 20344 32959 32415 49302 35763
carat 0.94 1.51 0.31 0.41 0.32 0.4
price 4661.00 8742.00 461.00 791.00 540.00 912.0
In [10]:
# dplython way: create a new column caled carat_bin, group by cut and carat_bin, then get the average price
(diamonds >> 
  mutate(carat_bin=X.carat.round()) >> 
  group_by(X.cut, X.carat_bin) >> 
  summarize(avg_price=X.price.mean())
)
Out[10]:
avg_price carat_bin cut
0 4213.864948 1.0 Ideal
1 12337.020064 2.0 Premium
2 15842.666667 4.0 Fair
3 15053.555556 3.0 Very Good
4 786.054191 0.0 Good
5 12587.000000 4.0 Ideal
6 4135.271007 1.0 Very Good
7 11096.950321 2.0 Good
8 13466.823529 3.0 Fair
9 12838.984078 2.0 Ideal
10 3305.754579 1.0 Fair
11 4382.906453 1.0 Premium
12 863.908535 0.0 Ideal
13 15636.047619 3.0 Premium
14 15984.000000 4.0 Very Good
15 16335.000000 4.0 Premium
16 18018.000000 5.0 Fair
17 12281.851661 2.0 Very Good
18 16156.681818 3.0 Ideal
19 863.329085 0.0 Premium
20 766.354590 0.0 Very Good
21 3815.307879 1.0 Good
22 15924.176471 3.0 Good
23 9588.269737 2.0 Fair
24 1027.979275 0.0 Fair
In [14]:
# pandas way: create the new column, group by the 2 columns, then get the avg price, but then have to rename
# price to avg_price.  With dplython, it is nice to be able to create a new column on the fly.
(diamonds.assign(carat_bin=diamonds.carat.round())
 .groupby(['carat_bin','cut'])[['price']].mean()
 .rename(columns={'price':'avg_price'})
)
Out[14]:
avg_price
carat_bin cut
0.0 Fair 1027.979275
Good 786.054191
Ideal 863.908535
Premium 863.329085
Very Good 766.354590
1.0 Fair 3305.754579
Good 3815.307879
Ideal 4213.864948
Premium 4382.906453
Very Good 4135.271007
2.0 Fair 9588.269737
Good 11096.950321
Ideal 12838.984078
Premium 12337.020064
Very Good 12281.851661
3.0 Fair 13466.823529
Good 15924.176471
Ideal 16156.681818
Premium 15636.047619
Very Good 15053.555556
4.0 Fair 15842.666667
Ideal 12587.000000
Premium 16335.000000
Very Good 15984.000000
5.0 Fair 18018.000000

Next example is a bit trickier.

Goal: Create a new column that contains the difference between a diamond's carat and the average of the carats for that diamond's color. Added sort to make it easier to compare result to pandas' result.

In [91]:
# dplython way: group by color, create a new column, then ungroup, then sort by color and carat, then
# return 10 rows of data
(diamonds >>
  group_by(X.color) >>
  mutate(carat_diff = X.carat - X.carat.mean()) >>
  ungroup() >>
  arrange(X.color, X.carat_diff) >>
  head(10)
)
/home/pybokeh/envs/jupyter/lib/python3.5/site-packages/dplython/dplython.py:379: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  df[key] = val.applyFcns(df)
/home/pybokeh/envs/jupyter/lib/python3.5/site-packages/dplython/dplython.py:429: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  return lambda df: DplyFrame(df.sort(names))
Out[91]:
Unnamed: 0 carat cut color clarity depth table price x y z carat_diff
50742 31598 0.20 Ideal D VS2 61.5 57.0 367 3.81 3.77 2.33 -0.457795
50743 31601 0.20 Premium D VS2 62.3 60.0 367 3.73 3.68 2.31 -0.457795
50744 31602 0.20 Premium D VS2 61.7 60.0 367 3.77 3.72 2.31 -0.457795
51721 38277 0.21 Premium D VS2 61.6 59.0 386 3.82 3.78 2.34 -0.447795
51722 38278 0.21 Premium D VS2 60.6 60.0 386 3.85 3.81 2.32 -0.447795
51723 38279 0.21 Premium D VS2 59.1 62.0 386 3.89 3.86 2.29 -0.447795
51724 38280 0.21 Premium D VS2 58.3 59.0 386 3.96 3.93 2.30 -0.447795
47170 55 0.22 Premium D VS2 59.3 62.0 404 3.91 3.88 2.31 -0.437795
47165 29 0.23 Very Good D VS2 60.5 61.0 357 3.96 3.97 2.40 -0.427795
47166 35 0.23 Very Good D VS1 61.9 58.0 402 3.92 3.96 2.44 -0.427795
In [90]:
# pandas way: create a new column, group by color, then assign the new column's data to be carat - carat mean
# Then sorty by color and carat_diff, then retur 10 rows of data
(diamonds.assign(carat_diff=diamonds
                 .groupby('color')['carat']
                 .transform(lambda x:x-x.mean())
                )
                .sort_values(by=['color','carat_diff'])
                .head(10)
)
Out[90]:
Unnamed: 0 carat cut color clarity depth table price x y z carat_diff
31597 31598 0.20 Ideal D VS2 61.5 57.0 367 3.81 3.77 2.33 -0.457795
31600 31601 0.20 Premium D VS2 62.3 60.0 367 3.73 3.68 2.31 -0.457795
31601 31602 0.20 Premium D VS2 61.7 60.0 367 3.77 3.72 2.31 -0.457795
38276 38277 0.21 Premium D VS2 61.6 59.0 386 3.82 3.78 2.34 -0.447795
38277 38278 0.21 Premium D VS2 60.6 60.0 386 3.85 3.81 2.32 -0.447795
38278 38279 0.21 Premium D VS2 59.1 62.0 386 3.89 3.86 2.29 -0.447795
38279 38280 0.21 Premium D VS2 58.3 59.0 386 3.96 3.93 2.30 -0.447795
54 55 0.22 Premium D VS2 59.3 62.0 404 3.91 3.88 2.31 -0.437795
28 29 0.23 Very Good D VS2 60.5 61.0 357 3.96 3.97 2.40 -0.427795
34 35 0.23 Very Good D VS1 61.9 58.0 402 3.92 3.96 2.44 -0.427795

I personally prefer dplython's syntax in this case over pandas. What I also like with dplython, you can create a new column based on an aggregate function on the fly. Whereas with pandas, you have to rename the column after the aggregate function.

Overall, not sure if it's because I've been using pandas for a while, but I still don't mind how things are done with pandas. YMMV!

However, I do wish that with pandas' query() method, we can do a SQL like statement (example: df.query("column like '%NAME%'") and also filter is null or not null somehow. Then I would consider query() method pretty much feature complete with respect to what can be done using SQL's WHERE statement.