Merge and pivot

In [1]:
import addutils.toc ; addutils.toc.js(ipy_notebook=True)
Out[1]:

In this tutorial we are going to see how to combine pandas datastructures together.

In [2]:
import numpy as np
import pandas as pd
from IPython.display import (display, HTML)
from addutils import side_by_side2, css_notebook
css_notebook()
Out[2]:

1 Concat

pandas.concat concatenates two (or more) pandas objects along a particular axis (default is the vertical one).

In [3]:
data = np.array([[-1, -2, 4, 3],[3,4, -4, -3], [2, -5, 3, 2], [2, -5, 3, 2]])
df1 = pd.DataFrame(data, columns = ['a','b','c','d'])
df2 = pd.DataFrame(np.random.randint(5, size=(3,6)) + 0.5, columns = list('fedcba'))
HTML(side_by_side2(df1, df2))
Out[3]:
a b c d
0 -1 -2 4 3
1 3 4 -4 -3
2 2 -5 3 2
3 2 -5 3 2
f e d c b a
0 2.5 2.5 4.5 3.5 3.5 4.5
1 3.5 3.5 4.5 3.5 3.5 1.5
2 0.5 1.5 1.5 0.5 4.5 4.5

When gluing together multiple DataFrames, you have a choice of how to handle the other axes (other than the one being concatenated). This can be done by defining join:

In [4]:
HTML(side_by_side2(pd.concat([df1, df2]), pd.concat([df1, df2], join='inner')))
Out[4]:
a b c d e f
0 -1.0 -2.0 4.0 3.0 NaN NaN
1 3.0 4.0 -4.0 -3.0 NaN NaN
2 2.0 -5.0 3.0 2.0 NaN NaN
3 2.0 -5.0 3.0 2.0 NaN NaN
0 4.5 3.5 3.5 4.5 2.5 2.5
1 1.5 3.5 3.5 4.5 3.5 3.5
2 4.5 4.5 0.5 1.5 1.5 0.5
a b c d
0 -1.0 -2.0 4.0 3.0
1 3.0 4.0 -4.0 -3.0
2 2.0 -5.0 3.0 2.0
3 2.0 -5.0 3.0 2.0
0 4.5 3.5 3.5 4.5
1 1.5 3.5 3.5 4.5
2 4.5 4.5 0.5 1.5

It is possible to create a Multi-index DataFrame setting the 'keys' parameter.

In [5]:
pd.concat([df1, df2], keys = ('First', 'Second'))
Out[5]:
a b c d e f
First 0 -1.0 -2.0 4.0 3.0 NaN NaN
1 3.0 4.0 -4.0 -3.0 NaN NaN
2 2.0 -5.0 3.0 2.0 NaN NaN
3 2.0 -5.0 3.0 2.0 NaN NaN
Second 0 4.5 3.5 3.5 4.5 2.5 2.5
1 1.5 3.5 3.5 4.5 3.5 3.5
2 4.5 4.5 0.5 1.5 1.5 0.5

Note that concat allows overlapping indexes. In order to avoid overlapping index the parameter ignore_index has to be set as True.

In [6]:
pd.concat([df1, df2], ignore_index = True)
Out[6]:
a b c d e f
0 -1.0 -2.0 4.0 3.0 NaN NaN
1 3.0 4.0 -4.0 -3.0 NaN NaN
2 2.0 -5.0 3.0 2.0 NaN NaN
3 2.0 -5.0 3.0 2.0 NaN NaN
4 4.5 3.5 3.5 4.5 2.5 2.5
5 1.5 3.5 3.5 4.5 3.5 3.5
6 4.5 4.5 0.5 1.5 1.5 0.5

And if we want to concatenate the DataFrames on a different axis:

In [7]:
concatenatedaxis = pd.concat([df1, df2], axis = 1, keys = ['ONE','TWO'])
concatenatedaxis
Out[7]:
ONE TWO
a b c d f e d c b a
0 -1 -2 4 3 2.5 2.5 4.5 3.5 3.5 4.5
1 3 4 -4 -3 3.5 3.5 4.5 3.5 3.5 1.5
2 2 -5 3 2 0.5 1.5 1.5 0.5 4.5 4.5
3 2 -5 3 2 NaN NaN NaN NaN NaN NaN

2 Append

append is a shortcut for concat along axis=0:

In [8]:
d1 = pd.DataFrame(np.random.randint(10,99,(2,2)), index=[1001, 1002], columns=list('AB'))
d2 = pd.DataFrame(np.random.randint(10,99,(2,3)), index=[1002, 2001], columns=list('BCD'))

append allows overlapping index. Here we can see the two cases with ignore_index set respectively to true or false:

In [9]:
HTML(side_by_side2(d1, d2, d1.append(d2), d1.append(d2, ignore_index=True)))
Out[9]:
A B
1001 37 27
1002 10 94
B C D
1002 45 36 87
2001 32 71 64
A B C D
1001 37.0 27 NaN NaN
1002 10.0 94 NaN NaN
1002 NaN 45 36.0 87.0
2001 NaN 32 71.0 64.0
A B C D
0 37.0 27 NaN NaN
1 10.0 94 NaN NaN
2 NaN 45 36.0 87.0
3 NaN 32 71.0 64.0

3 Join

join is equivalent to the sql homonymous command. It's more advanced than append because has the parameter how that specifies how to merge indexes:

  • left keeps left indexes.
  • right keeps right indexes.
  • inner intersects indexes.
  • outer makes union of indexes.
In [10]:
HTML(side_by_side2(d1, d2))
Out[10]:
A B
1001 37 27
1002 10 94
B C D
1002 45 36 87
2001 32 71 64
In [11]:
d6 = d1.join(d2, rsuffix='_r', how='left')
d7 = d1.join(d2, rsuffix='_r', how='right')
d8 = d1.join(d2.D, how='inner')
d9 = d1.join(d2.D, how='outer')
HTML(side_by_side2(d6, d7, d8, d9))
Out[11]:
A B B_r C D
1001 37 27 NaN NaN NaN
1002 10 94 45.0 36.0 87.0
A B B_r C D
1002 10.0 94.0 45 36 87
2001 NaN NaN 32 71 64
A B D
1002 10 94 87
A B D
1001 37.0 27.0 NaN
1002 10.0 94.0 87.0
2001 NaN NaN 64.0

4 Merge

pandas provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects:

pd.merge(left, right, how=’left’, on=None, left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True,
         suffixes=(’_x’, ’_y’), copy=True)

  • on: Columns to join on. Must be found in both DataFrames
  • left_on: Columns from the left DataFrame to use as keys
  • right_on: same for right
  • left_index: If True, use the index (row labels) from the left DataFrame join key(s)
  • right_index: same for right
  • how: One of ’left’, ’right’, ’outer’, ’inner’. Defaults to ’inner’.
  • suffixes: A tuple of string suffixes to apply to overlapping columns. Defaults to (’_x’, ’_y’).
In [12]:
left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
HTML(side_by_side2(left, right))
Out[12]:
key lval
0 foo 1
1 foo 2
key rval
0 foo 4
1 bar 5
In [13]:
pd.merge(left, right, on='key', how='right')
Out[13]:
key lval rval
0 foo 1.0 4
1 foo 2.0 4
2 bar NaN 5

A practical example: revenues is a little dataset that we want to join with the 'cities' database to find the geographical location of the offices:

In [14]:
revenues = pd.read_csv('example_data/p05_d2.txt', index_col=0)
cities = pd.read_csv('example_data/cities.csv.bz2', compression='bz2',
                  names=['City', 'State', 'lat', 'lon'])
HTML(side_by_side2(revenues, cities.head(8)))
Out[14]:
City Name Revenues State
0 New York Roy 1250 NY
1 frisco Johnn 840 CA
2 houston Jim 349 tx
3 taft Paul 1100 OK
4 venice Ross 900 IL
City State lat lon
0 ROBERTS ID 43.698922 -112.173195
1 HODGENVILLE KY 37.559321 -85.707267
2 WILLIAMSPORT TN 35.722628 -87.212698
3 MAUMEE OH 41.571251 -83.685036
4 BERNE NY 42.602236 -74.154615
5 MONCLOVA OH 41.577251 -83.772378
6 MEADOWBROOK WV 39.346214 -80.313875
7 RUTH NV 39.342631 -114.885945

The 'cities' database contains 30101 records:

In [15]:
cities['City'].count()
Out[15]:
30101

First we map a normalization function to standardize the 'City' and 'State' names. map applies the function passed as parameter to all elements of the DataFrame:

In [16]:
revenues = revenues.reindex(columns=['Name','Revenues','City','State'])
revenues['City'] = revenues['City'].map(lambda str: str.upper().strip())
revenues['State'] = revenues['State'].map(lambda str: str.upper().strip())
display(revenues)
Name Revenues City State
0 Roy 1250 NEW YORK NY
1 Johnn 840 FRISCO CA
2 Jim 349 HOUSTON TX
3 Paul 1100 TAFT OK
4 Ross 900 VENICE IL
In [17]:
leftjoin = pd.merge(revenues, cities, left_on=['City', 'State'],
                                      right_on=['City', 'State'], how='left')
leftjoin = leftjoin.sort_values(by='Revenues', ascending=False)
HTML(side_by_side2(revenues, leftjoin))
Out[17]:
Name Revenues City State
0 Roy 1250 NEW YORK NY
1 Johnn 840 FRISCO CA
2 Jim 349 HOUSTON TX
3 Paul 1100 TAFT OK
4 Ross 900 VENICE IL
Name Revenues City State lat lon
0 Roy 1250 NEW YORK NY 40.757929 -73.985506
3 Paul 1100 TAFT OK 35.763648 -95.544501
4 Ross 900 VENICE IL 38.670250 -90.168859
1 Johnn 840 FRISCO CA NaN NaN
2 Jim 349 HOUSTON TX 29.759956 -95.362534

5 Pivoting

pivot_table is useful to analize data in the so-called "stacked" or "record" format: here each field of each record are stored in colums and similar fields could have multiple duplicates:

In [18]:
data = pd.read_csv('example_data/pd06_pivot_Example.txt')
data
Out[18]:
Unnamed: 0 date variable value
0 0 2000-01-03 00:00:00 A 0.397018
1 1 2000-01-04 00:00:00 A 0.621109
2 2 2000-01-05 00:00:00 A -2.508284
3 3 2000-01-03 00:00:00 B 0.219796
4 4 2000-01-04 00:00:00 B 1.371027
5 5 2000-01-05 00:00:00 B -1.005504
6 6 2000-01-03 00:00:00 C -0.391149
7 7 2000-01-04 00:00:00 C 1.209346
8 8 2000-01-05 00:00:00 C -0.059617
9 9 2000-01-03 00:00:00 D -0.464558
10 10 2000-01-04 00:00:00 D -0.173605
11 11 2000-01-05 00:00:00 D -1.649632
In [19]:
pivot = data.pivot(index='date', columns='variable', values='value')
pivot.head()
Out[19]:
variable A B C D
date
2000-01-03 00:00:00 0.397018 0.219796 -0.391149 -0.464558
2000-01-04 00:00:00 0.621109 1.371027 1.209346 -0.173605
2000-01-05 00:00:00 -2.508284 -1.005504 -0.059617 -1.649632

A practical example: apply pivoting to a dataset containing the statistics of te US crimes State by State, downloaded from the Stanford University "Human-Computer Interaction" group (http://hci.stanford.edu/jheer/workshop/data/):

In [20]:
data2 = pd.read_csv('example_data/CrimeStatebyState.csv.bz2', compression='bz2')
data2.iloc[0:7000:1000]
Out[20]:
State Type of Crime Crime Year Count
0 Alabama Violent Crime Murder and nonnegligent Manslaughter 1960 406
1000 Arkansas Violent Crime Murder and nonnegligent Manslaughter 1994 294
2000 Connecticut Violent Crime Forcible rape 1982 692
3000 Florida Violent Crime Robbery 1970 12636
4000 Idaho Violent Crime Robbery 2004 241
5000 Iowa Violent Crime Aggravated assault 1992 6131
6000 Louisiana Property Crime Burglary 1980 63997

Definition: pd.pivot_table(data, values=None, rows=None, cols=None, aggfunc='mean', fill_value=None, margins=False, dropna=True)

  • values: Columns to aggregate
  • aggfunc: Default is np.mean. This can be a list of functions as well
  • aggfunc: Boolean, add subtotals, grandtotals
  • dropna: Exclude columns with NaN's
In [21]:
pd.set_option("display.colheader_justify","right")
pivot2 = pd.pivot_table(data2, values='Count',
                        index=['Year'],
                        columns=['State'],
                        margins=True, aggfunc=np.sum)
pivot2.iloc[-10:,-8:]
Out[21]:
State Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming All
Year
1997 123447 16658 261022 332469 44839 190133 20068 13195074
1998 115624 18552 248576 333799 46130 185093 18315 12485714
1999 105999 16735 231886 302509 49161 173062 16583 11634378
2000 99958 18185 214348 300932 47067 172124 16285 11608070
2001 96307 16978 228445 308492 46120 179410 17392 11849006
2002 103129 15600 229039 309931 45320 176987 17858 11878954
2003 105973 14510 220939 312814 46997 169788 17962 11826538
2004 103409 15272 220976 322167 51436 158258 18052 11679474
2005 101158 14956 221044 329406 52653 160646 17242 11556854
All 3432582 716716 9264900 10942317 1709475 7399321 726727 486181458
In [22]:
pivot3 = pd.pivot_table(data2, values='Count',
                        index=['Year', 'Type of Crime'],
                        columns=['State'],
                        margins=True, aggfunc=np.sum)
pivot3.iloc[-10:,:8]
Out[22]:
State Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware
Year Type of Crime
2001 Violent Crime 19582 3735 28675 12190 212855 15492 11492 4868
2002 Property Crime 180400 24118 318296 101171 1176484 180054 93426 26967
Violent Crime 19931 3627 30171 11501 208388 15882 10807 4836
2003 Property Crime 182241 24386 314335 99084 1215086 179706 92981 27943
Violent Crime 19331 3877 28638 12449 205551 15757 11045 5525
2004 Property Crime 182340 22172 291203 110911 1227194 180322 93942 27256
Violent Crime 19324 4159 28952 13814 189175 17121 10113 5105
2005 Property Crime 177393 23975 287345 112775 1200531 188449 89794 26245
Violent Crime 19678 4194 30478 14659 190178 18498 9635 5332
All 6745549 1019564 9436942 3670925 66614776 7359322 5649669 1359238

If we want to select crimes for a given year and/or location:

In [23]:
pivot2.iloc[2004:2005,:8]
Out[23]:
State Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware
Year

6 Stack and Unstack

stack “pivot” a level of the column labels to an inner-most row level. In the following case, since after stacking there are no column labels remaining, stack returns a Series instead of a Dataframe:

In [24]:
pivot3.iloc[:5,:8]
Out[24]:
State Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware
Year Type of Crime
1960 Property Crime 33823 3494 36539 16548 508511 35695 28393 9267
Violent Crime 6097 236 2704 1924 37558 2408 928 375
1961 Property Crime 32541 3605 41304 16836 521008 40156 31658 9343
Violent Crime 5564 208 2288 1809 38154 2659 878 318
1962 Property Crime 35829 3849 45694 18199 564451 47872 33798 10157
In [25]:
stk = pivot3.stack()
stk[:10]
Out[25]:
Year  Type of Crime   State               
1960  Property Crime  Alabama                  33823
                      Alaska                    3494
                      Arizona                  36539
                      Arkansas                 16548
                      California              508511
                      Colorado                 35695
                      Connecticut              28393
                      Delaware                  9267
                      District of Columbia     16495
                      Florida                 122858
dtype: int64

unstack reverse the operation by pivoting a row index on columns:

In [26]:
unstk = stk.unstack(-1)
unstk.iloc[:5,:8]
Out[26]:
State Alabama Alaska Arizona Arkansas California Colorado Connecticut Delaware
Year Type of Crime
1960 Property Crime 33823 3494 36539 16548 508511 35695 28393 9267
Violent Crime 6097 236 2704 1924 37558 2408 928 375
1961 Property Crime 32541 3605 41304 16836 521008 40156 31658 9343
Violent Crime 5564 208 2288 1809 38154 2659 878 318
1962 Property Crime 35829 3849 45694 18199 564451 47872 33798 10157

Visit www.add-for.com for more tutorials and updates.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.