import addutils.toc ; addutils.toc.js(ipy_notebook=True)
In this tutorial we are going to see how to combine pandas
datastructures together.
import numpy as np
import pandas as pd
from IPython.display import (display, HTML)
from addutils import side_by_side2, css_notebook
css_notebook()
pandas.concat
concatenates two (or more) pandas objects along a particular axis (default is the vertical one).
data = np.array([[-1, -2, 4, 3],[3,4, -4, -3], [2, -5, 3, 2], [2, -5, 3, 2]])
df1 = pd.DataFrame(data, columns = ['a','b','c','d'])
df2 = pd.DataFrame(np.random.randint(5, size=(3,6)) + 0.5, columns = list('fedcba'))
HTML(side_by_side2(df1, df2))
a | b | c | d | |
---|---|---|---|---|
0 | -1 | -2 | 4 | 3 |
1 | 3 | 4 | -4 | -3 |
2 | 2 | -5 | 3 | 2 |
3 | 2 | -5 | 3 | 2 |
f | e | d | c | b | a | |
---|---|---|---|---|---|---|
0 | 2.5 | 2.5 | 4.5 | 3.5 | 3.5 | 4.5 |
1 | 3.5 | 3.5 | 4.5 | 3.5 | 3.5 | 1.5 |
2 | 0.5 | 1.5 | 1.5 | 0.5 | 4.5 | 4.5 |
When gluing together multiple DataFrames, you have a choice of how to handle the other axes (other than the one being concatenated). This can be done by defining join
:
HTML(side_by_side2(pd.concat([df1, df2]), pd.concat([df1, df2], join='inner')))
a | b | c | d | e | f | |
---|---|---|---|---|---|---|
0 | -1.0 | -2.0 | 4.0 | 3.0 | NaN | NaN |
1 | 3.0 | 4.0 | -4.0 | -3.0 | NaN | NaN |
2 | 2.0 | -5.0 | 3.0 | 2.0 | NaN | NaN |
3 | 2.0 | -5.0 | 3.0 | 2.0 | NaN | NaN |
0 | 4.5 | 3.5 | 3.5 | 4.5 | 2.5 | 2.5 |
1 | 1.5 | 3.5 | 3.5 | 4.5 | 3.5 | 3.5 |
2 | 4.5 | 4.5 | 0.5 | 1.5 | 1.5 | 0.5 |
a | b | c | d | |
---|---|---|---|---|
0 | -1.0 | -2.0 | 4.0 | 3.0 |
1 | 3.0 | 4.0 | -4.0 | -3.0 |
2 | 2.0 | -5.0 | 3.0 | 2.0 |
3 | 2.0 | -5.0 | 3.0 | 2.0 |
0 | 4.5 | 3.5 | 3.5 | 4.5 |
1 | 1.5 | 3.5 | 3.5 | 4.5 |
2 | 4.5 | 4.5 | 0.5 | 1.5 |
It is possible to create a Multi-index DataFrame setting the 'keys' parameter.
pd.concat([df1, df2], keys = ('First', 'Second'))
a | b | c | d | e | f | ||
---|---|---|---|---|---|---|---|
First | 0 | -1.0 | -2.0 | 4.0 | 3.0 | NaN | NaN |
1 | 3.0 | 4.0 | -4.0 | -3.0 | NaN | NaN | |
2 | 2.0 | -5.0 | 3.0 | 2.0 | NaN | NaN | |
3 | 2.0 | -5.0 | 3.0 | 2.0 | NaN | NaN | |
Second | 0 | 4.5 | 3.5 | 3.5 | 4.5 | 2.5 | 2.5 |
1 | 1.5 | 3.5 | 3.5 | 4.5 | 3.5 | 3.5 | |
2 | 4.5 | 4.5 | 0.5 | 1.5 | 1.5 | 0.5 |
Note that concat
allows overlapping indexes. In order to avoid overlapping index the parameter ignore_index
has to be set as True
.
pd.concat([df1, df2], ignore_index = True)
a | b | c | d | e | f | |
---|---|---|---|---|---|---|
0 | -1.0 | -2.0 | 4.0 | 3.0 | NaN | NaN |
1 | 3.0 | 4.0 | -4.0 | -3.0 | NaN | NaN |
2 | 2.0 | -5.0 | 3.0 | 2.0 | NaN | NaN |
3 | 2.0 | -5.0 | 3.0 | 2.0 | NaN | NaN |
4 | 4.5 | 3.5 | 3.5 | 4.5 | 2.5 | 2.5 |
5 | 1.5 | 3.5 | 3.5 | 4.5 | 3.5 | 3.5 |
6 | 4.5 | 4.5 | 0.5 | 1.5 | 1.5 | 0.5 |
And if we want to concatenate the DataFrames on a different axis:
concatenatedaxis = pd.concat([df1, df2], axis = 1, keys = ['ONE','TWO'])
concatenatedaxis
ONE | TWO | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
a | b | c | d | f | e | d | c | b | a | |
0 | -1 | -2 | 4 | 3 | 2.5 | 2.5 | 4.5 | 3.5 | 3.5 | 4.5 |
1 | 3 | 4 | -4 | -3 | 3.5 | 3.5 | 4.5 | 3.5 | 3.5 | 1.5 |
2 | 2 | -5 | 3 | 2 | 0.5 | 1.5 | 1.5 | 0.5 | 4.5 | 4.5 |
3 | 2 | -5 | 3 | 2 | NaN | NaN | NaN | NaN | NaN | NaN |
append
is a shortcut for concat
along axis=0
:
d1 = pd.DataFrame(np.random.randint(10,99,(2,2)), index=[1001, 1002], columns=list('AB'))
d2 = pd.DataFrame(np.random.randint(10,99,(2,3)), index=[1002, 2001], columns=list('BCD'))
append
allows overlapping index. Here we can see the two cases with ignore_index
set respectively to true or false:
HTML(side_by_side2(d1, d2, d1.append(d2), d1.append(d2, ignore_index=True)))
A | B | |
---|---|---|
1001 | 37 | 27 |
1002 | 10 | 94 |
B | C | D | |
---|---|---|---|
1002 | 45 | 36 | 87 |
2001 | 32 | 71 | 64 |
A | B | C | D | |
---|---|---|---|---|
1001 | 37.0 | 27 | NaN | NaN |
1002 | 10.0 | 94 | NaN | NaN |
1002 | NaN | 45 | 36.0 | 87.0 |
2001 | NaN | 32 | 71.0 | 64.0 |
A | B | C | D | |
---|---|---|---|---|
0 | 37.0 | 27 | NaN | NaN |
1 | 10.0 | 94 | NaN | NaN |
2 | NaN | 45 | 36.0 | 87.0 |
3 | NaN | 32 | 71.0 | 64.0 |
join
is equivalent to the sql
homonymous command. It's more advanced than append
because has the parameter how
that specifies how to merge indexes:
left
keeps left indexes.right
keeps right indexes.inner
intersects indexes.outer
makes union of indexes.HTML(side_by_side2(d1, d2))
A | B | |
---|---|---|
1001 | 37 | 27 |
1002 | 10 | 94 |
B | C | D | |
---|---|---|---|
1002 | 45 | 36 | 87 |
2001 | 32 | 71 | 64 |
d6 = d1.join(d2, rsuffix='_r', how='left')
d7 = d1.join(d2, rsuffix='_r', how='right')
d8 = d1.join(d2.D, how='inner')
d9 = d1.join(d2.D, how='outer')
HTML(side_by_side2(d6, d7, d8, d9))
A | B | B_r | C | D | |
---|---|---|---|---|---|
1001 | 37 | 27 | NaN | NaN | NaN |
1002 | 10 | 94 | 45.0 | 36.0 | 87.0 |
A | B | B_r | C | D | |
---|---|---|---|---|---|
1002 | 10.0 | 94.0 | 45 | 36 | 87 |
2001 | NaN | NaN | 32 | 71 | 64 |
A | B | D | |
---|---|---|---|
1002 | 10 | 94 | 87 |
A | B | D | |
---|---|---|---|
1001 | 37.0 | 27.0 | NaN |
1002 | 10.0 | 94.0 | 87.0 |
2001 | NaN | NaN | 64.0 |
pandas provides a single function, merge
, as the entry point for all standard database join operations between DataFrame objects:
pd.merge(left, right, how=’left’, on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True,
suffixes=(’_x’, ’_y’), copy=True)
on
: Columns to join on. Must be found in both DataFramesleft_on
: Columns from the left DataFrame to use as keysright_on
: same for rightleft_index
: If True, use the index (row labels) from the left DataFrame join key(s)right_index
: same for righthow
: One of ’left’, ’right’, ’outer’, ’inner’. Defaults to ’inner’.suffixes
: A tuple of string suffixes to apply to overlapping columns. Defaults to (’_x’, ’_y’).left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
HTML(side_by_side2(left, right))
key | lval | |
---|---|---|
0 | foo | 1 |
1 | foo | 2 |
key | rval | |
---|---|---|
0 | foo | 4 |
1 | bar | 5 |
pd.merge(left, right, on='key', how='right')
key | lval | rval | |
---|---|---|---|
0 | foo | 1.0 | 4 |
1 | foo | 2.0 | 4 |
2 | bar | NaN | 5 |
A practical example: revenues is a little dataset that we want to join with the 'cities' database to find the geographical location of the offices:
revenues = pd.read_csv('example_data/p05_d2.txt', index_col=0)
cities = pd.read_csv('example_data/cities.csv.bz2', compression='bz2',
names=['City', 'State', 'lat', 'lon'])
HTML(side_by_side2(revenues, cities.head(8)))
City | Name | Revenues | State | |
---|---|---|---|---|
0 | New York | Roy | 1250 | NY |
1 | frisco | Johnn | 840 | CA |
2 | houston | Jim | 349 | tx |
3 | taft | Paul | 1100 | OK |
4 | venice | Ross | 900 | IL |
City | State | lat | lon | |
---|---|---|---|---|
0 | ROBERTS | ID | 43.698922 | -112.173195 |
1 | HODGENVILLE | KY | 37.559321 | -85.707267 |
2 | WILLIAMSPORT | TN | 35.722628 | -87.212698 |
3 | MAUMEE | OH | 41.571251 | -83.685036 |
4 | BERNE | NY | 42.602236 | -74.154615 |
5 | MONCLOVA | OH | 41.577251 | -83.772378 |
6 | MEADOWBROOK | WV | 39.346214 | -80.313875 |
7 | RUTH | NV | 39.342631 | -114.885945 |
The 'cities' database contains 30101 records:
cities['City'].count()
30101
First we map a normalization function to standardize the 'City' and 'State' names. map
applies the function passed as parameter to all elements of the DataFrame
:
revenues = revenues.reindex(columns=['Name','Revenues','City','State'])
revenues['City'] = revenues['City'].map(lambda str: str.upper().strip())
revenues['State'] = revenues['State'].map(lambda str: str.upper().strip())
display(revenues)
Name | Revenues | City | State | |
---|---|---|---|---|
0 | Roy | 1250 | NEW YORK | NY |
1 | Johnn | 840 | FRISCO | CA |
2 | Jim | 349 | HOUSTON | TX |
3 | Paul | 1100 | TAFT | OK |
4 | Ross | 900 | VENICE | IL |
leftjoin = pd.merge(revenues, cities, left_on=['City', 'State'],
right_on=['City', 'State'], how='left')
leftjoin = leftjoin.sort_values(by='Revenues', ascending=False)
HTML(side_by_side2(revenues, leftjoin))
Name | Revenues | City | State | |
---|---|---|---|---|
0 | Roy | 1250 | NEW YORK | NY |
1 | Johnn | 840 | FRISCO | CA |
2 | Jim | 349 | HOUSTON | TX |
3 | Paul | 1100 | TAFT | OK |
4 | Ross | 900 | VENICE | IL |
Name | Revenues | City | State | lat | lon | |
---|---|---|---|---|---|---|
0 | Roy | 1250 | NEW YORK | NY | 40.757929 | -73.985506 |
3 | Paul | 1100 | TAFT | OK | 35.763648 | -95.544501 |
4 | Ross | 900 | VENICE | IL | 38.670250 | -90.168859 |
1 | Johnn | 840 | FRISCO | CA | NaN | NaN |
2 | Jim | 349 | HOUSTON | TX | 29.759956 | -95.362534 |
pivot_table
is useful to analize data in the so-called "stacked" or "record" format: here each field of each record are stored in colums and similar fields could have multiple duplicates:
data = pd.read_csv('example_data/pd06_pivot_Example.txt')
data
Unnamed: 0 | date | variable | value | |
---|---|---|---|---|
0 | 0 | 2000-01-03 00:00:00 | A | 0.397018 |
1 | 1 | 2000-01-04 00:00:00 | A | 0.621109 |
2 | 2 | 2000-01-05 00:00:00 | A | -2.508284 |
3 | 3 | 2000-01-03 00:00:00 | B | 0.219796 |
4 | 4 | 2000-01-04 00:00:00 | B | 1.371027 |
5 | 5 | 2000-01-05 00:00:00 | B | -1.005504 |
6 | 6 | 2000-01-03 00:00:00 | C | -0.391149 |
7 | 7 | 2000-01-04 00:00:00 | C | 1.209346 |
8 | 8 | 2000-01-05 00:00:00 | C | -0.059617 |
9 | 9 | 2000-01-03 00:00:00 | D | -0.464558 |
10 | 10 | 2000-01-04 00:00:00 | D | -0.173605 |
11 | 11 | 2000-01-05 00:00:00 | D | -1.649632 |
pivot = data.pivot(index='date', columns='variable', values='value')
pivot.head()
variable | A | B | C | D |
---|---|---|---|---|
date | ||||
2000-01-03 00:00:00 | 0.397018 | 0.219796 | -0.391149 | -0.464558 |
2000-01-04 00:00:00 | 0.621109 | 1.371027 | 1.209346 | -0.173605 |
2000-01-05 00:00:00 | -2.508284 | -1.005504 | -0.059617 | -1.649632 |
A practical example: apply pivoting to a dataset containing the statistics of te US crimes State by State, downloaded from the Stanford University "Human-Computer Interaction" group (http://hci.stanford.edu/jheer/workshop/data/):
data2 = pd.read_csv('example_data/CrimeStatebyState.csv.bz2', compression='bz2')
data2.iloc[0:7000:1000]
State | Type of Crime | Crime | Year | Count | |
---|---|---|---|---|---|
0 | Alabama | Violent Crime | Murder and nonnegligent Manslaughter | 1960 | 406 |
1000 | Arkansas | Violent Crime | Murder and nonnegligent Manslaughter | 1994 | 294 |
2000 | Connecticut | Violent Crime | Forcible rape | 1982 | 692 |
3000 | Florida | Violent Crime | Robbery | 1970 | 12636 |
4000 | Idaho | Violent Crime | Robbery | 2004 | 241 |
5000 | Iowa | Violent Crime | Aggravated assault | 1992 | 6131 |
6000 | Louisiana | Property Crime | Burglary | 1980 | 63997 |
Definition: pd.pivot_table(data, values=None, rows=None, cols=None, aggfunc='mean', fill_value=None, margins=False, dropna=True)
values
: Columns to aggregateaggfunc
: Default is np.mean
. This can be a list of functions as wellaggfunc
: Boolean, add subtotals, grandtotalsdropna
: Exclude columns with NaN'spd.set_option("display.colheader_justify","right")
pivot2 = pd.pivot_table(data2, values='Count',
index=['Year'],
columns=['State'],
margins=True, aggfunc=np.sum)
pivot2.iloc[-10:,-8:]
State | Utah | Vermont | Virginia | Washington | West Virginia | Wisconsin | Wyoming | All |
---|---|---|---|---|---|---|---|---|
Year | ||||||||
1997 | 123447 | 16658 | 261022 | 332469 | 44839 | 190133 | 20068 | 13195074 |
1998 | 115624 | 18552 | 248576 | 333799 | 46130 | 185093 | 18315 | 12485714 |
1999 | 105999 | 16735 | 231886 | 302509 | 49161 | 173062 | 16583 | 11634378 |
2000 | 99958 | 18185 | 214348 | 300932 | 47067 | 172124 | 16285 | 11608070 |
2001 | 96307 | 16978 | 228445 | 308492 | 46120 | 179410 | 17392 | 11849006 |
2002 | 103129 | 15600 | 229039 | 309931 | 45320 | 176987 | 17858 | 11878954 |
2003 | 105973 | 14510 | 220939 | 312814 | 46997 | 169788 | 17962 | 11826538 |
2004 | 103409 | 15272 | 220976 | 322167 | 51436 | 158258 | 18052 | 11679474 |
2005 | 101158 | 14956 | 221044 | 329406 | 52653 | 160646 | 17242 | 11556854 |
All | 3432582 | 716716 | 9264900 | 10942317 | 1709475 | 7399321 | 726727 | 486181458 |
pivot3 = pd.pivot_table(data2, values='Count',
index=['Year', 'Type of Crime'],
columns=['State'],
margins=True, aggfunc=np.sum)
pivot3.iloc[-10:,:8]
State | Alabama | Alaska | Arizona | Arkansas | California | Colorado | Connecticut | Delaware | |
---|---|---|---|---|---|---|---|---|---|
Year | Type of Crime | ||||||||
2001 | Violent Crime | 19582 | 3735 | 28675 | 12190 | 212855 | 15492 | 11492 | 4868 |
2002 | Property Crime | 180400 | 24118 | 318296 | 101171 | 1176484 | 180054 | 93426 | 26967 |
Violent Crime | 19931 | 3627 | 30171 | 11501 | 208388 | 15882 | 10807 | 4836 | |
2003 | Property Crime | 182241 | 24386 | 314335 | 99084 | 1215086 | 179706 | 92981 | 27943 |
Violent Crime | 19331 | 3877 | 28638 | 12449 | 205551 | 15757 | 11045 | 5525 | |
2004 | Property Crime | 182340 | 22172 | 291203 | 110911 | 1227194 | 180322 | 93942 | 27256 |
Violent Crime | 19324 | 4159 | 28952 | 13814 | 189175 | 17121 | 10113 | 5105 | |
2005 | Property Crime | 177393 | 23975 | 287345 | 112775 | 1200531 | 188449 | 89794 | 26245 |
Violent Crime | 19678 | 4194 | 30478 | 14659 | 190178 | 18498 | 9635 | 5332 | |
All | 6745549 | 1019564 | 9436942 | 3670925 | 66614776 | 7359322 | 5649669 | 1359238 |
If we want to select crimes for a given year and/or location:
pivot2.iloc[2004:2005,:8]
State | Alabama | Alaska | Arizona | Arkansas | California | Colorado | Connecticut | Delaware |
---|---|---|---|---|---|---|---|---|
Year |
stack
“pivot” a level of the column labels to an inner-most row level. In the following case, since after stacking there are no column labels remaining, stack
returns a Series instead of a Dataframe:
pivot3.iloc[:5,:8]
State | Alabama | Alaska | Arizona | Arkansas | California | Colorado | Connecticut | Delaware | |
---|---|---|---|---|---|---|---|---|---|
Year | Type of Crime | ||||||||
1960 | Property Crime | 33823 | 3494 | 36539 | 16548 | 508511 | 35695 | 28393 | 9267 |
Violent Crime | 6097 | 236 | 2704 | 1924 | 37558 | 2408 | 928 | 375 | |
1961 | Property Crime | 32541 | 3605 | 41304 | 16836 | 521008 | 40156 | 31658 | 9343 |
Violent Crime | 5564 | 208 | 2288 | 1809 | 38154 | 2659 | 878 | 318 | |
1962 | Property Crime | 35829 | 3849 | 45694 | 18199 | 564451 | 47872 | 33798 | 10157 |
stk = pivot3.stack()
stk[:10]
Year Type of Crime State 1960 Property Crime Alabama 33823 Alaska 3494 Arizona 36539 Arkansas 16548 California 508511 Colorado 35695 Connecticut 28393 Delaware 9267 District of Columbia 16495 Florida 122858 dtype: int64
unstack
reverse the operation by pivoting a row index on columns:
unstk = stk.unstack(-1)
unstk.iloc[:5,:8]
State | Alabama | Alaska | Arizona | Arkansas | California | Colorado | Connecticut | Delaware | |
---|---|---|---|---|---|---|---|---|---|
Year | Type of Crime | ||||||||
1960 | Property Crime | 33823 | 3494 | 36539 | 16548 | 508511 | 35695 | 28393 | 9267 |
Violent Crime | 6097 | 236 | 2704 | 1924 | 37558 | 2408 | 928 | 375 | |
1961 | Property Crime | 32541 | 3605 | 41304 | 16836 | 521008 | 40156 | 31658 | 9343 |
Violent Crime | 5564 | 208 | 2288 | 1809 | 38154 | 2659 | 878 | 318 | |
1962 | Property Crime | 35829 | 3849 | 45694 | 18199 | 564451 | 47872 | 33798 | 10157 |
Visit www.add-for.com for more tutorials and updates.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.