Merge and pivot¶

In [1]:

import addutils.toc ; addutils.toc.js(ipy_notebook=True)

Out[1]:

In this tutorial we are going to see how to combine pandas datastructures together.

In [2]:

import numpy as np
import pandas as pd
from IPython.display import (display, HTML)
from addutils import side_by_side2, css_notebook
css_notebook()

Out[2]:

1 Concat¶

pandas.concat concatenates two (or more) pandas objects along a particular axis (default is the vertical one).

In [3]:

data = np.array([[-1, -2, 4, 3],[3,4, -4, -3], [2, -5, 3, 2], [2, -5, 3, 2]])
df1 = pd.DataFrame(data, columns = ['a','b','c','d'])
df2 = pd.DataFrame(np.random.randint(5, size=(3,6)) + 0.5, columns = list('fedcba'))
HTML(side_by_side2(df1, df2))

Out[3]:

	a	b	c	d
0	-1	-2	4	3
1	3	4	-4	-3
2	2	-5	3	2
3	2	-5	3	2

	f	e	d	c	b	a
0	2.5	2.5	4.5	3.5	3.5	4.5
1	3.5	3.5	4.5	3.5	3.5	1.5
2	0.5	1.5	1.5	0.5	4.5	4.5

When gluing together multiple DataFrames, you have a choice of how to handle the other axes (other than the one being concatenated). This can be done by defining join:

In [4]:

HTML(side_by_side2(pd.concat([df1, df2]), pd.concat([df1, df2], join='inner')))

Out[4]:

	a	b	c	d	e	f
0	-1.0	-2.0	4.0	3.0	NaN	NaN
1	3.0	4.0	-4.0	-3.0	NaN	NaN
2	2.0	-5.0	3.0	2.0	NaN	NaN
3	2.0	-5.0	3.0	2.0	NaN	NaN
0	4.5	3.5	3.5	4.5	2.5	2.5
1	1.5	3.5	3.5	4.5	3.5	3.5
2	4.5	4.5	0.5	1.5	1.5	0.5

	a	b	c	d
0	-1.0	-2.0	4.0	3.0
1	3.0	4.0	-4.0	-3.0
2	2.0	-5.0	3.0	2.0
3	2.0	-5.0	3.0	2.0
0	4.5	3.5	3.5	4.5
1	1.5	3.5	3.5	4.5
2	4.5	4.5	0.5	1.5

It is possible to create a Multi-index DataFrame setting the 'keys' parameter.

In [5]:

pd.concat([df1, df2], keys = ('First', 'Second'))

Out[5]:

		a	b	c	d	e	f
First	0	-1.0	-2.0	4.0	3.0	NaN	NaN
	1	3.0	4.0	-4.0	-3.0	NaN	NaN
	2	2.0	-5.0	3.0	2.0	NaN	NaN
	3	2.0	-5.0	3.0	2.0	NaN	NaN
Second	0	4.5	3.5	3.5	4.5	2.5	2.5
	1	1.5	3.5	3.5	4.5	3.5	3.5
	2	4.5	4.5	0.5	1.5	1.5	0.5

Note that concat allows overlapping indexes. In order to avoid overlapping index the parameter ignore_index has to be set as True.

In [6]:

pd.concat([df1, df2], ignore_index = True)

Out[6]:

	a	b	c	d	e	f
0	-1.0	-2.0	4.0	3.0	NaN	NaN
1	3.0	4.0	-4.0	-3.0	NaN	NaN
2	2.0	-5.0	3.0	2.0	NaN	NaN
3	2.0	-5.0	3.0	2.0	NaN	NaN
4	4.5	3.5	3.5	4.5	2.5	2.5
5	1.5	3.5	3.5	4.5	3.5	3.5
6	4.5	4.5	0.5	1.5	1.5	0.5

And if we want to concatenate the DataFrames on a different axis:

In [7]:

concatenatedaxis = pd.concat([df1, df2], axis = 1, keys = ['ONE','TWO'])
concatenatedaxis

Out[7]:

	ONE				TWO
	a	b	c	d	f	e	d	c	b	a
0	-1	-2	4	3	2.5	2.5	4.5	3.5	3.5	4.5
1	3	4	-4	-3	3.5	3.5	4.5	3.5	3.5	1.5
2	2	-5	3	2	0.5	1.5	1.5	0.5	4.5	4.5
3	2	-5	3	2	NaN	NaN	NaN	NaN	NaN	NaN

2 Append¶

append is a shortcut for concat along axis=0:

In [8]:

d1 = pd.DataFrame(np.random.randint(10,99,(2,2)), index=[1001, 1002], columns=list('AB'))
d2 = pd.DataFrame(np.random.randint(10,99,(2,3)), index=[1002, 2001], columns=list('BCD'))

append allows overlapping index. Here we can see the two cases with ignore_index set respectively to true or false:

In [9]:

HTML(side_by_side2(d1, d2, d1.append(d2), d1.append(d2, ignore_index=True)))

Out[9]:

	A	B
1001	37	27
1002	10	94

	B	C	D
1002	45	36	87
2001	32	71	64

	A	B	C	D
1001	37.0	27	NaN	NaN
1002	10.0	94	NaN	NaN
1002	NaN	45	36.0	87.0
2001	NaN	32	71.0	64.0

	A	B	C	D
0	37.0	27	NaN	NaN
1	10.0	94	NaN	NaN
2	NaN	45	36.0	87.0
3	NaN	32	71.0	64.0

3 Join¶

join is equivalent to the sql homonymous command. It's more advanced than append because has the parameter how that specifies how to merge indexes:

left keeps left indexes.
right keeps right indexes.
inner intersects indexes.
outer makes union of indexes.

In [10]:

HTML(side_by_side2(d1, d2))

Out[10]:

	A	B
1001	37	27
1002	10	94

	B	C	D
1002	45	36	87
2001	32	71	64

In [11]:

d6 = d1.join(d2, rsuffix='_r', how='left')
d7 = d1.join(d2, rsuffix='_r', how='right')
d8 = d1.join(d2.D, how='inner')
d9 = d1.join(d2.D, how='outer')
HTML(side_by_side2(d6, d7, d8, d9))

Out[11]:

	A	B	B_r	C	D
1001	37	27	NaN	NaN	NaN
1002	10	94	45.0	36.0	87.0

	A	B	B_r	C	D
1002	10.0	94.0	45	36	87
2001	NaN	NaN	32	71	64

	A	B	D
1002	10	94	87

	A	B	D
1001	37.0	27.0	NaN
1002	10.0	94.0	87.0
2001	NaN	NaN	64.0

4 Merge¶

pandas provides a single function, merge, as the entry point for all standard database join operations between DataFrame objects:

pd.merge(left, right, how=’left’, on=None, left_on=None, right_on=None,
         left_index=False, right_index=False, sort=True,
         suffixes=(’_x’, ’_y’), copy=True)

on: Columns to join on. Must be found in both DataFrames
left_on: Columns from the left DataFrame to use as keys
right_on: same for right
left_index: If True, use the index (row labels) from the left DataFrame join key(s)
right_index: same for right
how: One of ’left’, ’right’, ’outer’, ’inner’. Defaults to ’inner’.
suffixes: A tuple of string suffixes to apply to overlapping columns. Defaults to (’_x’, ’_y’).

In [12]:

left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
HTML(side_by_side2(left, right))

Out[12]:

	key	lval
0	foo	1
1	foo	2

	key	rval
0	foo	4
1	bar	5

In [13]:

pd.merge(left, right, on='key', how='right')

Out[13]:

	key	lval	rval
0	foo	1.0	4
1	foo	2.0	4
2	bar	NaN	5

A practical example: revenues is a little dataset that we want to join with the 'cities' database to find the geographical location of the offices:

In [14]:

revenues = pd.read_csv('example_data/p05_d2.txt', index_col=0)
cities = pd.read_csv('example_data/cities.csv.bz2', compression='bz2',
                  names=['City', 'State', 'lat', 'lon'])
HTML(side_by_side2(revenues, cities.head(8)))

Out[14]:

	City	Name	Revenues	State
0	New York	Roy	1250	NY
1	frisco	Johnn	840	CA
2	houston	Jim	349	tx
3	taft	Paul	1100	OK
4	venice	Ross	900	IL

	City	State	lat	lon
0	ROBERTS	ID	43.698922	-112.173195
1	HODGENVILLE	KY	37.559321	-85.707267
2	WILLIAMSPORT	TN	35.722628	-87.212698
3	MAUMEE	OH	41.571251	-83.685036
4	BERNE	NY	42.602236	-74.154615
5	MONCLOVA	OH	41.577251	-83.772378
6	MEADOWBROOK	WV	39.346214	-80.313875
7	RUTH	NV	39.342631	-114.885945

The 'cities' database contains 30101 records:

In [15]:

cities['City'].count()

Out[15]:

First we map a normalization function to standardize the 'City' and 'State' names. map applies the function passed as parameter to all elements of the DataFrame:

In [16]:

revenues = revenues.reindex(columns=['Name','Revenues','City','State'])
revenues['City'] = revenues['City'].map(lambda str: str.upper().strip())
revenues['State'] = revenues['State'].map(lambda str: str.upper().strip())
display(revenues)

	Name	Revenues	City	State
0	Roy	1250	NEW YORK	NY
1	Johnn	840	FRISCO	CA
2	Jim	349	HOUSTON	TX
3	Paul	1100	TAFT	OK
4	Ross	900	VENICE	IL

In [17]:

leftjoin = pd.merge(revenues, cities, left_on=['City', 'State'],
                                      right_on=['City', 'State'], how='left')
leftjoin = leftjoin.sort_values(by='Revenues', ascending=False)
HTML(side_by_side2(revenues, leftjoin))

Out[17]:

	Name	Revenues	City	State
0	Roy	1250	NEW YORK	NY
1	Johnn	840	FRISCO	CA
2	Jim	349	HOUSTON	TX
3	Paul	1100	TAFT	OK
4	Ross	900	VENICE	IL

	Name	Revenues	City	State	lat	lon
0	Roy	1250	NEW YORK	NY	40.757929	-73.985506
3	Paul	1100	TAFT	OK	35.763648	-95.544501
4	Ross	900	VENICE	IL	38.670250	-90.168859
1	Johnn	840	FRISCO	CA	NaN	NaN
2	Jim	349	HOUSTON	TX	29.759956	-95.362534

5 Pivoting¶

pivot_table is useful to analize data in the so-called "stacked" or "record" format: here each field of each record are stored in colums and similar fields could have multiple duplicates:

In [18]:

data = pd.read_csv('example_data/pd06_pivot_Example.txt')
data

Out[18]:

	Unnamed: 0	date	variable	value
0	0	2000-01-03 00:00:00	A	0.397018
1	1	2000-01-04 00:00:00	A	0.621109
2	2	2000-01-05 00:00:00	A	-2.508284
3	3	2000-01-03 00:00:00	B	0.219796
4	4	2000-01-04 00:00:00	B	1.371027
5	5	2000-01-05 00:00:00	B	-1.005504
6	6	2000-01-03 00:00:00	C	-0.391149
7	7	2000-01-04 00:00:00	C	1.209346
8	8	2000-01-05 00:00:00	C	-0.059617
9	9	2000-01-03 00:00:00	D	-0.464558
10	10	2000-01-04 00:00:00	D	-0.173605
11	11	2000-01-05 00:00:00	D	-1.649632

In [19]:

pivot = data.pivot(index='date', columns='variable', values='value')
pivot.head()

Out[19]:

variable	A	B	C	D
date
2000-01-03 00:00:00	0.397018	0.219796	-0.391149	-0.464558
2000-01-04 00:00:00	0.621109	1.371027	1.209346	-0.173605
2000-01-05 00:00:00	-2.508284	-1.005504	-0.059617	-1.649632

A practical example: apply pivoting to a dataset containing the statistics of te US crimes State by State, downloaded from the Stanford University "Human-Computer Interaction" group (http://hci.stanford.edu/jheer/workshop/data/):

In [20]:

data2 = pd.read_csv('example_data/CrimeStatebyState.csv.bz2', compression='bz2')
data2.iloc[0:7000:1000]

Out[20]:

	State	Type of Crime	Crime	Year	Count
0	Alabama	Violent Crime	Murder and nonnegligent Manslaughter	1960	406
1000	Arkansas	Violent Crime	Murder and nonnegligent Manslaughter	1994	294
2000	Connecticut	Violent Crime	Forcible rape	1982	692
3000	Florida	Violent Crime	Robbery	1970	12636
4000	Idaho	Violent Crime	Robbery	2004	241
5000	Iowa	Violent Crime	Aggravated assault	1992	6131
6000	Louisiana	Property Crime	Burglary	1980	63997

Definition: pd.pivot_table(data, values=None, rows=None, cols=None, aggfunc='mean', fill_value=None, margins=False, dropna=True)

values: Columns to aggregate
aggfunc: Default is np.mean. This can be a list of functions as well
aggfunc: Boolean, add subtotals, grandtotals
dropna: Exclude columns with NaN's

In [21]:

pd.set_option("display.colheader_justify","right")
pivot2 = pd.pivot_table(data2, values='Count',
                        index=['Year'],
                        columns=['State'],
                        margins=True, aggfunc=np.sum)
pivot2.iloc[-10:,-8:]

Out[21]:

State	Utah	Vermont	Virginia	Washington	West Virginia	Wisconsin	Wyoming	All
Year
1997	123447	16658	261022	332469	44839	190133	20068	13195074
1998	115624	18552	248576	333799	46130	185093	18315	12485714
1999	105999	16735	231886	302509	49161	173062	16583	11634378
2000	99958	18185	214348	300932	47067	172124	16285	11608070
2001	96307	16978	228445	308492	46120	179410	17392	11849006
2002	103129	15600	229039	309931	45320	176987	17858	11878954
2003	105973	14510	220939	312814	46997	169788	17962	11826538
2004	103409	15272	220976	322167	51436	158258	18052	11679474
2005	101158	14956	221044	329406	52653	160646	17242	11556854
All	3432582	716716	9264900	10942317	1709475	7399321	726727	486181458

In [22]:

pivot3 = pd.pivot_table(data2, values='Count',
                        index=['Year', 'Type of Crime'],
                        columns=['State'],
                        margins=True, aggfunc=np.sum)
pivot3.iloc[-10:,:8]

Out[22]:

	State	Alabama	Alaska	Arizona	Arkansas	California	Colorado	Connecticut	Delaware
Year	Type of Crime
2001	Violent Crime	19582	3735	28675	12190	212855	15492	11492	4868
2002	Property Crime	180400	24118	318296	101171	1176484	180054	93426	26967
2002	Violent Crime	19931	3627	30171	11501	208388	15882	10807	4836
2003	Property Crime	182241	24386	314335	99084	1215086	179706	92981	27943
2003	Violent Crime	19331	3877	28638	12449	205551	15757	11045	5525
2004	Property Crime	182340	22172	291203	110911	1227194	180322	93942	27256
2004	Violent Crime	19324	4159	28952	13814	189175	17121	10113	5105
2005	Property Crime	177393	23975	287345	112775	1200531	188449	89794	26245
2005	Violent Crime	19678	4194	30478	14659	190178	18498	9635	5332
All		6745549	1019564	9436942	3670925	66614776	7359322	5649669	1359238

If we want to select crimes for a given year and/or location:

In [23]:

pivot2.iloc[2004:2005,:8]

Out[23]:

State	Alabama	Alaska	Arizona	Arkansas	California	Colorado	Connecticut	Delaware
Year

6 Stack and Unstack¶

stack “pivot” a level of the column labels to an inner-most row level. In the following case, since after stacking there are no column labels remaining, stack returns a Series instead of a Dataframe:

In [24]:

pivot3.iloc[:5,:8]

Out[24]:

	State	Alabama	Alaska	Arizona	Arkansas	California	Colorado	Connecticut	Delaware
Year	Type of Crime
1960	Property Crime	33823	3494	36539	16548	508511	35695	28393	9267
1960	Violent Crime	6097	236	2704	1924	37558	2408	928	375
1961	Property Crime	32541	3605	41304	16836	521008	40156	31658	9343
1961	Violent Crime	5564	208	2288	1809	38154	2659	878	318
1962	Property Crime	35829	3849	45694	18199	564451	47872	33798	10157

In [25]:

stk = pivot3.stack()
stk[:10]

Out[25]:

Year  Type of Crime   State               
1960  Property Crime  Alabama                  33823
                      Alaska                    3494
                      Arizona                  36539
                      Arkansas                 16548
                      California              508511
                      Colorado                 35695
                      Connecticut              28393
                      Delaware                  9267
                      District of Columbia     16495
                      Florida                 122858
dtype: int64

unstack reverse the operation by pivoting a row index on columns:

In [26]:

unstk = stk.unstack(-1)
unstk.iloc[:5,:8]

Out[26]:

	State	Alabama	Alaska	Arizona	Arkansas	California	Colorado	Connecticut	Delaware
Year	Type of Crime
1960	Property Crime	33823	3494	36539	16548	508511	35695	28393	9267
1960	Violent Crime	6097	236	2704	1924	37558	2408	928	375
1961	Property Crime	32541	3605	41304	16836	521008	40156	31658	9343
1961	Violent Crime	5564	208	2288	1809	38154	2659	878	318
1962	Property Crime	35829	3849	45694	18199	564451	47872	33798	10157

Visit www.add-for.com for more tutorials and updates.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

	ONE				TWO
	a	b	c	d	f	e	d	c	b	a
0	-1	-2	4	3	2.5	2.5	4.5	3.5	3.5	4.5
1	3	4	-4	-3	3.5	3.5	4.5	3.5	3.5	1.5
2	2	-5	3	2	0.5	1.5	1.5	0.5	4.5	4.5
3	2	-5	3	2	NaN	NaN	NaN	NaN	NaN	NaN

	ONE				TWO
	a	b	c	d	f	e	d	c	b	a
0	-1	-2	4	3	2.5	2.5	4.5	3.5	3.5	4.5
1	3	4	-4	-3	3.5	3.5	4.5	3.5	3.5	1.5
2	2	-5	3	2	0.5	1.5	1.5	0.5	4.5	4.5
3	2	-5	3	2	NaN	NaN	NaN	NaN	NaN	NaN

	ONE				TWO
	a	b	c	d	f	e	d	c	b	a
0	-1	-2	4	3	2.5	2.5	4.5	3.5	3.5	4.5
1	3	4	-4	-3	3.5	3.5	4.5	3.5	3.5	1.5
2	2	-5	3	2	0.5	1.5	1.5	0.5	4.5	4.5
3	2	-5	3	2	NaN	NaN	NaN	NaN	NaN	NaN