A short tutorial on pandas MultiIndexing with DataFrames¶

In [1]:

import pandas as pd
import numpy as np
from IPython.display import display

print "pandas version:",pd.version.version

pandas version: 0.16.2

In [2]:

midx_col = pd.MultiIndex.from_product([['x1', 'x2', 'x3', 'x4'], ['y1', 'y2'],['z1','z2']],names=['clvl1','clvl2','clvl3'])
midx_row = pd.MultiIndex.from_product([['a1', 'a2', 'a3', 'a4'], ['b1', 'b2'],['c1','c2']],names=['rlvl1','rlvl2','rlvl3'])
df = pd.DataFrame(np.random.randint(10,size=(16,16)),index=midx_row,columns=midx_col)

display(df)

		clvl1	x1				x2				x3				x4
		clvl2	y1		y2		y1		y2		y1		y2		y1		y2
		clvl3	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2
rlvl1	rlvl2	rlvl3
a1	b1	c1	2	4	3	7	5	3	6	9	0	5	8	3	6	2	3	5
	b1	c2	3	4	6	6	5	6	3	2	9	8	1	8	7	0	6	0
	b2	c1	4	0	5	7	0	4	8	5	7	1	9	4	7	6	6	6
	b2	c2	7	5	8	8	4	7	4	4	9	0	3	6	1	8	0	6
a2	b1	c1	3	2	4	2	8	5	7	0	4	6	3	7	2	5	1	4
	b1	c2	6	4	6	3	9	3	6	0	3	1	2	3	5	9	8	3
	b2	c1	8	9	0	7	9	8	0	3	7	6	1	8	9	0	3	9
	b2	c2	7	4	9	7	3	7	3	8	4	5	3	1	9	4	0	3
a3	b1	c1	9	0	8	0	7	5	8	2	7	8	4	3	7	2	1	7
	b1	c2	4	9	5	2	1	2	3	2	7	8	9	6	5	5	9	1
	b2	c1	6	8	9	9	6	0	1	2	2	9	9	3	0	5	2	4
	b2	c2	9	8	4	0	3	3	9	0	7	0	8	8	0	4	2	9
a4	b1	c1	7	1	6	1	7	2	6	2	4	7	3	2	9	4	0	2
	b1	c2	8	2	7	9	4	7	8	2	2	0	9	7	5	8	6	4
	b2	c1	9	3	8	6	2	4	8	5	8	4	1	5	4	6	3	3
	b2	c2	8	9	9	5	6	1	5	7	9	2	9	5	8	1	2	8

Single indexing on a MultiIndexed Dataframe¶

Indexing the first l < L levels (where L = len(MultiIndex.levels)) returns a DataFrame with the first l levels of the row index dropped.

If l == L a Series is returned.

Note that the column index is always kept intact.

In [3]:

display(df.loc['a1','b1'])
display(df.loc['a1','b1','c1'])

clvl1	x1				x2				x3				x4
clvl2	y1		y2		y1		y2		y1		y2		y1		y2
clvl3	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2
rlvl3
c1	2	4	3	7	5	3	6	9	0	5	8	3	6	2	3	5
c2	3	4	6	6	5	6	3	2	9	8	1	8	7	0	6	0

clvl1  clvl2  clvl3
x1     y1     z1       2
              z2       4
       y2     z1       3
              z2       7
x2     y1     z1       5
              z2       3
       y2     z1       6
              z2       9
x3     y1     z1       0
              z2       5
       y2     z1       8
              z2       3
x4     y1     z1       6
              z2       2
       y2     z1       3
              z2       5
Name: (a1, b1, c1), dtype: int64

We can also select all indices at a particular level by using a colon :. Note that if we use a colon at any index level, the MultiIndex from the original DataFrame is kept intact (i.e. none of the levels are dropped).

In [4]:

df.loc['a1',:,'c1']

Out[4]:

		clvl1	x1				x2				x3				x4
		clvl2	y1		y2		y1		y2		y1		y2		y1		y2
		clvl3	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2
rlvl1	rlvl2	rlvl3
a1	b1	c1	2	4	3	7	5	3	6	9	0	5	8	3	6	2	3	5
a1	b2	c1	4	0	5	7	0	4	8	5	7	1	9	4	7	6	6	6

Selecting arbitrary indices at multiple levels ¶

The previous examples showed how to select rows that match only a single index at each index level (actually we learned that we can also select all indices at a particular level by using colon :). But it's also possible to select arbitrary indices at each index level by using indexing of the form

df.loc[( (rlvl1 indices), (rlvl2 indices), (rlvl3 indices) ), ( (clvl1 indices), (clvl2 indices), (clvl3 indices) )]

Note that

The column index must be explicitly set
The order of index tuples must match the order of index levels

I'll go into more detail on this syntax in the next section.

Here's a few examples

In [5]:

display(df.loc[(('a1','a2','a4'), ('b1','b2'), ('c2')), :])
display(df.loc[(('a1','a2','a4'), ('b1','b2'), ('c2')), 'x1'])
display(df.loc[(('a1','a2','a4'), ('b1','b2'), ('c2')), ('x1','y1')])
display(df.loc[(('a1','a2','a4'), ('b1','b2'), ('c2')), (('x1','x2'),('y1'))])

		clvl1	x1				x2				x3				x4
		clvl2	y1		y2		y1		y2		y1		y2		y1		y2
		clvl3	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2
rlvl1	rlvl2	rlvl3
a1	b1	c2	3	4	6	6	5	6	3	2	9	8	1	8	7	0	6	0
a1	b2	c2	7	5	8	8	4	7	4	4	9	0	3	6	1	8	0	6
a2	b1	c2	6	4	6	3	9	3	6	0	3	1	2	3	5	9	8	3
a2	b2	c2	7	4	9	7	3	7	3	8	4	5	3	1	9	4	0	3
a4	b1	c2	8	2	7	9	4	7	8	2	2	0	9	7	5	8	6	4
a4	b2	c2	8	9	9	5	6	1	5	7	9	2	9	5	8	1	2	8

		clvl2	y1		y2
		clvl3	z1	z2	z1	z2
rlvl1	rlvl2	rlvl3
a1	b1	c2	3	4	6	6
a1	b2	c2	7	5	8	8
a2	b1	c2	6	4	6	3
a2	b2	c2	7	4	9	7
a4	b1	c2	8	2	7	9
a4	b2	c2	8	9	9	5

		clvl3	z1	z2
rlvl1	rlvl2	rlvl3
a1	b1	c2	3	4
a1	b2	c2	7	5
a2	b1	c2	6	4
a2	b2	c2	7	4
a4	b1	c2	8	2
a4	b2	c2	8	9

		clvl1	x1		x2
		clvl2	y1		y1
		clvl3	z1	z2	z1	z2
rlvl1	rlvl2	rlvl3
a1	b1	c2	3	4	5	6
a1	b2	c2	7	5	4	7
a2	b1	c2	6	4	9	3
a2	b2	c2	7	4	3	7
a4	b1	c2	8	2	4	7
a4	b2	c2	8	9	6	1

To select all indices at a particular level, a slice(None) can be used in place of a tuple. (slice(None) takes the place of : here since a colon inside a tuple is illegal python syntax).

Just as in the last section, using a slice(None) at any index level will assure that the MultiIndex from the original DataFrame is kept intact.

In [6]:

display(df.loc[(('a1','a2','a4'), slice(None), ('c1')), :]) # select all indices at row level 2

		clvl1	x1				x2				x3				x4
		clvl2	y1		y2		y1		y2		y1		y2		y1		y2
		clvl3	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2
rlvl1	rlvl2	rlvl3
a1	b1	c1	2	4	3	7	5	3	6	9	0	5	8	3	6	2	3	5
a1	b2	c1	4	0	5	7	0	4	8	5	7	1	9	4	7	6	6	6
a2	b1	c1	3	2	4	2	8	5	7	0	4	6	3	7	2	5	1	4
a2	b2	c1	8	9	0	7	9	8	0	3	7	6	1	8	9	0	3	9
a4	b1	c1	7	1	6	1	7	2	6	2	4	7	3	2	9	4	0	2
a4	b2	c1	9	3	8	6	2	4	8	5	8	4	1	5	4	6	3	3

Some ambiguous cases ¶

Remember how I said in the section above that for arbitrary multi-indexing you had to specify a column index for it to work? I lied :) This actually isn't always necessary. For example the following two are equivalent

In [7]:

result1 = df.loc[('a1','a2','a4'), ('b1','b2'), ('c2')] # incomplete indexing
result2 = df.loc[(('a1','a2','a4'), ('b1','b2'), ('c2')),:] # explicit column indexing
np.all(result1 == result2)

Out[7]:

True

The problem with this type of incomplete indexing with multi-index dataframes is that the behavior isn't always well defined. Specifically, when the number of index tuples is 1 or 2, pandas has no way of knowing which index tuples belong to the row index and which belong to the column index.

This is actually a python limitation. You see, indexing with 2 tuples without specifying a column index is of the form

df.loc[( (level1 indices), (level2 indices) )]

But in the context of the __getitem__ interface, this notation is exactly equivalent to

df.loc[(level1 indices), (level2 indices)]

In pandas, this case will always be interpreted as the typical row and column indexing.

Funnily enough, simply appending a comma after the parenthesis will eliminate this ambiguity

df.loc[( (level1 indices), (level2 indices) ),] <-- Note the added comma!

In summary, incomplete indexing with 3 or more indexing tuples works

In [8]:

df.loc[('a1','a2','a4'), ('b1','b3'), ('c2')]

Out[8]:

		clvl1	x1				x2				x3				x4
		clvl2	y1		y2		y1		y2		y1		y2		y1		y2
		clvl3	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2
rlvl1	rlvl2	rlvl3
a1	b1	c2	3	4	6	6	5	6	3	2	9	8	1	8	7	0	6	0
a2	b1	c2	6	4	6	3	9	3	6	0	3	1	2	3	5	9	8	3
a4	b1	c2	8	2	7	9	4	7	8	2	2	0	9	7	5	8	6	4

But this does not work when using only 2 (or less) indexing tuples

In [13]:

try:
    df.loc[('a1','a2','a4'), ('b1','b3')]
except KeyError as e:
    print "KeyError:",e.message

KeyError: ('a1', 'a2', 'a4')

In [10]:

try:
    df.loc[('a1','a2','a4')] # remember: this is the same as df.loc['a1','a2','a4'] to python
except KeyError as e:
    print "KeyError:",e.message

KeyError: the label [a2] is not in the [columns]

Perhaps surprisingly, in cases where you only need to specify a single index at every level of the row index this type of shorthand indexing will always work - even if the length of the index tuple is 2.

In [11]:

df.loc['a1','b1']

Out[11]:

clvl1	x1				x2				x3				x4
clvl2	y1		y2		y1		y2		y1		y2		y1		y2
clvl3	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2	z1	z2
rlvl3
c1	2	4	3	7	5	3	6	9	0	5	8	3	6	2	3	5
c2	3	4	6	6	5	6	3	2	9	8	1	8	7	0	6	0

This works because pandas tries first to match the second index (b1 in our example) to the column index and falls back to the row index if it is not found. Clearly this behavior has some consequences if your row and column Index's are not unique, so take care.