import pandas as pd
import numpy as np
from IPython.display import display
print "pandas version:",pd.version.version
pandas version: 0.16.2
midx_col = pd.MultiIndex.from_product([['x1', 'x2', 'x3', 'x4'], ['y1', 'y2'],['z1','z2']],names=['clvl1','clvl2','clvl3'])
midx_row = pd.MultiIndex.from_product([['a1', 'a2', 'a3', 'a4'], ['b1', 'b2'],['c1','c2']],names=['rlvl1','rlvl2','rlvl3'])
df = pd.DataFrame(np.random.randint(10,size=(16,16)),index=midx_row,columns=midx_col)
display(df)
clvl1 | x1 | x2 | x3 | x4 | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
clvl2 | y1 | y2 | y1 | y2 | y1 | y2 | y1 | y2 | ||||||||||
clvl3 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | ||
rlvl1 | rlvl2 | rlvl3 | ||||||||||||||||
a1 | b1 | c1 | 2 | 4 | 3 | 7 | 5 | 3 | 6 | 9 | 0 | 5 | 8 | 3 | 6 | 2 | 3 | 5 |
c2 | 3 | 4 | 6 | 6 | 5 | 6 | 3 | 2 | 9 | 8 | 1 | 8 | 7 | 0 | 6 | 0 | ||
b2 | c1 | 4 | 0 | 5 | 7 | 0 | 4 | 8 | 5 | 7 | 1 | 9 | 4 | 7 | 6 | 6 | 6 | |
c2 | 7 | 5 | 8 | 8 | 4 | 7 | 4 | 4 | 9 | 0 | 3 | 6 | 1 | 8 | 0 | 6 | ||
a2 | b1 | c1 | 3 | 2 | 4 | 2 | 8 | 5 | 7 | 0 | 4 | 6 | 3 | 7 | 2 | 5 | 1 | 4 |
c2 | 6 | 4 | 6 | 3 | 9 | 3 | 6 | 0 | 3 | 1 | 2 | 3 | 5 | 9 | 8 | 3 | ||
b2 | c1 | 8 | 9 | 0 | 7 | 9 | 8 | 0 | 3 | 7 | 6 | 1 | 8 | 9 | 0 | 3 | 9 | |
c2 | 7 | 4 | 9 | 7 | 3 | 7 | 3 | 8 | 4 | 5 | 3 | 1 | 9 | 4 | 0 | 3 | ||
a3 | b1 | c1 | 9 | 0 | 8 | 0 | 7 | 5 | 8 | 2 | 7 | 8 | 4 | 3 | 7 | 2 | 1 | 7 |
c2 | 4 | 9 | 5 | 2 | 1 | 2 | 3 | 2 | 7 | 8 | 9 | 6 | 5 | 5 | 9 | 1 | ||
b2 | c1 | 6 | 8 | 9 | 9 | 6 | 0 | 1 | 2 | 2 | 9 | 9 | 3 | 0 | 5 | 2 | 4 | |
c2 | 9 | 8 | 4 | 0 | 3 | 3 | 9 | 0 | 7 | 0 | 8 | 8 | 0 | 4 | 2 | 9 | ||
a4 | b1 | c1 | 7 | 1 | 6 | 1 | 7 | 2 | 6 | 2 | 4 | 7 | 3 | 2 | 9 | 4 | 0 | 2 |
c2 | 8 | 2 | 7 | 9 | 4 | 7 | 8 | 2 | 2 | 0 | 9 | 7 | 5 | 8 | 6 | 4 | ||
b2 | c1 | 9 | 3 | 8 | 6 | 2 | 4 | 8 | 5 | 8 | 4 | 1 | 5 | 4 | 6 | 3 | 3 | |
c2 | 8 | 9 | 9 | 5 | 6 | 1 | 5 | 7 | 9 | 2 | 9 | 5 | 8 | 1 | 2 | 8 |
Indexing the first l < L levels (where L = len(MultiIndex.levels)
) returns a DataFrame
with the first l levels of the row index dropped.
If l == L a Series
is returned.
Note that the column index is always kept intact.
display(df.loc['a1','b1'])
display(df.loc['a1','b1','c1'])
clvl1 | x1 | x2 | x3 | x4 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
clvl2 | y1 | y2 | y1 | y2 | y1 | y2 | y1 | y2 | ||||||||
clvl3 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 |
rlvl3 | ||||||||||||||||
c1 | 2 | 4 | 3 | 7 | 5 | 3 | 6 | 9 | 0 | 5 | 8 | 3 | 6 | 2 | 3 | 5 |
c2 | 3 | 4 | 6 | 6 | 5 | 6 | 3 | 2 | 9 | 8 | 1 | 8 | 7 | 0 | 6 | 0 |
clvl1 clvl2 clvl3 x1 y1 z1 2 z2 4 y2 z1 3 z2 7 x2 y1 z1 5 z2 3 y2 z1 6 z2 9 x3 y1 z1 0 z2 5 y2 z1 8 z2 3 x4 y1 z1 6 z2 2 y2 z1 3 z2 5 Name: (a1, b1, c1), dtype: int64
We can also select all indices at a particular level by using a colon :
. Note that if we use a colon at any index level, the MultiIndex from the original DataFrame is kept intact (i.e. none of the levels are dropped).
df.loc['a1',:,'c1']
clvl1 | x1 | x2 | x3 | x4 | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
clvl2 | y1 | y2 | y1 | y2 | y1 | y2 | y1 | y2 | ||||||||||
clvl3 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | ||
rlvl1 | rlvl2 | rlvl3 | ||||||||||||||||
a1 | b1 | c1 | 2 | 4 | 3 | 7 | 5 | 3 | 6 | 9 | 0 | 5 | 8 | 3 | 6 | 2 | 3 | 5 |
b2 | c1 | 4 | 0 | 5 | 7 | 0 | 4 | 8 | 5 | 7 | 1 | 9 | 4 | 7 | 6 | 6 | 6 |
The previous examples showed how to select rows that match only a single index at each index level (actually we learned that we can also select all indices at a particular level by using colon :
). But it's also possible to select arbitrary indices at each index level by using indexing of the form
df.loc[( (rlvl1 indices), (rlvl2 indices), (rlvl3 indices) ), ( (clvl1 indices), (clvl2 indices), (clvl3 indices) )]
Note that
I'll go into more detail on this syntax in the next section.
Here's a few examples
display(df.loc[(('a1','a2','a4'), ('b1','b2'), ('c2')), :])
display(df.loc[(('a1','a2','a4'), ('b1','b2'), ('c2')), 'x1'])
display(df.loc[(('a1','a2','a4'), ('b1','b2'), ('c2')), ('x1','y1')])
display(df.loc[(('a1','a2','a4'), ('b1','b2'), ('c2')), (('x1','x2'),('y1'))])
clvl1 | x1 | x2 | x3 | x4 | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
clvl2 | y1 | y2 | y1 | y2 | y1 | y2 | y1 | y2 | ||||||||||
clvl3 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | ||
rlvl1 | rlvl2 | rlvl3 | ||||||||||||||||
a1 | b1 | c2 | 3 | 4 | 6 | 6 | 5 | 6 | 3 | 2 | 9 | 8 | 1 | 8 | 7 | 0 | 6 | 0 |
b2 | c2 | 7 | 5 | 8 | 8 | 4 | 7 | 4 | 4 | 9 | 0 | 3 | 6 | 1 | 8 | 0 | 6 | |
a2 | b1 | c2 | 6 | 4 | 6 | 3 | 9 | 3 | 6 | 0 | 3 | 1 | 2 | 3 | 5 | 9 | 8 | 3 |
b2 | c2 | 7 | 4 | 9 | 7 | 3 | 7 | 3 | 8 | 4 | 5 | 3 | 1 | 9 | 4 | 0 | 3 | |
a4 | b1 | c2 | 8 | 2 | 7 | 9 | 4 | 7 | 8 | 2 | 2 | 0 | 9 | 7 | 5 | 8 | 6 | 4 |
b2 | c2 | 8 | 9 | 9 | 5 | 6 | 1 | 5 | 7 | 9 | 2 | 9 | 5 | 8 | 1 | 2 | 8 |
clvl2 | y1 | y2 | ||||
---|---|---|---|---|---|---|
clvl3 | z1 | z2 | z1 | z2 | ||
rlvl1 | rlvl2 | rlvl3 | ||||
a1 | b1 | c2 | 3 | 4 | 6 | 6 |
b2 | c2 | 7 | 5 | 8 | 8 | |
a2 | b1 | c2 | 6 | 4 | 6 | 3 |
b2 | c2 | 7 | 4 | 9 | 7 | |
a4 | b1 | c2 | 8 | 2 | 7 | 9 |
b2 | c2 | 8 | 9 | 9 | 5 |
clvl3 | z1 | z2 | ||
---|---|---|---|---|
rlvl1 | rlvl2 | rlvl3 | ||
a1 | b1 | c2 | 3 | 4 |
b2 | c2 | 7 | 5 | |
a2 | b1 | c2 | 6 | 4 |
b2 | c2 | 7 | 4 | |
a4 | b1 | c2 | 8 | 2 |
b2 | c2 | 8 | 9 |
clvl1 | x1 | x2 | ||||
---|---|---|---|---|---|---|
clvl2 | y1 | y1 | ||||
clvl3 | z1 | z2 | z1 | z2 | ||
rlvl1 | rlvl2 | rlvl3 | ||||
a1 | b1 | c2 | 3 | 4 | 5 | 6 |
b2 | c2 | 7 | 5 | 4 | 7 | |
a2 | b1 | c2 | 6 | 4 | 9 | 3 |
b2 | c2 | 7 | 4 | 3 | 7 | |
a4 | b1 | c2 | 8 | 2 | 4 | 7 |
b2 | c2 | 8 | 9 | 6 | 1 |
To select all indices at a particular level, a slice(None)
can be used in place of a tuple. (slice(None)
takes the place of :
here since a colon inside a tuple is illegal python syntax).
Just as in the last section, using a slice(None)
at any index level will assure that the MultiIndex from the original DataFrame is kept intact.
display(df.loc[(('a1','a2','a4'), slice(None), ('c1')), :]) # select all indices at row level 2
clvl1 | x1 | x2 | x3 | x4 | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
clvl2 | y1 | y2 | y1 | y2 | y1 | y2 | y1 | y2 | ||||||||||
clvl3 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | ||
rlvl1 | rlvl2 | rlvl3 | ||||||||||||||||
a1 | b1 | c1 | 2 | 4 | 3 | 7 | 5 | 3 | 6 | 9 | 0 | 5 | 8 | 3 | 6 | 2 | 3 | 5 |
b2 | c1 | 4 | 0 | 5 | 7 | 0 | 4 | 8 | 5 | 7 | 1 | 9 | 4 | 7 | 6 | 6 | 6 | |
a2 | b1 | c1 | 3 | 2 | 4 | 2 | 8 | 5 | 7 | 0 | 4 | 6 | 3 | 7 | 2 | 5 | 1 | 4 |
b2 | c1 | 8 | 9 | 0 | 7 | 9 | 8 | 0 | 3 | 7 | 6 | 1 | 8 | 9 | 0 | 3 | 9 | |
a4 | b1 | c1 | 7 | 1 | 6 | 1 | 7 | 2 | 6 | 2 | 4 | 7 | 3 | 2 | 9 | 4 | 0 | 2 |
b2 | c1 | 9 | 3 | 8 | 6 | 2 | 4 | 8 | 5 | 8 | 4 | 1 | 5 | 4 | 6 | 3 | 3 |
Remember how I said in the section above that for arbitrary multi-indexing you had to specify a column index for it to work? I lied :) This actually isn't always necessary. For example the following two are equivalent
result1 = df.loc[('a1','a2','a4'), ('b1','b2'), ('c2')] # incomplete indexing
result2 = df.loc[(('a1','a2','a4'), ('b1','b2'), ('c2')),:] # explicit column indexing
np.all(result1 == result2)
True
The problem with this type of incomplete indexing with multi-index dataframes is that the behavior isn't always well defined. Specifically, when the number of index tuples is 1 or 2, pandas has no way of knowing which index tuples belong to the row index and which belong to the column index.
This is actually a python limitation. You see, indexing with 2 tuples without specifying a column index is of the form
df.loc[( (level1 indices), (level2 indices) )]
But in the context of the __getitem__
interface, this notation is exactly equivalent to
df.loc[(level1 indices), (level2 indices)]
In pandas, this case will always be interpreted as the typical row and column indexing.
Funnily enough, simply appending a comma after the parenthesis will eliminate this ambiguity
df.loc[( (level1 indices), (level2 indices) ),]
<-- Note the added comma!
In summary, incomplete indexing with 3 or more indexing tuples works
df.loc[('a1','a2','a4'), ('b1','b3'), ('c2')]
clvl1 | x1 | x2 | x3 | x4 | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
clvl2 | y1 | y2 | y1 | y2 | y1 | y2 | y1 | y2 | ||||||||||
clvl3 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | ||
rlvl1 | rlvl2 | rlvl3 | ||||||||||||||||
a1 | b1 | c2 | 3 | 4 | 6 | 6 | 5 | 6 | 3 | 2 | 9 | 8 | 1 | 8 | 7 | 0 | 6 | 0 |
a2 | b1 | c2 | 6 | 4 | 6 | 3 | 9 | 3 | 6 | 0 | 3 | 1 | 2 | 3 | 5 | 9 | 8 | 3 |
a4 | b1 | c2 | 8 | 2 | 7 | 9 | 4 | 7 | 8 | 2 | 2 | 0 | 9 | 7 | 5 | 8 | 6 | 4 |
But this does not work when using only 2 (or less) indexing tuples
try:
df.loc[('a1','a2','a4'), ('b1','b3')]
except KeyError as e:
print "KeyError:",e.message
KeyError: ('a1', 'a2', 'a4')
try:
df.loc[('a1','a2','a4')] # remember: this is the same as df.loc['a1','a2','a4'] to python
except KeyError as e:
print "KeyError:",e.message
KeyError: the label [a2] is not in the [columns]
Perhaps surprisingly, in cases where you only need to specify a single index at every level of the row index this type of shorthand indexing will always work - even if the length of the index tuple is 2.
df.loc['a1','b1']
clvl1 | x1 | x2 | x3 | x4 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
clvl2 | y1 | y2 | y1 | y2 | y1 | y2 | y1 | y2 | ||||||||
clvl3 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 | z1 | z2 |
rlvl3 | ||||||||||||||||
c1 | 2 | 4 | 3 | 7 | 5 | 3 | 6 | 9 | 0 | 5 | 8 | 3 | 6 | 2 | 3 | 5 |
c2 | 3 | 4 | 6 | 6 | 5 | 6 | 3 | 2 | 9 | 8 | 1 | 8 | 7 | 0 | 6 | 0 |
This works because pandas tries first to match the second index (b1
in our example) to the column index and falls back to the row index if it is not found. Clearly this behavior has some consequences if your row and column Index's are not unique, so take care.