Chapter 06 -- Hierarchical Indexing¶

Topics Covered¶

In [1]:

import pandas as pd
import numpy as np
from pandas import Series, DataFrame, Index
from IPython.display import Image

Simply put, a MultiIndex (also referred to as a hierarchical index) allows multiple index levels within a single index. Higher dimensional data can be represented in a one-dimensional Series or a two-dimensional DataFrame. Start with the creation of a synthetic DataFrame whose values are used to represent the number of traffic violations issued in cell #3 below.

pandas provide constructor methods allowing components to be defined separately. The construction of the index and columns below is self-explanatory.

For the data component: The first data = assignment constructs an array using numpy's random number generator with the round() method used to return the integer portion. The second date = assignment multiples the array of integers by 100 and floor divides by 5, and finally taking the absolute value.

The tickets = assignment constructs the DataFrame from the constituents.

MultiIndexing¶

In [2]:

index = pd.MultiIndex.from_product([[2012, 2013, 2014, 2015], ['1', '2', '3']],
                                   names=['year', 'month'])
columns=pd.MultiIndex.from_product([['city', 'suburbs', 'rural'], ['day', 'night']],
                                    names=['area', 'when'])

data = np.round(np.random.randn(12, 6),2)
data = abs(np.floor_divide(data[:] * 100, 5))

tickets = pd.DataFrame(data, index=index, columns=columns).sort_index().sort_index(axis=1)

In [3]:

index

Out[3]:

MultiIndex(levels=[[2012, 2013, 2014, 2015], ['1', '2', '3']],
           labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]],
           names=['year', 'month'])

Notice how the MultiIndex displayed above contains multiple levels of indexing, in this case 'year' and 'month' along the row dimension. The levels act as labels for each data value encoding these names.

In SAS, the terminology is 'month' nested within 'year'. Said another way, 'month' is the inner-most level of the index.

In [4]:

tickets.columns

Out[4]:

MultiIndex(levels=[['city', 'rural', 'suburbs'], ['day', 'night']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
           names=['area', 'when'])

Similiarly, the indexing for columns contains multiple levels, with the name 'area' labeling the outer levels 'city', 'rural', and 'suburbs'. 'when' is the name for the inner levels 'day' and 'night'.

In [5]:

tickets

Out[5]:

	area	city		rural		suburbs
	when	day	night	day	night	day	night
year	month
2012	1	39.0	18.0	20.0	7.0	12.0	9.0
	2	12.0	25.0	3.0	17.0	7.0	30.0
	3	13.0	27.0	34.0	32.0	12.0	30.0
2013	1	2.0	7.0	12.0	9.0	30.0	51.0
	2	37.0	10.0	2.0	0.0	8.0	28.0
	3	17.0	17.0	17.0	20.0	15.0	18.0
2014	1	16.0	42.0	25.0	20.0	14.0	8.0
	2	27.0	9.0	8.0	51.0	5.0	2.0
	3	9.0	33.0	13.0	11.0	26.0	15.0
2015	1	3.0	21.0	7.0	8.0	33.0	34.0
	2	1.0	2.0	24.0	7.0	2.0	2.0
	3	13.0	2.0	1.0	19.0	1.0	1.0

In [6]:

tickets.shape

Out[6]:

(12, 6)

Another SAS-like way to think of a MultiIndex is nested group-by's illustrated in cell below with PROC SORT. The resulting SAS data set tickets is 'tall' and 'skinny' with one observation for each of the nested levels.

Below, we will illustrate how 'stacking' the DataFrame can be used to reshape it to a form similiar to the SAS data set tickets.

    /******************************************************/
    /* c06_tickets_proc_sort.sas                          */
    /******************************************************/
    34       proc sort;
    35       by year month area nd;
    NOTE: 72 observations were read from "WORK.tickets"
    NOTE: Data set "WORK.tickets" has 72 observation(s) and 5 variable(s)

A SAS analog of the tickets Dataframe is rendered using PROC TABULATE. Unlike the DataFrame methods illustrated here, the physical shape of the SAS data set is not changed. In this case PROC TABULATE processes the 'tall' SAS data set to produce printed output shaped like the 'tickets' DataFrame.

    /******************************************************/
    /* c06_tickets_proc_tabulate.sas                      */
    /******************************************************/
    12       data tickets;
    13       length area $ 7;
    14       call streaminit(time());
    15       do year = 2012 to 2015;
    16          do month = 1 to 3;
    17             do area = 'City', 'Suburbs', 'Rural';
    18                do nd = 'Night', 'Day';
    19                   tickets = abs(int(rand( 'NORMAL')*100));
    20                   output;
    21                end;
    22             end;
    23          end;
    24       end;
    25       proc tabulate;
    26          var tickets;;
    27          class area nd year month;
    28             table year * month ,
    29                   area=' ' * nd=' ' * sum=' ' * tickets=' ';

In [7]:

Image(filename='Anaconda3\\output\\tabulate.JPG')

Out[7]:

Multi-Indexed Selection¶

The tickets DataFrame is essentially four-dimensional with area nested within when for columns, and month nested within year for rows. The multi-indexing for both rows and columns make sub-setting operations straightforward. In this case, selecting by the index label 'rural' returns an entire DataFrame. Strictly speaking, 'rural' is a level in the column MultiIndex illustrated in cell #5 above.

Select all rows by the 'rural' level.

In [8]:

tickets['rural']

Out[8]:

	when	day	night
year	month
2012	1	20.0	7.0
	2	3.0	17.0
	3	34.0	32.0
2013	1	12.0	9.0
	2	2.0	0.0
	3	17.0	20.0
2014	1	25.0	20.0
	2	8.0	51.0
	3	13.0	11.0
2015	1	7.0	8.0
	2	24.0	7.0
	3	1.0	19.0

Select all rows for tickets issued in the city at night.

In [9]:

tickets['city', 'night']

Out[9]:

year  month
2012  1        18.0
      2        25.0
      3        27.0
2013  1         7.0
      2        10.0
      3        17.0
2014  1        42.0
      2         9.0
      3        33.0
2015  1        21.0
      2         2.0
      3         2.0
Name: (city, night), dtype: float64

Multiple keys can be specified. Above, the request is for those tickets issued at 'night' in the 'city'. In this instance a Series is returned.

The examples in cells #8 and #9 work by selecting labels from 'area' which is the outer-most MultiIndex level for columns in the tickets DataFrame. Said another way, 'when' ('day' or 'night') is nested inside 'area'.

Another method to select rows and columns nested inside a hierarchical index is the DataFrame xs() method. This cross-section method generally results in a smaller DataFrame than the original.

xs() method for cross sections¶

The DataFrame xs() method uses the level argument to select data for rows and columns. The example in the cell below selects the 1st month for each year.

In [10]:

tickets.xs(('1'), level='month')

Out[10]:

area	city		rural		suburbs
when	day	night	day	night	day	night
year
2012	39.0	18.0	20.0	7.0	12.0	9.0
2013	2.0	7.0	12.0	9.0	30.0	51.0
2014	16.0	42.0	25.0	20.0	14.0	8.0
2015	3.0	21.0	7.0	8.0	33.0	34.0

Likewise, you can request just the rows for 2013.

In [11]:

tickets.xs((2013), level='year')

Out[11]:

area	city		rural		suburbs
when	day	night	day	night	day	night
month
1	2.0	7.0	12.0	9.0	30.0	51.0
2	37.0	10.0	2.0	0.0	8.0	28.0
3	17.0	17.0	17.0	20.0	15.0	18.0

Selections can use multiple keys. In this case, 2013 and month '1'.

In [12]:

tickets.xs((2013, '1'), level=('year', 'month'))

Out[12]:

	area	city		rural		suburbs
	when	day	night	day	night	day	night
year	month
2013	1	2.0	7.0	12.0	9.0	30.0	51.0

The .xs method also works for columns with the optional axis=1 (for columns) argument. In this case, the column 'night' in each of the 3 areas is selected.

In [13]:

tickets.xs(('night'), level='when', axis=1)

Out[13]:

	area	city	rural	suburbs
year	month
2012	1	18.0	7.0	9.0
	2	25.0	17.0	30.0
	3	27.0	32.0	30.0
2013	1	7.0	9.0	51.0
	2	10.0	0.0	28.0
	3	17.0	20.0	18.0
2014	1	42.0	20.0	8.0
	2	9.0	51.0	2.0
	3	33.0	11.0	15.0
2015	1	21.0	8.0	34.0
	2	2.0	7.0	2.0
	3	2.0	19.0	1.0

Advanced Indexing with .loc indexer¶

Chapter 5, Understanding Indexes covers the .loc indexer which can also be used with hierarchical indicies.

The colon (:) to the left of the comma (,) selects all rows. The selection to the right of the comma (,) requests the levels 'city' and 'suburbs' from 'area'.

In [14]:

tickets.loc[:,['city', 'suburbs']]

Out[14]:

	area	city		suburbs
	when	day	night	day	night
year	month
2012	1	39.0	18.0	12.0	9.0
	2	12.0	25.0	7.0	30.0
	3	13.0	27.0	12.0	30.0
2013	1	2.0	7.0	30.0	51.0
	2	37.0	10.0	8.0	28.0
	3	17.0	17.0	15.0	18.0
2014	1	16.0	42.0	14.0	8.0
	2	27.0	9.0	5.0	2.0
	3	9.0	33.0	26.0	15.0
2015	1	3.0	21.0	33.0	34.0
	2	1.0	2.0	2.0	2.0
	3	13.0	2.0	1.0	1.0

Below is a an example of partial slicing.

In [15]:

tickets.loc[2013:2014, ['city', 'suburbs']]

Out[15]:

	area	city		suburbs
	when	day	night	day	night
year	month
2013	1	2.0	7.0	30.0	51.0
	2	37.0	10.0	8.0	28.0
	3	17.0	17.0	15.0	18.0
2014	1	16.0	42.0	14.0	8.0
	2	27.0	9.0	5.0	2.0
	3	9.0	33.0	26.0	15.0

The example below illustrates slicing with a range of values by providing tuples containing both the row slice ( 2nd month, 2013 to 3rd month 2014) and the column slice ('rural' 'day' to 'suburbs' 'day').

In [16]:

tickets.loc[(2013, '2') : (2014, '3'), ('rural', 'day') : ('suburbs', 'day')]

Out[16]:

	area	rural		suburbs
	when	day	night	day
year	month
2013	2	2.0	0.0	8.0
2013	3	17.0	20.0	15.0
2014	1	25.0	20.0	14.0
	2	8.0	51.0	5.0
	3	13.0	11.0	26.0

The example below illustrates the .loc() indexer using multiple keys for row and column slices.

In [17]:

tickets.loc[(2013, '3'): (2014, '1')]

Out[17]:

	area	city		rural		suburbs
	when	day	night	day	night	day	night
year	month
2013	3	17.0	17.0	17.0	20.0	15.0	18.0
2014	1	16.0	42.0	25.0	20.0	14.0	8.0

In [18]:

tickets.loc[(2013, '3'): (2014, '1'), "city"]

Out[18]:

	when	day	night
year	month
2013	3	17.0	17.0
2014	1	16.0	42.0

Using Boolean operators with .loc indexer¶

Using boolean operators with the .loc indexer permits boolean evaluations across the Dataframe values. You can use a pd.IndexSlice to permit syntax using the colon (:) rather then slice(None) syntax when specifying ranges.

The mask object returns the boolean value True which is applied to the column MultiIndex 'night' as the inner-most index level within 'city'. Notice the shape of the DataFrame is defined by just those rows in the 'night' within 'city' column construct that evaluate True.

In [19]:

idx = pd.IndexSlice
mask = tickets[('city','night')]>25
tickets.loc[idx[mask,:,]]

Out[19]:

	area	city		rural		suburbs
	when	day	night	day	night	day	night
year	month
2012	3	13.0	27.0	34.0	32.0	12.0	30.0
2014	1	16.0	42.0	25.0	20.0	14.0	8.0
2014	3	9.0	33.0	13.0	11.0	26.0	15.0

The example in the cell below applies the boolean 'mask2' to all values in the DataFrame

In [20]:

mask2 = tickets[('rural', 'day')]>15
tickets.loc[idx[mask2, 'city':'rurual']]

Out[20]:

	area	city		rural
	when	day	night	day	night
year	month
2012	1	39.0	18.0	20.0	7.0
2012	3	13.0	27.0	34.0	32.0
2013	3	17.0	17.0	17.0	20.0
2014	1	16.0	42.0	25.0	20.0
2015	2	1.0	2.0	24.0	7.0

The example below applies boolean 'mask2' and also scopes the column request.

In [21]:

tickets.loc[idx[mask2, 'rural']]

Out[21]:

	when	day	night
year	month
2012	1	20.0	7.0
2012	3	34.0	32.0
2013	3	17.0	20.0
2014	1	25.0	20.0
2015	2	24.0	7.0

stack() and unstack() methods¶

DataFrames with hierarchical indices can use the .stack and .unstack attributes to reshape data from 'tall and skinny' to 'short and fat' formats and vice-versa.

The .stack() attribute pivots columns into rows. The .unstack() attribute pivots rows into columns. Consider the examples in the cell below.

The .unstack() attribute used without any arguments in this case pivots the 'month' rows to the inner-most index level of the columns. 'month' was the inner-most index level for rows. Also notice how we assigned the 'unstacked' DataFrame to create the new one, df_u.

In [22]:

df_u = tickets.unstack()
df_u

Out[22]:

area	city						rural						suburbs
when	day			night			day			night			day			night
month	1	2	3	1	2	3	1	2	3	1	2	3	1	2	3	1	2	3
year
2012	39.0	12.0	13.0	18.0	25.0	27.0	20.0	3.0	34.0	7.0	17.0	32.0	12.0	7.0	12.0	9.0	30.0	30.0
2013	2.0	37.0	17.0	7.0	10.0	17.0	12.0	2.0	17.0	9.0	0.0	20.0	30.0	8.0	15.0	51.0	28.0	18.0
2014	16.0	27.0	9.0	42.0	9.0	33.0	25.0	8.0	13.0	20.0	51.0	11.0	14.0	5.0	26.0	8.0	2.0	15.0
2015	3.0	1.0	13.0	21.0	2.0	2.0	7.0	24.0	1.0	8.0	7.0	19.0	33.0	2.0	1.0	34.0	2.0	1.0

The MultiIndex for the columns from DataFrame df_u is shown below. Contrast this column index with the one for tickets in cell #4 above.

In [23]:

df_u.columns

Out[23]:

MultiIndex(levels=[['city', 'rural', 'suburbs'], ['day', 'night'], ['1', '2', '3']],
           labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2], [0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]],
           names=['area', 'when', 'month'])

The .stack() attribute performs the reverse operation of the .unstack() attribute. The .stack() attribute takes the inner most index level of the columns level and pivots them to the inner-most index level of the rows. In this case the new DataFrame is called df_s.

In [24]:

df_s = tickets.stack()
df_s

Out[24]:

		area	city	rural	suburbs
year	month	when
2012	1	day	39.0	20.0	12.0
	1	night	18.0	7.0	9.0
	2	day	12.0	3.0	7.0
	2	night	25.0	17.0	30.0
	3	day	13.0	34.0	12.0
	3	night	27.0	32.0	30.0
2013	1	day	2.0	12.0	30.0
	1	night	7.0	9.0	51.0
	2	day	37.0	2.0	8.0
	2	night	10.0	0.0	28.0
	3	day	17.0	17.0	15.0
	3	night	17.0	20.0	18.0
2014	1	day	16.0	25.0	14.0
	1	night	42.0	20.0	8.0
	2	day	27.0	8.0	5.0
	2	night	9.0	51.0	2.0
	3	day	9.0	13.0	26.0
	3	night	33.0	11.0	15.0
2015	1	day	3.0	7.0	33.0
	1	night	21.0	8.0	34.0
	2	day	1.0	24.0	2.0
	2	night	2.0	7.0	2.0
	3	day	13.0	1.0	1.0
	3	night	2.0	19.0	1.0

DataFrames containing MultiIndexes can 'stack' and 'unstack' multiple levels at a time.

In [25]:

df_like_sas = pd.DataFrame(tickets.stack(level=['when', 'area']))

We can 'stack' the original tickets DataFrame by the levels 'when' and 'area' from the column MultiIndex to create a new DataFrame shaped similiarly to the SAS data set tickets.

In [26]:

df_like_sas.shape

Out[26]:

(72, 1)

Display the first 12 rows.

In [27]:

df_like_sas.head(12)

Out[27]:

				0
year	month	when	area
2012	1	day	city	39.0
			rural	20.0
			suburbs	12.0
		night	city	18.0
			rural	7.0
			suburbs	9.0
	2	day	city	12.0
			rural	3.0
			suburbs	7.0
		night	city	25.0
			rural	17.0
			suburbs	30.0

Resources¶

Python Data Science Handbook , Essential Tools for Working With Data, by Jake VanderPlas.

Reshaping by pivoting DataFrame objects , from the pandas 0.19.0 documentation.

Reshaping by stacking and unstacking from the pandas 0.19.0 documentation.

MultiIndex / Advanced Indexing from the pandas 0.19.0 documentation.

Return to Chapter List