Pandas MultiIndex Tutorial

What's Pandas?

Pandas is one of the most-used open source libraries for importing and analyzing data available in Python today. It provides convenient ways to import, view, split, apply, and combine array-like data. And not just convenient, but efficient, too. For example, Pandas' read_csv and to_csv functions are so efficient that the library is often imported just for this task instead of relying on the standard library alternative!

The core value of the library, however, comes through several data structure options, primarily Series (for labeled, homogenously-typed, one-dimensional arrays) and DataFrames (for labeled, potentially heterogeneously-typed, two-dimensional arrays).

What's a DataFrame?

DataFrames are two-dimensional, labeled data structures, with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used Pandas object. Like Series, DataFrame accepts many different kinds of input, including dicts, lists, lists of lists, Series, numpy arrays, other DataFrames, external data from CSVs, etc.

The DataFrame has two core parts: an index (row labels; look like columns, but aren't) and columns (data with headers).

What's an index?

The index of a DataFrame is a set (i.e. each element is only represented once) that consists of a label for each row. To be helpful, those labels should be not just unique, but also meaningful. By default, if no index is provided, the index will be a numbered range, starting from 0 (known as a range index).

A more meaningful index, however, would be something that uniquely describes each row of your data in a way that will help you look things up. For example, in a list of transactions, the date-time might be most useful. Alternatively, in a grade book for a math class the name of the student might be most useful.

The most confusing thing about understanding about Pandas' indexes at first (in my opinion) is how to interact with them. While they look exactly like columns, they're not referenced in the same way (learn more here). Understanding that, I personally find it most useful to think of an index in Pandas like a column that's in time-out and that just can't play with all the other columns.

  • Note: Another common way in which Pandas' indexes are misunderstood at first is by thinking of them in SQL-like terms. While that can be helpful if that's what you're familiar with, in practice (and for performance) Pandas' and SQL's indexes are quite different (see this SO answer and Pandas Under The Hood by Jeff Tratner).

What is a MultiIndex DataFrame?

Pandas' multiindex DataFrames extend the DataFrames described above by enabling effective storage and manipulation of arbitrarily high dimension data in a 2-dimensional tabular structure. ((If that sentence doesn't make sense yet, don't worry - it should by the end of this tutorial.))

While the displayed version of a multiindexed DataFrame doesn't appear to be much more than a prettily-organized regular DataFrame, it's actually a pretty powerful structure if the data warrants its use.

When should you use one?

  1. When a single column’s value isn’t enough to uniquely identify a row (e.g. multiple records on the same date means date alone isn’t a good index).
  2. When data is logically hierarchical - meaning that it has multiple dimensions or “levels.”

Besides structure, multiindexes offer relatively easy in-memory retreival of complex data.

Realistic Demo Data

For this tutorial we will work with a realistic case of when multiindex DataFrames can come in handy - a grocer's retail transactions.

In [1]:
# Normally put all your import up top, but this cell was ugly in the midst of the intro paragraphs
from typing import Any

import numpy as np
import pandas as pd

display(f"Pandas version: {pd.__version__}")
'Pandas version: 0.22.0'

Note that this notebook is using the Pandas version above. There have been many changes to MultiIndex methods since 0.19, including major bug fixes. It is STRONGLY recommended to use the latest version of Pandas, but at least version 0.21 is required for all of the techniques in this notebook to work as presented.

In [2]:
# # Creates random mock data.
# # Based on the UPCs found in upc_meta_data.csv and saves to data.csv
# %run mocker.py
In [3]:
df = pd.read_csv('data.csv', parse_dates=['Date'])
df.sample(10)
Out[3]:
Date Store Category Subcategory UPC EAN Description Dollars Units
20 2018-07-14 Store 3 Beer Lagers 702770081011 Omission - Gluten Free Lager - 6 Pack 128.64 12
101 2018-07-10 Store 3 Alcohol Liqour 80480280024 Grey Goose - Imported Vodka - 750 mL Bottle 128.30 5
68 2018-07-14 Store 1 Wine Red 492130008399 Yellow Tail - Merlot - 750 mL Bottle 32.10 6
12 2018-07-16 Store 2 Beer Ales 94922755711 Bowser Beer - Beefy Brown Ale - 6 Pack 99.40 10
58 2018-07-16 Store 1 Wine Red 89744402585 Trapiche - Malbec - 750 mL Bottle 126.14 14
1 2018-07-11 Store 2 Beer Ales 736920111112 Goose Island - Honkers Ale - 6 Pack 166.74 14
100 2018-07-10 Store 2 Alcohol Liqour 80480280024 Grey Goose - Imported Vodka - 750 mL Bottle 231.30 9
59 2018-07-16 Store 3 Wine Red 89744402585 Trapiche - Malbec - 750 mL Bottle 66.16 8
116 2018-07-11 Store 3 Alcohol Liquor 674545000001 Don Julio - Tequila Blanco - 750 mL Bottle 102.04 4
77 2018-07-15 Store 3 Wine Red 492130008399 Yellow Tail - Merlot - 750 mL Bottle 58.20 12

Each row of data represents a sale of an item, but how do we set our index meaningfully? A numeric ID for each transaction would be fine, but it wouldn't tell us terribly much. Even the date of a transaction isn't useful by itself, since it's common for such a company to have many transactions on a given date (even at the same time).

Instead, we will have to look at a combination of the available metadata to create a unique and meaningful index. At first glance, a combination of the date, the store, and the hierarchy of each item (Category > Subcategory > UPC) look like they might be enough to get us a unique index.

Setting and Manipulating MultiIndexes

So let's take a look at how we can create our multiindex from our regular ol' DataFrame. We'll walk through the basics of setting, reordering, and resetting indexes, along with some useful tips/tricks. Then we can begin investigating our transaction data to learn about our sales and trends.

In [4]:
# Set just like the index for a DataFrame...
# ...except we give a list of column names instead of a single string column name
df.set_index(['Date', 'Store', 'Category', 'Subcategory', 'Description'], inplace=True)
df.head(3)
Out[4]:
UPC EAN Dollars Units
Date Store Category Subcategory Description
2018-07-11 Store 3 Beer Ales Goose Island - Honkers Ale - 6 Pack 736920111112 35.68 4
Store 2 Beer Ales Goose Island - Honkers Ale - 6 Pack 736920111112 166.74 14
2018-07-10 Store 3 Beer Ales Goose Island - Honkers Ale - 6 Pack 736920111112 67.36 8

Uh oh - it looks like we forgot to add the 'UPC EAN' column to our index, but don't worry - Pandas has us covered with extra set_index parameters for MultiIndexes:

In [5]:
# We can append a column to our existing index
df.set_index('UPC EAN', append=True, inplace=True)
df.head(3)
Out[5]:
Dollars Units
Date Store Category Subcategory Description UPC EAN
2018-07-11 Store 3 Beer Ales Goose Island - Honkers Ale - 6 Pack 736920111112 35.68 4
Store 2 Beer Ales Goose Island - Honkers Ale - 6 Pack 736920111112 166.74 14
2018-07-10 Store 3 Beer Ales Goose Island - Honkers Ale - 6 Pack 736920111112 67.36 8

That's almost right, but we'd actually like 'Description' to show up after 'UPC EAN'. We have a couple of options to get things in the right order:

In [6]:
# Option 1 is the generalized solution to reorder the index levels
# Note: We're not making an inplace change in this cell,
#       but it's worth noting that this method doesn't have an inplace parameter.
df.reorder_levels(order=['Date', 'Store', 'Category', 'Subcategory', 'UPC EAN', 'Description']).head(3)
Out[6]:
Dollars Units
Date Store Category Subcategory UPC EAN Description
2018-07-11 Store 3 Beer Ales 736920111112 Goose Island - Honkers Ale - 6 Pack 35.68 4
Store 2 Beer Ales 736920111112 Goose Island - Honkers Ale - 6 Pack 166.74 14
2018-07-10 Store 3 Beer Ales 736920111112 Goose Island - Honkers Ale - 6 Pack 67.36 8

reorder_levels() is useful, but it was a pain to have to type all five levels just two switch two. In cases like this we have a second, less verbose option:

In [7]:
# Option 2 just switches two index levels (a more common need than you'd think)
# Note: This time we're doing an inplace change, but there's no parameter for this method either.
df = df.swaplevel('Description', 'UPC EAN')
df.head(3)
Out[7]:
Dollars Units
Date Store Category Subcategory UPC EAN Description
2018-07-11 Store 3 Beer Ales 736920111112 Goose Island - Honkers Ale - 6 Pack 35.68 4
Store 2 Beer Ales 736920111112 Goose Island - Honkers Ale - 6 Pack 166.74 14
2018-07-10 Store 3 Beer Ales 736920111112 Goose Island - Honkers Ale - 6 Pack 67.36 8

Just when we thought we were done, it turns our we forgot to add the highest level of the product hierarchy - the Department - not just to our index, but to our DataFrame altogether. Luckily all of our records belong in the same Department, so here's a neat trick to add a new column with all the same values as a level in an existing index:

In [8]:
# A handy function to keep around for projects
def add_constant_index_level(df: pd.DataFrame, value: Any, level_name: str):
    """Add a new level to an existing index where every row has the same, given value.
    
    Args:
        df: Any existing pd.DataFrame.
        value: Value to be placed in every row of the new index level.
        level_name: Title of the new index level.
    
    Returns:
        df with an additional, prepended index level.
    """
    return pd.concat([df], keys=[value], names=[level_name])

df = add_constant_index_level(df, "Booooze", "Department")
df = df.reorder_levels(order=['Date', 'Store', 'Department', 'Category', 'Subcategory', 'UPC EAN', 'Description'])
df.head(3)

# # If we wanted to later drop that level
# df.index = df.index.droplevel(level='Department')
# df.head(3)
Out[8]:
Dollars Units
Date Store Department Category Subcategory UPC EAN Description
2018-07-11 Store 3 Booooze Beer Ales 736920111112 Goose Island - Honkers Ale - 6 Pack 35.68 4
Store 2 Booooze Beer Ales 736920111112 Goose Island - Honkers Ale - 6 Pack 166.74 14
2018-07-10 Store 3 Booooze Beer Ales 736920111112 Goose Island - Honkers Ale - 6 Pack 67.36 8

Now that our index is set the way we want it, what if we want to interact with those index levels? Here are a few helpful code snippets:

In [9]:
# checking out their unique values, for a single level 
df.index.get_level_values('Subcategory').unique()

# checking out their unique values, for combinations of multiple levels
# See answer at https://stackoverflow.com/questions/39080555/pandas-get-level-values-for-multiple-columns
Out[9]:
Index(['Ales', 'Lagers', 'Stouts', 'Malts', 'Red', 'White', 'Rose', 'Liqour',
       'Liquor'],
      dtype='object', name='Subcategory')

Note the typo in "Liquor" above. Good thing we checked out unique values! Maybe someone can submit a pull request to fix this for me :)

In [10]:
# Replace level values using rename
# Note that this can be done using set_levels as well, but it's a pain
df.rename(index={'Goose Island - Honkers Ale - 6 Pack': 'We changed this'}).head(15)

# Replace np.nan level values
# df.rename(index={np.nan: "''"}, inplace=True)
Out[10]:
Dollars Units
Date Store Department Category Subcategory UPC EAN Description
2018-07-11 Store 3 Booooze Beer Ales 736920111112 We changed this 35.68 4
Store 2 Booooze Beer Ales 736920111112 We changed this 166.74 14
2018-07-10 Store 3 Booooze Beer Ales 736920111112 We changed this 67.36 8
Store 2 Booooze Beer Ales 736920111112 We changed this 80.01 9
2018-07-12 Store 3 Booooze Beer Ales 736920111112 We changed this 78.30 10
Store 2 Booooze Beer Ales 736920111112 We changed this 37.75 5
2018-07-14 Store 3 Booooze Beer Ales 736920111112 We changed this 113.49 13
Store 2 Booooze Beer Ales 736920111112 We changed this 45.05 5
2018-07-10 Store 2 Booooze Beer Ales 94922755711 Bowser Beer - Beefy Brown Ale - 6 Pack 81.76 8
Store 3 Booooze Beer Ales 94922755711 Bowser Beer - Beefy Brown Ale - 6 Pack 62.48 8
2018-07-11 Store 2 Booooze Beer Ales 94922755711 Bowser Beer - Beefy Brown Ale - 6 Pack 104.91 13
Store 3 Booooze Beer Ales 94922755711 Bowser Beer - Beefy Brown Ale - 6 Pack 72.72 9
2018-07-16 Store 2 Booooze Beer Ales 94922755711 Bowser Beer - Beefy Brown Ale - 6 Pack 99.40 10
Store 3 Booooze Beer Ales 94922755711 Bowser Beer - Beefy Brown Ale - 6 Pack 46.65 5
2018-07-14 Store 2 Booooze Beer Ales 94922755711 Bowser Beer - Beefy Brown Ale - 6 Pack 124.52 11
In [11]:
# Rename index levels
temp_df = df.copy()
temp_df.index = df.index.set_names('Desc.', level='Description')
temp_df.head(3)
Out[11]:
Dollars Units
Date Store Department Category Subcategory UPC EAN Desc.
2018-07-11 Store 3 Booooze Beer Ales 736920111112 Goose Island - Honkers Ale - 6 Pack 35.68 4
Store 2 Booooze Beer Ales 736920111112 Goose Island - Honkers Ale - 6 Pack 166.74 14
2018-07-10 Store 3 Booooze Beer Ales 736920111112 Goose Island - Honkers Ale - 6 Pack 67.36 8

Understanding the MultiIndex Object

Why is this section all the way down here? Because the MultiIndex object is scary looking if you're new to using them. Many guides to hierarchical data analysis using multiindex DataFrames start with DataFrame creation and manipulation using MultiIndex objects, which I think both hinders adoption and is not reflective of how a lot of DataFrames get created in practice. As a result, my explanation of MultiIndex objects is very basic, because there are lots of other great resources out there if you want to learn more. Here are my top two:

In [12]:
df.index
Out[12]:
MultiIndex(levels=[[2018-07-10 00:00:00, 2018-07-11 00:00:00, 2018-07-12 00:00:00, 2018-07-13 00:00:00, 2018-07-14 00:00:00, 2018-07-15 00:00:00, 2018-07-16 00:00:00], ['Store 1', 'Store 2', 'Store 3'], ['Booooze'], ['Alcohol', 'Beer', 'Wine'], ['Ales', 'Lagers', 'Liqour', 'Liquor', 'Malts', 'Red', 'Rose', 'Stouts', 'White'], [71250000140, 80432400630, 80480280024, 81248501095, 83783575217, 85000024218, 89744402585, 94922755711, 492130008399, 674545000001, 702770081011, 702770082018, 736920111112], ['Bowser Beer - Beefy Brown Ale - 6 Pack', 'Colt 45 - Malt Liquor - 6 Pack', 'Curious Beasts - Chardonnay - 750 mL Bottle', 'Dark Horse - S. Blanc Rose - 750 mL Bottle', 'Don Julio - Tequila Blanco - 750 mL Bottle', 'Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle', 'Goose Island - Honkers Ale - 6 Pack', 'Grey Goose - Imported Vodka - 750 mL Bottle', 'Omission - Gluten Free Lager - 6 Pack', 'Omission Pale Ale - Gluten Free Pale Ale - 6 Pack', 'Sierra Nevada Brewing Co. - Stout - 6 Pack', 'Trapiche - Malbec - 750 mL Bottle', 'Yellow Tail - Merlot - 750 mL Bottle']],
           labels=[[1, 1, 0, 0, 2, 2, 4, 4, 0, 0, 1, 1, 6, 6, 4, 4, 5, 5, 2, 2, 4, 4, 0, 0, 6, 6, 2, 2, 4, 4, 3, 3, 0, 0, 1, 1, 5, 5, 0, 0, 1, 1, 6, 6, 5, 5, 2, 2, 6, 6, 1, 1, 5, 5, 3, 3, 2, 2, 6, 6, 5, 5, 0, 0, 1, 1, 4, 4, 4, 4, 3, 3, 0, 0, 6, 6, 5, 5, 2, 2, 3, 3, 2, 2, 0, 0, 1, 1, 4, 4, 6, 6, 0, 0, 3, 3, 2, 2, 1, 1, 0, 0, 5, 5, 2, 2, 1, 1, 6, 6, 2, 2, 6, 6, 0, 0, 1, 1, 5, 5, 2, 2, 3, 3, 1, 1, 4, 4, 5, 5, 0, 0], [2, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 0, 2, 0, 2, 0, 2, 0, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 6, 6, 6, 6, 6, 6, 6, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [12, 12, 12, 12, 12, 12, 12, 12, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 10, 10, 10, 10, 10, 10, 10, 10, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 0, 0, 0, 0, 0, 0, 0, 0, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [6, 6, 6, 6, 6, 6, 6, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 1, 1, 1, 1, 1, 1, 1, 1, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]],
           names=['Date', 'Store', 'Department', 'Category', 'Subcategory', 'UPC EAN', 'Description'])

Well that's gross looking...but don't be scared - it's actually not that hard to understand.

'levels' is a list of lists, where each sublist represents all possible values in that index level. In other words, the 'levels' parameter reflects all possible unique values by level. For example, our first index level ('Date') has the possible values ['2018-07-10', '2018-07-11', '2018-07-12', ...].

  • Important Note: When talking about a multiindex DataFrame (not the parameter for the MultiIndex object), we talk about the "levels" as the index "columns." For example, the 'levels' of our df in a more general sense are 'Date', 'Store', 'Department', etc. Levels in this sense (and elsewhere in code) can also be referenced by number (e.g. 'Date' = 0 [read as 'level 0'], 'Store' = 1, 'Department' = 2, etc.).

'labels' is also a list of lists, but here each sublist reflects all of the values that appear in the row of that index. In other words, each sublist in our labels is of the same length as the entire dataframe, and the value of each row is one of the possible values defined in our associated level (above). Looking again at our first index level ('Date'), we see values like [1, 1, 0, 0, 2, 2, 4, 4 ...]. There are just an enumerated representation of the options defined in our level, so 0 = '2018-07-10', 1 = '2018-07-11', 2 = '2018-07-12', 3 = '2018-07-13', etc.

'names' is a list of the actual titles of each index level, in order of appearance from left to right.

With that fresh understanding of the 'anatomy' of a MultiIndex, we can look at...

Other Methods of Multiindex DataFrame Creation

For the most part, the two references listed in the section above cover this topic well; however, a common use case that isn't covered in those guides is creating a multiindex DataFrame while reading from a csv:

In [13]:
# We can set a MultiIndex while reading a csv by referencing columns to be used in the index by number
pd.read_csv("data.csv", index_col=[0, 1, 2, 3, 4, 5], skipinitialspace=True, parse_dates=['Date']).head(3)
Out[13]:
Dollars Units
Date Store Category Subcategory UPC EAN Description
2018-07-11 Store 3 Beer Ales 736920111112 Goose Island - Honkers Ale - 6 Pack 35.68 4
Store 2 Beer Ales 736920111112 Goose Island - Honkers Ale - 6 Pack 166.74 14
2018-07-10 Store 3 Beer Ales 736920111112 Goose Island - Honkers Ale - 6 Pack 67.36 8

We'll review more advanced importing/exporting methods below.

MultiIndex Columns (Multiple Column Levels)

For a different view we can also create hierarchical column levels. We'll do this by introducting a new method: unstack(). This function "pivots" an index level to a new level of column labels whose inner-most level consists of the pivoted index labels. Stack/unstack is one of the biggest reasons to use a MultiIndex, so it's work supplementing the examples here by checking out the official docs on reshaping.

With this new technique we can start actually investigating our data. For example, let's say we want to more easily compare sales of a product by store by day, we can unstack our 'Store' index level:

In [14]:
multi_col_lvl_df = df.unstack('Store')
multi_col_lvl_df.sample(10)
Out[14]:
Dollars Units
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date Department Category Subcategory UPC EAN Description
2018-07-16 Booooze Alcohol Liqour 80480280024 Grey Goose - Imported Vodka - 750 mL Bottle NaN 221.04 126.66 NaN 9.0 6.0
Liquor 674545000001 Don Julio - Tequila Blanco - 750 mL Bottle NaN 161.15 345.07 NaN 5.0 11.0
2018-07-10 Booooze Alcohol Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 172.00 278.80 NaN 5.0 8.0
674545000001 Don Julio - Tequila Blanco - 750 mL Bottle NaN 168.06 204.12 NaN 6.0 7.0
2018-07-11 Booooze Alcohol Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 201.72 318.80 NaN 6.0 8.0
2018-07-16 Booooze Beer Lagers 702770081011 Omission - Gluten Free Lager - 6 Pack 174.72 NaN 104.86 14.0 NaN 7.0
2018-07-12 Booooze Wine Red 492130008399 Yellow Tail - Merlot - 750 mL Bottle 95.16 NaN 75.60 12.0 NaN 10.0
2018-07-10 Booooze Beer Ales 702770082018 Omission Pale Ale - Gluten Free Pale Ale - 6 Pack NaN 89.76 135.72 NaN 6.0 13.0
2018-07-15 Booooze Alcohol Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 415.68 476.64 NaN 12.0 12.0
2018-07-13 Booooze Beer Stouts 83783575217 Sierra Nevada Brewing Co. - Stout - 6 Pack NaN 82.67 75.74 NaN 7.0 7.0

The new view makes our comparison easier, but now it's a bit cluttered. Internally our multi-level columns are stored as tuples of the name values for each level, so we can easily fix the clutter by flattening the columns into a single level:

In [15]:
def flatten_cols(df: pd.DataFrame, delim: str = ""):
    """Flatten multiple column levels of the DataFrame into a one column level.

    Args:
        delim: the delimiter between the column values.

    Returns:
        A copy of the dataframe with the new column names.

    """
    new_cols = [delim.join((col_lev for col_lev in tup if col_lev))
                for tup in df.columns.values]
    ndf = df.copy()
    ndf.columns = new_cols

    return ndf

flattened_multi_col_df = flatten_cols(multi_col_lvl_df, " | ").head(3)
flattened_multi_col_df
Out[15]:
Dollars | Store 1 Dollars | Store 2 Dollars | Store 3 Units | Store 1 Units | Store 2 Units | Store 3
Date Department Category Subcategory UPC EAN Description
2018-07-10 Booooze Alcohol Liqour 80480280024 Grey Goose - Imported Vodka - 750 mL Bottle NaN 231.30 128.30 NaN 9.0 5.0
Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 172.00 278.80 NaN 5.0 8.0
674545000001 Don Julio - Tequila Blanco - 750 mL Bottle NaN 168.06 204.12 NaN 6.0 7.0

If later we want to undo that flattening it's just as simple:

In [16]:
def unflatten_cols(df: pd.DataFrame, delim: str = ""):
    """Unflatten a single column level into multiple column levels.

    Args:
        delim: the delimiter to split on to identify the multiple column values.

    Returns:
        A copy of the dataframe with the new column levels.

    """
    new_cols = pd.MultiIndex.from_tuples([tuple(col.split(delim)) for col in df.columns])
    ndf = df.copy()
    ndf.columns = new_cols

    return ndf

unflatten_cols(flattened_multi_col_df, " | ")
Out[16]:
Dollars Units
Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date Department Category Subcategory UPC EAN Description
2018-07-10 Booooze Alcohol Liqour 80480280024 Grey Goose - Imported Vodka - 750 mL Bottle NaN 231.30 128.30 NaN 9.0 5.0
Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 172.00 278.80 NaN 5.0 8.0
674545000001 Don Julio - Tequila Blanco - 750 mL Bottle NaN 168.06 204.12 NaN 6.0 7.0

Importing/Exporting MultiIndex DataFrames

We've already seen an example of reading a csv, but what if we want to save our multiindex DataFrame and then be able to reread it? How the stored files need to be accessed and by whom they need to be accessed will determine a lot. If everyone who needs access to the data is Python/Pandas savy, pickling is fast and easy. If, however, other, not-tech-savy people will need to access the data, CSVs are a versitile storage medium. That said, it complicates our read_csv() a bit if we want to just dump out our multiindex. Revisiting our multi_col_lvl_df, where we have a multiindex DataFrame that has both multiple index levels and column levels, creates a difficult situation that requires getting to know as many as all 48 (yes, 48!) of read_csv's parameters.

In [17]:
# Write our multi-column-level df
multi_col_lvl_df.to_csv('multi_col_lvl_output.csv')

# Reading it back in requires the header parameter
read_multi_df = pd.read_csv('multi_col_lvl_output.csv', header=[0, 1], index_col=[0, 1, 2, 3, 4, 5],
                            skipinitialspace=True, parse_dates=[0]).head(3)

read_multi_df
Out[17]:
Dollars Units
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date Department Category Subcategory UPC EAN Description
2018-07-10 Booooze Alcohol Liqour 80480280024 Grey Goose - Imported Vodka - 750 mL Bottle NaN 231.30 128.30 NaN 9.0 5.0
Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 172.00 278.80 NaN 5.0 8.0
674545000001 Don Julio - Tequila Blanco - 750 mL Bottle NaN 168.06 204.12 NaN 6.0 7.0

By adding a header parameter to deal with our multiple column levels, on top of index_col for the setting of the index, this import looks good, but it required inspecting the csv for the column and row numbers (that's right - names won't work. Just to be sure everything worked, let's also check our dtypes:

In [18]:
# A function to check our index level dtypes to aid this example
def index_level_dtypes(df):
    return [f"{df.index.names[i]}: {df.index.get_level_values(n).dtype}"
            for i, n in enumerate(df.index.names)]

index_level_dtypes(read_multi_df)
Out[18]:
['Date: datetime64[ns]',
 'Department: object',
 'Category: object',
 'Subcategory: object',
 'UPC EAN: int64',
 'Description: object']

Looks good. However let's take a step back and look at what would have happened if we'd not parsed the dates and wanted to later change the dtype:

In [19]:
# Reading it back iwithout parse_dates
bad_dtype_df = pd.read_csv('multi_col_lvl_output.csv', header=[0, 1], index_col=[0, 1, 2, 3, 4, 5],
                            skipinitialspace=True).head(3)

display(bad_dtype_df)
display(index_level_dtypes(bad_dtype_df))
Dollars Units
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date Department Category Subcategory UPC EAN Description
2018-07-10 Booooze Alcohol Liqour 80480280024 Grey Goose - Imported Vodka - 750 mL Bottle NaN 231.30 128.30 NaN 9.0 5.0
Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 172.00 278.80 NaN 5.0 8.0
674545000001 Don Julio - Tequila Blanco - 750 mL Bottle NaN 168.06 204.12 NaN 6.0 7.0
['Date: object',
 'Department: object',
 'Category: object',
 'Subcategory: object',
 'UPC EAN: int64',
 'Description: object']

Updating the dtypes of our index columns isn't going to be so simple, because our MultiIndex levels are immutable. To make any changes to the levels, we actually have to recreate the levels:

In [20]:
bad_dtype_df.index.set_levels([pd.to_datetime(bad_dtype_df.index.levels[0]), bad_dtype_df.index.levels[1],
                               bad_dtype_df.index.levels[2], bad_dtype_df.index.levels[3],
                               bad_dtype_df.index.levels[4], bad_dtype_df.index.levels[5]],
                               inplace=True)
index_level_dtypes(bad_dtype_df)
Out[20]:
['Date: datetime64[ns]',
 'Department: object',
 'Category: object',
 'Subcategory: object',
 'UPC EAN: int64',
 'Description: object']

Alternatively we could reset just the 'Date' level, update its dtype, add it back to our index, and finally reorder our index:

In [21]:
bad_dtype_df2 = bad_dtype_df.reset_index(level='Date')
bad_dtype_df2['Date'] = pd.to_datetime(bad_dtype_df2['Date'],infer_datetime_format=True)
bad_dtype_df2.set_index('Date', append=True, inplace=True)
bad_dtype_df2 = (bad_dtype_df2.swaplevel('Date', 'Description')
                              .swaplevel('Date', 'UPC EAN')
                              .swaplevel('Date', 'Subcategory')
                              .swaplevel('Date', 'Category')
                              .swaplevel('Date', 'Department'))
index_level_dtypes(bad_dtype_df2)
Out[21]:
['Date: datetime64[ns]',
 'Department: object',
 'Category: object',
 'Subcategory: object',
 'UPC EAN: int64',
 'Description: object']

The moral of the story is, if you want to export and import complex multiindex DataFrames, learn read_csv's parameters well! The second moral is that, either of the above situations results in an awful lot of work.

One alternative if the readability of the CSV by other users is important, is to simply reset the column levels and index levels before we write to a CSV and recreate them when we import. While it requires more steps and information is lost while in CSV form, it makes the code manipulations a lot easier:

In [22]:
# a) restack the column levels,
# b) drop any blanks the unstacking created, and then
# c) reset the index so everything is a flat column again
# d) output to csv
multi_col_lvl_df.stack().dropna().reset_index().to_csv('index_removed_output.csv')

# Reading it back in will require
read_df = pd.read_csv("data.csv", index_col=[0, 1, 2, 3, 4, 5], skipinitialspace=True, parse_dates=['Date'])
read_df.unstack('Store').head(3)
Out[22]:
Dollars Units
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date Category Subcategory UPC EAN Description
2018-07-10 Alcohol Liqour 80480280024 Grey Goose - Imported Vodka - 750 mL Bottle NaN 231.30 128.30 NaN 9.0 5.0
Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 172.00 278.80 NaN 5.0 8.0
674545000001 Don Julio - Tequila Blanco - 750 mL Bottle NaN 168.06 204.12 NaN 6.0 7.0

Still takes some work, but the code's a lot more straightforward.

To this point we've really just been learning the basics of setting up our multiindex DataFrames, with a few neat tricks along the way. Now comes the fun part - actually interacting with those DataFrames to analyze our data.

Multiindex Math

Math operations are as nearly as easy with multiindex DataFrames as with regular ones, but a whole lot more powerful. Let's say we want to calculate dollars per unit for each of our stores. That's as simple as:

In [23]:
dollars_per_unit = multi_col_lvl_df['Dollars'] / multi_col_lvl_df['Units']
dollars_per_unit.sample(10)
Out[23]:
Store Store 1 Store 2 Store 3
Date Department Category Subcategory UPC EAN Description
2018-07-11 Booooze Alcohol Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 33.62 39.85
2018-07-16 Booooze Alcohol Liquor 674545000001 Don Julio - Tequila Blanco - 750 mL Bottle NaN 32.23 31.37
2018-07-10 Booooze Alcohol Liquor 674545000001 Don Julio - Tequila Blanco - 750 mL Bottle NaN 28.01 29.16
Wine White 81248501095 Curious Beasts - Chardonnay - 750 mL Bottle NaN 20.94 19.90
Red 89744402585 Trapiche - Malbec - 750 mL Bottle 7.11 NaN 6.30
Beer Ales 702770082018 Omission Pale Ale - Gluten Free Pale Ale - 6 Pack NaN 14.96 10.44
Alcohol Liqour 80480280024 Grey Goose - Imported Vodka - 750 mL Bottle NaN 25.70 25.66
2018-07-16 Booooze Wine Red 492130008399 Yellow Tail - Merlot - 750 mL Bottle 4.89 NaN 7.76
2018-07-12 Booooze Beer Stouts 83783575217 Sierra Nevada Brewing Co. - Stout - 6 Pack NaN 10.61 7.81
2018-07-16 Booooze Wine Red 89744402585 Trapiche - Malbec - 750 mL Bottle 9.01 NaN 8.27

We just operated on all the pairwise sub column levels at the same time! Now we'd like this answer as columns back in our original DataFrame:

In [24]:
# Add a column level for our new measure
dollars_per_unit.columns = pd.MultiIndex.from_product([['Dollars per Unit'], dollars_per_unit.columns])

# Concat it with our original data
pd.concat([multi_col_lvl_df, dollars_per_unit], axis='columns').head(3)
Out[24]:
Dollars Units Dollars per Unit
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date Department Category Subcategory UPC EAN Description
2018-07-10 Booooze Alcohol Liqour 80480280024 Grey Goose - Imported Vodka - 750 mL Bottle NaN 231.30 128.30 NaN 9.0 5.0 NaN 25.70 25.66
Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 172.00 278.80 NaN 5.0 8.0 NaN 34.40 34.85
674545000001 Don Julio - Tequila Blanco - 750 mL Bottle NaN 168.06 204.12 NaN 6.0 7.0 NaN 28.01 29.16

We can similarly apply functions to our multiindex DataFrame:

In [25]:
# Change our units to 000s for funsies
multi_col_lvl_df.applymap(lambda x: np.nan if np.isnan(x) else str(round(x/1000, 2)) + "k").head(10)
Out[25]:
Dollars Units
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date Department Category Subcategory UPC EAN Description
2018-07-10 Booooze Alcohol Liqour 80480280024 Grey Goose - Imported Vodka - 750 mL Bottle NaN 0.23k 0.13k NaN 0.01k 0.01k
Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 0.17k 0.28k NaN 0.01k 0.01k
674545000001 Don Julio - Tequila Blanco - 750 mL Bottle NaN 0.17k 0.2k NaN 0.01k 0.01k
Beer Ales 94922755711 Bowser Beer - Beefy Brown Ale - 6 Pack NaN 0.08k 0.06k NaN 0.01k 0.01k
702770082018 Omission Pale Ale - Gluten Free Pale Ale - 6 Pack NaN 0.09k 0.14k NaN 0.01k 0.01k
736920111112 Goose Island - Honkers Ale - 6 Pack NaN 0.08k 0.07k NaN 0.01k 0.01k
Lagers 702770081011 Omission - Gluten Free Lager - 6 Pack 0.14k NaN 0.18k 0.01k NaN 0.01k
Stouts 83783575217 Sierra Nevada Brewing Co. - Stout - 6 Pack NaN 0.08k 0.12k NaN 0.01k 0.01k
Wine Red 89744402585 Trapiche - Malbec - 750 mL Bottle 0.06k NaN 0.04k 0.01k NaN 0.01k
492130008399 Yellow Tail - Merlot - 750 mL Bottle 0.05k NaN 0.03k 0.01k NaN 0.01k

Sorting

Before we start slicing and filtering, it's important to sort out data to drastically improve efficiency. In some cases pandas will begin sorting for you by default in the latest versions, but other times you'll get an ugly and confusing lexsort warning (referencing the default lexographic sorting order).

Sorting indexes works the exactly the same way for multiindex DataFrames as with regular DataFrames, but with some extra parameters to decide on:

In [26]:
# By default sort_index will sort all levels of the index,
# first sorting the first index level, then secondarily sorting the second index level, and so on...
sort1 = multi_col_lvl_df.copy()
sort1.sort_index(inplace=True)
print("\nSort by Date, then Department, then Category, etc.:")
display(sort1.head(10))

# ...but you can choose to starting from a different level...
sort2 = multi_col_lvl_df.copy()
sort2.sort_index(level='Category', inplace=True)
print("\nSort by Category, then Subcategory, then UPC EAN, then Description, then Date, then Deparment:")
display(sort2.head(10))

# ...or even to only sort on specifc levels.
sort3 = multi_col_lvl_df.copy()
sort3.sort_index(level=['Category', 'Date'], sort_remaining=False, inplace=True)
print("\nSort by Category, then Date only:")
display(sort3.head(10))
Sort by Date, then Department, then Category, etc.:
Dollars Units
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date Department Category Subcategory UPC EAN Description
2018-07-10 Booooze Alcohol Liqour 80480280024 Grey Goose - Imported Vodka - 750 mL Bottle NaN 231.30 128.30 NaN 9.0 5.0
Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 172.00 278.80 NaN 5.0 8.0
674545000001 Don Julio - Tequila Blanco - 750 mL Bottle NaN 168.06 204.12 NaN 6.0 7.0
Beer Ales 94922755711 Bowser Beer - Beefy Brown Ale - 6 Pack NaN 81.76 62.48 NaN 8.0 8.0
702770082018 Omission Pale Ale - Gluten Free Pale Ale - 6 Pack NaN 89.76 135.72 NaN 6.0 13.0
736920111112 Goose Island - Honkers Ale - 6 Pack NaN 80.01 67.36 NaN 9.0 8.0
Lagers 702770081011 Omission - Gluten Free Lager - 6 Pack 137.67 NaN 176.96 13.0 NaN 14.0
Stouts 83783575217 Sierra Nevada Brewing Co. - Stout - 6 Pack NaN 78.40 119.90 NaN 7.0 10.0
Wine Red 89744402585 Trapiche - Malbec - 750 mL Bottle 56.88 NaN 37.80 8.0 NaN 6.0
492130008399 Yellow Tail - Merlot - 750 mL Bottle 46.30 NaN 34.56 10.0 NaN 6.0
Sort by Category, then Subcategory, then UPC EAN, then Description, then Date, then Deparment:
Dollars Units
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date Department Category Subcategory UPC EAN Description
2018-07-10 Booooze Alcohol Liqour 80480280024 Grey Goose - Imported Vodka - 750 mL Bottle NaN 231.30 128.30 NaN 9.0 5.0
Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 172.00 278.80 NaN 5.0 8.0
674545000001 Don Julio - Tequila Blanco - 750 mL Bottle NaN 168.06 204.12 NaN 6.0 7.0
2018-07-11 Booooze Alcohol Liqour 80480280024 Grey Goose - Imported Vodka - 750 mL Bottle NaN 248.80 296.01 NaN 10.0 11.0
Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 201.72 318.80 NaN 6.0 8.0
674545000001 Don Julio - Tequila Blanco - 750 mL Bottle NaN 338.10 102.04 NaN 10.0 4.0
2018-07-12 Booooze Alcohol Liqour 80480280024 Grey Goose - Imported Vodka - 750 mL Bottle NaN 279.36 228.08 NaN 12.0 8.0
Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 250.25 455.13 NaN 7.0 13.0
674545000001 Don Julio - Tequila Blanco - 750 mL Bottle NaN 410.88 268.56 NaN 12.0 9.0
2018-07-13 Booooze Alcohol Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 432.74 185.75 NaN 11.0 5.0
Sort by Category, then Date only:
Dollars Units
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date Department Category Subcategory UPC EAN Description
2018-07-10 Booooze Alcohol Liquor 674545000001 Don Julio - Tequila Blanco - 750 mL Bottle NaN 168.06 204.12 NaN 6.0 7.0
2018-07-11 Booooze Alcohol Liquor 674545000001 Don Julio - Tequila Blanco - 750 mL Bottle NaN 338.10 102.04 NaN 10.0 4.0
2018-07-12 Booooze Alcohol Liquor 674545000001 Don Julio - Tequila Blanco - 750 mL Bottle NaN 410.88 268.56 NaN 12.0 9.0
2018-07-15 Booooze Alcohol Liquor 674545000001 Don Julio - Tequila Blanco - 750 mL Bottle NaN 177.00 430.50 NaN 6.0 14.0
2018-07-16 Booooze Alcohol Liquor 674545000001 Don Julio - Tequila Blanco - 750 mL Bottle NaN 161.15 345.07 NaN 5.0 11.0
2018-07-10 Booooze Alcohol Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 172.00 278.80 NaN 5.0 8.0
2018-07-11 Booooze Alcohol Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 201.72 318.80 NaN 6.0 8.0
2018-07-12 Booooze Alcohol Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 250.25 455.13 NaN 7.0 13.0
2018-07-13 Booooze Alcohol Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 432.74 185.75 NaN 11.0 5.0
2018-07-14 Booooze Alcohol Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 227.82 268.52 NaN 6.0 7.0

Slicing, Filtering, and Querying

With the sorting out of the way we can begin searching our data by specific criteria. There is a fantastic guide about understanding SettingwithCopyWarnings in Pandas, so if you're unclear on the difference between a view and a copy, I suggest giving that a read first.

Readers who have used Pandas for slicing previously will know that .loc is generally the preferred method of referencing cells in a DataFrame. The same type of syntax exists for multiindex Dataframes:

In [27]:
# A slicing helper. Works similarly to slicing in Python (e.g. list slicing),
#                   but is inclusive of both the start and stop values.
idx = pd.IndexSlice

# View rows with a Category of Beer, but any Date, Department, Subcategory, UPC EAN, or Description
# Only looking at Dollars columns
print("View only rows in the 'Dollars' columns where the Category is 'Beer':")
display(sort3.loc[idx[:, :, 'Beer'], 'Dollars':'Dollars'].head(10))

# If we just want to look at the Store 1 sub-column
print("\nView only rows in the 'Dollars' AND 'Store 1' column where the Category is 'Beer':")
display(sort3.loc[idx[:, :, 'Beer'], idx['Dollars', 'Store 1':'Store 1']].head(10))
View only rows in the 'Dollars' columns where the Category is 'Beer':
Dollars
Store Store 1 Store 2 Store 3
Date Department Category Subcategory UPC EAN Description
2018-07-10 Booooze Beer Ales 94922755711 Bowser Beer - Beefy Brown Ale - 6 Pack NaN 81.76 62.48
2018-07-11 Booooze Beer Ales 94922755711 Bowser Beer - Beefy Brown Ale - 6 Pack NaN 104.91 72.72
2018-07-14 Booooze Beer Ales 94922755711 Bowser Beer - Beefy Brown Ale - 6 Pack NaN 124.52 113.68
2018-07-15 Booooze Beer Ales 94922755711 Bowser Beer - Beefy Brown Ale - 6 Pack NaN 88.77 119.47
2018-07-16 Booooze Beer Ales 94922755711 Bowser Beer - Beefy Brown Ale - 6 Pack NaN 99.40 46.65
2018-07-11 Booooze Beer Malts 71250000140 Colt 45 - Malt Liquor - 6 Pack NaN 101.79 107.58
2018-07-13 Booooze Beer Malts 71250000140 Colt 45 - Malt Liquor - 6 Pack NaN 103.59 142.44
2018-07-15 Booooze Beer Malts 71250000140 Colt 45 - Malt Liquor - 6 Pack NaN 139.23 40.50
2018-07-16 Booooze Beer Malts 71250000140 Colt 45 - Malt Liquor - 6 Pack NaN 47.55 108.48
2018-07-10 Booooze Beer Ales 736920111112 Goose Island - Honkers Ale - 6 Pack NaN 80.01 67.36
View only rows in the 'Dollars' AND 'Store 1' column where the Category is 'Beer':
Dollars
Store Store 1
Date Department Category Subcategory UPC EAN Description
2018-07-10 Booooze Beer Ales 94922755711 Bowser Beer - Beefy Brown Ale - 6 Pack NaN
2018-07-11 Booooze Beer Ales 94922755711 Bowser Beer - Beefy Brown Ale - 6 Pack NaN
2018-07-14 Booooze Beer Ales 94922755711 Bowser Beer - Beefy Brown Ale - 6 Pack NaN
2018-07-15 Booooze Beer Ales 94922755711 Bowser Beer - Beefy Brown Ale - 6 Pack NaN
2018-07-16 Booooze Beer Ales 94922755711 Bowser Beer - Beefy Brown Ale - 6 Pack NaN
2018-07-11 Booooze Beer Malts 71250000140 Colt 45 - Malt Liquor - 6 Pack NaN
2018-07-13 Booooze Beer Malts 71250000140 Colt 45 - Malt Liquor - 6 Pack NaN
2018-07-15 Booooze Beer Malts 71250000140 Colt 45 - Malt Liquor - 6 Pack NaN
2018-07-16 Booooze Beer Malts 71250000140 Colt 45 - Malt Liquor - 6 Pack NaN
2018-07-10 Booooze Beer Ales 736920111112 Goose Island - Honkers Ale - 6 Pack NaN

Unfortunately the .loc syntax doesn't scale well to having many index levels. In order to select a values from a single level, .loc requires specifying ':' for every level that comes before. For the column names to show up for a single column you also need to use : syntax just to specify a single column. This can be incredibly annoying.

Instead, it is more common (and practical) to use df.xs (cross sections). df.xs allows us to filter our DataFrame to only those rows that match levels of our index we specify. We can choose an individual level or multiple levels easily:

In [28]:
# We provide the level value, which by default searches the first index level ('Date')
display(multi_col_lvl_df.xs('2018-07-12').head(10))

# Here we're more specific because we want to search 'Category instead'
display(multi_col_lvl_df.xs('Wine', level='Category').head(10))

# To search for rows that match all of our level value requirements, use tuples
display(multi_col_lvl_df.xs(('2018-07-10', 'Booooze', 'Wine'), level=['Date', 'Department', 'Category']).head(10))

# Note that chaining .xs works, but is significantly less efficient
display(multi_col_lvl_df.xs('2018-07-10', level='Date')
                        .xs('Booooze', level='Department')
                        .xs('Wine', level='Category')
                        .head(10))

# We can then use the column value filtering we're used to to filter rows as well
cross_section_df = multi_col_lvl_df.xs(('2018-07-10', 'Booooze'), level=['Date', 'Department'])
display(cross_section_df[cross_section_df['Dollars']['Store 1'] > 0].head(10))
Dollars Units
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Department Category Subcategory UPC EAN Description
Booooze Alcohol Liqour 80480280024 Grey Goose - Imported Vodka - 750 mL Bottle NaN 279.36 228.08 NaN 12.0 8.0
Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 250.25 455.13 NaN 7.0 13.0
674545000001 Don Julio - Tequila Blanco - 750 mL Bottle NaN 410.88 268.56 NaN 12.0 9.0
Beer Ales 702770082018 Omission Pale Ale - Gluten Free Pale Ale - 6 Pack NaN 109.20 56.76 NaN 8.0 6.0
736920111112 Goose Island - Honkers Ale - 6 Pack NaN 37.75 78.30 NaN 5.0 10.0
Lagers 702770081011 Omission - Gluten Free Lager - 6 Pack 40.12 NaN 93.10 4.0 NaN 10.0
Stouts 83783575217 Sierra Nevada Brewing Co. - Stout - 6 Pack NaN 116.71 109.34 NaN 11.0 14.0
Wine Red 89744402585 Trapiche - Malbec - 750 mL Bottle 60.24 NaN 69.29 8.0 NaN 13.0
492130008399 Yellow Tail - Merlot - 750 mL Bottle 95.16 NaN 75.60 12.0 NaN 10.0
Rose 85000024218 Dark Horse - S. Blanc Rose - 750 mL Bottle 93.94 83.82 NaN 7.0 6.0 NaN
Dollars Units
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date Department Subcategory UPC EAN Description
2018-07-10 Booooze Red 89744402585 Trapiche - Malbec - 750 mL Bottle 56.88 NaN 37.80 8.0 NaN 6.0
492130008399 Yellow Tail - Merlot - 750 mL Bottle 46.30 NaN 34.56 10.0 NaN 6.0
Rose 85000024218 Dark Horse - S. Blanc Rose - 750 mL Bottle 148.61 100.38 NaN 11.0 7.0 NaN
White 81248501095 Curious Beasts - Chardonnay - 750 mL Bottle NaN 293.16 278.60 NaN 14.0 14.0
2018-07-11 Booooze Red 89744402585 Trapiche - Malbec - 750 mL Bottle 50.40 NaN 47.25 6.0 NaN 7.0
Rose 85000024218 Dark Horse - S. Blanc Rose - 750 mL Bottle 52.24 173.03 NaN 4.0 13.0 NaN
White 81248501095 Curious Beasts - Chardonnay - 750 mL Bottle NaN 237.16 247.52 NaN 11.0 13.0
2018-07-12 Booooze Red 89744402585 Trapiche - Malbec - 750 mL Bottle 60.24 NaN 69.29 8.0 NaN 13.0
492130008399 Yellow Tail - Merlot - 750 mL Bottle 95.16 NaN 75.60 12.0 NaN 10.0
Rose 85000024218 Dark Horse - S. Blanc Rose - 750 mL Bottle 93.94 83.82 NaN 7.0 6.0 NaN
Dollars Units
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Subcategory UPC EAN Description
Red 89744402585 Trapiche - Malbec - 750 mL Bottle 56.88 NaN 37.80 8.0 NaN 6.0
492130008399 Yellow Tail - Merlot - 750 mL Bottle 46.30 NaN 34.56 10.0 NaN 6.0
Rose 85000024218 Dark Horse - S. Blanc Rose - 750 mL Bottle 148.61 100.38 NaN 11.0 7.0 NaN
White 81248501095 Curious Beasts - Chardonnay - 750 mL Bottle NaN 293.16 278.60 NaN 14.0 14.0
Dollars Units
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Subcategory UPC EAN Description
Red 89744402585 Trapiche - Malbec - 750 mL Bottle 56.88 NaN 37.80 8.0 NaN 6.0
492130008399 Yellow Tail - Merlot - 750 mL Bottle 46.30 NaN 34.56 10.0 NaN 6.0
Rose 85000024218 Dark Horse - S. Blanc Rose - 750 mL Bottle 148.61 100.38 NaN 11.0 7.0 NaN
White 81248501095 Curious Beasts - Chardonnay - 750 mL Bottle NaN 293.16 278.60 NaN 14.0 14.0
Dollars Units
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Category Subcategory UPC EAN Description
Beer Lagers 702770081011 Omission - Gluten Free Lager - 6 Pack 137.67 NaN 176.96 13.0 NaN 14.0
Wine Red 89744402585 Trapiche - Malbec - 750 mL Bottle 56.88 NaN 37.80 8.0 NaN 6.0
492130008399 Yellow Tail - Merlot - 750 mL Bottle 46.30 NaN 34.56 10.0 NaN 6.0
Rose 85000024218 Dark Horse - S. Blanc Rose - 750 mL Bottle 148.61 100.38 NaN 11.0 7.0 NaN

That's a whole lot nicer of syntax than the .loc method, but it's still relatively verbose and requires separating out or index and column value filtering. Multiindex DataFrames also offer a lesser known method, df.query. Query is powerful for a few reasons:

  1. It can search both index levels and columns in the same query,
  2. Syntactically it's nicer, since it lets use write our query as short expressions,
  3. It's very efficient (depending on the chosen engine)

The documentation has excellent notes about the method and a nice usage guide that are definitely worth a read but a couple of examples using our dataset:

In [29]:
multi_col_lvl_df.query("Date == '2018-07-10' and Category == 'Wine'")
Out[29]:
Dollars Units
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date Department Category Subcategory UPC EAN Description
2018-07-10 Booooze Wine Red 89744402585 Trapiche - Malbec - 750 mL Bottle 56.88 NaN 37.80 8.0 NaN 6.0
492130008399 Yellow Tail - Merlot - 750 mL Bottle 46.30 NaN 34.56 10.0 NaN 6.0
Rose 85000024218 Dark Horse - S. Blanc Rose - 750 mL Bottle 148.61 100.38 NaN 11.0 7.0 NaN
White 81248501095 Curious Beasts - Chardonnay - 750 mL Bottle NaN 293.16 278.60 NaN 14.0 14.0

So why doesn't everyone always use df.query? It has two big drawbacks:

  1. It doesn't allow index or column names with spaces in the name.
  2. It can't deal with multiple column levels well.

However, if you're used to working with SQL databases and/or willing to work within the above limitations, this is a powerful and fast option.

Updating Column Values

Now that we've learned a few ways to slice, filter, and query our data, what if we want to alter values in our dataframe? We can take advantage of the .loc method to replace values in place without creating a copy or, if temporarily copying our data isn't an issue (i.e. too big for memory) we can use the df.xs method with df.update:

In [30]:
updated_df = multi_col_lvl_df.copy()
updated_df.loc[idx[:, :, 'Beer'], idx['Dollars', 'Store 1':'Store 1']] = "We changed this"
updated_df.head(10)
Out[30]:
Dollars Units
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date Department Category Subcategory UPC EAN Description
2018-07-10 Booooze Alcohol Liqour 80480280024 Grey Goose - Imported Vodka - 750 mL Bottle NaN 231.30 128.30 NaN 9.0 5.0
Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 172.00 278.80 NaN 5.0 8.0
674545000001 Don Julio - Tequila Blanco - 750 mL Bottle NaN 168.06 204.12 NaN 6.0 7.0
Beer Ales 94922755711 Bowser Beer - Beefy Brown Ale - 6 Pack We changed this 81.76 62.48 NaN 8.0 8.0
702770082018 Omission Pale Ale - Gluten Free Pale Ale - 6 Pack We changed this 89.76 135.72 NaN 6.0 13.0
736920111112 Goose Island - Honkers Ale - 6 Pack We changed this 80.01 67.36 NaN 9.0 8.0
Lagers 702770081011 Omission - Gluten Free Lager - 6 Pack We changed this NaN 176.96 13.0 NaN 14.0
Stouts 83783575217 Sierra Nevada Brewing Co. - Stout - 6 Pack We changed this 78.40 119.90 NaN 7.0 10.0
Wine Red 89744402585 Trapiche - Malbec - 750 mL Bottle 56.88 NaN 37.80 8.0 NaN 6.0
492130008399 Yellow Tail - Merlot - 750 mL Bottle 46.3 NaN 34.56 10.0 NaN 6.0

The above syntax works well for immutable values, but you'll get an error if you try the same thing for a mutable value, like a list. To my knowledge the only way to set values in a DataFrame to a mutable object is one cell at a time, so doing multiple replacements requires iterrows:

In [31]:
# # Note: If the column we're changing's type isn't already object,
# # we need to change it or the value relpacements below will error.
# updated_df['Dollars']['Store 1'] = updated_df['Dollars']['Store 1'].astype(object)

# # Loop through rows, replacing single values
# # Only necessary if the new assigned value is mutable
# # Code below currently not working when there are multiple column levels, but works with one column level
# for index, row in updated_df.loc[idx[:, :, 'Beer'], idx['Dollars', 'Store 1':'Store 1']].iterrows():
#     updated_df.at[index, idx['Dollars', 'Store 1':'Store 1']] = ["We", "changed", "this"]

# updated_df.head(3)

Alternatively, using df.xs and then df.update instead of the full .loc is a little bit more understandable in my option. For example, the following code does the same as the code two cells above (except that it requires a copy):

In [32]:
updated_df = multi_col_lvl_df.copy()

df2 = updated_df.xs('Beer', level='Category', drop_level=False).copy()  # .copy() is to avoid SettingwithCopyWarning
df2[idx['Dollars', 'Store 1']] = "We ALSO changed this"

updated_df.update(df2, join="left", overwrite=True, filter_func=None, raise_conflict=False)
updated_df.head(10)
Out[32]:
Dollars Units
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date Department Category Subcategory UPC EAN Description
2018-07-10 Booooze Alcohol Liqour 80480280024 Grey Goose - Imported Vodka - 750 mL Bottle NaN 231.30 128.30 NaN 9.0 5.0
Liquor 80432400630 Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle NaN 172.00 278.80 NaN 5.0 8.0
674545000001 Don Julio - Tequila Blanco - 750 mL Bottle NaN 168.06 204.12 NaN 6.0 7.0
Beer Ales 94922755711 Bowser Beer - Beefy Brown Ale - 6 Pack We ALSO changed this 81.76 62.48 NaN 8.0 8.0
702770082018 Omission Pale Ale - Gluten Free Pale Ale - 6 Pack We ALSO changed this 89.76 135.72 NaN 6.0 13.0
736920111112 Goose Island - Honkers Ale - 6 Pack We ALSO changed this 80.01 67.36 NaN 9.0 8.0
Lagers 702770081011 Omission - Gluten Free Lager - 6 Pack We ALSO changed this NaN 176.96 13.0 NaN 14.0
Stouts 83783575217 Sierra Nevada Brewing Co. - Stout - 6 Pack We ALSO changed this 78.40 119.90 NaN 7.0 10.0
Wine Red 89744402585 Trapiche - Malbec - 750 mL Bottle 56.88 NaN 37.80 8.0 NaN 6.0
492130008399 Yellow Tail - Merlot - 750 mL Bottle 46.3 NaN 34.56 10.0 NaN 6.0

Display options

In [33]:
pd.set_option('display.multi_sparse', True)

That's all, folks!

Hopefully this guide has been a gentle and practical tutorial on multiindex DataFrames in Pandas. If you find any errors, material ommissions, or topics that could just be explained more clearly, please submit leave a comment or better yet, a pull request!

To continue your learning, find a list of my top picks for Pandas resources - generally and specifically for MultiIndexes. Of particular note is the Official list of available methods for MultiIndexes, which has a full listing of all the available methods (many not listed here). This link is specifically for Pandas version 0.22, because, for some reason, the list in future versions excludes a lot of still valuable methods! What's more, this list provides the ONLY explanation for some methods, since many MultiIndex methods lack even the most basic of docstrings.

Resources