Pandas MultiIndex Tutorial¶

What's Pandas?¶

Pandas is one of the most-used open source libraries for importing and analyzing data available in Python today. It provides convenient ways to import, view, split, apply, and combine array-like data. And not just convenient, but efficient, too. For example, Pandas' read_csv and to_csv functions are so efficient that the library is often imported just for this task instead of relying on the standard library alternative!

The core value of the library, however, comes through several data structure options, primarily Series (for labeled, homogenously-typed, one-dimensional arrays) and DataFrames (for labeled, potentially heterogeneously-typed, two-dimensional arrays).

What's a DataFrame?¶

DataFrames are two-dimensional, labeled data structures, with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used Pandas object. Like Series, DataFrame accepts many different kinds of input, including dicts, lists, lists of lists, Series, numpy arrays, other DataFrames, external data from CSVs, etc.

The DataFrame has two core parts: an index (row labels; look like columns, but aren't) and columns (data with headers).

What's an index?¶

The index of a DataFrame is a set (i.e. each element is only represented once) that consists of a label for each row. To be helpful, those labels should be not just unique, but also meaningful. By default, if no index is provided, the index will be a numbered range, starting from 0 (known as a range index).

A more meaningful index, however, would be something that uniquely describes each row of your data in a way that will help you look things up. For example, in a list of transactions, the date-time might be most useful. Alternatively, in a grade book for a math class the name of the student might be most useful.

The most confusing thing about understanding about Pandas' indexes at first (in my opinion) is how to interact with them. While they look exactly like columns, they're not referenced in the same way (learn more here). Understanding that, I personally find it most useful to think of an index in Pandas like a column that's in time-out and that just can't play with all the other columns.

Note: Another common way in which Pandas' indexes are misunderstood at first is by thinking of them in SQL-like terms. While that can be helpful if that's what you're familiar with, in practice (and for performance) Pandas' and SQL's indexes are quite different (see this SO answer and Pandas Under The Hood by Jeff Tratner).

What is a MultiIndex DataFrame?¶

Pandas' multiindex DataFrames extend the DataFrames described above by enabling effective storage and manipulation of arbitrarily high dimension data in a 2-dimensional tabular structure. ((If that sentence doesn't make sense yet, don't worry - it should by the end of this tutorial.))

While the displayed version of a multiindexed DataFrame doesn't appear to be much more than a prettily-organized regular DataFrame, it's actually a pretty powerful structure if the data warrants its use.

When should you use one?¶

When a single column’s value isn’t enough to uniquely identify a row (e.g. multiple records on the same date means date alone isn’t a good index).
When data is logically hierarchical - meaning that it has multiple dimensions or “levels.”

Besides structure, multiindexes offer relatively easy in-memory retreival of complex data.

Realistic Demo Data¶

For this tutorial we will work with a realistic case of when multiindex DataFrames can come in handy - a grocer's retail transactions.

In [1]:

# Normally put all your import up top, but this cell was ugly in the midst of the intro paragraphs
from typing import Any

import numpy as np
import pandas as pd

display(f"Pandas version: {pd.__version__}")

'Pandas version: 0.22.0'

Note that this notebook is using the Pandas version above. There have been many changes to MultiIndex methods since 0.19, including major bug fixes. It is STRONGLY recommended to use the latest version of Pandas, but at least version 0.21 is required for all of the techniques in this notebook to work as presented.

In [2]:

# # Creates random mock data.
# # Based on the UPCs found in upc_meta_data.csv and saves to data.csv
# %run mocker.py

In [3]:

df = pd.read_csv('data.csv', parse_dates=['Date'])
df.sample(10)

Out[3]:

	Date	Store	Category	Subcategory	UPC EAN	Description	Dollars	Units
20	2018-07-14	Store 3	Beer	Lagers	702770081011	Omission - Gluten Free Lager - 6 Pack	128.64	12
101	2018-07-10	Store 3	Alcohol	Liqour	80480280024	Grey Goose - Imported Vodka - 750 mL Bottle	128.30	5
68	2018-07-14	Store 1	Wine	Red	492130008399	Yellow Tail - Merlot - 750 mL Bottle	32.10	6
12	2018-07-16	Store 2	Beer	Ales	94922755711	Bowser Beer - Beefy Brown Ale - 6 Pack	99.40	10
58	2018-07-16	Store 1	Wine	Red	89744402585	Trapiche - Malbec - 750 mL Bottle	126.14	14
1	2018-07-11	Store 2	Beer	Ales	736920111112	Goose Island - Honkers Ale - 6 Pack	166.74	14
100	2018-07-10	Store 2	Alcohol	Liqour	80480280024	Grey Goose - Imported Vodka - 750 mL Bottle	231.30	9
59	2018-07-16	Store 3	Wine	Red	89744402585	Trapiche - Malbec - 750 mL Bottle	66.16	8
116	2018-07-11	Store 3	Alcohol	Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	102.04	4
77	2018-07-15	Store 3	Wine	Red	492130008399	Yellow Tail - Merlot - 750 mL Bottle	58.20	12

Each row of data represents a sale of an item, but how do we set our index meaningfully? A numeric ID for each transaction would be fine, but it wouldn't tell us terribly much. Even the date of a transaction isn't useful by itself, since it's common for such a company to have many transactions on a given date (even at the same time).

Instead, we will have to look at a combination of the available metadata to create a unique and meaningful index. At first glance, a combination of the date, the store, and the hierarchy of each item (Category > Subcategory > UPC) look like they might be enough to get us a unique index.

Setting and Manipulating MultiIndexes¶

So let's take a look at how we can create our multiindex from our regular ol' DataFrame. We'll walk through the basics of setting, reordering, and resetting indexes, along with some useful tips/tricks. Then we can begin investigating our transaction data to learn about our sales and trends.

In [4]:

# Set just like the index for a DataFrame...
# ...except we give a list of column names instead of a single string column name
df.set_index(['Date', 'Store', 'Category', 'Subcategory', 'Description'], inplace=True)
df.head(3)

Out[4]:

					UPC EAN	Dollars	Units
Date	Store	Category	Subcategory	Description
2018-07-11	Store 3	Beer	Ales	Goose Island - Honkers Ale - 6 Pack	736920111112	35.68	4
2018-07-11	Store 2	Beer	Ales	Goose Island - Honkers Ale - 6 Pack	736920111112	166.74	14
2018-07-10	Store 3	Beer	Ales	Goose Island - Honkers Ale - 6 Pack	736920111112	67.36	8

Uh oh - it looks like we forgot to add the 'UPC EAN' column to our index, but don't worry - Pandas has us covered with extra set_index parameters for MultiIndexes:

In [5]:

# We can append a column to our existing index
df.set_index('UPC EAN', append=True, inplace=True)
df.head(3)

Out[5]:

						Dollars	Units
Date	Store	Category	Subcategory	Description	UPC EAN
2018-07-11	Store 3	Beer	Ales	Goose Island - Honkers Ale - 6 Pack	736920111112	35.68	4
2018-07-11	Store 2	Beer	Ales	Goose Island - Honkers Ale - 6 Pack	736920111112	166.74	14
2018-07-10	Store 3	Beer	Ales	Goose Island - Honkers Ale - 6 Pack	736920111112	67.36	8

That's almost right, but we'd actually like 'Description' to show up after 'UPC EAN'. We have a couple of options to get things in the right order:

In [6]:

# Option 1 is the generalized solution to reorder the index levels
# Note: We're not making an inplace change in this cell,
#       but it's worth noting that this method doesn't have an inplace parameter.
df.reorder_levels(order=['Date', 'Store', 'Category', 'Subcategory', 'UPC EAN', 'Description']).head(3)

Out[6]:

						Dollars	Units
Date	Store	Category	Subcategory	UPC EAN	Description
2018-07-11	Store 3	Beer	Ales	736920111112	Goose Island - Honkers Ale - 6 Pack	35.68	4
2018-07-11	Store 2	Beer	Ales	736920111112	Goose Island - Honkers Ale - 6 Pack	166.74	14
2018-07-10	Store 3	Beer	Ales	736920111112	Goose Island - Honkers Ale - 6 Pack	67.36	8

reorder_levels() is useful, but it was a pain to have to type all five levels just two switch two. In cases like this we have a second, less verbose option:

In [7]:

# Option 2 just switches two index levels (a more common need than you'd think)
# Note: This time we're doing an inplace change, but there's no parameter for this method either.
df = df.swaplevel('Description', 'UPC EAN')
df.head(3)

Out[7]:

						Dollars	Units
Date	Store	Category	Subcategory	UPC EAN	Description
2018-07-11	Store 3	Beer	Ales	736920111112	Goose Island - Honkers Ale - 6 Pack	35.68	4
2018-07-11	Store 2	Beer	Ales	736920111112	Goose Island - Honkers Ale - 6 Pack	166.74	14
2018-07-10	Store 3	Beer	Ales	736920111112	Goose Island - Honkers Ale - 6 Pack	67.36	8

Just when we thought we were done, it turns our we forgot to add the highest level of the product hierarchy - the Department - not just to our index, but to our DataFrame altogether. Luckily all of our records belong in the same Department, so here's a neat trick to add a new column with all the same values as a level in an existing index:

In [8]:

# A handy function to keep around for projects
def add_constant_index_level(df: pd.DataFrame, value: Any, level_name: str):
    """Add a new level to an existing index where every row has the same, given value.
    
    Args:
        df: Any existing pd.DataFrame.
        value: Value to be placed in every row of the new index level.
        level_name: Title of the new index level.
    
    Returns:
        df with an additional, prepended index level.
    """
    return pd.concat([df], keys=[value], names=[level_name])

df = add_constant_index_level(df, "Booooze", "Department")
df = df.reorder_levels(order=['Date', 'Store', 'Department', 'Category', 'Subcategory', 'UPC EAN', 'Description'])
df.head(3)

# # If we wanted to later drop that level
# df.index = df.index.droplevel(level='Department')
# df.head(3)

Out[8]:

							Dollars	Units
Date	Store	Department	Category	Subcategory	UPC EAN	Description
2018-07-11	Store 3	Booooze	Beer	Ales	736920111112	Goose Island - Honkers Ale - 6 Pack	35.68	4
2018-07-11	Store 2	Booooze	Beer	Ales	736920111112	Goose Island - Honkers Ale - 6 Pack	166.74	14
2018-07-10	Store 3	Booooze	Beer	Ales	736920111112	Goose Island - Honkers Ale - 6 Pack	67.36	8

Now that our index is set the way we want it, what if we want to interact with those index levels? Here are a few helpful code snippets:

In [9]:

# checking out their unique values, for a single level 
df.index.get_level_values('Subcategory').unique()

# checking out their unique values, for combinations of multiple levels
# See answer at https://stackoverflow.com/questions/39080555/pandas-get-level-values-for-multiple-columns

Out[9]:

Index(['Ales', 'Lagers', 'Stouts', 'Malts', 'Red', 'White', 'Rose', 'Liqour',
       'Liquor'],
      dtype='object', name='Subcategory')

Note the typo in "Liquor" above. Good thing we checked out unique values! Maybe someone can submit a pull request to fix this for me :)

In [10]:

# Replace level values using rename
# Note that this can be done using set_levels as well, but it's a pain
df.rename(index={'Goose Island - Honkers Ale - 6 Pack': 'We changed this'}).head(15)

# Replace np.nan level values
# df.rename(index={np.nan: "''"}, inplace=True)

Out[10]:

							Dollars	Units
Date	Store	Department	Category	Subcategory	UPC EAN	Description
2018-07-11	Store 3	Booooze	Beer	Ales	736920111112	We changed this	35.68	4
2018-07-11	Store 2	Booooze	Beer	Ales	736920111112	We changed this	166.74	14
2018-07-10	Store 3	Booooze	Beer	Ales	736920111112	We changed this	67.36	8
2018-07-10	Store 2	Booooze	Beer	Ales	736920111112	We changed this	80.01	9
2018-07-12	Store 3	Booooze	Beer	Ales	736920111112	We changed this	78.30	10
2018-07-12	Store 2	Booooze	Beer	Ales	736920111112	We changed this	37.75	5
2018-07-14	Store 3	Booooze	Beer	Ales	736920111112	We changed this	113.49	13
2018-07-14	Store 2	Booooze	Beer	Ales	736920111112	We changed this	45.05	5
2018-07-10	Store 2	Booooze	Beer	Ales	94922755711	Bowser Beer - Beefy Brown Ale - 6 Pack	81.76	8
2018-07-10	Store 3	Booooze	Beer	Ales	94922755711	Bowser Beer - Beefy Brown Ale - 6 Pack	62.48	8
2018-07-11	Store 2	Booooze	Beer	Ales	94922755711	Bowser Beer - Beefy Brown Ale - 6 Pack	104.91	13
2018-07-11	Store 3	Booooze	Beer	Ales	94922755711	Bowser Beer - Beefy Brown Ale - 6 Pack	72.72	9
2018-07-16	Store 2	Booooze	Beer	Ales	94922755711	Bowser Beer - Beefy Brown Ale - 6 Pack	99.40	10
2018-07-16	Store 3	Booooze	Beer	Ales	94922755711	Bowser Beer - Beefy Brown Ale - 6 Pack	46.65	5
2018-07-14	Store 2	Booooze	Beer	Ales	94922755711	Bowser Beer - Beefy Brown Ale - 6 Pack	124.52	11

In [11]:

# Rename index levels
temp_df = df.copy()
temp_df.index = df.index.set_names('Desc.', level='Description')
temp_df.head(3)

Out[11]:

							Dollars	Units
Date	Store	Department	Category	Subcategory	UPC EAN	Desc.
2018-07-11	Store 3	Booooze	Beer	Ales	736920111112	Goose Island - Honkers Ale - 6 Pack	35.68	4
2018-07-11	Store 2	Booooze	Beer	Ales	736920111112	Goose Island - Honkers Ale - 6 Pack	166.74	14
2018-07-10	Store 3	Booooze	Beer	Ales	736920111112	Goose Island - Honkers Ale - 6 Pack	67.36	8

Understanding the MultiIndex Object¶

Why is this section all the way down here? Because the MultiIndex object is scary looking if you're new to using them. Many guides to hierarchical data analysis using multiindex DataFrames start with DataFrame creation and manipulation using MultiIndex objects, which I think both hinders adoption and is not reflective of how a lot of DataFrames get created in practice. As a result, my explanation of MultiIndex objects is very basic, because there are lots of other great resources out there if you want to learn more. Here are my top two:

In [12]:

df.index

Out[12]:

MultiIndex(levels=[[2018-07-10 00:00:00, 2018-07-11 00:00:00, 2018-07-12 00:00:00, 2018-07-13 00:00:00, 2018-07-14 00:00:00, 2018-07-15 00:00:00, 2018-07-16 00:00:00], ['Store 1', 'Store 2', 'Store 3'], ['Booooze'], ['Alcohol', 'Beer', 'Wine'], ['Ales', 'Lagers', 'Liqour', 'Liquor', 'Malts', 'Red', 'Rose', 'Stouts', 'White'], [71250000140, 80432400630, 80480280024, 81248501095, 83783575217, 85000024218, 89744402585, 94922755711, 492130008399, 674545000001, 702770081011, 702770082018, 736920111112], ['Bowser Beer - Beefy Brown Ale - 6 Pack', 'Colt 45 - Malt Liquor - 6 Pack', 'Curious Beasts - Chardonnay - 750 mL Bottle', 'Dark Horse - S. Blanc Rose - 750 mL Bottle', 'Don Julio - Tequila Blanco - 750 mL Bottle', 'Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle', 'Goose Island - Honkers Ale - 6 Pack', 'Grey Goose - Imported Vodka - 750 mL Bottle', 'Omission - Gluten Free Lager - 6 Pack', 'Omission Pale Ale - Gluten Free Pale Ale - 6 Pack', 'Sierra Nevada Brewing Co. - Stout - 6 Pack', 'Trapiche - Malbec - 750 mL Bottle', 'Yellow Tail - Merlot - 750 mL Bottle']],
           labels=[[1, 1, 0, 0, 2, 2, 4, 4, 0, 0, 1, 1, 6, 6, 4, 4, 5, 5, 2, 2, 4, 4, 0, 0, 6, 6, 2, 2, 4, 4, 3, 3, 0, 0, 1, 1, 5, 5, 0, 0, 1, 1, 6, 6, 5, 5, 2, 2, 6, 6, 1, 1, 5, 5, 3, 3, 2, 2, 6, 6, 5, 5, 0, 0, 1, 1, 4, 4, 4, 4, 3, 3, 0, 0, 6, 6, 5, 5, 2, 2, 3, 3, 2, 2, 0, 0, 1, 1, 4, 4, 6, 6, 0, 0, 3, 3, 2, 2, 1, 1, 0, 0, 5, 5, 2, 2, 1, 1, 6, 6, 2, 2, 6, 6, 0, 0, 1, 1, 5, 5, 2, 2, 3, 3, 1, 1, 4, 4, 5, 5, 0, 0], [2, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 0, 2, 0, 2, 0, 2, 0, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 6, 6, 6, 6, 6, 6, 6, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [12, 12, 12, 12, 12, 12, 12, 12, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 10, 10, 10, 10, 10, 10, 10, 10, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 0, 0, 0, 0, 0, 0, 0, 0, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [6, 6, 6, 6, 6, 6, 6, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 1, 1, 1, 1, 1, 1, 1, 1, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]],
           names=['Date', 'Store', 'Department', 'Category', 'Subcategory', 'UPC EAN', 'Description'])

Well that's gross looking...but don't be scared - it's actually not that hard to understand.

'levels' is a list of lists, where each sublist represents all possible values in that index level. In other words, the 'levels' parameter reflects all possible unique values by level. For example, our first index level ('Date') has the possible values ['2018-07-10', '2018-07-11', '2018-07-12', ...].

Important Note: When talking about a multiindex DataFrame (not the parameter for the MultiIndex object), we talk about the "levels" as the index "columns." For example, the 'levels' of our df in a more general sense are 'Date', 'Store', 'Department', etc. Levels in this sense (and elsewhere in code) can also be referenced by number (e.g. 'Date' = 0 [read as 'level 0'], 'Store' = 1, 'Department' = 2, etc.).

'labels' is also a list of lists, but here each sublist reflects all of the values that appear in the row of that index. In other words, each sublist in our labels is of the same length as the entire dataframe, and the value of each row is one of the possible values defined in our associated level (above). Looking again at our first index level ('Date'), we see values like [1, 1, 0, 0, 2, 2, 4, 4 ...]. There are just an enumerated representation of the options defined in our level, so 0 = '2018-07-10', 1 = '2018-07-11', 2 = '2018-07-12', 3 = '2018-07-13', etc.

'names' is a list of the actual titles of each index level, in order of appearance from left to right.

With that fresh understanding of the 'anatomy' of a MultiIndex, we can look at...

Other Methods of Multiindex DataFrame Creation¶

For the most part, the two references listed in the section above cover this topic well; however, a common use case that isn't covered in those guides is creating a multiindex DataFrame while reading from a csv:

In [13]:

# We can set a MultiIndex while reading a csv by referencing columns to be used in the index by number
pd.read_csv("data.csv", index_col=[0, 1, 2, 3, 4, 5], skipinitialspace=True, parse_dates=['Date']).head(3)

Out[13]:

						Dollars	Units
Date	Store	Category	Subcategory	UPC EAN	Description
2018-07-11	Store 3	Beer	Ales	736920111112	Goose Island - Honkers Ale - 6 Pack	35.68	4
2018-07-11	Store 2	Beer	Ales	736920111112	Goose Island - Honkers Ale - 6 Pack	166.74	14
2018-07-10	Store 3	Beer	Ales	736920111112	Goose Island - Honkers Ale - 6 Pack	67.36	8

We'll review more advanced importing/exporting methods below.

MultiIndex Columns (Multiple Column Levels)¶

For a different view we can also create hierarchical column levels. We'll do this by introducting a new method: unstack(). This function "pivots" an index level to a new level of column labels whose inner-most level consists of the pivoted index labels. Stack/unstack is one of the biggest reasons to use a MultiIndex, so it's work supplementing the examples here by checking out the official docs on reshaping.

With this new technique we can start actually investigating our data. For example, let's say we want to more easily compare sales of a product by store by day, we can unstack our 'Store' index level:

In [14]:

multi_col_lvl_df = df.unstack('Store')
multi_col_lvl_df.sample(10)

Out[14]:

						Dollars			Units
					Store	Store 1	Store 2	Store 3	Store 1	Store 2	Store 3
Date	Department	Category	Subcategory	UPC EAN	Description
2018-07-16	Booooze	Alcohol	Liqour	80480280024	Grey Goose - Imported Vodka - 750 mL Bottle	NaN	221.04	126.66	NaN	9.0	6.0
2018-07-16	Booooze	Alcohol	Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	NaN	161.15	345.07	NaN	5.0	11.0
2018-07-10	Booooze	Alcohol	Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	172.00	278.80	NaN	5.0	8.0
2018-07-10	Booooze	Alcohol	Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	NaN	168.06	204.12	NaN	6.0	7.0
2018-07-11	Booooze	Alcohol	Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	201.72	318.80	NaN	6.0	8.0
2018-07-16	Booooze	Beer	Lagers	702770081011	Omission - Gluten Free Lager - 6 Pack	174.72	NaN	104.86	14.0	NaN	7.0
2018-07-12	Booooze	Wine	Red	492130008399	Yellow Tail - Merlot - 750 mL Bottle	95.16	NaN	75.60	12.0	NaN	10.0
2018-07-10	Booooze	Beer	Ales	702770082018	Omission Pale Ale - Gluten Free Pale Ale - 6 Pack	NaN	89.76	135.72	NaN	6.0	13.0
2018-07-15	Booooze	Alcohol	Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	415.68	476.64	NaN	12.0	12.0
2018-07-13	Booooze	Beer	Stouts	83783575217	Sierra Nevada Brewing Co. - Stout - 6 Pack	NaN	82.67	75.74	NaN	7.0	7.0

The new view makes our comparison easier, but now it's a bit cluttered. Internally our multi-level columns are stored as tuples of the name values for each level, so we can easily fix the clutter by flattening the columns into a single level:

In [15]:

def flatten_cols(df: pd.DataFrame, delim: str = ""):
    """Flatten multiple column levels of the DataFrame into a one column level.

    Args:
        delim: the delimiter between the column values.

    Returns:
        A copy of the dataframe with the new column names.

    """
    new_cols = [delim.join((col_lev for col_lev in tup if col_lev))
                for tup in df.columns.values]
    ndf = df.copy()
    ndf.columns = new_cols

    return ndf

flattened_multi_col_df = flatten_cols(multi_col_lvl_df, " | ").head(3)
flattened_multi_col_df

Out[15]:

						Dollars \| Store 1	Dollars \| Store 2	Dollars \| Store 3	Units \| Store 1	Units \| Store 2	Units \| Store 3
Date	Department	Category	Subcategory	UPC EAN	Description
2018-07-10	Booooze	Alcohol	Liqour	80480280024	Grey Goose - Imported Vodka - 750 mL Bottle	NaN	231.30	128.30	NaN	9.0	5.0
			Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	172.00	278.80	NaN	5.0	8.0
			Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	NaN	168.06	204.12	NaN	6.0	7.0

If later we want to undo that flattening it's just as simple:

In [16]:

def unflatten_cols(df: pd.DataFrame, delim: str = ""):
    """Unflatten a single column level into multiple column levels.

    Args:
        delim: the delimiter to split on to identify the multiple column values.

    Returns:
        A copy of the dataframe with the new column levels.

    """
    new_cols = pd.MultiIndex.from_tuples([tuple(col.split(delim)) for col in df.columns])
    ndf = df.copy()
    ndf.columns = new_cols

    return ndf

unflatten_cols(flattened_multi_col_df, " | ")

Out[16]:

						Dollars			Units
						Store 1	Store 2	Store 3	Store 1	Store 2	Store 3
Date	Department	Category	Subcategory	UPC EAN	Description
2018-07-10	Booooze	Alcohol	Liqour	80480280024	Grey Goose - Imported Vodka - 750 mL Bottle	NaN	231.30	128.30	NaN	9.0	5.0
			Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	172.00	278.80	NaN	5.0	8.0
			Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	NaN	168.06	204.12	NaN	6.0	7.0

Importing/Exporting MultiIndex DataFrames¶

We've already seen an example of reading a csv, but what if we want to save our multiindex DataFrame and then be able to reread it? How the stored files need to be accessed and by whom they need to be accessed will determine a lot. If everyone who needs access to the data is Python/Pandas savy, pickling is fast and easy. If, however, other, not-tech-savy people will need to access the data, CSVs are a versitile storage medium. That said, it complicates our read_csv() a bit if we want to just dump out our multiindex. Revisiting our multi_col_lvl_df, where we have a multiindex DataFrame that has both multiple index levels and column levels, creates a difficult situation that requires getting to know as many as all 48 (yes, 48!) of read_csv's parameters.

In [17]:

# Write our multi-column-level df
multi_col_lvl_df.to_csv('multi_col_lvl_output.csv')

# Reading it back in requires the header parameter
read_multi_df = pd.read_csv('multi_col_lvl_output.csv', header=[0, 1], index_col=[0, 1, 2, 3, 4, 5],
                            skipinitialspace=True, parse_dates=[0]).head(3)

read_multi_df

Out[17]:

						Dollars			Units
					Store	Store 1	Store 2	Store 3	Store 1	Store 2	Store 3
Date	Department	Category	Subcategory	UPC EAN	Description
2018-07-10	Booooze	Alcohol	Liqour	80480280024	Grey Goose - Imported Vodka - 750 mL Bottle	NaN	231.30	128.30	NaN	9.0	5.0
			Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	172.00	278.80	NaN	5.0	8.0
			Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	NaN	168.06	204.12	NaN	6.0	7.0

By adding a header parameter to deal with our multiple column levels, on top of index_col for the setting of the index, this import looks good, but it required inspecting the csv for the column and row numbers (that's right - names won't work. Just to be sure everything worked, let's also check our dtypes:

In [18]:

# A function to check our index level dtypes to aid this example
def index_level_dtypes(df):
    return [f"{df.index.names[i]}: {df.index.get_level_values(n).dtype}"
            for i, n in enumerate(df.index.names)]

index_level_dtypes(read_multi_df)

Out[18]:

['Date: datetime64[ns]',
 'Department: object',
 'Category: object',
 'Subcategory: object',
 'UPC EAN: int64',
 'Description: object']

Looks good. However let's take a step back and look at what would have happened if we'd not parsed the dates and wanted to later change the dtype:

In [19]:

# Reading it back iwithout parse_dates
bad_dtype_df = pd.read_csv('multi_col_lvl_output.csv', header=[0, 1], index_col=[0, 1, 2, 3, 4, 5],
                            skipinitialspace=True).head(3)

display(bad_dtype_df)
display(index_level_dtypes(bad_dtype_df))

						Dollars			Units
					Store	Store 1	Store 2	Store 3	Store 1	Store 2	Store 3
Date	Department	Category	Subcategory	UPC EAN	Description
2018-07-10	Booooze	Alcohol	Liqour	80480280024	Grey Goose - Imported Vodka - 750 mL Bottle	NaN	231.30	128.30	NaN	9.0	5.0
			Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	172.00	278.80	NaN	5.0	8.0
			Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	NaN	168.06	204.12	NaN	6.0	7.0

['Date: object',
 'Department: object',
 'Category: object',
 'Subcategory: object',
 'UPC EAN: int64',
 'Description: object']

Updating the dtypes of our index columns isn't going to be so simple, because our MultiIndex levels are immutable. To make any changes to the levels, we actually have to recreate the levels:

In [20]:

bad_dtype_df.index.set_levels([pd.to_datetime(bad_dtype_df.index.levels[0]), bad_dtype_df.index.levels[1],
                               bad_dtype_df.index.levels[2], bad_dtype_df.index.levels[3],
                               bad_dtype_df.index.levels[4], bad_dtype_df.index.levels[5]],
                               inplace=True)
index_level_dtypes(bad_dtype_df)

Out[20]:

['Date: datetime64[ns]',
 'Department: object',
 'Category: object',
 'Subcategory: object',
 'UPC EAN: int64',
 'Description: object']

Alternatively we could reset just the 'Date' level, update its dtype, add it back to our index, and finally reorder our index:

In [21]:

bad_dtype_df2 = bad_dtype_df.reset_index(level='Date')
bad_dtype_df2['Date'] = pd.to_datetime(bad_dtype_df2['Date'],infer_datetime_format=True)
bad_dtype_df2.set_index('Date', append=True, inplace=True)
bad_dtype_df2 = (bad_dtype_df2.swaplevel('Date', 'Description')
                              .swaplevel('Date', 'UPC EAN')
                              .swaplevel('Date', 'Subcategory')
                              .swaplevel('Date', 'Category')
                              .swaplevel('Date', 'Department'))
index_level_dtypes(bad_dtype_df2)

Out[21]:

['Date: datetime64[ns]',
 'Department: object',
 'Category: object',
 'Subcategory: object',
 'UPC EAN: int64',
 'Description: object']

The moral of the story is, if you want to export and import complex multiindex DataFrames, learn read_csv's parameters well! The second moral is that, either of the above situations results in an awful lot of work.

One alternative if the readability of the CSV by other users is important, is to simply reset the column levels and index levels before we write to a CSV and recreate them when we import. While it requires more steps and information is lost while in CSV form, it makes the code manipulations a lot easier:

In [22]:

# a) restack the column levels,
# b) drop any blanks the unstacking created, and then
# c) reset the index so everything is a flat column again
# d) output to csv
multi_col_lvl_df.stack().dropna().reset_index().to_csv('index_removed_output.csv')

# Reading it back in will require
read_df = pd.read_csv("data.csv", index_col=[0, 1, 2, 3, 4, 5], skipinitialspace=True, parse_dates=['Date'])
read_df.unstack('Store').head(3)

Out[22]:

					Dollars			Units
				Store	Store 1	Store 2	Store 3	Store 1	Store 2	Store 3
Date	Category	Subcategory	UPC EAN	Description
2018-07-10	Alcohol	Liqour	80480280024	Grey Goose - Imported Vodka - 750 mL Bottle	NaN	231.30	128.30	NaN	9.0	5.0
		Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	172.00	278.80	NaN	5.0	8.0
		Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	NaN	168.06	204.12	NaN	6.0	7.0

Still takes some work, but the code's a lot more straightforward.

To this point we've really just been learning the basics of setting up our multiindex DataFrames, with a few neat tricks along the way. Now comes the fun part - actually interacting with those DataFrames to analyze our data.

Multiindex Math¶

Math operations are as nearly as easy with multiindex DataFrames as with regular ones, but a whole lot more powerful. Let's say we want to calculate dollars per unit for each of our stores. That's as simple as:

In [23]:

dollars_per_unit = multi_col_lvl_df['Dollars'] / multi_col_lvl_df['Units']
dollars_per_unit.sample(10)

Out[23]:

					Store	Store 1	Store 2	Store 3
Date	Department	Category	Subcategory	UPC EAN	Description
2018-07-11	Booooze	Alcohol	Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	33.62	39.85
2018-07-16	Booooze	Alcohol	Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	NaN	32.23	31.37
2018-07-10	Booooze	Alcohol	Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	NaN	28.01	29.16
		Wine	White	81248501095	Curious Beasts - Chardonnay - 750 mL Bottle	NaN	20.94	19.90
		Wine	Red	89744402585	Trapiche - Malbec - 750 mL Bottle	7.11	NaN	6.30
		Beer	Ales	702770082018	Omission Pale Ale - Gluten Free Pale Ale - 6 Pack	NaN	14.96	10.44
		Alcohol	Liqour	80480280024	Grey Goose - Imported Vodka - 750 mL Bottle	NaN	25.70	25.66
2018-07-16	Booooze	Wine	Red	492130008399	Yellow Tail - Merlot - 750 mL Bottle	4.89	NaN	7.76
2018-07-12	Booooze	Beer	Stouts	83783575217	Sierra Nevada Brewing Co. - Stout - 6 Pack	NaN	10.61	7.81
2018-07-16	Booooze	Wine	Red	89744402585	Trapiche - Malbec - 750 mL Bottle	9.01	NaN	8.27

We just operated on all the pairwise sub column levels at the same time! Now we'd like this answer as columns back in our original DataFrame:

In [24]:

# Add a column level for our new measure
dollars_per_unit.columns = pd.MultiIndex.from_product([['Dollars per Unit'], dollars_per_unit.columns])

# Concat it with our original data
pd.concat([multi_col_lvl_df, dollars_per_unit], axis='columns').head(3)

Out[24]:

						Dollars			Units			Dollars per Unit
					Store	Store 1	Store 2	Store 3	Store 1	Store 2	Store 3	Store 1	Store 2	Store 3
Date	Department	Category	Subcategory	UPC EAN	Description
2018-07-10	Booooze	Alcohol	Liqour	80480280024	Grey Goose - Imported Vodka - 750 mL Bottle	NaN	231.30	128.30	NaN	9.0	5.0	NaN	25.70	25.66
			Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	172.00	278.80	NaN	5.0	8.0	NaN	34.40	34.85
			Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	NaN	168.06	204.12	NaN	6.0	7.0	NaN	28.01	29.16

We can similarly apply functions to our multiindex DataFrame:

In [25]:

# Change our units to 000s for funsies
multi_col_lvl_df.applymap(lambda x: np.nan if np.isnan(x) else str(round(x/1000, 2)) + "k").head(10)

Out[25]:

						Dollars			Units
					Store	Store 1	Store 2	Store 3	Store 1	Store 2	Store 3
Date	Department	Category	Subcategory	UPC EAN	Description
2018-07-10	Booooze	Alcohol	Liqour	80480280024	Grey Goose - Imported Vodka - 750 mL Bottle	NaN	0.23k	0.13k	NaN	0.01k	0.01k
			Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	0.17k	0.28k	NaN	0.01k	0.01k
			Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	NaN	0.17k	0.2k	NaN	0.01k	0.01k
		Beer	Ales	94922755711	Bowser Beer - Beefy Brown Ale - 6 Pack	NaN	0.08k	0.06k	NaN	0.01k	0.01k
				702770082018	Omission Pale Ale - Gluten Free Pale Ale - 6 Pack	NaN	0.09k	0.14k	NaN	0.01k	0.01k
				736920111112	Goose Island - Honkers Ale - 6 Pack	NaN	0.08k	0.07k	NaN	0.01k	0.01k
			Lagers	702770081011	Omission - Gluten Free Lager - 6 Pack	0.14k	NaN	0.18k	0.01k	NaN	0.01k
			Stouts	83783575217	Sierra Nevada Brewing Co. - Stout - 6 Pack	NaN	0.08k	0.12k	NaN	0.01k	0.01k
		Wine	Red	89744402585	Trapiche - Malbec - 750 mL Bottle	0.06k	NaN	0.04k	0.01k	NaN	0.01k
		Wine	Red	492130008399	Yellow Tail - Merlot - 750 mL Bottle	0.05k	NaN	0.03k	0.01k	NaN	0.01k

Sorting¶

Before we start slicing and filtering, it's important to sort out data to drastically improve efficiency. In some cases pandas will begin sorting for you by default in the latest versions, but other times you'll get an ugly and confusing lexsort warning (referencing the default lexographic sorting order).

Sorting indexes works the exactly the same way for multiindex DataFrames as with regular DataFrames, but with some extra parameters to decide on:

In [26]:

# By default sort_index will sort all levels of the index,
# first sorting the first index level, then secondarily sorting the second index level, and so on...
sort1 = multi_col_lvl_df.copy()
sort1.sort_index(inplace=True)
print("\nSort by Date, then Department, then Category, etc.:")
display(sort1.head(10))

# ...but you can choose to starting from a different level...
sort2 = multi_col_lvl_df.copy()
sort2.sort_index(level='Category', inplace=True)
print("\nSort by Category, then Subcategory, then UPC EAN, then Description, then Date, then Deparment:")
display(sort2.head(10))

# ...or even to only sort on specifc levels.
sort3 = multi_col_lvl_df.copy()
sort3.sort_index(level=['Category', 'Date'], sort_remaining=False, inplace=True)
print("\nSort by Category, then Date only:")
display(sort3.head(10))

Sort by Date, then Department, then Category, etc.:

						Dollars			Units
					Store	Store 1	Store 2	Store 3	Store 1	Store 2	Store 3
Date	Department	Category	Subcategory	UPC EAN	Description
2018-07-10	Booooze	Alcohol	Liqour	80480280024	Grey Goose - Imported Vodka - 750 mL Bottle	NaN	231.30	128.30	NaN	9.0	5.0
			Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	172.00	278.80	NaN	5.0	8.0
			Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	NaN	168.06	204.12	NaN	6.0	7.0
		Beer	Ales	94922755711	Bowser Beer - Beefy Brown Ale - 6 Pack	NaN	81.76	62.48	NaN	8.0	8.0
				702770082018	Omission Pale Ale - Gluten Free Pale Ale - 6 Pack	NaN	89.76	135.72	NaN	6.0	13.0
				736920111112	Goose Island - Honkers Ale - 6 Pack	NaN	80.01	67.36	NaN	9.0	8.0
			Lagers	702770081011	Omission - Gluten Free Lager - 6 Pack	137.67	NaN	176.96	13.0	NaN	14.0
			Stouts	83783575217	Sierra Nevada Brewing Co. - Stout - 6 Pack	NaN	78.40	119.90	NaN	7.0	10.0
		Wine	Red	89744402585	Trapiche - Malbec - 750 mL Bottle	56.88	NaN	37.80	8.0	NaN	6.0
		Wine	Red	492130008399	Yellow Tail - Merlot - 750 mL Bottle	46.30	NaN	34.56	10.0	NaN	6.0

Sort by Category, then Subcategory, then UPC EAN, then Description, then Date, then Deparment:

						Dollars			Units
					Store	Store 1	Store 2	Store 3	Store 1	Store 2	Store 3
Date	Department	Category	Subcategory	UPC EAN	Description
2018-07-10	Booooze	Alcohol	Liqour	80480280024	Grey Goose - Imported Vodka - 750 mL Bottle	NaN	231.30	128.30	NaN	9.0	5.0
			Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	172.00	278.80	NaN	5.0	8.0
			Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	NaN	168.06	204.12	NaN	6.0	7.0
2018-07-11	Booooze	Alcohol	Liqour	80480280024	Grey Goose - Imported Vodka - 750 mL Bottle	NaN	248.80	296.01	NaN	10.0	11.0
			Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	201.72	318.80	NaN	6.0	8.0
			Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	NaN	338.10	102.04	NaN	10.0	4.0
2018-07-12	Booooze	Alcohol	Liqour	80480280024	Grey Goose - Imported Vodka - 750 mL Bottle	NaN	279.36	228.08	NaN	12.0	8.0
			Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	250.25	455.13	NaN	7.0	13.0
			Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	NaN	410.88	268.56	NaN	12.0	9.0
2018-07-13	Booooze	Alcohol	Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	432.74	185.75	NaN	11.0	5.0

Sort by Category, then Date only:

						Dollars			Units
					Store	Store 1	Store 2	Store 3	Store 1	Store 2	Store 3
Date	Department	Category	Subcategory	UPC EAN	Description
2018-07-10	Booooze	Alcohol	Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	NaN	168.06	204.12	NaN	6.0	7.0
2018-07-11	Booooze	Alcohol	Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	NaN	338.10	102.04	NaN	10.0	4.0
2018-07-12	Booooze	Alcohol	Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	NaN	410.88	268.56	NaN	12.0	9.0
2018-07-15	Booooze	Alcohol	Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	NaN	177.00	430.50	NaN	6.0	14.0
2018-07-16	Booooze	Alcohol	Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	NaN	161.15	345.07	NaN	5.0	11.0
2018-07-10	Booooze	Alcohol	Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	172.00	278.80	NaN	5.0	8.0
2018-07-11	Booooze	Alcohol	Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	201.72	318.80	NaN	6.0	8.0
2018-07-12	Booooze	Alcohol	Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	250.25	455.13	NaN	7.0	13.0
2018-07-13	Booooze	Alcohol	Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	432.74	185.75	NaN	11.0	5.0
2018-07-14	Booooze	Alcohol	Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	227.82	268.52	NaN	6.0	7.0

Slicing, Filtering, and Querying¶

With the sorting out of the way we can begin searching our data by specific criteria. There is a fantastic guide about understanding SettingwithCopyWarnings in Pandas, so if you're unclear on the difference between a view and a copy, I suggest giving that a read first.

Readers who have used Pandas for slicing previously will know that .loc is generally the preferred method of referencing cells in a DataFrame. The same type of syntax exists for multiindex Dataframes:

In [27]:

# A slicing helper. Works similarly to slicing in Python (e.g. list slicing),
#                   but is inclusive of both the start and stop values.
idx = pd.IndexSlice

# View rows with a Category of Beer, but any Date, Department, Subcategory, UPC EAN, or Description
# Only looking at Dollars columns
print("View only rows in the 'Dollars' columns where the Category is 'Beer':")
display(sort3.loc[idx[:, :, 'Beer'], 'Dollars':'Dollars'].head(10))

# If we just want to look at the Store 1 sub-column
print("\nView only rows in the 'Dollars' AND 'Store 1' column where the Category is 'Beer':")
display(sort3.loc[idx[:, :, 'Beer'], idx['Dollars', 'Store 1':'Store 1']].head(10))

View only rows in the 'Dollars' columns where the Category is 'Beer':

						Dollars
					Store	Store 1	Store 2	Store 3
Date	Department	Category	Subcategory	UPC EAN	Description
2018-07-10	Booooze	Beer	Ales	94922755711	Bowser Beer - Beefy Brown Ale - 6 Pack	NaN	81.76	62.48
2018-07-11	Booooze	Beer	Ales	94922755711	Bowser Beer - Beefy Brown Ale - 6 Pack	NaN	104.91	72.72
2018-07-14	Booooze	Beer	Ales	94922755711	Bowser Beer - Beefy Brown Ale - 6 Pack	NaN	124.52	113.68
2018-07-15	Booooze	Beer	Ales	94922755711	Bowser Beer - Beefy Brown Ale - 6 Pack	NaN	88.77	119.47
2018-07-16	Booooze	Beer	Ales	94922755711	Bowser Beer - Beefy Brown Ale - 6 Pack	NaN	99.40	46.65
2018-07-11	Booooze	Beer	Malts	71250000140	Colt 45 - Malt Liquor - 6 Pack	NaN	101.79	107.58
2018-07-13	Booooze	Beer	Malts	71250000140	Colt 45 - Malt Liquor - 6 Pack	NaN	103.59	142.44
2018-07-15	Booooze	Beer	Malts	71250000140	Colt 45 - Malt Liquor - 6 Pack	NaN	139.23	40.50
2018-07-16	Booooze	Beer	Malts	71250000140	Colt 45 - Malt Liquor - 6 Pack	NaN	47.55	108.48
2018-07-10	Booooze	Beer	Ales	736920111112	Goose Island - Honkers Ale - 6 Pack	NaN	80.01	67.36

View only rows in the 'Dollars' AND 'Store 1' column where the Category is 'Beer':

						Dollars
					Store	Store 1
Date	Department	Category	Subcategory	UPC EAN	Description
2018-07-10	Booooze	Beer	Ales	94922755711	Bowser Beer - Beefy Brown Ale - 6 Pack	NaN
2018-07-11	Booooze	Beer	Ales	94922755711	Bowser Beer - Beefy Brown Ale - 6 Pack	NaN
2018-07-14	Booooze	Beer	Ales	94922755711	Bowser Beer - Beefy Brown Ale - 6 Pack	NaN
2018-07-15	Booooze	Beer	Ales	94922755711	Bowser Beer - Beefy Brown Ale - 6 Pack	NaN
2018-07-16	Booooze	Beer	Ales	94922755711	Bowser Beer - Beefy Brown Ale - 6 Pack	NaN
2018-07-11	Booooze	Beer	Malts	71250000140	Colt 45 - Malt Liquor - 6 Pack	NaN
2018-07-13	Booooze	Beer	Malts	71250000140	Colt 45 - Malt Liquor - 6 Pack	NaN
2018-07-15	Booooze	Beer	Malts	71250000140	Colt 45 - Malt Liquor - 6 Pack	NaN
2018-07-16	Booooze	Beer	Malts	71250000140	Colt 45 - Malt Liquor - 6 Pack	NaN
2018-07-10	Booooze	Beer	Ales	736920111112	Goose Island - Honkers Ale - 6 Pack	NaN

Unfortunately the .loc syntax doesn't scale well to having many index levels. In order to select a values from a single level, .loc requires specifying ':' for every level that comes before. For the column names to show up for a single column you also need to use : syntax just to specify a single column. This can be incredibly annoying.

Instead, it is more common (and practical) to use df.xs (cross sections). df.xs allows us to filter our DataFrame to only those rows that match levels of our index we specify. We can choose an individual level or multiple levels easily:

In [28]:

# We provide the level value, which by default searches the first index level ('Date')
display(multi_col_lvl_df.xs('2018-07-12').head(10))

# Here we're more specific because we want to search 'Category instead'
display(multi_col_lvl_df.xs('Wine', level='Category').head(10))

# To search for rows that match all of our level value requirements, use tuples
display(multi_col_lvl_df.xs(('2018-07-10', 'Booooze', 'Wine'), level=['Date', 'Department', 'Category']).head(10))

# Note that chaining .xs works, but is significantly less efficient
display(multi_col_lvl_df.xs('2018-07-10', level='Date')
                        .xs('Booooze', level='Department')
                        .xs('Wine', level='Category')
                        .head(10))

# We can then use the column value filtering we're used to to filter rows as well
cross_section_df = multi_col_lvl_df.xs(('2018-07-10', 'Booooze'), level=['Date', 'Department'])
display(cross_section_df[cross_section_df['Dollars']['Store 1'] > 0].head(10))

					Dollars			Units
				Store	Store 1	Store 2	Store 3	Store 1	Store 2	Store 3
Department	Category	Subcategory	UPC EAN	Description
Booooze	Alcohol	Liqour	80480280024	Grey Goose - Imported Vodka - 750 mL Bottle	NaN	279.36	228.08	NaN	12.0	8.0
		Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	250.25	455.13	NaN	7.0	13.0
		Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	NaN	410.88	268.56	NaN	12.0	9.0
	Beer	Ales	702770082018	Omission Pale Ale - Gluten Free Pale Ale - 6 Pack	NaN	109.20	56.76	NaN	8.0	6.0
		Ales	736920111112	Goose Island - Honkers Ale - 6 Pack	NaN	37.75	78.30	NaN	5.0	10.0
		Lagers	702770081011	Omission - Gluten Free Lager - 6 Pack	40.12	NaN	93.10	4.0	NaN	10.0
		Stouts	83783575217	Sierra Nevada Brewing Co. - Stout - 6 Pack	NaN	116.71	109.34	NaN	11.0	14.0
	Wine	Red	89744402585	Trapiche - Malbec - 750 mL Bottle	60.24	NaN	69.29	8.0	NaN	13.0
		Red	492130008399	Yellow Tail - Merlot - 750 mL Bottle	95.16	NaN	75.60	12.0	NaN	10.0
		Rose	85000024218	Dark Horse - S. Blanc Rose - 750 mL Bottle	93.94	83.82	NaN	7.0	6.0	NaN

					Dollars			Units
				Store	Store 1	Store 2	Store 3	Store 1	Store 2	Store 3
Date	Department	Subcategory	UPC EAN	Description
2018-07-10	Booooze	Red	89744402585	Trapiche - Malbec - 750 mL Bottle	56.88	NaN	37.80	8.0	NaN	6.0
		Red	492130008399	Yellow Tail - Merlot - 750 mL Bottle	46.30	NaN	34.56	10.0	NaN	6.0
		Rose	85000024218	Dark Horse - S. Blanc Rose - 750 mL Bottle	148.61	100.38	NaN	11.0	7.0	NaN
		White	81248501095	Curious Beasts - Chardonnay - 750 mL Bottle	NaN	293.16	278.60	NaN	14.0	14.0
2018-07-11	Booooze	Red	89744402585	Trapiche - Malbec - 750 mL Bottle	50.40	NaN	47.25	6.0	NaN	7.0
		Rose	85000024218	Dark Horse - S. Blanc Rose - 750 mL Bottle	52.24	173.03	NaN	4.0	13.0	NaN
		White	81248501095	Curious Beasts - Chardonnay - 750 mL Bottle	NaN	237.16	247.52	NaN	11.0	13.0
2018-07-12	Booooze	Red	89744402585	Trapiche - Malbec - 750 mL Bottle	60.24	NaN	69.29	8.0	NaN	13.0
		Red	492130008399	Yellow Tail - Merlot - 750 mL Bottle	95.16	NaN	75.60	12.0	NaN	10.0
		Rose	85000024218	Dark Horse - S. Blanc Rose - 750 mL Bottle	93.94	83.82	NaN	7.0	6.0	NaN

			Dollars			Units
		Store	Store 1	Store 2	Store 3	Store 1	Store 2	Store 3
Subcategory	UPC EAN	Description
Red	89744402585	Trapiche - Malbec - 750 mL Bottle	56.88	NaN	37.80	8.0	NaN	6.0
Red	492130008399	Yellow Tail - Merlot - 750 mL Bottle	46.30	NaN	34.56	10.0	NaN	6.0
Rose	85000024218	Dark Horse - S. Blanc Rose - 750 mL Bottle	148.61	100.38	NaN	11.0	7.0	NaN
White	81248501095	Curious Beasts - Chardonnay - 750 mL Bottle	NaN	293.16	278.60	NaN	14.0	14.0

			Dollars			Units
		Store	Store 1	Store 2	Store 3	Store 1	Store 2	Store 3
Subcategory	UPC EAN	Description
Red	89744402585	Trapiche - Malbec - 750 mL Bottle	56.88	NaN	37.80	8.0	NaN	6.0
Red	492130008399	Yellow Tail - Merlot - 750 mL Bottle	46.30	NaN	34.56	10.0	NaN	6.0
Rose	85000024218	Dark Horse - S. Blanc Rose - 750 mL Bottle	148.61	100.38	NaN	11.0	7.0	NaN
White	81248501095	Curious Beasts - Chardonnay - 750 mL Bottle	NaN	293.16	278.60	NaN	14.0	14.0

				Dollars			Units
			Store	Store 1	Store 2	Store 3	Store 1	Store 2	Store 3
Category	Subcategory	UPC EAN	Description
Beer	Lagers	702770081011	Omission - Gluten Free Lager - 6 Pack	137.67	NaN	176.96	13.0	NaN	14.0
Wine	Red	89744402585	Trapiche - Malbec - 750 mL Bottle	56.88	NaN	37.80	8.0	NaN	6.0
	Red	492130008399	Yellow Tail - Merlot - 750 mL Bottle	46.30	NaN	34.56	10.0	NaN	6.0
	Rose	85000024218	Dark Horse - S. Blanc Rose - 750 mL Bottle	148.61	100.38	NaN	11.0	7.0	NaN

That's a whole lot nicer of syntax than the .loc method, but it's still relatively verbose and requires separating out or index and column value filtering. Multiindex DataFrames also offer a lesser known method, df.query. Query is powerful for a few reasons:

It can search both index levels and columns in the same query,
Syntactically it's nicer, since it lets use write our query as short expressions,
It's very efficient (depending on the chosen engine)

The documentation has excellent notes about the method and a nice usage guide that are definitely worth a read but a couple of examples using our dataset:

In [29]:

multi_col_lvl_df.query("Date == '2018-07-10' and Category == 'Wine'")

Out[29]:

						Dollars			Units
					Store	Store 1	Store 2	Store 3	Store 1	Store 2	Store 3
Date	Department	Category	Subcategory	UPC EAN	Description
2018-07-10	Booooze	Wine	Red	89744402585	Trapiche - Malbec - 750 mL Bottle	56.88	NaN	37.80	8.0	NaN	6.0
			Red	492130008399	Yellow Tail - Merlot - 750 mL Bottle	46.30	NaN	34.56	10.0	NaN	6.0
			Rose	85000024218	Dark Horse - S. Blanc Rose - 750 mL Bottle	148.61	100.38	NaN	11.0	7.0	NaN
			White	81248501095	Curious Beasts - Chardonnay - 750 mL Bottle	NaN	293.16	278.60	NaN	14.0	14.0

So why doesn't everyone always use df.query? It has two big drawbacks:

It doesn't allow index or column names with spaces in the name.
It can't deal with multiple column levels well.

However, if you're used to working with SQL databases and/or willing to work within the above limitations, this is a powerful and fast option.

Updating Column Values¶

Now that we've learned a few ways to slice, filter, and query our data, what if we want to alter values in our dataframe? We can take advantage of the .loc method to replace values in place without creating a copy or, if temporarily copying our data isn't an issue (i.e. too big for memory) we can use the df.xs method with df.update:

In [30]:

updated_df = multi_col_lvl_df.copy()
updated_df.loc[idx[:, :, 'Beer'], idx['Dollars', 'Store 1':'Store 1']] = "We changed this"
updated_df.head(10)

Out[30]:

						Dollars			Units
					Store	Store 1	Store 2	Store 3	Store 1	Store 2	Store 3
Date	Department	Category	Subcategory	UPC EAN	Description
2018-07-10	Booooze	Alcohol	Liqour	80480280024	Grey Goose - Imported Vodka - 750 mL Bottle	NaN	231.30	128.30	NaN	9.0	5.0
			Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	172.00	278.80	NaN	5.0	8.0
			Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	NaN	168.06	204.12	NaN	6.0	7.0
		Beer	Ales	94922755711	Bowser Beer - Beefy Brown Ale - 6 Pack	We changed this	81.76	62.48	NaN	8.0	8.0
				702770082018	Omission Pale Ale - Gluten Free Pale Ale - 6 Pack	We changed this	89.76	135.72	NaN	6.0	13.0
				736920111112	Goose Island - Honkers Ale - 6 Pack	We changed this	80.01	67.36	NaN	9.0	8.0
			Lagers	702770081011	Omission - Gluten Free Lager - 6 Pack	We changed this	NaN	176.96	13.0	NaN	14.0
			Stouts	83783575217	Sierra Nevada Brewing Co. - Stout - 6 Pack	We changed this	78.40	119.90	NaN	7.0	10.0
		Wine	Red	89744402585	Trapiche - Malbec - 750 mL Bottle	56.88	NaN	37.80	8.0	NaN	6.0
		Wine	Red	492130008399	Yellow Tail - Merlot - 750 mL Bottle	46.3	NaN	34.56	10.0	NaN	6.0

The above syntax works well for immutable values, but you'll get an error if you try the same thing for a mutable value, like a list. To my knowledge the only way to set values in a DataFrame to a mutable object is one cell at a time, so doing multiple replacements requires iterrows:

In [31]:

# # Note: If the column we're changing's type isn't already object,
# # we need to change it or the value relpacements below will error.
# updated_df['Dollars']['Store 1'] = updated_df['Dollars']['Store 1'].astype(object)

# # Loop through rows, replacing single values
# # Only necessary if the new assigned value is mutable
# # Code below currently not working when there are multiple column levels, but works with one column level
# for index, row in updated_df.loc[idx[:, :, 'Beer'], idx['Dollars', 'Store 1':'Store 1']].iterrows():
#     updated_df.at[index, idx['Dollars', 'Store 1':'Store 1']] = ["We", "changed", "this"]

# updated_df.head(3)

Alternatively, using df.xs and then df.update instead of the full .loc is a little bit more understandable in my option. For example, the following code does the same as the code two cells above (except that it requires a copy):

In [32]:

updated_df = multi_col_lvl_df.copy()

df2 = updated_df.xs('Beer', level='Category', drop_level=False).copy()  # .copy() is to avoid SettingwithCopyWarning
df2[idx['Dollars', 'Store 1']] = "We ALSO changed this"

updated_df.update(df2, join="left", overwrite=True, filter_func=None, raise_conflict=False)
updated_df.head(10)

Out[32]:

						Dollars			Units
					Store	Store 1	Store 2	Store 3	Store 1	Store 2	Store 3
Date	Department	Category	Subcategory	UPC EAN	Description
2018-07-10	Booooze	Alcohol	Liqour	80480280024	Grey Goose - Imported Vodka - 750 mL Bottle	NaN	231.30	128.30	NaN	9.0	5.0
			Liquor	80432400630	Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle	NaN	172.00	278.80	NaN	5.0	8.0
			Liquor	674545000001	Don Julio - Tequila Blanco - 750 mL Bottle	NaN	168.06	204.12	NaN	6.0	7.0
		Beer	Ales	94922755711	Bowser Beer - Beefy Brown Ale - 6 Pack	We ALSO changed this	81.76	62.48	NaN	8.0	8.0
				702770082018	Omission Pale Ale - Gluten Free Pale Ale - 6 Pack	We ALSO changed this	89.76	135.72	NaN	6.0	13.0
				736920111112	Goose Island - Honkers Ale - 6 Pack	We ALSO changed this	80.01	67.36	NaN	9.0	8.0
			Lagers	702770081011	Omission - Gluten Free Lager - 6 Pack	We ALSO changed this	NaN	176.96	13.0	NaN	14.0
			Stouts	83783575217	Sierra Nevada Brewing Co. - Stout - 6 Pack	We ALSO changed this	78.40	119.90	NaN	7.0	10.0
		Wine	Red	89744402585	Trapiche - Malbec - 750 mL Bottle	56.88	NaN	37.80	8.0	NaN	6.0
		Wine	Red	492130008399	Yellow Tail - Merlot - 750 mL Bottle	46.3	NaN	34.56	10.0	NaN	6.0

Display options¶

In [33]:

pd.set_option('display.multi_sparse', True)

That's all, folks!¶

Hopefully this guide has been a gentle and practical tutorial on multiindex DataFrames in Pandas. If you find any errors, material ommissions, or topics that could just be explained more clearly, please submit leave a comment or better yet, a pull request!

To continue your learning, find a list of my top picks for Pandas resources - generally and specifically for MultiIndexes. Of particular note is the Official list of available methods for MultiIndexes, which has a full listing of all the available methods (many not listed here). This link is specifically for Pandas version 0.22, because, for some reason, the list in future versions excludes a lot of still valuable methods! What's more, this list provides the ONLY explanation for some methods, since many MultiIndex methods lack even the most basic of docstrings.

Resources¶

Official MultiIndex References

Other MultiIndex Tutorials

General Pandas Tutorials

Pandas Operations and Efficiency