Pandas is one of the most-used open source libraries for importing and analyzing data available in Python today. It provides convenient ways to import, view, split, apply, and combine array-like data. And not just convenient, but efficient, too. For example, Pandas' read_csv and to_csv functions are so efficient that the library is often imported just for this task instead of relying on the standard library alternative!
The core value of the library, however, comes through several data structure options, primarily Series (for labeled, homogenously-typed, one-dimensional arrays) and DataFrames (for labeled, potentially heterogeneously-typed, two-dimensional arrays).
DataFrames are two-dimensional, labeled data structures, with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used Pandas object. Like Series, DataFrame accepts many different kinds of input, including dicts, lists, lists of lists, Series, numpy arrays, other DataFrames, external data from CSVs, etc.
The DataFrame has two core parts: an index (row labels; look like columns, but aren't) and columns (data with headers).
The index of a DataFrame is a set (i.e. each element is only represented once) that consists of a label for each row. To be helpful, those labels should be not just unique, but also meaningful. By default, if no index is provided, the index will be a numbered range, starting from 0 (known as a range index).
A more meaningful index, however, would be something that uniquely describes each row of your data in a way that will help you look things up. For example, in a list of transactions, the date-time might be most useful. Alternatively, in a grade book for a math class the name of the student might be most useful.
The most confusing thing about understanding about Pandas' indexes at first (in my opinion) is how to interact with them. While they look exactly like columns, they're not referenced in the same way (learn more here). Understanding that, I personally find it most useful to think of an index in Pandas like a column that's in time-out and that just can't play with all the other columns.
Pandas' multiindex DataFrames extend the DataFrames described above by enabling effective storage and manipulation of arbitrarily high dimension data in a 2-dimensional tabular structure. ((If that sentence doesn't make sense yet, don't worry - it should by the end of this tutorial.))
While the displayed version of a multiindexed DataFrame doesn't appear to be much more than a prettily-organized regular DataFrame, it's actually a pretty powerful structure if the data warrants its use.
Besides structure, multiindexes offer relatively easy in-memory retreival of complex data.
For this tutorial we will work with a realistic case of when multiindex DataFrames can come in handy - a grocer's retail transactions.
# Normally put all your import up top, but this cell was ugly in the midst of the intro paragraphs
from typing import Any
import numpy as np
import pandas as pd
display(f"Pandas version: {pd.__version__}")
'Pandas version: 0.22.0'
Note that this notebook is using the Pandas version above. There have been many changes to MultiIndex methods since 0.19, including major bug fixes. It is STRONGLY recommended to use the latest version of Pandas, but at least version 0.21 is required for all of the techniques in this notebook to work as presented.
# # Creates random mock data.
# # Based on the UPCs found in upc_meta_data.csv and saves to data.csv
# %run mocker.py
df = pd.read_csv('data.csv', parse_dates=['Date'])
df.sample(10)
Date | Store | Category | Subcategory | UPC EAN | Description | Dollars | Units | |
---|---|---|---|---|---|---|---|---|
20 | 2018-07-14 | Store 3 | Beer | Lagers | 702770081011 | Omission - Gluten Free Lager - 6 Pack | 128.64 | 12 |
101 | 2018-07-10 | Store 3 | Alcohol | Liqour | 80480280024 | Grey Goose - Imported Vodka - 750 mL Bottle | 128.30 | 5 |
68 | 2018-07-14 | Store 1 | Wine | Red | 492130008399 | Yellow Tail - Merlot - 750 mL Bottle | 32.10 | 6 |
12 | 2018-07-16 | Store 2 | Beer | Ales | 94922755711 | Bowser Beer - Beefy Brown Ale - 6 Pack | 99.40 | 10 |
58 | 2018-07-16 | Store 1 | Wine | Red | 89744402585 | Trapiche - Malbec - 750 mL Bottle | 126.14 | 14 |
1 | 2018-07-11 | Store 2 | Beer | Ales | 736920111112 | Goose Island - Honkers Ale - 6 Pack | 166.74 | 14 |
100 | 2018-07-10 | Store 2 | Alcohol | Liqour | 80480280024 | Grey Goose - Imported Vodka - 750 mL Bottle | 231.30 | 9 |
59 | 2018-07-16 | Store 3 | Wine | Red | 89744402585 | Trapiche - Malbec - 750 mL Bottle | 66.16 | 8 |
116 | 2018-07-11 | Store 3 | Alcohol | Liquor | 674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | 102.04 | 4 |
77 | 2018-07-15 | Store 3 | Wine | Red | 492130008399 | Yellow Tail - Merlot - 750 mL Bottle | 58.20 | 12 |
Each row of data represents a sale of an item, but how do we set our index meaningfully? A numeric ID for each transaction would be fine, but it wouldn't tell us terribly much. Even the date of a transaction isn't useful by itself, since it's common for such a company to have many transactions on a given date (even at the same time).
Instead, we will have to look at a combination of the available metadata to create a unique and meaningful index. At first glance, a combination of the date, the store, and the hierarchy of each item (Category > Subcategory > UPC) look like they might be enough to get us a unique index.
So let's take a look at how we can create our multiindex from our regular ol' DataFrame. We'll walk through the basics of setting, reordering, and resetting indexes, along with some useful tips/tricks. Then we can begin investigating our transaction data to learn about our sales and trends.
# Set just like the index for a DataFrame...
# ...except we give a list of column names instead of a single string column name
df.set_index(['Date', 'Store', 'Category', 'Subcategory', 'Description'], inplace=True)
df.head(3)
UPC EAN | Dollars | Units | |||||
---|---|---|---|---|---|---|---|
Date | Store | Category | Subcategory | Description | |||
2018-07-11 | Store 3 | Beer | Ales | Goose Island - Honkers Ale - 6 Pack | 736920111112 | 35.68 | 4 |
Store 2 | Beer | Ales | Goose Island - Honkers Ale - 6 Pack | 736920111112 | 166.74 | 14 | |
2018-07-10 | Store 3 | Beer | Ales | Goose Island - Honkers Ale - 6 Pack | 736920111112 | 67.36 | 8 |
Uh oh - it looks like we forgot to add the 'UPC EAN' column to our index, but don't worry - Pandas has us covered with extra set_index parameters for MultiIndexes:
# We can append a column to our existing index
df.set_index('UPC EAN', append=True, inplace=True)
df.head(3)
Dollars | Units | ||||||
---|---|---|---|---|---|---|---|
Date | Store | Category | Subcategory | Description | UPC EAN | ||
2018-07-11 | Store 3 | Beer | Ales | Goose Island - Honkers Ale - 6 Pack | 736920111112 | 35.68 | 4 |
Store 2 | Beer | Ales | Goose Island - Honkers Ale - 6 Pack | 736920111112 | 166.74 | 14 | |
2018-07-10 | Store 3 | Beer | Ales | Goose Island - Honkers Ale - 6 Pack | 736920111112 | 67.36 | 8 |
That's almost right, but we'd actually like 'Description' to show up after 'UPC EAN'. We have a couple of options to get things in the right order:
# Option 1 is the generalized solution to reorder the index levels
# Note: We're not making an inplace change in this cell,
# but it's worth noting that this method doesn't have an inplace parameter.
df.reorder_levels(order=['Date', 'Store', 'Category', 'Subcategory', 'UPC EAN', 'Description']).head(3)
Dollars | Units | ||||||
---|---|---|---|---|---|---|---|
Date | Store | Category | Subcategory | UPC EAN | Description | ||
2018-07-11 | Store 3 | Beer | Ales | 736920111112 | Goose Island - Honkers Ale - 6 Pack | 35.68 | 4 |
Store 2 | Beer | Ales | 736920111112 | Goose Island - Honkers Ale - 6 Pack | 166.74 | 14 | |
2018-07-10 | Store 3 | Beer | Ales | 736920111112 | Goose Island - Honkers Ale - 6 Pack | 67.36 | 8 |
reorder_levels() is useful, but it was a pain to have to type all five levels just two switch two. In cases like this we have a second, less verbose option:
# Option 2 just switches two index levels (a more common need than you'd think)
# Note: This time we're doing an inplace change, but there's no parameter for this method either.
df = df.swaplevel('Description', 'UPC EAN')
df.head(3)
Dollars | Units | ||||||
---|---|---|---|---|---|---|---|
Date | Store | Category | Subcategory | UPC EAN | Description | ||
2018-07-11 | Store 3 | Beer | Ales | 736920111112 | Goose Island - Honkers Ale - 6 Pack | 35.68 | 4 |
Store 2 | Beer | Ales | 736920111112 | Goose Island - Honkers Ale - 6 Pack | 166.74 | 14 | |
2018-07-10 | Store 3 | Beer | Ales | 736920111112 | Goose Island - Honkers Ale - 6 Pack | 67.36 | 8 |
Just when we thought we were done, it turns our we forgot to add the highest level of the product hierarchy - the Department - not just to our index, but to our DataFrame altogether. Luckily all of our records belong in the same Department, so here's a neat trick to add a new column with all the same values as a level in an existing index:
# A handy function to keep around for projects
def add_constant_index_level(df: pd.DataFrame, value: Any, level_name: str):
"""Add a new level to an existing index where every row has the same, given value.
Args:
df: Any existing pd.DataFrame.
value: Value to be placed in every row of the new index level.
level_name: Title of the new index level.
Returns:
df with an additional, prepended index level.
"""
return pd.concat([df], keys=[value], names=[level_name])
df = add_constant_index_level(df, "Booooze", "Department")
df = df.reorder_levels(order=['Date', 'Store', 'Department', 'Category', 'Subcategory', 'UPC EAN', 'Description'])
df.head(3)
# # If we wanted to later drop that level
# df.index = df.index.droplevel(level='Department')
# df.head(3)
Dollars | Units | |||||||
---|---|---|---|---|---|---|---|---|
Date | Store | Department | Category | Subcategory | UPC EAN | Description | ||
2018-07-11 | Store 3 | Booooze | Beer | Ales | 736920111112 | Goose Island - Honkers Ale - 6 Pack | 35.68 | 4 |
Store 2 | Booooze | Beer | Ales | 736920111112 | Goose Island - Honkers Ale - 6 Pack | 166.74 | 14 | |
2018-07-10 | Store 3 | Booooze | Beer | Ales | 736920111112 | Goose Island - Honkers Ale - 6 Pack | 67.36 | 8 |
Now that our index is set the way we want it, what if we want to interact with those index levels? Here are a few helpful code snippets:
# checking out their unique values, for a single level
df.index.get_level_values('Subcategory').unique()
# checking out their unique values, for combinations of multiple levels
# See answer at https://stackoverflow.com/questions/39080555/pandas-get-level-values-for-multiple-columns
Index(['Ales', 'Lagers', 'Stouts', 'Malts', 'Red', 'White', 'Rose', 'Liqour', 'Liquor'], dtype='object', name='Subcategory')
Note the typo in "Liquor" above. Good thing we checked out unique values! Maybe someone can submit a pull request to fix this for me :)
# Replace level values using rename
# Note that this can be done using set_levels as well, but it's a pain
df.rename(index={'Goose Island - Honkers Ale - 6 Pack': 'We changed this'}).head(15)
# Replace np.nan level values
# df.rename(index={np.nan: "''"}, inplace=True)
Dollars | Units | |||||||
---|---|---|---|---|---|---|---|---|
Date | Store | Department | Category | Subcategory | UPC EAN | Description | ||
2018-07-11 | Store 3 | Booooze | Beer | Ales | 736920111112 | We changed this | 35.68 | 4 |
Store 2 | Booooze | Beer | Ales | 736920111112 | We changed this | 166.74 | 14 | |
2018-07-10 | Store 3 | Booooze | Beer | Ales | 736920111112 | We changed this | 67.36 | 8 |
Store 2 | Booooze | Beer | Ales | 736920111112 | We changed this | 80.01 | 9 | |
2018-07-12 | Store 3 | Booooze | Beer | Ales | 736920111112 | We changed this | 78.30 | 10 |
Store 2 | Booooze | Beer | Ales | 736920111112 | We changed this | 37.75 | 5 | |
2018-07-14 | Store 3 | Booooze | Beer | Ales | 736920111112 | We changed this | 113.49 | 13 |
Store 2 | Booooze | Beer | Ales | 736920111112 | We changed this | 45.05 | 5 | |
2018-07-10 | Store 2 | Booooze | Beer | Ales | 94922755711 | Bowser Beer - Beefy Brown Ale - 6 Pack | 81.76 | 8 |
Store 3 | Booooze | Beer | Ales | 94922755711 | Bowser Beer - Beefy Brown Ale - 6 Pack | 62.48 | 8 | |
2018-07-11 | Store 2 | Booooze | Beer | Ales | 94922755711 | Bowser Beer - Beefy Brown Ale - 6 Pack | 104.91 | 13 |
Store 3 | Booooze | Beer | Ales | 94922755711 | Bowser Beer - Beefy Brown Ale - 6 Pack | 72.72 | 9 | |
2018-07-16 | Store 2 | Booooze | Beer | Ales | 94922755711 | Bowser Beer - Beefy Brown Ale - 6 Pack | 99.40 | 10 |
Store 3 | Booooze | Beer | Ales | 94922755711 | Bowser Beer - Beefy Brown Ale - 6 Pack | 46.65 | 5 | |
2018-07-14 | Store 2 | Booooze | Beer | Ales | 94922755711 | Bowser Beer - Beefy Brown Ale - 6 Pack | 124.52 | 11 |
# Rename index levels
temp_df = df.copy()
temp_df.index = df.index.set_names('Desc.', level='Description')
temp_df.head(3)
Dollars | Units | |||||||
---|---|---|---|---|---|---|---|---|
Date | Store | Department | Category | Subcategory | UPC EAN | Desc. | ||
2018-07-11 | Store 3 | Booooze | Beer | Ales | 736920111112 | Goose Island - Honkers Ale - 6 Pack | 35.68 | 4 |
Store 2 | Booooze | Beer | Ales | 736920111112 | Goose Island - Honkers Ale - 6 Pack | 166.74 | 14 | |
2018-07-10 | Store 3 | Booooze | Beer | Ales | 736920111112 | Goose Island - Honkers Ale - 6 Pack | 67.36 | 8 |
Why is this section all the way down here? Because the MultiIndex object is scary looking if you're new to using them. Many guides to hierarchical data analysis using multiindex DataFrames start with DataFrame creation and manipulation using MultiIndex objects, which I think both hinders adoption and is not reflective of how a lot of DataFrames get created in practice. As a result, my explanation of MultiIndex objects is very basic, because there are lots of other great resources out there if you want to learn more. Here are my top two:
df.index
MultiIndex(levels=[[2018-07-10 00:00:00, 2018-07-11 00:00:00, 2018-07-12 00:00:00, 2018-07-13 00:00:00, 2018-07-14 00:00:00, 2018-07-15 00:00:00, 2018-07-16 00:00:00], ['Store 1', 'Store 2', 'Store 3'], ['Booooze'], ['Alcohol', 'Beer', 'Wine'], ['Ales', 'Lagers', 'Liqour', 'Liquor', 'Malts', 'Red', 'Rose', 'Stouts', 'White'], [71250000140, 80432400630, 80480280024, 81248501095, 83783575217, 85000024218, 89744402585, 94922755711, 492130008399, 674545000001, 702770081011, 702770082018, 736920111112], ['Bowser Beer - Beefy Brown Ale - 6 Pack', 'Colt 45 - Malt Liquor - 6 Pack', 'Curious Beasts - Chardonnay - 750 mL Bottle', 'Dark Horse - S. Blanc Rose - 750 mL Bottle', 'Don Julio - Tequila Blanco - 750 mL Bottle', 'Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle', 'Goose Island - Honkers Ale - 6 Pack', 'Grey Goose - Imported Vodka - 750 mL Bottle', 'Omission - Gluten Free Lager - 6 Pack', 'Omission Pale Ale - Gluten Free Pale Ale - 6 Pack', 'Sierra Nevada Brewing Co. - Stout - 6 Pack', 'Trapiche - Malbec - 750 mL Bottle', 'Yellow Tail - Merlot - 750 mL Bottle']], labels=[[1, 1, 0, 0, 2, 2, 4, 4, 0, 0, 1, 1, 6, 6, 4, 4, 5, 5, 2, 2, 4, 4, 0, 0, 6, 6, 2, 2, 4, 4, 3, 3, 0, 0, 1, 1, 5, 5, 0, 0, 1, 1, 6, 6, 5, 5, 2, 2, 6, 6, 1, 1, 5, 5, 3, 3, 2, 2, 6, 6, 5, 5, 0, 0, 1, 1, 4, 4, 4, 4, 3, 3, 0, 0, 6, 6, 5, 5, 2, 2, 3, 3, 2, 2, 0, 0, 1, 1, 4, 4, 6, 6, 0, 0, 3, 3, 2, 2, 1, 1, 0, 0, 5, 5, 2, 2, 1, 1, 6, 6, 2, 2, 6, 6, 0, 0, 1, 1, 5, 5, 2, 2, 3, 3, 1, 1, 4, 4, 5, 5, 0, 0], [2, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 0, 2, 0, 2, 0, 2, 0, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 0, 2, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 6, 6, 6, 6, 6, 6, 6, 6, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3], [12, 12, 12, 12, 12, 12, 12, 12, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 10, 10, 10, 10, 10, 10, 10, 10, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 0, 0, 0, 0, 0, 0, 0, 0, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 5, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [6, 6, 6, 6, 6, 6, 6, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 8, 8, 8, 8, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 1, 1, 1, 1, 1, 1, 1, 1, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5]], names=['Date', 'Store', 'Department', 'Category', 'Subcategory', 'UPC EAN', 'Description'])
Well that's gross looking...but don't be scared - it's actually not that hard to understand.
'levels' is a list of lists, where each sublist represents all possible values in that index level. In other words, the 'levels' parameter reflects all possible unique values by level. For example, our first index level ('Date') has the possible values ['2018-07-10', '2018-07-11', '2018-07-12', ...].
'labels' is also a list of lists, but here each sublist reflects all of the values that appear in the row of that index. In other words, each sublist in our labels is of the same length as the entire dataframe, and the value of each row is one of the possible values defined in our associated level (above). Looking again at our first index level ('Date'), we see values like [1, 1, 0, 0, 2, 2, 4, 4 ...]. There are just an enumerated representation of the options defined in our level, so 0 = '2018-07-10', 1 = '2018-07-11', 2 = '2018-07-12', 3 = '2018-07-13', etc.
'names' is a list of the actual titles of each index level, in order of appearance from left to right.
With that fresh understanding of the 'anatomy' of a MultiIndex, we can look at...
For the most part, the two references listed in the section above cover this topic well; however, a common use case that isn't covered in those guides is creating a multiindex DataFrame while reading from a csv:
# We can set a MultiIndex while reading a csv by referencing columns to be used in the index by number
pd.read_csv("data.csv", index_col=[0, 1, 2, 3, 4, 5], skipinitialspace=True, parse_dates=['Date']).head(3)
Dollars | Units | ||||||
---|---|---|---|---|---|---|---|
Date | Store | Category | Subcategory | UPC EAN | Description | ||
2018-07-11 | Store 3 | Beer | Ales | 736920111112 | Goose Island - Honkers Ale - 6 Pack | 35.68 | 4 |
Store 2 | Beer | Ales | 736920111112 | Goose Island - Honkers Ale - 6 Pack | 166.74 | 14 | |
2018-07-10 | Store 3 | Beer | Ales | 736920111112 | Goose Island - Honkers Ale - 6 Pack | 67.36 | 8 |
We'll review more advanced importing/exporting methods below.
For a different view we can also create hierarchical column levels. We'll do this by introducting a new method: unstack(). This function "pivots" an index level to a new level of column labels whose inner-most level consists of the pivoted index labels. Stack/unstack is one of the biggest reasons to use a MultiIndex, so it's work supplementing the examples here by checking out the official docs on reshaping.
With this new technique we can start actually investigating our data. For example, let's say we want to more easily compare sales of a product by store by day, we can unstack our 'Store' index level:
multi_col_lvl_df = df.unstack('Store')
multi_col_lvl_df.sample(10)
Dollars | Units | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Store | Store 1 | Store 2 | Store 3 | Store 1 | Store 2 | Store 3 | |||||
Date | Department | Category | Subcategory | UPC EAN | Description | ||||||
2018-07-16 | Booooze | Alcohol | Liqour | 80480280024 | Grey Goose - Imported Vodka - 750 mL Bottle | NaN | 221.04 | 126.66 | NaN | 9.0 | 6.0 |
Liquor | 674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | NaN | 161.15 | 345.07 | NaN | 5.0 | 11.0 | |||
2018-07-10 | Booooze | Alcohol | Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 172.00 | 278.80 | NaN | 5.0 | 8.0 |
674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | NaN | 168.06 | 204.12 | NaN | 6.0 | 7.0 | ||||
2018-07-11 | Booooze | Alcohol | Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 201.72 | 318.80 | NaN | 6.0 | 8.0 |
2018-07-16 | Booooze | Beer | Lagers | 702770081011 | Omission - Gluten Free Lager - 6 Pack | 174.72 | NaN | 104.86 | 14.0 | NaN | 7.0 |
2018-07-12 | Booooze | Wine | Red | 492130008399 | Yellow Tail - Merlot - 750 mL Bottle | 95.16 | NaN | 75.60 | 12.0 | NaN | 10.0 |
2018-07-10 | Booooze | Beer | Ales | 702770082018 | Omission Pale Ale - Gluten Free Pale Ale - 6 Pack | NaN | 89.76 | 135.72 | NaN | 6.0 | 13.0 |
2018-07-15 | Booooze | Alcohol | Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 415.68 | 476.64 | NaN | 12.0 | 12.0 |
2018-07-13 | Booooze | Beer | Stouts | 83783575217 | Sierra Nevada Brewing Co. - Stout - 6 Pack | NaN | 82.67 | 75.74 | NaN | 7.0 | 7.0 |
The new view makes our comparison easier, but now it's a bit cluttered. Internally our multi-level columns are stored as tuples of the name values for each level, so we can easily fix the clutter by flattening the columns into a single level:
def flatten_cols(df: pd.DataFrame, delim: str = ""):
"""Flatten multiple column levels of the DataFrame into a one column level.
Args:
delim: the delimiter between the column values.
Returns:
A copy of the dataframe with the new column names.
"""
new_cols = [delim.join((col_lev for col_lev in tup if col_lev))
for tup in df.columns.values]
ndf = df.copy()
ndf.columns = new_cols
return ndf
flattened_multi_col_df = flatten_cols(multi_col_lvl_df, " | ").head(3)
flattened_multi_col_df
Dollars | Store 1 | Dollars | Store 2 | Dollars | Store 3 | Units | Store 1 | Units | Store 2 | Units | Store 3 | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Date | Department | Category | Subcategory | UPC EAN | Description | ||||||
2018-07-10 | Booooze | Alcohol | Liqour | 80480280024 | Grey Goose - Imported Vodka - 750 mL Bottle | NaN | 231.30 | 128.30 | NaN | 9.0 | 5.0 |
Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 172.00 | 278.80 | NaN | 5.0 | 8.0 | |||
674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | NaN | 168.06 | 204.12 | NaN | 6.0 | 7.0 |
If later we want to undo that flattening it's just as simple:
def unflatten_cols(df: pd.DataFrame, delim: str = ""):
"""Unflatten a single column level into multiple column levels.
Args:
delim: the delimiter to split on to identify the multiple column values.
Returns:
A copy of the dataframe with the new column levels.
"""
new_cols = pd.MultiIndex.from_tuples([tuple(col.split(delim)) for col in df.columns])
ndf = df.copy()
ndf.columns = new_cols
return ndf
unflatten_cols(flattened_multi_col_df, " | ")
Dollars | Units | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Store 1 | Store 2 | Store 3 | Store 1 | Store 2 | Store 3 | ||||||
Date | Department | Category | Subcategory | UPC EAN | Description | ||||||
2018-07-10 | Booooze | Alcohol | Liqour | 80480280024 | Grey Goose - Imported Vodka - 750 mL Bottle | NaN | 231.30 | 128.30 | NaN | 9.0 | 5.0 |
Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 172.00 | 278.80 | NaN | 5.0 | 8.0 | |||
674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | NaN | 168.06 | 204.12 | NaN | 6.0 | 7.0 |
We've already seen an example of reading a csv, but what if we want to save our multiindex DataFrame and then be able to reread it? How the stored files need to be accessed and by whom they need to be accessed will determine a lot. If everyone who needs access to the data is Python/Pandas savy, pickling is fast and easy. If, however, other, not-tech-savy people will need to access the data, CSVs are a versitile storage medium. That said, it complicates our read_csv() a bit if we want to just dump out our multiindex. Revisiting our multi_col_lvl_df, where we have a multiindex DataFrame that has both multiple index levels and column levels, creates a difficult situation that requires getting to know as many as all 48 (yes, 48!) of read_csv's parameters.
# Write our multi-column-level df
multi_col_lvl_df.to_csv('multi_col_lvl_output.csv')
# Reading it back in requires the header parameter
read_multi_df = pd.read_csv('multi_col_lvl_output.csv', header=[0, 1], index_col=[0, 1, 2, 3, 4, 5],
skipinitialspace=True, parse_dates=[0]).head(3)
read_multi_df
Dollars | Units | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Store | Store 1 | Store 2 | Store 3 | Store 1 | Store 2 | Store 3 | |||||
Date | Department | Category | Subcategory | UPC EAN | Description | ||||||
2018-07-10 | Booooze | Alcohol | Liqour | 80480280024 | Grey Goose - Imported Vodka - 750 mL Bottle | NaN | 231.30 | 128.30 | NaN | 9.0 | 5.0 |
Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 172.00 | 278.80 | NaN | 5.0 | 8.0 | |||
674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | NaN | 168.06 | 204.12 | NaN | 6.0 | 7.0 |
By adding a header parameter to deal with our multiple column levels, on top of index_col for the setting of the index, this import looks good, but it required inspecting the csv for the column and row numbers (that's right - names won't work. Just to be sure everything worked, let's also check our dtypes:
# A function to check our index level dtypes to aid this example
def index_level_dtypes(df):
return [f"{df.index.names[i]}: {df.index.get_level_values(n).dtype}"
for i, n in enumerate(df.index.names)]
index_level_dtypes(read_multi_df)
['Date: datetime64[ns]', 'Department: object', 'Category: object', 'Subcategory: object', 'UPC EAN: int64', 'Description: object']
Looks good. However let's take a step back and look at what would have happened if we'd not parsed the dates and wanted to later change the dtype:
# Reading it back iwithout parse_dates
bad_dtype_df = pd.read_csv('multi_col_lvl_output.csv', header=[0, 1], index_col=[0, 1, 2, 3, 4, 5],
skipinitialspace=True).head(3)
display(bad_dtype_df)
display(index_level_dtypes(bad_dtype_df))
Dollars | Units | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Store | Store 1 | Store 2 | Store 3 | Store 1 | Store 2 | Store 3 | |||||
Date | Department | Category | Subcategory | UPC EAN | Description | ||||||
2018-07-10 | Booooze | Alcohol | Liqour | 80480280024 | Grey Goose - Imported Vodka - 750 mL Bottle | NaN | 231.30 | 128.30 | NaN | 9.0 | 5.0 |
Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 172.00 | 278.80 | NaN | 5.0 | 8.0 | |||
674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | NaN | 168.06 | 204.12 | NaN | 6.0 | 7.0 |
['Date: object', 'Department: object', 'Category: object', 'Subcategory: object', 'UPC EAN: int64', 'Description: object']
Updating the dtypes of our index columns isn't going to be so simple, because our MultiIndex levels are immutable. To make any changes to the levels, we actually have to recreate the levels:
bad_dtype_df.index.set_levels([pd.to_datetime(bad_dtype_df.index.levels[0]), bad_dtype_df.index.levels[1],
bad_dtype_df.index.levels[2], bad_dtype_df.index.levels[3],
bad_dtype_df.index.levels[4], bad_dtype_df.index.levels[5]],
inplace=True)
index_level_dtypes(bad_dtype_df)
['Date: datetime64[ns]', 'Department: object', 'Category: object', 'Subcategory: object', 'UPC EAN: int64', 'Description: object']
Alternatively we could reset just the 'Date' level, update its dtype, add it back to our index, and finally reorder our index:
bad_dtype_df2 = bad_dtype_df.reset_index(level='Date')
bad_dtype_df2['Date'] = pd.to_datetime(bad_dtype_df2['Date'],infer_datetime_format=True)
bad_dtype_df2.set_index('Date', append=True, inplace=True)
bad_dtype_df2 = (bad_dtype_df2.swaplevel('Date', 'Description')
.swaplevel('Date', 'UPC EAN')
.swaplevel('Date', 'Subcategory')
.swaplevel('Date', 'Category')
.swaplevel('Date', 'Department'))
index_level_dtypes(bad_dtype_df2)
['Date: datetime64[ns]', 'Department: object', 'Category: object', 'Subcategory: object', 'UPC EAN: int64', 'Description: object']
The moral of the story is, if you want to export and import complex multiindex DataFrames, learn read_csv's parameters well! The second moral is that, either of the above situations results in an awful lot of work.
One alternative if the readability of the CSV by other users is important, is to simply reset the column levels and index levels before we write to a CSV and recreate them when we import. While it requires more steps and information is lost while in CSV form, it makes the code manipulations a lot easier:
# a) restack the column levels,
# b) drop any blanks the unstacking created, and then
# c) reset the index so everything is a flat column again
# d) output to csv
multi_col_lvl_df.stack().dropna().reset_index().to_csv('index_removed_output.csv')
# Reading it back in will require
read_df = pd.read_csv("data.csv", index_col=[0, 1, 2, 3, 4, 5], skipinitialspace=True, parse_dates=['Date'])
read_df.unstack('Store').head(3)
Dollars | Units | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Store | Store 1 | Store 2 | Store 3 | Store 1 | Store 2 | Store 3 | ||||
Date | Category | Subcategory | UPC EAN | Description | ||||||
2018-07-10 | Alcohol | Liqour | 80480280024 | Grey Goose - Imported Vodka - 750 mL Bottle | NaN | 231.30 | 128.30 | NaN | 9.0 | 5.0 |
Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 172.00 | 278.80 | NaN | 5.0 | 8.0 | ||
674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | NaN | 168.06 | 204.12 | NaN | 6.0 | 7.0 |
Still takes some work, but the code's a lot more straightforward.
To this point we've really just been learning the basics of setting up our multiindex DataFrames, with a few neat tricks along the way. Now comes the fun part - actually interacting with those DataFrames to analyze our data.
Math operations are as nearly as easy with multiindex DataFrames as with regular ones, but a whole lot more powerful. Let's say we want to calculate dollars per unit for each of our stores. That's as simple as:
dollars_per_unit = multi_col_lvl_df['Dollars'] / multi_col_lvl_df['Units']
dollars_per_unit.sample(10)
Store | Store 1 | Store 2 | Store 3 | |||||
---|---|---|---|---|---|---|---|---|
Date | Department | Category | Subcategory | UPC EAN | Description | |||
2018-07-11 | Booooze | Alcohol | Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 33.62 | 39.85 |
2018-07-16 | Booooze | Alcohol | Liquor | 674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | NaN | 32.23 | 31.37 |
2018-07-10 | Booooze | Alcohol | Liquor | 674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | NaN | 28.01 | 29.16 |
Wine | White | 81248501095 | Curious Beasts - Chardonnay - 750 mL Bottle | NaN | 20.94 | 19.90 | ||
Red | 89744402585 | Trapiche - Malbec - 750 mL Bottle | 7.11 | NaN | 6.30 | |||
Beer | Ales | 702770082018 | Omission Pale Ale - Gluten Free Pale Ale - 6 Pack | NaN | 14.96 | 10.44 | ||
Alcohol | Liqour | 80480280024 | Grey Goose - Imported Vodka - 750 mL Bottle | NaN | 25.70 | 25.66 | ||
2018-07-16 | Booooze | Wine | Red | 492130008399 | Yellow Tail - Merlot - 750 mL Bottle | 4.89 | NaN | 7.76 |
2018-07-12 | Booooze | Beer | Stouts | 83783575217 | Sierra Nevada Brewing Co. - Stout - 6 Pack | NaN | 10.61 | 7.81 |
2018-07-16 | Booooze | Wine | Red | 89744402585 | Trapiche - Malbec - 750 mL Bottle | 9.01 | NaN | 8.27 |
We just operated on all the pairwise sub column levels at the same time! Now we'd like this answer as columns back in our original DataFrame:
# Add a column level for our new measure
dollars_per_unit.columns = pd.MultiIndex.from_product([['Dollars per Unit'], dollars_per_unit.columns])
# Concat it with our original data
pd.concat([multi_col_lvl_df, dollars_per_unit], axis='columns').head(3)
Dollars | Units | Dollars per Unit | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Store | Store 1 | Store 2 | Store 3 | Store 1 | Store 2 | Store 3 | Store 1 | Store 2 | Store 3 | |||||
Date | Department | Category | Subcategory | UPC EAN | Description | |||||||||
2018-07-10 | Booooze | Alcohol | Liqour | 80480280024 | Grey Goose - Imported Vodka - 750 mL Bottle | NaN | 231.30 | 128.30 | NaN | 9.0 | 5.0 | NaN | 25.70 | 25.66 |
Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 172.00 | 278.80 | NaN | 5.0 | 8.0 | NaN | 34.40 | 34.85 | |||
674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | NaN | 168.06 | 204.12 | NaN | 6.0 | 7.0 | NaN | 28.01 | 29.16 |
We can similarly apply functions to our multiindex DataFrame:
# Change our units to 000s for funsies
multi_col_lvl_df.applymap(lambda x: np.nan if np.isnan(x) else str(round(x/1000, 2)) + "k").head(10)
Dollars | Units | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Store | Store 1 | Store 2 | Store 3 | Store 1 | Store 2 | Store 3 | |||||
Date | Department | Category | Subcategory | UPC EAN | Description | ||||||
2018-07-10 | Booooze | Alcohol | Liqour | 80480280024 | Grey Goose - Imported Vodka - 750 mL Bottle | NaN | 0.23k | 0.13k | NaN | 0.01k | 0.01k |
Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 0.17k | 0.28k | NaN | 0.01k | 0.01k | |||
674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | NaN | 0.17k | 0.2k | NaN | 0.01k | 0.01k | ||||
Beer | Ales | 94922755711 | Bowser Beer - Beefy Brown Ale - 6 Pack | NaN | 0.08k | 0.06k | NaN | 0.01k | 0.01k | ||
702770082018 | Omission Pale Ale - Gluten Free Pale Ale - 6 Pack | NaN | 0.09k | 0.14k | NaN | 0.01k | 0.01k | ||||
736920111112 | Goose Island - Honkers Ale - 6 Pack | NaN | 0.08k | 0.07k | NaN | 0.01k | 0.01k | ||||
Lagers | 702770081011 | Omission - Gluten Free Lager - 6 Pack | 0.14k | NaN | 0.18k | 0.01k | NaN | 0.01k | |||
Stouts | 83783575217 | Sierra Nevada Brewing Co. - Stout - 6 Pack | NaN | 0.08k | 0.12k | NaN | 0.01k | 0.01k | |||
Wine | Red | 89744402585 | Trapiche - Malbec - 750 mL Bottle | 0.06k | NaN | 0.04k | 0.01k | NaN | 0.01k | ||
492130008399 | Yellow Tail - Merlot - 750 mL Bottle | 0.05k | NaN | 0.03k | 0.01k | NaN | 0.01k |
Before we start slicing and filtering, it's important to sort out data to drastically improve efficiency. In some cases pandas will begin sorting for you by default in the latest versions, but other times you'll get an ugly and confusing lexsort warning (referencing the default lexographic sorting order).
Sorting indexes works the exactly the same way for multiindex DataFrames as with regular DataFrames, but with some extra parameters to decide on:
# By default sort_index will sort all levels of the index,
# first sorting the first index level, then secondarily sorting the second index level, and so on...
sort1 = multi_col_lvl_df.copy()
sort1.sort_index(inplace=True)
print("\nSort by Date, then Department, then Category, etc.:")
display(sort1.head(10))
# ...but you can choose to starting from a different level...
sort2 = multi_col_lvl_df.copy()
sort2.sort_index(level='Category', inplace=True)
print("\nSort by Category, then Subcategory, then UPC EAN, then Description, then Date, then Deparment:")
display(sort2.head(10))
# ...or even to only sort on specifc levels.
sort3 = multi_col_lvl_df.copy()
sort3.sort_index(level=['Category', 'Date'], sort_remaining=False, inplace=True)
print("\nSort by Category, then Date only:")
display(sort3.head(10))
Sort by Date, then Department, then Category, etc.:
Dollars | Units | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Store | Store 1 | Store 2 | Store 3 | Store 1 | Store 2 | Store 3 | |||||
Date | Department | Category | Subcategory | UPC EAN | Description | ||||||
2018-07-10 | Booooze | Alcohol | Liqour | 80480280024 | Grey Goose - Imported Vodka - 750 mL Bottle | NaN | 231.30 | 128.30 | NaN | 9.0 | 5.0 |
Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 172.00 | 278.80 | NaN | 5.0 | 8.0 | |||
674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | NaN | 168.06 | 204.12 | NaN | 6.0 | 7.0 | ||||
Beer | Ales | 94922755711 | Bowser Beer - Beefy Brown Ale - 6 Pack | NaN | 81.76 | 62.48 | NaN | 8.0 | 8.0 | ||
702770082018 | Omission Pale Ale - Gluten Free Pale Ale - 6 Pack | NaN | 89.76 | 135.72 | NaN | 6.0 | 13.0 | ||||
736920111112 | Goose Island - Honkers Ale - 6 Pack | NaN | 80.01 | 67.36 | NaN | 9.0 | 8.0 | ||||
Lagers | 702770081011 | Omission - Gluten Free Lager - 6 Pack | 137.67 | NaN | 176.96 | 13.0 | NaN | 14.0 | |||
Stouts | 83783575217 | Sierra Nevada Brewing Co. - Stout - 6 Pack | NaN | 78.40 | 119.90 | NaN | 7.0 | 10.0 | |||
Wine | Red | 89744402585 | Trapiche - Malbec - 750 mL Bottle | 56.88 | NaN | 37.80 | 8.0 | NaN | 6.0 | ||
492130008399 | Yellow Tail - Merlot - 750 mL Bottle | 46.30 | NaN | 34.56 | 10.0 | NaN | 6.0 |
Sort by Category, then Subcategory, then UPC EAN, then Description, then Date, then Deparment:
Dollars | Units | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Store | Store 1 | Store 2 | Store 3 | Store 1 | Store 2 | Store 3 | |||||
Date | Department | Category | Subcategory | UPC EAN | Description | ||||||
2018-07-10 | Booooze | Alcohol | Liqour | 80480280024 | Grey Goose - Imported Vodka - 750 mL Bottle | NaN | 231.30 | 128.30 | NaN | 9.0 | 5.0 |
Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 172.00 | 278.80 | NaN | 5.0 | 8.0 | |||
674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | NaN | 168.06 | 204.12 | NaN | 6.0 | 7.0 | ||||
2018-07-11 | Booooze | Alcohol | Liqour | 80480280024 | Grey Goose - Imported Vodka - 750 mL Bottle | NaN | 248.80 | 296.01 | NaN | 10.0 | 11.0 |
Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 201.72 | 318.80 | NaN | 6.0 | 8.0 | |||
674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | NaN | 338.10 | 102.04 | NaN | 10.0 | 4.0 | ||||
2018-07-12 | Booooze | Alcohol | Liqour | 80480280024 | Grey Goose - Imported Vodka - 750 mL Bottle | NaN | 279.36 | 228.08 | NaN | 12.0 | 8.0 |
Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 250.25 | 455.13 | NaN | 7.0 | 13.0 | |||
674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | NaN | 410.88 | 268.56 | NaN | 12.0 | 9.0 | ||||
2018-07-13 | Booooze | Alcohol | Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 432.74 | 185.75 | NaN | 11.0 | 5.0 |
Sort by Category, then Date only:
Dollars | Units | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Store | Store 1 | Store 2 | Store 3 | Store 1 | Store 2 | Store 3 | |||||
Date | Department | Category | Subcategory | UPC EAN | Description | ||||||
2018-07-10 | Booooze | Alcohol | Liquor | 674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | NaN | 168.06 | 204.12 | NaN | 6.0 | 7.0 |
2018-07-11 | Booooze | Alcohol | Liquor | 674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | NaN | 338.10 | 102.04 | NaN | 10.0 | 4.0 |
2018-07-12 | Booooze | Alcohol | Liquor | 674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | NaN | 410.88 | 268.56 | NaN | 12.0 | 9.0 |
2018-07-15 | Booooze | Alcohol | Liquor | 674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | NaN | 177.00 | 430.50 | NaN | 6.0 | 14.0 |
2018-07-16 | Booooze | Alcohol | Liquor | 674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | NaN | 161.15 | 345.07 | NaN | 5.0 | 11.0 |
2018-07-10 | Booooze | Alcohol | Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 172.00 | 278.80 | NaN | 5.0 | 8.0 |
2018-07-11 | Booooze | Alcohol | Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 201.72 | 318.80 | NaN | 6.0 | 8.0 |
2018-07-12 | Booooze | Alcohol | Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 250.25 | 455.13 | NaN | 7.0 | 13.0 |
2018-07-13 | Booooze | Alcohol | Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 432.74 | 185.75 | NaN | 11.0 | 5.0 |
2018-07-14 | Booooze | Alcohol | Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 227.82 | 268.52 | NaN | 6.0 | 7.0 |
With the sorting out of the way we can begin searching our data by specific criteria. There is a fantastic guide about understanding SettingwithCopyWarnings in Pandas, so if you're unclear on the difference between a view and a copy, I suggest giving that a read first.
Readers who have used Pandas for slicing previously will know that .loc is generally the preferred method of referencing cells in a DataFrame. The same type of syntax exists for multiindex Dataframes:
# A slicing helper. Works similarly to slicing in Python (e.g. list slicing),
# but is inclusive of both the start and stop values.
idx = pd.IndexSlice
# View rows with a Category of Beer, but any Date, Department, Subcategory, UPC EAN, or Description
# Only looking at Dollars columns
print("View only rows in the 'Dollars' columns where the Category is 'Beer':")
display(sort3.loc[idx[:, :, 'Beer'], 'Dollars':'Dollars'].head(10))
# If we just want to look at the Store 1 sub-column
print("\nView only rows in the 'Dollars' AND 'Store 1' column where the Category is 'Beer':")
display(sort3.loc[idx[:, :, 'Beer'], idx['Dollars', 'Store 1':'Store 1']].head(10))
View only rows in the 'Dollars' columns where the Category is 'Beer':
Dollars | ||||||||
---|---|---|---|---|---|---|---|---|
Store | Store 1 | Store 2 | Store 3 | |||||
Date | Department | Category | Subcategory | UPC EAN | Description | |||
2018-07-10 | Booooze | Beer | Ales | 94922755711 | Bowser Beer - Beefy Brown Ale - 6 Pack | NaN | 81.76 | 62.48 |
2018-07-11 | Booooze | Beer | Ales | 94922755711 | Bowser Beer - Beefy Brown Ale - 6 Pack | NaN | 104.91 | 72.72 |
2018-07-14 | Booooze | Beer | Ales | 94922755711 | Bowser Beer - Beefy Brown Ale - 6 Pack | NaN | 124.52 | 113.68 |
2018-07-15 | Booooze | Beer | Ales | 94922755711 | Bowser Beer - Beefy Brown Ale - 6 Pack | NaN | 88.77 | 119.47 |
2018-07-16 | Booooze | Beer | Ales | 94922755711 | Bowser Beer - Beefy Brown Ale - 6 Pack | NaN | 99.40 | 46.65 |
2018-07-11 | Booooze | Beer | Malts | 71250000140 | Colt 45 - Malt Liquor - 6 Pack | NaN | 101.79 | 107.58 |
2018-07-13 | Booooze | Beer | Malts | 71250000140 | Colt 45 - Malt Liquor - 6 Pack | NaN | 103.59 | 142.44 |
2018-07-15 | Booooze | Beer | Malts | 71250000140 | Colt 45 - Malt Liquor - 6 Pack | NaN | 139.23 | 40.50 |
2018-07-16 | Booooze | Beer | Malts | 71250000140 | Colt 45 - Malt Liquor - 6 Pack | NaN | 47.55 | 108.48 |
2018-07-10 | Booooze | Beer | Ales | 736920111112 | Goose Island - Honkers Ale - 6 Pack | NaN | 80.01 | 67.36 |
View only rows in the 'Dollars' AND 'Store 1' column where the Category is 'Beer':
Dollars | ||||||
---|---|---|---|---|---|---|
Store | Store 1 | |||||
Date | Department | Category | Subcategory | UPC EAN | Description | |
2018-07-10 | Booooze | Beer | Ales | 94922755711 | Bowser Beer - Beefy Brown Ale - 6 Pack | NaN |
2018-07-11 | Booooze | Beer | Ales | 94922755711 | Bowser Beer - Beefy Brown Ale - 6 Pack | NaN |
2018-07-14 | Booooze | Beer | Ales | 94922755711 | Bowser Beer - Beefy Brown Ale - 6 Pack | NaN |
2018-07-15 | Booooze | Beer | Ales | 94922755711 | Bowser Beer - Beefy Brown Ale - 6 Pack | NaN |
2018-07-16 | Booooze | Beer | Ales | 94922755711 | Bowser Beer - Beefy Brown Ale - 6 Pack | NaN |
2018-07-11 | Booooze | Beer | Malts | 71250000140 | Colt 45 - Malt Liquor - 6 Pack | NaN |
2018-07-13 | Booooze | Beer | Malts | 71250000140 | Colt 45 - Malt Liquor - 6 Pack | NaN |
2018-07-15 | Booooze | Beer | Malts | 71250000140 | Colt 45 - Malt Liquor - 6 Pack | NaN |
2018-07-16 | Booooze | Beer | Malts | 71250000140 | Colt 45 - Malt Liquor - 6 Pack | NaN |
2018-07-10 | Booooze | Beer | Ales | 736920111112 | Goose Island - Honkers Ale - 6 Pack | NaN |
Unfortunately the .loc syntax doesn't scale well to having many index levels. In order to select a values from a single level, .loc requires specifying ':' for every level that comes before. For the column names to show up for a single column you also need to use
Instead, it is more common (and practical) to use df.xs (cross sections). df.xs allows us to filter our DataFrame to only those rows that match levels of our index we specify. We can choose an individual level or multiple levels easily:
# We provide the level value, which by default searches the first index level ('Date')
display(multi_col_lvl_df.xs('2018-07-12').head(10))
# Here we're more specific because we want to search 'Category instead'
display(multi_col_lvl_df.xs('Wine', level='Category').head(10))
# To search for rows that match all of our level value requirements, use tuples
display(multi_col_lvl_df.xs(('2018-07-10', 'Booooze', 'Wine'), level=['Date', 'Department', 'Category']).head(10))
# Note that chaining .xs works, but is significantly less efficient
display(multi_col_lvl_df.xs('2018-07-10', level='Date')
.xs('Booooze', level='Department')
.xs('Wine', level='Category')
.head(10))
# We can then use the column value filtering we're used to to filter rows as well
cross_section_df = multi_col_lvl_df.xs(('2018-07-10', 'Booooze'), level=['Date', 'Department'])
display(cross_section_df[cross_section_df['Dollars']['Store 1'] > 0].head(10))
Dollars | Units | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Store | Store 1 | Store 2 | Store 3 | Store 1 | Store 2 | Store 3 | ||||
Department | Category | Subcategory | UPC EAN | Description | ||||||
Booooze | Alcohol | Liqour | 80480280024 | Grey Goose - Imported Vodka - 750 mL Bottle | NaN | 279.36 | 228.08 | NaN | 12.0 | 8.0 |
Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 250.25 | 455.13 | NaN | 7.0 | 13.0 | ||
674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | NaN | 410.88 | 268.56 | NaN | 12.0 | 9.0 | |||
Beer | Ales | 702770082018 | Omission Pale Ale - Gluten Free Pale Ale - 6 Pack | NaN | 109.20 | 56.76 | NaN | 8.0 | 6.0 | |
736920111112 | Goose Island - Honkers Ale - 6 Pack | NaN | 37.75 | 78.30 | NaN | 5.0 | 10.0 | |||
Lagers | 702770081011 | Omission - Gluten Free Lager - 6 Pack | 40.12 | NaN | 93.10 | 4.0 | NaN | 10.0 | ||
Stouts | 83783575217 | Sierra Nevada Brewing Co. - Stout - 6 Pack | NaN | 116.71 | 109.34 | NaN | 11.0 | 14.0 | ||
Wine | Red | 89744402585 | Trapiche - Malbec - 750 mL Bottle | 60.24 | NaN | 69.29 | 8.0 | NaN | 13.0 | |
492130008399 | Yellow Tail - Merlot - 750 mL Bottle | 95.16 | NaN | 75.60 | 12.0 | NaN | 10.0 | |||
Rose | 85000024218 | Dark Horse - S. Blanc Rose - 750 mL Bottle | 93.94 | 83.82 | NaN | 7.0 | 6.0 | NaN |
Dollars | Units | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Store | Store 1 | Store 2 | Store 3 | Store 1 | Store 2 | Store 3 | ||||
Date | Department | Subcategory | UPC EAN | Description | ||||||
2018-07-10 | Booooze | Red | 89744402585 | Trapiche - Malbec - 750 mL Bottle | 56.88 | NaN | 37.80 | 8.0 | NaN | 6.0 |
492130008399 | Yellow Tail - Merlot - 750 mL Bottle | 46.30 | NaN | 34.56 | 10.0 | NaN | 6.0 | |||
Rose | 85000024218 | Dark Horse - S. Blanc Rose - 750 mL Bottle | 148.61 | 100.38 | NaN | 11.0 | 7.0 | NaN | ||
White | 81248501095 | Curious Beasts - Chardonnay - 750 mL Bottle | NaN | 293.16 | 278.60 | NaN | 14.0 | 14.0 | ||
2018-07-11 | Booooze | Red | 89744402585 | Trapiche - Malbec - 750 mL Bottle | 50.40 | NaN | 47.25 | 6.0 | NaN | 7.0 |
Rose | 85000024218 | Dark Horse - S. Blanc Rose - 750 mL Bottle | 52.24 | 173.03 | NaN | 4.0 | 13.0 | NaN | ||
White | 81248501095 | Curious Beasts - Chardonnay - 750 mL Bottle | NaN | 237.16 | 247.52 | NaN | 11.0 | 13.0 | ||
2018-07-12 | Booooze | Red | 89744402585 | Trapiche - Malbec - 750 mL Bottle | 60.24 | NaN | 69.29 | 8.0 | NaN | 13.0 |
492130008399 | Yellow Tail - Merlot - 750 mL Bottle | 95.16 | NaN | 75.60 | 12.0 | NaN | 10.0 | |||
Rose | 85000024218 | Dark Horse - S. Blanc Rose - 750 mL Bottle | 93.94 | 83.82 | NaN | 7.0 | 6.0 | NaN |
Dollars | Units | |||||||
---|---|---|---|---|---|---|---|---|
Store | Store 1 | Store 2 | Store 3 | Store 1 | Store 2 | Store 3 | ||
Subcategory | UPC EAN | Description | ||||||
Red | 89744402585 | Trapiche - Malbec - 750 mL Bottle | 56.88 | NaN | 37.80 | 8.0 | NaN | 6.0 |
492130008399 | Yellow Tail - Merlot - 750 mL Bottle | 46.30 | NaN | 34.56 | 10.0 | NaN | 6.0 | |
Rose | 85000024218 | Dark Horse - S. Blanc Rose - 750 mL Bottle | 148.61 | 100.38 | NaN | 11.0 | 7.0 | NaN |
White | 81248501095 | Curious Beasts - Chardonnay - 750 mL Bottle | NaN | 293.16 | 278.60 | NaN | 14.0 | 14.0 |
Dollars | Units | |||||||
---|---|---|---|---|---|---|---|---|
Store | Store 1 | Store 2 | Store 3 | Store 1 | Store 2 | Store 3 | ||
Subcategory | UPC EAN | Description | ||||||
Red | 89744402585 | Trapiche - Malbec - 750 mL Bottle | 56.88 | NaN | 37.80 | 8.0 | NaN | 6.0 |
492130008399 | Yellow Tail - Merlot - 750 mL Bottle | 46.30 | NaN | 34.56 | 10.0 | NaN | 6.0 | |
Rose | 85000024218 | Dark Horse - S. Blanc Rose - 750 mL Bottle | 148.61 | 100.38 | NaN | 11.0 | 7.0 | NaN |
White | 81248501095 | Curious Beasts - Chardonnay - 750 mL Bottle | NaN | 293.16 | 278.60 | NaN | 14.0 | 14.0 |
Dollars | Units | ||||||||
---|---|---|---|---|---|---|---|---|---|
Store | Store 1 | Store 2 | Store 3 | Store 1 | Store 2 | Store 3 | |||
Category | Subcategory | UPC EAN | Description | ||||||
Beer | Lagers | 702770081011 | Omission - Gluten Free Lager - 6 Pack | 137.67 | NaN | 176.96 | 13.0 | NaN | 14.0 |
Wine | Red | 89744402585 | Trapiche - Malbec - 750 mL Bottle | 56.88 | NaN | 37.80 | 8.0 | NaN | 6.0 |
492130008399 | Yellow Tail - Merlot - 750 mL Bottle | 46.30 | NaN | 34.56 | 10.0 | NaN | 6.0 | ||
Rose | 85000024218 | Dark Horse - S. Blanc Rose - 750 mL Bottle | 148.61 | 100.38 | NaN | 11.0 | 7.0 | NaN |
That's a whole lot nicer of syntax than the .loc method, but it's still relatively verbose and requires separating out or index and column value filtering. Multiindex DataFrames also offer a lesser known method, df.query. Query is powerful for a few reasons:
The documentation has excellent notes about the method and a nice usage guide that are definitely worth a read but a couple of examples using our dataset:
multi_col_lvl_df.query("Date == '2018-07-10' and Category == 'Wine'")
Dollars | Units | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Store | Store 1 | Store 2 | Store 3 | Store 1 | Store 2 | Store 3 | |||||
Date | Department | Category | Subcategory | UPC EAN | Description | ||||||
2018-07-10 | Booooze | Wine | Red | 89744402585 | Trapiche - Malbec - 750 mL Bottle | 56.88 | NaN | 37.80 | 8.0 | NaN | 6.0 |
492130008399 | Yellow Tail - Merlot - 750 mL Bottle | 46.30 | NaN | 34.56 | 10.0 | NaN | 6.0 | ||||
Rose | 85000024218 | Dark Horse - S. Blanc Rose - 750 mL Bottle | 148.61 | 100.38 | NaN | 11.0 | 7.0 | NaN | |||
White | 81248501095 | Curious Beasts - Chardonnay - 750 mL Bottle | NaN | 293.16 | 278.60 | NaN | 14.0 | 14.0 |
So why doesn't everyone always use df.query? It has two big drawbacks:
However, if you're used to working with SQL databases and/or willing to work within the above limitations, this is a powerful and fast option.
Now that we've learned a few ways to slice, filter, and query our data, what if we want to alter values in our dataframe? We can take advantage of the .loc method to replace values in place without creating a copy or, if temporarily copying our data isn't an issue (i.e. too big for memory) we can use the df.xs method with df.update:
updated_df = multi_col_lvl_df.copy()
updated_df.loc[idx[:, :, 'Beer'], idx['Dollars', 'Store 1':'Store 1']] = "We changed this"
updated_df.head(10)
Dollars | Units | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Store | Store 1 | Store 2 | Store 3 | Store 1 | Store 2 | Store 3 | |||||
Date | Department | Category | Subcategory | UPC EAN | Description | ||||||
2018-07-10 | Booooze | Alcohol | Liqour | 80480280024 | Grey Goose - Imported Vodka - 750 mL Bottle | NaN | 231.30 | 128.30 | NaN | 9.0 | 5.0 |
Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 172.00 | 278.80 | NaN | 5.0 | 8.0 | |||
674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | NaN | 168.06 | 204.12 | NaN | 6.0 | 7.0 | ||||
Beer | Ales | 94922755711 | Bowser Beer - Beefy Brown Ale - 6 Pack | We changed this | 81.76 | 62.48 | NaN | 8.0 | 8.0 | ||
702770082018 | Omission Pale Ale - Gluten Free Pale Ale - 6 Pack | We changed this | 89.76 | 135.72 | NaN | 6.0 | 13.0 | ||||
736920111112 | Goose Island - Honkers Ale - 6 Pack | We changed this | 80.01 | 67.36 | NaN | 9.0 | 8.0 | ||||
Lagers | 702770081011 | Omission - Gluten Free Lager - 6 Pack | We changed this | NaN | 176.96 | 13.0 | NaN | 14.0 | |||
Stouts | 83783575217 | Sierra Nevada Brewing Co. - Stout - 6 Pack | We changed this | 78.40 | 119.90 | NaN | 7.0 | 10.0 | |||
Wine | Red | 89744402585 | Trapiche - Malbec - 750 mL Bottle | 56.88 | NaN | 37.80 | 8.0 | NaN | 6.0 | ||
492130008399 | Yellow Tail - Merlot - 750 mL Bottle | 46.3 | NaN | 34.56 | 10.0 | NaN | 6.0 |
The above syntax works well for immutable values, but you'll get an error if you try the same thing for a mutable value, like a list. To my knowledge the only way to set values in a DataFrame to a mutable object is one cell at a time, so doing multiple replacements requires iterrows:
# # Note: If the column we're changing's type isn't already object,
# # we need to change it or the value relpacements below will error.
# updated_df['Dollars']['Store 1'] = updated_df['Dollars']['Store 1'].astype(object)
# # Loop through rows, replacing single values
# # Only necessary if the new assigned value is mutable
# # Code below currently not working when there are multiple column levels, but works with one column level
# for index, row in updated_df.loc[idx[:, :, 'Beer'], idx['Dollars', 'Store 1':'Store 1']].iterrows():
# updated_df.at[index, idx['Dollars', 'Store 1':'Store 1']] = ["We", "changed", "this"]
# updated_df.head(3)
Alternatively, using df.xs and then df.update instead of the full .loc is a little bit more understandable in my option. For example, the following code does the same as the code two cells above (except that it requires a copy):
updated_df = multi_col_lvl_df.copy()
df2 = updated_df.xs('Beer', level='Category', drop_level=False).copy() # .copy() is to avoid SettingwithCopyWarning
df2[idx['Dollars', 'Store 1']] = "We ALSO changed this"
updated_df.update(df2, join="left", overwrite=True, filter_func=None, raise_conflict=False)
updated_df.head(10)
Dollars | Units | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Store | Store 1 | Store 2 | Store 3 | Store 1 | Store 2 | Store 3 | |||||
Date | Department | Category | Subcategory | UPC EAN | Description | ||||||
2018-07-10 | Booooze | Alcohol | Liqour | 80480280024 | Grey Goose - Imported Vodka - 750 mL Bottle | NaN | 231.30 | 128.30 | NaN | 9.0 | 5.0 |
Liquor | 80432400630 | Glenlivet - 12 Year Scotch Whisky - 750 mL Bottle | NaN | 172.00 | 278.80 | NaN | 5.0 | 8.0 | |||
674545000001 | Don Julio - Tequila Blanco - 750 mL Bottle | NaN | 168.06 | 204.12 | NaN | 6.0 | 7.0 | ||||
Beer | Ales | 94922755711 | Bowser Beer - Beefy Brown Ale - 6 Pack | We ALSO changed this | 81.76 | 62.48 | NaN | 8.0 | 8.0 | ||
702770082018 | Omission Pale Ale - Gluten Free Pale Ale - 6 Pack | We ALSO changed this | 89.76 | 135.72 | NaN | 6.0 | 13.0 | ||||
736920111112 | Goose Island - Honkers Ale - 6 Pack | We ALSO changed this | 80.01 | 67.36 | NaN | 9.0 | 8.0 | ||||
Lagers | 702770081011 | Omission - Gluten Free Lager - 6 Pack | We ALSO changed this | NaN | 176.96 | 13.0 | NaN | 14.0 | |||
Stouts | 83783575217 | Sierra Nevada Brewing Co. - Stout - 6 Pack | We ALSO changed this | 78.40 | 119.90 | NaN | 7.0 | 10.0 | |||
Wine | Red | 89744402585 | Trapiche - Malbec - 750 mL Bottle | 56.88 | NaN | 37.80 | 8.0 | NaN | 6.0 | ||
492130008399 | Yellow Tail - Merlot - 750 mL Bottle | 46.3 | NaN | 34.56 | 10.0 | NaN | 6.0 |
pd.set_option('display.multi_sparse', True)
Hopefully this guide has been a gentle and practical tutorial on multiindex DataFrames in Pandas. If you find any errors, material ommissions, or topics that could just be explained more clearly, please submit leave a comment or better yet, a pull request!
To continue your learning, find a list of my top picks for Pandas resources - generally and specifically for MultiIndexes. Of particular note is the Official list of available methods for MultiIndexes, which has a full listing of all the available methods (many not listed here). This link is specifically for Pandas version 0.22, because, for some reason, the list in future versions excludes a lot of still valuable methods! What's more, this list provides the ONLY explanation for some methods, since many MultiIndex methods lack even the most basic of docstrings.
Official MultiIndex References
Other MultiIndex Tutorials
General Pandas Tutorials
Pandas Operations and Efficiency