#!/usr/bin/env python
# coding: utf-8

# # Pandas MultiIndex Tutorial

# # What's Pandas?

# Pandas is one of the most-used open source libraries for importing and analyzing data available in Python today. It provides convenient ways to import, view, split, apply, and combine array-like data. And not just convenient, but efficient, too. For example, Pandas' read_csv and to_csv functions are so efficient that the library is often imported just for this task instead of relying on the standard library alternative!
# 
# The core value of the library, however, comes through several data structure options, primarily Series (for labeled, homogenously-typed, one-dimensional arrays)  and DataFrames (for labeled, potentially heterogeneously-typed, two-dimensional arrays).

# # What's a DataFrame?

# DataFrames are two-dimensional, labeled data structures, with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used Pandas object. Like Series, DataFrame accepts many different kinds of input, including dicts, lists, lists of lists, Series, numpy arrays, other DataFrames, external data from CSVs, etc.
# 
# The DataFrame has two core parts:  an index (row labels; look like columns, but aren't) and columns (data with headers).

# # What's an index?

# The index of a DataFrame is a set (i.e. each element is only represented once) that consists of a label for each row. To be helpful, those labels should be not just unique, but also meaningful. By default, if no index is provided, the index will be a numbered range, starting from 0 (known as a range index).
# 
# A more meaningful index, however, would be something that uniquely describes each row of your data in a way that will help you look things up. For example, in a list of transactions, the date-time might be most useful. Alternatively, in a grade book for a math class the name of the student might be most useful.
# 
# The most confusing thing about understanding about Pandas' indexes at first (in my opinion) is how to interact with them. While they look exactly like columns, they're not referenced in the same way (learn more [here](https://pandas.pydata.org/pandas-docs/stable/indexing.html)). Understanding that, I personally find it most useful to think of an index in Pandas like a column that's in time-out and that just can't play with all the other columns.
#  - Note: Another common way in which Pandas' indexes are misunderstood at first is by thinking of them in SQL-like terms. While that can be helpful if that's what you're familiar with, in practice (and for performance) Pandas' and SQL's indexes are quite different (see this [SO answer](https://stackoverflow.com/questions/42641018/pandas-column-indexing-for-searching) and [Pandas Under The Hood by Jeff Tratner](http://www.jeffreytratner.com/slides/pandas-under-the-hood-pydata-seattle-2015.pdf)).

# # What is a MultiIndex DataFrame?

# Pandas' multiindex DataFrames extend the DataFrames described above by enabling effective storage and manipulation of arbitrarily high dimension data in a 2-dimensional tabular structure. ((If that sentence doesn't make sense yet, don't worry - it should by the end of this tutorial.))
# 
# While the displayed version of a multiindexed DataFrame doesn't appear to be much more than a prettily-organized regular DataFrame, it's actually a pretty powerful structure if the data warrants its use.

# # When should you use one?

# 1. When a single column’s value isn’t enough to uniquely identify a row (e.g. multiple records on the same date means date alone isn’t a good index).
# 2. When data is logically hierarchical - meaning that it has multiple dimensions or “levels.”
# 
# Besides structure, multiindexes offer relatively easy in-memory retreival of complex data.

# # Realistic Demo Data

# For this tutorial we will work with a realistic case of when multiindex DataFrames can come in handy - a grocer's retail transactions.

# In[1]:


# Normally put all your import up top, but this cell was ugly in the midst of the intro paragraphs
from typing import Any

import numpy as np
import pandas as pd

display(f"Pandas version: {pd.__version__}")


# Note that this notebook is using the Pandas version above. There have been many changes to MultiIndex methods since 0.19, including major bug fixes. **It is STRONGLY recommended to use the latest version of Pandas, but at least version 0.21 is required for all of the techniques in this notebook to work as presented.**

# In[2]:


# # Creates random mock data.
# # Based on the UPCs found in upc_meta_data.csv and saves to data.csv
# %run mocker.py


# In[3]:


df = pd.read_csv('data.csv', parse_dates=['Date'])
df.sample(10)


# Each row of data represents a sale of an item, but how do we set our index meaningfully? A numeric ID for each transaction would be fine, but it wouldn't tell us terribly much. Even the date of a transaction isn't useful by itself, since it's common for such a company to have many transactions on a given date (even at the same time). 
# 
# Instead, we will have to look at a combination of the available metadata to create a unique and meaningful index. At first glance, a combination of the date, the store, and the hierarchy of each item (Category > Subcategory > UPC) look like they might be enough to get us a unique index.

# # Setting and Manipulating MultiIndexes
# 
# 

# So let's take a look at how we can create our multiindex from our regular ol' DataFrame. We'll walk through the basics of setting, reordering, and resetting indexes, along with some useful tips/tricks. Then we can begin investigating our transaction data to learn about our sales and trends.

# In[4]:


# Set just like the index for a DataFrame...
# ...except we give a list of column names instead of a single string column name
df.set_index(['Date', 'Store', 'Category', 'Subcategory', 'Description'], inplace=True)
df.head(3)


# Uh oh - it looks like we forgot to add the 'UPC EAN' column to our index, but don't worry - Pandas has us covered with extra set_index parameters for MultiIndexes:

# In[5]:


# We can append a column to our existing index
df.set_index('UPC EAN', append=True, inplace=True)
df.head(3)


# That's almost right, but we'd actually like 'Description' to show up after 'UPC EAN'. We have a couple of options to get things in the right order:

# In[6]:


# Option 1 is the generalized solution to reorder the index levels
# Note: We're not making an inplace change in this cell,
#       but it's worth noting that this method doesn't have an inplace parameter.
df.reorder_levels(order=['Date', 'Store', 'Category', 'Subcategory', 'UPC EAN', 'Description']).head(3)


# reorder_levels() is useful, but it was a pain to have to type all five levels just two switch two. In cases like this we have a second, less verbose option:

# In[7]:


# Option 2 just switches two index levels (a more common need than you'd think)
# Note: This time we're doing an inplace change, but there's no parameter for this method either.
df = df.swaplevel('Description', 'UPC EAN')
df.head(3)


# Just when we thought we were done, it turns our we forgot to add the highest level of the product hierarchy - the Department - not just to our index, but to our DataFrame altogether. Luckily all of our records belong in the same Department, so here's a neat trick to add a new column with all the same values as a level in an existing index:

# In[8]:


# A handy function to keep around for projects
def add_constant_index_level(df: pd.DataFrame, value: Any, level_name: str):
    """Add a new level to an existing index where every row has the same, given value.
    
    Args:
        df: Any existing pd.DataFrame.
        value: Value to be placed in every row of the new index level.
        level_name: Title of the new index level.
    
    Returns:
        df with an additional, prepended index level.
    """
    return pd.concat([df], keys=[value], names=[level_name])

df = add_constant_index_level(df, "Booooze", "Department")
df = df.reorder_levels(order=['Date', 'Store', 'Department', 'Category', 'Subcategory', 'UPC EAN', 'Description'])
df.head(3)

# # If we wanted to later drop that level
# df.index = df.index.droplevel(level='Department')
# df.head(3)


# Now that our index is set the way we want it, what if we want to interact with those index levels? Here are a few helpful code snippets:

# In[9]:


# checking out their unique values, for a single level 
df.index.get_level_values('Subcategory').unique()

# checking out their unique values, for combinations of multiple levels
# See answer at https://stackoverflow.com/questions/39080555/pandas-get-level-values-for-multiple-columns


# Note the typo in "Liquor" above. Good thing we checked out unique values! Maybe someone can submit a pull request to fix this for me :)

# In[10]:


# Replace level values using rename
# Note that this can be done using set_levels as well, but it's a pain
df.rename(index={'Goose Island - Honkers Ale - 6 Pack': 'We changed this'}).head(15)

# Replace np.nan level values
# df.rename(index={np.nan: "''"}, inplace=True)


# In[11]:


# Rename index levels
temp_df = df.copy()
temp_df.index = df.index.set_names('Desc.', level='Description')
temp_df.head(3)


# # Understanding the MultiIndex Object

# Why is this section all the way down here? Because the MultiIndex object is scary looking if you're new to using them. Many guides to hierarchical data analysis using multiindex DataFrames start with DataFrame creation and manipulation using MultiIndex objects, which I think both hinders adoption and is not reflective of how a lot of DataFrames get created in practice. As a result, my explanation of MultiIndex objects is very basic, because there are lots of other great resources out there if you want to learn more. Here are my top two:
#  * [Official guide](https://pandas.pydata.org/pandas-docs/stable/advanced.html?highlight=indexslice#hierarchical-indexing-multiindex)
#  * [Python Data Science Handbook by Jake Vanderplas](https://jakevdp.github.io/PythonDataScienceHandbook/03.05-hierarchical-indexing.html#Methods-of-MultiIndex-Creation)

# In[12]:


df.index


# Well that's gross looking...but don't be scared - it's actually not that hard to understand.
# 
# **'levels'** is a list of lists, where each sublist represents all possible values in that index level. In other words, the 'levels' parameter reflects all possible unique values by level. For example, our first index level ('Date') has the possible values \['2018-07-10', '2018-07-11', '2018-07-12', ...\].
# 
# * **Important Note:** When talking about a multiindex DataFrame (not the parameter for the MultiIndex object), we talk about the "levels" as the index "columns." For example, the 'levels' of our df in a more general sense are 'Date', 'Store', 'Department', etc. Levels in this sense (and elsewhere in code) can also be referenced by number (e.g. 'Date' = 0 \[read as 'level 0'\], 'Store' = 1, 'Department' = 2, etc.).
# 
# **'labels'** is also a list of lists, but here each sublist reflects all of the values that appear in the row of that index. In other words, each sublist in our labels is of the same length as the entire dataframe, and the value of each row is one of the possible values defined in our associated level (above). Looking again at our first index level ('Date'), we see values like \[1, 1, 0, 0, 2, 2, 4, 4 ...\]. There are just an enumerated representation of the options defined in our level, so 0 = '2018-07-10', 1 = '2018-07-11', 2 = '2018-07-12', 3 = '2018-07-13', etc.
# 
# **'names'** is a list of the actual titles of each index level, in order of appearance from left to right.

# With that fresh understanding of the 'anatomy' of a MultiIndex, we can look at...

# # Other Methods of Multiindex DataFrame Creation

# For the most part, the two references listed in the section above cover this topic well; however, a common use case that isn't covered in those guides is creating a multiindex DataFrame while reading from a csv:

# In[13]:


# We can set a MultiIndex while reading a csv by referencing columns to be used in the index by number
pd.read_csv("data.csv", index_col=[0, 1, 2, 3, 4, 5], skipinitialspace=True, parse_dates=['Date']).head(3)


# We'll review more advanced importing/exporting methods below.

# # MultiIndex Columns (Multiple Column Levels)

# For a different view we can also create hierarchical column levels. We'll do this by introducting a new method: unstack(). This function "pivots" an index level to a new level of column labels whose inner-most level consists of the pivoted index labels. **Stack/unstack is one of the biggest reasons to use a MultiIndex, so it's work supplementing the examples here by checking out the [official docs on reshaping](https://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-stacking-and-unstacking).**
# 
# With this new technique we can start actually investigating our data. For example, let's say we want to more easily compare sales of a product by store by day, we can unstack our 'Store' index level:

# In[14]:


multi_col_lvl_df = df.unstack('Store')
multi_col_lvl_df.sample(10)


# The new view makes our comparison easier, but now it's a bit cluttered. Internally our multi-level columns are stored as tuples of the name values for each level, so we can easily fix the clutter by flattening the columns into a single level:

# In[15]:


def flatten_cols(df: pd.DataFrame, delim: str = ""):
    """Flatten multiple column levels of the DataFrame into a one column level.

    Args:
        delim: the delimiter between the column values.

    Returns:
        A copy of the dataframe with the new column names.

    """
    new_cols = [delim.join((col_lev for col_lev in tup if col_lev))
                for tup in df.columns.values]
    ndf = df.copy()
    ndf.columns = new_cols

    return ndf

flattened_multi_col_df = flatten_cols(multi_col_lvl_df, " | ").head(3)
flattened_multi_col_df


# If later we want to undo that flattening it's just as simple:

# In[16]:


def unflatten_cols(df: pd.DataFrame, delim: str = ""):
    """Unflatten a single column level into multiple column levels.

    Args:
        delim: the delimiter to split on to identify the multiple column values.

    Returns:
        A copy of the dataframe with the new column levels.

    """
    new_cols = pd.MultiIndex.from_tuples([tuple(col.split(delim)) for col in df.columns])
    ndf = df.copy()
    ndf.columns = new_cols

    return ndf

unflatten_cols(flattened_multi_col_df, " | ")


# # Importing/Exporting MultiIndex DataFrames

# We've already seen an example of reading a csv, but what if we want to save our multiindex DataFrame and then be able to reread it? How the stored files need to be accessed and by whom they need to be accessed will determine a lot. If everyone who needs access to the data is Python/Pandas savy, [pickling](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.to_pickle.html) is fast and easy. If, however, other, not-tech-savy people will need to access the data, CSVs are a versitile storage medium. That said, it complicates our read_csv() a bit if we want to just dump out our multiindex. Revisiting our multi_col_lvl_df, where we have a multiindex DataFrame that has both multiple index levels and column levels, creates a difficult situation that requires getting to know as many as all 48 (yes, 48!) of read_csv's parameters.

# In[17]:


# Write our multi-column-level df
multi_col_lvl_df.to_csv('multi_col_lvl_output.csv')

# Reading it back in requires the header parameter
read_multi_df = pd.read_csv('multi_col_lvl_output.csv', header=[0, 1], index_col=[0, 1, 2, 3, 4, 5],
                            skipinitialspace=True, parse_dates=[0]).head(3)

read_multi_df


# By adding a header parameter to deal with our multiple column levels, on top of index_col for the setting of the index, this import looks good, but it required inspecting the csv for the column and row numbers (that's right - names won't work. Just to be sure everything worked, let's also check our dtypes:

# In[18]:


# A function to check our index level dtypes to aid this example
def index_level_dtypes(df):
    return [f"{df.index.names[i]}: {df.index.get_level_values(n).dtype}"
            for i, n in enumerate(df.index.names)]

index_level_dtypes(read_multi_df)


# Looks good. However let's take a step back and look at what would have happened if we'd not parsed the dates and wanted to later change the dtype:

# In[19]:


# Reading it back iwithout parse_dates
bad_dtype_df = pd.read_csv('multi_col_lvl_output.csv', header=[0, 1], index_col=[0, 1, 2, 3, 4, 5],
                            skipinitialspace=True).head(3)

display(bad_dtype_df)
display(index_level_dtypes(bad_dtype_df))


# Updating the dtypes of our index columns isn't going to be so simple, because our MultiIndex levels are immutable. To make any changes to the levels, we actually have to recreate the levels:

# In[20]:


bad_dtype_df.index.set_levels([pd.to_datetime(bad_dtype_df.index.levels[0]), bad_dtype_df.index.levels[1],
                               bad_dtype_df.index.levels[2], bad_dtype_df.index.levels[3],
                               bad_dtype_df.index.levels[4], bad_dtype_df.index.levels[5]],
                               inplace=True)
index_level_dtypes(bad_dtype_df)


# Alternatively we could reset just the 'Date' level, update its dtype, add it back to our index, and finally reorder our index:

# In[21]:


bad_dtype_df2 = bad_dtype_df.reset_index(level='Date')
bad_dtype_df2['Date'] = pd.to_datetime(bad_dtype_df2['Date'],infer_datetime_format=True)
bad_dtype_df2.set_index('Date', append=True, inplace=True)
bad_dtype_df2 = (bad_dtype_df2.swaplevel('Date', 'Description')
                              .swaplevel('Date', 'UPC EAN')
                              .swaplevel('Date', 'Subcategory')
                              .swaplevel('Date', 'Category')
                              .swaplevel('Date', 'Department'))
index_level_dtypes(bad_dtype_df2)


# The moral of the story is, if you want to export and import complex multiindex DataFrames, learn read_csv's parameters well! The second moral is that, either of the above situations results in an awful lot of work.
# 
# One alternative if the readability of the CSV by other users is important, is to simply reset the column levels and index levels before we write to a CSV and recreate them when we import. While it requires more steps and information is lost while in CSV form, it makes the code manipulations a lot easier:

# In[22]:


# a) restack the column levels,
# b) drop any blanks the unstacking created, and then
# c) reset the index so everything is a flat column again
# d) output to csv
multi_col_lvl_df.stack().dropna().reset_index().to_csv('index_removed_output.csv')

# Reading it back in will require
read_df = pd.read_csv("data.csv", index_col=[0, 1, 2, 3, 4, 5], skipinitialspace=True, parse_dates=['Date'])
read_df.unstack('Store').head(3)


# Still takes some work, but the code's a lot more straightforward.

# To this point we've really just been learning the basics of setting up our multiindex DataFrames, with a few neat tricks along the way. Now comes the fun part - actually interacting with those DataFrames to analyze our data.

# # Multiindex Math

# Math operations are as nearly as easy with multiindex DataFrames as with regular ones, but a whole lot more powerful. Let's say we want to calculate dollars per unit for each of our stores. That's as simple as:

# In[23]:


dollars_per_unit = multi_col_lvl_df['Dollars'] / multi_col_lvl_df['Units']
dollars_per_unit.sample(10)


# We just operated on all the pairwise sub column levels at the same time! Now we'd like this answer as columns back in our original DataFrame:

# In[24]:


# Add a column level for our new measure
dollars_per_unit.columns = pd.MultiIndex.from_product([['Dollars per Unit'], dollars_per_unit.columns])

# Concat it with our original data
pd.concat([multi_col_lvl_df, dollars_per_unit], axis='columns').head(3)


# We can similarly apply functions to our multiindex DataFrame:

# In[25]:


# Change our units to 000s for funsies
multi_col_lvl_df.applymap(lambda x: np.nan if np.isnan(x) else str(round(x/1000, 2)) + "k").head(10)


# # Sorting

# Before we start slicing and filtering, it's important to sort out data to drastically improve efficiency. In some cases pandas will begin sorting for you by default in the latest versions, but other times you'll get an ugly and confusing lexsort warning (referencing the default lexographic sorting order).
# 
# Sorting indexes works the exactly the same way for multiindex DataFrames as with regular DataFrames, but with some extra parameters to decide on:

# In[26]:


# By default sort_index will sort all levels of the index,
# first sorting the first index level, then secondarily sorting the second index level, and so on...
sort1 = multi_col_lvl_df.copy()
sort1.sort_index(inplace=True)
print("\nSort by Date, then Department, then Category, etc.:")
display(sort1.head(10))

# ...but you can choose to starting from a different level...
sort2 = multi_col_lvl_df.copy()
sort2.sort_index(level='Category', inplace=True)
print("\nSort by Category, then Subcategory, then UPC EAN, then Description, then Date, then Deparment:")
display(sort2.head(10))

# ...or even to only sort on specifc levels.
sort3 = multi_col_lvl_df.copy()
sort3.sort_index(level=['Category', 'Date'], sort_remaining=False, inplace=True)
print("\nSort by Category, then Date only:")
display(sort3.head(10))


# # Slicing, Filtering, and Querying

# With the sorting out of the way we can begin searching our data by specific criteria. There is a fantastic guide about [understanding SettingwithCopyWarnings in Pandas](https://www.dataquest.io/blog/settingwithcopywarning/), so if you're unclear on the difference between a view and a copy, I suggest giving that a read first.
# 
# Readers who have used Pandas for slicing previously will know that .loc is generally the preferred method of referencing cells in a DataFrame. The same type of syntax exists for multiindex Dataframes:

# In[27]:


# A slicing helper. Works similarly to slicing in Python (e.g. list slicing),
#                   but is inclusive of both the start and stop values.
idx = pd.IndexSlice

# View rows with a Category of Beer, but any Date, Department, Subcategory, UPC EAN, or Description
# Only looking at Dollars columns
print("View only rows in the 'Dollars' columns where the Category is 'Beer':")
display(sort3.loc[idx[:, :, 'Beer'], 'Dollars':'Dollars'].head(10))

# If we just want to look at the Store 1 sub-column
print("\nView only rows in the 'Dollars' AND 'Store 1' column where the Category is 'Beer':")
display(sort3.loc[idx[:, :, 'Beer'], idx['Dollars', 'Store 1':'Store 1']].head(10))


# Unfortunately the .loc syntax doesn't scale well to having many index levels. In order to select a values from a single level, .loc requires specifying ':' for every level that comes before. For the column names to show up for a single column you also need to use <name>:<name> syntax just to specify a single column. This can be incredibly annoying.
#     
# Instead, it is more common (and practical) to use df.xs (cross sections). df.xs allows us to filter our DataFrame to only those rows that match levels of our index we specify. We can choose an individual level or multiple levels easily:

# In[28]:


# We provide the level value, which by default searches the first index level ('Date')
display(multi_col_lvl_df.xs('2018-07-12').head(10))

# Here we're more specific because we want to search 'Category instead'
display(multi_col_lvl_df.xs('Wine', level='Category').head(10))

# To search for rows that match all of our level value requirements, use tuples
display(multi_col_lvl_df.xs(('2018-07-10', 'Booooze', 'Wine'), level=['Date', 'Department', 'Category']).head(10))

# Note that chaining .xs works, but is significantly less efficient
display(multi_col_lvl_df.xs('2018-07-10', level='Date')
                        .xs('Booooze', level='Department')
                        .xs('Wine', level='Category')
                        .head(10))

# We can then use the column value filtering we're used to to filter rows as well
cross_section_df = multi_col_lvl_df.xs(('2018-07-10', 'Booooze'), level=['Date', 'Department'])
display(cross_section_df[cross_section_df['Dollars']['Store 1'] > 0].head(10))


# That's a whole lot nicer of syntax than the .loc method, but it's still relatively verbose and requires separating out or index and column value filtering. Multiindex DataFrames also offer a lesser known method, df.query. Query is powerful for a few reasons:
# 1. It can search both index levels and columns in the same query,
# 2. Syntactically it's nicer, since it lets use write our query as short expressions,
# 3. It's very efficient (depending on the chosen engine)
# 
# The documentation has excellent [notes about the method](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.query.html) and a nice [usage guide](https://pandas.pydata.org/pandas-docs/stable/indexing.html#multiindex-query-syntax) that are definitely worth a read but a couple of examples using our dataset:

# In[29]:


multi_col_lvl_df.query("Date == '2018-07-10' and Category == 'Wine'")


# So why doesn't everyone always use df.query? It has two big drawbacks: 
# 1. It doesn't allow index or column names with spaces in the name. 
# 2. It can't deal with multiple column levels well.
# 
# However, if you're used to working with SQL databases and/or willing to work within the above limitations, this is a powerful and fast option.

# # Updating Column Values

# Now that we've learned a few ways to slice, filter, and query our data, what if we want to alter values in our dataframe? We can take advantage of the .loc method to replace values in place without creating a copy or, if temporarily copying our data isn't an issue (i.e. too big for memory) we can use the df.xs method with df.update:

# In[30]:


updated_df = multi_col_lvl_df.copy()
updated_df.loc[idx[:, :, 'Beer'], idx['Dollars', 'Store 1':'Store 1']] = "We changed this"
updated_df.head(10)


# The above syntax works well for immutable values, but you'll get an error if you try the same thing for a mutable value, like a list. To my knowledge the only way to set values in a DataFrame to a mutable object is one cell at a time, so doing multiple replacements requires iterrows:

# In[31]:


# # Note: If the column we're changing's type isn't already object,
# # we need to change it or the value relpacements below will error.
# updated_df['Dollars']['Store 1'] = updated_df['Dollars']['Store 1'].astype(object)

# # Loop through rows, replacing single values
# # Only necessary if the new assigned value is mutable
# # Code below currently not working when there are multiple column levels, but works with one column level
# for index, row in updated_df.loc[idx[:, :, 'Beer'], idx['Dollars', 'Store 1':'Store 1']].iterrows():
#     updated_df.at[index, idx['Dollars', 'Store 1':'Store 1']] = ["We", "changed", "this"]

# updated_df.head(3)


# Alternatively, using df.xs and then df.update instead of the full .loc is a little bit more understandable in my option. For example, the following code does the same as the code two cells above (except that it requires a copy):

# In[32]:


updated_df = multi_col_lvl_df.copy()

df2 = updated_df.xs('Beer', level='Category', drop_level=False).copy()  # .copy() is to avoid SettingwithCopyWarning
df2[idx['Dollars', 'Store 1']] = "We ALSO changed this"

updated_df.update(df2, join="left", overwrite=True, filter_func=None, raise_conflict=False)
updated_df.head(10)


# # Display options

# In[33]:


pd.set_option('display.multi_sparse', True)


# # That's all, folks!

# Hopefully this guide has been a gentle and practical tutorial on multiindex DataFrames in Pandas. If you find any errors, material ommissions, or topics that could just be explained more clearly, please submit leave a comment or better yet, a pull request!
# 
# To continue your learning, find a list of my top picks for Pandas resources - generally and specifically for MultiIndexes. Of particular note is the [Official list of available methods for MultiIndexes](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.MultiIndex.html), which has a full listing of all the available methods (many not listed here). **This link is specifically for Pandas version 0.22, because, for some reason, the list in future versions excludes a lot of still valuable methods!** What's more, this list provides the ONLY explanation for some methods, since many MultiIndex methods lack even the most basic of docstrings.

# # Resources

# **Official MultiIndex References**
#  * [Official MultiIndex / Advanced Indexing Tutorial](https://pandas.pydata.org/pandas-docs/stable/advanced.html?highlight=indexslice#hierarchical-indexing-multiindex)
#  * [Official list of available methods for MultiIndexes](https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.MultiIndex.html)
#  
# **Other MultiIndex Tutorials**
#  * [Official Reshaping by Stacking and Unstacking Tutorial](https://pandas.pydata.org/pandas-docs/stable/reshaping.html#reshaping-by-stacking-and-unstacking)
#  * [Hierarchical indices, groupby and pandas by DataCamp](https://www.datacamp.com/community/tutorials/pandas-multi-index)
#  * [Python Data Science Handbook by Jake Vanderplas](https://jakevdp.github.io/PythonDataScienceHandbook/03.05-hierarchical-indexing.html#Methods-of-MultiIndex-Creation)
#  * [MultiIndex slicing by Nelson](https://www.somebits.com/~nelson/pandas-multiindex-slice-demo.html)
#  * [Pandas examples and cookbook by Eric Neilsen, Jr.](http://ehneilsen.net/notebook/pandasExamples/pandas_examples.html)
#  
# **General Pandas Tutorials**
#  * [Official 10 Minutes to Pandas Tutorial](https://pandas.pydata.org/pandas-docs/stable/10min.html)
#  * [Official Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
#  * [Official Indexing and Selecting Data Tutorial](https://pandas.pydata.org/pandas-docs/stable/indexing.html)
#  * [Understanding SettingwithCopyWarning in pandas by DataQuest](https://www.dataquest.io/blog/settingwithcopywarning/)
#  
# **Pandas Operations and Efficiency**
#  * [A Beginner’s Guide to Optimizing Pandas Code for Speed by Sofia Heisler](https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6)
#  * [Using pandas with large data by DataQuest](https://www.dataquest.io/blog/pandas-big-data/)
#  * [Pandas Under The Hood by Jeff Tratner](http://www.jeffreytratner.com/slides/pandas-under-the-hood-pydata-seattle-2015.pdf)
#  * [Pandas 2.0 Internals: Data structure changes by Wes McKinney](https://pandas-dev.github.io/pandas2/internal-architecture.html)
#