Panel data in pandas

Files needed = ('dogs.csv', 'WEOOct2016all.xls')

We are learning how to deal with complex datasets in pandas. We began by tackling the multiIndex and now we turn to panel data and its various representations.

Panel data have (at least) three dimensions, for example, a panel might record

  1. several variables (income, age, spending on cheese)
  2. regarding several people
  3. over time

The multiIndex gives us a natural way to handle this data.

In [ ]:
import pandas as pd                 # load pandas and shorten it to pd
import datetime as dt               # load datetime and shorten it to dt
import matplotlib.pyplot as plt     # for making figures

Some panel data

Load the 'dogs.csv' file. The different dimensions of the data are: variables (walks, snacks); dogs (Buster, Su); and time.

The difference between the 'prem' dataset in the last notebook and this one is that the dogs dataset has a time dimension.

For reference, here are Buster and Su.

In [ ]:
# load a data file with the number of walks and snacks my dogs have had 

dogs = pd.read_csv('dogs.csv')         # data on the habits of my dogs
dogs

Long vs. wide (or stacked vs. unstacked)

This data format is called long because there are lots of rows and not many columns. Moving between long and wide (lots of columns, fewer rows) is a common task in setting up panel data sets.

Pandas calls long data stacked and wide data unstacked. We use the stack() and unstack() methods for moving between long and wide with multiIndexed data. Stack and unstack do not work in place. They always return a copy, so we need to assign it to a variable.

Set up the index

In [ ]:
dogs['time'] = pd.to_datetime(dogs['time'])

# move everything but the data values to the index
dogs = dogs.set_index(['dog', 'time', 'var'])
dogs

Unstack

We unstack the data to put the variables into columns.

In [ ]:
# Moving var to a column creates a multiIndex with two levels

dogs_us = dogs.unstack('var') 
dogs_us
In [ ]:
dogs_us.columns

We can unstack several variables. As usual, we pass a list.

In [ ]:
dogs_us = dogs.unstack(['dog', 'var']) # move dog and var to columns, creates a multiindex with three levels
dogs_us
In [ ]:
dogs_us.columns

Removing a level

The outermost level of the column multiIndex is 'value' and is not very useful. Let's drop it.

Note that the level does not have a name associated with it (None is not a name) so we need to use the number associate with the level.

In [ ]:
dogs_us.columns = dogs_us.columns.droplevel(level = 0)
dogs_us

We now have an unstacked DataFrame and each column is a time series of one dog's observations (the data are 'wide'). Notice that we do not have observations for 2018-10-04 walks, so panda filled in NaNs for us.

Panel data like this are STATA's bread and butter, and STATA has a lot of tools for working with panels. Pandas does not have as large a toolset, but the multiIndex provides a very nice way to handle and display panel data. STATA does not have a multiIndex, so the column names in the data above would be: 'walkBuster', 'snacksBuster', 'walkSu', 'snacksSu'.

Creating new variables

I want to know the walk-to-snacks ratio to make sure they are getting enough exercise. We can refer to columns in the multiIndex using tuples.

In [ ]:
# compute the ratios
dogs_us[('Buster', 'ws_ratio')] =  dogs_us[('Buster', 'walk')]/dogs_us[('Buster', 'snacks')]
dogs_us[('Su', 'ws_ratio')] =  dogs_us[('Su', 'walk')]/dogs_us[('Su', 'snacks')]

# resort the index
dogs_us.sort_index(axis=1, inplace=True)
dogs_us

Ploting with panels

Let's plot the walk-snack ratio. There are several things in this code to talk about...

In [ ]:
# A stealthy import --- I usually put these in the first cell
import matplotlib.dates as mdates

fig, ax = plt.subplots(figsize=(15,6))

ax.plot(dogs_us.index, dogs_us[('Buster', 'ws_ratio')], color='red', marker = 'o', label='Buster')
ax.plot(dogs_us.index, dogs_us[('Su', 'ws_ratio')], color='blue', marker = 'o', label='Su')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_title('Walk-snack ratio')
ax.legend(frameon=False)

# Here I am formatting the x-axis dates
# ax.set_xticks(dogs_us.index)                                  # put ticks only where we have observations
# ax.xaxis.set_major_formatter(mdates.DateFormatter("%m-%d"))   #show the month and day

plt.show()

Looks pretty good, but Buster needs more walks. Let's dig into this a bit more.

1. Alternative plot command

In the figure above, I separately plotted Buster's and Su's data.

  1. This let me specify a label to get the legend correct
  2. It makes it clear how I specify each line color, width, marker, etc

In the figure below, I plot the index (the dates) against all the ws_ratio. This is a fast way to plot many lines.

  1. I have to handle the legend labels separately
  2. I am using matplotlib's default color cycler
In [ ]:
fig, ax = plt.subplots(figsize=(15,6))

ax.plot(dogs_us.index, dogs_us.xs('ws_ratio', level=1, axis=1))

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_title('Walk-snack ratio')
#ax.legend(frameon=False)

ax.legend(dogs_us.columns.get_level_values(level='dog').unique(), frameon=False)

# Here I am formatting the x-axis dates
# ax.set_xticks(dogs_us.index)                                  # put ticks only where we have observations
# ax.xaxis.set_major_formatter(mdates.DateFormatter("%m-%d"))   #show the month and day

plt.show()

Legends

Comment out the more complicated legend code above and use our standard ax.legend(frameon=False) code. What happened? Matplotlib wasn't sure what to do about the legend labels, so it bonked.

To add labels to the legend:

  1. Get the values of the column names (at which level?)
  2. They are repeated, so pull out the unique values
  3. Assign the values to the legend
In [ ]:
# Here is get_level_values
print(dogs_us.columns.get_level_values(level='dog'))

# Here is get the unique levels
print(dogs_us.columns.get_level_values(level='dog').unique())

In general, legends are very flexible and potentially complicated. We will talk more about them later.

2. An aside on datetime axes

The Problem: The x-axis range is small, so matplotlib has added some labels with days broken up in hours. The data are not collected by the hour, so this axis is misleading. I only have daily observations.

I am exerting some control over the x-axis ticks and labels here.

ax.set_xticks(dogs_us.index)

specifies that I want ticks at the values in the index --- and only at these places.

The code

ax.xaxis.set_major_formatter(mdates.DateFormatter("%m-%d"))

tells matplotlib to format the tick labels as month-day. To use this second line of code, I needed to import the mdates (matplotlib dates) part of the matplotlib package.

In [ ]:
fig, ax = plt.subplots(figsize=(15,6))

ax.plot(dogs_us.index, dogs_us[('Buster', 'ws_ratio')], color='red', marker = 'o', label='Buster')
ax.plot(dogs_us.index, dogs_us[('Su', 'ws_ratio')], color='blue', marker = 'o', label='Su')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_title('Walk-snack ratio')

# Here I am formatting the x-axis dates
ax.set_xticks(dogs_us.index)                                  # put ticks only where we have observations
ax.xaxis.set_major_formatter(mdates.DateFormatter("%m-%d"))   #show the month and day

ax.legend(frameon=False)

plt.show()

[To try at home: We don't have data points for 10-04. Change the x-axis limits to that it only spans 10-01 to 10-03.]

Stack

We can stack the data to put the variables back on the rows.

When we stack the data, pandas defaults to dropping the NaNs. We can override this if we choose. If we do not pass an argument, it stacks the innermost index.

In [ ]:
# stack just one level
dogs_s = dogs_us.stack()       # stack shifts the columns to rows. 
dogs_s
In [ ]:
# stack two levels (and keep the missings)
dogs_s = dogs_us.stack(['dog', 'var'], dropna=False)
dogs_s

The data are back to 'long' form. Note that we have a series now, and not a DataFrame.

I would like dog as the outermost index. Use the swaplevel() to swap two levels.

In [ ]:
dogs_s=dogs_s.swaplevel('dog', 'time')
dogs_s
In [ ]:
# and swap var for time, so time is the innermost index
dogs_s=dogs_s.swaplevel('var', 'time')
dogs_s
In [ ]:
# and give it a sorting...
dogs_s.sort_index(inplace=True)
dogs_s

Let's get this ready to save as long data, like it was when we read it in.

  1. Reset the index to move them back to columns
  2. Give the value column a name
In [ ]:
dogs_s = dogs_s.reset_index()
dogs_s
In [ ]:
dogs_s.rename(columns={0 :'value'}, inplace=True)
dogs_s
In [ ]:
# Now save it. 
dogs_s.to_csv('dog_data_updated.csv', index=False)  #index = False does not print the index

Practice

Let's review multiIndexing with some real world data. The data are messy and will require some cleaning up and 'wrangling.' We will do some of it together, then I'll hand it off for you all to finish.

We will work with the IMF's World Economic Outlook, which contains historical data and the IMF's forecasts for many countries and variables.

Our goal is to study the evolution of debt in Germany, Argentina, and Greece.

First, download the data file. The file is here http://www.imf.org/external/pubs/ft/weo/2016/02/weodata/WEOOct2016all.xls. Clicking on the link should initiate a download.

Wow, there is a lot going on here. Let's get to work.

Read the data into a DataFrame

Read in the data, then inspect the head and tail to make sure it looks as expected.

In [ ]:
# The IMF website is acting up and the url doesn't always work
# url = 'http://www.imf.org/external/pubs/ft/weo/2016/02/weodata/WEOOct2016all.xls'

# The data are in a tab-separated list (even though the file ends in 'xls'. Not cool, IMF)
# The thousands parameter tells pandas to remove the comma when reading in numbers
weo = pd.read_csv('WEOOct2016all.xls', sep='\t', na_values=['n/a', '--'], thousands =',')
weo.head()

A quick aside about encoding

What the heck just happened? What is a UnicodeDecodeError?

Any 'text' character (be it latin, cyrilic, korean, chinese,...) must be translated to a series of 0s and 1s in order for a computer to understand it. This translation is known as an encoding. As computers developed, many different groups created many different encodings (e.g., ASCII), so the letter 'K' may translate to 0010 in one encoding and 00100101 in another. What a mess.

At some point Unicode was developed to serve as one standard encoding. Pandas defaults to utf-8 encoding. The world is converging on unicode, but other encoding are still out there.

The error message tells us that the IMF is not encoding its workbook tab-delimited file as utf-8. After a bit of digging, I found the file to be encoded in 'windows-1252'. (Now is not the time for a discussion of the bureaucratic nightmares of large international organizations, but this is a nice example.)

That's probably more than you ever wanted to know about text, but it's a scary world out there. If you try reading a file and pandas throws a unicode error, then you probably need to change the encoding.

The read_ methods of pandas (and other python methods) takes a parameter encoding to specify the file encoding. When we do not specify the encoding, pandas defaults to utf-8. Let's add the encoding parameter and try again.

In [ ]:
# url = 'http://www.imf.org/external/pubs/ft/weo/2016/02/weodata/WEOOct2016all.xls'

# The data are in a tab-separated list (even though the file ends in 'xls'. Not cool, IMF)
# The encoding parameter tells pandas how to read special characters
# The thousands parameter tells pandas to remove the comma when reading in numbers
weo = pd.read_csv('WEOOct2016all.xls', sep='\t', na_values=['n/a', '--'], thousands =',', encoding='windows-1252')
weo.head()
In [ ]:
weo.tail()

The data file has a footer (go back and look in the file) and we read it in as a line of data. We could go back and specify the skipfooter parameter to read_csv(), but we can also just drop it.

In [ ]:
weo.drop(8404, inplace=True)   # Using the row number is not very robust. What if the IMF added more countries to the workbook?
weo.tail()

Drop unneeded variables

Use drop() to get rid of columns that we are not going to use.

In [ ]:
weo.drop(['WEO Country Code', 'Subject Notes', 'Country/Series-specific Notes', 'Scale', 'Estimates Start After'],axis=1, inplace=True)
weo.head()

Keep the countries and variables we want

Let's look at Argentina, Germany, and Greece and the variables debt-gdp ratio and deficit-gdp ratio. We will use this as an opportunity to learn about the .isin( ) method of DataFrame.

In [ ]:
variables = ['GGXWDG_NGDP', 'GGXCNL_NGDP']
countries = ['ARG', 'DEU', 'GRC']

weo = weo[ weo['WEO Subject Code'].isin(variables) & weo['ISO'].isin(countries) ]
weo.head()

The isin() method

Check out the way that I selected the rows of the DataFrame I wanted. I used the isin() method (docs). You pass it a list of values and it returns a DataFrame of bools which you use to subset the data. This is a lot cleaner than using a bunch of conditional statements.

I'm using it twice: once to specify the variables I want and once to specify the countries I want. I join the two together with an & operator.

Your turn

The data are in pretty decent shape. Take a few minutes and try the following. Feel free to chat with those around if you get stuck. I am here, too.

  1. Rename the 'WEO Subject Code' to 'Variable' and 'Subject Descriptor' to 'Description'.
  1. Replace 'GGXCNL_NGDP' with 'Surplus
  2. Replace 'GGXWDG_NGDP' with 'Debt'
  1. Set the (row) multiIndex to be 'ISO', 'Variable', 'Country', 'Description', and 'Units', in that order.
  1. We want the row to be time and everything else to be columns. Swap the columns for rows using .transpose().
  2. Sort the row and column indexes.
  1. Plot debt levels for Argentina, Germany and Greece for all the years in the data. Add a legend. Make the figure look nice. Hint: The x-axis is time...

Wow, Greece has had an interesting run. Let's look at Greece more closely.

  1. Create a (2,1) grid of subplots. Plot Greece's surplus in the top axes and debt in the bottom axis. Make it look nice.
  1. Stack the data to create a long dataset
  2. Save the stacked data as a csv file