PS 00 Scratch. Data Manipulation Pieces Dropped from the Problem Set

4. Data Manipulation

4.1. Pandas datafames

Pandas is one of the most widely used Python libraries in data science. It is commonly used for data cleaning, and with good reason: it’s very powerful and flexible. Think of it as a table. The rows and columns of a pandas dataframe thought of as a table are a collection of lists stacked on top/next to each other. For example, here is a dataframe made up of ten lists, with each list consisting of a rating and a movie title:

In [ ]:
import numpy as np
import pandas as pd

top_10_movies_list = [
    [9.2, 'The Shawshank Redemption', '1994'],
    [9.2, 'The Godfather', '1972'],
    [9.0, 'The Godfather: Part II', '1974'],
    [8.9, 'Pulp Fiction', '1994'],
    [8.9, "Schindler's List", '1993'],
    [8.9, 'The Lord of the Rings: The Return of the King', '2003'],
    [8.9, '12 Angry Men', '1957'],
    [8.9, 'The Dark Knight', '2008'],
    [8.9, 'Il buono, il brutto, il cattivo', '1966'],
    [8.8, 'The Lord of the Rings: The Fellowship of the Ring', '2001']]

top_10_movies_df = pd.DataFrame(data=np.array(top_10_movies_list), columns=["Rating", "Movie", "Year"])

4.2 Reading in dataframes

Luckily for you, most datatables in this course will be premade and given to you in a form that is easily read into a pandas method, which creates the table for you. A common file type that is used for economic data is a Comma-Separated Values (.csv) file, which stores tabular data. It is not necessary for you to know exactly how .csv files store data, but you should know how to read a file in as a pandas dataframe. You can use the "read_csv" method from pandas, which takes in one parameter which is the path to the csv file you are reading in.

We will read in a .csv file that contains quarterly real GDI, real GDP, and nominal GDP data in the U.S. from 1947 to the present.

In [ ]:
# Run this cell to read in the table
accounts = pd.read_csv("data/Quarterly_Accounts.csv")

The pd.read_csv function expects a path to a .csv file as its input, and will return a data table created from the data contained in the csv. We have provided Quarterly_Accouunts.csv in the data directory, which is all contained in the current working directory (aka the folder this assignment is contained in). For this reason, we must specify to the read_csv function that it should look for the csv in the data directory, and the / indicates that Quarterly_Accounts.csv can be found there.

Here is a sample of some of the rows in this datatable:

In [ ]:

4.3 Indexing dataframes

Oftentimes, tables will contain a lot of extraneous data that muddles our data tables, making it more difficult to quickly and accurately obtain the data we need. To correct for this, we can select out columns or rows that we need by indexing our dataframes.

The easiest way to index into a table is with square bracket notation. Suppose you wanted to obtain all of the Real GDP data from the data. Using a single pair of square brackets, you could index the table for "Real GDP"

In [ ]:
# Run this cell and see what it outputs
accounts["Real GDP"]

Notice how the above cell returns an array of all the real GDP values in their original order. Now, if you wanted to get the first real GDP value from this array, you could index it with another pair of square brackets:

In [ ]:
accounts["Real GDP"][0]

Pandas columns have many of the same properties as numpy arrays. Keep in mind that pandas dataframes, as well as many other data structures, are zero-indexed, meaning indexes start at 0 and end at the number of elements minus one.

If you wanted to create a new datatable with select columns from the original table, you can index with double brackets.

In [ ]:
## Note: .head() returns the first five rows of the table
accounts[["Year", "Quarter", "Real GDP", "Real GDI"]].head()

Alternatively, you can also get rid of columns you dont need using .drop()

In [ ]:
accounts.drop("Nominal GDP", axis=1).head()

Finally, you can use square bracket notation to index rows by their indices with a single set of brackets. You must specify a range of values for which you want to index. For example, if I wanted the 20th to 30th rows of accounts:

In [ ]:

4.4. Filtering data

As you can tell from the previous, indexing rows based on indices is only useful when you know the specific set of rows that you need, and you can only really get a range of entries. Working with data often involves huge datasets, making it inefficient and sometimes impossible to know exactly what indices to be looking at. On top of that, most data analysis concerns itself with looking for patterns or specific conditions in the data, which is impossible to look for with simple index based sorting.

Thankfully, you can also use square bracket notation to filter out data based on a condition. Suppose we only wanted real GDP and nominal GDP data from the 21st century:

In [ ]:
accounts[accounts["Year"] >= 2000][["Real GDP", "Nominal GDP"]]

The accounts table is being indexed by the condition accounts["Year"] >= 2000, which returns a table where only rows that have a "Year" greater than $2000$ is returned. We then index this table with the double bracket notation from the previous section to only get the real GDP and nominal GDP columns.

Suppose now we wanted a table with data from the first quarter, and where the real GDP was less than 5000 or nominal GDP is greater than 15,000.

In [ ]:
accounts[(accounts["Quarter"] == "Q1") & ((accounts["Real GDP"] < 5000) | (accounts["Nominal GDP"] > 15000))]

Many different conditions can be included to filter, and you can use & and | operators to connect them together. Make sure to include parantheses for each condition!

Another way to reorganize data to make it more convenient is to sort the data by the values in a specific column. For example, if we wanted to find the highest real GDP since 1947, we could sort the table for real GDP:

In [ ]:
accounts.sort_values("Real GDP")

But wait! The table looks like it's sorted in increasing order. This is because sort_values defaults to ordering the column in ascending order. To correct this, add in the extra optional parameter

In [ ]:
accounts.sort_values("Real GDP", ascending=False)

Now we can clearly see that the highest real GDP was attained in the first quarter of this year, and had a value of 16903.2

4.5. Useful methods for numeric data

Here are a few useful functions when dealing with numeric data columns. To find the minimum value in a column, call min() on a column of the table:

In [ ]:
accounts["Real GDP"].min()

To find the maximum value, call max().

In [ ]:
accounts["Nominal GDP"].max()

And to find the average value of a column, use mean().

In [ ]:
accounts["Real GDI"].mean()

4.6. Keeping track of your information

Data comes from all kinds of places. It is important to be systematic about where your data comes from—lest you do all your work, come up with quantitative answers, and then realize you have not the foggiest clue about what the ultimate source of data is and how reliable it is.

I find it useful to keep the data in a dataframe, to mark that it is a dataframe by making sure that its name ends with a "_df", and to keep the source information in another object with the same name but ending with a "_dict", for "dictionary"—a more flexible pandas object.

Watch how I constructed the data for the next section, "5. Visualization", of this notebook:

In the spring of 2020 with the coming of coronavirus the share of Americans who had jobs collapsed. In order to gain some perspective, I wanted to look back at how the share of Americans who had jobs had changed over time, and how it typically varies. The United States's Department of Labor's Bureau of Labor Statistics, every month, sends out interviewers to survey a random sample of Americans about their experiences. In this Current Population Survey one of the questions asked is: "do you have a job?" The share of American adults—people 16 and over—who answer "yes" is then called the employment-population ratio. This series is available, monthly, back to when the BLS started collecting the CPS in 1948.

Let us take a look at how the employment-population ratio has fluctuated over the past seventy-odd years.

The Federal Reserve Bank of St. Louis has an internet-accessible databank called FRED, for Federal Reserve Economic Data. I downloaded the employment-population ratio from FRED as a single .csv—".csv" meaning that the data consists of values, separated by commas, and parked it on the internet.

The first thing to do is to go out to the internet, get the .csv file, and load it into a pandas dataframe, using the pd.read_csv command pandas has for this specific purpose:

In [ ]:
# read the employment-population ratio data in from the internet
# into a dataframe:

employment_population_df = pd.read_csv(

And then look at what the data looks like—just the first few lines of the dataframe, the .head() of the file, will be enough:

In [ ]:

The second thing to do is to tell the python interpreter that we are going to want to use the Date column to organize the dataframe: that the data are naturally arranged by date, and we want to utilize that natural arrangement:

In [ ]:
# set the date column to be the index of the dataframe:

employment_population_df.set_index('Date', inplace=True)

We see that the dataframe has both date and employment-population ratio information in it stored in its columns 1 and 2, and also source information stored in its columns 3 and 4. We would like to move that information to a more convenient place so that it does not mess up presentation of the data. Therefore, third, we tell the python interpreter that we want to create our related .dict object, which starts out empty. We then loop over the first seven lines of the file—where the source information is—to transfer that information into the dict like so:

In [ ]:
# as downloaded, the employment_population_df dataframe includes
# source and other such information in the first seven lines of 
# the dataframe. we would like to move that information to a more
# convenient place so that it does not mess up presentation of 
# the data

# therefore we create a related "dict" object, which starts out 
# empty:

employment_population_dict = {}

# and we load the source and related information into the dict
# object:

for i in range(7):
    employment_population_dict[employment_population_df['Key'][i]] = employment_population_df['Value'][i]

Now we can drop the source information from the dataframe without losting any information:

In [ ]:
# then we drop the source information from the dataframe:

employment_population_df.drop(['Key'], axis = 1, inplace=True)
employment_population_df.drop(['Value'], axis = 1, inplace=True)

We now have both employment_population_df and employment_population_dict around. Anyone looking at this notebook in the future should be able to infer that the first is a dataframe, and the second is related information. But just to be sure let's add the dataframe to the dictionary, so that anyone asking the dictionary if it has associated data in it will get the answer "yes". Thus, last, we do this:

In [ ]:
# last, we add the dataframe as another element of the "dict"
# object so that it is kept close and related to its source
# information. If we ever want to see information about the
# source, simply replace the "df" at the end of the dataframe's
# name with a "dict", and then look in that object to see what
# is there:

employment_population_dict['Data'] = employment_population_df

Now somebody who asks python if there is data associated with employment_population_dict by typing in employment_population_dict['Data'] will get a helpful result:

In [ ]:

There is all the data.

5. Visualization

Now that you can read in data and manipulate it, you are now ready to learn about how to visualize data. To begin, run the cell below to import the required libraries we will be using, and to make graphs appear inside the notebook rather than in a separate window:

In [ ]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl

5.1. It is very easy

We jumped through all our hoops to set up our employment_population_df dataframe in the last section. To plot a graph of how the employment-population ratio varies over time, all we do is to take the name of the dataframe and add ".plot()" to the end of it, like so. So here is our graph:

In [ ]:
# having jumped through these hoops, we can now take a look at
# what the share of american adults who every month tell the 
# cps interviewers that they have jobs looks like:


That is it.

One of the treat advantages of python and pandas is its built-in plotting methods. We can simply call .plot() on a dataframe to plot columns against one another. All that we have to do is specify which column to plot on which axis. Something special that pandas does is attempt to automatically parse dates into something that it can understand and order them sequentially.

We probably want to pretty-up the graph a little bit: adding labels, and the "Ratio" in the top-right corner does not add anything useful:

In [ ]:
plt.ylabel('Percent of Adults')
plt.title('Employment to Population Ratio', size=24)

Yes, the collapse of American employment in coronavirus spring is historically unprecedented.

5.2. Slicing the data to look at subsamples

If you want to gain a sense of how the employment-population ratio "usually" behaves, we can simply look at a narrower timeslice than the full 1948-2020 data:

In [ ]:
# if we just want to look at the data 1990-2005:

plt.ylabel('Percent of Adults')
plt.title('Employment to Population Ratio', size=24)

We see, clearly, the seasonal cycle—the semi-regular every-twelve-months fluctuations in employment that come with the seasons (of which, somewhat surprisingly, the most important component is the pre-Christmas gift-shopping season); we also see clearly, the irregular roughly seven-year "business cycle" fluctuations of boom and bust connected to shifts in businesses' optimism and pessimism about the future.

5.3. The "great recession" & the "secular stagnation recovery"

Before coronavirus spring, I thought the focus of this section would be on the years 2005-2015—the years from the peak of the housing bubble until the Federal Reserve began to try to "normalize" interest rates, because keeping interest rates as low as it could in order to try to boost employment to offset the damage done to the economy by the so-called "Great Recession" of 2008-2010 was no longer its highest priority:

In [ ]:
# and if we just want to look at the years 2005-2015—
# the years from the peak of the housing bubble until
# the federal reserve begins to try to "normalize" interest 
# rates because keeping interest rates as low as it 
# can in order to try to boost employment to offset 
# the damage done to the economy by the so-called "great 
# recession" is no longer its highest priority:


plt.ylabel('Percent of Adults')
plt.title('Employment to Population Ratio', size=24)

Note how, after the recession of the early 1990s, employment bounced back to its previous level—how the recession and recovery then looked like a "V". No such luck over 2005-2015: the recession and recovery looked, instead, like an "L".

5.4. Coronavirus spring:

But right now our focus is clearly on the exceptionalism of the coronavirus spring:

In [ ]:
# right now our focus is clearly on the exceptionalism 
# of the coronavirus spring:


plt.ylabel('Percent of Adults')
plt.title('Employment to Population Ratio', size=24)

5.5. Practice visualizing!

Spend some time playing with and analyzing this employment-population ratio series—this share of American adults who, each month, tell the CPS interviewers that they have jobs. See if you can identify all of these salient features of the series:

  1. the seasonal cycle—the substantial fluctuations in the share of American adults at work that happen each and every year
    • the shrinking of the magnitude of the seasonal cycle: it is a much smaller deal now than it was in the first decade after World War II.
  2. the business cycle—the large but definitely not regular fluctuations with a roughly seven-year period in the share of American adults with jobs
  3. the large and rapid downward business-cycle fall in the employment share that is the "Great Recession" of 2008-2010
    • the remarkably slow bounce-back of the employment share from its Great Recession low
  4. the thirty-year rise in the employment share with the coming of feminism: women were now not discouraged from taking jobs outside of housecleaners, waitresses, nurses, schoolteachers, and secretaries
  5. the tendency of the employment share to fall as more people are retired as the population ages and people enter the labor force later as years spent in school increase
  6. the collapse of the employment share in Coronavirus Spring

5.6. Accessing source information

Remember that we stored information about the source of the data in a dict object? We can pull it out and take a look at it whenever we choose:

In [ ]:
# remember that we stored information about the source
# of the data in a "dict" object? now let's look at that 
# "dict" object for information about the source of the 
# data series:


The dict object thus tells us that:

  • the source of the data is the Current Population Survey,
  • the data was last updated on 2020-05-08,
  • the series is called "mployment-Population Ratio, NSA"—that is, it is not seasonally adjusted—
  • it is a monthly data series,
  • it is a ratio,
  • its series ID is LNU02300000, and
  • it can be found in the "20th Century Economic History" data list.

If we wanted to access any one of these, we could pull it out of the "dict" object directly:

In [ ]:
employment_population_dict['Series ID']