Notebook

Pandas Intermediate 2¶

Description: This notebook describes how to:

Work with time series data in Pandas
Use .plot() to make different kinds of static charts

Use Case: For Learners (Detailed explanation, not ideal for researchers)

Difficulty: Intermediate

Knowledge Required:

Python Basics (Start Python Basics I)
Pandas Basics (Start Pandas Basics I)

Knowledge Recommended:

Completion Time: 90 minutes

Data Format: csv

Libraries Used: Pandas

Research Pipeline: None ___

In [ ]:

# import Pandas, `as pd` allows us to shorten typing pandas to pd
import pandas as pd

Times series data¶

Time-series data is everywhere. You see it used in public health record, weather records and many other places. Pandas library provides a comprehensive framework for working with times, dates, and time-series data.

In real life, we often find datasets where dates and times are stored as strings.

In [ ]:

import urllib
from pathlib import Path

# Check if a data folder exists. If not, create it.
data_folder = Path('./data/')
data_folder.mkdir(exist_ok=True)

# Download the sample files
url = 'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/Pandas1_failed_banks_since_2000.csv'
file = './data/failed_banks.csv'
urllib.request.urlretrieve(url, file)
print('Sample file ready.')

# Read in the data
banks_df = pd.read_csv(file)
banks_df

In [ ]:

# check the data type of the Closing Date column
banks_df['Closing Date']

As you can see, the dates in the Closing Date column are strings. We can convert the data type to datetime using the to_datetime() function in Pandas. The strings in this column will be turned into datetime objects.

The datetime objects¶

In [ ]:

# convert the strings to datetime
banks_df['Closing Date'] = pd.to_datetime(banks_df['Closing Date'], format='%d-%b-%y')

In [ ]:

# check the data type of the Closing Date column
banks_df['Closing Date']

When we parse the strings in the Closing Date column into datetime objects, we specify a format to inform Pandas of how to do the parsing. For the cheatsheet of the format codes, you can check the documentation here.

One good thing about converting the dates and times into datetime objects is that we can then use them to index a dataframe. You will soon see that indexing using datetime objects is very flexible.

In [ ]:

# set the index to Closing Date
banks_df = banks_df.set_index('Closing Date').sort_index()

In [ ]:

# select data between two dates
banks_df.loc['2020-01-01':'2023-07-01']

Providing the full form YYYY-MM-DD when selecting data from the dataframe is expected, as our indexes are in the format of YYYY-MM-DD. However, what makes datetime indexes really convenient is that we can do partial indexing with them!

In [ ]:

# select data between two months
banks_df.loc['2020-01':'2023-07']

As you can see, even if we do not provide the full forms, we do not get an error.

In [ ]:

# select data between two years
banks_df.loc['2020':'2023']

In [ ]:

# reset the index for later use
banks_df = banks_df.reset_index()

Note that this kind of partial indexing is impossible with other data types. Let's create a small dataframe to test what will happen if we use partial indexing with string dates.

In [ ]:

# create a df with a column of dates stored as strings
date_str = pd.DataFrame({'Date':['2020-06-01', '2021-06-30'],
                        'num_failed_banks':[1,2]})

# set the index column to the Date column
date_str = date_str.set_index('Date')

# select data using partial indexing
date_str.loc['2020']

We got a key error! This is because '2020' is not one of the indexes and partial indexing with strings is impossible. If we supply '2020' as the index, Pandas will try to find the rows whose index is '2020', but there are no rows with this index in our small dataframe!

The datetime class has a variety of attributes that we can access. For example, we can access the 'year', 'month', and 'date' from the full form of a datetime object.

In [ ]:

# access the 'year' from datetime objects
banks_df['Closing Date'].dt.year

The datetime class also has a variety of methods that we can use. For example, if we would like to get the day of week for each date in the Closing Date column, we can use the day_name() method.

In [ ]:

# access the day of week from datetime objects
banks_df['Closing Date'].dt.day_name()

The attributes and methods available in the datetime class make it extremely easy to extract the 'year', 'month' and 'date' into different columns.

In the following code cell, can you write some code to create a new column that contains the closing years of the banks? In Pandas basics 2, we did it with string slicing, but this time let's try using the attributes of the datetime class to do it.

In [ ]:

# create a new column containing the closing years

Grouping data using the datetime objects is also quite convenient. We can use the resample() method to specify the granularity with which we would like to group the data.

In [ ]:

# group the data using resample()
banks_df.resample('Y', on='Closing Date').size()

With resample(), we can choose to upsample or downsample.

In [ ]:

# downsample to month with smaller granularity
banks_df.resample('M', on='Closing Date').size()

In [ ]:

# upsample to 12 hours with greater granularity
banks_df.resample('12H', on='Closing Date').size()

In Pandas intermediate 1, you have learned how to group the data using groupby(). In the next code cell, can you use both groupby() and resample() to get how many banks were failed in each state by year?

In [ ]:

# get how many banks were failed in each state by year

The timedelta objects¶

In Pandas, the difference between two time points is a timedelta object. In other words, while datetime objects are used to represent instants of time, timedelta objects are used to represent durations of time.

When you substract one datetime object from another, you'll get a timedelta object.

In [ ]:

# substract one time point from another 
pd.to_datetime('2020-06-01 12:00:00') - pd.to_datetime('2020-06-01 8:00:00')

In the following, let's grab the 2022 Boston Marathon dataset and use the completion time of the runners as an example.

In [ ]:

# Get the urls to the files and download the files
import urllib.request
url = 'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/DataViz3_BostonMarathon2022.csv'
urllib.request.urlretrieve(url, './data/'+url.rsplit('/')[-1][9:])
    
# Success message
print('Sample file ready.')

# create a dataframe
bm_22 = pd.read_csv('./data/BostonMarathon2022.csv')

# get the data type of the OfficialTime column
bm_22['OfficialTime'].info()

In [ ]:

# turn the data type of the OfficialTime column to timedelta
bm_22['OfficialTime'] = pd.to_timedelta(bm_22['OfficialTime'])

In [ ]:

# get the difference in completion time between the 
# fastest runner and the slowest runner
bm_22['OfficialTime'].max() - bm_22['OfficialTime'].min()

Just like the datetime objects, the timedelta objects also have a variety of attributes and methods that we can make use of.

In [ ]:

# get the components of a timedelta object
bm_22['OfficialTime'].dt.components

In [ ]:

# Get the completion time in seconds
bm_22['OfficialTime'].dt.total_seconds()

Suppose you would like to get the average completion time of the runners in the United States. How do you do that?

In [ ]:

# get the average completion time of the runners in the US

Plotting in Pandas¶

Pandas uses the .plot() method to create charts and plots. In this section, we'll learn how to use the .plot() method to make different kinds of charts.

Scatterplot¶

Scatter plots are usually used to show the relationship between different variables. Suppose we would like to see whether there is a relationship between the age of the runners and their completion time.

In [ ]:

# prepare the data
bm_22_sc = bm_22[['AgeOnRaceDay', 'OfficialTime']].copy()

# make a scatter plot
bm_22_sc.plot(kind='scatter', x='AgeOnRaceDay', y='OfficialTime')

If the data are categorically grouped, you can slice the data into smaller data series and color them differently.

For example, suppose we are interested in the relationship between the age and the completion time of the runners from France.

In [ ]:

# prepare the data
bm_22_sc_fm = bm_22.loc[bm_22['CountryOfResName']=='France'][['AgeOnRaceDay', 'OfficialTime', 'Gender']].copy()
bm_22_sc_fm = bm_22_sc_fm.sort_values(by='Gender').reset_index(drop=True)

# get the cutting point in the Gender column
bm_22_sc_fm.loc[bm_22_sc_fm['Gender']=='F'].tail(1)

In [ ]:

# make a scatter plot
ax1 = bm_22_sc_fm.loc[:42].plot(kind='scatter',
                                   x='AgeOnRaceDay',
                                   y='OfficialTime', 
                                   c='blue',label='F')

ax2 = bm_22_sc_fm.loc[43:].plot(kind='scatter',
                                   x='AgeOnRaceDay',
                                   y='OfficialTime', 
                                   c='orange',label='M', ax=ax1)

Bar charts¶

In Pandas intermediate 1, we have plotted a bar chart showing the number of failed banks by year in the failed banks dataset. In this section, let's plot more kinds of bar charts. Suppose we would like to get the top ten non-US countries with the most runners and plot the number of runners from them in a horizontal bar chart.

In [ ]:

# prepare the data
bm_22_hbar = bm_22.groupby('CountryOfResName').size().sort_values().iloc[-11:-1]
bm_22_hbar

In [ ]:

# make a horizontal bar chart
bm_22_hbar.plot(kind='barh')

Now, suppose we would like to break the total number of runners in each country by gender and make a stacked bar chart instead.

In [ ]:

# prepare the data
bm_22_sthbar = bm_22[['CountryOfResName', 'Gender']].copy()

# get the names of the top 10 non-US countries with the most runners 
ctry = bm_22_sthbar.groupby('CountryOfResName').size().sort_values().iloc[-11:-1].index

# restructure the df for plotting
bm_22_sthbar = bm_22_sthbar.loc[bm_22_sthbar['CountryOfResName'].isin(ctry)].copy()
bm_22_sthbar = bm_22_sthbar.groupby(['CountryOfResName', 'Gender']).size().to_frame().unstack()
bm_22_sthbar.columns = bm_22_sthbar.columns.droplevel(0)
bm_22_sthbar['sum'] = bm_22_sthbar['F'] + bm_22_sthbar['M'] 
bm_22_sthbar = bm_22_sthbar.sort_values(by='sum').drop(columns='sum')

In [ ]:

# make the stacked horizontal bar chart
bm_22_sthbar.plot(kind='barh', stacked=True)

Coding Challenge! < / >

Can you plot another bar chart showing the number of female and male runners from the same ten countries but this time with the two bars for each country standing next to each other?

In [ ]:

# make a bar chart showing the number of female and male runners 
# of the top non-US countries with the most runners
# with two bars for each country, one for female, one for male

Pie chart¶

A pie chart is usually used to show how a total amount is divided between different levels of a categorical variable. In a pie chart, the levels of a categorical variable is represented by a slice of the pie.

Let's make a pie chart that shows how the total number of runners are divided between the completion time in hours.

In [ ]:

### prepare the data

# get the column(s) of interest
bm_22_pie = bm_22[['OfficialTime']].copy()

# create a new column containing the hours
bm_22_pie['Hour'] = bm_22_pie['OfficialTime'].dt.components['hours']

# group the data by Hour and get how many runners there are in each subgroup
bm_22_pie = bm_22_pie.groupby('Hour').size()

# make the pie chart
bm_22_pie.plot(kind='pie', labels=None, legend=True)

Histogram¶

A histogram is a bar chart which shows the frequency of observations. In a histogram, the x-axis is a continuous quantitative value. The height of each bar shows the frequency of a certain range of values. The biggest difference between a bar chart and a histogram is that a bar chart has categorical values on the x-axis but a histogram has continuous quantitative values on the x-axis.

Suppose we are interested in the distribution of the age of the runners in ranges of 10. How many runners were of age 20 - 29 at the time of race? How many were of age 30 - 39? Moreover, we would like to make a histogram that shows the distribution of age. Before we do that, can you take a guess which age group has the most runners?

In [ ]:

# get the youngest and oldest age
print(bm_22['AgeOnRaceDay'].max())
print(bm_22['AgeOnRaceDay'].min())

In [ ]:

# prepare the data for plotting
bm_22_hist = bm_22['AgeOnRaceDay'].copy()

# make a range of bins
bins = range(10, 100, 10)

bm_22_hist.plot(kind='hist', bins=bins, edgecolor='black', xlabel='Age',ylabel='Num_runners')

In the next code cell, can you make a histogram that displays the distribution of the completion time of the runners in ranges of 1 hour?

In [ ]:

# make a histogram displaying the distribution of 
# the completion time of the runners in ranges of 1 hour

Line graph¶

A line graph consists of points connected by line segments. It is commonly used to demonstrate changes in value. Oftentimes, the horizontal axis holds "time" and the vertical axis shows the change in a value of interest.

In Pandas intermediate 1, we made a bar chart to display the number of failed banks by year. Let's make a line graph to show the change in the number over the years in that dataset.

In [ ]:

### prepare the data for plotting

# get the column(s) of interest
banks_line = banks_df[['Closing Date']].copy()

# create a new column containing the years
banks_line['Year'] = banks_line['Closing Date'].dt.year.astype(str)

# group the data by year and get how many banks failed in each year
banks_line = banks_line.groupby('Year').size()

# plot the line graph
banks_line.plot(xlabel='Year')

Lesson Complete¶

Congratulations! You have completed Pandas Intermediate 2.

Start Next Lesson: Pandas intermediate 3 ->¶

Exercise Solutions¶

Here are a few solutions for exercises in this lesson.

In [ ]:

# get how many banks were failed in each state by year
banks_df.groupby('State').resample('Y', on='Closing Date').size()

In [ ]:

# get the average completion time of the runners from the US

# Get the url to the file and download the file
import urllib.request
url = 'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/DataViz3_BostonMarathon2022.csv'
urllib.request.urlretrieve(url, './data/'+url.rsplit('/')[-1][9:])
    
# Success message
print('Sample file ready.')

# create a dataframe
bm_22 = pd.read_csv('./data/BostonMarathon2022.csv')

# turn the 'OfficialTime' column to type timedelta
bm_22['OfficialTime'] = pd.to_timedelta(bm_22['OfficialTime'])

# get the average completion time of the runners in the US
bm_22.loc[bm_22['CountryOfResName']=='United States of America']['OfficialTime'].mean()

In [ ]:

## create a bar chart with two bard next to each other for each country
# prepare the data
bm_22_bar_fm = bm_22[['CountryOfResName', 'Gender']].copy()

# get the names of the top 10 non-US countries with the most runners 
ctry = bm_22_bar_fm.groupby('CountryOfResName').size().sort_values().iloc[-11:-1].index

# restructure the df for plotting
bm_22_bar_fm = bm_22_bar_fm.loc[bm_22_bar_fm['CountryOfResName'].isin(ctry)].copy()
bm_22_bar_fm = bm_22_bar_fm.groupby(['CountryOfResName', 'Gender']).size().to_frame().unstack()
bm_22_bar_fm.columns = bm_22_bar_fm.columns.droplevel(0)

# plot the bar chart
bm_22_bar_fm.plot(kind='bar')

In [ ]:

### make a histogram showing the distribution of completion time 
### in ranges of 1 hour

# Get the url to the file and download the file
import urllib.request
url = 'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/DataViz3_BostonMarathon2022.csv'
urllib.request.urlretrieve(url, './data/'+url.rsplit('/')[-1][9:])
    
# Success message
print('Sample file ready.')

# create a dataframe
bm_22 = pd.read_csv('./data/BostonMarathon2022.csv')

# get the column of interest
bm_22_time = bm_22[['OfficialTime']].copy()

# make a new column containing the hour only
bm_22_time['Hour'] = bm_22_time['OfficialTime'].str[0].astype(int)

# get the max and min value from OfficialTime
min_h = bm_22_time['Hour'].min()
max_h = bm_22_time['Hour'].max()

# make the bins
bins = range(min_h, max_h+2)

# plot the histogram
bm_22_time['Hour'].plot(kind='hist', logy=True, bins=bins, edgecolor='black', ylabel='Num_runners')

In [ ]: