Description: This notebook describes how to:
Use Case: For Learners (Detailed explanation, not ideal for researchers)
Difficulty: Intermediate
Knowledge Required:
Knowledge Recommended:
Completion Time: 90 minutes
Data Format: csv
Libraries Used: Pandas
Research Pipeline: None ___
# import Pandas
import pandas as pd
# check the version of the installed pandas
pd.__version__
# set the max items to display to 20
pd.options.display.max_seq_items = 20
In April 2023, Pandas 2.0 was released. The defining feature of this release is the new PyArrow backend.
In Pandas basics, we have learned how to read data from files of different formats into a dataframe using the read_*()
method. What we were doing is essentially to load data into memory. When loading data into memory, we need to decide how the data is stored in memory. Pandas was initially developed using NumPy data structures for memory management. It has advantages but it also has pain points. For the pain points, you can read this blog post by the creator of Pandas, Wes McKinney.
The new PyArrow backend in Pandas 2.0 changes the way Pandas works with data in memory. Basically, the new backend reduces the memory consumption in data processing and thus enhances the performance speed.
# intall pyarrow
!pip install pyarrow
import pyarrow as pa
pa.__version__
In the following, we'll use the dataset on the 2022 Boston Marathon to compare the processing speed of the old NumPy backend and the new PyArrow backend.
Let's first download the sample file and read in the data. As you can see, when we load the data using the read_csv()
method, we only pass the path to the sample file into the method. By default, the NumPy backend is used.
import urllib.request
from pathlib import Path
# Check if a data folder exists. If not, create it.
data_folder = Path('./data/')
data_folder.mkdir(exist_ok=True)
# Get the urls to the files and download the files
url = 'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/DataViz3_BostonMarathon2022.csv'
urllib.request.urlretrieve(url, './data/'+url.rsplit('/')[-1][9:])
# Success message
print('Sample file ready.')
# create a dataframe, the NumPy backend is used by default
bm_22 = pd.read_csv('./data/BostonMarathon2022.csv')
bm_22
Suppose we would like to find out all the runners whose name has the four letters 't', 'a', 'p', 'i' in the given order.
%%timeit -n 10
# get all names with 't', 'a', 'p', 'i'
bm_22.loc[bm_22['FullName'].str.contains('.*t.*a.*p.*i', case=False)]
Now, let's load the data into memory again, but with the PyArrow backend this time.
# read in the data using the pyarrow backend
bm_22_pa = pd.read_csv("./data/BostonMarathon2022.csv", dtype_backend="pyarrow")
Let's grab all names with 't', 'a', 'p', 'i' again and get the execution time.
%%timeit -n 10
# get all names with 't', 'a', 'p', 'i'
bm_22_pa[bm_22_pa['FullName'].str.contains('.*t.*a.*p.*i', case=False)]
Let's check the data type of the FullName
column.
# check the data type of the FullName column, NumPy as backend
bm_22['FullName'].dtypes
# check the data type of the FullName column, pyarrow as backend
bm_22_pa['FullName'].dtypes
You can see that the data types are different. When we use NumPy as the backend, the data type is object
. When we use PyArrow as the backend, the data type is string[pyarrow]
. This indicates that the data are stored differently in the memory by these two backends.
The difference in processing speed is expected to become even larger when we have a huge dataset.
Let's generate 100K of random names.
import random as rd
import string
def generate_random_names(num_names, name_len):
rd.seed(0) # Make sure that we get consistent results in each execution
name_ls = []
for i in range(num_names):
name_ls.append(''.join(rd.choice(string.ascii_lowercase) for j in range(name_len)))
return name_ls
# generate a list of random names
huge_name_ls = generate_random_names(100000, 15)
Let's create a pandas series to hold the names. By default, NumPy is used as the backend.
# create a pandas series holding the names
huge_name_col = pd.Series(huge_name_ls)
Let's first get all names containing 't', 'a', 'p', 'i' using the loc
indexer with NumPy as the backend.
%%timeit -n 10
# get all names containing 't', 'a', 'p', 'i'
huge_name_col.loc[huge_name_col.str.contains('.*t.*a.*p.*i', case=False)]
Now, let's change the data type to PyArrow string and repeat the same search.
# create a pandas series to hold the names, data structure set to pyarrow string
huge_name_col_pa = pd.Series(huge_name_ls, dtype=pd.ArrowDtype(pa.string()))
%%timeit -n 10
# get all names containing 't', 'a', 'p', 'i'
huge_name_col_pa.loc[huge_name_col_pa.str.contains('.*t.*a.*p.*i', case=False)]
As you can see, PyArrow provides a data structure that enables memory-efficient string operations.
The performance enhancement applies to numerical data as well.
Let's use the following function to generate 10 millions of random numbers and compute the mean.
# write a function to generate 10 million random floating numbers
import numpy as np
def generate_ten_mil_randnum():
np.random.seed(0) # Make sure that we get consistent results in each execution
min_value=-100000
max_value=10000
return np.random.uniform(min_value, max_value, 10000000)
# create two series holding the numbers, one with NumPy
# one with pyarrow
nums = pd.Series(generate_ten_mil_randnum())
nums_pa = pd.Series(generate_ten_mil_randnum(),dtype='float64[pyarrow]')
# examine the data
nums.head()
# examine the data
nums_pa.head()
%%timeit -n 10
# calculate the mean, with NumPy as backend
nums.mean()
%%timeit -n 10
# calculate the mean, with PyArrow as backend
nums_pa.mean()
According to Social Security of the United States of America, the most popular given names for male and female babies born during the past century are James and Mary, respectively.
Could you find how many runners of 2022 Boston Marathon are named James and how many named Mary? Try using the two different data representations, i.e., bm_22
and bm_22_pa
, and compare the execution time. Do you see any improvement when using the pyarrow backend?
In Pandas basics series, you have learned how to do data cleaning, filtering and preprocessing. The next step is to summarize the data to extract useful information. Pandas provides many methods to summarize data. In this section, we'll use the Shakespeare dataframe we have worked with in Pandas basics 2 to learn these methods.
# Creating a variable `dataset_id` to hold our dataset ID
# The default dataset is Shakespeare Quarterly, 1950-present
# retrieve the metadata
import constellate
dataset_id = "7e41317e-740f-e86a-4729-20dab492e925"
metadata = constellate.get_metadata(dataset_id)
# create a dataframe out of the metadata
shake_df = pd.read_csv(metadata)
Pandas makes summarizing a dataframe very easy. For example, we can count how many non-null values there are in each column using the .count()
method.
# Get the number of non-null values in each column
shake_df.count()
We can also get the max value or the min value of a column using the .max()
and .min()
methods.
# Get the max value from the year column
shake_df['publicationYear'].max()
# Get the min value from the year column
shake_df['publicationYear'].min()
You can refer to the Pandas documentation for more methods that you can use to query the data.
When you summarize a dataframe, a very useful method is .describe()
. It can quickly display the statistics for any group of data it is applied to.
# Use the .describe() method to explore the year column
shake_df['title'].describe()
Groupby is a powerful method built into Pandas that you can use to summarize your data. Groupby splits the data into different groups on a variable of your choice.
# Group the data by publicationYear
shake_df.groupby('publicationYear')
The groupby()
method returns a GroupBy object which describes how the rows of the original dataset have been split by the selected variable. You can actually see how the rows of the original dataframe have been grouped using the groups
attribute after applying groupby()
.
# See how the rows have been grouped
shake_df.groupby('publicationYear').groups
As you can see, a dictionary is returned whose keys are the unique values in publicationYear and whose values are lists of row indexes. Each key corresponds to a list of row indexes.
You can group the data using multiple variables. For example, you may want to group the documents first by their publication year and then by the publisher.
# Group by multiple variables
# Take a look at the composite keys
shake_df.groupby(['publicationYear', 'publisher']).groups
If you take a look at the groups in the groupby object, you will see that essentially we have a composite key for each group. The first key, for example, is (1950, 'Folger Shakespeare Library'). The value associated with this key is a list of indexes, all of which are the rows storing the documents that were published in 1950 by Folger Shakespeare Library.
We have seen how we can group the data in a dataframe. Of course, we don't just stop at grouping data. Grouping data is just a step towards data query. After we apply the .groupby()
method, we can actually use different Pandas methods to query the data. For example, how do we get the number of documents by publisher in each publicationYear?
# Create a series storing the number of documents by publisher in each year
shake_df.groupby(['publicationYear', 'publisher']).size()
# get the 20 rows at the tail to see the groups more clearly
shake_df.groupby(['publicationYear', 'publisher']).size().tail(20)
After we group the data in a dataframe, we can apply the agg()
method to calculate multiple statistics per group in one calculation.
For example, let's say we would like to know the sum of the word count in all the documents from each year. To achieve this goal, we can group the data by publicationYear
, and then aggregate the data by summing the numerical values in the column of wordCount
for each subgroup.
# Get the sum of word count in docs by publication year
shake_df.groupby('publicationYear').agg({'wordCount':'sum'})
Of course, you can choose other ways to aggregate the data in each subgroup. For example, suppose you are interested in the biggest word count by year.
# the biggest word count by year
shake_df.groupby('publicationYear').agg({'wordCount':'max'})
We can apply multiple aggregating functions to a single column.
# apply multiple functions to a single column
shake_df.groupby('publicationYear').agg({'wordCount':['sum', 'max']})
We can specify multiple columns to apply a function to.
# apply a single function to selected columns in each subgroup
shake_df.groupby('publicationYear').agg({'wordCount':'sum', 'pageCount': 'max'})
We can also apply multiple functions to each of the selected columns.
# apply multiple functions to selected columns in each subgroup
shake_df.groupby('publicationYear').agg({'wordCount':['sum', 'max'], 'pageCount':['max', 'size']})
Take the following dataframe containing the information on the failed banks in US since 2000. Can you work with the dataframe to find out which year witnessed the most failed banks?
# download the sample file
import urllib
url = 'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/Pandas1_failed_banks_since_2000.csv'
file = './data/failed_banks.csv'
urllib.request.urlretrieve(url, file)
print('Sample file ready.')
# Read in the data
banks_df = pd.read_csv(file)
banks_df
# create a new column containing the closing year
# group the data by closing year and aggregate the data to get the number of failed banks by year
# get the year with the most failed banks
In the next code cell, can you find out which state witnessed the most failed banks in 2010?
### find out which state witnessed the most failed banks in 2010
Pandas has a .pivot_table()
method that we can use to summarize data. It takes a dataframe as argument and has parameters specifying the shape of the pivot table.
In the previous section, we have used the .groupby()
and agg()
methods to summarize data. For example, we grouped the documents in the shakespeare dataframe by their year of publication and calculated the sum of word count in those documents. We can do the same thing using the .pivot_table
method.
# Create a pivot table giving the sum of
# word count by year
shake_df.pivot_table(index='publicationYear',
values='wordCount',
aggfunc='sum')
Again, when aggregating the data, you can apply a single function to multiple columns.
# Create a pivot table giving the max value of the wordCount
# and pageCount column by publicationYear
shake_df.pivot_table(index='publicationYear',
values=['wordCount', 'pageCount'],
aggfunc='max')
You can also apply multiple functions to a single column.
# Create a pivot table giving the sum and the mean value of
# word count by year
shake_df.pivot_table(index='publicationYear',
values='wordCount',
aggfunc=['sum', 'mean'])
Or, you can apply different functions to different columns.
# Create a pivot table giving the sum of
# the wordCount by publicationYear
# and the max value of pageCount by publicationYear
shake_df.pivot_table(index='publicationYear',
values=['wordCount', 'pageCount'],
aggfunc={'wordCount':'sum', 'pageCount':'max'})
We have learned how to create a dataframe from files of different formats, how to clean the data and how to summarize the data. With the information we get from summarizing the data, we can go ahead and plot it!
For example, let's plot the number of failed banks by year in the failed banks dataset.
# Prepare the dataframe for plotting
banks_df['Closing Year'] = banks_df['Closing Date'].str[-2:].astype(int) + 2000
# plot a bar chart to show number of failed banks by year
banks_df.groupby('Closing Year').size().plot(kind='bar', ylabel='num_banks')
Congratulations! You have completed Pandas Intermediate 1.
Here are a few solutions for exercises in this lesson.
%%timeit -n 10
# find out how many runners are named James, with NumPy as backend
len(bm_22[bm_22['FullName'].str.startswith('James ')])
%%timeit -n 10
# find out how many runners are named James, with pyarrow as backend
len(bm_22_pa[bm_22_pa['FullName'].str.startswith('James ')])
%%timeit -n 10
# find out how many runners are named Mary, with NumPy as backend
len(bm_22[bm_22['FullName'].str.startswith('Mary ')])
%%timeit -n 10
# find out how many runners are named Mary, with pyarrow as backend
len(bm_22_pa[bm_22_pa['FullName'].str.startswith('Mary ')])
### find out which year witnessed the most failed banks
# download the sample file
import urllib
url = 'https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/Pandas1_failed_banks_since_2000.csv'
file = './data/failed_banks.csv'
urllib.request.urlretrieve(url, file)
print('Sample file ready.')
# Read in the data
banks_df = pd.read_csv(file)
# create a new column storing the closing year
banks_df['Closing Year'] = banks_df['Closing Date'].str[-2:].astype(int)+2000
# group the data by closing year and get the number of failed banks in each year
banks_df.groupby('Closing Year').agg('size').sort_values(ascending=False)
### find out which state witnessed the most failed banks in 2010
banks_df.groupby(['Closing Year', 'State']).agg('size')[2010].sort_values(ascending=False)