For more information, see following books on Python and data analysis:
Back to the normal notebook-of-exercises format for the next few weeks! The exercises for this week build on what we talked about today, and they primarily focus on a set of data about individual Chipotle orders.
TURNING IT IN: Submit your completed Jupyter notebook on Canvas by the deadline. Don't submit any other files.
pandas
is a third-party Python library for doing data analysis. It's developed and maintained by Wes McKinney, who at the time of initial development was working in finance. Suffice to say, pandas is very much beta software, like JupyterLab. There are going to be some bugs, design flaws, and other issues that you're going to run into; I'll address them as they come up in class, and if you run into them on your own, we can work through them together in office hours.
pandas is a foundational part of using Python for data science. Most if not all things that pandas does can be done with plain-jane Python, but, most of the time, pandas does them faster and easier. It's built on top of another extremely powerful third-party Python library called numpy
; if you are doing data analysis on a huge scale, consider numpy
.
We're using pandas because it has a powerful set of structures and functions that make working with large datasets simple. Once you learn these structures and functions, it becomes extremely easy to answer any question you want to ask with a given data set. This is my definition of data analysis: answering questions asked of data.
pandas also interacts nicely with a bunch of other Python libraries and programs, which make doing data science easier.
Importing pandas is simple.
import pandas as pd
%matplotlib inline
You can use the as
keyword in Python to specify a different name for a library you're importing. Here, I'm telling Python that I'm going to use the term pd
whenever I want to use pandas functions. It's basically like giving a Python library a nickname.
That second line, %matplotlib inline
, is a specifier within Jupyter for the matplotlib
library, which, as we already discussed, is for making 2D graphs and charts from data. This specifies that any plots we generate are displayed below the code block, like Jupyter normally does. We'll see this later.
We're going to do some stuff with a dataset that I provided for y'all here. It contains information about attendance at various community centers in Pittsburgh.
I said at the beginning of this lecture that pandas allows us to ask questions of data and get answers from it. In our analysis, we want to answer the following question of this dataset:
Which community center has had the most attendance over time?
NOTE: It's important to look at the comments in the code we're running through today. I'll be describing what's going on, line by line, by commenting my code. Confused as to what something does? Look at the comments first.
# load in the community center data file
data = pd.read_csv("community-center-attendance-2019.csv", index_col="date", parse_dates=True)
# look at the first ten rows of the data
data.head(10)
As you can see, this has the name of the community center and the attendance, all organized in chronological order. Let's see what the attendance values look like over time.
# no prizes for guessing what this does
data.plot()
We can pivot the data so the center names are columns and each row is the number of people attending that community center per day. This is basically rotating the data; it's a common data operation. You'll often hear accountant-types talking about "pivot tables" in Excel. This is like that, but way better.
# use the pivot function to make row values into data columns
data.pivot(columns="center_name", values="attendance_count").head()
That's a little ugly, because whenever there's no attendance data for a center on a given day, the default is NaN
, or "not a number". Thanks, Python. That's really useful.
Maybe we should consider separating the attendance values for each center. Let's first get a list of how much attendance data we have for each center.
# count the number of rows per center
data.groupby("center_name").count()
Hm. We don't have a ton of data for quite a few centers, either because they don't report attendance all that often, or they just aren't open all that often.
So, we're going to write a function that'll filter out any community center that doesn't have a lot of attendance data. We'll apply that filter to ever row in the dataframe using the groupby filter function.
This is... a little esoteric. What we're doing is using a special type of filter function called a lambda that, instead of doing something itself, will apply a function we write to each row. This sounds more complicated than it is. All we're doing is testing to see if each community center has more than 1000 attendance values; that's our standard.
# create a function we will use to perform a filtering operation on the data
# filter out centers that have less then 1000 total entries
# FUNCTION HERE
# use the custom function to filter out rows
popular_centers = data.groupby("center_name").filter(<NAME_OF_FILTER_FUNCTION>,
threshold=1000)
# look at what centers are in the data now
popular_centers.groupby("center_name").count()
That looks better. Now we've got some good data to work with! Let's look at the data again, now that we've filtered it.
NOTE: Part of that above code renames our data. Instead of calling it data
, like at the beginning, we named our filtered dataset to popular_centers
.
# get the first 5 rows
popular_centers.head(5)
# plot it
popular_centers.plot()
This plot... isn't great. It just doesn't do a great job at displaying anything useful. Let's try pivoting again, now that we've eliminated some of the sparser columns.
# use the pivot function to make rows into columns with only the popular community centers
pivoted_data = popular_centers.pivot(columns="center_name", values="attendance_count")
pivoted_data.head()
Not nearly as bad as before! Now, we can plot the attendance over time of individual community centers.
# plot the data!
pivoted_data.plot(figsize=(10,10))
Pretty messy, but definitely better. Now, let's calculate the cumulative sum, a measure that will add up attendance over time. This can give us both attendance values and a general sense of how they're changing over time.
# compute the cumulative sum for every column and make a chart
pivoted_data.cumsum().plot(figsize=(10,10))
Brookline seems to be the winner, but attendance isn't growing as it has been in the past. Let's look at month-by-month attendance.
# resample and compute the monthly totals for the popular community centers
pivoted_data.resample("M").sum().plot(figsize=(10,10))
Hmm, that's way too much noise. Let's plot it by year, instead. That's as simple as changing the M
to a Y
.
# resample and compute the yearly totals for the popular community centers
pivoted_data.resample("Y").sum().plot(figsize=(10,10))
And now, hey presto, we've turned this unwieldy, huge dataset into something that we can use to answer questions.
The major thing that pandas brings to Python is three new data structures: the Series, the Dataframe, and the Index.
A series is a one-dimensional array of indexed data. Let's break that down.
Here's a default, empty Series. What do you get when you run that? What about if you take the type()
of it?
pd.Series()
You can create a Series from a single Python list. When you do that, the indices are just numbers, starting from 0.
# create a regular Python list
my_list = [0.25, 0.5, 0.75, 1.0]
# transform that list into a Series
data = pd.Series(my_list)
# display the data in the series
data
A Series is kind of like a list, in that order matters. The first element will always be before the second element, and so on. You can use indexing to grab an item in a series, like so:
# get the first item in the series
data[0]
pandas makes this way more powerful and a little harder with the .iloc
function. You can use .iloc
to index a Series. This is the idiomatic "pandas-y" way to do it.
# get the first item, but pandas-y, with iloc
data.iloc[0]
# go crazy, grab the fourth element
data.iloc[3]
All of the normal Python indexing styles work with .iloc
. You can use slicing to grab certain sub-lists within a Series.
# get the first two elements
data.iloc[0:2]
# get the last two elements
data.iloc[2:]
You can also think of a Series as a Python dictionary. The main difference is that order matters in a Series. You can grab things by their name, in addition to their numerical index, and you can generate a Series from a dictionary.
# create a regular Python Dictionary
population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
# make that dictionary into a Series
population = pd.Series(population_dict)
# display the data
population
You can also make a named Series from two independent Python lists.
# create two ordered lists
population_list = [38332521, 26448193, 19651127, 19552860, 12882135]
states = ['California', 'Texas', 'New York', 'Florida', 'Illinois']
# Create a Series from those two lists
population = pd.Series(population_list, index=states)
# display the data
population
You can index and slice, just like we did before, but you can use the keys instead of the numbers! You use the .loc
method, which just stands for "location", to access items from their named keys.
# select the data value with the name "California"
population.loc['California']
You can do slicing with these keys, too!
population.loc['Texas':'Florida']
Answer the following questions concerning .iloc
, .loc
and Series indexing with pandas.
[0.1, 0.3, 0.75, 1.2, 1.6]
iloc
and slices to grab the second and third elements of your Series.iloc[4]
. There's a way to do it, search around. (Hint: think small.){'Aaron': 65,
'Lauren': 24,
'Joseph': 49,
'Mallory': 32,
'Eric': 19,
'Jeff': 84}
.loc
method..loc
..iloc
.
A pandas DataFrame is the main way to display and manipulate data with pandas. You're going to be suuuuuper familiar with DataFrames by the time we're done here.
A DataFrame is just a two-dimensional Series. It's like a table, or an Excel spreadsheet. Functionally, it's a Series-of-Serieses, a bunch of Series lined up together.
# remember our population Series?
population
# let's make one for the areas, too!
area_dict = {'Illinois': 149995, 'California': 423967,
'Texas': 695662, 'Florida': 170312,
'New York': 141297}
area = pd.Series(area_dict)
area
# now, we create a dictionary containing both of our dictionaries
# meta, huh?
state_info_dictionary = {'population': population,
'area': area}
# then we mash them together into a DataFrame
states = pd.DataFrame(state_info_dictionary)
# let's check our work!
states
pandas just takes care of lining everything up, because our indices (state names) are the same across our two dictionaries!
You can also generate a DataFrame from a list of dictionaries, and from a list of lists.
# create a list of dictionaries that contain our data.
# one dictionary per observation/row
dead_people = [
{"ssn":1, "first_name": "Bob", "last_name": "Jones", "age": 200},
{"ssn":2, "first_name": "Jane", "last_name": "Jones", "age": 199},
{"ssn":3, "first_name": "Ethel", "last_name": "Jones", "age": 180},
{"ssn":4, "first_name": "Hortense", "last_name": "Jones", "age": 178},
{"ssn":5, "first_name": "Vern", "last_name": "Jones", "age": 178}
]
# create a DataFrame from a list of dictionaries
pd.DataFrame(dead_people)
If you create a DataFrame from a list of lists, you can either specify the row indices, or it'll automatically number them, starting at 0.
# auto-number the rows
# create a list of lists, each sub-list is an observation/row
dead_people = [
[1,"Bob","Jones",200],
[2,"Jane","Jones",199],
[3,"Ethel","Jones",180],
[4,"Hortense","Jones",178],
[5,"Vern","Jones",178]
]
# specify the column names separately
column_names = ["ssn","first_name", "last_name", "age"]
# make a DataFrame with column names specified separately
pd.DataFrame(dead_people, columns=column_names)
# specify the row indices
# create a list of lists, each sub-list is an observation/row
dead_people = [
[1,"Bob","Jones",200],
[2,"Jane","Jones",199],
[3,"Ethel","Jones",180],
[4,"Hortense","Jones",178],
[5,"Vern","Jones",178]
]
# specify the column names separately
column_names = ["ssn","first_name", "last_name", "age"]
row_ids = [123,3452,3235,4345,563463]
# make a DataFrame with column names specified separately
dead_dataframe = pd.DataFrame(dead_people, columns=column_names, index=row_ids)
dead_dataframe
In pandas, the Series and the DataFrame are both containers for data; they store information. The Index is what makes that data useful.
You can use indexing to merge otherwise disparate datasets. Remember our list of states?
states
You can also access the column and row labels, programmatically, using the .columns
and .index
functions.
# get the column labels as a list-like data structure
states.columns
# get the row labels as a list-like data structure
states.index
We can use .loc
and .iloc
, and indexing, on DataFrames too! .loc
lets us select specific rows and columns by their name. The syntax for indexing a DataFrame is [ROW, COLUMN]
, where ROW
and COLUMN
are the index values.
# let's get Illinois' population
states.loc["Illinois", "population"]
This can get really powerful, really quickly.
You can use slicing within a row or column when indexing a DataFrame; this includes the :
operator, which selects all indices. You can also use a list in place of a row or column to select all items on that list from the DataFrame.
# two-dimensional slicing
# get the area for states from Florida to Texas
states.loc["Florida":"Texas", "area"]
# using a list to select values
# get the area for Florida and Texas
states.loc[["Florida", "Texas"], "area"]
# use a ":" to specify "all columns"
# get area and population for Florida and Texas
states.loc[["Florida", "Texas"], :]
# select all the rows and columns
states.loc[:,:]
And this all works for .iloc
, too.
# two-dimensional slicing
# get the area for states from Florida to Texas
states.iloc[1:, 1]
# using a list to select values
# get the area for Florida and Texas
states.iloc[[1, 4], 1]
# using a list with -1 index to select values to the end
# get the area for Florida and Texas
states.iloc[[1, -1], 1]
# using a ":" to specify "all columns"
# get area and population for Florida and Texas
states.iloc[[1, -1], :]
Sometimes you might want to go through each row in a DataFrame one at a time. Luckily, pandas has a function called iterrows
that lets you accomplish this. The basic blueprint looks like this:
# df is the name a of a dataframe that we have already created
for index, row in df.iterrows():
# do stuff with each row
Note that when you loop through the data frame, it unpacks the row into two separate parts: the row index, and the actual contents of the row.
The loop variable row
is a dictionary. Each column name in the data frame is a key in the dictionary; the value associated with the key is the value of that column in that particular row.
Now let's see some examples of how we can use iterrows
to loop through a dataframe. We will use our states
dataframe in all of these examples.
# use states.iterrows() because we are iterating through the states dataframe
for index, row in states.iterrows():
print(index)
# use states.iterrows() because we are iterating through the states dataframe
for index, row in states.iterrows():
print(row)
print() # add a blank space between each row
Each row is a dictionary with two keys: population
and area
. The values associated with those keys are the population and area of that row.
# use states.iterrows() because we are iterating through the states dataframe
for index, row in states.iterrows():
population = row['population']
print(population)
# use states.iterrows() because we are iterating through the states dataframe
for index, row in states.iterrows():
area = row['area']
print(area)
# create a variable to keep track of the total population
total = 0
# loop through each row one at a time
for index, row in states.iterrows():
# get the population of the current row
population = row['population']
# update the total to include the current population
total = total + population
print(total)
As you can imagine, being able to loop through the rows opens up a lot of possibilities for performing data analysis. However, writing a for loop to go through each row manually is actually quite inefficient. Luckily, pandas has many built in functions that allow you to accomplish the tasks that you would want to accomplish with looping. These functions take advantage of the structure of the dataframe! You'll learn about all of this next week.
Take a look at the community-center-attendance-2019.csv
file one more time. How is the data structured in the CSV? How does the way the data is stored in the CSV affect how it is read in Python? How could I restructure this data to make certain data analysis tasks faster?
What about in Pandas? How do the data structures in Pandas versus vanilla Python inform our use of the library?
That's it for today! There's a bit more slicing in your lab exercises for the week, and some more serious data analysis as well. Any questions before we begin?