Lab 7 Lesson

Processing Data with Python, Part 1

Topics

  • introduction to/motivation of pandas
  • analyzing real-life data
  • Series
  • DataFrames
  • manipulating DataFrames with Index

Resources

For more information, see following books on Python and data analysis:

Exercises

Back to the normal notebook-of-exercises format for the next few weeks! The exercises for this week build on what we talked about today, and they primarily focus on a set of data about individual Chipotle orders.

TURNING IT IN: Submit your completed Jupyter notebook on Canvas by the deadline. Don't submit any other files.

So, what is pandas anyway?

pandas is a third-party Python library for doing data analysis. It's developed and maintained by Wes McKinney, who at the time of initial development was working in finance. Suffice to say, pandas is very much beta software, like JupyterLab. There are going to be some bugs, design flaws, and other issues that you're going to run into; I'll address them as they come up in class, and if you run into them on your own, we can work through them together in office hours.

pandas is a foundational part of using Python for data science. Most if not all things that pandas does can be done with plain-jane Python, but, most of the time, pandas does them faster and easier. It's built on top of another extremely powerful third-party Python library called numpy; if you are doing data analysis on a huge scale, consider numpy.

Why are we using pandas?

We're using pandas because it has a powerful set of structures and functions that make working with large datasets simple. Once you learn these structures and functions, it becomes extremely easy to answer any question you want to ask with a given data set. This is my definition of data analysis: answering questions asked of data.

pandas also interacts nicely with a bunch of other Python libraries and programs, which make doing data science easier.

  • Jupyter notebooks, of course, allow you to construct computational narratives with code, data, and text. Displaying dataframes (one of pandas' data structures) as an inline HTML table is one of the major interactions between Jupyter and pandas.
  • Matplotlib is an incredibly powerful graphing library for Python. Generating plots from dataframes is simple with matplotlib and pandas.
  • pandas also integrates with scientific computing/machine learning Python libraries, like SciKit and SciPy.

pandas basics

Importing pandas is simple.

In [ ]:
import pandas as pd

%matplotlib inline

You can use the as keyword in Python to specify a different name for a library you're importing. Here, I'm telling Python that I'm going to use the term pd whenever I want to use pandas functions. It's basically like giving a Python library a nickname.

That second line, %matplotlib inline, is a specifier within Jupyter for the matplotlib library, which, as we already discussed, is for making 2D graphs and charts from data. This specifies that any plots we generate are displayed below the code block, like Jupyter normally does. We'll see this later.

Diving into data

We're going to do some stuff with a dataset that I provided for y'all here. It contains information about attendance at various community centers in Pittsburgh.

I said at the beginning of this lecture that pandas allows us to ask questions of data and get answers from it. In our analysis, we want to answer the following question of this dataset:

     Which community center has had the most attendance over time?

NOTE: It's important to look at the comments in the code we're running through today. I'll be describing what's going on, line by line, by commenting my code. Confused as to what something does? Look at the comments first.

In [ ]:
# load in the community center data file
data = pd.read_csv("community-center-attendance-2019.csv", index_col="date", parse_dates=True)

# look at the first ten rows of the data
data.head(10)

As you can see, this has the name of the community center and the attendance, all organized in chronological order. Let's see what the attendance values look like over time.

In [ ]:
# no prizes for guessing what this does
data.plot()

We can pivot the data so the center names are columns and each row is the number of people attending that community center per day. This is basically rotating the data; it's a common data operation. You'll often hear accountant-types talking about "pivot tables" in Excel. This is like that, but way better.

In [ ]:
# use the pivot function to make row values into data columns
data.pivot(columns="center_name", values="attendance_count").head()

That's a little ugly, because whenever there's no attendance data for a center on a given day, the default is NaN, or "not a number". Thanks, Python. That's really useful.

Maybe we should consider separating the attendance values for each center. Let's first get a list of how much attendance data we have for each center.

In [ ]:
# count the number of rows per center
data.groupby("center_name").count()

Hm. We don't have a ton of data for quite a few centers, either because they don't report attendance all that often, or they just aren't open all that often.

So, we're going to write a function that'll filter out any community center that doesn't have a lot of attendance data. We'll apply that filter to ever row in the dataframe using the groupby filter function.

This is... a little esoteric. What we're doing is using a special type of filter function called a lambda that, instead of doing something itself, will apply a function we write to each row. This sounds more complicated than it is. All we're doing is testing to see if each community center has more than 1000 attendance values; that's our standard.

In [ ]:
# create a function we will use to perform a filtering operation on the data
# filter out centers that have less then 1000 total entries

# FUNCTION HERE

# use the custom function to filter out rows
popular_centers = data.groupby("center_name").filter(<NAME_OF_FILTER_FUNCTION>, 
                                                     threshold=1000)
# look at what centers are in the data now
popular_centers.groupby("center_name").count()

That looks better. Now we've got some good data to work with! Let's look at the data again, now that we've filtered it.

NOTE: Part of that above code renames our data. Instead of calling it data, like at the beginning, we named our filtered dataset to popular_centers.

In [ ]:
# get the first 5 rows
popular_centers.head(5)
In [ ]:
# plot it
popular_centers.plot()

This plot... isn't great. It just doesn't do a great job at displaying anything useful. Let's try pivoting again, now that we've eliminated some of the sparser columns.

In [ ]:
# use the pivot function to make rows into columns with only the popular community centers
pivoted_data = popular_centers.pivot(columns="center_name", values="attendance_count")
pivoted_data.head()

Not nearly as bad as before! Now, we can plot the attendance over time of individual community centers.

In [ ]:
# plot the data! 
pivoted_data.plot(figsize=(10,10))

Pretty messy, but definitely better. Now, let's calculate the cumulative sum, a measure that will add up attendance over time. This can give us both attendance values and a general sense of how they're changing over time.

In [ ]:
# compute the cumulative sum for every column and make a chart
pivoted_data.cumsum().plot(figsize=(10,10))

Brookline seems to be the winner, but attendance isn't growing as it has been in the past. Let's look at month-by-month attendance.

In [ ]:
# resample and compute the monthly totals for the popular community centers
pivoted_data.resample("M").sum().plot(figsize=(10,10))

Hmm, that's way too much noise. Let's plot it by year, instead. That's as simple as changing the M to a Y.

In [ ]:
# resample and compute the yearly totals for the popular community centers
pivoted_data.resample("Y").sum().plot(figsize=(10,10))

And now, hey presto, we've turned this unwieldy, huge dataset into something that we can use to answer questions.

Let's talk data structures

The major thing that pandas brings to Python is three new data structures: the Series, the Dataframe, and the Index.

Series: a list, but a dictionary

A series is a one-dimensional array of indexed data. Let's break that down.

  • one-dimensional: only one column of data
  • array: a contiguous representation of the data (i.e. data[2] comes after data[1])
  • indexed: the data has indices that you use to access the data. These can be numbers or other, more complicated keys.

Here's a default, empty Series. What do you get when you run that? What about if you take the type() of it?

In [ ]:
pd.Series()

You can create a Series from a single Python list. When you do that, the indices are just numbers, starting from 0.

In [ ]:
# create a regular Python list
my_list = [0.25, 0.5, 0.75, 1.0]

# transform that list into a Series
data = pd.Series(my_list)

# display the data in the series
data

A Series is kind of like a list, in that order matters. The first element will always be before the second element, and so on. You can use indexing to grab an item in a series, like so:

In [ ]:
# get the first item in the series
data[0]

pandas makes this way more powerful and a little harder with the .iloc function. You can use .iloc to index a Series. This is the idiomatic "pandas-y" way to do it.

In [ ]:
# get the first item, but pandas-y, with iloc
data.iloc[0]
In [ ]:
# go crazy, grab the fourth element
data.iloc[3]

All of the normal Python indexing styles work with .iloc. You can use slicing to grab certain sub-lists within a Series.

In [ ]:
# get the first two elements
data.iloc[0:2]
In [ ]:
# get the last two elements
data.iloc[2:]

Alternate indices

You can also think of a Series as a Python dictionary. The main difference is that order matters in a Series. You can grab things by their name, in addition to their numerical index, and you can generate a Series from a dictionary.

In [ ]:
# create a regular Python Dictionary
population_dict = {'California': 38332521,
                   'Texas': 26448193,
                   'New York': 19651127,
                   'Florida': 19552860,
                   'Illinois': 12882135}

# make that dictionary into a Series 
population = pd.Series(population_dict)

# display the data
population

You can also make a named Series from two independent Python lists.

In [ ]:
# create two ordered lists
population_list = [38332521, 26448193, 19651127, 19552860, 12882135]
states = ['California', 'Texas', 'New York', 'Florida', 'Illinois']

# Create a Series from those two lists
population = pd.Series(population_list, index=states)

# display the data
population

You can index and slice, just like we did before, but you can use the keys instead of the numbers! You use the .loc method, which just stands for "location", to access items from their named keys.

In [ ]:
# select the data value with the name "California"
population.loc['California']

You can do slicing with these keys, too!

In [ ]:
population.loc['Texas':'Florida']

Group Assignment: some indexing questions

Answer the following questions concerning .iloc, .loc and Series indexing with pandas.

  1. Make a pandas Series from a Python list that contains the following numbers: [0.1, 0.3, 0.75, 1.2, 1.6]
  2. Use iloc and slices to grab the second and third elements of your Series.
  3. Grab the last element of your Series.
  4. As a challenge, grab the last element of your Series without using the length of the list. You can't use slicing or just say iloc[4]. There's a way to do it, search around. (Hint: think small.)
  5. Create another Series from the following dictionary:
     {'Aaron': 65,
      'Lauren': 24,
      'Joseph': 49,
      'Mallory': 32,
      'Eric': 19,
      'Jeff': 84}
  6. Print out Lauren's age, using the .loc method.
  7. Print out the elements between Joseph and Jeff using .loc.
  8. Do the same thing, but with numerical slicing using .iloc.
In [ ]:
 

Dataframe: adding a dimension

A pandas DataFrame is the main way to display and manipulate data with pandas. You're going to be suuuuuper familiar with DataFrames by the time we're done here.

A DataFrame is just a two-dimensional Series. It's like a table, or an Excel spreadsheet. Functionally, it's a Series-of-Serieses, a bunch of Series lined up together.

In [ ]:
# remember our population Series?
population
In [ ]:
# let's make one for the areas, too!
area_dict = {'Illinois': 149995, 'California': 423967, 
             'Texas': 695662, 'Florida': 170312, 
             'New York': 141297}
area = pd.Series(area_dict)
area
In [ ]:
# now, we create a dictionary containing both of our dictionaries
# meta, huh?
state_info_dictionary = {'population': population,
                       'area': area}

# then we mash them together into a DataFrame
states = pd.DataFrame(state_info_dictionary)
# let's check our work!
states

pandas just takes care of lining everything up, because our indices (state names) are the same across our two dictionaries!

You can also generate a DataFrame from a list of dictionaries, and from a list of lists.

In [ ]:
# create a list of dictionaries that contain our data.
# one dictionary per observation/row
dead_people = [
    {"ssn":1, "first_name": "Bob", "last_name": "Jones", "age": 200},
    {"ssn":2, "first_name": "Jane", "last_name": "Jones", "age": 199},
    {"ssn":3, "first_name": "Ethel", "last_name": "Jones", "age": 180},
    {"ssn":4, "first_name": "Hortense", "last_name": "Jones", "age": 178},
    {"ssn":5, "first_name": "Vern", "last_name": "Jones", "age": 178}
]

# create a DataFrame from a list of dictionaries
pd.DataFrame(dead_people)

If you create a DataFrame from a list of lists, you can either specify the row indices, or it'll automatically number them, starting at 0.

In [ ]:
# auto-number the rows

# create a list of lists, each sub-list is an observation/row
dead_people = [
    [1,"Bob","Jones",200],
    [2,"Jane","Jones",199],
    [3,"Ethel","Jones",180],
    [4,"Hortense","Jones",178],
    [5,"Vern","Jones",178]
]

# specify the column names separately
column_names = ["ssn","first_name", "last_name", "age"]

# make a DataFrame with column names specified separately
pd.DataFrame(dead_people, columns=column_names)
In [ ]:
# specify the row indices

# create a list of lists, each sub-list is an observation/row
dead_people = [
    [1,"Bob","Jones",200],
    [2,"Jane","Jones",199],
    [3,"Ethel","Jones",180],
    [4,"Hortense","Jones",178],
    [5,"Vern","Jones",178]
]

# specify the column names separately
column_names = ["ssn","first_name", "last_name", "age"]

row_ids = [123,3452,3235,4345,563463]

# make a DataFrame with column names specified separately
dead_dataframe = pd.DataFrame(dead_people, columns=column_names, index=row_ids)
dead_dataframe

Index and DataFrame slicing

In pandas, the Series and the DataFrame are both containers for data; they store information. The Index is what makes that data useful.

  • In a Series, the Index is the key to each value in the list.
  • In a DataFrame, the index is the column name. There is also an index for each row.

You can use indexing to merge otherwise disparate datasets. Remember our list of states?

In [ ]:
states

You can also access the column and row labels, programmatically, using the .columns and .index functions.

In [ ]:
# get the column labels as a list-like data structure
states.columns
In [ ]:
# get the row labels as a list-like data structure
states.index

We can use .loc and .iloc, and indexing, on DataFrames too! .loc lets us select specific rows and columns by their name. The syntax for indexing a DataFrame is [ROW, COLUMN], where ROW and COLUMN are the index values.

In [ ]:
# let's get Illinois' population
states.loc["Illinois", "population"]
In [ ]:
 

This can get really powerful, really quickly.

You can use slicing within a row or column when indexing a DataFrame; this includes the : operator, which selects all indices. You can also use a list in place of a row or column to select all items on that list from the DataFrame.

In [ ]:
# two-dimensional slicing

# get the area for states from Florida to Texas
states.loc["Florida":"Texas", "area"]
In [ ]:
# using a list to select values

# get the area for Florida and Texas
states.loc[["Florida", "Texas"], "area"]
In [ ]:
# use a ":" to specify "all columns"

# get area and population for Florida and Texas
states.loc[["Florida", "Texas"], :]
In [ ]:
# select all the rows and columns
states.loc[:,:]

And this all works for .iloc, too.

In [ ]:
# two-dimensional slicing

# get the area for states from Florida to Texas
states.iloc[1:, 1]
In [ ]:
# using a list to select values

# get the area for Florida and Texas
states.iloc[[1, 4], 1]
In [ ]:
# using a list with -1 index to select values to the end

# get the area for Florida and Texas
states.iloc[[1, -1], 1]
In [ ]:
# using a ":" to specify "all columns"

# get area and population for Florida and Texas
states.iloc[[1, -1], :]

The end!

That's it for today! There's a bit more slicing in your lab exercises for the week, and some more serious data analysis as well. Any questions before we begin?