Grouping

The pandas groupby function allows us to group our data on the values in a column or column to look at summary measures for records sharing the same values.

For example, let's load the speed camera dataset again and ask which camera locations or days of the week have produced the most violations.

In [ ]:
import pandas as pd
df = pd.read_csv("data/Speed_Camera_Violations.csv")
df.head()

Now that the data is loaded, let's find the 10 locations with the most total violations recorded.

To do this, we need to group by the ADDRESS column, then examine the VIOLATIONS column of the resulting grouped dataframe.

In [ ]:
# first let's group by address and look at descriptive statistics for the first 10 records
df.groupby(["ADDRESS"])["VIOLATIONS"].describe().head(10)

The above records aren't sorted in any meaningful way, but the first thing to note is that the Index is no longer just an integer, it is now the Address. This is because the groupby method returns a special object with a new index made up of the values of the column being grouped on.

We can still use the loc indexer with this new grouped object to, for example, find the count for a given address:

In [ ]:
# `count` returns the number of rows for this address, not the total violation count.
# IE this tells us the number of observations.
df.groupby(["ADDRESS"])["VIOLATIONS"].count().loc["19 W CHICAGO AVE"]
In [ ]:
# to get the total violation count, we want the `sum` method:
df.groupby(["ADDRESS"])["VIOLATIONS"].sum().loc["19 W CHICAGO AVE"]
In [ ]:
# Now let's get the top 10 camera locations by total violation count:
df.groupby(["ADDRESS"])["VIOLATIONS"].sum().sort_values(ascending=False).head(10)

It's possible that some locations just have more observations than others, so a more meaningful measure is probably the mean violation count per observation. To get this we just need to use the mean function rather than sum.

In [ ]:
df.groupby(["ADDRESS"])["VIOLATIONS"].mean().sort_values(ascending=False).head(10)

How about days of the week? When are people most likely to be caught speeding?

The simplest way to do this is to create a new weekday column and group on that.

In [ ]:
# datetime series have a special `dt` property that exposes the date/time-specific functionality.
# In this case, dayofweek is a 0-based index where 0 = Monday, 6 = Sunday.
df["VIOLATION DATE"] = pd.to_datetime(df["VIOLATION DATE"], format="%m/%d/%Y")
df["VIOLATION DATE"].dt.dayofweek.head()
In [ ]:
df["DAY OF WEEK"] = df["VIOLATION DATE"].dt.dayofweek
df.groupby(["DAY OF WEEK"])["VIOLATIONS"].mean()

Plotting

It's not easy to understand at a glance the distribution of speeding violations by day of the week above, so let's produce a simple plot to visualize and help understand it.

Pandas has a number of convenience functions to let us output plots directly without having to interact with matplotlib.

In [ ]:
# a bit of jupyter-specific magic
%matplotlib inline
In [ ]:
# We can call `plot` on just about anything to get a minimally-formatted plot
df.groupby(["DAY OF WEEK"])["VIOLATIONS"].mean().plot()

We can do much better. Let's do a horizontal bar plot, renaming the labels to the actual days of the week, and starting the X-axis at 0 to give a better sense of how much this data actually varies.

In [ ]:
# Create a mapping of day number to day name
daynames = {0: "Mon", 1: "Tue", 2: "Wed", 3: "Thu", 4: "Fri", 5: "Sat", 6: "Sun"}

# save the "sum of violations grouped-by day of week" series in a variable
violations_by_day_of_week = df.groupby(["DAY OF WEEK"])["VIOLATIONS"].mean()

# use the mapping created above to give the series index labels sensible values
violations_by_day_of_week.rename(index=daynames, inplace=True)

# use the reindex method to order them the way we want in the plot (starting with Sunday at the top of the y axis)
violations_by_day_of_week = violations_by_day_of_week.reindex(["Sat", "Fri", "Thu", "Wed", "Tue", "Mon", "Sun"])

# and finally, create the bar plot
violations_by_day_of_week.plot(kind='barh', title="City of Chicago Speed Camera Violations By Day Of Week") 

Combining DataFrames

Often you will need to combine data from multiple data sets together. There are two types of combinations: concatenations and merges (aka joins).

Concatenating means taking multiple DataFrame objects and appending their rows together to make a new DataFrame. In general you will do this when your datasets contain the same columns and you are combining observations of the same type together into one dataset that contains all the rows from all the datasets.

Merging is joining DataFrames together SQL-style by using common values. This is useful when you have multiple datasets with common keys and you want to combine them into one dataset that contains columns from all the datasets being merged.

In [ ]:
# Concatenation example
df1 = pd.DataFrame({'A': [1, 2, 3],
                    'B': [4, 5, 6]})

df2 = pd.DataFrame({'A': [7, 8, 9],
                    'B': [10, 11, 12]})

print("df1: ")
print(df1)
print()
print("df2: ")
print(df2)
print()
print("concatenated: ")
print(pd.concat([df1, df2]))
In [ ]:
# Merge example
df1 = pd.DataFrame({'A': [1, 2, 3],
                    'B': [4, 5, 6]})

df2 = pd.DataFrame({'A': [1, 2, 3],
                    'C': ["foo", "bar", "baz"]})

print("df1: ")
print(df1)
print()
print("df2: ")
print(df2)
print()
print("merged: ")
print(pd.merge(df1, df2))

Hierarchical Indexing

Up to now we have looked only at 1- and 2-dimensional data. While Pandas does offer objects for handling 3- and 4-dimesional data (the Panel and Panel4D), it's often more convenient to keep higher dimensional data in a DataFrame but use an index with multiple levels. This kind of index is called a MultiIndex in pandas.

Let us use for example mean annual water level data for the great lakes for 2015 and 2016. We'd like to be able to look up, by year and lake, the mean water level, in a single series. To do this, we can just create the series as before, but pass in a list of lists for the index.

In [ ]:
data = [183.57, 183.58, 176.59, 176.70, 175.20, 175.35, 174.32, 174.41, 74.74, 74.80]


labels = [
    ['Superior', 'Superior',
     'Michigan-Huron', 'Michigan-Huron',
     'St. Clair', 'St. Clair',
     'Erie', 'Erie',
     'Ontario', 'Ontario'],
    [2015, 2016,
     2015, 2016,
     2015, 2016,
     2015, 2016,
     2015, 2016,]
]

mean_levels = pd.Series(data, index=labels)
mean_levels.index.rename(['Lake', 'Year'])
mean_levels
In [ ]:
# We can now get data by Lake and Year:
mean_levels['Superior', 2015]
In [ ]:
# Or do "partial indexing": select a label from one level of the index, getting back a series with all remaining levels:
mean_levels['Superior']
In [ ]:
# to do partial indexing on the "inner" level, we need to pass in an empty slice for the outer level:
mean_levels[:, 2015]

But why torture a series like this when this data could easily be represented in a DataFrame, with the lakes as columns?

Because we can do this same thing with more series and combine them into a dataframe that shares this two-dimensional index! If you squint, it's higher-dimensional data!!

We have a CSV containing monthly water level readings for the great lakes going back to 1918. We'll use it to calculate annual min and max readings, and then combine those calculated series with this series that contains annual means to create a new dataframe with year and lake as the index, and min, max, and mean as the columns.

In [ ]:
# the `index_col` parameter allows us to specify which columns to use as an index when creating
# the dataframe and in which order. Very handy!
df = pd.read_csv('data/GLHYD_data_metric.csv', header=12, index_col=[1, 0])
df.head()

At this point we have the lakes as columns, when what we really want is to take all of those column names and add them to our index as another level. This is exactly what the very useful stack method does. It "pivots" a level of column labels as a new innermost index level.

In this case, because our dataframe has only one level of column labels, stack turns that level into a new index level leaving us with only a series. We have "moved a dimension" and transformed a dataframe with a 2-d index to a series with a 3-d index.

In [ ]:
lake_monthly_levels = df.stack()
lake_monthly_levels.head()

Now in order to make it easy for Pandas to combine series we want to give our index levels nice, consistent names.

In [ ]:
lake_monthly_levels.index.set_names(['Year', 'Month', 'Lake'], inplace=True)
lake_monthly_levels.head()
In [ ]:
# Calculate the max water level for each year for each lake
max_levels = lake_monthly_levels.groupby(['Year', 'Lake']).agg('max')
max_levels.head()
In [ ]:
min_levels = lake_monthly_levels.groupby(['Year', 'Lake']).agg('min')
min_levels.head()

If we were to try to merge these three series together now, Pandas would fail because the index levels aren't in the same order. So the mean_levels series needs to have the order of its index levels swapped so that "Year" is the outer level. This is easily done with the reorder_levels method.

In [ ]:
mean_levels = mean_levels.reorder_levels([1, 0])
mean_levels.index.set_names(['Year', 'Lake'], inplace=True)
mean_levels

And now, finally, we can merge these three series together into a new dataframe (and drop all observations before 2015 that don't have). Note this is a very contrived example as it would have been trivially easy to calculate the mean water levels from the CSV we loaded :)

In [ ]:
df = pd.DataFrame({'Mean': mean_levels, 'Max': max_levels, 'Min': min_levels}).dropna()
df
In [ ]:
# to select values based on just the inner index level, use the `xs` ("cross section") method:
df.xs('Erie', level=1)
In [ ]:
# Alternatively, you can still use `loc` and pass in an empty slice for the outer level:
df.loc[(slice(None), 'Erie'), :]