#!/usr/bin/env python
# coding: utf-8
# # Aggregation and Grouping
# A fundamental piece of many data analysis tasks is efficient summarization: computing aggregations like `sum`, `mean`, `median`, `min`, and `max`, in which a single number summarizes aspects of a potentially large dataset.
# In this chapter, we'll explore aggregations in Pandas, from simple operations akin to what we've seen on NumPy arrays to more sophisticated operations based on the concept of a `groupby`.
# For convenience, we'll use the same `display` magic function that we used in the previous chapters:
# In[1]:
import numpy as np
import pandas as pd
class display(object):
"""Display HTML representation of multiple objects"""
template = """
"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)
def __repr__(self):
return '\n\n'.join(a + '\n' + repr(eval(a))
for a in self.args)
# ## Planets Data
#
# Here we will use the Planets dataset, available via the [Seaborn package](http://seaborn.pydata.org/) (see [Visualization With Seaborn](04.14-Visualization-With-Seaborn.ipynb)).
# It gives information on planets that astronomers have discovered around other stars (known as *extrasolar planets*, or *exoplanets* for short). It can be downloaded with a simple Seaborn command:
# In[2]:
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape
# In[3]:
planets.head()
# This has some details on the 1,000+ extrasolar planets discovered up to 2014.
# ## Simple Aggregation in Pandas
# In ["Aggregations: Min, Max, and Everything In Between"](02.04-Computation-on-arrays-aggregates.ipynb), we explored some of the data aggregations available for NumPy arrays.
# As with a one-dimensional NumPy array, for a Pandas ``Series`` the aggregates return a single value:
# In[4]:
rng = np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
ser
# In[5]:
ser.sum()
# In[6]:
ser.mean()
# For a `DataFrame`, by default the aggregates return results within each column:
# In[7]:
df = pd.DataFrame({'A': rng.rand(5),
'B': rng.rand(5)})
df
# In[8]:
df.mean()
# By specifying the `axis` argument, you can instead aggregate within each row:
# In[9]:
df.mean(axis='columns')
# Pandas `Series` and `DataFrame` objects include all of the common aggregates mentioned in [Aggregations: Min, Max, and Everything In Between](02.04-Computation-on-arrays-aggregates.ipynb); in addition, there is a convenience method, `describe`, that computes several common aggregates for each column and returns the result.
# Let's use this on the Planets data, for now dropping rows with missing values:
# In[10]:
planets.dropna().describe()
# This method helps us understand the overall properties of a dataset.
# For example, we see in the `year` column that although exoplanets were discovered as far back as 1989, half of all planets in the dataset were not discovered until 2010 or after.
# This is largely thanks to the *Kepler* mission, which aimed to find eclipsing planets around other stars using a specially designed space telescope.
# The following table summarizes some other built-in Pandas aggregations:
#
# | Aggregation | Returns |
# |--------------------------|---------------------------------|
# | ``count`` | Total number of items |
# | ``first``, ``last`` | First and last item |
# | ``mean``, ``median`` | Mean and median |
# | ``min``, ``max`` | Minimum and maximum |
# | ``std``, ``var`` | Standard deviation and variance |
# | ``mad`` | Mean absolute deviation |
# | ``prod`` | Product of all items |
# | ``sum`` | Sum of all items |
#
# These are all methods of `DataFrame` and `Series` objects.
# To go deeper into the data, however, simple aggregates are often not enough.
# The next level of data summarization is the `groupby` operation, which allows you to quickly and efficiently compute aggregates on subsets of data.
# ## groupby: Split, Apply, Combine
#
# Simple aggregations can give you a flavor of your dataset, but often we would prefer to aggregate conditionally on some label or index: this is implemented in the so-called `groupby` operation.
# The name "group by" comes from a command in the SQL database language, but it is perhaps more illuminative to think of it in the terms first coined by Hadley Wickham of Rstats fame: *split, apply, combine*.
# ### Split, Apply, Combine
#
# A canonical example of this split-apply-combine operation, where the "apply" is a summation aggregation, is illustrated in this figure:
# ![](images/03.08-split-apply-combine.png)
#
# ([figure source in Appendix](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/06.00-Figure-Code.ipynb#Split-Apply-Combine))
# This illustrates what the `groupby` operation accomplishes:
#
# - The *split* step involves breaking up and grouping a `DataFrame` depending on the value of the specified key.
# - The *apply* step involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups.
# - The *combine* step merges the results of these operations into an output array.
#
# While this could certainly be done manually using some combination of the masking, aggregation, and merging commands covered earlier, an important realization is that *the intermediate splits do not need to be explicitly instantiated*. Rather, the `groupby` can (often) do this in a single pass over the data, updating the sum, mean, count, min, or other aggregate for each group along the way.
# The power of the `groupby` is that it abstracts away these steps: the user need not think about *how* the computation is done under the hood, but rather can think about the *operation as a whole*.
#
# As a concrete example, let's take a look at using Pandas for the computation shown in the following figure.
# We'll start by creating the input `DataFrame`:
# In[11]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data': range(6)}, columns=['key', 'data'])
df
# The most basic split-apply-combine operation can be computed with the `groupby` method of the `DataFrame`, passing the name of the desired key column:
# In[12]:
df.groupby('key')
# Notice that what is returned is a `DataFrameGroupBy` object, not a set of `DataFrame` objects.
# This object is where the magic is: you can think of it as a special view of the `DataFrame`, which is poised to dig into the groups but does no actual computation until the aggregation is applied.
# This "lazy evaluation" approach means that common aggregates can be implemented efficiently in a way that is almost transparent to the user.
#
# To produce a result, we can apply an aggregate to this `DataFrameGroupBy` object, which will perform the appropriate apply/combine steps to produce the desired result:
# In[13]:
df.groupby('key').sum()
# The `sum` method is just one possibility here; you can apply most Pandas or NumPy aggregation functions, as well as most `DataFrame` operations, as you will see in the following discussion.
# ### The GroupBy Object
#
# The `GroupBy` object is a flexible abstraction: in many ways, it can be treated as simply a collection of ``DataFrame``s, though it is doing more sophisticated things under the hood. Let's see some examples using the Planets data.
#
# Perhaps the most important operations made available by a `GroupBy` are *aggregate*, *filter*, *transform*, and *apply*.
# We'll discuss each of these more fully in the next section, but before that let's take a look at some of the other functionality that can be used with the basic `GroupBy` operation.
# #### Column indexing
#
# The `GroupBy` object supports column indexing in the same way as the `DataFrame`, and returns a modified `GroupBy` object.
# For example:
# In[14]:
planets.groupby('method')
# In[15]:
planets.groupby('method')['orbital_period']
# Here we've selected a particular `Series` group from the original `DataFrame` group by reference to its column name.
# As with the `GroupBy` object, no computation is done until we call some aggregate on the object:
# In[16]:
planets.groupby('method')['orbital_period'].median()
# This gives an idea of the general scale of orbital periods (in days) that each method is sensitive to.
# #### Iteration over groups
#
# The `GroupBy` object supports direct iteration over the groups, returning each group as a `Series` or `DataFrame`:
# In[17]:
for (method, group) in planets.groupby('method'):
print("{0:30s} shape={1}".format(method, group.shape))
# This can be useful for manual inspection of groups for the sake of debugging, but it is often much faster to use the built-in `apply` functionality, which we will discuss momentarily.
# #### Dispatch methods
#
# Through some Python class magic, any method not explicitly implemented by the `GroupBy` object will be passed through and called on the groups, whether they are `DataFrame` or `Series` objects.
# For example, using the `describe` method is equivalent to calling `describe` on the `DataFrame` representing each group:
# In[18]:
planets.groupby('method')['year'].describe().unstack()
# Looking at this table helps us to better understand the data: for example, the vast majority of planets until 2014 were discovered by the Radial Velocity and Transit methods, though the latter method became common more recently.
# The newest methods seem to be Transit Timing Variation and Orbital Brightness Modulation, which were not used to discover a new planet until 2011.
#
# Notice that these dispatch methods are applied *to each individual group*, and the results are then combined within `GroupBy` and returned.
# Again, any valid `DataFrame`/`Series` method can be called in a similar manner on the corresponding `GroupBy` object.
# ### Aggregate, Filter, Transform, Apply
#
# The preceding discussion focused on aggregation for the combine operation, but there are more options available.
# In particular, `GroupBy` objects have `aggregate`, `filter`, `transform`, and `apply` methods that efficiently implement a variety of useful operations before combining the grouped data.
#
# For the purpose of the following subsections, we'll use this ``DataFrame``:
# In[19]:
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data1': range(6),
'data2': rng.randint(0, 10, 6)},
columns = ['key', 'data1', 'data2'])
df
# #### Aggregation
#
# You're now familiar with `GroupBy` aggregations with `sum`, `median`, and the like, but the `aggregate` method allows for even more flexibility.
# It can take a string, a function, or a list thereof, and compute all the aggregates at once.
# Here is a quick example combining all of these:
# In[20]:
df.groupby('key').aggregate(['min', np.median, max])
# Another common pattern is to pass a dictionary mapping column names to operations to be applied on that column:
# In[21]:
df.groupby('key').aggregate({'data1': 'min',
'data2': 'max'})
# #### Filtering
#
# A filtering operation allows you to drop data based on the group properties.
# For example, we might want to keep all groups in which the standard deviation is larger than some critical value:
# In[22]:
def filter_func(x):
return x['data2'].std() > 4
display('df', "df.groupby('key').std()",
"df.groupby('key').filter(filter_func)")
# The filter function should return a Boolean value specifying whether the group passes the filtering. Here, because group A does not have a standard deviation greater than 4, it is dropped from the result.
# #### Transformation
#
# While aggregation must return a reduced version of the data, transformation can return some transformed version of the full data to recombine.
# For such a transformation, the output is the same shape as the input.
# A common example is to center the data by subtracting the group-wise mean:
# In[23]:
def center(x):
return x - x.mean()
df.groupby('key').transform(center)
# #### The apply method
#
# The `apply` method lets you apply an arbitrary function to the group results.
# The function should take a `DataFrame` and returns either a Pandas object (e.g., `DataFrame`, `Series`) or a scalar; the behavior of the combine step will be tailored to the type of output returned.
#
# For example, here is an `apply` operation that normalizes the first column by the sum of the second:
# In[24]:
def norm_by_data2(x):
# x is a DataFrame of group values
x['data1'] /= x['data2'].sum()
return x
df.groupby('key').apply(norm_by_data2)
# `apply` within a `GroupBy` is flexible: the only criterion is that the function takes a `DataFrame` and returns a Pandas object or scalar. What you do in between is up to you!
# ### Specifying the Split Key
#
# In the simple examples presented before, we split the `DataFrame` on a single column name.
# This is just one of many options by which the groups can be defined, and we'll go through some other options for group specification here.
# #### A list, array, series, or index providing the grouping keys
#
# The key can be any series or list with a length matching that of the `DataFrame`. For example:
# In[25]:
L = [0, 1, 0, 1, 2, 0]
df.groupby(L).sum()
# Of course, this means there's another, more verbose way of accomplishing the `df.groupby('key')` from before:
# In[26]:
df.groupby(df['key']).sum()
# #### A dictionary or series mapping index to group
#
# Another method is to provide a dictionary that maps index values to the group keys:
# In[27]:
df2 = df.set_index('key')
mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}
display('df2', 'df2.groupby(mapping).sum()')
# #### Any Python function
#
# Similar to mapping, you can pass any Python function that will input the index value and output the group:
# In[28]:
df2.groupby(str.lower).mean()
# #### A list of valid keys
#
# Further, any of the preceding key choices can be combined to group on a multi-index:
# In[29]:
df2.groupby([str.lower, mapping]).mean()
# ### Grouping Example
#
# As an example of this, in a few lines of Python code we can put all these together and count discovered planets by method and by decade:
# In[30]:
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'
planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)
# This shows the power of combining many of the operations we've discussed up to this point when looking at realistic datasets: we quickly gain a coarse understanding of when and how extrasolar planets were detected in the years after the first discovery.
#
# I would suggest digging into these few lines of code and evaluating the individual steps to make sure you understand exactly what they are doing to the result.
# It's certainly a somewhat complicated example, but understanding these pieces will give you the means to similarly explore your own data.