#!/usr/bin/env python # coding: utf-8 # # groupby # # **With `groupby`, you can group data in a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) and apply calculations on those groups in various ways.** # # # # This Cheatbook (Cheatsheet + Notebook) introduces you to the core functionality of pandas' [`groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) function. [Here can find the executable Jupyter Notebook version to directly play around with it](https://mybinder.org/v2/gh/feststelltaste/software-analytics/master?filepath=cheatbooks%2Fgroupby.ipynb)! # ## References # Here you can find out more about this function. # # * [API Reference](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) # * [Pandas Grouper and Agg Functions Explained](https://pbpython.com/pandas-grouper-agg.html) # * [Understanding the Transform Function in Pandas](https://pbpython.com/pandas_transform.html) # ## Example Scenario # This is an excerpt of a file list from a directory with the following information as separate columns / [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html): # # * `file`: The name of the file # * `dir`: The name of the directory where the file lives in # * `bytes`: The size of the file in bytes # # This data is stored into a pandas' DataFrame named `df`. # In[1]: import pandas as pd df = pd.DataFrame({ "file" : ['hello.java', 'tutorial.md', 'controller.java', "build.sh", "deploy.sh"], "dir" : ["src", "docs", "src", "src", "src"], "bytes" : [54, 124, 36, 78, 62] }) df # ## When to use it # `groupby` is a great way to summarize data in a specific way to build a more higher-level view on your data (e.g., to go from code level to module level). # E.g., in our scenario, we could count the number of files per directory. # Let's take a look at this use case step by step. # ## Basic Principles # You can use the `groupby` function on our DataFrame `df`. # As parameter, you can put in the name (or a list of names) of the Series you want to group. # In our case, we want to group the directories / the Series `dir`. # In[2]: df.groupby('dir') # This gives you a [GroupBy](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html) object. We can take a look at the built groups by inspecting the `groups` object of the GroupBy object. # In[3]: df.groupby('dir').groups # The `groups` object shows you the groups and their members, using their indexes. # ## Aggregating Values # Now we have built some groups, but now what? The next step is to decide what we want to do with the values that belong to a group. This means we need to tell the GroupBy object how we want to group the values. We can apply a [multitude of aggregating functions](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html#computations-descriptive-stats) here, e.g. # * `count`: count the number of entries of each group # In[4]: df.groupby('dir').count() # * `first`: take the first entry of each group # In[5]: df.groupby('dir').first() # * `max`: take the entry with the highest value # In[6]: df.groupby('dir').max() # * `sum`: sum up all values within one group # In[7]: df.groupby('dir').sum() # This gives us the number of bytes of all files that reside in a directory. Note that there is no more `file` Series because it doesn't contain any values we could sum up. So this Series was thrown away. # We can also apply [dedicated functions](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html#function-application) on each group using e.g., # # * `agg`: apply a variety of aggregating functions on the groups (e.g., building the sum as well as counting the values at once) # * `apply`: apply a custom function on each group to execute calculations as you like # * `transform`: calculate summarizing values for each group (e.g., the sum of all entries for each group) # # We'll see these operations later on! # ## More Advanced Use Cases # Let's dig deeper into our example scenario. # We want to find out which kind of files occupy what space in which directory. # For this, we extract the files' extensions from the `file` series. # We use the string `split` function to split by the `.` sign and keep just the last piece of the split file name (which is the file's extension). # In[8]: df['ext'] = df["file"].str.split(".").str[-1] df # We can then group this data in a more sophisticated way by using two Series for our groups. # We sum up the numeric values (= the `bytes`) for each file for each group. # In[9]: dir_ext_bytes = df.groupby(['dir', 'ext']).sum() dir_ext_bytes # Last, we want to calculate the ratio of the files' bytes for each extension. # We first calculate the overall size for each extension in each directory by using `transform`. # The `transform` function doesn't compute results for each value of a group. # Instead, it provides results for all values of a group. # In[10]: bytes_per_dir = dir_ext_bytes.groupby('dir').transform('sum') bytes_per_dir # In our case, we summed up all the files' bytes of the file extensions per directory. # We can add this new information to our existing DataFrame. # In[11]: dir_ext_bytes['all'] = bytes_per_dir dir_ext_bytes # Now we are able to calculate the ratio. # In[12]: dir_ext_bytes['ratio'] = dir_ext_bytes['bytes'] / dir_ext_bytes['all'] dir_ext_bytes # The result gives us the ratios of file sizes per file type for each directory. # ## Summary # The `groupby` function is an excellent way to summarize data. # It will create a higher-level view of your fine-grained raw data.