#!/usr/bin/env python
# coding: utf-8

# # groupby
# 
# **With `groupby`, you can group data in a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) and apply calculations on those groups in various ways.**
# 
# 
# 
# This Cheatbook (Cheatsheet + Notebook) introduces you to the core functionality of pandas' [`groupby`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) function. [Here can find the executable Jupyter Notebook version to directly play around with it](https://mybinder.org/v2/gh/feststelltaste/software-analytics/master?filepath=cheatbooks%2Fgroupby.ipynb)!

# ## References
# Here you can find out more about this function.
# 
# * [API Reference](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html)
# * [Pandas Grouper and Agg Functions Explained](https://pbpython.com/pandas-grouper-agg.html)
# * [Understanding the Transform Function in Pandas](https://pbpython.com/pandas_transform.html)

# ## Example Scenario
# This is an excerpt of a file list from a directory with the following information as separate columns /  [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html):
# 
# * `file`: The name of the file
# * `dir`: The name of the directory where the file lives in
# * `bytes`: The size of the file in bytes
# 
# This data is stored into a pandas' DataFrame named `df`.

# In[1]:


import pandas as pd
df = pd.DataFrame({
    "file" : ['hello.java', 'tutorial.md', 'controller.java', "build.sh", "deploy.sh"],
    "dir" : ["src", "docs", "src", "src", "src"],
    "bytes" : [54, 124, 36, 78, 62]
    })
df


# ## When to use it
# `groupby` is a great way to summarize data in a specific way to build a more higher-level view on your data (e.g., to go from code level to module level).
# E.g., in our scenario, we could count the number of files per directory.
# Let's take a look at this use case step by step.

# ## Basic Principles

# You can use the `groupby` function on our DataFrame `df`.
# As parameter, you can put in the name (or a list of names) of the Series you want to group.
# In our case, we want to group the directories / the Series `dir`.

# In[2]:


df.groupby('dir')


# This gives you a [GroupBy](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html) object. We can take a look at the built groups by inspecting the `groups` object of the GroupBy object.

# In[3]:


df.groupby('dir').groups


# The `groups` object shows you the groups and their members, using their indexes.

# ## Aggregating Values

# Now we have built some groups, but now what? The next step is to decide what we want to do with the values that belong to a group. This means we need to tell the GroupBy object how we want to group the values. We can apply a [multitude of aggregating functions](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html#computations-descriptive-stats) here, e.g.

# * `count`: count the number of entries of each group

# In[4]:


df.groupby('dir').count()


# * `first`: take the first entry of each group

# In[5]:


df.groupby('dir').first()


# * `max`: take the entry with the highest value

# In[6]:


df.groupby('dir').max()


# * `sum`: sum up all values within one group

# In[7]:


df.groupby('dir').sum()


# This gives us the number of bytes of all files that reside in a directory. Note that there is no more `file` Series because it doesn't contain any values we could sum up. So this Series was thrown away.

# We can also apply [dedicated functions](https://pandas.pydata.org/pandas-docs/stable/reference/groupby.html#function-application) on each group using e.g.,
# 
# * `agg`: apply a variety of aggregating functions on the groups (e.g., building the sum as well as counting the values at once)
# * `apply`: apply a custom function on each group to execute calculations as you like
# * `transform`: calculate summarizing values for each group (e.g., the sum of all entries for each group)
# 
# We'll see these operations later on!

# ## More Advanced Use Cases
# Let's dig deeper into our example scenario.
# We want to find out which kind of files occupy what space in which directory.
# For this, we extract the files' extensions from the `file` series.
# We use the string `split` function to split by the `.` sign and keep just the last piece of the split file name (which is the file's extension).

# In[8]:


df['ext'] = df["file"].str.split(".").str[-1]
df


# We can then group this data in a more sophisticated way by using two Series for our groups.
# We sum up the numeric values (= the `bytes`) for each file for each group.

# In[9]:


dir_ext_bytes = df.groupby(['dir', 'ext']).sum()
dir_ext_bytes


# Last, we want to calculate the ratio of the files' bytes for each extension.
# We first calculate the overall size for each extension in each directory by using `transform`.
# The `transform` function doesn't compute results for each value of a group.
# Instead, it provides results for all values of a group.

# In[10]:


bytes_per_dir = dir_ext_bytes.groupby('dir').transform('sum')
bytes_per_dir


# In our case, we summed up all the files' bytes of the file extensions per directory.
# We can add this new information to our existing DataFrame.

# In[11]:


dir_ext_bytes['all'] = bytes_per_dir
dir_ext_bytes


# Now we are able to calculate the ratio.

# In[12]:


dir_ext_bytes['ratio'] = dir_ext_bytes['bytes'] / dir_ext_bytes['all']
dir_ext_bytes


# The result gives us the ratios of file sizes per file type for each directory.

# ## Summary
# The `groupby` function is an excellent way to summarize data.
# It will create a higher-level view of your fine-grained raw data.