#!/usr/bin/env python
# coding: utf-8

# # Aggregations: min, max, and Everything in Between

# A first step in exploring any dataset is often to compute various summary statistics.
# Perhaps the most common summary statistics are the mean and standard deviation, which allow you to summarize the "typical" values in a dataset, but other aggregations are useful as well (the sum, product, median, minimum and maximum, quantiles, etc.).
# 
# NumPy has fast built-in aggregation functions for working on arrays; we'll discuss and try out some of them here.

# ## Summing the Values in an Array
# 
# As a quick example, consider computing the sum of all values in an array.
# Python itself can do this using the built-in `sum` function:

# In[1]:


import numpy as np
rng = np.random.default_rng()


# In[2]:


L = rng.random(100)
sum(L)


# The syntax is quite similar to that of NumPy's `sum` function, and the result is the same in the simplest case:

# In[3]:


np.sum(L)


# However, because it executes the operation in compiled code, NumPy's version of the operation is computed much more quickly:

# In[4]:


big_array = rng.random(1000000)
get_ipython().run_line_magic('timeit', 'sum(big_array)')
get_ipython().run_line_magic('timeit', 'np.sum(big_array)')


# Be careful, though: the `sum` function and the `np.sum` function are not identical, which can sometimes lead to confusion!
# In particular, their optional arguments have different meanings (`sum(x, 1)` initializes the sum at `1`, while `np.sum(x, 1)` sums along axis `1`), and `np.sum` is aware of multiple array dimensions, as we will see in the following section.

# ## Minimum and Maximum
# 
# Similarly, Python has built-in `min` and `max` functions, used to find the minimum value and maximum value of any given array:

# In[5]:


min(big_array), max(big_array)


# NumPy's corresponding functions have similar syntax, and again operate much more quickly:

# In[6]:


np.min(big_array), np.max(big_array)


# In[7]:


get_ipython().run_line_magic('timeit', 'min(big_array)')
get_ipython().run_line_magic('timeit', 'np.min(big_array)')


# For `min`, `max`, `sum`, and several other NumPy aggregates, a shorter syntax is to use methods of the array object itself:

# In[8]:


print(big_array.min(), big_array.max(), big_array.sum())


# Whenever possible, make sure that you are using the NumPy version of these aggregates when operating on NumPy arrays!

# ### Multidimensional Aggregates
# 
# One common type of aggregation operation is an aggregate along a row or column.
# Say you have some data stored in a two-dimensional array:

# In[9]:


M = rng.integers(0, 10, (3, 4))
print(M)


# NumPy aggregations will apply across all elements of a multidimensional array:

# In[10]:


M.sum()


# Aggregation functions take an additional argument specifying the *axis* along which the aggregate is computed. For example, we can find the minimum value within each column by specifying `axis=0`:

# In[11]:


M.min(axis=0)


# The function returns four values, corresponding to the four columns of numbers.
# 
# Similarly, we can find the maximum value within each row:

# In[12]:


M.max(axis=1)


# The way the axis is specified here can be confusing to users coming from other languages.
# The `axis` keyword specifies the dimension of the array that will be *collapsed*, rather than the dimension that will be returned.
# So, specifying `axis=0` means that axis 0 will be collapsed: for two-dimensional arrays, values within each column will be aggregated.

# ### Other Aggregation Functions
# 
# NumPy provides several other aggregation functions with a similar API, and additionally most have a `NaN`-safe counterpart that computes the result while ignoring missing values, which are marked by the special IEEE floating-point `NaN` value (see [Handling Missing Data](03.04-Missing-Values.ipynb)).
# 
# The following table provides a list of useful aggregation functions available in NumPy:
# 
# |Function name    |   NaN-safe version| Description                                   |
# |-----------------|-------------------|-----------------------------------------------|
# | `np.sum`        | `np.nansum`       | Compute sum of elements                       |
# | `np.prod`       | `np.nanprod`      | Compute product of elements                   |
# | `np.mean`       | `np.nanmean`      | Compute mean of elements                      |
# | `np.std`        | `np.nanstd`       | Compute standard deviation                    |
# | `np.var`        | `np.nanvar`       | Compute variance                              |
# | `np.min`        | `np.nanmin`       | Find minimum value                            |
# | `np.max`        | `np.nanmax`       | Find maximum value                            |
# | `np.argmin`     | `np.nanargmin`    | Find index of minimum value                   |
# | `np.argmax`     | `np.nanargmax`    | Find index of maximum value                   |
# | `np.median`     | `np.nanmedian`    | Compute median of elements                    |
# | `np.percentile` | `np.nanpercentile`| Compute rank-based statistics of elements     |
# | `np.any`        | N/A               | Evaluate whether any elements are true        |
# | `np.all`        | N/A               | Evaluate whether all elements are true        |
# 
# You will see these aggregates often throughout the rest of the book.

# ## Example: What Is the Average Height of US Presidents?

# Aggregates available in NumPy can act as summary statistics for a set of values.
# As a small example, let's consider the heights of all US presidents.
# This data is available in the file *president_heights.csv*, which is a comma-separated list of labels and values:

# In[13]:


get_ipython().system('head -4 data/president_heights.csv')


# We'll use the Pandas package, which we'll explore more fully in [Part 3](03.00-Introduction-to-Pandas.ipynb), to read the file and extract this information (note that the heights are measured in centimeters):

# In[14]:


import pandas as pd
data = pd.read_csv('data/president_heights.csv')
heights = np.array(data['height(cm)'])
print(heights)


# Now that we have this data array, we can compute a variety of summary statistics:

# In[15]:


print("Mean height:       ", heights.mean())
print("Standard deviation:", heights.std())
print("Minimum height:    ", heights.min())
print("Maximum height:    ", heights.max())


# Note that in each case, the aggregation operation reduced the entire array to a single summarizing value, which gives us information about the distribution of values.
# We may also wish to compute quantiles:

# In[16]:


print("25th percentile:   ", np.percentile(heights, 25))
print("Median:            ", np.median(heights))
print("75th percentile:   ", np.percentile(heights, 75))


# We see that the median height of US presidents is 182 cm, or just shy of six feet.
# 
# Of course, sometimes it's more useful to see a visual representation of this data, which we can accomplish using tools in Matplotlib (we'll discuss Matplotlib more fully in [Part 4](04.00-Introduction-To-Matplotlib.ipynb)). For example, this code generates the following chart:

# In[17]:


get_ipython().run_line_magic('matplotlib', 'inline')
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')


# In[18]:


plt.hist(heights)
plt.title('Height Distribution of US Presidents')
plt.xlabel('height (cm)')
plt.ylabel('number');