- Understand the split-apply-combine concept for data analysis.
- Use
`groupby()`

,`mean()`

,`agg()`

and`size()`

to apply this technique.

- Use
- Produce scatter plots, line plots, and histograms using
`seaborn`

and`matplotlib`

. - Set universal plot settings.
- Understand and apply grids for faceting in
`seaborn`

.

- Split-apply-combine techniques in
`pandas`

- Using
`mean()`

to summarize categorical data (20 min) - Using
`size()`

to summarize categorical data (10 min)

- Using
- Data visualization with
`matplotlib`

and`seaborn`

(10 min)- Visualizing one quantitative variable with multiple categorical variables (50 min)
- Visualizing the relationship of two quantitative variable with multiple categorical variables (40min)

- Split-apply-combine... plot! (20 min)

In [1]:

```
import pandas as pd
surveys = pd.read_csv('surveys.csv')
surveys.tail()
```

Out[1]:

`mean()`

to summarize categorical data¶Many data analysis tasks can be approached using the *split-apply-combine* paradigm: split the data into groups, apply some analysis to each group, and then combine the results.

`pandas`

facilitates this workflow through the use of `groupby()`

to split data and summary/aggregation functions such as `mean()`

, which collapses each group into a single-row summary of that group. The arguments to `groupby()`

are the column names that contain the *categorical* variables by which summary statistics should be calculated. To start, compute the mean `weight`

by sex.

In [2]:

```
# NA values are included by default since pandas 0.23
# could be excluded with `observed=True` or `dropna()`
surveys.groupby('genus')['weight'].mean()
```

Out[2]:

When the mean is computed, the default behavior is to ignore NA values, so they only need to be dropped if they are to be excluded from the visual output.

Groups can also be created from multiple columns:

In [3]:

```
surveys.groupby(['genus', 'sex'])['weight'].mean()
```

Out[3]:

Since the same filtered and grouped data frame will be used in multiple code chunks below, this data can be assigned to a new variable to be used in the subsequent code chunks instead of typing out the functions each time.

In [4]:

```
grouped_surveys = surveys.groupby(['genus', 'sex'])
grouped_surveys['weight'].mean() # Show that the output is the same as above
```

Out[4]:

Instead of using the `mean()`

method, the more general `agg()`

method could be called to aggregate (or summarize) by *any* function, not just the mean. The equivalent to the `mean()`

method would be to call `agg()`

with the numpy function `np.mean()`

.

In [5]:

```
import numpy as np
grouped_surveys['weight'].agg(np.mean).reset_index()
```

Out[5]:

This general approach is more flexible and powerful since multiple aggregation functions can be applied in the same line of code by passing them as a list to `agg()`

. For instance, the standard deviation and mean could be computed in the same call.

In [6]:

```
# Multiple aggregation functions are passed as a list, hence the square brackets
grouped_surveys['weight'].agg ([np.mean, np.std])
```

Out[6]:

Any function can be passed like this, including custom personal functions. For many common aggregation functions, `pandas`

allows to pass a string with the function name as a convenience.

In [7]:

```
grouped_surveys['weight'].agg (['mean', 'median', 'count'])
```

Out[7]:

## Challenge¶

Use

`groupby()`

and`agg()`

to find the mean, min, and max hindfoot length for each species.What was the heaviest animal measured in each year? Return the columns

`year`

,`genus`

,`species`

, and`weight`

.HintLook into the`idxmax()`

method.

`size()`

to summarize categorical data¶When working with data, it is common to want to know the number of observations present for each categorical variable. For this, `pandas`

provides the `size()`

method. For example, to group by 'taxa' and find the number of observations for each 'taxa':

In [8]:

```
surveys.groupby('taxa').size()
```

Out[8]:

`size()`

can also be used when grouping on multiple variables.

In [9]:

```
surveys.groupby(['taxa', 'sex']).size()
```

Out[9]:

If there are many groups, `size()`

is not that useful on its own. For example, it is difficult to quickly find the five most abundant species among the observations.

In [10]:

```
surveys.groupby('species').size()
```

Out[10]:

Since there are many rows in this output, it would be beneficial to sort the table values and display the most abundant species first. This is easy to do with the `sort_values()`

method.

In [11]:

```
surveys.groupby('species').size().sort_values()
```

Out[11]:

That's better, but it could be helpful to display the most abundant species on top. In other words, the output should be arranged in descending order.

In [12]:

```
surveys.groupby('species').size().sort_values(ascending=False).head(5)
```

Out[12]:

Looks good! By now, the code statement has grown quite long because many methods have been *chained* together. It can be tricky to keep track of what is going on in long method chains. To make the code more readable, it can be broken up multiple lines by adding a surrounding parenthesis.

In [13]:

```
(surveys
.groupby('species')
.size()
.sort_values(ascending=False)
.head(5)
)
```

Out[13]:

This looks neater and makes long method chains easier to reads. There is no absolute rule for when to break code into multiple line, but always try to write code that is easy for collaborators (your most common collaborator is a future version of yourself!) to understand.

`pandas`

actually has a convenience function for returning the top five results, so the values don't need to be sorted explicitly.

In [14]:

```
(surveys
.groupby(['species'])
.size()
.nlargest() # the default is 5
)
```

Out[14]:

To include more attributes about these species, add columns to `groupby()`

.

In [15]:

```
(surveys
.groupby(['species', 'taxa', 'genus'])
.size()
.nlargest()
)
```

Out[15]:

Again, the display of the output shows that it is returned as a `Series`

. As mentioned previously, the `reset_index()`

method can be used to convert the output into a data frame

In [16]:

```
(surveys
.groupby(['species', 'taxa', 'genus'])
.size()
.nlargest()
.to_frame()
)
```

Out[16]:

The reason that "species", "taxa", and "genus" are displayed in bold font is that `groupby()`

makes these columns into the index (the row names) of the data frame. Indexes can be powerful when working with very large datasets (e.g. matching on indices is faster than matching on values in columns). However, when having multiple index levels like above, it can be less intuitive to work with than when working with columns, so it is often a good idea to reset the data frames index, unless there is a clear advantage of keeping it for downstream analyisis.

To reset the index, the `reset_index()`

method can be used instead of `to_frame()`

.

In [17]:

```
(surveys
.groupby(['species', 'taxa', 'genus'])
.size()
.nlargest()
.reset_index()
)
```

Out[17]:

When the series was changed into a data frame, the values were put into a column. Columns needs a name and by default this is just the lowest unique number among the column names, in this case `0`

. The `rename()`

can be used to change the name of the `0`

column to something more meaningful.

In [18]:

```
(surveys
.groupby(['species', 'taxa', 'genus'])
.size()
.nlargest()
.reset_index()
.rename(columns={0: 'size'})
)
```

Out[18]:

Any column can be renamed this way

In [19]:

```
(surveys
.groupby(['species', 'taxa', 'genus'])
.size()
.nlargest()
.reset_index()
.rename(columns={'genus': 'Genus', 'taxa': 'Taxa'})
)
```

Out[19]:

## Challenge¶

How many individuals were caught in each

`plot_type`

surveyed?Calculate the number of animals trapped per plot type for each year. Extract the combinations of year and plot type that had the three highest number of observations (e.g. "1998-Control").

`matplotlib`

and `seaborn`

¶There are many plotting packages in Python, making it possible to create diverse visualizations such as interactive web graphics, 3D animations, statistical visualization, and map-based plots. Here, we will focus on two of the most useful for researchers, `matplotlib`

which is a robust, detail-oriented, low level plotting interface, and `seaborn`

which provides high level functions on top of `matplotlib`

and allows the plotting calls to be expressed more in terms what is being explored in the underlying data rather than what graphical elements to add to the plot.

For example, instead of instructing the computer to "go through a data frame and plot any observations of speciesX in blue, any observations of speciesY in red, etc", the `seaborn`

syntax allows commands more similar to "color the data by species". Thanks to this functional way of interfaces with data, only minimal changes are required if the underlying data change or to switch the type of plot used for the visualization. It provides a language that facilitates thinking about data in ways that are conducive for exploratory analysis and allows for the and creation of publication quality plots with minimal amounts of adjustments and tweaking.

The concepts of plotting with `seaborn`

plotting were briefly introduced briefly already in the first lecture. To make a plot of the number of observations for each species, first import the library and then use the `countplot()`

function. Before the first plot is created, the line `%matplotlib inline`

is used to specify that all plots should show up in the notebook instead of in a separate window.

In [20]:

```
%matplotlib inline
import seaborn as sns
sns.countplot(y='species', data=surveys)
```

Out[20]:

That's a lot of species... for convenience when introducing the following the plotting concept, the number of species will be limited to the four most abundant. To do this, first extract the names of the most abundant species.

In [21]:

```
most_common_species = (
surveys['species']
.value_counts()
.nlargest(4)
.index
)
most_common_species
```

Out[21]:

A subset can now be created from the data frame, including only those rows where the column 'species' matches any of the names in the `most_common_species`

variable. As before, boolean indexes will be used for this. One way of doing this would be to use the `|`

operator four times.

In [22]:

```
surveys.loc[(surveys['species'] == most_common_species[0]) |
(surveys['species'] == most_common_species[1]) |
(surveys['species'] == most_common_species[2]) |
(surveys['species'] == most_common_species[3])].shape
```

Out[22]:

That is quite tedious and `pandas`

has a special `isin()`

method for comparing a data frame column to an array-like object of names such as the index extracted above.

In [23]:

```
surveys.loc[surveys['species'].isin(most_common_species)].shape
```

Out[23]:

Drop any NAs and assign this to a variable

In [24]:

```
surveys_common = surveys.loc[surveys['species'].isin(most_common_species)].dropna()
surveys_common.shape
```

Out[24]:

This abbreviated data frame can now be used for plotting.

In [25]:

```
sns.countplot(y='species', data=surveys_common)
```

Out[25]:

That's more manageable! The text is a little small, change this with the `set_context()`

function from `seaborn`

, using a number above `1`

for the fontscale parameter. The context parameter changes the size of object in the plots, such as the linewidths, and will be left as the default `notebook`

for now.

These option changes will apply to all plots made from now on. Think of it as changing a value in the options menu of a graphical software.

In [26]:

```
sns.set_context(context='notebook', font_scale=1.4)
sns.countplot(y='species', data=surveys_common)
```

Out[26]:

To get a vertical plot, change `y`

to `x`

. With long label names, horizontal plots can be easier to read.

In [27]:

```
sns.countplot(x='species', data=surveys_common)
```

Out[27]:

`seaborn`

can do much more advanced visualizations than counting things. For example, to visualize summary statistics of the weight variable distribution for these fours species, a boxplot can be used.

In [28]:

```
sns.boxplot(x='weight', y='species', data=surveys_common)
```

Out[28]:

The width of each box can be changed to make it look more appealing.

In [29]:

```
sns.boxplot(x='weight', y='species', data=surveys_common, width=0.4)
```

Out[29]:

The syntax is very similar to that of `countplot()`

, but instead of just supplying one variable and asking `seaborn`

to count the observations of that variable, the xy-variables are the categorical groups (the species) and the measurement of interest (the weight).

The aim of a box plot is to display statistics of the underlying distribution, which facilitate comparison of more than just the mean + standard deviation (or another single measure of central tendency and variation) across categorical variables. These specific box plots are so-called Tukey box plots by default, which means that the graphical elements correspond to the following statistics:

- The lines of the box represent the 25th, 50th (median), and 75th quantile in the data. These divide the data into four quartiles (0-25, 25-50, 50-75, 75-100).
- The whiskers represent 1.5 * the interquartile range (the distance between the 25th and 75th quantile)
- The flyers mark all individual observations that are outside the whiskers, which could be referred to as "outliers" (there are many definitions of what could constitute an outlier).

Most of these plot elements are configurable in case and could be set to represent different distribution statistics.

Another useful visualization for comparing distributions is the `violinplot`

. Again, the syntax is the same as before, just change the plot name.

In [30]:

```
sns.violinplot (x='weight', y='species', data=surveys_common)
```

Out[30]:

Think of this plot as a smoothened version of the underlying histogram, that is then mirrored underneath. Where the violin is wider, there are more observations. The inner part is a boxplot with the median marked as a white dot. Comparisons with histograms and other distribution visualizations will be talked more about later in the workshop, but it is good to already keep in mind that it can be misleading to use a smoothened distribution if you have few observations, and it is probably better to show the individual data points instead of, or in addition to, the distribution plot.

The colors of the violin can be muted bring out the box.

In [31]:

```
sns.violinplot (x='weight', y='species', data=surveys_common, color='lightgrey')
```

Out[31]:

An example for when a violin plot can be more informative than a box plot is to detect multimodal distributions, which could indicate an underlying confounding variable that has been grouped together. This can be seen when plotting the 'genus' on the y-axis instead of the 'species'.

In [32]:

```
sns.boxplot(x='weight', y='plot_type', data=surveys_common)
```

Out[32]:

In [33]:

```
sns.violinplot(x='weight', y='plot_type', data=surveys_common)
```

Out[33]:

From the violin plot, it appears that there could be multiple distributions grouped together within each plot type (remember that the 'plot_type' indicates the type of trap used to catch the animals). There seems to be one distribution centered around weight=20 for all traps and one distribution centered around 45 (or 30 for Long-term krat exclosure). These observations could indeed be from the same distribution, but often when there are see multiple bumps like this, it is a good idea to explore other variables in the data set, and see if we can find the reason for the multiple bumps in the violin plot.

Since there appears to be 2-3 bumps in the distributions, it would be good to find a categorical variable in the data frame that has around the same number of unique values, since grouping based on these values could explain what we are seeing. The pandas method `nunique()`

comes in handy for this task.

In [34]:

```
surveys_common.nunique().sort_values()
```

Out[34]:

There are a few candidate variables that have a suitable number of unique values. A very effective approach for exploring multiple variables in a data set, is to plot so-called small multiples of the data where the same type of plot is used for different subsets of the data. These plots are drawn in rows and columns forming a grid pattern, and can be referred to as a "lattice", "facet", or "trellis" plot.

Visualizing categorical variables in this manner is a key step in exploratory data analysis, and thus `seaborn`

has a dedicated plot function for this, called `factorplot()`

(categorical variables are sometimes referred to as "factors"). This plot can be used to plot the same violin plot as before, and easily spread the variables across the rows and columns, e.g. for the variable "sex".

In [35]:

```
sns.factorplot(x='weight', y='plot_type', data=surveys_common, col='sex',
kind='violin')
```

Out[35]:

Sorting by the sex of the animal is probably not the most clever approach here since, the same sex from different species or genus would have different weights. Let's try adding "genus".

In [36]:

```
sns.factorplot(x='weight', y='plot_type', data=surveys_common, col='sex',
row='genus', kind='violin', margin_titles=True)
```

Out[36]:

There are certainly differences between the two genus, but it appears that the data still is not split into unimodal distributions. A likely explanation could be that we still have multiple species per genus and the weight is species-dependent. Let's check how many species there are per genus and how many observations there are in each.

In [37]:

```
surveys_common.groupby(['genus', 'species']).size()
```

Out[37]:

If the mean weights for those species are different, it could indeed explain the additional bump in the Chaetodipus genus.

In [38]:

```
surveys_common.groupby(['genus', 'species'])['weight'].mean()
```

Out[38]:

A factor plot with the column variable set to "species" instead of "genus" might be able to separate the distributions.

In [39]:

```
sns.factorplot(x='weight', y='plot_type', data=surveys_common, col='species',
kind='violin')
```

Out[39]:

That looks pretty good! The plot can be made more appealing by having two columns per row and making each plot a bit wider.

In [40]:

```
sns.factorplot(x='weight', y='plot_type', data=surveys_common, col='species',
col_wrap=2, kind='violin', aspect=1.4)
```

Out[40]:

This is great, much of the variation in the weight data can be explained by the species observed. The only species where there still appears to be multimodal distributions (and thus *possibly* a confounding variable, is within "baileyi" (and potentially "ordii"), especially for the "Spectab exclosure". The "sex" variable was used in a previous plot, but it was never explored within tin each species. It is common with sexual dimorphism within a species, and this could include weight differences.

In [41]:

```
sns.factorplot(x='weight', y='plot_type', hue='sex', data=surveys_common,
col='species', col_wrap=2, kind='violin', aspect=1.4)
```

Out[41]:

It does indeed appear that there is a difference in mean and distribution between the sexes within the species "baileyi". Minor differences between the sexes within other species are also visible now although they were not big enough to show up in the initial violinplot (in later lectures, we will see more how the violin plot can hide differences like this). As a final beautification of this plot, the violins can be split down the middle to reduce clutter in the plot.

In [42]:

```
sns.factorplot(x='weight', y='plot_type', hue='sex', data=surveys_common,
col='species', col_wrap=2, kind='violin', aspect=1.4, split=True)
```

Out[42]:

This clearly delivers the message and looks is easy to understand. A great aspect of the `facetplot()`

function, is that if there is a change of minds (or hearts) in what type of visualization to use, only minor modifications are needed to completely change the plot appearance.

In [43]:

```
sns.factorplot(x='weight', y='plot_type', hue='sex', data=surveys_common,
col='species', col_wrap=2, kind='box', aspect=1.4)
```

Out[43]:

Plotting the mean and 95% CI, requires changing a couple additional parameters to make the plot look good, but the code is largely identical.

In [44]:

```
sns.factorplot(x='weight', y='plot_type', hue='sex', data=surveys_common,
col='species', col_wrap=2, kind='point', aspect=1.4, join=False,
dodge=1.25)
```

Out[44]:

To recap, `facetplot()`

facilitates the representation of variables within data as different elements in the plot, such as the rows, column, x-axis positions, and colors. There is a great description on this in the `seaborn`

documentation:

It is important to choose how variables get mapped to the plot structure such that the most important comparisons are easiest to make. As a general rule, it is easier to compare positions that are closer together, so the

`hue`

variable should be used for the most important comparisons. For secondary comparisons, try to share the quantitative axis (so, use`col`

for vertical plots and`row`

for horizontal plots). Note that, although it is possible to make rather complex plots using this function, in many cases you may be better served by created several smaller and more focused plots than by trying to stuff many comparisons into one figure

The last point above is worth illustrating with a challenge. It is easy to get carried away with `facetplot()`

and try to visualize everything at once.

## Challenge¶

Create a grid of countplots comparing the number of observations between sexes across months. Create facets for each species and each plot_type.

First, examine the variable and their. For this, the entire "surveys" dataframe can be used.

In [46]:

```
surveys.info()
```

From this it is already clear that the only two quantitative variables are "weight" and "hindfoot_length". Although some of the others are integers, they are all categorical, such as month, day and year.

A scatter plot is the immediate choice for exploring pairwise relationships between variables. `seaborn`

has a convenient scatter matrix function, `pairplot()`

, for plotting the pairwise relationships between all numerical variables in the data frame.

In [47]:

```
# Since this plot creates so many graphical elemments, the data set is subsampled to
# avoid waiting for the plot creation to finish. Setting `random_state` makes sure the
# same observations are sampled each time this is run.
surveys_sample = surveys.dropna().sample(1000, random_state=0)
sns.pairplot(surveys_sample)
```

Out[47]: