- Produce scatter plots, line plots, and histograms using
`seaborn`

and`matplotlib`

. - Understand how to graphically explore relationships between variables.
- Apply grids for faceting in
`seaborn`

. - Set universal plot settings.
- Use
`seaborn`

grids with`matplotlib`

functions

- Data visualization with
`matplotlib`

and`seaborn`

(10 min)- Visualizing one quantitative variable with multiple categorical variables (50 min)
- Visualizing the relationship of two quantitative variable with multiple categorical variables (40min)
- Using any plotting function with
`seaborn`

grids (20 min)

In [1]:

```
# Setup by importing the same data as in previous lecture
import pandas as pd
surveys = pd.read_csv("https://ndownloader.figshare.com/files/2292169")
# or surveys = pd.read_csv("surveys.csv") if you saved it on your computer
```

In [2]:

```
surveys.head()
```

Out[2]:

`matplotlib`

and `seaborn`

¶There are many plotting packages in Python, making it possible to create diverse visualizations such as interactive web graphics, 3D animations, statistical visualizations, and map-based plots. A Google search for "Python graph gallery" or "Seaborn graph gallery" will turn up lots of examples of the diversity of plots that can be made.

Here, we will focus on two of the most useful for researchers: `matplotlib`

, which is a robust, detail-oriented, low level plotting interface, and `seaborn`

, which provides high level functions on top of `matplotlib`

and allows the plotting calls to be expressed more in terms what is being explored in the underlying data rather than what graphical elements to add to the plot.

Instead of instructing the computer to "go through a data frame and plot any observations of speciesX in blue, any observations of speciesY in red, etc", the `seaborn`

syntax is more similar to saying "color the data by species". Thanks to this functional way of interfacing with data, only minimal changes are required if the underlying data change or to switch the type of plot used for the visualization. It provides a language that facilitates thinking about data in ways that are conducive for exploratory analysis and allows for the creation of publication quality plots with minimal adjustments and tweaking.

The concepts of plotting with `seaborn`

were introduced briefly already in the first lecture. To make a plot of the number of observations for each species, first import the library and then use the `countplot()`

function. Before the first plot is created, the line `%matplotlib inline`

is used to specify that all plots should show up in the notebook instead of in a separate window.

In [3]:

```
%matplotlib inline
```

In [4]:

```
import seaborn as sns
sns.countplot(y='species', data=surveys)
```

Out[4]:

That's a lot of species... for convenience when introducing the following the plotting concept, the number of species will be limited to the four most abundant. To do this, the `value_counts()`

method can be used to find out how observations belong to each species. We could also get the same information using `surveys.groupby('species').size()`

.

In [5]:

```
surveys['species'].value_counts()
```

Out[5]:

The top four species names could be manually typed in to filter the surveys data frame.

In [6]:

```
surveys.loc[(surveys['species'] == 'merriami') |
(surveys['species'] == 'penicillatus') |
(surveys['species'] == 'ordii') |
(surveys['species'] == 'baileyi')].shape
```

Out[6]:

Comparing this number with the number of rows in the original data frame shows that it was filtered successfully.

In [7]:

```
surveys.shape
```

Out[7]:

However, it is rather tedious to type out the species names by hand and to do one comparison per species. Instead, the names of the most abundand species can be extract by using `nlargest()`

and returning only the index (row names) of the top observations.

In [8]:

```
most_common_species = (
surveys['species']
.value_counts()
.nlargest(4)
.index
)
most_common_species
```

Out[8]:

A subset can now be created from the data frame, including only those rows where the column 'species' matches any of the names in the `most_common_species`

variable. `pandas`

has a special `isin()`

method for comparing a data frame column to an array-like object of names such as the index extracted above.

In [9]:

```
surveys.loc[surveys['species'].isin(most_common_species)].shape
```

Out[9]:

The number of observations is the same as with the manual filtering above, but this method is more succinct.

To facilitate plotting, any NA values will be dropped before assigning the resulting data frame to a new variable name.

In [10]:

```
surveys_common = surveys.loc[surveys['species'].isin(most_common_species)].dropna()
surveys_common.shape
```

Out[10]:

This abbreviated data frame can now be used for plotting.

In [11]:

```
sns.countplot(y='species', data=surveys_common)
```

Out[11]:

That's more manageable! The text is a little small, which can be changed with the `set_context()`

function from `seaborn`

, using a number above `1`

for the fontscale parameter. The context parameter changes the size of object in the plots, such as the line widths, and will be left as the default `notebook`

for now.

These option changes will apply to all plots made from now on. Think of it as changing a value in the options menu of a graphical software.

In [12]:

```
sns.set_context(context='notebook', font_scale=1.4)
sns.countplot(y='species', data=surveys_common)
```

Out[12]:

To get a vertical plot, change `y`

to `x`

. With long label names, horizontal plots can be easier to read.

In [13]:

```
sns.countplot(x='species', data=surveys_common)
```

Out[13]:

`seaborn`

can do much more advanced visualizations than counting observations. Next, the relationship between one quantitative and one categorical valuable will be explored while stratifying the data based on its remaining categorical variables. To start, let's visualize summary statistics of the weight variable distribution for these fours species with a boxplot.

In [14]:

```
sns.boxplot(x='weight', y='species', data=surveys_common)
```

Out[14]:

The width of each box can be changed to make it look more appealing.

In [15]:

```
sns.boxplot(x='weight', y='species', data=surveys_common, width=0.4)
```

Out[15]:

The syntax is very similar to that of `countplot()`

, but instead of just supplying one variable and asking `seaborn`

to count the observations of that variable, the xy-variables are the categorical groups (the species) and the measurement of interest (the weight).

The aim of a box plot is to display a few statistics from the underlying distribution of one quantitative variable between the values of one categorical variable (the y-axis). The inclusion of multiple distribution statistics facilitates the comparison of more than just the mean + standard deviation (or another single measure of central tendency and variation). The `seaborn`

box plots are so-called Tukey box plots by default, which means that the graphical elements correspond to the following statistics:

- The lines of the box represent the 25th, 50th (median), and 75th quantile in the data. These divide the data into four quartiles (0-25, 25-50, 50-75, 75-100).
- The whiskers represent 1.5 * the interquartile range (the distance between the 25th and 75th quantile).
- The flyers mark all individual observations that are outside the whiskers, which could be referred to as "outliers" (although there are many definitions of what could constitutes an outlier).

Most of these plot elements are configurable and could be set to represent different distribution statistics.

Another useful visualization for comparing distributions is the `violinplot`

. Again, the syntax is the same as before, just change the plot name.

In [16]:

```
sns.violinplot (x='weight', y='species', data=surveys_common)
```

Out[16]:

Think of this plot as a smoothened version of the underlying histogram, that is then mirrored to give the violin-like shape. Where the violin is wider, there are more observations. The inner part is a boxplot with the median marked as a white dot. Comparisons with histograms and other distribution visualizations will be talked more about later in the workshop, but it is good to already keep in mind that although violinplots are great medium to large data sets, it can be misleading to use a smoothened distribution when there are very few observations, and it is probably better to show the individual data points instead of, or in addition to, the distribution plot.

The colors of the violins can be changed with the `palette`

keyword.

In [17]:

```
sns.violinplot (x='weight', y='species', data=surveys_common, palette='Blues')
```

Out[17]:

The `color`

parameter can be used to set all violins in the same color.

In [18]:

```
sns.violinplot (x='weight', y='species', data=surveys_common, color='lightgrey')
```

Out[18]:

An example for when a violin plot can be more informative than a box plot is to detect multimodal distributions, which could indicate that multiple values from an underlying confounding variable has been grouped together. This can be observed when comparing the weight of all animals in each type of trap used to catch them (the `plot_type`

column).

In [19]:

```
sns.boxplot(x='weight', y='plot_type', data=surveys_common)
```

Out[19]:

In [20]:

```
sns.violinplot(x='weight', y='plot_type', data=surveys_common)
```

Out[20]:

From the violin plot, it appears that there could be multiple distributions grouped together within each type of trap. There seems to be one distribution centered around 20 grams for all traps and one distribution centered around 45 grams (or 30 grams for Long-term krat exclosure). These observations could indeed be from the same distribution, but often when there are multiple bumps like this, it is a good idea to explore other variables in the data set, and see if there is a confounding variable contributing to the multimodality of the violin plot.

Since there appear to be 2-3 bumps in the distributions, it would be good to find a categorical variable in the data frame that has around the same number of unique values. The `pandas`

method `nunique()`

can count the number of unique values within each variable.

In [21]:

```
surveys_common.nunique().sort_values()
```

Out[21]:

There are a few candidate variables that have a suitable number of unique values. A very effective approach for exploring multiple categorical variables in a data set is to plot so-called "small multiples" of the data, where the same type of plot is used for different subsets of the data. These plots are drawn in rows and columns forming a grid pattern, and can be referred to as a "lattice", "facet", or "trellis" plot.

Visualizing categorical variables in this manner is a key step in exploratory data analysis, and thus `seaborn`

has a dedicated plot function for this, called `factorplot()`

(categorical variables are sometimes referred to as "factors"). This plot can be used to plot the same violin plot as before, and easily spread the variables across the rows and columns, e.g. for the variable `sex`

.

In [22]:

```
sns.factorplot(x='weight', y='plot_type', data=surveys_common, col='sex',
kind='violin')
```

Out[22]:

Sorting by the sex of the animal is probably not the most clever approach here since, the same sex from different species might very well have different weights. Let's try faceting the `genus`

variable on separate rows.

In [23]:

```
sns.factorplot(x='weight', y='plot_type', data=surveys_common, col='sex',
row='genus', kind='violin', margin_titles=True)
```

Out[23]:

There are certainly differences in weight between the two genera, but it appears that the data still is not split into unimodal distributions. A likely explanation could be that there are multiple species within each genus and the weight is species-dependent. Let's check how many species there are per genus and how many observations there are in each.

In [24]:

```
surveys_common.groupby(['genus', 'species']).size()
```

Out[24]:

There are two species within each genus. If the mean weights for those species are different, it could indeed explain the additional bump in the Chaetodipus distributions above.

In [25]:

```
surveys_common.groupby(['genus', 'species'])['weight'].mean()
```

Out[25]:

A factor plot with the column variable set to `'species'`

instead of `'genus'`

might be able to separate the distributions.

In [26]:

```
sns.factorplot(x='weight', y='plot_type', data=surveys_common,
col='species', kind='violin')
```

Out[26]:

That looks pretty good! The plot can be made more appealing by having two columns per row and making each plot a bit wider.

In [27]:

```
sns.factorplot(x='weight', y='plot_type', data=surveys_common, col='species',
col_wrap=2, kind='violin', aspect=1.4)
```

Out[27]:

This is great, much of the variation in the weight data can be explained by the species observed. The only species where there still appears to be multimodal distributions (and thus *possibly* a confounding variable, is within "baileyi" (and potentially "ordii"), especially for the "Spectab exclosure". The "sex" variable was used in a previous plot, but it was never explored within tin each species. It is common with sexual dimorphism within a species, and this could include weight differences.

In [28]:

```
sns.factorplot(x='weight', y='plot_type', hue='sex', data=surveys_common,
col='species', col_wrap=2, kind='violin', aspect=1.4)
```

Out[28]:

It does indeed appear that there is a difference in mean and distribution between the sexes within the species "baileyi". Minor differences between the sexes within other species are also visible now although they were not big enough to show up in the initial violinplot. As a final beautification of this plot, the violins can be split down the middle to reduce clutter.

In [29]:

```
sns.factorplot(x='weight', y='plot_type', hue='sex', data=surveys_common,
col='species', col_wrap=2, kind='violin', aspect=1.4, split=True)
```

Out[29]:

This clearly delivers the message and is easy to understand. One thing to notice here though is that the `plot_type`

variable doesn't look like it has much effect on the weights.

## Challenge¶

- Create a violin plot of species vs weight, split down the middle by sex.
- (Bonus) add the fifth and sixth most-common species to the plot.

A great aspect of the `factorplot()`

function is that if there is a change of minds (or hearts) in what type of visualization to use, only minor modifications are needed to completely change the plot appearance. For example, plotting the mean and 95% CI, requires changing a couple of parameters to make the plot look good, but the code is largely identical.

In [30]:

```
sns.factorplot(x='weight', y='plot_type', hue='sex', data=surveys_common,
col='species', col_wrap=2, kind='point', aspect=1.4, join=False,
dodge=1.25)
```

Out[30]:

To recap, `factorplot()`

facilitates the representation of variables within data as different elements in the plot, such as the rows, column, x-axis positions, and colors. There is a great description on this in the `seaborn`

documentation:

It is important to choose how variables get mapped to the plot structure such that the most important comparisons are easiest to make. As a general rule, it is easier to compare positions that are closer together, so the

`hue`

variable should be used for the most important comparisons. For secondary comparisons, try to share the quantitative axis (so, use`col`

for vertical plots and`row`

for horizontal plots). Note that, although it is possible to make rather complex plots using this function, in many cases you may be better served by created several smaller and more focused plots than by trying to stuff many comparisons into one figure

We talk more about the advantages and drawbacks of creating complex visualization with `factorplot()`

in the second part of the challenge.

## Challenge¶

- Explore the same relationship as in the last violin plot, but visualize the results using a boxplot instead of a violinplot. Which variable(s) did you need to change and why?
- (Bonus) Create a grid of countplots comparing the number of observations between sexes across months (with months on the x-axis for each plot). Create facets for each species and each plot_type. Is this a good plot for data explorations? What about for publication?

In the last section, one quantitative variable was visualized across one or more categorical variables. Here, the relationship between two quantitative variables will be explored while stratifying the data based on its remaining categorical variables. First, reexamine the variables and their data types using the entire `surveys`

data frame.

In [31]:

```
surveys.info()
```

The only two quantitative continuous variables are "weight" and "hindfoot_length". Although some of the others are integers, they are all categorical, such as month, day and year.

A scatter plot is the immediate choice for exploring pairwise relationships between continuous variables. `seaborn`

has a convenient scatter plot matrix function, `pairplot()`

, for plotting the pairwise relationships between all numerical variables in the data frame.

In [32]:

```
# Since this plot creates so many graphical elemments, the data set is subsampled to
# avoid waiting for the plot creation to finish. Setting `random_state` makes sure the
# same observations are sampled each time this is run.
surveys_sample = surveys.dropna().sample(1000, random_state=0)
sns.pairplot(surveys_sample)
```

Out[32]: