Data wrangling and visualization with pandas, seaborn, and matplotlib

Lesson preamble

Learning Objectives

  • Understand the split-apply-combine concept for data analysis.
    • Use groupby(), mean(), agg() and size() to apply this technique.
  • Produce scatter plots, line plots, and histograms using seaborn and matplotlib.
  • Set universal plot settings.
  • Understand and apply grids for faceting in seaborn.

Lesson outline

  • Split-apply-combine techniques in pandas
    • Using mean() to summarize categorical data (20 min)
    • Using size() to summarize categorical data (10 min)
  • Data visualization with matplotlib and seaborn (10 min)
    • Visualizing one quantitative variable with multiple categorical variables (50 min)
    • Visualizing the relationship of two quantitative variable with multiple categorical variables (40min)
  • Split-apply-combine... plot! (20 min)
In [1]:
import pandas as pd

surveys = pd.read_csv('surveys.csv')
surveys.tail()
Out[1]:
record_id month day year plot_id species_id sex hindfoot_length weight genus species taxa plot_type
34781 26966 10 25 1997 7 PL M 20.0 16.0 Peromyscus leucopus Rodent Rodent Exclosure
34782 27185 11 22 1997 7 PL F 21.0 22.0 Peromyscus leucopus Rodent Rodent Exclosure
34783 27792 5 2 1998 7 PL F 20.0 8.0 Peromyscus leucopus Rodent Rodent Exclosure
34784 28806 11 21 1998 7 PX NaN NaN NaN Chaetodipus sp. Rodent Rodent Exclosure
34785 30986 7 1 2000 7 PX NaN NaN NaN Chaetodipus sp. Rodent Rodent Exclosure

Split-apply-combine techniques in pandas

Using mean() to summarize categorical data

Many data analysis tasks can be approached using the split-apply-combine paradigm: split the data into groups, apply some analysis to each group, and then combine the results.

pandas facilitates this workflow through the use of groupby() to split data and summary/aggregation functions such as mean(), which collapses each group into a single-row summary of that group. The arguments to groupby() are the column names that contain the categorical variables by which summary statistics should be calculated. To start, compute the mean weight by sex.

In [2]:
# NA values are included by default since pandas 0.23
# could be excluded with `observed=True` or `dropna()`
surveys.groupby('genus')['weight'].mean()
Out[2]:
genus
Ammodramus                 NaN
Ammospermophilus           NaN
Amphispiza                 NaN
Baiomys               8.600000
Calamospiza                NaN
Callipepla                 NaN
Campylorhynchus            NaN
Chaetodipus          24.179329
Cnemidophorus              NaN
Crotalus                   NaN
Dipodomys            55.860219
Lizard                     NaN
Neotoma             159.245660
Onychomys            26.496173
Perognathus           8.377454
Peromyscus           21.456262
Pipilo                     NaN
Pooecetes                  NaN
Reithrodontomys      10.667939
Rodent                     NaN
Sceloporus                 NaN
Sigmodon             67.264574
Sparrow                    NaN
Spermophilus         93.500000
Sylvilagus                 NaN
Zonotrichia                NaN
Name: weight, dtype: float64

When the mean is computed, the default behavior is to ignore NA values, so they only need to be dropped if they are to be excluded from the visual output.

Groups can also be created from multiple columns:

In [3]:
surveys.groupby(['genus', 'sex'])['weight'].mean()
Out[3]:
genus             sex
Ammospermophilus  M             NaN
Baiomys           F        9.161290
                  M        7.357143
Chaetodipus       F       23.763824
                  M       24.712219
Dipodomys         F       55.244360
                  M       56.243034
Neotoma           F      154.282209
                  M      165.652893
Onychomys         F       26.780959
                  M       26.246466
Perognathus       F        8.574803
                  M        8.204182
Peromyscus        F       22.491649
                  M       20.644279
Reithrodontomys   F       11.220080
                  M       10.159941
Sigmodon          F       71.696000
                  M       61.336842
Spermophilus      F       57.000000
                  M      130.000000
Name: weight, dtype: float64

Since the same filtered and grouped data frame will be used in multiple code chunks below, this data can be assigned to a new variable to be used in the subsequent code chunks instead of typing out the functions each time.

In [4]:
grouped_surveys = surveys.groupby(['genus', 'sex'])
grouped_surveys['weight'].mean() # Show that the output is the same as above
Out[4]:
genus             sex
Ammospermophilus  M             NaN
Baiomys           F        9.161290
                  M        7.357143
Chaetodipus       F       23.763824
                  M       24.712219
Dipodomys         F       55.244360
                  M       56.243034
Neotoma           F      154.282209
                  M      165.652893
Onychomys         F       26.780959
                  M       26.246466
Perognathus       F        8.574803
                  M        8.204182
Peromyscus        F       22.491649
                  M       20.644279
Reithrodontomys   F       11.220080
                  M       10.159941
Sigmodon          F       71.696000
                  M       61.336842
Spermophilus      F       57.000000
                  M      130.000000
Name: weight, dtype: float64

Instead of using the mean() method, the more general agg() method could be called to aggregate (or summarize) by any function, not just the mean. The equivalent to the mean() method would be to call agg() with the numpy function np.mean().

In [5]:
import numpy as np

grouped_surveys['weight'].agg(np.mean).reset_index()
Out[5]:
genus sex weight
0 Ammospermophilus M NaN
1 Baiomys F 9.161290
2 Baiomys M 7.357143
3 Chaetodipus F 23.763824
4 Chaetodipus M 24.712219
5 Dipodomys F 55.244360
6 Dipodomys M 56.243034
7 Neotoma F 154.282209
8 Neotoma M 165.652893
9 Onychomys F 26.780959
10 Onychomys M 26.246466
11 Perognathus F 8.574803
12 Perognathus M 8.204182
13 Peromyscus F 22.491649
14 Peromyscus M 20.644279
15 Reithrodontomys F 11.220080
16 Reithrodontomys M 10.159941
17 Sigmodon F 71.696000
18 Sigmodon M 61.336842
19 Spermophilus F 57.000000
20 Spermophilus M 130.000000

This general approach is more flexible and powerful since multiple aggregation functions can be applied in the same line of code by passing them as a list to agg(). For instance, the standard deviation and mean could be computed in the same call.

In [6]:
# Multiple aggregation functions are passed as a list, hence the square brackets
grouped_surveys['weight'].agg ([np.mean, np.std])
Out[6]:
mean std
genus sex
Ammospermophilus M NaN NaN
Baiomys F 9.161290 2.237510
M 7.357143 0.841897
Chaetodipus F 23.763824 7.973696
M 24.712219 10.303329
Dipodomys F 55.244360 29.657217
M 56.243034 29.008498
Neotoma F 154.282209 39.186546
M 165.652893 48.991563
Onychomys F 26.780959 6.269802
M 26.246466 6.360828
Perognathus F 8.574803 4.123303
M 8.204182 3.238490
Peromyscus F 22.491649 4.850259
M 20.644279 3.935623
Reithrodontomys F 11.220080 2.604365
M 10.159941 1.760459
Sigmodon F 71.696000 28.241820
M 61.336842 20.418291
Spermophilus F 57.000000 NaN
M 130.000000 NaN

Any function can be passed like this, including custom personal functions. For many common aggregation functions, pandas allows to pass a string with the function name as a convenience.

In [7]:
grouped_surveys['weight'].agg (['mean', 'median', 'count'])
Out[7]:
mean median count
genus sex
Ammospermophilus M NaN NaN 0
Baiomys F 9.161290 9.0 31
M 7.357143 7.0 14
Chaetodipus F 23.763824 23.0 3201
M 24.712219 21.0 2627
Dipodomys F 55.244360 45.0 6826
M 56.243034 47.0 8649
Neotoma F 154.282209 160.0 652
M 165.652893 170.0 484
Onychomys F 26.780959 26.0 1502
M 26.246466 25.0 1627
Perognathus F 8.574803 8.0 762
M 8.204182 8.0 813
Peromyscus F 22.491649 23.0 958
M 20.644279 21.0 1206
Reithrodontomys F 11.220080 11.0 1245
M 10.159941 10.0 1363
Sigmodon F 71.696000 70.0 125
M 61.336842 58.0 95
Spermophilus F 57.000000 57.0 1
M 130.000000 130.0 1

Challenge

  1. Use groupby() and agg() to find the mean, min, and max hindfoot length for each species.

  2. What was the heaviest animal measured in each year? Return the columns year, genus, species, and weight. Hint Look into the idxmax() method.

Using size() to summarize categorical data

When working with data, it is common to want to know the number of observations present for each categorical variable. For this, pandas provides the size() method. For example, to group by 'taxa' and find the number of observations for each 'taxa':

In [8]:
surveys.groupby('taxa').size()
Out[8]:
taxa
Bird         450
Rabbit        75
Reptile       14
Rodent     34247
dtype: int64

size() can also be used when grouping on multiple variables.

In [9]:
surveys.groupby(['taxa', 'sex']).size()
Out[9]:
taxa    sex
Rodent  F      15690
        M      17348
dtype: int64

If there are many groups, size() is not that useful on its own. For example, it is difficult to quickly find the five most abundant species among the observations.

In [10]:
surveys.groupby('species').size()
Out[10]:
species
albigula            1252
audubonii             75
baileyi             2891
bilineata            303
brunneicapillus       50
chlorurus             39
clarki                 1
eremicus            1299
flavus              1597
fulvescens            75
fulviventer           43
fuscus                 5
gramineus              8
harrisi              437
hispidus             179
intermedius            9
leucogaster         1006
leucophrys             2
leucopus              36
maniculatus          899
megalotis           2609
melanocorys           13
merriami           10596
montanus               8
ochrognathus          43
ordii               3027
penicillatus        3123
savannarum             2
scutalatus             1
sp.                   86
spectabilis         2504
spilosoma            248
squamata              16
taylori               46
tereticaudus           1
tigris                 1
torridus            2249
undulatus              5
uniparens              1
viridis                1
dtype: int64

Since there are many rows in this output, it would be beneficial to sort the table values and display the most abundant species first. This is easy to do with the sort_values() method.

In [11]:
surveys.groupby('species').size().sort_values()
Out[11]:
species
viridis                1
uniparens              1
scutalatus             1
tereticaudus           1
tigris                 1
clarki                 1
leucophrys             2
savannarum             2
undulatus              5
fuscus                 5
gramineus              8
montanus               8
intermedius            9
melanocorys           13
squamata              16
leucopus              36
chlorurus             39
ochrognathus          43
fulviventer           43
taylori               46
brunneicapillus       50
fulvescens            75
audubonii             75
sp.                   86
hispidus             179
spilosoma            248
bilineata            303
harrisi              437
maniculatus          899
leucogaster         1006
albigula            1252
eremicus            1299
flavus              1597
torridus            2249
spectabilis         2504
megalotis           2609
baileyi             2891
ordii               3027
penicillatus        3123
merriami           10596
dtype: int64

That's better, but it could be helpful to display the most abundant species on top. In other words, the output should be arranged in descending order.

In [12]:
surveys.groupby('species').size().sort_values(ascending=False).head(5)
Out[12]:
species
merriami        10596
penicillatus     3123
ordii            3027
baileyi          2891
megalotis        2609
dtype: int64

Looks good! By now, the code statement has grown quite long because many methods have been chained together. It can be tricky to keep track of what is going on in long method chains. To make the code more readable, it can be broken up multiple lines by adding a surrounding parenthesis.

In [13]:
(surveys
     .groupby('species')
     .size()
     .sort_values(ascending=False)
     .head(5)
)
Out[13]:
species
merriami        10596
penicillatus     3123
ordii            3027
baileyi          2891
megalotis        2609
dtype: int64

This looks neater and makes long method chains easier to reads. There is no absolute rule for when to break code into multiple line, but always try to write code that is easy for collaborators (your most common collaborator is a future version of yourself!) to understand.

pandas actually has a convenience function for returning the top five results, so the values don't need to be sorted explicitly.

In [14]:
(surveys
     .groupby(['species'])
     .size()
     .nlargest() # the default is 5
)
Out[14]:
species
merriami        10596
penicillatus     3123
ordii            3027
baileyi          2891
megalotis        2609
dtype: int64

To include more attributes about these species, add columns to groupby().

In [15]:
(surveys
     .groupby(['species', 'taxa', 'genus'])
     .size()
     .nlargest()
) 
Out[15]:
species       taxa    genus          
merriami      Rodent  Dipodomys          10596
penicillatus  Rodent  Chaetodipus         3123
ordii         Rodent  Dipodomys           3027
baileyi       Rodent  Chaetodipus         2891
megalotis     Rodent  Reithrodontomys     2609
dtype: int64

Again, the display of the output shows that it is returned as a Series. As mentioned previously, the reset_index() method can be used to convert the output into a data frame

In [16]:
(surveys
     .groupby(['species', 'taxa', 'genus'])
     .size()
     .nlargest()
     .to_frame()
) 
Out[16]:
0
species taxa genus
merriami Rodent Dipodomys 10596
penicillatus Rodent Chaetodipus 3123
ordii Rodent Dipodomys 3027
baileyi Rodent Chaetodipus 2891
megalotis Rodent Reithrodontomys 2609

The reason that "species", "taxa", and "genus" are displayed in bold font is that groupby() makes these columns into the index (the row names) of the data frame. Indexes can be powerful when working with very large datasets (e.g. matching on indices is faster than matching on values in columns). However, when having multiple index levels like above, it can be less intuitive to work with than when working with columns, so it is often a good idea to reset the data frames index, unless there is a clear advantage of keeping it for downstream analyisis.

To reset the index, the reset_index() method can be used instead of to_frame().

In [17]:
(surveys
     .groupby(['species', 'taxa', 'genus'])
     .size()
     .nlargest()
     .reset_index()
) 
Out[17]:
species taxa genus 0
0 merriami Rodent Dipodomys 10596
1 penicillatus Rodent Chaetodipus 3123
2 ordii Rodent Dipodomys 3027
3 baileyi Rodent Chaetodipus 2891
4 megalotis Rodent Reithrodontomys 2609

When the series was changed into a data frame, the values were put into a column. Columns needs a name and by default this is just the lowest unique number among the column names, in this case 0. The rename() can be used to change the name of the 0 column to something more meaningful.

In [18]:
(surveys
     .groupby(['species', 'taxa', 'genus'])
     .size()
     .nlargest()
     .reset_index()
     .rename(columns={0: 'size'})
) 
Out[18]:
species taxa genus size
0 merriami Rodent Dipodomys 10596
1 penicillatus Rodent Chaetodipus 3123
2 ordii Rodent Dipodomys 3027
3 baileyi Rodent Chaetodipus 2891
4 megalotis Rodent Reithrodontomys 2609

Any column can be renamed this way

In [19]:
(surveys
     .groupby(['species', 'taxa', 'genus'])
     .size()
     .nlargest()
     .reset_index()
     .rename(columns={'genus': 'Genus', 'taxa': 'Taxa'})
) 
Out[19]:
species Taxa Genus 0
0 merriami Rodent Dipodomys 10596
1 penicillatus Rodent Chaetodipus 3123
2 ordii Rodent Dipodomys 3027
3 baileyi Rodent Chaetodipus 2891
4 megalotis Rodent Reithrodontomys 2609

Challenge

  1. How many individuals were caught in each plot_type surveyed?

  2. Calculate the number of animals trapped per plot type for each year. Extract the combinations of year and plot type that had the three highest number of observations (e.g. "1998-Control").

Data visualization in matplotlib and seaborn

There are many plotting packages in Python, making it possible to create diverse visualizations such as interactive web graphics, 3D animations, statistical visualization, and map-based plots. Here, we will focus on two of the most useful for researchers, matplotlib which is a robust, detail-oriented, low level plotting interface, and seaborn which provides high level functions on top of matplotlib and allows the plotting calls to be expressed more in terms what is being explored in the underlying data rather than what graphical elements to add to the plot.

For example, instead of instructing the computer to "go through a data frame and plot any observations of speciesX in blue, any observations of speciesY in red, etc", the seaborn syntax allows commands more similar to "color the data by species". Thanks to this functional way of interfaces with data, only minimal changes are required if the underlying data change or to switch the type of plot used for the visualization. It provides a language that facilitates thinking about data in ways that are conducive for exploratory analysis and allows for the and creation of publication quality plots with minimal amounts of adjustments and tweaking.

The concepts of plotting with seaborn plotting were briefly introduced briefly already in the first lecture. To make a plot of the number of observations for each species, first import the library and then use the countplot() function. Before the first plot is created, the line %matplotlib inline is used to specify that all plots should show up in the notebook instead of in a separate window.

In [20]:
%matplotlib inline
import seaborn as sns

sns.countplot(y='species', data=surveys)
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f50b824b668>

That's a lot of species... for convenience when introducing the following the plotting concept, the number of species will be limited to the four most abundant. To do this, first extract the names of the most abundant species.

In [21]:
most_common_species = (
    surveys['species']
       .value_counts()
       .nlargest(4)
       .index
)
most_common_species
Out[21]:
Index(['merriami', 'penicillatus', 'ordii', 'baileyi'], dtype='object')

A subset can now be created from the data frame, including only those rows where the column 'species' matches any of the names in the most_common_species variable. As before, boolean indexes will be used for this. One way of doing this would be to use the | operator four times.

In [22]:
surveys.loc[(surveys['species'] == most_common_species[0]) |
            (surveys['species'] == most_common_species[1]) |
            (surveys['species'] == most_common_species[2]) |
            (surveys['species'] == most_common_species[3])].shape
Out[22]:
(19637, 13)

That is quite tedious and pandas has a special isin() method for comparing a data frame column to an array-like object of names such as the index extracted above.

In [23]:
surveys.loc[surveys['species'].isin(most_common_species)].shape
Out[23]:
(19637, 13)

Drop any NAs and assign this to a variable

In [24]:
surveys_common = surveys.loc[surveys['species'].isin(most_common_species)].dropna()
surveys_common.shape
Out[24]:
(18289, 13)

This abbreviated data frame can now be used for plotting.

In [25]:
sns.countplot(y='species', data=surveys_common)
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f50ad75d2b0>

That's more manageable! The text is a little small, change this with the set_context() function from seaborn, using a number above 1 for the fontscale parameter. The context parameter changes the size of object in the plots, such as the linewidths, and will be left as the default notebook for now.

These option changes will apply to all plots made from now on. Think of it as changing a value in the options menu of a graphical software.

In [26]:
sns.set_context(context='notebook', font_scale=1.4)
sns.countplot(y='species', data=surveys_common)
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f50ad747470>

To get a vertical plot, change y to x. With long label names, horizontal plots can be easier to read.

In [27]:
sns.countplot(x='species', data=surveys_common)
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f50ad6961d0>

Visualizing one quantitative variable with multiple categorical variables

seaborn can do much more advanced visualizations than counting things. For example, to visualize summary statistics of the weight variable distribution for these fours species, a boxplot can be used.

In [28]:
sns.boxplot(x='weight', y='species', data=surveys_common)
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f50ad66f908>

The width of each box can be changed to make it look more appealing.

In [29]:
sns.boxplot(x='weight', y='species', data=surveys_common, width=0.4)
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f50ad5f39b0>

The syntax is very similar to that of countplot(), but instead of just supplying one variable and asking seaborn to count the observations of that variable, the xy-variables are the categorical groups (the species) and the measurement of interest (the weight).

The aim of a box plot is to display statistics of the underlying distribution, which facilitate comparison of more than just the mean + standard deviation (or another single measure of central tendency and variation) across categorical variables. These specific box plots are so-called Tukey box plots by default, which means that the graphical elements correspond to the following statistics:

  • The lines of the box represent the 25th, 50th (median), and 75th quantile in the data. These divide the data into four quartiles (0-25, 25-50, 50-75, 75-100).
  • The whiskers represent 1.5 * the interquartile range (the distance between the 25th and 75th quantile)
  • The flyers mark all individual observations that are outside the whiskers, which could be referred to as "outliers" (there are many definitions of what could constitute an outlier).

Most of these plot elements are configurable in case and could be set to represent different distribution statistics.

Another useful visualization for comparing distributions is the violinplot. Again, the syntax is the same as before, just change the plot name.

In [30]:
sns.violinplot (x='weight', y='species', data=surveys_common)
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f50ad57ca58>

Think of this plot as a smoothened version of the underlying histogram, that is then mirrored underneath. Where the violin is wider, there are more observations. The inner part is a boxplot with the median marked as a white dot. Comparisons with histograms and other distribution visualizations will be talked more about later in the workshop, but it is good to already keep in mind that it can be misleading to use a smoothened distribution if you have few observations, and it is probably better to show the individual data points instead of, or in addition to, the distribution plot.

The colors of the violin can be muted bring out the box.

In [31]:
sns.violinplot (x='weight', y='species', data=surveys_common, color='lightgrey')
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f50ac2eafd0>

An example for when a violin plot can be more informative than a box plot is to detect multimodal distributions, which could indicate an underlying confounding variable that has been grouped together. This can be seen when plotting the 'genus' on the y-axis instead of the 'species'.

In [32]:
sns.boxplot(x='weight', y='plot_type', data=surveys_common)
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f50ac260908>
In [33]:
sns.violinplot(x='weight', y='plot_type', data=surveys_common)
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f50ac207d68>

From the violin plot, it appears that there could be multiple distributions grouped together within each plot type (remember that the 'plot_type' indicates the type of trap used to catch the animals). There seems to be one distribution centered around weight=20 for all traps and one distribution centered around 45 (or 30 for Long-term krat exclosure). These observations could indeed be from the same distribution, but often when there are see multiple bumps like this, it is a good idea to explore other variables in the data set, and see if we can find the reason for the multiple bumps in the violin plot.

Since there appears to be 2-3 bumps in the distributions, it would be good to find a categorical variable in the data frame that has around the same number of unique values, since grouping based on these values could explain what we are seeing. The pandas method nunique() comes in handy for this task.

In [34]:
surveys_common.nunique().sort_values()
Out[34]:
taxa                   1
sex                    2
genus                  2
species_id             4
species                4
plot_type              5
month                 12
plot_id               24
year                  26
day                   31
hindfoot_length       40
weight                69
record_id          18289
dtype: int64

There are a few candidate variables that have a suitable number of unique values. A very effective approach for exploring multiple variables in a data set, is to plot so-called small multiples of the data where the same type of plot is used for different subsets of the data. These plots are drawn in rows and columns forming a grid pattern, and can be referred to as a "lattice", "facet", or "trellis" plot.

Visualizing categorical variables in this manner is a key step in exploratory data analysis, and thus seaborn has a dedicated plot function for this, called factorplot() (categorical variables are sometimes referred to as "factors"). This plot can be used to plot the same violin plot as before, and easily spread the variables across the rows and columns, e.g. for the variable "sex".

In [35]:
sns.factorplot(x='weight', y='plot_type', data=surveys_common, col='sex',
               kind='violin')
Out[35]:
<seaborn.axisgrid.FacetGrid at 0x7f50ac1c0d30>

Sorting by the sex of the animal is probably not the most clever approach here since, the same sex from different species or genus would have different weights. Let's try adding "genus".

In [36]:
sns.factorplot(x='weight', y='plot_type', data=surveys_common, col='sex',
               row='genus', kind='violin', margin_titles=True)
Out[36]:
<seaborn.axisgrid.FacetGrid at 0x7f50ac070c50>

There are certainly differences between the two genus, but it appears that the data still is not split into unimodal distributions. A likely explanation could be that we still have multiple species per genus and the weight is species-dependent. Let's check how many species there are per genus and how many observations there are in each.

In [37]:
surveys_common.groupby(['genus', 'species']).size()
Out[37]:
genus        species     
Chaetodipus  baileyi         2803
             penicillatus    2969
Dipodomys    merriami        9727
             ordii           2790
dtype: int64

If the mean weights for those species are different, it could indeed explain the additional bump in the Chaetodipus genus.

In [38]:
surveys_common.groupby(['genus', 'species'])['weight'].mean()
Out[38]:
genus        species     
Chaetodipus  baileyi         31.739922
             penicillatus    17.187942
Dipodomys    merriami        43.136013
             ordii           48.867384
Name: weight, dtype: float64

A factor plot with the column variable set to "species" instead of "genus" might be able to separate the distributions.

In [39]:
sns.factorplot(x='weight', y='plot_type', data=surveys_common, col='species',
               kind='violin')
Out[39]:
<seaborn.axisgrid.FacetGrid at 0x7f50ac179390>

That looks pretty good! The plot can be made more appealing by having two columns per row and making each plot a bit wider.

In [40]:
sns.factorplot(x='weight', y='plot_type', data=surveys_common, col='species', 
               col_wrap=2, kind='violin', aspect=1.4)
Out[40]:
<seaborn.axisgrid.FacetGrid at 0x7f50a7e66da0>

This is great, much of the variation in the weight data can be explained by the species observed. The only species where there still appears to be multimodal distributions (and thus possibly a confounding variable, is within "baileyi" (and potentially "ordii"), especially for the "Spectab exclosure". The "sex" variable was used in a previous plot, but it was never explored within tin each species. It is common with sexual dimorphism within a species, and this could include weight differences.

In [41]:
sns.factorplot(x='weight', y='plot_type', hue='sex', data=surveys_common,
               col='species', col_wrap=2, kind='violin', aspect=1.4) 
Out[41]:
<seaborn.axisgrid.FacetGrid at 0x7f50ac15dc18>

It does indeed appear that there is a difference in mean and distribution between the sexes within the species "baileyi". Minor differences between the sexes within other species are also visible now although they were not big enough to show up in the initial violinplot (in later lectures, we will see more how the violin plot can hide differences like this). As a final beautification of this plot, the violins can be split down the middle to reduce clutter in the plot.

In [42]:
sns.factorplot(x='weight', y='plot_type', hue='sex', data=surveys_common, 
               col='species', col_wrap=2, kind='violin', aspect=1.4, split=True)
Out[42]:
<seaborn.axisgrid.FacetGrid at 0x7f50a7f31470>

This clearly delivers the message and looks is easy to understand. A great aspect of the facetplot() function, is that if there is a change of minds (or hearts) in what type of visualization to use, only minor modifications are needed to completely change the plot appearance.

In [43]:
sns.factorplot(x='weight', y='plot_type', hue='sex', data=surveys_common, 
               col='species', col_wrap=2, kind='box', aspect=1.4)
Out[43]:
<seaborn.axisgrid.FacetGrid at 0x7f50a615c588>

Plotting the mean and 95% CI, requires changing a couple additional parameters to make the plot look good, but the code is largely identical.

In [44]:
sns.factorplot(x='weight', y='plot_type', hue='sex', data=surveys_common, 
               col='species', col_wrap=2, kind='point', aspect=1.4, join=False,
               dodge=1.25)
Out[44]:
<seaborn.axisgrid.FacetGrid at 0x7f50a6027828>

To recap, facetplot() facilitates the representation of variables within data as different elements in the plot, such as the rows, column, x-axis positions, and colors. There is a great description on this in the seaborn documentation:

It is important to choose how variables get mapped to the plot structure such that the most important comparisons are easiest to make. As a general rule, it is easier to compare positions that are closer together, so the hue variable should be used for the most important comparisons. For secondary comparisons, try to share the quantitative axis (so, use col for vertical plots and row for horizontal plots). Note that, although it is possible to make rather complex plots using this function, in many cases you may be better served by created several smaller and more focused plots than by trying to stuff many comparisons into one figure

The last point above is worth illustrating with a challenge. It is easy to get carried away with facetplot() and try to visualize everything at once.

Challenge

Create a grid of countplots comparing the number of observations between sexes across months. Create facets for each species and each plot_type.

Visualizing the relationship of two quantitative variable across multiple categorical variables

First, examine the variable and their. For this, the entire "surveys" dataframe can be used.

In [46]:
surveys.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34786 entries, 0 to 34785
Data columns (total 13 columns):
record_id          34786 non-null int64
month              34786 non-null int64
day                34786 non-null int64
year               34786 non-null int64
plot_id            34786 non-null int64
species_id         34786 non-null object
sex                33038 non-null object
hindfoot_length    31438 non-null float64
weight             32283 non-null float64
genus              34786 non-null object
species            34786 non-null object
taxa               34786 non-null object
plot_type          34786 non-null object
dtypes: float64(2), int64(5), object(6)
memory usage: 3.5+ MB

From this it is already clear that the only two quantitative variables are "weight" and "hindfoot_length". Although some of the others are integers, they are all categorical, such as month, day and year.

A scatter plot is the immediate choice for exploring pairwise relationships between variables. seaborn has a convenient scatter matrix function, pairplot(), for plotting the pairwise relationships between all numerical variables in the data frame.

In [47]:
# Since this plot creates so many graphical elemments, the data set is subsampled to
# avoid waiting for the plot creation to finish. Setting `random_state` makes sure the 
# same observations are sampled each time this is run.
surveys_sample = surveys.dropna().sample(1000, random_state=0)
sns.pairplot(surveys_sample)
Out[47]:
<seaborn.axisgrid.PairGrid at 0x7f50a4c3c1d0>