seaborngrids (20 min)
# Setup by importing the same data as in previous lecture import pandas as pd surveys = pd.read_csv("https://ndownloader.figshare.com/files/2292169") # or surveys = pd.read_csv("surveys.csv") if you saved it on your computer
There are many plotting packages in Python, making it possible to create diverse visualizations such as interactive web graphics, 3D animations, statistical visualizations, and map-based plots. A Google search for "Python graph gallery" or "Seaborn graph gallery" will turn up lots of examples of the diversity of plots that can be made.
Here, we will focus on two of the most useful for researchers:
matplotlib, which is a robust, detail-oriented, low level plotting interface, and
seaborn, which provides high level functions on top of
matplotlib and allows the plotting calls to be expressed more in terms what is being explored in the underlying data rather than what graphical elements to add to the plot.
Instead of instructing the computer to "go through a data frame and plot any observations of speciesX in blue, any observations of speciesY in red, etc", the
seaborn syntax is more similar to saying "color the data by species". Thanks to this functional way of interfacing with data, only minimal changes are required if the underlying data change or to switch the type of plot used for the visualization. It provides a language that facilitates thinking about data in ways that are conducive for exploratory analysis and allows for the creation of publication quality plots with minimal adjustments and tweaking.
The concepts of plotting with
seaborn were introduced briefly already in the first lecture. To make a plot of the number of observations for each species, first import the library and then use the
countplot() function. Before the first plot is created, the line
%matplotlib inline is used to specify that all plots should show up in the notebook instead of in a separate window.
import seaborn as sns sns.countplot(y='species', data=surveys)
<matplotlib.axes._subplots.AxesSubplot at 0x7f1efde4b908>
That's a lot of species... for convenience when introducing the following the plotting concept, the number of species will be limited to the four most abundant. To do this, the
value_counts() method can be used to find out how observations belong to each species. We could also get the same information using
merriami 10596 penicillatus 3123 ordii 3027 baileyi 2891 megalotis 2609 spectabilis 2504 torridus 2249 flavus 1597 eremicus 1299 albigula 1252 leucogaster 1006 maniculatus 899 harrisi 437 bilineata 303 spilosoma 248 hispidus 179 sp. 86 audubonii 75 fulvescens 75 brunneicapillus 50 taylori 46 ochrognathus 43 fulviventer 43 chlorurus 39 leucopus 36 squamata 16 melanocorys 13 intermedius 9 montanus 8 gramineus 8 undulatus 5 fuscus 5 leucophrys 2 savannarum 2 tigris 1 uniparens 1 scutalatus 1 tereticaudus 1 viridis 1 clarki 1 Name: species, dtype: int64
The top four species names could be manually typed in to filter the surveys data frame.
surveys.loc[(surveys['species'] == 'merriami') | (surveys['species'] == 'penicillatus') | (surveys['species'] == 'ordii') | (surveys['species'] == 'baileyi')].shape
Comparing this number with the number of rows in the original data frame shows that it was filtered successfully.
However, it is rather tedious to type out the species names by hand and to do one comparison per species. Instead, the names of the most abundand species can be extract by using
nlargest() and returning only the index (row names) of the top observations.
most_common_species = ( surveys['species'] .value_counts() .nlargest(4) .index ) most_common_species
Index(['merriami', 'penicillatus', 'ordii', 'baileyi'], dtype='object')
A subset can now be created from the data frame, including only those rows where the column 'species' matches any of the names in the
pandas has a special
isin() method for comparing a data frame column to an array-like object of names such as the index extracted above.
The number of observations is the same as with the manual filtering above, but this method is more succinct.
To facilitate plotting, any NA values will be dropped before assigning the resulting data frame to a new variable name.
surveys_common = surveys.loc[surveys['species'].isin(most_common_species)].dropna() surveys_common.shape
This abbreviated data frame can now be used for plotting.
<matplotlib.axes._subplots.AxesSubplot at 0x7f1efa925fd0>
That's more manageable! The text is a little small, which can be changed with the
set_context() function from
seaborn, using a number above
1 for the fontscale parameter. The context parameter changes the size of object in the plots, such as the line widths, and will be left as the default
notebook for now.
These option changes will apply to all plots made from now on. Think of it as changing a value in the options menu of a graphical software.
sns.set_context(context='notebook', font_scale=1.4) sns.countplot(y='species', data=surveys_common)
<matplotlib.axes._subplots.AxesSubplot at 0x7f1efa899da0>
To get a vertical plot, change
x. With long label names, horizontal plots can be easier to read.
<matplotlib.axes._subplots.AxesSubplot at 0x7f1efa7e7b00>
seaborn can do much more advanced visualizations than counting observations. Next, the relationship between one quantitative and one categorical valuable will be explored while stratifying the data based on its remaining categorical variables. To start, let's visualize summary statistics of the weight variable distribution for these fours species with a boxplot.
sns.boxplot(x='weight', y='species', data=surveys_common)
<matplotlib.axes._subplots.AxesSubplot at 0x7f1efa7b3d30>
The width of each box can be changed to make it look more appealing.
sns.boxplot(x='weight', y='species', data=surveys_common, width=0.4)
<matplotlib.axes._subplots.AxesSubplot at 0x7f1efa931588>
The syntax is very similar to that of
countplot(), but instead of just supplying one variable and asking
seaborn to count the observations of that variable, the xy-variables are the categorical groups (the species) and the measurement of interest (the weight).
The aim of a box plot is to display a few statistics from the underlying distribution of one quantitative variable between the values of one categorical variable (the y-axis). The inclusion of multiple distribution statistics facilitates the comparison of more than just the mean + standard deviation (or another single measure of central tendency and variation). The
seaborn box plots are so-called Tukey box plots by default, which means that the graphical elements correspond to the following statistics:
Most of these plot elements are configurable and could be set to represent different distribution statistics.
Another useful visualization for comparing distributions is the
violinplot. Again, the syntax is the same as before, just change the plot name.
sns.violinplot (x='weight', y='species', data=surveys_common)
<matplotlib.axes._subplots.AxesSubplot at 0x7f1efa717438>
Think of this plot as a smoothened version of the underlying histogram, that is then mirrored to give the violin-like shape. Where the violin is wider, there are more observations. The inner part is a boxplot with the median marked as a white dot. Comparisons with histograms and other distribution visualizations will be talked more about later in the workshop, but it is good to already keep in mind that although violinplots are great medium to large data sets, it can be misleading to use a smoothened distribution when there are very few observations, and it is probably better to show the individual data points instead of, or in addition to, the distribution plot.
The colors of the violins can be changed with the
sns.violinplot (x='weight', y='species', data=surveys_common, palette='Blues')
<matplotlib.axes._subplots.AxesSubplot at 0x7f1efa58a5c0>
color parameter can be used to set all violins in the same color.
sns.violinplot (x='weight', y='species', data=surveys_common, color='lightgrey')
<matplotlib.axes._subplots.AxesSubplot at 0x7f1ef9949080>
An example for when a violin plot can be more informative than a box plot is to detect multimodal distributions, which could indicate that multiple values from an underlying confounding variable has been grouped together. This can be observed when comparing the weight of all animals in each type of trap used to catch them (the
sns.boxplot(x='weight', y='plot_type', data=surveys_common)
<matplotlib.axes._subplots.AxesSubplot at 0x7f1ef9737128>
sns.violinplot(x='weight', y='plot_type', data=surveys_common)
<matplotlib.axes._subplots.AxesSubplot at 0x7f1ef96320f0>
From the violin plot, it appears that there could be multiple distributions grouped together within each type of trap. There seems to be one distribution centered around 20 grams for all traps and one distribution centered around 45 grams (or 30 grams for Long-term krat exclosure). These observations could indeed be from the same distribution, but often when there are multiple bumps like this, it is a good idea to explore other variables in the data set, and see if there is a confounding variable contributing to the multimodality of the violin plot.
Since there appear to be 2-3 bumps in the distributions, it would be good to find a categorical variable in the data frame that has around the same number of unique values. The
nunique() can count the number of unique values within each variable.
taxa 1 sex 2 genus 2 species_id 4 species 4 plot_type 5 month 12 plot_id 24 year 26 day 31 hindfoot_length 40 weight 69 record_id 18289 dtype: int64
There are a few candidate variables that have a suitable number of unique values. A very effective approach for exploring multiple categorical variables in a data set is to plot so-called "small multiples" of the data, where the same type of plot is used for different subsets of the data. These plots are drawn in rows and columns forming a grid pattern, and can be referred to as a "lattice", "facet", or "trellis" plot.
Visualizing categorical variables in this manner is a key step in exploratory data analysis, and thus
seaborn has a dedicated plot function for this, called
factorplot() (categorical variables are sometimes referred to as "factors"). This plot can be used to plot the same violin plot as before, and easily spread the variables across the rows and columns, e.g. for the variable
sns.factorplot(x='weight', y='plot_type', data=surveys_common, col='sex', kind='violin')
<seaborn.axisgrid.FacetGrid at 0x7f1ef9810470>
Sorting by the sex of the animal is probably not the most clever approach here since, the same sex from different species might very well have different weights. Let's try faceting the
genus variable on separate rows.
sns.factorplot(x='weight', y='plot_type', data=surveys_common, col='sex', row='genus', kind='violin', margin_titles=True)
<seaborn.axisgrid.FacetGrid at 0x7f1ef971b0f0>
There are certainly differences in weight between the two genera, but it appears that the data still is not split into unimodal distributions. A likely explanation could be that there are multiple species within each genus and the weight is species-dependent. Let's check how many species there are per genus and how many observations there are in each.
genus species Chaetodipus baileyi 2803 penicillatus 2969 Dipodomys merriami 9727 ordii 2790 dtype: int64
There are two species within each genus. If the mean weights for those species are different, it could indeed explain the additional bump in the Chaetodipus distributions above.
genus species Chaetodipus baileyi 31.739922 penicillatus 17.187942 Dipodomys merriami 43.136013 ordii 48.867384 Name: weight, dtype: float64
A factor plot with the column variable set to
'species' instead of
'genus' might be able to separate the distributions.
sns.factorplot(x='weight', y='plot_type', data=surveys_common, col='species', kind='violin')
<seaborn.axisgrid.FacetGrid at 0x7f1ef91d61d0>
That looks pretty good! The plot can be made more appealing by having two columns per row and making each plot a bit wider.
sns.factorplot(x='weight', y='plot_type', data=surveys_common, col='species', col_wrap=2, kind='violin', aspect=1.4)
<seaborn.axisgrid.FacetGrid at 0x7f1ef9558860>
This is great, much of the variation in the weight data can be explained by the species observed. The only species where there still appears to be multimodal distributions (and thus possibly a confounding variable, is within "baileyi" (and potentially "ordii"), especially for the "Spectab exclosure". The "sex" variable was used in a previous plot, but it was never explored within tin each species. It is common with sexual dimorphism within a species, and this could include weight differences.
sns.factorplot(x='weight', y='plot_type', hue='sex', data=surveys_common, col='species', col_wrap=2, kind='violin', aspect=1.4)
<seaborn.axisgrid.FacetGrid at 0x7f1ef9558fd0>
It does indeed appear that there is a difference in mean and distribution between the sexes within the species "baileyi". Minor differences between the sexes within other species are also visible now although they were not big enough to show up in the initial violinplot. As a final beautification of this plot, the violins can be split down the middle to reduce clutter.
sns.factorplot(x='weight', y='plot_type', hue='sex', data=surveys_common, col='species', col_wrap=2, kind='violin', aspect=1.4, split=True)
<seaborn.axisgrid.FacetGrid at 0x7f1ef90eba20>
This clearly delivers the message and is easy to understand. One thing to notice here though is that the
plot_type variable doesn't look like it has much effect on the weights.
- Create a violin plot of species vs weight, split down the middle by sex.
- (Bonus) add the fifth and sixth most-common species to the plot.
A great aspect of the
factorplot() function is that if there is a change of minds (or hearts) in what type of visualization to use, only minor modifications are needed to completely change the plot appearance. For example, plotting the mean and 95% CI, requires changing a couple of parameters to make the plot look good, but the code is largely identical.
sns.factorplot(x='weight', y='plot_type', hue='sex', data=surveys_common, col='species', col_wrap=2, kind='point', aspect=1.4, join=False, dodge=1.25)
<seaborn.axisgrid.FacetGrid at 0x7f1ef33c8780>
factorplot() facilitates the representation of variables within data as different elements in the plot, such as the rows, column, x-axis positions, and colors. There is a great description on this in the
It is important to choose how variables get mapped to the plot structure such that the most important comparisons are easiest to make. As a general rule, it is easier to compare positions that are closer together, so the
huevariable should be used for the most important comparisons. For secondary comparisons, try to share the quantitative axis (so, use
colfor vertical plots and
rowfor horizontal plots). Note that, although it is possible to make rather complex plots using this function, in many cases you may be better served by created several smaller and more focused plots than by trying to stuff many comparisons into one figure
We talk more about the advantages and drawbacks of creating complex visualization with
factorplot() in the second part of the challenge.
- Explore the same relationship as in the last violin plot, but visualize the results using a boxplot instead of a violinplot. Which variable(s) did you need to change and why?
- (Bonus) Create a grid of countplots comparing the number of observations between sexes across months (with months on the x-axis for each plot). Create facets for each species and each plot_type. Is this a good plot for data explorations? What about for publication?
In the last section, one quantitative variable was visualized across one or more categorical variables. Here, the relationship between two quantitative variables will be explored while stratifying the data based on its remaining categorical variables. First, reexamine the variables and their data types using the entire
surveys data frame.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 34786 entries, 0 to 34785 Data columns (total 13 columns): record_id 34786 non-null int64 month 34786 non-null int64 day 34786 non-null int64 year 34786 non-null int64 plot_id 34786 non-null int64 species_id 34786 non-null object sex 33038 non-null object hindfoot_length 31438 non-null float64 weight 32283 non-null float64 genus 34786 non-null object species 34786 non-null object taxa 34786 non-null object plot_type 34786 non-null object dtypes: float64(2), int64(5), object(6) memory usage: 3.5+ MB
The only two quantitative continuous variables are "weight" and "hindfoot_length". Although some of the others are integers, they are all categorical, such as month, day and year.
A scatter plot is the immediate choice for exploring pairwise relationships between continuous variables.
seaborn has a convenient scatter plot matrix function,
pairplot(), for plotting the pairwise relationships between all numerical variables in the data frame.
# Since this plot creates so many graphical elemments, the data set is subsampled to # avoid waiting for the plot creation to finish. Setting `random_state` makes sure the # same observations are sampled each time this is run. surveys_sample = surveys.dropna().sample(1000, random_state=0) sns.pairplot(surveys_sample)
<seaborn.axisgrid.PairGrid at 0x7f1ef26f8588>