#!/usr/bin/env python
# coding: utf-8

# ## Lesson preamble
# 
# ### Lecture objectives
# 
# - Learn about tidy data.
# - Transform data from the long to wide format.
# - Understand which types of figures are suitable to create from raw data.
# - Learn how to avoid common pitfalls when plotting large data sets.
# 
# ### Lecture outline
# 
# - Reshaping with data with `pivot()`, `pivot_table()`, and `melt()` (40 min)
# - Visualization tips and tricks
#     - Changing plot appearance with `matplotlib` (35 min)
#     - Avoiding saturated plots (40 min)
#     - Choose informative plots for categorical data (35 min)
#     - Making plots accessible through suitable color choices (10 min)

# In[1]:


# Setup by loading the data set from the previous lecture
import pandas as pd

# If you don't have the dataset 
#surveys = pd.read_csv('https://ndownloader.figshare.com/files/2292169')

# If you have already downloaded the dataset
surveys = pd.read_csv('./surveys.csv')

surveys.head()


# <!-- TODO I should talk about this during the data organization chapter also, maybe I can go into details, and this can be more like repetition? I shold mentioned splittign varibles. and to minimize storage space. -->
# 
# ## Reshaping data between long and wide formats
# 
# Data is often presented in a so-called wide format, e.g. with one column per measurement:
# 
# |person|weight|height|age|
# |------|------|------|---|
# |A|70|170|32|
# |B|85|179|28|
# 
# This can be a great way to display data so that it is easily interpretable by humans and is often used for summary statistics (commonly referred to as pivot tables). However, many data analysis functions in `pandas`, `seaborn` and other packages are optimized to work with the tidy data format. Tidy data is a long format where each row is a single observation and each column contains a single variable:
# 
# |person|measure|value|
# |------|-----------|-----|
# |     A|     weight|   70|
# |     A|     height|  170|
# |     A|        age|   32|
# |     B|     weight|   85|
# |     B|     height|  179|
# |     B|        age|   28|
# 
# `pandas` enables a wide range of manipulations of the structure of data, including alternating between the long and wide format. The survey data presented here is in a tidy format. To facilitate visual comparisons of the relationships between measurements across columns, it would be beneficial to display this data in the wide format. For example, what is the relationship between mean weights of different species caught at the same plot type?
# 
# ### Subset data
# 
# To facilitate the visualization of the the transformations between wide and tidy data,  it is beneficial to create a subset of the data.

# In[2]:


species_sub = ['albigula', 'flavus', 'merriami']
col_sub = ['record_id', 'species', 'weight', 'plot_type']
surveys_sub = surveys.loc[surveys['species'].isin(species_sub), col_sub]
surveys_sub.head()


# In[3]:


surveys_sub.info()


# ### Long to wide with `pivot()` and `pivot_table()`
# 
# A long to wide transformation would be suitable to effectively visualize the relationship between the mean body weights of each species within the different plot types used to trap the animals. The first step in creating this table is to compute the mean weight for each species in each plot type.

# In[4]:


surveys_sub_gsp = (
    surveys_sub
        .groupby(['species', 'plot_type'])['weight']
        .mean()
        .reset_index()
)
surveys_sub_gsp


# To remove the repeating information for `species` and `plot_type`, this table can be pivoted into a wide formatted using the `pivot()` method. The arguments passed to `pivot()` includes the rows (the index), the columns, and which values should populate the table. 

# In[5]:


surveys_sub_gsp.pivot(index='plot_type', columns='species', values='weight')


# Compare how this table is displayed with the table in the previous cell. It is certainly easier to spot differences between the species and plot types in this wide format.
# 
# Since presenting summary statistics in a wide format is such a common operation, `pandas` has a dedicated method, `pivot_table()`, that performs both the data aggregation and pivoting.

# In[6]:


surveys_sub.pivot_table(index='plot_type', columns='species', values='weight')


# Although `pivot_table()` is the most convenient way to aggregate *and* pivot data, `pivot()` is still useful to reshape a data frame from wide to long *without* performing aggregation.
# 
# With the data in a wide format, the pairwise correlations between the columns can be computed using `corr()`.

# In[7]:


surveys_sub_pvt = surveys_sub.pivot_table(index='plot_type', columns='species', values='weight')
surveys_sub_pvt.corr()


# The columns and rows can be swapped in the call to `pivot_table()`. This is useful both to present the table differently and to perform computations on a different axis (dimension) of the data frame (this result can also be obtained by calling the `transpose()` method of `subveys_sub`).

# In[8]:


surveys_sub.pivot_table(index='species', columns='plot_type', values='weight')


# With `pivot_table()` it is also possible to add the total sums for all rows and columns, and to change the aggregation function.

# In[9]:


surveys_sub.pivot_table(index='plot_type', columns='species', values='weight', margins=True, aggfunc='median')


# ### Wide to long with `melt()`
# 
# It is also a common operation to reshape data from the wide to the long format, e.g. when getting the data into the most suitable format for analysis. For this transformation, the `melt()` method can be used to sweep up a set of columns into one key-value pair.
# 
# To prepare the data frame, the `plot_type` index name can be moved to a column name with the `reset_index()` method.

# In[10]:


surveys_sub_pvt


# In[11]:


surveys_sub_pvt = surveys_sub_pvt.reset_index()
surveys_sub_pvt


# At a minimum, `melt()` only requires the name of the column that should be kept intact. All remaining columns will have their values in the `value` column and their name in the `variable` column (here, our columns already has a name "species", so this will be used automatically instead of "variable").

# In[12]:


surveys_sub_pvt.melt(id_vars='plot_type')


# To be more explicit, all the arguments to `melt()` can be specified. This way it is also possible to exclude some columns, e.g. the species 'merriami'.

# In[13]:


surveys_sub_pvt.melt(id_vars='plot_type', value_vars=['albigula', 'flavus'], 
                     var_name='species', value_name='weight')


# >#### Challenge 1
# >
# >1. Make a wide data frame with `year` as columns, `plot_id` as rows, where the values are the number of genera per plot. *Hint* Remember how `nunique()` from last lecture. You will also need to reset the index before pivoting.
# >
# >2. Now take that data frame, and make it long again, so each row is a unique `plot_id` - `year` combination.

# # Visualization tips and tricks

# ## Changing plot appearance with `matplotlib`
# 
# The knowledge of how to make an appealing and informative visualization can be put into practice by working directly with `matplotlib`, and styling different elements of the plot. The high-level figures created by `seaborn` can also be configured via the `matplotlib` parameters, so learning more about them will be very useful.
# 
# As demonstrated previously with `FacetGrid`, one way of creating a line plot is by using using the `plot()` function from `matplotlib.pyplot`. To facilitate the understanding of plotting concepts, the initial examples here will not include data frames, but instead have simple lists holding just a few data points.

# In[14]:


get_ipython().run_line_magic('matplotlib', 'inline')


# In[15]:


import matplotlib.pyplot as plt

x = [1, 2, 3, 4]
y = [1, 2, 4, 3]
plt.plot(x ,y)


# However, this way of plotting is not very explicit and some configuration is outside our control, e.g. a figure is automatically created and it is assumed that the plot should go into the currently active region of this figure. This gives little control over exactly where to place the plots within a figure and how to make modifications the plot after creating it, e.g. adding a title or labeling the axis. For these operations, it is easier to use the object oriented plotting interface, where an empty figure and is created initially. This figure and its axes are assigned to variable names which are then explicitly used for plotting. In `matplotlib`, an axes refers to what you would often call a subplot colloquially and it is named "axes" because it consists of an x-axis and a y-axis by default. By default an empty figure is created.

# In[16]:


fig, ax = plt.subplots()


# Calling `subplots()` returns two objects, the figure and its axes. Plots can be added to the axes of the figure using the name of the returned axes object (here `ax`).

# In[17]:


fig, ax = plt.subplots()
ax.plot(x, y)


# To create a scatter plot, use `ax.scatter()` instead of `ax.plot()`.

# In[18]:


fig, ax = plt.subplots()
ax.scatter(x, y)


# Plots can also be combined together in the same axes. The line style and marker color can be changed to facilitate viewing the elements in th combined plot.

# In[19]:


fig, ax = plt.subplots()
ax.scatter(x, y, color='red')
ax.plot(x, y, linestyle='dashed')


# And plot elements can be resized.

# In[20]:


fig, ax = plt.subplots()
ax.scatter(x, y, color='red', s=100)
ax.plot(x, y, linestyle='dashed', linewidth=3)


# It is common to modify the plot after creating it, e.g. adding a title or label the axis.

# In[21]:


fig, ax = plt.subplots()
ax.scatter(x, y, color='red')
ax.plot(x, y, linestyle='dashed')

ax.set_title('Line and scatter plot')
ax.set_xlabel('Measurement X')


# The scatter and line plot can easily be separated into two subplots within the same figure. Instead of assigning a single returned axes to `ax`, the two returned axes objects are assigned to `ax1` and `ax2` respectively.

# In[22]:


fig, (ax1, ax2) = plt.subplots(1, 2)
# The default is (1, 1), that's why it does not need
# to be specified with only one subplot


# To prevent plot elements, such as the axis ticklabels from overlapping, `tight_layout()` method can be used.

# In[23]:


fig, (ax1, ax2) = plt.subplots(1, 2)
fig.tight_layout()


# The figure size can easily be controlled when it is created.

# In[24]:


fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4)) # This refers to the size of the figure in inches when printed or in a PDF
fig.tight_layout()


# Putting it all together to separate the line and scatter plot.

# In[25]:


fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4))
ax1.scatter(x, y, color='red')
ax2.plot(x, y, linestyle='dashed')

ax1.set_title('Scatter plot')
ax2.set_title('Line plot')
fig.tight_layout()


# > #### Challenge 2
# > 
# > 1. There are a plethora of colors available to use in `matplotlib`. Change the color of the line and the dots in the figure using [your favorite color from this list](https://stackoverflow.com/a/37232760/2166823).
# > 2. Use the documentation to also change the styling of the line in the line plot and the type of marker used in the scatter plot (you might need to search online for this).

# ### Saving graphs
# 
# Figures can be saved by calling the `savefig()` method and specifying the name of file to create. The resolution of the figure can be controlled by the `dpi` parameter.

# In[26]:


fig.savefig('scatter-and-line.png', dpi=300)


# A PDF-file can be saved by changing the extension in the specified file name. Since PDF is a vector file format, there is not need to specify the resolution.

# In[27]:


fig.savefig('scatter-and-line.pdf')


# This concludes the customization section. The concepts taught here will be applied in the next section on how to choose a suitable plot type for data sets with many observations.

# ## Avoiding saturated plots
# 
# Summary plots (especially bar plots) were previously mentioned to potentially be misleading, and it is often most appropriate to show every individual observation with a dot plot or the like, perhaps combined with summary markers where appropriate. But, what if the data set is too big to visualize every single observations? In large data sets, it is often the case that plotting each individual observation would oversaturate the chart.  
# 
# To illustrate saturation and how it can be avoided, load the datasets `diamonds` from the R sample data set repository:

# In[28]:


diamonds = pd.read_csv('https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/diamonds.csv', index_col=0)
diamonds.head()


# In[29]:


diamonds.info()


# When plotting a data frame, `matplotlib` plotting functions can be made aware of the structure of the data by specifying the `data` parameter and the `x` and `y` parameters can then be specified just by passing the name of a column in the data frame as a string.

# In[30]:


fig, ax = plt.subplots()
ax.scatter('carat', 'price', data=diamonds)


# Because this is a dataset with 53,940 observations, visualizing it in two dimensions creates a graph that is incredibly oversaturated. Oversaturated graphs make it *far more* difficult to glean information from the visualization. Maybe adjusting the size of each observation could help?

# In[31]:


fig, ax = plt.subplots()
ax.scatter('carat', 'price', data=diamonds, s=1)


# That's a bit better. Reducing the transparency might help further.

# In[32]:


fig, ax = plt.subplots()
ax.scatter('carat', 'price', data=diamonds, s=1, alpha=0.1)


# This is clearer than initially, but does still not reveal the full structure of the underlying data. Before proceeding, add axis labels and remove the axis lines (spines) on the top and the right.

# In[33]:


fig, ax = plt.subplots()
ax.scatter('carat', 'price', data=diamonds, s=1, alpha=0.1)

ax.set_title('Diamond prices')
ax.set_xlabel('Carat')
ax.set_ylabel('Price')


# The fontsizes of the labels and title are a bit small. They could be resizes separately, but the easiest way to change all of them is with the previously used `set_context()` function from `seaborn`. Here, `despine()` is also used to improve the visual appeal of the plot by removing the top and right axis spines.

# In[34]:


import seaborn as sns

sns.set_context('notebook', font_scale=1.3) # Increase all font sizes

fig, ax = plt.subplots()
ax.scatter('carat', 'price', data=diamonds, s=1, alpha=0.1)
sns.despine()

ax.set_title('Diamond prices')
ax.set_xlabel('Carat')
ax.set_ylabel('Price')
# sns.despine() essentially does the following:
# ax.spines['top'].set_visible(False)
# ax.spines['right'].set_visible(False)


# The x- and y-axis limits can be adjusted to zoom in on the denser areas of the plot.

# In[35]:


fig, ax = plt.subplots()
ax.scatter('carat', 'price', data=diamonds, s=1, alpha=0.1)
sns.despine()

ax.set_title('Diamond prices')
ax.set_xlabel('Carat')
ax.set_ylabel('Price')
ax.set_xlim(0.2, 1.2)
ax.set_ylim(200, 4000)


# The result is still not satisfactory, which illustrates that a scatter plot is simply not a good choice with huge data sets. A more suitable plot type for this data, is a so called `hexbin` plot, which essentially is a two dimensional histogram, where the color of each hexagonal bin represents the amount of observations in that bin (analogous to the height in a one dimensional histogram). 

# In[36]:


fig, ax = plt.subplots()
ax.hexbin('carat', 'price', data=diamonds)


# This looks ugly because the bins with zero observations are still colored. This can be avoided by setting the minimum count of observations to color a bin.

# In[37]:


fig, ax = plt.subplots()
ax.hexbin('carat', 'price', data=diamonds, mincnt=1)


# The distribution of the data is not more akin to that of the scatter plot. To know what the different colors represent, a colorbar needs to be added to this plot. The space for the colorbar will be taken from a plot in the current figure.

# In[38]:


fig, ax = plt.subplots()
# Assign to a variable to reuse with the colorbar
hex_plot = ax.hexbin('carat', 'price', data=diamonds, mincnt=1)
# Create the colorbar from the hexbin plot axis
cax = fig.colorbar(hex_plot)


# Notice that the overall figure is the same size, and the axes that contains the hexbin plot shrank to make room for the colorbar. To remind ourselves what is plotted, axis labels can be added like previously.

# In[39]:


fig, ax = plt.subplots()
hex_plot = ax.hexbin('carat', 'price', data=diamonds, mincnt=1, gridsize=50)
sns.despine()
cax = fig.colorbar(hex_plot)

ax.set_title('Diamond prices')
ax.set_xlabel('Carat')
ax.set_ylabel('Price')
cax.set_label('Number of observations')


# It is now clear that the yellow area represents over 2000 observations!

# In[40]:


diamonds_subset = diamonds.loc[(diamonds['carat'] < 1.3) & (diamonds['price'] < 2500)]

fig, ax = plt.subplots()
hexbin = ax.hexbin('carat', 'price', data=diamonds_subset, mincnt=1)
sns.despine()
cax = fig.colorbar(hexbin)

cax.set_label('Observation density')
ax.set_title('Diamond prices')
ax.set_xlabel('Carat')
ax.set_ylabel('Price')


# Although this hexbin plot is a great way of visualizing the distributions, it could be valuable to compare it to the histograms for each the plotted variable.

# In[41]:


fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))
fig.suptitle('Distribution plots', y=1.05)
sns.despine()

ax1.hist('carat', bins=30, data=diamonds) 
ax1.set_title('Diamond weight')
ax1.set_xlabel('Carat')

ax2.hist('price', bins=30, data=diamonds) 
ax2.set_title('Diamond price')
ax2.set_xlabel('USD')

fig.tight_layout()


# Since visualizing two individual 1D distribution together with their joint 2D distribution is a common operation, `seaborn` has a built-in function to create a hexbin plot with histograms on the marginal axes.

# In[42]:


sns.jointplot(x='carat', y='price', data=diamonds, kind='hex')


# This can be customized to appear more like the previous hexbin plots. Since `joinplot()` deals with both the hexbin and the histogram, the parameter names must be separated so that it is clear which plot they are referring to. This is done by passing them as dictionaries to the `joint_kws` and `marginal_kws` parameters ("kws" stands for "keywords").

# In[43]:


sns.jointplot(x='carat', y='price', data=diamonds, kind='hex', 
              joint_kws={'cmap':'viridis', 'mincnt':1},
              marginal_kws={'color': 'indigo'})


# ## Choosing informative plots for categorical data
# 
# When visualizing data it is important to explore different plotting options and reflect on which one best conveys the information within the data. In the following code cells, a sample data set is loaded from the `seaborn` data library in order to illustrate some advantages and disadvantages between categorical plot types. This is the same data as was used in lecture 1 and contains three different species of iris flowers and measurements of their sepals and petals.

# In[44]:


import seaborn as sns

iris = sns.load_dataset('iris')
iris.groupby('species').mean()


# ### Bar plots
# 
# A common visualization when comparing a groups is to create a barplot of the means of each group and plot them next to each other.

# In[45]:


sns.barplot(x='species', y='sepal_length', data=iris)


# This barplot shows the mean and the 95% confidence interval. Individual plotting functions in `seaborn` return an axes with the plotted elements. This returned axes object can be assigned to a variable name and customized just as previously in this lecture.

# In[46]:


ax = sns.barplot(x='species', y='sepal_length', data=iris)

ax.set_ylim(0, 10)
ax.set_ylabel('Sepal Length')
ax.set_xlabel('')


# Since the `seaborn` plotting functions returns a `matplotlib` axes object, these can be used with any `matplotlib` functions. For example, by creating a figure using `subplots()`, the `seaborn` plotting functions can be arranged as subplots in a grid. The syntax is slightly different from doing this with functions that are native to `matplotlib`, and the axes in which the `seaborn` function will plot needs to be specified with the `ax` parameter.

# In[47]:


fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, figsize=(10, 8), sharex=True)
fig.suptitle('Visualization comparison', y=1.02) # `y` is used to place the title a little bit higher up
sns.despine()

sns.barplot(x='species', y='sepal_length', data=iris, ax=ax1)
sns.boxplot(x='species', y='sepal_length', data=iris, ax=ax2)
sns.violinplot(x='species', y='sepal_length', data=iris, ax=ax3)
sns.swarmplot(x='species', y='sepal_length', data=iris, ax=ax4)

ax1.set_xlabel('')
ax2.set_xlabel('')
ax2.set_ylabel('')
ax3.set_xlabel('')
ax4.set_xlabel('')
ax4.set_ylabel('')

fig.tight_layout()


# >#### Challenge 3
# >
# >1. How many data points and/or distribution statistics are displayed in each of these plots 
# >2. Out of the these plots, which one do you think is the most informative and why? Which is the most true to the underlying data?

# ### Pros and cons of different graph types
# 
# We will deepen the discussion around some of these ideas, in the context of the following plot:
# 
# ![*Reproduced with permission from [Dr. Koyama's poster](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/TatsukiKoyama/Poster3.pdf)*](./img/dynamite-bars.png)
# 
# *Reproduced with permission from [Dr. Koyama's poster](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/TatsukiKoyama/Poster3.pdf)*
# 
# It is generally advisable to avoid "decorative" plot elements that do not convey extra information about the data, *especially* when such elements hide the real data. An early champion of this idea was Edward Tufte, who details how to reduce so called non-data ink and many other things in his book [The visual display of quantitative information](https://www.edwardtufte.com/tufte/books_vdqi). In the bar chart above, the only relevant information is given by the where the rectangles of the bars ends on the y-axis, the rest of it is unnecessary. Instead of using the rectangle's height, a simpler marker (circle, square, etc) could have been used to indicate the height on the y-axis. Note that the body of the rectangle is not representative for where the data lies, there are probably no data points close to 0, and several above the rectangle.

# Barplots are especially misleading when used as data summaries, as in the
# example above. In a summary plot, only two distribution parameters (a measure of
# central tendency, e.g. the mean, and error, e.g. the standard deviation or a
# confidence interval) are displayed, instead of showing all the individual data
# points. This can be highly misleading, since different underlying distributions
# can give rise to the same summary plot. We also have no idea of how many observations there are in each group. These
# shortcomings become evident when comparing the barplot to the underlying
# distributions that were used to create them:
# 
# ![*Reproduced with permission from [Dr. Koyama's poster*](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/TatsukiKoyama/Poster3.pdf)](./img/dynamite-vs-dists.png)
# 
# *Reproduced with permission from [Dr. Koyama's poster](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/TatsukiKoyama/Poster3.pdf)*
# 
# Immediately, you can see that conclusions drawn from the barplot, such that A
# and B have the same outcome, are factually incorrect. The distribution in D is
# bimodal, so representing that with a mean would be like observing black and
# white birds and conclude that the average bird color is grey, it's nonsensical.
# If we would have planned our follow up experiments based on the barplot alone,
# we would have been setting ourselves up for failure! Always be sceptical when
# you see a barplot in a published paper, and think of how the underlying
# distribution might look (note that barplots are more acceptable when used to
# represents counts, proportion or percentages, where there is only one data point
# per group in the data set).
# 
# Boxplots and violin plots are more meaningful data summaries as they represent more than just two distribution parameters (such as mean +/- sd). However, these can still be misleading and it is often the most appropriate to show each individual observation with a dot/hive/swarm plot, possibly combined with a superimposed summary plot or a marker for the mean or median *if* this additional information is useful. One exception, when it is not advisable to show all data points, is when the data set is gigantic and plotting each individual observation would oversaturate the chart. In that case, plot summary statistics or a 2D histogram (more on this later).
# 
# Here is an example of how a violinplot can be combined together with the individual observations in `seaborn`.

# In[48]:


fig, ax = plt.subplots()
sns.despine()

sns.violinplot(x='species', y='sepal_length', data=iris,
               color='lightgrey', inner=None, ax=ax)
sns.swarmplot(x='species', y='sepal_length', data=iris,
              ax=ax) 
ax.set_ylabel('Sepal Length')
ax.set_xlabel('')


# Plotting elements have a default order in which they appear. This can be changed by explicitly via the `zorder` parameter.

# In[49]:


fig, ax = plt.subplots()
sns.despine()

sns.violinplot(x='species', y='sepal_length', data=iris,
               color='lightgrey', inner=None, ax=ax, zorder=10)
sns.swarmplot(x='species', y='sepal_length', data=iris,
              ax=ax, zorder=0) 
ax.set_ylabel('Sepal Length')
ax.set_xlabel('')


# >#### Challenge 4
# >
# >1. So far, we've looked at the distribution of sepal length within species.  Try making a new plot to explore the distribution of another variable within each species.
# >2. Combine a `stripplot()` with a `boxplot()`. Set the `jitter` parameter to distribute the dots so that they are not all on one line.

# ## Making plots accessible through suitable color choices
# 
# Colour blindness is common in the population, and red-green colour blindness in particular affects 8% of men and 0.5% of women. Guidelines for making your visualizations more accessible to people affected by colour blindness, will in many cases also improve the interpretability of your graphs for people who have standard color vision. Here are a couple of examples:
# 
# Don't use jet rainbow-coloured heatmaps. Jet colourmaps are often the default heatmap used in many visualization packages (you've probably seen them before). 
# 
# ![](./img/heatmap.png)
# 
# Colour blind viewers are going to have a difficult time distinguishing the meaning of this heat map if some of the colours blend together.
# 
# ![](./img/colourblind.png)

# The jet colormap should be avoided for other reasons, including that the sharp transitions between colors introduces visual threshold levels that do not represent the underlying continuous data. Another issue is luminance, or brightness. For example, your eye is drawn to the yellow and cyan regions, because the luminance is higher. This can have the unfortunate effect of highlighting features in your data that don't actually exist, misleading your viewers! It also means that your graph is not going to translate well to greyscale in publication format.
# 
# More details about jet can be found in [this blog post](https://jakevdp.github.io/blog/2014/10/16/how-bad-is-your-colormap/) and [this series of posts](https://mycarta.wordpress.com/2012/05/12/the-rainbow-is-dead-long-live-the-rainbow-part-1/). In general, when presenting continuous data, a perceptually uniform colormap is often the most suitable choice. This type of colormap ensures that equal steps in data are perceived as equal steps in color space. The human brain perceives changes in lightness as changes in the data much better than, for example, changes in hue. Therefore, colormaps which have monotonically increasing lightness through the colormap will be better interpreted by the viewer. More details and examples of such colormaps are available in the [`matplotlib` documentation](http://matplotlib.org/users/colormaps.html), and many of the core design principles are outlined in [this entertaining talk](https://www.youtube.com/watch?v=xAoljeRJ3lU).
# 
# The default colormap in `matplotlib` is `viridis` which to have monotonically increasing lightness throughout. There is also `cividis`, which is designed to look the same for common colorblindess as for people without colorblindness. Heatmaps is a good example on where color matters

# In[50]:


fig, ax = plt.subplots(figsize=(3, 6))
iris_heat = iris.iloc[:, :4].sample(10, random_state=0)
sns.heatmap(iris_heat, cmap='cividis', ax=ax)


# Heatmaps are great when visualizing correlation matrices. Since correlations range from -1 to 1, it is suitable to use a divergent heatmap with two distinct colors centered around 0.

# In[51]:


iris.corr()


# In[52]:


sns.heatmap(iris.corr(), center=0, cmap='vlag')


# In a correlation matrix, the diagonal is a column's correlaiton with itself, so it is always perfect (1). The same values are mirrored above and below the diagonal.

# Another approach to improve visualization clarity is to use different symbols for the groups and to change the color palette to one specifically designed to work well for common colorblindness.

# In[53]:


# To see all available palettes, set it to an empty string and view the error message
sns.lmplot(x='sepal_width', y='sepal_length', hue='species', data=iris,
           fit_reg=False, markers=['o', 's', 'd'], palette='colorblind')


# >#### Challenge 5 (optional)
# >
# >1. Take one of the figures created previously and upload it to [this website](http://www.color-blindness.com/coblis-color-blindness-simulator/) to see how it  looks in the color blindness simulator.

# ### More general resources on plotting
# 
# - [Ten Simple Rules for Better Figures](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003833)
# - [Finding the Right Color Palettes for Data Visualizations](https://blog.graphiq.com/finding-the-right-color-palettes-for-data-visualizations-fcd4e707a283)
# - [Examples of bad graphs](https://www.biostat.wisc.edu/~kbroman/topten_worstgraphs/)
# - [More examples of bad graphs and how to improve them](https://www.stat.auckland.ac.nz/~ihaka/120/Lectures/lecture03.pdf)
# - [Wikipedia has a great article on misleading graphs](https://en.wikipedia.org/wiki/Misleading_graph)
# - [Usability article about how to design for people with color blindness](http://blog.usabilla.com/how-to-design-for-color-blindness/)