#!/usr/bin/env python
# coding: utf-8
#
#
# This notebook is an element of the [risk-engineering.org courseware](https://risk-engineering.org/). It can be distributed under the terms of the [Creative Commons Attribution-ShareAlike licence](https://creativecommons.org/licenses/by-sa/4.0/).
#
# Author: Eric Marsden .
#
# ---
#
# In this notebook, we illustrate NumPy features for working with correlated data. Check the [associated lecture slides](https://risk-engineering.org/correlation/) for background material and to download this content as a Jupyter/Python notebook.
# # Linear correlation
# In[18]:
import numpy
import matplotlib.pyplot as plt
import scipy.stats
plt.style.use("bmh")
get_ipython().run_line_magic('config', 'InlineBackend.figure_formats=["svg"]')
# In[19]:
X = numpy.random.normal(10, 1, 100)
X
# In[20]:
Y = -X + numpy.random.normal(0, 1, 100)
# In[21]:
plt.scatter(X, Y);
# Looking at the scatterplot above, we can see that the random variables $X$ and $Y$ are *correlated*. There are various statistical measures that allow us to quantify the degree of linear correlation. The most commonly used is [Pearson’s product-moment correlation coefficient](https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient). It is available in `scipy.stats`.
# In[22]:
scipy.stats.pearsonr(X, Y)
# The first return value is the linear correlation coefficient, a value between -1 and 1 which measures the strength of the linear correlation. A value greater than 0.9 indicates a strong positive linear correlation, and a value lower than -0.9 indicates strong negative linear correlation (when $X$ increases, $Y$ decreases).
#
# (The second return value is a *p-value*, which is a measure of the confidence which can be placed in the estimation of the correlation coefficient (smaller = more confidence). It tells you the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. Here we have a very very low p-value, so high confidence in the estimated value of the correlation coefficient.)
# ## Exercises
# **Exercise**: show that when the error in $Y$ decreases, the correlation coefficient increases.
# **Exercise**: produce data and a plot with a negative correlation coefficient.
# ## Anscombe’s quartet
# Let’s examine four datasets produced by the statistician [Francis Anscombe](https://en.wikipedia.org/wiki/Frank_Anscombe) to illustrate the importance of exploring your data qualitatively (for example by plotting the data), rather than relying only on summary statistics such as the linear correlation coefficient.
# In[23]:
x = numpy.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])
y1 = numpy.array([8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68])
y2 = numpy.array([9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74])
y3 = numpy.array([7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73])
x4 = numpy.array([8,8,8,8,8,8,8,19,8,8,8])
y4 = numpy.array([6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.50,5.56,7.91,6.89])
# In[24]:
plt.scatter(x, y1)
plt.title("Anscombe quartet n°1")
plt.margins(0.1)
# In[25]:
scipy.stats.pearsonr(x, y1)
# In[26]:
plt.scatter(x, y2)
plt.title("Anscombe quartet n°2")
plt.margins(0.1)
# In[27]:
scipy.stats.pearsonr(x, y2)
# In[28]:
plt.scatter(x, y3)
plt.title("Anscombe quartet n°3")
plt.margins(0.1)
# In[29]:
scipy.stats.pearsonr(x, y3)
# In[30]:
plt.scatter(x4, y4)
plt.title("Anscombe quartet n°4")
plt.margins(0.1)
# In[31]:
scipy.stats.pearsonr(x4, y4)
# Notice that the linear correlation coefficient (Pearson's $r$) of the four datasets is identical, though clearly the relationship between $X$ and $Y$ is very different in each case! This illustrates the risks of depending only on quantitative descriptors to understand your datasets: you should also use different types of plots to give you a better overview of the data.
# ## The Datasaurus
# The [Datasaurus](http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html) provides another illustration of the importance of plotting your data to make sure it doesn't contain any surprises, rather than relying only on summary statistics.
# In[32]:
import pandas
ds = pandas.read_csv("https://risk-engineering.org/static/data/datasaurus.csv", header=None)
ds.describe()
# In[33]:
scipy.stats.pearsonr(ds[0], ds[1])
# These summary statistics don't look too nasty, but check out the data once it's plotted!
# In[34]:
plt.scatter(ds[0], ds[1]);
# [This article](https://www.autodeskresearch.com/publications/samestats) titled *Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing* runs a lot further with this concept.
# In[ ]: