#!/usr/bin/env python # coding: utf-8 # # # This notebook is an element of the [risk-engineering.org courseware](https://risk-engineering.org/). It can be distributed under the terms of the [Creative Commons Attribution-ShareAlike licence](https://creativecommons.org/licenses/by-sa/4.0/). # # Author: Eric Marsden . # # --- # # In this notebook, we illustrate NumPy features for working with correlated data. Check the [associated lecture slides](https://risk-engineering.org/correlation/) for background material and to download this content as a Jupyter/Python notebook. # # Linear correlation # In[18]: import numpy import matplotlib.pyplot as plt import scipy.stats plt.style.use("bmh") get_ipython().run_line_magic('config', 'InlineBackend.figure_formats=["svg"]') # In[19]: X = numpy.random.normal(10, 1, 100) X # In[20]: Y = -X + numpy.random.normal(0, 1, 100) # In[21]: plt.scatter(X, Y); # Looking at the scatterplot above, we can see that the random variables $X$ and $Y$ are *correlated*. There are various statistical measures that allow us to quantify the degree of linear correlation. The most commonly used is [Pearson’s product-moment correlation coefficient](https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient). It is available in `scipy.stats`. # In[22]: scipy.stats.pearsonr(X, Y) # The first return value is the linear correlation coefficient, a value between -1 and 1 which measures the strength of the linear correlation. A value greater than 0.9 indicates a strong positive linear correlation, and a value lower than -0.9 indicates strong negative linear correlation (when $X$ increases, $Y$ decreases). # # (The second return value is a *p-value*, which is a measure of the confidence which can be placed in the estimation of the correlation coefficient (smaller = more confidence). It tells you the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. Here we have a very very low p-value, so high confidence in the estimated value of the correlation coefficient.) # ## Exercises # **Exercise**: show that when the error in $Y$ decreases, the correlation coefficient increases. # **Exercise**: produce data and a plot with a negative correlation coefficient. # ## Anscombe’s quartet # Let’s examine four datasets produced by the statistician [Francis Anscombe](https://en.wikipedia.org/wiki/Frank_Anscombe) to illustrate the importance of exploring your data qualitatively (for example by plotting the data), rather than relying only on summary statistics such as the linear correlation coefficient. # In[23]: x = numpy.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]) y1 = numpy.array([8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]) y2 = numpy.array([9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]) y3 = numpy.array([7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]) x4 = numpy.array([8,8,8,8,8,8,8,19,8,8,8]) y4 = numpy.array([6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.50,5.56,7.91,6.89]) # In[24]: plt.scatter(x, y1) plt.title("Anscombe quartet n°1") plt.margins(0.1) # In[25]: scipy.stats.pearsonr(x, y1) # In[26]: plt.scatter(x, y2) plt.title("Anscombe quartet n°2") plt.margins(0.1) # In[27]: scipy.stats.pearsonr(x, y2) # In[28]: plt.scatter(x, y3) plt.title("Anscombe quartet n°3") plt.margins(0.1) # In[29]: scipy.stats.pearsonr(x, y3) # In[30]: plt.scatter(x4, y4) plt.title("Anscombe quartet n°4") plt.margins(0.1) # In[31]: scipy.stats.pearsonr(x4, y4) # Notice that the linear correlation coefficient (Pearson's $r$) of the four datasets is identical, though clearly the relationship between $X$ and $Y$ is very different in each case! This illustrates the risks of depending only on quantitative descriptors to understand your datasets: you should also use different types of plots to give you a better overview of the data. # ## The Datasaurus # The [Datasaurus](http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html) provides another illustration of the importance of plotting your data to make sure it doesn't contain any surprises, rather than relying only on summary statistics. # In[32]: import pandas ds = pandas.read_csv("https://risk-engineering.org/static/data/datasaurus.csv", header=None) ds.describe() # In[33]: scipy.stats.pearsonr(ds[0], ds[1]) # These summary statistics don't look too nasty, but check out the data once it's plotted! # In[34]: plt.scatter(ds[0], ds[1]); # [This article](https://www.autodeskresearch.com/publications/samestats) titled *Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing* runs a lot further with this concept. # In[ ]: