This notebook is an element of the risk-engineering.org courseware. It can be distributed under the terms of the Creative Commons Attribution-ShareAlike licence.

Author: Eric Marsden [email protected].


In this notebook, we illustrate NumPy features for working with correlated data. Check the associated lecture slides for background material.

Linear correlation

In [16]:
import numpy
import matplotlib.pyplot as plt
import scipy.stats
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_formats=['svg']
In [17]:
X = numpy.random.normal(10, 1, 100)
X
Out[17]:
array([ 10.20107479,   9.4063267 ,   9.35925975,   8.69936166,
         9.68910726,   9.85405318,   8.60661236,  10.08454305,
         8.30319939,  10.21445046,  10.4396301 ,  12.38902889,
        10.9808642 ,   7.98431786,  10.13631675,   8.44738428,
         9.36522325,  10.29499463,  10.04606004,  11.00638488,
        12.07261399,  10.80256371,   9.46126198,   9.38123702,
         8.98822604,   9.15236137,   9.3376605 ,   9.30438657,
         9.78996393,   9.49894784,   9.43647119,   9.55568992,
        11.20927467,  12.7191866 ,  10.42758769,   8.14224458,
         9.30565385,  10.63998949,  10.66122655,   9.83916936,
         9.27289739,   9.00283594,   9.51899124,   8.93694694,
        10.58172455,  10.63856832,   7.64131459,   9.84446853,
         8.02220301,   9.67709333,  10.24119223,   9.13050033,
         9.48903717,  10.53441498,  10.92532674,   9.75549168,
        10.31641759,   9.78170438,   9.53854654,   8.76682161,
        10.80344607,  11.58906864,  10.1593613 ,   9.18055869,
         8.81313529,  10.89648306,  10.17565836,   7.86195305,
         9.24215571,   9.09704679,   9.95070734,  10.77146896,
         9.31440722,  10.04332483,  11.64185445,   9.45643575,
         7.45589937,  11.52976664,  11.03312838,  10.21293762,
        10.57867935,  11.12948802,   9.30721498,   9.2566451 ,
        11.16357153,  10.50878674,   9.80893015,   9.57082672,
        10.12226219,  10.04377583,  10.63776629,  10.51013363,
         9.97421612,  11.50313469,  11.68939682,  12.41476162,
         9.02562803,  11.34000577,   9.12195023,   9.13789766])
In [18]:
Y = -X + numpy.random.normal(0, 1, 100)
In [19]:
plt.scatter(X, Y);

Looking at the scatterplot above, we can see that the random variables $X$ and $Y$ are correlated. There are various statistical measures that allow us to quantify the degree of linear correlation. The most commonly used is Pearson's product-moment correlation coefficient. It is available in scipy.stats.

In [20]:
scipy.stats.pearsonr(X, Y)
Out[20]:
(-0.7416383159326555, 1.0905867258471829e-18)

The first return value is the linear correlation coefficient; a value greater than 0.9 indicates a strong linear correlation.

(The second return value is a p-value, which is a measure of the confidence which can be placed in the estimation of the correlation coefficient (smaller = more confidence). It tells you the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets. Here we have a very very low p-value, so high confidence in the estimated value of the correlation coefficient.)

Exercises

Exercise: show that when the error in $Y$ decreases, the correlation coefficient increases.

Exercise: produce data and a plot with a negative correlation coefficient.

Anscombe’s quartet

Let's examine four datasets produced by the statistician Francis Anscombe to illustrate the importance of graphing data, rather than relying only on summary statistics (such as the linear correlation coefficient).

In [21]:
x =  numpy.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])
y1 = numpy.array([8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68])
y2 = numpy.array([9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74])
y3 = numpy.array([7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73])
x4 = numpy.array([8,8,8,8,8,8,8,19,8,8,8])
y4 = numpy.array([6.58,5.76,7.71,8.84,8.47,7.04,5.25,12.50,5.56,7.91,6.89])
In [22]:
plt.plot(x, y1, 'ks', marker='o', color='blue')
plt.title(u"Anscombe quartet n° 1")
plt.margins(0.1)
In [23]:
scipy.stats.pearsonr(x, y1)
Out[23]:
(0.81642051634484003, 0.0021696288730787888)
In [24]:
plt.plot(x, y2, 'ks', marker='o', color='blue')
plt.title(u"Anscombe quartet n° 2")
plt.margins(0.1)
In [25]:
scipy.stats.pearsonr(x, y2)
Out[25]:
(0.81623650600024267, 0.0021788162369108031)
In [26]:
plt.plot(x, y3, 'ks', marker='o', color='blue')
plt.title(u"Anscombe quartet n° 3")
plt.margins(0.1)
In [27]:
scipy.stats.pearsonr(x, y3)
Out[27]:
(0.81628673948959807, 0.0021763052792280304)
In [28]:
plt.plot(x4, y4, 'ks', marker='o', color='blue')
plt.title(u"Anscombe quartet n° 4")
plt.margins(0.1)
In [29]:
scipy.stats.pearsonr(x4, y4)
Out[29]:
(0.81652143688850298, 0.0021646023471972127)

Notice that the linear correlation coefficient of the four datasets is identical, though clearly the relationship between $X$ and $Y$ is very different!

In [30]:
numpy.linspace(1, 10, 10)
Out[30]:
array([  1.,   2.,   3.,   4.,   5.,   6.,   7.,   8.,   9.,  10.])