First time here?

Please see our demo on how to use the notebooks.


Correlation

Correlation is typically used to describe a linear relationship between two variables. However, in a more broad sense, it can be used to describe any kind of dependence or association between two variables.

Probably the most common measure of linear correlation is the Pearson's correlation coefficient, which is defined by $\rho_{X,Y} = \frac{\operatorname{cov}(X,Y)}{\sigma_X \sigma_Y}$. Let's see how the correlation changes for some randomly generated data based on the number of samples used, the spread in the data, and the angle...

First, we need to import a few packages...

In [1]:
import math
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats

from ipywidgets import interact, interactive, fixed, interact_manual
import ipywidgets as widgets

Now we'll define some functions that will create a randomly sampled line along the x-axis with a given y-axis width (again, randomly displaced).

In [2]:
def makeLine(nsamples, width):
    line = np.random.random(size=(nsamples, 2))
    scale = np.array([[10.0,0],[0,width]])
    shift = np.dot(np.dot(np.ones((nsamples,2)), np.identity(2)), np.diag([-5, -width/2.0]))
    linecoords = np.dot(line, scale) + shift
    return(linecoords)
    
def rotateLine(line, angle):
    theta = angle / 180.0 * np.pi
    R = np.array([[math.cos(theta), -math.sin(theta)], [math.sin(theta), math.cos(theta)]])
    rotated_line = np.transpose(np.dot(R, np.transpose(line)))
    return(rotated_line)

Now, we create some functions (and global data) to let us manipulate the line

In [3]:
line = []
rotated_line = []

def plotLine():
    global rotated_line
    plt.plot(rotated_line[:,0], rotated_line[:,1], 'r.')
    plt.xlim([-8,8])
    plt.ylim([-8,8])
    (r,p) = scipy.stats.pearsonr(rotated_line[:,0], rotated_line[:,1])
    text = 'r = %.4f' % r
    plt.figtext(0.45, 0.0, text)
    plt.show()

def adjustLine(nsamples, width):
    global line
    global rotated_line
    line = makeLine(nsamples, width)
    rotated_line = line
    
def adjustAngle(angle):
    global line
    global rotated_line
    rotated_line = rotateLine(line, angle)
    plotLine()
    
def adjustAll(nsamples, width, angle):
    adjustLine(nsamples, width)
    adjustAngle(angle)

Here, we create an interactive plot showing the random data along with the Pearson correlation coefficient. Adjust the slides to change the number of samples drawn from the distribution, the width of the line, and the angle. Look at the r-value shown below the graph.

In [4]:
interact(adjustAll, nsamples=widgets.IntSlider(min=5, max=500, step=1, value=10),
         width=widgets.FloatSlider(min=0.0, max=10.0, step=0.1,value=2.0),
         angle=widgets.FloatSlider(min=0.0, max=359.0, step=1.0, value=45.0));

Circular Distributions

Now, let's see what happens with data drawn from a different distribution -- a circle. This data reveals the limits of using a Pearson coefficient: clearly there's a relationship between the x-value and the resulting y-value, but the correlation coefficients should be 0. What kinds of correlation values do you measure when you have a small amount of data? Play with the sliders, and watch the r value below the graph.

In [5]:
def circleDistribution(nsamples, radius, width):
    t = np.random.uniform(0.0, 2.0 * math.pi, size=nsamples)
    x = np.cos(t)*radius + (np.random.uniform(size=nsamples)*2-1)*width
    y = np.sin(t)*radius + (np.random.uniform(size=nsamples)*2-1)*width
    return(x,y)
In [6]:
def plotCircle(npts=1000, radius=10, width=1):
    (x,y) = circleDistribution(npts, radius, width)
    plt.plot(x,y, 'r.')
    (r,p) = scipy.stats.pearsonr(x,y)
    plt.figtext(0.45,0.0, 'r=%.3f' % r)
    plt.show()

Now let's creat the plot. We're going to start with a small sample size and it's possible that you will see a correlation coefficient that suggests some degree of a linear relationship. Remember, each time you adjust a slider, a new distribution is created, so you can wiggle them around and get different r-values for comparable slider settings.

Next, adjust the number of points upwards. You'll find that even though there's a clear relationship between the two variables, the r-value will be almost 0.

In [7]:
interact(plotCircle,
         npts=widgets.IntSlider(min=5, max=500, step=10, value=10),
         radius=widgets.FloatSlider(min=5.0, max=50., step=1.0, value=5.0),
         width=widgets.FloatSlider(min=0.0, max=10.0, step=0.1, value=1.0));

Parabolic Distributions

Now, let's look at data drawn from a parabolic distribution. Here, the data will be symmetric in X about 0, so again the Pearson r-value should be 0. What do you actually get? Could you conclude there's a linear relationship? Play with the sliders and find out: every time you move the sliders, it will generate a new data set.

In [8]:
def parabolicDistribution(nsamples, length, width):
    x = np.random.uniform(size=nsamples) * 2.0 * length - length
    y = x**2 + (np.random.uniform(size=nsamples) * 2.0 - 1.0) * width
    return(x,y)
In [9]:
def plotParabolicDistribution(npts, length, width):
    (x,y) = parabolicDistribution(npts, length, width)
    (r,p) = scipy.stats.pearsonr(x,y)
    plt.plot(x,y,'r.')
    plt.figtext(0.45, 0.0, 'r=%.3f' % r)
    plt.show()
In [10]:
interact(plotParabolicDistribution,
        npts = widgets.IntSlider(min=5, max=500, step=10, value=10),
        length = widgets.FloatSlider(min=1.0, max=20.0, step=0.1, value=20.0),
        width = widgets.FloatSlider(min=0.0, max=100.0, step=0.1, value=0.0));
In [ ]: