In this notebook, I show how to determine a correlation coefficient within the Bayesian framework both in a simply and a robust way. The correlation can be seen as a direct alternative to the traditional Pearson correlation coefficient.

The model and code shown here is motivated by the following blog posts and the book "Bayesian Cognitive Modeling":

- http://www.sumsar.net/blog/2013/08/bayesian-estimation-of-correlation/
- http://www.sumsar.net/blog/2013/08/robust-bayesian-estimation-of-correlation/
- http://bayesmodels.com/

My main contribution here is to show how do apply the model with Python and PyMC.

In [1]:

```
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
```

First, let us start with defining our model both for the classic way of using a multivariate normal distribution as well as for the robust way that utilizes a multivariate student t-distribution.

In [2]:

```
import pymc as pymc
from pymc import Normal, Uniform, MvNormal, Exponential
from numpy.linalg import inv, det
from numpy import log, pi, dot
import numpy as np
from scipy.special import gammaln
def _model(data, robust=False):
# priors might be adapted here to be less flat
mu = Normal('mu', 0, 0.000001, size=2)
sigma = Uniform('sigma', 0, 1000, size=2)
rho = Uniform('r', -1, 1)
# we have a further parameter (prior) for the robust case
if robust == True:
nu = Exponential('nu',1/29., 1)
# we model nu as an Exponential plus one
@pymc.deterministic
def nuplus(nu=nu):
return nu + 1
@pymc.deterministic
def precision(sigma=sigma,rho=rho):
ss1 = float(sigma[0] * sigma[0])
ss2 = float(sigma[1] * sigma[1])
rss = float(rho * sigma[0] * sigma[1])
return inv(np.mat([[ss1, rss], [rss, ss2]]))
if robust == True:
# log-likelihood of multivariate t-distribution
@pymc.stochastic(observed=True)
def mult_t(value=data.T, mu=mu, tau=precision, nu=nuplus):
k = float(tau.shape[0])
res = 0
for r in value:
delta = r - mu
enum1 = gammaln((nu+k)/2.) + 0.5 * log(det(tau))
denom = (k/2.)*log(nu*pi) + gammaln(nu/2.)
enum2 = (-(nu+k)/2.) * log (1 + (1/nu)*delta.dot(tau).dot(delta.T))
result = enum1 + enum2 - denom
res += result[0]
return res[0,0]
else:
mult_n = MvNormal('mult_n', mu=mu, tau=precision, value=data.T, observed=True)
return locals()
def analyze(data, robust=False, plot=True):
model = pymc.MCMC(_model(data,robust))
model.sample(50000,25000)
print
if plot:
pymc.Matplot.plot(model)
return model
```

Let us define some data.

In [3]:

```
import numpy as np
x = np.array([525., 300., 450., 300., 400., 500., 550., 125., 300., 400., 500., 550.])
y = np.array([250., 225., 275., 350., 325., 375., 450., 400., 500., 550., 600., 525.])
data = np.array([x, y])
```

And now, we can do the inference. First, with the classic (non-robust) model.

In [4]:

```
model = analyze(data)
```

In [5]:

```
print model.stats()['r']['mean']
print model.stats()['r']['95% HPD interval']
```

In [6]:

```
(model.trace('r')[:]>0.05).mean()
```

Out[6]:

As a result, we see that only around 61% of all values of the posterior are above 0.05. Thus, we only see mediocre evidence for a positive correlation in the data.

Let us now relax this assumption and just determine what the probability is that there is no correlation at all (regardless negative or positive) with a ROPE of [0.05,0.05].

In [7]:

```
((model.trace('r')[:]<0.05) & (model.trace('r')[:]>-0.05)).mean()
```

Out[7]:

We can only see week evidence that there is no correlation at all. Thus, the probability that there is a negative correlation is around 27%. Overall, the inference and our inspection of the marginal posterior reveals that there might be a slight tendency towards a positive correlation, but we are pretty unsure about it. We might want to gather further data to supplement our inference.

Let us now compare these results to a classic Pearson correlation coefficient:

In [8]:

```
from scipy.stats.stats import pearsonr
pearsonr(x,y)
```

Out[8]:

In [9]:

```
s = [9.92,94,9.94,79,9.97,78,9.93,83,9.90,77,9.93,76,10.00,74,9.97,87,10.00,86,10.01,83,10.08,75,10.09,74,10.15,92,10.15,69,10.17,79,10.17,71,10.19,80,10.30,80,10.31,77,10.34,87]
x = [float(a) for i,a in enumerate(s) if i % 2 == 0]
y = [float(a) for i,a in enumerate(s) if i % 2 != 0]
data = np.array([x, y])
```

We again do inference with the simple model.

In [10]:

```
model = analyze(data,plot=False)
```

In [11]:

```
print model.stats()['r']['mean']
print model.stats()['r']['95% HPD interval']
```

In [12]:

```
from scipy.stats.stats import pearsonr
pearsonr(x,y)
```

Out[12]:

In [13]:

```
x.append(9.5)
y.append(115.)
data = np.array([x, y])
```

In [14]:

```
model = analyze(data,plot=False)
```

In [15]:

```
print model.stats()['r']['mean']
print model.stats()['r']['95% HPD interval']
```

In [16]:

```
from scipy.stats.stats import pearsonr
pearsonr(x,y)
```

Out[16]:

Not surprisingly, the Pearson correlation again agrees with our Bayesian inference and states the correlation as significant. This is not surprising, as the Pearson correlation is very non-robust to non-normality of the data as it is a measure of linear dependence. The same happends of course with our Bayesian model that models the data with a normal distribution.

Thus, we make our model more robust as elaborated in cited blog post. What we actually want, is a model that assumes that the majority of the data is normally distributed, but still allows outliers to exist. Within our Bayesian framework, we can choose any model for the likelihood. In that case, a robust model is the multivariate student t-distribution. For a detailed discussion please refer to corresponding blog post.

Now, let us apply the robust model (multivariate student t-distribution).

In [17]:

```
# this takes a while with many iterations as the likelihood calculation is not optimized
model = analyze(data, robust=True)
```

In [18]:

```
print model.stats()['r']['mean']
print model.stats()['r']['95% HPD interval']
```

This estimation looks much more reasonable and is closer to our original estimation for the data with no outlier using the multivariate normal model.

Let us take a look at the marginal of nu:

In [19]:

```
print model.stats()['nu']['mean']
print model.stats()['nu']['95% HPD interval']
```

In [ ]:

```
```