*Key ideas:* Correlation matrix, correlation thresholding, regulation, simulation studies

The sample correlation matrix is the most familiar estimate of the population correlation matrix. It is unbiased and consistent under mild conditions. However the inverse, square root (Cholesky factor), and inverse square root of the sample correlation matrix may not be good estimates of their population counterparts, and may not even exist in some cases. These problems are likely to occur if the sample size is small relative to the dimension, or if the population correlation matrix is nearly singular.

In this notebook we explore two strategies to improve the estimation of these matrices:

Correlation matrices can be estimated with lower bound constraints on the eigenvalues. Several "nearest correlation" functions provided in Statsmodels attempt to find the closest SPD matrix in the Frobenius norm to a given square matrix, under the constraint that all eigenvalues are greater than a given value.

"Hard element-wise thresholding" means that the elements of the sample correlation matrix that are smaller than some given value in magnitude are replaced with zero. If the threshold value scales with the standard error, i.e. 1/sqrt(n), then the hard thresholded sample correlation matrix is still consistent and asymptotically unbiased, but may provide better estimates of the population correlation matrix, and of its inverse, square root, and inverse square root. However the hard thresholded sample correlation matrix may not be symmetric and positive definite (SPD), in which case the square root does not exist as a real matrix. Running the thresholded covariance matrix through a "nearest correlation" routine resolves this issue.

Statsmodels provides three "nearest correlation" methods, located in `stats.correlation_tools`

. The methods are `corr_nearest`

, `corr_clipped`

, and `corr_nearest_factor`

. There are three corresponding methods for covariance matrices called `cov_nearest`

, `cov_clipped`

, and `cov_nearest_factor`

.

This is a simulation study to investigate the performance of two of the nearest correlation methods that are provided in Statsmodels.

In [1]:

```
import numpy as np
import statsmodels.stats.correlation_tools as ctools
import pandas as pd
from IPython.core.display import HTML
```

In [2]:

```
def derived_matrices(C):
Sigma = C.copy()
try:
Sigma_inv = np.linalg.inv(C)
except:
Sigma_inv = None
try:
Sigma_sr = np.linalg.cholesky(C)
except:
Sigma_sr = None
try:
Sigma_inv_sr = np.linalg.cholesky(Sigma_inv)
except:
Sigma_inv_sr = None
return [C, Sigma_inv, Sigma_sr, Sigma_inv_sr]
```

In [3]:

```
def update_results(truth, estimates):
v = []
for a,b in zip(truth, estimates):
if None not in (a,b):
v.append(np.sqrt(np.sum((a-b)**2)))
else:
v.append(np.nan)
return v
```

In [4]:

```
def do_sim(n, Sigma, rthresh, ethresh):
nrep = 100
p = Sigma.shape[0]
truth = derived_matrices(Sigma)
Sigma_sr = truth[2]
rslts = np.zeros((nrep, 4), dtype=np.float64)
for k in range(nrep):
X = np.random.normal(size=(n,p))
X = np.dot(X, Sigma_sr.T)
C = np.dot(X.T, X) / n
CT = (np.abs(C) >= rthresh) * C
CT = ctools.corr_nearest(CT, ethresh)
estimates = derived_matrices(CT)
rslts[k,:] = update_results(truth, estimates)
rslts = pd.DataFrame(rslts)
rpd = pd.Series(np.concatenate((rslts.mean(0), pd.notnull(rslts).mean(0))))
rpd.index = ["C", "CI", "CR", "CRI", "C(S)", "CI(S)", "CR(S)", "CRI(S)"]
rpd = rpd.set_value("rthresh", rthresh)
rpd = rpd.set_value("ethresh", ethresh)
return rpd
```

In [5]:

```
def runsim(n, R):
rslts = []
for fthresh in 0, 0.5, 1:
for ethresh in 0,0.5,1:
rthresh = 2 * fthresh / np.sqrt(n)
rslts1 = do_sim(2*n, R, rthresh, ethresh)
rslts.append(rslts1)
rslts = pd.DataFrame(rslts)
# Generate a formatting function
def gformat(col):
mn = min(rslts[col])
def fmt(x):
if abs(x - mn) < 1e-5 and col in ["C", "CI", "CR", "CRI"]:
return "*%.2f*" % x
elif x >= 100:
return ">100"
else:
return "%.2f" % x
return fmt
formatters = {z: gformat(z) for z in rslts.columns}
formatters["ethresh"] = lambda x: "%.3f" % x
return HTML(rslts.to_html(formatters=formatters, index_names=False))
```

In [6]:

```
runsim(40, np.eye(20))
```

Out[6]:

In [7]:

```
R = np.fromfunction(lambda i,j: 1.0/(1+abs(i-j)), (20,20), dtype=np.float64)
runsim(40, R)
```

Out[7]:

In [8]:

```
R = np.fromfunction(lambda i,j: 1.0/(1+abs(i-j))**2, (20,20), dtype=np.float64)
runsim(40, R)
```

Out[8]: