# Correlation matrix estimation with element-wise thresholding and eigenvalue constraints¶

Key ideas: Correlation matrix, correlation thresholding, regulation, simulation studies

The sample correlation matrix is the most familiar estimate of the population correlation matrix. It is unbiased and consistent under mild conditions. However the inverse, square root (Cholesky factor), and inverse square root of the sample correlation matrix may not be good estimates of their population counterparts, and may not even exist in some cases. These problems are likely to occur if the sample size is small relative to the dimension, or if the population correlation matrix is nearly singular.

In this notebook we explore two strategies to improve the estimation of these matrices:

• Correlation matrices can be estimated with lower bound constraints on the eigenvalues. Several "nearest correlation" functions provided in Statsmodels attempt to find the closest SPD matrix in the Frobenius norm to a given square matrix, under the constraint that all eigenvalues are greater than a given value.

• "Hard element-wise thresholding" means that the elements of the sample correlation matrix that are smaller than some given value in magnitude are replaced with zero. If the threshold value scales with the standard error, i.e. 1/sqrt(n), then the hard thresholded sample correlation matrix is still consistent and asymptotically unbiased, but may provide better estimates of the population correlation matrix, and of its inverse, square root, and inverse square root. However the hard thresholded sample correlation matrix may not be symmetric and positive definite (SPD), in which case the square root does not exist as a real matrix. Running the thresholded covariance matrix through a "nearest correlation" routine resolves this issue.

Statsmodels provides three "nearest correlation" methods, located in stats.correlation_tools. The methods are corr_nearest, corr_clipped, and corr_nearest_factor. There are three corresponding methods for covariance matrices called cov_nearest, cov_clipped, and cov_nearest_factor.

This is a simulation study to investigate the performance of two of the nearest correlation methods that are provided in Statsmodels.

In [1]:
import numpy as np
import statsmodels.stats.correlation_tools as ctools
import pandas as pd
from IPython.core.display import HTML


This is a utility function that calculates the inverse, Cholesky factor, and inverse Cholesky factor for a given input matrix. These matrices are stored in a list, using a value of None when the matrix doesn't exist.

In [2]:
def derived_matrices(C):
Sigma = C.copy()
try:
Sigma_inv = np.linalg.inv(C)
except:
Sigma_inv = None
try:
Sigma_sr = np.linalg.cholesky(C)
except:
Sigma_sr = None
try:
Sigma_inv_sr = np.linalg.cholesky(Sigma_inv)
except:
Sigma_inv_sr = None
return [C, Sigma_inv, Sigma_sr, Sigma_inv_sr]


Next we have another utility function, that is used to compute Frobenius norms for the differences bettween the estimate and the truth for the four matrices of interest (covariance, inverse covariance, Cholesky factor, and inverse Cholesky factor).

In [3]:
def update_results(truth, estimates):
v = []
for a,b in zip(truth, estimates):
if None not in (a,b):
v.append(np.sqrt(np.sum((a-b)**2)))
else:
v.append(np.nan)
return v


Here is the simulation for one population structure and one set of estimation tuning parameters. The elements of the sample correlation matrix are thresholded at 'rthresh', and the eigenvalues are thresholded at 'ethresh'.

In [4]:
def do_sim(n, Sigma, rthresh, ethresh):

nrep = 100
p = Sigma.shape[0]

truth = derived_matrices(Sigma)
Sigma_sr = truth[2]

rslts = np.zeros((nrep, 4), dtype=np.float64)
for k in range(nrep):
X = np.random.normal(size=(n,p))
X = np.dot(X, Sigma_sr.T)
C = np.dot(X.T, X) / n
CT = (np.abs(C) >= rthresh) * C
CT = ctools.corr_nearest(CT, ethresh)
estimates = derived_matrices(CT)
rslts[k,:] = update_results(truth, estimates)

rslts = pd.DataFrame(rslts)
rpd = pd.Series(np.concatenate((rslts.mean(0), pd.notnull(rslts).mean(0))))
rpd.index = ["C", "CI", "CR", "CRI", "C(S)", "CI(S)", "CR(S)", "CRI(S)"]
rpd = rpd.set_value("rthresh", rthresh)
rpd = rpd.set_value("ethresh", ethresh)
return rpd



Here is a wrapper that runs the simulations for a range of different threshold values, then prints the results in a table. We use a column formatter function that puts stars around the optimal value in each column (i.e. the value of the threshold parameters that gives the smallest estimation error).

In [5]:
def runsim(n, R):

rslts = []
for fthresh in 0, 0.5, 1:
for ethresh in 0,0.5,1:
rthresh = 2 * fthresh / np.sqrt(n)
rslts1 = do_sim(2*n, R, rthresh, ethresh)
rslts.append(rslts1)

rslts = pd.DataFrame(rslts)

# Generate a formatting function
def gformat(col):
mn = min(rslts[col])
def fmt(x):
if abs(x - mn) < 1e-5 and col in ["C", "CI", "CR", "CRI"]:
return "*%.2f*" % x
elif x >= 100:
return ">100"
else:
return "%.2f" % x
return fmt
formatters = {z: gformat(z) for z in rslts.columns}
formatters["ethresh"] = lambda x: "%.3f" % x
return HTML(rslts.to_html(formatters=formatters, index_names=False))


First we consider the setting where the population correlation matrix is diagonal. We see that the most accurate estimates of the correlation matrix result when using the greatest amounts of thresholding. The reason for this is that we are forcing values to be zero that in actuality are exactly zero. Imposing a minimum eigenvalue bound is also helpful.

In [6]:
runsim(40, np.eye(20))

Out[6]:
C CI CR CRI C(S) CI(S) CR(S) CRI(S) rthresh ethresh
0 2.31 3.96 1.62 1.98 1.00 1.00 1.00 1.00 0.00 0.000
1 2.17 2.59 1.44 1.54 1.00 1.00 1.00 1.00 0.00 0.500
2 1.77 0.99 0.93 0.73 1.00 1.00 1.00 1.00 0.00 1.000
3 1.83 2.87 1.28 1.51 1.00 1.00 1.00 1.00 0.16 0.000
4 1.75 2.12 1.17 1.26 1.00 1.00 1.00 1.00 0.16 0.500
5 1.31 0.82 0.70 0.58 1.00 1.00 1.00 1.00 0.16 1.000
6 0.90 0.93 0.51 0.51 1.00 1.00 1.00 1.00 0.32 0.000
7 0.83 0.87 0.46 0.47 1.00 1.00 1.00 1.00 0.32 0.500
8 *0.62* *0.46* *0.31* *0.27* 1.00 1.00 1.00 1.00 0.32 1.000

Next we consider an example where the population correlation matrix is dense. In this case, thresholding the correlations leads to worse performance (i.e. rthresh=0 is best), but constraining the minimum eigenvalue (using ethresh=0.5) can be quite helpful when estimating the derived matrices (inverse, etc.)

In [7]:
R = np.fromfunction(lambda i,j: 1.0/(1+abs(i-j)), (20,20), dtype=np.float64)
runsim(40, R)

/opt/anaconda/envs/np18py27-1.9/lib/python2.7/site-packages/statsmodels/stats/correlation_tools.py:82: UserWarning: maximum iteration reached
warnings.warn('maximum iteration reached')

Out[7]:
C CI CR CRI C(S) CI(S) CR(S) CRI(S) rthresh ethresh
0 2.32 6.33 1.56 2.47 1.00 1.00 1.00 1.00 0.00 0.000
1 *2.24* 2.28 *1.28* 1.34 1.00 1.00 1.00 1.00 0.00 0.500
2 2.70 4.26 1.62 1.98 1.00 1.00 1.00 1.00 0.00 1.000
3 2.52 >100 2.09 >100 1.00 1.00 0.80 0.73 0.16 0.000
4 2.24 2.39 1.33 1.41 1.00 1.00 1.00 1.00 0.16 0.500
5 3.51 4.30 1.94 2.15 1.00 1.00 1.00 1.00 0.16 1.000
6 3.04 >100 2.55 >100 1.00 1.00 0.52 0.35 0.32 0.000
7 2.69 *2.26* 1.41 *1.33* 1.00 1.00 1.00 1.00 0.32 0.500
8 4.21 4.33 2.21 2.29 1.00 1.00 1.00 1.00 0.32 1.000

In the following example, the population correlation matrix is approximately sparse. Here we se that element-wise thresholding and eigenvalue thresholding both improve the estimation accuracy.

In [8]:
R = np.fromfunction(lambda i,j: 1.0/(1+abs(i-j))**2, (20,20), dtype=np.float64)
runsim(40, R)

Out[8]:
C CI CR CRI C(S) CI(S) CR(S) CRI(S) rthresh ethresh
0 2.29 4.44 1.61 2.08 1.00 1.00 1.00 1.00 0.00 0.000
1 2.13 2.30 1.37 1.42 1.00 1.00 1.00 1.00 0.00 0.500
2 2.09 1.72 1.16 1.04 1.00 1.00 1.00 1.00 0.00 1.000
3 2.05 5.05 1.48 2.10 1.00 1.00 1.00 1.00 0.16 0.000
4 1.90 2.05 1.22 1.27 1.00 1.00 1.00 1.00 0.16 0.500
5 1.85 *1.68* *1.07* *1.00* 1.00 1.00 1.00 1.00 0.16 1.000
6 1.87 2.11 1.20 1.27 1.00 1.00 1.00 1.00 0.32 0.000
7 1.84 1.77 1.16 1.16 1.00 1.00 1.00 1.00 0.32 0.500
8 *1.80* 1.71 1.11 1.06 1.00 1.00 1.00 1.00 0.32 1.000