Marcos Duarte

To describe the dispersion of data or to make a prediction for a sample, we can calculate a prediction interval or tolerance interval. For the specific case of a univariate random variable, see Confidence and prediction intervals. For a multivariate random variable, we need to calculate a prediction hyperellipsoid (Chew, 1966):

The 95% prediction hyperellipsoid is a prediction interval for a sample of a multivariate random variable such that there is 95% of probability that a new observation will lie inside the hyperellipsoid.

The 95% prediction interval (a p-dimensional hyperellipsoid) for a sample of a multivariate random variable with $p$ dimensions and size $n$ from a population with normal distribution is given by the following probabilistic bounds which holds 95% of the time (Chew, 1966):

$$ (x_{n+1}-\bar{x})'\: S^{-1} (x_{n+1}-\bar{x}) \leq \frac{F_{(0.95,\: p,\: n-p)}(n-1)\:p\:(n+1)}{n\:(n-p)} $$Where $x_{n+i}$ (with dimension $p$) is the new observation for which we want to calculate the prediction interval; $F$ is the F-distribution; $\bar{x}$ is the sample mean (with dimension $p$), and $S$ is the sample covariance matrix defined as:

$$ S_{j,k} = \frac{1}{n-1}\sum_{i=1}^{n}(x_{ij}-\bar{x}_j)(x_{ik}-\bar{x}_k) $$This is the formula for the prediction of the next single observation where the population mean and variance are unknown (we only have estimates of these parameters from the sample). This formula is also equivalent to the formula for calculation of a tolerance region (called Type 2 tolerance region by Chew (1966)):

The 95% tolerance hyperellipsoid is a tolerance region for a sample of a multivariate random variable such that the average or expected value of the proportion of the population contained in the hyperellipsoid is 95%.

In this context, we can refer to the calculated hyperellipsoid as a prediction or tolerance hyperellipsoid (but prediction and tolerance intervals are in general different concepts in statistics).

The function `hyperellipsoid.py`

(code at the end of this text) calculates the prediction or tolerance hyperellipsoid, some related parameters, and plots the results for a given multivariate random variable. The function signature is:

```
hypervol, saxes, angles, p0, R = hyperellipsoid(P, y=None, z=None, pvalue=.95,
units=None, show=True, ax=None):
```

Before showing how to use this code, let's look a bit more at the statistical principles behind the estimation of a prediction or tolerance interval.

For $p$ = 1 (univariate case) the 95% prediction interval formula above reduces to:

$$ \left|\frac{x_{n+1}-\bar{x}}{s}\right| \leq \sqrt{F_{(0.95,\: 1,\: n-1)}(1+1/n)} $$Where $s$ is the square root of the sample covariance, the sample standard deviation.

And considering that $F_{(0.95,\: 1,\: n-1)}$ is equal to $T^2_{(0.975,\: n-1)}$, the square of the Student's t-distribution, finally we have:

$$ \left|x_{n+1}-\bar{x}\right| \leq s\:T_{(0.975,\: n-1)}\sqrt{1+1/n} $$Indeed, this results in the 95% prediction interval used for a sample of a univariate variable, see Confidence and prediction intervals.

For large $n$, according to the central limit theorem, we can use the fact that $T(\infty)=N$, where $N$ is the normal distribution, and the above statement can be simplified to:

$$ \left|x_{n+1}-\bar{x}\right| \leq s\:N_{0.975} $$$N_{0.975}$ is equal to 1.96, a commonly used value for calculating 95%-probability intervals.

Let's calculate the error of this approximation for different sample sizes.

First let's import the necessary Python libraries:

In [1]:

```
# import the necessary libraries
import numpy as np
from scipy import stats
```

In [2]:

```
F = lambda p, n: np.sqrt(stats.f.ppf(0.95, p, n-p)*(n-1)*p*(n+1)/n/(n-p)) # F distribution
N = stats.norm.ppf(0.975) # Normal distribution
for i in [1000, 100, 30, 10]:
print('\nApproximation error for n = %d' %i)
print('Using Normal distribution: %.1f%%' % (100*(N-F(1, i))/F(1, i)))
```

For n=1000, the approximation is probably good enough, for n=10 it is bad, and it always underestimates.

For $p$ = 2 (bivariate case) the 95% prediction interval formula above reduces to (again, these probabilistic bounds hold 95% of the time (Chew, 1966):

$$ (x_{n+1}-\bar{x})'\: S^{-1} (x_{n+1}-\bar{x}) \leq \frac{2(n-1)(n+1)}{n\:(n-2)}F_{(0.95,\: 2,\: n-2)} $$This formula is also equivalent to the formula for calculation of a tolerance region such that the average or expected value of the proportion of the population contained in this region is exactly 95% (called Type 2 tolerance region by Chew (1966)).

For large $n$, according to the central limit theorem, we can use the fact that $2F(2,\infty)=\chi^2(2)$, where $\chi^2$ is the chi-square distribution, and the formula above for the 95% prediction interval for the bivariate case can be simplified to:

$$ (x_{n+1}-\bar{x})'\: S^{-1} (x_{n+1}-\bar{x}) \leq \chi^2_{0.95,\: 2} $$Or, equivalently, we can use the fact that $\sqrt{2F(2,\infty)}=R$, where $R$ is the Rayleigh distribution, and:

$$ (x_{n+1}-\bar{x})'\: S^{-1} (x_{n+1}-\bar{x}) \leq (R_{0.95})^2 $$Let's calulate the error of this approximation for different sample sizes:

In [3]:

```
C = lambda p: np.sqrt(stats.chi2.ppf(0.95, p)) # Chi-square distribution
R = stats.rayleigh.ppf(0.95) # Rayleigh distribution
for i in [1000, 100, 30, 10]:
print('\nApproximation error for n = %d' %i)
print('Using Chi-square distribution: %.1f%%' % (100*(C(2)-F(2, i))/F(2, i)))
print('Using Rayleigh distribution: %.1f%%' % (100*(R-F(2, i))/F(2, i)))
```

For n=1000, the approximation is probably good enough, for n=10 it is bad, and it always underestimates.

To read more about multivariate prediction and confidence intervals, see Chew (1966), and about the use of prediction ellipse in the field of posturography, see Schubert and Kirchner (2014).

We can write a code to calculate the ellipse, ellipsoid, and in a more general case, the hyperellipsoid using the formula for the prediction hyperellipsoid (this code is published in Duarte (2015)). The directions and lengths of the hyperellipsoid semi-axes are found, respectively, as the eigenvectors and eigenvalues of the covariance matrix of the data using the concept of principal components analysis (PCA) or singular value decomposition (SVD) and the length of the semi-axes are adjusted to account for the necessary prediction probability. The volume of the hyperellipsoid is calculated with the same equation for the volume of a n-dimensional ball with the radius replaced by the semi-axes of the hyperellipsoid.

In Python, such code to compute the hypervolume of the 95% prediction hyperellipsoid is:

```
import numpy as np # import Numpy package
from scipy.stats import f as F # import F distribution
from scipy.special import gamma # import Gamma function
n, p = np.asarray(data).shape # 2-D array dimensions
cov = np.cov(data, rowvar=0) # covariance matrix of data
U, s, Vt = np.linalg.svd(cov) # singular value decomposition
f95 = F.ppf(.95,p,n-p)*(n-1)*p*(n+1)/n/(n-p) # F 95 percent point function
saxes = np.sqrt(s*f95) # semi-axes lengths
hypervolume = np.pi**(p/2)/gamma(p/2+1)*np.prod(saxes)
hypervolume
```

For instance, if `data`

has two columns, `hyperellipsoid`

will contain the area of the 95% prediction ellipse.

The equivalent code for a Matlab-like environment to compute the hypervolume of the 95% prediction hyperellipsoid is:

```
[n, p] = size(data); % 2-D array dimensions
covar = cov(data); % covariance matrix of data
[U, S, V] = svd(covar); % singular value decomposition
f95 = finv(.95,p,n-p)*(n-1)*p*(n+1)/n/(n-p); % F 95 percent point function
saxes = sqrt(diag(S)*f95); % semi-axes lengths
hypervolume = pi^(p/2)/gamma(p/2+1)*prod(saxes);
hypervolume
```

Let's see now a more complete function, `hyperellipsoid.py`

(code at the end of this text), to compute the prediction hyperellipsoid and related variables:

In [4]:

```
import matplotlib.pyplot as plt
%matplotlib inline
import sys
sys.path.insert(1, r'./../functions') # add to pythonpath
from hyperellipsoid import hyperellipsoid
```

For a bivariate random data with normal distribution with mean 0 and variance 1, the prediction ellipse should be a circle:

In [5]:

```
P = np.random.randn(10000, 2)
area, axes, angles, center, R = hyperellipsoid(P, units='cm', show=True)
print('Area =', area)
print('Semi-axes =', axes)
print('Angles =', angles)
print('Center =', center)
print('Rotation matrix =\n', R)
```

For other data:

In [6]:

```
y = np.cumsum(np.random.randn(3000)) / 50
x = np.cumsum(np.random.randn(3000)) / 100
area, axes, angles, center, R = hyperellipsoid(x, y, units='cm', show=True)
print('Area =', area)
print('Semi-axes =', axes)
print('Angles =', angles)
print('Center =', center)
print('Rotation matrix =\n', R)
```

The **outside_hyperellipsoid** function can be used to determine whether specific points lie outside the ellipse as demonstrated below.

Note that the proportion of points outside a 95% ellipse is expected to tend toward 0.05 as n increases.

In [7]:

```
from hyperellipsoid import outside_hyperellipsoid
#Calculate ellipse and whether or not points lie outside:
np.random.seed(6)
P = np.random.randn(100, 2) #points
volume, axes, angles, center, R = hyperellipsoid(P, pvalue=.95, show=False)
b = outside_hyperellipsoid(axes, angles, center, R, P) #binary hit-miss (outside ellipse?)
p = P[b] #points outside ellipse
#Plot:
fig = plt.figure(figsize=(4, 4))
ax = fig.add_axes([0, 0, 1, 1])
ax.plot(p[:,0], p[:,1], 'ro', ms=8) #points outside ellipse plotted as large red dots
volume, axes, angles, center, R = hyperellipsoid(P, pvalue=.95, show=True, ax=ax)
# Report:
print('Number of points outside ellipsoid: %d' %b.sum())
print('Proportion of points outside ellipsoid: %.3f' %b.mean())
```

For a trivariate random data with normal distribution with mean 0 and variance 1, the prediction ellipsoid should be a sphere:

In [8]:

```
P = np.random.randn(10000, 3)
volume, axes, angles, center, R = hyperellipsoid(P, units='cm', show=True)
print('Volume =', volume)
print('Semi-axes =', axes)
print('Angles =', angles)
print('Center =', center)
print('Rotation matrix =\n', R)
```

For other data:

In [9]:

```
P = np.random.randn(1000, 3)
P[:, 2] = P[:, 2] + P[:, 1]*.5
P[:, 1] = P[:, 1] + P[:, 0]*.5
volume, axes, angles, center, R = hyperellipsoid(P, units='cm', show=True)
print('Volume =', volume)
print('Semi-axes =', axes)
print('Angles =', angles)
print('Center =', center)
print('Rotation matrix =\n', R)
```

The **outside_hyperellipsoid** function can be used to determine whether specific points lie outside the ellipsoid as demonstrated below.

Note that the proportion of points outside a 95% ellipsoid is expected to tend toward 0.05 as n increases.

In [10]:

```
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.animation as animation
from hyperellipsoid import _plot, outside_hyperellipsoid
# fig = plt.figure(figsize=(7, 7))
# ax = fig.add_axes([0, 0, 1, 1], projection='3d')
#Calculate ellipsoid and whether or not points lie outside:
np.random.seed(6)
P = np.random.randn(100, 3) #points
volume, axes, angles, center, R = hyperellipsoid(P, pvalue=.95, show=False)
b = outside_hyperellipsoid(axes, angles, center, R, P) #binary hit-miss (outside ellipsoid?)
p = P[b] #points outside ellipsoid
#Plot:
fig = plt.figure(figsize=(7, 7))
ax = fig.add_axes([0, 0, 1, 1], projection='3d')
ax.plot(p[:,0], p[:,1], p[:,2], 'ro', ms=8) #points outside ellipse plotted as large red dots
volume, axes, angles, center, R = hyperellipsoid(P, pvalue=.95, show=True, ax=ax)
# Report:
print('Number of points outside ellipsoid: %d' %b.sum())
print('Proportion of points outside ellipsoid: %.3f' %b.mean())
```

To be able to rotate and zoom the figure, for now you will have to plot it in a separate window:

In [11]:

```
# plot figure in a separate window
%matplotlib qt
```

In [12]:

```
hyperellipsoid(P, units='cm', show=True);
```