The following theory is from a section of "Statistical Pattern Recognition: a review" (Jain et al., 2000) covering results from "A Problem of Dimensionality: A Simple Example" by Trunk (1979). The performance of a classifier is related to the number of samples seen, their dimensionality, and the complexity of the classifier. The complexity of the classifier is a tradeoff between bias and variance. What may be unintuitive is that additional information from more dimensions, even if each dimension is informative and non-redundant, may lead to a decrease in classification accuracy if the number of samples remains small compared to the number of features. Trunk (1979) provides an example of this in a simple setting.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_context('talk')
import warnings
warnings.simplefilter("ignore")
Consider a two-class classification problem where each class has the same prior probability and has a $d$-dimensional Gaussian distribution with identity covariance matrix (i.e. all dimensions are independent of one another and of variance 1). The two class means $\mu_1$ and $\mu_2$ for class 1 and class 2, respectively, are such that $\mu_1 = -\mu_2 = \mu$ where \begin{align} \mu &= \big(\frac{1}{1}, \frac{1}{\sqrt{2}}, \dots, \frac{1}{\sqrt{d}} \big) \\ \end{align} Each feature adds new information, however the information decreases monotonically with each added feature and converges to 0.
We consider this classification setting, in which data are drawn independently and identically distributed with equal prior probably $\pi = \frac{1}{2}$ for each class and the only unknown parameter is the mean.
d = 10
mu1 = 1 / np.sqrt(np.arange(1, d+1))
mu2 = -mu1
f, ax = plt.subplots(1,1,figsize=(8,3))
ax.plot(np.arange(1,d+1), mu1 - mu2)
ax.set_xticks(np.arange(1,d+1))
ax.set_xlabel('Dimension')
ax.set_ylabel('Difference')
ax.set_title('Difference in class mean across dimensions')
plt.show()
# Simulation
def simulate(n, d):
n1 = np.random.binomial(n, 0.5)
n2 = n - n1
mu = 1 / np.sqrt(np.arange(1, d+1))
X = np.vstack((
np.random.normal(mu, 1, (n1, d)),
np.random.normal(-mu, 1, (n2, d))
))
return X, np.asarray([1]*n1 + [-1]*n2)
class BayesClassifier:
def __init__(self, method):
self.method = method
if method=='optimal':
self.mu = 1 / np.sqrt(np.arange(1, d+1))
def fit(self, X, y):
if self.method == 'naive':
self.mu = np.mean(X * y[:, None], axis=0)
return self
def predict(self, x):
# if self.method == 'optimal':
proj = x @ self.mu
return 2*(proj > 0).astype(int) - 1
With Gaussian models for the distributions of each class, we classify an observation as the class with the greatest likelihood. Given a mean $\mu$ and covariance $\Sigma$, the likelihood of an observation $x$ is
$$P( x ; \mu, \Sigma) = \frac{1}{\sqrt{(2\pi)^d|\Sigma|}}e^{-\frac{1}{2}(x - \mu)\Sigma ^{-1}(x - \mu)^T}$$Because we know our covariance matrix to be the identity, this simplifies to
$$P( x ; \mu) = \frac{1}{\sqrt{(2\pi)^d}}e^{-\frac{1}{2}(x - \mu)^T(x - \mu)}$$and we make the classification $y = argmax_{y \in \{0,1\}} P( x ; \mu_y)$. In the case of the Bayes optimal classifier (where we know all the parameters), the probability of error for a given dimension $d$ can analytically be written as
$$P_{error}(d) = \int^\infty_{\sqrt{\sum_{i=1}^{d}\frac{1}{i}}} \frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}z^2} dz$$.
Clearly, in the limit as $d \rightarrow \infty$ this error $P_{error}(d) \rightarrow 0$ as the lower limit of integration diverges. This means, that we can perfectly discriminate the two classes in the limit if we know the parameters.
However, as in most real cases, we don't know the true parameters and need to estimate them. Here, we assume that only the means are unknown, a generous assumption. Thus, for each class we have a our max-liklihood estimate $\hat{\mu}_i$ of the mean and as Trunk showed, the analytical probability of error is
\begin{align} P_{error}(n,d) &= \int^{\infty}_{\theta(n,d)}\frac{1}{\sqrt{2\pi}} e^{-\frac{1}{2}z^2} dz,\\ \theta(n,d) &= \frac{\sum_{i=1}^d (\frac{1}{i})}{\sqrt{(1 + \frac{1}{n})\sum_{i=1}^d (\frac{1}{i}) + \frac{d}{n}}}. \end{align}In this case, however, it is apparent that for a fixed $n$, $\theta(n,d) \rightarrow 0$ as $d \rightarrow \infty$. Thus, the probability of error integrates over half of the symmetric density and $P_{error}(n,d) \rightarrow \frac{1}{2}$ in the limit. Thus, if we do not know the parameters, increasing the number of dimensions catastrophically hurts our classification performance.
This distinction between the optimal and learned classifier is demonstrated empirically in the next plot and holds, as expected.
dim_range = np.asarray([10,50,100,250,500,1000,2500,5e3,1e4,5e4,1e5,5e5,1e6]).astype(int)
optimal_accuracies = []
naive_accuracies = []
for d in dim_range:
X_train, y_train = simulate(n=100, d=d)
X_test, y_test = simulate(n=2000, d=d)
bayes_optimal = BayesClassifier('optimal')
bayes_naive = BayesClassifier('naive').fit(X_train, y_train)
optimal_accuracies.append(np.mean(y_test != bayes_optimal.predict(X_test)))
naive_accuracies.append(np.mean(y_test != bayes_naive.predict(X_test)))
f, ax = plt.subplots(1,1,figsize=(7,3))
ax.plot(dim_range, optimal_accuracies, label='Bayes Optimal')
ax.plot(dim_range, naive_accuracies, label='Naive Bayes')
ax.set_xscale('log')
ax.set_xlabel('Number of dimensions')
ax.set_ylabel('Expected Error')
ax.set_title('A simulation of Trunk\'s Example')
ax.legend()
plt.show()
We clearly see that the Naive Bayes classifier starts to approach chance probability. In summary, when learning a classifier from a finite amount of data, we do not want to arbitrarily utilize more features even if they are informative. Rather, as a designer of a practical implementation we wish to use those features which are most relevant and least obscured by noise given our sample size.