The K centroids (p-dimensional) lie in an affine subspace of dimension <= K - 1. Morever, in locating the closest centroid, we can ignore distances orthogonal to this subspace, since they contribute equally to each class. Thus, we might just project the $X^{*}$ onto this subspace.
If K = 3, this could allow us to view the data in a two dimensional plot.
If K > 3, we might then ask for a L < K - 1 dimensional subspace $H_l \subseteq H_{k-1}$. In summary, finding the sequences of optimal subspaces for LDA involves the following steps:
compute the $K \times p$ matrix of class centroids M and the common covariance matrix W (for within-class covariance);
Compute $\mathbf{M}^{*}=\mathbf{M}\mathbf{W}^{-1/2}$ using the eigen-decomposition of W;
Compute $\mathbf{B}^{*}$, the covariance matrix of $\mathbf{M}^{*}$ (B the between-class covariance), and its eigen-decomposition $\mathbf{B}^{*}=\mathbf{V}^{*}\mathbf{D}_B\mathbf{V}^{*T}$. The columns $v_l^{*} \text{ of } \mathbf{V}^{*}$ in sequence from first to last define the coordinates of the optimal subspaces.
Combining all these operations the lth discriminant variable is given by $Z_l = v_l^TX$ with $v_l=\mathbf{W}^{-1/2}v_l^{*}$.
TODO: Implement the algorithm and plot Figure 4.8.
Fisher arrived at this decomposition via a different route. He posed the problem:
Find the linear combination Z = a^T X such that the between-class variance is maximized relative to the within-class variance.
The between-class variance of Z is $a^T\mathbf{B}a$ and within-class variance $a^T\mathbf{W}a$, where W is defined earlier, and B is the covariance matrix of the class centroid matrix M. Note that $\mathbf{B}+\mathbf{W}=\mathbf{T}$, where T is the total covariance matrix of X, ignoring class information.
Fisher's problem amounts to maximizing the Rayleigh quotient (4.15): $$ \max_{a} \cfrac{a^T\mathbf{B}a}{a^T\mathbf{W}a} $$
or equivalently (4.16):
$$ \max_{a} a^T\mathbf{B}a \text{ subject to } a^T\mathbf{W}a = 1. $$This is a generalized eigenvalue problem, with a given by the largest eigenvalue of $\mathbf{W}^{-1}\mathbf{B}$.
To summarize the developments so far:
Classification can be achieved by sphering the data with respect to W.
Since only the relative distance to the centroids count, one can confine the data to the subspace spanned by the centroids in the sphered space.
This subsapce can be further decomposed into successively optimal subspaces in term of centroid separation. This decomposition is identical to the decomposition due to Fisher.
TODO: Finish the last paragraphs.