Notebook

2.3 高斯分布¶

高斯分布，又叫正态分布，是连续变量经常使用的一个分布模型，一维的高斯分布如下：

$$ \mathcal{N}\left(x\left|~\mu,\sigma^2\right.\right) = \frac{1}{(2\pi\sigma^2)^{1/2}} \exp\left\{-\frac{1}{2\sigma^2}(x-\mu)^2\right\} $$

其中 $\mu$ 是均值，$\sigma$ 是方差。

$D$-维的高斯分布如下：

$$ \mathcal{N}\left(\mathbf x\left|~\mathbf{\mu, \Sigma}\right.\right) = \frac{1}{(2\pi)^{D/2}} \frac{1}{|\mathbf\Sigma|^{1/2}} \exp \left\{-\frac{1}{2}(\mathbf x - \mathbf \mu)^\top\mathbf\Sigma^{-1}(\mathbf x - \mathbf \mu)\right\} $$

其中，$D$ 维向量 $\mathbf \mu$ 是均值，$D\times D$ 矩阵 $\mathbf\Sigma$ 是方差，$|\mathbf\Sigma|$ 是其行列式。

之前我们已经看到，在均值和方差固定时，高斯函数是熵最大的连续分布，因此高斯分布的应用十分广泛。

而中心极限定理告诉我们，对于某个分布一组样本 $x_1, \dots, x_N$，他们的均值 $(x_1+\dots+x_N)/N$ 的分布会随着 $N$ 的增大而越来越接近一个高斯分布。

验证高斯分布¶

下面，我们来验证高斯分布是一个概率分布。

考虑高斯分布的结构中跟 $\mathbf x$ 有关的这个二次型：

$$ \Delta^2 = (\mathbf x - \mathbf \mu)^\top\mathbf\Sigma^{-1}(\mathbf x - \mathbf \mu) $$

这里的 $\Delta$ 叫做 $\mathbf \mu$ 和 $\mathbf x$ 的马氏距离（Mahalanobis distance），当 $\mathbf \Sigma$ 是单位矩阵的时候为欧氏距离。

在 $\Delta^2$ 相等的地方，高斯分布的概率密度也相等。

不失一般性，我们只需要考虑 $\mathbf\Sigma$ 是对称的情况。

事实上，如果 $\mathbf\Sigma$ 不对称，考虑其逆矩阵 $\mathbf\Lambda$，它总可以写成一个对称矩阵和反对称矩阵的形式：

$$\mathbf\Lambda = \mathbf\Lambda^S + \mathbf\Lambda^A =\frac{\mathbf{\Lambda + \Lambda^\top}}{2} + \frac{\mathbf{\Lambda - \Lambda^\top}}{2}$$

而这个反对称的矩阵的对角线均为 0，从而它的二次型为 $0$ 因此，我们有：

$$ (\mathbf x - \mathbf \mu)^\top\mathbf\Lambda(\mathbf x - \mathbf \mu) = (\mathbf x - \mathbf \mu)^\top\mathbf\Lambda^S(\mathbf x - \mathbf \mu) $$

所以 $\mathbf\Lambda$ 可以等价于 $\mathbf\Lambda^S$，而对称矩阵的逆矩阵也是对称的，因此只需要考虑 $\mathbf\Sigma$ 是对称的情况。

对于实对称矩阵，可以找到一组正交的特征向量使得：

$$ \mathbf{\Sigma u}_i = \lambda_i,\lambda_i\geq=0 \mathbf u_i, \mathbf u_i^\top\mathbf u_j = I_{ij}， i,j=1,\dots,D, $$

其中 $I_{ij}$ 表示单位矩阵的第 $i$ 行 $j$ 列的元素：

$$ I_{ij} = \left\{\begin{align} 1, &~i=j \\ 0, &~i \neq j \end{align}\right. $$

从而 $\mathbf\Sigma$ 可以表示为

$$ \mathbf\Sigma = \sum_{i=1}^D \lambda_i \mathbf u_i\mathbf u_i^\top = \mathbf{U\Lambda U}^\top $$

其中 $\Lambda =\text{diag}(\lambda_1, \dots,\lambda_D)$ 为对角阵，$\mathbf U$ 是 $u_1,\dots,u_D$ 按列拼成的正交矩阵 $\mathbf U^\top\mathbf U = \mathbf U\mathbf U^\top = \mathbf I$。

由正交性：

$$ \mathbf\Sigma^{-1} = (\mathbf{U\Lambda U}^\top)^{-1} = (\mathbf U^\top)^{-1}\mathbf\Lambda^{-1} \mathbf U^{-1} = \mathbf U \mathbf\Lambda^{-1} \mathbf U^\top = \sum_{i=1}^D \frac{1}{\lambda_i} \mathbf u_i\mathbf u_i^\top $$

令 $y_i=\mathbf u_i^\top (\mathbf{x-\mu})$，则

$$ \Delta^2 = \sum_{i=1}^D \frac{y_i^2}{\lambda_i} $$

写成向量形式：

$$ \mathbf y = \mathbf{U}^\top\mathbf{(x-\mu)} $$

因此，马氏距离相当于将 $\mathbf{x}$ 做平移变换后，再在 $\mathbf u_i$ 方向上作了一个大小为 $\lambda_i^{1/2}$ 的尺度变换。

为了满足高斯分布定义的合法性，需要使得所有的 $\lambda_i$ 均大于 0，即 $\mathbf\Sigma$ 是正定的。

考虑新的坐标系下的高斯分布形式，由 $\mathbf U$ 的正交性，我们有

$$ \mathbf x = \mathbf{U} \mathbf{y} + \mu $$

因此 $\mathbf x$ 对 $\mathbf y$ 的雅克比矩阵为：

$$ J_{ij} = \frac{\partial x_i}{\partial y_j} = U_{ij} $$

因此其行列式的绝对值为 1：

另一方面，矩阵 $\mathbf\Sigma$ 的行列式可以成特征值的乘积：

$$ |\mathbf\Sigma| = \prod_{j=1}^D \lambda_j $$

所以在新的坐标系下的分布为：

$$ p(\mathbf y) = p(\mathbf x)|\mathbf J|=\prod_{j=1}^D \frac{1}{\lambda_j^{1/2}}\exp\left(-\frac{y_j^2}{2\lambda_j}\right) $$

即 $D$ 个独立的单变量高斯分布的乘积。

之前我们证明过单变量高斯分布的积分是 1，因此：

$$ \int p(\mathbf y)\mathbf y = \prod_{j=1}^D \int_{-\infty}^{\infty} \frac{1}{\lambda_j^{1/2}}\exp\left(-\frac{y_j^2}{2\lambda_j}\right) dy_j = 1 $$

所以多维高斯分布的确是归一化的，非负性显然，所以它是一个概率分布。

In [1]:

import numpy as np
import scipy as sp
import matplotlib.pyplot as plt

%matplotlib inline

fig, ax = plt.subplots()

tt = np.linspace(-np.pi, np.pi, 100)

xx = 1 * np.sin(tt)
yy = 2 * np.cos(tt)

ax.plot(- xx * np.sin(np.pi/6) - yy * np.cos(np.pi/6), 
        xx * np.cos(np.pi/6) - yy * np.sin(np.pi/6), 'r')

ax.set_xlim([-3, 3])
ax.set_ylim([-2, 2])
ax.set_xlabel(r"$x_1$", fontsize="xx-large")
ax.set_ylabel(r"$x_2$", fontsize="xx-large")

ax.set_xticks([])
ax.set_yticks([])

ax.text(0, 0.3, r"$\mathbf{\mu}$", fontsize="xx-large")

ax.annotate(r"$y_1$",
            fontsize="xx-large",
            xy=(-2.5, -2.5 / np.sqrt(3)), xycoords='data',
            xytext=(2.5, 2.5 / np.sqrt(3)), textcoords='data',
            arrowprops=dict(arrowstyle="<-",
                            connectionstyle="arc3",
                            color="blue"))
ax.annotate(r"$y_2$",
            fontsize="xx-large",
            xy=(1, -np.sqrt(3)), xycoords='data',
            xytext=(-1, np.sqrt(3)), textcoords='data',
            arrowprops=dict(arrowstyle="<-",
                            connectionstyle="arc3",
                            color="blue"))

ax.annotate("",
            xy=(0.55, -0.55 * np.sqrt(3)), xycoords='data',
            xytext=(0.55 + np.sqrt(3), 1-0.55 * np.sqrt(3)), textcoords='data',
            arrowprops=dict(arrowstyle="<->",
                            connectionstyle="arc3"), 
            )
ax.annotate("",
            xy=(-1.05*np.sqrt(3), -1.05), xycoords='data',
            xytext=(-1.05*np.sqrt(3)-0.5, -1.05+0.5*np.sqrt(3)), textcoords='data',
            arrowprops=dict(arrowstyle="<->",
                            connectionstyle="arc3"), 
            )

ax.text(-2.7, -1, r"$\lambda_2^{1/2}$", fontsize="xx-large")
ax.text(1.4, -0.9, r"$\lambda_1^{1/2}$", fontsize="xx-large")

plt.show()

均值和协方差¶

计算一阶矩即均值，令 $\mathbf{z=x-\mu}$：

$$ \begin{aligned} \mathbb E[\mathbf x] & = \frac{1}{(2\pi)^{D/2}} \frac{1}{|\mathbf\Sigma|^{1/2}} \int \exp \left\{-\frac{1}{2}(\mathbf x - \mathbf \mu)^\top\mathbf\Sigma^{-1}(\mathbf x - \mathbf \mu)\right\} \mathbf x d\mathbf x\\ & = \frac{1}{(2\pi)^{D/2}} \frac{1}{|\mathbf\Sigma|^{1/2}} \int \exp \left\{-\frac{1}{2}\mathbf z^\top\mathbf\Sigma^{-1}\mathbf z\right\} \mathbf{(z+\mu)} d\mathbf z \end{aligned} $$

含有 $\mathbf{z + \mu}$ 中 $\mathbf z$ 的部分会消失，因为此时 $\mathbf z$ 的部分是一个奇函数，而 $\mathbf \mu$ 的部分相当于它乘上一个多维高斯分布的积分，因此：

$$ \mathbb E[\mathbf x] = \mu $$

考虑二阶矩，令 $\mathbf{z=x-\mu}$：

$$ \begin{aligned} \mathbb E[\mathbf x\mathbf x^\top] & = \frac{1}{(2\pi)^{D/2}} \frac{1}{|\mathbf\Sigma|^{1/2}} \int \exp \left\{-\frac{1}{2}(\mathbf x - \mathbf \mu)^\top\mathbf\Sigma^{-1}(\mathbf x - \mathbf \mu)\right\} \mathbf x \mathbf x^\top d\mathbf x\\ & = \frac{1}{(2\pi)^{D/2}} \frac{1}{|\mathbf\Sigma|^{1/2}} \int \exp \left\{-\frac{1}{2}\mathbf z^\top\mathbf\Sigma^{-1}\mathbf z\right\} \mathbf{(z+\mu)} \mathbf{(z+\mu)}^\top d\mathbf z \end{aligned} $$

展开后面的乘积，$\mathbf{\mu z}^\top$ 和 $\mathbf{z\mu}^\top$ 项的积分会因为奇函数的性质消去，$\mathbf{\mu\mu}^\top$ 项是常数乘以分布，因此只剩下 $\mathbf{\mu\mu}^\top$。现在考虑 $\mathbf{zz}^\top$ 项：

从之前的推导中，我们知道：

$$ \mathbf{z=x-\mu=Uy}=\sum_{j=1}^D y_j \mathbf u_j $$$$ \mathbf z^\top\mathbf\Sigma^{-1}\mathbf z = \sum_{k=1}^D \frac{y_k^2}{\lambda_k} $$

其中 $y_j = \mathbf u_j^\top \mathbf z$。

因此我们有：

\begin{aligned} & \frac{1}{(2\pi)^{D/2}} \frac{1}{|\mathbf\Sigma|^{1/2}} \int \exp \left\{-\frac{1}{2}\mathbf z^\top\mathbf\Sigma^{-1}\mathbf z\right\} \mathbf{z} \mathbf{z}^\top d\mathbf z \\ = & \frac{1}{(2\pi)^{D/2}} \frac{1}{|\mathbf\Sigma|^{1/2}} \sum_{i=1}^D \sum_{j=1}^D \int \exp \left\{-\sum_{k=1}^D \frac{y_k^2}{2\lambda_k}\right\} y_iy_j \mathbf u_i \mathbf u_j^\top d\mathbf y \end{aligned}

当 $i \neq j$ 时，奇函数的性质使得积分为 0，所以只有 $i = j$ 的积分项才能保留，因此：

$$ \mathbb E[\mathbf{zz}^\top] = \sum_{i=1}^D \mathbf{u}_i \mathbf u_i^\top \lambda_i = \mathbf \Sigma $$

$i=j$ 时，非 $i$ 项的 $y_k$ 的积分为 1，$i$ 项的积分相当于对一个均值为 $0$，方差为 $\lambda_i$ 的一维高斯函数求二阶矩，因此为 $0^2 + \lambda_i = \lambda_i$。

所以二阶矩为

$$ \mathbb E[\mathbf{xx}^\top] = \mathbf{\Sigma + \mu\mu}^\top $$

从而协方差矩阵为

$$ {\rm cov}[{\bf x}] = \mathbb E[(\bf x-\mathbb E[\bf x])(\bf x-\mathbb E[\bf x])^\top] = \mathbb E[\mathbf{xx}^\top] - \mathbb E[\bf x]\mathbb E[\bf x]^\top = \bf \Sigma $$

高斯分布的局限性¶

尽管高斯分布是一个广泛使用的模型，但是它仍然存在一定的局限性。

一个 $D$ 维的高斯分布的独立参数的数量级是 $O(D^2)$ 的（协方差 $D(D+1)/2$，均值 $D$），因此随着 $D$ 的增大，参数量的增长是平方量级的，此外，求逆操作的时间复杂度也随之增大。

一个解决这个问题的方法是限制协方差矩阵的形式，例如将其限制为对角形式 $\bf \Sigma={\rm diag}(\sigma_i^2)$，或者更进一步，限制为 $\bf \Sigma = \sigma^2 \bf I$，即各向同性（isotropic）的高斯分布。

但这样做限制了高斯分布的自由度。

一般高斯分布，对角线形式的高斯分布和各向同性的高斯分布的等高线如下图所示：

In [2]:

from matplotlib.mlab import bivariate_normal

x, y = np.mgrid[-1:1:.01, -1:1:.01]

fig, ax = plt.subplots(1, 3, figsize=(10, 3))

# 斜的椭圆
z = bivariate_normal(x, y, sigmax=2, sigmay=1, sigmaxy=1.5)
ax[0].contour(x, y, z, colors="r", 
              levels=np.linspace(0.1*z.min() + 0.9*z.max(), z.max(), 5))

ax[0].set_xticks([])
ax[0].set_yticks([])

# 坐标轴平行的椭圆
z = bivariate_normal(x, y, sigmax=2, sigmay=1, sigmaxy=0)
ax[1].contour(x, y, z, colors="r", 
              levels=np.linspace(0.2*z.min() + 0.8*z.max(), z.max(), 5))

ax[1].set_xticks([])
ax[1].set_yticks([])

# 圆
z = bivariate_normal(x, y, sigmax=2, sigmay=2, sigmaxy=0)
ax[2].contour(x, y, z, colors="r", 
              levels=np.linspace(0.25*z.min() + 0.75*z.max(), z.max(), 5))

ax[2].set_xticks([])
ax[2].set_yticks([])

plt.show()

高斯分布的另一个问题在于他只有一个峰值（unimodal），所以不能很好的表示多模态（multimodal）的分布。之后我们可以看到，这个问题可以通过引入隐变量来解决。

2.3.1 条件高斯分布¶

高斯分布的一个重要性质是如果两组变量的联合分布是高斯分布，那么其中一组相对于另一组的条件分布也是高斯分布，每组变量的边缘分布也是高斯的。

基本设定¶

考虑 $D$ 维的高斯分布 $p(\mathbf x)= \cal N(\bf x~|~\bf{\mu,\Sigma})$，将其分成两组变量 ${\bf x}_a, {\bf x}_b$。不失一般性，我们假设 ${\bf x}_a$ 是 $\bf x$ 的前 $M$ 个分量，${\bf x}_b$ 是 $\bf x$ 的后 $D-M$ 个分量，即

$$ {\bf x} = \begin{pmatrix} {\bf x}_a \\ {\bf x}_b \end{pmatrix} $$

对应的均值和协方差矩阵可以写为：

$$ \begin{align} {\bf \mu} & = \begin{pmatrix} {\bf \mu}_a \\ {\bf \mu}_b \end{pmatrix} \\ {\bf \Sigma} & = \begin{pmatrix} {\bf \Sigma}_{aa} & {\bf \Sigma}_{ab} \\ {\bf \Sigma}_{ba} & {\bf \Sigma}_{bb} \end{pmatrix} \end{align} $$

由 ${\bf\Sigma = \Sigma}^{\text T}$，我们可以知道 ${\bf \Sigma}_{aa}, {\bf \Sigma}_{bb}$ 是对称的，且 ${\bf \Sigma}_{ab}={\bf \Sigma}_{ba}^\top$。

为了方便，我们使用协方差的逆矩阵，或者叫精确度矩阵（precision matrix）：

$$ {\bf\Lambda \equiv \Sigma}^{-1} $$

它可以写成：

$$ {\bf \Lambda} = \begin{pmatrix} {\bf \Lambda}_{aa} & {\bf \Lambda}_{ab} \\ {\bf \Lambda}_{ba} & {\bf \Lambda}_{bb} \end{pmatrix} $$

由对称性，我们有 ${\bf \Lambda}_{aa}, {\bf \Lambda}_{bb}$ 是对称的，且 ${\bf \Lambda}_{ab}={\Lambda}_{ba}^\top$。

使用精确度矩阵表示¶

我们不从条件分布的定义出发计算，换一个角度，考虑二次型的部分：

$$ \begin{align} -\frac{1}{2}(\mathbf x - \mathbf \mu)^\top\mathbf\Sigma^{-1}(\mathbf x - \mathbf \mu) = & -\frac{1}{2}(\mathbf x_a - \mathbf \mu_a)^\top\mathbf\Lambda_{aa}(\mathbf x_a - \mathbf \mu_a) - \frac{1}{2}(\mathbf x_a - \mathbf \mu_a)^\top\mathbf\Lambda_{ab}(\mathbf x_b - \mathbf \mu_b) \\ & - \frac{1}{2}(\mathbf x_b - \mathbf \mu_b)^\top\mathbf\Lambda_{ba}(\mathbf x_a - \mathbf \mu_a) - \frac{1}{2}(\mathbf x_b - \mathbf \mu_b)^\top\mathbf\Lambda_{bb}(\mathbf x_b - \mathbf \mu_b) \end{align} $$

我们看到，对于 $\mathbf x_a$ 来说，它仍然是一个二次型，因此，条件分布 $p(\mathbf x_a|\mathbf x_b)$ 应该是一个高斯分布。

因为高斯分布由其均值和协方差完全确定，我们的目的变为计算条件分布的均值和方差。

为了计算均值和方法，我们可以使用配方法，对于一个高斯分布来说，我们有：

$$ -\frac{1}{2}(\mathbf x - \mathbf \mu)^\top\mathbf\Sigma^{-1}(\mathbf x - \mathbf \mu) = -\frac{1}{2}\mathbf x^\top \mathbf\Sigma^{-1} \mathbf x + \mathbf x^\top \mathbf\Sigma^{-1} \mathbf \mu + {\rm const} $$

其中常量表示与 $\mathbf x$ 无关的部分，另外用到了 $\bf \Sigma$ 的对称性。

这告诉我们，$\mathbf\Sigma^{-1}$ 是 $\mathbf x$ 二次项的中间部分，$\mathbf x$ 一次项的中间部分是 $\mathbf\Sigma^{-1} \mathbf \mu$。

回过头来，我们观察到二次型部分中，$\mathbf x_a$ 的二次项为：

$$ -\frac{1}{2}\mathbf x_a^\top \mathbf\Lambda_{aa} \mathbf x_a $$

我们立刻可以得到条件分布 $p(\mathbf x_a|\mathbf x_b)$ 的协方差矩阵为： $$ \mathbf \Sigma_{a|b} = \mathbf\Lambda_{aa}^{-1} $$

在考虑含有 $\mathbf x_a$ 的线性部分（利用 $\mathbf\Lambda_{ab} = \mathbf\Lambda_{ba}^\top$）：

$$ \mathbf x_a^\top\{\mathbf \Lambda_{aa}\mathbf\mu_a - \mathbf \Lambda_{ab}(\mathbf x_b - \mathbf \mu_b)\} $$

我们可以得到：

$$ \mathbf \Sigma_{a|b}^{-1}\mathbf \mu_{a|b} = \mathbf \Lambda_{aa}\mathbf\mu_a - \mathbf \Lambda_{ab}(\mathbf x_b - \mathbf \mu_b) $$

从而：

$$ \begin{align} \mathbf \mu_{a|b} & = \mathbf \Sigma_{a|b}\{\mathbf \Lambda_{aa}\mathbf\mu_a - \mathbf \Lambda_{ab}(\mathbf x_b - \mathbf \mu_b)\} \\ & = \mathbf \mu_a - \mathbf\Lambda_{aa}^{-1}\mathbf \Lambda_{ab}(\mathbf x_b - \mathbf \mu_b) \end{align} $$

使用协方差矩阵表示¶

我们也可以使用协方差矩阵的分量来表示条件分布的均值和方差，首先我们使用分块矩阵的 Schur 补（Schur complement）来表示分块矩阵的逆：

$$ \begin{pmatrix} \bf A & \bf B \\ \bf C & \bf D \\ \end{pmatrix}^{-1} = \begin{pmatrix} \bf M & {\bf -MBD}^{-1} \\ -{\bf D}^{-1}{\bf CM} & {\bf D}^{-1}+{\bf CMBD}^{-1} \\ \end{pmatrix} $$

其中 $\mathbf M =(\mathbf{A - BD}^{-1}\mathbf C)^{-1}$，是分块矩阵的 Schur 补。

结合定义式：

$$ \begin{pmatrix} {\bf \Sigma}_{aa} & {\bf \Sigma}_{ab} \\ {\bf \Sigma}_{ba} & {\bf \Sigma}_{bb} \end{pmatrix}^{-1} = \begin{pmatrix} {\bf \Lambda}_{aa} & {\bf \Lambda}_{ab} \\ {\bf \Lambda}_{ba} & {\bf \Lambda}_{bb} \end{pmatrix} $$

我们有：

$$ \begin{align} {\bf \Lambda}_{aa} & = (\mathbf{\Sigma}_{aa} - \mathbf\Sigma_{ab}\mathbf\Sigma_{bb}^{-1}\mathbf \Sigma_{ba})^{-1} \\ {\bf \Lambda}_{ab} & = -(\mathbf{\Sigma}_{aa} - \mathbf\Sigma_{ab}\mathbf\Sigma_{bb}^{-1}\mathbf \Sigma_{ba})^{-1}\mathbf\Sigma_{ab}\mathbf\Sigma_{bb}^{-1} \end{align} $$

带入均值和方差的表达式，我们有：

$$ \begin{align} \mathbf \mu_{a|b} & = \mathbf \mu_a - \mathbf\Lambda_{aa}^{-1}\mathbf \Lambda_{ab}(\mathbf x_b - \mathbf \mu_b) \\ & = \mu_a + \mathbf\Sigma_{ab}\mathbf\Sigma_{bb}^{-1} (\mathbf x_b-\mathbf \mu_b) \\ \mathbf \Sigma_{a|b} & = \mathbf\Lambda_{aa}^{-1} \\ & = \mathbf{\Sigma}_{aa} - \mathbf\Sigma_{ab}\mathbf\Sigma_{bb}^{-1}\mathbf \Sigma_{ba} \\ \end{align} $$

可以看到，使用精确度矩阵表示形式上更简单一些。

注意到，条件分布 $p(\mathbf x_a|\mathbf x_b)$ 的均值与 $\mathbf x_b$ 呈线性关系，而协方差与 $\mathbf x_a$ 独立，这表示了一个线性高斯模型。

2.3.2 边缘高斯分布¶

现在考虑边缘分布的形式：

$$ p(\mathbf x_a)=\int p(\mathbf x_a, \mathbf x_b) d\mathbf x_b $$

它也是一个高斯分布。

因为是对 $\mathbf x_b$ 的积分，我们考虑二次型中关于 $\mathbf x_b$ 的部分：

$$ -\frac{1}{2}\mathbf x_b^{\text T}\mathbf\Lambda_{bb}\mathbf x_b + \mathbf x_b^\top\mathbf m = -\frac{1}{2} (\mathbf x_b-\mathbf\Lambda^{-1}_{bb}\mathbf m)^\top (\mathbf x_b-\mathbf\Lambda^{-1}_{bb}\mathbf m) + \frac{1}{2} \mathbf m^\top\mathbf \Lambda_{bb}^{-1}\mathbf m $$

其中

$$ \mathbf m = \mathbf\Lambda_{bb}\mathbf\mu_{b}-\mathbf\Lambda_{ba}(\mathbf x_a - \mathbf\mu_a) $$

是一个与 $\mathbf x_b$ 无关的项。

配方之后我们发现，关于 $\mathbf x_b$ 的二次型部分可以拆分为两部分：一个高斯分布的二次型再加上一个跟 $\mathbf x_b$ 无关的部分。

对于高斯分布的二次型部分，积分

$$ \int \exp\left\{-\frac{1}{2} (\mathbf x_b-\mathbf\Lambda^{-1}_{bb}\mathbf m)^\top (\mathbf x_b-\mathbf\Lambda^{-1}_{bb}\mathbf m)\right\} d\mathbf x_b $$

是一个与 $\mathbf x_a, \mathbf x_b$ 无关的常数项。

考虑第二部分，并加上与 $\mathbf x_a$ 有关的项，我们有：

$$ \begin{align} & \frac{1}{2} \mathbf m^\top\mathbf \Lambda_{bb}^{-1}\mathbf m -\frac{1}{2}\mathbf x_a^\top \mathbf\Lambda_{aa} \mathbf x_a + \mathbf x_a^\top\{\mathbf \Lambda_{aa}\mathbf\mu_a + \mathbf \Lambda_{ab}\mathbf \mu_b\} + \text{const} \\ = & -\frac{1}{2} \mathbf x_a^\top (\mathbf\Lambda_{aa} - \mathbf\Lambda_{ba}\mathbf\Lambda_{bb}^{-1}\mathbf\Lambda_{ba})\mathbf x_a + \mathbf x_a^\top (\mathbf\Lambda_{aa} - \mathbf\Lambda_{ba}\mathbf\Lambda_{bb}^{-1}\mathbf\Lambda_{ba}) \mathbf \mu_a + \text{const} \end{align} $$

从之前的讨论立即可以得到协方差矩阵：

$$ \mathbf\Sigma_a = (\mathbf\Lambda_{aa} - \mathbf\Lambda_{ba}\mathbf\Lambda_{bb}^{-1}\mathbf\Lambda_{ba})^{-1} $$

以及均值

$$ \mathbf\Sigma_a (\mathbf\Lambda_{aa} - \mathbf\Lambda_{ba}\mathbf\Lambda_{bb}^{-1}\mathbf\Lambda_{ba}) \mathbf \mu_a = \mathbf\mu_a $$

再来利用之前 Schur 补的结论：

$$ \begin{pmatrix} \bf A & \bf B \\ \bf C & \bf D \\ \end{pmatrix}^{-1} = \begin{pmatrix} \bf M & {\bf -MBD}^{-1} \\ -{\bf D}^{-1}{\bf CM} & {\bf D}^{-1}+{\bf CMBD}^{-1} \\ \end{pmatrix} $$

其中 $\mathbf M =(\mathbf{A - BD}^{-1}\mathbf C)^{-1}$，是分块矩阵的 Schur 补。

$$ \begin{pmatrix} {\bf \Lambda}_{aa} & {\bf \Lambda}_{ab} \\ {\bf \Lambda}_{ba} & {\bf \Lambda}_{bb} \end{pmatrix}^{-1} = \begin{pmatrix} {\bf \Sigma}_{aa} & {\bf \Sigma}_{ab} \\ {\bf \Sigma}_{ba} & {\bf \Sigma}_{bb} \end{pmatrix} $$

我们用协方差矩阵来表示这个结果，可以得到：

$$ \mathbf \Sigma_{aa} = (\mathbf\Lambda_{aa} - \mathbf\Lambda_{ba}\mathbf\Lambda_{bb}^{-1}\mathbf\Lambda_{ba})^{-1} $$

从而：

$$ \begin{align} \mathbb E[\mathbf x_a]& = \mathbf\mu_a\\ {\rm cov}[\mathbf x_a]&=\mathbf \Sigma_{aa} \end{align} $$

可以看到，边缘分布的均值和协方差就是原来的均值和协方差中相对应的部分。

结论¶

已知 $p(\mathbf x)= \cal N(\bf x~|~\bf{\mu,\Sigma})$ 以及 $\mathbf\Lambda\equiv\mathbf\Sigma^{-1}$ 以及

$$ \begin{align} {\bf x} = \begin{pmatrix} {\bf x}_a \\ {\bf x}_b \end{pmatrix},~& {\bf \mu} = \begin{pmatrix} {\bf \mu}_a \\ {\bf \mu}_b \end{pmatrix} \\ {\bf \Sigma} = \begin{pmatrix} {\bf \Sigma}_{aa} & {\bf \Sigma}_{ab} \\ {\bf \Sigma}_{ba} & {\bf \Sigma}_{bb} \end{pmatrix},~& {\bf \Lambda} = \begin{pmatrix} {\bf \Lambda}_{aa} & {\bf \Lambda}_{ab} \\ {\bf \Lambda}_{ba} & {\bf \Lambda}_{bb} \end{pmatrix} \end{align} $$

条件分布为：

$$ \begin{align} p(\mathbf x_a|\mathbf x_b) & = \mathcal N (\mathbf x|\mathbf \mu_{a|b}, \mathbf \Lambda_{aa}^{-1}) \\ \mathbf \mu_{a|b} & = \mathbf \mu_a - \mathbf \Lambda_{aa}^{-1}\mathbf \Lambda_{ab} (\mathbf x_b-\mathbf \mu_b) \end{align} $$

边缘分布为：

$$ p(\mathbf x_a)=\mathcal N (\mathbf x_a|\mathbf \mu_{a}, \mathbf \Sigma_{aa}) $$

In [3]:

from scipy.stats import norm
from matplotlib.mlab import bivariate_normal

x, y = np.mgrid[0:1:.01, 0:1:.01]

fig, ax = plt.subplots(1, 2, figsize=(10, 4.5), dpi=200)

mu_a = 0.5
mu_b = 0.5
sigma_aa = 0.15
sigma_bb = 0.15
sigma_ab = 0.0215


# 联合分布

z = bivariate_normal(x, y, 
                     mux=mu_a, 
                     muy=mu_b, 
                     sigmax=sigma_aa, 
                     sigmay=sigma_bb, 
                     sigmaxy=sigma_ab)

ax[0].contour(x, y, z, colors="g", 
              levels = [0.99 * z.min() + 0.01 * z.max(),
                        0.9 * z.min() + 0.1 * z.max(), 
                        0.5 * z.min() + 0.5 * z.max()])

ax[0].plot([0, 1], [0.7, 0.7], 'r')

ax[0].set_xlim(0, 1)
ax[0].set_ylim(0, 1)

ax[0].set_xticks([0, 0.5, 1])
ax[0].set_yticks([0, 0.5, 1])

ax[0].set_xlabel(r"$x_a$", fontsize="xx-large")
ax[0].set_ylabel(r"$x_b$", fontsize="xx-large")

ax[0].text(0.2, 0.8, r"$x_b=0.7$", fontsize="xx-large")
ax[0].text(0.6, 0.3, r"$p(x_a, x_b)$", fontsize="xx-large")

# 边缘分布

xx = np.linspace(0, 1, 100)

pa = norm.pdf(xx, loc=mu_a, scale = sigma_aa)

ax[1].plot(xx, pa, 'b')

pa_b = norm.pdf(xx, 
                loc = mu_a + sigma_ab / sigma_bb ** 2 * (0.7 - mu_a), 
                scale = np.sqrt(sigma_aa ** 2 - sigma_ab ** 2 / sigma_bb ** 2))

ax[1].set_xlabel(r"$x_a$", fontsize="xx-large")

ax[1].set_xlim(0, 1)
ax[1].set_ylim(0, 10)
ax[1].set_xticks([0, 0.5, 1])
ax[1].set_yticks([0, 5, 10])

ax[1].text(0.2, 7, r"$p(x_a|x_b=0.7)$", fontsize="xx-large")
ax[1].text(0.2, 3, r"$p(x_a)$", fontsize="xx-large")

ax[1].plot(xx, pa_b, 'r')

plt.show()

2.3.3 高斯变量的贝叶斯理论¶

之前提到 $p(\mathbf x_a|\mathbf x_b)$ 的均值是 $\mathbf x_b$ 的线性函数。

现在我们假设 $p(\mathbf x)$ 是一个已知的高斯分布，而 $p(\bf y|x)$ 是一个条件高斯分布，且其均值是 $\bf x$ 的线性函数，方差与 $\bf x$ 独立。这就是所谓的高斯线性模型（linear Gaussian model）：

$$ \begin{align} p(\mathbf x)&=\mathcal N(\mathbf x~|~\mathbf \mu, \mathbf \Lambda^{-1}) \\ p(\mathbf y|\mathbf x)&=\mathcal N(\mathbf y~|~\mathbf{Ax+b}, \mathbf L^{-1}) \\ \end{align} $$

其中 $\mathbf \mu, \mathbf A, \mathbf b$ 是均值的参数，$\mathbf \Lambda, \mathbf L$ 是精确度矩阵。$\mathbf{x,y}$ 的维度分别为 $M, D$，则 $\mathbf A$ 的维度为 $D\times M$。

我们的目的是求 $p(\mathbf y)$ 的分布。

考虑它们的联合分布，令 $\mathbf z=\begin{pmatrix}\mathbf x\\ \mathbf y\end{pmatrix}$，则：

$$ \begin{align} \ln p(\mathbf z)&=\ln p(\mathbf x) + \ln p(\mathbf{y|x}) \\ &=-\frac{1}{2}(\mathbf{x-\mu})^\top\mathbf\Lambda(\mathbf{x-\mu}) -\frac{1}{2}(\mathbf{y-Ax-b})^\top\mathbf L(\mathbf{y-Ax-b}) + \text{const} \end{align} $$

为了得到 $\mathbf z$ 的精确度矩阵，我们考虑二次项：

$$ \begin{align} & -\frac{1}{2}\mathbf x^\top (\mathbf\Lambda + \mathbf A^{\text T}\mathbf L\mathbf A) \mathbf x - \frac{1}{2}\mathbf y^\top \mathbf L\mathbf y + \frac{1}{2}\mathbf y^\top \mathbf L\mathbf A\mathbf+ \frac{1}{2}\mathbf x^\top \mathbf A^\top\mathbf L\mathbf y \\ = & -\frac{1}{2} \begin{pmatrix} \mathbf x\\ \mathbf y \end{pmatrix}^\top \begin{pmatrix} \mathbf\Lambda + \mathbf A^\top\mathbf L\mathbf A & -\mathbf A^\top\mathbf L \\ -\mathbf L\mathbf A & \mathbf L \end{pmatrix} \begin{pmatrix} \mathbf x\\ \mathbf y \end{pmatrix} \\ = & -\frac{1}{2} \mathbf {z^\top Rz} \end{align} $$

$\mathbf z$ 的精确度矩阵为：

$$ \mathbf{R} = \begin{pmatrix} \mathbf\Lambda + \mathbf A^\top \mathbf L\mathbf A & -\mathbf A^\top \mathbf L \\ -\mathbf L\mathbf A & \mathbf L \end{pmatrix} $$

协方差矩阵为（使用 Schur 补的结论）：

$$ \text{cov}[\mathbf z]=\mathbf{R}^{-1} = \begin{pmatrix} \mathbf\Lambda + \mathbf A^\top\mathbf L\mathbf A & -\mathbf A^\top\mathbf L \\ -\mathbf L\mathbf A & \mathbf L \end{pmatrix}^{-1} = \begin{pmatrix} \mathbf\Lambda^{-1} & \mathbf\Lambda^{-1}\mathbf A^\top\\ \mathbf A\mathbf\Lambda^{-1} & \mathbf L^{-1}+\mathbf A\mathbf \Lambda^{-1}\mathbf A^\top \end{pmatrix} $$

考虑线性项：

$$ \mathbf x^\text{T}\mathbf{\Lambda\mu}-\mathbf x^\top\mathbf A^\top\mathbf{Lb} + \mathbf y^\top\mathbf{Lb} = \begin{pmatrix} \mathbf x\\ \mathbf y \end{pmatrix}^\top \begin{pmatrix} \mathbf{\Lambda\mu-A}^\top\mathbf{Lb}\\ \mathbf{Lb} \end{pmatrix} $$

我们有

$$ \mathbb E[\mathbf z]= \mathbf R^{-1} \begin{pmatrix} \mathbf{\Lambda\mu-A^\top Lb}\\ \mathbf{Lb} \end{pmatrix} = \begin{pmatrix} \mathbf{\mu}\\ \mathbf{A\mu+b} \end{pmatrix} $$

然后我们由边缘高斯分布的结论马上可以得到：

$$ \begin{align} \mathbb E[\mathbf y] &= \mathbf{A\mu+b} \\ {\rm cov} [\mathbf y] &= \mathbf L^{-1}+\mathbf A\mathbf \Lambda^{-1}\mathbf A^\top \end{align} $$

再来，我们由条件高斯分布的结论得到：

$$ \begin{align} \mathbb E[\mathbf{x|y}] &= (\mathbf\Lambda + \mathbf A^\top \mathbf L\mathbf A)^{-1}\left\{ \mathbf A^\top \mathbf{L(y-b)} + \mathbf{\Lambda\mu} \right\} \\ {\rm cov} [\mathbf{x|y}] &= (\mathbf\Lambda + \mathbf A^\top\mathbf L\mathbf A)^{-1} \end{align} $$

结论¶

已知：

$$ \begin{align} p(\mathbf x)&=\mathcal N(\mathbf x~|~\mathbf \mu, \mathbf \Lambda^{-1}) \\ p(\mathbf y|\mathbf x)&=\mathcal N(\mathbf y~|~\mathbf{Ax+b}, \mathbf L^{-1}) \\ \end{align} $$

我们有

$$ \begin{align} p(\mathbf y)&=\mathcal N(\mathbf y~|~\mathbf{A\mu+b}, \mathbf L^{-1}+\mathbf A\mathbf \Lambda^{-1}\mathbf A^\top) \\ p(\mathbf x|\mathbf y)&=\mathcal N(\mathbf y~|~\mathbf \Sigma\left\{ \mathbf A^\top \mathbf{L(y-b)} + \mathbf{\Lambda\mu} \right\}, \mathbf \Sigma) \\ \end{align} $$

其中 $\mathbf \Sigma = (\mathbf\Lambda + \mathbf A^\top\mathbf L\mathbf A)^{-1}$。

2.3.4 高斯分布最大似然¶

对于一组独立同分布的高斯观测数据 $\mathbf X=(x_1, \dots,x_N)^\text{T}$，高斯分布的似然函数为：

$$ \ln p({\bf X|\mu,\Sigma}) = -\frac{ND}{2}\ln (2\pi)-\frac{N}{2}\ln|\mathbf\Sigma|-\frac{1}{2}\sum_{n=1}^N ({\bf x_n-\mu})^\top\mathbf\Sigma^{-1}({\bf x_n-\mu}) $$

其充分统计量为

$$ \sum_{n=1}^N \mathbf x_n, \sum_{n=1}^N \mathbf x_n\mathbf x_n^\top $$

考虑对 $\mathbf\mu$ 求偏导：

$$ \frac{\partial}{\partial \mathbf \mu} \ln p({\bf X|\mu,\Sigma}) = \sum_{n=1}^N \mathbf\Sigma^{-1} (\bf x_n-\mu) $$

令其为零（严格来说要验证这是一个最大值点）：

$$ \mathbf \mu_{ML} = \frac{1}{N} \sum_{n=1}^N \mathbf x_n $$

所以最大似然的均值就是样本均值。

协方差的最大似然是（可以求偏导，对矩阵求偏导的方法参见附录 C）

$$ {\bf\Sigma}_{ML} = \frac{1}{N} \sum_{n=1}^N (\mathbf x_n-\mathbf \mu_{ML})(\mathbf x_n-\mathbf \mu_{ML})^\top $$

另一方面，在真实的分布下，最大似然估计的均值为

$$ \begin{align} \mathbb E [\mu_{ML}] & = \mathbf \mu \\ \mathbb E [\Sigma_{ML}] & = \frac{N-1}{N} \bf \Sigma \end{align} $$

所以最大似然估计的协方差不是无偏估计，为此，可以使用

$$ \mathbf{\tilde\Sigma}=\frac{1}{N-1} \sum_{n=1}^N (\mathbf x_n-\mathbf \mu_{ML})(\mathbf x_n-\mathbf \mu_{ML})^\top $$

来纠正。

2.3.5 序列估计¶

序列化方法要求我们每次只处理一个数据，然后扔掉这个数据。这种方法在对于处理数据量很大（大到不能一次处理所有数据）的问题很重要。

我们考虑从 $N-1$ 个观测数据到 $N$ 个观测数据时，均值的最大似然估计的变化：

$$ \begin{align} \mu_{ML}^{(N)} & = \frac{1}{N} \sum_{n=1}^N \mathbf x_n \\ & = \frac{1}{N} \mathbf x_N + \frac{1}{N} \sum_{n=1}^N-1 \mathbf x_n\\ & = \frac{1}{N} \mathbf x_N + \frac{N-1}{N} \mu_{ML}^{(N-1)} \\ & = \mu_{ML}^{(N-1)} + \frac{1}{N}(\mathbf x_N - \mu_{ML}^{(N-1)}) \end{align} $$

这相当于在原来的均值的基础上，增加了一个误差项 $\mathbf x_N - \mu_{ML}^{(N-1)}$ 的变化；同时随着 $N$ 的增大，误差项调整的权重逐渐变小。

Robbins-Monro 算法¶

在上面的序列估计算法中，我们得到的是最大似然解的精确结果，而在很多情况下，我们并不一定能得到精确的结果。为此，我们引入 Robbins-Monro 算法。

考虑联合分布 $p(z,\theta)$，$z$ 关于 $\theta$ 的条件分布的均值是一个关于 $\theta$ 的函数：

$$ f(\theta)\equiv \mathbb E[z|\theta]=\int zp(z|\theta)dz $$

我们的目的是找到一个 $\theta^\star$ 使得 $f(\theta^\star)=0$。

对于这个问题，如果我们有很多 $z$ 和 $\theta$ 的数据，我们可以建模来解决这个问题。

不过，现在假设我们每次只观察一个 $z$ 值，然后根据这个值来更新我们对 $\theta^\star$ 的估计，为此，我们先假定 $z$ 的条件方差是有限的：

$$ \mathbb E[(z-f)^2|\theta]<\infty $$

并且不失一般性的假设 $f(\theta)$ 在 $\theta^\star$ 附近是单调递增的。Robbins-Monro 过程给出：

$$ \theta^{(N)} = \theta^{(N-1)} + a_{N-1} z(\theta^{(N-1)}) $$

其中 $z(\theta^{(N)})$ 是 $z$ 在 $\theta$ 取 $\theta^{(N)}$ 时的观测值，参数 $a_N$ 满足：

$$ \begin{align} \lim_{N\to\infty} a_N & = 0 \\ \sum_{N=1}^\infty a_N & = \infty \\ \sum_{N=1}^\infty a_N^2 & < \infty \end{align} $$

第一个条件保证了收敛性；第二个条件保证了不会收敛到最优值附近的点；第三个条件保证误差项最终不会破坏收敛性。

对于最大似然问题，我们要求

$$ \frac{\partial}{\partial \theta} \left\{\left.\frac{1}{N} \sum_{n=1}^N \ln p({x_n|\theta}) \right\}\right|_{\theta_{ML}} = 0 $$

交换求和和偏导，并求极限有：

$$ \lim_{N\to\infty}\frac{1}{N} \sum_{n=1}^N \frac{\partial}{\partial \theta}\ln p({x_n|\theta}) = \mathbb E_x \left[ \frac{\partial}{\partial \theta}\ln p({x|\theta})\right] $$

我们可以看到，最大化极大似然对应与寻找一个函数的根。Robbins-Monro 过程给出：

$$ \theta^{(N)} = \theta^{(N-1)} + a_{N-1} \frac{\partial}{\partial \theta^{(N-1)}}\ln p({x_N|\theta^{(N-1)}}) $$

如果 $\theta$ 是均值，我们有

$$ z = \frac{\partial}{\partial \mu_{ML}}\ln p({x|\mu_{ML}, \sigma^2}) = \frac{1}{\sigma^2} (x-\mu_{ML}) $$

所以 $z$ 的分布是一个均值为 $\mu-\mu_{ML}$ 的高斯分布，我们相应的 $a_N$ 为 $\frac{\sigma^2}{N}$。

2.3.6 高斯分布的贝叶斯估计¶

极大似然估计给出了均值和协方差矩阵的点估计，现在我们通过引入先验分布来考虑它的贝叶斯估计。

均值未知，方差已知¶

考虑一维的高斯变量 $x$，并假设方差 $\sigma^2$ 已知，有 $N$ 个数据点 $\mathbf X=\{x_1,\dots,x_N\}$，其似然函数为

$$ p(\mathbf X|\mu)=\prod_{n=1}^N p(x_n|\mu) =\frac{1}{(2\pi\sigma^2)^\frac{N}{2}} \exp\left\{-\frac{1}{2\sigma^2} \sum_{n=1}^N (x_n-\mu)^2\right\} $$

注意似然函数不是一个关于 $\mu$ 的概率分布。

考虑共轭性，我们引入一个高斯先验 $p(\mu)$：

$$ p(\mu)=\mathcal N(\mu|\mu_0, \sigma_0^2) $$

因此后验分布为：

$$ p(\mu|\mathbf X) \propto p(\mathbf X|\mu)p(\mu) $$

用上文类似的配方法可以得到：

$$ p(\mu|\mathbf X) = \mathcal N(\mu|\mu_N,\sigma_N^2) $$

其中

$$ \begin{align} \mu_N & = \frac{\sigma^2}{N\sigma_0^2 + \sigma^2} \mu_0 + \frac{N\sigma_0^2}{N\sigma_0^2 + \sigma^2} \mu_{ML}\\ \frac{1}{\sigma_N^2} & = \frac{1}{\sigma_0^2} + \frac{N}{\sigma^2} \end{align} $$

其中

$$ \mu_{ML} = \frac{1}{N} \sum_{n=1}^N x_n $$

当 $N = 0$ 时，后验分布就是先验分布，当 $N\to\infty$ 时，后验分布的参数趋于最大似然的结果。

均值已知，方差未知¶

我们使用精确度 $\lambda\equiv 1~/~\sigma^2$ 来进行计算，其似然函数为：

$$ p(\mathbf X|\lambda)=\prod_{n=1}^N \mathcal N(x_n|\mu,\lambda^{-1}) \propto \lambda^{N/2} \exp\left\{-\frac{\lambda}{2} \sum_{n=1}^N(x_n-\mu)^2\right\} $$

考虑关于 $\lambda$ 的共轭性，我们引入的先验分布为 Gamma 分布：

$$ \mathrm{Gam}(\lambda~|~a, b) = \frac{1}{\Gamma(a)} b^a {\lambda^{a-1} \exp(-b\lambda)}, a > 0 $$

不同参数的 Gam 分布的图像如下图所示：

In [4]:

from scipy.stats import gamma

x = np.linspace(0.01, 2, 100)

fig, axes = plt.subplots(1, 3, figsize=(10, 2))

A = [0.1, 1, 4]
B = [0.1, 1, 6]

for a, b, ax in zip(A, B, axes):
    y = gamma.pdf(x, a = a, scale=1.0/b)
    ax.plot(x, y, color="r")
    ax.set_ylim(0, 2)
    ax.set_xticks([0, 1, 2])
    ax.set_yticks([0, 1, 2])
    ax.set_xlabel(r"$\lambda$", fontsize="xx-large")
    ax.text(1, 1, "$a={}$\n$b={}$".format(a, b), fontsize="xx-large")

plt.show()

其均值和方差分别为：

$$ \begin{align} \mathbb E[\lambda] & = \frac{a}{b}\\ {\rm var}[\lambda] & = \frac{a}{b^2} \end{align} $$

给定先验分布 ${\rm Gam}(\lambda~|~a_0, b_0)$，后验分布为：

$$ p(\lambda|\mathbf X)\propto\lambda^{a_0-1}\lambda^{N/2} \exp\left\{-b_0\lambda_0-\frac{\lambda}{2} \sum_{i=1}^N (x_n-\mu)^2\right\} $$

从而这也是一个 Gam 分布 ${\rm Gam}(\lambda~|~a_N, b_N)$，其中：

$$ \begin{align} a_N & = a_0 + \frac N 2 \\ b_N & = b_0 + \frac{1}{2} \sum_{i=1}^N (x_n-\mu)^2 = b_0 + \frac{N}{2} \sigma^2_{ML} \end{align} $$

因此，我们看到，$N$ 个数据点向参数 $a$ 添加了 $\frac N 2$ 的贡献，因此，我们可以认为初始状态向 $a$ 贡献了 $2a_0$ 个数据点；$N$ 个数据点向参数 $b$ 添加了 $\frac{N}{2} \sigma^2_{ML}$ 的贡献，我们可以认为这 $2a_0$ 个数据点平均每个贡献了 $2b_0/(2a_0)=b_0/a_0$ 的方差。

如果直接考虑方差，而不是精确度，我们使用的先验将是逆 Gamma 分布。不过用精确度表示起来更加方便。

均值和方差都未知¶

在这种情况下，似然函数为：

$$ p(\mathbf X|\mu, \lambda)=\left(\frac{\lambda}{2\pi}\right)^\frac{N}{2} \exp\left\{-\frac{\lambda}{2} \sum_{n=1}^N (x_n-\mu)^2\right\} \propto \left[\lambda^{1/2} \exp\left(-\frac{\lambda\mu^2}{2}\right)\right]^N \exp\left\{ \lambda\mu \sum_{n=1}^N x_n - \frac{\lambda}{2} \sum_{n=1}^N x_n^2 \right\} $$

这是一个指数族函数（exponential family）的形式。

由共轭性，考虑这样的先验分布：

$$ p(\mu,\lambda)\propto \left[\lambda^{1/2} \exp\left(-\frac{\lambda\mu^2}{2}\right)\right]^\beta \exp\left\{ c\lambda\mu - d\lambda \right\} = \exp\left\{-\frac{\beta\lambda}{2}(\mu-c/\beta)^2\right\} \lambda^{\beta/2} \exp\left\{-\left(d-\frac{c^2}{2\beta}\right)\lambda\right\} $$

由于 $p(\mu,\lambda) = p(\mu|\lambda)p(\lambda)$，通过观察我们发现：

$$ p(\mu,\lambda) = \mathcal N(\mu|\mu_0, (\beta\lambda)^{-1}) \mathrm{Gam}(\lambda|a,b) $$

其中 $\mu_0 = c/\beta, a=1+\beta/2, b=d-c^2/(2\beta)$。这个分布叫做高斯-伽马分布（Gaussian-gamma）或者正态-伽马分布（normal-gamma），其图像如图所示：

In [5]:

from scipy.stats import gamma

mu, lam = np.mgrid[-2:2:.01, 0.01:2:.01]

a, b = 5, 6
mu_0 = 0
beta = 2

z = (beta*lam/2/np.pi) ** 1/2 * np.exp(- beta*lam/2 * (mu-mu_0)**2) \
    * gamma.pdf(lam, a=a, scale=1.0/b)
    
fig, ax = plt.subplots()

ax.contour(mu, lam, z, colors='r')
ax.set_xticks([-2, 0, 2])
ax.set_yticks([0, 1, 2])

ax.set_xlabel(r"$\mu$", fontsize="xx-large")
ax.set_ylabel(r"$\lambda$", fontsize="xx-large")

plt.show()

在 $D$ 维高斯分布 $\mathcal N(\mathbf x|\mathbf\mu, \mathbf\Lambda^{-1})$ 的情况下，若均值未知，精确度矩阵已知，先验分布为：

$$\mathcal N(\mathbf \mu|\mathbf \mu_0, \mathbf\Lambda_0^{-1})$$

若均值已知，精确度矩阵未知，先验分布为 Wishart 分布：

$$ \mathcal W(\mathbf\Lambda|\mathbf W, \nu)=B|\mathbf\Lambda|^{(\nu-D-1)/2} \exp\left(-\frac{1}{2}\mathrm{Tr}(\mathbf W^{-1}\mathbf\Lambda)\right) $$

其中 $\nu$ 叫做分布的自由度（degrees of freedom），$\mathbf W$ 是一个 $D\times D$ 矩阵，$\rm Tr$ 是矩阵的迹,归一化参数 $B$ 为

$$ B(\mathbf W, \nu) = |\mathbf W|^{-\nu/2} \left(2^{\mu D/2} \pi^{D(D-1)/4} \prod_{i=1}^D \Gamma\left(\frac{\nu+1-i}{2}\right)\right)^{-1} $$

若均值和精确度矩阵都未知，先验分布为 normal-Wishart（或者 Gaussian-Wishart）分布：

$$ \mathcal N(\mathbf \mu|\mathbf \mu_0, (\beta\mathbf\Lambda)^{-1}) \mathcal W(\mathbf\Lambda|\mathbf W, \nu) $$

2.3.7 学生 t 分布¶

之前我们看到精确度的先验分布是一个 Gamma 分布。现在考虑一个单变量高斯分布 $\mathcal N(x|\mu,\tau^{-1})$ 和精确度的一个伽马分布先验 ${\rm Gam}(\tau|a,b)$，我们对精确度进行积分之后，得到 $x$ 的边缘分布为

$$ \begin{align} p(x|\mu, a,b) & = \int_{0}^{\infty} \mathcal N(x|\mu,\tau^{-1}) {\rm Gam}(\tau|a,b) d\tau \\ & = \int_{0}^{\infty} \frac{b^a e^{-b\tau} \tau^{a-1}}{\Gamma(a)} \left(\frac{\tau}{2\pi}\right)^{1/2} \exp\left\{-\frac{\tau}{2}(x-\mu)^2\right\} d\tau\\ & = \frac{b^a}{\Gamma(a)} \left(\frac{1}{2\pi}\right)^{1/2} \left[b+\frac{(x-\mu)^2}{2}\right]^{-a-1/2} \Gamma(a+1/2) \end{align} $$

（关于 $\tau$ 可以凑成一个 ${\rm Gam}(a+1/2, b+\frac{(x-\mu)^2}{2})$ 的核）

令参数 $\nu=2a, \lambda=a/b$，我们有

$$ \text{St}(x|\mu,\lambda,\nu)= \frac{\Gamma(v/2+1/2)}{\Gamma(v/2)} \left(\frac{\lambda}{\pi\nu}\right)^{1/2} \left[1+\frac{\lambda(x-\mu)^2}{\nu}\right]^{-\nu/2-1/2} $$

这就是学生 t 分布（Student’s t-distribution）。其中，参数 $\lambda$ 叫做精确度，$\mu$ 叫做自由度。

当自由度为 1 时，学生 t 分布退化为柯西分布，当自由度趋于无穷时，学生 t 分布变成均值 $\lambda$ 精度 $\lambda$ 的高斯分布（忽略所有与 $x$ 无关的项，对 $\nu$ 求极限，利用 $\lim_{x\to\infty}(1+\frac{1}{x})^x = e$）

其概率密度函数如图所示：

In [6]:

from scipy.stats import t, norm

xx = np.linspace(-5, 5, 100)

dfs = [0.1, 1]

fig, ax = plt.subplots()

for df in dfs:
    ax.plot(xx, t.pdf(xx, df))

ax.plot(xx, norm.pdf(xx))

ax.set_xlim(-5, 5)
ax.set_xticks([-5, 0, 5])
ax.set_ylim(0, 0.5)
    
ax.legend([r"$\nu=0.1$", r"$\nu=1.0$", r"$\nu\to\infty$"], fontsize="xx-large")
ax.set_title(r"$\lambda=1,\ \mu=0$", fontsize="xx-large")

plt.show()

学生 t 分布可以看成是无穷多的高斯分布的一个混合。

从图中可以看出，它产生的分布的尾巴要比高斯分布的长。事实上，在有异常值的情况下，它比高斯分布更鲁棒一些：

In [7]:

from scipy.stats import t, norm, uniform

tt = norm.rvs(size=30)
xx = np.linspace(-5, 10, 100)

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

axes[0].hist(tt, 
             normed=True, 
             rwidth=0.5, 
             color="yellow", 
             bins = np.linspace(-5, 10, 30))

axes[0].set_xlim(-5, 10)
axes[0].set_xticks([-5, 0, 5, 10])
axes[1].set_ylim(0, 0.7)

l, s = norm.fit(tt)

axes[0].plot(xx, norm.pdf(xx, loc=l, scale=s), 'g')

d, l, s = t.fit(tt)
axes[0].plot(xx, t.pdf(xx, df=d, loc=l, scale=s), 'r')

axes[0].legend(["MLE for Gaussian", "MLE for t"])


tt1 = np.hstack([tt, uniform.rvs(size=4, loc=8, scale=2)])

axes[1].hist(tt1, 
             normed=True, 
             rwidth=0.5, 
             color="yellow",
             bins = np.linspace(-5, 10, 30))

axes[1].set_xlim(-5, 10)
axes[1].set_ylim(0, 0.7)
axes[1].set_xticks([-5, 0, 5, 10])

l, s = norm.fit(tt1)

axes[1].plot(xx, norm.pdf(xx, loc=l, scale=s), 'g')

d, l, s = t.fit(tt1)
axes[1].plot(xx, t.pdf(xx, df=d, loc=l, scale=s), 'r')

axes[1].legend(["MLE for Gaussian", "MLE for t"])

plt.show()

如果我们取 $\nu=2a, \lambda=a/b, \eta=\tau b /a$，我们有：

$$ \mathrm{St}(x|\mu,\lambda,\nu) = \int_{0}^{\infty} \mathcal N(x~|~\mu,(\eta\lambda)^{-1}) {\rm Gam}(\eta ~|~\nu/2,\nu/2) d\eta $$

（这也说明了学生 t 分布是一个概率分布：对 $x$ 积分等于 1。）

我们看到，自由度项只在 Gamma 分布出现，因此，我们可以将其推广到高维形式：

$$ \mathrm{St}(x|\mathbf{\mu,\Lambda},\nu) = \int_{0}^{\infty} \mathcal N(x~|~{\mathbf \mu,(\eta\mathbf\Lambda)^{-1}}) {\rm Gam}(\eta ~|~\nu/2,\nu/2) d\eta $$

与一维的结果类似，我们有：

$$ \mathrm{St}(x|\mathbf{\mu,\Lambda},\nu) = \frac{\Gamma(v/2+D/2)}{\Gamma(v/2)} \frac{|\mathbf\Lambda|^{1/2}}{(\pi\nu)^{D/2}} \left[1+\frac{\Delta^2}{\nu}\right]^{-\nu/2-D/2} $$

其中 $\Delta^2 \mathbf{= (x -\mu)^\top\Lambda(x - \mu)}$。

学生 t 分布的均值方差和众数分别为：

$$ \begin{align} \mathbb E[\mathbf x] & = \mathbf \mu & {\rm if} && \nu > 1\\ {\rm cov}[\mathbf x] & = \frac{\nu}{\nu-2} \mathbf\Lambda^{-1} & {\rm if} && \nu > 2 \\ {\rm mode}[\mathbf x] & = \mathbf \mu \\ \end{align} $$

2.3.8 周期变量和 von Mises 分布¶

高斯分布并不能很好的解决周期变量的问题。

对于周期变量，一种解决方法是使用极坐标系 $0\leq\theta<2\pi$ 的周期性，使用极坐标的角度 $\theta$ 表示周期变量。

在极坐标系下，我们需要考虑这样的问题，两个角度 $1^\circ, 359^\circ$ 十分接近，如果我们选择 $0^\circ$ 作为原点，均值和标准差分别为 $180^\circ, 179^\circ$；选择 $180^\circ$ 作为原点，均值和标准差分别为 $0^\circ, 1^\circ$。因此我们需要一种解决这种问题的特殊方法。

考虑找到一组周期变量 $\mathcal D=\{\theta_1,\dots, \theta_N\}$ 的均值的问题。之前我们看到，随着坐标系的变化，直接计算的方法的结果也在变化。

为了找到一种不随坐标原点变化的均值衡量方法，我们将这些变量看成是单位圆上的点，即将角度 $\theta$ 映射为点 $\mathbf x$：

In [8]:

xx = np.linspace(0, 2 * np.pi, 100)

_, ax = plt.subplots(figsize=(7, 7))

ax.plot(np.cos(xx), np.sin(xx), 'r')

ax.text(1, -0.2, "$x_1$", fontsize="xx-large")
ax.text(1, 0.2, "$x_2$", fontsize="xx-large")
ax.text(0.2, 1, "$x_3$", fontsize="xx-large")
ax.text(-0.4, 1, "$x_4$", fontsize="xx-large")

ax.axis("equal")
ax.set_xlim(-1.2, 1.2)

ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')

ax.xaxis.set_ticks_position('bottom')
ax.spines['bottom'].set_position(('data',0))
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data',0))

ax.set_xticks([])
ax.set_yticks([])

ax.plot([0, 0.4], [0, 0.6])
ax.text(0.12, 0.3, r'$\bar r$', fontsize="xx-large")
ax.text(0.42, 0.55, r'$\bar x$', fontsize="xx-large")
ax.text(0.35, 0.2, r'$\bar{\theta}$', fontsize="xx-large")

tt = np.linspace(0.2, np.sqrt(0.2 ** 2 + 0.3 ** 2))

ax.fill_between([0, 0.2], [0, 0.3], color="c")
ax.fill_between(tt, np.sqrt(0.2 ** 2 + 0.3 ** 2 - tt ** 2 + 0.001), color="c")

plt.show()

我们不直接求 $\theta$ 的均值，而是求这些点 $\bf x$ 的均值：

$$ \mathbf{\bar x} = \frac{1}{N} \sum_{n=1}^N \mathbf x_n $$

然后得到相应的极坐标的角度 $\bar\theta$。

这样的设定能保证我们的均值 $\bf \bar x$ 不随极坐标的原点设置变化而变化。一般来说，这个均值通常落在单位圆内。

考虑坐标变换，$\mathbf x_n = (\cos\theta_n,\sin\theta_n), \mathbf{\bar x}=(\bar r\cos\bar\theta, \bar r\sin\theta)$，我们有：

$$ \bar r\cos\bar\theta = \frac{1}{N} \sum_{n=1}^N \cos\theta_n, \bar r\sin\theta \frac{1}{N} \sum_{n=1}^N \sin\theta_n $$

利用正切函数，我们有：

$$ \bar\theta = \tan^{-1} \frac{\sum_n \sin\theta_n}{\sum_n \cos\theta_n} $$

von Mises 分布¶

对于周期函数，我们考虑一个周期为 $2\pi$ 的概率分布 $p(\theta)$，它应当满足这三个条件：

$$ \begin{align} p(\theta) & \geq 0 \\ \int_{0}^{2\pi} p(\theta) & =1 \\ p(\theta+2\pi) &= p(\theta) \end{align} $$

按照单位圆的设定，我们可以假定 $\mathbf x=(x_1,x_2)$ 满足一个高斯分布 $\mathcal N(\mathbf\mu, \sigma^2\mathbf I)$，其中 $\mathbf\mu=\mu_1,\mu_2$，所以：

$$ p(x_1,x_2) = \frac{1}{2\pi\sigma^2} \exp\left\{-\frac{(x_1-\mu_1)^2 + (x_2-\mu_2)^2}{2\sigma^2}\right\} $$

它的分布等高线是一系列的圆：

In [9]:

xx = np.linspace(0, 2 * np.pi, 100)

_, ax = plt.subplots(figsize=(7, 7))

ax.plot(np.cos(xx), np.sin(xx), 'r')

ax.spines['right'].set_color('none')
ax.spines['top'].set_color('none')

ax.xaxis.set_ticks_position('bottom')
ax.spines['bottom'].set_position(('data',0))
ax.yaxis.set_ticks_position('left')
ax.spines['left'].set_position(('data',0))
ax.set_xticks([])
ax.set_yticks([])

mu1, mu2 = 0.5, 0.8
for r in [0.25, 0.5, 0.75]:
    ax.plot(r * np.cos(xx) + mu1, r * np.sin(xx) + mu2, 'b')
    
ax.axis('equal')
ax.set_xlim(-1.5, 2)
ax.text(1.35, 1, r"$p(\mathbf{x})$", fontsize="xx-large")
ax.text(-1,-1, r'$r=1$', fontsize="xx-large")
ax.text(1.8, -.2, "$x_1$", fontsize="xx-large")
ax.text(-.2, 1.8, "$x_2$", fontsize="xx-large")

plt.show()

考虑极坐标的变换：

$$ \begin{align} x_1 = r\cos\theta, &&& x_2 = r\sin\theta \\ \mu_1 = r_0\cos\theta_0, &&& \mu_2 = r_0\sin\theta_0 \end{align} $$

并将 $\mathbf x$ 限制在单位圆上即 $r=1$，我们有：

$$ -\frac{1}{2\sigma^2}\left\{(\cos\theta-r_0\cos\theta_0)^2 + (\sin\theta-r_0\sin\theta_0)^2\right\} = -\frac{1}{2\sigma^2}\left\{1 + r_0^2 -2r_0\cos\theta\cos\theta_0 -2r_0\sin\theta\sin\theta_0\right\} = \frac{r_0}{\sigma^2} \cos(\theta-\theta_0) + \mathrm{const} $$

常数是与 $\theta$ 无关的项。

如果我们定义 $m=\frac{r_0}{\sigma^2}$，我们可以得到这样的一个分布：

$$ p(\theta|\theta_0,m) = \frac{1}{2\pi I_0(m)}\exp\{m\cos(\theta-\theta_0)\} $$

这样我们就得到了 von Mises 分布，或者叫 circular normal 分布，这里 $\theta_0$ 是分布的均值，$m$ 类似于高斯分布中的精确度。

von Mises 分布在直角坐标系和极坐标系的图像如图所示：

In [10]:

from scipy.stats import vonmises

tt = np.linspace(0, 2 * np.pi, 100)

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

axes[0].plot(tt, vonmises.pdf(tt, 5, loc=np.pi/4), 'r')
axes[0].plot(tt, vonmises.pdf(tt, 1, loc=3*np.pi/4), 'b')
axes[0].set_xlim(0, 2*np.pi)
axes[0].legend([r'$m=5,\ \theta_0=\pi/4$', r'$m=1,\ \theta_0=3\pi/4$'], fontsize="x-large", loc=0)

for ax in axes:
    ax.set_xticks([])
    ax.set_yticks([])
    
ax = fig.add_subplot(122, projection='polar')
ax.set_axis_off()

ax.plot(tt, vonmises.pdf(tt, 5, loc=np.pi/4), 'r')
ax.plot(tt, vonmises.pdf(tt, 1, loc=3*np.pi/4), 'b')
ax.legend([r'$m=5,\ \theta_0=\pi/4$', r'$m=1,\ \theta_0=3\pi/4$'], fontsize="x-large", loc=0)
ax.grid("off")

ax.plot([0, np.pi/4], [0, 1], 'k')
ax.plot([0, 0], [0, 1], 'k')
ax.plot([0, 3*np.pi/4], [0, 1], 'k')
ax.text(0.1, 0.8, r'$0$', fontsize="x-large")
ax.text(-0.2, 0.8, r'$2\pi$', fontsize="x-large")
ax.text(0.3*np.pi, 1, r'$\pi / 4$', fontsize="x-large")
ax.text(0.7*np.pi, 0.8, r'$3\pi / 4$', fontsize="x-large")

plt.show()

$I_0(m)$ 是用来归一化概率分布的常数，是修正的 0 阶 Bessel 函数：

$$ I_0(m) =\frac{1}{2\pi} \int_{0}^{2\pi} \exp\{m\cos\theta\} d\theta $$

In [11]:

from scipy import special

_, ax = plt.subplots()

xx = np.r_[0:10:0.01]

ax.plot(xx, special.i0(xx), 'r')
ax.set_xlabel("$m$", fontsize='x-large')
ax.set_ylabel("$I_0(m)$", fontsize='x-large')
ax.set_xticks([0, 5, 10])
ax.set_yticks(np.r_[0:3001:1000])

plt.show()

现在考虑 von Mises 分布的最大似然：

$$ \ln p(\mathcal D|\theta_0,m) = -N\ln(2\pi)-N\ln I_0(m) + m\sum_{n=1}^N \cos(\theta_n-\theta_0) $$

令其对 $\theta_0$ 的偏导为 0，我们有

$$ \sum_{n=1}^N \sin(\theta_n-\theta_0) = \cos\theta_0 \sum_{n=1}^N \sin\theta_n - \sin\theta_0 \sum_{n=1}^N \cos\theta_n = 0 $$

从而

$$ \theta_{ML} = \tan^{-1} \left\{\frac{\sum_{n} \sin \theta_n}{\sum_{n} \cos \theta_n}\right\} $$

这与我们之前讨论的均值一致。

类似地，考虑对 $m$ 的偏导，利用 $I_0'(m) = I_1(m)$（$I_1$ 是修正的 1 阶 Bessel 函数），我们有：

$$ \frac{I_1(m)}{I_0(m)} = \frac{1}{N} \sum_{n=1}^N \cos(\theta_n-\theta_0) $$

令

$$ A(m) = \frac{I_1(m)}{I_0(m)} $$

其图像如下图所示。

因此，我们有

$$ A(m_{ML}) = \frac{1}{N} \cos\theta_0^{ML} \sum_{n=1}^N \cos\theta_n - \frac{1}{N} \sin\theta_0^{ML} \sum_{n=1}^N \sin\theta_n $$

我们可以通过数值的方法来求 $m_{ML}$ 的值。

In [12]:

_, ax = plt.subplots()

xx = np.r_[0:10:0.01]

ax.plot(xx, special.i1(xx) / special.i0(xx), 'r')

ax.set_xlabel("$m$", fontsize='x-large')
ax.set_ylabel("$A(m)$", fontsize='x-large')
ax.set_xticks(np.r_[0:11:5])
ax.set_yticks(np.r_[0:1.1:0.5])

plt.show()

2.3.9 高斯混合模型¶

之前说到，高斯分布是一个单峰模型，对于多峰情况的模拟效果并不好。

In [13]:

from scipy.stats import multivariate_normal

sz = 100

tt1 = multivariate_normal.rvs(cov=np.array([[2,1], [1,2]]), 
                              size=sz)
x1, y1 = tt1[:,0], tt1[:,1]

tt2 = multivariate_normal.rvs(mean = np.array([8, 8]),
                              cov=np.array([[1,0.5], [0.5,2]]), 
                              size=sz)
x2, y2 = tt2[:,0], tt2[:,1]

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

m1 = np.mean(tt1, axis=0)
c1 = np.dot((tt1 - m1).T, (tt1 - m1)) / sz

m2 = np.mean(tt2, axis=0)
c2 = np.dot((tt2 - m2).T, (tt2 - m2)) / sz

tt = np.vstack([tt1, tt2])

m = np.mean(tt, axis=0)
c = np.dot((tt - m).T, (tt - m)) / sz / 2

x, y = np.mgrid[-5:12:0.1, -5:12:0.1]

z1 = bivariate_normal(x, y, 
                      mux=m1[0], 
                      muy=m1[1], 
                      sigmax=np.sqrt(c1[0][0]),
                      sigmay=np.sqrt(c1[1][1]),
                      sigmaxy = c1[1][0])

z2 = bivariate_normal(x, y, 
                      mux=m2[0], 
                      muy=m2[1], 
                      sigmax=np.sqrt(c2[0][0]),
                      sigmay=np.sqrt(c2[1][1]),
                      sigmaxy = c2[1][0])

z = bivariate_normal(x, y, 
                     mux=m[0], 
                     muy=m[1], 
                     sigmax=np.sqrt(c[0][0]),
                     sigmay=np.sqrt(c[1][1]),
                     sigmaxy = c[1][0])

axes[0].contour(x, y, z, 4, colors='b')
axes[1].contour(x, y, z1, 4, colors='b')
axes[1].contour(x, y, z2, 4, colors='b')

for ax in axes:
    ax.scatter(x1, y1, color='lime')
    ax.scatter(x2, y2, color='lime')

plt.show()

但我们可以使用多个高斯分布的混合来模拟多峰模型。

高斯分布的一个简单线性混合就能产生十分复杂的分布，如图所示，红线是三个高斯分布线性组合后的概率密度，蓝线是三个高斯分布按相应比例进行缩放后的概率密度：

In [14]:

xx = np.linspace(-20, 20, 200)

y1 = norm.pdf(xx, loc=0, scale=4)
y2 = norm.pdf(xx, loc=10, scale=1.5)
y3 = norm.pdf(xx, loc=-8, scale=2)

fig, ax = plt.subplots()

ax.plot(xx, y1, 'b')
ax.plot(xx, y2, 'b')
ax.plot(xx, y3, 'b')

ax.plot(xx, y1 + y2 + y3, 'r')

ax.set_xticks([])
ax.set_yticks([])

ax.set_xlabel("$x$", fontsize="xx-large")
ax.set_ylabel("$p(x)$", fontsize="xx-large")

plt.show()

考虑 $K$ 个高斯分布的线性组合，即高斯混合分布（mixture of Gaussians）：

$$ p(\mathbf x) = \sum_{k=1}^K \pi_k \mathcal N(\mathbf x|\mathbf \mu_k, \mathbf \Sigma_k) $$

其中 $\mathcal N(\mathbf x|\mathbf \mu_k, \mathbf \Sigma_k)$ 是一个高斯分布，是混合分布的一个组分（component of the mixture）。

考虑概率分布的归一性 $\int p(\mathbf x) d\mathbf x = 1$：

$$ \sum_{k=1}^K \pi_k = 1 $$

同时考虑非负性 $p(\mathbf x) \geq 0$：

$$ \forall k, \pi_k \geq 0 $$

于是我们有：

$$ 0 \leq \pi_k \leq 1 $$

另一方面，我们有：

$$ p(\mathbf x) = \sum_{k=1}^N p(k) p(\mathbf x|k) $$

从而我们可以认为 $\pi_k = p(k)$ 是第 $k$ 个组分的先验分布，$p(\mathbf x|k)$ 是给定 $k$ 下的条件分布。

从而我们可以计算 $k$ 的后验分布：

$$ \begin{align} \gamma_k(\mathbf x) & = p(k|\mathbf x) \\ & = \frac{p(k) p(\mathbf x|k)}{\sum_{k=1}^N p(k) p(\mathbf x|k)} \\ & = \frac{\pi_k \mathcal N(\mathbf x|\mathbf \mu_k, \mathbf \Sigma_k) }{\sum_{k=1}^K \pi_k \mathcal N(\mathbf x|\mathbf \mu_k, \mathbf \Sigma_k) } \end{align} $$

其参数为 $\bf \pi, \mu, \Sigma$，其中 $\mathbf \pi = {\pi_1, \dots, \pi_n}, \mathbf \mu = \{\mathbf\mu_1,\dots,\mathbf\mu_n\}, \mathbf\Sigma=\{\mathbf\Sigma_1,\dots,\mathbf\Sigma_n\}$。

对于数据 $\mathbf X=\{\mathbf x_1,\dots,\mathbf x_n\}$，似然函数为

$$ \ln p(\mathbf{X|\pi, \mu, \Sigma}) = \sum_{n=1}^N \ln \left\{ \sum_{k=1}^K \pi_k \mathcal N(\mathbf x_n|\mathbf \mu_k, \mathbf \Sigma_k) \right\} $$

其最大似然解比较复杂，没有显式的表示形式，可以用数值法求解，或者用 EM 算法求解。