训练数据集
\begin{align*} \\& T = \left\{ \left( x_{1}, y_{1} \right), \left( x_{2}, y_{2} \right), \cdots, \left( x_{N}, y_{N} \right) \right\} \end{align*}
假设分类模型是条件概率分布$P \left( Y | X \right), X \in \mathcal{X} \subseteq R^{n}$表示输入,$Y \in \mathcal{Y}$表示输出。给定输入$X$,以条件概率$P \left( Y | X \right)$输出$Y$。
特征函数$f \left( x, y \right)$描述输入$x$和输出$y$之间的某一事实, \begin{align*} f \left( x, y \right) = \left\{ \begin{aligned} \ & 1, x与y满足某一事实 \\ & 0, 否则 \end{aligned} \right.\end{align*}
特征函数$f \left( x, y \right)$关于经验分布$ \tilde{P} \left( X, Y \right) $的期望 \begin{align*} \\& E_{ \tilde{P} } \left( f \right) = \sum_{x, y} \tilde{P} \left( x, y \right) f \left( x, y \right) \end{align*}
特征函数$f \left( x, y \right)$关于模型$ P \left( Y | X \right) $与经验分布$ \tilde{P} \left( X \right) $的期望 \begin{align*} \\& E_{ P } \left( f \right) = \sum_{x, y} \tilde{P} \left( x \right) P \left( y | x \right) f \left( x, y \right) \end{align*}
最大熵模型:假设满足所有约束条件的模型集合为
\begin{align*} \\& \mathcal{C} \equiv \left\{ P \in \mathcal{P} | E_{ P } \left( f_{i} \right) = E_{ \tilde{P} } \left( f_{i} \right), i = 1,2, \cdots, n \right\}\end{align*}
定义在条件概率分布$P \left( Y | X \right)$上的条件熵为
\begin{align*} \\& H \left( P \right) = - \sum_{x,y} \tilde{P} \left( x \right) P \left( y | x \right) \log P \left( y | x \right) \end{align*}
则模型集合$\mathcal{C}$中条件熵$H \left( P \right)$最大的模型称为最大熵模型。
给定训练数据集$T = \left\{ \left( x_{1}, y_{1} \right), \left( x_{2}, y_{2} \right), \cdots, \left( x_{N}, y_{N} \right) \right\}$以及特征函数$f_{i} \left( x, y \right), i = 1, 2, \cdots, n$,最大熵模型的学习等价于最优化问题:
\begin{align*} \\& \max_{P \in \mathcal{C} } \quad H \left( P \right) = - \sum_{x,y} \tilde{P} \left( x \right) P \left( y | x \right) \log P \left( y | x \right)
\\ & s.t.\quad E_{ P } \left( f_{i} \right) = E_{ \tilde{P} } \left( f_{i} \right), i = 1,2, \cdots, n
\\ & \sum_{y} P \left( y | x \right) = 1 \end{align*}
等价的
\begin{align*} \\& \min_{P \in \mathcal{C} } \quad -H \left( P \right) = \sum_{x,y} \tilde{P} \left( x \right) P \left( y | x \right) \log P \left( y | x \right)
\\ & s.t.\quad E_{ P } \left( f_{i} \right) - E_{ \tilde{P} } \left( f_{i} \right) = 0, i = 1,2, \cdots, n
\\ & \sum_{y} P \left( y | x \right) = 1 \end{align*}
最优化问题的求解:
2. 求$\min_{P \in \mathcal{C} } L \left( P, w \right)$:
记对偶函数$\Psi \left( w \right) = min_{P \in \mathcal{C} } L \left( P, w \right) = L \left( P_{w}, w \right)$,其解记$P_{w} = \arg \min_{P \in \mathcal{C} } L \left( P, w \right) = P_{w} \left( y | x \right)$
\begin{align*} \\& \dfrac {\partial L \left( P, w \right)} {\partial P \left( y | x \right)} = \sum_{x,y} \tilde{P} \left( x \right) \left( \log P \left( y | x \right) + 1 \right) - \sum_{y} w_{0} - \sum_{x,y} \left( \tilde{P} \left( x \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right)
\\ & \quad = \sum_{x,y} \tilde{P} \left( x \right) \left( \log P \left( y | x \right) + 1 \right) - \sum_{x,y} P \left( x \right) w_{0} - \sum_{x,y} \left( \tilde{P} \left( x \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right)
\\ & \quad = \sum_{x,y} \tilde{P} \left( x \right) \left( \log P \left( y | x \right) + 1 - w_{0} - \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) = 0\end{align*}
由于$\tilde{P} \left( x \right) > 0 $,得
\begin{align*} \\ & \log P \left( y | x \right) + 1 - w_{0} - \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right)=0
\\ & P \left( y | x \right) = \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) + w_{0} -1 \right) = \dfrac{ \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) }{ \exp \left( 1 - w_{0} \right)}\end{align*}
由于
\begin{align*} \\& \sum_{y} P \left( y | x \right) = 1 \end{align*}
则
\begin{align*} \\ & \sum_{y} P \left( y | x \right) = \sum_{y} \dfrac{ \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) }{ \exp \left( 1 - w_{0} \right)} = 1
\\ & \sum_{y} \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) = \exp \left( 1 - w_{0} \right)\end{align*}
代入,得
\begin{align*} \\ & P \left( y | x \right) = \dfrac{1 }{Z_{w} \left( x \right)}\exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) \end{align*}
其中
\begin{align*} Z_{w} = \sum_{y} \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) \end{align*}
$Z_{w}$称为规范化因子;$f_{i} \left( x, y \right)$是特征函数;$w_{i}$是特征的权值。
3. 求$\max_{w} \Psi \left( w \right)$
将其解记为$w^{*}$,即
\begin{align*} w^{*} = \arg \max_{w} \Psi \left( w \right) \end{align*}
已知训练数据的经验概率分布$\tilde{P} \left( X, Y \right)$,则条件概率分布$P \left( X | Y \right)$的对数似然函数
\begin{align*} \\ & L_{\tilde{P}} \left( P_{w} \right) = \log \prod_{x,y} P \left( y | x \right)^{\tilde{P} \left( x, y \right)}
\\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \log P \left( y | x \right)
\\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \log \dfrac{\exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right)}{Z_{w} \left( x \right) }
\\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) - \sum_{x,y} \tilde{P} \left( x, y \right) \log Z_{w} \left( x \right)
\\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) - \sum_{x} \tilde{P} \left( x \right) \log Z_{w} \left( x \right)\end{align*}
对偶函数
\begin{align*} \\ & \Psi \left( w \right) = min_{P \in \mathcal{C} } L \left( P, w \right) = L \left( P_{w}, w \right) \\ & = - H \left( P_{w} \right) + w_{0} \left( 1 - \sum_{y} P_{w} \left( y | x \right) \right) + \sum_{i=1}^{n} w_{i} \left( E_{\tilde{P}} \left( f_{i} \right) - E_{P_{w}} \left( f_{i} \right) \right)
\\ & = \sum_{x,y} \tilde{P} \left( x \right) P_{w} \left( y | x \right) \log P_{w} \left( y | x \right)
\\& \quad\quad\quad + w_{0} \left( 1 - \sum_{y} \dfrac{1 }{Z_{w} \left( x \right)}\exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) \right)
\\ & \quad\quad\quad + \sum_{i=1}^{n} w_{i} \left( \sum_{x, y} \tilde{P} \left( x, y \right) f_{i} \left( x, y \right) - \sum_{x, y} \tilde{P} \left( x \right) P_{w} \left( y | x \right) f_{i} \left( x, y \right) \right)
\\ & = \sum_{x, y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) + \sum_{x,y} \tilde{P} \left( x \right) P_{w} \left( y | x \right) \left( \log P_{w} \left( y | x \right) - \sum_{i=1}^{n} w_{i} f_{i} \left(x, y \right) \right)
\\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) - \sum_{x,y} \tilde{P} \left( x, y \right) \log Z_{w} \left( x \right)
\\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) - \sum_{x} \tilde{P} \left( x \right) \log Z_{w} \left( x \right)\end{align*}
得
\begin{align*} \\ & L_{\tilde{P}} \left( P_{w} \right) = \Psi \left( w \right)\end{align*}
即,最大熵模型的极大似然估计等价于对偶函数极大化。
已知最大熵模型
\begin{align*} \\ & P_{w} \left( y | x \right) = \dfrac{1 }{Z_{w} \left( x \right)}\exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) \end{align*}
其中
\begin{align*} Z_{w} = \sum_{y} \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right) \end{align*}
$Z_{w}$称为规范化因子;$f_{i} \left( x, y \right)$是特征函数;$w_{i}$是特征的权值。
对数似然函数 \begin{align*} \\ & L \left( w \right) = \sum_{x,y} \tilde{P} \left( x, y \right) \log P_{w} \left( y | x \right) \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) - \sum_{x} \tilde{P} \left( x \right) \log Z_{w} \left( x \right) \end{align*}
对于给定的经验分布$\tilde{P}$,模型参数从$w$到$w + \delta$,对数似然函数的改变量 \begin{align*} \\ & L \left( w + \delta \right) - L \left( w \right) = \sum_{x,y} \tilde{P} \left( x, y \right) \log P_{w + \delta} \left( y | x \right) - \sum_{x,y} \tilde{P} \left( x, y \right) \log P_{w} \left( y | x \right) \\ & = \left( \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \left( w_{i} + \delta_{i} \right) f_{i} \left( x, y \right) - \sum_{x} \tilde{P} \left( x \right) \log Z_{w + \delta} \left( x \right) \right) \\ & \quad\quad\quad\quad\quad\quad - \left( \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) - \sum_{x} \tilde{P} \left( x \right) \log Z_{w} \left( x \right) \right) \\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) - \sum_{x} \tilde{P} \left( x \right) \log \dfrac{Z_{w + \delta} \left( x \right)}{Z_{w} \left( x \right)}\end{align*}
由
\begin{align*} - \log \alpha \geq 1 - \alpha, \alpha > 0 \end{align*}
得
\begin{align*} \\ & L \left( w + \delta \right) - L \left( w \right) \geq \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \dfrac{Z_{w + \delta} \left( x \right)}{Z_{w} \left( x \right)}
\\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \dfrac{\sum_{y} \exp \left( \sum_{i=1}^{n} \left( w_{i} + \delta_{i} \right) f_{i} \left( x, y \right) \right)}{\sum_{y} \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right)}
\\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \sum_{y} \dfrac{ \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right)}{\sum_{y} \exp \left( \sum_{i=1}^{n} w_{i} f_{i} \left( x, y \right) \right)} \exp \left( \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) \right)
\\ & = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \sum_{y} P_{w} \left( y | x \right) \exp \left( \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) \right)\end{align*}
记
\begin{align*} \\ & A \left( \delta | w \right) = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \sum_{y} P_{w} \left( y | x \right) \exp \left( \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) \right)\end{align*}
则
\begin{align*} \\ & L \left( w + \delta \right) - L \left( w \right) \geq A \left( \delta | w \right)\end{align*}
即$ A \left( \delta | w \right)$是对数似然函数改变量的一个下届。
引入\begin{align*} \\ & f^{\#} \left( x, y \right) = \sum_{i} f_{i} \left( x, y \right) \end{align*}
表示所有特征在$\left( x, y \right)$出现的次数
则
\begin{align*} \\ & A \left( \delta | w \right) = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \sum_{y} P_{w} \left( y | x \right) \exp \left( f^{\#} \left( x, y \right) \sum_{i=1}^{n} \dfrac{\delta_{i} f_{i} \left( x, y \right) }{f^{\#} \left( x, y \right) } \right)\end{align*}
对任意$i$,有$\dfrac{f_{i} \left( x, y \right)}{f^{\#} \left( x, y \right)} \geq 0$且$\sum_{i=1}^{n} \dfrac{f_{i} \left( x, y \right)}{f^{\#} \left( x, y \right)} = 1$,
根据Jensen不等式,得
\begin{align*} \\ & \exp \left( \sum_{i=1}^{n} \dfrac{f_{i} \left( x, y \right)}{f^{\#} \left( x, y \right)} \delta_{i} f_{\#} \left( x, y \right) ) \right) \leq \sum_{i=1}^{n} \dfrac{f_{i} \left( x, y \right)}{f^{\#} \left( x, y \right)} \exp \left( \delta_{i} f^{\#} \left(x, y\right) \right)\end{align*}
则
\begin{align*} \\ & A \left( \delta | w \right) \geq \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \sum_{y} P_{w} \left( y | x \right) \sum_{i=1}^{n} \left( \dfrac{f_{i} \left( x, y \right)}{f^{\#} \left( x, y \right)} \right) \exp \left( \delta_{i} f^{\#} \left(x, y\right) \right)\end{align*}
记
\begin{align*} \\ & B \left( \delta | w \right) = \sum_{x,y} \tilde{P} \left( x, y \right) \sum_{i=1}^{n} \delta_{i} f_{i} \left( x, y \right) + 1 - \sum_{x} \tilde{P} \left( x \right) \sum_{y} P_{w} \left( y | x \right) \sum_{i=1}^{n} \left( \dfrac{f_{i} \left( x, y \right)}{f^{\#} \left( x, y \right)} \right) \exp \left( \delta_{i} f^{\#} \left(x, y\right) \right)\end{align*}
则
\begin{align*} \\ & L \left( w + \delta \right) - L \left( w \right) \geq A \left( \delta | w \right) \geq B \left( \delta | w \right)\end{align*}
即$ B \left( \delta | w \right)$是对数似然函数改变量的一个新的(相对不紧的)下届。
求
\begin{align*} \\ & \dfrac {\partial B \left( \delta | w \right) }{\partial \delta_{i}} = \sum_{x,y} \tilde{P} \left( x, y \right) f_{i} \left( x, y \right) - \sum_{x} \tilde{P} \left( x \right) \sum_{y} P_{w} \left( y | x \right) f_{i} \left( x, y \right) \exp \left( \delta_{i} f^{\#} \left(x, y\right) \right)\end{align*}
令$ \dfrac {\partial B \left( \delta | w \right) }{\partial \delta_{i}} = 0 $
得
\begin{align*} \\ & \sum_{x,y} \tilde{P} \left( x, y \right) f_{i} \left( x, y \right) = \sum_{x, y} \tilde{P} \left( x \right) P_{w} \left( y | x \right) f_{i} \left( x, y \right) \exp \left( \delta_{i} f^{\#} \left(x, y\right) \right)\end{align*}
对$\delta_{i}$求解可解得$\delta$
改进的迭代尺度算法(IIS):
输入:特征函数$f_{i},i=1, 2, \cdots, n$,经验分布$\tilde{P} \left( x, y \right)$,模型$P_{w} \left( y | x \right)$
输出:最优参数值$w_{i}^{*}$;最优模型$P_{w^{*}}$
2.1 令$\delta_{i}$是方程
\begin{align*} \\ & \sum_{x,y} \tilde{P} \left( x, y \right) f_{i} \left( x, y \right) = \sum_{x, y} \tilde{P} \left( x \right) P_{w} \left( y | x \right) f_{i} \left( x, y \right) \exp \left( \delta_{i} f^{\#} \left(x, y\right) \right) \end{align*}
的解
2.2 更新$w_{i}$的值
\begin{align*} \\ & w_{i} \leftarrow w_{i} + \delta_{i}\end{align*}
3. 如果不是所有$w_{i}$都收敛,重复步骤2.