一.参数的梯度下降求解

损失函数推导

CRF的参数学习同HMM一样,采用的极大似然估计的方式,其对数似然函数为:

$$ L(w)=L_{\tilde{P}}(P_w)\\ =log\prod_{x,y}P_w(y\mid x)^{\tilde{P}(x,y)}\\ =\sum_{x,y}\tilde{P}(x,y)logP_w(y\mid x)\\ =\sum_{j=1}^NlogP_w(y_j\mid x_j)(假设样本量为N) $$

将CRF的势函数:

$$ P_w(y\mid x)=\frac{exp(\sum_{k=1}^Kw_kf_k(y,x))}{Z_w(x)} $$

带入上面的对数似然函数可得:

$$ L(w)=\sum_{j=1}^N\sum_{k=1}^Kw_kf_k(y_j,x_j)-\sum_{j=1}^NlogZ_w(x_j) $$

最终,我们的问题等价于:

$$ w_k^*=arg\min_{w_k}\sum_{j=1}^N(logZ_w(x_j)-\sum_{k=1}^Kw_kf_k(y_j,x_j)) $$

梯度下降

最简单的就是梯度下降,可推导出其梯度函数为:

$$ g(w^t)=\sum_{j=1}^N(P_{w^t}(y_j\mid x_j)-1)F(y_j,x_j)) $$

这里,$F(y_j,x_j)=[f_1(y_j,x_j),f_2(y_j,x_j),...,f_K(y_j,x_j)]^T$,所以梯度更新公式也就出来了:

$$ w^{t+1}=w^t-\eta g(w^t) $$

这里,$\eta$为学习率,用梯度更新公式来理解一下训练过程,每次迭代会将特征函数输出为1的权重增加,而输出为0的保持不变,则看起来也蛮合理的~~~

二.代码实现

In [1]:
import os
os.chdir('../')
from ml_models.pgm import CRFFeatureFunction
import numpy as np

"""
线性链CRF的实现,封装到ml_models.pgm
"""


class CRF(object):
    def __init__(self, epochs=10, lr=1e-3, tol=1e-5, output_status_num=None, input_status_num=None, unigram_rulers=None,
                 bigram_rulers=None):
        """
        :param epochs: 迭代次数
        :param lr: 学习率
        :param tol:梯度更新的阈值
        :param output_status_num:标签状态数
        :param input_status_num:输入状态数
        :param unigram_rulers: 状态特征规则
        :param bigram_rulers: 状态转移规则
        """
        self.epochs = epochs
        self.lr = lr
        self.tol = tol
        # 为输入序列和标签状态序列添加一个头尾id
        self.output_status_num = output_status_num + 2
        self.input_status_num = input_status_num + 2
        self.input_status_head_tail = [input_status_num, input_status_num + 1]
        self.output_status_head_tail = [output_status_num, output_status_num + 1]
        # 特征函数
        self.FF = CRFFeatureFunction(unigram_rulers, bigram_rulers)
        # 模型参数
        self.w = None

    def fit(self, x, y):
        """
        :param x: [[...],[...],...,[...]]
        :param y: [[...],[...],...,[...]]
        :return
        """
        # 为 x,y加头尾
        x = [[self.input_status_head_tail[0]] + xi + [self.input_status_head_tail[1]] for xi in x]
        y = [[self.output_status_head_tail[0]] + yi + [self.output_status_head_tail[1]] for yi in y]
        self.FF.fit(x, y)
        self.w = np.ones(len(self.FF.feature_funcs)) * 1e-5
        for _ in range(0, self.epochs):
            # 偷个懒,用随机梯度下降
            for i in range(0, len(x)):
                xi = x[i]
                yi = y[i]
                """
                1.求F(yi \mid xi)以及P_w(yi \mid xi)
                """
                F_y_x = []
                Z_x = np.ones(shape=(self.output_status_num, 1)).T
                for j in range(1, len(xi)):
                    F_y_x.append(self.FF.map(yi[j - 1], yi[j], xi, j))
                    # 构建M矩阵
                    M = np.zeros(shape=(self.output_status_num, self.output_status_num))
                    for k in range(0, self.output_status_num):
                        for t in range(0, self.output_status_num):
                            M[k, t] = np.exp(np.dot(self.w, self.FF.map(k, t, xi, j)))
                    # 前向算法求 Z(x)
                    Z_x = Z_x.dot(M)
                F_y_x = np.sum(F_y_x, axis=0)
                Z_x = np.sum(Z_x)
                # 求P_w(yi \mid xi)
                P_w = np.exp(np.dot(self.w, F_y_x)) / Z_x
                """
                2.求梯度,并更新
                """
                dw = (P_w - 1) * F_y_x
                self.w = self.w - self.lr * dw
                if (np.sqrt(np.dot(dw, dw) / len(dw))) < self.tol:
                    break
In [2]:
# 随便测试一下
x = [
    [1, 2, 3, 0, 1, 3, 4],
    [1, 2, 3],
    [0, 2, 4, 2],
    [4, 3, 2, 1],
    [3, 1, 1, 1, 1],
    [2, 1, 3, 2, 1, 3, 4]
]
y = x

crf = CRF(output_status_num=5, input_status_num=5)
crf.fit(x, y)
In [3]:
crf.w
Out[3]:
array([1.00000000e-05, 1.00009053e-01, 7.00089277e-02, 7.00091106e-02,
       2.00098984e-02, 4.00097839e-02, 6.00090103e-02, 1.00000000e-05,
       2.00092452e-02, 2.00092452e-02, 1.00099996e-02, 1.00099996e-02,
       3.00099986e-02, 2.00099991e-02, 2.00099991e-02, 2.00092452e-02,
       2.00092452e-02, 2.00092452e-02, 1.00099996e-02, 1.00099996e-02,
       3.00099986e-02, 2.00099991e-02, 2.00099991e-02, 1.00000000e-05,
       2.00092452e-02, 2.00092452e-02, 1.00099996e-02, 1.00099996e-02,
       3.00099986e-02, 2.00099991e-02, 2.00099991e-02, 2.00092452e-02,
       2.00092452e-02, 2.00092452e-02, 1.00099996e-02, 1.00099996e-02,
       3.00099986e-02, 2.00099991e-02, 2.00099991e-02, 2.00092452e-02,
       2.00092452e-02, 2.00092452e-02, 1.00099996e-02, 1.00099996e-02,
       3.00099986e-02, 2.00099991e-02, 2.00099991e-02, 1.00092456e-02,
       1.00092456e-02, 1.00092456e-02, 1.00092456e-02, 1.00092456e-02,
       1.00000000e-05, 1.00098988e-02, 1.00098988e-02, 1.00098988e-02,
       1.00098988e-02, 1.00098988e-02, 1.00098988e-02, 1.00098988e-02,
       1.00098988e-02, 1.00098988e-02, 1.00000000e-05, 1.00098988e-02,
       1.00098988e-02, 1.00098988e-02, 1.00098988e-02, 1.00098988e-02,
       1.00098988e-02, 1.00098988e-02, 1.00098988e-02, 1.00098988e-02,
       1.00098988e-02, 1.00098988e-02, 1.00098988e-02, 1.00098988e-02,
       1.00098988e-02, 1.00000000e-05, 1.00098860e-02, 2.00098855e-02,
       3.00098850e-02, 2.00098668e-02, 1.00098860e-02, 1.00098860e-02,
       2.00098855e-02, 3.00098850e-02, 2.00098668e-02, 1.00000000e-05,
       1.00098860e-02, 2.00098855e-02, 3.00098850e-02, 2.00098668e-02,
       1.00098860e-02, 1.00098860e-02, 2.00098855e-02, 3.00098850e-02,
       2.00098668e-02, 1.00098860e-02, 1.00098860e-02, 2.00098855e-02,
       3.00098850e-02, 2.00098668e-02, 1.00000000e-05, 1.00099808e-02,
       3.00099424e-02, 1.00099808e-02, 1.00099808e-02, 3.00099424e-02,
       1.00000000e-05, 1.00099808e-02, 3.00099424e-02, 1.00099808e-02,
       1.00099808e-02, 3.00099424e-02, 1.00099808e-02, 1.00099808e-02,
       3.00099424e-02, 1.00000000e-05, 1.00099995e-02, 1.00000000e-05,
       1.00099995e-02, 1.00099995e-02])
In [4]:
len(crf.w)
Out[4]:
122

可知一共有122个特征函数,接下来还剩最后一个问题,那就是标签预测的问题,见下一小节~~~

In [ ]: