非负矩阵分解(non-negative matrix factorization,NMF)原理很简单,与SVD将矩阵分解为三个矩阵类似,NMF将矩阵分解为两个小矩阵,比如原始矩阵$A_{m\times n}$分解为$W_{m\times k}$与$H_{k\times n}$的乘积,即:
$$ A_{m\times n}\simeq W_{m\times k}H_{k\times n} $$这里,要求$A,W,H$中的元素都非负,而参数估计也很简单,最小化如下的平方损失即可:
$$ L(A,W,H)=\frac{1}{2}\sum_{i=1}^m\sum_{j=1}^n(A_{ij}-(WH)_{ij})^2=\frac{1}{2}\sum_{i=1}^m\sum_{j=1}^n(A_{ij}-\sum_{l=1}^kW_{il}H_{lj})^2 $$所以:
$$ W^*,H^*=arg\min_{W,H}L(A,W,H) $$对于参数估计,采用梯度下降即可,下面推导一下:
$$ \frac{\partial L}{\partial W_{ik}}=\sum_j(A_{ij}-(WH)_{ij})\cdot \frac{\partial (WH)_{ij}}{\partial W_{ik}}=\sum_j(A_{ij}-(WH)_{ij})\cdot (-H_{kj})=-\sum_j A_{ij}H_{kj}+\sum_j(WH)_{ij}H_{kj}=(WHH^T)_{ik}-(AH^T)_{ik} $$类似地:
$$ \frac{\partial L}{\partial H_{kj}}=(W^TWH)_{kj}-(W^TA)_{kj} $$所以,梯度下降的更新公式可以表示如下:
$$ W_{ik}\leftarrow W_{ik}+\alpha_1[(AH^T)_{ik}-(WHH^T)_{ik}]\\ H_{kj}\leftarrow H_{kj}+\alpha_2[(W^TA)_{kj}-(W^TWH)_{kj}] $$这里,$\alpha_1>0,\alpha_2>0$为学习率,如果我们巧妙的设置:
$$ \alpha_1=\frac{W_{ik}}{(WHH^T)_{ik}}\\ \alpha_2=\frac{H_{kj}}{(W^TWH)_{kj}} $$那么,迭代公式为:
$$ W_{ik}\leftarrow W_{ik}\cdot\frac{(AH^T)_{ik}}{(WHH^T)_{ik}}\\ H_{kj}\leftarrow H_{kj}\cdot\frac{(W^TA)_{kj}}{(W^TWH)_{kj}} $$可以发现该迭代公式也很好的满足了我们的约束条件,$W,H$在迭代过程中始终非负
docs=[
["有","微信","红包","的","软件"],
["微信","支付","不行","的"],
["我们","需要","稳定的","微信","支付","接口"],
["申请","公众号","认证"],
["这个","还有","几天","放","垃圾","流量"],
["可以","提供","聚合","支付","系统"]
]
word2id={}
idx=0
W=[]
for doc in docs:
tmp=[]
for word in doc:
if word in word2id:
tmp.append(word2id[word])
else:
word2id[word]=idx
idx+=1
tmp.append(word2id[word])
W.append(tmp)
W
[[0, 1, 2, 3, 4], [1, 5, 6, 3], [7, 8, 9, 1, 5, 10], [11, 12, 13], [14, 15, 16, 17, 18, 19], [20, 21, 22, 5, 23]]
import numpy as np
data = np.zeros(shape=(len(docs), len(word2id)))
for idx, w in enumerate(W):
for i in w:
data[idx][i] = 1
import os
os.chdir('../')
from ml_models.decomposition import NMF
nmf = NMF(n_components=3,epochs=10)
trans = nmf.fit_transform(data)
def cosine(x1, x2):
return x1.dot(x2) / (np.sqrt(np.sum(np.power(x1, 2))) * np.sqrt(np.sum(np.power(x2, 2))))
#第二句和第三句应该比较近似,因为它们都含有“微信”,“支付”
cosine(trans[1],trans[2])
0.1844189988582961
#而第二句和第四句的相似度显然不如第二句和第三句,因为它们没有完全相同的词
cosine(trans[1],trans[3])
0.16058138087290988
#而第一句和第二句都含有“微信”,"的"这两个词,所以相似度会比第三,四句高
cosine(trans[1],trans[0])
0.2459676681757762