### 一.半朴素贝叶斯分类器¶

#### SPODE¶

SPODE(Super-Parent ODE)的做法是假设所有属性都依赖于同一个属性，该属性称为“超父”，比如下图是“超父”为$X_1$的SPODE

#### TAN¶

TAN(Tree Augmented Naive Bayes)则是根据条件互信息构建一个最大带权生成树，然后再添加$Y$到每个属性的指向来构建半朴素贝叶斯分类器，假设目前仅有$X_1,X_2,X_3,X_4$这4个属性，下面说明其构建过程：

(1)计算任意两个属性直接的条件互信息：

$$I(X_i,X_j)=\sum_{X_i,X_j,c}p(X_i,X_j\mid Y=c)log\frac{p(X_i,X_j\mid Y=c)}{p(X_i\mid Y=c)\cdot p(X_j\mid Y=c)},i\neq j,i,j\in \{1,2,3,4\}$$

(2)以属性为节点构建完全图，任意两个节点之间边的权重设为其条件互信息值：

(3)构建此完全图的最大带权生成树，挑选根变量，将边置为有向：

(4)增加类别节点$Y$，增加从$Y$到每个属性的有向边

### 二.参数估计¶

$$f(x)=arg\max_{c_k}P(Y=c_k)\prod_{i=1}^nP(X_i=x_i\mid Y=c_k,X_{Pa(i)})$$

#### 1.求解$p(Y=c_k)$¶

$$p(Y=c_k)=\frac{\sum_{i=1}^NI(y_i=c_k)}{N},k=1,2,...,K,N表示样本量$$

#### 2.求解$p(X_i=x_i\mid Y=c_k,X_{Pa(i)}=Pa(i)_j)$¶

$$p(X_i=a_{il}\mid Y=c_k,Pa(i)=b_{im})=\frac{\sum_{j=1}^NI(x_i^j=a_{il},y_i=c_k,Pa(i)^j=b_{im})}{\sum_{j=1}^nI(y_j=c_k,Pa(i)^j=b_{im})}$$

### 三.代码实现¶

In [1]:
import os
os.chdir('../')
import numpy as np
from ml_models import utils
%matplotlib inline

"""

"""

class SemiGaussianNBClassifier(object):
"""
"""
self.p_y = {}  # p(y)
self.p_x_y = {}  # p(x | y)
self.class_num = None
raise Exception("属性依赖超过1次")

def fit(self, x, y):
# 参数估计
self.class_num = y.max() + 1
for y_index in range(0, self.class_num):
# p(y)
y_n_sample = np.sum(y == y_index)
self.p_y[y_index] = np.log(y_n_sample / len(y))
self.p_x_y[y_index] = {}
# p(x | y)
x_y = x[y == y_index]
for i in range(0, x_y.shape[1]):
u = np.mean(x_y[:, [first_feature, second_feature]], axis=0)
sigma = np.cov(x_y[:, [first_feature, second_feature]].T)
else:
u = np.mean(x_y[:, i])
sigma = np.std(x_y[:, i])
self.p_x_y[y_index][i] = [u, sigma]

def predict_proba(self, x):
rst = []
for x_row in x:
tmp = []
for y_index in range(0, self.class_num):
p_y_log = self.p_y[y_index]
for j in range(0, len(x_row)):
xij = x_row[[first_feature, second_feature]]
p_y_log += np.log(utils.gaussian_nd(xij, self.p_x_y[y_index][j][0], self.p_x_y[y_index][j][1]))
else:
xij = x_row[j]
p_y_log += np.log(utils.gaussian_1d(xij, self.p_x_y[y_index][j][0], self.p_x_y[y_index][j][1]))
tmp.append(p_y_log)
rst.append(tmp)
return utils.softmax(np.asarray(rst)).reshape(x.shape[0], self.class_num)

def predict(self, x):
return np.argmax(self.predict_proba(x), axis=1).reshape(-1)

In [2]:
#造伪数据
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=400, centers=4, cluster_std=0.85, random_state=0)
X = X[:, ::-1]

In [3]:
#查看效果
snb.fit(X, y)
utils.plot_decision_function(X, y, snb)


link_rulers=[(0,1)]表示添加了一条$x_1\rightarrow x_2$的依赖，如果不添加，则和上一节的效果一样

In [4]:
#查看效果
snb = SemiGaussianNBClassifier()
snb.fit(X, y)
utils.plot_decision_function(X, y, snb)