Notebook

In [1]:

# %load /Users/facai/Study/book_notes/preconfig.py
%matplotlib inline

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes=True)
sns.set(font='SimHei')
plt.rcParams['axes.grid'] = False

from IPython.display import SVG

def show_image(filename, figsize=None):
    if figsize:
        plt.figure(figsize=figsize)

    plt.imshow(plt.imread(filename))

TreeBoost原理和实现（sklearn）简介¶

0. 前言¶

TreeBoost是对GBDT做了更深一步的优化，主要贡献在将整颗树的全局权重值细化到每片叶子的局部权重值。它不像xgboost在树生长时就做出指导，而是在树生长后再修正叶子值。对于工程中的GBDT模块，spark目前是传统的Gradient Boosting，而sklearn采用了TreeBoost，所以本文以sklearn为例进行说明。

本文假设读者已经了解GBDT的基本原理。

sklearn代码版本：

~/W/s/sklearn ❯❯❯ git log -n 1
commit d161bfaa1a42da75f4940464f7f1c524ef53484f
Author: John B Nelson <jnelso11@gmu.edu>
Date:   Thu May 26 18:36:37 2016 -0400

    Add missing double quote (#6831)

GBDT模块位于sklearn/ensemble/gradient_boosting.py文件，类的关系比较简单。

In [2]:

SVG("./res/Main.svg")

Out[2]:

1. GBDT优化¶

传统的Gradient Boost算法主要有四步：

对损失导数求偏导；
训练决策树，拟合偏导；
寻优模型权值；
将训练的模型，加到叠加模型中。

可以用数学公式对应表述为[1]：

$F_0(x) = \operatorname{arg \, min}_\rho \displaystyle \sum_{i=1}^N L(y_i, \rho)$
For $m=1$ to $M$ do:
1. $\tilde{y} = - \left [ \frac{\partial L (y_i, F(x_i))}{\partial F(x_i)} \right ]_{F(x) = F_{m-1}(x)}, \quad i = 1, 2, \dotsc, N$
2. $\mathbf{a}_m = \operatorname{arg \, min}_{\mathbf{a}, \beta} \displaystyle \sum_{i=1}^N \left [ \tilde{y}_i - \beta h(x_i; \mathbf{a}) \right ]^2$
3. $\rho_m = \operatorname{arg \, min}_\rho \displaystyle \sum_{i=1}^N L \left ( y_i, F_{m-1}(x_i) + \rho h(x_i; \mathbf{a}_m) \right)$
4. $F_m(x) = F_{m-1}(x) + l_r \rho_m h(x; \mathbf{a}_m)$

其中，$L$是损失函数， $l_r$是学习率，$\rho$是模型的权值，$h(x_i; \mathbf{a})$是决策树模型，$\mathbf{a}$是树的参数，$F$是最终累加模型。

利用一些数学方法和损失函数的特性，可以将第3步的权值寻优进行简化。在sklear中单步训练代码对应如下：

748     def _fit_stage(self, i, X, y, y_pred, sample_weight, sample_mask,
 749                    random_state, X_idx_sorted, X_csc=None, X_csr=None):
 750         """Fit another stage of ``n_classes_`` trees to the boosting model. """
 751 #+--  8 lines: assert sample_mask.dtype == np.bool----------
 759
 760             residual = loss.negative_gradient(y, y_pred, k=k,
 761                                               sample_weight=sample_weight)
 762
 763             # induce regression tree on residuals
 764             tree = DecisionTreeRegressor(
 765                 criterion='friedman_mse',
 766                 splitter='best',
 767 #+---  7 lines: max_depth=self.max_depth,------------------
 774                 presort=self.presort)
 775
 776 #+---  8 lines: if self.subsample < 1.0:------------------
 784                 tree.fit(X, residual, sample_weight=sample_weight,
 785                          check_input=False, X_idx_sorted=X_idx_sorted)
 786
 787             # update tree leaves
 788 #+--  5 lines: if X_csr is not None:---------------------
 793                 loss.update_terminal_regions(tree.tree_, X, y, residual, y_pred,
 794                                              sample_weight, sample_mask,
 795                                              self.learning_rate, k=k)
 796
 797             # add tree to ensemble
 798             self.estimators_[i, k] = tree
 799
 800         return y_pred

其中，

760L是求解偏导。
784L是生成决策树。注意，对于MSE，$\beta=1$。765L树的评价函数是friedman_mse，它是MSE的变种。我猜测$\beta$是一致的，后续有时间再深究。
793L是TreeBoost做的改进，将第3步寻优全局解$\rho$转化到树内部，后面主要内容就在这。
798L是加回到累加模型。

接下来，我们先就平方差和绝对值两种损失函数，对传统Gradient Boost方法进行第3步的优化，为后续TreeBoost铺路。

[1]: Friedman - Greedy function approximation: A gradient boosting machine

1.0 Least squares regression¶

这种损失函数定义是 $L(y, F) = \frac{1}{2} (y - F)^2$，则其导数为$\frac{\partial L}{\partial F} = -(y - F)$。将偏导代入到第1步有：

\begin{align} \tilde{y} &= - \left [ \frac{\partial L (y_i, F(x_i))}{\partial F(x_i)} \right ]_{F(x) = F_{m-1}(x)} \\ &= y_i - F_{m-1}(x_i), \quad i = 1, 2, \dotsc, N \\ \end{align}

将$L$和$\tilde{y}$代入到第3步中，整理可得：

\begin{align} \rho_m &= \operatorname{arg \, min}_\rho \displaystyle \sum_{i=1}^N L \left ( y_i, F_{m-1}(x_i) + \rho h(x_i; \mathbf{a}_m) \right) \\ &= \operatorname{arg \, min}_\rho \displaystyle \sum_{i=1}^N \frac{1}{2} \left ( y_i - F_{m-1}(x_i) - \rho h(x_i; \mathbf{a}_m) \right)^2 \\ &= \operatorname{arg \, min}_\rho \displaystyle \sum_{i=1}^N \frac{1}{2} \left ( \tilde{y}_i - \rho h(x_i; \mathbf{a}_m) \right)^2 \\ &= \operatorname{arg \, min}_\rho \displaystyle \sum_{i=1}^N \left ( \tilde{y}_i - \rho h(x_i; \mathbf{a}_m) \right)^2 \\ &= \operatorname{arg \, min}_\beta \displaystyle \sum_{i=1}^N \left ( \tilde{y}_i - \beta h(x_i; \mathbf{a}_m) \right)^2 \quad \text{符号替换}\\ &= \beta_m \end{align}

也就是说，对于平方差这种损失函数，它的最优权值就是第2步中指导决策树生成时的$\beta$值。

这个算法称为LS_Boost，具体过程为：

$F_0(x) = \bar{y}$
For $m=1$ to $M$ do:
1. $\tilde{y}_i = y_i - F_{m-1}(x_i), \quad i=1, N$
2. $(\rho_m, \mathbf{a}_m) = \operatorname{arg \, min}_{\mathbf{a}, \rho} \displaystyle \sum_{i=1}^N \left [ \tilde{y}_i - \rho h(x_i; \mathbf{a}) \right ]^2$
3. $F_m(x) = F_{m-1}(x) + l_r \rho_m h(x; \mathbf{a}_m)$

sklearn中代码如下：

274 class LeastSquaresError(RegressionLossFunction):
 275     """Loss function for least squares (LS) estimation.
 276     Terminal regions need not to be updated for least squares. """
 277     def init_estimator(self):
 278         return MeanEstimator()
 279
 280     def __call__(self, y, pred, sample_weight=None):
 281         if sample_weight is None:
 282             return np.mean((y - pred.ravel()) ** 2.0)
 283         else:
 284             return (1.0 / sample_weight.sum() *
 285                     np.sum(sample_weight * ((y - pred.ravel()) ** 2.0)))
 286
 287     def negative_gradient(self, y, pred, **kargs):
 288         return y - pred.ravel()
 289
 290     def update_terminal_regions(self, tree, X, y, residual, y_pred,
 291                                 sample_weight, sample_mask,
 292                                 learning_rate=1.0, k=0):
 293         """Least squares does not need to update terminal regions.
 294
 295         But it has to update the predictions.
 296         """
 297         # update predictions
 298         y_pred[:, k] += learning_rate * tree.predict(X).ravel()
 299
 300     def _update_terminal_region(self, tree, terminal_regions, leaf, X, y,
 301                                 residual, pred, sample_weight):
 302         pass

前面说过sklearn中$\beta=1$，则$l_r \times \beta = l_r$，所以update_terminal_regions这里直接用学习率和树预测值相乘。

1.1 Least-absolute-deviation (LAD) regression¶

损失函数定义为 $L(y, F) = | y - F |$，同样地，可以利用导数算出残差：

\begin{align} \tilde{y} &= - \left [ \frac{\partial L (y_i, F(x_i))}{\partial F(x_i)} \right ]_{F(x) = F_{m-1}(x)} \\ &= \operatorname{sign}(y_i - F_{m-1}(x_i))\\ \end{align}

$L(y, F)$是分段函数，它的导数在正数区间恒为1，负数区间恒为-1，而在间段点$F=0$处是不可导的，人为规定为0，所以可以用sign函数来描述：

\begin{equation} \operatorname{sign}(x) := { \begin{cases} -1 & {\text{if }} x<0, \\ 0 & {\text{if }} x=0, \\ 1 & {\text{if }} x>0. \end{cases} } \end{equation}

同样的，将$L$和$\tilde{y}$代入到第3步中，整理可得：

\begin{align} \rho_m &= \operatorname{arg \, min}_\rho \displaystyle \sum_{i=1}^N L \left ( y_i, F_{m-1}(x_i) + \rho h(x_i; \mathbf{a}_m) \right) \\ &= \operatorname{arg \, min}_\rho \sum_{i=1}^N \big | y_i - F_{m-1}(x_i) - \rho h(x_i; \mathbf{a}_m) \big | \\ &= \operatorname{arg \, min}_\rho \sum_{i=1}^N \big | h(x_i; \mathbf{a}_m) \big | \cdot \left | \frac{y_i - F_{m-1}(x_i)}{h(x_i; \mathbf{a}_m)} - \rho \right | \\ &= \operatorname{median}_W \left \{ \frac{y_i - F_{m-1}(x_i)}{h(x_i; \mathbf{a}_m)} \right \}_1^N, \quad w_i = \big | h(x_i; \mathbf{a}_m) \big | \end{align}

这里$\operatorname{median}_W \{ \cdot \}$是带权$w_i$的weighted median。

注意，weighted median的概念$\operatorname{median}_W (x)$，这里的加权，应理解为$x_i$出现了$w_i$次，而不是$x_i \cdot w_i$数值。你可以去维基页查看定义，也可以看下面的推导细节。不太好讲，但其实很简单。

在sklearn中，LAD用的TreeBoost去实现，所以没有代码对应。

推导细节¶

\begin{align} \rho_m &= \operatorname{arg \, min}_\rho \sum_{i=1}^N \big | h(x_i; \mathbf{a}_m) \big | \cdot \left | \frac{y_i - F_{m-1}(x_i)}{h(x_i; \mathbf{a}_m)} - \rho \right | \\ &= \operatorname{median}_W \left \{ \frac{y_i - F_{m-1}(x_i)}{h(x_i; \mathbf{a}_m)} \right \}_1^N, \quad w_i = \big | h(x_i; \mathbf{a}_m) \big | \end{align}

最后一步的过程比较跳，我们在这里详细说下推导。

在这之前，我们简化下符号，以利于理解，

\begin{align} \rho_m &= \operatorname{arg \, min}_\rho \sum_{i=1}^N \big | h(x_i; \mathbf{a}_m) \big | \cdot \left | \frac{y_i - F_{m-1}(x_i)}{h(x_i; \mathbf{a}_m)} - \rho \right | \\ &= \operatorname{arg \, min}_\rho \sum_{i=1}^N w_i \cdot | t_i - \rho | \\ \end{align}

无加权¶

首先，我们考虑无加权的情况，即$w_i = 1$。此时有$\rho_m = \operatorname{arg \, min}_\rho \sum_{i=1}^N | t_i - \rho |$。

有两种方法来解释为什么$\rho_m = \operatorname{median}(t_i)$:

第一种是几何方法，$| t_i - \rho |$表示两点距离。我们要找的$\rho_m$，本质上就是离所有点$t_i$总距离最短的点。所以通过作图的方法，直观理解。

In [3]:

show_image("./res/rho_m.png")

如上图，绿色直线是中位点到各点距离，黄色直线是另一个点的距离，随着这个点外移，总距离越来越长。总结，对于奇数个点，$\rho_m$是中位点；对于偶数点，$\rho_m$在两个中间点闭区间上。

第二种方法是代数方法。

首先我们直观地了感受下结果，对于$\rho_m = \operatorname{arg \, min}_\rho \sum_{i=1}^N | t_i - \rho | = \operatorname{arg \, min} f(\rho)$，导数是$f'(\rho) = \sum_{i=1}^N \operatorname{sign}(t_i - \rho)$。在极值点$f'(\rho) = 0$，而$\operatorname{sign}(x)$只有$1, -1, 0$三种值，所以要和为0，则必然要求$1$和$-1$的数量相同。而中位点正好满足此条件2。

然后，在Mathematics上看到一个较为严密的论证2。它先算最左端结果，再向右取区间推进，证明整个过程中距离是先单减再单增的过程，从而证明中位点是最小点，具体如下：

令集合$T$含$N$个元素$t_1 < t_2 < \dots < t_N$。

对于最左端$\rho < t_1$，有$f(\rho) = \sum_{i=1}^N | t_i - \rho | = \sum_{i=1}^N (t_i - \rho)$。很明显，对于每个子项，有$t_i - \rho > 0$，且$\rho \to t_1$时，$(t_i - \rho)$单减。也就是说，随着$\rho$从左向右靠近$t_1$点，所有子项都在减小，则总和$f(\rho)$单减。
选定任一区间，$t_k \leq \rho \leq \rho + d \leq t_{k+1}$，有：

\begin{align} f(\rho + d) &= \displaystyle \sum_{i=1}^N | t_i - (\rho + d) | \\ &= \sum_{i=1}^k (\rho + d - t_i) + \sum_{i=k+1}^N (t_i - (\rho + d)) \\ &= \sum_{i=1}^k (\rho - t_i) + \sum_{i=1}^k d + \sum_{i=k+1}^N (t_i - \rho) + \sum_{i=k+1}^N -d \\ &= \sum_{i=1}^N | t_i = \rho | + k d + (N - (k+1) + 1) \times -d \\ &= f(\rho) + d \times ( k - N + (k + 1 ) -1 ) \\ &= f(\rho) + d \times (2k - N) \end{align}

则有：

\begin{equation} f(\rho + d) = \begin{cases} < f(\rho), \quad \text{when } k < \frac{N}{2} \\ = f(\rho), \quad \text{when } k = \frac{N}{2} \\ > f(\rho), \quad \text{when } k > \frac{N}{2} \\ \end{cases} \end{equation}

也就是说，在$k < \frac{N}{2}$区间，$f(\rho)$单减；在$k > \frac{N}{2}$区间，$f(\rho)$单增。所以，在中位点$\frac{N}{2}$处，$f(\rho)$是最小值。于是得$\rho_m = \operatorname{median}(t_i)$。

有加权¶

\begin{equation} \rho_m = \operatorname{arg \, min}_\rho \sum_{i=1}^N w_i \cdot | t_i - \rho | \end{equation}

对于加权，有点无从下手的感觉。我们可以换个角度，会提供点有意思的想法。

假设$w_i$是正整数，则$w_i \cdot | t_i - \rho | = \sum_{k=1}^{w_i} | t_i - \rho |$，也就是说，相当于将集合$T$中的$t_i$点扩增到$w_i$个。那么，可以展开为：

\begin{align} \rho_m &= \operatorname{arg \, min}_\rho \sum_{i=1}^N w_i \cdot | t_i - \rho | \\ &= \operatorname{arg \, min}_\rho \sum_{i=1}^N \sum_{k=1}^{w_i} | t_i - \rho | \\ &= \operatorname{arg \, min}_\rho \sum_{t_i \in T'} | t_i - \rho | \quad \text{$T'$ 是$t_i$扩增$w_i$倍的集合} \end{align}

于是，有加权问题就变更到无加权的问题。这时，我们可以将解法表示如下：

将$t_i$按升序排列。
将$w_i$按对应顺序排，计算累积和$c_k = \sum_{i=1}^k w_i$。
找到对应中位点$c_m = \frac{1}{2} c_N$对应的$t_m$，这个$t_m$就是$\operatorname{median}_W \{t_i\}$。

对应的sklearn代码如下：

51 def _weighted_percentile(array, sample_weight, percentile=50):
 52     """Compute the weighted ``percentile`` of ``array`` with ``sample_weight``. """
 53     sorted_idx = np.argsort(array)
 54
 55     # Find index of median prediction for each sample
 56     weight_cdf = sample_weight[sorted_idx].cumsum()
 57     percentile_idx = np.searchsorted(
 58         weight_cdf, (percentile / 100.) * weight_cdf[-1])
 59     return array[sorted_idx[percentile_idx]]

当然，如果$w_i$是分数，应该怎么想，我暂时也没有思路。可以参考下维基定义Weighted median，后续有时间，细究下。

2. 从GBDT到TreeBoost¶

2.0 回归决策树¶

我们先回顾前面的Gradient Boost方法，其训练公式如下：

For $m=1$ to $M$ do:
1. $\tilde{y} = - \left [ \frac{\partial L (y_i, F(x_i))}{\partial F(x_i)} \right ]_{F(x) = F_{m-1}(x)}, \quad i = 1, 2, \dotsc, N$
2. $\mathbf{a}_m = \operatorname{arg \, min}_{\mathbf{a}, \beta} \displaystyle \sum_{i=1}^N \left [ \tilde{y}_i - \beta h(x_i; \mathbf{a}) \right ]^2$
3. $\rho_m = \operatorname{arg \, min}_\rho \displaystyle \sum_{i=1}^N L \left ( y_i, F_{m-1}(x_i) + \rho h(x_i; \mathbf{a}_m) \right)$
4. $F_m(x) = F_{m-1}(x) + \rho_m h(x; \mathbf{a}_m)$
  注意，为了描述方便，我移除了学习率这个乘数。

TreeBoost将外部的寻优参数内化到决策树内部，从而既让模型更精细，又加快了运行速度。而要让外部参数进入到决策树$h(x_i; \mathbf{a}_m)$，首要的问题是得打开这个函数，即使用解析式来描述。换句话说，我们需要对决策树建立数学模型。

对于$J$个叶子的回归决策树，可以表述为累加式：

\begin{equation} h(x; \{b_j, R_j\}_1^J) = \displaystyle \sum_{j=1}^J b_j \, \mathbf{1}(x \in R_j) \end{equation}

其中，$R_j$是各叶子，对应叶子的值是此区域的样本均值$b_j = \operatorname{ave}_{x_i \in R_j} y_i$。因为各个叶子是没有交集的，所以这个公式的含义等价于：如果$x \in R_j$，那么$h(x) = b_j$。

我们将回归决策树的数学式代入Gradient Boost中第四步，可得到：

\begin{align} F_m(x) &= F_{m-1}(x) + \rho_m h(x; \mathbf{a}_m) \\ &= F_{m-1}(x) + \rho_M \displaystyle \sum_{j=1}^J b_{jm} \, \mathbf{1}(x \in R_{jm}) \\ &= F_{m-1}(x) + \sum_{j=1}^J \color{red}{\rho_M b_{jm}} \, \mathbf{1}(x \in R_{jm}) \\ &= F_{m-1}(x) + \sum_{j=1}^J \color{red}{\gamma_{jm}} \, \mathbf{1}(x \in R_{jm}) \end{align}

请注意，最后一步定义$\gamma_{jm} = \rho_M b_{jm}$，只是个简单的代数替换。但它的实质是将权重从对整颗树的全局解移动到了对各个叶子的局部解。也就是说，这个权重值更加细化了，以前是对整颗树寻优，现在是针对各个叶子，各自寻优。此时，将定义代入Gradient Boost第三步，就得到局部权重的最优解：

\begin{equation} \{\gamma_{jm}\}_1^J = \displaystyle \operatorname{arg \, min}_{\{\gamma_j\}_1^J} \sum_{i=1}^N L \left ( y_i, F_{m-1}(x_i) + \sum_{j=1}^J \gamma_j \mathbf{1}(x \in R_{jm}) \right ) \end{equation}

又因为各叶子$R_{jm}$是互无交集，相互独立的，上式的最优解就是各叶子各自的最优解汇总，故可简写为：

\begin{equation} \gamma_{jm} = \displaystyle \operatorname{arg \, min}_{\gamma} \sum_{x_i \in R_{jm}} L(y_i, F_{m-1}(x_i) + \gamma) \end{equation}

2.1 LAD_TreeBoost¶

前面已经讨论过，对于损失函数$L(y,F) = |y - F|$，它的最优解是中位值。

故，对于LAD回归，可得：

\begin{equation} \gamma_{jm} = \displaystyle \operatorname{median}_{x_i \in R_{jm}} \{ y_i - F_{m-1}(x_i) \} \end{equation}

综上，可得到LAD_TreeBoost算法如下：

$F_0(x) = \operatorname{median}\{y_i\}_1^N$
For $m=1$ to $M$ do:
1. $\tilde{y}_i = \operatorname{sign}(y_i - F_{m-1}(x_i)), \, i=1, N$
2. $\{R_{jm}\}_1^J = \text{$J$ terminal node tree} \big (\{\tilde{y}_i, x_i\}_1^N \big )$
3. $\gamma_{jm} = \displaystyle \operatorname{median}_{x_i \in R_{jm}} \{ y_i - F_{m-1}(x_i)\}, \, j=1, J$
4. $F_m(x) = F_{m-1}(x) + \displaystyle \sum_{j=1}^J \gamma_{jm} \mathbf{1}(x \in R_{jm})$

对应的sklearn中代码为

305 class LeastAbsoluteError(RegressionLossFunction):
 306     """Loss function for least absolute deviation (LAD) regression. """
 307     def init_estimator(self):
 308         return QuantileEstimator(alpha=0.5)
 309
 310     def __call__(self, y, pred, sample_weight=None):
 311         if sample_weight is None:
 312             return np.abs(y - pred.ravel()).mean()                                      
 313         else:
 314             return (1.0 / sample_weight.sum() *
 315                     np.sum(sample_weight * np.abs(y - pred.ravel())))
 316
 317     def negative_gradient(self, y, pred, **kargs):                                    
 318         """1.0 if y - pred > 0.0 else -1.0"""
 319         pred = pred.ravel()                                                           
 320         return 2.0 * (y - pred > 0.0) - 1.0
 321                                                                                       
 322     def _update_terminal_region(self, tree, terminal_regions, leaf, X, y,
 323                                 residual, pred, sample_weight):
 324         """LAD updates terminal regions to median estimates. """
 325         terminal_region = np.where(terminal_regions == leaf)[0]
 326         sample_weight = sample_weight.take(terminal_region, axis=0)
 327         diff = y.take(terminal_region, axis=0) - pred.take(terminal_region, axis=0)
 328         tree.value[leaf, 0, 0] = _weighted_percentile(diff, sample_weight, percentile=50)

其中320L就是第一步的求导，328L就是第三步将树的叶子改为中位值（percentile=50）。

因为$\tilde{y}_i$只有两种值$\tilde{y} \in \{-1, 1\}$，又叶子值是取中位数，所以这个损失的鲁棒性非常强。但求解中位数不如平均数有快速方法，所以性能会受影响。

2.2 M-Regression¶

LS_Boost运行快，LDA_TreeBoost鲁棒性好，能不能将两者结合起来呢？M-Regression的初衷正是来源于此，它用阈值$\delta$，比如说三倍方差，将$|y-F|$误差分隔成两部份：在阈值内的认定是正常误差，用LS_Boost来约束；在阈值外认定是长尾和坏值，用LDA_TreeBoost来抵抗。通过调整$\delta$值来控制平衡，从而达到各取其长的好处。

这里具体对的损失函数叫Huber，定义如下：

\begin{equation} L(y, F) = \begin{cases} \frac{1}{2} (y - F)^2 \quad & |y - F| \leq \delta \\ \delta \cdot \left (\big |y - F \big | - \frac{\delta}{2} \right ) \quad & |y - F| > \delta \end{cases} \end{equation}

对应的具体取解推导要再参阅论文，如果后面有时间再来填坑。

在sklearn中对应的是HuberLossFunction类。

2.3 二分类逻辑回归树¶

这里用的损失函数叫negative binomial log-likelihood[3]，定义为：

\begin{equation} L(y, F) = \log (1 + e^{-2 y F}), \quad y \in \{-1, 1\} \end{equation}

其中$F(x) = \frac{1}{2} \log \left [ \frac{\operatorname{Pr}(y = 1 | x)}{\operatorname{Pr}(y = -1 | x)} \right ]$

则第一步残差展开为：

\begin{align} \tilde{y_i} &= - \left [ \frac{\partial L (y_i, F(x_i))}{\partial F(x_i)} \right ]_{F(x) = F_{m-1}(x)} \\ &= - \frac{1}{1 + e^{-2yF}} \cdot 1 \cdot e^{-2yF} \cdot -2y \\ &= 2y \frac{e^{-2yF}}{1 + e^{-2yF}} \\ &= \frac{2y}{1 + e^{2yF}} \\ &= \frac{2 y_i}{1 + e^{2 y_i F_{m-1}(x_i)}} \\ \end{align}

同样地，第三步寻优代入损失函数，变为：

\begin{equation} \rho_m = \displaystyle \operatorname{arg \, min}_\rho \sum_{i=1}^{N} \log \left ( 1 + e^{-2 y_i \big ( F_{m-1}(x_i) + \rho h(x_i; \mathbf{a}_m) \big )} \right ) \end{equation}

运用前面LAD_Boost的技巧，很容易得到：

\begin{equation} \gamma_{jm} = \displaystyle \operatorname{arg \, min}_\gamma \sum_{x_i \in R_{jm}} \log(1 + e^{-2 y_i ( F_{m-1}(x_i) + \gamma ) }) \end{equation}

上式没有封闭形式的解，根据[3]，可用一轮最优化中的牛顿迭代法，得到数值解：

\begin{equation} \gamma_{jm} = \displaystyle \sum_{x_i \in R_{jm}} \tilde{y}_i \Big / \sum_{x_i \in R_{jm}} |\tilde{y}_i| (2 - |\tilde{y}_i|) \end{equation}

这一步推导，我也还没有细看，以后有精力再细究。

这个算法被称为L2_TreeBoost，总结如下：

$F_0(x) = \frac{1}{2} \log \frac{1 + \bar{y}}{1 - \bar{y}}$
For $m=1$ to $M$ do:
1. $\tilde{y}_i = \frac{2 y_i}{1 + e^{2 y_i F_{m-1}(x_i)}}, \quad i=1, N$
2. $\{R_{jm}\}_1^J = \text{$J$ terminal node tree} \big (\{\tilde{y}_i, x_i\}_1^N \big )$
3. $\gamma_{jm} = \displaystyle \sum_{x_i \in R_{jm}} \tilde{y}_i \Big / \sum_{x_i \in R_{jm}} |\tilde{y}_i| (2 - |\tilde{y}_i|), \quad j=1, J$
4. $F_m(x) = F_{m-1}(x) + \displaystyle \sum_{j=1}^J \gamma_{jm} \mathbf{1}(x \in R_{jm})$

sklearn中代码如下：

106 class LogOddsEstimator(BaseEstimator):
 107     """An estimator predicting the log odds ratio."""
 108     scale = 1.0
 109
 110     def fit(self, X, y, sample_weight=None):
 111         # pre-cond: pos, neg are encoded as 1, 0
 112         if sample_weight is None:
 113             pos = np.sum(y)
 114             neg = y.shape[0] - pos
 115         else:
 116             pos = np.sum(sample_weight * y)
 117             neg = np.sum(sample_weight * (1 - y))
 118
 119         if neg == 0 or pos == 0:
 120             raise ValueError('y contains non binary labels.')
 121         self.prior = self.scale * np.log(pos / neg)
 122
 123     def predict(self, X):
 124         check_is_fitted(self, 'prior')
 125
 126         y = np.empty((X.shape[0], 1), dtype=np.float64)
 127         y.fill(self.prior)
 128         return y

注意，这个初始化$F_0$对公式做了点变形。

466 class BinomialDeviance(ClassificationLossFunction):
 467 #    """Binomial deviance loss function for binary classification.
 468 #+-- 10 lines: Binary classification is a special case; here, we only need to-------------------------
 478
 479     def init_estimator(self):
 480         return LogOddsEstimator()
 481
 482 #+--  9 lines: def __call__(self, y, pred, sample_weight=None):---------------------------------------
 491
 492     def negative_gradient(self, y, pred, **kargs):
 493         """Compute the residual (= negative gradient). """
 494         return y - expit(pred.ravel())
 495
 496     def _update_terminal_region(self, tree, terminal_regions, leaf, X, y,
 497                                 residual, pred, sample_weight):
 498         """Make a single Newton-Raphson step.
 499
 500         our node estimate is given by:
 501
 502             sum(w * (y - prob)) / sum(w * prob * (1 - prob))
 503
 504         we take advantage that: y - prob = residual
 505         """
 506         terminal_region = np.where(terminal_regions == leaf)[0]
 507         residual = residual.take(terminal_region, axis=0)
 508         y = y.take(terminal_region, axis=0)
 509         sample_weight = sample_weight.take(terminal_region, axis=0)
 510
 511         numerator = np.sum(sample_weight * residual)
 512         denominator = np.sum(sample_weight * (y - residual) * (1 - y + residual))
 513
 514         if denominator == 0.0:
 515             tree.value[leaf, 0, 0] = 0.0
 516         else:
 517             tree.value[leaf, 0, 0] = numerator / denominator
 518
 519     def _score_to_proba(self, score):
 520         proba = np.ones((score.shape[0], 2), dtype=np.float64)
 521         proba[:, 1] = expit(score.ravel())
 522         proba[:, 0] -= proba[:, 1]
 523         return proba

[3]: Jerome Friedman - Additive logistic regression: a statistical view of boosting

模型$F_M(x)$输出的分值需要转换成各类的概率值，519L的_score_to_proba处理思路来源如下：

通过联理公式，

\begin{align} & F(x) = \frac{1}{2} \log \left [ \frac{\operatorname{Pr}(y = 1 | x)}{\operatorname{Pr}(y = -1 | x)} \right ] \\ & \operatorname{Pr}(y = 1 | x) + \operatorname{Pr}(y = -1 | x) = 1 \end{align}

可以容易解得：

\begin{align} & \operatorname{Pr}(y = 1 | x) = \frac{1}{1 + e^{-2 F(x)}} \\ & \operatorname{Pr}(y = -1 | x) = \frac{1}{1 + e^{2 F(x)}} \\ \end{align}

因为$F_M(x)$和$F(x)$有关联，我们可以借用上式将分值近似转换成概率值：

\begin{align} p_{+}(x) &= \operatorname{\widehat{Pr}}(y = 1 | x) = \frac{1}{1 + e^{-2 F_M(x)}} \\ p_{-}(x) &= \operatorname{\widehat{Pr}}(y = -1 | x) = \frac{1}{1 + e^{2 F_M(x)}} \\ \end{align}

用于二分类时，应用如下公式：

\begin{equation} \hat{y}(x) = 2 \cdot \mathbf{1}[c(-1,1) p_{+}(x) > c(1,-1) p_{-}(x)] - 1 \end{equation}

其中，$c(\hat{y}, y)$是将$y$误认为$\hat{y}$的代价。

回过头来看score_to_proba函数，它其实是上式的简化版，并没有$c(\cdot)$，同时虽然$p_{+}(x)$和$p_{-}(x)$并非真实概率，但相加等于1，所以只算$p_{+}(x)$。

Influence trimming¶

在寻优$\gamma_{jm}$时，还有优化的空间：

\begin{align} \gamma_{jm} &= \displaystyle \operatorname{arg \, min}_\gamma \sum_{x_i \in R_{jm}} \log(1 + e^{-2 y_i ( F_{m-1}(x_i) + \gamma ) }) \\ &= \displaystyle \operatorname{arg \, min}_\gamma \sum_{x_i \in R_{jm}} \log(1 + e^{-2 \color{blue}{y_i F_{m-1}(x_i)}} \cdot e^{-2 y_i \gamma} ) \\ &= \displaystyle \operatorname{arg \, min}_\gamma \sum_{x_i \in R_{jm}} \phi(x_i) \end{align}

注意，上式中，若$y_i F_{m-1}(x_i)$非常大，则对应的$\phi(x_i) \to 0$。也就是说，这些样本对寻优不再有贡献。于是，我们可以定义$w_i = e^{-2 y_i F_{m-1}(x_i)}$作为测量函数，如果它的值小于一定阈值，这个样本就不再参与计算。

另外，对于一个$x_i$，其损失函数$L(y, F(x_i))$如果到达极值点，就相当于是常数，对于整体寻优$\operatorname{arg \, min} L(y, F(x))$不再有贡献。而判断极值点可以用二阶导数来度量，所以二阶导也能作为一种测量函数。具体到现在的逻辑回归树，就可以定义为：

\begin{equation} w_i = \frac{\partial^2 L}{\partial F^2} = |\tilde{y}_i| (2 - |\tilde{y}_i|) \end{equation}

至于这个移除的阈值，可以用比例来约束。即我们定阈值为$w_{l(\alpha)}$，而$l(\alpha)$满足：

\begin{equation} \displaystyle \sum_{i=1}^{l(\alpha)} w_{(i)} = \alpha \sum_{i=1}^{N} w_i \end{equation}

其中，$w_{(i)}$是按升序排列的权值。典型值$\alpha \in [0.05, 0.2]$，也就是说，可以减少10到20倍的计算量。

2.4 多分类逻辑回归树¶

多分类是二分类的推广，相对而言公式推导比较复杂，不再详述。它的简化，相当于是对每一个类建一个「是、否」的二分类树，最后对每个类逐一评分，输出分值最高的类。`

3. 参数与正则¶

3.0 Regularization¶

lack of fit(LOF):

regression
- average absolute error
classification
- minus twice log-likelihood(deviance)
- misclassification error-rate

两种shrinkage strategy:

简单：最终模型：$F_\nu(x) = \bar{y} + \nu \cdot (F_M(x) - \bar{y})$
复杂：每个迭代模型：$F_m(x) = F_{m-1}(x) + \nu \cdot \rho_m h(x; \mathbf{a}_m, \quad 0 < \nu \leq 1$

3.1 Tree boosting的参数¶

meta-parameter:

$M$: the number of iterations.
$\nu$: the learning rate.
$J$: the fixed number of terminal nodes.
The best tree size $J$ is governed by the effective interaction order of the target $F*(x)$, while it is unknown. $\to$ cross-validation.

4. 可解释性¶

理解贡献比较大的变量

4.0 Relative importance of input variables¶

The relative influences $I_j$, of the individual inputs $x_j$, on the variation of $\hat{F}(x)$ over the joint input variable distributation:

\begin{equation} I_j = \left ( E_x \left [ \frac{\partial \hat{F}(x)}{\partial x_j} \right ]^2 \cdot \operatorname{var}_x[x_j] \right )^{1/2} \end{equation}

据此，Breiman提出了适合于决策树的公式：

\begin{equation} \hat{I}_j^2(T) = \displaystyle \sum_{t=1}^{J-1} \hat{i}_t^2 \mathbf{1}(v_t = j) \end{equation}

其中，$v_t$是以$j$作分割特征的中间节点，$\hat{i}_t^2$是对应节点的Friedman MSE评价$i^2(R_l, R_r) = \frac{w_l w_r}{w_l + w_r}(\bar{y}_l - \bar{y}_r)^2$。

对于多颗树，就加和取平均：

\begin{equation} \hat{I}^2_j = \frac{1}{M} \displaystyle \sum_{m=1}^M \hat{I}^2_j (T_m) \end{equation}

这个思路非常朴素，就是计算各特征对决策树评价函数提升的贡献度。在sklearn中实现如下：

1201     def feature_importances_(self):
1202         """Return the feature importances (the higher, the more important the
1203            feature).
1204
1205         Returns
1206         -------
1207         feature_importances_ : array, shape = [n_features]
1208         """
1209         self._check_initialized()
1210
1211         total_sum = np.zeros((self.n_features, ), dtype=np.float64)
1212         for stage in self.estimators_:
1213             stage_sum = sum(tree.feature_importances_
1214                             for tree in stage) / len(stage)
1215             total_sum += stage_sum
1216
1217         importances = total_sum / len(self.estimators_)
1218         return importances

此时，我们也就明白了为什么765L决策树使用的friedman_mse损失函数了。

4.1 Partial dependence plots¶

Partial dependence主要是用于可视化一维或二维变量对于整体总结果的影响。它的计算思路非常简单，令有特征集合$X_s \cup X_c = X$，则$f(X_s = x_s) = \operatorname{Avg}(f(X_s = x_s, X_{ci}))$。也就是对于特定的$x_s$值，算出它的响应均值。

下图是一个示例sklearn: Partial Dependence Plots。

对于一维变量，可以看出，房价与Medlin(平均收入）成正比，与AveOccup（每户人均数）成反比，而与HouseAge（房龄）、AveRoom（房间均数）没有明显关系。再看二维变量（HouseAge - AveOccup），对于每户人均数大于2的情况，房龄与房价关系不大；但当人均数小于2时，则出现了相关性。

ref: Partial Dependency Plots and GBM

5. 总结¶

我们先讲了对GBDT框架的优化，再借此拓展到tree boosting方法，然后略微提了正则和调参的参数，最后介绍了两种用于解释特征变量的方法。

本文基本上相当于翻译了论文Friedman - Greedy function approximation: A gradient boosting machine，如果有不明晰的地方，可以直接查看原文。

In [ ]: