#!/usr/bin/env python # coding: utf-8 # $\LaTeX \text{ commands here} # \newcommand{\R}{\mathbb{R}} # \newcommand{\im}{\text{im}\,} # \newcommand{\norm}[1]{||#1||} # \newcommand{\inner}[1]{\langle #1 \rangle} # \newcommand{\span}{\mathrm{span}} # \newcommand{\proj}{\mathrm{proj}} # \newcommand{\OPT}{\mathrm{OPT}} # $ #
# # **Georgia Tech, CS 4540** # # # L17: Boosting + AdaBoost # # Jake Abernethy # # *Tuesday, October 29, 2019* # ## Recall: The Minimax Theorem # # Proved by von Neumann in the 1920s! # # **Theorem**: For any matrix $M \in \R^{n \times m}$ we have # $$\min_{p \in \Delta_n} \max_{j \in [m]} p^\top M_{:j} = \max_{q \in \Delta_m} \min_{i \in [n]} M_{i:} q$$ # Equivalently: # $$\min_{p \in \Delta_n} \max_{q \in \Delta_m} p^\top M q = \max_{q \in \Delta_m} \min_{p \in \Delta_n} p^\top M q$$ # # # ### Base/Weak Learners # # - Let $(\mathbf{x}_1, y_1), ..., (\mathbf{x}_n, y_n)$ be the training data. # - Let $\mathscr{F}$ be a fixed set of classifiers called the base class. # - A base learner for $\mathscr{F}$ is a rule that takes as input a set of weights $\mathbf{w} = (w_1, ..., w_n)$ such that $w_i \geq 0, \sum w_i = 1$, and outputs a classifier $f \in \mathscr{F}$ such that the weighted empirical risk $$e_w(f) = \sum \limits_{i = 1}^n w_i \mathbb{1}_{\{f(\mathbf{x}_i) \neq y_i\}}$$ is (approximately) minimized. # ### Boosting (2 Class Scenario) # # - Assume class labels are -1 and +1. # - The final classifier then has the form: # - $h_T(\mathbf{x}) = \text{sgn}\left(\sum \limits_{t = 1}^T \alpha_t f_t(\mathbf{x})\right)$ where $f_1, ..., f_T$ are called base (or weak) classifiers and $\alpha_1, ..., \alpha_T > 0$ reflect the confidence of the various base classifiers. # ### Examples of Base (Weak) Learners # # - Decision Stumps, i.e., decision trees with depth 1 # - Decision Trees # - Polynomial thresholds, i.e., $$f(\vec{x}) = \pm \text{sign}((\vec{w}^\top \vec{x})^2 - b)$$ where $b \in \mathbb{R}$ and $\vec{w} \in \mathbb{R}^d$ is a radial kernel. # ### AdaBoost (Adaptive Boosting) # # - The first concrete algorithm to successfully realize the boosting principle. # # # ### AdaBoost Algorithm # # An *iterative* algorithm for "ensembling" base learners # # - Input: $\{(\mathbf{x}_i, y_i)\}_{i = 1}^n, T, \mathscr{F}$, base learner # - Initialize: $\mathbf{w}^{1} = (\frac{1}{n}, ..., \frac{1}{n})$ # - For $t = 1, ..., T$ # - $\mathbf{w}^{t} \rightarrow \boxed{\text{base learner}} \rightarrow f_t$ # - $\alpha_t = \frac{1}{2}\text{ln}\left(\frac{1 - r_t}{r_t}\right)$ # - where $r_t := e_{\mathbf{w}^t}(f_t) = \frac 1 n \sum \limits_{i = 1}^n \mathbf{w}_i \mathbf{1}_{\{f(\mathbf{x}_i) \neq y_i\}} $ # - $w_i^{t + 1} = \frac{\mathbf{w}_i^t \exp \left(- \alpha_ty_if_t(\mathbf{x}_i)\right)}{z_t}$ where $z_t$ normalizes. # - Output: $h_T(\mathbf{x}) = \text{sign}\left(\sum \limits_{t = 1}^T \alpha_t f_t(\mathbf{x})\right)$ # ## Weak Learning Assumption # # For Boosting to work, one needs the *weak learning hypothesis*: there is a $\gamma > 0$ such that given any vector of weights $\mathbf{w} \in \Delta_n$ that forms a distribution over your training set, you can always find a base hypothesis $f$ such that # $$e_{\mathbf{w}^t}(f) := \sum \limits_{i = 1}^n \mathbf{w}_i \mathbf{1}[f(\mathbf{x}_i) = y_i] \leq \frac 1 2 - \gamma$$ # # ## Strong Learning Assumption # # Weak learning isn't good enough. What I want is to be able to combine the weak learners in a way that correctly predicts every example. That is, you want to find $\alpha_1, \ldots, \alpha_T$ and base learners $f_1, \ldots, f_T$ so that for every $i = 1, \ldots, n$: # $$ # h_T(\mathbf{x}_i) := \text{sign}\left(\sum \limits_{t = 1}^T \alpha_t f_t(\mathbf{x}_i)\right) = y_i # $$ # Equivalently, assuming $f_i$ outputs $\pm 1$, if there are $m$ base learners: # $$ # \exists \vec{\alpha} \in \mathbf{R}^m_+ \; \forall i \in [n]: \quad y_i \sum_{j = 1}^m \alpha_j f_j(\mathbf{x}_i) > 0 # $$ # ## This two are primal/dual statements! # # It turns out that we can formulate this as a game. Let $M \in \{+1, -1\}^{n \times m}$ defined by: # $$ M_{i,j} := y_i f_j(x_i)$$ # # #### Problem # How does the minimax value and the maximin value of the game relate to the Strong Learning and Weak Learning hypotheses? What does the minimax theorem tell us? # \begin{align*} # \text{Weak Learning}: &\quad \forall \mathbf{w} \in \Delta_n \exists j: \quad \sum \limits_{i = 1}^n \mathbf{w}_i \mathbf{1}[f_j(\mathbf{x}_i) \ne y_i] \leq \frac 1 2 - \gamma # \\ # \text{Strong Learning}:&\quad \exists \vec{\alpha} \in \mathbf{R}^m_+ \forall i \in [n]: \quad y_i \sum_{j = 1}^m \alpha_j f_j(\mathbf{x}_i) > 0 # \end{align*} # ## Adaboost through Coordinate Descent # # It is often said that we can view Adaboost as "Coordinate Descent" on the exponential loss function. Coordinate descent is a simple algorithm: to minimize $L(\vec \alpha)$, first pick some appropriate coordiante $i$, then optimize *only this coordinate*. Repeat. # # #### Problem # # Show that the AdaBoost algorithm can be viewed as coordinate descent on *exponential loss function* # $$\text{Loss}(\vec{\alpha}) = \sum_{i=1}^n \exp \left( - y_i \left(\sum_{j=1}^m \alpha_j h_j(\vec{x}_i)\right)\right)$$ # #