#!/usr/bin/env python
# coding: utf-8
# $\LaTeX \text{ commands here}
# \newcommand{\R}{\mathbb{R}}
# \newcommand{\im}{\text{im}\,}
# \newcommand{\norm}[1]{||#1||}
# \newcommand{\inner}[1]{\langle #1 \rangle}
# \newcommand{\span}{\mathrm{span}}
# \newcommand{\proj}{\mathrm{proj}}
# \newcommand{\OPT}{\mathrm{OPT}}
# $
#
#
# **Georgia Tech, CS 4540**
#
# # L17: Boosting + AdaBoost
#
# Jake Abernethy
#
# *Tuesday, October 29, 2019*
# ## Recall: The Minimax Theorem
#
# Proved by von Neumann in the 1920s!
#
# **Theorem**: For any matrix $M \in \R^{n \times m}$ we have
# $$\min_{p \in \Delta_n} \max_{j \in [m]} p^\top M_{:j} = \max_{q \in \Delta_m} \min_{i \in [n]} M_{i:} q$$
# Equivalently:
# $$\min_{p \in \Delta_n} \max_{q \in \Delta_m} p^\top M q = \max_{q \in \Delta_m} \min_{p \in \Delta_n} p^\top M q$$
#
#
# ### Base/Weak Learners
#
# - Let $(\mathbf{x}_1, y_1), ..., (\mathbf{x}_n, y_n)$ be the training data.
# - Let $\mathscr{F}$ be a fixed set of classifiers called the base class.
# - A base learner for $\mathscr{F}$ is a rule that takes as input a set of weights $\mathbf{w} = (w_1, ..., w_n)$ such that $w_i \geq 0, \sum w_i = 1$, and outputs a classifier $f \in \mathscr{F}$ such that the weighted empirical risk $$e_w(f) = \sum \limits_{i = 1}^n w_i \mathbb{1}_{\{f(\mathbf{x}_i) \neq y_i\}}$$ is (approximately) minimized.
# ### Boosting (2 Class Scenario)
#
# - Assume class labels are -1 and +1.
# - The final classifier then has the form:
# - $h_T(\mathbf{x}) = \text{sgn}\left(\sum \limits_{t = 1}^T \alpha_t f_t(\mathbf{x})\right)$ where $f_1, ..., f_T$ are called base (or weak) classifiers and $\alpha_1, ..., \alpha_T > 0$ reflect the confidence of the various base classifiers.
# ### Examples of Base (Weak) Learners
#
# - Decision Stumps, i.e., decision trees with depth 1
# - Decision Trees
# - Polynomial thresholds, i.e., $$f(\vec{x}) = \pm \text{sign}((\vec{w}^\top \vec{x})^2 - b)$$ where $b \in \mathbb{R}$ and $\vec{w} \in \mathbb{R}^d$ is a radial kernel.
# ### AdaBoost (Adaptive Boosting)
#
# - The first concrete algorithm to successfully realize the boosting principle.
#
#
# ### AdaBoost Algorithm
#
# An *iterative* algorithm for "ensembling" base learners
#
# - Input: $\{(\mathbf{x}_i, y_i)\}_{i = 1}^n, T, \mathscr{F}$, base learner
# - Initialize: $\mathbf{w}^{1} = (\frac{1}{n}, ..., \frac{1}{n})$
# - For $t = 1, ..., T$
# - $\mathbf{w}^{t} \rightarrow \boxed{\text{base learner}} \rightarrow f_t$
# - $\alpha_t = \frac{1}{2}\text{ln}\left(\frac{1 - r_t}{r_t}\right)$
# - where $r_t := e_{\mathbf{w}^t}(f_t) = \frac 1 n \sum \limits_{i = 1}^n \mathbf{w}_i \mathbf{1}_{\{f(\mathbf{x}_i) \neq y_i\}} $
# - $w_i^{t + 1} = \frac{\mathbf{w}_i^t \exp \left(- \alpha_ty_if_t(\mathbf{x}_i)\right)}{z_t}$ where $z_t$ normalizes.
# - Output: $h_T(\mathbf{x}) = \text{sign}\left(\sum \limits_{t = 1}^T \alpha_t f_t(\mathbf{x})\right)$
# ## Weak Learning Assumption
#
# For Boosting to work, one needs the *weak learning hypothesis*: there is a $\gamma > 0$ such that given any vector of weights $\mathbf{w} \in \Delta_n$ that forms a distribution over your training set, you can always find a base hypothesis $f$ such that
# $$e_{\mathbf{w}^t}(f) := \sum \limits_{i = 1}^n \mathbf{w}_i \mathbf{1}[f(\mathbf{x}_i) = y_i] \leq \frac 1 2 - \gamma$$
#
# ## Strong Learning Assumption
#
# Weak learning isn't good enough. What I want is to be able to combine the weak learners in a way that correctly predicts every example. That is, you want to find $\alpha_1, \ldots, \alpha_T$ and base learners $f_1, \ldots, f_T$ so that for every $i = 1, \ldots, n$:
# $$
# h_T(\mathbf{x}_i) := \text{sign}\left(\sum \limits_{t = 1}^T \alpha_t f_t(\mathbf{x}_i)\right) = y_i
# $$
# Equivalently, assuming $f_i$ outputs $\pm 1$, if there are $m$ base learners:
# $$
# \exists \vec{\alpha} \in \mathbf{R}^m_+ \; \forall i \in [n]: \quad y_i \sum_{j = 1}^m \alpha_j f_j(\mathbf{x}_i) > 0
# $$
# ## This two are primal/dual statements!
#
# It turns out that we can formulate this as a game. Let $M \in \{+1, -1\}^{n \times m}$ defined by:
# $$ M_{i,j} := y_i f_j(x_i)$$
#
# #### Problem
# How does the minimax value and the maximin value of the game relate to the Strong Learning and Weak Learning hypotheses? What does the minimax theorem tell us?
# \begin{align*}
# \text{Weak Learning}: &\quad \forall \mathbf{w} \in \Delta_n \exists j: \quad \sum \limits_{i = 1}^n \mathbf{w}_i \mathbf{1}[f_j(\mathbf{x}_i) \ne y_i] \leq \frac 1 2 - \gamma
# \\
# \text{Strong Learning}:&\quad \exists \vec{\alpha} \in \mathbf{R}^m_+ \forall i \in [n]: \quad y_i \sum_{j = 1}^m \alpha_j f_j(\mathbf{x}_i) > 0
# \end{align*}
# ## Adaboost through Coordinate Descent
#
# It is often said that we can view Adaboost as "Coordinate Descent" on the exponential loss function. Coordinate descent is a simple algorithm: to minimize $L(\vec \alpha)$, first pick some appropriate coordiante $i$, then optimize *only this coordinate*. Repeat.
#
# #### Problem
#
# Show that the AdaBoost algorithm can be viewed as coordinate descent on *exponential loss function*
# $$\text{Loss}(\vec{\alpha}) = \sum_{i=1}^n \exp \left( - y_i \left(\sum_{j=1}^m \alpha_j h_j(\vec{x}_i)\right)\right)$$
#
#