$\LaTeX \text{ commands here} \newcommand{\R}{\mathbb{R}} \newcommand{\im}{\text{im}\,} \newcommand{\norm}[1]{||#1||} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\span}{\mathrm{span}} \newcommand{\proj}{\mathrm{proj}} \newcommand{\OPT}{\mathrm{OPT}} $
Proved by von Neumann in the 1920s!
Theorem: For any matrix $M \in \R^{n \times m}$ we have $$\min_{p \in \Delta_n} \max_{j \in [m]} p^\top M_{:j} = \max_{q \in \Delta_m} \min_{i \in [n]} M_{i:} q$$ Equivalently: $$\min_{p \in \Delta_n} \max_{q \in \Delta_m} p^\top M q = \max_{q \in \Delta_m} \min_{p \in \Delta_n} p^\top M q$$
An iterative algorithm for "ensembling" base learners
For Boosting to work, one needs the weak learning hypothesis: there is a $\gamma > 0$ such that given any vector of weights $\mathbf{w} \in \Delta_n$ that forms a distribution over your training set, you can always find a base hypothesis $f$ such that $$e_{\mathbf{w}^t}(f) := \sum \limits_{i = 1}^n \mathbf{w}_i \mathbf{1}[f(\mathbf{x}_i) = y_i] \leq \frac 1 2 - \gamma$$
Weak learning isn't good enough. What I want is to be able to combine the weak learners in a way that correctly predicts every example. That is, you want to find $\alpha_1, \ldots, \alpha_T$ and base learners $f_1, \ldots, f_T$ so that for every $i = 1, \ldots, n$: $$ h_T(\mathbf{x}_i) := \text{sign}\left(\sum \limits_{t = 1}^T \alpha_t f_t(\mathbf{x}_i)\right) = y_i $$ Equivalently, assuming $f_i$ outputs $\pm 1$, if there are $m$ base learners: $$ \exists \vec{\alpha} \in \mathbf{R}^m_+ \; \forall i \in [n]: \quad y_i \sum_{j = 1}^m \alpha_j f_j(\mathbf{x}_i) > 0 $$
It turns out that we can formulate this as a game. Let $M \in \{+1, -1\}^{n \times m}$ defined by: $$ M_{i,j} := y_i f_j(x_i)$$
How does the minimax value and the maximin value of the game relate to the Strong Learning and Weak Learning hypotheses? What does the minimax theorem tell us? \begin{align*} \text{Weak Learning}: &\quad \forall \mathbf{w} \in \Delta_n \exists j: \quad \sum \limits_{i = 1}^n \mathbf{w}_i \mathbf{1}[f_j(\mathbf{x}_i) \ne y_i] \leq \frac 1 2 - \gamma \\ \text{Strong Learning}:&\quad \exists \vec{\alpha} \in \mathbf{R}^m_+ \forall i \in [n]: \quad y_i \sum_{j = 1}^m \alpha_j f_j(\mathbf{x}_i) > 0 \end{align*}
It is often said that we can view Adaboost as "Coordinate Descent" on the exponential loss function. Coordinate descent is a simple algorithm: to minimize $L(\vec \alpha)$, first pick some appropriate coordiante $i$, then optimize only this coordinate. Repeat.
Show that the AdaBoost algorithm can be viewed as coordinate descent on exponential loss function $$\text{Loss}(\vec{\alpha}) = \sum_{i=1}^n \exp \left( - y_i \left(\sum_{j=1}^m \alpha_j h_j(\vec{x}_i)\right)\right)$$