$\LaTeX \text{ commands here} \newcommand{\R}{\mathbb{R}} \newcommand{\im}{\text{im}\,} \newcommand{\norm}[1]{||#1||} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\span}{\mathrm{span}} \newcommand{\proj}{\mathrm{proj}} \newcommand{\OPT}{\mathrm{OPT}} \newcommand{\grad}{\nabla} \newcommand{\eps}{\varepsilon} $
Georgia Tech, CS 4540
Jake Abernethy
Quiz password: opt
Thursday, September 19, 2019
It became clear in the early 90s that you could do learning and prediction without using statistical tools. One could prove guarantees for sequential algorithms that would hold on any sequence of data.
The seminal paper on this topic was The Weighted Majority Algorithm by Littlestone and Warmuth. It was a big idea that has been used in many places, and is a core result in the theory of learning.
This idea, plus the Perceptron Algorithm from Lecture 1, spawned a number of other ideas in ML. In future lectures we'll be exploring Online Convex Optimization and related tools, that can be used to study lots of other ideas as well.
As you saw in your reading, we can imagine a sequential game as follows:
Assume there is a perfect expert, i.e. $x_i^t = y^t$ for all $t$. Without knowing $i$ in advance, how can you perform almost as well? Come up with an algorithm that makes the smallest number of mistakes possible. (A mistake occurs when $\hat y^t \ne y^t$). Note: the Algorithm should make significantly fewer than $N$ mistakes!
The halving algorithm:
Theorem: the halving algorithm will make fewer than $\log_2 N$ mistakes!
Assume you have $m$ teams playing a league. On each of a sequence of rounds, some pair of teams $(i,j)$ will play each other. There is some unknown permutation $\pi \in S_m$ which determines the outcome of each match, so $i$ beats $j$ only when $\pi(i) > \pi(j)$. You're a gambler, and you want to know how to bet on who wins. What is the best algorithm to use, and how few mistakes will it make before knowing the exact ranking $\pi$?
There are $m!$ permutations or possible rankings of the $m$ teams. Let each of these permutations $\pi$ be an expert. Since it is given that one of these permutations are correct, we can run the Halving algorithm on these $m!$ permutations.
By the Halving algorithm, this approach will make at most $\log m! = O(m \log m)$ mistakes.
You looked at the Weighted Majority Algorithm in the reading, which can be thought of as a generalization of the Halving Algorithm when there are no perfect experts. But let's consider a simple scenario first, and try to use the Halving Algorithm again.
Assume you have $N$ experts, and each will give you a prediction on each round, and there are exactly $N$ rounds. There isn't a perfect expert, but there is one expert that will make exactly one mistake.
Can you find an algorithm that uses the experts' advice to make predictions, and is guaranteed to make no more than $2 \log_2 N$ mistakes? (Hint: Use the Halving Alg!)
For each expert $e$, make alternate experts $e'_i$ for $i \in [N]$ such that the prediction of $e'_i$ is the same as $e$ at each round $t$, except when $t = i$.
Since we are guaranteed that one expert $e$ makes exacty one mistake, then we are also guaranteed that one of it's alternates is a perfect expert, since one of them must have corrected the one mistake expert $e$ made.
Since we have a perfect expert, we can run the Halving Algorithm on these $N^2$ experts, and the bound on the number of mistakes we will make is $\log N^2 = 2 \log N$.
Theorem: Let $M_T$ be num mistakes of WMA after $T$ rounds, and let $M_T(i)$ be the number of mistakes of expert $i$. Then for every expert $i$ we have $$ M_T \leq 2(1+\epsilon)M_T(i) + \frac{2 \log N}{\epsilon}$$
What is the best choice of $\epsilon$ here?
We want to pick an $\epsilon$ such that the bound is minimized.
Let $f(\epsilon) = 2(1 + \epsilon)M_T(i) + \frac{2 \log N}{\epsilon}$. To minimize $f(\epsilon)$, we take find $\epsilon$ such that $f'(\epsilon) = 0$.
$$f'(\epsilon) = 2M_T(i) - \frac{2 \log N}{\epsilon^2} = 0$$$$\frac{\log N}{\epsilon^2} = M_T(i)$$$$\epsilon = \sqrt{\frac{\log N}{M_T(i)}}$$This bound is further bounded by $M_T(i^*)$ where $i^*$ is the expert that makes the fewest mistakes. Note that this requires that we know the number of mistakes the best expert makes ahead of time.
Theorem: Let $M_T$ be num mistakes of RWM after $T$ rounds, and let $M_T(i)$ be the number of mistakes of expert $i$. Then for every expert $i$ we have $$ \mathbb{E}[M_T] \leq (1+\epsilon)M_T(i) + \frac{\log N}{\epsilon}$$
Corollary: If we tune correctly, we see that $\epsilon = \sqrt{\frac{\log N}{M_T(i)}}$ where $i$ is the index of any expert (which we can choose as the best). Then we have $$ \mathbb{E}[M_T] - M_T(i) \leq 2\sqrt{M_T(i) \log N}$$
Now assume we are in the gambling problem we had before, with $m$ teams, and on each round a pair of teams shows up to play a match. This time, however, there is some good permutation $\pi$, but it's not perfect! There will be at most $K$ matches whose outcome doesn't following the permutation $\pi$.
What's a good algorithm for the gambler? And how many mistakes will she make along the way?