$\LaTeX \text{ commands here} \newcommand{\R}{\mathbb{R}} \newcommand{\im}{\text{im}\,} \newcommand{\norm}[1]{||#1||} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\span}{\mathrm{span}} \newcommand{\proj}{\mathrm{proj}} \newcommand{\OPT}{\mathrm{OPT}} \newcommand{\grad}{\nabla} \newcommand{\eps}{\varepsilon} $
Georgia Tech, CS 4540
Jake Abernethy
Quiz password: online
Tuesday, September 24, 2019
It became clear in the early 2000s that a lot of the tools that were being developed in sequential (online) learning problems were part of a generic framework. Researchers realized that you could generalize all of these ideas to a master setting. The basic setup is that you're given a convex and bounded decision set $K \subset \mathbb{R}^d$, and you will execute the following protocol.
Objective: Minimize the regret, given by $$ \text{Regret}_T := \sum_{t=1}^T f_t(x_t) - \min_{x \in K} \sum_{t=1}^T f_t(x) $$
In the "online learning with experts" setting, we imagined a setting where, on each round, an algorithm picks a distribution $p_t$ over the $N$ experts, and then observes a loss $\ell_t(i)$ for each expert $i$. The goal is to minimize the long-term expected loss versus the loss of the best expert: $$\sum_{t=1}^T \left(\sum_i p_t(i) \ell_t(i)\right) - \min_i \sum_{t=1}^T \ell_t(i)$$
Show that the Hedge setting is a special case of OCO
We like to invest our money, but we don't always know the best way to invest. Let's say the stock market (say, the S&P500) fluctuates by $r_t \in (0,\infty)$, where $r_t = 1$ means the market doesn't change, and $r_t = 1.01$ means the market went up by 1%.
You are also welcome to invest some of your own cash in the stock market, and you start out with $C$ dollars. If you decided to invest a $p$ fraction of your money in the S&P500 every day, and keep $1-p$ as cash on hand, then you'd earn $C\prod_{t=1}^T ((1-p) + pr_t)$
Of course, you can change your investment every day, putting a $p_t$ fraction of your money in the market on day $t$. Then you'd earn $C\prod_{t=1}^T ((1-p_t) + p_tr_t)$ over $T$ rounds.
Let's say your goal is to maximize your overall wealth earnings relative to the best fixed portfolio in hindsight. That is, you want to maximize $$\frac{C\prod_{t=1}^T ((1-p_t) + p_tr_t)}{\max_{p\in [0,1]}C\prod_{t=1}^T ((1-p) + pr_t)}$$ Can you turn this into an OCO problem?
Here's a trick that gets used a lot: instead of using your original sequence of functions $f_1, \ldots, f_T$ let's use an alternative sequence of functions, that will depend on your choice of $x_t$'s along the way.
Let $h_t(x) := f_t(x_t) + \langle \nabla f_t(x_t), x - x_t \rangle$. (Note that I can only define this sequence once I have committed to my choice of $x_t$'s!)
Can you relate $\text{Regret}_T(h_1, \ldots, h_T)$ and $\text{Regret}_T(f_1, \ldots, f_T)$?
You've now read about the OGD algorithm:
Theorem: One can show that, as long as the functions $f_t$ are $G$-lipschitz, and the set $K$ has diameter $D$, then $$\text{Regret}_T(\text{OGD}) \leq \frac 3 2 GD\sqrt{T}$$
Question: Can we apply OGD to the Hedge setting? If so, why might not this be the best choice of algorithm?
You will notice that the OGD algorithm looks very similar to the standard GD algorithm for minimizing a fixed convex lipschitz function $\Phi(\cdot)$. But keep in mind the differences here: OGD uses a sequence of functions, whereas GD operates on only a single function.
How can we used the regret bound for OGD to get a convergence bound for (iterate-averaged) gradient descent? Hint: You can use $\Phi(\cdot)$ more than once! And you might find Jensen useful.
Another algorithm that sometimes works well is known as Follow the Leader (FTL). It's defined as follows: $$x_t := \arg\min_{x \in K} \sum_{s=1}^{t-1} f_s(x)$$
But this algorithm doesn't always work well. It can achieve linear regret. Can you construct an example where this occurs?
Suppose we have $2$ experts. On the first round neither of them has ever made mistakes so the decision maker picks one of them at random. At that point the adversary selects the other expert's advice to be the correct one and thus the chosen expert has made $1$ mistake and the decision maker has made $1$ mistake.
On the second round the decision maker will pick the advice of the expert that hasn't made a mistake. Then the adversary selects the the other expert's advice to be correct and now both experts have $1$ mistake each while the decision maker has made $2$ mistakes.
This sequence of choices repeats for as long as the game lasts. Suppose this game continues for $T$ rounds. Then the decisions maker will have made $T$ mistakes while the best expert will have made $\lceil T/2 \rceil$ mistakes.
Let regret be defined as $$ Regret_T = M_T - \min_i M_T(i) $$ where $M_T$ is the number of mistakes made by the decision maker, $M_T(i)$ is the number of mistakes made by expert $i$. Thus regret is the number of mistakes made by the decision maker minus the number of mistakes made by the best expert in hindsight.
Therefore, in the Follow The Leader setting described above the regret of the algorithm will be $$ Regret_T = M_T - \min_i M_T(i) = T - \lceil T/2 \rceil = \lfloor T/2 \rfloor $$ which is $\Theta(T)$ – linear in $T$.