Notebook

$\LaTeX \text{ commands here} \newcommand{\R}{\mathbb{R}} \newcommand{\im}{\text{im}\,} \newcommand{\norm}[1]{||#1||} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\span}{\mathrm{span}} \newcommand{\proj}{\mathrm{proj}} \newcommand{\OPT}{\mathrm{OPT}} \newcommand{\grad}{\nabla} \newcommand{\eps}{\varepsilon} $

Georgia Tech, CS 4540

L10: Online Convex Optimization¶

Jake Abernethy

Quiz password: online

Tuesday, September 24, 2019

Introduction to Online Convex Optimization¶

It became clear in the early 2000s that a lot of the tools that were being developed in sequential (online) learning problems were part of a generic framework. Researchers realized that you could generalize all of these ideas to a master setting. The basic setup is that you're given a convex and bounded decision set $K \subset \mathbb{R}^d$, and you will execute the following protocol.

For $t=1, \ldots, T$
- Alg selects $x_t \in K$
- Alg observes convex loss function $f_t : K \to \mathbb{R}$

Objective: Minimize the regret, given by $$ \text{Regret}_T := \sum_{t=1}^T f_t(x_t) - \min_{x \in K} \sum_{t=1}^T f_t(x) $$

Recall the Hedge Setting¶

In the "online learning with experts" setting, we imagined a setting where, on each round, an algorithm picks a distribution $p_t$ over the $N$ experts, and then observes a loss $\ell_t(i)$ for each expert $i$. The goal is to minimize the long-term expected loss versus the loss of the best expert: $$\sum_{t=1}^T \left(\sum_i p_t(i) \ell_t(i)\right) - \min_i \sum_{t=1}^T \ell_t(i)$$

Problem¶

Show that the Hedge setting is a special case of OCO

Investing your money¶

We like to invest our money, but we don't always know the best way to invest. Let's say the stock market (say, the S&P500) fluctuates by $r_t \in (0,\infty)$, where $r_t = 1$ means the market doesn't change, and $r_t = 1.01$ means the market went up by 1%.

You are also welcome to invest some of your own cash in the stock market, and you start out with $C$ dollars. If you decided to invest a $p$ fraction of your money in the S&P500 every day, and keep $1-p$ as cash on hand, then you'd earn $C\prod_{t=1}^T ((1-p) + pr_t)$

Of course, you can change your investment every day, putting a $p_t$ fraction of your money in the market on day $t$. Then you'd earn $C\prod_{t=1}^T ((1-p_t) + p_tr_t)$ over $T$ rounds.

Problem¶

Let's say your goal is to maximize your overall wealth earnings relative to the best fixed portfolio in hindsight. That is, you want to maximize $$\frac{C\prod_{t=1}^T ((1-p_t) + p_tr_t)}{\max_{p\in [0,1]}C\prod_{t=1}^T ((1-p) + pr_t)}$$ Can you turn this into an OCO problem?

Reduction to the linear setting¶

Here's a trick that gets used a lot: instead of using your original sequence of functions $f_1, \ldots, f_T$ let's use an alternative sequence of functions, that will depend on your choice of $x_t$'s along the way.

Let $h_t(x) := f_t(x_t) + \langle \nabla f_t(x_t), x - x_t \rangle$. (Note that I can only define this sequence once I have committed to my choice of $x_t$'s!)

Problem¶

Can you relate $\text{Regret}_T(h_1, \ldots, h_T)$ and $\text{Regret}_T(f_1, \ldots, f_T)$?

Online Gradient Descent (OGD)¶

You've now read about the OGD algorithm:

Initialize $x_1 \in K$ to be any point
For $t=1, \ldots, T$:
- Observe $f_t(\cdot)$
- Update $x_{t+1} = x_t - \eta_t \nabla f_t(x_t)$

Theorem: One can show that, as long as the functions $f_t$ are $G$-lipschitz, and the set $K$ has diameter $D$, then $$\text{Regret}_T(\text{OGD}) \leq \frac 3 2 GD\sqrt{T}$$

Problem: OGD and Experts¶

Question: Can we apply OGD to the Hedge setting? If so, why might not this be the best choice of algorithm?

Convergence bound for Gradient Descent¶

You will notice that the OGD algorithm looks very similar to the standard GD algorithm for minimizing a fixed convex lipschitz function $\Phi(\cdot)$. But keep in mind the differences here: OGD uses a sequence of functions, whereas GD operates on only a single function.

Problem¶

How can we used the regret bound for OGD to get a convergence bound for (iterate-averaged) gradient descent? Hint: You can use $\Phi(\cdot)$ more than once! And you might find Jensen useful.

Alternative: Follow the Leader¶

Another algorithm that sometimes works well is known as Follow the Leader (FTL). It's defined as follows: $$x_t := \arg\min_{x \in K} \sum_{s=1}^{t-1} f_s(x)$$

But this algorithm doesn't always work well. It can achieve linear regret. Can you construct an example where this occurs?

Answer¶

Suppose we have $2$ experts. On the first round neither of them has ever made mistakes so the decision maker picks one of them at random. At that point the adversary selects the other expert's advice to be the correct one and thus the chosen expert has made $1$ mistake and the decision maker has made $1$ mistake.

On the second round the decision maker will pick the advice of the expert that hasn't made a mistake. Then the adversary selects the the other expert's advice to be correct and now both experts have $1$ mistake each while the decision maker has made $2$ mistakes.

This sequence of choices repeats for as long as the game lasts. Suppose this game continues for $T$ rounds. Then the decisions maker will have made $T$ mistakes while the best expert will have made $\lceil T/2 \rceil$ mistakes.

Let regret be defined as $$ Regret_T = M_T - \min_i M_T(i) $$ where $M_T$ is the number of mistakes made by the decision maker, $M_T(i)$ is the number of mistakes made by expert $i$. Thus regret is the number of mistakes made by the decision maker minus the number of mistakes made by the best expert in hindsight.

Therefore, in the Follow The Leader setting described above the regret of the algorithm will be $$ Regret_T = M_T - \min_i M_T(i) = T - \lceil T/2 \rceil = \lfloor T/2 \rfloor $$ which is $\Theta(T)$ – linear in $T$.