$\LaTeX \text{ commands here} \newcommand{\R}{\mathbb{R}} \newcommand{\im}{\text{im}\,} \newcommand{\norm}[1]{||#1||} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\span}{\mathrm{span}} \newcommand{\proj}{\mathrm{proj}} \newcommand{\OPT}{\mathrm{OPT}} \newcommand{\grad}{\nabla} \newcommand{\eps}{\varepsilon} $
Georgia Tech, CS 4540
Jake Abernethy
Quiz password: sgd
Tuesday, October 1, 2019
It became clear in the early 2000s that a lot of the tools that were being developed in sequential (online) learning problems were part of a generic framework. Researchers realized that you could generalize all of these ideas to a master setting. The basic setup is that you're given a convex and bounded decision set $K \subset \mathbb{R}^d$, and you will execute the following protocol.
Objective: Minimize the regret, given by $$ \text{Regret}_T := \sum_{t=1}^T f_t(x_t) - \min_{x \in K} \sum_{t=1}^T f_t(x) $$
You've now read about the OGD algorithm:
Theorem: One can show that, as long as the functions $f_t$ are $G$-lipschitz, and the set $K$ has diameter $D$, then $$\text{Regret}_T(\text{OGD}) \leq \frac 3 2 GD\sqrt{T}$$
Recall the Perceptron algorithm. We have a sample of data $\{(x_1, y_1), \ldots, (x_T, y_T)\}$ where $x_t \in \mathbb{R}^d$ and $y_t = \pm 1$. We assume there exists some $w^* \in \mathbb{R}^d$ such that $y_t(w^*\cdot x_t) > \gamma$.
Show that the Perceptron algorithm is a special case of Online Gradient Descent
You will notice that the OGD algorithm looks very similar to the standard GD algorithm for minimizing a fixed convex lipschitz function $\Phi(\cdot)$. But keep in mind the differences here: OGD uses a sequence of functions, whereas GD operates on only a single function.
How can we used the regret bound for OGD to get a convergence bound for (iterate-averaged) gradient descent? Hint: You can use $\Phi(\cdot)$ more than once! And you might find Jensen useful.
We can start with the regret bound for OGD: $$\sum_{t = 1}^T f_t(x_t) - min_{x \in K} \sum_{t = 1}^T f_t(x) \leq \frac{3}{2}GD\sqrt{T}$$
Our function $\phi(x_t)$ is convex, so we may plug it into this regret bound:
$$\sum_{t = 1}^T \phi(x_t) - min_{x \in K} \sum_{t = 1}^T \phi(x) \leq \frac{3}{2}GD\sqrt{T}$$Note that $\phi(x_t)$, as a function does not change with time, t. Since we are trying to find a bound for iterate-averaged GD, we should find the average of the function's output by dividing by the total time T.
$$ \frac{1}{T}\sum_{t = 1}^T \phi(x_t) - \frac{1}{T}min_{x \in K} \sum_{t = 1}^T \phi(x) \leq \frac{3GD}{2\sqrt{T}} $$We can use Jensen's inequality, "the average of a function of a value is greater than the function of the average":
$$\phi\left(\frac{1}{T}\sum_{t = 1}^T x_t\right) - \frac{1}{T}min_{x \in K} \sum_{t = 1}^T \phi(x) \leq \frac{3GD}{2\sqrt{T}} $$We also note that the second term, having the sum of T constant terms divided by T can be reduced to just the min function:
$$\phi\left(\frac{1}{T}\sum_{t = 1}^T x_t\right) - min_{x \in K} \phi(x) \leq \frac{3GD}{2\sqrt{T}} $$This final statement puts a bound on the difference between the function at the average value and the function at the minimizing value. The bound is $\frac{3GD}{2\sqrt{T}}$.
Technically speaking, a random variable $X : \Omega \to \R$ is just a map from some sample space $\Omega$ to the real numbers. We assume we have a probability measure $\Omega$.
$$\def\E{\mathbb{E}}$$When $\Omega$ is a finite or countably-infinite set, then we can assume our measure is just a probability $p(\omega)$ assigned to every $\omega \in \Omega$, where $\sum_{\omega \in \Omega} p_\omega = 1$.
When the sample space $\Omega$ is uncountably infinite, we usually need to go to full measure theory to talk about random variables in general. We won't do today, but consider a measure theory course!
Countinuous random variables are the most common in Machine Learning. The best way to think about random variables is through their probability density function and cumulative distribution function
In many fields, especially Machine Learning, we deal with functions that have randomized inputs. For example, assume we have some distribution $D$, we often consider functions of the form $$ F(\theta) := \E_{\xi \sim D} [f(\theta;\xi)]. $$ Here, $\theta$ are the parameters we want to optimize, and $\xi$ is some random input, such as a datapoint. But we'd like to optimize $\theta$ "on average", i.e. in expectation over a random sample of $\xi \sim D$.
We can optimize functions, defined through expectations above, using randomized gradients.
Theorem (Hazan book): As long as $f(\cdot;\xi)$ is convex, $\|\nabla f(\cdot; \cdot) \| \leq G$ and $\text{diam}(K) \leq D$, then we have $$\E[F(\bar \theta_T)] - \min_{\theta \in K} F(\theta) \leq \frac{3GD}{2\sqrt{T}}$$
Let $F(\theta) := \frac 1 n \sum_{i=1}^n f(\theta; \xi_i)$ be some objective function, defined as an average over samples $\xi_1, \ldots \xi_n$. Consider two algorithms for minimizing $F$: