%matplotlib inline
import numpy as np;
import matplotlib
import matplotlib.pyplot as plt
np.random.seed(1337)
kwargs = {'linewidth' : 3.5}
font = {'weight' : 'normal', 'size' : 24}
matplotlib.rc('font', **font)
def error_plot(ys, yscale='log'):
plt.figure(figsize=(8, 8))
plt.xlabel('Step')
plt.ylabel('Error')
plt.yscale(yscale)
plt.plot(range(len(ys)), ys, **kwargs)
$\LaTeX \text{ commands here} \newcommand{\R}{\mathbb{R}} \newcommand{\im}{\text{im}\,} \newcommand{\norm}[1]{||#1||} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\span}{\mathrm{span}} \newcommand{\proj}{\mathrm{proj}} \newcommand{\OPT}{\mathrm{OPT}} \newcommand{\grad}{\nabla} \newcommand{\eps}{\varepsilon} $
Georgia Tech, CS 4540
Jake Abernethy, Benjamin Bray, Naveen Kodali
Quiz password: descent
Tuesday, September 10, 2019
Warning: In this lecture, we will abuse notation and pretend that $\nabla f(x)$ is a column vector. Always remember that it's actually a row vector, and we're just pretending!!!
For $f : \R^d \rightarrow \R$ is differentiable, gradient descent performs the following iteration: follows the direction of steepest descent from a starting point $x_0 \in \R^d$:
A function $f : \Omega \rightarrow \R$ is $L$-Lipschitz continuous on its domain $\Omega \subset \R^d$ provided that
$$ | f(x) - f(y) | \leq L \norm{x - y}_2 \quad \forall\, x,y \in \Omega $$Part B: Prove that a Lipschitz continuous function is continuous, that is, for all $x_0 \in \R^d$ and $\eps > 0$ there exists $\delta > 0$ such that for all $x \in \R^d$, $$\norm{x-x_0} < \delta \implies |f(x)-f(x_0)| < \eps$$
Part C: (Save for Homework) Prove that if $f$ is differentiable, convex, and $L$-Lipschitz, then $\norm{\grad f(x)}_2 \leq L$.
Recall that a function is convex if it is lower bounded by its linear approximation at every point. But a function is strongly convex if it is also lower bounded by a quadratic approximation as well. More precisely, we say $f(x)$ is $\alpha$-strongly convex if the following holds for every $x, x_0 \in \text{dom}(f)$: $$ f(x) \geq f(x_0) + \nabla_{x_0} f \cdot (x - x_0) + \frac \alpha 2 \| x - x_0 \|^2 $$ Similarly, a convex function is called $\beta$-smooth if the inequality goes the other way! $$ f(x) \leq f(x_0) + \nabla_{x_0} f \cdot (x - x_0) + \frac \beta 2 \| x - x_0 \|^2 $$
$f(x) = \frac{1}{2} x^T A x => \nabla^2 f(x)=A$
Let $x^*$ be the minimizer of $f$, and let $\|x_0 - x^*\| \leq R$.
Our goal for today is to prove the following:
Show that $$f(x_t) - f(x^*) \leq \frac{1}{2\eta} \left(||x_t - x^*||^2 - ||x_{t+1} - x^*||^2\right) + \frac{\eta L^2}{2}$$
Hint: First, show that $u^T v = \frac{1}{2} (||u||^2 + ||v||^2 - ||u - v||^2)$ for any $u, v \in \R^d$
First, notice that $f(x_t) - f(x^*) \leq \nabla f(x_t)\cdot(x_t - x^*)$ since $f$ is convex.
Next, notice that \begin{eqnarray*} \|x_{t+1} − x^*\|^2 & = & \| x_t − \eta \nabla f(x_t) − x^∗\|^2\\ & = & \|x_t − x^∗\|^2 − 2 \eta \nabla f(x_t) \cdot (x_t − x^∗) + \eta^2 \|\nabla f(x_t)\|^2 \\ & \leq & \|x_t − x^∗\|^2 − 2\eta \nabla f(x_t) \cdot (x_t − x^∗) + \eta^2 L^2 \end{eqnarray*}
Rearranging, dividing by $2\eta$, and combining with the previous statement, we have $$ f(x_t) - f(x^*) \leq \nabla f(x_t)\cdot(x_t - x^*) \leq \frac{1}{2\eta}(\|x_{t} − x^*\|^2 - \|x_{t+1} − x^∗\|^2) + \eta L^2/2 $$
Using the previous step, prove for any $\eta > 0$ that
$$f\left(\frac{1}{t}\sum_{i=0}^{t}x_i\right) - f(x^*) \leq \frac{R^2}{2\eta t} + \frac{\eta L^2 t}{2}$$Hint: Sum over iterates and identify a telescoping sum. Then, use Jensen's inequality (i.e. the definition of convexity).
Now we have
$$f\left(\frac{1}{t}\sum_{i=0}^{t}x_i\right) - f(x^*) \leq \frac{R^2}{2\eta t} + \frac{\eta L^2 t}{2}$$Show Since this holds for all $\eta > 0$, we should pick the step size that gives the best bound. Which $\eta$ should we choose? What bound does it give?
(Solution: $\eta = \frac{R}{L \sqrt{t}}$)
Using $\eta = \frac{R}{L \sqrt{t}}$, we obtain the following bound:
$$ f\left(\frac{1}{t}\sum_{i=0}^{t}x_i\right) - f(x^*) \leq \frac{R L}{\sqrt{t}} $$Suppose we want to minimize $f : \Omega \rightarrow \R$ within some convex set $\Omega \subset \R^d$ instead of the entire space $\R^d$. We can still apply gradient descent as long as we project back onto the convex domain $\Omega$ after each iteration. Let $\Pi_\Omega(x) = \min_{z \in \Omega} \norm{z-x}_2^2$ be the projection of $x$ onto the convex set $\Omega$.
Show that Projected Gradient Descent achieves the same convergence rate by making a minor adjusment to the proof. (Hint: you may want to use a fact from your homework!)
Adapted from Moritz Hardt's lecture notebook
We start with a basic implementation of projected gradient descent. Note that this implementation keeps around all points computed along the way. This is clearly not what you would do on large instances. We do this for illustrative purposes to be able to easily inspect the computed sequence of points.
def gradient_descent(init, steps, grad, proj=lambda x: x):
"""Projected gradient descent.
Inputs:
initial: starting point
steps: list of scalar step sizes
grad: function mapping points to gradients
proj (optional): function mapping points to points
Returns:
List of all points computed by projected gradient descent.
"""
xs = [init]
for step in steps:
x_step = xs[-1]
# Fill this in:
x_update = None
xs.append(x_update)
return xs
As a toy example, let's optimize $f(x) = \frac{1}{2} \norm{x}^2$, which has gradient $\nabla f(x) = x$.
def quadratic(x):
return 0.5*x.dot(x)
# What is the gradient of this function?
def quadratic_gradient(x):
# FILL THIS IN
return None
Note the function is 1-smooth and 1-strongly convex. Our theorems would then suggest that we use a constant step size of 1. If you think about it, for this step size the algorithm will actually find the optimal solution in just one step.
x0 = np.random.normal(0, 1, (1000))
_, x1 = gradient_descent(x0, [1.0], quadratic_gradient)
Indeed, it does:
x1.all() == 0
True
Let's see what happens if we don't have the right learning rate.
xs = gradient_descent(x0, [0.1]*50, quadratic_gradient)
error_plot([quadratic(x) for x in xs])
Let's say we want to optimize the function inside some affine subspace. Recall that affine subspaces are convex sets. Below we pick a random low dimensional affine subspace b+U and define the corresponding linear projection operator.
# U is an orthonormal basis of a random 100-dimensional subspace.
U = np.linalg.qr(np.random.normal(0, 1, (1000, 100)))[0]
b = np.random.normal(0, 1, 1000)
def proj(x):
"""Projection of x onto an affine subspace"""
return None
# What is this???
x0 = np.random.normal(0, 1, (1000))
xs = gradient_descent(x0, [0.1]*50, quadratic_gradient, proj)
# the optimal solution is the projection of the origin
x_opt = proj(0)
error_plot([quadratic(x) for x in xs])
plt.plot(range(len(xs)), [quadratic(x_opt)]*len(xs),
label='$\\frac{1}{2}|\!|x_{\mathrm{opt}}|\!|^2$')
plt.legend()
<matplotlib.legend.Legend at 0x7f6b05b94198>
The orangle line shows the optimal error, which the algorithm reaches quickly. The iterates also converge to the optimal solution in domain as the following plot shows.
error_plot([np.linalg.norm(x_opt-x)**2 for x in xs])