%matplotlib inline
import numpy as np;
import matplotlib
import matplotlib.pyplot as plt
np.random.seed(1337)
kwargs = {'linewidth' : 3.5}
font = {'weight' : 'normal', 'size' : 24}
matplotlib.rc('font', **font)
def error_plot(ys, yscale='log'):
plt.figure(figsize=(8, 8))
plt.xlabel('Step')
plt.ylabel('Error')
plt.yscale(yscale)
plt.plot(range(len(ys)), ys, **kwargs)
$\LaTeX \text{ commands here} \newcommand{\R}{\mathbb{R}} \newcommand{\im}{\text{im}\,} \newcommand{\norm}[1]{||#1||} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\span}{\mathrm{span}} \newcommand{\proj}{\mathrm{proj}} \newcommand{\OPT}{\mathrm{OPT}} \newcommand{\grad}{\nabla} \newcommand{\eps}{\varepsilon} $
Georgia Tech, CS 4540
Jake Abernethy, Benjamin Bray, Naveen Kodali
Thursday, September 20, 2018
Minimize an objective $f : \R^d \rightarrow \R$ by following the negative gradient direction:
$$ x_{t+1} = x_t - \eta_t \nabla f(x_t) $$Last time, we began proving the following convergence rate:
Last time, we were able to show that for any step size $\eta > 0$,
$$f\left(\frac{1}{t}\sum_{i=0}^{t}x_i\right) - f(x^*) \leq \frac{R^2}{2\eta t} + \frac{\eta L^2}{2}$$Define $g(\eta) = \frac{R^2}{2\eta t} + \frac{\eta L^2}{2}$ and differentiate:
$$ \frac{\partial g}{\partial \eta} = \frac{-R^2}{2\eta^2 t} + \frac{L^2}{2} $$Setting the derivative to zero and solving for $\eta$ gives $\eta = \frac{R}{L\sqrt{t}}$. Plugging this into $g$ proves the theorem!
In general, there are several strategies for choosing the step size:
Suppose we want to minimize $f : \Omega \rightarrow \R$ within some convex set $\Omega \subset \R^d$ instead of the entire space $\R^d$. We can still apply gradient descent as long as we project back onto the convex domain $\Omega$ after each iteration:
When minimizing an objective $f : \Omega \rightarrow \R$ using gradient descent, we can obtain stronger convergence guarantees by making assumptions about how "nice" of a function $f$ is:
Remember the notation $C \succcurlyeq D$ means $C - D$ is positive-semidefinite. Below, let $A \in \R^{n \times n}$ be symmetric.
Part B: If $A$ has eigenpairs $A v_k = \lambda_k v_k$ for $1 \leq k \leq n$, what are the eigenvalues and eigenvectors of $A - \alpha I$ when $\alpha \in \R$?
Part B: Let $A \in \R^{n \times n}$ be symmetric. For fixed $\alpha > 0$, which conditions guarantee that $A \succcurlyeq \alpha I$? What about $A \preccurlyeq \alpha I$?
Part A: The eigenvalues of $\alpha I$ are all equal to $\alpha$, so this matrix is psd when $\alpha \geq 0$ and nsd when $\alpha \leq 0$.
Part B: The matrix $A - \alpha I$ has eigenvectors $v_k$ and eigenvalues $\mu_k = \lambda_k - \alpha$, since $$ (A - \alpha I) v_k = A v_k - \alpha v_k = (\lambda_k - \alpha) v_k $$
Part C: By definition, $A \succcurlyeq \alpha I$ iff $A - \alpha I$ is positive semidefinite. By the previous part, this requires $\lambda_k \geq \alpha$ for all $k$. Similarly, $A \preccurlyeq \alpha I$ whenever $\lambda_k \leq \alpha$ for all $k$.
Remember that a differentiable function $f : \Omega \rightarrow \R$ is convex iff it lies above its linear approximation at every point, that is, $$ f(y) \geq f(x) + \inner{ \nabla f(x), y-x } \quad \forall\, x,y \in \Omega $$
A function is $\alpha$-strongly convex if it also lies above a quadratic approximation at every point, where $\alpha$ controls the steepness: $$ f(y) \geq f(x) + \inner{ \nabla f(x), y-x } + \tfrac{\alpha}{2} \norm{y - x}^2 $$
Hint: First, show that $f$ is $\alpha$-strongly convex if and only if the function $g(x) = f(x) - \frac{\alpha}{2}\norm{x}_2^2$ is convex. Then, use the second-order condition for convexity on $g$.
Reminder: A function is $\alpha$-strongly convex if it also lies above a quadratic approximation at every point, where $\alpha$ controls the steepness: $$ f(y) \geq f(x) + \inner{ \nabla f(x), y-x } + \tfrac{\alpha}{2} \norm{y - x}^2 $$
A function $f : \Omega \rightarrow \R$ is $\beta$-smooth if it lies below a quadratic at each point:
$$ f(y) \leq f(x) + \inner{ \nabla f(x), y-x } + \tfrac{\beta}{2} \norm{y-x}_2^2 $$Similarly to strong-convexity, a twice-differentiable function is $\beta$-smooth if and only if $\nabla^2 f(x) \preccurlyeq \beta I$ for all $x \in \Omega$.
Let $A \in \R^{d \times d}$ be symmetric and positive-definite and define $f(x) = \frac{1}{2} x^T A x$.
Let $f(x)$ be a $\beta$-smooth and $\lambda$-strongly convex function with minimizer $x^*$. Assume our initial point is $x_1$, and repeatedly perform gradient descent: $x_{t+1} = x_t - \eta \nabla f(x_t)$. Then it can be shown that $$ f(x_T) - f(x^*) \leq \frac{\beta}{2} \exp(-4T/(\kappa + 1)) \|x_1 - x^*\|^2. $$ where $\kappa = \frac{\beta}{\lambda}$ is the condition number, and the learning rate needs to be set as $\eta = \frac{2}{\lambda + \beta}$.
Let $f(x)$ be a $\beta$-smooth function, and assume we want to solve $\min_{x \in \Omega} f(x)$ for some constrained set $\Omega$ with diameter $D$. The Frank-Wolfe algorithm does the following:
Theorem: after $T$ rounds of Frank-Wolfe, we have that $$ f(x_t) - f(x^*) \leq \frac{2 \beta D^2}{t + 2}$$ for step size $\eta_t = 2/(t+2)$.
Adapted from Moritz Hardt's lecture notebook
We start with a basic implementation of projected gradient descent. Note that this implementation keeps around all points computed along the way. This is clearly not what you would do on large instances. We do this for illustrative purposes to be able to easily inspect the computed sequence of points.
def gradient_descent(init, steps, grad, proj=lambda x: x):
"""Projected gradient descent.
Inputs:
initial: starting point
steps: list of scalar step sizes
grad: function mapping points to gradients
proj (optional): function mapping points to points
Returns:
List of all points computed by projected gradient descent.
"""
xs = [init]
for step in steps:
x_step = xs[-1]
x_update = None
xs.append(x_update)
return xs
As a toy example, let's optimize $f(x) = \frac{1}{2} \norm{x}^2$, which has gradient $\nabla f(x) = x$.
def quadratic(x):
return 0.5*x.dot(x)
# What is the gradient of this function?
def quadratic_gradient(x):
return None
Note the function is 1-smooth and 1-strongly convex. Our theorems would then suggest that we use a constant step size of 1. If you think about it, for this step size the algorithm will actually find the optimal solution in just one step.
x0 = np.random.normal(0, 1, (1000))
_, x1 = gradient_descent(x0, [1.0], quadratic_gradient)
Indeed, it does:
x1.all() == 0
True
Let's see what happens if we don't have the right learning rate.
xs = gradient_descent(x0, [0.1]*50, quadratic_gradient)
error_plot([quadratic(x) for x in xs])
Let's say we want to optimize the function inside some affine subspace. Recall that affine subspaces are convex sets. Below we pick a random low dimensional affine subspace b+U and define the corresponding linear projection operator.
# U is an orthonormal basis of a random 100-dimensional subspace.
U = np.linalg.qr(np.random.normal(0, 1, (1000, 100)))[0]
b = np.random.normal(0, 1, 1000)
def proj(x):
"""Projection of x onto an affine subspace"""
return None
# What is this???
x0 = np.random.normal(0, 1, (1000))
xs = gradient_descent(x0, [0.1]*50, quadratic_gradient, proj)
# the optimal solution is the projection of the origin
x_opt = proj(0)
error_plot([quadratic(x) for x in xs])
plt.plot(range(len(xs)), [quadratic(x_opt)]*len(xs),
label='$\\frac{1}{2}|\!|x_{\mathrm{opt}}|\!|^2$')
plt.legend()
The orangle line shows the optimal error, which the algorithm reaches quickly. The iterates also converge to the optimal solution in domain as the following plot shows.
error_plot([np.linalg.norm(x_opt-x)**2 for x in xs])