Notebook

In [1]:

%matplotlib inline
import numpy as np;
import matplotlib
import matplotlib.pyplot as plt

np.random.seed(1337)

kwargs = {'linewidth' : 3.5}
font = {'weight' : 'normal', 'size'   : 24}
matplotlib.rc('font', **font)

def error_plot(ys, yscale='log'):
    plt.figure(figsize=(8, 8))
    plt.xlabel('Step')
    plt.ylabel('Error')
    plt.yscale(yscale)
    plt.plot(range(len(ys)), ys, **kwargs)

$\LaTeX \text{ commands here} \newcommand{\R}{\mathbb{R}} \newcommand{\im}{\text{im}\,} \newcommand{\norm}[1]{||#1||} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\span}{\mathrm{span}} \newcommand{\proj}{\mathrm{proj}} \newcommand{\OPT}{\mathrm{OPT}} \newcommand{\grad}{\nabla} \newcommand{\eps}{\varepsilon} $

Georgia Tech, CS 4540

L7: Gradient Descent Algorithm¶

Jake Abernethy, Benjamin Bray, Naveen Kodali

Quiz password: descent

Tuesday, September 10, 2019

No class on Thursday Sept 12!¶

Danger!¶

Warning: In this lecture, we will abuse notation and pretend that $\nabla f(x)$ is a column vector. Always remember that it's actually a row vector, and we're just pretending!!!

Gradient Descent¶

For $f : \R^d \rightarrow \R$ is differentiable, gradient descent performs the following iteration: follows the direction of steepest descent from a starting point $x_0 \in \R^d$:

Gradient Descent $$ \begin{align} x_{t+1} = x_t - \eta_t \grad f(x_t) \end{align} $$

Initial iterate $x_0 \in \R^d$
$-\grad f(x_t)$ is the direction of steepest descent from $x_t$
$\eta > 0$ is called the step size

Problem: Lipschitz Continuity¶

A function $f : \Omega \rightarrow \R$ is $L$-Lipschitz continuous on its domain $\Omega \subset \R^d$ provided that

$$ | f(x) - f(y) | \leq L \norm{x - y}_2 \quad \forall\, x,y \in \Omega $$

Part A: Give an example of a function that is not Lipschitz on $\R^d$ but is Lipschitz on a subset.

Part B: Prove that a Lipschitz continuous function is continuous, that is, for all $x_0 \in \R^d$ and $\eps > 0$ there exists $\delta > 0$ such that for all $x \in \R^d$, $$\norm{x-x_0} < \delta \implies |f(x)-f(x_0)| < \eps$$

Part C: (Save for Homework) Prove that if $f$ is differentiable, convex, and $L$-Lipschitz, then $\norm{\grad f(x)}_2 \leq L$.

Def: Smooth / Strongly Convex¶

Recall that a function is convex if it is lower bounded by its linear approximation at every point. But a function is strongly convex if it is also lower bounded by a quadratic approximation as well. More precisely, we say $f(x)$ is $\alpha$-strongly convex if the following holds for every $x, x_0 \in \text{dom}(f)$: $$ f(x) \geq f(x_0) + \nabla_{x_0} f \cdot (x - x_0) + \frac \alpha 2 \| x - x_0 \|^2 $$ Similarly, a convex function is called $\beta$-smooth if the inequality goes the other way! $$ f(x) \leq f(x_0) + \nabla_{x_0} f \cdot (x - x_0) + \frac \beta 2 \| x - x_0 \|^2 $$

Equivalently a twice-differentiable $f(x)$ is $\alpha$-strongly convex iff $\nabla^2_x f \succcurlyeq \alpha I$
and a twice-differentiable $f(x)$ is $\beta$-smooth iff $\nabla^2_x f \preccurlyeq \beta I$

Problem: Quadratic Forms¶

Let $A \in \R^{d \times d}$ be symmetric and positive-definite and define $f(x) = \frac{1}{2} x^T A x$.

Is $f$ smooth? If so, with what smoothness constant?
Is $f$ strongly convex? If so, with what constant?

Answer: Quadratic Forms¶

$f(x) = \frac{1}{2} x^T A x => \nabla^2 f(x)=A$

$f$ is $\lambda_{max}$-smooth. A function is $\beta$-smooth if $\nabla^2 f\preccurlyeq\beta I$, which means $A$'s eigenvalues are less than or equal to $\beta$. $\beta=\lambda_{max}$ satisfies this.
$f$ is $\lambda_{min}$-strongly convex. A function is $\alpha$-strongly convex if $\nabla^2 f\succcurlyeq\alpha I$, which means $A$'s eigenvalues are greater than or equal to $\alpha$. $\alpha=\lambda_{min}$ satisfies this.

Three key facts about gradient descent from reading¶

Let $x^*$ be the minimizer of $f$, and let $\|x_0 - x^*\| \leq R$.

If $f$ is convex and $L$-lipschitz, and we set $\eta = \frac{R}{L\sqrt t}$ then

$$f\left(\frac{1}{t}\sum_{i=0}^{t}x_i\right) - f(x^*) \leq \frac{R L}{\sqrt{t}}$$

If $f$ is convex, $\beta$-smooth, and we set $\eta = 1/\beta$, then

$$f\left(x_t \right) - f(x^*) \leq \frac{LR^2}{t}$$

If $f$ is $\beta$-smooth and $\mu$-strongly convex, and set set $\eta = 1/L$, then

$$\|x_t - x^*\| \leq R \left(1 - \frac{\mu}{\beta}\right)^t $$

Gradient Descent for Lipschitz Continuous Functions¶

Our goal for today is to prove the following:

Theorem: If $f : \R^d \rightarrow \R$ is convex, differentiable and L-Lipschitz on all of $\R^d$ and $||x_0 - x^*|| \leq R$, then there is a step size $\eta > 0$ such that the iterates of gradient descent satisfy $$ f\left(\frac{1}{t}\sum_{i=1}^{t}x_i\right) - f(x^*) \leq \frac{R L}{\sqrt{t}} $$

Step 1: Problem¶

Show that $$f(x_t) - f(x^*) \leq \frac{1}{2\eta} \left(||x_t - x^*||^2 - ||x_{t+1} - x^*||^2\right) + \frac{\eta L^2}{2}$$

Hint: First, show that $u^T v = \frac{1}{2} (||u||^2 + ||v||^2 - ||u - v||^2)$ for any $u, v \in \R^d$

Answer¶

First, notice that $f(x_t) - f(x^*) \leq \nabla f(x_t)\cdot(x_t - x^*)$ since $f$ is convex.

Next, notice that \begin{eqnarray*} \|x_{t+1} − x^*\|^2 & = & \| x_t − \eta \nabla f(x_t) − x^∗\|^2\\ & = & \|x_t − x^∗\|^2 − 2 \eta \nabla f(x_t) \cdot (x_t − x^∗) + \eta^2 \|\nabla f(x_t)\|^2 \\ & \leq & \|x_t − x^∗\|^2 − 2\eta \nabla f(x_t) \cdot (x_t − x^∗) + \eta^2 L^2 \end{eqnarray*}

Rearranging, dividing by $2\eta$, and combining with the previous statement, we have $$ f(x_t) - f(x^*) \leq \nabla f(x_t)\cdot(x_t - x^*) \leq \frac{1}{2\eta}(\|x_{t} − x^*\|^2 - \|x_{t+1} − x^∗\|^2) + \eta L^2/2 $$

Step 2: Problem¶

Using the previous step, prove for any $\eta > 0$ that

$$f\left(\frac{1}{t}\sum_{i=0}^{t}x_i\right) - f(x^*) \leq \frac{R^2}{2\eta t} + \frac{\eta L^2 t}{2}$$

Hint: Sum over iterates and identify a telescoping sum. Then, use Jensen's inequality (i.e. the definition of convexity).

Answer¶

\begin{eqnarray*} f\left(\frac{1}{t}\sum_{i=0}^{t}x_i\right) - f(x^*) & \leq & \frac{1}{t}\sum_{i=0}^{t} f\left(x_i\right) - f(x^*) \\ & = &\frac{1}{t}\sum_{i=0}^{t} (f\left(x_i\right) - f(x^*)) \\ & \leq & \frac{1}{t}\sum_{i=0}^{t} \left(\frac{1}{2\eta}(\|x_{t} − x^*\|^2 - \|x_{t+1} − x^∗\|^2) + \eta L^2/2 \right) \\ & \leq & \frac{\|x_{t} − x^*\|^2}{2t\eta} + \eta L^2 / 2 \end{eqnarray*}

Step 3: Problem¶

Now we have

$$f\left(\frac{1}{t}\sum_{i=0}^{t}x_i\right) - f(x^*) \leq \frac{R^2}{2\eta t} + \frac{\eta L^2 t}{2}$$

Show Since this holds for all $\eta > 0$, we should pick the step size that gives the best bound. Which $\eta$ should we choose? What bound does it give?

(Solution: $\eta = \frac{R}{L \sqrt{t}}$)

Conclusion: Lipschitz Case¶

Using $\eta = \frac{R}{L \sqrt{t}}$, we obtain the following bound:

$$ f\left(\frac{1}{t}\sum_{i=0}^{t}x_i\right) - f(x^*) \leq \frac{R L}{\sqrt{t}} $$

Lipschitz continuity is a fairly weak assumption
By imposing stronger conditions, we can get better convergence rates

Remark: Projected Gradient Descent¶

Suppose we want to minimize $f : \Omega \rightarrow \R$ within some convex set $\Omega \subset \R^d$ instead of the entire space $\R^d$. We can still apply gradient descent as long as we project back onto the convex domain $\Omega$ after each iteration. Let $\Pi_\Omega(x) = \min_{z \in \Omega} \norm{z-x}_2^2$ be the projection of $x$ onto the convex set $\Omega$.

Projected Gradient Descent $$ \begin{align} y_{t+1} &= x_t - \eta_t \grad f(x_t) \\ x_{t+1} &= \Pi_\Omega(y_{t+1}) \end{align} $$

Problem¶

Show that Projected Gradient Descent achieves the same convergence rate by making a minor adjusment to the proof. (Hint: you may want to use a fact from your homework!)

Implementation¶

Adapted from Moritz Hardt's lecture notebook

Projected Gradient Descent¶

We start with a basic implementation of projected gradient descent. Note that this implementation keeps around all points computed along the way. This is clearly not what you would do on large instances. We do this for illustrative purposes to be able to easily inspect the computed sequence of points.

In [2]:

def gradient_descent(init, steps, grad, proj=lambda x: x):
    """Projected gradient descent.
    
    Inputs:
        initial: starting point
        steps: list of scalar step sizes
        grad: function mapping points to gradients
        proj (optional): function mapping points to points
        
    Returns:
        List of all points computed by projected gradient descent.
    """
    xs = [init]
    for step in steps:
        x_step = xs[-1]
        # Fill this in:
        x_update = None
        xs.append(x_update)
    return xs

Warm-Up: Optimizing a Quadratic¶

As a toy example, let's optimize $f(x) = \frac{1}{2} \norm{x}^2$, which has gradient $\nabla f(x) = x$.

In [5]:

def quadratic(x):
    return 0.5*x.dot(x)


# What is the gradient of this function?
def quadratic_gradient(x):
    # FILL THIS IN
    return None

Note the function is 1-smooth and 1-strongly convex. Our theorems would then suggest that we use a constant step size of 1. If you think about it, for this step size the algorithm will actually find the optimal solution in just one step.

In [7]:

x0 = np.random.normal(0, 1, (1000))
_, x1 = gradient_descent(x0, [1.0], quadratic_gradient)

Indeed, it does:

In [8]:

x1.all() == 0

Out[8]:

True

Let's see what happens if we don't have the right learning rate.

In [9]:

xs = gradient_descent(x0, [0.1]*50, quadratic_gradient)
error_plot([quadratic(x) for x in xs])

Constrained Optimization¶

Let's say we want to optimize the function inside some affine subspace. Recall that affine subspaces are convex sets. Below we pick a random low dimensional affine subspace b+U and define the corresponding linear projection operator.

In [ ]:

# U is an orthonormal basis of a random 100-dimensional subspace.
U = np.linalg.qr(np.random.normal(0, 1, (1000, 100)))[0]
b = np.random.normal(0, 1, 1000)

def proj(x):
    """Projection of x onto an affine subspace"""
    return None
# What is this???

In [11]:

x0 = np.random.normal(0, 1, (1000))
xs = gradient_descent(x0, [0.1]*50, quadratic_gradient, proj)
# the optimal solution is the projection of the origin
x_opt = proj(0)
error_plot([quadratic(x) for x in xs])
plt.plot(range(len(xs)), [quadratic(x_opt)]*len(xs),
        label='$\\frac{1}{2}|\!|x_{\mathrm{opt}}|\!|^2$')
plt.legend()

Out[11]:

<matplotlib.legend.Legend at 0x7f6b05b94198>

The orangle line shows the optimal error, which the algorithm reaches quickly. The iterates also converge to the optimal solution in domain as the following plot shows.

In [12]:

error_plot([np.linalg.norm(x_opt-x)**2 for x in xs])