Noise tolerance of accelerated gradient descent

Here we compare the noise tolerance of basic gradient descent with that of accelerated gradient descent for smooth convex optimization. We will see that errors in the gradient accumulate for accelerated gradient descent, while they do not accumulate for basic gradient descent. This illustrates that for noisy convex optimization basic gradient descent is preferable to accelerated gradient descent despite the slower convergence rate in the noise free setting.

In [1]:
import matplotlib.pyplot as plt
import numpy as np

def gd(x0,gradient,smoothness=1,n_iterations=100):
    """ Gradient descent for smooth convex function """
    x = x0
    xs = [x0]
    for t in range(0,n_iterations):
        x = x - (1.0/smoothness)*gradient(x)
        xs.append(x)
    return xs

def accgd(x0,gradient,smoothness=1,n_iterations=100):
    """ Accelerated gradient descent for smooth convex function 
        See: http://blogs.princeton.edu/imabandit/2013/04/01/acceleratedgradientdescent/
    """
    x = x0
    y = x0
    xs = [x0]
    a = 0
    for t in range(0,n_iterations):
        a = 0.5 * (1.0 + np.sqrt(1+4.0*a**2))
        a2 = 0.5 * (1.0 + np.sqrt(1+4.0*a**2))
        gamma = (1.0-a)/a2
        y2 = x - (1.0/smoothness) * gradient(x)
        x = (1.0-gamma)*y2 + gamma*y
        y = y2
        xs.append(x)
    return xs

Laplacian of a cycle

Below the matrix P defines the Laplacian of a cycle graph and we solve a linear system defined by the Laplacian. This example is typically shown as a tight example for accelerated gradient descent.

In [2]:
n = 100
A = np.zeros((n,n))
for i in range(1,n-1):
    A[i,i+1] = 1
    A[i,i-1] = 1
A[0,1] = 1
A[n-1,n-2] = 1
P = 2.0*np.eye(n) - A
b = np.zeros(n)
b[0] = 1
opt = np.dot(np.linalg.pinv(P),b)

def path(x):
    return 0.5*np.dot(x,np.dot(P,x)) - np.dot(x,b)
                                
def pathgrad(x):
    return np.dot(P,x) - b

its = 5000

xs3 = gd(np.zeros(n),pathgrad,4,its)
ys3 = [ abs(path(xs3[i])-path(opt)) for i in range(0,its) ]
xs3acc = accgd(np.zeros(n),pathgrad,4,its)
ys3acc = [ abs(path(xs3acc[i])-path(opt)) for i in range(0,its) ]

plt.yscale('log')
plt.ylim(0,0.1)
plt.plot(range(0,its),ys3,range(0,its),ys3acc)
Out[2]:
[<matplotlib.lines.Line2D at 0x1ba16950>,
 <matplotlib.lines.Line2D at 0x1ad2f450>]

Noisy cycle

We repeat the same example, but now we had gaussian noise to the gradient. The plot illustrates that accelerated gradient descent diverges, while basic gradient descent converges essentially to the best possible answer given the noise in the gradient.

In [3]:
def noisygrad(x):
    return np.dot(P,x) - b + np.random.normal(0,0.1,(n))

its = 500

xs3 = gd(np.zeros(n),noisygrad,4,its)
ys3 = [ abs(path(xs3[i])-path(opt)) for i in range(0,its) ]
xs3acc = accgd(np.zeros(n),noisygrad,4,its)
ys3acc = [ abs(path(xs3acc[i])-path(opt)) for i in range(0,its) ]

plt.yscale('linear')
plt.plot(range(0,its),ys3,range(0,its),ys3acc)
Out[3]:
[<matplotlib.lines.Line2D at 0x1accd550>,
 <matplotlib.lines.Line2D at 0x1a38cb50>]