Gradient Descent

In [1]:
using PyPlot
using NtToolBox
WARNING: Method definition macroexpand(Module, Any) in module Compat at C:\Users\Ayman\.julia\v0.5\Compat\src\Compat.jl:1491 overwritten in module MacroTools at C:\Users\Ayman\.julia\v0.5\MacroTools\src\utils.jl:64.

$\newcommand{\dotp}[2]{\langle #1, #2 \rangle}$ $\newcommand{\umin}[1]{\underset{#1}{\min}\;}$ $\newcommand{\RR}{\mathbb{R}}$ $\newcommand{\pd}[2]{ \frac{ \partial #1}{\partial #2} }$ $\newcommand{\norm}[1]{\|#1\|}$ $\newcommand{\pa}[1]{\left(#1\right)}$

We consider the problem of finding a minimizer of a convex smooth function $\;f : \RR^d \rightarrow \RR$; that is, we want to solve $$\umin{x \in \RR^d} f(x)$$

Note that the minimizer is not necessarily unique.

The simplest method is gradient descent, which iteratively computes $$ x^{(k+1)} = x^{(k)} - \tau \nabla f(x^{(k)}), $$ where $\nabla f(x) \in \RR^d$ is the gradient of $f$ at $x$, $x^{(0)} \in \RR^d$ is an arbitrary initial point, and the stepsize $\tau$ must satisfy $$0<\tau<2/\beta,$$ to have convergence, where $\beta$ is a Lipschitz constant of $\nabla f$; that is, $$ \forall (x,x') \in \RR^N \times \RR^N, \quad \norm{\nabla f(x) - \nabla f(x')} \leq \beta \norm{x-x'}. $$

For instance, if $f$ is of class $C^2$, $$\beta= \sup_x \norm{Hf(x)},$$ where $Hf(x) \in \RR^{d \times d}$ is the Hessian of $f$ at $x$ and $\norm{\cdot}$ is the spectral operator norm (largest eigenvalue).

Gradient descent in 2-D

We consider a simple problem, corresponding to the minimization of a 2-D ($d=2$) quadratic form $$ f(x) = \frac{1}{2} \pa{ x_1^2 + \eta x_2^2 } ,$$ where $\eta>0$ controls the anisotropy, and hence the difficulty, of the problem.

We can note that minimizing a strongly convex quadratic function is equivalent to solving a positive definite linear system. Gradient descent is then equivalent to the Richardson iteration applied to this linear system.

We define the anisotropy parameter $\eta$:

In [2]:
eta = 8;

We define the function $f$:

In [3]:
f = x -> (x[1]^2 + eta*x[2]^2) / 2;

Note: a 'lambda' function is a one-line function definition; in Matlab, this would be [email protected](x)(x(1)^2+eta*x(2)^2)/2;

An example of function evaluation:

In [4]:
f([2,3])
Out[4]:
38.0

We display the function using a contourplot:

In [5]:
include("NtToolBox/src/ndgrid.jl")
t = linspace(-.7,.7,101)
(u, v) = meshgrid(t,t)
F = (u.^2 + eta.*v.^2) ./ 2
contourf(t, t, F, 35);

We define the gradient of $f$:

In [6]:
function Grad_f(x)
    return[x[1]; eta*x[2]]
end;

An example of evaluation:

In [7]:
Grad_f([1,2]);

Since $f$ is quadratic, its Hessian is the constant matrix $\left(\begin{array}{cc}1&0\\0&\eta\end{array}\right)$. Its spectral norm, which is the Lipschitz constant of Grad_f, is $\beta=\max(1,\eta)$.

The stepsize $\tau$ must satisfy $0< \tau < 2/\beta$.

In [8]:
tau = 1.8/max(eta,1); tau
Out[8]:
0.225

Now we implement the gradient descent method: given the initial estimate $x^{(0)}$ of the solution, the stepsize $\tau$ and the number $k$ of iterations, we compute $x^{(k)}$:

In [9]:
nbiter = 10
x = [1, 1]  # initial estimate of the solution
for iter in 1:nbiter  # iter goes from 1 to nbiter
    x = x - tau.*Grad_f(x)
end
x   # to display x, like in Matlab. Use print(x) if this is not the last command of the cell, else nothing is displayed
Out[9]:
2-element Array{Float64,1}:
 0.0781658
 0.107374 
In [10]:
f(x)
Out[10]:
0.049171809817584886

We encapsulate the code in a function called GradientDescent.

In [11]:
function GradDescent(Grad_f, x0, nbiter, tau)
    x = x0
    for iter in 1:nbiter  # iter goes from 1 to nbiter
        x = x - tau.*Grad_f(x)   # x has type 'array'. Like in C, one can also write x-=tau*Grad_f(x)
    end
    return x
end;
In [12]:
GradDescent(Grad_f, [1, 1], 10, tau)
Out[12]:
2-element Array{Float64,1}:
 0.0781658
 0.107374 

We define a function called GradDescentArray which puts the iterates in a 'matrix' (a 2-D array in fact); they are first put in a list and the list is converted to an array at the end (the + operation on lists concatenates them, whereas on arrays this is the classical elementwise addition).

In [13]:
function GradDescentArray(Grad_f, x0, nbiter, tau)
    x = x0
    xlist = [x0[1] x0[2]]  
    for iter in 1:nbiter
        x = x - tau*Grad_f(x)     
        xlist = [xlist; [x[1] x[2]]]
    end
    return xlist
end;
In [14]:
xarray = GradDescentArray(Grad_f,[0.6,0.6],10,0.225)
Out[14]:
11×2 Array{Float64,2}:
 0.6         0.6      
 0.465      -0.48     
 0.360375    0.384    
 0.279291   -0.3072   
 0.21645     0.24576  
 0.167749   -0.196608 
 0.130005    0.157286 
 0.100754   -0.125829 
 0.0780845   0.100663 
 0.0605155  -0.0805306
 0.0468995   0.0644245

We can access the elements of an array like in Matlab. Be careful: like in C, the first element has index 0! Some examples:

In [15]:
xarray[2, 1]
Out[15]:
0.46499999999999997
In [16]:
xarray[2, :]
Out[16]:
2-element Array{Float64,1}:
  0.465
 -0.48 
In [17]:
xarray[:, 1]
Out[17]:
11-element Array{Float64,1}:
 0.6      
 0.465    
 0.360375 
 0.279291 
 0.21645  
 0.167749 
 0.130005 
 0.100754 
 0.0780845
 0.0605155
 0.0468995

We can apply the function $f$ to every iterate $x^{(k)}$ of the list `xarray`, using the command `map`:

In [18]:
mapslices(f, xarray, 2)
Out[18]:
11×1 Array{Float64,2}:
 1.62     
 1.02971  
 0.654759 
 0.416489 
 0.265017 
 0.168689 
 0.107407 
 0.0684076
 0.043581 
 0.0277718
 0.0177019

We plot the cost function $f(x^{(k)})$ as a function of $k$, in log-scale:

In [19]:
plot(0:size(xarray)[1] - 1, log10(mapslices(f, xarray, 2)), "o-");

We remark that because the function $f$ is strongly convex, the convergence is linear. Also, we can note that gradient descent is monotonic: $f(x^{(k)})$ is decreased at every iteration.

We plot the iterates above the contourplot of $f$:

In [20]:
contourf(t, t, F, 35)
xarray = GradDescentArray(Grad_f, [0.6, 0.6], 30, 0.225)
plot(xarray[:, 1], xarray[:, 2], "w.-")
Out[20]:
1-element Array{Any,1}:
 PyObject <matplotlib.lines.Line2D object at 0x000000000A23D828>

__Exercise: re-run the previous cells with different values of tau (try 0.25,0.225,0.2,0.125,0.05).__

__Also try different initial estimates.__

__Also try different values of the anisotropy eta__.

Gradient descent in large dimension: image restoration

Local differential operators like the discrete gradient are fundamental for variational image processing.

We load the image Lena in the array $x^\sharp$ (we choose the symbol # - 'sharp', because the image is clean, not degraded).

In [21]:
name = "NtToolBox/src/data/lena.png"
xsharp = load_image(name)*256
println(size(xsharp))
println("The size of the image is $(size(xsharp,1)) x $(size(xsharp,2)).")
println("The range of the pixel values is [$(minimum(xsharp)), $(maximum(xsharp))].") # the range of the pixel values is [0.0, 1.0] because load_image scales the image.
(512,512)
The size of the image is 512 x 512.
The range of the pixel values is [0.0, 256.0].
In [31]:
xsharp = readdlm("xsharp");
SystemError: opening file xsharp: No such file or directory

 in #systemerror#51 at .\error.jl:34 [inlined]
 in systemerror(::String, ::Bool) at .\error.jl:34
 in open(::String, ::Bool, ::Bool, ::Bool, ::Bool, ::Bool) at .\iostream.jl:89
 in open(::Base.#readstring, ::String) at .\iostream.jl:111
 in #readdlm_auto#11(::Array{Any,1}, ::Function, ::String, ::Char, ::Type{T}, ::Char, ::Bool) at .\datafmt.jl:124
 in #readdlm#7(::Array{Any,1}, ::Function, ::String, ::Char, ::Char) at .\datafmt.jl:73
 in #readdlm#5(::Array{Any,1}, ::Function, ::String) at .\datafmt.jl:56
 in readdlm(::String) at .\datafmt.jl:56

We display the image.

In [24]:
imageplot(xsharp)
set_cmap("gray")
colorbar()       # displays the color bar close to the image
subplots_adjust(top = 0.75)
title("This is Lena");

We define the 'discrete gradient' $D$, which is the linear operator which maps an image to an image whose values are pairs of vertical and horizontal finite differences:

$$(D x)_{n_1,n_2} = \big((D x)_{n_1,n_2,v},(D x)_{n_1,n_2,h}\big)=( x_{n_1+1,n_2}-x_{n_1,n_2}, x_{n_1,n_2+1}-x_{n_1,n_2} ) \in \RR^2,$$ where $n_1$ and $n_2$ are the row and column numbers, respectively (the origin $n_1=n_2=0$ corresponds to the pixel at the top left corner) and where Neumann boundary conditions are assumed: a difference accross the boundary is zero.

Let us define the operator $D$ as a function:

In [25]:
function D(x)
    vdiff = [diff(x,1) ; zeros(1,size(x,2))] # concatenates along the rows
    hdiff = [diff(x,2) zeros(size(x,1),1)]   # concatenates along the columns
    return cat(3, vdiff, hdiff) # combination along a third dimension
end;
In [26]:
v = D(xsharp);

We display the two components as two grayscale images.

In [27]:
fig = figure(figsize=(16,7))
suptitle("The two images of vertical and horizontal finite differences")
subplot(1,2,1)
imageplot(v[:, :, 1])
title("Image of vertical finite differences")
set_cmap("gray")
colorbar()

subplot(1,2,2)
imageplot(v[:, :, 2])
colorbar()
title("Image of horizontal finite differences")
set_cmap("gray")

Let us display the image of the magnitudes $\norm{(D x)_{n_1,n_2}}$, which are large near edges.

In [28]:
imshow(sqrt(sum(v.^2, 3)[:, :]))
colorbar()      
subplots_adjust(top = 0.75);