Description: Introduction to automatic differentiation and the DualNumbers package. Prepared for MIT course 15.S60 "Software Tools for Operations Research", January 2015.

Author: Miles Lubin

License:
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Computing derivatives for nonlinear optimization: Forward mode automatic differentiation¶

Consider a general constrained nonlinear optimization problem: $$ \begin{align} \min \quad&f(x)\\ \text{s.t.} \quad& g(x) = 0, \\ & h(x) \leq 0. \end{align} $$ where $f : \mathbb{R}^n \to \mathbb{R}, g : \mathbb{R}^n \to \mathbb{R}^r$, and $h: \mathbb{R}^n \to \mathbb{R}^s$.

When $f$ and $h$ are convex and $g$ is affine, we can hope for a globally optimal solution, otherwise typically we can only ask for a locally optimal solution.

What approaches can we use to solve this?

When $r=0$ and $s = 0$ (unconstrained), and $f$ differentiable, most classical approach is gradient descent, also fancier methods like Newton's method and quasi-newton methods like BFGS.
When $f$ differentiable and $g$ and $h$ linear, gradient projection
When $f$, $g$, and $h$ twice differentiable, interior-point methods, sequential quadratic programming
When derivatives "not available", derivative-free optimization

This is not meant to be an exhaustive list, see http://plato.asu.edu/sub/nlores.html#general and http://www.neos-guide.org/content/nonlinear-programming for more details.

How are derivatives computed?¶

Hand-written by applying chain rule
Finite difference approximation $\frac{\partial f}{\partial x_i} = \lim_{h\to 0} \frac{f(x+h e_i)-f(x)}{h}$
Automatic differentiation
- Idea: Automatically transform code to compute a function into code to compute its derivatives

Dual Numbers¶

Consider numbers of the form $x + y\epsilon$ with $x,y \in \mathbb{R}$. We define $\epsilon^2 = 0$, so $$ (x_1 + y_1\epsilon)(x_2+y_2\epsilon) = x_1x_2 + (x_1y_2 + x_2y_1)\epsilon. $$ These are called the dual numbers. Think of $\epsilon$ as an infinitesimal perturbation (you've probably seen hand-wavy algebra using $(dx)^2 = 0$ when computing integrals - this is the same idea).

If we are given an infinitely differentiable function in Taylor expanded form $$ f(x) = \sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!} (x-a)^k $$ it follows that $$ f(x+y\epsilon) = \sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!} (x-a+y\epsilon)^k = \sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!} (x-a)^k + y\epsilon\sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!}\binom{k}{1} (x-a)^{k-1} = f(x) + yf'(x)\epsilon $$

Let's unpack what's going on here. We started with a function $f : \mathbb{R} \to \mathbb{R}$. Dual numbers are not real numbers, so it doesn't even make sense to ask for the value $f(x+y\epsilon)$ given $x+y\epsilon \in \mathbb{D}$ (the set of dual numbers). But anyway we plugged the dual number into the Taylor expansion, and by using the algebra rule $\epsilon^2 = 0$ we found that $f(x+y\epsilon)$ must be equal to $f(x) + yf'(x)\epsilon$ if we use the Taylor expansion as the definition of $f : \mathbb{D} \to \mathbb{R}$.

Alternatively, for any once differentiable function $f : \mathbb{R} \to \mathbb{R}$, we can define its extension to the dual numbers as $$ f(x+y\epsilon) = f(x) + yf'(x)\epsilon. $$ This is essentially equivalent to the previous definition.

Let's verify a very basic property, the chain rule, using this definition.

Suppose $h(x) = f(g(x))$. Then, $$ h(x+y\epsilon) = f(g(x+y\epsilon)) = f(g(x) + yg'(x)\epsilon) = f(g(x)) + yg'(x)f'(g(x))\epsilon = h(x) + yh'(x)\epsilon. $$

Maybe that's not too surprising, but it's actually a quite useful observation.

Implementation¶

Dual numbers are implemented in the DualNumbers package in Julia.

In [1]:

using DualNumbers

You construct $x + y\epsilon$ with Dual(x,y). The real and epsilon components are accessed as real(d) and epsilon(d):

In [2]:

d = Dual(2.0,1.0)

Out[2]:

2.0 + 1.0ɛ

In [3]:

typeof(d)

Out[3]:

DualNumbers.Dual{Float64}

In [4]:

realpart(d)

Out[4]:

2.0

In [5]:

epsilon(d)

Out[5]:

1.0

How is addition of dual numbers defined?

In [ ]:

@which d+Dual(3.0,4.0)

Clicking on the link, we'll see:

+(z::Dual, w::Dual) = dual(real(z)+real(w), epsilon(z)+epsilon(w))

Multiplication?

In [6]:

Dual(2.0,2.0)*Dual(3.0,4.0)

Out[6]:

6.0 + 14.0ɛ

In [ ]:

@which Dual(2.0,2.0)*Dual(3.0,4.0)

The code is:

*(z::Dual, w::Dual) = dual(real(z)*real(w), epsilon(z)*real(w)+real(z)*epsilon(w))

Basic univariate functions?

In [7]:

log(Dual(2.0,1.0))

Out[7]:

0.6931471805599453 + 0.5ɛ

In [8]:

1/2.0

Out[8]:

0.5

How is this implemented?

In [9]:

@code_lowered log(Dual(2.0,1.0))

Out[9]:

LambdaInfo template for log(z::DualNumbers.Dual) at /home/mlubin/.julia/v0.5/DualNumbers/src/dual.jl:273
:(begin 
        nothing
        x = (DualNumbers.value)(z) # line 274:
        xp = (DualNumbers.epsilon)(z) # line 275:
        return (DualNumbers.Dual)((DualNumbers.log)(x),xp * (1 / x))
    end)

Trig functions?

In [10]:

@code_lowered sin(Dual(2.0,1.0))

Out[10]:

LambdaInfo template for sin(z::DualNumbers.Dual) at /home/mlubin/.julia/v0.5/DualNumbers/src/dual.jl:273
:(begin 
        nothing
        x = (DualNumbers.value)(z) # line 274:
        xp = (DualNumbers.epsilon)(z) # line 275:
        return (DualNumbers.Dual)((DualNumbers.sin)(x),xp * (DualNumbers.cos)(x))
    end)

Computing derivatives of functions¶

We can define a function in Julia as:

In [11]:

f(x) = x^2 - log(x)
# Or equivalently
function f(x)
    return x^2 - log(x)
end

WARNING: Method definition f(Any) in module Main at In[11]:1 overwritten at In[11]:4.

Out[11]:

f (generic function with 1 method)

In [12]:

f(Dual(1/sqrt(2),1))

Out[12]:

0.8465735902799727 - 2.220446049250313e-16ɛ

[Exercise]: Differentiate it!

Evaluate $f$ at $1 + \epsilon$. What are $f(1)$ and $f'(1)$?

Evaluate $f$ at $\frac{1}{\sqrt{2}} + \epsilon$. What are $f(\frac{1}{\sqrt{2}})$ and $f'(\frac{1}{\sqrt{2}})$?

Define a new function fprime which returns the derivative of f by using DualNumbers.

Use the finite difference formula $$

f'(x) \approx \frac{f(x+h)-f(x)}{h} $$ to evaluate $f'(\frac{1}{\sqrt{2}})$ approximately using a range of values of $h$. Visualize the approximation error using @manipulate, plots, or both!

How general is it?¶

Recall Newton's iterative method for finding zeros: $$ x \leftarrow x - \frac{f(x)}{f'(x)} $$ until $f(x) \approx 0$.

Let's use this method to compute $\sqrt{x}$ by solving $f(z) = 0$ where $f(z) = z^2-x$. So $f'(z) = 2z$, and we can implement the algorithm as:

In [13]:

function squareroot(x)
    z = x # Initial starting point
    while abs(z*z - x) > 1e-13
        z = z - (z*z-x)/(2z)
    end
    return z
end

Out[13]:

squareroot (generic function with 1 method)

In [14]:

squareroot(100)

Out[14]:

10.0

Can we differentiate this code? Yes!

In [15]:

d = squareroot(Dual(100.0,1.0))

Out[15]:

10.0 + 0.049999999999999996ɛ

In [16]:

epsilon(d) # Computed derivative

Out[16]:

0.049999999999999996

In [17]:

1/(2*sqrt(100)) # The exact derivative

Out[17]:

0.05

In [18]:

abs(epsilon(d)-1/(2*sqrt(100)))

Out[18]:

6.938893903907228e-18

Multivariate functions?¶

Dual numbers can be used to compute the gradient of a function $f: \mathbb{R}^n \to \mathbb{R}$. This requires $n$ evaluations of $f$ with dual number input, essentially computing the partial derivative in each of the $n$ dimensions. We won't get into the details, but this procedure is implemented in Optim with the autodiff=true keyword.

In [ ]:

using Optim
rosenbrock(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2
optimize(rosenbrock, [0.0, 0.0], method = :l_bfgs, autodiff = true)

When $n$ is large, there's an alternative procedure called reverse-mode automatic differentiation which requires only $O(1)$ evaluations of $f$ to compute its gradient. This is the method used internally by JuMP (implemented in ReverseDiffSparse).

Conclusions¶

We can compute numerically exact derivatives of any differentiable function which is implemented by using a sequence of basic operations.
In Julia it's very easy to use dual numbers for this!
Reconsider when derivatives are "not available."

This was just an introduction to one technique from the area of automatic differentiation. For more references, see autodiff.org.