Description: Introduction to automatic differentiation and the DualNumbers
package. Prepared for MIT course 15.S60 "Software Tools for Operations Research", January 2015.
Author: Miles Lubin
License:
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Consider a general constrained nonlinear optimization problem: $$ \begin{align} \min \quad&f(x)\\ \text{s.t.} \quad& g(x) = 0, \\ & h(x) \leq 0. \end{align} $$ where $f : \mathbb{R}^n \to \mathbb{R}, g : \mathbb{R}^n \to \mathbb{R}^r$, and $h: \mathbb{R}^n \to \mathbb{R}^s$.
When $f$ and $h$ are convex and $g$ is affine, we can hope for a globally optimal solution, otherwise typically we can only ask for a locally optimal solution.
What approaches can we use to solve this?
This is not meant to be an exhaustive list, see http://plato.asu.edu/sub/nlores.html#general and http://www.neos-guide.org/content/nonlinear-programming for more details.
Consider numbers of the form $x + y\epsilon$ with $x,y \in \mathbb{R}$. We define $\epsilon^2 = 0$, so $$ (x_1 + y_1\epsilon)(x_2+y_2\epsilon) = x_1x_2 + (x_1y_2 + x_2y_1)\epsilon. $$ These are called the dual numbers. Think of $\epsilon$ as an infinitesimal perturbation (you've probably seen hand-wavy algebra using $(dx)^2 = 0$ when computing integrals - this is the same idea).
If we are given an infinitely differentiable function in Taylor expanded form $$ f(x) = \sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!} (x-a)^k $$ it follows that $$ f(x+y\epsilon) = \sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!} (x-a+y\epsilon)^k = \sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!} (x-a)^k + y\epsilon\sum_{k=0}^{\infty} \frac{f^{(k)}(a)}{k!}\binom{k}{1} (x-a)^{k-1} = f(x) + yf'(x)\epsilon $$
Let's unpack what's going on here. We started with a function $f : \mathbb{R} \to \mathbb{R}$. Dual numbers are not real numbers, so it doesn't even make sense to ask for the value $f(x+y\epsilon)$ given $x+y\epsilon \in \mathbb{D}$ (the set of dual numbers). But anyway we plugged the dual number into the Taylor expansion, and by using the algebra rule $\epsilon^2 = 0$ we found that $f(x+y\epsilon)$ must be equal to $f(x) + yf'(x)\epsilon$ if we use the Taylor expansion as the definition of $f : \mathbb{D} \to \mathbb{R}$.
Alternatively, for any once differentiable function $f : \mathbb{R} \to \mathbb{R}$, we can define its extension to the dual numbers as $$ f(x+y\epsilon) = f(x) + yf'(x)\epsilon. $$ This is essentially equivalent to the previous definition.
Let's verify a very basic property, the chain rule, using this definition.
Suppose $h(x) = f(g(x))$. Then, $$ h(x+y\epsilon) = f(g(x+y\epsilon)) = f(g(x) + yg'(x)\epsilon) = f(g(x)) + yg'(x)f'(g(x))\epsilon = h(x) + yh'(x)\epsilon. $$
Maybe that's not too surprising, but it's actually a quite useful observation.
Dual numbers are implemented in the DualNumbers package in Julia.
using DualNumbers
You construct $x + y\epsilon$ with Dual(x,y)
. The real and epsilon components are accessed as real(d)
and epsilon(d)
:
d = Dual(2.0,1.0)
2.0 + 1.0ɛ
typeof(d)
DualNumbers.Dual{Float64}
realpart(d)
2.0
epsilon(d)
1.0
How is addition of dual numbers defined?
@which d+Dual(3.0,4.0)
Clicking on the link, we'll see:
+(z::Dual, w::Dual) = dual(real(z)+real(w), epsilon(z)+epsilon(w))
Multiplication?
Dual(2.0,2.0)*Dual(3.0,4.0)
6.0 + 14.0ɛ
@which Dual(2.0,2.0)*Dual(3.0,4.0)
The code is:
*(z::Dual, w::Dual) = dual(real(z)*real(w), epsilon(z)*real(w)+real(z)*epsilon(w))
Basic univariate functions?
log(Dual(2.0,1.0))
0.6931471805599453 + 0.5ɛ
1/2.0
0.5
How is this implemented?
@code_lowered log(Dual(2.0,1.0))
LambdaInfo template for log(z::DualNumbers.Dual) at /home/mlubin/.julia/v0.5/DualNumbers/src/dual.jl:273 :(begin nothing x = (DualNumbers.value)(z) # line 274: xp = (DualNumbers.epsilon)(z) # line 275: return (DualNumbers.Dual)((DualNumbers.log)(x),xp * (1 / x)) end)
Trig functions?
@code_lowered sin(Dual(2.0,1.0))
LambdaInfo template for sin(z::DualNumbers.Dual) at /home/mlubin/.julia/v0.5/DualNumbers/src/dual.jl:273 :(begin nothing x = (DualNumbers.value)(z) # line 274: xp = (DualNumbers.epsilon)(z) # line 275: return (DualNumbers.Dual)((DualNumbers.sin)(x),xp * (DualNumbers.cos)(x)) end)
We can define a function in Julia as:
f(x) = x^2 - log(x)
# Or equivalently
function f(x)
return x^2 - log(x)
end
WARNING: Method definition f(Any) in module Main at In[11]:1 overwritten at In[11]:4.
f (generic function with 1 method)
f(Dual(1/sqrt(2),1))
0.8465735902799727 - 2.220446049250313e-16ɛ
[Exercise]: Differentiate it!
- Evaluate $f$ at $1 + \epsilon$. What are $f(1)$ and $f'(1)$?
- Evaluate $f$ at $\frac{1}{\sqrt{2}} + \epsilon$. What are $f(\frac{1}{\sqrt{2}})$ and $f'(\frac{1}{\sqrt{2}})$?
- Define a new function
fprime
which returns the derivative off
by usingDualNumbers
.- Use the finite difference formula $$
f'(x) \approx \frac{f(x+h)-f(x)}{h}
$$
to evaluate $f'(\frac{1}{\sqrt{2}})$ approximately using a range of values of $h$. Visualize the approximation error using @manipulate
, plots, or both!
Recall Newton's iterative method for finding zeros: $$ x \leftarrow x - \frac{f(x)}{f'(x)} $$ until $f(x) \approx 0$.
Let's use this method to compute $\sqrt{x}$ by solving $f(z) = 0$ where $f(z) = z^2-x$. So $f'(z) = 2z$, and we can implement the algorithm as:
function squareroot(x)
z = x # Initial starting point
while abs(z*z - x) > 1e-13
z = z - (z*z-x)/(2z)
end
return z
end
squareroot (generic function with 1 method)
squareroot(100)
10.0
Can we differentiate this code? Yes!
d = squareroot(Dual(100.0,1.0))
10.0 + 0.049999999999999996ɛ
epsilon(d) # Computed derivative
0.049999999999999996
1/(2*sqrt(100)) # The exact derivative
0.05
abs(epsilon(d)-1/(2*sqrt(100)))
6.938893903907228e-18
Dual numbers can be used to compute the gradient of a function $f: \mathbb{R}^n \to \mathbb{R}$. This requires $n$ evaluations of $f$ with dual number input, essentially computing the partial derivative in each of the $n$ dimensions. We won't get into the details, but this procedure is implemented in Optim with the autodiff=true
keyword.
using Optim
rosenbrock(x) = (1.0 - x[1])^2 + 100.0 * (x[2] - x[1]^2)^2
optimize(rosenbrock, [0.0, 0.0], method = :l_bfgs, autodiff = true)
When $n$ is large, there's an alternative procedure called reverse-mode automatic differentiation which requires only $O(1)$ evaluations of $f$ to compute its gradient. This is the method used internally by JuMP (implemented in ReverseDiffSparse).
This was just an introduction to one technique from the area of automatic differentiation. For more references, see autodiff.org.