#!/usr/bin/env python
# coding: utf-8

# # Non Linear Least Squares #
# This notebook is the theory behind the non-linear least squares. Where it all comes from and what is the logic behind the different ways to solve such problems. We start by talking about least squares optimization problems, steepest gradient descent and move on to non-linear case with Gauss-Newton method to solve it.

# $\newcommand{\x}{\boldsymbol{\mathrm{x}}}$
# $\newcommand{\h}{\boldsymbol{\mathrm{h}}}$
# $\newcommand{\g}{\boldsymbol{\mathrm{g}}}$
# $\newcommand{\J}{\boldsymbol{\mathrm{J}}}$
# $\newcommand{\H}{\boldsymbol{\mathrm{H}}}$
# $\newcommand{\f}{\boldsymbol{\mathrm{f}}}$
# $\newcommand{\F}{\boldsymbol{\mathrm{F}}}$
# $\newcommand{\l}{\boldsymbol{l}}$
# $\newcommand{\L}{\boldsymbol{\mathrm{L}}}$
# $\newcommand{\norm}[1]{\left\lVert#1\right\rVert}$
# ## Least Squares Problem
# We want to find a **local minimizer** $\x^*$ for
# \begin{equation}
# F(\x) = \frac{1}{2}\sum_{i=1}^{m}(f_i(\x))^2,
# \end{equation}
# where $f_i: {\rm I\!R}^n \rightarrow {\rm I\!R}, i = 1, \dots, m$ are given functions, and $m \geq n$.
# 
# We assume that the cost function $F$ is differentiable and so smooth that the following Taylor expansion holds:
# \begin{equation}
# F(\x + \h) = F(\x) + \h^T\g + \frac{1}{2}\h^T\H\h + O(\norm{\h}^3),
# \end{equation}
# where $\g$ is *gradient*: $
# \g \equiv \F^\prime(\x) = \begin{bmatrix}
#     \frac{\partial{F}}{\partial x_1}(\x) \\
#     \dots \\
#     \frac{\partial{F}}{\partial x_n}(\x)
#   \end{bmatrix}$, and $\H$ is *Hessian*: $\H \equiv \F^{\prime \prime}(\x) = \begin{bmatrix} 
#       \frac{\partial^2 F}{\partial x_i \partial x_j}(\x)
#   \end{bmatrix}$.
#   
# ### Necessary condition for a local minimizer
# If $\x^*$ is a local minimizer, then $\x^*$ is a stationary point, i.e., $\g^* \equiv \F^\prime(\x^*) = 0$.
# 
# However, it can also be a *saddle point*, i.e. a point which is a local minimum in one direction and a local maximum in another direction. To determine if a given stationary point $\x_s$ is a local minimizer we need to include the second order term in the Taylor series:
# \begin{equation}
# F(\x_s + \h) = F(\x_s) + \frac{1}{2}\h^T \H_s \h + O(\norm(\h)^3),
# \end{equation}
# where $\H_s = F^{\prime \prime}(\x_s)$. From the fact that $F(\x_s + \h)$ must be bigger than $F(\x_s)$ for any $\h$ we can formulate:
# 
# ### Sufficient condition for a local minimizer
# If $\x_s$ is a stationary point and $F^{\prime \prime}(\x_s)$ is positive definite then $\x_s$ is a local minimizer.

# $\newcommand\myeq{\stackrel{\tiny{\mbox{dot product}}}{=}}$
# 
# ## Descent methods
# Now that we know *what* we are searching for we need to find *how* to search for a solution. We do it by (iteratively) moving in the correct direction to a local minimizer. Roughly we describe what we need to do as follows:
# 
# - find the direction of descent $\h$.
# - make a step of some size $\alpha$ in that direction.
# - repeat
# 
# We then consider the variation of the function $F$ with the step $\h$:
# \begin{eqnarray}
#     F(\x + \alpha \h) 
#         &=& F(\x) + \alpha \h^T \F^\prime(\x) + O(\alpha^2) \\
#         &\simeq& F(\x) + \alpha \h^T \F^\prime(\x)
# \end{eqnarray}
# 
# ### Steepest descent
# How big should be the step? This is an open question. We can say that we perform a step $\alpha \h$, where $\alpha$ is positive. Then the relative gain in function value is:
# 
# \begin{eqnarray}
#     \lim_{\alpha \rightarrow 0}\frac{F(\x) - F(\x + \alpha \h)}{\alpha \norm{\h}} 
#         = \lim_{\alpha \rightarrow 0}\frac{F(\x) - (F(\x) + \alpha \h^T \F^\prime(\x))}{\alpha \norm{\h}} 
#         = \lim_{\alpha \rightarrow 0}\frac{- \alpha \h^T \F^\prime(\x)}{\alpha \norm{\h}} 
#         = -\frac{1}{\norm{\h}} \h^T \F^\prime(\x)
#         \myeq -\norm{\F^\prime(\x)} \cos{\theta}
# \end{eqnarray}
# 
# Here, $\theta$ is an angle between $\h$ and $\F^\prime(\x)$. Therefore, we see that the biggest change in function is reached if $\theta = \pi$, i.e. $\h = -\F^\prime(\x)$.
# 
# This method has one downside - as the magnitude of the step depends on the magnitude of the derivative, the final convergence (close to a stationary point, where derivative is zero) is slow, although it has good initial performance in the initial stage of the iterative process.
# 
# ### Newton's method
# To have good convergence in the final stage we can look at the Newton's method. We derive it from the fact that $\x^*$ is a stationary point. So $\F^\prime(\x^*) = \F^\prime(\x + \h) = 0$. We can furthermore consider its Taylor expansion:
# 
# \begin{eqnarray}
#     \F^\prime(\x + \h) &=& \F^\prime(\x) + \F^{\prime \prime}(\x) \h + O(\norm{\h}^2) \\
#     &\simeq& \F^\prime(\x) + \F^{\prime \prime}(\x) \h
# \end{eqnarray}
# 
# And when we set the left part of this equation to $0$ we get an equation, solutions to which give us $\h$: $\H \h = -\F^{\prime}(\x)$.
# 
# Note, that Newton's method does not really care if the direction of optimization is towards a minimum, maximum or a saddle point. It just guides you towards the "closest" saddle point. So it works best when you know you are already in the vicinity of a global optimum.
# 

# ## Non-linear Least Squares
# These methods are more efficient than the general optimization methods for least squares problems. Also, they don't need to compute the second derivatives. More formally, we want to find $\x^* = \mathrm{argmin}_\x\{F(\x)\}$, where 
# \begin{equation}
#     F(\x) = \frac{1}{2}\sum_{i = 1}^{m}(f_i(\x))^2 = \frac{1}{2} \f(\x)^T \f(\x)
# \end{equation}
# 
# Provided, that $\f$ has continuous second partial derivatives, we can write its Taylor expansion:
# 
# \begin{equation}
#     \f(\x + \h) = \f(\x) + \J(\x)\h + O(\norm{\h}^2),
# \end{equation}
# 
# where $\J$ is the *Jacobian* and $\h$ is some step. 
# 
# The partial derivatives of $F$ are trivially achieved from the equation in the beginning of this cell:
# 
# \begin{eqnarray}
#     \frac{\partial F}{\partial x_j}(\x) 
#     &=& \frac{\partial (\frac{1}{2}\sum_{i = 1}^{m}(f_i(\x))^2) }{\partial x_j} \\
#     &=& \sum_{i = 1}^{m} {f_i(\x) \frac{\partial f_i}{\partial x_j}(\x) }.
# \end{eqnarray}
# 
# Thus the gradient is $\F^\prime(\x) = \J(\x)^T \f(\x)$.
# 
# Following the same logic, we can also estimate the hessian: $\F^{\prime \prime}(\x) = \J(\x)^T \J(\x) + \sum_{i=1}^{m}{f_i(x) \f^{\prime \prime}(\x)}$.
# 
# Depending on the underlying functions, such a Hessian can be too expensive to compute. To avoid direct Hessian computation we can use:
# 
# ### The Gauss-Newton Method
# This method us just a non-linear least squares optimization method, that **linearizes** $\f$ in the neighborhood of $\x$, i.e. for small $\norm{\h}$, the Taylor expansion of $\f$ is:
# 
# \begin{equation}
#     \f(\x + \h) \simeq \l(\h) \equiv \f(\x) + \J(\x)\h
# \end{equation}
# 
# And we can do the same for function $F$:
# 
# \begin{eqnarray}
#     F(\x + \h) \simeq L(\h) &\equiv& \frac{1}{2}\l(\h)^T \l(\h) \\
#     &=& \frac{1}{2} \f^T \f + \h^T \J^T \f + \frac{1}{2}\h^T \J^T \J \h \\
#     &=& F(\x) + \h^T \J^T \f + \frac{1}{2}\h^T \J^T \J \h
# \end{eqnarray}
# 
# From this it is easy to derive that the gradient and the Hessian of $L$ are:
# 
# \begin{eqnarray}
#     \L^\prime(\h) &=& \J^T \f + \J^T \J\h \\
#     \L^{\prime \prime}(\h) &=& \J^T \J
# \end{eqnarray}
# 
# Now we can find a unique minimizer by solving for $\h$:
# 
# \begin{eqnarray}
#     (\J^T \J)\h = -\J^T \f
# \end{eqnarray}
# 
# And the typical next step is then: $\x \rightarrow \x + \alpha \h$. The classical Gauss-Newton method uses $\alpha = 1$.