%matplotlib inline
import numpy as np;
from matplotlib import pyplot as plt
$\LaTeX \text{ commands here} \newcommand{\R}{\mathbb{R}} \newcommand{\im}{\text{im}\,} \newcommand{\norm}[1]{||#1||} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\span}{\mathrm{span}} \newcommand{\proj}{\mathrm{proj}} $
The set of separating hyperplanes. Suppose that $C$ and $D$ are disjoint subsets of $\R^n$. Consider the set of $(a, b) \in \R^{n+1}$ for which $$a^\top x \leq b \text{ for all } x \in C$$ and $$a^\top x \geq b \text{ for all } x \in D.$$ Show that this set is a convex cone (which is the singleton $\{0\}$ if there is no hyperplane that separates $C$ and $D$)
If convex sets $C$ and $D$ are disjoint, then there is vector $w$ and scalar $b$ such that $\forall x\in C, w^\top x \leq b$ and $\forall z \in D, \; w^\top z \geq b$.
Given a non-empty convex set $C$, and some $x_0 \in \text{bndry}(C)$, then there is vector $w$ and scalar $b$ such that $w^\top x_0 = b$ and $\forall x \in C, \; w^\top x \leq b$
Prove: the first theorem implies the second (Hint: think about the interior of $C$)
Caratheodory's Theorem: Let $X \subset \R^d$. Then each point in $\text{conv}(X)$ is a convex combination of at most $d + 1$ points of $X$.
Proof (start): Suppose there is some $y \in \text{conv}(X)$ that cannot be written as a convex combination of fewer than $m \geq d+2$ points in $X$. Then $$ y = \sum_{i=1}^m \lambda_i x_i \text{ with } \sum_{i=1}^m \lambda_i = 1 \text{ and } \lambda_i > 0 \forall i.$$ Notice that $m \geq d+2$ points $x_1, \ldots, x_m$ must be affinely dependent (since no more than $d+1$ points in $\R^d$ can be affinely independent). That means that there are coefficients $\mu_1, \ldots, \mu_m$ so that $$\sum_{i=1}^m \mu_i x_i = 0 \text{ where } \sum_{i=1}^m \mu_i = 0, \text{ and } \exists \mu_i > 0.$$ Let's combine these two statements, and observe that for any $\alpha \in \R$ we have $$y = y + \alpha \cdot 0 = \sum_{i=1}^m \lambda_i x_i + \alpha \sum_{i=1}^m \mu_i x_i = \sum_{i=1}^m (\lambda_i + \alpha \mu_i) x_i$$
$\ldots$
$\ldots$ which is a contradiction!
We know that there is one $\mu_j$ so that $\mu_j > 0$. Set $\alpha = - \min_{i : \mu_i > 0} \frac{\lambda_j}{\mu_j}$, and let $i^*$ be the minimizing $i$ in the previous expression. Also let $\lambda_i' := \lambda_i + \alpha \mu_i$. Our particular choice of $\alpha$ ensures two things:
What we have just done is show that we can write $y$ in terms of another convex combination, i.e. $y = \sum_{i=1}^m \lambda_i' x_i$. But notice that, for this convex combination, one of the $\lambda_i'$ is 0! That means we were able to write $y$ as a convex combination of fewer than $m$ points, a contradiction.
For a differentiable function $f : \R \to \R$, the derivative is defined as $$f'(x) = \lim_{\delta \to 0} \frac{ f(x + \delta) - f(x)}{\delta}.$$
For a function $f : \R^n \to \R$, the partial derivative w.r.t. $x_i$ is $$\frac{\partial f(x)}{\partial x_i} = \lim_{\delta \to 0} \frac{ f(x + \delta e_i ) - f(x)}{\delta}$$ where $e_i$ is the $i$th basis vector.
The gradient of $f$ can be viewed as the vector of partial derivatives $$ \nabla_x f = \nabla f(x) = \left[\frac{\partial f(x)}{\partial x_1}, \ldots, \frac{\partial f(x)}{\partial x_n}\right]$$
But another way to view the gradient is that it is a linear transformation which has the same domain and co-domain of $f$. That is, $\nabla_x f$ is a linear map $\R^n \to \R$, where $$ \nabla_x f [u] = \left[\frac{\partial f(x)}{\partial x_1}, \ldots, \frac{\partial f(x)}{\partial x_n}\right] \cdot u.$$
But this is indeed the same as the directional derivative of $f$ in the direction of $u$. That is, $$ \nabla_x f[u] = \lim_{\delta \to 0} \frac{ f(x + \delta u ) - f(x)}{\delta}$$
If $f$ is a function mapping $\R^n \to \R^m$, then we often call the gradient the Jacobian, which is the matrix of partial derivatives: $$D_f := \begin{pmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n}\\ \vdots & & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{pmatrix} $$ The hessian of a function $f : \R^n \to \R$ is defined as the matrix of partial derivatives: $$ \nabla^2 f = D^2_f := \begin{pmatrix} \frac{\partial^2 f}{\partial x_1 \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n}\\ \vdots & & \vdots \\ \frac{\partial^2 f}{\partial x_m \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_n \partial x_n} \end{pmatrix} $$
$f(x) = \log \sum_{i=1}^n \exp(x_i)$.
For a good description, see these notes.
Imagine we have $f : \R^n \to \R^m$ and $g : \R^m \to \R^k$. Then we want to compute the gradient (Jacobian) of the function $f \circ g$. Is there any easy way to do this? The answer is YES, according to the multivariate chain rule: $$ D_{f \circ g}[x] = D_f[g(x)] \cdot D_g[x]$$ This is a powerful statement, although it may not be clear why. It says we can obtain the gradient of the composed functions, by simply composing the gradients, treated as linear transformations.
Let $f(x) = \sum_{i=1}^m -\log (A_i \cdot x - b_i)$ where $A$ is an $m\times n$ matrix. Notice that we can write $f(x) = h(g(x)) = h \circ g (x)$ where $h(y) = -\sum \log y_i$ and $g(x) = A x - b$
Use the multivariate chain rule to compute $\nabla_x f$.