Notebook

In [27]:

%matplotlib inline
import numpy as np;
from matplotlib import pyplot as plt

$\LaTeX \text{ commands here} \newcommand{\R}{\mathbb{R}} \newcommand{\im}{\text{im}\,} \newcommand{\norm}[1]{||#1||} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\span}{\mathrm{span}} \newcommand{\proj}{\mathrm{proj}} $

Georgia Tech, CS 4540

L4: Convexity & Vector Calculus¶

Jacob Abernethy

Thursday, August 29, 2019

Separating Hyperplanes¶

The set of separating hyperplanes. Suppose that $C$ and $D$ are disjoint subsets of $\R^n$. Consider the set of $(a, b) \in \R^{n+1}$ for which $$a^\top x \leq b \text{ for all } x \in C$$ and $$a^\top x \geq b \text{ for all } x \in D.$$ Show that this set is a convex cone (which is the singleton $\{0\}$ if there is no hyperplane that separates $C$ and $D$)

Separating hyperplane theorem¶

If convex sets $C$ and $D$ are disjoint, then there is vector $w$ and scalar $b$ such that $\forall x\in C, w^\top x \leq b$ and $\forall z \in D, \; w^\top z \geq b$.

Supporting hyperplane theorem¶

Given a non-empty convex set $C$, and some $x_0 \in \text{bndry}(C)$, then there is vector $w$ and scalar $b$ such that $w^\top x_0 = b$ and $\forall x \in C, \; w^\top x \leq b$

Prove: the first theorem implies the second (Hint: think about the interior of $C$)

Caratheodory's Theorem: Let $X \subset \R^d$. Then each point in $\text{conv}(X)$ is a convex combination of at most $d + 1$ points of $X$.

Proof (start): Suppose there is some $y \in \text{conv}(X)$ that cannot be written as a convex combination of fewer than $m \geq d+2$ points in $X$. Then $$ y = \sum_{i=1}^m \lambda_i x_i \text{ with } \sum_{i=1}^m \lambda_i = 1 \text{ and } \lambda_i > 0 \forall i.$$ Notice that $m \geq d+2$ points $x_1, \ldots, x_m$ must be affinely dependent (since no more than $d+1$ points in $\R^d$ can be affinely independent). That means that there are coefficients $\mu_1, \ldots, \mu_m$ so that $$\sum_{i=1}^m \mu_i x_i = 0 \text{ where } \sum_{i=1}^m \mu_i = 0, \text{ and } \exists \mu_i > 0.$$ Let's combine these two statements, and observe that for any $\alpha \in \R$ we have $$y = y + \alpha \cdot 0 = \sum_{i=1}^m \lambda_i x_i + \alpha \sum_{i=1}^m \mu_i x_i = \sum_{i=1}^m (\lambda_i + \alpha \mu_i) x_i$$

$\ldots$

$\ldots$ which is a contradiction!

Answer:¶

We know that there is one $\mu_j$ so that $\mu_j > 0$. Set $\alpha = - \min_{i : \mu_i > 0} \frac{\lambda_j}{\mu_j}$, and let $i^*$ be the minimizing $i$ in the previous expression. Also let $\lambda_i' := \lambda_i + \alpha \mu_i$. Our particular choice of $\alpha$ ensures two things:

$\lambda_{i^*}' = 0$, because $\lambda_{i^*}' = \lambda_{i^*} + (-\frac{\lambda_{i^*}}{\mu_{i^*}}) \mu_{i^*}$
For any $j$, $\lambda_j' \geq 0$. That's because either
- $\mu_j = 0$, in which case $\lambda_j' = \lambda_j$
- OR $\lambda_j' = \lambda_j + (-\frac{\lambda_{i^*}}{\mu_{i^*}})\mu_j \geq \lambda_j + (-\frac{\lambda_{j}}{\mu_{j}})\mu_j = 0$

What we have just done is show that we can write $y$ in terms of another convex combination, i.e. $y = \sum_{i=1}^m \lambda_i' x_i$. But notice that, for this convex combination, one of the $\lambda_i'$ is 0! That means we were able to write $y$ as a convex combination of fewer than $m$ points, a contradiction.

Derivatives, and multivariable functions¶

For a differentiable function $f : \R \to \R$, the derivative is defined as $$f'(x) = \lim_{\delta \to 0} \frac{ f(x + \delta) - f(x)}{\delta}.$$

For a function $f : \R^n \to \R$, the partial derivative w.r.t. $x_i$ is $$\frac{\partial f(x)}{\partial x_i} = \lim_{\delta \to 0} \frac{ f(x + \delta e_i ) - f(x)}{\delta}$$ where $e_i$ is the $i$th basis vector.

Perspectives on the gradient¶

The gradient of $f$ can be viewed as the vector of partial derivatives $$ \nabla_x f = \nabla f(x) = \left[\frac{\partial f(x)}{\partial x_1}, \ldots, \frac{\partial f(x)}{\partial x_n}\right]$$

But another way to view the gradient is that it is a linear transformation which has the same domain and co-domain of $f$. That is, $\nabla_x f$ is a linear map $\R^n \to \R$, where $$ \nabla_x f [u] = \left[\frac{\partial f(x)}{\partial x_1}, \ldots, \frac{\partial f(x)}{\partial x_n}\right] \cdot u.$$

But this is indeed the same as the directional derivative of $f$ in the direction of $u$. That is, $$ \nabla_x f[u] = \lim_{\delta \to 0} \frac{ f(x + \delta u ) - f(x)}{\delta}$$

The Jacobian¶

If $f$ is a function mapping $\R^n \to \R^m$, then we often call the gradient the Jacobian, which is the matrix of partial derivatives: $$D_f := \begin{pmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n}\\ \vdots & & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{pmatrix} $$ The hessian of a function $f : \R^n \to \R$ is defined as the matrix of partial derivatives: $$ \nabla^2 f = D^2_f := \begin{pmatrix} \frac{\partial^2 f}{\partial x_1 \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n}\\ \vdots & & \vdots \\ \frac{\partial^2 f}{\partial x_m \partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_n \partial x_n} \end{pmatrix} $$

Gradient and Hessian problems¶

Define a function $f(x) = \log \sum_{i=1}^n \exp(x_i)$.

Problem¶

What is $\nabla_x f$?
What is $\nabla_x f[\mathbf{1}]$ where $\mathbf{1}$ is the vector of all 1s.
What is $\nabla^2 f(x)$?

$f(x) = \log \sum_{i=1}^n \exp(x_i)$.

Answer¶

$\nabla_x f = \frac{1}{\sum_{i=1}^n \exp(x_i)} \cdot [\exp(x_1) \exp(x_2) \dots \exp(x_n)]$
$\nabla_x f[\mathbf{1}] = \sum_{i = 1}^n \frac{\exp(x_i)}{\sum_{i=1}^n \exp(x_i)} = 1$
$\nabla^2 f(x) = - \frac{1}{(\sum_{i=1}^n \exp(x_i))^2} A$ where for matrix $A$, $a_{i,i} = \exp(x_i)$, and $a_{i,j} = 0$ if $i \neq j$.

Multivariate Chain Rule¶

For a good description, see these notes.

Imagine we have $f : \R^n \to \R^m$ and $g : \R^m \to \R^k$. Then we want to compute the gradient (Jacobian) of the function $f \circ g$. Is there any easy way to do this? The answer is YES, according to the multivariate chain rule: $$ D_{f \circ g}[x] = D_f[g(x)] \cdot D_g[x]$$ This is a powerful statement, although it may not be clear why. It says we can obtain the gradient of the composed functions, by simply composing the gradients, treated as linear transformations.

Problem¶

Let $f(x) = \sum_{i=1}^m -\log (A_i \cdot x - b_i)$ where $A$ is an $m\times n$ matrix. Notice that we can write $f(x) = h(g(x)) = h \circ g (x)$ where $h(y) = -\sum \log y_i$ and $g(x) = A x - b$

Use the multivariate chain rule to compute $\nabla_x f$.