$\newcommand{\R}{\mathbb{R}}$
$$ \min_{x \in R^n} \;\; f(x) $$$f$ is a convex function
Methods:
$f: \R^n \to \R$, a subgradient of $f$ at $x$ is a vector $g$ satisfying
$$ f(y) \ge f(x) + \langle g, y-x \rangle, \;\; \forall x,y \in \R^n. $$Gradient: $s = -\nabla f(x)$ is a descent direction, i.e., $\langle \nabla f(x), s \rangle < 0$.
Subgradient: $s = -g$ is not necessarily descent.
Ex. $f(x,y) = 2|x| + |y|$
At $(x,y) = (1.5,0)$, $g = (2, 1)$, $-g = (-2, -1)$
using PyPlot
n = 200
x = linspace(-4,4,n)
y = linspace(-4,4,n)
xgrid = repmat(x',n,1)
ygrid = repmat(y,1,n)
z = zeros(n,n)
for i in 1:n
for j in 1:n
z[i:i,j:j] = 2abs(x[i]) + abs(y[j])
end
end
cp = plt[:contour](xgrid, ygrid, z, colors="blue", linewidth=2.0)
plt[:clabel](cp, inline=1, fontsize=8)
# xlabel("x")
# ylabel("y")
#tight_layout()
ax = gca()
ax[:spines]["top"][:set_visible](false) # Hide the top edge of the axis
ax[:spines]["right"][:set_visible](false) # Hide the right edge of the axis
ax[:spines]["left"][:set_position]("center") # Move the right axis to the center
ax[:spines]["bottom"][:set_position]("center") # Most the bottom axis to the center
ax[:xaxis][:set_ticks_position]("bottom") # Set the x-ticks to only the bottom
ax[:yaxis][:set_ticks_position]("left") # Set the y-ticks to only the left
arrow(1.5,0,-2,-1,
head_width=0.1,
width=0.00015,
head_length=0.05,
overhang=0.5,
head_starts_at_zero="false",
facecolor="black")
annotate("(1.5,0)", xy=[1.2; 0.1])
annotate("-g", xy=[0.6; -.8])
PyObject <matplotlib.text.Annotation object at 0x31bb4b9d0>
Ex. $f(x) = \|x\|_1 = \sum_{i=1}^n |x_i|$
using PyPlot
n = 200
x = linspace(-1,1,n)
y = linspace(-1,1,n)
xgrid = repmat(x',n,1)
ygrid = repmat(y,1,n)
z = zeros(n,n)
for i in 1:n
for j in 1:n
z[i:i,j:j] = abs(x[i]) + abs(y[j])
end
end
fig = figure("pyplot_surfaceplot",figsize=(10,10))
ax = fig[:add_subplot](1,1,1, projection = "3d")
ax[:plot_surface](xgrid, ygrid, z, rstride=2,edgecolors="k", cstride=2, cmap=ColorMap("gray"), alpha=0.8, linewidth=0.25)
xlabel("x")
ylabel("y")
PyObject <matplotlib.text.Text object at 0x32a10e9d0>
where
$$ g_i = \begin{cases} 1 & \text{if } x_i > 0\\ -1 & \text{if } x_i <0\\ [-1,1] & o.w. \end{cases} $$Changes to gradient descent:
Assumptions:
We can show that the $x_\text{best}$ until $k$th iteration satisfies: $$ f(x_{k, \text{best}}) - f(x^*) \le \frac{D^2 + M^2 \sum_{i=1}^k \alpha_i^2}{2\sum_{i=1}^k \alpha_i} $$
Constant stepsize $\alpha_i = c$
Recall: gradient descent had a faster $O(1/k)$ conv rate
Subgradient method + Randomness
Motivation
Consider a objective function with a very large summation:
$$ \min_{x\in X \subset \R^n} \;\; \sum_{i=1}^m \ell(x; D_i) $$where $\ell(\cdot; D_i): \R^n \to \R$ is a convex function.
The (full) gradient of $f$ will be (assuming $f$ diff'able), $$ \nabla f(x) = \sum_{i=1}^m \; \nabla \ell(x; D_i ) $$
Q. Can we avoid this?
Consider a stochastic subgradient $G$ of $f$,
$\newcommand{\E}{\mathbb{E}}$
$$ f(y) \ge f(x) + \langle \E[G], y-x \rangle . $$Required property:
We can consider $G$ as an noisy estimate of $g$, $$ \text{E.g.,}\;\; G = g + v, \;\; g \in \partial f(x), \E[v] = 0 $$
The usual formulation has
$$ \min_{x \in X}\;\; \tilde f(x) = \frac{1}{m} \sum_{i=1}^m F(x;\xi_i) $$where $m\to\infty$.
This is called Sample Average Approximation (SAA), or Empirical Risk Minimization (ERM)
In data science, what we really want to do is
$$ \min_{x \in X} \;\; f(x) = \E[ F(x;\xi_i) ] = \int_\Xi F(x;\xi) dP(\xi) $$This is called Stochastic Approximation (SA) or Risk Minimization
Assumptions:
Two constants:
$f$ is $c$-strongly convex, and stepsize $\alpha_k = \frac{\theta}{k}$ with $\theta > \frac{1}{2c}$,
$f$ is diff'able and $\nabla f$ is $L$-Lipschitz continuous, and $x^* \in \text{int}\, X$,
The convergence is sensitive to
Ex. $f(x) = x^2/10$, $G(x,\xi) = \nabla f(x)$ (no noise), $X = [-1,1]$
Ex. $f(x) = x^2/10$, $G(x,\xi) = \nabla f(x)$ (no noise), $X = [-1,1]$
With $k=10^9$, we still have $x_k > 0.015$
Differences to the classical SGD:
Constant stepsize $\alpha_k = \frac{D}{M\sqrt{K}}$, $$ \E[ f(\tilde x_s^K) - f(x^*) ] \le \frac{DM}{\sqrt{K}} \left[ \frac{2K}{K-s+1} + \frac{1}{2} \right] $$
Diminishing stepsize $\alpha_k = \frac{\theta D}{M\sqrt{k}}$ for any $\theta>0$, $$ \E[ f(\tilde x_s^k) - f(x^*) ] \le \frac{DM}{\sqrt{k}} \left[ \frac{2}{\theta} \left( \frac{k}{k-s+1} \right) + \frac{\theta}{2} \sqrt{\frac{k}{s}} \right] $$
$s = \lceil rk \rceil$, $r \in (0,1)$: $$ \E[ f(\tilde x_s^k) - f(x^*) ] \le \frac{DM}{\sqrt{k}} \left[ \frac{2}{\theta} \left( \frac{1}{1-r} \right) + \frac{\theta}{2} \sqrt{r} \right] $$
For a convex function $f: \R^n \to \R$, show that its subdifferential $\partial f(x)$ is a convex set. (Bonus) Show also that $\partial f(x)$ is closed.
Suppose that $f:\R^n \to \R$ is convex and Lipschitz continuous (with a constant $L>0$). Show that its subgradients are bounded in norm.
Implement the SGD algorithm to solve the SVM (primal form without intercept)
where $x_i \in \R^n$ and $y_i \in \{-1,+1\}$ for $i=1,\dots,m$, and $\lambda>0$. Train the SVM on 60000 training data from the MNIST data set (use the MNIST Julia package and traindata()). Report the prediction accuracy of the trained SVM, on the test data set of 10000 samples not used for training (use testdata()). In SVM, we predict the label of a given input point $x\in\R^n$ as $+1$ if $\langle w^*, x \rangle > 0$, or $-1$ if $\langle w^*, x \rangle < 0$.