Review the main ideas in Lec. 17, p. 2--4.
Analytically compute the projection of a point $x_{0}\in\mathbb{R}^{n}_{\geq 0}\setminus\{0\}$ onto the standard simplex $\mathcal{S} := \{x\in\mathbb{R}^{n}_{\geq 0}\mid \mathbf{1}^{\top}x = 1\}$ with respect to the generalized Kullback-Leibler divergence $$D_{\text{KL}}(x\parallel x_{0}) := \sum_{i=1}^{n}\left(x_{i}\log\frac{x_{i}}{x_{0i}} - x_{i} + x_{0i}\right).$$ In other words, compute $\operatorname{proj}^{D_{\text{KL}}}_{\mathcal{S}}\left(x_{0}\right)$ as a function of $x_{0}$.
(Hint: Set up the relevant constrained optimization problem, argue why you can invoke statement 2 in Lec. 15, p. 2, and then solve the KKT conditions from Lec. 14, p. 18-19.)
Analytically compute the projection of $X_{0}\in\mathbb{S}^{n}$ onto the cone $\mathbb{S}_{+}^{n}$ with respect to the Frobenius norm (Lec. 17, p. 7, 1st row and second column). In other words, compute $\operatorname{proj}^{\|\cdot\|_{\text{F}}}_{\mathbb{S}^{n}_{+}}\left(X_{0}\right)$ as a function of $X_{0}$.
(Hint: Same hints as in part (a). Review Lec. 4 for taking derivative of scalar with respect to matrix. Also, the spectral theorem for symmertic matrices should help.)
During Lec. 17, p. 9-10, Example 2, we mentioned that the $k$ rank ball
$$\mathcal{C}_{mn}^{k} := \{X\in\mathbb{R}^{m\times n} \mid \text{rank}(X)\leq k\}, \quad\text{for fixed}\;k\leq \min\{m,n\},$$is a nonconvex set.
To understand this set nonconvexity, fix $m=n=3$ and $k=2$. Construct an example to demonstrate that the set $\mathcal{C}^{2}_{33}=\{X\in\mathbb{R}^{3\times 3}\mid \text{rank}(X)\leq 2\}$ is nonconvex.
Remark: Notice that even though $\text{rank}(X)$ is a nonconvex function, we need to be careful in arguing set nonconvexity becuase it is possible for a nonconvex function to have convex sublevel sets.
In practice, we often encounter optimization problems where the objective is a convex but nondifferentiable function. To solve such problems, we need the concept of subgradient.
We say a vector $g\in\mathbb{R}^{n}$ is a subgradient of $f:\mathbb{R}^{n}\mapsto \mathbb{R}$ at $x\in\textbf{dom}(f)$ if $$f(y) \geq f(x) + g^{\top}(y-x),\quad\text{for all}\quad y\in\textbf{dom}(f).$$
The above definition is a generalization of the concept of gradient in the sense if $f$ were differentiable, the above would precisely be the first order characterization of convexity. In general, subgradient at a point may not be unique. The set of all subgradients at $x$ is called the subdifferential $\partial f(x) := \{g \mid f(y) \geq f(x) + g^{\top}(y-x),\:\text{for all}\:y\in\textbf{dom}(f)\}$.
As a simple example, consider $f(x) = |x|$ with $\textbf{dom}(f)=\mathbb{R}$. Then for $x>0$, the subgradient $g=1$ is unique, i.e., $\partial f(x) = \{1\}$. Likewise for $x<0$, we have $\partial f(x) = \{-1\}$. At $x=0$, the inequality $|y|\geq g^{\top}y$ is saisfied iff $g\in[-1,1]$. Thus, $\partial f(0) = [-1,1]$.
It can be shown that if $f$ is differentiable at $x$, then $\partial f(x) = \{\nabla f(x)\}$, as you intuitively expect. Thus, subgradient $g$ generalizes the concept of gradient.
Given convex functions $f_{1},...,f_{m}(x)$, let $f(x):=\max\{f_{1}(x), ..., f_{m}(x)\}$. Define $I(x)=\{i\mid f_{i}(x)=f(x)\}$, that is, the index of the "active" functions at $x$. A nice result is this: to compute subgradient $g$ of $f$ at $x$, choose any $i\in I(x)$. Then $g$ can be taken as any subgradient of $f_{i}$ at $x$. That is, if $i$ is an active index, then $g\in\partial f_{i}(x)$ implies $g\in\partial f(x)$. Furthermore, $$\partial f(x) = \text{conv}\left(\underset{i\in I(x)}{\bigcup}\partial f_{i}(x)\right),$$ the convex hull of the union of subdifferentials of "active" functions at $x$.
Use the above result to prove that if $f(x) = \|x\|_{1}$ with $x\in\mathbb{R}^{n}$, then $\partial f(x) = S_{1} \times S_{2} \times ... \times S_{n}$ (Cartesian product of sets), where $S_{j} := [-1,1]$ for $x_{j}=0$, $S_{j} := \{1\}$ for $x_{j}>0$, and $S_{j} := \{-1\}$ for $x_{j}<0$, $j=1,...,n$.
Hint: Start by writing $f(x)=\|x\|_{1}$ as the pointwise max of a finite number of functions.
To solve a problem of the form $\underset{x\in\mathbb{R}^{n}}{\text{minimize}}\:f(x)$, where $f$ is convex but not everywhere differentiable, a standard iterative algorithm is the subgradient method: $$x_{k+1} = x_{k} - \eta_{k}g_{k}, \quad k=1,2,...,$$
where $g_{k}\in\partial f(x_{k})$ is any subgradient, and $\eta_{k}>0$ denotes stepsize at the $k$th iteration. Assume that the norm of subgradient is bounded, i.e., there exists $G$ such that $\|g_{k}\|_{2} \leq G$ for all iteration index $k=1,2,...$.
Although it looks very similar to gradient descent, subgradient method is not a descent method (see Lec. 18: intro to gradient descent, p. 2). Consider a running estimate of the best value so far: $f^{\text{best}}_{k} = \min\{f(x_{1}), ..., f(x_{k})\}$. Since $f^{\text{best}}_{k}$ is decreasing, it will have a limit, which may be finite or $-\infty$.
If $x_{\text{opt}}$ is a minimizer, then after $k$ iterations, the subgradient method satisfies:
$$f^{\text{best}}_{k} - f(x_{\text{opt}}) \leq \dfrac{\|x_{0} - x_{\text{opt}}\|_{2}^{2} + G^{2}\sum_{i=1}^{k}\eta_{i}^{2}}{2\sum_{i=1}^{k}\eta_{i}}.$$(i) From above, what can you tell about the convergence of subgradient method if the stepsize sequence $\{\eta_{k}\}$ is not summable but square summable, i.e., if $\sum_{k=1}^{\infty}\eta_{k} = \infty$, and $\sum_{k=1}^{\infty}\eta_{k}^{2} < \infty$.
(ii) What can you tell about the convergence of subgradient method if we choose the stepsize $\eta_{k} \equiv 1/k$ for all $k$?
A popular problem in statistics and machine learning is the so called "lasso problem", given by
$$\underset{x\in\mathbb{R}^{n}}{\text{minimize}}\:\frac{1}{2}\|Ax-b\|_{2}^{2} + \gamma \|x\|_{1},$$where $A\in\mathbb{R}^{m\times n}$, $b\in\mathbb{R}^{m}$, and $\gamma>0$ is a regularization parameter. For $\gamma = 0$, it reduces to ordinary least squares. For $\gamma>0$, it solves a linear regression problem while promoting sparsity in the minimizer. The objective function is a sum of two terms, and is convex but nondifferentiable. The first summand in the objective promotes data fidelity while the second promotes sparsity.
(i) Use your answer in part (a) to explicitly write down the subgradient update rule for the lasso problem.
(ii) Use the MATLAB starter code $\texttt{AM229F20HW8problem.m}$ inside CANVAS Files section, folder: HW problems and solutions, to report the optimal value and cpu time for cvx solution of the lasso problem for given problem data. Also report the 10,000th iterate value of the objective and cpu time for subgradient method for the same data with $\eta_k \equiv 1/k$ using the same code. Please submit your completed code.
You can also submit the Python equivalent of the completed starter code with cvxpy but please make sure to use the same numerical values for the problem data.