#!/usr/bin/env python
# coding: utf-8

# # Problem 1 [60 points] Projection on a set
# 
# Review the main ideas in Lec. 17, p. 2--4.

# ## (a) [20 points] Projection on standard simplex
# 
# **Analytically** compute the projection of a point $x_{0}\in\mathbb{R}^{n}_{\geq 0}\setminus\{0\}$ onto the standard simplex $\mathcal{S} := \{x\in\mathbb{R}^{n}_{\geq 0}\mid \mathbf{1}^{\top}x = 1\}$ with respect to the **generalized Kullback-Leibler divergence** 
# $$D_{\text{KL}}(x\parallel x_{0}) := \sum_{i=1}^{n}\left(x_{i}\log\frac{x_{i}}{x_{0i}} - x_{i} + x_{0i}\right).$$
# In other words, compute $\operatorname{proj}^{D_{\text{KL}}}_{\mathcal{S}}\left(x_{0}\right)$ as a function of $x_{0}$.
# 
# (**Hint:** Set up the relevant constrained optimization problem, argue why you can invoke statement 2 in Lec. 15, p. 2, and then solve the KKT conditions from Lec. 14, p. 18-19.)

# ## (b) [30 points] Projection on $\mathbb{S}_{+}^{n}$
# 
# **Analytically** compute the projection of $X_{0}\in\mathbb{S}^{n}$ onto the cone $\mathbb{S}_{+}^{n}$ with respect to the **Frobenius norm** (Lec. 17, p. 7, 1st row and second column). In other words, compute $\operatorname{proj}^{\|\cdot\|_{\text{F}}}_{\mathbb{S}^{n}_{+}}\left(X_{0}\right)$ as a function of $X_{0}$.
# 
# (**Hint:** Same hints as in part (a). Review Lec. 4 for taking derivative of scalar with respect to matrix. Also, the spectral theorem for symmertic matrices should help.)

# ## (c) [10 points] Nonconvexity of rank ball
# 
# During Lec. 17, p. 9-10, Example 2, we mentioned that the $k$ rank ball
# 
# $$\mathcal{C}_{mn}^{k} := \{X\in\mathbb{R}^{m\times n} \mid \text{rank}(X)\leq k\}, \quad\text{for fixed}\;k\leq \min\{m,n\},$$
# is a nonconvex set.
# 
# To understand this **set nonconvexity**, fix $m=n=3$ and $k=2$. **Construct an example** to demonstrate that the set $\mathcal{C}^{2}_{33}=\{X\in\mathbb{R}^{3\times 3}\mid \text{rank}(X)\leq 2\}$ is nonconvex.
# 
# **Remark:** Notice that even though $\text{rank}(X)$ is a nonconvex **function**, we need to be careful in arguing **set** nonconvexity becuase it is possible for a nonconvex function to have convex sublevel sets.

# # Problem 2 [40 points] Subgradient algorithm
# 
# In practice, we often encounter optimization problems where the objective is a **convex but nondifferentiable function**. To solve such problems, we need the concept of **subgradient**.
# 
# We say a vector $g\in\mathbb{R}^{n}$ is a **subgradient** of $f:\mathbb{R}^{n}\mapsto \mathbb{R}$ at $x\in\textbf{dom}(f)$ if 
# $$f(y) \geq f(x) + g^{\top}(y-x),\quad\text{for all}\quad y\in\textbf{dom}(f).$$
# 
# The above definition is a generalization of the concept of gradient in the sense if $f$ were differentiable, the above would precisely be the first order characterization of convexity. In general, subgradient at a point may not be unique. The set of all subgradients at $x$ is called the **subdifferential** $\partial f(x) := \{g \mid f(y) \geq f(x) + g^{\top}(y-x),\:\text{for all}\:y\in\textbf{dom}(f)\}$.
# 
# As a simple example, consider $f(x) = |x|$ with $\textbf{dom}(f)=\mathbb{R}$. Then for $x>0$, the subgradient $g=1$ is unique, i.e., $\partial f(x) = \{1\}$. Likewise for $x<0$, we have $\partial f(x) = \{-1\}$. At $x=0$, the inequality $|y|\geq g^{\top}y$ is saisfied iff $g\in[-1,1]$. Thus, $\partial f(0) = [-1,1]$. 
# 
# It can be shown that if $f$ is differentiable at $x$, then $\partial f(x) = \{\nabla f(x)\}$, as you intuitively expect. Thus, subgradient $g$ generalizes the concept of gradient.

# ## (a) [15 points] Subgradient and subdifferential of $\ell_{1}$ norm
# 
# Given convex functions $f_{1},...,f_{m}(x)$, let $f(x):=\max\{f_{1}(x), ..., f_{m}(x)\}$. Define $I(x)=\{i\mid f_{i}(x)=f(x)\}$, that is, the index of the "active" functions at $x$. A nice result is this: to compute subgradient $g$ of $f$ at $x$, choose any $i\in I(x)$. Then $g$ can be taken as any subgradient of $f_{i}$ at $x$. That is, if $i$ is an active index, then $g\in\partial f_{i}(x)$ implies $g\in\partial f(x)$. Furthermore, $$\partial f(x) = \text{conv}\left(\underset{i\in I(x)}{\bigcup}\partial f_{i}(x)\right),$$ the convex hull of the union of subdifferentials of "active" functions at $x$. 
# 
# **Use the above result to prove that** if $f(x) = \|x\|_{1}$ with $x\in\mathbb{R}^{n}$, then $\partial f(x) = S_{1} \times S_{2} \times ... \times S_{n}$ (Cartesian product of sets), where $S_{j} := [-1,1]$ for $x_{j}=0$, $S_{j} := \{1\}$ for $x_{j}>0$, and $S_{j} := \{-1\}$ for $x_{j}<0$, $j=1,...,n$. 
# 
# **Hint:** Start by writing $f(x)=\|x\|_{1}$ as the pointwise max of a finite number of functions.

# ## (b) [2 + 3 = 5 points] Algorithm
# 
# To solve a problem of the form $\underset{x\in\mathbb{R}^{n}}{\text{minimize}}\:f(x)$, where $f$ is convex but not everywhere differentiable, a standard iterative algorithm is the **subgradient method**:
# $$x_{k+1} = x_{k} - \eta_{k}g_{k}, \quad k=1,2,...,$$
# 
# where $g_{k}\in\partial f(x_{k})$ is **any** subgradient, and $\eta_{k}>0$ denotes stepsize at the $k$th iteration. Assume that the norm of subgradient is bounded, i.e., there exists $G$ such that $\|g_{k}\|_{2} \leq G$ for all iteration index $k=1,2,...$. 
# 
# Although it looks very similar to gradient descent, subgradient method is not a descent method (see Lec. 18: intro to gradient descent, p. 2). Consider a running estimate of the best value so far: $f^{\text{best}}_{k} = \min\{f(x_{1}), ..., f(x_{k})\}$. Since $f^{\text{best}}_{k}$ is decreasing, it will have a limit, which may be finite or $-\infty$.
# 
# If $x_{\text{opt}}$ is a minimizer, then after $k$ iterations, the subgradient method satisfies:
# 
# $$f^{\text{best}}_{k} - f(x_{\text{opt}}) \leq \dfrac{\|x_{0} - x_{\text{opt}}\|_{2}^{2} + G^{2}\sum_{i=1}^{k}\eta_{i}^{2}}{2\sum_{i=1}^{k}\eta_{i}}.$$
# 
# (i) **From above, what can you tell about the convergence** of subgradient method if the stepsize sequence $\{\eta_{k}\}$ is not summable but square summable, i.e., if $\sum_{k=1}^{\infty}\eta_{k} = \infty$, and $\sum_{k=1}^{\infty}\eta_{k}^{2} < \infty$.
# 
# (ii) **What can you tell about the convergence** of subgradient method if we choose the stepsize $\eta_{k} \equiv 1/k$ for all $k$?

# ## (c) [5 + 15 = 20 points] Numerical case study
# 
# A popular problem in statistics and machine learning is the so called "lasso problem", given by
# 
# $$\underset{x\in\mathbb{R}^{n}}{\text{minimize}}\:\frac{1}{2}\|Ax-b\|_{2}^{2} + \gamma \|x\|_{1},$$
# 
# where $A\in\mathbb{R}^{m\times n}$, $b\in\mathbb{R}^{m}$, and $\gamma>0$ is a regularization parameter. For $\gamma = 0$, it reduces to ordinary least squares. For $\gamma>0$, it solves a linear regression problem while promoting sparsity in the minimizer. The objective function is a sum of two terms, and is convex but nondifferentiable. The first summand in the objective promotes data fidelity while the second promotes sparsity.  
# 
# (i) Use your answer in part (a) to **explicitly write down the subgradient update rule** for the lasso problem.
# 
# (ii) Use the MATLAB starter code $\texttt{AM229F20HW8problem.m}$ inside CANVAS Files section, folder: HW problems and solutions, to **report the optimal value and cpu time for cvx solution** of the lasso problem for given problem data. Also **report the 10,000th iterate value of the objective and cpu time for subgradient method** for the same data with $\eta_k \equiv 1/k$ using the same code. **Please submit your completed code**. 
# 
# You can also submit the Python equivalent of the completed starter code with cvxpy but please make sure to use the same numerical values for the problem data.