Notebook

Problem 1 [30 points] Maximum and minimum eigenvalues¶

Consider the maximum eigenvalue $\lambda_{\max}$ and the minimum eigenvalue $\lambda_{\min}$ for any $X\in\mathbb{S}^{n}$.

The purpose of this exercise is to investigate whether $\lambda_{\max}(X)$ and $\lambda_{\min}(X)$ are convex functions of the matrix variable $X\in\mathbb{S}^{n}$.

(a) [10 points] An inequality¶

For any $x\in\mathbb{R}^{n}$ and any $X\in\mathbb{S}^{n}$, derive the inequality:

$$\lambda_{\min}(X) \leq \dfrac{x^{\top}Xx}{x^{\top}x} \leq \lambda_{\max}(X).$$

(Hint: start from the spectral theorem applied to a symmetric matrix.)

Solution for 1(a):¶

By the spectral theorem, $X = VDV^{\top}$ where $V$ is orthogonal, and $D$ is the diagonal matrix with its diagonal entries being the eigenvalues of $X$, i.e., $D_{ii} = \lambda_{i}(X)$.

Let $y:= V^{\top}x$. Then $y^{\top}y = x^{\top}x$. Also, $$x^{\top}Xx = x^{\top}VDV^{\top}x = y^{\top}Dy = \sum_{i=1}^{n}\lambda_{i}(X)y_{i}^{2}.$$ Since $$\lambda_{\min}(X)\|y\|_{2}^{2} \leq \sum_{i=1}^{n}\lambda_{i}(X)y_{i}^{2} \leq \lambda_{\max}(X)\|y\|_{2}^{2} \quad\Leftrightarrow\quad \lambda_{\min}(X)\|x\|_{2}^{2} \leq x^{\top}Xx \leq \lambda_{\max}(X)\|x\|_{2}^{2},$$ we therefore get $$\lambda_{\min}(X) \leq \dfrac{x^{\top}Xx}{x^{\top}x} \leq \lambda_{\max}(X),$$ as desired.

(b) [10 points] Variational characterization of $\lambda_{\max}$ and $\lambda_{\min}$¶

Use part (a), to prove the following two formula:

$$\lambda_{\max}(X) = \underset{\|x\|_{2}=1}{\sup} x^{\top}X x, \qquad \lambda_{\min}(X) = \underset{\|x\|_{2}=1}{\inf} x^{\top}X x, \qquad x\in\mathbb{R}^{n}, \quad X\in\mathbb{S}^{n}.$$

(Hint: think if and when the equalities are achieved in the weak inequality derived in part (a).)

Solution for part 1(b):¶

Let $v_{\max}$ be the normalized eigenvector associated with $\lambda_{\max}(X)$, i.e., $Xv_{\max} = \lambda_{\max}(X)v_{\max}$. Likewise, $Xv_{\min} = \lambda_{\min}(X)v_{\min}$.

Then, $v_{\max}^{\top}Xv_{\max} = \lambda_{\max}(X)v_{\max}^{\top}v_{\max} = \lambda_{\max}(X)$. Similarly, $v_{\min}^{\top}Xv_{\min} = \lambda_{\min}(X)v_{\min}^{\top}v_{\min} = \lambda_{\min}(X)$. Thus, the vectors $v_{\max}$ and $v_{\min}$ achieve the right and left equalities respectively, in the weak inequality derived in Prob. 1(a). Hence the statements.

(c) [10 points] Function convexity¶

Use part (b) to argue which of the two functions $f(X) := \lambda_{\max}(X)$ and $g(X) = \lambda_{\min}(X)$ is/are convex over $\textbf{dom}(f) = \textbf{dom}(g) = \mathbb{S}^{n}$. Explain.

(Hint: think operations preserving function convexity/concavity from Lec. 8.)

Solution for part 1(c):¶

Since $x^{\top}X x$ is linear in $X$, it is both convex and concave in $X$.

From Prob. 1(b), $\lambda_{\max}(X)$ is the pointwise sup of convex, which by Lec. 8, p. 15, is convex over $\mathbb{S}^{n}$.

From Prob. 1(b), $\lambda_{\min}(X)$ is the pointwise inf of concave, which by Lec. 8, p. 15, is concave over $\mathbb{S}^{n}$.

Problem 2 [30 points] Support function revisited¶

Consider the support function $h(y)$ of any compact convex set introduced in HW3, Prob. 1(b).

(a) [1 + 9 = 10 points] Function convexity¶

Is the support function convex or concave or neither? Give reasons to explain your answer.

(Hint: again think operations preserving function convexity/concavity from Lec. 8.)

Solution for part 2(a):¶

Yes, $h(y)$ is a convex function of $y\in\mathbb{R}^{n}$.

By definition, $h(y) := \underset{x\in\mathcal{X}}{\sup}x^{\top}y$, i.e., pointwise sup of a convex fiunction in $y$. Also, $\textbf{dom}(f)=\mathbb{R}^{n}$ is an affine, hence convex set.

Therefore, by Lec. 8, p. 15, $h(y)$ is a convex function of $y\in\mathbb{R}^{n}$.

(b) [(1 + 7) + 2 = 10 points] Convex conjugate¶

(i) What is the function $f(x)$ whose convex conjugate equals the support function $h(y)$? Give reasons to explain your answer.

(ii) What is the geometric interpretation of your result in 2(b)(i)?

Solution for part 2(b):¶

(i) Comparing the definitions of the Legendre-Fenchel conjugate with that of the support function of set $\mathcal{X}$, it follows that $h(y)$ is the Legendre-Fenchel conjugate of the indicator function of $\mathcal{X}$.

More specifically, $h(y)=f^{*}(y)$ where $$ f(x) \equiv \mathbf{1}_{\mathcal{X}}(x) := \begin{cases}0 & \text{if}\quad x\in\mathcal{X}\\ +\infty &\text{otherwise.}\end{cases}$$

(ii) The interpretation is as follows. Since the indicator function uniquely describes the set, so does its conjugate. This analytically confirms the geometric intuition given in HW3Prob.1(b) that the support function $h(y)$ uniquely describes a convex set.

(c) [(1 + 4) + (1 + 4) = 10 points] Epigraph¶

(i) Is the epigraph of support function: $\text{epi}(h)$ a cone? Why/why not?

(ii) Is $\text{epi}(h)$ a convex cone? Why/why not?

Solution for part 2(c):¶

(i) Yes, $\text{epi}(h)$ is a cone.

To see why, first notice that the support function is positively homogeneous of degree one, i.e., for $\alpha>0$, we have $$h(\alpha y) = \alpha\underset{x\in\mathcal{X}}{\sup}x^{\top}y = \alpha h(y).$$ Thus, $$\{(\alpha y, \alpha t) \mid \alpha y \in\textbf{dom}(h), \quad h(\alpha y)\leq \alpha t, \quad \alpha > 0\} = \{(y, t) \mid y \in\textbf{dom}(h), \quad h(y)\leq t\} =: \text{epi}(h).$$

This proves that $\text{epi}(h)$ is a cone.

(ii) Yes, $\text{epi}(h)$ is a convex cone.

From 2(a), we know that $h(y)$ is a convex function. So its epigraph $\text{epi}(h)$ must be a convex set (Lec. 8, p. 3). From 2(c)(i), we also know that $\text{epi}(h)$ is a cone. Combining the two, the statement follows.

Problem 3 [40 points] Bregman divergence¶

Given a strictly convex differentiable function $\psi$ over a closed convex set $\mathcal{X}$, the associated Bregman divergence $D_{\psi}:\mathcal{X}\times\mathcal{X}\mapsto\mathbb{R}_{\geq 0}$ is defined as

$$D_{\psi}(x,y) := \psi(x) - \psi(y) - \langle\nabla\psi(y), x-y\rangle, \qquad\forall x,y\in\mathcal{X}.$$

The above formula has a clean geometric meaning: the Bregman divergence quantifies the amount by which a strictly convex function lies above its linear approximation (tangent hyperplane).

For example, when $\psi(x) = \|x\|_{2}^{2}$, then for any closed $\mathcal{X}\subset\mathbb{R}^{n}$, we have $D_{\psi}(x,y) = \|x-y\|_{2}^{2}$, the squared Euclidean distance. (It is a good idea to verify this yourself!)

As another example, when $\psi(x) = \sum_{i=1}^{n}x_{i}\log x_{i}$, $\mathcal{X}\equiv \{x\in\mathbb{R}^{n}_{\geq 0}\mid \sum_{i=1}^{n}x_{i}=1\}$ (the standard simplex, see Lec. 5, p. 15), then $D_{\psi}(x,y) = \sum_{i=1}^{n}x_{i}\log(x_{i}/y_{i})$, the Kullback-Leibler divergence/relative entropy. (Again, good idea to verify this yourself!)

Even though $D_{\psi}(x,y)\neq D_{\psi}(y,x)$ in general, the Bregman divergence has many nice properties, and appears frequently in the machine learning applications.

(a) [(1 + 4) + (2 + 3) + 5 = 15 points] Function convexity¶

(i) Is $D_{\psi}(x,y)$ strictly convex in $x$? Why/why not?

(ii) If $\psi$ is $m$-strongly convex, then prove that $D_{\psi}(x,y) \geq \frac{m}{2}\|x-y\|_{2}^{2}$, and that $D_{\psi}(x,y)$ is $m$-strongly convex in $x$.

(iii) Construct a counterexample to demonstrate that $D_{\psi}(x,y)$ need not be convex in its second argument $y$.

Solution for part 3(a):¶

(i) Yes, it is strictly convex in $x$.

This follows from the strict convexity of the function $\psi$, because for any $x_1,x_2\in\mathcal{X}$, and any $\theta$ satisfying $0\leq \theta\leq 1$, we get

\begin{align*} D_\psi(\theta x_{1} + (1-\theta)x_{2}, y) &= \psi(\theta x_{1}+(1-\theta)x_{2},y) - \psi(y) - \langle\nabla\psi(y), \theta x_{1}+(1-\theta)x_{2}-y\rangle\\ &< \theta\psi(x_{1},y) + (1-\theta)\psi(x_{2},y) - \left(\theta + (1-\theta)\right)\psi(y) - \theta \langle\nabla\psi(y),x_1 - y\rangle - (1-\theta) \langle\nabla\psi(y),x_2 - y\rangle\\ &= \theta D_\psi(x_{1}, y) + (1-\theta) D_\psi(x_{2}, y), \end{align*}

establishing the claim.

(ii) By definition of Bregman divergence, we have $$D_{\psi}(x,y) = \psi(x) - \psi(y) - \langle\nabla\psi(y), x-y\rangle \geq \frac{m}{2}\|x-y\|_{2}^{2},$$ where the last inequality follows from the definition of $m$-strong convexity of $\psi$ (Lec. 9, p. 1). This proves the first statement.

Next, to show that $D_{\psi}(x,y)$ is $m$-strongly convex in $x$, we start with a general result:

sum of a $m$-strongly convex function $f(x)$ with a convex function $g(x)$, is $m$-strongly convex.

To see this, let $h=f+g$. We have: $f(y)\geq f(x) + \langle\nabla f(x) , y-x\rangle + \frac{m}{2}\|y-x\|_{2}^{2}$, and $g(y) \geq g(x) + \langle\nabla g(x) , y-x\rangle$. Adding the two inequalities, we get $$h(y) \geq h(x) + \langle\nabla h(x) , y-x\rangle + \frac{m}{2}\|y-x\|_{2}^{2}, \quad \forall\;x,y\in\mathcal{X},$$ showing that $h(x)=f(x)+g(x)$ is $m$-strongly convex in $x$. To apply this result in the definition of Bregman divergence, set $f(x)\equiv \psi(x)$ and $g(x)\equiv - (\psi(y) + \langle\nabla\psi(y), x-y\rangle)$ (affine in $x$ is concave, negative of concave is convex). This proves that $D_{\psi}(x,y)$ is $m$-strongly convex in $x$.

(b) [5 + 5 + 15 = 25 points] Distance-like function¶

The Bregman divergence $D_{\psi}(x,y)$ behaves like the square of a generalized distance between elements in $\mathcal{X}$.

(i) Prove that $$\nabla_{x}D_{\psi}(x,y) = \nabla_{x}\psi(x) - \nabla_{y}\psi(y),$$

where the subscript of the gradient operator $\nabla$ shows the gradient is taken with respect to which vector.

Solution for part 3(b)(i):¶

We have \begin{align*} D_{\psi}(x,y) &= \psi(x) - \psi(y) - \left(\nabla_{y}\psi(y)\right)^{\top}x + \left(\nabla_{y}\psi(y)\right)^{\top}y,\\ \Rightarrow \nabla_{x}D_{\psi}(x,y) &= \nabla_{x}\psi(x) - 0 - \nabla_{y}\psi(y) + 0, \end{align*} hence the result.

(ii) Prove the generalized law of cosines:

$$D_{\psi}(x,y) = D_{\psi}(x,z) + D_{\psi}(z,y) - \langle x-z, \nabla\psi(y) - \nabla\psi(z)\rangle, \qquad \forall x,y,z\in\mathcal{X}.$$

Remark: Notice that for $\psi(x) = \|x\|_{2}^2$, this reduces to $\|x-y\|_{2}^{2} = \|x-z\|_{2}^{2} + \|z-y\|_{2}^{2} - 2\underbrace{\langle x-z, y-z\rangle}_{\|x-z\|_{2}\|y-z\|_{2}\cos\angle yzx}$.

Solution for part 3(b)(ii):¶

By definition of Bregman divergence, we have

\begin{align*} D_{\psi}(x,z) + D_{\psi}(z,y) &= \psi(x) - \psi(y) - \langle x - z, \nabla\psi(z)\rangle - \langle z-y,\nabla\psi(y)\rangle. \end{align*}

Adding and subtracting $\langle \nabla\psi(y), x-y\rangle$ to the RHS above, we get

\begin{align*} D_{\psi}(x,z) + D_{\psi}(z,y) &= D_{\psi}(x,y) + \langle x,\nabla\psi(y)-\nabla\psi(z)\rangle - \langle z, \nabla\psi(y) - \nabla\psi(z)\rangle\\ &= D_{\psi}(x,y) + \langle x-z,\nabla\psi(y)-\nabla\psi(z)\rangle,\\ \Rightarrow D_{\psi}(x,y) &= D_{\psi}(x,z) + D_{\psi}(z,y) - \langle x-z, \nabla\psi(y) - \nabla\psi(z)\rangle. \end{align*}

(iii) Recall that $\mathcal{X}\equiv\textbf{dom}(\psi)$. Consider a closed convex set $\mathcal{C}$ such that $\mathcal{C}\cap\text{inerior}\left(\mathcal{X}\right)\neq\emptyset$. For notational convenience, introduce $\mathcal{A}:=\mathcal{C}\cap\text{inerior}\left(\mathcal{X}\right)$, which is a convex set.

Define the Bregman projection of $y$ onto $\mathcal{A}$ with respect to $\psi$ as

$${\rm{proj}}_{\mathcal{A}}\left(y\right) := \underset{a\in\mathcal{A}}{\arg\min} \,D_{\psi}\left(a, y\right). \qquad $$

Using parts (i)-(ii), prove the genralized Pythagorean theorem: $$D_{\psi}(x,y) \geq D_{\psi}(x, {\rm{proj}}_{\mathcal{A}}\left(y\right)) \: + \: D_{\psi}( {\rm{proj}}_{\mathcal{A}}\left(y\right),y).$$

Hint: If $a^{\text{opt}}$ is a local minimizer of the (possibly nonconvex) function $f(a)$ over $\mathcal{A}$, then the directional derivative $$\bigg\langle a - a^{\text{opt}}, \nabla_{a} f(a)\big\vert_{a=a^{\text{opt}}}\bigg\rangle \geq 0, \qquad \forall a\in\mathcal{A}.$$ If $f$ is convex over $\mathcal{A}$, then the above inequality is also sufficient to guarantee that $a^{\text{opt}}$ is a minimizer.

Remark: In the generalized Pythagorean theorem, the equality is achieved when the set $\mathcal{A}$ is affine.

Solution for part 3(b)(iii):¶

To demonstrate $D_{\psi}(x,y) \:-\: D_{\psi}(x, {\rm{proj}}_{\mathcal{A}}\left(y\right)) \: - \: D_{\psi}( {\rm{proj}}_{\mathcal{A}}\left(y\right),y) \geq 0$, we start by specializing part 3(b)(ii) for $z \equiv {\rm{proj}}_{\mathcal{A}}\left(y\right)$. This gives

$$D_{\psi}(x,y) \:-\: D_{\psi}(x, {\rm{proj}}_{\mathcal{A}}\left(y\right)) \: - \: D_{\psi}( {\rm{proj}}_{\mathcal{A}}\left(y\right),y) = \bigg\langle x - {\rm{proj}}_{\mathcal{A}}\left(y\right), \nabla_{a}D_{\psi}(a,y)\bigg\vert_{a = {\rm{proj}}_{\mathcal{A}}\left(y\right)}\bigg\rangle.$$

Using the given hint, the inner product in the above RHS is $\geq 0$, completing the proof.