Notebook

$\quad$ Lecture 21 - Revision ¶

$\newcommand{\vct}[1]{\mathbf{#1}}$ $\newcommand{\mtx}[1]{\mathbf{#1}}$ $\newcommand{\e}{\varepsilon}$ $\newcommand{\norm}[1]{\|#1\|}$ $\newcommand{\minimize}{\text{minimize}\quad}$ $\newcommand{\maximize}{\text{maximize}\quad}$ $\newcommand{\subjto}{\quad\text{subject to}\quad}$ $\newcommand{\R}{\mathbb{R}}$ $\newcommand{\trans}{T}$ $\newcommand{\ip}[2]{\langle {#1}, {#2} \rangle}$ $\newcommand{\zerovct}{\vct{0}}$ $\newcommand{\diff}[1]{\mathrm{d}{#1}}$ $\newcommand{\conv}{\operatorname{conv}}$ $\newcommand{\inter}{{\operatorname{int}}}$

In this final lecture we address several topics related to the exam. The main talking points are listed below. Other than that, the past exam, the midterm test, and Part A of the problem sheets should serve as good guidance.

1 Vectors and Matrices¶

It is essential that you are comfortable with manipulating vectors and matrices, and be able to easily compute gradients of multivariate functions. If, for example, $\mtx{A}\in \R^{n\times n}$, $\vct{w}\in \R^n$ and $\vct{x}\in \R^n$, then

\begin{equation*} \vct{w}^{\trans}\mtx{A}\vct{x} = \vct{x}^{\trans}\mtx{A}^{\trans}\vct{w} = \ip{\vct{w}}{\mtx{A}\vct{x}} = \ip{\mtx{A}^{\trans}\vct{w}}{\vct{x}}. \end{equation*}

The gradient with respect to $\vct{x}$ of $f(\vct{x}) = \vct{w}^{\trans}\mtx{A}\vct{x}$ is

\begin{equation*} \nabla_{\vct{x}} f(\vct{x}) = \mtx{A}^{\trans}\vct{w}, \end{equation*}

Note that in particular, the gradient $\nabla_{\vct{x}}$ of a linear form $\vct{w}^{\trans}\vct{x}$ is $\vct{w}$, and the gradient $\nabla_{\vct{w}}$ is $\vct{x}$. For a quadratic function $f(\vct{x})=\vct{x}^{\trans}\mtx{A}\vct{x}$,

\begin{equation*} \nabla_{\vct{x}} = 2\mtx{A}\vct{x}. \end{equation*}

It is also recommended to have a look at the Preliminaries document, which is included as appendix in the complete set of lecture notes. If in doubt, you may want to write the functions out and work as in Example 2.11

2 Convex Sets and Convex Functions¶

We have seen various criteria for sets to be convex, and for functions to be convex.

Convexity of functions.

Apply definition, $f(\lambda\vct{x}+(1-\lambda)\vct{y})\leq \lambda f(\vct{x})+(1-\lambda) f(\vct{y})$ for $\lambda\in [0,1]$;
Use composition rules (multiplication with scalar, addition);
If $f$ is once or twice differentiable, use one of the criteria involving the gradient or the Hessian.

Convexity of sets.

Apply definition, $\lambda\vct{x}+(1-\lambda)\vct{y}\in C$ if $\vct{x},\vct{y}\in C$ and $\lambda\in [0,1]$;
Check if $C=\{\vct{x}\in \R^n\mid f(\vct{x})\leq c$ for some convex function $f$;
Characterise as sum or intersection of known convex sets.

It helps to have a collection of functions and sets in mind that are convex. When determining whether a set in $\R^2$ is convex (or a function of one variable), a small sketch may help.

For convex functions taking values in $\R^2$, one should be able to sketch the level sets.

3 Iterative Algorithms and Convergence¶

We have seen a class of algorithms called descent algorithms, of which the most important are

Gradient Descent;
Newton's Method.

Note that we encountered Newton's method in two guises: as a minimization algorithm, and as a root-finding algorithm. The minimization version of Newton's algorithm for a function $f$ is merely the root-finding version of Newton's method for the gradient $\nabla f$.

For the descent methods considered, it will be important to

be able to carry them out by hand;
know about types of stopping criteria (gradient small or difference small);
know about how to select a step length.

Associated to iterative algorithms is the notion of Rates of Convergence. You should be able to recognise the rates of convergence of the algorithms discusses, as well as for sequences of numbers. When do Gradient Descent or Newton's Method fail to converge?

4 Linear Programming¶

We have seen LP from the point of view of theory and from the point of view of applications.

Applications.

Know how to convert a text problem into a linear programming problem (see introductory example from first lecture, or Question 1 in last year's exam;
Know how to transform other problems (such as those involving the $1$ or $\infty$ norm) to equivalent LP problems.

One should be able to easily switch between primal and dual formulations.

Algorithms / solving.

Know how to sketch the feasible set and solve the problem graphically in small dimensions;
Identify and describe the central path for a specific problem;
Know the significance of the parameters appearing in the central path;
Be able to transform LPs from one form to another (see beginning of Lecture 11, for example).

Theory.

Separating Hyperplane Theorem;
Farkas' Lemma;
Properties of polyhedra, vertices and faces;
Primal and Dual version of an LP;
Optimality conditions.

One should also be familiar in working with convex or conic hulls of points. For example, given vectors $\vct{a}_1,\dots,\vct{a}_p\in \R^n$, the set of points \begin{equation*} \sum_{i=1}^p x_i \vct{a}_i, \quad x_i\geq 0 \end{equation*} is a convex set, a convex cone, and can be written as $\mtx{A}\vct{x}$, $\vct{x}\geq \zerovct$.

Also, if the matrix $\mtx{A}$ consists of rows $\vct{a}_i^{\trans}$, then the set of $\vct{x}$ such that $\mtx{A}\vct{x}\leq \zerovct$ is the same as the set of $\vct{x}$ such that $\ip{\vct{x}}{\vct{a}_i}\leq 0$ for all $i$.

5 Non-linear Convex Optimization ¶

An important point to notice here is the analogy / similarity to linear programming theory.

Theory (in order of complexity).

Lagrange multipliers;
Be able to derive the Lagrange Dual of a function (keep in mind the domain!);
Be able to write down the KKT conditions.

The above tasks can only be

It can be useful to remember some special cases. These include:

Linear programming duality;
General quadratic programming problems;
Quadratic programming with only equality constraints;

The definition of the barrier function is a feature that we haven't strictly seen in LP and is important in the context of non-linear convex optimization. It can be helpful to be able to switch quickly between matrix notation and just listing all the constraints individually!

6 Semidefinite programming ¶

There are two things to remember with respect to semidefinite programming:

Basic properties of positive semidefinite matrices (eigenvalues non-negative, definition);
That the formulas look exactly the same as in linear programming, with the inner product replace by $\bullet$ (the trace inner product), and $\geq$ replaces by $\succeq$.

As for examples, one should know how to transfrom simple problems into semidefinite ones. For example:

LP as special case of semidefinite programming;
Some quadratic problems as semidefinite (Exam question 3(d) or Problem 8(3)).

7 Applications ¶

Dispersed throughout the course we had a look at various applications. These included

Machine learning basics (supervised learning, classification) -> unconstrained optimization;
Portfolio optimization -> quadratic programming;
Support vector machines and linear separation -> quadratic programming;
Finding a maximal cut in a network -> semidefinite programming.

It is not so crucial to remember every detail in the treatment of these examples, but one should be able to explain the main ideas. In particular, how exactly the different optimization types enter, what are some the problems that can arise, and what are potential concrete applications.

Thought experiment. Imagine you need to explain to someone who is reasonably mathematically literate some of these applications, and how the different optimization types (gradient descent, quadratic programming, SDP) enter the picture. That is about the level of detail you need to remember.

In [ ]: