$\LaTeX \text{ commands here} \newcommand{\R}{\mathbb{R}} \newcommand{\im}{\text{im}\,} \newcommand{\norm}[1]{||#1||} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\span}{\mathrm{span}} \newcommand{\proj}{\mathrm{proj}} \newcommand{\OPT}{\mathrm{OPT}} \newcommand{\grad}{\nabla} \newcommand{\eps}{\varepsilon} $
Georgia Tech, CS 4540
Jake Abernethy and Benjamin Bray
Quiz password: regress
Tuesday, September 17, 2019
General Objective Function for supervised machine learning problems: $$ J(\theta; x, y) = \frac{1}{n} \sum_{n=1}^N \ell(y_n, h_\theta(x_n)) $$ Goal: find a $\theta$ that minimizes this objective! (Note: we often use $w$ for the vector of parameters)
In the SVM problem, we have $n$ datapoints $(x_1, y_1), \ldots, (x_n, y_n)$, where $x_i \in \mathbb{R}^d$, and $y_i \in \{-1,1\}$. We want to find a linear classifier $\theta \in \mathbb{R}^d$ which classifies the datapoints with a large margin but allowing for some slack. The soft margin SVM is formulated as follows: \begin{align*} \mathop{\text{minimize }}_{\theta \in \mathbb{R}^d, \xi_1,\ldots, \xi_n \geq 0} \quad & \frac 1 2 \|\theta\|^2 + C \sum_{i=1}^n \xi_i \\ \text{subject to } \quad & y_i (\theta^\top x_i) \geq 1 - \xi_i \quad \text{ for } i=1, \ldots, n \end{align*}
The above formulation, with $n + d$ variables, is actually equivalent to an unconstrained problem with only $d$ variables below. Why is that? $$ \mathop{\text{minimize }}_{\theta \in \mathbb{R}^d} \quad \frac 1 2 \|\theta\|^2 + C\sum_{i=1}^n \max\{0, 1 - y_i(\theta^\top x_i)\} $$
Let us assume we have found the $\theta^*$ that solves the SVM objective: $$ \mathop{\text{minimize }}_{\theta \in \mathbb{R}^d} \quad \frac 1 2 \|\theta\|^2 + C\sum_{i=1}^n \max\{0, 1 - y_i(\theta^\top x_i)\}. $$ Now consider the set of so-called support vectors $S := \{ i : y_i(\theta^{*\top} x_i) \leq 1 \}$.
Is $\theta^*$ also a solution of the following modified objective? Why or why not? $$ \mathop{\text{minimize }}_{\theta \in \mathbb{R}^d} \quad \frac 1 2 \|\theta\|^2 + C\sum_{i \in S} \max\{0, 1 - y_i(\theta^\top x_i)\}. $$
As shown in the reading, the (batch) gradient descent update for the least squares objective is: $$ \theta_{t+1} = \theta_{t} + \alpha \sum_{n=1}^N (y_n - h_\theta(x_n)) x_n $$
Using the convergence theory from last lecture, what can you say about the convergence rate of gradient descent for least squares? Is "iterate averaging" required?
(Hint: From the reading, you know the objective is convex. Are any stronger properties true?)
Suppose we add a 2-norm regularizer to the objective function for least-squares linear regression. How does the convergence rate change? $$ J_{new}(\theta) = J_{old}(\theta) + \lambda || \theta ||^2 $$ (here, $\lambda > 0$ is a parameter which controls the relative importance of the regularizer)
In the notes, you saw an objective function for computing maximum likelihood estimates of the parameters to a logistic regression model. The probabilistic interpretation is nice, but we can also view it as a variant of our "formula" above, where we choose to use the binary cross-entropy loss.
As shown in the notes, the (batch) gradient descent update rule is $$ \theta_{t+1} = \theta_t + \alpha \sum_{n=1}^N (y_n - h_\theta(x_n)) x_n $$
Using the convergence theory from last lecture, what can you say about the convergence rate of gradient descent for least squares?
(Hint: From the reading, you know the objective is convex. is the objective strongly convex? strongly smooth?)
Suppose we add a 2-norm regularizer to the objective function for logistic regression. How does the convergence rate change? $$ J_{new}(\theta) = J_{old}(\theta) + \lambda || \theta ||^2 $$ (here, $\lambda > 0$ is a parameter which controls the relative importance of the regularizer)