Notebook

$\LaTeX \text{ commands here} \newcommand{\R}{\mathbb{R}} \newcommand{\im}{\text{im}\,} \newcommand{\norm}[1]{||#1||} \newcommand{\inner}[1]{\langle #1 \rangle} \newcommand{\span}{\mathrm{span}} \newcommand{\proj}{\mathrm{proj}} \newcommand{\OPT}{\mathrm{OPT}} \newcommand{\grad}{\nabla} \newcommand{\eps}{\varepsilon} $

Georgia Tech, CS 4540

L8: Least Squares, Support Vector Machines, and Logistic Regression¶

Jake Abernethy and Benjamin Bray

Quiz password: regress

Tuesday, September 17, 2019

Review: Objective Functions¶

General Objective Function for supervised machine learning problems: $$ J(\theta; x, y) = \frac{1}{n} \sum_{n=1}^N \ell(y_n, h_\theta(x_n)) $$ Goal: find a $\theta$ that minimizes this objective! (Note: we often use $w$ for the vector of parameters)

Support Vector Machines¶

In the SVM problem, we have $n$ datapoints $(x_1, y_1), \ldots, (x_n, y_n)$, where $x_i \in \mathbb{R}^d$, and $y_i \in \{-1,1\}$. We want to find a linear classifier $\theta \in \mathbb{R}^d$ which classifies the datapoints with a large margin but allowing for some slack. The soft margin SVM is formulated as follows: \begin{align*} \mathop{\text{minimize }}_{\theta \in \mathbb{R}^d, \xi_1,\ldots, \xi_n \geq 0} \quad & \frac 1 2 \|\theta\|^2 + C \sum_{i=1}^n \xi_i \\ \text{subject to } \quad & y_i (\theta^\top x_i) \geq 1 - \xi_i \quad \text{ for } i=1, \ldots, n \end{align*}

Problem¶

The above formulation, with $n + d$ variables, is actually equivalent to an unconstrained problem with only $d$ variables below. Why is that? $$ \mathop{\text{minimize }}_{\theta \in \mathbb{R}^d} \quad \frac 1 2 \|\theta\|^2 + C\sum_{i=1}^n \max\{0, 1 - y_i(\theta^\top x_i)\} $$

SVM with only support vectors¶

Let us assume we have found the $\theta^*$ that solves the SVM objective: $$ \mathop{\text{minimize }}_{\theta \in \mathbb{R}^d} \quad \frac 1 2 \|\theta\|^2 + C\sum_{i=1}^n \max\{0, 1 - y_i(\theta^\top x_i)\}. $$ Now consider the set of so-called support vectors $S := \{ i : y_i(\theta^{*\top} x_i) \leq 1 \}$.

Problem¶

Is $\theta^*$ also a solution of the following modified objective? Why or why not? $$ \mathop{\text{minimize }}_{\theta \in \mathbb{R}^d} \quad \frac 1 2 \|\theta\|^2 + C\sum_{i \in S} \max\{0, 1 - y_i(\theta^\top x_i)\}. $$

Least Squares¶

Objective¶

$$ \begin{align} J(\theta; X, y) &= \frac{1}{2} \sum_{n=1}^N (y_n - \theta^T x_n)^2 \\ &= \frac{1}{2} || y - \theta^T X ||^2 \end{align} $$

Input dataset $X \in \mathbb{R}^{D \times N}$ with columns $x_1,\dots,x_N \in \R^D$
Output $y \in \mathbb{R}^N$ with entries $y_1,\dots,y_n \in \mathbb{R}$
Hypothesis $h_\theta : \mathbb{R}^D \rightarrow \mathbb{R}$ defined as $h_\theta(x) = \theta^T x$
Squared loss $\ell(y_1,y_2) = (y_1 - y_2)^2$
Weights / Parameters $\theta \in \mathbb{R}^D$

Gradient Descent for Least Squares, convergence¶

As shown in the reading, the (batch) gradient descent update for the least squares objective is: $$ \theta_{t+1} = \theta_{t} + \alpha \sum_{n=1}^N (y_n - h_\theta(x_n)) x_n $$

Problem: Convergence Rate¶

Using the convergence theory from last lecture, what can you say about the convergence rate of gradient descent for least squares? Is "iterate averaging" required?

(Hint: From the reading, you know the objective is convex. Are any stronger properties true?)

Problem: Regularization¶

Suppose we add a 2-norm regularizer to the objective function for least-squares linear regression. How does the convergence rate change? $$ J_{new}(\theta) = J_{old}(\theta) + \lambda || \theta ||^2 $$ (here, $\lambda > 0$ is a parameter which controls the relative importance of the regularizer)

Logistic Regression¶

In the notes, you saw an objective function for computing maximum likelihood estimates of the parameters to a logistic regression model. The probabilistic interpretation is nice, but we can also view it as a variant of our "formula" above, where we choose to use the binary cross-entropy loss.

Objective¶

$$ J(\theta) = NLL(\theta) = - \sum_{n=1}^N \bigg[ y_n \log h_\theta(x_n) + (1-y_n) \log (1-h_\theta(x_n)) \bigg] $$

Input dataset $X \in \mathbb{R}^{D \times N}$ with columns $x_1,\dots,x_N \in \R^D$
Binary Output $y \in \{ 0,1 \}^N$ with entries $y_1,\dots,y_n \in \{0,1\}$
Hypothesis $h_\theta : \mathbb{R}^D \rightarrow \mathbb{R}$ defined as $h_\theta(x) = \sigma(\theta^T x) = \frac{1}{1 + \exp(-\theta^T x)}$
Logistic function $\sigma(z) = \frac{1}{1+\exp(-z)}$ with the special property that $\sigma'(z) = \sigma(z) (1 - \sigma(z))$
Cross-entropy loss $\ell(y,z) = -y \log z - (1-y) \log (1-z)$
Weights / Parameters $\theta \in \mathbb{R}^D$

Gradient Descent for Logistic Regression¶

As shown in the notes, the (batch) gradient descent update rule is $$ \theta_{t+1} = \theta_t + \alpha \sum_{n=1}^N (y_n - h_\theta(x_n)) x_n $$

Problem: Convergence Rate¶

Using the convergence theory from last lecture, what can you say about the convergence rate of gradient descent for least squares?

(Hint: From the reading, you know the objective is convex. is the objective strongly convex? strongly smooth?)

Problem: Regularizing Logistic Regression¶

Suppose we add a 2-norm regularizer to the objective function for logistic regression. How does the convergence rate change? $$ J_{new}(\theta) = J_{old}(\theta) + \lambda || \theta ||^2 $$ (here, $\lambda > 0$ is a parameter which controls the relative importance of the regularizer)