#!/usr/bin/env python # coding: utf-8 # $\LaTeX \text{ commands here} # \newcommand{\R}{\mathbb{R}} # \newcommand{\im}{\text{im}\,} # \newcommand{\norm}[1]{||#1||} # \newcommand{\inner}[1]{\langle #1 \rangle} # \newcommand{\span}{\mathrm{span}} # \newcommand{\proj}{\mathrm{proj}} # \newcommand{\OPT}{\mathrm{OPT}} # \newcommand{\grad}{\nabla} # \newcommand{\eps}{\varepsilon} # $ #
# # **Georgia Tech, CS 4540** # # # L8: Least Squares, Support Vector Machines, and Logistic Regression # # Jake Abernethy and Benjamin Bray # # Quiz password: regress # # *Tuesday, September 17, 2019* # ## Review: Objective Functions # # General Objective Function for *supervised* machine learning problems: # $$ # J(\theta; x, y) = \frac{1}{n} \sum_{n=1}^N \ell(y_n, h_\theta(x_n)) # $$ # Goal: find a $\theta$ that minimizes this objective! (*Note*: we often use $w$ for the vector of parameters) # ## Support Vector Machines # # In the SVM problem, we have $n$ *datapoints* $(x_1, y_1), \ldots, (x_n, y_n)$, where $x_i \in \mathbb{R}^d$, and $y_i \in \{-1,1\}$. We want to find a linear classifier $\theta \in \mathbb{R}^d$ which classifies the datapoints with a *large margin* but allowing for some *slack*. The soft margin SVM is formulated as follows: # \begin{align*} # \mathop{\text{minimize }}_{\theta \in \mathbb{R}^d, \xi_1,\ldots, \xi_n \geq 0} \quad & \frac 1 2 \|\theta\|^2 + C \sum_{i=1}^n \xi_i \\ # \text{subject to } \quad & y_i (\theta^\top x_i) \geq 1 - \xi_i \quad \text{ for } i=1, \ldots, n # \end{align*} # #### Problem # # The above formulation, with $n + d$ variables, is actually equivalent to an unconstrained problem with only $d$ variables below. Why is that? # $$ # \mathop{\text{minimize }}_{\theta \in \mathbb{R}^d} \quad \frac 1 2 \|\theta\|^2 + C\sum_{i=1}^n \max\{0, 1 - y_i(\theta^\top x_i)\} # $$ # ## SVM with only support vectors # # Let us assume we have found the $\theta^*$ that solves the SVM objective: # $$ # \mathop{\text{minimize }}_{\theta \in \mathbb{R}^d} \quad \frac 1 2 \|\theta\|^2 + C\sum_{i=1}^n \max\{0, 1 - y_i(\theta^\top x_i)\}. # $$ # Now consider the set of so-called *support vectors* $S := \{ i : y_i(\theta^{*\top} x_i) \leq 1 \}$. # # #### Problem # Is $\theta^*$ also a solution of the following modified objective? Why or why not? # $$ # \mathop{\text{minimize }}_{\theta \in \mathbb{R}^d} \quad \frac 1 2 \|\theta\|^2 + C\sum_{i \in S} \max\{0, 1 - y_i(\theta^\top x_i)\}. # $$ # # ## Least Squares # # ##### Objective # # $$ # \begin{align} # J(\theta; X, y) # &= \frac{1}{2} \sum_{n=1}^N (y_n - \theta^T x_n)^2 \\ # &= \frac{1}{2} || y - \theta^T X ||^2 # \end{align} # $$ # # * Input dataset $X \in \mathbb{R}^{D \times N}$ with columns $x_1,\dots,x_N \in \R^D$ # * Output $y \in \mathbb{R}^N$ with entries $y_1,\dots,y_n \in \mathbb{R}$ # * Hypothesis $h_\theta : \mathbb{R}^D \rightarrow \mathbb{R}$ defined as $h_\theta(x) = \theta^T x$ # * Squared loss $\ell(y_1,y_2) = (y_1 - y_2)^2$ # * Weights / Parameters $\theta \in \mathbb{R}^D$ # ### Gradient Descent for Least Squares, convergence # # As shown in the reading, the (batch) gradient descent update for the least squares objective is: # $$ # \theta_{t+1} # = \theta_{t} + \alpha \sum_{n=1}^N (y_n - h_\theta(x_n)) x_n # $$ # # #### Problem: Convergence Rate # # Using the convergence theory from last lecture, what can you say about the convergence rate of gradient descent for least squares? Is "iterate averaging" required? # # (*Hint:* From the reading, you know the objective is convex. Are any stronger properties true?) # #### Problem: Regularization # # Suppose we add a 2-norm regularizer to the objective function for least-squares linear regression. How does the convergence rate change? # $$ # J_{new}(\theta) = J_{old}(\theta) + \lambda || \theta ||^2 # $$ # (here, $\lambda > 0$ is a parameter which controls the relative importance of the regularizer) # ## Logistic Regression # # In the notes, you saw an objective function for computing maximum likelihood estimates of the parameters to a logistic regression model. The probabilistic interpretation is nice, but we can also view it as a variant of our "formula" above, where we choose to use the **binary cross-entropy loss**. # # ##### Objective # # $$ # J(\theta) # = NLL(\theta) # = - \sum_{n=1}^N \bigg[ y_n \log h_\theta(x_n) + (1-y_n) \log (1-h_\theta(x_n)) \bigg] # $$ # # # # - Input dataset $X \in \mathbb{R}^{D \times N}$ with columns $x_1,\dots,x_N \in \R^D$ # - Binary Output $y \in \{ 0,1 \}^N$ with entries $y_1,\dots,y_n \in \{0,1\}$ # - Hypothesis $h_\theta : \mathbb{R}^D \rightarrow \mathbb{R}$ defined as $h_\theta(x) = \sigma(\theta^T x) = \frac{1}{1 + \exp(-\theta^T x)}$ # - Logistic function $\sigma(z) = \frac{1}{1+\exp(-z)}$ with the special property that $\sigma'(z) = \sigma(z) (1 - \sigma(z))$ # - Cross-entropy loss $\ell(y,z) = -y \log z - (1-y) \log (1-z)$ # - Weights / Parameters $\theta \in \mathbb{R}^D$ # ### Gradient Descent for Logistic Regression # # As shown in the notes, the (batch) gradient descent update rule is # $$ # \theta_{t+1} = \theta_t + \alpha \sum_{n=1}^N (y_n - h_\theta(x_n)) x_n # $$ # # #### Problem: Convergence Rate # # Using the convergence theory from last lecture, what can you say about the convergence rate of gradient descent for least squares? # # (*Hint:* From the reading, you know the objective is convex. is the objective strongly convex? strongly smooth?) # ## Problem: Regularizing Logistic Regression # # Suppose we add a 2-norm regularizer to the objective function for logistic regression. How does the convergence rate change? # $$ # J_{new}(\theta) = J_{old}(\theta) + \lambda || \theta ||^2 # $$ # (here, $\lambda > 0$ is a parameter which controls the relative importance of the regularizer)