#!/usr/bin/env python
# coding: utf-8
# $\LaTeX \text{ commands here}
# \newcommand{\R}{\mathbb{R}}
# \newcommand{\im}{\text{im}\,}
# \newcommand{\norm}[1]{||#1||}
# \newcommand{\inner}[1]{\langle #1 \rangle}
# \newcommand{\span}{\mathrm{span}}
# \newcommand{\proj}{\mathrm{proj}}
# \newcommand{\OPT}{\mathrm{OPT}}
# \newcommand{\grad}{\nabla}
# \newcommand{\eps}{\varepsilon}
# $
#
#
# **Georgia Tech, CS 4540**
#
# # L8: Least Squares, Support Vector Machines, and Logistic Regression
#
# Jake Abernethy and Benjamin Bray
#
# Quiz password: regress
#
# *Tuesday, September 17, 2019*
# ## Review: Objective Functions
#
# General Objective Function for *supervised* machine learning problems:
# $$
# J(\theta; x, y) = \frac{1}{n} \sum_{n=1}^N \ell(y_n, h_\theta(x_n))
# $$
# Goal: find a $\theta$ that minimizes this objective! (*Note*: we often use $w$ for the vector of parameters)
# ## Support Vector Machines
#
# In the SVM problem, we have $n$ *datapoints* $(x_1, y_1), \ldots, (x_n, y_n)$, where $x_i \in \mathbb{R}^d$, and $y_i \in \{-1,1\}$. We want to find a linear classifier $\theta \in \mathbb{R}^d$ which classifies the datapoints with a *large margin* but allowing for some *slack*. The soft margin SVM is formulated as follows:
# \begin{align*}
# \mathop{\text{minimize }}_{\theta \in \mathbb{R}^d, \xi_1,\ldots, \xi_n \geq 0} \quad & \frac 1 2 \|\theta\|^2 + C \sum_{i=1}^n \xi_i \\
# \text{subject to } \quad & y_i (\theta^\top x_i) \geq 1 - \xi_i \quad \text{ for } i=1, \ldots, n
# \end{align*}
# #### Problem
#
# The above formulation, with $n + d$ variables, is actually equivalent to an unconstrained problem with only $d$ variables below. Why is that?
# $$
# \mathop{\text{minimize }}_{\theta \in \mathbb{R}^d} \quad \frac 1 2 \|\theta\|^2 + C\sum_{i=1}^n \max\{0, 1 - y_i(\theta^\top x_i)\}
# $$
# ## SVM with only support vectors
#
# Let us assume we have found the $\theta^*$ that solves the SVM objective:
# $$
# \mathop{\text{minimize }}_{\theta \in \mathbb{R}^d} \quad \frac 1 2 \|\theta\|^2 + C\sum_{i=1}^n \max\{0, 1 - y_i(\theta^\top x_i)\}.
# $$
# Now consider the set of so-called *support vectors* $S := \{ i : y_i(\theta^{*\top} x_i) \leq 1 \}$.
#
# #### Problem
# Is $\theta^*$ also a solution of the following modified objective? Why or why not?
# $$
# \mathop{\text{minimize }}_{\theta \in \mathbb{R}^d} \quad \frac 1 2 \|\theta\|^2 + C\sum_{i \in S} \max\{0, 1 - y_i(\theta^\top x_i)\}.
# $$
#
# ## Least Squares
#
# ##### Objective
#
# $$
# \begin{align}
# J(\theta; X, y)
# &= \frac{1}{2} \sum_{n=1}^N (y_n - \theta^T x_n)^2 \\
# &= \frac{1}{2} || y - \theta^T X ||^2
# \end{align}
# $$
#
# * Input dataset $X \in \mathbb{R}^{D \times N}$ with columns $x_1,\dots,x_N \in \R^D$
# * Output $y \in \mathbb{R}^N$ with entries $y_1,\dots,y_n \in \mathbb{R}$
# * Hypothesis $h_\theta : \mathbb{R}^D \rightarrow \mathbb{R}$ defined as $h_\theta(x) = \theta^T x$
# * Squared loss $\ell(y_1,y_2) = (y_1 - y_2)^2$
# * Weights / Parameters $\theta \in \mathbb{R}^D$
# ### Gradient Descent for Least Squares, convergence
#
# As shown in the reading, the (batch) gradient descent update for the least squares objective is:
# $$
# \theta_{t+1}
# = \theta_{t} + \alpha \sum_{n=1}^N (y_n - h_\theta(x_n)) x_n
# $$
#
# #### Problem: Convergence Rate
#
# Using the convergence theory from last lecture, what can you say about the convergence rate of gradient descent for least squares? Is "iterate averaging" required?
#
# (*Hint:* From the reading, you know the objective is convex. Are any stronger properties true?)
# #### Problem: Regularization
#
# Suppose we add a 2-norm regularizer to the objective function for least-squares linear regression. How does the convergence rate change?
# $$
# J_{new}(\theta) = J_{old}(\theta) + \lambda || \theta ||^2
# $$
# (here, $\lambda > 0$ is a parameter which controls the relative importance of the regularizer)
# ## Logistic Regression
#
# In the notes, you saw an objective function for computing maximum likelihood estimates of the parameters to a logistic regression model. The probabilistic interpretation is nice, but we can also view it as a variant of our "formula" above, where we choose to use the **binary cross-entropy loss**.
#
# ##### Objective
#
# $$
# J(\theta)
# = NLL(\theta)
# = - \sum_{n=1}^N \bigg[ y_n \log h_\theta(x_n) + (1-y_n) \log (1-h_\theta(x_n)) \bigg]
# $$
#
#
#
# - Input dataset $X \in \mathbb{R}^{D \times N}$ with columns $x_1,\dots,x_N \in \R^D$
# - Binary Output $y \in \{ 0,1 \}^N$ with entries $y_1,\dots,y_n \in \{0,1\}$
# - Hypothesis $h_\theta : \mathbb{R}^D \rightarrow \mathbb{R}$ defined as $h_\theta(x) = \sigma(\theta^T x) = \frac{1}{1 + \exp(-\theta^T x)}$
# - Logistic function $\sigma(z) = \frac{1}{1+\exp(-z)}$ with the special property that $\sigma'(z) = \sigma(z) (1 - \sigma(z))$
# - Cross-entropy loss $\ell(y,z) = -y \log z - (1-y) \log (1-z)$
# - Weights / Parameters $\theta \in \mathbb{R}^D$
# ### Gradient Descent for Logistic Regression
#
# As shown in the notes, the (batch) gradient descent update rule is
# $$
# \theta_{t+1} = \theta_t + \alpha \sum_{n=1}^N (y_n - h_\theta(x_n)) x_n
# $$
#
# #### Problem: Convergence Rate
#
# Using the convergence theory from last lecture, what can you say about the convergence rate of gradient descent for least squares?
#
# (*Hint:* From the reading, you know the objective is convex. is the objective strongly convex? strongly smooth?)
# ## Problem: Regularizing Logistic Regression
#
# Suppose we add a 2-norm regularizer to the objective function for logistic regression. How does the convergence rate change?
# $$
# J_{new}(\theta) = J_{old}(\theta) + \lambda || \theta ||^2
# $$
# (here, $\lambda > 0$ is a parameter which controls the relative importance of the regularizer)