In [ ]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import scipy.stats as st
import scipy.linalg as la
from math import sqrt
import pandas as pd
from pandas import DataFrame
import re
print("Modules Imported!")

Linear Regression

In this lab we will apply linear regression to real and synthetic data and study Stochastic Gradient Descent.

If you are not familiar with matrix operations, review this page befor you start.

Also, for matrix inversion, we will use numpy.linalg, which we imported with

import numpy.linalg as la

In particular, you will find numpy.linalg.inv useful.

Estimating Box Office Gross

(40 pts) We will use opening weekend gross to estimate total gross based on data for 2017 (until mid september). First, we will use pandas to read a csv file in to an object called a dataframe. (pandas can do a lot of things, but here we only use it to read from a file.) We use panda as:

df = pd.read_csv('BoxOffice2017.csv')

But it can also take other arguments. For example, if the names of the columns are not given in the file, we can use names=[...] or we can set the delimiter to be any whitespace instead of just comma with delim_whitespace=True, as shown below:

df = pd.DataFrame(raw_data, names = ['first_name', 'last_name', 'age', 'education','income'], delim_whitespace=True)

We then print the first few rows to check that everything has loaded properly, and check the data type for each column.

In [ ]:
df = pd.read_csv('BoxOffice2017.csv')
df.head() # The first few rows of the dataframe
In [ ]:
df.dtypes

Next we extract data from the dataframe. Our input variable is the Openning Gross and our output is Total Gross. The data needs some preprocessing.

In [ ]:
Y = df[['TotalGross']].values # extract grosses
# remove ',' and '$' and convert to float
# Y is a two dimensional array, so we use a two for loops
for i in range(np.shape(Y)[0]): 
    for j in range(np.shape(Y)[1]):
        Y[i,j] = float(re.sub('[$,]','',Y[i,j])) 
Y = Y.astype(float)


X = df[['OpeningGross']].values # if we need multiple columns we can write, e.g., X = df[['OpeningGross','Theaters']].values
# solution

  1. Plot the input and output in a scatter plot.
  2. Find $\theta^*$ minimizing the RMSE.
  3. Compute the RMSE.
  4. Plot the line $X\theta^*$ on the same plot as the input-output data.
In [ ]:
# solution

Maximum likelihood with synthetic data

(30 pts) This time we will use synthetic data. The data will be multivariate normal and so minimizing RMSE is the same as maximum likelihood estimation.

Generate a samples $\left\{(x_1,y_1),\dotsc,(x_{100},y_{100})\right\}$ of size $100$, where $$x_i\sim \mathcal N (\mu,K),$$ with mean $\mu=[0,0]$ and covariance $$ K=\left[\begin{array}{cc} 10 & 5\\ 5 & 10\\ \end{array}\right],$$ and $$ y_i \sim \mathcal(\theta^T x_i,1)$$ with $$ \theta=\left[\begin{array}{c} 3 \\ -1\\ \end{array}\right].$$

Arrange these in two matrixes $X$ and $Y$ of sizes $100\times2$ and $100\times1$, respectively.

In [ ]:
# solution

Find the maximum likelihood estimate $\theta^*$ of $\theta$, and compare with true value. Note that this is the same as the estimate that minimizes RMSE. Find the RMSE.

In [ ]:
# solution

Stochastic Gradient Descent for ML

(30 pts)

Finding the root: The basic case

Suppose that we want to solve the equation $$f(\theta)=0,$$ which we know to have a unique solution $\theta^*$ such that $f(\theta)$ is increasing at this solution. One way to find this solution is to start with an arbitrary $\theta_0$ and let $$ \theta_t = \theta_{t-1} - a_t f(\theta_t), $$ where $a_i$ satisfies $$\sum_{t=0}^\infty a_t = \infty,\quad \sum_{t=0}^\infty a_t^2 < \infty.$$ For example, we may let $a_t=\frac1t$. It can be shown that $\theta_t$ converges to $\theta^*$.


Consider the function $f(\theta)=\arctan \theta -1$ and let $\theta_0=0$. Use the method described above to find $\theta^*$ such that $f(\theta^*)=1$. Set the number of iterations to 1000. Plot the values of $\theta_t$ and compare $\theta_{1000}$ with $\theta^*$. Let $a_t=\frac1{t^{3/4}}$.

Bonus: It is also interesting to see how the behavior changes with $a_t=\frac1{t^r}$ for different values of $r$.

In [ ]:
# solution:

Finding the root with noisy observations

Importantly, this method works even if we can't actually compute $f(\theta)$ but instead can obtain $F(\theta)$, which is a noisy version of $f(\theta)$ such that $$ f(\theta)=E[F(\theta)]. $$ The good news is that using the noisy version works too and we can let $$ \theta_t = \theta_{t-1} - a_t F(\theta_t), $$ and again $\theta_t$ converges to $\theta^*$.


To see this in action, let $F(\theta)=f(\theta) + Z = \arctan \theta - 1 + Z$, where $Z\sim \mathcal N(0,1)$. Start with $\theta_0=0$ and let $$ \theta_t = \theta_{t-1} - a_t (f(\theta_{t-1}+Z_t)), $$ where, for each $t$, $Z_t$ is a new sample from $\mathcal N(0,1)$. Plot the values of $\theta_t$ and compare $\theta_{1000}$ with $\theta^*$. Let $a_t=\frac1{t^{3/4}}$.

Bonus: It is also interesting to see how the behavior changes with $a_t=\frac1{t^r}$ for different values of $r$.

In [ ]:
# solution:

Stochastic Gradient Descent

Now consider two random variables $x$ and $y$ and suppose that we have $N$ samples: $\{(x_1,y_1),(x_2,y_2),\dotsc,(x_N,y_N)\}$. Our goal is linear regression of $y$ with repect to $x$ as $y = \theta^T x$. The gradient descent solution to this problem requires finding $\theta$ such that $$\nabla J(\theta)=-\sum_{i=1}^N(y_i-\theta^T x_i)x_i=0.$$ This can also be viewed as wanting to find the root of $f(\theta)=-E[(y-\theta^T x)x]$. But we don't know the distribution of $x$ and $y$. If we did, we could find the root by $$ \theta_t = \theta_{t-1} - a_t f(\theta_t), $$ But based on the explanation above, we can instead find the root by $$ \theta_t = \theta_{t-1} + a_t (y_i-\theta_{t}^Tx_i)x_i, $$ where $(x_i,y_i)$ is a random data point.

In the context of linear regression, this method is refered to as Least Mean Square or LMS (which isn't a very good name). But the principle applies more generally and is refered to as Stochastic Gradient Descent, where stochastic refers to the fact that we chose a random data point from our dataset in each step.


Set $\theta_0=(0,0)$ and let $$ \theta_t=\theta_{t-1} + \frac{1}{(t+10)} (y_i-\theta_t^Tx_i)x_i, $$ for $t=1,\dotsc,200$, where $i$ is a random index indicating which data point is used in this step. Use data generated in Maximum likelihood with synthetic data. Note that $\theta_t$ and $x_i$ are two-dimensional. Plot both elements of $\theta_t$ and compare them with the ture values of [3,-1].

In [ ]:
# solution