In [ ]:

```
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import scipy as sp
import scipy.stats as st
import scipy.linalg as la
from math import sqrt
import pandas as pd
from pandas import DataFrame
import re
print("Modules Imported!")
```

In this lab we will apply linear regression to real and synthetic data and study Stochastic Gradient Descent.

If you are not familiar with matrix operations, review this page befor you start.

Also, for matrix inversion, we will use `numpy.linalg`

, which we imported with

```
import numpy.linalg as la
```

In particular, you will find `numpy.linalg.inv`

useful.

(40 pts) We will use opening weekend gross to estimate total gross based on data for 2017 (until mid september). First, we will use `pandas`

to read a csv file in to an object called a dataframe. (`pandas`

can do a lot of things, but here we only use it to read from a file.)
We use panda as:

`df = pd.read_csv('BoxOffice2017.csv')`

But it can also take other arguments. For example, if the names of the columns are not given in the file, we can use `names=[...]`

or we can set the delimiter to be any whitespace instead of just comma with `delim_whitespace=True`

, as shown below:

`df = pd.DataFrame(raw_data, names = ['first_name', 'last_name', 'age', 'education','income'], delim_whitespace=True)`

We then print the first few rows to check that everything has loaded properly, and check the data type for each column.

In [ ]:

```
df = pd.read_csv('BoxOffice2017.csv')
df.head() # The first few rows of the dataframe
```

In [ ]:

```
df.dtypes
```

Next we extract data from the dataframe. Our input variable is the Openning Gross and our output is Total Gross. The data needs some preprocessing.

In [ ]:

```
Y = df[['TotalGross']].values # extract grosses
# remove ',' and '$' and convert to float
# Y is a two dimensional array, so we use a two for loops
for i in range(np.shape(Y)[0]):
for j in range(np.shape(Y)[1]):
Y[i,j] = float(re.sub('[$,]','',Y[i,j]))
Y = Y.astype(float)
X = df[['OpeningGross']].values # if we need multiple columns we can write, e.g., X = df[['OpeningGross','Theaters']].values
# solution
```

- Plot the input and output in a scatter plot.
- Find $\theta^*$ minimizing the RMSE.
- Compute the RMSE.
- Plot the line $X\theta^*$ on the same plot as the input-output data.

In [ ]:

```
# solution
```

(30 pts) This time we will use synthetic data. The data will be multivariate normal and so minimizing RMSE is the same as maximum likelihood estimation.

Generate a samples $\left\{(x_1,y_1),\dotsc,(x_{100},y_{100})\right\}$ of size $100$, where $$x_i\sim \mathcal N (\mu,K),$$ with mean $\mu=[0,0]$ and covariance $$ K=\left[\begin{array}{cc} 10 & 5\\ 5 & 10\\ \end{array}\right],$$ and $$ y_i \sim \mathcal(\theta^T x_i,1)$$ with $$ \theta=\left[\begin{array}{c} 3 \\ -1\\ \end{array}\right].$$

Arrange these in two matrixes $X$ and $Y$ of sizes $100\times2$ and $100\times1$, respectively.

In [ ]:

```
# solution
```

Find the maximum likelihood estimate $\theta^*$ of $\theta$, and compare with true value. Note that this is the same as the estimate that minimizes RMSE. Find the RMSE.

In [ ]:

```
# solution
```

(30 pts)

Suppose that we want to solve the equation $$f(\theta)=0,$$ which we know to have a unique solution $\theta^*$ such that $f(\theta)$ is increasing at this solution. One way to find this solution is to start with an arbitrary $\theta_0$ and let $$ \theta_t = \theta_{t-1} - a_t f(\theta_t), $$ where $a_i$ satisfies $$\sum_{t=0}^\infty a_t = \infty,\quad \sum_{t=0}^\infty a_t^2 < \infty.$$ For example, we may let $a_t=\frac1t$. It can be shown that $\theta_t$ converges to $\theta^*$.

Consider the function $f(\theta)=\arctan \theta -1$ and let $\theta_0=0$. Use the method described above to find $\theta^*$ such that $f(\theta^*)=1$. Set the number of iterations to 1000. Plot the values of $\theta_t$ and compare $\theta_{1000}$ with $\theta^*$. Let $a_t=\frac1{t^{3/4}}$.

Bonus: It is also interesting to see how the behavior changes with $a_t=\frac1{t^r}$ for different values of $r$.

In [ ]:

```
# solution:
```

Importantly, this method works even if we can't actually compute $f(\theta)$ but instead can obtain $F(\theta)$, which is a noisy version of $f(\theta)$ such that $$ f(\theta)=E[F(\theta)]. $$ The good news is that using the noisy version works too and we can let $$ \theta_t = \theta_{t-1} - a_t F(\theta_t), $$ and again $\theta_t$ converges to $\theta^*$.

To see this in action, let $F(\theta)=f(\theta) + Z = \arctan \theta - 1 + Z$, where $Z\sim \mathcal N(0,1)$. Start with $\theta_0=0$ and let $$ \theta_t = \theta_{t-1} - a_t (f(\theta_{t-1}+Z_t)), $$ where, for each $t$, $Z_t$ is a new sample from $\mathcal N(0,1)$. Plot the values of $\theta_t$ and compare $\theta_{1000}$ with $\theta^*$. Let $a_t=\frac1{t^{3/4}}$.

Bonus: It is also interesting to see how the behavior changes with $a_t=\frac1{t^r}$ for different values of $r$.

In [ ]:

```
# solution:
```

Now consider two random variables $x$ and $y$ and suppose that we have $N$ samples: $\{(x_1,y_1),(x_2,y_2),\dotsc,(x_N,y_N)\}$. Our goal is linear regression of $y$ with repect to $x$ as $y = \theta^T x$. The gradient descent solution to this problem requires finding $\theta$ such that $$\nabla J(\theta)=-\sum_{i=1}^N(y_i-\theta^T x_i)x_i=0.$$ This can also be viewed as wanting to find the root of $f(\theta)=-E[(y-\theta^T x)x]$. But we don't know the distribution of $x$ and $y$. If we did, we could find the root by $$ \theta_t = \theta_{t-1} - a_t f(\theta_t), $$ But based on the explanation above, we can instead find the root by $$ \theta_t = \theta_{t-1} + a_t (y_i-\theta_{t}^Tx_i)x_i, $$ where $(x_i,y_i)$ is a random data point.

In the context of linear regression, this method is refered to as *Least Mean Square* or *LMS* (which isn't a very good name). But the principle applies more generally and is refered to as *Stochastic Gradient Descent*, where *stochastic* refers to the fact that we chose a random data point from our dataset in each step.

Set $\theta_0=(0,0)$ and let $$ \theta_t=\theta_{t-1} + \frac{1}{(t+10)} (y_i-\theta_t^Tx_i)x_i, $$ for $t=1,\dotsc,200$, where $i$ is a random index indicating which data point is used in this step. Use data generated in Maximum likelihood with synthetic data. Note that $\theta_t$ and $x_i$ are two-dimensional. Plot both elements of $\theta_t$ and compare them with the ture values of [3,-1].

In [ ]:

```
# solution
```