- Understand the nomenclature of the regression equation and terms
- Be able to recognize which terms represent noise and which are model terms
- Perform linear regression in 1 or more dimensions
- Be able to construct probability distributions representing parameter uncertainty
- Compute hypothesis tests and confidence intervals

In [1]:

```
%matplotlib inline
import random
import numpy as np
import matplotlib.pyplot as plt
from math import sqrt, pi, erf
import scipy.stats
```

Let's define the problem setting for linear regression.

We have some data, $x$ and $y$, and we want to fit a line to the data. Our model will be:

$$\hat{y} = \hat{\alpha} + \hat{\beta}x$$where a $\hat{}$ indicates our best estimate of something. We make an assumption that the process that generates our data looks like:

$$y = \alpha + \beta x + \epsilon$$where $\epsilon$ is coming from a normal distribution

- $\alpha$ - an intercept
- $\hat{\alpha}$ - our estimated best intercept
- $\beta$ - a slope
- $\hat{\beta}$ - our estimated best slope
- $\hat{y}$ - our estimated y's
- $\epsilon$ - the variable which is creating normally distributed noise.
- $x,y$ - the data
- Residual - the difference between $\hat{y}$ and $y$ at a particular point ($\hat{y}_i - y_i$)
- Sum of squared residues (SSR) - $\sum_i (\hat{y}_i - y_i)^2$
- Sum of squared error (SSE) - the same as SSR
- Total sum of squares (TSS) - $\sum_i (\bar{y} - y_i)^2$, which is a kind of variance
- $S_{\textrm{something}}$ - The standard error of something

One way to view this problem is as an optimization. Can we write down an equation that results in a single value that we minimize? Our ultimate goal is to make $\hat{y}$ as close as possible to $y$. Mathematically, that means we want $\sum_i(y_i - \hat{y}_i)^2$ to be small.

That is the objective function to minimize, but what are the dimensions to change? Those are $\hat{\alpha}$ and $\hat{\beta}$. So we need to write down a function which takes in $\hat{\alpha}$ and $\hat{\beta}$ and returns how good the fit is:

$$f(\hat{\alpha}, \hat{\beta}) = \sum_i \left(y_i - \hat{\alpha} - \hat{\beta} x_i\right)^2$$

We can minimize this equation using any of our minimization techniques or we can do it analytically.

Using calculus you can show that the minimum to $f(\alpha, \beta)$ is:

$$\hat{\beta} = \frac{\sum_i(x_i - \bar{x})(y_i - \bar{y})}{\sum_i(x_i - \bar{x})^2}$$With a little bit of algebra, you can show this is

$$\hat{\beta} = \frac{\sigma_{xy}}{\sigma_x^2}$$where $\sigma_{xy}$ is the sample covariance of $x$ and $y$ and $\sigma_x^2$ is the sample variance of $x$.

To find the intercept, you can just take the average of the residuals (not their squares!) given the model so far:

$$\hat{\alpha} = \frac{1}{N}\sum_i (y_i - \hat{\beta}x_i)$$Let's see this in action

In [2]:

```
# Make some data -> this is problem setup
# Do NOT copy this because it only generates random data
# This does not perform regression
x = np.linspace(0,10, 20)
y = 1 + x * 2.5 + scipy.stats.norm.rvs(scale=2, size=20)
plt.plot(x,y, 'o')
plt.show()
```

In [3]:

```
cov = np.cov(x,y, ddof=2)
#recall that the diagonal is variances, so we use that directly
beta_hat = cov[0,1] / cov[0,0]
alpha_hat = np.mean( y - beta_hat * x)
print(f'alpha_hat = {alpha_hat:.2} ({1})')
print(f'beta_hat = {beta_hat:.2} ({2.5})')
```

In [4]:

```
plt.plot(x,y, 'o')
plt.plot(x, alpha_hat + beta_hat * x)
plt.show()
```

Notice that we didn't get exactly the correct answer. The points were generated with a slope of 2.5 and an intercept of 1, whereas our fit was a little bit off

Another important hypothesis is determining if a sample is normally distributed. We said in class many times that we require things to be normal for many methods. Here's how we actually test that.

**Data Type:** single group of samples

**Compares:** If the samples came from an unknown parent normal distribution (are they normally distributed)

**Null Hypothesis:** The samples are from the unknown parent normal distribution

**Conditions:** None

**Python:** `scipy.stats.shapiro`

**Notes:** There are many other tests for normality. This one is not the simplests, but is the most effective

In [5]:

```
data = [12.4, 12.6, 11.8, 11.5, 11.9, 12.2, 12.0, 12.1, 11.8]
scipy.stats.shapiro(data)
```

Out[5]:

The $p$-value is quite high, so we don't reject the null hypothesis

In [6]:

```
data = np.linspace(0,1, 100)
scipy.stats.shapiro(data)
```

Out[6]:

The $p$-value is 0.002, so we correctly reject the null hypothesis

One of our assumptions is that the noise is normally distributed. Recall our model:

$$y = \alpha + \beta x + \epsilon$$$$\epsilon = y - \alpha - \beta x \approx y - \hat{\alpha} - \hat{\beta} x = y - \hat{y}$$Where the $\approx$ is because we are using our estimates for $\alpha$ and $\beta$. We can now check our assumption by histogramming the residuals, which should be the same as looking at the $\epsilon$ distribution.

In [7]:

```
plt.hist(y - beta_hat * x - alpha_hat)
plt.show()
```

At this point, it's unclear if they are normally distributed. Luckily we just learned the Shapiroâ€“Wilk Test!

In [8]:

```
scipy.stats.shapiro(y - beta_hat * x - alpha_hat)
```

Out[8]:

It looks like the residuals may indeed be normally distributed. So, our assumption was valid.

There are a few ways of measuring goodness of fit. One way is to just compute the SSR, the sum of the squared residuals. However, this has the negative that the units of $y$ appear in the goodness of fit. Here is what people typically use, the coefficient of determination:

$$R^2 = 1 - \frac{\textrm{SSR}}{\textrm{TSS}} = 1 - \frac{\sum_i \left(\hat{y}_i - y\right)^2}{\sum_i \left(\bar{y} - y\right)^2}$$This equation has the property that it's unitless, it's $1$ when the fit is perfect, and $0$ when the fit is awful. In the case of linear regression, $R$ is the same as the correlation coefficient.

In [9]:

```
ssr = np.sum((y - alpha_hat - beta_hat * x)**2)
tss = np.sum((np.mean(y) - y)**2)
rsq = 1 - ssr / tss
print(rsq, sqrt(rsq))
print(np.corrcoef(x,y))
```

Knowing goodness of fit is not quite useful yet though. We'd like to do hypothesis tests, build confidence intervals, etc. We need to know how the $\hat{\beta}$ are distributed!

As you may have noticed, sometimes it gets confusing that the difference between the sample mean and the true mean, $\mu - \bar{x}$, are distributed according to a normal or $T$ distribution if the samples are distributed according to a normal distribution. Thus we end up with two distributions: one with parameters $\mu$, $\sigma$ that describes the samples and one with parameters $\mu$, $\sigma / \sqrt{N}$ that describes the difference between sample mean and true mean. Often this $\sigma / \sqrt{N}$ term is called **Standard Error** to distinguish it from the standard deviation of the population (true/hidden) distribution. Thus you can write:

We're to be referring to **Standard Error** often now. You then can use **Standard Error** in confidence intervals or hypothesis tests. The same rules as previously apply: $N < 25$ requires a $t$-distribution and above is normal.

One other important thing to remember is that the denominator in the standard error should be the square root of the degrees of freedom. Usually this is written as $N - D$, where $D$ is the deducted degrees of freedom. In the case of linear regression, we have $D$ being the number of coefficients we're fitting. That $N - D$ is also the degrees of freedom in the $t$-distribution.

You may read the derivation in Bulmer on page 226. The variance in our estimated values ($S^2$) for slope and intercept are:

$$S^2_{\epsilon} =\frac{\sigma^2_{\epsilon}}{N - D} = \frac{1}{N - D}\sum_i \left(\hat{y}_i - y_i\right)^2$$$$S^2_{\alpha} = S^2_{\epsilon} \left[ \frac{1}{N - D} + \frac{\bar{x}^2}{\sum_i\left(x_i - \bar{x}\right)^2}\right]$$$$S^2_{\beta} = \frac{S^2_{\epsilon}}{\sum_i \left(x_i - \bar{x}\right)^2}$$$D$ here is the number of fit coefficients. 2 in our case.

We'll continue using the data from above

In [10]:

```
df = len(x) - 2
s2_epsilon = np.sum((y - alpha_hat - beta_hat * x) ** 2) / df
s2_alpha = s2_epsilon * (1. / df + np.mean(x) ** 2 / (np.sum((np.mean(x) - x) ** 2)))
print('The standard error for the intercept is', np.sqrt(s2_alpha))
```

Let's just visualize now what the distribution for where the true intercept looks like. We know it is distributed according to:

$$P(\alpha) = T(\mu=\hat{\alpha}, \sigma=S_\alpha, df=18)$$The degrees of freedom here is

In [11]:

```
alpha_grid = np.linspace(-2, 2, 100)
P_alpha = scipy.stats.t.pdf(alpha_grid, loc=alpha_hat, scale=np.sqrt(s2_alpha), df=len(x) - 2)
plt.plot(alpha_grid, P_alpha)
plt.axvline(1, color='red')
plt.axvline(alpha_hat)
plt.show()
```

Once we compute $S_\beta$, we can use the formulas we've seen before:

$$P(\beta = \hat{\beta} \pm y) = 0.95$$$$T = \frac{y}{S_\beta}$$$$y = TS_\beta$$The confidence interval will then be:

$$\beta = \hat{\beta} \pm TS_\beta$$with 95% confidence

In [12]:

```
s2_beta = s2_epsilon / np.sum((x - np.mean(x))**2)
T = scipy.stats.t.ppf(0.975, len(x) - 2)
print(s2_beta, T)
print('beta = ', beta_hat, '+/-', T * np.sqrt(s2_beta), ' with 95% confidence')
```

Notice that just like in confidence intervals, once $N$ becomes large we can replace the $t$-distribution with a normal distribution.

The next interesting consequence of the $t$-distribution is that we can construct a hypothesis test. For example, we could test if the intercept should be $0$.

Our null hypothesis is that the intercept is $0$. Then, we can compute how big an interval would have to be constructed around $0$ to just capture what we calculated for $\alpha$. The distribution for that interval is:

$$P(\alpha) = T(\mu=0, \sigma=S_\alpha, df=N - D = 19)$$We take $D = 1$ here because we work under the assumption of the null hpyothesis (i.e., only slope needs to be fit). For the actual regression analysis, you should still use $D = 2$. Only when you do the hypothesis test itself do you use the $D = 1$

To convert to a standard $t$-distribution (variance is 1), so we have:

$$T = \frac{\hat{\alpha}}{S_\alpha}$$$$\int_{-T}^T p(T)\,dT = 1 - p$$In [13]:

```
df = len(x) - 2
s2_epsilon = np.sum((y - alpha_hat - beta_hat * x) ** 2) / df
s2_alpha = s2_epsilon * (1. / df + np.mean(x) ** 2 / (np.sum((np.mean(x) - x) ** 2)))
#ensure our T-value is positive, so our integral doesn't get flipped
T = abs(alpha_hat / sqrt(s2_alpha))
p = 1 - (scipy.stats.t.cdf(T, len(x) - 1) - scipy.stats.t.cdf(-T, len(x) - 1))
print('alpha = ', alpha_hat, ' T = ', T, ' p-value = ', p)
```

*Depends on random date above!*

When I ran this, the $p$-value is $\approx 0.04$ which says the evidence is weak but we can reject the null hypothesis. There is an intercept, since the $p$-value is below our threshold of 5%.

Our governing equation is now:

$$y = {\mathbf X\beta} + \epsilon$$where $\mathbf X$ is an $N\times D$ matrix, where $N$ is the number of data points and $D$ is the number of dimensions. $\beta$ is then a $D$ length column vector. We want to find $\hat{\beta}$ and get a model that looks like:

$$\hat{y} = {\mathbf X\hat{\beta}}$$This can be done with optimization just like last time, but we can more easily do it with matrix algebra. The equation for $\hat{\beta}$ is:

$$\hat{\beta} = (\mathbf{X}^T \mathbf{X})^{-1}\mathbf{X}^Ty$$Let's try to fit the equation:

$$ y = 3 + 3 x_1 + x_2 + \epsilon$$with

$$\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2$$In [14]:

```
#NOTE: THIS IS NOT PART OF REGRESSION!!!!
#DO NOT COPY PASTE THIS CODE INTO HW/EXAM
#generate data
#I add some noise to the x coordinate to just spread the points out a little.
x1 = np.linspace(0,1,15)+ scipy.stats.norm.rvs(size=15)
x2 = np.linspace(0,1,15) + scipy.stats.norm.rvs(size=len(x1))
y = 3 * x1 - 2 * x2 + 3 + scipy.stats.norm.rvs(size=len(x1))
y
```

Out[14]:

In [15]:

```
import numpy.linalg as linalg
x_mat = np.column_stack( (np.ones(len(x1)), x1, x2) )
x_mat
```

Out[15]:

Now we have our $X$ matrix set-up. Now we need to evaluate the matrix equation for $\hat{\beta}$ above:

In [16]:

```
#dot -> matrix multiplication
#transpose -> take a transpose
#linalg.inv -> compute a matrix inverse
beta_hat = linalg.inv(x_mat.transpose() @ x_mat) @ x_mat.transpose() @ y
```

Since it is tedius to type that whole equation out, you can instead use a shortcut:

In [17]:

```
beta_hat, *_ = linalg.lstsq(x_mat, y)
```

The `*_`

symbol means put the rest of the return value into the `_`

variable, which recall is how we indicate that we're making a variable which we will not use.

Let's now see how well the regression did!

In [18]:

```
y_hat = x_mat @ beta_hat
```

The first plot will be a $\hat{y}$ vs $y$ plot. This is called a **parity** plot. If $y$ and $\hat{y}$ are the same, you would see a $y=x$ line. How far the deviation is from that line is how bad the fit is.

In [19]:

```
plt.plot(y, y_hat, 'o')
plt.plot(y, y, 'r')
plt.xlabel('$y$')
plt.ylabel('$\hat{y}$')
plt.show()
```

Of course we can also look at the histogram of residuals

In [20]:

```
plt.hist(y - y_hat)
plt.show()
```

Just like before, if we can find the standard error for the noise/residual, $\epsilon$, then we can find everything else. The equation for that is:

$$S^2_{\epsilon} =\frac{\sigma^2_{\epsilon}}{N - D} = \frac{1}{N - D}\sum_i \left(\hat{y}_i - y_i\right)^2$$where $D$ is the dimension of $\beta$. Knowing that, the standard error for $\hat{\beta}$ is

$$S^2_{\beta} = S^2_{\epsilon} \left(\mathbf{X}^T\mathbf{X}\right)^{-1}$$where the diagonal elements are the standard errors. The off-diagonal elements are unrelated.

And again, $P(\beta - \hat{\beta}) \propto T(0, \sigma = S_\beta, df = N - D)$

The $N - D$ term is called the degrees of freedom. $D$ is the number of fit coefficients.

One of the most common uses for this form of least-squares is if your problem is non-linear and you want to linearize it. For example, let's say I am in 1 dimension and the model I'd like to have is:

$$y = \beta_2 x^2 + \beta_1 x + \beta_0 + \epsilon$$I can package that into a vector, like this: $[x^2, x, 1]$ and create an $\mathbf{X}$ matrix by creating a vector for each $x_i$. If my $x$ values are 1,2,3, that would like:

$$\left[ \begin{array}{lcr} 1^2 & 1 & 1\\ 2^2 & 2 & 1\\ 3^2 & 3 & 1\\ \end{array}\right]$$In [21]:

```
#NOTE THIS IS NOT PART OF REGRESSION!
#make some data to regress
x = np.linspace(-3, 3, 25)
y = 2 * x ** 2 - 3 * x + 4 + scipy.stats.norm.rvs(size=len(x), loc=0, scale=1.5)
#END
```

In [22]:

```
plt.plot(x,y, 'o', label='data')
plt.plot(x,2 * x ** 2 - 3 * x + 4, '-', label='exact solution')
plt.legend(loc='upper right')
plt.show()
```

In [23]:

```
x_mat = np.column_stack( (x**2, x, np.ones(len(x))) )
x_mat
```

Out[23]:

In [24]:

```
beta,*_ = linalg.lstsq(x_mat, y)
print(beta)
plt.plot(x,y, 'o', label='data')
plt.plot(x,2 * x ** 2 - 3 * x + 4, '-', label='exact solution')
plt.plot(x,x_mat.dot(beta), label='least squares')
plt.legend(loc='upper right')
plt.show()
```

In [25]:

```
yhat = x_mat @ beta
resids = yhat - y
SSR = np.sum(resids**2)
se2_epsilon = SSR / (len(x) - len(beta))
print(se2_epsilon)
```

In [26]:

```
se2_beta = se2_epsilon * linalg.inv(x_mat.transpose() @ x_mat)
print(se2_beta)
```

Now that we have the standard error matrix, we can create confidence intervals for the $\beta$ values.

In [29]:

```
for i in range(len(beta)):
#get our T-value for the confidence interval
T = scipy.stats.t.ppf(0.975, len(x) - len(beta))
# Get the width of the confidence interval using our previously computed standard error
cwidth = T * np.sqrt(se2_beta[i,i])
# print the result, using 2 - i to match our numbering above
print(f'beta_{i} is {beta[i]:.2f} +/- {cwidth:.2f} with 95% confidence')
```

The same equations above apply here. For the example of the polynomial:

In [28]:

```
TSS = np.sum( (np.mean(y) - y)**2)
R2 = 1 - SSR / TSS
R2, np.sqrt(R2)
```

Out[28]:

You can justify a regression and its coefficients by:

- A Spearmann correlation test
- A Shapiro-Wilks normality test on residuals
- A Goodness of fit

You can justify a particular **model** by:

- Plotting
- Hypothesis tests on individual coefficients

You CANNOT use a goodness of fit to compare models, since each time you add a new parameter you get a better fit

You should do the following when you want to do a good job:

- Justify correlation
- Compute your model coefficients, their standard errors, confidence intervals, and p-values for existing
- Plot your regressed model (use $y$ vs $\hat{y}$ for N-dimensions)
- Histogram and compute Shapiro-Wilks normality test on residuals