## Two intepretations of regression¶

• Linear: $\hat{y} = wx$
• Bayesian (MLE & MAP): $y \sim N(wx, \sigma^2)$ $\textrm{argmax}_w p(D|w)$

Review slides on Linear Regression

In regression, we're always given X. Thus, given X, what's the Y?

• MAP: $\textrm{argmax}_w \prod_{i=1}^n p(y_i | w, x_i) p(w)$
• MLE: $\textrm{argmax}_w \prod_{i=1}^n p(y_i | w, x_i)$ ... or something

• Estimating means for normal distribution:
We have a prior: that $y_i \sim N(\mu, \sigma^2)$
We add a prior: $w \sim N(0, \gamma^2)$
See the slides for how to use these priors

• Constant Term in Linear Regression
Coding up things in Matlab, you generally need to add in a constant term... Something to watch for

## Linear Regression with Varying Noise ¶

Different noise at each observation: Heteroscedasticicity
With every observation, different noise:
in the real world, the noise on more extreme measurements is often greater

$y_i \sim N(wx_i, \sigma_i^2)$ <- note how sigma changes with $i$

Sometimes we know something about the noise, and then we can use different sigmas at each point, assume independence among noise, then plugging in eqn for Gaussian and simplifying.

This is called Weighted Regression:
$\textrm{argmin}_w \sum_{i = 1}^R (y_i - wx_i)/sigma_i^2$

i.e., you weigh noisy measurements less

## Non-linear Regression ¶

Suppose you know that y is related to a function of x in such a way that the predicted values [lost slide]...
$y_i ~ N(\sqrt{w + x_i}, \sigma^2)$

MLE: $\textrm{argmin}_w \sum (y_i - \sqrt{w + x_i}_)^2$
Then use non-linear optimization techniques, of which many are available

## Polynomial Regression¶

$y = a + bx^2$
Is this linear or nonlinear regression?
It is linear

We make a new variable:

z = [1 x_1^2
1 x_2^2
...
1 x_n^2 ]



Now: $\hat{y} = zw$ and it is linear (linear in weights)

• $w = w \sin(x)$ <- linear estimation
* $\sin(x)$ is a transformed feature, but still a feature
* $w$ is still linear  

$y = \sin(wx)$ <- nonlinear estimation

Often you have some really non-linear relationship between X and Y. Can you do some transformation on these to make the relationship linear?

Let us choose a set of points on x: $z_1 \dots z_k$ For each point we will create a Gaussian distribution $z_j = e^{\frac{||x - \mu_j||}{\sigma^2}}$

For every $x$, generate a bunch of $Z$s where the $Z$s near $X$ will be weighted heavily, and the $Z$s far from $X$ will be zero

One adjustable parameter in this situation: the kernel width, or $\sigma$. If the kernel width is really big, everything comes out. If it is really narrow, then only very close things have an effect

Now the Xs are correlated, so we generally use a Ridge Regression (MAP)

This method is LOESS

Later: the use of kernels in regression

In [ ]: