Linear Regression in one variable¶

Linear regression will find a straight line that will try to best fit the data provided. It does so by learning the slope of the line, and the bais term (y-intercept)

Given a table:

size of house(int sq. ft) (x)	price in $1000(y)
450	100
324	78
844	123

Our hypothesis (prediction) is: $$h_\theta(x) = \theta_0 + \theta_1x$$ Will give us an equation of line that will predict the price. The above equation is nothing but the equation of line. When we say the machine learns, we are actually adjusting the parameters $\theta_0$ and $\theta_1$. So for a new x (size of house) we will insert the value of x in the above equation and produce a value $\hat y$ (our prediction)

Our prediction $\hat y$ will not always be accurate, and will have a certain error which we will define by an equation. We will also need this equation to minimise the error, this equation is called as loss function. One of the most used one is called Mean Squared Error (MSE) which is nothing but the means of all the errors squared.

$$J(\theta) = \frac{1}{2m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)^2}$$¶

Where,¶

$m$ = the number of training examples
$x_i$ = the value of x at ith row
$y_i$ = the actual value at the ith row
$h_\theta(x)$ = our prediction function that predicts $\hat y$

Why square the difference?¶

The error will be positive
If you take the absolute function (to cover point 1), the absolute function isn't differentiable at the origin. Hence, we square the error.

Why $\frac{1}{2m}$ instead of $\frac{1}{m}$ ?¶

As we will see later, when we differentiate the squared error, the $\frac{1}{2}$ will cancel out. If we don't do that we'll be stuck with a $2$ in the equation which is useless.

Minimising the cost function¶

Our objective function is

$$\displaystyle \operatorname*{argmin}_\theta J(\theta)$$¶

Which simply means, find the value of $\theta$ that minimises the error function $J(\theta)$. In order to do that, we will differentiate our cost function. When we differentiate it, it will give us gradient, which is the direction in which the error will be reduced. Upon having the gradient, we will simply update our $\theta$ values to reflect that step (a step in the direction of lower error)

So, the update rule is the following equation

$$\theta = \theta - \alpha \frac{\partial}{\partial \theta} J(\theta)$$¶

Where,

$\alpha$ = learning rate. Which is the rate at which we will travel to the direction of the lower error.

This process is nothing but Gradient Descent. There are few version of gradient descent, few of them are:

Batch Gradient Descent: Go through all your input samples, compute the gradient once, and then update $\theta$s.
Stochastic Gradient Descent: Go through a single sample, compute gradient, update $\theta$s, repeat $m$ times
Mini Batch Gradient Descent: Go through a batch of $k$ samples, compute gradient, update $\theta$s, repeat $\frac{m}{k}$ times.

Differentiating the loss function:¶

In the update rule:

$$\theta = \theta - \alpha \frac{\partial}{\partial \theta} J(\theta)$$¶

The important part is calculating the derivative. Since we have two variables, we will have two derivatives, one for $\theta_0$ and another for $\theta_1$.

So the first equation is:

$ \begin{align} \notag \theta_0 &= \theta_0 - \frac{\partial}{\partial \theta_0} J(\theta) \\ \notag &= \theta_0 - \frac{\partial}{\partial \theta_0}(\frac{1}{2m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)^2}) \\ \notag &= \theta_0 - \frac{2}{2m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}\frac{\partial}{\partial \theta_0}{(h_\theta(x_i) - y_i)} \\ \notag &= \theta_0 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}\frac{\partial}{\partial \theta_0}{(h_\theta(x_i) - y_i)} \\ \notag &= \theta_0 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}\frac{\partial}{\partial \theta_0}{(\theta_0 + \theta_1x- y_i)} \\ \notag &= \theta_0 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}(1 + 0 - 0) \end{align} $

$$\therefore \theta_0= \theta_0 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)} $$

Similarly, for $\theta_1$¶

$ \begin{align} \notag \theta_1 &= \theta_1 - \frac{\partial}{\partial \theta_1} J(\theta) \\ \notag &= \theta_1 - \frac{\partial}{\partial \theta_1}(\frac{1}{2m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)^2}) \\ \notag &= \theta_1 - \frac{2}{2m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}\frac{\partial}{\partial \theta_0}{(h_\theta(x_i) - y_i)} \\ \notag &= \theta_1 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}\frac{\partial}{\partial \theta_0}{(h_\theta(x_i) - y_i)} \\ \notag &= \theta_1 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}\frac{\partial}{\partial \theta_0}{(\theta_0 + \theta_1x- y_i)} \\ \notag &= \theta_1 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}(x + 0 - 0) \end{align} $

$$\therefore \theta_1= \theta_1 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}(x) $$

We will implement Batch Gradient Descent i.e. we'll update the gradients after 1 pass through the entire dataset. Our Algorithm hence becomes:

Linear Regression in one variable¶

$$J(\theta) = \frac{1}{2m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)^2}$$¶

Where,¶

Why square the difference?¶

Why $\frac{1}{2m}$ instead of $\frac{1}{m}$ ?¶

Minimising the cost function¶

$$\displaystyle \operatorname*{argmin}_\theta J(\theta)$$¶

$$\theta = \theta - \alpha \frac{\partial}{\partial \theta} J(\theta)$$¶

Differentiating the loss function:¶

$$\theta = \theta - \alpha \frac{\partial}{\partial \theta} J(\theta)$$¶

Similarly, for $\theta_1$¶

Repeat till convergence:¶

1. $\theta_0= \theta_0 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)} $¶

2. $\theta_1= \theta_1 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}(x) $¶

Extending to Multivariable regression¶