Linear regression will find a straight line that will try to best fit the data provided. It does so by learning the slope of the line, and the bais term (y-intercept)
Given a table:
size of house(int sq. ft) (x) | price in $1000(y) |
---|---|
450 | 100 |
324 | 78 |
844 | 123 |
Our hypothesis (prediction) is: $$h_\theta(x) = \theta_0 + \theta_1x$$ Will give us an equation of line that will predict the price. The above equation is nothing but the equation of line. When we say the machine learns, we are actually adjusting the parameters $\theta_0$ and $\theta_1$. So for a new x (size of house) we will insert the value of x in the above equation and produce a value $\hat y$ (our prediction)
Our prediction $\hat y$ will not always be accurate, and will have a certain error which we will define by an equation. We will also need this equation to minimise the error, this equation is called as loss function. One of the most used one is called Mean Squared Error (MSE) which is nothing but the means of all the errors squared.
As we will see later, when we differentiate the squared error, the $\frac{1}{2}$ will cancel out. If we don't do that we'll be stuck with a $2$ in the equation which is useless.
Our objective function is
Which simply means, find the value of $\theta$ that minimises the error function $J(\theta)$. In order to do that, we will differentiate our cost function. When we differentiate it, it will give us gradient, which is the direction in which the error will be reduced. Upon having the gradient, we will simply update our $\theta$ values to reflect that step (a step in the direction of lower error)
So, the update rule is the following equation
Where,
$\alpha$ = learning rate. Which is the rate at which we will travel to the direction of the lower error.
This process is nothing but Gradient Descent. There are few version of gradient descent, few of them are:
In the update rule:
The important part is calculating the derivative. Since we have two variables, we will have two derivatives, one for $\theta_0$ and another for $\theta_1$.
So the first equation is:
$ \begin{align} \notag \theta_0 &= \theta_0 - \frac{\partial}{\partial \theta_0} J(\theta) \\ \notag &= \theta_0 - \frac{\partial}{\partial \theta_0}(\frac{1}{2m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)^2}) \\ \notag &= \theta_0 - \frac{2}{2m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}\frac{\partial}{\partial \theta_0}{(h_\theta(x_i) - y_i)} \\ \notag &= \theta_0 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}\frac{\partial}{\partial \theta_0}{(h_\theta(x_i) - y_i)} \\ \notag &= \theta_0 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}\frac{\partial}{\partial \theta_0}{(\theta_0 + \theta_1x- y_i)} \\ \notag &= \theta_0 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}(1 + 0 - 0) \end{align} $
$$\therefore \theta_0= \theta_0 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)} $$$ \begin{align} \notag \theta_1 &= \theta_1 - \frac{\partial}{\partial \theta_1} J(\theta) \\ \notag &= \theta_1 - \frac{\partial}{\partial \theta_1}(\frac{1}{2m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)^2}) \\ \notag &= \theta_1 - \frac{2}{2m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}\frac{\partial}{\partial \theta_0}{(h_\theta(x_i) - y_i)} \\ \notag &= \theta_1 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}\frac{\partial}{\partial \theta_0}{(h_\theta(x_i) - y_i)} \\ \notag &= \theta_1 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}\frac{\partial}{\partial \theta_0}{(\theta_0 + \theta_1x- y_i)} \\ \notag &= \theta_1 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}(x + 0 - 0) \end{align} $
$$\therefore \theta_1= \theta_1 - \frac{1}{m}\sum_{i=0}^m{(h_\theta(x_i) - y_i)}(x) $$We will implement Batch Gradient Descent i.e. we'll update the gradients after 1 pass through the entire dataset. Our Algorithm hence becomes:
The update rule stated above can be extended to as many $\theta$s as possible using matrix notation. the key is to store the $\theta$s in one vector and the input data in a matrix then take a dot product of them.