This document briefly explains why target mean value minimizes MSE error and why target median minimizes MAE.
Suppose we have a dataset {(xi,yi)}Ni=1
Basically, we are given pairs: features xi and corresponding target value yi∈R.
We will denote vector of targets as y∈RN, such that yi is target for object xi. Similarly, ˆy∈R denotes predictions for the objects: ˆyi for object xi.
Let's start with MSE loss. It is defined as follows:
MSE(y,ˆy)=1NN∑i=1(ˆyi−yi)2Now, the question is: if predictions for all the objects were the same and equal to α: ˆyi=α, what value of α would minimize MSE error?
minαf(α)=1NN∑i=1(α−yi)2The function f(α), that we want to minimize is smooth with respect to α. A required condition for α∗ to be a local optima is dfdα|α=α∗=0.
Let's find the points, that satisfy the condition:
dfdα|α=α∗=2NN∑i=1(α∗−yi)=0And finally: α∗=1NN∑i=1yi
Since second derivative d2fdα2 is positive at point α∗, then what we found is local minima.
So, that is how it is possible to find, that optial constan for MSE metric is target mean value.
Similarly to the way we found optimal constant for MSE loss, we can find it for MAE.
Recall that ∂|x|dx=sign(x), where sign stands for signum function . Thus
dfdα|α=α∗=1NN∑i=1sign(α∗−yi)=0So we need to find such α∗ that
g(α∗)=N∑i=1sign(α∗−yi)=0Note that g(α∗) is piecewise-constant non-decreasing function. g(α∗)=−1 for all calues of α less then mimimum yi and g(α∗)=1 for α>maxiyi. The function "jumps" by 2N at every point yi. Here is an example, how this function looks like for y=[−0.5,0,1,3,3.4]:
Basically there are N jumps of the same size, starting from −1 and ending at 1. It is clear, that you need to do about N2 jumps to hit zero. And that happens exactly at median value of the target vector g(median(y))=0. We should be careful and separate two cases: when there are even number of points and odd, but the intuition remains the same.