We can assume that $k=1$ and noise is Gaussian and that $\Psi$ is finite.
We will assume that the target variable is described by
$$y = \vec{F}(\vec{x},\vec{w}) + \varepsilon$$where:
The derivations provided below all assume Gaussian noise on the target data. Is this a good assumption? In many cases yes. The argument hinges on the use of the Central_Limit_Theorem that basically says the the sum of many independent random variables behaves behaves like a Gaussian distributed random variable.
The noise term in this model, $\epsilon$, can be thought of as the sum of features not included in the model function, $\vec{F}(\vec{x},\vec{w})$. Assuming these features are themselves independent random variables then the Central Limit Theorem suggests a Gaussian model is appropriate, assuming there are many independent unaccounted for features.
It is possible that there is only a small number of unaccounted for features or that there is genuine non-Gaussian noise in our observation measurements, e.g. sensor shot noise that often has a Poisson distribution. In such cases, the assumption is no longer valid.
Training dataset
N = 5 # Size of the data set
X = np.array([[0.0], [1.0], [2.0], [3.0], [4.0]]) # Inputs, shape: N x 1
y = np.array([10.5, 5.0, 3.0, 2.5, 1.0]) # Outputs, shape: N
X.T
array([[ 0., 1., 2., 3., 4.]])
y
array([ 10.5, 5. , 3. , 2.5, 1. ])
Test dataset
N_test = 100
X_test = np.linspace(0.0, 4.0, N_test).reshape((N_test, 1))
y_test = np.linspace(7.0, -5.0, N_test)+ 2*np.random.randn(N_test)
plt.plot(X[:, 0], y, 'bo', label='Training')
plt.plot(X_test[:, 0], y_test, "g.", label='Test')
plt.xlabel('$X$'); plt.ylabel('$y$'); plt.legend(frameon=True);
where $\vec{w}$ and $b$ constitute a weight vector and $\varepsilon^{(n)} \sim \mathcal{N}(0, \sigma^2_\varepsilon)$.
A shorter version of this expression is the following equation:
$$\vec{y} + \vec{\epsilon} = \vec{X} \cdot \vec{w},\ \vec{X} \in \mathbb{R}^{m \times n},\ \vec{y} \in \mathbb{R}^m,\ \vec{\epsilon} \in \mathbb{R}^m,\ \vec{w} \in \mathbb{R}^n,$$where the $i$-th row of $\vec{X}$ represents $\vec{x}^{(i)}$ and the $j$-th entry of the vector $\vec{y}$ represents $\vec{y}^{(j)}$.
where $\left\|\vec{A}\right\|_2$ is called Frobenius norm and is the generalization of the Euclidean norm for matrices.
There are two ways to adjust the weights of a linear model (which includes linear models with nonlinear features):
In addition, we can add a constraint to the objective function to penalize large weights: Tikhonov regularization (forward pointer).
Solving $\vec{X} \cdot \vec{w} = \vec{y}$ means solving $$ \left\{ \begin{array}{rcrcl} x^{(1)}_1 w_1 & + x^{(1)}_2 w_2 & + \cdots & + x^{(1)}_n w_n & = y^{(1)} \\ x^{(2)}_1 w_1 & + x^{(2)}_2 w_2 & + \cdots & + x^{(2)}_n w_n & = y^{(2)} \\ x^{(3)}_1 w_1 & + x^{(3)}_2 w_2 & + \cdots & + x^{(3)}_n w_n & = y^{(3)} \\ \vdots & \vdots & \ddots & \vdots & = \vdots \\ x^{(m)}_1 w_1 & + x^{(m)}_2 w_2 & + \cdots & + x^{(m)}_n w_n & = y^{(m)} \end{array} \right.\,. $$
That is, the solution of
$$\vec{X}^\intercal\vec{X} \hat{\boldsymbol{w}} = \vec{X}^\intercal\vec{y}$$for $\hat{\boldsymbol{w}}$, i.e.,
$$\hat{\vec{w}} = (\vec{X}^\intercal\vec{X})^{-1}\vec{X}^\intercal\vec{y}.$$def normal_equations(X, y):
# w = np.linalg.inv(X.T.dot(X)).dot(X.T).dot(y)
# ... or we can use the Moore-Penrose pseudoinverse (is usually more likely to
# be numerically stable)
w = np.linalg.pinv(X).dot(y)
# ... or we use the solver
# w = np.linalg.lstsq(X, y)[0]
return w
Add bias to each row/instance:
X_bias = np.hstack((X, np.ones((N, 1))))
X_test_bias = np.hstack((X_test, np.ones((N_test, 1))))
X.T
array([[ 0., 1., 2., 3., 4.]])
X_bias.T
array([[ 0., 1., 2., 3., 4.], [ 1., 1., 1., 1., 1.]])
w_norm = normal_equations(X_bias, y)
plt.plot(X_bias[:, 0], y, "o")
plt.plot(X_bias[:, 0], X_bias.dot(w_norm), "m-", linewidth=2)
plt.xlabel("x"); plt.ylabel("y"); plt.title("Model: $y = %.2f x + %.2f$" % tuple(w_norm));
We can now solve linear regression problems at will!
Test the other forms of normal equations.
Think of this problem:
Assuming that each pixel is represented by three bytes defining the RGB values:
n_pixels = 256 * 256
memory_size_xtx = n_pixels * n_pixels * 3 * 8
print("Required memory: %d GB" % (memory_size_xtx / 2**30))
Required memory: 96 GB
Inversion is an procedure that requires $O(n^{2.373})$ operations (Source).
In cases where the training data set is very large or data is received in a stream, a direct solution using the normal equations may not be possible. An alternative approach is the stochastic gradient descent algorithm.
Iteratively update the weights using the gradient as reference.
$\Delta \vec{w}_t$ is minus the derivative of the error function $E(\vec{X}, \vec{y};\vec{w}_t)$ with respect to the weight $w_i$:
$$ \Delta w_i = - \alpha\frac{\partial E(\vec{X}, \vec{y};\vec{w})}{\partial w_i}\,, $$where:
Update $\vec{w}$ incrementally in every iteration $t$ as:
$$ \vec{w}(t+1) = \vec{w}(t) + \Delta \vec{w}(t)\,, $$Using as error function the sum of squared errors (SSE),
$$E(\boldsymbol{X}, \boldsymbol{y};\boldsymbol{w}) = \frac{1}{2}\left\|\boldsymbol{X} \cdot \boldsymbol{w} - \boldsymbol{y}\right\|^2_2\,.$$The gradient becomes
$$\nabla \boldsymbol{w} = \nabla_{\boldsymbol{w}} E(\boldsymbol{X}, \boldsymbol{y};\boldsymbol{w}) = \boldsymbol{X}^T \cdot (\boldsymbol{X} \cdot \boldsymbol{w} - \boldsymbol{y})\,.$$def error(X, y, w):
return 0.5*np.linalg.norm(X.dot(w) - y)**2
def linear_regression_gradient(X, y, w):
return X.T.dot(X.dot(w)-y)
Initialize weights to small random values
w = np.random.random(2)/10000
alpha = 0.05
for i in range(50):
# print('Iteration', i+1, 'weights', w, 'error', error(X_bias, y, w))
w = w - alpha*linear_regression_gradient(X_bias, y, w)
print('w_norm:', w_norm, 'w grad. desc.', w)
w_norm: [-2.15 8.7 ] w grad. desc. [-2.089 8.526]
Implementing the gradient descent loop as a function.
def gradient_descent(X, y, w_0, alpha, max_iters):
'Returns the values of the weights as learning took place.'
w = np.array(w_0, dtype=np.float64)
w_hist = np.zeros(shape=(max_iters+1, w.shape[0]))
w_hist[0] = w
for i in range(0,max_iters):
delta_weights = -alpha*linear_regression_gradient(X_bias, y, w)
w += delta_weights
w_hist[i+1] = w
return w_hist
w_0 = [-3,2] # we fix the initial weights to make it more illustrative
alpha = 0.05
max_iters = 25
w_hist = gradient_descent(X_bias, y, w_0, alpha, max_iters)
plot_hist_contour(X_bias, y, w_hist, w_norm, title='end='+str(w_hist[-1]), show_legend=True)
What is the impact of the learning rate ($\alpha$)?
def alphas_study(alphas):
fig = plt.figure(figsize=(11,7))
for i,alpha in enumerate(alphas):
ax = fig.add_subplot(2,3,i+1)
w_hist = gradient_descent(X_bias, y , w_0, alpha, max_iters)
plot_hist_contour(X_bias, y, w_hist, w_norm, ax=ax, title='$\\alpha='+str(alpha)+'$')
plt.legend(scatterpoints=1, ncol=3, bbox_to_anchor=(-0.2,-0.2), frameon=True);
plt.tight_layout()
alphas = np.linspace(0.02,0.07,6)
alphas
array([ 0.02, 0.03, 0.04, 0.05, 0.06, 0.07])
alphas_study(alphas)
Idea adding some an inertial momentum to the process can:
Update $\vec{w}$ incrementally in every iteration $t$ as:
$$\vec{w}(t+1) = \vec{w}(t) + \alpha\Delta \vec{w}(t) + \beta \Delta \vec{w}(t-1),$$where $\beta\in\mathbb{R}^+$ is known as the momentum rate.
def gradient_descent_with_momentum(X, y, w_0, alpha, beta, max_iters):
w = np.array(w_0, dtype=np.float64)
w_hist = np.zeros(shape=(max_iters+1, w.shape[0]))
w_hist[0] = w
omega = np.zeros_like(w)
for i in range(max_iters):
delta_weights = -alpha*linear_regression_gradient(X, y, w) + beta*omega
omega = delta_weights
w += delta_weights
w_hist[i+1] = w
return w_hist
alpha = 0.05
beta = 0.5
max_iters = 25
w_hist = gradient_descent(X_bias, y, (-3,2), alpha, max_iters)
w_hist_mom = gradient_descent_with_momentum(X_bias,y, (-3,2), alpha, beta, max_iters)
comparison_plot()
def alphas_study_with_momentum(alphas, beta):
fig = plt.figure(figsize=(11,7))
for i,alpha in enumerate(alphas):
ax = fig.add_subplot(2,3,i+1)
w_hist = gradient_descent_with_momentum(X_bias, y , w_0, alpha, beta, max_iters)
plot_hist_contour(X_bias, y, w_hist, w_norm, ax=ax, title='$\\alpha='+str(alpha)+'$')
plt.legend(scatterpoints=1, ncol=3, bbox_to_anchor=(-0.2,-0.2), frameon=True);
plt.tight_layout()
alphas_study_with_momentum(alphas, 0.5)
alphas_study_with_momentum(alphas, 0.25)
alphas_study_with_momentum(alphas, 0.75)
... I know you are wondering how $\beta=1$ looks like.
alphas_study_with_momentum(alphas, 1.05)
plt.plot(X_bias[:, 0], y, "o", label='training data')
plt.plot(X_bias[:, 0], X_bias.dot(w_norm), "D-", c='lightcoral',
linewidth=2, label='normal equation')
plt.plot(X_bias[:, 0], X_bias.dot(w_hist[-1]), "s-", c='skyblue',
linewidth=2, label='gradient descent')
plt.plot(X_bias[:, 0], X_bias.dot(w_hist_mom[-1]), "^-", c='green',
linewidth=2, alpha=0.5, label='gradient descent\n with momentum')
plt.xlabel("$X$"); plt.ylabel("$y$"); plt.legend(numpoints=1, bbox_to_anchor=(1.45,1), frameon=True);
N_test = 10000000
X_test_load = np.linspace(0.0, 4.0, N_test).reshape((N_test, 1))
y_test_load = np.linspace(7.0, -5.0, N_test)+ 2*np.random.randn(N_test)
X_test_load_bias = np.hstack((X_test_load, np.ones((N_test, 1))))
%%time
w_norm = normal_equations(X_test_load_bias, y_test_load)
CPU times: user 788 ms, sys: 410 ms, total: 1.2 s Wall time: 1.07 s
%%time
w_grad = gradient_descent_with_momentum(X_test_load_bias, y_test_load,
np.random.random(2)/1000, 0.000001, 0.25, 25)
CPU times: user 2.62 s, sys: 753 ms, total: 3.37 s Wall time: 1.77 s
w_norm
array([-3. , 7.001])
w_grad[-1]
array([ -6.249e+43, -2.443e+43])
np.linalg.norm(w_norm - w_grad[-1])
6.7101402268503633e+43
To approximate nonlinear functions with a linear model, we have to generate nonlinear features.
In this example, we generate cosinusoidal features. You could also try radial basis functions, polynomials, ...
We expand each feature $x_j$ to the nonlinear feature vector
where $d$ is the number of basis functions.
from itertools import chain, cycle
def cosinusoidalize(X, n_degree):
X_cosinusoidal = np.ndarray((len(X), n_degree+1))
for d in range(n_degree+1):
X_cosinusoidal[:, d] = np.cos(X[:, 0] * np.pi * d / n_degree)
return X_cosinusoidal
# Utility function
def build_cosinusoidal(n_degree, w):
return ((("%+.2f \\cos(\\frac{%d \\pi x}{" + str(n_degree) + "} )") * (n_degree + 1)) % tuple(chain(*zip(w, range(n_degree + 1)))))
N_DEGREE = 4
X_cosinusoidal = cosinusoidalize(X, n_degree=N_DEGREE)
X_test_cosinusoidal = cosinusoidalize(X_test, n_degree=N_DEGREE)
X.T, X_cosinusoidal
(array([[ 0., 1., 2., 3., 4.]]), array([[ 1. , 1. , 1. , 1. , 1. ], [ 1. , 0.707, 0. , -0.707, -1. ], [ 1. , 0. , -1. , -0. , 1. ], [ 1. , -0.707, -0. , 0.707, -1. ], [ 1. , -1. , 1. , -1. , 1. ]]))
from matplotlib.lines import lineStyles
ls = cycle([style for style in lineStyles if not lineStyles[style] =='_draw_nothing'])
for deg in range(N_DEGREE):
plt.plot(np.linspace(0,4,100), X_test_cosinusoidal[:,deg],
linestyle=next(ls), label='Degree '+str(deg))
plt.legend(frameon=True, bbox_to_anchor=(1.,1))
plt.title('Cosine basis transformations'); plt.xlabel("$X$"); plt.ylabel("$y$");
w_cos = normal_equations(X_cosinusoidal, y)
plt.figure(figsize=(11,4.5))
plt.plot(X_bias[:, 0], X_bias.dot(w), "m-", label='linear')
plt.plot(X_test_bias[:, 0], X_test_cosinusoidal.dot(w_cos), "g:", label='cosinusoidal')
plt.plot(X_bias[:, 0], y, "bo", label='training')
plt.xlabel("$X$"); plt.ylabel("$y$"); plt.legend(frameon=True)
plt.title("$y = " + build_cosinusoidal(N_DEGREE, w_cos) + "$");
You might be wondering, is this cheating?
for $d$=N_DEGREES
=4?
Techniques used to account for over fitting or under fitting a given model when using a maximum likelihood approach.
where $\lambda$ is the regularization coefficient and $E_W$ is the regularization term.
A commonly used regularization term is the sum-of-squares of the model parameters,
$$E_W(\vec{w}) = \frac1{2}\vec{w}^\intercal\vec{w}\,,$$known as the weight decay regularizer.
This regularization terms leads to the optimal solution, assuming a linear regression model with Gaussian noise on the training data, of
$$ \vec{w} = \left(\lambda \vec{I} + \vec{\Phi}^\intercal \vec{\Phi}\right)^{-1} \vec{\Phi}^T\vec{y}\,. $$def y(x, m, b, mu=0, sigma=1):
return m * x + b + np.random.normal(mu, sigma, 1)[0]
def el_pow(x, pow):
temp = x
for i in range(pow - 1):
temp = temp * x
return temp
def prediction(params, x):
pred = 0
for i in range(len(params)):
pred += params[i] * math.pow(x, i)
return pred
Generating training data, with $N$ data points
N = 101
M = 8
t = np.empty(N)
domain = np.empty(N)
domain_bound = 1.0 / N
for i in range(N):
domain[i] = i * domain_bound
for i in range(N):
t[i] = y(x=domain[i], m=4.89, b=0.57)
Find the solution without using regularization design matrix, $\phi$, of dimension $N\times M$.
phi = np.array([np.ones(N), domain, el_pow(domain,2), el_pow(domain,3),
el_pow(domain,4), el_pow(domain,5), el_pow(domain,6),el_pow(domain,7)]).T
temp1 = np.linalg.inv(np.dot(phi.T,phi)) #inverse of phi.T X phi
temp2 = np.dot(temp1, phi.T)
Solution without regularization:
w_nreg = np.dot(temp2,t)
w_nreg
array([ 0.194, 29.042, -317.676, 1720.279, -4590.23 , 6374.297, -4413.416, 1202.347])
predicted_t = [prediction(w_nreg,x) for x in domain]
Finding the regularized solution
lam = 0.1
temp1 = np.linalg.inv(lam * np.eye(M) + np.dot(phi.T, phi))
temp2 = np.dot(temp1, phi.T)
w_reg = np.dot(temp2, t)
w_reg
array([ 0.793, 3.763, 1.178, 0.412, 0.146, -0.088, -0.397, -0.772])
predicted_t_reg = [prediction(w_reg, x) for x in domain]
plt.plot(domain, t, '.', label='training')
plt.plot(domain, predicted_t, label='predicted')
plt.plot(domain, predicted_t_reg, '--', label='regularized')
plt.legend(loc='lower right', frameon=True); plt.xlabel("$X$"); plt.ylabel("$y$")
plt.title('Effects of regularization');