In the last lecture we have learned how to set up and use the softmax regression to classify all 10 handwritten digits based on pixel intensities on a 28x28 grid (MNIST dataset).
However, the training is slow on a decent configured laptop due to the size the dataset and the size of the parameter.
Today we will learn a new method called Stochastic Gradient Descent (SGD), one of the two pillars of the deep learning (the other being backpropagation).
Every state-of-the-art Deep Learning library contains implementations of various algorithms to optimize (stochastic) gradient descent.
References:
We consider a loss function (e.g., the one we saw in softmax regression) which is the average of the sample-wise loss:
$$L(\mathbf{w}) := L(\mathbf{w}; X,\mathbf{y}) = \frac{1}{N}\sum_{i=1}^N f_i(\mathbf{w}; \mathbf{x}^{(i)},y^{(i)})$$which has the weights $\mathbf{w}$ as the parameters. Let us recall the softmax loss function here for comparison:
$$ L (\mathbf{w}) = - \frac{1}{N}\sum_{i=1}^N \Big\{\text{cross-entropy for each sample} \Big\}=- \frac{1}{N}\sum_{i=1}^N \left\{\sum_{k=1}^K 1_{\{y^{(i)} = k\}} \ln \Bigg( \frac{\exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)})}{\sum_{j=1}^{K} \exp\big(\mathbf{w}_j^{\top} \mathbf{x}^{(i)} \big) } \Bigg)\right\}. $$Then the gradient descent method for it reads:
Choose initial guess $\mathbf{w}_0$, step size (learning rate) $\eta$, number of iterations $M$
For $k=0,1,2, \cdots, M$
$\displaystyle\mathbf{w}_{k+1} = \mathbf{w}_k - \eta\nabla_{\mathbf{w}} L(\mathbf{w}_k) = \mathbf{w}_k - \eta\frac{1}{N}\sum_{i=1}^N \nabla_{\mathbf{w}} f_i(\mathbf{w}; \mathbf{x}^{(i)},y^{(i)})$
The gradient has to be evaluated $N$ times in one iteration, each evaluation involves a matrix matrix multiplication of order $O(n)$ (number of features in one sample).
even for the same loss function.
Suppose our loss function is still:
$$L := L(\mathbf{w}; X,\mathbf{y}) = \frac{1}{N}\sum_{i=1}^N f_i(\mathbf{w}; \mathbf{x}^{(i)},y^{(i)}),$$where $X = (\mathbf{x}^{(1)}, \dots, \mathbf{x}^{(N)})^{\top}$ are the training samples, $\mathbf{y} = (y^{(1)}, \dots, y^{(N)})^{\top}$ are the labels/taget values for the training samples.
Choose initial guess $\mathbf{w}_0$, step size (learning rate) $\eta$, number of inner iterations $M$, number of epochs $n_E$
Set $\mathbf{w}_{M+1} = \mathbf{w}_0$ for epoch $e=0$
For epoch $n=1,2, \cdots, n_E$
$\mathbf{w}_{0}$ for the current epoch is $\mathbf{w}_{M+1}$ for the previous epoch.
Randomly shuffle the samples so that $\{\mathbf{x}^{(m)},y^{(m)}\}_{m=1}^N$ is a permutation of the original dataset.
For $m=0,1,2, \cdots, M$
$\displaystyle\mathbf{w}_{m+1} = \mathbf{w}_m - \eta \nabla f_m(\mathbf{w}; \mathbf{x}^{(m)},y^{(m)})$
If $M = N$, which is the current batch of all training samples, one outer iteration is called a completed epoch.
Let us give it a go on the linear regression. This time, we are using scikit-learn
's built-in dataset generator. For the linear regression, we can use scikit-learn
's LinearRegression()
class for the multivariate regression from previous lectures, but since we are illustrating SGD, we are implementing ourselves.
Recall the loss function with the $L^2$ regularization and the gradient: let the weight $\mathbf{w} = (w_0, \widehat{\mathbf{w}})$ where $w_0$ is the bias, and $\widehat{\mathbf{w}}$ is the vector containing the weights for the features of the dataset
$$ L(\mathbf{w}) = \frac{1}{N}\sum_{i=1}^N \left( [1, \;\mathbf{x}^{(i)}]^{\top} \mathbf{w} - y^{(i)} \right)^2 + \epsilon |\widehat{\mathbf{w}}|^2, \\ \frac{\partial L(w)}{\partial \mathbf{w}} = \frac{2}{N}\sum_{i=1}^N [1, \;\mathbf{x}^{(i)}]\left( [1, \;\mathbf{x}^{(i)}]^{\top} \mathbf{w} - y^{(i)}\right) + 2\epsilon\, [0, \widehat{\mathbf{w}}] $$If there is no bias:
$$
L(\mathbf{w}) = \frac{1}{N}\sum_{i=1}^N
\left( \mathbf{w}^{\top}\mathbf{x}^{(i)} - y^{(i)} \right)^2
\ \frac{\partial L(w)}{\partial \mathbf{w}} = \frac{2}{N}\sum_{i=1}^N \mathbf{x}^{(i)} \left( \mathbf{w}^{\top}\mathbf{x}^{(i)} - y^{(i)}\right) + 2\epsilon,\mathbf{w}. $$
Reference: Scikit-learn's dataset loading utilities
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
np.set_printoptions(suppress=True)
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
X, y = make_regression(n_samples=10000, n_features=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
eps = 1e-3 # regularization parameter
# our model, returns the linear function [1 X]^T w
def h(X, w):
# X is the training data
return np.matmul(X,w)
# loss function = total square error on the given data set X,y
def loss(w, X, y):
# N = len(y)
# this is one of the key, if y is only part of/one of the data label, then N will be small!
residual_components = h(X, w) - y
regularization = eps*np.sum(w**2)
# return (1/len(y))*np.sum(residual_components**2) + regularization
return np.mean(residual_components**2) + regularization
# the second implementation is prefered due to the len(y) may not exist
def gradient_loss(w, X, y):
gradient_for_all_training_data = (h(X,w) - y).reshape(-1,1)*X
gradient_for_regularization = 2*eps*w
gradient_mean_training_data = np.mean(gradient_for_all_training_data, axis=0)
# we should return a (10,) array, which is averaging all training data
return 2*gradient_mean_training_data + gradient_for_regularization
# we define a cross validating function to compute the R^2 score
def rsquared(w, X, y):
y_pred = h(X, w)
return 1 - (np.sum((y- y_pred)**2))/(np.sum((y- y.mean())**2))
# initialization and hyper-parameter
w = 1e-2*np.random.random(np.shape(X_train)[1]) # weights and bias (bias is 0 in this case)
eta = 1e-3 # step size (learning rate)
num_steps = 1000
np.sum(w * X_train, axis=1)[:10]
np.matmul(X_train,w)[:10]
loss_at_eachstep = np.zeros(num_steps) # record the change of the loss function
for i in range(num_steps):
loss_at_eachstep[i] = loss(w,X_train,y_train)
dw = gradient_loss(w,X_train,y_train)
w -= eta * dw
if i % 200 == 0:
print("loss after", i+1, "iterations is: ", loss(w,X_train,y_train))
print("Training R squared after", i+1, "iterations is: ", rsquared(w, X_train, y_train))
print("Testing R squared after", i+1, "iterations is: ", rsquared(w, X_test, y_test))
# keep track of training accuracy just making sure we are in the right direction
plt.plot(range(num_steps), loss_at_eachstep)
plt.show()
np.random.permutation
to randomly permute the order of the samples.eta = 1e-4 # step size (learning rate), in general SGD should have a smaller learning rate
num_epochs = 5 # no. of outer iteration
N = len(y_train) # no. of inner iteration
M = N # in general you can choose M <= N
w = 1e-2*np.random.random(np.shape(X_train)[1])
sgd_loss_at_eachstep = np.zeros([num_epochs,M]) # record the change of the loss function
# num_epochs is the # of outer iterations
# N is the # of samples, which is the number of inner iterations
for e in range(num_epochs):
shuffle_index = np.random.permutation(N)
for m in range(M):
i = shuffle_index[m] # i corresponds i-th sample
sgd_loss_at_eachstep[e,m] = loss(w,X_train,y_train)
dw = gradient_loss(w,X_train[i,:],y_train[i])
# this is the gradient for i-th sample
w -= eta * dw
if m % 1000 ==0:
print("loss after", e+1, "epochs and ", m+1, "iterations is: ", loss(w,X_train,y_train))
print("Training R squared after", e+1, "epochs and ", m+1,
"iterations is:", rsquared(w, X_train, y_train))
print("Testing R squared after", e+1, "epochs and ", m+1,
"iterations is:", rsquared(w, X_test, y_test))
plt.plot(sgd_loss_at_eachstep.reshape(-1)[:300], label="SGD loss")
plt.legend()
plt.show()
In the vanilla SGD, each parameter $\mathbf{w}$ update is computed w.r.t one training sample randomly selected. In mini-batch SGD, the update is computed for a mini-batch (a small number of training samples), as opposed to a single example. The reason for this is twofold:
A typical mini-batch size is $2^k$ (32, 256, etc), although the optimal size of the mini-batch can vary for different applications, and size of dataset (e.g., AlphaGo training uses mini-batch size of 2048 positions).
Choose initial guess $\mathbf{w}_0$, step size (learning rate) $\eta$,
batch size $n_B$, number of inner iterations $M\leq N/n_B$, number of epochs $n_E$
Set $\mathbf{w}_{M+1} = \mathbf{w}_0$ for epoch $e=0$
For epoch $n=1,2, \cdots, n_E$
$\mathbf{w}_{0}$ for the current epoch is $\mathbf{w}_{M+1}$ for the previous epoch.
Randomly shuffle the training samples.
For $m=0,1,2, \cdots, M$
$\displaystyle\mathbf{w}_{m+1} = \mathbf{w}_m - \frac{\eta}{n_B}\sum_{i=1}^{n_B} \nabla f_{m+i}(\mathbf{w}; \mathbf{x}^{(m+i)},y^{(m+i)})$
eta = 1e-4 # step size (learning rate)
num_epochs = 20
N = len(y_train)
num_batch = 16 # number of mini-batch
M = int(N/num_batch)
w = 1e-2*np.random.random(np.shape(X_train)[1])
sgdmini_loss_at_eachstep = np.zeros([num_epochs,M]) # record the change of the loss function
for e in range(num_epochs):
shuffle_index = np.random.permutation(N)
for m in range(0,N,num_batch):
i = shuffle_index[m:m+num_batch]
# indices for the gradient of loss function
# of the i-th sample to be evaluated
sgdmini_loss_at_eachstep[e,m//num_batch-1] = loss(w,X_train,y_train)
dw = gradient_loss(w,X_train[i,:],y_train[i])
w -= eta * dw
if m % 1000 ==0 and e % 5 == 0:
print("loss after", e+1, "epochs and ", m+1, "iterations is: ", loss(w,X_train,y_train))
print("Training R squared after", e+1, "epochs and ", (m+1)//num_batch,
"iterations is:", rsquared(w, X_train, y_train))
print("Testing R squared after", e+1, "epochs and ", (m+1)//num_batch,
"iterations is:", rsquared(w, X_test, y_test))
import seaborn as sns
sns.set()
num_plot = 300
plt.figure(figsize= [16,8])
plt.plot(sgdmini_loss_at_eachstep.reshape(-1)[:num_plot], 'b-',
label="SGD mini-batch loss")
plt.plot(sgd_loss_at_eachstep.reshape(-1)[:num_plot],
label="SGD loss", linewidth=2, color = 'green')
plt.legend(fontsize=20)
plt.show()