In this notebook, we'll cover Gradient Descent algorithm and its variants: Batch Gradient Descent, Mini-Batch Gradient Descent, and Stochastic Gradient Descent. Next, we'll cover Momentum Methods such as Momentum, Root Mean Square Propagation (RMSProp), and Adaptive Momentum (Adam). Optimization algorithms have one of the following goals:
Gradient Descent algorithm is a first-order optimization algorithm. That means it only takes into account first derivative when performing the updates on the parameters. Below is how the algorithm works in the case of a logistic regression:
The update equation is: $$w^j = w^j - \alpha \nabla_w J\tag{3}$$ * Continue the process until the cost function converges. That is, until the error curve becomes flat and doesn't change. * Notice that if the slope of the current values of $w > 0$, that means that we're to the right of optimal $w$ and the update will be negative and will start getting close to the optimal values of $w$. However, if it's negative, the update will be positive and will increase the current values of $w$ to converge to the optimal values of $w$. See below:
In deep learning, optimization algorithms face few challenges:
Batch gradient descent is when we sum up over all examples on each iteration when performing forward propagation and back-propagaton. Therefore, for each update, we have to sum over all examples. The main advantages:
The main disadvantages:
Instead of going over all examples, mini-batch gradient descent sums up over lower number of examples based on batch size. The way it works is as follows:
The batch size is something we can tune; however, the batch size is usually chosen as power of 2 such as 32, 64, 128, 256, 512, etc.
The main advantages:
The main disadvantages:
With large training datasets, we don't usually need more than 2-10 passes over all training examples (epochs). Note that with batch size $b = m$, we get the Batch Gradient Descent.
It's the same as mini-batch gradient descent but the batch size $b = 1$. Therefore, learning happen on every example. The main disadvantage is that we can't utilize vectorization over 1 example and becomes very slow. Also, the variance becomes large since we only use 1 example for each learning step. Below is a graph that shows the gradient descent and variants and the direction towards the minimum:
Since mini-batch gradient descent learns from smaller number of examples on each step, it adds noise to the direction towards the minimum (oscillations). Cost function is often very sensitive in the direction of some parameters and not sensitive in the direction of other parameters. Instead of having a learning rate for each parameter, momentum methods solve this issue by using exponential weighted average (moving average) of the past gradients to smooth the steps towards the minimum. Doing so will affect the learning rate by decaying it more as we get closer to the minimum region based on the past gradients. Therefore, momentum methods only affects the update formula (3) and all other implementations such as forward propagation and back-propagation would still be computed the same.
Momentum here uses the exponential weighted average of the past gradients to smooth gradients steps in the direction of the convergence. Below are the formulas needed to implement momentum method: $$v_{dW^{[l]}} = \beta v_{dW^{[l]}} + (1 - \beta) dW^{[l]}\tag{4} \\ $$ $$v_{db^{[l]}} = \beta v_{db^{[l]}} + (1 - \beta) db^{[l]}\tag{5} \\ $$ $$W^{[l]} = W^{[l]} - \alpha v_{dW^{[l]}}\tag{6} \\ $$ $$b^{[l]} = b^{[l]} - \alpha v_{db^{[l]}}\tag{7} \\ $$ $\beta$ reflects how far we go back to include the past gradients. For example, if $\beta = 0.9$, then we average over the past 10 gradients ($\frac{1}{1 - \beta}$).
RMSProp decays the learning rate using exponential weighted average of the square of the gradients. It's supposed to converge faster in the case of the covex functions, and converge to the minimum region in the case of nonconvex function by decaying learning rate and discarding history from the extreme past. Below are the formulas needed to implement RMSProp method: $$s_{dW^{[l]}} = \beta s_{dW^{[l]}} + (1 - \beta) (dW^{[l]})^2 \tag{8}\\ $$ $$s_{db^{[l]}} = \beta s_{db^{[l]}} + (1 - \beta) (db^{[l]})^2 \tag{9} \\ $$ $$W^{[l]} = W^{[l]} - \alpha \frac{dW^{[l]}}{\sqrt{s_{dW^{[l]}}+ \epsilon}} \tag{10} \\ $$ $$b^{[l]} = b^{[l]} - \alpha \frac{db^{[l]}}{\sqrt{s_{db^{[l]}}+ \epsilon}} \tag{11} \\ $$
Adam combines both momentum and RMSProp by using the corrected version of them. Since at $t = 0$ there would be correction bias error, we divide both momentum and RMSProp by $(1 - \beta^t)$ to correct for the bias where $t$ is the number of steps taken by Adam. Below are the formulas needed to implement Adam: $$v_{dW^{[l]}} = \beta_1 v_{dW^{[l]}} + (1 - \beta_1) dW^{[l]} \tag{12} \\ $$ $$v_{db^{[l]}} = \beta_1 v_{db^{[l]}} + (1 - \beta_1) db^{[l]} \tag{13} \\ $$ $$s_{dW^{[l]}} = \beta_2 s_{dW^{[l]}} + (1 - \beta_2) (dW^{[l]})^2 \tag{14} \\ $$ $$s_{db^{[l]}} = \beta_2 s_{db^{[l]}} + (1 - \beta_2) (db^{[l]})^2 \tag{15} \\ $$ $$v_{dW^{[l]}}^{corrected} = \frac{v_{dW^{[l]}}} {1 - \beta_1^t} \tag{16} \\ $$ $$v_{db^{[l]}}^{corrected} = \frac{v_{db^{[l]}}} {1 - \beta_1^t} \tag{17} \\ $$ $$s_{dW^{[l]}}^{corrected} = \frac{s_{dW^{[l]}}} {1 - \beta_2^t} \tag{18} \\ $$ $$s_{db^{[l]}}^{corrected} = \frac{s_{db^{[l]}}} {1 - \beta_2^t} \tag{19} \\ $$ $$W^{[l]} = W^{[l]} - \alpha \frac{v_{dW^{[l]}}^{corrected}}{\sqrt{s_{dW^{[l]}}+ \epsilon}} \tag{20} \\ $$ $$b^{[l]} = b^{[l]} - \alpha \frac{v_{db^{[l]}}^{corrected}}{\sqrt{s_{db^{[l]}}+ \epsilon}} \tag{21} \\ $$
Default values are: $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\epsilon = 1e-8$, $\alpha = 0.001$. Adam has proven to be effective in many real world applications.
We'll train neural network using mini-batch gradient descent (MB), MB with momentum, MB with RMSProp, and MB with Adam. The dataset we'll be using has images for cats and non-cat. We'll try to build a neural network to classify if the image has cat or not. Each image is 64 x 64 pixels on RGB scale.
Let's import the data and take a look at the shape as well as a sample of a cat image from the training set.
# Loading packages
import os
import h5py
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
# import local modules
os.chdir("../scripts/")
from coding_deep_neural_network_from_scratch import (initialize_parameters,
L_model_forward,
compute_cost,
L_model_backward,
update_parameters,
accuracy)
%matplotlib inline
sns.set_context("notebook")
plt.style.use("fivethirtyeight")
plt.rcParams['figure.figsize'] = (12, 6)
# Import training data
train_dataset = h5py.File("../data/train_catvnoncat.h5")
X_train = np.array(train_dataset["train_set_x"])
Y_train = np.array(train_dataset["train_set_y"])
# Plot a sample image
plt.imshow(X_train[50])
plt.axis("off");
# Import test data
test_dataset = h5py.File("../data/test_catvnoncat.h5")
X_test = np.array(test_dataset["test_set_x"])
Y_test = np.array(test_dataset["test_set_y"])
# Transform data
X_train = X_train.reshape(209, -1).T
X_train = X_train / 255
Y_train = Y_train.reshape(-1, 209)
X_test = X_test.reshape(50, -1).T
X_test = X_test / 255
Y_test = Y_test.reshape(-1, 50)
# print the new shape of both training and test datasets
print("Training data dimensions:")
print("X's dimension: {}, Y's dimension: {}".format(X_train.shape, Y_train.shape))
print("Test data dimensions:")
print("X's dimension: {}, Y's dimension: {}".format(X_test.shape, Y_test.shape))
Training data dimensions: X's dimension: (12288, 209), Y's dimension: (1, 209) Test data dimensions: X's dimension: (12288, 50), Y's dimension: (1, 50)
Let's now write the helper functions to implement all optimization algorithms covered. Mainly, Mini-batch gradient descent (MB), MB with Momentum, MB with RMSProp, MB with Adam.
def random_mini_batches(X, Y, mini_batch_size=64, seed=0):
"""
Creates a list of random minibatches from (X, Y)
Arguments:
X -- input data, shape: input size, number of examples
Y -- "label" vector, shape: 1, number of examples
mini_batch_size -- size of the mini-batches, integer
Returns:
mini_batches -- list of synchronous (mini_batch_X, mini_batch_Y)
"""
np.random.seed(seed)
m = X.shape[1]
mini_batches = []
# shuffle training set
permutation = np.random.permutation(m)
shuffle_X = X[:, permutation]
shuffle_Y = Y[:, permutation]
num_complete_minibatches = m // mini_batch_size
for k in range(num_complete_minibatches):
mini_batch_X = shuffle_X[:, k*mini_batch_size:(k + 1)*mini_batch_size]
mini_batch_Y = shuffle_Y[:, k*mini_batch_size:(k + 1)*mini_batch_size]
mini_batch = (mini_batch_X, mini_batch_Y)
mini_batches.append(mini_batch)
# check if there are some examples left if m % batch_size != 0
if m % mini_batch_size != 0:
mini_batch_X = shuffle_X[:, num_complete_minibatches*mini_batch_size:]
mini_batch_Y = shuffle_Y[:, num_complete_minibatches*mini_batch_size:]
mini_batch = (mini_batch_X, mini_batch_Y)
mini_batches.append(mini_batch)
return mini_batches
def initialize_momentum(parameters):
"""
Initializes the velocity as a python dictionary with:
- keys: "dW1", "db1", ..., "dWL", "dbL"
- values: numpy arrays of zeros of the same shape as the
corresponding gradients/parameters.
Arguments:
parameters -- python dictionary containing parameters.
Returns:
v -- python dictionary containing the current velocity.
"""
L = len(parameters) // 2
v = {}
for l in range(1, L + 1):
v["dW" + str(l)] = np.zeros_like(parameters["W" + str(l)])
v["db" + str(l)] = np.zeros_like(parameters["b" + str(l)])
return v
def update_parameters_with_momentum(parameters, grads, v, beta, learning_rate):
"""
Update parameters using Momentum
Arguments:
parameters -- python dictionary containing your parameters:
grads -- python dictionary containing your gradients for each parameters:
v -- python dictionary containing the current velocity:
beta -- the momentum hyperparameter --> scalar
learning_rate -- the learning rate --> scalar
Returns:
parameters -- python dictionary containing your updated parameters
v -- python dictionary containing your updated velocities
"""
L = len(parameters) // 2
for l in range(1, L + 1):
# update momentum velocity
v["dW" + str(l)] =\
beta * v["dW" + str(l)] + (1 - beta) * grads["dW" + str(l)]
v["db" + str(l)] =\
beta * v["db" + str(l)] + (1 - beta) * grads["db" + str(l)]
# update parameters
parameters["W" + str(l)] =\
parameters["W" + str(l)] - learning_rate * v["dW" + str(l)]
parameters["b" + str(l)] =\
parameters["b" + str(l)] - learning_rate * v["db" + str(l)]
return parameters, v
def initialize_rmsprop(parameters):
"""
Initializes the velocity as a python dictionary with:
- keys: "dW1", "db1", ..., "dWL", "dbL"
- values: numpy arrays of zeros of the same shape as the
corresponding gradients/parameters.
Arguments:
parameters -- python dictionary containing parameters.
Returns:
s -- python dictionary containing the current velocity.
"""
L = len(parameters) // 2
s = {}
for l in range(1, L + 1):
s["dW" + str(l)] = np.zeros_like(parameters["W" + str(l)])
s["db" + str(l)] = np.zeros_like(parameters["b" + str(l)])
return s
def update_parameters_with_rmsprop(parameters, grads, s, beta, learning_rate,
epsilon=1e-8):
"""
Update parameters using Momentum
Arguments:
parameters -- python dictionary containing parameters:
grads -- python dictionary containing gradients for each parameters:
s -- python dictionary containing the current velocity:
beta -- the momentum hyperparameter --> scalar
learning_rate -- the learning rate --> scalar
epsilon -- hyperparameter preventing division by zero in Adam updates
Returns:
parameters -- python dictionary containing your updated parameters
v -- python dictionary containing updated velocities
"""
L = len(parameters) // 2
for l in range(1, L + 1):
# update momentum velocity
s["dW" + str(l)] =\
beta * s["dW" + str(l)] +\
(1 - beta) * np.square(grads["dW" + str(l)])
s["db" + str(l)] =\
beta * s["db" + str(l)] +\
(1 - beta) * np.square(grads["db" + str(l)])
# update parameters
parameters["W" + str(l)] -= (learning_rate * grads["dW" + str(l)])\
/ (np.sqrt(s["dW" + str(l)] + epsilon))
parameters["b" + str(l)] -= (learning_rate * grads["db" + str(l)])\
/ (np.sqrt(s["db" + str(l)] + epsilon))
return parameters, s
def initialize_adam(parameters):
"""
Initializes v and s as two python dictionaries with:
- keys: "dW1", "db1", ..., "dWL", "dbL"
- values: numpy arrays of zeros of the same shape as the
corresponding gradients/parameters.
Arguments:
parameters -- python dictionary containing your parameters.
v -- python dictionary that will contain the exponentially weighted
average of the gradient.
s -- python dictionary that will contain the exponentially weighted
average of the squared gradient.
"""
L = len(parameters) // 2
v = {}
s = {}
for l in range(1, L + 1):
v["dW" + str(l)] = np.zeros_like(parameters["W" + str(l)])
v["db" + str(l)] = np.zeros_like(parameters["b" + str(l)])
s["dW" + str(l)] = np.zeros_like(parameters["W" + str(l)])
s["db" + str(l)] = np.zeros_like(parameters["b" + str(l)])
return v, s
def update_parameters_with_adam(parameters, grads, v, s, t, learning_rate,
beta1=0.9, beta2=0.999, epsilon=1e-8):
"""
Update parameters using Adam
Arguments:
parameters -- python dictionary containing parameters:
grads -- python dictionary containing gradients for each parameters:
v -- Adam variable, moving average of the first gradient
s -- Adam variable, moving average of the squared gradient
learning_rate -- the learning rate, scalar.
beta1 -- Exponential decay hyperparameter for the first moment estimates
beta2 -- Exponential decay hyperparameter for the second moment estimates
epsilon -- hyperparameter preventing division by zero in Adam updates
Returns:
parameters -- python dictionary containing updated parameters
v -- Adam variable, moving average of the first gradient
s -- Adam variable, moving average of the squared gradient
"""
L = len(parameters) // 2
v_corrected = {}
s_corrected = {}
for l in range(1, L + 1):
# update the moving avergae of both first gradient and squared gradient
v["dW" + str(l)] = beta1 * v["dW" + str(l)] +\
(1 - beta1) * grads["dW" + str(l)]
v["db" + str(l)] = beta1 * v["db" + str(l)] +\
(1 - beta1) * grads["db" + str(l)]
s["dW" + str(l)] = beta2 * s["dW" + str(l)] +\
(1 - beta2) * np.square(grads["dW" + str(l)])
s["db" + str(l)] = beta2 * s["db" + str(l)] + \
(1 - beta2) * np.square(grads["db" + str(l)])
# compute the corrected-bias estimate of the moving averages
v_corrected["dW" + str(l)] = v["dW" + str(l)] / (1 - beta1**t)
v_corrected["db" + str(l)] = v["db" + str(l)] / (1 - beta1**t)
s_corrected["dW" + str(l)] = s["dW" + str(l)] / (1 - beta2**t)
s_corrected["db" + str(l)] = s["db" + str(l)] / (1 - beta2**t)
# update parameters
parameters["W" + str(l)] -= (learning_rate * v_corrected["dW" + str(l)])\
/ (np.sqrt(s_corrected["dW" + str(l)] + epsilon))
parameters["b" + str(l)] -= (learning_rate * v_corrected["db" + str(l)])\
/ (np.sqrt(s_corrected["db" + str(l)] + epsilon))
return parameters, v, s
def model(X, Y, layers_dims, optimizer="adam", learning_rate=0.01,
mini_batch_size=64, beta=0.9, beta1=0.9, beta2=0.999, epsilon=1e-8,
num_epochs=3000, print_cost=True, activation_fn="relu"):
"""
Implements multi-neural network model which can be run in different
optimizer modes.
Arguments:
X -- input data, shape: number of features, number of examples
Y -- label vector, shape: 1, number of examples
layers_dims -- python list, containing the size of each layer
optimizer -- "momentum", "rmsprop", or "adam".
learning_rate -- the learning rate --> scalar.
mini_batch_size -- the size of a mini batch
beta -- Momentum/RMSProp hyperparameter
beta1 -- Exponential decay hyperparameter for the past gradients
beta2 -- Exponential decay hyperparameter for the past squared gradients
epsilon -- hyperparameter preventing division by zero
num_epochs -- number of epochs
print_cost -- True to print the cost every 1000 epochs
activation_fn -- function to be used on hidden layers: "relu", or "tanh"
Returns:
parameters -- python dictionary containing updated parameters
"""
# set random seed to get consistent output
seed = 1
np.random.seed(seed)
# initialize parameters
parameters = initialize_parameters(layers_dims)
# initialize moving averages based on optimizer modes
assert(optimizer == "mb" or optimizer == "momentum" or
optimizer == "rmsprop" or optimizer == "adam")
if optimizer == "momentum":
v = initialize_momentum(parameters)
elif optimizer == "rmsprop":
s = initialize_rmsprop(parameters)
elif optimizer == "adam":
v, s = initialize_adam(parameters)
t = 0
# initialize costs list
costs = []
# iterate over number of epochs
for epoch in range(num_epochs):
# split the training data into mini batches
seed += 1
mini_batches = random_mini_batches(X, Y, mini_batch_size, seed=seed)
# iterate over mini batches
for mini_batch in mini_batches:
mini_batch_X, mini_batch_Y = mini_batch
# compute fwd prop
AL, caches = L_model_forward(
mini_batch_X, parameters, activation_fn)
# compute cost
cost = compute_cost(AL, mini_batch_Y)
# compute gradients
grads = L_model_backward(AL, mini_batch_Y, caches, activation_fn)
# update parameters
if optimizer == "mb":
parameters = update_parameters(
parameters, grads, learning_rate)
elif optimizer == "momentum":
parameters, v = update_parameters_with_momentum(
parameters, grads, v, beta, learning_rate)
elif optimizer == "rmsprop":
parameters, s = update_parameters_with_rmsprop(
parameters, grads, s, beta, learning_rate, epsilon)
elif optimizer == "adam":
t += 1
parameters, v, s = update_parameters_with_adam(
parameters, grads, v, s, t, learning_rate, beta1, beta2,
epsilon)
# compute epoch cost
AL, caches = L_model_forward(
X_train, parameters, activation_fn)
cost = compute_cost(AL, Y_train)
if epoch % 100 == 0:
costs.append(cost)
# plot the cost
plt.plot(costs)
plt.ylabel('Cost')
plt.xlabel('Epochs (per hundreds)')
plt.title("Learning rate = " + str(learning_rate))
plt.show()
return parameters
Now, we're ready to train the neural network using different learning algorithms.
# Setup layers dimensions
layers_dims = [X_train.shape[0], 200, 200, 1]
# train NN with mini-batch gradient descent only
parameters = model(X_train, Y_train, layers_dims, optimizer="mb",
learning_rate=0.03, mini_batch_size=64,
beta1=0.9, num_epochs=3000, activation_fn="tanh")
# print the test accuracy
print("The training accuracy rate: {}".format(
accuracy(X_train, parameters, Y_train, "tanh")[-7:]))
print("The test accuracy rate: {}".format(
accuracy(X_test, parameters, Y_test, "tanh")[-7:]))
# train NN with mini-batch and momentum
parameters = model(X_train, Y_train, layers_dims, optimizer="momentum",
learning_rate=0.03, mini_batch_size=64,
beta1=0.9, num_epochs=1000, activation_fn="tanh")
# print the test accuracy
print("The training accuracy rate: {}".format(
accuracy(X_train, parameters, Y_train, "tanh")[-8:]))
print("The test accuracy rate: {}".format(
accuracy(X_test, parameters, Y_test, "tanh")[-7:]))
The training accuracy rate: 100.00%. The test accuracy rate: 70.00%.
# train NN with mini-batch and RMPSProp
parameters = model(X_train, Y_train, layers_dims, optimizer="rmsprop",
learning_rate=0.001, mini_batch_size=64,
beta1=0.9, epsilon=1e-6, num_epochs=1000,
activation_fn="tanh")
# print the test accuracy
print("The training accuracy rate: {}".format(
accuracy(X_train, parameters, Y_train, "tanh")[-8:]))
print("The test accuracy rate: {}".format(
accuracy(X_test, parameters, Y_test, "tanh")[-7:]))
The training accuracy rate: 100.00%. The test accuracy rate: 64.00%.
# train NN with mini-batch and Adam
parameters = model(X_train, Y_train, layers_dims, optimizer="adam",
learning_rate=0.001, mini_batch_size=64,
beta1=0.9, beta2=0.999, epsilon=1e-8,
num_epochs=1000, activation_fn="tanh")
# print the test accuracy
print("The training accuracy rate: {}".format(
accuracy(X_train, parameters, Y_train, "tanh")[-7:]))
print("The test accuracy rate: {}".format(
accuracy(X_test, parameters, Y_test, "tanh")[-7:]))
The training accuracy rate: 97.13%. The test accuracy rate: 72.00%.
We're by no means comparing the performance of the optimization algorithms. The goal is to see how each algorithm works and all of them are very effective and can be tuned to yield the best results.
The most popular optimization algorithms are: Mini-batch gradient descent (MB), MB with RMSProp, and MB with Adam. Even though neural network doesn't converge to a global minimum, it's able to arrive at a significantly low values of the loss function and having a very low error rates. Below are some key takeaways:
There is no definitive guide of which optimization algorithm to use under specific scenarios. It's more trial and error. However, the optimization algorithms with momentum methods such RMSProp and Adam with mini-batch gradient descent are very effective.
Momentum methods can be used with Batch, mini-batch, and stochastic gradient descent.
Batch size $b$ can be tuned using a validation set. It plays a critical role in the performance of the learning algorithm.
$\beta$ that measures the velocity of gradients or how far back to go to include past gradients can be tuned too.
Gradient descent is a first-order optimization algorithm, which means it doesn't take into account the second derivatives of the cost function. However, the curvature of the function affects the size of each learning step. The gradient measures the steepness of the curve but the second derivative measures the curvature of the curve. Therefore, if:
As a result, the direction that looks promising to the gradient may not be so and may lead to slow the learning process or even diverge.
If Hessian matrix has poor conditioning number, i.e the direction of the most curvature has much more curvature than the direction of the lowest curvature. This will lead the cost function to be very sensitive in some directions and insensitive in other directions. As a result, it will make it harder on the gradient because the direction that looks promising for the gradient may not lead to big changes in the cost function.
The norm of the gradient $g^Tg$ is supposed to decrease slowly with each learning step because the curve is getting flatter and steepness of the curve will decrease. However, we see that the norm of the gradient is increasing. That's because of the curvature of the curve. Nonetheless, even though the gradients' norm is increasing, the neural network was able to achieve a very low error rates.
Newton's method is a second order optimization algorithm where it jumps right away towards the minimum. It takes into account the second derivatives and the update rule is: $w = w - \frac{g}{H}$ where $g$ is the gradient and $H$ is the hessian matrix that contains the second derivatives. The main disadvantages of Newton's method are: