In this chapter, we will:
[Richard Branson] "You don't learn to walk by following rules. You learn by doing and by falling over."
Let's show how a neural network with multiple inputs can learn:
# an empty network with multiple inputs.
def w_sum(a, b):
assert(len(a) == len(b))
S = 0
for i in range(len(a)):
S += a[i] * b[i]
return S
# init weights.
weights = [.1, .2, -.1]
# defining the model.
def neural_network(X, W):
prediction = w_sum(X, W)
return prediction
# PREDICT + COMPARE: Making a Prediction, and Calculating Error & Delta.
toes = [ 8.5, 9.5, 9.9, 9.0]
wlrec = [0.65, 0.8, 0.8, 0.9]
nfans = [ 1.2, 1.3, 0.5, 1.0]
win_or_lose_binary = [1, 1, 0, 1]
true = win_or_lose_binary[0]
x0 = [toes[0], wlrec[0], nfans[0]]
pred = neural_network(x0, weights)
error = (pred - true) ** 2
delta = pred - true
# LEARN: Calculating each weight_delta and putting it on each weight
def ele_mul(number, vector):
return [number * v_i for v_i in vector]
# we calculate gradients associated w/ each weight.
gradients = ele_mul(delta, x0)
print(gradients)
[-1.189999999999999, -0.09099999999999994, -0.16799999999999987]
# LEARN: Updating the Weights.
alpha = .01
for i in range(len(weights)):
weights[i] -= alpha * gradients[i]
print(weights)
print(gradients)
[0.1119, 0.20091, -0.09832] [-1.189999999999999, -0.09099999999999994, -0.16799999999999987]
The properties involved in GD with multiple inputs are fascinating and worthy of discussion. Let's take a look at them side by side:
# Single Input: Making a Prediction and calculating error and delta.
number_of_toes = [8.5]
win_or_lose_binary = [1]
input = number_of_toes[0]
true = win_or_lose_binary[0]
weight = 0.3
prediction = input * weight
error = (prediction - true) ** 2
gradient = 2 * input * (prediction - true)
delta = prediction - true
Delta
in this case is a measure of how much higher or lower we want the node's value to be, to predict perfectly given the currrent trraining example. Weight Delta
is a derivative-based estimate of the direction and amount we should move a weight to reduce node_delta
, accounting for scaling, negative reversal, and stopping.
In the above figure, we made three individual error/weight curves, one for each weight. The slope of each curve is reflected by the gradient values. The gradient is steeper for (a) than the others because (a) has an input value that is significantly higher than the others and thus, has a higher derivative.
We can notice that most of the learning was performed on the weight with the largest input because the input changes the slope significantly. This might be a good idea, in fact, there is a technique called data normalization
that encourages learning across all weights despite data characteristics such as this.
This significant difference in slope will force us to set a learning rate lower that the standard one (.01 instead of .1).
Freezing weights is a great exercise to understand how the weights effect each other. We are going to train again, except weight (a) won't ever be adjusted, we will try to learn using only weights (b) and (c).
lr = .3
for iter in range(3):
pred = neural_network(x0, weights)
error = (pred - true) ** 2
delta = (pred - true)
gradients = ele_mul(delta, x0)
gradients[0] = 0
print(f"Iteration: {iter}")
print(f"Prediction: {round(pred, 5)}")
print(f"Error: {round(error, 5)}")
print(f"Weights: {[round(w, 5) for w in weights]}")
print(f"- - - - - -")
for i in range(len(weights)):
weights[i] -= lr * gradients[i]
Iteration: 0 Prediction: 0.96376 Error: 0.00131 Weights: [0.1119, 0.20091, -0.09832] - - - - - - Iteration: 1 Prediction: 0.98401 Error: 0.00026 Weights: [0.1119, 0.20798, -0.08527] - - - - - - Iteration: 2 Prediction: 0.99294 Error: 5e-05 Weights: [0.1119, 0.2111, -0.07952] - - - - - -
It is somewhat surprising that the rest of the weights were learned despite removing the biggest contributing feature. However, the graph (a) still finds the bottom of the bowel because the whole gradient loss graph changes with error. The black dot can move horizontally only if the weight is updated, but because (a)'s weight is frozen, the dot must stay fixed. However, this does not stop the error to go to 0.
This is an extremely important lesson. If we converge using only (b) and (c) then tried to train (a), it wouldn't move because the error already reached 0. In other words, (a) may be a powerful input with lots of predictive power, but if the network accidentally figures out how to predict accurately on the training data without it, then it will never learn how to incorporate (a) into its prediction.
The 3 graphs representing the loss function according to each input element are. in reality 2D slices of a 4-dimensional space. 3 of the dimensions arer. the weight values, and the forth is the error. The shape of the 4-D function represents the error place or the curvature of the loss function. The curvature is determined by the training data.
Our goal is to find the weight configuration at the global minimum of the loss curvature.
Neural networks can also make multiple predictions using only a single input. Beyond that, we understand that a simple mechanism (stochastic gradient descent) is constantly being used to perform learning across a wide veriety of architectures.
# An empty network with multiple outputs.
weights = [.3, .2, .9]
def neural_network(X, W):
prediction = ele_mul(X, W)
return prediction
# PREDICT: Making a prediction and calculating error and delta.
wlrec = [.65, 1, 1, .9]
hurt = [.1, 0, 0, .1]
win = [1, 1, 0, 1]
sad = [.1, 0, .1, .2]
x0 = wlrec[0]
target = [hurt[0], win[0], sad[0]]
pred = neural_network(x0, weights)
error = [0, 0, 0]
pure_error = [0, 0, 0]
for i in range(len(target)):
error[i] = (pred[i] - target[i]) ** 2
pure_error[i] = pred[i] - target[i]
pred, error, pure_error
([0.195, 0.13, 0.5850000000000001], [0.009025, 0.7569, 0.2352250000000001], [0.095, -0.87, 0.4850000000000001])
# COMPARE: calculating each gradient.
gradients = [x0 * pure_error[0], x0 * pure_error[1], x0 * pure_error[2]]
# UPDATE: Updating the Weights.
lr = 3
for i in range(len(weights)):
weights[i] -= lr * gradients[i]
[input * w for w in weights]
[0.9753749999999997, 16.12025, -0.38887500000000247]
SGD naturally generalize to arbitrary architectures. Let's implement SGD with multiple inputs and outputs:
# 1. An empty Network with multiple inputs and outputs.
weights = [[.1, .1, -.3],
[.1, .2, 0],
[0, 1.3, .1]]
def vect_mat_mul(vect, matrix):
assert(len(vect) == len(matrix))
output = [0, 0, 0]
for i in range(len(vect)):
output[i] = w_sum(vect, matrix[i])
return output
def neural_network(X, W):
prediction = vect_mat_mul(X, W)
return prediction
# 2. PREDICT: Making a Prediction & Calculating Error & Delta (Pure Error).
# Inputs
toes = [8.5, 9.5, 9.9, 9.0]
wlrec = [.65, 0.8, 0.8, 0.9]
nfans = [1.2, 1.3, 0.5, 1.0]
# Outputs
hurt = [.1, 0, 0,.1]
win = [ 1, 1, 0, 1]
sad = [.1, 0,.1,.2]
# learning rate
alpha = .01
x0 = [toes[0], wlrec[0], nfans[0]]
target = [hurt[0], win[0], sad[0]]
prediction = neural_network(x0, weights)
error, delta = [0, 0, 0], [0, 0, 0]
for i in range(len(prediction)):
error[i] = (prediction[i] - target[i]) ** 2
delta[i] = prediction[i] - target[i]
import numpy as np
# 3. COMPARE: Calculating each weight_delta & Putting it on each weight.
# we have 9 gradients, or weight deltas.
def outer_prod(vect_a, vect_b):
out = np.zeros((len(vect_a), len(vect_b)))
for i in range(len(vect_a)):
for j in range(len(vect_b)):
out[i][j] = vect_a[i]*vect_b[j]
return out
gradients = outer_prod(x0, delta)
# 4. LEARN: Updating the Weights
for i in range(len(weights)):
for j in range(len(weights[0])):
weights[i][j] -= alpha * gradients[i][j]
Even though we understand "how" learning happens, another interesting question is "what do the weights store while learning?"
To answer this question, we will move on to the first real-world dataset. It's called the modified national institure of standards and technology (MNIST) dataset. It consists of digits that high school students and employees of the US census bureau wrote some years ago. The interesting thing is that these images are black-and-white images of people's handwriting. Accompanying each digit is the actual number they were writing (0-9). Each image is 784 pixels (28 x 28).
In this case, the neural network must have 784 input features. On the other end, we want to predict 10 probabilities, one for each digit. The neural network will tell us which digit is most likely to be what was drawn.
Let's take a look at the MNIST dataset:
from tensorflow.keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz 11493376/11490434 [==============================] - 9s 1us/step
# let's take a sample.
images = X_train[0:1000]
labels = y_train[0:1000]
images.shape, labels.shape
((1000, 28, 28), (1000,))
This diagram represents the new MNIST classification neural network. If this network could perdict perfectly, It would take in an image's pixels (say, a 2) then predict 1
in the correct output position and 0
everywhere else. If it were able to do this correctly for all images in the dataset, it would have no error.
.. But, what does it mean to modify a bunch of weights to learn an aggregate pattern?
An interesting and intuitive practice in neural network research is to visualize the weights as if they were an image. Each output node has a weight coming from each pixel, and so what is this relationship?
The network learned to construct artifacts of 2
and 1
in the above figure, which were created using the weights for 2
and 1
. The natural color (red) represents a 0
. The key to understanding why these artifacts were created, we should go back to the notion of the dot product.
# remember that the dot product is a mathematical measure of similarity.
import numpy as np
a = np.array([0, 1, 0, 1])
b = np.array([1, 0, 1, 0])
a.dot(b)
0
A dot product is a loose measurement of similarity between two vectors.
Meaning, if the weight vector for 2
is similar to the input vector, then it will output a high score for 2
resulting in a higher probability.
Gradient Descent is a general learning algorithm. The most important subtext in this chapter is that gradient descent is a very flexible algorithm. If we combine. weights in a way that allows us to calculate the error function and a delta, then gradient descent can show us how to reduce error.
We will spend the rest of this book exploring different types of weight combinations and error functions for which gradient descent is useful.