In this Chapter, we will:
[Douglas Adams] "O Deep Thought Computer," he Said, "The task we have designed you to perform is this. We want you to tell us..." he paused. "The Answer"
The streetlight problem is a toy problem that considers how a network can learn an entire dataset.
Imagine that you are approaching a street corner in a foreign country, as you approach, you look up and realize that the street light is unfamiliar. In this case, how can you know when it's safe to cross the street?
To solve this problem, you might sit at a street corner for a few minutes observing the correlation between each light combination and whether people around you choose to stop or walk. After a few minutes, you realize that there is a perfect correlation between the middle light and whether it's safe to walk or not.
You learned this pattern by observing all individual data points and searching for correlation. This is what we're going to train a neural network to do!
We have two datasets; on the one hand, we have six streetlight states, on the other hand, we have size observations of whether people walked or not.
Neural networks do not read streetlights. As a result, we want to prepare this data for processing. First thing to do is split it into two groups:
We want to translate the streetlight data into numerical values:
The goal after that is to teach a neural network to translate a streetlight pattern into the correct stop/walk pattern. What we really want to do is to transform the input information we have into the correct stop/walk target signal.
In data matrices, a common convention is to give each recorded example a single row and give each feature/column/thing being recorded a single column. This makes the matrix easier to be read and processed.
The data matrix does not have to be all 1s and 0s. It was the case for the previous example since we are dealing with binary information. The matrix itself should mirror the patterns that exist in the real world.
The underlying pattern is not the same as the matrix:
The resulting matrix is called a lossless representation because we can perfectly convert back and forth between the stop/walk notes and the matrix.
Let's Create the Streetlight pattern matrix:
import numpy as np
streetlights = np.array([[1,0,1], [0,1,1], [0,0,1], [1,1,1], [0,1,1], [1,0,1]])
streetlights.shape
(6, 3)
Numpy
is really just a fancy wrapper for an array of arrays that provides special, matrix-oriented functions.
walk_vs_stop = np.array([[0], [1], [0], [1], [1], [0]])
walk_vs_stop.shape
(6, 1)
# Personal Implementation
ws = np.random.rand(streetlights.shape[1])
x_i = streetlights[0]
y_i = walk_vs_stop[0]
lr = .1
for iteration in range(20):
# predict.
prediction = x_i.dot(ws)
# MSE error.
error = (prediction - y_i) ** 2
# update weights.
for j in range(len(ws)):
gradient = 2 * x_i[j] * (prediction - y_i)
ws[j] -= lr * gradient
if iteration % 2 == 0:
print(f"Error: {round(error[0], 5)} | Prediction: {round(prediction, 5)}")
Error: 0.46497 | Prediction: 0.68189 Error: 0.06026 | Prediction: 0.24548 Error: 0.00781 | Prediction: 0.08837 Error: 0.00101 | Prediction: 0.03181 Error: 0.00013 | Prediction: 0.01145 Error: 2e-05 | Prediction: 0.00412 Error: 0.0 | Prediction: 0.00148 Error: 0.0 | Prediction: 0.00053 Error: 0.0 | Prediction: 0.00019 Error: 0.0 | Prediction: 7e-05
# Book's Implementation.
weights = np.array([.5, .48, -.7])
alpha = .1
input = streetlights[0]
goal_prediction = walk_vs_stop[0]
for iteration in range(20):
prediction = input.dot(weights)
error = (goal_prediction - prediction) ** 2
delta = prediction - goal_prediction
weights = weights - (alpha * (input * delta))
print(f"Error: {round(error[0], 5)} | Prediction: {round(prediction, 5)}")
Error: 0.04 | Prediction: -0.2 Error: 0.0256 | Prediction: -0.16 Error: 0.01638 | Prediction: -0.128 Error: 0.01049 | Prediction: -0.1024 Error: 0.00671 | Prediction: -0.08192 Error: 0.00429 | Prediction: -0.06554 Error: 0.00275 | Prediction: -0.05243 Error: 0.00176 | Prediction: -0.04194 Error: 0.00113 | Prediction: -0.03355 Error: 0.00072 | Prediction: -0.02684 Error: 0.00046 | Prediction: -0.02147 Error: 0.0003 | Prediction: -0.01718 Error: 0.00019 | Prediction: -0.01374 Error: 0.00012 | Prediction: -0.011 Error: 8e-05 | Prediction: -0.0088 Error: 5e-05 | Prediction: -0.00704 Error: 3e-05 | Prediction: -0.00563 Error: 2e-05 | Prediction: -0.0045 Error: 1e-05 | Prediction: -0.0036 Error: 1e-05 | Prediction: -0.00288
The neural network has been learning only one streetlight. Next, let's implement a training loop to learn from all of the data points we have:
# let's generalize the algorithm.
ws = np.random.rand(streetlights.shape[1])
lr = .1
epoches = 7
for interation in range(epoches):
for i in range(len(streetlights)):
# predict.
prediction = streetlights[i].dot(ws)
# MSE error.
error = (prediction - walk_vs_stop[i]) ** 2
# update weights.
for j in range(len(ws)):
gradient = 2 * streetlights[i][j] * (prediction - walk_vs_stop[i])
ws[j] -= lr * gradient
print(f"Prediction: {round(prediction, 2)} | Reality: {walk_vs_stop[i][0]} | Error: {round(error[0], 5)}")
Prediction: 1.39 | Reality: 0 | Error: 1.94297 Prediction: 0.91 | Reality: 1 | Error: 0.0076 Prediction: 0.19 | Reality: 0 | Error: 0.03563 Prediction: 1.57 | Reality: 1 | Error: 0.33057 Prediction: 0.68 | Reality: 1 | Error: 0.10242 Prediction: 0.65 | Reality: 0 | Error: 0.42256 Prediction: 0.39 | Reality: 0 | Error: 0.15212 Prediction: 0.6 | Reality: 1 | Error: 0.16003 Prediction: -0.03 | Reality: 0 | Error: 0.00078 Prediction: 1.11 | Reality: 1 | Error: 0.01157 Prediction: 0.72 | Reality: 1 | Error: 0.07698 Prediction: 0.33 | Reality: 0 | Error: 0.11028 Prediction: 0.2 | Reality: 0 | Error: 0.0397 Prediction: 0.73 | Reality: 1 | Error: 0.07439 Prediction: -0.04 | Reality: 0 | Error: 0.00161 Prediction: 1.06 | Reality: 1 | Error: 0.00343 Prediction: 0.82 | Reality: 1 | Error: 0.03206 Prediction: 0.19 | Reality: 0 | Error: 0.03783 Prediction: 0.12 | Reality: 0 | Error: 0.01362 Prediction: 0.83 | Reality: 1 | Error: 0.02879 Prediction: -0.04 | Reality: 0 | Error: 0.00132 Prediction: 1.05 | Reality: 1 | Error: 0.00209 Prediction: 0.89 | Reality: 1 | Error: 0.01273 Prediction: 0.12 | Reality: 0 | Error: 0.01334 Prediction: 0.07 | Reality: 0 | Error: 0.0048 Prediction: 0.9 | Reality: 1 | Error: 0.01095 Prediction: -0.03 | Reality: 0 | Error: 0.001 Prediction: 1.04 | Reality: 1 | Error: 0.00142 Prediction: 0.93 | Reality: 1 | Error: 0.00512 Prediction: 0.07 | Reality: 0 | Error: 0.00463 Prediction: 0.04 | Reality: 0 | Error: 0.00167 Prediction: 0.94 | Reality: 1 | Error: 0.00419 Prediction: -0.03 | Reality: 0 | Error: 0.00075 Prediction: 1.03 | Reality: 1 | Error: 0.00099 Prediction: 0.95 | Reality: 1 | Error: 0.00211 Prediction: 0.04 | Reality: 0 | Error: 0.00156 Prediction: 0.02 | Reality: 0 | Error: 0.00056 Prediction: 0.96 | Reality: 1 | Error: 0.00162 Prediction: -0.02 | Reality: 0 | Error: 0.00056 Prediction: 1.03 | Reality: 1 | Error: 0.0007 Prediction: 0.97 | Reality: 1 | Error: 0.0009 Prediction: 0.02 | Reality: 0 | Error: 0.0005
# our final weights predictions. Compared with ground truths
[round(ws.dot(streetlight), 1) for streetlight in streetlights], streetlights, walk_vs_stop
([0.0, 1.0, -0.0, 1.0, 1.0, 0.0], array([[1, 0, 1], [0, 1, 1], [0, 0, 1], [1, 1, 1], [0, 1, 1], [1, 0, 1]]), array([[0], [1], [0], [1], [1], [0]]))
# Book's Implementation.
weights = np.array([.5, .48, -.7])
alpha = .1
for iteration in range(20):
error_for_all_lights = 0
for row_index in range(len(walk_vs_stop)):
input = streetlights[row_index]
goal_prediction = walk_vs_stop[row_index]
prediction = input.dot(weights)
error = (goal_prediction - prediction) ** 2
error_for_all_lights += error
delta = prediction - goal_prediction
weights = weights - (alpha * (input * delta))
print(f"Error: {round(error_for_all_lights[0], 5)}")
Error: 2.65612 Error: 0.96287 Error: 0.55092 Error: 0.36446 Error: 0.25168 Error: 0.17798 Error: 0.12864 Error: 0.09511 Error: 0.07195 Error: 0.05565 Error: 0.04395 Error: 0.03536 Error: 0.02891 Error: 0.02395 Error: 0.02006 Error: 0.01695 Error: 0.01442 Error: 0.01233 Error: 0.01059 Error: 0.00912
The idea of learning one example at a time is called stochastic gradient descent. It performs a prediction and a weight update for each training example separately. It also iterates through the entire dataset many times until it can find a weight configuration that works well for the entire dataset.
Instead of updating the weights once for each training example, the weight calculates the loss over the entire dataset. Changing the weights only when it computes the full average.
Instead of updating the weights using one example or the entire dataset, we choose batch size (typically between 8 and 1024), after which the weights are updated.
Correlation is found wherever the weights are set to high numbers. Inversely, randomness with respect to the input is gound wherever the weights converge to 0
.
So, a valid question to ask is: How did the network identify correlation in the last example? and the answer comes from the input, in the process of gradient descent, each training example asserts either up pressure or down pressure on the weights. On average, there was more up
pressure for the middle weight and more down
pressure for the other two.
Next, we want to ask the following questions:
Each node is individually trying to correctly predict the output given the input. For the most partr, each node ignores all other nodes when attempting to do so, the only cross communication that occurs is that all 3 weights must share the same error measure. The weight update is nthing more than taking this shared error measure and multiplying it by each respective error.
A key part of why neural networks learn is error attribution, which means given a shared error, the network needs to figure out which weights contributed (so they can be adjusted) and which weights did not contribute (so they can be left alone). On average, this causes the network to find the correlation between the middle weight and the output to be the dominant predictive force while enhancing the predictive accuracy of the network.
To summarize:
Sometimes correlation happens accidentally. If a particular configuration of weights accidentally creates perfect correlation between the predictions and the output dataset, in that case, the neural network will stop learning.
In essence, the neural network memorized the two training examples instead of finding the correlation that will generalize to any possible streetlight configuration. The greatest challenge we will face with deep learning is pushing our neural network to generalize instead of just memorize.
As nodes learn, they absorb some of the error; in other words, they absorb part of the correlation. This causes the network to predict with moderate. correlative power, which reduces the error.
After that, the other weights only try to adjust their values to correctly predict what's left. Regularization forces weights with conflicting pressure to mve toward 0 and aims to say that only weights with really strong correlation can stay on.
In the case of one input & output layers, each weight learn for itself and finds correlation between the associated column and the output. In the case when the correlation is indirect, meaning that when a linear combination of the inputs is correlated with the output and not distinct columns, in that case, we use the multi-layer perceptron architecture.
Neural networks search for correlation between their input and output layers. Because sometimes the input dataset doesn't directly correlate with the output dataset, we'll use the input dataset to create an intermediate dataset that does have correlation with the output.
This will lead us to what is called representation learning.
The middle layer represents the intermediate dataset. The resulting network is still just a function. It has a bunch of weights that are collected together in a particular way.
Gradient Descent still works because we can calculate how much each weight contributes to the error and adjust it to reduce the error to 0.
If we look at the stacked neural network architecture and ignore the lower weights & only consider their output to be the dataset, the the top half of the neural network is just like the networks trained in the preceding chapter. We can use the same learning logic to help the network efficiently learn.
The part that we don't yet understand is how to update the weights of the first layer. Previously, we used delta as a cached/normalized error measure. Now we want to figure out how to know the delta values at the first layer so they can help the second layer make accurate predictions.
The way of using delta at layer 2 to figure out the delta at layer 1 is to multiply it by each of the respective weights for layer 1. It's like the prediction logic but in reverse. This process of moving delta back is called backpropagation.
Backpropagation lets us say: "If we want this node to be x amount higher, then each of these previous four nodes needs to be x * weights_1_2
amount higher/lower". Because these weights were amplifying the prediction by weights_1_2
times.
Once we know this, we can update each weight matrix as you did before. For each weight, we multiply its output delta by its input value and adjust the weight by that much (or we can scale it by the learning rate).
As it turns out, we need one more piece to make this neural network train. The Problem lies in the following statement:
All linear mappings of linear mappings produce linear mappings.
Meaning, no matter how many stacked layers you add to our neural network, we can find an equivalent NN with only one layer that processes the input in the same way.
The middle nodes don't get to add anything to the conversation. they don't get to have correlation of their own. As a result, they're more or less correlated to various input nodes. Because we know that in the new dataset there is no correlation between any of the inputs/outputs, how can the middle layer help? We present the following reasons of why it doesn't:
In essence, we want the middle layer to sometimes correlate with an input, and sometimes not correlate. This will give it a correlation of its own and the opportunity to not just always be x%
correlated with one input and y%
correlated with another input. Instead, it can be x%
correlated with one input only when it wants to be.
This is called conditional correlation or sometimes correlation, but let's just call it non-linearity.
If a Node's value dropped below 0, normally the node would still have the same correlation to the input as always (it would just happen to be negative in value). But if we turn off the node when it would be negative, then it has zero correlation to any inputs whenever it's negative.
This means that the Node can now pick & choose when it wants to be correlated to something. This allows it to say something like: "Make me perfectly correlated to the left input, but only when the right input is turned off". Now the node can be conditional (or speak for itself!).
The fancy term for this "if the node would be negative, set it to 0" logic is nonlinearity. Without this tweak, the neural network is linear. There are many kinds of nonlinearities, but the one discussed here is, in many cases, the best one to use. It's also the simplest, ReLU:
ReLU(x)=max(x,0)import numpy as np
np.random.seed(1) # set the seed to a number if you are interested in having producing the same results between runs.
def ReLU(x):
return (x > 0) * x
lr = .1
hidden_size = 4
X = np.array([[1, 0, 1],
[0, 1, 1],
[0, 0, 1],
[1, 1, 1]])
y = np.array([[1, 1, 0, 0]]).T
# weights to connect the 3 layers.
ws_0_1 = (2 * np.random.random((3, hidden_size))) - 1
ws_1_2 = (2 * np.random.random((hidden_size, 1))) - 1
ws_0_1.shape, ws_1_2.shape
((3, 4), (4, 1))
layer_0 = X[0]
layer_1 = ReLU(np.dot(layer_0, ws_0_1))
layer_2 = np.dot(layer_1, ws_1_2)
def ReLU_grad(x):
"""Derivative of ReLU."""
return (x > 0) * 1
for epoch in range(100):
for i in range(len(X)):
# get input X[i] & target y[i]
x_i, y_i = X[i], y[i]
# calculate prediction
hs = ReLU(np.dot(x_i, ws_0_1))
prediction = np.dot(hs, ws_1_2)
# calculate error, pure error.
error = (prediction - y_i) ** 2
delta = prediction - y_i
# calculate gradients of 1st layer.
grad_0_1 = np.zeros(ws_0_1.shape)
for line_i in range(len(ws_0_1)):
for col_i in range(len(ws_0_1[0])):
grad_0_1[line_i][col_i] = 2 * delta * x_i[line_i] * ws_1_2[col_i] * ReLU_grad(hs[col_i])
# update weights of 1st layer.
ws_0_1 -= lr * grad_0_1
# calculate gradients of 2nd layer.
grad_1_2 = np.zeros(ws_1_2.shape)
for line_i in range(len(ws_1_2)):
grad_1_2[line_i]= 2 * delta * hs[line_i]
# update weights of 2nd layer.
ws_1_2 -= lr * grad_1_2
if (epoch % 10 == 0):
print(f"ERROR: {round(error[0], 30)}")
ws_1_2
ERROR: 0.0 ERROR: 0.07710452429266079 ERROR: 0.03764561420363532 ERROR: 0.002405811235989023 ERROR: 9.222743383310113e-06 ERROR: 0.0 ERROR: 0.0 ERROR: 0.0 ERROR: 0.0 ERROR: 0.0
array([[-0.5910955 ], [ 1.13962134], [-0.94522481], [ 1.11202675]])
# Book Implementation.
for iteration in range(100):
layer_2_error = 0
for i in range(len(X)):
layer_0 = X[i:i+1]
layer_1 = ReLU(np.dot(layer_0, ws_0_1))
layer_2 = np.dot(layer_1, ws_1_2)
layer_2_error += np.sum((layer_2 - y[i:i+1]) ** 2)
layer_2_delta = (layer_2 - y[i:i+1])
layer_1_delta = layer_2_delta.dot(ws_1_2.T)*ReLU_grad(layer_1)
ws_1_2 -= lr * layer_1.T.dot(layer_2_delta)
ws_0_1 -= lr * layer_0.T.dot(layer_1_delta)
if (iteration % 10 == 0):
print(f"ERROR: {round(layer_2_error, 30)}")
ERROR: 3.629241238667749e-11 ERROR: 8.651118420460012e-12 ERROR: 2.080468952941911e-12 ERROR: 5.00404140651715e-13 ERROR: 1.2035981037474324e-13 ERROR: 2.894955274935546e-14 ERROR: 6.963091350021569e-15 ERROR: 1.6747973437405323e-15 ERROR: 4.0283054359352306e-16 ERROR: 9.689079599258301e-17
Proper error attribution is the goal of backpropagation. It's about figuring out how much each weight contributed to the overall error. Now that we know how much the final prediction should move up or down, we need to figure out how much each middle node should move up/down. We call these the "intermediate predictions". Once we have the delta at layer 1, we can use the same processes as before for calculating a weight upda
Backpropagation is about calculating deltas for intermediate layers so we can perform gradient descent.
The two layer network might have a problem classifying cat vs. non-cat pictures because no individual pixel correlates with whether there's a cat in the picture, only different configuration of pixels correlate with whether there's a cat.
Deep Learning is all about creating intermediate layers (representations) wherein each node is an intermediate layer represents the presence or absence of a different configuration of inputs. Because intermediate layers detect various pixel configurations, it then gives the final layer the information it needs to correctly predict the presence/absence of cat.
Some neural networks have hundreds of layers!
The rest of this book will be dedicated to studying different phenomena within these layers in an effort to explore the full power of deep neural networks.
import numpy as np
X = np.array([[1, 0, 1],
[0, 1, 1],
[0, 0, 1],
[1, 1, 1]])
y = np.array([[1, 1, 0, 0]]).T
epochs = 10000
lr = 0.1
X.shape, y.shape
((4, 3), (4, 1))
# init weights.
ws_1 = np.random.rand(X.shape[1], 4)
ws_2 = np.random.rand(4, y.shape[1])
ws_1.shape, ws_2.shape
((3, 4), (4, 1))
def relu(x):
return (x > 0) * x
def grad_relu(x):
return x > 0
for epoch in range(epochs):
for i in range(len(X)):
# get input/output
layer_in = X[i:i+1]
# calculate prediction
layer_1 = relu(layer_in.dot(ws_1))
layer_out = layer_1.dot(ws_2).reshape(1, 1)
# calculate delta 2
delta_2 = layer_out - y[i:i+1]
# calc error for logs
error = delta_2 ** 2
# calculate delta 1
# delta_2.dot(ws_2.T) -> (1, 4)
# grad_relu(hs) -> (4,)
# * : element wise multiplication.
delta_1 = delta_2.dot(ws_2.T)*grad_relu(layer_1)
# update weights
ws_2 -= lr * (layer_1.T.reshape(4,1).dot(delta_2))
ws_1 -= lr * (layer_in.T.reshape(3,1).dot(delta_1))
if epoch % 200 == 0:
print(f"ERROR: {error[0][0]}")
ERROR: 1.8831273134174278 ERROR: 0.0001422217424704162 ERROR: 7.932455952075163e-12 ERROR: 6.184971668330622e-19 ERROR: 3.58738933991847e-26 ERROR: 1.232595164407831e-32 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31 ERROR: 1.1093356479670479e-31
# test weights.
for i in range(len(X)):
x_i, y_i = X[i], y[i]
print('y: ', y_i, ' y_hat: ', relu(x_i.dot(ws_1)).dot(ws_2).squeeze())
y: [1] y_hat: 0.9999999999999996 y: [1] y_hat: 0.9999999999999997 y: [0] y_hat: 3.531122842112766e-16 y: [0] y_hat: 1.1102230246251565e-16