Notebook

Building your first Deep Neural Network: Introduction to Backpropagation¶

In this Chapter, we will:

Introrduce the Streetlight Problem.
Study matrices and the matrix relationship.
Implement full, batch, and stochastic Gradient Descent.
Show that neural networks learn correlation.
Explain overfitting.
Create our own correlation.
Study backpropagation: Long-distance error attribution.
Study linear versus non-lienar propagation.
Implement our first deep network.
Implement Backpropagation: bringing it all together.

[Douglas Adams] "O Deep Thought Computer," he Said, "The task we have designed you to perform is this. We want you to tell us..." he paused. "The Answer"

The Streetlight Problem¶

The streetlight problem is a toy problem that considers how a network can learn an entire dataset.

Imagine that you are approaching a street corner in a foreign country, as you approach, you look up and realize that the street light is unfamiliar. In this case, how can you know when it's safe to cross the street?

To solve this problem, you might sit at a street corner for a few minutes observing the correlation between each light combination and whether people around you choose to stop or walk. After a few minutes, you realize that there is a perfect correlation between the middle light and whether it's safe to walk or not.

You learned this pattern by observing all individual data points and searching for correlation. This is what we're going to train a neural network to do!

Preparing the Data¶

We have two datasets; on the one hand, we have six streetlight states, on the other hand, we have size observations of whether people walked or not.

Neural networks do not read streetlights. As a result, we want to prepare this data for processing. First thing to do is split it into two groups:

What we know.
What we want to know.

Matrices & the Matrix Relationship¶

We want to translate the streetlight data into numerical values:

The goal after that is to teach a neural network to translate a streetlight pattern into the correct stop/walk pattern. What we really want to do is to transform the input information we have into the correct stop/walk target signal.

In data matrices, a common convention is to give each recorded example a single row and give each feature/column/thing being recorded a single column. This makes the matrix easier to be read and processed.

Good Data Matrices perfectly mimic the outside world¶

The data matrix does not have to be all 1s and 0s. It was the case for the previous example since we are dealing with binary information. The matrix itself should mirror the patterns that exist in the real world.

The underlying pattern is not the same as the matrix:

It is a property of the matrix.
The pattern is what the matrix is expressing.
The same pattern can exist in other matrices that describe the same real-world phenomena.

The resulting matrix is called a lossless representation because we can perfectly convert back and forth between the stop/walk notes and the matrix.

Creating a Matrix or Two in Python¶

Import the Matrices into Python¶

Let's Create the Streetlight pattern matrix:

In [1]:

import numpy as np

In [3]:

streetlights = np.array([[1,0,1], [0,1,1], [0,0,1], [1,1,1], [0,1,1], [1,0,1]])
streetlights.shape

Out[3]:

(6, 3)

Numpy is really just a fancy wrapper for an array of arrays that provides special, matrix-oriented functions.

In [4]:

walk_vs_stop = np.array([[0], [1], [0], [1], [1], [0]])
walk_vs_stop.shape

Out[4]:

(6, 1)

Building a Neural Network¶

In [6]:

# Personal Implementation
ws = np.random.rand(streetlights.shape[1])
x_i = streetlights[0]
y_i = walk_vs_stop[0]
lr = .1

for iteration in range(20):
    # predict.
    prediction = x_i.dot(ws)
    # MSE error.
    error = (prediction - y_i) ** 2
    # update weights.
    for j in range(len(ws)):
        gradient = 2 * x_i[j] * (prediction - y_i)
        ws[j] -= lr * gradient
    
    if iteration % 2 == 0:
        print(f"Error: {round(error[0], 5)} | Prediction: {round(prediction, 5)}")

Error: 0.46497 | Prediction: 0.68189
Error: 0.06026 | Prediction: 0.24548
Error: 0.00781 | Prediction: 0.08837
Error: 0.00101 | Prediction: 0.03181
Error: 0.00013 | Prediction: 0.01145
Error: 2e-05 | Prediction: 0.00412
Error: 0.0 | Prediction: 0.00148
Error: 0.0 | Prediction: 0.00053
Error: 0.0 | Prediction: 0.00019
Error: 0.0 | Prediction: 7e-05

In [10]:

# Book's Implementation.
weights = np.array([.5, .48, -.7])
alpha = .1

input = streetlights[0]
goal_prediction = walk_vs_stop[0]

for iteration in range(20):
    prediction = input.dot(weights)
    error = (goal_prediction - prediction) ** 2
    delta = prediction - goal_prediction
    weights = weights - (alpha * (input * delta))
    print(f"Error: {round(error[0], 5)} | Prediction: {round(prediction, 5)}")

Error: 0.04 | Prediction: -0.2
Error: 0.0256 | Prediction: -0.16
Error: 0.01638 | Prediction: -0.128
Error: 0.01049 | Prediction: -0.1024
Error: 0.00671 | Prediction: -0.08192
Error: 0.00429 | Prediction: -0.06554
Error: 0.00275 | Prediction: -0.05243
Error: 0.00176 | Prediction: -0.04194
Error: 0.00113 | Prediction: -0.03355
Error: 0.00072 | Prediction: -0.02684
Error: 0.00046 | Prediction: -0.02147
Error: 0.0003 | Prediction: -0.01718
Error: 0.00019 | Prediction: -0.01374
Error: 0.00012 | Prediction: -0.011
Error: 8e-05 | Prediction: -0.0088
Error: 5e-05 | Prediction: -0.00704
Error: 3e-05 | Prediction: -0.00563
Error: 2e-05 | Prediction: -0.0045
Error: 1e-05 | Prediction: -0.0036
Error: 1e-05 | Prediction: -0.00288

Learning the Whole Dataset¶

The neural network has been learning only one streetlight. Next, let's implement a training loop to learn from all of the data points we have:

In [11]:

# let's generalize the algorithm.
ws = np.random.rand(streetlights.shape[1])
lr = .1
epoches = 7

for interation in range(epoches):
    for i in range(len(streetlights)):
        # predict.
        prediction = streetlights[i].dot(ws)
        # MSE error.
        error = (prediction - walk_vs_stop[i]) ** 2
        # update weights.
        for j in range(len(ws)):
            gradient = 2 * streetlights[i][j] * (prediction - walk_vs_stop[i])
            ws[j] -= lr * gradient
        print(f"Prediction: {round(prediction, 2)} | Reality: {walk_vs_stop[i][0]} | Error: {round(error[0], 5)}")

Prediction: 1.39 | Reality: 0 | Error: 1.94297
Prediction: 0.91 | Reality: 1 | Error: 0.0076
Prediction: 0.19 | Reality: 0 | Error: 0.03563
Prediction: 1.57 | Reality: 1 | Error: 0.33057
Prediction: 0.68 | Reality: 1 | Error: 0.10242
Prediction: 0.65 | Reality: 0 | Error: 0.42256
Prediction: 0.39 | Reality: 0 | Error: 0.15212
Prediction: 0.6 | Reality: 1 | Error: 0.16003
Prediction: -0.03 | Reality: 0 | Error: 0.00078
Prediction: 1.11 | Reality: 1 | Error: 0.01157
Prediction: 0.72 | Reality: 1 | Error: 0.07698
Prediction: 0.33 | Reality: 0 | Error: 0.11028
Prediction: 0.2 | Reality: 0 | Error: 0.0397
Prediction: 0.73 | Reality: 1 | Error: 0.07439
Prediction: -0.04 | Reality: 0 | Error: 0.00161
Prediction: 1.06 | Reality: 1 | Error: 0.00343
Prediction: 0.82 | Reality: 1 | Error: 0.03206
Prediction: 0.19 | Reality: 0 | Error: 0.03783
Prediction: 0.12 | Reality: 0 | Error: 0.01362
Prediction: 0.83 | Reality: 1 | Error: 0.02879
Prediction: -0.04 | Reality: 0 | Error: 0.00132
Prediction: 1.05 | Reality: 1 | Error: 0.00209
Prediction: 0.89 | Reality: 1 | Error: 0.01273
Prediction: 0.12 | Reality: 0 | Error: 0.01334
Prediction: 0.07 | Reality: 0 | Error: 0.0048
Prediction: 0.9 | Reality: 1 | Error: 0.01095
Prediction: -0.03 | Reality: 0 | Error: 0.001
Prediction: 1.04 | Reality: 1 | Error: 0.00142
Prediction: 0.93 | Reality: 1 | Error: 0.00512
Prediction: 0.07 | Reality: 0 | Error: 0.00463
Prediction: 0.04 | Reality: 0 | Error: 0.00167
Prediction: 0.94 | Reality: 1 | Error: 0.00419
Prediction: -0.03 | Reality: 0 | Error: 0.00075
Prediction: 1.03 | Reality: 1 | Error: 0.00099
Prediction: 0.95 | Reality: 1 | Error: 0.00211
Prediction: 0.04 | Reality: 0 | Error: 0.00156
Prediction: 0.02 | Reality: 0 | Error: 0.00056
Prediction: 0.96 | Reality: 1 | Error: 0.00162
Prediction: -0.02 | Reality: 0 | Error: 0.00056
Prediction: 1.03 | Reality: 1 | Error: 0.0007
Prediction: 0.97 | Reality: 1 | Error: 0.0009
Prediction: 0.02 | Reality: 0 | Error: 0.0005

In [15]:

# our final weights predictions. Compared with ground truths
[round(ws.dot(streetlight), 1) for streetlight in streetlights], streetlights, walk_vs_stop

Out[15]:

([0.0, 1.0, -0.0, 1.0, 1.0, 0.0],
 array([[1, 0, 1],
        [0, 1, 1],
        [0, 0, 1],
        [1, 1, 1],
        [0, 1, 1],
        [1, 0, 1]]),
 array([[0],
        [1],
        [0],
        [1],
        [1],
        [0]]))

In [17]:

# Book's Implementation.
weights = np.array([.5, .48, -.7])
alpha = .1

for iteration in range(20):
    error_for_all_lights = 0
    for row_index in range(len(walk_vs_stop)):
        input = streetlights[row_index]
        goal_prediction = walk_vs_stop[row_index]
        
        prediction = input.dot(weights)
        
        error = (goal_prediction - prediction) ** 2
        error_for_all_lights += error
        
        delta = prediction - goal_prediction
        weights = weights - (alpha * (input * delta))
    
    print(f"Error: {round(error_for_all_lights[0], 5)}")

Error: 2.65612
Error: 0.96287
Error: 0.55092
Error: 0.36446
Error: 0.25168
Error: 0.17798
Error: 0.12864
Error: 0.09511
Error: 0.07195
Error: 0.05565
Error: 0.04395
Error: 0.03536
Error: 0.02891
Error: 0.02395
Error: 0.02006
Error: 0.01695
Error: 0.01442
Error: 0.01233
Error: 0.01059
Error: 0.00912

Full, Batch, and Stochastic Gradient Descent¶

Stochastic Gradient Descent updates weights one example at a time¶

The idea of learning one example at a time is called stochastic gradient descent. It performs a prediction and a weight update for each training example separately. It also iterates through the entire dataset many times until it can find a weight configuration that works well for the entire dataset.

(Full) Gradient Descent updates weights one dataset at a time¶

Instead of updating the weights once for each training example, the weight calculates the loss over the entire dataset. Changing the weights only when it computes the full average.

Batch Gradient Descent updates weights taking in n examples¶

Instead of updating the weights using one example or the entire dataset, we choose batch size (typically between 8 and 1024), after which the weights are updated.

Neural Networks Learn Correlation¶

What did the last neural network learn?¶

Correlation is found wherever the weights are set to high numbers. Inversely, randomness with respect to the input is gound wherever the weights converge to 0.

So, a valid question to ask is: How did the network identify correlation in the last example? and the answer comes from the input, in the process of gradient descent, each training example asserts either up pressure or down pressure on the weights. On average, there was more up pressure for the middle weight and more down pressure for the other two.

Next, we want to ask the following questions:

Where does the pressure come from?
Why is it different for different weights?

Up and Down Pressue¶

It comes from the Data¶

Each node is individually trying to correctly predict the output given the input. For the most partr, each node ignores all other nodes when attempting to do so, the only cross communication that occurs is that all 3 weights must share the same error measure. The weight update is nthing more than taking this shared error measure and multiplying it by each respective error.

A key part of why neural networks learn is error attribution, which means given a shared error, the network needs to figure out which weights contributed (so they can be adjusted) and which weights did not contribute (so they can be left alone). On average, this causes the network to find the correlation between the middle weight and the output to be the dominant predictive force while enhancing the predictive accuracy of the network.

To summarize:

The prediction is a weighted sum of the inputs.
The learning algorithm rewards inputs that correlate with the output with upward weight pressure (in the case of 1), and penalize inputs that decorrelate with the output with downward pressure (in the case of 0).
The weighted sum of the inputs find perfect correlation between the input and output by weighting decorrelated inputs to 0.

Edge Case: Overfitting¶

Sometimes correlation happens accidentally. If a particular configuration of weights accidentally creates perfect correlation between the predictions and the output dataset, in that case, the neural network will stop learning.

In essence, the neural network memorized the two training examples instead of finding the correlation that will generalize to any possible streetlight configuration. The greatest challenge we will face with deep learning is pushing our neural network to generalize instead of just memorize.

Edge Case: Conflicting Pressure¶

Sometimes correlation fights itself¶

As nodes learn, they absorb some of the error; in other words, they absorb part of the correlation. This causes the network to predict with moderate. correlative power, which reduces the error.

After that, the other weights only try to adjust their values to correctly predict what's left. Regularization forces weights with conflicting pressure to mve toward 0 and aims to say that only weights with really strong correlation can stay on.

In the case of one input & output layers, each weight learn for itself and finds correlation between the associated column and the output. In the case when the correlation is indirect, meaning that when a linear combination of the inputs is correlated with the output and not distinct columns, in that case, we use the multi-layer perceptron architecture.

Learning Indirect Correlation¶

If your Data doesn't have correlation, create intermediate data that does!¶

Neural networks search for correlation between their input and output layers. Because sometimes the input dataset doesn't directly correlate with the output dataset, we'll use the input dataset to create an intermediate dataset that does have correlation with the output.

This will lead us to what is called representation learning.

Creating Correlation¶

The middle layer represents the intermediate dataset. The resulting network is still just a function. It has a bunch of weights that are collected together in a particular way.

Gradient Descent still works because we can calculate how much each weight contributes to the error and adjust it to reduce the error to 0.

Stacking Neural Networks: A Review¶

If we look at the stacked neural network architecture and ignore the lower weights & only consider their output to be the dataset, the the top half of the neural network is just like the networks trained in the preceding chapter. We can use the same learning logic to help the network efficiently learn.

The part that we don't yet understand is how to update the weights of the first layer. Previously, we used delta as a cached/normalized error measure. Now we want to figure out how to know the delta values at the first layer so they can help the second layer make accurate predictions.

Backpropagation: Long-distance Error Attribution¶

The Weighted average error¶

The way of using delta at layer 2 to figure out the delta at layer 1 is to multiply it by each of the respective weights for layer 1. It's like the prediction logic but in reverse. This process of moving delta back is called backpropagation.

Backpropagation: Why does this work?¶

Backpropagation lets us say: "If we want this node to be x amount higher, then each of these previous four nodes needs to be x * weights_1_2 amount higher/lower". Because these weights were amplifying the prediction by weights_1_2 times.

Once we know this, we can update each weight matrix as you did before. For each weight, we multiply its output delta by its input value and adjust the weight by that much (or we can scale it by the learning rate).

Linear vs. Nonlinear¶

This is probably the hardest Concept in the Book, Let's Take it slowly¶

As it turns out, we need one more piece to make this neural network train. The Problem lies in the following statement:

All linear mappings of linear mappings produce linear mappings.

Meaning, no matter how many stacked layers you add to our neural network, we can find an equivalent NN with only one layer that processes the input in the same way.

Why the Neural Network still doesn't work¶

If you trained the three layer network as it is now, it wouldn't converge.¶

The middle nodes don't get to add anything to the conversation. they don't get to have correlation of their own. As a result, they're more or less correlated to various input nodes. Because we know that in the new dataset there is no correlation between any of the inputs/outputs, how can the middle layer help? We present the following reasons of why it doesn't:

It mixes up a bunch of correlation that're already useless.
What we really need is for the middle layer to be able to selectively correlate with the input.

In essence, we want the middle layer to sometimes correlate with an input, and sometimes not correlate. This will give it a correlation of its own and the opportunity to not just always be x% correlated with one input and y% correlated with another input. Instead, it can be x% correlated with one input only when it wants to be.

This is called conditional correlation or sometimes correlation, but let's just call it non-linearity.

The Secret to Sometimes Correlation¶

Turn off the node when the value is below 0¶

If a Node's value dropped below 0, normally the node would still have the same correlation to the input as always (it would just happen to be negative in value). But if we turn off the node when it would be negative, then it has zero correlation to any inputs whenever it's negative.

This means that the Node can now pick & choose when it wants to be correlated to something. This allows it to say something like: "Make me perfectly correlated to the left input, but only when the right input is turned off". Now the node can be conditional (or speak for itself!).

The fancy term for this "if the node would be negative, set it to 0" logic is nonlinearity. Without this tweak, the neural network is linear. There are many kinds of nonlinearities, but the one discussed here is, in many cases, the best one to use. It's also the simplest, ReLU:

$ReLU(x) = max(x, 0)$

Our First Deep Neural Network¶

In [30]:

import numpy as np

np.random.seed(1)  # set the seed to a number if you are interested in having producing the same results between runs.

In [31]:

def ReLU(x):
    return (x > 0) * x

In [32]:

lr = .1
hidden_size = 4

In [33]:

X = np.array([[1, 0, 1], 
              [0, 1, 1], 
              [0, 0, 1], 
              [1, 1, 1]])
y = np.array([[1, 1, 0, 0]]).T

In [34]:

# weights to connect the 3 layers.
ws_0_1 = (2 * np.random.random((3, hidden_size))) - 1
ws_1_2 = (2 * np.random.random((hidden_size, 1))) - 1
ws_0_1.shape, ws_1_2.shape

Out[34]:

((3, 4), (4, 1))

In [35]:

layer_0 = X[0]
layer_1 = ReLU(np.dot(layer_0, ws_0_1))
layer_2 = np.dot(layer_1, ws_1_2)

Backpropagating in Code¶

You can learn the amount that each weight contributes to the final error¶

In [36]:

def ReLU_grad(x):
    """Derivative of ReLU."""
    return (x > 0) * 1

In [37]:

for epoch in range(100):
    for i in range(len(X)):
        # get input X[i] & target y[i]
        x_i, y_i = X[i], y[i]
        
        # calculate prediction
        hs = ReLU(np.dot(x_i, ws_0_1))
        prediction = np.dot(hs, ws_1_2)
        
        # calculate error, pure error.
        error = (prediction - y_i) ** 2
        delta = prediction - y_i
        
        # calculate gradients of 1st layer.
        grad_0_1 = np.zeros(ws_0_1.shape)
        for line_i in range(len(ws_0_1)):
            for col_i in range(len(ws_0_1[0])):
                grad_0_1[line_i][col_i] = 2 * delta * x_i[line_i] * ws_1_2[col_i] * ReLU_grad(hs[col_i])
        
        # update weights of 1st layer.
        ws_0_1 -= lr * grad_0_1
        
        # calculate gradients of 2nd layer.
        grad_1_2 = np.zeros(ws_1_2.shape)
        for line_i in range(len(ws_1_2)):
            grad_1_2[line_i]= 2 * delta * hs[line_i]
        
        # update weights of 2nd layer.
        ws_1_2 -= lr * grad_1_2
    if (epoch % 10 == 0):
        print(f"ERROR: {round(error[0], 30)}")
ws_1_2

ERROR: 0.0
ERROR: 0.07710452429266079
ERROR: 0.03764561420363532
ERROR: 0.002405811235989023
ERROR: 9.222743383310113e-06
ERROR: 0.0
ERROR: 0.0
ERROR: 0.0
ERROR: 0.0
ERROR: 0.0

Out[37]:

array([[-0.5910955 ],
       [ 1.13962134],
       [-0.94522481],
       [ 1.11202675]])

In [38]:

# Book Implementation.
for iteration in range(100):
    layer_2_error = 0
    for i in range(len(X)):
        layer_0 = X[i:i+1]
        layer_1 = ReLU(np.dot(layer_0, ws_0_1))
        layer_2 = np.dot(layer_1, ws_1_2)
        
        layer_2_error += np.sum((layer_2 - y[i:i+1]) ** 2)
        layer_2_delta = (layer_2 - y[i:i+1])
        layer_1_delta = layer_2_delta.dot(ws_1_2.T)*ReLU_grad(layer_1)
        
        ws_1_2 -= lr * layer_1.T.dot(layer_2_delta)
        ws_0_1 -= lr * layer_0.T.dot(layer_1_delta)
    if (iteration % 10 == 0):
        print(f"ERROR: {round(layer_2_error, 30)}")

ERROR: 3.629241238667749e-11
ERROR: 8.651118420460012e-12
ERROR: 2.080468952941911e-12
ERROR: 5.00404140651715e-13
ERROR: 1.2035981037474324e-13
ERROR: 2.894955274935546e-14
ERROR: 6.963091350021569e-15
ERROR: 1.6747973437405323e-15
ERROR: 4.0283054359352306e-16
ERROR: 9.689079599258301e-17

Proper error attribution is the goal of backpropagation. It's about figuring out how much each weight contributed to the overall error. Now that we know how much the final prediction should move up or down, we need to figure out how much each middle node should move up/down. We call these the "intermediate predictions". Once we have the delta at layer 1, we can use the same processes as before for calculating a weight upda

Backpropagation is about calculating deltas for intermediate layers so we can perform gradient descent.

Why do Deep Networks Matter?¶

What's the point of creating "intermediate datasets" that have correlation?¶

The two layer network might have a problem classifying cat vs. non-cat pictures because no individual pixel correlates with whether there's a cat in the picture, only different configuration of pixels correlate with whether there's a cat.

Deep Learning is all about creating intermediate layers (representations) wherein each node is an intermediate layer represents the presence or absence of a different configuration of inputs. Because intermediate layers detect various pixel configurations, it then gives the final layer the information it needs to correctly predict the presence/absence of cat.

Some neural networks have hundreds of layers!

The rest of this book will be dedicated to studying different phenomena within these layers in an effort to explore the full power of deep neural networks.

Challenge: Build a 3-layer Neural Network from Memory!¶

In [39]:

import numpy as np 

In [40]:

X = np.array([[1, 0, 1], 
              [0, 1, 1], 
              [0, 0, 1], 
              [1, 1, 1]])
y = np.array([[1, 1, 0, 0]]).T

epochs = 10000
lr = 0.1

X.shape, y.shape

Out[40]:

((4, 3), (4, 1))

In [41]:

# init weights.
ws_1 = np.random.rand(X.shape[1], 4)
ws_2 = np.random.rand(4, y.shape[1])

ws_1.shape, ws_2.shape

Out[41]:

((3, 4), (4, 1))

In [42]:

def relu(x):
    return (x > 0) * x

def grad_relu(x):
    return x > 0

In [43]:

for epoch in range(epochs):
    for i in range(len(X)):
        # get input/output
        layer_in = X[i:i+1]
        
        # calculate prediction
        layer_1 = relu(layer_in.dot(ws_1))
        layer_out = layer_1.dot(ws_2).reshape(1, 1)
        
        # calculate delta 2
        delta_2 = layer_out - y[i:i+1]
        
        # calc error for logs
        error = delta_2 ** 2
        
        # calculate delta 1
        # delta_2.dot(ws_2.T) -> (1, 4)
        # grad_relu(hs) -> (4,)
        # * : element wise multiplication.
        delta_1 = delta_2.dot(ws_2.T)*grad_relu(layer_1)
        
        # update weights
        ws_2 -= lr * (layer_1.T.reshape(4,1).dot(delta_2))
        ws_1 -= lr * (layer_in.T.reshape(3,1).dot(delta_1))
    if epoch % 200 == 0:
        print(f"ERROR: {error[0][0]}")

ERROR: 1.8831273134174278
ERROR: 0.0001422217424704162
ERROR: 7.932455952075163e-12
ERROR: 6.184971668330622e-19
ERROR: 3.58738933991847e-26
ERROR: 1.232595164407831e-32
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31
ERROR: 1.1093356479670479e-31

In [44]:

# test weights.
for i in range(len(X)):
    x_i, y_i = X[i], y[i]
    print('y: ', y_i, ' y_hat: ', relu(x_i.dot(ws_1)).dot(ws_2).squeeze())

y:  [1]  y_hat:  0.9999999999999996
y:  [1]  y_hat:  0.9999999999999997
y:  [0]  y_hat:  3.531122842112766e-16
y:  [0]  y_hat:  1.1102230246251565e-16

Sketches¶

* Error Note: last layer doesn't have an activation function