numpy
Neural Networks: Multi-Layer Perceptron¶Now that we've done one layer successfully, let's try more! We begin with adding one hidden layer to our network, and then generalize to include any number of hidden layers.
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.facecolor'] = 'white'
The "hard" dataset that we couldn't predict with logistic regression:
def rescale(X):
return (X - X.mean(axis=0)) / (X.var(axis=0))
n_samples = 1000
X0 = np.random.normal(loc=[0,0], scale=[2,0.5], size=(int(n_samples/2), 2))
X11 = np.random.normal(loc=[0,3.5], scale=[0.5,1], size=(int(n_samples/4), 2))
X12 = np.random.normal(loc=[0,-3.5], scale=[0.5,1], size=(int(n_samples/4), 2))
X1 = np.vstack([X11, X12])
X = np.vstack([X0, X1])
# X = rescale(X)
y0 = np.zeros(shape=(int(n_samples/2), 1))
y1 = np.ones(shape=(int(n_samples/2), 1))
yhat = np.vstack([y0, y1])
X0 = X[(yhat==0).reshape(-1)]
X1 = X[(yhat==1).reshape(-1)]
plt.scatter(*X0.T, label='0', alpha=0.4); plt.scatter(*X1.T, label='1', alpha=0.4)
plt.legend();
Sigmoid Activation and Cross-Entropy Loss
def sig(z):
return 1 / (1 + np.exp(-z))
def dsig_dz(z):
return sig(z) * (1 - sig(z))
def J(y, yhat):
eps = 1e-8
return -(yhat*np.log(y+eps) + (1-yhat)*np.log(1-y+eps))
def dJ_dy(y, yhat):
eps = 1e-8
return (1-yhat)/(1-y+eps) - yhat/(y+eps)
For our hidden layer we will use a new type of unit (neuron w/ activation function): The rectified linear unit. ReLU(z)={zz≥00z<0
This activation is nice because it is fast to compute, and it doesn't saturate, by which we mean the derivative doesn't go to zero asymptotically, so terms that depend on the derivative don't die out:
dReLUdz={1z≥00z<0def relu(z):
return np.where(z>0, z, 0)
def drelu_dz(z):
return np.where(z>0, 1, 0)
z = np.linspace(-5,5)
plt.plot(z, relu(z), label='ReLU$(z)$'); plt.plot(z, drelu_dz(z), label='$d$ ReLU / $dz$'); plt.legend();
plt.title('The ReLU function');
To start off, let's consider the case of having a single hidden ReLU layer with 2 hidden units.
n_input = 2
n_hidden = 10
n_output = 1
We now have two sets of weights, w1 and w2. We must now consider the biases as matrices connecting the units of consecutive layers. Define them to have the same number of rows as units in the previous layer, and the same number of columns as units in the next layer. So w1∈ℜ(2,2) and w2∈ℜ(2,1). We will also explicitly consider bias vectors b1 and b2 which have the size of the next layer (one bias per activation unit).
In the previous notebook, we dealt with bias by setting x0=0 and then adding an additional weight, but I now find it more clear to just add the bias separately. In any case the effect is the same. The actions of the layers are then:
z1=x0w1+b1Tx1=ReLU(z1)z2=x1w2+b2Ty=x2=σ(z1)where the bias vectors are transposed since x and z are rows. Here we denote the sigmoid function as σ.
Let's initialize some random nonzero weights to demonstrate as we go along:
w1 = np.random.normal(0,0.1, size=(n_input, n_hidden))
w2 = np.random.normal(0,0.1, size=(n_hidden, n_output))
b1 = np.random.normal(0,0.1, size=(n_hidden, 1))
b2 = np.random.normal(0,0.1, size=(n_output, 1))
The forward pass is simply the calculation performed above:
def forward1(x0, w1, b1, w2, b2):
x1 = relu(np.dot(x0, w1) + b1.T) # output of hidden layer
return sig(np.dot(x1, w2) + b2.T) # output of output layer
We can check the prediction on the first sample:
y = forward1(X[0], w1, b1, w2, b2)
y
array([[0.49080367]])
For the backward pass, we follow the approach detailed in this video and define a quantity that will become useful later on: δℓ, the partial derivative of the cost function with respect to zℓ:
δℓ≡∂J∂zℓWe can begin explicit calculation with the last delta:
δ2=∂J∂z2=∂J∂y∂y∂z2=∂J∂y∂σ∂z2Moving on to δ1, we write out the vector and matrix indices explicitly for initial clarity:
δ1i=∂J∂z1i=∑j∂J∂z2j∂z2j∂z1i=∑j,k∂J∂z2j∂z2j∂x1k∂x1k∂z1iNow,
z2j=∑mx1mw2mj+b2j,so
∂z2j∂x1k=w2kj.Furthermore, x1k=ReLU(z1k) → ∂x1k∂z1i=ReLU′(z1k)δki
or, flipping back to matrix notation,
where ⊙ is element-wise multiplication.
We see that δ2 appeared in the definition of δ1. It's easy to convince yourself that this generalizes to
δℓ=wℓ+1δℓ+1⊙ReLU′(zℓ)T,and the form of the equation makes it easy to substitute other activation functions ReLU as well.
Furthermore, the update rules for training, wℓ → wℓ−α∂J/∂wℓ and βℓ → βℓ−α∂J/∂βℓ, can be also written in terms of δ:
so ∂J∂wℓ=(δℓxℓ−1)T
Armed with these convenient definitions we can implement our backward pass to update our weights, and our training function, which looks much the same as in the previous notebook:
def backward1(x0, w1, b1, w2, b2, y, yhat, alpha):
# quantities
z1 = np.dot(x0, w1) + b1.T
x1 = relu(z1)
z2 = np.dot(x1, w2) + b2.T
#y = sig(z2)
delta2 = dJ_dy(y, yhat) * dsig_dz(z2)
delta1 = np.matmul(w2, delta2) * drelu_dz(z1).T
w2 -= alpha * np.multiply(delta2, x1).T
w1 -= alpha * np.multiply(delta1, x0).T
b2 -= alpha * delta2
b1 -= alpha * delta1
return w1, b1, w2, b2
alpha=0.1
y = forward1(X[0], w1, b1, w2, b2)
w1, b1, w2, b2 = backward1(X[0], w1, b1, w2, b2, y, yhat[0], alpha)
print(y)
print(J(y, yhat[0]))
[[0.49080367]] [[0.67492159]]
def train1(X, yhat, n_hidden, alpha, n_epoch):
n_samples = X.shape[0]
n_input = X.shape[1]
n_output = 1
# keep track of performance during training
costs = np.zeros(shape=(n_epoch,1))
# random nonzero initialization
w1 = np.random.normal(0, 1, size=(n_input, n_hidden))
w2 = np.random.normal(0, 1, size=(n_hidden, n_output))
b1 = np.random.normal(0, 1, size=(n_hidden, 1))
b2 = np.random.normal(0, 1, size=(n_output, 1))
for epoch in range(n_epoch):
for i in range(n_samples):
x0 = X[i,:]; yh = yhat[i]
y = forward1(x0, w1, b1, w2, b2) # prediction for one sample
w1, b1, w2, b2 = backward1(x0, w1, b1, w2, b2, y, yh, alpha) # take step
# ### Some niceness to see our progress
# Calculate total cost after epoch
predictions = forward1(X, w1, b1, w2, b2) # predictions for entire set
costs[epoch] = np.mean(J(predictions, yhat)) # mean cost per sample
# report progress
if ((epoch % 10) == 0) or (epoch == (n_epoch - 1)):
#print(predictions.round())
accuracy = np.mean(predictions.round() == yhat) # current accuracy on entire set
print('Training accuracy after epoch {}: {:.4%}'.format(epoch, accuracy))
return w1, b1, w2, b2, costs
Let's give it a try!
n_epoch = 200
n_hidden = 2
alpha = 0.001
w1, b1, w2, b2, costs = train1(X, yhat, n_hidden, alpha, n_epoch)
Training accuracy after epoch 0: 50.0000% Training accuracy after epoch 10: 50.0000% Training accuracy after epoch 20: 50.0000% Training accuracy after epoch 30: 72.0000% Training accuracy after epoch 40: 75.0000% Training accuracy after epoch 50: 78.5000% Training accuracy after epoch 60: 83.0000% Training accuracy after epoch 70: 90.4000% Training accuracy after epoch 80: 96.5000% Training accuracy after epoch 90: 98.6000% Training accuracy after epoch 100: 98.8000% Training accuracy after epoch 110: 99.2000% Training accuracy after epoch 120: 99.2000% Training accuracy after epoch 130: 99.0000% Training accuracy after epoch 140: 99.0000% Training accuracy after epoch 150: 99.0000% Training accuracy after epoch 160: 99.0000% Training accuracy after epoch 170: 99.0000% Training accuracy after epoch 180: 98.9000% Training accuracy after epoch 190: 98.9000% Training accuracy after epoch 199: 98.9000%
plt.plot(costs); plt.title('Mean cost per sample after each epoch');
x1 = np.linspace(-8,8, 250)
x2 = np.linspace(-10,10, 250)
fun_map = np.empty((x1.size, x2.size))
for n,i in enumerate(x1):
for m,j in enumerate(x2):
fun_map[m,n] = forward1([i,-j], w1, b1, w2, b2)
X0 = X[(yhat==0).reshape(-1)]
X1 = X[(yhat==1).reshape(-1)]
plt.figure(figsize=(10,5))
plt.imshow(fun_map, extent=[x1.min(), x1.max(), x2.min(), x2.max()],
vmin=0, vmax=1, aspect='auto')
plt.colorbar()
plt.contour(x1, -x2, fun_map, levels=[0.5], colors=['r'], linewidths=2)
plt.scatter(*X0.T, label='0', alpha=0.4); plt.scatter(*X1.T, label='1', alpha=0.4)
plt.legend();
Looks pretty good! We added the decision boundary in red so that we can more easily visualize how it changes as we make changes to our network.
Our functions are starting to take a lot of parameters, which is often a sign that you should be generalizing. Since all our definitions are recursive anyway, we can actually handle any number of layers with ease.
Instead of specifying n_hidden
, we will move to specifying the number of units per layer (including the input and output layers), using a tuple which we'll call shape
.
We will store quantities that are matrix-valued, but inconsistent in dimension across different layers, in a dictionary with the layer number ℓ as the key.
def init_model(shape):
w = {}
b = {}
for layer in range(1, len(shape)): # no weights for input layer
w[layer] = np.random.normal(0,0.1, size=(shape[layer-1], shape[layer])) # dim: from_units, to_units
b[layer] = np.random.normal(0,0.1, size=(shape[layer], 1)) # dim: to_units, 1
return w,b
def forwardn(x0, w, b):
n_layers = len(w)
x_prev = x0
for l in range(1, n_layers):
x_l = relu(np.dot(x_prev, w[l]) + b[l].T) # output of a hidden layer
x_prev = x_l
return sig(np.dot(x_prev, w[n_layers]) + b[n_layers].T) # output of output layer
Let's test these out with a bunch of layers!
t = (2,3,6,8,1,)
w,b = init_model(t)
print(w)
{1: array([[ 0.06181095, -0.22796272, -0.13053987], [-0.10271135, -0.05801015, -0.07595937]]), 2: array([[ 0.08546893, 0.04472243, 0.10677069, -0.27842714, -0.06287323, 0.01765893], [-0.033705 , 0.06336374, 0.23601233, 0.07679502, 0.16572488, 0.07469416], [-0.02062047, -0.04991025, 0.09991342, -0.03286914, 0.05510907, -0.07973095]]), 3: array([[-0.17834555, -0.01982056, 0.13754984, -0.03927334, -0.02762466, 0.19623374, 0.04666107, -0.17265983], [-0.11245948, 0.0559374 , 0.04578567, 0.13311479, 0.11499437, 0.16349449, -0.18871903, 0.00515981], [ 0.09665122, 0.11057042, 0.03347225, -0.0316614 , -0.06024299, -0.0876207 , 0.00674745, 0.0033442 ], [-0.03706872, -0.04867028, -0.00373517, -0.05504343, 0.12834597, 0.04784655, 0.07212743, -0.10916401], [-0.04596272, 0.09596565, -0.02752977, -0.04241039, 0.05234288, 0.12034268, -0.10948764, -0.00940387], [-0.07367622, 0.00860961, -0.00632613, 0.08375661, -0.04271112, -0.12024093, 0.12569929, -0.01522808]]), 4: array([[-0.03419322], [-0.0363885 ], [ 0.16400532], [-0.03601286], [-0.0935391 ], [ 0.04563441], [ 0.00649827], [-0.012408 ]])}
y = forwardn(X[:10], w, b)
y
array([[0.48567418], [0.48581211], [0.48548631], [0.48544916], [0.48578261], [0.48582915], [0.48539676], [0.48553286], [0.48581705], [0.48546157]])
def backwardn(x0, w, b, y, yhat, alpha):
n_layers = len(w)
z = {}
x = {0:x0}
# x and z values for calculating derivatives
for l in range(1, n_layers+1):
z[l] = np.dot(x[l-1], w[l]) + b[l].T
x[l] = relu(z[l])
delta = {}
# deltas and updates
for l in range(n_layers, 0, -1): # start with last layer and move backward
if l == n_layers: # base case
delta[l] = dJ_dy(y, yhat)*dsig_dz(z[n_layers])
else: # recursive case
delta[l] = np.dot(w[l+1], delta[l+1]) * drelu_dz(z[l]).T
# update weights and biases
w[l] -= alpha * np.multiply(delta[l], x[l-1]).T
b[l] -= alpha * delta[l]
return w, b
Does this work for a simple case? If we update a bunch of times using one sample, we should see the prediction move towards ˆy:
for i in range(1000):
w,b = backwardn(X[0], w, b, y[0], 0, 0.1)
y[0] = forwardn(X[0], w, b)
if i%100 == 0:
print(y[0],'-->',yhat[0])
[0.47231947] --> [0.] [0.04555354] --> [0.] [0.0126895] --> [0.] [0.00604703] --> [0.] [0.00364882] --> [0.] [0.00249834] --> [0.] [0.00184846] --> [0.] [0.00144065] --> [0.] [0.00116531] --> [0.] [0.00096917] --> [0.]
Awesome! Confident, we also implement the training function to do this with all the samples:
def trainn(X, yhat, shape, alpha, n_epoch):
n_samples = X.shape[0]
n_input = X.shape[1]
# keep track of performance during training
costs = np.zeros(shape=(n_epoch,1))
# random nonzero initialization
w,b = init_model(shape)
for epoch in range(n_epoch):
for i in range(n_samples):
x0 = X[i,:]; yh = yhat[i]
y = forwardn(x0, w, b) # prediction for one sample
w, b = backwardn(x0, w, b, y, yh, alpha) # take step
# ### Some niceness to see our progress
# Calculate total cost after epoch
predictions = forwardn(X, w, b) # predictions for entire set
costs[epoch] = np.mean(J(predictions, yhat)) # mean cost per sample
# report progress
if ((epoch % 10) == 0) or (epoch == (n_epoch - 1)):
accuracy = np.mean(predictions.round() == yhat) # current accuracy on entire set
print('Training accuracy after epoch {}: {:.4%}'.format(epoch, accuracy))
return w, b, costs
n_epoch = 200
shape = (2,5,3,1)
alpha = 0.001
w, b, costs = trainn(X, yhat, shape, alpha, n_epoch)
Training accuracy after epoch 0: 50.0000% Training accuracy after epoch 10: 50.0000% Training accuracy after epoch 20: 50.0000% Training accuracy after epoch 30: 50.0000% Training accuracy after epoch 40: 50.0000% Training accuracy after epoch 50: 50.0000% Training accuracy after epoch 60: 50.0000% Training accuracy after epoch 70: 50.0000% Training accuracy after epoch 80: 50.0000% Training accuracy after epoch 90: 50.0000% Training accuracy after epoch 100: 98.4000% Training accuracy after epoch 110: 98.9000% Training accuracy after epoch 120: 99.4000% Training accuracy after epoch 130: 99.5000% Training accuracy after epoch 140: 99.6000% Training accuracy after epoch 150: 99.6000% Training accuracy after epoch 160: 99.6000% Training accuracy after epoch 170: 99.7000% Training accuracy after epoch 180: 99.7000% Training accuracy after epoch 190: 99.7000% Training accuracy after epoch 199: 99.7000%
plt.plot(costs); plt.title('Mean cost per sample after each epoch');
x1 = np.linspace(-8,8, 250)
x2 = np.linspace(-10,10, 250)
fun_map = np.empty((x1.size, x2.size))
for n,i in enumerate(x1):
for m,j in enumerate(x2):
fun_map[m,n] = forwardn([i,-j], w, b)
X0 = X[(yhat==0).reshape(-1)]
X1 = X[(yhat==1).reshape(-1)]
plt.figure(figsize=(10,5))
plt.imshow(fun_map, extent=[x1.min(), x1.max(), x2.min(), x2.max()],
vmin=0, vmax=1, aspect='auto')
plt.colorbar()
plt.scatter(*X0.T, label='0', alpha=0.4); plt.scatter(*X1.T, label='1', alpha=0.4)
plt.contour(x1, -x2, fun_map, levels=[0.5], colors=['r'], label='Decision boundary', linewidths=2)
plt.legend();
Even more flexibility than before!
Of course unlimited flexibility has its own pitfalls, so in the next notebook we'll look into some ways of getting that under control.