In this chapter, we will:
[George Gordon Byron] "I Know that 2 and 2 make 4 –– & should be glad to prove it too if I could –– though I must say if by any sort of process I could convert 2 & 2 into 5 it would give me much greater pleasure"
An activation function is a function applied to the neurons in a layer during prediction. In that sense, an activation function is any function that can take one number and return another number. There are, however, an infinite number of functions in the universe, and not all of them are useful as activation functions.
We've already used an activation function called ReLU
. The ReLU
function had the effect of turning all negative numbers to zero.
There are many constraints on the nature of activation functions, we present them next.
Meaning that the function must have an output number for any input.
An activation function must never change direction (always increasing/decreasing). This particular constrraint is not technically a requirement but if we consider a function that map different input values to the same output, then that function may have multiple perfect configurations. As a result, we can't know the correct direction to go.
For an advanced look into this subject, we should look more into convex versus non-convex optimization.
Linear functions scale values, they do not effect how correlated a neuron is to various inputs. They just make the collective corrrelation that is represented louder or softer.
What we want instead is selective correlation.
We will be using the chosen activation functions a lot. For this reason, we want them (and their derivatives) to be efficiently computable. As an example, ReLU
has become very popular mostly because it's efficient to compute.
Sigmoid is great because it smoothly squinshes the infinite amount of input to an output between 0 and 1. This lets us interpret the output of any neuron as a probability. We typically use this non-linearity both in hidden and outputs layers.
Tanh
is the same as sigmoid except it's between -1
and 1
. This means it can also throw in some negative correlation. This aspect of negative correlation is powerful for hidden layers. On many problems, tanh
will outperform sigmoid in the hidden layers.
For output layer activation functions, choosing the best one depends on what we're trying to predict. Broadly speaking, there are 3 major types of output layers:
One example might be predicting the average temperature in colorado given the average temperature in surrounding states.
It's best to use the sigmoid function, Because it models individual probabilities separately for each output node.
This is by far the common use case in neural networks: predicting a single label out of many. In this case, it's better to have an activation function that models the idea that "The more likely it's one label, The less likely it's any of the other labels".
As we can see in the above figure, the average 2
shares quite a bit with the average 3
. As a general rule, similar inputs create similar outputs, when we take some numbers and multiply them by a matrix, if the starting numbers are pretty similar, the ending numbers will be pretty similar.
As a result, sigmoid will penelize the network for recognizing a 2
by anything other than features that are exclusively related to 2
s.
Most Images share lots of pixels in the middle of images. Because of that the network will start trying to focus on the edges instead. As we can see in the weight Image, the light areas are probably the best individual indicators of a 2
.
What we are really striving for, though, is a network that "sees" the entire shape of a digit before outputing a prediction.
Softmax raises each input value exponentially and then divides by the layer's sum. The nice thing about softmax is that the higher the network predicts one value, the lower it predicts all the others. It encourages the network to pick out class with a very high probability.
To adjust how aggresively Softmax pushes the density to one class, use number slightly bigger or lower than e. Lower numbers will result in lower attenuation and higher numbers will result win bigger attenuation. For now, we stick to e.
We know the following:
ReLU
for positive numbers is exactly 1.ReLU
for negative numbers is exactly 0.As a reminder, the slope is a measure of how much the output of ReLU
will change given a change in the input. The Input to a layer refers to the value before the nonlinearity. Modifying the input to ReLU
(by a tiny amount) will have a 1:1 effect if it was predicting positively and a 0:1 effect if it was predicting negatively.
Thus, when we backpropagate, in order to generate layer_1_delta
, multiply the backpropagated delta
from layer_2
(layer_2_delta.dot(W_2.T)
) by the slope of ReLU
at the point predicted in forward propagation. For some deltas the slope is 1 (positive numbers) and for others it's 0 (negative numbers).
The important thing to remember is that the slope is an indicator of how much a tiny change to the input effects the output. The update effect encourages the network to leave weights alone if adjusting them will have little to no effect.
To compute layer_delta
, we multiply the backpropagated delta by the layer's slope:
layer_1_delta[0]
represents how much higher or lower the first hidden node of layer 1 should be in order to reduce the error of the network. The end goal of delta on a neuron is to inform the weights whether they should move. If moving them would have no effect, they should be left alone.
This condition is obvious for ReLU, which is either on or off, but Sigmoid's sensitivity to change in the input slowly increases as the input approaches 0 from either direction. However, very positive and very negative inputs approach a slope of very near 0. Thus, as the input becomes very positive/negative, small changes to the incoming weights become less relevant to the neuron's error for the specific training example.
For all of the previous activation functions, we can directly convert their output to their slope:
Most Great Activations have a means by which the output of the layer (at forward propagation) can be used to compute the derivative. This has become the standard practice for computing derivatives in neural networks, and it's quite handy.
Let's Upgrade the MNIST Network to reflect what you've learned. Theoretically, the tanh
function should make for a better hidden-layer activation. The softmax
function, on the other hand, should make for a better output layer activation function, but things aren't always as simple as they seem. For Tanh
we had to reduce the standard diviation for the incoming weights (we adjust weigth values to be between -.01
& +.01
). We also tune the learning rate.
import numpy as np
import sys
from tensorflow.keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train.shape, y_train.shape, X_test.shape, y_test.shape
((60000, 28, 28), (60000,), (10000, 28, 28), (10000,))
images, labels = (X_train[:1000].reshape(1000, 28*28))/255, y_train[:1000]
images.shape, labels.shape
((1000, 784), (1000,))
one_hot_labels = np.zeros((labels.shape[0], 10))
one_hot_labels.shape
(1000, 10)
for i, l in enumerate(labels):
one_hot_labels[i][l] = 1
labels = one_hot_labels
labels.shape
(1000, 10)
test_images = X_test.reshape(X_test.shape[0], 28*28)/255
test_images.shape
(10000, 784)
one_hot_test_labels = np.zeros((y_test.shape[0], 10))
one_hot_test_labels.shape
(10000, 10)
for i, l in enumerate(y_test):
one_hot_test_labels[i][l] = 1
test_labels = one_hot_test_labels
test_labels.shape
(10000, 10)
# activation functions.
def tanh(x):
return np.tanh(x)
def tanh2deriv(output):
return 1 - (output ** 2)
def softmax(x):
temp = np.exp(x)
return temp / np.sum(temp, axis=1, keepdims=True)
alpha, iterations, hidden_size = (2, 300, 100)
pixels_per_image, num_labels = (784, 10)
batch_size = 100
W_1 = (0.02 * np.random.random((pixels_per_image,hidden_size))) - 0.01
W_2 = (0.2 * np.random.random((hidden_size,num_labels))) - 0.1
W_1.shape, W_2.shape
((784, 100), (100, 10))
# Training Loop
for j in range(iterations): # epoches
correct_cnt = 0
for i in range(int(len(images) / batch_size)): # batches
batch_start, batch_end = (i * batch_size), ((i+1) * batch_size)
# Forward Propagation
layer_0 = images[batch_start:batch_end]
layer_1 = tanh(np.dot(layer_0, W_1))
dropout_mask = np.random.randint(2,size=layer_1.shape)
layer_1 *= dropout_mask * 2
layer_2 = softmax(np.dot(layer_1, W_2))
# benchmarking
for k in range(batch_size):
correct_cnt += int(np.argmax(layer_2[k:k+1]) == np.argmax(labels[batch_start+k:batch_start+k+1]))
# backpropagation
layer_2_delta = (labels[batch_start:batch_end] - layer_2) / (batch_size * layer_2.shape[0])
layer_1_delta = layer_2_delta.dot(W_2.T)*tanh2deriv(layer_1)
layer_1_delta *= dropout_mask
# optimization
W_2 += alpha * layer_1.T.dot(layer_2_delta)
W_1 += alpha * layer_0.T.dot(layer_1_delta)
test_correct_cnt = 0
for i in range(len(test_images)): # test images
# predict
layer_0 = test_images[i:i+1]
layer_1 = tanh(np.dot(layer_0,W_1))
layer_2 = np.dot(layer_1,W_2)
# benchmark
test_correct_cnt += int(np.argmax(layer_2) == np.argmax(test_labels[i:i+1]))
if(j % 10 == 0):
print(f"I: {j} | Test-Acc: {round(test_correct_cnt / float(len(test_images)), 5)} | Train-Acc: {round(correct_cnt / float(len(images)), 5)}")
I: 0 | Test-Acc: 0.4141 | Train-Acc: 0.159 I: 10 | Test-Acc: 0.6838 | Train-Acc: 0.691 I: 20 | Test-Acc: 0.7108 | Train-Acc: 0.725 I: 30 | Test-Acc: 0.7377 | Train-Acc: 0.763 I: 40 | Test-Acc: 0.7698 | Train-Acc: 0.807 I: 50 | Test-Acc: 0.794 | Train-Acc: 0.831 I: 60 | Test-Acc: 0.8109 | Train-Acc: 0.85 I: 70 | Test-Acc: 0.8193 | Train-Acc: 0.864 I: 80 | Test-Acc: 0.8273 | Train-Acc: 0.869 I: 90 | Test-Acc: 0.833 | Train-Acc: 0.877 I: 100 | Test-Acc: 0.8399 | Train-Acc: 0.892 I: 110 | Test-Acc: 0.8419 | Train-Acc: 0.889 I: 120 | Test-Acc: 0.8452 | Train-Acc: 0.902 I: 130 | Test-Acc: 0.846 | Train-Acc: 0.906 I: 140 | Test-Acc: 0.8498 | Train-Acc: 0.908 I: 150 | Test-Acc: 0.852 | Train-Acc: 0.906 I: 160 | Test-Acc: 0.8543 | Train-Acc: 0.918 I: 170 | Test-Acc: 0.8561 | Train-Acc: 0.926 I: 180 | Test-Acc: 0.8573 | Train-Acc: 0.921 I: 190 | Test-Acc: 0.859 | Train-Acc: 0.931 I: 200 | Test-Acc: 0.8602 | Train-Acc: 0.933 I: 210 | Test-Acc: 0.8607 | Train-Acc: 0.928 I: 220 | Test-Acc: 0.8628 | Train-Acc: 0.93 I: 230 | Test-Acc: 0.8637 | Train-Acc: 0.938 I: 240 | Test-Acc: 0.8651 | Train-Acc: 0.941 I: 250 | Test-Acc: 0.8656 | Train-Acc: 0.945 I: 260 | Test-Acc: 0.8672 | Train-Acc: 0.94 I: 270 | Test-Acc: 0.8678 | Train-Acc: 0.939 I: 280 | Test-Acc: 0.8687 | Train-Acc: 0.943 I: 290 | Test-Acc: 0.8692 | Train-Acc: 0.951
In this section, we're going to make sure that we understand batch stochastic gradient descent + the new activation functions by implementing Dropout
with ReLU
:
from tensorflow.keras import datasets
# Load Data.
(X_train, y_train), (X_test, y_test) = datasets.mnist.load_data()
X_train.shape, y_train.shape, X_test.shape, y_test.shape
((60000, 28, 28), (60000,), (10000, 28, 28), (10000,))
# light data pre-processing
x_train, y_train = (X_train[:1000].reshape((1000, 28*28))/255.), (y_train[:1000])
# one hotting `y_train`
labels_train = np.zeros((y_train.shape[0], 10))
for i, v in enumerate(y_train):
labels_train[i][v] = 1
labels_train[:3]
array([[0., 0., 0., 0., 0., 1., 0., 0., 0., 0.], [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]])
# same to testing data.
x_test = X_test.reshape((-1, 28*28))/255.
labels_test = np.zeros((y_test.shape[0], 10))
for i, v in enumerate(y_test):
labels_test[i][v] = 1
labels_test[:3]
array([[0., 0., 0., 0., 0., 0., 0., 1., 0., 0.], [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]])
x_train.shape, labels_train.shape, x_test.shape, labels_test.shape
((1000, 784), (1000, 10), (10000, 784), (10000, 10))
# activation functions.
def ReLU(x):
return (x > 0) * x
def grad_ReLU(x):
return (x > 0).astype('int')
def tanh(x):
return np.tanh(x)
def tanh2deriv(x):
return 1 - (x ** 2)
def softmax(x):
temp = np.exp(x)
return temp / np.sum(temp, axis=1, keepdims=True)
# configuration parameters
lr, epoches, hidden_size = 2, 100, 100
pixels_count, labels_count = 784, 10
batch_size = 100
# Random Wights Initialization
W_0 = (0.02 * np.random.random((784,100))) - 0.01
W_1 = (0.02 * np.random.random((100,10))) - 0.01
W_0.shape, W_1.shape
((784, 100), (100, 10))
for epoch in range(epoches):
# cuz each epoch passes through all training data, we calc error each epoch
correct_count = []
for batch_i in range(int(x_train.shape[0]/batch_size)):
# get batch
batch_start, batch_end = (batch_i * batch_size), ((batch_i+1) * batch_size)
X = x_train[batch_start:batch_end]
y = labels_train[batch_start:batch_end]
# forward propagation
layer_0 = X
layer_1 = ReLU(np.matmul(layer_0, W_0))
dropout_mask = np.random.randint(2, size=layer_1.shape)
layer_1 *= dropout_mask * 2
layer_2 = softmax(np.matmul(layer_1, W_1))
# Evaluating, loop over the batch
for k in range(batch_size):
# we want to loop over the batch images.
x_i, y_i, y_i_hat = X[k:k+1], y[k:k+1], layer_2[k:k+1]
if np.argmax(y_i_hat.squeeze()) == np.argmax(y_i.squeeze()):
correct_count.append(1)
else:
correct_count.append(0)
# backpropagation
layer_2_delta = (layer_2 - y) / (batch_size * layer_2.shape[1])
layer_1_delta = layer_2_delta.dot(W_1.T) * grad_ReLU(layer_1)
layer_1_delta *= dropout_mask
# Optimization
W_1 -= lr * (layer_1.T.dot(layer_2_delta))
W_0 -= lr * (layer_0.T.dot(layer_1_delta))
test_correct_count = list()
# evaluate over test dataset.
for i in range(x_test.shape[0]):
# get data
x_i, y_i = x_test[i:i+1], labels_test[i:i+1]
# forward propagation
layer_0 = x_i
layer_1 = ReLU(layer_0.dot(W_0))
layer_2 = softmax(layer_1.dot(W_1))
if np.argmax(layer_2.squeeze()) == np.argmax(y_i.squeeze()):
test_correct_count.append(1)
else:
test_correct_count.append(0)
if(epoch % 10 == 0):
print("\n"+ "Epoch:" + str(epoch) + \
" Test-Acc:"+str(np.sum(np.array(test_correct_count))/np.array(test_correct_count).shape[0])+ \
" Train-Acc:" + str(np.sum(np.array(correct_count))/np.array(correct_count).shape[0]))
Epoch:0 Test-Acc:0.3397 Train-Acc:0.173 Epoch:10 Test-Acc:0.7704 Train-Acc:0.734 Epoch:20 Test-Acc:0.8415 Train-Acc:0.872 Epoch:30 Test-Acc:0.8543 Train-Acc:0.894 Epoch:40 Test-Acc:0.867 Train-Acc:0.922 Epoch:50 Test-Acc:0.8697 Train-Acc:0.946 Epoch:60 Test-Acc:0.8712 Train-Acc:0.943 Epoch:70 Test-Acc:0.8712 Train-Acc:0.965 Epoch:80 Test-Acc:0.875 Train-Acc:0.967 Epoch:90 Test-Acc:0.8748 Train-Acc:0.963