Notebook

Modeling Probabilities & Non-Linearities: Activation Functions¶

In this chapter, we will:

Learn what is an activation function.
Learn about the standard hidden activation functions
- Sigmoid
- Tanh
Learn about the standard output activation functions
- Softmax

[George Gordon Byron] "I Know that 2 and 2 make 4 –– & should be glad to prove it too if I could –– though I must say if by any sort of process I could convert 2 & 2 into 5 it would give me much greater pleasure"

What is an Activation Function?¶

An activation function is a function applied to the neurons in a layer during prediction. In that sense, an activation function is any function that can take one number and return another number. There are, however, an infinite number of functions in the universe, and not all of them are useful as activation functions.

We've already used an activation function called ReLU. The ReLU function had the effect of turning all negative numbers to zero.

There are many constraints on the nature of activation functions, we present them next.

Constraint 1: The Function must be Continuous & Infinite in Domain¶

Meaning that the function must have an output number for any input.

Constraint 2: Good Activation Functions are Monotonic (Increasing/Decreasing)¶

An activation function must never change direction (always increasing/decreasing). This particular constrraint is not technically a requirement but if we consider a function that map different input values to the same output, then that function may have multiple perfect configurations. As a result, we can't know the correct direction to go.

For an advanced look into this subject, we should look more into convex versus non-convex optimization.

Constraint 3: Good Activation Functions are Non-Linear¶

Linear functions scale values, they do not effect how correlated a neuron is to various inputs. They just make the collective corrrelation that is represented louder or softer.

What we want instead is selective correlation.

Constraint 4: Good Activation Functions (& their derivatives) should be efficiently computable¶

We will be using the chosen activation functions a lot. For this reason, we want them (and their derivatives) to be efficiently computable. As an example, ReLU has become very popular mostly because it's efficient to compute.

Standard Hidden-Layer Activation Functions¶

Which ones are most commonly used?¶

Sigmoid is the bread & butter Activation¶

Sigmoid is great because it smoothly squinshes the infinite amount of input to an output between $0$ and $1$ . This lets us interpret the output of any neuron as a probability. We typically use this non-linearity both in hidden and outputs layers.

Tanh is better than sigmoid for hidden layers¶

Tanh is the same as sigmoid except it's between -1 and 1. This means it can also throw in some negative correlation. This aspect of negative correlation is powerful for hidden layers. On many problems, tanh will outperform sigmoid in the hidden layers.

Standard output layer activation functions¶

For output layer activation functions, choosing the best one depends on what we're trying to predict. Broadly speaking, there are 3 major types of output layers:

Configuration 1: Predicting Raw Data Values (Regression) — No activation function¶

One example might be predicting the average temperature in colorado given the average temperature in surrounding states.

Configuration 2: Predicting Unrelated Yes/No Probabilities (Binary Classification) — Sigmoid¶

It's best to use the sigmoid function, Because it models individual probabilities separately for each output node.

Configuration 3: Predicting which-one probabilities (Categorical Classification) — Softmax¶

This is by far the common use case in neural networks: predicting a single label out of many. In this case, it's better to have an activation function that models the idea that "The more likely it's one label, The less likely it's any of the other labels".

The Core Issue: Inputs have similarity¶

As we can see in the above figure, the average 2 shares quite a bit with the average 3. As a general rule, similar inputs create similar outputs, when we take some numbers and multiply them by a matrix, if the starting numbers are pretty similar, the ending numbers will be pretty similar.

As a result, sigmoid will penelize the network for recognizing a 2 by anything other than features that are exclusively related to 2s.

Most Images share lots of pixels in the middle of images. Because of that the network will start trying to focus on the edges instead. As we can see in the weight Image, the light areas are probably the best individual indicators of a 2.

What we are really striving for, though, is a network that "sees" the entire shape of a digit before outputing a prediction.

Softmax Computation¶

Softmax raises each input value exponentially and then divides by the layer's sum¶

$\sigma(y)_{i} = \frac{e^{y_i}}{\sum_{j=1}^{K} e^{y_{j}}}$

Softmax raises each input value exponentially and then divides by the layer's sum. The nice thing about softmax is that the higher the network predicts one value, the lower it predicts all the others. It encourages the network to pick out class with a very high probability.

To adjust how aggresively Softmax pushes the density to one class, use number slightly bigger or lower than $e$ . Lower numbers will result in lower attenuation and higher numbers will result win bigger attenuation. For now, we stick to $e$ .

Activation Installation Instructions¶

How do we add our favorite activation function to any layer?¶

We know the following:

The Slope of ReLU for positive numbers is exactly 1.
The Slope of ReLU for negative numbers is exactly 0.

As a reminder, the slope is a measure of how much the output of ReLU will change given a change in the input. The Input to a layer refers to the value before the nonlinearity. Modifying the input to ReLU (by a tiny amount) will have a 1:1 effect if it was predicting positively and a 0:1 effect if it was predicting negatively.

Thus, when we backpropagate, in order to generate layer_1_delta, multiply the backpropagated delta from layer_2 (layer_2_delta.dot(W_2.T)) by the slope of ReLU at the point predicted in forward propagation. For some deltas the slope is $1$ (positive numbers) and for others it's $0$ (negative numbers).

The important thing to remember is that the slope is an indicator of how much a tiny change to the input effects the output. The update effect encourages the network to leave weights alone if adjusting them will have little to no effect.

Multiplying Delta by the Slope¶

To compute layer_delta, we multiply the backpropagated delta by the layer's slope:

layer_1_delta[0] represents how much higher or lower the first hidden node of layer 1 should be in order to reduce the error of the network. The end goal of delta on a neuron is to inform the weights whether they should move. If moving them would have no effect, they should be left alone.

This condition is obvious for ReLU, which is either on or off, but Sigmoid's sensitivity to change in the input slowly increases as the input approaches 0 from either direction. However, very positive and very negative inputs approach a slope of very near 0. Thus, as the input becomes very positive/negative, small changes to the incoming weights become less relevant to the neuron's error for the specific training example.

Converting Output to Slope (Derivative)¶

For all of the previous activation functions, we can directly convert their output to their slope:

Most Great Activations have a means by which the output of the layer (at forward propagation) can be used to compute the derivative. This has become the standard practice for computing derivatives in neural networks, and it's quite handy.

Upgrading the MNIST Network¶

Let's Upgrade the MNIST Network to reflect what you've learned. Theoretically, the tanh function should make for a better hidden-layer activation. The softmax function, on the other hand, should make for a better output layer activation function, but things aren't always as simple as they seem. For Tanh we had to reduce the standard diviation for the incoming weights (we adjust weigth values to be between -.01 & +.01). We also tune the learning rate.

In [1]:

import numpy as np
import sys

In [2]:

from tensorflow.keras.datasets import mnist

In [3]:

(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train.shape, y_train.shape, X_test.shape, y_test.shape

Out[3]:

((60000, 28, 28), (60000,), (10000, 28, 28), (10000,))

In [4]:

images, labels = (X_train[:1000].reshape(1000, 28*28))/255, y_train[:1000]
images.shape, labels.shape

Out[4]:

((1000, 784), (1000,))

In [5]:

one_hot_labels = np.zeros((labels.shape[0], 10))
one_hot_labels.shape

Out[5]:

(1000, 10)

In [6]:

for i, l in enumerate(labels):
    one_hot_labels[i][l] = 1
labels = one_hot_labels
labels.shape

Out[6]:

(1000, 10)

In [7]:

test_images = X_test.reshape(X_test.shape[0], 28*28)/255
test_images.shape

Out[7]:

(10000, 784)

In [8]:

one_hot_test_labels = np.zeros((y_test.shape[0], 10))
one_hot_test_labels.shape

Out[8]:

(10000, 10)

In [9]:

for i, l in enumerate(y_test):
    one_hot_test_labels[i][l] = 1
test_labels = one_hot_test_labels
test_labels.shape

Out[9]:

(10000, 10)

In [10]:

# activation functions.
def tanh(x):
    return np.tanh(x)

def tanh2deriv(output):
    return 1 - (output ** 2)

def softmax(x):
    temp = np.exp(x)
    return temp / np.sum(temp, axis=1, keepdims=True)

In [11]:

alpha, iterations, hidden_size = (2, 300, 100)
pixels_per_image, num_labels = (784, 10)
batch_size = 100

In [12]:

W_1 = (0.02 * np.random.random((pixels_per_image,hidden_size))) - 0.01
W_2 = (0.2 * np.random.random((hidden_size,num_labels))) - 0.1
W_1.shape, W_2.shape

Out[12]:

((784, 100), (100, 10))

In [13]:

# Training Loop
for j in range(iterations):  # epoches
    correct_cnt = 0
    for i in range(int(len(images) / batch_size)):  # batches
        batch_start, batch_end = (i * batch_size), ((i+1) * batch_size)
        
        # Forward Propagation
        layer_0 = images[batch_start:batch_end]
        layer_1 = tanh(np.dot(layer_0, W_1))
        dropout_mask = np.random.randint(2,size=layer_1.shape)
        layer_1 *= dropout_mask * 2
        layer_2 = softmax(np.dot(layer_1, W_2))
        
        # benchmarking
        for k in range(batch_size):
            correct_cnt += int(np.argmax(layer_2[k:k+1]) == np.argmax(labels[batch_start+k:batch_start+k+1]))
        
        # backpropagation
        layer_2_delta = (labels[batch_start:batch_end] - layer_2) / (batch_size * layer_2.shape[0])
        layer_1_delta = layer_2_delta.dot(W_2.T)*tanh2deriv(layer_1)
        layer_1_delta *= dropout_mask
        
        # optimization
        W_2 += alpha * layer_1.T.dot(layer_2_delta)
        W_1 += alpha * layer_0.T.dot(layer_1_delta)
    
    test_correct_cnt = 0
    
    for i in range(len(test_images)):  # test images
        # predict
        layer_0 = test_images[i:i+1]
        layer_1 = tanh(np.dot(layer_0,W_1)) 
        layer_2 = np.dot(layer_1,W_2) 
        
        # benchmark
        test_correct_cnt += int(np.argmax(layer_2) == np.argmax(test_labels[i:i+1]))
    
    if(j % 10 == 0):
        print(f"I: {j} | Test-Acc: {round(test_correct_cnt / float(len(test_images)), 5)} | Train-Acc: {round(correct_cnt / float(len(images)), 5)}")

I: 0 | Test-Acc: 0.4141 | Train-Acc: 0.159
I: 10 | Test-Acc: 0.6838 | Train-Acc: 0.691
I: 20 | Test-Acc: 0.7108 | Train-Acc: 0.725
I: 30 | Test-Acc: 0.7377 | Train-Acc: 0.763
I: 40 | Test-Acc: 0.7698 | Train-Acc: 0.807
I: 50 | Test-Acc: 0.794 | Train-Acc: 0.831
I: 60 | Test-Acc: 0.8109 | Train-Acc: 0.85
I: 70 | Test-Acc: 0.8193 | Train-Acc: 0.864
I: 80 | Test-Acc: 0.8273 | Train-Acc: 0.869
I: 90 | Test-Acc: 0.833 | Train-Acc: 0.877
I: 100 | Test-Acc: 0.8399 | Train-Acc: 0.892
I: 110 | Test-Acc: 0.8419 | Train-Acc: 0.889
I: 120 | Test-Acc: 0.8452 | Train-Acc: 0.902
I: 130 | Test-Acc: 0.846 | Train-Acc: 0.906
I: 140 | Test-Acc: 0.8498 | Train-Acc: 0.908
I: 150 | Test-Acc: 0.852 | Train-Acc: 0.906
I: 160 | Test-Acc: 0.8543 | Train-Acc: 0.918
I: 170 | Test-Acc: 0.8561 | Train-Acc: 0.926
I: 180 | Test-Acc: 0.8573 | Train-Acc: 0.921
I: 190 | Test-Acc: 0.859 | Train-Acc: 0.931
I: 200 | Test-Acc: 0.8602 | Train-Acc: 0.933
I: 210 | Test-Acc: 0.8607 | Train-Acc: 0.928
I: 220 | Test-Acc: 0.8628 | Train-Acc: 0.93
I: 230 | Test-Acc: 0.8637 | Train-Acc: 0.938
I: 240 | Test-Acc: 0.8651 | Train-Acc: 0.941
I: 250 | Test-Acc: 0.8656 | Train-Acc: 0.945
I: 260 | Test-Acc: 0.8672 | Train-Acc: 0.94
I: 270 | Test-Acc: 0.8678 | Train-Acc: 0.939
I: 280 | Test-Acc: 0.8687 | Train-Acc: 0.943
I: 290 | Test-Acc: 0.8692 | Train-Acc: 0.951

ReLU + Dropout¶

In this section, we're going to make sure that we understand batch stochastic gradient descent + the new activation functions by implementing Dropout with ReLU:

In [56]:

from tensorflow.keras import datasets 

In [57]:

# Load Data.
(X_train, y_train), (X_test, y_test) = datasets.mnist.load_data()
X_train.shape, y_train.shape, X_test.shape, y_test.shape

Out[57]:

((60000, 28, 28), (60000,), (10000, 28, 28), (10000,))

In [58]:

# light data pre-processing
x_train, y_train = (X_train[:1000].reshape((1000, 28*28))/255.), (y_train[:1000])

In [59]:

# one hotting `y_train`
labels_train = np.zeros((y_train.shape[0], 10))
for i, v in enumerate(y_train):
    labels_train[i][v] = 1
labels_train[:3]

Out[59]:

array([[0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]])

In [60]:

# same to testing data.
x_test = X_test.reshape((-1, 28*28))/255.
labels_test = np.zeros((y_test.shape[0], 10))
for i, v in enumerate(y_test):
    labels_test[i][v] = 1
labels_test[:3]

Out[60]:

array([[0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [61]:

x_train.shape, labels_train.shape, x_test.shape, labels_test.shape

Out[61]:

((1000, 784), (1000, 10), (10000, 784), (10000, 10))

In [62]:

# activation functions.
def ReLU(x):
    return (x > 0) * x

def grad_ReLU(x):
    return (x > 0).astype('int')

def tanh(x):
    return np.tanh(x)

def tanh2deriv(x):
    return 1 - (x ** 2)

def softmax(x):
    temp = np.exp(x)
    return temp / np.sum(temp, axis=1, keepdims=True)

In [63]:

# configuration parameters
lr, epoches, hidden_size = 2, 100, 100
pixels_count, labels_count = 784, 10
batch_size = 100

In [64]:

# Random Wights Initialization
W_0 = (0.02 * np.random.random((784,100))) - 0.01
W_1 = (0.02 * np.random.random((100,10))) - 0.01
W_0.shape, W_1.shape

Out[64]:

((784, 100), (100, 10))

In [65]:

for epoch in range(epoches):
    
    # cuz each epoch passes through all training data, we calc error each epoch
    correct_count = []
    
    for batch_i in range(int(x_train.shape[0]/batch_size)):
        # get batch
        batch_start, batch_end = (batch_i * batch_size), ((batch_i+1) * batch_size)
        X = x_train[batch_start:batch_end]
        y = labels_train[batch_start:batch_end]
        
        # forward propagation
        layer_0 = X
        layer_1 = ReLU(np.matmul(layer_0, W_0))
        dropout_mask = np.random.randint(2, size=layer_1.shape)
        layer_1 *= dropout_mask * 2
        layer_2 = softmax(np.matmul(layer_1, W_1))
        
        # Evaluating, loop over the batch
        for k in range(batch_size):
            # we want to loop over the batch images.
            x_i, y_i, y_i_hat = X[k:k+1], y[k:k+1], layer_2[k:k+1]
            
            if np.argmax(y_i_hat.squeeze()) == np.argmax(y_i.squeeze()):
                correct_count.append(1)
            else:
                correct_count.append(0)
        
        # backpropagation
        layer_2_delta = (layer_2 - y) / (batch_size * layer_2.shape[1])
        layer_1_delta = layer_2_delta.dot(W_1.T) * grad_ReLU(layer_1)
        layer_1_delta *= dropout_mask
        
        # Optimization
        W_1 -= lr * (layer_1.T.dot(layer_2_delta))
        W_0 -= lr * (layer_0.T.dot(layer_1_delta))
        
    test_correct_count = list()
    
    # evaluate over test dataset.
    for i in range(x_test.shape[0]):
        # get data
        x_i, y_i = x_test[i:i+1], labels_test[i:i+1]
        
        # forward propagation
        layer_0 = x_i
        layer_1 = ReLU(layer_0.dot(W_0))
        layer_2 = softmax(layer_1.dot(W_1))
        
        if np.argmax(layer_2.squeeze()) == np.argmax(y_i.squeeze()):
            test_correct_count.append(1)
        else:
            test_correct_count.append(0)
        
    if(epoch % 10 == 0):
        print("\n"+ "Epoch:" + str(epoch) + \
              " Test-Acc:"+str(np.sum(np.array(test_correct_count))/np.array(test_correct_count).shape[0])+ \
              " Train-Acc:" + str(np.sum(np.array(correct_count))/np.array(correct_count).shape[0]))

Epoch:0 Test-Acc:0.3397 Train-Acc:0.173

Epoch:10 Test-Acc:0.7704 Train-Acc:0.734

Epoch:20 Test-Acc:0.8415 Train-Acc:0.872

Epoch:30 Test-Acc:0.8543 Train-Acc:0.894

Epoch:40 Test-Acc:0.867 Train-Acc:0.922

Epoch:50 Test-Acc:0.8697 Train-Acc:0.946

Epoch:60 Test-Acc:0.8712 Train-Acc:0.943

Epoch:70 Test-Acc:0.8712 Train-Acc:0.965

Epoch:80 Test-Acc:0.875 Train-Acc:0.967

Epoch:90 Test-Acc:0.8748 Train-Acc:0.963