MNIST neural network demo

A project by Sam Greydanus. MIT license

This is a vanilla two-layer neural network for solving the MNIST written digit classification task. When I was first learning how to use neural networks, backpropagation was by far the most confusing part.

I've broken the gradient computations into several smaller functions and included mathematical explanations of each step to make backprop more transparent

Load dependencies and data

In [5]:
import numpy as np
import matplotlib.pyplot as plt
import as cm
%matplotlib inline

from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets('MNIST_data', one_hot=False)
Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


In [6]:
resume = False
batch_size = 16
lr = 2e-2
reg = 1e-4
h1_size = 100 # first hidden layer
h2_size = 10 # second hidden layer
D = 28*28 # dimensionality

Model Initialization

In [7]:
model = {}
# first layer
model['W1'] = np.random.randn(D,h1_size) / np.sqrt(h1_size) # Xavier initialization
model['b1'] = np.random.randn(1,h1_size) / np.sqrt(h1_size) # see
#second layer
model['W2'] = np.random.randn(h1_size,h2_size) / np.sqrt(h2_size)
model['b2'] = np.random.randn(1,h2_size) / np.sqrt(h2_size)

Forward propagation

x $\cdot$ W+b function

The x $\cdot$ W+b operation is just a linear transformation plus a constant. For a given input $x_j$ and output $h_k$ we have: $$h_k = x_jW_{jk} + b_k $$

In [8]:
def xW_plus_b(x, W, b):
    return,W) + b # in some cases you can even drop the bias b

Softmax function

The softmax function is $p_k=\frac{e^{z_k}}{\sum_i e^{z_i}}$

In [9]:
def softmax(x):
    maxes = np.amax(x, axis=1, keepdims=True)
    e = np.exp(x - maxes) # improves numerics
    dist = e / np.sum(e, axis=1, keepdims=True)
    return dist

ReLU function

The ReLU function is $z_i=relu(h_i)$ where

$$relu(h_i) = \left\{ \begin{array}{ll} h_i & \quad h \ge 0\\ 0 & \quad \mathrm{otherwise} \end{array} \right.$$
In [10]:
def relu(x):
    x[x<0] = 0
    return x

Putting it together

In [11]:
def forward(X, model):
    # evaluate class scores, [N x K]
    hs = [] # we'll need the h's for computing gradients
    z1 = xW_plus_b(X, model['W1'], model['b1'])
    h1 = relu(z1); hs.append(h1)
    z2 = xW_plus_b(h1, model['W2'], model['b2'])
    h2 = relu(z2); hs.append(h2)
    probs = softmax(h2)
    return probs, hs


Gradient of the softmax

The original softmax function is: $$p_k=\frac{e^{z_k}}{\sum_i e^{z_i}}$$ We're using a cross entropy loss function which looks like $$L_i= \left\{ \begin{array}{ll} y - log(p_k) & \quad y = k \\ -log(p_k) & \quad \mathrm{otherwise} \end{array} \right.$$ Taking the gradient of $L$ with respect to $f_k$, we get $$\frac{\partial L_i}{\partial z_k}=\frac{\partial}{\partial z_k}\left(-log(e^{z_k}) + log(\sum_i e^{z_i}) \right) = \frac{\partial}{\partial z_k}\left(-z_k + log(e^{z_1} + \ldots + e^{z_k} + \ldots + e^{z_n}) \right)$$ $$\frac{\partial L_i}{\partial z_k}=-\mathbb{1}(y_i=k) + \frac{e^{z_k}}{(e^{z_1} + \ldots + e^{z_k} + \ldots + e^{z_n})}$$ $$\frac{\partial L_i}{\partial z_k}= -\mathbb{1}(y_i=k) + p_k$$ Now we know how to calculate the whole gradient $\frac{\partial L}{\partial z_k}$

In [12]:
def dsoftmax(h, y, batch_size):
    h[range(batch_size),y] -= 1
    return h/y.shape[0] # divide by batch size

Gradient of the ReLU

Propagating the gradient through the ReLU looks like: $$\frac{\partial L}{\partial z_k} \frac{\partial z_k}{\partial h_k} =\frac{\partial}{\partial z_k}\left( relu(h_k) \right) = \left\{ \begin{array}{ll} \frac{\partial L}{\partial z_k} * 1 & \quad h \ge 0\\ 0 & \quad \mathrm{otherwise} \end{array} \right.$$ Since we know $\frac{\partial L}{\partial z_k}$ from calculations above, we can also calculate $\frac{\partial L}{\partial h_k}$

In [13]:
def drelu(dz, h):
    dz[h <= 0] = 0 # backprop relu
    return dz

Gradient of x $\cdot$ W+b

The x $\cdot$ W+b operation is just a matrix dot product plus a constant vector $b$. We can break it into indices as follows: $$ h_k = x_j*W_{jk} + b_k $$

Now we just need to propagate the gradient through this function.

$$\frac{\partial L}{\partial h_k} \frac{\partial h_k}{\partial x_j} = \frac{\partial L}{\partial h_k} \frac{\partial}{\partial x_j} \left( x_j*W_{jk} + b_k \right) = \frac{\partial L}{\partial h_k} W_{jk}$$

In practice, we'd like to vectorize this calculation. The best way to do this is with a matrix dot product $$\frac{\partial L}{\partial x_j} = \frac{\partial L}{\partial h} \cdot W^\intercal$$

In [14]:
def dxW_plus_b(dh, model):
    return, model['W2'].T)

Putting it together

In [21]:
def backward(y, probs, hs, model):
    grads = { k : np.zeros_like(v) for k,v in model.items() }
    dh2 = dsoftmax(probs, y, batch_size)
    # second hidden layer
    print(hs[0].T.shape, dh2.shape, model['W2'].shape)
    grads['W2'] =[0].T, dh2)
    grads['b2'] = np.sum(dh2, axis=0, keepdims=True)

    # first hidden layer
    dh1 = dxW_plus_b(dh2, model)
    dh1 = drelu(dh1, hs[0]) # backprop through relu
    grads['W1'] =, dh1)
    grads['b1'] = np.sum(dh1, axis=0, keepdims=True)
    return grads

Evaluation metric

In [22]:
# evaluate training set accuracy
def eval_model(model):
    X = mnist.test.images
    y = mnist.test.labels
    hidden_layer = np.maximum(0,, model['W1']) + model['b1'])
    scores =, model['W2']) + model['b2']
    predicted_class = np.argmax(scores, axis=1)
    return (np.mean(predicted_class == y))

Gradient descent loop

A note on regularization

I've added L2 regularization to the cost and gradients here even though I didn't touch on it in the Gradients section above. Here's how it works.

Overfitting is a problem which occurs when a model fits itself to individual examples in training data. This prevents it from generalizing well on new data. One way that neural networks overfit is by making certain weights very large; we use L2 Regularization to prevent this.

The idea of L2 regularization is to add a penalty - the regularization loss - to the loss function which scales with the sum square of all parameters in the model. $$L_{\mathrm{regularization}} = \frac{1}{2} \gamma \theta \cdot \theta$$

Where $L_\mathrm{regularization}$ is regularization loss, $\gamma$ is a scaling hyperparameter, and $\theta$ is an 1D vector of all the training parameters. The gradient, then, has the simple form of

$$\frac{\partial L_\mathrm{regularization}}{\partial \theta} = \gamma \theta$$
In [23]:
loss_history = []
smoothing_factor = 0.95
running_loss = 0
for i in range(2):
    X, y = mnist.train.next_batch(batch_size)
    probs, hs = forward(X, model)
    # compute the loss: average cross-entropy loss and L2 regularization
    y_logprobs = -np.log(probs[range(batch_size),y]) # we want probs on the y labels to be large
    reg_loss = 0.5*reg*np.sum([np.sum(w*w) for w in model.values()])
    loss = np.sum(y_logprobs)/batch_size + reg_loss

    grads = backward(y, probs, hs, model) # data gradients
    grads = {k : v+reg*model[k] for (k,v) in grads.items()} # L2 gradients
    # update parameters
    model = {k : model[k] - lr*grads[k] for (k,v) in grads.items()}
    # boring book-keeping
    running_loss = smoothing_factor*running_loss + (1-smoothing_factor)*loss
    if i % 1000 == 0: print("iteration {}: test accuracy {:3f}".format(i, eval_model(model)))
    if i % 250 == 0: print("\titeration %d: loss %f" % (i, running_loss))
    if i % 10 == 0: loss_history.append((i,running_loss))
(100, 16) (16, 10) (100, 10)
iteration 0: test accuracy 0.925300
	iteration 0: loss 0.012139
(100, 16) (16, 10) (100, 10)

Visualize training loss

In [14]:
plt.title("Smoothed loss")
plt.xlabel("training steps")

train_steps, smoothed_losses = zip(*loss_history)
plt.plot(train_steps, smoothed_losses)

Sample model

In [15]:
X = mnist.test.images
y = mnist.test.labels
hidden_layer = np.maximum(0,, model['W1']) + model['b1'])
scores =, model['W2']) + model['b2']
predicted_class = np.argmax(scores, axis=1)

for i in range(0,10,5):
    img1 = np.reshape(X[i+0,:], (28,28))
    img2 = np.reshape(X[i+1,:], (28,28))
    img3 = np.reshape(X[i+2,:], (28,28))
    img4 = np.reshape(X[i+3,:], (28,28))
    img5 = np.reshape(X[i+4,:], (28,28))
    plt.figure(i, figsize=(10,4))
    plt.subplot(151) ; plt.title("predicted: " + str(predicted_class[i]))
    plt.imshow(img1, cmap=cm.gray)
    plt.subplot(152) ; plt.title("predicted: " + str(predicted_class[i+1]))
    plt.imshow(img2, cmap=cm.gray)
    plt.subplot(153) ; plt.title("predicted: " + str(predicted_class[i+2]))
    plt.imshow(img3, cmap=cm.gray)
    plt.subplot(154) ; plt.title("predicted: " + str(predicted_class[i+3]))
    plt.imshow(img4, cmap=cm.gray)
    plt.subplot(155) ; plt.title("predicted: " + str(predicted_class[i+4]))
    plt.imshow(img5, cmap=cm.gray)