Exercise 3 : Multiclass Classification and Neural Networks

In the first part of the exercise we'll use a subset of the MNIST dataset to train a logistic regression classifier that can recognize handwritten digits. In the second part of the exercise we'll use an already trained neural network to recognize handwritten digits (the training itself is the object of exercise 4).

Part 1: Multiclass Classification using Logistic Regression

Start by importing the libraries and routines that we'll use.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.io
from scipy.special import expit
from scipy.optimize import minimize
import time

Let's load the data and display it. Also prepend the bias terms (a column full of ones) to the observation matrix X.

In [2]:
#Load data stored in matlab format
mat = scipy.io.loadmat('ex3data1.mat')
#One can check that mat is a dictionary with keys 'X' and 'y'
X, y = mat['X'], mat['y']
#y contains the class labels, class 0 is labeled as '10', lets relabel it to 0.
y[y==10] = 0
(m, n) = np.shape(X)
# Add intercept term to X
Xt = np.concatenate((np.ones((m, 1)), X), axis=1)
In [3]:
#Each row in X represents a 20x20 image of a digit
#Randomly select n rows in matrix X and display them as an image
def display_data(X, nrows):
    idxs = np.random.randint(0,X.shape[0], nrows)
    for i in idxs:
        plt.imshow(X[i,:].reshape((20,20), order='F'), cmap='Greys')
        plt.show()
display_data(X, 5)

The implementation of the cost function for regularized logistic regression is the same as in exercise 2. We reproduce it here.

In [4]:
#Implementation of logistic regression cost function and gradient 
#and of its regularized version.
def sigmoid(z):
    return expit(z)
def lr_cost(theta, X, y):
    """ Logistic Regression cost function and Gradient.
        Denoting g(z) = 1/(1+exp(-z))
                    h = g(X theta) (m,)
    In:
        theta (n+1,) float array
        X (m, n+1) float array
        y (m, ) binary array 
    Out: 
        cost (float): 1/m (-y^T log(h) - (1-y)^T log(1-h))   
        grad (n+1, ): 1/m X^T(g(X theta)-y)
    """
    m = X.shape[0]
    h = sigmoid(np.dot(X, theta))
    cost = 1./float(m) * (-np.dot(y, np.log(h)) - np.dot((1.-y), np.log(1.-h)))
    grad = 1./float(m) * np.dot(X.T, h-y)
    return cost, grad
def lr_cost_reg(theta, X, y, lam):
    """ Regularized Logistic Regression cost function and Gradient.
        Denoting g(z) = 1/(1+exp(-z))
                    h = g(X theta) (m,1)
                    t = theta[1:n+1]
    In:
        theta (n+1,) float array
        X (m, n+1) float array
        y (m,) binary array 
        lam : scalar, regularization parameter
    Out: 
        cost (float): 1/m (-y^T log(h) - (1-y)^T log(1-h)) +lam/2m * t^T t   
        grad (n+1, 1): 1/m X^T(g(X theta)-y)
    """
    cost, grad = lr_cost(theta, X, y)
    t = theta[1:]
    cost += lam/(2*m) * np.dot(t,t)
    grad[1:] += lam/m * t
    return cost, grad

Training a 1 vs all classifier.

For each one of the 10 classes we train a binary classifier which predicts the probability that an image is in class k or not.

In [5]:
def train_one_vs_all(X, y, lam, num_labels, method='BFGS'):
    """ Function that trains a 1 vs all logistic regression classifier
    In:
        X : (m,n) array of floats containing the m observations, one per row. 
            It is assumed that the bias terms are included 
            (the first column is full of ones).
        y : (m,1) array of ints between 0 and num_labels-1, each entry k designates 
            the label of the data at row k in X.
        lam (float): positive , regularization parameter.
        num_labels (int): number of labels (or classes).
        method (str): method to be used by minimize, e.g BFGS or CG, see documentation
                      of scipy.optimize.minimize.
    Out: 
        theta_opt : (n, num_labels) array of floats, column k contains 
        the optimal weights of classifier k
    """
    options = {'maxiter':800, 'disp':True}
    (m,n) = X.shape
    theta_opt = np.zeros((n, num_labels))
    theta0 = np.zeros(n)
    #for each class
    for k in range(num_labels):
        yk = (y==k).astype(np.float64)
        res = minimize(lr_cost_reg, theta0, args=(X,yk.flatten(),lam), 
                   method=method, jac=True, options=options)
        theta_opt[:,k] = res['x']
    return theta_opt
lam = 1.0
#time it
t1 = time.time()
theta_opt = train_one_vs_all(Xt,y,lam, 10)
t2 = time.time()
print "time taken to train: %.2f seconds"%(t2-t1)
Optimization terminated successfully.
         Current function value: 0.020149
         Iterations: 179
         Function evaluations: 180
         Gradient evaluations: 180
Optimization terminated successfully.
         Current function value: 0.026961
         Iterations: 164
         Function evaluations: 165
         Gradient evaluations: 165
Optimization terminated successfully.
         Current function value: 0.068444
         Iterations: 208
         Function evaluations: 209
         Gradient evaluations: 209
Optimization terminated successfully.
         Current function value: 0.071707
         Iterations: 208
         Function evaluations: 209
         Gradient evaluations: 209
Optimization terminated successfully.
         Current function value: 0.052051
         Iterations: 203
         Function evaluations: 204
         Gradient evaluations: 204
Optimization terminated successfully.
         Current function value: 0.076859
         Iterations: 221
         Function evaluations: 222
         Gradient evaluations: 222
Optimization terminated successfully.
         Current function value: 0.034787
         Iterations: 187
         Function evaluations: 188
         Gradient evaluations: 188
Optimization terminated successfully.
         Current function value: 0.046726
         Iterations: 191
         Function evaluations: 192
         Gradient evaluations: 192
Optimization terminated successfully.
         Current function value: 0.092747
         Iterations: 228
         Function evaluations: 229
         Gradient evaluations: 229
Optimization terminated successfully.
         Current function value: 0.089312
         Iterations: 218
         Function evaluations: 219
         Gradient evaluations: 219
time taken to train: 18.01 seconds

Evaluate the accuracy on the training set and define a function that does the actual prediction.

In [6]:
def predict_class(X, theta, num_labels):
    m = X.shape[0]
    prob = np.zeros((m, num_labels))
    #for each class k
    for k in range(num_labels):
        #evaluate probability that each image is in class k
        prob[:,k] = sigmoid(np.dot(X, theta[:,k]))
    #predict for each image the class with the highest probability
    prediction = prob.argmax(axis=1)
    return prediction
def eval_accuracy(X, y, theta, num_labels):
    yf = np.int64(y.flatten())
    prediction = predict_class(X, theta, num_labels)
    accuracy = np.double(prediction==yf).mean() * 100
    return accuracy
acc = eval_accuracy(Xt, y, theta_opt, 10)
print ("Accuracy on Training set = %.2f percent"%acc)
Accuracy on Training set = 94.46 percent
In [7]:
#Convienience function to print an image in the dataset, its label and the predicted class.
def classify_img(Xt, y, theta, row):
    img = np.array(Xt[row,1:])
    plt.imshow(img.reshape((20,20), order='F'), cmap='Greys')
    plt.title('Digit at row %d'%row)
    print("Image at row %d is labeled as a %d:"%(row,y[row]))
    pred =  predict_class(Xt[row,:], theta_opt, 10)
    print("Predicted class by Logistic regression: %d"%pred[0])
classify_img(Xt, y, theta_opt, row=3500)
Image at row 3500 is labeled as a 7:
Predicted class by Logistic regression: 7

Part 2: Neural Networks

The network to perform multiclass classification of the handwritten digits has already been performed, all we need to do is to load the weights and code up the prediction function.

In [8]:
#Neural network parameters
input_layer_size  = 400  # 20x20 Input Images of Digits
hidden_layer_size = 25   # 25 hidden units
num_labels = 10
#Load weights
nn_weights =  scipy.io.loadmat('ex3weights.mat')
In [9]:
theta1  = nn_weights['Theta1']
theta2  = nn_weights['Theta2']
print theta1.shape
print theta2.shape
(25, 401)
(10, 26)
In [11]:
def nn_predict(theta1, theta2, X):
    a1 = sigmoid(np.dot(theta1, X))
    bias = np.ones(1)
    if X.ndim == 2:
        bias = np.ones((1,a1.shape[1]))
    a1biased = np.concatenate((bias, a1), axis = 0)
    a2 = sigmoid(np.dot(theta2, a1biased))
    return a2
In [17]:
#Example use of the neural network prediction and comparison with logistic regression
classify_img(Xt, y, theta_opt, row=0)
pred = nn_predict(theta1, theta2, Xt[0,:])
print "Output of Neural Network Prediction:", pred.argmax()+1 
# note that the 10th class is predicted by the neural network, which means a 0.
Image at row 0 is labeled as a 0:
Predicted class by Logistic regression: 0
Output of Neural Network Prediction: 10
In [18]:
# Evalutae accuracy
mat = scipy.io.loadmat('ex3data1.mat')
#The original vector of labels is indexed between 1 and 10 where 10 represents the digit 0
y = mat['y'][:,0]
pred = nn_predict(theta1, theta2, Xt.T)
In [20]:
#Check that the dimensions are correct, 1 column per datum
print pred.shape
(10, 5000)
In [21]:
idxs = np.argmax(pred, axis = 0 )
idxs += 1 #so that the range is 1..10
print "Predicted Classes : ", idxs
print "Ground Truth :", y
nn_acc = np.float64(idxs==y).mean()*100
print "Accuracy of Neural Network Multiclass classification = %.2f percent"%nn_acc
Predicted Classes :  [10 10 10 ...,  9  9  9]
Ground Truth : [10 10 10 ...,  9  9  9]
Accuracy of Neural Network Multiclass classification = 97.52 percent