Notebook

Lecture 18: Softmax regression, multiclass classifier¶

In the last lecture we have learned the logistic regression to classify "0" digit or a "1" digit based on pixel intensities on a 28x28 grid.

Today we will learn how to classify all 10 digits.

Reference: adapted from the MATLAB tutorial in Stanford Deep Learning tutorial.

In [ ]:

import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook
%matplotlib inline

MNIST¶

Let us load the MNIST dataset of handwritten digits, both testing and training data. You can download the npz format file on Canvas file tab.

What does the data look like?¶

(If you have loaded the csv data from Kaggle digit recognizer competition) The first column of the sample data are the labels, and the rest 784 columns represent a 28x28 grayscale image.
If you have loaded the npz data (numpy native zip format), X_train and X_test both have 784 columns which represent a 28x28 grayscale image. y_train and y_test, which range from 0 to 9 total 10 classes, are the labels of the training samples, respectively.

In [ ]:

data_train = np.load('mnist_train.npz')
X_train = data_train['X']
y_train = data_train['y']

In [ ]:

data_test = np.load('mnist_test.npz')
X_test = data_test['X']
y_test = data_test['y']

In [ ]:

# visualize the first 20 rows of the training data, with their labels.
_, axes = plt.subplots(4,5, figsize=(16, 14))
axes = axes.reshape(-1)

# randomly choosing 20 samples to display
idx = np.random.choice(60000, size=20)

for i in range(20):
    axes[i].axis('off') # hide the axes ticks
    axes[i].imshow(X_train[idx[i],:].reshape(28,28), cmap = 'gray')
    axes[i].set_title(str(int(y_train[idx[i]])), color= 'black', fontsize=25)
plt.show()

Review: binary classification¶

In logistic regression problem, for a certain labeled sample $(\mathbf{x},y)$t hat is in class 0 (i.e., the image is a 0)

Ground truth: $ P(y=1) = 0$, $P(y=0) = 1 - P(y=1)=1$, its one-hot vector representation is $[0,1]$.
Hypothesis: after training, we use $h(\mathbf{x};\mathbf{w})$ to estimate the conditional probability $ P(y=1|\mathbf{x})$, i.e., use features $\mathbf{x}$ to predict the probability of $y=1$; and $1 - h(\mathbf{x})$ to estimate $P(y=0|\mathbf{x}) = 1 - P(y=1|\mathbf{x})$. For a good model, $h(\mathbf{x};\mathbf{w}) \approx 10^{-9}$, which is to say, we have trained a model that uses $[h(\mathbf{x};\mathbf{w}), 1- h(\mathbf{x};\mathbf{w})]$ to approximate the ground truth one-hot vector $[0,1]$.

Softmax regression¶

Suppose that we know for a certain sample, its label $y\in \{1,\dots, K\}$ where $K$ is the number of classes. Feature vector $\mathbf{x}\in \mathbb{R}^n$ (in MNIST it is a 28x28 image flattened to a (784,) array), we want to estimate the probability that $P(y=k|\mathbf{x})$.

Model (hypothesis)¶

$$ h(\mathbf{x};W) = \begin{pmatrix} P(y = 1 | \mathbf{x}; \mathbf{w}) \\ P(y = 2 | \mathbf{x}; \mathbf{w}) \\ \vdots \\ P(y = K | \mathbf{x}; \mathbf{w}) \end{pmatrix} = \frac{1}{ \sum_{j=1}^{K}{\exp\big(\mathbf{w}_j^{\top} \mathbf{x}\big) }} \begin{pmatrix} \exp(\mathbf{w}_1^{\top} \mathbf{x} ) \\ \exp(\mathbf{w}_2^{\top} \mathbf{x} ) \\ \vdots \\ \exp(\mathbf{w}_K^{\top} \mathbf{x} ) \\ \end{pmatrix}. $$

where we have $K$ sets of parameters, $\mathbf{w}_1, \mathbf{w}_2, \dots, \mathbf{w}_K$, and the factor $\sum_{j=1}^{K}{\exp\big(\mathbf{w}_j^{\top} \mathbf{x}\big)}$ normalizes the results to be a probability.

$W$ is an $n\times K$ matrix containing all $K$ sets of parameters, obtained by concatenating $\mathbf{w}_1, \mathbf{w}_2, \dots, \mathbf{w}_K$ into columns, so that $\mathbf{w}_k = (w_{k1}, \dots, w_{kn})^{\top} = (w_{kl})$ for $l = 1,\dots, n$

$$ \mathbf{w} = \left( \begin{array}{cccc}| & | & | & | \\ \mathbf{w}_1 & \mathbf{w}_2 & \cdots & \mathbf{w}_K \\ | & | & | & | \end{array}\right), $$

and $W^{\top}\mathbf{x}$ would be sensible and vectorized to be computed.

Loss function¶

Define the following indicator function: $$ 1_{\{y = k\}} = 1_{\{k\}}(y) = \delta_{yk} = \begin{cases} 1 & \text{when } y = k, \\[5pt] 0 & \text{otherwise}. \end{cases} $$

Loss function is again using the cross entropy:

$$ \begin{aligned} L (\mathbf{w};X,\mathbf{y}) & = - \frac{1}{N}\sum_{i=1}^N \sum_{k=1}^K \Bigl\{ 1_{\{y^{(i)} = k\}} \ln P\big(y^{(i)}=k | \mathbf{x}^{(i)} ; \mathbf{w} \big) \Bigr\} \\ & = - \frac{1}{N}\sum_{i=1}^N \sum_{k=1}^K \left\{1_{\{y^{(i)} = k\}} \ln \Bigg( \frac{\exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)})}{\sum_{j=1}^{K} \exp\big(\mathbf{w}_j^{\top} \mathbf{x}^{(i)} \big) } \Bigg)\right\}. \end{aligned} $$

Notice for every term in the sum w.r.t. to the labels, $\sum_{k=1}^K 1_{\{y^{(i)} = k\}} = 1$ for only one term among $K$ terms, and the rest is 0.

For example, if a sample $(\mathbf{x},y)$ is in the 5th class (representing digit 4), $y=5$, then its one-hot vector is $$[1_{\{y^{(i)} = k\}} ]_{k=1}^{10}= [0, 0, 0, 0, 1, 0, 0, 0, 0, 0].$$ Hopefully, after training, the model above can predict something like: $$h(\mathbf{x};W) = [0.009,\; 0.01,\; 0.01,\; 0.009,\; 0.91,\; 0.009,\; 0.009,\; 0.01,\; 0.009,\; 0.010]$$

Gradient descent¶

Now the gradient of $L$ with respect the whole $k$-th set of weights is then:

$$ \frac{\partial L }{\partial \mathbf{w}_{k}} = - \frac{1}{N}\sum_{i=1}^N \mathbf{x}^{(i)}\left( 1_{\{y^{(i)} = k\} } - \big(\text{$k$-th component of } h(\mathbf{x}^{(i)};W)\big) \right) = \frac{1}{N}\sum_{i=1}^N \mathbf{x}^{(i)}\left( \frac{\exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)})} {\sum_{m=1}^{K} \exp(\mathbf{w}_m^{\top} \mathbf{x}^{(i)} )} -1_{\{y^{(i)} = k\} } \right). \tag{$\diamond$} $$

One big challenge here is that the weights are now represented by a matrix.

Prediction¶

The biggest estimated probability's class as this sample's predicted label. $$ \hat{y} = \operatorname{arg}\max_{j} P\big(y = j| \mathbf{x}\big), $$

In [ ]:

N = len(y_train) # number of training samples
n = np.shape(X_train)[1] # 784, which is number of pixels (number of features)
K = 10 # number of classes

In [ ]:

np.random.seed(42)
w  = 1e-4*np.random.random(n*K) 
# w: a (7840, ) array a small, random initial guess
# 7840 = 784x10, 784 features, 10 classes
# during computation it will be resized to total 10 columns standing for 10 sets of weights

In [ ]:

# model sigma(x; w) 
# w: 10 sets weights
# X: training samples, of shape 60000 row by 784
# output: (60000, 10), i-th row represent the probabilities of i-th sample
# in the k-th class (k-th column entry)
def sigma(X,w):
    W = w.reshape(n,K)
    s = np.exp(np.matmul(X,W))
    total = np.sum(s, axis=1).reshape(-1,1)
    prob = s / total
    return prob

# loss function, modulo by N (size of training data)
# a vectorized implementation with a for loop with only 10 iterations
def loss(w,X,y):
    loss_components = np.zeros(N)
    for k in range(K):
        loss_components += np.log(sigma(X,w))[:,k] * (y == k)
    # above is a dimension (60000,) array
    return -np.mean(loss_components) # same with loss_components.sum()/N

def gradient_loss(w,X,y):
    gradient_for_each_weight_class = np.empty([n,K]) 
    # 10 columns, each column represent a graident
    for k in range(K):
        gradient_for_all_training_data_for_class_k = (sigma(X,w)[:,k] - (y==k)).reshape(-1,1)*X
        gradient_for_each_weight_class[:,k] = np.mean(gradient_for_all_training_data_for_class_k, axis=0)
    # we should return a (784,) array, which is averaging all 60000 training data
    return gradient_for_each_weight_class.reshape(n*K)

Cross-validation function¶

For a fixed set of weights sigma(w) gives 10 probabilities for each sample (training or testing), here we want to implement a cross-validation function, takes input of X_train or X_test, compute the class label of the highest probability for each samples, and returns the accuracy.

In [ ]:

# we define a checking accuracy function computing 
def check_acc(w,X,y):
    prob = sigma(X,w) # for each sample, it computes 10 probabilities based on current weight w
    highest_prob_index = np.argmax(prob, axis=1)
    return np.mean(highest_prob_index == y.astype(int))

In [ ]:

eta = 1e-5  # step size (learning rate)
num_steps = 5

for i in tqdm_notebook(range(num_steps)):
    dw = gradient_loss(w,X_train,y_train)
    w -= eta * dw

    print("Training accuracy after", i+1, "iterations is: ", check_acc(w,X_train,y_train))
    print("Testing accuracy after", i+1, "iterations is: ", check_acc(w,X_test,y_test))
    # keep track of training and testing accuracy just making sure we are in the right direction
    
print("loss after", i+1, "iterations is: ", loss(w,X_train,y_train))

Slow, slow, slow¶

Because our dataset is big. One iteration of the gradient descent requires evaluating the gradient for all the training samples, and it takes takes $O(N\cdot d)$ cpu time ($N$: number of training samples, $d$:number of features in each sample). Stochastic gradient descent to remedy next time...

scikit-learn¶

We can use scikit-learn's LogisticRegression() class in the linear_model to perform this task for us. Quoting the reference:

In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the 'multi_class' option is set to 'ovr', and uses the cross- entropy loss if the 'multi_class' option is set to 'multinomial'. (Currently the 'multinomial' option is supported only by the 'lbfgs', 'sag' and 'newton-cg' solvers.)

For small datasets, 'liblinear' is a good choice, whereas 'sag' and 'saga' are faster for large ones.

For multiclass problems, only 'newton-cg', 'sag', 'saga' and 'lbfgs' handle multinomial loss; 'liblinear' is limited to one-versus-rest schemes. 'newton-cg', 'lbfgs' and 'sag' only handle L2 penalty, whereas 'liblinear' and 'saga' handle L1 penalty.

In [ ]:

# copy the example
from sklearn.linear_model import LogisticRegression
mnist_clf = LogisticRegression(random_state=42, 
                               solver='lbfgs', tol= 1e-5, max_iter = 2000, verbose=True, 
                               multi_class='multinomial')
# a faster solver is sag according to the reference
# verbose is printing output during training (only applies to lbfgs as solver)

In [ ]:

mnist_clf.fit(X_train[:10000,:], y_train[:10000]) # we only use first 10000 images as training data

In [ ]:

mnist_clf.predict(X_test[:10, :])

In [ ]:

# we visualize the first 10 rows of X_test, see how the prediction goes
# visualize the first 20 rows of the training data, with their labels.
_, axes = plt.subplots(2,5, figsize=(16, 7))
axes = axes.reshape(-1)

for i in range(10):
    axes[i].axis('off') # hide the axes ticks
    axes[i].imshow(X_test[i,:].reshape(28,28), cmap = 'gray')
    axes[i].set_title(str(int(y_test[i])), color= 'black', fontsize=25)
plt.show()

Result¶

Our softmax got 8 out 10 correct, not a bad score. Run the following cell will give you the prediction accuracy for the first 500 testing samples.

In [ ]:

mnist_clf.score(X_test, y_test)

Reading: more details about softmax regression¶

Softmax regression (or multinomial logistic regression) is a generalization of logistic regression to the case where we want to handle multiple classes. In logistic regression we assumed that the labels were binary: $y^{(i)}\in\{0,1\}$. In the last lecture, we have used such a binary classifier to distinguish between two kinds of handwritten digits. Softmax regression allows us to handle $y^{(i)}\in \{1,\dots, K\}$ where $K$ is the number of classes.

Given a test input $\mathbf{x}\in \mathbb{R}^n$ (a 28x28 image flattened to a (784,) array), we want to estimate the probability that $P(y=k|\mathbf{x})$ for each value of $k=1,\dots,K$ using certain model (hypothesis). In other words, from the input image, we want to estimate the probability of this image being classified as each label among $K$ labels, and we choose the highest probable one to label this image as our prediction based on the model. Thus, for each sample, our model (hypothesis) will output a $K$-dimensional vector (whose elements sum to $1$ to make it a probability) giving us our $K$ estimated probabilities. Concretely, our model $h(\mathbf{x}; W)$, which stands for given the current weights $W$ the probability vector for $\mathbf{x}$, takes the form:

Totally we have $K$ sets of parameters, $\mathbf{w}_1, \mathbf{w}_2, \dots, \mathbf{w}_K$, and the factor $\sum_{j=1}^{K}{\exp\big(\mathbf{w}_j^{\top} \mathbf{x}\big)}$ normalizes the results to be a probability.

When we implement the softmax regression, it is usually convenient to represent $W$ containing all $K$ sets of parameters as a $n\times K$ matrix obtained by concatenating $\mathbf{w}_1, \mathbf{w}_2, \dots, \mathbf{w}_K$ into columns, so that $\mathbf{w}_k = (w_{k1}, \dots, w_{kn})^{\top} = (w_{kl})$ for $l = 1,\dots, n$

$$ \mathbf{w} = \left( \begin{array}{cccc}| & | & | & | \\ \mathbf{w}_1 & \mathbf{w}_2 & \cdots & \mathbf{w}_K \\ | & | & | & | \end{array}\right), $$

and $\mathbf{w}^{\top}\mathbf{x}$ would be sensible and vectorized to be computed.

Loss functions Logistic vs Softmax¶

Define the following indicator function: $$ 1_{\{y = k\}} = 1_{\{k\}}(y) = \delta_{yk} = \begin{cases} 1 & \text{when } y = k, \\[5pt] 0 & \text{otherwise}. \end{cases} $$ First let us recall the loss function for the logistic regression, and we rewrite it as: we have $N$ training samples $(\mathbf{x}^{(i)}, y^{(i)})$ $$ \begin{aligned} L^{\text{Logistic}} (\mathbf{w}) &= - \frac{1}{N}\sum_{i=1}^N \Bigl{y^{(i)} \ln\big( h(\mathbf{x}^{(i)}) \big)

(1 - y^{(i)}) \ln\big( 1 - h(\mathbf{x}^{(i)}) \big) \Bigr}

\ & = - \frac{1}{N}\sum_{i=1}^N \sum_{k=0}^1 \Bigl{ 1_{{y^{(i)} = k}} \ln P\big(y^{(i)}=k | \mathbf{x}^{(i)} ; \mathbf{w} \big) \Bigr}. \end{aligned} $$

Now our loss function for the softmax regression is then the generalization of above:

$$ \begin{aligned} L (\mathbf{w}) = L^{\text{Softmax}} (\mathbf{w}) & = - \frac{1}{N}\sum_{i=1}^N \sum_{k=1}^K \Bigl\{ 1_{\{y^{(i)} = k\}} \ln P\big(y^{(i)}=k | \mathbf{x}^{(i)} ; \mathbf{w} \big) \Bigr\} \\ & = - \frac{1}{N}\sum_{i=1}^N \sum_{k=1}^K \left\{1_{\{y^{(i)} = k\}} \ln \Bigg( \frac{\exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)})}{\sum_{j=1}^{K} \exp\big(\mathbf{w}_j^{\top} \mathbf{x}^{(i)} \big) } \Bigg)\right\}. \end{aligned} $$

Notice for every term in the sum w.r.t. to the labels, $\sum_{k=1}^K$, $1_{\{y^{(i)} = k\}} = 1$ for only one term among $K$ terms, and the rest is 0. The loss function above is the average of the cross-entropy for each sample: $$ H(p,q)\ =\ -\sum^{K}_{k=1}p_{k}\log q_{k}, $$ where $p_{k}$ is the ground truth probability of this sample is in $k$-th class (known), $q_{k}$ is our model's estimated/predicted probability.

Gradient of softmax loss function¶

(you might want to re-derive this on paper)¶

Notice our weights to be trained have $K$ sets $\mathbf{w}_1, \mathbf{w}_2, \dots, \mathbf{w}_K$, and each of the $k$-th weights vector has $n$ components: $\mathbf{w}_k = (w_{k1}, \dots, w_{kl},\dots, w_{kn})^{\top}$. The first subscript is $1\leq k \leq K$ (label's index, we have this many set of weights), the second subscript is $1\leq l\leq n$ ($\mathbf{x}$'s feature index).

The indices involved are pretty complicated, to simplify the notation a bit, denote the probability predicted by our model of the $i$-th training sample being in the $k$-th class as:

$$ \sigma_{k}^{(i)}:= P\big(y^{(i)}=k | \mathbf{x}^{(i)} ; \mathbf{w} \big) = \frac{\exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)})} {\sum_{m=1}^{K} \exp(\mathbf{w}_m^{\top} \mathbf{x}^{(i)} )} $$

then $$ \frac{\partial L }{\partial w_{jl}} = - \sum_{i=1}^N \sum_{k=1}^K \left\{ 1_{\{y^{(i)} = k\} } \frac{\partial}{\partial w_{jl}}\Big( \ln \sigma_{k}^{(i)}\Big) \right\} = - \sum_{i=1}^N \sum_{k=1}^K \left\{ 1_{\{y^{(i)} = k\} } \frac{1}{\sigma_{k}^{(i)}}\frac{\partial}{\partial w_{jl}} \sigma_{k}^{(i)} \right\}. \tag{$\star$} $$

Now computing the partial derivative above:

$$ \begin{aligned} \frac{\partial \sigma_{k}^{(i)}}{\partial w_{jl}} &= \frac{\partial }{\partial w_{jl}} \left( \frac{\exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)})} {\sum_{m=1}^{K} \exp(\mathbf{w}_m^{\top} \mathbf{x}^{(i)} )}\right) \\ &= \frac{1}{\sum_{m=1}^{K}\exp(\mathbf{w}_m^{\top} \mathbf{x}^{(i)} )} \frac{\partial }{\partial w_{jl}} \left( \exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)})\right) - \frac{\exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)})} { \left(\sum_{m=1}^{K}\exp(\mathbf{w}_m^{\top} \mathbf{x}^{(i)}) \right)^2} \frac{\partial }{\partial w_{jl}} \left( \sum_{m=1}^{K}\exp(\mathbf{w}_m^{\top} \mathbf{x}^{(i)}) \right) \\ &= \frac{1}{\sum_{m=1}^{K}\exp(\mathbf{w}_m^{\top} \mathbf{x}^{(i)} )} 1_{\{j = k\}} \exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)}) \frac{\partial }{\partial w_{jl}} \left( \mathbf{w}_k^{\top} \mathbf{x}^{(i)} \right) - \frac{\exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)})} { \left(\sum_{m=1}^{K}\exp(\mathbf{w}_m^{\top} \mathbf{x}^{(i)}) \right)^2} \exp(\mathbf{w}_j^{\top} \mathbf{x}^{(i)}) \frac{\partial }{\partial w_{jl}} \left( \mathbf{w}_j^{\top} \mathbf{x}^{(i)} \right). \end{aligned} \tag{$\dagger$} $$

By the property of the indicator function, we have:

$$ 1_{\{j = k\}} \exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)}) \frac{\partial }{\partial w_{jl}} \left( \mathbf{w}_k^{\top} \mathbf{x}^{(i)} \right) = \begin{cases} \exp(\mathbf{w}_j^{\top} \mathbf{x}^{(i)}) x_l^{(i)} & \text{if } j=k, \\[3pt] 0 & \text{if }j\neq k. \end{cases} $$

Hence, $(\dagger)$ can be further written as:

$$ \begin{aligned} \frac{\partial \sigma_{k}^{(i)}}{\partial w_{jl}} = \frac{\exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)})} { \sum_{m=1}^{K}\exp(\mathbf{w}_m^{\top} \mathbf{x}^{(i)}) } \left( 1_{\{j = k\}} - \frac{\exp(\mathbf{w}_j^{\top} \mathbf{x}^{(i)})} { \sum_{m=1}^{K}\exp(\mathbf{w}_m^{\top} \mathbf{x}^{(i)}) } \right) x^{(i)}_l = \sigma_{k}^{(i)} \left( 1_{\{j = k\}} - \sigma_{j}^{(i)} \right) x^{(i)}_l. \end{aligned} $$

Now plugging this back to $(\star)$:

$$ \begin{aligned} \frac{\partial L }{\partial w_{jl}} &= - \sum_{i=1}^N \sum_{k=1}^K \left\{ 1_{\{y^{(i)} = k\} } \left( 1_{\{j = k\}} - \sigma_{j}^{(i)} \right) x^{(i)}_l \right\} \\ &=- \sum_{i=1}^N x^{(i)}_l\left\{ \sum_{k=1}^K 1_{\{y^{(i)} = k\} } 1_{\{j = k\}} - \sum_{k=1}^K 1_{\{y^{(i)} = k\} } \sigma_{j}^{(i)} \right\} \\ &= - \sum_{i=1}^N x^{(i)}_l\left( 1_{\{y^{(i)} = j\} } - \sigma_{j}^{(i)} \right) \\ & = - \sum_{i=1}^N x^{(i)}_l\Big( 1_{\{y^{(i)} = j\} } - P\big(y^{(i)}=j | \mathbf{x}^{(i)} ; \mathbf{w} \big) \Big). \end{aligned} $$

This is pretty simple, and it has a nice interpretation similar to the maximum likelihood function: the term in the parenthesis is the difference between the actual probability and the probability estimation in our model.

Now the derivative of $L$ with respect the whole $k$-th set of weights is then:

$$ \frac{\partial L }{\partial \mathbf{w}_{k}} = - \sum_{i=1}^N \mathbf{x}^{(i)}\left( 1_{\{y^{(i)} = k\} } - \sigma_{k}^{(i)} \right) = \sum_{i=1}^N \mathbf{x}^{(i)}\left( \frac{\exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)})} {\sum_{m=1}^{K} \exp(\mathbf{w}_m^{\top} \mathbf{x}^{(i)} )} -1_{\{y^{(i)} = k\} } \right). \tag{$\diamond$} $$