In the last lecture we have learned the logistic regression to classify "0" digit or a "1" digit based on pixel intensities on a 28x28 grid.
Today we will learn how to classify all 10 digits.
Reference: adapted from the MATLAB tutorial in Stanford Deep Learning tutorial.
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook
%matplotlib inline
Let us load the MNIST dataset of handwritten digits, both testing and training data. You can download the npz
format file on Canvas file tab.
csv
data from Kaggle digit recognizer competition) The first column of the sample data are the labels, and the rest 784 columns represent a 28x28 grayscale image.npz
data (numpy native zip format), X_train
and X_test
both have 784 columns which represent a 28x28 grayscale image. y_train
and y_test
, which range from 0 to 9 total 10 classes, are the labels of the training samples, respectively.data_train = np.load('mnist_train.npz')
X_train = data_train['X']
y_train = data_train['y']
data_test = np.load('mnist_test.npz')
X_test = data_test['X']
y_test = data_test['y']
# visualize the first 20 rows of the training data, with their labels.
_, axes = plt.subplots(4,5, figsize=(16, 14))
axes = axes.reshape(-1)
# randomly choosing 20 samples to display
idx = np.random.choice(60000, size=20)
for i in range(20):
axes[i].axis('off') # hide the axes ticks
axes[i].imshow(X_train[idx[i],:].reshape(28,28), cmap = 'gray')
axes[i].set_title(str(int(y_train[idx[i]])), color= 'black', fontsize=25)
plt.show()
In logistic regression problem, for a certain labeled sample $(\mathbf{x},y)$t hat is in class 0 (i.e., the image is a 0)
Suppose that we know for a certain sample, its label $y\in \{1,\dots, K\}$ where $K$ is the number of classes. Feature vector $\mathbf{x}\in \mathbb{R}^n$ (in MNIST it is a 28x28 image flattened to a (784,)
array), we want to estimate the probability that $P(y=k|\mathbf{x})$.
where we have $K$ sets of parameters, $\mathbf{w}_1, \mathbf{w}_2, \dots, \mathbf{w}_K$, and the factor $\sum_{j=1}^{K}{\exp\big(\mathbf{w}_j^{\top} \mathbf{x}\big)}$ normalizes the results to be a probability.
$W$ is an $n\times K$ matrix containing all $K$ sets of parameters, obtained by concatenating $\mathbf{w}_1, \mathbf{w}_2, \dots, \mathbf{w}_K$ into columns, so that $\mathbf{w}_k = (w_{k1}, \dots, w_{kn})^{\top} = (w_{kl})$ for $l = 1,\dots, n$
$$ \mathbf{w} = \left( \begin{array}{cccc}| & | & | & | \\ \mathbf{w}_1 & \mathbf{w}_2 & \cdots & \mathbf{w}_K \\ | & | & | & | \end{array}\right), $$and $W^{\top}\mathbf{x}$ would be sensible and vectorized to be computed.
Define the following indicator function: $$ 1_{\{y = k\}} = 1_{\{k\}}(y) = \delta_{yk} = \begin{cases} 1 & \text{when } y = k, \\[5pt] 0 & \text{otherwise}. \end{cases} $$
Loss function is again using the cross entropy:
$$ \begin{aligned} L (\mathbf{w};X,\mathbf{y}) & = - \frac{1}{N}\sum_{i=1}^N \sum_{k=1}^K \Bigl\{ 1_{\{y^{(i)} = k\}} \ln P\big(y^{(i)}=k | \mathbf{x}^{(i)} ; \mathbf{w} \big) \Bigr\} \\ & = - \frac{1}{N}\sum_{i=1}^N \sum_{k=1}^K \left\{1_{\{y^{(i)} = k\}} \ln \Bigg( \frac{\exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)})}{\sum_{j=1}^{K} \exp\big(\mathbf{w}_j^{\top} \mathbf{x}^{(i)} \big) } \Bigg)\right\}. \end{aligned} $$Notice for every term in the sum w.r.t. to the labels, $\sum_{k=1}^K 1_{\{y^{(i)} = k\}} = 1$ for only one term among $K$ terms, and the rest is 0.
For example, if a sample $(\mathbf{x},y)$ is in the 5th class (representing digit 4), $y=5$, then its one-hot vector is $$[1_{\{y^{(i)} = k\}} ]_{k=1}^{10}= [0, 0, 0, 0, 1, 0, 0, 0, 0, 0].$$ Hopefully, after training, the model above can predict something like: $$h(\mathbf{x};W) = [0.009,\; 0.01,\; 0.01,\; 0.009,\; 0.91,\; 0.009,\; 0.009,\; 0.01,\; 0.009,\; 0.010]$$
Now the gradient of $L$ with respect the whole $k$-th set of weights is then:
$$ \frac{\partial L }{\partial \mathbf{w}_{k}} = - \frac{1}{N}\sum_{i=1}^N \mathbf{x}^{(i)}\left( 1_{\{y^{(i)} = k\} } - \big(\text{$k$-th component of } h(\mathbf{x}^{(i)};W)\big) \right) = \frac{1}{N}\sum_{i=1}^N \mathbf{x}^{(i)}\left( \frac{\exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)})} {\sum_{m=1}^{K} \exp(\mathbf{w}_m^{\top} \mathbf{x}^{(i)} )} -1_{\{y^{(i)} = k\} } \right). \tag{$\diamond$} $$One big challenge here is that the weights are now represented by a matrix.
The biggest estimated probability's class as this sample's predicted label. $$ \hat{y} = \operatorname{arg}\max_{j} P\big(y = j| \mathbf{x}\big), $$
N = len(y_train) # number of training samples
n = np.shape(X_train)[1] # 784, which is number of pixels (number of features)
K = 10 # number of classes
np.random.seed(42)
w = 1e-4*np.random.random(n*K)
# w: a (7840, ) array a small, random initial guess
# 7840 = 784x10, 784 features, 10 classes
# during computation it will be resized to total 10 columns standing for 10 sets of weights
# model sigma(x; w)
# w: 10 sets weights
# X: training samples, of shape 60000 row by 784
# output: (60000, 10), i-th row represent the probabilities of i-th sample
# in the k-th class (k-th column entry)
def sigma(X,w):
W = w.reshape(n,K)
s = np.exp(np.matmul(X,W))
total = np.sum(s, axis=1).reshape(-1,1)
prob = s / total
return prob
# loss function, modulo by N (size of training data)
# a vectorized implementation with a for loop with only 10 iterations
def loss(w,X,y):
loss_components = np.zeros(N)
for k in range(K):
loss_components += np.log(sigma(X,w))[:,k] * (y == k)
# above is a dimension (60000,) array
return -np.mean(loss_components) # same with loss_components.sum()/N
def gradient_loss(w,X,y):
gradient_for_each_weight_class = np.empty([n,K])
# 10 columns, each column represent a graident
for k in range(K):
gradient_for_all_training_data_for_class_k = (sigma(X,w)[:,k] - (y==k)).reshape(-1,1)*X
gradient_for_each_weight_class[:,k] = np.mean(gradient_for_all_training_data_for_class_k, axis=0)
# we should return a (784,) array, which is averaging all 60000 training data
return gradient_for_each_weight_class.reshape(n*K)
For a fixed set of weights sigma(w)
gives 10 probabilities for each sample (training or testing), here we want to implement a cross-validation function, takes input of X_train
or X_test
, compute the class label of the highest probability for each samples, and returns the accuracy.
# we define a checking accuracy function computing
def check_acc(w,X,y):
prob = sigma(X,w) # for each sample, it computes 10 probabilities based on current weight w
highest_prob_index = np.argmax(prob, axis=1)
return np.mean(highest_prob_index == y.astype(int))
eta = 1e-5 # step size (learning rate)
num_steps = 5
for i in tqdm_notebook(range(num_steps)):
dw = gradient_loss(w,X_train,y_train)
w -= eta * dw
print("Training accuracy after", i+1, "iterations is: ", check_acc(w,X_train,y_train))
print("Testing accuracy after", i+1, "iterations is: ", check_acc(w,X_test,y_test))
# keep track of training and testing accuracy just making sure we are in the right direction
print("loss after", i+1, "iterations is: ", loss(w,X_train,y_train))
Because our dataset is big. One iteration of the gradient descent requires evaluating the gradient for all the training samples, and it takes takes $O(N\cdot d)$ cpu time ($N$: number of training samples, $d$:number of features in each sample). Stochastic gradient descent to remedy next time...
We can use scikit-learn
's LogisticRegression()
class in the linear_model
to perform this task for us. Quoting the reference:
In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the 'multi_class' option is set to 'ovr', and uses the cross- entropy loss if the 'multi_class' option is set to 'multinomial'. (Currently the 'multinomial' option is supported only by the 'lbfgs', 'sag' and 'newton-cg' solvers.)
For small datasets, 'liblinear' is a good choice, whereas 'sag' and 'saga' are faster for large ones.
For multiclass problems, only 'newton-cg', 'sag', 'saga' and 'lbfgs' handle multinomial loss; 'liblinear' is limited to one-versus-rest schemes. 'newton-cg', 'lbfgs' and 'sag' only handle L2 penalty, whereas 'liblinear' and 'saga' handle L1 penalty.
# copy the example
from sklearn.linear_model import LogisticRegression
mnist_clf = LogisticRegression(random_state=42,
solver='lbfgs', tol= 1e-5, max_iter = 2000, verbose=True,
multi_class='multinomial')
# a faster solver is sag according to the reference
# verbose is printing output during training (only applies to lbfgs as solver)
mnist_clf.fit(X_train[:10000,:], y_train[:10000]) # we only use first 10000 images as training data
mnist_clf.predict(X_test[:10, :])
# we visualize the first 10 rows of X_test, see how the prediction goes
# visualize the first 20 rows of the training data, with their labels.
_, axes = plt.subplots(2,5, figsize=(16, 7))
axes = axes.reshape(-1)
for i in range(10):
axes[i].axis('off') # hide the axes ticks
axes[i].imshow(X_test[i,:].reshape(28,28), cmap = 'gray')
axes[i].set_title(str(int(y_test[i])), color= 'black', fontsize=25)
plt.show()
Our softmax got 8 out 10 correct, not a bad score. Run the following cell will give you the prediction accuracy for the first 500 testing samples.
mnist_clf.score(X_test, y_test)
Given a test input $\mathbf{x}\in \mathbb{R}^n$ (a 28x28 image flattened to a (784,)
array), we want to estimate the probability that $P(y=k|\mathbf{x})$ for each value of $k=1,\dots,K$ using certain model (hypothesis). In other words, from the input image, we want to estimate the probability of this image being classified as each label among $K$ labels, and we choose the highest probable one to label this image as our prediction based on the model. Thus, for each sample, our model (hypothesis) will output a $K$-dimensional vector (whose elements sum to $1$ to make it a probability) giving us our $K$ estimated probabilities. Concretely, our model $h(\mathbf{x}; W)$, which stands for given the current weights $W$ the probability vector for $\mathbf{x}$, takes the form:
Totally we have $K$ sets of parameters, $\mathbf{w}_1, \mathbf{w}_2, \dots, \mathbf{w}_K$, and the factor $\sum_{j=1}^{K}{\exp\big(\mathbf{w}_j^{\top} \mathbf{x}\big)}$ normalizes the results to be a probability.
When we implement the softmax regression, it is usually convenient to represent $W$ containing all $K$ sets of parameters as a $n\times K$ matrix obtained by concatenating $\mathbf{w}_1, \mathbf{w}_2, \dots, \mathbf{w}_K$ into columns, so that $\mathbf{w}_k = (w_{k1}, \dots, w_{kn})^{\top} = (w_{kl})$ for $l = 1,\dots, n$
$$ \mathbf{w} = \left( \begin{array}{cccc}| & | & | & | \\ \mathbf{w}_1 & \mathbf{w}_2 & \cdots & \mathbf{w}_K \\ | & | & | & | \end{array}\right), $$and $\mathbf{w}^{\top}\mathbf{x}$ would be sensible and vectorized to be computed.
Define the following indicator function: $$ 1_{\{y = k\}} = 1_{\{k\}}(y) = \delta_{yk} = \begin{cases} 1 & \text{when } y = k, \\[5pt] 0 & \text{otherwise}. \end{cases} $$ First let us recall the loss function for the logistic regression, and we rewrite it as: we have $N$ training samples $(\mathbf{x}^{(i)}, y^{(i)})$ $$ \begin{aligned} L^{\text{Logistic}} (\mathbf{w}) &= - \frac{1}{N}\sum_{i=1}^N \Bigl{y^{(i)} \ln\big( h(\mathbf{x}^{(i)}) \big)
\ & = - \frac{1}{N}\sum_{i=1}^N \sum_{k=0}^1 \Bigl{ 1_{{y^{(i)} = k}} \ln P\big(y^{(i)}=k | \mathbf{x}^{(i)} ; \mathbf{w} \big) \Bigr}. \end{aligned} $$
Now our loss function for the softmax regression is then the generalization of above:
$$ \begin{aligned} L (\mathbf{w}) = L^{\text{Softmax}} (\mathbf{w}) & = - \frac{1}{N}\sum_{i=1}^N \sum_{k=1}^K \Bigl\{ 1_{\{y^{(i)} = k\}} \ln P\big(y^{(i)}=k | \mathbf{x}^{(i)} ; \mathbf{w} \big) \Bigr\} \\ & = - \frac{1}{N}\sum_{i=1}^N \sum_{k=1}^K \left\{1_{\{y^{(i)} = k\}} \ln \Bigg( \frac{\exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)})}{\sum_{j=1}^{K} \exp\big(\mathbf{w}_j^{\top} \mathbf{x}^{(i)} \big) } \Bigg)\right\}. \end{aligned} $$Notice for every term in the sum w.r.t. to the labels, $\sum_{k=1}^K$, $1_{\{y^{(i)} = k\}} = 1$ for only one term among $K$ terms, and the rest is 0. The loss function above is the average of the cross-entropy for each sample: $$ H(p,q)\ =\ -\sum^{K}_{k=1}p_{k}\log q_{k}, $$ where $p_{k}$ is the ground truth probability of this sample is in $k$-th class (known), $q_{k}$ is our model's estimated/predicted probability.
Notice our weights to be trained have $K$ sets $\mathbf{w}_1, \mathbf{w}_2, \dots, \mathbf{w}_K$, and each of the $k$-th weights vector has $n$ components: $\mathbf{w}_k = (w_{k1}, \dots, w_{kl},\dots, w_{kn})^{\top}$. The first subscript is $1\leq k \leq K$ (label's index, we have this many set of weights), the second subscript is $1\leq l\leq n$ ($\mathbf{x}$'s feature index).
The indices involved are pretty complicated, to simplify the notation a bit, denote the probability predicted by our model of the $i$-th training sample being in the $k$-th class as:
$$ \sigma_{k}^{(i)}:= P\big(y^{(i)}=k | \mathbf{x}^{(i)} ; \mathbf{w} \big) = \frac{\exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)})} {\sum_{m=1}^{K} \exp(\mathbf{w}_m^{\top} \mathbf{x}^{(i)} )} $$then $$ \frac{\partial L }{\partial w_{jl}} = - \sum_{i=1}^N \sum_{k=1}^K \left\{ 1_{\{y^{(i)} = k\} } \frac{\partial}{\partial w_{jl}}\Big( \ln \sigma_{k}^{(i)}\Big) \right\} = - \sum_{i=1}^N \sum_{k=1}^K \left\{ 1_{\{y^{(i)} = k\} } \frac{1}{\sigma_{k}^{(i)}}\frac{\partial}{\partial w_{jl}} \sigma_{k}^{(i)} \right\}. \tag{$\star$} $$
Now computing the partial derivative above:
$$ \begin{aligned} \frac{\partial \sigma_{k}^{(i)}}{\partial w_{jl}} &= \frac{\partial }{\partial w_{jl}} \left( \frac{\exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)})} {\sum_{m=1}^{K} \exp(\mathbf{w}_m^{\top} \mathbf{x}^{(i)} )}\right) \\ &= \frac{1}{\sum_{m=1}^{K}\exp(\mathbf{w}_m^{\top} \mathbf{x}^{(i)} )} \frac{\partial }{\partial w_{jl}} \left( \exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)})\right) - \frac{\exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)})} { \left(\sum_{m=1}^{K}\exp(\mathbf{w}_m^{\top} \mathbf{x}^{(i)}) \right)^2} \frac{\partial }{\partial w_{jl}} \left( \sum_{m=1}^{K}\exp(\mathbf{w}_m^{\top} \mathbf{x}^{(i)}) \right) \\ &= \frac{1}{\sum_{m=1}^{K}\exp(\mathbf{w}_m^{\top} \mathbf{x}^{(i)} )} 1_{\{j = k\}} \exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)}) \frac{\partial }{\partial w_{jl}} \left( \mathbf{w}_k^{\top} \mathbf{x}^{(i)} \right) - \frac{\exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)})} { \left(\sum_{m=1}^{K}\exp(\mathbf{w}_m^{\top} \mathbf{x}^{(i)}) \right)^2} \exp(\mathbf{w}_j^{\top} \mathbf{x}^{(i)}) \frac{\partial }{\partial w_{jl}} \left( \mathbf{w}_j^{\top} \mathbf{x}^{(i)} \right). \end{aligned} \tag{$\dagger$} $$By the property of the indicator function, we have:
$$ 1_{\{j = k\}} \exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)}) \frac{\partial }{\partial w_{jl}} \left( \mathbf{w}_k^{\top} \mathbf{x}^{(i)} \right) = \begin{cases} \exp(\mathbf{w}_j^{\top} \mathbf{x}^{(i)}) x_l^{(i)} & \text{if } j=k, \\[3pt] 0 & \text{if }j\neq k. \end{cases} $$Hence, $(\dagger)$ can be further written as:
$$ \begin{aligned} \frac{\partial \sigma_{k}^{(i)}}{\partial w_{jl}} = \frac{\exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)})} { \sum_{m=1}^{K}\exp(\mathbf{w}_m^{\top} \mathbf{x}^{(i)}) } \left( 1_{\{j = k\}} - \frac{\exp(\mathbf{w}_j^{\top} \mathbf{x}^{(i)})} { \sum_{m=1}^{K}\exp(\mathbf{w}_m^{\top} \mathbf{x}^{(i)}) } \right) x^{(i)}_l = \sigma_{k}^{(i)} \left( 1_{\{j = k\}} - \sigma_{j}^{(i)} \right) x^{(i)}_l. \end{aligned} $$Now plugging this back to $(\star)$:
$$ \begin{aligned} \frac{\partial L }{\partial w_{jl}} &= - \sum_{i=1}^N \sum_{k=1}^K \left\{ 1_{\{y^{(i)} = k\} } \left( 1_{\{j = k\}} - \sigma_{j}^{(i)} \right) x^{(i)}_l \right\} \\ &=- \sum_{i=1}^N x^{(i)}_l\left\{ \sum_{k=1}^K 1_{\{y^{(i)} = k\} } 1_{\{j = k\}} - \sum_{k=1}^K 1_{\{y^{(i)} = k\} } \sigma_{j}^{(i)} \right\} \\ &= - \sum_{i=1}^N x^{(i)}_l\left( 1_{\{y^{(i)} = j\} } - \sigma_{j}^{(i)} \right) \\ & = - \sum_{i=1}^N x^{(i)}_l\Big( 1_{\{y^{(i)} = j\} } - P\big(y^{(i)}=j | \mathbf{x}^{(i)} ; \mathbf{w} \big) \Big). \end{aligned} $$This is pretty simple, and it has a nice interpretation similar to the maximum likelihood function: the term in the parenthesis is the difference between the actual probability and the probability estimation in our model.
Now the derivative of $L$ with respect the whole $k$-th set of weights is then:
$$ \frac{\partial L }{\partial \mathbf{w}_{k}} = - \sum_{i=1}^N \mathbf{x}^{(i)}\left( 1_{\{y^{(i)} = k\} } - \sigma_{k}^{(i)} \right) = \sum_{i=1}^N \mathbf{x}^{(i)}\left( \frac{\exp(\mathbf{w}_k^{\top} \mathbf{x}^{(i)})} {\sum_{m=1}^{K} \exp(\mathbf{w}_m^{\top} \mathbf{x}^{(i)} )} -1_{\{y^{(i)} = k\} } \right). \tag{$\diamond$} $$