In the last lecture, we have learned our goal in a binary classification problem. To sum up:
Now let us look at the famous MNIST dataset of handwritten digits, please download the npz
file on Canvas.
import numpy as np
import matplotlib.pyplot as plt
data_train = np.load('mnist_binary_train.npz')
The first column data_train[:,0]
is the label, and the rest 784 columns data_train[:,1:]
represent the image. Let us try to visualize the first 20 rows of the training data, with their labels.
X_train = data_train['X']
y_train = data_train['y']
fig, axes = plt.subplots(4,5, figsize=(12, 14))
axes = axes.reshape(-1)
for i in range(20):
axes[i].axis('off')
axes[i].imshow(X_train[i,:].reshape(28,28), cmap = 'gray')
axes[i].set_title(str(int(y_train[i])), color= 'black', fontsize=25)
plt.show()
Weights vector $\mathbf{w}$, same shape with a sample's feature vector $\mathbf{x}$, $h(\mathbf{x})$ is our estimate of $ P(y=1|\mathbf{x})$ and $1 - h(\mathbf{x})$ is our estimate of $P(y=0|\mathbf{x}) = 1 - P(y=1|\mathbf{x})$.
$$ h(\mathbf{x}) = h(\mathbf{x};\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^\top \mathbf{x})} =: \sigma(\mathbf{w}^\top \mathbf{x}) $$where $\sigma(z)$ is the Sigmoid function $1/(1+e^{z})$ or more compactly, because $y = 0$ or $1$: $$ P(y|\mathbf{x}) \text{ is estimated by } h(\mathbf{x})^y \big(1 - h(\mathbf{x}) \big)^{1-y}. $$
The cross entropy loss for two probability distribution is defined as, $K$ is the no. of classes, $\hat {y}$ is the prediction from the model (try to estimate $y$) $$ H(p,q)\ =\ -\sum^{K}_{k=1}p_{k}\log q_{k}\ =\ -y\log {\hat {y}}-(1-y)\log(1-{\hat {y}}) $$ Since we estimate $y$ using $h(\mathbf{x})$, $$ L (\mathbf{w}; X, \mathbf{y}) = - \frac{1}{N}\sum_{i=1}^N \Bigl{y^{(i)} \ln\big( h(\mathbf{x}^{(i)}; \mathbf{w}) \big)
\tag{$\star$} $$
The gradient of the loss function $(\star)$ with respect to the weights $\mathbf{w}$ is:
$$ \nabla_{\mathbf{w}} \big( L (\mathbf{w}) \big) =\frac{1}{N}\sum_{i=1}^N \big( h(\mathbf{x}^{(i)};\mathbf{w}) - y^{(i)} \big) \mathbf{x}^{(i)} . \tag{$\dagger$} $$Now let us run the gradient descent based on $(\dagger)$, with code adapted from Lecture 12.
# Initialization
N = len(y)
w = np.zeros(np.shape(X_train)[1])
# zero initial guess, np.shape(X)[1] = 784, which is no. of pixels
# and we want it to be small
# model h(X; w) = sigma(-Xw)
# w: weights
# x: training data
# x has shape 12665 (no. of samples) row by 784 (no. of features)
# w has shape 784
def h(w,X):
z = np.matmul(X,w)
return 1.0 / (1.0 + np.exp(-z))
# loss function, modulo by N (size of training data), a vectorized implementation without for loop
def loss(w,X,y):
loss_components = np.log(h(w,X)) * y + (1.0 - y)* np.log(1 - h(w,X))
# above is a dimension (12665,) array
return -np.mean(loss_components) # same with loss_components.sum()/N
def gradient_loss(w,X,y):
gradient_for_all_training_data = (h(w,X) - y).reshape(-1,1)*X
# we should return a (784,) array, which is averaging all 12655 training data
return np.mean(gradient_for_all_training_data, axis=0)
eta = 5e-5 # step size (learning rate)
num_steps = 500
# this block is for plot
fig, axes = plt.subplots(2,5, figsize=(12, 5))
axes = axes.reshape(-1)
loss_at_eachstep = np.zeros(num_steps) # record the change of the loss function
for i in range(num_steps):
loss_at_eachstep[i] = loss(w,X_train,y_train) # this step is optional
dw = gradient_loss(w,X_train,y_train)
w -= eta * dw
if i % 50 == 0: # plot weights and print loss every 50 steps
print("loss after", i+1, "iterations is: ", loss(w,X_train,y_train))
axes[i//50].axis('off')
axes[i//50].imshow(w.reshape(28,28), cmap = 'gray')
axes[i//50].set_title("%4i iterations" % i)
fig.canvas.draw()
fig.canvas.flush_events()
plt.plot(range(num_steps), loss_at_eachstep)
plt.yscale('log')
plt.show()
Now let us use the testing data set to see if the the accuracy is good.
# import the testing data and extract zeros and ones like we did before
data_test = np.load('mnist_binary_test.npz')
X_test = data_test['X']
y_test = data_test['y']
# compute the y_pred using the weights w and X_test
probability_estimate = h(w,X_test)
y_pred = 1*(probability_estimate > 0.5) # integer
# if probability_estimate is > 0.5, it is the 2nd class (class 1), otherwise it is the first class (class 0)
percentage_getting_label_correct = np.mean(y_pred == y_test)
print("{:.5%}".format(percentage_getting_label_correct))
Read the manual of the logistic regression class in scikit-learn
, follow the example there to redo the classification above.
from sklearn.linear_model import LogisticRegression
mnist_binary_reg = LogisticRegression(solver= 'lbfgs',max_iter=1000, verbose=True)
mnist_binary_reg.fit(X_train,y_train)
y_pred = mnist_binary_reg.predict(X_test)
percentage_getting_label_correct = np.mean(y_pred == y_test)
print("{:.5%}".format(percentage_getting_label_correct))