Notebook

Lecture 17: Binary classifier¶

In the last lecture, we have learned our goal in a binary classification problem. To sum up:

Data has features (attributes of a sample) and labels (which class this sample is in, $y=0$ or $1$).
Each weight corresponds to each feature.
Cross-entropy loss function.

MNIST¶

Now let us look at the famous MNIST dataset of handwritten digits, please download the npz file on Canvas.

In [ ]:

import numpy as np
import matplotlib.pyplot as plt

In [ ]:

data_train = np.load('mnist_binary_train.npz')

What does the data look like?¶

The first column data_train[:,0] is the label, and the rest 784 columns data_train[:,1:] represent the image. Let us try to visualize the first 20 rows of the training data, with their labels.

In [ ]:

X_train = data_train['X']
y_train = data_train['y']

fig, axes = plt.subplots(4,5, figsize=(12, 14))
axes = axes.reshape(-1)

for i in range(20):
    axes[i].axis('off')
    axes[i].imshow(X_train[i,:].reshape(28,28), cmap = 'gray')
    axes[i].set_title(str(int(y_train[i])), color= 'black', fontsize=25)
plt.show()

Logistic Regression¶

Model function (hypothesis)¶

Weights vector $\mathbf{w}$, same shape with a sample's feature vector $\mathbf{x}$, $h(\mathbf{x})$ is our estimate of $ P(y=1|\mathbf{x})$ and $1 - h(\mathbf{x})$ is our estimate of $P(y=0|\mathbf{x}) = 1 - P(y=1|\mathbf{x})$.

$$ h(\mathbf{x}) = h(\mathbf{x};\mathbf{w}) = \frac{1}{1 + \exp(-\mathbf{w}^\top \mathbf{x})} =: \sigma(\mathbf{w}^\top \mathbf{x}) $$

where $\sigma(z)$ is the Sigmoid function $1/(1+e^{z})$ or more compactly, because $y = 0$ or $1$: $$ P(y|\mathbf{x}) \text{ is estimated by } h(\mathbf{x})^y \big(1 - h(\mathbf{x}) \big)^{1-y}. $$

Loss function (cross-entropy)¶

The cross entropy loss for two probability distribution is defined as, $K$ is the no. of classes, $\hat {y}$ is the prediction from the model (try to estimate $y$) $$ H(p,q)\ =\ -\sum^{K}_{k=1}p_{k}\log q_{k}\ =\ -y\log {\hat {y}}-(1-y)\log(1-{\hat {y}}) $$ Since we estimate $y$ using $h(\mathbf{x})$, $$ L (\mathbf{w}; X, \mathbf{y}) = - \frac{1}{N}\sum_{i=1}^N \Bigl{y^{(i)} \ln\big( h(\mathbf{x}^{(i)}; \mathbf{w}) \big)

(1 - y^{(i)}) \ln\big( 1 - h(\mathbf{x}^{(i)};\mathbf{w}) \big) \Bigr}.

\tag{$\star$} $$

Training¶

The gradient of the loss function $(\star)$ with respect to the weights $\mathbf{w}$ is:

$$ \nabla_{\mathbf{w}} \big( L (\mathbf{w}) \big) =\frac{1}{N}\sum_{i=1}^N \big( h(\mathbf{x}^{(i)};\mathbf{w}) - y^{(i)} \big) \mathbf{x}^{(i)} . \tag{$\dagger$} $$

Gradient descent¶

Now let us run the gradient descent based on $(\dagger)$, with code adapted from Lecture 12.

In [ ]:

# Initialization
N = len(y)
w  = np.zeros(np.shape(X_train)[1]) 
# zero initial guess, np.shape(X)[1] = 784, which is no. of pixels
# and we want it to be small

In [ ]:

# model h(X; w) = sigma(-Xw)
# w: weights
# x: training data
# x has shape 12665 (no. of samples) row by 784 (no. of features)
# w has shape 784
def h(w,X):
    z = np.matmul(X,w)
    return 1.0 / (1.0 + np.exp(-z))

# loss function, modulo by N (size of training data), a vectorized implementation without for loop
def loss(w,X,y):
    loss_components = np.log(h(w,X)) * y + (1.0 - y)* np.log(1 - h(w,X))
    # above is a dimension (12665,) array
    return -np.mean(loss_components) # same with loss_components.sum()/N

def gradient_loss(w,X,y):
    gradient_for_all_training_data = (h(w,X) - y).reshape(-1,1)*X
    # we should return a (784,) array, which is averaging all 12655 training data
    return np.mean(gradient_for_all_training_data, axis=0)

In [ ]:

eta = 5e-5  # step size (learning rate)
num_steps = 500

# this block is for plot
fig, axes = plt.subplots(2,5, figsize=(12, 5))
axes = axes.reshape(-1)


loss_at_eachstep = np.zeros(num_steps) # record the change of the loss function
for i in range(num_steps):
    loss_at_eachstep[i] = loss(w,X_train,y_train) # this step is optional
    dw = gradient_loss(w,X_train,y_train) 
    w -= eta * dw
    if i % 50 == 0: # plot weights and print loss every 50 steps
        print("loss after", i+1, "iterations is: ", loss(w,X_train,y_train))
        axes[i//50].axis('off')
        axes[i//50].imshow(w.reshape(28,28), cmap = 'gray')
        axes[i//50].set_title("%4i iterations" % i)
        fig.canvas.draw()
        fig.canvas.flush_events()
        

In [ ]:

plt.plot(range(num_steps), loss_at_eachstep)
plt.yscale('log')
plt.show()

Cross-validation (Judgement day)¶

Now let us use the testing data set to see if the the accuracy is good.

In [ ]:

# import the testing data and extract zeros and ones like we did before
data_test = np.load('mnist_binary_test.npz')
X_test = data_test['X']
y_test = data_test['y']

In [ ]:

# compute the y_pred using the weights w and X_test
probability_estimate = h(w,X_test)
y_pred = 1*(probability_estimate > 0.5) # integer
# if probability_estimate is > 0.5, it is the 2nd class (class 1), otherwise it is the first class (class 0)
percentage_getting_label_correct = np.mean(y_pred == y_test)
print("{:.5%}".format(percentage_getting_label_correct))

In-class exercise:¶

Read the manual of the logistic regression class in scikit-learn, follow the example there to redo the classification above.

In [ ]:

from sklearn.linear_model import LogisticRegression

In [ ]:

mnist_binary_reg = LogisticRegression(solver= 'lbfgs',max_iter=1000, verbose=True)

In [ ]:

mnist_binary_reg.fit(X_train,y_train)

In [ ]:

y_pred = mnist_binary_reg.predict(X_test)
percentage_getting_label_correct = np.mean(y_pred == y_test)
print("{:.5%}".format(percentage_getting_label_correct))