The following cell downloads a data set that contains 10,000 handwritten images (n=10000) with correct label for each image. Each image is made of 28x28 grayscale pixels (m=28) and the label is given by an integer from 0 to 9, so we represents the n-th data pair by X(n)∈Rm×m (image) and y(n)∈{0,1,…,9} (label).
We will use some of these data pairs to train our model. The model is simply a python function, that receives an m×m grayscale image as input and returns what the input image looks closest among the integers from 0 to 9. What we mean by "to train" is to find appropriate parameters of the model function based on the data pairs that we use for training.
Note that X(n)∈Rm×m and y(n) are accessible by X[:,:,n]
and y[n]
below.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv('https://jonghank.github.io/ee370/files/numbers.csv', \
header=None).values
m = 28 # 28x28: image size
K = 10 # 10: each image represents one of 10 digits
N = 10000 # 10000: number of images in the train set
y = data[:,0]
X = np.zeros((m,m,N))
for n in range(N):
X[:,:,n] = data[n,1:].reshape((m,m))
print(X.shape, y.shape)
(28, 28, 10000) (10000,)
For example, the first 12 images and the labels from the dataset are shown below. You can check if they coincide.
n_examples = 12
print(f'First {n_examples} images:\n')
plt.figure(figsize=(12,4), dpi=100)
for n in range(n_examples):
plt.subplot(1,n_examples,n+1)
plt.imshow(X[:,:,n], cmap='gray')
plt.axis('off')
plt.show()
print(f'First {n_examples} label:\n{y[:n_examples]}')
First 12 images:
First 12 label: [7 2 1 0 4 1 4 9 5 9 0 6]
The following splits the data set into the train set and the validation set (7:3). The first 12 images from the validation set are shown below.
N_train = 7000
N_valid = 3000
X_train = X[:,:,:7000]
y_train = y[:7000]
X_valid = X[:,:,7000:]
y_valid = y[7000:]
n_examples = 12
print(f'First {n_examples} images in the validation set:\n')
plt.figure(figsize=(12,4), dpi=100)
for n in range(n_examples):
plt.subplot(1,n_examples,n+1)
plt.imshow(X_valid[:,:,n], cmap='gray')
plt.axis('off')
plt.show()
First 12 images in the validation set:
Classifying the digit '0'. The train data set and the validation set are preprocessed below.
K = 2
A_train = np.ones((N_train,1+m*m))
b_train = -np.ones(N_train, dtype=int)
for i in range(N_train):
A_train[i,:] = np.hstack((1,X_train[:,:,i].flatten()))
if y_train[i] == 0:
b_train[i] = 1
A_valid = np.ones((N_valid,1+m*m))
b_valid = -np.ones(N_valid, dtype=int)
for i in range(N_valid):
A_valid[i,:] = np.hstack((1,X_valid[:,:,i].flatten()))
if y_valid[i] == 0:
b_valid[i] = 1
Optimal classifier parameters and the predictions:
theta_opt = np.linalg.lstsq(A_train, b_train, rcond=None)[0]
b_pred_train = np.round(np.sign(A_train@theta_opt))
b_pred_valid = np.round(np.sign(A_valid@theta_opt))
Confusion matrix on the train set:
C = np.zeros((K,K), dtype=int)
for n in range(N_train):
if b_pred_train[n] == 1:
if b_train[n]==1:
C[0,0] += 1
else:
C[1,0] += 1
else:
if b_train[n]==1:
C[0,1] += 1
else:
C[1,1] += 1
np.set_printoptions(precision=2)
print(f'Confusion matrix (train):\n{C}\n')
print(f'Error rate (train):\n{100-np.sum(np.diag(C))/N_train*100:.2f} percent\n')
Confusion matrix (train): [[ 621 51] [ 25 6303]] Error rate (train): 1.09 percent
and the confusion matrix on the validation set:
C = np.zeros((K,K), dtype=int)
for n in range(N_valid):
if b_pred_valid[n] == 1:
if b_valid[n]==1:
C[0,0] += 1
else:
C[1,0] += 1
else:
if b_valid[n]==1:
C[0,1] += 1
else:
C[1,1] += 1
np.set_printoptions(precision=2)
print(f'Confusion matrix (validation):\n{C}\n')
print(f'Error rate (validation):\n{100-np.sum(np.diag(C))/N_valid*100:.2f} percent\n')
Confusion matrix (validation): [[ 281 27] [ 11 2681]] Error rate (validation): 1.27 percent
Designs a multiclass classifier that tells what the input image looks most like one of the digits '0', '1',...,'9'.
Preprocessing first:
K = 10
A_train = np.ones((N_train,1+m*m))
b_train = -np.ones((N_train,K), dtype=int)
for i in range(N_train):
A_train[i,:] = np.hstack((1,X_train[:,:,i].flatten()))
for j in range(K):
if y_train[i] == j:
b_train[i,j] = 1
A_valid = np.ones((N_valid,1+m*m))
b_valid = -np.ones((N_valid,K), dtype=int)
for i in range(N_valid):
A_valid[i,:] = np.hstack((1,X_valid[:,:,i].flatten()))
for j in range(K):
if y_valid[i] == j:
b_valid[i,j] = 1
The predictor parameter ˆθ is 785×10, where the k-th column of θ represents the predictor parameter for the k-th digit prediction, so
˜fk(x(i))=(x(i))Tθkand will choose the final prediction ˆf(x(i))=l so that ˜fl(x(i)) is the largest among ˜f0(x(i)),˜f1(x(i)),…,˜f9(x(i)):
ˆf(x(i))=argmaxl∈{1,…,K}˜fl(x(i))theta_opt = np.linalg.lstsq(A_train, b_train, rcond=None)[0]
b_pred_train = np.argmax(A_train@theta_opt, axis=1)
b_pred_valid = np.argmax(A_valid@theta_opt, axis=1)
Confusion matrices on both the train set and the validation set:
C = np.zeros((K,K), dtype=int)
for n in range(N_train):
C[b_pred_train[n], y_train[n]] += 1
np.set_printoptions(precision=2)
print(f'Confusion matrix (train):\n{C}\n')
print(f'Error rate (train):\n{100-np.sum(np.diag(C))/N_train*100:.2f} percent\n')
C = np.zeros((K,K), dtype=int)
for n in range(N_valid):
C[b_pred_valid[n], y_valid[n]] += 1
np.set_printoptions(precision=2)
print(f'Confusion matrix (validation):\n{C}\n')
print(f'Error rate (validation):\n{100-np.sum(np.diag(C))/N_valid*100:.2f} percent\n')
Confusion matrix (train): [[649 0 10 2 0 13 13 5 13 14] [ 0 773 19 9 13 10 6 30 23 14] [ 0 5 635 15 3 3 3 8 7 0] [ 4 0 13 623 0 33 1 4 22 10] [ 1 3 9 5 646 20 5 13 10 50] [ 4 0 0 13 2 503 9 1 14 1] [ 6 4 13 2 3 13 613 1 8 1] [ 1 1 11 12 1 12 0 619 6 31] [ 6 8 18 13 7 20 6 3 569 11] [ 1 1 1 8 25 6 0 28 10 587]] Error rate (train): 11.19 percent Confusion matrix (validation): [[290 0 8 2 0 3 3 1 2 1] [ 0 308 4 3 2 9 0 2 9 1] [ 3 8 248 26 0 1 5 7 2 0] [ 1 0 7 240 0 4 6 1 6 2] [ 3 1 8 0 256 11 0 5 2 23] [ 6 0 3 8 5 209 4 0 29 3] [ 4 0 3 0 6 6 281 0 2 0] [ 0 0 5 5 0 1 0 290 2 42] [ 1 23 9 11 7 15 3 2 226 4] [ 0 0 8 13 6 0 0 8 12 214]] Error rate (validation): 14.60 percent
We enhance the multiclass classification performance by adding the following random features. The additional features are random sums and subtractions of the existing features, so the augmented feature vector for the i-th data is:
x(i)⟵[x(i)max(Rx(i),0)]where R is 1000×785 filled with random 1's and −1's.
Feature engineering on the test set and the validation set:
n_plus = 1000
R = np.sign(np.random.randn(n_plus,1+m*m))
A_train = np.ones((N_train,1+m*m))
b_train = -np.ones((N_train,K), dtype=int)
for i in range(N_train):
A_train[i,:] = np.hstack((1,X_train[:,:,i].flatten()))
for j in range(K):
if y_train[i] == j:
b_train[i,j] = 1
Aa_train = np.ones((N_train,1+m*m+n_plus))
for i in range(N_train):
Aa_train[i,:] = np.hstack((A_train[i,:],np.maximum(R@A_train[i,:],0)))
A_valid = np.ones((N_valid,1+m*m))
b_valid = -np.ones((N_valid,K), dtype=int)
for i in range(N_valid):
A_valid[i,:] = np.hstack((1,X_valid[:,:,i].flatten()))
for j in range(K):
if y_valid[i] == j:
b_valid[i,j] = 1
Aa_valid = np.ones((N_valid,1+m*m+n_plus))
for i in range(N_valid):
Aa_valid[i,:] = np.hstack((A_valid[i,:],np.maximum(R@A_valid[i,:],0)))
Optimal classifier parameter and the predictions on the train set and the validation set:
theta_opt = np.linalg.lstsq(Aa_train, b_train, rcond=None)[0]
b_pred_train = np.argmax(Aa_train@theta_opt, axis=1)
b_pred_valid = np.argmax(Aa_valid@theta_opt, axis=1)
Confusion matrices:
C = np.zeros((K,K), dtype=int)
for n in range(N_train):
C[b_pred_train[n], y_train[n]] += 1
np.set_printoptions(precision=2)
print(f'Confusion matrix (train):\n{C}\n')
print(f'Error rate (train):\n{100-np.sum(np.diag(C))/N_train*100:.2f} percent\n')
C = np.zeros((K,K), dtype=int)
for n in range(N_valid):
C[b_pred_valid[n], y_valid[n]] += 1
np.set_printoptions(precision=2)
print(f'Confusion matrix (validation):\n{C}\n')
print(f'Error rate (validation):\n{100-np.sum(np.diag(C))/N_valid*100:.2f} percent\n')
Confusion matrix (train): [[669 0 1 0 0 0 1 0 0 3] [ 0 791 0 0 0 0 3 7 1 4] [ 0 1 723 0 0 0 0 1 0 0] [ 0 0 0 695 0 3 0 0 2 4] [ 0 1 2 0 693 1 0 2 1 2] [ 0 0 0 0 0 627 2 0 0 0] [ 1 1 0 0 0 0 650 0 2 0] [ 1 0 1 3 0 0 0 699 0 0] [ 1 0 2 4 1 2 0 0 674 2] [ 0 1 0 0 6 0 0 3 2 704]] Error rate (train): 1.07 percent Confusion matrix (validation): [[302 0 4 1 0 3 1 0 1 1] [ 0 335 1 0 1 0 0 1 2 0] [ 1 0 266 6 1 2 1 5 0 0] [ 0 0 3 277 0 7 0 1 1 0] [ 0 0 5 0 269 1 5 1 1 4] [ 1 0 1 6 0 238 6 0 9 2] [ 3 0 0 0 0 4 285 1 2 0] [ 0 0 8 0 0 0 2 306 1 8] [ 1 5 9 12 3 4 1 0 270 2] [ 0 0 6 6 8 0 1 1 5 273]] Error rate (validation): 5.97 percent