# Project 1: Digit Classification with KNN and Naive Bayes¶

In this project, you'll implement your own image recognition system for classifying digits. Read through the code and the instructions carefully and add your own code where indicated. Each problem can be addressed succinctly with the included packages -- please don't add any more. Grading will be based on writing clean, commented code, along with a few short answers.

As always, you're welcome to work on the project in groups and discuss ideas on the course wall, but please prepare your own write-up (with your own code).

If you're interested, check out these links related to digit recognition:

Yann Lecun's MNIST benchmarks: http://yann.lecun.com/exdb/mnist/

Stanford Streetview research and data: http://ufldl.stanford.edu/housenumbers/

In [1]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# Import a bunch of libraries.
import time
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_mldata
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import LinearRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import classification_report

# Set the randomizer seed so results are the same each time.
np.random.seed(0)


Load the data. Notice that we are splitting the data into training, development, and test. We also have a small subset of the training data called mini_train_data and mini_train_labels that you should use in all the experiments below, unless otherwise noted.

In [2]:
# Load the digit data either from mldata.org, or once downloaded to data_home, from disk. The data is about 53MB so this cell
# should take a while the first time your run it.
mnist = fetch_mldata('MNIST original', data_home='~/datasets/mnist')
X, Y = mnist.data, mnist.target

# Rescale grayscale values to [0,1].
X = X / 255.0

# Shuffle the input: create a random permutation of the integers between 0 and the number of data points and apply this
# permutation to X and Y.
# NOTE: Each time you run this cell, you'll re-shuffle the data, resulting in a different ordering.
shuffle = np.random.permutation(np.arange(X.shape[0]))
X, Y = X[shuffle], Y[shuffle]

print 'data shape: ', X.shape
print 'label shape:', Y.shape

# Set some variables to hold test, dev, and training data.
test_data, test_labels = X[61000:], Y[61000:]
dev_data, dev_labels = X[60000:61000], Y[60000:61000]
train_data, train_labels = X[:60000], Y[:60000]
mini_train_data, mini_train_labels = X[:1000], Y[:1000]

data shape:  (70000, 784)
label shape: (70000,)


(1) Create a 10x10 grid to visualize 10 examples of each digit. Python hints:

• plt.rc() for setting the colormap, for example to black and white
• plt.subplot() for creating subplots
• plt.imshow() for rendering a matrix
• np.array.reshape() for reshaping a 1D feature vector into a 2D matrix (for rendering)
In [3]:
def P1(num_examples=10):

### STUDENT START ###

num_digits = 10 # number of digits to show

## Select num_examples examples for each

# Create index dictionary
index_dict = { i : [] for i in xrange(num_digits) }
digits_completed = 0

# Iterate through labels
for i, label in enumerate(train_labels):
# If we are in need of the label we just came accross add it
if len(index_dict[label]) < num_examples:
index_dict[label].append(i)

# If we have just finished the digit, add one to digits_completed
if len(index_dict[label]) == num_examples:
digits_completed += 1

# When we have finished our 10 digits, stop iterating through labels
if digits_completed == num_digits:
break

## Generate the figure
fig, ax = plt.subplots(num_digits, num_examples, figsize=(10,10))

for i in xrange(num_digits): # for each digit
for j in xrange(num_examples): # for each plot within each digit
# Get the position in the data
_ind = index_dict[i][j]

# Plot with no axes
ax[i,j].imshow(train_data[_ind].reshape(28,28), cmap=plt.cm.gray)
ax[i,j].axis('off')

fig.suptitle('Showing {} examples for each digit'.format(num_examples), size=18)

### STUDENT END ###

P1(10)


(2) Evaluate a K-Nearest-Neighbors model with k = [1,3,5,7,9] using the mini training set. Report accuracy on the dev set. For k=1, show precision, recall, and F1 for each label. Which is the most difficult digit?

• KNeighborsClassifier() for fitting and predicting
• classification_report() for producing precision, recall, F1 results
In [4]:
def P2(k_values, extra=False):

### STUDENT START ###

# For extra report
extra_reports = []

for k in k_values:
# Create and train model
_model = KNeighborsClassifier(n_neighbors=k)
_model.fit(mini_train_data, mini_train_labels)

# Predict
_predicted = _model.predict(dev_data)

## Extra report comparing the overall performance of each of the models
if extra:
_p = classification_report(y_true=dev_labels, y_pred=_predicted, digits=3)

# Report for k==1
if k == 1:
_p = classification_report(y_true=dev_labels, y_pred=_predicted, digits=3)
print 'For k==1 neighbors:'
print _p

# Parse report (obtain prec, recall and F1 for avg / total)
line = _p.split('\n')[-2]
_1, precision, recall, f1, _2 = [x for x in line.split('  ') if len(x)]

# Calculate accuracy and append to list
_correct = sum([ i == j for i,j in zip(_predicted, dev_labels) ])
accuracy = 1. * _correct / len(dev_labels)

extra_reports.append( (k, accuracy, precision, recall, f1) )

# Print extra report
if extra:
print '  MODEL      Accuracy  Precision   Recall  F1score'
for entry in extra_reports:
print 'Neighbors {0}:    {1:.1%}     {2}    {3}   {4}'.format(*entry)

### STUDENT END ###

k_values = [1, 3, 5, 7, 9]
P2(k_values, extra=True)

For k==1 neighbors:
precision    recall  f1-score   support

0.0      0.907     0.980     0.942        99
1.0      0.890     1.000     0.942       105
2.0      0.988     0.794     0.880       102
3.0      0.773     0.872     0.820        86
4.0      0.885     0.817     0.850       104
5.0      0.927     0.835     0.879        91
6.0      0.940     0.959     0.949        98
7.0      0.889     0.920     0.904       113
8.0      0.944     0.875     0.908        96
9.0      0.777     0.821     0.798       106

avg / total      0.892     0.888     0.888      1000

MODEL      Accuracy  Precision   Recall  F1score
Neighbors 1:    88.8%     0.892     0.888    0.888
Neighbors 3:    87.8%     0.887     0.878    0.878
Neighbors 5:    86.9%     0.877     0.869    0.869
Neighbors 7:    86.5%     0.875     0.865    0.865
Neighbors 9:    86.3%     0.875     0.863    0.863


According to F1 score, the most difficult digit seems to be '9'. Going back to the figure produced by P1, out of the examples shown '9' is indeed the most variable.

(3) Using k=1, report dev set accuracy for the training set sizes below. Also, measure the amount of time needed for prediction with each training size.

• time.time() gives a wall clock value you can use for timing operations
In [5]:
def P3(train_sizes, accuracies):

### STUDENT START ###

print '  TRAIN SIZE   Accuracy   Precision   Recall  F1score    Time'

for size in train_sizes:
# Train model
_model = KNeighborsClassifier(n_neighbors=1)
_model.fit(train_data[:size], train_labels[:size])

# Predict and time the prediction
_start = time.time()
_predicted = _model.predict(dev_data)
_elapsed = time.time() - _start

# Calculate accuracy and append to list
_correct = sum([ i == j for i,j in zip(_predicted, dev_labels) ])
accuracy = 1. * _correct / len(dev_labels)
accuracies.append(accuracy)

# Extra reporting
_p = classification_report(y_true=dev_labels, y_pred=_predicted, digits=3)
# Parse report (obtain prec, recall and F1 for avg / total)
line = _p.split('\n')[-2]
_1, precision, recall, f1, _2 = [x for x in line.split('  ') if len(x)]

# Format time and print along with accuracies
_t = str(_elapsed).split('.')
print '{s:5} records:    {a:<.1%}      {p}    {r}   {f}  {t[0]:>2}.{t[1]:.2}s'.format(
s=size, a=accuracy, p=precision, r=recall, f=f1, t=_t)

### STUDENT END ###

train_sizes = [100, 200, 400, 800, 1600, 3200, 6400, 12800, 25000]
accuracies = []
P3(train_sizes, accuracies)

  TRAIN SIZE   Accuracy   Precision   Recall  F1score    Time
100 records:    72.0%      0.765     0.720    0.714   0.14s
200 records:    78.6%      0.800     0.786    0.784   0.33s
400 records:    84.1%      0.853     0.841    0.840   0.60s
800 records:    88.4%      0.890     0.884    0.884   1.47s
1600 records:    90.2%      0.905     0.902    0.902   2.85s
3200 records:    92.6%      0.928     0.926    0.926   5.37s
6400 records:    93.7%      0.938     0.937    0.937   9.32s
12800 records:    95.9%      0.959     0.959    0.959  20.39s
25000 records:    97.0%      0.970     0.970    0.970  38.44s


(4) Fit a regression model that predicts accuracy from training size. What does it predict for n=60000? What's wrong with using regression here? Can you apply a transformation that makes the predictions more reasonable?

• Remember that the sklearn fit() functions take an input matrix X and output vector Y. So each input example in X is a vector, even if it contains only a single value.
In [6]:
def P4(transform=False, size=60000):

### STUDENT START ###

# Access data from P3 and reshape train sizes (to matrix)
_ts = map( lambda x: [x,], train_sizes )

# If specified, transform data
if transform:
_acc = map( lambda x: 1./(1 - x) , accuracies )
else:
_acc = accuracies

# Fit model and predict for 600k
_m = LinearRegression()
_m.fit(_ts, _acc)
pred = _m.predict([[size]])

# If transformed, untransform data
if transform:
_pred = 1 -  1. / pred[0]
else:
_pred = pred[0]

# Print result
print 'Transformation {w:>3} -> Accuracy predicted (size={s}k): {p:>6.1%}'.format(
w= 'ON' if transform else 'OFF', s=size/1000, p=_pred )

### STUDENT END ###

P4()
P4(transform=True)

Transformation OFF -> Accuracy predicted (size=60k): 124.3%
Transformation  ON -> Accuracy predicted (size=60k):  98.7%


The prediction yields accuracy > 1 because there is no restriction for the linear regression. We should apply a transformation on the accuracy so that we model an unconstrained variables (function of the accuracy), fit the regression on the transformed data, and then apply the inverse transformation to the predicted values to obtain the predicted accuracy.

The transformation applied is: $y = \frac{1}{1-x}$
Whose inverse is: $x = 1 - \frac{1}{y}$

Fit a 1-NN and output a confusion matrix for the dev data. Use the confusion matrix to identify the most confused pair of digits, and display a few example mistakes.

• confusion_matrix() produces a confusion matrix
In [7]:
def P5():

### STUDENT START ###

# Fit model and get predictions
_m = KNeighborsClassifier(n_neighbors=1)
_m.fit(mini_train_data, mini_train_labels)
_p = _m.predict(dev_data)

# Print confusion matrix
print 'CONFUSION MATRIX'
print confusion_matrix(y_true=dev_labels, y_pred=_p, labels= range(10))

# Print examples of mislabeled 9s (two 9s as 3s and two 9s as 7s)
fig, ax = plt.subplots(2, 2, figsize=(6,6))

_num_as3, _num_as7 = 0, 0 # aux for keeping track of examples plotted

for i, label in enumerate(dev_labels):
if label == 9:

# If 9 mislabeled as 3
if _p[i] == 3 and _num_as3 < 2:
ax[0,_num_as3].imshow(dev_data[i].reshape(28,28), cmap=plt.cm.gray)
ax[0,_num_as3].axis('off')
_num_as3 += 1

# If 9 mislabeled as 7
if _p[i] == 7 and _num_as7 < 2:
ax[1,_num_as7].imshow(dev_data[i].reshape(28,28), cmap=plt.cm.gray)
ax[1,_num_as7].axis('off')
_num_as7 += 1

if _num_as3 + _num_as7 >= 4:
break

fig.suptitle('TOP: 9s classified as 3s , BOTTOM: 9s classified as 7s', size=15)

### STUDENT END ###

P5()

CONFUSION MATRIX
[[ 97   0   0   0   0   0   2   0   0   0]
[  0 105   0   0   0   0   0   0   0   0]
[  4   4  81   4   0   0   0   4   3   2]
[  1   0   0  75   0   3   0   3   1   3]
[  0   2   0   0  85   0   3   0   0  14]
[  2   0   0   9   0  76   0   1   1   2]
[  1   1   1   0   1   0  94   0   0   0]
[  1   4   0   1   1   0   0 104   0   2]
[  0   2   0   5   0   2   1   0  84   2]
[  1   0   0   3   9   1   0   5   0  87]]


(6) A common image processing technique is to smooth an image by blurring. The idea is that the value of a particular pixel is estimated as the weighted combination of the original value and the values around it. Typically, the blurring is Gaussian -- that is, the weight of a pixel's influence is determined by a Gaussian function over the distance to the relevant pixel.

Implement a simplified Gaussian blur by just using the 8 neighboring pixels: the smoothed value of a pixel is a weighted combination of the original value and the 8 neighboring values. Try applying your blur filter in 3 ways:

• preprocess the training data but not the dev data
• preprocess the dev data but not the training data
• preprocess both training and dev data

Note that there are Guassian blur filters available, for example in scipy.ndimage.filters. You're welcome to experiment with those, but you are likely to get the best results with the simplified version I described above.

In [8]:
class GaussianBlur:
'''Implements a Gaussian Blur with the specified blur_range and std_deviation

Attributes:
blur_range [int]: Length of the blur's radius (Defaults to 1)
std_dev [float]: Standard deviation of the Gaussian applied (Defaults to 1)
weights [matrix of floats]: weights for the specified blur_range and std_dev

'''

def __init__(self, blur_range=1, std_dev=1):
'''Initialize a GaussianBlur with specified blur length and standard deviation'''
self.blur_range = blur_range
self.std_dev = std_dev

# Calculate weights
self.weights = np.zeros([self.blur_range + 1, self.blur_range + 1])
for i in xrange(self.blur_range + 1):
for j in xrange(self.blur_range + 1):
self.weights[i, j] = np.exp( -1. * (i**2 + j**2) / (2 * self.std_dev**2) ) \
/ ( 2 * np.pi * self.std_dev**2 )

def blur(self, matrix):
'''Apply Gaussian Blur'''
# Get matrix shape, initialize blurred matrix
height, width = matrix.shape
blurred = np.zeros_like(matrix)

for r in xrange(height):
for c in xrange(width):
_vals = [] # collects values of neighbor pixels
_weights = [] # collects the weights as specified by the Gaussian

# get allowed set of jitters
possible_rows = xrange( max(0, r - self.blur_range), min(height, r + self.blur_range + 1) )
possible_cols = xrange( max(0, c - self.blur_range), min(width, c + self.blur_range + 1) )

# iterate through jitters
for _r in possible_rows:
for _c in possible_cols:
_vals.append( matrix[_r , _c] )
_weights.append( self.weights[abs(r - _r) , abs(c - _c)] )

blurred[r][c] = np.dot(_vals, _weights) / sum(_weights)

return blurred

In [9]:
def P6(blur_train=True, blur_dev=True):

### STUDENT START ###

_start = time.time() # 'start' timer
_m = KNeighborsClassifier(n_neighbors=1)

# Initialize GaussianBlur if necessary
if blur_train or blur_dev:
gblur = GaussianBlur()

# Blur (if specified) and train
if blur_train:
_train = map( lambda x: gblur.blur( x.reshape(28,28) ).flatten() , mini_train_data )
else:
_train = mini_train_data
_m.fit(_train, mini_train_labels)

# Blur (if specified) and predict
if blur_dev:
_dev = map( lambda x: gblur.blur( x.reshape(28,28) ).flatten() , dev_data )
else:
_dev = dev_data
_predicted = _m.predict(_dev)

_elapsed = time.time() - _start # 'stop' timer
_t = str(_elapsed).split('.')

# Calculate accuracy and print report
_correct = sum([ i == j for i,j in zip(_predicted, dev_labels) ])
accuracy = 1. * _correct / len(dev_labels)

print '{tr:^16}    {dv:^12}      {a:.1%}   {t[0]:>2}.{t[1]:.2}s'.format(
tr= 'ON' if blur_train else 'OFF',
dv= 'ON' if blur_dev else 'OFF',
a=accuracy, t=_t )

### STUDENT END ###

print 'Blurred Training    Blurred Dev    Accuracy    Time'
P6(blur_train=False, blur_dev=False)
P6(blur_train=False, blur_dev=True)
P6(blur_train=True, blur_dev=False)
P6(blur_train=True, blur_dev=True)

Blurred Training    Blurred Dev    Accuracy    Time
OFF               OFF           88.8%    1.73s
OFF                ON           87.5%   48.40s
ON               OFF           90.5%   45.11s
ON                ON           90.1%   90.08s


Given the results above, it seems clear that blurring the training dataset improves accuracy (my intuition is that the smoothing makes the dataset more generalizable).
However, although the difference is quite small, it seems that it is better not to smooth the dataset to predict.

(7) Fit a Naive Bayes classifier and report accuracy on the dev data. Remember that Naive Bayes estimates P(feature|label). While sklearn can handle real-valued features, let's start by mapping the pixel values to either 0 or 1. You can do this as a preprocessing step, or with the binarize argument. With binary-valued features, you can use BernoulliNB. Next try mapping the pixel values to 0, 1, or 2, representing white, grey, or black. This mapping requires MultinomialNB. Does the multi-class version improve the results? Why or why not?

In [10]:
def binarize(data, thresholds=[.5]):
'''Split data into different classes given a set of thresholds (defaults to binarizing at .5)'''

binarized_data = map( lambda d: np.digitize(d, thresholds), data )

return binarized_data

In [11]:
rep_template = '{m:^15}  {th:^12}    {a:.1%}   {tm[0]:>2}.{tm[1]:.3}s'

def P7(thresholds):

### STUDENT START ###

_start = time.time() # 'start' timer

# Select Binomial or Multinomial
if len(thresholds) == 1:
mod = BernoulliNB(alpha=1., fit_prior=False)
else:
mod = MultinomialNB(alpha=1., fit_prior=False)

# Train and fit (use binarize with .5, and alpha 1)
mod.fit( binarize(mini_train_data, thresholds=thresholds) , mini_train_labels)
_p = mod.predict( binarize(dev_data,thresholds=thresholds) )

_elapsed = time.time() - _start # 'stop' timer
_t = str(_elapsed).split('.')

# Calculate and report accuracy
_correct = sum([ i == j for i,j in zip(_p, dev_labels) ])
accuracy = 1. * _correct / len(dev_labels)

print rep_template.format(
m='BernoulliNB' if len(thresholds) == 1 else 'MultinomialNB',
th=thresholds, a=accuracy, tm=_t)

### STUDENT END ###

print '{:^15}  Threshold(s)  Accuracy    Time'.format('MODEL')

# BernoulliNB
P7([.5])
P7([.9])
P7([.1])
# MultinomialNB
P7([.33, .66])
P7([.1, .9])
P7([.4, .6])

     MODEL       Threshold(s)  Accuracy    Time
BernoulliNB       [0.5]        81.5%    0.370s
BernoulliNB       [0.9]        76.6%    0.309s
BernoulliNB       [0.1]        81.2%    0.323s
MultinomialNB   [0.33, 0.66]    80.1%    0.250s
MultinomialNB    [0.1, 0.9]     79.8%    0.234s
MultinomialNB    [0.4, 0.6]     80.0%    0.230s


After trying with several thresholds it seems that multiclass does not improve results (unless compared with a very high binarizing threshold). This may be due to the fact that it does not make sense to have three classes when our objects are just lines, and thus objects which should be able to have more different values. The gray class does not add any further information about the digit recorded (it might if we were classifying objects that had different colors or tonalities).

(8) Use GridSearchCV to perform a search over values of alpha (the Laplace smoothing parameter) in a Bernoulli NB model. What is the best value for alpha? What is the accuracy when alpha=0? Is this what you'd expect?

• Note that GridSearchCV partitions the training data so the results will be a bit different than if you used the dev data for evaluation.
In [12]:
def P8(alphas):

### STUDENT START ###

nb = BernoulliNB(binarize = .5, fit_prior=False) # initialize estimator
gr_srch = GridSearchCV(nb, alphas)
gr_srch.fit(mini_train_data, mini_train_labels) # fit grid search

return gr_srch

### STUDENT END ###

alphas = {'alpha': [0.0, 0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0]}
nb = P8(alphas)

## Report accuracies and scores for each alpha specified
print 'ALPHA   AVG-Score    SCORES'
for row in nb.grid_scores_:
print "{0[alpha]:<6}   {1:5.1%}   [ {2[0]:5.1%}, {2[1]:5.1%}, {2[2]:5.1%} ]".format(*row)

/Library/Python/2.7/site-packages/sklearn/naive_bayes.py:766: RuntimeWarning: divide by zero encountered in log
self.feature_log_prob_ = (np.log(smoothed_fc)
/Library/Python/2.7/site-packages/sklearn/naive_bayes.py:785: RuntimeWarning: divide by zero encountered in log
neg_prob = np.log(1 - np.exp(self.feature_log_prob_))

ALPHA   AVG-Score    SCORES
0.0       9.2%   [  9.2%,  9.3%,  9.1% ]
0.0001   81.7%   [ 84.6%, 80.2%, 80.2% ]
0.001    82.0%   [ 84.6%, 80.5%, 80.9% ]
0.01     82.1%   [ 84.6%, 80.5%, 81.2% ]
0.1      81.7%   [ 84.3%, 80.5%, 80.2% ]
0.5      80.0%   [ 81.7%, 79.0%, 79.3% ]
1.0      79.4%   [ 81.1%, 78.1%, 79.0% ]
2.0      78.4%   [ 81.1%, 76.6%, 77.5% ]
10.0     72.0%   [ 75.4%, 69.7%, 70.8% ]

In [13]:
print nb.best_params_

{'alpha': 0.01}


The best value for alpha is 0.01. Given the accuracies obtained for alpha = 0, it seems there must be an error in the transformation since the values are disproportionately off. Additionally, the warning that pops up mentions division by zero.

(9) Try training a model using GuassianNB, which is intended for real-valued features, and evaluate on the dev data. You'll notice that it doesn't work so well. Try to diagnose the problem. You should be able to find a simple fix that returns the accuracy to around the same rate as BernoulliNB. Explain your solution.

Hint: examine the parameters estimated by the fit() method, theta_ and sigma_.

In [14]:
report_template = '{n}. {ap:>24}    {ac:>5.1%}'

def P9():
### STUDENT START ###
"""Trains a Gaussian Naive Bayes model (assumes P(x|y)~N(theta,sigma))

Various approaches tested:
1. Raw - using the gnb model as given
2. Set theta and sigma based on the data's mean and std deviation
3. Fix sigma to be 1
4. Standardize data using the GNB's estimated theta and sigma

NOTE: A fifth approach was tried, standardizing data according to the data's theta and sigma.
However, the approach failed due to some sigma's being zero.
"""

gnb = GaussianNB()
gnb.fit(mini_train_data, mini_train_labels)

# Record original params
sigma_gnb = gnb.sigma_.copy()
theta_gnb = gnb.theta_.copy()

# Estimate data's mean and standard deviation
sigma_data = sigma_gnb.copy()
theta_data = theta_gnb.copy()
for c in xrange(10):
_filter = mini_train_labels == c
sigma_data[c] = np.std(mini_train_data[ _filter ], axis=0)
theta_data[c] = np.average(mini_train_data[ _filter ], axis=0)

## 1. Raw approach
# Predict and get accuracies
_p = gnb.predict(dev_data)
_correct = sum([ i == j for i,j in zip(_p, dev_labels) ])
accuracy = 1. * _correct / len(dev_labels)

# Print report
print report_template.format(n=1, ap='Raw (no transform)', ac=accuracy)

## 2. Calculate std dev from the data and apply to sigma parameter
gnb.theta_ = theta_data
gnb.sigma_ = sigma_data

# Predict and get accuracies
_p = gnb.predict(dev_data)
_correct = sum([ i == j for i,j in zip(_p, dev_labels) ])
accuracy = 1. * _correct / len(dev_labels)

# Print report
print report_template.format(n=2, ap="Data's parameters", ac=accuracy)

## 3. Fix sigma to be 1
gnb.theta_ = theta_gnb  # revert
gnb.sigma_ = np.ones_like(gnb.sigma_)

# Predict and get accuracies
_p = gnb.predict(dev_data)
_correct = sum([ i == j for i,j in zip(_p, dev_labels) ])
accuracy = 1. * _correct / len(dev_labels)

print report_template.format(n=3, ap='Fix Sigma=1' ,ac=accuracy)

## 4. Standardize data approach using original GNB's estimated theta and sigma
std_mini_train_data = np.zeros_like(mini_train_data)
for i, c in enumerate(mini_train_labels):
std_mini_train_data[i] = (mini_train_data[i] - theta_gnb[c]) / sigma_gnb[c]

std_dev_data = np.zeros_like(dev_data)
for i, c in enumerate(dev_labels):
std_dev_data[i] = (dev_data[i] - theta_gnb[c]) / sigma_gnb[c]

gnb2 = GaussianNB()
gnb2.fit(std_mini_train_data, mini_train_labels)
_p = gnb2.predict(std_dev_data)
_correct = sum([ i == j for i,j in zip(_p, dev_labels) ])
accuracy = 1. * _correct / len(dev_labels)

print report_template.format(n=4, ap='Standardize Data (GNB)', ac=accuracy)

### STUDENT END ###

print '            APPROACH          Accuracy'
print report_template.format(n=0, ap='BernoulliNB (P8)', ac=.821)
P9()

            APPROACH          Accuracy
0.         BernoulliNB (P8)    82.1%
1.       Raw (no transform)    62.1%
2.        Data's parameters     9.9%
3.              Fix Sigma=1    80.2%
4.   Standardize Data (GNB)    85.6%

/Library/Python/2.7/site-packages/sklearn/naive_bayes.py:393: RuntimeWarning: divide by zero encountered in log
n_ij = - 0.5 * np.sum(np.log(2. * np.pi * self.sigma_[i, :]))
/Library/Python/2.7/site-packages/sklearn/naive_bayes.py:395: RuntimeWarning: divide by zero encountered in divide
(self.sigma_[i, :]), 1)
/Library/Python/2.7/site-packages/sklearn/naive_bayes.py:395: RuntimeWarning: invalid value encountered in divide
(self.sigma_[i, :]), 1)


The different approaches play with the parameters that specifies the Gaussian that determines the likelihood of the features for a given class.

The problem seems to be that some fo the features have a std deviation of zero, which does not play well with the Gaussian's analytic form. Therefore two approaches seem to solve the problem:

1. Fixing $\sigma = 1$
2. Standardizing using the estimated parameters by the GNB (which are close, but no sigma's are zero) helps produce data better fit for the GaussianNB assumptions.

(10) Because Naive Bayes is a generative model, we can use the trained model to generate digits. Train a BernoulliNB model and then generate a 10x20 grid with 20 examples of each digit. Because you're using a Bernoulli model, each pixel output will be either 0 or 1. How do the generated digits compare to the training digits?

• You can use np.random.rand() to generate random numbers from a uniform distribution
• The estimated probability of each pixel is stored in feature_log_prob_. You'll need to use np.exp() to convert a log probability back to a probability.
In [15]:
def P10(num_examples=20):

### STUDENT START ###
num_digits = 10 # number of digits to show

# Fit BernoulliNB model with alpha=0.01 (as per P8's response)
bnb = BernoulliNB(alpha=0.01, binarize=.5, fit_prior=False)
bnb.fit( mini_train_data , mini_train_labels)

## Generate the figure
fig, ax = plt.subplots(num_digits, num_examples, figsize=(16,8))

for dig in xrange(num_digits): # for each digit
for j in xrange(num_examples): # for each plot within each digit

# Generate example by generating random numbers and using the model's probabilities to binarize
_example = (np.random.rand(784) < np.exp(bnb.feature_log_prob_[dig])).reshape(28,28)

# Plot with no axes
ax[dig,j].imshow(_example, cmap=plt.cm.gray)
ax[dig,j].axis('off')

fig.suptitle('Showing {} randomly generated examples for each digit'.format(num_examples), size=18)

### STUDENT END ###

P10(20)


From the plotted figure it can be observed that the generated examples tend to be more homogeneous than the training digits. These examples seem to be almost averages of the training examples, in which features distinctive of the number (like edges, or certain blan areas) are maintained, and other areas are noisier. Training examples are more crisp and clear, but vary more from one example to another.

(11) Remember that a strongly calibrated classifier is rougly 90% accurate when the posterior probability of the predicted class is 0.9. A weakly calibrated classifier is more accurate when the posterior is 90% than when it is 80%. A poorly calibrated classifier has no positive correlation between posterior and accuracy.

Train a BernoulliNB model with a reasonable alpha value. For each posterior bucket (think of a bin in a histogram), you want to estimate the classifier's accuracy. So for each prediction, find the bucket the maximum posterior belongs to and update the "correct" and "total" counters.

How would you characterize the calibration for the Naive Bayes model?

In [16]:
def P11(buckets, correct, total):

### STUDENT START ###

# Fit BernoulliNB model with alpha=0.01 (as per P8's response)
bnb = BernoulliNB(alpha=0.01, binarize=.5, fit_prior=False)
bnb.fit( mini_train_data , mini_train_labels)
_pred_prob = np.exp( bnb.predict_log_proba(dev_data) )

# Get predicted probabilities and labels
posterior_prob = np.max(_pred_prob, axis=1)
predicted_label = np.argmax(_pred_prob, axis=1)

# Update total and correct counts accordingly
for i, p in enumerate(posterior_prob):  # for each prediction
for j, b in enumerate(buckets):  # for each posterior bucket
if p >= b:
total[j] += 1
# check if correct
if predicted_label[i] == dev_labels[i]:
correct[j] += 1

### STUDENT END ###

buckets = [0.5, 0.9, 0.999, 0.99999, 0.9999999, 0.999999999, 0.99999999999, 0.9999999999999, 1.0]
correct = [0 for i in buckets]
total = [0 for i in buckets]

P11(buckets, correct, total)

print '  POSTERIOR BUCKET    Accuracy  Total  Correct'
for i in range(len(buckets)):
accuracy = 0.0
if (total[i] > 0): accuracy = 1. * correct[i] / total[i]
print 'P <= {:<15.13}    {:>5.1%}    {:>4d}      {:>3d}'.format(buckets[i],accuracy, total[i], correct[i])

  POSTERIOR BUCKET    Accuracy  Total  Correct
P <= 0.5                81.8%    1000      818
P <= 0.9                83.7%     970      812
P <= 0.999              87.3%     884      772
P <= 0.99999            90.1%     798      719
P <= 0.9999999          91.8%     743      682
P <= 0.999999999        93.1%     682      635
P <= 0.99999999999      93.9%     607      570
P <= 0.9999999999999    95.3%     529      504
P <= 1.0                95.8%     499      478


It seems that the classifier is somewhat weakly calibrated, since it has positive correlation between posterior probability and accuracy, but they do not coincide very well.

(12) EXTRA CREDIT

Try designing extra features to see if you can improve the performance of Naive Bayes on the dev set. Here are a few ideas to get you started:

• Try summing the pixel values in each row and each column.
• Try counting the number of enclosed regions; 8 usually has 2 enclosed regions, 9 usually has 1, and 7 usually has 0.

Make sure you comment your code well!

## Brief Index¶

The following additional features where tested:

• Sum of rows and columns
• Hough transform for straight line detection
• Sobel filter for edge detection
• Enclosed region detection through connectivity in labeled images and region detection
• Histogram of Oriented Gradients for counting orientation of gradients in each region
In [20]:
# Additional imports necessary for image processing
from skimage import exposure
from skimage.filters import sobel
from skimage.feature import canny, hog
from skimage.measure import label, regionprops

import matplotlib.patches as mpatches # for plotting regions

In [18]:
# Create index dictionary for an example of each digit
index_dict = {}

for i, lbl in enumerate(mini_train_labels):
# If we are in need of the label we just came accross add it
if lbl not in index_dict:
index_dict[lbl] = i

# When we have finished our 10 digits, stop iterating through labels
if len(index_dict) >= 10:
break

In [22]:
## PLOT EXAMPLES
fig, ax = plt.subplots(10, 7, figsize=(10,14))

for i in xrange(10): # for each digit

## Feature Extraction
# Get the image, Sobel filter, Hough transform and Enclosed regions
_img = mini_train_data[index_dict[i]].reshape(28,28)
_sobel = sobel(_img)
_hough = canny(_img, sigma=1.25)

# Histogram of Oriented Gradients (HOG)
_, _hog = hog(_img, orientations=8, pixels_per_cell=(4, 4),
cells_per_block=(1, 1), visualise=True)
# rescale HOG's histogram for better display

# Region detection on Raw
_labeled = label(_img, connectivity=_img.ndim)
_regions = regionprops(_labeled)

# Region detection on Hough
_labeled_hou = label(_hough, connectivity=_img.ndim)
_regions_hou = regionprops(_labeled_hou)

## Plots (no axes)
# Raw image
ax[i,0].imshow(_img, cmap=plt.cm.gray)
ax[i,0].axis('off')

# Sobel Filter
ax[i,1].imshow(_sobel, cmap=plt.cm.gray)
ax[i,1].axis('off')

# Hough Transform
ax[i,2].imshow(_hough, cmap=plt.cm.gray)
ax[i,2].axis('off')

ax[i,3].imshow(_hog, cmap=plt.cm.gray)
ax[i,3].axis('off')
ax[i,4].axis('off')

# Region on Raw
ax[i,5].imshow(_labeled, cmap=plt.cm.gray)
ax[i,5].axis('off')
for reg in _regions:
# draw rectangle around segmented coins
minr, minc, maxr, maxc = reg.bbox
rect = mpatches.Rectangle((minc, minr), maxc - minc, maxr - minr,
fill=False, edgecolor='red', linewidth=2)

# Region on Hough (sigma=1)
ax[i,6].imshow(_labeled, cmap=plt.cm.gray)
ax[i,6].axis('off')
for reg in _regions_hou:
# draw rectangle around segmented coins
minr, minc, maxr, maxc = reg.bbox
rect = mpatches.Rectangle((minc, minr), maxc - minc, maxr - minr,
fill=False, edgecolor='red', linewidth=2)

# Add column titles and figure title
fig.text(.175,.915, 'Raw\nImage', ha='center', fontsize=17)
fig.text(.285,.915, 'Sobel\nFilter', ha='center', fontsize=17)
fig.text(.4,.915, 'Hough\nTransform', ha='center', fontsize=17)
fig.text(.515,.925, 'HOG', ha='center', fontsize=17)