Title: Introduction to Classification Author: Thomas Breuel Institution: UniKL

In [18]:

import numpy,scipy,scipy.ndimage,zlib
import random as pyrandom

from pylab import *
from pylab import random as arandom

def method(cls):
    import new
    def _wrap(f):
        cls.__dict__[f.func_name] = new.instancemethod(f,None,cls)
        return None
    return _wrap

An Object-Oriented View of Classification¶

In regular software engineering, people think of software development as starting from a specification. In pattern recognition, however, you don't know the "specification". Instead, you have a source of data that generates instances. Let's look at how this works.

First, we generate a problem instance. In this case, the problem instance is actually the instantiation of a class, but in the real world, the problem instance is usually some particular physical instance: a device, a web site, a person, a book, etc.

We may call the problem instance nature, in order to emphasize that it is usually given by physics and not another software system. However, "nature" may well be another software system.

In [19]:

# Nature
class Nature:
    def training_sample(self):
        return (c,r)
    def challenge(self):
        return c
    def response(self,r):
        return reward

In [20]:

# An instance of "nature"
class SevenSegments(Nature):
    def __init__(self):
        self.vs = [None] * 10
        self.vs[0] = array((1,1,1,1,1,1,0))
        self.vs[1] = array((0,1,1,0,0,0,0))
        self.vs[2] = array((1,1,0,1,1,0,1))
        self.vs[3] = array((1,1,1,1,0,0,1))
        self.vs[4] = array((0,1,1,0,0,1,1))
        self.vs[5] = array((1,0,1,1,0,1,1))
        self.vs[6] = array((1,0,1,1,1,1,1))
        self.vs[7] = array((1,1,1,0,0,0,0))
        self.vs[8] = array((1,1,1,1,1,1,1))
        self.vs[9] = array((1,1,1,1,0,1,1))

Nature gives us training samples consisting of measurements and correct responses.

In [21]:

@method(SevenSegments)
def training_sample(self):
    c = pyrandom.randint(0,9)
    v = self.vs[c]
    return (v,c)

Based on these training samples, we need to build a model that then returns correct classifications. These are training samples without the classification.

In [22]:

@method(SevenSegments)
def challenge(self):
    v,self.c = self.training_sample()
    return v

Our classifier needs to respond with a class, and nature evaluates our response and gives us feedback.

In [ ]:

@method(SevenSegments)
def response(self,c):
    return (c==self.c)

In [23]:

nature = SevenSegments()
nature.training_sample()

Out[23]:

(array([1, 1, 1, 1, 1, 1, 0]), 0)

We are trying to build a model of nature. These respond to challenges by predicting a response.

In [24]:

# Models
class Model:
    def __init__(self,dataset):
        self.dataset = dataset
    def predict(self,v):
        return 0

Nature gives us training examples. These training examples consist of some kind of measurement vector v and a corresponding classification c.

In [25]:

v,c = nature.training_sample()
print "measurement vector",v
print "class",c

measurement vector [1 0 1 1 0 1 1]
class 5

At training time, we can request training samples. Usually, generating labeled training samples costs money, so we only obtain a limited number of them.
These are usually collected in a dataset.

In [26]:

trainingset = [nature.training_sample() for i in range(100)]
trainingset[:5]

Out[26]:

[(array([1, 1, 0, 1, 1, 0, 1]), 2),
 (array([1, 0, 1, 1, 1, 1, 1]), 6),
 (array([1, 1, 1, 1, 1, 1, 0]), 0),
 (array([1, 1, 1, 1, 1, 1, 0]), 0),
 (array([1, 1, 1, 1, 0, 0, 1]), 3)]

We usually use the dataset to create a model.

In [27]:

model = Model(trainingset)

After training, nature presents us with challenges or test samples.

In [28]:

nature.challenge()

Out[28]:

array([1, 1, 1, 1, 0, 1, 1])

The task of the model is to predict a value based on the challenge.

In [29]:

prediction = model.predict(nature.challenge())
print prediction

Our prediction is then handed back to nature for evaluation. We may or may not see this evaluation, but eventually, our model will be judged on the quality of its predictions.

In [30]:

nature.response(prediction)

Out[30]:

False

A second important question is how we validate model. In standard software engineering, the assumption is (rightly or wrongly) that all we need to show is conformance to the specification. However, pattern recognition and machine learning methods do not have a specification. They do, however, have lots of data, and that lets us make good empirical predictions about performance.

In [31]:

# evaluating the model
def count_errors(nature,model,trials=100):
    errors = 0
    for i in range(trials):
        v = nature.challenge()
        c = model.predict(v)
        if nature.response(c)!=True:
            errors += 1
    return errors
count_errors(nature,model,trials=100)

Out[31]:

OK, that's not very good... the model is wrong about 90 percent of the time--chance level. Let's try for something better.

In [32]:

class MemoryModel:
    def __init__(self,dataset):
        self.memory = {}
        for v,c in dataset:
            self.memory[tuple(v)] = c
    def predict(self,v):
        return self.memory.get(tuple(v),0)
    

In [33]:

model = MemoryModel(trainingset)

In [34]:

count_errors(nature,model,trials=100)

Out[34]:

In the noise free case, memorization of samples may give good response.

It still lacks generalization: that is, it can't make good predictions for previously unseen samples that are similar to, but different from, existing samples.

Noisy Samples¶

In the above example, the model just generated fixed patterns with a one-to-one correspondence to classes.

In practice, there is usually noise.

Noise means:

multiple patterns correspond to each class
each pattern may correspond to multiple classes
the correspondences are non-deterministic

The overall goal is to minimize the error rate.

In [62]:

from pylab import random as arandom
class NoisySevenSegments(SevenSegments):
    def __init__(self):
        self.p_noise = 0.07
        self.vs = [None] * 10
        self.vs[0] = array((1,1,1,1,1,1,0))
        self.vs[1] = array((0,1,1,0,0,0,0))
        self.vs[2] = array((1,1,0,1,1,0,1))
        self.vs[3] = array((1,1,1,1,0,0,1))
        self.vs[4] = array((0,1,1,0,0,1,1))
        self.vs[5] = array((1,0,1,1,0,1,1))
        self.vs[6] = array((1,0,1,1,1,1,1))
        self.vs[7] = array((1,1,1,0,0,0,0))
        self.vs[8] = array((1,1,1,1,1,1,1))
        self.vs[9] = array((1,1,1,1,0,1,1))

Now the training samples consist of the true targets, but with some of the segments flipped.

In [63]:

@method(NoisySevenSegments)
def training_sample(self):
    c = pyrandom.randint(0,len(self.vs)-1)
    v = self.vs[c]
    flip = 1.0*(arandom(size=7)<self.p_noise)
    v = 1.0*(v!=flip)       
    return (v,c)

Let's see how well the memory model works.

In [64]:

nature = NoisySevenSegments()
trainingset = [nature.training_sample() for i in range(100)]
model = MemoryModel(trainingset)
print count_errors(nature,model,trials=10000)

Let's use "similarity" to help improve performance.

In [65]:

from scipy.spatial import distance
distance.cdist(randn(1,10),randn(10,10))

Out[65]:

array([[ 3.8478801 ,  4.20201828,  3.70312897,  5.37046302,  3.23728652,
         4.94973151,  3.44314282,  2.96195443,  3.27705394,  3.53697688]])

In [66]:

# nearest neighbor models
class NearestNeighborModel:
    def __init__(self,dataset):
        self.centers = array([v for v,c in dataset],'f')
        self.classes = array([c for v,c in dataset],'i')
    def predict(self,v):
        ds = distance.cdist(array([v],'f'),self.centers)
        i = argmin(ds[0])
        return self.classes[i]

In [67]:

model = NearestNeighborModel(trainingset)

In [68]:

count_errors(nature,model,trials=10000)

Out[68]:

We often determine the error rate, the percentage of samples that are misclassified.

In [69]:

count_errors(nature,model,trials=10000)/10000.0

Out[69]:

0.2785

Test Sets¶

Above, we evaluated the model by trying it out in the real world and getting feedback from nature. However, usually, we use a test set instead, a set of samples obtained like training samples.

We can use this to estimate the error rate as well.

In [70]:

testset = [nature.training_sample() for i in range(100)]
truth = [c for v,c in testset]
predictions = [model.predict(v) for v,c in testset]
sum(array(predictions)!=array(truth))

Out[70]:

We can wrap this up as a function, a function that evaluates using a testset.

In [71]:

def dataset_eval(testset,model):
    truth = [c for v,c in testset]
    predictions = [model.predict(v) for v,c in testset]
    return sum(array(predictions)!=array(truth))*1.0/len(testset)

In [72]:

dataset_eval(testset,model)

Out[72]:

0.26000000000000001

Note that we can also evaluate the performance of the model on the training set itself. This will, in general, give an error rate that is too low. For nearest neighbor classifiers, it is easy to see why.

In [73]:

dataset_eval(trainingset,model)

Out[73]:

0.13

Two important things to remember about test sets are:

the test set must be statistically representative of the training set
the error rate on the training set itself is often lower than the error rate on the test set

Manual Creation of Test/Training Sets¶

In fact, often we just get a set of training samples. In that case, we need to perform manually splitting of the data into training and test sets.

In [74]:

import random as pyrandom
all_indexes = set(range(len(trainingset)))
test_indexes = set(pyrandom.sample(all_indexes,int(0.1*len(all_indexes))))
training_indexes = set(all_indexes-test_indexes)

In [75]:

my_testset = [trainingset[i] for i in test_indexes]
my_trainingset = [trainingset[i] for i in training_indexes]

In fact, often we split labeled data into three sets:

a training set that we use to train one or more models
a validation set that we use to optimize parameter choices
a test set that we use for final evaluation of error rates

Note that you cannot use a test set too often, because just by chance you may find a model that seems to work well by chance.

We will talk about ways of dealing with this problem later.

In [ ]: