Title: Introduction to Classification Author: Thomas Breuel Institution: UniKL
import numpy,scipy,scipy.ndimage,zlib
import random as pyrandom
from pylab import *
from pylab import random as arandom
def method(cls):
import new
def _wrap(f):
cls.__dict__[f.func_name] = new.instancemethod(f,None,cls)
return None
return _wrap
In regular software engineering, people think of software development as starting from a specification. In pattern recognition, however, you don't know the "specification". Instead, you have a source of data that generates instances. Let's look at how this works.
First, we generate a problem instance. In this case, the problem instance is actually the instantiation of a class, but in the real world, the problem instance is usually some particular physical instance: a device, a web site, a person, a book, etc.
We may call the problem instance nature, in order to emphasize that it is usually given by physics and not another software system. However, "nature" may well be another software system.
# Nature
class Nature:
def training_sample(self):
return (c,r)
def challenge(self):
return c
def response(self,r):
return reward
# An instance of "nature"
class SevenSegments(Nature):
def __init__(self):
self.vs = [None] * 10
self.vs[0] = array((1,1,1,1,1,1,0))
self.vs[1] = array((0,1,1,0,0,0,0))
self.vs[2] = array((1,1,0,1,1,0,1))
self.vs[3] = array((1,1,1,1,0,0,1))
self.vs[4] = array((0,1,1,0,0,1,1))
self.vs[5] = array((1,0,1,1,0,1,1))
self.vs[6] = array((1,0,1,1,1,1,1))
self.vs[7] = array((1,1,1,0,0,0,0))
self.vs[8] = array((1,1,1,1,1,1,1))
self.vs[9] = array((1,1,1,1,0,1,1))
Nature gives us training samples consisting of measurements and correct responses.
@method(SevenSegments)
def training_sample(self):
c = pyrandom.randint(0,9)
v = self.vs[c]
return (v,c)
Based on these training samples, we need to build a model that then returns correct classifications. These are training samples without the classification.
@method(SevenSegments)
def challenge(self):
v,self.c = self.training_sample()
return v
Our classifier needs to respond with a class, and nature evaluates our response and gives us feedback.
@method(SevenSegments)
def response(self,c):
return (c==self.c)
nature = SevenSegments()
nature.training_sample()
(array([1, 1, 1, 1, 1, 1, 0]), 0)
We are trying to build a model of nature. These respond to challenges by predicting a response.
# Models
class Model:
def __init__(self,dataset):
self.dataset = dataset
def predict(self,v):
return 0
Nature gives us training examples. These training examples consist of some kind of measurement vector v and a corresponding classification c.
v,c = nature.training_sample()
print "measurement vector",v
print "class",c
measurement vector [1 0 1 1 0 1 1] class 5
At training time, we can request training samples.
Usually, generating labeled training samples costs money, so we only obtain a limited number of them.
These are usually collected in a dataset.
trainingset = [nature.training_sample() for i in range(100)]
trainingset[:5]
[(array([1, 1, 0, 1, 1, 0, 1]), 2), (array([1, 0, 1, 1, 1, 1, 1]), 6), (array([1, 1, 1, 1, 1, 1, 0]), 0), (array([1, 1, 1, 1, 1, 1, 0]), 0), (array([1, 1, 1, 1, 0, 0, 1]), 3)]
We usually use the dataset to create a model.
model = Model(trainingset)
After training, nature presents us with challenges or test samples.
nature.challenge()
array([1, 1, 1, 1, 0, 1, 1])
The task of the model is to predict a value based on the challenge.
prediction = model.predict(nature.challenge())
print prediction
0
Our prediction is then handed back to nature for evaluation. We may or may not see this evaluation, but eventually, our model will be judged on the quality of its predictions.
nature.response(prediction)
False
A second important question is how we validate model. In standard software engineering, the assumption is (rightly or wrongly) that all we need to show is conformance to the specification. However, pattern recognition and machine learning methods do not have a specification. They do, however, have lots of data, and that lets us make good empirical predictions about performance.
# evaluating the model
def count_errors(nature,model,trials=100):
errors = 0
for i in range(trials):
v = nature.challenge()
c = model.predict(v)
if nature.response(c)!=True:
errors += 1
return errors
count_errors(nature,model,trials=100)
93
OK, that's not very good... the model is wrong about 90 percent of the time--chance level. Let's try for something better.
class MemoryModel:
def __init__(self,dataset):
self.memory = {}
for v,c in dataset:
self.memory[tuple(v)] = c
def predict(self,v):
return self.memory.get(tuple(v),0)
model = MemoryModel(trainingset)
count_errors(nature,model,trials=100)
0
In the noise free case, memorization of samples may give good response.
It still lacks generalization: that is, it can't make good predictions for previously unseen samples that are similar to, but different from, existing samples.
In the above example, the model just generated fixed patterns with a one-to-one correspondence to classes.
In practice, there is usually noise.
Noise means:
The overall goal is to minimize the error rate.
from pylab import random as arandom
class NoisySevenSegments(SevenSegments):
def __init__(self):
self.p_noise = 0.07
self.vs = [None] * 10
self.vs[0] = array((1,1,1,1,1,1,0))
self.vs[1] = array((0,1,1,0,0,0,0))
self.vs[2] = array((1,1,0,1,1,0,1))
self.vs[3] = array((1,1,1,1,0,0,1))
self.vs[4] = array((0,1,1,0,0,1,1))
self.vs[5] = array((1,0,1,1,0,1,1))
self.vs[6] = array((1,0,1,1,1,1,1))
self.vs[7] = array((1,1,1,0,0,0,0))
self.vs[8] = array((1,1,1,1,1,1,1))
self.vs[9] = array((1,1,1,1,0,1,1))
Now the training samples consist of the true targets, but with some of the segments flipped.
@method(NoisySevenSegments)
def training_sample(self):
c = pyrandom.randint(0,len(self.vs)-1)
v = self.vs[c]
flip = 1.0*(arandom(size=7)<self.p_noise)
v = 1.0*(v!=flip)
return (v,c)
Let's see how well the memory model works.
nature = NoisySevenSegments()
trainingset = [nature.training_sample() for i in range(100)]
model = MemoryModel(trainingset)
print count_errors(nature,model,trials=10000)
3425
Let's use "similarity" to help improve performance.
from scipy.spatial import distance
distance.cdist(randn(1,10),randn(10,10))
array([[ 3.8478801 , 4.20201828, 3.70312897, 5.37046302, 3.23728652, 4.94973151, 3.44314282, 2.96195443, 3.27705394, 3.53697688]])
# nearest neighbor models
class NearestNeighborModel:
def __init__(self,dataset):
self.centers = array([v for v,c in dataset],'f')
self.classes = array([c for v,c in dataset],'i')
def predict(self,v):
ds = distance.cdist(array([v],'f'),self.centers)
i = argmin(ds[0])
return self.classes[i]
model = NearestNeighborModel(trainingset)
count_errors(nature,model,trials=10000)
2859
We often determine the error rate, the percentage of samples that are misclassified.
count_errors(nature,model,trials=10000)/10000.0
0.2785
Above, we evaluated the model by trying it out in the real world and getting feedback from nature. However, usually, we use a test set instead, a set of samples obtained like training samples.
We can use this to estimate the error rate as well.
testset = [nature.training_sample() for i in range(100)]
truth = [c for v,c in testset]
predictions = [model.predict(v) for v,c in testset]
sum(array(predictions)!=array(truth))
26
We can wrap this up as a function, a function that evaluates using a testset.
def dataset_eval(testset,model):
truth = [c for v,c in testset]
predictions = [model.predict(v) for v,c in testset]
return sum(array(predictions)!=array(truth))*1.0/len(testset)
dataset_eval(testset,model)
0.26000000000000001
Note that we can also evaluate the performance of the model on the training set itself. This will, in general, give an error rate that is too low. For nearest neighbor classifiers, it is easy to see why.
dataset_eval(trainingset,model)
0.13
Two important things to remember about test sets are:
In fact, often we just get a set of training samples. In that case, we need to perform manually splitting of the data into training and test sets.
import random as pyrandom
all_indexes = set(range(len(trainingset)))
test_indexes = set(pyrandom.sample(all_indexes,int(0.1*len(all_indexes))))
training_indexes = set(all_indexes-test_indexes)
my_testset = [trainingset[i] for i in test_indexes]
my_trainingset = [trainingset[i] for i in training_indexes]
In fact, often we split labeled data into three sets:
Note that you cannot use a test set too often, because just by chance you may find a model that seems to work well by chance.
We will talk about ways of dealing with this problem later.