(k-)Nearest Neighbor(s)¶

a.k.a. Classifying for Fun and Profit

Last class we had some members of the NBA that we'd mapped. They looked something like this

In [21]:

import pandas as pd
import matplotlib.pylab as plt
nba_df = pd.read_csv('data/NBA-Census-10.14.2013.csv')

# Loop through the position, marker and colors
for position, marker, color in zip(['C','G','F'], ">xo", "cmy"):
    # Get the players that play that given position
    players = nba_df[nba_df["POS"] == position]
    # Add their points to the scatterplot
    plt.scatter(players["Ht (In.)"], players["WT"], c=color, marker=marker, alpha=0.5)
# Add some labels for readability
plt.xlabel("Weight")
plt.ylabel("Height (inches)")

Out[21]:

<matplotlib.text.Text at 0x1165f2b50>

Wait a second!¶

First, what the heck is this for .. in zip() thing?

Last time we mapped everyone, we did it a more verbose way. First we pulled out the centers, forwards and guards, then we graphed the centers, forwards and guards. Lots of repetition, right?

using a for loop with zip allows us to loop through more than one thing at a time.

In [22]:

for position, marker, color in zip(['C','G','F'], ">xo", "cmy"):
    print "== Going through the loop!"
    print "Position is %s" % position
    print "Position is %s" % marker
    print "Position is %s" % color

== Going through the loop!
Position is C
Position is >
Position is c
== Going through the loop!
Position is G
Position is x
Position is m
== Going through the loop!
Position is F
Position is o
Position is y

The first time through we get the first elements of everything being zipped up (the 'C', the '>' and the 'c'). This works because you can loop over the members of "cmy" the same as you can loop over ['C','G','F']

In [23]:

letters = "cmy"
print letters[0]
print letters[1]
print letters[2]

arr = ['C', 'G', 'F']
print arr[0]
print arr[1]
print arr[2]

c
m
y
C
G
F

And since our for loop is asking for three things - position, marker, and color, Python automatically assigns the first element of each zipped element to the appropriate value. Let's take another look!

In [24]:

for letter, number in zip("ABC", "123"):
    print "== Going through the loop!"
    print "Letter is %s" % letter
    print "Number is %s" % number

== Going through the loop!
Letter is A
Number is 1
== Going through the loop!
Letter is B
Number is 2
== Going through the loop!
Letter is C
Number is 3

The for loop assigns variables based on their relationship to the ziped list. letter maps to "ABC" and number to "123". You'll run into this pattern often enough that you should probably know about it!

Oh, and if you want to know what `zip` is actually doing....

In [25]:

zip(['C','G','F'], ">xo", "cmy")

Out[25]:

[('C', '>', 'c'), ('G', 'x', 'm'), ('F', 'o', 'y')]

Hopefully that makes a bit of sense so you can work through it when you see it!

Back to the NBA¶

In [26]:

import pandas as pd
import matplotlib.pylab as plt
nba_df = pd.read_csv('data/NBA-Census-10.14.2013.csv')

# Loop through the position, marker and colors
for position, marker, color in zip(['C','G','F'], ">xo", "cmy"):
    # Get the players that play that given position
    players = nba_df[nba_df["POS"] == position]
    # Add their points to the scatterplot
    plt.scatter(players["Ht (In.)"], players["WT"], c=color, marker=marker, alpha=0.5)
# Add some labels for readability
plt.xlabel("Weight")
plt.ylabel("Height (inches)")

Out[26]:

<matplotlib.text.Text at 0x116624950>

We can see that the guards are lighter and shorter, the centers are taller and heavier, and the forwards are somewhere in between. If given someone clearly in those groups, we wouldn't do a terrible job figuring out what position they play. We know it because we can see it, but how can a computer figure it out that?

Birds of a feather get analyzed together¶

Remember that time in middle school you bought a skateboard and your parents decided you were going to spend the rest of your life shoplifting and smoking pot because That's All Kids With Skateboards Do? And hey, maybe some of them did, but Come On, Mom, You're Different, You Promise?

Luckily for us, computers are exactly dumb enough to jump to the same sorts of conclusions - we're going to help them get there.

Nearest Neighbor Algorithm¶

The nearest neighbor algorithm is an incredibly simple algorithm to classify objects. It's an example of supervised learning, a type of machine learning where to model requires a bit of initial data (called training data) to base its model on.

Your parents probably based their model on a Fox News special called Satanic Dope-Fiend Skaters. As we'll see later, it's important to pick a nice random set of training data to not bias your later results.

In [27]:

# First, let's get a small dose of training data.
# Let's take 20 players of the full 528
nba_training = nba_df[:20]

In [28]:

# And graph it to see what we've got
for position, marker, color in zip(['C','G','F'], ">xo", "cmy"):
    players = nba_training[nba_training["POS"] == position]
    plt.scatter(players["Ht (In.)"], players["WT"], c=color, marker=marker)
plt.xlabel("Weight")
plt.ylabel("Height (inches)")

Out[28]:

<matplotlib.text.Text at 0x1167ccc90>

Looks pretty clear to me! You can still see the centers on the top right, the forwards in the middle and the guards to the bottom left.

The way nearest neighbors works is by finding, well, the nearest neighbor of a new point. Let's say I have a new person who is 220 pounds and 76 inches tall.

In [29]:

plt.plot(76,220, marker='*',c='r', ms=10)
pylab.ylim([170,270])
pylab.xlim([72,84])

Out[29]:

(72, 84)

Let's see how close he is to all of the other players...

In [30]:

for index, player in nba_training.iterrows():
    plot([76, player["Ht (In.)"]], [220,player["WT"]], 'k', linestyle='dashed', linewidth=1)
plt.scatter(nba_training["Ht (In.)"], nba_training["WT"])
plt.plot(76,220, marker='*',c='r', ms=10)

Out[30]:

[<matplotlib.lines.Line2D at 0x1167228d0>]

In [31]:

# Except of course that didn't do anything. Let's actually get lengths.
numpy.sqrt((nba_training["Ht (In.)"] - 76) ** 2 + (nba_training["WT"] - 220) ** 2)

Out[31]:

0      2.236068
1      3.000000
2     25.179357
3      7.000000
4     10.440307
5      3.000000
6      3.605551
7     21.377558
8     16.155494
9     25.961510
10    20.024984
11    40.049969
12    37.013511
13    13.152946
14    37.483330
15    35.227830
16    30.413813
17     1.000000
18    10.198039
19    35.128336
dtype: float64

In [32]:

# Looks like he's closest to #17

# Let's plot new guy
plt.plot(76,220, marker='*',c='r', ms=10)
# Let's plot 17
match = nba_training.ix[17]
plt.plot(match["Ht (In.)"], match["WT"], marker='*',c='r', ms=10)

pylab.ylim([170,270])
pylab.xlim([72,84])

Out[32]:

(72, 84)

In [33]:

# Looks pretty close! So who is #17?
nba_training.ix[17]

Out[33]:

Name                                    Harden, James
Age                                                24
Team                                          Rockets
POS                                                 G
#                                                  13
2013 $                                    $13,701,250
Ht (In.)                                           77
WT                                                220
EXP                                                 4
1st Year                                         2009
DOB                                         8/26/1989
School                                  Arizona State
City                                  Los Angeles, CA
State (Province, Territory, Etc..)         California
Country                                            US
Race                                            Black
HS Only                                            No
Name: 17, dtype: object

He's a guard! And according to the person across the room from me, he is very famous but doesn't play defense at all so people actually hate him. And he has a beard.

But regardless, since he's the closest person (or nearest neighbor) to our new player, we're going to predict that the new guy is a guard, too.

So, in fewer words and no code¶

The nearest neighbor algorithm just finds the closest known point and classifies the new point to match. It also works with a lot more than a 2-dimensional graph - it works in big huge n-dimensional space, it's just easier to visualize in 2D.

Enter scikit-learn¶

Do you want to write a program to compute the minimum length for every single new player? No, of course not. And while we should probably actually make you do that, instead we'll introduce you to a brand new friend called scikit-learn.

Scikit-learn comes with a fun little thing called neighbors, which - you guessed it - does nearest neighbor analysis. And it'll even work with pandas!

In [34]:

from sklearn.neighbors import KNeighborsClassifier

# Let's initialize a classifier
knn = KNeighborsClassifier(n_neighbors=1)

# knn.fit takes two parameters
# First, the content we want to train on. For us
# it's height and weight.
# Secondly, how we're classifying each element of the
# training data. We're classifying by position!
knn.fit(nba_training[["Ht (In.)", "WT"]], nba_training["POS"])

# Now we need to make a prediction. It's pretty easy

knn.predict([76, 220])

Out[34]:

array(['G'], dtype=object)

Ta-da! It predicted guard, just like we 1) guessed and 2) predicted ourselves. This is so easy it should be illegal. We can also find some more information about any given point...

In [35]:

# Let's get the first neighbor
distance, neighbors = knn.kneighbors([76,220])
# Why does it return an array of arrays? I honestly
# don't know, but you can grab the first one.
print "Neighbor is %s " % neighbors[0]
print "Distance is %s" % distance[0]
nba_training.ix[neighbors[0]]

Neighbor is [17] 
Distance is [ 1.]

Out[35]:

	Name	Age	Team	POS	#	2013 $	Ht (In.)	WT	EXP	1st Year	DOB	School	City	State (Province, Territory, Etc..)	Country	Race	HS Only
17	Harden, James	24	Rockets	G	13	$13,701,250	77	220	4	2009	8/26/1989	Arizona State	Los Angeles, CA	California	US	Black	No

In [36]:

# Let's get the closest five neighbors
distance, neighbors = knn.kneighbors([76,220], 3)
# Even though it's returning five neighbors/distances now, you'll
# still use [0] to grab them
print "Neighbors are %s " % neighbors[0]
print "Distances are %s" % distance[0]
# You should be able to do add in a new distance column
# with that new data, but I don't know how.
nba_training.ix[neighbors[0]]

Neighbors are [17  0  1] 
Distances are [ 1.          2.23606798  3.        ]

Out[36]:

	Name	Age	Team	POS	#	2013 $	Ht (In.)	WT	EXP	1st Year	DOB	School	City	State (Province, Territory, Etc..)	Country	Race	HS Only
17	Harden, James	24	Rockets	G	13	$13,701,250	77	220	4	2009	8/26/1989	Arizona State	Los Angeles, CA	California	US	Black	No
0	Gee, Alonzo	26	Cavaliers	F	33	$3,250,000	78	219	4	2009	5/29/1987	Alabama	Riviera Beach, FL	Florida	US	Black	No
1	Wallace, Gerald	31	Celtics	F	45	$10,105,855	79	220	12	2001	7/23/1982	Alabama	Sylacauga, AL	Alabama	US	Black	No

Putting the k- in k-Nearest Neighbors¶

Hrm, looks like there are a few forwards nearby, too. Maybe instead of just looking at one neighbor, we should look at more? We're in luck!

In [37]:

knn = KNeighborsClassifier(n_neighbors=3)

# knn.fit takes two parameters
# First, the content we want to train on. For us
# it's height and weight.
# Secondly, how we're classifying each element of the
# training data. We're classifying by position!
knn.fit(nba_training[["Ht (In.)", "WT"]], nba_training["POS"])
knn.predict([76,220])

Out[37]:

array(['F'], dtype=object)

In [38]:

knn = KNeighborsClassifier(n_neighbors=5)

# knn.fit takes two parameters
# First, the content we want to train on. For us
# it's height and weight.
# Secondly, how we're classifying each element of the
# training data. We're classifying by position!
knn.fit(nba_training[["Ht (In.)", "WT"]], nba_training["POS"])
knn.predict([76,220])

Out[38]:

array(['F'], dtype=object)

In [39]:

knn = KNeighborsClassifier(n_neighbors=20)

# knn.fit takes two parameters
# First, the content we want to train on. For us
# it's height and weight.
# Secondly, how we're classifying each element of the
# training data. We're classifying by position!
knn.fit(nba_training[["Ht (In.)", "WT"]], nba_training["POS"])
knn.predict([76,220])

Out[39]:

array(['F'], dtype=object)

Even though k = 1 says he's a guard, k = 3 says he'll be a forward. We then upped it to k = 20, which agrees that he'll be a forward, although there are only twenty data points in our training set, so that probably just means there are more forwards than anything else.

In [40]:

knn = sk.neighbors.KNeighborsClassifier(n_neighbors=20)
knn.fit(nba_training[["Ht (In.)", "WT"]], nba_training["POS"])
knn.predict([90,420])

Out[40]:

array(['F'], dtype=object)

Yeah, someone who is 7'6" and 420 lbs is probably going to be a center, not a forward. Looks wrong to me.

Note to self: Pick your n_neighbors values carefully.

Normalizing your data¶

We've actually had a big problem following us around for a while, but since you aren't an expert Problem Tracker you might not have noticed. It deals with how we're computing distance.

Let's say we have Cool Mister Gerald, who is 200lbs and 7'6". Is he more similar to 7'3", 220lb Bubbles McSlamdunk or 6'0", 200lb Speedy Runsalot? (that's 90 inches, 87 inches and 72 inches, respectively)

In [41]:

bubbles_distance = numpy.sqrt((90 - 87) ** 2 + (200-220) ** 2)
speedy_distance = numpy.sqrt((90 - 72) ** 2 + (200-200) ** 2)

print "%s units from Bubbles" % bubbles_distance
print "%s units from Speedy" % speedy_distance

20.2237484162 units from Bubbles
18.0 units from Speedy

So it looks like he should be assigned to the same group as Speedy. But does that make any sense? It seems like Cool Mister Gerald and Bubbles McSlamdunkwould have a lot more in common since they're both really tall, since even though Speedy is the same weight as Cool Mister he's also really short (for the NBA!).

That, my friends, is our Big Problem.

Going Unitless¶

Right now, both height and weight are on equal footing. An inch of height difference puts you as far away from someone as a pound of weight difference, even though being a foot taller than someone means a lot more than being 12 pounds lighter.

We need fix this up if anyone is going to take us seriously.

The way we're going to do this is by figuring out how far from average everyone is in standard deviations. If you're way above-average tall, you'll have a good chance of being grouped with someone else who is way above-average tall, even if you aren't necessarily similar in other ways.

Let's take a look.

In [66]:

# First, let's subtract the mean from everyone's height
height_distance_from_mean = nba_df["Ht (In.)"] - nba_df["Ht (In.)"].mean()
# ...and look at the first 5
height_distance_from_mean[:5]

Out[66]:

0   -1.119318
1   -0.119318
2   -6.119318
3    3.880682
4   -0.119318
Name: Ht (In.), dtype: float64

By subtracting the mean, if you're above zero, you're above average. Below zero, below average. Now we need to divide this value by the standard deviation to see how meaningful each unit of distance actually is.

MAKE SURE YOU ARE DOING THIS IN YOUR ORIGINAL nba_df DATA. WHEN WE'RE PLAYING AROUND WITH THINGS LATER, WE'LL ADD MORE ELEMENTS INTO OUR TEST DATA AND WE DON'T WANT TO HAVE TO RECALCULATE!

In [67]:

adjusted_heights = height_distance_from_mean / nba_df["Ht (In.)"].std()
adjusted_heights[:5]

Out[67]:

0   -0.326190
1   -0.034772
2   -1.783284
3    1.130904
4   -0.034772
Name: Ht (In.), dtype: float64

So player 2 is way below average in height, player 3 is above average, and the rest are middling-ish. Let's add this information into the dataframe!

In [84]:

adjusted_weights = (nba_df["WT"] - nba_df["WT"].mean()) / nba_df["WT"].std()

# Need to put it into the original dataframe! 
nba_df["Adj Weight"] = adjusted_weights
nba_df["Adj Height"] = adjusted_heights

In [85]:

# Let's take a look at our new adjusted columns
nba_df[["Name", "WT", "Adj Weight", "Ht (In.)", "Adj Height"]][:10]

Out[85]:

	Name	WT	Adj Weight	Ht (In.)	Adj Height
0	Gee, Alonzo	219	-0.078962	78	-0.326190
1	Wallace, Gerald	220	-0.043175	79	-0.034772
2	Williams, Mo	195	-0.937848	73	-1.783284
3	Gladness, Mickell	220	-0.043175	83	1.130904
4	Jefferson, Richard	230	0.314694	79	-0.034772
5	Hill, Solomon	220	-0.043175	79	-0.034772
6	Budinger, Chase	218	-0.114749	79	-0.034772
7	Williams, Derrick	241	0.708351	80	0.256647
8	Hill, Jordan	235	0.493629	82	0.839485
9	Frye, Channing	245	0.851498	83	1.130904

In [251]:

# Let's take 20 and get a new set of training data, then map it
adjusted_training = nba_df[:20]

for position, marker, color in zip(['C','G','F'], ">xo", "cmy"):
    # Get the players that play that given position
    players = adjusted_training[adjusted_training["POS"] == position]
    # Add their points to the scatterplot
    plt.scatter(players["Adj Height"], players["Adj Weight"], c=color, marker=marker, alpha=0.5)
plt.xlabel("Weight")
plt.ylabel("Height")
# Nearly everything is within 3 standard deviations, right? Let's look.
pylab.ylim([-3,3])
pylab.xlim([-3,3])

Out[251]:

(-3, 3)

In [228]:

# Let's normalize our friend by subtracting the mean and dividing by the 
# standard deviation
adj_height = (76 - nba_df["Ht (In.)"].mean()) / nba_df["Ht (In.)"].std() 
adj_weight = (220 - nba_df["WT"].mean()) / nba_df["WT"].std()

In [229]:

# And try some predicting
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(adjusted_training[["Adj Height", "Adj Weight"]], adjusted_training["POS"])
print knn.predict([adj_height, adj_weight])

['G']

In [230]:

# With more neighbors...
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(adjusted_training[["Adj Height", "Adj Weight"]], adjusted_training["POS"])
print knn.predict([adj_height, adj_weight])

['F']

In [231]:

# And more neighbors...
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(adjusted_training[["Adj Height", "Adj Weight"]], adjusted_training["POS"])
print knn.predict([adj_height, adj_weight])

['G']

Last time 5 neighbors said he was a forward, so it's definitely changing things up!

Testing your results¶

Now that we've got a predictor, how do we know if it's any good? We need some testing data!

Testing data isn't made-up people whose positions we can guess about, it's data that we already know the answers to, so we can see if our results are right.

This is why we were only using 20 members of the NBA as our training data. Now we can predict what the other members of the NBA should be, and compare our predictions to what we know they actually are.

In [232]:

# Based on a single nearest neighbor
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(adjusted_training[["Adj Height", "Adj Weight"]], adjusted_training["POS"])

# Take the training data out of our test data (we're using the first 20 to train)
test_data = nba_df[20:]

# Now predict for every single one of 'em
predictions = knn.predict(test_data[["Adj Height", "Adj Weight"]])

# View a few of the predictions
predictions[:10]

Out[232]:

array(['G', 'F', 'F', 'G', 'F', 'F', 'F/C', 'F/C', 'G', 'F/C'], dtype=object)

In [234]:

# Now let's see what predictions match the actual positions
prediction_results = test_data["POS"] == predictions
# The first 10 will obviously match because I wasn't able
# to get it out
prediction_results[:10]

Out[234]:

20     True
21    False
22    False
23     True
24     True
25     True
26    False
27     True
28     True
29     True
Name: POS, dtype: bool

In [235]:

# Let's look at the raw count of matches and non-matches
print "Number of matches is"
print len(prediction_results[prediction_results == True])
print "Number of wrong results is"
print len(prediction_results[prediction_results == False])

Number of matches is
278
Number of wrong results is
230

Measuring accuracy with numbers¶

This gets its own header because I was really excited when I learned about it. Right now we can say, okay, there were 278 matches and 230 non-matches.

That's roughly a 50% hit rate, so not terribly good. But what exactly is the hit rate? Well, we could divide 278 / 230, but typing all of that out is going to make our fingers cramp.

HOTTEST TIP YOU'LL EVER GET¶

False is 0. True is 1. That means you can just take the mean of prediction_results

In [236]:

prediction_results.mean()

Out[236]:

0.547244094488189

Increasing your k¶

A 54% hit rate isn't so great. The higher, the better, so what can we do to improve? Let's start off by trying more neighbors.

In [237]:

for k in [1, 2, 3, 5, 10, 20]:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(adjusted_training[["Adj Height", "Adj Weight"]], adjusted_training["POS"])

    # Please figure out how to remove the training data from the test data.
    test_data = nba_df
    predictions = knn.predict(test_data[["Adj Height", "Adj Weight"]])
    prediction_results = nba_df["POS"] == predictions
    print "With a k of %s we had a score of %s" % (k, prediction_results.mean())

With a k of 1 we had a score of 0.564393939394
With a k of 2 we had a score of 0.613636363636
With a k of 3 we had a score of 0.543560606061
With a k of 5 we had a score of 0.560606060606
With a k of 10 we had a score of 0.492424242424
With a k of 20 we had a score of 0.268939393939

The score actually starts to drop pretty quickly. Since we have a really small set of training data (only twenty players), as we add more neighbors we're reaching further and further away.

Increasing your training data set¶

With a max of 61%, it's still a pretty rough score. We can increase the amount of training data we have, though, instead of just increasing the number of neighbors. Let's compare twenty to fifty, one hundred and two hundred fifty

In [242]:

for training_data_size in [20, 50, 100, 250, 450]:
    test_data_size = len(nba_df) - training_data_size
    print "== Training data size: %s, Test data size: %s" % (training_data_size, test_data_size)
    larger_training_data = nba_df[:training_data_size]
    for k in [1, 2, 3, 5, 10, 20]:
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(larger_training_data[["Adj Height", "Adj Weight"]], larger_training_data["POS"])

        test_data = nba_df[training_data_size:]
        predictions = knn.predict(test_data[["Adj Height", "Adj Weight"]])
        prediction_results = test_data["POS"] == predictions
        print "With a k of %s we had a score of %s" % (k, prediction_results.mean())

== Training data size: 20, Test data size: 508
With a k of 1 we had a score of 0.547244094488
With a k of 2 we had a score of 0.600393700787
With a k of 3 we had a score of 0.531496062992
With a k of 5 we had a score of 0.555118110236
With a k of 10 we had a score of 0.490157480315
With a k of 20 we had a score of 0.267716535433
== Training data size: 50, Test data size: 478
With a k of 1 we had a score of 0.627615062762
With a k of 2 we had a score of 0.627615062762
With a k of 3 we had a score of 0.60460251046
With a k of 5 we had a score of 0.673640167364
With a k of 10 we had a score of 0.654811715481
With a k of 20 we had a score of 0.558577405858
== Training data size: 100, Test data size: 428
With a k of 1 we had a score of 0.635514018692
With a k of 2 we had a score of 0.668224299065
With a k of 3 we had a score of 0.644859813084
With a k of 5 we had a score of 0.63785046729
With a k of 10 we had a score of 0.651869158879
With a k of 20 we had a score of 0.607476635514
== Training data size: 250, Test data size: 278
With a k of 1 we had a score of 0.654676258993
With a k of 2 we had a score of 0.687050359712
With a k of 3 we had a score of 0.687050359712
With a k of 5 we had a score of 0.679856115108
With a k of 10 we had a score of 0.694244604317
With a k of 20 we had a score of 0.697841726619
== Training data size: 450, Test data size: 78
With a k of 1 we had a score of 0.730769230769
With a k of 2 we had a score of 0.74358974359
With a k of 3 we had a score of 0.769230769231
With a k of 5 we had a score of 0.782051282051
With a k of 10 we had a score of 0.807692307692
With a k of 20 we had a score of 0.807692307692

The more training the better, it seems! And once we get a ton of data, having a large value for k doesn't seem to be as bad any more.

Cross-valiation¶

The concept of hiding some of your training data to use as test data is called cross validation. The problem we're seeing, though, is that while more training data is better for your model, you're going to need something to test it against.

K-Folds Cross Validation¶

One way to use your data for training and testing is called k-folds cross-validation.

Let's say we have 500 pieces of data, and we want to test using 5 folds. First, you break your data up into five equal sets. Then you run your model (nearest neighbors, in this case) five times, each time leaving out one of your sets to later test with

Run 1: Test with 1, Train with 2, 3, 4, 5
Run 2: Test with 2, Train with 1, 3, 4, 5
Run 3: Test with 3, Train with 1, 2, 4, 5
Run 4: Test with 4, Train with 1, 2, 3, 5
Run 5: Test with 5, Train with 1, 2, 3, 4

When you're done, average the results and you've got yourself a pretty good idea of how good your model is at predicting!

http://scikit-learn.org/0.11/modules/neighbors.html

In [ ]: