a.k.a. Classifying for Fun and Profit
Last class we had some members of the NBA that we'd mapped. They looked something like this
import pandas as pd
import matplotlib.pylab as plt
nba_df = pd.read_csv('data/NBA-Census-10.14.2013.csv')
# Loop through the position, marker and colors
for position, marker, color in zip(['C','G','F'], ">xo", "cmy"):
# Get the players that play that given position
players = nba_df[nba_df["POS"] == position]
# Add their points to the scatterplot
plt.scatter(players["Ht (In.)"], players["WT"], c=color, marker=marker, alpha=0.5)
# Add some labels for readability
plt.xlabel("Weight")
plt.ylabel("Height (inches)")
<matplotlib.text.Text at 0x1165f2b50>
First, what the heck is this for .. in zip()
thing?
Last time we mapped everyone, we did it a more verbose way. First we pulled out the centers, forwards and guards, then we graphed the centers, forwards and guards. Lots of repetition, right?
using a for
loop with zip
allows us to loop through more than one thing at a time.
for position, marker, color in zip(['C','G','F'], ">xo", "cmy"):
print "== Going through the loop!"
print "Position is %s" % position
print "Position is %s" % marker
print "Position is %s" % color
== Going through the loop! Position is C Position is > Position is c == Going through the loop! Position is G Position is x Position is m == Going through the loop! Position is F Position is o Position is y
The first time through we get the first elements of everything being zip
ped up (the 'C'
, the '>'
and the 'c'
). This works because you can loop over the members of "cmy"
the same as you can loop over ['C','G','F']
letters = "cmy"
print letters[0]
print letters[1]
print letters[2]
arr = ['C', 'G', 'F']
print arr[0]
print arr[1]
print arr[2]
c m y C G F
And since our for
loop is asking for three things - position
, marker
, and color
, Python automatically assigns the first element of each zip
ped element to the appropriate value. Let's take another look!
for letter, number in zip("ABC", "123"):
print "== Going through the loop!"
print "Letter is %s" % letter
print "Number is %s" % number
== Going through the loop! Letter is A Number is 1 == Going through the loop! Letter is B Number is 2 == Going through the loop! Letter is C Number is 3
The for loop assigns variables based on their relationship to the zip
ed list. letter
maps to "ABC"
and number
to "123"
. You'll run into this pattern often enough that you should probably know about it!
Oh, and if you want to know what `zip` is actually doing....
zip(['C','G','F'], ">xo", "cmy")
[('C', '>', 'c'), ('G', 'x', 'm'), ('F', 'o', 'y')]
Hopefully that makes a bit of sense so you can work through it when you see it!
import pandas as pd
import matplotlib.pylab as plt
nba_df = pd.read_csv('data/NBA-Census-10.14.2013.csv')
# Loop through the position, marker and colors
for position, marker, color in zip(['C','G','F'], ">xo", "cmy"):
# Get the players that play that given position
players = nba_df[nba_df["POS"] == position]
# Add their points to the scatterplot
plt.scatter(players["Ht (In.)"], players["WT"], c=color, marker=marker, alpha=0.5)
# Add some labels for readability
plt.xlabel("Weight")
plt.ylabel("Height (inches)")
<matplotlib.text.Text at 0x116624950>
We can see that the guards are lighter and shorter, the centers are taller and heavier, and the forwards are somewhere in between. If given someone clearly in those groups, we wouldn't do a terrible job figuring out what position they play. We know it because we can see it, but how can a computer figure it out that?
Remember that time in middle school you bought a skateboard and your parents decided you were going to spend the rest of your life shoplifting and smoking pot because That's All Kids With Skateboards Do? And hey, maybe some of them did, but Come On, Mom, You're Different, You Promise?
Luckily for us, computers are exactly dumb enough to jump to the same sorts of conclusions - we're going to help them get there.
The nearest neighbor algorithm is an incredibly simple algorithm to classify objects. It's an example of supervised learning, a type of machine learning where to model requires a bit of initial data (called training data) to base its model on.
Your parents probably based their model on a Fox News special called Satanic Dope-Fiend Skaters. As we'll see later, it's important to pick a nice random set of training data to not bias your later results.
# First, let's get a small dose of training data.
# Let's take 20 players of the full 528
nba_training = nba_df[:20]
# And graph it to see what we've got
for position, marker, color in zip(['C','G','F'], ">xo", "cmy"):
players = nba_training[nba_training["POS"] == position]
plt.scatter(players["Ht (In.)"], players["WT"], c=color, marker=marker)
plt.xlabel("Weight")
plt.ylabel("Height (inches)")
<matplotlib.text.Text at 0x1167ccc90>
Looks pretty clear to me! You can still see the centers on the top right, the forwards in the middle and the guards to the bottom left.
The way nearest neighbors works is by finding, well, the nearest neighbor of a new point. Let's say I have a new person who is 220 pounds and 76 inches tall.
plt.plot(76,220, marker='*',c='r', ms=10)
pylab.ylim([170,270])
pylab.xlim([72,84])
(72, 84)
Let's see how close he is to all of the other players...
for index, player in nba_training.iterrows():
plot([76, player["Ht (In.)"]], [220,player["WT"]], 'k', linestyle='dashed', linewidth=1)
plt.scatter(nba_training["Ht (In.)"], nba_training["WT"])
plt.plot(76,220, marker='*',c='r', ms=10)
[<matplotlib.lines.Line2D at 0x1167228d0>]
# Except of course that didn't do anything. Let's actually get lengths.
numpy.sqrt((nba_training["Ht (In.)"] - 76) ** 2 + (nba_training["WT"] - 220) ** 2)
0 2.236068 1 3.000000 2 25.179357 3 7.000000 4 10.440307 5 3.000000 6 3.605551 7 21.377558 8 16.155494 9 25.961510 10 20.024984 11 40.049969 12 37.013511 13 13.152946 14 37.483330 15 35.227830 16 30.413813 17 1.000000 18 10.198039 19 35.128336 dtype: float64
# Looks like he's closest to #17
# Let's plot new guy
plt.plot(76,220, marker='*',c='r', ms=10)
# Let's plot 17
match = nba_training.ix[17]
plt.plot(match["Ht (In.)"], match["WT"], marker='*',c='r', ms=10)
pylab.ylim([170,270])
pylab.xlim([72,84])
(72, 84)
# Looks pretty close! So who is #17?
nba_training.ix[17]
Name Harden, James Age 24 Team Rockets POS G # 13 2013 $ $13,701,250 Ht (In.) 77 WT 220 EXP 4 1st Year 2009 DOB 8/26/1989 School Arizona State City Los Angeles, CA State (Province, Territory, Etc..) California Country US Race Black HS Only No Name: 17, dtype: object
He's a guard! And according to the person across the room from me, he is very famous but doesn't play defense at all so people actually hate him. And he has a beard.
But regardless, since he's the closest person (or nearest neighbor) to our new player, we're going to predict that the new guy is a guard, too.
The nearest neighbor algorithm just finds the closest known point and classifies the new point to match. It also works with a lot more than a 2-dimensional graph - it works in big huge n-dimensional space, it's just easier to visualize in 2D.
Do you want to write a program to compute the minimum length for every single new player? No, of course not. And while we should probably actually make you do that, instead we'll introduce you to a brand new friend called scikit-learn.
Scikit-learn comes with a fun little thing called neighbors, which - you guessed it - does nearest neighbor analysis. And it'll even work with pandas!
from sklearn.neighbors import KNeighborsClassifier
# Let's initialize a classifier
knn = KNeighborsClassifier(n_neighbors=1)
# knn.fit takes two parameters
# First, the content we want to train on. For us
# it's height and weight.
# Secondly, how we're classifying each element of the
# training data. We're classifying by position!
knn.fit(nba_training[["Ht (In.)", "WT"]], nba_training["POS"])
# Now we need to make a prediction. It's pretty easy
knn.predict([76, 220])
array(['G'], dtype=object)
Ta-da! It predicted guard, just like we 1) guessed and 2) predicted ourselves. This is so easy it should be illegal. We can also find some more information about any given point...
# Let's get the first neighbor
distance, neighbors = knn.kneighbors([76,220])
# Why does it return an array of arrays? I honestly
# don't know, but you can grab the first one.
print "Neighbor is %s " % neighbors[0]
print "Distance is %s" % distance[0]
nba_training.ix[neighbors[0]]
Neighbor is [17] Distance is [ 1.]
Name | Age | Team | POS | # | 2013 $ | Ht (In.) | WT | EXP | 1st Year | DOB | School | City | State (Province, Territory, Etc..) | Country | Race | HS Only | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
17 | Harden, James | 24 | Rockets | G | 13 | $13,701,250 | 77 | 220 | 4 | 2009 | 8/26/1989 | Arizona State | Los Angeles, CA | California | US | Black | No |
# Let's get the closest five neighbors
distance, neighbors = knn.kneighbors([76,220], 3)
# Even though it's returning five neighbors/distances now, you'll
# still use [0] to grab them
print "Neighbors are %s " % neighbors[0]
print "Distances are %s" % distance[0]
# You should be able to do add in a new distance column
# with that new data, but I don't know how.
nba_training.ix[neighbors[0]]
Neighbors are [17 0 1] Distances are [ 1. 2.23606798 3. ]
Name | Age | Team | POS | # | 2013 $ | Ht (In.) | WT | EXP | 1st Year | DOB | School | City | State (Province, Territory, Etc..) | Country | Race | HS Only | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
17 | Harden, James | 24 | Rockets | G | 13 | $13,701,250 | 77 | 220 | 4 | 2009 | 8/26/1989 | Arizona State | Los Angeles, CA | California | US | Black | No |
0 | Gee, Alonzo | 26 | Cavaliers | F | 33 | $3,250,000 | 78 | 219 | 4 | 2009 | 5/29/1987 | Alabama | Riviera Beach, FL | Florida | US | Black | No |
1 | Wallace, Gerald | 31 | Celtics | F | 45 | $10,105,855 | 79 | 220 | 12 | 2001 | 7/23/1982 | Alabama | Sylacauga, AL | Alabama | US | Black | No |
Hrm, looks like there are a few forwards nearby, too. Maybe instead of just looking at one neighbor, we should look at more? We're in luck!
knn = KNeighborsClassifier(n_neighbors=3)
# knn.fit takes two parameters
# First, the content we want to train on. For us
# it's height and weight.
# Secondly, how we're classifying each element of the
# training data. We're classifying by position!
knn.fit(nba_training[["Ht (In.)", "WT"]], nba_training["POS"])
knn.predict([76,220])
array(['F'], dtype=object)
knn = KNeighborsClassifier(n_neighbors=5)
# knn.fit takes two parameters
# First, the content we want to train on. For us
# it's height and weight.
# Secondly, how we're classifying each element of the
# training data. We're classifying by position!
knn.fit(nba_training[["Ht (In.)", "WT"]], nba_training["POS"])
knn.predict([76,220])
array(['F'], dtype=object)
knn = KNeighborsClassifier(n_neighbors=20)
# knn.fit takes two parameters
# First, the content we want to train on. For us
# it's height and weight.
# Secondly, how we're classifying each element of the
# training data. We're classifying by position!
knn.fit(nba_training[["Ht (In.)", "WT"]], nba_training["POS"])
knn.predict([76,220])
array(['F'], dtype=object)
Even though k = 1
says he's a guard, k = 3
says he'll be a forward. We then upped it to k = 20
, which agrees that he'll be a forward, although there are only twenty data points in our training set, so that probably just means there are more forwards than anything else.
knn = sk.neighbors.KNeighborsClassifier(n_neighbors=20)
knn.fit(nba_training[["Ht (In.)", "WT"]], nba_training["POS"])
knn.predict([90,420])
array(['F'], dtype=object)
Yeah, someone who is 7'6" and 420 lbs is probably going to be a center, not a forward. Looks wrong to me.
Note to self: Pick your n_neighbors
values carefully.
We've actually had a big problem following us around for a while, but since you aren't an expert Problem Tracker you might not have noticed. It deals with how we're computing distance.
Let's say we have Cool Mister Gerald, who is 200lbs and 7'6". Is he more similar to 7'3", 220lb Bubbles McSlamdunk or 6'0", 200lb Speedy Runsalot? (that's 90 inches, 87 inches and 72 inches, respectively)
bubbles_distance = numpy.sqrt((90 - 87) ** 2 + (200-220) ** 2)
speedy_distance = numpy.sqrt((90 - 72) ** 2 + (200-200) ** 2)
print "%s units from Bubbles" % bubbles_distance
print "%s units from Speedy" % speedy_distance
20.2237484162 units from Bubbles 18.0 units from Speedy
So it looks like he should be assigned to the same group as Speedy. But does that make any sense? It seems like Cool Mister Gerald and Bubbles McSlamdunkwould have a lot more in common since they're both really tall, since even though Speedy is the same weight as Cool Mister he's also really short (for the NBA!).
That, my friends, is our Big Problem.
Right now, both height and weight are on equal footing. An inch of height difference puts you as far away from someone as a pound of weight difference, even though being a foot taller than someone means a lot more than being 12 pounds lighter.
We need fix this up if anyone is going to take us seriously.
The way we're going to do this is by figuring out how far from average everyone is in standard deviations. If you're way above-average tall, you'll have a good chance of being grouped with someone else who is way above-average tall, even if you aren't necessarily similar in other ways.
Let's take a look.
# First, let's subtract the mean from everyone's height
height_distance_from_mean = nba_df["Ht (In.)"] - nba_df["Ht (In.)"].mean()
# ...and look at the first 5
height_distance_from_mean[:5]
0 -1.119318 1 -0.119318 2 -6.119318 3 3.880682 4 -0.119318 Name: Ht (In.), dtype: float64
By subtracting the mean, if you're above zero, you're above average. Below zero, below average. Now we need to divide this value by the standard deviation to see how meaningful each unit of distance actually is.
MAKE SURE YOU ARE DOING THIS IN YOUR ORIGINAL nba_df
DATA. WHEN WE'RE PLAYING AROUND WITH THINGS LATER, WE'LL ADD MORE ELEMENTS INTO OUR TEST DATA AND WE DON'T WANT TO HAVE TO RECALCULATE!
adjusted_heights = height_distance_from_mean / nba_df["Ht (In.)"].std()
adjusted_heights[:5]
0 -0.326190 1 -0.034772 2 -1.783284 3 1.130904 4 -0.034772 Name: Ht (In.), dtype: float64
So player 2 is way below average in height, player 3 is above average, and the rest are middling-ish. Let's add this information into the dataframe!
adjusted_weights = (nba_df["WT"] - nba_df["WT"].mean()) / nba_df["WT"].std()
# Need to put it into the original dataframe!
nba_df["Adj Weight"] = adjusted_weights
nba_df["Adj Height"] = adjusted_heights
# Let's take a look at our new adjusted columns
nba_df[["Name", "WT", "Adj Weight", "Ht (In.)", "Adj Height"]][:10]
Name | WT | Adj Weight | Ht (In.) | Adj Height | |
---|---|---|---|---|---|
0 | Gee, Alonzo | 219 | -0.078962 | 78 | -0.326190 |
1 | Wallace, Gerald | 220 | -0.043175 | 79 | -0.034772 |
2 | Williams, Mo | 195 | -0.937848 | 73 | -1.783284 |
3 | Gladness, Mickell | 220 | -0.043175 | 83 | 1.130904 |
4 | Jefferson, Richard | 230 | 0.314694 | 79 | -0.034772 |
5 | Hill, Solomon | 220 | -0.043175 | 79 | -0.034772 |
6 | Budinger, Chase | 218 | -0.114749 | 79 | -0.034772 |
7 | Williams, Derrick | 241 | 0.708351 | 80 | 0.256647 |
8 | Hill, Jordan | 235 | 0.493629 | 82 | 0.839485 |
9 | Frye, Channing | 245 | 0.851498 | 83 | 1.130904 |
# Let's take 20 and get a new set of training data, then map it
adjusted_training = nba_df[:20]
for position, marker, color in zip(['C','G','F'], ">xo", "cmy"):
# Get the players that play that given position
players = adjusted_training[adjusted_training["POS"] == position]
# Add their points to the scatterplot
plt.scatter(players["Adj Height"], players["Adj Weight"], c=color, marker=marker, alpha=0.5)
plt.xlabel("Weight")
plt.ylabel("Height")
# Nearly everything is within 3 standard deviations, right? Let's look.
pylab.ylim([-3,3])
pylab.xlim([-3,3])
(-3, 3)
# Let's normalize our friend by subtracting the mean and dividing by the
# standard deviation
adj_height = (76 - nba_df["Ht (In.)"].mean()) / nba_df["Ht (In.)"].std()
adj_weight = (220 - nba_df["WT"].mean()) / nba_df["WT"].std()
# And try some predicting
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(adjusted_training[["Adj Height", "Adj Weight"]], adjusted_training["POS"])
print knn.predict([adj_height, adj_weight])
['G']
# With more neighbors...
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(adjusted_training[["Adj Height", "Adj Weight"]], adjusted_training["POS"])
print knn.predict([adj_height, adj_weight])
['F']
# And more neighbors...
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(adjusted_training[["Adj Height", "Adj Weight"]], adjusted_training["POS"])
print knn.predict([adj_height, adj_weight])
['G']
Last time 5 neighbors said he was a forward, so it's definitely changing things up!
Now that we've got a predictor, how do we know if it's any good? We need some testing data!
Testing data isn't made-up people whose positions we can guess about, it's data that we already know the answers to, so we can see if our results are right.
This is why we were only using 20 members of the NBA as our training data. Now we can predict what the other members of the NBA should be, and compare our predictions to what we know they actually are.
# Based on a single nearest neighbor
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(adjusted_training[["Adj Height", "Adj Weight"]], adjusted_training["POS"])
# Take the training data out of our test data (we're using the first 20 to train)
test_data = nba_df[20:]
# Now predict for every single one of 'em
predictions = knn.predict(test_data[["Adj Height", "Adj Weight"]])
# View a few of the predictions
predictions[:10]
array(['G', 'F', 'F', 'G', 'F', 'F', 'F/C', 'F/C', 'G', 'F/C'], dtype=object)
# Now let's see what predictions match the actual positions
prediction_results = test_data["POS"] == predictions
# The first 10 will obviously match because I wasn't able
# to get it out
prediction_results[:10]
20 True 21 False 22 False 23 True 24 True 25 True 26 False 27 True 28 True 29 True Name: POS, dtype: bool
# Let's look at the raw count of matches and non-matches
print "Number of matches is"
print len(prediction_results[prediction_results == True])
print "Number of wrong results is"
print len(prediction_results[prediction_results == False])
Number of matches is 278 Number of wrong results is 230
This gets its own header because I was really excited when I learned about it. Right now we can say, okay, there were 278 matches and 230 non-matches.
That's roughly a 50% hit rate, so not terribly good. But what exactly is the hit rate? Well, we could divide 278 / 230, but typing all of that out is going to make our fingers cramp.
False
is 0
. True
is 1
. That means you can just take the mean of prediction_results
prediction_results.mean()
0.547244094488189
A 54% hit rate isn't so great. The higher, the better, so what can we do to improve? Let's start off by trying more neighbors.
for k in [1, 2, 3, 5, 10, 20]:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(adjusted_training[["Adj Height", "Adj Weight"]], adjusted_training["POS"])
# Please figure out how to remove the training data from the test data.
test_data = nba_df
predictions = knn.predict(test_data[["Adj Height", "Adj Weight"]])
prediction_results = nba_df["POS"] == predictions
print "With a k of %s we had a score of %s" % (k, prediction_results.mean())
With a k of 1 we had a score of 0.564393939394 With a k of 2 we had a score of 0.613636363636 With a k of 3 we had a score of 0.543560606061 With a k of 5 we had a score of 0.560606060606 With a k of 10 we had a score of 0.492424242424 With a k of 20 we had a score of 0.268939393939
The score actually starts to drop pretty quickly. Since we have a really small set of training data (only twenty players), as we add more neighbors we're reaching further and further away.
With a max of 61%, it's still a pretty rough score. We can increase the amount of training data we have, though, instead of just increasing the number of neighbors. Let's compare twenty to fifty, one hundred and two hundred fifty
for training_data_size in [20, 50, 100, 250, 450]:
test_data_size = len(nba_df) - training_data_size
print "== Training data size: %s, Test data size: %s" % (training_data_size, test_data_size)
larger_training_data = nba_df[:training_data_size]
for k in [1, 2, 3, 5, 10, 20]:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(larger_training_data[["Adj Height", "Adj Weight"]], larger_training_data["POS"])
test_data = nba_df[training_data_size:]
predictions = knn.predict(test_data[["Adj Height", "Adj Weight"]])
prediction_results = test_data["POS"] == predictions
print "With a k of %s we had a score of %s" % (k, prediction_results.mean())
== Training data size: 20, Test data size: 508 With a k of 1 we had a score of 0.547244094488 With a k of 2 we had a score of 0.600393700787 With a k of 3 we had a score of 0.531496062992 With a k of 5 we had a score of 0.555118110236 With a k of 10 we had a score of 0.490157480315 With a k of 20 we had a score of 0.267716535433 == Training data size: 50, Test data size: 478 With a k of 1 we had a score of 0.627615062762 With a k of 2 we had a score of 0.627615062762 With a k of 3 we had a score of 0.60460251046 With a k of 5 we had a score of 0.673640167364 With a k of 10 we had a score of 0.654811715481 With a k of 20 we had a score of 0.558577405858 == Training data size: 100, Test data size: 428 With a k of 1 we had a score of 0.635514018692 With a k of 2 we had a score of 0.668224299065 With a k of 3 we had a score of 0.644859813084 With a k of 5 we had a score of 0.63785046729 With a k of 10 we had a score of 0.651869158879 With a k of 20 we had a score of 0.607476635514 == Training data size: 250, Test data size: 278 With a k of 1 we had a score of 0.654676258993 With a k of 2 we had a score of 0.687050359712 With a k of 3 we had a score of 0.687050359712 With a k of 5 we had a score of 0.679856115108 With a k of 10 we had a score of 0.694244604317 With a k of 20 we had a score of 0.697841726619 == Training data size: 450, Test data size: 78 With a k of 1 we had a score of 0.730769230769 With a k of 2 we had a score of 0.74358974359 With a k of 3 we had a score of 0.769230769231 With a k of 5 we had a score of 0.782051282051 With a k of 10 we had a score of 0.807692307692 With a k of 20 we had a score of 0.807692307692
The more training the better, it seems! And once we get a ton of data, having a large value for k
doesn't seem to be as bad any more.
The concept of hiding some of your training data to use as test data is called cross validation. The problem we're seeing, though, is that while more training data is better for your model, you're going to need something to test it against.
One way to use your data for training and testing is called k-folds cross-validation.
Let's say we have 500 pieces of data, and we want to test using 5 folds. First, you break your data up into five equal sets. Then you run your model (nearest neighbors, in this case) five times, each time leaving out one of your sets to later test with
Run 1: Test with 1, Train with 2, 3, 4, 5
Run 2: Test with 2, Train with 1, 3, 4, 5
Run 3: Test with 3, Train with 1, 2, 4, 5
Run 4: Test with 4, Train with 1, 2, 3, 5
Run 5: Test with 5, Train with 1, 2, 3, 4
When you're done, average the results and you've got yourself a pretty good idea of how good your model is at predicting!