Introduction

In this assignment you will be using regression and classification to explore different data sets.

First: You will use data from before 2002 in the Sean Lahman's Baseball Database to create a metric for picking baseball players using linear regression. This is same database we used in Homework 1. This database contains the "complete batting and pitching statistics from 1871 to 2013, plus fielding statistics, standings, team stats, managerial records, post-season data, and more". Documentation provided here.

"Sabermetrics Science" http://saberseminar.com/wp-content/uploads/2012/01/saber-web.jpg

Second: You will use the famous iris data set to perform a $k$-neareast neighbor classification using cross validation. While it was introduced in 1936, it is still one of the most popular example data sets in the machine learning community. Wikipedia describes the data set as follows: "The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimetres." Here is an illustration what the four features measure:

"iris data features" http://sebastianraschka.com/Images/2014_python_lda/iris_petal_sepal.png

Third: You will investigate the influence of higher dimensional spaces on the classification using another standard data set in machine learning called the The digits data set. This data set is similar to the MNIST data set discussed in the lecture. The main difference is, that each digit is represented by an 8x8 pixel image patch, which is considerably smaller than the 28x28 pixels from MNIST. In addition, the gray values are restricted to 16 different values (4 bit), instead of 256 (8 bit) for MNIST.

Finally: In preparation for Homework 4, we want you to read through the following articles related to predicting the 2014 Senate Midterm Elections.


Load Python modules

In [1]:
# special IPython command to prepare the notebook for matplotlib
%matplotlib inline 

import os
from IPython import display

import requests 
#import StringIO # not working in python 3
import zipfile
import numpy as np
import pandas as pd # pandas
import matplotlib.pyplot as plt # module for plotting 

# If this module is not already installed, you may need to install it. 
# You can do this by typing 'pip install seaborn' in the command line
import seaborn as sns 

import sklearn
import sklearn.datasets
import sklearn.cross_validation
import sklearn.decomposition
import sklearn.grid_search
import sklearn.neighbors
import sklearn.metrics
/Users/ko/anaconda/lib/python3.6/site-packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
/Users/ko/anaconda/lib/python3.6/site-packages/sklearn/grid_search.py:43: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. This module will be removed in 0.20.
  DeprecationWarning)

Problem 1: Sabermetrics

Using data preceding the 2002 season pick 10 offensive players keeping the payroll under $20 million (assign each player the median salary). Predict how many games this team would win in a 162 game season.

In this problem we will be returning to the Sean Lahman's Baseball Database that we used in Homework 1. From this database, we will be extract five data sets containing information such as yearly stats and standing, batting statistics, fielding statistics, player names, player salaries and biographical information. You will explore the data in this database from before 2002 and create a metric for picking players.

Problem 1(a)

Load in these CSV files from the Sean Lahman's Baseball Database. For this assignment, we will use the 'Teams.csv', 'Batting.csv', 'Salaries.csv', 'Fielding.csv', 'Master.csv' tables. Read these tables into separate pandas DataFrames with the following names.

CSV file name Name of pandas DataFrame
Teams.csv teams
Batting.csv players
Salaries.csv salaries
Fielding.csv fielding
Master.csv master
In [17]:
### Your code here ###

url = "http://seanlahman.com/files/database/baseballdatabank-2017.1.zip"

r = requests.get(url, stream=True)

# Total size in bytes.
total_size = int(r.headers['Content-Length'])

with open("data/baseballdatabank-2017.1.zip", 'wb') as fd:
    for chunk in r.iter_content(chunk_size=256):
        fd.write(chunk)
        
print(f"file succesfully downloaded")
print(os.listdir("data"))
file succesfully downloaded
['.DS_Store', 'baseballdatabank-2017.1', 'baseballdatabank-2017.1.zip', 'exprs_GSE5859.csv', 'lahman', 'lahman-csv_2014-02-14.zip', 'sampleinfo_GSE5859.csv']
In [3]:
z = zipfile.ZipFile("data/baseballdatabank-2017.1.zip")
# extract into the data dir
z.extractall("data")
z.close()
data_dir = "data/baseballdatabank-2017.1/core/"
print(os.listdir(data_dir))
['AllstarFull.csv', 'Appearances.csv', 'AwardsManagers.csv', 'AwardsPlayers.csv', 'AwardsShareManagers.csv', 'AwardsSharePlayers.csv', 'Batting.csv', 'BattingPost.csv', 'CollegePlaying.csv', 'Fielding.csv', 'FieldingOF.csv', 'FieldingOFsplit.csv', 'FieldingPost.csv', 'HallOfFame.csv', 'HomeGames.csv', 'Managers.csv', 'ManagersHalf.csv', 'Master.csv', 'Parks.csv', 'Pitching.csv', 'PitchingPost.csv', 'readme2014.txt', 'Salaries.csv', 'Schools.csv', 'SeriesPost.csv', 'Teams.csv', 'TeamsFranchises.csv', 'TeamsHalf.csv']
CSV file name Name of pandas DataFrame
Teams.csv teams
Batting.csv players
Salaries.csv salaries
Fielding.csv fielding
Master.csv master
In [4]:
read_files = ['Teams.csv', 'Batting.csv', 'Salaries.csv', 'Fielding.csv', 'Master.csv']
teams, players, salaries, fielding, master = (pd.read_csv(data_dir + f) for f in read_files)

Problem 1(b)

Calculate the median salary for each player and create a pandas DataFrame called medianSalaries with four columns: (1) the player ID, (2) the first name of the player, (3) the last name of the player and (4) the median salary of the player. Show the head of the medianSalaries DataFrame.

In [5]:
### Your code here ###
salaries.head()
Out[5]:
yearID teamID lgID playerID salary
0 1985 ATL NL barkele01 870000
1 1985 ATL NL bedrost01 550000
2 1985 ATL NL benedbr01 545000
3 1985 ATL NL campri01 633333
4 1985 ATL NL ceronri01 625000
In [6]:
master[["playerID", "nameFirst", "nameLast"]].head()
Out[6]:
playerID nameFirst nameLast
0 aardsda01 David Aardsma
1 aaronha01 Hank Aaron
2 aaronto01 Tommie Aaron
3 aasedo01 Don Aase
4 abadan01 Andy Abad

First to group the salaries by median:

In [7]:
medianSalaries = salaries.groupby("playerID").median().reset_index()
print(medianSalaries.shape)
medianSalaries.head()
(5155, 3)
Out[7]:
playerID yearID salary
0 aardsda01 2009.0 419000.0
1 aasedo01 1987.5 612500.0
2 abadan01 2006.0 327000.0
3 abadfe01 2014.0 525900.0
4 abbotje01 1999.5 255000.0

Now to add in the first and last name

In [8]:
medianSalaries = medianSalaries.merge(master[["playerID", "nameFirst", "nameLast"]], on="playerID")
print(medianSalaries.shape)
medianSalaries.head()
(5147, 5)
Out[8]:
playerID yearID salary nameFirst nameLast
0 aardsda01 2009.0 419000.0 David Aardsma
1 aasedo01 1987.5 612500.0 Don Aase
2 abadan01 2006.0 327000.0 Andy Abad
3 abadfe01 2014.0 525900.0 Fernando Abad
4 abbotje01 1999.5 255000.0 Jeff Abbott
In [9]:
medianSalaries.drop(["yearID"], axis=1, inplace=True)

The final dataframe:

In [10]:
print(medianSalaries.shape)
medianSalaries.head()
(5147, 4)
Out[10]:
playerID salary nameFirst nameLast
0 aardsda01 419000.0 David Aardsma
1 aasedo01 612500.0 Don Aase
2 abadan01 327000.0 Andy Abad
3 abadfe01 525900.0 Fernando Abad
4 abbotje01 255000.0 Jeff Abbott

Problem 1(c)

Now, consider only team/season combinations in which the teams played 162 Games. Exclude all data from before 1947. Compute the per plate appearance rates for singles, doubles, triples, HR, and BB. Create a new pandas DataFrame called stats that has the teamID, yearID, wins and these rates.

Hint: Singles are hits that are not doubles, triples, nor HR. Plate appearances are base on balls plus at bats.

looking at the readme file for the data needed for this q:

teams table:

  • "G" is games played
  • "W" is wins
  • "yearID" is the year of the game
  • "H" is all hits
  • "2B" is doubles
  • "3B" is triples
  • "HR" is homeruns
  • "BB" is base on balls
  • "AB" is at bats

So first, getting a df for all teams which played 162 games from 1947 onwards:

In [11]:
### Your code here ###
stats = teams[teams["yearID"] >= 1947]
stats = stats[stats["G"]==162]
print(stats.shape)
stats.head()
(1068, 48)
Out[11]:
yearID lgID teamID franchID divID Rank G Ghome W L ... DP FP name park attendance BPF PPF teamIDBR teamIDlahman45 teamIDretro
1366 1961 AL KC1 OAK NaN 9 162 80.0 61 100 ... 160.0 0.972 Kansas City Athletics Municipal Stadium I 683817.0 101 103 KCA KC1 KC1
1367 1961 AL LAA ANA NaN 8 162 82.0 70 91 ... 154.0 0.969 Los Angeles Angels Wrigley Field (LA) 603510.0 111 112 LAA LAA LAA
1377 1962 AL BAL BAL NaN 7 162 82.0 77 85 ... 152.0 0.980 Baltimore Orioles Memorial Stadium 790254.0 94 93 BAL BAL BAL
1379 1962 AL CHA CHW NaN 5 162 81.0 85 77 ... 153.0 0.982 Chicago White Sox Comiskey Park 1131562.0 100 99 CHW CHA CHA
1380 1962 NL CHN CHC NaN 9 162 81.0 59 103 ... 171.0 0.977 Chicago Cubs Wrigley Field 609802.0 104 105 CHC CHN CHN

5 rows × 48 columns

Now to drop all the un-needed columns

In [12]:
data_cols = ["teamID", "yearID", "W", "H", "2B", "3B", "HR", "BB", "AB"]
stats = stats[data_cols]
stats.head(2)
Out[12]:
teamID yearID W H 2B 3B HR BB AB
1366 KC1 1961 61 1342 216 47 90 580 5423
1367 LAA 1961 70 1331 218 22 189 681 5424
In [13]:
stats["singles"] = stats["H"] - stats["2B"] - stats["3B"] - stats["HR"]
stats["plates"] = stats["BB"] + stats["AB"]
stats.head()
Out[13]:
teamID yearID W H 2B 3B HR BB AB singles plates
1366 KC1 1961 61 1342 216 47 90 580 5423 989 6003
1367 LAA 1961 70 1331 218 22 189 681 5424 902 6105
1377 BAL 1962 77 1363 225 34 156 516 5491 948 6007
1379 CHA 1962 85 1415 250 56 92 620 5514 1017 6134
1380 CHN 1962 59 1398 196 56 126 504 5534 1020 6038

Compute the per plate appearance rates for singles, doubles, triples, HR, and BB

In [14]:
for col in ["singles","2B","3B","HR","BB"]:
    stats[col] = stats[col] / stats["plates"]
stats.head()
Out[14]:
teamID yearID W H 2B 3B HR BB AB singles plates
1366 KC1 1961 61 1342 0.035982 0.007829 0.014993 0.096618 5423 0.164751 6003
1367 LAA 1961 70 1331 0.035708 0.003604 0.030958 0.111548 5424 0.147748 6105
1377 BAL 1962 77 1363 0.037456 0.005660 0.025970 0.085900 5491 0.157816 6007
1379 CHA 1962 85 1415 0.040756 0.009129 0.014998 0.101076 5514 0.165797 6134
1380 CHN 1962 59 1398 0.032461 0.009275 0.020868 0.083471 5534 0.168930 6038

So the final stats is:

In [15]:
stats = stats[["teamID", "yearID", "W", "singles","2B","3B","HR","BB"]]
stats.head()
Out[15]:
teamID yearID W singles 2B 3B HR BB
1366 KC1 1961 61 0.164751 0.035982 0.007829 0.014993 0.096618
1367 LAA 1961 70 0.147748 0.035708 0.003604 0.030958 0.111548
1377 BAL 1962 77 0.157816 0.037456 0.005660 0.025970 0.085900
1379 CHA 1962 85 0.165797 0.040756 0.009129 0.014998 0.101076
1380 CHN 1962 59 0.168930 0.032461 0.009275 0.020868 0.083471

Problem 1(d)

Is there a noticeable time trend in the rates computed computed in Problem 1(c)?

In [171]:
### Your code here ###
sns.pairplot(data=stats, x_vars="yearID", 
             y_vars=["W", "singles", "2B", "3B", "HR", "BB"], size=5, kind="reg")
Out[171]:
<seaborn.axisgrid.PairGrid at 0x140dd53c8>

From the pairplots It looks like the singles and third base hits are decreasing over time while the second base hits are going up. Home runs interesting enough have a sudden shift upwards - probably linked to some kind of rule change, like allowing heavier bats or a more hittable ball.

Problem 1(e)

Using the stats DataFrame from Problem 1(c), adjust the singles per PA rates so that the average across teams for each year is 0. Do the same for the doubles, triples, HR, and BB rates.

In [172]:
### Your code here ###
# to get the mean for each year:
stats_mean_by_year = stats.groupby("yearID").mean()
stats_mean_by_year.head()
Out[172]:
W singles 2B 3B HR BB
yearID
1961 65.500000 0.156249 0.035845 0.005717 0.022975 0.104083
1962 78.454545 0.165632 0.035853 0.006777 0.023811 0.088590
1963 78.142857 0.162467 0.034020 0.006896 0.021254 0.080336
1964 81.727273 0.167251 0.036336 0.006748 0.021548 0.079152
1965 82.000000 0.160042 0.035539 0.006534 0.022693 0.085745

I tried sklearns preprocessing, but that doesn't work in this case, so we need our own function:

In [225]:
def meanNormalizeRates(df):
        subRates = df[["singles","2B","3B","HR","BB"]]
        df[["singles","2B","3B","HR","BB"]] = subRates - subRates.mean(axis=0)
        return df

statsNorm = stats.groupby('yearID').apply(meanNormalizeRates)
statsNorm.head()
Out[225]:
teamID yearID W singles 2B 3B HR BB
1366 KC1 1961 61 0.008502 0.000137 0.002113 -0.007983 -0.007465
1367 LAA 1961 70 -0.008502 -0.000137 -0.002113 0.007983 0.007465
1377 BAL 1962 77 -0.007816 0.001604 -0.001117 0.002158 -0.002690
1379 CHA 1962 85 0.000165 0.004904 0.002352 -0.008813 0.012486
1380 CHN 1962 59 0.003298 -0.003391 0.002497 -0.002944 -0.005119

Problem 1(f)

Build a simple linear regression model to predict the number of wins from the average adjusted singles, double, triples, HR, and BB rates. To decide which of these terms to include fit the model to data from 2002 and compute the average squared residuals from predictions to years past 2002. Use the fitted model to define a new sabermetric summary: offensive predicted wins (OPW). Hint: the new summary should be a linear combination of one to five of the five rates.

In [298]:
### Your code here ###
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

# the test train datasets
mask = statsNorm["yearID"]<=2002
xtrain = statsNorm[["singles","2B","3B","HR","BB"]][mask]
xtest = statsNorm[["singles","2B","3B","HR","BB"]][~mask]
ytrain = statsNorm["W"][mask]
ytest = statsNorm["W"][~mask]
print(ytrain.shape,ytest.shape, regr.predict(xtrain).shape, regr.predict(xtest).shape)

regr = linear_model.LinearRegression()
regr.fit(xtrain, ytrain)

mean_squared_error(ytest, regr.predict(xtest))
(684,) (384,) (684,) (384,)
Out[298]:
83.049337451741621

Taking a look to see how far the predictions are from the actual:

In [371]:
sns.regplot(ytest, regr.predict(xtest))
sns.regplot(ytrain, regr.predict(xtrain));
In [370]:
#plt.fill_between(x=range(384), y1=ytest, y2=regr.predict(xtest), interpolate=True)
plt.plot(range(384), ytest, label="Actual Y test values", alpha=0.4, color="g", lw=2, ls="-")
plt.plot(range(384), regr.predict(xtest), label="Predicted Y test values", alpha=0.3, color="r")
Out[370]:
[<matplotlib.lines.Line2D at 0x14b99a630>]
In [354]:
plt.fill_between(x=range(ytrain.shape[0]), y1=ytrain, y2=regr.predict(xtrain), 
                 interpolate=True)
plt.plot(range(ytrain.shape[0]), ytrain, label="Actual Y train values", alpha=0.3, color="g")
plt.plot(range(ytrain.shape[0]), regr.predict(xtrain), label="Predicted Y train values", alpha=0.3, color="r")
Out[354]:
[<matplotlib.lines.Line2D at 0x14a1edf28>]

Your answer here: the mse is 83 The predictions have less varibality than actual - they are on trend, so to speak from looking at their plots, but the actual values spike higher and lower than the predicted.

Problem 1(g)

Now we will create a similar database for individual players. Consider only player/year combinations in which the player had at least 500 plate appearances. Consider only the years we considered for the calculations above (after 1947 and seasons with 162 games). For each player/year compute singles, doubles, triples, HR, BB per plate appearance rates. Create a new pandas DataFrame called playerstats that has the playerID, yearID and the rates of these stats. Remove the average for each year as for these rates as done in Problem 1(e).

In [624]:
### Your code here ###

playerstats = players[players["yearID"] >= 1947]
playerstats = playerstats[playerstats["G"]>=160]

# adding in the two new columns
playerstats["PA"] = playerstats["AB"] + playerstats["BB"]
playerstats["1B"] = playerstats["H"] - playerstats["2B"] - playerstats["3B"] - playerstats["HR"]

# convert numbers into rates per appearance
for col in ["1B", "2B", "3B", "HR", "BB"]:
    playerstats[col] = playerstats[col] / playerstats["PA"]
    
# we only want these cols
playerstats = playerstats[["playerID", "yearID", "1B", "2B", "3B", "HR", "BB"]]

print(players.shape, playerstats.shape)
playerstats.head()
(102816, 22) (603, 7)
Out[624]:
playerID yearID 1B 2B 3B HR BB
41336 brutobi01 1961 0.176560 0.022831 0.007610 0.025875 0.092846
41386 colavro01 1961 0.132184 0.043103 0.002874 0.064655 0.162356
41661 marisro01 1961 0.114035 0.023392 0.005848 0.089181 0.137427
41778 richabo01 1961 0.213873 0.024566 0.007225 0.004335 0.043353
41785 robinbr01 1961 0.195804 0.053147 0.009790 0.009790 0.065734

Now to normalize per year:

In [625]:
def meanNormalizeRates(df):
    """takes in a DF and the columns to normalize"""
    cols_to_norm = ["1B", "2B", "3B", "HR", "BB"]
    tempDf = df[cols_to_norm]
    df[cols_to_norm] = tempDf - tempDf.mean(axis=0)
    return df

playerstats = playerstats.groupby('yearID').apply(meanNormalizeRates)
playerstats.shape
Out[625]:
(603, 7)

Show the head of the playerstats DataFrame.

In [626]:
### Your code here ###
playerstats.head()
Out[626]:
playerID yearID 1B 2B 3B HR BB
41336 brutobi01 1961 0.007998 -0.008939 -0.001184 -0.008974 -0.004180
41386 colavro01 1961 -0.036378 0.011334 -0.005921 0.029806 0.065330
41661 marisro01 1961 -0.054527 -0.008378 -0.002946 0.054332 0.040400
41778 richabo01 1961 0.045310 -0.007203 -0.001569 -0.030514 -0.053674
41785 robinbr01 1961 0.027242 0.021377 0.000996 -0.025059 -0.031292

Problem 1(h)

Using the playerstats DataFrame created in Problem 1(g), create a new DataFrame called playerLS containing the player's lifetime stats. This DataFrame should contain the playerID, the year the player's career started, the year the player's career ended and the player's lifetime average for each of the quantities (singles, doubles, triples, HR, BB). For simplicity we will simply compute the avaerage of the rates by year (a more correct way is to go back to the totals).

playerstats from 1g has most of the info, and to get the career start, stop we can either search for the min and mix year for each player or just use the "debut" and "finalGame" columns from the master table.

  • debut Date that player made first major league appearance
  • finalGame Date that player made first major league appearance (blank if still active)
In [627]:
master[["playerID", "debut", "finalGame"]].head(2)
Out[627]:
playerID debut finalGame
0 aardsda01 2004-04-06 2015-08-23
1 aaronha01 1954-04-13 1976-10-03
In [628]:
### Your code here ###
playerLS = playerstats.groupby("playerID").mean().drop("yearID", axis=1).reset_index()
playerLS.head()
Out[628]:
playerID 1B 2B 3B HR BB
0 aaronha01 -0.007832 0.002895 -0.003433 0.024918 0.023050
1 abreubo01 -0.014127 0.008774 -0.001982 -0.008251 0.052051
2 ageeto01 0.009557 0.003229 0.003496 0.001698 -0.027181
3 allendi01 0.001701 0.007721 0.009833 0.003657 0.014888
4 alomaro01 0.009168 0.010377 0.008724 -0.015226 -0.014299
In [629]:
playerLS = playerLS.merge(master[["playerID", "debut", "finalGame"]], how="inner", on="playerID")
playerLS.head()
Out[629]:
playerID 1B 2B 3B HR BB debut finalGame
0 aaronha01 -0.007832 0.002895 -0.003433 0.024918 0.023050 1954-04-13 1976-10-03
1 abreubo01 -0.014127 0.008774 -0.001982 -0.008251 0.052051 1996-09-01 2014-09-28
2 ageeto01 0.009557 0.003229 0.003496 0.001698 -0.027181 1962-09-14 1973-09-30
3 allendi01 0.001701 0.007721 0.009833 0.003657 0.014888 1963-09-03 1977-06-19
4 alomaro01 0.009168 0.010377 0.008724 -0.015226 -0.014299 1988-04-22 2004-09-05

Show the head of the playerLS DataFrame.

In [11]:
### Your code here ###

Problem 1(i)

Compute the OPW for each player based on the average rates in the playerLS DataFrame. You can interpret this summary statistic as the predicted wins for a team with 9 batters exactly like the player in question. Add this column to the playerLS DataFrame. Call this colum OPW.

offensive predicted wins (OPW)

In [630]:
playerLS["OPW"] = regr.predict(playerLS[["1B", "2B", "3B", "HR", "BB"]])
playerLS.head()
Out[630]:
playerID 1B 2B 3B HR BB debut finalGame OPW
0 aaronha01 -0.007832 0.002895 -0.003433 0.024918 0.023050 1954-04-13 1976-10-03 107.694862
1 abreubo01 -0.014127 0.008774 -0.001982 -0.008251 0.052051 1996-09-01 2014-09-28 95.462674
2 ageeto01 0.009557 0.003229 0.003496 0.001698 -0.027181 1962-09-14 1973-09-30 79.155456
3 allendi01 0.001701 0.007721 0.009833 0.003657 0.014888 1963-09-03 1977-06-19 107.425460
4 alomaro01 0.009168 0.010377 0.008724 -0.015226 -0.014299 1988-04-22 2004-09-05 80.458396

Problem 1(j)

Add four columns to the playerLS DataFrame that contains the player's position (C, 1B, 2B, 3B, SS, LF, CF, RF, or OF), first name, last name and median salary.

The fielding table contains the position 'POS', and we alread have the name and median salary in the df medianSalaries.

First adding the median salaries to the playerLs df:

In [631]:
### Your code here ###
playerLS = playerLS.merge(medianSalaries, how="inner", on="playerID")
playerLS.head()
Out[631]:
playerID 1B 2B 3B HR BB debut finalGame OPW salary nameFirst nameLast
0 abreubo01 -0.014127 0.008774 -0.001982 -0.008251 0.052051 1996-09-01 2014-09-28 95.462674 9000000.0 Bobby Abreu
1 alomaro01 0.009168 0.010377 0.008724 -0.015226 -0.014299 1988-04-22 2004-09-05 80.458396 5466667.0 Roberto Alomar
2 altuvjo01 0.055975 0.010345 0.002287 -0.012276 0.001718 2011-07-20 2016-10-02 109.362671 1250000.0 Jose Altuve
3 anderga01 0.034407 0.002158 -0.002039 -0.014866 -0.076270 1994-07-27 2010-08-06 46.572173 3250000.0 Garret Anderson
4 andruel01 0.023382 0.001604 0.001352 -0.030660 -0.024045 2009-04-06 2016-10-02 57.419704 3837500.0 Elvis Andrus

Now to add in the POS of each player from the fielding table. This is trickier than it looks as the fielding table contains the fielding position for each year, and the playerLS table has only one entry per player. So I'm going to go with the most frequently used fielding position for each player and add that to the playerLS table.

In [632]:
from collections import Counter

def findPosition(df):
    """takes in a df, and for each row returns a POS by selecting
    the most frequent POS for that player from the fielding table"""
    playerID = df["playerID"]
    positions = Counter(fielding[fielding["playerID"]==playerID]["POS"])
    return positions.most_common(1)[0][0]

#apply the function to each row:
playerLS["Pos"] = playerLS.apply(findPosition, axis=1)
print(playerLS.shape)
playerLS.head()
(227, 13)
Out[632]:
playerID 1B 2B 3B HR BB debut finalGame OPW salary nameFirst nameLast Pos
0 abreubo01 -0.014127 0.008774 -0.001982 -0.008251 0.052051 1996-09-01 2014-09-28 95.462674 9000000.0 Bobby Abreu OF
1 alomaro01 0.009168 0.010377 0.008724 -0.015226 -0.014299 1988-04-22 2004-09-05 80.458396 5466667.0 Roberto Alomar 2B
2 altuvjo01 0.055975 0.010345 0.002287 -0.012276 0.001718 2011-07-20 2016-10-02 109.362671 1250000.0 Jose Altuve 2B
3 anderga01 0.034407 0.002158 -0.002039 -0.014866 -0.076270 1994-07-27 2010-08-06 46.572173 3250000.0 Garret Anderson OF
4 andruel01 0.023382 0.001604 0.001352 -0.030660 -0.024045 2009-04-06 2016-10-02 57.419704 3837500.0 Elvis Andrus SS

Show the head of the playerLS DataFrame.

In [14]:
### Your code here ###

Problem 1(k)

Subset the playerLS DataFrame for players active in 2002 and 2003 and played at least three years. Plot and describe the relationship bewteen the median salary (in millions) and the predicted number of wins.

First, converting to datetime:

In [633]:
### Your code here ###
playerLS["debut"] = pd.to_datetime(playerLS["debut"])
playerLS["finalGame"] = pd.to_datetime(playerLS["finalGame"])
print(playerLS.shape)
playerLS.head(2)
(227, 13)
Out[633]:
playerID 1B 2B 3B HR BB debut finalGame OPW salary nameFirst nameLast Pos
0 abreubo01 -0.014127 0.008774 -0.001982 -0.008251 0.052051 1996-09-01 2014-09-28 95.462674 9000000.0 Bobby Abreu OF
1 alomaro01 0.009168 0.010377 0.008724 -0.015226 -0.014299 1988-04-22 2004-09-05 80.458396 5466667.0 Roberto Alomar 2B

Building a mask to select:

In [634]:
mask = (((playerLS["finalGame"].dt.year - playerLS["debut"].dt.year) >= 3) & 
        (playerLS["debut"].dt.year <= 2002) & 
        (playerLS["finalGame"].dt.year >= 2003))

playeractive = playerLS[mask]
print(playeractive.shape)
playeractive.head(3)
(78, 13)
Out[634]:
playerID 1B 2B 3B HR BB debut finalGame OPW salary nameFirst nameLast Pos
0 abreubo01 -0.014127 0.008774 -0.001982 -0.008251 0.052051 1996-09-01 2014-09-28 95.462674 9000000.0 Bobby Abreu OF
1 alomaro01 0.009168 0.010377 0.008724 -0.015226 -0.014299 1988-04-22 2004-09-05 80.458396 5466667.0 Roberto Alomar 2B
3 anderga01 0.034407 0.002158 -0.002039 -0.014866 -0.076270 1994-07-27 2010-08-06 46.572173 3250000.0 Garret Anderson OF
In [635]:
x = np.log(playeractive["salary"])
sns.lmplot("salary", "OPW", data=playeractive, logx=True)
plt.ylim(0,250)
sns.lmplot("salary", "OPW", data=playeractive, hue="Pos", size=6, fit_reg=True)
plt.ylim(0,250)
/Users/ko/anaconda/lib/python3.6/site-packages/seaborn/linearmodels.py:267: RuntimeWarning: invalid value encountered in log
  grid = np.c_[np.ones(len(grid)), np.log(grid)]
Out[635]:
(0, 250)

in the first chart, there is a lot of OPW variation at the same salary, with a steady inrease with salary. Though the guy on the far right seems vastly overpaid for his OPW. The second chart does a linear regression per position, where there is a lot more varition, though this dataset doesn't have enough points to really say.

Problem 1(l)

Pick one players from one of each of these 10 position C, 1B, 2B, 3B, SS, LF, CF, RF, DH, or OF keeping the total median salary of all 10 players below 20 million. Report their averaged predicted wins and total salary.

In [812]:
### Your code here ###
columns = ["playerID", "Pos", "salary", "OPW"]

salary_cap = 20*np.power(10, 6)

def make_team(salary_cap=salary_cap):
    df = pd.DataFrame()
    for pos in ["C", "1B", "2B", "3B", "SS", "LF", "CF", "RF", "DH", "OF"]:
        df_temp = playeractive[playeractive["Pos"]==pos]
        if df_temp.shape[0] > 0:
            df = df.append(df_temp.sample())
    return df

scores = []
best_team=make_team()
              
# going to make a bunch of random teams with the salary cap less then 20M
for i in range(150):
    df = make_team()
    while df["salary"].sum() >= salary_cap:
        df = make_team()
    score = df["OPW"].mean()
    scores.append(score)
    if score >= max(scores):
        best_team = df

# The team with the highest mean OPW as per above:
print(f"Salary: {best_team['salary'].sum():,.0f}, Mean OPW: {best_team.OPW.mean():.2f}")
best_team
Salary: 19,225,000, Mean OPW: 85.99
Out[812]:
playerID 1B 2B 3B HR BB debut finalGame OPW salary nameFirst nameLast Pos
139 mcgrifr01 -0.045828 -0.004944 -0.002342 0.027440 0.078672 1986-05-17 2004-07-15 115.192971 4275000.0 Fred McGriff 1B
22 biggicr01 0.012406 0.006105 -0.000200 -0.015707 0.008167 1988-06-26 2007-09-30 81.420779 4000000.0 Craig Biggio 2B
27 booneaa01 -0.025670 0.004521 -0.001056 -0.006579 -0.012648 1997-06-20 2009-10-04 56.273163 925000.0 Aaron Boone 3B
176 rolliji01 -0.008713 -0.001946 0.010000 -0.008171 -0.010932 2000-09-17 2016-06-08 73.728189 8000000.0 Jimmy Rollins SS
89 gonzalu01 0.004170 0.003094 0.001470 0.012240 0.012480 1990-09-04 2008-09-28 103.358827 2025000.0 Luis Gonzalez OF
In [813]:
sns.distplot(scores)
plt.axvline(max(scores), color="r");

Note: redo this later, but good enough for now.

Problem 1(m)

What do these players outperform in? Singles, doubles, triples HR or BB?

so we need to know which of ["1B","2B","3B","HR","BB"] contributes most to OPW.

In [816]:
regr2 = linear_model.LinearRegression()
regr2.fit(best_team[["1B","2B","3B","HR","BB"]], best_team["OPW"])
regr2.coef_
Out[816]:
array([ 621.88427825, -340.92979757,  476.95997605,  834.65337277,
        442.93047415])
In [817]:
df = best_team.copy() 
df[["1B","2B","3B","HR","BB"]] = df[["1B","2B","3B","HR","BB"]] * regr2.coef_ * 100
df
Out[817]:
playerID 1B 2B 3B HR BB debut finalGame OPW salary nameFirst nameLast Pos
139 mcgrifr01 -2849.959598 168.544770 -111.708611 2290.289714 3484.623079 1986-05-17 2004-07-15 115.192971 4275000.0 Fred McGriff 1B
22 biggicr01 771.524101 -208.147868 -9.523578 -1311.004953 361.722421 1988-06-26 2007-09-30 81.420779 4000000.0 Craig Biggio 2B
27 booneaa01 -1596.372187 -154.124887 -50.389977 -549.078708 -560.225737 1997-06-20 2009-10-04 56.273163 925000.0 Aaron Boone 3B
176 rolliji01 -541.837860 66.333591 476.965145 -681.958606 -484.191138 2000-09-17 2016-06-08 73.728189 8000000.0 Jimmy Rollins SS
89 gonzalu01 259.351878 -105.476641 70.117467 1021.619694 552.762544 1990-09-04 2008-09-28 103.358827 2025000.0 Luis Gonzalez OF
In [818]:
for pos in ["1B","2B","3B","HR","BB"]:
    print(pos, df[pos].mean())
1B -791.4587330732895
2B -46.574206918616504
3B 75.0920892529091
HR 153.9734283087909
BB 670.938233712274

Your answer here: BB seems to be contributing the most to OPW

A better way would be to use a bigger data set to get the regression coef:

In [819]:
regr3 = linear_model.LinearRegression()
regr3.fit(playerLS[["1B","2B","3B","HR","BB"]], playerLS["OPW"])

df = best_team.copy() 
df[["1B","2B","3B","HR","BB"]] = df[["1B","2B","3B","HR","BB"]] * regr3.coef_ * 100
for pos in ["1B","2B","3B","HR","BB"]:
    print(pos, df[pos].mean())
1B -685.2660211999464
2B 69.3114948228255
3B 172.04358802572847
HR 157.48965503691235
BB 778.5899413533468

Discussion for Problem 1

Write a brief discussion of your conclusions to the questions and tasks above in 100 words or less.


there was a lot of data wrangling into the shape we wanted for analysis.

Problem 2

$k$-Nearest Neighbors and Cross Validation

What is the optimal $k$ for predicting species using $k$-nearest neighbor classification on the four features provided by the iris dataset.

In this problem you will get to know the famous iris data set, and use cross validation to select the optimal $k$ for a $k$-nearest neighbor classification. This problem set makes heavy use of the sklearn library. In addition to Pandas, it is one of the most useful libraries for data scientists! After completing this homework assignment you will know all the basics to get started with your own machine learning projects in sklearn.

Future lectures will give further background information on different classifiers and their specific strengths and weaknesses, but when you have the basics for sklearn down, changing the classifier will boil down to exchanging one to two lines of code.

The data set is so popular, that sklearn provides an extra function to load it:

In [507]:
#load the iris data set
iris = sklearn.datasets.load_iris()

X = iris.data  
Y = iris.target

X.shape, Y.shape
Out[507]:
((150, 4), (150,))

Problem 2(a)

Split the data into a train and a test set. Use a random selection of 33% of the samples as test data. Sklearn provides the train_test_split function for this purpose. Print the dimensions of all the train and test data sets you have created.

See train_test_split docs, version linked above is depracated.

In [970]:
### Your code here ###
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.33)
x_train.shape, x_test.shape, y_train.shape, y_test.shape
Out[970]:
((1203, 64), (594, 64), (1203,), (594,))

Problem 2(b)

Examine the data further by looking at the projections to the first two principal components of the data. Use the TruncatedSVD function for this purpose, and create a scatter plot. Use the colors on the scatter plot to represent the different classes in the target data.

In [975]:
### Your code here ###
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD()

x_train_centered = x_train-np.mean(x_train, axis=0)
x_test_centered = x_test-np.mean(x_test, axis=0)

x_pca = svd.fit_transform(x_train_centered)

plt.scatter(x_pca[:,0], x_pca[:,1], c=y_train, cmap=plt.cm.Vega10, 
            s=(2+y_train)*20)
plt.title("Principle component plot")
plt.colorbar()
plt.xlabel("PC 1"), plt.ylabel("PC 2");

Problem 2(c)

In the lecture we discussed how to use cross validation to estimate the optimal value for $k$ (the number of nearest neighbors to base the classification on). Use ten fold cross validation to estimate the optimal value for $k$ for the iris data set.

Note: For your convenience sklearn does not only include the KNN classifier, but also a grid search function. The function is called grid search, because if you have to optimize more than one parameter, it is common practice to define a range of possible values for each parameter. An exhaustive search then runs over the complete grid defined by all the possible parameter combinations. This can get very computation heavy, but luckily our KNN classifier only requires tuning of a single parameter for this problem set.

In [976]:
### Your code here ###
from sklearn.neighbors import KNeighborsClassifier

neigh = KNeighborsClassifier()
neigh.fit(x_train_centered, y_train) 
np.mean(neigh.predict(x_test_centered)==y_test)
Out[976]:
0.98148148148148151

Pretty good accuracy right of the bat! now to find the optimal K using GridSearchCV:

In [977]:
from sklearn.model_selection import GridSearchCV
k_values = [k for k in range(1,25)]
parameters = {'n_neighbors': k_values}
clf = GridSearchCV(KNeighborsClassifier(), parameters, cv=10)
clf.fit(x_train_centered, y_train)
Out[977]:
GridSearchCV(cv=10, error_score='raise',
       estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform'),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

Problem 2(d)

Visualize the result by plotting the score results versus values for $k$.

In [978]:
### Your code here ###
#plt.plot(clf.param_grid["n_neighbors"], clf.cv_results_["mean_test_score"], linewidth=2)
plt.errorbar(clf.param_grid["n_neighbors"], clf.cv_results_["mean_test_score"], 
             clf.cv_results_["std_test_score"], lw=3, elinewidth=2, ecolor="g")
plt.ylim(0.8,1.05)
plt.title("K vs accuracy")
plt.xticks(k_values)
plt.xlabel("K value"), plt.ylabel("Accuracy")

# highlight the best value
best_k = clf.best_params_["n_neighbors"]
plt.scatter(best_k, clf.best_score_, alpha=0.2, s=300, c="g")
plt.annotate(f"Best k: {best_k}, Best score: {clf.best_score_}", color="r", 
             arrowprops=dict(facecolor='r', shrink=0.05), 
             xy=(best_k, clf.best_score_), xytext=(best_k+2, 0.88));

Verify that the grid search has indeed chosen the right parameter value for $k$.

In [979]:
### Your code here ###
clf.best_params_
Out[979]:
{'n_neighbors': 1}

Problem 2(e)

Test the performance of our tuned KNN classifier on the test set.

In [980]:
### Your code here ###
# simple way to check:
clf.best_estimator_.score(x_test_centered, y_test)
Out[980]:
0.98148148148148151

looking at the cross_val score:

In [981]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf.best_estimator_, x_test_centered, y_test, cv=5)
scores.mean(), scores.std()
Out[981]:
(0.96961102036050573, 0.01459413416300674)

Discussion for Problem 2

Write a brief discussion of your conclusions to the questions and tasks above in 100 words or less.


We look at classifying the iris data set, and KNN did a pretty accurate job, with the ideal K =1. But why the diff b/w the score and the cross_val score?

Problem 3:

The Curse and Blessing of Higher Dimensions

In this problem we will investigate the influence of higher dimensional spaces on the classification. The data set is again one of the standard data sets from sklearn. The digits data set is similar to the MNIST data set discussed in the lecture. The main difference is, that each digit is represented by an 8x8 pixel image patch, which is considerably smaller than the 28x28 pixels from MNIST. In addition, the gray values are restricted to 16 different values (4 bit), instead of 256 (8 bit) for MNIST.

First we again load our data set.

In [864]:
digits = sklearn.datasets.load_digits()

X = digits.data  
Y = digits.target

X.shape, Y.shape
Out[864]:
((1797, 64), (1797,))

Problem 3(a)

Start with the same steps as in Problem 2. Split the data into train and test set. Use 33% of the samples as test data. Print the dimensions of all the train and test data sets you created.

In [865]:
### Your code here ###
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.33)
x_train.shape, x_test.shape, y_train.shape, y_test.shape
Out[865]:
((1203, 64), (594, 64), (1203,), (594,))

Problem 3(b)

Similar to Problem 2(b), create a scatter plot of the projections to the first two PCs. Use the colors on the scatter plot to represent the different classes in the target data. How well can we separate the classes?

Hint: Use a Colormap in matplotlib to represent the diferent classes in the target data.

In [982]:
### Your code here ###

from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD()

x_train = x_train-np.mean(x_train, axis=0)
x_test = x_test-np.mean(x_test, axis=0)

x_pca = svd.fit_transform(x_train-np.mean(x_train, axis=0))

plt.scatter(x_pca[:,0], x_pca[:,1], c=y_train, cmap=plt.cm.Vega10)
plt.colorbar()
plt.title("Principle component plot")
plt.xlabel("PC 1"), plt.ylabel("PC 2");

Create individual scatter plots using only two classes at a time to explore which classes are most difficult to distinguish in terms of class separability. You do not need to create scatter plots for all pairwise comparisons, but at least show one.

In [983]:
NUMS = 4 # make plots for digits from 0 to this number

# setting up the number combinations
from itertools import combinations
combs = np.array([i for i in combinations(range(NUMS),2)])

fig = plt.figure(figsize=(16,16))
sns.despine(left=True, bottom=True)
cmap = plt.cm.get_cmap('tab10')
grid_size = (NUMS, NUMS)

axes = [] # holds all the subplots
i = 0 # the axis number

for x, y in combs:
    # first we need to seperate out the x and y digits
    mask_x = np.ma.masked_where(y_train==x, y_train).mask
    mask_y = np.ma.masked_where(y_train==y, y_train).mask
    
    axes.append(plt.subplot2grid(grid_size, (x,y)))
    axes[i].set_title(f"{x} vs {y}", fontsize=18)
    
    # scatter plot for digit x
    axes[i].scatter(x_pca[:,0][mask_x], x_pca[:,1][mask_x],  
                    c=[cmap(i/10) for i in y_train[mask_x]], label=str(x))
    # scatter plot for digit y
    axes[i].scatter(x_pca[:,0][mask_y], x_pca[:,1][mask_y], 
                    c=[cmap(i/10) for i in y_train[mask_y]], label=str(y))

    axes[i].legend(loc="lower right", facecolor="grey", frameon=True, framealpha=0.15)
    i += 1

fig.tight_layout()