## IDENTIFYING MOST COMMON NBA PLAYERS FROM THE 2017 DRAFT CLASS USING PCA DIMENSIONALITY REDUCTION AND K NEAREST NEIGHBORS ALGORITHM¶

Its so common that we hear talking heads tell us about how Lonzo Ball looks like the next Jason Kidd, or how John Jackson is a better shooting version of Kawhi Leonard. But there's a lot of inherent bias in the prognostications. One - they're limited to players we're familiar with; Two - we choose to see certain aspects of a players' game, wanted to describe someone as a great shooter, or passer, or rebounder. The goal of this analysis is to strip away those biases and get the most accurate comparisons possible, using the best data we have available.

The approach is pretty straightforward, and is outlined below before digging into all of the code.

• Take every player who has been drafted since 2010 - who also played in the NCAA
• Append their basic and advanced college stats from CBB Reference
• Take all NCAA players from this season and retrieve their advanced stats as well.
• Since we have about 36 different statistics - there's alot of covariance among our features, so we'll perform something called "dimensionality reduction" to reduce them to the fewest # of features that can explain the variance we see in our dataset
• Take every player and measure their euclidean distance to every other player in the dataset
• Limit the dataset to NCAA players compared to NBA players (this is the comparison we wanted to make from the beginning)
• return every NBA player and sort by ascending distance metric
• limit to Chad Fords top 50 and export to CSV
In [ ]:
#import necessary libraries
import pandas as pd
import numpy as np
from time import sleep
from scipy.spatial import distance
from scipy.spatial.distance import squareform
from scipy.spatial.distance import pdist
from sklearn.decomposition import PCA

In [158]:
#read in datasets

df.drop('Unnamed: 0',axis=1,inplace=True)

In [228]:
#scrape data from bbref if pulling data for the first time
#i = 100

#while(i < 15000):
#    print("Number of players retrieved:", str(i))
#    i = i+100
#    sleep(10)

#realign columns properly
#cols = ['Rk', 'Player', 'Class', 'Season',
#       'Pos', 'School', 'Conf', 'G', 'MP', 'MP.1', 'FG', 'FGA', '2P', '2PA',
#       '3P', '3PA', 'FT', 'FTA', 'ORB', 'DRB', 'TRB', 'AST', 'STL', 'BLK',
#       'TOV', 'PF', 'PTS', 'PER', 'TS%', 'eFG%', 'ORB%', 'DRB%', 'TRB%',
#       'AST%', 'STL%', 'BLK%', 'TOV%', 'USG%', 'PProd', 'ORtg', 'DRtg', 'OWS',
#       'DWS', 'WS', 'OBPM', 'DBPM', 'BPM','drop1','drop2','drop3']

#df.columns = cols

#df.drop(['drop1','drop2','drop3'], axis=1, inplace=True)

#df = df.drop(df[df.Class == 'Class'].index)

#send to CSV for perpetuity
#df.to_csv("ncaa_stats.csv")

In [229]:
#get the max year for each player (only one row for the season)
df_new = df.groupby(['Player'])['Season'].transform(max) == df['Season']

stats = df[df_new]

In [230]:
#limit to the few columns we need from the NBA dataset
nba = nba[['Player','Rd','Pk','Year']]

In [231]:
#merge nba players and get their ncaa stats
draft_stats = pd.merge(nba,stats,on='Player',how='left')

draft_stats = draft_stats.dropna(subset=['Class'])

In [233]:
#limit to the most recent NCAA season and drop the raw MP column (we only want Min/Gm)
test = stats[stats['Season']== '2016-17']
test.drop('MP',axis=1,inplace=True)

draft_stats_test = draft_stats[draft_stats['Year'] < 2017]

C:\Users\coreyjez\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
from ipykernel import kernelapp as app

In [236]:
#drop unwanted columns
draft_stats_vars = draft_stats_test.drop(['Rd','Pk','Year','Rk','Class','Season','Pos','School','Conf','G','MP'],axis=1)
test_16_vars = test.drop(['Class','Rk','Season','Pos','School','Conf','G'],axis=1)

In [237]:
#verify they have the same # of columns
print ("test df shape:",test_16_vars.shape)
print ("draft df shape:",draft_stats_vars.shape)

test df shape: (2080, 39)
draft df shape: (272, 39)

In [238]:
# concat the two DFs together into one big DF
final = pd.concat([test_16_vars,draft_stats_vars])
final = final.reset_index()

In [239]:
#finally drop the player and additional index column
df_final = final.drop(['Player','index'],axis=1)

In [240]:
#normalize the data so that distance is equalized regardless of the scale of the metric
final_normal = (df_final - df_final.mean()) / df_final.std()

In [241]:
#introduce PCA Dimensionality Reduction to get the best features that explain the variance in our matrix
pca = PCA()
transformed_pca_x = pca.fit_transform(final_normal)
#create component indices
component_names = ["component_"+str(comp) for comp in range(1, len(pca.explained_variance_)+1)]

#generate new component dataframe
transformed_pca_x = pd.DataFrame(transformed_pca_x,columns=component_names)

       MP.1        FG       FGA        2P       2PA        3P       3PA  \
0  1.214320  1.676989  1.813782  1.160537  1.524675  0.978063  0.792355
1  0.184873  1.405143  0.496129  2.210807  1.755843 -1.315460 -1.417786
2 -1.393613 -0.905546 -1.464283 -0.169806 -0.594367 -1.315460 -1.417786
3  0.339290  0.385722 -0.500146  1.090519  0.407362 -1.194749 -1.229689
4  0.510865 -0.157970 -0.628698 -0.309842 -0.787007  0.133081  0.039967

FT       FTA       ORB    ...         USG%     PProd      ORtg  \
0  3.794487  3.365690  1.594710    ...     1.726118  2.079515  1.316056
1 -0.014472  0.844939  2.430493    ...     1.105218  1.340251  0.764449
2 -1.149055 -0.856568  0.400733    ...    -1.292741 -1.222051  0.291643
3 -0.095513 -0.037324  2.072300    ...    -0.736072  0.708647  1.473658
4 -0.662805 -0.919586 -0.076857    ...    -1.314151  0.012447  2.133616

DRtg       OWS       DWS        WS      OBPM      DBPM       BPM
0 -2.210414  2.258310  2.564215  2.687228  2.628517  2.274177  3.040330
1 -2.560484  1.446197  2.875438  2.201929  1.684870  3.228015  2.956779
2 -2.088651 -0.584086  0.541266 -0.224570  0.363764  4.108481  2.580799
3 -2.179973  1.283774  2.875438  2.080604  1.087227  3.117957  2.497248
4 -1.525496  1.202563  1.941769  1.655967  1.684870  2.347549  2.455473

[5 rows x 38 columns]

Out[241]:
component_1 component_2 component_3 component_4 component_5 component_6 component_7 component_8 component_9 component_10 ... component_29 component_30 component_31 component_32 component_33 component_34 component_35 component_36 component_37 component_38
0 10.276067 1.605826 0.772921 2.503615 -1.112311 -1.594821 -0.589460 0.639170 -0.174985 1.860121 ... -0.173023 -0.133946 -0.141375 0.002717 0.017823 -0.004699 -0.041912 0.010904 -0.005366 0.002974
1 8.860138 -3.794910 -0.077645 4.185647 -0.216325 -0.925672 -1.518631 -0.144931 0.604341 -1.232700 ... -0.267700 -0.014189 0.101492 -0.010417 0.005115 -0.000098 0.002593 -0.001163 0.001736 -0.000472
2 0.206886 -6.265915 2.390440 4.415517 0.160699 -1.142673 2.502830 3.200884 -0.197537 -0.942359 ... 0.176014 -0.076339 -0.066579 -0.015415 -0.015631 -0.000041 0.002867 -0.022992 -0.002697 -0.002196
3 6.413919 -5.715263 2.318523 3.536236 0.269141 0.516624 1.059161 0.196717 0.208271 -1.431863 ... 0.027190 0.038664 0.004368 -0.008195 0.003281 0.008977 0.004088 -0.017837 0.001622 -0.000914
4 2.651144 -1.205150 4.764476 4.285265 0.440931 0.745651 -0.708535 1.183168 -0.092450 -0.207216 ... 0.120202 0.048320 0.073231 0.011929 0.018293 -0.037068 -0.039336 -0.005705 0.000903 -0.001435

5 rows × 38 columns

In [264]:
#generate component loadings on original features
component_matrix = pd.DataFrame(pca.components_,index=component_names,columns = df_final.columns)
component_matrix["explained_variance_ratio"] = pca.explained_variance_ratio_
component_matrix["eigenvalue"] = pca.explained_variance_

print("explained variance running sum by component:",component_matrix.explained_variance_ratio.cumsum())

explained variance running sum by component: component_1     0.361325
component_2     0.583178
component_3     0.699289
component_4     0.779868
component_5     0.822355
component_6     0.852970
component_7     0.879692
component_8     0.904938
component_9     0.925857
component_10    0.942240
component_11    0.955429
component_12    0.967679
component_13    0.977788
component_14    0.985724
component_15    0.989414
component_16    0.991526
component_17    0.993078
component_18    0.994361
component_19    0.995431
component_20    0.996239
component_21    0.996847
component_22    0.997343
component_23    0.997810
component_24    0.998226
component_25    0.998636
component_26    0.998989
component_27    0.999272
component_28    0.999514
component_29    0.999744
component_30    0.999874
component_31    0.999943
component_32    0.999957
component_33    0.999970
component_34    0.999982
component_35    0.999991
component_36    0.999996
component_37    0.999998
component_38    1.000000
Name: explained_variance_ratio, dtype: float64

In [243]:
#so letes perform the KNN algorithm on components 1-14, since they expalin 98.15% of the variance in the dataset
pca_final = transformed_pca_x.iloc[:,:14]

In [245]:
#get the distance between every obs in the final D
distances_euclidean = pdist(pca_final, metric='euclidean')

In [246]:
#create a pairwise matrix
distances_matrix = squareform(distances_euclidean)

Out[246]:
array([[  0.        ,  20.88732888,  35.13218736, ...,  25.94866902,
24.31348246,  24.06270058],
[ 20.88732888,   0.        ,  25.82637323, ...,  26.99952771,
23.93688809,  24.37278189],
[ 35.13218736,  25.82637323,   0.        , ...,  24.78983909,
27.13142578,  28.85847513],
...,
[ 25.94866902,  26.99952771,  24.78983909, ...,   0.        ,
14.00979177,  16.17721438],
[ 24.31348246,  23.93688809,  27.13142578, ...,  14.00979177,
0.        ,  12.05671275],
[ 24.06270058,  24.37278189,  28.85847513, ...,  16.17721438,
12.05671275,   0.        ]])
In [247]:
#transform that pairwise matrix into a dataframe and add an index field for joining
distances = pd.DataFrame(distances_matrix)
distances['id_a'] = range(0, len(distances))

Out[247]:
0 1 2 3 4 5 6 7 8 9 ... 2343 2344 2345 2346 2347 2348 2349 2350 2351 id_a
0 0.000000 20.887329 35.132187 26.087636 26.702912 27.043791 25.089999 29.283917 20.236402 16.541398 ... 23.732446 30.058939 27.202155 21.132404 25.292953 30.492015 25.948669 24.313482 24.062701 0
1 20.887329 0.000000 25.826373 16.572213 22.913603 19.197107 19.541230 27.641119 21.757821 18.055367 ... 17.035436 25.575648 21.268822 23.118253 29.960605 26.592151 26.999528 23.936888 24.372782 1
2 35.132187 25.826373 0.000000 18.614549 20.941981 13.929478 29.310328 33.334112 29.194351 33.430191 ... 23.729913 25.141384 23.215902 40.104858 24.370200 35.457359 24.789839 27.131426 28.858475 2
3 26.087636 16.572213 18.614549 0.000000 19.288614 12.727091 19.429013 21.493559 22.047909 21.372450 ... 16.091108 16.148922 24.852787 29.462924 26.806176 27.935244 26.345192 24.074822 26.579903 3
4 26.702912 22.913603 20.941981 19.288614 0.000000 15.406146 21.812691 21.842560 15.930106 20.433810 ... 20.206717 24.155608 16.821444 28.147994 21.506791 27.476980 16.496903 20.102148 23.606290 4

5 rows × 2353 columns

In [248]:
#pivot this data so that we can have one row per player comparison
distances_final = pd.melt(distances,id_vars='id_a')

#rename columns
cols = ['player_a','player_b','eucl_dist']
distances_final.columns = cols

Out[248]:
id_a variable value
0 0 0 0.000000
1 1 0 20.887329
2 2 0 35.132187
3 3 0 26.087636
4 4 0 26.702912
In [251]:
#merge over the players' names'
final1 = pd.merge(distances_final, final, how='inner',left_on='player_a', right_index=True)

final2 = pd.merge(final1,final,how='inner',left_on='player_b',right_index=True)

final2 = final2[['eucl_dist','Player_y','Player_x']]

In [255]:
#create lookup tables
ncaa_lookup = test[['Player']]
ncaa_lookup['player_type'] = 'NCAA'

nba_lookup = draft_stats_test[['Player']]
nba_lookup['player_type'] = 'NBA'

C:\Users\coreyjez\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
from ipykernel import kernelapp as app
C:\Users\coreyjez\Anaconda3\lib\site-packages\ipykernel\__main__.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [256]:
#join over whether or not the players are NBA comparisons or NCAA players
final3 = pd.merge(final2,ncaa_lookup,how='left',left_on='Player_y',right_on='Player')
final3 = final3.drop('Player',axis=1)
final3.rename(columns={'player_type':'player_y_type'},inplace=True)

final4 = pd.merge(final3,ncaa_lookup,how='left',left_on='Player_x',right_on='Player')
final4.drop('Player',axis=1,inplace=True)
final4.rename(columns={'player_type':'player_x_type'},inplace=True)

final4.fillna('NBA',inplace=True)

In [259]:
#create final final dataframe with one NCAA players compared to NBA players
final_final = final4[(final4.player_y_type =='NCAA') & (final4.player_x_type=='NBA')]

#df[(df.A == 1) & (df.D == 6)]

In [265]:
#examine the comparisons for Lonzo Ball
final_final[final_final['Player_y'] == 'Lonzo Ball'].sort_values(by='eucl_dist',ascending = True)

Out[265]:
eucl_dist Player_y Player_x player_y_type player_x_type
18847 13.716531 Lonzo Ball Shane Larkin NCAA NBA
18767 15.257080 Lonzo Ball Reggie Jackson NCAA NBA
18843 15.593754 Lonzo Ball Jerian Grant NCAA NBA
18929 16.083855 Lonzo Ball Lamar Patterson NCAA NBA
18870 16.613314 Lonzo Ball Patrick McCaw NCAA NBA
18886 17.019915 Lonzo Ball Shabazz Napier NCAA NBA
18865 17.092628 Lonzo Ball Tyus Jones NCAA NBA
18962 17.203179 Lonzo Ball Lorenzo Brown NCAA NBA
18853 17.436172 Lonzo Ball Reggie Bullock NCAA NBA
18757 18.096593 Lonzo Ball Isaiah Thomas NCAA NBA
18869 18.109116 Lonzo Ball Delon Wright NCAA NBA
18882 18.261786 Lonzo Ball D'Angelo Russell NCAA NBA
18751 18.609615 Lonzo Ball Trey Burke NCAA NBA
18836 18.690465 Lonzo Ball Ben McLemore NCAA NBA
18903 18.895548 Lonzo Ball Denzel Valentine NCAA NBA
18901 18.964175 Lonzo Ball Tyler Ulis NCAA NBA
18973 19.131855 Lonzo Ball Wade Baldwin NCAA NBA
18958 19.170505 Lonzo Ball Michael Gbinije NCAA NBA
18913 19.406235 Lonzo Ball Pat Connaughton NCAA NBA
18984 19.436834 Lonzo Ball Marcus Denmon NCAA NBA
18931 19.696883 Lonzo Ball Kris Dunn NCAA NBA
19006 19.841475 Lonzo Ball Isaiah Cousins NCAA NBA
18800 19.850398 Lonzo Ball Allen Crabbe NCAA NBA
18808 19.928488 Lonzo Ball Solomon Hill NCAA NBA
18878 19.954606 Lonzo Ball Jamal Murray NCAA NBA
18861 19.981715 Lonzo Ball Ray McCallum NCAA NBA
18876 19.984895 Lonzo Ball Orlando Johnson NCAA NBA
18832 19.996917 Lonzo Ball Josh Richardson NCAA NBA
18835 20.028378 Lonzo Ball Malcolm Brogdon NCAA NBA
18895 20.102194 Lonzo Ball Cameron Payne NCAA NBA
... ... ... ... ... ...
19000 31.974613 Lonzo Ball Cady Lalanne NCAA NBA
18845 32.226368 Lonzo Ball Jordan Hamilton NCAA NBA
18996 32.377422 Lonzo Ball Dakari Johnson NCAA NBA
18902 32.601743 Lonzo Ball Jarnell Stokes NCAA NBA
18946 33.082097 Lonzo Ball Damian Jones NCAA NBA
18814 33.309737 Lonzo Ball Shabazz Muhammad NCAA NBA
18920 33.330938 Lonzo Ball Tony Mitchell NCAA NBA
18829 33.377107 Lonzo Ball Julius Randle NCAA NBA
18759 33.571917 Lonzo Ball Tristan Thompson NCAA NBA
18855 33.813413 Lonzo Ball Jimmer Fredette NCAA NBA
18924 34.171508 Lonzo Ball Joel Bolomboy NCAA NBA
18823 34.273858 Lonzo Ball Festus Ezeli NCAA NBA
18955 34.390509 Lonzo Ball Cameron Bairstow NCAA NBA
18788 34.604572 Lonzo Ball John Henson NCAA NBA
18991 34.906913 Lonzo Ball Alec Brown NCAA NBA
18909 35.015887 Lonzo Ball Rakeem Christmas NCAA NBA
18939 35.619096 Lonzo Ball Fab Melo NCAA NBA
18758 36.065498 Lonzo Ball Andre Drummond NCAA NBA
18873 36.104126 Lonzo Ball Pascal Siakam NCAA NBA
18819 36.443424 Lonzo Ball Jeff Withey NCAA NBA
18793 36.852015 Lonzo Ball Myles Turner NCAA NBA
18818 37.646551 Lonzo Ball Mike Muscala NCAA NBA
18833 37.727792 Lonzo Ball Thomas Robinson NCAA NBA
18737 38.120727 Lonzo Ball Anthony Davis NCAA NBA
18922 38.253961 Lonzo Ball Jordan Mickey NCAA NBA
18937 39.284590 Lonzo Ball Keith Benson NCAA NBA
18740 39.583770 Lonzo Ball Kenneth Faried NCAA NBA
18817 39.696040 Lonzo Ball T.J. Warren NCAA NBA
18884 40.637818 Lonzo Ball Skal Labissiere NCAA NBA
18777 41.570897 Lonzo Ball James Johnson NCAA NBA

271 rows × 5 columns

In [263]:
#write this final_final table out to CSV
final_final.to_csv("player_comparison_2017_pca.csv")

Out[263]:
eucl_dist Player_y Player_x player_y_type player_x_type
553466 4.897065 Steve Vasturia Caris LeVert NCAA NBA
1066727 5.062801 Jared Terrell Malcolm Lee NCAA NBA
230272 5.379588 Rawle Alkins Solomon Hill NCAA NBA
589021 5.415472 Dylan Ennis Cory Joseph NCAA NBA
1361372 5.521101 Rodney Purvis Malachi Richardson NCAA NBA
2157231 5.577684 Christian Terrell Zach LaVine NCAA NBA
503535 5.852017 Tracy Abrams Zach LaVine NCAA NBA
762657 5.911154 Khadeen Carrington Darius Johnson-Odom NCAA NBA
116274 5.911482 Sterling Brown Justise Winslow NCAA NBA
413247 6.009668 Sviatoslav Mykhailiuk Zach LaVine NCAA NBA
676827 6.016287 Justin Robinson Isaiah Canaan NCAA NBA
676826 6.016287 Justin Robinson Isaiah Canaan NCAA NBA
966792 6.018570 Jordan Murphy Marcus Thornton NCAA NBA
967018 6.018570 Jordan Murphy Marcus Thornton NCAA NBA
762514 6.030128 Khadeen Carrington Austin Rivers NCAA NBA
2285549 6.036571 Darrell Davis Jordan Hamilton NCAA NBA
696001 6.080880 Jack Gibbs Isaiah Canaan NCAA NBA
1197419 6.142668 Xavier Johnson Malik Beasley NCAA NBA
1069023 6.158689 Christian Vital Zach LaVine NCAA NBA
1674917 6.229392 Cullen VanLeer Jordan Hamilton NCAA NBA
2468661 6.258655 Alex Murphy Abdel Nader NCAA NBA
2656239 6.387709 Rob Edwards Jordan Williams NCAA NBA
1553807 6.409944 D.J. Fenner Malcolm Lee NCAA NBA
120981 6.443540 Edrice Adebayo Quincy Acy NCAA NBA
2000483 6.444821 Brynton Lemar James Young NCAA NBA
2000410 6.457194 Brynton Lemar Austin Rivers NCAA NBA
2439989 6.606846 Alexander Aka Gorski Jordan Hamilton NCAA NBA
33068 6.608452 Donovan Mitchell Gary Harris NCAA NBA
914535 6.698608 Kavin Gilder-Tilbury Terrence Ross NCAA NBA
280171 6.719336 London Perrantes Tony Snell NCAA NBA
... ... ... ... ... ...
4723217 55.659372 Sebastian Townes Anthony Davis NCAA NBA
4834889 55.806828 Darius Moore Anthony Davis NCAA NBA
4808753 55.818913 Eliel Gonzalez Anthony Davis NCAA NBA
4792121 55.901396 Mike Green Anthony Davis NCAA NBA
4017545 55.908765 Christian Ellis Anthony Davis NCAA NBA
4616297 56.092542 Isaiah Walton Anthony Davis NCAA NBA
4956065 56.164070 Darrell Riley Anthony Davis NCAA NBA
4915673 56.267556 Sam Hunt Anthony Davis NCAA NBA
4958441 56.304151 Raheem Watts Anthony Davis NCAA NBA
4899041 56.318796 Sam Burmeister Anthony Davis NCAA NBA
4918049 56.357529 Jermaine Marrow Anthony Davis NCAA NBA
4925177 56.468464 Reggie Dillard Anthony Davis NCAA NBA
4637681 56.498150 Matthew Butler Anthony Davis NCAA NBA
4685201 56.636098 Asante Gist Anthony Davis NCAA NBA
4908545 56.642419 Jo'Vontae Millner Anthony Davis NCAA NBA
4682825 56.740089 Josh Boyd Anthony Davis NCAA NBA
4782617 56.883457 Reggie Oliver Anthony Davis NCAA NBA
4877657 56.946361 Charles Tucker Jr. Anthony Davis NCAA NBA
4963193 56.987312 Max Heidegger Anthony Davis NCAA NBA
4825385 57.088632 Delante Jones Anthony Davis NCAA NBA
4991705 57.289504 Amos Given Anthony Davis NCAA NBA
4939433 57.606328 Tyson Batiste Anthony Davis NCAA NBA
4708961 57.632143 Taylor Johnson Anthony Davis NCAA NBA
4998833 57.667914 Marcus Merriweather Anthony Davis NCAA NBA
4970321 57.728321 Rakim Lubin Anthony Davis NCAA NBA
4763609 57.844177 August Haas Anthony Davis NCAA NBA
4982201 58.519947 Rakiya Battle Anthony Davis NCAA NBA
4865777 58.674243 Junior Lomomba Anthony Davis NCAA NBA
4984577 59.255539 Elijah Pughsley Anthony Davis NCAA NBA
4977449 59.585905 Chris Shields Anthony Davis NCAA NBA

570455 rows × 5 columns