NHL: Conner McDavid Predictions

  1. At what age will Connor McDavid’s points-per-game (points in hockey are assists + goals) most likely peak?
  2. What is the best estimate for Connor McDavid’s points-per-game in his peak season?
  3. What is the best estimate for Connor McDavid’s points-per-game in the 2027 season?
  4. Having completed that, what would be a different question that you’d like to interrogate this data towards? (Please just provide the angle / question you think might be interesting and potentially fruitful. You should not dive into the particular analysis you come up with. Just write up for us your thoughts on some interesting topic you’d pursue with the attached data.)
In [6]:
%reload_ext watermark
%watermark -a Jonathan.Ma -v -m -p numpy,scipy,pandas -g
Jonathan.Ma 

CPython 3.5.2
IPython 5.1.0

numpy 1.11.1
scipy 0.18.0
pandas 0.18.1

compiler   : GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)
system     : Darwin
release    : 15.6.0
machine    : x86_64
processor  : i386
CPU cores  : 2
interpreter: 64bit
Git hash   :

Loading in the libraries for analysis.

In [112]:
# Getting data into right structure
import os
import numpy as np
import pandas as pd
import datetime as date

# Split test train
from sklearn.cross_validation import train_test_split

# Outlier detection
from sklearn import svm

# Clustering
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn import manifold, decomposition, ensemble, discriminant_analysis, random_projection
from sklearn.cluster import KMeans
from sklearn.decomposition import TruncatedSVD, PCA

# Visualizing
import seaborn as sns
import matplotlib.pyplot as plt
import bokeh.charts
from bokeh.plotting import figure, ColumnDataSource, show
from bokeh.models import HoverTool
from bokeh.io import output_notebook

#regression line import formula api as alias smf
import statsmodels.formula.api as smf

%matplotlib inline
output_notebook()

import warnings
warnings.filterwarnings('ignore')
Loading BokehJS ...

Loading & Cleaning up the dataset

In [113]:
# read csv file, parse date from date of birth column
DIR = '/Users/JMa/Learn/NHL/'
filename = os.path.join(DIR, 'data/raw/NHL_Forwards_v01.csv')
df = pd.read_csv(filename, parse_dates = ['DateOfBirth'])

#add Conner McDavid stats from http://oilers.nhl.com/club/player.htm?id=8478402
col = df.columns
df_CM = pd.DataFrame([[231200, 'CONNER MCDAVID',2016, 'Edmonton Oilers', 45, 16/45, 32/45, 48/45, pd.to_datetime('1997-01-13', format='%Y-%m-%d')]], columns =col)
df = df.append(df_CM, ignore_index=True)

# get age during season from birthdate; add Age, Goals, Assists, Birth Month (BM) column
df['DateOfBirth'] = pd.to_datetime(df['DateOfBirth'], format='%y-%m-%d')    # 1
season = pd.to_datetime(df['SeasonEnding'], format='%Y')
df['DateOfBirth'] = df['DateOfBirth'].where(df['DateOfBirth'] < season, df['DateOfBirth'] -  np.timedelta64(100, 'Y'))   # 2
df['Age'] = (season - df['DateOfBirth']).astype('<m8[Y]') 
df['BM'] = df.DateOfBirth.map(lambda x: x.strftime('%m')).astype('int')

# get total goals and assist numbers
df['Goals'] = df['GoalsPerGame']*df['GamesPlayed']
df['Assists'] = df['AssistsPerGame']*df['GamesPlayed']

# Get if transfer, Seasons Played, Seasons Played on Particular Team
df['one'] = 1.0
tmp = df.groupby(['PlayerId', 'SeasonEnding']).cumsum()
tmp.loc[tmp.one == 1, 'One'] = 0
tmp.loc[tmp.one == 2, 'One'] = 1
df['Transfer'] = tmp.One
df.drop(['one'], axis=1)
tmp = df.groupby(['PlayerId', 'Transfer']).cumsum()
df['SeasonsPlayed'] = tmp.one
df['Shift'] = df.SeasonsPlayed.shift(1)
df.loc[df.Transfer == 1, 'SeasonsPlayed'] = df.Shift
tmp = df.groupby(['PlayerId', 'Team']).cumsum()
df['SeasonsPlayedOnTeam'] = tmp.one
df['Shift2'] = df.Shift.shift(1)
df.loc[df.Transfer.isnull(), 'SeasonsPlayed'] = df.Shift2
df.loc[df.Transfer.isnull()] = 0.0
In [114]:
df = df.drop(['DateOfBirth'], axis = 1)
df = df.drop(['one', 'Shift', 'Shift2'], axis=1)
df = df.drop_duplicates()
In [115]:
df.describe()
Out[115]:
PlayerId SeasonEnding GamesPlayed GoalsPerGame AssistsPerGame TotalPointsPerGame Age BM Goals Assists Transfer SeasonsPlayed SeasonsPlayedOnTeam
count 9949.000000 9949.000000 9949.000000 9949.000000 9949.000000 9949.000000 9949.000000 9949.000000 9949.000000 9949.000000 9949.000000 9949.000000 9949.00000
mean 13963.149161 2007.955473 45.095487 0.155599 0.221333 0.402401 26.698462 5.989647 8.835461 12.421650 0.083626 4.324153 2.48045
std 16068.621696 20.740724 29.457167 0.138012 0.189207 1.515894 4.576401 3.447783 10.291136 13.746989 0.276841 3.127816 2.03447
min 0.000000 0.000000 0.000000 -0.379747 -0.437500 -0.784810 0.000000 0.000000 -30.000000 -35.000000 0.000000 0.000000 0.00000
25% 8670.000000 2003.000000 16.000000 0.046875 0.076923 0.150000 23.000000 3.000000 1.000000 2.000000 0.000000 2.000000 1.00000
50% 9098.000000 2008.000000 48.000000 0.133333 0.187500 0.333333 26.000000 6.000000 5.000000 8.000000 0.000000 3.000000 2.00000
75% 11744.000000 2012.000000 73.000000 0.237288 0.333333 0.560606 30.000000 9.000000 14.000000 19.000000 0.000000 6.000000 3.00000
max 231299.000000 2016.000000 700.000000 1.000000 1.444444 100.000000 56.000000 12.000000 360.000000 240.000000 1.000000 16.000000 16.00000
In [120]:
#remove outliers and duplicates
df = df[df.TotalPointsPerGame < 60]
df = df[df.GamesPlayed < 100 ]
df = df[df.TotalPointsPerGame > 0]
df = df[df.Age < 45]
df.describe()
Out[120]:
PlayerId SeasonEnding GamesPlayed GoalsPerGame AssistsPerGame TotalPointsPerGame Age BM Goals Assists Transfer SeasonsPlayed SeasonsPlayedOnTeam
count 8716.000000 8716.000000 8716.000000 8716.000000 8716.000000 8716.000000 8716.000000 8716.000000 8716.000000 8716.000000 8716.000000 8716.000000 8716.000000
mean 13363.077100 2008.121156 50.526159 0.177552 0.252525 0.430072 26.939307 6.043483 10.047614 14.149151 0.087884 4.543713 2.604176
std 15607.965446 4.994524 26.142991 0.133032 0.180634 0.273597 4.572999 3.459735 9.671585 13.532639 0.283143 3.166708 2.108650
min 59.000000 2000.000000 1.000000 0.000000 0.000000 0.017857 18.000000 1.000000 0.000000 0.000000 0.000000 1.000000 1.000000
25% 8631.000000 2003.000000 26.000000 0.079365 0.116667 0.214286 23.000000 3.000000 2.000000 3.000000 0.000000 2.000000 1.000000
50% 9034.500000 2008.000000 57.000000 0.154388 0.216216 0.378049 26.000000 6.000000 7.000000 10.000000 0.000000 4.000000 2.000000
75% 11417.000000 2012.000000 75.000000 0.250000 0.351351 0.597561 30.000000 9.000000 15.000000 21.000000 0.000000 6.000000 3.000000
max 231299.000000 2016.000000 82.000000 1.000000 1.444444 2.000000 43.000000 12.000000 65.000000 92.000000 1.000000 16.000000 16.000000

Overview of stats and whats in our dataset.

  1. Average Player played 50.4 games.
  2. Average Age is 35.5
  3. Most Goals Per Game is 1.0
  4. Most Assists Per Game is 1.4
  5. Most Total Points Per Game is 2.0
In [3]:
df.corr('pearson')
Out[3]:
PlayerId SeasonEnding GamesPlayed GoalsPerGame AssistsPerGame TotalPointsPerGame Age BM Goals Assists Transfer SeasonsPlayed SeasonsPlayedOnTeam
PlayerId 1.000000 0.102468 -0.131784 -0.110374 -0.119445 -0.132530 -0.135613 -0.006231 -0.148092 -0.156726 -0.016505 -0.260140 -0.159115
SeasonEnding 0.102468 1.000000 -0.018052 -0.013186 -0.036618 -0.030550 -0.031313 0.013905 -0.024721 -0.035427 -0.048749 0.484502 0.271939
GamesPlayed -0.131784 -0.018052 1.000000 0.309083 0.293979 0.344363 0.154807 0.045412 0.693670 0.690075 -0.321066 0.204890 0.293407
GoalsPerGame -0.110374 -0.013186 0.309083 1.000000 0.510068 0.823215 0.032011 0.050022 0.783828 0.585526 -0.038984 0.105547 0.216206
AssistsPerGame -0.119445 -0.036618 0.293979 0.510068 1.000000 0.908183 0.108914 0.072973 0.589344 0.796773 -0.028898 0.153079 0.255996
TotalPointsPerGame -0.132530 -0.030550 0.344363 0.823215 0.908183 1.000000 0.087464 0.072366 0.770383 0.810715 -0.037880 0.152397 0.274162
Age -0.135613 -0.031313 0.154807 0.032011 0.108914 0.087464 1.000000 0.102281 0.079268 0.130796 0.072002 0.623031 0.211185
BM -0.006231 0.013905 0.045412 0.050022 0.072973 0.072366 0.102281 1.000000 0.055892 0.075958 0.002541 0.042311 0.051119
Goals -0.148092 -0.024721 0.693670 0.783828 0.589344 0.770383 0.079268 0.055892 1.000000 0.830278 -0.201231 0.160247 0.312239
Assists -0.156726 -0.035427 0.690075 0.585526 0.796773 0.810715 0.130796 0.075958 0.830278 1.000000 -0.197324 0.194604 0.339044
Transfer -0.016505 -0.048749 -0.321066 -0.038984 -0.028898 -0.037880 0.072002 0.002541 -0.201231 -0.197324 1.000000 0.029372 -0.217901
SeasonsPlayed -0.260140 0.484502 0.204890 0.105547 0.153079 0.152397 0.623031 0.042311 0.160247 0.194604 0.029372 1.000000 0.474150
SeasonsPlayedOnTeam -0.159115 0.271939 0.293407 0.216206 0.255996 0.274162 0.211185 0.051119 0.312239 0.339044 -0.217901 0.474150 1.000000
  1. High Correlation Between Total Points Per Game (TPPG) and Assists and Goals Per Game, which is expected.
  2. Assists has a greater impact to TPPG than goals, since an assist is awarded to the player or players (maximum of two) who touch the puck prior to the goal, provided no defender plays or possesses the puck in between. Since there will be a greater sample of assists than goals, naturally there will be more weight associated with it.
  3. Close to zero correlation between TPPG and Age and Games Played, most likely some higher order polynomial relationship between the values. Visualization will confirm.

Assuming normal distribution of dataset.

Note: Pearson’s correlation coefficient is a statistical measure of the strength of a linear relationship between paired data. In a sample it is denoted by r and is by design constrained as follows.

* Positive values denote positive linear correlation;
* Negative values denote negative linear correlation;
* A value of 0 denotes no linear correlation;
* The closer the value is to 1 or –1, the stronger the linear correlation.
In [4]:
df.corr('spearman')
Out[4]:
PlayerId SeasonEnding GamesPlayed GoalsPerGame AssistsPerGame TotalPointsPerGame Age BM Goals Assists Transfer SeasonsPlayed SeasonsPlayedOnTeam
PlayerId 1.000000 0.200075 -0.218979 -0.209724 -0.248501 -0.250376 -0.190519 -0.015525 -0.261536 -0.289302 -0.011098 -0.304398 -0.156712
SeasonEnding 0.200075 1.000000 -0.028445 -0.008828 -0.033236 -0.024522 -0.037405 0.015569 -0.012859 -0.027622 -0.048165 0.464273 0.238399
GamesPlayed -0.218979 -0.028445 1.000000 0.391209 0.364817 0.389906 0.155094 0.049418 0.793196 0.797204 -0.303058 0.234870 0.346879
GoalsPerGame -0.209724 -0.008828 0.391209 1.000000 0.581190 0.841620 0.048505 0.063359 0.820964 0.633356 -0.045802 0.151689 0.221349
AssistsPerGame -0.248501 -0.033236 0.364817 0.581190 1.000000 0.912624 0.107525 0.074559 0.618109 0.800683 -0.033591 0.179617 0.230848
TotalPointsPerGame -0.250376 -0.024522 0.389906 0.841620 0.912624 1.000000 0.083638 0.078367 0.759258 0.787122 -0.036317 0.174954 0.239908
Age -0.190519 -0.037405 0.155094 0.048505 0.107525 0.083638 1.000000 0.099997 0.120616 0.162187 0.079617 0.612176 0.132645
BM -0.015525 0.015569 0.049418 0.063359 0.074559 0.078367 0.099997 1.000000 0.069843 0.074427 0.002042 0.034023 0.042093
Goals -0.261536 -0.012859 0.793196 0.820964 0.618109 0.759258 0.120616 0.069843 1.000000 0.883972 -0.214954 0.240388 0.348780
Assists -0.289302 -0.027622 0.797204 0.633356 0.800683 0.787122 0.162187 0.074427 0.883972 1.000000 -0.213143 0.261388 0.361774
Transfer -0.011098 -0.048165 -0.303058 -0.045802 -0.033591 -0.036317 0.079617 0.002042 -0.214954 -0.213143 1.000000 0.040194 -0.310796
SeasonsPlayed -0.304398 0.464273 0.234870 0.151689 0.179617 0.174954 0.612176 0.034023 0.240388 0.261388 0.040194 1.000000 0.452228
SeasonsPlayedOnTeam -0.156712 0.238399 0.346879 0.221349 0.230848 0.239908 0.132645 0.042093 0.348780 0.361774 -0.310796 0.452228 1.000000

Little difference between Pearson and Spearman coefficents.

Visualizing the Data

In [5]:
sns.pairplot(df, x_vars=['SeasonEnding','GamesPlayed','Age','Goals', 'Assists', 'SeasonsPlayed', 'SeasonsPlayedOnTeam'], y_vars=['TotalPointsPerGame'], size=7, aspect=0.7, kind='reg')
Out[5]:
<seaborn.axisgrid.PairGrid at 0x105e30860>

Visualizing the data and seeing if there are any clear relationships. As expected there is a linear relationship between goals and assists to total points per game, since points in hockey are a combination of goals and assists. Missing data from 2005. It looks like there is a quadratic relationship between age and TPPG, but another visualization should confirm this. Missing data from 2004-2005 season since the NHL had the lockout that year.

We can confirm correlation with TPPG with a regression line through the data. Goals and assists are positively correlated with TTPG seen by the high slope of the regression line. Age and Games played are only slightly correlated.

In [6]:
g= sns.jointplot(data=df, x= 'Age', y='TotalPointsPerGame', kind="kde")
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("Age", "Total Points Per Game");

We can see here there is definitely more younger players over the years. Mainly between 21 to 27. Average players peak around 23-24 years old, but under 0.5 TPPG.

In [7]:
g = sns.jointplot(data=df, x= 'GamesPlayed', y='TotalPointsPerGame', kind="kde")
g.ax_joint.collections[0].set_alpha(0)
g.set_axis_labels("Games Played", "Total Points Per Game");

Looks like there are two sets of players. Those that play less than 20 games and those that play over 60 games.

Split into two groups at 40 Games Played Threshold

In [11]:
df_20 = df[df.GamesPlayed < 40]
df_40 = df[df.GamesPlayed >= 40]
sns.pairplot(df_20, x_vars=['SeasonEnding','GamesPlayed','Age','Goals', 'Assists', 'SeasonsPlayed', 'SeasonsPlayedOnTeam'], y_vars=['TotalPointsPerGame'], size=7, aspect=0.7, kind='reg')
Out[11]:
<seaborn.axisgrid.PairGrid at 0x10cd47d30>
In [7]:
sns.pairplot(df_40, x_vars=['SeasonEnding','GamesPlayed','Age','Goals', 'Assists', 'SeasonsPlayed', 'SeasonsPlayedOnTeam'], y_vars=['TotalPointsPerGame'], size=7, aspect=0.7, kind='reg')
Out[7]:
<seaborn.axisgrid.PairGrid at 0x10afc77f0>

Analysis

Since there wasn't a clear linear relationship between age and TPPG, another type of analysis must be done. I chose to follow something similar to the fivethirtyeight's CARMELO system. To essentially use the historical data to group players together with t-Distributed Stochastic Neighbor Embedding (t-SNE) and project careers path. The advantage of using t-SNE is to use an unsupervised approach as there was no easy correlation between Age and TPPG. With the hopes that we can infer natural groupings of all NHL players as our best estimate to find out what age Conner McDavid (or any other player) will peak at or what his career will be like.

Process:

  1. Compute the correlation matrix for stats for all players.
  2. Use PCA or a truncated SVD to reduce data to fewer dimensions. So in the future when more stats and features of the dataset are added, this approach can still be used.
  3. Applying the t-SNE dimensionality reduction algorithm. Roughly, t-SNE is considered to be useful because of its property to conserve the overall topology of the data, so that neighboring players are mapped to neighboring locations in a two-dimensional space
  4. Visualize and grab players that are closest to Conner McDavid. Calculate how close they were and associate that to a similarity score.
  5. Polynomial Regression to estimate Conner McDavid's future stats.

Note: Could use other clustering techniques such as k-means or MDS, but visualization is easily seen with t-SNE. Model will get better with more stats and more data from Conner McDavid.

In [63]:
# Set index 
df_final = df_40
y = df_final['TotalPointsPerGame']
df_final = df_final.set_index(['PlayerId', 'Team', 'Name', 'Age'])
df_final = df_final.drop(['GoalsPerGame', 'AssistsPerGame'], axis=1)
df_final.head()
Out[63]:
SeasonEnding GamesPlayed TotalPointsPerGame BM Goals Assists Transfer SeasonsPlayed SeasonsPlayedOnTeam
PlayerId Team Name Age
75 Tampa Bay Lightning NILS EKMAN 24.0 2001 43 0.465116 3 9.0 11.0 0.0 2.0 2.0
San Jose Sharks NILS EKMAN 27.0 2004 82 0.670732 3 22.0 33.0 0.0 3.0 1.0
29.0 2006 77 0.740260 3 21.0 36.0 0.0 4.0 2.0
79 Atlanta Thrashers JOHAN GARPENLOV 31.0 2000 73 0.219178 3 2.0 14.0 0.0 1.0 1.0
89 Phoenix Coyotes MATHIAS TJARNQVIST 28.0 2008 78 0.141026 4 4.0 7.0 0.0 4.0 2.0

Truncated SVD vs PCA

In general, truncated SVD is used for sparse data and PCA for dense data. But since there are such few features, I wasn't sure which would work better so I tried them both.

In [64]:
# Total Points Per Game
y = y.as_matrix()

# compute correlation matrix
df_final = df_final.convert_objects(convert_numeric=True)
df_final = df_final.astype(float)
corr_x = df_final.T.corr()
X = corr.as_matrix()
In [65]:
# perform SVD|t-SNE dimensionality reduction 
file = DIR + '/data/processed/SVD3_tsne_clusters.csv'
if os.path.isfile(file):
    # If file is already present. Don't run again
    print('%s already present - Skipping the dimensionality reduction.' % (file))
else:
    X_reduced = TruncatedSVD(n_components=3, random_state=0).fit_transform(X)
    tsne = manifold.TSNE(n_components=2, perplexity=40, verbose=2)
    X_tsne = tsne.fit_transform(X_reduced)

    df_SVD = pd.DataFrame(X_tsne)

    # Match up players via index, save to file
    df_SVD['Name'] = [x[2] for x in df_final.index]
    df_SVD['Team'] = [x[1] for x in df_final.index]
    df_SVD['Age'] = [x[3] for x in df_final.index]

    df_SVD.columns = ['x', 'y', 'Name', 'Team', 'Age']

    # Save data for future analysis. 
    df_SVD.to_csv(file, index=False)

    # confirming it worked for all the data. 
    print('Huzzaaaaa!')
/Users/JMa/Learn/NHL//data/processed/SVD3_tsne_clusters.csv already present - Skipping the dimensionality reduction.
[t-SNE] Computing pairwise distances... [t-SNE] Computing 121 nearest neighbors... [t-SNE] Computed conditional probabilities for sample 1000 / 8745 [t-SNE] Computed conditional probabilities for sample 2000 / 8745 [t-SNE] Computed conditional probabilities for sample 3000 / 8745 [t-SNE] Computed conditional probabilities for sample 4000 / 8745 [t-SNE] Computed conditional probabilities for sample 5000 / 8745 [t-SNE] Computed conditional probabilities for sample 6000 / 8745 [t-SNE] Computed conditional probabilities for sample 7000 / 8745 [t-SNE] Computed conditional probabilities for sample 8000 / 8745 [t-SNE] Computed conditional probabilities for sample 8745 / 8745 [t-SNE] Mean sigma: 0.000154 [t-SNE] Iteration 25: error = 1.4120536, gradient norm = 0.0024560 [t-SNE] Iteration 50: error = 1.4029273, gradient norm = 0.0094328 [t-SNE] Iteration 75: error = 1.2945052, gradient norm = 0.0027703 [t-SNE] Iteration 100: error = 1.2638632, gradient norm = 0.0024147 [t-SNE] Error after 100 iterations with early exaggeration: 1.263863 [t-SNE] Iteration 125: error = 1.2029926, gradient norm = 0.0018307 [t-SNE] Iteration 150: error = 1.1812812, gradient norm = 0.0016973 [t-SNE] Iteration 175: error = 1.1755172, gradient norm = 0.0016734 [t-SNE] Iteration 200: error = 1.1739495, gradient norm = 0.0016676 [t-SNE] Iteration 225: error = 1.1735171, gradient norm = 0.0016657 [t-SNE] Iteration 250: error = 1.1734033, gradient norm = 0.0016649 [t-SNE] Iteration 275: error = 1.1733677, gradient norm = 0.0016648 [t-SNE] Iteration 300: error = 1.1733609, gradient norm = 0.0016648 [t-SNE] Iteration 325: error = 1.1733584, gradient norm = 0.0016647 [t-SNE] Iteration 350: error = 1.1733500, gradient norm = 0.0016649 [t-SNE] Iteration 375: error = 1.1733578, gradient norm = 0.0016647 [t-SNE] Iteration 375: error difference 0.000000. Finished. [t-SNE] Error after 375 iterations: 1.173358
In [13]:
# perform PCA| t-SNE dimensionality reduction  
file = DIR + '/data/processed/PCA3_tsne_clusters.csv'
if os.path.isfile(file):
    # If file is already present. Don't run again
    print('%s already present - Skipping the dimensionality reduction.' % (file))
else:
    X_reduced = PCA(n_components=3).fit_transform(X)
    tsne = manifold.TSNE(n_components=2, perplexity=40, verbose=2)
    X_tsne = tsne.fit_transform(X_reduced)

    df_PCA = pd.DataFrame(X_tsne)
    # Match up players via index, save to file
    df_PCA['Name'] = [x[2] for x in df_final.index]
    df_PCA['Team'] = [x[1] for x in df_final.index]
    df_PCA['Age'] = [x[3] for x in df_final.index]
    df_PCA.columns = ['x', 'y', 'Name', 'Team', 'Age']
    df_PCA.to_csv(file, index=False)

    df_PCA.shape
[t-SNE] Computing pairwise distances...
[t-SNE] Computing 121 nearest neighbors...
[t-SNE] Computed conditional probabilities for sample 1000 / 5669
[t-SNE] Computed conditional probabilities for sample 2000 / 5669
[t-SNE] Computed conditional probabilities for sample 3000 / 5669
[t-SNE] Computed conditional probabilities for sample 4000 / 5669
[t-SNE] Computed conditional probabilities for sample 5000 / 5669
[t-SNE] Computed conditional probabilities for sample 5669 / 5669
[t-SNE] Mean sigma: 0.000350
[t-SNE] Iteration 25: error = 1.7095543, gradient norm = 0.0114829
[t-SNE] Iteration 50: error = 1.6565709, gradient norm = 0.0066356
[t-SNE] Iteration 75: error = 1.4221132, gradient norm = 0.0026807
[t-SNE] Iteration 100: error = 1.3566098, gradient norm = 0.0022401
[t-SNE] Error after 100 iterations with early exaggeration: 1.356610
[t-SNE] Iteration 125: error = 1.2422367, gradient norm = 0.0016965
[t-SNE] Iteration 150: error = 1.2052388, gradient norm = 0.0015366
[t-SNE] Iteration 175: error = 1.1960778, gradient norm = 0.0014941
[t-SNE] Iteration 200: error = 1.1936114, gradient norm = 0.0014835
[t-SNE] Iteration 225: error = 1.1929373, gradient norm = 0.0014806
[t-SNE] Iteration 250: error = 1.1927449, gradient norm = 0.0014798
[t-SNE] Iteration 275: error = 1.1926939, gradient norm = 0.0014796
[t-SNE] Iteration 300: error = 1.1926819, gradient norm = 0.0014795
[t-SNE] Iteration 325: error = 1.1926782, gradient norm = 0.0014795
[t-SNE] Iteration 350: error = 1.1926770, gradient norm = 0.0014795
[t-SNE] Iteration 375: error = 1.1926767, gradient norm = 0.0014795
[t-SNE] Iteration 375: error difference 0.000000. Finished.
[t-SNE] Error after 375 iterations: 1.192677
[t-SNE] Computing pairwise distances... [t-SNE] Computing 121 nearest neighbors... [t-SNE] Computed conditional probabilities for sample 1000 / 8745 [t-SNE] Computed conditional probabilities for sample 2000 / 8745 [t-SNE] Computed conditional probabilities for sample 3000 / 8745 [t-SNE] Computed conditional probabilities for sample 4000 / 8745 [t-SNE] Computed conditional probabilities for sample 5000 / 8745 [t-SNE] Computed conditional probabilities for sample 6000 / 8745 [t-SNE] Computed conditional probabilities for sample 7000 / 8745 [t-SNE] Computed conditional probabilities for sample 8000 / 8745 [t-SNE] Computed conditional probabilities for sample 8745 / 8745 [t-SNE] Mean sigma: 0.000284 [t-SNE] Iteration 25: error = 1.4046679, gradient norm = 0.0024244 [t-SNE] Iteration 50: error = 1.3958834, gradient norm = 0.0094297 [t-SNE] Iteration 75: error = 1.2817439, gradient norm = 0.0028523 [t-SNE] Iteration 100: error = 1.2486776, gradient norm = 0.0024751 [t-SNE] Error after 100 iterations with early exaggeration: 1.248678 [t-SNE] Iteration 125: error = 1.1841278, gradient norm = 0.0018861 [t-SNE] Iteration 150: error = 1.1613634, gradient norm = 0.0017461 [t-SNE] Iteration 175: error = 1.1554110, gradient norm = 0.0017062 [t-SNE] Iteration 200: error = 1.1537817, gradient norm = 0.0016978 [t-SNE] Iteration 225: error = 1.1533287, gradient norm = 0.0016955 [t-SNE] Iteration 250: error = 1.1531978, gradient norm = 0.0016945 [t-SNE] Iteration 275: error = 1.1531537, gradient norm = 0.0016949 [t-SNE] Iteration 300: error = 1.1531700, gradient norm = 0.0016940 [t-SNE] Iteration 325: error = 1.1531450, gradient norm = 0.0016946 [t-SNE] Iteration 350: error = 1.1531432, gradient norm = 0.0016945 [t-SNE] Iteration 375: error = 1.1531596, gradient norm = 0.0016943 [t-SNE] Iteration 400: error = 1.1531614, gradient norm = 0.0016942 [t-SNE] Iteration 400: did not make any progress during the last 30 episodes. Finished. [t-SNE] Error after 400 iterations: 1.153161

PCA performed sightly better so using that data to analyze the data. 1.153 vs 1.173 error.

In [69]:
DIR = '/Users/JMa/Learn/NHL/'
filename = os.path.join(DIR, 'data/processed/SVD3_tsne_clusters.csv')
df_SVD = pd.read_csv(filename)
In [80]:
df = df
df2 = df_SVD
Player = 'CONNER MCDAVID'

comb = pd.merge(df, df2, how='inner', on = ['Name', 'Team', 'Age'])
CM = comb[comb.Name == Player]

# Simliarity score based on how close they are to him. using the distance formala  
comb['SScore'] = 100 - np.sqrt((CM.x.mean() - comb.x)**2 + (CM.y.mean() - comb.y)**2)
comb['Norm_SScore'] =(comb.SScore - comb.SScore.min()) / (comb.SScore.max() - comb.SScore.min())
comb['Standard_SScore'] = (comb.SScore - comb.SScore.mean()) / (comb.SScore.std())

CM = comb[comb.Name == Player]
In [86]:
# Save dataframes 
file = DIR + '/data/processed/comb.csv'
file2 = DIR + '/data/processed/CM.csv'

if os.path.isfile(file):
    # If file is already present. Don't run again
    print('%s, %s already present - Skipping the saving.' % (file, file2))
else:
    comb.to_csv(file, index=False)
    CM.to_csv(file2, index=False)
In [74]:
# Visualize    
def bplot(df1, df2):   
    # Interactive Bokeh plot
    _tools = 'box_zoom,pan,save,resize,reset,tap,wheel_zoom'
    fig = figure(tools=_tools, title='t-SNE of Players', responsive=True,
                 x_axis_label='Component 1', y_axis_label='Component 2')

    source = ColumnDataSource(df1)
    source2 = ColumnDataSource(df2)
    hover = HoverTool()
    hover.tooltips=[('Player, Age, Year','@Name, @Age, @SeasonEnding'),]
    fig.scatter(df1['x'], df1['y'], source=source, size=8, alpha=0.6,
                line_color='grey', fill_color='grey')

    fig.scatter(df2.x, df2.y, source=source2, size=8, alpha=0.6,
                line_color='red', fill_color='red')
    fig.add_tools(hover)

    show(fig)
    

bplot(comb, CM)