The game of basketball is growing in popularity around the world, and China is a main driver in that. In fact, they have two professional leagues, a major and a minor that employ about 400 total players. Their main, major league is the Chinese Basketball Association (CBA), which is currently comprised of 18 teams, a number that has been growing in recent years to accomodate the surge in demand.
This league is becoming increasingly relevant to the overall development of global talent, as players from the United States are now going over and taking advantage of the opportunities for playing time and earning power. With more and more talented, foreign players combining with the improving domestic scene, the CBA is primed for a significant increase in business/revenue as talent attracts spending and spectators. Additionally, the CBA's lack of a players' salary cap further magnifies the earning potential for its players, allowing the consumer market to fully dictate players' value.
With this growing collection of quality players, I believe it's important to analyze and better understand this league to identify those who could potentially be valuable in the NBA.
Import data, try to classify players as foreign or domestic based on attributes/on-court performance
Y = pop off ['nationality']
X = everything other than ['sys_id', 'season', 'player', 'team', 'birth_city', 'draft_status'] from info
Keep all calculated columns for now, address during feature selection
Classification -- kfolds cross validate modeling with logistic regression, random forest, SVC, adaboost, gradientboost
Clustering -- knn
Build ROC curve
Preliminary dimensions:
stats.shape = 774 x 64
stats_final.shape = 708 x 64
Found data integrity issues:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sqlite3 as sql
db_read = 'data/cba.sqlite'
sql_query = '''SELECT tot.*, adv.*, misc.*, info.ht, info.wt, info.nationality
FROM team_info_realgm tinfo
JOIN player_season_totals tot
ON tinfo.team_id = tot.team
JOIN player_season_advanced adv
ON tot.player = adv.player AND tot.season = adv.season AND tinfo.team_id = adv.team
JOIN player_season_misc misc
ON tot.player = misc.player AND tot.season = misc.season AND tinfo.team_id = misc.team
JOIN player_season_info info
ON tot.player = info.player AND tot.season = info.season AND tinfo.team = info.team'''
with sql.connect(db_read) as con:
df = pd.read_sql(sql_query, con)
stats = df.T.drop_duplicates().T
stats.shape
(774, 64)
cols_retained = stats.columns - ['sys_id', 'season', 'player', 'team', 'birth_city', 'draft_status']
stats = stats[cols_retained]
ht_inches = []
for height in stats['ht']:
h = height.split('-')
if h[0] and h[1]:
ht = 12.0 * float(h[0]) + 1.0 * float(h[1])
ht_inches.append(ht)
else:
ht_inches.append('')
stats['ht'] = ht_inches
stats = stats.convert_objects(convert_numeric=True)
stats = stats.fillna(stats.mean()[['ht' , 'wt']])
# [(type(stats[col]), col) for col in stats.columns]
# [(stats[col].dtype, col) for col in stats.columns]
# [(stats[col].isnull().sum(), col) for col in stats.columns]
stats.shape
(774, 60)
stats['nationality'].value_counts()
China 573 United States 126 - 33 Taiwan 13 Jordan 7 Syria 3 Puerto Rico 3 Iran 3 Palestine 2 Ivory Coast 2 Democratic Republic of the Congo 2 Netherlands 1 Lebanon 1 Senegal 1 Australia 1 Nigeria 1 France 1 Panama 1 dtype: int64
stats['nationality'] = stats['nationality'].replace(to_replace='-', value=np.nan)
stats = stats.dropna(subset=['nationality'])
#domestic for China, Taiwan, Hong Kong, non-domestic for anything else
#1 for domestic, 0 for foreign
stats['domestic'] = stats['nationality'].apply(lambda x: 1 if x in ['China', 'Taiwan', 'Hong Kong'] else 0)
stats['nationality_class'] = stats['nationality'].apply(lambda x: x if x in ['China', 'United States'] else "Other")
stats['nationality_class'].value_counts()
China 573 United States 126 Other 42 dtype: int64
stats_nopos = stats.drop(['pos'], axis=1)
stats_nopos = stats_nopos.dropna()
stats.shape
(741, 62)
stats = stats.dropna()
pos_binarized = pd.get_dummies(stats['pos'], prefix='pos')
stats_final = pd.concat([stats, pos_binarized], axis=1).drop(['pos'], axis=1)
Reduce collinearity, this is essential for logistic regression.
PCA -- Downside is that it removes all interpretability of original features
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
y_nat = stats_final.pop('nationality').values
y_nat_class = stats_final.pop('nationality_class').values
y_dom = stats_final.pop('domestic').values
x = stats_final.values
y_nat = y_nat.astype(str)
y_nat_class = y_nat_class.astype(str)
feature_names = stats_final.columns.values
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)
def show_exp_var(decomper, data):
var_exp = []
for n in xrange(1, 63+1):
dc = decomper(n_components=n)
dc.fit(data)
var_exp.append(dc.explained_variance_.sum())
plt.figure(figsize=(16,10))
plt.grid(b=True, which='major', color='black', linestyle='-')
plt.plot(np.array(var_exp))
plt.tight_layout()
show_exp_var(PCA, x_scaled)
pca = PCA(n_components=30)
x_reduced = pca.fit_transform(x_scaled)
x_reduced.shape
(708, 30)
Binary classification -- Domestic vs nondomestic
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import confusion_matrix, recall_score, precision_score, f1_score, roc_curve, roc_auc_score
from sklearn.grid_search import GridSearchCV
xtrain, xtest, ytrain, ytest = train_test_split(x_reduced, y_dom, random_state=13)
def get_scores(x, y, model):
print "model:", model
print "cf matrix:\n", confusion_matrix(y, model.predict(x))
print "accuracy:", model.score(x, y)
print "precision (tp/tp+fp):", precision_score(y, model.predict(x))
print "recall (tp/p):", recall_score(y, model.predict(x))
print "f1:", f1_score(y, model.predict(x))
lg = LogisticRegression()
lg.fit(xtrain, ytrain)
get_scores(xtest, ytest, lg)
model: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, penalty=l2, random_state=None, tol=0.0001) cf matrix: [[ 32 6] [ 3 136]] accuracy: 0.949152542373 precision (tp/tp+fp): 0.957746478873 recall (tp/p): 0.978417266187 f1: 0.967971530249
rf = RandomForestClassifier(n_jobs=-1)
pg = {'n_estimators': np.arange(10, 101, 10), 'criterion': ['gini', 'entropy']}
gs = GridSearchCV(rf, param_grid=pg, scoring='f1')
gs.fit(x_reduced, y_dom)
GridSearchCV(cv=None, estimator=RandomForestClassifier(bootstrap=True, compute_importances=None, criterion='gini', max_depth=None, max_features='auto', min_density=None, min_samples_leaf=1, min_samples_split=2, n_estimators=10, n_jobs=-1, oob_score=False, random_state=None, verbose=0), fit_params={}, iid=True, loss_func=None, n_jobs=1, param_grid={'n_estimators': array([ 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]), 'criterion': ['gini', 'entropy']}, pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring='f1', verbose=0)
gs.grid_scores_
[mean: 0.94650, std: 0.00770, params: {'n_estimators': 10, 'criterion': 'gini'}, mean: 0.95059, std: 0.00822, params: {'n_estimators': 20, 'criterion': 'gini'}, mean: 0.95230, std: 0.00825, params: {'n_estimators': 30, 'criterion': 'gini'}, mean: 0.95322, std: 0.00174, params: {'n_estimators': 40, 'criterion': 'gini'}, mean: 0.95698, std: 0.00050, params: {'n_estimators': 50, 'criterion': 'gini'}, mean: 0.95352, std: 0.00861, params: {'n_estimators': 60, 'criterion': 'gini'}, mean: 0.95668, std: 0.00510, params: {'n_estimators': 70, 'criterion': 'gini'}, mean: 0.95793, std: 0.00901, params: {'n_estimators': 80, 'criterion': 'gini'}, mean: 0.95968, std: 0.00621, params: {'n_estimators': 90, 'criterion': 'gini'}, mean: 0.95877, std: 0.00832, params: {'n_estimators': 100, 'criterion': 'gini'}, mean: 0.95086, std: 0.01082, params: {'n_estimators': 10, 'criterion': 'entropy'}, mean: 0.95370, std: 0.00508, params: {'n_estimators': 20, 'criterion': 'entropy'}, mean: 0.94830, std: 0.00633, params: {'n_estimators': 30, 'criterion': 'entropy'}, mean: 0.95825, std: 0.00299, params: {'n_estimators': 40, 'criterion': 'entropy'}, mean: 0.95886, std: 0.00603, params: {'n_estimators': 50, 'criterion': 'entropy'}, mean: 0.96230, std: 0.00530, params: {'n_estimators': 60, 'criterion': 'entropy'}, mean: 0.95098, std: 0.00386, params: {'n_estimators': 70, 'criterion': 'entropy'}, mean: 0.95654, std: 0.00300, params: {'n_estimators': 80, 'criterion': 'entropy'}, mean: 0.95728, std: 0.00831, params: {'n_estimators': 90, 'criterion': 'entropy'}, mean: 0.95553, std: 0.00428, params: {'n_estimators': 100, 'criterion': 'entropy'}]
gs.best_params_
{'criterion': 'entropy', 'n_estimators': 60}
rf = RandomForestClassifier(n_jobs=-1, n_estimators=60, criterion='gini')
rf.fit(xtrain, ytrain)
get_scores(xtest, ytest, rf)
model: RandomForestClassifier(bootstrap=True, compute_importances=None, criterion=gini, max_depth=None, max_features=auto, min_density=None, min_samples_leaf=1, min_samples_split=2, n_estimators=60, n_jobs=-1, oob_score=False, random_state=None, verbose=0) cf matrix: [[ 31 7] [ 4 135]] accuracy: 0.937853107345 precision (tp/tp+fp): 0.950704225352 recall (tp/p): 0.971223021583 f1: 0.960854092527
rf.classes_
array([0, 1])
# xtrain_full, xtest_full, ytrain_full, ytest_full = train_test_split(x, y_dom, random_state=13)
xtrain_full, xtest_full, ytrain_full, ytest_full = train_test_split(x, y_dom)
rf_full = RandomForestClassifier(n_jobs=-1, n_estimators=60, criterion='gini')
rf_full.fit(xtrain_full, ytrain_full)
get_scores(xtest_full, ytest_full, rf_full)
model: RandomForestClassifier(bootstrap=True, compute_importances=None, criterion=gini, max_depth=None, max_features=auto, min_density=None, min_samples_leaf=1, min_samples_split=2, n_estimators=60, n_jobs=-1, oob_score=False, random_state=None, verbose=0) cf matrix: [[ 32 3] [ 4 138]] accuracy: 0.960451977401 precision (tp/tp+fp): 0.978723404255 recall (tp/p): 0.971830985915 f1: 0.975265017668
rf_full.feature_importances_
array([ 0.00000000e+00, 7.61751267e-03, 3.13574587e-02, 2.64774026e-03, 2.61600500e-03, 6.54577299e-03, 4.76293559e-03, 1.54990354e-02, 4.55698688e-03, 1.03682349e-02, 3.59508890e-02, 2.87579205e-02, 3.00913479e-02, 4.47696216e-03, 5.57373914e-03, 1.19521081e-02, 4.42461778e-03, 3.33032693e-03, 4.42995269e-03, 1.87947330e-03, 1.68992817e-02, 4.21984947e-02, 5.50119292e-03, 4.52364876e-02, 4.51412110e-03, 5.18549984e-02, 1.72402778e-02, 4.33524412e-03, 1.24089454e-02, 7.50654866e-02, 8.54232169e-02, 4.40645106e-03, 6.91914957e-03, 7.27429653e-03, 3.12880746e-03, 4.80652787e-03, 5.59911784e-03, 1.10449361e-02, 7.30903352e-02, 4.30476116e-03, 6.44953096e-03, 4.96497938e-03, 2.64860891e-02, 5.20678231e-03, 3.73270872e-03, 2.82547750e-03, 5.21704728e-03, 4.78211955e-03, 1.18011311e-02, 6.07522336e-03, 1.06799876e-03, 2.98982527e-02, 5.60596590e-03, 1.52757282e-01, 6.29450397e-03, 7.24529133e-03, 1.22050948e-02, 6.38227075e-03, 9.90651661e-04, 7.29441914e-04, 0.00000000e+00, 1.51123175e-04, 1.03988413e-03])
feature_ranks = [feature_names[i] for i in np.argsort(rf_full.feature_importances_)[::-1]]
feature_ranks
['usgpct', 'hob', 'high_game', 'per', 'fta', 'fic', 'fgm', 'dbl_dbl', '_40_pts', 'drbpct', 'trbpct', 'drb', 'pts', 'ftm', 'fga', 'astpct', 'gp', 'ws', 'ediff', 'tov', 'ows', 'blkpct', '_20_reb', 'min', 'win_pct', 'l', 'ast', 'ppr', 'wt', 'w', 'tovpct', 'tspct', 'ortg', 'dws', 'fgpct', 'stlpct', 'reb', 'pps', 'orbpct', 'total_s_pct', 'ast_to', 'blk', 'ft_fga', 'drtg', 'fg3m', 'efgpct', 'ht', 'ftpct', 'pf', 'stl', 'fg3a', 'orb', 'stl_to', '_5_blk', '_5_stl', 'fg3pct', 'tpl_dbl', 'pos_SG', 'pos_C', 'pos_PF', 'pos_SF', 'pos_PG', '_20_ast']
feature_indices = np.arange(len(feature_names))
fig = plt.figure(figsize=(16,10))
ax = fig.add_axes([0, 0, 1, 1])
ax.set_title("Feature Importance in Classifying Domestic vs Foreign CBA Players")
ax.barh(feature_indices, np.sort(rf_full.feature_importances_)[::-1], align="center")
ax.set_yticks(feature_indices)
ax.set_yticklabels(feature_ranks)
ax.autoscale()
fig.show()
Based on the strong precision/recall/accuracy performance of the logistic and random forest classifiers, it appears there are indeed features that can predict whether a player is domestic or foreign.
Assuming the foreign players are indeed more talented and more impactful, the top results make intuitive sense.
For a summary of the statistics, please reference the following: http://basketball.realgm.com/info/glossary
It is also important to note that many of these top features are in fact calculated, rather than observed, statistics.
During the feature selection process when PCA was used to address collinearity issues, the features were transformed and no longer provide insight into what differences separate domestic and non-domestic players.
To investigate this further, I will revisit the feature selection process and attempt to manually select features using VIF (variable inflation factors) and the feature correlation matrix. By doing this, I can potentially use logistic regression and retain the meaning from the feature set.
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = [variance_inflation_factor(x_scaled, col) for col in xrange(x_scaled.shape[1])]
zip(vif, feature_names)
[(nan, '_20_ast'), (2.3232646413884468, '_20_reb'), (3.099973924140988, '_40_pts'), (2.5583691874867571, '_5_blk'), (3.2826153029391079, '_5_stl'), (190036935.48370814, 'ast'), (8.2032827384218852, 'ast_to'), (17.978150577641646, 'astpct'), (19124656.985708512, 'blk'), (4.8911542003284083, 'blkpct'), (13.717477282514521, 'dbl_dbl'), (inf, 'drb'), (518.93282127926705, 'drbpct'), (21129.564180145786, 'drtg'), (21377.576159298682, 'dws'), (346061.19871661032, 'ediff'), (86.736913551758022, 'efgpct'), (74.464553355393164, 'fg3a'), (1069.3081439950811, 'fg3m'), (160256.93698674868, 'fg3pct'), (1904161358.7764828, 'fga'), (40434.534043811895, 'fgm'), (81245.066394499881, 'fgpct'), (3338961506.4996514, 'fic'), (11.081218981979251, 'ft_fga'), (73638132.87765573, 'fta'), (3730.3210419056595, 'ftm'), (228885.57379329094, 'ftpct'), (inf, 'gp'), (10.256000365408415, 'high_game'), (27.118818234214988, 'hob'), (7.9971082754363625, 'ht'), (inf, 'l'), (48.10307546446689, 'min'), (inf, 'orb'), (230.81040424653685, 'orbpct'), (287639.38165621378, 'ortg'), (83465.746619397309, 'ows'), (30.100700172182293, 'per'), (36924926.888011456, 'pf'), (10.560570393422465, 'ppr'), (76.475391471295495, 'pps'), (6519203798.0495596, 'pts'), (inf, 'reb'), (42019635.050169773, 'stl'), (4.1066018150601806, 'stl_to'), (4.7900846441743417, 'stlpct'), (724672.55940323207, 'total_s_pct'), (90452664.394567743, 'tov'), (2.7081295157277996, 'tovpct'), (1.689207276032876, 'tpl_dbl'), (1161.9775733693359, 'trbpct'), (65.065564699399644, 'tspct'), (13.404526175881928, 'usgpct'), (inf, 'w'), (4.1287363801882559, 'win_pct'), (155021.14855930192, 'ws'), (2.9694097757938884, 'wt'), (inf, 'pos_C'), (inf, 'pos_PF'), (inf, 'pos_PG'), (inf, 'pos_SF'), (inf, 'pos_SG')]
corr_matrix = pd.DataFrame(x_scaled, columns=feature_names).corr().abs()
corr_top = corr_matrix.unstack().order(ascending=False)
corr_top[corr_top != 1][:750:2]
pts fgm 0.992658 fga 0.987447 ftm fta 0.987418 reb drb 0.985903 fg3m fg3a 0.984703 fgm fga 0.981999 ortg ediff 0.970763 tspct efgpct 0.960869 fic fgm 0.959132 ws ows 0.954628 pts fic 0.951468 ws fic 0.945673 fgpct efgpct 0.943900 reb orb 0.942665 tspct pps 0.936179 ... stlpct stl 0.514828 w fga 0.511142 tov 0.511094 wt orbpct 0.510350 ws blk 0.508952 ows fg3a 0.506598 fic fg3m 0.505807 tov l 0.505148 gp fta 0.505010 fic drbpct 0.504449 ows ediff 0.503364 fic fg3a 0.501707 orb gp 0.500524 ortg hob 0.500017 ws ortg 0.498958 Length: 375, dtype: float64
#no values/information in this feature
stats_pruned = stats_final.drop('_20_ast', axis=1)
# y = stats_pruned.pop('nationality').values
x = stats_pruned.values
scaler = StandardScaler()
x_pruned = scaler.fit_transform(x)
feature_names_pruned = stats_pruned.columns.values