CBA Classification Analysis

Motivation/Background

The game of basketball is growing in popularity around the world, and China is a main driver in that. In fact, they have two professional leagues, a major and a minor that employ about 400 total players. Their main, major league is the Chinese Basketball Association (CBA), which is currently comprised of 18 teams, a number that has been growing in recent years to accomodate the surge in demand.

This league is becoming increasingly relevant to the overall development of global talent, as players from the United States are now going over and taking advantage of the opportunities for playing time and earning power. With more and more talented, foreign players combining with the improving domestic scene, the CBA is primed for a significant increase in business/revenue as talent attracts spending and spectators. Additionally, the CBA's lack of a players' salary cap further magnifies the earning potential for its players, allowing the consumer market to fully dictate players' value.

With this growing collection of quality players, I believe it's important to analyze and better understand this league to identify those who could potentially be valuable in the NBA.

Import and process data

Import data, try to classify players as foreign or domestic based on attributes/on-court performance
Y = pop off ['nationality']
X = everything other than ['sys_id', 'season', 'player', 'team', 'birth_city', 'draft_status'] from info

Keep all calculated columns for now, address during feature selection
Classification -- kfolds cross validate modeling with logistic regression, random forest, SVC, adaboost, gradientboost
Clustering -- knn

Build ROC curve

Preliminary dimensions:
stats.shape = 774 x 64
stats_final.shape = 708 x 64

Found data integrity issues:

  1. 'Wei Liu' player_season_advanced, player_season_misc -- FIXED, dropped inconsistencies
  2. Issues from player_season_info -- FIXED, updated
    "Xiaowei Zhang","2012" -- traded b/w SDS & GFD
    "Tal Co","2012" -- absent
    "YoBo Xia","2012" -- absent
    "Mike Efevberha","2014" -- absent
    "Jerel McNeal","2014" -- listed under wrong team
  3. Missing ht/wt
  4. 33 players missing nationality from 2014 -- FIXED, dropped
In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sqlite3 as sql
In [2]:
db_read = 'data/cba.sqlite'
In [3]:
sql_query = '''SELECT tot.*, adv.*, misc.*, info.ht, info.wt, info.nationality
FROM team_info_realgm tinfo
JOIN player_season_totals tot
ON tinfo.team_id = tot.team
JOIN player_season_advanced adv
ON tot.player = adv.player AND tot.season = adv.season AND tinfo.team_id = adv.team
JOIN player_season_misc misc
ON tot.player = misc.player AND tot.season = misc.season AND tinfo.team_id = misc.team
JOIN player_season_info info
ON tot.player = info.player AND tot.season = info.season  AND tinfo.team = info.team'''

with sql.connect(db_read) as con:
    df = pd.read_sql(sql_query, con)
In [4]:
stats = df.T.drop_duplicates().T
In [5]:
stats.shape
Out[5]:
(774, 64)
In [6]:
cols_retained = stats.columns - ['sys_id', 'season', 'player', 'team', 'birth_city', 'draft_status']
stats = stats[cols_retained]

Convert ht to inches

In [7]:
ht_inches = []
for height in stats['ht']:
    h = height.split('-')
    if h[0] and h[1]:
        ht = 12.0 * float(h[0]) + 1.0 * float(h[1])
        ht_inches.append(ht)
    else:
        ht_inches.append('')
stats['ht'] = ht_inches
In [8]:
stats = stats.convert_objects(convert_numeric=True)

Process missing 'wt' & 'ht'

In [9]:
stats = stats.fillna(stats.mean()[['ht' , 'wt']])
In [10]:
# [(type(stats[col]), col) for col in stats.columns]
In [11]:
# [(stats[col].dtype, col) for col in stats.columns]
In [12]:
# [(stats[col].isnull().sum(), col) for col in stats.columns]
In [13]:
stats.shape
Out[13]:
(774, 60)

Process 'nationality' column

In [14]:
stats['nationality'].value_counts()
Out[14]:
China                               573
United States                       126
-                                    33
Taiwan                               13
Jordan                                7
Syria                                 3
Puerto Rico                           3
Iran                                  3
Palestine                             2
Ivory Coast                           2
Democratic Republic of the Congo      2
Netherlands                           1
Lebanon                               1
Senegal                               1
Australia                             1
Nigeria                               1
France                                1
Panama                                1
dtype: int64
In [15]:
stats['nationality'] = stats['nationality'].replace(to_replace='-', value=np.nan)
In [16]:
stats = stats.dropna(subset=['nationality'])
In [17]:
#domestic for China, Taiwan, Hong Kong, non-domestic for anything else
#1 for domestic, 0 for foreign
stats['domestic'] = stats['nationality'].apply(lambda x: 1 if x in ['China', 'Taiwan', 'Hong Kong'] else 0)
In [18]:
stats['nationality_class'] = stats['nationality'].apply(lambda x: x if x in ['China', 'United States'] else "Other")
In [19]:
stats['nationality_class'].value_counts()
Out[19]:
China            573
United States    126
Other             42
dtype: int64

stats with no pos column -- Currently not used

In [20]:
stats_nopos = stats.drop(['pos'], axis=1)
In [21]:
stats_nopos = stats_nopos.dropna()

stats dropping the records with NaN values

In [22]:
stats.shape
Out[22]:
(741, 62)
In [23]:
stats = stats.dropna()

Process/binarize positional values

In [24]:
pos_binarized = pd.get_dummies(stats['pos'], prefix='pos')
In [25]:
stats_final = pd.concat([stats, pos_binarized], axis=1).drop(['pos'], axis=1)

Feature Selection

Reduce collinearity, this is essential for logistic regression.
PCA -- Downside is that it removes all interpretability of original features

In [26]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
In [27]:
y_nat = stats_final.pop('nationality').values
y_nat_class = stats_final.pop('nationality_class').values
y_dom = stats_final.pop('domestic').values
x = stats_final.values
In [28]:
y_nat = y_nat.astype(str)
y_nat_class = y_nat_class.astype(str)
In [29]:
feature_names = stats_final.columns.values
In [30]:
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)
In [31]:
def show_exp_var(decomper, data):
    var_exp = []
    for n in xrange(1, 63+1):
        dc = decomper(n_components=n)
        dc.fit(data)
        var_exp.append(dc.explained_variance_.sum())
    plt.figure(figsize=(16,10))
    plt.grid(b=True, which='major', color='black', linestyle='-')
    plt.plot(np.array(var_exp))
    plt.tight_layout()
In [32]:
show_exp_var(PCA, x_scaled)
In [33]:
pca = PCA(n_components=30)
x_reduced = pca.fit_transform(x_scaled)
In [34]:
x_reduced.shape
Out[34]:
(708, 30)

Modeling!!

Binary classification -- Domestic vs nondomestic

In [35]:
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import confusion_matrix, recall_score, precision_score, f1_score, roc_curve, roc_auc_score
from sklearn.grid_search import GridSearchCV

Cross Validation Splitting -- Reduced feature set

In [36]:
xtrain, xtest, ytrain, ytest = train_test_split(x_reduced, y_dom, random_state=13)
In [37]:
def get_scores(x, y, model):
    print "model:", model
    print "cf matrix:\n", confusion_matrix(y, model.predict(x))
    print "accuracy:", model.score(x, y)
    print "precision (tp/tp+fp):", precision_score(y, model.predict(x))
    print "recall (tp/p):", recall_score(y, model.predict(x))
    print "f1:", f1_score(y, model.predict(x))

Basic Logistic Classification with reduced feature set ("x" 30 features)

In [38]:
lg = LogisticRegression()
lg.fit(xtrain, ytrain)
get_scores(xtest, ytest, lg)
model: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty=l2, random_state=None, tol=0.0001)
cf matrix:
[[ 32   6]
 [  3 136]]
accuracy: 0.949152542373
precision (tp/tp+fp): 0.957746478873
recall (tp/p): 0.978417266187
f1: 0.967971530249

Random Forest Classification with reduced feature set ("x" 30 features); Grid Search for Optimal Parameters

In [39]:
rf = RandomForestClassifier(n_jobs=-1)
pg = {'n_estimators': np.arange(10, 101, 10), 'criterion': ['gini', 'entropy']}
gs = GridSearchCV(rf, param_grid=pg, scoring='f1')
gs.fit(x_reduced, y_dom)
Out[39]:
GridSearchCV(cv=None,
       estimator=RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            min_density=None, min_samples_leaf=1, min_samples_split=2,
            n_estimators=10, n_jobs=-1, oob_score=False, random_state=None,
            verbose=0),
       fit_params={}, iid=True, loss_func=None, n_jobs=1,
       param_grid={'n_estimators': array([ 10,  20,  30,  40,  50,  60,  70,  80,  90, 100]), 'criterion': ['gini', 'entropy']},
       pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring='f1',
       verbose=0)
In [40]:
gs.grid_scores_
Out[40]:
[mean: 0.94650, std: 0.00770, params: {'n_estimators': 10, 'criterion': 'gini'},
 mean: 0.95059, std: 0.00822, params: {'n_estimators': 20, 'criterion': 'gini'},
 mean: 0.95230, std: 0.00825, params: {'n_estimators': 30, 'criterion': 'gini'},
 mean: 0.95322, std: 0.00174, params: {'n_estimators': 40, 'criterion': 'gini'},
 mean: 0.95698, std: 0.00050, params: {'n_estimators': 50, 'criterion': 'gini'},
 mean: 0.95352, std: 0.00861, params: {'n_estimators': 60, 'criterion': 'gini'},
 mean: 0.95668, std: 0.00510, params: {'n_estimators': 70, 'criterion': 'gini'},
 mean: 0.95793, std: 0.00901, params: {'n_estimators': 80, 'criterion': 'gini'},
 mean: 0.95968, std: 0.00621, params: {'n_estimators': 90, 'criterion': 'gini'},
 mean: 0.95877, std: 0.00832, params: {'n_estimators': 100, 'criterion': 'gini'},
 mean: 0.95086, std: 0.01082, params: {'n_estimators': 10, 'criterion': 'entropy'},
 mean: 0.95370, std: 0.00508, params: {'n_estimators': 20, 'criterion': 'entropy'},
 mean: 0.94830, std: 0.00633, params: {'n_estimators': 30, 'criterion': 'entropy'},
 mean: 0.95825, std: 0.00299, params: {'n_estimators': 40, 'criterion': 'entropy'},
 mean: 0.95886, std: 0.00603, params: {'n_estimators': 50, 'criterion': 'entropy'},
 mean: 0.96230, std: 0.00530, params: {'n_estimators': 60, 'criterion': 'entropy'},
 mean: 0.95098, std: 0.00386, params: {'n_estimators': 70, 'criterion': 'entropy'},
 mean: 0.95654, std: 0.00300, params: {'n_estimators': 80, 'criterion': 'entropy'},
 mean: 0.95728, std: 0.00831, params: {'n_estimators': 90, 'criterion': 'entropy'},
 mean: 0.95553, std: 0.00428, params: {'n_estimators': 100, 'criterion': 'entropy'}]
In [41]:
gs.best_params_
Out[41]:
{'criterion': 'entropy', 'n_estimators': 60}
In [42]:
rf = RandomForestClassifier(n_jobs=-1, n_estimators=60, criterion='gini')
rf.fit(xtrain, ytrain)
get_scores(xtest, ytest, rf)
model: RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion=gini, max_depth=None, max_features=auto,
            min_density=None, min_samples_leaf=1, min_samples_split=2,
            n_estimators=60, n_jobs=-1, oob_score=False, random_state=None,
            verbose=0)
cf matrix:
[[ 31   7]
 [  4 135]]
accuracy: 0.937853107345
precision (tp/tp+fp): 0.950704225352
recall (tp/p): 0.971223021583
f1: 0.960854092527
In [43]:
rf.classes_
Out[43]:
array([0, 1])

Random Forest Classification with full, unscaled feature set ("x" 63 features)

In [61]:
# xtrain_full, xtest_full, ytrain_full, ytest_full = train_test_split(x, y_dom, random_state=13)
xtrain_full, xtest_full, ytrain_full, ytest_full = train_test_split(x, y_dom)
In [62]:
rf_full = RandomForestClassifier(n_jobs=-1, n_estimators=60, criterion='gini')
rf_full.fit(xtrain_full, ytrain_full)
get_scores(xtest_full, ytest_full, rf_full)
model: RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion=gini, max_depth=None, max_features=auto,
            min_density=None, min_samples_leaf=1, min_samples_split=2,
            n_estimators=60, n_jobs=-1, oob_score=False, random_state=None,
            verbose=0)
cf matrix:
[[ 32   3]
 [  4 138]]
accuracy: 0.960451977401
precision (tp/tp+fp): 0.978723404255
recall (tp/p): 0.971830985915
f1: 0.975265017668
In [63]:
rf_full.feature_importances_
Out[63]:
array([  0.00000000e+00,   7.61751267e-03,   3.13574587e-02,
         2.64774026e-03,   2.61600500e-03,   6.54577299e-03,
         4.76293559e-03,   1.54990354e-02,   4.55698688e-03,
         1.03682349e-02,   3.59508890e-02,   2.87579205e-02,
         3.00913479e-02,   4.47696216e-03,   5.57373914e-03,
         1.19521081e-02,   4.42461778e-03,   3.33032693e-03,
         4.42995269e-03,   1.87947330e-03,   1.68992817e-02,
         4.21984947e-02,   5.50119292e-03,   4.52364876e-02,
         4.51412110e-03,   5.18549984e-02,   1.72402778e-02,
         4.33524412e-03,   1.24089454e-02,   7.50654866e-02,
         8.54232169e-02,   4.40645106e-03,   6.91914957e-03,
         7.27429653e-03,   3.12880746e-03,   4.80652787e-03,
         5.59911784e-03,   1.10449361e-02,   7.30903352e-02,
         4.30476116e-03,   6.44953096e-03,   4.96497938e-03,
         2.64860891e-02,   5.20678231e-03,   3.73270872e-03,
         2.82547750e-03,   5.21704728e-03,   4.78211955e-03,
         1.18011311e-02,   6.07522336e-03,   1.06799876e-03,
         2.98982527e-02,   5.60596590e-03,   1.52757282e-01,
         6.29450397e-03,   7.24529133e-03,   1.22050948e-02,
         6.38227075e-03,   9.90651661e-04,   7.29441914e-04,
         0.00000000e+00,   1.51123175e-04,   1.03988413e-03])
In [64]:
feature_ranks = [feature_names[i] for i in np.argsort(rf_full.feature_importances_)[::-1]]
In [65]:
feature_ranks
Out[65]:
['usgpct',
 'hob',
 'high_game',
 'per',
 'fta',
 'fic',
 'fgm',
 'dbl_dbl',
 '_40_pts',
 'drbpct',
 'trbpct',
 'drb',
 'pts',
 'ftm',
 'fga',
 'astpct',
 'gp',
 'ws',
 'ediff',
 'tov',
 'ows',
 'blkpct',
 '_20_reb',
 'min',
 'win_pct',
 'l',
 'ast',
 'ppr',
 'wt',
 'w',
 'tovpct',
 'tspct',
 'ortg',
 'dws',
 'fgpct',
 'stlpct',
 'reb',
 'pps',
 'orbpct',
 'total_s_pct',
 'ast_to',
 'blk',
 'ft_fga',
 'drtg',
 'fg3m',
 'efgpct',
 'ht',
 'ftpct',
 'pf',
 'stl',
 'fg3a',
 'orb',
 'stl_to',
 '_5_blk',
 '_5_stl',
 'fg3pct',
 'tpl_dbl',
 'pos_SG',
 'pos_C',
 'pos_PF',
 'pos_SF',
 'pos_PG',
 '_20_ast']
In [66]:
feature_indices = np.arange(len(feature_names))
fig = plt.figure(figsize=(16,10))
ax = fig.add_axes([0, 0, 1, 1])
ax.set_title("Feature Importance in Classifying Domestic vs Foreign CBA Players")
ax.barh(feature_indices, np.sort(rf_full.feature_importances_)[::-1], align="center")
ax.set_yticks(feature_indices)
ax.set_yticklabels(feature_ranks)
ax.autoscale()
fig.show()

Analysis

Based on the strong precision/recall/accuracy performance of the logistic and random forest classifiers, it appears there are indeed features that can predict whether a player is domestic or foreign.

Assuming the foreign players are indeed more talented and more impactful, the top results make intuitive sense.

  1. "usgpct" (Usage %) --
  2. "hob" (Hands on buckets) --
  3. "per" (Player Efficiency Rating) --
  4. "high_game" (Most points scored in a game) --
  5. "pts" --
  6. "fic" (Floor Impact Counter) --

For a summary of the statistics, please reference the following: http://basketball.realgm.com/info/glossary

It is also important to note that many of these top features are in fact calculated, rather than observed, statistics.

During the feature selection process when PCA was used to address collinearity issues, the features were transformed and no longer provide insight into what differences separate domestic and non-domestic players.

To investigate this further, I will revisit the feature selection process and attempt to manually select features using VIF (variable inflation factors) and the feature correlation matrix. By doing this, I can potentially use logistic regression and retain the meaning from the feature set.

Feature Analysis/Selection Part II

VIF & Correlation Matrix (Pearson Coefficients) on full feature set ("x" 64 features)

In [46]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
In [47]:
vif = [variance_inflation_factor(x_scaled, col) for col in xrange(x_scaled.shape[1])]
In [48]:
zip(vif, feature_names)
Out[48]:
[(nan, '_20_ast'),
 (2.3232646413884468, '_20_reb'),
 (3.099973924140988, '_40_pts'),
 (2.5583691874867571, '_5_blk'),
 (3.2826153029391079, '_5_stl'),
 (190036935.48370814, 'ast'),
 (8.2032827384218852, 'ast_to'),
 (17.978150577641646, 'astpct'),
 (19124656.985708512, 'blk'),
 (4.8911542003284083, 'blkpct'),
 (13.717477282514521, 'dbl_dbl'),
 (inf, 'drb'),
 (518.93282127926705, 'drbpct'),
 (21129.564180145786, 'drtg'),
 (21377.576159298682, 'dws'),
 (346061.19871661032, 'ediff'),
 (86.736913551758022, 'efgpct'),
 (74.464553355393164, 'fg3a'),
 (1069.3081439950811, 'fg3m'),
 (160256.93698674868, 'fg3pct'),
 (1904161358.7764828, 'fga'),
 (40434.534043811895, 'fgm'),
 (81245.066394499881, 'fgpct'),
 (3338961506.4996514, 'fic'),
 (11.081218981979251, 'ft_fga'),
 (73638132.87765573, 'fta'),
 (3730.3210419056595, 'ftm'),
 (228885.57379329094, 'ftpct'),
 (inf, 'gp'),
 (10.256000365408415, 'high_game'),
 (27.118818234214988, 'hob'),
 (7.9971082754363625, 'ht'),
 (inf, 'l'),
 (48.10307546446689, 'min'),
 (inf, 'orb'),
 (230.81040424653685, 'orbpct'),
 (287639.38165621378, 'ortg'),
 (83465.746619397309, 'ows'),
 (30.100700172182293, 'per'),
 (36924926.888011456, 'pf'),
 (10.560570393422465, 'ppr'),
 (76.475391471295495, 'pps'),
 (6519203798.0495596, 'pts'),
 (inf, 'reb'),
 (42019635.050169773, 'stl'),
 (4.1066018150601806, 'stl_to'),
 (4.7900846441743417, 'stlpct'),
 (724672.55940323207, 'total_s_pct'),
 (90452664.394567743, 'tov'),
 (2.7081295157277996, 'tovpct'),
 (1.689207276032876, 'tpl_dbl'),
 (1161.9775733693359, 'trbpct'),
 (65.065564699399644, 'tspct'),
 (13.404526175881928, 'usgpct'),
 (inf, 'w'),
 (4.1287363801882559, 'win_pct'),
 (155021.14855930192, 'ws'),
 (2.9694097757938884, 'wt'),
 (inf, 'pos_C'),
 (inf, 'pos_PF'),
 (inf, 'pos_PG'),
 (inf, 'pos_SF'),
 (inf, 'pos_SG')]
In [49]:
corr_matrix = pd.DataFrame(x_scaled, columns=feature_names).corr().abs()
In [50]:
corr_top = corr_matrix.unstack().order(ascending=False)
In [51]:
corr_top[corr_top != 1][:750:2]
Out[51]:
pts    fgm       0.992658
       fga       0.987447
ftm    fta       0.987418
reb    drb       0.985903
fg3m   fg3a      0.984703
fgm    fga       0.981999
ortg   ediff     0.970763
tspct  efgpct    0.960869
fic    fgm       0.959132
ws     ows       0.954628
pts    fic       0.951468
ws     fic       0.945673
fgpct  efgpct    0.943900
reb    orb       0.942665
tspct  pps       0.936179
...
stlpct  stl       0.514828
w       fga       0.511142
        tov       0.511094
wt      orbpct    0.510350
ws      blk       0.508952
ows     fg3a      0.506598
fic     fg3m      0.505807
tov     l         0.505148
gp      fta       0.505010
fic     drbpct    0.504449
ows     ediff     0.503364
fic     fg3a      0.501707
orb     gp        0.500524
ortg    hob       0.500017
ws      ortg      0.498958
Length: 375, dtype: float64

Manual Feature Selection (stats_pruned)

In [52]:
#no values/information in this feature
stats_pruned = stats_final.drop('_20_ast', axis=1)
In [53]:
# y = stats_pruned.pop('nationality').values
x = stats_pruned.values
In [54]:
scaler = StandardScaler()
x_pruned = scaler.fit_transform(x)
In [55]:
feature_names_pruned = stats_pruned.columns.values