This short notebook shows how to scrape historical ratings from Ken Pomeroy's (KenPom) college basketball ratings site.
KenPom has ratings for NCAA men's basketball teams going back to 2002.
import numpy as np
import pandas as pd
import pracpred.scrape as pps
from tqdm import tqdm_notebook
from pathlib import Path
PROJECT_DIR = Path.cwd().parent
DATA_DIR = PROJECT_DIR / 'data' / 'scraped'
DATA_DIR.mkdir(exist_ok=True, parents=True)
OUTPUT_DIR = PROJECT_DIR / 'data' / 'kenpom'
OUTPUT_DIR.mkdir(exist_ok=True, parents=True)
Let's start by scraping 2017 data. We can use the HTML table scraping functionality in our pracpred
package.
You can find information on the pracpred
package here and here. You can install the package in your sports analytics environment by running the command pip install pracpred
in Terminal (Mac or Linux) or Windows Anaconda Prompt.
url = 'https://kenpom.com/index.php?y=2017'
tables = pps.HTMLTables(url)
len(tables)
1
tables[0].to_df().head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | Strength of Schedule | NaN | NaN | NaN | NaN | NaN | NCSOS | NaN | |
1 | Rank | Team | Conf | W-L | AdjEM | AdjO | NaN | AdjD | NaN | AdjT | ... | Luck | NaN | AdjEM | NaN | OppO | NaN | OppD | NaN | AdjEM | NaN |
2 | 1 | Gonzaga 1 | WCC | 37-2 | +32.05 | 118.4 | 16 | 86.3 | 1 | 70.1 | ... | +.020 | 133 | +2.99 | 89 | 106.2 | 84 | 103.3 | 105 | +1.01 | 127 |
3 | 2 | Villanova 1 | BE | 32-4 | +29.88 | 122.4 | 3 | 92.5 | 12 | 64.1 | ... | +.010 | 166 | +9.33 | 33 | 109.8 | 38 | 100.5 | 32 | +3.55 | 61 |
4 | 3 | North Carolina 1 | ACC | 33-7 | +28.22 | 120.7 | 9 | 92.5 | 11 | 71.3 | ... | +.037 | 85 | +12.49 | 6 | 112.0 | 4 | 99.5 | 19 | +3.87 | 53 |
5 rows × 21 columns
Before we go further, let's put this simple logic in a function which will get the data for any particular year.
def raw_kenpom(year):
url = f'https://kenpom.com/index.php?y={year}'
tables = pps.HTMLTables(url)
return tables[0].to_df()
raw = raw_kenpom(2017)
raw.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | Strength of Schedule | NaN | NaN | NaN | NaN | NaN | NCSOS | NaN | |
1 | Rank | Team | Conf | W-L | AdjEM | AdjO | NaN | AdjD | NaN | AdjT | ... | Luck | NaN | AdjEM | NaN | OppO | NaN | OppD | NaN | AdjEM | NaN |
2 | 1 | Gonzaga 1 | WCC | 37-2 | +32.05 | 118.4 | 16 | 86.3 | 1 | 70.1 | ... | +.020 | 133 | +2.99 | 89 | 106.2 | 84 | 103.3 | 105 | +1.01 | 127 |
3 | 2 | Villanova 1 | BE | 32-4 | +29.88 | 122.4 | 3 | 92.5 | 12 | 64.1 | ... | +.010 | 166 | +9.33 | 33 | 109.8 | 38 | 100.5 | 32 | +3.55 | 61 |
4 | 3 | North Carolina 1 | ACC | 33-7 | +28.22 | 120.7 | 9 | 92.5 | 11 | 71.3 | ... | +.037 | 85 | +12.49 | 6 | 112.0 | 4 | 99.5 | 19 | +3.87 | 53 |
5 rows × 21 columns
In the raw table, the first two rows are the table headers. Many of these cells are blank (NaN), because the header spans two rows. "NaN" stands for not-a-number, and is a value defined in the numpy
package on top of which pandas
is built. You can read more about how pandas
handles missing values here.
Most of the useful column names are in the second row. Also, we see that the columns which have blank (NaN) values in this row correspond to the ranks of each statistic in the KenPom table. The ranks show up as small text to the right of the statistic value. For example, Gonzaga had the sixteenth best Adjusted Offensive Efficiency and the highest-ranked Adjusted Defensive Efficiency for 2017, according to KenPom.
raw.iloc[1]
0 Rank 1 Team 2 Conf 3 W-L 4 AdjEM 5 AdjO 6 NaN 7 AdjD 8 NaN 9 AdjT 10 NaN 11 Luck 12 NaN 13 AdjEM 14 NaN 15 OppO 16 NaN 17 OppD 18 NaN 19 AdjEM 20 NaN Name: 1, dtype: object
Let's write a function to pull out the useful column names from the raw data.
def column_names(raw_cols):
cols = raw_cols.copy()
adj_em_count = 0
adj_em_col = {
1: 'AdjEM',
2: 'SOS AdjEM',
3: 'NCSOS AdjEM',
}
for i, col in enumerate(cols):
if col == 'AdjEM':
adj_em_count += 1
cols[i] = adj_em_col[adj_em_count]
if str(col) == 'nan':
cols[i] = cols[i-1] + '_rank'
if col == 'Rank':
cols[i] = 'KenPom'
return cols
column_names(raw.iloc[1])
0 KenPom 1 Team 2 Conf 3 W-L 4 AdjEM 5 AdjO 6 AdjO_rank 7 AdjD 8 AdjD_rank 9 AdjT 10 AdjT_rank 11 Luck 12 Luck_rank 13 SOS AdjEM 14 SOS AdjEM_rank 15 OppO 16 OppO_rank 17 OppD 18 OppD_rank 19 NCSOS AdjEM 20 NCSOS AdjEM_rank Name: 1, dtype: object
Now we can write a function to take the raw data, pull out the column names, and create a new DataFrame
with the column names.
This function drops the first two rows of the raw data since they are no longer needed once we have the column names. It also drops any rows with blank data, in particular the column headers which repeat several times on the web site.
def formatted(raw):
df = raw.copy()
df.columns = column_names(raw.iloc[1])
df.columns.name = None
df = df.drop([0, 1])
return df.dropna().reset_index(drop=True)
formatted(raw).shape
(351, 21)
There are 351 rows of data, which matches the number of teams in the 2017 KenPom web site table.
Now we need to split the school name from the NCAA tournament seed. To do this, we'll use regular expressions to separate the text from the number. You can read more about working with text data in pandas
here.
formatted(raw)['Team'].str.extract('(.+) (\d+)', expand=True).head()
0 | 1 | |
---|---|---|
0 | Gonzaga | 1 |
1 | Villanova | 1 |
2 | North Carolina | 1 |
3 | Kentucky | 2 |
4 | Florida | 4 |
formatted(raw)['Team'].str.extract('(.+) (\d+)', expand=True).tail()
0 | 1 | |
---|---|---|
346 | NaN | NaN |
347 | NaN | NaN |
348 | NaN | NaN |
349 | NaN | NaN |
350 | NaN | NaN |
Notice that for rows which don't have an NCAA seed, the regular expression pattern matching fails. In this case, we end up with NaN for both columns in such rows. To correct for this, we need to fall back to use the team name using pandas
fillna()
method.
def ncaa_seed(df):
split = df['Team'].str.extract('(.+) (\d+)', expand=True)
team = split[0].fillna(df['Team'])
df['Team'] = team.str.replace(';', '')
df['Seed'] = split[1]
return df
Next, we want to split the wins and losses string to get one column for each value.
def wins_losses(df):
df['Wins'] = df['W-L'].str.split('-').str.get(0).astype(int)
df['Losses'] = df['W-L'].str.split('-').str.get(1).astype(int)
df = df.drop(columns=['W-L'])
return df
wins_losses(ncaa_seed(formatted(raw)))[['Team', 'Seed', 'Wins', 'Losses']].head()
Team | Seed | Wins | Losses | |
---|---|---|---|---|
0 | Gonzaga | 1 | 37 | 2 |
1 | Villanova | 1 | 32 | 4 |
2 | North Carolina | 1 | 33 | 7 |
3 | Kentucky | 2 | 32 | 6 |
4 | Florida | 4 | 27 | 9 |
wins_losses(ncaa_seed(formatted(raw)))[['Team', 'Seed', 'Wins', 'Losses']].tail()
Team | Seed | Wins | Losses | |
---|---|---|---|---|
346 | Longwood | NaN | 6 | 24 |
347 | Arkansas Pine Bluff | NaN | 7 | 25 |
348 | North Carolina A&T | NaN | 3 | 29 |
349 | Presbyterian | NaN | 5 | 25 |
350 | Alabama A&M | NaN | 2 | 27 |
The last things we need to do are make sure all of the columns are of the correct data type and order them the way we want.
def numeric(df):
cols_to_strip = [
'AdjEM',
'Luck',
'SOS AdjEM',
'NCSOS AdjEM',
]
for col in cols_to_strip:
df[col] = pd.to_numeric(df[col])
rank_cols = [col for col in df.columns if '_rank' in col]
for col in rank_cols:
df[col] = df[col].astype(int)
other_cols = [
'AdjO',
'AdjD',
'AdjT',
'OppO',
'OppD',
]
for col in other_cols:
df[col] = df[col].astype(float)
return df
def ordered(df):
first_cols = [
'Year',
'Team',
'Conf',
'Seed',
'Wins',
'Losses',
'KenPom',
]
last_cols = [col for col in df.columns if 'rank' in col]
cols = first_cols + [col for col in df.columns if col not in first_cols and col not in last_cols] + last_cols
return df[cols]
def formatted_kenpom(year):
raw = raw_kenpom(year)
df = formatted(raw)
df = ncaa_seed(df)
df = wins_losses(df)
df = numeric(df)
df['Year'] = int(year)
df = ordered(df)
return df
df = formatted_kenpom(2017)
df.shape
(351, 24)
df.dtypes
Year int64 Team object Conf object Seed object Wins int64 Losses int64 KenPom object AdjEM float64 AdjO float64 AdjD float64 AdjT float64 Luck float64 SOS AdjEM float64 OppO float64 OppD float64 NCSOS AdjEM float64 AdjO_rank int64 AdjD_rank int64 AdjT_rank int64 Luck_rank int64 SOS AdjEM_rank int64 OppO_rank int64 OppD_rank int64 NCSOS AdjEM_rank int64 dtype: object
df.head()
Year | Team | Conf | Seed | Wins | Losses | KenPom | AdjEM | AdjO | AdjD | ... | OppD | NCSOS AdjEM | AdjO_rank | AdjD_rank | AdjT_rank | Luck_rank | SOS AdjEM_rank | OppO_rank | OppD_rank | NCSOS AdjEM_rank | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2017 | Gonzaga | WCC | 1 | 37 | 2 | 1 | 32.05 | 118.4 | 86.3 | ... | 103.3 | 1.01 | 16 | 1 | 76 | 133 | 89 | 84 | 105 | 127 |
1 | 2017 | Villanova | BE | 1 | 32 | 4 | 2 | 29.88 | 122.4 | 92.5 | ... | 100.5 | 3.55 | 3 | 12 | 324 | 166 | 33 | 38 | 32 | 61 |
2 | 2017 | North Carolina | ACC | 1 | 33 | 7 | 3 | 28.22 | 120.7 | 92.5 | ... | 99.5 | 3.87 | 9 | 11 | 40 | 85 | 6 | 4 | 19 | 53 |
3 | 2017 | Kentucky | SEC | 2 | 32 | 6 | 4 | 27.72 | 119.1 | 91.4 | ... | 99.5 | 3.74 | 12 | 7 | 26 | 175 | 19 | 24 | 15 | 56 |
4 | 2017 | Florida | SEC | 4 | 27 | 9 | 5 | 27.50 | 116.9 | 89.5 | ... | 97.8 | 8.19 | 25 | 5 | 117 | 286 | 7 | 28 | 2 | 15 |
5 rows × 24 columns
Now that we have one function that will scrape and format the data for one year, all we need to do is call it for each year since 2002. Then we can save the data to a CSV file for later analysis.
def scrape_kenpom(year=None):
if year:
return formatted_kenpom(year)
else:
dfs = dict()
for year in tqdm_notebook(range(2002, 2018)):
dfs[year] = formatted_kenpom(year)
return pd.concat([dfs[year] for year in dfs], ignore_index=True)
df = scrape_kenpom()
df.shape
Failed to display Jupyter Widget of type HBox
.
If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean that the widgets JavaScript is still loading. If this message persists, it likely means that the widgets JavaScript library is either not installed or not enabled. See the Jupyter Widgets Documentation for setup instructions.
If you're reading this message in another frontend (for example, a static rendering on GitHub or NBViewer), it may mean that your frontend doesn't currently support widgets.
(5453, 24)
filename = 'kenpom-historical.csv'
csvfile = OUTPUT_DIR.joinpath(filename)
df.to_csv(csvfile, index=False, float_format='%g')