This notebook scrapes data from the Washington Post's NCAA Tournament History site.
First, the site was used to create a list of all NCAA tournament games going back to 1985. This list was manually saved in this spreadsheet. The "Team Links" tab of this spreadsheet contains all the distinct teams which have played in the NCAA Tournament, in the form of hyperlinks to detailed Washington Post team pages. The team names and the related hyperlink URLs were saved into this CSV file.
The code in this notebook loops through every team in the CSV file, and scrapes the historical team information into a new CSV file.
import numpy as np
import pandas as pd
from tqdm import tqdm_notebook
from pathlib import Path
PROJECT_DIR = Path.cwd().parent
DATA_DIR = PROJECT_DIR / 'data' / 'scraped'
DATA_DIR.mkdir(exist_ok=True, parents=True)
OUTPUT_DIR = PROJECT_DIR / 'data' / 'prepared'
OUTPUT_DIR.mkdir(exist_ok=True, parents=True)
filename = 'teams-washpost.csv'
teamsfile = DATA_DIR.joinpath(filename)
teams = pd.read_csv(teamsfile).set_index('name')
teams.shape
(297, 1)
teams.head()
url | |
---|---|
name | |
Air Force | https://apps.washingtonpost.com/sports/apps/li... |
Akron | https://apps.washingtonpost.com/sports/apps/li... |
Alabama | https://apps.washingtonpost.com/sports/apps/li... |
Alabama A&M | https://apps.washingtonpost.com/sports/apps/li... |
Alabama State | https://apps.washingtonpost.com/sports/apps/li... |
url = teams.loc['Air Force', 'url']
We can use the pandas
built-in function to read the simple HTML tables on each team's page.
pd.read_html(url)[0]
Year | Seed | Record | Round reached | Bid type | Region | Coach | |
---|---|---|---|---|---|---|---|
0 | 2006 | 13 | 24-7 | 1First Round | At Large | East | Jeff Bzdelik |
1 | 2004 | 11 | 22-7 | 1First Round | At Large | South | Joe Scott |
The function below attempts to read each team's page, and returns a small DataFrame
for each team. Note that at the time I wrote this, several Washington Post pages were "under construction" and returned HTTP errors. Therefore, there is exception-handling code in this function to make sure the code scrapes as much information as possible.
def team_info(team, url):
try:
df = pd.read_html(url)[0]
except Exception as e:
print(f'Failed for {team}:', e)
print(url)
return None
df.columns = df.columns.str.rstrip()
df['Team'] = team
cols = [
'Team',
'Year',
'Seed',
'Region',
'Bid type',
'Round reached',
'Coach',
'Record',
]
return df[cols]
We want to put all the information into one large DataFrame
. The function below loops through all the teams and uses the concat()
pandas
function to do this.
def get_info(teams):
info = dict()
for name in tqdm_notebook(teams.index):
url = teams.loc[name, 'url']
info[name] = team_info(name, url)
return pd.concat([info[name] for name in teams.index], ignore_index=True)
Now we can scrape all the information.
df = get_info(teams)
Failed to display Jupyter Widget of type HBox
.
If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean that the widgets JavaScript is still loading. If this message persists, it likely means that the widgets JavaScript library is either not installed or not enabled. See the Jupyter Widgets Documentation for setup instructions.
If you're reading this message in another frontend (for example, a static rendering on GitHub or NBViewer), it may mean that your frontend doesn't currently support widgets.
Failed for California: HTTP Error 500: Internal Server Error https://apps.washingtonpost.com/sports/apps/live-updating-mens-ncaa-basketball-bracket/schools/california/ Failed for Vanderbilt: HTTP Error 500: Internal Server Error https://apps.washingtonpost.com/sports/apps/live-updating-mens-ncaa-basketball-bracket/schools/vanderbilt/ Failed for Wichita State: HTTP Error 500: Internal Server Error https://apps.washingtonpost.com/sports/apps/live-updating-mens-ncaa-basketball-bracket/schools/wichita-state/
len(df)
2113
df.head()
Team | Year | Seed | Region | Bid type | Round reached | Coach | Record | |
---|---|---|---|---|---|---|---|---|
0 | Air Force | 2006 | 13 | East | At Large | 1First Round | Jeff Bzdelik | 24-7 |
1 | Air Force | 2004 | 11 | South | At Large | 1First Round | Joe Scott | 22-7 |
2 | Akron | 2013 | 12 | South | Automatic Qualifier | 1First Round | Keith Dambrot | 26-7 |
3 | Akron | 2011 | 15 | Southwest | Automatic Qualifier | 1First Round | Keith Dambrot | 23-13 |
4 | Akron | 2009 | 13 | South | Automatic Qualifier | 1First Round | Keith Dambrot | 23-13 |
Let's clean up the "Round reached" column to split the numeric value and the text values into two distinct columns.
df['Round reached'].unique()
array(['1First Round', '2Second Round', '4Elite Eight', '3Sweet 16', '0Play-In', '6National Championship', '5Final Four'], dtype=object)
def format_round(df):
df[['Eliminated', 'Round Reached']] = df['Round reached'].str.extract('^(\d)(.+)', expand=True)
return df.drop(columns=['Round reached'])
df = format_round(df)
df.head()
Team | Year | Seed | Region | Bid type | Coach | Record | Eliminated | Round Reached | |
---|---|---|---|---|---|---|---|---|---|
0 | Air Force | 2006 | 13 | East | At Large | Jeff Bzdelik | 24-7 | 1 | First Round |
1 | Air Force | 2004 | 11 | South | At Large | Joe Scott | 22-7 | 1 | First Round |
2 | Akron | 2013 | 12 | South | Automatic Qualifier | Keith Dambrot | 26-7 | 1 | First Round |
3 | Akron | 2011 | 15 | Southwest | Automatic Qualifier | Keith Dambrot | 23-13 | 1 | First Round |
4 | Akron | 2009 | 13 | South | Automatic Qualifier | Keith Dambrot | 23-13 | 1 | First Round |
filename = 'team_history-washpost.csv'
csvfile = OUTPUT_DIR.joinpath(filename)
df.to_csv(csvfile, index=False)