NCAA March Madness¶

Scrape NCAA Team History from the Washington Post¶

This notebook scrapes data from the Washington Post's NCAA Tournament History site.

First, the site was used to create a list of all NCAA tournament games going back to 1985. This list was manually saved in this spreadsheet. The "Team Links" tab of this spreadsheet contains all the distinct teams which have played in the NCAA Tournament, in the form of hyperlinks to detailed Washington Post team pages. The team names and the related hyperlink URLs were saved into this CSV file.

The code in this notebook loops through every team in the CSV file, and scrapes the historical team information into a new CSV file.

In [1]:

import numpy as np
import pandas as pd

In [2]:

from tqdm import tqdm_notebook

In [3]:

from pathlib import Path

In [4]:

PROJECT_DIR = Path.cwd().parent
DATA_DIR = PROJECT_DIR / 'data' / 'scraped'
DATA_DIR.mkdir(exist_ok=True, parents=True)
OUTPUT_DIR = PROJECT_DIR / 'data' / 'prepared'
OUTPUT_DIR.mkdir(exist_ok=True, parents=True)

In [5]:

filename = 'teams-washpost.csv'
teamsfile = DATA_DIR.joinpath(filename)
teams = pd.read_csv(teamsfile).set_index('name')
teams.shape

Out[5]:

(297, 1)

In [6]:

teams.head()

Out[6]:

	url
name
Air Force	https://apps.washingtonpost.com/sports/apps/li...
Akron	https://apps.washingtonpost.com/sports/apps/li...
Alabama	https://apps.washingtonpost.com/sports/apps/li...
Alabama A&M	https://apps.washingtonpost.com/sports/apps/li...
Alabama State	https://apps.washingtonpost.com/sports/apps/li...

In [7]:

url = teams.loc['Air Force', 'url']

We can use the pandas built-in function to read the simple HTML tables on each team's page.

In [8]:

pd.read_html(url)[0]

Out[8]:

	Year	Seed	Record	Round reached	Bid type	Region	Coach
0	2006	13	24-7	1First Round	At Large	East	Jeff Bzdelik
1	2004	11	22-7	1First Round	At Large	South	Joe Scott

The function below attempts to read each team's page, and returns a small DataFrame for each team. Note that at the time I wrote this, several Washington Post pages were "under construction" and returned HTTP errors. Therefore, there is exception-handling code in this function to make sure the code scrapes as much information as possible.

In [9]:

def team_info(team, url):
    try:
        df = pd.read_html(url)[0]
    except Exception as e:
        print(f'Failed for {team}:', e)
        print(url)
        return None
    df.columns = df.columns.str.rstrip()
    df['Team'] = team
    cols = [
        'Team',
        'Year',
        'Seed',
        'Region',
        'Bid type',
        'Round reached',
        'Coach',
        'Record',
    ]
    return df[cols]

We want to put all the information into one large DataFrame. The function below loops through all the teams and uses the concat() pandas function to do this.

In [10]:

def get_info(teams):
    info = dict()
    for name in tqdm_notebook(teams.index):
        url = teams.loc[name, 'url']
        info[name] = team_info(name, url)
    return pd.concat([info[name] for name in teams.index], ignore_index=True)

Now we can scrape all the information.

In [11]:

df = get_info(teams)

Failed to display Jupyter Widget of type HBox.

If you're reading this message in the Jupyter Notebook or JupyterLab Notebook, it may mean that the widgets JavaScript is still loading. If this message persists, it likely means that the widgets JavaScript library is either not installed or not enabled. See the Jupyter Widgets Documentation for setup instructions.

If you're reading this message in another frontend (for example, a static rendering on GitHub or NBViewer), it may mean that your frontend doesn't currently support widgets.

Failed for California: HTTP Error 500: Internal Server Error
https://apps.washingtonpost.com/sports/apps/live-updating-mens-ncaa-basketball-bracket/schools/california/
Failed for Vanderbilt: HTTP Error 500: Internal Server Error
https://apps.washingtonpost.com/sports/apps/live-updating-mens-ncaa-basketball-bracket/schools/vanderbilt/
Failed for Wichita State: HTTP Error 500: Internal Server Error
https://apps.washingtonpost.com/sports/apps/live-updating-mens-ncaa-basketball-bracket/schools/wichita-state/

In [12]:

len(df)

Out[12]:

In [13]:

df.head()

Out[13]:

	Team	Year	Seed	Region	Bid type	Round reached	Coach	Record
0	Air Force	2006	13	East	At Large	1First Round	Jeff Bzdelik	24-7
1	Air Force	2004	11	South	At Large	1First Round	Joe Scott	22-7
2	Akron	2013	12	South	Automatic Qualifier	1First Round	Keith Dambrot	26-7
3	Akron	2011	15	Southwest	Automatic Qualifier	1First Round	Keith Dambrot	23-13
4	Akron	2009	13	South	Automatic Qualifier	1First Round	Keith Dambrot	23-13

Let's clean up the "Round reached" column to split the numeric value and the text values into two distinct columns.

In [14]:

df['Round reached'].unique()

Out[14]:

array(['1First Round', '2Second Round', '4Elite Eight', '3Sweet 16',
       '0Play-In', '6National Championship', '5Final Four'], dtype=object)

In [15]:

def format_round(df):
    df[['Eliminated', 'Round Reached']] = df['Round reached'].str.extract('^(\d)(.+)', expand=True)
    return df.drop(columns=['Round reached'])

In [16]:

df = format_round(df)
df.head()

Out[16]:

	Team	Year	Seed	Region	Bid type	Coach	Record	Eliminated	Round Reached
0	Air Force	2006	13	East	At Large	Jeff Bzdelik	24-7	1	First Round
1	Air Force	2004	11	South	At Large	Joe Scott	22-7	1	First Round
2	Akron	2013	12	South	Automatic Qualifier	Keith Dambrot	26-7	1	First Round
3	Akron	2011	15	Southwest	Automatic Qualifier	Keith Dambrot	23-13	1	First Round
4	Akron	2009	13	South	Automatic Qualifier	Keith Dambrot	23-13	1	First Round

In [17]:

filename = 'team_history-washpost.csv'
csvfile = OUTPUT_DIR.joinpath(filename)
df.to_csv(csvfile, index=False)

In [ ]: