Python code for predicting NCAA field hockey games

Neal Caren - University of North Carolina, Chapel Hill mail web twitter scholar

Current predictions

This code follows my standard workflow for collecting and analyzing data from the web, so I think it might be useful, even if you don't care about sports rankings or field hockey.

I begin by gathering a list of all the web pages I want to scrape with the get_teams function. The next pair of functions, get_schedule and read_table_row' scrape that have the data I'm interested in. In this case, I'm interested in all the played and scheduled games, and looking to grab the date, opponent, and score. For unknown reasons, schools sometimes go by different names, so I have to do some name cleaning. The last step, begining withgenerate_skill`, does a little bit of analysis. Here, I'm simply computing a power ranking that is the average victory margin, adjusting for home field advantage and opponent's average victory margin. I use a similar method to compute total points scored. Finally, I estimate some predicted values for the games that haven't been played yet. If I was doing anything fancier, I would use Pandas and/or scikit-learn or export the data.

In [1]:
import requests
import re
import csv
import numpy as np

from time import sleep
In [2]:
def get_teams():
    #Reads FieldHockeyCorner and grabs all the NCAA Division 1 team names and abrreviations
    url = ''
    page = requests.get(url)
    teams = re.findall("tcode=(.*?)&div=1'>(.*?)<",page.text)
    return teams
In [3]:
def get_schedule(team):
    #Grabs the schedule for a specific team
    #Includes games played and to be played
    print 'Gettings schedule for', team[1]
    url = '' % team[0]
    page = requests.get(url)
    table = re.findall('<tr><td>(.*?)<\/td><td>(.*?)<\/td><td>(.*?)<\/td><td>(.*?)<\/td><\/tr>',page.text)
    table = [ read_table_row(row, team[1]) for row in table]
    return table
In [4]:
def read_table_row(row, team):
    #Decodes a row of the table from Fieldhockey corner
        score = [item for item in row[2].split() if '-' in item][0]
        own_score =   score.split('-')[0]
        other_score = score.split('-')[1]
        other_score = '.'
        own_score = '.'
    if 'vs. ' in row[1]:
        location = 'Neutral'
    elif '<b>' in row[1]:
        location = 'Home'
    elif 'at' in row[1]:
        location = 'Away'
        location = 'Other'
    return {'team'     : team,
            'date'     : row[0],
            'opponent' : row[1].replace('vs. ','').replace('</b>','').replace('<b>','').replace('at ',''),
            'notes'    : row[3],
            'location' : location,
            'own_score': own_score,
            'other_score' : other_score
In [5]:
#grab the teams and their schedules
teams = get_teams()
schedules = [ get_schedule(team) for team in teams]
Gettings schedule for Albany
Gettings schedule for Delaware
Gettings schedule for Longwood
Gettings schedule for Pacific
Gettings schedule for Syracuse
Gettings schedule for American
Gettings schedule for Drexel
Gettings schedule for Louisville
Gettings schedule for Penn
Gettings schedule for Temple
Gettings schedule for Appalachian St.
Gettings schedule for Duke
Gettings schedule for Maine
Gettings schedule for Penn State
Gettings schedule for Towson
Gettings schedule for Ball State
Gettings schedule for Fairfield
Gettings schedule for Maryland
Gettings schedule for Princeton
Gettings schedule for UC Davis
Gettings schedule for Boston C.
Gettings schedule for Georgetown
Gettings schedule for Massachusetts
Gettings schedule for Providence
Gettings schedule for UMass-Lowell
Gettings schedule for Boston U.
Gettings schedule for Harvard
Gettings schedule for Miami
Gettings schedule for Quinnipiac
Gettings schedule for Vermont
Gettings schedule for Brown
Gettings schedule for Hofstra
Gettings schedule for Michigan
Gettings schedule for Radford
Gettings schedule for Villanova
Gettings schedule for Bryant
Gettings schedule for Holy Cross
Gettings schedule for Michigan State
Gettings schedule for Richmond
Gettings schedule for Virginia
Gettings schedule for Bucknell
Gettings schedule for Indiana
Gettings schedule for Missouri State
Gettings schedule for Rider
Gettings schedule for VCU
Gettings schedule for California
Gettings schedule for Iowa
Gettings schedule for Monmouth
Gettings schedule for Robert Morris
Gettings schedule for Wake Forest
Gettings schedule for Central Michigan
Gettings schedule for James Madison
Gettings schedule for New Hampshire
Gettings schedule for Rutgers
Gettings schedule for William & Mary
Gettings schedule for Colgate
Gettings schedule for Kent State
Gettings schedule for North Carolina
Gettings schedule for Sacred Heart
Gettings schedule for Yale
Gettings schedule for Columbia
Gettings schedule for La Salle
Gettings schedule for Northeastern
Gettings schedule for Saint Francis
Gettings schedule for Connecticut
Gettings schedule for Lafayette
Gettings schedule for Northwestern
Gettings schedule for Saint Joseph's
Gettings schedule for Cornell
Gettings schedule for Lehigh
Gettings schedule for Ohio
Gettings schedule for Saint Louis
Gettings schedule for Dartmouth
Gettings schedule for Liberty
Gettings schedule for Ohio State
Gettings schedule for Siena
Gettings schedule for Davidson
Gettings schedule for Lock Haven
Gettings schedule for Old Dominion
Gettings schedule for Stanford
In [6]:
#Flatten the list so it a list of games
games  = [item for sublist in schedules for item in sublist]
In [7]:
#Split into played and unplayed
played_games =   [game for game in games if game['own_score']!='.']
unplayed_games = [game for game in games if game['own_score']=='.']
In [8]:
#Because some names are listed in mulitple ways
name_clean = {'Boston C.': 'Boston College',
              'Boston U.': 'Boston University',
              'Vcu'      : 'Virginia Commonwealth',
              'Appalachian St.' : 'Appalachian State',
              'Uc Davis' : 'UC Davis'}
In [9]:
def generate_skill(games, n=50):
    #Ranking algorithm. Average winning margin adjusted for opponents winning margin
    #N is the number of times to iterate through. Seems to converge after 10 or so loops
    skill = {}
    for x in range(0, n):
        team_skill_list = {}
        for game in games:
            #clean up the team names
            team = name_clean.get(game['team'].title(),game['team'].title())
            opponent = name_clean.get(game['opponent'].title(),game['opponent'].title())
            #Hard coded Home Field Advantage at .7, which was the figure from 2012
            if game['location'] == 'Home':
                hfa = .7
                hfa = 0
            #figure out how unexpected the margin of victory was
            expected_margin =  skill.get(team,0)     - skill.get(opponent,0) + hfa 
            observed_margin = int(game['own_score']) - int(game['other_score'])
            difference = observed_margin - expected_margin
            #Add the unexpected portion to a list by team
                team_skill_list[team] = [observed_margin - expected_margin]

        #New skill is old skill plus average of the new unexpected portion
        skill = {team: np.mean(team_skill_list[team]) + skill.get(team,0) for team in team_skill_list}

        #center the skills to prevent drift
        mean =  np.mean([skill[team] for team in skill])
        skill = {team: skill[team] - mean  for team in skill}
    return skill
In [10]:
skills = generate_skill(played_games)
In [11]:
#take a look at the top teams
for team in sorted(skills, key=skills.get, reverse=True)[:5]:
    print team,skills[team]
Maryland 5.93977027965
North Carolina 5.16134861747
Virginia 3.81517327634
Connecticut 3.66998671713
Iowa 3.35718763212
In [12]:
def generate_total(games, n=20):
    #Same function as above, expcept for total points score rather than margin
    #Shoudl probably be combined with above
    total = {}
    for x in range(0, n):
        team_total_list = {}
        for game in games:
            team = name_clean.get(game['team'].title(),game['team'].title())
            opponent = name_clean.get(game['opponent'].title(),game['opponent'].title())
            expected_total =  total.get(team,0)     + total.get(opponent,0) + 4.7 
            observed_total = int(game['own_score']) + int(game['other_score'])
            difference = observed_total - expected_total
                team_total_list[team] = [difference]

        #New total is old total plus new average
        total = {team: np.mean(team_total_list[team]) + total.get(team,0) for team in team_total_list}
    return total
In [13]:
totals = generate_total(played_games)
In [14]:
for team in sorted(totals, key=totals.get, reverse=True)[:10]:
    print team,totals[team]
Kent State 2.64308192784
Dartmouth 2.41186253797
Maryland 2.3897954595
Saint Louis 1.99132275916
Penn 1.74421499118
North Carolina 1.54463283301
Vermont 1.4099056861
Boston College 1.35897993364
Appalachian State 1.15298955323
Longwood 1.07484246491
In [15]:
def make_prediction(games,skill,total):
    # for unplayed games, come up with a prediction based on location, 
    # and who is playing
    predictions = {}
    for game in games:
        team = name_clean.get(game['team'].title(),game['team'].title())
        opponent = name_clean.get(game['opponent'].title(),game['opponent'].title())
        total_predict = 4.7 + total.get(team,0) + total.get(opponent,0)
        if game['location']=='Home':
            hfa = .7
            modifier = "at"
            hfa = 0
            modifier ="vs"
        #Quick hack to sort by date
        date = game['date']
        day = int(date.split()[1])
        if 'Sep. ' in date:
            day = day + 900
        elif 'Oct. ' in date:
            day = day + 1000
        elif 'Nov. ' in date:
            day = day + 1100
        elif 'Dec. ' in date:
            day = day + 1200            
        #Only print out away games to avoid duplication
        if game['location']!='Away':
            expected_margin =  skill.get(team,0)     - skill.get(opponent,0) + hfa

            expected_margin = round(expected_margin * 2,0)/2
            total_predict = round(total_predict*2,0)/2
            #fix for cases where margin is greater than difference:
            if expected_margin > total_predict:
                expected_margin = total_predict
            # fix for -0 margin
            row = [date,'%s %s %s' % (opponent,modifier,team),"%.1f" % -expected_margin,"%.1f" % total_predict]
                predictions[day] = [row]
    return predictions
In [16]:
preds = make_prediction(unplayed_games,skills,totals)
In [21]:
for day in preds:
    print ' '.join(preds[day][0])
Oct. 25 William & Mary at Drexel -3.0 4.0
Oct. 26 Fairfield at Albany -4.0 5.5
Oct. 27 Princeton at Albany -2.0 6.0
Oct. 28 Maine at Providence 1.0 4.0
Oct. 30 Villanova at Penn -5.0 5.5
Oct. 31 Stanford at California 1.5 1.5
Sep. 24 UC Davis at Miami -4.0 5.0
Sep. 25 Rider at Drexel -4.0 4.0
Sep. 26 Richmond at Longwood 3.0 4.5
Sep. 27 Sacred Heart at Albany -5.0 5.0
Sep. 28 Stanford vs Syracuse -0.5 3.0
Sep. 29 Stanford at Albany -1.0 4.5
Nov. 1 Delaware at Drexel -1.0 5.0
Nov. 2 North Carolina at Syracuse 1.5 6.0
Nov. 3 Villanova at Delaware -4.5 4.5
Nov. 9 Princeton at Penn -0.0 7.0
Nov. 10 Harvard at Columbia -2.5 4.0
Oct. 1 Bryant at Holy Cross 0.5 4.0
Oct. 2 Lafayette at Penn -1.5 6.0
Oct. 4 William & Mary at Delaware -2.5 5.0
Oct. 5 Virginia at Syracuse -0.0 4.5
Oct. 6 Virginia at Albany 0.5 5.5
Oct. 7 Missouri State at Iowa -3.0 3.0
Oct. 8 Hofstra at Maryland -7.5 7.5
Oct. 9 Liberty at Longwood 0.5 4.5
Oct. 10 Richmond at William & Mary 2.0 3.5
Oct. 11 Towson at Delaware -4.0 4.0
Oct. 12 Vermont at Albany -7.0 7.0
Oct. 13 Princeton at Delaware -1.0 6.0
Oct. 15 Holy Cross at Vermont -0.5 5.0
Oct. 16 Radford at Wake Forest -4.5 4.5
Oct. 17 Lafayette at Drexel -2.5 3.5
Oct. 18 New Hampshire at Albany -3.0 6.5
Oct. 19 Wheelock at American -3.0 4.5
Oct. 20 Radford at Longwood -2.0 5.5
Oct. 22 Central Michigan at Michigan State -1.0 5.0
Oct. 23 Ohio State at Ball State 3.0 3.5

Not shown: Ugly code that turns the predictions into an html table.