Neal Caren - University of North Carolina, Chapel Hill mail web twitter scholar
This code follows my standard workflow for collecting and analyzing data from the web, so I think it might be useful, even if you don't care about sports rankings or field hockey.
I begin by gathering a list of all the web pages I want to scrape with the get_teams
function.
The next pair of functions, get_schedule
and read_table_row' scrape that have the data I'm interested in. In this case, I'm interested in all the played and scheduled games, and looking to grab the date, opponent, and score. For unknown reasons, schools sometimes go by different names, so I have to do some name cleaning. The last step, begining with
generate_skill`, does a little bit of analysis. Here, I'm simply computing a power ranking that is the average victory margin,
adjusting for home field advantage and opponent's average victory margin. I use a similar method to compute total points scored.
Finally, I estimate some predicted values for the games that haven't been played yet. If I was doing anything fancier,
I would use Pandas and/or scikit-learn or export the data.
import requests
import re
import csv
import numpy as np
from time import sleep
def get_teams():
#Reads FieldHockeyCorner and grabs all the NCAA Division 1 team names and abrreviations
url = 'http://www.fieldhockeycorner.com/scores.php?div=1'
page = requests.get(url)
teams = re.findall("tcode=(.*?)&div=1'>(.*?)<",page.text)
return teams
def get_schedule(team):
#Grabs the schedule for a specific team
#Includes games played and to be played
sleep(.5)
print 'Gettings schedule for', team[1]
url = 'http://www.fieldhockeycorner.com/scores.php?action=schedule&tcode=%s&div=1' % team[0]
page = requests.get(url)
table = re.findall('<tr><td>(.*?)<\/td><td>(.*?)<\/td><td>(.*?)<\/td><td>(.*?)<\/td><\/tr>',page.text)
table = [ read_table_row(row, team[1]) for row in table]
return table
def read_table_row(row, team):
#Decodes a row of the table from Fieldhockey corner
try:
score = [item for item in row[2].split() if '-' in item][0]
own_score = score.split('-')[0]
other_score = score.split('-')[1]
except:
other_score = '.'
own_score = '.'
if 'vs. ' in row[1]:
location = 'Neutral'
elif '<b>' in row[1]:
location = 'Home'
elif 'at' in row[1]:
location = 'Away'
else:
location = 'Other'
return {'team' : team,
'date' : row[0],
'opponent' : row[1].replace('vs. ','').replace('</b>','').replace('<b>','').replace('at ',''),
'notes' : row[3],
'location' : location,
'own_score': own_score,
'other_score' : other_score
}
#grab the teams and their schedules
teams = get_teams()
schedules = [ get_schedule(team) for team in teams]
Gettings schedule for Albany Gettings schedule for Delaware Gettings schedule for Longwood Gettings schedule for Pacific Gettings schedule for Syracuse Gettings schedule for American Gettings schedule for Drexel Gettings schedule for Louisville Gettings schedule for Penn Gettings schedule for Temple Gettings schedule for Appalachian St. Gettings schedule for Duke Gettings schedule for Maine Gettings schedule for Penn State Gettings schedule for Towson Gettings schedule for Ball State Gettings schedule for Fairfield Gettings schedule for Maryland Gettings schedule for Princeton Gettings schedule for UC Davis Gettings schedule for Boston C. Gettings schedule for Georgetown Gettings schedule for Massachusetts Gettings schedule for Providence Gettings schedule for UMass-Lowell Gettings schedule for Boston U. Gettings schedule for Harvard Gettings schedule for Miami Gettings schedule for Quinnipiac Gettings schedule for Vermont Gettings schedule for Brown Gettings schedule for Hofstra Gettings schedule for Michigan Gettings schedule for Radford Gettings schedule for Villanova Gettings schedule for Bryant Gettings schedule for Holy Cross Gettings schedule for Michigan State Gettings schedule for Richmond Gettings schedule for Virginia Gettings schedule for Bucknell Gettings schedule for Indiana Gettings schedule for Missouri State Gettings schedule for Rider Gettings schedule for VCU Gettings schedule for California Gettings schedule for Iowa Gettings schedule for Monmouth Gettings schedule for Robert Morris Gettings schedule for Wake Forest Gettings schedule for Central Michigan Gettings schedule for James Madison Gettings schedule for New Hampshire Gettings schedule for Rutgers Gettings schedule for William & Mary Gettings schedule for Colgate Gettings schedule for Kent State Gettings schedule for North Carolina Gettings schedule for Sacred Heart Gettings schedule for Yale Gettings schedule for Columbia Gettings schedule for La Salle Gettings schedule for Northeastern Gettings schedule for Saint Francis Gettings schedule for Connecticut Gettings schedule for Lafayette Gettings schedule for Northwestern Gettings schedule for Saint Joseph's Gettings schedule for Cornell Gettings schedule for Lehigh Gettings schedule for Ohio Gettings schedule for Saint Louis Gettings schedule for Dartmouth Gettings schedule for Liberty Gettings schedule for Ohio State Gettings schedule for Siena Gettings schedule for Davidson Gettings schedule for Lock Haven Gettings schedule for Old Dominion Gettings schedule for Stanford
#Flatten the list so it a list of games
games = [item for sublist in schedules for item in sublist]
#Split into played and unplayed
played_games = [game for game in games if game['own_score']!='.']
unplayed_games = [game for game in games if game['own_score']=='.']
#Because some names are listed in mulitple ways
name_clean = {'Boston C.': 'Boston College',
'Boston U.': 'Boston University',
'Vcu' : 'Virginia Commonwealth',
'Appalachian St.' : 'Appalachian State',
'Uc Davis' : 'UC Davis'}
def generate_skill(games, n=50):
#Ranking algorithm. Average winning margin adjusted for opponents winning margin
#N is the number of times to iterate through. Seems to converge after 10 or so loops
skill = {}
for x in range(0, n):
team_skill_list = {}
for game in games:
#clean up the team names
team = name_clean.get(game['team'].title(),game['team'].title())
opponent = name_clean.get(game['opponent'].title(),game['opponent'].title())
#Hard coded Home Field Advantage at .7, which was the figure from 2012
if game['location'] == 'Home':
hfa = .7
else:
hfa = 0
#figure out how unexpected the margin of victory was
expected_margin = skill.get(team,0) - skill.get(opponent,0) + hfa
observed_margin = int(game['own_score']) - int(game['other_score'])
difference = observed_margin - expected_margin
#Add the unexpected portion to a list by team
try:
team_skill_list[team].append(difference)
except:
team_skill_list[team] = [observed_margin - expected_margin]
#New skill is old skill plus average of the new unexpected portion
skill = {team: np.mean(team_skill_list[team]) + skill.get(team,0) for team in team_skill_list}
#center the skills to prevent drift
mean = np.mean([skill[team] for team in skill])
skill = {team: skill[team] - mean for team in skill}
return skill
skills = generate_skill(played_games)
#take a look at the top teams
for team in sorted(skills, key=skills.get, reverse=True)[:5]:
print team,skills[team]
Maryland 5.93977027965 North Carolina 5.16134861747 Virginia 3.81517327634 Connecticut 3.66998671713 Iowa 3.35718763212
def generate_total(games, n=20):
#Same function as above, expcept for total points score rather than margin
#Shoudl probably be combined with above
total = {}
for x in range(0, n):
team_total_list = {}
for game in games:
team = name_clean.get(game['team'].title(),game['team'].title())
opponent = name_clean.get(game['opponent'].title(),game['opponent'].title())
expected_total = total.get(team,0) + total.get(opponent,0) + 4.7
observed_total = int(game['own_score']) + int(game['other_score'])
difference = observed_total - expected_total
try:
team_total_list[team].append(difference)
except:
team_total_list[team] = [difference]
#New total is old total plus new average
total = {team: np.mean(team_total_list[team]) + total.get(team,0) for team in team_total_list}
return total
totals = generate_total(played_games)
for team in sorted(totals, key=totals.get, reverse=True)[:10]:
print team,totals[team]
Kent State 2.64308192784 Dartmouth 2.41186253797 Maryland 2.3897954595 Saint Louis 1.99132275916 Penn 1.74421499118 North Carolina 1.54463283301 Vermont 1.4099056861 Boston College 1.35897993364 Appalachian State 1.15298955323 Longwood 1.07484246491
def make_prediction(games,skill,total):
# for unplayed games, come up with a prediction based on location,
# and who is playing
predictions = {}
for game in games:
team = name_clean.get(game['team'].title(),game['team'].title())
opponent = name_clean.get(game['opponent'].title(),game['opponent'].title())
total_predict = 4.7 + total.get(team,0) + total.get(opponent,0)
if game['location']=='Home':
hfa = .7
modifier = "at"
else:
hfa = 0
modifier ="vs"
#Quick hack to sort by date
date = game['date']
day = int(date.split()[1])
if 'Sep. ' in date:
day = day + 900
elif 'Oct. ' in date:
day = day + 1000
elif 'Nov. ' in date:
day = day + 1100
elif 'Dec. ' in date:
day = day + 1200
#Only print out away games to avoid duplication
if game['location']!='Away':
expected_margin = skill.get(team,0) - skill.get(opponent,0) + hfa
expected_margin = round(expected_margin * 2,0)/2
total_predict = round(total_predict*2,0)/2
#fix for cases where margin is greater than difference:
if expected_margin > total_predict:
expected_margin = total_predict
# fix for -0 margin
row = [date,'%s %s %s' % (opponent,modifier,team),"%.1f" % -expected_margin,"%.1f" % total_predict]
try:
predictions[day].append(row)
except:
predictions[day] = [row]
return predictions
preds = make_prediction(unplayed_games,skills,totals)
for day in preds:
print ' '.join(preds[day][0])
Oct. 25 William & Mary at Drexel -3.0 4.0 Oct. 26 Fairfield at Albany -4.0 5.5 Oct. 27 Princeton at Albany -2.0 6.0 Oct. 28 Maine at Providence 1.0 4.0 Oct. 30 Villanova at Penn -5.0 5.5 Oct. 31 Stanford at California 1.5 1.5 Sep. 24 UC Davis at Miami -4.0 5.0 Sep. 25 Rider at Drexel -4.0 4.0 Sep. 26 Richmond at Longwood 3.0 4.5 Sep. 27 Sacred Heart at Albany -5.0 5.0 Sep. 28 Stanford vs Syracuse -0.5 3.0 Sep. 29 Stanford at Albany -1.0 4.5 Nov. 1 Delaware at Drexel -1.0 5.0 Nov. 2 North Carolina at Syracuse 1.5 6.0 Nov. 3 Villanova at Delaware -4.5 4.5 Nov. 9 Princeton at Penn -0.0 7.0 Nov. 10 Harvard at Columbia -2.5 4.0 Oct. 1 Bryant at Holy Cross 0.5 4.0 Oct. 2 Lafayette at Penn -1.5 6.0 Oct. 4 William & Mary at Delaware -2.5 5.0 Oct. 5 Virginia at Syracuse -0.0 4.5 Oct. 6 Virginia at Albany 0.5 5.5 Oct. 7 Missouri State at Iowa -3.0 3.0 Oct. 8 Hofstra at Maryland -7.5 7.5 Oct. 9 Liberty at Longwood 0.5 4.5 Oct. 10 Richmond at William & Mary 2.0 3.5 Oct. 11 Towson at Delaware -4.0 4.0 Oct. 12 Vermont at Albany -7.0 7.0 Oct. 13 Princeton at Delaware -1.0 6.0 Oct. 15 Holy Cross at Vermont -0.5 5.0 Oct. 16 Radford at Wake Forest -4.5 4.5 Oct. 17 Lafayette at Drexel -2.5 3.5 Oct. 18 New Hampshire at Albany -3.0 6.5 Oct. 19 Wheelock at American -3.0 4.5 Oct. 20 Radford at Longwood -2.0 5.5 Oct. 22 Central Michigan at Michigan State -1.0 5.0 Oct. 23 Ohio State at Ball State 3.0 3.5
Not shown: Ugly code that turns the predictions into an html table.