Good football data is hard to come by. Basic stat counts are easily available, but full play data (i.e. a play broken down in its individual components: interceptions and tackles, runs, passes and shots, etc.) is very rare. And that's the most important unit in a team sport like football. So imagine my surprise and great joy when I came across a fantastic dataset of full play-by-play data for all World Cup matches.
After spending some time in the wonderful world of web scraping, one becomes aware of hints that something worthwhile is going on. Whenever I see a pretty interactive chart on a web-page like the great Huff Post Data's World Cup page, my spider sense starts tingling.
The first thing to do is to make sure that the site is using only html5 and js. Check.
OK, so how is the website sending the data to the browser? Developer tools in Chrome is your friend: network tab, filter "json".
Bingo.
The website was sending the full dataset to the browser.
#imports
import requests
import json
import mechanize
from bs4 import BeautifulSoup
import time
#initializes the browser
br = mechanize.Browser()
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=10)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
Looking at a match page, you see that all links are listed in a handy menu. Looking at the html code, you can see that they are inside a span tag with class set to "matchup".
So the following code gets you all the links:
starting_link = 'http://data.huffingtonpost.com/2014/world-cup/matches/belgium-vs-usa-731822'
response = mechanize.urlopen(starting_link)
soup = BeautifulSoup(response)
links_html = soup.find_all("span", class_="matchup")
links = []
for link_html in links_html:
a = link_html.find_all('a')
for l in a:
link = l.get('href')
link = link.split('/')[-1]
links.append(link)
links[:5]
['brazil-vs-chile-731815', 'colombia-vs-uruguay-731816', 'netherlands-vs-mexico-731817', 'costa-rica-vs-greece-731818', 'france-vs-nigeria-731819']
With this information we can get all the match data trough a simple request, which gives back easily readable json.
def get_match_data(match):
match_id = match.split('-')[-1]
response = mechanize.urlopen('http://data.huffingtonpost.com/2014/world-cup/matches/%s.json' % match_id)
match_data = json.loads(response.read())
return match_data
match_data = get_match_data(links[0]) # test
match_data.keys()
[u'team_stats', u'events', u'summary']
Unfortunately, the data includes IDs only. The page has names, though, so there must be some conversion taking place. At this point, I was scared that I to look through all script files and javascript code to see where the conversion took place.
However, the first (and obvious) step was enough: simply searching for a player's name in the main page source showed that variables HPIN.teams and HPIN.players contained the names and IDs, plus a bunch of other information (like position, birth date and even preferred foot). The script tag that defined the variables has no class or id, so we could only identify it by its position.
def get_match_names(match):
response = mechanize.urlopen('http://data.huffingtonpost.com/2014/world-cup/matches/%s' % links[0]) #example page
soup = BeautifulSoup(response)
data = {}
data_script = soup.findAll("script")[1] #gets the second script block. Hopefully all pages follow the same format
data_lines = data_script.text.split('\n')
for line in data_lines[1:]:
try:
#format of a variable is HPIN.variable = [list of dictionaries]
#this tries to convert it to
line_data = line.split(' = ')
name = line_data[0].split('.')[1]
value = json.loads(line_data[1][:-1])
data[name] = value
except:
print "error parsing string: ", line #should only occur on blank lines - yeah, I know, lazy exception handling...
return data
names = get_match_names(links[0])
names.keys()
error parsing string:
[u'statCategories', u'awayTeam', u'callbackPath', u'homeTeam', u'teams', u'players', u'imageCallbackPath', u'imageCallbackInterval', u'twitterUrl']
Alright, so now we have all the match links, a function that returns the events and stats from each match, and a function that returns the players and team names. Let's put it all together. First, create a dictionary:
data = {}
Then, execute a loop that will get the data from all the matches and add it to the dictionary. The if
statement ensures you don't have to reprocess a match in the case you have to run the cell again (e.g. due a network error).
for match in links:
if match not in data:
print match
time.sleep(60)
match_data = get_match_data(match)
match_names = get_match_names(match)
data[match] = {'data': match_data, 'names': match_names}
print match, " done"
else:
print match, " already processed"
brazil-vs-chile-731815 error parsing string: brazil-vs-chile-731815 done colombia-vs-uruguay-731816 error parsing string: colombia-vs-uruguay-731816 done netherlands-vs-mexico-731817 error parsing string: netherlands-vs-mexico-731817 done costa-rica-vs-greece-731818 error parsing string: costa-rica-vs-greece-731818 done france-vs-nigeria-731819 error parsing string: france-vs-nigeria-731819 done germany-vs-algeria-731820 error parsing string: germany-vs-algeria-731820 done argentina-vs-switzerland-731821 error parsing string: argentina-vs-switzerland-731821 done belgium-vs-usa-731822 error parsing string: belgium-vs-usa-731822 done france-vs-germany-731824 error parsing string: france-vs-germany-731824 done brazil-vs-colombia-731823 error parsing string: brazil-vs-colombia-731823 done argentina-vs-belgium-731826 error parsing string: argentina-vs-belgium-731826 done netherlands-vs-costa-rica-731825 error parsing string: netherlands-vs-costa-rica-731825 done brazil-vs-germany-731827 error parsing string: brazil-vs-germany-731827 done netherlands-vs-argentina-731828 error parsing string: netherlands-vs-argentina-731828 done brazil-vs-netherlands-731829 error parsing string: brazil-vs-netherlands-731829 done germany-vs-argentina-731830 error parsing string: germany-vs-argentina-731830 done brazil-vs-croatia-731767 error parsing string: brazil-vs-croatia-731767 done mexico-vs-cameroon-731768 error parsing string: mexico-vs-cameroon-731768 done brazil-vs-mexico-731783 error parsing string: brazil-vs-mexico-731783 done cameroon-vs-croatia-731784 error parsing string: cameroon-vs-croatia-731784 done croatia-vs-mexico-731800 error parsing string: croatia-vs-mexico-731800 done cameroon-vs-brazil-731799 error parsing string: cameroon-vs-brazil-731799 done spain-vs-netherlands-731769 error parsing string: spain-vs-netherlands-731769 done chile-vs-australia-731770 error parsing string: chile-vs-australia-731770 done australia-vs-netherlands-731786 error parsing string: australia-vs-netherlands-731786 done spain-vs-chile-731785 error parsing string: spain-vs-chile-731785 done netherlands-vs-chile-731802 error parsing string: netherlands-vs-chile-731802 done australia-vs-spain-731801 error parsing string: australia-vs-spain-731801 done colombia-vs-greece-731771 error parsing string: colombia-vs-greece-731771 done ivory-coast-vs-japan-731772 error parsing string: ivory-coast-vs-japan-731772 done colombia-vs-ivory-coast-731787 error parsing string: colombia-vs-ivory-coast-731787 done japan-vs-greece-731788 error parsing string: japan-vs-greece-731788 done japan-vs-colombia-731803 error parsing string: japan-vs-colombia-731803 done greece-vs-ivory-coast-731804 error parsing string: greece-vs-ivory-coast-731804 done uruguay-vs-costa-rica-731773 error parsing string: uruguay-vs-costa-rica-731773 done england-vs-italy-731774 error parsing string: england-vs-italy-731774 done uruguay-vs-england-731789 error parsing string: uruguay-vs-england-731789 done italy-vs-costa-rica-731790 error parsing string: italy-vs-costa-rica-731790 done italy-vs-uruguay-731805 error parsing string: italy-vs-uruguay-731805 done costa-rica-vs-england-731806 error parsing string: costa-rica-vs-england-731806 done switzerland-vs-ecuador-731775 error parsing string: switzerland-vs-ecuador-731775 done france-vs-honduras-731776 error parsing string: france-vs-honduras-731776 done switzerland-vs-france-731791 error parsing string: switzerland-vs-france-731791 done honduras-vs-ecuador-731792 error parsing string: honduras-vs-ecuador-731792 done ecuador-vs-france-731808 error parsing string: ecuador-vs-france-731808 done honduras-vs-switzerland-731807 error parsing string: honduras-vs-switzerland-731807 done argentina-vs-bosnia-herz-731777 error parsing string: argentina-vs-bosnia-herz-731777 done iran-vs-nigeria-731778 error parsing string: iran-vs-nigeria-731778 done argentina-vs-iran-731793 error parsing string: argentina-vs-iran-731793 done nigeria-vs-bosnia-herz-731794 error parsing string: nigeria-vs-bosnia-herz-731794 done nigeria-vs-argentina-731809 error parsing string: nigeria-vs-argentina-731809 done bosnia-herz-vs-iran-731810 error parsing string: bosnia-herz-vs-iran-731810 done germany-vs-portugal-731779 error parsing string: germany-vs-portugal-731779 done ghana-vs-usa-731780 error parsing string: ghana-vs-usa-731780 done germany-vs-ghana-731795 error parsing string: germany-vs-ghana-731795 done usa-vs-portugal-731796 error parsing string: usa-vs-portugal-731796 done portugal-vs-ghana-731812 error parsing string: portugal-vs-ghana-731812 done usa-vs-germany-731811 error parsing string: usa-vs-germany-731811 done belgium-vs-algeria-731781 error parsing string: belgium-vs-algeria-731781 done russia-vs-south-korea-731782 error parsing string: russia-vs-south-korea-731782 done belgium-vs-russia-731797 error parsing string: belgium-vs-russia-731797 done south-korea-vs-algeria-731798 error parsing string: south-korea-vs-algeria-731798 done algeria-vs-russia-731814 error parsing string: algeria-vs-russia-731814 done south-korea-vs-belgium-731813 error parsing string: south-korea-vs-belgium-731813 done brazil-vs-chile-731815 already processed colombia-vs-uruguay-731816 already processed netherlands-vs-mexico-731817 already processed costa-rica-vs-greece-731818 already processed france-vs-nigeria-731819 already processed germany-vs-algeria-731820 already processed argentina-vs-switzerland-731821 already processed belgium-vs-usa-731822 already processed france-vs-germany-731824 already processed brazil-vs-colombia-731823 already processed argentina-vs-belgium-731826 already processed netherlands-vs-costa-rica-731825 already processed brazil-vs-germany-731827 already processed netherlands-vs-argentina-731828 already processed brazil-vs-netherlands-731829 already processed germany-vs-argentina-731830 already processed brazil-vs-croatia-731767 already processed mexico-vs-cameroon-731768 already processed brazil-vs-mexico-731783 already processed cameroon-vs-croatia-731784 already processed croatia-vs-mexico-731800 already processed cameroon-vs-brazil-731799 already processed spain-vs-netherlands-731769 already processed chile-vs-australia-731770 already processed australia-vs-netherlands-731786 already processed spain-vs-chile-731785 already processed netherlands-vs-chile-731802 already processed australia-vs-spain-731801 already processed colombia-vs-greece-731771 already processed ivory-coast-vs-japan-731772 already processed colombia-vs-ivory-coast-731787 already processed japan-vs-greece-731788 already processed japan-vs-colombia-731803 already processed greece-vs-ivory-coast-731804 already processed uruguay-vs-costa-rica-731773 already processed england-vs-italy-731774 already processed uruguay-vs-england-731789 already processed italy-vs-costa-rica-731790 already processed italy-vs-uruguay-731805 already processed costa-rica-vs-england-731806 already processed switzerland-vs-ecuador-731775 already processed france-vs-honduras-731776 already processed switzerland-vs-france-731791 already processed honduras-vs-ecuador-731792 already processed ecuador-vs-france-731808 already processed honduras-vs-switzerland-731807 already processed argentina-vs-bosnia-herz-731777 already processed iran-vs-nigeria-731778 already processed argentina-vs-iran-731793 already processed nigeria-vs-bosnia-herz-731794 already processed nigeria-vs-argentina-731809 already processed bosnia-herz-vs-iran-731810 already processed germany-vs-portugal-731779 already processed ghana-vs-usa-731780 already processed germany-vs-ghana-731795 already processed usa-vs-portugal-731796 already processed portugal-vs-ghana-731812 already processed usa-vs-germany-731811 already processed belgium-vs-algeria-731781 already processed russia-vs-south-korea-731782 already processed belgium-vs-russia-731797 already processed south-korea-vs-algeria-731798 already processed algeria-vs-russia-731814 already processed south-korea-vs-belgium-731813 already processed
print len(data.keys()) #make sure you have all 64 games
64
import pickle
pickle.dump(data, open( "wc2014.p", "wb"))
data == pickle.load(open("wc2014.p", "rb")) #because I'm a bit OCD and want to make sure the data was properly stored
True
The boring part is over. Now, it's time to play :)
Check out my WC final analysis notebook for an example of what you can do with the data, and follow my github repository Football Crunching for more analysis in the future.