Web scraping in Python¶

I find sports more exciting when I know what to expect. How many points is a team scoring relative to how they did in the past? Is there defense stopping a high scoring team or one with a struggling offense? For many sports, Las Vegas provides me the information I want with the line (expect difference between the home team's score and the visitor's score) and the total (the sum of the expected scores). For example, at game time on Monday, March 11th, Iona's men's basketball team was a four-point favorite (-4) over Manhattan, with a total of 117.

Vegas doesn't provide lines for NCAA women's lacrosse (or any other college sport but football and men's basketball. Laxpower has a power rating. Comparing two team's ratings and adding about a home field advantage does give a pretty good line, but they don't provide total projections. So I decided to create my own Vegas style lines and totals. You can see the results.

Creating lines involves: finding and downloading game data; developing a power rating model to rank each team; and using those models to predict future games. More generally, this is the same process used to scrape data from websites for quantitative analysis. It's the same process I used, for example, to (analyze)[http://nealcaren.web.unc.edu/files/2012/05/smoc.pdf] a white racist web forum.

Luckily, Laxpower has all the game information, both for games played and scheduled. I wanted to cycle through each of the pages to get the information, but first I needed to know the URLs for all those pages. Luckily the ranking page has all the teams listed along with links to their pages, so I can grab the information from there.

In my browser, I looked at the source for the ranking page--the raw HTML. I searched for "North Carolina" so I could get a sense of what each link looked like. Fortunately, the page had two pieces of information for each team list in a way that was very easy to extract. Each link began with a " and ended in PHP". (Acutally, this is just the relative path to the URL. I'll fill in the beginning part of the URL later). This was followed by a >, the school's name, and then a >'. This is a situation where a simple regular expression would allow me to pull out the information I needed.

To get my list that contains all the URLs, I could take advantage of uniform way they were listed. In the Python variant of regular expressions, the powerful combination of .*? will find any character, repeated any number of times, until it runs into something else. So searching a text for My .*? dog would grab all the word or words used between My and dog. In my case, I wanted to extract all the instances of text that occurred between a quotion mark and PHP followed by a quotation mark, so I could search for instances of ".*?PHP" in the page's text.

In [22]:
import urllib2
import re

teams=re.findall('".*?PHP"',teams_html)
print teams[:5]

['"XMADXX.PHP"', '"XUFLXX.PHP"', '"XNWSXX.PHP"', '"XSYRXX.PHP"', '"XUNCXX.PHP"']

This is pretty good, but I don't want the quotation marks. I can be pickier about what I extract by using parentheses, which instructs re to only return the stuff between parentheses.

In [23]:
teams=re.findall('"(.*?PHP)"',teams_html)
print teams[:5]

['XMADXX.PHP', 'XUFLXX.PHP', 'XNWSXX.PHP', 'XSYRXX.PHP', 'XUNCXX.PHP']

As I noted above, next to this is also the school's name. I can extract this as well by extending the re statement.

In [24]:
teams=re.findall('"(.*?PHP)">(.*?)<',teams_html)
print teams[:5]

[('XMADXX.PHP', 'Maryland'), ('XUFLXX.PHP', 'Florida'), ('XNWSXX.PHP', 'Northwestern'), ('XSYRXX.PHP', 'Syracuse'), ('XUNCXX.PHP', 'North Carolina')]

Adding >(.*?)< had the effect of extending the search and returning everything between the greater than and less than signs. This is returned as a list of tuples. Note that regular expressions are complicated and more times than not will return either nothing or the entire text of the document. Trial, error, and reading is the only way forward.

I want to remove any duplicates by turning the returned list into a set, and then back into a list.

In [25]:
print len(teams)
teams=list(set(teams))
print len(teams)

200
100

I also want to store it in more useful format-I'll forget later on whether the team or the URL was first in the tuple.

In [26]:
teams=[{'team id':t[0],'team name':t[1]} for t in teams]
print teams[:5]

[{'team name': 'Lehigh', 'team id': 'XLEHXX.PHP'}, {'team name': 'Columbia', 'team id': 'XCMBXX.PHP'}, {'team name': 'Boston University', 'team id': 'XBOUXX.PHP'}, {'team name': 'Princeton', 'team id': 'XPRIXX.PHP'}, {'team name': 'Quinnipiac', 'team id': 'XQUIXX.PHP'}]

Now that I know all the teams and where do get information about them, I want to go to each of those pages and get the information about each game-who,when,where, and if it has already been played, what the score was. A quick look at the source for a page shows that information is stored in an HTML table. This is good news, and other ways of presenting data on the page can be hard to get, and others, such as those displayed using Flash, can be impossible.

I'm going to use the BeautifulSoup module to help parse the HTML. Regular expressions can get you pretty far, but modules like BeautifulSoup can save you a lot of time. They are much easier to use if you already know things like what a DOM element is, but are still usable for those who don't code web pages.

After downloading, opening, and soupifying the page (see the function below), you can extract the table with a simple table = soup.find("table") while rows=table.find_all("tr") will identify each of the rows. You might not want the first or last rows, depending on how the information is presented, so you can slice by appending something like [-1:] which will start after the first row.

Within in each row, you can extract a list of the cells with cell=row.findAll('td'). Another powerful feature of BeautifulSoup is that in can get rid of the HTML formatting with .get_text()' which is a lot more efficient than a complicated regular expression, which might not always work. In my case, I'm not going to sort through the contents of each of the cells here. Since I want all the information, I'm just going to dump it to a file and organize it later.

My get_team_page' function downloads the page and then extracts the contents of all the informative rows of the table and returns them as a list of lists. In retrospect, this should probably be split into two functions, with one that downloads the page and another that extracts the table information. That second function would be useful in other contexts, so I could use include it in other projects.

In [27]:
from bs4 import BeautifulSoup

def get_team_page(team):
team_url='http://www.laxpower.com/update13/binwom/%s?tab=detail' % team['team id']
soup = BeautifulSoup(team_html.decode('utf-8', 'ignore'))
table = soup.find("table")
rows=[]
for row in table.find_all("tr")[3:-1]:
data=[d.get_text().replace(u'\xa0\x96','') for d in row.findAll('td')]
outline=[team['team name']]+data[:5]
rows.append([i.encode('utf-8') for i in outline])
return rows


For maximum flexiblity, I want to output all the data to a tab separated file. I do this with the csv' module. The\t tells the writer to use a tab instead of the default comma between items.

In [28]:
import csv
outfile=csv.writer(open('lax_13.tsv','wb'),delimiter='\t')


In order to be polite to the website, I want to pause a second between each page. Generally, I try to save the contents of the page locally so that I only have to download it once. In this case, I'll be running it everyday and I want the most recent results, so I'm not going to save each page. Additionally, the function above will crash if the web server is down or any other sort of HTML error. A better function would put the urllib2.urlopen() in a try: so that it can skip over those pages (if you think that is acceptable. Otherwise, you might have it so that if it can't download the page, it loads up the most recent locally saved version. All depends on what the data is and what you want to do with it.)

I have a print statement in the loop that goes through each team, downloads the page, returns the table, and then writes the results to file so that I can watch it go. It takes about two minutes because of the sleep(1) pause, when all is working, and I like to make sure it isn't caught on anything. I've commented it out here because it made the page too long.

In [30]:
from time import sleep

for team in teams:
#print team['team name']
rows=get_team_page(team)
outfile.writerows(rows)
sleep(1)


The resulting file, lax_13.tsv, can be read in any statistical program or in Excel if you want to do your analysis there. In part II, I'll describe my power ranking and prediction models that were built in Python