!date
Mon Jan 20 17:00:20 EST 2014
Recently I did some analysis on the likelihood of making NFL Field Goal over distance and temperature. I would lke to add alitude as a feature, but unfortunately the dataset I am using has the Stadium name, but not the city the game was played in. So took the stadium names and Google's geocoding API to get the location of the stadium and Google's elevation API to get the stadium's elevation. The following is an overview of extracting the elevation for each stadium.
import pandas as pd
import pandas.io.sql as psql
import mysql.connector as sql
import urllib2
import urllib
import json
import time
I use the NFL data set from Armchair Analysis. One of the tables includes the stadium each game was played in. Tge data us stored in a MySQL database, so lets extract the unique stadium names into a pandas DataFrame to get elevation using pandas.io.sql.
import pickle
with open('sql.config','rb') as fp:
config = pickle.load(fp)
con = sql.connect(**config)
df = psql.frame_query('''
SELECT DISTINCT(STAD)
FROM games;''', con)
con.close()
Now lets call Google's API for both location and elevation and save them into our pandas DataFrame object. We will not get a result for every entry, but luckily we will get them for most. Also, Google has rate limiting, so I had to use time.sleep value between calls to ensure Google does not deny my request.
main_url = "http://maps.googleapis.com/maps/api/geocode/json?"
ELEVATION_BASE_URL = 'http://maps.googleapis.com/maps/api/elevation/json'
LOCATION_BASE_URL = 'http://maps.googleapis.com/maps/api/geocode/json'
def getAlt(d):
loc_args = {'address': d, 'sensor': 'false'}
url = LOCATION_BASE_URL + '?' + urllib.urlencode(loc_args)
page = urllib2.urlopen(url)
data = json.load(page)
time.sleep(0.1)
if 'results' in data and len(data['results'])>0:
ll = data['results'][0]['geometry']['location']
loc = ','.join(str(x) for x in [ll['lat'],ll['lng']])
el_args = {'locations': loc, 'sensor': 'false'}
url = ELEVATION_BASE_URL + '?' + urllib.urlencode(el_args)
page = urllib2.urlopen(url)
el_data = json.load(page)
if 'results' in el_data:
return ll['lat'], ll['lng'], el_data['results'][0]['elevation']
else:
return None, None, None
return None, None, None
df['lat'], df['lng'], df['alt'] = zip(*df.STAD.map(getAlt))
df.head()
STAD | lat | lng | alt | |
---|---|---|---|---|
0 | Georgia Dome | 33.75769 | -84.40083 | 303.1637 |
1 | Cleveland Browns Stadium | 41.50648 | -81.70004 | 183.6383 |
2 | Texas Stadium | 32.84107 | -96.9109 | 133.4713 |
3 | Lambeau Field | 44.50136 | -88.06075 | 190.2768 |
4 | Arrowhead Stadium | 39.04894 | -94.48392 | 256.7705 |
As I mentioned above, there will be some stadium names that do not get a result from Google's API. Here is the list of stadiums without a location and elevation.
df[df.lat.isnull()].head(12)
STAD | lat | lng | alt | |
---|---|---|---|---|
10 | Network Associates Coliseum | None | None | None |
14 | Trans World Dome | None | None | None |
15 | PSINet Stadium | None | None | None |
19 | RCA Dome | None | None | None |
39 | Seahawks Stadium | None | None | None |
44 | Qwest Field | None | None | None |
47 | McAfee Coliseum | None | None | None |
48 | Azteca Stadium | None | None | None |
60 | Mall of America Field at HHH Metrodome | None | None | None |
62 | New Meadowlands Stadium | None | None | None |
68 | Mercedes-Benz Superdome | None | None | None |
69 | Snapdragon Stadium | None | None | None |
Since there are only 12, I will manually enter the rest by searching for the right city and finding the elevation. I will ignore putting in latitude and longitude for now, those columns are actually not needed for my analsysis.
df.ix[df.STAD.str.contains('Network Associates Coliseum'),'alt'] = 13.11
df.ix[df.STAD.str.contains('Trans World Dome'),'alt'] = 142.03
df.ix[df.STAD.str.contains('PSINet'),'alt'] = 3.05
df.ix[df.STAD.str.contains('RCA'),'alt'] = 216.34
df.ix[df.STAD.str.contains('Seahawks'),'alt'] = 5.31
df.ix[df.STAD.str.contains('Qwest'),'alt'] = 5.31
df.ix[df.STAD.str.contains('McAfee'),'alt'] = 10.74
df.ix[df.STAD.str.contains('Azteca'),'alt'] = 2268.13
df.ix[df.STAD.str.contains('Mall of America'),'alt'] = 253.42
df.ix[df.STAD.str.contains('New Meadowlands'),'alt'] = 0.86
df.ix[df.STAD.str.contains('Mercedes-Benz'),'alt'] = 0.0006
df.ix[df.STAD.str.contains('Snapdragon'),'alt'] = 22.89
df[df.lat.isnull()].head(12)
STAD | lat | lng | alt | |
---|---|---|---|---|
10 | Network Associates Coliseum | None | None | 13.11 |
14 | Trans World Dome | None | None | 142.03 |
15 | PSINet Stadium | None | None | 3.05 |
19 | RCA Dome | None | None | 216.34 |
39 | Seahawks Stadium | None | None | 5.31 |
44 | Qwest Field | None | None | 5.31 |
47 | McAfee Coliseum | None | None | 10.74 |
48 | Azteca Stadium | None | None | 2268.13 |
60 | Mall of America Field at HHH Metrodome | None | None | 253.42 |
62 | New Meadowlands Stadium | None | None | 0.86 |
68 | Mercedes-Benz Superdome | None | None | 0.0006 |
69 | Snapdragon Stadium | None | None | 22.89 |
Now lets save the results to a CSV file to avoid runing this code whenever we need the data.
df.to_csv('stadium_with_altitude.csv')
Lastly, here is a histogram of the elevation. Denver and Mexico City are the outlier stadiums, and the reason there is a count greater than one for Denver is that there are multiple stadium names in the dataset for the Broncos (as expected). This is true for a number of teams.
%matplotlib inline
import matplotlib.pyplot as plt
df.alt.hist(figsize=(12,8), bins=20)
plt.xlabel('Elevation (m)')
plt.ylabel('Count')
<matplotlib.text.Text at 0x10a839a50>
print 'Denver elevation: %.2f (m)' % df[df.STAD.str.contains('Mile')].alt[:1]
print 'Mexico City elevation: %.2f (m)' % df[df.STAD.str.contains('Azteca')].alt[:1]
Denver elevation: 1583.44 (m) Mexico City elevation: 2268.13 (m)
from IPython.core.display import HTML
def css_styling():
styles = open("style.css", "r").read()
return HTML(styles)
css_styling()