Using MySQL Weather History Database on Amazon EC2

This notebook shows some example queries using the weather history database set up as described in the complementary notebook weather_mysqlconfig. The MySQLdb library is used to manage the interaction with the database. See this blog post for more context.

Configuration: Libaries, Keys, Convenience Methods

In [1]:
import pandas as pd
import numpy as np
import MySQLdb as mdb
In [2]:
AWSAccessKeyId, AWSSecretKey = ( line.strip().split('=')[1] for line in open('/Users/brian/rootkey.csv','r') )
sshKeyName, instancePass, mysqlPass, myIP = ( line.strip().split('=')[1] for line in open('/Users/brian/passwords.csv','r') )
In [3]:
def sqlCall(cmd):
    con = mdb.connect( host = 'ec2-54-91-115-227.compute-1.amazonaws.com', 
                   passwd = mysqlPass, 
                   user = 'weatherdbuser', 
                   db='weather' );
    cur = con.cursor()
    cur.execute(cmd)
    field_names = [i[0] for i in cur.description]
    output = []
    for i in range(cur.rowcount):
        output.append(cur.fetchone())
    con.close()
    df = pd.DataFrame(output)
    try:
        df.columns = field_names
    except:
        pass
    return df

Quick Views of the Tables in the Database

In [4]:
cmd =  "SELECT * FROM country LIMIT 5"
cmd += ";"
sqlCall(cmd)
Out[4]:
FIPS_ID country_name
0 AA ARUBA
1 AC ANTIGUA AND BARBUDA
2 AF AFGHANISTAN
3 AG ALGERIA
4 AI ASCENSION ISLAND
In [5]:
cmd =  "SELECT * FROM station LIMIT 5"
cmd += ";"
sqlCall(cmd)
Out[5]:
USAF WBAN station_name CTRY FIPS_ID ST callid lat lon elev
0 000000 99999 NYGGBUKTA GREENLAND- STA GL GL 73483 21567 30
1 000010 99999 JAN HAYEN NO NO 70983 -7700 229
2 000020 99999 ISFJORD RADIO SPITZBERGEN NO NO 78067 13633 79
3 000030 99999 BJORNOYA BARENTS SEA NO NO 74467 19283 290
4 000040 99999 VAROO NO NO 70367 31100 119
In [6]:
cmd =  "SELECT * FROM observation LIMIT 5"
cmd += ";"
sqlCall(cmd)
Out[6]:
STNYMD STN WBAN year month day temp dewp slp stp visib wdsp mxspd gust maxtemp mintemp prcp sndp FRSHTT
0 00820920090309 008209 99999 2009 3 9 782 710 99999 99999 9999 13 19 9999 806 716 0 9999 000000
1 00820920090310 008209 99999 2009 3 10 678 656 99999 99999 9999 12 41 9999 806 554 0 9999 000000
2 00820920090311 008209 99999 2009 3 11 681 646 99999 99999 9999 13 29 9999 824 554 0 9999 000000
3 00820920090312 008209 99999 2009 3 12 666 639 99999 99999 9999 12 41 9999 806 572 0 9999 000000
4 00820920090313 008209 99999 2009 3 13 669 622 99999 99999 9999 13 41 9999 806 554 0 9999 000000

Selecting a Subset of Stations for Further Analysis and Visualization

The database contains weather data for more than 9000 sites. For initial exploration and visualization, we will select a subset of these stations. In particular, the following cells ([7] through [11]) select the longest-running record (ie. the one with the most observations) from each country, breaking ties randomly.

In [7]:
cmd  = "SELECT USAF, FIPS_ID, obs.numYears "
cmd += "FROM station "
cmd += "JOIN (SELECT STN, COUNT(DISTINCT year) as numYears "
cmd += "FROM observation "
cmd += "GROUP BY STN) as obs "
cmd += "ON station.USAF = obs.STN "
cmd += ";"
r = sqlCall(cmd)
In [8]:
r.head()
Out[8]:
USAF FIPS_ID numYears
0 010010 NO 41
1 010013 NO 3
2 010014 NO 24
3 010015 NO 23
4 010016 NO 5
In [9]:
idx = r.groupby(['FIPS_ID'])['numYears'].transform(max) == r['numYears']
longestPerCountry = r[idx].sort('FIPS_ID')
longestPerCountry.tail()
Out[9]:
USAF FIPS_ID numYears
11161 679750 ZI 44
11128 677750 ZI 44
11167 679910 ZI 44
11124 677650 ZI 44
11143 678670 ZI 44
In [10]:
fn = lambda obj: obj.loc[np.random.choice(obj.index, 1, True),:]
selectedSites = longestPerCountry.groupby('FIPS_ID', as_index=False).apply(fn)
selectedSites.tail()
Out[10]:
USAF FIPS_ID numYears
255 7819 414363 YM 22
256 3391 131500 YU 31
257 10227 601400 YY 3
258 11106 676650 ZA 37
259 11167 679910 ZI 44
In [11]:
selectedSitesList = selectedSites['USAF'].tolist()
selectedSitesList[:10]
Out[11]:
['035963',
 '789820',
 '788620',
 '411960',
 '409480',
 '603900',
 '375750',
 '136150',
 '377890',
 '664220']

Collecting Site Description Details for these Sites

The following collects the ID, station name, country name, latitude, longitude and elevation for each of the stations on the given site list.

In [12]:
slist = str(selectedSitesList)[1:-1]
slist[:100]
Out[12]:
"'035963', '789820', '788620', '411960', '409480', '603900', '375750', '136150', '377890', '664220', "
In [13]:
cmd  = "SELECT s.USAF, s.station_name, c.country_name, "
cmd += "(s.lat / 1000) as lat, (s.lon / 1000) as lon, (s.elev / 10) as elev "
cmd += "FROM station s "
cmd += "JOIN country c ON s.FIPS_ID = c.FIPS_ID "
cmd += "WHERE s.USAF IN (" + slist + ");"
In [14]:
sites = sqlCall(cmd)
sites[['lat','lon','elev']] = sites[['lat','lon','elev']].astype(float)
sites.head()
Out[14]:
USAF station_name country_name lat lon elev
0 010620 HOPEN SVALBARD 76.500 25.067 10
1 013840 OSLO/GARDERMOEN NORWAY 60.200 11.083 204
2 026800 HOBURG SWEDEN 56.917 18.150 39
3 029740 HELSINKI-VANTAA FINLAND 60.317 24.967 56
4 031350 PRESTWICK(CIV/NAVY) UNITED KINGDOM 55.500 -4.583 20

Collecting Monthly Average, Minimum and Maximum Temperatures for these Sites

The weather observations in the database are recorded on a daily basis. For our purposes, we are only interested in the monthly average, minimum and maximum temperatures, and only for the site list described above.

In [24]:
siteTemps = []
for site in selectedSitesList:
    cmd  = "SELECT STN, year, month, AVG(temp), MAX(maxtemp), MIN(mintemp) "
    cmd += "FROM observation "
    cmd += "WHERE STN = '" + site + "' " 
    cmd += "GROUP BY year, month "
    cmd += ";"
    siteTemps.append(sqlCall(cmd))
siteTemps[0].head(10)
Out[24]:
STN year month AVG(temp) MAX(maxtemp) MIN(mintemp)
0 035963 1945 1 330.0323 473 174
1 035963 1945 2 453.7500 603 313
2 035963 1945 3 466.0000 644 313
3 035963 1945 4 489.9333 734 324
4 035963 1945 5 545.1667 793 324
5 035963 1945 6 592.4333 783 453
6 035963 1945 7 590.0000 703 484
7 035963 1951 6 590.1200 763 441
8 035963 1951 7 624.9032 784 453
9 035963 1951 8 610.3548 734 473

Initial graphing

The following graph shows the distribution of selected sites by latitude and longitude (note that the rough shape of Africa is visible in the center of the graph).

In [25]:
sites.plot(x='lon',y='lat',kind='scatter')
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x10863fdd0>

The following graph shows an example scatter plot of monthly average temperatures (over many years) for a single site.

In [49]:
grapher = siteTemps[0]
grapher[['year']] = grapher[['year']].astype(int)
grapher[['month']] = grapher[['month']].astype(int)
grapher[['t']] = grapher[['AVG(temp)']].astype(float) / 10
grapher.groupby('year')
grapher.plot(x='month',y='t',kind='scatter')
Out[49]:
<matplotlib.axes._subplots.AxesSubplot at 0x10b8b2210>

Exporting Processed Data for D3 Visualization

The sites and temperature data is exported to text files as follows. It is then used in the D3 visualization shown in the blog post here

In [26]:
sites.to_csv('sites.csv',sep='\t',index=False)
In [27]:
for i in range(len(selectedSitesList)):
    siteTemps[i].to_csv('sitesTemps' + str(i) + '.csv',sep='\t',index=False)