Configuring Weather History Database using MySQL on Amazon EC2

This notebook describes a nearly fully automated process for configuring a MySQL database with the Daily Global Weather Measurements data on Amazon EC2. The original data is available as a collection of text files on an Amazon EBS volume snapshot (snap-ac47f4c5), as described here:

http://aws.amazon.com/datasets/2759

Before running this script, three files are required:

  • rootkey.csv : containing the AWSAccessKeyId and AWSSecretKey values for the AWS account being used

  • passwords.csv : containing a ssh key name, an access password for the EC2 instance, an access password for the MySQL database, and the user's IP address.

  • populateDB.py : a python file that will be sent to the EC2 instance to run one of the data loading processes. The text of this file is copied below, in the notes just before In[34].

A security group must also be set up in AWS prior to running this script, with the expected name of 'onlyMyIP' (see In[5] below).

The notebook also requires three Python libraries beyond the standard installation:

  • boto
  • paramiko
  • pandas

Finally, this notebook sadly cannot be run successfully via the Run All option. Instead, do the following:

  • scroll down to the heading "--- Manual Modification for Require TTY ---", select that cell

  • from there, select Run All Above from the Cell menu, wait for completion (5-10 minutes)

  • follow the instructions directly below that heading in order to change the EC2 instance to not require a TTY connection (which will then let us use Paramiko for scripted ssh connections to the machine)

  • then select Run All Below from the Cell menu, and the rest of the script will run to completion

The full script should take about 8-10 hours to run, but requires very few resources on the local machine (most of the work is done on the remote EC2 machine).

Libraries, Keys, Convenience Methods

Libraries: boto is used for AWS interaction, Paramiko for ssh, and Pandas is only used here for nice table outputs in IPython (could thus be cut out if desired).

In [1]:
import sys, os, time
import paramiko as pm
import boto as bt
import pandas as pd

Access keys and passwords: in external files, formatted as lines of "[name] = [value]", as per the AWS rootkey.csv download.

In [2]:
AWSAccessKeyId, AWSSecretKey = ( line.strip().split('=')[1] for line in open('/Users/brian/rootkey.csv','r') )
sshKeyName, instancePass, mysqlPass, myIP = ( line.strip().split('=')[1] for line in open('/Users/brian/passwords.csv','r') )

Convenience methods: just one in this case - waitUntilReady is used throughout to wait for completed status, particularly when interacting with AWS.

In [3]:
def waitUntilReady(obj,desiredStatus):
    while obj.update() != desiredStatus:
        sys.stdout.write(".")
        time.sleep(1)

Connecting to EC2, Starting an Instance, Connecting Volumes

Establish a connection to EC2.

In [4]:
ec2 = bt.connect_ec2( aws_access_key_id = AWSAccessKeyId, 
                      aws_secret_access_key = AWSSecretKey )

Start an EC2 instance based on the basic Amazon Linux AMI.

In [5]:
res = ec2.run_instances( 'ami-146e2a7c', # "Amazon Linux AMI"
                         key_name = sshKeyName, 
                         instance_type = 'r3.large', 
                         security_groups = ['onlyMyIP'], # previously-configured security group
                         placement = 'us-east-1a' ) # region

allInstances = res.instances
instance = allInstances[0]

Start two EBS volumes, one empty, the other based on the weather data (snapshot 'snap-ac47f4c5').

In [6]:
mysqlVolume = ec2.create_volume(size=100, zone='us-east-1a', snapshot=None, volume_type='gp2')
givenDataVolume = ec2.create_volume(size=20, zone='us-east-1a', snapshot='snap-ac47f4c5', volume_type='gp2')

Wait until the instance and volumes are ready.

In [7]:
time.sleep(10) # to make sure the instance is registered before checking if it is ready
waitUntilReady(instance,'running')
waitUntilReady(mysqlVolume,'available')
waitUntilReady(givenDataVolume,'available')
instance, mysqlVolume, givenDataVolume
......
Out[7]:
(Instance:i-088a96f2, Volume:vol-c39b2d89, Volume:vol-a69a2cec)

Attach the volumes to the instance.

In [8]:
att1 = ec2.attach_volume(mysqlVolume.id, instance.id, '/dev/xvdc')
att1 = ec2.attach_volume(givenDataVolume.id, instance.id, '/dev/xvdb')

Wait until they are properly attached.

In [9]:
waitUntilReady(mysqlVolume,'in-use')
waitUntilReady(givenDataVolume,'in-use')

--- Manual Modification for Require TTY ---

At this point, an unfortunate thing needs to happen:

  • in terminal, manually ssh into the EC2 instance
  • sudo vim /etc/sudoers
  • add a line at the end to say : 'Defaults !requiretty'
  • save the edited file and exit

This will let us use Paramiko to run sudo commands without requiring a tty. Pain in the butt. See the following links for discussion of this issue:

SSH Methods

The following methods will be used hereafter for sending commands and transfering files to the EC2 instance via SSH.

In [10]:
def establishSSHconnection():
    rsaKey = pm.RSAKey.from_private_key_file(sshKeyName + '.pem')
    ssh = pm.SSHClient()
    ssh.set_missing_host_key_policy(pm.AutoAddPolicy())
    totalTime = 0 
    connected = False
    while not connected:
        try:
            ssh.connect( str(instance.public_dns_name), 
                         username = 'ec2-user', 
                         pkey = rsaKey )
            connected = True
        except Exception as e: 
            sys.stdout.write(".")
            time.sleep(1)
            totalTime += 1
            if totalTime >= 300:
                sys.stdout.write( str(instance.public_dns_name) + ': ' + 
                                  "Timed out waiting for instance to get ssh ready\n" )
                break
    return ssh

def prettyOut(obj):
    l = [r.strip('\n').split('\t') for r in obj.readlines()]
    if len(l) > 1:
        df = pd.DataFrame(l[1:len(l)])
        df.columns = l[0]
    else:
        df = pd.DataFrame(l)
    return df

def sshCommandWait(cmd,pretty=False):
    ssh = establishSSHconnection()
    stdin, stdout, stderr = ssh.exec_command(cmd)
    if pretty:
        stdout = prettyOut(stdout)
        stderr = prettyOut(stderr)
    else:
        stdout = stdout.readlines()
        stderr = stderr.readlines()
    ssh.close()
    return stdout, stderr

def sshCommandNoWait(cmd):
    ssh = establishSSHconnection()
    ssh.exec_command(cmd)
    ssh.close()

def sftpSendFile(localFile,remoteFile):
    ssh = establishSSHconnection()
    ftp = ssh.open_sftp()
    ftp.put(localFile,remoteFile)
    ftp.close()
    ssh.close()

def sftpGetFile(remoteFile,localFile):
    ssh = establishSSHconnection()
    ftp = ssh.open_sftp()
    try:
        ftp.get(remoteFile,localFile)
        success = True
    except:
        success = False
    ftp.close()
    ssh.close()
    return success

A quick test of the SSH command method:

In [11]:
o,e = sshCommandWait('sudo echo hello')
o,e
Out[11]:
([u'hello\n'], [])

Mount the EBS Volumes

View the currently mounted volumes:

In [12]:
o,e = sshCommandWait('df -h')
o
Out[12]:
[u'Filesystem      Size  Used Avail Use% Mounted on\n',
 u'/dev/xvda1      7.8G  1.1G  6.6G  14% /\n',
 u'devtmpfs        7.5G   64K  7.5G   1% /dev\n',
 u'tmpfs           7.5G     0  7.5G   0% /dev/shm\n']

View all available volumes before starting the mounting process:

In [13]:
o,e = sshCommandWait('lsblk')
o
Out[13]:
[u'NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT\n',
 u'xvda    202:0    0    8G  0 disk \n',
 u'\u2514\u2500xvda1 202:1    0    8G  0 part /\n',
 u'xvdb    202:16   0   20G  0 disk \n',
 u'xvdc    202:32   0  100G  0 disk \n']

Format the new volume (only the new one, not the given data one):

In [14]:
o,e = sshCommandWait('sudo mkfs -t ext4 /dev/xvdc')

Make directories on the EC2 instance and mount the volumes. The mounting of the given data volume will take a few minutes.

In [15]:
o,e = sshCommandWait('sudo mkdir /mnt/givenDataVolume')
o,e = sshCommandWait('sudo mkdir /mnt/mysqlVolume')
o,e = sshCommandWait('sudo mount /dev/xvdb /mnt/givenDataVolume')
o,e = sshCommandWait('sudo mount /dev/xvdc /mnt/mysqlVolume')

View list of mounted volumes again to be sure that the new volumes are mounted correctly.

In [16]:
o,e = sshCommandWait('df -h')
o
Out[16]:
[u'Filesystem      Size  Used Avail Use% Mounted on\n',
 u'/dev/xvda1      7.8G  1.1G  6.6G  14% /\n',
 u'devtmpfs        7.5G   64K  7.5G   1% /dev\n',
 u'tmpfs           7.5G     0  7.5G   0% /dev/shm\n',
 u'/dev/xvdb        20G   17G  2.3G  88% /mnt/givenDataVolume\n',
 u'/dev/xvdc        99G   61M   94G   1% /mnt/mysqlVolume\n']

Set up MySQL on Instance

The Amazon Linux AMI has repos set up for mysql and mysql-server - we just need to update those repos and run the installers.

In [17]:
o,e = sshCommandWait('sudo yum -y update')
o,e = sshCommandWait('sudo yum -y install mysql')
o,e = sshCommandWait('sudo mkdir /var/lib/mysql')
o,e = sshCommandWait('sudo yum -y install mysql-server')

To use the attached volume for the database instead of the default folder, set up the necessary folders and permissions, and then change the data directory and the tmp directory in the MySQL configuration file.

In [18]:
o,e = sshCommandWait('sudo mkdir -p /mnt/mysqlVolume/data')
o,e = sshCommandWait('sudo mkdir -p /mnt/mysqlVolume/tmp')
o,e = sshCommandWait('sudo chown -R mysql.mysql /mnt/mysqlVolume/data')
o,e = sshCommandWait('sudo chown -R mysql.mysql /mnt/mysqlVolume/tmp')
In [19]:
cmd  = 'sudo sed -i '
cmd += '"s/datadir\=\/var\/lib\/mysql/datadir\=\/mnt\/mysqlVolume\/data\\ntmpdir\=\/mnt\/mysqlVolume\/tmp/g" '
cmd += '/etc/my.cnf'
o,e = sshCommandWait(cmd)
o,e
Out[19]:
([], [])

Start MySQL

In [20]:
o,e = sshCommandWait('sudo /etc/init.d/mysqld start')

Set root password for MySQL, and configure it to run on startup:

In [21]:
o,e = sshCommandWait('mysqladmin -u root password "' + mysqlPass + '"')
o,e = sshCommandWait('sudo chkconfig mysqld on')

A quick test of the server setup:

In [22]:
o,e = sshCommandWait('mysql -e "SHOW DATABASES;" -u root --password=' + mysqlPass, True)
o
Out[22]:
Database
0 information_schema
1 mysql
2 performance_schema
3 test

Configure Database

Add a database named 'weather' to MySQL:

In [23]:
o,e = sshCommandWait('mysql -e "CREATE DATABASE weather;" -u root --password=' + mysqlPass)

Create 3 tables in the database - 'country', 'station' and 'observation'. The first two will be small, the last one very large.

country

This will hold the simple relationship between an ID and a country name, from the 'country_list.txt' file in the top-level folder of the given data directory.

In [24]:
sqlcommand  = 'CREATE TABLE country '
sqlcommand += '( FIPS_ID VARCHAR(2) PRIMARY KEY, country_name VARCHAR(47) ) '
sqlcommand += ';'
o,e = sshCommandWait('mysql -e "' + sqlcommand + '" -u root --password=' + mysqlPass + ' weather')
In [25]:
sqlcommand = 'DESCRIBE country;'
o,e = sshCommandWait('mysql -e "' + sqlcommand + '" -u root --password=' + mysqlPass + ' weather', True)
o
Out[25]:
Field Type Null Key Default Extra
0 FIPS_ID varchar(2) NO PRI NULL
1 country_name varchar(47) YES NULL
station

This table will hold some meta-data about the stations, including name, stationID, countryID, latitude, longitude and elevation. The data will come from the 'ish-history.csv' file in the top-level folder of the given data directory.

In [26]:
sqlcommand  = 'CREATE TABLE station '
sqlcommand += '( USAF VARCHAR(6) PRIMARY KEY, WBAN VARCHAR(6), '
sqlcommand += '  station_name VARCHAR(30), CTRY VARCHAR(2), FIPS_ID VARCHAR(2), ST VARCHAR(2), callid VARCHAR(4), '
sqlcommand += '  lat INT, lon INT, elev INT ) '
sqlcommand += ';'
o,e = sshCommandWait('mysql -e "' + sqlcommand + '" -u root --password=' + mysqlPass + ' weather')
In [27]:
sqlcommand = 'DESCRIBE station;'
o,e = sshCommandWait('mysql -e "' + sqlcommand + '" -u root --password=' + mysqlPass + ' weather', True)
o
Out[27]:
Field Type Null Key Default Extra
0 USAF varchar(6) NO PRI NULL
1 WBAN varchar(6) YES NULL
2 station_name varchar(30) YES NULL
3 CTRY varchar(2) YES NULL
4 FIPS_ID varchar(2) YES NULL
5 ST varchar(2) YES NULL
6 callid varchar(4) YES NULL
7 lat int(11) YES NULL
8 lon int(11) YES NULL
9 elev int(11) YES NULL
observation

This table will hold the weather data for each station (and is by far the largest of the three tables). The data will come from the fixed-width text files held in a large number of folders on the data volume. Note that in many of the fields in this table, the values will be stored as integers representing tenths of the real value (e.g. a value of 320 in the temp column is used to represent 32.0 degrees F).

In [28]:
sqlcommand  = 'CREATE TABLE observation '
sqlcommand += '( STNYMD VARCHAR(14) PRIMARY KEY, '
sqlcommand += '  STN VARCHAR(6), WBAN VARCHAR(6), year INT, month INT, day INT, '
sqlcommand += '  temp INT, dewp INT, slp INT, stp INT, visib INT, '
sqlcommand += '  wdsp INT, mxspd INT, gust INT, maxtemp INT, mintemp INT, '
sqlcommand += '  prcp INT, sndp INT, FRSHTT VARCHAR(6) ) '
sqlcommand += ';'
o,e = sshCommandWait('mysql -e "' + sqlcommand + '" -u root --password=' + mysqlPass + ' weather')
In [29]:
sqlcommand = 'DESCRIBE observation;'
o,e = sshCommandWait('mysql -e "' + sqlcommand + '" -u root --password=' + mysqlPass + ' weather', True)
o
Out[29]:
Field Type Null Key Default Extra
0 STNYMD varchar(14) NO PRI NULL
1 STN varchar(6) YES NULL
2 WBAN varchar(6) YES NULL
3 year int(11) YES NULL
4 month int(11) YES NULL
5 day int(11) YES NULL
6 temp int(11) YES NULL
7 dewp int(11) YES NULL
8 slp int(11) YES NULL
9 stp int(11) YES NULL
10 visib int(11) YES NULL
11 wdsp int(11) YES NULL
12 mxspd int(11) YES NULL
13 gust int(11) YES NULL
14 maxtemp int(11) YES NULL
15 mintemp int(11) YES NULL
16 prcp int(11) YES NULL
17 sndp int(11) YES NULL
18 FRSHTT varchar(6) YES NULL

Populate Database

country

Populating the 'country' table requires just a simple file read.

In [30]:
sqlcommand  = 'LOAD DATA LOCAL INFILE \'\/mnt\/givenDataVolume\/country-list.txt\' '
sqlcommand += 'INTO TABLE country '
sqlcommand += 'FIELDS TERMINATED BY \'          \' '
sqlcommand += 'LINES TERMINATED BY \'\n\' '
sqlcommand += 'IGNORE 2 LINES '
sqlcommand += '; '
o,e = sshCommandWait('mysql -e "' + sqlcommand + '" -u root --password=' + mysqlPass + ' weather --local-infile')
In [31]:
sqlcommand = 'SELECT * FROM country LIMIT 5;'
o,e = sshCommandWait('mysql -e "' + sqlcommand + '" -u root --password=' + mysqlPass + ' weather', True)
o
Out[31]:
FIPS_ID country_name
0 AA ARUBA
1 AC ANTIGUA AND BARBUDA
2 AF AFGHANISTAN
3 AG ALGERIA
4 AI ASCENSION ISLAND
station

Populating the 'station' table also requires just a simple file read.

In [32]:
sqlcommand  = 'LOAD DATA LOCAL INFILE \'\/mnt\/givenDataVolume\/ish-history.csv\' '
sqlcommand += 'INTO TABLE station '
sqlcommand += 'FIELDS TERMINATED BY \',\' ENCLOSED BY \'\\"\''
sqlcommand += 'LINES TERMINATED BY \'\n\' '
sqlcommand += 'IGNORE 1 LINES '
sqlcommand += '; '
o,e = sshCommandWait('mysql -e "' + sqlcommand + '" -u root --password=' + mysqlPass + ' weather --local-infile')
In [33]:
sqlcommand = 'SELECT * FROM station LIMIT 5;'
o,e = sshCommandWait('mysql -e "' + sqlcommand + '" -u root --password=' + mysqlPass + ' weather', True)
o
Out[33]:
USAF WBAN station_name CTRY FIPS_ID ST callid lat lon elev
0 000000 99999 NYGGBUKTA GREENLAND- STA GL GL 73483 21567 30
1 000010 99999 JAN HAYEN NO NO 70983 -7700 229
2 000020 99999 ISFJORD RADIO SPITZBERGEN NO NO 78067 13633 79
3 000030 99999 BJORNOYA BARENTS SEA NO NO 74467 19283 290
4 000040 99999 VAROO NO NO 70367 31100 119
observation

Populating the 'observation' table is much more complicated and time-consuming. The data is stored in multiple files in multiple folders:

gsod/[year]/[filename_by_station_ID]

In some cases the file has .op as file extension, sometimes it has no file extension. These files are also in fixed-width format, which LOAD DATA doesn't like, and they have some extra text in them that we do not want. So the following Python script is used to first process each file to make a csv file in the form we want, then to upload it to the data table.

import os, sys
mysqlPass = sys.argv[1]
given_path = "/mnt/givenDataVolume/gsod/"
mysql_path = "/mnt/mysqlVolume/data/"

fixed_widths = [[1,6],[8,12],[15,18],[19,20],[21,22],[25,30],[36,41],[47,52],[58,63],
                [69,73],[79,83],[89,93],[96,100],[103,108],[111,116],[119,123],[126,130],
                [133,138]]


with open('log.txt','w') as logfile:

    for yr in range(1929,2010):

        files = []
        for (dirpath, dirnames, filenames) in os.walk(given_path + str(yr) + '/'):
            files.extend(filenames)
            break
        for f in files:
            logfile.write('year # ' + str(yr) + ', filename ' + f + '\n')

            # read file as fixed width, write as csv
            with open(given_path + str(yr) + '/' + f,'r') as fi:
                with open(mysql_path + f.replace(".op","") + '.csv','w') as fo:
                    for li in fi:
                        if li[0:3] != 'STN':

                            # primary key = concat(stn,yr,mo,da)
                            fo.write(li[0:6] + li[14:22] + ',')

                            # remaining fields
                            for k, w in enumerate(fixed_widths):
                                if k < len(fixed_widths) - 1:
                                    if k > 4:
                                        fo.write( str(int(float(li[w[0]-1:w[1]])*10)) + ',')
                                    else:
                                        fo.write( li[w[0]-1:w[1]] + ',')
                                else:
                                    fo.write( li[w[0]-1:w[1]] + '\n')

            # load text file to mysql
            logfile.write("loading to mysql" + '\n')
            cmd = 'mysql -e '
            cmd += '"LOAD DATA LOCAL INFILE \'' + mysql_path + f.replace(".op","") + '.csv' + '\' '
            cmd += 'INTO TABLE observation FIELDS TERMINATED BY \',\' '
            cmd += 'LINES TERMINATED BY \'\n\' '
            cmd += ';" '
            cmd += '-u root --password=' + mysqlPass + ' weather --local-infile'
            os.system(cmd)

            # delete csv file
            os.system('sudo rm ' + mysql_path + f.replace(".op","") + '.csv')

with open('end.txt','w') as f:
    f.write('complete')

The script above should be copied to a file named 'populateDB.py' on the local root folder. The following command then sends it to the EC2 instance:

In [34]:
sftpSendFile('populateDB.py','populateDB.py')

We can now run it on the remote machine. This process may take 3-3.5 hours. Nohup and output redirection are used to allow the process to run independently of this connection to it.

In [35]:
sshCommandNoWait('sudo nohup python populateDB.py ' + mysqlPass + ' > proc.out 2> proc.err < /dev/null &')
In [36]:
fileReady = False
fillTime = 0
while not fileReady:
    fileReady = sftpGetFile('end.txt','test.txt')
    if not fileReady:
        time.sleep(60)
        fillTime += 1
        if fillTime > 100*60:
            break
print fillTime
203

A quick test to make sure that the populating script worked:

In [37]:
sqlcommand = 'SELECT * FROM observation LIMIT 5;'
o,e = sshCommandWait('mysql -e "' + sqlcommand + '" -u root --password=' + mysqlPass + ' weather', True)
o
Out[37]:
STNYMD STN WBAN year month day temp dewp slp stp visib wdsp mxspd gust maxtemp mintemp prcp sndp FRSHTT
0 00820920090309 008209 99999 2009 3 9 782 710 99999 99999 9999 13 19 9999 806 716 0 9999 000000
1 00820920090310 008209 99999 2009 3 10 678 656 99999 99999 9999 12 41 9999 806 554 0 9999 000000
2 00820920090311 008209 99999 2009 3 11 681 646 99999 99999 9999 13 29 9999 824 554 0 9999 000000
3 00820920090312 008209 99999 2009 3 12 666 639 99999 99999 9999 12 41 9999 806 572 0 9999 000000
4 00820920090313 008209 99999 2009 3 13 669 622 99999 99999 9999 13 41 9999 806 554 0 9999 000000

The following shows some general information about the table. Note that the table contains approximately 109,880,661 rows.

In [38]:
sqlcommand = "SHOW TABLE STATUS LIKE 'observation';"
o,e = sshCommandWait('mysql -e "' + sqlcommand + '" -u root --password=' + mysqlPass + ' weather')
o
Out[38]:
[u'Name\tEngine\tVersion\tRow_format\tRows\tAvg_row_length\tData_length\tMax_data_length\tIndex_length\tData_free\tAuto_increment\tCreate_time\tUpdate_time\tCheck_time\tCollation\tChecksum\tCreate_options\tComment\n',
 u'observation\tInnoDB\t10\tCompact\t109880661\t145\t15998124032\t0\t0\t7340032\tNULL\t2015-03-24 02:27:57\tNULL\tNULL\tlatin1_swedish_ci\tNULL\t\t\n']

The following checks the memory usage of the database. Note that at this stage (before configuring the table indexes), the database requires about 16 GB of disk space.

In [39]:
o,e = sshCommandWait('df -h')
o
Out[39]:
[u'Filesystem      Size  Used Avail Use% Mounted on\n',
 u'/dev/xvda1      7.8G  1.4G  6.4G  18% /\n',
 u'devtmpfs        7.5G   64K  7.5G   1% /dev\n',
 u'tmpfs           7.5G     0  7.5G   0% /dev/shm\n',
 u'/dev/xvdb        20G   17G  2.3G  88% /mnt/givenDataVolume\n',
 u'/dev/xvdc        99G   16G   79G  17% /mnt/mysqlVolume\n']

Detach and Delete Given Data Volume

Since we are not using the given data volume any more (all of the data has been copied to the MySQL database), we can detach and delete it so that we are no longer being charged for it.

In [40]:
o,e = sshCommandWait('sudo umount -d /dev/xvdb')
In [41]:
givenDataVolume.detach()
waitUntilReady(givenDataVolume,'available')
givenDataVolume.delete()
........
Out[41]:
True

Configure Indices on Tables

To make queries on the table reasonably fast (particularly for the large table), we need to add Indexes.

In [42]:
sqlcommand  = 'ALTER TABLE station '
sqlcommand += 'ADD INDEX (FIPS_ID), '
sqlcommand += 'ADD INDEX (lat), '
sqlcommand += 'ADD INDEX (lon) '
sqlcommand += ';'
cmd  = 'sudo nohup mysql -e "' + sqlcommand + '" -u root --password=' + mysqlPass + ' weather' 
cmd += ' > proc2.out 2> proc2.err < /dev/null &'
sshCommandNoWait(cmd)
In [43]:
sqlcommand = 'DESCRIBE station;'
o,e = sshCommandWait('mysql -e "' + sqlcommand + '" -u root --password=' + mysqlPass + ' weather', True)
o
Out[43]:
Field Type Null Key Default Extra
0 USAF varchar(6) NO PRI NULL
1 WBAN varchar(6) YES NULL
2 station_name varchar(30) YES NULL
3 CTRY varchar(2) YES NULL
4 FIPS_ID varchar(2) YES MUL NULL
5 ST varchar(2) YES NULL
6 callid varchar(4) YES NULL
7 lat int(11) YES MUL NULL
8 lon int(11) YES MUL NULL
9 elev int(11) YES NULL
In [44]:
sqlcommand  = 'ALTER TABLE observation '
sqlcommand += 'ADD INDEX (STN), '
sqlcommand += 'ADD INDEX (year), '
sqlcommand += 'ADD INDEX (month), '
sqlcommand += 'ADD INDEX (day), '
sqlcommand += 'ADD INDEX (temp), '
sqlcommand += 'ADD INDEX (maxtemp), '
sqlcommand += 'ADD INDEX (mintemp) '
sqlcommand += ';'
cmd  = 'sudo time nohup mysql -e "' + sqlcommand + '" -u root --password=' + mysqlPass + ' weather' 
cmd += ' > proc2.out 2> proc2.err < /dev/null &'
sshCommandNoWait(cmd)
In [45]:
time.sleep(60*fillTime*2)
In [46]:
sqlcommand = 'DESCRIBE observation;'
o,e = sshCommandWait('mysql -e "' + sqlcommand + '" -u root --password=' + mysqlPass + ' weather', True)
o
Out[46]:
Field Type Null Key Default Extra
0 STNYMD varchar(14) NO PRI NULL
1 STN varchar(6) YES MUL NULL
2 WBAN varchar(6) YES NULL
3 year int(11) YES MUL NULL
4 month int(11) YES MUL NULL
5 day int(11) YES MUL NULL
6 temp int(11) YES MUL NULL
7 dewp int(11) YES NULL
8 slp int(11) YES NULL
9 stp int(11) YES NULL
10 visib int(11) YES NULL
11 wdsp int(11) YES NULL
12 mxspd int(11) YES NULL
13 gust int(11) YES NULL
14 maxtemp int(11) YES MUL NULL
15 mintemp int(11) YES MUL NULL
16 prcp int(11) YES NULL
17 sndp int(11) YES NULL
18 FRSHTT varchar(6) YES NULL

The following again checks the memory usage of the database. Note that we have more than doubled the disk space requirement of the database by adding the indexes.

In [47]:
o,e = sshCommandWait('df -h')
o
Out[47]:
[u'Filesystem      Size  Used Avail Use% Mounted on\n',
 u'/dev/xvda1      7.8G  1.4G  6.4G  18% /\n',
 u'devtmpfs        7.5G   60K  7.5G   1% /dev\n',
 u'tmpfs           7.5G     0  7.5G   0% /dev/shm\n',
 u'/dev/xvdc        99G   34G   60G  37% /mnt/mysqlVolume\n']

Make Database Remotely Accessible

The following change to the MySQL configuration file makes the database remotely accessible. A database restart is required after the change.

In [48]:
cmd = 'sudo sed -i "s/mysql\\.sock/mysql\\.sock\\nbind-address\\=0\\.0\\.0\\.0/g" /etc/my.cnf'
o,e = sshCommandWait(cmd)
In [49]:
o,e = sshCommandWait('sudo /etc/init.d/mysqld restart')

Set up a new MySQL user named "weatherdbuser", provide it with remote access privilege when accessing the database from 'myIP'.

In [50]:
sqlcommand = 'GRANT ALL ON weather.* TO weatherdbuser@' + myIP + ' IDENTIFIED BY \'' + mysqlPass + '\';'
o,e = sshCommandWait('mysql -e "' + sqlcommand + '" -u root --password=' + mysqlPass)

Test MySQL Connection

Another (shorter) IPython Notebook weather_queries.ipynb complements this one, wherein a variety of queries are made to the remote database. We just provide a quick example here.

In [51]:
import MySQLdb as mdb
print 'opening connection'
con = mdb.connect( host = instance.public_dns_name, 
                   passwd = mysqlPass, 
                   user = 'weatherdbuser', 
                   db='weather' );
cur = con.cursor()
cur.execute("SELECT * FROM observation LIMIT 10;")
for i in range(cur.rowcount):
    row = cur.fetchone()
    print row
print 'closing connection'
con.close()
opening connection
('00820920090309', '008209', '99999', 2009L, 3L, 9L, 782L, 710L, 99999L, 99999L, 9999L, 13L, 19L, 9999L, 806L, 716L, 0L, 9999L, '000000')
('00820920090310', '008209', '99999', 2009L, 3L, 10L, 678L, 656L, 99999L, 99999L, 9999L, 12L, 41L, 9999L, 806L, 554L, 0L, 9999L, '000000')
('00820920090311', '008209', '99999', 2009L, 3L, 11L, 681L, 646L, 99999L, 99999L, 9999L, 13L, 29L, 9999L, 824L, 554L, 0L, 9999L, '000000')
('00820920090312', '008209', '99999', 2009L, 3L, 12L, 666L, 639L, 99999L, 99999L, 9999L, 12L, 41L, 9999L, 806L, 572L, 0L, 9999L, '000000')
('00820920090313', '008209', '99999', 2009L, 3L, 13L, 669L, 622L, 99999L, 99999L, 9999L, 13L, 41L, 9999L, 806L, 554L, 0L, 9999L, '000000')
('00820920090314', '008209', '99999', 2009L, 3L, 14L, 700L, 691L, 99999L, 99999L, 9999L, 11L, 60L, 9999L, 806L, 662L, 0L, 9999L, '000000')
('00820920090315', '008209', '99999', 2009L, 3L, 15L, 715L, 715L, 99999L, 99999L, 9999L, 15L, 29L, 9999L, 806L, 698L, 0L, 9999L, '000000')
('00820920090316', '008209', '99999', 2009L, 3L, 16L, 692L, 692L, 99999L, 99999L, 9999L, 15L, 80L, 9999L, 716L, 644L, 0L, 9999L, '000000')
('00820920090317', '008209', '99999', 2009L, 3L, 17L, 667L, 604L, 99999L, 99999L, 9999L, 26L, 70L, 9999L, 824L, 590L, 0L, 9999L, '000000')
('00820920090318', '008209', '99999', 2009L, 3L, 18L, 652L, 599L, 99999L, 99999L, 9999L, 18L, 60L, 9999L, 806L, 554L, 0L, 9999L, '000000')
closing connection