NSF Funding History 1990 - 2016

Ths script requests zipped files containing annual funding history from the NSF website, unpacks, parses and loads the resulting xml files (in-memory) to 4 relational data sets into SQLite, and finally performs some analysis using Pandas, Seaborn and Plotly.

In [1]:
import os
import sys
import json
import re
import csv
import datetime as dt
import requests
from collections import OrderedDict
from io import BytesIO
from zipfile import ZipFile
from lxml import etree as ET

import pandas as pd
import numpy as np
import sqlalchemy as sql
import matplotlib.pylab as plt
import seaborn as sns
import plotly.plotly as py
from plotly.graph_objs import Bar, Scatter, Marker, Layout
from plotly.tools import FigureFactory as FF
from wordcloud import WordCloud, STOPWORDS
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from IPython.display import HTML

%matplotlib inline

Engineering

Scraping the NSF

downloaded from: http://www.nsf.gov/awardsearch/download.jsp

In [2]:
# Constants
data_in  = 'data_in/'
data_out = 'data_out/' 

# URL for http-requests
req_root = 'http://www.nsf.gov/awardsearch/download?DownloadFileName={YR}&All=true'

# When to start looking at awards, and when to end. It is the YR in the root
year_start = 1990
year_end = 2017

csv_out = [
    'NSF_AWARDS.csv',
    'NSF_ABSTRACTS.csv',
    'NSF_INSTITUTIONS.csv',
    'NSF_PI.csv'
]

department_normalization = {
    'Polar Progrms': 'Polar Programs',
    'Arctic Sciences Division' : 'Polar Programs',
    'Research On Learning In Formal And Informal Settings (DRL)': 'Research On Learning',
    'Information Systems': 'Information & Intelligent Systems',
    'Civil, Mechanical, And Manufacturing Innovation': 'Civil, Mechanical, And Manufacturing Inn',
    'Behavioral And Cognitive Sci' : 'Behavioral And Cognitive Sciences',
    'Integrative Organismal System' : 'Integrative Organismal Sys',
    'Information Systems': 'Information & Intelligent Systems'

}

modifiers_to_remove = {
    'Division ': '', 
    'Of ': '',
    'Div ': '',
    'Directorate ':'',
    'Office ': '',
    'Direct ': '',
    'Divn ': '',
    'For ': '',
}

Each annual zipped file contains one xml file per award:

In [621]:
with open('data_in/2016/1623912.xml', 'r') as xml_view:
    print(xml_view.read())
<?xml version="1.0" encoding="UTF-8"?>

<rootTag>
  <Award>
    <AwardTitle>Dynamics of Extreme Precipitation in the Northeast United States in Observations and Models</AwardTitle>
    <AwardEffectiveDate>11/01/2016</AwardEffectiveDate>
    <AwardExpirationDate>10/31/2019</AwardExpirationDate>
    <AwardAmount>453807</AwardAmount>
    <AwardInstrument>
      <Value>Standard Grant</Value>
    </AwardInstrument>
    <Organization>
      <Code>06020100</Code>
      <Directorate>
        <LongName>Directorate For Geosciences</LongName>
      </Directorate>
      <Division>
        <LongName>Div Atmospheric &amp; Geospace Sciences</LongName>
      </Division>
    </Organization>
    <ProgramOfficer>
      <SignBlockName>Anjuli S. Bamzai</SignBlockName>
    </ProgramOfficer>
    <AbstractNarration>Extreme precipitation and its related impacts, especially flooding, result in significant loss of life, property and infrastructure damage, transportation disruption, and storm water pollution, and have economic costs of more than $8 billion per year for the US. This project will undertake fundamental research into understanding the causes of extreme precipitation in the Northeast US. This is the most economically developed and densely populated region of the country. Improved model representation of these processes is a crucial step in the overall goal of improving forecasts and projections of these high-cost events, thereby mitigating their impacts. &lt;br/&gt;&lt;br/&gt;Processes that cause extreme precipitation over daily to weekly periods in the Northeast US will be identified. The ability of current climate models to reproduce these processes will be examined. The two motivating questions are: What types of storms cause extreme precipitation in the Northeast? Do current models correctly reproduce these storms types and their relationship to extreme precipitation? Using observational data, storm types associated with extreme precipitation will be identified by applying advanced analytic techniques. Characteristic patterns in the jet stream and other storm features that occur in association with extreme precipitation will be identified. This analysis will then be undertaken on the climate model output to identify the storm types that are produced in the models and compare the modeled types to the observed types. The differences will be highlighted for use in model development and for providing context for model forecasts and projections. The physical processes by which the extreme precipitation are generated within each storm type will also be investigated. The relative strength of different factors that are known to influence precipitation, such as the amount of moisture in the lower atmosphere, will be examined within each storm type. After the key factors are identified, their influence will then be further tested in a regional, high resolution model by changing the strength of individual factors and examining how the modeled precipitation changes in response.</AbstractNarration>
    <MinAmdLetterDate>10/25/2016</MinAmdLetterDate>
    <MaxAmdLetterDate>10/25/2016</MaxAmdLetterDate>
    <ARRAAmount/>
    <AwardID>1623912</AwardID>
    <Investigator>
      <FirstName>Mathew</FirstName>
      <LastName>Barlow</LastName>
      <EmailAddress>[email protected]</EmailAddress>
      <StartDate>10/25/2016</StartDate>
      <EndDate/>
      <RoleCode>Principal Investigator</RoleCode>
    </Investigator>
    <Investigator>
      <FirstName>JianHua</FirstName>
      <LastName>Qian</LastName>
      <EmailAddress>[email protected]</EmailAddress>
      <StartDate>10/25/2016</StartDate>
      <EndDate/>
      <RoleCode>Co-Principal Investigator</RoleCode>
    </Investigator>
    <Institution>
      <Name>University of Massachusetts Lowell</Name>
      <CityName>Lowell</CityName>
      <ZipCode>018543643</ZipCode>
      <PhoneNumber>9789344170</PhoneNumber>
      <StreetAddress>600 Suffolk Street</StreetAddress>
      <CountryName>United States</CountryName>
      <StateName>Massachusetts</StateName>
      <StateCode>MA</StateCode>
    </Institution>
    <ProgramElement>
      <Code>5740</Code>
      <Text>CLIMATE &amp; LARGE-SCALE DYNAMICS</Text>
    </ProgramElement>
  </Award>
</rootTag>

In [626]:
def division_cleanse(raw_division):
    """
    Use normalization dicts to standardize legacy division naming.
    """

    rep = dict((re.escape(k), v) for k, v in modifiers_to_remove.items())
    pattern = re.compile("|".join(rep.keys()))

    division = pattern.sub(lambda m: rep[re.escape(m.group(0))],
                           raw_division.title())

    return department_normalization.get(division, division)

def get_PIs(list_investigators):
    """
    Returns a semi-colon joined list of PI names (first and last)
    """
    return ';'.join(
        ['{} {}'.format(PI.find('FirstName').text, 
                        PI.find('LastName').text) 
         for PI in list_investigators])


# Builder functions
def awards_builder(tree):
    global awards
    list_investigators = tree.findall('.//Award/Investigator')
    return awards.append(
            OrderedDict([
                    ('AwardID',tree.find('Award/AwardID').text),
                    ('AwardTitle',tree.find('Award/AwardTitle').text),
                    ('AwardEffectiveDate',tree.find('Award/AwardEffectiveDate').text),
                    ('AwardExpirationDate',tree.find('Award/AwardExpirationDate').text),
                    ('AwardAmount',tree.find('Award/AwardAmount').text),
                    ('InstitutionName',tree.find('Award/Institution/Name').text),
                    ('Division', division_cleanse(tree.find('Award/Organization/Division')[0].text)),
                    ('Directorate', division_cleanse(tree.find('Award/Organization/Directorate')[0].text)),
                    ('EmailAddress',tree.find('Award/Investigator/EmailAddress').text),
                    ('InvestigatorNames', get_PIs(list_investigators)),
                    ('NumInvestigators', len(list_investigators))
            ]))

def abstracts_builder(tree):
    global abstracts
    return abstracts.append(
            OrderedDict([
                    ('AwardID',tree.find('Award/AwardID').text),
                    ('Abstract',tree.find('Award/AbstractNarration').text)
             ]))

def institutions_builder(tree):
    global institutions
    return institutions.append(
            OrderedDict([
                    ('InstitutionName',tree.find('Award/Institution/Name').text),
                    ('StreetAddress',tree.find('Award/Institution/StreetAddress').text),
                    ('CityName',tree.find('Award/Institution/CityName').text),
                    ('StateCode',tree.find('Award/Institution/StateCode').text),
                    ('ZipCode',tree.find('Award/Institution/ZipCode').text),
                    ('CountryName',tree.find('Award/Institution/CountryName').text)
            ]))

def PI_builder(tree):
    global PI
    return PI.append(
            OrderedDict([
                    ('EmailAddress',tree.find('Award/Investigator/EmailAddress').text),
                    ('FirstName',tree.find('Award/Investigator/FirstName').text),
                    ('LastName',tree.find('Award/Investigator/LastName').text)
            ]))

def xml_parse(file):
    '''
    Calls all four data set builders after parsing a xml file into a tree.
    This function is called for each unzipped-file.
    '''
    try:
        tree = ET.parse(file)
    except:
        return("empty")
    try:
        awards_builder(tree)
    except:
        return(tree.find('Award/AwardID').text,"is missing an award key")
    
    abstracts_builder(tree)
    institutions_builder(tree)
    PI_builder(tree)
    return

def curl(path):
    '''
    Requests zip file of a year's Award info from the NSf site.
    The zip file is unzipped and stored as a stream of bytes in memory.
    Each of the unzipped xml files is parsed by xml_parse for all 4 data set builders.
    '''
    file_url = req_root.format(YR = str(path))
    print("*** {} ***\ndownloading {}".format(path,file_url))
    url = requests.get(file_url)
    zipfile = ZipFile(BytesIO(url.content))
    [xml_parse(zipfile.open(file)) for file in zipfile.namelist()]
    return

def remove_duplicates(dict_list):
    return [dict(tupleized) for tupleized in 
            set(tuple(item.items()) for item in dict_list)]

def json_dump():
    '''
    Iterates through each relational dataset, and dumps each to json.
    '''
    dict_NSF = [awards,
                abstracts,
                institutions,
                PI]

    out_json = ["NSF_AWARDS.json",
                "NSF_ABSTRACTS.json",
                "NSF_INSTITUTIONS.json",
                "NSF_PI.json"]

    for list_dicts, json_file in zip(dict_NSF, out_json):
        with open(os.path.join(data_out, json_file), 'w') as outfile:
            json.dump(list_dicts, outfile, indent=4)

def csv_dump():
    '''
    Iterates through each relational dataset, and dumps each to csv.
    '''
    for list_dicts, file in zip(dict_NSF, csv_out):
        with open(os.path.join(data_out,file), 'w') as outfile:
            dict_writer = csv.DictWriter(outfile, list_dicts[0].keys())
            dict_writer.writeheader()
            dict_writer.writerows(list_dicts)
In [629]:
awards = []
abstracts = []
institutions = []
PI = []
dict_NSF = [awards, abstracts, institutions, PI]

start = dt.datetime.now()
[curl(year) for year in range(year_start, year_end+1)] 
print("{ROWS} records scraped in {TIME} sec".format(
        ROWS=len(awards),
        TIME=(dt.datetime.now()-start).total_seconds()))

# remove doops
start = dt.datetime.now()
[remove_duplicates(data_set) for data_set in [awards, PI, institutions]]
print("Duplicates removed in %s sec." % (dt.datetime.now()-start).total_seconds())

# dump'em
start = dt.datetime.now()
csv_dump()
print("Dumped into CSV in %s sec." % (dt.datetime.now()-start).total_seconds())
*** 1990 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=1990&All=true
*** 1991 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=1991&All=true
*** 1992 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=1992&All=true
*** 1993 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=1993&All=true
*** 1994 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=1994&All=true
*** 1995 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=1995&All=true
*** 1996 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=1996&All=true
*** 1997 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=1997&All=true
*** 1998 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=1998&All=true
*** 1999 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=1999&All=true
*** 2000 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2000&All=true
*** 2001 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2001&All=true
*** 2002 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2002&All=true
*** 2003 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2003&All=true
*** 2004 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2004&All=true
*** 2005 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2005&All=true
*** 2006 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2006&All=true
*** 2007 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2007&All=true
*** 2008 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2008&All=true
*** 2009 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2009&All=true
*** 2010 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2010&All=true
*** 2011 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2011&All=true
*** 2012 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2012&All=true
*** 2013 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2013&All=true
*** 2014 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2014&All=true
*** 2015 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2015&All=true
*** 2016 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2016&All=true
*** 2017 ***
downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2017&All=true
298261 records scraped in 711.08449 sec
Duplicates removed in 16.04135 sec.
Dumped into CSV in 36.556531 sec.
In [22]:
pd.read_csv(os.path.join(data_out,"NSF_AWARDS.csv"), nrows=5)
Out[22]:
AwardID AwardTitle AwardEffectiveDate AwardExpirationDate AwardAmount InstitutionName Division Directorate EmailAddress InvestigatorNames NumInvestigators
0 0 Regulation of Sn-Glycerol-3-Phosphate Metaboli... 07/01/1986 07/01/1986 0 Virginia Polytechnic Institute and State Unive... Molecular And Cellular Bioscience Biological Sciences [email protected] Timothy Larson 1
1 9 Design of Cutting Tools for High Speed Milling 06/15/2000 05/31/2004 280000 University of Florida Civil, Mechanical, & Manufact Inn Engineering [email protected] Jiri Tlusty;Tony Schmitz;John Ziegert 3
2 26 A Novel Ultrasonic Cooling Concept for Microel... 06/15/2000 05/31/2004 292026 North Carolina State University Electrical, Commun & Cyber Sys Engineering [email protected] Paul Ro;Andrey Kuznetsov;Angus Kingon 3
3 27 Development of a Wireless Sensor to Detect Cra... 04/15/2000 03/31/2004 238000 University of Texas at Austin Civil, Mechanical, & Manufact Inn Engineering [email protected] Dean Neikirk;Sharon Wood;Karl Frank 3
4 31 Development of Link-to-Column Connections for ... 09/01/2000 08/31/2003 285000 University of Texas at Austin Civil, Mechanical, & Manufact Inn Engineering [email protected] Michael Engelhardt 1

Getting the data into a SQLite database

We could have gone straight from the stream of bytes to the SQLite db, but there's nothing wrong with keeping the processed back-up files in disk :)

In [631]:
def SQLite_dump(infile, Type='csv'):
    # open SQlite connection
    disk_engine = sql.create_engine('sqlite:///NSF.db')
    start = dt.datetime.now()
    tableName = infile.replace("NSF_","")\
                      .replace(".csv","")\
                      .replace(".json","")\
                      .replace(data_out,"")
    
    print("*********************************************************************")
    if Type == 'csv':
        '''
        Set Type to csv after csv_dump.
        Break each CSV into chunks to reduce in memory stoarge of dataset.
        Each chunk is sent to a SQLite table (tableName).
        '''
        chunksize = 20000
        j = 0
        index_start = 1
        
        for df in pd.read_csv(infile, chunksize=chunksize, iterator=True, encoding='utf-8'):
            if tableName == 'AWARDS':
                for col in ['AwardEffectiveDate','AwardExpirationDate']:
                    df[col] = pd.to_datetime(df[col], format='%m/%d/%Y')
            df.index += index_start
            index_start = df.index[-1] + 1
            j += 1

            print("{X} seconds: {Y} records dumped into SQLite table {Z}".format(
                    X=(dt.datetime.now() - start).total_seconds(),
                    Y=j*chunksize,
                    Z=tableName
                ))
            
            if j == 1:
                df.to_sql(tableName, disk_engine, if_exists='replace', index=False)
            
            else:
                df.to_sql(tableName, disk_engine, if_exists='append', index=False)
            
    elif Type == 'json':
        '''
        After json_dump, file is opened in a dataframe and dumped into a sqlite table
        '''
        df = pd.read_json(infile)
        if tableName == 'AWARDS':
            for col in ['AwardEffectiveDate','AwardExpirationDate']:
                df[col] = pd.to_datetime(df[col], format='%m/%d/%Y')
                
        print("{X} seconds: {Y} records dumped into SQLite table {Z}".format(
                X=(dt.datetime.now() - start).seconds,
                Y=len(df),
                Z=tableName
            ))
        
        df.to_sql(tableName, disk_engine, if_exists='replace', index=False)
In [632]:
for infile in csv_out:
    SQLite_dump(os.path.join(data_out,infile), Type='csv')
*********************************************************************
0.543148 seconds: 20000 records dumped into SQLite table AWARDS
5.331882 seconds: 40000 records dumped into SQLite table AWARDS
6.401413 seconds: 60000 records dumped into SQLite table AWARDS
7.47707 seconds: 80000 records dumped into SQLite table AWARDS
8.588326 seconds: 100000 records dumped into SQLite table AWARDS
9.827497 seconds: 120000 records dumped into SQLite table AWARDS
11.051997 seconds: 140000 records dumped into SQLite table AWARDS
12.171535 seconds: 160000 records dumped into SQLite table AWARDS
13.445196 seconds: 180000 records dumped into SQLite table AWARDS
15.086242 seconds: 200000 records dumped into SQLite table AWARDS
20.884972 seconds: 220000 records dumped into SQLite table AWARDS
22.812093 seconds: 240000 records dumped into SQLite table AWARDS
25.089158 seconds: 260000 records dumped into SQLite table AWARDS
26.491599 seconds: 280000 records dumped into SQLite table AWARDS
28.456642 seconds: 300000 records dumped into SQLite table AWARDS
*********************************************************************
0.996505 seconds: 20000 records dumped into SQLite table ABSTRACTS
29.262021 seconds: 40000 records dumped into SQLite table ABSTRACTS
30.11163 seconds: 60000 records dumped into SQLite table ABSTRACTS
32.335524 seconds: 80000 records dumped into SQLite table ABSTRACTS
34.743973 seconds: 100000 records dumped into SQLite table ABSTRACTS
37.305316 seconds: 120000 records dumped into SQLite table ABSTRACTS
40.759159 seconds: 140000 records dumped into SQLite table ABSTRACTS
44.580109 seconds: 160000 records dumped into SQLite table ABSTRACTS
48.659568 seconds: 180000 records dumped into SQLite table ABSTRACTS
52.77365 seconds: 200000 records dumped into SQLite table ABSTRACTS
56.958778 seconds: 220000 records dumped into SQLite table ABSTRACTS
61.348238 seconds: 240000 records dumped into SQLite table ABSTRACTS
66.34427 seconds: 260000 records dumped into SQLite table ABSTRACTS
68.026501 seconds: 280000 records dumped into SQLite table ABSTRACTS
69.319802 seconds: 300000 records dumped into SQLite table ABSTRACTS
*********************************************************************
0.055064 seconds: 20000 records dumped into SQLite table INSTITUTIONS
1.834172 seconds: 40000 records dumped into SQLite table INSTITUTIONS
2.236637 seconds: 60000 records dumped into SQLite table INSTITUTIONS
2.655564 seconds: 80000 records dumped into SQLite table INSTITUTIONS
3.029963 seconds: 100000 records dumped into SQLite table INSTITUTIONS
3.428473 seconds: 120000 records dumped into SQLite table INSTITUTIONS
3.829848 seconds: 140000 records dumped into SQLite table INSTITUTIONS
4.290678 seconds: 160000 records dumped into SQLite table INSTITUTIONS
12.752918 seconds: 180000 records dumped into SQLite table INSTITUTIONS
13.461992 seconds: 200000 records dumped into SQLite table INSTITUTIONS
13.893989 seconds: 220000 records dumped into SQLite table INSTITUTIONS
14.296411 seconds: 240000 records dumped into SQLite table INSTITUTIONS
14.689038 seconds: 260000 records dumped into SQLite table INSTITUTIONS
15.061441 seconds: 280000 records dumped into SQLite table INSTITUTIONS
15.420823 seconds: 300000 records dumped into SQLite table INSTITUTIONS
*********************************************************************
0.037913 seconds: 20000 records dumped into SQLite table PI
1.016936 seconds: 40000 records dumped into SQLite table PI
1.320865 seconds: 60000 records dumped into SQLite table PI
1.630687 seconds: 80000 records dumped into SQLite table PI
1.963261 seconds: 100000 records dumped into SQLite table PI
2.252138 seconds: 120000 records dumped into SQLite table PI
2.53908 seconds: 140000 records dumped into SQLite table PI
2.851338 seconds: 160000 records dumped into SQLite table PI
3.151069 seconds: 180000 records dumped into SQLite table PI
3.445793 seconds: 200000 records dumped into SQLite table PI
3.791598 seconds: 220000 records dumped into SQLite table PI
4.069519 seconds: 240000 records dumped into SQLite table PI
4.338497 seconds: 260000 records dumped into SQLite table PI
4.651899 seconds: 280000 records dumped into SQLite table PI
4.997948 seconds: 300000 records dumped into SQLite table PI

Analysis

Filtering SQL tables into Pandas.

There is a read_sql() function in Pandas, but who wants to open connections, and write querries every time? This function takes care of that in a way that is SQL-native, with the advantage of default values and auto formatting to spend more time analyzing data and less time writing querries!

In [10]:
def do(query):
    '''
    Direct sql queries to a Pandas dataframe.
    '''
    disk_engine = sql.create_engine('sqlite:///NSF.db')
    df = pd.read_sql_query(query, disk_engine)
    
    for col in df.columns:
        if "Date" in col:
            df[col] = pd.to_datetime(df[col])
    return df
In [29]:
df = do("SELECT * FROM AWARDS")
df.head()
Out[29]:
AwardID AwardTitle AwardEffectiveDate AwardExpirationDate AwardAmount InstitutionName Division Directorate EmailAddress InvestigatorNames NumInvestigators
0 0 Regulation of Sn-Glycerol-3-Phosphate Metaboli... 1986-07-01 1986-07-01 0 Virginia Polytechnic Institute and State Unive... Molecular And Cellular Bioscience Biological Sciences [email protected] Timothy Larson 1
1 9 Design of Cutting Tools for High Speed Milling 2000-06-15 2004-05-31 280000 University of Florida Civil, Mechanical, & Manufact Inn Engineering [email protected] Jiri Tlusty;Tony Schmitz;John Ziegert 3
2 26 A Novel Ultrasonic Cooling Concept for Microel... 2000-06-15 2004-05-31 292026 North Carolina State University Electrical, Commun & Cyber Sys Engineering [email protected] Paul Ro;Andrey Kuznetsov;Angus Kingon 3
3 27 Development of a Wireless Sensor to Detect Cra... 2000-04-15 2004-03-31 238000 University of Texas at Austin Civil, Mechanical, & Manufact Inn Engineering [email protected] Dean Neikirk;Sharon Wood;Karl Frank 3
4 31 Development of Link-to-Column Connections for ... 2000-09-01 2003-08-31 285000 University of Texas at Austin Civil, Mechanical, & Manufact Inn Engineering [email protected] Michael Engelhardt 1
In [33]:
q = """
SELECT 
    strftime(\'%Y\',AwardEffectiveDate) as year,
    Division,
    SUM(AwardAmount) as 'AwardSum'
FROM AWARDS GROUP BY Division ORDER BY AwardSum DESC;"""
do(q).head()
Out[33]:
year Division AwardSum
0 2016 Ocean Sciences 6287996066
1 2016 Research On Learning 5199775389
2 2016 Astronomical Sciences 4801812127
3 2016 Materials Research 4727804645
4 2016 Physics 4391902343
In [34]:
q = """
SELECT 
    strftime(\'%Y\',AwardEffectiveDate) as year,
    Directorate,
    SUM(AwardAmount) as 'AwardSum'
FROM AWARDS GROUP BY 2 ORDER BY 3 DESC;"""
do(q).head()
Out[34]:
year Directorate AwardSum
0 2016 Mathematical & Physical Scien 20959270657
1 2016 Geosciences 17517102569
2 2017 Education And Human Resources 15754031311
3 2016 Computer & Info Scie & Enginr 12113647384
4 2016 Engineering 11317262347

Analysis

Let's use Plotly and Seaborn for some Analysis

In [6]:
plotly.offline.init_notebook_mode()
with open("/Users/leonyin/Documents/creds/plotly_credentials.json",'r') as data_file:
    '''
    Get username credentials.
    '''
    creds = json.load(data_file)['credentials']
    
    plotly.tools.set_credentials_file(
        username=creds['USER'],
        api_key=creds['API_KEY']
    )
with open("/Users/leonyin/Documents/creds/plotly_presets.json",'r') as data_file:
    '''
    Load graphy presets.
    '''
    layout_dict = json.load(data_file)['layout']