Ths script requests zipped files containing annual funding history from the NSF website, unpacks, parses and loads the resulting xml files (in-memory) to 4 relational data sets into SQLite, and finally performs some analysis using Pandas, Seaborn and Plotly.
import os
import sys
import json
import re
import csv
import datetime as dt
import requests
from collections import OrderedDict
from io import BytesIO
from zipfile import ZipFile
from lxml import etree as ET
import pandas as pd
import numpy as np
import sqlalchemy as sql
import matplotlib.pylab as plt
import seaborn as sns
import plotly.plotly as py
from plotly.graph_objs import Bar, Scatter, Marker, Layout
from plotly.tools import FigureFactory as FF
from wordcloud import WordCloud, STOPWORDS
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from IPython.display import HTML
%matplotlib inline
downloaded from: http://www.nsf.gov/awardsearch/download.jsp
# Constants
data_in = 'data_in/'
data_out = 'data_out/'
# URL for http-requests
req_root = 'http://www.nsf.gov/awardsearch/download?DownloadFileName={YR}&All=true'
# When to start looking at awards, and when to end. It is the YR in the root
year_start = 1990
year_end = 2017
csv_out = [
'NSF_AWARDS.csv',
'NSF_ABSTRACTS.csv',
'NSF_INSTITUTIONS.csv',
'NSF_PI.csv'
]
department_normalization = {
'Polar Progrms': 'Polar Programs',
'Arctic Sciences Division' : 'Polar Programs',
'Research On Learning In Formal And Informal Settings (DRL)': 'Research On Learning',
'Information Systems': 'Information & Intelligent Systems',
'Civil, Mechanical, And Manufacturing Innovation': 'Civil, Mechanical, And Manufacturing Inn',
'Behavioral And Cognitive Sci' : 'Behavioral And Cognitive Sciences',
'Integrative Organismal System' : 'Integrative Organismal Sys',
'Information Systems': 'Information & Intelligent Systems'
}
modifiers_to_remove = {
'Division ': '',
'Of ': '',
'Div ': '',
'Directorate ':'',
'Office ': '',
'Direct ': '',
'Divn ': '',
'For ': '',
}
Each annual zipped file contains one xml file per award:
with open('data_in/2016/1623912.xml', 'r') as xml_view:
print(xml_view.read())
<?xml version="1.0" encoding="UTF-8"?> <rootTag> <Award> <AwardTitle>Dynamics of Extreme Precipitation in the Northeast United States in Observations and Models</AwardTitle> <AwardEffectiveDate>11/01/2016</AwardEffectiveDate> <AwardExpirationDate>10/31/2019</AwardExpirationDate> <AwardAmount>453807</AwardAmount> <AwardInstrument> <Value>Standard Grant</Value> </AwardInstrument> <Organization> <Code>06020100</Code> <Directorate> <LongName>Directorate For Geosciences</LongName> </Directorate> <Division> <LongName>Div Atmospheric & Geospace Sciences</LongName> </Division> </Organization> <ProgramOfficer> <SignBlockName>Anjuli S. Bamzai</SignBlockName> </ProgramOfficer> <AbstractNarration>Extreme precipitation and its related impacts, especially flooding, result in significant loss of life, property and infrastructure damage, transportation disruption, and storm water pollution, and have economic costs of more than $8 billion per year for the US. This project will undertake fundamental research into understanding the causes of extreme precipitation in the Northeast US. This is the most economically developed and densely populated region of the country. Improved model representation of these processes is a crucial step in the overall goal of improving forecasts and projections of these high-cost events, thereby mitigating their impacts. <br/><br/>Processes that cause extreme precipitation over daily to weekly periods in the Northeast US will be identified. The ability of current climate models to reproduce these processes will be examined. The two motivating questions are: What types of storms cause extreme precipitation in the Northeast? Do current models correctly reproduce these storms types and their relationship to extreme precipitation? Using observational data, storm types associated with extreme precipitation will be identified by applying advanced analytic techniques. Characteristic patterns in the jet stream and other storm features that occur in association with extreme precipitation will be identified. This analysis will then be undertaken on the climate model output to identify the storm types that are produced in the models and compare the modeled types to the observed types. The differences will be highlighted for use in model development and for providing context for model forecasts and projections. The physical processes by which the extreme precipitation are generated within each storm type will also be investigated. The relative strength of different factors that are known to influence precipitation, such as the amount of moisture in the lower atmosphere, will be examined within each storm type. After the key factors are identified, their influence will then be further tested in a regional, high resolution model by changing the strength of individual factors and examining how the modeled precipitation changes in response.</AbstractNarration> <MinAmdLetterDate>10/25/2016</MinAmdLetterDate> <MaxAmdLetterDate>10/25/2016</MaxAmdLetterDate> <ARRAAmount/> <AwardID>1623912</AwardID> <Investigator> <FirstName>Mathew</FirstName> <LastName>Barlow</LastName> <EmailAddress>Mathew_Barlow@uml.edu</EmailAddress> <StartDate>10/25/2016</StartDate> <EndDate/> <RoleCode>Principal Investigator</RoleCode> </Investigator> <Investigator> <FirstName>JianHua</FirstName> <LastName>Qian</LastName> <EmailAddress>JianHua_Qian@uml.edu</EmailAddress> <StartDate>10/25/2016</StartDate> <EndDate/> <RoleCode>Co-Principal Investigator</RoleCode> </Investigator> <Institution> <Name>University of Massachusetts Lowell</Name> <CityName>Lowell</CityName> <ZipCode>018543643</ZipCode> <PhoneNumber>9789344170</PhoneNumber> <StreetAddress>600 Suffolk Street</StreetAddress> <CountryName>United States</CountryName> <StateName>Massachusetts</StateName> <StateCode>MA</StateCode> </Institution> <ProgramElement> <Code>5740</Code> <Text>CLIMATE & LARGE-SCALE DYNAMICS</Text> </ProgramElement> </Award> </rootTag>
def division_cleanse(raw_division):
"""
Use normalization dicts to standardize legacy division naming.
"""
rep = dict((re.escape(k), v) for k, v in modifiers_to_remove.items())
pattern = re.compile("|".join(rep.keys()))
division = pattern.sub(lambda m: rep[re.escape(m.group(0))],
raw_division.title())
return department_normalization.get(division, division)
def get_PIs(list_investigators):
"""
Returns a semi-colon joined list of PI names (first and last)
"""
return ';'.join(
['{} {}'.format(PI.find('FirstName').text,
PI.find('LastName').text)
for PI in list_investigators])
# Builder functions
def awards_builder(tree):
global awards
list_investigators = tree.findall('.//Award/Investigator')
return awards.append(
OrderedDict([
('AwardID',tree.find('Award/AwardID').text),
('AwardTitle',tree.find('Award/AwardTitle').text),
('AwardEffectiveDate',tree.find('Award/AwardEffectiveDate').text),
('AwardExpirationDate',tree.find('Award/AwardExpirationDate').text),
('AwardAmount',tree.find('Award/AwardAmount').text),
('InstitutionName',tree.find('Award/Institution/Name').text),
('Division', division_cleanse(tree.find('Award/Organization/Division')[0].text)),
('Directorate', division_cleanse(tree.find('Award/Organization/Directorate')[0].text)),
('EmailAddress',tree.find('Award/Investigator/EmailAddress').text),
('InvestigatorNames', get_PIs(list_investigators)),
('NumInvestigators', len(list_investigators))
]))
def abstracts_builder(tree):
global abstracts
return abstracts.append(
OrderedDict([
('AwardID',tree.find('Award/AwardID').text),
('Abstract',tree.find('Award/AbstractNarration').text)
]))
def institutions_builder(tree):
global institutions
return institutions.append(
OrderedDict([
('InstitutionName',tree.find('Award/Institution/Name').text),
('StreetAddress',tree.find('Award/Institution/StreetAddress').text),
('CityName',tree.find('Award/Institution/CityName').text),
('StateCode',tree.find('Award/Institution/StateCode').text),
('ZipCode',tree.find('Award/Institution/ZipCode').text),
('CountryName',tree.find('Award/Institution/CountryName').text)
]))
def PI_builder(tree):
global PI
return PI.append(
OrderedDict([
('EmailAddress',tree.find('Award/Investigator/EmailAddress').text),
('FirstName',tree.find('Award/Investigator/FirstName').text),
('LastName',tree.find('Award/Investigator/LastName').text)
]))
def xml_parse(file):
'''
Calls all four data set builders after parsing a xml file into a tree.
This function is called for each unzipped-file.
'''
try:
tree = ET.parse(file)
except:
return("empty")
try:
awards_builder(tree)
except:
return(tree.find('Award/AwardID').text,"is missing an award key")
abstracts_builder(tree)
institutions_builder(tree)
PI_builder(tree)
return
def curl(path):
'''
Requests zip file of a year's Award info from the NSf site.
The zip file is unzipped and stored as a stream of bytes in memory.
Each of the unzipped xml files is parsed by xml_parse for all 4 data set builders.
'''
file_url = req_root.format(YR = str(path))
print("*** {} ***\ndownloading {}".format(path,file_url))
url = requests.get(file_url)
zipfile = ZipFile(BytesIO(url.content))
[xml_parse(zipfile.open(file)) for file in zipfile.namelist()]
return
def remove_duplicates(dict_list):
return [dict(tupleized) for tupleized in
set(tuple(item.items()) for item in dict_list)]
def json_dump():
'''
Iterates through each relational dataset, and dumps each to json.
'''
dict_NSF = [awards,
abstracts,
institutions,
PI]
out_json = ["NSF_AWARDS.json",
"NSF_ABSTRACTS.json",
"NSF_INSTITUTIONS.json",
"NSF_PI.json"]
for list_dicts, json_file in zip(dict_NSF, out_json):
with open(os.path.join(data_out, json_file), 'w') as outfile:
json.dump(list_dicts, outfile, indent=4)
def csv_dump():
'''
Iterates through each relational dataset, and dumps each to csv.
'''
for list_dicts, file in zip(dict_NSF, csv_out):
with open(os.path.join(data_out,file), 'w') as outfile:
dict_writer = csv.DictWriter(outfile, list_dicts[0].keys())
dict_writer.writeheader()
dict_writer.writerows(list_dicts)
awards = []
abstracts = []
institutions = []
PI = []
dict_NSF = [awards, abstracts, institutions, PI]
start = dt.datetime.now()
[curl(year) for year in range(year_start, year_end+1)]
print("{ROWS} records scraped in {TIME} sec".format(
ROWS=len(awards),
TIME=(dt.datetime.now()-start).total_seconds()))
# remove doops
start = dt.datetime.now()
[remove_duplicates(data_set) for data_set in [awards, PI, institutions]]
print("Duplicates removed in %s sec." % (dt.datetime.now()-start).total_seconds())
# dump'em
start = dt.datetime.now()
csv_dump()
print("Dumped into CSV in %s sec." % (dt.datetime.now()-start).total_seconds())
*** 1990 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=1990&All=true *** 1991 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=1991&All=true *** 1992 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=1992&All=true *** 1993 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=1993&All=true *** 1994 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=1994&All=true *** 1995 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=1995&All=true *** 1996 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=1996&All=true *** 1997 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=1997&All=true *** 1998 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=1998&All=true *** 1999 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=1999&All=true *** 2000 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2000&All=true *** 2001 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2001&All=true *** 2002 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2002&All=true *** 2003 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2003&All=true *** 2004 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2004&All=true *** 2005 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2005&All=true *** 2006 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2006&All=true *** 2007 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2007&All=true *** 2008 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2008&All=true *** 2009 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2009&All=true *** 2010 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2010&All=true *** 2011 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2011&All=true *** 2012 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2012&All=true *** 2013 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2013&All=true *** 2014 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2014&All=true *** 2015 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2015&All=true *** 2016 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2016&All=true *** 2017 *** downloading http://www.nsf.gov/awardsearch/download?DownloadFileName=2017&All=true 298261 records scraped in 711.08449 sec Duplicates removed in 16.04135 sec. Dumped into CSV in 36.556531 sec.
pd.read_csv(os.path.join(data_out,"NSF_AWARDS.csv"), nrows=5)
AwardID | AwardTitle | AwardEffectiveDate | AwardExpirationDate | AwardAmount | InstitutionName | Division | Directorate | EmailAddress | InvestigatorNames | NumInvestigators | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Regulation of Sn-Glycerol-3-Phosphate Metaboli... | 07/01/1986 | 07/01/1986 | 0 | Virginia Polytechnic Institute and State Unive... | Molecular And Cellular Bioscience | Biological Sciences | tilarson@vt.edu | Timothy Larson | 1 |
1 | 9 | Design of Cutting Tools for High Speed Milling | 06/15/2000 | 05/31/2004 | 280000 | University of Florida | Civil, Mechanical, & Manufact Inn | Engineering | jtlusty@ufl.edu | Jiri Tlusty;Tony Schmitz;John Ziegert | 3 |
2 | 26 | A Novel Ultrasonic Cooling Concept for Microel... | 06/15/2000 | 05/31/2004 | 292026 | North Carolina State University | Electrical, Commun & Cyber Sys | Engineering | ro@ncsu.edu | Paul Ro;Andrey Kuznetsov;Angus Kingon | 3 |
3 | 27 | Development of a Wireless Sensor to Detect Cra... | 04/15/2000 | 03/31/2004 | 238000 | University of Texas at Austin | Civil, Mechanical, & Manufact Inn | Engineering | neikirk@mail.utexas.edu | Dean Neikirk;Sharon Wood;Karl Frank | 3 |
4 | 31 | Development of Link-to-Column Connections for ... | 09/01/2000 | 08/31/2003 | 285000 | University of Texas at Austin | Civil, Mechanical, & Manufact Inn | Engineering | mde@mail.utexas.edu | Michael Engelhardt | 1 |
We could have gone straight from the stream of bytes to the SQLite db, but there's nothing wrong with keeping the processed back-up files in disk :)
def SQLite_dump(infile, Type='csv'):
# open SQlite connection
disk_engine = sql.create_engine('sqlite:///NSF.db')
start = dt.datetime.now()
tableName = infile.replace("NSF_","")\
.replace(".csv","")\
.replace(".json","")\
.replace(data_out,"")
print("*********************************************************************")
if Type == 'csv':
'''
Set Type to csv after csv_dump.
Break each CSV into chunks to reduce in memory stoarge of dataset.
Each chunk is sent to a SQLite table (tableName).
'''
chunksize = 20000
j = 0
index_start = 1
for df in pd.read_csv(infile, chunksize=chunksize, iterator=True, encoding='utf-8'):
if tableName == 'AWARDS':
for col in ['AwardEffectiveDate','AwardExpirationDate']:
df[col] = pd.to_datetime(df[col], format='%m/%d/%Y')
df.index += index_start
index_start = df.index[-1] + 1
j += 1
print("{X} seconds: {Y} records dumped into SQLite table {Z}".format(
X=(dt.datetime.now() - start).total_seconds(),
Y=j*chunksize,
Z=tableName
))
if j == 1:
df.to_sql(tableName, disk_engine, if_exists='replace', index=False)
else:
df.to_sql(tableName, disk_engine, if_exists='append', index=False)
elif Type == 'json':
'''
After json_dump, file is opened in a dataframe and dumped into a sqlite table
'''
df = pd.read_json(infile)
if tableName == 'AWARDS':
for col in ['AwardEffectiveDate','AwardExpirationDate']:
df[col] = pd.to_datetime(df[col], format='%m/%d/%Y')
print("{X} seconds: {Y} records dumped into SQLite table {Z}".format(
X=(dt.datetime.now() - start).seconds,
Y=len(df),
Z=tableName
))
df.to_sql(tableName, disk_engine, if_exists='replace', index=False)
for infile in csv_out:
SQLite_dump(os.path.join(data_out,infile), Type='csv')
********************************************************************* 0.543148 seconds: 20000 records dumped into SQLite table AWARDS 5.331882 seconds: 40000 records dumped into SQLite table AWARDS 6.401413 seconds: 60000 records dumped into SQLite table AWARDS 7.47707 seconds: 80000 records dumped into SQLite table AWARDS 8.588326 seconds: 100000 records dumped into SQLite table AWARDS 9.827497 seconds: 120000 records dumped into SQLite table AWARDS 11.051997 seconds: 140000 records dumped into SQLite table AWARDS 12.171535 seconds: 160000 records dumped into SQLite table AWARDS 13.445196 seconds: 180000 records dumped into SQLite table AWARDS 15.086242 seconds: 200000 records dumped into SQLite table AWARDS 20.884972 seconds: 220000 records dumped into SQLite table AWARDS 22.812093 seconds: 240000 records dumped into SQLite table AWARDS 25.089158 seconds: 260000 records dumped into SQLite table AWARDS 26.491599 seconds: 280000 records dumped into SQLite table AWARDS 28.456642 seconds: 300000 records dumped into SQLite table AWARDS ********************************************************************* 0.996505 seconds: 20000 records dumped into SQLite table ABSTRACTS 29.262021 seconds: 40000 records dumped into SQLite table ABSTRACTS 30.11163 seconds: 60000 records dumped into SQLite table ABSTRACTS 32.335524 seconds: 80000 records dumped into SQLite table ABSTRACTS 34.743973 seconds: 100000 records dumped into SQLite table ABSTRACTS 37.305316 seconds: 120000 records dumped into SQLite table ABSTRACTS 40.759159 seconds: 140000 records dumped into SQLite table ABSTRACTS 44.580109 seconds: 160000 records dumped into SQLite table ABSTRACTS 48.659568 seconds: 180000 records dumped into SQLite table ABSTRACTS 52.77365 seconds: 200000 records dumped into SQLite table ABSTRACTS 56.958778 seconds: 220000 records dumped into SQLite table ABSTRACTS 61.348238 seconds: 240000 records dumped into SQLite table ABSTRACTS 66.34427 seconds: 260000 records dumped into SQLite table ABSTRACTS 68.026501 seconds: 280000 records dumped into SQLite table ABSTRACTS 69.319802 seconds: 300000 records dumped into SQLite table ABSTRACTS ********************************************************************* 0.055064 seconds: 20000 records dumped into SQLite table INSTITUTIONS 1.834172 seconds: 40000 records dumped into SQLite table INSTITUTIONS 2.236637 seconds: 60000 records dumped into SQLite table INSTITUTIONS 2.655564 seconds: 80000 records dumped into SQLite table INSTITUTIONS 3.029963 seconds: 100000 records dumped into SQLite table INSTITUTIONS 3.428473 seconds: 120000 records dumped into SQLite table INSTITUTIONS 3.829848 seconds: 140000 records dumped into SQLite table INSTITUTIONS 4.290678 seconds: 160000 records dumped into SQLite table INSTITUTIONS 12.752918 seconds: 180000 records dumped into SQLite table INSTITUTIONS 13.461992 seconds: 200000 records dumped into SQLite table INSTITUTIONS 13.893989 seconds: 220000 records dumped into SQLite table INSTITUTIONS 14.296411 seconds: 240000 records dumped into SQLite table INSTITUTIONS 14.689038 seconds: 260000 records dumped into SQLite table INSTITUTIONS 15.061441 seconds: 280000 records dumped into SQLite table INSTITUTIONS 15.420823 seconds: 300000 records dumped into SQLite table INSTITUTIONS ********************************************************************* 0.037913 seconds: 20000 records dumped into SQLite table PI 1.016936 seconds: 40000 records dumped into SQLite table PI 1.320865 seconds: 60000 records dumped into SQLite table PI 1.630687 seconds: 80000 records dumped into SQLite table PI 1.963261 seconds: 100000 records dumped into SQLite table PI 2.252138 seconds: 120000 records dumped into SQLite table PI 2.53908 seconds: 140000 records dumped into SQLite table PI 2.851338 seconds: 160000 records dumped into SQLite table PI 3.151069 seconds: 180000 records dumped into SQLite table PI 3.445793 seconds: 200000 records dumped into SQLite table PI 3.791598 seconds: 220000 records dumped into SQLite table PI 4.069519 seconds: 240000 records dumped into SQLite table PI 4.338497 seconds: 260000 records dumped into SQLite table PI 4.651899 seconds: 280000 records dumped into SQLite table PI 4.997948 seconds: 300000 records dumped into SQLite table PI
There is a read_sql() function in Pandas, but who wants to open connections, and write querries every time? This function takes care of that in a way that is SQL-native, with the advantage of default values and auto formatting to spend more time analyzing data and less time writing querries!
def do(query):
'''
Direct sql queries to a Pandas dataframe.
'''
disk_engine = sql.create_engine('sqlite:///NSF.db')
df = pd.read_sql_query(query, disk_engine)
for col in df.columns:
if "Date" in col:
df[col] = pd.to_datetime(df[col])
return df
df = do("SELECT * FROM AWARDS")
df.head()
AwardID | AwardTitle | AwardEffectiveDate | AwardExpirationDate | AwardAmount | InstitutionName | Division | Directorate | EmailAddress | InvestigatorNames | NumInvestigators | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Regulation of Sn-Glycerol-3-Phosphate Metaboli... | 1986-07-01 | 1986-07-01 | 0 | Virginia Polytechnic Institute and State Unive... | Molecular And Cellular Bioscience | Biological Sciences | tilarson@vt.edu | Timothy Larson | 1 |
1 | 9 | Design of Cutting Tools for High Speed Milling | 2000-06-15 | 2004-05-31 | 280000 | University of Florida | Civil, Mechanical, & Manufact Inn | Engineering | jtlusty@ufl.edu | Jiri Tlusty;Tony Schmitz;John Ziegert | 3 |
2 | 26 | A Novel Ultrasonic Cooling Concept for Microel... | 2000-06-15 | 2004-05-31 | 292026 | North Carolina State University | Electrical, Commun & Cyber Sys | Engineering | ro@ncsu.edu | Paul Ro;Andrey Kuznetsov;Angus Kingon | 3 |
3 | 27 | Development of a Wireless Sensor to Detect Cra... | 2000-04-15 | 2004-03-31 | 238000 | University of Texas at Austin | Civil, Mechanical, & Manufact Inn | Engineering | neikirk@mail.utexas.edu | Dean Neikirk;Sharon Wood;Karl Frank | 3 |
4 | 31 | Development of Link-to-Column Connections for ... | 2000-09-01 | 2003-08-31 | 285000 | University of Texas at Austin | Civil, Mechanical, & Manufact Inn | Engineering | mde@mail.utexas.edu | Michael Engelhardt | 1 |
q = """
SELECT
strftime(\'%Y\',AwardEffectiveDate) as year,
Division,
SUM(AwardAmount) as 'AwardSum'
FROM AWARDS GROUP BY Division ORDER BY AwardSum DESC;"""
do(q).head()
year | Division | AwardSum | |
---|---|---|---|
0 | 2016 | Ocean Sciences | 6287996066 |
1 | 2016 | Research On Learning | 5199775389 |
2 | 2016 | Astronomical Sciences | 4801812127 |
3 | 2016 | Materials Research | 4727804645 |
4 | 2016 | Physics | 4391902343 |
q = """
SELECT
strftime(\'%Y\',AwardEffectiveDate) as year,
Directorate,
SUM(AwardAmount) as 'AwardSum'
FROM AWARDS GROUP BY 2 ORDER BY 3 DESC;"""
do(q).head()
year | Directorate | AwardSum | |
---|---|---|---|
0 | 2016 | Mathematical & Physical Scien | 20959270657 |
1 | 2016 | Geosciences | 17517102569 |
2 | 2017 | Education And Human Resources | 15754031311 |
3 | 2016 | Computer & Info Scie & Enginr | 12113647384 |
4 | 2016 | Engineering | 11317262347 |
Let's use Plotly and Seaborn for some Analysis
plotly.offline.init_notebook_mode()
with open("/Users/leonyin/Documents/creds/plotly_credentials.json",'r') as data_file:
'''
Get username credentials.
'''
creds = json.load(data_file)['credentials']
plotly.tools.set_credentials_file(
username=creds['USER'],
api_key=creds['API_KEY']
)
with open("/Users/leonyin/Documents/creds/plotly_presets.json",'r') as data_file:
'''
Load graphy presets.
'''
layout_dict = json.load(data_file)['layout']
def top_funds(df,groupby,n=5):
return df.groupby(groupby) \
.apply(lambda x:x['AwardAmount'].sum()) \
.sort_values(ascending=False) \
.head(n)
years = list(range(year_start,year_end))
top_n = 10
annual_funds = {}
annual_funds['Other'] = {}
for year in years:
df = do("""SELECT strftime(\'%Y\',AwardEffectiveDate) as year,
Division, SUM(AwardAmount) as 'AwardSum'
FROM AWARDS WHERE YEAR='{}'
GROUP BY Division ORDER BY AwardSum DESC""".format(year))
annual_funds['Other'][year] = sum(df.AwardSum)
# get top 10 to plot
for i in range(top_n):
division = df.get_value(i,'Division')
_sum =df.get_value(i,'AwardSum')
annual_funds['Other'][year] -= _sum
if division in annual_funds:
annual_funds[division][year] = _sum
else:
annual_funds[division] = {year:_sum}
tops = []
for division in top_funds(do("SELECT * FROM AWARDS"),groupby='Division', n=top_n).keys():
tops.append({
'x': years,
'y': [annual_funds[division].get(year,None) for year in years],
'name': division,
'type': 'bar'
})
py.iplot({
'data': tops,
'layout': {
'barmode': 'stack',
'xaxis': layout_dict['axis'],
'yaxis': layout_dict['axis'],
'plot_bgcolor': layout_dict['plot_bgcolor'],
'title': '{} Highest Funded NSF Divisions'.format(top_n)
}})
The plot can be viewed here.
Ocean sciences happen to be the highest funded division from the NSF.
We can also look at how different divisions' funding varies interannual and interdivisionally.
df = do('SELECT * FROM ABSTRACTS '
'LEFT JOIN AWARDS '
'USING(AwardID)'
'WHERE Abstract NOT NULL')
df['year'] = df['AwardEffectiveDate'].dt.year
df['AwardDuration'] = (abs(((df['AwardExpirationDate'] - df['AwardEffectiveDate']).dt.days)) + 1) / 365
df['AwardAmountStandardized'] = df['AwardAmount'] / (df['NumInvestigators'] * df['AwardDuration'])
df.replace({'AwardAmountStandardized' : {np.nan : 0}}, inplace=True)
def division_history(divisions_to_compare, start=1990, end=2016,
agg_key='Division', num_bins=3):
metric1 = 'AwardAmount'
metric2 = 'AwardAmountStandardized'
df_plot = df[(df[agg_key].isin(divisions_to_compare)) &
(df['year'] >= start) & (df['year'] <= end)]
fig, ax = plt.subplots(2,2, figsize=(18,12))
# Top two plots
for label, df_div in df_plot.groupby([agg_key,'year'])[metric1].sum() \
.reset_index().groupby(agg_key):
df_div[[metric1, 'year']].plot(x='year', y=metric1, logy=False,
ax=ax[0][0], label=label,
title='Annual Funding per Division (USD)')
for label, df_div in df_plot.groupby([agg_key,'year'])[metric2].mean() \
.reset_index().groupby(agg_key):
df_div[[metric2,'year']].plot(x='year', y=metric2, logy=False,
ax=ax[0][1], label=label,
title='Average Annual Funding Available '
'per Investigator per Year of Research (USD)')
# Bottom two plots
if len(divisions_to_compare) > 2:
# Density distro for more than two divisions
for label, df_div in df_plot[df_plot['year']==end].groupby(agg_key):
df_div['AwardDuration'].plot(kind='kde', logx=False, xlim=(0,10),
ax=ax[1][0], label=label, legend=True,
title='{} Distribution of '
'Funding Length (Years)'.format(end))
for label, df_div in df_plot[df_plot['year']==end].groupby(agg_key):
df_div['NumInvestigators'].plot(kind='kde', logx=False, xlim=(0,8),
ax=ax[1][1], label=label,
title='{} Distribution of Number of '
'Research Collaborators'.format(end))
else:
# Violins for 2 or less
pd.options.mode.chained_assignment = None
bins = np.linspace(start, end, num=num_bins+1)
labels = ['{:.0f}-{:.0f}'.format(
bins[i], bins[i+1]
) for i in range(len(bins) - 1)]
df_plot['YearRange'] = pd.cut(df['year'], bins, labels=labels)
ax[1, 0].set_title('Distribution of Funding Length (Years)')
sns.violinplot(data=df_plot, y='YearRange', x='AwardDuration', hue=agg_key,
split=True, inner=None, ax=ax[1][0])
ax[1, 1].set_title('Distribution of Research Collaborators Per Award')
g = sns.violinplot(data=df_plot, y='YearRange', x='NumInvestigators', hue=agg_key,
split=True, inner=None, ax=ax[1][1])
g.set(ylabel='')
g.set(yticks=[])
ax[0, 1].legend_.remove()
ax[1, 0].legend_.remove()
ax[1, 1].legend_.remove()
divisions_to_compare = ['Molecular And Cellular Bioscience', 'Social And Economic Sciences']
def get_history(full_name):
'''
Returns a PI-centric SQLite query to a Pandas dataframe.
'''
return do(
"SELECT DISTINCT AwardID, AwardTitle, AwardEffectiveDate,"
"InstitutionName, Division, AwardAmount FROM AWARDS "
"INNER JOIN PI using(EmailAddress) "
"WHERE FirstName||' '||LastName = '{}' "
"ORDER BY AwardEffectiveDate ASC".format(full_name)
)
division_history(divisions_to_compare, start=2000, num_bins=3)
Let's look at a PI's history in more depth...
get_history('Cynthia Rudin')
AwardID | AwardTitle | AwardEffectiveDate | InstitutionName | Division | AwardAmount | |
---|---|---|---|---|---|---|
0 | 1053407 | CAREER: New Approaches for Ranking in Machine ... | 2011-09-01 | Massachusetts Institute of Technology | Information & Intelligent Systems | 480000 |
To compare funding history of multiple PIs, we can recycle the get_history
for PI funding timeseries.
def trace(PI):
'''
Generates trace for a given PI
'''
def build_annotation(fund_history):
'''
Returns a list of break delimtied json-looking-strings for html rendering on the plotly plot.
More meta-data!
'''
A = ["Award ID: {}".format(str(award)) for award in fund_history['AwardID'].tolist()]
I = ["Institution : {}".format(uni) for uni in fund_history['InstitutionName'].tolist()]
M = ["Award Amount: ${}".format(str(award)) for award in fund_history['AwardAmount'].tolist()]
return ['<br>'.join([award, inst, mone]) for (award, inst, mone) in zip(A,I,M)]
fund_history = get_history(PI)
return Scatter(
x = pd.to_datetime(fund_history['AwardEffectiveDate'], format='%Y/%m/%d'),
y = fund_history['AwardAmount'].cumsum(),
text = build_annotation(fund_history),
name = PI,
connectgaps = True
)
py.iplot({
'data': [trace('Scott Doney'), trace('Jorge Sarmiento'), trace('Hugh Ducklow')],
'layout':
{'title': "Marine Biogeochemists' NSF Funding History in USD 2000-2016",
'xaxis': layout_dict['axis'],
'yaxis': layout_dict['axis'],
'plot_bgcolor': layout_dict['plot_bgcolor']
}
})
The plot can be viewed here.
With the annotations, we can backtrack see see which awards were fullfilled at what institution. This provides an accurate track record into each scientist's funding history. Unfortunately this plot oversimplified an award as being given to one recipeient, when in reality these rewards are split into many slices and may last several years.
def read_abstract(AwardID):
'''
Renders the abstract in HTML so we can learn more!
'''
abstract = do("SELECT * FROM ABSTRACTS "
"WHERE AwardID={}".format(AwardID))
return HTML(abstract['Abstract'].loc[0])
We see a 20M huge spike for Jorge Sarmiento on Sept 2014.
We can quickly addess the abstract of using the awardID 1425989.
read_abstract(1425989)
We can also split the data by Abstract topics.
Currently this is only supported by substrings.
An unsupervised learning algorithm might be interesting to use here.
def topic_history(topic):
abstract_df = do(
'SELECT * FROM AWARDS '
'INNER JOIN ( '
'SELECT * FROM ABSTRACTS '
"WHERE ABSTRACT LIKE '%{}%' "
'AND Abstract NOT NULL '
"AND Abstract != 'None'"
') USING(AwardID)'.format(topic))
return abstract_df
topic_history("ocean acidification").head()
AwardID | AwardTitle | AwardEffectiveDate | AwardExpirationDate | AwardAmount | InstitutionName | Division | Directorate | EmailAddress | InvestigatorNames | NumInvestigators | Abstract | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 9418652 | Analysis of New Drill Core Samples from within... | 1995-09-15 | 1998-08-31 | 119848 | Lunar and Planetary Institute | Earth Sciences | Geosciences | Buck.Sharpton@alaska.edu | Virgil Sharpton;Benjamin Shuraytz | 2 | 9418652 Sharpton Recent analyses of drill core... |
1 | 628302 | Collaborative Research: Dynamics of carbon rel... | 2006-09-15 | 2009-08-31 | 179989 | Purdue University | Earth Sciences | Geosciences | gabe.bowen@utah.edu | Gabriel Bowen | 1 | Intellectual Merit: The ocean is the largest s... |
2 | 628336 | Collaborative Research: Dynamics of carbon rel... | 2006-09-15 | 2008-04-30 | 307481 | University of Wisconsin-Madison | Earth Sciences | Geosciences | awinguth@uta.edu | Arne Winguth | 1 | ABSTRACT<br/><br/>Intellectual Merit: The ocea... |
3 | 628358 | Collaborative Research: Dynamics of carbon rel... | 2006-09-15 | 2010-08-31 | 240002 | Yale University | Earth Sciences | Geosciences | mark.pagani@yale.edu | Mark Pagani | 1 | ABSTRACT<br/><br/>Intellectual Merit: The ocea... |
4 | 628366 | Collaborative Research: Dynamics of carbon rel... | 2006-09-15 | 2009-04-30 | 179261 | Williams College | Earth Sciences | Geosciences | hstoll@geo.umass.edu | Heather Stoll | 1 | ABSTRACT<br/><br/>Intellectual Merit: The ocea... |
How can we leverage the abstracts to understand what each field within a Division is all about.
We can count words within abstracts and visualize them with sizr proportional to frequency as a world cloud.
First we need to ignore common words called stopwords.
def download_english_stopwords():
'''
Stopwords are words that don't give any unique insight
across different texts.
'''
stop_url = 'https://raw.githubusercontent.com/nltk/nltk_data/' \
'gh-pages/packages/corpora/stopwords.zip'
url = requests.get(stop_url)
zipfile = ZipFile(BytesIO(url.content))
with zipfile.open('stopwords/english', 'r') as f:
return f.read().decode('utf8').split('\n')
stops = download_english_stopwords()
stops[:5]
['i', 'me', 'my', 'myself', 'we']
stops += [
'lab', 'abstract', 'project', 'career', 'research',
'give', 'given', 'gives', 'giving', 'get', 'fund',
'funding', 'funds', 'gain', 'gained', 'gaining',
'gains', 'go', 'good', 'goods', 'great', 'greater',
'greatest', 'greatly', 'high', 'higher', 'improve',
'improves', 'improvement', 'improving', 'inrease',
'increased', 'increasing', 'increasingly', 'know',
'large', 'larger', 'largely', 'learn', 'learned',
'learner', 'less', 'like', 'investigator', 'also',
'university', 'student', 'students', 'undergraduate',
'graduate', 'br', 'new', 'pi', 'one', 'two', 'three',
'four', 'five', 'year', 'well', 'studies', 'things',
'using', 'many', 'use', 'used','using',' users',
'school', 'work', 'nsf', 'science', 'professor',
'researcher', 'researchers'
]
def display_cloud(words, title, exclude=list(), save=False):
wc = WordCloud(max_words=300, stopwords=stops + exclude, margin=10, background_color="white",
random_state=1, scale=.75, width=800, height=400).generate(words)
plt.title(title)
plt.imshow(wc.to_array())
plt.axis("off")
plt.figure()
if save:
wc.to_file("clouds/{}_cloud.png".format(title.replace(' ', '_').lower()))
We can compare most common terms across divisions.
for division in divisions_to_compare:
words = df[(df['year'] == 2016) & (df['Division'] == division)] \
.groupby('Division')['Abstract'] \
.apply(lambda x: x.sum()).iloc[0]
display_cloud(words, division, exclude=['data'], save=True)
<matplotlib.figure.Figure at 0x16626cc88>
We can create a wordcloud for a specific topic (substring)
topic = 'policy'
words = topic_history(topic) \
.groupby('Division')['Abstract'] \
.apply(lambda x: x.sum()).iloc[0]
display_cloud(words, topic, save=False)
<matplotlib.figure.Figure at 0x1725502b0>
How are researchers connected?
Let's make a network of collaborators based on who has been award grants together.
Work in progress... But a beautiful mistake!.
# df['Division'].value_counts().iloc[:40]
def count_PIS(PI_STRING):
"""
split PI string
"""
PI_list = [x.rstrip().lstrip().title() for x in PI_STRING.split(';')]
for PI in PI_list:
if PI_count.get(PI):
PI_count[PI]['count'] += 1
else:
PI_count[PI] = {'count': 1}
def parse_PIS(PI_STRING):
"""
split PI string
"""
PI_list = [x.rstrip().lstrip().title() for x in PI_STRING.split(';')]
if len(PI_list) > 1:
for PI in PI_list:
if PI_dict.get(PI):
# check if PI exists
continue
else:
# if PI not in dict, add it
PI_dict[PI] = dict()
#check other PIs
remaining_PI = [x for x in PI_list if x != PI]
for sub_PI in remaining_PI:
# for PI already collaborated add one
if PI_dict[PI].get(sub_PI):
PI_dict[PI][sub_PI] += 1
else:
PI_dict[PI][sub_PI] = 1
else:
return
PI_dict = dict()
PI_count = dict()
# df[
# (df['Division'].isin(['Ocean Sciences']))
# & (df['year'] >= 2000)
# ]['InvestigatorNames'].apply(parse_PIS)
df[df['year'] >= 2000]['InvestigatorNames'].apply(parse_PIS)
# df[(df['Division'].isin(['Ocean Sciences'])) & (df['year'] >= 2000)]['InvestigatorNames'].apply(count_PIS)
df[df['year'] >= 2000]['InvestigatorNames'].apply(count_PIS)
# PI_exclude = [PI for (PI, count) in PI_count.items() if count <= 2]
df_connection = pd.DataFrame([
{'source' : source.replace(' ', '_'),
'target': target.replace(' ', '_'),
'value': value}
for (source, targets) in PI_dict.items()
for (target, value) in targets.items()])
# if source not in PI_exclude
# or target not in PI_exclude])
force_directed_graph = {}
force_directed_graph['links'] = df_connection.to_dict(orient='records')
force_directed_graph['nodes'] = [{'id': source.replace(' ', '_'), 'group': 1}\
for source in df_connection['source'].append(df_connection['target']).unique()]
with open('data_out/nsf.json', 'w') as fp:
json.dump(force_directed_graph, fp)
PI_count
def get_edges(PI):
'''
look through dataframes for PI and all PIs who have shared an award with that PI.
'''
df_PI = df[df['InvestigatorNames'].str.contains(PI)]
PI_count[PI]['Children'] = set(';'.join(df_PI['InvestigatorNames'].tolist()).split(';'))
[get_edges(PI) for PI in PI_count.keys()]
def depth(first_PI, PI):
if not PI_count[PI].get('subChildren'):
PI_count[PI]['subChildren'] = set()
diff_set = PI_count[PI]['Children'] - PI_count[first_PI]['subChildren']
if diff_set:
print("difference found between {} and {}".format(first_PI, PI))
for new_PI in diff_set:
# add new PI to OG PI's subchildren
PI_count[first_PI]['subChildren'].update(set(new_PI))
# add OG PI to new PI's subchildren
PI_count[new_PI]['subChildren'].update(set(first_PI))
# recursive call
depth(first_PI, new_PI)
else: # base case
return
ex = set(';'.join(df[df['InvestigatorNames'].str.lower().str.contains('Sarah Gille'.lower())]['InvestigatorNames'].tolist()).split(';'))
PI_count['Sarah Gille'].get('Children')
{'Alexa Griesel', 'B. Gregory Mitchell', 'David Sandwell', 'Farooq Azam', 'Janet Sprintall', 'Jennifer MacKinnon', 'John Orcutt', 'Julie McClean', 'Katherine Barbeau', 'Matthew Mazloff', 'Osmund Holm-Hansen', 'Paul Henkart', 'Sarah Gille', 'Teresa Chereskin'}
for PI in PI_count.keys():
PI_count[PI]['subChildren'] = PI_count[PI]['Children']
check each set of Children, append their Children to dict, and start generating graph
AKA links, all permutations of investigators
PI_count['Shawn Campagna']['subChildren']
{'Alison Buchan', 'Brynn Voy', 'DeEtta Mills', 'Diana Downs', 'Eberhard Voit', 'Elizabeth Fozo', 'Erik Zinser', 'Gladys Alexandre', 'Jay Whelan', 'Jennifer DeBruyn', 'Laurie Richardson', 'Michael Best', 'Patrick Gillevet', 'Shawn Campagna', 'Steven Wilhelm'}