Auditing Federal Contractors Part 1

By Leon Yin Last Updated 2017-06-11

View this notebook in NBViewer or Github

This two part module will show you how to request data from USASpending.gov, store it as a tab-separated value (tsv), and perform some temporal and spatial analysis.

I hope these two modules are clear enough to be used by Journalists, Laywers, and other folks who wish to use data to audit government contractors

Please view part 2 on NBViewer or Github

Downloading data from USA Spending

This notebook describes how to download a annual records from the USASpending.gov website for a specific contractor.

In this example we download all records for the Core Corrections Association of America (CCA), and all it's subsidiaries.

Let's start by getting all the Python packages we used in this module first.

In [5]:
%%sh
pip install -r requirements.txt
In [21]:
%matplotlib inline
import os
from zipfile import ZipFile
from multiprocessing import Pool
import requests
from io import BytesIO
from itertools import repeat
import datetime

import pandas as pd
In [2]:
latest_update = '20170515'
next_latest  = '20170115'

year = 2017

dep= 'All'
In [3]:
url = ('http://download.usaspending.gov/data_archives/{UP_MONTH}/'
       'tsv/{{YEAR}}_{DEP}_Contracts_Full_{UP_DATE}.tsv.zip'.format(
            UP_MONTH=latest_update[:-2], DEP=dep, UP_DATE=latest_update))
url
Out[3]:
'http://download.usaspending.gov/data_archives/201705/tsv/{YEAR}_All_Contracts_Full_20170515.tsv.zip'
In [4]:
# for files that were not updated by the date above...
url_legacy = ('http://download.usaspending.gov/data_archives/{UP_MONTH}/'
              'tsv/{{YEAR}}_{DEP}_Contracts_Full_{UP_DATE}.tsv.zip'.format(
                  UP_MONTH=next_latest[:-2], DEP=dep, UP_DATE=next_latest))
In [5]:
data_in = 'data_in/spending'
In [6]:
# these are the years we are interested in:
start = 2000
end = 2017
years = [y for y in range(start, end + 1)]
In [7]:
# There are the aliases for each company, this is case insensitive, 
# but need to be enclose in a single or double quote!
companies =  ['Corrections Corporation of America', 
              'CoreCivic',
              'TransCor']
In [9]:
def load_and_sift(year, regex):
    '''Downloads zipped tsv file from: 
    https://www.usaspending.gov/DownloadCenter/Pages/dataarchives.aspx
    to a requests object.
    
    Expands zipfile and reads each file, chunkwise into Pandas dataframes.
    The dataframe (df) is filtered by the conpanies' RegEx expression.
    
    Args:
       year (int):  The fiscal year of records to load.
       regex (string): A regex expression of company name(s).
    
    Returns:
       df: a Pandas Dataframe containing records from the given.
    '''
    print(year)
    r = requests.get(url.format(YEAR=year))
    last_update = datetime.datetime.strptime(latest_update, '%Y%m%d')
    
    if r.status_code == 404: # if url doesn't work, use the legacy url.
        r = requests.get(url_legacy.format(YEAR=year))
        last_update = datetime.datetime.strptime(next_latest, '%Y%m%d')
            
    if r.status_code == 200: # make sure the download was successful.
                     
        # the downloaded stream is a zip archive
        zipfile = ZipFile(BytesIO(r.content))        
        df_final = pd.DataFrame()
        
        # for each file in the zip archive
        for f in zipfile.namelist():              
            # process the file in dataframe chunks!
            for df in pd.read_csv(zipfile.open(f), sep='\t',
                                  chunksize=100000, low_memory=False):
 
                # filter the dataframe chunk for active vendors
                # and relevant company names.
                df = df[(~df['vendorname'].isnull()) &
                        (df['vendorname'].str.contains(regex, case=False))]
                
                # some date tags...
                df['lastupdate'] = last_update
                df['contract_year'] = year
                df['filename'] = f
                df['search_terms'] = regex
                                
                df_final = df_final.append(df, ignore_index=True)
        
        return df_final
    
    else:
        raise "bad request"

This next step might take a while. It's expedited using pool, which parallelizes the task

If is equivalent to

for year, co in zip(years, repeat('|'.join(companies))):
    df_list += load_and_sift(year, co)
df = pd.concat(df_list, ignore_index=True)
In [16]:
with Pool() as pool:
    df_list = pool.starmap(load_and_sift, zip(years, repeat('|'.join(companies))))
df = pd.concat(df_list, ignore_index=True)

We can take a peek at 5 random records here:

In [9]:
df.sample(5)
Out[9]:
unique_transaction_id transaction_status dollarsobligated baseandexercisedoptionsvalue baseandalloptionsvalue maj_agency_cat mod_agency maj_fund_agency_cat contractingofficeagencyid contractingofficeid ... prime_awardee_executive4 prime_awardee_executive4_compensation prime_awardee_executive5 prime_awardee_executive5_compensation interagencycontractingauthority last_modified_date lastupdate contract_year filename search_terms
782 43e9c36cedfb72813511fe13fc39d742 active 0.00 0.0 0.0 1500: JUSTICE, DEPARTMENT OF 1501: OFFICES, BOARDS AND DIVISIONS 1500: JUSTICE, DEPARTMENT OF 1501: OFFICES, BOARDS AND DIVISIONS OFDT: OFFICE OF THE FEDERAL DETENTION TRUSTEE ... NaN 0.0 NaN 0.0 X: Not Applicable 01/09/2012 2017-04-15 2012 datafeeds\2012_All_Contracts_Full_20170415.tsv Corrections Corporation of America|CoreCivic|T...
280 4871c9c2bf29ef438ad8c8e2a3a22926 active 48073.00 48073.0 48073.0 9700: DEPT OF DEFENSE 5700: DEPT OF THE AIR FORCE 9700: DEPT OF DEFENSE 5700: DEPT OF THE AIR FORCE FA5613: FA5613 700 CONS LGC ... NaN 0.0 NaN 0.0 X: Not Applicable 04/18/2008 2017-04-15 2008 datafeeds\2008_All_Contracts_Full_20170415.tsv Corrections Corporation of America|CoreCivic|T...
874 4ec36c4a3a56e151c7e7f89d77f0be77 active 0.00 0.0 0.0 1500: JUSTICE, DEPARTMENT OF 1501: OFFICES, BOARDS AND DIVISIONS 1500: JUSTICE, DEPARTMENT OF 1501: OFFICES, BOARDS AND DIVISIONS OFDT: OFFICE OF THE FEDERAL DETENTION TRUSTEE ... NaN 0.0 NaN 0.0 X: Not Applicable 06/18/2013 2017-04-15 2012 datafeeds\2012_All_Contracts_Full_20170415.tsv Corrections Corporation of America|CoreCivic|T...
919 20204082b037e602ef22bad3c6d7a918 active 1522136.00 1522136.0 1522136.0 1500: JUSTICE, DEPARTMENT OF 1501: OFFICES, BOARDS AND DIVISIONS 1500: JUSTICE, DEPARTMENT OF 1501: OFFICES, BOARDS AND DIVISIONS OFDT: OFFICE OF THE FEDERAL DETENTION TRUSTEE ... NaN 0.0 NaN 0.0 X: Not Applicable 04/26/2013 2017-04-15 2013 datafeeds\2013_All_Contracts_Full_20170415.tsv Corrections Corporation of America|CoreCivic|T...
1306 3d423442685d42d99a05519784779622 active -2087743.31 0.0 0.0 1500: Department of Justice 1544: U.S. MARSHALS SERVICE 1500: Department of Justice 1544: U.S. MARSHALS SERVICE HQ018: OFFICE OF CONTRACTS&AGREEMENTS ... NaN 0.0 NaN 0.0 X: NOT APPLICABLE 12/07/2015 2017-04-15 2016 datafeeds\2016_All_Contracts_Full_20170415.tsv Corrections Corporation of America|CoreCivic|T...

5 rows × 229 columns

There are a lot of columns! So it might to hard to know what you're after.

In [10]:
len(df.columns)
Out[10]:
229

save the data to a gzipped tab-separated value document (tsv)

In [15]:
outfile = data_in + '_' + companies[0].replace(' ', '_').lower() + '.tsv.gz'
df.to_csv(outfile, sep='\t', compression='gzip', index=False)
print("Data saved to {}".format(outfile))
Data saved to data_in/spending_corrections_corporation_of_america.tsv.gz

Now we're ready to perform some analysis!

See Part 2 here!