By Leon Yin Last Updated 2018-12-06
View this notebook in NBViewer or Github. The output is availalble publically on Google Drive and Github.
Following my analysis of federal contracts to the private prisons CoreCivic and Geo Group, this notebook analyzes publically avaiable contracts from the Immigration and Customs Enforcement Agency (ICE). This analysis uses data downloaded from the USASpending.gov dashboard, specifically I filtered Funding Agency
with "U.S. Immigration and Customs Enforcement (ICE) | Sub-Agency".
%matplotlib inline
import glob
import datetime
import string
from collections import Counter
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from fuzzywuzzy import fuzz
/anaconda3/lib/python3.6/site-packages/fuzzywuzzy/fuzz.py:35: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
# data downloaded from USAspending.gov
input_file_pattern = 'ice_data/data_in/all_prime_awards_subawards_*/*'
files = glob.glob(input_file_pattern)
files
['ice_data/data_in/all_prime_awards_subawards_20181206050302/all_contracts_subawards_1.csv', 'ice_data/data_in/all_prime_awards_subawards_20181206050302/all_contracts_prime_awards_1.csv', 'ice_data/data_in/all_prime_awards_subawards_20181206050302/all_assistance_prime_awards_1.csv', 'ice_data/data_in/all_prime_awards_subawards_20181206050302/all_assistance_subawards_1.csv']
Let's just look at prime contracts today.
df_prime_contracts = pd.read_csv(files[1])
/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2698: DtypeWarning: Columns (13,32,40,42,43,50,72,73,81,100,128,129) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
Something to note here, if we don't filter out funding agency we get some inaccuracies!
{
'awarding_sub_agency_name' : 'U.S. IMMIGRATION AND CUSTOMS ENFORCEMENT', # N = 15056
'funding_sub_agency_name' : 'U.S. IMMIGRATION AND CUSTOMS ENFORCEMENT', # N = 17145
}
{'awarding_sub_agency_name': 'U.S. IMMIGRATION AND CUSTOMS ENFORCEMENT', 'funding_sub_agency_name': 'U.S. IMMIGRATION AND CUSTOMS ENFORCEMENT'}
# here we're filtering only when ICE awards a contract
df_prime_contracts = df_prime_contracts[
df_prime_contracts['awarding_sub_agency_name'] == 'U.S. IMMIGRATION AND CUSTOMS ENFORCEMENT'
]
len(df_prime_contracts)
15056
# there are a lot of columns!
prime_cols = df_prime_contracts.columns
num_cols = len(prime_cols)
print(f"There are {num_cols} columns in this file!\nHere are some of the column names:")
prime_cols.values.tolist()[:20]
There are 249 columns in this file! Here are some of the column names:
['award_id_piid', 'parent_award_agency_id', 'parent_award_agency_name', 'parent_award_id', 'obligated_amount', 'current_total_value_of_award', 'potential_total_value_of_award', 'period_of_performance_start_date', 'period_of_performance_current_end_date', 'period_of_performance_potential_end_date', 'ordering_period_end_date', 'awarding_agency_code', 'awarding_agency_name', 'awarding_sub_agency_code', 'awarding_sub_agency_name', 'awarding_office_code', 'awarding_office_name', 'funding_agency_code', 'funding_agency_name', 'funding_sub_agency_code']
That is a lot of columns, here are the definitions of the columns we'll be using here:
recipient_duns
The unique identification number for the ultimate parent of an awardee or recipient. Currently the identifier is the 9-digit number maintained by Dun & Bradstreet as the global parent DUNS® number.
potential_total_value_of_award
The total amount that could be obligated on a contract. This total includes the base plus options amount. For example, if a recipient is awarded 10M on a base contract with 3 option years at 1M each, the potential award amount is 13M.
current_total_value_of_award
The amount of money that the government has promised (obligated) to pay a recipient for a contract. This means the base amount and any exercised options.
period_of_performance_start_date
The date that the award begins.
period_of_performance_potential_end_date
For procurement, the date on which, for the award referred to by the action being reported if all potential pre-determined or pre-negotiated options were exercised, awardee effort is completed or the award is otherwise ended. Administrative actions related to this award may continue to occur after this date. This date does not apply to procurement indefinite delivery vehicles under which definitive orders may be awarded.
funding_sub_agency_name
A Funding Agency pays for the majority of funds for an award out of its budget. Typically, the Funding Agency is the same as the Awarding Agency. In some cases, one agency will administer an award (Awarding Agency) and another agency will pay for it (Funding Agency).
awarding_sub_agency_name
The Awarding Sub Agency is the sub agency that issues and administers the award. For example, the Internal Revenue Service (IRS) is a sub agency of the Department of the Treasury.
Thanks for providing that glossary USASpending, let's do some minor data wrangling and get some summary statistics.
# convert columns to datetime
date_cols = ['period_of_performance_start_date', 'period_of_performance_potential_end_date']
for col in date_cols:
df_prime_contracts[col] = pd.to_datetime(df_prime_contracts[col])
# aggregate contract values
total_spend = df_prime_contracts.drop_duplicates(subset=['award_id_piid']) \
.current_total_value_of_award.sum()
potential_spend = df_prime_contracts.drop_duplicates(subset=['award_id_piid']) \
.potential_total_value_of_award.sum()
at_stake = potential_spend - total_spend
num_contractors = len(df_prime_contracts.recipient_duns.unique())
first_contract = df_prime_contracts.period_of_performance_start_date.min().strftime('%Y-%m-%d')
print("ICE has invested ${:,.2f} in contracts with {} Prime contractors "
"since {}.\nICE has ${:,.2f} in potential payouts with active contracts.".format(
total_spend, num_contractors, first_contract, at_stake))
ICE has invested $6,959,189,462.00 in contracts with 3134 Prime contractors since 2006-10-01. ICE has $10,616,424,251.18 in potential payouts with active contracts.
Let's sort this data by the contract end date
df_prime_contracts.sort_values(by=['period_of_performance_potential_end_date'],
ascending=False,
inplace=True)
There are typos in the recipient names, so let's use their DUNs IDs for aggregate analysis.
We can create a lookup_table
containing the latest name and contact info
lookup_cols = [
'recipient_parent_duns',
'recipient_parent_name',
'recipient_address_line_1',
'recipient_address_line_2',
'recipient_city_name',
'recipient_state_code',
'recipient_state_name',
'recipient_zip_4_code',
'recipient_congressional_district',
'recipient_phone_number',
'recipient_fax_number'
]
lookup_table = df_prime_contracts.drop_duplicates(
subset = ['recipient_parent_duns'],
keep = 'first'
)[lookup_cols]
There are contractors with similar names (and different DUNS), we can find similar names using fuzzywuzzy, and replace them for aggregation.
from fuzzywuzzy import fuzz
look_up_names = lookup_table[~lookup_table.recipient_parent_name.isnull()].recipient_parent_name.unique()
look_up_names.sort()
Here we will compare all names, and create a standardized dictionary (switch_name
) to replace similar names.
window_size = 10 # look at to words to the right and left
threshold = 90 # how similar is enough?
skip = []
switch_name = dict()
for i, comp in enumerate(look_up_names):
if comp not in skip:
window = np.concatenate((look_up_names[i-window_size: i],
look_up_names[i+1: i+window_size]))
for to_search in window:
similarity_score = fuzz.token_sort_ratio(comp, to_search)
if similarity_score >= threshold:
switch_name[comp] = to_search
skip += [to_search]
Let's swamp out the fuzzy names!
df_prime_contracts.recipient_parent_name.replace(switch_name,
inplace = True)
lookup_table.recipient_parent_name.replace(switch_name,
inplace = True)
lookup_table.drop_duplicates(subset = ['recipient_parent_name'],
keep = 'first',
inplace = True)
Now let's get how much each company has been rewarded, how much they have the potential to be rewarded, how many awards they're been granted, and when their contracts are finished. This code is chained, and kind of messy. Each line aggreagates by the recipient ID, and performs some sort of agg function, and then sorts the values according to the total of the agg function.
contract_amounts = df_prime_contracts.groupby('recipient_parent_name') \
.current_total_value_of_award \
.sum() \
.sort_values(ascending=False)
contract_potential = df_prime_contracts.groupby('recipient_parent_name') \
.potential_total_value_of_award \
.sum()
num_awards = df_prime_contracts.groupby('recipient_parent_name') \
.award_id_piid.nunique()
award_ids = df_prime_contracts.groupby('recipient_parent_name') \
.award_id_piid\
.unique().str.join(', ')
contract_enddate = df_prime_contracts[~df_prime_contracts.period_of_performance_potential_end_date.isnull()] \
.groupby('recipient_parent_name') \
.period_of_performance_potential_end_date.max()
contract_startdate = df_prime_contracts[~df_prime_contracts.period_of_performance_start_date.isnull()]\
.groupby('recipient_parent_name') \
.period_of_performance_start_date \
.min()
For context, let's see what kinds of services each of these companies provide by counting the most frequent words in service descriptions per contractor.
def count_words(row, counter, sw):
'''
Splits sentences, casts words to lowercase, and counts all words that aren't in stopwords (sw)
'''
for c in string.punctuation:
row = row.replace(c, " ")
counter.update([w for w in row.lower().split() if w not in sw])
sw = stopwords.words('English')
top_words = []
for company_, df_ in df_prime_contracts.groupby('recipient_parent_name'):
word_counter_ = Counter()
df_['product_or_service_code_description'].apply(count_words,
args = (word_counter_, sw,))
# get top 5 words
top_words_ = ', '.join([w for w, c in word_counter_.most_common(5)])
# create a dcitonary, and add it to the list
d_ = dict(top_words_from_serv_desc = top_words_,
recipient_parent_name = company_)
top_words.append(d_)
top_words_per_service_description = pd.DataFrame(top_words)
We can join them all together using this craziness below:
output = (
contract_amounts.reset_index()
.merge(
num_awards.reset_index(),
on='recipient_parent_name',
)
.merge(
contract_potential.reset_index(),
on='recipient_parent_name'
)
.merge(
contract_enddate.reset_index(),
on='recipient_parent_name'
)
.merge(
contract_startdate.reset_index(),
on='recipient_parent_name'
)
.merge(
top_words_per_service_description,
on='recipient_parent_name'
)
.merge(
award_ids.reset_index(),
on='recipient_parent_name'
)
.merge(
lookup_table,
on='recipient_parent_name'
)
)
The last merge above is the look_up table, containing metadata for each recipient.
Let's change some of the aggregate column names to be more decriptive:
remapping = {
'current_total_value_of_award' : 'current_total_value_of_awards_USD',
'potential_total_value_of_award' : 'potential_total_value_of_awards_USD',
'period_of_performance_potential_end_date' : 'last_contract_end_date',
'period_of_performance_start_date' : 'first_contract_start_date',
'award_id_piid_x' : 'number_of_prime_awards',
'award_id_piid_y' : 'award_id_piids'
}
output.columns = [remapping.get(c, c) for c in output.columns]
# convert datetimes
for col in ['last_contract_end_date', 'first_contract_start_date']:
output[col] = output[col].dt.strftime('%Y-%m-%d')
# calculate how much money in on the line for each contractor
output['remaining_total_value_of_awards_USD'] = output.potential_total_value_of_awards_USD - output.current_total_value_of_awards_USD
Let's timestamp when this file was made.
output['analysis_date'] = datetime.datetime.now() # when was this file made?
output['raw_data_download_date'] = datetime.datetime(2018,12,6) # when was raw data downloaded?
And lastly, let's re-order the columns so they're readable
order_cols = [
'recipient_parent_name',
'recipient_parent_duns',
'top_words_from_serv_desc',
'number_of_prime_awards',
'current_total_value_of_awards_USD',
'potential_total_value_of_awards_USD',
'remaining_total_value_of_awards_USD',
'award_id_piids',
'first_contract_start_date',
'last_contract_end_date',
'recipient_address_line_1',
'recipient_address_line_2',
'recipient_city_name',
'recipient_state_code',
'recipient_state_name',
'recipient_zip_4_code',
'recipient_congressional_district',
'recipient_phone_number',
'recipient_fax_number',
'analysis_date',
'raw_data_download_date',
]
# make these human readible
new_cols = [' '.join(c.split('_')).title() for c in order_cols]
Here is the output:
out_order = output[order_cols]
out_order.columns = new_cols
out_order.head(50)
Recipient Parent Name | Recipient Parent Duns | Top Words From Serv Desc | Number Of Prime Awards | Current Total Value Of Awards Usd | Potential Total Value Of Awards Usd | Remaining Total Value Of Awards Usd | Award Id Piids | First Contract Start Date | Last Contract End Date | ... | Recipient Address Line 2 | Recipient City Name | Recipient State Code | Recipient State Name | Recipient Zip 4 Code | Recipient Congressional District | Recipient Phone Number | Recipient Fax Number | Analysis Date | Raw Data Download Date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | SCIENCE APPLICATIONS INTERNATIONAL CORPORATION | 078745061 | telecom, systems, development, analysis, house... | 8 | 5.322573e+08 | 7.048101e+08 | 1.725527e+08 | HSCETC09J00034, HSCETC10J00005, HSCOW5D0056, H... | 2015-06-04 | 2015-09-24 | ... | NaN | MCLEAN | VA | VIRGINIA | 22102 | NaN | 7036765681 | 7036765063 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
1 | THE GEO GROUP INC | 612706465 | guard, housekeeping, services, support, profes... | 56 | 5.295999e+08 | 2.641455e+09 | 2.111855e+09 | 70CDCR18FR0000045, HSCEDM11J00054, 70CDCR18FR0... | 2007-06-01 | 2029-10-11 | ... | NaN | BOCA RATON | FL | FLORIDA | 334878242 | 22 | 5619997359 | 5619997648 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
2 | CSI AVIATION SERVICES INC. | 102793460 | travel, air, passenger, transportation, reloca... | 15 | 5.213086e+08 | 6.580096e+08 | 1.367010e+08 | HSCECR14F00009, HSCECR14F00010, HSCECR14F00015... | 2009-10-05 | 2018-06-30 | ... | NaN | ALBUQUERQUE | NM | NEW MEXICO | 871072876 | 01 | 5057619000 | 5053427377 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
3 | INGENESIS INC. | 019992424 | medical, general, health, care | 2 | 3.335133e+08 | 4.491412e+08 | 1.156279e+08 | 70CDCR18C00000002, HSCECR12F00048 | 2018-08-06 | 2018-11-15 | ... | NaN | SAN ANTONIO | TX | TEXAS | 782174430 | 21 | 2103660033 | 2105684582 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
4 | M V M INC. | 038049532 | support, housekeeping, guard, administrative, ... | 38 | 2.611629e+08 | 1.086340e+09 | 8.251771e+08 | 70CMSW18FR0000058, HSCEMS17F00040, HSCEDM14J00... | 2010-09-20 | 2023-01-31 | ... | NaN | ASHBURN | VA | VIRGINIA | 201476063 | 10 | 5712234620 | 5712234487 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
5 | GENERAL DYNAMICS CORPORATION | 001381284 | support, ammunition, 30mm, equipment, professi... | 52 | 2.480584e+08 | 4.923976e+08 | 2.443392e+08 | HSCEMD16F00117, HSCETC16J00002, HSCEMD15F00113... | 2011-04-07 | 2021-09-24 | ... | NaN | CHANTILLY | VA | VIRGINIA | 201513848 | 10 | 7038761243 | 7038761243 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
6 | NANA REGIONAL CORPORATION INC | 079253761 | housekeeping, guard | 13 | 2.451490e+08 | 1.172830e+09 | 9.276814e+08 | HSCEDM15J00019, 70CDCR18FR0000051, 70CDCR18FR0... | 2016-06-21 | 2024-04-30 | ... | NaN | ANCHORAGE | AK | ALASKA | 99502 | 00 | 7037666837 | 7037666906 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
7 | CORRECTIONS CORPORATION OF AMERICA | 159734151 | guard, housekeeping, support, services, profes... | 32 | 2.415587e+08 | 6.227455e+08 | 3.811868e+08 | 70CDCR18FR0000058, 70CDCR18FR0000017, 70CDCR18... | 2010-04-29 | 2019-06-30 | ... | NaN | NASHVILLE | TN | TENNESSEE | 372156105 | 05 | 6152636713 | 6152633090 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
8 | SPECTRUM SECURITY SERVICES INC. | 6.18954e+08 | guard, housekeeping, services | 13 | 1.927423e+08 | 3.628039e+08 | 1.700616e+08 | 70CDCR18FC0000013, 70CDCR18FC0000012, 70CDCR18... | 2009-12-09 | 2019-10-31 | ... | NaN | JAMUL | CA | CALIFORNIA | 9.19353e+08 | 50 | NaN | NaN | 2018-12-06 01:30:13.766382 | 2018-12-06 |
9 | DELOITTE LLP | 014127109 | support, management, professional, program, fi... | 83 | 1.779484e+08 | 3.771559e+08 | 1.992075e+08 | 70CMSW18FR0000103, 70CMSW18FC0000097, HSCEMS17... | 2011-08-29 | 2023-09-29 | ... | NaN | ARLINGTON | VA | VIRGINIA | 222091742 | 08 | 5718580176 | 7038426748 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
10 | AHTNA INCORPORATED | 6.95861e+07 | housekeeping, guard, maint, rep, alt | 9 | 1.421123e+08 | 4.689208e+08 | 3.268086e+08 | 70CDCR19FR0000003, 70CDCR18FR0000003, HSCEDM17... | 2011-10-21 | 2019-11-30 | ... | NaN | ANCHORAGE | AK | ALASKA | 995035677 | 00 | 9073346477 | 9073346750 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
11 | CLASSIC AIR CHARTER INC. | 9.61705e+08 | travel, transportation, relocation, lodging, r... | 1 | 1.420237e+08 | 6.518740e+08 | 5.098503e+08 | 70CDCR18FR0000002 | 2018-11-14 | 2023-06-30 | ... | NaN | HUNTINGTON | NY | NEW YORK | 1.17433e+08 | 03 | 6.3155e+09 | NaN | 2018-12-06 01:30:13.766382 | 2018-12-06 |
12 | TRAILBOSS ENTERPRISES INC. | 879740454 | housekeeping, guard, facilities, operations, s... | 6 | 1.187095e+08 | 5.630046e+08 | 4.442951e+08 | 70CDCR18FR0000159, 70CDCR18FR0000075, 70CDCR18... | 2018-07-18 | 2019-09-27 | ... | NaN | ANCHORAGE | AK | ALASKA | 995012503 | 00 | 9073388243 | 9073380869 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
13 | BOOZ ALLEN HAMILTON HOLDING CORPORATION | 964725688 | support, professional, management, telecom, pr... | 23 | 1.099139e+08 | 3.423611e+08 | 2.324473e+08 | 70CTD018FR0000193, HSCEMD17F00080, HSCETC15F00... | 2013-02-05 | 2024-03-06 | ... | NaN | MCLEAN | VA | VIRGINIA | 221023830 | 11 | 7033770195 | 7039023200 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
14 | ASSET PROTECTION & SECURITY SERVICES L.P. | 009741828 | housekeeping, guard, oper, govt, misc | 11 | 9.983111e+07 | 3.326429e+08 | 2.328118e+08 | HSCEDM16J00022, HSCEDM16J00018, HSCEDM15J00018... | 2012-02-15 | 2018-12-28 | ... | NaN | CORPUS CHRISTI | TX | TEXAS | 784133787 | 27 | 3619061552 | 3619061844 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
15 | GLOBAL PRECISION SYSTEMS LLC | 805596710 | housekeeping, guard, tableware | 9 | 9.716002e+07 | 4.184622e+08 | 3.213022e+08 | 70CDCR18FR0000134, 70CDCR18FR0000061, HSCEDM17... | 2017-05-15 | 2022-09-29 | ... | NaN | NOME | AK | ALASKA | 99762 | 00 | 9073348308 | 9075632742 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
16 | STG INTERNATIONAL INC. | 179570403 | support, professional, program, management, ad... | 25 | 8.962683e+07 | 4.524422e+08 | 3.628154e+08 | 70CDCR18C00000003, HSCEMS13F00071, HSCEMD13F00... | 2009-07-16 | 2023-06-28 | ... | NaN | ALEXANDRIA | VA | VIRGINIA | 223145504 | 08 | 7035786030 | 5712556545 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
17 | SAVANTAGE FINANCIAL SERVICES INC. | 878187012 | telecom, telecommunications | 2 | 8.604917e+07 | 1.302544e+08 | 4.420525e+07 | HSCETC17F00002, HSCETC12F00013 | 2017-11-22 | 2020-12-31 | ... | NaN | ROCKVILLE | MD | MARYLAND | 208504339 | 08 | NaN | NaN | 2018-12-06 01:30:13.766382 | 2018-12-06 |
18 | PALANTIR TECHNOLOGIES INC. | 3.62131e+08 | data, telecom, information, broadcasting, dist... | 6 | 7.103665e+07 | 1.696240e+08 | 9.858732e+07 | 70CTD019C00000001, HSCETC15C00001, HSCETC13F00... | 2013-09-13 | 2021-11-27 | ... | NaN | PALO ALTO | CA | CALIFORNIA | 9.43012e+08 | 18 | 6.50815e+09 | 6.50618e+09 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
19 | DEV TECHNOLOGY GROUP INC. | 105667401 | telecom, services, programming, systems, devel... | 15 | 6.771468e+07 | 1.370574e+08 | 6.934275e+07 | HSCETC16J00057, HSCETC16J00026, HSCETC17J00004... | 2012-06-28 | 2022-02-28 | ... | NaN | RESTON | VA | VIRGINIA | 201905041 | 11 | 7038608135 | 7038608138 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
20 | PHACIL INC. | 054885111 | telecom, help, desk | 1 | 6.757944e+07 | 6.969906e+07 | 2.119618e+06 | HSCETC12F00014 | 2018-09-05 | 2019-03-09 | ... | NaN | MCLEAN | VA | VIRGINIA | 221023590 | 11 | 7035261852 | 7037377100 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
21 | PERSPECTA INC. | 081218565 | telecom, data, centers, storage, support | 3 | 6.156383e+07 | 8.597697e+07 | 2.441314e+07 | HSCETE15J00227, 70CMSD18FC0000012, HSCETC08F00013 | 2018-09-21 | 2020-06-26 | ... | NaN | HERNDON | VA | VIRGINIA | 201713299 | 11 | 7037333793 | 7037422674 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
22 | CACI INTERNATIONAL INC | 045534641 | support, professional, miscellaneous, communic... | 4 | 5.923109e+07 | 9.455504e+07 | 3.532395e+07 | HSCEMS16J00065, HSCEMD14FAF003, HSCEMD10F00260... | 2014-04-14 | 2021-03-17 | ... | NaN | CHANTILLY | VA | VIRGINIA | 201512218 | 10 | 7036794137 | 7036793402 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
23 | WIDEPOINT CORPORATION | 788549012 | support, professional, program, management | 2 | 5.494329e+07 | 2.127748e+08 | 1.578315e+08 | HSCETE15F00007, HSCETE15F00002 | 2017-03-23 | 2018-12-08 | ... | NaN | MC LEAN | VA | VIRGINIA | 221023371 | 11 | 7033495644 | 7038483560 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
24 | MAXIM HEALTHCARE SERVICES INC. | 606290401 | medical, general, health, care | 3 | 5.416702e+07 | 5.416702e+07 | 0.000000e+00 | HSCEDM17C00002, HSCEDM15C00004, HSCEDM14C00005 | 2016-08-05 | 2018-11-22 | ... | NaN | COLUMBIA | MD | MARYLAND | 210463236 | 03 | 4109101553 | 4109101615 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
25 | NET DIRECT SYSTEMS LLC | 5.40078e+07 | equipment, support, information, technology, adp | 931 | 4.864500e+07 | 5.851272e+07 | 9.867719e+06 | 70CTD018FC0000010, HSCETE17J00388, HSCETE17J00... | 2010-03-22 | 2018-12-13 | ... | NaN | APEX | NC | NORTH CAROLINA | 2.75396e+08 | 04 | NaN | NaN | 2018-12-06 01:30:13.766382 | 2018-12-06 |
26 | THE MITRE CORPORATION | 007872690 | development, support, engineering, r, research | 12 | 4.857490e+07 | 4.913481e+07 | 5.599040e+05 | 70CMSW18FR0000119, HSCETC16J00004, HSCETC16J00... | 2010-08-31 | 2019-09-28 | ... | NaN | MC LEAN | VA | VIRGINIA | 221027539 | 08 | 7039836474 | 7039837121 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
27 | GRANT THORNTON INTERNATIONAL LTD | 3.47201e+08 | support, management, professional, program, fi... | 24 | 4.768990e+07 | 1.035107e+08 | 5.582075e+07 | 70CMSD18FC0000060, 70CMSW18FC0000076, 70CMSD18... | 2011-07-11 | 2024-03-09 | ... | NaN | ALEXANDRIA | VA | VIRGINIA | 223145767 | 08 | NaN | NaN | 2018-12-06 01:30:13.766382 | 2018-12-06 |
28 | KNIGHT POINT SYSTEMS LLC | 622594492 | telecom, help, desk, support, telecommunications | 4 | 4.439087e+07 | 7.911528e+07 | 3.472440e+07 | HSCETC16J00050, 70CTD018FR0000058, HSCETC14J00... | 2014-07-24 | 2021-06-29 | ... | NaN | RESTON | VA | VIRGINIA | 201905109 | 11 | 7036577051 | 5712663106 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
29 | U.S. BANCORP | 006213482 | support, management, financial | 2 | 4.063902e+07 | 5.850522e+07 | 1.786619e+07 | 70CTD018C00000001, HSCEMD12C00003 | 2018-03-23 | 2023-04-27 | ... | NaN | MINNEAPOLIS | MN | MINNESOTA | 554027000 | 05 | 2022610806 | 2022610800 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
30 | CAP GEMINI | 382014330 | support, professional, program, management, se... | 13 | 3.812979e+07 | 8.017269e+07 | 4.204290e+07 | 70CDCR18FC0000021, HSCEMS16F00054, HSCEMS17F00... | 2010-08-16 | 2023-06-30 | ... | NaN | RESTON | VA | VIRGINIA | 201911560 | 11 | 5713361618 | 5713361700 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
31 | JACOBS ENGINEERING GROUP INC. | 074103508 | telecom, telecommunications, transmission, sys... | 2 | 3.640971e+07 | 3.772398e+07 | 1.314267e+06 | HSCETC16F00018, HSCOW3D0603 | 2018-04-06 | 2018-09-08 | ... | NaN | RESTON | VA | VIRGINIA | 201905380 | 11 | 7038964000 | NaN | 2018-12-06 01:30:13.766382 | 2018-12-06 |
32 | INTERNATIONAL BUSINESS MACHINES CORPORATION | 1.36808e+06 | software, adp, information, technology, support | 14 | 3.244281e+07 | 3.516806e+07 | 2.725246e+06 | 70CTD018FR0000377, HSCETC14F00043, HSCETC09J00... | 2011-09-15 | 2019-03-31 | ... | NaN | BETHESDA | MD | MARYLAND | 208171834 | 08 | 8003336705 | 8885200533 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
33 | VISTA OUTDOOR SALES LLC | 079728394 | 30mm, ammunition, guns, 75mm, miscellaneous | 236 | 2.907978e+07 | 2.857832e+08 | 2.567034e+08 | 70CMSW18FR0000023, HSCEMS16J00074, 70CMSW18FR0... | 2016-03-22 | 2022-09-26 | ... | NaN | ANOKA | MN | MINNESOTA | 553036794 | 06 | 7637126056 | 7633232506 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
34 | 360 IT INTEGRATED SOLUTIONS | 962373515 | telecom, telecommunications, facility, operati... | 2 | 2.885949e+07 | 3.252903e+07 | 3.669532e+06 | HSCETC16J00055, HSCETC14J00117 | 2018-09-21 | 2020-09-29 | ... | NaN | RESTON | VA | VIRGINIA | 201905014 | 11 | 8884580345 | 7034373709 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
35 | DENALI HOLDING INC | 6.0184e+08 | software, information, technology, adp, compon... | 171 | 2.823319e+07 | 4.828102e+07 | 2.004783e+07 | HSCETE16J00107, HSCETE15J00642, HSCETE16J00069... | 2014-11-20 | 2020-08-25 | ... | NaN | ROUND ROCK | TX | TEXAS | 7.86827e+08 | 31 | 5.12729e+09 | 5.12283e+09 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
36 | DTC COMMUNICATIONS INC. | 611639688 | equipment, recording, reproducing, video, sound | 27 | 2.762288e+07 | 3.817939e+07 | 1.055651e+07 | 70CMSD18FR0000267, 70CMSD18FR0000189, 70CMSD18... | 2016-10-25 | 2019-03-29 | ... | NaN | PINELLAS PARK | FL | FLORIDA | 337826132 | 13 | 7274716988 | 7274716773 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
37 | NATIVE ENERGY & TECHNOLOGY INC. | 080778868 | government, maintenance, owned, operated, gogo | 1 | 2.706727e+07 | 5.421699e+07 | 2.714972e+07 | HSCEMR17F00041 | 2018-10-18 | 2027-06-05 | ... | NaN | SAN ANTONIO | TX | TEXAS | 782492239 | 20 | 2102316060 | 2102316098 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
38 | XL ASSOCIATES INC. | 6.08717e+08 | support, management, professional, program, data | 19 | 2.683397e+07 | 3.480865e+07 | 7.974672e+06 | 70CMSW18FC0000104, HSCETC16J00047, HSCEMR15F00... | 2014-07-11 | 2023-07-14 | ... | NaN | VIENNA | VA | VIRGINIA | 221822264 | 11 | 7038480400 | 7032569508 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
39 | MCKINSEY & COMPANY INC. | 001673920 | support, management, professional | 3 | 2.657078e+07 | 3.180656e+07 | 5.235783e+06 | HSCECR16F00004, HSCECR17F00003, HSCECR16F00001 | 2017-04-05 | 2019-03-20 | ... | NaN | WASHINGTON | DC | DISTRICT OF COLUMBIA | 200362412 | 00 | 2026621185 | 2026623256 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
40 | METROPOLITAN INTERPRETERS & TRANSLATORS INC | 186945325 | support, administrative, translation, interpre... | 18 | 2.591282e+07 | 2.920811e+07 | 3.295290e+06 | HSCEMD17F00215, HSCEMD16F00142, HSCEMD15J00067... | 2010-09-14 | 2019-03-16 | ... | NaN | NEW YORK | NY | NEW YORK | 100178538 | 12 | 2136734710 | 2136734711 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
41 | UNIVERSAL LANGUAGE CORPORATION | 961404241 | support, administrative, translation, interpre... | 9 | 2.411732e+07 | 4.469079e+07 | 2.057347e+07 | HSCEMD09J00229, HSCEMD09J00240, HSCENV08JCL016... | 2007-09-10 | 2014-03-22 | ... | NaN | REGO PARK | NY | NEW YORK | 113742244 | 06 | 7189979369 | 7184590331 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
42 | ACCENTURE INC. | 985015354 | telecom, cyber, security, data, backup | 3 | 2.380236e+07 | 3.929861e+07 | 1.549624e+07 | HSCETC16F00019, HSCETE14A00005, HSCETC15A00008 | 2017-04-27 | 2021-09-27 | ... | NaN | ALEXANDRIA | VA | VIRGINIA | 223141552 | 08 | 5407299926 | 5713662006 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
43 | PROCENTRIX INC. | 626154640 | telecom, systems, development, strategy, archi... | 10 | 2.337984e+07 | 2.365943e+07 | 2.795936e+05 | HSCETC17F00005, HSCETC16J00060, HSCETC15J00048... | 2013-03-13 | 2019-05-31 | ... | NaN | HERNDON | VA | VIRGINIA | 201714583 | 11 | 7036352627 | 7036352629 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
44 | DYNAMIS INC. | 809429728 | support, professional, program, management | 3 | 2.199776e+07 | 2.604925e+07 | 4.051484e+06 | 70CMSD18FR0000184, 70CMSD18FR0000217, HSCEMD13... | 2018-08-23 | 2023-09-16 | ... | NaN | FAIRFAX | VA | VIRGINIA | 220314504 | 11 | 7034654400 | 7034654488 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
45 | MOTOROLA INC | 1.32546e+06 | equipment, communication, radio, television, a... | 54 | 2.193658e+07 | 2.415277e+07 | 2.216185e+06 | 70CMSD18P00000143, HSCEMS17J00177, HSCEMS16J00... | 2010-10-18 | 2023-08-31 | ... | NaN | LINTHICUM HEIGHTS | MD | MARYLAND | 210902535 | 03 | 4107126545 | 4107126501 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
46 | HARRIS CORPORATION | 4.20334e+06 | equipment, countermeasures, electronic, counte... | 17 | 2.127281e+07 | 2.130907e+07 | 3.626050e+04 | HSCEMD17P00004, 70CMSD18P00000076, 70CMSD18P00... | 2012-10-12 | 2022-10-02 | ... | NaN | PALM BAY | FL | FLORIDA | 329053377 | 08 | 3217276361 | NaN | 2018-12-06 01:30:13.766382 | 2018-12-06 |
47 | JAVA PRODUCTIONS INC. | 143692858 | support, professional, program, management | 1 | 2.090808e+07 | 2.243571e+07 | 1.527626e+06 | HSCEMD14F00041 | 2018-07-10 | 2019-10-29 | ... | NaN | BLACKSBURG | VA | VIRGINIA | 240605130 | 09 | 5408183098 | 5403013615 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
48 | K'OYITL'OTS'INA LIMITED | 9.43393e+08 | facilities, government, housekeeping, support,... | 30 | 1.949453e+07 | 4.086089e+07 | 2.136636e+07 | 70CMSW18FR0000015, HSCEMS16C00006, HSCEMS16C00... | 2012-05-10 | 2023-02-28 | ... | NaN | FAIRBANKS | AK | ALASKA | 997094175 | 00 | 9074528119 | 9074528109 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
49 | ENGILITY CORPORATION | 7.83838e+08 | support, professional, program, management, legal | 18 | 1.874999e+07 | 1.874999e+07 | 0.000000e+00 | HSCEMD16FAF001, HSCEMS14F00057, HSCEMD12F00018... | 2013-02-22 | 2016-09-30 | ... | NaN | ASHBURN | VA | VIRGINIA | 201475060 | 10 | 5712918900 | 5712918957 | 2018-12-06 01:30:13.766382 | 2018-12-06 |
50 rows × 21 columns
There are still a lot of duplicates, future work will be fuzzy matching these accounts.
Let's save this as a csv.
outfile = 'ice_data/data_out/ice_prime_contractors_aggregated_clean.csv'
out_order.to_csv(outfile, index=False)
import math
millnames = ['',' K',' M',' B',' T']
def millify(n):
if not np.isnan(n):
n = float(n)
millidx = max(0,min(len(millnames)-1,
int(math.floor(0 if n == 0 else math.log10(abs(n))/3))))
return '{:.2f}{}'.format(n / 10**(3 * millidx), millnames[millidx])
return np.nan
# to make it more human, smh...
for col in ['Current Total Value Of Awards Usd', 'Potential Total Value Of Awards Usd']:
out_order[col] = out_order[col].apply(millify)
outfile = 'ice_data/data_out/ice_prime_contractors_aggregated_clean_numbers.csv'
out_order.to_csv(outfile, index=False)
We can plot the top paid contractors from past awards, and the contractors who have the most valuable current projects.
ax = output.sort_values(by=['current_total_value_of_awards_USD'], ascending=False) \
.head(30).plot(x='recipient_parent_name',
y='current_total_value_of_awards_USD',
color='orange',
kind='barh',
logx=True,
figsize=(8,12),
title= "30 most valuable ICE contractors from since 2006",
legend=False)
# annotate each bar
for rect in ax.patches:
width = rect.get_width()
ax.text(1.2*rect.get_width(), rect.get_y()+ 0.5 * rect.get_height(),
'$' + millify(width),
ha='center', va='center')
# remove boarder
for spine in plt.gca().spines.values():
spine.set_visible(False)
# remoce axis titles
ax.set_xlabel("")
ax.set_ylabel("")
# remove ticks
ax.tick_params(axis=u'both', which=u'both',length=0)
ax.get_xaxis().set_ticks([])
ax.invert_yaxis();
ax = output.sort_values(by=['remaining_total_value_of_awards_USD'], ascending=False) \
.head(30).plot(x='recipient_parent_name',
y='remaining_total_value_of_awards_USD',
color='orange',
kind='barh',
logx=True,
figsize=(8,12),
title= "30 most valuable ICE contractors in progress",
legend=False)
# annotate each bar
for rect in ax.patches:
width = rect.get_width()
ax.text(1.32*rect.get_width(), rect.get_y()+ 0.5 * rect.get_height(),
'$' + millify(width),
ha='center', va='center')
# remove boarder
for spine in plt.gca().spines.values():
spine.set_visible(False)
# remoce axis titles
ax.set_xlabel("")
ax.set_ylabel("")
# remove ticks
ax.tick_params(axis=u'both', which=u'both',length=0)
ax.get_xaxis().set_ticks([])
ax.invert_yaxis();