ICE Contractor Analysis

By Leon Yin Last Updated 2018-12-06

View this notebook in NBViewer or Github. The output is availalble publically on Google Drive and Github.

Following my analysis of federal contracts to the private prisons CoreCivic and Geo Group, this notebook analyzes publically avaiable contracts from the Immigration and Customs Enforcement Agency (ICE). This analysis uses data downloaded from the USASpending.gov dashboard, specifically I filtered Funding Agency with "U.S. Immigration and Customs Enforcement (ICE) | Sub-Agency".

In [1]:
%matplotlib inline
import glob
import datetime
import string
from collections import Counter

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from fuzzywuzzy import fuzz
/anaconda3/lib/python3.6/site-packages/fuzzywuzzy/fuzz.py:35: UserWarning: Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning
  warnings.warn('Using slow pure-python SequenceMatcher. Install python-Levenshtein to remove this warning')
In [2]:
# data downloaded from USAspending.gov
input_file_pattern = 'ice_data/data_in/all_prime_awards_subawards_*/*'
In [3]:
files = glob.glob(input_file_pattern)
files
Out[3]:
['ice_data/data_in/all_prime_awards_subawards_20181206050302/all_contracts_subawards_1.csv',
 'ice_data/data_in/all_prime_awards_subawards_20181206050302/all_contracts_prime_awards_1.csv',
 'ice_data/data_in/all_prime_awards_subawards_20181206050302/all_assistance_prime_awards_1.csv',
 'ice_data/data_in/all_prime_awards_subawards_20181206050302/all_assistance_subawards_1.csv']

Let's just look at prime contracts today.

In [4]:
df_prime_contracts = pd.read_csv(files[1])
/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2698: DtypeWarning: Columns (13,32,40,42,43,50,72,73,81,100,128,129) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Something to note here, if we don't filter out funding agency we get some inaccuracies!

In [5]:
{
    'awarding_sub_agency_name' : 'U.S. IMMIGRATION AND CUSTOMS ENFORCEMENT', # N = 15056
    'funding_sub_agency_name' : 'U.S. IMMIGRATION AND CUSTOMS ENFORCEMENT', # N = 17145
}
Out[5]:
{'awarding_sub_agency_name': 'U.S. IMMIGRATION AND CUSTOMS ENFORCEMENT',
 'funding_sub_agency_name': 'U.S. IMMIGRATION AND CUSTOMS ENFORCEMENT'}
In [6]:
# here we're filtering only when ICE awards a contract
df_prime_contracts = df_prime_contracts[
    df_prime_contracts['awarding_sub_agency_name'] == 'U.S. IMMIGRATION AND CUSTOMS ENFORCEMENT'
]

len(df_prime_contracts)
Out[6]:
15056
In [7]:
# there are a lot of columns!
prime_cols = df_prime_contracts.columns
num_cols = len(prime_cols)
print(f"There are {num_cols} columns in this file!\nHere are some of the column names:")
prime_cols.values.tolist()[:20]
There are 249 columns in this file!
Here are some of the column names:
Out[7]:
['award_id_piid',
 'parent_award_agency_id',
 'parent_award_agency_name',
 'parent_award_id',
 'obligated_amount',
 'current_total_value_of_award',
 'potential_total_value_of_award',
 'period_of_performance_start_date',
 'period_of_performance_current_end_date',
 'period_of_performance_potential_end_date',
 'ordering_period_end_date',
 'awarding_agency_code',
 'awarding_agency_name',
 'awarding_sub_agency_code',
 'awarding_sub_agency_name',
 'awarding_office_code',
 'awarding_office_name',
 'funding_agency_code',
 'funding_agency_name',
 'funding_sub_agency_code']

That is a lot of columns, here are the definitions of the columns we'll be using here:

Glossary of fields used in this Analysis with Definitions from USASpending:

recipient_duns
The unique identification number for the ultimate parent of an awardee or recipient. Currently the identifier is the 9-digit number maintained by Dun & Bradstreet as the global parent DUNS® number.

potential_total_value_of_award
The total amount that could be obligated on a contract. This total includes the base plus options amount. For example, if a recipient is awarded 10M on a base contract with 3 option years at 1M each, the potential award amount is 13M.

current_total_value_of_award
The amount of money that the government has promised (obligated) to pay a recipient for a contract. This means the base amount and any exercised options.

period_of_performance_start_date
The date that the award begins.

period_of_performance_potential_end_date
For procurement, the date on which, for the award referred to by the action being reported if all potential pre-determined or pre-negotiated options were exercised, awardee effort is completed or the award is otherwise ended. Administrative actions related to this award may continue to occur after this date. This date does not apply to procurement indefinite delivery vehicles under which definitive orders may be awarded.

funding_sub_agency_name
A Funding Agency pays for the majority of funds for an award out of its budget. Typically, the Funding Agency is the same as the Awarding Agency. In some cases, one agency will administer an award (Awarding Agency) and another agency will pay for it (Funding Agency).

awarding_sub_agency_name
The Awarding Sub Agency is the sub agency that issues and administers the award. For example, the Internal Revenue Service (IRS) is a sub agency of the Department of the Treasury.


Thanks for providing that glossary USASpending, let's do some minor data wrangling and get some summary statistics.

In [8]:
# convert columns to datetime
date_cols = ['period_of_performance_start_date', 'period_of_performance_potential_end_date']
for col in date_cols:
    df_prime_contracts[col] = pd.to_datetime(df_prime_contracts[col])
In [9]:
# aggregate contract values
total_spend = df_prime_contracts.drop_duplicates(subset=['award_id_piid']) \
                                .current_total_value_of_award.sum()
potential_spend = df_prime_contracts.drop_duplicates(subset=['award_id_piid']) \
                                    .potential_total_value_of_award.sum()
at_stake =  potential_spend - total_spend

num_contractors = len(df_prime_contracts.recipient_duns.unique())
first_contract = df_prime_contracts.period_of_performance_start_date.min().strftime('%Y-%m-%d')

print("ICE has invested ${:,.2f} in contracts with {} Prime contractors "
      "since {}.\nICE has ${:,.2f} in potential payouts with active contracts.".format(
          total_spend, num_contractors, first_contract, at_stake))
ICE has invested $6,959,189,462.00 in contracts with 3134 Prime contractors since 2006-10-01.
ICE has $10,616,424,251.18 in potential payouts with active contracts.

Let's sort this data by the contract end date

In [10]:
df_prime_contracts.sort_values(by=['period_of_performance_potential_end_date'], 
                               ascending=False, 
                               inplace=True)

There are typos in the recipient names, so let's use their DUNs IDs for aggregate analysis.
We can create a lookup_table containing the latest name and contact info

In [11]:
lookup_cols = [
    'recipient_parent_duns',
    'recipient_parent_name', 
    'recipient_address_line_1', 
    'recipient_address_line_2',
    'recipient_city_name', 
    'recipient_state_code', 
    'recipient_state_name',
    'recipient_zip_4_code', 
    'recipient_congressional_district',
    'recipient_phone_number', 
    'recipient_fax_number'
]
In [12]:
lookup_table = df_prime_contracts.drop_duplicates(
    subset = ['recipient_parent_duns'],
    keep = 'first'
)[lookup_cols]

There are contractors with similar names (and different DUNS), we can find similar names using fuzzywuzzy, and replace them for aggregation.

In [13]:
from fuzzywuzzy import fuzz
In [14]:
look_up_names = lookup_table[~lookup_table.recipient_parent_name.isnull()].recipient_parent_name.unique()
look_up_names.sort()

Here we will compare all names, and create a standardized dictionary (switch_name) to replace similar names.

In [15]:
window_size = 10 # look at to words to the right and left
threshold = 90   # how similar is enough?

skip = []
switch_name = dict()
for i, comp in enumerate(look_up_names):
    if comp not in skip:
        window = np.concatenate((look_up_names[i-window_size: i], 
                                 look_up_names[i+1: i+window_size]))
        for to_search in window:
            similarity_score = fuzz.token_sort_ratio(comp, to_search)
            if similarity_score >= threshold:
                switch_name[comp] = to_search
                skip += [to_search]

Let's swamp out the fuzzy names!

In [16]:
df_prime_contracts.recipient_parent_name.replace(switch_name, 
                                                 inplace = True)
In [17]:
lookup_table.recipient_parent_name.replace(switch_name, 
                                           inplace = True)

lookup_table.drop_duplicates(subset = ['recipient_parent_name'],
                             keep = 'first',
                             inplace = True)

Now let's get how much each company has been rewarded, how much they have the potential to be rewarded, how many awards they're been granted, and when their contracts are finished. This code is chained, and kind of messy. Each line aggreagates by the recipient ID, and performs some sort of agg function, and then sorts the values according to the total of the agg function.

In [18]:
contract_amounts = df_prime_contracts.groupby('recipient_parent_name') \
                                     .current_total_value_of_award \
                                     .sum() \
                                     .sort_values(ascending=False)
In [19]:
contract_potential = df_prime_contracts.groupby('recipient_parent_name') \
                                       .potential_total_value_of_award \
                                       .sum()
In [20]:
num_awards = df_prime_contracts.groupby('recipient_parent_name') \
                               .award_id_piid.nunique()
In [21]:
award_ids = df_prime_contracts.groupby('recipient_parent_name') \
                              .award_id_piid\
                              .unique().str.join(', ')
In [22]:
contract_enddate = df_prime_contracts[~df_prime_contracts.period_of_performance_potential_end_date.isnull()] \
                       .groupby('recipient_parent_name') \
                       .period_of_performance_potential_end_date.max()
In [23]:
contract_startdate = df_prime_contracts[~df_prime_contracts.period_of_performance_start_date.isnull()]\
                         .groupby('recipient_parent_name') \
                         .period_of_performance_start_date \
                         .min()

For context, let's see what kinds of services each of these companies provide by counting the most frequent words in service descriptions per contractor.

In [24]:
def count_words(row, counter, sw):
    '''
    Splits sentences, casts words to lowercase, and counts all words that aren't in stopwords (sw)
    '''
    for c in string.punctuation:
        row = row.replace(c, " ")
    counter.update([w for w in row.lower().split() if w not in sw])
In [25]:
sw = stopwords.words('English')
In [26]:
top_words = []
for company_, df_ in df_prime_contracts.groupby('recipient_parent_name'):
    word_counter_ = Counter()
    df_['product_or_service_code_description'].apply(count_words, 
                                                     args = (word_counter_, sw,))
    # get top 5 words
    top_words_ = ', '.join([w for w, c in word_counter_.most_common(5)])
    
    # create a dcitonary, and add it to the list
    d_ = dict(top_words_from_serv_desc = top_words_,
              recipient_parent_name = company_)
    
    top_words.append(d_)
    
top_words_per_service_description = pd.DataFrame(top_words)

We can join them all together using this craziness below:

In [27]:
output = (
    contract_amounts.reset_index()
        .merge(
          num_awards.reset_index(), 
          on='recipient_parent_name',
        )
        .merge(
          contract_potential.reset_index(), 
          on='recipient_parent_name'
        )
        .merge(
          contract_enddate.reset_index(), 
          on='recipient_parent_name'
        )
        .merge(
          contract_startdate.reset_index(), 
          on='recipient_parent_name'
        )
        .merge(
          top_words_per_service_description, 
          on='recipient_parent_name'
        )
        .merge(
          award_ids.reset_index(),
          on='recipient_parent_name'
        )
        .merge(
          lookup_table,
          on='recipient_parent_name'
        )
    )

The last merge above is the look_up table, containing metadata for each recipient.
Let's change some of the aggregate column names to be more decriptive:

In [28]:
remapping = {
    'current_total_value_of_award' : 'current_total_value_of_awards_USD',
    'potential_total_value_of_award' : 'potential_total_value_of_awards_USD',
    'period_of_performance_potential_end_date' : 'last_contract_end_date',
    'period_of_performance_start_date' : 'first_contract_start_date',
    'award_id_piid_x' : 'number_of_prime_awards',
    'award_id_piid_y' : 'award_id_piids'
}
In [29]:
output.columns = [remapping.get(c, c) for c in output.columns]
In [30]:
# convert datetimes
for col in ['last_contract_end_date', 'first_contract_start_date']:
    output[col] = output[col].dt.strftime('%Y-%m-%d')
In [31]:
# calculate how much money in on the line for each contractor
output['remaining_total_value_of_awards_USD'] = output.potential_total_value_of_awards_USD - output.current_total_value_of_awards_USD

Let's timestamp when this file was made.

In [32]:
output['analysis_date'] = datetime.datetime.now() # when was this file made?
output['raw_data_download_date'] = datetime.datetime(2018,12,6) # when was raw data downloaded?

And lastly, let's re-order the columns so they're readable

In [33]:
order_cols = [
    'recipient_parent_name',
    'recipient_parent_duns',
    'top_words_from_serv_desc',
    'number_of_prime_awards',
    'current_total_value_of_awards_USD',
    'potential_total_value_of_awards_USD',
    'remaining_total_value_of_awards_USD',
    'award_id_piids',
    'first_contract_start_date',
    'last_contract_end_date',
    'recipient_address_line_1',
    'recipient_address_line_2',
    'recipient_city_name', 
    'recipient_state_code', 
    'recipient_state_name',
    'recipient_zip_4_code',
    'recipient_congressional_district',
    'recipient_phone_number', 
    'recipient_fax_number',
    'analysis_date',
    'raw_data_download_date',
]

# make these human readible
new_cols = [' '.join(c.split('_')).title() for c in order_cols]

Here is the output:

In [34]:
out_order = output[order_cols]
out_order.columns = new_cols
In [35]:
out_order.head(50)
Out[35]:
Recipient Parent Name Recipient Parent Duns Top Words From Serv Desc Number Of Prime Awards Current Total Value Of Awards Usd Potential Total Value Of Awards Usd Remaining Total Value Of Awards Usd Award Id Piids First Contract Start Date Last Contract End Date ... Recipient Address Line 2 Recipient City Name Recipient State Code Recipient State Name Recipient Zip 4 Code Recipient Congressional District Recipient Phone Number Recipient Fax Number Analysis Date Raw Data Download Date
0 SCIENCE APPLICATIONS INTERNATIONAL CORPORATION 078745061 telecom, systems, development, analysis, house... 8 5.322573e+08 7.048101e+08 1.725527e+08 HSCETC09J00034, HSCETC10J00005, HSCOW5D0056, H... 2015-06-04 2015-09-24 ... NaN MCLEAN VA VIRGINIA 22102 NaN 7036765681 7036765063 2018-12-06 01:30:13.766382 2018-12-06
1 THE GEO GROUP INC 612706465 guard, housekeeping, services, support, profes... 56 5.295999e+08 2.641455e+09 2.111855e+09 70CDCR18FR0000045, HSCEDM11J00054, 70CDCR18FR0... 2007-06-01 2029-10-11 ... NaN BOCA RATON FL FLORIDA 334878242 22 5619997359 5619997648 2018-12-06 01:30:13.766382 2018-12-06
2 CSI AVIATION SERVICES INC. 102793460 travel, air, passenger, transportation, reloca... 15 5.213086e+08 6.580096e+08 1.367010e+08 HSCECR14F00009, HSCECR14F00010, HSCECR14F00015... 2009-10-05 2018-06-30 ... NaN ALBUQUERQUE NM NEW MEXICO 871072876 01 5057619000 5053427377 2018-12-06 01:30:13.766382 2018-12-06
3 INGENESIS INC. 019992424 medical, general, health, care 2 3.335133e+08 4.491412e+08 1.156279e+08 70CDCR18C00000002, HSCECR12F00048 2018-08-06 2018-11-15 ... NaN SAN ANTONIO TX TEXAS 782174430 21 2103660033 2105684582 2018-12-06 01:30:13.766382 2018-12-06
4 M V M INC. 038049532 support, housekeeping, guard, administrative, ... 38 2.611629e+08 1.086340e+09 8.251771e+08 70CMSW18FR0000058, HSCEMS17F00040, HSCEDM14J00... 2010-09-20 2023-01-31 ... NaN ASHBURN VA VIRGINIA 201476063 10 5712234620 5712234487 2018-12-06 01:30:13.766382 2018-12-06
5 GENERAL DYNAMICS CORPORATION 001381284 support, ammunition, 30mm, equipment, professi... 52 2.480584e+08 4.923976e+08 2.443392e+08 HSCEMD16F00117, HSCETC16J00002, HSCEMD15F00113... 2011-04-07 2021-09-24 ... NaN CHANTILLY VA VIRGINIA 201513848 10 7038761243 7038761243 2018-12-06 01:30:13.766382 2018-12-06
6 NANA REGIONAL CORPORATION INC 079253761 housekeeping, guard 13 2.451490e+08 1.172830e+09 9.276814e+08 HSCEDM15J00019, 70CDCR18FR0000051, 70CDCR18FR0... 2016-06-21 2024-04-30 ... NaN ANCHORAGE AK ALASKA 99502 00 7037666837 7037666906 2018-12-06 01:30:13.766382 2018-12-06
7 CORRECTIONS CORPORATION OF AMERICA 159734151 guard, housekeeping, support, services, profes... 32 2.415587e+08 6.227455e+08 3.811868e+08 70CDCR18FR0000058, 70CDCR18FR0000017, 70CDCR18... 2010-04-29 2019-06-30 ... NaN NASHVILLE TN TENNESSEE 372156105 05 6152636713 6152633090 2018-12-06 01:30:13.766382 2018-12-06
8 SPECTRUM SECURITY SERVICES INC. 6.18954e+08 guard, housekeeping, services 13 1.927423e+08 3.628039e+08 1.700616e+08 70CDCR18FC0000013, 70CDCR18FC0000012, 70CDCR18... 2009-12-09 2019-10-31 ... NaN JAMUL CA CALIFORNIA 9.19353e+08 50 NaN NaN 2018-12-06 01:30:13.766382 2018-12-06
9 DELOITTE LLP 014127109 support, management, professional, program, fi... 83 1.779484e+08 3.771559e+08 1.992075e+08 70CMSW18FR0000103, 70CMSW18FC0000097, HSCEMS17... 2011-08-29 2023-09-29 ... NaN ARLINGTON VA VIRGINIA 222091742 08 5718580176 7038426748 2018-12-06 01:30:13.766382 2018-12-06
10 AHTNA INCORPORATED 6.95861e+07 housekeeping, guard, maint, rep, alt 9 1.421123e+08 4.689208e+08 3.268086e+08 70CDCR19FR0000003, 70CDCR18FR0000003, HSCEDM17... 2011-10-21 2019-11-30 ... NaN ANCHORAGE AK ALASKA 995035677 00 9073346477 9073346750 2018-12-06 01:30:13.766382 2018-12-06
11 CLASSIC AIR CHARTER INC. 9.61705e+08 travel, transportation, relocation, lodging, r... 1 1.420237e+08 6.518740e+08 5.098503e+08 70CDCR18FR0000002 2018-11-14 2023-06-30 ... NaN HUNTINGTON NY NEW YORK 1.17433e+08 03 6.3155e+09 NaN 2018-12-06 01:30:13.766382 2018-12-06
12 TRAILBOSS ENTERPRISES INC. 879740454 housekeeping, guard, facilities, operations, s... 6 1.187095e+08 5.630046e+08 4.442951e+08 70CDCR18FR0000159, 70CDCR18FR0000075, 70CDCR18... 2018-07-18 2019-09-27 ... NaN ANCHORAGE AK ALASKA 995012503 00 9073388243 9073380869 2018-12-06 01:30:13.766382 2018-12-06
13 BOOZ ALLEN HAMILTON HOLDING CORPORATION 964725688 support, professional, management, telecom, pr... 23 1.099139e+08 3.423611e+08 2.324473e+08 70CTD018FR0000193, HSCEMD17F00080, HSCETC15F00... 2013-02-05 2024-03-06 ... NaN MCLEAN VA VIRGINIA 221023830 11 7033770195 7039023200 2018-12-06 01:30:13.766382 2018-12-06
14 ASSET PROTECTION & SECURITY SERVICES L.P. 009741828 housekeeping, guard, oper, govt, misc 11 9.983111e+07 3.326429e+08 2.328118e+08 HSCEDM16J00022, HSCEDM16J00018, HSCEDM15J00018... 2012-02-15 2018-12-28 ... NaN CORPUS CHRISTI TX TEXAS 784133787 27 3619061552 3619061844 2018-12-06 01:30:13.766382 2018-12-06
15 GLOBAL PRECISION SYSTEMS LLC 805596710 housekeeping, guard, tableware 9 9.716002e+07 4.184622e+08 3.213022e+08 70CDCR18FR0000134, 70CDCR18FR0000061, HSCEDM17... 2017-05-15 2022-09-29 ... NaN NOME AK ALASKA 99762 00 9073348308 9075632742 2018-12-06 01:30:13.766382 2018-12-06
16 STG INTERNATIONAL INC. 179570403 support, professional, program, management, ad... 25 8.962683e+07 4.524422e+08 3.628154e+08 70CDCR18C00000003, HSCEMS13F00071, HSCEMD13F00... 2009-07-16 2023-06-28 ... NaN ALEXANDRIA VA VIRGINIA 223145504 08 7035786030 5712556545 2018-12-06 01:30:13.766382 2018-12-06
17 SAVANTAGE FINANCIAL SERVICES INC. 878187012 telecom, telecommunications 2 8.604917e+07 1.302544e+08 4.420525e+07 HSCETC17F00002, HSCETC12F00013 2017-11-22 2020-12-31 ... NaN ROCKVILLE MD MARYLAND 208504339 08 NaN NaN 2018-12-06 01:30:13.766382 2018-12-06
18 PALANTIR TECHNOLOGIES INC. 3.62131e+08 data, telecom, information, broadcasting, dist... 6 7.103665e+07 1.696240e+08 9.858732e+07 70CTD019C00000001, HSCETC15C00001, HSCETC13F00... 2013-09-13 2021-11-27 ... NaN PALO ALTO CA CALIFORNIA 9.43012e+08 18 6.50815e+09 6.50618e+09 2018-12-06 01:30:13.766382 2018-12-06
19 DEV TECHNOLOGY GROUP INC. 105667401 telecom, services, programming, systems, devel... 15 6.771468e+07 1.370574e+08 6.934275e+07 HSCETC16J00057, HSCETC16J00026, HSCETC17J00004... 2012-06-28 2022-02-28 ... NaN RESTON VA VIRGINIA 201905041 11 7038608135 7038608138 2018-12-06 01:30:13.766382 2018-12-06
20 PHACIL INC. 054885111 telecom, help, desk 1 6.757944e+07 6.969906e+07 2.119618e+06 HSCETC12F00014 2018-09-05 2019-03-09 ... NaN MCLEAN VA VIRGINIA 221023590 11 7035261852 7037377100 2018-12-06 01:30:13.766382 2018-12-06
21 PERSPECTA INC. 081218565 telecom, data, centers, storage, support 3 6.156383e+07 8.597697e+07 2.441314e+07 HSCETE15J00227, 70CMSD18FC0000012, HSCETC08F00013 2018-09-21 2020-06-26 ... NaN HERNDON VA VIRGINIA 201713299 11 7037333793 7037422674 2018-12-06 01:30:13.766382 2018-12-06
22 CACI INTERNATIONAL INC 045534641 support, professional, miscellaneous, communic... 4 5.923109e+07 9.455504e+07 3.532395e+07 HSCEMS16J00065, HSCEMD14FAF003, HSCEMD10F00260... 2014-04-14 2021-03-17 ... NaN CHANTILLY VA VIRGINIA 201512218 10 7036794137 7036793402 2018-12-06 01:30:13.766382 2018-12-06
23 WIDEPOINT CORPORATION 788549012 support, professional, program, management 2 5.494329e+07 2.127748e+08 1.578315e+08 HSCETE15F00007, HSCETE15F00002 2017-03-23 2018-12-08 ... NaN MC LEAN VA VIRGINIA 221023371 11 7033495644 7038483560 2018-12-06 01:30:13.766382 2018-12-06
24 MAXIM HEALTHCARE SERVICES INC. 606290401 medical, general, health, care 3 5.416702e+07 5.416702e+07 0.000000e+00 HSCEDM17C00002, HSCEDM15C00004, HSCEDM14C00005 2016-08-05 2018-11-22 ... NaN COLUMBIA MD MARYLAND 210463236 03 4109101553 4109101615 2018-12-06 01:30:13.766382 2018-12-06
25 NET DIRECT SYSTEMS LLC 5.40078e+07 equipment, support, information, technology, adp 931 4.864500e+07 5.851272e+07 9.867719e+06 70CTD018FC0000010, HSCETE17J00388, HSCETE17J00... 2010-03-22 2018-12-13 ... NaN APEX NC NORTH CAROLINA 2.75396e+08 04 NaN NaN 2018-12-06 01:30:13.766382 2018-12-06
26 THE MITRE CORPORATION 007872690 development, support, engineering, r, research 12 4.857490e+07 4.913481e+07 5.599040e+05 70CMSW18FR0000119, HSCETC16J00004, HSCETC16J00... 2010-08-31 2019-09-28 ... NaN MC LEAN VA VIRGINIA 221027539 08 7039836474 7039837121 2018-12-06 01:30:13.766382 2018-12-06
27 GRANT THORNTON INTERNATIONAL LTD 3.47201e+08 support, management, professional, program, fi... 24 4.768990e+07 1.035107e+08 5.582075e+07 70CMSD18FC0000060, 70CMSW18FC0000076, 70CMSD18... 2011-07-11 2024-03-09 ... NaN ALEXANDRIA VA VIRGINIA 223145767 08 NaN NaN 2018-12-06 01:30:13.766382 2018-12-06
28 KNIGHT POINT SYSTEMS LLC 622594492 telecom, help, desk, support, telecommunications 4 4.439087e+07 7.911528e+07 3.472440e+07 HSCETC16J00050, 70CTD018FR0000058, HSCETC14J00... 2014-07-24 2021-06-29 ... NaN RESTON VA VIRGINIA 201905109 11 7036577051 5712663106 2018-12-06 01:30:13.766382 2018-12-06
29 U.S. BANCORP 006213482 support, management, financial 2 4.063902e+07 5.850522e+07 1.786619e+07 70CTD018C00000001, HSCEMD12C00003 2018-03-23 2023-04-27 ... NaN MINNEAPOLIS MN MINNESOTA 554027000 05 2022610806 2022610800 2018-12-06 01:30:13.766382 2018-12-06
30 CAP GEMINI 382014330 support, professional, program, management, se... 13 3.812979e+07 8.017269e+07 4.204290e+07 70CDCR18FC0000021, HSCEMS16F00054, HSCEMS17F00... 2010-08-16 2023-06-30 ... NaN RESTON VA VIRGINIA 201911560 11 5713361618 5713361700 2018-12-06 01:30:13.766382 2018-12-06
31 JACOBS ENGINEERING GROUP INC. 074103508 telecom, telecommunications, transmission, sys... 2 3.640971e+07 3.772398e+07 1.314267e+06 HSCETC16F00018, HSCOW3D0603 2018-04-06 2018-09-08 ... NaN RESTON VA VIRGINIA 201905380 11 7038964000 NaN 2018-12-06 01:30:13.766382 2018-12-06
32 INTERNATIONAL BUSINESS MACHINES CORPORATION 1.36808e+06 software, adp, information, technology, support 14 3.244281e+07 3.516806e+07 2.725246e+06 70CTD018FR0000377, HSCETC14F00043, HSCETC09J00... 2011-09-15 2019-03-31 ... NaN BETHESDA MD MARYLAND 208171834 08 8003336705 8885200533 2018-12-06 01:30:13.766382 2018-12-06
33 VISTA OUTDOOR SALES LLC 079728394 30mm, ammunition, guns, 75mm, miscellaneous 236 2.907978e+07 2.857832e+08 2.567034e+08 70CMSW18FR0000023, HSCEMS16J00074, 70CMSW18FR0... 2016-03-22 2022-09-26 ... NaN ANOKA MN MINNESOTA 553036794 06 7637126056 7633232506 2018-12-06 01:30:13.766382 2018-12-06
34 360 IT INTEGRATED SOLUTIONS 962373515 telecom, telecommunications, facility, operati... 2 2.885949e+07 3.252903e+07 3.669532e+06 HSCETC16J00055, HSCETC14J00117 2018-09-21 2020-09-29 ... NaN RESTON VA VIRGINIA 201905014 11 8884580345 7034373709 2018-12-06 01:30:13.766382 2018-12-06
35 DENALI HOLDING INC 6.0184e+08 software, information, technology, adp, compon... 171 2.823319e+07 4.828102e+07 2.004783e+07 HSCETE16J00107, HSCETE15J00642, HSCETE16J00069... 2014-11-20 2020-08-25 ... NaN ROUND ROCK TX TEXAS 7.86827e+08 31 5.12729e+09 5.12283e+09 2018-12-06 01:30:13.766382 2018-12-06
36 DTC COMMUNICATIONS INC. 611639688 equipment, recording, reproducing, video, sound 27 2.762288e+07 3.817939e+07 1.055651e+07 70CMSD18FR0000267, 70CMSD18FR0000189, 70CMSD18... 2016-10-25 2019-03-29 ... NaN PINELLAS PARK FL FLORIDA 337826132 13 7274716988 7274716773 2018-12-06 01:30:13.766382 2018-12-06
37 NATIVE ENERGY & TECHNOLOGY INC. 080778868 government, maintenance, owned, operated, gogo 1 2.706727e+07 5.421699e+07 2.714972e+07 HSCEMR17F00041 2018-10-18 2027-06-05 ... NaN SAN ANTONIO TX TEXAS 782492239 20 2102316060 2102316098 2018-12-06 01:30:13.766382 2018-12-06
38 XL ASSOCIATES INC. 6.08717e+08 support, management, professional, program, data 19 2.683397e+07 3.480865e+07 7.974672e+06 70CMSW18FC0000104, HSCETC16J00047, HSCEMR15F00... 2014-07-11 2023-07-14 ... NaN VIENNA VA VIRGINIA 221822264 11 7038480400 7032569508 2018-12-06 01:30:13.766382 2018-12-06
39 MCKINSEY & COMPANY INC. 001673920 support, management, professional 3 2.657078e+07 3.180656e+07 5.235783e+06 HSCECR16F00004, HSCECR17F00003, HSCECR16F00001 2017-04-05 2019-03-20 ... NaN WASHINGTON DC DISTRICT OF COLUMBIA 200362412 00 2026621185 2026623256 2018-12-06 01:30:13.766382 2018-12-06
40 METROPOLITAN INTERPRETERS & TRANSLATORS INC 186945325 support, administrative, translation, interpre... 18 2.591282e+07 2.920811e+07 3.295290e+06 HSCEMD17F00215, HSCEMD16F00142, HSCEMD15J00067... 2010-09-14 2019-03-16 ... NaN NEW YORK NY NEW YORK 100178538 12 2136734710 2136734711 2018-12-06 01:30:13.766382 2018-12-06
41 UNIVERSAL LANGUAGE CORPORATION 961404241 support, administrative, translation, interpre... 9 2.411732e+07 4.469079e+07 2.057347e+07 HSCEMD09J00229, HSCEMD09J00240, HSCENV08JCL016... 2007-09-10 2014-03-22 ... NaN REGO PARK NY NEW YORK 113742244 06 7189979369 7184590331 2018-12-06 01:30:13.766382 2018-12-06
42 ACCENTURE INC. 985015354 telecom, cyber, security, data, backup 3 2.380236e+07 3.929861e+07 1.549624e+07 HSCETC16F00019, HSCETE14A00005, HSCETC15A00008 2017-04-27 2021-09-27 ... NaN ALEXANDRIA VA VIRGINIA 223141552 08 5407299926 5713662006 2018-12-06 01:30:13.766382 2018-12-06
43 PROCENTRIX INC. 626154640 telecom, systems, development, strategy, archi... 10 2.337984e+07 2.365943e+07 2.795936e+05 HSCETC17F00005, HSCETC16J00060, HSCETC15J00048... 2013-03-13 2019-05-31 ... NaN HERNDON VA VIRGINIA 201714583 11 7036352627 7036352629 2018-12-06 01:30:13.766382 2018-12-06
44 DYNAMIS INC. 809429728 support, professional, program, management 3 2.199776e+07 2.604925e+07 4.051484e+06 70CMSD18FR0000184, 70CMSD18FR0000217, HSCEMD13... 2018-08-23 2023-09-16 ... NaN FAIRFAX VA VIRGINIA 220314504 11 7034654400 7034654488 2018-12-06 01:30:13.766382 2018-12-06
45 MOTOROLA INC 1.32546e+06 equipment, communication, radio, television, a... 54 2.193658e+07 2.415277e+07 2.216185e+06 70CMSD18P00000143, HSCEMS17J00177, HSCEMS16J00... 2010-10-18 2023-08-31 ... NaN LINTHICUM HEIGHTS MD MARYLAND 210902535 03 4107126545 4107126501 2018-12-06 01:30:13.766382 2018-12-06
46 HARRIS CORPORATION 4.20334e+06 equipment, countermeasures, electronic, counte... 17 2.127281e+07 2.130907e+07 3.626050e+04 HSCEMD17P00004, 70CMSD18P00000076, 70CMSD18P00... 2012-10-12 2022-10-02 ... NaN PALM BAY FL FLORIDA 329053377 08 3217276361 NaN 2018-12-06 01:30:13.766382 2018-12-06
47 JAVA PRODUCTIONS INC. 143692858 support, professional, program, management 1 2.090808e+07 2.243571e+07 1.527626e+06 HSCEMD14F00041 2018-07-10 2019-10-29 ... NaN BLACKSBURG VA VIRGINIA 240605130 09 5408183098 5403013615 2018-12-06 01:30:13.766382 2018-12-06
48 K'OYITL'OTS'INA LIMITED 9.43393e+08 facilities, government, housekeeping, support,... 30 1.949453e+07 4.086089e+07 2.136636e+07 70CMSW18FR0000015, HSCEMS16C00006, HSCEMS16C00... 2012-05-10 2023-02-28 ... NaN FAIRBANKS AK ALASKA 997094175 00 9074528119 9074528109 2018-12-06 01:30:13.766382 2018-12-06
49 ENGILITY CORPORATION 7.83838e+08 support, professional, program, management, legal 18 1.874999e+07 1.874999e+07 0.000000e+00 HSCEMD16FAF001, HSCEMS14F00057, HSCEMD12F00018... 2013-02-22 2016-09-30 ... NaN ASHBURN VA VIRGINIA 201475060 10 5712918900 5712918957 2018-12-06 01:30:13.766382 2018-12-06

50 rows × 21 columns

There are still a lot of duplicates, future work will be fuzzy matching these accounts.
Let's save this as a csv.

In [36]:
outfile = 'ice_data/data_out/ice_prime_contractors_aggregated_clean.csv'
out_order.to_csv(outfile, index=False)
In [37]:
import math

millnames = ['',' K',' M',' B',' T']

def millify(n):
    if not np.isnan(n):
        n = float(n)
        millidx = max(0,min(len(millnames)-1,
                            int(math.floor(0 if n == 0 else math.log10(abs(n))/3))))

        return '{:.2f}{}'.format(n / 10**(3 * millidx), millnames[millidx])
    return np.nan
In [38]:
# to make it more human, smh...
for col in ['Current Total Value Of Awards Usd', 'Potential Total Value Of Awards Usd']:
    out_order[col] = out_order[col].apply(millify)
In [39]:
outfile = 'ice_data/data_out/ice_prime_contractors_aggregated_clean_numbers.csv'
out_order.to_csv(outfile, index=False)

Plotting Top Contractors

We can plot the top paid contractors from past awards, and the contractors who have the most valuable current projects.

In [40]:
ax = output.sort_values(by=['current_total_value_of_awards_USD'], ascending=False) \
           .head(30).plot(x='recipient_parent_name', 
                          y='current_total_value_of_awards_USD', 
                          color='orange',
                          kind='barh',
                          logx=True,
                          figsize=(8,12),
                          title= "30 most valuable ICE contractors from since 2006",
                          legend=False)
# annotate each bar
for rect in ax.patches: 
    width = rect.get_width()
    ax.text(1.2*rect.get_width(), rect.get_y()+ 0.5 * rect.get_height(),
            '$' + millify(width),
            ha='center', va='center')

# remove boarder    
for spine in plt.gca().spines.values():
    spine.set_visible(False)

# remoce axis titles
ax.set_xlabel("")
ax.set_ylabel("")

# remove ticks
ax.tick_params(axis=u'both', which=u'both',length=0)
ax.get_xaxis().set_ticks([])
ax.invert_yaxis();
In [41]:
ax = output.sort_values(by=['remaining_total_value_of_awards_USD'], ascending=False) \
           .head(30).plot(x='recipient_parent_name', 
                          y='remaining_total_value_of_awards_USD', 
                          color='orange',
                          kind='barh',
                          logx=True,
                          figsize=(8,12),
                          title= "30 most valuable ICE contractors in progress",
                          legend=False)
# annotate each bar
for rect in ax.patches: 
    width = rect.get_width()
    ax.text(1.32*rect.get_width(), rect.get_y()+ 0.5 * rect.get_height(),
            '$' + millify(width),
            ha='center', va='center')

# remove boarder    
for spine in plt.gca().spines.values():
    spine.set_visible(False)

# remoce axis titles
ax.set_xlabel("")
ax.set_ylabel("")

# remove ticks
ax.tick_params(axis=u'both', which=u'both',length=0)
ax.get_xaxis().set_ticks([])
ax.invert_yaxis();