ZBP to ZCTA

Retrieves data from the Census Bureau's ZIP Code Business Patterns API for a specific area and summarizes it by ZCTA. Three tables are generated: one for employees that contains employment, establishments, and wages, one for industries that contains counts of establishments by 2-digit sector NAICS codes, and one reference table that correlates sector numbers and names. Initial data retrieved from API is written to json, final output is written to a SQLite database.

https://www.census.gov/data/developers/data-sets/cbp-nonemp-zbp/zbp-api.html

Variables

In [1]:
import pandas as pd, requests, sqlite3, os, json
from IPython.display import clear_output
In [2]:
#Crosswalk files - update only if necessary
uszips_file='Zip_to_ZCTA_crosswalk_2015_JSI.csv'
zcta_file='geocorr14_modified.csv'

uszips_path=os.path.join('inputs',uszips_file)
zcta_path=zcta_file=os.path.join('inputs',zcta_file)

#Dump files for api data storage
ejsonpath=os.path.join('outputs', 'emp_data.json')
ijsonpath=os.path.join('outputs', 'ind_data.json')
cjsonpath=os.path.join('outputs', 'codes_data.json')

#API variables - UPDATE THE YEAR
keyfile='census_key.txt'

year='2016'
dsource='zbp'
state='36'
ecols='ESTAB,EMP,PAYQTR1,PAYANN'
icols='ESTAB'
ncodes=['00','11','21','22','23','31-33','42','44-45','48-49',
        '51','52','53','54','55','56','61','62','71','72','81',
        '99']

#SQL output - UPDATE EACH TABLE NAME
dbname=os.path.join('outputs','testdb.sqlite')
emptable='zbp2016emp'
indtable='zbp2016ind'
codetable='zbp2016indcodes'

Crosswalking

Read in the files that relate US ZIP codes to ZCTAs (from JSI) and ZCTAs to counties for local area (from MCDC Geocorr), then join them by ZIP Code to create a ZCTA to ZIP table for the local area

In [3]:
uszips=pd.read_csv(uszips_path, sep=',', dtype={'ZIP':str, 'ZCTA':str})
uszips.head()
Out[3]:
ZIP PO_NAME STATE ZIP_TYPE ZCTA
0 96916 Merizo GU Post Office or large volume customer 96916
1 96917 Inarajan GU Post Office or large volume customer 96917
2 96928 Agat GU Post Office or large volume customer 96928
3 96915 Santa Rita GU ZIP Code area 96915
4 96923 Mangilao GU Post Office or large volume customer 96913
In [4]:
#All ZIP Codes in US
uszips.shape
Out[4]:
(41270, 5)
In [5]:
zcta=pd.read_csv(zcta_path, sep=',', dtype={'zcta5':str, 'county14':str})
zcta.head()
Out[5]:
zcta5 county14 cntyname2 zipname pop10 afact
0 10001 36061 New York NY New York, NY 21102 1.0
1 10002 36061 New York NY New York, NY 81410 1.0
2 10003 36061 New York NY New York, NY 56024 1.0
3 10004 36061 New York NY New York, NY 3089 1.0
4 10005 36061 New York NY New York, NY 7135 1.0
In [6]:
#ZCTAs in local area
zcta.shape
Out[6]:
(214, 6)
In [7]:
#Merge ZIP Codes with ZCTAs for local area
zip2zcta = pd.merge(uszips[['ZIP','ZIP_TYPE','PO_NAME','ZCTA']],zcta[['zcta5','county14']],how='right', 
                    left_on='ZCTA', right_on='zcta5').set_index('ZIP')
zip2zcta.drop(columns=['ZCTA'],inplace=True)
zip2zcta.head()
Out[7]:
ZIP_TYPE PO_NAME zcta5 county14
ZIP
10307 ZIP Code area Staten Island 10307 36085
10309 ZIP Code area Staten Island 10309 36085
10312 ZIP Code area Staten Island 10312 36085
10308 ZIP Code area Staten Island 10308 36085
11697 ZIP Code area Breezy Point 11697 36081
In [8]:
#ZIP Codes in local area
zip2zcta.shape
Out[8]:
(317, 4)

API Call and Processing

Request the ZBP data from the Census Bureau for the state, join the ZBP data to the local ZCTA to ZIP table based on ZIP Code, and group the data by ZCTA

In [9]:
with open(keyfile) as key:
    api_key=key.read().strip()
In [10]:
base_url = f'https://api.census.gov/data/{year}/{dsource}'
base_url
Out[10]:
'https://api.census.gov/data/2016/zbp'

ZBP Employment Data

This data can be requested in a single api call - do not rerun the requests block if retrieval is successful but subsequent notebook changes are needed. Proceed to the next block and pull data from json dump file.

THIS BLOCK IS A REQUESTS BLOCK!

In [11]:
#Code 200 = success, do not rerun this block unless it's necessary
edata_url = f'{base_url}?get={ecols}&for=zipcode:*&ST={state}&key={api_key}'
response=requests.get(edata_url)
if response.status_code==200:
    emp_data=response.json()
    with open(ejsonpath, 'w') as f:
        json.dump(emp_data, f)
    print('Data dumped to json file')
else:
    print('Problem with retrieval, response code',resonse.status_code)
Data dumped to json file
In [12]:
with open(ejsonpath, 'r') as f:
    ejsondata=json.load(f)
zbpemp=pd.DataFrame(ejsondata[1:], columns=ejsondata[0]).rename(columns={'PAYQTR1':'PAYQ1','PAYANN':'PAYAN'}).set_index('zipcode')
zbpemp.drop(columns=['ST'],inplace=True)
for field in zbpemp.columns:
    zbpemp=zbpemp.astype(dtype={field:'int64'},inplace=True)
zbpemp.head()
Out[12]:
ESTAB EMP PAYQ1 PAYAN
zipcode
00501 2 0 0 0
06390 40 211 2489 13914
10003 4300 107728 1685926 6918342
10001 7296 149825 2238126 9003675
10002 2998 23220 164606 723242
In [13]:
#ZIP Codes for the entire state
zbpemp.shape
Out[13]:
(2064, 4)
In [14]:
#Flag columns count the number of establishments for which data is not disclosed
flags=['FLAG_EMP','FLAG_PAYQ1','FLAG_PAYAN']
for flagcol in flags:
    datacol=flagcol.split('_')[1]
    zbpemp[flagcol]=0
    zbpemp.loc[zbpemp[datacol] == 0, flagcol] = zbpemp['ESTAB']
zbpemp.head()
Out[14]:
ESTAB EMP PAYQ1 PAYAN FLAG_EMP FLAG_PAYQ1 FLAG_PAYAN
zipcode
00501 2 0 0 0 2 2 2
06390 40 211 2489 13914 0 0 0
10003 4300 107728 1685926 6918342 0 0 0
10001 7296 149825 2238126 9003675 0 0 0
10002 2998 23220 164606 723242 0 0 0
In [15]:
#Join to ZIP ZCTA crosswalk
zbpemp2zcta = pd.merge(zip2zcta,zbpemp,how='inner',left_index=True,right_index=True)
zbpemp2zcta.index.name = 'ZIP'
zbpemp2zcta.head()
Out[15]:
ZIP_TYPE PO_NAME zcta5 county14 ESTAB EMP PAYQ1 PAYAN FLAG_EMP FLAG_PAYQ1 FLAG_PAYAN
ZIP
10307 ZIP Code area Staten Island 10307 36085 281 1568 12395 54830 0 0 0
10309 ZIP Code area Staten Island 10309 36085 811 8783 75794 338636 0 0 0
10312 ZIP Code area Staten Island 10312 36085 925 7111 52775 258524 0 0 0
10308 ZIP Code area Staten Island 10308 36085 468 2622 17576 79489 0 0 0
11697 ZIP Code area Breezy Point 11697 36081 45 341 3320 16467 0 0 0
In [16]:
#ZIP codes in the local area that appear in the ZBP data
zbpemp2zcta.shape
Out[16]:
(290, 11)
In [17]:
#Aggregate to ZCTAs
zctaemp=zbpemp2zcta[['zcta5','ESTAB','EMP','PAYQ1','PAYAN','FLAG_EMP','FLAG_PAYQ1','FLAG_PAYAN']].groupby(['zcta5'])[['ESTAB','EMP','PAYQ1','PAYAN','FLAG_EMP','FLAG_PAYQ1','FLAG_PAYAN']].sum()
zctaemp.head()
Out[17]:
ESTAB EMP PAYQ1 PAYAN FLAG_EMP FLAG_PAYQ1 FLAG_PAYAN
zcta5
10001 8071 162819 2554068 10340827 0 0 0
10002 2998 23220 164606 723242 0 0 0
10003 4306 107738 1685965 6918644 1 1 1
10004 1655 69945 2049963 6767972 3 3 3
10005 1389 47554 2373123 6595876 49 49 49

ZBP Industry Data

This data must be requested one record at a time. NOTE that this request can take a LONG TIME, up to one hour to complete for approx 300 ZIP Codes. Once the request is finished the data gets dumped into a json file. If the request is successful but subsequent blocks need to be modified, don't rerun the requests block - pull the data from the json file.

In [18]:
zipcodes=zbpemp2zcta.index.tolist()
len(zipcodes)
Out[18]:
290

THIS BLOCK IS A REQUESTS BLOCK!

Retrieving approx 300 ZIP Codes takes 1 hour

In [19]:
#If this block is successful but there are subsequent problems, do not rerun it - start from the following block.
#For industry data, if there are no records for an industry create a blank record with zeros
n=0
z=0
ind_data=[['estab','naics','zipcode']]
for zcode in zipcodes:
#for zcode in zipcodes[0:5]:
    clear_output(wait=True)
    for naics in ncodes:
        idata_url = f'{base_url}?get={icols}&NAICS2012={naics}&for=zipcode:{zcode}&key={api_key}'
        try:
            response=requests.get(idata_url)
        except requests.exceptions.RequestException as e:
            print (e)
            break
        if response.status_code==200:
            jsondata=response.json()
            ind_data.append(jsondata[1]) 
            n=n+1
        elif response.status_code==204:
            record=['0',naics,zcode]
            ind_data.append(record)
            n=n+1
        else:
            print('Problem retrieving data, status code:',response.status_code)
            break
    z=z+1
    print(n,'records have been retrieved for',z,'ZIP codes...')
print('Done')

with open(ijsonpath, 'w') as f:
    json.dump(ind_data, f)
print('Data dumped to json file')
6090 records have been retrieved for 290 ZIP codes...
Done
Data dumped to json file
In [20]:
with open(ijsonpath, 'r') as f:
    ijsondata=json.load(f)
zbpind = pd.DataFrame(ijsondata[1:],columns=ijsondata[0])
zbpind['estab']=zbpind['estab'].astype('int64', inplace=True)
zbpind.head()
Out[20]:
estab naics zipcode
0 281 00 10307
1 0 11 10307
2 0 21 10307
3 0 22 10307
4 56 23 10307
In [21]:
#Pivot data to move NAICS to columns
zbpind_tab=zbpind.pivot(index='zipcode', columns='naics', values='estab')
zbpind_tab=zbpind_tab.add_prefix('N')
zbpind_tab.rename(columns=lambda x: x.replace('-', '_'),inplace=True)
zbpind_tab.head()
Out[21]:
naics N00 N11 N21 N22 N23 N31_33 N42 N44_45 N48_49 N51 ... N53 N54 N55 N56 N61 N62 N71 N72 N81 N99
zipcode
10001 7296 1 0 3 256 216 1265 669 71 388 ... 460 1470 143 388 139 345 263 493 525 16
10002 2998 0 0 1 109 68 197 561 47 72 ... 276 267 2 93 26 213 98 585 317 4
10003 4300 1 0 1 54 41 110 433 8 197 ... 385 706 28 124 79 335 520 636 504 12
10004 1600 0 1 0 27 11 64 75 25 117 ... 86 452 18 84 33 90 26 139 122 5
10005 1340 1 1 0 13 5 39 45 8 64 ... 75 405 16 74 27 36 25 77 107 2

5 rows × 21 columns

In [22]:
#Join to ZIP ZCTA crosswalk
zbpind2zcta = pd.merge(zip2zcta[['zcta5']],zbpind_tab,how='inner',left_index=True,right_index=True)
zbpind2zcta.index.name = 'ZIP'
zbpind2zcta.head()
Out[22]:
zcta5 N00 N11 N21 N22 N23 N31_33 N42 N44_45 N48_49 ... N53 N54 N55 N56 N61 N62 N71 N72 N81 N99
ZIP
10307 10307 281 0 0 0 56 6 11 31 14 ... 9 26 0 24 4 15 6 29 32 2
10309 10309 811 0 0 0 149 14 36 105 27 ... 30 77 0 44 15 87 25 64 84 1
10312 10312 925 0 0 1 137 9 46 119 33 ... 18 102 0 56 25 142 9 83 100 3
10308 10308 468 0 0 0 67 3 13 65 16 ... 15 53 3 27 8 58 9 47 62 1
11697 11697 45 0 0 0 9 0 3 6 0 ... 3 4 0 4 1 4 2 5 3 0

5 rows × 22 columns

In [23]:
#Aggregate to ZCTAs
zctaind=zbpind2zcta.groupby(['zcta5']).sum()
zctaind.head()
Out[23]:
N00 N11 N21 N22 N23 N31_33 N42 N44_45 N48_49 N51 ... N53 N54 N55 N56 N61 N62 N71 N72 N81 N99
zcta5
10001 8071 1 0 3 271 226 1367 702 78 422 ... 520 1729 164 435 147 365 332 508 565 16
10002 2998 0 0 1 109 68 197 561 47 72 ... 276 267 2 93 26 213 98 585 317 4
10003 4306 1 0 1 54 41 110 434 8 197 ... 386 706 28 124 80 335 521 637 505 12
10004 1655 0 1 0 27 11 65 77 25 134 ... 90 460 20 88 33 90 26 139 124 5
10005 1389 1 1 0 13 5 39 45 10 66 ... 76 408 17 75 27 36 25 80 107 2

5 rows × 21 columns

In [24]:
#Generate and calculate percent total columns
ncols=list(zctaind)
for c in ncols[1:]:
    pct=c+'_PCT'
    zctaind[pct]=((zctaind[c]/zctaind['N00'])*100).round(2)
zctaind.head()
Out[24]:
N00 N11 N21 N22 N23 N31_33 N42 N44_45 N48_49 N51 ... N53_PCT N54_PCT N55_PCT N56_PCT N61_PCT N62_PCT N71_PCT N72_PCT N81_PCT N99_PCT
zcta5
10001 8071 1 0 3 271 226 1367 702 78 422 ... 6.44 21.42 2.03 5.39 1.82 4.52 4.11 6.29 7.00 0.20
10002 2998 0 0 1 109 68 197 561 47 72 ... 9.21 8.91 0.07 3.10 0.87 7.10 3.27 19.51 10.57 0.13
10003 4306 1 0 1 54 41 110 434 8 197 ... 8.96 16.40 0.65 2.88 1.86 7.78 12.10 14.79 11.73 0.28
10004 1655 0 1 0 27 11 65 77 25 134 ... 5.44 27.79 1.21 5.32 1.99 5.44 1.57 8.40 7.49 0.30
10005 1389 1 1 0 13 5 39 45 10 66 ... 5.47 29.37 1.22 5.40 1.94 2.59 1.80 5.76 7.70 0.14

5 rows × 41 columns

NAICS Codes

THIS BLOCK IS A REQUESTS BLOCK!

In [25]:
#Need to input a ZIP Code that contains establishments in every industry
#Don't rerun unless necessary
ind_codes=[['name','naics','zip']]
for naics in ncodes:
    data_url = f'{base_url}?get=NAICS2012_TTL&NAICS2012={naics}&for=zipcode:08088&key={api_key}'
    response=requests.get(data_url)
    jsondata=response.json()
    ind_codes.append(jsondata[1]) 
if len(ind_codes)==22:
    with open(cjsonpath, 'w') as f:
        json.dump(ind_codes, f)
        print('Retrieved all 21 codes and dumped to json')
else:
    print('Some codes are missing; try a different ZIP')
Retrieved all 22 codes and dumped to json
In [26]:
with open(cjsonpath, 'r') as f:
    cjsondata=json.load(f)
codes=pd.DataFrame(cjsondata[1:],columns=cjsondata[0]).set_index('naics').drop(columns='zip')
codes.head()
Out[26]:
name
naics
00 Total for all sectors
11 Agriculture, forestry, fishing and hunting
21 Mining, quarrying, and oil and gas extraction
22 Utilities
23 Construction

Quality Control Checks

In [27]:
#Does sum of industries equal industry total?
indsum=zctaind['N00'].subtract(zctaind.iloc[:,1:21].sum(axis=1))
if indsum.sum()==0:
    print (True)
else:
    print(indsum.loc[indsum != 0])
True
In [28]:
#Is sum of percent totals approximately 100?
ptotal=zctaind.iloc[:,21:].sum(axis=1)
if ptotal.loc[(ptotal <= 99.05) | (ptotal >= 100.05)].empty:
    print(True)
else:
    print(ptotal.loc[(ptotal <= 99.05) | (ptotal >= 100.05)])
True
In [29]:
#Do number of ZCTAs in employment table match the industries table?
ecount=zctaemp.shape[0]
icount=zctaind.shape[0]
if ecount == icount:
    print (True)
else:
    print('Mistmatched count between employment',ecount, 'rows and industry',icount, 'rows')
True
In [30]:
#Does sum of estabslishments from employment table equal establishments in industries table?
estsum=zctaemp['ESTAB'].subtract(zctaind['N00'])
if estsum.sum()==0:
    print (True)
else:
    print(estsum.loc[estsum != 0])
True

Write to Database

In [31]:
con = sqlite3.connect(dbname) 
cur = con.cursor()
In [32]:
#Employment table
cur.execute('DROP TABLE IF EXISTS {};'.format(emptable))
qcreate_emptab="""
CREATE TABLE {}(
zcta5 TEXT NOT NULL PRIMARY KEY,
estab INTEGER,
emp INTEGER,
payq1 INTEGER,
payan INTEGER,
flag_emp INTEGER,
flag_payq1 INTEGER,
flag_payan INTEGER);
""".format(emptable)

cur.execute(qcreate_emptab)
Out[32]:
<sqlite3.Cursor at 0x29dc5ff2420>
In [33]:
#Don't run this block unless you've run the previous one
zctaemp.to_sql(name='{}'.format(emptable), if_exists='append', index=True, con=con)
In [34]:
cur.execute('SELECT COUNT(*) FROM {};'.format(emptable))
rows = cur.fetchone()
print(rows[0], 'records written to', emptable)
214 records written to zbp2016emp
In [35]:
#Replace zeros with nulls, as these values really represent no data
for col in zctaemp.columns[1:]:
    qupdate='UPDATE {} SET {} = NULL WHERE {} = 0;'.format(emptable,col,col)
    cur.execute(qupdate)
    print(cur.rowcount,'records updated for',col)
    con.commit()
1 records updated for EMP
1 records updated for PAYQ1
1 records updated for PAYAN
188 records updated for FLAG_EMP
188 records updated for FLAG_PAYQ1
188 records updated for FLAG_PAYAN
In [36]:
#Industry table
cur.execute('DROP TABLE IF EXISTS {}'.format(indtable))
qcreate_indtab="""
CREATE TABLE {} (
zcta5 TEXT NOT NULL PRIMARY KEY, 
N00 INTEGER, 
N11 INTEGER, 
N21 INTEGER, 
N22 INTEGER, 
N23 INTEGER, 
N31_33 INTEGER, 
N42 INTEGER, 
N44_45 INTEGER, 
N48_49 INTEGER, 
N51 INTEGER, 
N52 INTEGER, 
N53 INTEGER, 
N54 INTEGER, 
N55 INTEGER, 
N56 INTEGER, 
N61 INTEGER, 
N62 INTEGER, 
N71 INTEGER, 
N72 INTEGER, 
N81 INTEGER, 
N99 INTEGER, 
N11_PCT REAL, 
N21_PCT REAL, 
N22_PCT REAL, 
N23_PCT REAL, 
N31_33_PCT REAL, 
N42_PCT REAL, 
N44_45_PCT REAL, 
N48_49_PCT REAL, 
N51_PCT REAL, 
N52_PCT REAL, 
N53_PCT REAL, 
N54_PCT REAL, 
N55_PCT REAL, 
N56_PCT REAL, 
N61_PCT REAL, 
N62_PCT REAL, 
N71_PCT REAL, 
N72_PCT REAL, 
N81_PCT REAL, 
N99_PCT REAL);
""".format(indtable)

cur.execute(qcreate_indtab)
Out[36]:
<sqlite3.Cursor at 0x29dc5ff2420>
In [37]:
#Don't run this block unless you've run the previous one
zctaind.to_sql(name='{}'.format(indtable), if_exists='append', index=True, con=con)
In [38]:
cur.execute('SELECT COUNT(*) FROM {};'.format(indtable))
rows = cur.fetchone()
print(rows[0], 'records written to', indtable)
214 records written to zbp2016ind
In [39]:
#Replace zeros with nulls, as these values really represent no data
for col in zctaind.columns[21:]:
    qupdate='UPDATE {} SET {} = NULL WHERE {} = 0.0;'.format(indtable,col,col)
    cur.execute(qupdate)
    print(cur.rowcount,'records updated for',col)
    con.commit()
193 records updated for N11_PCT
195 records updated for N21_PCT
146 records updated for N22_PCT
18 records updated for N23_PCT
29 records updated for N31_33_PCT
10 records updated for N42_PCT
6 records updated for N44_45_PCT
23 records updated for N48_49_PCT
16 records updated for N51_PCT
6 records updated for N52_PCT
7 records updated for N53_PCT
4 records updated for N54_PCT
66 records updated for N55_PCT
7 records updated for N56_PCT
19 records updated for N61_PCT
9 records updated for N62_PCT
17 records updated for N71_PCT
7 records updated for N72_PCT
2 records updated for N81_PCT
67 records updated for N99_PCT
In [40]:
#NAICS code table
cur.execute('DROP TABLE IF EXISTS {};'.format(codetable))
qcreate_codetab="""
CREATE TABLE {}(
naics TEXT NOT NULL PRIMARY KEY,
name TEXT);
""".format(codetable)

cur.execute(qcreate_codetab)
Out[40]:
<sqlite3.Cursor at 0x29dc5ff2420>
In [41]:
#Don't run this block unless you've run the previous one
codes.to_sql(name='{}'.format(codetable), if_exists='append', index=True, con=con)
In [42]:
cur.execute('SELECT COUNT(*) FROM {};'.format(codetable))
rows = cur.fetchone()
print(rows[0], 'records written to', codetable)
21 records written to zbp2016indcodes
In [43]:
con.close()
In [ ]: