Author: Mark Bauer
# importing libraries
import os
import numpy as np
import pandas as pd
from sodapy import Socrata
# nyc open data domain and 311 dataset id
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'
# If you choose to use a token, run the following command on the terminal (or add it to your .bashrc)
# $ export SODAPY_APPTOKEN=<token>
socrata_token = os.environ.get("SODAPY_APPTOKEN")
Practice querying using the sodapy client and basic query format.
# Socrata object to fetch data
client = Socrata(
domain=socrata_domain,
app_token=socrata_token,
timeout=1000
)
print(client)
WARNING:root:Requests made without an app_token will be subject to strict throttling limits.
<sodapy.socrata.Socrata object at 0x10a54cf40>
We use the Socrata get
method to fetch data. Pass the dataset id
and the query
to this method.
query = """
SELECT
complaint_type,
count(complaint_type) AS count
GROUP BY
complaint_type
ORDER BY
count(complaint_type) DESC
LIMIT
20
"""
# returned as JSON from API, converted to Python list of dictionaries by sodapy
results = client.get(socrata_dataset_identifier, query=query)
# convert to pandas DataFrame
results_df = pd.DataFrame(results)
print('shape of data: {}'.format(results_df.shape))
results_df
shape of data: (20, 2)
complaint_type | count | |
---|---|---|
0 | Noise - Residential | 3217341 |
1 | Illegal Parking | 2245293 |
2 | HEAT/HOT WATER | 2030803 |
3 | Blocked Driveway | 1487709 |
4 | Street Condition | 1227532 |
5 | Noise - Street/Sidewalk | 1155974 |
6 | Street Light Condition | 1123871 |
7 | Request Large Bulky Item Collection | 1073753 |
8 | PLUMBING | 908288 |
9 | HEATING | 887869 |
10 | Water System | 865727 |
11 | UNSANITARY CONDITION | 800577 |
12 | Noise | 663290 |
13 | General Construction/Plumbing | 639597 |
14 | Traffic Signal Condition | 575904 |
15 | Noise - Commercial | 567159 |
16 | PAINT/PLASTER | 533191 |
17 | GENERAL CONSTRUCTION | 500821 |
18 | Noise - Vehicle | 494785 |
19 | Sewer | 478443 |
Table xx: Count 311 complaints by complaint_type
.
query = """
SELECT
descriptor,
count(descriptor) AS count
GROUP BY
descriptor
ORDER BY
count(descriptor) DESC
LIMIT
20
"""
results = client.get(socrata_dataset_identifier, query=query)
results_df = pd.DataFrame(results)
print('shape of data: {}'.format(results_df.shape))
results_df
shape of data: (20, 2)
descriptor | count | |
---|---|---|
0 | Loud Music/Party | 3548896 |
1 | ENTIRE BUILDING | 1326457 |
2 | No Access | 1109122 |
3 | Request Large Bulky Item Collection | 1073753 |
4 | Banging/Pounding | 880625 |
5 | HEAT | 868960 |
6 | Street Light Out | 827388 |
7 | Pothole | 743364 |
8 | APARTMENT ONLY | 704346 |
9 | Blocked Hydrant | 628045 |
10 | Posted Parking Sign Violation | 560211 |
11 | Loud Talking | 511630 |
12 | N/A | 479756 |
13 | With License Plate | 433902 |
14 | CEILING | 429105 |
15 | PESTS | 407991 |
16 | Derelict Vehicles | 379085 |
17 | Partial Access | 378587 |
18 | MOLD | 347253 |
19 | Car/Truck Music | 317067 |
Table xx: Count 311 complaints by complaint_type
.
query = """
SELECT
complaint_type,
descriptor,
count(*) AS count
GROUP BY
complaint_type, descriptor
ORDER BY
count(*) DESC
LIMIT
20
"""
results = client.get(socrata_dataset_identifier, query=query)
results_df = pd.DataFrame(results)
print('shape of data: {}'.format(results_df.shape))
results_df
shape of data: (20, 3)
complaint_type | descriptor | count | |
---|---|---|---|
0 | Noise - Residential | Loud Music/Party | 2145838 |
1 | HEAT/HOT WATER | ENTIRE BUILDING | 1326457 |
2 | Blocked Driveway | No Access | 1109122 |
3 | Request Large Bulky Item Collection | Request Large Bulky Item Collection | 1073753 |
4 | Noise - Street/Sidewalk | Loud Music/Party | 903909 |
5 | HEATING | HEAT | 868960 |
6 | Street Light Condition | Street Light Out | 827388 |
7 | Noise - Residential | Banging/Pounding | 822786 |
8 | Street Condition | Pothole | 740964 |
9 | HEAT/HOT WATER | APARTMENT ONLY | 704346 |
10 | Illegal Parking | Blocked Hydrant | 625989 |
11 | Illegal Parking | Posted Parking Sign Violation | 557321 |
12 | Noise - Commercial | Loud Music/Party | 431799 |
13 | UNSANITARY CONDITION | PESTS | 407991 |
14 | Derelict Vehicles | Derelict Vehicles | 379085 |
15 | Blocked Driveway | Partial Access | 378587 |
16 | Traffic Signal Condition | Controller | 300982 |
17 | Noise - Vehicle | Car/Truck Music | 298957 |
18 | Illegal Parking | Blocked Sidewalk | 298010 |
19 | Noise | Noise: Construction Before/After Hours (NM1) | 270104 |
Table xx: Count 311 complaints by complaint_type
and descriptor
.
query = """
SELECT
complaint_type,
count(complaint_type)
WHERE
LOWER(complaint_type) LIKE '%flood%'
GROUP BY
complaint_type
ORDER BY
count(complaint_type) DESC
LIMIT
10
"""
results = client.get(socrata_dataset_identifier, query=query)
results_df = pd.DataFrame(results)
print('shape of data: {}'.format(results_df.shape))
results_df
shape of data: (0, 0)
Table xx: Count 311 Complaints by complaint_type
where complaint_type
contains the word flood
.
query = """
SELECT
descriptor,
count(descriptor) AS count
WHERE
LOWER(descriptor) LIKE '%flood%'
GROUP BY
descriptor
ORDER BY
count(descriptor) DESC
LIMIT
20
"""
results = client.get(socrata_dataset_identifier, query=query)
results_df = pd.DataFrame(results)
print('shape of data: {}'.format(results_df.shape))
results_df
shape of data: (11, 2)
descriptor | count | |
---|---|---|
0 | Catch Basin Clogged/Flooding (Use Comments) (SC) | 111582 |
1 | Street Flooding (SJ) | 37764 |
2 | Flood Light Lamp Out | 6464 |
3 | Highway Flooding (SH) | 3115 |
4 | Flood Light Lamp Cycling | 2584 |
5 | Ready NY - Flooding | 271 |
6 | Flood Light Lamp Dayburning | 223 |
7 | Flood Light Lamp Missing | 211 |
8 | Flood Light Lamp Dim | 185 |
9 | RAIN GARDEN FLOODING (SRGFLD) | 152 |
10 | Flooded | 56 |
Table xx: Count 311 Complaints by descriptor
where descriptor
contains the word flood
.
Select all rows where descriptor
is Street Flooding (SJ)
and created_date
is between 2010 and 2020
. We use limit 40,000 as an approximate of all possible rows.
query = """
SELECT
*
WHERE
descriptor == 'Street Flooding (SJ)'
AND created_date BETWEEN '2010' AND '2020'
LIMIT
40000
"""
results = client.get(socrata_dataset_identifier, query=query)
results_df = pd.DataFrame(results)
# close client
client.close()
print('shape of data: {}'.format(results_df.shape))
results_df.head()
shape of data: (25747, 32)
unique_key | created_date | closed_date | agency | agency_name | complaint_type | descriptor | incident_zip | incident_address | street_name | ... | open_data_channel_type | park_facility_name | park_borough | latitude | longitude | location | intersection_street_1 | intersection_street_2 | facility_type | due_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 45283755 | 2019-12-31T22:42:00.000 | 2020-01-07T11:07:00.000 | DEP | Department of Environmental Protection | Sewer | Street Flooding (SJ) | 10306 | 131 GRIMSBY STREET | GRIMSBY STREET | ... | ONLINE | Unspecified | STATEN ISLAND | 40.57504060779978 | -74.0931855786481 | {'latitude': '40.57504060779978', 'longitude':... | NaN | NaN | NaN | NaN |
1 | 45283863 | 2019-12-31T17:34:00.000 | 2020-01-01T15:45:00.000 | DEP | Department of Environmental Protection | Sewer | Street Flooding (SJ) | 10312 | NaN | NaN | ... | PHONE | Unspecified | STATEN ISLAND | 40.56047555908232 | -74.1668892252524 | {'latitude': '40.56047555908232', 'longitude':... | ARTHUR KILL ROAD | GETZ AVENUE | NaN | NaN |
2 | 45279400 | 2019-12-31T16:11:00.000 | 2020-01-08T10:10:00.000 | DEP | Department of Environmental Protection | Sewer | Street Flooding (SJ) | 10305 | 753 QUINCY AVENUE | QUINCY AVENUE | ... | PHONE | Unspecified | STATEN ISLAND | 40.576529751013474 | -74.08418458891498 | {'latitude': '40.576529751013474', 'longitude'... | NaN | NaN | NaN | NaN |
3 | 45277773 | 2019-12-31T15:42:00.000 | 2020-01-01T05:25:00.000 | DEP | Department of Environmental Protection | Sewer | Street Flooding (SJ) | 11379 | 61-21 70 STREET | 70 STREET | ... | PHONE | Unspecified | QUEENS | 40.72035428730757 | -73.88758860416793 | {'latitude': '40.72035428730757', 'longitude':... | NaN | NaN | NaN | NaN |
4 | 45282532 | 2019-12-31T12:18:00.000 | 2019-12-31T14:15:00.000 | DEP | Department of Environmental Protection | Sewer | Street Flooding (SJ) | 11375 | NaN | NaN | ... | PHONE | Unspecified | QUEENS | 40.721453503515995 | -73.84397787334406 | {'latitude': '40.721453503515995', 'longitude'... | QUEENS BOULEVARD | 71 AVENUE | NaN | NaN |
5 rows × 32 columns
# sanity checks
print('Number of total records: {:,}.\n'.format(results_df.shape[0]))
print('min date:', results_df['created_date'].min())
print('max date:', results_df['created_date'].max())
Number of total records: 25,747. min date: 2010-01-02T08:26:00.000 max date: 2019-12-31T22:42:00.000
# summary of dataframe
results_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 25747 entries, 0 to 25746 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 unique_key 25747 non-null object 1 created_date 25747 non-null object 2 closed_date 25745 non-null object 3 agency 25747 non-null object 4 agency_name 25747 non-null object 5 complaint_type 25747 non-null object 6 descriptor 25747 non-null object 7 incident_zip 24899 non-null object 8 incident_address 16188 non-null object 9 street_name 16188 non-null object 10 cross_street_1 22472 non-null object 11 cross_street_2 22464 non-null object 12 address_type 25741 non-null object 13 city 24901 non-null object 14 status 25747 non-null object 15 resolution_description 25743 non-null object 16 resolution_action_updated_date 25747 non-null object 17 community_board 25747 non-null object 18 bbl 14603 non-null object 19 borough 25747 non-null object 20 x_coordinate_state_plane 24817 non-null object 21 y_coordinate_state_plane 24817 non-null object 22 open_data_channel_type 25747 non-null object 23 park_facility_name 25747 non-null object 24 park_borough 25747 non-null object 25 latitude 24817 non-null object 26 longitude 24817 non-null object 27 location 24817 non-null object 28 intersection_street_1 9616 non-null object 29 intersection_street_2 9616 non-null object 30 facility_type 23296 non-null object 31 due_date 1 non-null object dtypes: object(32) memory usage: 6.3+ MB
# sort count nulls descending
(results_df
.isnull()
.sum()
.sort_values(ascending=False)
)
due_date 25746 intersection_street_2 16131 intersection_street_1 16131 bbl 11144 incident_address 9559 street_name 9559 cross_street_2 3283 cross_street_1 3275 facility_type 2451 location 930 longitude 930 latitude 930 x_coordinate_state_plane 930 y_coordinate_state_plane 930 incident_zip 848 city 846 address_type 6 resolution_description 4 closed_date 2 open_data_channel_type 0 park_borough 0 park_facility_name 0 unique_key 0 borough 0 community_board 0 created_date 0 status 0 descriptor 0 complaint_type 0 agency_name 0 agency 0 resolution_action_updated_date 0 dtype: int64
# writing output file as a csv
results_df.to_csv('data/street-flooding-query.csv', index=False)
# sanity check, listing items in data folder
%ls data/
README.md street-flooding-query.csv street-flooding-complaints.csv streets-clipped.gpkg
df = pd.read_csv('data/street-flooding-query.csv', low_memory=False)
print('shape of data: {}'.format(df.shape))
df.head()
shape of data: (25747, 32)
unique_key | created_date | closed_date | agency | agency_name | complaint_type | descriptor | incident_zip | incident_address | street_name | ... | open_data_channel_type | park_facility_name | park_borough | latitude | longitude | location | intersection_street_1 | intersection_street_2 | facility_type | due_date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 45283755 | 2019-12-31T22:42:00.000 | 2020-01-07T11:07:00.000 | DEP | Department of Environmental Protection | Sewer | Street Flooding (SJ) | 10306.0 | 131 GRIMSBY STREET | GRIMSBY STREET | ... | ONLINE | Unspecified | STATEN ISLAND | 40.575041 | -74.093186 | {'latitude': '40.57504060779978', 'longitude':... | NaN | NaN | NaN | NaN |
1 | 45283863 | 2019-12-31T17:34:00.000 | 2020-01-01T15:45:00.000 | DEP | Department of Environmental Protection | Sewer | Street Flooding (SJ) | 10312.0 | NaN | NaN | ... | PHONE | Unspecified | STATEN ISLAND | 40.560476 | -74.166889 | {'latitude': '40.56047555908232', 'longitude':... | ARTHUR KILL ROAD | GETZ AVENUE | NaN | NaN |
2 | 45279400 | 2019-12-31T16:11:00.000 | 2020-01-08T10:10:00.000 | DEP | Department of Environmental Protection | Sewer | Street Flooding (SJ) | 10305.0 | 753 QUINCY AVENUE | QUINCY AVENUE | ... | PHONE | Unspecified | STATEN ISLAND | 40.576530 | -74.084185 | {'latitude': '40.576529751013474', 'longitude'... | NaN | NaN | NaN | NaN |
3 | 45277773 | 2019-12-31T15:42:00.000 | 2020-01-01T05:25:00.000 | DEP | Department of Environmental Protection | Sewer | Street Flooding (SJ) | 11379.0 | 61-21 70 STREET | 70 STREET | ... | PHONE | Unspecified | QUEENS | 40.720354 | -73.887589 | {'latitude': '40.72035428730757', 'longitude':... | NaN | NaN | NaN | NaN |
4 | 45282532 | 2019-12-31T12:18:00.000 | 2019-12-31T14:15:00.000 | DEP | Department of Environmental Protection | Sewer | Street Flooding (SJ) | 11375.0 | NaN | NaN | ... | PHONE | Unspecified | QUEENS | 40.721454 | -73.843978 | {'latitude': '40.721453503515995', 'longitude'... | QUEENS BOULEVARD | 71 AVENUE | NaN | NaN |
5 rows × 32 columns
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 25747 entries, 0 to 25746 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 unique_key 25747 non-null int64 1 created_date 25747 non-null object 2 closed_date 25745 non-null object 3 agency 25747 non-null object 4 agency_name 25747 non-null object 5 complaint_type 25747 non-null object 6 descriptor 25747 non-null object 7 incident_zip 24899 non-null float64 8 incident_address 16188 non-null object 9 street_name 16188 non-null object 10 cross_street_1 22472 non-null object 11 cross_street_2 22464 non-null object 12 address_type 25741 non-null object 13 city 24901 non-null object 14 status 25747 non-null object 15 resolution_description 25743 non-null object 16 resolution_action_updated_date 25747 non-null object 17 community_board 25747 non-null object 18 bbl 14603 non-null float64 19 borough 25747 non-null object 20 x_coordinate_state_plane 24817 non-null float64 21 y_coordinate_state_plane 24817 non-null float64 22 open_data_channel_type 25747 non-null object 23 park_facility_name 25747 non-null object 24 park_borough 25747 non-null object 25 latitude 24817 non-null float64 26 longitude 24817 non-null float64 27 location 24817 non-null object 28 intersection_street_1 9616 non-null object 29 intersection_street_2 9616 non-null object 30 facility_type 0 non-null float64 31 due_date 1 non-null object dtypes: float64(7), int64(1), object(24) memory usage: 6.3+ MB
There are two columns that we should probably drop due to enormous amounts of missing data, facility_type
and due_date
. We will dive in further in the EDA notebook.