Querying NYC 311 Street Flooding Complaints Using sodapy¶

Author: Mark Bauer

In [1]:

# importing libraries
import os
import numpy as np
import pandas as pd
from sodapy import Socrata

In [2]:

# nyc open data domain and 311 dataset id
socrata_domain = 'data.cityofnewyork.us'
socrata_dataset_identifier = 'erm2-nwe9'

# If you choose to use a token, run the following command on the terminal (or add it to your .bashrc)
# $ export SODAPY_APPTOKEN=<token>
socrata_token = os.environ.get("SODAPY_APPTOKEN")

Preview and Explore the 311 Dataset¶

Practice querying using the sodapy client and basic query format.

In [3]:

# Socrata object to fetch data
client = Socrata(
    domain=socrata_domain,
    app_token=socrata_token,
    timeout=1000
)

print(client)

WARNING:root:Requests made without an app_token will be subject to strict throttling limits.

<sodapy.socrata.Socrata object at 0x10a54cf40>

We use the Socrata get method to fetch data. Pass the dataset id and the query to this method.

In [4]:

query = """
SELECT 
    complaint_type, 
    count(complaint_type) AS count  
GROUP BY 
    complaint_type   
ORDER BY 
    count(complaint_type) DESC
LIMIT
    20
"""

# returned as JSON from API, converted to Python list of dictionaries by sodapy
results = client.get(socrata_dataset_identifier, query=query)

# convert to pandas DataFrame
results_df = pd.DataFrame(results)

print('shape of data: {}'.format(results_df.shape))
results_df

shape of data: (20, 2)

Out[4]:

	complaint_type	count
0	Noise - Residential	3217341
1	Illegal Parking	2245293
2	HEAT/HOT WATER	2030803
3	Blocked Driveway	1487709
4	Street Condition	1227532
5	Noise - Street/Sidewalk	1155974
6	Street Light Condition	1123871
7	Request Large Bulky Item Collection	1073753
8	PLUMBING	908288
9	HEATING	887869
10	Water System	865727
11	UNSANITARY CONDITION	800577
12	Noise	663290
13	General Construction/Plumbing	639597
14	Traffic Signal Condition	575904
15	Noise - Commercial	567159
16	PAINT/PLASTER	533191
17	GENERAL CONSTRUCTION	500821
18	Noise - Vehicle	494785
19	Sewer	478443

Table xx: Count 311 complaints by complaint_type.

In [5]:

query = """
SELECT 
    descriptor, 
    count(descriptor) AS count 
GROUP BY 
    descriptor    
ORDER BY 
    count(descriptor) DESC
LIMIT
    20
"""

results = client.get(socrata_dataset_identifier, query=query)
results_df = pd.DataFrame(results)

print('shape of data: {}'.format(results_df.shape))
results_df

shape of data: (20, 2)

Out[5]:

	descriptor	count
0	Loud Music/Party	3548896
1	ENTIRE BUILDING	1326457
2	No Access	1109122
3	Request Large Bulky Item Collection	1073753
4	Banging/Pounding	880625
5	HEAT	868960
6	Street Light Out	827388
7	Pothole	743364
8	APARTMENT ONLY	704346
9	Blocked Hydrant	628045
10	Posted Parking Sign Violation	560211
11	Loud Talking	511630
12	N/A	479756
13	With License Plate	433902
14	CEILING	429105
15	PESTS	407991
16	Derelict Vehicles	379085
17	Partial Access	378587
18	MOLD	347253
19	Car/Truck Music	317067

Table xx: Count 311 complaints by complaint_type.

In [6]:

query = """
SELECT 
    complaint_type,
    descriptor, 
    count(*) AS count  
GROUP BY 
    complaint_type, descriptor   
ORDER BY 
    count(*) DESC
LIMIT
    20
"""

results = client.get(socrata_dataset_identifier, query=query)
results_df = pd.DataFrame(results)

print('shape of data: {}'.format(results_df.shape))
results_df

shape of data: (20, 3)

Out[6]:

	complaint_type	descriptor	count
0	Noise - Residential	Loud Music/Party	2145838
1	HEAT/HOT WATER	ENTIRE BUILDING	1326457
2	Blocked Driveway	No Access	1109122
3	Request Large Bulky Item Collection	Request Large Bulky Item Collection	1073753
4	Noise - Street/Sidewalk	Loud Music/Party	903909
5	HEATING	HEAT	868960
6	Street Light Condition	Street Light Out	827388
7	Noise - Residential	Banging/Pounding	822786
8	Street Condition	Pothole	740964
9	HEAT/HOT WATER	APARTMENT ONLY	704346
10	Illegal Parking	Blocked Hydrant	625989
11	Illegal Parking	Posted Parking Sign Violation	557321
12	Noise - Commercial	Loud Music/Party	431799
13	UNSANITARY CONDITION	PESTS	407991
14	Derelict Vehicles	Derelict Vehicles	379085
15	Blocked Driveway	Partial Access	378587
16	Traffic Signal Condition	Controller	300982
17	Noise - Vehicle	Car/Truck Music	298957
18	Illegal Parking	Blocked Sidewalk	298010
19	Noise	Noise: Construction Before/After Hours (NM1)	270104

Table xx: Count 311 complaints by complaint_type and descriptor.

In [7]:

query = """
SELECT 
    complaint_type, 
    count(complaint_type)
WHERE 
    LOWER(complaint_type) LIKE '%flood%'   
GROUP BY 
    complaint_type
ORDER BY 
    count(complaint_type) DESC
LIMIT
    10
"""

results = client.get(socrata_dataset_identifier, query=query)
results_df = pd.DataFrame(results)

print('shape of data: {}'.format(results_df.shape))
results_df

shape of data: (0, 0)

Out[7]:

Table xx: Count 311 Complaints by complaint_type where complaint_type contains the word flood.

In [8]:

query = """
SELECT 
    descriptor, 
    count(descriptor) AS count
WHERE 
    LOWER(descriptor) LIKE '%flood%' 
GROUP BY 
    descriptor  
ORDER BY 
    count(descriptor) DESC
LIMIT
    20 
"""

results = client.get(socrata_dataset_identifier, query=query)
results_df = pd.DataFrame(results)

print('shape of data: {}'.format(results_df.shape))
results_df

shape of data: (11, 2)

Out[8]:

	descriptor	count
0	Catch Basin Clogged/Flooding (Use Comments) (SC)	111582
1	Street Flooding (SJ)	37764
2	Flood Light Lamp Out	6464
3	Highway Flooding (SH)	3115
4	Flood Light Lamp Cycling	2584
5	Ready NY - Flooding	271
6	Flood Light Lamp Dayburning	223
7	Flood Light Lamp Missing	211
8	Flood Light Lamp Dim	185
9	RAIN GARDEN FLOODING (SRGFLD)	152
10	Flooded	56

Table xx: Count 311 Complaints by descriptor where descriptor contains the word flood.

311 Street Flooding Complaints Query¶

Select all rows where descriptor is Street Flooding (SJ) and created_date is between 2010 and 2020. We use limit 40,000 as an approximate of all possible rows.

In [9]:

query = """
SELECT 
    *
WHERE 
    descriptor == 'Street Flooding (SJ)'
    AND created_date BETWEEN '2010' AND '2020'
LIMIT
    40000
"""

results = client.get(socrata_dataset_identifier, query=query)
results_df = pd.DataFrame(results)

# close client
client.close()

print('shape of data: {}'.format(results_df.shape))
results_df.head()

shape of data: (25747, 32)

Out[9]:

	unique_key	created_date	closed_date	agency	agency_name	complaint_type	descriptor	incident_zip	incident_address	street_name	...	open_data_channel_type	park_facility_name	park_borough	latitude	longitude	location	intersection_street_1	intersection_street_2	facility_type	due_date
0	45283755	2019-12-31T22:42:00.000	2020-01-07T11:07:00.000	DEP	Department of Environmental Protection	Sewer	Street Flooding (SJ)	10306	131 GRIMSBY STREET	GRIMSBY STREET	...	ONLINE	Unspecified	STATEN ISLAND	40.57504060779978	-74.0931855786481	{'latitude': '40.57504060779978', 'longitude':...	NaN	NaN	NaN	NaN
1	45283863	2019-12-31T17:34:00.000	2020-01-01T15:45:00.000	DEP	Department of Environmental Protection	Sewer	Street Flooding (SJ)	10312	NaN	NaN	...	PHONE	Unspecified	STATEN ISLAND	40.56047555908232	-74.1668892252524	{'latitude': '40.56047555908232', 'longitude':...	ARTHUR KILL ROAD	GETZ AVENUE	NaN	NaN
2	45279400	2019-12-31T16:11:00.000	2020-01-08T10:10:00.000	DEP	Department of Environmental Protection	Sewer	Street Flooding (SJ)	10305	753 QUINCY AVENUE	QUINCY AVENUE	...	PHONE	Unspecified	STATEN ISLAND	40.576529751013474	-74.08418458891498	{'latitude': '40.576529751013474', 'longitude'...	NaN	NaN	NaN	NaN
3	45277773	2019-12-31T15:42:00.000	2020-01-01T05:25:00.000	DEP	Department of Environmental Protection	Sewer	Street Flooding (SJ)	11379	61-21 70 STREET	70 STREET	...	PHONE	Unspecified	QUEENS	40.72035428730757	-73.88758860416793	{'latitude': '40.72035428730757', 'longitude':...	NaN	NaN	NaN	NaN
4	45282532	2019-12-31T12:18:00.000	2019-12-31T14:15:00.000	DEP	Department of Environmental Protection	Sewer	Street Flooding (SJ)	11375	NaN	NaN	...	PHONE	Unspecified	QUEENS	40.721453503515995	-73.84397787334406	{'latitude': '40.721453503515995', 'longitude'...	QUEENS BOULEVARD	71 AVENUE	NaN	NaN

5 rows × 32 columns

In [10]:

# sanity checks
print('Number of total records: {:,}.\n'.format(results_df.shape[0]))
      
print('min date:', results_df['created_date'].min())
print('max date:', results_df['created_date'].max())

Number of total records: 25,747.

min date: 2010-01-02T08:26:00.000
max date: 2019-12-31T22:42:00.000

In [11]:

# summary of dataframe
results_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25747 entries, 0 to 25746
Data columns (total 32 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   unique_key                      25747 non-null  object
 1   created_date                    25747 non-null  object
 2   closed_date                     25745 non-null  object
 3   agency                          25747 non-null  object
 4   agency_name                     25747 non-null  object
 5   complaint_type                  25747 non-null  object
 6   descriptor                      25747 non-null  object
 7   incident_zip                    24899 non-null  object
 8   incident_address                16188 non-null  object
 9   street_name                     16188 non-null  object
 10  cross_street_1                  22472 non-null  object
 11  cross_street_2                  22464 non-null  object
 12  address_type                    25741 non-null  object
 13  city                            24901 non-null  object
 14  status                          25747 non-null  object
 15  resolution_description          25743 non-null  object
 16  resolution_action_updated_date  25747 non-null  object
 17  community_board                 25747 non-null  object
 18  bbl                             14603 non-null  object
 19  borough                         25747 non-null  object
 20  x_coordinate_state_plane        24817 non-null  object
 21  y_coordinate_state_plane        24817 non-null  object
 22  open_data_channel_type          25747 non-null  object
 23  park_facility_name              25747 non-null  object
 24  park_borough                    25747 non-null  object
 25  latitude                        24817 non-null  object
 26  longitude                       24817 non-null  object
 27  location                        24817 non-null  object
 28  intersection_street_1           9616 non-null   object
 29  intersection_street_2           9616 non-null   object
 30  facility_type                   23296 non-null  object
 31  due_date                        1 non-null      object
dtypes: object(32)
memory usage: 6.3+ MB

In [12]:

# sort count nulls descending
(results_df
 .isnull()
 .sum()
 .sort_values(ascending=False)
)

Out[12]:

due_date                          25746
intersection_street_2             16131
intersection_street_1             16131
bbl                               11144
incident_address                   9559
street_name                        9559
cross_street_2                     3283
cross_street_1                     3275
facility_type                      2451
location                            930
longitude                           930
latitude                            930
x_coordinate_state_plane            930
y_coordinate_state_plane            930
incident_zip                        848
city                                846
address_type                          6
resolution_description                4
closed_date                           2
open_data_channel_type                0
park_borough                          0
park_facility_name                    0
unique_key                            0
borough                               0
community_board                       0
created_date                          0
status                                0
descriptor                            0
complaint_type                        0
agency_name                           0
agency                                0
resolution_action_updated_date        0
dtype: int64

In [13]:

# writing output file as a csv
results_df.to_csv('data/street-flooding-query.csv', index=False)

# sanity check, listing items in data folder
%ls data/

README.md                       street-flooding-query.csv
street-flooding-complaints.csv  streets-clipped.gpkg

Sanity Check of CSV¶

In [14]:

df = pd.read_csv('data/street-flooding-query.csv', low_memory=False)

print('shape of data: {}'.format(df.shape))
df.head()

shape of data: (25747, 32)

Out[14]:

	unique_key	created_date	closed_date	agency	agency_name	complaint_type	descriptor	incident_zip	incident_address	street_name	...	open_data_channel_type	park_facility_name	park_borough	latitude	longitude	location	intersection_street_1	intersection_street_2	facility_type	due_date
0	45283755	2019-12-31T22:42:00.000	2020-01-07T11:07:00.000	DEP	Department of Environmental Protection	Sewer	Street Flooding (SJ)	10306.0	131 GRIMSBY STREET	GRIMSBY STREET	...	ONLINE	Unspecified	STATEN ISLAND	40.575041	-74.093186	{'latitude': '40.57504060779978', 'longitude':...	NaN	NaN	NaN	NaN
1	45283863	2019-12-31T17:34:00.000	2020-01-01T15:45:00.000	DEP	Department of Environmental Protection	Sewer	Street Flooding (SJ)	10312.0	NaN	NaN	...	PHONE	Unspecified	STATEN ISLAND	40.560476	-74.166889	{'latitude': '40.56047555908232', 'longitude':...	ARTHUR KILL ROAD	GETZ AVENUE	NaN	NaN
2	45279400	2019-12-31T16:11:00.000	2020-01-08T10:10:00.000	DEP	Department of Environmental Protection	Sewer	Street Flooding (SJ)	10305.0	753 QUINCY AVENUE	QUINCY AVENUE	...	PHONE	Unspecified	STATEN ISLAND	40.576530	-74.084185	{'latitude': '40.576529751013474', 'longitude'...	NaN	NaN	NaN	NaN
3	45277773	2019-12-31T15:42:00.000	2020-01-01T05:25:00.000	DEP	Department of Environmental Protection	Sewer	Street Flooding (SJ)	11379.0	61-21 70 STREET	70 STREET	...	PHONE	Unspecified	QUEENS	40.720354	-73.887589	{'latitude': '40.72035428730757', 'longitude':...	NaN	NaN	NaN	NaN
4	45282532	2019-12-31T12:18:00.000	2019-12-31T14:15:00.000	DEP	Department of Environmental Protection	Sewer	Street Flooding (SJ)	11375.0	NaN	NaN	...	PHONE	Unspecified	QUEENS	40.721454	-73.843978	{'latitude': '40.721453503515995', 'longitude'...	QUEENS BOULEVARD	71 AVENUE	NaN	NaN

5 rows × 32 columns

In [15]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25747 entries, 0 to 25746
Data columns (total 32 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   unique_key                      25747 non-null  int64  
 1   created_date                    25747 non-null  object 
 2   closed_date                     25745 non-null  object 
 3   agency                          25747 non-null  object 
 4   agency_name                     25747 non-null  object 
 5   complaint_type                  25747 non-null  object 
 6   descriptor                      25747 non-null  object 
 7   incident_zip                    24899 non-null  float64
 8   incident_address                16188 non-null  object 
 9   street_name                     16188 non-null  object 
 10  cross_street_1                  22472 non-null  object 
 11  cross_street_2                  22464 non-null  object 
 12  address_type                    25741 non-null  object 
 13  city                            24901 non-null  object 
 14  status                          25747 non-null  object 
 15  resolution_description          25743 non-null  object 
 16  resolution_action_updated_date  25747 non-null  object 
 17  community_board                 25747 non-null  object 
 18  bbl                             14603 non-null  float64
 19  borough                         25747 non-null  object 
 20  x_coordinate_state_plane        24817 non-null  float64
 21  y_coordinate_state_plane        24817 non-null  float64
 22  open_data_channel_type          25747 non-null  object 
 23  park_facility_name              25747 non-null  object 
 24  park_borough                    25747 non-null  object 
 25  latitude                        24817 non-null  float64
 26  longitude                       24817 non-null  float64
 27  location                        24817 non-null  object 
 28  intersection_street_1           9616 non-null   object 
 29  intersection_street_2           9616 non-null   object 
 30  facility_type                   0 non-null      float64
 31  due_date                        1 non-null      object 
dtypes: float64(7), int64(1), object(24)
memory usage: 6.3+ MB

There are two columns that we should probably drop due to enormous amounts of missing data, facility_type and due_date. We will dive in further in the EDA notebook.

In [ ]: