# The usual preamble
%matplotlib inline
import polars as pl
import plotlib as sbn
import matplotlib.pyplot as plt
# Make the graphs a bit prettier, and bigger
plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 5)
We're going to use a new dataset here, to demonstrate how to deal with larger datasets. This is a subset of the of 311 service requests from NYC Open Data.
# because of mixed types we specify dtype to prevent any errors
complaints = pl.read_csv('../data/311-service-requests.csv', dtypes={'Incident Zip':pl.Utf8})
Notice that we had to explicitly specify the dtype of the 'Incident Zip' column as a string type (Utf8). This means that it's encountered a problem reading in our data. In this case it almost certainly means that it has columns where some of the entries are strings and some are integers.
For now we're going to ignore it and hope we don't run into a problem, but in the long run we'd need to investigate this warning.
Use the head function to get the top rows of a dataframe
complaints.head()
Unique Key | Created Date | Closed Date | Agency | Agency Name | Complaint Type | Descriptor | Location Type | Incident Zip | Incident Address | Street Name | Cross Street 1 | Cross Street 2 | Intersection Street 1 | Intersection Street 2 | Address Type | City | Landmark | Facility Type | Status | Due Date | Resolution Action Updated Date | Community Board | Borough | X Coordinate (State Plane) | Y Coordinate (State Plane) | Park Facility Name | Park Borough | School Name | School Number | School Region | School Code | School Phone Number | School Address | School City | School State | School Zip | School Not Found | School or Citywide Complaint | Vehicle Type | Taxi Company Borough | Taxi Pick Up Location | Bridge Highway Name | Bridge Highway Direction | Road Ramp | Bridge Highway Segment | Garage Lot Name | Ferry Direction | Ferry Terminal Name | Latitude | Longitude | Location |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | i64 | i64 | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | f64 | f64 | str |
26589651 | "10/31/2013 02:… | null | "NYPD" | "New York City … | "Noise - Street… | "Loud Talking" | "Street/Sidewal… | "11432" | "90-03 169 STRE… | "169 STREET" | "90 AVENUE" | "91 AVENUE" | null | null | "ADDRESS" | "JAMAICA" | null | "Precinct" | "Assigned" | "10/31/2013 10:… | "10/31/2013 02:… | "12 QUEENS" | "QUEENS" | 1042027 | 197389 | "Unspecified" | "QUEENS" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "N" | null | null | null | null | null | null | null | null | null | null | null | 40.708275 | -73.791604 | "(40.7082753259… |
26593698 | "10/31/2013 02:… | null | "NYPD" | "New York City … | "Illegal Parkin… | "Commercial Ove… | "Street/Sidewal… | "11378" | "58 AVENUE" | "58 AVENUE" | "58 PLACE" | "59 STREET" | null | null | "BLOCKFACE" | "MASPETH" | null | "Precinct" | "Open" | "10/31/2013 10:… | null | "05 QUEENS" | "QUEENS" | 1009349 | 201984 | "Unspecified" | "QUEENS" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "N" | null | null | null | null | null | null | null | null | null | null | null | 40.721041 | -73.909453 | "(40.7210405356… |
26594139 | "10/31/2013 02:… | "10/31/2013 02:… | "NYPD" | "New York City … | "Noise - Commer… | "Loud Music/Par… | "Club/Bar/Resta… | "10032" | "4060 BROADWAY" | "BROADWAY" | "WEST 171 STREE… | "WEST 172 STREE… | null | null | "ADDRESS" | "NEW YORK" | null | "Precinct" | "Closed" | "10/31/2013 10:… | "10/31/2013 02:… | "12 MANHATTAN" | "MANHATTAN" | 1001088 | 246531 | "Unspecified" | "MANHATTAN" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "N" | null | null | null | null | null | null | null | null | null | null | null | 40.84333 | -73.939144 | "(40.8433297546… |
26595721 | "10/31/2013 01:… | "10/31/2013 02:… | "NYPD" | "New York City … | "Noise - Vehicl… | "Car/Truck Horn… | "Street/Sidewal… | "10023" | "WEST 72 STREET… | "WEST 72 STREET… | "COLUMBUS AVENU… | "AMSTERDAM AVEN… | null | null | "BLOCKFACE" | "NEW YORK" | null | "Precinct" | "Closed" | "10/31/2013 09:… | "10/31/2013 02:… | "07 MANHATTAN" | "MANHATTAN" | 989730 | 222727 | "Unspecified" | "MANHATTAN" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "N" | null | null | null | null | null | null | null | null | null | null | null | 40.778009 | -73.980213 | "(40.7780087446… |
26590930 | "10/31/2013 01:… | null | "DOHMH" | "Department of … | "Rodent" | "Condition Attr… | "Vacant Lot" | "10027" | "WEST 124 STREE… | "WEST 124 STREE… | "LENOX AVENUE" | "ADAM CLAYTON P… | null | null | "BLOCKFACE" | "NEW YORK" | null | "N/A" | "Pending" | "11/30/2013 01:… | "10/31/2013 01:… | "10 MANHATTAN" | "MANHATTAN" | 998815 | 233545 | "Unspecified" | "MANHATTAN" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "N" | null | null | null | null | null | null | null | null | null | null | null | 40.807691 | -73.947387 | "(40.8076909270… |
To select a column, we index with the name of the column, like this:
complaints['Complaint Type']
Complaint Type |
---|
str |
"Noise - Street… |
"Illegal Parkin… |
"Noise - Commer… |
"Noise - Vehicl… |
"Rodent" |
"Noise - Commer… |
"Blocked Drivew… |
"Noise - Commer… |
"Noise - Commer… |
"Noise - Commer… |
"Noise - House … |
"Noise - Commer… |
… |
"Noise" |
"Noise - Commer… |
"Noise - Street… |
"Noise" |
"Noise - Commer… |
"Water System" |
"Water System" |
"Maintenance or… |
"Illegal Parkin… |
"Noise - Street… |
"Noise - Commer… |
"Blocked Drivew… |
To get the first 5 rows of a dataframe, we can also use a slice: df[:5]
. But it's generally recommended to use the head()
function.
This is a great way to get a sense for what kind of information is in the dataframe -- take a minute to look at the contents and get a feel for this dataset.
complaints.head()
Unique Key | Created Date | Closed Date | Agency | Agency Name | Complaint Type | Descriptor | Location Type | Incident Zip | Incident Address | Street Name | Cross Street 1 | Cross Street 2 | Intersection Street 1 | Intersection Street 2 | Address Type | City | Landmark | Facility Type | Status | Due Date | Resolution Action Updated Date | Community Board | Borough | X Coordinate (State Plane) | Y Coordinate (State Plane) | Park Facility Name | Park Borough | School Name | School Number | School Region | School Code | School Phone Number | School Address | School City | School State | School Zip | School Not Found | School or Citywide Complaint | Vehicle Type | Taxi Company Borough | Taxi Pick Up Location | Bridge Highway Name | Bridge Highway Direction | Road Ramp | Bridge Highway Segment | Garage Lot Name | Ferry Direction | Ferry Terminal Name | Latitude | Longitude | Location |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i64 | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | i64 | i64 | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | str | f64 | f64 | str |
26589651 | "10/31/2013 02:… | null | "NYPD" | "New York City … | "Noise - Street… | "Loud Talking" | "Street/Sidewal… | "11432" | "90-03 169 STRE… | "169 STREET" | "90 AVENUE" | "91 AVENUE" | null | null | "ADDRESS" | "JAMAICA" | null | "Precinct" | "Assigned" | "10/31/2013 10:… | "10/31/2013 02:… | "12 QUEENS" | "QUEENS" | 1042027 | 197389 | "Unspecified" | "QUEENS" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "N" | null | null | null | null | null | null | null | null | null | null | null | 40.708275 | -73.791604 | "(40.7082753259… |
26593698 | "10/31/2013 02:… | null | "NYPD" | "New York City … | "Illegal Parkin… | "Commercial Ove… | "Street/Sidewal… | "11378" | "58 AVENUE" | "58 AVENUE" | "58 PLACE" | "59 STREET" | null | null | "BLOCKFACE" | "MASPETH" | null | "Precinct" | "Open" | "10/31/2013 10:… | null | "05 QUEENS" | "QUEENS" | 1009349 | 201984 | "Unspecified" | "QUEENS" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "N" | null | null | null | null | null | null | null | null | null | null | null | 40.721041 | -73.909453 | "(40.7210405356… |
26594139 | "10/31/2013 02:… | "10/31/2013 02:… | "NYPD" | "New York City … | "Noise - Commer… | "Loud Music/Par… | "Club/Bar/Resta… | "10032" | "4060 BROADWAY" | "BROADWAY" | "WEST 171 STREE… | "WEST 172 STREE… | null | null | "ADDRESS" | "NEW YORK" | null | "Precinct" | "Closed" | "10/31/2013 10:… | "10/31/2013 02:… | "12 MANHATTAN" | "MANHATTAN" | 1001088 | 246531 | "Unspecified" | "MANHATTAN" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "N" | null | null | null | null | null | null | null | null | null | null | null | 40.84333 | -73.939144 | "(40.8433297546… |
26595721 | "10/31/2013 01:… | "10/31/2013 02:… | "NYPD" | "New York City … | "Noise - Vehicl… | "Car/Truck Horn… | "Street/Sidewal… | "10023" | "WEST 72 STREET… | "WEST 72 STREET… | "COLUMBUS AVENU… | "AMSTERDAM AVEN… | null | null | "BLOCKFACE" | "NEW YORK" | null | "Precinct" | "Closed" | "10/31/2013 09:… | "10/31/2013 02:… | "07 MANHATTAN" | "MANHATTAN" | 989730 | 222727 | "Unspecified" | "MANHATTAN" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "N" | null | null | null | null | null | null | null | null | null | null | null | 40.778009 | -73.980213 | "(40.7780087446… |
26590930 | "10/31/2013 01:… | null | "DOHMH" | "Department of … | "Rodent" | "Condition Attr… | "Vacant Lot" | "10027" | "WEST 124 STREE… | "WEST 124 STREE… | "LENOX AVENUE" | "ADAM CLAYTON P… | null | null | "BLOCKFACE" | "NEW YORK" | null | "N/A" | "Pending" | "11/30/2013 01:… | "10/31/2013 01:… | "10 MANHATTAN" | "MANHATTAN" | 998815 | 233545 | "Unspecified" | "MANHATTAN" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "Unspecified" | "N" | null | null | null | null | null | null | null | null | null | null | null | 40.807691 | -73.947387 | "(40.8076909270… |
We can combine these to get the first 5 rows of a column:
complaints['Complaint Type'].head()
Complaint Type |
---|
str |
"Noise - Street… |
"Illegal Parkin… |
"Noise - Commer… |
"Noise - Vehicl… |
"Rodent" |
"Noise - Commer… |
"Blocked Drivew… |
"Noise - Commer… |
"Noise - Commer… |
"Noise - Commer… |
and it doesn't matter which direction we do it in:
complaints.head()['Complaint Type']
Complaint Type |
---|
str |
"Noise - Street… |
"Illegal Parkin… |
"Noise - Commer… |
"Noise - Vehicl… |
"Rodent" |
What if we just want to know the complaint type and the borough, but not the rest of the information? Polars makes it really easy to select a subset of the columns: just index with list of columns you want.
complaints[['Complaint Type', 'Borough']]
Complaint Type | Borough |
---|---|
str | str |
"Noise - Street… | "QUEENS" |
"Illegal Parkin… | "QUEENS" |
"Noise - Commer… | "MANHATTAN" |
"Noise - Vehicl… | "MANHATTAN" |
"Rodent" | "MANHATTAN" |
"Noise - Commer… | "QUEENS" |
"Blocked Drivew… | "QUEENS" |
"Noise - Commer… | "QUEENS" |
"Noise - Commer… | "MANHATTAN" |
"Noise - Commer… | "BROOKLYN" |
"Noise - House … | "BROOKLYN" |
"Noise - Commer… | "MANHATTAN" |
… | … |
"Noise" | "MANHATTAN" |
"Noise - Commer… | "BROOKLYN" |
"Noise - Street… | "MANHATTAN" |
"Noise" | "MANHATTAN" |
"Noise - Commer… | "QUEENS" |
"Water System" | "MANHATTAN" |
"Water System" | "MANHATTAN" |
"Maintenance or… | "BROOKLYN" |
"Illegal Parkin… | "QUEENS" |
"Noise - Street… | "MANHATTAN" |
"Noise - Commer… | "BROOKLYN" |
"Blocked Drivew… | "BROOKLYN" |
That showed us a summary, and then we can look at the first 10 rows:
complaints[['Complaint Type', 'Borough']].head(10)
Complaint Type | Borough |
---|---|
str | str |
"Noise - Street… | "QUEENS" |
"Illegal Parkin… | "QUEENS" |
"Noise - Commer… | "MANHATTAN" |
"Noise - Vehicl… | "MANHATTAN" |
"Rodent" | "MANHATTAN" |
"Noise - Commer… | "QUEENS" |
"Blocked Drivew… | "QUEENS" |
"Noise - Commer… | "QUEENS" |
"Noise - Commer… | "MANHATTAN" |
"Noise - Commer… | "BROOKLYN" |
This is a really easy question to answer! There's a value_counts()
method that we can use:
complaints['Complaint Type'].value_counts(sort=True)
Complaint Type | counts |
---|---|
str | u32 |
"HEATING" | 14200 |
"GENERAL CONSTR… | 7471 |
"Street Light C… | 7117 |
"DOF Literature… | 5797 |
"PLUMBING" | 5373 |
"PAINT - PLASTE… | 5149 |
"Blocked Drivew… | 4590 |
"NONCONST" | 3998 |
"Street Conditi… | 3473 |
"Illegal Parkin… | 3343 |
"Noise" | 3321 |
"Traffic Signal… | 3145 |
… | … |
"Window Guard" | 2 |
"Snow" | 1 |
"Stalled Sites" | 1 |
"DHS Income Sav… | 1 |
"Tunnel Conditi… | 1 |
"Municipal Park… | 1 |
"Trans Fat" | 1 |
"Open Flame Per… | 1 |
"DWD" | 1 |
"Highway Sign -… | 1 |
"Ferry Permit" | 1 |
"X-Ray Machine/… | 1 |
If we just wanted the top 10 most common complaints, we can use the top_k
function like so:
complaint_counts = complaints['Complaint Type'].value_counts().top_k(10, by='counts')
complaint_counts
Complaint Type | counts |
---|---|
str | u32 |
"HEATING" | 14200 |
"GENERAL CONSTR… | 7471 |
"Street Light C… | 7117 |
"DOF Literature… | 5797 |
"PLUMBING" | 5373 |
"PAINT - PLASTE… | 5149 |
"Blocked Drivew… | 4590 |
"NONCONST" | 3998 |
"Street Conditi… | 3473 |
"Illegal Parkin… | 3343 |
But it gets better! We can plot them!
plt.xticks(rotation=45)
sbn.barplot(complaint_counts, x='Complaint Type', y='counts')