A practical introduction to IPython Notebook & pandas

Hi! I'm Julia.

Right now: Hacker School.
Before: Data scientist.

I'm on the internet at http://jvns.ca, http://twitter.com/b0rk

Follow along by downloading this presentation and running the code yourself:

Setup:

  • You can ask me any question, any time.
  • It's the end of the day
  • I'd rather cover less material and have you understand more of it.
  • There will be exercises! Pair up with the person next to you and do the exercises.
In [1]:
%pylab inline
import pandas as pd
pd.set_option('display.mpl_style', 'default')
figsize(15, 6)
pd.set_option('display.line_width', 4000)
pd.set_option('display.max_columns', 100)
Populating the interactive namespace from numpy and matplotlib

Goal (in 6 months)

Know how to use IPython Notebook + pandas to answer your questions about data

  • How to start IPython Notebook
  • How to read data into pandas
  • How to do simple manipulations on pandas dataframes

Goal (Today)

Know how to use pandas to answer some specific questions about a dataset

Roadmap:

  1. Demo with rats
  2. Dataframes: what makes pandas powerful
  3. Selecting data from a dataframe
  4. Time series and indexes and resampling
  5. Groupby + aggregate

Some notes about installation:

Don't do this:

sudo apt-get install ipython-notebook

Instead, do this:

pip install ipython tornado pyzmq

or install Anaconda from http://store.continuum.io (what I do)

You can start IPython notebook by running

ipython notebook --pylab inline

First: Read the data

In [78]:
# Download and read the data
!wget "http://bit.ly/311-data-tar-gz"
!tar -xzf "311-data.tar.gz" # wget does different things
!tar -xzf "311-data-tar-gz" # wget does different things
orig_data = pd.read_csv('./311-service-requests.csv', nrows=100000, parse_dates=['Created Date'])
--2013-11-08 15:02:41--  http://bit.ly/311-data-tar-gz
Resolving bit.ly (bit.ly)... 69.58.188.40, 69.58.188.39
Connecting to bit.ly (bit.ly)|69.58.188.40|:80... connected.
HTTP request sent, awaiting response... 301 Moved
Location: https://dl.dropboxusercontent.com/u/115162019/311-data.tar.gz [following]
--2013-11-08 15:02:41--  https://dl.dropboxusercontent.com/u/115162019/311-data.tar.gz
Resolving dl.dropboxusercontent.com (dl.dropboxusercontent.com)... 184.73.228.95
Connecting to dl.dropboxusercontent.com (dl.dropboxusercontent.com)|184.73.228.95|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8492118 (8.1M) [application/octet-stream]
Saving to: `311-data-tar-gz.2'

100%[======================================>] 8,492,118   1.61M/s   in 6.1s    

2013-11-08 15:02:48 (1.33 MB/s) - `311-data-tar-gz.2' saved [8492118/8492118]

tar (child): 311-data.tar.gz: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
In [81]:
plot(orig_data['Longitude'], orig_data['Latitude'], '.', color="purple")
Out[81]:
[<matplotlib.lines.Line2D at 0xafb9fd0>]

Example 1: Graph the number of noise complaints each hour in New York

In [3]:
complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()
Out[3]:
<matplotlib.axes.AxesSubplot at 0x6e2e190>

Example 2: What are the most common complaint types?

In [4]:
orig_data['Complaint Type'].value_counts()[:20].plot(kind='bar')
Out[4]:
<matplotlib.axes.AxesSubplot at 0x4bc6e90>

Example 3: Does every zip code complain about the same things?

In [5]:
popular_zip_codes = orig_data['Incident Zip'].value_counts()[:10].index
zipcode_incident_table = orig_data.groupby(['Incident Zip', 'Complaint Type'])['Descriptor'].aggregate(len).unstack()
top_5_complaints = zipcode_incident_table.transpose()[popular_zip_codes]
normalized_complaints = top_5_complaints / top_5_complaints.sum()
normalized_complaints.dropna(how='any').sort('11226', ascending=False)[:5].transpose().plot(kind='bar')
Out[5]:
<matplotlib.axes.AxesSubplot at 0x32b5850>

Roadmap:

  1. Numpy: what makes pandas fast
  2. Dataframes: what makes pandas powerful
  3. Selecting data from a dataframe
  4. Time series and indexes
  5. Graphing

1. Numpy: What makes pandas fast

In [6]:
import numpy as np

How to create a numpy array

In [7]:
np.array([1,2,8.0, 3])
Out[7]:
array([ 1.,  2.,  8.,  3.])
In [8]:
np.arange(10)
Out[8]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [9]:
# Generate random numbers
np.random.random(10)
Out[9]:
array([ 0.28288364,  0.82679209,  0.24453135,  0.1364644 ,  0.54386584,
        0.4973374 ,  0.47602631,  0.42742955,  0.57587563,  0.97857772])

How to operate on numpy arrays

In [10]:
prices = np.array([31, 40, 12, 40])
prices
Out[10]:
array([31, 40, 12, 40])
In [11]:
# Change the type
prices.astype(np.float32)
Out[11]:
array([ 31.,  40.,  12.,  40.], dtype=float32)
In [12]:
prices.astype(np.int64)
Out[12]:
array([31, 40, 12, 40])
In [13]:
# Find which ones are even
prices % 2 == 0
Out[13]:
array([False,  True,  True,  True], dtype=bool)
In [14]:
# Get only the even prices
prices[prices % 2 == 0]
Out[14]:
array([40, 12, 40])

More array operations

In [15]:
# Find the mean
np.mean(prices)
Out[15]:
30.75
In [16]:
prices * prices
Out[16]:
array([ 961, 1600,  144, 1600])

Vectorized operations: Don't do this:

In [17]:
v1 = np.array([1, 2, 3, 4, 5])
v2 = np.array([1, 2, 3, 8, 9])
In [18]:
result = np.zeros_like(v1)
for i in xrange(len(v1)):
    result[i] = 2 * v1[i] + 3 * v2[i]
print result
[ 5 10 15 32 37]

Do this instead:

In [19]:
result = 2 * v1 + 3 * v2
print result
[ 5 10 15 32 37]

Exercise 1: Compute the mean of the numbers 1-1000000

When you're done, try some harder things:

  • Generate some random numbers and find the mean or standard deviation.
  • play around some more with creating arrays
In [20]:
# Your code here
In [20]:
 
In [20]:
 

Exercise 2: Find all the elements in the prices array that are divisible by 6

When you're done:

  • find all the cubes less than 10000
In [21]:
# Your code here
In [21]:
 

What is pandas?

A few awesome things about pandas

  • Really, really, really, really good at time series
  • Can import Excel files (!!!)
  • Fast (joining dataframes, etc.)

This is what lets you manipulate data easily -- the dataframe is basically the whole reason for pandas. It's a powerful concept from the statistical computing language R.

If you don't know R, you can think of it like a database table (it has rows and columns), or like a table of numbers.

2. Dataframes: what makes pandas powerful

In [22]:
people = pd.read_csv('tiny.csv')
people
Out[22]:
name age height
0 Scott 12 61
1 Lea 13 73
2 Julia 14 66
3 Kate 15 62
4 Rishi 18 70

This is a like a SQL database, or an R dataframe. There are 3 columns, called 'name', 'age', and 'height, and 6 rows.

3. Selecting data from a dataframe

I want you to know about this because you almost always only want a subset of the data you're working on. We are going to look at a CSV with 40 columns and 1,000,000 rows.

In [23]:
# Load the first 5 rows of our CSV
small_requests = pd.read_csv('./311-service-requests.csv', nrows=5)
In [24]:
# How to get a column
small_requests['Complaint Type']
Out[24]:
0    Noise - Street/Sidewalk
1            Illegal Parking
2         Noise - Commercial
3            Noise - Vehicle
4                     Rodent
Name: Complaint Type, dtype: object
In [25]:
# How to get a subset of the columns
small_requests[['Complaint Type', 'Created Date']]
Out[25]:
Complaint Type Created Date
0 Noise - Street/Sidewalk 10/31/2013 02:08:41 AM
1 Illegal Parking 10/31/2013 02:01:04 AM
2 Noise - Commercial 10/31/2013 02:00:24 AM
3 Noise - Vehicle 10/31/2013 01:56:23 AM
4 Rodent 10/31/2013 01:53:44 AM
In [26]:
# How to get 3 rows
small_requests[:3]
Out[26]:
Unique Key Created Date Closed Date Agency Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address Street Name Cross Street 1 Cross Street 2 Intersection Street 1 Intersection Street 2 Address Type City Landmark Facility Type Status Due Date Resolution Action Updated Date Community Board Borough X Coordinate (State Plane) Y Coordinate (State Plane) Park Facility Name Park Borough School Name School Number School Region School Code School Phone Number School Address School City School State School Zip School Not Found School or Citywide Complaint Vehicle Type Taxi Company Borough Taxi Pick Up Location Bridge Highway Name Bridge Highway Direction Road Ramp Bridge Highway Segment Garage Lot Name Ferry Direction Ferry Terminal Name Latitude Longitude Location
0 26589651 10/31/2013 02:08:41 AM NaN NYPD New York City Police Department Noise - Street/Sidewalk Loud Talking Street/Sidewalk 11432 90-03 169 STREET 169 STREET 90 AVENUE 91 AVENUE NaN NaN ADDRESS JAMAICA NaN Precinct Assigned 10/31/2013 10:08:41 AM 10/31/2013 02:35:17 AM 12 QUEENS QUEENS 1042027 197389 Unspecified QUEENS Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 40.708275 -73.791604 (40.70827532593202, -73.79160395779721)
1 26593698 10/31/2013 02:01:04 AM NaN NYPD New York City Police Department Illegal Parking Commercial Overnight Parking Street/Sidewalk 11378 58 AVENUE 58 AVENUE 58 PLACE 59 STREET NaN NaN BLOCKFACE MASPETH NaN Precinct Open 10/31/2013 10:01:04 AM NaN 05 QUEENS QUEENS 1009349 201984 Unspecified QUEENS Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 40.721041 -73.909453 (40.721040535628305, -73.90945306791765)
2 26594139 10/31/2013 02:00:24 AM 10/31/2013 02:40:32 AM NYPD New York City Police Department Noise - Commercial Loud Music/Party Club/Bar/Restaurant 10032 4060 BROADWAY BROADWAY WEST 171 STREET WEST 172 STREET NaN NaN ADDRESS NEW YORK NaN Precinct Closed 10/31/2013 10:00:24 AM 10/31/2013 02:39:42 AM 12 MANHATTAN MANHATTAN 1001088 246531 Unspecified MANHATTAN Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 40.843330 -73.939144 (40.84332975466513, -73.93914371913482)

Get the first 3 rows of a column

In [27]:
small_requests['Agency Name'][:3]
Out[27]:
0    New York City Police Department
1    New York City Police Department
2    New York City Police Department
Name: Agency Name, dtype: object
In [28]:
small_requests[:3]['Agency Name']
Out[28]:
0    New York City Police Department
1    New York City Police Department
2    New York City Police Department
Name: Agency Name, dtype: object

Compare a column to a value

In [29]:
small_requests['Complaint Type']
Out[29]:
0    Noise - Street/Sidewalk
1            Illegal Parking
2         Noise - Commercial
3            Noise - Vehicle
4                     Rodent
Name: Complaint Type, dtype: object
In [30]:
# This is like our numpy example from before
small_requests['Complaint Type'] == 'Noise - Street/Sidewalk'
Out[30]:
0     True
1    False
2    False
3    False
4    False
Name: Complaint Type, dtype: bool

That's numpy in action! Using == on a column of a dataframe gives us a series of True and False values

Selecting only the rows with noise complaints

In [31]:
# This is like our numpy example earlier
noise_complaints = small_requests[small_requests['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints
Out[31]:
Unique Key Created Date Closed Date Agency Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address Street Name Cross Street 1 Cross Street 2 Intersection Street 1 Intersection Street 2 Address Type City Landmark Facility Type Status Due Date Resolution Action Updated Date Community Board Borough X Coordinate (State Plane) Y Coordinate (State Plane) Park Facility Name Park Borough School Name School Number School Region School Code School Phone Number School Address School City School State School Zip School Not Found School or Citywide Complaint Vehicle Type Taxi Company Borough Taxi Pick Up Location Bridge Highway Name Bridge Highway Direction Road Ramp Bridge Highway Segment Garage Lot Name Ferry Direction Ferry Terminal Name Latitude Longitude Location
0 26589651 10/31/2013 02:08:41 AM NaN NYPD New York City Police Department Noise - Street/Sidewalk Loud Talking Street/Sidewalk 11432 90-03 169 STREET 169 STREET 90 AVENUE 91 AVENUE NaN NaN ADDRESS JAMAICA NaN Precinct Assigned 10/31/2013 10:08:41 AM 10/31/2013 02:35:17 AM 12 QUEENS QUEENS 1042027 197389 Unspecified QUEENS Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 40.708275 -73.791604 (40.70827532593202, -73.79160395779721)

Any Dataframe has an index, which is a integer or date or something else associated to each row.

In [32]:
# How to get a specific row
small_requests.ix[0]
Out[32]:
Unique Key                                                       26589651
Created Date                                       10/31/2013 02:08:41 AM
Closed Date                                                           NaN
Agency                                                               NYPD
Agency Name                               New York City Police Department
Complaint Type                                    Noise - Street/Sidewalk
Descriptor                                                   Loud Talking
Location Type                                             Street/Sidewalk
Incident Zip                                                        11432
Incident Address                                         90-03 169 STREET
Street Name                                                    169 STREET
Cross Street 1                                                  90 AVENUE
Cross Street 2                                                  91 AVENUE
Intersection Street 1                                                 NaN
Intersection Street 2                                                 NaN
Address Type                                                      ADDRESS
City                                                              JAMAICA
Landmark                                                              NaN
Facility Type                                                    Precinct
Status                                                           Assigned
Due Date                                           10/31/2013 10:08:41 AM
Resolution Action Updated Date                     10/31/2013 02:35:17 AM
Community Board                                                 12 QUEENS
Borough                                                            QUEENS
X Coordinate (State Plane)                                        1042027
Y Coordinate (State Plane)                                         197389
Park Facility Name                                            Unspecified
Park Borough                                                       QUEENS
School Name                                                   Unspecified
School Number                                                 Unspecified
School Region                                                 Unspecified
School Code                                                   Unspecified
School Phone Number                                           Unspecified
School Address                                                Unspecified
School City                                                   Unspecified
School State                                                  Unspecified
School Zip                                                    Unspecified
School Not Found                                                        N
School or Citywide Complaint                                          NaN
Vehicle Type                                                          NaN
Taxi Company Borough                                                  NaN
Taxi Pick Up Location                                                 NaN
Bridge Highway Name                                                   NaN
Bridge Highway Direction                                              NaN
Road Ramp                                                             NaN
Bridge Highway Segment                                                NaN
Garage Lot Name                                                       NaN
Ferry Direction                                                       NaN
Ferry Terminal Name                                                   NaN
Latitude                                                         40.70828
Longitude                                                        -73.7916
Location                          (40.70827532593202, -73.79160395779721)
Name: 0, Length: 52, dtype: object
In [33]:
# How not to get a row
small_requests[0]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-33-cb28920bbdf9> in <module>()
      1 # How not to get a row
----> 2 small_requests[0]

/opt/anaconda/lib/python2.7/site-packages/pandas/core/frame.pyc in __getitem__(self, key)
   1926         else:
   1927             # get column
-> 1928             return self._get_item_cache(key)
   1929 
   1930     def _getitem_slice(self, key):

/opt/anaconda/lib/python2.7/site-packages/pandas/core/generic.pyc in _get_item_cache(self, item)
    568             return cache[item]
    569         except Exception:
--> 570             values = self._data.get(item)
    571             res = self._box_item_values(item, values)
    572             cache[item] = res

/opt/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in get(self, item)
   1381 
   1382     def get(self, item):
-> 1383         _, block = self._find_block(item)
   1384         return block.get(item)
   1385 

/opt/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in _find_block(self, item)
   1523 
   1524     def _find_block(self, item):
-> 1525         self._check_have(item)
   1526         for i, block in enumerate(self.blocks):
   1527             if item in block:

/opt/anaconda/lib/python2.7/site-packages/pandas/core/internals.pyc in _check_have(self, item)
   1530     def _check_have(self, item):
   1531         if item not in self.items:
-> 1532             raise KeyError('no item named %s' % com.pprint_thing(item))
   1533 
   1534     def reindex_axis(self, new_axis, method=None, axis=0, copy=True):

KeyError: u'no item named 0'

Exercise 2: Selecting things from dataframes

  • Select all the people with even ages from people
  • Find out how complaints were filed with the NYPD
  • The zip code here is 10007. How many complaints were filed here?
  • Find out which values the Descriptor column can have when the Complaint Type is "Noise - Street/Sidewalk"
In [34]:
# Your code here
In [34]:
 
In [34]:
 
In [34]:
 

Back to our example

In [35]:
# We ran this at the beginning, so we don't have to run it again. Just here as a reminder.
#orig_data = pd.read_csv('./311-service-requests.csv', nrows=100000, parse_dates=['Created Date'])
In [36]:
complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()
Out[36]:
<matplotlib.axes.AxesSubplot at 0x662ae90>

Indexes

In [37]:
noise_complaints[:3]
Out[37]:
Created Date Complaint Type
0 2013-10-31 02:08:41 Noise - Street/Sidewalk
16 2013-10-31 00:54:03 Noise - Street/Sidewalk
25 2013-10-31 00:35:18 Noise - Street/Sidewalk
In [38]:
noise_complaints = noise_complaints.set_index('Created Date')
In [39]:
noise_complaints[:3]
Out[39]:
Complaint Type
Created Date
2013-10-31 02:08:41 Noise - Street/Sidewalk
2013-10-31 00:54:03 Noise - Street/Sidewalk
2013-10-31 00:35:18 Noise - Street/Sidewalk

Sorting the index

Pandas is awesome for date time index stuff. It was built for dealing with financial data is which is ALL TIME SERIES

In [40]:
noise_complaints = noise_complaints.sort_index()
noise_complaints[:3]
Out[40]:
Complaint Type
Created Date
2013-10-07 15:45:56 Noise - Street/Sidewalk
2013-10-07 16:17:41 Noise - Street/Sidewalk
2013-10-07 16:58:08 Noise - Street/Sidewalk

Counting the complaints each hour

In [41]:
noise_complaints.resample('H', how=len)[:3]
Out[41]:
Complaint Type
Created Date
2013-10-07 15:00:00 1
2013-10-07 16:00:00 2
2013-10-07 17:00:00 0

Example 1: done!

In [42]:
noise_complaints.resample('H', how=len).plot()
Out[42]:
<matplotlib.axes.AxesSubplot at 0x330ea50>

Chaining commands together

In [43]:
complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints.set_index('Created Date').sort_index().resample('H', how=len).plot()
Out[43]:
<matplotlib.axes.AxesSubplot at 0x3b63910>

Exercise 3: Time series resampling

  • Find the number of noise complaints every day!
  • Find how many complaints about rodents there are each week

Example 2: What are the most common complaint types?

In [44]:
orig_data['Complaint Type'].value_counts()
Out[44]:
HEATING                     13983
GENERAL CONSTRUCTION         6859
Street Light Condition       6513
DOF Literature Request       5107
PLUMBING                     4884
PAINT - PLASTER              4671
Blocked Driveway             3992
NONCONST                     3646
Street Condition             3070
Noise                        2942
Traffic Signal Condition     2895
Illegal Parking              2865
Dirty Conditions             2364
ELECTRIC                     2154
Noise - Commercial           2120
...
Window Guard                         2
Legal Services Provider Complaint    2
Public Assembly                      2
Ferry Permit                         1
Trans Fat                            1
DFTA Literature Request              1
Highway Sign - Damaged               1
X-Ray Machine/Equipment              1
DHS Income Savings Requirement       1
Tunnel Condition                     1
Snow                                 1
Stalled Sites                        1
Open Flame Permit                    1
Municipal Parking Facility           1
DWD                                  1
Length: 165, dtype: int64
In [45]:
orig_data['Complaint Type'].value_counts()[:20].plot(kind='bar')
Out[45]:
<matplotlib.axes.AxesSubplot at 0x3702210>

Exercise 4: Do the same thing for a different column

In [46]:
# Your code here.

Example 3: Which weekday has the most noise complaints?

In [50]:
complaints = orig_data[['Created Date', 'Complaint Type']]
noise_complaints = complaints[complaints['Complaint Type'] == 'Noise - Street/Sidewalk']
noise_complaints = noise_complaints.set_index("Created Date")
In [63]:
noise_complaints['weekday'] = noise_complaints.index.weekday
noise_complaints[:3]
Out[63]:
Complaint Type weekday
Created Date
2013-10-31 02:08:41 Noise - Street/Sidewalk 3
2013-10-31 00:54:03 Noise - Street/Sidewalk 3
2013-10-31 00:35:18 Noise - Street/Sidewalk 3
In [64]:
# Count the complaints by weekday
counts_by_weekday = noise_complaints.groupby('weekday').aggregate(len)
counts_by_weekday
Out[64]:
Complaint Type
weekday
0 200
1 187
2 204
3 149
4 180
5 312
6 280
In [65]:
# change the index to be actual days
counts_by_weekday.index = ["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"]
In [66]:
counts_by_weekday.plot(kind='bar')
Out[66]:
<matplotlib.axes.AxesSubplot at 0x6c9da50>

Exercise 5: Count the complaints by hour instead

In [67]:
# Your code here
In [67]:
 
In [67]:
 

A few more cool things

String searching

In [77]:
# We need to get rid of the NA values for this to work
street_names = orig_data['Street Name'].fillna('')
In [75]:
manhattan_streets = street_names[street_names.str.contains("MANHATTAN")]
manhattan_streets
Out[75]:
263          MANHATTAN AVENUE
1387         MANHATTAN AVENUE
1589         MANHATTAN AVENUE
1943         MANHATTAN AVENUE
2826         MANHATTAN AVENUE
2968         MANHATTAN AVENUE
3364         MANHATTAN AVENUE
6068         MANHATTAN AVENUE
7359     MANHATTAN BEACH PARK
7360     MANHATTAN BEACH PARK
7917         MANHATTAN AVENUE
10095        MANHATTAN AVENUE
10688        MANHATTAN AVENUE
11043        MANHATTAN AVENUE
12668        MANHATTAN AVENUE
...
77358             MANHATTAN AVENUE
77404    MANHATTAN COLLEGE PARKWAY
77885             MANHATTAN AVENUE
82118         MANHATTAN BEACH PARK
82122         MANHATTAN BEACH PARK
84985                MANHATTAN AVE
85032             MANHATTAN AVENUE
85202              MANHATTAN COURT
85602             MANHATTAN AVENUE
85680             MANHATTAN AVENUE
89159             MANHATTAN AVENUE
91088    MANHATTAN COLLEGE PARKWAY
93843         MANHATTAN BEACH PARK
95950             MANHATTAN AVENUE
96630             MANHATTAN AVENUE
Name: Street Name, Length: 106, dtype: object
In [76]:
manhattan_streets.value_counts()
Out[76]:
MANHATTAN AVENUE             88
MANHATTAN COLLEGE PARKWAY     7
MANHATTAN BEACH PARK          6
MANHATTAN STREET              3
MANHATTAN COURT               1
MANHATTAN AVE                 1
dtype: int64

Looking at complaints close to us

In [91]:
# Our current latitude and longitude
our_lat, our_long = 40.714151,-74.00878
In [94]:
distance_from_us = (orig_data['Longitude'] - our_long)**2 + (orig_data['Latitude'] - our_lat)**2
In [96]:
pd.Series(distance_from_us).hist()
Out[96]:
<matplotlib.axes.AxesSubplot at 0xa5d7350>
In [103]:
close_complaints = orig_data[distance_from_us < 0.00005]
In [106]:
close_complaints['Complaint Type'].value_counts()[:20].plot(kind='bar')
Out[106]:
<matplotlib.axes.AxesSubplot at 0x1988ff90>