In [1]:
import pandas as pd
In [2]:
inspections = pd.read_csv("../data/inspections.csv", index_col=0)
In [3]:
inspections.head(3)
Out[3]:
CAMIS DBA BORO BUILDING STREET ZIPCODE PHONE CUISINE DESCRIPTION INSPECTION DATE ACTION VIOLATION CODE VIOLATION DESCRIPTION CRITICAL FLAG SCORE GRADE GRADE DATE RECORD DATE INSPECTION TYPE
0 41158108 NICK'S GOURMET DELI QUEENS 7415 DITMARS BOULEVARD 11370 7182788338 American 07/21/2015 Violations were cited in the following area(s). 04L Evidence of mice or live mice present in facil... Critical 11.0 NaN NaN 09/27/2016 Cycle Inspection / Initial Inspection
1 41187577 HANSOL NUTRITION CENTER QUEENS 16026 NORTHERN BOULEVARD 11358 7188880200 Korean 07/13/2016 Violations were cited in the following area(s). 06A Personal cleanliness inadequate. Outer garment... Critical 30.0 NaN NaN 09/27/2016 Cycle Inspection / Initial Inspection
2 41705988 KURA MANHATTAN 130 ST MARKS PLACE 10009 2122281010 Japanese 05/08/2013 Violations were cited in the following area(s). 02B Hot food item not held at or above 140º F. Critical 27.0 NaN NaN 09/27/2016 Pre-permit (Operational) / Initial Inspection
In [3]:
inspections['INSPECTION DATE'] = pd.to_datetime(inspections['INSPECTION DATE'])
In [47]:
# random_camis = inspections.sample().iloc[0]['CAMIS']
# inspections[inspections['CAMIS'] == random_camis].sort_values(by='INSPECTION DATE')['INSPECTION DATE']

Get initial inspection date.

In [4]:
inspections_f = inspections.groupby('CAMIS')\
                            .apply(lambda df: inspections[inspections['CAMIS'] == df['CAMIS'].iloc[0]]\
                                               .sort_values(by="INSPECTION DATE")\
                                               .iloc[0]\
                                               .drop('CAMIS')
                                  )\
                            .rename(columns={'INSPECTION DATE': 'INITIAL INSPECTION DATE'})
In [5]:
inspections_f.sample(1)
Out[5]:
DBA BORO BUILDING STREET ZIPCODE PHONE CUISINE DESCRIPTION INITIAL INSPECTION DATE ACTION VIOLATION CODE VIOLATION DESCRIPTION CRITICAL FLAG SCORE GRADE GRADE DATE RECORD DATE INSPECTION TYPE
CAMIS
50001937 TASTY CHICKEN BROOKLYN 1687 86TH STREET 11214 7182591111 American 08/21/2013 Violations were cited in the following area(s). 15L Smoke free workplace smoking policy inadequate... Not Critical NaN NaN NaN 09/27/2016 Smoke-Free Air Act / Initial Inspection

Not sure why, but this operation drops the DBA field. None others are affected. Maybe something with the indexing?

Let's take the most recent inspection date.

In [5]:
inspections_ff = inspections_f.copy()
inspections_ff['LATEST INSPECTION DATE'] = inspections_ff\
        .apply(lambda srs: inspections[inspections['CAMIS'] == srs.name]\
                                  .sort_values(by='INSPECTION DATE')\
                                  .iloc[-1]\
                                  ['INSPECTION DATE'],
              axis='columns')

Rename the INSPECTION TYPE to INITIAL INSPECTION TYPE (this flags new establishments from pre-existing ones, in the cotext of our dataset).

In [6]:
inspections_ff = inspections_ff.rename(columns={'INSPECTION TYPE': 'INITIAL INSPECTION TYPE'})

Checking the flags:

In [91]:
inspections_ff['INITIAL INSPECTION TYPE'].value_counts()
Out[91]:
Cycle Inspection / Initial Inspection                          12537
Pre-permit (Operational) / Initial Inspection                   7630
Pre-permit (Non-operational) / Initial Inspection               2256
Cycle Inspection / Re-inspection                                 877
Administrative Miscellaneous / Initial Inspection                683
Smoke-Free Air Act / Initial Inspection                          281
Trans Fat / Initial Inspection                                   247
Pre-permit (Operational) / Re-inspection                          82
Trans Fat / Compliance Inspection                                 80
Trans Fat / Re-inspection                                         75
Inter-Agency Task Force / Initial Inspection                      53
Calorie Posting / Initial Inspection                              30
Trans Fat / Second Compliance Inspection                          27
Administrative Miscellaneous / Re-inspection                      27
Smoke-Free Air Act / Re-inspection                                23
Administrative Miscellaneous / Compliance Inspection              14
Cycle Inspection / Compliance Inspection                          12
Pre-permit (Operational) / Compliance Inspection                  12
Cycle Inspection / Reopening Inspection                            8
Smoke-Free Air Act / Compliance Inspection                         6
Calorie Posting / Re-inspection                                    5
Pre-permit (Operational) / Reopening Inspection                    4
Pre-permit (Non-operational) / Re-inspection                       3
Cycle Inspection / Second Compliance Inspection                    2
Administrative Miscellaneous / Second Compliance Inspection        2
Pre-permit (Operational) / Second Compliance Inspection            1
Smoke-Free Air Act / Limited Inspection                            1
Calorie Posting / Compliance Inspection                            1
Smoke-Free Air Act / Second Compliance Inspection                  1
Name: INITIAL INSPECTION TYPE, dtype: int64

A lot of them are null.

In [93]:
inspections_ff['INITIAL INSPECTION TYPE'].isnull().astype(int).sum()
Out[93]:
1094

Remember, this is new establishments that have not been inspected yet. That the numbers match up here is encouraging.

In [94]:
inspections['INSPECTION TYPE'].isnull().astype(int).sum()
Out[94]:
1094

We reattach the lost DBA column.

In [95]:
inspections_ff.head(1)
Out[95]:
DBA BORO BUILDING STREET ZIPCODE PHONE CUISINE DESCRIPTION INITIAL INSPECTION DATE ACTION VIOLATION CODE VIOLATION DESCRIPTION CRITICAL FLAG SCORE GRADE GRADE DATE RECORD DATE INITIAL INSPECTION TYPE LATEST INSPECTION DATE
CAMIS
30075445 NaN BRONX 1007.0 MORRIS PARK AVE 10462 7.188925e+09 Bakery 2013-06-01 Violations were cited in the following area(s). 16B The original nutritional fact labels and/or in... Not Critical NaN NaN NaN 09/24/2016 Trans Fat / Compliance Inspection 2016-02-18
In [7]:
inspections_fff = inspections_ff.copy()
inspections_fff['DBA'] = inspections_fff.apply(lambda srs: 
                                                   inspections[inspections['CAMIS'] == srs.name]\
                                                   .iloc[0]['DBA'],
                                               axis='columns')

Prepend descriptors, to more easily distinguish things down the road.

In [8]:
inspections_fff.columns = ["DOHMH " + col for col in inspections_fff.columns]
In [105]:
inspections_fff.sample()
Out[105]:
DOHMH DBA DOHMH BORO DOHMH BUILDING DOHMH STREET DOHMH ZIPCODE DOHMH PHONE DOHMH CUISINE DESCRIPTION DOHMH INITIAL INSPECTION DATE DOHMH ACTION DOHMH VIOLATION CODE DOHMH VIOLATION DESCRIPTION DOHMH CRITICAL FLAG DOHMH SCORE DOHMH GRADE DOHMH GRADE DATE DOHMH RECORD DATE DOHMH INITIAL INSPECTION TYPE DOHMH LATEST INSPECTION DATE
CAMIS
40374834 CASA BELLA MANHATTAN 127.0 MULBERRY STREET 10013 2.124314e+09 Italian 2013-08-15 Violations were cited in the following area(s). 06D Food contact surface not properly washed, rins... Critical 6.0 NaN NaN 09/24/2016 Cycle Inspection / Initial Inspection 2016-04-26

Now we get information from Yelp!

In [9]:
from yelp.client import Client
from yelp.oauth1_authenticator import Oauth1Authenticator
from yelp.errors import BusinessUnavailable
import os
import json

def import_credentials(filename='../data/yelp_credentials.json'):
    try:
        data = json.load(open(filename))
        return data
    except:
        raise IOError('This API requires Yelp credentials to work. Did you forget to define them?')

credentials = import_credentials()

auth = Oauth1Authenticator(
    consumer_key=credentials['consumer_key'],
    consumer_secret=credentials['consumer_secret'],
    token=credentials['token'],
    token_secret=credentials['token_secret']
)

client = Client(auth)
In [10]:
from tqdm import tqdm
In [77]:
import httplib2.HttpLib2Error
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-77-ee8a03be3a31> in <module>()
----> 1 from httplib2.error import HTTPError

ImportError: No module named 'httplib2.error'
In [82]:
from urllib.error import HTTPError
import yelp

def yelp_phone_fetch(num):
    """
    Performs the phone search described in notebook 02 to fetch information on the entity associated with a number.
    """
    if not num:
        return None
    else:
        try:
            business = client.phone_search(num).businesses[0]
            if business and business.location and business.location.coordinate:
                return {
                    'Yelp ID': business.id,
                    'Yelp Is Claimed': business.is_claimed,
                    'Yelp Is Closed': business.is_closed,
                    'Yelp Name': business.name,
                    'Yelp URL': business.url,
                    'Yelp Review Count': business.review_count,
                    'Yelp Categories': business.categories,
                    'Yelp Rating': business.rating,
                    'Yelp Address': business.location.address,
                    'Yelp Neighborhoods': business.location.neighborhoods,
                    'Yelp Latitude': business.location.coordinate.latitude,
                    'Yelp Longitude': business.location.coordinate.longitude,
                       }
            else:  # Partial information, skip.
                return None
        except IndexError:  # Phone search failed!
            return None
        except yelp.errors.InvalidParameter:  # Invalid number!
            return None

After some testing there appears to be a very significant miss rate. Remember, we're fuzzy matching phone numbers from one data set with phone numbers in another dataset. It's far from 100% that we'll get something back out. I'm interested in what percentage of the time we're successful.

In [12]:
def random_number_dba_tuple():
    random_number, random_dba = inspections_fff.sample().iloc[0][['DOHMH PHONE', 'DOHMH DBA']]
    random_number = str(int(random_number))
    return random_number, random_dba
In [15]:
random_number_dba_tuple()
Out[15]:
(2125878880, 'MULTI TASTES DINER')
In [127]:
yelp_phone_fetch(random_number_dba_tuple()[0])
Out[127]:
{'Yelp Address': ['41 E 11th St'],
 'Yelp Categories': [Category(name='Japanese', alias='japanese')],
 'Yelp ID': 'ootoya-greenwich-village-new-york',
 'Yelp Is Claimed': True,
 'Yelp Is Closed': False,
 'Yelp Latitude': 40.7333107,
 'Yelp Longitude': -73.9929962,
 'Yelp Name': 'Ootoya Greenwich Village',
 'Yelp Neighborhoods': ['Greenwich Village'],
 'Yelp Rating': 4.0,
 'Yelp Review Count': 135,
 'Yelp URL': 'https://www.yelp.com/biz/ootoya-greenwich-village-new-york?adjust_creative=dkJPGu_jtTyHwsEgZIZN6g&utm_campaign=yelp_api&utm_medium=api_v2_phone_search&utm_source=dkJPGu_jtTyHwsEgZIZN6g'}

Hey I've been there! Nice. More rigorously:

In [133]:
one_hundred_randoms = [random_number_dba_tuple() for i in range(100)]
In [138]:
testset = [yelp_phone_fetch(num) for num, placename in tqdm(one_hundred_randoms)]
  0%|                                                  | 0/100 [00:00<?, ?it/s]
  1%|▍                                         | 1/100 [00:00<00:25,  3.85it/s]
  2%|▊                                         | 2/100 [00:00<00:28,  3.45it/s]
  3%|█▎                                        | 3/100 [00:00<00:25,  3.74it/s]
  4%|█▋                                        | 4/100 [00:02<00:52,  1.84it/s]
  5%|██                                        | 5/100 [00:02<00:42,  2.25it/s]
  6%|██▌                                       | 6/100 [00:02<00:35,  2.62it/s]
  7%|██▉                                       | 7/100 [00:02<00:31,  2.99it/s]
  8%|███▎                                      | 8/100 [00:02<00:29,  3.10it/s]
  9%|███▊                                      | 9/100 [00:03<00:32,  2.78it/s]
 10%|████                                     | 10/100 [00:03<00:30,  2.94it/s]
 11%|████▌                                    | 11/100 [00:03<00:26,  3.33it/s]
 12%|████▉                                    | 12/100 [00:04<00:24,  3.54it/s]
 13%|█████▎                                   | 13/100 [00:04<00:23,  3.64it/s]
 14%|█████▋                                   | 14/100 [00:04<00:29,  2.95it/s]
 15%|██████▏                                  | 15/100 [00:05<00:33,  2.58it/s]
 16%|██████▌                                  | 16/100 [00:05<00:29,  2.88it/s]
 17%|██████▉                                  | 17/100 [00:05<00:25,  3.22it/s]
 18%|███████▍                                 | 18/100 [00:06<00:23,  3.47it/s]
 19%|███████▊                                 | 19/100 [00:06<00:20,  3.86it/s]
 20%|████████▏                                | 20/100 [00:06<00:20,  3.96it/s]
 21%|████████▌                                | 21/100 [00:06<00:18,  4.35it/s]
 22%|█████████                                | 22/100 [00:07<00:18,  4.14it/s]
 23%|█████████▍                               | 23/100 [00:07<00:23,  3.23it/s]
 24%|█████████▊                               | 24/100 [00:07<00:27,  2.72it/s]
 25%|██████████▎                              | 25/100 [00:08<00:24,  3.10it/s]
 26%|██████████▋                              | 26/100 [00:08<00:20,  3.55it/s]
 27%|███████████                              | 27/100 [00:08<00:19,  3.65it/s]
 28%|███████████▍                             | 28/100 [00:09<00:24,  2.89it/s]
 29%|███████████▉                             | 29/100 [00:09<00:21,  3.36it/s]
 30%|████████████▎                            | 30/100 [00:09<00:24,  2.89it/s]
 31%|████████████▋                            | 31/100 [00:10<00:20,  3.31it/s]
 32%|█████████████                            | 32/100 [00:10<00:18,  3.63it/s]
 33%|█████████████▌                           | 33/100 [00:10<00:17,  3.74it/s]
 34%|█████████████▉                           | 34/100 [00:10<00:16,  3.98it/s]
 35%|██████████████▎                          | 35/100 [00:10<00:14,  4.38it/s]
 36%|██████████████▊                          | 36/100 [00:11<00:14,  4.30it/s]
 37%|███████████████▏                         | 37/100 [00:11<00:13,  4.82it/s]
 38%|███████████████▌                         | 38/100 [00:11<00:13,  4.54it/s]
 39%|███████████████▉                         | 39/100 [00:11<00:12,  4.75it/s]
 40%|████████████████▍                        | 40/100 [00:11<00:12,  4.87it/s]
 41%|████████████████▊                        | 41/100 [00:12<00:12,  4.81it/s]
 42%|█████████████████▏                       | 42/100 [00:12<00:11,  5.05it/s]
 43%|█████████████████▋                       | 43/100 [00:12<00:10,  5.27it/s]
 44%|██████████████████                       | 44/100 [00:12<00:11,  4.93it/s]
 45%|██████████████████▍                      | 45/100 [00:12<00:11,  4.97it/s]
 46%|██████████████████▊                      | 46/100 [00:13<00:10,  4.99it/s]
 47%|███████████████████▎                     | 47/100 [00:13<00:11,  4.72it/s]
 48%|███████████████████▋                     | 48/100 [00:13<00:14,  3.51it/s]
 49%|████████████████████                     | 49/100 [00:14<00:17,  2.83it/s]
 50%|████████████████████▌                    | 50/100 [00:14<00:15,  3.23it/s]
 51%|████████████████████▉                    | 51/100 [00:14<00:17,  2.85it/s]
 52%|█████████████████████▎                   | 52/100 [00:15<00:15,  3.19it/s]
 53%|█████████████████████▋                   | 53/100 [00:15<00:16,  2.81it/s]
 54%|██████████████████████▏                  | 54/100 [00:16<00:17,  2.56it/s]
 55%|██████████████████████▌                  | 55/100 [00:16<00:15,  2.86it/s]
 56%|██████████████████████▉                  | 56/100 [00:16<00:13,  3.21it/s]
 57%|███████████████████████▎                 | 57/100 [00:16<00:11,  3.63it/s]
 58%|███████████████████████▊                 | 58/100 [00:17<00:12,  3.31it/s]
 59%|████████████████████████▏                | 59/100 [00:17<00:11,  3.59it/s]
 60%|████████████████████████▌                | 60/100 [00:17<00:10,  3.90it/s]
 61%|█████████████████████████                | 61/100 [00:17<00:09,  4.20it/s]
 62%|█████████████████████████▍               | 62/100 [00:17<00:08,  4.36it/s]
 63%|█████████████████████████▊               | 63/100 [00:18<00:08,  4.42it/s]
 64%|██████████████████████████▏              | 64/100 [00:18<00:07,  4.93it/s]
 65%|██████████████████████████▋              | 65/100 [00:18<00:07,  4.83it/s]
 66%|███████████████████████████              | 66/100 [00:18<00:07,  4.52it/s]
 67%|███████████████████████████▍             | 67/100 [00:18<00:06,  4.74it/s]
 68%|███████████████████████████▉             | 68/100 [00:19<00:06,  4.88it/s]
 69%|████████████████████████████▎            | 69/100 [00:19<00:07,  4.41it/s]
 70%|████████████████████████████▋            | 70/100 [00:20<00:09,  3.05it/s]
 71%|█████████████████████████████            | 71/100 [00:20<00:08,  3.55it/s]
 72%|█████████████████████████████▌           | 72/100 [00:20<00:07,  3.94it/s]
 73%|█████████████████████████████▉           | 73/100 [00:20<00:06,  4.22it/s]
 74%|██████████████████████████████▎          | 74/100 [00:20<00:05,  4.35it/s]
 75%|██████████████████████████████▊          | 75/100 [00:21<00:07,  3.35it/s]
 76%|███████████████████████████████▏         | 76/100 [00:21<00:06,  3.96it/s]
 77%|███████████████████████████████▌         | 77/100 [00:21<00:05,  3.98it/s]
 78%|███████████████████████████████▉         | 78/100 [00:21<00:05,  4.01it/s]
 79%|████████████████████████████████▍        | 79/100 [00:22<00:04,  4.24it/s]
 80%|████████████████████████████████▊        | 80/100 [00:22<00:04,  4.84it/s]
 81%|█████████████████████████████████▏       | 81/100 [00:22<00:04,  4.73it/s]
 82%|█████████████████████████████████▌       | 82/100 [00:22<00:03,  5.09it/s]
 83%|██████████████████████████████████       | 83/100 [00:22<00:03,  4.94it/s]
 84%|██████████████████████████████████▍      | 84/100 [00:22<00:03,  5.22it/s]
 85%|██████████████████████████████████▊      | 85/100 [00:23<00:03,  4.78it/s]
 86%|███████████████████████████████████▎     | 86/100 [00:24<00:07,  1.96it/s]
 87%|███████████████████████████████████▋     | 87/100 [00:24<00:05,  2.35it/s]
 88%|████████████████████████████████████     | 88/100 [00:24<00:04,  2.80it/s]
 89%|████████████████████████████████████▍    | 89/100 [00:25<00:03,  3.16it/s]
 90%|████████████████████████████████████▉    | 90/100 [00:25<00:02,  3.43it/s]
 91%|█████████████████████████████████████▎   | 91/100 [00:25<00:02,  3.80it/s]
 92%|█████████████████████████████████████▋   | 92/100 [00:25<00:01,  4.44it/s]
 93%|██████████████████████████████████████▏  | 93/100 [00:25<00:01,  4.77it/s]
 94%|██████████████████████████████████████▌  | 94/100 [00:26<00:01,  3.44it/s]
 95%|██████████████████████████████████████▉  | 95/100 [00:26<00:01,  3.80it/s]
 96%|███████████████████████████████████████▎ | 96/100 [00:26<00:00,  4.14it/s]
 97%|███████████████████████████████████████▊ | 97/100 [00:26<00:00,  4.16it/s]
 98%|████████████████████████████████████████▏| 98/100 [00:27<00:00,  4.70it/s]
 99%|████████████████████████████████████████▌| 99/100 [00:27<00:00,  4.46it/s]
100%|████████████████████████████████████████| 100/100 [00:27<00:00,  4.71it/s]
In [145]:
np.array([entity == None for entity in testset]).astype(int).sum()
Out[145]:
19

19 misses in 100 randoms. So we're able to match 80% of the time. Not bad as far as fuzzy matches go.

If our data is missing at random (MAR), then we are happy, because this makes for a statistically valid sample of all restaurants in New York City—we can simply drop the other values.

But there's a high likelihood, in my professional opinion, that the data is missing not at a random (MNAR). I expect that it's more likely for a restaurant or eatery in a poor neighborhood to be missing the phone number information in Yelp! than one in a richer, whiter community. This is a severe under-reporting issue that will invalidate any conclusions we try to make using this data with regarding to "gentrification" and whatnot.

We'll need to validate the data geospatially. We'll do that next.

Run the full Yelp! API data through, one slice at a time.

In [150]:
len(inspections_fff)
Out[150]:
26074
In [13]:
inspections_fff.to_csv("../data/inspections_flattened_initial.csv", encoding="utf-8")

Oy vey. This will require two days of processing, since the API limit is 25000/day.

In [15]:
from tqdm import tqdm_notebook
In [14]:
del inspections
del inspections_f
del inspections_ff
In [38]:
inspections_fff['DOHMH PHONE'].iloc[2702:2705]
Out[38]:
CAMIS
40788706    7187231080
40788884    __________
40788886    6462307208
Name: DOHMH PHONE, dtype: object

Uh, ok.

Hideous

In [39]:
raw_yelp_5000 = [yelp_phone_fetch(int(num)) if pd.notnull(num) and str.isdigit(num) else None for num in tqdm_notebook(inspections_fff['DOHMH PHONE'][:5000])]

In [42]:
raw_yelp_5000_2 = [yelp_phone_fetch(int(num)) if pd.notnull(num) and str.isdigit(num) else None for num in tqdm_notebook(inspections_fff['DOHMH PHONE'][5000:10000])]
In [43]:
raw_yelp_5000_3 = [yelp_phone_fetch(int(num)) if pd.notnull(num) and str.isdigit(num) else None for num in tqdm_notebook(inspections_fff['DOHMH PHONE'][10000:15000])]
In [54]:
import pickle

with open("../data/raw_yelp_list.pkl", "wb") as f:
    pickle.dump(raw_yelp_5000+ raw_yelp_5000_2 + raw_yelp_5000_3, f)
In [18]:
import pickle

with open("../data/raw_yelp_list.pkl", "rb") as f:
    raw_yelp_1_to_3 = pickle.load(f)
In [20]:
len(raw_yelp_1_to_3)
Out[20]:
15000
In [24]:
raw_yelp_5000_4 = [yelp_phone_fetch(int(num)) if pd.notnull(num) and str.isdigit(str(int(num))) else None for num in tqdm_notebook(inspections_fff['DOHMH PHONE'][15000:20000])]
In [35]:
inspections_fff['DOHMH PHONE'].iloc[20000 + 3270]
Out[35]:
1646644665.0

This appears to be an invalid phone number. Parsing it through Yelp raises a reliable error, which I've now coded into the loop.

In [84]:
yelp_phone_fetch(int(inspections_fff['DOHMH PHONE'].iloc[20000 + 3270]))

There seem to be multiple such numbers in there.

In [85]:
yelp_phone_fetch(int(inspections_fff['DOHMH PHONE'].iloc[20000 + 4065]))
In [87]:
raw_yelp_5000_5 = [yelp_phone_fetch(int(num)) if pd.notnull(num) and str.isdigit(str(int(num))) else None for num in tqdm_notebook(inspections_fff['DOHMH PHONE'][20000:25000])]
In [88]:
raw_yelp_5000_6 = [yelp_phone_fetch(int(num)) if pd.notnull(num) and str.isdigit(str(int(num))) else None for num in tqdm_notebook(inspections_fff['DOHMH PHONE'][25000:])]
In [90]:
all_raws = raw_yelp_1_to_3 + raw_yelp_5000_4 + raw_yelp_5000_5 + raw_yelp_5000_6
In [91]:
len(all_raws)
Out[91]:
26064

With the data thus acquired, we now assign it to our dataframe. Before we do that, first we need to clean up what we have.

In [92]:
inspections_fff.sample(1)
Out[92]:
DOHMH DBA DOHMH BORO DOHMH BUILDING DOHMH STREET DOHMH ZIPCODE DOHMH PHONE DOHMH CUISINE DESCRIPTION DOHMH INITIAL INSPECTION DATE DOHMH ACTION DOHMH VIOLATION CODE DOHMH VIOLATION DESCRIPTION DOHMH CRITICAL FLAG DOHMH SCORE DOHMH GRADE DOHMH GRADE DATE DOHMH RECORD DATE DOHMH INITIAL INSPECTION TYPE DOHMH LATEST INSPECTION DATE
CAMIS
41405535 TWO BOOTS MANHATTAN 625.0 9 AVENUE 10036 2.129563e+09 Pizza 2014-06-24 Violations were cited in the following area(s). 04L Evidence of mice or live mice present in facil... Critical 15.0 NaN NaN 09/27/2016 Cycle Inspection / Initial Inspection 2016-09-13
In [127]:
all_raws[0] = {'Yelp Address': ['1007 Morris Park Avenue'],
 'Yelp Categories': ["Category(name='Bakeries', alias='bakeries')",
  "Category(name='Desserts', alias='desserts')"],
 'Yelp ID': 'morris-pk-bake-shop-bronx',
 'Yelp Is Claimed': False,
 'Yelp Is Closed': False,
 'Yelp Latitude': 40.848445892334,
 'Yelp Longitude': -73.8560791015625,
 'Yelp Name': 'Morris Pk Bake Shop',
 'Yelp Neighborhoods': ['Morris Park'],
 'Yelp Rating': 4.5,
 'Yelp Review Count': 27,
 'Yelp URL': 'https://www.yelp.com/biz/morris-pk-bake-shop-bronx?adjust_creative=dkJPGu_jtTyHwsEgZIZN6g&utm_campaign=yelp_api&utm_medium=api_v2_phone_search&utm_source=dkJPGu_jtTyHwsEgZIZN6g'}
In [131]:
all_raws[0]
Out[131]:
{'Yelp Address': ['1007 Morris Park Avenue'],
 'Yelp Categories': ["Category(name='Bakeries', alias='bakeries')",
  "Category(name='Desserts', alias='desserts')"],
 'Yelp ID': 'morris-pk-bake-shop-bronx',
 'Yelp Is Claimed': False,
 'Yelp Is Closed': False,
 'Yelp Latitude': 40.848445892334,
 'Yelp Longitude': -73.8560791015625,
 'Yelp Name': 'Morris Pk Bake Shop',
 'Yelp Neighborhoods': ['Morris Park'],
 'Yelp Rating': 4.5,
 'Yelp Review Count': 27,
 'Yelp URL': 'https://www.yelp.com/biz/morris-pk-bake-shop-bronx?adjust_creative=dkJPGu_jtTyHwsEgZIZN6g&utm_campaign=yelp_api&utm_medium=api_v2_phone_search&utm_source=dkJPGu_jtTyHwsEgZIZN6g'}
In [232]:
import re
import copy
re_cat_word = re.compile("name='[-\w\s&()/,']+(?=')")

def safe_group(match_obj):
    try:
        return match_obj.group().replace("name='", "")
    except AttributeError:
        return None

def format_data(yelp_dict):
    if yelp_dict:
        ret = copy.deepcopy(yelp_dict)
        if ret['Yelp Address']:
            ret['Yelp Address'] = ret['Yelp Address'][0]
        cats = ret['Yelp Categories']
        # print(cats)
        if cats:
            try:
                parsed_cats = [re.search(re_cat_word, str(cat)).group().replace("name='", "") for cat in cats]
            except:
                print(cats)
                parsed_cats = []
            # print(parsed_cats)
            ret['Yelp Categories'] = "|".join(parsed_cats)
        neighborhoods = ret['Yelp Neighborhoods']
        if neighborhoods:
            ret['Yelp Neighborhoods'] = "|".join(ret['Yelp Neighborhoods'])
        del ret['Yelp URL']
        return ret
    else:
        return None
In [235]:
formatted_yelp_data = []

for raw in tqdm_notebook(all_raws):
    formatted_yelp_data.append(format_data(raw))
[Category(name="Women's Clothing", alias='womenscloth')]
[Category(name='Department Stores', alias='deptstores'), Category(name="Men's Clothing", alias='menscloth'), Category(name="Women's Clothing", alias='womenscloth')]
[Category(name="Men's Clothing", alias='menscloth'), Category(name="Women's Clothing", alias='womenscloth')]
[Category(name="Men's Clothing", alias='menscloth'), Category(name='Used, Vintage & Consignment', alias='vintage')]
[Category(name='Jewelry', alias='jewelry'), Category(name="Women's Clothing", alias='womenscloth')]
[Category(name='Motorcycle Gear', alias='motorcyclinggear'), Category(name="Men's Clothing", alias='menscloth'), Category(name='Coffee & Tea', alias='coffee')]
[Category(name="Children's Clothing", alias='childcloth'), Category(name='Ice Cream & Frozen Yogurt', alias='icecream'), Category(name='Desserts', alias='desserts')]
[Category(name='Shoe Stores', alias='shoes'), Category(name="Men's Clothing", alias='menscloth'), Category(name="Women's Clothing", alias='womenscloth')]
[Category(name='Toy Stores', alias='toys'), Category(name="Men's Clothing", alias='menscloth'), Category(name="Women's Clothing", alias='womenscloth')]
In [236]:
formatted_yelp_data[0]
Out[236]:
{'Yelp Address': '1007 Morris Park Avenue',
 'Yelp Categories': 'Bakeries|Desserts',
 'Yelp ID': 'morris-pk-bake-shop-bronx',
 'Yelp Is Claimed': False,
 'Yelp Is Closed': False,
 'Yelp Latitude': 40.848445892334,
 'Yelp Longitude': -73.8560791015625,
 'Yelp Name': 'Morris Pk Bake Shop',
 'Yelp Neighborhoods': 'Morris Park',
 'Yelp Rating': 4.5,
 'Yelp Review Count': 27}

Now assign.

In [241]:
inspections_ffff = inspections_fff.copy()

for key in formatted_yelp_data[0].keys():
    inspections_ffff[key] = [s[key] if s else None for s in formatted_yelp_data]
In [249]:
inspections_ffff.columns = [col.upper() for col in inspections_ffff.columns]
In [250]:
inspections_ffff
Out[250]:
DOHMH DBA DOHMH BORO DOHMH BUILDING DOHMH STREET DOHMH ZIPCODE DOHMH PHONE DOHMH CUISINE DESCRIPTION DOHMH INITIAL INSPECTION DATE DOHMH ACTION DOHMH VIOLATION CODE ... YELP LATITUDE YELP ADDRESS YELP ID YELP RATING YELP REVIEW COUNT YELP IS CLAIMED YELP LONGITUDE YELP NEIGHBORHOODS YELP CATEGORIES YELP NAME
CAMIS
30075445 MORRIS PARK BAKE SHOP BRONX 1007.0 MORRIS PARK AVE 10462 7.188925e+09 Bakery 2013-06-01 Violations were cited in the following area(s). 16B ... 40.848446 1007 Morris Park Avenue morris-pk-bake-shop-bronx 4.5 27.0 False -73.856079 Morris Park Bakeries|Desserts Morris Pk Bake Shop
30112340 WENDY'S BROOKLYN 469.0 FLATBUSH AVENUE 11225 7.182875e+09 Hamburgers 2014-06-05 Violations were cited in the following area(s). 10B ... 40.662952 469 Flatbush Ave wendys-brooklyn-4 2.0 23.0 False -73.961753 Prospect Heights|Prospect Lefferts Gardens Fast Food|Burgers Wendy's
30191841 DJ REYNOLDS PUB AND RESTAURANT MANHATTAN 351.0 WEST 57 STREET 10019 2.122453e+09 Irish 2013-07-22 Violations were cited in the following area(s). 10B ... 40.767750 351 W 57th St dj-reynolds-new-york-3 3.0 75.0 False -73.984870 Midtown West|Hell's Kitchen Irish|Pubs DJ Reynolds
40356018 RIVIERA CATERER BROOKLYN 2780.0 STILLWELL AVENUE 11224 7.183723e+09 American 2013-06-05 Violations were cited in the following area(s). 10F ... 40.579521 2780 Stillwell Ave riviera-caterers-brooklyn 4.0 23.0 True -73.982430 Coney Island Caterers Riviera Caterers
40356151 BRUNOS ON THE BOULEVARD QUEENS 8825.0 ASTORIA BOULEVARD 11369 7.183351e+09 American 2014-04-11 Violations were cited in the following area(s). 04J ... 40.764240 8825 Astoria Blvd events-by-brunos-jackson-heights 4.0 15.0 True -73.880410 East Elmhurst Caterers|Venues & Event Spaces Events By Bruno's
40356483 WILKEN'S FINE FOOD BROOKLYN 7114.0 AVENUE U 11234 7.184444e+09 Delicatessen 2013-07-09 Violations were cited in the following area(s). 02H ... 40.619900 7114 Ave U wilkens-ii-deli-brooklyn 3.5 25.0 False -73.906853 Bergen Beach Delis Wilkens II Deli
40356731 TASTE THE TROPICS ICE CREAM BROOKLYN 1839.0 NOSTRAND AVENUE 11226 7.188561e+09 Ice Cream, Gelato, Yogurt, Ices 2013-07-10 Violations were cited in the following area(s). 10B ... 40.640820 1839 Nostrand Ave taste-the-tropics-brooklyn 4.5 16.0 False -73.948151 Flatbush Ice Cream & Frozen Yogurt Taste the Tropics
40357217 WILD ASIA BRONX 2300.0 SOUTHERN BOULEVARD 10460 7.182208e+09 American 2013-06-19 Violations were cited in the following area(s). 10B ... NaN None None NaN NaN None NaN None None None
40357437 C & C CATERING SERVICE BROOKLYN 7715.0 18 AVENUE 11214 7.182323e+09 American 2014-04-16 Violations were cited in the following area(s). 06D ... 40.611713 7715 18th Ave c-and-c-catering-service-brooklyn 3.0 2.0 True -73.997261 Bensonhurst Caterers C & C Catering Service
40359480 1 EAST 66TH STREET KITCHEN MANHATTAN 1.0 EAST 66 STREET 10065 2.128794e+09 American 2014-05-07 Violations were cited in the following area(s). 10B ... 40.768684 1 E 66th St wyeth-james-new-york 0.0 0.0 False -73.969337 Upper East Side None Wyeth James
40359705 NATHAN'S FAMOUS BROOKLYN 1310.0 SURF AVENUE 11224 7.183332e+09 Hotdogs 2013-12-03 Violations were cited in the following area(s). 10F ... NaN None None NaN NaN None NaN None None None
40360045 SEUDA FOODS BROOKLYN 705.0 KINGS HIGHWAY 11223 7.183752e+09 Jewish/Kosher 2013-10-10 Violations were cited in the following area(s). 08A ... 40.606472 705 Kings Hwy seuda-foods-brooklyn 4.5 3.0 False -73.965415 Midwood|Gravesend Caterers Seuda Foods
40360076 CARVEL ICE CREAM BROOKLYN 203.0 CHURCH AVENUE 11218 7.184390e+09 Ice Cream, Gelato, Yogurt, Ices 2014-02-10 Violations were cited in the following area(s). 10F ... 40.643380 203 Church Ave carvel-brooklyn-3 3.5 18.0 False -73.978199 Kensington|Flatbush Ice Cream & Frozen Yogurt Carvel
40361322 CARVEL ICE CREAM QUEENS 26515.0 HILLSIDE AVENUE 11004 7.183430e+09 Ice Cream, Gelato, Yogurt, Ices 2013-09-18 Violations were cited in the following area(s). 08A ... 40.743605 26515 Hillside Ave carvel-glen-oaks-2 5.0 4.0 False -73.660313 None Ice Cream & Frozen Yogurt Carvel
40361606 THE MOVABLE FEAST BROOKLYN 284.0 PROSPECT PARK WEST 11215 7.189653e+09 American 2013-11-14 Violations were cited in the following area(s). 10F ... 40.658035 284 Prospect Park W the-movable-feast-brooklyn 4.5 16.0 True -73.982879 South Slope|Windsor Terrace Caterers The Movable Feast
40361618 SAL'S DELI QUEENS 12908.0 20 AVENUE 11356 7.186619e+09 Delicatessen 2014-08-16 Violations were cited in the following area(s). 08C ... 40.781471 12908 20th Ave sals-deli-college-point 4.5 4.0 False -73.839287 College Point Delis|Sandwiches Sal's Deli
40361708 BULLY'S DELI MANHATTAN 759.0 BROADWAY 10003 2.122550e+09 Delicatessen 2014-01-21 Violations were cited in the following area(s). 10J ... 40.730826 759 Broadway bullys-deli-new-york-13 3.0 110.0 True -73.992404 Greenwich Village Delis Bully's Deli
40361998 STEVE CHU'S DELI & GROCERY QUEENS 3406.0 10 STREET 11106 7.187269e+09 Delicatessen 2013-03-13 Violations were cited in the following area(s). 10F ... 40.764748 3406 10th St steve-chus-deli-and-grocery-astoria 4.0 4.0 False -73.938873 Astoria Delis Steve Chu's Deli & Grocery
40362098 HARRIET'S KITCHEN MANHATTAN 502.0 AMSTERDAM AVENUE 10024 2.127210e+09 American 2014-02-03 Violations were cited in the following area(s). 04M ... 40.786678 502 Amsterdam Ave harriets-kitchen-new-york 3.0 106.0 True -73.976074 Upper West Side American (Traditional)|Burgers Harriet's Kitchen
40362264 P & S DELI GROCERY MANHATTAN 730.0 COLUMBUS AVENUE 10025 2.129323e+09 American 2013-04-08 Violations were cited in the following area(s). 02G ... 40.792419 724 Columbus Ave ps-deli-and-grocery-manhattan 4.0 2.0 False -73.967912 Upper West Side Grocery PS Deli and Grocery
40362274 ANGELIKA FILM CENTER MANHATTAN 18.0 WEST HOUSTON STREET 10012 2.129953e+09 American 2013-04-05 Violations were cited in the following area(s). 10F ... 40.725891 18 W Houston St angelika-film-center-and-cafe-new-york-new-york-2 3.5 396.0 True -73.996986 Greenwich Village Cinema|Cafes Angelika Film Center & Cafe - New York
40362432 HO MEI RESTAURANT QUEENS 10305.0 37 AVENUE 11368 7.187797e+09 Chinese 2013-05-02 Violations were cited in the following area(s). 06D ... 40.753552 10305 37th Avenue ho-mei-corona 5.0 3.0 False -73.864227 North Corona Restaurants Ho Mei
40362715 THE COUNTRY CAFE MANHATTAN 60.0 WALL STREET 10005 3.474279e+09 Sandwiches/Salads/Mixed Buffet 2013-09-18 Violations were cited in the following area(s). 06B ... 40.706241 60 Wall St the-country-cafe-new-york-2 2.5 16.0 False -74.008506 Financial District Sandwiches|Cafes|Breakfast & Brunch The Country Cafe
40362869 SHASHEMENE INT'L RESTAURA BROOKLYN 195.0 EAST 56 STREET 11203 3.474301e+09 Caribbean 2013-05-08 Violations were cited in the following area(s). 10B ... NaN None None NaN NaN None NaN None None None
40363093 CARVEL ICE CREAM BRONX 1006.0 EAST 233 STREET 10466 7.186527e+09 Ice Cream, Gelato, Yogurt, Ices 2013-08-02 Violations were cited in the following area(s). 06D ... 40.890331 1006 E 233rd St carvel-bronx-3 2.0 5.0 False -73.848595 Edenwald Ice Cream & Frozen Yogurt Carvel
40363098 DUNKIN' DONUTS BROOKLYN 56.0 COURT STREET 11201 7.186252e+09 Donuts 2013-01-08 Violations were cited in the following area(s). 08A ... 40.692272 56 Court st dunkin-donuts-brooklyn-131 1.0 2.0 False -73.991493 Brooklyn Heights Donuts|Coffee & Tea Dunkin' Donuts
40363289 HAPPY GARDEN BRONX 1236.0 238 SPOFFORD AVE 10474 7.186172e+09 Chinese 2013-12-30 Violations were cited in the following area(s). 10F ... 40.814002 1236 Spofford Avenue chinese-fried-chicken-bronx-3 0.0 0.0 False -73.889529 Hunts Point Chinese Chinese Fried Chicken
40363298 CAFE METRO MANHATTAN 625.0 8 AVENUE 10018 2.127149e+09 American 2013-12-13 Violations were cited in the following area(s). 06D ... 40.756252 625 8th Ave Bsmt 2 cosane-new-york 0.0 0.0 False -73.990685 Theater District|Midtown West|Hell's Kitchen Coffee & Tea Cosane
40363333 TONY'S DELI QUEENS 1069.0 WYCKOFF AVENUE 11385 7.183868e+09 Delicatessen 2013-06-10 Violations were cited in the following area(s). 04N ... 40.694920 1069 Wyckoff Avenue tonts-too-deli-and-coffee-shop-ridgewood 0.0 0.0 False -73.902460 Bushwick|Ridgewood Delis|Sandwiches Tont's Too Deli & Coffee Shop
40363426 LEXLER DELI MANHATTAN 405.0 LEXINGTON AVENUE 10174 2.126871e+09 Sandwiches/Salads/Mixed Buffet 2013-08-21 Violations were cited in the following area(s). 10F ... 40.751910 405 Lexington Ave lexler-deli-new-york 4.5 15.0 False -73.975530 Midtown East Delis Lexler Deli
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
50055989 NaN QUEENS 25109.0 NORTHERN BLVD 11362 3.473282e+09 Other 1900-01-01 NaN NaN ... NaN None None NaN NaN None NaN None None None
50055990 RESTAURANT ASSOCIATES INC MANHATTAN 1177.0 AVENUE OF THE AMERICAS 10036 2.123363e+09 Other 1900-01-01 NaN NaN ... NaN None None NaN NaN None NaN None None None
50055996 NaN MANHATTAN NaN CANAL ST 10013 7.184619e+09 Other 1900-01-01 NaN NaN ... 40.759792 39-15 Main St alicechencpa-queens 1.0 3.0 False -73.830063 Flushing|Downtown Flushing Financial Services AliceChenCPA
50055997 NaN BRONX 1167.0 BOSTON RD 10456 7.184508e+09 Other 1900-01-01 NaN NaN ... 40.828840 1167 Boston Rd mikes-pizza-and-pasta-bronx 5.0 1.0 False -73.903340 Morrisania Pizza Mike's Pizza and Pasta
50056005 SANTA PANZA BROOKLYN 1079.0 BROADWAY 11221 7.186980e+09 Other 1900-01-01 NaN NaN ... NaN None None NaN NaN None NaN None None None
50056006 NaN BROOKLYN 239.0 BRIGHTON BEACH AVE 11235 9.019072e+09 Other 1900-01-01 NaN NaN ... NaN None None NaN NaN None NaN None None None
50056007 NaN MANHATTAN 1.0 E 35TH ST 10016 9.178265e+09 Other 1900-01-01 NaN NaN ... NaN None None NaN NaN None NaN None None None
50056009 NaN QUEENS 15702.0 NORTHERN BLVD 11354 9.176238e+09 Other 1900-01-01 NaN NaN ... NaN None None NaN NaN None NaN None None None
50056014 PIZZA ROYALE BROOKLYN 6716.0 FORT HAMILTON PKWY 11219 7.182385e+09 Other 1900-01-01 NaN NaN ... 40.629331 6718 Fort Hamilton Pkwy pizza-royale-brooklyn 3.0 39.0 True -74.011227 Dyker Heights Pizza|Italian Pizza Royale
50056016 NaN QUEENS 1526.0 COLLEGE POINT BLVD 11356 9.178734e+09 Other 1900-01-01 NaN NaN ... NaN None None NaN NaN None NaN None None None
50056017 VANGUARD WINE BAR MANHATTAN 1372.0 1ST AVE 10021 2.122889e+09 Other 1900-01-01 NaN NaN ... NaN None None NaN NaN None NaN None None None
50056018 NaN BRONX 3513.0 E TREMONT AVE 10465 7.188220e+09 Other 1900-01-01 NaN NaN ... 40.830978 3513 E Tremont Ave legendary-bar-and-restaurant-bronx 4.5 11.0 False -73.826118 Schuylerville Bars|Italian Legendary Bar & Restaurant
50056020 DOT & LINE BROOKLYN 253.0 BERGEN ST 11217 3.475178e+08 Other 1900-01-01 NaN NaN ... NaN None None NaN NaN None NaN None None None
50056021 NaN MANHATTAN 2656.0 BROADWAY 10025 7.184900e+09 Other 1900-01-01 NaN NaN ... 40.797590 2656 Broadway lava-kitchen-new-york 0.0 0.0 True -73.969310 Manhattan Valley Chinese|Asian Fusion|Noodles Lava Kitchen
50056022 NaN QUEENS 3526.0 FARRINGTON ST 11354 7.184600e+09 Other 1900-01-01 NaN NaN ... 40.763670 3526 Farrington St joomakgol-flushing 4.5 47.0 True -73.831750 Flushing|Downtown Flushing Korean JooMakGol
50056030 NaN QUEENS 11916.0 ROCKAWAY BLVD 11420 9.178057e+09 Other 1900-01-01 NaN NaN ... NaN None None NaN NaN None NaN None None None
50056031 NaN BROOKLYN 652.0 PENNSYLVANIA AVE 11207 9.174454e+09 Other 1900-01-01 NaN NaN ... NaN None None NaN NaN None NaN None None None
50056033 NaN QUEENS 10837.0 CORONA AVE 11368 9.172952e+09 Other 1900-01-01 NaN NaN ... NaN None None NaN NaN None NaN None None None
50056036 NORTHERN CAFE INC. QUEENS 10113.0 NORTHERN BLVD 11368 3.477838e+09 Other 1900-01-01 NaN NaN ... NaN None None NaN NaN None NaN None None None
50056056 NaN QUEENS 1304.0 BEACH CHANNEL DR 11691 3.478130e+09 Other 1900-01-01 NaN NaN ... NaN None None NaN NaN None NaN None None None
50056057 NaN QUEENS 6418.0 108TH ST 11375 9.173228e+09 Other 1900-01-01 NaN NaN ... NaN None None NaN NaN None NaN None None None
50056063 NaN MANHATTAN 200.0 DYCKMAN ST 10040 3.472959e+09 Other 1900-01-01 NaN NaN ... NaN None None NaN NaN None NaN None None None
50056065 NaN MANHATTAN 4241.0 BROADWAY 10033 6.468212e+09 Other 1900-01-01 NaN NaN ... NaN None None NaN NaN None NaN None None None
50056067 HAROLD'S MEAT & THREE MANHATTAN 231.0 HUDSON ST 10013 3.476752e+09 Other 1900-01-01 NaN NaN ... NaN None None NaN NaN None NaN None None None
50056068 XELAJU RESTAURANT QUEENS 8830.0 PARSONS BLVD 11432 7.186573e+09 Other 1900-01-01 NaN NaN ... 40.706911 88-30 Parsons Blvd luna-de-xelaju-restaurant-and-pizzeria-jamaica 2.0 1.0 False -73.802483 Jamaica Spanish Luna de Xelaju Restaurant & Pizzeria
50056069 NaN MANHATTAN 4484.0 BROADWAY 10040 9.172041e+09 Other 1900-01-01 NaN NaN ... NaN None None NaN NaN None NaN None None None
50056073 CULINART INC MANHATTAN 1.0 UN PLZ 10017 2.127581e+09 Other 1900-01-01 NaN NaN ... NaN None None NaN NaN None NaN None None None
50056074 ABRACO MANHATTAN 81.0 E 7TH ST 10003 3.476102e+09 Other 1900-01-01 NaN NaN ... NaN None None NaN NaN None NaN None None None
50056081 NaN MANHATTAN 232.0 E 111TH ST 10029 9.174841e+09 Other 1900-01-01 NaN NaN ... NaN None None NaN NaN None NaN None None None
50056087 NaN BROOKLYN 293.0 AVENUE X 11223 9.177708e+09 Other 1900-01-01 NaN NaN ... NaN None None NaN NaN None NaN None None None

26064 rows × 29 columns

In [251]:
inspections_ffff.to_csv("../data/yelp_dohmh_agg_data.csv", encoding='utf-8')