Understanding How menus.nypl.org Handles Data Updates

Trevor Muñoz, @trevormunoz

11 October 2014

Let's walk through the process of identifying a data quality problem and then correcting it through NYPL's own interfaces. For the purposes of this experiment, I'll use some data that represents clusters of similar dish names that begin with non-ascii alphabetic characters …

In [1]:
SOURCE_FILE = '/Users/libraries/Code/menus-site/data/dishes/fixtures_by_alpha/_nonascii-fixture_data-reshaped_with_items.json'
In [2]:
import json
In [3]:
with open(SOURCE_FILE, 'r') as infile:
    OUTLIER_DATA = json.loads(infile.read())
In [4]:
len(OUTLIER_DATA.keys())
Out[4]:
31631
In [5]:
import re

def starts_with_number(obj):
    if re.match(r'[0-9]', obj) is not None:
        return True
    else:
        return False
In [6]:
NUMBER_OUTLIER_DATA = {k: v for k,v in OUTLIER_DATA.items() if starts_with_number(k) == True}

From previous exploration of the data, we know that some dishes are listed in the database even though they have a value of 0 for times_appeared — Let's filter those out.

In [7]:
def find_nonnull_appearances(key):
    resultlist = NUMBER_OUTLIER_DATA[key]
    filtered_resultlist = [n for n in resultlist if n['times_appeared'] != 0]
    return (key, filtered_resultlist)
In [8]:
APPEARS_1_OR_MORE = {k: v for k,v in NUMBER_OUTLIER_DATA.items() if find_nonnull_appearances(k)[1] != []}
In [9]:
len(APPEARS_1_OR_MORE.keys())
Out[9]:
30671

Already in this set, I see two patterns of errors that I would like to investigate further:

  • Where double zeroes get sorted to the beginning of a key — someone has probably transcribed the price as part of the dish name
  • Where 4 digits get sorted to the beginning of a key this is probably a year, probably for booze (in other words, an outlier value but not an error)

Let's investigate the double zeros …

In [10]:
DOUBLE_0_START = {k: v for k,v in APPEARS_1_OR_MORE.items() if k.startswith('00')}
In [11]:
len(DOUBLE_0_START.keys())
Out[11]:
685

By filtering for fingerprint values that start with numbers, from this selecting a particular pattern of "errors", and eliminating dishes that appear 0 times, we come out with a much-smaller set of possible test cases. Let's take a random sample of these …

In [12]:
import random
In [13]:
TEST_CASES = dict(random.sample(DOUBLE_0_START.items(), 20))
In [14]:
print(TEST_CASES.keys())
dict_keys(['00 1 2 champagne heidsieck pints', '00 2 cologne d from grunhauser leiden moselle', '00 1 crabflakes salad', '00 1 dinner', '00 1 charge person room service', '00 28 beetesauce gedämpft in lachsschnitte mangold petersilienkartoffeln rote', '00 19 duck pie s shepherd', '00 3 arlequin de mignon veau', '00 1825 2 and pale pts qts sherry yriarte', '00 1847 3 and chateau claret lafitte pts qts', '00 1836 3 imported in judge m madeira rich s story', '00 1 fresh hawaiian pineapple', '00 1844 3 and chateau claret margaux pts qts', '00 3 chicken fresh killed milkfed roast', '00 1 25 bacon fried remoulade scallops with', '00 14 20 and beans fava lobster radicchio risotto savory with', '00 58 99 balthazar de fruits grand le mer plateaux', '00 1 50 box cigarettes franc large per size smallsize', '00 1841 3 bottled castle chateau claret lafitte', '00 10 and cheese cream nova salmon scotia smoked'])
In [15]:
TEST_CASES['00 1 crabflakes salad']
Out[15]:
[{'item_uri': ['http://menus.nypl.org/menu_items/944194/edit'],
  'menus_appeared': 1,
  'times_appeared': 1,
  'name': 'Crabflakes Salad 1.00',
  'page_uri': ['http://menus.nypl.org/menu_pages/63015'],
  'dish_uri': 'http://menus.nypl.org/dishes/361169'}]
In [16]:
IMG_URL = 'http://j2k.repo.nypl.org/adore-djatoka/resolver?url_ver=Z39.88-2004&rft_id=urn:uuid:bcb737e9-be35-80f1-e040-e00a180630eb&svc_id=info:lanl-repo/svc/getRegion&svc_val_fmt=info:ofi/fmt:kev:mtx:jpeg2000&svc.format=image/jpeg&svc.rotate=0&svc.region=2306,1222,438,2085&svc.scale=950,200'
In [17]:
from PIL import Image
In [18]:
import requests
In [19]:
img_req = requests.get(IMG_URL, stream=True)
if img_req.status_code == 200:
    with open('/tmp/menu_data_downloads/menu_slice.jpg', 'wb') as savefile:
        for chunk in img_req.iter_content():
            savefile.write(chunk)
In [20]:
im = Image.open('/tmp/menu_data_downloads/menu_slice.jpg')
In [21]:
#copied from http://nbviewer.ipython.org/gist/deeplook/5162445

from io import BytesIO

from IPython.core import display
#from PIL import Image


def display_pil_image(im):
   """Displayhook function for PIL Images, rendered as PNG."""

   b = BytesIO()
   im.save(b, format='png')
   data = b.getvalue()

   ip_img = display.Image(data=data, format='png', embed=True)
   return ip_img._repr_png_()


# register display func with PNG formatter:
png_formatter = get_ipython().display_formatter.formatters['image/png']
dpi = png_formatter.for_type(Image.Image, display_pil_image)
In [22]:
im
Out[22]:

Now we can indeed see the problem — the price has been tacked onto to the end of the dish name

In [23]:
import os

Now, let's watch what happens through the API when we manually go and correct this error …

First, how many total dishes do we have at time t0?

In [24]:
payload = {"token" : os.environ['MENUS_API_KEY']} 

dish_count = requests.get('http://api.menus.nypl.org/dishes/', params=payload)
stats = json.loads(dish_count.content.decode())['stats']
print(stats)
{'count': 410997}

Now let's see the current API responses for the dish above …

In [25]:
DISH = TEST_CASES['00 1 crabflakes salad'][0]
print(DISH)
{'item_uri': ['http://menus.nypl.org/menu_items/944194/edit'], 'menus_appeared': 1, 'times_appeared': 1, 'name': 'Crabflakes Salad 1.00', 'page_uri': ['http://menus.nypl.org/menu_pages/63015'], 'dish_uri': 'http://menus.nypl.org/dishes/361169'}
In [29]:
import datetime
import time

target_path = re.split('/', DISH['dish_uri'], maxsplit=3)
api_uri = 'http://api.menus.nypl.org/{0}'.format(target_path[-1])

req_t0 = requests.get(api_uri, params=payload)
print(api_uri)
print(datetime.datetime.now())
print(req_t0.status_code)
resp_t0 = json.loads(req_t0.content.decode())
time.sleep(0.5)

#And we'll grab the linked menu while we're at it
menu_api_uri = api_uri + '/menus'
menu_resp_t0 = json.loads(requests.get(menu_api_uri, params=payload).content.decode())
http://api.menus.nypl.org/dishes/361169
2014-10-11 14:22:00.220319
200
In [30]:
print(json.dumps(resp_t0, indent=2))
{
  "last_appeared": 1933,
  "first_appeared": 1933,
  "links": [
    {
      "href": "http://menus.nypl.org/api/dishes",
      "rel": "index"
    },
    {
      "href": "http://menus.nypl.org/api/dishes/361169/menus",
      "rel": "menus"
    }
  ],
  "times_appeared": 1,
  "highest_price": null,
  "description": null,
  "menus_appeared": 1,
  "id": 361169,
  "lowest_price": null,
  "name": "Crabflakes Salad 1.00"
}

Now, I'm going to go to http://menus.nypl.org/menu_items/944194/edit and manually update the value and see what happens when "Crabflakes Salad 1.00" becomes "Crabflakes Salad"

In [33]:
target_path = re.split('/', DISH['dish_uri'], maxsplit=3)
api_uri = 'http://api.menus.nypl.org/{0}'.format(target_path[-1])

req_t1 = requests.get(api_uri, params=payload)
print(api_uri)
print(datetime.datetime.now())
print(req_t1.status_code)
resp_t1 = json.loads(req_t1.content.decode())
time.sleep(0.5)

#And we'll grab the linked menu while we're at it
menu_api_uri = api_uri + '/menus'
menu_resp_t1 = json.loads(requests.get(menu_api_uri, params=payload).content.decode())
http://api.menus.nypl.org/dishes/361169
2014-10-11 14:25:28.373109
404

So, now the same API request we executed a minute ago seems to 404 …

In [35]:
print(json.dumps(resp_t1, indent=2))
{
  "error": "Dish Not Found"
}

Let's go look at the webpage that loads after we make our edit…

In [63]:
from IPython.display import HTML
HTML('<iframe src=http://menus.nypl.org/menu_pages/63015?useformat=mobile width=700 height=600></iframe>')
Out[63]:

Now "Crabflakes Salad" links to http://menus.nypl.org/dishes/289940 … Let's check that out via the API

In [50]:
NEW_URI = 'http://menus.nypl.org/dishes/289940'

target_path = re.split('/', NEW_URI, maxsplit=3)
api_uri = 'http://api.menus.nypl.org/{0}'.format(target_path[-1])

req_t2 = requests.get(api_uri, params=payload)
print(api_uri)
print(datetime.datetime.now())
print(req_t2.status_code)
resp_t2 = json.loads(req_t2.content.decode())
time.sleep(0.5)

#And we'll grab the linked menu while we're at it
menu_api_uri = api_uri + '/menus'
menu_resp_t2 = json.loads(requests.get(menu_api_uri, params=payload).content.decode())
http://api.menus.nypl.org/dishes/289940
2014-10-11 14:44:49.779602
200
In [51]:
print(json.dumps(resp_t2, indent=2))
{
  "last_appeared": 1933,
  "first_appeared": 1933,
  "links": [
    {
      "href": "http://menus.nypl.org/api/dishes",
      "rel": "index"
    },
    {
      "href": "http://menus.nypl.org/api/dishes/289940/menus",
      "rel": "menus"
    }
  ],
  "times_appeared": 60,
  "highest_price": "$3.25",
  "description": null,
  "menus_appeared": 60,
  "id": 289940,
  "lowest_price": "$0.90",
  "name": "Crabflakes Salad"
}
In [53]:
dish_count_t2 = requests.get('http://api.menus.nypl.org/dishes/', params=payload)
stats_t2 = json.loads(dish_count_t2.content.decode())['stats']
print(stats)
{'count': 410997}

No "new" dish has appeared, so it looks like this has quickly gotten lumped in with other appearances of "Crabflakes Salad"? Obviously, I can't go back in time to check but I can inspect the same information from the latest published data dump …

In [54]:
import pandas as pd
In [55]:
!ls /tmp/menu_data_downloads/2014_10_01/
Dish.csv     Menu.csv     MenuItem.csv MenuPage.csv
In [57]:
OCT_1_DATA_DF = pd.DataFrame.from_csv('/tmp/menu_data_downloads/2014_10_01/Dish.csv', index_col='id')
In [59]:
OCT_1_DATA_DF[OCT_1_DATA_DF.index == 361169]
Out[59]:
name description menus_appeared times_appeared first_appeared last_appeared lowest_price highest_price
id
361169 Crabflakes Salad 1.00 NaN 1 1 1933 1933 0 0

There's our original entry that we just changed …

In [60]:
OCT_1_DATA_DF[OCT_1_DATA_DF.index == 289940]
Out[60]:
name description menus_appeared times_appeared first_appeared last_appeared lowest_price highest_price
id
289940 Crabflakes Salad NaN 59 59 1933 1933 0.9 3.25

Yep! Now we can see the difference — in the static version of the data there were only 59 appearances of "Crabflakes Salad" but the API now returns "times_appeared": 60

This makes tracking and managing updates a bit more complicated because instead of:

In [65]:
from IPython.display import Image
embed1 = Image('menu_data_updates1.png')
embed1
Out[65]:

We get:

In [66]:
embed2 = Image('menu_data_updates2.png')
embed2
Out[66]: