Using Pandas to Curate Data from the New York Public Library's What's On the Menu? Project

Trevor Muñoz

10 January 2014

For more background, please see the accompanying blog post

In [1]:
# By convention
import pandas as pd

Loading Data

For this I want to setup two DataFrames—one with the contents of Dish.csv (~25 MB in the latest download) and one with the contents of MenuItem.csv (~114 MB).

As mentioned, Open Refine hangs before it can load MenuItem.csv (without increasing the Java heap size to provide more memory). So, though the timings will vary across systems, I've included time profiling information to show how speedily Pandas handles this job.

In [2]:
%%time

dish_data = '/Users/libraries/Dropbox/NYPL_Data/2013_12_18_11_50_34_data/Dish.csv'
dish_df = pd.read_csv(dish_data, index_col='id')
CPU times: user 701 ms, sys: 111 ms, total: 812 ms
Wall time: 888 ms
In [3]:
%%time

menu_item_data = '/Users/libraries/Dropbox/NYPL_Data/2013_12_18_11_50_34_data/MenuItem.csv'
item_df = pd.read_csv(menu_item_data, index_col='dish_id')
CPU times: user 3.43 s, sys: 519 ms, total: 3.95 s
Wall time: 4.42 s

We can make it easier to join up data by selecting the dish id (equal to the "primary key" of the first DataFrame) as the column to index by when loading up the second DataFrame.

With both DataFrames set up, we can print the first few rows to check that everything looks right.

In [4]:
dish_df[:5]
Out[4]:
name description menus_appeared times_appeared first_appeared last_appeared lowest_price highest_price
id
1 Consomme printaniere royal NaN 10 10 1897 1927 0.20 0.4
2 Chicken gumbo NaN 110 116 1895 1960 0.10 0.8
3 Tomato aux croutons NaN 15 15 1893 1917 0.25 0.4
4 Onion au gratin NaN 41 41 1900 1971 0.35 1.0
5 St. Emilion NaN 62 64 1881 1981 0.00 18.0
In [5]:
item_df[:5]
Out[5]:
id menu_page_id price high_price created_at updated_at xpos ypos
dish_id
1 1 1389 0.4 NaN 2011-03-28 15:00:44 UTC 2011-04-19 04:33:15 UTC 0.111429 0.254735
2 2 1389 0.6 NaN 2011-03-28 15:01:13 UTC 2011-04-19 15:00:54 UTC 0.438571 0.254735
3 3 1389 0.4 NaN 2011-03-28 15:01:40 UTC 2011-04-19 19:10:05 UTC 0.140000 0.261922
4 4 1389 0.5 NaN 2011-03-28 15:01:51 UTC 2011-04-19 19:07:01 UTC 0.377143 0.262720
5 5 3079 0.5 1 2011-03-28 15:21:26 UTC 2011-04-13 15:25:27 UTC 0.105714 0.313178

Pandas integrates with the Notebook to display DataFrames as nice HTML tables. It's also possible to print a summary description of a DataFrame.

In [6]:
dish_df
Out[6]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 399455 entries, 1 to 477346
Data columns (total 8 columns):
name              399455  non-null values
description       0  non-null values
menus_appeared    399455  non-null values
times_appeared    399455  non-null values
first_appeared    399455  non-null values
last_appeared     399455  non-null values
lowest_price      377962  non-null values
highest_price     377962  non-null values
dtypes: float64(3), int64(4), object(1)

Clustering to Find Duplicate Values

For the moment, I'll focus on the Dish data.

With a single function call, we can see how many different values we have in the 'name' column before we start de-duping

In [7]:
unique_values_t0 = len(dish_df.name.unique())
print unique_values_t0
399421

To start, almost all the values of dish name are supposedly unique

In [8]:
# Drop the 'description' column because it's empty and taking up screen real estate
del dish_df['description']

In order to find very similar values that we can identify as duplicates and normalize, I'll pursue the same series of steps I would take using Open Refine.

In [9]:
def strip_and_lower(x):
    """
    Basic string normalization:
    1) strip leading and trailing whitespace
    2) convert to lowercase
    3) normalize internal whitespace (remove extra spaces between tokens)
    """
    
    tokens = x.strip().lower().split()
    result = ' '.join(filter(None, tokens))
    
    return result

# Apply the function to every value of name using map() — rather than, for instance, using a loop
dish_df['name_modified'] = dish_df['name'].map(strip_and_lower)

A new column has been created:

In [10]:
dish_df[:5]
Out[10]:
name menus_appeared times_appeared first_appeared last_appeared lowest_price highest_price name_modified
id
1 Consomme printaniere royal 10 10 1897 1927 0.20 0.4 consomme printaniere royal
2 Chicken gumbo 110 116 1895 1960 0.10 0.8 chicken gumbo
3 Tomato aux croutons 15 15 1893 1917 0.25 0.4 tomato aux croutons
4 Onion au gratin 41 41 1900 1971 0.35 1.0 onion au gratin
5 St. Emilion 62 64 1881 1981 0.00 18.0 st. emilion

This already yields some duplicates which had been showing up as distinct values because of small (inconsequential) differences in whitespace or capitalization.

In [11]:
len(dish_df.name_modified.unique())
Out[11]:
373336

Now, if we implement Open Refine-style clustering, we should find more candidates for de-duping.

In [12]:
import re

def fingerprint(x):
    """
    A modified version of the fingerprint clustering algorithm implemented by Open Refine.
    See https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
    This does not normalize to ASCII characters since diacritics may be significant in this dataset
    """
    alphanumeric_tokens = filter(None, re.split('\W', x))
    seen = set()
    seen_add = seen.add
    deduped = sorted([i for i in alphanumeric_tokens if i not in seen and not seen_add(i)])
    fingerprint = ''.join(deduped)
    
    return fingerprint

dish_df['fingerprint'] = dish_df['name_modified'].map(fingerprint)

We can see the results as another new column

In [13]:
dish_df[:5]
Out[13]:
name menus_appeared times_appeared first_appeared last_appeared lowest_price highest_price name_modified fingerprint
id
1 Consomme printaniere royal 10 10 1897 1927 0.20 0.4 consomme printaniere royal consommeprintaniereroyal
2 Chicken gumbo 110 116 1895 1960 0.10 0.8 chicken gumbo chickengumbo
3 Tomato aux croutons 15 15 1893 1917 0.25 0.4 tomato aux croutons auxcroutonstomato
4 Onion au gratin 41 41 1900 1971 0.35 1.0 onion au gratin augratinonion
5 St. Emilion 62 64 1881 1981 0.00 18.0 st. emilion emilionst

To see the "clusters" of potential duplicate name values that the fingerprinting method identifies, we can group the data by the contents of the fingerprint column

In [14]:
clusters = dish_df.groupby('fingerprint')

# Order the clusters by size
s = clusters.size()
s.order()
Out[14]:
fingerprint
000000dedouzainehuitreslazelande                                1
0001001938duringhavemenmoreorplacedwho                          1
000100populationportslopesunnyto                                1
000112bottledrinkginhollandperrum                               1
000112bottledrinkginjamaicaperrum                               1
000112bottledrinkginoldperrumtom                                1
000112bottledrinkginperplymouthrum                              1
000113030aacresafterallandareasavebeenbeingbodegasbuttsbybyasscocompanyconsidereddelicatedondryextremelyfamilyfavoritefinestfirmforfoundedgenerationsgonzalesgonzalezhavinginisitjerezjoejudgedkeptknownlargestliterallymanuelmariamaturingmembersminimumnamedofoldolderoneoroverownerspepeproducingsschieffelinselectedsherriessherrysolerasspainstandardthethroughouttiotodaytranslatedultimateunclevineyardvineyardswarehouseswaswhichwhosewithworldyears    1
00012belosdzlalesno                                             1
000150additioncupsinofpintpricequarttowine                      1
000150boxcigarettesfranclargepersizesmall                       1
000150boxcigarettesfrancslargepersizesmall                      1
0001550andchampagnejouetmilsperrier                             1
000160cgermainegorgelmacons                                     1
000165amontilladoptsqtssherry                                   1
...
broiledchickenhalf       36
brownedhashedpotatoes    36
codryextraghmumm         36
eggsfriedtwo             36
alecgingerimported       38
mashedpotatoes           38
dryextraghmumms          39
creamhashedinpotatoes    40
creamicevanilla          40
augratinpotatoes         41
2eggsonpoachedtoast      44
friedpotatoessweet       45
boiledpotatoes           46
frenchfriedpotatoes      47
                         81
Length: 330038, dtype: int64

We might want to know what scope of variation there is in the size of the clusters. This can be accomplished with 1 line:,

In [15]:
s.value_counts()
Out[15]:
1     294118
2      23204
3       6230
4       2564
5       1370
6        788
7        491
8        323
9        249
10       178
11       110
12       102
15        50
14        46
13        39
18        24
20        20
16        20
17        20
19        14
22        12
21        10
25         7
36         6
31         5
26         5
30         5
24         4
23         3
33         3
38         2
29         2
40         2
27         2
45         1
44         1
46         1
41         1
32         1
39         1
35         1
47         1
28         1
81         1
dtype: int64

So, by this measure, there are still 294,118 singleton values (down from 399,421), but there are 23,204 two-item clusters that could be normalized, 110 11-item clusters, just 1 47-item cluster, etc.

The print display of the various clusters above shows only the fingerprint value and the count of rows that share that common value but this is not the most immediately-apprehensible way of seeing what's going on. For clarity, let's print the 5 largest clusters:

In [16]:
# Actually we'll be excluding the last item of the series since the value there is blank
for f in s.order()[-6:-1].index:
    print dish_df[dish_df.fingerprint == f].name
id
2759        Potatoes, au gratin
7176         Potatoes au Gratin
8373        Potatoes--Au gratin
35728       Potatoes: au gratin
44271        Au Gratin Potatoes
84510      Au Gratin (Potatoes)
94968     Potatoes, au Gratin, 
97166      POTATOES:- Au gratin
185040     Au Gratin [potatoes]
313168      Au Gratin  Potatoes
315697     (Potatoes) Au Gratin
325940       Au Gratin Potatoes
330420       au-Gratin Potatoes
353435     Potatoes:  Au gratin
373639       Potatoes Au Gratin
376396     Potatoes - au gratin
376414      Potatoes, Au Gratin
378597       Au gratin potatoes
378732       Potatoes au gratin
379157      Potatoes, au Gratin
379534        Potatoes AuGratin
379829     [Potatoes] au gratin
380247     [Potatoes] Au Gratin
380907      Potatoes au Gratin 
383920       Au gratin Potatoes
385327      Potatoes Au Gratin 
385777       Au Gratin potatoes
387869       Potatoes Au gratin
388838       potatoes au gratin
389716     Potatoes - Au gratin
391056      POTATOES: Au Gratin
398307      Potatoes: Au gratin
398348       Potatoes-Au gratin
414322       au gratin potatoes
414927      Potatoes -Au Gratin
420093       POTATOES Au Gratin
435770      Potatoes, Au gratin
441471      POTATOES: AU GRATIN
448532     POTATOES - Au gratin
450243       POTATOES AU GRATIN
453840      potatoes, au gratin
Name: name, dtype: object
id
22361       Poached Eggs (2) on toast
22562        Eggs, 2 Poached on Toast
22763         2 Poached Eggs on Toast
31324     Eggs, Poached on toast (2) 
48043       Eggs poached on toast (2)
51940       Poached eggs on toast (2)
79275      2 Poached  (Eggs) on Toast
81059      Eggs Poached (2), on Toast
81421        2 Eggs, poached on toast
90960       2 Poached (eggs) on Toast
91432       2 (Eggs) Poached on Toast
108224       Poached Eggs, 2 on toast
111031     Poached Eggs on Toast, [2]
127128     Eggs, Poached (2) on Toast
128696      Eggs (2) poached on toast
183077      EGGS Poached (2) on Toast
206923       2 Poached on Toast, Eggs
224031       Eggs: 2 Poached on Toast
296286    Eggs - Poached on Toast (2)
334815      (2) Poached Eggs on Toast
369966     Eggs (2), Poached on Toast
375226      Eggs, poached on toast(2)
376027     Eggs, poached on toast (2)
376449    EGGS, poached on toast (2) 
376635    [EGGS] poached on toast (2)
376746     EGGS, poached on toast (2)
377100     Poached eggs (2), on toast
380354     EGGS  poached on toast (2)
381711        2 Poached Eggs on toast
382721     EGGS, poached on toast [2]
383950      Poached Eggs on Toast (2)
389804      Poached eggs on Toast (2)
399299       2 Eggs, Poached on Toast
408143      Poached eggs (2) on toast
417205     Eggs, poached on toast [2]
433441        2 eggs poached on toast
440712     Eggs, Poached on Toast (2)
442689      Eggs poached on toast [2]
445035       2 eggs, poached on toast
452711     POACHED EGGS ON TOAST  (2)
458993     Eggs, Poached on Toast [2]
465510      EGGS poached on toast (2)
466056      EGGS poached on toast [2]
476441     Eggs poached, on toast (2)
Name: name, dtype: object
id
2089         Sweet Potatoes Fried
3780        Sweet Potatoes, Fried
4075         fried Sweet Potatoes
9476        Potatoes, sweet fried
17042      Fried sweet [potatoes]
17116      Potatoes, Sweet, Fried
38431      Sweet Potatoes - Fried
42055       Potatoes sweet, fried
56061       Potatoes, fried sweet
76902        Potatoes Sweet Fried
78525     Sweet (potatoes), fried
86202       POTATOES: Fried sweet
92455       Sweet Potatoes: fried
101532       Potatoes Fried Sweet
136255       Sweet fried potatoes
198754     Sweet  Potatoes, fried
349957      Sweet Potatoes--Fried
354008      Potatoes  SWEET Fried
356717     Sweet Potatoes---Fried
373866       Fried sweet potatoes
373917       FRIED SWEET POTATOES
373973     Sweet potatoes - Fried
374052       Fried Sweet Potatoes
376416      Potatoes, Fried Sweet
376488      Sweet Potatoes: Fried
379831     Fried [Sweet Potatoes]
380251     Fried [sweet potatoes]
387427     Sweet potatoes - fried
390331      Sweet potatoes, fried
392859       Fried Sweet potatoes
396587     SWEET POTATOES---Fried
397513    Sweet potatoes -- Fried
397930     SWEET POTATOES - Fried
401304       Sweet potatoes-fried
402285       Sweet potatoes-Fried
406090       fried sweet potatoes
409582       Sweet potatoes fried
411471      Sweet potatoes--Fried
431756     Sweet Potatoes - fried
452385      Sweet Potatoes, fried
454978      fried sweet  potatoes
456340       Sweet Fried Potatoes
461269     SWEET POTATOES - fried
465671     Sweet Potatoes, Fried 
471570      Fried Sweet Potatoes.
Name: name, dtype: object
id
219           Boiled potatoes
2752         Potatoes, boiled
4290         Potatoes--Boiled
17902        Boiled Potatoes,
20308         Potatoes Boiled
31198        Potatoes: boiled
34056        Boiled Potatoes.
37644       Potatoes - Boiled
40611         Boiled Potatoes
44255       Boiled (potatoes)
59475       Boiled [Potatoes]
63979         POTATOES-boiled
74342         Potatoes Boiled
76369        POTATOES  Boiled
85397       Potatoes , boiled
94958       Potatoes, Boiled,
97154       Potatoes:- Boiled
125408      POTATOES: *Boiled
194143       Boiled  Potatoes
199538        Boiled-Potatoes
225209      Potatoes:--boiled
362463      Potatoes---Boiled
374956        Boiled Potatoes
376397       Potatoes, Boiled
379856    Potatoes -- Boiled 
380179     Potatoes -- Boiled
380244      Boiled [potatoes]
385714       potatoes, boiled
389060       *Boiled Potatoes
394137        boiled potatoes
394579        Boiled POTATOES
395878        BOILED POTATOES
398306       Potatoes: Boiled
398346        Potatoes-Boiled
401302        Potatoes boiled
407244      Potatoes - boiled
412613      Potatoes, boiled 
426362       Potatoes boiled 
427434         BoiledPotatoes
427492       Potatoes Boiled 
432003       Boiled Potatoes 
436408      Potatoes, Boiled 
451614       POTATOES--Boiled
452667      POTATOES - Boiled
465416       POTATOES: Boiled
471029       Boiled potatoes.
Name: name, dtype: object
id
1259         French fried potatoes
2761        Potatoes, French fried
11893      Potatoes---French Fried
14987       Potatoes: French fried
17047      French fried [potatoes]
17949      Potatoes - French Fried
18441        Potatoes French fried
21120        Potatoes-French Fried
21264       Potatoes--French Fried
29043     POTATOES:  French Fried.
32765       Potatoes French, fried
35844       French Fried Potatoes.
37852      French Fried (POTATOES)
39455      Potatoes, French, fried
57562       Potatoes- French Fried
74346        Potatoes French Fried
76376       POTATOES  French Fried
97164      POTATOES:- French fried
100589    Potatoes - French Fried,
112767     Potatoes, Fried, French
147020     Potatoes:  French Fried
181396    Potatoes -- French Fried
248905       French Fried Potatoes
328845       French-Fried Potatoes
373638       Potatoes French Fried
375011       French Fried Potatoes
376400      Potatoes, French Fried
376487      Potatoes: French Fried
376508     Potatoes - French fried
379156       French Fried potatoes
379828     French Fried [Potatoes]
386120      Potatoes, french fried
387212      Potatoes--French fried
387671     French Fried (potatoes)
393034       French fried Potatoes
395877       FRENCH FRIED POTATOES
398352       Potatoes-French fried
401306       Potatoes french fried
401838      French  fried potatoes
404612     Potatoes:  French fried
407806      French fried potatoes 
410281       french fried potatoes
410591     Potatoes: French fried 
436571     POTATOES - French Fried
441467      POTATOES: FRENCH FRIED
455232      potatoes, french fried
473731      POTATOES--FRENCH FRIED
Name: name, dtype: object

At a glance these do indeed look like values that should be normalized to a single "name". Since the default way that IPython prints a Pandas Series object includes the 'id', it is easy to construct URLs to allow for some sanity checking.

(I'm not automating the sanity checking here because I don't want to get in bad odor with NYPL's anti-spambot mechanisms.)

In [17]:
testing_cluster = dish_df[dish_df.fingerprint == 'augratinpotatoes']

for item in testing_cluster.index.tolist():
    url = "http://menus.nypl.org/dishes/{0}".format(item)
    print url
http://menus.nypl.org/dishes/2759
http://menus.nypl.org/dishes/7176
http://menus.nypl.org/dishes/8373
http://menus.nypl.org/dishes/35728
http://menus.nypl.org/dishes/44271
http://menus.nypl.org/dishes/84510
http://menus.nypl.org/dishes/94968
http://menus.nypl.org/dishes/97166
http://menus.nypl.org/dishes/185040
http://menus.nypl.org/dishes/313168
http://menus.nypl.org/dishes/315697
http://menus.nypl.org/dishes/325940
http://menus.nypl.org/dishes/330420
http://menus.nypl.org/dishes/353435
http://menus.nypl.org/dishes/373639
http://menus.nypl.org/dishes/376396
http://menus.nypl.org/dishes/376414
http://menus.nypl.org/dishes/378597
http://menus.nypl.org/dishes/378732
http://menus.nypl.org/dishes/379157
http://menus.nypl.org/dishes/379534
http://menus.nypl.org/dishes/379829
http://menus.nypl.org/dishes/380247
http://menus.nypl.org/dishes/380907
http://menus.nypl.org/dishes/383920
http://menus.nypl.org/dishes/385327
http://menus.nypl.org/dishes/385777
http://menus.nypl.org/dishes/387869
http://menus.nypl.org/dishes/388838
http://menus.nypl.org/dishes/389716
http://menus.nypl.org/dishes/391056
http://menus.nypl.org/dishes/398307
http://menus.nypl.org/dishes/398348
http://menus.nypl.org/dishes/414322
http://menus.nypl.org/dishes/414927
http://menus.nypl.org/dishes/420093
http://menus.nypl.org/dishes/435770
http://menus.nypl.org/dishes/441471
http://menus.nypl.org/dishes/448532
http://menus.nypl.org/dishes/450243
http://menus.nypl.org/dishes/453840

Better Dish Statistics

Identifying the clusters is great and we can use this information not only to improve the quality of the curated data set but also to clarify and improve analysis of the data about dish frequencies and "timespans." For the purposes of example, I'll use the "french fried potatoes" cluster.

With 47 values to choose from, what should we select for a normalized value? How about the most common string already represented in the cluster?

In [18]:
# Get a Dataframe consisting of all the rows that share a common fingerprint
result_df = dish_df[dish_df.fingerprint == 'frenchfriedpotatoes']

# Get the number of times appeared for each result by selecting that column (Series) and sort in descending order
by_appearance = result_df.times_appeared.order(ascending=False)

#Use the index of the max value to look up the most common value for the name
print "Most common value: '{0}'".format(dish_df.name[by_appearance.idxmax()])
Most common value: 'French fried potatoes'

Does clustering the values change the number of times a dish appeared or the earliest and latest dates it appears?

In [19]:
result_df.times_appeared[:5]
Out[19]:
id
1259     1392
2761      243
11893       1
14987      23
17047       3
Name: times_appeared, dtype: int64

According to this data (it varies slightly from what appears on the web site), the dish with id 1259 and name 'French fried potatoes' appears 1,392 times. Dish 2761 (name "Potatoes, French Fried") appears 243 times, etc.

If we assert that all of these instances are in fact the same type of dish, we can quickly calculate a new total for how often this dish appears on the menus in the data set.

In [20]:
total_appearances = result_df.times_appeared.sum()
print "French fried potatoes appear {0} times in the menus digitized by NYPL.".format(total_appearances)
French fried potatoes appear 2140 times in the menus digitized by NYPL.

We can do something similar for the date ranges. (It isn't even necessary to do anything with the data type of the values for first_appeared and last_appeared.)

In [21]:
# aggregate_last_appeared = result_df.last_appeared.max()

# It turns out we can't completely straightforwardly use min() as well as max() 
# because some dishes apparently have '0' for first appeared.
aggregate_first_appeared = result_df.first_appeared.values[0]
aggregate_last_appeared = result_df.last_appeared.max()

print "French fried potatoes first appeared on a menu in {0} and last appeared on a menu in {1}"\
.format(aggregate_first_appeared, aggregate_last_appeared)
French fried potatoes first appeared on a menu in 1884 and last appeared on a menu in 1989

For an apparently ever-popular dish like 'French fried potatoes' this 100-or-so year range is not that different from what appears on the page for the most-popular variant (Dish 1259). The aggregate values might shift more for other dishes.

Of course, there are also still 'pommes frites' to account for …

Plotting Results

Reproducing NYPL's lovely appearance by year frequency plots takes a little more doing but it provides a good opportunity to demonstrate the facilities that Pandas provides for pulling together different sets of data using SQL-like mechanisms.

In [22]:
# Load data from the other CSV files included in the NYPL's regular data dumps because we'll need it all 
page_data = '/Users/libraries/Dropbox/NYPL_Data/2013_12_18_11_50_34_data/MenuPage.csv'
page_df = pd.read_csv(page_data, index_col='id')

menu_data = '/Users/libraries/Dropbox/NYPL_Data/2013_12_18_11_50_34_data/Menu.csv'
menu_df = pd.read_csv(menu_data, index_col='id', parse_dates=True)

(For this final bit, I'm not entirely confident that I'm doing this in the best way. It seems to "work" but it feels a little verbose. Caveat lector.)

In [23]:
# This should create a DataFrame from the Menu Items data set that contains 
# only "rows" related to our example cluster of 'French fried potato' dishes
lookup_items_df = item_df[item_df.index.isin(result_df.index)]
In [24]:
# The length of this DataFrame should be same as the sum of occurrences we calculated above 
try:
    assert len(lookup_items_df) == result_df.times_appeared.sum()
    print "Length of the DataFrame is consistent"
except:
    print "Something's wrong"
Length of the DataFrame is consistent
In [25]:
lookup_items_df.head(10)
Out[25]:
id menu_page_id price high_price created_at updated_at xpos ypos
dish_id
1259 454 5217 NaN NaN 2011-04-19 03:29:41 UTC 2011-10-05 21:07:14 UTC 0.342857 0.761442
1259 530 130 NaN NaN 2011-04-19 13:17:01 UTC 2011-05-13 05:14:16 UTC 0.667143 0.735197
1259 1194 6471 NaN NaN 2011-04-19 19:22:19 UTC 2011-05-06 18:52:50 UTC 0.315714 0.747705
1259 1792 134 NaN NaN 2011-04-19 20:42:20 UTC 2011-04-19 20:42:20 UTC 0.550000 0.685084
1259 2093 146 NaN NaN 2011-04-19 22:22:22 UTC 2011-04-19 22:22:22 UTC 0.568571 0.683963
1259 2953 1627 NaN NaN 2011-04-20 02:18:30 UTC 2011-04-20 02:18:30 UTC 0.508571 0.623720
1259 3715 4783 0.10 NaN 2011-04-20 05:06:28 UTC 2011-10-02 16:31:41 UTC 0.301429 0.338671
2761 3966 3531 0.10 NaN 2011-04-20 05:25:58 UTC 2011-04-20 05:25:58 UTC 0.342857 0.477359
1259 4565 1432 0.10 NaN 2011-04-20 16:04:24 UTC 2011-05-11 20:53:58 UTC 0.131429 0.824837
2761 5788 5006 0.15 NaN 2011-04-20 19:12:01 UTC 2011-05-31 20:07:02 UTC 0.535714 0.652158

From this intermediate DataFrame we want to grab the values for menu_page_id so that we can look up the relevant menus to which they belong:

In [26]:
target_pages = lookup_items_df.menu_page_id.values

lookup_pages_df = page_df[page_df.index.isin(target_pages)]
In [27]:
lookup_pages_df.head(10)
Out[27]:
menu_id page_number image_id full_height full_width uuid
id
130 12463 1 466928 3049 2004 510D47DB-491F-A3D9-E040-E00A18064A99
134 12465 1 466931 3411 2291 510d47db-4923-a3d9-e040-e00a18064a99
146 12470 1 466940 3427 2282 510d47db-4933-a3d9-e040-e00a18064a99
263 12505 1 475815 4949 3261 510d47db-4309-a3d9-e040-e00a18064a99
265 12506 3 4000008161 5559 3229 510d47db-430b-a3d9-e040-e00a18064a99
309 12518 3 4000013961 4659 2823 510d47db-662f-a3d9-e040-e00a18064a99
332 12523 5 4000013986 3703 2427 510d47db-6652-a3d9-e040-e00a18064a99
460 12562 1 466438 2803 1979 510d47db-44a4-a3d9-e040-e00a18064a99
482 12569 1 466597 2600 1811 510d47db-461a-a3d9-e040-e00a18064a99
501 12577 1 466783 2884 2083 510d47db-47c1-a3d9-e040-e00a18064a99

Now we can grab all the relevant values of 'menu_id' for menus and do the last lookup:

In [28]:
target_menus = lookup_pages_df.menu_id.values
In [29]:
lookup_menus_df = menu_df[menu_df.index.isin(target_menus)]

Now we should have a DataFrame containing only menus that have pages on which one of the dishes from our 'French fried potato' cluster appears:

In [30]:
lookup_menus_df.head(10)
Out[30]:
name sponsor event venue place physical_description occasion notes call_number keywords language date location location_type currency currency_symbol status page_count dish_count
id
12463 NaN HOTEL EASTMAN BREAKFAST COMMERCIAL HOT SPRINGS, AR CARD; 4.75X7.5; EASTER; NaN 1900-2822 NaN NaN 1900-04-15 Hotel Eastman NaN NaN NaN under review 2 67
12465 NaN NORDDEUTSCHER LLOYD BREMEN FRUHSTUCK/BREAKFAST; COMMERCIAL DAMPFER KAISER WILHELM DER GROSSE; CARD; ILLU; COL; 5.5X8.0; NaN MENU IN GERMAN AND ENGLISH; ILLUS, STEAMSHIP A... 1900-2827 NaN NaN 1900-04-16 Norddeutscher Lloyd Bremen NaN NaN NaN under review 2 84
12470 NaN NORDDEUTSCHER LLOYD BREMEN FRUHSTUCK/BREAKFAST COMMERCIAL SCHNELLDAMPFER KAISER WILHELM DER GROSSE; BROADSIDE; ILLUS; COL; 5.5X8.50; NaN MENU IN GERMAN AND ENGLISH; ILLUS, LIGHTHOUSE;... 1900-2839 NaN NaN 1900-04-17 Norddeutscher Lloyd Bremen NaN NaN NaN under review 2 80
12505 NaN BARTHOLDI HOTEL BREAKFAST COMMERCIAL NY BROADSIDE; COL; 7 X 11.75; DAILY VERY POOR CONDITION; 1900-1137 NaN NaN 1900-02-18 Bartholdi Hotel NaN Dollars $ under review 2 203
12506 NaN BARTHOLDI HOTEL LUNCH & DINNER COMMERCIAL 23RD ST & BWAY NY FOLDER; ILLUS;8.25 X 14; DAILY MANY ITEMS HANDWRITTEN; HOURS INCLUDED; 1900-1138 NaN NaN 1900-02-18 Bartholdi Hotel NaN Dollars $ under review 4 336
12518 NaN COLUMBIA RESTAURANT DAILY MENU COMMERCIAL 48 EAST 14TH STREET,[NEW YORK,NY?] FOLDER;ILLUS;7.25X11.5; DAILY; 2 COPIES;WINE LIST; 1901-1527 NaN NaN 1901-06-03 Columbia Restaurant NaN Dollars $ under review 4 475
12523 NaN CITIZENS'STEAMBOAT COMPANY LUNCH COMMERCIAL STEAMER SARATOGA,TROY LINE BOOKLET;ILLUS;COL;6.25X9; DAILY; AU SABLE CHASM ON FRONT COVER;BLUE MOUNTAIN LA... 1901-1537 NaN NaN 1901-01-01 Citizens'steamboat Company NaN Dollars $ under review 8 246
12562 NaN CUNARD LINE BREAKFAST COMMERCIAL EN ROUTE ON BOARD R.M.S. CAMPANIA CARD; 4.5 X 6.5; NaN PRICED WINE LIST ON BACK OF MENU; PRICES IN BR... 1900-2170 NaN NaN 1900-02-28 Cunard Line NaN NaN NaN under review 2 110
12569 NaN MAXWELL HOUSE SUPPER COMMERCIAL NASHVILLE, [TN?] CARD; ILLUS; 4.25X6.25 NaN ELABORATE PRINTING OF HOTEL NAME FORMS ILLUSTR... 1900-2397 NaN NaN 1900-03-12 Maxwell House NaN NaN NaN under review 2 38
12577 NaN HADDON HALL SUPPER COMMERCIAL [PHILADELPHIA,PA.] BROADSIDE; ILLUS; COL; 5 X 7; DAILY NaN 1900-2625 NaN NaN 1900-03-31 Haddon Hall NaN NaN NaN under review 2 60

This next bit is a workaround since parsing the dates in the date column doesn't seem to be working for me as expected. I need to extract the year for each menu so I can group the number of appearances of dish by year

In [31]:
import dateutil

def extract_year(x):
    try:
        datestring = dateutil.parser.parse(x)
        return datestring.year
    except:
        return None

lookup_menus_df['date_year'] = lookup_menus_df['date'].map(extract_year)

With a little IPython magic, it's possible to plot frequency of appearances:

In [32]:
%matplotlib inline

by_year = lookup_menus_df.groupby(lookup_menus_df['date_year'])
year_series = by_year.size()
year_series.plot(figsize=(20,10))
Out[32]:
<matplotlib.axes.AxesSubplot at 0x10e5ff4d0>

Of course, this graph does not look right. What's going on with that huge spike at the beginning of the 20th century? And why do the number of appearances decrease toward mid-century?

This skew is an artifact of the coverage of the underlying menu collection that NYPL holds. From the Library's own collection guide:

Thousands of menus beginning with the oldest items in the collection from 1851. The collection is strongest for the period between 1890 and 1910.

This is why NYPL, on their own site, has normalized the plot of dish appearances by total number of menus held for a given year.

In [33]:
# Extract years in the complete DataFrame for menus
menu_df['date_year'] = menu_df['date'].map(extract_year)

# Group by year and calculate the number of menus held for each year
total_menu_by_year = menu_df.groupby(menu_df['date_year'])
menu_total_series = total_menu_by_year.size()

With this information, we can create a normalized graph:

In [34]:
normed = [year_series[date_year]/menu_total_series[date_year].astype(float) for date_year in year_series.index.tolist()]
pd.Series(normed, index=year_series.index).plot(figsize=(20,10))
Out[34]:
<matplotlib.axes.AxesSubplot at 0x10e60a0d0>

I hope this notebook demonstrates some ways to use the data analysis tools provided by Pandas in the service of curating data from What's On the Menu?