Trevor Muñoz
10 January 2014
For more background, please see the accompanying blog post
# By convention
import pandas as pd
For this I want to setup two DataFrames—one with the contents of Dish.csv (~25 MB in the latest download) and one with the contents of MenuItem.csv (~114 MB).
As mentioned, Open Refine hangs before it can load MenuItem.csv (without increasing the Java heap size to provide more memory). So, though the timings will vary across systems, I've included time profiling information to show how speedily Pandas handles this job.
%%time
dish_data = '/Users/libraries/Dropbox/NYPL_Data/2013_12_18_11_50_34_data/Dish.csv'
dish_df = pd.read_csv(dish_data, index_col='id')
%%time
menu_item_data = '/Users/libraries/Dropbox/NYPL_Data/2013_12_18_11_50_34_data/MenuItem.csv'
item_df = pd.read_csv(menu_item_data, index_col='dish_id')
We can make it easier to join up data by selecting the dish id (equal to the "primary key" of the first DataFrame) as the column to index by when loading up the second DataFrame.
With both DataFrames set up, we can print the first few rows to check that everything looks right.
dish_df[:5]
item_df[:5]
Pandas integrates with the Notebook to display DataFrames as nice HTML tables. It's also possible to print a summary description of a DataFrame.
dish_df
For the moment, I'll focus on the Dish data.
With a single function call, we can see how many different values we have in the 'name' column before we start de-duping
unique_values_t0 = len(dish_df.name.unique())
print unique_values_t0
To start, almost all the values of dish name are supposedly unique
# Drop the 'description' column because it's empty and taking up screen real estate
del dish_df['description']
In order to find very similar values that we can identify as duplicates and normalize, I'll pursue the same series of steps I would take using Open Refine.
def strip_and_lower(x):
"""
Basic string normalization:
1) strip leading and trailing whitespace
2) convert to lowercase
3) normalize internal whitespace (remove extra spaces between tokens)
"""
tokens = x.strip().lower().split()
result = ' '.join(filter(None, tokens))
return result
# Apply the function to every value of name using map() — rather than, for instance, using a loop
dish_df['name_modified'] = dish_df['name'].map(strip_and_lower)
A new column has been created:
dish_df[:5]
This already yields some duplicates which had been showing up as distinct values because of small (inconsequential) differences in whitespace or capitalization.
len(dish_df.name_modified.unique())
Now, if we implement Open Refine-style clustering, we should find more candidates for de-duping.
import re
def fingerprint(x):
"""
A modified version of the fingerprint clustering algorithm implemented by Open Refine.
See https://github.com/OpenRefine/OpenRefine/wiki/Clustering-In-Depth
This does not normalize to ASCII characters since diacritics may be significant in this dataset
"""
alphanumeric_tokens = filter(None, re.split('\W', x))
seen = set()
seen_add = seen.add
deduped = sorted([i for i in alphanumeric_tokens if i not in seen and not seen_add(i)])
fingerprint = ''.join(deduped)
return fingerprint
dish_df['fingerprint'] = dish_df['name_modified'].map(fingerprint)
We can see the results as another new column
dish_df[:5]
To see the "clusters" of potential duplicate name values that the fingerprinting method identifies, we can group the data by the contents of the fingerprint column
clusters = dish_df.groupby('fingerprint')
# Order the clusters by size
s = clusters.size()
s.order()
We might want to know what scope of variation there is in the size of the clusters. This can be accomplished with 1 line:,
s.value_counts()
So, by this measure, there are still 294,118 singleton values (down from 399,421), but there are 23,204 two-item clusters that could be normalized, 110 11-item clusters, just 1 47-item cluster, etc.
The print display of the various clusters above shows only the fingerprint value and the count of rows that share that common value but this is not the most immediately-apprehensible way of seeing what's going on. For clarity, let's print the 5 largest clusters:
# Actually we'll be excluding the last item of the series since the value there is blank
for f in s.order()[-6:-1].index:
print dish_df[dish_df.fingerprint == f].name
At a glance these do indeed look like values that should be normalized to a single "name". Since the default way that IPython prints a Pandas Series object includes the 'id', it is easy to construct URLs to allow for some sanity checking.
(I'm not automating the sanity checking here because I don't want to get in bad odor with NYPL's anti-spambot mechanisms.)
testing_cluster = dish_df[dish_df.fingerprint == 'augratinpotatoes']
for item in testing_cluster.index.tolist():
url = "http://menus.nypl.org/dishes/{0}".format(item)
print url
Identifying the clusters is great and we can use this information not only to improve the quality of the curated data set but also to clarify and improve analysis of the data about dish frequencies and "timespans." For the purposes of example, I'll use the "french fried potatoes" cluster.
With 47 values to choose from, what should we select for a normalized value? How about the most common string already represented in the cluster?
# Get a Dataframe consisting of all the rows that share a common fingerprint
result_df = dish_df[dish_df.fingerprint == 'frenchfriedpotatoes']
# Get the number of times appeared for each result by selecting that column (Series) and sort in descending order
by_appearance = result_df.times_appeared.order(ascending=False)
#Use the index of the max value to look up the most common value for the name
print "Most common value: '{0}'".format(dish_df.name[by_appearance.idxmax()])
Does clustering the values change the number of times a dish appeared or the earliest and latest dates it appears?
result_df.times_appeared[:5]
According to this data (it varies slightly from what appears on the web site), the dish with id 1259 and name 'French fried potatoes' appears 1,392 times. Dish 2761 (name "Potatoes, French Fried") appears 243 times, etc.
If we assert that all of these instances are in fact the same type of dish, we can quickly calculate a new total for how often this dish appears on the menus in the data set.
total_appearances = result_df.times_appeared.sum()
print "French fried potatoes appear {0} times in the menus digitized by NYPL.".format(total_appearances)
We can do something similar for the date ranges. (It isn't even necessary to do anything with the data type of the values for first_appeared and last_appeared.)
# aggregate_last_appeared = result_df.last_appeared.max()
# It turns out we can't completely straightforwardly use min() as well as max()
# because some dishes apparently have '0' for first appeared.
aggregate_first_appeared = result_df.first_appeared.values[0]
aggregate_last_appeared = result_df.last_appeared.max()
print "French fried potatoes first appeared on a menu in {0} and last appeared on a menu in {1}"\
.format(aggregate_first_appeared, aggregate_last_appeared)
For an apparently ever-popular dish like 'French fried potatoes' this 100-or-so year range is not that different from what appears on the page for the most-popular variant (Dish 1259). The aggregate values might shift more for other dishes.
Of course, there are also still 'pommes frites' to account for …
Reproducing NYPL's lovely appearance by year frequency plots takes a little more doing but it provides a good opportunity to demonstrate the facilities that Pandas provides for pulling together different sets of data using SQL-like mechanisms.
# Load data from the other CSV files included in the NYPL's regular data dumps because we'll need it all
page_data = '/Users/libraries/Dropbox/NYPL_Data/2013_12_18_11_50_34_data/MenuPage.csv'
page_df = pd.read_csv(page_data, index_col='id')
menu_data = '/Users/libraries/Dropbox/NYPL_Data/2013_12_18_11_50_34_data/Menu.csv'
menu_df = pd.read_csv(menu_data, index_col='id', parse_dates=True)
(For this final bit, I'm not entirely confident that I'm doing this in the best way. It seems to "work" but it feels a little verbose. Caveat lector.)
# This should create a DataFrame from the Menu Items data set that contains
# only "rows" related to our example cluster of 'French fried potato' dishes
lookup_items_df = item_df[item_df.index.isin(result_df.index)]
# The length of this DataFrame should be same as the sum of occurrences we calculated above
try:
assert len(lookup_items_df) == result_df.times_appeared.sum()
print "Length of the DataFrame is consistent"
except:
print "Something's wrong"
lookup_items_df.head(10)
From this intermediate DataFrame we want to grab the values for menu_page_id so that we can look up the relevant menus to which they belong:
target_pages = lookup_items_df.menu_page_id.values
lookup_pages_df = page_df[page_df.index.isin(target_pages)]
lookup_pages_df.head(10)
Now we can grab all the relevant values of 'menu_id' for menus and do the last lookup:
target_menus = lookup_pages_df.menu_id.values
lookup_menus_df = menu_df[menu_df.index.isin(target_menus)]
Now we should have a DataFrame containing only menus that have pages on which one of the dishes from our 'French fried potato' cluster appears:
lookup_menus_df.head(10)
This next bit is a workaround since parsing the dates in the date column doesn't seem to be working for me as expected. I need to extract the year for each menu so I can group the number of appearances of dish by year
import dateutil
def extract_year(x):
try:
datestring = dateutil.parser.parse(x)
return datestring.year
except:
return None
lookup_menus_df['date_year'] = lookup_menus_df['date'].map(extract_year)
With a little IPython magic, it's possible to plot frequency of appearances:
%matplotlib inline
by_year = lookup_menus_df.groupby(lookup_menus_df['date_year'])
year_series = by_year.size()
year_series.plot(figsize=(20,10))
Of course, this graph does not look right. What's going on with that huge spike at the beginning of the 20th century? And why do the number of appearances decrease toward mid-century?
This skew is an artifact of the coverage of the underlying menu collection that NYPL holds. From the Library's own collection guide:
Thousands of menus beginning with the oldest items in the collection from 1851. The collection is strongest for the period between 1890 and 1910.
This is why NYPL, on their own site, has normalized the plot of dish appearances by total number of menus held for a given year.
# Extract years in the complete DataFrame for menus
menu_df['date_year'] = menu_df['date'].map(extract_year)
# Group by year and calculate the number of menus held for each year
total_menu_by_year = menu_df.groupby(menu_df['date_year'])
menu_total_series = total_menu_by_year.size()
With this information, we can create a normalized graph:
normed = [year_series[date_year]/menu_total_series[date_year].astype(float) for date_year in year_series.index.tolist()]
pd.Series(normed, index=year_series.index).plot(figsize=(20,10))
I hope this notebook demonstrates some ways to use the data analysis tools provided by Pandas in the service of curating data from What's On the Menu?