Programmatic Use of Open Refine to Facet and Cluster Names of 'Dishes' from NYPL's What's on the menu?

Trevor Muñoz

18 August 2013

In [1]:
from google.refine import refine, facet
import pystache

This client library assumes that Open Refine is installed and that the Refine server application is running. To create a connection to the server:

In [2]:
server = refine.RefineServer()
grefine = refine.Refine(server)

Now that we have a connection to the server from our application we can make calls that replicate the functionality available in the standard graphical user interface (GUI). For instance, I can list the projects I have in my copy of Refine (ymmv). In this case, I have just one project at the moment, containing the 8/1/2013 data release from NYPL. The list_projects command returns a JSON response containing metadata about my project(s). To do any work, I need the project identifier (2310205155087) so I can open the relevant project:

In [3]:
grefine.list_projects()
Out[3]:
{u'2310205155087': {u'created': u'2013-08-16T20:45:49Z',
  u'customMetadata': {},
  u'modified': u'2013-08-16T20:56:56Z',
  u'name': u'2013_08_01_07_05_00_data'}}
In [4]:
nypl_dishes = grefine.open_project('2310205155087')

Now I can issue commands to facet the values in the 'name' column. To demonstrate that the failure I experienced using the standard GUI is a bottleneck in the frontend code and not the server code that is actually computing the data facets, I'll time the execution of the facet command:

In [5]:
name_facet = facet.TextFacet('name')

%%timeit facet_response = nypl_dishes.compute_facets(name_facet)
1 loops, best of 3: 7.83 s per loop

At 7.68 seconds for best time, this command is not amazingly quick but neither is the execution so slow that the whole application should grind to a halt.

To check that we're getting the same behavior with this command as in the GUI, let's see if the number of calculated facets in the same (370,004):

In [6]:
facet_response = nypl_dishes.compute_facets(name_facet)
facets = facet_response.facets[0]

From this object we can access a dictionary called 'choices' that contains the facets of the data along with their associated counts. The number of keys in this dictionary is the number of facets.

In [7]:
len(facets.choices.keys())
Out[7]:
370004

Now I can display the list of unique values for dish 'name' in descendingly order of their raw count. This is the same information that would appear in one of the boxes of the sidebar of the Refine GUI if it didn't choke. For the sake of space, I'll only output the first 25 here:

In [8]:
for k in sorted(facets.choices, key=lambda k: facets.choices[k].count, reverse=True)[:25]:
    print facets.choices[k].count, k
13 potatoes hashed in cream
13 cold roast beef
11 club sandwich
10 lobster salad
10 hot roast beef sandwich
10 american cheese
10 clams: little necks
9 celery
9 american cheese sandwich
9 strawberry ice cream
9 potatoes au gratin
9 french fried potatoes
9 chocolate ice cream
9 new lima beans
9 broiled oysters, celery sauce
9 roast beef sandwich
9 caviar, special importation
9 ice cream
9 vanilla ice cream
8 bulgarzoon (scientifically fermented milk), per bottle
8 milk toast
8 crab meat salad
8 cold corned beef
8 little neck clams
8 green turtle soup

In the GUI, the order of operations is to calculate the facets for a column then generate clusters. From the perspective of this little script, the dependency is not clear. In any case, it's now also possible to calculate and inspect the output of Refine's clustering functionality operating on the 'name' column. At first we'll just use the default binning clusterer:

In [9]:
cluster_response = nypl_dishes.compute_clusters('name')

Again, what is returned is a JSON response with information about the clusters found in the data. In the GUI, this is turned into a modal overlay where users can select the best value to represent a cluster and then normalize batches of values at the same time.

We can see how many clusters were generated in this first pass:

In [10]:
len(cluster_response)
Out[10]:
25712

And we can inspect the data in the clusters. (For space, only the first five clusters are shown):

In [11]:
for cluster in cluster_response[:5]:
    print '\n'
    for line in cluster:
        print(pystache.render('{{count}} \t {{value}}', line))
5 	 eggs, poached on toast (2)
4 	 poached eggs on toast (2)
3 	 2 eggs, poached on toast
3 	 eggs poached on toast (2)
3 	 eggs, poached on toast [2]
2 	 2 poached (eggs) on toast
2 	 2 poached eggs on toast
2 	 eggs poached on toast [2]
2 	 poached eggs (2) on toast
1 	 (2) poached eggs on toast
1 	 2 (eggs) poached on toast
1 	 2 eggs poached on toast
1 	 2 poached on toast, eggs
1 	 [eggs] poached on toast (2)
1 	 eggs (2) poached on toast
1 	 eggs (2), poached on toast
1 	 eggs - poached on toast (2)
1 	 eggs poached (2) on toast
1 	 eggs poached (2), on toast
1 	 eggs, 2 poached on toast
1 	 eggs, poached (2) on toast
1 	 eggs: 2 poached on toast
1 	 poached eggs (2), on toast
1 	 poached eggs on toast, [2]
1 	 poached eggs, 2 on toast


2 	 2 eggs fried
2 	 2 eggs, fried
2 	 eggs fried (2)
2 	 eggs, fried (2)
2 	 fried eggs (2)
1 	 (2) eggs fried
1 	 (2) fried eggs
1 	 2 eggs: fried
1 	 2 fried eggs
1 	 [2] fried eggs
1 	 eggs (2) fried
1 	 eggs (2), fried
1 	 eggs - fried (2)
1 	 eggs, (2) fried
1 	 eggs, 2 fried
1 	 eggs, fried 2
1 	 eggs, fried [2]
1 	 eggs: fried (2)
1 	 fried (2 eggs)
1 	 fried 2 [eggs]
1 	 fried eggs (2 eggs)
1 	 fried eggs 2
1 	 fried eggs, (2)
1 	 fried eggs, 2


3 	 boiled eggs (2)
3 	 eggs boiled (2)
3 	 eggs, boiled (2)
2 	 2 boiled eggs
2 	 2 eggs boiled
2 	 2 eggs, boiled
2 	 eggs, boiled [2]
1 	 (2) boiled eggs
1 	 (2) eggs boiled
1 	 2 eggs: boiled
1 	 [2] boiled eggs
1 	 boiled (2 eggs)
1 	 boiled 2 [eggs]
1 	 boiled eggs (2 eggs)
1 	 boiled eggs 2
1 	 boiled eggs, 2
1 	 eggs (2) boiled
1 	 eggs (2), boiled
1 	 eggs boiled [2]
1 	 eggs, (2) boiled
1 	 eggs, 2 boiled
1 	 eggs, boiled 2
1 	 eggs: boiled (2)


6 	 imported ginger ale, c. & c.
2 	 c & c imported ginger ale
2 	 ginger ale, imported c & c
1 	 "c. & c." ginger ale, imported
1 	 c. & c. ginger ale, imported
1 	 c. & c. imported ginger ale
1 	 ginger ale (c. & c.), imported
1 	 ginger ale (imported) c. & c.
1 	 ginger ale imported, c. & c.
1 	 ginger ale, c & c (imported)
1 	 ginger ale, c. & c. (imported)
1 	 ginger ale, c. & c., imported
1 	 ginger ale, imported, c & c
1 	 ginger ale, imported, c. & c.
1 	 imported (c. & c.) ginger ale
1 	 imported c. & c. ginger ale
1 	 imported ginger ale (c & c)
1 	 imported ginger ale (c. & c.)
1 	 imported ginger ale c & c
1 	 imported ginger ale c. & c
1 	 imported ginger ale c. & c.
1 	 imported ginger ale, c & c
1 	 imported ginger ale, c. & c


7 	 broiled chicken (half)
3 	 broiled chicken, half
3 	 half broiled chicken
2 	 broiled half chicken
2 	 chicken, broiled, half
1 	 (half) broiled chicken
1 	 [broiled] chicken (half)
1 	 broiled chicken - (half)
1 	 broiled chicken half
1 	 broiled chicken, (half)
1 	 chicken (half) (broiled)
1 	 chicken (half), broiled
1 	 chicken broiled (half)
1 	 chicken broiled half
1 	 chicken broiled, half
1 	 chicken half, broiled
1 	 chicken, broiled (half)
1 	 chicken, half, broiled
1 	 half (broiled chicken)
1 	 half chicken broiled
1 	 half chicken, broiled

Directly inspecting the clusters produced by Refine shows that there is great potential for improving the quality of the data by normalizing names of dishes.