OpenRefine Client¶

Notebook demonstrating how to control OpenRefine via a Python client.

Use the dbutlerdb/refine-client-py fork of PaulMakepeace/refine-client-py for Python3 support.

I'm not yet convinced this is sensible or that the python client is as useable as it might be? Things like pandas are perfectly serviceable for working with tabular data in a notebook, so why would we want to use the OpenRefine engine?

Getting Started¶

To start, ensure that the OpenRefine application server is running. You can start it from the notebook homepage (New -> OpenRefine Session.

The server connection is looked for on the default port 3333. This can be hardcoded as part of the nbopenrefineder OpneRefine start-up command.

In [1]:

import os

#We might have to spot these in the notebook server log - the terminal output where the notebook server is started
#If we're in MyBinder, that's not much help though?
#os.environ['OPENREFINE_HOST']='127.0.0.1'
#os.environ['OPENREFINE_PORT']='39113'

In [3]:

from open.refine import refine

#What is the distinction between these two?
#Can we make use of it in a Jupyter context somehow?

server = refine.RefineServer()
orefine = refine.Refine(server)

We can list any projects that currently exist:

In [5]:

orefine.list_projects().items()

Out[5]:

dict_items([])

Creating Projects¶

Projects can be created via the API from a source data file.

In [255]:

%%capture
try:
    import pandas as pd
except:
    !pip install pandas
    import pandas as pd

In [256]:

#Create a simple test data file as a CSV file

#via https://github.com/dbutlerdb/refine-client-py/blob/master/tests/data/duplicates.csv
data = '''email,name,state,gender,purchase
danny.baron@example1.com,Danny Baron,CA,M,TV
melanie.white@example2.edu,Melanie White,NC,F,iPhone
danny.baron@example1.com,D. Baron,CA,M,Winter jacket
ben.tyler@example3.org,Ben Tyler,NV,M,Flashlight
arthur.duff@example4.com,Arthur Duff,OR,M,Dining table
danny.baron@example1.com,Daniel Baron,CA,M,Bike
jean.griffith@example5.org,Jean Griffith,WA,F,Power drill
melanie.white@example2.edu,Melanie White,NC,F,iPad
ben.morisson@example6.org,Ben Morisson,FL,M,Amplifier
arthur.duff@example4.com,Arthur Duff,OR,M,Night table'''


fn = 'test.csv'

with open(fn,'w') as f:
    f.write(data)

Either copy and paste the above data into a new OpenRefine project as a clipboard item or load it in as a CSV data file.

We should now be able to see the project via the project index list:

In [257]:

#For some reason, the project does not appear to get named?
orefine.list_projects()

Out[257]:

{'1695820561517': {'name': 'clipboard',
  'created': '2019-01-04T13:48:22Z',
  'modified': '2019-01-04T13:48:24Z',
  'creator': '',
  'contributors': '',
  'subject': '',
  'description': '',
  'rowCount': 10,
  'customMetadata': {},
  'importOptionMetadata': [{'guessCellValueTypes': False,
    'ignoreLines': -1,
    'processQuotes': True,
    'fileSource': '(clipboard)',
    'encoding': '',
    'separator': ',',
    'storeBlankCellsAsNulls': True,
    'storeBlankRows': True,
    'skipDataLines': 0,
    'includeFileSources': False,
    'headerLines': 1,
    'limit': -1,
    'projectName': 'clipboard'}]}}

Opening an OpenRefine Project¶

To open an OpenRefine project, we need to get it's key value from the project list. (Note that different projects may have the same name.)

In [258]:

dir(orefine)

Out[258]:

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'default_options',
 'get_project_id',
 'get_project_name',
 'list_projects',
 'new_project',
 'open_project',
 'server',
 'set_options',
 'set_project_name']

In [259]:

KEY = '2165169842672'  #Need to use a valid key
KEY = list(orefine.list_projects().keys())[0]

p = orefine.open_project(KEY)

In [260]:

#Inspect the column names
p.columns

Out[260]:

['email', 'name', 'state', 'gender', 'purchase']

In [261]:

#Rename a column

col = p.columns[-1]
p.rename_column(col,'{}2'.format(col))
p.columns, p.column_order

Out[261]:

(['email', 'name', 'state', 'gender', 'purchase2'],
 {'email': 0,
  'name': 1,
  'state': 2,
  'gender': 3,
  'purchase': 4,
  'purchase2': 4})

Hmmm, it looks like there's a duplicate there?

Adding A Column¶

OpenRefine allows you to add a column derived from one or more other columns.

.add_column(oldColumn, newColumn, expression="value", column_insert_index=None, on_error='set-to-blank'])

The value keyword denotes using the cell values from the original column.

In [262]:

p.add_column('name', 'name2', column_insert_index=2)

Out[262]:

{'code': 'ok',
 'historyEntry': {'id': 1546610143487,
  'description': 'Create new column name2 based on column name by filling 10 rows with value',
  'time': '2019-01-04T13:48:33Z'}}

How do we access the history from the API? Or do we have to create one ourselves from command responses?

In [263]:

#Add a column based on the name
p.add_column('name', 'NAME', "toUppercase(value)")

Out[263]:

{'code': 'ok',
 'historyEntry': {'id': 1546610429767,
  'description': 'Create new column NAME based on column name by filling 10 rows with toUppercase(value)',
  'time': '2019-01-04T13:48:33Z'}}

For a list of supported GREL (Google Refine Expression Language) commands, see here.

Delete A Column¶

The remove column OpenRefine option doesn't appear to be supported?

Export CSV¶

The export appears to work on the dataset with current filters set. This is distinct from the complere dataset.

In [264]:

import pandas as pd
from io import StringIO

pd.read_csv( StringIO( p.export(export_format='csv') ) )

Out[264]:

	email	name	NAME	name2	state	gender	purchase2
0	danny.baron@example1.com	Danny Baron	DANNY BARON	Danny Baron	CA	M	TV
1	melanie.white@example2.edu	Melanie White	MELANIE WHITE	Melanie White	NC	F	iPhone
2	danny.baron@example1.com	D. Baron	D. BARON	D. Baron	CA	M	Winter jacket
3	ben.tyler@example3.org	Ben Tyler	BEN TYLER	Ben Tyler	NV	M	Flashlight
4	arthur.duff@example4.com	Arthur Duff	ARTHUR DUFF	Arthur Duff	OR	M	Dining table
5	danny.baron@example1.com	Daniel Baron	DANIEL BARON	Daniel Baron	CA	M	Bike
6	jean.griffith@example5.org	Jean Griffith	JEAN GRIFFITH	Jean Griffith	WA	F	Power drill
7	melanie.white@example2.edu	Melanie White	MELANIE WHITE	Melanie White	NC	F	iPad
8	ben.morisson@example6.org	Ben Morisson	BEN MORISSON	Ben Morisson	FL	M	Amplifier
9	arthur.duff@example4.com	Arthur Duff	ARTHUR DUFF	Arthur Duff	OR	M	Night table

In [265]:

def show_export(p):
    display( pd.read_csv( StringIO( p.export(export_format='csv') ) ) )

Via https://stackoverflow.com/a/53742983/454773 it looks like there may be an options parmamter that allows for column selection. This is not currently supported by the Python client.

Display Table With Current Filters¶

We can display the state of the table with current project filters set.

In [266]:

def show_table(p):
    ''' Display currently selected rows in the table. '''
    
    cells = [ [col['v'] for col in row['cells']] for row in p.get_rows().rows.rows_response ]
    df = pd.DataFrame( cells )
    
    #The list of columns seems to include historical items
    #But how do we also guarantee the current one? dicts are inherently unordered?
    cols = ['Unnamed_{}'.format(i) for i in range(len(df.columns))]
    for (k,v) in sorted(p.column_order.items(), key=lambda kv: kv[1]):
        cols[v]=k
        
    #Set the column names guessed at - is there a better way?
    df.columns = cols
    
    display ( df ) #columns = [n for n in p.column_order]
    
    
#How do we get the full list of column names?

In [267]:

show_table(p)

	email	name	NAME	name2	purchase	gender	purchase2
0	danny.baron@example1.com	Danny Baron	CA	M	TV	Danny Baron	DANNY BARON
1	melanie.white@example2.edu	Melanie White	NC	F	iPhone	Melanie White	MELANIE WHITE
2	danny.baron@example1.com	D. Baron	CA	M	Winter jacket	D. Baron	D. BARON
3	ben.tyler@example3.org	Ben Tyler	NV	M	Flashlight	Ben Tyler	BEN TYLER
4	arthur.duff@example4.com	Arthur Duff	OR	M	Dining table	Arthur Duff	ARTHUR DUFF
5	danny.baron@example1.com	Daniel Baron	CA	M	Bike	Daniel Baron	DANIEL BARON
6	jean.griffith@example5.org	Jean Griffith	WA	F	Power drill	Jean Griffith	JEAN GRIFFITH
7	melanie.white@example2.edu	Melanie White	NC	F	iPad	Melanie White	MELANIE WHITE
8	ben.morisson@example6.org	Ben Morisson	FL	M	Amplifier	Ben Morisson	BEN MORISSON
9	arthur.duff@example4.com	Arthur Duff	OR	M	Night table	Arthur Duff	ARTHUR DUFF

Table Manipulation¶

How can we manipulate the data table?

Column Names and Column Order¶

We can look up the current column names and column order as follows:

In [268]:

p.column_order

Out[268]:

{'email': 0,
 'name': 1,
 'state': 4,
 'gender': 5,
 'purchase': 4,
 'purchase2': 6,
 'name2': 3,
 'NAME': 2}

Rearrange column order:

In [269]:

p.reorder_columns(['email','name','state','gender','purchase2','NAME','name2'])
show_export(p)

	email	name	state	gender	purchase2	NAME	name2
0	danny.baron@example1.com	Danny Baron	CA	M	TV	DANNY BARON	Danny Baron
1	melanie.white@example2.edu	Melanie White	NC	F	iPhone	MELANIE WHITE	Melanie White
2	danny.baron@example1.com	D. Baron	CA	M	Winter jacket	D. BARON	D. Baron
3	ben.tyler@example3.org	Ben Tyler	NV	M	Flashlight	BEN TYLER	Ben Tyler
4	arthur.duff@example4.com	Arthur Duff	OR	M	Dining table	ARTHUR DUFF	Arthur Duff
5	danny.baron@example1.com	Daniel Baron	CA	M	Bike	DANIEL BARON	Daniel Baron
6	jean.griffith@example5.org	Jean Griffith	WA	F	Power drill	JEAN GRIFFITH	Jean Griffith
7	melanie.white@example2.edu	Melanie White	NC	F	iPad	MELANIE WHITE	Melanie White
8	ben.morisson@example6.org	Ben Morisson	FL	M	Amplifier	BEN MORISSON	Ben Morisson
9	arthur.duff@example4.com	Arthur Duff	OR	M	Night table	ARTHUR DUFF	Arthur Duff

In [270]:

p.reorder_columns(['email','name'])
show_export(p)

	email	name
0	danny.baron@example1.com	Danny Baron
1	melanie.white@example2.edu	Melanie White
2	danny.baron@example1.com	D. Baron
3	ben.tyler@example3.org	Ben Tyler
4	arthur.duff@example4.com	Arthur Duff
5	danny.baron@example1.com	Daniel Baron
6	jean.griffith@example5.org	Jean Griffith
7	melanie.white@example2.edu	Melanie White
8	ben.morisson@example6.org	Ben Morisson
9	arthur.duff@example4.com	Arthur Duff

In [271]:

p.reorder_columns(['email','name','state','gender','purchase2','NAME','name2'])
show_export(p)

	email	name
0	danny.baron@example1.com	Danny Baron
1	melanie.white@example2.edu	Melanie White
2	danny.baron@example1.com	D. Baron
3	ben.tyler@example3.org	Ben Tyler
4	arthur.duff@example4.com	Arthur Duff
5	danny.baron@example1.com	Daniel Baron
6	jean.griffith@example5.org	Jean Griffith
7	melanie.white@example2.edu	Melanie White
8	ben.morisson@example6.org	Ben Morisson
9	arthur.duff@example4.com	Arthur Duff

In [272]:

p.columns

Out[272]:

['email', 'name']

Although the full set of columns aren't in the exportable view of the table, they do still exist:

In [273]:

p.get_rows().rows.rows_response[0]

Out[273]:

{'flagged': False,
 'starred': False,
 'cells': [{'v': 'danny.baron@example1.com'},
  {'v': 'Danny Baron'},
  {'v': 'CA'},
  {'v': 'M'},
  {'v': 'TV'},
  {'v': 'Danny Baron'},
  {'v': 'DANNY BARON'}],
 'i': 0}

So how do we reset that? We presumably need to unstep the project history?

Sorting¶

Data in the table can be sorted on one or more columns, each ascending or descending.

In [274]:

from open.refine.facet import Sorting

In [275]:

sorting = Sorting('email')

sorting.criteria

Out[275]:

[{'column': 'email',
  'valueType': 'string',
  'caseSensitive': False,
  'reverse': False,
  'errorPosition': 1,
  'blankPosition': 2}]

In [276]:

Sorting(['name', 'gender']).criteria

Out[276]:

[{'column': 'name',
  'valueType': 'string',
  'caseSensitive': False,
  'reverse': False,
  'errorPosition': 1,
  'blankPosition': 2},
 {'column': 'gender',
  'valueType': 'string',
  'caseSensitive': False,
  'reverse': False,
  'errorPosition': 1,
  'blankPosition': 2}]

In [277]:

p.sorting = Sorting('email')
p.reorder_rows()

show_table(p)

	email	name	state	gender	purchase2	NAME	name2
0	arthur.duff@example4.com	Arthur Duff	OR	M	Dining table	Arthur Duff	ARTHUR DUFF
1	arthur.duff@example4.com	Arthur Duff	OR	M	Night table	Arthur Duff	ARTHUR DUFF
2	ben.morisson@example6.org	Ben Morisson	FL	M	Amplifier	Ben Morisson	BEN MORISSON
3	ben.tyler@example3.org	Ben Tyler	NV	M	Flashlight	Ben Tyler	BEN TYLER
4	danny.baron@example1.com	Danny Baron	CA	M	TV	Danny Baron	DANNY BARON
5	danny.baron@example1.com	D. Baron	CA	M	Winter jacket	D. Baron	D. BARON
6	danny.baron@example1.com	Daniel Baron	CA	M	Bike	Daniel Baron	DANIEL BARON
7	jean.griffith@example5.org	Jean Griffith	WA	F	Power drill	Jean Griffith	JEAN GRIFFITH
8	melanie.white@example2.edu	Melanie White	NC	F	iPhone	Melanie White	MELANIE WHITE
9	melanie.white@example2.edu	Melanie White	NC	F	iPad	Melanie White	MELANIE WHITE

In [278]:

#Seems we can be more explicit - but are the missing items (e.g. valueType, caseSensitive) important?
Sorting([{'column':'name','reverse':True}]).criteria

Out[278]:

[{'column': 'name', 'reverse': True, 'errorPosition': 1, 'blankPosition': 2}]

In [279]:

p.sorting = Sorting(['email',{'column':'name','reverse':True}])
p.reorder_rows()

show_table(p)

	email	name	state	gender	purchase2	NAME	name2
0	arthur.duff@example4.com	Arthur Duff	OR	M	Dining table	Arthur Duff	ARTHUR DUFF
1	arthur.duff@example4.com	Arthur Duff	OR	M	Night table	Arthur Duff	ARTHUR DUFF
2	ben.morisson@example6.org	Ben Morisson	FL	M	Amplifier	Ben Morisson	BEN MORISSON
3	ben.tyler@example3.org	Ben Tyler	NV	M	Flashlight	Ben Tyler	BEN TYLER
4	danny.baron@example1.com	Danny Baron	CA	M	TV	Danny Baron	DANNY BARON
5	danny.baron@example1.com	Daniel Baron	CA	M	Bike	Daniel Baron	DANIEL BARON
6	danny.baron@example1.com	D. Baron	CA	M	Winter jacket	D. Baron	D. BARON
7	jean.griffith@example5.org	Jean Griffith	WA	F	Power drill	Jean Griffith	JEAN GRIFFITH
8	melanie.white@example2.edu	Melanie White	NC	F	iPhone	Melanie White	MELANIE WHITE
9	melanie.white@example2.edu	Melanie White	NC	F	iPad	Melanie White	MELANIE WHITE

In [280]:

p.get_rows().rows.rows_response[0]

Out[280]:

{'flagged': False,
 'starred': False,
 'cells': [{'v': 'arthur.duff@example4.com'},
  {'v': 'Arthur Duff'},
  {'v': 'OR'},
  {'v': 'M'},
  {'v': 'Dining table'},
  {'v': 'Arthur Duff'},
  {'v': 'ARTHUR DUFF'}],
 'i': 0}

Unset filters¶

It looks like we can use .engine.reset_all() to reset filters.

In [281]:

p.engine.reset_all()
show_table(p)

	email	name	state	gender	purchase2	NAME	name2
0	arthur.duff@example4.com	Arthur Duff	OR	M	Dining table	Arthur Duff	ARTHUR DUFF
1	arthur.duff@example4.com	Arthur Duff	OR	M	Night table	Arthur Duff	ARTHUR DUFF
2	ben.morisson@example6.org	Ben Morisson	FL	M	Amplifier	Ben Morisson	BEN MORISSON
3	ben.tyler@example3.org	Ben Tyler	NV	M	Flashlight	Ben Tyler	BEN TYLER
4	danny.baron@example1.com	Danny Baron	CA	M	TV	Danny Baron	DANNY BARON
5	danny.baron@example1.com	Daniel Baron	CA	M	Bike	Daniel Baron	DANIEL BARON
6	danny.baron@example1.com	D. Baron	CA	M	Winter jacket	D. Baron	D. BARON
7	jean.griffith@example5.org	Jean Griffith	WA	F	Power drill	Jean Griffith	JEAN GRIFFITH
8	melanie.white@example2.edu	Melanie White	NC	F	iPhone	Melanie White	MELANIE WHITE
9	melanie.white@example2.edu	Melanie White	NC	F	iPad	Melanie White	MELANIE WHITE

In [282]:

#The export filter still seems to be set though?
show_export(p)

	email	name
0	arthur.duff@example4.com	Arthur Duff
1	arthur.duff@example4.com	Arthur Duff
2	ben.morisson@example6.org	Ben Morisson
3	ben.tyler@example3.org	Ben Tyler
4	danny.baron@example1.com	Danny Baron
5	danny.baron@example1.com	Daniel Baron
6	danny.baron@example1.com	D. Baron
7	jean.griffith@example5.org	Jean Griffith
8	melanie.white@example2.edu	Melanie White
9	melanie.white@example2.edu	Melanie White

In [283]:

#Also need an example with numeric type

Project history¶

How do we explore a project's history?

In [284]:

def show_history_item(h):
    return h.time, h.id,h.description

show_history_item(p.history_entry)

Out[284]:

('2019-01-04T13:48:39Z', 1546609838269, 'Reorder rows')

Creating an OpenRefine Project¶

This doesn't seem to work?

It would be useful to have a simple recipe for creating a project from a pandas dataframe.

In [285]:

import os

#Create an OpenRefine project from the data file
#Use the absolute path to the file
p2 = orefine.new_project(project_file=os.path.abspath(fn),
                      project_name='Test 1',
                      project_file_name=fn)

#Do we have to mediate this via a file transfer? eg could we go more directly from a pandas dataframe somehow?

In [286]:

#For some reason, the project does not appear to get named?
#There also can be a delay before the index listing shows that the data has been loaded?
orefine.list_projects()

Out[286]:

{'1718022364826': {'name': 'Untitled',
  'created': '2019-01-04T13:48:41Z',
  'modified': '2019-01-04T13:48:41Z',
  'creator': '',
  'contributors': '',
  'subject': '',
  'description': '',
  'rowCount': 0,
  'customMetadata': {},
  'importOptionMetadata': [{'storeBlankRows': True,
    'includeFileSources': False,
    'skipDataLines': 0,
    'guessCellValueTypes': False,
    'headerLines': 1,
    'ignoreLines': -1,
    'processQuotes': True,
    'fileSource': 'test.csv',
    'separator': ',',
    'storeBlankCellsAsNulls': True}]},
 '1695820561517': {'name': 'clipboard',
  'created': '2019-01-04T13:48:22Z',
  'modified': '2019-01-04T13:48:40Z',
  'creator': '',
  'contributors': '',
  'subject': '',
  'description': '',
  'rowCount': 10,
  'customMetadata': {},
  'importOptionMetadata': [{'guessCellValueTypes': False,
    'ignoreLines': -1,
    'processQuotes': True,
    'fileSource': '(clipboard)',
    'encoding': '',
    'separator': ',',
    'storeBlankCellsAsNulls': True,
    'storeBlankRows': True,
    'skipDataLines': 0,
    'includeFileSources': False,
    'headerLines': 1,
    'limit': -1,
    'projectName': 'clipboard'}]}}

Delete a Project¶

Take care with this one — there is no warning...

In [287]:

p2.delete()

Out[287]:

True

In [288]:

dir(refine.RefineProject)

Out[288]:

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 'add_column',
 'annotate_one_row',
 'apply_operations',
 'blank_down',
 'clusterer_defaults',
 'compute_clusters',
 'compute_facets',
 'delete',
 'do_json',
 'do_raw',
 'edit',
 'export',
 'export_rows',
 'fill_down',
 'flag_row',
 'get_models',
 'get_operations',
 'get_preference',
 'get_reconciliation_service_by_name_or_url',
 'get_reconciliation_services',
 'get_rows',
 'guess_types_of_column',
 'json_data',
 'mass_edit',
 'move_column',
 'project_name',
 'project_url',
 'reconcile',
 'remove_rows',
 'rename_column',
 'reorder_columns',
 'reorder_rows',
 'split_column',
 'star_row',
 'text_transform',
 'transpose_columns_into_rows',
 'transpose_rows_into_columns',
 'wait_until_idle']

OpenRefine supports facetting on a column, including facet counts, and several clustering algorithms.

In [289]:

from open.refine import facet

fr=p.compute_facets(facet.TextFacet('name'))

In [290]:

facets = fr.facets[0]
for k in sorted(facets.choices,
                key=lambda k: facets.choices[k].count,
                reverse=True):
    print(facets.choices[k].count, k)

2 Arthur Duff
2 Melanie White
1 Ben Tyler
1 D. Baron
1 Danny Baron
1 Jean Griffith
1 Ben Morisson
1 Daniel Baron

In [293]:

facet_value = facet.TextFacet(column='email',
                              selection=['danny.baron@example1.com','arthur.duff@example4.com'])
facet_value.as_dict()

Out[293]:

{'type': 'list',
 'name': 'email',
 'columnName': 'email',
 'omitBlank': False,
 'omitError': False,
 'selectBlank': False,
 'selectError': False,
 'invert': False,
 'expression': 'value',
 'selection': [{'v': {'v': 'danny.baron@example1.com',
    'l': 'danny.baron@example1.com'}},
  {'v': {'v': 'arthur.duff@example4.com', 'l': 'arthur.duff@example4.com'}}]}

In [294]:

#The export doesn't appear to be affected?
p.engine.add_facet(facet_value)
p.compute_facets()

show(p)

	email	name
0	arthur.duff@example4.com	Arthur Duff
1	arthur.duff@example4.com	Arthur Duff
2	ben.morisson@example6.org	Ben Morisson
3	ben.tyler@example3.org	Ben Tyler
4	danny.baron@example1.com	Danny Baron
5	danny.baron@example1.com	Daniel Baron
6	danny.baron@example1.com	D. Baron
7	jean.griffith@example5.org	Jean Griffith
8	melanie.white@example2.edu	Melanie White
9	melanie.white@example2.edu	Melanie White

In [295]:

#But the table view is?
show_table(p)

	email	name	state	gender	purchase2	NAME	name2
0	arthur.duff@example4.com	Arthur Duff	OR	M	Dining table	Arthur Duff	ARTHUR DUFF
1	arthur.duff@example4.com	Arthur Duff	OR	M	Night table	Arthur Duff	ARTHUR DUFF
2	danny.baron@example1.com	Danny Baron	CA	M	TV	Danny Baron	DANNY BARON
3	danny.baron@example1.com	Daniel Baron	CA	M	Bike	Daniel Baron	DANIEL BARON
4	danny.baron@example1.com	D. Baron	CA	M	Winter jacket	D. Baron	D. BARON

In [296]:

#Reset the facet filter
facet_value.reset()
facet_value.as_dict()

Out[296]:

{'type': 'list',
 'name': 'email',
 'columnName': 'email',
 'omitBlank': False,
 'omitError': False,
 'selectBlank': False,
 'selectError': False,
 'invert': False,
 'expression': 'value',
 'selection': []}

In [300]:

show_table(p)

	email	name	state	gender	purchase2	NAME	name2
0	arthur.duff@example4.com	Arthur Duff	OR	M	Dining table	Arthur Duff	ARTHUR DUFF
1	arthur.duff@example4.com	Arthur Duff	OR	M	Night table	Arthur Duff	ARTHUR DUFF
2	ben.morisson@example6.org	Ben Morisson	FL	M	Amplifier	Ben Morisson	BEN MORISSON
3	ben.tyler@example3.org	Ben Tyler	NV	M	Flashlight	Ben Tyler	BEN TYLER
4	danny.baron@example1.com	Danny Baron	CA	M	TV	Danny Baron	DANNY BARON
5	danny.baron@example1.com	Daniel Baron	CA	M	Bike	Daniel Baron	DANIEL BARON
6	danny.baron@example1.com	D. Baron	CA	M	Winter jacket	D. Baron	D. BARON
7	jean.griffith@example5.org	Jean Griffith	WA	F	Power drill	Jean Griffith	JEAN GRIFFITH
8	melanie.white@example2.edu	Melanie White	NC	F	iPhone	Melanie White	MELANIE WHITE
9	melanie.white@example2.edu	Melanie White	NC	F	iPad	Melanie White	MELANIE WHITE

Clustering¶

Open Refine supports several clustering methods:

clusterer_type: binning; refine_function: fingerprint|metaphone3|cologne-phonetic
clusterer_type: binning; refine_function: ngram-fingerprint; params: {'ngram-size': INT}
clusterer_type: knn; refine_function: levenshtein|ppm; params: {'radius': FLOAT,'blocking-ngram-size': INT}

In [291]:

clusters=p.compute_clusters('name',
                            clusterer_type='binning',
                            refine_function='cologne-phonetic')
for cluster in clusters:
    print(cluster)

In [292]:

clusters=p.compute_clusters('name',
                            clusterer_type='knn',
                            refine_function='levenshtein',
                            params={'radius':3})
for cluster in clusters:
    print(cluster)

[{'value': 'Danny Baron', 'count': 1}, {'value': 'Daniel Baron', 'count': 1}]

So how do we then select and apply a correction amongst clustered values?

Reconciliation¶

Reconciliation allows column values to be "reconciled" with data from third party services.

In [19]:

data='''Name
Diane Abbott
Boris Johnson
Boris Johnstone
Diana Abbot
Boris Johnston
Joanna Lumley
Boris Johnstone
'''

fn = 'test2.csv'

with open(fn,'w') as f:
    f.write(data)
    
p=orefine.new_project(project_file=os.path.abspath(fn),
                      project_name='Test 2',
                      project_file_name=fn)

In [24]:

p.get_reconciliation_services()

Out[24]:

[{'preview': {'width': 400,
   'url': 'https://tools.wmflabs.org/openrefine-wikidata/en/preview?id={{id}}',
   'height': 100},
  'extend': {'property_settings': [{'default': 0,
     'name': 'limit',
     'label': 'Limit',
     'type': 'number',
     'help_text': 'Maximum number of values to return per row (0 for no limit)'},
    {'default': 'best',
     'name': 'rank',
     'label': 'Ranks',
     'choices': [{'name': 'Any rank', 'value': 'any'},
      {'name': 'Only the best rank', 'value': 'best'},
      {'name': 'Preferred and normal ranks', 'value': 'no_deprecated'}],
     'type': 'select',
     'help_text': 'Filter statements by rank'},
    {'default': 'any',
     'name': 'references',
     'label': 'References',
     'choices': [{'name': 'Any statement', 'value': 'any'},
      {'name': 'At least one reference', 'value': 'referenced'},
      {'name': 'At least one non-wiki reference', 'value': 'no_wiki'}],
     'type': 'select',
     'help_text': 'Filter statements by their references'},
    {'default': False,
     'name': 'count',
     'label': 'Return counts instead of values',
     'type': 'checkbox',
     'help_text': 'The number of values will be returned.'}],
   'propose_properties': {'service_url': 'https://tools.wmflabs.org/openrefine-wikidata',
    'service_path': '/en/propose_properties'}},
  'view': {'url': 'https://www.wikidata.org/wiki/{{id}}'},
  'ui': {'handler': 'ReconStandardServicePanel'},
  'identifierSpace': 'http://www.wikidata.org/entity/',
  'name': 'Wikidata (en)',
  'suggest': {'property': {'service_url': 'https://tools.wmflabs.org/openrefine-wikidata',
    'flyout_service_path': '/en/flyout/property?id=${id}',
    'service_path': '/en/suggest/property'},
   'type': {'service_url': 'https://tools.wmflabs.org/openrefine-wikidata',
    'flyout_service_path': '/en/flyout/type?id=${id}',
    'service_path': '/en/suggest/type'},
   'entity': {'service_url': 'https://tools.wmflabs.org/openrefine-wikidata',
    'flyout_service_path': '/en/flyout/entity?id=${id}',
    'service_path': '/en/suggest/entity'}},
  'defaultTypes': [{'name': 'entity', 'id': 'Q35120'}],
  'url': 'https://tools.wmflabs.org/openrefine-wikidata/en/api',
  'schemaSpace': 'http://www.wikidata.org/prop/direct/'}]

In [25]:

p.get_reconciliation_service_by_name_or_url('Wikidata (en)')

Out[25]:

{'preview': {'width': 400,
  'url': 'https://tools.wmflabs.org/openrefine-wikidata/en/preview?id={{id}}',
  'height': 100},
 'extend': {'property_settings': [{'default': 0,
    'name': 'limit',
    'label': 'Limit',
    'type': 'number',
    'help_text': 'Maximum number of values to return per row (0 for no limit)'},
   {'default': 'best',
    'name': 'rank',
    'label': 'Ranks',
    'choices': [{'name': 'Any rank', 'value': 'any'},
     {'name': 'Only the best rank', 'value': 'best'},
     {'name': 'Preferred and normal ranks', 'value': 'no_deprecated'}],
    'type': 'select',
    'help_text': 'Filter statements by rank'},
   {'default': 'any',
    'name': 'references',
    'label': 'References',
    'choices': [{'name': 'Any statement', 'value': 'any'},
     {'name': 'At least one reference', 'value': 'referenced'},
     {'name': 'At least one non-wiki reference', 'value': 'no_wiki'}],
    'type': 'select',
    'help_text': 'Filter statements by their references'},
   {'default': False,
    'name': 'count',
    'label': 'Return counts instead of values',
    'type': 'checkbox',
    'help_text': 'The number of values will be returned.'}],
  'propose_properties': {'service_url': 'https://tools.wmflabs.org/openrefine-wikidata',
   'service_path': '/en/propose_properties'}},
 'view': {'url': 'https://www.wikidata.org/wiki/{{id}}'},
 'ui': {'handler': 'ReconStandardServicePanel'},
 'identifierSpace': 'http://www.wikidata.org/entity/',
 'name': 'Wikidata (en)',
 'suggest': {'property': {'service_url': 'https://tools.wmflabs.org/openrefine-wikidata',
   'flyout_service_path': '/en/flyout/property?id=${id}',
   'service_path': '/en/suggest/property'},
  'type': {'service_url': 'https://tools.wmflabs.org/openrefine-wikidata',
   'flyout_service_path': '/en/flyout/type?id=${id}',
   'service_path': '/en/suggest/type'},
  'entity': {'service_url': 'https://tools.wmflabs.org/openrefine-wikidata',
   'flyout_service_path': '/en/flyout/entity?id=${id}',
   'service_path': '/en/suggest/entity'}},
 'defaultTypes': [{'name': 'entity', 'id': 'Q35120'}],
 'url': 'https://tools.wmflabs.org/openrefine-wikidata/en/api',
 'schemaSpace': 'http://www.wikidata.org/prop/direct/'}

In [26]:

#Get suggested types
p.guess_types_of_column('Name',
                        'https://tools.wmflabs.org/openrefine-wikidata/en/api')

Out[26]:

[{'id': 'Q5', 'name': 'human', 'score': 5, 'count': 6},
 {'id': 'Q13442814',
  'name': 'scholarly article',
  'score': 1.833333333333333,
  'count': 5},
 {'id': 'Q3305213', 'name': 'painting', 'score': 1.5, 'count': 2},
 {'id': 'Q5398426', 'name': 'television series', 'score': 0.5, 'count': 1},
 {'id': 'Q571', 'name': 'book', 'score': 0.3333333333333333, 'count': 1}]

In [27]:

reconciliation_config = {
                'mode': 'standard-service',
                'service': 'https://tools.wmflabs.org/openrefine-wikidata/en/preview?id={{id}}',
                'identifierSpace': 'http://www.wikidata.org/entity/',
                'schemaSpace': 'http://www.wikidata.org/prop/direct/',
                'type': {
                    'id': 'Q5',
                    'name': 'human',
                },
                'autoMatch': True,
                'columnDetails': [],
            }

In [ ]:

p.reconcile(column='Name',
            service='Wikidata (en)',
            reconciliation_config=reconciliation_config)

Reconciliation is asynchronous - how do we check the actual response when it arrives?

Code suggests call wait_until_idle(), but where? Perhaps requires a fork in the code to embed it in refine.py::RefineProject().reconcile() prior to providing the response?

In [ ]:

pd.read_csv( StringIO( p.export(export_format='csv') ) )

See also: OpenRefine Style Reconciliation Containers. It would be good to weave reconciliation service recipes into Binderised demos, perhaps even using a Jupyter kernel gateway powered API to allow reconciliation service definitions via a notebook?

Debug strategies¶

I need to figure some recipes for debugging in notebooks... Indeed, debugging anywhere!

In [ ]:

#%%debug
import pdb; pdb.set_trace()
p.guess_types_of_column('Name', 'Wikidata (en)')
#c

In [ ]:

pdb.pm()
#locals()['s']
#q

OpenRefine Client¶

Getting Started¶

Creating Projects¶

Opening an OpenRefine Project¶

Adding A Column¶

Delete A Column¶

Export CSV¶

Display Table With Current Filters¶

Table Manipulation¶

Column Names and Column Order¶

Sorting¶

Unset filters¶

Project history¶

Creating an OpenRefine Project¶

Delete a Project¶

Facets and Clustering¶

Facets¶

Clustering¶

Reconciliation¶

Debug strategies¶