OpenRefine Client

Notebook demonstrating how to control headless OpenRefine via a Python client.

Use the dbutlerdb/refine-client-py fork of PaulMakepeace/refine-client-py for Python3 support.

I'm not yet convinced this is sensible or that the python client is as useable as it might be? Things like pandas are perfectly serviceable for working with tabular data in a notebook, so why would we want to use the OpenRefine engine?

Getting Started

To start, ensure that the OpenRefine application server is running. You can start it from the notebook homepage (New -> OpenRefine Session.

The server connection is looked for on the default port 3333. This can be hardcoded as part of the nbopenrefineder OpneRefine start-up command.

In [1]:
from open.refine import refine

#What is the distinction between these two?
#Can we make use of it in a Jupyter context somehow?

server = refine.RefineServer()
orefine = refine.Refine(server)

We can list any projects that currently exist (this should be empty):

In [2]:

Creating an OpenRefine Project

It would be useful to have a simple recipe for creating a project from a pandas dataframe.

In [5]:
#Create a simple test data file as a CSV file

data = '''email,name,state,gender,purchase
[email protected],Danny Baron,CA,M,TV
[email protected],Melanie White,NC,F,iPhone
[email protected],D. Baron,CA,M,Winter jacket
[email protected],Ben Tyler,NV,M,Flashlight
[email protected],Arthur Duff,OR,M,Dining table
[email protected],Daniel Baron,CA,M,Bike
[email protected],Jean Griffith,WA,F,Power drill
[email protected],Melanie White,NC,F,iPad
[email protected],Ben Morisson,FL,M,Amplifier
[email protected],Arthur Duff,OR,M,Night table'''

fn = 'test.csv'

with open(fn,'w') as f:
In [6]:
import os

#Create an OpenRefine project from the data file
#Use the absolute path to the file
p2 = orefine.new_project(project_file=os.path.abspath(fn),
                      project_name='Test 1',

#Do we have to mediate this via a file transfer? eg could we go more directly from a pandas dataframe somehow?
In [11]:
#For some reason, the project does not appear to get named?
#There also can be a delay before the index listing shows that the data has been loaded?
{'1996034361665': {'name': 'Untitled',
  'created': '2019-01-07T13:36:27Z',
  'modified': '2019-01-07T13:36:27Z',
  'creator': '',
  'contributors': '',
  'subject': '',
  'description': '',
  'rowCount': 0,
  'customMetadata': {},
  'importOptionMetadata': [{'storeBlankRows': True,
    'includeFileSources': False,
    'skipDataLines': 0,
    'guessCellValueTypes': False,
    'headerLines': 1,
    'ignoreLines': -1,
    'processQuotes': True,
    'fileSource': 'test.csv',
    'separator': ',',
    'storeBlankCellsAsNulls': True}]}}
In [14]:
import pandas as pd
In [15]:
def show_table(p):
    ''' Display currently selected rows in the table. '''
    cells = [ [col['v'] for col in row['cells']] for row in p.get_rows().rows.rows_response ]
    df = pd.DataFrame( cells )
    #The list of columns seems to include historical items
    #But how do we also guarantee the current one? dicts are inherently unordered?
    cols = ['Unnamed_{}'.format(i) for i in range(len(df.columns))]
    for (k,v) in sorted(p.column_order.items(), key=lambda kv: kv[1]):
    #Set the column names guessed at - is there a better way?
    df.columns = cols
    display ( df ) #columns = [n for n in p.column_order]
#How do we get the full list of column names?
In [16]:
email name state gender purchase
0 [email protected] Danny Baron CA M TV
1 [email protected] Melanie White NC F iPhone
2 [email protected] D. Baron CA M Winter jacket
3 [email protected] Ben Tyler NV M Flashlight
4 [email protected] Arthur Duff OR M Dining table
5 [email protected] Daniel Baron CA M Bike
6 [email protected] Jean Griffith WA F Power drill
7 [email protected] Melanie White NC F iPad
8 [email protected] Ben Morisson FL M Amplifier
9 [email protected] Arthur Duff OR M Night table