Using the openrefine-client in a Linux Bash environment

Preparations

First we need an OpenRefine server running and the openrefine-client installed.

Option 1: binder

This binder has OpenRefine, the openrefine-client and a Jupyter server proxy preinstalled. OpenRefine should be listening on default port 3333 and the GUI should be available at the urlpath /openrefine.

Option 2: Local environment

Ensure you have an OpenRefine server running. Then install the OpenRefine client as follows.

In [ ]:
wget -nv https://github.com/opencultureconsulting/openrefine-client/releases/download/v0.3.10/openrefine-client_0-3-10_linux -O ~/.local/bin/openrefine-client
chmod +x ~/.local/bin/openrefine-client

Create a directory

We will store some files so it is clearer to use a new folder.

In [ ]:
workspace=$(date +%Y%m%d_%H%M%S)
mkdir -p ~/$workspace && cd ~/$workspace && pwd

Create project

Download sample data

In [ ]:
openrefine-client --download "https://git.io/fj5hF" --output=duplicates.csv

Import file into OpenRefine

In [ ]:
openrefine-client --create duplicates.csv

List all projects

In [ ]:
openrefine-client --list

Show project metadata

In [ ]:
openrefine-client --info "duplicates"

Export project to terminal

In [ ]:
openrefine-client --export "duplicates"

Apply rules from json file

Download sample json file (the content of this file was previously extracted via Undo/Redo history in the OpenRefine graphical user interface)

In [ ]:
openrefine-client --download "https://git.io/fj5ju" --output=duplicates-deletion.json

Apply transformations rules

In [ ]:
openrefine-client --apply duplicates-deletion.json "duplicates"

Export project to terminal again

In [ ]:
openrefine-client --export "duplicates"

Export project to file

Export data in Excel (.xls) format

In [ ]:
openrefine-client --export "duplicates" --output deduped.xls

Delete project

In [ ]:
openrefine-client --delete "duplicates"

Advanced templating

Create another project from the example file above

In [ ]:
openrefine-client --create duplicates.csv --projectName=advanced

The following example code will export the columns "name" and "purchase" in JSON format from the project "advanced" for rows matching the regex text filter ^F$ in column "gender"

In [ ]:
openrefine-client "advanced" \
--prefix='{ "events" : [
' \
--template='    { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }' \
--rowSeparator=',
' \
--suffix='
] }' \
--filterQuery='^F$' \
--filterColumn='gender'

There is also an option to store the results in multiple files. Each file will contain the prefix, an processed row, and the suffix.

In [ ]:
openrefine-client "advanced" \
--prefix='{ "events" : [
' \
--template='    { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }' \
--rowSeparator=',
' \
--suffix='
] }' \
--filterQuery='^F$' \
--filterColumn='gender' \
--output=advanced.json \
--splitToFiles=true

Filenames are suffixed with the row number by default (e.g. advanced_1.json, advanced_2.json etc.). There is another option to use the value in the first column instead:

In [ ]:
openrefine-client "advanced" \
--prefix='{ "events" : [
' \
--template='    { "name" : {{jsonize(cells["name"].value)}}, "purchase" : {{jsonize(cells["purchase"].value)}} }' \
--rowSeparator=',
' \
--suffix='
] }' \
--filterQuery='^F$' \
--filterColumn='gender' \
--output=advanced.json \
--splitToFiles=true \
--suffixById=true

Check the results in the current directory.

In [ ]:
ls

Because our project "advanced" contains duplicates in the first column "email" this command will overwrite files (e.g. [email protected]). When using this option, the first column should contain unique identifiers.

Delete project

In [ ]:
openrefine-client --delete "advanced"

Getting help

In [ ]:
openrefine-client --help

The openrefine-client is available as a one file executable for Windows, Mac OS and Linux. Client and server can be executed on different machines (host and port of the OpenRefine server can be specified, e.g. -H 127.0.0.1 -P 80).

Please file an issue if you miss some features in the command line interface or if you have tracked a bug. And you are welcome to ask any questions!