GATE Worker

The GATE Worker is a module that allows to run anything in a Java GATE process from Python and interchange documents between Python and Java.

One possible use of this is to run an existing GATE pipeline on a Python GateNLP document.

This is done by the python module communicating with a Java process over a socket connection. Java calls on the Python side are sent over to Java, executed and the result is send back to Python.

For this to work, GATE and Java have to be installed on the machine that runs the GATE Worker.

The easiest way to run this is by first manually starting the GATE Worker in the Java GATE GUI and then connecting to it from the Python side.

Manually starting the GATE Worker from GATE

  1. Start GATE
  2. Load the Python plugin using the CREOLE Plugin Manager
  3. Create a new Language Resource (NOTE: not a Processing Reource!): "PythonWorkerLr"

When creating the PyhonWorkerLr, the following initialization parameters can be specified:

  • authToken: this is used to prevent other processes from connecting to the worker. You can either specify some string here or with useAuthToken set to true let GATE choose a random one and display it in the message pane after the resource has been created.
    • for testing this, enter "verysecretauthtoken"
  • host: The host name or address to bind to. The default 127.0.0.1 makes the worker only visible on the same machine. In order to make it visible on other machines, use the host name or IP address on the network or use 0.0.0.0
    • for testing, keep the default of 127.0.0.1
  • logActions: if this is set to true, the actions requested by the Python process are logged to the message pane.
    • for testing, change to "true"
  • port: the port number to use. Each worker requires their own port number so if more than one worker is running on a machine, they need to use different, unused port numbers.
    • for testing, keep the default
  • useAuthToken: if this is set to false, no auth token is generated and used, and the connection can be established by any process connecting to that port number.
    • for testing, keep the default

A GATE Worker started via the PythonWorkerLr keeps running until the resource is deleted or GATE is ended.

Using the GATE Worker from Python

Once the PythonWorkerLr resource has been created it is ready to get used by a Python program:

In [4]:
from gatenlp.gateworker import GateWorker

To connect to an already running worker process, the parameter start=False must be specified. In addition the auth token must be provided and the port and host, if they differ from the default.

In [6]:
gs = GateWorker(start=False, auth_token="verysecretauthtoken")

The gate worker instance can now be used to run arbitrary Java methods on the Java side. The gate worker instance provides a number of useful methods directly (see PythonDoc for gateworker )

  • gs.load_gdoc(filepath, mimetype=None: load a GATE document on the Java side and return it to Python
  • gs.save_gdoc(gatedocument, filepath, mimetype=None): save a GATE document on the Java side
  • gs.gdoc2pdoc(gatedocument): convert the Java GATE document as a Python GateNLP document and return it
  • gs.pdoc2gdoc(doc): convert the Python GateNLP document to a Java GATE document and return it
  • gs.del_resource(gatedocument): remove a Java GATE document on the Java side (this necessary to release memory) This can also be used to remove other kinds of GATE resources like ProcessingResource, Corpus, LanguageResource etc.
  • gs.load_pdoc(filepath, mimetype=None): load a document on the Java side using the file format specified via the mime type and return it as a Python GateNLP document
  • gs.log_actions(trueorfalse): switch logging of actions on the worker side off/on

In addition, there is a larger number of utility methods which are available through gs.worker (see PythonWorker Source code, here are a few examples:

  • loadMavenPlugin(group, artifact, version): make the plugin identified by the given Maven coordinates available
  • loadPipelineFromFile(filepath): load the pipeline/controller from the given file path and return it
  • loadDocumentFromFile(filepath): load a GATE document from the file and return it
  • loadDocumentFromFile(filepath, mimetype): load a GATE document from the file using the format corresponding to the given mime type and return it
  • saveDocumentToFile(gatedocument, filepath, mimetype): save the document to the file, using the format corresponding to the mime type
  • createDocument(content): create a new document from the given String content and return it
  • run4Document(pipeline, document): run the given pipeline on the given document
In [7]:
# Create a new Java document from a string
# You should see how the document gets created in the GATE GUI
gdoc1 = gs.worker.createDocument("This is a 💩 document. It mentions Barack Obama and George Bush and New York.")
gdoc1
Out[7]:
JavaObject id=o5
In [8]:
# you can call the API methods for the document directly from Python
print(gdoc1.getName())
print(gdoc1.getFeatures())
GATE Document_00016
{'gate.SourceURL': 'created from String'}
In [9]:
# so far the document only "lives" in the Java process. In order to copy it to Python, it has to be converted
# to a Python GateNLP document:
pdoc1 = gs.gdoc2pdoc(gdoc1)
pdoc1.text
Out[9]:
'This is a 💩 document. It mentions Barack Obama and George Bush and New York.'
In [10]:
# Let's load ANNIE on the Java side and run it on that document:
# First we have to load the ANNIE plugin:
gs.worker.loadMavenPlugin("uk.ac.gate.plugins", "annie", "8.6")
In [11]:
# now load the prepared ANNIE pipeline from the plugin
pipeline = gs.worker.loadPipelineFromPlugin("uk.ac.gate.plugins","annie", "/resources/ANNIE_with_defaults.gapp")
pipeline.getName()
Out[11]:
'ANNIE'
In [12]:
# run the pipeline on the document and convert it to a GateNLP Python document and display it
gs.worker.run4Document(pipeline, gdoc1)
pdoc1 = gs.gdoc2pdoc(gdoc1)
In [13]:
pdoc1
Out[13]:

Manually starting the GATE Worker from Python

After installation of Python gatenlp, the command gatenlp-gate-worker is available.

You can run gatenlp-gate-worker --help to get help information:

usage: gatenlp-gate-worker [-h] [--port PORT] [--host HOST] [--auth AUTH]
                          [--noauth] [--gatehome GATEHOME]
                          [--platform PLATFORM] [--log_actions] [--keep]

Start Java GATE Worker

optional arguments:
  -h, --help           show this help message and exit
  --port PORT          Port (25333)
  --host HOST          Host to bind to (127.0.0.1)
  --auth AUTH          Auth token to use (generate random)
  --noauth             Do not use auth token
  --gatehome GATEHOME  Location of GATE (environment variable GATE_HOME)
  --platform PLATFORM  OS/Platform: windows or linux (autodetect)
  --log_actions        If worker actions should be logged
  --keep               Prevent shutting down the worker

For example to start a gate worker as with the PythonWorkerLr above, but this time re-using the exact same auth token and switching on logging of the actions:

gatenlp-gate-worker --auth 841e634a-d1f0-4768-b763-a7738ddee003 --log_actions

Again the Python program can connect to the server as before:

In [7]:
gs = GateWorker(start=False, auth_token="841e634a-d1f0-4768-b763-a7738ddee003")
gs
Out[7]:
<gatenlp.gateworker.GateWorker at 0x7fb6204e67f0>

The GATE worker started that way keeps running until it is interrupted from the keyboard using "Ctrl-C" or until the GATE worker sends the "close" request:

In [8]:
gs.close()

Automatically starting the GATE Worker from Python

When using the GateWorker class from python, it is possible to just start the worker processes automatically in the background by setting the paramater start to True:

In [9]:
gs = GateWorker(start=True, auth_token="my-super-secret-auth-token")
Trying to start GATE Worker on port=25333 host=127.0.0.1 log=false keep=false
PythonWorkerRunner.java: starting server with 25333/127.0.0.1/my-super-secret-auth-token/false
In [10]:
gdoc1 = gs.worker.createDocument("This is a 💩 document. It mentions Barack Obama and George Bush and New York.")
gdoc1
Out[10]:
JavaObject id=o0
In [11]:
# when done, the gate worker should get closed:
gs.close()

A better way to close the GATE Worker

In [3]:
# using the GateWork this way will automatically close it when exiting the with block:
with GateWorker(start=True) as gw:
    print(gw.gate_version)
    
Trying to start GATE Worker on port=25333 host=127.0.0.1 log=false keep=false
Process id is 8778
9.0.1
PythonWorkerRunner.java: starting server with 25333/127.0.0.1/OQ__kPvCOvkanlu4S9TGcpQrssg/false
Java GatenlpWorker ENDING: 8778

Using the GateWorkerAnnotator

The GateWorkerAnnotator is an annotator that simplifies the common task of letting a GATE Java annotation pipeline annotate a bunch of Python gatenlp documents. It can be used like other annotators (see Processing)

To run the GateWorkerAnnotator, Java must be installed and the java command must be on the path. Currently only Java version 8 has been tested.

A simple way to install Java on Linux and choose from various Java versions is SDKMan

Also, the GATE_HOME environment variable must be set, or the path to an installed Java GATE must get passed on using the gatehome parameter.

An installed Java GATE can be one of:

  • a GATE release downloaded from https://github.com/GateNLP/gate-core/releases/ and installed
    • the GATE release will get installed into some directory
    • the GATE_HOME environment variable or the gatehome parameter should point to that directory
  • the gate-core repository checked out locally and installed using Maven (mvn install)
    • the GATE_HOME environment variable or the gatehome parameter should point to the distro subdirectory of that repository directory
In [1]:
from gatenlp import Document
# Create a small corpus of documents to process
texts = [
    "A very simple document.",
    "Another document, this one mentions New York and Washington. It also mentions the person Barack Obama.",
    "One more document for this little test."
]
corpus = [Document(t) for t in texts]
In [2]:
from gatenlp.gateworker import GateWorkerAnnotator
from gatenlp.processing.executor import SerialCorpusExecutor
In [6]:
# use the path of your GATE pipeline instead of annie.xgapp
# To create the GateWorkerAnnotator a GateWorker must first be created

# To run the pipeline on a corpus, first initialize the pipeline using start(), then annotate all documents, 
# then finish the pipeline using finish().
# At this point the same annotator can be used in the same way again to run on another corpus.
# If the GateWorkerAnnotator is not used any more, use close() to stop the GateWorker (the GATE worker is also
# stopped automatically when the Python process ends)

# If an executor is used, only the final close() is necessary, as the executor takes care of everything else

with GateWorker() as gw:
    pipeline = GateWorkerAnnotator("annie.xgapp", gw)
    executor = SerialCorpusExecutor(pipeline, corpus=corpus)
    executor()

    
Trying to start GATE Worker on port=25333 host=127.0.0.1 log=false keep=false
Process id is 10995
PythonWorkerRunner.java: starting server with 25333/127.0.0.1/6C8L67T0iLuVFHEovPN07nNGz2c/false
Java GatenlpWorker ENDING: 10995
In [7]:
# Show the second document
corpus[1]
Out[7]:
In [ ]: