# Documents¶

In [1]:
from gatenlp import Document

In [2]:
# To load a document from a file with the name "file.bdocjs" into gatenlp simply use:

# But it is also possible to load from a file that is somewhere on the internet. For this notebook, we use
# an example document that gets loaded from a URL:

# We can visualize the document by printing it:
print(doc)

Document(This is a test document.

It contains just a few sentences.
Here is a sentence that mentions a few named entities like
the persons Barack Obama or Ursula von der Leyen, locations
like New York City, Vienna or Beijing or companies like

Here we include a URL https://gatenlp.github.io/python-gatenlp/
and a fake email address [email protected] as well
as #some #cool #hastags and a bunch of emojis like 😽 (a kissing cat),
👩‍🏫 (a woman teacher), 🧬 (DNA),
🧗 (a person climbing),
💩 (a pile of poo).

Here we test a few different scripts, e.g. Hangul 한글 or
simplified Hanzi 汉字 or Farsi فارسی which goes from right to left.

,features=Features({}),anns=[])


Printing the document shows the document text and indicates that there are no document features and no annotations which is to be expected since we just loaded from a plain text file.

In a Jupyter notebook, a gatenlp document can also be visualized graphically by either just using the document as the last value of a cell or by using the IPython "display" function:

In [3]:
from IPython.display import display
display(doc)


This shows the document in a layout that has three areas: the document text in the upper left, the list of annotation set and type names in the upper right and document or annotation features at the bottom. In the example above only the text is shown because there are no document features or annotations.

## Document features¶

In [4]:
doc.features["loaded-from"] = "https://gatenlp.github.io/python-gatenlp/testdocument1.txt"
doc.features["purpose"] = "test document for gatenlp"
doc.features["someotherfeature"] = 22
doc.features["andanother"] = {"what": "a dict", "alist": [1,2,3,4,5]}


Document features map feature names to feature values and behave a lot like a Python dictionary. Feature names should always be strings, feature values can be anything, but a document can only be stored or exchanged with Java GATE if feature values are restricted to whatever can be serialized with JSON: dictionaries, lists, numbers, strings and booleans.

Now that we have create document features the document is shown like this:

In [5]:
doc

Out[5]:
In [6]:
# to retrieve a feature value we can do:
doc.features["purpose"]

Out[6]:
'test document for gatenlp'
In [7]:
# If a feature does not exist, None is returned or a default value if specified:
print(doc.features.get("doesntexist"))
print(doc.features.get("doesntexist", "MV!"))

None
MV!


## Annotations¶

Lets add some annotations too. Annotations are items of information for some range of characters within the document. They can be used to represent information about things like tokens, entities, sentences, paragraphs, or anything that corresponds to some contiguous range of offsets in the document.

Annotations consist of the following parts:

• The "start" and "end" offset to identify the text the annotation refers to
• A "type" which is an arbitrary name that identifies what kind of thing the annotation describes, e.g. "Token"
• Features: these work in the same way as for the whole document: an arbitrary set of feature name / feature value pairs which provide more information, e.g. for a Token the features could include the lemma, the part of speech, the stem, the number, etc.

Annotations can be organized in "annotation sets". Each annotation set has a name and a set of annotations. There can be as many sets as needed.

Annotation can overlap arbitrarily and there can be as many as needed.

Let us manually add a few annotations to the document:

In [8]:
# create and get an annotation set with the name "Set1"
annset = doc.annset("Set1")


Add an annotation to the set which refers to the first word in the document "This". The range of characters for this word starts at offset 0 and the length of the annotation is 4, so the "start" offset is 0 and the "end" offset is 0+4=4. Note that the end offset always points to the offset after the last character of the range.

In [9]:
annset.add(0,4,"Word",{"what": "our first annotation"})

Out[9]:
Annotation(0,4,Word,features=Features({'what': 'our first annotation'}),id=0)
In [10]:
# Add more

Out[10]:
Annotation(0,24,Sentence,features=Features({'what': 'our first sentence annotation'}),id=2)

If we visualize the document now, the newly created set "Set" is shown in the right part of the display. It shows the different annotation types that exist in the set, and how many annotations for each type are in the set. If you click the check box, the annotation ranges are shown in the text with the colour associated with the annotation type. You can then click on a range / annotation in the text and the features of the annotation are shown in the lower part. To show the features for a different annotation click on the coloured range for the annotation in the text. To show the document features, click on "Document".

If you have selected more than one type, a range can have more than one overlapping annotations. This is shown by mixing the colours. If you click at such a location, a dialog appears which lets you select for which of the overlapping annotations you want to display the features.

In [11]:
doc

Out[11]:

Lets load a larger document, and from an HTML file: the Wikipedia page for "Natural Language processing":

In [12]:
doc2 = Document.load("https://en.m.wikipedia.org/wiki/Natural_language_processing", fmt="html", parser="html.parser")
doc2

Out[12]:

The markup present in the original HTML file is converted into annotations in the annotation set with the name "Original markups". For example all the HTML links are present as annotations of type "a" (there are 449 of those), the level 3 headings are present as annotations of type "h3" and so on.

GateNlp documents can be loaded from a number of different text representations. When you run Document.load(filepath), gatenlp tries to automatically determine the format of the document from the file extensions, but if that fails, it is possible to explicitly specify the format using the fmt= keyword argument which can take a memnonic or a mime type specification for the format.

The following formats are known, the list shows first the memnonic, if one exists, then the mime type, and then the description of the format. All the following formats can be loaded and saved:

• text, text/plain: Plain text, extension .txt, by default this is expected to be encoded in "UTF-8" but a different encoding can be specified using the encoding= keyword argument.
• text/plain+gzip: Gzip compressed plain text, same as text but gzip compressed.
• bdocjs, json, text/bdocjs: BDOC Json Format, extension .bdocjs, which can be exchanged with Java GATE via the format BDOC plugin (https://gatenlp.github.io/gateplugin-Format_Bdoc/)
• bdocjsgz, jsongz, text/bdocjs+gzip: BDOC Json Format, GZip compressed, extension .bdocjs.gz
• yaml, text/bdocym: BDOC Yaml Format, extension .bdocym, which can be exchanged with Java GATE via the format BDOC plugin. This format allows for serialization of shared nested arrays/maps and exchange of these between Java GATE and Python GateNLP.
• yamlgz, text/bdocym+gzip: BDOC Yaml Format, GZip compressed, extesion, .bdocym.gz
• msgpack, application/msgpack: BDOC Message Pack format, extension .bdocmp. Can be exchanged with Java GATE via the format BDOC plugin

The following formats can only be loaded:

• html, text/html: HTML files can be loaded and will be parsed to obtain the text and to create annotations that correspond to the HTML markup (these annotations are in annotation set "Original markups"). Note that not all HTML can be parsed without problems and this will NOT load the rendered form of the HTML page, i.e. anything created or influenced by JavaScript code on the page is not loaded.
• gatexml: Java GATE XML format, extension .xml can be loaded, but Java-specific data is not supported. If e.g. features have Java lists or arrays or similar as a value, the load will fail unless the keyword argument ignore_unknown_types=True is specified.

The following formats can only be saved:

• html-ann-viewer: This creates a HTML file which can be used to visualize the document. The following keyword arguments can be used: notebook=True to create a div instead of a complete html document, offline=True to include all Javascript code necessary for visualization in the document instead of loading it from the internet, htmlid="somename" to make all HTML, CSS and Javascript definitions for the generated HTML code unique, so that several different pieces of HTML code can be embedded in the same page.

Documents can also be saved and loaded using Python pickle.

Documents can also be convert to and from a Python-only representation using the methods doc.to_dict() and Document.from_dict(thedict) which can be used to serialize or transfer the document in many other formats.

In [13]:
# Convert the document to a dictionary representation:
as_dict = doc.to_dict()
as_dict

Out[13]:
{'annotation_sets': {'Set1': {'name': 'Set1',
'annotations': [{'type': 'Word',
'start': 0,
'end': 4,
'id': 0,
'features': {'what': 'our first annotation'}},
{'type': 'Word',
'start': 5,
'end': 7,
'id': 1,
'features': {'what': 'our second annotation'}},
{'type': 'Sentence',
'start': 0,
'end': 24,
'id': 2,
'features': {'what': 'our first sentence annotation'}}],
'next_annid': 3}},
'text': 'This is a test document.\n\nIt contains just a few sentences. \nHere is a sentence that mentions a few named entities like \nthe persons Barack Obama or Ursula von der Leyen, locations\nlike New York City, Vienna or Beijing or companies like \nGoogle, UniCredit or Huawei. \n\nHere we include a URL https://gatenlp.github.io/python-gatenlp/ \nand a fake email address [email protected] as well \nas #some #cool #hastags and a bunch of emojis like 😽 (a kissing cat),\n👩\u200d🏫 (a woman teacher), \U0001f9ec (DNA), \n\U0001f9d7 (a person climbing), \n💩 (a pile of poo). \n\nHere we test a few different scripts, e.g. Hangul 한글 or \nsimplified Hanzi 汉字 or Farsi فارسی which goes from right to left. \n\n\n',
'purpose': 'test document for gatenlp',
'someotherfeature': 22,
'andanother': {'what': 'a dict', 'alist': [1, 2, 3, 4, 5]}},
'offset_type': 'p',
'name': ''}
In [14]:
# create a copy by creating a new Document from the dictionary representation
doc_copy = Document.from_dict(as_dict)
doc_copy

Out[14]:
In [15]:
# Save the document in bdocjs format
doc.save("tmpdoc.bdocjs")

# show what the document looks like
with open("tmpdoc.bdocjs", "rt") as infp:

{"annotation_sets": {"Set1": {"name": "Set1", "annotations": [{"type": "Word", "start": 0, "end": 4, "id": 0, "features": {"what": "our first annotation"}}, {"type": "Word", "start": 5, "end": 7, "id": 1, "features": {"what": "our second annotation"}}, {"type": "Sentence", "start": 0, "end": 24, "id": 2, "features": {"what": "our first sentence annotation"}}], "next_annid": 3}}, "text": "This is a test document.\n\nIt contains just a few sentences. \nHere is a sentence that mentions a few named entities like \nthe persons Barack Obama or Ursula von der Leyen, locations\nlike New York City, Vienna or Beijing or companies like \nGoogle, UniCredit or Huawei. \n\nHere we include a URL https://gatenlp.github.io/python-gatenlp/ \nand a fake email address [email protected] as well \nas #some #cool #hastags and a bunch of emojis like \ud83d\ude3d (a kissing cat),\n\ud83d\udc69\u200d\ud83c\udfeb (a woman teacher), \ud83e\uddec (DNA), \n\ud83e\uddd7 (a person climbing), \n\ud83d\udca9 (a pile of poo). \n\nHere we test a few different scripts, e.g. Hangul \ud55c\uae00 or \nsimplified Hanzi \u6c49\u5b57 or Farsi \u0641\u0627\u0631\u0633\u06cc which goes from right to left. \n\n\n", "features": {"loaded-from": "https://gatenlp.github.io/python-gatenlp/testdocument1.txt", "purpose": "test document for gatenlp", "someotherfeature": 22, "andanother": {"what": "a dict", "alist": [1, 2, 3, 4, 5]}}, "offset_type": "p", "name": ""}

# load the document from the saved bdocjs format file

# clean up the document