Notebook

Stanza Tutorial¶

Version: 1.1, January 2024

Download: This and various other Jupyter notebooks are available from my GitHub repo.

Prerequisites:

In [ ]:

!pip install -U stanza

To install spaCy follow the instructions on the Install spaCy page.

In [ ]:

!pip install -U pip setuptools wheel

The following installation of spaCy is ideal for my environment, i.e., using a GPU and CUDA 12.x. See the spaCy homepage for detailed installation instructions.

In [ ]:

!pip install -U 'spacy[cuda12x,transformers,lookups,ja]'

Introduction¶

This is a tutorial related to the L645 Advanced Natural Language Processing course in Fall 2023 at Indiana University. The following tutorial assumes that you are using a newer distribution of Python 3.x and Stanza 1.5.1 or newer.

This notebook assumes that you have set up Stanza on your computer with your Python distribution. Follow the instructions on the Stanza installation page to set up a working environment for the following code. The code will also require that you are online and that the specific language models can be downloaded and installed.

Loading the Stanza module and spaCy's Displacy for visualization:

In [1]:

import stanza
from stanza.models.common.doc import Document
from stanza.pipeline.core import Pipeline
from spacy import displacy

The following code will load the English language model for Stanza:

In [2]:

stanza.download('en')

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.7.0.json:   0%|   …

2024-01-23 12:31:57 INFO: Downloading default packages for language: en (English) ...

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.7.0/models/default.zip:   0%|          | 0…

2024-01-23 12:32:13 INFO: Finished downloading models and saved to /home/damir/stanza_resources.

We can configure the Stanza pipeline to contain all desired linguistic annotation modules. In this case we use:

tokenizer
multi-word-tokenizer
Part-of-Speech tagger
lemmatizer
dependency parser
constituent parser

In [3]:

nlp = stanza.Pipeline('en', processors='tokenize,mwt,pos,lemma,ner,depparse,constituency,sentiment', package={"ner": ["ncbi_disease", "ontonotes"]}, use_gpu=False, download_method="reuse_resources")

2024-01-23 12:32:27 WARNING: Can not find ner: ontonotes from official model list. Ignoring it.

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.7.0/models/ner/ncbi_disease.pt:   0%|     …

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.7.0/models/forward_charlm/pubmed.pt:   0%|…

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.7.0/models/backward_charlm/pubmed.pt:   0%…

Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.7.0/models/pretrain/biomed.pt:   0%|      …

2024-01-23 12:32:33 INFO: Loading these models for language: en (English):
======================================
| Processor    | Package             |
--------------------------------------
| tokenize     | combined            |
| mwt          | combined            |
| pos          | combined_charlm     |
| lemma        | combined_nocharlm   |
| constituency | ptb3-revised_charlm |
| depparse     | combined_charlm     |
| sentiment    | sstplus             |
| ner          | ncbi_disease        |
======================================

2024-01-23 12:32:33 INFO: Using device: cpu
2024-01-23 12:32:33 INFO: Loading: tokenize
/home/damir/.local/lib/python3.12/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
2024-01-23 12:32:33 INFO: Loading: mwt
2024-01-23 12:32:33 INFO: Loading: pos
2024-01-23 12:32:34 INFO: Loading: lemma
2024-01-23 12:32:34 INFO: Loading: constituency
2024-01-23 12:32:34 INFO: Loading: depparse
2024-01-23 12:32:34 INFO: Loading: sentiment
2024-01-23 12:32:34 INFO: Loading: ner
2024-01-23 12:32:35 INFO: Done loading processors!

In [4]:

doc = nlp("The pilot had arthritis. What's so important to underline is that Metz worked for both Northrop and Lockheed Martin in New York City and is not known for hyperbole. Yet even after flying the pre-production F-22, a far more mature machine than the YF-23 ever was, he makes it quite clear that Northrop's offering was on par with Lockheed's, if not superior.")
for i, sentence in enumerate(doc.sentences):
    print(f'====== Sentence {i+1} tokens =======')
    print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')

====== Sentence 1 tokens =======
id: (1,)	text: The
id: (2,)	text: pilot
id: (3,)	text: had
id: (4,)	text: arthritis
id: (5,)	text: .
====== Sentence 2 tokens =======
id: (1, 2)	text: What's
id: (3,)	text: so
id: (4,)	text: important
id: (5,)	text: to
id: (6,)	text: underline
id: (7,)	text: is
id: (8,)	text: that
id: (9,)	text: Metz
id: (10,)	text: worked
id: (11,)	text: for
id: (12,)	text: both
id: (13,)	text: Northrop
id: (14,)	text: and
id: (15,)	text: Lockheed
id: (16,)	text: Martin
id: (17,)	text: in
id: (18,)	text: New
id: (19,)	text: York
id: (20,)	text: City
id: (21,)	text: and
id: (22,)	text: is
id: (23,)	text: not
id: (24,)	text: known
id: (25,)	text: for
id: (26,)	text: hyperbole
id: (27,)	text: .
====== Sentence 3 tokens =======
id: (1,)	text: Yet
id: (2,)	text: even
id: (3,)	text: after
id: (4,)	text: flying
id: (5,)	text: the
id: (6,)	text: pre-production
id: (7,)	text: F
id: (8,)	text: -
id: (9,)	text: 22
id: (10,)	text: ,
id: (11,)	text: a
id: (12,)	text: far
id: (13,)	text: more
id: (14,)	text: mature
id: (15,)	text: machine
id: (16,)	text: than
id: (17,)	text: the
id: (18,)	text: YF
id: (19,)	text: -
id: (20,)	text: 23
id: (21,)	text: ever
id: (22,)	text: was
id: (23,)	text: ,
id: (24,)	text: he
id: (25,)	text: makes
id: (26,)	text: it
id: (27,)	text: quite
id: (28,)	text: clear
id: (29,)	text: that
id: (30, 31)	text: Northrop's
id: (32,)	text: offering
id: (33,)	text: was
id: (34,)	text: on
id: (35,)	text: par
id: (36,)	text: with
id: (37, 38)	text: Lockheed's
id: (39,)	text: ,
id: (40,)	text: if
id: (41,)	text: not
id: (42,)	text: superior
id: (43,)	text: .

In [5]:

print(*[f'word: {word.text}\tupos: {word.upos}\txpos: {word.xpos}\tfeats: {word.feats if word.feats else "_"}' for sent in doc.sentences for word in sent.words], sep='\n')

word: The	upos: DET	xpos: DT	feats: Definite=Def|PronType=Art
word: pilot	upos: NOUN	xpos: NN	feats: Number=Sing
word: had	upos: VERB	xpos: VBD	feats: Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin
word: arthritis	upos: NOUN	xpos: NN	feats: Number=Sing
word: .	upos: PUNCT	xpos: .	feats: _
word: What	upos: PRON	xpos: WP	feats: PronType=Int
word: 's	upos: AUX	xpos: VBZ	feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
word: so	upos: ADV	xpos: RB	feats: _
word: important	upos: ADJ	xpos: JJ	feats: Degree=Pos
word: to	upos: PART	xpos: TO	feats: _
word: underline	upos: VERB	xpos: VB	feats: VerbForm=Inf
word: is	upos: AUX	xpos: VBZ	feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
word: that	upos: SCONJ	xpos: IN	feats: _
word: Metz	upos: PROPN	xpos: NNP	feats: Number=Sing
word: worked	upos: VERB	xpos: VBD	feats: Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin
word: for	upos: ADP	xpos: IN	feats: _
word: both	upos: CCONJ	xpos: CC	feats: _
word: Northrop	upos: PROPN	xpos: NNP	feats: Number=Sing
word: and	upos: CCONJ	xpos: CC	feats: _
word: Lockheed	upos: PROPN	xpos: NNP	feats: Number=Sing
word: Martin	upos: PROPN	xpos: NNP	feats: Number=Sing
word: in	upos: ADP	xpos: IN	feats: _
word: New	upos: ADJ	xpos: NNP	feats: Degree=Pos
word: York	upos: PROPN	xpos: NNP	feats: Number=Sing
word: City	upos: PROPN	xpos: NNP	feats: Number=Sing
word: and	upos: CCONJ	xpos: CC	feats: _
word: is	upos: AUX	xpos: VBZ	feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
word: not	upos: PART	xpos: RB	feats: _
word: known	upos: VERB	xpos: VBN	feats: Tense=Past|VerbForm=Part|Voice=Pass
word: for	upos: ADP	xpos: IN	feats: _
word: hyperbole	upos: NOUN	xpos: NN	feats: Number=Sing
word: .	upos: PUNCT	xpos: .	feats: _
word: Yet	upos: CCONJ	xpos: CC	feats: _
word: even	upos: ADV	xpos: RB	feats: _
word: after	upos: SCONJ	xpos: IN	feats: _
word: flying	upos: VERB	xpos: VBG	feats: VerbForm=Ger
word: the	upos: DET	xpos: DT	feats: Definite=Def|PronType=Art
word: pre-production	upos: NOUN	xpos: NN	feats: Number=Sing
word: F	upos: PROPN	xpos: NNP	feats: Number=Sing
word: -	upos: PUNCT	xpos: HYPH	feats: _
word: 22	upos: NUM	xpos: CD	feats: NumForm=Digit|NumType=Card
word: ,	upos: PUNCT	xpos: ,	feats: _
word: a	upos: DET	xpos: DT	feats: Definite=Ind|PronType=Art
word: far	upos: ADV	xpos: RB	feats: Degree=Pos
word: more	upos: ADV	xpos: RBR	feats: Degree=Cmp
word: mature	upos: ADJ	xpos: JJ	feats: Degree=Pos
word: machine	upos: NOUN	xpos: NN	feats: Number=Sing
word: than	upos: ADP	xpos: IN	feats: _
word: the	upos: DET	xpos: DT	feats: Definite=Def|PronType=Art
word: YF	upos: PROPN	xpos: NNP	feats: Number=Sing
word: -	upos: PUNCT	xpos: HYPH	feats: _
word: 23	upos: NUM	xpos: CD	feats: NumForm=Digit|NumType=Card
word: ever	upos: ADV	xpos: RB	feats: _
word: was	upos: AUX	xpos: VBD	feats: Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin
word: ,	upos: PUNCT	xpos: ,	feats: _
word: he	upos: PRON	xpos: PRP	feats: Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs
word: makes	upos: VERB	xpos: VBZ	feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
word: it	upos: PRON	xpos: PRP	feats: Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs
word: quite	upos: ADV	xpos: RB	feats: _
word: clear	upos: ADJ	xpos: JJ	feats: Degree=Pos
word: that	upos: SCONJ	xpos: IN	feats: _
word: Northrop	upos: PROPN	xpos: NNP	feats: Number=Sing
word: 's	upos: PART	xpos: POS	feats: _
word: offering	upos: NOUN	xpos: NN	feats: Number=Sing
word: was	upos: AUX	xpos: VBD	feats: Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin
word: on	upos: ADP	xpos: IN	feats: _
word: par	upos: NOUN	xpos: NN	feats: Number=Sing
word: with	upos: ADP	xpos: IN	feats: _
word: Lockheed	upos: PROPN	xpos: NNP	feats: Number=Sing
word: 's	upos: PART	xpos: POS	feats: _
word: ,	upos: PUNCT	xpos: ,	feats: _
word: if	upos: SCONJ	xpos: IN	feats: _
word: not	upos: PART	xpos: RB	feats: _
word: superior	upos: ADJ	xpos: JJ	feats: Degree=Pos
word: .	upos: PUNCT	xpos: .	feats: _

In [6]:

print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\n')

word: The 	lemma: the
word: pilot 	lemma: pilot
word: had 	lemma: have
word: arthritis 	lemma: arthritis
word: . 	lemma: .
word: What 	lemma: what
word: 's 	lemma: be
word: so 	lemma: so
word: important 	lemma: important
word: to 	lemma: to
word: underline 	lemma: underline
word: is 	lemma: be
word: that 	lemma: that
word: Metz 	lemma: Metz
word: worked 	lemma: work
word: for 	lemma: for
word: both 	lemma: both
word: Northrop 	lemma: Northrop
word: and 	lemma: and
word: Lockheed 	lemma: Lockheed
word: Martin 	lemma: Martin
word: in 	lemma: in
word: New 	lemma: New
word: York 	lemma: York
word: City 	lemma: City
word: and 	lemma: and
word: is 	lemma: be
word: not 	lemma: not
word: known 	lemma: know
word: for 	lemma: for
word: hyperbole 	lemma: hyperbole
word: . 	lemma: .
word: Yet 	lemma: yet
word: even 	lemma: even
word: after 	lemma: after
word: flying 	lemma: fly
word: the 	lemma: the
word: pre-production 	lemma: pre-production
word: F 	lemma: F
word: - 	lemma: -
word: 22 	lemma: 22
word: , 	lemma: ,
word: a 	lemma: a
word: far 	lemma: far
word: more 	lemma: more
word: mature 	lemma: mature
word: machine 	lemma: machine
word: than 	lemma: than
word: the 	lemma: the
word: YF 	lemma: YF
word: - 	lemma: -
word: 23 	lemma: 23
word: ever 	lemma: ever
word: was 	lemma: be
word: , 	lemma: ,
word: he 	lemma: he
word: makes 	lemma: make
word: it 	lemma: it
word: quite 	lemma: quite
word: clear 	lemma: clear
word: that 	lemma: that
word: Northrop 	lemma: Northrop
word: 's 	lemma: 's
word: offering 	lemma: offering
word: was 	lemma: be
word: on 	lemma: on
word: par 	lemma: par
word: with 	lemma: with
word: Lockheed 	lemma: Lockheed
word: 's 	lemma: 's
word: , 	lemma: ,
word: if 	lemma: if
word: not 	lemma: not
word: superior 	lemma: superior
word: . 	lemma: .

In [8]:

for sentence in doc.sentences:
    print(sentence.constituency)

(ROOT (S (NP (DT The) (NN pilot)) (VP (VBD had) (NP (NN arthritis))) (. .)))
(ROOT (S (SBAR (WHNP (WP What)) (S (VP (VBZ 's) (ADJP (RB so) (JJ important) (SBAR (S (VP (TO to) (VP (VB underline))))))))) (VP (VBZ is) (SBAR (IN that) (S (NP (NNP Metz)) (VP (VP (VBD worked) (PP (IN for) (NP (CC both) (NP (NNP Northrop)) (CC and) (NP (NNP Lockheed) (NNP Martin)))) (PP (IN in) (NP (NML (NNP New) (NNP York)) (NNP City)))) (CC and) (VP (VBZ is) (RB not) (VP (VBN known) (PP (IN for) (NP (NN hyperbole))))))))) (. .)))
(ROOT (S (CC Yet) (PP (ADVP (RB even)) (IN after) (S (VP (VBG flying) (NP (DT the) (NN pre-production) (NNP F) (HYPH -) (CD 22))))) (, ,) (NP (NP (DT a) (ADJP (ADVP (RB far) (RBR more)) (JJ mature)) (NN machine)) (PP (IN than) (NP (DT the) (NNP YF) (HYPH -) (CD 23)))) (ADVP (RB ever)) (VP (VBD was)) (, ,) (NP (NP (PRP he))) (VP (VBZ makes) (S (NP (NP (PRP it))) (ADJP (RB quite) (JJ clear)) (SBAR (IN that) (S (NP (NP (NNP Northrop) (POS 's)) (NN offering)) (VP (VBD was) (PP (IN on) (NP (NN par))) (PP (IN with) (ADJP (NP (NNP Lockheed) (POS 's)) (, ,) (SBAR (IN if) (FRAG (RB not) (ADJP (JJ superior))))))))))) (. .)))

In [7]:

print(*[f'entity: {ent.text}\ttype: {ent.type}' for ent in doc.ents], sep='\n')

entity: arthritis	type: DISEASE

In [57]:

print(*[f'token: {token.text}\tner: {token.ner}' for sent in doc.sentences for token in sent.tokens], sep='\n')

token: The	ner: O
token: pilot	ner: O
token: had	ner: O
token: arthritis	ner: S-DISEASE
token: .	ner: O
token: What	ner: O
token: 's	ner: O
token: so	ner: O
token: important	ner: O
token: to	ner: O
token: underline	ner: O
token: is	ner: O
token: that	ner: O
token: Metz	ner: S-ORG
token: worked	ner: O
token: for	ner: O
token: both	ner: O
token: Northrop	ner: S-ORG
token: and	ner: O
token: Lockheed	ner: B-ORG
token: Martin	ner: E-ORG
token: in	ner: O
token: New	ner: B-GPE
token: York	ner: I-GPE
token: City	ner: E-GPE
token: and	ner: O
token: is	ner: O
token: not	ner: O
token: known	ner: O
token: for	ner: O
token: hyperbole	ner: O
token: .	ner: O
token: Yet	ner: O
token: even	ner: O
token: after	ner: O
token: flying	ner: O
token: the	ner: O
token: pre-production	ner: O
token: F	ner: B-PRODUCT
token: -	ner: I-PRODUCT
token: 22	ner: E-PRODUCT
token: ,	ner: O
token: a	ner: O
token: far	ner: O
token: more	ner: O
token: mature	ner: O
token: machine	ner: O
token: than	ner: O
token: the	ner: B-PRODUCT
token: YF	ner: I-PRODUCT
token: -	ner: I-PRODUCT
token: 23	ner: E-PRODUCT
token: ever	ner: O
token: was	ner: O
token: ,	ner: O
token: he	ner: O
token: makes	ner: O
token: it	ner: O
token: quite	ner: O
token: clear	ner: O
token: that	ner: O
token: Northrop	ner: S-ORG
token: 's	ner: O
token: offering	ner: O
token: was	ner: O
token: on	ner: O
token: par	ner: O
token: with	ner: O
token: Lockheed	ner: S-ORG
token: 's	ner: O
token: ,	ner: O
token: if	ner: O
token: not	ner: O
token: superior	ner: O
token: .	ner: O

In [58]:

for i, sentence in enumerate(doc.sentences):
    print("%d -> %d" % (i, sentence.sentiment))

0 -> 0
1 -> 2
2 -> 0

Language ID¶

In [59]:

stanza.download(lang="multilingual")
stanza.download(lang="en")
# stanza.download(lang="fr")
stanza.download(lang="de")

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

2023-09-20 17:34:37 INFO: Downloading default packages for language: multilingual (multilingual) ...
2023-09-20 17:34:37 INFO: File exists: C:\Users\damir\stanza_resources\multilingual\default.zip
2023-09-20 17:34:37 INFO: Finished downloading models and saved to C:\Users\damir\stanza_resources.

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

2023-09-20 17:34:38 INFO: Downloading default packages for language: en (English) ...
2023-09-20 17:34:38 INFO: File exists: C:\Users\damir\stanza_resources\en\default.zip
2023-09-20 17:34:42 INFO: Finished downloading models and saved to C:\Users\damir\stanza_resources.

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

2023-09-20 17:34:42 INFO: Downloading default packages for language: de (German) ...
2023-09-20 17:34:43 INFO: File exists: C:\Users\damir\stanza_resources\de\default.zip
2023-09-20 17:34:47 INFO: Finished downloading models and saved to C:\Users\damir\stanza_resources.

In [61]:

nlp = Pipeline(lang="multilingual", processors="langid")
docs = ["Hello world.", "Hallo, Welt!"]
docs = [Document([], text=text) for text in docs]
nlp(docs)
print("\n".join(f"{doc.text}\t{doc.lang}" for doc in docs))

2023-09-20 17:36:07 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json:   0%|   …

2023-09-20 17:36:07 INFO: Loading these models for language: multilingual ():
=======================
| Processor | Package |
-----------------------
| langid    | ud      |
=======================

2023-09-20 17:36:07 INFO: Using device: cuda
2023-09-20 17:36:07 INFO: Loading: langid
2023-09-20 17:36:07 INFO: Done loading processors!

Hello world.	en
Hallo, Welt!	it

Processing Dependency Parse Trees¶

I wrote the following function to convert the Stanza dependency tree data structure to a spaCy's Displacy compatible data structure for the visualization of dependency trees using spaCy's excellent visualizer:

In [9]:

def get_stanza_dep_displacy_manual(doc):
    res = []
    for x in doc.sentences:
        words = []
        arcs  = []
        for w in x.words:
            if w.head > 0:
                head_text = x.words[w.head-1].text
            else:
                head_text = "root"
            words.append({"text": w.text, "tag": w.upos})
            if w.deprel == "root": continue
            start = w.head-1
            end = w.id-1
            if start < end:
                arcs.append({ "start":start, "end":end, "label": w.deprel, "dir":"right"})
            else:
                arcs.append({ "start":end, "end":start, "label": w.deprel, "dir":"left"})
        res.append( { "words": words, "arcs": arcs } )
    return res

We can generate an annotation object with Stanza similarly to spaCy's approach submitting a sentence or text segment to the NLP pipeline we specified above and assigned to the nlp variable:

In [16]:

doc = nlp("John loves to read books and Mary newspapers.")

We can now generate the spaCy-compatible data format from the dependency tree to be able to visualize it:

In [17]:

res = get_stanza_dep_displacy_manual(doc)

The rendering can be achieved using the Displacy call:

In [18]:

displacy.render(res, style="dep", manual=True, options={"compact":False, "distance":110})

Data Format - CoNLL¶

In [43]:

from stanza.utils.conll import CoNLL

In [44]:

CoNLL.write_doc2conll(doc, "output.conllu")