(C) 2023-2024 by Damir Cavar
Version: 1.1, January 2024
Download: This and various other Jupyter notebooks are available from my GitHub repo.
Prerequisites:
!pip install -U stanza
To install spaCy follow the instructions on the Install spaCy page.
!pip install -U pip setuptools wheel
The following installation of spaCy is ideal for my environment, i.e., using a GPU and CUDA 12.x. See the spaCy homepage for detailed installation instructions.
!pip install -U 'spacy[cuda12x,transformers,lookups,ja]'
This is a tutorial related to the L645 Advanced Natural Language Processing course in Fall 2023 at Indiana University. The following tutorial assumes that you are using a newer distribution of Python 3.x and Stanza 1.5.1 or newer.
This notebook assumes that you have set up Stanza on your computer with your Python distribution. Follow the instructions on the Stanza installation page to set up a working environment for the following code. The code will also require that you are online and that the specific language models can be downloaded and installed.
Loading the Stanza module and spaCy's Displacy for visualization:
import stanza
from stanza.models.common.doc import Document
from stanza.pipeline.core import Pipeline
from spacy import displacy
The following code will load the English language model for Stanza:
stanza.download('en')
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.7.0.json: 0%| …
2024-01-23 12:31:57 INFO: Downloading default packages for language: en (English) ...
Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.7.0/models/default.zip: 0%| | 0…
2024-01-23 12:32:13 INFO: Finished downloading models and saved to /home/damir/stanza_resources.
We can configure the Stanza pipeline to contain all desired linguistic annotation modules. In this case we use:
nlp = stanza.Pipeline('en', processors='tokenize,mwt,pos,lemma,ner,depparse,constituency,sentiment', package={"ner": ["ncbi_disease", "ontonotes"]}, use_gpu=False, download_method="reuse_resources")
2024-01-23 12:32:27 WARNING: Can not find ner: ontonotes from official model list. Ignoring it.
Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.7.0/models/ner/ncbi_disease.pt: 0%| …
Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.7.0/models/forward_charlm/pubmed.pt: 0%|…
Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.7.0/models/backward_charlm/pubmed.pt: 0%…
Downloading https://huggingface.co/stanfordnlp/stanza-en/resolve/v1.7.0/models/pretrain/biomed.pt: 0%| …
2024-01-23 12:32:33 INFO: Loading these models for language: en (English): ====================================== | Processor | Package | -------------------------------------- | tokenize | combined | | mwt | combined | | pos | combined_charlm | | lemma | combined_nocharlm | | constituency | ptb3-revised_charlm | | depparse | combined_charlm | | sentiment | sstplus | | ner | ncbi_disease | ====================================== 2024-01-23 12:32:33 INFO: Using device: cpu 2024-01-23 12:32:33 INFO: Loading: tokenize /home/damir/.local/lib/python3.12/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead. _torch_pytree._register_pytree_node( 2024-01-23 12:32:33 INFO: Loading: mwt 2024-01-23 12:32:33 INFO: Loading: pos 2024-01-23 12:32:34 INFO: Loading: lemma 2024-01-23 12:32:34 INFO: Loading: constituency 2024-01-23 12:32:34 INFO: Loading: depparse 2024-01-23 12:32:34 INFO: Loading: sentiment 2024-01-23 12:32:34 INFO: Loading: ner 2024-01-23 12:32:35 INFO: Done loading processors!
doc = nlp("The pilot had arthritis. What's so important to underline is that Metz worked for both Northrop and Lockheed Martin in New York City and is not known for hyperbole. Yet even after flying the pre-production F-22, a far more mature machine than the YF-23 ever was, he makes it quite clear that Northrop's offering was on par with Lockheed's, if not superior.")
for i, sentence in enumerate(doc.sentences):
print(f'====== Sentence {i+1} tokens =======')
print(*[f'id: {token.id}\ttext: {token.text}' for token in sentence.tokens], sep='\n')
====== Sentence 1 tokens ======= id: (1,) text: The id: (2,) text: pilot id: (3,) text: had id: (4,) text: arthritis id: (5,) text: . ====== Sentence 2 tokens ======= id: (1, 2) text: What's id: (3,) text: so id: (4,) text: important id: (5,) text: to id: (6,) text: underline id: (7,) text: is id: (8,) text: that id: (9,) text: Metz id: (10,) text: worked id: (11,) text: for id: (12,) text: both id: (13,) text: Northrop id: (14,) text: and id: (15,) text: Lockheed id: (16,) text: Martin id: (17,) text: in id: (18,) text: New id: (19,) text: York id: (20,) text: City id: (21,) text: and id: (22,) text: is id: (23,) text: not id: (24,) text: known id: (25,) text: for id: (26,) text: hyperbole id: (27,) text: . ====== Sentence 3 tokens ======= id: (1,) text: Yet id: (2,) text: even id: (3,) text: after id: (4,) text: flying id: (5,) text: the id: (6,) text: pre-production id: (7,) text: F id: (8,) text: - id: (9,) text: 22 id: (10,) text: , id: (11,) text: a id: (12,) text: far id: (13,) text: more id: (14,) text: mature id: (15,) text: machine id: (16,) text: than id: (17,) text: the id: (18,) text: YF id: (19,) text: - id: (20,) text: 23 id: (21,) text: ever id: (22,) text: was id: (23,) text: , id: (24,) text: he id: (25,) text: makes id: (26,) text: it id: (27,) text: quite id: (28,) text: clear id: (29,) text: that id: (30, 31) text: Northrop's id: (32,) text: offering id: (33,) text: was id: (34,) text: on id: (35,) text: par id: (36,) text: with id: (37, 38) text: Lockheed's id: (39,) text: , id: (40,) text: if id: (41,) text: not id: (42,) text: superior id: (43,) text: .
print(*[f'word: {word.text}\tupos: {word.upos}\txpos: {word.xpos}\tfeats: {word.feats if word.feats else "_"}' for sent in doc.sentences for word in sent.words], sep='\n')
word: The upos: DET xpos: DT feats: Definite=Def|PronType=Art word: pilot upos: NOUN xpos: NN feats: Number=Sing word: had upos: VERB xpos: VBD feats: Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin word: arthritis upos: NOUN xpos: NN feats: Number=Sing word: . upos: PUNCT xpos: . feats: _ word: What upos: PRON xpos: WP feats: PronType=Int word: 's upos: AUX xpos: VBZ feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin word: so upos: ADV xpos: RB feats: _ word: important upos: ADJ xpos: JJ feats: Degree=Pos word: to upos: PART xpos: TO feats: _ word: underline upos: VERB xpos: VB feats: VerbForm=Inf word: is upos: AUX xpos: VBZ feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin word: that upos: SCONJ xpos: IN feats: _ word: Metz upos: PROPN xpos: NNP feats: Number=Sing word: worked upos: VERB xpos: VBD feats: Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin word: for upos: ADP xpos: IN feats: _ word: both upos: CCONJ xpos: CC feats: _ word: Northrop upos: PROPN xpos: NNP feats: Number=Sing word: and upos: CCONJ xpos: CC feats: _ word: Lockheed upos: PROPN xpos: NNP feats: Number=Sing word: Martin upos: PROPN xpos: NNP feats: Number=Sing word: in upos: ADP xpos: IN feats: _ word: New upos: ADJ xpos: NNP feats: Degree=Pos word: York upos: PROPN xpos: NNP feats: Number=Sing word: City upos: PROPN xpos: NNP feats: Number=Sing word: and upos: CCONJ xpos: CC feats: _ word: is upos: AUX xpos: VBZ feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin word: not upos: PART xpos: RB feats: _ word: known upos: VERB xpos: VBN feats: Tense=Past|VerbForm=Part|Voice=Pass word: for upos: ADP xpos: IN feats: _ word: hyperbole upos: NOUN xpos: NN feats: Number=Sing word: . upos: PUNCT xpos: . feats: _ word: Yet upos: CCONJ xpos: CC feats: _ word: even upos: ADV xpos: RB feats: _ word: after upos: SCONJ xpos: IN feats: _ word: flying upos: VERB xpos: VBG feats: VerbForm=Ger word: the upos: DET xpos: DT feats: Definite=Def|PronType=Art word: pre-production upos: NOUN xpos: NN feats: Number=Sing word: F upos: PROPN xpos: NNP feats: Number=Sing word: - upos: PUNCT xpos: HYPH feats: _ word: 22 upos: NUM xpos: CD feats: NumForm=Digit|NumType=Card word: , upos: PUNCT xpos: , feats: _ word: a upos: DET xpos: DT feats: Definite=Ind|PronType=Art word: far upos: ADV xpos: RB feats: Degree=Pos word: more upos: ADV xpos: RBR feats: Degree=Cmp word: mature upos: ADJ xpos: JJ feats: Degree=Pos word: machine upos: NOUN xpos: NN feats: Number=Sing word: than upos: ADP xpos: IN feats: _ word: the upos: DET xpos: DT feats: Definite=Def|PronType=Art word: YF upos: PROPN xpos: NNP feats: Number=Sing word: - upos: PUNCT xpos: HYPH feats: _ word: 23 upos: NUM xpos: CD feats: NumForm=Digit|NumType=Card word: ever upos: ADV xpos: RB feats: _ word: was upos: AUX xpos: VBD feats: Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin word: , upos: PUNCT xpos: , feats: _ word: he upos: PRON xpos: PRP feats: Case=Nom|Gender=Masc|Number=Sing|Person=3|PronType=Prs word: makes upos: VERB xpos: VBZ feats: Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin word: it upos: PRON xpos: PRP feats: Case=Acc|Gender=Neut|Number=Sing|Person=3|PronType=Prs word: quite upos: ADV xpos: RB feats: _ word: clear upos: ADJ xpos: JJ feats: Degree=Pos word: that upos: SCONJ xpos: IN feats: _ word: Northrop upos: PROPN xpos: NNP feats: Number=Sing word: 's upos: PART xpos: POS feats: _ word: offering upos: NOUN xpos: NN feats: Number=Sing word: was upos: AUX xpos: VBD feats: Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin word: on upos: ADP xpos: IN feats: _ word: par upos: NOUN xpos: NN feats: Number=Sing word: with upos: ADP xpos: IN feats: _ word: Lockheed upos: PROPN xpos: NNP feats: Number=Sing word: 's upos: PART xpos: POS feats: _ word: , upos: PUNCT xpos: , feats: _ word: if upos: SCONJ xpos: IN feats: _ word: not upos: PART xpos: RB feats: _ word: superior upos: ADJ xpos: JJ feats: Degree=Pos word: . upos: PUNCT xpos: . feats: _
print(*[f'word: {word.text+" "}\tlemma: {word.lemma}' for sent in doc.sentences for word in sent.words], sep='\n')
word: The lemma: the word: pilot lemma: pilot word: had lemma: have word: arthritis lemma: arthritis word: . lemma: . word: What lemma: what word: 's lemma: be word: so lemma: so word: important lemma: important word: to lemma: to word: underline lemma: underline word: is lemma: be word: that lemma: that word: Metz lemma: Metz word: worked lemma: work word: for lemma: for word: both lemma: both word: Northrop lemma: Northrop word: and lemma: and word: Lockheed lemma: Lockheed word: Martin lemma: Martin word: in lemma: in word: New lemma: New word: York lemma: York word: City lemma: City word: and lemma: and word: is lemma: be word: not lemma: not word: known lemma: know word: for lemma: for word: hyperbole lemma: hyperbole word: . lemma: . word: Yet lemma: yet word: even lemma: even word: after lemma: after word: flying lemma: fly word: the lemma: the word: pre-production lemma: pre-production word: F lemma: F word: - lemma: - word: 22 lemma: 22 word: , lemma: , word: a lemma: a word: far lemma: far word: more lemma: more word: mature lemma: mature word: machine lemma: machine word: than lemma: than word: the lemma: the word: YF lemma: YF word: - lemma: - word: 23 lemma: 23 word: ever lemma: ever word: was lemma: be word: , lemma: , word: he lemma: he word: makes lemma: make word: it lemma: it word: quite lemma: quite word: clear lemma: clear word: that lemma: that word: Northrop lemma: Northrop word: 's lemma: 's word: offering lemma: offering word: was lemma: be word: on lemma: on word: par lemma: par word: with lemma: with word: Lockheed lemma: Lockheed word: 's lemma: 's word: , lemma: , word: if lemma: if word: not lemma: not word: superior lemma: superior word: . lemma: .
for sentence in doc.sentences:
print(sentence.constituency)
(ROOT (S (NP (DT The) (NN pilot)) (VP (VBD had) (NP (NN arthritis))) (. .))) (ROOT (S (SBAR (WHNP (WP What)) (S (VP (VBZ 's) (ADJP (RB so) (JJ important) (SBAR (S (VP (TO to) (VP (VB underline))))))))) (VP (VBZ is) (SBAR (IN that) (S (NP (NNP Metz)) (VP (VP (VBD worked) (PP (IN for) (NP (CC both) (NP (NNP Northrop)) (CC and) (NP (NNP Lockheed) (NNP Martin)))) (PP (IN in) (NP (NML (NNP New) (NNP York)) (NNP City)))) (CC and) (VP (VBZ is) (RB not) (VP (VBN known) (PP (IN for) (NP (NN hyperbole))))))))) (. .))) (ROOT (S (CC Yet) (PP (ADVP (RB even)) (IN after) (S (VP (VBG flying) (NP (DT the) (NN pre-production) (NNP F) (HYPH -) (CD 22))))) (, ,) (NP (NP (DT a) (ADJP (ADVP (RB far) (RBR more)) (JJ mature)) (NN machine)) (PP (IN than) (NP (DT the) (NNP YF) (HYPH -) (CD 23)))) (ADVP (RB ever)) (VP (VBD was)) (, ,) (NP (NP (PRP he))) (VP (VBZ makes) (S (NP (NP (PRP it))) (ADJP (RB quite) (JJ clear)) (SBAR (IN that) (S (NP (NP (NNP Northrop) (POS 's)) (NN offering)) (VP (VBD was) (PP (IN on) (NP (NN par))) (PP (IN with) (ADJP (NP (NNP Lockheed) (POS 's)) (, ,) (SBAR (IN if) (FRAG (RB not) (ADJP (JJ superior))))))))))) (. .)))
print(*[f'entity: {ent.text}\ttype: {ent.type}' for ent in doc.ents], sep='\n')
entity: arthritis type: DISEASE
print(*[f'token: {token.text}\tner: {token.ner}' for sent in doc.sentences for token in sent.tokens], sep='\n')
token: The ner: O token: pilot ner: O token: had ner: O token: arthritis ner: S-DISEASE token: . ner: O token: What ner: O token: 's ner: O token: so ner: O token: important ner: O token: to ner: O token: underline ner: O token: is ner: O token: that ner: O token: Metz ner: S-ORG token: worked ner: O token: for ner: O token: both ner: O token: Northrop ner: S-ORG token: and ner: O token: Lockheed ner: B-ORG token: Martin ner: E-ORG token: in ner: O token: New ner: B-GPE token: York ner: I-GPE token: City ner: E-GPE token: and ner: O token: is ner: O token: not ner: O token: known ner: O token: for ner: O token: hyperbole ner: O token: . ner: O token: Yet ner: O token: even ner: O token: after ner: O token: flying ner: O token: the ner: O token: pre-production ner: O token: F ner: B-PRODUCT token: - ner: I-PRODUCT token: 22 ner: E-PRODUCT token: , ner: O token: a ner: O token: far ner: O token: more ner: O token: mature ner: O token: machine ner: O token: than ner: O token: the ner: B-PRODUCT token: YF ner: I-PRODUCT token: - ner: I-PRODUCT token: 23 ner: E-PRODUCT token: ever ner: O token: was ner: O token: , ner: O token: he ner: O token: makes ner: O token: it ner: O token: quite ner: O token: clear ner: O token: that ner: O token: Northrop ner: S-ORG token: 's ner: O token: offering ner: O token: was ner: O token: on ner: O token: par ner: O token: with ner: O token: Lockheed ner: S-ORG token: 's ner: O token: , ner: O token: if ner: O token: not ner: O token: superior ner: O token: . ner: O
for i, sentence in enumerate(doc.sentences):
print("%d -> %d" % (i, sentence.sentiment))
0 -> 0 1 -> 2 2 -> 0
stanza.download(lang="multilingual")
stanza.download(lang="en")
# stanza.download(lang="fr")
stanza.download(lang="de")
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json: 0%| …
2023-09-20 17:34:37 INFO: Downloading default packages for language: multilingual (multilingual) ... 2023-09-20 17:34:37 INFO: File exists: C:\Users\damir\stanza_resources\multilingual\default.zip 2023-09-20 17:34:37 INFO: Finished downloading models and saved to C:\Users\damir\stanza_resources.
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json: 0%| …
2023-09-20 17:34:38 INFO: Downloading default packages for language: en (English) ... 2023-09-20 17:34:38 INFO: File exists: C:\Users\damir\stanza_resources\en\default.zip 2023-09-20 17:34:42 INFO: Finished downloading models and saved to C:\Users\damir\stanza_resources.
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json: 0%| …
2023-09-20 17:34:42 INFO: Downloading default packages for language: de (German) ... 2023-09-20 17:34:43 INFO: File exists: C:\Users\damir\stanza_resources\de\default.zip 2023-09-20 17:34:47 INFO: Finished downloading models and saved to C:\Users\damir\stanza_resources.
nlp = Pipeline(lang="multilingual", processors="langid")
docs = ["Hello world.", "Hallo, Welt!"]
docs = [Document([], text=text) for text in docs]
nlp(docs)
print("\n".join(f"{doc.text}\t{doc.lang}" for doc in docs))
2023-09-20 17:36:07 INFO: Checking for updates to resources.json in case models have been updated. Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.5.0.json: 0%| …
2023-09-20 17:36:07 INFO: Loading these models for language: multilingual (): ======================= | Processor | Package | ----------------------- | langid | ud | ======================= 2023-09-20 17:36:07 INFO: Using device: cuda 2023-09-20 17:36:07 INFO: Loading: langid 2023-09-20 17:36:07 INFO: Done loading processors!
Hello world. en Hallo, Welt! it
I wrote the following function to convert the Stanza dependency tree data structure to a spaCy's Displacy compatible data structure for the visualization of dependency trees using spaCy's excellent visualizer:
def get_stanza_dep_displacy_manual(doc):
res = []
for x in doc.sentences:
words = []
arcs = []
for w in x.words:
if w.head > 0:
head_text = x.words[w.head-1].text
else:
head_text = "root"
words.append({"text": w.text, "tag": w.upos})
if w.deprel == "root": continue
start = w.head-1
end = w.id-1
if start < end:
arcs.append({ "start":start, "end":end, "label": w.deprel, "dir":"right"})
else:
arcs.append({ "start":end, "end":start, "label": w.deprel, "dir":"left"})
res.append( { "words": words, "arcs": arcs } )
return res
doc = nlp("John loves to read books and Mary newspapers.")
We can now generate the spaCy-compatible data format from the dependency tree to be able to visualize it:
res = get_stanza_dep_displacy_manual(doc)
The rendering can be achieved using the Displacy call:
displacy.render(res, style="dep", manual=True, options={"compact":False, "distance":110})
from stanza.utils.conll import CoNLL
CoNLL.write_doc2conll(doc, "output.conllu")
(C) 2023-2024 by Damir Cavar <dcavar@iu.edu>