Notebook

spaCy Tutorial¶

Version: 1.8, January 2024

Download: This and various other Jupyter notebooks are available from my GitHub repo.

This is a tutorial related to the L665 course on Machine Learning for NLP focusing on Deep Learning, Spring 2018, and L645 Advanced Natural Language Processing in Fall 2023 at Indiana University. The following tutorial assumes that you are using a newer distribution of Python 3.x and spaCy 3.5 or newer.

Requirements¶

The following code examples presuppose a running Python 3.x environment with Jupyter Lab and spaCy installed.

To install spaCy follow the instructions on the Install spaCy page.

In [ ]:

!pip install -U pip setuptools wheel

The following installation of spaCy is ideal for my environment, i.e., using a GPU and CUDA 12.x. See the spaCy homepage for detailed installation instructions.

In [ ]:

!pip install -U 'spacy[cuda12x,transformers,lookups,ja]'

Once spaCy is installed, install the language models using the following commands.

For the small English model:

python -m spacy download en_core_web_sm

For the medium English language model:

python -m spacy download en_core_web_md

For the large English language model:

python -m spacy download en_core_web_lg

For the small Spanish language model:

python -m spacy download es_core_news_sm

In [ ]:

!python -m spacy download en_core_web_sm
!python -m spacy download en_core_web_md
!python -m spacy download en_core_web_lg
!python -m spacy download es_core_news_sm

Introduction to spaCy¶

Follow the instructions on the spaCy homepage about installation of the module and language models. Your local spaCy module is correctly installed, if the following command is successfull:

In [1]:

import spacy

We can load the English NLP pipeline in the following way:

In [2]:

nlp = spacy.load("es_core_news_sm")

/home/damir/.local/lib/python3.12/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/damir/.local/lib/python3.12/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(

Tokenization¶

In [3]:

doc = nlp(u'Como estas? Estoy bien.')
for token in doc:
    print(token.text, token.lemma_)

Como como
estas este
? ?
Estoy estar
bien bien
. .

Part-of-Speech Tagging¶

We can tokenize and part of speech tag the individual tokens using the following code:

In [4]:

doc = nlp(u'Como estas? Estoy bien.')

for token in doc:
    print("\t".join( (token.text, str(token.idx), token.lemma_, token.pos_, token.tag_, token.dep_,
          token.shape_, str(token.is_alpha), str(token.is_stop) )))

Como	0	como	SCONJ	SCONJ	mark	Xxxx	True	True
estas	5	este	DET	DET	ROOT	xxxx	True	True
?	10	?	PUNCT	PUNCT	punct	?	False	False
Estoy	12	estar	AUX	AUX	cop	Xxxxx	True	True
bien	18	bien	ADV	ADV	ROOT	xxxx	True	True
.	22	.	PUNCT	PUNCT	punct	.	False	False

The above output contains for every token in a line the token itself, the lemma, the Part-of-Speech tag, the dependency label, the orthographic shape (upper and lower case characters as X or x respectively), the boolean for the token being an alphanumeric string, and the boolean for it being a stopword.

Dependency Parse¶

Using the same approach as above for PoS-tags, we can print the Dependency Parse relations:

In [5]:

for token in doc:
    print(token.text, token.dep_, token.head.text, token.head.pos_,
          [child for child in token.children])

Como mark estas DET []
estas ROOT estas DET [Como, ?]
? punct estas DET []
Estoy cop bien ADV []
bien ROOT bien ADV [Estoy, .]
. punct bien ADV []

As specified in the code, each line represents one token. The token is printed in the first column, followed by the dependency relation to it from the token in the third column, followed by its main category type.

Named Entity Recognition¶

Similarly to PoS-tags and Dependency Parse Relations, we can print out Named Entity labels:

In [6]:

nlp = spacy.load("en_core_web_lg")

In [8]:

text = "John Lee Hooker loves Ali Hassan Kuban when driving on the highway."
doc = nlp(text)
for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

John Lee Hooker 0 15 PERSON
Ali Hassan Kuban 22 38 PERSON

We can extend the input with some more entities:

In [9]:

doc = nlp(u'Ali Hassan Kuban said that Apple Inc. from California will buy Google in May 2018.')

The corresponding NE-labels are:

In [10]:

for ent in doc.ents:
    print(ent.text, ent.start_char, ent.end_char, ent.label_)

Ali Hassan Kuban 0 16 PERSON
Apple Inc. 27 37 ORG
California 43 53 GPE
Google 63 69 ORG
May 2018 73 81 DATE

Pattern Matching in spaCy¶

You can define patterns in spaCy and generate a label (here HelloWorld) whenever there is a matching pattern in some text using the spaCy Matcher class. In the code below we print out the label, offset of matching sub-string, and the real match string in the text.

In [ ]:

from spacy.matcher import Matcher

matcher = Matcher(nlp.vocab)
pattern = [{'LOWER': 'hello'}, {'IS_PUNCT': True}, {'LOWER': 'world'}]
matcher.add('HelloWorld', [pattern])

doc = nlp(u'Hello, world! Hello... world!')
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)
print("-" * 50)
doc = nlp(u'Hello, world! Hello world!')
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

spaCy is Missing¶

From the linguistic standpoint, when looking at the analytical output of the NLP pipeline in spaCy, there are some important components missing:

Clause boundary detection
Anaphora resolution (partially solved in the Coreference modules)
Temporal reference resolution
...

There are add-on modules that provide annotations for additional linguistic levels, as for example:

Constituent structure trees (scope relations over constituents and phrases)
Coreference analysis

You can find various such addons in the spaCy Universe.

Clause Boundary Detection¶

Complex sentences consist of clauses. For precise processing of semantic properties of natural language utterances we need to segment the sentences into clauses. The following sentence:

The man said that the woman claimed that the child broke the toy.

can be broken into the following clauses:

Matrix clause: [ the man said ]
Embedded clause: [ that the woman claimed ]
Embedded clause: [ that the child broke the toy ]

These clauses do not form an ordered list or flat sequence, they in fact are hierarchically organized. The matrix clause verb selects as its complement an embedded finite clause with the complementizer that. The embedded predicate claimed selects the same kind of clausal complement. We express this hierarchical relation in form of embedding in tree representations:

[ the man said [ that the woman claimed [ that the child broke the toy ] ] ]

Or using a graphical representation in form of a tree:

The hierarchical relation of sub-clauses is relevant when it comes to semantics. The clause John sold his car can be interpreted as an assertion that describes an event with John as the agent, and the car as the object of a selling event in the past. If the clause is embedded under a matrix clause that contains a sentential negation, the proposition is assumed to NOT be true: [ Mary did not say that [ John sold his car ] ]

It is possible with additional effort to translate the Dependency Trees into clauses and reconstruct the clause hierarchy into a relevant form or data structure. SpaCy does not offer a direct data output of such relations.

One problem still remains, and this is clausal discontinuities. None of the common NLP pipelines, and spaCy in particular, can deal with any kind of discontinuities in any reasonable way. Discontinuities can be observed when sytanctic structures are split over the clause or sentence, or elements ocur in a cannoically different position, as in the following example:

Which car did John claim that Mary took?

The embedded clause consists of the sequence [ Mary took which car ]. One part of the sequence appears dislocated and precedes the matrix clause in the above example. Simple Dependency Parsers cannot generate any reasonable output that makes it easy to identify and reconstruct the relations of clausal elements in these structures.

Constitutent Structure Trees¶

Dependency Parse trees are a simplification of relations of elements in the clause. They ignore structural and hierarchical relations in a sentence or clause, as shown in the examples above. Instead the Dependency Parse trees show simple functional relations in the sense of sentential functions like subject or object of a verb.

SpaCy does not output any kind of constituent structure and more detailed relational properties of phrases and more complex structural units in a sentence or clause.

Since many semantic properties are defined or determined in terms of structural relations and hierarchies, that is scope relations, this is more complicated to reconstruct or map from the Dependency Parse trees.

Anaphora Resolution¶

SpaCy does not offer any anaphora resolution annotation. That is, the referent of a pronoun, as in the following examples, is not annotated in the resulting linguistic data structure:

John saw him.
John said that he* saw the house.*
Tim sold his* house. He moved to Paris.*
John saw himself* in the mirror.*

Knowing the restrictions of pronominal binding (in English for example), we can partially generate the potential or most likely anaphora - antecedent relations. This - however - is not part of the spaCy output.

One problem, however, is that spaCy does not provide parse trees of the constituent structure and clausal hierarchies, which is crucial for the correct analysis of pronominal anaphoric relations.

Coreference Analysis¶

Some NLP pipelines are capable of providing coreference analyses for constituents in clauses. For example, the two clauses should be analyzed as talking about the same subject:

The CEO of Apple, Tim Cook, decided to apply for a job at Google. Cook said that he is not satisfied with the quality of the iPhones anymore. He prefers the Pixel 2.

The constituents [ the CEO of Apple, Tim Cook ] in the first sentence, [ Cook ] in the second sentence, and [ he ] in the third, should all be tagged as referencing the same entity, that is the one mentioned in the first sentence. SpaCy does not provide such a level of analysis or annotation.

Temporal Reference¶

For various analysis levels it is essential to identify the time references in a sentence or utterance, for example the time the utterance is made or the time the described event happened.

Certain tenses are expressed as periphrastic constructions, including auxiliaries and main verbs. SpaCy does not provide the relevant information to identify these constructions and tenses.

Using the Dependency Parse Visualizer¶

Vectors¶

To use vectors in spaCy, you might consider installing the larger models for the particular language. The common module and language packages only come with the small models. The larger models can be installed as described on the spaCy vectors page:

python -m spacy download en_core_web_lg

The large model en_core_web_lg contains more than 1 million unique vectors.

Let us restart all necessary modules again, in particular spaCy:

In [1]:

import spacy

We can now import the English NLP pipeline to process some word list. Since the small models in spacy only include context-sensitive tensors, we should use the dowloaded large model for better word vectors. We load the large model as follows:

In [2]:

nlp = spacy.load('en_core_web_lg')

/home/damir/.local/lib/python3.11/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/damir/.local/lib/python3.11/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(

We can process a list of words by the pipeline using the nlp object:

In [3]:

tokens = nlp(u'dog poodle beagle cat banana apple')

As described in the spaCy chapter Word Vectors and Semantic Similarity, the resulting elements of Doc, Span, and Token provide a method similarity(), which returns the similarities between words:

In [4]:

for token1 in tokens:
    # print(token1.vector)
    for token2 in tokens:
        print(token1, token2, token1.similarity(token2))

dog dog 1.0
dog poodle 0.6339901089668274
dog beagle 0.5964534282684326
dog cat 0.8220817446708679
dog banana 0.2090904861688614
dog apple 0.22881002724170685
poodle dog 0.6339901089668274
poodle poodle 1.0
poodle beagle 0.6217650771141052
poodle cat 0.6388018131256104
poodle banana 0.2899792790412903
poodle apple 0.237016960978508
beagle dog 0.5964534282684326
beagle poodle 0.6217650771141052
beagle beagle 1.0
beagle cat 0.5943629145622253
beagle banana 0.10636148601770401
beagle apple 0.1200629323720932
cat dog 0.8220817446708679
cat poodle 0.6388018131256104
cat beagle 0.5943629145622253
cat cat 1.0
cat banana 0.2235882729291916
cat apple 0.20368057489395142
banana dog 0.2090904861688614
banana poodle 0.2899792790412903
banana beagle 0.10636148601770401
banana cat 0.2235882729291916
banana banana 1.0
banana apple 0.6646701097488403
apple dog 0.22881002724170685
apple poodle 0.237016960978508
apple beagle 0.1200629323720932
apple cat 0.20368057489395142
apple banana 0.6646701097488403
apple apple 1.0

We can access the vectors of these objects using the vector attribute:

In [5]:

tokens = nlp(u'dog cat banana grungle')

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

dog True 75.254234 False
cat True 63.188496 False
banana True 31.620354 False
grungle False 0.0 True

The attribute has_vector returns a boolean depending on whether the token has a vector in the model or not. The token grungle has no vector. It is also out-of-vocabulary (OOV), as the fourth column shows. Thus, it also has a norm of $0$, that is, it has a length of $0$.

Here the token vector has a length of $300$. We can print out the vector for a token:

In [6]:

n = 0
print(tokens[n].text, len(tokens[n].vector), tokens[n].vector)

dog 300 [ 1.2330e+00  4.2963e+00 -7.9738e+00 -1.0121e+01  1.8207e+00  1.4098e+00
 -4.5180e+00 -5.2261e+00 -2.9157e-01  9.5234e-01  6.9880e+00  5.0637e+00
 -5.5726e-03  3.3395e+00  6.4596e+00 -6.3742e+00  3.9045e-02 -3.9855e+00
  1.2085e+00 -1.3186e+00 -4.8886e+00  3.7066e+00 -2.8281e+00 -3.5447e+00
  7.6888e-01  1.5016e+00 -4.3632e+00  8.6480e+00 -5.9286e+00 -1.3055e+00
  8.3870e-01  9.0137e-01 -1.7843e+00 -1.0148e+00  2.7300e+00 -6.9039e+00
  8.0413e-01  7.4880e+00  6.1078e+00 -4.2130e+00 -1.5384e-01 -5.4995e+00
  1.0896e+01  3.9278e+00 -1.3601e-01  7.7732e-02  3.2218e+00 -5.8777e+00
  6.1359e-01 -2.4287e+00  6.2820e+00  1.3461e+01  4.3236e+00  2.4266e+00
 -2.6512e+00  1.1577e+00  5.0848e+00 -1.7058e+00  3.3824e+00  3.2850e+00
  1.0969e+00 -8.3711e+00 -1.5554e+00  2.0296e+00 -2.6796e+00 -6.9195e+00
 -2.3386e+00 -1.9916e+00 -3.0450e+00  2.4890e+00  7.3247e+00  1.3364e+00
  2.3828e-01  8.4388e-02  3.1480e+00 -1.1128e+00 -3.5598e+00 -1.2115e-01
 -2.0357e+00 -3.2731e+00 -7.7205e+00  4.0948e+00 -2.0732e+00  2.0833e+00
 -2.2803e+00 -4.9850e+00  9.7667e+00  6.1779e+00 -1.0352e+01 -2.2268e+00
  2.5765e+00 -5.7440e+00  5.5564e+00 -5.2735e+00  3.0004e+00 -4.2512e+00
 -1.5682e+00  2.2698e+00  1.0491e+00 -9.0486e+00  4.2936e+00  1.8709e+00
  5.1985e+00 -1.3153e+00  6.5224e+00  4.0113e-01 -1.2583e+01  3.6534e+00
 -2.0961e+00  1.0022e+00 -1.7873e+00 -4.2555e+00  7.7471e+00  1.0173e+00
  3.1626e+00  2.3558e+00  3.3589e-01 -4.4178e+00  5.0584e+00 -2.4118e+00
 -2.7445e+00  3.4170e+00 -1.1574e+01 -2.6568e+00 -3.6933e+00 -2.0398e+00
  5.0976e+00  6.5249e+00  3.3573e+00  9.5334e-01 -9.4430e-01 -9.4395e+00
  2.7867e+00 -1.7549e+00  1.7287e+00  3.4942e+00 -1.6883e+00 -3.5771e+00
 -1.9013e+00  2.2239e+00 -5.4335e+00 -6.5724e+00 -6.7228e-01 -1.9748e+00
 -3.1080e+00 -1.8570e+00  9.9496e-01  8.9135e-01 -4.4254e+00  3.3125e-01
  5.8815e+00  1.9384e+00  5.7294e-01 -2.8830e+00  3.8087e+00 -1.3095e+00
  5.9208e+00  3.3620e+00  3.3571e+00 -3.8807e-01  9.0022e-01 -5.5742e+00
 -4.2939e+00  1.4992e+00 -4.7080e+00 -2.9402e+00 -1.2259e+00  3.0980e-01
  1.8858e+00 -1.9867e+00 -2.3554e-01 -5.4535e-01 -2.1387e-01  2.4797e+00
  5.9710e+00 -7.1249e+00  1.6257e+00 -1.5241e+00  7.5974e-01  1.4312e+00
  2.3641e+00 -3.5566e+00  9.2066e-01  4.4934e-01 -1.3233e+00  3.1733e+00
 -4.7059e+00 -1.2090e+01 -3.9241e-01 -6.8457e-01 -3.6789e+00  6.6279e+00
 -2.9937e+00 -3.8361e+00  1.3868e+00 -4.9002e+00 -2.4299e+00  6.4312e+00
  2.5056e+00 -4.5080e+00 -5.1278e+00 -1.5585e+00 -3.0226e+00 -8.6811e-01
 -1.1538e+00 -1.0022e+00 -9.1651e-01 -4.7810e-01 -1.6084e+00 -2.7307e+00
  3.7080e+00  7.7423e-01 -1.1085e+00 -6.8755e-01 -8.2901e+00  3.2405e+00
 -1.6108e-01 -6.2837e-01 -5.5960e+00 -4.4865e+00  4.0115e-01 -3.7063e+00
 -2.1704e+00  4.0789e+00 -1.7973e+00  8.9538e+00  8.9421e-01 -4.8128e+00
  4.5367e+00 -3.2579e-01 -5.2344e+00 -3.9766e+00 -2.1979e+00  3.5699e+00
  1.4982e+00  6.0972e+00 -1.9704e+00  4.6522e+00 -3.7734e-01  3.9101e-02
  2.5361e+00 -1.8096e+00  8.7035e+00 -8.6372e+00 -3.5257e+00  3.1034e+00
  3.2635e+00  4.5437e+00 -5.7290e+00 -2.9141e-01 -2.0011e+00  8.5328e+00
 -4.5064e+00 -4.8276e+00 -1.1786e+01  3.5607e-01 -5.7115e+00  6.3122e+00
 -3.6650e+00  3.3597e-01  2.5017e+00 -3.5025e+00 -3.7891e+00 -3.1343e+00
 -1.4429e+00 -6.9119e+00 -2.6114e+00 -5.9757e-01  3.7847e-01  6.3187e+00
  2.8965e+00 -2.5397e+00  1.8022e+00  3.5486e+00  4.4721e+00 -4.8481e+00
 -3.6252e+00  4.0969e+00 -2.0081e+00 -2.0122e-01  2.5244e+00 -6.8817e-01
  6.7184e-01 -7.0466e+00  1.6641e+00 -2.2308e+00 -3.8960e+00  6.1320e+00
 -8.0335e+00 -1.7130e+00  2.5688e+00 -5.2547e+00  6.9845e+00  2.7835e-01
 -6.4554e+00 -2.1327e+00 -5.6515e+00  1.1174e+01 -8.0568e+00  5.7985e+00]

Here just another example of similarities for some famous words:

In [7]:

tokens = nlp(u'queen king chef')

for token1 in tokens:
    for token2 in tokens:
        print(token1, token2, token1.similarity(token2))

queen queen 1.0
queen king 0.6108841896057129
queen chef 0.13113069534301758
king queen 0.6108841896057129
king king 1.0
king chef 0.04403642565011978
chef queen 0.13113069534301758
chef king 0.04403642565011978
chef chef 1.0

Similarities in Context¶

In spaCy parsing, tagging and NER models make use of vector representations of contexts that represent the meaning of words. A text meaning representation is represented as an array of floats, i.e. a tensor, computed during the NLP pipeline processing. With this approach words that have not been seen before can be typed or classified. SpaCy uses a 4-layer convolutional network for the computation of these tensors. In this approach these tensors model a context of four words left and right of any given word.

Let us use the example from the spaCy documentation and check the word labrador:

In [8]:

tokens = nlp(u'labrador')

for token in tokens:
    print(token.text, token.has_vector, token.vector_norm, token.is_oov)

labrador True 22.03589 False

We can now test for the context:

In [9]:

doc1 = nlp(u"The labrador barked.")
doc2 = nlp(u"The labrador swam.")
doc3 = nlp(u"The people on Labrador are Canadians.")

dog = nlp(u"dog")

count = 0
for doc in [doc1, doc2, doc3]:
    lab = doc
    count += 1
    print(str(count) + ":", lab.similarity(dog))

1: 0.09123279947734134
2: 0.0818719127377742
3: 0.09721566052157429

Using this strategy we can compute document or text similarities as well:

In [10]:

docs = ( nlp(u"Paris is the largest city in France."),
        nlp(u"Vilnius is the capital of Lithuania."),
        nlp(u"An emu is a large bird.") )

for x in range(len(docs)):
    zset = set(range(len(docs)))
    zset.remove(x)
    for y in zset:
        print(x, y, docs[x].similarity(docs[y]))

0 1 0.8596882830672081
0 2 0.5688490403558649
1 0 0.8596882830672081
1 2 0.6276001607674082
2 0 0.5688490403558649
2 1 0.6276001607674082

We can vary the word order in sentences and compare them:

In [11]:

docs = [nlp(u"dog bites man"), nlp(u"man bites dog"),
        nlp(u"man dog bites"), nlp(u"cat eats mouse")]

for doc in docs:
    for other_doc in docs:
        print('"' + doc.text + '"', '"' + other_doc.text + '"', doc.similarity(other_doc))

"dog bites man" "dog bites man" 1.0
"dog bites man" "man bites dog" 0.9999999483653454
"dog bites man" "man dog bites" 0.9999998947030624
"dog bites man" "cat eats mouse" 0.6909646217296318
"man bites dog" "dog bites man" 0.9999999483653454
"man bites dog" "man bites dog" 1.0
"man bites dog" "man dog bites" 1.000000012821272
"man bites dog" "cat eats mouse" 0.69096462081283
"man dog bites" "dog bites man" 0.9999998947030624
"man dog bites" "man bites dog" 1.000000012821272
"man dog bites" "man dog bites" 1.0
"man dog bites" "cat eats mouse" 0.690964625000244
"cat eats mouse" "dog bites man" 0.6909646217296318
"cat eats mouse" "man bites dog" 0.69096462081283
"cat eats mouse" "man dog bites" 0.690964625000244
"cat eats mouse" "cat eats mouse" 1.0

Custom Models¶

Optimization¶

In [20]:

nlp = spacy.load('en_core_web_lg')

Training Models¶

This example code for training an NER model is based on the training example in spaCy.

We will import some components from the future module. Read its documentation here.

In [1]:

from __future__ import unicode_literals, print_function

We import the random module for pseudo-random number generation:

In [2]:

import random

We import the Path object from the pathlib module:

In [3]:

from pathlib import Path

We import spaCy:

In [4]:

import spacy

We also import the minibatch and compounding module from spaCy.utils:

In [5]:

from spacy.util import minibatch, compounding
from spacy.training.example import Example

The training data is formated as JSON:

In [6]:

TRAIN_DATA = [
    ("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
    ("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]

We created a blank 'xx' model:

In [7]:

nlp = spacy.blank("xx")  # create blank Language class
ner = nlp.add_pipe("ner", last=True)

/home/damir/.local/lib/python3.11/site-packages/transformers/utils/generic.py:441: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(
/home/damir/.local/lib/python3.11/site-packages/transformers/utils/generic.py:309: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  _torch_pytree._register_pytree_node(

We add the named entity labels to the NER model:

In [8]:

for _, annotations in TRAIN_DATA:
    for ent in annotations.get("entities"):
        ner.add_label(ent[2])

Assuming that the model is empty and untrained, we reset and initialize the weights randomly using:

In [9]:

nlp.begin_training()

Out[9]:

<thinc.optimizers.Optimizer at 0x768613f1a020>

We would not do this, if the model is supposed to be tuned or retrained on new data.

We get all pipe-names in the model that are not our NER related pipes to disable them during training:

In [10]:

pipe_exceptions = ["ner", "trf_wordpiecer", "trf_tok2vec"]
other_pipes = [pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]

We can now disable the other pipes and train just the NER uing 100 iterations:

In [11]:

with nlp.disable_pipes(*other_pipes):  # only train NER
    for itn in range(100):
        random.shuffle(TRAIN_DATA)
        losses = {}
        # batch up the examples using spaCy's minibatch
        batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
        for batch in batches:
            for text, annotations in batch:
                print(text)
                print(annotations)
                doc = nlp.make_doc(text)
                example = Example.from_dict(doc, annotations)
                nlp.update([example],
                    drop=0.5,  # dropout - make it harder to memorise data
                    losses=losses,
                )
        print("Losses", losses)

Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 9.804199874401093}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 9.378168404102325}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 8.878702640533447}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 8.084787487983704}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 7.684804081916809}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 6.719165354967117}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 6.1549230217933655}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 5.926921933889389}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 5.114250376820564}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 3.905261367559433}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 4.803224857896566}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 4.071965433657169}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 4.124644186813384}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 4.419207601691596}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 4.069842653349042}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 2.900299648696091}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 6.496187977492809}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 3.7045014230534434}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 2.545258560916409}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 3.515323496016208}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 1.9346504728600848}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 3.617041664198041}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 3.0120007601799443}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 1.988534303032793}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 2.788861875771545}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 2.3091221836657496}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 1.9209051267089308}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 1.3401959295297274}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 1.1315552217129152}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 1.381677663710434}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 1.3113824253427993}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 0.45815791326629096}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 0.7744309550371327}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 0.4487278034382598}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 0.18023752846686225}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 0.8337155280485953}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 0.020622616767529134}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 0.006422741743957072}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 0.11949631332268508}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 0.3233942783861958}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 0.00031674511999167654}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 0.013451710974875561}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 0.006336543364760207}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 0.0008653662256715763}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 0.0005173218071663293}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 0.0004392343511747966}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 3.445272148226838e-07}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 0.00021716944827355634}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 6.8657545295985195e-06}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 3.807316376228079e-05}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 1.9374009751112753e-05}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 3.907229847021995e-06}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 2.4131223756608614e-07}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 1.0854577383638013e-05}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 6.739205577006435e-08}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 4.0725146180469517e-07}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 0.00022054431668037646}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 0.00012138565389001919}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 1.326270706712906e-06}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 4.293141829328753e-05}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 1.2039716128148807e-08}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 8.679873165927199e-08}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 1.0078720289877151e-07}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 0.07870026917200694}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 1.1056136560999869e-08}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 3.2413324670157183e-06}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 7.367572379416744e-06}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 1.1689998484037621e-05}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 2.8223324067675686e-06}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 1.2329731947213951e-07}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 1.7893614912219646e-07}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 0.0003141640662639505}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 9.306160481876498e-08}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 9.457749717680469e-06}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 9.575859847124382e-08}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 9.575427957434568e-06}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 2.4935574864731465e-08}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 1.5672836009821735e-06}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 3.52608805128802e-10}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 1.17968810970269e-09}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 5.371021146179102e-09}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 2.3592887209626158e-08}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 2.28903862167769e-06}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 2.9878317819723995e-07}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 4.793343483500154e-09}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 1.3447286281598351e-08}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 0.23371704747150776}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 0.003400509015195647}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 1.0186260824423491e-05}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 3.051644578424991e-07}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 2.3526982615854487e-07}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 3.3635482859874966e-06}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 4.304957177634687e-06}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 1.3826546434807992e-07}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 1.6475824441941653e-06}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 6.3103205019734994e-09}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 0.00010522956870933458}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 3.506134576097521e-10}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
Losses {'ner': 1.479756073354846e-10}
Who is Shaka Khan?
{'entities': [(7, 17, 'PERSON')]}
I like London and Berlin.
{'entities': [(7, 13, 'LOC'), (18, 24, 'LOC')]}
Losses {'ner': 3.5580518043934527e-09}

We can test the trained model:

In [12]:

for text, _ in TRAIN_DATA:
    doc = nlp(text)
    print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
    print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

Entities [('Shaka Khan', 'PERSON')]
Tokens [('Who', '', 2), ('is', '', 2), ('Shaka', 'PERSON', 3), ('Khan', 'PERSON', 1), ('?', '', 2)]
Entities [('London', 'LOC'), ('Berlin', 'LOC')]
Tokens [('I', '', 2), ('like', '', 2), ('London', 'LOC', 3), ('and', '', 2), ('Berlin', 'LOC', 3), ('.', '', 2)]

We can define the output directory where the model will be saved as the models folder in the directory where the notebook is running:

In [33]:

output_dir = Path("./models/")

Save model to output dir:

In [34]:

if not output_dir.exists():
    output_dir.mkdir()
nlp.to_disk(output_dir)

To make sure everything worked out well, we can test the saved model:

In [35]:

nlp2 = spacy.load(output_dir)
for text, _ in TRAIN_DATA:
    doc = nlp2(text)
    print("Entities", [(ent.text, ent.label_) for ent in doc.ents])
    print("Tokens", [(t.text, t.ent_type_, t.ent_iob) for t in doc])

Entities [('London', 'LOC'), ('Berlin', 'LOC')]
Tokens [('I', '', 2), ('like', '', 2), ('London', 'LOC', 3), ('and', '', 2), ('Berlin', 'LOC', 3), ('.', '', 2)]
Entities [('Shaka Khan', 'PERSON')]
Tokens [('Who', '', 2), ('is', '', 2), ('Shaka', 'PERSON', 3), ('Khan', 'PERSON', 1), ('?', '', 2)]