At the center of spaCy is the object containing the processing pipeline. We usually call this variable "nlp".
For example, to create an English nlp object, you can import the English language class from spacy dot lang dot en and instantiate it. You can use the nlp object like a function to analyze text.
It contains all the different components in the pipeline.
It also includes language-specific rules used for tokenizing the text into words and punctuation. spaCy supports a variety of languages that are available in spacy dot lang.
# Import the English language class
from spacy.lang.pt import Portuguese
# Create the nlp object
nlp = Portuguese()
nlp
<spacy.lang.pt.Portuguese at 0x1e1617712b0>
When you process a text with the nlp object, spaCy creates a Doc object – short for "document". The Doc lets you access information about the text in a structured way, and no information is lost.
The Doc behaves like a normal Python sequence by the way and lets you iterate over its tokens, or get a token by its index. But more on that later!
# Created by processing a string of text with the nlp object
doc = nlp("Olá mundo!")
# Iterate over tokens in a Doc
for token in doc:
print(token.text)
Olá mundo !
Token objects represent the tokens in a document – for example, a word or a punctuation character.
To get a token at a specific position, you can index into the Doc.
Token objects also provide various attributes that let you access more information about the tokens. For example, the dot text attribute returns the verbatim token text.
doc = nlp("Oi mundo!")
# Index into the Doc to get a single Token
token = doc[1]
# Get the token text via the .text attribute
print(token.text)
mundo
A Span object is a slice of the document consisting of one or more tokens. It's only a view of the Doc and doesn't contain any data itself.
To create a Span, you can use Python's slice notation. For example, 1 colon 3 will create a slice starting from the token at position 1, up to – but not including! – the token at position 3.
doc = nlp("Oi mundo!")
# A slice from the Doc is a Span object
span = doc[1:4]
# Get the span text via the .text attribute
print(span.text)
mundo!
Here you can see some of the available token attributes:
"i" is the index of the token within the parent document.
"text" returns the token text.
"is alpha", "is punct" and "like num" return boolean values indicating whether the token consists of alphanumeric characters, whether it's punctuation or whether it resembles a number. For example, a token "10" – one, zero – or the word "ten" – T, E, N.
These attributes are also called lexical attributes: they refer to the entry in the vocabulary and don't depend on the token's context.
doc = nlp("O ingresso custa $50.")
print('Index: ', [token.i for token in doc])
print('Text: ', [token.text for token in doc])
print('is_alpha:', [token.is_alpha for token in doc])
print('is_punct:', [token.is_punct for token in doc])
print('like_num:', [token.like_num for token in doc])
Index: [0, 1, 2, 3, 4, 5] Text: ['O', 'ingresso', 'custa', '$', '50', '.'] is_alpha: [True, True, True, False, False, False] is_punct: [False, False, False, False, False, True] like_num: [False, False, False, False, True, False]
Let’s get started and try out spaCy! In this exercise, you’ll be able to try out some of the 45+ available languages.
Part 1: English
Import the English class from spacy.lang.en and create the nlp object.
Create a doc and print its text.
# Import the English language class
from spacy.lang.pt import Portuguese
# Create the nlp object
nlp = Portuguese()
# Process a text
doc = nlp("Essa é uma frase.")
# Print the document text
print(doc.text)
Essa é uma frase.
When you call nlp on a string, spaCy first tokenizes the text and creates a document object. In this exercise, you’ll learn more about the Doc, as well as its views Token and Span.
Step 1
Import the English language class and create the nlp object.
Process the text and instantiate a Doc object in the variable doc.
Select the first token of the Doc and print its text.
Create a slice of the Doc for the tokens “tree kangaroos” and “tree kangaroos and narwhals”.
# Import the English language class and create the nlp object
from spacy.lang.en import English
nlp = English()
# Process the text
doc = nlp("I like tree kangaroos and narwhals.")
# Select the first token
first_token = doc[0]
# Print the first token's text
print(first_token.text)
I
Import the English language class and create the nlp object.
Process the text and instantiate a Doc object in the variable doc.
Create a slice of the Doc for the tokens “tree kangaroos” and “tree kangaroos and narwhals”.
# Import the English language class and create the nlp object
from spacy.lang.en import English
nlp = English()
# Process the text
doc = nlp("I like tree kangaroos and narwhals.")
# A slice of the Doc for "tree kangaroos"
tree_kangaroos = doc[2:4]
print(tree_kangaroos.text)
# A slice of the Doc for "tree kangaroos and narwhals" (without the ".")
tree_kangaroos_and_narwhals = doc[2:6]
print(tree_kangaroos_and_narwhals.text)
tree kangaroos tree kangaroos and narwhals
In this example, you’ll use spaCy’s Doc and Token objects, and lexical attributes to find percentages in a text. You’ll be looking for two subsequent tokens: a number and a percent sign.
Use the like_num token attribute to check whether a token in the doc resembles a number.
Get the token following the current token in the document. The index of the next token in the doc is token.i + 1.
Check whether the next token’s text attribute is a percent sign ”%“.
from spacy.lang.en import English
nlp = English()
# Process the text
doc = nlp(
"In 1990, more than 60% of people in East Asia were in extreme poverty. "
"Now less than 4% are."
)
# Iterate over the tokens in the doc
for token in doc:
# Check if the token resembles a number
if token.like_num:
# Get the next token in the document
next_token = doc[token.i + 1]
# Check if the next token's text equals '%'
if next_token.text == "%":
print("Percentage found:", token.text)
print('like_num:', [token.like_num for token in doc])
Percentage found: 60 Percentage found: 4 like_num: [False, True, False, False, False, True, False, False, False, False, False, False, False, False, False, False, False, False, False, False, True, False, False, False]
Some of the most interesting things you can analyze are context-specific: for example, whether a word is a verb or whether a span of text is a person name.
Statistical models enable spaCy to make predictions in context. This usually includes part-of speech tags, syntactic dependencies and named entities.
Models are trained on large datasets of labeled example texts.
They can be updated with more examples to fine-tune their predictions – for example, to perform better on your specific data.
spaCy provides a number of pre-trained model packages you can download using the "spacy download" command. For example, the "en_core_web_sm" package is a small English model that supports all core capabilities and is trained on web text.
The spacy dot load method loads a model package by name and returns an nlp object.
The package provides the binary weights that enable spaCy to make predictions.
It also includes the vocabulary, and meta information to tell spaCy which language class to use and how to configure the processing pipeline.
Visit: https://github.com/explosion/spacy-models/releases/
Remember, to download:
python -m spacy download pt_core_news_sm
python -m spacy download en_core_news_sm
python -m spacy download en_core_web_sm
import spacy
# Load the small English model
nlp = spacy.load('en_core_web_sm')
# Process a text
doc = nlp("She ate the pizza")
# Iterate over the tokens
for token in doc:
# Print the text and the predicted part-of-speech tag
print(token.text, token.pos_)
She PRON ate VERB the DET pizza NOUN
import spacy
# Load the small Portuguese model
nlp = spacy.load('pt_core_news_sm')
# Process a text
doc = nlp("Ela comeu uma pizza enorme")
# Iterate over the tokens
for token in doc:
# Print the text and the predicted part-of-speech tag
print(token.text, token.pos_)
Ela PRON comeu VERB uma DET pizza NOUN enorme ADJ
For each token in the Doc, we can print the text and the "pos underscore" attribute, the predicted part-of-speech tag.
In spaCy, attributes that return strings usually end with an underscore – attributes without the underscore return an ID.
Here, the model correctly predicted "ate" as a verb and "pizza" as a noun.
Let's take a look at the model's predictions. In this example, we're using spaCy to predict part-of-speech tags, the word types in context.
First, we load the small English model and receive an nlp object.
Next, we're processing the text "She ate the pizza".
For each token in the Doc, we can print the text and the "pos underscore" attribute, the predicted part-of-speech tag.
In spaCy, attributes that return strings usually end with an underscore – attributes without the underscore return an ID.
Here, the model correctly predicted "ate" as a verb and "pizza" as a noun.
import spacy
nlp= spacy.load("en_core_web_sm")
doc= nlp("She ate the pizza")
for token in doc:
print(token.text, token.pos_)
She PRON ate VERB the DET pizza NOUN
import spacy
# Load the small English model
nlp = spacy.load('pt_core_news_sm')
# Process a text
doc = nlp("Posso comprar ingresso com cartao de débito")
# Iterate over the tokens
for token in doc:
# Print the text and the predicted part-of-speech tag
print(token.text, token.pos_)
Posso VERB comprar VERB ingresso NOUN com ADP cartao NOUN de ADP débito NOUN
# Process a text
doc = nlp("Ela comeu o macarrão com molho de tomate")
# Iterate over the tokens
for token in doc:
# Print the text and the predicted part-of-speech tag
print(token.text, token.pos_)
Ela PRON comeu VERB o DET macarrão NOUN com ADP molho NOUN de ADP tomate NOUN
# Process a text
doc = nlp("Quero comprar um ingresso para o jogo")
# Iterate over the tokens
for token in doc:
# Print the text and the predicted part-of-speech tag
print(token.text, token.pos_)
Quero VERB comprar VERB um DET ingresso NOUN para ADP o DET jogo NOUN
In addition to the part-of-speech tags, we can also predict how the words are related. For example, whether a word is the subject of the sentence or an object.
The "dep underscore" attribute returns the predicted dependency label.
The head attribute returns the syntactic head token. You can also think of it as the parent token this word is attached to.
for token in doc:
print(token.text, token.pos_, token.dep_, token.head.text)
Quero VERB ROOT Quero comprar VERB xcomp Quero um DET det ingresso ingresso NOUN obj comprar para ADP case jogo o DET det jogo jogo NOUN nmod ingresso
To describe syntactic dependencies, spaCy uses a standardized label scheme. Here's an example of some common labels:
The pronoun "She" is a nominal subject attached to the verb – in this case, to "ate".
The noun "pizza" is a direct object attached to the verb "ate". It is eaten by the subject, "she".
The determiner "the", also known as an article, is attached to the noun "pizza".
Named entities are "real world objects" that are assigned a name – for example, a person, an organization or a country.
The doc dot ents property lets you access the named entities predicted by the model.
It returns an iterator of Span objects, so we can print the entity text and the entity label using the "label underscore" attribute.
In this case, the model is correctly predicting "Apple" as an organization, "U.K." as a geopolitical entity and "$1 billion" as money.
import spacy
nlp= spacy.load("en_core_web_sm")
doc= nlp(u"Apple is looking at buying U.K. startup for $1 billion")
for ent in doc.ents:
print(ent.text, ent.label_)
Apple ORG U.K. GPE $1 billion MONEY
# Load the small Portuguese model
nlp = spacy.load('pt_core_news_sm')
# Process a text
doc = nlp(u"A empresa Apple está buscando comprar uma empresa na Inglaterra por R$1 bilhão de reais")
# Iterate over the predicted entities
for ent in doc.ents:
# Print the entity text and its label
print(ent.text, ent.label_)
Apple ORG Inglaterra LOC R$ PER
# Load the small Portuguese model
nlp = spacy.load('pt_core_news_sm')
# Process a text
doc = nlp(u"Como faço para contratar um plano de Sócio Torcedor do Flamengo")
# Iterate over the predicted entities
for ent in doc.ents:
# Print the entity text and its label
print(ent.text, ent.label_)
Sócio Torcedor PER Flamengo ORG
# Load the small Portuguese model
nlp = spacy.load('pt_core_news_sm')
# Process a text
doc = nlp(u"Quais são os planos para quem mora fora do Rio de Janeiro")
# Iterate over the predicted entities
for ent in doc.ents:
# Print the entity text and its label
print(ent.text, ent.label_)
Rio de Janeiro LOC
A quick tip: To get definitions for the most common tags and labels, you can use the spacy dot explain helper function.
For example, "GPE" for geopolitical entity isn't exactly intuitive – but spacy dot explain can tell you that it refers to countries, cities and states.
The same works for part-of-speech tags and dependency labels
spacy.explain('ORG')
'Companies, agencies, institutions, etc.'
spacy.explain('PER')
'Named person or family.'
spacy.explain('GPE')
'Countries, cities, states'
spacy.explain('NNP')
'noun, proper singular'
spacy.explain('dobj')
'direct object'
spacy.explain('MISC')
'Miscellaneous entities, e.g. events, nationalities, products or works of art'
spacy.explain('LOC')
'Non-GPE locations, mountain ranges, bodies of water'
The models we’re using in this course are already pre-installed. For more details on spaCy’s statistical models and how to install them on your machine, see the documentation.
Use spacy.load to load the small English model 'en_core_web_sm'. Process the text and print the document text.
import spacy
# Load the 'en_core_web_sm' model
nlp = spacy.load("en_core_web_sm")
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"
# Process the text
doc = nlp(text)
# Print the document text
print(doc.text)
It’s official: Apple is the first U.S. public company to reach a $1 trillion market value
You’ll now get to try one of spaCy’s pre-trained model packages and see its predictions in action. Feel free to try it out on your own text! To find out what a tag or label means, you can call spacy.explain in the. For example: spacy.explain('PROPN') or spacy.explain('GPE').
Process the text with the nlp object and create a doc.
For each token, print the token text, the token’s .pos_ (part-of-speech tag) and the token’s .dep_ (dependency label).
import spacy
nlp = spacy.load("en_core_web_sm")
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"
# Process the text
doc = nlp(text)
for token in doc:
# Get the token text, part-of-speech tag and dependency label
token_text = token.text
token_pos = token.pos_
token_dep = token.dep_
# This is for formatting only
print("{:<12}{:<10}{:<10}".format(token_text, token_pos, token_dep))
It PRON nsubj ’s PROPN ROOT official NOUN acomp : PUNCT punct Apple PROPN nsubj is VERB ROOT the DET det first ADJ amod U.S. PROPN nmod public ADJ amod company NOUN attr to PART aux reach VERB relcl a DET det $ SYM quantmod 1 NUM compound trillion NUM nummod market NOUN compound value NOUN dobj
spacy.explain('dobj')
'direct object'
spacy.explain('attr')
'attribute'
spacy.explain('nsubj')
'nominal subject'
import spacy
nlp = spacy.load("pt_core_news_sm")
text = "É oficial: A Apple é a primeira empresa americana a atingir o valor de mercado de 1 bilhão de reais"
# Process the text
doc = nlp(text)
for token in doc:
# Get the token text, part-of-speech tag and dependency label
token_text = token.text
token_pos = token.pos_
token_dep = token.dep_
# This is for formatting only
print("{:<12}{:<10}{:<10}".format(token_text, token_pos, token_dep))
É VERB cop oficial ADJ ROOT : PUNCT punct A DET det Apple PROPN nsubj é VERB cop a DET det primeira ADJ amod empresa NOUN ROOT americana ADJ amod a ADP mark atingir VERB acl o DET det valor NOUN obj de ADP case mercado NOUN nmod de ADP case 1 NUM nummod bilhão NOUN nmod de ADP case reais NOUN nmod
spacy.explain('cop')
'copula'
spacy.explain('ROOT')
spacy.explain('acl')
spacy.explain('nummod')
spacy.explain('nmod')
'modifier of nominal'
import spacy
nlp = spacy.load("pt_core_news_sm")
text = u"Meu cartão ingresso ainda não chegou. Como faço para entrar no jogo"
# Process the text
doc = nlp(text)
for token in doc:
# Get the token text, part-of-speech tag and dependency label
token_text = token.text
token_pos = token.pos_
token_dep = token.dep_
# This is for formatting only
print("{:<12}{:<10}{:<10}".format(token_text, token_pos, token_dep))
Meu DET det cartão NOUN nsubj ingresso ADJ acl ainda ADV advmod não ADV advmod chegou VERB ROOT . PUNCT punct Como SCONJ case faço NOUN ROOT para ADP mark entrar VERB acl no ADP nummod jogo NOUN obj
spacy.explain('acl')
Process the text and create a doc object.
Iterate over the doc.ents and print the entity text and label_ attribute.
import spacy
nlp = spacy.load("en_core_web_sm")
text = "It’s official: Apple is the first U.S. public company to reach a $1 trillion market value"
# Process the text
doc = nlp(text)
# Iterate over the predicted entities
for ent in doc.ents:
# Print the entity text and its label
print(ent.text, ent.label_)
Apple ORG first ORDINAL U.S. GPE $1 trillion MONEY
import spacy
nlp = spacy.load("pt_core_news_sm")
text = "É oficial: A Apple é a primeira empresa americana a atingir o valor de mercado de 1 bilhão de reais"
# Process the text
doc = nlp(text)
# Iterate over the predicted entities
for ent in doc.ents:
# Print the entity text and its label
print(ent.text, ent.label_)
Apple ORG
import spacy
nlp = spacy.load("pt_core_news_sm")
text = "Os carros de passeio estão custando 120% mais caros"
# Process the text
doc = nlp(text)
# Iterate over the predicted entities
for ent in doc.ents:
# Print the entity text and its label
print(ent.text, ent.label_)
import spacy
nlp = spacy.load("pt_core_news_sm")
text = "O rei da Inglaterra morreu de acidente de carro"
# Process the text
doc = nlp(text)
# Iterate over the predicted entities
for ent in doc.ents:
# Print the entity text and its label
print(ent.text, ent.label_)
Inglaterra LOC
import spacy
nlp = spacy.load("en_core_web_sm")
text = "The king of England died on a car accident"
# Process the text
doc = nlp(text)
# Iterate over the predicted entities
for ent in doc.ents:
# Print the entity text and its label
print(ent.text, ent.label_)
England GPE
import spacy
nlp = spacy.load("en_core_web_sm")
text = "I want to buy a tennis shoes and pay 120 dollars"
# Process the text
doc = nlp(text)
# Iterate over the predicted entities
for ent in doc.ents:
# Print the entity text and its label
print(ent.text, ent.label_)
120 dollars MONEY
Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text you’re processing. Let’s take a look at an example.
Process the text with the nlp object.
Iterate over the entities and print the entity text and label.
Looks like the model didn’t predict “iPhone X”. Create a span for those tokens manually.
import spacy
nlp = spacy.load("en_core_web_sm")
text = "New iPhone X release date leaked as Apple reveals pre-orders by mistake"
# Process the text
doc = nlp(text)
# Iterate over the entities
for token in doc.ents:
# Print the entity text and label
print(token.text, token.label_)
# Get the span for "iPhone X"
iphone_x = doc[1:3]
# Print the span text
print("Missing entity:", iphone_x.text)
Apple ORG Missing entity: iPhone X
Now we'll take a look at spaCy's matcher, which lets you write rules to find words and phrases in text.
Compared to regular expressions, the matcher works with Doc and Token objects instead of only strings.
It's also more flexible: you can search for texts but also other lexical attributes.
You can even write rules that use the model's predictions.
For example, find the word "duck" only if it's a verb, not a noun.
Why not just regular expressions ?
Match patterns are lists of dictionaries. Each dictionary describes one token. The keys are the names of token attributes, mapped to their expected values.
In this example, we're looking for two tokens with the text "iPhone" and "X".
We can also match on other token attributes. Here, we're looking for two tokens whose lowercase forms equal "iphone" and "x".
We can even write patterns using attributes predicted by the model. Here, we're matching a token with the lemma "buy", plus a noun. The lemma is the base form, so this pattern would match phrases like "buying milk" or "bought flowers".
Match patterns
[{'ORTH': 'iPhone'}, {'ORTH': 'X'}]
[{'LOWER': 'iphone'}, {'LOWER': 'x'}]
[{'LEMMA': 'buy'}, {'POS': 'NOUN'}]
To use a pattern, we first import the matcher from spacy dot matcher.
We also load a model and create the nlp object.
The matcher is initialized with the shared vocabulary, nlp dot vocab. You'll learn more about this later – for now, just remember to always pass it in.
The matcher dot add method lets you add a pattern. The first argument is a unique ID to identify which pattern was matched. The second argument is an optional callback. We don't need one here, so we set it to None. The third argument is the pattern.
To match the pattern on a text, we can call the matcher on any doc.
This will return the matches.
import spacy
# Import the Matcher
from spacy.matcher import Matcher
# Load a model and create the nlp object
nlp = spacy.load('en_core_web_sm')
# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)
# Add the pattern to the matcher
pattern = [{'ORTH': 'iPhone'}, {'ORTH': 'X'}]
matcher.add('IPHONE_PATTERN', None, pattern)
# Process some text
doc = nlp("New iPhone X release date leaked")
# Call the matcher on the doc
matches = matcher(doc)
matches
[(9528407286733565721, 1, 3)]
When you call the matcher on a doc, it returns a list of tuples.
Each tuple consists of three values: the match ID, the start index and the end index of the matched span.
This means we can iterate over the matches and create a Span object: a slice of the doc at the start and end index.
# Call the matcher on the doc
doc = nlp("New iPhone X release date leaked")
matches = matcher(doc)
# Iterate over the matches
for match_id, start, end in matches:
# Get the matched span
matched_span = doc[start:end]
print(matched_span.text)
iPhone X
match_id: hash value of the pattern name
start: start index of matched span
end: end index of matched span
Here's an example of a more complex pattern using lexical attributes.
We're looking for five tokens:
A token consisting of only digits.
Three case-insensitive tokens for "fifa", "world" and "cup".
And a token that consists of punctuation.
The pattern matches the tokens "2018 FIFA World Cup:".
pattern = [
{'IS_DIGIT': True},
{'LOWER': 'fifa'},
{'LOWER': 'world'},
{'LOWER': 'cup'},
{'IS_PUNCT': True}
]
doc = nlp("2018 FIFA World Cup: France won!")
# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)
# Add the pattern to the matcher
matcher.add('FIFA', None, pattern)
# Process some text
# Call the matcher on the doc
matches = matcher(doc)
# Iterate over the matches
for match_id, start, end in matches:
# Get the matched span
matched_span = doc[start:end]
print(matched_span.text)
2018 FIFA World Cup:
In this example, we're looking for two tokens:
A verb with the lemma "love", followed by a noun.
This pattern will match "loved dogs" and "love cats".
pattern = [
{'LEMMA': 'love', 'POS': 'VERB'},
{'POS': 'NOUN'}
]
doc = nlp("I loved dogs but now I love cats more.")
# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)
# Add the pattern to the matcher
matcher.add('PETS', None, pattern)
# Process some text
# Call the matcher on the doc
matches = matcher(doc)
# Iterate over the matches
for match_id, start, end in matches:
# Get the matched span
matched_span = doc[start:end]
print(matched_span.text)
loved dogs love cats
pattern = [
{'LEMMA': 'love', 'POS': 'VERB'},
{'POS': 'NOUN'}
]
doc = nlp("I loved dogs. Now I love cats more.")
# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)
# Add the pattern to the matcher
matcher.add('PETS', None, pattern)
# Process some text
# Call the matcher on the doc
matches = matcher(doc)
# Iterate over the matches
for match_id, start, end in matches:
# Get the matched span
matched_span = doc[start:end]
print(matched_span.text)
loved dogs love cats
Operators and quantifiers let you define how often a token should be matched. They can be added using the "OP" key.
Here, the "?" operator makes the determiner token optional, so it will match a token with the lemma "buy", an optional article and a noun.
pattern = [
{'LEMMA': 'buy'},
{'POS': 'DET', 'OP': '?'}, # optional: match 0 or 1 times
{'POS': 'NOUN'}
]
doc = nlp("I bought a smartphone. Now I'm buying apps.")
# Initialize the matcher with the shared vocab
matcher = Matcher(nlp.vocab)
# Add the pattern to the matcher
matcher.add('CONSUME', None, pattern)
# Process some text
# Call the matcher on the doc
matches = matcher(doc)
# Iterate over the matches
for match_id, start, end in matches:
# Get the matched span
matched_span = doc[start:end]
print(matched_span.text)
bought a smartphone buying apps
"OP" can have one of four values:
An "!" negates the token, so it's matched 0 times.
A "?" makes the token optional, and matches it 0 or 1 times.
A "+" matches a token 1 or more times.
And finally, an "*" matches 0 or more times.
Operators can make your patterns a lot more powerful, but they also add more complexity – so use them wisely.
{'OP': '!'} Negation: match 0 times
{'OP': '?'} Optional: match 0 or 1 times
{'OP': '+'} Match 1 or more times
{'OP': '*'} Match 0 or more times
Let’s try spaCy’s rule-based Matcher. You’ll be using the example from the previous exercise and write a pattern that can match the phrase “iPhone X” in the text.
Import the Matcher from spacy.matcher.
Initialize it with the nlp object’s shared vocab.
Create a pattern that matches the 'TEXT' values of two tokens: "iPhone" and "X".
Use the matcher.add method to add the pattern to the matcher.
Call the matcher on the doc and store the result in the variable matches.
Iterate over the matches and get the matched span from the start to the end index.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("New iPhone X release date leaked as Apple reveals pre-orders by mistake")
print(doc.text)
# Import the Matcher
from spacy.matcher import Matcher
# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)
# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{'ORTH': 'iPhone'}, {'ORTH': 'X'}]
# Add the pattern to the matcher
matcher.add('IPHONE_PATTERN', None, pattern)
# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])
New iPhone X release date leaked as Apple reveals pre-orders by mistake Matches: ['iPhone X']
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
doc = nlp(
"After making the iOS update you won't notice a radical system-wide "
"redesign: nothing like the aesthetic upheaval we got with iOS 7. Most of "
"iOS 11's furniture remains the same as in iOS 10. But you will discover "
"some tweaks once you delve a little deeper."
)
# Write a pattern for full iOS versions ("iOS 7", "iOS 11", "iOS 10")
pattern = [{"TEXT": 'iOS'}, {"IS_DIGIT": True}]
# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("IOS_VERSION_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))
# Iterate over the matches and print the span text
for match_id, start, end in matches:
print("Match found:", doc[start:end].text)
Total matches found: 3 Match found: iOS 7 Match found: iOS 11 Match found: iOS 10
Write one pattern that only matches forms of “download” (tokens with the lemma “download”), followed by a token with the part-of-speech tag 'PROPN' (proper noun).
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
doc = nlp(
"i downloaded Fortnite on my laptop and can't open the game at all. Help? "
"so when I was downloading Minecraft, I got the Windows version where it "
"is the '.zip' folder and I used the default program to unpack it... do "
"I also need to download Winzip?"
)
# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": 'download'}, {"POS": 'PROPN'}]
# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))
# Iterate over the matches and print the span text
for match_id, start, end in matches:
print("Match found:", doc[start:end].text)
Total matches found: 3 Match found: downloaded Fortnite Match found: downloading Minecraft Match found: download Winzip
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("pt_core_news_sm")
matcher = Matcher(nlp.vocab)
doc = nlp(
"Eu baixei varios aplicativos na minha vida "
"mas quando eu baixo o fortinite sempre dá problemas "
"eu realmente nao sei mais o que fazer "
"Tem algum problema para baixar o winzip?"
)
# Write a pattern that matches a form of "download" plus proper noun
pattern = [{"LEMMA": 'baixar'}, {"POS": 'PROPN'}]
# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("DOWNLOAD_THINGS_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))
# Iterate over the matches and print the span text
for match_id, start, end in matches:
print("Match found:", doc[start:end].text)
Total matches found: 0
Part 3
Write one pattern that matches adjectives ('ADJ') followed by one or two 'NOUN's (one noun and one optional noun).
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
doc = nlp(
"Features of the app include a beautiful design, smart search, automatic "
"labels and optional voice responses."
)
# Write a pattern for adjective plus one or two nouns
pattern = [{"POS": "ADJ"}, {"POS": "NOUN"}, {"POS": "NOUN", "OP": "?"}]
# Add the pattern to the matcher and apply the matcher to the doc
matcher.add("ADJ_NOUN_PATTERN", None, pattern)
matches = matcher(doc)
print("Total matches found:", len(matches))
# Iterate over the matches and print the span text
for match_id, start, end in matches:
print("Match found:", doc[start:end].text)
Total matches found: 4 Match found: beautiful design Match found: smart search Match found: automatic labels Match found: optional voice responses