Notebook

Introduction to spaCy ¶

Industrial-Strength Natural Language Processing in Python

spaCy is a Python libary for NLP
supports multiple languages, staistical models
provides support for tokenization, word vectors, tagging, parsing, segmentation, and more

Setup Resources:

To install, go to terminal and run

pip install -U spacy

After installation, also need to download the language model

python -m spacy download en_core_web_lg

To use spacy with English:

import spacy
nlp = spacy.load("en_core_web_lg")

Make sure you install in terminal first before trying to install in this jupyter notebook.

In [2]:

%%capture
# Install spacy for jupyter notebook.
try:
    from pip import main as pipmain
except:
    from pip._internal import main as pipmain
packages = ['spacy']
pipmain(['install'] + packages);

In [3]:

%%capture
!python -m spacy download en_core_web_lg

In [4]:

import spacy
nlp = spacy.load('en_core_web_lg')

Tokenization¶

split text into words, symbols, punctuation a.k.a. tokens

In [5]:

doc = nlp("The hungry, hungry catepillar ate all of the food, and then he became a butterfly!")
doc.text.split() 

Out[5]:

['The',
 'hungry,',
 'hungry',
 'catepillar',
 'ate',
 'all',
 'of',
 'the',
 'food,',
 'and',
 'then',
 'he',
 'became',
 'a',
 'butterfly!']

Note that some of the punctuation gets attached to the previous word. We don't want that.

In [6]:

[token.orth_ for token in doc] 

Out[6]:

['The',
 'hungry',
 ',',
 'hungry',
 'catepillar',
 'ate',
 'all',
 'of',
 'the',
 'food',
 ',',
 'and',
 'then',
 'he',
 'became',
 'a',
 'butterfly',
 '!']

remove punctuation by using .is_punct
remove spaces by using: .is_space
remove stop words by using the .is_stop

In [7]:

[token.orth_ for token in doc if not token.is_punct | token.is_space | token.is_stop] 

Out[7]:

['hungry', 'hungry', 'catepillar', 'ate', 'food', 'butterfly']

Note how all the punctuation, white spaces, and stop words have been removed and we are left only with the "important" words.

Aside: In the below example, the contraction gets split up. Trying using nltk's casual_tokenize to split words instead.

In [8]:

text2 = "Hey!!! Find Jessica's website at https://www.google.com/"
doc2 = nlp(text2)
print(doc2.text.split())
[token.orth_ for token in doc2] 

['Hey!!!', 'Find', "Jessica's", 'website', 'at', 'https://www.google.com/']

Out[8]:

['Hey',
 '!',
 '!',
 '!',
 'Find',
 'Jessica',
 "'s",
 'website',
 'at',
 'https://www.google.com/']

In [9]:

%%capture
packages = ['nltk']
pipmain(['install'] + packages);

from nltk.tokenize import casual_tokenize

In [10]:

casual_tokenize(text2)

Out[10]:

['Hey',
 '!',
 '!',
 '!',
 'Find',
 "Jessica's",
 'website',
 'at',
 'https://www.google.com/']

Stopwords¶

Stopwords are insignificant and can mess up frequency analysis so it's useful to remove stopwords.

In [11]:

spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
spacy_stopwords

Out[11]:

{"'d",
 "'ll",
 "'m",
 "'re",
 "'s",
 "'ve",
 'a',
 'about',
 'above',
 'across',
 'after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'amount',
 'an',
 'and',
 'another',
 'any',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anywhere',
 'are',
 'around',
 'as',
 'at',
 'back',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'below',
 'beside',
 'besides',
 'between',
 'beyond',
 'both',
 'bottom',
 'but',
 'by',
 'ca',
 'call',
 'can',
 'cannot',
 'could',
 'did',
 'do',
 'does',
 'doing',
 'done',
 'down',
 'due',
 'during',
 'each',
 'eight',
 'either',
 'eleven',
 'else',
 'elsewhere',
 'empty',
 'enough',
 'even',
 'ever',
 'every',
 'everyone',
 'everything',
 'everywhere',
 'except',
 'few',
 'fifteen',
 'fifty',
 'first',
 'five',
 'for',
 'former',
 'formerly',
 'forty',
 'four',
 'from',
 'front',
 'full',
 'further',
 'get',
 'give',
 'go',
 'had',
 'has',
 'have',
 'he',
 'hence',
 'her',
 'here',
 'hereafter',
 'hereby',
 'herein',
 'hereupon',
 'hers',
 'herself',
 'him',
 'himself',
 'his',
 'how',
 'however',
 'hundred',
 'i',
 'if',
 'in',
 'indeed',
 'into',
 'is',
 'it',
 'its',
 'itself',
 'just',
 'keep',
 'last',
 'latter',
 'latterly',
 'least',
 'less',
 'made',
 'make',
 'many',
 'may',
 'me',
 'meanwhile',
 'might',
 'mine',
 'more',
 'moreover',
 'most',
 'mostly',
 'move',
 'much',
 'must',
 'my',
 'myself',
 "n't",
 'name',
 'namely',
 'neither',
 'never',
 'nevertheless',
 'next',
 'nine',
 'no',
 'nobody',
 'none',
 'noone',
 'nor',
 'not',
 'nothing',
 'now',
 'nowhere',
 'n‘t',
 'n’t',
 'of',
 'off',
 'often',
 'on',
 'once',
 'one',
 'only',
 'onto',
 'or',
 'other',
 'others',
 'otherwise',
 'our',
 'ours',
 'ourselves',
 'out',
 'over',
 'own',
 'part',
 'per',
 'perhaps',
 'please',
 'put',
 'quite',
 'rather',
 're',
 'really',
 'regarding',
 'same',
 'say',
 'see',
 'seem',
 'seemed',
 'seeming',
 'seems',
 'serious',
 'several',
 'she',
 'should',
 'show',
 'side',
 'since',
 'six',
 'sixty',
 'so',
 'some',
 'somehow',
 'someone',
 'something',
 'sometime',
 'sometimes',
 'somewhere',
 'still',
 'such',
 'take',
 'ten',
 'than',
 'that',
 'the',
 'their',
 'them',
 'themselves',
 'then',
 'thence',
 'there',
 'thereafter',
 'thereby',
 'therefore',
 'therein',
 'thereupon',
 'these',
 'they',
 'third',
 'this',
 'those',
 'though',
 'three',
 'through',
 'throughout',
 'thru',
 'thus',
 'to',
 'together',
 'too',
 'top',
 'toward',
 'towards',
 'twelve',
 'twenty',
 'two',
 'under',
 'unless',
 'until',
 'up',
 'upon',
 'us',
 'used',
 'using',
 'various',
 'very',
 'via',
 'was',
 'we',
 'well',
 'were',
 'what',
 'whatever',
 'when',
 'whence',
 'whenever',
 'where',
 'whereafter',
 'whereas',
 'whereby',
 'wherein',
 'whereupon',
 'wherever',
 'whether',
 'which',
 'while',
 'whither',
 'who',
 'whoever',
 'whole',
 'whom',
 'whose',
 'why',
 'will',
 'with',
 'within',
 'without',
 'would',
 'yet',
 'you',
 'your',
 'yours',
 'yourself',
 'yourselves',
 '‘d',
 '‘ll',
 '‘m',
 '‘re',
 '‘s',
 '‘ve',
 '’d',
 '’ll',
 '’m',
 '’re',
 '’s',
 '’ve'}

Lemmatization¶

reducing a word to its base form or root form
reduce various wordforms to its citation form

use spacy's .lemma_ method

In [12]:

lemma_words = "going gone went goes" 
nlp_lemma_words = nlp(lemma_words) 
[word.lemma_ for word in nlp_lemma_words] 

Out[12]:

['go', 'go', 'go', 'go']

In [13]:

lemma_words = "has had have"
nlp_lemma_words = nlp(lemma_words) 
[word.lemma_ for word in nlp_lemma_words] 

Out[13]:

['have', 'have', 'have']

In [14]:

lemma_words = "falsely"
nlp_lemma_words = nlp(lemma_words) 
[word.lemma_ for word in nlp_lemma_words] 

Out[14]:

['falsely']

This is especially useful for text classification because lemmatising the text helps avoids word duplication for building models like bag of words model.

Parts-of-speech (POS) Tagging¶

assign the to words
spacy uses Penn Treebank POS tags

use the .pos_ and .tag_ methods

In [15]:

doc2 = nlp("My dog's toy actually belongs to the neighbor's cat.") 
pos_tags = [(i, i.tag_) for i in doc2]
pos_tags

Out[15]:

[(My, 'PRP$'),
 (dog, 'NN'),
 ('s, 'POS'),
 (toy, 'NN'),
 (actually, 'RB'),
 (belongs, 'VBZ'),
 (to, 'IN'),
 (the, 'DT'),
 (neighbor, 'NN'),
 ('s, 'POS'),
 (cat, 'NN'),
 (., '.')]

create a list of owner-possesion tuples:

In [16]:

[(i[0].nbor(-1), i[0].nbor(+1)) for i in pos_tags if i[1] == "POS"] 

Out[16]:

[(dog, toy), (neighbor, cat)]

Word Vectors¶

the concept of word embeddings is that every word can be represented as a set of real numbers (vectors) that capture the word meaning and context
each word has a unique embedding
word embeddings are multidimensional
similar words have similar embedding values

Resources:

Spacy provides pre-trained models for word embeddings which downloaded when we downloaded the English model. Spacy can parse entire blocks of text and assigns word vectors using the loaded model. Then, use .vector to get the word vector.

Important Note: spaCy's small models (models that end in sm) don't ship with word vectors. You can still use .similarity to compares, but the results won't be as good. To use real word vectors, make sure to download the large models:

python -m spacy download en_core_web_lg

In [18]:

tokens = nlp(u"cat dog water cloud kitty")
print(tokens[0].text, tokens[0].vector)

cat [-0.15067   -0.024468  -0.23368   -0.23378   -0.18382    0.32711
 -0.22084   -0.28777    0.12759    1.1656    -0.64163   -0.098455
 -0.62397    0.010431  -0.25653    0.31799    0.037779   1.1904
 -0.17714   -0.2595    -0.31461    0.038825  -0.15713   -0.13484
  0.36936   -0.30562   -0.40619   -0.38965    0.3686     0.013963
 -0.6895     0.004066  -0.1367     0.32564    0.24688   -0.14011
  0.53889   -0.80441   -0.1777    -0.12922    0.16303    0.14917
 -0.068429  -0.33922    0.18495   -0.082544  -0.46892    0.39581
 -0.13742   -0.35132    0.22223   -0.144     -0.048287   0.3379
 -0.31916    0.20526    0.098624  -0.23877    0.045338   0.43941
  0.030385  -0.013821  -0.093273  -0.18178    0.19438   -0.3782
  0.70144    0.16236    0.0059111  0.024898  -0.13613   -0.11425
 -0.31598   -0.14209    0.028194   0.5419    -0.42413   -0.599
  0.24976   -0.27003    0.14964    0.29287   -0.31281    0.16543
 -0.21045   -0.4408     1.2174     0.51236    0.56209    0.14131
  0.092514   0.71396   -0.021051  -0.33704   -0.20275   -0.36181
  0.22055   -0.25665    0.28425   -0.16968    0.058029   0.61182
  0.31576   -0.079185   0.35538   -0.51236    0.4235    -0.30033
 -0.22376    0.15223   -0.048292   0.23532    0.46507   -0.67579
 -0.32905    0.08446   -0.22123   -0.045333   0.34463   -0.1455
 -0.18047   -0.17887    0.96879   -1.0028    -0.47343    0.28542
  0.56382   -0.33211   -0.38275   -0.2749    -0.22955   -0.24265
 -0.37689    0.24822    0.36941    0.14651   -0.37864    0.31134
 -0.28449    0.36948   -2.8174    -0.38319   -0.022373   0.56376
  0.40131   -0.42131   -0.11311   -0.17317    0.1411    -0.13194
  0.18494    0.097692  -0.097341  -0.23987    0.16631   -0.28556
  0.0038654  0.53292   -0.32367   -0.38744    0.27011   -0.34181
 -0.27702   -0.67279   -0.10771   -0.062189  -0.24783   -0.070884
 -0.20898    0.062404   0.022372   0.13408    0.1305    -0.19546
 -0.46849    0.77731   -0.043978   0.3827    -0.23376    1.0457
 -0.14371   -0.3565    -0.080713  -0.31047   -0.57822   -0.28067
 -0.069678   0.068929  -0.16227   -0.63934   -0.62149    0.11222
 -0.16969   -0.54637    0.49661    0.46565    0.088294  -0.48496
  0.69263   -0.068977  -0.53709    0.20802   -0.42987   -0.11921
  0.1174    -0.18443    0.43797   -0.1236     0.3607    -0.19608
 -0.35366    0.18808   -0.5061     0.14455   -0.024368  -0.10772
 -0.0115     0.58634   -0.054461   0.0076487 -0.056297   0.27193
  0.23096   -0.29296   -0.24325    0.10317   -0.10014    0.7089
  0.17402   -0.0037509 -0.46304    0.11806   -0.16457   -0.38609
  0.14524    0.098122  -0.12352   -0.1047     0.39047   -0.3063
 -0.65375   -0.0044248 -0.033876   0.037114  -0.27472    0.0053147
  0.30737    0.12528   -0.19527   -0.16461    0.087518  -0.051107
 -0.16323    0.521      0.10822   -0.060379  -0.71735   -0.064327
  0.37043   -0.41054   -0.2728    -0.30217    0.015771  -0.43056
  0.35647    0.17188   -0.54598   -0.21541   -0.044889  -0.10597
 -0.54391    0.53908    0.070938   0.097839   0.097908   0.17805
  0.18995    0.49962   -0.18529    0.051234   0.019574   0.24805
  0.3144    -0.29304    0.54235    0.46672    0.26017   -0.44705
  0.28287   -0.033345  -0.33181   -0.10902   -0.023324   0.2106
 -0.29633    0.81506    0.038524   0.46004    0.17187   -0.29804  ]

Now we can use the word vectors we got from spacy to compare the similarity of the words using .similarity.

In [19]:

for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

cat cat 1.0
cat dog 0.8016855
cat water 0.2888436
cat cloud 0.16586679
cat kitty 0.7888994
dog cat 0.8016855
dog dog 1.0
dog water 0.30933863
dog cloud 0.1380703
dog kitty 0.6306644
water cat 0.2888436
water dog 0.30933863
water water 1.0
water cloud 0.3084506
water kitty 0.22237565
cloud cat 0.16586679
cloud dog 0.1380703
cloud water 0.3084506
cloud cloud 1.0
cloud kitty 0.14712334
kitty cat 0.7888994
kitty dog 0.6306644
kitty water 0.22237565
kitty cloud 0.14712334
kitty kitty 1.0

Token Matching¶

Token matching entails using Matcher which matches sequences of token by pattern rules.
Resources:

Rule-based Matching with spaCy

In [20]:

from spacy.matcher import Matcher
from spacy.tokens import Span

matcher = Matcher(nlp.vocab)

Define a pattern and add it to the matcher.

LOWER indicates that the lowercase form matches

In [21]:

#define the pattern
pattern = [{'LOWER': 'computer', 'POS': 'NOUN'},
             {'POS':{'NOT_IN': ['VERB']}}]#add the pattern to the previously created matcher object
matcher.add("Matching", None, pattern)

Computer programming is the process of writing instructions that get executed by computers. The instructions, also known as code, are written in a programming language which the computer can understand and use to perform a task or solve a problem. Basic computer programming involves the analysis of a problem and development of a logical sequence of instructions to solve it. There can be numerous paths to a solution and the computer programmer seeks to design and code that which is most efficient. Among the programmer’s tasks are understanding requirements, determining the right programming language to use, designing or architecting the solution, coding, testing, debugging and writing documentation so that the solution can be easily understood by other programmers.Computer programming is at the heart of computer science. It is the implementation portion of software development, application development and software engineering efforts, transforming ideas and theories into actual, working solutions.

In [22]:

text = "Computer programming is the process of writing instructions that get executed by computers. The instructions, also known as code, are written in a programming language which the computer can understand and use to perform a task or solve a problem. Basic computer programming involves the analysis of a problem and development of a logical sequence of instructions to solve it. There can be numerous paths to a solution and the computer programmer seeks to design and code that which is most efficient. Among the programmer’s tasks are understanding requirements, determining the right programming language to use, designing or architecting the solution, coding, testing, debugging and writing documentation so that the solution can be easily understood by other programmers.Computer programming is at the heart of computer science. It is the implementation portion of software development, application development and software engineering efforts, transforming ideas and theories into actual, working solutions."
doc = nlp(text)
matches = matcher(doc)#print the matched results and extract out the results

In [23]:

for match_id, start, end in matches:
    # nlp.vocab.strings[match_id]  
    span = doc[start:end] 
    print("Indexes:", start, end, span.text)

Indexes: 0 2 Computer programming
Indexes: 45 47 computer programming
Indexes: 75 77 computer programmer
Indexes: 131 133 Computer programming
Indexes: 138 140 computer science

Phrase Matcher¶

This allows us to match specific phrases and combinations of words.

In [24]:

from spacy.matcher import PhraseMatcher

In [27]:

matcher2 = PhraseMatcher(nlp.vocab, attr='LOWER')# the list containing the pharses to be matched
terminology_list = ["Machine learning", "Hidden Structure",                     
                           "Unlabeled Data"]
patterns = [nlp.make_doc(text) for text in terminology_list]# add the patterns to the matcher object without any callbacks
matcher2.add("Phrase Matching", None, *patterns)

Supervised machine learning algorithms can apply what has been learned in the past to new data using labeled examples to predict future events. Starting from the analysis of a known training dataset, the learning algorithm produces an inferred function to make predictions about the output values. The system is able to provide targets for any new input after sufficient training. The learning algorithm can also compare its output with the correct, intended output and find errors in order to modify the model accordingly. In contrast, unsupervised machine learning algorithms are used when the information used to train is neither classified nor labeled. Unsupervised learning studies how systems can infer a function to describe a hidden structure from unlabeled data. The system doesn’t figure out the right output, but it explores the data and can draw inferences from datasets to describe hidden structures from unlabeled data. Semi-supervised machine learning algorithms fall somewhere in between supervised and unsupervised learning, since they use both labeled and unlabeled data for training – typically a small amount of labeled data and a large amount of unlabeled data. The systems that use this method are able to considerably improve learning accuracy. Usually, semi-supervised learning is chosen when the acquired labeled data requires skilled and relevant resources in order to train it / learn from it. Otherwise, acquiring unlabeled data generally doesn’t require additional resources.

In [29]:

# the input text string is converted to a Document object
doc2 = nlp("Supervised machine learning algorithms can apply what has been learned in the past to new data using labeled examples to predict future events. Starting from the analysis of a known training dataset, the learning algorithm produces an inferred function to make predictions about the output values. The system is able to provide targets for any new input after sufficient training. The learning algorithm can also compare its output with the correct, intended output and find errors in order to modify the model accordingly. In contrast, unsupervised machine learning algorithms are used when the information used to train is neither classified nor labeled. Unsupervised learning studies how systems can infer a function to describe a hidden structure from unlabeled data. The system doesn’t figure out the right output, but it explores the data and can draw inferences from datasets to describe hidden structures from unlabeled data. Semi-supervised machine learning algorithms fall somewhere in between supervised and unsupervised learning, since they use both labeled and unlabeled data for training – typically a small amount of labeled data and a large amount of unlabeled data. The systems that use this method are able to considerably improve learning accuracy. Usually, semi-supervised learning is chosen when the acquired labeled data requires skilled and relevant resources in order to train it / learn from it. Otherwise, acquiring unlabeled data generally doesn’t require additional resources.")#call the matcher object the document object and it will return #match_id, start and stop indexes of the matched words
matches2 = matcher2(doc2)#print the matched results and extract out the results
for match_id, start, end in matches2:
    span = doc2[start:end] 
    print("Indexes:", start, end, span.text)

Indexes: 1 3 machine learning
Indexes: 93 95 machine learning
Indexes: 122 124 hidden structure
Indexes: 125 127 unlabeled data
Indexes: 154 156 unlabeled data
Indexes: 160 162 machine learning
Indexes: 178 180 unlabeled data
Indexes: 195 197 unlabeled data
Indexes: 243 245 unlabeled data

Other Resources

Introduction to spaCy¶