Setup Resources:
To install, go to terminal and run
pip install -U spacy
After installation, also need to download the language model
python -m spacy download en_core_web_lg
To use spacy with English:
import spacy
nlp = spacy.load("en_core_web_lg")
Make sure you install in terminal first before trying to install in this jupyter notebook.
%%capture
# Install spacy for jupyter notebook.
try:
from pip import main as pipmain
except:
from pip._internal import main as pipmain
packages = ['spacy']
pipmain(['install'] + packages);
%%capture
!python -m spacy download en_core_web_lg
import spacy
nlp = spacy.load('en_core_web_lg')
doc = nlp("The hungry, hungry catepillar ate all of the food, and then he became a butterfly!")
doc.text.split()
['The', 'hungry,', 'hungry', 'catepillar', 'ate', 'all', 'of', 'the', 'food,', 'and', 'then', 'he', 'became', 'a', 'butterfly!']
Note that some of the punctuation gets attached to the previous word. We don't want that.
[token.orth_ for token in doc]
['The', 'hungry', ',', 'hungry', 'catepillar', 'ate', 'all', 'of', 'the', 'food', ',', 'and', 'then', 'he', 'became', 'a', 'butterfly', '!']
remove punctuation by using .is_punct
remove spaces by using: .is_space
remove stop words by using the .is_stop
[token.orth_ for token in doc if not token.is_punct | token.is_space | token.is_stop]
['hungry', 'hungry', 'catepillar', 'ate', 'food', 'butterfly']
Note how all the punctuation, white spaces, and stop words have been removed and we are left only with the "important" words.
Aside: In the below example, the contraction gets split up. Trying using nltk
's casual_tokenize
to split words instead.
text2 = "Hey!!! Find Jessica's website at https://www.google.com/"
doc2 = nlp(text2)
print(doc2.text.split())
[token.orth_ for token in doc2]
['Hey!!!', 'Find', "Jessica's", 'website', 'at', 'https://www.google.com/']
['Hey', '!', '!', '!', 'Find', 'Jessica', "'s", 'website', 'at', 'https://www.google.com/']
%%capture
packages = ['nltk']
pipmain(['install'] + packages);
from nltk.tokenize import casual_tokenize
casual_tokenize(text2)
['Hey', '!', '!', '!', 'Find', "Jessica's", 'website', 'at', 'https://www.google.com/']
Stopwords are insignificant and can mess up frequency analysis so it's useful to remove stopwords.
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
spacy_stopwords
{"'d", "'ll", "'m", "'re", "'s", "'ve", 'a', 'about', 'above', 'across', 'after', 'afterwards', 'again', 'against', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amount', 'an', 'and', 'another', 'any', 'anyhow', 'anyone', 'anything', 'anyway', 'anywhere', 'are', 'around', 'as', 'at', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'behind', 'being', 'below', 'beside', 'besides', 'between', 'beyond', 'both', 'bottom', 'but', 'by', 'ca', 'call', 'can', 'cannot', 'could', 'did', 'do', 'does', 'doing', 'done', 'down', 'due', 'during', 'each', 'eight', 'either', 'eleven', 'else', 'elsewhere', 'empty', 'enough', 'even', 'ever', 'every', 'everyone', 'everything', 'everywhere', 'except', 'few', 'fifteen', 'fifty', 'first', 'five', 'for', 'former', 'formerly', 'forty', 'four', 'from', 'front', 'full', 'further', 'get', 'give', 'go', 'had', 'has', 'have', 'he', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'hereupon', 'hers', 'herself', 'him', 'himself', 'his', 'how', 'however', 'hundred', 'i', 'if', 'in', 'indeed', 'into', 'is', 'it', 'its', 'itself', 'just', 'keep', 'last', 'latter', 'latterly', 'least', 'less', 'made', 'make', 'many', 'may', 'me', 'meanwhile', 'might', 'mine', 'more', 'moreover', 'most', 'mostly', 'move', 'much', 'must', 'my', 'myself', "n't", 'name', 'namely', 'neither', 'never', 'nevertheless', 'next', 'nine', 'no', 'nobody', 'none', 'noone', 'nor', 'not', 'nothing', 'now', 'nowhere', 'n‘t', 'n’t', 'of', 'off', 'often', 'on', 'once', 'one', 'only', 'onto', 'or', 'other', 'others', 'otherwise', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 'part', 'per', 'perhaps', 'please', 'put', 'quite', 'rather', 're', 'really', 'regarding', 'same', 'say', 'see', 'seem', 'seemed', 'seeming', 'seems', 'serious', 'several', 'she', 'should', 'show', 'side', 'since', 'six', 'sixty', 'so', 'some', 'somehow', 'someone', 'something', 'sometime', 'sometimes', 'somewhere', 'still', 'such', 'take', 'ten', 'than', 'that', 'the', 'their', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'therefore', 'therein', 'thereupon', 'these', 'they', 'third', 'this', 'those', 'though', 'three', 'through', 'throughout', 'thru', 'thus', 'to', 'together', 'too', 'top', 'toward', 'towards', 'twelve', 'twenty', 'two', 'under', 'unless', 'until', 'up', 'upon', 'us', 'used', 'using', 'various', 'very', 'via', 'was', 'we', 'well', 'were', 'what', 'whatever', 'when', 'whence', 'whenever', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whither', 'who', 'whoever', 'whole', 'whom', 'whose', 'why', 'will', 'with', 'within', 'without', 'would', 'yet', 'you', 'your', 'yours', 'yourself', 'yourselves', '‘d', '‘ll', '‘m', '‘re', '‘s', '‘ve', '’d', '’ll', '’m', '’re', '’s', '’ve'}
use spacy's .lemma_
method
lemma_words = "going gone went goes"
nlp_lemma_words = nlp(lemma_words)
[word.lemma_ for word in nlp_lemma_words]
['go', 'go', 'go', 'go']
lemma_words = "has had have"
nlp_lemma_words = nlp(lemma_words)
[word.lemma_ for word in nlp_lemma_words]
['have', 'have', 'have']
lemma_words = "falsely"
nlp_lemma_words = nlp(lemma_words)
[word.lemma_ for word in nlp_lemma_words]
['falsely']
This is especially useful for text classification because lemmatising the text helps avoids word duplication for building models like bag of words model.
use the .pos_
and .tag_
methods
doc2 = nlp("My dog's toy actually belongs to the neighbor's cat.")
pos_tags = [(i, i.tag_) for i in doc2]
pos_tags
[(My, 'PRP$'), (dog, 'NN'), ('s, 'POS'), (toy, 'NN'), (actually, 'RB'), (belongs, 'VBZ'), (to, 'IN'), (the, 'DT'), (neighbor, 'NN'), ('s, 'POS'), (cat, 'NN'), (., '.')]
create a list of owner-possesion tuples:
[(i[0].nbor(-1), i[0].nbor(+1)) for i in pos_tags if i[1] == "POS"]
[(dog, toy), (neighbor, cat)]
Resources:
Spacy provides pre-trained models for word embeddings which downloaded when we downloaded the English model. Spacy can parse entire blocks of text and assigns word vectors using the loaded model. Then, use .vector
to get the word vector.
Important Note: spaCy's small models (models that end in sm
) don't ship with word vectors. You can still use .similarity
to compares, but the results won't be as good. To use real word vectors, make sure to download the large models:
python -m spacy download en_core_web_lg
tokens = nlp(u"cat dog water cloud kitty")
print(tokens[0].text, tokens[0].vector)
cat [-0.15067 -0.024468 -0.23368 -0.23378 -0.18382 0.32711 -0.22084 -0.28777 0.12759 1.1656 -0.64163 -0.098455 -0.62397 0.010431 -0.25653 0.31799 0.037779 1.1904 -0.17714 -0.2595 -0.31461 0.038825 -0.15713 -0.13484 0.36936 -0.30562 -0.40619 -0.38965 0.3686 0.013963 -0.6895 0.004066 -0.1367 0.32564 0.24688 -0.14011 0.53889 -0.80441 -0.1777 -0.12922 0.16303 0.14917 -0.068429 -0.33922 0.18495 -0.082544 -0.46892 0.39581 -0.13742 -0.35132 0.22223 -0.144 -0.048287 0.3379 -0.31916 0.20526 0.098624 -0.23877 0.045338 0.43941 0.030385 -0.013821 -0.093273 -0.18178 0.19438 -0.3782 0.70144 0.16236 0.0059111 0.024898 -0.13613 -0.11425 -0.31598 -0.14209 0.028194 0.5419 -0.42413 -0.599 0.24976 -0.27003 0.14964 0.29287 -0.31281 0.16543 -0.21045 -0.4408 1.2174 0.51236 0.56209 0.14131 0.092514 0.71396 -0.021051 -0.33704 -0.20275 -0.36181 0.22055 -0.25665 0.28425 -0.16968 0.058029 0.61182 0.31576 -0.079185 0.35538 -0.51236 0.4235 -0.30033 -0.22376 0.15223 -0.048292 0.23532 0.46507 -0.67579 -0.32905 0.08446 -0.22123 -0.045333 0.34463 -0.1455 -0.18047 -0.17887 0.96879 -1.0028 -0.47343 0.28542 0.56382 -0.33211 -0.38275 -0.2749 -0.22955 -0.24265 -0.37689 0.24822 0.36941 0.14651 -0.37864 0.31134 -0.28449 0.36948 -2.8174 -0.38319 -0.022373 0.56376 0.40131 -0.42131 -0.11311 -0.17317 0.1411 -0.13194 0.18494 0.097692 -0.097341 -0.23987 0.16631 -0.28556 0.0038654 0.53292 -0.32367 -0.38744 0.27011 -0.34181 -0.27702 -0.67279 -0.10771 -0.062189 -0.24783 -0.070884 -0.20898 0.062404 0.022372 0.13408 0.1305 -0.19546 -0.46849 0.77731 -0.043978 0.3827 -0.23376 1.0457 -0.14371 -0.3565 -0.080713 -0.31047 -0.57822 -0.28067 -0.069678 0.068929 -0.16227 -0.63934 -0.62149 0.11222 -0.16969 -0.54637 0.49661 0.46565 0.088294 -0.48496 0.69263 -0.068977 -0.53709 0.20802 -0.42987 -0.11921 0.1174 -0.18443 0.43797 -0.1236 0.3607 -0.19608 -0.35366 0.18808 -0.5061 0.14455 -0.024368 -0.10772 -0.0115 0.58634 -0.054461 0.0076487 -0.056297 0.27193 0.23096 -0.29296 -0.24325 0.10317 -0.10014 0.7089 0.17402 -0.0037509 -0.46304 0.11806 -0.16457 -0.38609 0.14524 0.098122 -0.12352 -0.1047 0.39047 -0.3063 -0.65375 -0.0044248 -0.033876 0.037114 -0.27472 0.0053147 0.30737 0.12528 -0.19527 -0.16461 0.087518 -0.051107 -0.16323 0.521 0.10822 -0.060379 -0.71735 -0.064327 0.37043 -0.41054 -0.2728 -0.30217 0.015771 -0.43056 0.35647 0.17188 -0.54598 -0.21541 -0.044889 -0.10597 -0.54391 0.53908 0.070938 0.097839 0.097908 0.17805 0.18995 0.49962 -0.18529 0.051234 0.019574 0.24805 0.3144 -0.29304 0.54235 0.46672 0.26017 -0.44705 0.28287 -0.033345 -0.33181 -0.10902 -0.023324 0.2106 -0.29633 0.81506 0.038524 0.46004 0.17187 -0.29804 ]
Now we can use the word vectors we got from spacy to compare the similarity of the words using .similarity
.
for token1 in tokens:
for token2 in tokens:
print(token1.text, token2.text, token1.similarity(token2))
cat cat 1.0 cat dog 0.8016855 cat water 0.2888436 cat cloud 0.16586679 cat kitty 0.7888994 dog cat 0.8016855 dog dog 1.0 dog water 0.30933863 dog cloud 0.1380703 dog kitty 0.6306644 water cat 0.2888436 water dog 0.30933863 water water 1.0 water cloud 0.3084506 water kitty 0.22237565 cloud cat 0.16586679 cloud dog 0.1380703 cloud water 0.3084506 cloud cloud 1.0 cloud kitty 0.14712334 kitty cat 0.7888994 kitty dog 0.6306644 kitty water 0.22237565 kitty cloud 0.14712334 kitty kitty 1.0
from spacy.matcher import Matcher
from spacy.tokens import Span
matcher = Matcher(nlp.vocab)
Define a pattern and add it to the matcher.
LOWER
indicates that the lowercase form matches
#define the pattern
pattern = [{'LOWER': 'computer', 'POS': 'NOUN'},
{'POS':{'NOT_IN': ['VERB']}}]#add the pattern to the previously created matcher object
matcher.add("Matching", None, pattern)
Computer programming is the process of writing instructions that get executed by computers. The instructions, also known as code, are written in a programming language which the computer can understand and use to perform a task or solve a problem. Basic computer programming involves the analysis of a problem and development of a logical sequence of instructions to solve it. There can be numerous paths to a solution and the computer programmer seeks to design and code that which is most efficient. Among the programmer’s tasks are understanding requirements, determining the right programming language to use, designing or architecting the solution, coding, testing, debugging and writing documentation so that the solution can be easily understood by other programmers.Computer programming is at the heart of computer science. It is the implementation portion of software development, application development and software engineering efforts, transforming ideas and theories into actual, working solutions.
text = "Computer programming is the process of writing instructions that get executed by computers. The instructions, also known as code, are written in a programming language which the computer can understand and use to perform a task or solve a problem. Basic computer programming involves the analysis of a problem and development of a logical sequence of instructions to solve it. There can be numerous paths to a solution and the computer programmer seeks to design and code that which is most efficient. Among the programmer’s tasks are understanding requirements, determining the right programming language to use, designing or architecting the solution, coding, testing, debugging and writing documentation so that the solution can be easily understood by other programmers.Computer programming is at the heart of computer science. It is the implementation portion of software development, application development and software engineering efforts, transforming ideas and theories into actual, working solutions."
doc = nlp(text)
matches = matcher(doc)#print the matched results and extract out the results
for match_id, start, end in matches:
# nlp.vocab.strings[match_id]
span = doc[start:end]
print("Indexes:", start, end, span.text)
Indexes: 0 2 Computer programming Indexes: 45 47 computer programming Indexes: 75 77 computer programmer Indexes: 131 133 Computer programming Indexes: 138 140 computer science
This allows us to match specific phrases and combinations of words.
from spacy.matcher import PhraseMatcher
matcher2 = PhraseMatcher(nlp.vocab, attr='LOWER')# the list containing the pharses to be matched
terminology_list = ["Machine learning", "Hidden Structure",
"Unlabeled Data"]
patterns = [nlp.make_doc(text) for text in terminology_list]# add the patterns to the matcher object without any callbacks
matcher2.add("Phrase Matching", None, *patterns)
Supervised machine learning algorithms can apply what has been learned in the past to new data using labeled examples to predict future events. Starting from the analysis of a known training dataset, the learning algorithm produces an inferred function to make predictions about the output values. The system is able to provide targets for any new input after sufficient training. The learning algorithm can also compare its output with the correct, intended output and find errors in order to modify the model accordingly. In contrast, unsupervised machine learning algorithms are used when the information used to train is neither classified nor labeled. Unsupervised learning studies how systems can infer a function to describe a hidden structure from unlabeled data. The system doesn’t figure out the right output, but it explores the data and can draw inferences from datasets to describe hidden structures from unlabeled data. Semi-supervised machine learning algorithms fall somewhere in between supervised and unsupervised learning, since they use both labeled and unlabeled data for training – typically a small amount of labeled data and a large amount of unlabeled data. The systems that use this method are able to considerably improve learning accuracy. Usually, semi-supervised learning is chosen when the acquired labeled data requires skilled and relevant resources in order to train it / learn from it. Otherwise, acquiring unlabeled data generally doesn’t require additional resources.
# the input text string is converted to a Document object
doc2 = nlp("Supervised machine learning algorithms can apply what has been learned in the past to new data using labeled examples to predict future events. Starting from the analysis of a known training dataset, the learning algorithm produces an inferred function to make predictions about the output values. The system is able to provide targets for any new input after sufficient training. The learning algorithm can also compare its output with the correct, intended output and find errors in order to modify the model accordingly. In contrast, unsupervised machine learning algorithms are used when the information used to train is neither classified nor labeled. Unsupervised learning studies how systems can infer a function to describe a hidden structure from unlabeled data. The system doesn’t figure out the right output, but it explores the data and can draw inferences from datasets to describe hidden structures from unlabeled data. Semi-supervised machine learning algorithms fall somewhere in between supervised and unsupervised learning, since they use both labeled and unlabeled data for training – typically a small amount of labeled data and a large amount of unlabeled data. The systems that use this method are able to considerably improve learning accuracy. Usually, semi-supervised learning is chosen when the acquired labeled data requires skilled and relevant resources in order to train it / learn from it. Otherwise, acquiring unlabeled data generally doesn’t require additional resources.")#call the matcher object the document object and it will return #match_id, start and stop indexes of the matched words
matches2 = matcher2(doc2)#print the matched results and extract out the results
for match_id, start, end in matches2:
span = doc2[start:end]
print("Indexes:", start, end, span.text)
Indexes: 1 3 machine learning Indexes: 93 95 machine learning Indexes: 122 124 hidden structure Indexes: 125 127 unlabeled data Indexes: 154 156 unlabeled data Indexes: 160 162 machine learning Indexes: 178 180 unlabeled data Indexes: 195 197 unlabeled data Indexes: 243 245 unlabeled data