This notebook shows how you can explore aspects of a text using the Natural Langauge Took Kit (NLTK).
Some of the things you can do include:
For more on NLTK see the online version of the book Natural Language Processing with Python.
Before we can analyze a text we need to load it in and tokenize it.
Before you can use NTLK you need to make sure it is installed. The Anaconda Navigator by default installs NLTK, but you can always test if it is installed by importing it with import nltk
. Try it. It will give you an error if you don't have it.
import nltk
If you don't have it there are different ways to install it.
Now we will get a text to process with NLTK.
First we see what text files we have.
ls *.txt
Hume Enquiry.txt negative.txt positive.txt Hume Treatise.txt obama_tweets.txt
We are going to use the "Hume Enquiry.txt" from the Gutenberg Project. You can use whatever text you want. We print the first 50 characters to check.
theText2Use = "Hume Enquiry.txt"
with open(theText2Use, "r") as fileToRead:
theString = fileToRead.read()
print("This string has", len(theString), "characters.")
print(theString[:50])
This string has 366798 characters. The Project Gutenberg EBook of An Enquiry Concerni
Now we tokenize the text using NTLK's tokenizer producing a list called "listOfTokens" and check the first words. Note that the NTLK tokenizer doesn't eliminate punctuation and doesn't lower case the words. You can tokenize using another method if you want. Then we create a NLTK text object from the tokens. Note how the text object behaves like a list of tokens.
listOfTokens = nltk.word_tokenize(theString)
theText = nltk.Text(listOfTokens)
print(listOfTokens[:50])
['The', 'Project', 'Gutenberg', 'EBook', 'of', 'An', 'Enquiry', 'Concerning', 'Human', 'Understanding', ',', 'by', 'David', 'Hume', 'and', 'L.', 'A.', 'Selby-Bigge', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', '.', 'You', 'may', 'copy', 'it', ',', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the']
Now we get a concordance for a word in one line. Note that we can control the width of the concordances. Edit the word to explore.
theText.concordance("the", width=100)
Displaying 25 of 3499 matches: The Project Gutenberg EBook of An Enquiry Concernin the use of anyone anywhere at no cost and with almo u may copy it , give it away or re-use it under the terms of the Project Gutenberg License included , give it away or re-use it under the terms of the Project Gutenberg License included with this eB AVID HUME Extracted from : Enquiries Concerning the Human Understanding , and Concerning the Princi erning the Human Understanding , and Concerning the Principles of Morals , By David Hume . Reprinte ples of Morals , By David Hume . Reprinted from The Posthumous Edition of 1777 , and Edited with In Oxford . Second Edition , 1902 CONTENTS I . Of the different Species of Philosophy II . Of the Ori Of the different Species of Philosophy II . Of the Origin of Ideas III . Of the Association of Ide Philosophy II . Of the Origin of Ideas III . Of the Association of Ideas IV . Sceptical Doubts conc ation of Ideas IV . Sceptical Doubts concerning the Operations of the Understanding V. Sceptical So . Sceptical Doubts concerning the Operations of the Understanding V. Sceptical Solution of these Do on of these Doubts VI . Of Probability VII . Of the Idea of necessary Connexion VIII . Of Liberty a nnexion VIII . Of Liberty and Necessity IX . Of the Reason of Animals X . Of Miracles XI . Of a par cular Providence and of a future State XII . Of the academical or sceptical Philosophy INDEX SECTIO al or sceptical Philosophy INDEX SECTION I . OF THE DIFFERENT SPECIES OF PHILOSOPHY . 1 . Moral phi ECIES OF PHILOSOPHY . 1 . Moral philosophy , or the science of human nature , may be treated after has its peculiar merit , and may contribute to the entertainment , instruction , and reformation o nt , instruction , and reformation of mankind . The one considers man chiefly as born for action ; ne object , and avoiding another , according to the value which these objects seem to possess , and hese objects seem to possess , and according to the light in which they present themselves . As vir . As virtue , of all objects , is allowed to be the most valuable , this species of philosophers pa ble , this species of philosophers paint her in the most amiable colours ; borrowing all helps from s manner , and such as is best fitted to please the imagination , and engage the affections . They t fitted to please the imagination , and engage the affections . They select the most striking obse
Note that concordance
is not case sensitive. This will give you a concordance of both capitalized and lower case words.
If you want more lines then you need to add a parameter.
theText.concordance("the", lines=30)
Displaying 30 of 3499 matches: The Project Gutenberg EBook of An Enquiry d L. A. Selby-Bigge This eBook is for the use of anyone anywhere at no cost and it , give it away or re-use it under the terms of the Project Gutenberg Licens away or re-use it under the terms of the Project Gutenberg License included wi Extracted from : Enquiries Concerning the Human Understanding , and Concerning Human Understanding , and Concerning the Principles of Morals , By David Hume rals , By David Hume . Reprinted from The Posthumous Edition of 1777 , and Edit Second Edition , 1902 CONTENTS I . Of the different Species of Philosophy II . fferent Species of Philosophy II . Of the Origin of Ideas III . Of the Associat II . Of the Origin of Ideas III . Of the Association of Ideas IV . Sceptical D deas IV . Sceptical Doubts concerning the Operations of the Understanding V. Sc l Doubts concerning the Operations of the Understanding V. Sceptical Solution o e Doubts VI . Of Probability VII . Of the Idea of necessary Connexion VIII . Of II . Of Liberty and Necessity IX . Of the Reason of Animals X . Of Miracles XI idence and of a future State XII . Of the academical or sceptical Philosophy IN tical Philosophy INDEX SECTION I . OF THE DIFFERENT SPECIES OF PHILOSOPHY . 1 . HILOSOPHY . 1 . Moral philosophy , or the science of human nature , may be trea eculiar merit , and may contribute to the entertainment , instruction , and ref uction , and reformation of mankind . The one considers man chiefly as born for , and avoiding another , according to the value which these objects seem to pos ts seem to possess , and according to the light in which they present themselve e , of all objects , is allowed to be the most valuable , this species of philo species of philosophers paint her in the most amiable colours ; borrowing all and such as is best fitted to please the imagination , and engage the affectio o please the imagination , and engage the affections . They select the most str d engage the affections . They select the most striking observations and instan roper contrast ; and alluring us into the paths of virtue by the views of glory luring us into the paths of virtue by the views of glory and happiness , direct , direct our steps in these paths by the soundest precepts and most illustriou trious examples . They make us _feel_ the difference between vice and virtue ;
One thing that is annoying is that you can't easily save a concordance to a file and that is because the NLTK text object concordance is printed to the screen for exploration. You will need to cut and paste to a word processor to save this.
%matplotlib inline
theText.dispersion_plot(["Truth","truth"])
print(theText.count("Truth"), " ", theText.count("truth"))
1 20
To make it case insensitive we are going to use list comprehension to lowercase every token and get a new list of tokens. We are also going to get rid of punctuation using a parameter. Then we can count things in the list.
theLowerTokens = [token.lower() for token in listOfTokens if token[0].isalpha()]
print(theLowerTokens[:20])
['the', 'project', 'gutenberg', 'ebook', 'of', 'an', 'enquiry', 'concerning', 'human', 'understanding', 'by', 'david', 'hume', 'and', 'l.', 'a.', 'selby-bigge', 'this', 'ebook', 'is']
theLowerTokens.count("truth")
21
With NLTK we can get word frequencies. These can be displayed as a table. We can then do other things with the frequency distribution object.
theLowerFreqs = nltk.FreqDist(theLowerTokens)
theLowerFreqs.tabulate(15)
the of and to a in that is it which or be we by from 3499 2848 2210 1809 1165 1117 1002 955 786 750 711 674 663 564 529
theLowerFreqs["truth"]
21
Rather than get the count we can get the relative frequency which is the count divided by the number of tokens.
theLowerFreqs.freq("the")
0.058443293803240356
We can also plot the high frequency words.
%matplotlib inline
theLowerFreqs.plot(30)
stopwords = nltk.corpus.stopwords.words("english")
print(stopwords[:20])
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers']
We need to create a new list of tokens without the stopwords.
theLowerContentWords = [token for token in theLowerTokens if token not in stopwords]
theLowerContentWords[:10]
['project', 'gutenberg', 'ebook', 'enquiry', 'concerning', 'human', 'understanding', 'david', 'hume', 'l.']
Now we can create a table of high frequency content words.
theLowerContFreqs = nltk.FreqDist(theLowerContentWords)
theLowerContFreqs.tabulate(10)
may one nature must us experience cause human mind never 295 203 200 177 169 166 157 149 145 125
moreStopwords = ["may","one","must","us","never","every"]
theLowerContentWords = [token for token in theLowerContentWords if token not in moreStopwords]
theLowerContFreqs = nltk.FreqDist(theLowerContentWords)
theLowerContFreqs.tabulate(10)
nature experience cause human mind effect ideas objects idea reason 200 166 157 149 145 124 120 120 116 116
And now we get the Frequency Distribution and plot it.
theLowerContFreqs.plot(30)
We might also want to check how these words are used by looking at their concordance.
theText.concordance("power", width=80, lines=5)
Displaying 5 of 103 matches: in the subdividing and balancing of power ; the lawyer more method and finer p n , which not only escapes all human power and authority , but is not even rest ceived ; nor is any thing beyond the power of thought , except what implies an limits , and that all this creative power of the mind amounts to no more than o show distinctly the action of that power , which produces any single effect i
theText.collocations(10)
Project Gutenberg-tm; Project Gutenberg; Literary Archive; Gutenberg- tm electronic; common life; Archive Foundation; electronic works; Gutenberg Literary; sensible qualities; United States
Note how we are getting a lot of bigrams with "Gutenberg". That's because NLTK looks for bigrams where the words appear together more often than alone. If you ask for more collocations you can see some that have to do with the text.
theText.collocations(100)
Project Gutenberg-tm; Project Gutenberg; Literary Archive; Gutenberg- tm electronic; common life; Archive Foundation; electronic works; Gutenberg Literary; sensible qualities; United States; external objects; human nature; set forth; human testimony; voluntary actions; electronic work; necessary connexion; public domain; secret powers; Gutenberg-tm License; regular conjunction; human life; _in infinitum_; reasonings concerning; usual attendant; one object; constantly conjoined; David Hume; universally allowed; human understanding; seems evident; concerning matter; Human Understanding; copyright holder; take place; simple ideas; real existence; every moment; may observe; shall find; certain degree; infinitely less; CONCERNING HUMAN; ENQUIRY CONCERNING; HUMAN UNDERSTANDING; PROJECT GUTENBERG; past experience; Enquiry Concerning; one event; give rise; good fortune; conjoined together; human actions; customary transition; infinite number; must confess; constant conjunction; common sense; narrow limits; mutual destruction; strictly examined; first appearance; conclusions concerning; one instance; primary qualities; Concerning Human; mental geography; physical points; experimental reasoning; universally acknowledged; natural instinct; inward sentiment; universal doubt; paragraph 1.F.3; two kinds; may serve; new effects; uniform experience; usual course; Distributed Proofreaders; Jonathan Ingram; Plain Vanilla; Vanilla ASCII; _Christian Religion_; _necessary connexion_; _vis inertiae_; secondary qualities; natural events; derivative works; Gutenberg-tm trademark; greater variety; well known; distinctly conceived; may seem; strong presumption; divine existence; similar instances; draw inferences; human reason; human action
We can get words that are similar to target words. These are not synonyms but words being used in similar contexts. You can use this to expland on a word you are interested in.
theText.similar("truth")
cause reason nature men it ideas necessity mankind action objects conduct them body power experience resemblance first miracles science life
You can use this to get concordances of sets of similar words.
listOfWords2Conc = ["reason","fact","knowledge","ideas"]
for i in listOfWords2Conc:
print(i.upper() + ": ")
theText.concordance(i, width=80, lines=5)
print("--------------------------------------------------\n")
REASON: Displaying 5 of 116 matches: Of Liberty and Necessity IX . Of the Reason of Animals X . Of Miracles XI . Of a eigns . 7 . But is this a sufficient reason , why philosophers should desist fro iscover the proper province of human reason . For , besides , that many persons er parts of nature . And there is no reason to despair of equal success in our e RT I . 20 . All the objects of human reason or enquiry may naturally be divided -------------------------------------------------- FACT: Displaying 5 of 89 matches: tainty and evidence . 21 . Matters of fact , which are the second objects of hum ing . The contrary of every matter of fact is still possible ; because it can ne s of any real existence and matter of fact , beyond the present testimony of our . All reasonings concerning matter of fact seem to be founded on the relation of a man , why he believes any matter of fact , which is absent ; for instance , th -------------------------------------------------- KNOWLEDGE: Displaying 5 of 37 matches: prehension , possesses an accurate knowledge of the internal fabric , the opera make any addition to our stock of knowledge , in subjects of such unspeakable letter received from him , or the knowledge of his former resolutions and prom must enquire how we arrive at the knowledge of cause and effect . I shall vent admits of no exception , that the knowledge of this relation is not , in any i -------------------------------------------------- IDEAS: Displaying 5 of 120 matches: of Philosophy II . Of the Origin of Ideas III . Of the Association of Ideas IV of Ideas III . Of the Association of Ideas IV . Sceptical Doubts concerning the rror ! SECTION II . OF THE ORIGIN OF IDEAS . 11 . Every one will readily allow d impressions are distinguished from ideas , which are the less lively percepti untain , we only join two consistent ideas , _gold_ , and _mountain_ , with whi --------------------------------------------------
NLTK can give us common contexts for words that share them.
theText.common_contexts(["nature", "experience"],10)
human_it from_and of_are in_and of_but this_he by_that of_which of_in the_and
We can use regular expressions on tokens with the findall
method of the Text object. Some guidelines:
<.*>
matches any token as .
means any character and *
means 0 or more of. ?
would meanHere are some examples.
theText.findall("(<.*>)<experience>")
and; from; by; by; to; without; by; not; and; ,; assist; by; to; from; have; that; this; past; from; from; of; to; of; of; by; from; all; from; past; by; from; more; more; this; his; from; from; and; of; and; pure; is; farther; of; our; from; daily; any; and; from; without; from; by; besides; of; from; And; common; by; we; by; except; certain; by; and; and; fancied; of; by; without; have; this; have; this; uniform; and; that; of; no; and; past; past; the; our; seeming; from; and; not; even; past; greater; 's; and; Though; to; of; infallible; past; our; by; past; the; from; this; of; uniform; his; have; uniform; unalterable; from; uniform; uniform; no; we; regular; is; same; of; the; and; from; past; my; human; make; same; by; in; other; from; of; If; By; here; any; from; our; and; on; only; by; from; from; by; from; long; by; from; from; and; of; by; uniform; and; of; by; .; to; for; only; from; .
theText.findall("<.*><.*><nature>")
of human nature; regard human nature; , that nature; into the nature; of human nature; derived from nature; parts of nature; concerning human nature; limits of nature; , where nature; concerning their nature; triangle in nature; a like nature; is the nature; the same nature; of this nature; concerning the nature; course of nature; course of nature; laws of nature; discover in nature; established by nature; is the nature; , that nature; of their nature; course of nature; of human nature; similarity which nature; Of what nature; course of nature; learned the nature; Their secret nature; and transitory nature; as human nature; of human nature; priori_ the nature; of human nature; of human nature; of this nature; accurately the nature; excited by nature; the whole nature; the peculiar nature; observed that nature; the same nature; a similar nature; course of nature; works of nature; wisdom of nature; . As nature; the very nature; contrivance of nature; constitutes the nature; irregularity in nature; how soon nature; productions in nature; in all nature; and the nature; with the nature; and the nature; with the nature; scenes of nature; operations of nature; powers of nature; appears in nature; force in nature; author of nature; They rob nature; throughout all nature; course of nature; of this nature; laws of nature; scenes of nature; operations of nature; operations of nature; that human nature; of human nature; with the nature; course of nature; of human nature; of human nature; part of nature; characters which nature; course of nature; part of nature; laws of nature; of human nature; part of nature; the same nature; the inflexible nature; but their nature; of human nature; a similar nature; powers of nature; being in nature; their very nature; of that nature; phenomena of nature; system of nature; formed by nature; intention of nature; of the nature; course of nature; of this nature; uniformity of nature; hand of nature; a like nature; in human nature; state of nature; is placing nature; course of nature; laws of nature; the very nature; laws of nature; course of nature; from the nature; laws of nature; laws of nature; laws of nature; contrary to nature; law of nature; not its nature; in human nature; frame of nature; from human nature; the public nature; _singular_ a nature; of this nature; or miraculous nature; by the nature; of this nature; laws of nature; the very nature; laws of nature; course of nature; dissolution of nature; laws of nature; course of nature; laws of nature; extraordinary in nature; of human nature; of human nature; order of nature; phenomena of nature; phenomena in nature; appearances of nature; appearances of nature; course of nature; course of nature; course of nature; order of nature; course of nature; course of nature; course of nature; course of nature; course of nature; order of nature; laws which nature; with the nature; works of nature; works of nature; Author of nature; course of nature; In human nature; course of nature; delicate a nature; particular a nature; of this nature; a like nature; from the nature; instinct of nature; instincts of nature; instinct of nature; contrary a nature; a like nature; propensities of nature; a like nature; of our nature; of our nature; necessities of nature; in human nature; situation of nature; us the nature; course of nature; entrusted by nature; production in nature; . Human nature; course of nature; in human nature; of human nature; course of nature; , for nature; design in nature; appears in nature; match for nature
theText.findall("(<.*><.*>)<truth>")
talk of; is a; and a; , and; of their; the same; love of; . The; discovery of; for the; for the; inclination to; distinguish between; violation of; violations of; depart from; to reach; _criteria_ of; with great; love of
theText.findall("<not><.*>?<true>")
not universally true; not a true; not true
CC BY-SA From The Art of Literary Text Analysis by Stéfan Sinclair & Geoffrey Rockwell. Edited and revised by Melissa Mony.
Created October 10th, 2016 (Jupyter 4.2.1)