This notebook introduces the Natural Language Toolkit (NLTK).
The first step is to import the nltk
library and to load some example texts.
import nltk
from nltk.book import *
*** Introductory Examples for the NLTK Book *** Loading text1, ..., text9 and sent1, ..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G . K . Chesterton 1908
text1.dispersion_plot(['Ahab','whale','Ishmael','Queequeg', 'Moby', 'dive'])
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])
text1.concordance('Ishmael')
Building index... Displaying 20 of 20 matches: SONG . CHAPTER 1 Loomings . Call me Ishmael . Some years ago -- never mind how ED STATES . " WHALING VOYAGE BY ONE ISHMAEL . " BLOODY BATTLE IN AFFGHANISTAN . f silver ,-- So , wherever you go , Ishmael , said I to myself , as I stood in de to lodge for the night , my dear Ishmael , be sure to inquire the price , an nkling glasses within . But go on , Ishmael , said I at last ; don ' t you hear g and teeth - gnashing there . Ha , Ishmael , muttered I , backing out , Wretch emen who had gone before me . Yes , Ishmael , the same fate may be thine . But ? thought I . Do you suppose now , Ishmael , that the magnanimous God of heave l , which , if left to myself , I , Ishmael , should infallibly light upon , fo Bildad . Now then , my young man , Ishmael ' s thy name , didn ' t ye say ? We say ? Well then , down ye go here , Ishmael , for the three hundredth lay ." " why don ' t you speak ? It ' s I -- Ishmael ." But all remained still as before l fear ! CHAPTER 41 Moby Dick . I , Ishmael , was one of that crew ; my shouts lain , would be to dive deeper than Ishmael can go . The subterranean miner tha oul ; thou surrenderest to a hypo , Ishmael . Tell me , why this strong young c snows of prairies ; all these , to Ishmael , are as the shaking of that buffal ubtle meanings , how may unlettered Ishmael hope to read the awful Chaldee of t onditional skeleton . But how now , Ishmael ? How is it , that you , a mere oar for exhibition ? Explain thyself , Ishmael . Can you land a full - grown whale le witness have you hitherto been , Ishmael ; but have a care how you seize the
text1.similar('Ishmael')
Building word-context index... ahab did in it what about admit am and been bound bury could do except extreme for goes guess had
text1.similar('whale')
ship boat sea time captain deck man pequod world other whales air crew head water line thing side way body
text1.collocations()
Building collocations list Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab; years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief mate; white whale; ivory leg; one hand
text1.concordance("monstrous")
Displaying 11 of 11 matches: ong the former , one was of a most monstrous size . ... This came towards us , ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r ll over with a heathenish array of monstrous clubs and spears . Some were thick d as you gazed , and wondered what monstrous cannibal and savage could ever hav that has survived the flood ; most monstrous and most mountainous ! That Himmal they might scout at Moby Dick as a monstrous fable , or still worse and more de th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l ing Scenes . In connexion with the monstrous pictures of whales , I am strongly ere to enter upon those still more monstrous stories of them which are to be fo ght have been rummaged out of this monstrous cabinet there is no telling . But of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
text2.concordance("monstrous")
Building index... Displaying 11 of 11 matches: . " Now , Palmer , you shall see a monstrous pretty girl ." He immediately went your sister is to marry him . I am monstrous glad of it , for then I shall have ou may tell your sister . She is a monstrous lucky girl to get him , upon my ho k how you will like them . Lucy is monstrous pretty , and so good humoured and Jennings , " I am sure I shall be monstrous glad of Miss Marianne ' s company usual noisy cheerfulness , " I am monstrous glad to see you -- sorry I could n t however , as it turns out , I am monstrous glad there was never any thing in so scornfully ! for they say he is monstrous fond of her , as well he may . I s possible that she should ." " I am monstrous glad of it . Good gracious ! I hav thing of the kind . So then he was monstrous happy , and talked on some time ab e very genteel people . He makes a monstrous deal of money , and they keep thei
text1.similar("monstrous")
abundant candid careful christian contemptible curious delightfully determined doleful domineering exasperate fearless few gamesome horrible impalpable imperial lamentable lazy loving
text2.similar("monstrous")
Building word-context index... very exceedingly heartily so a amazingly as extremely good great remarkably sweet vast
def lexical_diversity(text):
return len(text)/(1.0*len(set(text)))
def percentage(count, total):
return 100*count/total
lexical_diversity(text3)
16.050197203298673
lexical_diversity(text4)
14.941049825712529
lexical_diversity(text5)
7.420046158918563
fdist1 = FreqDist(text1)
vocabulary1 = fdist1.keys()
fdist1['whale']
906
fdist1['monstrous']
10
fdist1.plot(50, cumulative=True)
hapaxes1 = fdist1.hapaxes()
print len(hapaxes1)
print hapaxes1[1000:1010]
9002 ['Gull', 'Gurry', 'HACKLUYT', 'HAILS', 'HALF', 'HAMLET', 'HANDS', 'HANGING', 'HARD', 'HARDY']
thursday_sents = nltk.corpus.gutenberg.sents('chesterton-thursday.txt')
sent22 = thursday_sents[22]
' '.join(sent22)
'THE suburb of Saffron Park lay on the sunset side of London , as red and ragged as a cloud of sunset .'
nltk.bigrams(w for w in sent22 if w.isalpha())
[('THE', 'suburb'), ('suburb', 'of'), ('of', 'Saffron'), ('Saffron', 'Park'), ('Park', 'lay'), ('lay', 'on'), ('on', 'the'), ('the', 'sunset'), ('sunset', 'side'), ('side', 'of'), ('of', 'London'), ('London', 'as'), ('as', 'red'), ('red', 'and'), ('and', 'ragged'), ('ragged', 'as'), ('as', 'a'), ('a', 'cloud'), ('cloud', 'of'), ('of', 'sunset')]
import networkx as nx
G = nx.Graph()
begin_sent = 22
end_sent = 24
sents = thursday_sents[begin_sent:end_sent+1]
for sent in sents:
G.add_edges_from(nltk.bigrams(w for w in sent if w.isalpha()))
nx.draw(G)
Below [BKL] refers to "Natural Language Processing with Python" by Bird, Klein and Loper and [MAR] refers to "Mining the Social Web" by Matthew A. Russell.
See page 95 of [BKL].
import codecs, nltk, pprint
hard_times_path = "/home/matthew/workspace/resources/C/Corpus Stylistics/Dickens, Charles/786-0.txt"
david_copperfield_path = "/home/matthew/workspace/resources/C/Corpus Stylistics/Dickens, Charles/pg766.txt"
f = codecs.open(hard_times_path, encoding = 'utf-8')
david_copperfield_file = codecs.open(david_copperfield_path, encoding = 'utf-8')
hard_times_raw_text = f.read()
len(hard_times_raw_text)
610690
david_copperfield_raw_text = david_copperfield_file.read()
len(david_copperfield_raw_text)
1992524
See page 112 of [BKL].
sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sents = sent_tokenizer.tokenize(hard_times_raw_text)
print sents[171]
He had reached the neutral ground upon the outskirts of the town, which was neither town nor country, and yet was either spoiled, when his ears were invaded by the sound of music.
len(sents)
4498
An alternative approach based on [MAR].
sents = nltk.tokenize.sent_tokenize(hard_times_raw_text)
print sents[171]
He had reached the neutral ground upon the outskirts of the town, which was neither town nor country, and yet was either spoiled, when his ears were invaded by the sound of music.
DC_sents = nltk.tokenize.sent_tokenize(david_copperfield_raw_text)
print DC_sents[171]
Calls a house a rookery when there's not a rook near it, and takes the birds on trust, because he sees the nests!
tokens = [nltk.tokenize.word_tokenize(s) for s in sents]
len(tokens)
4498
print tokens[171]
[u'He', u'had', u'reached', u'the', u'neutral', u'ground', u'upon', u'the', u'outskirts', u'of', u'the', u'town', u',', u'which', u'was', u'neither', u'town', u'nor', u'country', u',', u'and', u'yet', u'was', u'either', u'spoiled', u',', u'when', u'his', u'ears', u'were', u'invaded', u'by', u'the', u'sound', u'of', u'music', u'.']
DC_tokens = [nltk.tokenize.word_tokenize(s) for s in DC_sents]
print DC_tokens[171]
[u'Calls', u'a', u'house', u'a', u'rookery', u'when', u'there', u"'s", u'not', u'a', u'rook', u'near', u'it', u',', u'and', u'takes', u'the', u'birds', u'on', u'trust', u',', u'because', u'he', u'sees', u'the', u'nests', u'!']
Warning: Very slow
# pos_tagged_tokens = [nltk.pos_tag(t) for t in tokens]
' ' .join(tokens[171])
u'He had reached the neutral ground upon the outskirts of the town , which was neither town nor country , and yet was either spoiled , when his ears were invaded by the sound of music .'
bigram = nltk.bigrams(w for w in tokens[171] if w.isalpha())
import networkx as nx
G = nx.Graph()
G.add_edges_from(bigram)
nx.draw(G)
print nltk.ngrams((w for w in tokens[171] if w.isalpha()), 5)
[(u'He', u'had', u'reached', u'the', u'neutral'), (u'had', u'reached', u'the', u'neutral', u'ground'), (u'reached', u'the', u'neutral', u'ground', u'upon'), (u'the', u'neutral', u'ground', u'upon', u'the'), (u'neutral', u'ground', u'upon', u'the', u'outskirts'), (u'ground', u'upon', u'the', u'outskirts', u'of'), (u'upon', u'the', u'outskirts', u'of', u'the'), (u'the', u'outskirts', u'of', u'the', u'town'), (u'outskirts', u'of', u'the', u'town', u'which'), (u'of', u'the', u'town', u'which', u'was'), (u'the', u'town', u'which', u'was', u'neither'), (u'town', u'which', u'was', u'neither', u'town'), (u'which', u'was', u'neither', u'town', u'nor'), (u'was', u'neither', u'town', u'nor', u'country'), (u'neither', u'town', u'nor', u'country', u'and'), (u'town', u'nor', u'country', u'and', u'yet'), (u'nor', u'country', u'and', u'yet', u'was'), (u'country', u'and', u'yet', u'was', u'either'), (u'and', u'yet', u'was', u'either', u'spoiled'), (u'yet', u'was', u'either', u'spoiled', u'when'), (u'was', u'either', u'spoiled', u'when', u'his'), (u'either', u'spoiled', u'when', u'his', u'ears'), (u'spoiled', u'when', u'his', u'ears', u'were'), (u'when', u'his', u'ears', u'were', u'invaded'), (u'his', u'ears', u'were', u'invaded', u'by'), (u'ears', u'were', u'invaded', u'by', u'the'), (u'were', u'invaded', u'by', u'the', u'sound'), (u'invaded', u'by', u'the', u'sound', u'of'), (u'by', u'the', u'sound', u'of', u'music')]
Now consider the full text of "Hard Times", here represented by the list tokens
.
fivegrams = []
for t in tokens:
fivegrams += nltk.ngrams((w for w in t if w.isalpha()), 5)
print ' '.join(fivegrams[2000])
board with a dry Ogre
Create a dictionary to count occurrences of specific 5-grams.
D = {}
for gram in fivegrams:
if not D.get(gram):
D[gram] = 1
else:
D[gram] += 1
Iterate through the keys of the dictionary and print out and 5-grams that have more than 3 occurrences in the text.
for gram in D.keys():
if D[gram] > 3:
print ' '.join(gram), D[gram]
and why did he not 4 did he not come back 4 as if she would have 4 and venison with a gold 4 why did he not come 4 sir returned Sparsit with a 5 man and why did he 4 The emphasis was helped by 4 venison with a gold spoon 4 I am Josiah Bounderby of 4 to the Project Gutenberg Literary 6 was the man and why 4 his hands in his pockets 8 No little Gradgrind had ever 4 this town and I know 4 the terms of this agreement 7 am Josiah Bounderby of Coketown 4 with his hands in his 6 town and I know the 4 soup and venison with a 4 emphasis was helped by the 4 turtle soup and venison with 4 the man and why did 4 Project Gutenberg Literary Archive Foundation 13 of this town and I 4 the Project Gutenberg Literary Archive 11 a law to punish me 4
Now we can try to find the occurrences of one of the 5-grams, 'vension with a gold spoon', in the original text.
hard_times_raw_text.find('venison with a gold spoon')
250056
print hard_times_raw_text[250010:250082]
That object is, to be fed on turtle soup and venison with a gold spoon.
hard_times_raw_text.find('venison with a gold spoon',250057)
250156
print hard_times_raw_text[250084:250182]
Now, they’re not a-going—none of ’em—ever to be fed on turtle soup and venison with a gold spoon.
We only find 2, instead of the 4 we counted, though. Probably an issue of punctuation.
hard_times_raw_text.find('venison with a gold spoon', 250157)
-1
five_gram_2 = 'his hands in his pockets'
hard_times_raw_text.find(five_gram_2)
39377
hard_times_raw_text.find(five_gram_2, 39378)
41788
hard_times_raw_text.find(five_gram_2, 41789)
58759
hard_times_raw_text.find(five_gram_2, 58760)
476235
hard_times_raw_text.find(five_gram_2, 476236)
514354
hard_times_raw_text.find(five_gram_2, 514355)
-1
Now we will try to reproduce Table 3.3 on p.47 of [MM].
DC_bigrams = []
for t in DC_tokens:
DC_bigrams += nltk.ngrams((w for w in t if w.isalpha()), 2)
DC_D = {}
for gram in DC_bigrams:
if not DC_D.get(gram):
DC_D[gram] = 1
else:
DC_D[gram] += 1
for gram in DC_D.keys():
if DC_D[gram] > 500:
print ' '.join(gram), DC_D[gram]
to be 848 that I 885 at the 529 with a 613 it was 578 my aunt 631 I am 653 I was 950 to me 616 on the 660 of the 1392 to the 762 in a 694 and I 668 I have 639 in the 1521 of my 546 I had 904
DC_trigrams = []
for t in DC_tokens:
DC_trigrams += nltk.ngrams((w for w in t if w.isalpha()), 3)
DC_D_3 = {}
for gram in DC_trigrams:
if not DC_D_3.get(gram):
DC_D_3[gram] = 1
else:
DC_D_3[gram] += 1
for gram in DC_D_3.keys():
if DC_D_3[gram] > 60:
print ' '.join(gram), DC_D_3[gram]
I had been 88 said my aunt 219 that it was 83 I can not 62 I should have 71 out of the 124 that she was 61 that I was 124 I am sure 88 a good deal 71 one of the 65 if I had 82 as if he 81 would have been 62 that I had 90 I am not 65 I could not 125 I do know 102 there was a 73
DC_fourgrams = []
for t in DC_tokens:
DC_fourgrams += nltk.ngrams((w for w in t if w.isalpha()), 4)
DC_D_4 = {}
for gram in DC_fourgrams:
if not DC_D_4.get(gram):
DC_D_4[gram] = 1
else:
DC_D_4[gram] += 1
for gram in DC_D_4.keys():
if DC_D_4[gram] > 18:
print ' '.join(gram), DC_D_4[gram]
as well as I 22 for a long time 22 as if he were 30 as if he had 19 I have no doubt 27 in the course of 28 it would have been 22 a good deal of 28 in a state of 23 I could not help 32 if I had been 19 I am sure I 26 as if it were 28 as if I had 25 for a little while 23 I do know what 31
By comparison with [MM] there is a problem in how I am handling punctuation. For example, I record the most frequent four gram as "I do know what" but in [MM] it is "I don't know what".
david_copperfield_raw_text.find("I do know what")
-1
david_copperfield_raw_text.find("I don't know what")
19264
print david_copperfield_raw_text[19264:19296]
I don't know what's the matter.
At the very least I ought to reduce all text to lowercase, as done in [MM].
We begin with an extract.
test_extract = sents[1024: 1037]
print ' '.join(test_extract)
Then, withdrawing his hand and swallowing his mouthful of chop, he said to Stephen: ‘Now you know, this good lady is a born lady, a high lady. You are not to suppose because she keeps my house for me, that she hasn’t been very high up the tree—ah, up at the top of the tree! Now, if you have got anything to say that can’t be said before a born lady, this lady will leave the room. If what you have got to say _can_ be said before a born lady, this lady will stay where she is.’ ‘Sir, I hope I never had nowt to say, not fitten for a born lady to year, sin’ I were born mysen’,’ was the reply, accompanied with a slight flush. ‘Very well,’ said Mr. Bounderby, pushing away his plate, and leaning back. ‘Fire away!’ ‘I ha’ coom,’ Stephen began, raising his eyes from the floor, after a moment’s consideration, ‘to ask yo yor advice. I need ’t overmuch. I were married on Eas’r Monday nineteen year sin, long and dree. She were a young lass—pretty enow—wi’ good accounts of herseln. Well! She went bad—soon. Not along of me. Gonnows I were not a unkind husband to her.’ ‘I have heard all this before,’ said Mr. Bounderby.
import codecs, nltk
little_dorrit_path = "/home/matthew/workspace/resources/C/Corpus Stylistics/Dickens, Charles/pg963.txt"
f = codecs.open(little_dorrit_path, encoding = 'utf-8')
little_dorrit_file = codecs.open(little_dorrit_path, encoding = 'utf-8')
little_dorrit_raw = little_dorrit_file.read()
len(little_dorrit_raw)
1936177
little_dorrit_raw.find(u'At the close of this recital')
1725461
end_phrase = 'producing the money.'
little_dorrit_raw.find(end_phrase)
1727185
task_string = little_dorrit_raw[1725461:1727185 + len(end_phrase)]
print task_string
At the close of this recital, Arthur turned his eyes upon the impudent and wicked face. As it met his, the nose came down over the moustache and the moustache went up under the nose. When nose and moustache had settled into their places again, Monsieur Rigaud loudly snapped his fingers half-a-dozen times; bending forward to jerk the snaps at Arthur, as if they were palpable missiles which he jerked into his face. 'Now, Philosopher!' said Rigaud.'What do you want with me?' 'I want to know,' returned Arthur, without disguising his abhorrence, 'how you dare direct a suspicion of murder against my mother's house?' 'Dare!' cried Rigaud. 'Ho, ho! Hear him! Dare? Is it dare? By Heaven, my small boy, but you are a little imprudent!' 'I want that suspicion to be cleared away,' said Arthur. 'You shall be taken there, and be publicly seen. I want to know, moreover, what business you had there when I had a burning desire to fling you down-stairs. Don't frown at me, man! I have seen enough of you to know that you are a bully and coward. I need no revival of my spirits from the effects of this wretched place to tell you so plain a fact, and one that you know so well.' White to the lips, Rigaud stroked his moustache, muttering, 'By Heaven, my small boy, but you are a little compromising of my lady, your respectable mother'--and seemed for a minute undecided how to act. His indecision was soon gone. He sat himself down with a threatening swagger, and said: 'Give me a bottle of wine. You can buy wine here. Send one of your madmen to get me a bottle of wine. I won't talk to you without wine. Come! Yes or no?' 'Fetch him what he wants, Cavalletto,' said Arthur, scornfully, producing the money.
import re
re_1 = r"'[^']+'"
re_2 = r"'[a-zA-Z0-9_,!? ]+(?:[-'][a-zA-Z0-9_,!? ]+)*'"
re_3 = r"'[^']+[\.,!?]'"
nltk.re_show(re_1, task_string[423:])
{'Now, Philosopher!'} said Rigaud.{'What do you want with me?'} {'I want to know,'} returned Arthur, without disguising his abhorrence, {'how you dare direct a suspicion of murder against my mother'}s house?{' '}Dare!{' cried Rigaud. '}Ho, ho! Hear him! Dare? Is it dare? By Heaven, my small boy, but you are a little imprudent!{' '}I want that suspicion to be cleared away,{' said Arthur. '}You shall be taken there, and be publicly seen. I want to know, moreover, what business you had there when I had a burning desire to fling you down-stairs. Don{'t frown at me, man! I have seen enough of you to know that you are a bully and coward. I need no revival of my spirits from the effects of this wretched place to tell you so plain a fact, and one that you know so well.'} White to the lips, Rigaud stroked his moustache, muttering, {'By Heaven, my small boy, but you are a little compromising of my lady, your respectable mother'}--and seemed for a minute undecided how to act. His indecision was soon gone. He sat himself down with a threatening swagger, and said: {'Give me a bottle of wine. You can buy wine here. Send one of your madmen to get me a bottle of wine. I won'}t talk to you without wine. Come! Yes or no?{' '}Fetch him what he wants, Cavalletto,' said Arthur, scornfully, producing the money.
re.findall(re_1, task_string)
[u"'Now, Philosopher!'", u"'What do you want with me?'", u"'I want to know,'", u"'how you dare direct a suspicion of murder against my mother'", u"'\r\n\r\n'", u"' cried Rigaud. '", u"'\r\n\r\n'", u"' said Arthur. '", u"'t frown at me, man! I have seen enough of you to know\r\nthat you are a bully and coward. I need no revival of my spirits from\r\nthe effects of this wretched place to tell you so plain a fact, and one\r\nthat you know so well.'", u"'By Heaven,\r\nmy small boy, but you are a little compromising of my lady, your\r\nrespectable mother'", u"'Give me a bottle of wine. You can buy wine here. Send one of your\r\nmadmen to get me a bottle of wine. I won'", u"'\r\n\r\n'"]
from nltk.corpus import PlaintextCorpusReader
corpus_root = "/home/matthew/workspace/resources/C/Corpus Stylistics/Dickens, Charles"
wordlists = PlaintextCorpusReader(corpus_root, '.*')
wordlists.fileids()
['786-0.txt', 'pg1023.txt', 'pg580.txt', 'pg730.txt', 'pg766.txt', 'pg963.txt', 'pg967.txt']
len(hard_times_sents_raw)
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-43-f401c44dc2d0> in <module>() ----> 1 len(hard_times_sents_raw) NameError: name 'hard_times_sents_raw' is not defined
a = 0
for sentence in hard_times_sents_raw:
a += len(sentence)
print a
import re
s = hard_times_sents_raw[1000]
# re.findall(r'\W+', sentence)
print s
In this section I show how to remove extraneous text from the raw string for "Hard Times". This is done by hand by first identifying the first and last sentences of the text. There are some issue about Unicode that I haven't yet resolved.
hard_times_first_sentence = '\xe2 \x80\x98 NOW , what I want is , Facts .'
hard_times_first_sentence.split() in hard_times_sents_raw
first_sentence_index = hard_times_sents_raw.index(hard_times_first_sentence.split())
' '.join(hard_times_sents_raw[first_sentence_index])
hard_times_last_sentence = 'We shall sit with lighter bosoms on the hearth , to see the ashes of our fires turn gray and cold .'
hard_times_last_sentence.split() in hard_times_sents_raw
last_sentence_index = hard_times_sents_raw.index(hard_times_last_sentence.split())
' '.join(hard_times_sents_raw[last_sentence_index])
hard_times_sents = hard_times_sents_raw[first_sentence_index:last_sentence_index + 1]