Learning NLTK

This notebook introduces the Natural Language Toolkit (NLTK).

NLTK Example Texts

The first step is to import the nltk library and to load some example texts.

In [1]:
import nltk
In [2]:
from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

Dispersion Plots

In [3]:
text1.dispersion_plot(['Ahab','whale','Ishmael','Queequeg', 'Moby', 'dive'])
In [4]:
text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"])

Concordances

In [5]:
text1.concordance('Ishmael')
Building index...
Displaying 20 of 20 matches:
SONG . CHAPTER 1 Loomings . Call me Ishmael . Some years ago -- never mind how 
ED STATES . " WHALING VOYAGE BY ONE ISHMAEL . " BLOODY BATTLE IN AFFGHANISTAN .
f silver ,-- So , wherever you go , Ishmael , said I to myself , as I stood in 
de to lodge for the night , my dear Ishmael , be sure to inquire the price , an
nkling glasses within . But go on , Ishmael , said I at last ; don ' t you hear
g and teeth - gnashing there . Ha , Ishmael , muttered I , backing out , Wretch
emen who had gone before me . Yes , Ishmael , the same fate may be thine . But 
 ? thought I . Do you suppose now , Ishmael , that the magnanimous God of heave
l , which , if left to myself , I , Ishmael , should infallibly light upon , fo
 Bildad . Now then , my young man , Ishmael ' s thy name , didn ' t ye say ? We
say ? Well then , down ye go here , Ishmael , for the three hundredth lay ." " 
why don ' t you speak ? It ' s I -- Ishmael ." But all remained still as before
l fear ! CHAPTER 41 Moby Dick . I , Ishmael , was one of that crew ; my shouts 
lain , would be to dive deeper than Ishmael can go . The subterranean miner tha
oul ; thou surrenderest to a hypo , Ishmael . Tell me , why this strong young c
 snows of prairies ; all these , to Ishmael , are as the shaking of that buffal
ubtle meanings , how may unlettered Ishmael hope to read the awful Chaldee of t
onditional skeleton . But how now , Ishmael ? How is it , that you , a mere oar
 for exhibition ? Explain thyself , Ishmael . Can you land a full - grown whale
le witness have you hitherto been , Ishmael ; but have a care how you seize the
In [6]:
text1.similar('Ishmael')
Building word-context index...
ahab did in it what about admit am and been bound bury could do except
extreme for goes guess had
In [7]:
text1.similar('whale')
ship boat sea time captain deck man pequod world other whales air crew
head water line thing side way body
In [8]:
text1.collocations()
Building collocations list
Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm
whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
mate; white whale; ivory leg; one hand
In [9]:
text1.concordance("monstrous")
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal
they might scout at Moby Dick as a monstrous fable , or still worse and more de
th of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere l
ing Scenes . In connexion with the monstrous pictures of whales , I am strongly
ere to enter upon those still more monstrous stories of them which are to be fo
ght have been rummaged out of this monstrous cabinet there is no telling . But 
of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
In [10]:
text2.concordance("monstrous")
Building index...
Displaying 11 of 11 matches:
. " Now , Palmer , you shall see a monstrous pretty girl ." He immediately went
your sister is to marry him . I am monstrous glad of it , for then I shall have
ou may tell your sister . She is a monstrous lucky girl to get him , upon my ho
k how you will like them . Lucy is monstrous pretty , and so good humoured and 
 Jennings , " I am sure I shall be monstrous glad of Miss Marianne ' s company 
 usual noisy cheerfulness , " I am monstrous glad to see you -- sorry I could n
t however , as it turns out , I am monstrous glad there was never any thing in 
so scornfully ! for they say he is monstrous fond of her , as well he may . I s
possible that she should ." " I am monstrous glad of it . Good gracious ! I hav
thing of the kind . So then he was monstrous happy , and talked on some time ab
e very genteel people . He makes a monstrous deal of money , and they keep thei
In [11]:
text1.similar("monstrous")
abundant candid careful christian contemptible curious delightfully
determined doleful domineering exasperate fearless few gamesome
horrible impalpable imperial lamentable lazy loving
In [12]:
text2.similar("monstrous")
Building word-context index...
very exceedingly heartily so a amazingly as extremely good great
remarkably sweet vast

Lexical Diversity

In [13]:
def lexical_diversity(text):
    return len(text)/(1.0*len(set(text)))

def percentage(count, total):
    return 100*count/total
In [14]:
lexical_diversity(text3)
Out[14]:
16.050197203298673
In [15]:
lexical_diversity(text4)
Out[15]:
14.941049825712529
In [16]:
lexical_diversity(text5)
Out[16]:
7.420046158918563
In [17]:
fdist1 = FreqDist(text1)
In [18]:
vocabulary1 = fdist1.keys()
In [19]:
fdist1['whale']
Out[19]:
906
In [20]:
fdist1['monstrous']
Out[20]:
10
In [21]:
fdist1.plot(50, cumulative=True)
In [22]:
hapaxes1 = fdist1.hapaxes()
print len(hapaxes1)
print hapaxes1[1000:1010]
9002
['Gull', 'Gurry', 'HACKLUYT', 'HAILS', 'HALF', 'HAMLET', 'HANDS', 'HANGING', 'HARD', 'HARDY']

n-grams

In [23]:
thursday_sents = nltk.corpus.gutenberg.sents('chesterton-thursday.txt')
sent22 = thursday_sents[22]
' '.join(sent22)
Out[23]:
'THE suburb of Saffron Park lay on the sunset side of London , as red and ragged as a cloud of sunset .'
In [24]:
nltk.bigrams(w for w in sent22 if w.isalpha())
Out[24]:
[('THE', 'suburb'),
 ('suburb', 'of'),
 ('of', 'Saffron'),
 ('Saffron', 'Park'),
 ('Park', 'lay'),
 ('lay', 'on'),
 ('on', 'the'),
 ('the', 'sunset'),
 ('sunset', 'side'),
 ('side', 'of'),
 ('of', 'London'),
 ('London', 'as'),
 ('as', 'red'),
 ('red', 'and'),
 ('and', 'ragged'),
 ('ragged', 'as'),
 ('as', 'a'),
 ('a', 'cloud'),
 ('cloud', 'of'),
 ('of', 'sunset')]
In [25]:
import networkx as nx
G = nx.Graph()
begin_sent = 22
end_sent = 24
sents = thursday_sents[begin_sent:end_sent+1]
for sent in sents:
    G.add_edges_from(nltk.bigrams(w for w in sent if w.isalpha()))
nx.draw(G)

Lexical clusters in Dickens

Below [BKL] refers to "Natural Language Processing with Python" by Bird, Klein and Loper and [MAR] refers to "Mining the Social Web" by Matthew A. Russell.

Getting the raw text

See page 95 of [BKL].

In [129]:
import codecs, nltk, pprint
hard_times_path = "/home/matthew/workspace/resources/C/Corpus Stylistics/Dickens, Charles/786-0.txt"
david_copperfield_path = "/home/matthew/workspace/resources/C/Corpus Stylistics/Dickens, Charles/pg766.txt"
f = codecs.open(hard_times_path, encoding = 'utf-8')
david_copperfield_file = codecs.open(david_copperfield_path, encoding = 'utf-8')
In [27]:
hard_times_raw_text = f.read()
len(hard_times_raw_text)
Out[27]:
610690
In [130]:
david_copperfield_raw_text = david_copperfield_file.read()
len(david_copperfield_raw_text)
Out[130]:
1992524

Sentence tokenizing

See page 112 of [BKL].

In [28]:
sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sents = sent_tokenizer.tokenize(hard_times_raw_text)
In [29]:
print sents[171]
He had reached the neutral ground upon the outskirts of the town, which
was neither town nor country, and yet was either spoiled, when his ears
were invaded by the sound of music.
In [30]:
len(sents)
Out[30]:
4498

An alternative approach based on [MAR].

In [31]:
sents = nltk.tokenize.sent_tokenize(hard_times_raw_text)
In [32]:
print sents[171]
He had reached the neutral ground upon the outskirts of the town, which
was neither town nor country, and yet was either spoiled, when his ears
were invaded by the sound of music.
In [131]:
DC_sents = nltk.tokenize.sent_tokenize(david_copperfield_raw_text)
print DC_sents[171]
Calls a house a rookery when there's not a rook near it,
and takes the birds on trust, because he sees the nests!

Word tokenizing

In [33]:
tokens = [nltk.tokenize.word_tokenize(s) for s in sents]
In [34]:
len(tokens)
Out[34]:
4498
In [35]:
print tokens[171]
[u'He', u'had', u'reached', u'the', u'neutral', u'ground', u'upon', u'the', u'outskirts', u'of', u'the', u'town', u',', u'which', u'was', u'neither', u'town', u'nor', u'country', u',', u'and', u'yet', u'was', u'either', u'spoiled', u',', u'when', u'his', u'ears', u'were', u'invaded', u'by', u'the', u'sound', u'of', u'music', u'.']
In [134]:
DC_tokens = [nltk.tokenize.word_tokenize(s) for s in DC_sents]
In [135]:
print DC_tokens[171]
[u'Calls', u'a', u'house', u'a', u'rookery', u'when', u'there', u"'s", u'not', u'a', u'rook', u'near', u'it', u',', u'and', u'takes', u'the', u'birds', u'on', u'trust', u',', u'because', u'he', u'sees', u'the', u'nests', u'!']

POS tagging

Warning: Very slow

In [36]:
# pos_tagged_tokens = [nltk.pos_tag(t) for t in tokens]

n-grams

In [63]:
' ' .join(tokens[171])
Out[63]:
u'He had reached the neutral ground upon the outskirts of the town , which was neither town nor country , and yet was either spoiled , when his ears were invaded by the sound of music .'
In [61]:
bigram = nltk.bigrams(w for w in tokens[171] if w.isalpha())
In [62]:
import networkx as nx
G = nx.Graph()
G.add_edges_from(bigram)
nx.draw(G)
In [75]:
print nltk.ngrams((w for w in tokens[171] if w.isalpha()), 5)
[(u'He', u'had', u'reached', u'the', u'neutral'), (u'had', u'reached', u'the', u'neutral', u'ground'), (u'reached', u'the', u'neutral', u'ground', u'upon'), (u'the', u'neutral', u'ground', u'upon', u'the'), (u'neutral', u'ground', u'upon', u'the', u'outskirts'), (u'ground', u'upon', u'the', u'outskirts', u'of'), (u'upon', u'the', u'outskirts', u'of', u'the'), (u'the', u'outskirts', u'of', u'the', u'town'), (u'outskirts', u'of', u'the', u'town', u'which'), (u'of', u'the', u'town', u'which', u'was'), (u'the', u'town', u'which', u'was', u'neither'), (u'town', u'which', u'was', u'neither', u'town'), (u'which', u'was', u'neither', u'town', u'nor'), (u'was', u'neither', u'town', u'nor', u'country'), (u'neither', u'town', u'nor', u'country', u'and'), (u'town', u'nor', u'country', u'and', u'yet'), (u'nor', u'country', u'and', u'yet', u'was'), (u'country', u'and', u'yet', u'was', u'either'), (u'and', u'yet', u'was', u'either', u'spoiled'), (u'yet', u'was', u'either', u'spoiled', u'when'), (u'was', u'either', u'spoiled', u'when', u'his'), (u'either', u'spoiled', u'when', u'his', u'ears'), (u'spoiled', u'when', u'his', u'ears', u'were'), (u'when', u'his', u'ears', u'were', u'invaded'), (u'his', u'ears', u'were', u'invaded', u'by'), (u'ears', u'were', u'invaded', u'by', u'the'), (u'were', u'invaded', u'by', u'the', u'sound'), (u'invaded', u'by', u'the', u'sound', u'of'), (u'by', u'the', u'sound', u'of', u'music')]

Now consider the full text of "Hard Times", here represented by the list tokens.

In [67]:
fivegrams = []
for t in tokens:
    fivegrams += nltk.ngrams((w for w in t if w.isalpha()), 5)
In [73]:
print ' '.join(fivegrams[2000])
board with a dry Ogre

Create a dictionary to count occurrences of specific 5-grams.

In [76]:
D = {}
for gram in fivegrams:
    if not D.get(gram):
        D[gram] = 1
    else:
        D[gram] += 1

Iterate through the keys of the dictionary and print out and 5-grams that have more than 3 occurrences in the text.

In [86]:
for gram in D.keys():
    if D[gram] > 3:
        print ' '.join(gram), D[gram]
and why did he not 4
did he not come back 4
as if she would have 4
and venison with a gold 4
why did he not come 4
sir returned Sparsit with a 5
man and why did he 4
The emphasis was helped by 4
venison with a gold spoon 4
I am Josiah Bounderby of 4
to the Project Gutenberg Literary 6
was the man and why 4
his hands in his pockets 8
No little Gradgrind had ever 4
this town and I know 4
the terms of this agreement 7
am Josiah Bounderby of Coketown 4
with his hands in his 6
town and I know the 4
soup and venison with a 4
emphasis was helped by the 4
turtle soup and venison with 4
the man and why did 4
Project Gutenberg Literary Archive Foundation 13
of this town and I 4
the Project Gutenberg Literary Archive 11
a law to punish me 4

Now we can try to find the occurrences of one of the 5-grams, 'vension with a gold spoon', in the original text.

In [87]:
hard_times_raw_text.find('venison with a gold spoon')
Out[87]:
250056
In [116]:
print hard_times_raw_text[250010:250082]
That object is, to be fed on turtle soup and venison with a gold spoon.
In [96]:
hard_times_raw_text.find('venison with a gold spoon',250057)
Out[96]:
250156
In [113]:
print hard_times_raw_text[250084:250182]
Now, they’re not a-going—none of ’em—ever to be fed on turtle soup and
venison with a gold spoon.

We only find 2, instead of the 4 we counted, though. Probably an issue of punctuation.

In [115]:
hard_times_raw_text.find('venison with a gold spoon', 250157)
Out[115]:
-1
In [122]:
five_gram_2 = 'his hands in his pockets'
hard_times_raw_text.find(five_gram_2)
Out[122]:
39377
In [123]:
hard_times_raw_text.find(five_gram_2, 39378)
Out[123]:
41788
In [124]:
hard_times_raw_text.find(five_gram_2, 41789)
Out[124]:
58759
In [125]:
hard_times_raw_text.find(five_gram_2, 58760)
Out[125]:
476235
In [126]:
hard_times_raw_text.find(five_gram_2, 476236)
Out[126]:
514354
In [127]:
hard_times_raw_text.find(five_gram_2, 514355)
Out[127]:
-1

Now we will try to reproduce Table 3.3 on p.47 of [MM].

In [136]:
DC_bigrams = []
for t in DC_tokens:
    DC_bigrams += nltk.ngrams((w for w in t if w.isalpha()), 2)
In [138]:
DC_D = {}
for gram in DC_bigrams:
    if not DC_D.get(gram):
        DC_D[gram] = 1
    else:
        DC_D[gram] += 1
In [139]:
for gram in DC_D.keys():
    if DC_D[gram] > 500:
        print ' '.join(gram), DC_D[gram]
to be 848
that I 885
at the 529
with a 613
it was 578
my aunt 631
I am 653
I was 950
to me 616
on the 660
of the 1392
to the 762
in a 694
and I 668
I have 639
in the 1521
of my 546
I had 904
In [140]:
DC_trigrams = []
for t in DC_tokens:
    DC_trigrams += nltk.ngrams((w for w in t if w.isalpha()), 3)
In [141]:
DC_D_3 = {}
for gram in DC_trigrams:
    if not DC_D_3.get(gram):
        DC_D_3[gram] = 1
    else:
        DC_D_3[gram] += 1
In [142]:
for gram in DC_D_3.keys():
    if DC_D_3[gram] > 60:
        print ' '.join(gram), DC_D_3[gram]
I had been 88
said my aunt 219
that it was 83
I can not 62
I should have 71
out of the 124
that she was 61
that I was 124
I am sure 88
a good deal 71
one of the 65
if I had 82
as if he 81
would have been 62
that I had 90
I am not 65
I could not 125
I do know 102
there was a 73
In [144]:
DC_fourgrams = []
for t in DC_tokens:
    DC_fourgrams += nltk.ngrams((w for w in t if w.isalpha()), 4)
In [145]:
DC_D_4 = {}
for gram in DC_fourgrams:
    if not DC_D_4.get(gram):
        DC_D_4[gram] = 1
    else:
        DC_D_4[gram] += 1
In [149]:
for gram in DC_D_4.keys():
    if DC_D_4[gram] > 18:
        print ' '.join(gram), DC_D_4[gram]
as well as I 22
for a long time 22
as if he were 30
as if he had 19
I have no doubt 27
in the course of 28
it would have been 22
a good deal of 28
in a state of 23
I could not help 32
if I had been 19
I am sure I 26
as if it were 28
as if I had 25
for a little while 23
I do know what 31

By comparison with [MM] there is a problem in how I am handling punctuation. For example, I record the most frequent four gram as "I do know what" but in [MM] it is "I don't know what".

In [163]:
david_copperfield_raw_text.find("I do know what")
Out[163]:
-1
In [151]:
david_copperfield_raw_text.find("I don't know what")
Out[151]:
19264
In [162]:
print david_copperfield_raw_text[19264:19296]
I don't know what's the
matter.

At the very least I ought to reduce all text to lowercase, as done in [MM].

Finding suspended quotations

We begin with an extract.

In [40]:
test_extract = sents[1024: 1037]
In [41]:
print ' '.join(test_extract)
Then, withdrawing
his hand and swallowing his mouthful of chop, he said to Stephen:

‘Now you know, this good lady is a born lady, a high lady. You are not
to suppose because she keeps my house for me, that she hasn’t been very
high up the tree—ah, up at the top of the tree! Now, if you have got
anything to say that can’t be said before a born lady, this lady will
leave the room. If what you have got to say _can_ be said before a born
lady, this lady will stay where she is.’

‘Sir, I hope I never had nowt to say, not fitten for a born lady to year,
sin’ I were born mysen’,’ was the reply, accompanied with a slight flush. ‘Very well,’ said Mr. Bounderby, pushing away his plate, and leaning
back. ‘Fire away!’

‘I ha’ coom,’ Stephen began, raising his eyes from the floor, after a
moment’s consideration, ‘to ask yo yor advice. I need ’t overmuch. I
were married on Eas’r Monday nineteen year sin, long and dree. She were
a young lass—pretty enow—wi’ good accounts of herseln. Well! She went
bad—soon. Not along of me. Gonnows I were not a unkind husband to her.’

‘I have heard all this before,’ said Mr. Bounderby.

Interview Task

In [2]:
import codecs, nltk
little_dorrit_path = "/home/matthew/workspace/resources/C/Corpus Stylistics/Dickens, Charles/pg963.txt"
f = codecs.open(little_dorrit_path, encoding = 'utf-8')
little_dorrit_file = codecs.open(little_dorrit_path, encoding = 'utf-8')
In [3]:
little_dorrit_raw = little_dorrit_file.read()
In [4]:
len(little_dorrit_raw)
Out[4]:
1936177
In [5]:
little_dorrit_raw.find(u'At the close of this recital')
Out[5]:
1725461
In [9]:
end_phrase = 'producing the money.'
little_dorrit_raw.find(end_phrase)
Out[9]:
1727185
In [106]:
task_string = little_dorrit_raw[1725461:1727185 + len(end_phrase)]
print task_string
At the close of this recital, Arthur turned his eyes upon the impudent
and wicked face. As it met his, the nose came down over the moustache
and the moustache went up under the nose. When nose and moustache had
settled into their places again, Monsieur Rigaud loudly snapped his
fingers half-a-dozen times; bending forward to jerk the snaps at Arthur,
as if they were palpable missiles which he jerked into his face.

'Now, Philosopher!' said Rigaud.'What do you want with me?'

'I want to know,' returned Arthur, without disguising his abhorrence,
'how you dare direct a suspicion of murder against my mother's house?'

'Dare!' cried Rigaud. 'Ho, ho! Hear him! Dare? Is it dare? By Heaven, my
small boy, but you are a little imprudent!'

'I want that suspicion to be cleared away,' said Arthur. 'You shall
be taken there, and be publicly seen. I want to know, moreover,
what business you had there when I had a burning desire to fling you
down-stairs. Don't frown at me, man! I have seen enough of you to know
that you are a bully and coward. I need no revival of my spirits from
the effects of this wretched place to tell you so plain a fact, and one
that you know so well.'

White to the lips, Rigaud stroked his moustache, muttering, 'By Heaven,
my small boy, but you are a little compromising of my lady, your
respectable mother'--and seemed for a minute undecided how to act.
His indecision was soon gone. He sat himself down with a threatening
swagger, and said:

'Give me a bottle of wine. You can buy wine here. Send one of your
madmen to get me a bottle of wine. I won't talk to you without wine.
Come! Yes or no?'

'Fetch him what he wants, Cavalletto,' said Arthur, scornfully,
producing the money.
In [397]:
import re

re_1 = r"'[^']+'"
re_2 = r"'[a-zA-Z0-9_,!? ]+(?:[-'][a-zA-Z0-9_,!? ]+)*'"
re_3 = r"'[^']+[\.,!?]'"

nltk.re_show(re_1, task_string[423:])
{'Now, Philosopher!'} said Rigaud.{'What do you want with me?'}

{'I want to know,'} returned Arthur, without disguising his abhorrence,
{'how you dare direct a suspicion of murder against my mother'}s house?{'

'}Dare!{' cried Rigaud. '}Ho, ho! Hear him! Dare? Is it dare? By Heaven, my
small boy, but you are a little imprudent!{'

'}I want that suspicion to be cleared away,{' said Arthur. '}You shall
be taken there, and be publicly seen. I want to know, moreover,
what business you had there when I had a burning desire to fling you
down-stairs. Don{'t frown at me, man! I have seen enough of you to know
that you are a bully and coward. I need no revival of my spirits from
the effects of this wretched place to tell you so plain a fact, and one
that you know so well.'}

White to the lips, Rigaud stroked his moustache, muttering, {'By Heaven,
my small boy, but you are a little compromising of my lady, your
respectable mother'}--and seemed for a minute undecided how to act.
His indecision was soon gone. He sat himself down with a threatening
swagger, and said:

{'Give me a bottle of wine. You can buy wine here. Send one of your
madmen to get me a bottle of wine. I won'}t talk to you without wine.
Come! Yes or no?{'

'}Fetch him what he wants, Cavalletto,' said Arthur, scornfully,
producing the money.
In [396]:
re.findall(re_1, task_string)
Out[396]:
[u"'Now, Philosopher!'",
 u"'What do you want with me?'",
 u"'I want to know,'",
 u"'how you dare direct a suspicion of murder against my mother'",
 u"'\r\n\r\n'",
 u"' cried Rigaud. '",
 u"'\r\n\r\n'",
 u"' said Arthur. '",
 u"'t frown at me, man! I have seen enough of you to know\r\nthat you are a bully and coward. I need no revival of my spirits from\r\nthe effects of this wretched place to tell you so plain a fact, and one\r\nthat you know so well.'",
 u"'By Heaven,\r\nmy small boy, but you are a little compromising of my lady, your\r\nrespectable mother'",
 u"'Give me a bottle of wine. You can buy wine here. Send one of your\r\nmadmen to get me a bottle of wine. I won'",
 u"'\r\n\r\n'"]

Using the NLTK built-in corpus reader (optional)

In [290]:
from nltk.corpus import PlaintextCorpusReader
corpus_root = "/home/matthew/workspace/resources/C/Corpus Stylistics/Dickens, Charles"
wordlists = PlaintextCorpusReader(corpus_root, '.*')
wordlists.fileids()
Out[290]:
['786-0.txt',
 'pg1023.txt',
 'pg580.txt',
 'pg730.txt',
 'pg766.txt',
 'pg963.txt',
 'pg967.txt']
In [42]:
 
In [43]:
len(hard_times_sents_raw)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-43-f401c44dc2d0> in <module>()
----> 1 len(hard_times_sents_raw)

NameError: name 'hard_times_sents_raw' is not defined
In [ ]:
a = 0
for sentence in hard_times_sents_raw:
    a += len(sentence)
print a
In [ ]:
import re
s = hard_times_sents_raw[1000]
# re.findall(r'\W+', sentence)
print s

Removing extraneous text (optional)

In this section I show how to remove extraneous text from the raw string for "Hard Times". This is done by hand by first identifying the first and last sentences of the text. There are some issue about Unicode that I haven't yet resolved.

In [ ]:
hard_times_first_sentence = '\xe2 \x80\x98 NOW , what I want is , Facts .'
hard_times_first_sentence.split() in hard_times_sents_raw
In [ ]:
first_sentence_index = hard_times_sents_raw.index(hard_times_first_sentence.split())
In [ ]:
' '.join(hard_times_sents_raw[first_sentence_index])
In [ ]:
hard_times_last_sentence = 'We shall sit with lighter bosoms on the hearth , to see the ashes of our fires turn gray and cold .'
hard_times_last_sentence.split() in hard_times_sents_raw
In [ ]:
last_sentence_index = hard_times_sents_raw.index(hard_times_last_sentence.split())
In [ ]:
' '.join(hard_times_sents_raw[last_sentence_index])
In [ ]:
hard_times_sents = hard_times_sents_raw[first_sentence_index:last_sentence_index + 1]
In [ ]: