Working with Corpora

Warm-up

In this course, we will focus mainly on the NLTK, a python package for doing NLP and related text mining tasks. NLTK comes with abundant language resources as well as extentive documentation that covers every module, class, and function in this toolkit, specifying parameters and giving example of usages.

Check that NLTK is installed by typing import nltk at the prompt. Run some graphical demonstrations, e.g., type nltk.app.concordance() to see how concordancer works with a corpus. You can also encouraged to run other demonstrations.

In [10]:
import nltk
# nltk.app.concordance() # it takes some time
# nltk.download()        # install data and packages required.

Once again, why you should be happy working on NLP with Python?

In [44]:
# Code Guess

for line in open('../Corpus_data/4Practice/alice.txt'):
	for word in line.split():
		if word.endswith('ing'):
			print word

# what about words ending with 'ed'?
Being
keeping
contacting
trying
editing
upping
something
using
reading
sending
receiving
paying
including
receiving
sending
including
following
including
resulting
cessing
using
following
scanning
beginning
sitting
having
nothing
considering
making
getting
picking
nothing
burning
considering
stopping
falling
going
coming
killing
nothing
tumbling
anything
getting
showing
falling
nothing
talking
saying
dozing
walking
saying
hurrying
hanging
trying
wondering
nothing
waiting
hoping
shutting
going
finding
shutting
going
going
going
having
finding
nothing
going
thing
crying
bring
trying
having
playing
pretending
lying
holding
expecting
nothing
opening
getting
planning
sending
lying
crying
shedding
reaching
pattering
trotting
muttering
fanning
everything
feeling
thinking
puzzling
saying
shining
smiling
putting
saying
being
being
growing
going
shrinking
shrinking
lying
bathing
digging
lodging
trying
being
everything
something
splashing
Everything
swimming
speaking
thing
having
anything
soothing
purring
licking
washing
thing
catching
bristling
trembling
swimming
making
trembling
getting
queer-looking
clinging
dripping
talking
knowing
thing
Atheling
getting
turning
rising
meeting
meaning
going
thing
thing
running
running
pointing
calling
turning
saying
thing
anything
looking
thing
something
turning
looking
puzzling
something
morning
nothing
wasting
cunning
thinking
looking
nothing
getting
walking
talking
saying
addressing
catching
wrapping
getting
trembling
pattering
hoping
coming
trotting
looking
muttering
looking
hunting
seen--everything
hunting
doing
trying
going
sending
fancying
thing
"Coming
ordering
going
interesting
being
pressing
being
saying
lying
getting
growing
being
thing
taking
making
pattering
coming
forgetting
waiting
something
Digging
`Digging
pulling
hearing
anything
rumbling
talking
bring
coming
everything
scratching
scrambling
saying
thing
squeaking
something
moving
rattling
turning
shrinking
waiting
being
giving
something
thing
thing
peering
looking
stretching
trying
coaxing
knowing
being
thinking
having
expecting
running
barking
hanging
making
teaching
something
anything
thing
growing
sitting
smoking
taking
anything
encouraging
opening
being
beginning
making
puzzling
something
swallowing
nothing
something
shilling
anything
beginning
changing
losing
rearing
smoking
remarking
looking
trying
shrinking
moving
shaking
getting
curving
going
nothing
beating
nothing
talking
attending
pleasing
saying
anything
hatching
beginning
raising
thinking
wriggling
trying
denying
telling
looking
looking
getting
nibbling
growing
bringing
anything
talking
puzzling
going
thing
nibbling
looking
wondering
running
judging
producing
changing
hearing
sitting
staring
making
going
howling
attending
looking
skimming
nothing
repeating
`Anything
talking
sitting
nursing
leaning
stirring
sneezing
howling
sitting
grinning
feeling
trying
throwing
everything
howling
jumping
showing
`Talking
stirring
nursing
singing
giving
tossing
thing
flinging
thing
snorting
doubling
straightening
nursing
undoing
thing
sneezing
expressing
getting
thing
going
nothing
thing
beginning
thinking
saying
seeing
sitting
waving
waving
getting
looking
expecting
raving
sitting
appearing
vanishing
beginning
ending
thing
saying
raving
having
sitting
using
resting
talking
encouraging
nothing
being
looking
hearing
asking
thing
thing
thing
talking
thing
thing
turning
looking
shaking
holding
looking
nothing
looking
meaning
opening
going
turning
something
asking
wasting
tossing
anything
(pointing
sing
something
singing
murdering
thing
moving
beginning
getting
eating
thinking
living
nothing
making
beginning
learning
forgetting
considering
choosing
interrupting
learning
yawning
rubbing
getting
things--everything
going
being
thing
drawing
hoping
trying
leading
taking
unlocking
nibbling
growing
painting
splashing
bringing
watching
painting
doing
looking
carrying
jumping
talking
smiling
everything
noticing
carrying
having
tossing
turning
pointing
lying
lying
glaring
King
bowing
turning
doing
going
examining
remaining
looking
wondering
walking
peeping
running
tumbling
managing
getting
hanging
going
bursting
going
provoking
crawling
getting
walking
waiting
quarrelling
fighting
stamping
shouting
beheading
looking
wondering
being
watching
getting
speaking
feeling
complaining
confusing
being
walking
finishing
talking
going
looking
King
passing
settling
looking
King
going
screaming
having
croqueting
trying
going
talking
thing
going
anything
something
nothing
fading
King
looking
having
thinking
keeping
going
keeping
minding
digging
finding
wondering
feeling
putting
everything
nothing
everything
`Thinking
beginning
frowning
stamping
resting
remarking
playing
quarrelling
shouting
being
thing
King
lying
leaving
sitting
sighing
sobbing
getting
interesting
thinking
sobbing
asking
`living
`Reeling
counting
Fainting
Laughing
sighing
something
shaking
punching
running
thing
capering
dropping
jumping
dancing
treading
waving
whiting
treading
waiting
whiting
interesting
feeling
whiting
wondering
running
going
adventures--beginning
going
telling
repeating
coming
something
wondering
anything
trembling
sharing
repeating
confusing
thing
sing
accounting
Sing
sing
Waiting
taking
waiting
King
standing
King
looking
everything
everything
being
meaning
writing
anything
putting
King
looking
writing
taking
hunting
King
bringing
King
King
King
turning
staring
shifting
looking
beginning
sitting
growing
staring
`Bring
King
trembling
getting
twinkling
twinkling
twinkling
King
looking
being
King
reading
waiting
sneezing
King
King
folding
frowning
getting
feeling
forgetting
upsetting
sprawling
reminding
picking
running
King
looking
thing
waving
being
being
anything
gazing
King
`Nothing
`Nothing
King
turning
beginning
frowning
making
King
trying
writing
King
trembling
jumping
nothing
thing
clapping
thing
King
nothing
King
rubbing
interrupting
meaning
meaning
meaning
spreading
looking
meaning
turning
being
muttering
King
pointing
`Nothing
throwing
writing
using
trickling
looking
King
King
having
turning
nothing
flying
lying
brushing
reading
getting
thinking
leaning
watching
setting
thinking
dreaming
looking
wandering
neighbouring
never-ending
ordering
sneezing
squeaking
choking
rustling
rippling
waving
rattling
tinkling
lowing
loving
remembering

Accessing Corpora

  1. What is a corpus?
  2. What is a corpus query tool(system)?
  3. How are corpora used in Computational Linguistics?
    • Corpora are the lifeblood of computational linguistics research.
    • Collection, Preprocessing, and Analysis.
    • Descriptive Statistics
      • corpus and statistics are good friends.
        • How likely is the word “friend” to be a verb?
        • What language use patterns are associated with positive product reviews?
    • Statistical Modeling
      • Use statistics to make predictions about unseen language data.
        • Train a statistical model based on part of a corpus.
        • Evaluate on unseen material.
        • Models are task specific.
    • Consistent Evaluation
      • Different approaches to a common task can be evaluated on common material.
      • “Shared Tasks”
      • This allows for “state-of-the-art” results to be established and verified.

Existing Corpora

Accessing Corpora: the programmer's way

Using Corpora Modules in NLTK, you can

  • Access Text Corpora

    • tokenization (/segmentation), Part-of-Speech tagging, chunking and parsing.
    • Processing raw rext from the Web and from Disk
      • crawling, processing with Unicode.
  • Creating and Managing Custom Corpora

    • Corpus Reader Structure; Life Cycle of a Corpus; Corpus Viewer

I hope you've already downloaded the book collection with nltk.download()

Note: The idea here
[o] How simple programs can help you manipulate and analyze language data
[x] How to write these programs.

Searching in text

  • There are many ways to examine the context of a text apart from simply reading it.
    • E.g. concordance view shows us every occurrence of a given word, together with some context.
In [5]:
from nltk.book import *  # plain Eng: "from NLTK's book module, load all items"
text1
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
Out[5]:
<Text: Moby Dick by Herman Melville 1851>
In [6]:
text1.concordance('curious')
Building index...
Displaying 25 of 54 matches:
 pockets , and produced at length a curious little deformed image with a hunch 
and bent upon narrowly observing so curious a creature . When , at last , his m
pfold among the Green Mountains . A curious sight ; these bashful bears , these
ou will see other sights still more curious , certainly more comical . There we
ken chair , wriggling all over with curious carving ; and the bottom of which w
nt I stood a little puzzled by this curious request , not knowing exactly how t
ucket , though it certainly seems a curious story , that when he sailed the old
us began ranging alongside . It was curious and not unpleasing , how Peleg and 
eens , even modern ones , a certain curious process of seasoning them for their
g come to beg truce of a fortress . Curious to tell , this imperial negro , Aha
ing at it . But what was still more curious , Flask -- you know how curious all
ore curious , Flask -- you know how curious all dreams are -- through all this 
e Unicorn whale . He is certainly a curious example of the Unicornism to be fou
 , seems to feel relieved from some curious restraint ; for , tipping all sorts
s company were assembled , and with curious and not wholly unapprehensive faces
sh ?" " Does he fan - tail a little curious , sir , before he goes down ?" said
eader deliberately . " And has he a curious spout , too ," said Daggoo , " very
me to time have originated the most curious and contradictory speculations rega
ing one or two very interesting and curious particulars in the habits of sperm 
d upon gigantic Daggoo was yet more curious ; for sustaining himself with a coo
ising lee of churches . For by some curious fatality , as it is often noted of 
ore , previously to advert to those curious imaginary portraits of him which ev
ks you will at times meet with very curious touches at the whale , where all ma
nt of Learning " you will find some curious whales . But quitting all these unp
 at all . For it is one of the more curious things about this Leviathan , that 
In [8]:
text1.concordance("curious", width=50, lines=10)
Displaying 10 of 54 matches:
produced at length a curious little deformed imag
arrowly observing so curious a creature . When , 
 Green Mountains . A curious sight ; these bashfu
er sights still more curious , certainly more com
ggling all over with curious carving ; and the bo
ttle puzzled by this curious request , not knowin
it certainly seems a curious story , that when he
g alongside . It was curious and not unpleasing ,
ern ones , a certain curious process of seasoning
ruce of a fortress . Curious to tell , this imper

Count occurrences of a word

In [9]:
text1.count("monstrous")
Out[9]:
10

What other words appear in a similar range of contexts?

The common_contexts() allows us to examine just the contexts that are shared by two or more words.

In [10]:
text1.common_contexts(['curious','very'])
Building word-context index...
a_little a_sight

Find typical word pairs for this text

In [11]:
text1.collocations(num=20)
Building collocations list
Sperm Whale; Moby Dick; White Whale; old man; Captain Ahab; sperm
whale; Right Whale; Captain Peleg; New Bedford; Cape Horn; cried Ahab;
years ago; lower jaw; never mind; Father Mapple; cried Stubb; chief
mate; white whale; ivory leg; one hand

What words are similar to "monstrous" in Moby Dick (text1), and in Sense and Sensibility (text2)?

In [12]:
text1.similar("monstrous")
text2.similar("monstrous")
abundant candid careful christian contemptible curious delightfully
determined doleful domineering exasperate fearless few gamesome
impalpable imperial loving maddens mean modifies
Building word-context index...
very exceedingly heartily so a amazingly as extremely good great
remarkably sweet vast

Generate a nonsense text in the style of this document

In [15]:
text1.generate(length=40)
[ Moby Dick . To this , however ignorant the world ashore may be
encircling him . But thank heaven , man ! he ' ll try my hand ; I
forgot all about our horrible oath ; in him

Dispersion plot

  • Another way to display some words that appear in the same context.
    • Each stripe represents an instance of a word, and each row represents the entire text. (NumPy and Matplotlib required)
In [16]:
text4.dispersion_plot(['citizens', 'democracy', 'freedom'])

Counting Vocabulary

  • Find out the length of a text, in terms of the words and punctuation symbols that appear. (i.e., tokens)
In [17]:
len(text3)
Out[17]:
44764
  • Find out the set of tokens, where all duplicates are collapsed together (i.e., types).
In [18]:
len(set(text3))
Out[18]:
2789
  • Simple Measure of Lexical Richness
    • Suppose that we define lexical richness as $\frac{tokens}{types}$, then the lexical richness of text3 is calculated as:
In [19]:
from __future__ import division
len(text3)/len(set(text3))
Out[19]:
16.050197203298673
In [23]:
float(len(text3)) / float(len(set(text3)))
Out[23]:
16.050197203298673
  • NLTK can collect the words in a text with their frequencies
In [28]:
fd1 = text1.vocab()
# the most frequent "word"?
fd1.max()
Out[28]:
','
In [29]:
# and its frequency
fd1[',']
Out[29]:
18713
In [27]:
# the 10 most frequent words
fd1.items()[0:10]
Out[27]:
[(',', 18713),
 ('the', 13721),
 ('.', 6862),
 ('of', 6536),
 ('and', 6024),
 ('a', 4569),
 ('to', 4542),
 (';', 4072),
 ('in', 3916),
 ('that', 2982)]
  • Counting of individual words, and what percentage of the text is taken up by a specific word:
In [24]:
text3.count('smote')
Out[24]:
5
In [25]:
100 * text3.count('the') / len(text3)
Out[25]:
5.386024483960325

[Exercise]

How many times does the word lol appear in text5? How much is this as a percentage of the total number of words in this text?


How to access built-in NLTK-corpora?

NLTK ships with several useful textual corpora that are used widely in the NLP research community.

  • Brown Corpus - General Corpus of American English from 1960s.
  • Inaugural Corpus - US Presidential Inaugural Address Corpus 145.735 tokens.
  • Treebank Corpus - Penn Treebank Corpus Sample. 100.676 tagged tokens in parse trees.
  • Genesis Corpus- Multilingual Bible samples. 284.441 words.
  • Gutenberg Corpus: The Gutenberg Corpus is a selection of 14 texts chosen from Project Gutenberg, the largest online collection of free ebooks The corpus contains a total of 1.7 million words.
  • Stopwords Corpus: Besides regular content words, there is another class of words called stop words that perform important grammatical functions but are unlikely to be interesting by themselves, such as prepositions, complementizers and determiners. NLTK comes bundled with the Stopwords Corpus - a list of 2400 stop words across 11 different languages (including English).
  1. how to select individual texts, and
  2. how to work with them

From Gutenberg Corpus (Textual Archive)

NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, which contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/. Use the following to get an individual text from the corpus:

In [70]:
nltk.corpus.gutenberg.fileids()
Out[70]:
['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']
In [71]:
emma = nltk.corpus.gutenberg.words('austen-emma.txt')

Note that if you want to put the whole corpus into one variable, so that you can perform concordancing and other tasks mentioned (e.g., searching, counting, dispersion plotting, etc). You need to employ the following pair of statement:

In [11]:
emma = nltk.Text(nltk.corpus.gutenberg.words('austen-emma.txt')) 
emma.concordance('surprize')
emma.similar('surprize') 
Building index...
Displaying 25 of 37 matches:
er father , was sometimes taken by surprize at his being still able to pity ` 
hem do the other any good ." " You surprize me ! Emma must do Harriet good : a
Knightley actually looked red with surprize and displeasure , as he stood up ,
r . Elton , and found to his great surprize , that Mr . Elton was actually on 
d aid ." Emma saw Mrs . Weston ' s surprize , and felt that it must be great ,
father was quite taken up with the surprize of so sudden a journey , and his f
y , in all the favouring warmth of surprize and conjecture . She was , moreove
he appeared , to have her share of surprize , introduction , and pleasure . Th
ir plans ; and it was an agreeable surprize to her , therefore , to perceive t
talking aunt had taken me quite by surprize , it must have been the death of m
f all the dialogue which ensued of surprize , and inquiry , and congratulation
 the present . They might chuse to surprize her ." Mrs . Cole had many to agre
the mode of it , the mystery , the surprize , is more like a young woman ' s s
 to her song took her agreeably by surprize -- a second , slightly but correct
" " Oh ! no -- there is nothing to surprize one at all .-- A pretty fortune ; 
t to be considered . Emma ' s only surprize was that Jane Fairfax should accep
of your admiration may take you by surprize some day or other ." Mr . Knightle
ation for her will ever take me by surprize .-- I never had a thought of her i
 expected by the best judges , for surprize -- but there was great joy . Mr . 
 sound of at first , without great surprize . " So unreasonably early !" she w
d Frank Churchill , with a look of surprize and displeasure .-- " That is easy
; and Emma could imagine with what surprize and mortification she must be retu
tled that Jane should go . Quite a surprize to me ! I had not the least idea !
 . It is impossible to express our surprize . He came to speak to his father o
g engaged !" Emma even jumped with surprize ;-- and , horror - struck , exclai
Building word-context index...
her letter circumstance him idea it man marriage mind pleasure truth
word acquaintance all blind care character child conduct consolation

Most NLTK corpus readers offer methods of raw(), words() and sents().

In [12]:
from nltk.corpus import inaugural
print inaugural.raw()[:60]      
print inaugural.words()[:6] 
Fellow-Citizens of the Senate and of the House of Representa
['Fellow', '-', 'Citizens', 'of', 'the', 'Senate']

Brown Corpus (tagged corpus)

The Brown Corpus contains texts from different sources, and have been categorized by genre, such as news, fiction, etc. See here for the completed list. We can optionally specify particular categories or files to read.

In [89]:
from nltk.corpus import brown
brown.fileids()
brown.categories()
Out[89]:
['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']
In [91]:
brown.words(categories='hobbies')
Out[91]:
['Too', 'often', 'a', 'beginning', 'bodybuilder', ...]
In [92]:
brown.words(fileids=['ce36'])
Out[92]:
['There', 'comes', 'a', 'time', 'in', 'the', 'lives', ...]
In [94]:
brown.sents(categories=['news', 'fiction', 'romance'])
Out[94]:
[['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.'], ['The', 'jury', 'further', 'said', 'in', 'term-end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', ',', 'which', 'had', 'over-all', 'charge', 'of', 'the', 'election', ',', '``', 'deserves', 'the', 'praise', 'and', 'thanks', 'of', 'the', 'City', 'of', 'Atlanta', "''", 'for', 'the', 'manner', 'in', 'which', 'the', 'election', 'was', 'conducted', '.'], ...]

In addition to methods of raw(), words() and sents(), tagged corpus in NLTK offers a tagged_words() method to access the corpus as a list of tagged words.

In [95]:
brown.tagged_words()[:3]	 
Out[95]:
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL')]

NLTK's built-in corpora are diverse, e.g., the name corpus that contains two files female.txt and male.txt, each containing a list of few thousand common first names organized by gender.

In [47]:
from nltk.corpus import names   
names.fileids()
names.words('female.txt')[0:20]
len(names.words('female.txt'))
Out[47]:
5001

Another example is the (English) Basic Word List that comes with a large list of English words: there is one file with 850 Basic words, and another list with over 200,000 known English words.

In [48]:
from nltk.corpus import words  
words.fileids()
words.words('en-basic')[0:20]
len(words.words('en-basic'))
Out[48]:
850

[Exercise]

Recall the easy-to-read codes in the beginning, please write a script that print out the words end with 'tion' in English Basic Word List ('en-basic').


Simple Corpus Statistics

NLTK book 1.3: Frequency Distributions

NLTK book 2.2: Conditional Frequency Distributions

NLTK provides built-in support for frequency distribution.

In [50]:
fd1 = FreqDist(text3) # nltk.probability.FreqDist()
fd1
Out[50]:
<FreqDist with 2789 samples and 44764 outcomes>
  • Use the FreqDist to find the 20 most frequent words of Moby Dick
In [51]:
voc1 = fd1.keys() # keys() gives a list of all the distinct types in the text.
voc1[:20]
Out[51]:
[',',
 'and',
 'the',
 'of',
 '.',
 'And',
 'his',
 'he',
 'to',
 ';',
 'unto',
 'in',
 'that',
 'I',
 'said',
 'him',
 'a',
 'my',
 'was',
 'for']
In [53]:
fd1['the']
Out[53]:
2411
In [55]:
len(fd1.hapaxes())
Out[55]:
1195
  • Cumulative frequency plot
In [56]:
fd1.plot(20,cumulative = True)

A condition specifies the context in which an experiment is performed. Often, we are interested in the effect that conditions have on the outcome for an experiment. For example, we might want to examine how the distribution of a word’s length (the outcome) is affected by the word’s initial letter (the condition).

Conditional frequency distributions provide a tool for exploring this type of question. A conditional frequency distribution is a collection of frequency distributions for the same experiment, run under different conditions. The individual frequency distributions are indexed by the condition. Conditional frequency distributions are represented using the ConditionalFreqDist class, which is defined by the nltk.probability module.

The ConditionalFreqDist constructor creates a new empty conditional frequency distribution, and to access the frequency distribution for a condition, use the indexing operator.

In [ ]:
cfd1 = ConditionalFre

[Exercise]

  • Refer to Table 1-2 in NLTK-BOOK, applying other functions on shakespear's hamlet
In [58]:
fd2 = FreqDist(nltk.corpus.shakespeare.words('hamlet.xml'))
In [61]:
fd2.tabulate(20)
fd2['the'] # num. of times 'the' occurs
fd2.freq('the') # frequency of 'the'
fd2.plot(20)
   ,    .  the    '  and   to   of    I    ;    :  you    a   my    ?   in HAMLET   it   is  not  his
3211 1289  996  909  705  640  631  606  580  519  497  467  435  417  413  389  362  324  300  281
  • For R fans: output the data and send to R for pretty plotting and statistic modeling.
    • you could use pickle to store the ConditionalFreqDist() object in a file
In [68]:
import pickle
f = open('test.pkl', 'w')
pickle.dump(fd1, f)
f.close()
  • and to get back the object
In [69]:
f = open('test.pkl', 'r')
fd1 = pickle.load(fd1)
f.close()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-69-6da9e84d697b> in <module>()
      1 f = open('test.pkl', 'r')
----> 2 fd1 = pickle.load(fd1)
      3 f.close()

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.pyc in load(file)
   1376 
   1377 def load(file):
-> 1378     return Unpickler(file).load()
   1379 
   1380 def loads(str):

/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.pyc in __init__(self, file)
    839         or any other custom object that meets this interface.
    840         """
--> 841         self.readline = file.readline
    842         self.read = file.read
    843         self.memo = {}

AttributeError: 'FreqDist' object has no attribute 'readline'

Doing Stylistics

Now you can do the corpus-based stylistics by studying systematic differences between genres. For instance, let's compare genres in their usage of modal verbs.

  • First, provide the counts for a particular genre.
In [96]:
fiction = brown.words(categories='fiction')
fd_fiction = nltk.FreqDist([w.lower() for w in fiction])
In [97]:
modals = ['can','could','may','might','must','will']
for m in modals:
    print m + ":", fd_fiction[m],
    
can: 39 could: 168 may: 10 might: 44 must: 55 will: 56
  • Second, obtain counts for each genre of interest by using NLTK's support for Conditional Frequency Distributions.
In [98]:
cfd = nltk.ConditionalFreqDist(
    (genre, word)
    for genre in brown.categories()
    for word in brown.words(categories=genre))

genres = ['news', 'religion', 'hobbies','fiction', 'romance', 'humor','adventure']
cfd.tabulate(conditions=genres, samples=modals)
           can could  may might must will
     news   93   86   66   38   50  389
 religion   82   59   78   12   54   71
  hobbies  268   58  131   22   83  264
  fiction   37  166    8   44   55   52
  romance   74  193   11   51   45   43
    humor   16   30    8    8    9   13
adventure   46  151    5   58   27   50

[For R guys]: Exploratory Statistics and (interactive) Plotting !!

Check rCharts

In [1]:
load_ext rmagic
In [3]:
%%R
require(rCharts)

hair_eye_male <- subset(as.data.frame(HairEyeColor), Sex == "Male")
n1 <- nPlot(Freq ~ Hair, group = "Eye", data = hair_eye_male, type = "multiBarChart")
n1$save('nvd3plot.html',cdn=TRUE)
In [4]:
from IPython.display import HTML
HTML("<iframe width=800 height=400 src=http://127.0.0.1:8888/files/nvd3plot.html><iframe>")
Out[4]: