import sys
#reload(sys)
#sys.setdefaultencoding("utf-8")
import nltk
nltk.download()
showing info http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml
True
from nltk.book import *
Wikipedia:
A concordance is an alphabetical list of the principal words used in a book or body of work, with their immediate contexts. Because of the time, difficulty, and expense involved in creating a concordance in the pre-computer era, only works of special importance, such as the Vedas, Bible, Qur'an or the works of Shakespeare, had concordances prepared for them.
print text1
text1.concordance("monstrous")
<Text: Moby Dick by Herman Melville 1851> Displaying 11 of 11 matches: ong the former , one was of a most monstrous size . ... This came towards us , ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r ll over with a heathenish array of monstrous clubs and spears . Some were thick d as you gazed , and wondered what monstrous cannibal and savage could ever hav that has survived the flood ; most monstrous and most mountainous ! That Himmal they might scout at Moby Dick as a monstrous fable , or still worse and more de th of Radney .'" CHAPTER 55 Of the monstrous Pictures of Whales . I shall ere l ing Scenes . In connexion with the monstrous pictures of whales , I am strongly ere to enter upon those still more monstrous stories of them which are to be fo ght have been rummaged out of this monstrous cabinet there is no telling . But of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
print text3
text3.concordance("begat")
<Text: The Book of Genesis> Displaying 25 of 67 matches: unto Enoch was born Irad : and Irad begat Mehujael : and Mehujael begat Methus d Irad begat Mehujael : and Mehujael begat Methusa and Methusael begat Lamech . Mehujael begat Methusa and Methusael begat Lamech . And Lamech took unto him tw ed an hundred and thirty years , and begat a son in his own likeness , and afte n Seth were eight hundred yea and he begat sons and daughters : And all the day ived an hundred and five years , and begat Enos : And Seth lived after he begat begat Enos : And Seth lived after he begat Enos eight hundred and seven years , eight hundred and seven years , and begat sons and daughte And all the days of . And Enos lived ninety years , and begat Cainan : And Enos lived after he beg gat Cainan : And Enos lived after he begat Cainan eight hundred and fifteen yea ight hundred and fifteen years , and begat sons and daughte And all the days of . And Cainan lived seventy years and begat Mahalaleel : And Cainan lived after halaleel : And Cainan lived after he begat Mahalaleel eight hundred and forty y eight hundred and forty years , and begat sons and daughte And all the days of eel lived sixty and five years , and begat Jared : And Mahalaleel lived after h ared : And Mahalaleel lived after he begat Jared eight hundred and thirty years eight hundred and thirty years , and begat sons and daughte And all the days of hundred sixty and two years , and he begat Eno And Jared lived after he begat E e begat Eno And Jared lived after he begat Enoch eight hundred years , and bega egat Enoch eight hundred years , and begat sons and daughte And all the days of och lived sixty and five years , and begat Methuselah : And Enoch walked with G : And Enoch walked with God after he begat Methuselah three hundred years , and Methuselah three hundred years , and begat sons and daughte And all the days of hundred eighty and seven years , and begat Lamech . And Methuselah lived after mech . And Methuselah lived after he begat Lamech seven hundred eighty and two
NLTK can perform contextual analyses. Here, it looks for contexts of the word "monstrous" ("most monstrous size"), and then looks for other words that appear in such a context ("most ____ size").
print text1
text1.similar("monstrous")
<Text: Moby Dick by Herman Melville 1851> abundant candid careful christian contemptible curious delightfully determined doleful domineering exasperate fearless few gamesome horrible impalpable imperial lamentable lazy loving
print text2
text2.similar("monstrous")
<Text: Sense and Sensibility by Jane Austen 1811> very exceedingly heartily so a amazingly as extremely good great remarkably sweet vast
print text2
text2.common_contexts(["monstrous","very"])
<Text: Sense and Sensibility by Jane Austen 1811> a_lucky a_pretty am_glad be_glad is_pretty
print text1
text1.common_contexts(["monstrous","curious"])
<Text: Moby Dick by Herman Melville 1851> most_and
A dispersion plot is just a simple indication of where a word occurs within a corpus. Actually, you'll find these even as indicators in scroll bars these days.
print text4
figsize(10,6)
text4.dispersion_plot(["citizens","democracy","freedom","liberty","duties","America","slavery","women"])
<Text: Inaugural Address Corpus>
For n-gram text generation, we compute estimates of the conditional probabilities $P(x_n|x_{n-1}...x_1)$ and then sample from this distribution.
print text3
text3.generate()
<Text: The Book of Genesis> In the six hundredth and first year , and the name of the garden ; and he was at Chezib , when Laban heard the voice of the gard But of the ground made the firmament from the way to come in unto them , Know of a month . And Onan knew that they might dwell together ; and they joined battle with them , What profit is it that compasseth the whole age of Jacob answered Shechem and Hamor his father , an adder in the land shall be called any more Jacob , went in male and
Texts in NLTK are represented as Text objects. Text objects are a lot like "lists" of words or tokens.
print list(text4)[:30]
['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the', 'House', 'of', 'Representatives', ':', 'Among', 'the', 'vicissitudes', 'incident', 'to', 'life', 'no', 'event', 'could', 'have', 'filled', 'me', 'with', 'greater', 'anxieties', 'than', 'that']
len(text3)
44764
print sorted(set(text3))[:100]
[u'!', u"'", u'(', u')', u',', u',)', u'.', u'.)', u':', u';', u';)', u'?', u'?)', u'A', u'Abel', u'Abelmizraim', u'Abidah', u'Abide', u'Abimael', u'Abimelech', u'Abr', u'Abrah', u'Abraham', u'Abram', u'Accad', u'Achbor', u'Adah', u'Adam', u'Adbeel', u'Admah', u'Adullamite', u'After', u'Aholibamah', u'Ahuzzath', u'Ajah', u'Akan', u'All', u'Allonbachuth', u'Almighty', u'Almodad', u'Also', u'Alvah', u'Alvan', u'Am', u'Amal', u'Amalek', u'Amalekites', u'Ammon', u'Amorite', u'Amorites', u'Amraphel', u'An', u'Anah', u'Anamim', u'And', u'Aner', u'Angel', u'Appoint', u'Aram', u'Aran', u'Ararat', u'Arbah', u'Ard', u'Are', u'Areli', u'Arioch', u'Arise', u'Arkite', u'Arodi', u'Arphaxad', u'Art', u'Arvadite', u'As', u'Asenath', u'Ashbel', u'Asher', u'Ashkenaz', u'Ashteroth', u'Ask', u'Asshur', u'Asshurim', u'Assyr', u'Assyria', u'At', u'Atad', u'Avith', u'Baalhanan', u'Babel', u'Bashemath', u'Be', u'Because', u'Becher', u'Bedad', u'Beeri', u'Beerlahairoi', u'Beersheba', u'Behold', u'Bela', u'Belah', u'Benam']
len(text3)*1.0/len(set(text3))
16.050197203298673
text3.count("smote")
5
sent1
['Call', 'me', 'Ishmael', '.']
text1[100:110]
['and', 'to', 'teach', 'them', 'by', 'what', 'name', 'a', 'whale', '-']
from nltk import *
Frequency distribution (FreqDist) are histograms or counts of the number of words in a text.
fdist = FreqDist(text1)
fdist['them']
471
plot(list(fdist._cumulative_frequencies()))
[<matplotlib.lines.Line2D at 0x2497b390>]
vocabulary = fdist.keys()
print vocabulary[:50]
[',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like']
sents()
sent1: Call me Ishmael . sent2: The family of Dashwood had long been settled in Sussex . sent3: In the beginning God created the heaven and the earth . sent4: Fellow - Citizens of the Senate and of the House of Representatives : sent5: I have a problem with people PMing me to lol JOIN sent6: SCENE 1 : [ wind ] [ clop clop clop ] KING ARTHUR : Whoa there ! sent7: Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 . sent8: 25 SEXY MALE , seeks attrac older single lady , for discreet encounters . sent9: THE suburb of Saffron Park lay on the sunset side of London , as red and ragged as a cloud of sunset .
Tokenizers somehow split the text into "tokens".
s = "Call me Ishmael. The quick brown fox is sleeping."
The sent_tokenize
function splits a text into sentences.
nltk.tokenize.sent_tokenize(s)
['Call me Ishmael.', 'The quick brown fox is sleeping.']
The word_tokenize
splits a text into words.
nltk.tokenize.word_tokenize(s)
['Call', 'me', 'Ishmael.', 'The', 'quick', 'brown', 'fox', 'is', 'sleeping', '.']
Lemmatizers take words and turn them into lemmas (or lemmata), the canonical or dictionary form of a word.
from nltk import stem
L = stem.WordNetLemmatizer()
L.lemmatize("dogs")
'dog'
L.lemmatize("shops")
'shop'
L.lemmatize("mice")
'mouse'
L.lemmatize('fantasized','v')
'fantasize'
Lemmatization sometimes requires knowledge of the part-of-speech.
L.lemmatize('has')
'ha'
L.lemmatize('has','v')
'have'
L.lemmatize('having','v')
'have'
Stemming reduces inflected forms to a common base form, the stem.
The stem may or may not be the dictionary form of the word; what matters primarily is that different inflected forms map to the same stem, and different lemmas map to different stems.
Stemming is important for information retrieval and web search, as different inflected forms are usually treated the same for query purposes.
P = nltk.stem.porter.PorterStemmer()
P.stem("overdoing")
'overdo'
P.stem("bender")
'bender'
P.stem("bending")
'bend'
P.stem("communities")
'commun'
Recall that part-of-speech is the linguistic category of a word/lexical item.
We had nouns, verbs, adjectives, etc.
Some of these classes are open (they grow, like nouns), some of them are closed (like prepositions).
POS tagging infers the part-of-speech of a word in the context of a sentence. The sentence context is needed, since many words belong to different parts of speech depending on how they are used.
pos_tag(word_tokenize("The quick brown fox jumps over the lazy dogs."))
[('The', 'DT'), ('quick', 'NN'), ('brown', 'NN'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'NN'), ('dogs', 'NNS'), ('.', '.')]
pos_tag(word_tokenize("I compare thee to a summer's day?"))
[('I', 'PRP'), ('compare', 'VBP'), ('thee', 'JJ'), ('to', 'TO'), ('a', 'DT'), ('summer', 'NN'), ("'s", 'POS'), ('day', 'NN'), ('?', '.')]
nltk.help.upenn_tagset('PRP')
PRP: pronoun, personal hers herself him himself hisself it itself me myself one oneself ours ourselves ownself self she thee theirs them themselves they thou thy us
nltk.help.upenn_tagset("JJ")
JJ: adjective or numeral, ordinal third ill-mannered pre-war regrettable oiled calamitous first separable ectoplasmic battery-powered participatory fourth still-to-be-named multilingual multi-disciplinary ...
pos_tag(word_tokenize("Buffalo buffalo buffalo."))
[('Buffalo', 'NNP'), ('buffalo', 'VBD'), ('buffalo', 'NN'), ('.', '.')]
Parsing is the process of recovering the phrase structure (a tree structure) of a sentence from the linear sequence of words.
Parsing is driven by a grammar.
Recall your context-free grammars from the introductory class.
Recall terminals and non-terminals.
Context free grammars are grammars in which there is only a single non-terminal on the left hand side of any production.
tree = nltk.bracket_parse('(NP (Adj old) (NP (N men) (Conj and) (N women)))')
tree.draw()
grammar = parse_cfg("""
S -> NP VP
PP -> P NP
NP -> 'the' N | N PP | 'the' N PP
VP -> V NP | V PP | V NP PP
N -> 'cat'
N -> 'dog'
N -> 'rug'
V -> 'chased'
V -> 'sat'
P -> 'in'
P -> 'on'
""")
grammar
<Grammar with 15 productions>
from nltk.parse import ShiftReduceParser
P = ShiftReduceParser(grammar)
tree = P.parse(word_tokenize("the cat chased the dog"))
tree.draw()