import re
Natural language processing uses a variety of resources:
NLTK stores word lists, corpora, etc. on disk as text files, but it provides you programmatic access to these resources. This means that you can get well-defined text resources already parsed with often just a single function call.
This is very useful for research (and exercises). For actual applications, of course, you may still want to rely on text files.
from nltk.corpus import words
words.words()[:10]
['A', 'a', 'aa', 'aal', 'aalii', 'aam', 'Aani', 'aardvark', 'aardwolf', 'Aaron']
dictionary = [w.lower() for w in words.words()]
print dictionary[:10]
['a', 'a', 'aa', 'aal', 'aalii', 'aam', 'aani', 'aardvark', 'aardwolf', 'aaron']
Project Gutenberg is a large collection of books, manually transcribed.
There is a small corpus of Gutenberg texts in NLTK, ready for processing.
from nltk.corpus import gutenberg
austen = gutenberg.words('austen-sense.txt')
print len(austen)
print austen[:10]
141576 ['[', 'Sense', 'and', 'Sensibility', 'by', 'Jane', 'Austen', '1811', ']', 'CHAPTER']
You can get a list of all the files in this corpus using the fileids
function.
gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
gutenberg
actually refers to an object representing the corpus, with a number
of standard methods.
print [m for m in dir(gutenberg) if re.match('[a-z]',m)]
['abspath', 'abspaths', 'encoding', 'fileids', 'files', 'items', 'open', 'paras', 'raw', 'read', 'readme', 'root', 'sents', 'tokenized', 'words']
help(gutenberg.sents)
Help on method sents in module nltk.corpus.reader.plaintext: sents(self, fileids=None, sourced=False) method of nltk.corpus.reader.plaintext.PlaintextCorpusReader instance @return: the given file(s) as a list of sentences or utterances, each encoded as a list of word strings. @rtype: C{list} of (C{list} of C{str})
print gutenberg.readme()[:300]
Project Gutenberg Selections http://gutenberg.net/ This corpus contains etexts from from Project Gutenberg, by the following authors: * Jane Austen (3) * William Blake (2) * Thornton W. Burgess * Sarah Cone Bryant * Lewis Carroll * G. K. Chesterton (3) * Maria Edgeworth * King James Bible * Herman
If you don't give a fileid
, you get the whole corpus as one piece of text:
for s in gutenberg.sents()[:10]:
print s[:5]
['[', 'Emma', 'by', 'Jane', 'Austen'] ['VOLUME', 'I'] ['CHAPTER', 'I'] ['Emma', 'Woodhouse', ',', 'handsome', ','] ['She', 'was', 'the', 'youngest', 'of'] ['Her', 'mother', 'had', 'died', 'too'] ['Sixteen', 'years', 'had', 'Miss', 'Taylor'] ['Between', '_them_', 'it', 'was', 'more'] ['Even', 'before', 'Miss', 'Taylor', 'had'] ['The', 'real', 'evils', ',', 'indeed']
With these resources and the standard Python primitives, many kinds of processing become qutie easy. Let's look for words that are in Sense and Sensibility, but are not contained in the basic wordlist.
austen = [w.lower() for w in austen if not re.search(r'[\d\W]',w)]
print len(austen)
austen[:10]
120733
['sense', 'and', 'sensibility', 'by', 'jane', 'austen', 'chapter', 'the', 'family', 'of']
diff = set(set(austen)-set(words.words()))
diff = sorted(list(diff))
print len(diff)
print diff[:10]
1659 ['abbeyland', 'abhorred', 'abilities', 'abounded', 'abridgement', 'abused', 'abuses', 'acacia', 'accents', 'accepting']
Words like 'abhorred' and 'abounded' just derive from common words like 'abhor' and 'abound'. We don't really want these grammatical variants to be listed separately. To avoid this, we use a lemmatizer or a stemmer.
from nltk import stem
st = stem.LancasterStemmer()
stems = [st.stem(w) for w in dictionary]
austen_stems = list(set([st.stem(w) for w in list(set(austen))]))
diff = sorted(list(set(austen_stems)-set(stems)))
diff[:10]
['abbeyland', 'adie', 'allenham', 'annamar', 'anym', 'arbo', 'avignon', 'becam', 'befel', 'blackest']
Q: What is the difference between a lemmatizer and a stemmer?
Some words occur very frequently and have mainly grammatical functions. These are contained in a list of stopwords.
from nltk.corpus import stopwords
stops = set([w.lower() for w in stopwords.words('english')])
len(stops)
127
print " ".join(stops)
all just being over both through yourselves its before herself had should to only under ours has do them his very they not during now him nor did this she each further where few because doing some are our ourselves out what for while does above between t be we who were here hers by on about of against s or own into yourself down your from her their there been whom too themselves was until more himself that but don with than those he me myself these up will below can theirs my and then is am it an as itself at have in any if again no when same how other which you after most such why a off i yours so the having once
Let's see how many of the words in Austen's writing are not stopwords.
print len(austen)
print len([w for w in austen if w not in stops])
120733 54074
Another kind of dictionary that is very useful translates words into their pronunciations.
from nltk.corpus import cmudict
entries = cmudict.entries()
for entry in entries[20:30]: print entry
("aaronson's", ['AA1', 'R', 'AH0', 'N', 'S', 'AH0', 'N', 'Z']) ('aarti', ['AA1', 'R', 'T', 'IY2']) ('aase', ['AA1', 'S']) ('aasen', ['AA1', 'S', 'AH0', 'N']) ('ab', ['AE1', 'B']) ('ab', ['EY1', 'B', 'IY1']) ('ababa', ['AH0', 'B', 'AA1', 'B', 'AH0']) ('ababa', ['AA1', 'B', 'AH0', 'B', 'AH0']) ('abacha', ['AE1', 'B', 'AH0', 'K', 'AH0']) ('aback', ['AH0', 'B', 'AE1', 'K'])
Note: the pronunciations are given in [Arpabet](http://en.wikipedia.org/wiki/Arpabet Arpabet)
Using this dictionary, we can find words with particular relationships between pronunciation and spelling. For example, "dove" is a heteronym: the word is written in exactly one way, but there are multiple pronunciations and meanings.
print [l for l in entries if l[0]=="dove"]
[('dove', ['D', 'AH1', 'V']), ('dove', ['D', 'OW1', 'V'])]
Let's look for homophones: words that are spelled differently but sound the same.
from collections import defaultdict
reverse = defaultdict(list)
for k,v in entries:
v = " ".join(v)
reverse[v].append(k)
ambiguity = [len(l) for l in reverse.values()]
hist(ambiguity)
(array([111354, 2314, 748, 303, 131, 84, 12, 13, 3, 4]), array([ 1. , 2.2, 3.4, 4.6, 5.8, 7. , 8.2, 9.4, 10.6, 11.8, 13. ]), <a list of 10 Patch objects>)
ambiguities = sorted([(len(l),k,l) for k,l in reverse.items()])
for n,_,l in ambiguities[-30:]:
print n," ".join(l)
9 gee's geez geeze g's g.'s g.s jees jeez jeeze 9 caan caen con conn conn. kahn khan kohn kon 9 corey corrie corry cory coury kaori korey korry kory 9 kea kee key khe ki kyi qi quai quay 9 coal coale cole colle koehl koelle kohl kol kole 9 coo cou coup coups khoo khuu koo ku qu 9 cues 'cuse kuse kuze q's q.'s q.s ques queues 9 leu lew lieu loo lou louw loux lu lue 9 nicholls nichols nichols' nickel's nickels nickles nickols nicolls nicols 9 kneece kniess neace neece neice neiss nice niece niess 10 ais ayes eis eye's eyes eyes' i's i.'s i.s ise 10 barey barre barrie barry berrey berri berrie berry buerry bury 10 bailey bailie baillie bailly baily baley bally baylee bayley bayly 10 cay k k. 'kay kay kaye khe quai quay quaye 10 mauri maurie maury morey mori morrie morry mory mowrey mowry 10 kneale kneel neal neale neall neel neil neile neill niel 10 au aux eau eaux o o' o. oh ow owe 10 eaux oh's ohs ooohs o's o.'s o.s o.s' ose owes 10 reits reitz right's rights rights' rite's rites wright's wrights writes 10 reaux rheault rho ro roe roh rohe row rowe wroe 10 c c. cie sci sea see si sie sieh tse 10 seagle segal segel seigal seigel seigle siegal siegel siegell siegle 10 ewe hugh u u. uwe yew yoo you yu yue 11 freas frease freeh's frees freese freeze freis frese friese frieze friis 11 mae's maes mais maize mase mayes may's mays mays' mayse maze 11 hsu schoo schou schue schuh shew shiu shoe shoo shu shue 12 cees c's c.'s c.s saez sea's seas sease sees seese seize sies 12 ewes hughes hughes' hughs u's u.'s u.s use yew's yoos youse yu's 13 carey carie carrey carrie carry cary kairey kari karry kary kerrey kerri kerry 13 laurey lauri laurie laury lawrie lawry lorey lori lorie lorrie lorry lory lowrie
In fact, there are many different possible relationships between words, meanings, sounds, and spellings.
imshow(imread("word-relationships.png"))
<matplotlib.image.AxesImage at 0xb3ba710>
Created by Morris Swadesh, a list of words that exist in many languages.
These special wordlists of corresponding words are used for glottochronology and lexicostatistics.
Note that we are actually talking about lemmas here, since the words themselves may have multiple grammatical forms and be homographs.
from nltk.corpus import swadesh
fr2en = swadesh.entries(['fr','en'])
print fr2en[:10]
[('je', 'I'), ('tu, vous', 'you (singular), thou'), ('il', 'he'), ('nous', 'we'), ('vous', 'you (plural)'), ('ils, elles', 'they'), ('ceci', 'this'), ('cela', 'that'), ('ici', 'here'), ('l\xc3\xa0', 'there')] [(u'je', u'I'), (u'tu, vous', u'you (singular), thou'), (u'il', u'he'), (u'nous', u'we'), (u'vous', u'you (plural)'), (u'ils, elles', u'they'), (u'ceci', u'this'), (u'cela', u'that'), (u'ici', u'here'), (u'l\xe0', u'there')]
Let's briefly mention another kind of lexicon, a "toolbox lexicon". It's basically just a bunch of attributes on a word (usually a lemma). The meaning is defined on a case-by-case basis.
Often, these kinds of lexica represent notes created by linguists when studying a language (in fact, they are often captured on paper notecards before being computerized).
from nltk.corpus import toolbox
toolbox.entries('rotokas.dic')[:2]
[('kaa', [('ps', 'V'), ('pt', 'A'), ('ge', 'gag'), ('tkp', 'nek i pas'), ('dcsv', 'true'), ('vx', '1'), ('sc', '???'), ('dt', '29/Oct/2005'), ('ex', 'Apoka ira kaaroi aioa-ia reoreopaoro.'), ('xp', 'Kaikai i pas long nek bilong Apoka bikos em i kaikai na toktok.'), ('xe', 'Apoka is gagging from food while talking.')]), ('kaa', [('ps', 'V'), ('pt', 'B'), ('ge', 'strangle'), ('tkp', 'pasim nek'), ('arg', 'O'), ('vx', '2'), ('dt', '07/Oct/2006'), ('ex', 'Rera rauroro rera kaarevoi.'), ('xp', 'Em i holim pas em na nekim em.'), ('xe', 'He is holding him and strangling him.'), ('ex', 'Iroiro-ia oirato okoearo kaaivoi uvare rirovira kaureoparoveira.'), ('xp', 'Ol i pasim nek bilong man long rop bikos em i save bikhet tumas.'), ('xe', "They strangled the man's neck with rope because he was very stubborn and arrogant."), ('ex', 'Oirato okoearo kaaivoi iroiro-ia. Uva viapau uvuiparoi ra vovouparo uva kopiiroi.'), ('xp', 'Ol i pasim nek bilong man long rop. Olsem na em i no pulim win olsem na em i dai.'), ('xe', "They strangled the man's neck with a rope. And he couldn't breathe and he died.")])]