from nltk.corpus import brown
brown.root
FileSystemPathPointer('/home/tmb/nltk_data/corpora/brown')
brown.readme()
'BROWN CORPUS\n\nA Standard Corpus of Present-Day Edited American\nEnglish, for use with Digital Computers.\n\nby W. N. Francis and H. Kucera (1964)\nDepartment of Linguistics, Brown University\nProvidence, Rhode Island, USA\n\nRevised 1971, Revised and Amplified 1979\n\nhttp://www.hit.uib.no/icame/brown/bcm.html\n\nDistributed with the permission of the copyright holder,\nredistribution permitted.\n'
brown.fileids()[:10]
['ca01', 'ca02', 'ca03', 'ca04', 'ca05', 'ca06', 'ca07', 'ca08', 'ca09', 'ca10']
Files may have different encodings; the default is ASCII processed as str
.
brown.encoding("ca01")
Files may also be in different categories.
brown.categories()
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
The corpus abstraction allows you to avoid having to deal with individual files, encodings, etc.
That is, you can access all the words, all the text, all the sentences etc. in a corpus from a single object.
brown.raw()[:100]
'\n\n\tThe/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn'
brown.words()[:10]
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']
for s in brown.sents()[:10]: print s[:5]
['The', 'Fulton', 'County', 'Grand', 'Jury'] ['The', 'jury', 'further', 'said', 'in'] ['The', 'September-October', 'term', 'jury', 'had'] ['``', 'Only', 'a', 'relative', 'handful'] ['The', 'jury', 'said', 'it', 'did'] ['It', 'recommended', 'that', 'Fulton', 'legislators'] ['The', 'grand', 'jury', 'commented', 'on'] ['Merger', 'proposed'] ['However', ',', 'the', 'jury', 'said'] ['The', 'City', 'Purchasing', 'Department', ',']
brown.tagged_words()[:10]
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN')]
brown.tagged_sents()[0][:10]
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN')]
import nltk.corpus.reader
corpus = nltk.corpus.reader.plaintext.PlaintextCorpusReader(".",r"[ft].*txt",encoding="utf8")
corpus.fileids()
['faust.txt', 'tomsawyer.txt']
corpus.raw()[:100]
u'Faust: Der Trag\xf6die erster Teil\n\nJohann Wolfgang von Goethe\n\n\nZueignung.\n\nIhr naht euch wieder, schw'
corpus.paras()[:2]
[[[u'Faust', u':', u'Der', u'Trag\xf6die', u'erster', u'Teil']], [[u'Johann', u'Wolfgang', u'von', u'Goethe']]]
print corpus.sents()[500]
[u'FAUST', u':', u'Vor', u'jenem', u'droben', u'steht', u'geb\xfcckt', u',', u'Der', u'helfen', u'lehrt', u'und', u'H\xfclfe', u'schickt', u'.']
print corpus.words()[500:510]
[u'heute', u'!', u'DICHTER', u':', u'O', u'sprich', u'mir', u'nicht', u'von', u'jener']
from nltk import Text
text = Text(corpus.words("tomsawyer.txt"))
text.concordance("with")
Building index... Displaying 25 of 647 matches: " TOM !" No answer . " What ' s gone with that boy , I wonder ? You TOM !" No ding down and punching under the bed with the broom , and so she needed breath eded breath to punctuate the punches with . She resurrected nothing but the ca - brother ) Sid was already through with his part of the work ( picking up ch et vanity to believe she was endowed with a talent for dark and mysterious dip sewed . " Bother ! Well , go ' long with you . I ' d made sure you ' d played didn ' t think you sewed his collar with white thread , but it ' s black ." " it ' s black ." " Why , I did sew it with white ! Tom !" But Tom did not wait Confound it ! sometimes she sews it with white , and sometimes she sews it wi th white , and sometimes she sews it with black . I wish to geeminy she ' d st f it , and he strode down the street with his mouth full of harmony __________ ure is concerned , the advantage was with the boy , not the astronomer . The s art , don ' t you ? I could lick you with one hand tied behind me , if I wante do it ." " Well I will , if you fool with me ." " Oh yes -- I ' ve seen whole n ' t either ." So they stood , each with a foot placed at an angle as a brace angle as a brace , and both shoving with might and main , and glowering at ea d main , and glowering at each other with hate . But neither could get an adva nd flushed , each relaxed his strain with watchful caution , and Tom said : " other on you , and he can thrash you with his little finger , and I ' ll make it so ." Tom drew a line in the dust with his big toe , and said : " I dare yo out of his pocket and held them out with derision . Tom struck them to the gr er ' s nose , and covered themselves with dust and glory . Presently the confu tride the new boy , and pounding him with his fists . " Holler ' nuff !" said Better look out who you ' re fooling with next time ." The new boy went off br ht him out ." To which Tom responded with jeers , and started off in high feat
text.similar("with")
Building word-context index... and in on to for of was at into up s that through but if just upon what as by
text.common_contexts(["with","as"])
but_the is_a long_you up_a