Word List

  1. First you need to have/create a word list file, could be a single column csv file, or just a text file with one word per line.
  2. Then you can instantiate a WordListCorpusReader in nltk which generates a list of words from our file.
    • It takes two arguments: the directory path containing the files, and a list of filenames.
In [ ]:
import nltk
from nltk.corpus.reader import WordListCorpusReader

Sample of 58000 English words THIS list was compiled by merging different word-lists. The British spelling was preferred and American versions deleted. We have used it in crossword compiling (together with a programme) with much success. A few word groups (e.g. RUN_OF_THE_MILL, written RUNOFTHEMILL) are therefore also included. In all hyphenated words the hyphen was deleted to form one word.

In [23]:
wlist = WordListCorpusReader('../Corpus_data/word.lists/', ['58000_lowercase_en.txt','bow_wordlist.txt'],encoding=u'utf8')
# Note that if you open the Python cosole in the same directory as the files, then '.' can be used as the directory path.
In [25]:
#wlist.fileids()
len(wlist.words())
Out[25]:
93237
  • WordListCorpusReader inherits from CorpusReader as we introduced. CorpusReader is a general class for all corpus (in broad sense) readers. It does all the related works for corpus processing; while WordListCorpus reads the files and tokenizes each line to produce a list of words.
In [29]:
#wlist.words()
  • NLTK also comes with two lists of English words: one with 850 basic words, another with over 200,000 common words.
In [30]:
from nltk.corpus import words
In [31]:
words.fileids()
Out[31]:
['en', 'en-basic']
In [32]:
len(words.words('en-basic'))
Out[32]:
850
  • For linguistic field workers, NLTK includes so-called Swadesh wordlists, lists of around 200 basic words in several languages. (The language codes are identified using an ISO 639 two-letter code).
In [35]:
from nltk.corpus import swadesh
import pprint
print swadesh.fileids()
['be', 'bg', 'bs', 'ca', 'cs', 'cu', 'de', 'en', 'es', 'fr', 'hr', 'it', 'la', 'mk', 'nl', 'pl', 'pt', 'ro', 'ru', 'sk', 'sl', 'sr', 'sw', 'uk']
In [37]:
print swadesh.words('de')
['ich', 'du, Sie', 'er', 'wir', 'ihr, Sie', 'sie', 'dieses', 'jenes', 'hier', 'dort', 'wer', 'was', 'wo', 'wann', 'wie', 'nicht', 'alle', 'viele', 'einige', 'wenige', 'andere', 'eins', 'zwei', 'drei', 'vier', 'f\xc3\xbcnf', 'gro\xc3\x9f', 'lang', 'breit, weit', 'dick', 'schwer', 'klein', 'kurz', 'eng', 'd\xc3\xbcnn', 'Frau', 'Mann', 'Mensch', 'Kind', 'Frau, Ehefrau', 'Mann, Ehemann', 'Mutter', 'Vater', 'Tier', 'Fisch', 'Vogel', 'Hund', 'Laus', 'Schlange', 'Wurm', 'Baum', 'Wald', 'Stock', 'Frucht', 'Samen', 'Blatt', 'Wurzel', 'Rinde', 'Blume', 'Gras', 'Seil', 'Haut', 'Fleisch', 'Blut', 'Knochen', 'Fett', 'Ei', 'Horn', 'Schwanz', 'Feder', 'Haar', 'Kopf, Haupt', 'Ohr', 'Auge', 'Nase', 'Mund', 'Zahn', 'Zunge', 'Fingernagel', 'Fu\xc3\x9f', 'Bein', 'Knie', 'Hand', 'Fl\xc3\xbcgel', 'Bauch', 'Eingeweide, Innereien', 'Hals', 'R\xc3\xbccken', 'Brust', 'Herz', 'Leber', 'trinken', 'essen', 'bei\xc3\x9fen', 'saugen', 'spucken', 'erbrechen', 'blasen', 'atmen', 'lachen', 'sehen', 'h\xc3\xb6ren', 'wissen', 'denken', 'riechen', 'f\xc3\xbcrchten', 'schlafen', 'leben', 'sterben', 't\xc3\xb6ten', 'k\xc3\xa4mpfen', 'jagen', 'schlagen', 'schneiden', 'spalten', 'stechen', 'kratzen', 'graben', 'schwimmen', 'fliegen', 'gehen', 'kommen', 'liegen', 'sitzen', 'stehen', 'drehen', 'fallen', 'geben', 'halten', 'quetschen', 'reiben', 'waschen', 'wischen', 'ziehen', 'dr\xc3\xbccken', 'werfen', 'binden', 'n\xc3\xa4hen', 'z\xc3\xa4hlen', 'sagen', 'singen', 'spielen', 'schweben', 'flie\xc3\x9fen', 'frieren', 'schwellen', 'Sonne', 'Mond', 'Stern', 'Wasser', 'Regen', 'Flu\xc3\x9f', 'See', 'Meer, See', 'Salz', 'Stein', 'Sand', 'Staub', 'Erde', 'Wolke', 'Nebel', 'Himmel', 'Wind', 'Schnee', 'Eis', 'Rauch', 'Feuer', 'Asche', 'brennen', 'Stra\xc3\x9fe', 'Berg', 'rot', 'gr\xc3\xbcn', 'gelb', 'wei\xc3\x9f', 'schwarz', 'Nacht', 'Tag', 'Jahr', 'warm', 'kalt', 'voll', 'neu', 'alt', 'gut', 'schlecht', 'verrottet', 'schmutzig', 'gerade', 'rund', 'scharf', 'stumpf', 'glatt', 'nass, feucht', 'trocken', 'richtig', 'nah, nahe', 'weit, fern', 'rechts', 'links', 'bei, an', 'in', 'mit', 'und', 'wenn, falls, ob', 'weil', 'Name']
  • You can access cognate words from multiple languages using the entries() method.
In [40]:
en2de = swadesh.entries(['de','en'])
en2de
Out[40]:
[('ich', 'I'),
 ('du, Sie', 'you (singular), thou'),
 ('er', 'he'),
 ('wir', 'we'),
 ('ihr, Sie', 'you (plural)'),
 ('sie', 'they'),
 ('dieses', 'this'),
 ('jenes', 'that'),
 ('hier', 'here'),
 ('dort', 'there'),
 ('wer', 'who'),
 ('was', 'what'),
 ('wo', 'where'),
 ('wann', 'when'),
 ('wie', 'how'),
 ('nicht', 'not'),
 ('alle', 'all'),
 ('viele', 'many'),
 ('einige', 'some'),
 ('wenige', 'few'),
 ('andere', 'other'),
 ('eins', 'one'),
 ('zwei', 'two'),
 ('drei', 'three'),
 ('vier', 'four'),
 ('f\xc3\xbcnf', 'five'),
 ('gro\xc3\x9f', 'big'),
 ('lang', 'long'),
 ('breit, weit', 'wide'),
 ('dick', 'thick'),
 ('schwer', 'heavy'),
 ('klein', 'small'),
 ('kurz', 'short'),
 ('eng', 'narrow'),
 ('d\xc3\xbcnn', 'thin'),
 ('Frau', 'woman'),
 ('Mann', 'man (adult male)'),
 ('Mensch', 'man (human being)'),
 ('Kind', 'child'),
 ('Frau, Ehefrau', 'wife'),
 ('Mann, Ehemann', 'husband'),
 ('Mutter', 'mother'),
 ('Vater', 'father'),
 ('Tier', 'animal'),
 ('Fisch', 'fish'),
 ('Vogel', 'bird'),
 ('Hund', 'dog'),
 ('Laus', 'louse'),
 ('Schlange', 'snake'),
 ('Wurm', 'worm'),
 ('Baum', 'tree'),
 ('Wald', 'forest'),
 ('Stock', 'stick'),
 ('Frucht', 'fruit'),
 ('Samen', 'seed'),
 ('Blatt', 'leaf'),
 ('Wurzel', 'root'),
 ('Rinde', 'bark (from tree)'),
 ('Blume', 'flower'),
 ('Gras', 'grass'),
 ('Seil', 'rope'),
 ('Haut', 'skin'),
 ('Fleisch', 'meat'),
 ('Blut', 'blood'),
 ('Knochen', 'bone'),
 ('Fett', 'fat (noun)'),
 ('Ei', 'egg'),
 ('Horn', 'horn'),
 ('Schwanz', 'tail'),
 ('Feder', 'feather'),
 ('Haar', 'hair'),
 ('Kopf, Haupt', 'head'),
 ('Ohr', 'ear'),
 ('Auge', 'eye'),
 ('Nase', 'nose'),
 ('Mund', 'mouth'),
 ('Zahn', 'tooth'),
 ('Zunge', 'tongue'),
 ('Fingernagel', 'fingernail'),
 ('Fu\xc3\x9f', 'foot'),
 ('Bein', 'leg'),
 ('Knie', 'knee'),
 ('Hand', 'hand'),
 ('Fl\xc3\xbcgel', 'wing'),
 ('Bauch', 'belly'),
 ('Eingeweide, Innereien', 'guts'),
 ('Hals', 'neck'),
 ('R\xc3\xbccken', 'back'),
 ('Brust', 'breast'),
 ('Herz', 'heart'),
 ('Leber', 'liver'),
 ('trinken', 'drink'),
 ('essen', 'eat'),
 ('bei\xc3\x9fen', 'bite'),
 ('saugen', 'suck'),
 ('spucken', 'spit'),
 ('erbrechen', 'vomit'),
 ('blasen', 'blow'),
 ('atmen', 'breathe'),
 ('lachen', 'laugh'),
 ('sehen', 'see'),
 ('h\xc3\xb6ren', 'hear'),
 ('wissen', 'know (a fact)'),
 ('denken', 'think'),
 ('riechen', 'smell'),
 ('f\xc3\xbcrchten', 'fear'),
 ('schlafen', 'sleep'),
 ('leben', 'live'),
 ('sterben', 'die'),
 ('t\xc3\xb6ten', 'kill'),
 ('k\xc3\xa4mpfen', 'fight'),
 ('jagen', 'hunt'),
 ('schlagen', 'hit'),
 ('schneiden', 'cut'),
 ('spalten', 'split'),
 ('stechen', 'stab'),
 ('kratzen', 'scratch'),
 ('graben', 'dig'),
 ('schwimmen', 'swim'),
 ('fliegen', 'fly (verb)'),
 ('gehen', 'walk'),
 ('kommen', 'come'),
 ('liegen', 'lie'),
 ('sitzen', 'sit'),
 ('stehen', 'stand'),
 ('drehen', 'turn'),
 ('fallen', 'fall'),
 ('geben', 'give'),
 ('halten', 'hold'),
 ('quetschen', 'squeeze'),
 ('reiben', 'rub'),
 ('waschen', 'wash'),
 ('wischen', 'wipe'),
 ('ziehen', 'pull'),
 ('dr\xc3\xbccken', 'push'),
 ('werfen', 'throw'),
 ('binden', 'tie'),
 ('n\xc3\xa4hen', 'sew'),
 ('z\xc3\xa4hlen', 'count'),
 ('sagen', 'say'),
 ('singen', 'sing'),
 ('spielen', 'play'),
 ('schweben', 'float'),
 ('flie\xc3\x9fen', 'flow'),
 ('frieren', 'freeze'),
 ('schwellen', 'swell'),
 ('Sonne', 'sun'),
 ('Mond', 'moon'),
 ('Stern', 'star'),
 ('Wasser', 'water'),
 ('Regen', 'rain'),
 ('Flu\xc3\x9f', 'river'),
 ('See', 'lake'),
 ('Meer, See', 'sea'),
 ('Salz', 'salt'),
 ('Stein', 'stone'),
 ('Sand', 'sand'),
 ('Staub', 'dust'),
 ('Erde', 'earth'),
 ('Wolke', 'cloud'),
 ('Nebel', 'fog'),
 ('Himmel', 'sky'),
 ('Wind', 'wind'),
 ('Schnee', 'snow'),
 ('Eis', 'ice'),
 ('Rauch', 'smoke'),
 ('Feuer', 'fire'),
 ('Asche', 'ashes'),
 ('brennen', 'burn'),
 ('Stra\xc3\x9fe', 'road'),
 ('Berg', 'mountain'),
 ('rot', 'red'),
 ('gr\xc3\xbcn', 'green'),
 ('gelb', 'yellow'),
 ('wei\xc3\x9f', 'white'),
 ('schwarz', 'black'),
 ('Nacht', 'night'),
 ('Tag', 'day'),
 ('Jahr', 'year'),
 ('warm', 'warm'),
 ('kalt', 'cold'),
 ('voll', 'full'),
 ('neu', 'new'),
 ('alt', 'old'),
 ('gut', 'good'),
 ('schlecht', 'bad'),
 ('verrottet', 'rotten'),
 ('schmutzig', 'dirty'),
 ('gerade', 'straight'),
 ('rund', 'round'),
 ('scharf', 'sharp'),
 ('stumpf', 'dull'),
 ('glatt', 'smooth'),
 ('nass, feucht', 'wet'),
 ('trocken', 'dry'),
 ('richtig', 'correct'),
 ('nah, nahe', 'near'),
 ('weit, fern', 'far'),
 ('rechts', 'right'),
 ('links', 'left'),
 ('bei, an', 'at'),
 ('in', 'in'),
 ('mit', 'with'),
 ('und', 'and'),
 ('wenn, falls, ob', 'if'),
 ('weil', 'because'),
 ('Name', 'name')]
  • With the dictionary data type (will be introduced later), you can convert it into a simple dictionary.
In [41]:
translate = dict(en2de)
translate['Kind']
Out[41]:
'child'
  • You can also compare words in various/similar languages families.
In [42]:
language = ['en','de','nl','fr','es','pt']
for i in [120,140,160]:
    print swadesh.entries(language)[i]
('walk', 'gehen', 'lopen, stappen', 'marcher', 'caminar', 'andar, caminhar, passear')
('sing', 'singen', 'zingen', 'chanter', 'cantar', 'cantar')
('fog', 'Nebel', 'mist, nevel', 'brouillard', 'niebla', 'neblina, n\xc3\xa9voa, nevoeiro, bruma')

Applications of Word List

  • Text preprocessing
  • spelling checker
  • stopword for NLP tasks
  • games (e.g., word puzzles)

[1]. Filtering the text based on the word.

In [48]:
def unusual_words(text):
    text_vocab = set(w.lower() for w in text if w.isalpha())
    english_vocab = set(w.lower() for w in nltk.corpus.words.words())
    unusual = text_vocab.difference(english_vocab)                    # set.difference()
    return sorted(unusual)
In [49]:
from nltk.corpus import gutenberg
print gutenberg.fileids()
['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess-busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth-parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']
In [53]:
unusual_words(nltk.corpus.gutenberg.words('austen-sense.txt'))    # check gutenberg.fileids()
Out[53]:
['abbeyland',
 'abhorred',
 'abilities',
 'abounded',
 'abridgement',
 'abused',
 'abuses',
 'accents',
 'accepting',
 'accommodations',
 'accompanied',
 'accounted',
 'accounts',
 'accustomary',
 'aches',
 'acknowledging',
 'acknowledgment',
 'acknowledgments',
 'acquaintances',
 'acquiesced',
 'acquitted',
 'acquitting',
 'acted',
 'actions',
 'adapted',
 'adding',
 'additions',
 'addressed',
 'addresses',
 'addressing',
 'adhering',
 'adieus',
 'adjusting',
 'administering',
 'admirers',
 'admires',
 'admitting',
 'adorned',
 'advances',
 'advantages',
 'affairs',
 'affections',
 'affects',
 'affixed',
 'afflictions',
 'afforded',
 'affording',
 'ages',
 'agitated',
 'agonies',
 'ailments',
 'aimed',
 'alarms',
 'alienated',
 'alighted',
 'alleged',
 'allenham',
 'allowances',
 'allowed',
 'allowing',
 'alluded',
 'alterations',
 'altered',
 'altering',
 'amended',
 'amounted',
 'amusements',
 'ankles',
 'annamaria',
 'anne',
 'annexed',
 'announced',
 'announcing',
 'annuities',
 'annum',
 'answered',
 'answering',
 'answers',
 'anticipated',
 'anticipating',
 'anticipations',
 'anymore',
 'apartments',
 'apologies',
 'apologising',
 'apologized',
 'appearances',
 'appeared',
 'appearing',
 'appeased',
 'appetites',
 'applauded',
 'applying',
 'appointed',
 'apprehended',
 'apprehensions',
 'approached',
 'approved',
 'arbour',
 'ardour',
 'arguments',
 'arranged',
 'arrangements',
 'arranging',
 'arrived',
 'arrives',
 'arriving',
 'ascended',
 'ascertained',
 'asked',
 'asking',
 'assembled',
 'assemblies',
 'asserted',
 'assertions',
 'assiduities',
 'assisted',
 'assisting',
 'associating',
 'assurances',
 'astonished',
 'atoned',
 'atoning',
 'attaching',
 'attachments',
 'attacked',
 'attacks',
 'attained',
 'attempted',
 'attempting',
 'attempts',
 'attendants',
 'attended',
 'attending',
 'attentions',
 'attracted',
 'attractions',
 'attributed',
 'attributing',
 'auditors',
 'augmenting',
 'austen',
 'authorised',
 'authors',
 'availed',
 'avignon',
 'avoided',
 'avoiding',
 'awaited',
 'awakened',
 'awaking',
 'bags',
 'balls',
 'banished',
 'barouches',
 'bathed',
 'bears',
 'beasts',
 'beauties',
 'became',
 'bedrooms',
 'beds',
 'befallen',
 'befalls',
 'befell',
 'began',
 'begged',
 'begins',
 'behaved',
 'beings',
 'believed',
 'believes',
 'belonged',
 'belongs',
 'benefited',
 'bequeathed',
 'berkeley',
 'bestowed',
 'betrayed',
 'betraying',
 'biased',
 'blackest',
 'blameable',
 'blessings',
 'blights',
 'blossoms',
 'blundered',
 'blushed',
 'blushes',
 'bolder',
 'bones',
 'bonomi',
 'books',
 'booksellers',
 'borrowed',
 'bottoms',
 'boys',
 'brandon',
 'breakfasting',
 'bribing',
 'brightened',
 'brighter',
 'bringing',
 'brings',
 'broader',
 'brothers',
 'bruised',
 'buildings',
 'bursts',
 'buying',
 'called',
 'calls',
 'calming',
 'candles',
 'candour',
 'canvassing',
 'cards',
 'cares',
 'caresses',
 'careys',
 'carriages',
 'carries',
 'cases',
 'casts',
 'cats',
 'caused',
 'ceased',
 'ceasing',
 'censured',
 'centre',
 'certainties',
 'chagrined',
 'chairs',
 'chambers',
 'chanced',
 'changed',
 'changes',
 'changing',
 'characters',
 'charged',
 'charmed',
 'charms',
 'cheated',
 'checking',
 'cheeks',
 'cheerfuller',
 'cherished',
 'cherries',
 'children',
 'choked',
 'chuse',
 'chusing',
 'circles',
 'circumstances',
 'civilities',
 'claimed',
 'claiming',
 'claims',
 'clarke',
 'cleared',
 'cleveland',
 'clogged',
 'closing',
 'clouds',
 'coats',
 'collecting',
 'coloured',
 'colouring',
 'combe',
 'comforted',
 'comforts',
 'comings',
 'commanded',
 'commands',
 'commended',
 'comments',
 'commissioned',
 'commonest',
 'communicated',
 'companions',
 'compared',
 'compares',
 'comparisons',
 'complained',
 'complaining',
 'complaints',
 'completed',
 'compliments',
 'comprehended',
 'concealing',
 'concerns',
 'concessions',
 'concluded',
 'conclusions',
 'conditions',
 'conducted',
 'confessed',
 'confidante',
 'conforming',
 'congratulated',
 'congratulating',
 'congratulations',
 'conjectured',
 'conjectures',
 'conjecturing',
 'connections',
 'conquests',
 'consented',
 'consequences',
 'considerations',
 'considers',
 'consisted',
 'consists',
 'consoled',
 'conspired',
 'constantia',
 'consulted',
 'contained',
 'containing',
 'contend',
 'contenting',
 'continuing',
 'contradicted',
 'contrasted',
 'contributed',
 'contributing',
 'contrived',
 'contrives',
 'contriving',
 'controlled',
 'conveniences',
 'conversations',
 'conversed',
 'conversing',
 'conveyed',
 'conveying',
 'copying',
 'cordials',
 'cottages',
 'counsellor',
 'counteracted',
 'couples',
 'courted',
 'courting',
 'courtland',
 'cousins',
 'cowper',
 'cows',
 'coxcombs',
 'cramps',
 'created',
 'creating',
 'creatures',
 'cries',
 'crimsoned',
 'curtsying',
 'cutlets',
 'danced',
 'dances',
 'dared',
 'darker',
 'dartford',
 'dashwood',
 'dashwoods',
 'daughters',
 'davies',
 'dawdled',
 'dawlish',
 'dawned',
 'dearer',
 'dearest',
 'debated',
 'debts',
 'deceived',
 'deciding',
 'decisions',
 'declares',
 'declaring',
 'declining',
 'deemed',
 'deeper',
 'deepest',
 'defects',
 'defended',
 'deficiencies',
 'degrees',
 'delaford',
 'delayed',
 'delays',
 'deliberating',
 'delicacies',
 'delighful',
 'delineated',
 'delivered',
 'demanded',
 'demands',
 'demonstrations',
 'demur',
 'denied',
 'dennison',
 'denoted',
 'denoting',
 'departing',
 'depended',
 'depends',
 'deprived',
 'described',
 'describing',
 'deserts',
 'deserves',
 'designs',
 'desiring',
 'despatch',
 'despatching',
 'despised',
 'despising',
 'destroyed',
 'destroys',
 'detaining',
 'detected',
 'detecting',
 'determining',
 'deterred',
 'detested',
 'devolved',
 'died',
 'dies',
 'differed',
 'differing',
 'difficulties',
 'dimensions',
 'diminished',
 'dined',
 'dinners',
 'directing',
 'directions',
 'disagreements',
 'disappeared',
 'disappointments',
 'disapproved',
 'disapproves',
 'disapproving',
 'discarded',
 'discharged',
 'disclaiming',
 'disclosing',
 'discontents',
 'discovering',
 'discussions',
 'disgraced',
 'disinherited',
 'disliked',
 'dismissed',
 'dismounted',
 'dispatched',
 'dispatches',
 'dispersing',
 'disposing',
 'disputes',
 'disqualifications',
 'disregarded',
 'dissembling',
 'dissented',
 'distresses',
 'distrusts',
 'diverted',
 'doatingly',
 'donavan',
 'doomed',
 'dooming',
 'doors',
 'dorsetshire',
 'doubted',
 'doubts',
 'douceur',
 'downs',
 'dr',
 'drains',
 'drawings',
 'draws',
 'dreaded',
 'dreading',
 'dreaming',
 'dresses',
 'drives',
 'dropped',
 'drops',
 'drury',
 'duets',
 'duties',
 'earlier',
 'earliest',
 'earned',
 'ears',
 'echoed',
 'editions',
 'edtions',
 'effected',
 'effecting',
 'effusions',
 'elliott',
 'ellison',
 'ellisons',
 'eloping',
 'eluded',
 'embellishments',
 'embraced',
 'embraces',
 'employments',
 'enabled',
 'enamoured',
 'encouraged',
 'encouragements',
 'encroachments',
 'encumbered',
 'endeavoring',
 'endeavors',
 'endeavour',
 'endeavoured',
 'endeavouring',
 'endeavours',
 'endowed',
 'ends',
 'endured',
 'enfeebled',
 'enforcing',
 'engagements',
 'england',
 'enjoyed',
 'enjoyments',
 'enquired',
 'enquiries',
 'enquiring',
 'ensued',
 'ensured',
 'entered',
 'entertained',
 'entitled',
 'entreated',
 'entreaties',
 'entrusted',
 'equalled',
 'equals',
 'erred',
 'errors',
 'escaped',
 'esq',
 'establishing',
 'esteemed',
 'esteeming',
 'esteems',
 'estimating',
 'estranged',
 'evenings',
 'events',
 'evils',
 'examined',
 'exceeded',
 'excellencies',
 'exchanged',
 'exclaimed',
 'exclamations',
 'excused',
 'excuses',
 'exercised',
 'exercising',
 'exerted',
 'exertions',
 'exeter',
 'exhilarated',
 'existed',
 'expectations',
 'expected',
 'expecting',
 'expects',
 'expenses',
 'experiencing',
 'explained',
 'explanations',
 'expressing',
 'expressions',
 'extolling',
 'extorted',
 'extorting',
 'extremest',
 'eyeing',
 'eyes',
 'faces',
 'facts',
 'failed',
 'falls',
 'familiarized',
 'families',
 'fancying',
 'fates',
 'fatigued',
 'fatigues',
 'faults',
 'favour',
 'favourable',
 'favourite',
 'favourites',
 'fearing',
 'fears',
 'features',
 'feelings',
 'feels',
 'feet',
 'felicitations',
 'females',
 'ferrars',
 'fetches',
 'fettered',
 'finds',
 'finest',
 'fingers',
 'flattered',
 'flatteries',
 'flowed',
 'fluctuating',
 'flushed',
 'foibles',
 'followed',
 'follows',
 'fond',
 'footsteps',
 'forebodings',
 'foreplanned',
 'foresaw',
 'foreseeing',
 'foreseen',
 'forfeited',
 'forfeiting',
 'forgave',
 'forgiven',
 'forms',
 'forsaking',
 'fortunes',
 'forwarded',
 'foundations',
 'founded',
 'fowls',
 'friendliest',
 'friends',
 'frightens',
 'froid',
 'frosts',
 'fulfil',
 'fulfilled',
 'fullest',
 'gained',
 'gales',
 'gardens',
 'garrets',
 'gates',
 'gathered',
 'generations',
 'gentlemen',
 'gigs',
 'gilberts',
 'girls',
 'gives',
 'glances',
 'gloried',
 'gloves',
 'godby',
 'goings',
 'goodby',
 'governed',
 'gowns',
 'graces',
 'grandmothers',
 'granted',
 'greatest',
 'grieves',
 'grows',
 'guardians',
 'guessed',
 'guests',
 'guided',
 'guineas',
 'habits',
 'hallooing',
 'hands',
 'handsomer',
 'handsomest',
 'hang',
 'hanover',
 'happened',
 'happens',
 'hardened',
 'hardships',
 'harley',
 'harris',
 'has',
 'hastened',
 'hastening',
 'hated',
 'hates',
 'hating',
 'having',
 'hazarded',
 'hazarding',
 'heads',
 'heard',
 'hears',
 'heightened',
 'heightening',
 'heights',
 'heirs',
 'held',
 'hens',
 'henshawe',
 'hesitated',
 'hiding',
 'hills',
 'hinted',
 'hints',
 'hoarded',
 'holborn',
 'holburn',
 'holds',
 'holidays',
 'homes',
 'hon',
 'honeysuckles',
 'honiton',
 'honour',
 'honourable',
 'honourably',
 'honoured',
 'honours',
 'hopes',
 'hoping',
 'horrors',
 'horses',
 'hours',
 'houses',
 'howsever',
 'humbled',
 'humiliations',
 'humored',
 'humoured',
 'humouring',
 'hunted',
 'hunters',
 'hunts',
 'hurrying',
 'husbands',
 'huswifes',
 'ideas',
 'idled',
 'idolized',
 'ii',
 'imaginations',
 'imagined',
 'imagining',
 'imbibed',
 'immoveable',
 'imparted',
 'imperfections',
 'implied',
 'implies',
 'impoverished',
 'impoverishing',
 'improved',
 'improvements',
 'imputed',
 'inclinations',
 'inclined',
 'inclosing',
 'including',
 'incommoded',
 'inconveniences',
 'increased',
 'incurred',
 'incurring',
 'indulged',
 'infants',
 'inflicted',
 'inflicting',
 'influenced',
 'inforce',
 'inforced',
 'informing',
 'inhabitants',
 'inhabiting',
 'inheriting',
 'injuries',
 'inquired',
 'inquiries',
 'insinuations',
 'insisted',
 'installed',
 'instigated',
 'instructions',
 'insulted',
 'intends',
 'intentions',
 'intents',
 'interests',
 'interposed',
 'interspersed',
 'intervals',
 'interviews',
 'intimated',
 'introduced',
 'introducing',
 'intruded',
 'invented',
 'inventing',
 'invitations',
 'invited',
 'irritated',
 'irritates',
 'issued',
 'jealousies',
 'jenning',
 'jennings',
 'jewels',
 'jilting',
 'joined',
 'joked',
 'jokes',
 'joking',
 'joys',
 'judged',
 'judging',
 'judgments',
 'jumbled',
 'justified',
 'keeps',
 'keys',
 'kicked',
 'kinder',
 'kindest',
 'kingham',
 'kissed',
 'kisses',
 'knees',
 'knives',
 'knows',
 'laboured',
 'lamentations',
 'lamps',
 'lanes',
 'languages',
 'larger',
 'largest',
 'lasted',
 'laughed',
 'laughs',
 'leagued',
 'legacies',
 'lengthened',
 'lengths',
 'lessened',
 'lessening',
 'letters',
 'letting',
 'lies',
 'lifted',
 'lightened',
 'liked',
 'likes',
 'limbs',
 'limits',
 'lines',
 'lingered',
 'lingering',
 'lips',
 'listened',
 'lives',
 'livings',
 'll',
 'lodges',
 'loitered',
 'lombardy',
 'london',
 'longed',
 'longest',
 'longstaple',
 'looked',
 'looks',
 'loved',
 'lovers',
 'loves',
 'lowered',
 'lurking',
 'magna',
 'maids',
 'maintained',
 'makes',
 'mama',
 'managed',
 'marlborough',
 'marriages',
 'marries',
 'matters',
 'maxims',
 'meadows',
 'meals',
 'means',
 'meantime',
 'measures',
 'medicines',
 'meditated',
 'meditations',
 'meetings',
 'mentioned',
 'mentioning',
 'merest',
 'merits',
 'merrier',
 'messages',
 'middleton',
 'middletons',
 'militated',
 'minds',
 'minutes',
 'misapplied',
 'misinformed',
 'missed',
 'misses',
 'mistakes',
 'mixing',
 'modestest',
 'mohrs',
 'moments',
 'months',
 'morton',
 'mosquitoes',
 'mothers',
 'motives',
 'moved',
 'murmurings',
 'muttered',
 'nabobs',
 'named',
 'names',
 'natured',
 'nearer',
 'needed',
 'neglected',
 'neighbour',
 'neighbourhood',
 'neighbouring',
 'neighbourly',
 'neighbours',
 'nerves',
 'nests',
 'nettles',
 'newer',
 'newspapers',
 'nicest',
 'nieces',
 'nipped',
 'nodded',
 'nods',
 'noisier',
 'notes',
 'noticed',
 'noticing',
 'notions',
 'nt',
 'nurses',
 'obeyed',
 'objected',
 'objections',
 'objects',
 'obligations',
 'observations',
 'observed',
 'obstacles',
 'obstructed',
 'obtained',
 'obtaining',
 'obviated',
 'obviating',
 'occasioned',
 'occasions',
 'occupations',
 'occupied',
 'occurred',
 'oddest',
 'offence',
 'offences',
 'offending',
 'offered',
 'offices',
 'oftener',
 'oftenest',
 'oldest',
 'olives',
 'omitted',
 'ones',
 'opened',
 'opinions',
 'opportunities',
 'ordained',
 'orders',
 'originated',
 'ornamented',
 'ornaments',
 'others',
 'outdone',
 'outgrown',
 'outlived',
 'outraged',
 'outstaid',
 'outstretched',
 'outstripped',
 'outweighs',
 'overcame',
 'overcoming',
 'overheard',
 'overlooked',
 'overpowered',
 'overspreading',
 'overstrained',
 'owed',
 'owned',
 'owners',
 'owning',
 'paces',
 'pacified',
 'packages',
 'packed',
 'pages',
 'paid',
 'pains',
 'palanquins',
 'palmers',
 'pangs',
 'papers',
 'parcels',
 'parents',
 'parlors',
 'parlour',
 'parrys',
 'particulars',
 'parties',
 'parting',
 'partners',
 'parts',
 'passages',
 'passed',
 'passions',
 'patches',
 'patronised',
 'patterns',
 'paused',
 'pausing',
 'pearls',
 'perceived',
 'perfections',
 'performances',
 'performed',
 'performers',
 'performing',
 'permitting',
 'persecutions',
 'persevered',
 'persisted',
 'persons',
 'persuading',
 'pictures',
 'pieces',
 'pimples',
 'piqued',
 'pitched',
 'pitied',
 'placed',
 'placing',
 'plaguing',
 'planning',
 'plans',
 'plantations',
 'plants',
 'played',
 'playfellows',
 'playing',
 'playthings',
 'pleasanter',
 'pleasantest',
 'pleased',
 'pleasures',
 'plums',
 'pointers',
 'points',
 'ponds',
 'poplars',
 'popt',
 'possesses',
 'possessions',
 'postponing',
 'posts',
 'pounds',
 'poured',
 'powers',
 'practices',
 'practise',
 'practised',
 'praised',
 'praises',
 'pratt',
 'prayers',
 'pre',
 'preceded',
 'preferring',
 'prejudices',
 'premeditated',
 'premises',
 'prenticed',
 'preparations',
 'preparing',
 'prescribed',
 'prescriptions',
 'presented',
 'presenting',
 'presents',
 'preserved',
 'presided',
 'pressed',
 'presumed',
 'pretence',
 'pretends',
 'pretensions',
 'prettier',
 'prettiest',
 'prevailed',
 'prevailing',
 'prevented',
 'preyed',
 'principles',
 'probabilities',
 'proceeded',
 'proclaimed',
 'procured',
 'procuring',
 'producing',
 'professions',
 'profited',
 'prohibited',
 'projects',
 'promised',
 'promises',
 'promontories',
 'promoted',
 'promoting',
 'prompted',
 'pronouncing',
 'proofs',
 'propensities',
 'prophecies',
 'proposals',
 'proposed',
 'prospects',
 'protestations',
 'protested',
 'proud',
 'provisions',
 'provoked',
 'publishing',
 'pulled',
 'puppies',
 'purchases',
 'purposes',
 'pursued',
 'pursuing',
 'pursuits',
 'puts',
 'putting',
 'qualifications',
 'quarrelled',
 'quarrelling',
 'questions',
 'quickened',
 'quicker',
 'quickest',
 'quieted',
 'quitting',
 'rained',
 'raises',
 'rambles',
 'raptures',
 'reached',
 'reaped',
 'reasonings',
 'reasons',
 'recalled',
 'receiving',
 'reckoned',
 'reckons',
 'reclining',
 'recognised',
 'recollecting',
 'recommended',
 'recommending',
 'reconciled',
 'recovered',
 'recovering',
 'recreating',
 'recurred',
 'referred',
 'referring',
 'refinements',
 'reflections',
 'refreshed',
 'refreshments',
 'refused',
 'regarded',
 'regards',
 'regrets',
 'regretted',
 'regretting',
 'rejected',
 'rejoiced',
 'relating',
 'relations',
 'relatives',
 'released',
 'relics',
 'relied',
 'relinquished',
 'relying',
 'remained',
 'remaining',
 'remarks',
 'remedies',
 'remembered',
 'remembering',
 'remembers',
 'remembrances',
 'reminded',
 'reminding',
 'reminds',
 'removes',
 'rendered',
 'renewed',
 'renewing',
 'renounced',
 'repaid',
 'repaired',
 'repeating',
 'repining',
 'replied',
 'replying',
 'reports',
 'representations',
 'represented',
 'representing',
 'reproached',
 'reproaches',
 'reproaching',
 'reproved',
 'reproving',
 'repulsed',
 'requested',
 'requesting',
 'required',
 'requires',
 'requiring',
 'rescued',
 'reseated',
 'resembled',
 'resembling',
 'resented',
 'resettled',
 'resided',
 'residing',
 'resisted',
 'resists',
 'resolving',
 'resorted',
 'resources',
 'respected',
 'respects',
 'rested',
 'restored',
 'restoring',
 'restraints',
 'resumed',
 'retailed',
 'retained',
 'retreated',
 'retrenched',
 'returning',
 'reverted',
 'revived',
 'rewarded',
 'rheumatisms',
 'ribbons',
 'richardson',
 'richardsons',
 'richer',
 'rings',
 'rises',
 'risking',
 'rivals',
 'roads',
 'roared',
 'robbed',
 'rocks',
 'rooms',
 'roused',
 'ruins',
 'rumour',
 'sackville',
 'sacrificed',
 'sakes',
 'salts',
 'sandersons',
 'sashes',
 'sauntered',
 'saves',
 'savings',
 'says',
 'scenes',
 'schemes',
 'scolded',
 'scorning',
 'scotland',
 'scott',
 'scrawls',
 'screamed',
 'screams',
 'screens',
 'scrupled',
 'scruples',
 'scrupling',
 'scrutinies',
 'searched',
 'seasons',
 'seats',
 'seconded',
 'seconds',
 'secrets',
 'secured',
 'secures',
 'securing',
 'seduced',
 'seemed',
 'seems',
 'seized',
 'sellers',
 'sends',
 'sensations',
 'senses',
 'sentences',
 'sentiments',
 'separated',
 'separations',
 'servants',
 'served',
 'services',
 'shades',
 'shakespeare',
 'shared',
 'sharing',
 'sharpe',
 'shew',
 'shewed',
 'shewing',
 'shewn',
 'shews',
 'shillings',
 'shocked',
 'shoes',
 'shops',
 'shoulders',
 'showed',
 'showers',
 'shrubberies',
 'shrugging',
 'shuddering',
 'shutters',
 'sighed',
 'signs',
 'silencing',
 'silks',
 'simpered',
 'simpering',
 'simplest',
 'simpson',
 'sisters',
 'situations',
 'slightest',
 'smallest',
 'smiled',
 'smiles',
 'smirked',
 'smokes',
 'sobbed',
 'sobered',
 'sobs',
 'softened',
 'solicitations',
 'somersetshire',
 'songs',
 'soothings',
 'sorrows',
 'souls',
 'sounds',
 'sources',
 'spared',
 'speaks',
 'spends',
 'spirits',
 'sportsmen',
 'sprained',
 'spraining',
 'spunging',
 'spurned',
 'stairs',
 'stammered',
 'stanhill',
 'stared',
 'stares',
 'started',
 'startled',
 'stating',
 'staying',
 'steele',
 'steeles',
 'steepest',
 'steps',
 'stimulated',
 'stirred',
 'stockings',
 'stopt',
 'strains',
 'strangers',
 'streamed',
 'strengthened',
 'stretched',
 'strictest',
 'strikes',
 'stronger',
 'strongest',
 'struggled',
 'studies',
 'stupified',
 'styled',
 'subjects',
 'submitted',
 'submitting',
 'subsisted',
 'subsisting',
 'succeeded',
 'succour',
 'suffered',
 'sufferings',
 'suffers',
 'suggested',
 'suited',
 'summits',
 'summoned',
 'superannuated',
 'supplanted',
 'supplied',
 'supplying',
 'supported',
 'supports',
 'surfaces',
 'surpassed',
 'surprised',
 'survived',
 'suspecting',
 'suspects',
 'suspicions',
 'swallowed',
 'sweetest',
 'sweetmeats',
 'syllables',
 'sympathised',
 'symptoms',
 'systems',
 'takes',
 'talents',
 'talked',
 'talks',
 'tallest',
 'tastes',
 'taverns',
 'tears',
 'teazed',
 'teazing',
 'tells',
 'tempers',
 'tempted',
 'tended',
 'tenderest',
 'terminated',
 'terms',
 'thanked',
 'things',
 'thinks',
 'thirds',
 'thistles',
 'thomson',
 'thorns',
 'thoughts',
 'threatened',
 'threats',
 'thunderbolts',
 'tis',
 'tithes',
 'traced',
 'traded',
 'trades',
 'traits',
 'transacted',
 'transgressed',
 'travellers',
 'travelling',
 'treasured',
 'treated',
 'trees',
 'trembled',
 'tremour',
 'trials',
 'tricked',
 'tricks',
 'tries',
 'trifled',
 'troubles',
 'truest',
 'trusted',
 'truths',
 'twould',
 'undergone',
 'undervalued',
 'unfavourable',
 'unites',
 'unlover',
 'unpleasantest',
 'urged',
 'ushered',
 'using',
 'valleys',
 'variations',
 'varying',
 've',
 'ventured',
 'venturing',
 'viewed',
 'viewing',
 'views',
 'vigour',
 'villages',
 'violins',
 'virtues',
 'visited',
 'visitors',
 'visits',
 'voices',
 'vouchsafed',
 'waistcoats',
 'waited',
 'walked',
 'walks',
 'walls',
 'wandered',
 'wanted',
 'wants',
 'warmest',
 'weakened',
 'weaknesses',
 'weddings',
 'weeks',
 'welcomed',
 'westminster',
 'westons',
 'wettest',
 'weymouth',
 'whiled',
 'whims',
 'whitakers',
 'whiter',
 'whitwell',
 'wildest',
 'williams',
 'willoughby',
 'willoughbys',
 'windows',
 'winks',
 'wiping',
 'wisest',
 'wishes',
 'withdrew',
 'witnessed',
 'witnesses',
 'witnessing',
 'witticisms',
 'wittiest',
 'wives',
 'women',
 'wondered',
 'woods',
 'words',
 'workmen',
 'worlds',
 'wrapt',
 'writes',
 'yards',
 'years',
 'yielded',
 'youngest']

[2]. Filter the text based on the stopword list, which is a list of (functional) words with high frequency such as the that we want to remove from the text before further processing.

In [56]:
from nltk.corpus import stopwords
stopwords.words('english')    # check corpus/stopwords for available langauges.
Out[56]:
['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 'her',
 'hers',
 'herself',
 'it',
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each',
 'few',
 'more',
 'most',
 'other',
 'some',
 'such',
 'no',
 'nor',
 'not',
 'only',
 'own',
 'same',
 'so',
 'than',
 'too',
 'very',
 's',
 't',
 'can',
 'will',
 'just',
 'don',
 'should',
 'now']

[3]. Spelling checker

[Enchant: a spelling correction API]

sudo pip install pyenchant

Exercise

  1. Plot the frequency distribution of letters in the wordlist. (i.e., the 58000_lowercase_en.txt you just downloaded).
  2. Define a function to compute the proportion of words in the austen-sense.txt which are not in the stopwords list.
In [ ]:
 

VerbNet

VerbNet (VN) (Kipper-Schuler 2006) is the largest on-line verb lexicon currently available for English. hierarchical domain-independent, broad-coverage, with mappings to other lexical resources such as WordNet and FrameNet.

VerbNet is organized into verb classes extending Levin (1993) classes through refinement and addition of subclasses to achieve syntactic and semantic coherence among members of a class. Each verb class in VN is completely described by thematic roles, selectional restrictions on the arguments, and frames consisting of a syntactic description and semantic predicates with a temporal function, in a manner similar to the event decomposition of Moens and Steedman (1988).

http://nltk.org/_modules/nltk/corpus/reader/verbnet.html

VerbNet has recently been integrated with 57 new classes from Korhonen and Briscoe's (2004) (K&B) proposed extension to Levin's original classification (Kipper et al., 2006). This work has involved associating detailed syntactic-semantic descriptions to the K&B classes, as well as organizing them appropriately into the existing VN taxonomy. An additional set of 53 new classes from Korhonen and Ryant (2005) (K&R) have also been incorporated into VN. The outcome is a freely available resource which constitutes the most comprehensive and versatile Levin-style verb classification for English. After the two extensions VN has now also increased our coverage of PropBank tokens (Palmer et. al., 2005) from 78.45% to 90.86%, making feasible the creation of a substantial training corpus annotated with VN thematic role labels and class membership assignments, to be released in 2007. This will finally enable large-scale experimentation on the utility of syntax-based classes for improving the performance of syntactic parsers and semantic role labelers on new domains.

In [1]:
from nltk.corpus import verbnet as vn
len(vn.classids())
Out[1]:
429
In [2]:
vn.classids('hit')
Out[2]:
['bump-18.4',
 'contiguous_location-47.8-1',
 'hit-18.1-1',
 'reach-51.8',
 'throw-17.1-1']
In [3]:
v = vn.vnclass('hit-18.1-1')
In [4]:
vn.lemmas('hit-18.1-1')
Out[4]:
['bang',
 'bash',
 'batter',
 'beat',
 'bump',
 'butt',
 'dash',
 'drum',
 'hammer',
 'hit',
 'kick',
 'knock',
 'lash',
 'pound',
 'rap',
 'slap',
 'smack',
 'strike',
 'tamp',
 'tap',
 'thump',
 'thwack',
 'whack',
 'click']
In [5]:
vn.wordnetids('hit-18.1-1')
Out[5]:
['bang%2:35:00',
 'bang%2:35:01',
 'bash%2:35:00',
 'batter%2:35:01',
 'batter%2:35:00',
 'batter%2:30:00',
 'beat%2:35:01',
 'beat%2:36:00',
 'beat%2:35:03',
 'beat%2:35:10',
 'beat%2:35:12',
 'bump%2:35:00',
 'butt%2:35:00',
 'dash%2:35:02',
 'drum%2:39:00',
 'hammer%2:35:00',
 'hit%2:35:01',
 'hit%2:35:00',
 'hit%2:33:01',
 'hit%2:33:03',
 'kick%2:35:00',
 'knock%2:35:01',
 'knock%2:35:00',
 'knock%2:39:00',
 'lash%2:35:01',
 'lash%2:35:00',
 'pound%2:35:00',
 'pound%2:35:01',
 'pound%2:30:03',
 'rap%2:35:00',
 'rap%2:39:00',
 'slap%2:35:00',
 'smack%2:35:02',
 'strike%2:35:01',
 'strike%2:35:00',
 'strike%2:35:09',
 'tamp%2:35:00',
 'tap%2:35:00',
 'tap%2:39:01',
 'thump%2:35:00',
 'thwack%2:35:00',
 'whack%2:35:00',
 'click%2:35:00']
In [6]:
print vn.pprint_themroles('hit-18.1-1')
* Instrument[+body_part +refl]
In [7]:
[t.attrib['type'] for t in v.findall('THEMROLES/THEMROLE/SELRESTRS/SELRESTR')]
Out[7]:
['body_part', 'refl']
In [8]:
[t.attrib['type'] for t in v.findall('THEMROLES/THEMROLE')]
Out[8]:
['Instrument']

FrameNet

  • A lexical database of English based on a theory of meaning called Frame Semantics, deriving from the work of Charles J. Fillmore and colleagues.

  • The basic idea: the meanings of most words can best be understood on the basis of a (semantic) frame: a description of a type of event, relation, or entity and the participants in it. A Frame is a script-like conceptual structure that describes a particular type of situation, object, or event along with the participants and propositions that are needed for that Frame.

For example, the concept of cooking typically involves a person doing the cooking (Cook), the food that is to be cooked (Food), something to hold the food while cooking (Container) and a source of heat (Heating_instrument). In the FrameNet project, this is represented as a frame called Apply_heat, and the Cook, Food, Heating_instrument and Container are called frame elements (FEs) (i.e., roles of a Frame). Words that evoke this frame, such as fry, bake, boil, and broil, are called lexical units (LUs) of the Apply_heat frame.

  • FrameNet also includes relations between Frames. Several types of relations are defined, of which the most important are:

    1. Inheritance: An IS-A relation. The child frame is a subtype of the parent frame, and each FE in the parent is bound to a corresponding FE in the child. An example is the "Revenge" frame which inherits from the "Rewards_and_punishments" frame.
    2. Using: The child frame presupposes the parent frame as background, e.g the "Speed" frame "uses" (or presupposes) the "Motion" frame; however, not all parent FEs need to be bound to child FEs.
    3. Subframe: The child frame is a subevent of a complex event represented by the parent, e.g. the "Criminal_process" frame has subframes of "Arrest", "Arraignment", "Trial", and "Sentencing".
    4. Perspective_on: The child frame provides a particular perspective on an un-perspectivized parent frame. A pair of examples consists of the "Hiring" and "Get_a_job" frames, which perspectivize the "Employment_start" frame from the Employer's and the Employee's point of view, respectively.
  • The job of FrameNet is to define the frames and to annotate sentences to show how the FEs fit syntactically around the word that evokes the frame.

sudo python -m nltk.downloader framenet_v15 (or use nltk.download() to install framenet)

https://github.com/dasmith/FrameNet-python

Frames

  • use the frames() function to get a list of all of the Frames in FrameNet. (If you supply a regular expression pattern to the frames() function, you will get a list of all Frames whose names match that pattern)

For example, the "Apply_heat" frame describes a common situation involving a Cook, some Food, and a Heating_Instrument, and is evoked by words such as bake, blanch, boil, broil, brown, simmer, steam, etc.


HOMEWORK

  1. [60%] Define a function of neologism detection.
  2. [20%] Install PyEnchant, just get up and running with some queries.
  3. [20%] Term project pre-proposal/groupings for corpus, lexicon and app.

Reading: HTTLCS [chapter 7]; NLTK [section 3.1-3.3]

In [ ]: