Working with Lexical Data [1]

  1. WordNet: A Lexical Network of English
  2. VerbNet: A Class-Based Verb Lexicon

  • WordNet is a lexical database for the English language. (Now it's global).
    • Synset: groupings of synonymous words that express the same concept.
    • Relations: Synsets are connected via lexical semantic relations, e.g., antonymy, hypernymy, meronymy, etc.

Synsets

  • Looking up synsets for a word in WordNet
  • NLTK comes with a simple reader for looking up words in WordNet, be sure you've unzipped the wordnet corpus in nltk_data/corpora/wordnet
In [1]:
import nltk
from nltk.corpus import wordnet as wn
wn.synsets('book') # get a list of synsets that the word belongs to.
wn.synsets('book')[0] # note that a synset is identified with [headword.pos.nn] format
Out[1]:
Synset('book.n.01')
In [93]:
wn.synsets('book', pos=wn.VERB)
Out[93]:
[Synset('book.v.01'),
 Synset('reserve.v.04'),
 Synset('book.v.03'),
 Synset('book.v.04')]
  • Using synsets() to explore a number of attributes. Note that synsets() and synset() are different methods.
In [94]:
wn.synsets('book')[0].name
Out[94]:
'book.n.01'
In [95]:
wn.synsets('book')[0].definition
Out[95]:
'a written work or composition that has been published (printed on pages bound together)'
In [96]:
wn.synsets('book')[0].examples
Out[96]:
['I am reading a good book on economics']
  • Each synset contains one or more lemmas, which represent a specific sense of a specific word.
In [101]:
wn.synset('book.n.02').lemmas  # note: lemmas in a synset(), not synsets() 
Out[101]:
[Lemma('book.n.02.book'), Lemma('book.n.02.volume')]
In [107]:
wn.lemma('book.n.02.volume').synset
Out[107]:
Synset('book.n.02')

Lexical Semantic Relations

  • Looking up Lexical Semantic Relations for synsets in WordNet
  • Synsets are organized in a kind of inheritance tree. More abstract terms are known as hypernyms and more specific terms are hyponyms. This tree can be traced all the way up to a root hypernym.
In [6]:
wn.synsets('book')[0].hypernyms()
Out[6]:
[Synset('publication.n.01')]
In [7]:
wn.synsets('book')[0].hypernyms()[0].hyponyms()
Out[7]:
[Synset('new_edition.n.01'),
 Synset('book.n.01'),
 Synset('volume.n.04'),
 Synset('impression.n.06'),
 Synset('republication.n.01'),
 Synset('tip_sheet.n.01'),
 Synset('magazine.n.01'),
 Synset('reference.n.08'),
 Synset('collection.n.02'),
 Synset('reissue.n.01'),
 Synset('periodical.n.01'),
 Synset('read.n.01')]
In [114]:
# more compact way
# wn.synsets('dog') # first check the synsets that the word 'dog' appears
dog_1 = wn.synset('dog.n.01')
dog_1.examples
Out[114]:
['the dog barked all night']
In [115]:
dog_1.root_hypernyms()
Out[115]:
[Synset('entity.n.01')]
In [118]:
dog_1.member_holonyms()
Out[118]:
[Synset('pack.n.06'), Synset('canis.n.01')]
  • It is noted that the relations of antonyms, derivationally_related_forms and pertainyms are defined only over lemmas.
In [139]:
good = wn.synset('good.a.01')
In [128]:
good.antonyms()
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-128-ac72d7751601> in <module>()
----> 1 good.antonyms()

AttributeError: 'Synset' object has no attribute 'antonyms'
In [140]:
# So you have to retrieve those relations this way
good.lemmas[0].antonyms()
Out[140]:
[Lemma('bad.a.01.bad')]

Lemmas and Verb Frames

In [152]:
#wn.synsets('drink')
#drink_2 = wn.synset('drink.n.02') 
#drink_2

eat = wn.lemma('eat.v.03.eat')
eat
Out[152]:
Lemma('feed.v.06.eat')
In [153]:
eat.count()
Out[153]:
4
In [154]:
for lemma in wn.synset('eat.v.03').lemmas:
    print(lemma, lemma.count())
(Lemma('feed.v.06.feed'), 3)
(Lemma('feed.v.06.eat'), 4)
In [155]:
for lemma in wn.lemmas('eat', 'v'):
    print(lemma, lemma.count())
(Lemma('eat.v.01.eat'), 61)
(Lemma('eat.v.02.eat'), 13)
(Lemma('feed.v.06.eat'), 4)
(Lemma('eat.v.04.eat'), 0)
(Lemma('consume.v.05.eat'), 0)
(Lemma('corrode.v.01.eat'), 0)
  • Recall that lemmas can also have relations between them.
In [157]:
vocal = wn.lemma('vocal.a.01.vocal')
vocal.derivationally_related_forms()
Out[157]:
[Lemma('vocalize.v.02.vocalize')]
In [158]:
vocal.pertainyms()
vocal.antonyms()
Out[158]:
[Lemma('instrumental.a.01.instrumental')]
  • Verb frames
In [159]:
wn.synset('think.v.01').frame_ids
Out[159]:
[5, 9]
In [160]:
for lemma in wn.synset('think.v.01').lemmas:
     print(lemma, lemma.frame_ids)
     print(" | ".join(lemma.frame_strings))
(Lemma('think.v.01.think'), [5, 9])
Something think something Adjective/Noun | Somebody think somebody
(Lemma('think.v.01.believe'), [5, 9])
Something believe something Adjective/Noun | Somebody believe somebody
(Lemma('think.v.01.consider'), [5, 9])
Something consider something Adjective/Noun | Somebody consider somebody
(Lemma('think.v.01.conceive'), [5, 9])
Something conceive something Adjective/Noun | Somebody conceive somebody

Applications

Lexical Ontologies

  • Looking up Ontologies in WordNet
    • All these types of books have the same root hypernym, entity, one of the most abstract terms in the English language. You can trace the entire path from entity down to cookbook using the hypernym_paths() method.
In [165]:
book_1 = wn.synset('book.n.01')
book_1.hypernym_paths()
Out[165]:
[[Synset('entity.n.01'),
  Synset('physical_entity.n.01'),
  Synset('object.n.01'),
  Synset('whole.n.02'),
  Synset('artifact.n.01'),
  Synset('creation.n.02'),
  Synset('product.n.02'),
  Synset('work.n.02'),
  Synset('publication.n.01'),
  Synset('book.n.01')]]

Lowest Common Hypyernyms

  • LCH (lowest_common_hypernyms() method) is used to locate the lowest single hypernym that is shared by two given words.
In [193]:
wn.synset('kin.n.01').lowest_common_hypernyms(wn.synset('mother.n.01'))
#wn.synset('policeman.n.01').lowest_common_hypernyms(wn.synset('chef.n.01'))
Out[193]:
[Synset('organism.n.01')]
  • This method generally returns a single result, but in some cases, more than one valid LCH is possible:
In [194]:
wn.synset('body.n.09').lowest_common_hypernyms(wn.synset('sidereal_day.n.01'))
Out[194]:
[Synset('measure.n.02'), Synset('attribute.n.02')]

Synset Similarity

  • Calculate the conceptual similarity.
    • synset1.path_similarity(synset2): Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1.
    • synset1.lch_similarity(synset2): Leacock-Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses (as above) and the maximum depth of the taxonomy in which the senses occur. The relationship is given as -log(p/2d) where p is the shortest path length and d the taxonomy depth.
    • synset1.wup_similarity(synset2): Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node). Note that at this time the scores given do not always agree with those given by Pedersen's Perl implementation of Wordnet Similarity.
    • synset1.res_similarity(synset2, ic): Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node). Note that for any similarity measure that uses information content, the result is dependent on the corpus used to generate the information content and the specifics of how the information content was created.
    • synset1.jcn_similarity(synset2, ic): Jiang-Conrath Similarity Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 1 / (IC(s1) + IC(s2) - 2 * IC(lcs)).
    • synset1.lin_similarity(synset2, ic): Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets. The relationship is given by the equation 2 * IC(lcs) / (IC(s1) + IC(s2)).
In [166]:
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
hit = wn.synset('hit.v.01')
slap = wn.synset('slap.v.01')
In [171]:
#dog.path_similarity(cat)
wn.path_similarity(dog,cat)
Out[171]:
0.2
In [172]:
#hit.lch_similarity(slap)
wn.lch_similarity(hit, slap)
Out[172]:
1.3121863889661687
In [173]:
wn.wup_similarity(hit, slap)
Out[173]:
0.25
In [178]:
from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat')
semcor_ic = wordnet_ic.ic('ic-semcor.dat')

dog.res_similarity(cat, brown_ic)
Out[178]:
7.911666509036577
In [176]:
dog.jcn_similarity(cat, brown_ic)
Out[176]:
0.4497755285516739
In [179]:
dog.lin_similarity(cat, semcor_ic)
Out[179]:
0.8863288628086228
  • Looking up Part-of-Speech in WordNet
    • There are four common POS found in WordNet: Noun(n), Adjective(a), Adverb(r), Verb(v).
    • You can also look up a simplified part-of-speech tag.
    • These POS tags can be used for looking up specific synsets for a word. For example, the word 'book' can be used as a noun or an adjective. In WordNet, 'book' has 11 noun synset and 4 verb synsets.
In [13]:
book = wordnet.synsets('book')
len(wordnet.synsets('book')) # NB.: difference between lemman and synset
Out[13]:
15
In [15]:
len(wordnet.synsets('book', pos='n'))
Out[15]:
11
In [14]:
len(wordnet.synsets('book', pos='v'))
Out[14]:
4
  • Looking up Lemmas and Synonyms in WordNet

    • Since lemmas in a synset all have the same meaning, they can be treated as \structure{synonyms}. So if you wanted to get all synonyms for a synset,
In [21]:
#[lemma.name for lemma in syn.lemmas]
[lemma.name for lemma in book3.lemmas]   
Out[21]:
['record', 'record_book', 'book']

As you can see, record, record_book and book are three distinct lemmas in the same synset synset('record.n.05'). In this way, a synset represents a group of lemmas that all have the same meaning.

Access to all Synsets

  • Iterate over all the noun synsets:
In [181]:
for synset in list(wn.all_synsets('n'))[:5]:
    print(synset)
Synset('entity.n.01')
Synset('physical_entity.n.01')
Synset('abstraction.n.06')
Synset('thing.n.12')
Synset('object.n.01')
Synset('whole.n.02')
Synset('congener.n.03')
Synset('living_thing.n.01')
Synset('organism.n.01')
Synset('benthos.n.02')
Synset('entity.n.01')
Synset('physical_entity.n.01')
Synset('abstraction.n.06')
Synset('thing.n.12')
Synset('object.n.01')
  • Get all synsets for this word, possibly restricted by POS:
In [182]:
wn.synsets('dog', pos='v')
Out[182]:
[Synset('chase.v.01')]
  • Walk through the noun synsets looking at their hypernyms:
In [183]:
from itertools import islice
for synset in islice(wn.all_synsets('n'), 5):
    print(synset, synset.hypernyms())
(Synset('entity.n.01'), [])
(Synset('physical_entity.n.01'), [Synset('entity.n.01')])
(Synset('abstraction.n.06'), [Synset('entity.n.01')])
(Synset('thing.n.12'), [Synset('physical_entity.n.01')])
(Synset('object.n.01'), [Synset('physical_entity.n.01')])
  • All possible synonyms: As mentioned before, many words have multiple synsets because the word can have different meanings depending on the context. But let's say you didn't care about the context, and wanted to get all possible synonyms for a word.
In [23]:
synonyms = [] 
for syn in wordnet.synsets('book'):
	for lemma in syn.lemmas:
		synonyms.append(lemma.name)
In [26]:
len(synonyms)
Out[26]:
38
  • As you can see, there appears to be 38 possible synonyms for the word book. But in fact, some are verb forms, and many are just different usages of book. Instead, if we take the set of synonyms, there are fewer unique words.
In [25]:
len(set(synonyms))                            
Out[25]:
25
In [27]:
from nltk.corpus import wordnet as wn
set([s.name.split('.')[0] for s in wn.synsets('trade') if s.name.find(".v.") != -1])
Out[27]:
{'deal', 'trade'}

WordNet as Graph

Ref


SemCor

  • subset of Brown (234K words) labelled with word senses.
  • According to the documentation i can load a sense tagged corpus in nltk as such:
In [ ]:
 

VerbNet

VerbNet (VN) (Kipper-Schuler 2006) is the largest on-line verb lexicon currently available for English. hierarchical domain-independent, broad-coverage, with mappings to other lexical resources such as WordNet and FrameNet.

VerbNet is organized into verb classes extending Levin (1993) classes through refinement and addition of subclasses to achieve syntactic and semantic coherence among members of a class. Each verb class in VN is completely described by thematic roles, selectional restrictions on the arguments, and frames consisting of a syntactic description and semantic predicates with a temporal function, in a manner similar to the event decomposition of Moens and Steedman (1988).

http://nltk.org/_modules/nltk/corpus/reader/verbnet.html

VerbNet has recently been integrated with 57 new classes from Korhonen and Briscoe's (2004) (K&B) proposed extension to Levin's original classification (Kipper et al., 2006). This work has involved associating detailed syntactic-semantic descriptions to the K&B classes, as well as organizing them appropriately into the existing VN taxonomy. An additional set of 53 new classes from Korhonen and Ryant (2005) (K&R) have also been incorporated into VN. The outcome is a freely available resource which constitutes the most comprehensive and versatile Levin-style verb classification for English. After the two extensions VN has now also increased our coverage of PropBank tokens (Palmer et. al., 2005) from 78.45% to 90.86%, making feasible the creation of a substantial training corpus annotated with VN thematic role labels and class membership assignments, to be released in 2007. This will finally enable large-scale experimentation on the utility of syntax-based classes for improving the performance of syntactic parsers and semantic role labelers on new domains.

In [51]:
from nltk.corpus import verbnet as vn
len(vn.classids())
Out[51]:
429
In [53]:
vn.classids('hit')
Out[53]:
['bump-18.4',
 'contiguous_location-47.8-1',
 'hit-18.1-1',
 'reach-51.8',
 'throw-17.1-1']
In [55]:
v = vn.vnclass('hit-18.1-1')
Out[55]:
<Element 'VNSUBCLASS' at 0x10765e630>
In [56]:
vn.lemmas('hit-18.1-1')
Out[56]:
['bang',
 'bash',
 'batter',
 'beat',
 'bump',
 'butt',
 'dash',
 'drum',
 'hammer',
 'hit',
 'kick',
 'knock',
 'lash',
 'pound',
 'rap',
 'slap',
 'smack',
 'strike',
 'tamp',
 'tap',
 'thump',
 'thwack',
 'whack',
 'click']
In [57]:
vn.wordnetids('hit-18.1-1')
Out[57]:
['bang%2:35:00',
 'bang%2:35:01',
 'bash%2:35:00',
 'batter%2:35:01',
 'batter%2:35:00',
 'batter%2:30:00',
 'beat%2:35:01',
 'beat%2:36:00',
 'beat%2:35:03',
 'beat%2:35:10',
 'beat%2:35:12',
 'bump%2:35:00',
 'butt%2:35:00',
 'dash%2:35:02',
 'drum%2:39:00',
 'hammer%2:35:00',
 'hit%2:35:01',
 'hit%2:35:00',
 'hit%2:33:01',
 'hit%2:33:03',
 'kick%2:35:00',
 'knock%2:35:01',
 'knock%2:35:00',
 'knock%2:39:00',
 'lash%2:35:01',
 'lash%2:35:00',
 'pound%2:35:00',
 'pound%2:35:01',
 'pound%2:30:03',
 'rap%2:35:00',
 'rap%2:39:00',
 'slap%2:35:00',
 'smack%2:35:02',
 'strike%2:35:01',
 'strike%2:35:00',
 'strike%2:35:09',
 'tamp%2:35:00',
 'tap%2:35:00',
 'tap%2:39:01',
 'thump%2:35:00',
 'thwack%2:35:00',
 'whack%2:35:00',
 'click%2:35:00']
In [58]:
print vn.pprint_themroles('hit-18.1-1')
* Instrument[+body_part +refl]
In [41]:
[t.attrib['type'] for t in v.findall('THEMROLES/THEMROLE/SELRESTRS/SELRESTR')]
Out[41]:
['comestible', 'solid']
In [42]:
[t.attrib['type'] for t in v.findall('THEMROLES/THEMROLE')]
Out[42]:
['Patient']

HOMEWORK

  1. [60%] Define sent to be the list of words ['she', 'sells','sea','shells','by','the','sea','shore']. Write code to perform the following tasks:
    1. Print all words beginning with sh.
    2. Print all words longer than four characters.
  2. [20%] What percentage of noun synsets have no hyponyms? (Hint: you can get all noun synsets using wn.all_synsets('n')
  3. [20%] What's the similarity score between 'strike' and 'hit' based on Lin Similarity algorithm.
  4. [bonus 20%] Compute the average sense number of nouns, adjectives, and adverbs in WordNet.

Reading: HTTLCS [chapter 5-6]; NLTK [section 2.5]