Word2vec

You can install gensim as follows:

pip install --upgrade gensim

Here are some simple starter code for word2vec.

In [10]:
from gensim.models import Word2Vec
import gensim
from sklearn.decomposition import PCA
from matplotlib import pyplot
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# define training data
sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'word2vec'],
        ['this', 'is', 'the', 'second', 'sentence'],
        ['yet', 'another', 'sentence'],
        ['one', 'more', 'sentence'],
        ['and', 'the', 'final', 'sentence']]
# train model
model = Word2Vec(sentences, min_count=1)
model.wv['first']
2021-01-24 21:59:34,588 : INFO : collecting all words and their counts
2021-01-24 21:59:34,589 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-01-24 21:59:34,591 : INFO : collected 14 word types from a corpus of 22 raw words and 5 sentences
2021-01-24 21:59:34,592 : INFO : Loading a fresh vocabulary
2021-01-24 21:59:34,593 : INFO : effective_min_count=1 retains 14 unique words (100% of original 14, drops 0)
2021-01-24 21:59:34,595 : INFO : effective_min_count=1 leaves 22 word corpus (100% of original 22, drops 0)
2021-01-24 21:59:34,596 : INFO : deleting the raw counts dictionary of 14 items
2021-01-24 21:59:34,597 : INFO : sample=0.001 downsamples 14 most-common words
2021-01-24 21:59:34,598 : INFO : downsampling leaves estimated 2 word corpus (12.7% of prior 22)
2021-01-24 21:59:34,600 : INFO : estimated required memory for 14 words and 100 dimensions: 18200 bytes
2021-01-24 21:59:34,601 : INFO : resetting layer weights
2021-01-24 21:59:34,612 : INFO : training model with 3 workers on 14 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2021-01-24 21:59:34,621 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-01-24 21:59:34,623 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-01-24 21:59:34,624 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-01-24 21:59:34,625 : INFO : EPOCH - 1 : training on 22 raw words (1 effective words) took 0.0s, 283 effective words/s
2021-01-24 21:59:34,628 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-01-24 21:59:34,629 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-01-24 21:59:34,631 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-01-24 21:59:34,632 : INFO : EPOCH - 2 : training on 22 raw words (0 effective words) took 0.0s, 0 effective words/s
2021-01-24 21:59:34,637 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-01-24 21:59:34,638 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-01-24 21:59:34,639 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-01-24 21:59:34,640 : INFO : EPOCH - 3 : training on 22 raw words (4 effective words) took 0.0s, 1261 effective words/s
2021-01-24 21:59:34,644 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-01-24 21:59:34,645 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-01-24 21:59:34,646 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-01-24 21:59:34,647 : INFO : EPOCH - 4 : training on 22 raw words (3 effective words) took 0.0s, 811 effective words/s
2021-01-24 21:59:34,653 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-01-24 21:59:34,657 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-01-24 21:59:34,659 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-01-24 21:59:34,661 : INFO : EPOCH - 5 : training on 22 raw words (2 effective words) took 0.0s, 255 effective words/s
2021-01-24 21:59:34,663 : INFO : training on a 110 raw words (10 effective words) took 0.0s, 209 effective words/s
2021-01-24 21:59:34,665 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
Out[10]:
array([-2.01762188e-03,  1.46727194e-03,  2.21495074e-03,  2.48606596e-03,
       -3.38722020e-03, -4.72196518e-03,  1.13484675e-04,  3.78799066e-03,
       -3.79963452e-03, -9.89270120e-05, -2.52333982e-03,  2.89688702e-03,
        4.89748036e-03,  2.50620279e-03,  4.16825153e-03, -3.07514612e-03,
       -2.38831318e-03, -5.04913682e-04, -6.98224583e-04, -6.69230765e-04,
        2.76499719e-04, -4.19675605e-03,  4.73028951e-04,  1.51047867e-03,
       -4.11835406e-03, -1.53737795e-03, -6.25130895e-04,  1.74643099e-03,
        3.85645055e-03,  4.20596730e-03, -1.91992184e-03, -2.74513010e-03,
       -1.87818106e-04, -3.68853589e-03,  2.53451988e-03,  4.18071076e-03,
       -2.31573801e-03,  4.75549093e-03,  2.34071515e-03,  3.07040405e-03,
       -3.87906824e-04,  3.75742023e-03,  4.93973680e-03,  1.93535234e-03,
       -9.87664796e-04,  5.24982868e-04,  1.35029189e-03,  4.31928644e-03,
       -4.28817142e-03, -2.13935971e-03,  3.76118813e-03,  3.59573867e-04,
       -4.67378506e-03,  1.21080875e-03,  3.55756711e-05,  1.36071024e-03,
       -1.30385987e-03,  3.72920628e-03,  4.55394760e-03,  2.67964299e-03,
       -5.24081406e-04,  3.04466276e-03, -2.67909328e-03, -1.01336616e-03,
       -4.60730371e-04,  2.57151038e-03,  3.18618212e-03,  2.81267334e-03,
       -4.77797119e-03,  5.38517081e-04, -4.40169964e-03,  4.43629455e-03,
        4.46568709e-03, -1.46277121e-03, -4.15219506e-03, -2.89633963e-03,
        4.79440251e-03,  1.17966556e-03,  3.73994792e-03,  5.24884730e-04,
        1.00487494e-03, -4.70128981e-03, -3.31647671e-03, -4.25759377e-03,
        4.07797284e-03, -1.91678107e-03, -3.40599520e-03,  2.60688760e-03,
        6.43456078e-05, -4.42066789e-03,  4.64845495e-03,  8.57527426e-04,
        4.38214658e-04,  4.11094818e-03, -4.38492326e-03, -3.06548807e-03,
       -2.48974352e-03, -2.99249915e-03,  6.16223668e-04,  1.83907070e-03],
      dtype=float32)

PCA

We can reduce the dimentionality to two, and plot the words as below:

In [2]:
X = model[model.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
pyplot.scatter(result[:, 0], result[:, 1])
words = list(model.wv.vocab)
for i, word in enumerate(words):
    pyplot.annotate(word, xy=(result[i, 0], result[i, 1]))
pyplot.show()
In [ ]:
 
In [12]:
txtfile= open('../data/vldb.txt','r')
sentences=[line.lower().strip().split(' ') for line in txtfile.readlines()] 
model = gensim.models.Word2Vec(sentences, min_count=2, iter=5)
test='query'
print('words similar to \''+ test + '\':\t'+ str(model.wv.most_similar(test)))
2021-01-24 22:02:01,225 : INFO : collecting all words and their counts
2021-01-24 22:02:01,227 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2021-01-24 22:02:01,251 : INFO : collected 5908 word types from a corpus of 34460 raw words and 4324 sentences
2021-01-24 22:02:01,254 : INFO : Loading a fresh vocabulary
2021-01-24 22:02:01,270 : INFO : effective_min_count=2 retains 2307 unique words (39% of original 5908, drops 3601)
2021-01-24 22:02:01,277 : INFO : effective_min_count=2 leaves 30859 word corpus (89% of original 34460, drops 3601)
2021-01-24 22:02:01,311 : INFO : deleting the raw counts dictionary of 5908 items
2021-01-24 22:02:01,314 : INFO : sample=0.001 downsamples 44 most-common words
2021-01-24 22:02:01,315 : INFO : downsampling leaves estimated 22156 word corpus (71.8% of prior 30859)
2021-01-24 22:02:01,331 : INFO : estimated required memory for 2307 words and 100 dimensions: 2999100 bytes
2021-01-24 22:02:01,334 : INFO : resetting layer weights
2021-01-24 22:02:01,831 : INFO : training model with 3 workers on 2307 vocabulary and 100 features, using sg=0 hs=0 sample=0.001 negative=5 window=5
2021-01-24 22:02:01,861 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-01-24 22:02:01,863 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-01-24 22:02:01,865 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-01-24 22:02:01,866 : INFO : EPOCH - 1 : training on 34460 raw words (22155 effective words) took 0.0s, 757883 effective words/s
2021-01-24 22:02:01,906 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-01-24 22:02:01,909 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-01-24 22:02:01,915 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-01-24 22:02:01,918 : INFO : EPOCH - 2 : training on 34460 raw words (22162 effective words) took 0.0s, 499035 effective words/s
2021-01-24 22:02:01,966 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-01-24 22:02:01,969 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-01-24 22:02:01,970 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-01-24 22:02:01,971 : INFO : EPOCH - 3 : training on 34460 raw words (22252 effective words) took 0.0s, 567163 effective words/s
2021-01-24 22:02:02,036 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-01-24 22:02:02,037 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-01-24 22:02:02,039 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-01-24 22:02:02,040 : INFO : EPOCH - 4 : training on 34460 raw words (22222 effective words) took 0.1s, 420764 effective words/s
2021-01-24 22:02:02,138 : INFO : worker thread finished; awaiting finish of 2 more threads
2021-01-24 22:02:02,152 : INFO : worker thread finished; awaiting finish of 1 more threads
2021-01-24 22:02:02,157 : INFO : worker thread finished; awaiting finish of 0 more threads
2021-01-24 22:02:02,158 : INFO : EPOCH - 5 : training on 34460 raw words (22134 effective words) took 0.1s, 242916 effective words/s
2021-01-24 22:02:02,159 : INFO : training on a 172300 raw words (110925 effective words) took 0.3s, 338606 effective words/s
2021-01-24 22:02:02,160 : WARNING : under 10 jobs per worker: consider setting a smaller `batch_words' for smoother alpha decay
2021-01-24 22:02:02,163 : INFO : precomputing L2-norms of word weight vectors
words similar to 'query':	[('a', 0.9999188780784607), ('in', 0.9999072551727295), ('with', 0.9999068975448608), ('queries', 0.9999058246612549), ('and', 0.9999037981033325), ('for', 0.9999023675918579), ('using', 0.9999022483825684), ('over', 0.9999008774757385), ('of', 0.9998964667320251), ('by', 0.9998931288719177)]
In [4]:
model.accuracy('../data/questions-words.txt')
Out[4]:
[{'section': 'capital-common-countries', 'correct': [], 'incorrect': []},
 {'section': 'capital-world', 'correct': [], 'incorrect': []},
 {'section': 'currency', 'correct': [], 'incorrect': []},
 {'section': 'city-in-state', 'correct': [], 'incorrect': []},
 {'section': 'family', 'correct': [], 'incorrect': []},
 {'section': 'gram1-adjective-to-adverb', 'correct': [], 'incorrect': []},
 {'section': 'gram2-opposite',
  'correct': [],
  'incorrect': [('CERTAIN', 'UNCERTAIN', 'CONSISTENT', 'INCONSISTENT'),
   ('CONSISTENT', 'INCONSISTENT', 'CERTAIN', 'UNCERTAIN')]},
 {'section': 'gram3-comparative',
  'correct': [],
  'incorrect': [('FAST', 'FASTER', 'GOOD', 'BETTER'),
   ('FAST', 'FASTER', 'LOW', 'LOWER'),
   ('FAST', 'FASTER', 'SMART', 'SMARTER'),
   ('GOOD', 'BETTER', 'LOW', 'LOWER'),
   ('GOOD', 'BETTER', 'SMART', 'SMARTER'),
   ('GOOD', 'BETTER', 'FAST', 'FASTER'),
   ('LOW', 'LOWER', 'SMART', 'SMARTER'),
   ('LOW', 'LOWER', 'FAST', 'FASTER'),
   ('LOW', 'LOWER', 'GOOD', 'BETTER'),
   ('SMART', 'SMARTER', 'FAST', 'FASTER'),
   ('SMART', 'SMARTER', 'GOOD', 'BETTER'),
   ('SMART', 'SMARTER', 'LOW', 'LOWER')]},
 {'section': 'gram4-superlative', 'correct': [], 'incorrect': []},
 {'section': 'gram5-present-participle',
  'correct': [],
  'incorrect': [('IMPLEMENT', 'IMPLEMENTING', 'RUN', 'RUNNING'),
   ('RUN', 'RUNNING', 'IMPLEMENT', 'IMPLEMENTING')]},
 {'section': 'gram6-nationality-adjective', 'correct': [], 'incorrect': []},
 {'section': 'gram7-past-tense',
  'correct': [],
  'incorrect': [('ENHANCING', 'ENHANCED', 'GENERATING', 'GENERATED'),
   ('GENERATING', 'GENERATED', 'ENHANCING', 'ENHANCED')]},
 {'section': 'gram8-plural',
  'correct': [],
  'incorrect': [('CLOUD', 'CLOUDS', 'COMPUTER', 'COMPUTERS'),
   ('CLOUD', 'CLOUDS', 'MACHINE', 'MACHINES'),
   ('COMPUTER', 'COMPUTERS', 'MACHINE', 'MACHINES'),
   ('COMPUTER', 'COMPUTERS', 'CLOUD', 'CLOUDS'),
   ('MACHINE', 'MACHINES', 'CLOUD', 'CLOUDS'),
   ('MACHINE', 'MACHINES', 'COMPUTER', 'COMPUTERS')]},
 {'section': 'gram9-plural-verbs', 'correct': [], 'incorrect': []},
 {'section': 'total',
  'correct': [],
  'incorrect': [('CERTAIN', 'UNCERTAIN', 'CONSISTENT', 'INCONSISTENT'),
   ('CONSISTENT', 'INCONSISTENT', 'CERTAIN', 'UNCERTAIN'),
   ('FAST', 'FASTER', 'GOOD', 'BETTER'),
   ('FAST', 'FASTER', 'LOW', 'LOWER'),
   ('FAST', 'FASTER', 'SMART', 'SMARTER'),
   ('GOOD', 'BETTER', 'LOW', 'LOWER'),
   ('GOOD', 'BETTER', 'SMART', 'SMARTER'),
   ('GOOD', 'BETTER', 'FAST', 'FASTER'),
   ('LOW', 'LOWER', 'SMART', 'SMARTER'),
   ('LOW', 'LOWER', 'FAST', 'FASTER'),
   ('LOW', 'LOWER', 'GOOD', 'BETTER'),
   ('SMART', 'SMARTER', 'FAST', 'FASTER'),
   ('SMART', 'SMARTER', 'GOOD', 'BETTER'),
   ('SMART', 'SMARTER', 'LOW', 'LOWER'),
   ('IMPLEMENT', 'IMPLEMENTING', 'RUN', 'RUNNING'),
   ('RUN', 'RUNNING', 'IMPLEMENT', 'IMPLEMENTING'),
   ('ENHANCING', 'ENHANCED', 'GENERATING', 'GENERATED'),
   ('GENERATING', 'GENERATED', 'ENHANCING', 'ENHANCED'),
   ('CLOUD', 'CLOUDS', 'COMPUTER', 'COMPUTERS'),
   ('CLOUD', 'CLOUDS', 'MACHINE', 'MACHINES'),
   ('COMPUTER', 'COMPUTERS', 'MACHINE', 'MACHINES'),
   ('COMPUTER', 'COMPUTERS', 'CLOUD', 'CLOUDS'),
   ('MACHINE', 'MACHINES', 'CLOUD', 'CLOUDS'),
   ('MACHINE', 'MACHINES', 'COMPUTER', 'COMPUTERS')]}]
In [5]:
from nltk.corpus import wordnet as wn
wn.synsets('car')[0].lemma_names()
Out[5]:
['car', 'auto', 'automobile', 'machine', 'motorcar']
In [ ]:
 
In [6]:
wn.synsets('car')[1].lemma_names()
Out[6]:
['car', 'railcar', 'railway_car', 'railroad_car']
In [7]:
wn.synsets('car')[3].lemma_names()
Out[7]:
['car', 'elevator_car']
In [8]:
panda=wn.synset('panda.n.01')
hyper=lambda s:s.hypernyms()
list(panda.closure(hyper))
Out[8]:
[Synset('procyonid.n.01'),
 Synset('carnivore.n.01'),
 Synset('placental.n.01'),
 Synset('mammal.n.01'),
 Synset('vertebrate.n.01'),
 Synset('chordate.n.01'),
 Synset('animal.n.01'),
 Synset('organism.n.01'),
 Synset('living_thing.n.01'),
 Synset('whole.n.02'),
 Synset('object.n.01'),
 Synset('physical_entity.n.01'),
 Synset('entity.n.01')]
In [9]:
[Synset('procyonid.n.01'), Synset('carnivore.n.01'), 
Synset('placental.n.01'), Synset('mammal.n.01'), 
Synset('vertebrate.n.01'), Synset('chordate.n.01'), 
Synset('animal.n.01'), Synset('organism.n.01'), 
Synset('living_thing.n.01'), Synset('whole.n.02'), 
Synset('object.n.01'), Synset('physical_entity.n.01'), 
Synset('entity.n.01')]
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-e872263fe77d> in <module>
----> 1 [Synset('procyonid.n.01'), Synset('carnivore.n.01'), 
      2 Synset('placental.n.01'), Synset('mammal.n.01'),
      3 Synset('vertebrate.n.01'), Synset('chordate.n.01'),
      4 Synset('animal.n.01'), Synset('organism.n.01'),
      5 Synset('living_thing.n.01'), Synset('whole.n.02'),

NameError: name 'Synset' is not defined
In [ ]: