Notebook

Training Doc2Vec on Wikipedia articles¶

This notebook replicates the Document Embedding with Paragraph Vectors paper, http://arxiv.org/abs/1507.07998.

In that paper, the authors only showed results from the DBOW ("distributed bag of words") mode, trained on the English Wikipedia. Here we replicate this experiment using not only DBOW, but also the DM ("distributed memory") mode of the Paragraph Vector algorithm aka Doc2Vec.

Basic setup¶

Let's import the necessary modules and set up logging. The code below assumes Python 3.7+ and Gensim 4.0+.

In [1]:

import logging
import multiprocessing
from pprint import pprint

import smart_open
from gensim.corpora.wikicorpus import WikiCorpus, tokenize
from gensim.models.doc2vec import Doc2Vec, TaggedDocument

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

Preparing the corpus¶

First, download the dump of all Wikipedia articles from here. You want the file named enwiki-latest-pages-articles.xml.bz2.

Second, convert that Wikipedia article dump from the arcane Wikimedia XML format into a plain text file. This will make the subsequent training faster and also allow easy inspection of the data = "input eyeballing".

We'll preprocess each article at the same time, normalizing its text to lowercase, splitting into tokens, etc. Below I use a regexp tokenizer that simply looks for alphabetic sequences as tokens. But feel free to adapt the text preprocessing to your own domain. High quality preprocessing is often critical for the final pipeline accuracy – garbage in, garbage out!

In [2]:

wiki = WikiCorpus(
    "enwiki-latest-pages-articles.xml.bz2",  # path to the file you downloaded above
    tokenizer_func=tokenize,  # simple regexp; plug in your own tokenizer here
    metadata=True,  # also return the article titles and ids when parsing
    dictionary={},  # don't start processing the data yet
)

with smart_open.open("wiki.txt.gz", "w", encoding='utf8') as fout:
    for article_no, (content, (page_id, title)) in enumerate(wiki.get_texts()):
        title = ' '.join(title.split())
        if article_no % 500000 == 0:
            logging.info("processing article #%i: %r (%i tokens)", article_no, title, len(content))
        fout.write(f"{title}\t{' '.join(content)}\n")  # title_of_article [TAB] words of the article

2022-04-16 11:23:20,663 : INFO : processing article #0: 'Anarchism' (6540 tokens)
2022-04-16 11:30:53,798 : INFO : processing article #500000: 'Onward Muslim Soldiers' (517 tokens)
2022-04-16 11:36:14,662 : INFO : processing article #1000000: 'Push Upstairs' (354 tokens)
2022-04-16 11:40:59,785 : INFO : processing article #1500000: 'Small nucleolar RNA Z278' (113 tokens)
2022-04-16 11:45:58,630 : INFO : processing article #2000000: '1925–26 Boston Bruins season' (556 tokens)
2022-04-16 11:51:03,737 : INFO : processing article #2500000: 'Tessier, Saskatchewan' (119 tokens)
2022-04-16 11:56:20,254 : INFO : processing article #3000000: 'Sebezhsky District' (908 tokens)
2022-04-16 12:01:59,089 : INFO : processing article #3500000: 'Niko Peleshi' (248 tokens)
2022-04-16 12:07:23,184 : INFO : processing article #4000000: 'Kudoa gunterae' (109 tokens)
2022-04-16 12:13:08,024 : INFO : processing article #4500000: 'Danko (singer)' (699 tokens)
2022-04-16 12:19:33,734 : INFO : processing article #5000000: 'Lada West Togliatti' (253 tokens)
2022-04-16 12:22:20,928 : INFO : finished iterating over Wikipedia corpus of 5205168 documents with 3016298486 positions (total 21961341 articles, 3093120544 positions before pruning articles shorter than 50 words)

The above took about 1 hour and created a new ~5.8 GB file named wiki.txt.gz. Note the output text was transparently compressed into .gz (GZIP) right away, using the smart_open library, to save on disk space.

Next we'll set up a document stream to load the preprocessed articles from wiki.txt.gz one by one, in the format expected by Doc2Vec, ready for training. We don't want to load everything into RAM at once, because that would blow up the memory. And it is not necessary – Gensim can handle streamed input training data:

In [3]:

class TaggedWikiCorpus:
    def __init__(self, wiki_text_path):
        self.wiki_text_path = wiki_text_path
        
    def __iter__(self):
        for line in smart_open.open(self.wiki_text_path, encoding='utf8'):
            title, words = line.split('\t')
            yield TaggedDocument(words=words.split(), tags=[title])

documents = TaggedWikiCorpus('wiki.txt.gz')  # A streamed iterable; nothing in RAM yet.

In [4]:

# Load and print the first preprocessed Wikipedia document, as a sanity check = "input eyeballing".
first_doc = next(iter(documents))
print(first_doc.tags, ': ', ' '.join(first_doc.words[:50] + ['………'] + first_doc.words[-50:]))

['Anarchism'] :  anarchism is political philosophy and movement that is sceptical of authority and rejects all involuntary coercive forms of hierarchy anarchism calls for the abolition of the state which it holds to be unnecessary undesirable and harmful as historically left wing movement placed on the farthest left of the political spectrum ……… criticism of philosophical anarchism defence of philosophical anarchism stating that both kinds of anarchism philosophical and political anarchism are philosophical and political claims anarchistic popular fiction novel an argument for philosophical anarchism external links anarchy archives anarchy archives is an online research center on the history and theory of anarchism

The document seems legit so let's move on to finally training some Doc2vec models.

Training Doc2Vec¶

The original paper had a vocabulary size of 915,715 word types, so we'll try to match it by setting max_final_vocab to 1,000,000 in the Doc2vec constructor.

Other critical parameters were left unspecified in the paper, so we'll go with a window size of eight (a prediction window of 8 tokens to either side). It looks like the authors tried vector dimensionality of 100, 300, 1,000 & 10,000 in the paper (with 10k dims performing the best), but I'll only train with 200 dimensions here, to keep the RAM in check on my laptop.

Feel free to tinker with these values yourself if you like:

In [5]:

workers = 20  # multiprocessing.cpu_count() - 1  # leave one core for the OS & other stuff

# PV-DBOW: paragraph vector in distributed bag of words mode
model_dbow = Doc2Vec(
    dm=0, dbow_words=1,  # dbow_words=1 to train word vectors at the same time too, not only DBOW
    vector_size=200, window=8, epochs=10, workers=workers, max_final_vocab=1000000,
)

# PV-DM: paragraph vector in distributed memory mode
model_dm = Doc2Vec(
    dm=1, dm_mean=1,  # use average of context word vectors to train DM
    vector_size=200, window=8, epochs=10, workers=workers, max_final_vocab=1000000,
)

2022-04-18 12:05:46,344 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t20>', 'datetime': '2022-04-18T12:05:46.344471', 'gensim': '4.1.3.dev0', 'python': '3.8.10 (default, Nov 26 2021, 20:14:08) \n[GCC 9.3.0]', 'platform': 'Linux-5.4.0-94-generic-x86_64-with-glibc2.29', 'event': 'created'}
2022-04-18 12:05:46,345 : INFO : Doc2Vec lifecycle event {'params': 'Doc2Vec<dm/m,d200,n5,w8,mc5,s0.001,t20>', 'datetime': '2022-04-18T12:05:46.345716', 'gensim': '4.1.3.dev0', 'python': '3.8.10 (default, Nov 26 2021, 20:14:08) \n[GCC 9.3.0]', 'platform': 'Linux-5.4.0-94-generic-x86_64-with-glibc2.29', 'event': 'created'}

Run one pass through the Wikipedia corpus, to collect the 1M vocabulary and initialize the doc2vec models:

In [6]:

model_dbow.build_vocab(documents, progress_per=500000)
print(model_dbow)

# Save some time by copying the vocabulary structures from the DBOW model to the DM model.
# Both models are built on top of exactly the same data, so there's no need to repeat the vocab-building step.
model_dm.reset_from(model_dbow)
print(model_dm)

2022-04-18 12:05:47,311 : INFO : collecting all words and their counts
2022-04-18 12:05:47,313 : INFO : PROGRESS: at example #0, processed 0 words (0 words/s), 0 word types, 0 tags
2022-04-18 12:07:35,880 : INFO : PROGRESS: at example #500000, processed 656884578 words (6050478 words/s), 3221051 word types, 500000 tags
2022-04-18 12:08:38,784 : INFO : PROGRESS: at example #1000000, processed 1021477892 words (5796084 words/s), 4478830 word types, 1000000 tags
2022-04-18 12:09:29,607 : INFO : PROGRESS: at example #1500000, processed 1308608477 words (5649726 words/s), 5419923 word types, 1500000 tags
2022-04-18 12:10:13,477 : INFO : PROGRESS: at example #2000000, processed 1554211349 words (5598537 words/s), 6190970 word types, 2000000 tags
2022-04-18 12:10:56,549 : INFO : PROGRESS: at example #2500000, processed 1794853915 words (5587147 words/s), 6943275 word types, 2500000 tags
2022-04-18 12:11:39,668 : INFO : PROGRESS: at example #3000000, processed 2032520202 words (5511955 words/s), 7668721 word types, 3000000 tags
2022-04-18 12:12:23,192 : INFO : PROGRESS: at example #3500000, processed 2268859232 words (5430192 words/s), 8352590 word types, 3500000 tags
2022-04-18 12:13:02,526 : INFO : PROGRESS: at example #4000000, processed 2493668037 words (5715482 words/s), 8977844 word types, 4000000 tags
2022-04-18 12:13:42,550 : INFO : PROGRESS: at example #4500000, processed 2709484503 words (5392235 words/s), 9612299 word types, 4500000 tags
2022-04-18 12:14:21,813 : INFO : PROGRESS: at example #5000000, processed 2932680226 words (5684768 words/s), 10226832 word types, 5000000 tags
2022-04-18 12:14:51,346 : INFO : collected 10469247 word types and 5205168 unique tags from a corpus of 5205168 examples and 3016298486 words
2022-04-18 12:14:55,076 : INFO : Doc2Vec lifecycle event {'msg': 'max_final_vocab=1000000 and min_count=5 resulted in calc_min_count=23, effective_min_count=23', 'datetime': '2022-04-18T12:14:55.076153', 'gensim': '4.1.3.dev0', 'python': '3.8.10 (default, Nov 26 2021, 20:14:08) \n[GCC 9.3.0]', 'platform': 'Linux-5.4.0-94-generic-x86_64-with-glibc2.29', 'event': 'prepare_vocab'}
2022-04-18 12:14:55,076 : INFO : Creating a fresh vocabulary
2022-04-18 12:14:58,906 : INFO : Doc2Vec lifecycle event {'msg': 'effective_min_count=23 retains 996522 unique words (9.52% of original 10469247, drops 9472725)', 'datetime': '2022-04-18T12:14:58.906148', 'gensim': '4.1.3.dev0', 'python': '3.8.10 (default, Nov 26 2021, 20:14:08) \n[GCC 9.3.0]', 'platform': 'Linux-5.4.0-94-generic-x86_64-with-glibc2.29', 'event': 'prepare_vocab'}
2022-04-18 12:14:58,906 : INFO : Doc2Vec lifecycle event {'msg': 'effective_min_count=23 leaves 2988436691 word corpus (99.08% of original 3016298486, drops 27861795)', 'datetime': '2022-04-18T12:14:58.906730', 'gensim': '4.1.3.dev0', 'python': '3.8.10 (default, Nov 26 2021, 20:14:08) \n[GCC 9.3.0]', 'platform': 'Linux-5.4.0-94-generic-x86_64-with-glibc2.29', 'event': 'prepare_vocab'}
2022-04-18 12:15:01,747 : INFO : deleting the raw counts dictionary of 10469247 items
2022-04-18 12:15:01,860 : INFO : sample=0.001 downsamples 23 most-common words
2022-04-18 12:15:01,861 : INFO : Doc2Vec lifecycle event {'msg': 'downsampling leaves estimated 2431447874.2898555 word corpus (81.4%% of prior 2988436691)', 'datetime': '2022-04-18T12:15:01.861332', 'gensim': '4.1.3.dev0', 'python': '3.8.10 (default, Nov 26 2021, 20:14:08) \n[GCC 9.3.0]', 'platform': 'Linux-5.4.0-94-generic-x86_64-with-glibc2.29', 'event': 'prepare_vocab'}
2022-04-18 12:15:07,001 : INFO : estimated required memory for 996522 words and 200 dimensions: 7297864200 bytes
2022-04-18 12:15:07,002 : INFO : resetting layer weights
2022-04-18 12:15:10,247 : INFO : resetting layer weights

Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t20>
Doc2Vec<dm/m,d200,n5,w8,mc5,s0.001,t20>

Now we’re ready to train Doc2Vec on the entirety of the English Wikipedia. Warning! Training this DBOW model takes ~14 hours, and DM ~6 hours, on my 2020 Linux machine.

In [7]:

# Train DBOW doc2vec incl. word vectors.
# Report progress every ½ hour.
model_dbow.train(documents, total_examples=model_dbow.corpus_count, epochs=model_dbow.epochs, report_delay=30*60)

2022-04-18 12:15:13,503 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 20 workers on 996522 vocabulary and 200 features, using sg=1 hs=0 sample=0.001 negative=5 window=8 shrink_windows=True', 'datetime': '2022-04-18T12:15:13.503265', 'gensim': '4.1.3.dev0', 'python': '3.8.10 (default, Nov 26 2021, 20:14:08) \n[GCC 9.3.0]', 'platform': 'Linux-5.4.0-94-generic-x86_64-with-glibc2.29', 'event': 'train'}
2022-04-18 12:15:14,566 : INFO : EPOCH 0 - PROGRESS: at 0.00% examples, 299399 words/s, in_qsize 38, out_qsize 1
2022-04-18 12:45:14,574 : INFO : EPOCH 0 - PROGRESS: at 20.47% examples, 469454 words/s, in_qsize 39, out_qsize 0
2022-04-18 13:15:14,578 : INFO : EPOCH 0 - PROGRESS: at 61.04% examples, 470927 words/s, in_qsize 39, out_qsize 0
2022-04-18 13:40:53,256 : INFO : EPOCH 0: training on 3016298486 raw words (2421756111 effective words) took 5139.7s, 471184 effective words/s
2022-04-18 13:40:54,274 : INFO : EPOCH 1 - PROGRESS: at 0.00% examples, 401497 words/s, in_qsize 39, out_qsize 0
2022-04-18 14:10:54,283 : INFO : EPOCH 1 - PROGRESS: at 21.90% examples, 488616 words/s, in_qsize 39, out_qsize 0
2022-04-18 14:40:54,290 : INFO : EPOCH 1 - PROGRESS: at 63.73% examples, 485374 words/s, in_qsize 40, out_qsize 0
2022-04-18 15:04:11,566 : INFO : EPOCH 1: training on 3016298486 raw words (2421755370 effective words) took 4998.3s, 484515 effective words/s
2022-04-18 15:04:12,590 : INFO : EPOCH 2 - PROGRESS: at 0.00% examples, 413109 words/s, in_qsize 38, out_qsize 2
2022-04-18 15:34:12,592 : INFO : EPOCH 2 - PROGRESS: at 21.94% examples, 489186 words/s, in_qsize 39, out_qsize 0
2022-04-18 16:04:12,595 : INFO : EPOCH 2 - PROGRESS: at 64.02% examples, 487045 words/s, in_qsize 39, out_qsize 0
2022-04-18 16:27:13,124 : INFO : EPOCH 2: training on 3016298486 raw words (2421749843 effective words) took 4981.6s, 486143 effective words/s
2022-04-18 16:27:14,132 : INFO : EPOCH 3 - PROGRESS: at 0.00% examples, 425720 words/s, in_qsize 37, out_qsize 0
2022-04-18 16:57:14,170 : INFO : EPOCH 3 - PROGRESS: at 22.16% examples, 492364 words/s, in_qsize 39, out_qsize 0
2022-04-18 17:27:14,181 : INFO : EPOCH 3 - PROGRESS: at 64.36% examples, 489039 words/s, in_qsize 39, out_qsize 0
2022-04-18 17:49:58,875 : INFO : EPOCH 3: training on 3016298486 raw words (2421759041 effective words) took 4965.7s, 487693 effective words/s
2022-04-18 17:49:59,888 : INFO : EPOCH 4 - PROGRESS: at 0.00% examples, 405295 words/s, in_qsize 39, out_qsize 0
2022-04-18 18:19:59,893 : INFO : EPOCH 4 - PROGRESS: at 21.95% examples, 489379 words/s, in_qsize 39, out_qsize 0
2022-04-18 18:49:59,917 : INFO : EPOCH 4 - PROGRESS: at 63.77% examples, 485582 words/s, in_qsize 39, out_qsize 0
2022-04-18 19:13:19,358 : INFO : EPOCH 4: training on 3016298486 raw words (2421753794 effective words) took 5000.5s, 484304 effective words/s
2022-04-18 19:13:20,362 : INFO : EPOCH 5 - PROGRESS: at 0.00% examples, 417569 words/s, in_qsize 38, out_qsize 1
2022-04-18 19:43:20,366 : INFO : EPOCH 5 - PROGRESS: at 22.18% examples, 492529 words/s, in_qsize 40, out_qsize 0
2022-04-18 20:13:20,367 : INFO : EPOCH 5 - PROGRESS: at 64.36% examples, 489058 words/s, in_qsize 39, out_qsize 1
2022-04-18 20:36:01,806 : INFO : EPOCH 5: training on 3016298486 raw words (2421774390 effective words) took 4962.4s, 488021 effective words/s
2022-04-18 20:36:02,845 : INFO : EPOCH 6 - PROGRESS: at 0.00% examples, 376602 words/s, in_qsize 39, out_qsize 0
2022-04-18 21:06:02,845 : INFO : EPOCH 6 - PROGRESS: at 21.77% examples, 486989 words/s, in_qsize 39, out_qsize 0
2022-04-18 21:36:02,858 : INFO : EPOCH 6 - PROGRESS: at 63.44% examples, 483745 words/s, in_qsize 40, out_qsize 0
2022-04-18 21:59:40,920 : INFO : EPOCH 6: training on 3016298486 raw words (2421753569 effective words) took 5019.1s, 482507 effective words/s
2022-04-18 21:59:41,945 : INFO : EPOCH 7 - PROGRESS: at 0.00% examples, 410164 words/s, in_qsize 38, out_qsize 1
2022-04-18 22:29:41,989 : INFO : EPOCH 7 - PROGRESS: at 22.09% examples, 491334 words/s, in_qsize 39, out_qsize 0
2022-04-18 22:59:42,000 : INFO : EPOCH 7 - PROGRESS: at 64.16% examples, 487826 words/s, in_qsize 39, out_qsize 0
2022-04-18 23:22:40,504 : INFO : EPOCH 7: training on 3016298486 raw words (2421770259 effective words) took 4979.6s, 486340 effective words/s
2022-04-18 23:22:41,509 : INFO : EPOCH 8 - PROGRESS: at 0.00% examples, 294981 words/s, in_qsize 39, out_qsize 0
2022-04-18 23:52:41,532 : INFO : EPOCH 8 - PROGRESS: at 21.64% examples, 485279 words/s, in_qsize 40, out_qsize 0
2022-04-19 00:22:41,533 : INFO : EPOCH 8 - PROGRESS: at 63.05% examples, 481687 words/s, in_qsize 39, out_qsize 0
2022-04-19 00:46:43,879 : INFO : EPOCH 8: training on 3016298486 raw words (2421753439 effective words) took 5043.4s, 480185 effective words/s
2022-04-19 00:46:44,905 : INFO : EPOCH 9 - PROGRESS: at 0.00% examples, 383709 words/s, in_qsize 39, out_qsize 0
2022-04-19 01:16:44,926 : INFO : EPOCH 9 - PROGRESS: at 21.82% examples, 487579 words/s, in_qsize 40, out_qsize 0
2022-04-19 01:46:44,928 : INFO : EPOCH 9 - PROGRESS: at 63.44% examples, 483731 words/s, in_qsize 39, out_qsize 0
2022-04-19 02:10:25,029 : INFO : EPOCH 9: training on 3016298486 raw words (2421762745 effective words) took 5021.1s, 482313 effective words/s
2022-04-19 02:10:25,030 : INFO : Doc2Vec lifecycle event {'msg': 'training on 30162984860 raw words (24217588561 effective words) took 50111.5s, 483274 effective words/s', 'datetime': '2022-04-19T02:10:25.030386', 'gensim': '4.1.3.dev0', 'python': '3.8.10 (default, Nov 26 2021, 20:14:08) \n[GCC 9.3.0]', 'platform': 'Linux-5.4.0-94-generic-x86_64-with-glibc2.29', 'event': 'train'}

In [8]:

# Train DM doc2vec.
model_dm.train(documents, total_examples=model_dm.corpus_count, epochs=model_dm.epochs, report_delay=30*60)

2022-04-19 02:10:25,033 : INFO : Doc2Vec lifecycle event {'msg': 'training model with 20 workers on 996522 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=8 shrink_windows=True', 'datetime': '2022-04-19T02:10:25.033682', 'gensim': '4.1.3.dev0', 'python': '3.8.10 (default, Nov 26 2021, 20:14:08) \n[GCC 9.3.0]', 'platform': 'Linux-5.4.0-94-generic-x86_64-with-glibc2.29', 'event': 'train'}
2022-04-19 02:10:26,039 : INFO : EPOCH 0 - PROGRESS: at 0.01% examples, 1154750 words/s, in_qsize 0, out_qsize 2
2022-04-19 02:40:26,040 : INFO : EPOCH 0 - PROGRESS: at 83.97% examples, 1182619 words/s, in_qsize 39, out_qsize 0
2022-04-19 02:44:58,625 : INFO : EPOCH 0: training on 3016298486 raw words (2421749575 effective words) took 2073.6s, 1167903 effective words/s
2022-04-19 02:44:59,635 : INFO : EPOCH 1 - PROGRESS: at 0.01% examples, 1565065 words/s, in_qsize 0, out_qsize 0
2022-04-19 03:14:59,636 : INFO : EPOCH 1 - PROGRESS: at 84.22% examples, 1185115 words/s, in_qsize 39, out_qsize 0
2022-04-19 03:19:27,814 : INFO : EPOCH 1: training on 3016298486 raw words (2421738810 effective words) took 2069.2s, 1170383 effective words/s
2022-04-19 03:19:28,819 : INFO : EPOCH 2 - PROGRESS: at 0.01% examples, 1582102 words/s, in_qsize 0, out_qsize 0
2022-04-19 03:49:28,822 : INFO : EPOCH 2 - PROGRESS: at 84.33% examples, 1186338 words/s, in_qsize 39, out_qsize 0
2022-04-19 03:53:55,901 : INFO : EPOCH 2: training on 3016298486 raw words (2421754027 effective words) took 2068.1s, 1171014 effective words/s
2022-04-19 03:53:56,905 : INFO : EPOCH 3 - PROGRESS: at 0.01% examples, 1586215 words/s, in_qsize 0, out_qsize 0
2022-04-19 04:23:56,914 : INFO : EPOCH 3 - PROGRESS: at 84.30% examples, 1186028 words/s, in_qsize 39, out_qsize 0
2022-04-19 04:28:23,932 : INFO : EPOCH 3: training on 3016298486 raw words (2421734506 effective words) took 2068.0s, 1171036 effective words/s
2022-04-19 04:28:24,943 : INFO : EPOCH 4 - PROGRESS: at 0.01% examples, 1594202 words/s, in_qsize 0, out_qsize 0
2022-04-19 04:58:24,946 : INFO : EPOCH 4 - PROGRESS: at 84.53% examples, 1188348 words/s, in_qsize 39, out_qsize 0
2022-04-19 05:02:49,190 : INFO : EPOCH 4: training on 3016298486 raw words (2421739011 effective words) took 2065.3s, 1172611 effective words/s
2022-04-19 05:02:50,203 : INFO : EPOCH 5 - PROGRESS: at 0.01% examples, 1590285 words/s, in_qsize 0, out_qsize 0
2022-04-19 05:32:50,205 : INFO : EPOCH 5 - PROGRESS: at 84.51% examples, 1188165 words/s, in_qsize 38, out_qsize 0
2022-04-19 05:37:12,922 : INFO : EPOCH 5: training on 3016298486 raw words (2421759651 effective words) took 2063.7s, 1173488 effective words/s
2022-04-19 05:37:13,928 : INFO : EPOCH 6 - PROGRESS: at 0.01% examples, 1574494 words/s, in_qsize 0, out_qsize 0
2022-04-19 06:07:13,930 : INFO : EPOCH 6 - PROGRESS: at 84.61% examples, 1189231 words/s, in_qsize 40, out_qsize 0
2022-04-19 06:11:35,588 : INFO : EPOCH 6: training on 3016298486 raw words (2421751669 effective words) took 2062.7s, 1174090 effective words/s
2022-04-19 06:11:36,605 : INFO : EPOCH 7 - PROGRESS: at 0.01% examples, 1584768 words/s, in_qsize 0, out_qsize 0
2022-04-19 06:41:36,617 : INFO : EPOCH 7 - PROGRESS: at 84.50% examples, 1188066 words/s, in_qsize 39, out_qsize 0
2022-04-19 06:46:00,286 : INFO : EPOCH 7: training on 3016298486 raw words (2421751802 effective words) took 2064.7s, 1172935 effective words/s
2022-04-19 06:46:01,290 : INFO : EPOCH 8 - PROGRESS: at 0.01% examples, 1610826 words/s, in_qsize 0, out_qsize 0
2022-04-19 07:16:01,295 : INFO : EPOCH 8 - PROGRESS: at 84.71% examples, 1190249 words/s, in_qsize 39, out_qsize 0
2022-04-19 07:20:20,193 : INFO : EPOCH 8: training on 3016298486 raw words (2421731383 effective words) took 2059.9s, 1175653 effective words/s
2022-04-19 07:20:21,198 : INFO : EPOCH 9 - PROGRESS: at 0.01% examples, 1591209 words/s, in_qsize 0, out_qsize 0
2022-04-19 07:50:21,200 : INFO : EPOCH 9 - PROGRESS: at 84.65% examples, 1189549 words/s, in_qsize 39, out_qsize 0
2022-04-19 07:54:42,812 : INFO : EPOCH 9: training on 3016298486 raw words (2421765551 effective words) took 2062.6s, 1174124 effective words/s
2022-04-19 07:54:42,813 : INFO : Doc2Vec lifecycle event {'msg': 'training on 30162984860 raw words (24217475985 effective words) took 20657.8s, 1172317 effective words/s', 'datetime': '2022-04-19T07:54:42.813436', 'gensim': '4.1.3.dev0', 'python': '3.8.10 (default, Nov 26 2021, 20:14:08) \n[GCC 9.3.0]', 'platform': 'Linux-5.4.0-94-generic-x86_64-with-glibc2.29', 'event': 'train'}

Finding similar documents¶

After that, let's test both models! The DBOW model shows similar results as the original paper.

First, calculate the most similar Wikipedia articles to the "Machine learning" article. The calculated word vectors and document vectors are stored separately, in model.wv and model.dv respectively:

In [9]:

for model in [model_dbow, model_dm]:
    print(model)
    pprint(model.dv.most_similar(positive=["Machine learning"], topn=20))

Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t20>
[('Supervised learning', 0.7491602301597595),
 ('Pattern recognition', 0.7462332844734192),
 ('Artificial neural network', 0.7142727971076965),
 ('Data mining', 0.6930587887763977),
 ('Computer mathematics', 0.686907947063446),
 ('Deep learning', 0.6868096590042114),
 ('Multi-task learning', 0.6859176158905029),
 ('Outline of computer science', 0.6858125925064087),
 ('Boosting (machine learning)', 0.6807966828346252),
 ('Linear classifier', 0.6807013154029846),
 ('Learning classifier system', 0.679194450378418),
 ('Knowledge retrieval', 0.6765366196632385),
 ('Perceptron', 0.675654947757721),
 ('Incremental learning', 0.6712607741355896),
 ('Support-vector machine', 0.6711161136627197),
 ('Feature selection', 0.6696343421936035),
 ('Image segmentation', 0.6688867211341858),
 ('Neural network', 0.6670624017715454),
 ('Reinforcement learning', 0.6666402220726013),
 ('Feature extraction', 0.6657401323318481)]
Doc2Vec<dm/m,d200,n5,w8,mc5,s0.001,t20>
[('Pattern recognition', 0.7151365280151367),
 ('Supervised learning', 0.7006939053535461),
 ('Multi-task learning', 0.6899284720420837),
 ('Semi-supervised learning', 0.674682080745697),
 ('Statistical classification', 0.6649825572967529),
 ('Deep learning', 0.6647047400474548),
 ('Artificial neural network', 0.66275954246521),
 ('Feature selection', 0.6612880825996399),
 ('Statistical learning theory', 0.6528184413909912),
 ('Naive Bayes classifier', 0.6506016850471497),
 ('Automatic image annotation', 0.6491228342056274),
 ('Regularization (mathematics)', 0.6452057957649231),
 ('Early stopping', 0.6439507007598877),
 ('Support-vector machine', 0.64285808801651),
 ('Meta learning (computer science)', 0.6418778300285339),
 ('Linear classifier', 0.6391816735267639),
 ('Empirical risk minimization', 0.6339778900146484),
 ('Anomaly detection', 0.6328380703926086),
 ('Predictive Model Markup Language', 0.6314322352409363),
 ('Learning classifier system', 0.6307871341705322)]

Both results seem similar and match the results from the paper's Table 1, although not exactly. This is because we don't know the exact parameters of the original implementation (see above). And also because we're training the model 7 years later and the Wikipedia content has changed in the meantime.

Now following the paper's Table 2a), let's calculate the most similar Wikipedia entries to "Lady Gaga" using Paragraph Vector:

In [10]:

for model in [model_dbow, model_dm]:
    print(model)
    pprint(model.dv.most_similar(positive=["Lady Gaga"], topn=10))

Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t20>
[('Katy Perry', 0.7450265884399414),
 ('Miley Cyrus', 0.7275323867797852),
 ('Ariana Grande', 0.7223592400550842),
 ('Adele', 0.6982873678207397),
 ('Taylor Swift', 0.6901045441627502),
 ('Demi Lovato', 0.6819911003112793),
 ('Adam Lambert', 0.6552075147628784),
 ('Nicki Minaj', 0.6513625383377075),
 ('Selena Gomez', 0.6427122354507446),
 ('Rihanna', 0.6323978304862976)]
Doc2Vec<dm/m,d200,n5,w8,mc5,s0.001,t20>
[('Born This Way (album)', 0.6612793803215027),
 ('Artpop', 0.6428781747817993),
 ('Beautiful, Dirty, Rich', 0.6408763527870178),
 ('Lady Gaga videography', 0.6143141388893127),
 ('Lady Gaga discography', 0.6102882027626038),
 ('Katy Perry', 0.6046711802482605),
 ('Beyoncé', 0.6015700697898865),
 ('List of Lady Gaga live performances', 0.5977909564971924),
 ('Artpop (song)', 0.5930275917053223),
 ('Born This Way (song)', 0.5911758542060852)]

The DBOW results are in line with what the paper shows in Table 2a), revealing similar singers in the U.S.

Interestingly, the DM results seem to capture more "fact about Lady Gaga" (her albums, trivia), whereas DBOW recovered "similar artists".

Finally, let's do some of the wilder arithmetics that vectors embeddings are famous for. What are the entries most similar to "Lady Gaga" - "American" + "Japanese"? Table 2b) in the paper.

Note that "American" and "Japanese" are word vectors, but they live in the same space as the document vectors so we can add / subtract them at will, for some interesting results. All word vectors were already lowercased by our tokenizer above, so we look for the lowercased version here:

In [11]:

for model in [model_dbow, model_dm]:
    print(model)
    vec = [model.dv["Lady Gaga"] - model.wv["american"] + model.wv["japanese"]]
    pprint([m for m in model.dv.most_similar(vec, topn=11) if m[0] != "Lady Gaga"])

Doc2Vec<dbow+w,d200,n5,w8,mc5,s0.001,t20>
[('Ayumi Hamasaki', 0.6339365839958191),
 ('Katy Perry', 0.5903329849243164),
 ('2NE1', 0.5886631608009338),
 ("Girls' Generation", 0.5769038796424866),
 ('Flying Easy Loving Crazy', 0.5748921036720276),
 ('Love Life 2', 0.5738793611526489),
 ('Ariana Grande', 0.5715743899345398),
 ('Game (Perfume album)', 0.569789707660675),
 ('We Are "Lonely Girl"', 0.5696560740470886),
 ('H (Ayumi Hamasaki EP)', 0.5691372156143188)]
Doc2Vec<dm/m,d200,n5,w8,mc5,s0.001,t20>
[('Radwimps', 0.548571765422821),
 ('Chisato Moritaka', 0.5456540584564209),
 ('Suzuki Ami Around the World: Live House Tour 2005', 0.5375290513038635),
 ('Anna Suda', 0.5338292121887207),
 ('Beautiful, Dirty, Rich', 0.5309030413627625),
 ('Momoiro Clover Z', 0.5304197072982788),
 ('Pink Lady (duo)', 0.5268998742103577),
 ('Reol (singer)', 0.5237400531768799),
 ('Ami Suzuki', 0.5232592225074768),
 ('Kaela Kimura', 0.5219823122024536)]

As a result, the DBOW model surfaced artists similar to Lady Gaga in Japan, such as Ayumi Hamasaki whose Wiki bio says:

Ayumi Hamasaki is a Japanese singer, songwriter, record producer, actress, model, spokesperson, and entrepreneur.

So that sounds like a success. It's also the nr. 1 hit in the paper we're replicating – success!

The DM model results are opaque to me, but seem art & Japan related as well. The score deltas between these DM results are marginal, so it's likely they would change if retrained on a different version of Wikipedia. Or even when simply re-run on the same version – the doc2vec training algorithm is stochastic.

These results demonstrate that both training modes employed in the original paper are outstanding for calculating similarity between document vectors, word vectors, or a combination of both. The DM mode has the added advantage of being 4x faster to train.

If you wanted to continue working with these trained models, you could save them to disk, to avoid having to re-train the models from scratch every time:

In [12]:

model_dbow.save('doc2vec_dbow.model')
model_dm.save('doc2vec_dm.model')

2022-04-19 07:54:48,399 : INFO : Doc2Vec lifecycle event {'fname_or_handle': 'doc2vec_dbow.model', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2022-04-19T07:54:48.399560', 'gensim': '4.1.3.dev0', 'python': '3.8.10 (default, Nov 26 2021, 20:14:08) \n[GCC 9.3.0]', 'platform': 'Linux-5.4.0-94-generic-x86_64-with-glibc2.29', 'event': 'saving'}
2022-04-19 07:54:48,400 : INFO : storing np array 'vectors' to doc2vec_dbow.model.dv.vectors.npy
2022-04-19 07:54:49,613 : INFO : storing np array 'vectors' to doc2vec_dbow.model.wv.vectors.npy
2022-04-19 07:54:49,875 : INFO : storing np array 'syn1neg' to doc2vec_dbow.model.syn1neg.npy
2022-04-19 07:54:50,135 : INFO : not storing attribute cum_table
2022-04-19 07:54:53,026 : INFO : saved doc2vec_dbow.model
2022-04-19 07:54:53,027 : INFO : Doc2Vec lifecycle event {'fname_or_handle': 'doc2vec_dm.model', 'separately': 'None', 'sep_limit': 10485760, 'ignore': frozenset(), 'datetime': '2022-04-19T07:54:53.027661', 'gensim': '4.1.3.dev0', 'python': '3.8.10 (default, Nov 26 2021, 20:14:08) \n[GCC 9.3.0]', 'platform': 'Linux-5.4.0-94-generic-x86_64-with-glibc2.29', 'event': 'saving'}
2022-04-19 07:54:53,028 : INFO : storing np array 'vectors' to doc2vec_dm.model.dv.vectors.npy
2022-04-19 07:54:54,556 : INFO : storing np array 'vectors' to doc2vec_dm.model.wv.vectors.npy
2022-04-19 07:54:54,808 : INFO : storing np array 'syn1neg' to doc2vec_dm.model.syn1neg.npy
2022-04-19 07:54:55,058 : INFO : not storing attribute cum_table
2022-04-19 07:54:57,872 : INFO : saved doc2vec_dm.model

To continue your doc2vec explorations, refer to the official API documentation in Gensim: https://radimrehurek.com/gensim/models/doc2vec.html