Facebook Research open sourced a great project yesterday - fastText, a fast (no surprise) and effective method to learn word representations and perform text classification. I was curious about comparing these embeddings to other commonly used embeddings, so word2vec seemed like the obvious choice, especially considering fastText embeddings are based upon word2vec.
import nltk
nltk.download()
# Only the brown corpus is needed in case you don't have it.
# alternately, you can simply download the pretrained models below if you wish to avoid downloading and training
# Generate brown corpus text file
with open('brown_corp.txt', 'w+') as f:
for word in nltk.corpus.brown.words():
f.write('{word} '.format(word=word))
# download the text8 corpus (a 100 MB sample of cleaned wikipedia text)
# alternately, you can simply download the pretrained models below if you wish to avoid downloading and training
!wget http://mattmahoney.net/dc/text8.zip
# download the file questions-words.txt to be used for comparing word embeddings
!wget https://raw.githubusercontent.com/arfon/word2vec/master/questions-words.txt
If you wish to avoid training, you can download pre-trained models instead in the next section. For training the fastText models yourself, you'll have to follow the setup instructions for fastText and run the training with -
!./fasttext skipgram -input brown_corp.txt -output brown_ft
!./fasttext skipgram -input text8.txt -output text8_ft
For training the gensim models -
from nltk.corpus import brown
from gensim.models import Word2Vec
from gensim.models.word2vec import Text8Corpus
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(level=logging.INFO)
MODELS_DIR = 'models/'
brown_gs = Word2Vec(brown.sents())
brown_gs.save_word2vec_format(MODELS_DIR + 'brown_gs.vec')
text8_gs = Word2Vec(Text8Corpus('text8'))
text8_gs.save_word2vec_format(MODELS_DIR + 'text8_gs.vec')
In case you wish to avoid downloading the corpus and training the models, you can download pretrained models with -
# download the fastText and gensim models trained on the brown corpus and text8 corpus
!wget https://www.dropbox.com/s/4kray3epy439gca/models.tar.gz?dl=1 -O models.tar.gz
Once you have downloaded or trained the models (make sure they're in the models/
directory, or that you've appropriately changed MODELS_DIR
) and downloaded questions-words.txt
, you're ready to run the comparison.
from gensim.models import Word2Vec
def print_accuracy(model, questions_file):
print('Evaluating...\n')
acc = model.accuracy(questions_file)
for section in acc:
correct = len(section['correct'])
total = len(section['correct']) + len(section['incorrect'])
total = total if total else 1
accuracy = 100*float(correct)/total
print('{:d}/{:d}, {:.2f}%, Section: {:s}'.format(correct, total, accuracy, section['section']))
sem_correct = sum((len(acc[i]['correct']) for i in range(5)))
sem_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5))
print('\nSemantic: {:d}/{:d}, Accuracy: {:.2f}%'.format(sem_correct, sem_total, 100*float(sem_correct)/sem_total))
syn_correct = sum((len(acc[i]['correct']) for i in range(5, len(acc)-1)))
syn_total = sum((len(acc[i]['correct']) + len(acc[i]['incorrect'])) for i in range(5,len(acc)-1))
print('Syntactic: {:d}/{:d}, Accuracy: {:.2f}%\n'.format(syn_correct, syn_total, 100*float(syn_correct)/syn_total))
MODELS_DIR = 'models/'
word_analogies_file = 'questions-words.txt'
print('\nLoading FastText embeddings')
ft_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_ft.vec')
print('Accuracy for FastText:')
print_accuracy(ft_model, word_analogies_file)
print('\nLoading Gensim embeddings')
gs_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'brown_gs.vec')
print('Accuracy for word2vec:')
print_accuracy(gs_model, word_analogies_file)
Loading FastText embeddings Accuracy for FastText: Evaluating... 0/1, 0.00%, Section: capital-common-countries 0/1, 0.00%, Section: capital-world 0/1, 0.00%, Section: currency 0/1, 0.00%, Section: city-in-state 27/182, 14.84%, Section: family 539/702, 76.78%, Section: gram1-adjective-to-adverb 106/132, 80.30%, Section: gram2-opposite 656/1056, 62.12%, Section: gram3-comparative 136/210, 64.76%, Section: gram4-superlative 439/650, 67.54%, Section: gram5-present-participle 0/1, 0.00%, Section: gram6-nationality-adjective 165/1260, 13.10%, Section: gram7-past-tense 327/552, 59.24%, Section: gram8-plural 245/342, 71.64%, Section: gram9-plural-verbs 2640/5086, 51.91%, Section: total Semantic: 27/182, Accuracy: 14.84% Syntactic: 2613/4904, Accuracy: 53.28% Loading Gensim embeddings Accuracy for word2vec: Evaluating... 0/1, 0.00%, Section: capital-common-countries 0/1, 0.00%, Section: capital-world 0/1, 0.00%, Section: currency 0/1, 0.00%, Section: city-in-state 53/182, 29.12%, Section: family 8/702, 1.14%, Section: gram1-adjective-to-adverb 0/132, 0.00%, Section: gram2-opposite 75/1056, 7.10%, Section: gram3-comparative 0/210, 0.00%, Section: gram4-superlative 16/650, 2.46%, Section: gram5-present-participle 0/1, 0.00%, Section: gram6-nationality-adjective 30/1260, 2.38%, Section: gram7-past-tense 4/552, 0.72%, Section: gram8-plural 8/342, 2.34%, Section: gram9-plural-verbs 194/5086, 3.81%, Section: total Semantic: 53/182, Accuracy: 29.12% Syntactic: 141/4904, Accuracy: 2.88%
Word2vec embeddings seem to be slightly better than fastText embeddings at the semantic tasks, while the fastText embeddings do significantly better on the syntactic analogies. Makes sense, since fastText embeddings are trained for understanding morphological nuances, and most of the syntactic analogies are morphology based.
Let me explain that better.
According to the paper [1], embeddings for words are represented by the sum of their n-gram embeddings. This is meant to be useful for morphologically rich languages - so theoretically, the embedding for apparently
would include information from both character n-grams apparent
and ly
(as well as other n-grams), and the n-grams would combine in a simple, linear manner. This is very similar to what most of our syntactic tasks look like.
Example analogy:
amazing amazingly calm calmly
This analogy is marked correct if:
embedding(amazing)
- embedding(amazingly)
= embedding(calm)
- embedding(calmly)
Both these subtractions would result in a very similar set of remaining ngrams. No surprise the fastText embeddings do extremely well on this.
A brief note on hyperparameters - the Gensim word2vec implementation and the fastText word embedding implementation use largely the same defaults (dim_size = 100, window_size = 5, num_epochs = 5). Of course, they are two completely different models (albeit, with a few similarities).
Let's try with a larger corpus now - text8 (collection of wiki articles). I'm especially curious about the impact on semantic accuracy - for models trained on the brown corpus, the difference in the semantic accuracy and the accuracy values themselves are too small to be conclusive. Hopefully a larger corpus helps, and the text8 corpus likely has a lot more information about capitals, currencies, cities etc, which should be relevant to the semantic tasks.
print('Loading FastText embeddings')
ft_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_ft.vec')
print('Accuracy for FastText:')
print_accuracy(ft_model, word_analogies_file)
print('Loading Gensim embeddings')
gs_model = Word2Vec.load_word2vec_format(MODELS_DIR + 'text8_gs.vec')
print('Accuracy for word2vec:')
print_accuracy(gs_model, word_analogies_file)
Loading FastText embeddings Accuracy for FastText: Evaluating... 298/506, 58.89%, Section: capital-common-countries 625/1452, 43.04%, Section: capital-world 37/268, 13.81%, Section: currency 291/1511, 19.26%, Section: city-in-state 151/306, 49.35%, Section: family 567/756, 75.00%, Section: gram1-adjective-to-adverb 188/306, 61.44%, Section: gram2-opposite 809/1260, 64.21%, Section: gram3-comparative 303/506, 59.88%, Section: gram4-superlative 528/992, 53.23%, Section: gram5-present-participle 1291/1371, 94.16%, Section: gram6-nationality-adjective 451/1332, 33.86%, Section: gram7-past-tense 853/992, 85.99%, Section: gram8-plural 360/650, 55.38%, Section: gram9-plural-verbs 6752/12208, 55.31%, Section: total Semantic: 1402/4043, Accuracy: 34.68% Syntactic: 5350/8165, Accuracy: 65.52% Loading Gensim embeddings Accuracy for word2vec: Evaluating... 138/506, 27.27%, Section: capital-common-countries 248/1452, 17.08%, Section: capital-world 28/268, 10.45%, Section: currency 158/1571, 10.06%, Section: city-in-state 227/306, 74.18%, Section: family 85/756, 11.24%, Section: gram1-adjective-to-adverb 54/306, 17.65%, Section: gram2-opposite 739/1260, 58.65%, Section: gram3-comparative 178/506, 35.18%, Section: gram4-superlative 297/992, 29.94%, Section: gram5-present-participle 718/1371, 52.37%, Section: gram6-nationality-adjective 325/1332, 24.40%, Section: gram7-past-tense 389/992, 39.21%, Section: gram8-plural 200/650, 30.77%, Section: gram9-plural-verbs 3784/12268, 30.84%, Section: total Semantic: 799/4103, Accuracy: 19.47% Syntactic: 2985/8165, Accuracy: 36.56%
With the text8 corpus, the semantic accuracy for the fastText model increases significantly, and it surpasses word2vec on accuracies for both semantic and syntactical analogies. However, the increase in syntactic accuracy from the increase in corpus size is much higher for word2vec
These preliminary results seem to indicate fastText embeddings might be better than word2vec at encoding semantic and especially syntactic information. It'd be interesting to see how transferable these embeddings are by comparing their performance in a downstream supervised task.