In 2018, Facebook released a version of word embeddings that aligned all languages into a single embedding space, and in this notebook I will explore a bit about how language is represented in that space. If you aren't familiar with word embeddings, they are a way to represent words as numbers (or points in space), making them easier for computers to understand. For more about word embeddings, this blog post by Chris McCormick is a good summary.
Most initial work on word embeddings focused on single languages, but parallel work focused on aligning word embeddings from multiple languages into the same space. This would hopefully yield some benefits, like:
Most approaches to this problem work by starting with bilingual dictionaries of words between languages (e.g. "cerveza <-> beer"), and using them as anchors to rotate one word embedding to match another. The Facebook paper used a more advanced version of this that incorporates adversarial learning. More importantly, for me, they released pre-trained, aligned word embeddings for 30 languages!
In this notebook, I am going to explore how language is represented in the MUSE embeddings. They key takeaways are:
To see the results, skip down to "Exploration of MUSE embedding"
First, mount files from Google drive:
from google.colab import drive
drive.mount('/content/gdrive')
Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).
folder_loc = r'/content/gdrive/My Drive/Colab Notebooks/data/token_embeddings/'
!pip install fuzzywuzzy
from fuzzywuzzy import fuzz
import pandas as pd
import numpy as np
from gensim.models import KeyedVectors
import warnings
warnings.filterwarnings('ignore')
Requirement already satisfied: fuzzywuzzy in /usr/local/lib/python3.6/dist-packages (0.17.0)
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
fs = 22
font = {'family' : 'normal',
'weight' : 'normal',
'size' : fs}
matplotlib.rc('font', **font)
plt.rc('xtick', labelsize=fs-6)
plt.rc('ytick', labelsize=fs-6)
def load_bilingual_df(bilingual_dict_loc:str, lang1:str, lang2: str, muse_model, word_count_df: pd.DataFrame):
''' Function to load a bilingual dictionary, and calculate metrics between words
bilingual_dict_loc: location of bilingual dictionary
-should have two columns of words, one for each language
lang1/lang2: two-letter names of languages, e.g. "es"
muse_model: gensim KeyedVector model for MUSE embedding
word_count_df: pandas DF with two columns, "word" and "count"
'''
print('Loading dictionary')
bi_dict = (pd.read_csv(bilingual_dict_loc, delim_whitespace=True,
header = None, names = [lang1, lang2])
.query(f'{lang1} != {lang2}'))
# deduplicate words
count1 = bi_dict[lang1].value_counts().reset_index().rename(columns = {lang1:f'{lang1}_count', 'index':lang1})
count2 = bi_dict[lang2].value_counts().reset_index().rename(columns = {lang2:f'{lang2}_count', 'index':lang2})
bi_dict = bi_dict.merge(count1, on = lang1).merge(count2, on=lang2)
bi_dict[f'{lang1}_in_muse'] = bi_dict[lang1].apply(lambda row: row in muse_model)
bi_dict[f'{lang2}_in_muse'] = bi_dict[lang2].apply(lambda row: row in muse_model)
print('Calculating similarity for words in model')
# get only words in the MUSE model
simi_df = bi_dict.query(f'{lang1}_in_muse and {lang2}_in_muse')
# calculate cosine similarity and merge back on
simi_df['simi'] = simi_df.apply(lambda row: muse_model.similarity(row[lang1], row[lang2]), axis = 1 )
simi_df = simi_df.merge(word_count_df, left_on=lang1, right_on='word', how='left')
print('Calculating common language vector')
common_word_diffs = []
# get difference between words for pairs, and average that difference
for index, row in simi_df.iterrows():
common_word_diffs.append(muse_model[row[lang2]] - muse_model[row[lang1]] )
common_word_diffs = np.array(common_word_diffs)
common_direction = common_word_diffs.mean(axis = 0)
# calculate similarity
simi_df['common_direction_simi'] = simi_df[lang2].apply(lambda row:cos_sim(muse_model[row], common_direction))
# edit distance similarity
simi_df['edit_similarity'] = simi_df.apply(lambda row: fuzz.ratio(row[lang1], row[lang2]), axis=1)
return simi_df
%%time
muse_model = KeyedVectors.load_word2vec_format(r'/content/gdrive/My Drive/Colab Notebooks/data/token_embeddings/wiki.multi.en-es-fr-vi.vec')
CPU times: user 2min 25s, sys: 1.93 s, total: 2min 27s Wall time: 2min 27s
First thing we can do is calculate the similarity for a few word pairs across Spanish and English, ranging from common to rare:
for word1, word2 in (['the', 'el'],
['cerveza', 'beer'],
['enzyme', 'enzima']):
print(muse_model.similarity(word1, word2))
0.23508638 0.5158917 0.83669
For these three words, at least, the common words ("the", "beer") have lower similarity than the uncommon one.
To explore this more rigorously, we can look at word frequency across a larger set of words. The two most convenient sources of word frequency I found were from Peter Norvig, and from the Colombian newspaper counter Cavorite. (I think) word frequency generally follows a power law, which can be seen in the example word counts below.
norvig_df = pd.read_csv(folder_loc + 'norvig_word_counts.tsv', sep='\t',names = ['word', 'count'])
cavorite_df = pd.read_csv(folder_loc + 'cavorite_word_counts_es.tsv', sep='\t', names = ['word', 'count'])
norvig_df.iloc[0:9000:2000,:]
word | count | |
---|---|---|
0 | the | 23135851162 |
2000 | professor | 40572636 |
4000 | adoption | 17860735 |
6000 | pixels | 10381558 |
8000 | refurbished | 7018259 |
With the word count file, we can combine that with Facebook's bilingual dictionary, and calculate word similarity for pairs of words using a helper function.
en_es_df = load_bilingual_df(folder_loc + 'bilingual_dictionaries/en-es.txt', 'en', 'es',muse_model, norvig_df)
Loading dictionary Calculating similarity for words in model Calculating common language vector
Below is an example of what the data looks like. The columns are:
en/es
: word pair in English and Spanishen_count/es_count
: number of times the word appears in the bilingual dictionarysimi
: cosine similarity between this pair of words in the MUSE embedding spacecount
: word count for language in first column (in this case, English word count)common_direction_simi
: cosine similarity between word in second column, and the common difference of the wordsedit_similarity
: fuzzywuzzy ratio
edit similarity between the pair of wordsen_es_df.drop(columns = ['en_in_muse', 'es_in_muse', 'word']).sample(3)
en | es | en_count | es_count | simi | count | common_direction_simi | edit_similarity | |
---|---|---|---|---|---|---|---|---|
30818 | lefty | zurdo | 1 | 1 | 0.468547 | 978333.0 | 0.030474 | 0 |
24740 | ruthless | despiadados | 3 | 2 | 0.672333 | 1014804.0 | 0.018294 | 32 |
23082 | ignorant | ignorante | 2 | 1 | 0.747805 | 2724458.0 | 0.037379 | 94 |
Now that we have word frequency data, we can compare word frequency to the similarity between words. In this graph, the x-axis is the word frequency of the English word (log-scale); the y-axis is the cosine similarity between the English word and the Spanish word.
sns.scatterplot(x="count", y="simi", data=en_es_df.sample(5000), alpha = 0.2)
plt.xscale('log')
plt.xlabel('Word frequency (English word)')
plt.ylabel('English-Spanish\nword similarity');
Looking at the y-axis, you can see a majority of words have cosine similarity above 0.5. However, there is a large minority of words that have similarity < 0.3, and these words tend to be the higher frequency words. We can do a simple aggregation to show this:
print('Median word frequency for high similarity words: {}'.format(en_es_df.query('simi > 0.5')['count'].median()))
print('Median word frequency for low similarity words: {}'.format(en_es_df.query('simi < 0.3')['count'].median()))
Median word frequency for high similarity words: 1457470.5 Median word frequency for low similarity words: 8191578.5
I have only read popular linguistic books, but it is my understanding that common words' meaning are more malleable, and change over time. For example, the word "grow" can refer to plants, animals, cities, or careers. In contrast, "enzyme" refers specifically to a protein with certain properties. Given this, perhaps the lower similarity of common words is due to their poly-semous nature.
Looking at the above graph, one idea I had is that the common words might all be pointing in the same direction. To test this idea, I did the following:
In the plot below, each point is a word pair; the x-axis is again English word frequency; and the y-axis is the similarity between the Spanish word embedding and the common difference vector.
sns.scatterplot(x="count", y="common_direction_simi", data=en_es_df.sample(5000), alpha = 0.2)
plt.xscale('log')
plt.xlabel('Word frequency (English)')
plt.ylabel('Common vector\nsimilarity');
From this graph, we can see there are two types of words in the MUSE embedding: words that are somewhat similar to the common difference vector (0.4-0.6); and words that generally are not (< 0.3). High frequency words also had higher common vector similarity. My hypothesis that common words are all aligned in the same direction has at least some support.
MUSE embeddings are based on Facebook's FastText character-level embeddings. To see if character-level similarity was influencing the word embeddings after alignment, I plotted the same scatterplot as above, except this time I color-coded each word-pair with it's character similarity (where 100=identical). While there might be some trend for high character-similarity words to have less in common with the language direction, it does not seem like a strong bias.
cmap = sns.cubehelix_palette(dark=.3, light=0.9, as_cmap=True)
sns.scatterplot(x="simi", y="common_direction_simi", hue="edit_similarity",
data = en_es_df.sample(2000), alpha = 0.8, palette = 'coolwarm')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0., labels = np.arange(0, 101, 25));
plt.xlabel('Word frequency (English)')
plt.ylabel('Common vector\nsimilarity');
All the analysis above was for Spanish and English. If my hypotheses are correct, we should see similar behaviour for other languages. So I downloaded MUSE embeddings for French and Vietnamese to compare with English; and I also looked at the Spanish-French pair.
en_fr_df = load_bilingual_df(folder_loc + 'bilingual_dictionaries/en-fr.txt', 'en', 'fr', muse_model, norvig_df)
en_vi_df = load_bilingual_df(folder_loc + 'bilingual_dictionaries/en-vi.txt', 'en', 'vi', muse_model, norvig_df)
es_fr_df = load_bilingual_df(folder_loc + 'bilingual_dictionaries/es-fr.txt', 'es', 'fr', muse_model, cavorite_df)
Loading dictionary Calculating similarity for words in model Calculating common language vector Loading dictionary Calculating similarity for words in model Calculating common language vector Loading dictionary Calculating similarity for words in model Calculating common language vector
fig, axes = plt.subplots(1,3, figsize= [12, 4])
sns.scatterplot(x="count", y="simi", data=en_fr_df.sample(5000), alpha = 0.2, ax=axes[0])
axes[0].set_xscale('log')
axes[0].set_ylabel('Bilingual word\npair similarity')
axes[0].set_xlabel('Word frequency\n(English)')
axes[0].set_title('En-Fr')
sns.scatterplot(x="count", y="simi", data=en_vi_df.sample(5000), alpha = 0.2, ax=axes[1])
axes[1].set_xscale('log')
axes[1].set_xlabel('Word frequency\n(English)')
axes[1].set_title('En-Vi')
axes[1].set_ylabel('')
sns.scatterplot(x="count", y="simi", data=es_fr_df.sample(5000), alpha = 0.2, ax=axes[2])
axes[2].set_xscale('log')
axes[2].set_title('Es-Fr')
axes[2].set_ylabel('')
axes[2].set_xlabel('Word frequency\n(Spanish)');
For each language pair, we see similar results to English-Spanish: more common words tend to have lower similarity across languages. For Spanish-French, there does seem to be higher cross-language similarity, perhaps due to them both being Romance languages.
We can also look at the common language direction plots across languages:
fig, axes = plt.subplots(1,3, figsize= [12, 4])
sns.scatterplot(x="count", y="common_direction_simi", data=en_fr_df.sample(5000), alpha = 0.2, ax=axes[0])
axes[0].set_xscale('log')
axes[0].set_ylabel('Common language\nsimilarity')
axes[0].set_xlabel('Word frequency\n(English)')
axes[0].set_title('En-Fr')
sns.scatterplot(x="count", y="common_direction_simi", data=en_vi_df.sample(5000), alpha = 0.2, ax=axes[1])
axes[1].set_xscale('log')
axes[1].set_xlabel('Word frequency\n(English)')
axes[1].set_title('En-Vi')
axes[1].set_ylabel('')
sns.scatterplot(x="count", y="common_direction_simi", data=es_fr_df.sample(5000), alpha = 0.2, ax=axes[2])
axes[2].set_xscale('log')
axes[2].set_title('Es-Fr')
axes[2].set_ylabel('')
axes[2].set_xlabel('Word frequency\n(Spanish)');
For these languages, we see the same pattern as before: there are a set of words which are similar to the common language vector; and these words tend to be more common that the other words. For Vietnamese, there also might be a third cluster at low English word frequency.
In this notebook I looked at how words are represented across language in Facebook's MUSE embedding. I found that more common words have lower similarity across languages than less common words. This could cause problems in usage if you make the assumption words across languages will be similar. For example, antonyms for common words will be more similar than cross-language synonyms.
The second thing I found was that, even after alignment (or perhaps due to alignment), language is present in embeddings as a direction. This is particularly true for more common words. This might be a useful feature in multilingual environments, since it would allow models that consume these embeddings to implicitly know which language a word or sentence is coming from.