# ![word2vec quiz](https://s3.amazonaws.com/skipgram-images/word2vec-1.png) #

# You just demonstrated the core machine learning concept behind word vector embedding models! #

# ![word2vec quiz 2](https://s3.amazonaws.com/skipgram-images/word2vec-2.png) # The goal of *word vector embedding models*, or *word vector models* for short, is to learn dense, numerical vector representations for each term in a corpus vocabulary. If the model is successful, the vectors it learns about each term should encode some information about the *meaning* or *concept* the term represents, and the relationship between it and other terms in the vocabulary. Word vector models are also fully unsupervised — they learn all of these meanings and relationships solely by analyzing the text of the corpus, without any advance knowledge provided. # # Perhaps the best-known word vector model is [word2vec](https://arxiv.org/pdf/1301.3781v3.pdf), originally proposed in 2013. The general idea of word2vec is, for a given *focus word*, to use the *context* of the word — i.e., the other words immediately before and after it — to provide hints about what the focus word might mean. To do this, word2vec uses a *sliding window* technique, where it considers snippets of text only a few tokens long at a time. # # At the start of the learning process, the model initializes random vectors for all terms in the corpus vocabulary. The model then slides the window across every snippet of text in the corpus, with each word taking turns as the focus word. Each time the model considers a new snippet, it tries to learn some information about the focus word based on the surrouding context, and it "nudges" the words' vector representations accordingly. One complete pass sliding the window across all of the corpus text is known as a training *epoch*. It's common to train a word2vec model for multiple passes/epochs over the corpus. Over time, the model rearranges the terms' vector representations such that terms that frequently appear in similar contexts have vector representations that are *close* to each other in vector space. # # For a deeper dive into word2vec's machine learning process, see [here](https://arxiv.org/pdf/1411.2738v4.pdf). # # Word2vec has a number of user-defined hyperparameters, including: # - The dimensionality of the vectors. Typical choices include a few dozen to several hundred. # - The width of the sliding window, in tokens. Five is a common default choice, but narrower and wider windows are possible. # - The number of training epochs. # # For using word2vec in Python, [gensim](https://rare-technologies.com/deep-learning-with-word2vec-and-gensim/) comes to the rescue again! It offers a [highly-optimized](https://rare-technologies.com/word2vec-in-python-part-two-optimizing/), [parallelized](https://rare-technologies.com/parallelizing-word2vec-in-python/) implementation of the word2vec algorithm with its [Word2Vec](https://radimrehurek.com/gensim/models/word2vec.html) class. # In[58]: from gensim.models import Word2Vec trigram_sentences = LineSentence(trigram_sentences_filepath) word2vec_filepath = os.path.join(intermediate_directory, 'word2vec_model_all') # We'll train our word2vec model using the normalized sentences with our phrase models applied. We'll use 100-dimensional vectors, and set up our training process to run for twelve epochs. # In[59]: get_ipython().run_cell_magic('time', '', "\n# this is a bit time consuming - make the if statement True\n# if you want to train the word2vec model yourself.\nif 0 == 1:\n\n # initiate the model and perform the first epoch of training\n food2vec = Word2Vec(trigram_sentences, size=100, window=5,\n min_count=20, sg=1, workers=4)\n \n food2vec.save(word2vec_filepath)\n\n # perform another 11 epochs of training\n for i in range(1,12):\n\n food2vec.train(trigram_sentences)\n food2vec.save(word2vec_filepath)\n \n# load the finished model from disk\nfood2vec = Word2Vec.load(word2vec_filepath)\nfood2vec.init_sims()\n\nprint u'{} training epochs so far.'.format(food2vec.train_count)") # On my four-core machine, each epoch over all the text in the ~1 million Yelp reviews takes about 5-10 minutes. # In[60]: print u'{:,} terms in the food2vec vocabulary.'.format(len(food2vec.vocab)) # Let's take a peek at the word vectors our model has learned. We'll create a pandas DataFrame with the terms as the row labels, and the 100 dimensions of the word vector model as the columns. # In[90]: # build a list of the terms, integer indices, # and term counts from the food2vec model vocabulary ordered_vocab = [(term, voc.index, voc.count) for term, voc in food2vec.vocab.iteritems()] # sort by the term counts, so the most common terms appear first ordered_vocab = sorted(ordered_vocab, key=lambda (term, index, count): -count) # unzip the terms, integer indices, and counts into separate lists ordered_terms, term_indices, term_counts = zip(*ordered_vocab) # create a DataFrame with the food2vec vectors as data, # and the terms as row labels word_vectors = pd.DataFrame(food2vec.syn0norm[term_indices, :], index=ordered_terms) word_vectors # Holy wall of numbers! This DataFrame has 50,835 rows — one for each term in the vocabulary — and 100 colums. Our model has learned a quantitative vector representation for each term, as expected. # # Put another way, our model has "embedded" the terms into a 100-dimensional vector space. # ### So... what can we do with all these numbers? # The first thing we can use them for is to simply look up related words and phrases for a given term of interest. # In[63]: def get_related_terms(token, topn=10): """ look up the topn most similar terms to token and print them as a formatted list """ for word, similarity in food2vec.most_similar(positive=[token], topn=topn): print u'{:20} {}'.format(word, round(similarity, 3)) # ### What things are like Burger King? # In[64]: get_related_terms(u'burger_king') # The model has learned that fast food restaurants are similar to each other! In particular, *mcdonalds* and *wendy's* are the most similar to Burger King, according to this dataset. In addition, the model has found that alternate spellings for the same entities are probably related, such as *mcdonalds*, *mcdonald's* and *mcd's*. # ### When is happy hour? # In[65]: get_related_terms(u'happy_hour') # The model has noticed several alternate spellings for happy hour, such as *hh* and *happy hr*, and assesses them as highly related. If you were looking for reviews about happy hour, such alternate spellings would be very helpful to know. # # Taking a deeper look — the model has turned up phrases like *3-6pm*, *4-7pm*, and *mon-fri*, too. This is especially interesting, because the model has no advance knowledge at all about what happy hour is, and what time of day it should be. But simply by scanning through restaurant reviews, the model has discovered that the concept of happy hour has something very important to do with that block of time around 3-7pm on weekdays. # ### Let's make pasta tonight. Which style do you want? # In[66]: get_related_terms(u'pasta', topn=20) # ## Word algebra! # No self-respecting word2vec demo would be complete without a healthy dose of *word algebra*, also known as *analogy completion*. # # The core idea is that once words are represented as numerical vectors, you can do math with them. The mathematical procedure goes like this: # 1. Provide a set of words or phrases that you'd like to add or subtract. # 1. Look up the vectors that represent those terms in the word vector model. # 1. Add and subtract those vectors to produce a new, combined vector. # 1. Look up the most similar vector(s) to this new, combined vector via cosine similarity. # 1. Return the word(s) associated with the similar vector(s). # # But more generally, you can think of the vectors that represent each word as encoding some information about the *meaning* or *concepts* of the word. What happens when you ask the model to combine the meaning and concepts of words in new ways? Let's see. # In[67]: def word_algebra(add=[], subtract=[], topn=1): """ combine the vectors associated with the words provided in add= and subtract=, look up the topn most similar terms to the combined vector, and print the result(s) """ answers = food2vec.most_similar(positive=add, negative=subtract, topn=topn) for term, similarity in answers: print term # ### breakfast + lunch = ? # Let's start with a softball. # In[68]: word_algebra(add=[u'breakfast', u'lunch']) # OK, so the model knows that *brunch* is a combination of *breakfast* and *lunch*. What else? # ### lunch - day + night = ? # In[69]: word_algebra(add=[u'lunch', u'night'], subtract=[u'day']) # Now we're getting a bit more nuanced. The model has discovered that: # - Both *lunch* and *dinner* are meals # - The main difference between them is time of day # - Day and night are times of day # - Lunch is associated with day, and dinner is associated with night # # What else? # ### taco - mexican + chinese = ? # In[70]: word_algebra(add=[u'taco', u'chinese'], subtract=[u'mexican']) # Here's an entirely new and different type of relationship that the model has learned. # - It knows that tacos are a characteristic example of Mexican food # - It knows that Mexican and Chinese are both styles of food # - If you subtract *Mexican* from *taco*, you're left with something like the concept of a *"characteristic type of food"*, which is represented as a new vector # - If you add that new *"characteristic type of food"* vector to Chinese, you get *dumpling*. # # What else? # ### bun - american + mexican = ? # In[71]: word_algebra(add=[u'bun', u'mexican'], subtract=[u'american']) # The model knows that both *buns* and *tortillas* are the doughy thing that goes on the outside of your real food, and that the primary difference between them is the style of food they're associated with. # # What else? # ### filet mignon - beef + seafood = ? # In[72]: word_algebra(add=[u'filet_mignon', u'seafood'], subtract=[u'beef']) # The model has learned a concept of *delicacy*. If you take filet mignon and subtract beef from it, you're left with a vector that roughly corresponds to delicacy. If you add the delicacy vector to *seafood*, you get *raw oyster*. # # What else? # ### coffee - drink + snack = ? # In[73]: word_algebra(add=[u'coffee', u'snack'], subtract=[u'drink']) # The model knows that if you're on your coffee break, but instead of drinking something, you're eating something... that thing is most likely a pastry. # # What else? # ### Burger King + fine dining = ? # In[74]: word_algebra(add=[u'burger_king', u'fine_dining']) # TouchÃ©. It makes sense, though. The model has learned that both Burger King and Denny's are large chains, and that both serve fast, casual, American-style food. But Denny's has some elements that are slightly more upscale, such as printed menus and table service. Fine dining, indeed. # # *What if we keep going?* # ### Denny's + fine dining = ? # In[75]: word_algebra(add=[u"denny_'s", u'fine_dining']) # This seems like a good place to land... what if we explore the vector space around *Applebee's* a bit, in a few different directions? Let's see what we find. # # #### Applebee's + italian = ? # In[76]: word_algebra(add=[u"applebee_'s", u'italian']) # #### Applebee's + pancakes = ? # In[77]: word_algebra(add=[u"applebee_'s", u'pancakes']) # #### Applebee's + pizza = ? # In[78]: word_algebra(add=[u"applebee_'s", u'pizza']) # You could do this all day. One last analogy before we move on... # ### wine - grapes + barley = ? # In[79]: word_algebra(add=[u'wine', u'barley'], subtract=[u'grapes']) # ## Word Vector Visualization with t-SNE # [t-Distributed Stochastic Neighbor Embedding](https://lvdmaaten.github.io/publications/papers/JMLR_2008.pdf), or *t-SNE* for short, is a dimensionality reduction technique to assist with visualizing high-dimensional datasets. It attempts to map high-dimensional data onto a low two- or three-dimensional representation such that the relative distances between points are preserved as closely as possible in both high-dimensional and low-dimensional space. # # scikit-learn provides a convenient implementation of the t-SNE algorithm with its [TSNE](http://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html) class. # In[80]: from sklearn.manifold import TSNE # Our input for t-SNE will be the DataFrame of word vectors we created before. Let's first: # 1. Drop stopwords — it's probably not too interesting to visualize *the*, *of*, *or*, and so on # 1. Take only the 5,000 most frequent terms in the vocabulary — no need to visualize all ~50,000 terms right now. # In[81]: tsne_input = word_vectors.drop(spacy.en.STOPWORDS, errors=u'ignore') tsne_input = tsne_input.head(5000) # In[82]: tsne_input.head() # In[83]: tsne_filepath = os.path.join(intermediate_directory, u'tsne_model') tsne_vectors_filepath = os.path.join(intermediate_directory, u'tsne_vectors.npy') # In[93]: get_ipython().run_cell_magic('time', '', "\nif 0 == 1:\n \n tsne = TSNE()\n tsne_vectors = tsne.fit_transform(tsne_input.values)\n \n with open(tsne_filepath, 'w') as f:\n pickle.dump(tsne, f)\n\n pd.np.save(tsne_vectors_filepath, tsne_vectors)\n \nwith open(tsne_filepath) as f:\n tsne = pickle.load(f)\n \ntsne_vectors = pd.np.load(tsne_vectors_filepath)\n\ntsne_vectors = pd.DataFrame(tsne_vectors,\n index=pd.Index(tsne_input.index),\n columns=[u'x_coord', u'y_coord'])") # Now we have a two-dimensional representation of our data! Let's take a look. # In[94]: tsne_vectors.head() # In[95]: tsne_vectors[u'word'] = tsne_vectors.index # ### Plotting with Bokeh # In[88]: from bokeh.plotting import figure, show, output_notebook from bokeh.models import HoverTool, ColumnDataSource, value output_notebook() # In[89]: # add our DataFrame as a ColumnDataSource for Bokeh plot_data = ColumnDataSource(tsne_vectors) # create the plot and configure the # title, dimensions, and tools tsne_plot = figure(title=u't-SNE Word Embeddings', plot_width = 800, plot_height = 800, tools= (u'pan, wheel_zoom, box_zoom,' u'box_select, resize, reset'), active_scroll=u'wheel_zoom') # add a hover tool to display words on roll-over tsne_plot.add_tools( HoverTool(tooltips = u'@word') ) # draw the words as circles on the plot tsne_plot.circle(u'x_coord', u'y_coord', source=plot_data, color=u'blue', line_alpha=0.2, fill_alpha=0.1, size=10, hover_line_color=u'black') # configure visual elements of the plot tsne_plot.title.text_font_size = value(u'16pt') tsne_plot.xaxis.visible = False tsne_plot.yaxis.visible = False tsne_plot.grid.grid_line_color = None tsne_plot.outline_line_color = None # engage! show(tsne_plot); # ## Conclusion # Whew! Let's round up the major components that we've seen: # 1. Text processing with **spaCy** # 1. Automated **phrase modeling** # 1. Topic modeling with **LDA** $\ \longrightarrow\ $ visualization with **pyLDAvis** # 1. Word vector modeling with **word2vec** $\ \longrightarrow\ $ visualization with **t-SNE** # # #### Why use these models? # Dense vector representations for text like LDA and word2vec can greatly improve performance for a number of common, text-heavy problems like: # - Text classification # - Search # - Recommendations # - Question answering # # ...and more generally are a powerful way machines can help humans make sense of what's in a giant pile of text. They're also often useful as a pre-processing step for many other downstream machine learning applications. # ## Data Science @ S&P Global — *we are hiring!*