21 - Word2Vec

by Alejandro Correa Bahnsen

version 1.0, July 2018

Part of the class Applied Deep Learning

This notebook is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.


Word2vec is a two-layer neural net that processes text. Its input is a text corpus and its output is a set of vectors: feature vectors for words in that corpus. While Word2vec is not a deep neural network, it turns text into a numerical form that deep nets can understand.

Word2vec’s applications extend beyond parsing sentences in the wild. It can be applied just as well to genes, code, likes, playlists, social media graphs and other verbal or symbolic series in which patterns may be discerned.

Why? Because words are simply discrete states like the other data mentioned above, and we are simply looking for the transitional probabilities between those states: the likelihood that they will co-occur. So gene2vec, like2vec and follower2vec are all possible. With that in mind, the tutorial below will help you understand how to create neural embeddings for any group of discrete and co-occurring states.

The purpose and usefulness of Word2vec is to group the vectors of similar words together in vectorspace. That is, it detects similarities mathematically. Word2vec creates vectors that are distributed numerical representations of word features, features such as the context of individual words. It does so without human intervention.

Tensorflow Hub

TensorFlow Hub is a library for the publication, discovery, and consumption of reusable parts of machine learning models. A module is a self-contained piece of a TensorFlow graph, along with its weights and assets, that can be reused across different tasks in a process known as transfer learning. Transfer learning can:

  • Train a model with a smaller dataset,
  • Improve generalization, and
  • Speed up training.
In [1]:
!pip install tensorflow-hub
Collecting tensorflow-hub
  Downloading https://files.pythonhosted.org/packages/aa/8e/ab0d5cf0f2dc2ed1c645bf1cb4acc7beeb0be5a51ebb0c91cbb23eb1fe10/tensorflow_hub-0.1.0-py2.py3-none-any.whl (53kB)
    100% |████████████████████████████████| 61kB 416kB/s ta 0:00:011
Requirement already satisfied: protobuf>=3.4.0 in /home/al/anaconda3/lib/python3.6/site-packages (from tensorflow-hub)
Requirement already satisfied: six>=1.10.0 in /home/al/anaconda3/lib/python3.6/site-packages (from tensorflow-hub)
Requirement already satisfied: numpy>=1.12.0 in /home/al/anaconda3/lib/python3.6/site-packages (from tensorflow-hub)
Requirement already satisfied: setuptools in /home/al/anaconda3/lib/python3.6/site-packages (from protobuf>=3.4.0->tensorflow-hub)
Installing collected packages: tensorflow-hub
Successfully installed tensorflow-hub-0.1.0
You are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
In [10]:
import tensorflow_hub as hub
import tensorflow as tf
import numpy as np

See available modules on tensor_flow_hub


In [30]:
hub_model = "https://tfhub.dev/google/Wiki-words-250/1"
In [60]:
def create_text_embedding(hub_model, input_sentences):
    SEQ_LENGTH = max(map(len, input_sentences))

    with tf.Graph().as_default() as g:

        embed_layer = hub.Module(hub_model,
        trainable=False, name='text_embedding')

        sentences = tf.placeholder(dtype=tf.string, shape=(None, SEQ_LENGTH))
        batch_size = tf.shape(sentences)[0]

        flat_sentences = tf.reshape(sentences, [-1])

        embeddings = embed_layer(flat_sentences)

        sentence_embedding = tf.reshape(embeddings, 
                                      [batch_size, SEQ_LENGTH, EMBEDDING_DIM])

        with tf.Session(graph=g) as sess:

            output = sess.run(sentence_embedding, feed_dict={
                sentences: input_sentences
    return output
In [61]:
sentences = ["cat is on the mat", "dog is in the fog", 
             "padded sentence UNK UNK UNK"]
sentences = [s.split() for s in sentences]
In [62]:
sentences_em = create_text_embedding(hub_model, sentences)
INFO:tensorflow:Initialize variable text_embedding/embeddings/part_0:0,text_embedding/embeddings/part_1:0 from checkpoint b'/tmp/tfhub_modules/1e7e72950396d90315d3d9c57eddf5da44d4dca2/variables/variables' with embeddings
In [63]:
(3, 5, 250)
In [64]:
array([[-0.09022635, -0.05926817,  0.09362827, ..., -0.07516086,
         0.05431775, -0.07635543],
       [-0.02643044, -0.1691547 ,  0.04513412, ..., -0.02164096,
         0.00400387, -0.02354614],
       [-0.00483655, -0.06458678, -0.02172526, ...,  0.02054301,
         0.0117408 ,  0.01072438],
       [-0.03488984, -0.12522927,  0.01865043, ..., -0.01779286,
         0.03092172, -0.01291556],
       [ 0.04126915, -0.01120683, -0.03439177, ...,  0.00522323,
         0.00323421,  0.05423511]], dtype=float32)


In [118]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from keras.preprocessing.text import text_to_word_sequence
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.utils import np_utils
from keras.layers import Dense, Flatten, Dropout, Activation, BatchNormalization
from keras.optimizers import RMSprop, Adam
from keras.callbacks import History
from keras.layers import Conv1D, GlobalMaxPooling1D
from livelossplot import PlotLossesKeras

%matplotlib inline
import matplotlib.pyplot as plt
In [45]:
# read the data and set the datetime as the index
import zipfile
with zipfile.ZipFile('../datasets/Tweets.zip', 'r') as z:
    f = z.open('Tweets.csv')
    tweets = pd.read_csv(f, index_col=0)

airline_sentiment airline_sentiment_confidence negativereason negativereason_confidence airline airline_sentiment_gold name negativereason_gold retweet_count text tweet_coord tweet_created tweet_location user_timezone
570306133677760513 neutral 1.0000 NaN NaN Virgin America NaN cairdin NaN 0 @VirginAmerica What @dhepburn said. NaN 2015-02-24 11:35:52 -0800 NaN Eastern Time (US & Canada)
570301130888122368 positive 0.3486 NaN 0.0000 Virgin America NaN jnardino NaN 0 @VirginAmerica plus you've added commercials t... NaN 2015-02-24 11:15:59 -0800 NaN Pacific Time (US & Canada)
570301083672813571 neutral 0.6837 NaN NaN Virgin America NaN yvonnalynn NaN 0 @VirginAmerica I didn't today... Must mean I n... NaN 2015-02-24 11:15:48 -0800 Lets Play Central Time (US & Canada)
570301031407624196 negative 1.0000 Bad Flight 0.7033 Virgin America NaN jnardino NaN 0 @VirginAmerica it's really aggressive to blast... NaN 2015-02-24 11:15:36 -0800 NaN Pacific Time (US & Canada)
570300817074462722 negative 1.0000 Can't Tell 1.0000 Virgin America NaN jnardino NaN 0 @VirginAmerica and it's a really big bad thing... NaN 2015-02-24 11:14:45 -0800 NaN Pacific Time (US & Canada)
In [79]:
X = tweets['text']
y = tweets['airline_sentiment'].map({'negative':-1,'neutral':0,'positive':1})
Y = np_utils.to_categorical(y, 3)
In [66]:
# Get words
X_pad = [text_to_word_sequence(x) for x in X.tolist()]
In [72]:
X_pad = pad_sequences(X_pad, dtype='str', value=0)
In [74]:
(14640, 36)

Apply word2vec

In [75]:
X_em = create_text_embedding(hub_model, X_pad)
INFO:tensorflow:Initialize variable text_embedding/embeddings/part_0:0,text_embedding/embeddings/part_1:0 from checkpoint b'/tmp/tfhub_modules/1e7e72950396d90315d3d9c57eddf5da44d4dca2/variables/variables' with embeddings
In [76]:
(14640, 36, 250)

Train neural network

In [132]:
model = Sequential()
model.add(Conv1D(100, 2, padding='same', strides=1, input_shape=X_em.shape[1:]))


adam = Adam(lr=0.0002)
model.compile(loss='categorical_crossentropy', optimizer=adam, metrics=['accuracy'])
In [133]:
Layer (type)                 Output Shape              Param #   
conv1d_7 (Conv1D)            (None, 36, 100)           50100     
activation_33 (Activation)   (None, 36, 100)           0         
global_max_pooling1d_5 (Glob (None, 100)               0         
dense_27 (Dense)             (None, 256)               25856     
activation_34 (Activation)   (None, 256)               0         
dropout_15 (Dropout)         (None, 256)               0         
dense_28 (Dense)             (None, 3)                 771       
activation_35 (Activation)   (None, 3)                 0         
Total params: 76,727
Trainable params: 76,727
Non-trainable params: 0
In [134]:
h = History()
model.fit(X_em, Y,
          epochs=50, batch_size=500, verbose=0, callbacks=[h, PlotLossesKeras()],
<keras.callbacks.History at 0x7f7c4fe584e0>
In [135]: