Bringing contextual word representations into your models

In [1]:
__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2019"


This notebook provides a basic introduction to using pre-trained BERT and ELMo representations. It is meant as a practical companion to our lecture on contextual word representations. The goal of this notebook is just to help you use these representations in your own work. The BERT and ELMo teams have done amazing work to make these resources available to the community. Many projects can benefit from them, so it is probably worth your time to experiment.

This notebook should be considered an experimental extension to the regular course materials. It has some special requirements – libraries and data files – that are not part of the core requirements for this repository. All these tools are very new and being updated frequently, so you might need to do some fiddling to get all of this to work. As I said, though, it's probably worth the effort!



Han Xiao's "BERT as a Service" is pretty incredible:

To make use of it, run these two pip installs in your usual course virtual environment:

pip install bert-serving-server
pip install bert-serving-client

After that, you just need to download a BERT model:

In [2]:
from bert_serving.client import BertClient 

Edit the following command by replacing


with the path to your downloaded BERT model directory, and then run the command in a Terminal window:

bert-serving-start -model_dir data/bert/uncased_L-12_H-768_A-12/ -pooling_strategy NONE -max_seq_len NONE -show_tokens_to_client


There are a number of ways to use pre-trained ELMo models. We'll use the simplest of the AllenNLP interfaces. Run the following to install AllenNLP:

pip install allennlp

Mac users: If your installantion fails, make sure your Xcode tools are up to date by running xcode-select --install. This is a common source of problems installing AllenNLP at present.

We'll use the ElmoEmbedder interface, which downloads a default model. See below for instructions on how to use a different model.

In [3]:
from allennlp.commands.elmo import ElmoEmbedder
from nltk.tokenize.treebank import TreebankWordTokenizer
Better speed can be achieved with apex installed from


The following are requirements that you'll already have met if you've been working in this repository. As you can see, we'll use the Stanford Sentiment Treebank for illustrations, and we'll try out a few different deep learning models.

In [4]:
import os
import sst
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier
from torch_rnn_classifier import TorchRNNClassifier
from sklearn.metrics import classification_report
In [5]:
SST_HOME = os.path.join("data", "trees")

Using BERT

BERT representations for the SST

With the BERT server running in the background, the following will allow you to process new examples and obtain their BERT representations:

In [6]:
bc = BertClient(check_length=False)

Here we load in the SST train and dev sets, and we flatten the trees into strings of just their leaf nodes. We'll allow BERT to tokenize for us; an alternative is to use is_tokenized=True in the call to bc.encode, but this requires a bit more fussing with the representations and might be suboptimal.

In [7]:
sst_train_reader = sst.train_reader(
    SST_HOME, class_func=sst.ternary_class_func)

sst_train = [(" ".join(t.leaves()), label) for t, label in sst_train_reader]
In [8]:
sst_dev_reader = sst.dev_reader(
    SST_HOME, class_func=sst.ternary_class_func)

sst_dev = [(" ".join(t.leaves()), label) for t, label in sst_dev_reader]
In [9]:
X_str_train, y_train = zip(*sst_train)
In [10]:
X_str_dev, y_dev = zip(*sst_dev)

Now we process the examples into BERT representations. I've set show_tokens=True to help us keep track of what BERT is doing to our texts:

In [11]:
X_bert_train, bert_train_toks = bc.encode(
    list(X_str_train), show_tokens=True)
In [12]:
X_bert_dev, bert_dev_toks = bc.encode(
    list(X_str_dev), show_tokens=True)

BERT sentence-level classifier

As first illustration, we'll use BERT representations as the input to a classifier model. The first step is to combine the individual word representations into fixed dimensional vectors, so that we can use them as inputs to a classifier. For this, I'll just average the individual vectors:

In [13]:
def bert_reduce_mean(X):
    return X.mean(axis=1)  

This is very much like what we summed the GloVe representations of these examples, but now the individual word representations are different depending on the context in which they appear.

Note: If you start the BERT server with -pooling_strategy REDUCE_MEAN, then this step is done for you. And see here for discussion of other pooling strategies.

In [14]:
X_bert_train_mean = bert_reduce_mean(X_bert_train)

BERT representations are pretty large:

In [15]:

Now we instantiate and fit a classifier. I picked a TorchShallowNeuralClassifier. Since the input representations are large, I chose a pretty large hidden_dim:

In [16]:
mod = TorchShallowNeuralClassifier(
    max_iter=100, hidden_dim=300)
In [17]:
%time _ =, y_train)
Finished epoch 100 of 100; error is 0.17165078409016132
CPU times: user 2min 18s, sys: 595 ms, total: 2min 18s
Wall time: 20.3 s

Evaluation proceeds as you would expect:

In [18]:
X_bert_dev_mean = bert_reduce_mean(X_bert_dev)
In [19]:
bert_sent_preds = mod.predict(X_bert_dev_mean)
In [20]:
print(classification_report(y_dev, bert_sent_preds, digits=3))
              precision    recall  f1-score   support

    negative      0.713     0.673     0.692       428
     neutral      0.348     0.314     0.330       229
    positive      0.714     0.788     0.749       444

   micro avg      0.645     0.645     0.645      1101
   macro avg      0.592     0.592     0.591      1101
weighted avg      0.638     0.645     0.640      1101

Using the SST experimental framework with BERT

It is straightforward to conduct experiments like the above using sst.experiment, which will enable you to do a wider range of experiments without writing or copy-pasting a lot of code.

Per the guidelines at Han Xiao's "BERT as a service", it would be prohibitively slow to call bc.encode on all our sentences individually. To address this, I suggest first creating a look-up for the precomputed BERT representations and then having your feature function simply use this look-up:

In [21]:
bert_lookup = {}

for (sents, reps) in ((X_str_train, X_bert_train_mean), 
                      (X_str_dev, X_bert_dev_mean)):
    assert len(sents) == len(reps)
    for s, rep in zip(sents, reps):
        bert_lookup[s] = rep
In [22]:
def bert_sentence_phi(tree):
    s = " ".join(tree.leaves())
    return bert_lookup[s]
In [23]:
def fit_wide_shallow_network(X, y):
    mod = TorchShallowNeuralClassifier(
        max_iter=100, hidden_dim=300), y)
    return mod
In [24]:
_ = sst.experiment(
Finished epoch 100 of 100; error is 0.16109364107251167
              precision    recall  f1-score   support

    negative      0.680     0.736     0.707       428
     neutral      0.299     0.192     0.234       229
    positive      0.703     0.777     0.738       444

   micro avg      0.639     0.639     0.639      1101
   macro avg      0.561     0.568     0.560      1101
weighted avg      0.610     0.639     0.621      1101

CPU times: user 2min 18s, sys: 481 ms, total: 2min 18s
Wall time: 21.5 s

BERT word-level representations as RNN features

We can also use BERT representations as the input to an RNN. There is just one key change from how we used these models before:

  • Previously, we would feed in lists of tokens, and they would be converted to indices into a fixed embedding space. This presumes that all words have the same representation no matter what their context is.

  • With BERT, we skip the embedding entirely and just feed in lists of BERT vectors, which means that the same word can be represented in different ways.

TorchRNNClassifier supports this via use_embedding=False. In turn, you needn't supply a vocabulary:

In [25]:
bert_rnn = TorchRNNClassifier(
In [26]:
%time _ =, y_train)
Finished epoch 50 of 50; error is 3.3966610431671143
CPU times: user 31min 6s, sys: 2min 27s, total: 33min 34s
Wall time: 10min 11s
In [27]:
bert_rnn_preds = bert_rnn.predict(X_bert_dev)
In [28]:
print(classification_report(y_dev, bert_rnn_preds, digits=3))
              precision    recall  f1-score   support

    negative      0.778     0.614     0.687       428
     neutral      0.308     0.341     0.324       229
    positive      0.727     0.836     0.778       444

   micro avg      0.647     0.647     0.647      1101
   macro avg      0.605     0.597     0.596      1101
weighted avg      0.660     0.647     0.648      1101

Using ELMo

Using ELMo is very similar to using BERT. I'll illustrate just with an RNN.

ELMo representations for the SST

When first run, the following command downloads


directly from S3 to a local temp directory. Use options_file and weight_file to ask ElmoEmbedder to use a specified pair of model files. For additional details:

In [29]:
elmo = ElmoEmbedder()

The ELMo interface requires tokenized input. I believe the following tokenizer matches the behavior of the one used by the team to create the representations:

In [30]:
tokenizer = TreebankWordTokenizer()
In [31]:
elmo_train_toks = [tokenizer.tokenize(ex) for ex in X_str_train]
In [32]:
elmo_dev_toks = [tokenizer.tokenize(ex) for ex in X_str_dev]

Here we create the representations for the train and dev sets:

In [33]:
X_elmo_train_layers = list(elmo.embed_sentences(elmo_train_toks))
In [34]:
X_elmo_dev_layers = list(elmo.embed_sentences(elmo_dev_toks))

X_elmo_train_layers has three dimensions:

In [35]:
(3, 13, 1024)

For each word (second dimension), there are three layers of length 1024. So ELMo representations are even larger than BERT ones!

ELMo representations as RNN features

There are many ways we could combine the layers available for each word. Here, I'll use the mean:

In [36]:
def elmo_layer_reduce_mean(elmo_vecs):
    return [ex.mean(axis=0) for ex in elmo_vecs]
In [37]:
X_elmo_train = elmo_layer_reduce_mean(X_elmo_train_layers)

Now we can fit an RNN as usual:

In [38]:
elmo_rnn = TorchRNNClassifier(
In [39]:
%time _ =, y_train)
Finished epoch 50 of 50; error is 0.09299760637804866
CPU times: user 16min 2s, sys: 1min 32s, total: 17min 34s
Wall time: 4min 42s

Evaluation proceeds in the usual way:

In [40]:
X_elmo_dev = elmo_layer_reduce_mean(X_elmo_dev_layers)
In [41]:
elmo_rnn_preds = elmo_rnn.predict(X_elmo_dev)
In [42]:
print(classification_report(y_dev, elmo_rnn_preds, digits=3))
              precision    recall  f1-score   support

    negative      0.706     0.678     0.691       428
     neutral      0.292     0.258     0.274       229
    positive      0.734     0.806     0.768       444

   micro avg      0.642     0.642     0.642      1101
   macro avg      0.577     0.581     0.578      1101
weighted avg      0.631     0.642     0.635      1101

Using the SST experiment framework with ELMo

To round things out, here's an example of how to use sst.experiment with ELMo, for more compact and maintainable experimental code:

In [43]:
def elmo_sentence_phi(tree):
    vecs = elmo.embed_sentence(tree.leaves())
    return vecs.mean(axis=0) 
In [44]:
def fit_elmo_rnn(X, y):
    mod = TorchRNNClassifier(
        use_embedding=False), y)
    return mod
In [45]:
_ = sst.experiment(
Finished epoch 50 of 50; error is 0.021800976479426026
              precision    recall  f1-score   support

    negative      0.707     0.715     0.711       428
     neutral      0.344     0.245     0.286       229
    positive      0.733     0.833     0.780       444

   micro avg      0.665     0.665     0.665      1101
   macro avg      0.594     0.598     0.592      1101
weighted avg      0.642     0.665     0.650      1101

CPU times: user 3h 22min 56s, sys: 2min 25s, total: 3h 25min 22s
Wall time: 51min 40s