__author__ = "Christopher Potts"
__version__ = "CS224u, Stanford, Spring 2019"
This notebook provides a basic introduction to using pre-trained BERT and ELMo representations. It is meant as a practical companion to our lecture on contextual word representations. The goal of this notebook is just to help you use these representations in your own work. The BERT and ELMo teams have done amazing work to make these resources available to the community. Many projects can benefit from them, so it is probably worth your time to experiment.
This notebook should be considered an experimental extension to the regular course materials. It has some special requirements – libraries and data files – that are not part of the core requirements for this repository. All these tools are very new and being updated frequently, so you might need to do some fiddling to get all of this to work. As I said, though, it's probably worth the effort!
Han Xiao's "BERT as a Service" is pretty incredible:
https://github.com/hanxiao/bert-as-service
To make use of it, run these two pip installs in your usual course virtual environment:
pip install bert-serving-server
pip install bert-serving-client
After that, you just need to download a BERT model:
from bert_serving.client import BertClient
Edit the following command by replacing
data/bert/uncased_L-12_H-768_A-12/
with the path to your downloaded BERT model directory, and then run the command in a Terminal window:
bert-serving-start -model_dir data/bert/uncased_L-12_H-768_A-12/ -pooling_strategy NONE -max_seq_len NONE -show_tokens_to_client
There are a number of ways to use pre-trained ELMo models. We'll use the simplest of the AllenNLP interfaces. Run the following to install AllenNLP:
pip install allennlp
Mac users: If your installantion fails, make sure your Xcode tools are up to date by running xcode-select --install
. This is a common source of problems installing AllenNLP at present.
We'll use the ElmoEmbedder
interface, which downloads a default model. See below for instructions on how to use a different model.
from allennlp.commands.elmo import ElmoEmbedder
from nltk.tokenize.treebank import TreebankWordTokenizer
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
The following are requirements that you'll already have met if you've been working in this repository. As you can see, we'll use the Stanford Sentiment Treebank for illustrations, and we'll try out a few different deep learning models.
import os
import sst
from torch_shallow_neural_classifier import TorchShallowNeuralClassifier
from torch_rnn_classifier import TorchRNNClassifier
from sklearn.metrics import classification_report
SST_HOME = os.path.join("data", "trees")
With the BERT server running in the background, the following will allow you to process new examples and obtain their BERT representations:
bc = BertClient(check_length=False)
Here we load in the SST train and dev sets, and we flatten the trees into strings of just their leaf nodes. We'll allow BERT to tokenize for us; an alternative is to use is_tokenized=True
in the call to bc.encode
, but this requires a bit more fussing with the representations and might be suboptimal.
sst_train_reader = sst.train_reader(
SST_HOME, class_func=sst.ternary_class_func)
sst_train = [(" ".join(t.leaves()), label) for t, label in sst_train_reader]
sst_dev_reader = sst.dev_reader(
SST_HOME, class_func=sst.ternary_class_func)
sst_dev = [(" ".join(t.leaves()), label) for t, label in sst_dev_reader]
X_str_train, y_train = zip(*sst_train)
X_str_dev, y_dev = zip(*sst_dev)
Now we process the examples into BERT representations. I've set show_tokens=True
to help us keep track of what BERT is doing to our texts:
X_bert_train, bert_train_toks = bc.encode(
list(X_str_train), show_tokens=True)
X_bert_dev, bert_dev_toks = bc.encode(
list(X_str_dev), show_tokens=True)
As first illustration, we'll use BERT representations as the input to a classifier model. The first step is to combine the individual word representations into fixed dimensional vectors, so that we can use them as inputs to a classifier. For this, I'll just average the individual vectors:
def bert_reduce_mean(X):
return X.mean(axis=1)
This is very much like what we summed the GloVe representations of these examples, but now the individual word representations are different depending on the context in which they appear.
Note: If you start the BERT server with -pooling_strategy REDUCE_MEAN
, then this step is done for you. And see here for discussion of other pooling strategies.
X_bert_train_mean = bert_reduce_mean(X_bert_train)
BERT representations are pretty large:
X_bert_train_mean.shape[1]
768
Now we instantiate and fit a classifier. I picked a TorchShallowNeuralClassifier
. Since the input representations are large, I chose a pretty large hidden_dim
:
mod = TorchShallowNeuralClassifier(
max_iter=100, hidden_dim=300)
%time _ = mod.fit(X_bert_train_mean, y_train)
Finished epoch 100 of 100; error is 0.17165078409016132
CPU times: user 2min 18s, sys: 595 ms, total: 2min 18s Wall time: 20.3 s
Evaluation proceeds as you would expect:
X_bert_dev_mean = bert_reduce_mean(X_bert_dev)
bert_sent_preds = mod.predict(X_bert_dev_mean)
print(classification_report(y_dev, bert_sent_preds, digits=3))
precision recall f1-score support negative 0.713 0.673 0.692 428 neutral 0.348 0.314 0.330 229 positive 0.714 0.788 0.749 444 micro avg 0.645 0.645 0.645 1101 macro avg 0.592 0.592 0.591 1101 weighted avg 0.638 0.645 0.640 1101
It is straightforward to conduct experiments like the above using sst.experiment
, which will enable you to do a wider range of experiments without writing or copy-pasting a lot of code.
Per the guidelines at Han Xiao's "BERT as a service", it would be prohibitively slow to call bc.encode
on all our sentences individually. To address this, I suggest first creating a look-up for the precomputed BERT representations and then having your feature function simply use this look-up:
bert_lookup = {}
for (sents, reps) in ((X_str_train, X_bert_train_mean),
(X_str_dev, X_bert_dev_mean)):
assert len(sents) == len(reps)
for s, rep in zip(sents, reps):
bert_lookup[s] = rep
def bert_sentence_phi(tree):
s = " ".join(tree.leaves())
return bert_lookup[s]
def fit_wide_shallow_network(X, y):
mod = TorchShallowNeuralClassifier(
max_iter=100, hidden_dim=300)
mod.fit(X, y)
return mod
%%time
_ = sst.experiment(
SST_HOME,
bert_sentence_phi,
fit_wide_shallow_network,
train_reader=sst.train_reader,
assess_reader=sst.dev_reader,
class_func=sst.ternary_class_func,
vectorize=False)
Finished epoch 100 of 100; error is 0.16109364107251167
precision recall f1-score support negative 0.680 0.736 0.707 428 neutral 0.299 0.192 0.234 229 positive 0.703 0.777 0.738 444 micro avg 0.639 0.639 0.639 1101 macro avg 0.561 0.568 0.560 1101 weighted avg 0.610 0.639 0.621 1101 CPU times: user 2min 18s, sys: 481 ms, total: 2min 18s Wall time: 21.5 s
We can also use BERT representations as the input to an RNN. There is just one key change from how we used these models before:
Previously, we would feed in lists of tokens, and they would be converted to indices into a fixed embedding space. This presumes that all words have the same representation no matter what their context is.
With BERT, we skip the embedding entirely and just feed in lists of BERT vectors, which means that the same word can be represented in different ways.
TorchRNNClassifier
supports this via use_embedding=False
. In turn, you needn't supply a vocabulary:
bert_rnn = TorchRNNClassifier(
vocab=[],
max_iter=50,
use_embedding=False)
%time _ = bert_rnn.fit(X_bert_train, y_train)
Finished epoch 50 of 50; error is 3.3966610431671143
CPU times: user 31min 6s, sys: 2min 27s, total: 33min 34s Wall time: 10min 11s
bert_rnn_preds = bert_rnn.predict(X_bert_dev)
print(classification_report(y_dev, bert_rnn_preds, digits=3))
precision recall f1-score support negative 0.778 0.614 0.687 428 neutral 0.308 0.341 0.324 229 positive 0.727 0.836 0.778 444 micro avg 0.647 0.647 0.647 1101 macro avg 0.605 0.597 0.596 1101 weighted avg 0.660 0.647 0.648 1101
Using ELMo is very similar to using BERT. I'll illustrate just with an RNN.
When first run, the following command downloads
elmo_2x4096_512_2048cnn_2xhighway_options.json
elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5
directly from S3 to a local temp directory. Use options_file
and weight_file
to ask ElmoEmbedder
to use a specified pair of model files. For additional details:
https://github.com/allenai/allennlp/blob/master/allennlp/commands/elmo.py
elmo = ElmoEmbedder()
The ELMo interface requires tokenized input. I believe the following tokenizer matches the behavior of the one used by the team to create the representations:
tokenizer = TreebankWordTokenizer()
elmo_train_toks = [tokenizer.tokenize(ex) for ex in X_str_train]
elmo_dev_toks = [tokenizer.tokenize(ex) for ex in X_str_dev]
Here we create the representations for the train and dev sets:
X_elmo_train_layers = list(elmo.embed_sentences(elmo_train_toks))
X_elmo_dev_layers = list(elmo.embed_sentences(elmo_dev_toks))
X_elmo_train_layers
has three dimensions:
X_elmo_dev_layers[0].shape
(3, 13, 1024)
For each word (second dimension), there are three layers of length 1024. So ELMo representations are even larger than BERT ones!
There are many ways we could combine the layers available for each word. Here, I'll use the mean:
def elmo_layer_reduce_mean(elmo_vecs):
return [ex.mean(axis=0) for ex in elmo_vecs]
X_elmo_train = elmo_layer_reduce_mean(X_elmo_train_layers)
Now we can fit an RNN as usual:
elmo_rnn = TorchRNNClassifier(
vocab=[],
max_iter=50,
use_embedding=False)
%time _ = elmo_rnn.fit(X_elmo_train, y_train)
Finished epoch 50 of 50; error is 0.09299760637804866
CPU times: user 16min 2s, sys: 1min 32s, total: 17min 34s Wall time: 4min 42s
Evaluation proceeds in the usual way:
X_elmo_dev = elmo_layer_reduce_mean(X_elmo_dev_layers)
elmo_rnn_preds = elmo_rnn.predict(X_elmo_dev)
print(classification_report(y_dev, elmo_rnn_preds, digits=3))
precision recall f1-score support negative 0.706 0.678 0.691 428 neutral 0.292 0.258 0.274 229 positive 0.734 0.806 0.768 444 micro avg 0.642 0.642 0.642 1101 macro avg 0.577 0.581 0.578 1101 weighted avg 0.631 0.642 0.635 1101
To round things out, here's an example of how to use sst.experiment
with ELMo, for more compact and maintainable experimental code:
def elmo_sentence_phi(tree):
vecs = elmo.embed_sentence(tree.leaves())
return vecs.mean(axis=0)
def fit_elmo_rnn(X, y):
mod = TorchRNNClassifier(
vocab=[],
max_iter=50,
use_embedding=False)
mod.fit(X, y)
return mod
%%time
_ = sst.experiment(
SST_HOME,
elmo_sentence_phi,
fit_elmo_rnn,
train_reader=sst.train_reader,
assess_reader=sst.dev_reader,
class_func=sst.ternary_class_func,
vectorize=False)
Finished epoch 50 of 50; error is 0.021800976479426026
precision recall f1-score support negative 0.707 0.715 0.711 428 neutral 0.344 0.245 0.286 229 positive 0.733 0.833 0.780 444 micro avg 0.665 0.665 0.665 1101 macro avg 0.594 0.598 0.592 1101 weighted avg 0.642 0.665 0.650 1101 CPU times: user 3h 22min 56s, sys: 2min 25s, total: 3h 25min 22s Wall time: 51min 40s