Playing with transformers

By Allison Parrish

(This is a rough draft!)

Transformers is a Python library released by Hugging Face to make it easy to use pre-trained transformer language models. This notebook takes you through the basics of how to generate text with this library, and demonstrates a few simple techniques you can use to assert finer-grained control over the text generation procedure, like logit warping and fine-tuning.

Warning

Before you begin, I recommend reading On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? by Bender et al. The paper outlines the difficulties and potential for harm in large pre-trained language models, and suggests some techniques for training and making use of these models responsibly. For my part, I have the following recommendations:

  • Don't use pre-trained language models in an automated context. By "automated context" I mean things like web apps, Twitter bots, etc. You just can't guarantee that the output of one of these models won't be harmful. (I might make an exception to this rule if your automated service has very robust moderation, but even then, it's iffy.)
  • Always read through all of your language model outputs before publishing them.
  • Don't use language models to trick people.

What is a "transformer" though

"Transformer" is a name applied to neural network architectures that make use of a mechanism called "attention" and can be trained in parallel (rather than sequentially, as is the case with other neural network architectures that are frequently used to model sequences, like recurrent neural networks). The introduction of this architecture initiated a period of tremendous growth in language model capabilities. (Examples of Transformer models include GPT-2 and Google T5.)

This growth is mostly predicated on the fact that the transformer architecture makes it more practical to train language models on larger and larger datasets. As of this writing, state-of-the-art transformer models are often trained on datasets many hundreds of gigabytes in size, and consequently take a tremendous amount of energy (and money, and time) to train. In many cases, it's not practical to train a transformer model from scratch on your own that has the same capabilities. Instead, researchers and artists make use of models that other organizations have trained.

Hugging Face Transformers

That's where the Hugging Face Transformers library comes in. It's an easy interface for downloading pre-trained transformer models and making use of them with a consistent API. (For example, we can use the same code to generate text with GPT-2 and XLNet). To install Transformers, you'll first need to install PyTorch. If you're running this notebook with Anaconda, you can just run the code in the following cell:

In [ ]:
import sys
!conda install --prefix {sys.prefix} -y -c pytorch pytorch

(Note: You can also use Transformers with Tensorflow, but in practice I've found that PyTorch support in Transformers is better.)

Now you can install Transformers by running the following cell:

In [ ]:
import sys
!{sys.executable} -m pip install transformers

Quick start: load a model and generate text

I'm going to show you how to load a pre-trained model from the Hugging Face Models directory. First, you need to import the relevant parts of the Transformers library. I'm using the Auto classes, which automatically load the correct code based on the model that you choose. The AutoModelForCausalLM (that's "causal" not "casual") is the class you use for text generation tasks (where you want to generate the next word in a sequence).

In [1]:
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

In this notebook, I'm going to use distilgpt2, a "distilled" version of OpenAI's GPT-2 model. The primary benefit of this model is that it is small and fast—it generates text in a speedy fashion even on my old 2013 MacBook Air. You need to load both the model and its associated tokenizer. (We'll talk about the tokenizer in more detail below.) To load, use the .from_pretrained() method of the appropriate Auto class, like so:

In [2]:
tokenizer = AutoTokenizer.from_pretrained('distilgpt2')
model = AutoModelForCausalLM.from_pretrained('distilgpt2')

This might take a little while! The Transformers library downloads the model when you call the .from_pretrained() method, and some of the models are very large. (DistilGPT2, at 300MB, is on the small side.) The library will cache these files for later, so you won't need to download them again on the same machine.

Once you have the tokenizer and the model, you can create a Transformers pipeline. A pipeline groups together and abstracts away the intermediate steps of a machine learning procedure. The Transformers library has many types of pipeline, but we're going to create a text generation pipeline, using the model and tokenizer that we just loaded. Here's what it looks like:

In [3]:
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

Having created this pipeline, we can use it to generate text by calling the pipeline object as though it were a function. The parameter that you pass in is the "prompt"—i.e., the text whose completion you want to predict.

In [4]:
generator("Two roads diverged in a yellow wood, and")
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Out[4]:
[{'generated_text': 'Two roads diverged in a yellow wood, and the red-colored wooden road was on both sides of the road.\n\n\nA police officer is believed to have responded to a report of a road accident on January 24th near a village near'}]

This call returns a list of dictionaries, where each dictionary has a key generated_text whose value is the text that was generated. If you just want the generated string, you can do this:

In [5]:
generator("Two roads diverged in a yellow wood, and")[0]['generated_text']
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Out[5]:
'Two roads diverged in a yellow wood, and the trees were a little dark on some sides of the road.\n\n\nThe road had been closed for six days.'

There are other parameters that you can pass to the text generation pipeline—basically, you can pass in any of the parameters that you could pass to a causal language model's .generate() method. We'll go over those parameters in a second, but before we do so, I think it'll be helpful to take a step back and examine what exactly is happening in the generation process.

Tokenization

Machine learning models don't work on text directly; instead, they operate on numbers that correspond to parts of a text. Breaking a text up into enumerable parts is called tokenization. In this class, we've already explored several easy and common forms of tokenization, e.g., breaking a text up into characters, or breaking a text up into words. Most machine learning models now use a form of sub-word tokenization, in which a text is broken up into units that don't neatly correspond to either individual characters or whole words. The tokenization procedure itself is derived from statistical properties of the corpus—so tokenizers are, in a sense, "trained" in the same way that a machine learning model is. (This is why you have to load the tokenizer, the same way you load a model.)

A tokenizer has a vocabulary, which is the set of all possible unique tokens that the tokenizer recognizes. You can examine the vocabulary by calling the tokenizer's .get_vocab() method:

In [6]:
vocab = tokenizer.get_vocab()

Use len() to see how many items are in the vocabulary:

In [7]:
len(vocab)
Out[7]:
50257

The vocabulary is returned as a dictionary that maps tokens to their IDs. Let's just take a peek into what that looks like. I'm going to randomly sample a few items from the dictionary, like so:

In [9]:
import random
random.sample(vocab.items(), 10)
Out[9]:
[('341', 33660),
 ('Ġplayer', 2137),
 ('roach', 28562),
 ('ĠRobotics', 47061),
 ('isition', 10027),
 ('ĠClothing', 48921),
 ('ĠJac', 8445),
 ('Ġtame', 37812),
 ('acc', 4134),
 ('EDIT', 24706)]

The results look pretty weird, and there a bunch of things to explain. First off, let's discuss the mysterious Ġ character. Subword tokenizers generally don't start off with information about where word boundaries occur; instead, they "learn" word boundaries as part of the process of "training" the tokenizer. The Ġ character is a special character that represents a space. Second, we can see that in many cases, the subword tokenizer does actually end up with tokens in its vocabulary that represent entire words. However, in other cases, we end up with what look like word parts. This is by design! Because some tokens represent word parts, the tokenizer can potentially encode any word—even words that were not present in the original corpus—by tokenizing that word as a sequence of parts.

To demonstrate, let's actually encode a string with the tokenizer using its .encode() method. Just pass in a string, and you'll get back a list of IDs:

In [10]:
src = "Behold! An alabaster anemone. Zzzzap!"
tokenizer.encode(src)
Out[10]:
[3856,
 2946,
 0,
 1052,
 435,
 397,
 1603,
 281,
 368,
 505,
 13,
 1168,
 3019,
 89,
 499,
 0]

The tokenizer encodes this string of four words into sixteen tokens. You can find the token corresponding to the ID using the tokenizer's .decode() method:

In [11]:
tokenizer.decode(1603)
Out[11]:
'aster'

With this, we can see how the tokenizer broke up the original string into units:

In [12]:
for token_id in tokenizer.encode(src):
    print(token_id, "→", "'" + tokenizer.decode(token_id) + "'")
3856 → 'Be'
2946 → 'hold'
0 → '!'
1052 → ' An'
435 → ' al'
397 → 'ab'
1603 → 'aster'
281 → ' an'
368 → 'em'
505 → 'one'
13 → '.'
1168 → ' Z'
3019 → 'zz'
89 → 'z'
499 → 'ap'
0 → '!'

(I included quotation marks in this output to emphasize the fact that the text of the token includes whitespace.)

You can decode an entire list of IDs using the .decode() function as well:

In [13]:
token_ids = tokenizer.encode(src)
tokenizer.decode(token_ids)
Out[13]:
'Behold! An alabaster anemone. Zzzzap!'

For fun, get the tokenizer to decode a list of random token IDs:

In [14]:
tokenizer.decode(random.sample(list(vocab.values()), 12))
Out[14]:
' Viktor mimiczanne Bentleypoint Bee Awakening candJP redumann Labor'

Another way to tokenize a text is to call the tokenizer as though it's a function, passing in a list of strings as an argument:

In [15]:
tokenizer(["this is a test", "this is another test"], return_tensors="pt")
Out[15]:
{'input_ids': tensor([[5661,  318,  257, 1332],
        [5661,  318, 1194, 1332]]), 'attention_mask': tensor([[1, 1, 1, 1],
        [1, 1, 1, 1]])}

The value returned here is a dictionary in the format that the model is expecting, if you want to run the model "by hand" instead of using a pipeline, which is what we're going to do below. The return_tensors parameter directs the tokenizer to return the results as a PyTorch tensor instead of a Python list, which is also a requirement for passing the values directly to the model.

Generation in more detail (advanced, but interesting)

So what's actually happening when you ask the model to generate text is this: you encode the prompt as a sequence of IDs using the tokenizer, and then the model assigns a probability to every token in the tokenizer's vocabulary, based on which tokens it thinks are most likely to come next. Here's what it looks like to run that process "by hand," so to speak. First, create the prompt:

In [16]:
prompt = "Two roads diverged in a yellow wood, and"

Then encode the prompt as a sequence of token IDs:

In [17]:
prompt_encoded = tokenizer([prompt], return_tensors="pt")
In [18]:
prompt_encoded
Out[18]:
{'input_ids': tensor([[ 7571,  9725, 12312,  2004,   287,   257,  7872,  4898,    11,   290]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Then we call the model as though it were a function, passing in the key/value pairs that the tokenizer returned as parameters (using Python's ** operator):

In [19]:
result = model(**prompt_encoded)

The value returned from calling the model is an object with various attributes that we could examine. I'm most interested in .logits, which is a PyTorch tensor that contains information about the probability that the model assigned to each vocabulary item. ("Tensor" btw is just a fancy word for "array with a bunch of dimensions.") The prediction for the next token can be found in the very last row of this tensor:

In [20]:
next_token_probs = result.logits[0,-1]
next_token_probs
Out[20]:
tensor([-75.0658, -73.7585, -76.3894,  ..., -75.8365, -73.6258, -73.6459],
       grad_fn=<SelectBackward>)

The scores are shown in "raw" form, meaning that they don't have the kinds of values that we would normally associate with a probability distribution (i.e., multiple options all adding up to one). But we can still compare them in this state. Higher numbers mean higher probability.

This tensor has a shape that corresponds to the number of vocabulary items:

In [21]:
next_token_probs.shape
Out[21]:
torch.Size([50257])

And we can actually inquire about the probability of particular tokens by looking them up. The code in the following cells uses the tokenizer's .encode() method to convert a token to its ID, then looks up the ID by index in the array with the predictions. (The .item() call converts the resulting PyTorch tensor to a native Python value, which just makes the result a bit easier to look at.)

In [22]:
next_token_probs[tokenizer.encode(' the')].item()
Out[22]:
-63.117549896240234
In [23]:
next_token_probs[tokenizer.encode(' x')].item()
Out[23]:
-73.56358337402344
In [24]:
next_token_probs[tokenizer.encode(' an')].item()
Out[24]:
-66.07609558105469

We can see that the tokens the and an have fairly high probability, while the token x has low probability. Interesting!

Generating text, the home-grown way

Using the PyTorch library, we can get a list of the most likely tokens to come next. (This has some dark magic in it if you're not familiar with PyTorch—or another array processing library like NumPy—so... just trust me for a sec.)

In [25]:
import torch
for idx in reversed(torch.argsort(next_token_probs)[-12:]):
    print("'" + tokenizer.decode(idx) + "'")
' the'
' a'
' two'
' one'
' it'
' were'
' some'
' there'
' they'
' in'
' then'
' another'

(Again, I've added in the quotation marks so you can clearly see that these tokens have whitespace at the beginning.) These are the top twelve tokens to come next in the sequence, as predicted by the model. One way to generate a text would be to take one of these tokens—maybe the top-scoring token, maybe one of the top n picked at random, append it to our original list of tokens, ask the model to make a prediction on that list of tokens, and repeat. The loop would look something like this:

In [26]:
prompt = "Two roads diverged in a yellow wood, and"
for i in range(10):
    # encode the prompt
    prompt_encoded = tokenizer([prompt], return_tensors="pt")
    # run a forward pass on the network
    result = model(**prompt_encoded)
    # get the probabilities for the next word
    next_token_probs = result.logits[0,-1]
    # sort by value, get the top 12 (you can change this number! try 1, or 1000)
    nexts = torch.argsort(next_token_probs)[-12:]
    # append the decoded ID to the current prompt
    prompt += tokenizer.decode(random.choice(nexts))
    print(prompt)
Two roads diverged in a yellow wood, and the
Two roads diverged in a yellow wood, and the water
Two roads diverged in a yellow wood, and the water had
Two roads diverged in a yellow wood, and the water had a
Two roads diverged in a yellow wood, and the water had a high
Two roads diverged in a yellow wood, and the water had a high degree
Two roads diverged in a yellow wood, and the water had a high degree and
Two roads diverged in a yellow wood, and the water had a high degree and high
Two roads diverged in a yellow wood, and the water had a high degree and high levels
Two roads diverged in a yellow wood, and the water had a high degree and high levels to

The .generate() method

Our home-grown solution above does the job, but it's very rudimentary. Because generating text is such a common use-case, the Transformers library provides a .generate() method that is quite fast and also has a bunch of bells and whistles that we can exploit to add expressiveness to our use of the language model. Under the hood, though, the .generate() method is essentially doing exactly what we did above—iteratively constructing a string based on predicted tokens from the model. Use the .generate() method like this:

In [29]:
prompt = "Two roads diverged in a yellow wood, and"
prompt_encoded = tokenizer(prompt, return_tensors="pt") # the "return_tensors" thing is important!
result = model.generate(**prompt_encoded)[0]
tokenizer.decode(result, skip_special_tokens=True)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Out[29]:
"Two roads diverged in a yellow wood, and over the course of the past month's traffic jam in the United States with more than 1 million drivers from 18 states.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"

For a more detailed overview of all of the ways you can use .generate(), see How to generate on the Hugging Face blog. One argument of .generate() that is useful right off the bat is max_length, which continues the generation process for the number of tokens you specify:

In [30]:
prompt = "Two roads diverged in a yellow wood, and"
prompt_encoded = tokenizer(prompt, return_tensors="pt")
result = model.generate(**prompt_encoded, max_length=250)[0]
tokenizer.decode(result, skip_special_tokens=True)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Out[30]:
'Two roads diverged in a yellow wood, and over the edge of an ice cube, the police found the body of a boy in the bushes in the middle of the road. The woman is not being identified.\n\n\n\nA 26-year-old man was charged with first-degree murder after authorities said he was found in a snowboard.\n\nPolice say, with an estimated 10,000 residents, the area around the snowboard became unsafe.\n"There could have been more than two dozen people out there and several other people running through, and it was a lot more dangerous that maybe one or two more people from other communities were all moving through," the New York State Police spokesperson said.\nAccording to the New York City Police Department, the driver involved in the incident was identified as a 43-year-old man.\n"The victim was transported to hospitals and transported to Albany Hospital where he is in stable condition," New York State Police said in a statement.\n"There have been many reports of minor injuries and there is no further information as to the cause of the incident."'

Back to the pipeline

The process of encoding the prompt and decoding the results is pretty tedious. That's why the "pipeline" was invented. The text-generation pipeline takes care of encoding and decoding for you. Create a pipeline by calling pipeline with text-generation as the first parameter, and then the model and tokenizer that you want to use:

In [31]:
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

And then you can generate text with the pipeline. The first argument is the prompt; any remaining parameters will be forwarded to the model's .generate() method.

In [32]:
generator("Two roads diverged in a yellow",
          max_length=100)
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Out[32]:
[{'generated_text': 'Two roads diverged in a yellow Toyota Camry that was reported missing Sunday.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'}]

As a reminder, to get the actual generated text, use indexing to get the value of the dictionary in the list returned from the pipeline:

In [33]:
generator("Two roads diverged in a yellow",
          max_length=100)[0]['generated_text']
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Out[33]:
"Two roads diverged in a yellow car. The SUV ended up in the back of the car, then left behind.\n\n\n\n\nThe police say driver of the car crashed into a busy road between the city of Vancouver and Queenstown. One man said three people with yellow plates on their faces ran to the scene.\nTwo vehicles were caught on video smashing into pedestrians.\nAnyone with information is asked to call Detective Inspector Dave O'Brien at 722-370-6278."

Controlling the model

By default, the distilgpt2 model samples from the possible next tokens, weighted by the probability assigned to that token. This strategy leads to text that shows a good deal of variety, but there are strategies that we can use and parameters that we can tweak to exert a little more control over the model's output. In this section, I show a few of these strategies.

The magic of the prompt

Transformer models are often able to follow up on cues you give about the desired content and style of the text in the prompt itself. The smaller transformer models aren't especially good at this, but it's still worth playing around with. For example, to get distilgpt2 to generate something that looks like a movie review:

In [36]:
print(generator("My review of The Road Not Taken, the Movie:", max_length=100)[0]['generated_text'])
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
My review of The Road Not Taken, the Movie: The Next Generation’s most celebrated film. The film, which is a remake of a 1982, 1991, 1993, 2002, 2006 and 2010 film starring James Mazzarri and John Tapper, debuted as the worst box office gross for any movie in 2008. The only other film to get a second Oscar since it became the most controversial movie of all time was a 2009 sequel as well. A box office gross of $24.

You can also generate dialogues and interview transcripts:

In [37]:
print(generator("Allison: I took the road less traveled by.\nRobert Frost:",
                max_length=100)[0]['generated_text'])
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Allison: I took the road less traveled by.
Robert Frost: I went through three hours. I only ate my own steak two days before they finished.
Mason: I ate up the rest of the day.
Mason: Oh my god, this was my first chance at food justice!
The third week, you saw me, I took the road less traveled by. The first week, you saw me, I took the road less traveled by. The second week,

Poetry facts:

In [40]:
print(generator("My favorite facts about poetry:\n\n1.",
                max_length=100)[0]['generated_text'])
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
My favorite facts about poetry:

1. The earliest examples of poetry were of an American poet
2. The writers of Shakespeare and his works
3. The best examples are the first, and in a very short time, the only one that would explain the history of poetry to which it was written. There are only two types of poetry: poetry.
Some of the first is the modern one . It is the most popular of the three. There is the "Pornographic Po

In general, this kind of prompting works best with texts that are likely to have a lot of representation in the training corpus.

Sampling with temperature

As mentioned above, the distilgpt2 model, by default, picks the next token at random, weighted by the probability that the model assigns to the word. To demonstrate how this works, let's imagine that the model only has five tokens in its vocabulary (instead of 50,000+). A schematic illustration of those probabilities might look like this:

prompt: Whose woods these are I think I
probabilities:
    know -> 0.5
    knew -> 0.2
    smell -> 0.15
    see -> 0.1
    am -> 0.05

The probabilities will add up to 1.0. A probability of 0.5 indicates that the token has a 50% probability of coming next; a probability of 0.2 means that the token has a 20% probability of coming next, etc. Here are those probabilities represented in Python as two lists—one with the words, and one with the probabilities that correspond to those words by index:

In [41]:
tokens = ['know', 'knew', 'smell', 'see', 'am']
probs = [0.5, 0.2, 0.15, 0.1, 0.05]

By default, to select the next token, the generation code picks from this list weighted by probability. The code to do this with PyTorch looks like this:

In [42]:
index = torch.multinomial(torch.tensor(probs), 1).item()
print(tokens[index])
know

You don't have to worry about the specifics of this code—I'm just using it to demonstrate how the sampling process works. Run the code a few times and you'll see that about half the time you get "knew"—the token with the highest probability. Running the code in a loop makes this a bit easier to see:

In [43]:
for i in range(10):
    index = torch.multinomial(torch.tensor(probs), 1).item()
    print(tokens[index])
see
know
know
know
know
know
smell
knew
see
know

The generation process has a parameter called temperature, which lets you shift the probability distribution of the next token before it's sampled. If the temperature parameter is 1.0, then sampling will proceed as normal, with the tokens weighted by their estimated probability. If the temperature parameter is less than 1.0, then tokens that were already probable will get more probable. If the temperature parameter is greater than 1.0, then the probabilities start to even out, approaching a uniform distribution (meaning that no token is more likely to be chosen than any other). To demonstrate this, I've written some code below that applies temperature to the probabilities defined above, and shows the resulting changes:

In [44]:
for temperature in [0.1, 0.35, 1.0, 2.0, 50.0]:
    modified = torch.softmax(
        torch.log(torch.tensor(probs)) / temperature, dim=-1)
    print(f"temperature {temperature:0.02f}")
    for tok, prob in zip(tokens, modified):
        print(tok.ljust(6), "→", f"{prob:0.002f}")
    print()
temperature 0.10
know   → 1.00
knew   → 0.00
smell  → 0.00
see    → 0.00
am     → 0.00

temperature 0.35
know   → 0.90
knew   → 0.07
smell  → 0.03
see    → 0.01
am     → 0.00

temperature 1.00
know   → 0.50
knew   → 0.20
smell  → 0.15
see    → 0.10
am     → 0.05

temperature 2.00
know   → 0.34
knew   → 0.21
smell  → 0.19
see    → 0.15
am     → 0.11

temperature 50.00
know   → 0.20
knew   → 0.20
smell  → 0.20
see    → 0.20
am     → 0.20

You can see that at temperature 1.0, the probabilities are identical to the original. At temperature 0.35, the probability of the most likely token has been boosted, but the other tokens still have a small chance of occurring. At temperature 0.1, only the most likely token has a chance of being selected. At temperature 2.0, the most likely token is still the most likely, but the probabilities of the other tokens have been boosted in comparison; at temperature 50.0, no token is considered to be more likely than any other.

To apply temperature sampling to the model when generating text, pass the temperature parameter to the pipeline, like so:

In [45]:
generator("Two roads diverged in a yellow",
          temperature=0.1,
          max_length=100)[0]['generated_text']
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Out[45]:
'Two roads diverged in a yellowish blue sky.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'

Low temperatures generally produce predictable, repetitive results. Here's an attempt with high temperature:

In [46]:
generator("Two roads diverged in a yellow",
          temperature=4.0,
          max_length=100)[0]['generated_text']
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Out[46]:
'Two roads diverged in a yellow Tumbalo Lake and the surrounding region, though that might soon cease there." This would explain both changes – perhaps some other reason there is concern," Foulberg tells BuzzFeed on Twitter by telepath and to several in town\'s suburbs."There doesn\'t say very much. This new trailway system allows everyone – no surprise to those of you at HomeStory" to get inside (though she would explain how they chose to be left unpallmed instead on a'

The higher temperature example produces less likely sequences of words, so the text is a bit livelier—sometimes at the cost of coherence.

Adjusting the temperature can be useful when you want the text to be more or less "weird." It can be helpful to adjust the temperature downward when you feel as though the model is producing text that is a bit too unpredictable; it can be helpful to adjust upward when you want to model to take more unexpected turns when generating.

Top-k sampling

By default, the generation process only selects from the top 50 most probable tokens at each step. This is called "top-k filtering." Because of top-k filtering, you're not likely to sample truly unusual tokens even when the temperature is high. You can adjust the threshold for top-k filtering with the top_k parameter of the model. For example, adjusting top_k to the number of items in the vocabulary ensures that every token gets its chance:

In [47]:
generator("Two roads diverged in a yellow",
          top_k=tokenizer.vocab_size,
          max_length=100)[0]['generated_text']
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Out[47]:
'Two roads diverged in a yellow car. Other passengers have been seriously injured. Two men ferry the motorway to Tanong — which ended in the Palbeyan River in the city of Tanong. (Photo: Dan Crafts/Times Free Press) Story Highlights Tanong migrant workers evacuated\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'

Using this with a temperature greater than 1.0 can yield some unusual turns of phrase:

In [48]:
generator("Two roads diverged in a yellow",
          top_k=tokenizer.vocab_size,
          temperature=1.2,
          max_length=100)[0]['generated_text']
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Out[48]:
'Two roads diverged in a yellow Oueconomy helicopter Apr. 3, 2012 sitting in an unharmed location near coal age swings towards Associated Recording Studios, Calif. As Ulamofar Raiolini shrinks, stuttered civilians evacuate into the hills because militants have been linked there. Photo by Evan Vucci/MHz 2,927 Teenage Leaves Blocks Per Extremely Late Soon Iowa state GOP redistributed voters from dynamrodlot which only drives regional branches of U.S. state. more'

On the other extreme, setting the top_k value to 1 ensures that only the most likely token is chosen at each step. This is the same thing as "greedy decoding":

In [49]:
generator("Two roads diverged in a yellow",
          top_k=1,
          max_length=100)[0]['generated_text']
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Out[49]:
'Two roads diverged in a yellow light, and the police were called to the scene.\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'

Playing around with top_k and temperature in tandem is a good way to make adjustments to the texture of your generated text.

Logit warping: Exclude "bad" words

The .generate() method has a parameter called bad_words_ids, which causes the model to zero out the probabilities of tokens associated with words that you pass in. The intended use of this feature is to stop the model from generating offensive or harmful words. But we can also repurpose it for poetic purposes. For example, in the cell below, I make the model complete the prompt "It was a dark and stormy" without using the words "night" or "day":

In [50]:
generator("It was a dark and stormy",
          bad_words_ids=tokenizer([" night", " day"]).input_ids)[0]['generated_text']
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Out[50]:
"It was a dark and stormy storm that set off the town of Nizke. The village of Nizke was deserted and abandoned. The village's people had fled, but they did not understand it. Then when they could finally reach N"

The syntax for specifying the "bad words" is to call the tokenizer on a list of words that you want to exclude, and then get the .input_ids attribute of the value returned from calling the tokenizer. This yields a list of lists that looks like this:

In [51]:
tokenizer(["Allison", "Parrish"]).input_ids
Out[51]:
[[3237, 1653], [47, 3258, 680]]

Note that I used night and day as the words, with leading spaces—this is necessary because I ended the prompt without whitespace, so the model is likely to generate a token with leading whitespace at the next step. I've found that the bad_words_ids parameter works best if your list of words includes versions both with and without whitespace.

Here's another example: getting the model to complete a prompt without using any forms of the verb to be:

In [52]:
generator("Once upon a time,",
          bad_words_ids=tokenizer(
              ["be", " be",
               "am", " am",
               "are", " are",
               "is", " is",
               "was", " was",
               "were", " were"]).input_ids,
          max_length=100)[0]['generated_text']
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Out[52]:
'Once upon a time, on the other end of the planet at the top of a galaxy, it became apparent that the star, the Milky Way, had the same gravitational background as the stars in this study. When those planets moved at great distances from the outside, they made much larger changes to their properties. However, once they reached the core of the Milky Way they began to move farther and farther in the opposite direction, and that distance gained increased further. Finally, before reaching the core of the'

You can also create a list of token IDs that you want to exclude on the fly. In the following example, I make a list of token IDs that have the letter e in them, and pass that list to the bad_words_ids parameter:

In [53]:
forbidden_ids = []
for key, val in tokenizer.get_vocab().items():
    if 'e' in key:
        forbidden_ids.append([val]) # needs to be a list of lists
print(generator("Last month, I",
          bad_words_ids=forbidden_ids,
          max_length=100)[0]['generated_text'])
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Last month, I saw a lot of this, and I think it shows just how hard it is to avoid using Bitcoin just by looking at how far off that block diagram can go for Bitcoin's full functionality. First off, I had to go through a lot of data mining, which is why I want to start to look at a small part of why Bitcoin is a cash-only, anonymous, and anonymous Bitcoin.

For my first half hour and hour of work, I had to wait

Fine-tuning a model

"Fine-tuning" is a way of slightly modifying a model by training it a few extra steps on a corpus of your choice. This process adjusts the probabilities of the model so that it more closely reflects the probabilities of the source text you train it on. Fine-tuning models with Transformers is a little bit tricky! First, you'll need to install Hugging Face's datasets package:

In [ ]:
import sys
!{sys.executable} -m pip install datasets

And then import it:

In [54]:
import datasets

You'll want to select a text file to fine-tune the model on. Fine-tuning works best on large amounts of text, but fine-tuning is also very slow if you're not using a GPU. For demonstration purposes, I create a special version of Frankenstein that contains only the first 20000 characters, and save it to a file:

In [55]:
with open("84-0-20k.txt", "w") as fh:
    fh.write(open("84-0.txt").read()[:20000])

Then I load this text file as my fine-tuning dataset:

In [56]:
training_data = datasets.load_dataset('text', data_files="84-0-20k.txt")
Using custom data configuration default-8396e67af902a1a6
Downloading and preparing dataset text/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /Users/allison/.cache/huggingface/datasets/text/default-8396e67af902a1a6/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5...
Dataset text downloaded and prepared to /Users/allison/.cache/huggingface/datasets/text/default-8396e67af902a1a6/0.0.0/e16f44aa1b321ece1f87b07977cc5d70be93d69b20486d6dacd62e12cf25c9a5. Subsequent calls will reuse this data.

Now, there's a bunch of obligatory processing that we need to do to the data in order to prepare it for the model. This is boilerplate stuff, which I'm not going to go into in detail. If you want details, consult Hugging Face's fine-tuning language models notebook.

First, we tokenize the text:

In [57]:
tokenizer.pad_token = tokenizer.eos_token
tokenized_training_data = training_data.map(
    lambda x: tokenizer(x['text']),
    remove_columns=["text"]
)

Then we break the tokenized text up into batches of tokens:

In [58]:
block_size = 64
# magic from https://github.com/huggingface/notebooks/blob/master/examples/language_modeling.ipynb
def group_texts(examples):
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result
lm_training_data = tokenized_training_data.map(
    group_texts,
    batched=True,
    batch_size=200
)

Now we import the Trainer class, which implements a training loop.

In [59]:
from transformers import Trainer, TrainingArguments

Running the following cell creates the Trainer object. The output_dir parameter specifies a directory where your fine-tuned model will be saved. The num_train_epochs sets how many "epochs" the trainer will run; one epoch is one iteration over the entire dataset. More epochs is better, but even one epoch can significantly change the way the model generates text.

In [60]:
trainer = Trainer(model=model,
                  train_dataset=lm_training_data['train'],
                  args=TrainingArguments(
                      output_dir='distilgpt2-finetune-frankenstein20k',
                      num_train_epochs=1,
                      do_train=True,
                      do_eval=False
                  ),
                  tokenizer=tokenizer)

Finally, the cell below will start the training process. If you're running this on a computer without a GPU, it will take a while. You can open this notebook on Google Colab if you want and take advantage of the free GPU that Google lets you use.

In [61]:
trainer.train()
***** Running training *****
  Num examples = 67
  Num Epochs = 1
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 9
[9/9 00:50, Epoch 1/1]
Step Training Loss

Training completed. Do not forget to share your model on huggingface.co/models =)


Out[61]:
TrainOutput(global_step=9, training_loss=4.9099070231119795, metrics={'train_runtime': 56.9456, 'train_samples_per_second': 1.177, 'train_steps_per_second': 0.158, 'total_flos': 2107446755328.0, 'train_loss': 4.9099070231119795, 'epoch': 1.0})

Running the cell below will save the model to disk:

In [62]:
trainer.save_model()
Saving model checkpoint to distilgpt2-finetune-frankenstein20k
Configuration saved in distilgpt2-finetune-frankenstein20k/config.json
Model weights saved in distilgpt2-finetune-frankenstein20k/pytorch_model.bin
tokenizer config file saved in distilgpt2-finetune-frankenstein20k/tokenizer_config.json
Special tokens file saved in distilgpt2-finetune-frankenstein20k/special_tokens_map.json

Now you can generate with the fine-tuned model! The fine-tuning process modifies the model in-place, so the pipeline you created before will make use of the fine-tuned model. (Note that if you want to get the original distilgpt2 back, you'll need to reload it with the .from_pretrained() method, as demonstrated at the top of the notebook.)

In [64]:
generator("Two roads diverged in a yellow", max_length=100)[0]['generated_text']
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Out[64]:
'Two roads diverged in a yellow.\n\n\n\nAt a glance, I could not say what I was. I tried to hold my breaths, and felt a sense of peace that seemed to me an odd thing. I tried to keep my composure.\nThe distance between them was far beyond my comprehension, and so, then, seemed to the most of my life. As if I had always wished to stay home, as though the weather was different, I had no space for my own'

You can see that fine-tuning on even a small dataset produces big changes in the model.

If you want to use your fine-tuned model in another project, use the same syntax that we used above to load distilgpt2—just replace distilgpt2 with the name of the directory where you saved your model:

In [65]:
my_tokenizer = AutoTokenizer.from_pretrained('distilgpt2-finetune-frankenstein20k')
my_model = AutoModelForCausalLM.from_pretrained('distilgpt2-finetune-frankenstein20k')
Didn't find file distilgpt2-finetune-frankenstein20k/added_tokens.json. We won't load it.
loading file distilgpt2-finetune-frankenstein20k/vocab.json
loading file distilgpt2-finetune-frankenstein20k/merges.txt
loading file distilgpt2-finetune-frankenstein20k/tokenizer.json
loading file None
loading file distilgpt2-finetune-frankenstein20k/special_tokens_map.json
loading file distilgpt2-finetune-frankenstein20k/tokenizer_config.json
loading configuration file distilgpt2-finetune-frankenstein20k/config.json
Model config GPT2Config {
  "_name_or_path": "distilgpt2",
  "_num_labels": 1,
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "do_sample": true,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "id2label": {
    "0": "LABEL_0"
  },
  "initializer_range": 0.02,
  "label2id": {
    "LABEL_0": 0
  },
  "layer_norm_epsilon": 1e-05,
  "max_length": 50,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 768,
  "n_head": 12,
  "n_inner": null,
  "n_layer": 6,
  "n_positions": 1024,
  "resid_pdrop": 0.1,
  "scale_attn_weights": true,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "transformers_version": "4.8.2",
  "use_cache": true,
  "vocab_size": 50257
}

loading weights file distilgpt2-finetune-frankenstein20k/pytorch_model.bin
All model checkpoint weights were used when initializing GPT2LMHeadModel.

All the weights of GPT2LMHeadModel were initialized from the model checkpoint at distilgpt2-finetune-frankenstein20k.
If your task is similar to the task the model of the checkpoint was trained on, you can already use GPT2LMHeadModel for predictions without further training.

Now generate with it:

In [66]:
my_generator = pipeline("text-generation", model=my_model, tokenizer=my_tokenizer)
In [68]:
my_generator("Two roads diverged in a yellow")[0]['generated_text']
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Out[68]:
'Two roads diverged in a yellowish haze, and there were only two clear roadways crossing east of the river. As the sky fell and the ground was darkened, the wind blew a terrible wind; but it soon returned to the east where he'