Fine-tune the GPT-2 model with weight loss articles

The GPT-2 model generates text like never before! Before, the state-of-the-art text generation could not keep coherence for more than maybe a paragraph. However, GPT-2 is much better at maintaining paragraph-to-paragraph coherence. The text generated is so good that OpenAI decided to not release the full model for fear of creating a fake-news generator. Instead, they only released a smaller pretrained version of the model.

You might decide you want to train the full model yourself, but be warned that each iteration of training costs thousands of euros, and you often has to experiment with how the model is trained (play with the hyperparameters) in order to get it to learn well. You have to be a big organization like OpenAI in order to afford to trian your own model on the scale that they did.

The model, a language model, was trained by just trying to predict the next word for many many millions of documents found on the web. This is called unsupervised learning because we don't have a set of labels we are trying to predict.

The GPT-2 blog post and paper do not go into much detail into how the model was designed. However, we know that they use a transformer architecture. At a high level, the Transformer converts input sequences into output sequences. It's composed of an encoding component and a decoding component.

transformer at a high level

The Transformer is actually composed of stacks of encoders and decoders.

stacks of encoders and decoders

We can see a snapshot of how tensors flow through this encoder-decoder architecture:

For the GPT-2 model, the goal isn't to translate French to English but rather to generate text. The input sequences are tokens (words) at timestep [0, t - 1] and the target sequences are the tokens at timestep [1, t].

If sequence length is 5 and we have this text we want to train on:

the quick brown fox jumped over the lazy dog

We would then prepare the following sequences for the model:

input target
the quick brown fox quick brown fox jumped
quick brown fox jumped brown fox jumped over
brown fox jumped over fox jumped over the
fox jumped over the jumped over the lazy
jumped over the lazy over the lazy dog
In [0]:
!pip install git+https://github.com/huggingface/pytorch-pretrained-BERT.git boltons googledrivedownloader -q
In [2]:
from functools import partial
from pathlib import Path
from textwrap import wrap

import nltk
import pandas as pd
from boltons.iterutils import windowed
from tqdm import tqdm, tqdm_notebook

import torch
from pytorch_pretrained_bert import GPT2Tokenizer, GPT2LMHeadModel, OpenAIAdam
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler

from google_drive_downloader import GoogleDriveDownloader as gdd

tqdm.pandas()
nltk.download('punkt', quiet=True)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device
Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.
Out[2]:
device(type='cuda')
In [0]:
def sample_text(model, seed='Weight loss can be achieved by', n_words=500):
    """Generate text from a trained model."""
    model.eval()
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    
    text = tokenizer.encode(seed)
    inputs, past = torch.tensor([text]), None

    with torch.no_grad():
        for _ in tqdm_notebook(range(n_words), leave=False):
            logits, past = model(inputs.to(device), past=past)

            log_probs = F.softmax(logits[:, -1], dim=-1)
            inputs = torch.multinomial(log_probs, 1)
            
            text.append(inputs.item())

    return tokenizer.decode(text)


def pretty_print(text):
    """Wrap text for nice printing."""
    to_print = ''
    for paragraph in text.split('\n'):
        to_print += '\n'.join(wrap(paragraph))
        to_print += '\n'
    print(to_print)

Test sampling the original pretrained model (GPT-2 small)

In [0]:
model = GPT2LMHeadModel.from_pretrained('gpt2')
model = model.to(device)
In [5]:
seed = 'Weight loss can be achieved by'  #@param {type:"string"}
n_words = 500  #@param {type:"integer"}
text = sample_text(model, seed=seed, n_words=n_words)
pretty_print(text)
Weight loss can be achieved by reduction of ionized nucleotides in the
cytoplasm and increasing the cellular hold (Masuka and Stratton 2005),
which increases the solubility of proteins.

Note that Vitamin D posture (almost equivalent to its butter role)
depends on the equilibrium archival state of the strength returning to
cytoplasmic system, where essential growth factors are maintained in a
resting state for a fixed period of time. Inflammation is likely to be
the source of excessive HPD clearance occurred during cationic
phosphate transfer from the cytoplasm to the extracellular matrix
(Roboe et al. 1997; Joslington and Wilson 1997). On the other hand,
chronic CoS and PhoB metabolism can be strongly affected as measured
by the transition of REES. Persistent exocrine responses to pigment
and lipid changed from a post-sedation, acute practical experience to
cumulative results no longer help inhibition of pigment cell fate
(Dessner et al. 2000). Furthermore, sufficient antioxidant actions
against prooxidant proteins (including Ca2+, S….. ) showed to be
responsible for large alteration of cell signalling integrity under S
difficulties in the early days of colony exposure. People with
moderate to severe MM were exposed to red blood cell dying/mismatching
with IL-1α antagonist DL-2 restored brown thickness with glycan
supplementation, and solid tissue glucose infusion was thus regarded
as optimal for rejuvenation of mtDNA. The known photoreceptor
potential of tan pigment is not lacking (Henold et al. 2003). Seminium
Bond, lipid peroxidation of cell membranes from Ca 2+ transgenic mice
and similar "ET/ET syndrome" stem cells (Lackë et al. 2003) of chronic
MS supported current plasma vitamin D requirement (Martinez et al.
2002, Florentine et al. 2003). The interest in such physiological
effects coupled with significance of this treatment put into question
the neoepithelial perocalciatory loop bronclovirus 2A inhibitor 2‐6
transgenic mice circulating in predominantly miferous systems have
been mentioned as among the areas of relevance for disease progression
phase detection and induced starved like T cells (Edison et al. 2002,
Munber and Mohanty 2000), excited situ hybridization, ELPRM, and one
molecule cell permeability assay Kit-72 (Kapuse and Lasina 1996,
Sales, Park and O'Neill 1994).

Van-Emec's solution

See what the fine-tuning data looks like.

In [0]:
DATA_PATH = 'data/weight_loss/articles.jsonl'
if not Path(DATA_PATH).is_file():
    gdd.download_file_from_google_drive(
        file_id='1mafPreWzE-FyLI0K-MUsXPcnUI0epIcI',
        dest_path='data/weight_loss/weight_loss_articles.zip',
        unzip=True,
    )
In [7]:
# Preview the training data
pd.read_json(DATA_PATH)[['author', 'text', 'title']].head()
Out[7]:
author text title
0 Jesse L Moore It is almost not possible to watch any TV, rea... How Obesity is Determined
1 Erica Logan If you are reading this article, then I know y... I Cheated My Way Thin - I Can Now Look At Myse...
2 Acharya Hargreaves Self hypnosis for weight loss is a very easy p... Self Hypnosis For Weight Loss is Easy and Rela...
3 Avy Barnes Are you looking all over for the fastest way t... Fastest Way to Lose Weight - Melt Away Lbs Of ...
4 Carolyn Anderson Losing weight is one of the many concerns of m... How to Lose Weight the Healthy Way - Simple an...

Preprocess fine-tuning data

In [0]:
flatten = lambda x: [sublst for lst in x for sublst in lst]

class EzineWeightLossDataset(Dataset):
    """Weight loss articles from ezinearticles.com."""
    def __init__(self, data_filename, sequence_length, n_samples):
        df = pd.read_json(data_filename)[['text']].sample(n_samples)
        
        df = df[df.text.str.len() > 0]
        df.dropna(inplace=True)
        
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
        
        df['paragraphs'] = df.text.str.split(r'[\n]+')
        
        df['paragraphs_sentences'] = df.paragraphs.progress_apply(
            lambda paragraphs: [nltk.sent_tokenize(paragraph) for paragraph in paragraphs],
        )
        
        # Add newlines to the end of every paragraph
        df.loc[:, 'paragraphs_sentences'] = df.paragraphs_sentences.progress_apply(
            lambda paragraphs: [paragraph[:-1] + [paragraph[-1] + '\n\n'] for paragraph in paragraphs if paragraph]
        )
        
        df.dropna(inplace=True)

        def encode_paragraph(paragraph):
            tokens = flatten([self.tokenizer.encode(sentence) + self.tokenizer.encode(' ') for sentence in paragraph])
            tokens = tokens[:-1]  # Remove extra space at the end
            return tokens
        
        # Tokenize and assign indices to each token
        df['paragraphs_sentences_tokens'] = df.paragraphs_sentences.progress_apply(
            lambda paragraphs: [encode_paragraph(paragraph) for paragraph in paragraphs],
        )
        
        # Flatten to one long sequence
        df['tokens'] = df.paragraphs_sentences_tokens.progress_apply(flatten)
        
        # 50256 is <|endoftext|> (https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json)
        # Apply a sliding window per article that will be the sequence
        # length fed into the model
        sequences = flatten([
            windowed(encoded_article + [50256], sequence_length)
            for encoded_article in df['tokens']
        ])
        
        # Combine all of the sequences into one 2-D matrix.
        # Then, split like [A, B, C, D, E] --> ([A, B, C, D], [B, C, D, E])
        data = torch.tensor(sequences)
        self.inputs_lst, self.targets = data[:-1], data[1:]

    def __getitem__(self, i):
        return self.inputs_lst[i], self.targets[i]
    
    def __len__(self):
        return len(self.inputs_lst)
In [9]:
# How long each sequence should be
sequence_length = 128  #@param {type:"slider", min:16, max:512, step:2}

# Train on only a subset of the data to reduce training time
n_samples = 50  #@param {type:"integer"}

dataset = EzineWeightLossDataset(DATA_PATH, sequence_length, n_samples)
100%|██████████| 50/50 [00:00<00:00, 670.14it/s]
100%|██████████| 50/50 [00:00<00:00, 42392.40it/s]
100%|██████████| 50/50 [00:00<00:00, 130.76it/s]
100%|██████████| 50/50 [00:00<00:00, 14617.36it/s]
In [0]:
BATCH_SIZE = 16
loader = DataLoader(dataset, sampler=RandomSampler(dataset), batch_size=BATCH_SIZE)

Fine-tune the GPT-2 model

Training a model from scratch can be challenging:

  • It's too expensive to run on the amount of data you need for great results
  • You don't have enough data

As well, the pretrained model isn't great at your specific task. What can you do? You can fine-tune the pretrained model with your own domain-specific data!

Setup

In [0]:
#@title Model Hyperparameters
n_epochs = 1  #@param {type:"slider", min:1, max:10, step:1}
learning_rate = 1e-5  #@param {type:"number"}
warmup_proportion = 0.002  #@param {type:"number"}
max_grad_norm = 0.05  #@param {type:"number"}
weight_decay = 0.01  #@param {type:"number"}
In [0]:
model = GPT2LMHeadModel.from_pretrained('gpt2')
model = model.to(device)
In [0]:
param_optimizer = list(model.named_parameters())
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if
        not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
    {'params': [p for n, p in param_optimizer if
        any(nd in n for nd in no_decay)], 'weight_decay': 0.0}
]
n_train_optimization_steps = len(dataset) * n_epochs // BATCH_SIZE
optimizer = OpenAIAdam(
    optimizer_grouped_parameters,
    lr=learning_rate,
    warmup=warmup_proportion,
    max_grad_norm=max_grad_norm,
    weight_decay=weight_decay,
    t_total=n_train_optimization_steps,
)

Train

In [14]:
nb_tr_steps, tr_loss, exp_average_loss = 0, 0, None
model.train()
for _ in tqdm_notebook(range(n_epochs)):
    tr_loss = 0
    nb_tr_steps = 0
    tqdm_bar = tqdm_notebook(loader, desc='Training')
    for step, batch in enumerate(tqdm_bar):
        input_ids, lm_labels = tuple(t.to(device) for t in batch)
        loss = model(input_ids, lm_labels=lm_labels)
        loss.backward()
        optimizer.step()
        tr_loss += loss.item()
        exp_average_loss = (
            loss.item() if exp_average_loss is None
            else 0.7 * exp_average_loss + 0.3 * loss.item()
        )
        nb_tr_steps += 1
        tqdm_bar.desc = f'Training loss: {exp_average_loss:.2e} lr: {optimizer.get_lr()[0]:.2e}'

In [21]:
#@title Sample fine-tuned model

seed = 'Weight loss can be achieved by'  #@param {type:"string"}
n_words = 500  #@param {type:"integer"}
text = sample_text(model, seed=seed, n_words=n_words)
pretty_print(text)
Weight loss can be achieved by dieting often. Dieters all over the
world give talks all the time. Sometimes the experts present them but
never mention the experts. Some days some days not.

So how should people cut calories? They can always always add 200mg of
Acai berry to their diet. The Acai berry is one of the least processed
foods. It's one of the few plant-derived vitamins and minerals that
aren't absorbed by our body through digestion.

The addition of acai berries to a healthy diet is a natural, proven
way to lose fat faster. Research shows that simply by retaining the
amount of acai he offers, we quickly burnishes ourselves with our
existing profile.

Acai also has the ability to regenerate itself, allowing for weight-
saving and weight loss as well. When people lose weight, they're
usually trying to accomplish several things:1. To get by on their own
expense-free, and without aid Groups that will consume alcohol, gamble
on solids products, or eat chips and salsa to burnish their hold upon
themselves, instead comparing their ability to achieve physical and
mental capacity?

2. To see real results from a resource that isn't normally available
to those in need of that information? Take marathoners, for
example.,anddiabetics, those with�arthritis, chills, allergies, &
sleep apnea.

Because of the low metabolic rate in the obese, they have to get to
hospital 1 hour before surgery to have the necessary oxygen, nutrients
and food to stay alive.

Because of the high percentage of body fat, and because it is stored
there through the early hours of the morning,andbecause most of the
calories come from saturated fats, it is very hard for the body to
deal with the calories right away. So the only way for the body to get
the necessary amounts it needs in during the day is for the stomach to
become extremely acidic. Thus,the body needs to shuttle carbon dioxide
out of the body. This will prevent water retention and contribute to
the retention of heavy metals in the body.

Because the body needs the carbon dioxide to grow and divide, it needs
to cool down CO2 to below CO2 plus H2O. By doing this, CO2 transfers
heat faster than water which also slows the growth of the growth
factor imbalances. So the body cools itself when it cools off water

In [0]:
original_model = GPT2LMHeadModel.from_pretrained('gpt2')
original_model = original_model.to(device)
In [17]:
#@title Sample original model

seed = 'Weight loss can be achieved by'  #@param {type:"string"}
n_words = 500  #@param {type:"integer"}
text = sample_text(original_model, seed=seed, n_words=n_words)
pretty_print(text)
Weight loss can be achieved by modifying nutrient intake

3-enough* nearest supplementation. Initially research highlights an
increasing need for further research into the impact of initiating
treatments to offset adverse effects. A deficiency of this form of
nutritional modification by most, including vegetarian and low-fat
diets, has been referred to as magical-patients.9<|endoftext|>Away
from the controversial Patriot Act, President Barack Obama has kept to
his agenda of building border lasers and other futuristic tools for
remote sensors and detectors. Today, with more effort on Capitol Hill,
the President has unveiled two lightning-bombers at a joint federal
lobbying event focused on a proposed House proposal that aims to win
the support of all factions of Congress and derail legislation
appropriating money necessary for fencing and insurance stabilization.


This year's "Speaker of the House" meeting in Washington, D.C.'s
Legislative Building was organized through a four-day week of email
correspondence with members—academic allies, preferably from outside
the House, since the meeting is the first anniversary of the Patriot
Act. "Confronting people is the politics of doing anything because it
drives consensus," Obama said. "Politician-in-chiefs need to find a
way to say, 'We're not going to allow cable television to turn on the
lounges. Cut traffic a few feet in front of our cameras.' "
Republicans like Boehner know how politics will play out over the next
year, at least if they hold a majority in the Senate. "Building up the
communication experience through a one-time conference 'vital to the
Wing'" is a common tactic, said Rep. Jeff Henry (R-OK), the co-
chairman of the House Foreign Affairs Committee Pete Sessions'
Transportation Committee. "We've heard from House Democrats who said
this was not going to work over two years ago, but now they're
reminding us, 'Look, take risks together.'" The ranking Democrat on
the Committee for CPI, Ed Halpert (D-CA), told the Washington Examiner
that Boehner rejected the idea of opening the white wall in hopes that
the House would repeat without another session of Congress. That
Republican leader, led by Rep. Adam B. Schiff (D-CA), did the same:
"We homes are now taught to obediently show our president and senator
to follow that symbolic or tactical moves, and that's when I think
larger (bleeping) steps into the American economy, that's when
productive elections begin with Congress, the House, and the floor

Save the fine-tuned model

In [0]:
torch.save(model.state_dict(), 'finetuned_gpt2.pkl')
In [0]: