Notebook

Fine-tune the GPT-2 model with weight loss articles¶

The GPT-2 model generates text like never before! Before, the state-of-the-art text generation could not keep coherence for more than maybe a paragraph. However, GPT-2 is much better at maintaining paragraph-to-paragraph coherence. The text generated is so good that OpenAI decided to not release the full model for fear of creating a fake-news generator. Instead, they only released a smaller pretrained version of the model.

You might decide you want to train the full model yourself, but be warned that each iteration of training costs thousands of euros, and you often has to experiment with how the model is trained (play with the hyperparameters) in order to get it to learn well. You have to be a big organization like OpenAI in order to afford to trian your own model on the scale that they did.

The model, a language model, was trained by just trying to predict the next word for many many millions of documents found on the web. This is called unsupervised learning because we don't have a set of labels we are trying to predict.

The GPT-2 blog post and paper do not go into much detail into how the model was designed. However, we know that they use a transformer architecture. At a high level, the Transformer converts input sequences into output sequences. It's composed of an encoding component and a decoding component.

transformer at a high level

The Transformer is actually composed of stacks of encoders and decoders.

stacks of encoders and decoders

We can see a snapshot of how tensors flow through this encoder-decoder architecture:

For the GPT-2 model, the goal isn't to translate French to English but rather to generate text. The input sequences are tokens (words) at timestep [0, t - 1] and the target sequences are the tokens at timestep [1, t].

If sequence length is 5 and we have this text we want to train on:

the quick brown fox jumped over the lazy dog

We would then prepare the following sequences for the model:

input	target
`the quick brown fox`	`quick brown fox jumped`
`quick brown fox jumped`	`brown fox jumped over`
`brown fox jumped over`	`fox jumped over the`
`fox jumped over the`	`jumped over the lazy`
`jumped over the lazy`	`over the lazy dog`

Great blog post and source of the illustrations: The Illustrated Transformer
OpenAI blog post on GPT-2

In [0]:

!pip install git+https://github.com/huggingface/pytorch-pretrained-BERT.git boltons googledrivedownloader -q

In [2]:

from functools import partial
from pathlib import Path
from textwrap import wrap

import nltk
import pandas as pd
from boltons.iterutils import windowed
from tqdm import tqdm, tqdm_notebook

import torch
from pytorch_pretrained_bert import GPT2Tokenizer, GPT2LMHeadModel, OpenAIAdam
from torch.nn import functional as F
from torch.utils.data import Dataset, DataLoader, RandomSampler

from google_drive_downloader import GoogleDriveDownloader as gdd

tqdm.pandas()
nltk.download('punkt', quiet=True)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex.

Out[2]:

device(type='cuda')

In [0]:

def sample_text(model, seed='Weight loss can be achieved by', n_words=500):
    """Generate text from a trained model."""
    model.eval()
    tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
    
    text = tokenizer.encode(seed)
    inputs, past = torch.tensor([text]), None

    with torch.no_grad():
        for _ in tqdm_notebook(range(n_words), leave=False):
            logits, past = model(inputs.to(device), past=past)

            log_probs = F.softmax(logits[:, -1], dim=-1)
            inputs = torch.multinomial(log_probs, 1)
            
            text.append(inputs.item())

    return tokenizer.decode(text)


def pretty_print(text):
    """Wrap text for nice printing."""
    to_print = ''
    for paragraph in text.split('\n'):
        to_print += '\n'.join(wrap(paragraph))
        to_print += '\n'
    print(to_print)

Test sampling the original pretrained model (GPT-2 small)¶

In [0]:

model = GPT2LMHeadModel.from_pretrained('gpt2')
model = model.to(device)

In [5]:

seed = 'Weight loss can be achieved by'  #@param {type:"string"}
n_words = 500  #@param {type:"integer"}
text = sample_text(model, seed=seed, n_words=n_words)
pretty_print(text)

HBox(children=(IntProgress(value=0, max=500), HTML(value='')))

Weight loss can be achieved by reduction of ionized nucleotides in the
cytoplasm and increasing the cellular hold (Masuka and Stratton 2005),
which increases the solubility of proteins.

Note that Vitamin D posture (almost equivalent to its butter role)
depends on the equilibrium archival state of the strength returning to
cytoplasmic system, where essential growth factors are maintained in a
resting state for a fixed period of time. Inflammation is likely to be
the source of excessive HPD clearance occurred during cationic
phosphate transfer from the cytoplasm to the extracellular matrix
(Roboe et al. 1997; Joslington and Wilson 1997). On the other hand,
chronic CoS and PhoB metabolism can be strongly affected as measured
by the transition of REES. Persistent exocrine responses to pigment
and lipid changed from a post-sedation, acute practical experience to
cumulative results no longer help inhibition of pigment cell fate
(Dessner et al. 2000). Furthermore, sufficient antioxidant actions
against prooxidant proteins (including Ca2+, S….. ) showed to be
responsible for large alteration of cell signalling integrity under S
difficulties in the early days of colony exposure. People with
moderate to severe MM were exposed to red blood cell dying/mismatching
with IL-1α antagonist DL-2 restored brown thickness with glycan
supplementation, and solid tissue glucose infusion was thus regarded
as optimal for rejuvenation of mtDNA. The known photoreceptor
potential of tan pigment is not lacking (Henold et al. 2003). Seminium
Bond, lipid peroxidation of cell membranes from Ca 2+ transgenic mice
and similar "ET/ET syndrome" stem cells (Lackë et al. 2003) of chronic
MS supported current plasma vitamin D requirement (Martinez et al.
2002, Florentine et al. 2003). The interest in such physiological
effects coupled with significance of this treatment put into question
the neoepithelial perocalciatory loop bronclovirus 2A inhibitor 2‐6
transgenic mice circulating in predominantly miferous systems have
been mentioned as among the areas of relevance for disease progression
phase detection and induced starved like T cells (Edison et al. 2002,
Munber and Mohanty 2000), excited situ hybridization, ELPRM, and one
molecule cell permeability assay Kit-72 (Kapuse and Lasina 1996,
Sales, Park and O'Neill 1994).

Van-Emec's solution

See what the fine-tuning data looks like.

In [0]:

DATA_PATH = 'data/weight_loss/articles.jsonl'
if not Path(DATA_PATH).is_file():
    gdd.download_file_from_google_drive(
        file_id='1mafPreWzE-FyLI0K-MUsXPcnUI0epIcI',
        dest_path='data/weight_loss/weight_loss_articles.zip',
        unzip=True,
    )

In [7]:

# Preview the training data
pd.read_json(DATA_PATH)[['author', 'text', 'title']].head()

Out[7]:

	author	text	title
0	Jesse L Moore	It is almost not possible to watch any TV, rea...	How Obesity is Determined
1	Erica Logan	If you are reading this article, then I know y...	I Cheated My Way Thin - I Can Now Look At Myse...
2	Acharya Hargreaves	Self hypnosis for weight loss is a very easy p...	Self Hypnosis For Weight Loss is Easy and Rela...
3	Avy Barnes	Are you looking all over for the fastest way t...	Fastest Way to Lose Weight - Melt Away Lbs Of ...
4	Carolyn Anderson	Losing weight is one of the many concerns of m...	How to Lose Weight the Healthy Way - Simple an...

Preprocess fine-tuning data¶

In [0]:

flatten = lambda x: [sublst for lst in x for sublst in lst]

class EzineWeightLossDataset(Dataset):
    """Weight loss articles from ezinearticles.com."""
    def __init__(self, data_filename, sequence_length, n_samples):
        df = pd.read_json(data_filename)[['text']].sample(n_samples)
        
        df = df[df.text.str.len() > 0]
        df.dropna(inplace=True)
        
        self.tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
        
        df['paragraphs'] = df.text.str.split(r'[\n]+')
        
        df['paragraphs_sentences'] = df.paragraphs.progress_apply(
            lambda paragraphs: [nltk.sent_tokenize(paragraph) for paragraph in paragraphs],
        )
        
        # Add newlines to the end of every paragraph
        df.loc[:, 'paragraphs_sentences'] = df.paragraphs_sentences.progress_apply(
            lambda paragraphs: [paragraph[:-1] + [paragraph[-1] + '\n\n'] for paragraph in paragraphs if paragraph]
        )
        
        df.dropna(inplace=True)

        def encode_paragraph(paragraph):
            tokens = flatten([self.tokenizer.encode(sentence) + self.tokenizer.encode(' ') for sentence in paragraph])
            tokens = tokens[:-1]  # Remove extra space at the end
            return tokens
        
        # Tokenize and assign indices to each token
        df['paragraphs_sentences_tokens'] = df.paragraphs_sentences.progress_apply(
            lambda paragraphs: [encode_paragraph(paragraph) for paragraph in paragraphs],
        )
        
        # Flatten to one long sequence
        df['tokens'] = df.paragraphs_sentences_tokens.progress_apply(flatten)
        
        # 50256 is <|endoftext|> (https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json)
        # Apply a sliding window per article that will be the sequence
        # length fed into the model
        sequences = flatten([
            windowed(encoded_article + [50256], sequence_length)
            for encoded_article in df['tokens']
        ])
        
        # Combine all of the sequences into one 2-D matrix.
        # Then, split like [A, B, C, D, E] --> ([A, B, C, D], [B, C, D, E])
        data = torch.tensor(sequences)
        self.inputs_lst, self.targets = data[:-1], data[1:]

    def __getitem__(self, i):
        return self.inputs_lst[i], self.targets[i]
    
    def __len__(self):
        return len(self.inputs_lst)

In [9]:

# How long each sequence should be
sequence_length = 128  #@param {type:"slider", min:16, max:512, step:2}

# Train on only a subset of the data to reduce training time
n_samples = 50  #@param {type:"integer"}

dataset = EzineWeightLossDataset(DATA_PATH, sequence_length, n_samples)

100%|██████████| 50/50 [00:00<00:00, 670.14it/s]
100%|██████████| 50/50 [00:00<00:00, 42392.40it/s]
100%|██████████| 50/50 [00:00<00:00, 130.76it/s]
100%|██████████| 50/50 [00:00<00:00, 14617.36it/s]

In [0]:

BATCH_SIZE = 16
loader = DataLoader(dataset, sampler=RandomSampler(dataset), batch_size=BATCH_SIZE)