Transfer Learning for NLP: Sentiment Analysis on Amazon Reviews

In this notebook, we show how transfer learning can be applied to detecting the sentiment of amazon reviews, between positive and negative reviews.

This notebook uses the work from Howard and Ruder, Ulmfit. The idea of the paper (and it implementation explained in the fast.ai deep learning course) is to learn a language model trained on a very large dataset, e.g. a Wikipedia dump. The intuition is that if a model is able to predict the next word at each word, it means it has learnt something about the structure of the language we are using.

Word2vec and the likes have lead to huge improvements on various NLP tasks. This could be seen as a first step to transfer learning, where the pre-trained word vectors correspond to a transfer of the embedding layer. The ambition of Ulmfit (and others like ELMO or the Transformer language model recently introduced) is to progressively move the NLP field to the state where Computer Vision has risen thanks to the ImageNet challenge. Thanks to the ImageNet chalenge, today it is easy to download a model pre-trained on massive dataset of images, remove the last layer and replace it by a classifier or a regressor depending on the interest.

With Ulmfit, the goal is for everyone to be able to use a pre-trained language model and use it a backbone which we can use along with a classifier and a regressor. The game-changing apect of transfer learning is that we are no longer limited by the size of trzining data! With only a fraction of the data size that was necessary before, we can trtain a classifier/regressor and have very good result with few labelled data.

Given that labelled text data are difficult to get, in comparison with unlabelled text data which is almost infinite, transfer learning is likely to change radically the field of NLP, and help lead to a maturity state closer to computyer vision.

The architecture for the language model used in ULMFit is the AWD-LSTM language model by Merity.

While we are using this language model for this experiment, we keep an eye open to a recently proposed character language model with Contextual String Embedings by Akbik.

Content of this notebook

This notebook illustrate the power of Ulmfit on a dataset of Amazon reviews available on Kaggle at https://www.kaggle.com/bittlingmayer/amazonreviews/home. We use code from the excellent fastai course and use it for a different dataset. The original code is available at https://github.com/fastai/fastai/tree/master/courses/dl2

The data consists of 4M reviews that are either positives or negatives. Training a model with FastText classifier results in a f1 score of 0.916. We show that uing only a fraction of this dataset we are able to reach similar and even better results.

We encourage you to try it on your own tasks! Note that if you are interested in Regression instead of classification, you can also do it following this advice.

The notebook is organized as such:

  • Tokenize the reviews and create dictionaries
  • Download a pre-trained model and link the dictionary to the embedding layer of the model
  • Fine-tune the language model on the amaxon reviews texts

We have then the backbone of our algorithm: a pre-trained language model fine-tuned on Amazon reviews

  • Add a classifier to the language model and train the classifier layer only
  • Gradually defreeze successive layers to train different layers on the amazon reviews
  • Run a full classification task for several epochs
  • Use the model for inference!

We end this notebook by looking at the specific effect of training size on the overall performance. This is to test the hypothesis that the ULMFit model does not need much labeled data to perform well.

Data

Before starting, you should download the data from https://www.kaggle.com/bittlingmayer/amazonreviews, and put the extracted files into an ./Amazon folder somewher you like, and use this path for this notebook.

Also, we recommend working on a dedicated environment (e.g. mkvirtualenv fastai). Then clone the fastai github repo https://github.com/fastai/fastai and install requirements.

In [2]:
from fastai.text import *
import html
import os
import pandas as pd
import pickle
import re
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, \
confusion_matrix
from sklearn.model_selection import train_test_split
from time import time
In [2]:
path = '/your/path/to/folder/Amazon'
train = []
with open(os.path.join(path, 'train.ft.txt'), 'r') as file:
    for line in file:
        train.append(file.readline())
        
test = []
with open(os.path.join(path, 'test.ft.txt'), 'r') as file:
    for line in file:
        test.append(file.readline())
In [65]:
print(f'The train data contains {len(train)} examples')
print(f'The test data contains {len(test)} examples')
In [3]:
BOS = 'xbos'  # beginning-of-sentence tag
FLD = 'xfld'  # data field tag

PATH=Path('/your/path/to/folder/Amazon')

CLAS_PATH=PATH/'amazon_class'
CLAS_PATH.mkdir(exist_ok=True)

LM_PATH=PATH/'amazon_lm'
LM_PATH.mkdir(exist_ok=True)
In [12]:
# Each item is '__label__1/2' and then the review so we split to get texts and labels
trn_texts,trn_labels = [text[10:] for text in train], [text[:10] for text in train]
trn_labels = [0 if label == '__label__1' else 1 for label in trn_labels]
val_texts,val_labels = [text[10:] for text in test], [text[:10] for text in test]
val_labels = [0 if label == '__label__1' else 1 for label in val_labels]
In [13]:
# Following fast.ai recommendations we put our data in pandas dataframes
col_names = ['labels','text']

df_trn = pd.DataFrame({'text':trn_texts, 'labels':trn_labels}, columns=col_names)
df_val = pd.DataFrame({'text':val_texts, 'labels':val_labels}, columns=col_names)
In [66]:
df_trn.head(10)
In [16]:
df_trn.to_csv(CLAS_PATH/'train.csv', header=False, index=False)
df_val.to_csv(CLAS_PATH/'test.csv', header=False, index=False)
In [17]:
CLASSES = ['neg', 'pos']
(CLAS_PATH/'classes.txt').open('w').writelines(f'{o}\n' for o in CLASSES)

Language Model

In [11]:
# We're going to fine tune the language model so it's ok to take some of the test set in our train data
# for the lm fine-tuning
trn_texts,val_texts = train_test_split(np.concatenate([trn_texts,val_texts]), test_size=0.1)

df_trn = pd.DataFrame({'text':trn_texts, 'labels':[0]*len(trn_texts)}, columns=col_names)
df_val = pd.DataFrame({'text':val_texts, 'labels':[0]*len(val_texts)}, columns=col_names)

df_trn.to_csv(LM_PATH/'train.csv', header=False, index=False)
df_val.to_csv(LM_PATH/'test.csv', header=False, index=False)
In [19]:
# Here we use functions from the fast.ai course to get data

chunksize=24000
re1 = re.compile(r'  +')

def fixup(x):
    x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
        'nbsp;', ' ').replace('#36;', '$').replace('\\n', "\n").replace('quot;', "'").replace(
        '<br />', "\n").replace('\\"', '"').replace('<unk>','u_n').replace(' @[email protected] ','.').replace(
        ' @[email protected] ','-').replace('\\', ' \\ ')
    return re1.sub(' ', html.unescape(x))

def get_texts(df, n_lbls=1):
    labels = df.iloc[:,range(n_lbls)].values.astype(np.int64)
    texts = f'\n{BOS} {FLD} 1 ' + df[n_lbls].astype(str)
    for i in range(n_lbls+1, len(df.columns)): 
        texts += f' {FLD} {i-n_lbls} ' + df[i].astype(str)
    texts = list(texts.apply(fixup).values)

    tok = Tokenizer().proc_all_mp(partition_by_cores(texts))
    return tok, list(labels)

def get_all(df, n_lbls):
    tok, labels = [], []
    for i, r in enumerate(df):
        print(i)
        tok_, labels_ = get_texts(r, n_lbls)
        tok += tok_;
        labels += labels_
    return tok, labels

df_trn = pd.read_csv(LM_PATH/'train.csv', header=None, chunksize=chunksize)
df_val = pd.read_csv(LM_PATH/'test.csv', header=None, chunksize=chunksize)
In [21]:
# This cell can take quite some time if your dataset is large
# Run it once and comment it for later use
tok_trn, trn_labels = get_all(df_trn, 1)
tok_val, val_labels = get_all(df_val, 1)
In [15]:
# Run this cell once and comment everything but the load statements for later use

(LM_PATH/'tmp').mkdir(exist_ok=True)
np.save(LM_PATH/'tmp'/'tok_trn.npy', tok_trn)
np.save(LM_PATH/'tmp'/'tok_val.npy', tok_val)
tok_trn = np.load(LM_PATH/'tmp'/'tok_trn.npy')
tok_val = np.load(LM_PATH/'tmp'/'tok_val.npy')
In [63]:
# Check the most common tokens
freq = Counter(p for o in tok_trn for p in o)
freq.most_common(25)
In [64]:
# Check the least common tokens
freq.most_common()[-25:]
In [ ]:
# Build your vocabulary by keeping only the most common tokens that appears frequently enough
# and constrain the size of your vocabulary. We follow here the 60k recommendation.
max_vocab = 60000
min_freq = 2

itos = [o for o,c in freq.most_common(max_vocab) if c>min_freq]
itos.insert(0, '_pad_')
itos.insert(0, '_unk_')

stoi = collections.defaultdict(lambda:0, {v:k for k,v in enumerate(itos)})
len(itos)

trn_lm = np.array([[stoi[o] for o in p] for p in tok_trn])
val_lm = np.array([[stoi[o] for o in p] for p in tok_val])

np.save(LM_PATH/'tmp'/'trn_ids.npy', trn_lm)
np.save(LM_PATH/'tmp'/'val_ids.npy', val_lm)
pickle.dump(itos, open(LM_PATH/'tmp'/'itos.pkl', 'wb'))
In [10]:
# Save everything
trn_lm = np.load(LM_PATH/'tmp'/'trn_ids.npy')
val_lm = np.load(LM_PATH/'tmp'/'val_ids.npy')
itos = pickle.load(open(LM_PATH/'tmp'/'itos.pkl', 'rb'))
In [33]:
vs=len(itos)
vs,len(trn_lm)

Using pre trained Language Model

In [ ]:
# Uncomment this cell to download the pre-trained model.
# It will be placed into the PATH that you defined earlier.
# ! wget -nH -r -np -P {PATH} http://files.fast.ai/models/wt103/
In [5]:
# Load the weights of the model
em_sz,nh,nl = 400,1150,3

PRE_PATH = PATH/'models'/'wt103'
PRE_LM_PATH = PRE_PATH/'fwd_wt103.h5'

wgts = torch.load(PRE_LM_PATH, map_location=lambda storage, loc: storage)
In [8]:
# Check the word embedding layer and keep a 'mean word' for unknown tokens
enc_wgts = to_np(wgts['0.encoder.weight'])
row_m = enc_wgts.mean(0)

enc_wgts.shape
Out[8]:
(238462, 400)
In [12]:
# Load the vocabulary on which the pre-trained model was trained
# Define an embedding matrix with the vocabulary of our dataset
itos2 = pickle.load((PRE_PATH/'itos_wt103.pkl').open('rb'))
stoi2 = collections.defaultdict(lambda:-1, {v:k for k,v in enumerate(itos2)})

new_w = np.zeros((vs, em_sz), dtype=np.float32)
for i,w in enumerate(itos):
    r = stoi2[w]
    new_w[i] = enc_wgts[r] if r>=0 else row_m
In [16]:
# Use the new embedding matrix for the pre-trained model
wgts['0.encoder.weight'] = T(new_w)
wgts['0.encoder_with_dropout.embed.weight'] = T(np.copy(new_w))
wgts['1.decoder.weight'] = T(np.copy(new_w))
In [17]:
# Define the learner object to do the fine-tuning
# Here we will freeze everything except the embedding layer, so that we can have a better 
# embedding for unknown words than just the mean embedding on which we initialise it.
wd=1e-7
bptt=70
bs=52
opt_fn = partial(optim.Adam, betas=(0.8, 0.99))

trn_dl = LanguageModelLoader(np.concatenate(trn_lm), bs, bptt)
val_dl = LanguageModelLoader(np.concatenate(val_lm), bs, bptt)
md = LanguageModelData(PATH, 1, vs, trn_dl, val_dl, bs=bs, bptt=bptt)

drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15])*0.7

learner= md.get_model(opt_fn, em_sz, nh, nl, 
    dropouti=drops[0], dropout=drops[1], wdrop=drops[2], dropoute=drops[3], dropouth=drops[4])

learner.metrics = [accuracy]
learner.freeze_to(-1)

learner.model.load_state_dict(wgts)

lr=1e-3
lrs = lr
In [22]:
# Run one epoch of fine-tuning 
learner.fit(lrs/2, 1, wds=wd, use_clr=(32,2), cycle_len=1)
In [30]:
# Save the fine-tuned model and unfreeze everything to later fine-tune the whole model
learner.save('lm_last_ft')
learner.load('lm_last_ft')
learner.unfreeze()
In [23]:
learner.lr_find(start_lr=lrs/10, end_lr=lrs*10, linear=True)
In [24]:
learner.sched.plot()
In [ ]:
# Run this if you want to highly tune the LM to the Amazon data, with 15 epochs
# use_clr controls the shape of the cyclical (triangular) learning rate
learner.fit(lrs, 1, wds=wd, use_clr=(20,10), cycle_len=15)
In [33]:
# Save the Backbone for further classification!!
learner.save('lm1')
learner.save_encoder('lm1_enc')
In [25]:
learner.sched.plot_loss()

Going back to classification!

Now that we spent some time fine-tuning the language model on our Amazon data, let's see if we can classify easily these reviews. As before, some cells should be run once, and then use data loaders for later use.

In [35]:
df_trn = pd.read_csv(CLAS_PATH/'train.csv', header=None, chunksize=chunksize)
df_val = pd.read_csv(CLAS_PATH/'test.csv', header=None, chunksize=chunksize)
In [26]:
tok_trn, trn_labels = get_all(df_trn, 1)
tok_val, val_labels = get_all(df_val, 1)
In [36]:
(CLAS_PATH/'tmp').mkdir(exist_ok=True)

np.save(CLAS_PATH/'tmp'/'tok_trn.npy', tok_trn)
np.save(CLAS_PATH/'tmp'/'tok_val.npy', tok_val)

np.save(CLAS_PATH/'tmp'/'trn_labels.npy', trn_labels)
np.save(CLAS_PATH/'tmp'/'val_labels.npy', val_labels)
In [4]:
tok_trn = np.load(CLAS_PATH/'tmp'/'tok_trn.npy')
tok_val = np.load(CLAS_PATH/'tmp'/'tok_val.npy')
itos = pickle.load((LM_PATH/'tmp'/'itos.pkl').open('rb'))
stoi = collections.defaultdict(lambda:0, {v:k for k,v in enumerate(itos)})
len(itos)
Out[4]:
60002
In [38]:
trn_clas = np.array([[stoi[o] for o in p] for p in tok_trn])
val_clas = np.array([[stoi[o] for o in p] for p in tok_val])

np.save(CLAS_PATH/'tmp'/'trn_ids.npy', trn_clas)
np.save(CLAS_PATH/'tmp'/'val_ids.npy', val_clas)

Classifier

In this part, we adopt an unusual train/test hierarchy. While it's common to train on a big dataset and thewn test on a small one, here we wanrt to test the hypothesis that the model can learn with few training data. Hence we take less data for training than for testing.

In [5]:
# We select here the 'size' first reviews of our dataset
# The paper claims that it's possible to achieve very good results with few labeled examples
# So let's try with 100 examples for training, and 5000 examples for validation.
# We encourage you to try different values to see the effect of data size on performance.
trn_size = 100
val_size = 5000
trn_clas = np.load(CLAS_PATH/'tmp'/'trn_ids.npy')
val_clas = np.load(CLAS_PATH/'tmp'/'val_ids.npy')

trn_labels = np.squeeze(np.load(CLAS_PATH/'tmp'/'trn_labels.npy'))
val_labels = np.squeeze(np.load(CLAS_PATH/'tmp'/'val_labels.npy'))

train = random.sample(list(zip(trn_clas, trn_labels)), trn_size)
trn_clas = np.array([item[0] for item in train])
trn_labels = np.array([item[1] for item in train])
del train

validation = random.sample(list(zip(val_clas, val_labels)), val_size)
val_clas = np.array([item[0] for item in validation])
val_labels = np.array([item[1] for item in validation])
del validation


bptt,em_sz,nh,nl = 70,400,1150,3
vs = len(itos)
opt_fn = partial(optim.Adam, betas=(0.8, 0.99))
bs = 48

min_lbl = trn_labels.min()
trn_labels -= min_lbl
val_labels -= min_lbl
c=int(trn_labels.max())+1
In [34]:
# Ccheck that the validation dataset is well balanced so acccuracy is a good metric
# We'll also check other metrics usual for binary classification (precision, recall, f1 score)
len(trn_labels[trn_labels == 1]) / len(trn_labels)
In [48]:
trn_ds = TextDataset(trn_clas, trn_labels)
val_ds = TextDataset(val_clas, val_labels)
trn_samp = SortishSampler(trn_clas, key=lambda x: len(trn_clas[x]), bs=bs//2)
val_samp = SortSampler(val_clas, key=lambda x: len(val_clas[x]))
trn_dl = DataLoader(trn_ds, bs//2, transpose=True, num_workers=1, pad_idx=1, sampler=trn_samp)
val_dl = DataLoader(val_ds, bs, transpose=True, num_workers=1, pad_idx=1, sampler=val_samp)
In [50]:
# We define the model, here it a classifier on top of an RNN language model
# We load the language model encoder that we fine tuned before
# We freeze everything but the last layer, so that we can train the classification layer only.
#load the saved weights from before, and freeze everything until the last layer

md = ModelData(PATH, trn_dl, val_dl)
dps = np.array([0.4, 0.5, 0.05, 0.3, 0.1])

m = get_rnn_classifier(bptt, 20*70, c, vs, emb_sz=em_sz, n_hid=nh, n_layers=nl, pad_token=1,
          layers=[em_sz*3, 50, c], drops=[dps[4], 0.1],
          dropouti=dps[0], wdrop=dps[1], dropoute=dps[2], dropouth=dps[3])

opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

learn = RNN_Learner(md, TextModel(to_gpu(m)), opt_fn=opt_fn)
learn.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learn.clip=25.
learn.metrics = [accuracy]

lr=3e-3
lrm = 2.6
lrs = np.array([lr/(lrm**4), lr/(lrm**3), lr/(lrm**2), lr/lrm, lr])

lrs=np.array([1e-4,1e-4,1e-4,1e-3,1e-2])

wd = 1e-7
wd = 0
learn.load_encoder('lm1_enc')

learn.freeze_to(-1)
In [37]:
learn.lr_find(lrs/1000)
In [38]:
learn.sched.plot()
In [39]:
# Run one epoch on the classification layer
learn.fit(lrs, 1, wds=wd, cycle_len=1, use_clr=(8,3))
In [54]:
# Save the trained model
learn.save('clas_0')
learn.load('clas_0')
In [40]:
# Gradually unfreeze another layer to train a bit more parameters than just the classifier layer
learn.freeze_to(-2)
learn.fit(lrs, 1, wds=wd, cycle_len=1, use_clr=(8,3))
In [56]:
# Save the trained model
learn.save('clas_1')
learn.load('clas_1')
In [41]:
# Unfreeze everything and train for a few epochs on the whole set of parameters of the model
learn.unfreeze()
learn.fit(lrs, 1, wds=wd, cycle_len=14, use_clr=(32,10))
In [42]:
learn.sched.plot_loss()
In [59]:
# Save the model
learn.save('clas_2')

Inference

Nonw, let's play with the model we've just learned!

In [60]:
m = get_rnn_classifer(bptt, 20*70, c, vs, emb_sz=em_sz, n_hid=nh, n_layers=nl, pad_token=1,
          layers=[em_sz*3, 50, c], drops=[dps[4], 0.1],
          dropouti=dps[0], wdrop=dps[1], dropoute=dps[2], dropouth=dps[3])
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))
learn = RNN_Learner(md, TextModel(to_gpu(m)), opt_fn=opt_fn)
learn.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learn.clip=25.
learn.metrics = [accuracy]

lr=3e-3
lrm = 2.6
lrs = np.array([lr/(lrm**4), lr/(lrm**3), lr/(lrm**2), lr/lrm, lr])
wd = 1e-7
wd = 0
learn.load_encoder('lm1_enc')
learn.load('clas_2')
In [6]:
def get_sentiment(input_str: str):

    # predictions are done on arrays of input.
    # We only have a single input, so turn it into a 1x1 array
    texts = [input_str]

    # tokenize using the fastai wrapper around spacy
    tok = [t.split() for t in texts]
    # tok = Tokenizer().proc_all_mp(partition_by_cores(texts))

    # turn into integers for each word
    encoded = [stoi[p] for p in tok[0]]

    idx = np.array(encoded)[None]
    idx = np.transpose(idx)
    tensorIdx = VV(idx)
    m.eval()
    m.reset()
    p = m.forward(tensorIdx)
    return np.argmax(p[0][0].data.cpu().numpy())

def prediction(texts):
    """Do the prediction on a list of texts
    """
    y = []
    
    for i, text in enumerate(texts):
        if i % 1000 == 0:
            print(i)
        encoded = text
        idx = np.array(encoded)[None]
        idx = np.transpose(idx)
        tensorIdx = VV(idx)
        m.eval()
        m.reset()
        p = m.forward(tensorIdx)
        y.append(np.argmax(p[0][0].data.cpu().numpy()))
    return y
In [43]:
sentence = "I like Feedly"
start = time()
print(get_sentiment(sentence))
print(time() - start)
In [44]:
y = prediction(list(val_clas))
In [45]:
# Show relevant metrics for binary classification
# We encourage you to try training the classifier with different data size and its effect on performance
print(f'Accuracy --> {accuracy_score(y, val_labels)}')
print(f'Precision --> {precision_score(y, val_labels)}')
print(f'F1 score --> {f1_score(y, val_labels)}')
print(f'Recall score --> {recall_score(y, val_labels)}')
print(confusion_matrix(y, val_labels))
print(classification_report(y, val_labels))

What training size do we need?

The language model has already learnt a lot about the syntax. It is very knowledgeable about the context in which words appear in sentences. However, the language model does not contain any notion of meaning. This problem is well summarised in Emily Bender's tweet during a very interesting twiter thread that occur in July around meaning in NLP. A cool summary of this thread can be found in the Hugging Face blogpost. Hence the meaning in language is very likely to be learned through supervision, with the help of ground-truth examples.

However, when we perform some NLP tasks, sentiment analysis in our example, both syntax and meaning are important! The idea is that you can save a lot of time by being taught with a lot of blind synatx first, and then learning meaning. Think of when you start learning a complete new field. Well, it is far easier to learn it in your mother tongue than in another language you master less.

The big practical gain here is that once you "know" a language, you need less supervised examples to learn a new thing! In our example, it means we need less labeled reviews for us to learn a relevant classifier.

Let's verify this hypothesis by training a classifier with several training size and see how this size affects the performance!

In [7]:
trn_clas = np.load(CLAS_PATH/'tmp'/'trn_ids.npy')
val_clas = np.load(CLAS_PATH/'tmp'/'val_ids.npy')

trn_labels = np.squeeze(np.load(CLAS_PATH/'tmp'/'trn_labels.npy'))
val_labels = np.squeeze(np.load(CLAS_PATH/'tmp'/'val_labels.npy'))
In [8]:
def experiment(trn_size, val_size):

    train = random.sample(list(zip(trn_clas, trn_labels)), trn_size)
    aux_trn_clas = np.array([item[0] for item in train])
    aux_trn_labels = np.array([item[1] for item in train])
    del train

    validation = random.sample(list(zip(val_clas, val_labels)), val_size)
    aux_val_clas = np.array([item[0] for item in validation])
    aux_val_labels = np.array([item[1] for item in validation])
    del validation


    bptt,em_sz,nh,nl = 70,400,1150,3
    vs = len(itos)
    opt_fn = partial(optim.Adam, betas=(0.8, 0.99))
    bs = 48

    min_lbl = aux_trn_labels.min()
    aux_trn_labels -= min_lbl
    aux_val_labels -= min_lbl
    c=int(aux_trn_labels.max())+1

    # Load data in relevant structures
    trn_ds = TextDataset(aux_trn_clas, aux_trn_labels)
    val_ds = TextDataset(aux_val_clas, aux_val_labels)
    trn_samp = SortishSampler(aux_trn_clas, key=lambda x: len(aux_trn_clas[x]), bs=bs//2)
    val_samp = SortSampler(aux_val_clas, key=lambda x: len(aux_val_clas[x]))
    trn_dl = DataLoader(trn_ds, bs//2, transpose=True, num_workers=1, pad_idx=1, sampler=trn_samp)
    val_dl = DataLoader(val_ds, bs, transpose=True, num_workers=1, pad_idx=1, sampler=val_samp)

    # Define the model and load the backbone lamguage model
    md = ModelData(PATH, trn_dl, val_dl)
    dps = np.array([0.4, 0.5, 0.05, 0.3, 0.1])

    m = get_rnn_classifier(bptt, 20*70, c, vs, emb_sz=em_sz, n_hid=nh, n_layers=nl, pad_token=1,
              layers=[em_sz*3, 50, c], drops=[dps[4], 0.1],
              dropouti=dps[0], wdrop=dps[1], dropoute=dps[2], dropouth=dps[3])

    opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

    learn = RNN_Learner(md, TextModel(to_gpu(m)), opt_fn=opt_fn)
    learn.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
    learn.clip=25.
    learn.metrics = [accuracy]

    lr=3e-3
    lrm = 2.6
    lrs = np.array([lr/(lrm**4), lr/(lrm**3), lr/(lrm**2), lr/lrm, lr])

    lrs=np.array([1e-4,1e-4,1e-4,1e-3,1e-2])

    wd = 1e-7
    wd = 0
    learn.load_encoder('lm1_enc')

    learn.freeze_to(-1)

    # Find th learning rate
    learn.lr_find(lrs/1000)

    # Run one epoch on the classification layer
    learn.fit(lrs, 1, wds=wd, cycle_len=1, use_clr=(8,3))

    # Save the trained model
    learn.save(f'{trn_size}clas_0')
    learn.load(f'{trn_size}clas_0')

    # Gradually unfreeze another layer to train a bit more parameters than just the classifier layer
    learn.freeze_to(-2)
    learn.fit(lrs, 1, wds=wd, cycle_len=1, use_clr=(8,3))

    # Save the trained model
    learn.save(f'{trn_size}clas_1')
    learn.load(f'{trn_size}clas_1')

    # Unfreeze everything and train for a few epochs on the whole set of parameters of the model
    learn.unfreeze()
    learn.fit(lrs, 1, wds=wd, cycle_len=14, use_clr=(32,10))

    # Save the model
    learn.sched.plot_loss()
    learn.save(f'{trn_size}clas_2')
In [ ]:
from time import time
val_size = 100000
for trn_size in [50, 100, 500, 1000, 5000, 10000, 20000, 50000]:
    print('#'*50)
    print(f'Experiment with training size {trn_size}')
    start = time()
    experiment(trn_size, val_size)
    t = time() - start
    print(f'Time cost: {t}')
##################################################
Experiment with training size 50
epoch      trn_loss   val_loss   accuracy                
    0      0.739306   0.713452   0.515713  
epoch      trn_loss   val_loss   accuracy                
    0      0.780253   0.682528   0.60368   
epoch      trn_loss   val_loss   accuracy                
    0      0.480616   0.665205   0.64434   
    1      0.60659    0.642443   0.60112                 
    2      0.61519    0.619721   0.69608                 
    3      0.642923   0.626678   0.61732                 
    4      0.652647   0.660426   0.51752                 
    5      0.602682   0.620081   0.5915                  
    6      0.594284   0.584023   0.66818                 
    7      0.58685    0.559354   0.73106                 
    8      0.55382    0.540782   0.77018                 
    9      0.52772    0.527295   0.79862                 
    10     0.518118   0.487917   0.83798                 
    11     0.518521   0.461052   0.84442                 
    12     0.53044    0.453327   0.84558                 
    13     0.510322   0.468408   0.83854                 
Time cost: 3158.540988445282
##################################################
Experiment with training size 100
epoch      trn_loss   val_loss   accuracy                
    0      0.598895   78.284828  0.482651  
epoch      trn_loss   val_loss   accuracy                
    0      0.62033    0.664568   0.71318   
epoch      trn_loss   val_loss   accuracy                
    0      0.602062   0.617134   0.74574   
epoch      trn_loss   val_loss   accuracy                
    0      0.509279   0.616894   0.58494   
    1      0.528293   0.574365   0.69924                 
    2      0.496826   0.544474   0.75798                 
    3      0.478803   0.559163   0.6684                  
    4      0.442439   0.568413   0.64396                 
    5      0.45688    0.435576   0.82176                 
    6      0.438374   0.401803   0.87232                 
    7      0.435346   0.382793   0.86982                 
    8      0.430963   0.38687    0.86138                 
    9      0.421749   0.363613   0.86442                 
    10     0.404818   0.347554   0.87324                 
    11     0.402366   0.34878    0.8688                  
    12     0.420744   0.341431   0.86758                 
    13     0.405834   0.34154    0.86362                 
Time cost: 3164.5589134693146
##################################################
Experiment with training size 500
                                                           
epoch      trn_loss   val_loss   accuracy                  
    0      0.531424   0.558967   0.85856   
epoch      trn_loss   val_loss   accuracy                  
    0      0.427045   0.402448   0.88602   
epoch      trn_loss   val_loss   accuracy                  
    0      0.43276    0.325113   0.88386   
    1      0.439859   0.350954   0.85564                   
    2      0.420882   0.301699   0.88072                   
    3      0.408916   0.243965   0.91232                   
    4      0.385137   0.265443   0.8924                    
    5      0.374238   0.249731   0.89888                   
    6      0.397431   0.265853   0.90392                   
    7      0.388508   0.256725   0.90612                   
    8      0.405042   0.269658   0.90676                   
    9      0.3749     0.278558   0.89718                   
    10     0.378312   0.280107   0.89688                   
    11     0.368829   0.269968   0.90122                   
    12     0.412016   0.274945   0.90104                   
    13     0.399776   0.281551   0.89786                   
Time cost: 3095.5910897254944
##################################################
Experiment with training size 1000
                                                           
epoch      trn_loss   val_loss   accuracy                  
    0      0.538816   0.369876   0.90136   
epoch      trn_loss   val_loss   accuracy                  
    0      0.453464   0.315374   0.88258   
epoch      trn_loss   val_loss   accuracy                  
    0      0.404357   0.259631   0.90256   
    1      0.419865   0.254745   0.89808                   
    2      0.445964   0.268253   0.89904                   
    3      0.427022   0.229095   0.91462                   
    4      0.414167   0.228874   0.91148                   
    5      0.407483   0.219707   0.91912                   
    6      0.381847   0.216046   0.9203                    
    7      0.365503   0.219289   0.91962                   
    8      0.358103   0.213313   0.92152                   
    9      0.328652   0.219443   0.91694                   
    10     0.360773   0.225698   0.9129                    
    11     0.325618   0.216891   0.91786                   
    12     0.358954   0.213793   0.91994                   
    13     0.324676   0.217357   0.91804                   
Time cost: 3222.9498105049133
##################################################
Experiment with training size 5000
 80%|████████  | 168/209 [00:33<00:08,  5.06it/s, loss=2.11] 
epoch      trn_loss   val_loss   accuracy                    
    0      0.476658   0.251208   0.91892   
epoch      trn_loss   val_loss   accuracy                    
    0      0.433952   0.231621   0.92414   
epoch      trn_loss   val_loss   accuracy                    
    0      0.441624   0.26157    0.91548   
    1      0.39728    0.216438   0.92384                     
    2      0.409002   0.224356   0.92368                     
    3      0.422964   0.215129   0.92208                     
    5      0.323477   0.190459   0.92822                     
    6      0.359594   0.204132   0.9299                      
    7      0.364609   0.197063   0.92962                     
    8      0.335434   0.195078   0.93054                     
    9      0.344869   0.193901   0.93174                     
    10     0.355132   0.204457   0.92736                     
    11     0.361977   0.196434   0.92986                     
    12     0.335396   0.200645   0.92896                     
    13     0.327323   0.20609    0.92624                     
Time cost: 4408.779232263565
##################################################
Experiment with training size 10000
 77%|███████▋  | 323/417 [00:54<00:15,  5.95it/s, loss=1.4]  
epoch      trn_loss   val_loss   accuracy                    
    0      0.442663   0.237719   0.91894   
epoch      trn_loss   val_loss   accuracy                    
    0      0.431919   0.23883    0.92334   
epoch      trn_loss   val_loss   accuracy                    
    0      0.423774   0.199739   0.92554   
    1      0.400266   0.206542   0.92344                     
    2      0.327765   0.191927   0.93002                     
    3      0.355688   0.193465   0.92908                     
    4      0.336286   0.182849   0.93128                     
    5      0.324608   0.18046    0.93278                     
    6      0.314902   0.183413   0.93328                     
    7      0.328284   0.178485   0.93288                     
    8      0.337061   0.180216   0.93436                     
    9      0.308937   0.179975   0.9341                      
    10     0.290357   0.178364   0.93366                     
    11     0.301147   0.175089   0.93584                     
    12     0.267383   0.176672   0.934                       
    13     0.305133   0.17432    0.93538                     
Time cost: 5908.472403526306
##################################################
Experiment with training size 20000
 75%|███████▍  | 623/834 [01:46<00:36,  5.84it/s, loss=1.45] 
epoch      trn_loss   val_loss   accuracy                    
    0      0.425248   0.229867   0.91804   
epoch      trn_loss   val_loss   accuracy                    
    0      0.410012   0.210839   0.92766   
epoch      trn_loss   val_loss   accuracy                    
    0      0.418405   0.202191   0.92848   
    1      0.385172   0.21752    0.92934                     
    2      0.341867   0.1879     0.93032                     
    3      0.343511   0.176737   0.93358                     
    4      0.299173   0.169992   0.9357                      
 58%|█████▊    | 480/834 [03:46<02:46,  2.12it/s, loss=0.315]
IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

    9      0.313465   0.162371   0.93966                     
    10     0.2692     0.162227   0.93946                     
    11     0.272758   0.159716   0.94032                     
 80%|███████▉  | 666/834 [04:43<01:11,  2.35it/s, loss=0.261]
IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

epoch      trn_loss   val_loss   accuracy                      
    0      0.441473   0.254497   0.9168    
 98%|█████████▊| 2044/2084 [07:08<00:08,  4.77it/s, loss=0.414]
IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

    1      0.309567   0.170769   0.9367                        
 80%|███████▉  | 1664/2084 [11:45<02:57,  2.36it/s, loss=0.249]
IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

    4      0.257701   0.153826   0.9416                        
 80%|███████▉  | 1665/2084 [13:25<03:22,  2.07it/s, loss=0.239]
IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

    6      0.24764    0.148807   0.94436                       
    7      0.239934   0.146907   0.9456                        
    8      0.224837   0.156241   0.94496                       
  9%|▉         | 189/2084 [01:18<13:09,  2.40it/s, loss=0.212]
IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

    10     0.212315   0.145792   0.94616                       
    11     0.221374   0.14564    0.9458                        
  8%|▊         | 166/2084 [01:09<13:18,  2.40it/s, loss=0.186]

Some notebook issues here, you might want to run this cell from a python script...

Conclusions

Lety's see the evollution of the accuracy when we increas the size of the train data. For each training size, we report the best accuracy among the different epochs.

In [32]:
import matplotlib.pyplot as plt

best_acc = [0.84558, 0.87324, 0.91232, 0.9203, 0.93174, 0.93584, 0.94032, 0.94616]
sizes = [50, 100, 500, 1000, 5000, 10000, 20000, 50000]
plt.plot(sizes, best_acc)
plt.title('Evolution of performance when increasing the training size')
plt.xlabel('Training size')
plt.ylabel('Accuracy')
plt.show()

plt.plot(sizes, best_acc)
plt.title('Evolution of performance when increasing the training size, Zoom on the [0-10000] size zone')
plt.xlabel('Training size')
plt.ylabel('Accuracy')
plt.xlim([0, 10000])
plt.show()

plt.plot(np.log(sizes)/np.log(10), best_acc)
plt.title('Evolution of performance when increasing the training size, with log scale for size')
plt.xlabel('Training size (log)')
plt.ylabel('Accuracy')
plt.show()
  • The first observation is, even with 50 samples only, we get a pretty great accuracy of 0.85!
  • Then we see that the learning progress is very consequent when going from a size of 50 to 1000 samples
  • The ULMFit beats the reported score from FastText (~0.92) when using 1000 samples only! Note that the reported score from FastText is from a training using the whole training data (3.6M samples)
  • The accuracy continues to rise when we increase the training size, but with a lower speed. Here the trade-off comes, where you have to decide whether the extra 0.1% in accuracy is worth paying for more labeled data!
  • From the log-scale graph we might expect even greater results when raining the training size. We have 4.6M training reviews so we could get orders of magnitude more so we could expect reaching 0.95 accuracy or more with the full dataset.