# Transfer Learning for NLP: Sentiment Analysis on Amazon Reviews¶

In this notebook, we show how transfer learning can be applied to detecting the sentiment of amazon reviews, between positive and negative reviews.

This notebook uses the work from Howard and Ruder, Ulmfit. The idea of the paper (and it implementation explained in the fast.ai deep learning course) is to learn a language model trained on a very large dataset, e.g. a Wikipedia dump. The intuition is that if a model is able to predict the next word at each word, it means it has learnt something about the structure of the language we are using.

Word2vec and the likes have lead to huge improvements on various NLP tasks. This could be seen as a first step to transfer learning, where the pre-trained word vectors correspond to a transfer of the embedding layer. The ambition of Ulmfit (and others like ELMO or the Transformer language model recently introduced) is to progressively move the NLP field to the state where Computer Vision has risen thanks to the ImageNet challenge. Thanks to the ImageNet chalenge, today it is easy to download a model pre-trained on massive dataset of images, remove the last layer and replace it by a classifier or a regressor depending on the interest.

With Ulmfit, the goal is for everyone to be able to use a pre-trained language model and use it a backbone which we can use along with a classifier and a regressor. The game-changing apect of transfer learning is that we are no longer limited by the size of trzining data! With only a fraction of the data size that was necessary before, we can trtain a classifier/regressor and have very good result with few labelled data.

Given that labelled text data are difficult to get, in comparison with unlabelled text data which is almost infinite, transfer learning is likely to change radically the field of NLP, and help lead to a maturity state closer to computyer vision.

The architecture for the language model used in ULMFit is the AWD-LSTM language model by Merity.

While we are using this language model for this experiment, we keep an eye open to a recently proposed character language model with Contextual String Embedings by Akbik.

# Content of this notebook¶

This notebook illustrate the power of Ulmfit on a dataset of Amazon reviews available on Kaggle at https://www.kaggle.com/bittlingmayer/amazonreviews/home. We use code from the excellent fastai course and use it for a different dataset. The original code is available at https://github.com/fastai/fastai/tree/master/courses/dl2

The data consists of 4M reviews that are either positives or negatives. Training a model with FastText classifier results in a f1 score of 0.916. We show that uing only a fraction of this dataset we are able to reach similar and even better results.

We encourage you to try it on your own tasks! Note that if you are interested in Regression instead of classification, you can also do it following this advice.

The notebook is organized as such:

• Tokenize the reviews and create dictionaries
• Fine-tune the language model on the amaxon reviews texts

We have then the backbone of our algorithm: a pre-trained language model fine-tuned on Amazon reviews

• Add a classifier to the language model and train the classifier layer only
• Gradually defreeze successive layers to train different layers on the amazon reviews
• Run a full classification task for several epochs
• Use the model for inference!

We end this notebook by looking at the specific effect of training size on the overall performance. This is to test the hypothesis that the ULMFit model does not need much labeled data to perform well.

# Data¶

Before starting, you should download the data from https://www.kaggle.com/bittlingmayer/amazonreviews, and put the extracted files into an ./Amazon folder somewher you like, and use this path for this notebook.

Also, we recommend working on a dedicated environment (e.g. mkvirtualenv fastai). Then clone the fastai github repo https://github.com/fastai/fastai and install requirements.

In [2]:
from fastai.text import *
import html
import os
import pandas as pd
import pickle
import re
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, \
confusion_matrix
from sklearn.model_selection import train_test_split
from time import time

In [2]:
path = '/your/path/to/folder/Amazon'
train = []
with open(os.path.join(path, 'train.ft.txt'), 'r') as file:
for line in file:

test = []
with open(os.path.join(path, 'test.ft.txt'), 'r') as file:
for line in file:

In [65]:
print(f'The train data contains {len(train)} examples')
print(f'The test data contains {len(test)} examples')

In [3]:
BOS = 'xbos'  # beginning-of-sentence tag
FLD = 'xfld'  # data field tag

PATH=Path('/your/path/to/folder/Amazon')

CLAS_PATH=PATH/'amazon_class'
CLAS_PATH.mkdir(exist_ok=True)

LM_PATH=PATH/'amazon_lm'
LM_PATH.mkdir(exist_ok=True)

In [12]:
# Each item is '__label__1/2' and then the review so we split to get texts and labels
trn_texts,trn_labels = [text[10:] for text in train], [text[:10] for text in train]
trn_labels = [0 if label == '__label__1' else 1 for label in trn_labels]
val_texts,val_labels = [text[10:] for text in test], [text[:10] for text in test]
val_labels = [0 if label == '__label__1' else 1 for label in val_labels]

In [13]:
# Following fast.ai recommendations we put our data in pandas dataframes
col_names = ['labels','text']

df_trn = pd.DataFrame({'text':trn_texts, 'labels':trn_labels}, columns=col_names)
df_val = pd.DataFrame({'text':val_texts, 'labels':val_labels}, columns=col_names)

In [66]:
df_trn.head(10)

In [16]:
df_trn.to_csv(CLAS_PATH/'train.csv', header=False, index=False)

In [17]:
CLASSES = ['neg', 'pos']
(CLAS_PATH/'classes.txt').open('w').writelines(f'{o}\n' for o in CLASSES)


# Language Model¶

In [11]:
# We're going to fine tune the language model so it's ok to take some of the test set in our train data
# for the lm fine-tuning
trn_texts,val_texts = train_test_split(np.concatenate([trn_texts,val_texts]), test_size=0.1)

df_trn = pd.DataFrame({'text':trn_texts, 'labels':[0]*len(trn_texts)}, columns=col_names)
df_val = pd.DataFrame({'text':val_texts, 'labels':[0]*len(val_texts)}, columns=col_names)


In [19]:
# Here we use functions from the fast.ai course to get data

chunksize=24000
re1 = re.compile(r'  +')

def fixup(x):
x = x.replace('#39;', "'").replace('amp;', '&').replace('#146;', "'").replace(
'nbsp;', ' ').replace('#36;', '\$').replace('\\n', "\n").replace('quot;', "'").replace(
'<br />', "\n").replace('\\"', '"').replace('<unk>','u_n').replace(' @[email protected] ','.').replace(
' @[email protected] ','-').replace('\\', ' \\ ')
return re1.sub(' ', html.unescape(x))

def get_texts(df, n_lbls=1):
labels = df.iloc[:,range(n_lbls)].values.astype(np.int64)
texts = f'\n{BOS} {FLD} 1 ' + df[n_lbls].astype(str)
for i in range(n_lbls+1, len(df.columns)):
texts += f' {FLD} {i-n_lbls} ' + df[i].astype(str)
texts = list(texts.apply(fixup).values)

tok = Tokenizer().proc_all_mp(partition_by_cores(texts))

def get_all(df, n_lbls):
tok, labels = [], []
for i, r in enumerate(df):
print(i)
tok_, labels_ = get_texts(r, n_lbls)
tok += tok_;
labels += labels_


In [21]:
# This cell can take quite some time if your dataset is large
# Run it once and comment it for later use
tok_trn, trn_labels = get_all(df_trn, 1)
tok_val, val_labels = get_all(df_val, 1)

In [15]:
# Run this cell once and comment everything but the load statements for later use

(LM_PATH/'tmp').mkdir(exist_ok=True)
np.save(LM_PATH/'tmp'/'tok_trn.npy', tok_trn)
np.save(LM_PATH/'tmp'/'tok_val.npy', tok_val)

In [63]:
# Check the most common tokens
freq = Counter(p for o in tok_trn for p in o)
freq.most_common(25)

In [64]:
# Check the least common tokens
freq.most_common()[-25:]

In [ ]:
# Build your vocabulary by keeping only the most common tokens that appears frequently enough
# and constrain the size of your vocabulary. We follow here the 60k recommendation.
max_vocab = 60000
min_freq = 2

itos = [o for o,c in freq.most_common(max_vocab) if c>min_freq]
itos.insert(0, '_unk_')

stoi = collections.defaultdict(lambda:0, {v:k for k,v in enumerate(itos)})
len(itos)

trn_lm = np.array([[stoi[o] for o in p] for p in tok_trn])
val_lm = np.array([[stoi[o] for o in p] for p in tok_val])

np.save(LM_PATH/'tmp'/'trn_ids.npy', trn_lm)
np.save(LM_PATH/'tmp'/'val_ids.npy', val_lm)
pickle.dump(itos, open(LM_PATH/'tmp'/'itos.pkl', 'wb'))

In [10]:
# Save everything

In [33]:
vs=len(itos)
vs,len(trn_lm)


# Using pre trained Language Model¶

In [ ]:
# Uncomment this cell to download the pre-trained model.
# It will be placed into the PATH that you defined earlier.
# ! wget -nH -r -np -P {PATH} http://files.fast.ai/models/wt103/

In [5]:
# Load the weights of the model
em_sz,nh,nl = 400,1150,3

PRE_PATH = PATH/'models'/'wt103'
PRE_LM_PATH = PRE_PATH/'fwd_wt103.h5'

wgts = torch.load(PRE_LM_PATH, map_location=lambda storage, loc: storage)

In [8]:
# Check the word embedding layer and keep a 'mean word' for unknown tokens
enc_wgts = to_np(wgts['0.encoder.weight'])
row_m = enc_wgts.mean(0)

enc_wgts.shape

Out[8]:
(238462, 400)
In [12]:
# Load the vocabulary on which the pre-trained model was trained
# Define an embedding matrix with the vocabulary of our dataset
stoi2 = collections.defaultdict(lambda:-1, {v:k for k,v in enumerate(itos2)})

new_w = np.zeros((vs, em_sz), dtype=np.float32)
for i,w in enumerate(itos):
r = stoi2[w]
new_w[i] = enc_wgts[r] if r>=0 else row_m

In [16]:
# Use the new embedding matrix for the pre-trained model
wgts['0.encoder.weight'] = T(new_w)
wgts['0.encoder_with_dropout.embed.weight'] = T(np.copy(new_w))
wgts['1.decoder.weight'] = T(np.copy(new_w))

In [17]:
# Define the learner object to do the fine-tuning
# Here we will freeze everything except the embedding layer, so that we can have a better
# embedding for unknown words than just the mean embedding on which we initialise it.
wd=1e-7
bptt=70
bs=52

md = LanguageModelData(PATH, 1, vs, trn_dl, val_dl, bs=bs, bptt=bptt)

drops = np.array([0.25, 0.1, 0.2, 0.02, 0.15])*0.7

learner= md.get_model(opt_fn, em_sz, nh, nl,
dropouti=drops[0], dropout=drops[1], wdrop=drops[2], dropoute=drops[3], dropouth=drops[4])

learner.metrics = [accuracy]
learner.freeze_to(-1)

lr=1e-3
lrs = lr

In [22]:
# Run one epoch of fine-tuning
learner.fit(lrs/2, 1, wds=wd, use_clr=(32,2), cycle_len=1)

In [30]:
# Save the fine-tuned model and unfreeze everything to later fine-tune the whole model
learner.save('lm_last_ft')
learner.unfreeze()

In [23]:
learner.lr_find(start_lr=lrs/10, end_lr=lrs*10, linear=True)

In [24]:
learner.sched.plot()

In [ ]:
# Run this if you want to highly tune the LM to the Amazon data, with 15 epochs
# use_clr controls the shape of the cyclical (triangular) learning rate
learner.fit(lrs, 1, wds=wd, use_clr=(20,10), cycle_len=15)

In [33]:
# Save the Backbone for further classification!!
learner.save('lm1')
learner.save_encoder('lm1_enc')

In [25]:
learner.sched.plot_loss()


# Going back to classification!¶

Now that we spent some time fine-tuning the language model on our Amazon data, let's see if we can classify easily these reviews. As before, some cells should be run once, and then use data loaders for later use.

In [35]:
df_trn = pd.read_csv(CLAS_PATH/'train.csv', header=None, chunksize=chunksize)

In [26]:
tok_trn, trn_labels = get_all(df_trn, 1)
tok_val, val_labels = get_all(df_val, 1)

In [36]:
(CLAS_PATH/'tmp').mkdir(exist_ok=True)

np.save(CLAS_PATH/'tmp'/'tok_trn.npy', tok_trn)
np.save(CLAS_PATH/'tmp'/'tok_val.npy', tok_val)

np.save(CLAS_PATH/'tmp'/'trn_labels.npy', trn_labels)
np.save(CLAS_PATH/'tmp'/'val_labels.npy', val_labels)

In [4]:
tok_trn = np.load(CLAS_PATH/'tmp'/'tok_trn.npy')
stoi = collections.defaultdict(lambda:0, {v:k for k,v in enumerate(itos)})
len(itos)

Out[4]:
60002
In [38]:
trn_clas = np.array([[stoi[o] for o in p] for p in tok_trn])
val_clas = np.array([[stoi[o] for o in p] for p in tok_val])

np.save(CLAS_PATH/'tmp'/'trn_ids.npy', trn_clas)
np.save(CLAS_PATH/'tmp'/'val_ids.npy', val_clas)


# Classifier¶

In this part, we adopt an unusual train/test hierarchy. While it's common to train on a big dataset and thewn test on a small one, here we wanrt to test the hypothesis that the model can learn with few training data. Hence we take less data for training than for testing.

In [5]:
# We select here the 'size' first reviews of our dataset
# The paper claims that it's possible to achieve very good results with few labeled examples
# So let's try with 100 examples for training, and 5000 examples for validation.
# We encourage you to try different values to see the effect of data size on performance.
trn_size = 100
val_size = 5000

train = random.sample(list(zip(trn_clas, trn_labels)), trn_size)
trn_clas = np.array([item[0] for item in train])
trn_labels = np.array([item[1] for item in train])
del train

validation = random.sample(list(zip(val_clas, val_labels)), val_size)
val_clas = np.array([item[0] for item in validation])
val_labels = np.array([item[1] for item in validation])
del validation

bptt,em_sz,nh,nl = 70,400,1150,3
vs = len(itos)
bs = 48

min_lbl = trn_labels.min()
trn_labels -= min_lbl
val_labels -= min_lbl
c=int(trn_labels.max())+1

In [34]:
# Ccheck that the validation dataset is well balanced so acccuracy is a good metric
# We'll also check other metrics usual for binary classification (precision, recall, f1 score)
len(trn_labels[trn_labels == 1]) / len(trn_labels)

In [48]:
trn_ds = TextDataset(trn_clas, trn_labels)
val_ds = TextDataset(val_clas, val_labels)
trn_samp = SortishSampler(trn_clas, key=lambda x: len(trn_clas[x]), bs=bs//2)
val_samp = SortSampler(val_clas, key=lambda x: len(val_clas[x]))

In [50]:
# We define the model, here it a classifier on top of an RNN language model
# We load the language model encoder that we fine tuned before
# We freeze everything but the last layer, so that we can train the classification layer only.
#load the saved weights from before, and freeze everything until the last layer

md = ModelData(PATH, trn_dl, val_dl)
dps = np.array([0.4, 0.5, 0.05, 0.3, 0.1])

m = get_rnn_classifier(bptt, 20*70, c, vs, emb_sz=em_sz, n_hid=nh, n_layers=nl, pad_token=1,
layers=[em_sz*3, 50, c], drops=[dps[4], 0.1],
dropouti=dps[0], wdrop=dps[1], dropoute=dps[2], dropouth=dps[3])

learn = RNN_Learner(md, TextModel(to_gpu(m)), opt_fn=opt_fn)
learn.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learn.clip=25.
learn.metrics = [accuracy]

lr=3e-3
lrm = 2.6
lrs = np.array([lr/(lrm**4), lr/(lrm**3), lr/(lrm**2), lr/lrm, lr])

lrs=np.array([1e-4,1e-4,1e-4,1e-3,1e-2])

wd = 1e-7
wd = 0

learn.freeze_to(-1)

In [37]:
learn.lr_find(lrs/1000)

In [38]:
learn.sched.plot()

In [39]:
# Run one epoch on the classification layer
learn.fit(lrs, 1, wds=wd, cycle_len=1, use_clr=(8,3))

In [54]:
# Save the trained model
learn.save('clas_0')

In [40]:
# Gradually unfreeze another layer to train a bit more parameters than just the classifier layer
learn.freeze_to(-2)
learn.fit(lrs, 1, wds=wd, cycle_len=1, use_clr=(8,3))

In [56]:
# Save the trained model
learn.save('clas_1')

In [41]:
# Unfreeze everything and train for a few epochs on the whole set of parameters of the model
learn.unfreeze()
learn.fit(lrs, 1, wds=wd, cycle_len=14, use_clr=(32,10))

In [42]:
learn.sched.plot_loss()

In [59]:
# Save the model
learn.save('clas_2')


# Inference¶

Nonw, let's play with the model we've just learned!

In [60]:
m = get_rnn_classifer(bptt, 20*70, c, vs, emb_sz=em_sz, n_hid=nh, n_layers=nl, pad_token=1,
layers=[em_sz*3, 50, c], drops=[dps[4], 0.1],
dropouti=dps[0], wdrop=dps[1], dropoute=dps[2], dropouth=dps[3])
learn = RNN_Learner(md, TextModel(to_gpu(m)), opt_fn=opt_fn)
learn.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learn.clip=25.
learn.metrics = [accuracy]

lr=3e-3
lrm = 2.6
lrs = np.array([lr/(lrm**4), lr/(lrm**3), lr/(lrm**2), lr/lrm, lr])
wd = 1e-7
wd = 0

In [6]:
def get_sentiment(input_str: str):

# predictions are done on arrays of input.
# We only have a single input, so turn it into a 1x1 array
texts = [input_str]

# tokenize using the fastai wrapper around spacy
tok = [t.split() for t in texts]
# tok = Tokenizer().proc_all_mp(partition_by_cores(texts))

# turn into integers for each word
encoded = [stoi[p] for p in tok[0]]

idx = np.array(encoded)[None]
idx = np.transpose(idx)
tensorIdx = VV(idx)
m.eval()
m.reset()
p = m.forward(tensorIdx)
return np.argmax(p[0][0].data.cpu().numpy())

def prediction(texts):
"""Do the prediction on a list of texts
"""
y = []

for i, text in enumerate(texts):
if i % 1000 == 0:
print(i)
encoded = text
idx = np.array(encoded)[None]
idx = np.transpose(idx)
tensorIdx = VV(idx)
m.eval()
m.reset()
p = m.forward(tensorIdx)
y.append(np.argmax(p[0][0].data.cpu().numpy()))
return y

In [43]:
sentence = "I like Feedly"
start = time()
print(get_sentiment(sentence))
print(time() - start)

In [44]:
y = prediction(list(val_clas))

In [45]:
# Show relevant metrics for binary classification
# We encourage you to try training the classifier with different data size and its effect on performance
print(f'Accuracy --> {accuracy_score(y, val_labels)}')
print(f'Precision --> {precision_score(y, val_labels)}')
print(f'F1 score --> {f1_score(y, val_labels)}')
print(f'Recall score --> {recall_score(y, val_labels)}')
print(confusion_matrix(y, val_labels))
print(classification_report(y, val_labels))


# What training size do we need?¶

The language model has already learnt a lot about the syntax. It is very knowledgeable about the context in which words appear in sentences. However, the language model does not contain any notion of meaning. This problem is well summarised in Emily Bender's tweet during a very interesting twiter thread that occur in July around meaning in NLP. A cool summary of this thread can be found in the Hugging Face blogpost. Hence the meaning in language is very likely to be learned through supervision, with the help of ground-truth examples.

However, when we perform some NLP tasks, sentiment analysis in our example, both syntax and meaning are important! The idea is that you can save a lot of time by being taught with a lot of blind synatx first, and then learning meaning. Think of when you start learning a complete new field. Well, it is far easier to learn it in your mother tongue than in another language you master less.

The big practical gain here is that once you "know" a language, you need less supervised examples to learn a new thing! In our example, it means we need less labeled reviews for us to learn a relevant classifier.

Let's verify this hypothesis by training a classifier with several training size and see how this size affects the performance!

In [7]:
trn_clas = np.load(CLAS_PATH/'tmp'/'trn_ids.npy')


In [8]:
def experiment(trn_size, val_size):

train = random.sample(list(zip(trn_clas, trn_labels)), trn_size)
aux_trn_clas = np.array([item[0] for item in train])
aux_trn_labels = np.array([item[1] for item in train])
del train

validation = random.sample(list(zip(val_clas, val_labels)), val_size)
aux_val_clas = np.array([item[0] for item in validation])
aux_val_labels = np.array([item[1] for item in validation])
del validation

bptt,em_sz,nh,nl = 70,400,1150,3
vs = len(itos)
bs = 48

min_lbl = aux_trn_labels.min()
aux_trn_labels -= min_lbl
aux_val_labels -= min_lbl
c=int(aux_trn_labels.max())+1

# Load data in relevant structures
trn_ds = TextDataset(aux_trn_clas, aux_trn_labels)
val_ds = TextDataset(aux_val_clas, aux_val_labels)
trn_samp = SortishSampler(aux_trn_clas, key=lambda x: len(aux_trn_clas[x]), bs=bs//2)
val_samp = SortSampler(aux_val_clas, key=lambda x: len(aux_val_clas[x]))

# Define the model and load the backbone lamguage model
md = ModelData(PATH, trn_dl, val_dl)
dps = np.array([0.4, 0.5, 0.05, 0.3, 0.1])

m = get_rnn_classifier(bptt, 20*70, c, vs, emb_sz=em_sz, n_hid=nh, n_layers=nl, pad_token=1,
layers=[em_sz*3, 50, c], drops=[dps[4], 0.1],
dropouti=dps[0], wdrop=dps[1], dropoute=dps[2], dropouth=dps[3])

learn = RNN_Learner(md, TextModel(to_gpu(m)), opt_fn=opt_fn)
learn.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learn.clip=25.
learn.metrics = [accuracy]

lr=3e-3
lrm = 2.6
lrs = np.array([lr/(lrm**4), lr/(lrm**3), lr/(lrm**2), lr/lrm, lr])

lrs=np.array([1e-4,1e-4,1e-4,1e-3,1e-2])

wd = 1e-7
wd = 0

learn.freeze_to(-1)

# Find th learning rate
learn.lr_find(lrs/1000)

# Run one epoch on the classification layer
learn.fit(lrs, 1, wds=wd, cycle_len=1, use_clr=(8,3))

# Save the trained model
learn.save(f'{trn_size}clas_0')

# Gradually unfreeze another layer to train a bit more parameters than just the classifier layer
learn.freeze_to(-2)
learn.fit(lrs, 1, wds=wd, cycle_len=1, use_clr=(8,3))

# Save the trained model
learn.save(f'{trn_size}clas_1')

# Unfreeze everything and train for a few epochs on the whole set of parameters of the model
learn.unfreeze()
learn.fit(lrs, 1, wds=wd, cycle_len=14, use_clr=(32,10))

# Save the model
learn.sched.plot_loss()
learn.save(f'{trn_size}clas_2')

In [ ]:
from time import time
val_size = 100000
for trn_size in [50, 100, 500, 1000, 5000, 10000, 20000, 50000]:
print('#'*50)
print(f'Experiment with training size {trn_size}')
start = time()
experiment(trn_size, val_size)
t = time() - start
print(f'Time cost: {t}')

##################################################
Experiment with training size 50

epoch      trn_loss   val_loss   accuracy
0      0.739306   0.713452   0.515713

epoch      trn_loss   val_loss   accuracy
0      0.780253   0.682528   0.60368

epoch      trn_loss   val_loss   accuracy
0      0.480616   0.665205   0.64434

    1      0.60659    0.642443   0.60112
2      0.61519    0.619721   0.69608
3      0.642923   0.626678   0.61732
4      0.652647   0.660426   0.51752
5      0.602682   0.620081   0.5915
6      0.594284   0.584023   0.66818
7      0.58685    0.559354   0.73106
8      0.55382    0.540782   0.77018
9      0.52772    0.527295   0.79862
10     0.518118   0.487917   0.83798
11     0.518521   0.461052   0.84442
12     0.53044    0.453327   0.84558
13     0.510322   0.468408   0.83854
Time cost: 3158.540988445282
##################################################
Experiment with training size 100

epoch      trn_loss   val_loss   accuracy
0      0.598895   78.284828  0.482651

epoch      trn_loss   val_loss   accuracy
0      0.62033    0.664568   0.71318

epoch      trn_loss   val_loss   accuracy
0      0.602062   0.617134   0.74574

epoch      trn_loss   val_loss   accuracy
0      0.509279   0.616894   0.58494
1      0.528293   0.574365   0.69924
2      0.496826   0.544474   0.75798
3      0.478803   0.559163   0.6684
4      0.442439   0.568413   0.64396
5      0.45688    0.435576   0.82176
6      0.438374   0.401803   0.87232
7      0.435346   0.382793   0.86982
8      0.430963   0.38687    0.86138
9      0.421749   0.363613   0.86442
10     0.404818   0.347554   0.87324
11     0.402366   0.34878    0.8688
12     0.420744   0.341431   0.86758
13     0.405834   0.34154    0.86362
Time cost: 3164.5589134693146
##################################################
Experiment with training size 500



epoch      trn_loss   val_loss   accuracy
0      0.531424   0.558967   0.85856

epoch      trn_loss   val_loss   accuracy
0      0.427045   0.402448   0.88602

epoch      trn_loss   val_loss   accuracy
0      0.43276    0.325113   0.88386
1      0.439859   0.350954   0.85564
2      0.420882   0.301699   0.88072
3      0.408916   0.243965   0.91232
4      0.385137   0.265443   0.8924
5      0.374238   0.249731   0.89888
6      0.397431   0.265853   0.90392
7      0.388508   0.256725   0.90612
8      0.405042   0.269658   0.90676
9      0.3749     0.278558   0.89718
10     0.378312   0.280107   0.89688
11     0.368829   0.269968   0.90122
12     0.412016   0.274945   0.90104
13     0.399776   0.281551   0.89786
Time cost: 3095.5910897254944
##################################################
Experiment with training size 1000



epoch      trn_loss   val_loss   accuracy
0      0.538816   0.369876   0.90136

epoch      trn_loss   val_loss   accuracy
0      0.453464   0.315374   0.88258

epoch      trn_loss   val_loss   accuracy
0      0.404357   0.259631   0.90256
1      0.419865   0.254745   0.89808
2      0.445964   0.268253   0.89904
3      0.427022   0.229095   0.91462
4      0.414167   0.228874   0.91148
5      0.407483   0.219707   0.91912
6      0.381847   0.216046   0.9203
7      0.365503   0.219289   0.91962
8      0.358103   0.213313   0.92152
9      0.328652   0.219443   0.91694
10     0.360773   0.225698   0.9129
11     0.325618   0.216891   0.91786
12     0.358954   0.213793   0.91994
13     0.324676   0.217357   0.91804
Time cost: 3222.9498105049133
##################################################
Experiment with training size 5000

 80%|████████  | 168/209 [00:33<00:08,  5.06it/s, loss=2.11]
epoch      trn_loss   val_loss   accuracy
0      0.476658   0.251208   0.91892

epoch      trn_loss   val_loss   accuracy
0      0.433952   0.231621   0.92414

epoch      trn_loss   val_loss   accuracy
0      0.441624   0.26157    0.91548
1      0.39728    0.216438   0.92384
2      0.409002   0.224356   0.92368
3      0.422964   0.215129   0.92208
5      0.323477   0.190459   0.92822
6      0.359594   0.204132   0.9299
7      0.364609   0.197063   0.92962
8      0.335434   0.195078   0.93054
9      0.344869   0.193901   0.93174
10     0.355132   0.204457   0.92736
11     0.361977   0.196434   0.92986
12     0.335396   0.200645   0.92896
13     0.327323   0.20609    0.92624
Time cost: 4408.779232263565
##################################################
Experiment with training size 10000

 77%|███████▋  | 323/417 [00:54<00:15,  5.95it/s, loss=1.4]
epoch      trn_loss   val_loss   accuracy
0      0.442663   0.237719   0.91894

epoch      trn_loss   val_loss   accuracy
0      0.431919   0.23883    0.92334

epoch      trn_loss   val_loss   accuracy
0      0.423774   0.199739   0.92554
1      0.400266   0.206542   0.92344
2      0.327765   0.191927   0.93002
3      0.355688   0.193465   0.92908
4      0.336286   0.182849   0.93128
5      0.324608   0.18046    0.93278
6      0.314902   0.183413   0.93328
7      0.328284   0.178485   0.93288
8      0.337061   0.180216   0.93436
9      0.308937   0.179975   0.9341
10     0.290357   0.178364   0.93366
11     0.301147   0.175089   0.93584
12     0.267383   0.176672   0.934
13     0.305133   0.17432    0.93538
Time cost: 5908.472403526306
##################################################
Experiment with training size 20000

 75%|███████▍  | 623/834 [01:46<00:36,  5.84it/s, loss=1.45]
epoch      trn_loss   val_loss   accuracy
0      0.425248   0.229867   0.91804

epoch      trn_loss   val_loss   accuracy
0      0.410012   0.210839   0.92766

epoch      trn_loss   val_loss   accuracy
0      0.418405   0.202191   0.92848
1      0.385172   0.21752    0.92934
2      0.341867   0.1879     0.93032
3      0.343511   0.176737   0.93358
4      0.299173   0.169992   0.9357
58%|█████▊    | 480/834 [03:46<02:46,  2.12it/s, loss=0.315]
IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--NotebookApp.iopub_msg_rate_limit.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)


    9      0.313465   0.162371   0.93966
10     0.2692     0.162227   0.93946
11     0.272758   0.159716   0.94032
80%|███████▉  | 666/834 [04:43<01:11,  2.35it/s, loss=0.261]
IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--NotebookApp.iopub_msg_rate_limit.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)


epoch      trn_loss   val_loss   accuracy
0      0.441473   0.254497   0.9168

 98%|█████████▊| 2044/2084 [07:08<00:08,  4.77it/s, loss=0.414]
IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--NotebookApp.iopub_msg_rate_limit.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)


    1      0.309567   0.170769   0.9367
80%|███████▉  | 1664/2084 [11:45<02:57,  2.36it/s, loss=0.249]
IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--NotebookApp.iopub_msg_rate_limit.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)


    4      0.257701   0.153826   0.9416
80%|███████▉  | 1665/2084 [13:25<03:22,  2.07it/s, loss=0.239]
IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--NotebookApp.iopub_msg_rate_limit.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)


    6      0.24764    0.148807   0.94436
7      0.239934   0.146907   0.9456
8      0.224837   0.156241   0.94496
9%|▉         | 189/2084 [01:18<13:09,  2.40it/s, loss=0.212]
IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
--NotebookApp.iopub_msg_rate_limit.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)


    10     0.212315   0.145792   0.94616
11     0.221374   0.14564    0.9458
8%|▊         | 166/2084 [01:09<13:18,  2.40it/s, loss=0.186]

Some notebook issues here, you might want to run this cell from a python script...

# Conclusions¶

Lety's see the evollution of the accuracy when we increas the size of the train data. For each training size, we report the best accuracy among the different epochs.

In [32]:
import matplotlib.pyplot as plt

best_acc = [0.84558, 0.87324, 0.91232, 0.9203, 0.93174, 0.93584, 0.94032, 0.94616]
sizes = [50, 100, 500, 1000, 5000, 10000, 20000, 50000]
plt.plot(sizes, best_acc)
plt.title('Evolution of performance when increasing the training size')
plt.xlabel('Training size')
plt.ylabel('Accuracy')
plt.show()

plt.plot(sizes, best_acc)
plt.title('Evolution of performance when increasing the training size, Zoom on the [0-10000] size zone')
plt.xlabel('Training size')
plt.ylabel('Accuracy')
plt.xlim([0, 10000])
plt.show()

plt.plot(np.log(sizes)/np.log(10), best_acc)
plt.title('Evolution of performance when increasing the training size, with log scale for size')
plt.xlabel('Training size (log)')
plt.ylabel('Accuracy')
plt.show()

• The first observation is, even with 50 samples only, we get a pretty great accuracy of 0.85!
• Then we see that the learning progress is very consequent when going from a size of 50 to 1000 samples
• The ULMFit beats the reported score from FastText (~0.92) when using 1000 samples only! Note that the reported score from FastText is from a training using the whole training data (3.6M samples)
• The accuracy continues to rise when we increase the training size, but with a lower speed. Here the trade-off comes, where you have to decide whether the extra 0.1% in accuracy is worth paying for more labeled data!
• From the log-scale graph we might expect even greater results when raining the training size. We have 4.6M training reviews so we could get orders of magnitude more so we could expect reaching 0.95 accuracy or more with the full dataset.