Notebook

Using BERT embeddings for text classification of movie reviews¶

In [1]:

import gluonnlp as nlp
import mxnet as mx
from mxnet import gluon, nd
import numpy as np

Data¶

We are going to use the imdb dataset, trying to predict if a review is positive or if a review is negative

In [6]:

def transform_label(data):
    """
    Transform label into position / negative
    """
    text, label = data
    return text, 1 if label >= 5 else 0

In [7]:

train_dataset = nlp.data.IMDB('train')
test_dataset = nlp.data.IMDB('test')

In [8]:

k = {i+1:0 for i in range(10)}
for elem in train_dataset:
    k[elem[1]] += 1

In [9]:

print("Distribution of the ratings")
k

Distribution of the ratings

Out[9]:

{1: 5100,
 2: 2284,
 3: 2420,
 4: 2696,
 5: 0,
 6: 0,
 7: 2496,
 8: 3009,
 9: 2263,
 10: 4732}

In [10]:

print("Positive Review:\n{}".format(test_dataset[0][0]))
print()
print("Negative Review:\n{}".format(test_dataset[12501][0]))

Positive Review:
I went and saw this movie last night after being coaxed to by a few friends of mine. I'll admit that I was reluctant to see it because from what I knew of Ashton Kutcher he was only able to do comedy. I was wrong. Kutcher played the character of Jake Fischer very well, and Kevin Costner played Ben Randall with such professionalism. The sign of a good movie is that it can toy with our emotions. This one did exactly that. The entire theater (which was sold out) was overcome by laughter during the first half of the movie, and were moved to tears during the second half. While exiting the theater I not only saw many women in tears, but many full grown men as well, trying desperately not to let anyone see them crying. This movie was great, and I suggest that you go see it before you judge.

Negative Review:
This is an example of why the majority of action films are the same. Generic and boring, there's really nothing worth watching here. A complete waste of the then barely-tapped talents of Ice-T and Ice Cube, who've each proven many times over that they are capable of acting, and acting well. Don't bother with this one, go see New Jack City, Ricochet or watch New York Undercover for Ice-T, or Boyz n the Hood, Higher Learning or Friday for Ice Cube and see the real deal. Ice-T's horribly cliched dialogue alone makes this film grate at the teeth, and I'm still wondering what the heck Bill Paxton was doing in this film? And why the heck does he always play the exact same character? From Aliens onward, every film I've seen with Bill Paxton has him playing the exact same irritating character, and at least in Aliens his character died, which made it somewhat gratifying...<br /><br />Overall, this is second-rate action trash. There are countless better films to see, and if you really want to see this one, watch Judgement Night, which is practically a carbon copy but has better acting and a better script. The only thing that made this at all worth watching was a decent hand on the camera - the cinematography was almost refreshing, which comes close to making up for the horrible film itself - but not quite. 4/10.

In [11]:

train_dataset = train_dataset.transform(transform_label)
test_dataset = test_dataset.transform(transform_label)

In [12]:

print("There are {} training examples and {} test examples".format(len(train_dataset), len(test_dataset)))

There are 25000 training examples and 25000 test examples

Sklearn TFIDF baseline¶

Let's use sklearn to build a TFIDF pipeline with word tri-grams as a baseline

In [9]:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

for n_gram in [1,2,3]:
    
    text_clf = Pipeline([
         ('tfidf', TfidfVectorizer(sublinear_tf=True, min_df=2+n_gram, norm='l2', encoding='latin-1', ngram_range=(1,n_gram), stop_words='english')),
         ('clf', LogisticRegression()),
    ])

    train_x = [elem[0] for elem in train_dataset]
    test_x = [elem[0] for elem in test_dataset]

    train_y = np.array([elem[1] for elem in train_dataset])
    test_y = np.array([elem[1] for elem in test_dataset])

    text_clf.fit(train_x, train_y)
    
    test_y_hat = text_clf.predict(test_x)
    train_y_hat = text_clf.predict(train_x)
    print("{}-gram Accuracy train:{}%, test:{}%".format(n_gram, (train_y_hat == train_y).mean(), (test_y_hat == test_y).mean()))

/home/ubuntu/anaconda3/lib/python3.6/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

1-gram Accuracy train:0.9356%, test:0.88468%
2-gram Accuracy train:0.9498%, test:0.88632%
3-gram Accuracy train:0.94928%, test:0.88728%

Fine-tuning BERT¶

We download a pre-trained BERT a fine-tune it on the same dataset

In [13]:

ctx = mx.gpu() if mx.context.num_gpus() > 0 else mx.cpu()

In [14]:

bert_base, vocabulary = nlp.model.get_model('bert_24_1024_16',
                                             dataset_name='book_corpus_wiki_en_uncased',
                                             pretrained=True, use_pooler=True,
                                             use_decoder=False, use_classifier=False, ctx=ctx)

In [15]:

batch_size = 8

We need to process the words the same way as it was done during training for that we use the BERTTokenizer and the BERTSentenceTransform

In [16]:

# use the vocabulary from pre-trained model for tokenization
bert_tokenizer = nlp.data.BERTTokenizer(vocabulary)
max_len = 128
transform = nlp.data.BERTSentenceTransform(bert_tokenizer, max_len, pad=False, pair=False)

We create a custom network for BERT classification we take advantage of the pooler output which is the output of [CLS] token plus a non-linearity

In [17]:

class BERTTextClassifier(gluon.nn.Block):
    def __init__(self, bert, num_classes):
        super(BERTTextClassifier, self).__init__()
        self.bert = bert
        with self.name_scope():
            self.classifier = gluon.nn.Dense(num_classes)
        
    def forward(self, inputs, seq_len, token_types):
        out, pooler = self.bert(inputs, seq_len, token_types)
        return self.classifier(pooler)

In [18]:

net = BERTTextClassifier(bert_base, 2)
net.classifier.initialize(ctx=ctx)

Data Loading:

In [19]:

def transform_fn(text, label):
    data, length, token_type = transform([text])
    return data.astype('float32'), length.astype('float32'), token_type.astype('float32'), label

In [20]:

batchify_fn = nlp.data.batchify.Tuple(
        nlp.data.batchify.Pad(axis=0), nlp.data.batchify.Stack(),
        nlp.data.batchify.Pad(axis=0), nlp.data.batchify.Stack(np.float32))

/home/ubuntu/tutorials/gluon-nlp/src/gluonnlp/data/batchify/batchify.py:228: UserWarning: Padding value 0 is used in data.batchify.Pad(). Please check whether this is intended (e.g. value of padding index in the vocabulary).
  'Padding value 0 is used in data.batchify.Pad(). '

In [21]:

train_data = gluon.data.DataLoader(train_dataset.transform(transform_fn), batchify_fn=batchify_fn, shuffle=True, batch_size=batch_size, num_workers=8, thread_pool=True)
test_data = gluon.data.DataLoader(test_dataset.transform(transform_fn), batchify_fn=batchify_fn, shuffle=True, batch_size=batch_size*4, num_workers=8, thread_pool=True)

Training

In [22]:

trainer = gluon.Trainer(net.collect_params(), 'bertadam', {'learning_rate':0.000005, 'wd':0.001, 'epsilon':1e-6})
loss_fn = gluon.loss.SoftmaxCELoss()
net.hybridize(static_alloc=True, static_shape=True)

num_epoch = 3

Training loop

In [23]:

for epoch in range(num_epoch):
    accuracy = mx.metric.Accuracy()
    running_loss = 0
    for i, (inputs, seq_len, token_types, label) in enumerate(train_data):
        inputs = inputs.as_in_context(ctx)
        seq_len = seq_len.as_in_context(ctx)
        token_types = token_types.as_in_context(ctx)
        label = label.as_in_context(ctx)
        with mx.autograd.record():
            out = net(inputs, token_types, seq_len)
            loss = loss_fn(out, label.astype('float32'))
        loss.backward()
        running_loss += loss.mean()
        trainer.step(batch_size)
        accuracy.update(label, out.softmax())
        
        if i % 50 == 0:
            print("Batch", i, "Accuracy", accuracy.get()[1],"Loss", running_loss.asscalar()/(i+1))
    print("Epoch {}, Accuracy {}, Loss {}".format(epoch, accuracy.get(), running_loss.asscalar()/(i+1)))

Batch 0 Accuracy 0.5 Loss 0.8113940954208374
Batch 50 Accuracy 0.5465686274509803 Loss 0.6971047345329734
Batch 100 Accuracy 0.5160891089108911 Loss 0.6957627853544632
Batch 150 Accuracy 0.5099337748344371 Loss 0.6951458659393108
Batch 200 Accuracy 0.5055970149253731 Loss 0.6950178763166589
Batch 250 Accuracy 0.5079681274900398 Loss 0.6945380100690987
Batch 300 Accuracy 0.5095514950166113 Loss 0.694422420869238
Batch 350 Accuracy 0.5092592592592593 Loss 0.694160309272614
Batch 400 Accuracy 0.5105985037406484 Loss 0.6939230750029224
Batch 450 Accuracy 0.5124722838137472 Loss 0.6937596898385532
Batch 500 Accuracy 0.5104790419161677 Loss 0.693802093079466
Batch 550 Accuracy 0.5081669691470054 Loss 0.693887755572255
Batch 600 Accuracy 0.5056156405990017 Loss 0.6939957756766068
Batch 650 Accuracy 0.5032642089093702 Loss 0.6940928750690044
Batch 700 Accuracy 0.5030313837375179 Loss 0.6940981093554967
Batch 750 Accuracy 0.5058255659121171 Loss 0.6938340838517394
Batch 800 Accuracy 0.5032771535580525 Loss 0.6940455442659567
Batch 850 Accuracy 0.5020564042303173 Loss 0.694036140845329
Batch 900 Accuracy 0.5006936736958935 Loss 0.694033742347912
Batch 950 Accuracy 0.49763406940063093 Loss 0.6941702889844571
Batch 1000 Accuracy 0.498001998001998 Loss 0.694186062960477
Batch 1050 Accuracy 0.4978591817316841 Loss 0.6940937713710157
Batch 1100 Accuracy 0.49977293369663944 Loss 0.6940059263418199
Batch 1150 Accuracy 0.5018462206776716 Loss 0.693903494668359
Batch 1200 Accuracy 0.5007285595337219 Loss 0.694016430399797
Batch 1250 Accuracy 0.5007993605115907 Loss 0.6939503085031974
Batch 1300 Accuracy 0.5015372790161414 Loss 0.6939097235149645
Batch 1350 Accuracy 0.5004626202812731 Loss 0.693972889005829
Batch 1400 Accuracy 0.49937544610992146 Loss 0.694006844302061
Batch 1450 Accuracy 0.49922467263955894 Loss 0.693951176249677
Batch 1500 Accuracy 0.4999167221852099 Loss 0.6938975090824867
Batch 1550 Accuracy 0.5000805931656995 Loss 0.6938770916445035
Batch 1600 Accuracy 0.49960961898813244 Loss 0.6939126943365475
Batch 1650 Accuracy 0.49977286493034523 Loss 0.6939044400174137
Batch 1700 Accuracy 0.5001469723691946 Loss 0.6938736347644768
Batch 1750 Accuracy 0.499571673329526 Loss 0.693924063481582
Batch 1800 Accuracy 0.5 Loss 0.6938802580380691
Batch 1850 Accuracy 0.5018233387358185 Loss 0.6937092967061723
Batch 1900 Accuracy 0.5018411362440821 Loss 0.6937858826357838
Batch 1950 Accuracy 0.500961045617632 Loss 0.6938298243488916
Batch 2000 Accuracy 0.5006871564217891 Loss 0.6938200211417729
Batch 2050 Accuracy 0.5004266211604096 Loss 0.6938326578498294
Batch 2100 Accuracy 0.5015468824369348 Loss 0.6937640488346323
Batch 2150 Accuracy 0.502092050209205 Loss 0.6937241671751511
Batch 2200 Accuracy 0.5019309404815993 Loss 0.6937525955886529
Batch 2250 Accuracy 0.5022767658818303 Loss 0.6937186988074745
Batch 2300 Accuracy 0.5017383746197306 Loss 0.6937674962142818
Batch 2350 Accuracy 0.501807741386644 Loss 0.6937409031396214
Batch 2400 Accuracy 0.5009371095376927 Loss 0.6937702755687735
Batch 2450 Accuracy 0.5005099959200326 Loss 0.6937742447279172
Batch 2500 Accuracy 0.5001499400239904 Loss 0.6937726374294032
Batch 2550 Accuracy 0.49931399451195607 Loss 0.6937834389393865
Batch 2600 Accuracy 0.4987985390234525 Loss 0.6937976266640234
Batch 2650 Accuracy 0.4994813278008299 Loss 0.6937510959160458
Batch 2700 Accuracy 0.4991669751943725 Loss 0.6937807684075342
Batch 2750 Accuracy 0.4989094874591058 Loss 0.6937700033794529
Batch 2800 Accuracy 0.4988843270260621 Loss 0.6937682081315825
Batch 2850 Accuracy 0.49899158190108733 Loss 0.6937547440865047
Batch 2900 Accuracy 0.49909513960703206 Loss 0.693739429830231
Batch 2950 Accuracy 0.4988986784140969 Loss 0.6937331145030287
Batch 3000 Accuracy 0.4996251249583472 Loss 0.6936987800941353
Batch 3050 Accuracy 0.4996312684365782 Loss 0.6936925777409046
Batch 3100 Accuracy 0.4997178329571106 Loss 0.6936764980046759
Epoch 0, Accuracy ('accuracy', 0.49972), Loss 0.6936875
Batch 0 Accuracy 0.5 Loss 0.6856123208999634
Batch 50 Accuracy 0.4583333333333333 Loss 0.6955141553691789
Batch 100 Accuracy 0.48267326732673266 Loss 0.6941103038221302
Batch 150 Accuracy 0.48096026490066224 Loss 0.6942238712942363
Batch 200 Accuracy 0.4906716417910448 Loss 0.6936642471237562
Batch 250 Accuracy 0.4895418326693227 Loss 0.693647954568445
Batch 300 Accuracy 0.4908637873754153 Loss 0.693672357603561
Batch 350 Accuracy 0.4878917378917379 Loss 0.693731019979189
Batch 400 Accuracy 0.4878428927680798 Loss 0.6938717050148068
Batch 450 Accuracy 0.49029933481152993 Loss 0.6938042143760393
Batch 500 Accuracy 0.4880239520958084 Loss 0.6938397984304828
Batch 550 Accuracy 0.48593466424682397 Loss 0.6938430038424456
Batch 600 Accuracy 0.4902246256239601 Loss 0.6936775499493032
Batch 650 Accuracy 0.4911674347158218 Loss 0.6936897242673531
Batch 700 Accuracy 0.4957203994293866 Loss 0.6934668177714871
Batch 750 Accuracy 0.4933422103861518 Loss 0.693654391800516
Batch 800 Accuracy 0.493601747815231 Loss 0.693650301624922
Batch 850 Accuracy 0.49441833137485314 Loss 0.693529140234834
Batch 900 Accuracy 0.4911209766925638 Loss 0.693599993063263
Batch 950 Accuracy 0.49263932702418506 Loss 0.6935356132114222
Batch 1000 Accuracy 0.4926323676323676 Loss 0.6935521753636988
Batch 1050 Accuracy 0.4910799238820171 Loss 0.6936605437158063
Batch 1100 Accuracy 0.49364214350590374 Loss 0.6935811194368756
Batch 1150 Accuracy 0.4946785403996525 Loss 0.6936019247868701
Batch 1200 Accuracy 0.4929225645295587 Loss 0.6937399680767329
Batch 1250 Accuracy 0.4920063948840927 Loss 0.6937450039968026
Batch 1300 Accuracy 0.49279400461183703 Loss 0.693691901662183
Batch 1350 Accuracy 0.4932457438934123 Loss 0.6936686167798158
Batch 1400 Accuracy 0.4942897930049964 Loss 0.6936067394662295
Batch 1450 Accuracy 0.49353893866299103 Loss 0.6936395225814094
Batch 1500 Accuracy 0.4942538307794803 Loss 0.6936633715543804
Batch 1550 Accuracy 0.4951644100580271 Loss 0.6936317704555529
Batch 1600 Accuracy 0.49508119925046845 Loss 0.6936721372872423
Batch 1650 Accuracy 0.4950030284675954 Loss 0.6937146432323213
Batch 1700 Accuracy 0.4949294532627866 Loss 0.6937298200047766
Batch 1750 Accuracy 0.4957881210736722 Loss 0.6937139435947316
Batch 1800 Accuracy 0.4973625763464742 Loss 0.693623239367886
Batch 1850 Accuracy 0.4974338195569962 Loss 0.69365897814948
Batch 1900 Accuracy 0.49730405049973697 Loss 0.693665546094161
Batch 1950 Accuracy 0.4973731419784726 Loss 0.6936632055676576
Batch 2000 Accuracy 0.4976886556721639 Loss 0.6936287715517241
Batch 2050 Accuracy 0.4976231106777182 Loss 0.6936190496823196
Batch 2100 Accuracy 0.4982151356496906 Loss 0.6936090933260948
Batch 2150 Accuracy 0.49953509995351 Loss 0.6935010242329149
Batch 2200 Accuracy 0.5000567923671059 Loss 0.6934935358750284
Batch 2250 Accuracy 0.49916703687250114 Loss 0.6935682665343181
Batch 2300 Accuracy 0.49853324641460234 Loss 0.6935827196395589
Batch 2350 Accuracy 0.4979795831561038 Loss 0.6935873157366547
Batch 2400 Accuracy 0.4982299042065806 Loss 0.69355236913005
Batch 2450 Accuracy 0.4983170134638923 Loss 0.6935666106468024
Batch 2500 Accuracy 0.49810075969612155 Loss 0.6935709602877599
Batch 2550 Accuracy 0.4976969815758526 Loss 0.693574852309756
Batch 2600 Accuracy 0.49774125336409075 Loss 0.6935637641622212
Batch 2650 Accuracy 0.49872689551112787 Loss 0.6934963184588363
Batch 2700 Accuracy 0.4980562754535357 Loss 0.6935707044266013
Batch 2750 Accuracy 0.4969556524900036 Loss 0.6935945522650854
Batch 2800 Accuracy 0.49629596572652623 Loss 0.6936222990254597
Batch 2850 Accuracy 0.49622939319537007 Loss 0.6936271504241933
Batch 2900 Accuracy 0.49599276111685625 Loss 0.6936211045221475
Batch 2950 Accuracy 0.4959335818366655 Loss 0.6936311479424347
Batch 3000 Accuracy 0.4960846384538487 Loss 0.6936152304700517
Batch 3050 Accuracy 0.4963126843657817 Loss 0.6936248809304326
Batch 3100 Accuracy 0.4965736859077717 Loss 0.6936086330518381
Epoch 1, Accuracy ('accuracy', 0.49672), Loss 0.69359515625
Batch 0 Accuracy 0.25 Loss 0.7104008793830872
Batch 50 Accuracy 0.5073529411764706 Loss 0.6934510773303462
Batch 100 Accuracy 0.5247524752475248 Loss 0.6929469344639542
Batch 150 Accuracy 0.5091059602649006 Loss 0.6934733106600528
Batch 200 Accuracy 0.5105721393034826 Loss 0.6932625841738572
Batch 250 Accuracy 0.5134462151394422 Loss 0.6928929788657868
Batch 300 Accuracy 0.5157807308970099 Loss 0.6928477239767182
Batch 350 Accuracy 0.5138888888888888 Loss 0.693018073709602
Batch 400 Accuracy 0.5118453865336658 Loss 0.6929108198741427
Batch 450 Accuracy 0.5127494456762749 Loss 0.6929625111514343
Batch 500 Accuracy 0.5097305389221557 Loss 0.6931828481708459
Batch 550 Accuracy 0.5058983666061706 Loss 0.693242843267922
Batch 600 Accuracy 0.5056156405990017 Loss 0.6932559909915765
Batch 650 Accuracy 0.5036482334869432 Loss 0.6933147940218174
Batch 700 Accuracy 0.5030313837375179 Loss 0.6933085704155894
Batch 750 Accuracy 0.5021637816245007 Loss 0.6933090677274051
Batch 800 Accuracy 0.5015605493133583 Loss 0.6933596035960908
Batch 850 Accuracy 0.500587544065805 Loss 0.6934002563620005
Batch 900 Accuracy 0.5016648168701443 Loss 0.6933038946526429
Batch 950 Accuracy 0.5017087276550999 Loss 0.6933975620851407
Batch 1000 Accuracy 0.5013736263736264 Loss 0.6934364463661339
Batch 1050 Accuracy 0.5008325404376784 Loss 0.6934531054873335
Batch 1100 Accuracy 0.5011353315168029 Loss 0.6934898716011013
Batch 1150 Accuracy 0.5022806255430061 Loss 0.6934206222680006
Batch 1200 Accuracy 0.4997918401332223 Loss 0.6934522236515144
Batch 1250 Accuracy 0.5 Loss 0.693437096026304
Batch 1300 Accuracy 0.500576479631053 Loss 0.6934615537627306
Batch 1350 Accuracy 0.5013878608438194 Loss 0.6934563716546771
Batch 1400 Accuracy 0.5015167737330478 Loss 0.6935105470144094
Batch 1450 Accuracy 0.5031013094417643 Loss 0.693440895422338
Batch 1500 Accuracy 0.5027481678880746 Loss 0.6934464751363674
Batch 1550 Accuracy 0.5022566086395873 Loss 0.6934809730869197
Batch 1600 Accuracy 0.5021861336664585 Loss 0.6934927297636243
Batch 1650 Accuracy 0.500908540278619 Loss 0.6934966763751136
Batch 1700 Accuracy 0.5013962375073486 Loss 0.6934255412716784
Batch 1750 Accuracy 0.500642490005711 Loss 0.6934219787353655
Batch 1800 Accuracy 0.5002776235424764 Loss 0.6934382699715436
Batch 1850 Accuracy 0.4997974068071313 Loss 0.693457242284576
Batch 1900 Accuracy 0.5001315097317202 Loss 0.6934870961212849
Batch 1950 Accuracy 0.4998718605843157 Loss 0.6934887032090915
Batch 2000 Accuracy 0.5003748125937032 Loss 0.6934610087534357
Batch 2050 Accuracy 0.5001828376401756 Loss 0.6934643638011946
Batch 2100 Accuracy 0.49988100904331273 Loss 0.6934794117756425
Batch 2150 Accuracy 0.4996513249651325 Loss 0.6934828640748489
Batch 2200 Accuracy 0.4990345297592004 Loss 0.6935032416018287
Batch 2250 Accuracy 0.49888938249666814 Loss 0.6935142540815193
Batch 2300 Accuracy 0.4987505432420687 Loss 0.6935147613299923
Batch 2350 Accuracy 0.49829859634198215 Loss 0.6935221527242397
Batch 2400 Accuracy 0.49796959600166596 Loss 0.6935368624889369
Batch 2450 Accuracy 0.49775601795185637 Loss 0.693570096947037
Batch 2500 Accuracy 0.49670131947221113 Loss 0.6935967800379849
Batch 2550 Accuracy 0.496030968247746 Loss 0.6935988261343591
Batch 2600 Accuracy 0.494761630142253 Loss 0.6936276856407392
Batch 2650 Accuracy 0.49495473406261786 Loss 0.6936079821735902
Batch 2700 Accuracy 0.49453905960755273 Loss 0.6936315814021428
Batch 2750 Accuracy 0.49504725554343876 Loss 0.6936228179099646
Batch 2800 Accuracy 0.4949571581578008 Loss 0.6936225169303374
Batch 2850 Accuracy 0.4952648193616275 Loss 0.6936182017411654
Batch 2900 Accuracy 0.49551878662530163 Loss 0.6935980033151284
Batch 2950 Accuracy 0.49567943070145715 Loss 0.6935861420122416
Batch 3000 Accuracy 0.49579306897700764 Loss 0.6935903364243169
Batch 3050 Accuracy 0.49553425106522453 Loss 0.6935947934591118
Batch 3100 Accuracy 0.49568687520154786 Loss 0.6935833608362222
Epoch 2, Accuracy ('accuracy', 0.4954), Loss 0.693593359375

Evaluation

In [45]:

accuracy = 0
for i, (inputs, seq_len, token_types, label) in enumerate(test_data):
    inputs = inputs.as_in_context(ctx)
    seq_len = seq_len.as_in_context(ctx)
    token_types = token_types.as_in_context(ctx)
    label = label.as_in_context(ctx)
    out = net(inputs, token_types, seq_len)
    accuracy += (out.argmax(axis=1).squeeze() == label).mean()
    if i % 50 == 0 and i > 0:
        print(accuracy.asscalar()/(i+1))
print("Test Accuracy {}".format(accuracy.asscalar()/(i+1)))

0.9068627450980392
0.9077970297029703
0.9060430463576159
0.9040733830845771
0.9016434262948207
0.9011627906976745
0.9024216524216524
0.9017300498753117
0.9011917960088692
0.9009481037924152
0.9021098003629764
0.9019862728785357
0.901833717357911
0.9028619828815977
0.9029627163781625
Test Accuracy 0.9029731457800512

Final accuracies:

Model	Training Accuracy	Testing Accuracy
TF-IDF 1-gram	93.6%	88.4%
TF-IDF 2-gram	95.0%	88.6%
TF-IDF 3-gram	94.9%	88.7%
BERT 1024	97.0%	90.3%