# code for loading the format for the notebook
import os
# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir(os.path.join('..', '..', 'notebook_format'))
from formats import load_style
load_style(plot_style=False)
os.chdir(path)
# 1. magic for inline plot
# 2. magic to print version
# 3. magic so that the notebook will reload external python modules
# 4. magic to enable retina (high resolution) plots
# https://gist.github.com/minrk/3301035
%matplotlib inline
%load_ext watermark
%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format='retina'
import os
import time
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from typing import List, Tuple
from keras import layers
from keras.models import Model
from keras.preprocessing.text import Tokenizer
from keras.utils.np_utils import to_categorical
from keras.preprocessing.sequence import pad_sequences
# prevent scientific notations
pd.set_option('display.float_format', lambda x: '%.3f' % x)
%watermark -a 'Ethen' -d -t -v -p numpy,pandas,sklearn,keras,sentencepiece
Using TensorFlow backend.
Ethen 2019-12-31 11:20:36 CPython 3.6.4 IPython 7.9.0 numpy 1.16.5 pandas 0.25.0 sklearn 0.21.2 keras 2.2.2 sentencepiece n
In this notebook, we will be experimenting with subword tokenization. Tokenization is often times one of the first mandatory task that's performed in NLP task, where we break down a piece of text into meaningful individual units/tokens.
There're three major ways of performing tokenization.
Character Level
Treats each character (or unicode) as one individual token.
Word Level
Performs word segmentation on top of our text data.
Blog: Language modeling a billion words also shared some thoughts comparing character based tokenization v.s. word based tokenization. Taken directly from the post.
Word-level models have an important advantage over char-level models. Take the following sequence as an example (a quote from Robert A. Heinlein):
Progress isn't made by early risers. It's made by lazy men trying to find easier ways to do something.
After tokenization, the word-level model might view this sequence as containing 22 tokens. On the other hand, the char-level will view this sequence as containing 102 tokens. This longer sequence makes the task of the character model harder than the word model, as it must take into account dependencies between more tokens over more time-steps. Another issue with character language models is that they need to learn spelling in addition to syntax, semantics, etc. In any case, word language models will typically have lower error than character models.
The main advantage of character over word language models is that they have a really small vocabulary. For example, the GBW dataset will contain approximately 800 characters compared to 800,000 words (after pruning low-frequency tokens). In practice this means that character models will require less memory and have faster inference than their word counterparts. Another advantage is that they do not require tokenization as a preprocessing step.
Subword Level
As we can probably imagine, subword level is somewhere between character level and word level, hence tries to bring in the the pros (being able to handle out of vocabulary or rare words better) and mitigate the drawback (too fine-grained for downstream tasks) from both approaches. With subword level, what we are aiming for is to represent open vocabulary through a fixed-sized vocabulary of variable length character sequences. e.g. the word highest might be segmented into subwords high and est.
There're many different methods for generating these subwords. e.g.
We'll use the movie review sentiment analysis dataset from Kaggle for this example. It's a binary classification problem with AUC as the ultimate evaluation metric. The next few code chunk performs the usual text preprocessing, build up the word vocabulary and performing a train/test split.
data_dir = 'data'
submission_dir = 'submission'
input_path = os.path.join(data_dir, 'word2vec-nlp-tutorial', 'labeledTrainData.tsv')
df = pd.read_csv(input_path, delimiter='\t')
print(df.shape)
df.head()
(25000, 3)
id | sentiment | review | |
---|---|---|---|
0 | 5814_8 | 1 | With all this stuff going down at the moment w... |
1 | 2381_9 | 1 | \The Classic War of the Worlds\" by Timothy Hi... |
2 | 7759_3 | 0 | The film starts with a manager (Nicholas Bell)... |
3 | 3630_4 | 0 | It must be assumed that those who praised this... |
4 | 9495_8 | 1 | Superbly trashy and wondrously unpretentious 8... |
raw_text = df['review'].iloc[0]
raw_text
"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally starts is only on for 20 minutes or so excluding the Smooth Criminal sequence and Joe Pesci is convincing as a psychopathic all powerful drug lord. Why he wants MJ dead so bad is beyond me. Because MJ overheard his plans? Nah, Joe Pesci's character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates MJ's music.<br /><br />Lots of cool things in this like MJ turning into a car and a robot and the whole Speed Demon sequence. Also, the director must have had the patience of a saint when it came to filming the kiddy Bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.<br /><br />Bottom line, this movie is for people who like MJ on one level or another (which i think is most people). If not, then stay away. It does try and give off a wholesome message and ironically MJ's bestest buddy in this movie is a girl! Michael Jackson is truly one of the most talented people ever to grace this planet but is he guilty? Well, with all the attention i've gave this subject....hmmm well i don't know because people can be different behind closed doors, i know this for a fact. He is either an extremely nice but stupid guy or one of the most sickest liars. I hope he is not the latter."
import re
def clean_str(string: str) -> str:
string = re.sub(r"\\", "", string)
string = re.sub(r"\'", "", string)
string = re.sub(r"\"", "", string)
return string.strip().lower()
from bs4 import BeautifulSoup
def clean_text(df: pd.DataFrame,
text_col: str,
label_col: str) -> Tuple[List[str], List[int]]:
texts = []
labels = []
for raw_text, label in zip(df[text_col], df[label_col]):
text = BeautifulSoup(raw_text).get_text()
cleaned_text = clean_str(text)
texts.append(cleaned_text)
labels.append(label)
return texts, labels
text_col = 'review'
label_col = 'sentiment'
texts, labels = clean_text(df, text_col, label_col)
print('sample text: ', texts[0])
print('corresponding label:', labels[0])
sample text: with all this stuff going down at the moment with mj ive started listening to his music, watching the odd documentary here and there, watched the wiz and watched moonwalker again. maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. some of it has subtle messages about mjs feeling towards the press and also the obvious message of drugs are bad mkay.visually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring. some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of him.the actual feature film bit when it finally starts is only on for 20 minutes or so excluding the smooth criminal sequence and joe pesci is convincing as a psychopathic all powerful drug lord. why he wants mj dead so bad is beyond me. because mj overheard his plans? nah, joe pescis character ranted that he wanted people to know it is he who is supplying drugs etc so i dunno, maybe he just hates mjs music.lots of cool things in this like mj turning into a car and a robot and the whole speed demon sequence. also, the director must have had the patience of a saint when it came to filming the kiddy bad sequence as usually directors hate working with one kid let alone a whole bunch of them performing a complex dance scene.bottom line, this movie is for people who like mj on one level or another (which i think is most people). if not, then stay away. it does try and give off a wholesome message and ironically mjs bestest buddy in this movie is a girl! michael jackson is truly one of the most talented people ever to grace this planet but is he guilty? well, with all the attention ive gave this subject....hmmm well i dont know because people can be different behind closed doors, i know this for a fact. he is either an extremely nice but stupid guy or one of the most sickest liars. i hope he is not the latter. corresponding label: 1
random_state = 1234
val_split = 0.2
labels = to_categorical(labels)
texts_train, texts_val, y_train, y_val = train_test_split(
texts, labels,
test_size=val_split,
random_state=random_state)
print('labels shape:', labels.shape)
print('train size: ', len(texts_train))
print('validation size: ', len(texts_val))
labels shape: (25000, 2) train size: 20000 validation size: 5000
To train our text classifier, we specify a 1D convolutional network. The comparison we'll be experimenting is whether subword-level model gives a better performance than word-level model.
def simple_text_cnn(max_sequence_len: int, max_features: int, num_classes: int,
optimizer: str='adam', metrics: List[str]=['acc']) -> Model:
sequence_input = layers.Input(shape=(max_sequence_len,), dtype='int32')
embedded_sequences = layers.Embedding(max_features, 100,
trainable=True)(sequence_input)
conv1 = layers.Conv1D(128, 5, activation='relu')(embedded_sequences)
pool1 = layers.MaxPooling1D(5)(conv1)
conv2 = layers.Conv1D(128, 5, activation='relu')(pool1)
pool2 = layers.MaxPooling1D(5)(conv2)
conv3 = layers.Conv1D(128, 5, activation='relu')(pool2)
pool3 = layers.MaxPooling1D(35)(conv3)
flatten = layers.Flatten()(pool3)
dense = layers.Dense(128, activation='relu')(flatten)
preds = layers.Dense(num_classes, activation='softmax')(dense)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',
optimizer=optimizer,
metrics=metrics)
return model
The next couple of code chunks trains the subword vocabulary, encode our original text into these subwords and pads the sequences into a fixed length.
Note the the pad_sequences
function from keras assumes that index 0 is reserved for padding, hence when learning the subword vocabulary using sentencepiece
, we make sure to keep the index consistent.
# write the raw text so that sentencepiece can consume it
temp_file = 'train.txt'
with open(temp_file, 'w') as f:
f.write('\n'.join(texts))
from sentencepiece import SentencePieceTrainer, SentencePieceProcessor
max_num_words = 30000
model_type = 'unigram'
model_prefix = model_type
pad_id = 0
unk_id = 1
bos_id = 2
eos_id = 3
sentencepiece_params = ' '.join([
'--input={}'.format(temp_file),
'--model_type={}'.format(model_type),
'--model_prefix={}'.format(model_type),
'--vocab_size={}'.format(max_num_words),
'--pad_id={}'.format(pad_id),
'--unk_id={}'.format(unk_id),
'--bos_id={}'.format(bos_id),
'--eos_id={}'.format(eos_id)
])
print(sentencepiece_params)
SentencePieceTrainer.train(sentencepiece_params)
--input=train.txt --model_type=unigram --model_prefix=unigram --vocab_size=30000 --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3
True
sp = SentencePieceProcessor()
sp.load("{}.model".format(model_prefix))
print('Found %s unique tokens.' % sp.get_piece_size())
Found 30000 unique tokens.
max_sequence_len = 1000
sequences_train = [sp.encode_as_ids(text) for text in texts_train]
x_train = pad_sequences(sequences_train, maxlen=max_sequence_len)
sequences_val = [sp.encode_as_ids(text) for text in texts_val]
x_val = pad_sequences(sequences_val, maxlen=max_sequence_len)
sequences_train[0][:5]
[62, 5086, 4170, 2260, 2520]
print('sample text: ', texts_train[0])
print('sample text: ', sp.encode_as_pieces(sp.decode_ids(x_train[0].tolist())))
sample text: when gundam0079 became the movie trilogy most of us are familiar with, a lot of it was sheer action and less of anything else. this ova is kinda the opposite. though therere only half a dozen episodes, it isnt filled with action, but emotional things. the two main action sequences in this, i believe, are enough to satisfy me. after seeing so many gundam series, movies, and ovas, i was completely ready for a civilian-esquire movie. this movie did a fantastic job of that. what makes this movie stand out is that shows both sides of the war have good and bad people. it made the zeons seem more human rather than the original movies where theyre depicted as the second rise of evil nazis. most people that dont like anime that ive forced to watch this movie (lol), liked it. so, id recommend it to a lot of people just for the anti-war story. if youre a gundam fan, and havent seen this, you shouldnt be reading this; you should already be watching it right now. sample text: ['▁when', '▁gundam', '00', '7', '9', '▁became', '▁the', '▁movie', '▁trilogy', '▁most', '▁of', '▁us', '▁are', '▁familiar', '▁with', ',', '▁a', '▁lot', '▁of', '▁it', '▁was', '▁sheer', '▁action', '▁and', '▁less', '▁of', '▁anything', '▁else', '.', '▁this', '▁ova', '▁is', '▁kinda', '▁the', '▁opposite', '.', '▁though', '▁there', 're', '▁only', '▁half', '▁a', '▁dozen', '▁episodes', ',', '▁it', '▁isnt', '▁filled', '▁with', '▁action', ',', '▁but', '▁emotional', '▁things', '.', '▁the', '▁two', '▁main', '▁action', '▁sequences', '▁in', '▁this', ',', '▁i', '▁believe', ',', '▁are', '▁enough', '▁to', '▁satisfy', '▁me', '.', '▁after', '▁seeing', '▁so', '▁many', '▁gundam', '▁series', ',', '▁movies', ',', '▁and', '▁ova', 's', ',', '▁i', '▁was', '▁completely', '▁ready', '▁for', '▁a', '▁civilian', '-', 'esquire', '▁movie', '.', '▁this', '▁movie', '▁did', '▁a', '▁fantastic', '▁job', '▁of', '▁that', '.', '▁what', '▁makes', '▁this', '▁movie', '▁stand', '▁out', '▁is', '▁that', '▁shows', '▁both', '▁sides', '▁of', '▁the', '▁war', '▁have', '▁good', '▁and', '▁bad', '▁people', '.', '▁it', '▁made', '▁the', '▁zeon', 's', '▁seem', '▁more', '▁human', '▁rather', '▁than', '▁the', '▁original', '▁movies', '▁where', '▁theyre', '▁depicted', '▁as', '▁the', '▁second', '▁rise', '▁of', '▁evil', '▁nazis', '.', '▁most', '▁people', '▁that', '▁dont', '▁like', '▁anime', '▁that', '▁ive', '▁forced', '▁to', '▁watch', '▁this', '▁movie', '▁(', 'lol', '),', '▁liked', '▁it', '.', '▁so', ',', '▁id', '▁recommend', '▁it', '▁to', '▁a', '▁lot', '▁of', '▁people', '▁just', '▁for', '▁the', '▁anti', '-', 'war', '▁story', '.', '▁if', '▁youre', '▁a', '▁gundam', '▁fan', ',', '▁and', '▁havent', '▁seen', '▁this', ',', '▁you', '▁shouldnt', '▁be', '▁reading', '▁this', ';', '▁you', '▁should', '▁already', '▁be', '▁watching', '▁it', '▁right', '▁now', '.']
num_classes = 2
model1 = simple_text_cnn(max_sequence_len, max_num_words + 1, num_classes)
model1.summary()
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:66: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:541: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4432: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4267: The name tf.nn.max_pool is deprecated. Please use tf.nn.max_pool2d instead. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/optimizers.py:793: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3576: The name tf.log is deprecated. Please use tf.math.log instead. Model: "model_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_1 (InputLayer) (None, 1000) 0 _________________________________________________________________ embedding_1 (Embedding) (None, 1000, 100) 3000100 _________________________________________________________________ conv1d_1 (Conv1D) (None, 996, 128) 64128 _________________________________________________________________ max_pooling1d_1 (MaxPooling1 (None, 199, 128) 0 _________________________________________________________________ conv1d_2 (Conv1D) (None, 195, 128) 82048 _________________________________________________________________ max_pooling1d_2 (MaxPooling1 (None, 39, 128) 0 _________________________________________________________________ conv1d_3 (Conv1D) (None, 35, 128) 82048 _________________________________________________________________ max_pooling1d_3 (MaxPooling1 (None, 1, 128) 0 _________________________________________________________________ flatten_1 (Flatten) (None, 128) 0 _________________________________________________________________ dense_1 (Dense) (None, 128) 16512 _________________________________________________________________ dense_2 (Dense) (None, 2) 258 ================================================================= Total params: 3,245,094 Trainable params: 3,245,094 Non-trainable params: 0 _________________________________________________________________
# time : 120
# performance : 0.92936
start = time.time()
history1 = model1.fit(x_train, y_train,
validation_data=(x_val, y_val),
batch_size=128,
epochs=8)
end = time.time()
elapse1 = end - start
elapse1
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/math_grad.py:1424: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version. Instructions for updating: Use tf.where in 2.0, which has the same broadcast rule as np.where WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1033: The name tf.assign_add is deprecated. Please use tf.compat.v1.assign_add instead. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1020: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3005: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead. Train on 20000 samples, validate on 5000 samples Epoch 1/8 WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:197: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:207: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:216: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead. WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:223: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead. 20000/20000 [==============================] - 7s 363us/step - loss: 0.5963 - acc: 0.6101 - val_loss: 0.3138 - val_acc: 0.8702 Epoch 2/8 20000/20000 [==============================] - 4s 224us/step - loss: 0.2239 - acc: 0.9120 - val_loss: 0.2991 - val_acc: 0.8820 Epoch 3/8 20000/20000 [==============================] - 4s 223us/step - loss: 0.0797 - acc: 0.9738 - val_loss: 0.3427 - val_acc: 0.8852 Epoch 4/8 20000/20000 [==============================] - 4s 224us/step - loss: 0.0193 - acc: 0.9946 - val_loss: 0.5095 - val_acc: 0.8814 Epoch 5/8 20000/20000 [==============================] - 4s 222us/step - loss: 0.0050 - acc: 0.9988 - val_loss: 0.7519 - val_acc: 0.8704 Epoch 6/8 20000/20000 [==============================] - 4s 223us/step - loss: 0.0016 - acc: 0.9999 - val_loss: 0.7487 - val_acc: 0.8840 Epoch 7/8 20000/20000 [==============================] - 4s 223us/step - loss: 2.0759e-04 - acc: 1.0000 - val_loss: 0.8045 - val_acc: 0.8810 Epoch 8/8 20000/20000 [==============================] - 4s 223us/step - loss: 5.2034e-05 - acc: 1.0000 - val_loss: 0.8260 - val_acc: 0.8824
39.04836106300354
tokenizer = Tokenizer(num_words=max_num_words, oov_token='<unk>')
tokenizer.fit_on_texts(texts_train)
print('Found %s unique tokens.' % len(tokenizer.word_index))
Found 74207 unique tokens.
sequences_train = tokenizer.texts_to_sequences(texts_train)
x_train = pad_sequences(sequences_train, maxlen=max_sequence_len)
sequences_val = tokenizer.texts_to_sequences(texts_val)
x_val = pad_sequences(sequences_val, maxlen=max_sequence_len)
num_classes = 2
model2 = simple_text_cnn(max_sequence_len, max_num_words + 1, num_classes)
model2.summary()
Model: "model_2" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_2 (InputLayer) (None, 1000) 0 _________________________________________________________________ embedding_2 (Embedding) (None, 1000, 100) 3000100 _________________________________________________________________ conv1d_4 (Conv1D) (None, 996, 128) 64128 _________________________________________________________________ max_pooling1d_4 (MaxPooling1 (None, 199, 128) 0 _________________________________________________________________ conv1d_5 (Conv1D) (None, 195, 128) 82048 _________________________________________________________________ max_pooling1d_5 (MaxPooling1 (None, 39, 128) 0 _________________________________________________________________ conv1d_6 (Conv1D) (None, 35, 128) 82048 _________________________________________________________________ max_pooling1d_6 (MaxPooling1 (None, 1, 128) 0 _________________________________________________________________ flatten_2 (Flatten) (None, 128) 0 _________________________________________________________________ dense_3 (Dense) (None, 128) 16512 _________________________________________________________________ dense_4 (Dense) (None, 2) 258 ================================================================= Total params: 3,245,094 Trainable params: 3,245,094 Non-trainable params: 0 _________________________________________________________________
# time : 120
# performance : 0.92520
start = time.time()
history2 = model2.fit(x_train, y_train,
validation_data=(x_val, y_val),
batch_size=128,
epochs=8)
end = time.time()
elapse2 = end - start
elapse2
Train on 20000 samples, validate on 5000 samples Epoch 1/8 20000/20000 [==============================] - 5s 257us/step - loss: 0.5386 - acc: 0.6734 - val_loss: 0.3237 - val_acc: 0.8708 Epoch 2/8 20000/20000 [==============================] - 5s 227us/step - loss: 0.2028 - acc: 0.9216 - val_loss: 0.2670 - val_acc: 0.8908 Epoch 3/8 20000/20000 [==============================] - 4s 225us/step - loss: 0.0668 - acc: 0.9785 - val_loss: 0.3612 - val_acc: 0.8886 Epoch 4/8 20000/20000 [==============================] - 5s 225us/step - loss: 0.0205 - acc: 0.9937 - val_loss: 0.4852 - val_acc: 0.8826 Epoch 5/8 20000/20000 [==============================] - 5s 225us/step - loss: 0.0059 - acc: 0.9985 - val_loss: 0.6764 - val_acc: 0.8786 Epoch 6/8 20000/20000 [==============================] - 5s 228us/step - loss: 0.0021 - acc: 0.9995 - val_loss: 0.7321 - val_acc: 0.8788 Epoch 7/8 20000/20000 [==============================] - 5s 226us/step - loss: 0.0022 - acc: 0.9995 - val_loss: 0.8057 - val_acc: 0.8840 Epoch 8/8 20000/20000 [==============================] - 5s 226us/step - loss: 0.0034 - acc: 0.9990 - val_loss: 0.8816 - val_acc: 0.8808
37.271193742752075
For the submission section, we read in and preprocess the test data provided by the competition, then generate the predicted probability column for both the model that uses word-level tokenization and one that uses subword tokenization to compare their performance.
input_path = os.path.join(data_dir, 'word2vec-nlp-tutorial', 'testData.tsv')
df_test = pd.read_csv(input_path, delimiter='\t')
print(df_test.shape)
df_test.head()
(25000, 2)
id | review | |
---|---|---|
0 | 12311_10 | Naturally in a film who's main themes are of m... |
1 | 8348_2 | This movie is a disaster within a disaster fil... |
2 | 5828_4 | All in all, this is a movie for kids. We saw i... |
3 | 7186_2 | Afraid of the Dark left me with the impression... |
4 | 12128_7 | A very accurate depiction of small time mob li... |
def clean_text_without_label(df: pd.DataFrame, text_col: str) -> List[str]:
texts = []
for raw_text in df[text_col]:
text = BeautifulSoup(raw_text).get_text()
cleaned_text = clean_str(text)
texts.append(cleaned_text)
return texts
texts_test = clean_text_without_label(df_test, text_col)
# word-level
word_sequences_test = tokenizer.texts_to_sequences(texts_test)
word_x_test = pad_sequences(word_sequences_test, maxlen=max_sequence_len)
len(word_x_test)
25000
# subword-level
sentencepiece_sequences_test = [sp.encode_as_ids(text) for text in texts_test]
sentencepiece_x_test = pad_sequences(sentencepiece_sequences_test, maxlen=max_sequence_len)
len(sentencepiece_x_test)
25000
def create_submission(ids, predictions, ids_col, prediction_col, submission_path) -> pd.DataFrame:
df_submission = pd.DataFrame({
ids_col: ids,
prediction_col: predictions
}, columns=[ids_col, prediction_col])
if submission_path is not None:
# create the directory if need be, e.g. if the submission_path = submission/submission.csv
# we'll create the submission directory first if it doesn't exist
directory = os.path.split(submission_path)[0]
if (directory != '' or directory != '.') and not os.path.isdir(directory):
os.makedirs(directory, exist_ok=True)
df_submission.to_csv(submission_path, index=False, header=True)
return df_submission
ids_col = 'id'
prediction_col = 'sentiment'
ids = df_test[ids_col]
predictions_dict = {
'sentencepiece_cnn': model1.predict(sentencepiece_x_test)[:, 1], # 0.92936
'word_cnn': model2.predict(word_x_test)[:, 1] # 0.92520
}
for model_name, predictions in predictions_dict.items():
print('generating submission for: ', model_name)
submission_path = os.path.join(submission_dir, '{}_submission.csv'.format(model_name))
df_submission = create_submission(ids, predictions, ids_col, prediction_col, submission_path)
# sanity check to make sure the size and the output of the submission makes sense
print(df_submission.shape)
df_submission.head()
generating submission for: sentencepiece_cnn generating submission for: word_cnn (25000, 2)
id | sentiment | |
---|---|---|
0 | 12311_10 | 1.000 |
1 | 8348_2 | 0.000 |
2 | 5828_4 | 0.000 |
3 | 7186_2 | 1.000 |
4 | 12128_7 | 1.000 |
We've looked at the performance of leveraging subword tokenization for our text classification task. Note that some other ideas that we did not try out are: