BentoML Example: Keras Toxic Comment Classification

BentoML is an open source platform for machine learning model serving and deployment.

This notebook demonstrates how to use BentoML to turn a Keras model into a docker image containing a REST API server serving this model, how to use your ML service built with BentoML as a CLI tool, and how to distribute it a pypi package.

This notebook is built based on: https://www.kaggle.com/sarvajna/keras-sequential-model-lb-0-052

Impression

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline
In [ ]:
!pip install bentoml
!pip install keras kaggle tensorflow==1.14.0
In [3]:
import bentoml
import numpy as np
import pandas as pd
from keras.preprocessing import text, sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding
from keras.layers import Conv1D, GlobalMaxPooling1D
from sklearn.model_selection import train_test_split
Using TensorFlow backend.
In [4]:
list_of_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
max_features = 20000
max_text_length = 400
embedding_dims = 50
filters = 250
kernel_size = 3
hidden_dims = 250
batch_size = 32
epochs = 2

Prepare Dataset

Please Download data with Kaggle at https://www.kaggle.com/sarvajna/keras-sequential-model-lb-0-052/data

If you are running this notebook in Google Colab, fill in your kaggle credential below and download the training dataset from Kaggle:

In [5]:
%%bash

export KAGGLE_USERNAME=
export KAGGLE_KEY=

if [ ! -f ./train.csv.zip ]; then
    kaggle competitions download -c jigsaw-toxic-comment-classification-challenge
    unzip train.csv.zip
    unzip sample_submission.csv.zip
    unzip test.csv.zip
    unzip test_labels.csv.zip
fi
In [6]:
train_df = pd.read_csv('./train.csv')

print(train_df.head())
                 id                                       comment_text  toxic  \
0  0000997932d777bf  Explanation\nWhy the edits made under my usern...      0   
1  000103f0d9cfb60f  D'aww! He matches this background colour I'm s...      0   
2  000113f07ec002fd  Hey man, I'm really not trying to edit war. It...      0   
3  0001b41b1c6bb37e  "\nMore\nI can't make any real suggestions on ...      0   
4  0001d958c54c6e35  You, sir, are my hero. Any chance you remember...      0   

   severe_toxic  obscene  threat  insult  identity_hate  
0             0        0       0       0              0  
1             0        0       0       0              0  
2             0        0       0       0              0  
3             0        0       0       0              0  
4             0        0       0       0              0  
In [7]:
x = train_df['comment_text'].values
print(x)
["Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27"
 "D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)"
 "Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to me through edits instead of my talk page. He seems to care more about the formatting than the actual info."
 ...
 'Spitzer \n\nUmm, theres no actual article for prostitution ring.  - Crunch Captain.'
 'And it looks like it was actually you who put on the speedy to have the first version deleted now that I look at it.'
 '"\nAnd ... I really don\'t think you understand.  I came here and my idea was bad right away.  What kind of community goes ""you have bad ideas"" go away, instead of helping rewrite them.   "']
In [8]:
y = train_df[list_of_classes].values
print(y)
[[0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 ...
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]]
In [9]:
x_tokenizer = text.Tokenizer(num_words=max_features)
print(x_tokenizer)
x_tokenizer.fit_on_texts(list(x))
print(x_tokenizer)
x_tokenized = x_tokenizer.texts_to_sequences(x) #list of lists(containing numbers), so basically a list of sequences, not a numpy array
#pad_sequences:transform a list of num_samples sequences (lists of scalars) into a 2D Numpy array of shape 
x_train_val = sequence.pad_sequences(x_tokenized, maxlen=max_text_length)
<keras_preprocessing.text.Tokenizer object at 0x1a3271c7f0>
<keras_preprocessing.text.Tokenizer object at 0x1a3271c7f0>
In [10]:
x_train, x_val, y_train, y_val = train_test_split(x_train_val, y, test_size=0.1, random_state=1)
In [11]:
print('Build model...')
model = Sequential()

# we start off with an efficient embedding layer which maps
# our vocab indices into embedding_dims dimensions
model.add(Embedding(max_features,
                    embedding_dims,
                    input_length=max_text_length))
model.add(Dropout(0.2))

# we add a Convolution1D, which will learn filters
# word group filters of size filter_length:
model.add(Conv1D(filters,
                 kernel_size,
                 padding='valid',
                 activation='relu',
                 strides=1))
# we use max pooling:
model.add(GlobalMaxPooling1D())

# We add a vanilla hidden layer:
model.add(Dense(hidden_dims))
model.add(Dropout(0.2))
model.add(Activation('relu'))

# We project onto 6 output layers, and squash it with a sigmoid:
model.add(Dense(6))
model.add(Activation('sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.summary()
Build model...
WARNING:tensorflow:From /Users/chaoyuyang/anaconda3/envs/bentoml-dev/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /Users/chaoyuyang/anaconda3/envs/bentoml-dev/lib/python3.7/site-packages/keras/backend/tensorflow_backend.py:3721: calling dropout (from tensorflow.python.ops.nn_ops) with keep_prob is deprecated and will be removed in a future version.
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 400, 50)           1000000   
_________________________________________________________________
dropout_1 (Dropout)          (None, 400, 50)           0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 398, 250)          37750     
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 250)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 250)               62750     
_________________________________________________________________
dropout_2 (Dropout)          (None, 250)               0         
_________________________________________________________________
activation_1 (Activation)    (None, 250)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 6)                 1506      
_________________________________________________________________
activation_2 (Activation)    (None, 6)                 0         
=================================================================
Total params: 1,102,006
Trainable params: 1,102,006
Non-trainable params: 0
_________________________________________________________________
In [12]:
model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
validation_data=(x_val, y_val))
WARNING:tensorflow:From /Users/chaoyuyang/anaconda3/envs/bentoml-dev/lib/python3.7/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /Users/chaoyuyang/anaconda3/envs/bentoml-dev/lib/python3.7/site-packages/tensorflow/python/ops/math_grad.py:102: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Train on 143613 samples, validate on 15958 samples
Epoch 1/2
143613/143613 [==============================] - 256s 2ms/step - loss: 0.0652 - acc: 0.9781 - val_loss: 0.0552 - val_acc: 0.9804
Epoch 2/2
143613/143613 [==============================] - 258s 2ms/step - loss: 0.0460 - acc: 0.9827 - val_loss: 0.0484 - val_acc: 0.9825
Out[12]:
<keras.callbacks.History at 0xb3172fa90>
In [13]:
test_df = pd.read_csv('./test.csv')
In [14]:
x_test = test_df['comment_text'].values
In [15]:
x_test_tokenized = x_tokenizer.texts_to_sequences(x_test)
x_testing = sequence.pad_sequences(x_test_tokenized, maxlen=max_text_length)
In [16]:
y_testing = model.predict(x_testing, verbose = 1)
153164/153164 [==============================] - 52s 340us/step
In [17]:
sample_submission = pd.read_csv("./sample_submission.csv")
sample_submission[list_of_classes] = y_testing
sample_submission.to_csv("toxic_comment_classification.csv", index=False)

Create BentoService for model serving

In [1]:
%%writefile toxic_comment_classifier.py

from bentoml import api, artifacts, env, BentoService
from bentoml.artifact import PickleArtifact
from bentoml.handlers import DataframeHandler

from keras.preprocessing import text, sequence
import numpy as np

list_of_classes = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
max_text_length = 400

@env(pip_dependencies=['tensorflow==1.14.0', 'keras', 'pandas', 'numpy'])
@artifacts([PickleArtifact('x_tokenizer'), PickleArtifact('model')])
class ToxicCommentClassification(BentoService):
    
    def tokenize_df(self, df):
        comments = df['comment_text'].values
        tokenized = self.artifacts.x_tokenizer.texts_to_sequences(comments)        
        input_data = sequence.pad_sequences(tokenized, maxlen=max_text_length)
        return input_data
    
    @api(DataframeHandler)
    def predict(self, df):
        input_data = self.tokenize_df(df)
        prediction = self.artifacts.model.predict(input_data)
        result = []
        for i in prediction:
            result.append(list_of_classes[np.argmax(i)])
        return result
Overwriting toxic_comment_classifier.py

Save BentoService to file archive

In [19]:
# 1) import the custom BentoService defined above
from toxic_comment_classifier import ToxicCommentClassification

# 2) `pack` it with required artifacts
svc = ToxicCommentClassification.pack(x_tokenizer=x_tokenizer, model=model)

# 3) save your BentoSerivce
saved_path = svc.save()
[2019-09-17 18:25:13,069] INFO - Successfully saved Bento 'ToxicCommentClassification:2019_09_17_53d29fc2' to path: /Users/chaoyuyang/bentoml/repository/ToxicCommentClassification/2019_09_17_53d29fc2

Load BentoService from archive

In [20]:
sample_test = test_df.iloc[40:45]

bento_service = bentoml.load(saved_path)

print(bento_service.predict(sample_test))
[2019-09-17 18:25:13,309] WARNING - Module `toxic_comment_classifier` already loaded, using existing imported module.
['toxic', 'toxic', 'toxic', 'toxic', 'toxic']