by: Keith Qu
Some natural language classification of toxic comments using logistic regression and also keras (running on tensorflow). This is a very broad run through ranging from basic linear methods, to modified linear methods, to deep learning.
Methods include logistic regression, NB-SVM, and CNN and RNN (bidirectional LSTM) with Keras on a TensorFlow backend.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re, string
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score, auc
from scipy import sparse
from sklearn.model_selection import train_test_split
%matplotlib inline
Toxicity categories:
It makes sense that the categories are not exclusive. So we can treat them like 6 different classification problems.
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.tail()
id | comment_text | toxic | severe_toxic | obscene | threat | insult | identity_hate | |
---|---|---|---|---|---|---|---|---|
159566 | ffe987279560d7ff | ":::::And for the second time of asking, when ... | 0 | 0 | 0 | 0 | 0 | 0 |
159567 | ffea4adeee384e90 | You should be ashamed of yourself \n\nThat is ... | 0 | 0 | 0 | 0 | 0 | 0 |
159568 | ffee36eab5c267c9 | Spitzer \n\nUmm, theres no actual article for ... | 0 | 0 | 0 | 0 | 0 | 0 |
159569 | fff125370e4aaaf3 | And it looks like it was actually you who put ... | 0 | 0 | 0 | 0 | 0 | 0 |
159570 | fff46fc426af1f9a | "\nAnd ... I really don't think you understand... | 0 | 0 | 0 | 0 | 0 | 0 |
Let's take a look at some of our test comments.
test.head()
id | comment_text | |
---|---|---|
0 | 00001cee341fdb12 | Yo bitch Ja Rule is more succesful then you'll... |
1 | 0000247867823ef7 | == From RfC == \n\n The title is fine as it is... |
2 | 00013b17ad220c46 | " \n\n == Sources == \n\n * Zawe Ashton on Lap... |
3 | 00017563c3f7919a | :If you have a look back at the source, the in... |
4 | 00017695ad8997eb | I don't anonymously edit articles at all. |
test.loc[0].comment_text
"Yo bitch Ja Rule is more succesful then you'll ever be whats up with you and hating you sad mofuckas...i should bitch slap ur pethedic white faces and get you to kiss my ass you guys sicken me. Ja rule is about pride in da music man. dont diss that shit on him. and nothin is wrong bein like tupac he was a brother too...fuckin white boys get things right next time.,"
This looks like we have some obscenity and insult, combined with a dash of identity hate.
They are also completely wrong, since 50 Cent $>$ Ja Rule any day. Well, maybe not his last album...
Only a little.
test.fillna(' ',inplace=True)
We'll do it by words and by character. Internet comments are a cesspool, and there are character ngrams that can have toxic meanings. Maybe we can also combine them.
def tok(s):
return re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])').sub(r' \1 ',s).split()
words = TfidfVectorizer(ngram_range=(1,2), lowercase=True,
analyzer='word',stop_words='english',tokenizer=tok,
min_df=3,max_df=0.9, sublinear_tf=1, smooth_idf=1, dtype=np.float32,
strip_accents='unicode')
chars = TfidfVectorizer(ngram_range=(1,5), lowercase=True,
analyzer='char',min_df=3,
max_df=0.9, sublinear_tf=1,smooth_idf=1,dtype=np.float32,)
train_words = words.fit_transform(train['comment_text'])
train_chars = chars.fit_transform(train['comment_text'])
X = sparse.csr_matrix(train_words)
cols = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']
y = train[cols]
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=10101)
logit = LogisticRegression(C=4, dual=True)
pred = np.zeros((X_test.shape[0],y_test.shape[1]))
for i,c in enumerate(cols):
logit.fit(X_train,y_train[c])
pred[:,i] = logit.predict(X_test)
for i,c in enumerate(cols):
print('Confusion matrix for', c)
print(confusion_matrix(y_test[c],pred[:,i]))
Confusion matrix for toxic [[42996 244] [ 1767 2865]] Confusion matrix for severe_toxic [[47295 87] [ 390 100]] Confusion matrix for obscene [[45167 148] [ 892 1665]] Confusion matrix for threat [[47720 15] [ 121 16]] Confusion matrix for insult [[45231 269] [ 1158 1214]] Confusion matrix for identity_hate [[47414 40] [ 338 80]]
X_c = sparse.csr_matrix(train_chars)
X_train,X_test,y_train,y_test = train_test_split(X_c,y,test_size=0.3,random_state=10101)
pred = np.zeros((X_test.shape[0],y_test.shape[1]))
for i,c in enumerate(cols):
logit.fit(X_train,y_train[c])
pred[:,i] = logit.predict(X_test)
for i,c in enumerate(cols):
print('Confusion matrix for', c)
print(confusion_matrix(y_test[c],pred[:,i]))
Confusion matrix for toxic [[42927 313] [ 1540 3092]] Confusion matrix for severe_toxic [[47284 98] [ 372 118]] Confusion matrix for obscene [[45146 169] [ 791 1766]] Confusion matrix for threat [[47727 8] [ 109 28]] Confusion matrix for insult [[45212 288] [ 1006 1366]] Confusion matrix for identity_hate [[47417 37] [ 315 103]]
Characterwise vectorization appears to have better results with ngrams of up to size 5, but this could vary with different splits.
Horizontal stacking the word and character sets to create a larger blended dataset.
X2 = sparse.hstack([train_words,train_chars])
X_train,X_test,y_train,y_test = train_test_split(X2,y,test_size=0.3,random_state=10101)
pred = np.zeros((X_test.shape[0],y_test.shape[1]))
for i,c in enumerate(cols):
logit.fit(X_train,y_train[c])
pred[:,i] = logit.predict(X_test)
for i,c in enumerate(cols):
print('Confusion matrix for', c)
print(confusion_matrix(y_test[c],pred[:,i]))
Confusion matrix for toxic [[42935 305] [ 1472 3160]] Confusion matrix for severe_toxic [[47270 112] [ 367 123]] Confusion matrix for obscene [[45128 187] [ 757 1800]] Confusion matrix for threat [[47716 19] [ 106 31]] Confusion matrix for insult [[45198 302] [ 982 1390]] Confusion matrix for identity_hate [[47404 50] [ 308 110]]
So putting character and string combinations together seem to give better results than they are separately. However, we can see that the model does best in identifying toxic, obscene and insult comments, and these are the ones that that are likely to have specific keywords associated with them. There would appear to be heavy subjectivity when it comes to what exactly constitutes severe toxicity. Threats are very context-sensitive and identity hate is a lot easier to identify for some groups than others.
It's also extremely memory-intensive for a machine with a very normal 16 GB of RAM.
Wang & Manning (2012) find that Multinomial Naive Bayes performs better at classifying smaller snippets of text, while SVM is superior with full-length text. By combining the two models with linear interpolation, they create a new model that is robust for a wide variety of text.
There is a very helpful Python implementation by Jeremy Howard.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string, re
from collections import Counter
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score, auc
from scipy import sparse
from sklearn.model_selection import train_test_split
%matplotlib inline
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
cols = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']
test.fillna(' ',inplace=True)
# here's the main part of the implementation by jhoward mentioned above
def pr(y_i, y):
p = train_words[y == y_i].sum(0)
return (p+1)/((y==y_i).sum()+1)
# Get rid of the punctuation. Again, thanks to jhoward for this...
def tok(s):
return re.compile(f'([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])').sub(r' \1 ',s).split()
words = TfidfVectorizer(ngram_range=(1,2), lowercase=True,
analyzer='word',stop_words='english',tokenizer=tok,
min_df=3,max_df=0.9, sublinear_tf=1, smooth_idf=1, use_idf=1,
strip_accents='unicode')
train_words = words.fit_transform(train['comment_text'])
test_words = words.transform(test['comment_text'])
pred = np.zeros((test.shape[0],len(cols)))
for i,c in enumerate(cols):
logit = LogisticRegression(C=4, dual=True)
r = np.log(pr(1,train[c].values)/pr(0,train[c].values))
X_nb = train_words.multiply(r)
logit.fit(X_nb,train[c].values)
pred[:,i] = logit.predict_proba(test_words.multiply(r))[:,1]
submission = pd.read_csv('sample_submission.csv')
submission[cols] = pred
submission.to_csv('submission.csv',index=False)
This gives a score in the 0.07s (mean column-wise log loss), which is on the high side. But it was quick and painless; there wasn't any lemmatization, feature engineering, using toxic word dictionaries, spellchecking, or using existing repositories of vectorized text. So there's a lot of room for improvement.
submission.head()
id | toxic | severe_toxic | obscene | threat | insult | identity_hate | |
---|---|---|---|---|---|---|---|
0 | 00001cee341fdb12 | 0.999996 | 0.056500 | 0.999964 | 0.002058 | 0.986571 | 0.362056 |
1 | 0000247867823ef7 | 0.006075 | 0.000966 | 0.004201 | 0.000120 | 0.005512 | 0.000392 |
2 | 00013b17ad220c46 | 0.009947 | 0.000812 | 0.004720 | 0.000098 | 0.004124 | 0.000283 |
3 | 00017563c3f7919a | 0.001160 | 0.000283 | 0.001051 | 0.000240 | 0.001028 | 0.000225 |
4 | 00017695ad8997eb | 0.016037 | 0.000392 | 0.001696 | 0.000138 | 0.002533 | 0.000301 |
The first entry (0) is from the enlightened Ja Rule supporter shown above. We have detected the obscenity and insult (and so it is definitely toxic), but it's a bit weak on the identity hate measure. To be fair, calling someone a "fuckin white boy" ranks low on the identity hate ladder, but arguably it still should count. We don't really know what it's true classification is at this point.
The Keras API is a nice way to access TensorFlow, which will be needed to create convolutional (CNN) and recurrent neural networks (RNNs).
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk import WordNetLemmatizer
from nltk import pos_tag, word_tokenize
from keras.models import Sequential
from keras.layers import (Dense, Dropout, Input, LSTM, Activation, Flatten,
Convolution1D, MaxPooling1D, Bidirectional,
GlobalMaxPooling1D, Embedding, BatchNormalization,
SpatialDropout1D)
from keras.preprocessing import sequence
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score, auc
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
cols = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']
test.fillna(' ',inplace=True)
Going to convert everything to lower case, remove stopwords, lemmatize words (get the root), and convert text into batch sequences of 200 length to run through the learning models.
def clean_up(t):
t = t.strip().lower()
words = t.split()
# first get rid of the stopwords, or a lemmatized stopword might not
# be recognized as a stopword
imp_words = ' '.join(w for w in words if w not in set(stopwords.words('english')))
# lemmatize based on adjectives (J), verbs (V), nouns (N) and adverbs (R) to
# return only the base words (as opposed to stemming which can return
# non-words). e.g. ponies -> poni with stemming, and pony with lemmatizing
final_words = ''
lemma = WordNetLemmatizer()
for (w,tag) in pos_tag(word_tokenize(imp_words)):
if tag.startswith('J'):
final_words += ' '+ lemma.lemmatize(w, pos='a')
elif tag.startswith('V'):
final_words += ' '+ lemma.lemmatize(w, pos='v')
elif tag.startswith('N'):
final_words += ' '+ lemma.lemmatize(w, pos='n')
elif tag.startswith('R'):
final_words += ' '+ lemma.lemmatize(w, pos='r')
else:
final_words += ' '+ w
return final_words
# what a great name. do_stuff
def do_stuff (df):
text = df['comment_text'].copy()
# First get rid of anything that's not a letter. This may not be the greatest idea, since
# on3 c4n 3451ly substitute numbers in for letters, but keep it like this for now.
text.replace(to_replace={r'[^\x00-\x7F]':' '},inplace=True,regex=True)
text.replace(to_replace={r'[^a-zA-Z]': ' '},inplace=True,regex=True)
# Then lower case, tokenize and lemmatize
text = text.apply(lambda t:clean_up(t))
return text
def tok_seq (train,test):
tok = Tokenizer(num_words=100000)
tok.fit_on_texts(X_train)
# set our max text length to 200 characters
seq_train = tok.texts_to_sequences(train)
seq_test = tok.texts_to_sequences(test)
data_train = pad_sequences(seq_train,maxlen=200)
data_test = pad_sequences(seq_test,maxlen=200)
return data_train,data_test
Convolution model with 25% dropouts to help with generalization.
def seq_model (X_train, y_train, test, val='no'):
model = Sequential()
model.add(Embedding(100000,50,input_length=200))
model.add(Dropout(0.25))
model.add(Convolution1D(250, activation='relu',padding='valid',kernel_size=3))
model.add(GlobalMaxPooling1D())
model.add(Dense(250))
model.add(Dropout(0.25))
model.add(Activation('relu'))
# A sigmoid ensures a bounded solution in (0,1)
model.add(Dense(6,activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# batch_size=1000 seems to be the limit of my 2gb GTX 960m
# like with all predictive modeling, there is an under/overfitting
# tradeoff between too few epochs and too many
if val == 'no':
model.fit(X_train,y_train,batch_size=1000,epochs=5)
else:
model.fit(X_train,y_train,batch_size=1000,epochs=5,validation_split=0.1)
pred = model.predict(test)
return pred
Bidirectional LSTM model with similar dropouts.
def bidirect_model (X_train, y_train, test, val='no'):
model=Sequential()
model.add(Embedding(100000,100,input_length=200))
model.add(Bidirectional(LSTM(50, return_sequences=True)))
model.add(GlobalMaxPooling1D())
model.add(Dropout(0.25))
model.add(Dense(250))
model.add(Dropout(0.25))
model.add(Activation('relu'))
model.add(Dense(6,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
if val == 'no':
model.fit(X_train,y_train,batch_size=1000,epochs=4)
else:
model.fit(X_train,y_train,batch_size=1000,epochs=4,validation_split=0.1)
pred = model.predict(test)
return pred
First let's take a look at results with a train test split.
X_train = do_stuff(train)
cols = ['toxic','severe_toxic','obscene','threat','insult','identity_hate']
y_train = train[cols].values
Xt, Xv, yt, yv = train_test_split(X_train, y_train, test_size=0.2, random_state=11)
Xt, Xv = tok_seq(Xt,Xv)
pred_seq = seq_model (Xt, yt, Xv)
Epoch 1/5 127656/127656 [==============================] - 33s 260us/step - loss: 0.1777 - acc: 0.9598 Epoch 2/5 127656/127656 [==============================] - 27s 214us/step - loss: 0.1050 - acc: 0.9651 Epoch 3/5 127656/127656 [==============================] - 27s 213us/step - loss: 0.0550 - acc: 0.9803 Epoch 4/5 127656/127656 [==============================] - 27s 213us/step - loss: 0.0456 - acc: 0.9826 Epoch 5/5 127656/127656 [==============================] - 27s 213us/step - loss: 0.0417 - acc: 0.9840
pred_bid = bidirect_model(Xt,yt,Xv)
Epoch 1/4 127656/127656 [==============================] - 166s 1ms/step - loss: 0.1802 - acc: 0.9591 Epoch 2/4 127656/127656 [==============================] - 163s 1ms/step - loss: 0.0769 - acc: 0.9740 Epoch 3/4 127656/127656 [==============================] - 162s 1ms/step - loss: 0.0606 - acc: 0.9785 Epoch 4/4 127656/127656 [==============================] - 162s 1ms/step - loss: 0.0447 - acc: 0.9831
Unsurprisingly, bidirectional adds a hefty amount of computing time.
Using a GPU for computation is kind of like reading Playboy for the articles, where it might seem questionable at first but GPUs are really good for deep learning and every now and then Playboy has a great article about the military industrial complex.
roc_auc_score(yv, pred_seq)
0.965157908356303
roc_auc_score(yv, pred_bid)
0.96750127296574728
for i,c in enumerate(cols):
print ('Correlation between results of', c)
print(np.corrcoef(pred_bid[:,i],pred_seq[:,i])[0,1])
Correlation between results of toxic 0.936130675937 Correlation between results of severe_toxic 0.933198514747 Correlation between results of obscene 0.958145180505 Correlation between results of threat 0.901314923493 Correlation between results of insult 0.949524363145 Correlation between results of identity_hate 0.935303371467
The two sets of predictions are fairly highly correlated for our validation split, but we can still try mean ensembling the results.
roc_auc_score(yv,0.5*(pred_seq+pred_bid))
X_test = do_stuff(test)
X_train,X_test=tok_seq(X_train,X_test)
y_train = train[['toxic', 'severe_toxic','obscene','threat','insult','identity_hate']].values
finalpred_seq = seq_model(X_train, y_train, X_test,val='yes')
Train on 143613 samples, validate on 15958 samples Epoch 1/5 143613/143613 [==============================] - 32s 225us/step - loss: 0.1688 - acc: 0.9606 - val_loss: 0.1124 - val_acc: 0.9627 Epoch 2/5 143613/143613 [==============================] - 31s 218us/step - loss: 0.0884 - acc: 0.9706 - val_loss: 0.0607 - val_acc: 0.9790 Epoch 3/5 143613/143613 [==============================] - 31s 218us/step - loss: 0.0506 - acc: 0.9815 - val_loss: 0.0568 - val_acc: 0.9793 Epoch 4/5 143613/143613 [==============================] - 31s 217us/step - loss: 0.0449 - acc: 0.9829 - val_loss: 0.0554 - val_acc: 0.9796 Epoch 5/5 143613/143613 [==============================] - 31s 218us/step - loss: 0.0412 - acc: 0.9842 - val_loss: 0.0575 - val_acc: 0.9784
finalpred_bid = bidirect_model (X_train, y_train, X_test,val='yes')
Train on 143613 samples, validate on 15958 samples Epoch 1/4 143613/143613 [==============================] - 190s 1ms/step - loss: 0.1739 - acc: 0.9598 - val_loss: 0.0773 - val_acc: 0.9744 Epoch 2/4 143613/143613 [==============================] - 188s 1ms/step - loss: 0.0576 - acc: 0.9796 - val_loss: 0.0536 - val_acc: 0.9804 Epoch 3/4 143613/143613 [==============================] - 188s 1ms/step - loss: 0.0452 - acc: 0.9832 - val_loss: 0.0549 - val_acc: 0.9809 Epoch 4/4 143613/143613 [==============================] - 187s 1ms/step - loss: 0.0397 - acc: 0.9847 - val_loss: 0.0560 - val_acc: 0.9809
submission1 = pd.read_csv('sample_submission.csv')
submission1[cols] = finalpred_seq
submission1.to_csv('submission1.csv',index=False)
submission2 = pd.read_csv('sample_submission.csv')
submission2[cols] = finalpred_bid
submission2.to_csv('submission2.csv',index=False)
submission3 = pd.read_csv('sample_submission.csv')
submission3[cols] = 0.5*(submission1[cols] + submission2[cols])
submission3.to_csv('submission3.csv',index=False)
def onemore_model (X_train, y_train, test, val='no'):
model=Sequential()
model.add(Embedding(100000,100,input_length=200))
model.add(SpatialDropout1D(0.25))
model.add(GlobalMaxPooling1D())
model.add(BatchNormalization())
model.add(Dense(128))
model.add(Dropout(0.5))
model.add(Dense(6,activation='sigmoid'))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
if val == 'no':
model.fit(X_train,y_train,batch_size=1000,epochs=5)
else:
model.fit(X_train,y_train,batch_size=1000,epochs=5,validation_split=0.1)
pred = model.predict(test)
return pred
finalpred_om = onemore_model (X_train, y_train, X_test,val='yes')
Train on 143613 samples, validate on 15958 samples Epoch 1/5 143613/143613 [==============================] - 14s 97us/step - loss: 0.3350 - acc: 0.8389 - val_loss: 0.0671 - val_acc: 0.9772 Epoch 2/5 143613/143613 [==============================] - 13s 89us/step - loss: 0.0808 - acc: 0.9751 - val_loss: 0.0590 - val_acc: 0.9792 Epoch 3/5 143613/143613 [==============================] - 13s 89us/step - loss: 0.0619 - acc: 0.9794 - val_loss: 0.0567 - val_acc: 0.9797 Epoch 4/5 143613/143613 [==============================] - 13s 89us/step - loss: 0.0523 - acc: 0.9817 - val_loss: 0.0560 - val_acc: 0.9801 Epoch 5/5 143613/143613 [==============================] - 13s 89us/step - loss: 0.0478 - acc: 0.9828 - val_loss: 0.0555 - val_acc: 0.9804
submission4 = pd.read_csv('sample_submission.csv')
submission4[cols] = finalpred_om
submission4.to_csv('submission4.csv',index=False)
This last one scored 0.067. However, by dividing the predictions matrix by 1.12, this score actually improves to 0.063. This is due to the unbalanced nature of the dataset, and has led to some discussion about switching to an AUC scoring system. Regardless, there is still a lot of room for improvement. But I think that getting within earshot of the top 25% isn't too shabby considering I have about 4 days worth of natural language processing experience.
The obvious next step would be to use existing vectorized co-occurence dictionaries such as GloVe or Facebook fastText. Spell checking and toxic word dictionaries may also be helpful. Possible features that can be created include the use of all caps, prevalence of symbols within the body of text or exlamation and question marks. These can be determined prior to detokenization/lemmatization/forced lowercasing. Comment length can also be useful; anecdotally, at times there seems to be a slight corelation between the length of a comment and how angry its writer is. Early stoppage can also be incorporated to reduce overfitting.
These will almost certainly be necessary for significant score improvements.