import csv
data = []
with open('irony-labeled.csv') as datafile:
csvReader = csv.reader(datafile)
for row in csvReader:
data.append(row)
# delete the first element (header) in the data list
del data[0]
data[:3]
[["I suspect atheists are projecting their desires when they imagine Obama is one of their number. Does anyone remember the crazy preacher with whom he was associated? \nhttp://www. examiner. com/article/obama-and-wright-throw-each-other-under-the-bus\n\nI can understand a career politician in the USA needing to feign belief to get elected, but for that purpose I'd imagine a more vanilla choice of church. \n\n\nHe's not an atheist. He's not a liberal either.", '-1'], ['It\'s funny how the arguments the shills are making here are still so close to the racist remarks the GOP has already admitted to. Always attacking "lazy minorities and young people. " \n\n>\xe2\x80\x9c[i]f it hurts a bunch of college kids that\xe2\x80\x99s too lazy to get up off their bohunkus [sic] and get a photo ID, so be it,\xe2\x80\x9d and \xe2\x80\x9cif it hurts a bunch of lazy blacks that wants the government to give them everything, so be it. \xe2\x80\x9d \xe2\x80\x9cthe law is going to kick the Democrats in the butt. \xe2\x80\x9d', '-1'], ["We are truly following the patterns of how the mandarins took over empires, not because of the sword, but because control of the endless paper, and regulations, that do more to stagnate most people's lives, then to do anything productive. \n\nBut then because they don't see what else they can do, they write up even more laws and regulations, that either do nothing, or hinder more freedom and production.", '-1']]
# remove url from texts
import re
for row in data:
row[0] = re.sub(r'^https?:\/\/.*[\r\n]*', '', row[0], flags=re.MULTILINE)
print data[:3]
[["I suspect atheists are projecting their desires when they imagine Obama is one of their number. Does anyone remember the crazy preacher with whom he was associated? \nI can understand a career politician in the USA needing to feign belief to get elected, but for that purpose I'd imagine a more vanilla choice of church. \n\n\nHe's not an atheist. He's not a liberal either.", '-1'], ['It\'s funny how the arguments the shills are making here are still so close to the racist remarks the GOP has already admitted to. Always attacking "lazy minorities and young people. " \n\n>\xe2\x80\x9c[i]f it hurts a bunch of college kids that\xe2\x80\x99s too lazy to get up off their bohunkus [sic] and get a photo ID, so be it,\xe2\x80\x9d and \xe2\x80\x9cif it hurts a bunch of lazy blacks that wants the government to give them everything, so be it. \xe2\x80\x9d \xe2\x80\x9cthe law is going to kick the Democrats in the butt. \xe2\x80\x9d', '-1'], ["We are truly following the patterns of how the mandarins took over empires, not because of the sword, but because control of the endless paper, and regulations, that do more to stagnate most people's lives, then to do anything productive. \n\nBut then because they don't see what else they can do, they write up even more laws and regulations, that either do nothing, or hinder more freedom and production.", '-1']]
data_texts = [] # build a list to store texts
data_labels = [] # build a list to store labels
for row in data:
data_texts.append(row[0])
data_labels.append(row[1])
print data_texts[:3]
print data_labels[:3]
["I suspect atheists are projecting their desires when they imagine Obama is one of their number. Does anyone remember the crazy preacher with whom he was associated? \nI can understand a career politician in the USA needing to feign belief to get elected, but for that purpose I'd imagine a more vanilla choice of church. \n\n\nHe's not an atheist. He's not a liberal either.", 'It\'s funny how the arguments the shills are making here are still so close to the racist remarks the GOP has already admitted to. Always attacking "lazy minorities and young people. " \n\n>\xe2\x80\x9c[i]f it hurts a bunch of college kids that\xe2\x80\x99s too lazy to get up off their bohunkus [sic] and get a photo ID, so be it,\xe2\x80\x9d and \xe2\x80\x9cif it hurts a bunch of lazy blacks that wants the government to give them everything, so be it. \xe2\x80\x9d \xe2\x80\x9cthe law is going to kick the Democrats in the butt. \xe2\x80\x9d', "We are truly following the patterns of how the mandarins took over empires, not because of the sword, but because control of the endless paper, and regulations, that do more to stagnate most people's lives, then to do anything productive. \n\nBut then because they don't see what else they can do, they write up even more laws and regulations, that either do nothing, or hinder more freedom and production."] ['-1', '-1', '-1']
# check the counts of ironic (labeled as -1) and non-ironic texts (labeled as 1)
print data_labels.count('1')
print data_labels.count('-1')
537 1412
So, in this dataset we have 1412 non-ironic texts and 537 ironic texts
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(ngram_range = (1,2),min_df=5,max_df=0.8, sublinear_tf=True,use_idf=True)
features = vectorizer.fit_transform(data_texts)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, data_labels, test_size=0.2, random_state=42)
from sklearn import svm
clf = svm.SVC()
clf.fit(X_train,y_train)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)
from sklearn.metrics import accuracy_score
predicted = clf.predict(X_test)
# print the accuracy score
from sklearn.metrics import accuracy_score
print("Accuracy score of SVM model:\n"+ str(accuracy_score(y_test,predicted)))
# print evaluation report showing precision, recall, f1, support
from sklearn.metrics import classification_report
print(classification_report(y_test, predicted))
Accuracy score of SVM model: 0.725641025641 precision recall f1-score support -1 0.73 1.00 0.84 283 1 0.00 0.00 0.00 107 avg / total 0.53 0.73 0.61 390
The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(X_train, y_train)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
from sklearn.metrics import classification_report
mnb_predict = mnb.predict(X_test)
print("Accuracy score of Naive Bayes model:\n"+ str(accuracy_score(y_test,mnb_predict)))
print(classification_report(y_test, mnb_predict))
Accuracy score of Naive Bayes model: 0.728205128205 precision recall f1-score support -1 0.73 1.00 0.84 283 1 1.00 0.01 0.02 107 avg / total 0.80 0.73 0.62 390
Accuracy - Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. One may think that, if we have high accuracy then our model is best. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model.
Precision - Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all passengers that labeled as survived, how many actually survived? High precision relates to the low false positive rate.
Since my dataset is not a symmetric dataset, so I consider to take a look at the precision score.
In my experiments, the 'Precision' score changes when I adjust the parameter 'ngram_range' in the Vectorization procedure.
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
e.g.
text = "I do not know what you mean"
If we set ngram_range = (1,1), the vectorizer only vectorize the vocabulary to include 1-gram.
So the vocabulary includes "I""do""not""know""what""you""mean"
If we set ngram_range = (2,2), the vectorizer only vectorize the vocabulary to include 2-grams.
So the vocabulary includes "I do""do not""not know""know what""what you""you""mean"
If we set ngram_range = (1,2),
we will get: "I""do""not""know""what""you""mean""I do""do not""not know""know what""what you""you""mean"
SVM:0.54 NB: 0.54
SVM:0.54 NB: 0.68
SVM:0.54 NB: 0.61
SVM:0.54 NB: 0.54
SVM: 0.54 NB: 0.81
SVM: 0.54 NB: 0.81
SVM: 0.54 NB: 0.68
Obviously, the precision score of Naive Bayes model reaches to higher value when set the ngram_range as (1,2) and (1,3), which means when the model identify an ironic text, the combination of unigram and bigram will help the model to perform better.
Apart from the ngrams, what features can help to detect irony ?
How to construct mutiple features?
How to use multiple features in a model?
print list(mnb_predict).count('1')
print list(y_test).count('1')
0 1 107