Notebook

Data source¶

This dataset contains 1950 comments, which have been labeled as ironic (1) or not ironic (-1) by human annotators. The text was taken from Reddit comments.

In [143]:

import csv
data = []
with open('irony-labeled.csv') as datafile:
    csvReader = csv.reader(datafile)
    for row in csvReader:
        data.append(row)

In [144]:

# delete the first element (header) in the data list
del data[0]
data[:3]

Out[144]:

[["I suspect atheists are projecting their desires when they imagine Obama is one of their number.  Does anyone remember the crazy preacher with whom he was associated? \nhttp://www. examiner. com/article/obama-and-wright-throw-each-other-under-the-bus\n\nI can understand a career politician in the USA needing to feign belief to get elected, but for that purpose I'd imagine a more vanilla choice of church.  \n\n\nHe's not an atheist.  He's not a liberal either.",
  '-1'],
 ['It\'s funny how the arguments the shills are making here are still so close to the racist remarks the GOP has already admitted to.  Always attacking "lazy minorities and young people. " \n\n&gt;\xe2\x80\x9c[i]f it hurts a bunch of college kids that\xe2\x80\x99s too lazy to get up off their bohunkus [sic] and get a photo ID, so be it,\xe2\x80\x9d and \xe2\x80\x9cif it hurts a bunch of lazy blacks that wants the government to give them everything, so be it. \xe2\x80\x9d \xe2\x80\x9cthe law is going to kick the Democrats in the butt. \xe2\x80\x9d',
  '-1'],
 ["We are truly following the patterns of how the mandarins took over empires, not because of the sword, but because control of the endless paper, and regulations, that do more to stagnate most people's lives, then to do anything productive. \n\nBut then because they don't see what else they can do, they write up even more laws and regulations, that either do nothing, or hinder more freedom and production.",
  '-1']]

Preprocessing¶

In [145]:

# remove url from texts
import re
for row in data:
    row[0] = re.sub(r'^https?:\/\/.*[\r\n]*', '', row[0], flags=re.MULTILINE)
print data[:3]

[["I suspect atheists are projecting their desires when they imagine Obama is one of their number.  Does anyone remember the crazy preacher with whom he was associated? \nI can understand a career politician in the USA needing to feign belief to get elected, but for that purpose I'd imagine a more vanilla choice of church.  \n\n\nHe's not an atheist.  He's not a liberal either.", '-1'], ['It\'s funny how the arguments the shills are making here are still so close to the racist remarks the GOP has already admitted to.  Always attacking "lazy minorities and young people. " \n\n&gt;\xe2\x80\x9c[i]f it hurts a bunch of college kids that\xe2\x80\x99s too lazy to get up off their bohunkus [sic] and get a photo ID, so be it,\xe2\x80\x9d and \xe2\x80\x9cif it hurts a bunch of lazy blacks that wants the government to give them everything, so be it. \xe2\x80\x9d \xe2\x80\x9cthe law is going to kick the Democrats in the butt. \xe2\x80\x9d', '-1'], ["We are truly following the patterns of how the mandarins took over empires, not because of the sword, but because control of the endless paper, and regulations, that do more to stagnate most people's lives, then to do anything productive. \n\nBut then because they don't see what else they can do, they write up even more laws and regulations, that either do nothing, or hinder more freedom and production.", '-1']]

In [146]:

data_texts = [] # build a list to store texts
data_labels = [] # build a list to store labels
for row in data:
    data_texts.append(row[0])
    data_labels.append(row[1])
print data_texts[:3]
print data_labels[:3]

["I suspect atheists are projecting their desires when they imagine Obama is one of their number.  Does anyone remember the crazy preacher with whom he was associated? \nI can understand a career politician in the USA needing to feign belief to get elected, but for that purpose I'd imagine a more vanilla choice of church.  \n\n\nHe's not an atheist.  He's not a liberal either.", 'It\'s funny how the arguments the shills are making here are still so close to the racist remarks the GOP has already admitted to.  Always attacking "lazy minorities and young people. " \n\n&gt;\xe2\x80\x9c[i]f it hurts a bunch of college kids that\xe2\x80\x99s too lazy to get up off their bohunkus [sic] and get a photo ID, so be it,\xe2\x80\x9d and \xe2\x80\x9cif it hurts a bunch of lazy blacks that wants the government to give them everything, so be it. \xe2\x80\x9d \xe2\x80\x9cthe law is going to kick the Democrats in the butt. \xe2\x80\x9d', "We are truly following the patterns of how the mandarins took over empires, not because of the sword, but because control of the endless paper, and regulations, that do more to stagnate most people's lives, then to do anything productive. \n\nBut then because they don't see what else they can do, they write up even more laws and regulations, that either do nothing, or hinder more freedom and production."]
['-1', '-1', '-1']

In [147]:

# check the counts of ironic (labeled as -1) and non-ironic texts (labeled as 1)
print data_labels.count('1')
print data_labels.count('-1')

537
1412

So, in this dataset we have 1412 non-ironic texts and 537 ironic texts

Vectorization¶

In [148]:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(ngram_range = (1,2),min_df=5,max_df=0.8, sublinear_tf=True,use_idf=True)

features = vectorizer.fit_transform(data_texts)

Split training and testing dataset¶

In [149]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, data_labels, test_size=0.2, random_state=42)

Import Classifier - SVM¶

In [150]:

from sklearn import svm
clf = svm.SVC()
clf.fit(X_train,y_train)

Out[150]:

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

Evaluation - SVM¶

In [151]:

from sklearn.metrics import accuracy_score

predicted = clf.predict(X_test)
# print the accuracy score
from sklearn.metrics import accuracy_score
print("Accuracy score of SVM model:\n"+ str(accuracy_score(y_test,predicted)))

# print evaluation report showing precision, recall, f1, support
from sklearn.metrics import classification_report
print(classification_report(y_test, predicted))

Accuracy score of SVM model:
0.725641025641
             precision    recall  f1-score   support

         -1       0.73      1.00      0.84       283
          1       0.00      0.00      0.00       107

avg / total       0.53      0.73      0.61       390

Import Classifier - Naive Bayes¶

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

In [152]:

from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()
mnb.fit(X_train, y_train)

Out[152]:

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Evaluation-Naive Bayes¶

In [153]:

from sklearn.metrics import classification_report
mnb_predict = mnb.predict(X_test)
print("Accuracy score of Naive Bayes model:\n"+ str(accuracy_score(y_test,mnb_predict)))

print(classification_report(y_test, mnb_predict))

Accuracy score of Naive Bayes model:
0.728205128205
             precision    recall  f1-score   support

         -1       0.73      1.00      0.84       283
          1       1.00      0.01      0.02       107

avg / total       0.80      0.73      0.62       390

Accuracy VS Precision¶

Accuracy - Accuracy is the most intuitive performance measure and it is simply a ratio of correctly predicted observation to the total observations. One may think that, if we have high accuracy then our model is best. Yes, accuracy is a great measure but only when you have symmetric datasets where values of false positive and false negatives are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model.
Precision - Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all passengers that labeled as survived, how many actually survived? High precision relates to the low false positive rate.

Since my dataset is not a symmetric dataset, so I consider to take a look at the precision score.

In my experiments, the 'Precision' score changes when I adjust the parameter 'ngram_range' in the Vectorization procedure.

What is 'ngram_range'?¶

The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.

e.g.

text = "I do not know what you mean"

If we set ngram_range = (1,1), the vectorizer only vectorize the vocabulary to include 1-gram.

So the vocabulary includes "I""do""not""know""what""you""mean"

If we set ngram_range = (2,2), the vectorizer only vectorize the vocabulary to include 2-grams.

So the vocabulary includes "I do""do not""not know""know what""what you""you""mean"

If we set ngram_range = (1,2),

we will get: "I""do""not""know""what""you""mean""I do""do not""not know""know what""what you""you""mean"

Here I record the precision scores:¶

ngram_range = (1,1):

SVM:0.54 NB: 0.54

ngram_range = (2,2):

SVM:0.54 NB: 0.68

ngram_range = (3,3):

SVM:0.54 NB: 0.61

ngram_range = (4,4):

SVM:0.54 NB: 0.54

ngram_range = (1,2):

SVM: 0.54 NB: 0.81

ngram_range = (1,3):

SVM: 0.54 NB: 0.81

ngram_range = (2,3):

SVM: 0.54 NB: 0.68

Obviously, the precision score of Naive Bayes model reaches to higher value when set the ngram_range as (1,2) and (1,3), which means when the model identify an ironic text, the combination of unigram and bigram will help the model to perform better.

Next¶

Apart from the ngrams, what features can help to detect irony ?
- Features used in previous study:
  - ngram
  - sentiments (ironic text maybe more negtive than non-ironic?)
  - topics
  - written-spoken style (We de- signed this set of features to explore the unexpect- edness created by using spoken style words in a mainly written style tweet or vice versa (formal words usually adopted in written text employed in a spoken style context). )
  - Hyperbole (indicates the occurrence of a sequence of three positive or negative words in a row)
  - Punctuation (presence of an ellipses as well as multiple question or excla- mation marks or a combination of the latter two)
How to construct mutiple features?
How to use multiple features in a model?

In [156]:

print list(mnb_predict).count('1')
print list(y_test).count('1')

0
1
107

In [ ]: