- Open Machine Learning Course

Author: Yury Kashnitsky. All content is distributed under the Creative Commons CC BY-NC-SA 4.0 license.

Assignment 4 (demo)

Sarcasm detection with logistic regression

Same assignment as a Kaggle Kernel + solution.

We'll be using the dataset from the paper "A Large Self-Annotated Corpus for Sarcasm" with >1mln comments from Reddit, labeled as either sarcastic or not. A processed version can be found on Kaggle in a form of a Kaggle Dataset.

Sarcasm detection is easy.

In [1]:
!ls ../input/sarcasm/
test-balanced.csv  test-unbalanced.csv	train-balanced-sarcasm.csv
In [2]:
# some necessary imports
import os
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
from matplotlib import pyplot as plt
In [3]:
train_df = pd.read_csv('../input/sarcasm/train-balanced-sarcasm.csv')
In [4]:
label comment author subreddit score ups downs date created_utc parent_comment
0 0 NC and NH. Trumpbart politics 2 -1 -1 2016-10 2016-10-16 23:55:23 Yeah, I get that argument. At this point, I'd ...
1 0 You do know west teams play against west teams... Shbshb906 nba -4 -1 -1 2016-11 2016-11-01 00:24:10 The blazers and Mavericks (The wests 5 and 6 s...
2 0 They were underdogs earlier today, but since G... Creepeth nfl 3 3 0 2016-09 2016-09-22 21:45:37 They're favored to win.
3 0 This meme isn't funny none of the "new york ni... icebrotha BlackPeopleTwitter -8 -1 -1 2016-10 2016-10-18 21:03:47 deadass don't kill my buzz
4 0 I could use one of those tools. cush2push MaddenUltimateTeam 6 -1 -1 2016-12 2016-12-30 17:00:13 Yep can confirm I saw the tool they use for th...
In [5]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1010826 entries, 0 to 1010825
Data columns (total 10 columns):
label             1010826 non-null int64
comment           1010773 non-null object
author            1010826 non-null object
subreddit         1010826 non-null object
score             1010826 non-null int64
ups               1010826 non-null int64
downs             1010826 non-null int64
date              1010826 non-null object
created_utc       1010826 non-null object
parent_comment    1010826 non-null object
dtypes: int64(4), object(6)
memory usage: 77.1+ MB

Some comments are missing, so we drop the corresponding rows.

In [6]:
train_df.dropna(subset=['comment'], inplace=True)

We notice that the dataset is indeed balanced

In [7]:
0    505405
1    505368
Name: label, dtype: int64

We split data into training and validation parts.

In [8]:
train_texts, valid_texts, y_train, y_valid = \
        train_test_split(train_df['comment'], train_df['label'], random_state=17)


  1. Analyze the dataset, make some plots. This Kernel might serve as an example
  2. Build a Tf-Idf + logistic regression pipeline to predict sarcasm (label) based on the text of a comment on Reddit (comment).
  3. Plot the words/bigrams which a most predictive of sarcasm (you can use eli5 for that)
  4. (optionally) add subreddits as new features to improve model performance. Apply here the Bag of Words approach, i.e. treat each subreddit as a new feature.