We'll be using the dataset from the paper "A Large Self-Annotated Corpus for Sarcasm" with >1mln comments from Reddit, labeled as either sarcastic or not. A processed version can be found on Kaggle in a form of a Kaggle Dataset.
Sarcasm detection is easy.
test-balanced.csv test-unbalanced.csv train-balanced-sarcasm.csv
# some necessary imports import os import numpy as np import pandas as pd from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.linear_model import LogisticRegression from sklearn.pipeline import Pipeline from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score, confusion_matrix import seaborn as sns from matplotlib import pyplot as plt
train_df = pd.read_csv('../input/sarcasm/train-balanced-sarcasm.csv')
|0||0||NC and NH.||Trumpbart||politics||2||-1||-1||2016-10||2016-10-16 23:55:23||Yeah, I get that argument. At this point, I'd ...|
|1||0||You do know west teams play against west teams...||Shbshb906||nba||-4||-1||-1||2016-11||2016-11-01 00:24:10||The blazers and Mavericks (The wests 5 and 6 s...|
|2||0||They were underdogs earlier today, but since G...||Creepeth||nfl||3||3||0||2016-09||2016-09-22 21:45:37||They're favored to win.|
|3||0||This meme isn't funny none of the "new york ni...||icebrotha||BlackPeopleTwitter||-8||-1||-1||2016-10||2016-10-18 21:03:47||deadass don't kill my buzz|
|4||0||I could use one of those tools.||cush2push||MaddenUltimateTeam||6||-1||-1||2016-12||2016-12-30 17:00:13||Yep can confirm I saw the tool they use for th...|
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1010826 entries, 0 to 1010825 Data columns (total 10 columns): label 1010826 non-null int64 comment 1010773 non-null object author 1010826 non-null object subreddit 1010826 non-null object score 1010826 non-null int64 ups 1010826 non-null int64 downs 1010826 non-null int64 date 1010826 non-null object created_utc 1010826 non-null object parent_comment 1010826 non-null object dtypes: int64(4), object(6) memory usage: 77.1+ MB
Some comments are missing, so we drop the corresponding rows.
We notice that the dataset is indeed balanced
0 505405 1 505368 Name: label, dtype: int64
We split data into training and validation parts.
train_texts, valid_texts, y_train, y_valid = \ train_test_split(train_df['comment'], train_df['label'], random_state=17)
label) based on the text of a comment on Reddit (