Skip to content

Latest commit

 

History

History
48 lines (35 loc) · 2.07 KB

1911.575.md

File metadata and controls

48 lines (35 loc) · 2.07 KB

Sarcasm detection in tweets, Rajagopalan et al., 2019

Paper, Tags: #sarcasm-detection

We classify our set of features into 4 categories and then compare the performance of 6 classifiers: SVM, logistic regression, naive bayes, decision trees, random forests and neural networks using an incremental addition of one feature at a time.

We obtain our own dataset from Twitter API and we include two new features, passive aggressive detection and emoji polarity flips.

We use balanced (50k-50k) and unbalanced datasets(25k-75k), both from Ptacek et al..

We remove retweets, limit our study to English, integrate emojis as a feature, remove mentions and URLs.

Features

Features: most occurring unigrams +

  • text expression-based features
    • capitalization
    • exclamation marks
    • question marks
    • noun and verb count
    • ellipsis
    • passive aggressiveness
    • interjections
  • emotion-based features
    • emoji sentiment
    • intensifiers
    • words with repeated letters
    • sentiment score
    • skip gram
    • n-grams
  • contrast-based features
    • flip in polarity between sentence and emoji sentiments
    • polarity flip in a sentence
    • hashtag sentiment
    • positive and negative word count
  • context-based features
    • user mentions

Results

emotion-based features and contrast-based features helped performance.

Sarcasm Detection requires unique feature engineering techniques by considering contextual characteristics. Our study shows that just increase in the number of features is not enough to achieve high accuracy, but selection of the right set of features is the basis of successful classification.

These set of features will change with the change in dataset and data source

Sarcastic and non-sarcastic comments have different linguistic characteristics, hence the model needs to be trained with a balanced dataset. An unbalanced dataset may classify with higher accuracy but will produce low F1-score as the class with higher number of samples dominates the model training.