DWAEF: A Deep Weighted Average Ensemble Framework Harnessing Novel Indicators for Sarcasm Detection

Tracking #: 740-1720

Authors:

	Name	ORCID
	Richa Sharma	https://orcid.org/0000-0002-4472-1681
	Simrat Deol	https://orcid.org/0000-0002-6785-9691
	Udit Kaushish	https://orcid.org/0000-0003-0636-4000
	Prakher Pandey	https://orcid.org/0000-0002-3340-8112
	Vishal Maurya	https://orcid.org/0000-0002-5169-209X

Responsible editor:

Jodi Schneider

Submission Type:

Research Paper

Abstract:

Sarcasm is a linguistic phenomenon often indicating a disparity between literal and inferred meanings. Due to its complexity, it is typically difficult to discern it within an online text message. Consequently, in recent years sarcasm detection has received considerable attention from both academia and industry. Nevertheless, the majority of current approaches simply model low-level indicators of sarcasm in various machine learning algorithms. This paper aims to present sarcasm in a new light by utilizing novel indicators in a Deep Weighted Average Ensemble-based Framework (DWAEF). The novel indicators pertain to exploiting the presence of simile and metaphor in text and detecting the subtle shift in tone at a sentence's structural level. A Graph Neural Network (GNN) structure is implemented to detect the presence of simile, Bidirectional Encoder Representations from Transformers (BERT) embeddings are exploited to detect metaphorical instances and Fuzzy Logic is employed to account for the shift of tone. To account for the existence of sarcasm, the DWAEF integrates the inputs from the novel indicators. The performance of the framework is evaluated on a self-curated dataset of online text messages. A comparative report between the results acquired using conventional features and those obtained using proposed indicators is provided. The encouraging findings produced after applying DWAEF demonstrate that the proposed method surpasses the outcomes of previous research that made use of primitive features.

Manuscript:

ds-paper-740.pdf

Revised Version:

DWAEF: a deep weighted average ensemble framework harnessing novel indicators for sarcasm detection

Data repository URLs:

Self curated dataset was used for the present research work

(is made available to reviewers and will be published upon acceptance)

Date of Submission:

Saturday, December 31, 2022

Date of Decision:

Wednesday, March 8, 2023

Nanopublication URLs:

Decision:

Undecided

Solicited Reviews:

Review #1 submitted on 28/Jan/2023

Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: Medium
Significance: High significance
Background: Reasonable
Novelty: Limited novelty
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

Artitcle title: "DWAEF: A Deep Weighted Average Ensemble Framework Harnessing Novel Indicators for Sarcasm Detection"

This article utilizes data from Twitter, News headlines and SARC dataset for the experiments.
They use the primitive features along with the similie, metaphor and polarity to determine the
sarcastic nature of the content. The proposed model uses GNN, BERT, Fuzzy logic and DWAEF frameworks.
They achieve an accuracy of 92.01%

Reasons to accept:

The area of research is ineteresting. Authors idea of using additional features and combining frameworks for better results.
But the paper needs some more information explained for better readability as mentioned in the Further comments section.

Reasons to reject:

Nanopublication comments:

Further comments:

Abstract
"The encouraging findings produced after applying DWAEF demonstrate that the proposed method surpasses the outcomes of previous research that made use of primitive features"
Should more clearly state what exactly is the improvement in terms metrics used for evaluation or appropriate statistics.

What dataset is the GNN framework pre-trained on? The proposed data is combination of Twitter data, News headlines and SARC dataset.
Did the authors check the results for verifying that this did not affect the results?

Comparison made between only using primitive features and combination of primitive and proposed features:
1. Are the experiments using the same settings and parameters?
2. The base experiments reference is not provided. Provide references for the corresponding papers for readers.

The dataset used is from different sources like Twitter, New headlines and SARC dataset. Are these
under the same domain or same topics?
Does domain have effect on the results?
What is the authors view on using different data set for the same experiments?
The dataset size is approx. 2889 instances. What would be the impact on the system using a big dataset?

Review #2 submitted on 11/Feb/2023

By Rasim Çekik ORCID logo

https://orcid.org/0000-0002-7820-413X

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Average
Reviewer`s confidence: Medium
Significance: High significance
Background: Reasonable
Novelty: Clear novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

The study is a beautiful and up-to-date field in terms of subject matter. That's why I see it as a work worth publishing. However, there are some shortcomings:
1-Is the dataset balanced or not? Considering this situation, maybe an evaluation can be made in terms of f-measure (macro-f1 and micro-f1) as well as accuracy criteria.
2- There are some typographical errors in the study. It would be beneficial to review the study from this aspect.

Reasons to accept:

Reasons to reject:

There is no reason to refuse the study.

Nanopublication comments:

Further comments:

Review #3 submitted on 14/Feb/2023

By ALESSANDRA TERESA ORCID logo

https://orcid.org/0000-0002-4409-6679

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: High
Significance: Moderate significance
Background: Reasonable
Novelty: Limited novelty
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

The paper presents sarcasm a new framework for the detection of sarcasm: Deep Weighted Average Ensemble-based Framework (DWAEF). The features are linguistically motivated and are based on the presence of similes and metaphors in text and detecting the subtle shift in tone at a sentence’s structural level. The sarcasm-detection framework is based on deep learning methods such as GNN, BERT combined with Fuzzy Logic. The results demonstrate that the proposed method surpasses the outcomes of previous research that made use only of primitive features.

Reasons to accept:

-- good framing of the computational problem of sarcasm detection together with a complete section on background/motivation that introduces the reader to the issue
-- complex features that are linguistically motivated are exploited in the framework
-- refined combination of GNN + BERT + fuzzy logic used in ensemble
-- provided good examples, for a good understanding of the three conceptual block of the framework
-- well written paper, easy to read

Reasons to reject:

-- the related work on sarcasm is a bit narrow, some foundational work is completely missing
-- uncertainty about the dataset used in this research (how was it collected? In which period? Which keywords were used? Which data is used? What are the characteristic of the dataset, beside its genre and source?)
-- the fact that sarcasm is a language/culture dependent phenomenon is never mentioned in the paper (which is a fundamental characteristic)

Nanopublication comments:

Further comments:

-- In the first page, since the authors all belong to the same affiliation, its name could be written only once

-- technical details for the implementation of the models could be moved to the appendix

-- In section 5.1 the authors mention a dataset of 3000 similes used for pre-training models. More details or a footnote with a link/URL are needed.

-- the paper could be further revised for some typos (even though not in a preponderant measure)

-- the figures need to be adjusted for inclusivity (readability colorblind measures) --> https://www.ascb.org/science-news/how-to-make-scientific-figures-accessi...

-- overall it is a very good paper

Review #4 submitted on 16/Feb/2023

By Kyle Gorman ORCID logo

https://orcid.org/0000-0002-4233-6595

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Average
Suggested Decision: Reject
Technical Quality of the paper: Average
Presentation: Weak
Reviewer`s confidence: Medium
Significance: Moderate significance
Background: Incomplete or inappropriate
Novelty: Clear novelty
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

I have reviewed the submission entitled "DWAEF: A Deep Weighted Average Ensemble Framework Harnessing Novel Indicators for Sarcasm Detection". I have no conflicts of interest to disclose and am happy to provide my honest assessment of the paper. The authors develop a new architecture for sarcasm detection, which they then implement and evaluate. There is some comparison to simpler, shallower methods applied to the same corpus, and these results suggest they have made major progress on the problem.

Reasons to accept:

This is a genuinely interesting take on a classic problem:
* The paper contains useful comparisons to earlier models, and achieves an empirical improvement over them.
* The paper tackles a relatively deep problem in language understanding.
* The paper has some clever thinking about detecting metaphor and simile.

Reasons to reject:

* The literature review is not systematic in a useful way. (See my notes below.)
* It is assumed the reader is highly familiar with fuzzy logic, and I strongly doubt many will be. I was simply unable to follow this section. The authors need to provide much more background to make this work for the average reader.
* Discussion of the ensembling method is very sketchy and unsourced. I have no clue how the ensembling procedure works from the description, nor do I have any hint how I'd go about learning more about it.
* Discussion of the basic facts of the data set, including even what language it is written in (!), are missing. I was unaware of the use of a split into training, development/validation, and test until validation loss curves were reported late in the paper. No information are provided about how it was annotated except by "four expert linguists": did they double-annotate and perform consensus? What's the interannotator agreement? This sort of information seems necessary to even consider that the authors' data set might be valid.
* There is no meaningful error analysis, so there are few clues how one might go about improving their system. Indeed, there are few future directions to be found.
I think most of these issues could be fixed on resubmission without a great deal of effort from the authors, and crucially without the need to rerun any experiments, probably. For this reason I have indicated "revise and resubmit" in my overall recommendation.

Nanopublication comments:

Further comments:

Below, I have some brief notes, mostly about style, for the paper.
The authors have a tendency to capitalize noun phrases which are not proper names (and thus should not be capitalized). I would submit that "DWAEF" is not a proper name (and thus should be written "deep weighted average ensemble-based framework"); certain "graph neural network" is not a proper name, and few people even explain what "BERT" stands for (it's obvious the authors decided it was called "BERT" before they decided what it stood for; the acronym being spelled out gives no real insights into what it is; in fact it's not "bidirectional" at all because it's not an RNN). "Fuzzy Logic" is absolutely not a proper name. Anyways, don't do that. "Python", the name of a programming language, is however a proper name, so it should be written in titlecase.

Usually "vs" is written with a period after.

I am not familiar with the use of "viz", but the authors write it with and without a following period; pick one style and use it throughout.

Most of the figures have text that is a lot smaller than the body. I find a lot of them hard to read. The text should be made to match (roughly) that of the body text, and ideally you'd also match the fonts too.

The literature review reports accuracy numbers from previous work. These numbers are only meaningful if the studies are working on exactly the same corpora, but they are not, and so they should be omitted. For instance the authors' [11] used a new, crowd-sourced corpus; whereas the authors' [7] used two annotators to build their own corpus. [11], which appears to be published later, doesn't mention [7]. One corpus is surely much easier than the other, and that, rather than the methods, explains much of the difference in accuracy numbers across studies.

The literature review does not have much of a logical structure. It is rather brief, which is unfortunate because there is a big literature here and the comparisons across data sets and methods are not common as they should be. I don't understand the distinctions drawn between "machine learning methods", "deep learning and transformer based methods" (transformers are a type of deep learning method, and both deep learning and transformers in particular are examples of machine learning), and "graph neural network based methods" (which are also arguably deep learning too, and certainly machine learning). Also the authors mention that some of the work in the "deep learning" category used SVMs (from the "machine learning" category).

Emoticons, smileys, etc. being labeled "pragmatic features" is a strange notion to me. Pragmatics is concerned with the role of context in linguistic discourse; smileys are not "context" in the relevant sense. I think the term you may want would be something like "paralinguistics", perhaps.

"#sarcasme": is this English or, say, French? I found that misspelling confusing. May just be a typo though.

"fallen short of the speaker's affection": this is not idiomatic English to me. To "fall short" is to disappoint someone, which is not a likely sarcastic interpretation of that expression.

What is the third type of metaphor discussed by study [20]? It is mentioned that this study has a third type, and that it is not handled in this paper. What is it, and why can't it be handled here?

The word called "separator" for subordinate classes is usually known as a "subordinating conjunction" in both traditional and modern grammar.

"Ensemble learning strategies combine multiple machine learning algorithms to produce poor predictive outcomes. These results are then fused together to generate more accurate solutions.": This is a strange description of ensembling to me. The idea is not to find "poor learners", but to combine multiple models whose errors are no more than weakly correlated. What's written here has the unfortunate suggesting that the models that make up the ensemble ought to be bad.

Figure 2 should just be a table. There is a lot of (justified, IMO) hatred for the use of pie charts. In this case there is really very little information being conveyed here and it would be better in tabular form. It would also be better to have (in addition to percentages), raw counts of the data.

"viz, simile... is this a typo? I am not sure how to read this.

I recommend that large counting numbers like "2891" be written with comma separators. Without these, they are much harder to read and it is uncommon to see them written in this comma-free form in published text.

"Twitter is full of redundancy due to the rampant usage of slang, hashtags, emoticons, alterations in spelling, loose usage of punctuation, and so forth": I do not understand what these features of informal text have to do with is "slang" (a rather poorly defined notion in the first place) redundant? I found this an insightful discussion of what these social media features "mean" and how they ought to be handled: https://aclanthology.org/N13-1037/

"overdone": this seems like a value judgment that does not belong here.

"aforesaid": extremely archaic word choice, at least in my variety of English.

Could the authors put the list of intensifiers into an appendix? That way, this work can still be replicated when (as is certain) the Wikipedia page is edited in the future.

Same thing with interjections: that ought to be an appendix, I think.

"syntactical": ungrammatical for me, should just be "syntactic".

How was a GNN used to collect syntactic patterns? It seems like the author just ran an off-the-shelf parser. I would just say that.

I am not interested in the names of the "steps" on page 10. These are implementation details that have no bearing on the science here.

How were the authors able to confirm that the text in question was, say, monolingual English (as it appears to be assumed but never discussed)?

The series of equations on page 11 are meaningless because none of the terms have been defined yet. It might be useful to define them first (instead of later).

"upto": should be "up to".

It is a little difficult to understand how the BERT-based metaphor detection system works. I don't think I could recreate it from this description. Was fine-tuning used or not? I am not sure. The description should be made much more clear.

The list of clause separators (which are a mix of prepositions and conjunctions) are given in Python form on page 14; just write it like normal text, without the [ and ]. Or, move it to the appendix.

"moulded": I am not sure what this is supposed to mean here.

I do not have enough context to understand the use of fuzzy logic as described on the bottom of page 14. It seems to assume more knowledge of this software than the reader is likely to have. For instance, I haven't the foggiest idea what a "trapezoidal membership function" would mean, nor a "non-polygonal fuzzy set". I also don't follow the fuzzy rules on page 15 because once again, I don't know as much as the authors about fuzzy logic, which is not a major part of NLP research in general; it requires extensive introduction if it's an important idea.

I don't know what a "Dirichlet ensemble object" ("Dirichlet" is a person's name so it's capitalized, the rest is not a proper noun though so it should be in lowercase) is, so I can't make sense of the ensembling. No discussion or citations are provided.

It is claimed on page 15 that the model is neither undertrained nor overfit. As far as I can tell I was never told about any dev/validation or testing set though, nor do I know how large they are, so I can't really assess that.

Figure 9 is superfluous. Yes, the accuracy goes up and the loss goes down. Of course it does. Just report the final test accuracy, the only statistic that actually matters. These loss and accuracy curves are useful for the developer, but not useful for the reader who is trying to evaluate your proposal.

In table 5 the hyperparameters are reported like "avg_pooling": write that in English ("average pooling"), not code.

"drop out": this is traditionally called "dropout", one word.

"learning rate reduce factor": I don't know what this is and no context is given.

The authors give no error analysis. However, they do give a few correctly classified examples. I would prefer some useful error analysis instead.

The bibliography is full of typos, punctuation errors, and capitalization errors. Unfortunately, one cannot simply use the auto-generated bibliography entries without a bit of editing.

2 Comments

Study data

Submitted by Jodi Schneider on Sun, 01/29/2023 - 15:49

The journal received this link I from the authors containing
the data of the study:
https://docs.google.com/spreadsheets/d/1_2XQja9_Vpvzan9YQT3gNePJ4u10nYWchqFtlNpmMfU/edit?usp=sharing

Meta-Review by Editor

Submitted by Tobias Kuhn on Wed, 03/08/2023 - 03:24

Please carefully read the 4 reviews and take those into consideration if you resubmit.

Jodi Schneider (https://orcid.org/0000-0002-5098-5667)

Data Science

DWAEF: A Deep Weighted Average Ensemble Framework Harnessing Novel Indicators for Sarcasm Detection

Tracking #: 740-1720

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

2 Comments

Study data

Meta-Review by Editor