Reviewer has chosen not to be Anonymous
Overall Impression: Average
Suggested Decision: Reject
Technical Quality of the paper: Average
Presentation: Weak
Reviewer`s confidence: Medium
Significance: Moderate significance
Background: Incomplete or inappropriate
Novelty: Clear novelty
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The length of this manuscript is about right
Summary of paper in a few sentences:
I have reviewed the submission entitled "DWAEF: A Deep Weighted Average Ensemble Framework Harnessing Novel Indicators for Sarcasm Detection". I have no conflicts of interest to disclose and am happy to provide my honest assessment of the paper. The authors develop a new architecture for sarcasm detection, which they then implement and evaluate. There is some comparison to simpler, shallower methods applied to the same corpus, and these results suggest they have made major progress on the problem.
Reasons to accept:
This is a genuinely interesting take on a classic problem:
* The paper contains useful comparisons to earlier models, and achieves an empirical improvement over them.
* The paper tackles a relatively deep problem in language understanding.
* The paper has some clever thinking about detecting metaphor and simile.
Reasons to reject:
* The literature review is not systematic in a useful way. (See my notes below.)
* It is assumed the reader is highly familiar with fuzzy logic, and I strongly doubt many will be. I was simply unable to follow this section. The authors need to provide much more background to make this work for the average reader.
* Discussion of the ensembling method is very sketchy and unsourced. I have no clue how the ensembling procedure works from the description, nor do I have any hint how I'd go about learning more about it.
* Discussion of the basic facts of the data set, including even what language it is written in (!), are missing. I was unaware of the use of a split into training, development/validation, and test until validation loss curves were reported late in the paper. No information are provided about how it was annotated except by "four expert linguists": did they double-annotate and perform consensus? What's the interannotator agreement? This sort of information seems necessary to even consider that the authors' data set might be valid.
* There is no meaningful error analysis, so there are few clues how one might go about improving their system. Indeed, there are few future directions to be found.
I think most of these issues could be fixed on resubmission without a great deal of effort from the authors, and crucially without the need to rerun any experiments, probably. For this reason I have indicated "revise and resubmit" in my overall recommendation.
Nanopublication comments:
Further comments:
Below, I have some brief notes, mostly about style, for the paper.
The authors have a tendency to capitalize noun phrases which are not proper names (and thus should not be capitalized). I would submit that "DWAEF" is not a proper name (and thus should be written "deep weighted average ensemble-based framework"); certain "graph neural network" is not a proper name, and few people even explain what "BERT" stands for (it's obvious the authors decided it was called "BERT" before they decided what it stood for; the acronym being spelled out gives no real insights into what it is; in fact it's not "bidirectional" at all because it's not an RNN). "Fuzzy Logic" is absolutely not a proper name. Anyways, don't do that. "Python", the name of a programming language, is however a proper name, so it should be written in titlecase.
Usually "vs" is written with a period after.
I am not familiar with the use of "viz", but the authors write it with and without a following period; pick one style and use it throughout.
Most of the figures have text that is a lot smaller than the body. I find a lot of them hard to read. The text should be made to match (roughly) that of the body text, and ideally you'd also match the fonts too.
The literature review reports accuracy numbers from previous work. These numbers are only meaningful if the studies are working on exactly the same corpora, but they are not, and so they should be omitted. For instance the authors' [11] used a new, crowd-sourced corpus; whereas the authors' [7] used two annotators to build their own corpus. [11], which appears to be published later, doesn't mention [7]. One corpus is surely much easier than the other, and that, rather than the methods, explains much of the difference in accuracy numbers across studies.
The literature review does not have much of a logical structure. It is rather brief, which is unfortunate because there is a big literature here and the comparisons across data sets and methods are not common as they should be. I don't understand the distinctions drawn between "machine learning methods", "deep learning and transformer based methods" (transformers are a type of deep learning method, and both deep learning and transformers in particular are examples of machine learning), and "graph neural network based methods" (which are also arguably deep learning too, and certainly machine learning). Also the authors mention that some of the work in the "deep learning" category used SVMs (from the "machine learning" category).
Emoticons, smileys, etc. being labeled "pragmatic features" is a strange notion to me. Pragmatics is concerned with the role of context in linguistic discourse; smileys are not "context" in the relevant sense. I think the term you may want would be something like "paralinguistics", perhaps.
"#sarcasme": is this English or, say, French? I found that misspelling confusing. May just be a typo though.
"fallen short of the speaker's affection": this is not idiomatic English to me. To "fall short" is to disappoint someone, which is not a likely sarcastic interpretation of that expression.
What is the third type of metaphor discussed by study [20]? It is mentioned that this study has a third type, and that it is not handled in this paper. What is it, and why can't it be handled here?
The word called "separator" for subordinate classes is usually known as a "subordinating conjunction" in both traditional and modern grammar.
"Ensemble learning strategies combine multiple machine learning algorithms to produce poor predictive outcomes. These results are then fused together to generate more accurate solutions.": This is a strange description of ensembling to me. The idea is not to find "poor learners", but to combine multiple models whose errors are no more than weakly correlated. What's written here has the unfortunate suggesting that the models that make up the ensemble ought to be bad.
Figure 2 should just be a table. There is a lot of (justified, IMO) hatred for the use of pie charts. In this case there is really very little information being conveyed here and it would be better in tabular form. It would also be better to have (in addition to percentages), raw counts of the data.
"viz, simile... is this a typo? I am not sure how to read this.
I recommend that large counting numbers like "2891" be written with comma separators. Without these, they are much harder to read and it is uncommon to see them written in this comma-free form in published text.
"Twitter is full of redundancy due to the rampant usage of slang, hashtags, emoticons, alterations in spelling, loose usage of punctuation, and so forth": I do not understand what these features of informal text have to do with is "slang" (a rather poorly defined notion in the first place) redundant? I found this an insightful discussion of what these social media features "mean" and how they ought to be handled: https://aclanthology.org/N13-1037/
"overdone": this seems like a value judgment that does not belong here.
"aforesaid": extremely archaic word choice, at least in my variety of English.
Could the authors put the list of intensifiers into an appendix? That way, this work can still be replicated when (as is certain) the Wikipedia page is edited in the future.
Same thing with interjections: that ought to be an appendix, I think.
"syntactical": ungrammatical for me, should just be "syntactic".
How was a GNN used to collect syntactic patterns? It seems like the author just ran an off-the-shelf parser. I would just say that.
I am not interested in the names of the "steps" on page 10. These are implementation details that have no bearing on the science here.
How were the authors able to confirm that the text in question was, say, monolingual English (as it appears to be assumed but never discussed)?
The series of equations on page 11 are meaningless because none of the terms have been defined yet. It might be useful to define them first (instead of later).
"upto": should be "up to".
It is a little difficult to understand how the BERT-based metaphor detection system works. I don't think I could recreate it from this description. Was fine-tuning used or not? I am not sure. The description should be made much more clear.
The list of clause separators (which are a mix of prepositions and conjunctions) are given in Python form on page 14; just write it like normal text, without the [ and ]. Or, move it to the appendix.
"moulded": I am not sure what this is supposed to mean here.
I do not have enough context to understand the use of fuzzy logic as described on the bottom of page 14. It seems to assume more knowledge of this software than the reader is likely to have. For instance, I haven't the foggiest idea what a "trapezoidal membership function" would mean, nor a "non-polygonal fuzzy set". I also don't follow the fuzzy rules on page 15 because once again, I don't know as much as the authors about fuzzy logic, which is not a major part of NLP research in general; it requires extensive introduction if it's an important idea.
I don't know what a "Dirichlet ensemble object" ("Dirichlet" is a person's name so it's capitalized, the rest is not a proper noun though so it should be in lowercase) is, so I can't make sense of the ensembling. No discussion or citations are provided.
It is claimed on page 15 that the model is neither undertrained nor overfit. As far as I can tell I was never told about any dev/validation or testing set though, nor do I know how large they are, so I can't really assess that.
Figure 9 is superfluous. Yes, the accuracy goes up and the loss goes down. Of course it does. Just report the final test accuracy, the only statistic that actually matters. These loss and accuracy curves are useful for the developer, but not useful for the reader who is trying to evaluate your proposal.
In table 5 the hyperparameters are reported like "avg_pooling": write that in English ("average pooling"), not code.
"drop out": this is traditionally called "dropout", one word.
"learning rate reduce factor": I don't know what this is and no context is given.
The authors give no error analysis. However, they do give a few correctly classified examples. I would prefer some useful error analysis instead.
The bibliography is full of typos, punctuation errors, and capitalization errors. Unfortunately, one cannot simply use the auto-generated bibliography entries without a bit of editing.
2 Comments
Study data
Submitted by Jodi Schneider on
The journal received this link I from the authors containingsheets/d/1_2XQja9_Vpvzan9YQT3gNePJ4u10nYWchqFtlNpmMfU/edit?usp=sharing
the data of the study:
https://docs.google.com/spread
Meta-Review by Editor
Submitted by Tobias Kuhn on
Please carefully read the 4 reviews and take those into consideration if you resubmit.
Jodi Schneider (https://orcid.org/0000-0002-5098-5667)