Reducing the Effort for Systematic Reviews in Software Engineering

Tracking #: 553-1533

Authors:

	Name	ORCID
	Francesco Osborne	https://orcid.org/0000-0001-6557-3131
	Henry Muccini	https://orcid.org/0000-0001-6365-6515
	Patricia Lago	https://orcid.org/0000-0002-2234-0845
	Enrico Motta	https://orcid.org/0000-0003-0015-1952

Responsible editor:

Jamie McCusker

Submission Type:

Research Paper

Abstract:

Context. Systematic Reviews (SRs) are means for collecting and synthesizing evidence from the identification and analysis of relevant studies from multiple sources. To this aim, they use a well-defined methodology meant to mitigate the risks of biases and ensure repeatability for later updates. SRs, however, involve significant effort. Goal. The goal of this paper is to introduce a novel methodology that reduces the amount of manual tedious tasks involved in SRs while taking advantage of the value provided by human expertise. Method. Starting from current methodologies for SRs, we replaced the steps of keywording and data extraction with an automatic methodology for generating a domain ontology and classifying the primary studies. This methodology has been applied in the software engineering sub-area of software architecture and evaluated by human annotators. Results. The result is a novel Expert-Driven Automatic Methodology, EDAM, for assisting researchers in performing SRs. EDAM combines ontology-learning techniques and semantic technologies with the human-in-the-loop. The first (thanks to automation) fosters scalability, objectivity, reproducibility and granularity of the studies; the second allows tailoring to the specific focus of the study at hand and knowledge reuse from domain experts. We evaluated EDAM on the field of Software Architecture against six senior researchers. As a result, we found that the performance of the senior researchers in classifying papers was not statistically significantly different from EDAM. Conclusions. Thanks to automation of the less-creative steps in SRs, our methodology allows researchers to skip the tedious tasks of keywording and manually classifying primary studies, thus freeing effort for the analysis and the discussion.

Manuscript:

ds-paper-553.pdf

Revised Version:

Reducing the Effort for Systematic Reviews in Software Engineering

Data repository URLs:

http://rexplore.kmi.open.ac.uk/data/edam/

Date of Submission:

Monday, January 7, 2019

Date of Decision:

Monday, March 18, 2019

Nanopublication URLs:

Decision:

Undecided

Solicited Reviews:

Review #1 submitted on 21/Jan/2019

Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Average
Suggested Decision: Undecided
Technical Quality of the paper: Average
Presentation: Good
Reviewer`s confidence: Medium
Significance: Moderate significance
Background: Comprehensive
Novelty: Lack of novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

This paper describes an Expert-Driven Automatic Methodology (EDAM) to assist researchers in performing systematic reviews. It consists of 7 components where ontology learning acts as the key step. EDAM is evaluated on Software Architecture against six senior researchers. The result shows that the performance of EDAM is comparable to the senior researchers.

Reasons to accept:

This paper provides a good option to speed up systematic reviews.
EDAM which incorporates ontology learning is a bit novel

Reasons to reject:

I have several concerns regarding the evaluation:
(1) for the evaluation of the primary study classification, the dataset (25 papers) is to small to draw a conclusion. As the human agreement is usually less than 50% in table 2, the authors should consider a more effective way for evaluation, e.g., providing the ontology to the researchers and asking them to pick up to 5 categories.

(2) the authors should compare their method with other possible approaches, e.g., LDA based approaches, which can cluster studies to multiple groups based on their topical distribution, then each cluster can also be automatically assigned with some topical terms.

Nanopublication comments:

Further comments:

Review #2 submitted on 14/Mar/2019

By Anita Dewaard ORCID logo

https://orcid.org/0000-0002-9034-4119

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Weak
Suggested Decision: Undecided
Technical Quality of the paper: Weak
Presentation: Good
Reviewer`s confidence: High
Significance: High significance
Background: Comprehensive
Novelty: Clear novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

The paper says it proposes a method for developing automated Systematic Reviews (SRs): summaries of work in a specific domain, addressing a particular question or research areas. The problem of the difficulty to scale SRs and a summary of background and related work in this area are all very well-written and clearly motivated. Unfortunately, after this excellent beginning, the paper takes an odd turn: first of all, instead of proposing how to construct a SR, it focuses only on identifying which papers in the literature concern a specific topic, given as a keyword. Though source selection is of course an important step in developing a SR, there is no reason to assume (nor is evidence given) that this is the most difficult or even most important step in writing a SR: surely, most of the work (and skill) that goes into writing a good and useful SR goes into evaluating these papers, comparing and contrasting them, and summarizing the knowledge gleaned from this collection of content? None of these aspects are addressed in the paper and to me, it is disingenuous to present the tool as an automated SR creation tool when really, all it does is find papers related to a specific term. It would be much more accurate to treat this as an information retrieval or indexing task, than as a writing-support task.
Secondly, the use case taken in the paper, Software Engineering, is very unconvincing. The keywords used, such as Service-based Architectures and Software Design are all very vague and seem to be overlapping. Though I am not a software engineering, I think it is highly unlikely that the outputs presented here would support the author of an SR to detect trends in this domain. There is no evidence provided that suggests that the trends shown in the graphs of various terms are not simply due to chance or fads, and I do not see how any insight can be gleaned from these seemingly random correlations.
Lastly, Table 2 is intended to show that the tool, EDAM, delivers better than human inter-annotator agreement with a set of 6 annotators. To me, the main point of Table 2 is that in fact the human inter-annotator agreement in appalling: Possibly, this is again due to the fact that the terms assigned are so vague, and probably have a great amount of overlap in meaning. The majority of terms have very little agreement between most of the annotators. So: given that the current mode of identifying papers is largely the authors choice, what does this tell us? Are these the right terms to classify papers under? Why is the agreement so low? Would a different taxonomy, or a different domain, have delivered a different outcome?
In summary, my experience in reading the paper is that I am excited by the premise and careful writing in the introduction, but disappointed by what EDAM and the tool actually deliver to what I was quite excited about learning about, an automated SR authoring tool.
I suggest that the authors decide either expand the introduction and write an overview of current work related to this topic, or write up the methods and results of this paper as a small conference paper. If the latter, I would like to see EDAM applied to a different domain and taxonomy, to see whether the bad inter-annotator agreement is structural or incidental. If the prior, I would love to see EDAM applied for writing an automated systematic review of automated systematic reviews.

Reasons to accept:

- Interesting topic
- Good introduction & Background of problem, related work

Reasons to reject:

- Methods & results not living up to promise of introduction, needs rescoping/clarification
- Results unclear: why this taxonomy? How does term overlap affect outcomes? Why not try a different topic?
- Not a homogenous quality/level: propose to split into one of two papers, either small paper on Methods/Results, with a check for other topic, or an overview of related work in this field: if latter, exciting to use tool described to write that review.

Nanopublication comments:

Further comments:

Review #3 submitted on 16/Mar/2019

By Philipp Cimiano ORCID logo

https://orcid.org/0000-0002-4771-441X

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Excellent
Suggested Decision: Accept
Technical Quality of the paper: Excellent
Presentation: Excellent
Reviewer`s confidence: High
Significance: Moderate significance
Background: Reasonable
Novelty: Clear novelty
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The authors need to elaborate more on certain aspects and the manuscript should therefore be extended (if the general length limit is already reached, I urge the editor to allow for an exception)

Summary of paper in a few sentences:

This paper presents a new methodology that the authors call EDAM (Expert-Driven Automatic Methodology) that supports the process of generating systematic reviews. The method is applied to the domain of software engineering in the paper, but is clearly applicable beyond this domain.
The method attempts to automatize some steps of the systematic reviews process for two reasons: to reduce the burden of manual work involved in selecting relevant primary papers, but ii) to make the selection process more objective. The paper proposes an ontology-based approach that supports the steps of i) selection of relevant papers, ii) keywording, and iii) creation of a classification schema. For this purpose, the authors build on an existing ontology learning method that has been specifically designed for learning topic hierarchies in the scientific domain. The method proposed in the paper is briefly evaluated on a case study in software engineering. Further, the accuracy of the classification approach to classify the primary literature is evaluated by comparing it to a number of annotators.

Reasons to accept:

The paper tackles an important problem, that is the problem of supporting the process of generating systematic reviews. It proposes to automatize some parts of the task, in particular the one of selecting relevant papers, filtering them down and classifying the papers into topics.
The method is based on an ontology learning approach that generates a hierarchy of relevant concepts starting from a seed term. Experts can then interact with the ontology that is displayed in terms of a tree diagram in Excel. The fact that experts can directly refine the ontology is a very positive aspect of the method.
The method is evaluated on a use case in software engineering and limitations are discussed. The methodology for evaluation is sound, empirically evaluating the classification step of the approach by comparing automatically generated annotations to those of a number of annotates that are partially experts on the domain. The paper describes some typical case studies that the methodology can support as well as limitations (Section 5.2).
Overall, the proposed methodology is novel and sound.

Reasons to reject:

No major reason to reject the paper, see below some comments to improve the paper.

Nanopublication comments:

Further comments:

There is no evaluation of the usability of the tool for interacting with the ontology. The paper mentions that the experts were able to modify the ontology, but does not say anything about how usable the Excel-based editing process was from the point of view of experts.

The paper does not explain the ontology learning approach used. It references an existing method, Klink-2, that was developed by the same authors. To make the paper self-contained it would be good to have a concise description of how this ontology learning algorithm works.

It is not clear what language can be used to specify filter criteria. It is clear from the paper that the filtering is based on matching concepts in the ontology over the set of papers. However, it is not clear if one can use operators such as NOT, AND, OR etc. It is not clear if there is a formal language in which a set of matching criteria can be defined.

The authors mention that any method can be used to map papers into concepts / topics, even a learned classifier. However, it seems that in this paper they rely on a straightforward approach that annotates a paper with a concept from the ontology if the concept or any of its subconcepts appears in the ontology. This should be made clear in the evaluation section where the annotations are compared to the set of annotations by experts.

In this same section, the authors compute agreement in terms of overlap, but ignore that agreement can also be by chance. Many measures for quantifying inter-annotator agreement such as Kappa take this into account. I wonder why the authors have not considered to use such methods to quantify agreement.

Minor corrections:

Page 3: aimed to at improving -> aimed at

Page 4: by both [21] and [14] -> bad style, do not use references as words

Page 8: a Semantic Web ontology of 58 topics*s*

Page 18: limitations based on the categorization given in [45] -> do not use references as words

Page 19: requires human expertize -> expertise

Page 20: performance of entity extraction and linking tooks -> tools

Two comments on the title of the paper: as the methodology is not specific for software engineering, I wonder why the authors do not choose simply the title: "Reducing the effort for systematic reviews" without the domain qualification. Instead, the authors could think about adding some qualifiers that specify which parts of the SR process their method supports and add them to the title. When I first read the title, I did not have a clear idea what aspects of an SR could be supported. This is clarified only later in the introduction, but could be made more specific in the title of the paper.

A second point: I missed the qualifier "supporting systematic reviews" in the name of the methodology. Expert-Driven Automatic Methodology does not say anything about the fact that the methodology is supposed to support SR. This is odd.

1 Comment

Meta-Review by Editor

Submitted by Tobias Kuhn on Mon, 03/18/2019 - 12:35

Thank you for submitting this paper. The ability to There are some significant issues with both the presentation and content of this paper. Reviewers have conflicting opinions about the novelty of this work. This must be clarified in the next submission. Additionally, the issues with the evaluation must be fixed. Most importantly, the claims of the paper must be very strongly supported by the evaluation. Currently, the title and abstract suggest much broader claims than the evaluation provides, suggesting that more of the SR process is solved by your methods.

James McCusker (https://orcid.org/0000-0003-1085-6059)

Data Science

Reducing the Effort for Systematic Reviews in Software Engineering

Tracking #: 553-1533

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

1 Comment

Meta-Review by Editor