Reducing the Effort for Systematic Reviews in Software Engineering

Tracking #: 570-1550

Authors:

	Name	ORCID
	Francesco Osborne	https://orcid.org/0000-0001-6557-3131
	Henry Muccini	https://orcid.org/0000-0001-6365-6515
	Patricia Lago	https://orcid.org/0000-0002-2234-0845
	Enrico Motta	https://orcid.org/0000-0003-0015-1952

Responsible editor:

Jamie McCusker

Submission Type:

Research Paper

Abstract:

Context. Systematic Reviews (SRs) are means for collecting and synthesizing evidence from the identification and analysis of relevant studies from multiple sources. To this aim, they use a well-defined methodology meant to mitigate the risks of biases and ensure repeatability for later updates. SRs, however, involve significant effort. Goal. The goal of this paper is to introduce a novel methodology that reduces the amount of manual tedious tasks involved in SRs while taking advantage of the value provided by human expertise. Method. Starting from current methodologies for SRs, we replaced the steps of keywording and data extraction with an automatic methodology for generating a domain ontology and classifying the primary studies. This methodology has been applied in the software engineering sub-area of software architecture and evaluated by human annotators. Results. The result is a novel Expert-Driven Automatic Methodology, EDAM, for assisting researchers in performing SRs. EDAM combines ontology-learn\-ing techniques and semantic technologies with the human-in-the-loop. The first (thanks to automation) fosters scalability, objectivity, reproducibility and granularity of the studies; the second allows tailoring to the specific focus of the study at hand and knowledge reuse from domain experts. We evaluated EDAM on the field of Software Architecture against six senior researchers. As a result, we found that the performance of the senior researchers in classifying papers was not statistically significantly different from EDAM. Conclusions. Thanks to automation of the less-creative steps in SRs, our methodology allows researchers to skip the tedious tasks of keywording and manually classifying primary studies, thus freeing effort for the analysis and the discussion.

Manuscript:

ds-paper-570.pdf

Previous Version:

Reducing the Effort for Systematic Reviews in Software Engineering

Data repository URLs:

https://zenodo.org/record/2653925#.XMdrWhMzbGI

Date of Submission:

Monday, April 29, 2019

Date of Decision:

Friday, June 7, 2019

Nanopublication URLs:

Decision:

Solicited Reviews:

Review #1 submitted on 03/May/2019

By Anita Dewaard ORCID logo

https://orcid.org/0000-0002-9034-4119

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: High
Significance: Moderate significance
Background: Comprehensive
Novelty: Clear novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

Not wanting to repeat my earlier review: the authors have made significant changes to the paper which have clearly improved its validity and clarity.
Specifically they are more modest in the claims regarding the extent to which this work can serve to automate Systematic Reviews (it doesn't do that as such but can connect papers to terms, which is useful) and they have expanded the sections on the technical work, which is now even more useful. My one request is whether at some point the 'application of ontology learning techniques' that are 'omitted for the sake of brevity' (?) are published in another venue (p. 20, 38-40). There are still some minor language edits needed, which I trust the publisher to undertake.

Reasons to accept:

As it stands it's a solid contribution to the field, and the fact that all the data and software have been made available for open usage means it can support others who want to pursue this line of investigation.

Reasons to reject:

None

Nanopublication comments:

Further comments:

Review #2 submitted on 13/May/2019

Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: Medium
Significance: High significance
Background: Comprehensive
Novelty: Limited novelty
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

This paper describes an Expert-Driven Automatic Methodology (EDAM) to assist researchers in performing systematic reviews. It consists of 7 components where ontology learning acts as the key step. EDAM is evaluated on Software Architecture against six senior researchers. The result shows that the performance of EDAM is comparable to the senior researchers.

Reasons to accept:

This paper provides a good option to speed up systematic reviews.
EDAM which incorporates ontology learning is a bit novel

Reasons to reject:

The authors basically addressed my previous concerns:
(1) They clarify that though they use a small dataset for experiments, their approach can roughly achieve performance comparable to human.
(2) They explored other approaches, e.d., LDA, and showed the best strategy of their EDAM framework.

So I don't have any reasons to reject this paper.

Nanopublication comments:

Further comments:

Review #3 submitted on 06/Jun/2019

By Philipp Cimiano ORCID logo

https://orcid.org/0000-0002-4771-441X

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Undecided
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: High
Significance: High significance
Background: Comprehensive
Novelty: Clear novelty
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: This manuscript is too long for what it presents and should therefore be considerably shortened (below the general length limit)

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

The paper presents a methodology to support the creation of systematic reviews. The methodology is proposed is a semi-automatic methodology and consists of 7 steps. The methodology involves a human expert that is involved in the methodology to refine the ontology and select a number of categories to filter primary studies. The methodology relies on an ontology learning approach based on an algorithm that was developed previously by the authors, Klink-2. The methodology is applied to the case of software engineering as a proof-of-concept.

The approach is evaluated in terms of how reliable the classification of documents is with respect to a number of categories. The ground truth has been constructed by asking six annotators to assign topics to primary papers, in particular a randomly selected set of 25 studies. The authors have added an analysis in terms of agreements based on Cohen's Kappa, addressing one of my comments from the original review. The predictions made by EDAM agree with four out of six human annotators for 68% of the studies, which is a very good result.

I am satisfied with the way the authors have addressed the comments raised during the first round of reviews. Nevertheless. a number of things need to be addressed in my view.

Reasons to accept:

The proposed methodology is addressing an important problem, that is supporting the activity of creating systematic reviews. The concrete use case presented is on software engineering, but the methodology is much broader and not specific to the field of software engineering in any way.

I find the evaluation convincing; there is clear proof of concept in the paper that the proposed classification method is working.

Reasons to reject:

To some extent, the paper is presenting an interesting methodology to support the creation of systematic reviews. The evaluation is only one one single step of the methodology, which is the classification of primary studies. So far, there is really no real proof that experts find the methodology useful in the task of compiling systematic reviews. Yet, I find the method promising. It would be impossible to provide a proof-of-concept for the complete methodology in one paper, so I see this paper as a first meaningful step in providing proof-of-concept for the proposed methodology.

Nanopublication comments:

Further comments:

I appreciate the effort the authors have been done to address the comments raised by the reviewers and I cam generally satisfied with their responses and modifications to the paper.

Yet, there are a number of things that should be addressed IMHO before the paper can be published:

1) I would urge the authors to rethink the title and rewrite the intro. First, as I mentioned there is nothing specific in the methodology for software engineering. This is only the use case on which the authors are evaluating their approach and providing the proof-of-concept for the method. The introduction would need to be rewritten along these lines pointing to the need in any field of research to support activities that are targeted at compiling the current state-of-the-art from the existing literature, supporting systematic reviews in particular. I find the introduction improvable and would ask the senior co-authors of this paper to improve it.

2) I personally due not see much value in Section 5.4 as it stands. It discusses a number of potential applications of the methodology / use cases. This content should either go into the introduction as a further motivation for the methodology or as outlook in the conclusion section. I find this content misplaced as it is of a suggestive nature and simply hints at potential use cases and applications that are corollaries of the methodology rather than part of it.

3) There is no reason why Section six should stand alone as it does now. It should be integrated into the evaluation section. Precision, Recall and F-measure should be better defined. It is puzzling that there is no hint to the EDAM method in Table 3, given that the goal is to compare other methods to the EDAM method. I understand that DM stands proxy for direct matches based on EDAM, but this is very implicit in the text. The table should make clear that EDAM is the method compared here. You could replace the entry by "EDAM Method (direct matches)" or similar. It is not really clear btw. how unsupervised topic extraction methods such as LDA are really used in this context. LDA returns a per-document topic distribution and an overall topic distribution. I assume that the topic-per-document distribution is used to match the extracted topics to the computer science ontology (CSO) using Levenshtein. I do not see what LDA brings here in terms of benefit in addition to searching for direct matches.The comparison does not seem right to me, with LDA being used as a strawman that has not real impact. The results are really dependent on the threshold j, k on topic probability and document-topic probability and the threshold on Levensthein distance. So possibly simply the wrong thresholds have been chosen. It is not clear why the precision is so low given that only topics are chosen that have a Levenstehin distance of at least 0.8 to CSO terms are selected. Then a higher threshold should have been chosen for the Levenshtein distance. An alternative would have been to rank topics by probability and then select the top-k that have a Levenshtein distance of at least d. I am still not convinced by the comparison.
The main problem I see is that all other methods than DM return many more classes than the DM method, so there is a bias towards the DM method in the evaluation. One should evaluate the methods at the same number of candidates produced if possible, or at least restrict the number of candidates to maximum k.

RESPONSE TO REVIEWERS

We thank the reviewers for the time and effort they invested in the review of our manuscript and for their helpful comments and suggestions. We addressed the raised concerns in the revised manuscript and highlighted the major changes in blue. In particular, a major new piece of work was added as a new Section (Sec. 6), which contains an evaluation of seven alternative approaches for classifying primary studies on a gold standard of 70 papers.

Reviewer 1

“(1) for the evaluation of the primary study classification, the dataset (25 papers) is to small to draw a conclusion…” “(2) the authors should compare their method with other possible approaches, e.g., LDA based approaches…”
Our answer:
We agree on both points and, following the suggestion, now the paper includes in the added Section 6 a evaluation of several approaches for assigning research topics to primary studies. Here we compare our approach with other baselines (including LDA and TF-IDF) on a gold standard composed by 70 papers annotated by 21 domain experts (3 for each paper).

Reviewer 2

“surely, most of the work (and skill) that goes into writing a good and useful SR goes into evaluating these papers, comparing and contrasting them, and summarizing the knowledge gleaned from this collection of content?”
Our answer:
We fully agree that the contribution of domain experts in these phases is critical for a systematic review and to our knowledge it cannot be substituted by any automatic approach. However, in currently methodologies experts need to spend significant time in identifying and classifying the primary studies, to the cost of the analysis phase. Our approach aims to address this specific issue by providing semi-automated support to the steps of keywording and data extraction, and hence ultimately allowing the domain experts to primarily focus (as the reviewer commented) on the evaluation of the papers, the analysis and extraction of the resulting knowledge.

“Secondly, the use case taken in the paper, Software Engineering, is very unconvincing. The keywords used, such as Service-based Architectures and Software Design are all very vague and seem to be overlapping.”
Our answer:
This is a very generic comment, based on a personal impression and in contradiction with the well-established terminology adopted in software architecture (Check, for example, the SWEBOK chapter 2 on Software Design: http://swebokwiki.org/Chapter_2:_Software_Design). We kindly disagree that the terms are vague or overlapping. For example, the terms mentioned (‘service-oriented architecture’, not service-based architecture, and ‘software design’) do have a different meaning, and indicate different entities well-established in the field of software architecture, the first being an architectural style, the second being a stage or artifact in the software lifecycle.

“Lastly, Table 2 is intended to show that the tool, EDAM, delivers better than human inter-annotator agreement with a set of 6 annotators. To me, the main point of Table 2 is that in fact the human inter-annotator agreement in appalling.”
Our answer:
We kindly disagree with this assessment. Note that these results are in line with several previous studies (e.g., [1], [2]) about the agreement of users when annotating scientific publications with a set of pre-defined categories. In order to clarify this point, we revised the relevant parts and added a new evaluation and statistical analysis. The average agreement of the 6 annotators computed with Cohen's kappa is 0.57, which is typically classified as moderate inter-rater agreement [3]. As suggested in the review feedback, we also ran a new and more comprehensive evaluation of the method in which we test it on a gold standard composed by 70 papers annotated by 21 domain experts. In this evaluation, each paper could be tagged with multiple topics and therefore the inter annotation agreement was lower (0.45), but still classified as ‘moderate agreement’.

“I am excited by the premise and careful writing in the introduction, but disappointed by what EDAM and the tool actually deliver to what I was quite excited about learning about, an automated SR authoring tool.”
Our answer:
From this very general comment, we deduce that some part of the original manuscript might have been misleading in this sense. We have carefully reviewed the manuscript in order to clarify the scope of our work. As announced in the title, we propose an approach for “reducing the effort for systematic reviews in Software Engineering”, not an authoring tool. In particular, as stated in the abstract, the goal of our work is “to introduce a novel methodology that reduces the amount of manual tedious tasks involved in SRs while taking advantage of the value provided by human expertise.” by replacing “the steps of keywording and data extraction with an automatic methodology for generating a domain ontology and classifying the primary studies.” An automated SR authoring tool is outside the scope of this work.

Reviewer 3

“There is no evaluation of the usability of the tool for interacting with the ontology. The paper mentions that the experts were able to modify the ontology, but does not say anything about how usable the Excel-based editing process was from the point of view of experts.”
Our answer:
The annotators reported that they were able to easily correct and suggest changes in the ontology using the spreadsheet. This task was natural to them since the same kind of spreadsheet is typically used in the analysis phase of systematic reviews (e.g., for the keywording step). We clarified this in Section 4.1 (step 4).

“The paper does not explain the ontology learning approach used. It references an existing method, Klink-2, that was developed by the same authors. To make the paper self-contained it would be good to have a concise description of how this ontology learning algorithm works.”
Our answer:
We agree and we added in Section 4.2 a description of the algorithm and its pseudocode.

“It is not clear what language can be used to specify filter criteria. It is clear from the paper that the filtering is based on matching concepts in the ontology over the set of papers. However, it is not clear if one can use operators such as NOT, AND, OR etc.”
Our answer:
Typically, in a scientific paper presenting a “manual” systematic review, the community of Software Architecture expresses the search string by using simple logic constructs, which supports NOT, AND, and OR. The search string is then expressed in specific query language, such as SQL, depending on the requirements the study. We clarified this in Section 4.1 (step 5).

“Many measures for quantifying inter-annotator agreement such as Kappa take this into account.”
Our answer:
Thanks for the suggestion. The new version of the paper includes the average Cohen's kappa of the annotators (0.57) and the Cohen's kappa of each couple of reviewers was also added to the online material. In addition, we conducted a more comprehensive evaluation which involved 70 papers, each one annotated by three domain experts, which yielded a kappa of 0.45 (indicating a moderate inter-rater agreement).

“Minor corrections”
Our answer:
Thanks for flagging these issues, we addressed all of them.

REFERENCES
[1] Good, B.M., Tennis, J.T. and Wilkinson, M.D., 2009. Social tagging in the life sciences: characterizing a new metadata resource for bioinformatics. BMC bioinformatics, 10(1), p.313.
[2] Névéol, A., Doğan, R.I. and Lu, Z., 2010. Author keywords in biomedical journal articles. In AMIA annual symposium proceedings (Vol. 2010, p. 537). American Medical Informatics Association.
[3] Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics. 33, 159–74 (1977).

1 Comment

Meta-Review by Editor

Submitted by Tobias Kuhn on Fri, 06/07/2019 - 09:08

This is an interesting contribution to topic modeling and tagging, and we look forward to its publication. Review #3 raises important issues with the paper's organization. These must be addressed before final publication.

James McCusker (https://orcid.org/0000-0003-1085-6059)

Data Science

Reducing the Effort for Systematic Reviews in Software Engineering

Tracking #: 570-1550

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Previous Version:

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

1 Comment

Meta-Review by Editor