We thank the reviewers for the time and effort they invested in the review of our manuscript and for their helpful comments and suggestions. We addressed the raised concerns in the revised manuscript and highlighted the major changes in blue. In particular, a major new piece of work was added as a new Section (Sec. 6), which contains an evaluation of seven alternative approaches for classifying primary studies on a gold standard of 70 papers.
“(1) for the evaluation of the primary study classification, the dataset (25 papers) is to small to draw a conclusion…” “(2) the authors should compare their method with other possible approaches, e.g., LDA based approaches…”
We agree on both points and, following the suggestion, now the paper includes in the added Section 6 a evaluation of several approaches for assigning research topics to primary studies. Here we compare our approach with other baselines (including LDA and TF-IDF) on a gold standard composed by 70 papers annotated by 21 domain experts (3 for each paper).
“surely, most of the work (and skill) that goes into writing a good and useful SR goes into evaluating these papers, comparing and contrasting them, and summarizing the knowledge gleaned from this collection of content?”
We fully agree that the contribution of domain experts in these phases is critical for a systematic review and to our knowledge it cannot be substituted by any automatic approach. However, in currently methodologies experts need to spend significant time in identifying and classifying the primary studies, to the cost of the analysis phase. Our approach aims to address this specific issue by providing semi-automated support to the steps of keywording and data extraction, and hence ultimately allowing the domain experts to primarily focus (as the reviewer commented) on the evaluation of the papers, the analysis and extraction of the resulting knowledge.
“Secondly, the use case taken in the paper, Software Engineering, is very unconvincing. The keywords used, such as Service-based Architectures and Software Design are all very vague and seem to be overlapping.”
This is a very generic comment, based on a personal impression and in contradiction with the well-established terminology adopted in software architecture (Check, for example, the SWEBOK chapter 2 on Software Design: http://swebokwiki.org/Chapter_2:_Software_Design). We kindly disagree that the terms are vague or overlapping. For example, the terms mentioned (‘service-oriented architecture’, not service-based architecture, and ‘software design’) do have a different meaning, and indicate different entities well-established in the field of software architecture, the first being an architectural style, the second being a stage or artifact in the software lifecycle.
“Lastly, Table 2 is intended to show that the tool, EDAM, delivers better than human inter-annotator agreement with a set of 6 annotators. To me, the main point of Table 2 is that in fact the human inter-annotator agreement in appalling.”
We kindly disagree with this assessment. Note that these results are in line with several previous studies (e.g., , ) about the agreement of users when annotating scientific publications with a set of pre-defined categories. In order to clarify this point, we revised the relevant parts and added a new evaluation and statistical analysis. The average agreement of the 6 annotators computed with Cohen's kappa is 0.57, which is typically classified as moderate inter-rater agreement . As suggested in the review feedback, we also ran a new and more comprehensive evaluation of the method in which we test it on a gold standard composed by 70 papers annotated by 21 domain experts. In this evaluation, each paper could be tagged with multiple topics and therefore the inter annotation agreement was lower (0.45), but still classified as ‘moderate agreement’.
“I am excited by the premise and careful writing in the introduction, but disappointed by what EDAM and the tool actually deliver to what I was quite excited about learning about, an automated SR authoring tool.”
From this very general comment, we deduce that some part of the original manuscript might have been misleading in this sense. We have carefully reviewed the manuscript in order to clarify the scope of our work. As announced in the title, we propose an approach for “reducing the effort for systematic reviews in Software Engineering”, not an authoring tool. In particular, as stated in the abstract, the goal of our work is “to introduce a novel methodology that reduces the amount of manual tedious tasks involved in SRs while taking advantage of the value provided by human expertise.” by replacing “the steps of keywording and data extraction with an automatic methodology for generating a domain ontology and classifying the primary studies.” An automated SR authoring tool is outside the scope of this work.
“There is no evaluation of the usability of the tool for interacting with the ontology. The paper mentions that the experts were able to modify the ontology, but does not say anything about how usable the Excel-based editing process was from the point of view of experts.”
The annotators reported that they were able to easily correct and suggest changes in the ontology using the spreadsheet. This task was natural to them since the same kind of spreadsheet is typically used in the analysis phase of systematic reviews (e.g., for the keywording step). We clarified this in Section 4.1 (step 4).
“The paper does not explain the ontology learning approach used. It references an existing method, Klink-2, that was developed by the same authors. To make the paper self-contained it would be good to have a concise description of how this ontology learning algorithm works.”
We agree and we added in Section 4.2 a description of the algorithm and its pseudocode.
“It is not clear what language can be used to specify filter criteria. It is clear from the paper that the filtering is based on matching concepts in the ontology over the set of papers. However, it is not clear if one can use operators such as NOT, AND, OR etc.”
Typically, in a scientific paper presenting a “manual” systematic review, the community of Software Architecture expresses the search string by using simple logic constructs, which supports NOT, AND, and OR. The search string is then expressed in specific query language, such as SQL, depending on the requirements the study. We clarified this in Section 4.1 (step 5).
“Many measures for quantifying inter-annotator agreement such as Kappa take this into account.”
Thanks for the suggestion. The new version of the paper includes the average Cohen's kappa of the annotators (0.57) and the Cohen's kappa of each couple of reviewers was also added to the online material. In addition, we conducted a more comprehensive evaluation which involved 70 papers, each one annotated by three domain experts, which yielded a kappa of 0.45 (indicating a moderate inter-rater agreement).
Thanks for flagging these issues, we addressed all of them.
 Good, B.M., Tennis, J.T. and Wilkinson, M.D., 2009. Social tagging in the life sciences: characterizing a new metadata resource for bioinformatics. BMC bioinformatics, 10(1), p.313.
 Névéol, A., Doğan, R.I. and Lu, Z., 2010. Author keywords in biomedical journal articles. In AMIA annual symposium proceedings (Vol. 2010, p. 537). American Medical Informatics Association.
 Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics. 33, 159–74 (1977).