Reviewer has chosen not to be Anonymous
Overall Impression: Good
Suggested Decision: Undecided
Technical Quality of the paper: Weak
Presentation: Good
Reviewer`s confidence: Medium
Significance: Moderate significance
Background: Reasonable
Novelty: Limited novelty
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The length of this manuscript is about right
Summary of paper in a few sentences:
The paper presents an architecture and a prototype of the system for sharing and reusing the models and the data of scientific experiments in the chemical kinetics domain. The paper touches upon some issues in aggregating, managing and processing this data. It also mentions possible extensions to other domains and automated integration of even more of the gathered experimental data without going into details - I understand that both these aspects are part of the future work.
Reasons to accept:
The paper is well written, the figures and the screenshots are readable and help to present the material.
The paper deals with an important problem of reusing experimental result and is interdisciplinary, as the authors represent both data/software engineering and chemical engineering communities.
It also provides a nice introduction to the domain, accessible for a person who is not an expert in chemical kinetics.
Reasons to reject:
I would encourage the authors to clarify the main contribution. The requirements and the architecture do not qualify as good enough contribution, worth a journal paper as 1) they have been already presented already at SAVE-SD 2018 and 2) the design of the architecture does not mean it will work in practice. I would encourage authors to go for the "first version of the system" being presented and limit the discussion of "extension to other domains and new requirements" to the discussion section, as both are not tested in practice. Right now the paper mentions them several times, but discussing the problems does not solve them...
A possible reason to reject would be that neither the data (or samples of data) or the software (prototype) presented in the paper are not available online. I leave it up to the editors to decide if it is a reason for rejection.For instance, when the authors describe supplementary materials in p.4, examples of such materials can be given. The same for the "experiment consists of a set of conditions, ... and output variables" - including an example would help a lot!
It is not the strongest reason for rejection, but before accepting this manuscript, the paper should clarify the title and the use of terms. [1, 2] use "scholarly data" as "the data describing the publications, authors, etc.". This paper deals with the "experimental data from the chemical kinetics domain" or "scientific data". Therefore, the title "scholarly data analysis to aid ..." is misleading, as no such analysis happens at the moment (it is just mentioned as a possible source of new data). Such analysis, if already performed is non-trivial and can constitute contribution per se. For instance, on p.17, 5.3: "can be retrieved automatically from external sources". Can be? Or "is retrieved"? From which sources?
[1] F. Xia, W. Wang, T. M. Bekele and H. Liu, "Big Scholarly Data: A Survey," in IEEE Transactions on Big Data, vol. 3, no. 1, pp. 18-35, 1 March 2017.
doi: 10.1109/TBDATA.2016.2641460
[2] http://www.scholarlydata.org/
Nanopublication comments:
Further comments:
1. Please give concrete examples when mentioning other domains besides "chemical kinetics": which domains, which data will be stored and managed? If you do not have anything concrete at this stage, it is better to just say "we are going to look into extending our system to work with domains A, B, C". Right now the reader might be mislead by claims like "can be easily generalized to other domains",
The same at p.6 "The need of a continuous validation of models based on new experiments is shared among most scientific fields, and therefore the activities of acquiring, analyzing and evaluating models and experiments is certainly shared with other scientific domains" - consider giving a couple of examples to be specific.
2. p.3, one of the contributions: "A data-based and service-based architecture for such a system is proposed and discussed, providing a data model to support data integration and the development of a set of data curation and analysis services." - please be more concrete when describing the data integration in the paper. For instance, in Fig. 1 it is not clear which steps are covered by the system, and which are automated, which are performed manually? Elaborating on this aspects and relating the authors' contribution to this general model refinement procedure would help to clarify the benefits of the presented approach.
3. In the related work section a good analysis of the scholarly data and chemical kinetics experimental data is presented. However, the work seems to be also very relevant to the Google Dataset Search https://toolbox.google.com/datasetsearch, Figshare and other data repositories, as well as to platforms for sharing code/experiements (MyExperiment, Gigantum, Code Ocean). I would strongly recommend relating the solution to those platforms in the Related Work section.
4. Sections 4 and 5 miss the description of what exactly is done (or not done) by the described prototype and within the approach. For instance "In other cases new experiments could be directly extracted by domain experts and inserted manually" - it is not clear if this is the main source of filling the repository or only a theoretical possibility. In the former case, how often do experts add experiments manually, can they really cope with the "big data" announced earlier? Is it scalable? For the automated integration of structured formats - which repositories are already connected and continuously provide new experiments for the repository? How many experiments are added every year/month/...?
The section 4.1.2 decribes a lot of difficulties in matching experimental data with different parameters and levels of granularity. However, it is not clear what is managed by the system at the moment. The sentence in 4.1.3 about "handling volumes of data largely beyond those manageable manually" hints that the problem is solved to the certain extent, and the screenshots in Fig 3-6 suggest this too.
4.1.4 and managing changes in the models describe a very interesting problem of reusing the existing experimental results to predict the new ones or to get new results by re-running only some parts of the experiments. It is not clear, though, which use cases are supported at the moment and what limitations the system still has.
on p.17, section 5.4 it is not clear who performs manual interventions (if at all). Or is it one more "possibility" for the system rather than implemented feature?
5. Section 6 should correspond to 4-5. Right now it describes "continuous data integration" which is only explained at high-level in 4 and 5. The same concern relates to the discussion of using the domain-ontologies vs OLAP - it would be nice to see in 4 and 5 how and what is used, then such choice becomes clearer.
In the description of the scientific model (p.19) and also earlier the authors mention 10^5-10^6" experimental parameters. It is not clear how the user can search with so many parameters (Fig 3) then. Could you please explain this better?
6. To summarize, I would suggest that the authors focus on what is done, what can be done soon and present those. For cases where no implementation was done you should only mention the most important in the future work or discussion sections. In particular, the dillemma of "scholarly data" vs "experimental data" should be solved. Right now the paper leaves the impression that "scholarly data could be used to get more experimental data for the repository". But it is not seen through the paper if this is the reality in the current version of the system. I would suggest shortening Section 6 and other places where it is written "what could be used" and focus on "what IS used".
***Minor comments / language errors***
- "a first prorotype" - p.3
- "indipendent" - p.6
- "a simulation software" - p.9
- "In general, it does not exist a pre-defined" - p.10
- Opensmoke vs OpenSMOKE++ - capitalization
- p. 16 "stred" - "stored"?
- p.16 "one respect to the others"
- p. 16 "the lacks"
- p.17 "composed by" - "contains"
1 Comment
Meta-Review by Editor
Submitted by Tobias Kuhn on
We have received three complete and interesting reviews that should help you in expanding the paper and fixing the issues identified in the paper. I ask you to consider all the concerns raised by all the reviewers, since they are all very sensitive and important.A major concern, which has been raised by all, is the lack of availability of the data and of the software described in the paper, which is not acceptable for a publication in Data Science. I would strongly suggest to make them available online, following the FAIR principles and also the guidelines used in the ISWC Resource Track (the most recent ones are available at https://iswc2019.semanticweb.org/call-for-resources-track-papers/). For the latter ones, see in particular the section related to availability. Data (e.g. by publishing them in Figshare or Zenodo) and software (e.g. by using the GitHub+Zenodo feature for assigning DOIs to code) should be appropriately cited in the paper, see https://peerj.com/articles/cs-1/ and https://peerj.com/articles/cs-86/ for extensive discussion on the topic.
Silvio Peroni (https://orcid.org/0000-0003-0530-4305)