Scholarly data analysis to aid scientific model development

Tracking #: 549-1529

Authors:

	Name	ORCID
	Gabriele Scalia	https://orcid.org/0000-0003-3305-9220
	Matteo Pelucchi	https://orcid.org/0000-0003-3106-0236
	Alessandro Stagni	https://orcid.org/0000-0003-4339-7872
	Alberto Cuoci	https://orcid.org/0000-0001-5653-0180
	Tiziano Faravelli	https://orcid.org/0000-0001-8382-7342
	Barbara Pernici	https://orcid.org/0000-0002-2034-9774

Responsible editor:

Silvio Peroni

Submission Type:

Research Paper

Abstract:

The sharing of scientific and scholarly data has been increasingly promoted over the last decade, leading to open repositories in many different scientific domains. However, data sharing and open data are not final goals by themselves, while the real benefit is in data reuse, which allows leveraging investments in research and enables large-scale data-driven research progresses. Focusing on reuse, this paper discusses the design of an integrated framework to automatically take advantage of large amounts of scholarly scientific data to support research, and in particular scientific model development. Scientific models reproduce and predict complex phenomena and their development is a rather challenging task, within which scientific experiments have a key role in their continuous validation. Starting from the chemical kinetics domain, this paper discusses a set of use cases and a first prototype for such a framework which lead to a set of functional requirements and an architecture that can easily be generalized to other domains. The paper analyzes the needs, the challenges and the research directions for such a framework, in particular those related to data management, automatic scientific model validation, data aggregation and data analysis, to leverage large amounts of scholarly data for new knowledge extraction.

Manuscript:

ds-paper-549.pdf

Revised Version:

Towards a scientific data framework to support scientific model development

Special issue (if applicable):

Special issue of Data Science, including a selection of extended papers from SAVE-SD 2017 and 2018

Data repository URLs:

None

Date of Submission:

Sunday, December 16, 2018

Date of Decision:

Wednesday, February 6, 2019

Nanopublication URLs:

Decision:

Undecided

Solicited Reviews:

Review #1 submitted on 13/Jan/2019

By Hanna Ćwiek-Kupczyńska ORCID logo

https://orcid.org/0000-0001-9113-567X

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Average
Reviewer`s confidence: Medium
Significance: High significance
Background: Reasonable
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

The paper discusses the design of a framework that implements the idea of scientific data reuse. The authors propose a system taking advantage of public datasets to improve scientific models and validate both datasets and the models in a continuous process. Requirements and a prototype of the proposed solution is demonstrated for data from chemical kinetics domain, but general considerations seem reasonable for other research domains.
The paper is an extension of work presented at SAVE-SD workshop at The Web Conference 2018 that presents new requirements and improvements to the design of the system after testing the initial prototype. The authors well present the domain-specific problem, and they describe the general functionality of system to integrate data and services in order to automatize and improve the progress in the domain. The second prototype is partly-implemented; its modular architecture is given in the paper along with screenshots for exemplary use cases.

Reasons to accept:

The paper proposes a fairly solid way to implement experimental data reuse. As the authors point out in their motivations, data reuse is an understudied problem. In the view of common focus on making the data FAIR, the practical example of the reuse should be a notable work to promote. The use case of chemical kinetics seems interesting and demanding. Some of the requirements formulated in the paper can be applied to a general case of experimental data reuse and tracking development of scientific progress.

Reasons to reject:

The article needs some improvements, in particular to rethink the title*, and to explicitly provide the basic requirements for the system (from the previous paper) before formulating the additional ones.
The reading flow could be improved by some enhancements to the structure of the article. Initial sections could benefit from shortening and more strict distinction of the content between them, as currently they share some common descriptions.**
Some language corrections *** and other changes**** are recommended.

Nanopublication comments:

Further comments:

* Title and the use of “scholarly data”: I have some doubts about the adequacy of the title to the article content. Authors use the term “scholarly data” few times across the paper, without attracting reader’s attention with its meaning (at times it seems interchangeable with “scientific data” or “experimental data”, e.g. p.2 l.39 or p.6 l.33) and its role in the whole system until the last section, where the use of data from publications is explicitly mentioned (as one of the features of the system only). Perhaps the title should focus more on exposing the fact of scientific data reuse or the proposed system itself.

** Structure and content:
- Introduction – shorter and less details (perhaps the details of the chemical kinetics domain (p.2 l. 26 - p.3 l.7) can be skipped until section 2)
- Scenario – could benefit from clear distinction in parts about the general research area, and about the proposed approach with explicit assumptions about its functionality (i.e. naming the requirements, which can later be generalized and referred to while discussing solutions in section 3)
- Towards an integrated framework – Despite the reference to [44], it would be nice to remind the reader what the basic requirements were (in contrast to New Requirements discussed in section 5). Are the initial functional requirements the “use cases”?
- Proposed architecture – description of DW and OLAP systems repeats the text from section 3
- Concluding remarks (or other section) – It would be advantageous for these considerations to elaborate a bit more on the envisioned role for the domain-specific ontology and human actors (domain experts, operators, manual actions) in the whole system – which steps cannot be automatized?

*** Other improvements
- The information about the existence (yes or no?) of a domain-specific ontology is inconsistent (p.12 l.36, p. 16 l.33, p. 19 l.30).
- Figure 2 is not clear and doesn’t simplify understanding the whole process: use of nodes, arrows and additional text around the nodes inconsistent (physical objects, actions, or goals) and confusing.
- e.g.1. The central node (Exp. Data) is confusing – what does it stand for and where does “Compare’ fit in the sequence?
- e.g.2. “Analysis tools (… uncertainty quantification)” node lies on alternative path to “Model reduction” while the textual description (p.4 l.39-42) says that uncertainty is a diverging point (“if the model shows … uncertainties, relevant pathways can be identified by analysis tools ...”)

**** Language – some examples:
- verbs in 3rd person (“leadS to” p.1 l.27, “requireS” p.7 l.2, “comeS from” p.7 l.7, “involveS” p.8 l.4, “concernS” p.10 l.21
- typos and spelling (OpenSMOKE++ and alternatives, ReSpecTh alternatives, “Djangoi” p.12 l.24, “stred” p.16 l.23)
- use of definite / indefinite articles
- other: unnecessary symbols for temperature and pressure p.4 l.12, “industry industry” p.4 l.40, repeated “highlighted in the literature” p.1 l.39 and l.41, “one [variable functions]” p.4 l.32, “increasingly availability” p.6 l.32, “is not be limited” p.6 l.36, “indipendent” p.6 l.37, “as a meanS to” p.6 l.39, “it doesn’t exist” -> “there” p.10 l.3 and l.4, “a changes” p.10 l.43. “since from” p.12 l.11, “impacts on” p.16 l.19, “an automatically and manually querying” p.17 l.40,

***** Some formal issues:
- Recommended capitalisation (title, headings) missing
- In-text citations and reference format other than suggested in the guidelines, citation starts with [32] which seems strange
- multiple citations of some papers for the same context, e.g. [6] or [17]

Review #2 submitted on 22/Jan/2019

By Aliaksandr Birukou ORCID logo

https://orcid.org/0000-0002-4925-9131

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Undecided
Technical Quality of the paper: Weak
Presentation: Good
Reviewer`s confidence: Medium
Significance: Moderate significance
Background: Reasonable
Novelty: Limited novelty
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

The paper presents an architecture and a prototype of the system for sharing and reusing the models and the data of scientific experiments in the chemical kinetics domain. The paper touches upon some issues in aggregating, managing and processing this data. It also mentions possible extensions to other domains and automated integration of even more of the gathered experimental data without going into details - I understand that both these aspects are part of the future work.

Reasons to accept:

The paper is well written, the figures and the screenshots are readable and help to present the material.
The paper deals with an important problem of reusing experimental result and is interdisciplinary, as the authors represent both data/software engineering and chemical engineering communities.
It also provides a nice introduction to the domain, accessible for a person who is not an expert in chemical kinetics.

Reasons to reject:

I would encourage the authors to clarify the main contribution. The requirements and the architecture do not qualify as good enough contribution, worth a journal paper as 1) they have been already presented already at SAVE-SD 2018 and 2) the design of the architecture does not mean it will work in practice. I would encourage authors to go for the "first version of the system" being presented and limit the discussion of "extension to other domains and new requirements" to the discussion section, as both are not tested in practice. Right now the paper mentions them several times, but discussing the problems does not solve them...

A possible reason to reject would be that neither the data (or samples of data) or the software (prototype) presented in the paper are not available online. I leave it up to the editors to decide if it is a reason for rejection.For instance, when the authors describe supplementary materials in p.4, examples of such materials can be given. The same for the "experiment consists of a set of conditions, ... and output variables" - including an example would help a lot!

It is not the strongest reason for rejection, but before accepting this manuscript, the paper should clarify the title and the use of terms. [1, 2] use "scholarly data" as "the data describing the publications, authors, etc.". This paper deals with the "experimental data from the chemical kinetics domain" or "scientific data". Therefore, the title "scholarly data analysis to aid ..." is misleading, as no such analysis happens at the moment (it is just mentioned as a possible source of new data). Such analysis, if already performed is non-trivial and can constitute contribution per se. For instance, on p.17, 5.3: "can be retrieved automatically from external sources". Can be? Or "is retrieved"? From which sources?

[1] F. Xia, W. Wang, T. M. Bekele and H. Liu, "Big Scholarly Data: A Survey," in IEEE Transactions on Big Data, vol. 3, no. 1, pp. 18-35, 1 March 2017.
doi: 10.1109/TBDATA.2016.2641460
[2] http://www.scholarlydata.org/

Nanopublication comments:

Further comments:

1. Please give concrete examples when mentioning other domains besides "chemical kinetics": which domains, which data will be stored and managed? If you do not have anything concrete at this stage, it is better to just say "we are going to look into extending our system to work with domains A, B, C". Right now the reader might be mislead by claims like "can be easily generalized to other domains",

The same at p.6 "The need of a continuous validation of models based on new experiments is shared among most scientific fields, and therefore the activities of acquiring, analyzing and evaluating models and experiments is certainly shared with other scientific domains" - consider giving a couple of examples to be specific.

2. p.3, one of the contributions: "A data-based and service-based architecture for such a system is proposed and discussed, providing a data model to support data integration and the development of a set of data curation and analysis services." - please be more concrete when describing the data integration in the paper. For instance, in Fig. 1 it is not clear which steps are covered by the system, and which are automated, which are performed manually? Elaborating on this aspects and relating the authors' contribution to this general model refinement procedure would help to clarify the benefits of the presented approach.

3. In the related work section a good analysis of the scholarly data and chemical kinetics experimental data is presented. However, the work seems to be also very relevant to the Google Dataset Search https://toolbox.google.com/datasetsearch, Figshare and other data repositories, as well as to platforms for sharing code/experiements (MyExperiment, Gigantum, Code Ocean). I would strongly recommend relating the solution to those platforms in the Related Work section.

4. Sections 4 and 5 miss the description of what exactly is done (or not done) by the described prototype and within the approach. For instance "In other cases new experiments could be directly extracted by domain experts and inserted manually" - it is not clear if this is the main source of filling the repository or only a theoretical possibility. In the former case, how often do experts add experiments manually, can they really cope with the "big data" announced earlier? Is it scalable? For the automated integration of structured formats - which repositories are already connected and continuously provide new experiments for the repository? How many experiments are added every year/month/...?

The section 4.1.2 decribes a lot of difficulties in matching experimental data with different parameters and levels of granularity. However, it is not clear what is managed by the system at the moment. The sentence in 4.1.3 about "handling volumes of data largely beyond those manageable manually" hints that the problem is solved to the certain extent, and the screenshots in Fig 3-6 suggest this too.

4.1.4 and managing changes in the models describe a very interesting problem of reusing the existing experimental results to predict the new ones or to get new results by re-running only some parts of the experiments. It is not clear, though, which use cases are supported at the moment and what limitations the system still has.

on p.17, section 5.4 it is not clear who performs manual interventions (if at all). Or is it one more "possibility" for the system rather than implemented feature?

5. Section 6 should correspond to 4-5. Right now it describes "continuous data integration" which is only explained at high-level in 4 and 5. The same concern relates to the discussion of using the domain-ontologies vs OLAP - it would be nice to see in 4 and 5 how and what is used, then such choice becomes clearer.

In the description of the scientific model (p.19) and also earlier the authors mention 10^5-10^6" experimental parameters. It is not clear how the user can search with so many parameters (Fig 3) then. Could you please explain this better?

6. To summarize, I would suggest that the authors focus on what is done, what can be done soon and present those. For cases where no implementation was done you should only mention the most important in the future work or discussion sections. In particular, the dillemma of "scholarly data" vs "experimental data" should be solved. Right now the paper leaves the impression that "scholarly data could be used to get more experimental data for the repository". But it is not seen through the paper if this is the reality in the current version of the system. I would suggest shortening Section 6 and other places where it is written "what could be used" and focus on "what IS used".

***Minor comments / language errors***
- "a first prorotype" - p.3
- "indipendent" - p.6
- "a simulation software" - p.9
- "In general, it does not exist a pre-defined" - p.10
- Opensmoke vs OpenSMOKE++ - capitalization
- p. 16 "stred" - "stored"?
- p.16 "one respect to the others"
- p. 16 "the lacks"
- p.17 "composed by" - "contains"

Review #3 submitted on 05/Feb/2019

By Stian Soiland-Reyes ORCID logo

https://orcid.org/0000-0001-9842-9718

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Weak
Reviewer`s confidence: Medium
Significance: High significance
Background: Reasonable
Novelty: Limited novelty
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

The authors address the development of scientific models from experimental data, focusing on automation and semantic data integration from a use case of chemical kinetics models, but deriving requirements for a framework that I would argue is general enough to apply for any model/experiment research across domains (e.g. systems biology).

The paper also presents a service-oriented architecture to address the requirements, which has been partially implemented in a prototype. The prototype is shown by screenshot only (no name, URL or source code cited). Additional requirements for future work are laid out, summarising potential and existing methods from literature.

Reasons to accept:

* Identified requirements are generalizable to any modelling domain.
* Well-founded reasoning behind arguments.
* Domain example (kinetic combustion modelling) explained well.

Reasons to reject:

* Language: Several grammar and phrasing issues (see attached PDF)
* Length: Some repetition across sections (see PDF)
* Confusion between "proposed architecture" and what has been implemented in
* Architectural choices not clearLy derived from requirements (e.g. SOA for functions)
* No source code or URL provided for developed prototype

Nanopublication comments:

Further comments:

## Overview

The main value I find in this article is that it identifies and describes well requirements for experiment-based model development, and in particular showing the issues that must be addressed when automating and scaling up such research across multiple open data sources. As I think this would apply across domains, I would have liked some citation to similar work in automating modelling work for other fields, for instance in systems biology.

I think this paper should be accepted following a minor revision. Some more concern must be placed on the language.

A detailed annotated PDF is attached in the web version of this review at

## Language

The presentation of this article is generally good and well reasoned, however the grammar is of varying quality and so the language can get confusing at places. I have suggested numerous small modifications in the attached PDF (ISO5776 notation), some of which I hope will simplify the text where I identified repetition or unnecessary phrasings.

As I see was pointed out in SAVE-SD 2018 open peer reviews , it is odd to use exponential notation for small numbers. I understand the intention is to show scale rather than actual values or proportions, so I suggest changing them to "scale in the hundreds", "..thousands" and "..hundred thousands".

## Architecture section

The wording in the section of "Proposed Architecture" is floating between describing a potential general architecture ("It could be translated in the future") and features of the existing developed prototype
("the database has been designed..to privilege performance").

While I can read between the lines that the architecture was partially derived from the development of the prototype (which is good), this section attempts to give the opposite picture. This means a tension is artificially introduced that confuses the reader as to what parts of the architecture has been realized or not.

I suggest to be more concrete in the architecture section and focus on what has been implemented. The other design ideas are well reasoned and should be kept, but I would move them to a new subsection on future architectural work. This will show more clearly the distinction between features you can prove with the prototype and potential benefits which implementation (e.g exploratory OLAPs) may have hidden pitfalls yet to be discovered.

I have some questions on the choice of Service-Oriented Architecture. I understand the authors wanted to support multiple modelling systems and data formats, and so argue that individual workflow functions should be services to facilitate interoperability. While I certainly recognize this reasoning (as a developer of the Web Service-based workflow system Apache Taverna), I would also disagree with the argument that simply using SOA means data interoperability is easy.

## (Lack of) availability

The article focuses for a large part around development of a prototype. Yet, this prototype seem *not* to be available except for a couple of screenshots.

From

> All relevant data that were used or produced for conducting the work presented in a paper must be made FAIR and compliant with the PLOS data availability guidelines prior to submission.

In addition to a URL, I would highly recommend the authors to provide Open Source code of the developed prototype.

An associated Zenodo DOI can then be used as a Code Citation from the paper.

1 Comment

Meta-Review by Editor

Submitted by Tobias Kuhn on Wed, 02/06/2019 - 03:39

We have received three complete and interesting reviews that should help you in expanding the paper and fixing the issues identified in the paper. I ask you to consider all the concerns raised by all the reviewers, since they are all very sensitive and important.A major concern, which has been raised by all, is the lack of availability of the data and of the software described in the paper, which is not acceptable for a publication in Data Science. I would strongly suggest to make them available online, following the FAIR principles and also the guidelines used in the ISWC Resource Track (the most recent ones are available at https://iswc2019.semanticweb.org/call-for-resources-track-papers/). For the latter ones, see in particular the section related to availability. Data (e.g. by publishing them in Figshare or Zenodo) and software (e.g. by using the GitHub+Zenodo feature for assigning DOIs to code) should be appropriately cited in the paper, see https://peerj.com/articles/cs-1/ and https://peerj.com/articles/cs-86/ for extensive discussion on the topic.

Silvio Peroni (https://orcid.org/0000-0003-0530-4305)

Tracking #: 549-1529

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Tags:

Special issue (if applicable):

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

1 Comment