Two real use cases of FAIR maturity indicators in the life sciences

Tracking #: 602-1582

Authors:

	Name	ORCID
	Serena Bonaretti	https://orcid.org/0000-0003-4264-1773
	Egon Willighagen	https://orcid.org/0000-0001-7542-0286

Responsible editor:

Michel Dumontier

Submission Type:

Research Paper

Abstract:

Data sharing and reuse are crucial to enhance scientific progress and maximize return of investments in science. Although attitudes are increasingly favorable, data reuse remains difficult for lack of infrastructures, standards, and policies. The FAIR (findable, accessible, interoperable, reusable) principles aim to provide recommendations to increase data reuse. Because of the broad interpretation of the FAIR principles, maturity indicators are necessary to determine FAIRness of a dataset. In this work, we propose a reproducible computational workflow to assess data FAIRness in the life sciences. Our implementation follows principles and guidelines recommended by the maturity indicator authoring group and integrates concepts from the literature. In addition, we propose a FAIR balloon plot to summarize and compare dataset FAIRness. We evaluated our method on two real use cases where researchers looked for datasets to answer their scientific questions. We retrieved information from repositories (ArrayExpress and Gene Expression Omnibus), a registry of repositories (re3data.org), and a searchable resource (Google Dataset Search) via application program interface (API) wherever possible. With our analysis, we found that the two datasets met the majority of the criteria defined by the maturity indicators, and we showed areas where improvements can easily be reached. We suggest that use of standard schema for metadata and presence of specific attributes in registries of repositories could increase FAIRness of datasets.

Manuscript:

ds-paper-602.html

Revised Version:

A semi-automated workflow for FAIR maturity indicators in the life sciences

Special issue (if applicable):

Special Issue on FAIR Data, Systems and Analysis

Data repository URLs:

https://github.com/sbonaretti/FAIR_metrics

Date of Submission:

Sunday, August 11, 2019

Date of Decision:

Tuesday, October 22, 2019

Nanopublication URLs:

Decision:

Undecided

Solicited Reviews:

Review #1 submitted on 14/Sep/2019

By Stuart Chalk ORCID logo

https://orcid.org/0000-0002-0703-7776

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Excellent
Suggested Decision: Accept
Technical Quality of the paper: Excellent
Presentation: Good
Reviewer`s confidence: High
Significance: High significance
Background: Comprehensive
Novelty: Clear novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

This is timely paper that automates evaluation of metrics for FAIRness. It is likely to be the first of many papers on this topic and as a result will get heavily cited. The approach is easy to implement in other environments and highlights some potential issues. It would have been nice to see more than two datasets analyzed and reported in the paper.

Reasons to accept:

Clearly addresses relevant issues about assessing FAIR metrics. Provided a Juypter Notebook to be able run the code personally.

Reasons to reject:

The approach could have been more exhaustive by using more sites with data and more datasets. I am guessing that the authors felt time limited in order to get the paper into this special issue, so I encourage them to do further analysis and either link additional data to this paper, or publish a follow up paper.

Nanopublication comments:

Further comments:

Review #2 submitted on 16/Oct/2019

By Tina Dohna ORCID logo

https://orcid.org/0000-0002-5948-0980

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: Medium
Significance: High significance
Background: Comprehensive
Novelty: Clear novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

The authors present a semiautomatic computational workflow that was used to evaluate FAIR maturity indicators for two repositories holding gene base resources. Working from two use cases, they evaluated the FAIRness of two datasets in these repositories and compared their work with similar approaches and other manual assessments. They produced a visual summary (balloon plot) of the outcome of their assessments and also applied it to another published assessment to show how the plot can add comparability among tested resources.

Reasons to accept:

Automated computational workflows assessing FAIR maturity indicators that can be applied to a broad variety of data and data repositories are needed. Currently, most assessments are manual and therefore do not lend themselves to large scale and comparable assessment of archived data in scientific data repositories. Metadata quality even within individual repositories often varies greatly. Automated workflows to assess this and visual methods for labeling datasets according to their FAIR character would be helpful for data users and repositories alike. Despite the very limited application of the presented workflow (two datasets from repositories with gene based resources), the proposed workflow provides an good entry point to build on.

Reasons to reject:

The scalability of the approach in its current form may be limited because the automated workflow and assessment criteria need to be adapted to the repository being addressed. In addition, the starting point for the analysis is keywords used by individual researchers, who used these terms to find their resource. The reproducibility of the assessment is thereby limited, as different researchers may use different keywords for their search. However, this only affects a few of the indicators evaluated.

Nanopublication comments:

Further comments:

- Figures 1 and 2 not visible in html doc
- In several places missing periods in R1.2, A 1.2 etc., instead often reads A12 R12
- Consider changing the title to include the main achievement of the authors, the semi automated workflow
- Adding additional, less similar, use cases would greatly increase the impact of the presented work. This could also help solve questions of the scalability of the Approach when working accross different repositories.

Review #3 submitted on 18/Oct/2019

Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Average
Suggested Decision: Undecided
Technical Quality of the paper: Average
Presentation: Good
Reviewer`s confidence: High
Significance: High significance
Background: Comprehensive
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

The paper tackles the problem of assessing the “FAIRness” of research data and presents a semi-automatic pipeline to FAIR maturity indicators. The pipeline is demonstrated in a Jupyter notebook and illustrated in two use cases in the domain of Life Sciences. The proposed pipeline follows the principles and guidelines recommended by the maturity indicator authoring group, in addition, to integrate concepts from the state of the art. Specifically, they proposed pipeline satisfies 13 FAIR principles and allows for the retrieval of data collection by accessing different data repositories. The metadata describing the process of data collection retrieval is documented in XML. The effectiveness of the proposed pipeline is evaluated in two use cases and the results of the evaluation are illustrated in an FAIR ballon plot. This plot facilitates the visualization of the analysis of the FAIR maturity indicators during the process of data collection retrieval to answer scientific questions. Finally, two users are consulted to analyze the usability of the proposed pipeline.

Overall the paper is well-written and addresses the problem of ensuring FAIR principles during the semi-automatic execution of a scientific pipeline. The implementation provided as a Jupiter notebook enables the execution of the pipeline and reproducibility of the results of results reported in the paper; the Jupyter notebook is not only published in a GitHub repository but also can be run in Binder. Thus, the proposed pipeline is presented as a resource that also follows the FAIR principles. Nevertheless, because of the lack of description of the proposed pipeline, its full potential and limitations are not transparently presented. In particular, the following points are not clear in the paper.

To conclude, the paper presents an approach that has great potential and relevant to the scientific community. However, the lack of description of the proposed workflow impedes from a clear evaluation of its benefits and limitations. These issues reduce the value of the current version of this work and prevent a positive evaluation in terms of generality and innovation. The recommendation is to address these issues and resubmit the paper.

Reasons to accept:

Strong Points (SP)
A resource for evaluating the FAIRness of the data collections retrieved during the execution of a research question.
Live code of the pipeline accessible via a Jupyter notebook.
Clear visualization of the summary of the values of the results?

Reasons to reject:

Weak Points (WP)
The components of the pipeline are vaguely defined. It is not clear what is the innovation of the proposed workflow from a computational point of view
There are many issues that are not clearly describing, reducing thus the understanding of the potential of the proposed work.

Questions to the authors (QA):
QA1) What is the main component of the workflow implemented in the pipeline presented in this paper?
QA2) How a research question (e.g., “What are the differentially expressed genes between normal subjects and subjects with Parkinson’s diseases in the brain frontal lobe?”) is interpreted?
QA3) Why state-of-the-art Name Entity Recognition (NER) tools are not used to support this task?
QA4) Why only two use cases were selected? Two use cases are not enough to show the features of a given approach.
QA5) Why did only two users evaluate the pipeline? Which criteria were followed to select these two evaluators? Under which conditions were the pipeline was evaluated?
QA6) Why controlled vocabularies and semantic enrichment techniques are not utilized to describe the metadata of the datasets?
QA7) What will be the behavior of the proposed workflow if several data collections are relevant for answering a research question? How the reported measures will be computed?

Nanopublication comments:

Further comments:

1 Comment

Meta-Review by Editor

Submitted by Tobias Kuhn on Tue, 10/22/2019 - 13:41

The reviewers indicate that the work is novel and important, but there are concerns regarding the limited number of evaluations, the approach to initiate the review, lack of details regarding the expert evaluation, and questions regarding the behaviour and output of the approach.

Michel Dumontier (http://orcid.org/0000-0003-4727-9435)

Data Science

Two real use cases of FAIR maturity indicators in the life sciences

Tracking #: 602-1582

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Tags:

Special issue (if applicable):

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

1 Comment

Meta-Review by Editor