Reviewer has chosen not to be Anonymous
Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: High
Significance: High significance
Background: Reasonable
Novelty: Clear novelty
Data availability: All used and produced data are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right
Summary of paper in a few sentences (summary of changes and improvements for
second round reviews):
The paper introduces the idea of using knowledge graphs as a default model for representing heterogeneous data. Those would allow designing end-to-end machine learning pipelines.
Reasons to accept:
Some of the overselling points of the original version of this paper have been toned down, which is good. The comparison with XML and relational databases is appropriate, and clarifies the main message.
Most of my original points of critique (i.e., OWA, heterogeneous modeling etc.) have been addressed.
Reasons to reject:
Some points still need a more thorough discussion (see further comments).
Nanopublication comments:
Further comments:
Some of the overselling points of the original version of this paper have been toned down, which is good. The comparison with XML and relational databases is appropriate, and clarifies the main message.
In order to round off the picture, I would like to see a bit more discussion on when to use a knowledge graph and when not to, maybe as part of the conclusion section. These pieces are scattered (e.g., at the end of 3.4), but for researchers making sense from this paper, it would be good to see a matrix with aspects of the data (e.g., multiple sources, larger text literals, mixed media, streaming, time indexed, etc.) they wish to analyse, and whether a knowledge graph is suitable for those aspects or not.
Most of my original points of critique (i.e., OWA, heterogeneous modeling etc.) have been addressed.
For some of the problems, I have my doubts that simply using a deep neural net will solve the issues. For example, for data with different modeling paradigms, some data will follow one paradigm, while others will follow the other. It might be hard for a learning machine to identify the correspondence if there is no significant overlap here. Consider the example where a fraction of the dataset uses foaf:based_near, while another uses dbo:location. Without a significant overlap of pairs of instances that use *both* properties simultaneously (also indirectly by interlinking instances in both datasets), it will be difficult to learn that they refer to the same property. For the sake of correctnes, I would expect a more thorough discussion of the limitations w.r.t. the challenges. Here, it might make sense to distinguish what current approaches such as RDF2vec are already capable of doing, what they might be extended to, and what might be the hard challenges for which no straightforward solution exists.
Overall, the revision is well done. With a little bit of discussion added on top, I would like to see it accepted.
2 Comments
Meta-Review by Editor
Submitted by Tobias Kuhn on
We are pleased to inform you that your paper has been accepted for publication, under the condition that you address the remaining minor issues.
The reviewers found that the revised manuscript largely addressed all of the points raised. In order to be suitable for publication, please address the following two aspects:
Michel Dumontier (http://orcid.org/0000-0003-4727-9435)
Link to Final PDF and JATS/XML Files
Submitted by Tobias Kuhn on
https://github.com/data-science-hub/data/tree/master/publications/1-1-2/ds-1-1-2-ds007