The Knowledge Graph as the Default Data Model for Machine Learning

Tracking #: 465-1445

Authors:

	Name	ORCID
	Xander Wilcke	https://orcid.org/0000-0003-2415-8438
	Peter Bloem	https://orcid.org/0000-0002-0189-5817
	Victor de Boer	https://orcid.org/0000-0001-9079-039X

Responsible editor:

Michel Dumontier

Submission Type:

Position Paper

Abstract:

In modern machine learning, raw data is the preferred input for our models. Where a decade ago data scientists were still engineering features, manually picking out the details they thought salient, they now prefer the data in their raw form. As long as we can assume that all relevant and irrelevant information is present in the input data, we can design deep models that build up intermediate representations to sift out relevant features. However, these models are often domain specific and tailored to the task at hand, and therefore unsuited for learning on heterogeneous knowledge: information of different types and from different domains. If we can develop methods that operate on this form of knowledge, we can dispense with a great deal of ad-hoc feature engineering and train deep models end-to-end in many more domains. To accomplish this, we first need a data model capable of expressing heterogeneous knowledge naturally in various domains, in as usable a form as possible, and satisfying as many use cases as possible. In this position paper, we argue that the knowledge graph is a suitable candidate for this data model. This paper describes current research and discusses some of the promises and challenges of this approach.

Manuscript:

ds-paper-465.pdf

Supplementary Files (optional):

ds-supplementary-465.zip

Previous Version:

The Knowledge Graph as the Default Data Model for Machine Learning

Revised Version:

The Knowledge Graph as the Default Data Model for Learning on Heterogeneous Knowledge

Data repository URLs:

none

Date of Submission:

Tuesday, May 16, 2017

Date of Decision:

Tuesday, June 20, 2017

Nanopublication URLs:

Decision:

Solicited Reviews:

Review #1 submitted on 27/May/2017

By Kody Moodley ORCID logo

https://orcid.org/0000-0001-5666-1658

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Average
Suggested Decision: Undecided
Technical Quality of the paper: Average
Presentation: Good
Reviewer`s confidence: Medium
Significance: High significance
Background: Reasonable
Novelty: Clear novelty
Data availability: All used and produced data are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

My review has changed slightly after the second reading.

Reasons to accept:

The idea to use knowledge graphs as a data model for machine learning is compelling. It seems to be a very useful approach for certain machine learning problems and especially for integrating heterogeneous knowledge.
The authors describe the benefits of this idea quite well using clear examples and these benefits are quite attractive because they avoid several problems with manual feature engineering.

Reasons to reject:

The title might have been too optimistic. The authors do acknowledge that using knowledge graphs as the data model for machine learning cannot work for all domains. They concede, for example, that it would not be practical to represent individual pixels as nodes in the graph (at the end of Section 3.4). I also underestimated the magnitude of the challenge of differently-modelled knowledge (Section 4.4). Not being able to recognize that two pieces of information, although modeled differently, may represent the same knowledge, can undermine the benefits of using the knowledge graph for learning substantially. These issues lead me to believe that knowledge graphs, as a data model for machine learning, is probably a good addition to the toolkit of a data scientist, but to claim that it should be the default data model is perhaps too strong a statement.

Nanopublication comments:

Further comments:

Review #2 submitted on 29/May/2017

By Robert Hoehndorf ORCID logo

https://orcid.org/0000-0001-8149-5890

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: High
Significance: High significance
Background: Comprehensive
Novelty: Clear novelty
Data availability: With exceptions that are admissible according to the data availability guidelines, all used and produced data are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

The authors have adequately addressed this reviewer's comments, except one: the title of the manuscript still is "The Knowledge Graph as the Default Data
Model for Machine Learning"; in their response, the authors agree that this is not sufficiently precise and should be restated to refer to machine learning with "heterogeneous knowledge". I would like to see that the authors change the title to more accurately reflect the position they outline in the manuscript.

Reasons to accept:

Good, topical, and somewhat bold position paper suitable for this issue.

Reasons to reject:

The title is too broad and does not accurately reflect the position in the paper.

Nanopublication comments:

Further comments:

Review #3 submitted on 20/Jun/2017

By Heiko Paulheim ORCID logo

https://orcid.org/0000-0003-4386-8195

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: High
Significance: High significance
Background: Reasonable
Novelty: Clear novelty
Data availability: All used and produced data are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

The paper introduces the idea of using knowledge graphs as a default model for representing heterogeneous data. Those would allow designing end-to-end machine learning pipelines.

Reasons to accept:

Some of the overselling points of the original version of this paper have been toned down, which is good. The comparison with XML and relational databases is appropriate, and clarifies the main message.

Most of my original points of critique (i.e., OWA, heterogeneous modeling etc.) have been addressed.

Reasons to reject:

Some points still need a more thorough discussion (see further comments).

Nanopublication comments:

Further comments:

In order to round off the picture, I would like to see a bit more discussion on when to use a knowledge graph and when not to, maybe as part of the conclusion section. These pieces are scattered (e.g., at the end of 3.4), but for researchers making sense from this paper, it would be good to see a matrix with aspects of the data (e.g., multiple sources, larger text literals, mixed media, streaming, time indexed, etc.) they wish to analyse, and whether a knowledge graph is suitable for those aspects or not.

Most of my original points of critique (i.e., OWA, heterogeneous modeling etc.) have been addressed.

For some of the problems, I have my doubts that simply using a deep neural net will solve the issues. For example, for data with different modeling paradigms, some data will follow one paradigm, while others will follow the other. It might be hard for a learning machine to identify the correspondence if there is no significant overlap here. Consider the example where a fraction of the dataset uses foaf:based_near, while another uses dbo:location. Without a significant overlap of pairs of instances that use *both* properties simultaneously (also indirectly by interlinking instances in both datasets), it will be difficult to learn that they refer to the same property. For the sake of correctnes, I would expect a more thorough discussion of the limitations w.r.t. the challenges. Here, it might make sense to distinguish what current approaches such as RDF2vec are already capable of doing, what they might be extended to, and what might be the hard challenges for which no straightforward solution exists.

Overall, the revision is well done. With a little bit of discussion added on top, I would like to see it accepted.

2 Comments

Meta-Review by Editor

Submitted by Tobias Kuhn on Thu, 06/22/2017 - 01:29

We are pleased to inform you that your paper has been accepted for publication, under the condition that you address the remaining minor issues.

The reviewers found that the revised manuscript largely addressed all of the points raised. In order to be suitable for publication, please address the following two aspects:

remove "default" data model from the title.
extend the discussion to further address contextual limitations of knowledge graphs.

Michel Dumontier (http://orcid.org/0000-0003-4727-9435)

Link to Final PDF and JATS/XML Files

Submitted by Tobias Kuhn on Wed, 07/04/2018 - 08:40

https://github.com/data-science-hub/data/tree/master/publications/1-1-2/ds-1-1-2-ds007

Data Science

The Knowledge Graph as the Default Data Model for Machine Learning

Tracking #: 465-1445

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Supplementary Files (optional):

Previous Version:

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

2 Comments

Meta-Review by Editor

Link to Final PDF and JATS/XML Files