TriVec: Knowledge Graph Embeddings for Accurate and Efficient Link Prediction in Real World Application Scenarios

Tracking #: 620-1600

Authors:

	Name	ORCID
	Sameh K.	https://orcid.org/0000-0003-2659-2406
	Vit Novacek	https://orcid.org/0000-0003-4687-6043

Responsible editor:

Frank van Harmelen

Submission Type:

Research Paper

Abstract:

Knowledge graph embeddings models are widely used to provide scalable and efficient link prediction for knowledge graphs. They use different techniques to model embeddings interactions, where their tensor factorisation based versions are known to provide state-of-the-art results. In recent works, developments on factorisation based knowledge graph embedding models were mostly limited to enhancing the ComplEx and the DistMult models, as they can efficiently provide predictions within linear time and space complexity. The evaluation of these models was also limited to general knowledge benchmarks and it did not include any other applications in specialised domains. In this work, we aim to extend the works of the ComplEx and the DistMult models by proposing a new factorisation model, TriVec, which uses three part embeddings to model a combination of symmetric and asym- metric interactions between embeddings. We perform an empirical evaluation for the TriVec model compared to other tensor factorisation models on different training configurations (loss functions and regularisation terms), and we show that the TriVec model provides the state-of-the-art results in all configurations. In our experiments, we use standard benchmarking datasets (WN18, WN18RR, FB15k, FB15k-237, YAGO10) along with a new NELL based benchmarking dataset (NELL239) that we have developed. To complement the evaluation of our method on standard, but rather artificial datasets, we also present a more realistic benchmark based on the real-world problem of predicting effects of chemical-protein interactions. More specifically, we build a knowledge graph benchmark of chemicals, proteins and the effects of their interactions, and we desing an evaluation pipeline that uses knowledge graph embedding to predict new chemical-protein interactions and their effects. We then show by experimental evaluation that our model provides the best results in terms of the area under the ROC and precision recall curves in the prediction of the effects of chemical-protein interactions compared to other knowledge graph embedding models. Keywords. Knowledge Graph Embedding, Link Prediction, Bioinformatics

Manuscript:

ds-paper-620.pdf

Data repository URLs:

https://figshare.com/s/88ea0f4b8b139a13224f

Date of Submission:

Monday, January 27, 2020

Date of Decision:

Thursday, April 9, 2020

Nanopublication URLs:

Decision:

Reject

Solicited Reviews:

Review #1 submitted on 11/Feb/2020

Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Average
Suggested Decision: Undecided
Technical Quality of the paper: Average
Presentation: Average
Reviewer`s confidence: High
Significance: Moderate significance
Background: Reasonable
Novelty: Limited novelty
Data availability: With exceptions that are admissible according to the data availability guidelines, all used and produced data are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

Knowledge Graph Embedding has become a promising approach for link prediction. Among the existing KGE approaches, tensor factorization based models (e.g., ComplEx, Distmult, etc) obtain state-of-the-art performance. The paper proposes a new tensor factorization based KGE model (TriVec). TriVec uses three parts for embeddings of entities and relations. The formulation of the score function enables the model to model symmetric and asymmetric relational patterns. Experimental results on the standard benchmark as well as real-world datasets show that TriVec outperforms the existing model.

Reasons to accept:

The analysis of the score function of ComplEx (Table 1) is interesting. The score function of ComplEx has four terms: two symmetric parts and two asymmetric parts. By removing different parts, the results do not change significantly. It shows that the score function of ComplEx has redundant parts.

Reasons to reject:

ComplEx (and its variants such as ComplEx-v3) has the ability of modeling symmetric and asymmetric relation patterns with only storing two vectors for each entity/relation. TriVec has the same capability while it uses one additional parameter (three parts).

In Multi-class loss configuration (Table 4), FB15K, ComplEx-N3-R obtains 0.79 (MRR) and 0.88 (Hits@10). The results are different from what are reported in [5] because of using a smaller embedding dimension (200). However, in [https://github.com/facebookresearch/kbc], the results with embedding dimension 100 are 83 (MRR) and 89 (Hits@10). Are the differences related to the hyper-parameter search?

According to Table 5, each entity and relation uses three vectors with the dimension of K. Therefore, TriVec uses 3K parameters for each entity/relation. Using K=200, 600 parameters are used per entity/relation. Does ComplEx-N3-R use the same number of parameters in Table 3?

Are the results of other models in Table 3. obtained with the same hyper-parameters search?

It would be interesting to compare TriVec with RotatE [Sun, Zhiqing, et al. "Rotate: Knowledge graph embedding by relational rotation in complex space." arXiv preprint arXiv:1902.10197 (2019)] and QuatE [Zhang, Shuai, et al. "Quaternion knowledge graph embeddings." Advances in Neural Information Processing Systems. 2019]

It would be helpful to include the results of ComplEx-V3-N3-R in Table 4.

Nanopublication comments:

Further comments:

The writing needs to be revised.

In Equation 10, \Phi^{TriVec}?

Captions of Figure 3 and 4 are same.

Review #2 submitted on 21/Feb/2020

Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Weak
Suggested Decision: Reject
Technical Quality of the paper: Average
Presentation: Good
Reviewer`s confidence: High
Significance: Moderate significance
Background: Reasonable
Novelty: Lack of novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

Inspired by the ComplEx model, this paper proposed a novel form of combining symmetric and asymmetric interactions in score function, by representing each entity and relation with three embedding vectors. Two different loss configurations were adopted in the training process, and plenty of experiments on the link prediction task were conducted to demonstrate the excellence of TriVec model. Besides, this paper also presented experiments on predicting the effects of chemical-protein interactions.

Reasons to accept:

This paper is clearly structured and presented well. I appreciate the analysis of different score and loss functions and the sufficient experiment results presented in this paper.

Reasons to reject:

There are some critical flaws.

1. Lack of novelty on approach. It seems that the only novel part in TriVec model is the new form of combining symmetric and asymmetric interactions in score function, which is a little bit thin. Although this paper dedicated a significant portion on different loss functions, it didn’t add the novelty as the loss configurations are already proposed in existing work (Lacroix, T., Usunier, N., & Obozinski, G. (2018). Canonical tensor decomposition for knowledge base completion).

2. Results are moderately significant. Overall, compared to ComplEx, the experimental improvement of TriVec model is weak, but the parameter number increases. Besides, other recent works, such as RotatE (Sun, Z., Deng, Z.-H., Nie, J.-Y., & Tang, J. (2019). RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space), show better experimental results on some datasets. I think this paper needs a more comprehensive comparison.

3. Weak theoretical analysis. First, I think this paper needs a theoretical explanation about the advantage of the design of three part embeddings and the new interaction form. Second, I don't see any advantage of TriVec model from the Analysis part in this paper.

Other problems and minor mistakes in this paper:
1. This paper doesn't make it clear how was the dataset Nell239 built.
2. In Equation (8), right parenthesis are missing.
3. In Equation (12), shouldn't it be o(p+Nr)s?

Nanopublication comments:

Further comments:

2 Comments

Note from the editor-in-chief

Submitted by Tobias Kuhn on Thu, 04/09/2020 - 01:43

First of all, we apologize for the delay with this. The two reviewers raise a number of important points that need to be resolved before this manuscript can be accepted. The authors also should look at the section about "Extended Versions" of the Guidelines for Authors (https://datasciencehub.net/content/guidelines-authors) and make sure these conditions are fulfilled.

Tobias Kuhn (http://orcid.org/0000-0002-1267-0234)

No Revised Version Submitted: Marked as Rejected

Submitted by Tobias Kuhn on Thu, 04/22/2021 - 03:06

As the authors did not submit a revised version, I will mark this submission as rejected.

Data Science