Recommending Scientiﬁc Datasets Using Author Networks in Ensemble Methods

Tracking #: 720-1700

Authors:

	Name	ORCID
	Xu Wang	https://orcid.org/0000-0002-7585-759X
	Frank van Harmelen	https://orcid.org/0000-0002-7913-0048
	Zhisheng Huang	https://orcid.org/0000-0003-3794-9829

Responsible editor:

Stephen Pettifer

Submission Type:

Research Paper

Abstract:

Open access to datasets is increasingly driving modern science. Consequently, discovering such datasets is becoming an important functionality for scientists in many different ﬁelds. We investigate methods for dataset recommendation: the task of recommending relevant datasets given a dataset that is already known to be relevant. Previous work has used meta-data descriptions of datasets and interest proﬁles of authors to support dataset recommendation. In this work, we are the ﬁrst to investigate the use of co-author networks to drive the recommendation of relevant datasets. We also investigate the combination of such co-author networks with existing methods, resulting in three different algorithms for dataset recommendation. We obtain experimental results on a realistic corpus which show that only the ensemble combination of all three algorithms achieves sufﬁciently high precision for the dataset recommendation task.

Manuscript:

ds-paper-720.pdf

Data repository URLs:

The data, python implementation code, and sample experiment could be found at this link.

Some of the RDF/HDT datasets could be found at this link.

Date of Submission:

Tuesday, February 22, 2022

Date of Decision:

Friday, April 22, 2022

Nanopublication URLs:

Decision:

Solicited Reviews:

Review #1 submitted on 07/Mar/2022

By Pasquale Lisena ORCID logo

https://orcid.org/0000-0003-3094-5585

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Average
Suggested Decision: Undecided
Technical Quality of the paper: Excellent
Presentation: Average
Reviewer`s confidence: High
Significance: High significance
Background: Reasonable
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

The authors propose 3 methods (based on hop-distance, embeddings, and bm25) to recommend datasets on the basis of author co-authorship networks.

Reasons to accept:

The motivation is clear and relevant.
The work is scientifically rigorous, methods and algorithms are well described.
The evaluation seems to show good results.

Reasons to reject:

The main concern is about how this method is performing w.r.t. the SoA, and in particular, the works that the authors mention in Section 2.
Without comparing with SoA, it is hard to say if the presented strategy is effective or not.

Another doubt is about the novelty of the method.
Several works rely on co-author networks, some of them have been mentioned by the author, but there are applications also in paper recommendation (just an example: https://doi.org/10.1016/j.knosys.2020.106438).
Given that we are just looking at authors, title and description (=abstract), I wonder if we can see this problem as a paper recommendation problem (so apply paper recommendation techniques), or otherwise why it is different.

In some parts, the paper lacks clarity or details, as better pointed out in the further comments.
Among these parts, it is not clear how dataset pairs in ScholeXplorer are structured (1 link, 2 datasets? What is that link? can a particular dataset appear in more than 1 pair?) nor what is the semantic of these links (relatedness? look also?). More details about this are needed.
I don't know if these links can be considered "gold standard" (it has not been realised for this purpose, it potentially has lots of missing links), but I believe it is ok for the evaluation (but here again, comparing with other works is crucial).

Nanopublication comments:

Further comments:

The method has some limitations, which should at least be mentioned in the papers:
- it cannot work on datasets with poor or absent metadata
- it cannot propose you datasets of unknown authors (which is the cases of some Mendeley and MAKG, as reported in Table 1)

Minor comments:
- Sec.1. "Our working HYPOTHESIS is that we provide a new HYPOTHESIS for ..." should be possibly rephrased
- Sec 3.2. The first time MAKG is mentioned, a small description of it should be given (in particular, what kind of properties we can fin in) + its link
- Sec 3. It can be made clearer that the similarities are computed between authors of the dataset (and not between an author and the one is searching for recommendations)
- Sec 3.3. BM25 should have a proper citation and (if it is an acronym) expanded the first time it is mentioned
- Section 4. "We need inputs that contain" => " We need in input"
- Algo 3. VS(As) can be computed once for all outside the foreach
- Fig 4. I think that the lower box should have a 2nd Elsevier Dataset, instead of the ScholeXplorer Dataset. In addition, I suggest calling it Mendeley Dataset, coherently with other mentions in the text
- Tab 2. You should mention that each row of the table has been computed excluding the datasets belonging to the precedent one (if I well understood by looking at the percentage)
- Sec 6.1.3. It is not clear if you try to match the 3 resources or only ScholeXplorer and MAKG (as written after Def. 11)
- Sec 6.2. It is hard to follow numbers. Maybe a recap table about how many datasets are finally considered for each resource would help
- Sec 8. "This behaviour is similar to that of most widely used search engines". You should confirm this claim with a citation
- Typos: "max-hop number nn", "coreferencse", "SAPRQL", "overviwe" "ScholeXplorerS", "contiained", "summrise", "AHhger"

Review #2 submitted on 12/Apr/2022

By Imran Asif ORCID logo

https://orcid.org/0000-0002-1144-6265

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Average
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: Medium
Significance: Moderate significance
Background: Reasonable
Novelty: Lack of novelty
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

The paper is about discovering the relevant datasets for scientists in many different fields. Previous approaches used the meta-data to discover the relevant datasets. In this paper, they provided relevant datasets using the co-author's ensemble method networks recommendations. This paper evaluated and analysed their approach with the existing approaches and provided sufficient results with high precision recommendation results with low recall.

Reasons to accept:

The paper was good structured and provided a reasonable background. They made a good story to build the argument of the research. I looked at the algorithms in the paper, but I didn’t see any public repository to check and compile them. They are not providing datasets that are available for running the tests. The paper is not providing the FAIR.

Reasons to reject:

The paper provides limited novelty because most search engines are provided with the same type of recommendations for discovering the relevant datasets, such as https://datasetsearch.research.google.com/. They are talking about the open-access, but they missed to provide the datasets and code publically. Unfortunately, they are not following the proper guidelines of the FAIR to make the data accessible and reused.

Nanopublication comments:

Further comments:

Review #3 submitted on 21/Apr/2022

By Egon Willighagen ORCID logo

https://orcid.org/0000-0001-7542-0286

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: Medium
Significance: High significance
Background: Comprehensive
Novelty: Unable to judge
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

The article sketches a future of doing research driven by research data. As such, it follows the trend for more FAIR data and for more openly licensed data, answering the call for more reproducible research in reply to the lack of reproducibility. The article introduces a number of related details in limited but sufficient detail for the conclusions made in this article. It introduces us to the field of recommendation systems for datasets and basically zooms in on the question how to recommend relevant datasets. It compares a number of past approaches and wishes to explore of co-author networks can be used for this purpose. To do so, it compares three different algorithms to rank datasets based on information in the co-author networks.

Reasons to accept:

The article is clearly written, gives a solid, concise introduction into the field, and covers a timely topic about research dissemination which is frustrated with existing approaches. The work is clear, the discussion fair, and the conclusions at least factual.

Reasons to reject:

I do not recommend rejection, but there are a number of points that are not entirely consistent. Particularly, for an article that describes finding archived datasets, it is notable the data and source code behind this article are not (Surf and Triply.cc are not archives). The submission system has links to code and data but at least the GitHub repository does not seem mentioned in the article and the DataCite standard is not used to cite datasets (re)used in this article.

Nanopublication comments:

Further comments:

Overall, I think the article should be accepted, but can benefit from smaller and bigger changes which I will try to outline here. In major/minor terms I would recommend a major revision, but most are suggestions that would increase the impact. I do not see things intrinsically wrong.

= Introduction

The first paragraph could use a bit more detail. For example, the FAIR community stresses that FAIR and Open data are different things. One could wonder if recommending closed data even makes sense, as you increasingly cannot use this in research as, as the authors indicate, funders and publishers increasingly expect data that is needed to reach the conclusion in the article must be shared, which is not possible with reusing closed data in research.

Reference 8 basically is as far as I can see only talking about datasets collecting journal articles. That is a very narrow call for more open data for COVID-19.

But actually, I disagree with the statement that funders made sharing practices "compulsory". While in theory that sounds right, in practice it is not. For example, the H2020 Open Data Pilot has a simple exclusion statement that data sharing expectations can be ignored if their is "commercial interest", which is not further specified. Ergo, the EC is effectively merely a request to share data. And that does not even consider enforcing intentions, further reducing the idea of "compulsory".

I also disagree with the statement that "[i]t is widely acknowledged that such open datasets contribute to both the transparancy, quality, and reproducibility". This is still a minor opinion and even a niche area when we look at actual practices. Maybe the authors can add one or two references here to clarify the details of their statement.

While mentioned later, I recommend to mention Zenodo in the introduction too. More importantly, one aspect I find missing in the introduction is the following. Later in the article the specific metadata (meta-data) is discussed that is being used (obviously with the title, DOI/URL, authors as important aspect. However, the introduction considers the concept of "dataset" is well-defined. Just looking at all the "type" issues in various of the used databases show that this is actually a huge problem: things that are datasets are not correctly typed as datasets and vice versa. Now, I'm happy that the article takes "dataset" more loosly as "research output", but because it does bring up the topic when discussing the work from Kato, it sounds somewhat relevant.

Regarding the approach, there are two aspects of article/citation networks that interest me and I am looking forward to hearing the authors' ideas about that. First, with the growing list of 10+ author articles and the growth of interdisciplenary research, the meaning for the recommendation of a "co-authorship" is ill-defined, and a single hop is not always different in meaning for the recommendation is a 2-hop, I think. This could be discussed in the introduction or in the Discussion.

The second aspect is how citations networks could and/or have been used for recommendataions. These are used a lot for recommending article journal articles. Is a dataset mentioned in an article citing the article that describes the dataset you want recommendations for not at least as interesting as one from a co-author? This is just a discussion point, and material for a next paper, not this one, of course.

= Algorithms

In Algorithm 1 please clarify that hop number n is the maximum length of the shortest path. (very minor suggestion)

= Datasets

Please describe all used datasets with the metadata fields used in the article, as specified in Table 1. Second, please use the DataCite approach to cite used datasets.

= Result

The major thing I kind of miss, is an easy way to "try" the system with recommendations for me. To get some feeling for the practical recall/precision. The GitHub repository could benefit from a (for example) Google Colab notebook that makes a recommendation.

= GitHub

The Python code does not have a lot of documentation. I think this can be improved. At least add comments in the code to which Definitions and Algorithms in the article the code corresponds too.

Please add an AUTHORS file and a CITATION.cff file.

= Overall

Please consider adding a "Availability" section pointing to the archived source code and archived data that supports this article, and ideally use DataCite to cite these formally.

= Typos

- page 1: "Mendeley Data(https://", missing space (check rest of paper for similar issues)
- page 2, 2nd paragraph: closing bracket seems missing for "(see our discussion"?
- section 5.1: "data sets and public license", remove the "public" as it is ambiguous and redundant
- page 15: SAPRQL
- page 23: AHhgher

1 Comment

Meta-Review by Editor

Submitted by Tobias Kuhn on Wed, 11/02/2022 - 07:42

We are pleased to inform you that your paper has provisionally been accepted for publication, under the condition that you address the various issues raised by the reviewers. Many of these are minor corrections or suggestions for improvements, but I would draw your attention in particular to the comments by R1 regarding a comparison with SoA, and the comments by the others regarding FAIRness / openness.

Stephen Pettifer (https://orcid.org/0000-0002-1809-5621)

Data Science

Recommending Scientiﬁc Datasets Using Author Networks in Ensemble Methods

Tracking #: 720-1700

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

1 Comment

Meta-Review by Editor