Reviewer has chosen not to be AnonymousOverall Impression:
AcceptTechnical Quality of the paper:
Unable to judgeData availability:
Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix thisLength of the manuscript:
The length of this manuscript is about right
Summary of paper in a few sentences:
The article sketches a future of doing research driven by research data. As such, it follows the trend for more FAIR data and for more openly licensed data, answering the call for more reproducible research in reply to the lack of reproducibility. The article introduces a number of related details in limited but sufficient detail for the conclusions made in this article. It introduces us to the field of recommendation systems for datasets and basically zooms in on the question how to recommend relevant datasets. It compares a number of past approaches and wishes to explore of co-author networks can be used for this purpose. To do so, it compares three different algorithms to rank datasets based on information in the co-author networks.
Reasons to accept:
The article is clearly written, gives a solid, concise introduction into the field, and covers a timely topic about research dissemination which is frustrated with existing approaches. The work is clear, the discussion fair, and the conclusions at least factual.
Reasons to reject:
I do not recommend rejection, but there are a number of points that are not entirely consistent. Particularly, for an article that describes finding archived datasets, it is notable the data and source code behind this article are not (Surf and Triply.cc are not archives). The submission system has links to code and data but at least the GitHub repository does not seem mentioned in the article and the DataCite standard is not used to cite datasets (re)used in this article.
Overall, I think the article should be accepted, but can benefit from smaller and bigger changes which I will try to outline here. In major/minor terms I would recommend a major revision, but most are suggestions that would increase the impact. I do not see things intrinsically wrong.
The first paragraph could use a bit more detail. For example, the FAIR community stresses that FAIR and Open data are different things. One could wonder if recommending closed data even makes sense, as you increasingly cannot use this in research as, as the authors indicate, funders and publishers increasingly expect data that is needed to reach the conclusion in the article must be shared, which is not possible with reusing closed data in research.
Reference 8 basically is as far as I can see only talking about datasets collecting journal articles. That is a very narrow call for more open data for COVID-19.
But actually, I disagree with the statement that funders made sharing practices "compulsory". While in theory that sounds right, in practice it is not. For example, the H2020 Open Data Pilot has a simple exclusion statement that data sharing expectations can be ignored if their is "commercial interest", which is not further specified. Ergo, the EC is effectively merely a request to share data. And that does not even consider enforcing intentions, further reducing the idea of "compulsory".
I also disagree with the statement that "[i]t is widely acknowledged that such open datasets contribute to both the transparancy, quality, and reproducibility". This is still a minor opinion and even a niche area when we look at actual practices. Maybe the authors can add one or two references here to clarify the details of their statement.
While mentioned later, I recommend to mention Zenodo in the introduction too. More importantly, one aspect I find missing in the introduction is the following. Later in the article the specific metadata (meta-data) is discussed that is being used (obviously with the title, DOI/URL, authors as important aspect. However, the introduction considers the concept of "dataset" is well-defined. Just looking at all the "type" issues in various of the used databases show that this is actually a huge problem: things that are datasets are not correctly typed as datasets and vice versa. Now, I'm happy that the article takes "dataset" more loosly as "research output", but because it does bring up the topic when discussing the work from Kato, it sounds somewhat relevant.
Regarding the approach, there are two aspects of article/citation networks that interest me and I am looking forward to hearing the authors' ideas about that. First, with the growing list of 10+ author articles and the growth of interdisciplenary research, the meaning for the recommendation of a "co-authorship" is ill-defined, and a single hop is not always different in meaning for the recommendation is a 2-hop, I think. This could be discussed in the introduction or in the Discussion.
The second aspect is how citations networks could and/or have been used for recommendataions. These are used a lot for recommending article journal articles. Is a dataset mentioned in an article citing the article that describes the dataset you want recommendations for not at least as interesting as one from a co-author? This is just a discussion point, and material for a next paper, not this one, of course.
In Algorithm 1 please clarify that hop number n is the maximum length of the shortest path. (very minor suggestion)
Please describe all used datasets with the metadata fields used in the article, as specified in Table 1. Second, please use the DataCite approach to cite used datasets.
The major thing I kind of miss, is an easy way to "try" the system with recommendations for me. To get some feeling for the practical recall/precision. The GitHub repository could benefit from a (for example) Google Colab notebook that makes a recommendation.
The Python code does not have a lot of documentation. I think this can be improved. At least add comments in the code to which Definitions and Algorithms in the article the code corresponds too.
Please add an AUTHORS file and a CITATION.cff file.
Please consider adding a "Availability" section pointing to the archived source code and archived data that supports this article, and ideally use DataCite to cite these formally.
- page 1: "Mendeley Data(https://", missing space (check rest of paper for similar issues)
- page 2, 2nd paragraph: closing bracket seems missing for "(see our discussion"?
- section 5.1: "data sets and public license", remove the "public" as it is ambiguous and redundant
- page 15: SAPRQL
- page 23: AHhgher