Reviewer has chosen not to be Anonymous
Overall Impression: Average
Suggested Decision: Undecided
Technical Quality of the paper: Weak
Presentation: Average
Reviewer`s confidence: High
Significance: High significance
Background: Reasonable
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right
Summary of paper in a few sentences:
This manuscript discusses PIDs, persistent identifiers, in the context of FAIR. A review of current identifiers schemes (e.g. DOI, ARK, …) are provided and a discussion of identifier validation is provided. The manuscript also provides a model for PIDs that the authors believe addresses some issues discussed in the review of current identifier schemes.
Reasons to accept:
The manuscript provides a useful review of identifier schemes. The authors also provide some insight into the most common interaction with PIDs when they state: “It is suggested that the wide distribution and findability (e.g. by simple 'googling') on the internet may be more important for the usefulness of identifiers, than the resolvability of PID URI-links.“
In addition, the authors raise a potentially important issue regarding the use of PIDs in an overall ecosystem: “Validation of an identifier means ensuring that it is true to its proclaimed type, for example, making sure that what is flagged as an ISBN is not in fact an ISSN (real use case), or that the string-length and check-sum is compliant with its type. A further advantage of promptly validatable identifiers, as against relying exclusively on resolvability, is that validation can be performed also off-line, by means of a more or less simple validation-algorithm, a pattern for the identifier type (expressed by a regular expression), a piece of script (JavaScript, Python, etc.), an HTML form,, a schema (e.g. XSD or Schematron) and a piece of software such as an XML-editor.”
Reasons to reject:
There are a number of serious inconsistencies within the manuscript that raise significant questions about the manuscripts readiness for publication:
1) Conflating persistence of URIs with link rot
The author states that “The prevalence of phenomena such as link rot implies that the persistence of URIs cannot be trusted.” The persistence of URIs is independent of link rot – just as the existence of an ISBN or DOI doesn’t guarantee that the book/paper is available. For digital identifiers, this is resolved through the use of tombstone pages.
2) Conflating PIDs with Open Access
The authors state that “another reason resolvability may not be sufficient, even if the metadata is somehow in place, is that the file on the destination page resolved to is behind a paywall. In a case from December 2016, public domain content more than 110 years old was hidden behind a DOI-resolver charging 50$ for release of the content.”
PIDs do not guarantee open access and shouldn't. In addition to potential costs of material (as noted above) there are many cases where content identified by a PID should not be available without access restrictions (e.g. Protected Health Information). For the examples given in the manuscript, the PIDs described in the manuscript do provide metadata for the resources identified.
The authors continue to discuss this further “However, things have changed since then. When tried again now (Nov. 2018), the replacement unpaywall.org [22] and oadoi-API for 10.1080/00222930908692639 is actually not working anymore for this DOI; the response we get is: best_oa_location: null. But the resource sought, free from paywall, although no longer detected by unpaywall.org, can still be found at biodiversitylibrary.org, at several different URLs.”
PIDs are needed to identify open content and content that has various restrictions (and not only restrictions due to cost). PIDs should not be a guarantee of open access to underlying content – however all PIDs should have appropriate metadata (see 3 below).
3) Don’t need to jump to landing page for metadata:
The authors state that “Again, going back to the question of resolvability, the relationship between identifiers such as DOIs and URIs/IRIs is not always straightforward, and sometimes involves a chain of redirects ('303s'), before reaching eventually a destination holding also the appropriate metadata.” However, that is not the case as the metadata for any DOI is easily available:
https://search.crossref.org/?q=10.1080%2F00222930908692639
VII.—Descriptions of new genera and species of New-Zealand Coleoptera
Journal Article published Jul 1909 in Annals and Magazine of Natural History volume 4 issue 19 on pages 51 to 71
Authors: Major T. Broun
And metadata for un-resolvable DOI exists:
Showing DOI matching 10.1002/(sici)1520-6297(199601/02)12:1<67::aid-agr6>3.3.co;2-#
Acreage response under policy incompatibilities: The US durum wheat situation
Journal Article published Jan 1996 in Agribusiness volume 12 issue 1 on pages 67 to 77
Authors: Roberto J. Garcia, James E. Quinton
4) Potential issues with identifier explosion
The authors contend that “Providing multiple access to, or identification of resources through PIDs, that are capable of serving as trustworthy, competent, valid independent witnesses from different moments in time, at different sites, in different places is a good idea. Thus, we accept “that an object may have multiple PIDs”. Ideally these multiple PIDs should get to "know about" each other as a way towards interoperability.”
This explosion in identifiers potentially poses significant problems. A primary issue would be the systems required to keep the synonyms in check – the overall ecosystem would need another PID-like service that maintains this information. In addition, there already exist independent and trustworthy resources that resolve PIDs. For example, for DOIs the resources dx.DOI, n2t.net and identifiers.org (and others) all resolve metadata for DOIs.
5) Incorrect description of DOI requirements
The authors state that “according to the same partial restriction, this entirely fake DOI is equally valid: 10.99999999/xxxxxxxx/x(y)x\:-{=?%%@@@@@“
However, this is incorrect. The initial component of the DOI (10.99999999) is organization that controls the ID space and this is not a valid DOI prefix.
6) Misstating the lack of usefulness of certain IDs
The authors state that “UUIDs generated in this way by the gna name resolver, e.g. "707f84e1-e5b8-5063-8256-369ba9d72e13" for Antiaris toxicaria are next to useless as instruments of Findability, often yielding 0 hits by simple googling, all the while a search on the scientific name alone will give plenty of precision hits for the sought after organism, providing rich metadata for the 'thing' itself.”
First this is not the issue with the PID itself - rather its utilization. There are many PIDs that have very good findability via Google as their use entails incorporation in the text of documents (e.g. database accession numbers from PDB or GEO, the use of RRIDs in methods sections). In addition, googling for names does not always produce proper results – a primary reason for the increased interest in the use of ORCIDs to enable author disambiguation.
7) Model suggested is not differentiated from current models
The authors propose a model for PIDS:
“Model: [namespacePrefix].[objectType].[objectId: 10 positions].[issuedDate: YYYY-MM-DD].[registrant: org.id/ORCID] Example (expression of this paper): fabio.PositionPaper.jPsaveXD17.2018-11-12.0000-0001-5699-994X”
However, this model is very similar to current handles. It seems that the two pieces that determine the “prefix” that is who created it and for what purpose are split. In the proposed model. However, this would be similar to allowing anyone to create a handle with derived prefixes which is allowed by the handle infrastructure:
“The GHR is a distributed registry whose operation is managed collaboratively by the DONA Foundation and multiple organizations that are credentialed and authorized by DONA. Each of those credentialed organizations are referred to as MPAs (or Multi-Primary Administrators). A credential is a number, made known to the GHR by DONA and allotted to a given MPA. MPAs, in turn, are authorized to allot derived prefixes from their credential to themselves and to third parties. The MPA and these third parties can provide identifier and resolution services (aka local handle services) for handles under the derived prefixes allotted to them. The GHR specifically communicates to the client, upon request, the location and certain relevant security information of the local handle services that can process the handle that the client is interested in resolving.”
The authors also discuss that the “resulting PIDs should be minted within the corresponding namespaces, who would also be the 'custodians' and resolving authorities of their PIDs, responsible for their uniqueness within their namespace. Another task would be to monitor and assign sameAs-properties to PIDs that refer to the same 'thing' in other namespaces.”
Would this then result in a collection of un-coordinated custodians? This is an issue with database accession numbers and has required construction of additional infrastructure to support the resolution of these PIDs. Coordination and management of the sameAs-properties cold become problematic as well and isn’t discussed.
Nanopublication comments:
Further comments:
1 Comment
Meta-Review by Editor
Submitted by Tobias Kuhn on
Overall, the majority of reviewers have indicated several weaknesses of the paper, highlighting multiple points that must be addressed before the paper can be considered for publication.
Reviewer #3 was unfortunately very brief, but when asked for more details, indicated:
“I do not believe including extensive semantics in the identifiers is a good idea. We are trying to move away from this even though it is very difficult. I would like to see a common service model to retrieving metadata about identifiers. The paper also proposes a model but there is no technical solution behind this model to find out how difficult or easy to generate or maintain such a semantically rich identifier. The paper does not describe which community is interested in taking up such a semantically rich identifier scheme.”
My overall opinion is that while the paper discusses some potentially interesting points with respect to persistent identifiers and the FAIR principles, it has many drawbacks and confuses some concepts. So, my recommendation is that the paper should undergo a major revision addressing all the reviewers’ comments as well as the points I am listing below. After that, it will be sent for a second round of reviews.
Thus, in addition to all the reviewers comments, and in particular the detailed points made by Reviewer #4, I would ask you to also consider and address the following comments:
You mention that “the FAIR principles do not say anything explicitly about validation”. However, to be reusable (in R1.3), the FAIR principles require (meta)data to meet domain-relevant community standards. This means precisely that “metadata can be properly validated against a schema, as adhering to an accepted metadata standard”, which you indicate it is not covered. For more details on this, you can check the ongoing work on implementing the FAIR principles (some of which you included as reference: e.g. FAIR metrics).
You refer to identifiers validation and patterns for identifier types. These patterns is something maintained at identifiers.org (see e.g. https://www.ebi.ac.uk/miriam/main/collections/MIR:00000110). Have you considered that?
You refer to ordinary or plain URIs and resolvable URIs. I recommend checking ‘Study on Persistent URIs’ (https://philarcher.org/diary/2013/uripersistence/), which provides information on the persistent identifiers in the context of the Web architecture. I recommend you follow the terminology defined there. Another important reference is ‘Cool URI’s don’t change’ (https://www.w3.org/Provider/Style/URI).
I also recommend you check the paper “Identifiers for the 21st century: How to design, provision and reuse persistent identifiers to maximize utility and impact of life science data” (https://doi.org/10.1371/journal.pbio.2001414, disclaimer: I’m one of the authors). You suggest adding context to the identifiers - please check Lesson 4 as well as the e.g. the OBO Foundry identifiers policy: http://www.obofoundry.org/id-policy.html for resources explaining why identifiers should be opaque.
While the paper aims at discussing identifiers in the context of the FAIR principles, the principles themselves are interpreted without referring to the different elements in their definition (that you included in Figure 1). For example, when discussing findability, you refer to ‘googling’ rather than analysing the actual four elements described in the principle to be findable (F1,F2,F3,F4).
The whole text is driven by examples to discuss the arguments you make, where some conceptual elements are conflated and or confused (see points above and Reviewers #4 points). Instead, I recommend discussing strong conceptual points, which should be supported by the examples, rather than the other way around.
As indicated by the reviewers, you need to differentiate your proposed model from existing ones, explaining how it addresses their drawbacks and what are the strengths of the new model. Please also address where the model is/will be applied and what is the community that is addressing.
Please, add a conclusion section summarising the paper contributions.
Please, also revise the text (e.g. there are typos such as ‘homonymi’ instead of homonymy).
Many thanks,
Alejandra Gonzalez-Beltran (http://orcid.org/0000-0003-3499-8262)