PIDs, please play FAIR and identify yourselves!

Tracking #: 547-1527

Authors:

	Name	ORCID
	Joakim Philipson	https://orcid.org/0000-0001-5699-994X

Responsible editor:

Alejandra Gonzalez-Beltran

Submission Type:

Position Paper

Abstract:

This is an extended version of [32], first presented at the SAVE-SD 2017 workshop in Perth, Australia. In this comprehensively revised and updated version an example is given describing how scientific names can provide context and meaning, as a backdrop to the ensuing suggestion that PIDs, persistent identifiers - now often failing to do so, should also include contextual, semantic elements. As in the original version, findability and interoperability of some PIDs and their compliance with the FAIR data principles are explored, where ARKs were added in this version. It is suggested that the wide distribution and findability (e.g. by simple 'googling') on the internet may be more important for the usefulness of identifiers, than the resolvability of PID URI-links. New reasoning about how the failure to use PIDs such as DOIs - even when they exist, for citation, is supplied in this version. The prevalence of phenomena such as link rot implies that the persistence of URIs cannot be trusted. By contrast, the well distributed, but seldom directly resolvable ISBN identifier has proved remarkably resilient, with far-reaching persistence, inherent structural meaning and good validatability, by means of fixed string-length, pattern-recognition, restricted character set and check digit. Various examples of regular expressions used for validation of e.g. DOIs are supplied or referenced here. The suggestion to add context and meaning to PIDs, thereby making them "identify themselves", through namespace prefixes and object types is more elaborate in this version. Meaning can also be conferred by means of structural elements, such as well defined, restricted string patterns, that at the same time make PIDs more "validatable". Concluding this version is a generic, refined model for a PID with these properties, in which namespaces are instrumental as custodians, meaning-givers and validation schema providers. A draft example of a Schematron schema for validation of new PIDs in accordance with the proposed model is also provided.

Manuscript:

ds-paper-547.zip

Revised Version:

PIDs, please play FAIR and identify yourselves!

Special issue (if applicable):

SAVE-SD 2017/2018

Data repository URLs:

none

Date of Submission:

Friday, November 16, 2018

Date of Decision:

Tuesday, January 29, 2019

Nanopublication URLs:

Decision:

Undecided

Solicited Reviews:

Review #1 submitted on 08/Dec/2018

By Laurel L. Haak ORCID logo

https://orcid.org/0000-0001-5109-3700

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: High
Significance: High significance
Background: Reasonable
Novelty: Clear novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The authors need to elaborate more on certain aspects and the manuscript should therefore be extended (if the general length limit is already reached, I urge the editor to allow for an exception)

Summary of paper in a few sentences:

This paper describes the FAIR principles as they apply to common persistent identifiers: DOIs, ARKs, UUIDs, and Handles.

Reasons to accept:

To this reviewer's knowledge, this is the first paper examining persistent identifiers in the context of FAIR principles. It is engagingly written with clear examples, actionable recommendations, and is well referenced.

Reasons to reject:

None.

Nanopublication comments:

Further comments:

While the author treats content IDs in some amount of depth, there are three mentions of ORCID iDs that are not referenced in any way. This may be confusing to readers not well-aware of the persistent identifier community and this reviewer suggests that a short explanation be added at the first mention (Section 5) and a reference to a paper added. Also, in the Model in Section 7, ORCID should be spelled with an "O" not a "0". I would suggest the author consider ether specifying that the paper is about persistent identifiers for content ("things") OR to add a section on identifiers for persons and organizations (for which there is currently a very brief aside).

Review #2 submitted on 10/Dec/2018

By Patricia Feeney ORCID logo

https://orcid.org/0000-0002-4011-3590

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Weak
Suggested Decision: Undecided
Technical Quality of the paper: Weak
Presentation: Good
Reviewer`s confidence: High
Significance: Moderate significance
Background: Reasonable
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

The author describes issues surrounding persistent identifiers (specifically ARK, DOI, Handle, UUID) and whether or not they adhere to FAIR guidelines, and argues for having the identifiers themselves (as opposed to metadata attached to an identifier) do some identifying work, to “...make "new" PIDs fully recognizable, universally unique, stable, but always in a well-known context, meaningful, and with a good potential for backup.” It's argued that making PIDs non-opaque will make them more discoverable.

Reasons to accept:

The author makes some convincing arguments about the limitations of identifiers. Comparing identifier strings against FAIR principles is a valuable idea.

Reasons to reject:

I think there isn't a strong enough connection between the overall argument (do identifiers adhere to FAIR guidelines?) and the examples and data presented. The paper argues against opaque identifiers, but doesn’t directly tackle commonly accepted reasons to keep identifiers opaque (for example: persistent identifiers need to remain the same over time and the metadata used to describe the item being identified may change; non-opaque identifiers may lead users to assume details about potential identifiers).

There is some discussion in the paper about the shortcomings of identifiers, but it’s unclear how having a non-opaque identifier would help with this. The issues described (Open access content vs. paywalled content) are not related directly to the identifier string, but to the implementation of the identifier. It would be useful to expand on this as well and perhaps focus on how identifiers will be more discoverable and persistent if their names reflect what they are - will this really make the identifiers themselves more 'findable'? How much of this proposed infrastructure change depends on Google indexing the correct piece of metadata?

Nanopublication comments:

Further comments:

I'm employed by Crossref, which has a 'best practice' (not requirement) that identifiers be opaque.

Review #3 submitted on 08/Jan/2019

By Sarala Wimalaratne ORCID logo

https://orcid.org/0000-0002-5355-2576

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Average
Suggested Decision: Reject
Technical Quality of the paper: Weak
Presentation: Good
Reviewer`s confidence: Medium
Significance: Moderate significance
Background: Reasonable
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

This paper proposes a new identifier model which includes semantics.

Reasons to accept:

Not accepted

Reasons to reject:

I do not think adding detailed semantics to an identifier is a good solution. What we should have is a way of retrieving semantics for a given identifier.

Nanopublication comments:

Further comments:

Review #4 submitted on 15/Jan/2019

By Jeff Grethe ORCID logo

https://orcid.org/0000-0001-5212-7052

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Average
Suggested Decision: Undecided
Technical Quality of the paper: Weak
Presentation: Average
Reviewer`s confidence: High
Significance: High significance
Background: Reasonable
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

This manuscript discusses PIDs, persistent identifiers, in the context of FAIR. A review of current identifiers schemes (e.g. DOI, ARK, …) are provided and a discussion of identifier validation is provided. The manuscript also provides a model for PIDs that the authors believe addresses some issues discussed in the review of current identifier schemes.

Reasons to accept:

The manuscript provides a useful review of identifier schemes. The authors also provide some insight into the most common interaction with PIDs when they state: “It is suggested that the wide distribution and findability (e.g. by simple 'googling') on the internet may be more important for the usefulness of identifiers, than the resolvability of PID URI-links.“

In addition, the authors raise a potentially important issue regarding the use of PIDs in an overall ecosystem: “Validation of an identifier means ensuring that it is true to its proclaimed type, for example, making sure that what is flagged as an ISBN is not in fact an ISSN (real use case), or that the string-length and check-sum is compliant with its type. A further advantage of promptly validatable identifiers, as against relying exclusively on resolvability, is that validation can be performed also off-line, by means of a more or less simple validation-algorithm, a pattern for the identifier type (expressed by a regular expression), a piece of script (JavaScript, Python, etc.), an HTML form,, a schema (e.g. XSD or Schematron) and a piece of software such as an XML-editor.”

Reasons to reject:

There are a number of serious inconsistencies within the manuscript that raise significant questions about the manuscripts readiness for publication:

1) Conflating persistence of URIs with link rot

The author states that “The prevalence of phenomena such as link rot implies that the persistence of URIs cannot be trusted.” The persistence of URIs is independent of link rot – just as the existence of an ISBN or DOI doesn’t guarantee that the book/paper is available. For digital identifiers, this is resolved through the use of tombstone pages.

2) Conflating PIDs with Open Access

The authors state that “another reason resolvability may not be sufficient, even if the metadata is somehow in place, is that the file on the destination page resolved to is behind a paywall. In a case from December 2016, public domain content more than 110 years old was hidden behind a DOI-resolver charging 50$ for release of the content.”

PIDs do not guarantee open access and shouldn't. In addition to potential costs of material (as noted above) there are many cases where content identified by a PID should not be available without access restrictions (e.g. Protected Health Information). For the examples given in the manuscript, the PIDs described in the manuscript do provide metadata for the resources identified.

The authors continue to discuss this further “However, things have changed since then. When tried again now (Nov. 2018), the replacement unpaywall.org [22] and oadoi-API for 10.1080/00222930908692639 is actually not working anymore for this DOI; the response we get is: best_oa_location: null. But the resource sought, free from paywall, although no longer detected by unpaywall.org, can still be found at biodiversitylibrary.org, at several different URLs.”

PIDs are needed to identify open content and content that has various restrictions (and not only restrictions due to cost). PIDs should not be a guarantee of open access to underlying content – however all PIDs should have appropriate metadata (see 3 below).

3) Don’t need to jump to landing page for metadata:

The authors state that “Again, going back to the question of resolvability, the relationship between identifiers such as DOIs and URIs/IRIs is not always straightforward, and sometimes involves a chain of redirects ('303s'), before reaching eventually a destination holding also the appropriate metadata.” However, that is not the case as the metadata for any DOI is easily available:

https://search.crossref.org/?q=10.1080%2F00222930908692639

VII.—Descriptions of new genera and species of New-Zealand Coleoptera
Journal Article published Jul 1909 in Annals and Magazine of Natural History volume 4 issue 19 on pages 51 to 71
Authors: Major T. Broun

And metadata for un-resolvable DOI exists:

Showing DOI matching 10.1002/(sici)1520-6297(199601/02)12:1<67::aid-agr6>3.3.co;2-#
Acreage response under policy incompatibilities: The US durum wheat situation
Journal Article published Jan 1996 in Agribusiness volume 12 issue 1 on pages 67 to 77
Authors: Roberto J. Garcia, James E. Quinton

4) Potential issues with identifier explosion

The authors contend that “Providing multiple access to, or identification of resources through PIDs, that are capable of serving as trustworthy, competent, valid independent witnesses from different moments in time, at different sites, in different places is a good idea. Thus, we accept “that an object may have multiple PIDs”. Ideally these multiple PIDs should get to "know about" each other as a way towards interoperability.”

This explosion in identifiers potentially poses significant problems. A primary issue would be the systems required to keep the synonyms in check – the overall ecosystem would need another PID-like service that maintains this information. In addition, there already exist independent and trustworthy resources that resolve PIDs. For example, for DOIs the resources dx.DOI, n2t.net and identifiers.org (and others) all resolve metadata for DOIs.

5) Incorrect description of DOI requirements

The authors state that “according to the same partial restriction, this entirely fake DOI is equally valid: 10.99999999/xxxxxxxx/x(y)x\:-{=?%%@@@@@“

However, this is incorrect. The initial component of the DOI (10.99999999) is organization that controls the ID space and this is not a valid DOI prefix.

6) Misstating the lack of usefulness of certain IDs

The authors state that “UUIDs generated in this way by the gna name resolver, e.g. "707f84e1-e5b8-5063-8256-369ba9d72e13" for Antiaris toxicaria are next to useless as instruments of Findability, often yielding 0 hits by simple googling, all the while a search on the scientific name alone will give plenty of precision hits for the sought after organism, providing rich metadata for the 'thing' itself.”

First this is not the issue with the PID itself - rather its utilization. There are many PIDs that have very good findability via Google as their use entails incorporation in the text of documents (e.g. database accession numbers from PDB or GEO, the use of RRIDs in methods sections). In addition, googling for names does not always produce proper results – a primary reason for the increased interest in the use of ORCIDs to enable author disambiguation.

7) Model suggested is not differentiated from current models

The authors propose a model for PIDS:
“Model: [namespacePrefix].[objectType].[objectId: 10 positions].[issuedDate: YYYY-MM-DD].[registrant: org.id/ORCID] Example (expression of this paper): fabio.PositionPaper.jPsaveXD17.2018-11-12.0000-0001-5699-994X”

However, this model is very similar to current handles. It seems that the two pieces that determine the “prefix” that is who created it and for what purpose are split. In the proposed model. However, this would be similar to allowing anyone to create a handle with derived prefixes which is allowed by the handle infrastructure:

“The GHR is a distributed registry whose operation is managed collaboratively by the DONA Foundation and multiple organizations that are credentialed and authorized by DONA. Each of those credentialed organizations are referred to as MPAs (or Multi-Primary Administrators). A credential is a number, made known to the GHR by DONA and allotted to a given MPA. MPAs, in turn, are authorized to allot derived prefixes from their credential to themselves and to third parties. The MPA and these third parties can provide identifier and resolution services (aka local handle services) for handles under the derived prefixes allotted to them. The GHR specifically communicates to the client, upon request, the location and certain relevant security information of the local handle services that can process the handle that the client is interested in resolving.”

The authors also discuss that the “resulting PIDs should be minted within the corresponding namespaces, who would also be the 'custodians' and resolving authorities of their PIDs, responsible for their uniqueness within their namespace. Another task would be to monitor and assign sameAs-properties to PIDs that refer to the same 'thing' in other namespaces.”

Would this then result in a collection of un-coordinated custodians? This is an issue with database accession numbers and has required construction of additional infrastructure to support the resolution of these PIDs. Coordination and management of the sameAs-properties cold become problematic as well and isn’t discussed.

Nanopublication comments:

Further comments:

1 Comment

Meta-Review by Editor

Submitted by Tobias Kuhn on Fri, 02/01/2019 - 05:08

Overall, the majority of reviewers have indicated several weaknesses of the paper, highlighting multiple points that must be addressed before the paper can be considered for publication.

Reviewer #3 was unfortunately very brief, but when asked for more details, indicated:

“I do not believe including extensive semantics in the identifiers is a good idea. We are trying to move away from this even though it is very difficult. I would like to see a common service model to retrieving metadata about identifiers. The paper also proposes a model but there is no technical solution behind this model to find out how difficult or easy to generate or maintain such a semantically rich identifier. The paper does not describe which community is interested in taking up such a semantically rich identifier scheme.”

My overall opinion is that while the paper discusses some potentially interesting points with respect to persistent identifiers and the FAIR principles, it has many drawbacks and confuses some concepts. So, my recommendation is that the paper should undergo a major revision addressing all the reviewers’ comments as well as the points I am listing below. After that, it will be sent for a second round of reviews.

Thus, in addition to all the reviewers comments, and in particular the detailed points made by Reviewer #4, I would ask you to also consider and address the following comments:

You mention that “the FAIR principles do not say anything explicitly about validation”. However, to be reusable (in R1.3), the FAIR principles require (meta)data to meet domain-relevant community standards. This means precisely that “metadata can be properly validated against a schema, as adhering to an accepted metadata standard”, which you indicate it is not covered. For more details on this, you can check the ongoing work on implementing the FAIR principles (some of which you included as reference: e.g. FAIR metrics).
You refer to identifiers validation and patterns for identifier types. These patterns is something maintained at identifiers.org (see e.g. https://www.ebi.ac.uk/miriam/main/collections/MIR:00000110). Have you considered that?
You refer to ordinary or plain URIs and resolvable URIs. I recommend checking ‘Study on Persistent URIs’ (https://philarcher.org/diary/2013/uripersistence/), which provides information on the persistent identifiers in the context of the Web architecture. I recommend you follow the terminology defined there. Another important reference is ‘Cool URI’s don’t change’ (https://www.w3.org/Provider/Style/URI).
I also recommend you check the paper “Identifiers for the 21st century: How to design, provision and reuse persistent identifiers to maximize utility and impact of life science data” (https://doi.org/10.1371/journal.pbio.2001414, disclaimer: I’m one of the authors). You suggest adding context to the identifiers - please check Lesson 4 as well as the e.g. the OBO Foundry identifiers policy: http://www.obofoundry.org/id-policy.html for resources explaining why identifiers should be opaque.
While the paper aims at discussing identifiers in the context of the FAIR principles, the principles themselves are interpreted without referring to the different elements in their definition (that you included in Figure 1). For example, when discussing findability, you refer to ‘googling’ rather than analysing the actual four elements described in the principle to be findable (F1,F2,F3,F4).
The whole text is driven by examples to discuss the arguments you make, where some conceptual elements are conflated and or confused (see points above and Reviewers #4 points). Instead, I recommend discussing strong conceptual points, which should be supported by the examples, rather than the other way around.

As indicated by the reviewers, you need to differentiate your proposed model from existing ones, explaining how it addresses their drawbacks and what are the strengths of the new model. Please also address where the model is/will be applied and what is the community that is addressing.
Please, add a conclusion section summarising the paper contributions.
Please, also revise the text (e.g. there are typos such as ‘homonymi’ instead of homonymy).

Many thanks,

Alejandra Gonzalez-Beltran (http://orcid.org/0000-0003-3499-8262)

Data Science

PIDs, please play FAIR and identify yourselves!

Tracking #: 547-1527

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Tags:

Special issue (if applicable):

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

1 Comment

Meta-Review by Editor