PIDs, please play FAIR and identify yourselves!

Tracking #: 558-1538

Authors:

	Name	ORCID
	Joakim Philipson	https://orcid.org/0000-0001-5699-994X

Responsible editor:

Alejandra Gonzalez-Beltran

Submission Type:

Position Paper

Abstract:

This is an extended, revised version of [37]. Findability and interoperability of some PIDs and their compliance with the FAIR data principles are explored, where ARKs were added in this version. It is suggested that the wide distribution and findability (e.g. by simple 'googling') on the internet may be as important for the usefulness of PIDs, as the resolvability of PID URIs. This version also includes new reasoning about the failure to use PIDs such as DOIs for citation. The prevalence of phenomena such as link rot implies that the persistence of URIs cannot be trusted. By contrast, the well distributed, but seldom directly resolvable ISBN identifier has proved remarkably resilient, with far-reaching persistence, inherent structural meaning and good validatability, by means of fixed string-length, pattern-recognition, restricted character set and check digit. Examples of regular expressions used for validation of PIDs are supplied or referenced. The suggestion to add context and meaning to PIDs, making them "identify themselves", through namespace prefixes and object types is more elaborate in this version. Meaning can also be conferred by means of structural elements, such as well defined, restricted string patterns, that at the same time make PIDs more "validatable". Concluding this version is a generic, refined model for a PID with these properties, in which namespaces are instrumental as custodians, meaning-givers and validation schema providers. A draft example of a Schematron schema for validation of "new" PIDs in accordance with the proposed model is provided.

Manuscript:

ds-paper-558.html

Supplementary Files (optional):

ds-supplementary-558-853.html

ds-supplementary-558-863.zip

Previous Version:

PIDs, please play FAIR and identify yourselves!

Revised Version:

Identifying PIDs playing FAIR

Special issue (if applicable):

SAVE-SD 2017/2018

Data repository URLs:

None

Date of Submission:

Thursday, February 28, 2019

Date of Decision:

Monday, June 24, 2019

Nanopublication URLs:

Decision:

Solicited Reviews:

Review #1 submitted on 05/Mar/2019

By Laurel L. Haak ORCID logo

https://orcid.org/0000-0001-5109-3700

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Excellent
Presentation: Good
Reviewer`s confidence: High
Significance: High significance
Background: Reasonable
Novelty: Clear novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

This paper provides an analysis of the applicability of "FAIR principles" to persistent identifiers, specifically IDs for research objects.

Reasons to accept:

FAIR has largely been applied to data sets; this paper provides an analysis of the identifiers used to locate/navigate to/cite datasets, and is an important contribution to the current dialog.

Reasons to reject:

I see no reasons to reject this manuscript. It needs only very basic copy-editing to be ready for publication.

Nanopublication comments:

Further comments:

Review #2 submitted on 17/Mar/2019

By John Kunze ORCID logo

https://orcid.org/0000-0001-7604-8041

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: High
Significance: High significance
Background: Reasonable
Novelty: Clear novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

I have put a copy of the manuscript in a googledoc and made a number of tracked changes: https://docs.google.com/document/d/1NZhEUIQkf10Sw7a7XxrrbV_rWIWfzcoaYdtt...

This paper examines some major PID systems with respect to FAIR principles, introduces that idea that usability and validatability are important for persistence. Finding current systems lacking, the author proposes a model for a new validatable PID that includes context and object type.

Changed in this version: addition of ARK identifiers, more reasoning about reasoning about the low adoption of PIDs such as DOIs for citation, and more description of how PIDs could "identify themselves".

Reasons to accept:

Novel and important in this paper: (a) application of FAIR principles to identifiers and (b) a new PID model proposal that breaks away from purely opaque PIDs.

Reasons to reject:

As interesting as the ideas are, some of the arguments given are weak, for example, the argument that usability and persistence depend on validatability. Also, it is not clear why the object type, and registrant "modules" proposed for the PID could not live next to, but outside the PID, in a citation.

Nanopublication comments:

Further comments:

Review Document: tracked-changes.docx

Review #3 submitted on 25/Mar/2019

By Patricia Feeney ORCID logo

https://orcid.org/0000-0002-4011-3590

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Weak
Suggested Decision: Reject
Technical Quality of the paper: Weak
Presentation: Weak
Reviewer`s confidence: High
Significance: Moderate significance
Background: Incomplete or inappropriate
Novelty: Lack of novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

The stated goal of the paper is to analyze commonly used PIDs for compliance with FAIR principles, the author advocates for non-opaque identifiers. Some of the points in the revised version have been clarified and the conclusion helps draw the paper together, but overall it

Reasons to accept:

I think some good points are buried in the text and evaluating how/if PIDs meet FAIR principles has value.

Reasons to reject:

My initial review stressed that the case for non-opaque identifiers was not clearly stated, the author still conflates accessibility with discoverability and doesn't address arguments for opaque identifiers.

Nanopublication comments:

Further comments:

Review #4 submitted on 23/Apr/2019

By Sarala Wimalaratne ORCID logo

https://orcid.org/0000-0002-5355-2576

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Average
Suggested Decision: Accept
Technical Quality of the paper: Average
Presentation: Good
Reviewer`s confidence: Medium
Significance: Moderate significance
Background: Reasonable
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

The author has clarified many of the reviewers' comments in the response letter.

Reasons to accept:

The paper provides a good overall view of the current PID landscape.

Reasons to reject:

none

Nanopublication comments:

Further comments:

Review #5 submitted on 13/Jun/2019

By Phil Archer ORCID logo

https://orcid.org/0000-0002-4989-0601

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Weak
Suggested Decision: Accept
Technical Quality of the paper: Average
Presentation: Weak
Reviewer`s confidence: High
Significance: Moderate significance
Background: Reasonable
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: This manuscript is too long for what it presents and should therefore be considerably shortened (below the general length limit)

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

The paper reviews existing persistent identifiers, their use and usefulness in the context of the FAIR principles. Based on those findings, it presents a new format that seeks to offer the features of the most successful systems.

Reasons to accept:

Overall, as my answers to the specific questions show, I'd say this is a weak accept. There's some good material in the paper. For example, the text includes comparisons of the longevity of different PIDs and the accuracy with which they lead to the identified thing. That's useful information although the paper would benefit from presenting those details as tables, charts etc.

Reasons to reject:

The paper would benefit from better layout, diagrams etc. and none of the references appear in the text due to poor HTML.

More substantive, I'd suggest, is that there's a big dollop of open access/paywall discussion that, for this paper, is irrelevant. In my opinion, this should be removed. The paper is about the discoverability, resolvability and persistence of IDs - stick to that topic and don't go into a discussion of paywalls/access to research etc.

Nanopublication comments:

Further comments:

I'd also suggest a switch around in the order of presentation. The suggested new PID format comes at the end rather than at the beginning: "this paper proposes a new PID structure that addresses a series of issues identified with existing structures" or some such.

Due to my current work at GS1, I found the comment on ISBNs very interesting. They are indeed persistent and pre-date the Web by some decades. Most importantly - they are used throughout the industry that runs the system (the publishing, supply chain and retail industries). The paper refers several times to the - correct - notion that persistence comes from usage rather than design. The check digit is included so that if the point of sale scan fails, the number can be entered into the till manually and there's a good degree of checking that it was entered correctly.

However, sadly, it is not true that ISBNs provide a 1-1 mapping. ISBNs are just a special case for the even more widely used UPC/EAN numbering system used on all manner of goods. At the end of the day, it's just a number - and they are cloned/re-used around the world. It's being addressed, sure, and it's not a massive problem, but it's not as perfect as the paper suggests.

One thing - and this I must admit is a personal hobby horse - persistence is a matter of policy, not technical design. Link rot is a real problem because people allow it to happen, not because of an innate property of the Web. I'd have loved to have seen that point included in the paper.

RESPONSE TO REVIEWERS

Please see supplementary file 'philipson-response2reviews.html' in RASH format! [And please note, that due to the limitation of file size and file formats allowed here, it was necessary to leave out some of the files needed to display RASH properly, notably the css, grammar and font files. These should be placed in the same folder as the paper- and response2reviews files, together with img- and js-folders.]

1 Comment

Meta-Review by Editor

Submitted by Tobias Kuhn on Mon, 06/24/2019 - 01:12

Dear Joakim,

Thanks for your article re-submission and for the responses to reviewers. You will see that from this new round, we received 5 reviews from experts on identifiers from ORCID, ARK, CrossRef, identifiers.org and web architecture/persistent identifiers. There are varying opinions on the significance and novelty of your contribution, and several suggestions for improvements. In particular, some of the arguments made are not well-justified. I agree with those comments and suggestions for improvement, and I am listing below more issues that must be addressed. As a consequence, my recommendation is to accept the contribution, conditionally to all the changes being incorporated and the paper improved. I also expect to see an enumeration of the changes and justification on how the suggestions were addressed.

I am considering that this is a position paper, and the journals’ submission guidelines indicate: “We accept position papers presenting discussions and viewpoints around Data Science topics. These papers do not need an evaluation, but need to present relevant and novel discussion points in a thorough manner.” (see https://datasciencehub.net/content/guidelines-authors)

Thus, I strongly encourage you to present the discussion justifying all your arguments in a thorough manner as part of the condition for acceptance.

I find that many of your statements in the paper (as well as in the response to reviewers) are not well-justified. For example, in the section about FAIR principles you indicate “There are several cases where general data repositories, professing to be FAIR and adhere to accepted metadata standards both for their default output and export formats, nevertheless fail to validate against schemas of these same standards.” This statement is not justified neither with examples nor with a citation (also see issue with citations that I raise in the formatting issues below) and this should be addressed. If you say “there are several cases”, you should provide examples of those cases, including what export formats and why they fail validation, and/or a reference that provides those examples.
You refer to “validability” of identifiers, and refer to regular expressions. There are already systems that maintain such expression for identifier validation, such as identifiers.org (see for example the entry for DOI and its regular expression: https://www.ebi.ac.uk/miriam/main/collections/MIR:00000019). How does this affect your arguments? Justify.

In the text, you are referring to interoperability and then say “This is also used by fairmetrics.org as a measure of Findability.” - how is interoperability used as a measure of findability? I don’t think that is correct and you provide no explanation.
You insist that the FAIR principles don’t refer to findability, but you say: “However, the FAIR principles do not say anything explicitly about validation. Particularly for the principles of Interoperability and Re-usability, it is crucial that metadata can be properly validated against a schema, as adhering to an accepted metadata standard.” So, you are providing a counter-argument that in fact the FAIR principles refer to validation explicitly. Please, clarify.
You introduce examples using Life Science Identifiers (LSID) but do not discuss the issues around them. You can check the Wikipedia entry about LSID (https://en.wikipedia.org/wiki/LSID) and in particular, see “Controversy over the use of LSIDs” - https://lists.w3.org/Archives/Public/www-tag/2006Jul/0041; How does this discussion affect your arguments? Include a justification.
In the section ‘Resolvability or findability?’, you mention that “FAIR principles, the focus is very much on resolvability of identifiers despite the general awareness of phenomena like 'link rot' and 'reference rot'. What is the basis for this claim? The FAIR guiding principles don’t refer to link rot issues.
You mention ‘When someone in an ensuing Twitter conversation complained about this, an answering tweet seemed to mean, that was the price we have to pay for something as useful as DOIs. ‘ Twitter could provide some anecdotal material, but a tweet is not a a good reference for justifying a claim for a scientific article. In addition, you say that they “seem to mean” - this interpretation again doesn’t help in making a case. Moreover, the tweets are not referenced. But please, use more reliable references to justify your arguments instead of tweets.
About your proposal of a new identifier schema that maintains context in the identifier, and thus it is not opaque, I would like to see an explanation on how your scheme would handle the identification of objects that might change or evolve in the future. For instance, consider the identification of genes, whose information may evolve in time according to new scientific discoveries being made about it. Also, I would like to see a presentation on how your proposal improves the other identifier schemes, and how improves FAIRness.
Please address all the suggestions made by John Kunze and the modifications proposed in the Google document, including addressing his points around weak arguments, such as
- It is not clear why, “for example, the argument that usability and persistence depend on validatability.”
- “it is not clear why the object type, and registrant "modules" proposed for the PID could not live next to, but outside the PID, in a citation.”
Also, address Patricia Feeney’s points, including full justification for:

“case for non-opaque identifiers was not clearly stated, the author still conflates accessibility with discoverability and doesn't address arguments for opaque identifiers.”

Also address all the suggestions made by Phil Archer :
- Improve the presentation including diagrams and tables when relevant (e.g. to show the longevity of different PIDs and the accuracy with which they lead to the identified thing
- Switch the order of the presentation to show your proposal of the PID format first; I would suggest that you also provide a comparison table to show how your PID format would improve the issues that you highlight
- Address the issue raised around ISBNs

Formatting issues:

Please, fix the citations, the HTML is not well-formatted. In some cases your citations appear as a link from an underscore symbol and it is even not clear to what statement the citation corresponds to. (e.g. “This happens although there is an understanding that [u]nique identifiers, and metadata describing the data, and its disposition, should persist -- even beyond the lifespan of the data they describe. A recent study of some 40 research data repositories... ” - in this case, I imagine you are citing the data citation principles to justify the first sentence, but it is not clear what citation is included to justify the statement about the 40 data repositories. Please, revise all citations and that they can be seen properly.
The problem with the citations may also be the reason why the paper shows as you introduce many acronyms without indicating what they stand for. For example, the introduction mentions ORCIDs, RORs ARK, DOI, UUID but the references not included. Please, fix these to include citations especially on the first mention of each acronym. In addition, the paper would benefit from including a glossary listing all the acronyms and their definitions.

I look forward to receiving a thoroughly revised version of your article.

Many thanks,

Alejandra Gonzalez-Beltran (http://orcid.org/0000-0003-3499-8262)

Tracking #: 558-1538

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Supplementary Files (optional):

Previous Version:

Tags:

Special issue (if applicable):

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

1 Comment