PIDs, please play FAIR and identify yourselves!

Tracking #: 558-1538

Authors:

NameORCID
Joakim PhilipsonORCID logo https://orcid.org/0000-0001-5699-994X


Responsible editor: 

Alejandra Gonzalez-Beltran

Submission Type: 

Position Paper

Abstract: 

This is an extended, revised version of [37]. Findability and interoperability of some PIDs and their compliance with the FAIR data principles are explored, where ARKs were added in this version. It is suggested that the wide distribution and findability (e.g. by simple 'googling') on the internet may be as important for the usefulness of PIDs, as the resolvability of PID URIs. This version also includes new reasoning about the failure to use PIDs such as DOIs for citation. The prevalence of phenomena such as link rot implies that the persistence of URIs cannot be trusted. By contrast, the well distributed, but seldom directly resolvable ISBN identifier has proved remarkably resilient, with far-reaching persistence, inherent structural meaning and good validatability, by means of fixed string-length, pattern-recognition, restricted character set and check digit. Examples of regular expressions used for validation of PIDs are supplied or referenced. The suggestion to add context and meaning to PIDs, making them "identify themselves", through namespace prefixes and object types is more elaborate in this version. Meaning can also be conferred by means of structural elements, such as well defined, restricted string patterns, that at the same time make PIDs more "validatable". Concluding this version is a generic, refined model for a PID with these properties, in which namespaces are instrumental as custodians, meaning-givers and validation schema providers. A draft example of a Schematron schema for validation of "new" PIDs in accordance with the proposed model is provided.

Manuscript: 

Supplementary Files (optional): 

Previous Version: 

Tags: 

  • Reviewed

Data repository URLs: 

None

Date of Submission: 

Thursday, February 28, 2019

Date of Decision: 

Monday, June 24, 2019

Decision: 

Accept

Solicited Reviews:


1 Comment

Meta-Review by Editor

Dear Joakim,

 

Thanks for your article re-submission and for the responses to reviewers. You will see that from this new round, we received 5 reviews from experts on identifiers from ORCID, ARK, CrossRef, identifiers.org and web architecture/persistent identifiers. There are varying opinions on the significance and novelty of your contribution, and several suggestions for improvements. In particular, some of the arguments made are not well-justified. I agree with those comments and suggestions for improvement, and I am listing below more issues that must be addressed. As a consequence, my recommendation is to accept the contribution, conditionally to all the changes being incorporated and the paper improved. I also expect to see an enumeration of the changes and justification on how the suggestions were addressed.

 

I am considering that this is a position paper, and the journals’ submission guidelines indicate: “We accept position papers presenting discussions and viewpoints around Data Science topics. These papers do not need an evaluation, but need to present relevant and novel discussion points in a thorough manner.” (see https://datasciencehub.net/content/guidelines-authors)

 

Thus, I strongly encourage you to present the discussion justifying all your arguments in a thorough manner as part of the condition for acceptance.

 

  • I find that many of your statements in the paper (as well as in the response to reviewers) are not well-justified. For example, in the section about FAIR principles you indicate “There are several cases where general data repositories, professing to be FAIR and adhere to accepted metadata standards both for their default output and export formats, nevertheless fail to validate against schemas of these same standards.” This statement is not justified neither with examples nor with a citation (also see issue with citations that I raise in the formatting issues below) and this should be addressed. If you say “there are several cases”, you should provide examples of those cases, including what export formats and why they fail validation, and/or a reference that provides those examples.

  • You refer to “validability” of identifiers, and refer to regular expressions. There are already systems that maintain such expression for identifier validation, such as identifiers.org (see for example the entry for DOI and its regular expression: https://www.ebi.ac.uk/miriam/main/collections/MIR:00000019). How does this affect your arguments? Justify.

  • In the text, you are referring to interoperability and then say “This is also used by fairmetrics.org as a measure of Findability.” - how is interoperability used as a measure of findability? I don’t think that is correct and you provide no explanation.

  • You insist that the FAIR principles don’t refer to findability, but you say: “However, the FAIR principles do not say anything explicitly about validation. Particularly for the principles of Interoperability and Re-usability, it is crucial that metadata can be properly validated against a schema, as adhering to an accepted metadata standard.” So, you are providing a counter-argument that in fact the FAIR principles refer to validation explicitly. Please, clarify.

  • You introduce examples using Life Science Identifiers (LSID) but do not discuss the issues around them. You can check the Wikipedia entry about LSID (https://en.wikipedia.org/wiki/LSID) and in particular, see “Controversy over the use of LSIDs” - https://lists.w3.org/Archives/Public/www-tag/2006Jul/0041; How does this discussion affect your arguments? Include a justification.

  • In the section ‘Resolvability or findability?’, you mention that “FAIR principles, the focus is very much on resolvability of identifiers despite the general awareness of phenomena like 'link rot' and 'reference rot'. What is the basis for this claim? The FAIR guiding principles don’t refer to link rot issues.

  • You mention ‘When someone in an ensuing Twitter conversation complained about this, an answering tweet seemed to mean, that was the price we have to pay for something as useful as DOIs. ‘ Twitter could provide some anecdotal material, but a tweet is not a  a good reference for justifying a claim for a scientific article. In addition, you say that they “seem to mean” - this interpretation again doesn’t help in making a case. Moreover, the tweets are not referenced. But please, use more reliable references to justify your arguments instead of tweets.

  • About your proposal of a new identifier schema that maintains context in the identifier, and thus it is not opaque, I would like to see an explanation on how your scheme would handle the identification of objects that might change or evolve in the future. For instance, consider the identification of genes, whose information may evolve in time according to new scientific discoveries being made about it. Also, I would like to see a presentation on how your proposal improves the other identifier schemes, and how improves FAIRness.

  • Please address all the suggestions made by John Kunze and the modifications proposed in the Google document, including addressing his points around weak arguments, such as

    • It is not clear why, “for example, the argument that usability and persistence depend on validatability.”

    • it is not clear why the object type, and registrant "modules" proposed for the PID could not live next to, but outside the PID, in a citation.”

  • Also, address Patricia Feeney’s points, including full justification for:

“case for non-opaque identifiers was not clearly stated, the author still conflates accessibility with discoverability and doesn't address arguments for opaque identifiers.”

  • Also address all the suggestions made by Phil Archer :

    • Improve the presentation including diagrams and tables when relevant (e.g. to show the longevity of different PIDs and the accuracy with which they lead to the identified thing

    • Switch the order of the presentation to show your proposal of the PID format first; I would suggest that you also provide a comparison table to show how your PID format would improve the issues that you highlight

    • Address the issue raised around ISBNs

 

Formatting issues:

  • Please, fix the citations, the HTML is not well-formatted. In some cases your citations appear as a link from an underscore symbol and it is even not clear to what statement the citation corresponds to. (e.g. “This happens although there is an understanding that [u]nique identifiers, and metadata describing the data, and its disposition, should persist -- even beyond the lifespan of the data they describe.  A recent study of some 40 research data repositories... ” - in this case, I imagine you are citing the data citation principles to justify the first sentence, but it is not clear what citation is included to justify the statement about the 40 data repositories. Please, revise all citations and that they can be seen properly.

  • The problem with the citations may also be the reason why the paper shows as you introduce many acronyms without indicating what they stand for. For example, the introduction mentions ORCIDs, RORs ARK, DOI, UUID but the references not included. Please, fix these to include citations especially on the first mention of each acronym. In addition, the paper would benefit from including a glossary listing all the acronyms and their definitions.

 

I look forward to receiving a thoroughly revised version of your article.

 

Many thanks,

Alejandra Gonzalez-Beltran (http://orcid.org/0000-0003-3499-8262)