Reviewer: While the author treats content IDs in some
amount of depth, there are three mentions of ORCID iDs that are not
referenced in any way. This may be confusing to readers not well-aware
of the persistent identifier community and this reviewer suggests that a
short explanation be added at the first mention (Section 5) and a
reference to a paper added. Also, in the Model in Section 7, ORCID
should be spelled with an "O" not a "0". I would suggest the author
consider ether specifying that the paper is about persistent identifiers
for content ("things") OR to add a section on identifiers for persons
and organizations (for which there is currently a very brief
aside).
Response: References to a founding paper of ORCIDs and to RORs have been added as well as an introductory remark about the focus of the paper on PIDs for research outputs, 'things'.
Reasons to reject: I think there isn't a strong enough
connection between the overall argument (do identifiers adhere to FAIR
guidelines?) and the examples and data presented. The paper argues
against opaque identifiers, but doesn’t directly tackle commonly
accepted reasons to keep identifiers opaque (for example: persistent
identifiers need to remain the same over time and the metadata used to
describe the item being identified may change; non-opaque identifiers
may lead users to assume details about potential identifiers). There is
some discussion in the paper about the shortcomings of identifiers, but
it’s unclear how having a non-opaque identifier would help with this.
The issues described (Open access content vs. paywalled content) are not
related directly to the identifier string, but to the implementation of
the identifier. It would be useful to expand on this as well and perhaps
focus on how identifiers will be more discoverable and persistent if
their names reflect what they are - will this really make the
identifiers themselves more 'findable'? How much of this proposed
infrastructure change depends on Google indexing the correct piece of
metadata?
Response: I believe the onus is on those advocating opaque PIDs to show somehow that they are also more persistent than PIDs with some sort of semantic content. The argument put forward here is rather based on the observation that it is the continued use of PIDs that makes them persistent, with the possible implication, based on the case of ISBNs, that an inherent semantic structure will potentially enhance the chances of continued use. Admittedly, there is no conclusive proof of this assumption in the paper, but even so, the semantic content proposed here (in section 7) as part of the modular, contextual new PIDs comes primarily from the namespace prefixes and the associated object types, making these PIDs at least easier to interpret also in the future, even if or when they are no longer resolvable.(PIDs are part of metadata, and as has been observed, the challenge for long-term preservation of metadata is to keep them consistently and correctly interpretable over time. Although, the original example given in the model for a "new" PID in section 7 did appear to have some semantic content even in the objectId module, this is by no means a necessary condition, so I have changed the example accordingly to avoid giving that impression. A valid reason, rather than an as yet unproven enhanced persistence, for having at least the objectId opaque, might be to facilitate automated minting of locally unique strings. (The objectId module does not have to be globally unique, only unique within that namespace and / or possibly the objectType. That is, it is rather a Local ID in the anatomy of a web-based identifier, described by So, resembling the ISBNs in this respect, you might have a simple minting algorithm built on some sort of numerus currens within namespace and / or objectType, possibly reserving certain character combinations for certain object types.) The case with the now paywalled public domain content, more than 110 years old, seized by a commercial publisher who is demanding 50 USD for 24 hours access, and hidden behind an opaque DOI, for which unpaywall.org is now unable to find an alternative open access version, although known to exist, I believe, does illustrate a possible disadvantage with completely opaque identifiers. Especially when paired with the false belief (ignoring the NUNA - the Non-Unique Naming Assumption of the semantic web), that there can or should be one and only one PID, such as a DOI, for the same content. This false assumption, encountered again from curators at natural history museums at TDWG2018 in Dunedin, will continue to "give commercial publishers the opportunity to pro-actively seize whatever public domain content there is out there on the internet, quickly mint and assign their DOI to it and then lock it up behind paywalls ...it would certainly not promote the use of PIDs instead of simple, ephemeral URLs for citation." And so, it would negatively affect the findability of documents or other digital objects. There is, I believe, an intimate relationship between accessibility and findability in this respect. Ideally, I think commercial agents should be blocked from using common namespaces to assign PIDs to content that is in the public domain, and then locking it up behind a paywall. They should be forced to create their own namespaces for this kind of piracy, and these namespaces could then be put on a public blacklist, possibly maintained by unpaywall.org.
Reasons to reject: I do not think adding detailed
semantics to an identifier is a good solution. What we should have is a
way of retrieving semantics for a given identifier.
Extended review: I do not believe including extensive semantics in the
identifiers is a good idea. We are trying to move away from this even
though it is very difficult. I would like to see a common service model
to retrieving metadata about identifiers. The paper also proposes a
model but there is no technical solution behind this model to find out
how difficult or easy to generate or maintain such a semantically rich
identifier. The paper does not describe which community is interested in
taking up such a semantically rich identifier scheme.
Response: This review somehow begs the question, by relying on the problematic "resolvability" of PIDs, or how else should the semantics of a given identifier be retrieved? For the objection about the difficulty in generating and maintaining a "semantically rich" PID, see in part the response above to Review 2, noting in particular that the semantic content in the proposed model comes primarily from the namespace prefixes and object types. With the further observations that ISBNs have been able to preserve a semantic structure, throughout their lifetime, and with the namespace custodians as designated maintainers and creators minting these "new" PIDs, this author finds it hard to understand why this model should be any more difficult to implement and maintain than the present.
Reasons to reject: There are a number of serious
inconsistencies within the manuscript that raise significant questions
about the manuscripts readiness for publication:
Conflating persistence of URIs with link rot
"The author states that “The prevalence of phenomena such as link rot implies that the persistence of URIs cannot be trusted.” The persistence of URIs is independent of link rot – just as the existence of an ISBN or DOI doesn’t guarantee that the book/paper is available. For digital identifiers, this is resolved through the use of tombstone pages."
Conflating PIDs with Open Access
"The authors state that “another reason resolvability may not be sufficient, even if the metadata is somehow in place, is that the file on the destination page resolved to is behind a paywall. In a case from December 2016, public domain content more than 110 years old was hidden behind a DOI-resolver charging 50$ for release of the content.” PIDs do not guarantee open access and shouldn't. In addition to potential costs of material (as noted above) there are many cases where content identified by a PID should not be available without access restrictions (e.g. Protected Health Information). For the examples given in the manuscript, the PIDs described in the manuscript do provide metadata for the resources identified. The authors continue to discuss this further “However, things have changed since then. When tried again now (Nov. 2018), the replacement unpaywall.org [22] and oadoi-API for 10.1080/00222930908692639 is actually not working anymore for this DOI; the response we get is: best_oa_location: null. But the resource sought, free from paywall, although no longer detected by unpaywall.org, can still be found at biodiversitylibrary.org, at several different URLs.” PIDs are needed to identify open content and content that has various restrictions (and not only restrictions due to cost). PIDs should not be a guarantee of open access to underlying content – however all PIDs should have appropriate metadata (see 3 below)."
Don’t need to jump to landing page for metadata:
"The authors state that “Again, going back to the question of resolvability, the relationship between identifiers such as DOIs and URIs/IRIs is not always straightforward, and sometimes involves a chain of redirects ('303s'), before reaching eventually a destination holding also the appropriate metadata.” However, that is not the case as the metadata for any DOI is easily available: https://search.crossref.org/?q=10.1080%2F00222930908692639 VII.—Descriptions of new genera and species of New-Zealand Coleoptera Journal Article published Jul 1909 in Annals and Magazine of Natural History volume 4 issue 19 on pages 51 to 71 Authors: Major T. Broun And metadata for un-resolvable DOI exists: Showing DOI matching 10.1002/(sici)1520-6297(199601/02)12:1<67::aid-agr6>3.3.co;2-# Acreage response under policy incompatibilities: The US durum wheat situation Journal Article published Jan 1996 in Agribusiness volume 12 issue 1 on pages 67 to 77 Authors: Roberto J. Garcia, James E. Quinton"
Potential issues with identifier explosion
"The authors contend that “Providing multiple access to, or identification of resources through PIDs, that are capable of serving as trustworthy, competent, valid independent witnesses from different moments in time, at different sites, in different places is a good idea. Thus, we accept “that an object may have multiple PIDs”. Ideally these multiple PIDs should get to "know about" each other as a way towards interoperability.” This explosion in identifiers potentially poses significant problems. A primary issue would be the systems required to keep the synonyms in check – the overall ecosystem would need another PID-like service that maintains this information. In addition, there already exist independent and trustworthy resources that resolve PIDs. For example, for DOIs the resources dx.DOI, n2t.net and identifiers.org (and others) all resolve metadata for DOIs."
Incorrect description of DOI requirements
"The authors state that “according to the same partial restriction, this entirely fake DOI is equally valid: 10.99999999/xxxxxxxx/x(y)x\:-{=?%%@@@@@“ However, this is incorrect. The initial component of the DOI (10.99999999) is organization that controls the ID space and this is not a valid DOI prefix."
Misstating the lack of usefulness of certain IDs
"The authors state that “UUIDs generated in this way by the gna name resolver, e.g. "707f84e1-e5b8-5063-8256-369ba9d72e13" for Antiaris toxicaria are next to useless as instruments of Findability, often yielding 0 hits by simple googling, all the while a search on the scientific name alone will give plenty of precision hits for the sought after organism, providing rich metadata for the 'thing' itself.” First this is not the issue with the PID itself - rather its utilization. There are many PIDs that have very good findability via Google as their use entails incorporation in the text of documents (e.g. database accession numbers from PDB or GEO, the use of RRIDs in methods sections). In addition, googling for names does not always produce proper results – a primary reason for the increased interest in the use of ORCIDs to enable author disambiguation."
Model suggested is not differentiated from current models
"The authors propose a model for PIDS: “Model: [namespacePrefix].[objectType].[objectId: 10 positions].[issuedDate: YYYY-MM-DD].[registrant: org.id/ORCID] Example (expression of this paper): fabio.PositionPaper.jPsaveXD17.2018-11-12.0000-0001-5699-994X” However, this model is very similar to current handles. It seems that the two pieces that determine the “prefix” that is who created it and for what purpose are split. In the proposed model. However, this would be similar to allowing anyone to create a handle with derived prefixes which is allowed by the handle infrastructure: “The GHR is a distributed registry whose operation is managed collaboratively by the DONA Foundation and multiple organizations that are credentialed and authorized by DONA. Each of those credentialed organizations are referred to as MPAs (or Multi-Primary Administrators). A credential is a number, made known to the GHR by DONA and allotted to a given MPA. MPAs, in turn, are authorized to allot derived prefixes from their credential to themselves and to third parties. The MPA and these third parties can provide identifier and resolution services (aka local handle services) for handles under the derived prefixes allotted to them. The GHR specifically communicates to the client, upon request, the location and certain relevant security information of the local handle services that can process the handle that the client is interested in resolving.” The authors also discuss that the “resulting PIDs should be minted within the corresponding namespaces, who would also be the 'custodians' and resolving authorities of their PIDs, responsible for their uniqueness within their namespace. Another task would be to monitor and assign sameAs-properties to PIDs that refer to the same 'thing' in other namespaces.” Would this then result in a collection of un-coordinated custodians? This is an issue with database accession numbers and has required construction of additional infrastructure to support the resolution of these PIDs. Coordination and management of the sameAs-properties cold become problematic as well and isn’t discussed."
Response:
I am not sure I understand this objection. Link rot does sometimes prevent using a URI for direct identification of an object, and that is precisely what identifiers are for. There are not always "tombstone pages" available, and even when there are, as in the cases referred to in objection 3, for 10.1002/(SICI)1097-4571(199510)46:9<646::AID-ASI2>3.0.CO;2-1 and 10.1007/s11192-007-1682-3, these are not so "easily available" as being referred to directly from the destination page of the DOI-URI, where you are only met by the message "DOI Not Found" or "Page not found". This means at least that the PID-URI in these cases cannot be considered to be properly machine-actionable, a pre-requisite for FAIRness; unless you implement an automatic redirect in your workflow for these cases, you have to make a new manual search using the Crossref search-API https://search.crossref.org/?q=, something that may not be obvious to everyone using PIDs (including this author, who is thankful to the Reviewer for this piece of information). Of course, you could say that the PID-URIs are still persistent, while the links are not, but that is a mere play with words. The purpose of a URI = Uniform Resource Identifier, is to identify an object, an entity, and if the link does not work any longer, it simply fails duty. It is for this very reason that the paper argues for PIDs that are well distributed over the Internet, 'googlable' and not dependent on one single link-URI, one single custodian or "parent" to fulfil their mission. It is in this respect that ISBNs serve as an example, not because you will always find a copy of the actual book that they identify - it may be on loan at your local library or out of print, not in store, but because you will most often be able to identify the object that they are supposed to identify, through many online library catalogs, bookstores, reference lists, net resources etc.
I agree with the view that "PIDs do not guarantee open access and shouldn't." However, we are dicussing here the A in FAIR, Accessibility, and I do not think it should be the purpose of PIDs to make content that belongs in the public domain in fact less accessible, as in the case referred to here. For the rest, see my response to review 2 above.
First, it should be noted that the introductory quoted sentence in the remark beginning "Again, going back to the question of resolvability ..." has two references, , , representing Crossref and DOI, so this statement is not something this author claims originality for. For the rest, see my response to Review 4, point 1 above.
The final sentences under this point puts in doubt if the reviewer has actually read carefully the paper under review, in which, under section 4, is found the following information:
Note that we are not talking here about simply having more than
one proxy server acting as resolvers of the same PIDs. We
already have that; provided the lookup-table is managed
properly, the three different DOI-URIs from three different
proxy-servers all resolve to the same landing-page location:
https://doi.org/10.1007/978-3-319-53637-8_11,
https://hdl.handle.net/10.1007/978-3-319-53637-8_11 and
https://identifiers.org/doi:10.1007/978-3-319-53637-8_11. ARKs
(Archival Resource Keys) are resolved by identifiers.org and
n2t.net, as well as by their "mother institutions", e.g.
n2t.net/ark:/67531/metapth346793/,
identifiers.org/ark:/67531/metapth346793/ and
digital.library.unt.edu/ark:/67531/metapth346793/ resolve the
same content.
As for the propositions quoted in the beginning of the reviewer's point 4, they all lean on as their main reference , a presentation by Jonathan Clark, representing the International DOI Foundation, at the first PIDapalooza in Reykjavik 2016. So, again, this does not represent a major invention by this author. And if you read further from there, you will also find a reference, again, to the NUNA, the Non-unique Naming Assumption. Maybe this is is not enough to dissipate the fear of an "identifier explosion", that I believe already happened, but I do agree that we need more agents, not only one central PID-like service to "keep synonyms in check", and we need it already today. The paper refers to the SPARQL endpoint of identifiers.org as one such important agent, but this should be a common responsibility of all those namespace custodians who want to mint PIDs of their own, thus to contribute to the FAIR effort of making PIDs, as an essential part of metadata, Re-usable also in the sense that they are "sufficiently well-described and rich that it can be automatically (or with minimal human effort) linked or integrated, like-with-like, with other data sources" (FAIR Guiding Principles 4.2). Perhaps, this responsibility should also make them avoid minting new PIDs for the same 'things' in vain, when there are already sufficiently well-known, well described, 'validatable' PIDs in widespread use for the same objects.
Again, the impression is received that the reviewer has not read the paper under review sufficiently carefully to see that "the same partial restriction" here naturally refers back to the preceding regex above (to make this evident it is spelled out in the revised version of the paper submitted with this response to reviews). And the restriction by this regex fully matches with this fake DOI, no matter what the prefix rules for DOIs say, as shown below.
Furthermore, unfortunately, the same goes for the even less restrictive rules for DOI validation used by DataCite "10\..+/.+" , or the pattern registered for DOIs at identifiers.org as "^(doi\:)?\d{2}\.\d{4}.*$" MIR:00000019, both of which also allow for the fake DOI above as valid, when tested in regex101.com. It is of course possible to construct a more restrictive regex that rules out false DOI prefixes, and thus would have this fake DOI fail to validate, but that obviously remains to be implemented yet in validation schemas used by important agents such as DataCite and identifiers.org. And, as is further argued in the paper, given the quite lax initial rules for DOI structure, once you try to be more restrictive, you obviously will not catch all the now prevalent and permitted DOIs by one singular regular expression. , So, as is stated in the paper, "the more restrictive the validation rule or regular expression it is based on, the more actually existing DOIs it will leave out."
One reason for the uselessness of the UUIDs issued by the gna name
resolver as an instrument of Findability for the organism, the
'thing itself' is clearly stated in the paper in the sentences just
preceding the one quoted by the reviewer: "Note, however, that it is
actually the specific name string that is identified here,
not the object, the organism, the 'thing itself'. Thus, the
resulting UUID is completely dependent upon the particular name
string (with its encoding), it cannot be used as a bridge between
different name forms for the same organism, telling us that they are
naming the same object. This is due to the fact that it is
generated by hashing a namespace identifier and name
. " This is what the in this
way
in the sentence quoted (in part) by the reviewer
refers to. But I agree that the uselessness of the UUIDs in this
respect is at least partly due also to it's not being further
utilized for citations or disseminated to other databases, while
suspecting this fact is closely related to the preceding way in
which they are created as hashes of name strings, rather
than as identifiers of objects, concepts, 'things'. The point I am
trying to make, with reference again to is that PIDs must be used to be persistent, and that the
chances of becoming used and utilized for identification are better
if the PID by it's own structure, revealing some semantic content,
somehow reflects what kind of 'thing' it is supposed to identify. In
this sense, I believe a quite long, totally opaque string, such as
the UUID-5, that on the face of it appears to be more or less random
(although in reality there is a strict hashing algorithm behind it),
might possibly be less successful in this respect. I know there are
other databases, with identifiers, often shorter, such as accession
numbers, that are more successful in this respect, e.g. UniProt,
whose PIDs are also resolved by identifiers.org, but then only if
you use the right namespace prefix, e.g.
http://identifiers.org/UNIPROT:P68512, which already tells you
something about what kind of 'thing' you are supposed to identify,
i.e. having some semantic content. One might add that databases such
as UniProt are known to have made an extra effort to be findable by
'googling' through the use of schema.org markup. This is something
that should be encouraged in all PID-custodians or producers, as a
way of disseminating and enhancing the use of these PIDs! Naturally,
'googling' is not the only solution to enhanced findability, and
certainly googling for names does not always produce good results.
In particular, in section 2 of the paper it is argued that "While
scientific names are often useful for describing objects, they have
other drawbacks compared to PIDs, some of which were identified by
. For example, homonymy and
disambiguation should generally be a lesser problem for globally
unique identifiers." So, this remark about ORCIDs seems a little bit
misplaced or at least redundant here. Please note also, that it is
for a reason, i.e. precisely the awareness of the sometimes dubious
results of 'googling', that the estimated precision rates were
included in some of the examples given in the paper. (Recall rate,
unfortunately, is harder, perhaps even impossible, to estimate in
these cases.)
This objection is also a bit difficult to understand. First, the
paper makes no specific claim of originality in this respect. That
is why in the beginning of section 7 the word model is
emphasized and there are quotation marks surrounding "new" (now in
the revised version also in the section heading, to make it
clearer): So, here we finally suggest a model for a
"new" PID, with a limited character set, at least for the object
id part, defined by namespace specifications and schemas.
The "mission" of the paper is simply to explore the concept of a
PID, looking at real examples, their positive and negative features,
and find out what additional requirements there might be to make
them fully findable, accessible, interoperable and re-usable - FAIR.
The "novelty" of the paper, if any, would then rather be the
"widening" of the FAIR principles to include also Findability as
rate of distribution or dissemination (measured by means of
'googling') and Interoperability or Re-usability to include also
'validatability'. As for the custodianship and minting of PIDs, the
model proposed is that this should be the responsibility of the
namespaces to which they belong. I do not see why these, already
assuming the administration of specifications, validation schemas,
vocabularies or ontologies, should be any less qualified for this
task, than the MPAs in the Handle system. The minting algorithm, the
patterns for PID-recognition, restriction in character set,
string-length (with possible checksum) of objectId module should all
be part of the validation schema. These namespaces should then be
able to register their schemes with n2t.net or
identifiers.org, as already happens. So, why should the
danger of "a collection of un-coordinated custodians" be any more
real than today? Besides, PIDs are meant to work on a
network, essentially by means of links between nodes in
the network. Have we not come further beyond the idea of a central
authority ruling everything? Possibly many different actors and
namespace custodians could contribute to creating "sameAs" links
between PIDs identifying the same 'things'. And there might be
several services such as the SPARQL endpoint of
identifiers.org for registering such links.
Reviewer:You mention that “the FAIR principles do not say
anything explicitly about validation”. However, to be reusable (in
R1.3), the FAIR principles require (meta)data to meet domain-relevant
community standards. This means precisely that “metadata can be properly
validated against a schema, as adhering to an accepted metadata
standard”, which you indicate it is not covered. For more details on
this, you can check the ongoing work on implementing the FAIR principles
(some of which you included as reference: e.g. FAIR metrics).
Response: I do insist that "the FAIR principles do not say anything explicitly about validation." If implied by the FAIR principle R1.3 , it is only indirectly so and in reality open to interpretation. There are several cases where general data repositories, while professing to be FAIR and claim to adhere to accepted metadata standards both for their default output and export formats, still fail to validate against schemas of these same standards. Even the most elaborate explication of the FAIR principles, by fairmetrics.org describesR1.3 - (meta)data meet domain-relevant community standards), as measuring simply a "Certification, from a recognized body, of the resource meeting community standards" by means of a valid electronic signature, such as a verisign signature. But, then, one might ask again whether general data repositories such as Harvar'ds Dataverse, Figshare or Zenodo, qualify as "recognized bodies" in this respect, all being part of the test reported in "Evaluation_Of_Metrics/Supplementary Information_ FM Evaluation Results.pdf" , but none of which could be evaluated on this measure R1.3. This comes as no surprise, since there is already a comment in fairmetrics.org FM_R1.3 saying that "Such certification services may not exist, but this principle serves to encourage the community to create both the standard(s) and the verification services for those standards." True, in the rationale for FM_R1.3 there is mention of validation: "... As such, data should be (individually) certified as being compliant, likely through some automated process (e.g. submitting the data to the community's online validation service)". But it remains unclear if the "community" referred to here is defined by a certain general metadata standard, or by a repository, using it's own standard and validation service. Some output metadata files from repositories even lack a schemaLocation reference, making it difficult to validate them, or, the schemaLocation given might be erroneous, as observed in one case. We cannot just wait for the repositories themselves to provide verification of compliance with standards. We must use available validation tools, testing to what extent they are keeping their promises, knowing already that they do not always produce valid metadata in compliance with the standards they profess to adhere to.
Reviewer: You refer to identifiers validation and
patterns for identifier types. These patterns is something maintained at
identifiers.org (see e.g.
https://www.ebi.ac.uk/miriam/main/collections/MIR:00000110). Have you
considered that?
Response: Well, yes now I have, but it does not change
anything, except that for ARKS, the pattern deposited there,
^(ark\:)/*[0-9A-Za-z]+(?:/[\w/.=*+@\$-]*)?(?:\?.*)?$, is a little bit more
restrictive than the arkspec.txt , by
excluding the character '#' from the allowed set. Otherwise, it only
confirms the view expressed in the paper that ... apart from the specific
structure, there is no specific pattern or definite string-length of an
ARK. The only restrictions on the Name and Qualifier parts "as strings
of visible ASCII characters" is that they "should be less than 128 bytes
in length" ...
. This is also evident from the following image
showing the (Python)-validation of a very long and awkward string fully
matching the pattern in MIRIAM:
Similarly, the DOI-pattern registered at identifiers.org (as shown above, under Review 4, point 5), does not prevent fake DOI:s from passing as valid. I do not know who is "maintaining" these patterns in MIRIAM, if they are only deposited as is, by the time of registration of a prefix in the identifiers.org database by the responsible depositing agent, or if they are actually managed and updated by identifiers.org staff. There is, indeed, sometimes quite a large time span between Date of creation and Date of last modification in MIRIAM for a Data collection such as DOI, but it is not clear to me if modifications made also involve these validation patterns. It would be an advantage to record also what modifications were made and when, i.e. including a record history, at least in the RDF/XML version. The thing with DOIs in particular, as shown in the paper, is also that due to the initial lack of pattern restrictions they differ so much from each other that it is practically impossible to find a sufficiently restrictive pattern that does not at the same time exclude many prevalent, resolvable DOIs from validating OK, (as shown in the paper and above under Review 4, point 5, with references to and ).
Reviewer: You refer to ordinary or plain URIs and
resolvable URIs. I recommend checking ‘Study on Persistent URIs’
(https://philarcher.org/diary/2013/uripersistence/), which provides
information on the persistent identifiers in the context of the Web
architecture. I recommend you follow the terminology defined there.
Another important reference is ‘Cool URI’s don’t change’
(https://www.w3.org/Provider/Style/URI).
Response: It is not clear to me exactly what parts of the terminology used here that should be followed, in particular since even within this document itself and in the next recommended reference in the Metareview it seems different terms for the same things are used, e.g. what is the difference between an "ID string", an "item ID" and a "Local ID", (for which I used the term 'objectId' in the paper)? I would be willing to adopt any terms that were truly felt to represent a common global standard, but in the documents referenced here there appears to be no such terms easily discernible. The closest I get to is "local id", used - in different forms - at least in two of the referenced documents, so I would be prepared to use that instead of 'objectId' to comply, although I think the latter better describes its place in the model proposed here.
Reviewer: I also recommend you check the paper
“Identifiers for the 21st century: How to design, provision and reuse
persistent identifiers to maximize utility and impact of life science
data” (https://doi.org/10.1371/journal.pbio.2001414, disclaimer: I’m one
of the authors). You suggest adding context to the identifiers - please
check Lesson 4 as well as the e.g. the OBO Foundry identifiers policy:
http://www.obofoundry.org/id-policy.html for resources explaining why
identifiers should be opaque.
Response: I am grateful for these instructive and helpful
references, of which I was not aware before. However, I do not think they in
essence contradict anything that is said in the paper. I believe I have
answered some of the issues from Lesson 4 in regarding the preferable opaqueness of PIDs above in the responses to
Review 2 and 3. It was never my intention to rely on embedded meaning for
uniqueness in the local ID, in the model proposed corresponding to the
objectId, although I admit this might be the impression given by the example
given in section 7 of the paper. This has now been revised. I note at the
same time that Lesson 4 does allow for embedded meaning on some paticular
conditions: Meaning should only be embedded if it is indisputable,
unchangeable and also useful to the data consumer (e.g.,
computer-processable). For instance, the type of entity imparts meaning
to users and may fulfil these 3 criteria. When encountered, typing may
be embedded, either within the local ID (ENSMUSG…), or within the http
URI path (…/gene/12345), or both.
But, it is not clear to me why in
Lesson 4 it is further stated, without giving a reason: In any case, if
you opt to include type in the identifiers you issue, avoid relying on
type for uniqueness: that is to say, once a local ID (e.g., 12345) is
assigned it should never be recycled for another entity, even an entity
of a different type (e.g., …/gene/12345 and …/patient/12345).
Nevertheless, I believe the proposed restricted character set in the model
should be wide enough to be sufficient for guaranteeing local uniqueness
within most namespaces, without resorting also to the object (or 'entity')
type module to differentiate between local ids. Alternatively, this might be
done by reserving certain characters in the local id (objectId) for specific
object types. Admittedly, though, the argument in Lesson 5 for not relying
on case for uniqueness appears to be compelling, which would reduce the
number of unique permutations to 34e10, using only either lower or upper
case characters. So, the model in the paper has been revised accordingly.
Hopefully, it will still be wide enough to allow for local uniqueness within
the namespace.
Reviewer: While the paper aims at discussing identifiers
in the context of the FAIR principles, the principles themselves are
interpreted without referring to the different elements in their
definition (that you included in Figure 1). For example, when discussing
findability, you refer to ‘googling’ rather than analysing the actual
four elements described in the principle to be findable
(F1,F2,F3,F4).
Response: That is true, to the extent that the paper focuses on what is believed to be missing in the definitions and explications of the FAIR principles, e.g. for Findability, the distribution (dissemination) and use of PIDs for citations, for example. True, F4 could be interpreted as requiring at least open display of a PID as part of the metadata, which might promote the use of PIDs for citation, but is apparently not sufficient, judging from the case with , as indicated in the paper. To first analyse the FAIR-principles in detail, before pointing out what is perceived as missing, however, would have required more space than the 8000 words allowance for a position paper. That said, I do for example stress the "sine qua non" of machine actionable or machine readable metadata, which is also used by fairmetrics.org as a measure of Findability. Similarly, for Interoperability or Re-usability, the focus is on being 'validatable', perceived as somewhat neglected in the explication of FAIR, as argued above, but which also requires metadata and PIDs in particular to be machine actionable.
Reviewer: The whole text is driven by examples to discuss
the arguments you make, where some conceptual elements are conflated and
or confused (see points above and Reviewers #4 points). Instead, I
recommend discussing strong conceptual points, which should be supported
by the examples, rather than the other way around.
Response: I am unsure what to make of this objection. The main purpose of the paper is to analyse some of the most prevalent PIDs used in scholarly communication, identify some of their shortcomings, trying to find out how PIDs could be made more "FAIR", while explicating the FAIR principles to include also distribution (dissemination) and 'validatability'. I have tried to respond to the points above and those of Reviewer #4 to the best of my knowledge.
Reviewer: As indicated by the reviewers, you need to
differentiate your proposed model from existing ones, explaining how it
addresses their drawbacks and what are the strengths of the new model.
Please also address where the model is/will be applied and what is the
community that is addressing.
Response: Last things first, the model is free to be used by any namespace custodians willing to take on the responsibilities of minting PIDs and maintaining a PID schema, within any scholarly discipline, STEM, social sciences or humanities. So, while not prescribing or trying to impose the model on any particular community, I believe it could be used by most existing namespaces within scholarly communication today, and that, as indicated in section 7, it could also be integrated with already existing PID-schemas. The distinguishing features of this model are simply that it makes the connection of object types (or resource types) with namespace prefixes explicit, thereby adding semantic content to the PIDs, allowing them to "identify themselves", and further the formal requirement of 'validatability' (by means of restrictions on string-length, and character set, pattern recognition, possibly also checksum - depending on minting algorithm used). It might be said that these features can already be found in other already existing PID-models, e.g. in the IGSN, which I suggest could easily be integrated with this model, in particular, since it apparently has only one 'object type'. But, I do not find the importance of having PIDs convey also semantic content by means of 'object types' strongly emphasized in any other model, at least not in the other examples of PIDs examined here. And the strong reaction by reviewers to this suggestion indicates that it is not a regulart part of other PID models.
Reviewer:Please, add a conclusion section summarising the
paper contributions.
Response: I have now added a short conclusion section. I have also expanded some parts, to answer some of the calls from reviewers, while other parts had to be removed to stay within the given limits of a position paper. (Notably, the part about signposting.org and its references were left out, but kept as "comments" in the RASH-html, as part of the "record history" of the revised paper.)
Reviewer: Please, also revise the text (e.g. there are
typos such as ‘homonymi’ instead of homonymy).
Response: I have revised and partly compressed the text, correcting typos found, thank you for bringing it to my attention. I also revised and corrected the original Schematron schema code example, which was in fact not well-formed before. (Surprisingly, though, no reviewer made any mention of that.)
California Digital Library (2018). Archival Resource Key (ARK) Identifiers. http://n2t.net/e/ark_ids.html
Kunze, J. & Roberts, R. (2008). The ARK Identifier scheme. http://n2t.net/e/arkspec.txt
Clark, J. (2016). PIDvasive:_What's possible when everything has a persistent identifier? PIDapalooza, November 10, 2016. Retrieved Jan 16, 2017. https://doi.org/10.6084/m9.figshare.4233839.v1
Catalogue of Life: Annual Checklist(2015). Asterolibertia gibbosa (Gaillard) Hansf. 1949. http://www.catalogueoflife.org/annual-checklist/2015/details/species/id/4f5bf9e96f36e1c530b147c7105e865b
Coyle, K. et al.(2014). How Semantic Web differs from traditional data processing. RDF Validation in the Cultural Heritage Community. International Conference on Dublin Core and Metadata Applications, Austin, Oct. 2014. Date accessed: 24 Mar. 2017. http://dcevents.dublincore.org/IntConf/dc-2014/paper/view/311
Cruz, M., Kurapati, S., & Turkyilmaz-van der Velden, Y. (2018). The Role of Data Stewardship in Software Sustainability and Reproducibility. Zenodo. 2018-09-14. https://doi.org/10.5281/zenodo.1419085
DataCite Metadata Working Group. (2017). DataCite Metadata Schema 4.1. https://doi.org/10.5438/0015
Doorn, P., Dillo, I. (2017). Assessing the FAIRness of Datasets in Trustworthy Digital Repositories: A Proposal. IDCC Edinburgh, 22 February 2017. http://www.dcc.ac.uk/webfm_send/2481
Duerr, R.E. et al. (2011). (2011). On the utility of identification schemes for digital earth science data: an assessment and recommendations . Earth Science Informatics 4:139. ISSN: 1865-0473 (Print) 1865-0481 (Online) https://doi.org/10.1007/s12145-011-0083-6
Dunning, A., de Smaele, M., Böhmer, J. (2017). Are the FAIR Data Principles fair? Practice Paper. 12th International Digital Curation Conference (IDCC 2017), Edinburgh, Scotland, 20 - 23 February 2017. https://doi.org/10.5281/zenodo.321423
Fenner, M. (2016). Cool DOI's.. DataCite Blog. https://doi.org/10.5438/55e5-t5c0
Force11 (2016a). The FAIR Data Principles. https://www.force11.org/group/fairgroup/fairprinciples
Force11 (2016b). Guiding Principles for Findable, Accessible, Interoperable and Re-usable Data Publishing version B1.0. https://www.force11.org/fairprinciples
FAIRMetrics (2018). FM-F2 https://purl.org/fair-metrics/FM_F2
FAIRMetrics (2018). FM_R1-3 https://purl.org/fair-metrics/FM_R1.3
Data Citation Synthesis Group, Martone M. (ed.)(2014). Joint Declaration of Data Citation Principles San Diego, CA: FORCE11 https://www.force11.org/group/joint-declaration-data-citation-principles-final
Force11 (2016). Guiding Principles for Findable, Accessible, Interoperable and Re-usable Data Publishing version b1.0 San Diego, CA: FORCE11 https://www.force11.org/node/6062/#Annex6-9
Gertler, A., Bullock, J. (2017). Reference Rot: An Emerging Threat to Transparency in Political Science. The Profession. http://doi.org/10.1017/S1049096516002353
Gilmartin, A. (2015). DOIs and matching regular expressions. Crossref Blog, 2015-08-11. https://www.crossref.org/blog/dois-and-matching-regular-expressions/
Global Names Architecture - GNA (2015). New UUID v5 Generation Tool -- gn_uuid v0.5.0. http://globalnames.org/news/2015/05/31/gn-uuid-0-5-0/
Global Names Architecture - GNA (2015b). Global Names Resolver http://resolver.globalnames.org/
Guo, Xinjiang (2016). Yale Persistent Linking Service PIDapalooza, November 10, 2016. Retrieved Jan 16, 2017. https://doi.org/10.6084/m9.figshare.4235822.v1
Guralnick, R. et al. (2015). Community Next Steps for Making Globally Unique Identifiers Work for Biocollections Data. ZooKeys 494: 133–154. https://doi.org/10.3897/zookeys.494.9352
Hayes, C. (2016). oaDOI: A New Tool for Discovering OA Content. Scholars Cooperative, Wayne State University. http://blogs.wayne.edu/scholarscoop/2016/10/25/oadoi-a-new-tool-for-discovering-oa-content/
Hayes, C. (2017). Unpaywall: A New OA Discovery Tool. Scholars Cooperative, Wayne State University. https://blogs.wayne.edu/scholarscoop/2017/03/20/unpaywall/
Hennessey, J., Xijin Ge, S. (2013). A Cross Disciplinary Study of Link Decay and the Effectiveness of Mitigation Techniques. Proceedings of the Tenth Annual MCBIOS Conference. BMC Bioinformatics, 14(Suppl 14):S5. https://doi.org/10.1186/1471-2105-14-S14-S5
Jones, SM., Van de Sompel, H., Shankar, H., Klein, M., Tobin, R., Grover, C. (2016). Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content. PLoSONE 11(12): e0167475. https://doi.org/10.1371/journal.pone.016747
Kille, L.W. (2015). The growing problem of Internet "link rot" and best practices for media and online publishers. https://journalistsresource.org/studies/society/internet/website-linking-best-practices-media-online-publishers
Klein, M., Van de Sompel, H., Sanderson, R., Shankar, H., Balakireva L., Zhou, K., Tobin, R. (2014). Scholarly Context Not Found: One in Five Articles Suffers from Reference Rot. PLoS ONE 9(12): e115253. https://doi.org/10.1371/journal.pone.0115253
Kunze, J., Russell, M. (2006). Noid - search.cpan.org. http://search.cpan.org/~jak/Noid/noid
Li, C.& Sugimoto, S. (2014). Provenance Description of Metadata using PROV with PREMIS for Long-term Use of Metadata. Proceedings of the International Conference on Dublin Core and Metadata Applications (Austin TX, 2014). http://dcpapers.dublincore.org/pubs/article/view/3709
McMurry, JA. et al. (2017). Identifiers for the 21st century: How to design, provision, and reuse persistent identifiers to maximize utility and impact of life science data. PLoSBiol 15(6):e2001414.. http://doi.org/10.1371/journal.pbio.2001414
Page, R. (2016). Towards a biodiversity knowledge graph. Research Ideas and Outcomes 2: e8767 (07 Apr 2016). http://doi.org/10.3897/rio.2.e8767
Paskin, N. (1999). Toward Unique Identifiers. Proceedings of the IEEE 87(7):1208 - 1227. https://doi.org/10.1109/5.771073
Patterson, D. et al. (2016). Challenges with using names to link digital biodiversity information. Biodiversity Data Journal 4: e8080 (25 May 2016). https://doi.org/10.3897/BDJ.4.e8080
Philipson, J. (2017). About a BUOI: joint custody of persistent universally unique identifiers on the web, or, making PIDs more FAIR. SAVE-SD 2017 http://cs.unibo.it/save-sd/2017/papers/html/philipson-savesd2017.html
Philipson, J. (2019). The Red Queen in the Repository: metadata quality in an ever-changing environment.IDCC 2019. (In press). https://doi.org/10.5281/zenodo.2276777
SESAR - System for Earth Sample Registration (2017). What is the IGSN? http://www.geosamples.org/aboutigsn
Signposting.org (2017). Identifier - Signposting the Scholarly Web http://signposting.org/identifier/
Unpaywall.org (2018). Frequently Asked Questions http://unpaywall.org/faq
Wikipedia (2017a). Link rot. (last modified on 13 March 2017, at 17:46. Retrieved 2017-03-14.) https://en.wikipedia.org/wiki/Link_rot
Wikipedia (2017b). Universally unique identifier. (last modified on 29 January 2017, at 15:28. Retrieved 2017-01-30.) https://en.wikipedia.org/wiki/Universally_unique_identifier
Van de Sompel, H., Klein, M., Jones, S.M. (2016). Persistent URIs Must Be Used To Be Persistent. WWW 2016. arXiv:1602.09102v1 [cs.DL] 29 Feb 2016
Van de Sompel, H. (2016). A Signposting Pattern for PIDs. PIDapalooza, Reykjavik, November 2016. https://doi.org/10.6084/m9.figshare.4249739.v1
Van de Sompel, H. (2018).cite-as: A Link Relation to Convey a Preferred URI for Referencing https://datatracker.ietf.org/doc/draft-vandesompel-citeas/
Wass, J. (2016). When PIDs aren't there. Tales from Crossref Event Data. PIDapalooza, Reykjavik, November 2016. Retrieved: 11:57, Mar 20, 2017 (GMT). https://doi.org/10.6084/m9.figshare.4220580.v1
Wass, J. (2017). URLs and DOIs: a complicated relationship. CrossRef Blog, 2017 January 31. https://www.crossref.org/blog/urls-and-dois-a-complicated-relationship/
Wilkinson, M. et al. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data 3:160018. http://doi.org/10.1038/sdata.2016.18
Wilkinson, M., Schultes, E., Bonino, L., Sansone, S., Doorn, P. & Dumontier, M. (2018, July 4). FAIRMetrics/Metrics: FAIR Metrics, Evaluation results, and initial release of automated evaluator code. Scientific Data. Zenodo. http://doi.org/zenodo.1305060
Wimalaratne, S. et al. (2015). SPARQL-enabled identifier conversion with Identifiers.org Bioinformatics, 31(11), 2015, 1875–1877. http://doi.org/10.1093/bioinformatics/btv064
Zhou, K. et al. (2015). No More 404s: Predicting Referenced Link Rot in Scholarly Articles for Pro-Active Archiving. In: Proceedings of the 15th ACM/IEEE-CE on Joint Conference on Digital Libraries. JCDL '15, p. 233-236. http://doi.org/10.1145/2756406.2756940