By Jodi Schneider
Reviewer has chosen not to be AnonymousOverall Impression:
UndecidedTechnical Quality of the paper:
Limited noveltyData availability:
With exceptions that are admissible according to the data availability guidelines, all used and produced data are FAIR and openly available in established data repositoriesLength of the manuscript:
The length of this manuscript is about right
Summary of paper in a few sentences:
This paper presents an opinion about the need for "genuine semantic publishing", based on previous work in the field, including the nanopublications work of the authors. The core objection to current approaches is that semantics are provided after the fact and that, according to some definitions, minor improvements in the publication (not necessarily public, and not necessarily semantic) could be viewed as a kind of "semantic publication". A secondary core objection is that narrative is over-privileged in the view of the authors. The paper is accompanied by a homepage linking to multiple representations and a supplement in TriG is deposited. Much of what the authors would call the "essential" content of the TriG appears also in the text of the paper.
Reasons to accept:
The paper presents an informed opinion about semantic publishing.
The vision is well worked out and includes its own semantic representation.
Reasons to reject:
The paper does not distinguish between settled science and the forefront of science. The importance, for the forefront, of representing a paper's arguments (as opposed to its claims) does not seem to be taken seriously. Statements without justification seem to me to be only useful for settled science, not for work that might be contested or counterargued.
The intended use and application of the semantics for the TriG file could be made more clear: is it to refer back to (in a future semantic publication)? To index? The proposed approach to semantic publishing is more detailed than some -- but to my mind, arguments (more detailed than the assertions and sub-assertions from the narrative in the sample TriG) will be needed to serve many aims of publishing at the forefront of science.
While the authors are well-versed in semantic publishing, they miss some current trends. SEPIO is a stand-out:
Brush, Matthew H., Kent Shefchek, and Melissa Haendel. "SEPIO: A Semantic Model for the Integration and Analysis of Scientific Evidence." ICBO/BioCreative. 2016. http://ceur-ws.org/Vol-1747/IT605_ICBO2016.pdf
Dokeli generates RDFa under the hood; problems with this RDFa should be directly addressed, and RASH and Dokeili should be considered for Table 1 (maybe they don't fit but it's not immediately obvious to me either way).
Stronger arguments about WHY this vision has the potential to have a positive impact are needed. What is the longer-term intent and implication of this work? Does it have any practicality and practical impact? Or does it, at the least, drive a research agenda that will lead towards better scientific publishing or better scientific knowledge management, in practice, in the semantic web community or at large? These points are not really addressed but they seem (to this reviewer at least) an essential part of a real vision in this area.
Even taking the paper itself, and admitting its arguments, I think this work could do a much better job of forefronting the key ideas of the proposal. I think that those ideas may go beyond the 5 criteria. For instance this statement is vivid, intriguing, and possibly groundbreaking, but it is not supported by the text: "We will argue below that narrative text necessarily remains an important part of scientific discourse and communication, but it also has to be possible to publish data that is self-explanatory due to its formal semantics without the need for a narrative." To my mind, if this is a point you want to make in THIS paper you should make it strongly.
I wanted to love this paper. I think the topic is important and that you have relevant perspectives. Overall I think that the paper could devote more effort to persuasion and carrying the reader along.
While the first line discusses scientific publishing, "semantic publishing" could be misunderstood to focus on non-scholarly content as well. Consider modifying the title to be more clear.
I took a look at the Berners-Lee/Hendler article from 2001. (Consider adding DOIs; for that one, for instance, it's 10.1038/35074206 ). They say:
"Where a current tool using XML (see http://www.nature.com/nature/webmatters/xml/xml.html) can allow a user to assert that some part of a document is about an 'experiment', the new languages will let the scientist express that the experiment uses certain chemicals and reagents; that the system used involved some particular organic matter; that the experiment produced gels with certain DNA information on them (and that the images of these gels are located in particular places on the web); and so on."
That part of the Berners-Lee/Hendler vision is not fully achieved, true, but this kind of work is really going on. You cite some of it (e.g isn't this what your reference #18 does?) For instance, one very successful recent example comes from the adoption of RRIDs:
Bandrowski, Anita E., and Maryann E. Martone. "RRIDs: A simple step toward improving reproducibility through rigor and transparency of experimental methods." Neuron 90, no. 3 (2016): 434-436. "Our practical solution asks authors to provide more complete metadata as well as an RRID: a citation convention that provides a simple prefix, RRID, prepended to an alpha numeric string. These strings come from community databases that have been aggregating information for many years. Every time scientists register a new entity, e.g., a new antibody, it gets its own “social security number” in the form of an accession number. ... After our successful pilot (Bandrowski et al., 2016), many additional journals have adopted use of RRIDs, and Neuron has joined this effort by changing their instructions to authors and requesting inclusion of RRIDs in their publications. Neuron authors are now asked to follow resource citation guidelines (see Neuron RRID guidelines, http://www.cell.com/neuron/rrid) such that a resource citation would be reported as follows: BioLegend, cat# 101230, RRID: AB_2129374 (vendor, vendor ID, machine-readable ID)."
Shotton's 2009 paper is used as a strawman; you have given convincing arguments that its definition is too loose. However you have not really established why your definition is the best alternative. In particular, your "genuine semantic publishing" will break with all current publishing: few previous publications will have been enriched by their authors. In my view, we should aim for BOTH machine and human-readable text; while content-negotiation is a good thing, managing multiple versions for different types of consumers means they could get out of sync (and they have no expectation of carrying the same content). Shotton is not the only visionary of semantic publishing; among others, you could look to Steve Pettifer's ~8 papers (on Utopia documents, OpenPHACTS, and about semantic publishing in general). (About half of these are cited in Wikipedia currently, for easy reference, see the bibliography of https://en.wikipedia.org/wiki/Semantic_publishing )
PDFs are not necessarily incompatible with RDF; Adobe's XMP metadata can be embedded in documents or pushed into metadata 'sidecars'. Crossref folks even experimented with XMP and they once released an open source tool for pushing metadata into PDFs given a DOI:
I do not find Figure 1 compelling; your mileage may vary with other readers and reviewers. I disagree that "By only looking at the formal semantics, one can possibly find out the topic of the paper but not what the paper is actually claiming: The main message is missed." This is not inherent in post-hoc annotation, nor in annotation by non-authors. I would point, for instance, to entire industries with paid (often PhD-level scientists) who curate scientific knowledge bases and databases (e.g. authors and readers of Oxford's _Database_ journal) as well as write extracts/abstracts for publishing/aggregating companies like EBSCO.
I find this statement unhelpful: "It seems to be a common unquestioned assumption that the semantic representation of knowledge has to start from a textual representation, and therefore writing a statement down in natural language always needs to be the first step." It really doesn't matter which comes first. However the (linguistic) semantics of your narrative are much richer than those of your TriG. Narrative does have an important role. Historians and sociologists of science have argued that writing up work (and specifically writing and revising narrative arguments for presentation and publication) helps form it. Lavoisier provides a vivid example: as Moore summarizes, "Lavoisier wrote at least six drafts of the paper over a period of at least six months. However, his theory of respiration did not appear until the fifth draft. Clearly, Lavoisier's writing helped him refine and understand his ideas." (Moore, Randy. "Language—A Force that Shapes Science." Journal of College Science Teaching 28.6 (1999): 366. http://www.jstor.org/stable/42990615) A longer treatment (complete with facsimiles of Lavoisier's manuscripts) can be found in Holmes, Frederic L. "Scientific writing and scientific discovery." Isis 78.2 (1987): 220-235. http://www.jstor.org/stable/231523
I suppose that in the end my main concern is that your TriG representation does not do the narrative of your paper justice; and I am unsure how "genuine
semantic publishing" would do any better, on average in representing the content of the paper. Hence my concern with the intended use here.
Do you think that we need to do anything with historical papers? (e.g. do we already have the knowledge they represent? do we need it?) It is not clear from the point of view presented. You disparage certain activities (such as the Semantic Publishing Challenges) -- do you see any value in those (even though they don't address publishing per-se?
This seems an overstatement: "The possible use of RDFa to formally represent not just meta data but also high-level claims, hypotheses, and arguments is sometimes proposed, but no concrete solutions are presented." Certainly, integrating nanopublications or micropublications has been proposed -- what is not concrete enough? Regarding RDF in general, older approaches have been taken. This one comes to mind:
Li, Gangmin, Victoria Uren, Enrico Motta, Simon Buckingham Shum, and John Domingue. "Claimaker: Weaving a semantic web of research papers." In International Semantic Web Conference, pp. 436-441. Springer, Berlin, Heidelberg, 2002.
If I am misunderstanding your claim, consider how you might make the scope of your statement clearer.
"Structured abstracts" have a specific and more general meaning; I'd suggest that about "structured digital abstracts" you dwell a bit more on the papers you are citing to discuss what they did.
Alongside RASH I would suggest mentioning and citing dokieli (which anyway you are already using for an alternate representation) where you mention that in the text. And as mentioned above, it is not clear why these are elided from your review proper.
Some statements about micropublications do not fit with my understanding; I see that Tim Clark is one of the people you have talked to from the acknowledgements and of course an authoritative approach would be to check with the author! In particular here: " They argue that formal representations of scientific claims are often not practically feasible, whereas the structure among them can be captured more easily and is moreover more important and more valuable to help scientists with computer-aided knowledge management." and "In our own previous work, we have proposed a preliminary general approach of representing within nanopublications the structure among informal claims and hypotheses, which are themselves not necessarily formally represented , thereby addressing some of the points raised by micropublications." Note that semantic qualifiers enable indexing claims with existing identifiers. And in micropublications I think a real strength is the attention to arguments within a paper; but the suggestion that micropublications are *limited* to the scope of a paper is not right (e.g. "stick to the article as their unit of publication" -- do you mean something more subtle there?). In the Micropublications JBS article see especially Figure 11 "Connected support relations of three arguments give a Claim network across three publications." I recommend reading this paper for a helpful perspective:
Clark, Tim. "Argument Graphs: Literature-Data Integration for Robust and Reproducible Science." In First International Workshop on Capturing Scientific Knowledge at K-Cap https://www.isi.edu/ikcap/sciknow2015/papers/Clark.pdf
I think that your work on AIDA sentences and the proposal to use them (along with some hedging/uncertainty markers) for nanopub publishing is great -- but I don't think that this is the same thing as representing the internal structure of the argument. I'd be very happy to hear what I'm misunderstanding.
You say SPAR is highly valuable -- how/for what? Who is using it? How should it be used? Similarly for Linked Sciecne Core Vocabulary.
I'm surprised that you don't mention CNL; especially regarding "Explaining a result in a narrative is simpler than formally modeling it, in the sense that natural language allows the writer to remain vague and even ambiguous." (which seems to me not true for CNLs.)
Stating "we argue that" does not give a justification or rationale. Why do you think this? "Furthermore, we argue that the semantic representations need to be a primary component with an existence in their own right, to call it a genuine semantic publication. The main thing that is published needs to have a semantic representation, and this semantic representation needs to have an independent existence." Availability at time of publication seems to go in the other direction: they should be temporally locked to the original.
The notion of "essence" or "main message" is not operationalized.
Data representation of the paper could be stored in a FAIR repository.
(No I am not reading this on a beach. :D )
For "Meta data" personally I would write "metadata".
Explicitly reference the supplement when talking about files (e.g. end of section 5)
Consider writing a longer conclusion.
Table 1 would benefit from shading (e.g. on alternate rows) to aid the eyes.
Figure 2's caption could include the URL to your actual landing page.
Add hyphens: "English-speaking agents", "English-based representation", "RDF-speaking agents", "RDF-based representation"
"who are called authors" (not "which" here)
Reference 15 is missing a venue. Check capitalization especially in #25 and #30.