Genuine Semantic Publishing

Tracking #: 490-1470

Authors:

	Name	ORCID
	Tobias Kuhn	https://orcid.org/0000-0002-1267-0234
	Michel Dumontier	https://orcid.org/0000-0003-4727-9435

Responsible editor:

Silvio Peroni

Submission Type:

Position Paper

Abstract:

Several approaches and systems have been presented for what has been called semantic publishing. Closer inspection, however, reveals that these approaches are mostly not about publishing semantic representations, as the name seems to suggest. Rather, most approaches take the processes and outcomes of the current narrative-based publishing system for granted and only work with the already published papers. This includes semantic annotations, semantic interlinking, semantic integration, and semantic discovery, but with the semantics coming into play only after the publication of the original article. While these are interesting approaches, they fall short of providing a vision to transcend the current publishing paradigm. We argue for taking the term semantic publishing literally and work towards a vision of genuine semantic publishing, where computational tools and algorithms can help us with dealing with the wealth of human knowledge by letting researchers capture their research results with formal semantics from the start. We argue that genuine semantic publications should come with formal semantics as an integral and primary component at the time of publication, that these representations should have essential coverage in the sense that they cover the main results, that they should be authentic in the sense that they originate from the authors, and that they should be fine-grained and light-weight for optimized re-usability and minimized publication overhead. This paper is in fact not just advocating our concept, but is itself a genuine semantic publication, thereby demonstrating and illustrating our points.

Manuscript:

ds-paper-490.zip

Revised Version:

Genuine Semantic Publishing

Data repository URLs:

RDF representations of manuscript content: https://raw.githubusercontent.com/data-science-hub/data/master/rdf/ds-rdf-490.trig

Date of Submission:

Friday, June 16, 2017

Date of Decision:

Thursday, July 13, 2017

Nanopublication URLs:

Decision:

Undecided

Solicited Reviews:

Review #1 submitted on 29/Jun/2017

By Francesco Osborne ORCID logo

https://orcid.org/0000-0001-6557-3131

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Excellent
Suggested Decision: Accept
Technical Quality of the paper: Excellent
Presentation: Good
Reviewer`s confidence: High
Significance: High significance
Background: Reasonable
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

This position paper critiques current approaches for “semantic publishing”, arguing that they do not really support a new paradigm for publishing semantic representations. It proposes and defines the concept of “genuine semantic publishing”, which is more adherent to the initial vision of the semantic publishing movement.

Reasons to accept:

The paper is clear and well written. It addresses an important concept with solid and convincing arguments. It should be of interest for a lot of people in the Semantic Web community.

Reasons to reject:

The paper would be more robust if it included more examples of how the criteria of genuine semantic publishing would apply in practical cases (see further comments).

Nanopublication comments:

Further comments:

I would like to see more examples of how the criteria of genuine semantic publishing would apply in practical cases, such as a research paper presenting a scientific experiment and its results. What would qualify as a light-weight and fine-grained representation in this case? Would a natural language description of the results and the claims be enough? I think that some more examples would be very useful as guidelines to both authors/publishers that want to follow the paradigm of genuine semantic publishing and to developer that intend to implement relevant tools.

>Sec 3, Table 1.
I believe that the categories in Table 1 should be explained better. For example, I don’t understand why scholarly HTML could not represent “program code” or “domain data”. It is possible to include both as RDFa.

>“Semantic representations can only be considered authentic if they originate from an agent that is authoritative in a given situation. In the case of publication of scientific results, the only authoritative source are the researchers…“
Here it would be interesting to mention/discuss data provenance and how it relates to your claim.

>“More so than narrative texts, semantic representations can be broken down into independent pieces that can be interpreted independently. “
Context can be of utmost importance in science. Don’t you think that a statement taken by itself would risk being misinterpreted? Let’s take the case of “A causes B”. Does it mean that a study found a statistical correlation? Is it significant? And how strong is the correlation? Is the correlation subject to some other condition?

Minor corrections:
non-intuitive way, Instead > non-intuitive way. Instead
claiming: The main message > claiming: the main message

Review #2 submitted on 10/Jul/2017

By Sarven Capadisli ORCID logo

https://orcid.org/0000-0002-0880-9125

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: High
Significance: High significance
Background: Reasonable
Novelty: Limited novelty
Data availability: With exceptions that are admissible according to the data availability guidelines, all used and produced data are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

See "Further comments" for inline feedback. https://linkedresearch.org/annotation/www.tkuhn.org/pub/sempub/sempub.do...

Reasons to accept:

It could be used towards quality check.

Reasons to reject:

The write-up can be improved in a number of areas, but no reason to reject.

Nanopublication comments:

Further comments:

Overall impression *:
- Good

Suggested decision *:
- Accept

Actually somewhere between "Undecided" and "Accept". Not a fan of nearly binary decision making here.

Reviewer's confidence *:
- High

Significance (Does the work address an important problem within the research
fields covered by the journal?) *:
- High significance

Background (Is the work appropriately based on and connected to the relevant
related work?) *:
- Reasonable

Related work was primarily on semantic publishing within scholarly communication, however the semantics that it wanted to improve on was the general "semantic publishing". If the original "semantic publishing" needed to be clarified further or was deemed to be inappropriately used in practice, I would have expected the related work to focus more on the wider use/discussion around "semantic publishing" as opposed to the scholarly communication context.

Novelty (For research papers: Does the work provide new insights or new methods
of a substantial kind? For position papers: Does the work provide a novel and
potentially disruptive view on the given topic? For survey papers: Does the work
provide an overview that is unique in its scope or structure for the given
topic?) *:
- Limited novelty

The proposed criteria tends to focus on "static" published or publishable information. It can be used to check the "quality" of works, however, it doesn't provide sufficient guidance (or criteria) on how to account for things like interactivity - this may be classified under "essential coverage" but it wasn't clear to me (perhaps I've missed it).

Technical quality (For research papers: Are the methods adequate for the
addressed problem, are they correctly and thoroughly applied, and are their
results interpreted in a sound manner? For position papers: Is the advocated
position supported by sound and thorough arguments? For survey papers: Is the
topic covered in a comprehensive and well balanced manner, are the covered
approaches accurately described and compared, and are they placed in a
convincing common framework?) *:
- Good

It might be useful to further describe what dimensions are included (i.e., the current criteria), and which are intentionally excluded.

Presentation (Are the text, figures, and tables of the work accessible, pleasant
to read, clearly structured, and free of major errors in grammar or style?) *:
- Average

While the illustration of the analogy was clear, I didn't think that it was necessary. The screenshot of the index page and the accompanying helps to understand the example, however, I think it may be more useful to show an abstraction. Is the "index" of available representations/essential coverage an important or required unit to have?

Length of the manuscript *?
- The length of this manuscript is about right

Consider shortening.

Data availability *:
- With exceptions that are admissible according to the data availability
guidelines, all used and produced data are FAIR and openly available in
established data repositories

Summary of paper in a few sentences (summary of changes and improvements for
second round reviews) *:
See "Further comments" for inline feedback.

Reasons to accept *:
It could be used towards quality check

Reasons to reject *:
The write-up can be improved in a number of areas, but no reason to reject.

Further comments:
See https://linkedresearch.org/annotation/www.tkuhn.org/pub/sempub/sempub.do...

By submitting this form, you accept that the content above will be made public
under CC-BY (https://creativecommons.org/licenses/by/4.0/) license. Furthermore,
you accept that your name (Sarven Capadisli) and WebID (http://csarven.ca/#i) will be publicly linked to it,
unless you opt to stay anonymous below.

Review #3 submitted on 12/Jul/2017

By Jodi Schneider ORCID logo

https://orcid.org/0000-0002-5098-5667

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Average
Suggested Decision: Undecided
Technical Quality of the paper: Weak
Presentation: Average
Reviewer`s confidence: High
Significance: Moderate significance
Background: Reasonable
Novelty: Limited novelty
Data availability: With exceptions that are admissible according to the data availability guidelines, all used and produced data are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

This paper presents an opinion about the need for "genuine semantic publishing", based on previous work in the field, including the nanopublications work of the authors. The core objection to current approaches is that semantics are provided after the fact and that, according to some definitions, minor improvements in the publication (not necessarily public, and not necessarily semantic) could be viewed as a kind of "semantic publication". A secondary core objection is that narrative is over-privileged in the view of the authors. The paper is accompanied by a homepage linking to multiple representations and a supplement in TriG is deposited. Much of what the authors would call the "essential" content of the TriG appears also in the text of the paper.

Reasons to accept:

The paper presents an informed opinion about semantic publishing.

The vision is well worked out and includes its own semantic representation.

Reasons to reject:

The paper does not distinguish between settled science and the forefront of science. The importance, for the forefront, of representing a paper's arguments (as opposed to its claims) does not seem to be taken seriously. Statements without justification seem to me to be only useful for settled science, not for work that might be contested or counterargued.

The intended use and application of the semantics for the TriG file could be made more clear: is it to refer back to (in a future semantic publication)? To index? The proposed approach to semantic publishing is more detailed than some -- but to my mind, arguments (more detailed than the assertions and sub-assertions from the narrative in the sample TriG) will be needed to serve many aims of publishing at the forefront of science.

While the authors are well-versed in semantic publishing, they miss some current trends. SEPIO is a stand-out:
Brush, Matthew H., Kent Shefchek, and Melissa Haendel. "SEPIO: A Semantic Model for the Integration and Analysis of Scientific Evidence." ICBO/BioCreative. 2016. http://ceur-ws.org/Vol-1747/IT605_ICBO2016.pdf
Dokeli generates RDFa under the hood; problems with this RDFa should be directly addressed, and RASH and Dokeili should be considered for Table 1 (maybe they don't fit but it's not immediately obvious to me either way).

Stronger arguments about WHY this vision has the potential to have a positive impact are needed. What is the longer-term intent and implication of this work? Does it have any practicality and practical impact? Or does it, at the least, drive a research agenda that will lead towards better scientific publishing or better scientific knowledge management, in practice, in the semantic web community or at large? These points are not really addressed but they seem (to this reviewer at least) an essential part of a real vision in this area.

Even taking the paper itself, and admitting its arguments, I think this work could do a much better job of forefronting the key ideas of the proposal. I think that those ideas may go beyond the 5 criteria. For instance this statement is vivid, intriguing, and possibly groundbreaking, but it is not supported by the text: "We will argue below that narrative text necessarily remains an important part of scientific discourse and communication, but it also has to be possible to publish data that is self-explanatory due to its formal semantics without the need for a narrative." To my mind, if this is a point you want to make in THIS paper you should make it strongly.

Nanopublication comments:

Further comments:

I wanted to love this paper. I think the topic is important and that you have relevant perspectives. Overall I think that the paper could devote more effort to persuasion and carrying the reader along.

While the first line discusses scientific publishing, "semantic publishing" could be misunderstood to focus on non-scholarly content as well. Consider modifying the title to be more clear.

I took a look at the Berners-Lee/Hendler article from 2001. (Consider adding DOIs; for that one, for instance, it's 10.1038/35074206 ). They say:
"Where a current tool using XML (see http://www.nature.com/nature/webmatters/xml/xml.html) can allow a user to assert that some part of a document is about an 'experiment', the new languages will let the scientist express that the experiment uses certain chemicals and reagents; that the system used involved some particular organic matter; that the experiment produced gels with certain DNA information on them (and that the images of these gels are located in particular places on the web); and so on."

That part of the Berners-Lee/Hendler vision is not fully achieved, true, but this kind of work is really going on. You cite some of it (e.g isn't this what your reference #18 does?) For instance, one very successful recent example comes from the adoption of RRIDs:
Bandrowski, Anita E., and Maryann E. Martone. "RRIDs: A simple step toward improving reproducibility through rigor and transparency of experimental methods." Neuron 90, no. 3 (2016): 434-436. "Our practical solution asks authors to provide more complete metadata as well as an RRID: a citation convention that provides a simple prefix, RRID, prepended to an alpha numeric string. These strings come from community databases that have been aggregating information for many years. Every time scientists register a new entity, e.g., a new antibody, it gets its own “social security number” in the form of an accession number. ... After our successful pilot (Bandrowski et al., 2016), many additional journals have adopted use of RRIDs, and Neuron has joined this effort by changing their instructions to authors and requesting inclusion of RRIDs in their publications. Neuron authors are now asked to follow resource citation guidelines (see Neuron RRID guidelines, http://www.cell.com/neuron/rrid) such that a resource citation would be reported as follows: BioLegend, cat# 101230, RRID: AB_2129374 (vendor, vendor ID, machine-readable ID)."

Shotton's 2009 paper is used as a strawman; you have given convincing arguments that its definition is too loose. However you have not really established why your definition is the best alternative. In particular, your "genuine semantic publishing" will break with all current publishing: few previous publications will have been enriched by their authors. In my view, we should aim for BOTH machine and human-readable text; while content-negotiation is a good thing, managing multiple versions for different types of consumers means they could get out of sync (and they have no expectation of carrying the same content). Shotton is not the only visionary of semantic publishing; among others, you could look to Steve Pettifer's ~8 papers (on Utopia documents, OpenPHACTS, and about semantic publishing in general). (About half of these are cited in Wikipedia currently, for easy reference, see the bibliography of https://en.wikipedia.org/wiki/Semantic_publishing )

PDFs are not necessarily incompatible with RDF; Adobe's XMP metadata can be embedded in documents or pushed into metadata 'sidecars'. Crossref folks even experimented with XMP and they once released an open source tool for pushing metadata into PDFs given a DOI:

I do not find Figure 1 compelling; your mileage may vary with other readers and reviewers. I disagree that "By only looking at the formal semantics, one can possibly find out the topic of the paper but not what the paper is actually claiming: The main message is missed." This is not inherent in post-hoc annotation, nor in annotation by non-authors. I would point, for instance, to entire industries with paid (often PhD-level scientists) who curate scientific knowledge bases and databases (e.g. authors and readers of Oxford's _Database_ journal) as well as write extracts/abstracts for publishing/aggregating companies like EBSCO.

I find this statement unhelpful: "It seems to be a common unquestioned assumption that the semantic representation of knowledge has to start from a textual representation, and therefore writing a statement down in natural language always needs to be the first step." It really doesn't matter which comes first. However the (linguistic) semantics of your narrative are much richer than those of your TriG. Narrative does have an important role. Historians and sociologists of science have argued that writing up work (and specifically writing and revising narrative arguments for presentation and publication) helps form it. Lavoisier provides a vivid example: as Moore summarizes, "Lavoisier wrote at least six drafts of the paper over a period of at least six months. However, his theory of respiration did not appear until the fifth draft. Clearly, Lavoisier's writing helped him refine and understand his ideas." (Moore, Randy. "Language—A Force that Shapes Science." Journal of College Science Teaching 28.6 (1999): 366. http://www.jstor.org/stable/42990615) A longer treatment (complete with facsimiles of Lavoisier's manuscripts) can be found in Holmes, Frederic L. "Scientific writing and scientific discovery." Isis 78.2 (1987): 220-235. http://www.jstor.org/stable/231523
I suppose that in the end my main concern is that your TriG representation does not do the narrative of your paper justice; and I am unsure how "genuine
semantic publishing" would do any better, on average in representing the content of the paper. Hence my concern with the intended use here.

Do you think that we need to do anything with historical papers? (e.g. do we already have the knowledge they represent? do we need it?) It is not clear from the point of view presented. You disparage certain activities (such as the Semantic Publishing Challenges) -- do you see any value in those (even though they don't address publishing per-se?

This seems an overstatement: "The possible use of RDFa to formally represent not just meta data but also high-level claims, hypotheses, and arguments is sometimes proposed, but no concrete solutions are presented." Certainly, integrating nanopublications or micropublications has been proposed -- what is not concrete enough? Regarding RDF in general, older approaches have been taken. This one comes to mind:
Li, Gangmin, Victoria Uren, Enrico Motta, Simon Buckingham Shum, and John Domingue. "Claimaker: Weaving a semantic web of research papers." In International Semantic Web Conference, pp. 436-441. Springer, Berlin, Heidelberg, 2002.
If I am misunderstanding your claim, consider how you might make the scope of your statement clearer.

"Structured abstracts" have a specific and more general meaning; I'd suggest that about "structured digital abstracts" you dwell a bit more on the papers you are citing to discuss what they did.

Alongside RASH I would suggest mentioning and citing dokieli (which anyway you are already using for an alternate representation) where you mention that in the text. And as mentioned above, it is not clear why these are elided from your review proper.

Some statements about micropublications do not fit with my understanding; I see that Tim Clark is one of the people you have talked to from the acknowledgements and of course an authoritative approach would be to check with the author! In particular here: " They argue that formal representations of scientific claims are often not practically feasible, whereas the structure among them can be captured more easily and is moreover more important and more valuable to help scientists with computer-aided knowledge management." and "In our own previous work, we have proposed a preliminary general approach of representing within nanopublications the structure among informal claims and hypotheses, which are themselves not necessarily formally represented [33], thereby addressing some of the points raised by micropublications." Note that semantic qualifiers enable indexing claims with existing identifiers. And in micropublications I think a real strength is the attention to arguments within a paper; but the suggestion that micropublications are *limited* to the scope of a paper is not right (e.g. "stick to the article as their unit of publication" -- do you mean something more subtle there?). In the Micropublications JBS article see especially Figure 11 "Connected support relations of three arguments give a Claim network across three publications." I recommend reading this paper for a helpful perspective:
Clark, Tim. "Argument Graphs: Literature-Data Integration for Robust and Reproducible Science." In First International Workshop on Capturing Scientific Knowledge at K-Cap https://www.isi.edu/ikcap/sciknow2015/papers/Clark.pdf

I think that your work on AIDA sentences and the proposal to use them (along with some hedging/uncertainty markers) for nanopub publishing is great -- but I don't think that this is the same thing as representing the internal structure of the argument. I'd be very happy to hear what I'm misunderstanding.

You say SPAR is highly valuable -- how/for what? Who is using it? How should it be used? Similarly for Linked Sciecne Core Vocabulary.

I'm surprised that you don't mention CNL; especially regarding "Explaining a result in a narrative is simpler than formally modeling it, in the sense that natural language allows the writer to remain vague and even ambiguous." (which seems to me not true for CNLs.)

Stating "we argue that" does not give a justification or rationale. Why do you think this? "Furthermore, we argue that the semantic representations need to be a primary component with an existence in their own right, to call it a genuine semantic publication. The main thing that is published needs to have a semantic representation, and this semantic representation needs to have an independent existence." Availability at time of publication seems to go in the other direction: they should be temporally locked to the original.

The notion of "essence" or "main message" is not operationalized.

Data representation of the paper could be stored in a FAIR repository.

(No I am not reading this on a beach. :D )
For "Meta data" personally I would write "metadata".

Explicitly reference the supplement when talking about files (e.g. end of section 5)
Consider writing a longer conclusion.

Table 1 would benefit from shading (e.g. on alternate rows) to aid the eyes.
Figure 2's caption could include the URL to your actual landing page.

Add hyphens: "English-speaking agents", "English-based representation", "RDF-speaking agents", "RDF-based representation"
"who are called authors" (not "which" here)

Reference 15 is missing a venue. Check capitalization especially in #25 and #30.

1 Comment

Meta-Review by Editor

Submitted by Tobias Kuhn on Thu, 07/13/2017 - 06:53

Overall, according to the reviewers, this is a good paper and and it is very appropriate for the journal. All the reviews provide several suggestions for improving the paper in all its parts.

However, Reviewer 3 has highlighted some issues that should be addressed carefully. In particular, some claims in the paper should be clarified and supported with additional evidences.

I'm pretty sure these aspects will be addressed by the authors appropriately, but this will need a new revision. That's why the current decision is "Undecided".

Silvio Peroni (http://orcid.org/0000-0003-0530-4305)

Data Science

Genuine Semantic Publishing

Tracking #: 490-1470

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

1 Comment

Meta-Review by Editor