Automating Semantic Publishing

Tracking #: 511-1491


Silvio PeroniORCID logo

Responsible editor: 

Michel Dumontier

Submission Type: 

Position Paper


Semantic Publishing concerns the use of Web and Semantic Web technologies and standards for enhancing a scholarly work semantically so as to improve its discoverability, interactivity, openness and (re-)usability for both humans and machines. Recently, people suggest that the semantic enhancement of a scholarly work should be done by the authors of that scholarly work and it should be considered as part of the contribution and reviewed properly. However, the main bottleneck for a concrete adoption of this approach is that authors should always spend additional time and effort for adding such semantic annotations, and often they do not have that time available. Thus, the most pragmatic way to convince authors in doing this additional job is to have services that enable the automatic annotation of their scholarly articles by parsing the content that they have already written, thus reducing the total time spent by them to few clicks for adding the semantic annotations. In this article, I propose a generic approach called compositional and iterative semantic enhancement (CISE) that enables the automatic enhancement of scholarly papers with additional semantic annotations in a way that is independent of the markup used for storing scholarly articles and the natural language used for writing their content.


Previous Version: 


  • Reviewed

RDF representations of manuscript content (optional and experimental): 

Data repository URLs: 


Date of Submission: 

Friday, July 28, 2017

Date of Decision: 

Wednesday, August 23, 2017



Solicited Reviews:



This substantially revised manuscript addresses many of the concerns of the reviewers, but there are still a variety of relatively minor issues that we can trust the author to address in their final version. However, greater elaboration on the "downward causation" concept is needed in order to properly communicate to the intended audience.

Responses to reviewers - Camera Ready version

Rebuttal letter

I would like to thank all reviewers (R1, R2, R3 herein) for their comments, suggestions, and typos spotting. Please find below specific answers to reviewers' comments.


Reviewer 1: Tobias Kuhn

R1: The disregard for any kind of NLP is an important limitation by design, but it is not clear how CISE could be combined with NLP (or, alternatively, why it is beneficial to refrain from NLP altogether)

> It is worth mentioning that I consider CISE compatible and complementary to NLP, Machine Learning, etc. Thus, in principle, NLP tools can be fine-tuned considering the outcomes of CISE and, vice versa, CISE can use inputs derived from NLP mechanisms for improving its outcomes. However, exploring these kinds of interactions is out of scope for the purposes of the article. I've added an explanation at the end of the related works section.

R1: I didn't understand how higher levels effect lower levels by downward causation. The "enviroment" is a crucial part of downward causation (according to the Wikipedia definition), but I didn't understand what this environment is in the case of CISE, and how it allows for this type of effect. Is the environment the entire article, or even the entire socio-technical system? And how does this environment cause the lower layers to change?

> The Wikipedia page describing downward causation says that the environment is fundamentally involved in downward causation in biological systems. However, the generic specification of this causal relationship, as in Campbell 1974, actually refers to the fact that higher levels can cause changes to lower levels – see also While the environment can indeed cause changes in biological systems, it is less clear what constitutes the environment for publications, and thus I have explicitly avoided discussion of environmental effects in this article. Thus, since this environmental problem is specific to the Wikipedia article, I decided just to omit reference to the Wikipedia article, and I've defined downward causation in the text as mentioned before.

R1: "who, not necessarily, have authored such works": I think this needs to be rephrased to be proper English.

> Reworded.

R1: Parenthesis not closed: "(25th percentile 34, 75th percentile 175"

> Fixed.

R1: "can the pure syntactic organisation ...": should this be "purely syntactic"?

> Fixed.

R1: "restricting the possible input documents to the sole scholarly articles": I didn't understand what "sole" means in this context.

> “Sole” is wrong in this context. Removed.

R1: The article introduces the term "CISE" twice: in Section 1 and Section 4. I think once is enough.

> Honestly, I would prefer to keep both, since in the first section I just introduce it for the very first time, while in Section 4 I actually explain what it is in detail, and I fear that a reader does not remember properly what does “CISE” mean at that point.

R1: I didn't understand this: "a layer describing a particular conceptualisation of the parts of a scholarly article is dependent somehow on the conceptualisation of at least another lower/higher layer"

> I've slightly rephrased the sentence. The idea is that there is a strict dependency between the layers – i.e. the fact that parts of the annotations in a layer can be derived by means of the annotations defined in other layers.

R1: "continues until new annotations are added or removed from the current set of annotations available": Should this read "... until *no* new annotations are added ..."?

> Yes, indeed. Fixed.

R1: Wouldn't the while loop of Listing 1 stop if the same number of annotations are added and removed? Shouldn't it keep running until no new annotation is added and no annotation is removed (i.e. a fixpoint is reached)?

> Indeed it does. Line 13 is taking care exactly that: it counts the union set of all the annotations that have been added and removed in the last step, and adds the size of this set to the final_annotation variable. Thus any addition/removal of annotations will result in repeating the loop, because in this case the final_annotation number will be incremented.

R1: The fact that rules are applied to *sets* of documents (according to Listing 1) should be explained better. Are such rules therefore supposed to find common patterns across documents?

> I've added an explanation for clarifying this point. In particular, in principle, a rule can process simultaneously more than one document in the input document set, and it could, thus, find common patterns across documents.

R1: "i.e. it is possible to use them, in principle, to write locally/globally incoherent documents": An example might be helpful here.

> I've added an example (Listing 3) that introduces a DocBook document that is not pattern-based.

R1: "(regardless they are pattern-based or not)" > "(regardless of whether they are …"

> Fixed.

R1: I would drop "although they seem reasonable" from this sentence: "Instead, the other conditions introduced in Section 4 are less strict and need a more careful investigation, although they seem reasonable."

> Removed.


Reviewer 2: Karin Verspoor

R2: The notion of "downward causation" that has been added to the manuscript is not entirely clear and not supported by examples.

> We have added an example in order to clarify what does it means and how can be applied in the context of CISE. Examples of its application/implementation are briefly illustrated in 5.3.

R2: The definition of structural patterns is presented in terms of modal constraints ("can or cannot"); it is unclear precisely how these constraints can be inferred from examples. (Should it be "do or do not"?)

> The constraints introduced in the pattern theory presented in the article have been derived by several studies people in my research group have done in the past on XML grammars and documents, in particular [43]. I've added a reference and a brief sentence to explain this aspect.

R2: The code the defines mapping from structural patterns to structural semantics is underspecified and seems largely to depend on heuristics. Is there a strategy for evaluating such heuristics or validating the assumptions that they are based on?

> Each rule that defines the mapping between structural patterns and structural semantics is implemented by some heuristics that have been derived by analysing how the structural patterns are used in scholarly documents, as discussed in [10] with more details. I've added a paragraph for describing this with more details, that is also accompanied by an example for identifying paragraphs. As already mentioned in the revision, the annotations on a large set of documents, that result from these mappings, have been compared, via precision and recall, with a gold standard we have created by assigning structural characterizations to all the markup elements defined in DocBook.

R2: There is somewhat of a normative feel to several of the comments in the paper -- e.g. the discussion of HTML in Section 6. Saying that certain behaviour "should be avoided" doesn't change the fact that people will do it anyway, usually because they don't realise that there might be a different way of doing things, or have reasons to prefer that other way. If you are going to argue that there are consensus patterns that can be inferred from examples, then you should acknowledge that those patterns might vary from some ideal.

> People are using such flat organisation since it is simpler to implement (in particular in tools) and/or to write by hand. However, as mentioned in the paper, it does not explicitly carry the intended hierarchical organisation of the sections – since no section are defined.

R2: Regarding the comment relating to validation of heuristics, there should be some discussion of how mappings between structural patterns, structural semantics, and rhetorical components are derived. Can this be automated?

> Full details of how these mappings are derived are actually presented in the related papers, i.e. [10] and [12]. In this article I've provided only the main concepts and intuitions behind the algorithms we have implemented. Honestly, I think that a full and complete discussion about these algorithms/mappings (which would be a repetition of already published articles) is out-of-scope here, and it would move the attention of the reader far from the crucial aspect of the paper, i.e. CISE. In addition, we've never used any automatic technique for identifying/creating the mapping rules. Each of the rules introduced in such papers are, in fact, the result of an accurate observation of scholarly articles according to the various layers that have been introduced/implemented.

R2: Some typos/usage issues I picked up:

"people suggest" -> "people have suggested"

"use the sole syntactic organisation" -> "use solely the syntactic organisation"

"pure syntactic organisation" (do you mean "purely the syntactic organisation"?)

"the sole containment" -> "solely the containment"

"Contrarily" -> "In contrast"

"without caring about" -> "without consideration of"

"speculate" -> "propose"

"apply iteratively" -> "iteratively apply"

"In this cases" -> "In these cases"

"all the individual" -> "all the individual"

"experiment possible" -> "experiment with possible"

"brutal" -- strong word; probably you mean something else?

> Fixed.


Reviewer 3: Steve Pettifer

R3: Even though the narrative and examples now make the author's position much clearer, I personally remain unconvinced of the argument towards automation.

> I've clarified some passages about the automation, hoping that they are more convincing in the camera ready.