Automating the Semantic Publishing - Applying a format-independent and language-agnostic approach for the compositional and iterative semantic enhancement of scholarly articles

Tracking #: 483-1463

Authors:

	Name	ORCID
	Silvio Peroni	https://orcid.org/0000-0003-0530-4305

Responsible editor:

Michel Dumontier

Submission Type:

Position Paper

Abstract:

The Semantic Publishing concerns the use of Web and Semantic Web technologies and standards for enhancing a scholarly work semantically so as to improve its discoverability, interactivity, openness and (re-)usability for both humans and machines. Recently, people suggest that the semantic enhancement of a scholarly work should be actually done by the authors of that scholarly work and it should be considered as part of the contribution and reviewed properly. However, the main bottleneck for the concrete adoption of this approach is that authors should always spend additional time and effort for actually adding such semantic annotations, and often they do not have that time available. Thus, the most pragmatic way to convince authors in doing this additional job is to have services that enable the automatic annotation of their scholarly papers by parsing the content that they have already written, thus reducing the total time spent by them to few clicks for adding the semantic annotations. In this paper I propose a generic approach called compositional and iterative semantic enhancement (CISE) that enables the automatic enhancement of scholarly papers with additional semantic annotations in a way that is independent from the markup used for storing scholarly papers and the natural language used for writing their content. In addition, I report the outcomes of some experiments that suggest that the approach proposed has a quite good margin of being feasibly implemented.

Manuscript:

ds-paper-483.zip

Revised Version:

Automating Semantic Publishing

Data repository URLs:

RDF representations of manuscript content: https://raw.githubusercontent.com/data-science-hub/data/master/rdf/ds-rdf-483.ttl

Date of Submission:

Wednesday, May 31, 2017

Date of Decision:

Thursday, June 29, 2017

Nanopublication URLs:

Decision:

Undecided

Solicited Reviews:

Review #1 submitted on 09/Jun/2017

By Stephen Pettifer ORCID logo

https://orcid.org/0000-0002-1809-5621

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Weak
Suggested Decision: Undecided
Technical Quality of the paper: Unable to judge
Presentation: Weak
Reviewer`s confidence: Medium
Significance: Moderate significance
Background: Reasonable
Novelty: Clear novelty
Data availability: All used and produced data are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

This article proposes a framework or approach to automatic annotation of scholarly articles that begins with an author writing regular prose, and results through an incremental process in a fully semantic marked-up document.

Reasons to accept:

There's a really important story here trying to get out; if the story can be freed from what I think are unnecessary formalisms then this could be an important piece of work.

Reasons to reject:

I should make it clear before I begin that I’ve worked with the author before, and have co-authored a paper with him. I have no connections beyond that.

I should also confess at the start that I found this paper difficult to read and understand. Much of that is because it requires more knowledge of formalisms than I have; part of it I suspect is because of issues with the article itself. It’s difficult for me to judge how much is the former and how much is the latter, so the rest of this review should be read with that in mind. My review doesn't follow the standard structure requested for Data Science reviews because I'd end up writing 'N/A' or 'don't know' against most of the headings. Instead I'm returning a narrative verdict in the hope that this is helpful to the author.

The general gist of the article is that it proposes a framework or approach to automatic annotation of scholarly articles that begins with an author writing regular prose, and results through an incremental process in a fully semantic marked-up document.

This is a laudable aim — and I agree wholeheartedly that providing authors with tools to simplify the process of adding machine-readable semantics to articles is crucial if we are to move in that direction.

This is also an exceptionally ambitious aim, and one which seems a long way off being reached by existing technology. My difficulty with this article — probably a misunderstanding — is that it seems to suggest that this is already achievable (although the suggestion in surrounded by, in my opinion, unnecessarily complex and slightly flakey discussion of Curry-Howard isomorphism) and that the rest of the aims can be achieved by layering on further semantic transformations.

I think if this article was saying “here is a collection of ontologies and frameworks that form a kind of hierarchy that could in an ideal world be used to describe the perfect semantic publication; and maybe you can even reason about the relationships between the layers” then I’d be a lot more comfortable with it. As it stands, either I’m not following the formal glue that’s used to hold the different steps together, or perhaps the formal glue isn’t quite as cohesive as it might be. Or some combination of these two things.

Overall I think there’s an important message in here, but it’s currently squirming to get out a bit.

This may sound somewhat negative, but I really do think the author has something important to say here. I’d suggest simplifying the formalisms (I really don’t think they add clarity), and including concrete examples. I think then it could make an excellent position paper that unifies several already significant pieces of work.

A few specific comments:

"Automating the Semantic Publishing" -> should this be “Automating Semantic Publishing” ? I’m not sure what the definite article is doing here.
I think the sub title is too dense for a general CS audience (or even for a ‘data science’ audience). It may make sense to someone that’s into formal methods. Maybe.

Gutemburg -> Gutenburgh

Throughout the article needs a fair amount of attention to the written English; too many minor infelicities to list here. Because the article is technically dense, these risk tripping the reader rather too often.

Nanopublication comments:

Further comments:

Review #2 submitted on 17/Jun/2017

By Karin Verspoor ORCID logo

https://orcid.org/0000-0002-8661-1544

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Undecided
Technical Quality of the paper: Average
Presentation: Good
Reviewer`s confidence: Medium
Significance: Moderate significance
Background: Incomplete or inappropriate
Novelty: Limited novelty
Data availability: All used and produced data are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

This opinion paper introduces the reader to a proposal for formally modelling the structure of scholarly papers, and supporting combinations of lower-level structures to imply higher-level structures. The author argues that this approach will allow automatic enrichment of documents with semantic information, based on a sort of projection from syntactic patterns to the semantics implied implied by those patterns.

Reasons to accept:

The notion of, as I have understood it, reasoning over combinations of annotations to infer higher-order semantic types is interesting. While the discussion here focuses mostly on formal characterisation of the relationships/patterns, I can imagine that this could also be relevant to empirically derived models, e.g. statistical models derived from data mining/machine learning over semantic features. The author's comment that a majority of authors follow consistent patterns does not surprise me and I agree could be leveraged (whether rule-based or statistical) for compliance checks or feedback purposes. It makes sense to think of the structure of documents as compositional and the approach outlined here enables that.

There is pragmatic value in the approach; to the extent that semantic characteristics can be inferred from a document itself rather than demanding additional manual effort, the more likely those characteristics will be captured and represented.

Reasons to reject:

There is relevant literature that is not considered here, including the role of writing assistants for semantic enrichment [1] as well as the literature on web annotation and provenance from various W3C communities [2]-[3] and indeed the linguistic annotation communities (which often use stand-off annotation) [9]. Some work on compositionality in such models also exists [4]. The authors further mention "several tools that have been developed for addressing the automatic annotation..." without providing any details or references. I note that there is substantial work on issues such as document zoning [5] and rhetorical structure [6] and even citation analysis [7-8]. Finally the more recent and less well-developed writing of Robert Allen may be relevant [11]. While perhaps these don't (all) address the author's desire for the analysis to be independent of the language of the document content, and may not be amenable to algorithmic processing, they are certainly relevant to the broader context and argumentation -- at least for understanding how the proposals here differ.

Note that the citation provided for categorial grammar is probably not the canonical/most relevant citation, see [10] inter alia.

In addition, I found some of the aspects of the theory and algorithms described in section 4.1 not entirely clear. The interpretation of some of the core structural patterns as named isn't intuitive, and the notion of "coherence" that is relevant to pattern assignment is not entirely clear, especially given that every possible combination of t,s,T exists.

Regarding the point related to syntactical visualisation -- is this about generalisation of the patterns from a few examples? I would see this less as "compliance" (implying guidelines exist that are expected to be followed) and more as conventions that exist.

I also wonder about the use of the term "validity" in the Conclusions. In what way could this approach (as a whole) be validated as opposed to shown to apply in particular contexts?

[1] DOI: 10.1186/1471-2105-11-103
[2] https://www.w3.org/TR/annotation-model/
[3] https://www.w3.org/TR/prov-dm/
[4] DOI: 10.1186/2041-1480-4-38
[5] lots of citations -- a cursory search reveals:
Hollingsworth, Bill, Ian Lewin, and Dan Tidhar. "Retrieving hierarchical text structure from typeset scientific articles–a prerequisite for e-science text mining." Proc. of the 4th UK E-Science All Hands Meeting. 2005.
Taghva, Kazem, Allen Condit, and Julie Borsack. "Autotag: A tool for creating structured document collections from printed materials." Electronic Publishing, Artistic Imaging, and Digital Typography. Springer Berlin Heidelberg, 1998. 420-431.
Kim, Jongwoo, Daniel X. Le, and George R. Thoma. "Automated labeling algorithms for biomedical document images." Proc. 7th World Multiconference on Systemics, Cybernetics and Informatics. Vol. 5. 2003.
[6] https://doi.org/10.1093/bioinformatics/bts071
[7] http://clair.si.umich.edu/~radev/papers/192.pdf
[8] Nakov, P. I., Schwartz, A. S., and Hearst, M. A. 2004. Citances: Citation sentences for semantic analysis of bioscience text. In In Proceedings of the SIGIR’04 workshop on Search and Discovery in Bioinformatics.
[9] again, lots of citations. representative:
Bird, Steven, and Mark Liberman. "A formal framework for linguistic annotation." Speech communication 33.1 (2001): 23-60.
https://doi.org/10.1093/bioinformatics/bts071
Chiarcos, Christian, Sebastian Nordhoff, and Sebastian Hellmann, eds. Linked data in linguistics: Representing and connecting language data and language metadata. Springer Science & Business Media, 2012.
BioC: https://doi.org/10.1093/database/bat064
[10] Steedman, Mark. The syntactic process. Vol. 24. Cambridge: MIT press, 2000.
[11] e.g. DOI: 10.1007/978-3-319-49304-6_26

Nanopublication comments:

Further comments:

Note there are a few typos and language usage issues. I would suggest "the Semantic Publishing" should be simply "Semantic Publishing" in the title and elsewhere.
Others (not exhaustive)
Montegue -> Montague
dependent -> dependant
"for automatically generate" -> "to automatically generate"
What is "a rhetoric/rhetorics" (i.e. as a noun)?
What is an "easily complex" operation?
Reconsider use of "actually" in Abstract (can be safely removed)

Review #3 submitted on 22/Jun/2017

By Tobias Kuhn ORCID logo

https://orcid.org/0000-0002-1267-0234

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Average
Suggested Decision: Undecided
Technical Quality of the paper: Average
Presentation: Average
Reviewer`s confidence: Medium
Significance: High significance
Background: Reasonable
Novelty: Limited novelty
Data availability: All used and produced data are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

This paper introduces a methodology called "compositional and iterative semantic enhancement" (CISE) and argues that it is feasible and worthwhile to automatically enhance scholarly papers with semantic markup. This should be done in several stages, starting from low-level syntactic mappings to the highest levels of scientific arguments and the position of a paper in the whole discipline.

Reasons to accept:

- Interesting and controversial claims
- Built upon an impressive collection of previous work in this area
- Addresses highly relevant problem

Reasons to reject:

- The theoretical links are not convincing
- Lack of discussion on limitations, possible downsides, and assumptions made

Nanopublication comments:

Further comments:

I think this paper addresses a highly relevant problem, and the author is able to build upon an impressive collection of previous work in this area. The paper furthermore makes interesting and controversial claims, which is great for a position paper. However, I found the theoretical connection to the principle of compositionality and the Curry-Howard isomorphism not convincing. I think Section 4 is the most interesting part, but unfortunately only the lowest layers are described in detail (which, in my view, are the least interesting for the main point the paper is making). I suggest to put less focus on the theoretical links and more on the practical experiences and preliminary findings at the higher levels.

Specifically, I didn't understand why the Curry-Howard isomorphism is relevant to the points of this paper. Unlike the principle of compositionality, it doesn't seem to be adding anything to the argument (to the point where I could follow it).

I don't think, however, that the principle of compositionality can really carry all the argumentative weight that is put onto it here. After all, there is no final proof that this principle really holds for natural languages in their entirety. There are in fact many known cases where compositionality doesn't hold, such as for idiomatic expressions or sarcasm.

Furthermore, even if we assert the principle of compositionality as a fact, it would only tell us that we could in principle semantically parse papers in an automated fashion, but it would not allow us to conclude that this is feasible in any realistic setting. In fact, over and over again, all kinds of ambitious natural language processing has been proven to be very difficult and often infeasible with current technology.

This leads me to what I think is the second main shortcoming of the paper: There is basically no discussion on the limitations of the approach and on many of the implicit assumptions made. Specifically:

- At what accuracy do you think we can perform such a full parse of scientific papers? Automated approaches are never perfect (often with accuracy levels below 70% for non-trivial NLP tasks), and this seems to heavily affect the arguments made in the paper.

- When do you think we will be able to perform complete semantic analyses of scientific papers? In 5, 10, 50 years from now? What should we be doing until then?

- Sometimes authors write in ambiguous sentences (also for human readers), and deliberately or accidentally leave out important information. With your approach, we are stuck with incomplete information in these cases, whereas involving authors in the process could solve this. This shortcoming of the approach is not discussed.

So, in summary, I think the paper has clear merits as a position paper but needs to improve on the aspects explained above.

Below I list some more minor comments:

- The "iterative" part of the "compositional and iterative semantic enhancement" is not really explained. Is "iterative" referring to applying one layer after the other? To me, this wouldn't be an intuitive use of the term "iterative". I think "iterative" would imply to go through all the layers (or the individual layers) several times.

- I think "the" in the title should be omitted: "Automating Semantic Publishing" instead of "Automating the Semantic Publishing" (and same for the first sentence of the abstract)

- In general, I suggest to have a native speaker check the document with respect to grammar and style. At several places, I think that some of the used grammar constructs are awkward if not incorrect, but not being a native speaker either, I don't feel confident in my own judgment in what might be borderline cases or simply a matter of taste.

- The first paragraph of Section 1 contains many links but no citations (except for the last sentence) that would provide evidence for claims like "... have resulted in ... acceleration of the publishing workflow".

- "... which is very close to the recent proposal of the FAIR principles for scholarly data": Very close in what sense? What are the differences?

- "generally only a very low number of semantic statements (if none at all) is specified by the authors": Can you be more specific? What are the average/median/maximum values?

- "incentives such us prizes" > "such as"

- With respect to the paragraphs connecting to Genuine Semantic Publishing, I am not sure whether an average reader is given enough background to understand this discussion. Maybe the issue of "should we or shouldn't we require authors to make a significant extra effort?" could be stated more clearly and more explicitly.

- "The idea is that the aforementioned approaches can work correctly only if used with documents stored in a particular format ...": Do these *approaches* really only work with a particular format, or is it just the current *implementations* of these approaches? I think this is an important difference.

- Contrary to "... if the text to process is written in a particular language such as English, as happens for FRED [9]", I read on the linked website that "FRED is [...] able to parse natural language text in 48 different languages". This should be clarified.

- "It is worth mentioning that this approach is not merely theoretical, but rather it has been implemented ...": An important qualification here is that is has been *partially* implemented. None of these grammar correctly represent an entire natural language.

- I didn't understand why "hierarchical markup" is needed as an assumption in Section 3. If you assume that natural language sentences can be automatically parsed at great accuracy (as you seem to be assuming), then certainly you can automatically detect the hierarchical structure of documents as well.

- "there is no need of having a prior knowledge about the particular natural language used for writing the scholarly article": I don't understand what you mean by "prior knowledge" here. Somebody or something would need some knowledge (in fact deep knowledge) about the language to semantically parse the text at all the layers.

- Figure 1: I think I understand the meaning of the colors in this figure, but I failed to understand the meaning of the x and y axes. This should be explained better.

- Section 4: I would have liked to learn a bit more about the ontologies, tools, and existing studies on the layers 4 to 8.

- Section 4: I would expect some of the most difficult but also most interesting kind of knowledge to extract from a paper to be domain knowledge, i.e. what the authors have found out about the world (e.g. about living organisms in the case of biology). I don't see this aspect anywhere in the 8 presented layers. This seems to be another limitation that is not discussed.

2 Comments

Metareview by Editor

Submitted by Tobias Kuhn on Thu, 06/29/2017 - 01:35

The manuscript presents an important idea that needs to be further developed to be acceptable for publication. In particular, the authors should address concerns about the unnecessarily complex and unconvincing argumentation, the lack of relevant literature, and the need to discuss assumptions and limitations.

Michel Dumontier (http://orcid.org/0000-0003-4727-9435)

Anonymous Review, submited after decision

Submitted by Michel Dumontier on Mon, 07/03/2017 - 17:15

Overall Impression: Weak
Suggested Decision: Reject
Technical Quality of the paper: N/A
Presentation: Weak
Reviewer`s confidence: High
Significance: Moderate significance
Background: Weak
Novelty: N/A
Data availability: N/A
Length of the manuscript: N/A

Summary of paper in a few sentences:

The author is submitting a position paper; I understand this to be a position statement paper, not an opinion but a paper in which the author states a position wrt to a problem of interest. However, it is not clear whether this is a position statement wrt semantic publications, or if this is a paper in which the author is presenting a framework. If the former, then the author fails to present a coherent position. If the latter, then the author fails in presenting a framework. Beyond these issues, there is the problem of readability. The paper is not clearly written, the English is not acceptable of publication standards, the paper is not well organized, there are typos, problems with punctuation, lack of examples, etc. In addition, it is not easy to understand the real scope of this paper, I am guessing the author is really trying to get a first impression about an idea; however, the idea is poorly presented. The problem is also not well discussed and again the paper is very disorganized –this is probably because the author is not clear as to the intention of the paper. The author should also consider the whole publication lifecycle; at the very least the author should somehow consider the publication workflow.

Reasons to accept:

Although the domain of the paper, semantics for scholarly papers, is interesting I really don’t think that this paper is ready to be accepted. The author needs to put some work in the paper and resubmit. This is interesting work that should be published but as it is right now it is not ready. The approach is interesting but again, it needs work before it is published.

Reasons to reject:

The paper lacks scope, it is difficult to read, the English doesn’t make it easy for the reader to understand the paper. The author makes quite a few unsubstantiated claims that should be better supported; there is quite a lot of literature that is not referenced in this paper and that is quite relevant. There are also commercial applications that should be analyzed against the ideas presented by the author, e.g. nature science graph, and others that I give in my review.

The author is submitting a “position” paper; I understand this to be a position statement paper. However, it is not clear whether this is a position statement wrt semantic publications, or if this is a paper in which the author is presenting a framework. If the former, then the author fails to present a coherent position. If the latter, then the author fails in presenting a framework. Beyond these issues, there is the problem of readability. The paper is not clearly written, the English is not acceptable of publication standards, the paper is not well organized, there are typos, problems with punctuation, etc. In addition, it is not easy to understand the real scope of this paper, I am guessing the author is really trying to get a first impression about an idea; however, the idea is poorly presented. The problem is also not well discussed and again the paper is very disorganized –this is probably because the author is not clear as to the intention of the paper.

The author touches on semantic publications but does not contextualize the work within any publication workflow. How are we getting there? Is this an approach that will work only for new publications that are born within the idea/framework/context that the author is presenting?

The author does not seem to be well aware of current publication platforms that are doing many of the things that he is describing; to name but a few: https://science.ai/overview, the work at elife labs with lens and R markdown, semantics for Jupiter notebooks, nature science graph, Cochrane linked data, ZPID linked data. There is Biotea, this is somehow doing just what the author is describing. Also, close to the Biotea experience there is “Semantator: Semantic annotator for converting biomedical text to linked data”. There are more examples of work seeking to add semantics to publications. Some address the problem for existing publications; some other authors are working on solutions for novel publications. Either way, all of that is relevant for the work the author is trying to do.

There is annotation and NLP written all over the work presented in this paper; something the author is not clear about. For instance, automatic annotation is not perfect, far from it; also, there is a lot ambiguity in domain ontologies –ambiguity that is inherited by the annotations. Has the author considered any of these in his approach? Moreover, if the author is going for reasoning over several annotations from various ontologies then I would like to see a really convincing case. SNOMED and MEDRA, ChEBI and PubCHEM illustrate how this is difficult and may lead to contradictions. A running example could do a lot for this paper. In addition, it is not clear if the author is talking about annotation as NER or annotation using NLP pipelines; in any case, neither one of them gives 100%, so one has to also consider human annotation. If human annotation is involved then what could the task look like? What quality parameters (e.g. inter-annotation agreement) should be considered, will the annotations be part of the semantic layer of the paper or will this be an additional payer somehow attached/related to the paper?

From the authors:

In recent experiments colleagues and I have done in the context of the SAVE-SD workshops, described in [8], the clear trend is that, beside a few who actually believe in the Semantic Publishing and even if we made available appropriate incentives (i.e. prizes) for people submitting HTML+RDF scholarly papers, generally only a very low number of semantic statements (if none at all) is specified by the authors. Possible reasons for this behaviour could be the lack of appropriate support (e.g. graphical user interfaces) for facilitating the authors in the annotation of scholarly works with semantic data.”

What is the incentive for the author? Shouldn’t this be the replacement of the typesetting in the publication workflow? Is there a benchmark for tools that can automate the process? Adding the semantics, what advantages? What incentives? When in the publication workflow? Whose work is this? From my experience running a workshop addressing issues in semantics in scientific publications I could see how these are issues that need clarity for everyone. These issues are also related to author's available time, nowbody invests time and effort without first knowing why, what for, and how.

Also from the authors

…but rather to have services that do it for them in an automatic fashion by parsing somehow the content that the authors have already written, thus by reducing the entire time spent by them to few clicks for adding the semantic annotations…

sure, this has been addressed in the past by many authors but… it has also been said that automatic annotation has quite a few problems. The accuracy of the annotation and also the quality of ontologies and also the fact that these annotations will come from several ontologies and this leads to the problem of reasoning over multiple overlapping ontologies that mostly likely will bear contradictions. If the author is talking about human annotation then again, there are quite a few issues to address and investigate before doing it. Mark2cure has had relative success but they work in an overly scoped domain with an overly scoped annotation task. Also, how are u planning to define the annotation workflows involving humans and software?

From the authors

“…For instance, one can consider more important to have all the citations in a paper described according to their functions (i.e. the reasons why an author has cited other works), while others can consider more functional to have a formal description of the main argumentative claim of the paper with its evidences…”

This is really interesting but also poorly charted territory in which lots of authors have not really succeeded in the past. Is the author advocating human annotation for the identification of function of the citation or is this some sort of sentiment analysis kind of automated task?

From the authors

“…the SPAR Extractor Suite developed for the RASH format…”

Interesting, I have checked the SPAR ontologies and as ontologies they model some of this. However, I could not find the SPAR extractor suite.

Fred is limited; it is yet unclear how could FRED be applied to a wider context than that described by the authors.

From the authors

“…scholarly documents can be written in different natural languages..“

Do u mean different languages as in English, Italian, Spanish? Or do u mean different narrative structures? If the latter then, what are the narrative structures most commonly used in scientific literature? This has already been studied before.

From the authors

“The idea is that each of these syntactic/structural/semantic aspects, that we would like to arise starting from a pure syntactic organisation of document markup, can be defined in fact as standalone languages (e.g. by means of ontologies) with their proper compositional rules and functions.”

Once again, the author needs examples everywhere.

From the authors

“…the main constituents they contain are in fact shared among the whole scholarly communication domain...... My hypothesis is that such ways are shared somehow between the various research areas.

….:”

The whole paragraph is really complicated. As just opinions it is fine. For a position statement paper I would expect to see this very well supported. As a paper in which the author is presenting a framework, this needs a lot of work.

From the authors

[hierarchical markup] the sources of the scholarly article are available according to (or can be easily converted into) a document markup language that is appropriate for conveying the typical hierarchical containment proper to scholarly documents (e.g. body > section > paragraph);
[language agnosticism] there is no need of having a prior knowledge about the particular natural language used for writing the scholarly article;
[layer interdependency] a layer describing a particular conceptualisation of the components of scholarly documents is dependent somehow on the conceptualisation of at least another lower- or higher-level layer;
[inter-domain reuse] several of the structural and semantic aspects typically described in scholarly articles are shared across various research domains;
[intra-domain reuse] scholarly documents of a specific domain always share several structural and semantic aspects between them, even if such aspects are not implicitly adopted by other external domains.

Ok. Lets start by saying that strictly speaking these are not hypothesis. Consider writing these a problem statements, research questions or outright hypothesis.

The body>section>paragraph I don’t understand. Everything else in these points is really arguable and needs better backup in the form of references that really support these as assertions. For instance, “there is not need of having a prior…..” perhaps an example could come in handy. The author should focus on a particular domain, select a well-defined corpus of documents and elaborate from there on.

From the author

“…Intuitively, following the sketch illustrated in Figure 1, starting from a low-level definition of the structure of an article, e.g. the organisation of the XML elements that have been used to describe its content (layer 1), it is possible to create rules that describe each XML element according to more general compositional patterns depicting its structures that oblige a specific implicit and unmentioned organisation of the article content (layer 2) – e.g. the fact that there an element can behaves like a block or an inline item. Again, starting from the definitions in the first two layers, it would be possible to characterise the semantics of each XML element according to fixed categories defining its structural behaviour, e.g. paragraph, section, table, figure, etc. (layer 3). Along the same lines, starting from the aforementioned layers, it would be possible to derive the rhetorical organisation of a scholarly paper, e.g. identifying the argumentative role of each section introduced in such paper such as introduction, methods, material, experiment, data, results, conclusions, etc. (layer 4). And so on and so forth.”

This is, IMHO, the most interesting paragraph of the paper. However, the lack of an example just diminishes its importance. Also, the way it is written makes it seem like a lot of opinions. Once again, if this is a position statement paper I would consider this to be part of the overall position. But, even as a position statement the author needs to give the reader more than just his word for unsubstantiated claims.

From the Authors

syntactic containment (ontology: EARMARK [14]) – describing the dominance and containment relations [19] that exist between the various elements and attributes included in the XML sources of scholarly articles (e.g. the fact that a certain element X is contained in another element Y);
syntactic structures (ontology: Pattern Ontology [15]) – starting from the previous layer, inferring the particular structural pattern to which each XML element is compliant with (e.g. the fact that all the elements X behave as inline elements, while the elements Y behave as blocks);
structural semantics (ontology: DoCO [16]) – using the particular pattern-element specification provided by the previous layer, extrapolating general rules for the pattern-based composition of article structures (e.g. sections, paragraphs, tables, figures);
rhetorical components (ontology: DEO [16]) – by means of the organisation of the structural components obtained in the previous layer, inferring their rhetorical functions, so as to clearly characterise sections with a specific behaviour (e.g. introduction, methods, material, data, results, conclusions);
citation functions (ontology: CiTO [17]) – using the outcomes of the previous two layers, assigning the appropriate citation function to each occurrence of a citation in a scholarly articles (e.g. by specifying the function uses method infor all the citations included in the method section);
argumentative organisation (ontology: AMO) – analysing the various semantic characterisation of the previous layers, creating relations among the various components of scholarly articles by using specific argumentative models such as Toulmin's [20];
article categorisation (ontology: FaBiO [17]) – by looking to the ways document components are organised structurally and argumentatively, annotating each scholarly paper with the appropriate type (e.g. research paper, journal article, review, conference paper, demonstration paper, poster, opinion, report);
discipline clustering (ontology: DBpedia Ontology [18]) – understanding to which scholarly discipline each paper belong to by looking at the various characterisations that have been associated to each paper in the previous layers.

This is fine, but… ontologies are just models and data talks lauder than words. So, in order to convince me I need to see data, not just models; along with data I would like to see tested data. If just ontologies then at the very least I would like to see instantiated ontologies so that data tells me how you are right. Once again… I need a running example in this paper, I would like to author to focus on a clear message, I would like this paper to be much better articulated. The large number of self-citations are not helping the author in making a clear case –his own previously published papers may arise further evaluation under the light of his claims in this paper. Also, there are lots of papers that should be cited here and that are missing. His section “from structural patterns to structural semantics” indicates me that it is quite simply easier to map JATS/XML elements to ontologies (bearing in mind minimal ontological commitment) rather than embracing many of the things the author seems to embrace. The text that usually follows the e.g. is not enough explanatory; again a running example could help the author to make his case.

From the author

…” additional studies and tests are needed for having more robust outcomes…”

Yes, like for example?

The tittle reads funny, the author should consider making it closer to the content of the paper. Also, the analogy ised in this paper needs some work. I don’t see how it is relevant.

Tracking #: 483-1463

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

2 Comments