Reviewer has chosen not to be AnonymousOverall Impression:
RejectTechnical Quality of the paper:
Incomplete or inappropriateNovelty:
Limited noveltyData availability:
Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix thisLength of the manuscript:
The length of this manuscript is about right
Summary of paper in a few sentences:
This paper aims at providing a comprehensive view of similarity measures used in clustering of large data sets.
The authors focus mostly on text and document data and address six different types of similarity measures.
In a first step, the paper proceeds to summarizing the underlying method for each similarity measures, its advantages and disadvantages and also specific application domains. In a second step, the authors provide a comparison of the presented metrics, in terms of the efficiency of applying them on large collections of text.
Reasons to accept:
Choosing the appropriate similarity measure is indeed a challenging and important step in any clustering process, and the authors invest significant effort into addressing this.
Reasons to reject:
There are a few categories of similarity metrics that are overlooked by the authors. For example, semantic similarity (Latent Semantic Analysis) is not addressed at all. Knowledge similarity that uses information derived from semantic networks (e.g. WordNet) are also overlooked.
Also, more modern techniques such as word embeddings should also be mentioned, even more in the context of large scale text clustering.
I believe the current work needs to be extended in order to provide a truly comprehensive survey of similarity metrics for text and document clustering.
The performance evaluation in Section 6 remains completely unclear to me. There are no indications of which data sets were used, or what was the testing environment. In this light, the results presented in Table 1 seem unreliable.
I also recommend to the authors to check the English misspellings and formulation of sentences in general. There are a lot of these throughout the text and it really distracts from the actual content.
Meta-Review by Editor
Submitted by Tobias Kuhn on
We have reviewed your submission carefully. The task that you set out to address with the manuscript is an important one. Nonetheless, in its current state the paper is not suitable for publication, requiring very significant changes. The reviewer's commentaries offer useful input to guide improvement of your work. I would emphasize the most important areas: (1) a more in depth evaluation of the efficiency of algorithms on large scale datasets, (2) a more comprehensive review of existing methods (or justification of your reduced scope), and (3) differentiating your work from existing reviews. The text also needs umerous corrections to grammar and spelling toimprove readability. Finally, a clearer explanation of what data you are using is needed, and should you decided to submit again to this journal , you must ensure the data is FAIR and openly available in established data repositories.
Olivia Woolley-Meza (0000-0003-4517-2765)