Reviewer has chosen not to be AnonymousOverall Impression:
RejectTechnical Quality of the paper:
Incomplete or inappropriateNovelty:
Limited noveltyData availability:
Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix thisLength of the manuscript:
The length of this manuscript is about right
Summary of paper in a few sentences:
This paper aims at providing a comprehensive view of similarity measures used in clustering of large data sets.
The authors focus mostly on text and document data and address six different types of similarity measures.
In a first step, the paper proceeds to summarizing the underlying method for each similarity measures, its advantages and disadvantages and also specific application domains. In a second step, the authors provide a comparison of the presented metrics, in terms of the efficiency of applying them on large collections of text.
Reasons to accept:
Choosing the appropriate similarity measure is indeed a challenging and important step in any clustering process, and the authors invest significant effort into addressing this.
Reasons to reject:
There are a few categories of similarity metrics that are overlooked by the authors. For example, semantic similarity (Latent Semantic Analysis) is not addressed at all. Knowledge similarity that uses information derived from semantic networks (e.g. WordNet) are also overlooked.
Also, more modern techniques such as word embeddings should also be mentioned, even more in the context of large scale text clustering.
I believe the current work needs to be extended in order to provide a truly comprehensive survey of similarity metrics for text and document clustering.
The performance evaluation in Section 6 remains completely unclear to me. There are no indications of which data sets were used, or what was the testing environment. In this light, the results presented in Table 1 seem unreliable.
I also recommend to the authors to check the English misspellings and formulation of sentences in general. There are a lot of these throughout the text and it really distracts from the actual content.