Efficient Similarity Measures for Clustering a Huge Dataset: A Critical Review

Tracking #: 527-1507

Authors:

	Name	ORCID
	Desmond Bala	https://orcid.org/0000-0002-5723-8429
	Rajesh Prasad	https://orcid.org/0000-0002-3456-6980
	Musa Liman	https://orcid.org/0000-0002-6362-2494

Responsible editor:

Olivia Woolley-Meza

Submission Type:

Survey Paper

Abstract:

The need for appropriate applications of the various similarity measures for clustering has arisen over the years as data massively keeps on increasing. The issue of deciding which similarity measure is the best and on what kind of dataset have been a very cumbersome task in the field of data mining, data science, and organizations that are highly depending on the knowledge outcome from a huge set of data to make some vital / crucial decisions. Because various datasets portray some common features associated with them; therefore the need for clearer understanding of various similarity measures for clustering different datasets is needed. This paper presents a critical review of various similarity measures applied in text and data clustering. A theoretical comparison has been made to check the suitability of the measures on different kind of data sets.

Manuscript:

ds-paper-527.docx

Supplementary Files (optional):

ds-supplementary-527-803.pdf

Data repository URLs:

http://kdd.ics.uci.edu/databases/reuters21578/reuters21578

Date of Submission:

Friday, November 3, 2017

Date of Decision:

Sunday, December 24, 2017

Nanopublication URLs:

Decision:

Reject

Solicited Reviews:

Review #1 submitted on 20/Nov/2017

By Nino Antulov-Fantulin ORCID logo

https://orcid.org/0000-0002-4337-2475

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Weak
Suggested Decision: Reject
Technical Quality of the paper: Weak
Presentation: Weak
Reviewer`s confidence: Medium
Significance: Low significance
Background: Reasonable
Novelty: Lack of novelty
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The authors need to elaborate more on certain aspects and the manuscript should therefore be extended (if the general length limit is already reached, I urge the editor to allow for an exception)

Summary of paper in a few sentences:

The paper describes 5-6 different similarities measures: token-based and edit-based.

Structure of the paper and technical quality is ok.
But, I do not see why this review makes significant contribution, in my opinion all things are technically correct but the overall
contribution is not on a medium scientific level.

For example, I do not see any reason why the examples for dot products should be presented.

In my opinion: the only way how this paper would deserve to be published is the following:
- make a real critical review by making an experimental comparisons of the similarity measures on large-scale datasets and pros and cons
- elaborate run-time and space complexity on a more professional level
- elaborate the real-time online setting for similarity measures
- references need to be corrected and updated
- formulas need to be written in a better type setting

Reasons to accept:

This review, from my personal view acts more like a good tutorial to the similarity measures than a critical review.

Reasons to reject:

For this kind of a review, the authors should have put more emphasis on the computational issues for large scale datasets.
Only thing that touches the large scale issues are run-time and space complexities, which are put to table 1.
But here I see a lot of technical issues: What is On(1) ? or O( |m| +|n| ) quadratic ?

Nanopublication comments:

Further comments:

Review #2 submitted on 15/Dec/2017

By Izabela Moise ORCID logo

https://orcid.org/0000-0003-0370-6749

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Weak
Suggested Decision: Reject
Technical Quality of the paper: Average
Presentation: Bad
Reviewer`s confidence: High
Significance: High significance
Background: Incomplete or inappropriate
Novelty: Limited novelty
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

This paper aims at providing a comprehensive view of similarity measures used in clustering of large data sets.
The authors focus mostly on text and document data and address six different types of similarity measures.
In a first step, the paper proceeds to summarizing the underlying method for each similarity measures, its advantages and disadvantages and also specific application domains. In a second step, the authors provide a comparison of the presented metrics, in terms of the efficiency of applying them on large collections of text.

Reasons to accept:

Choosing the appropriate similarity measure is indeed a challenging and important step in any clustering process, and the authors invest significant effort into addressing this.

Reasons to reject:

There are a few categories of similarity metrics that are overlooked by the authors. For example, semantic similarity (Latent Semantic Analysis) is not addressed at all. Knowledge similarity that uses information derived from semantic networks (e.g. WordNet) are also overlooked.
Also, more modern techniques such as word embeddings should also be mentioned, even more in the context of large scale text clustering.

I believe the current work needs to be extended in order to provide a truly comprehensive survey of similarity metrics for text and document clustering.

The performance evaluation in Section 6 remains completely unclear to me. There are no indications of which data sets were used, or what was the testing environment. In this light, the results presented in Table 1 seem unreliable.

I also recommend to the authors to check the English misspellings and formulation of sentences in general. There are a lot of these throughout the text and it really distracts from the actual content.

Nanopublication comments:

Further comments:

Review #3 submitted on 22/Dec/2017

Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Weak
Suggested Decision: Reject
Technical Quality of the paper: Weak
Presentation: Weak
Reviewer`s confidence: High
Significance: Low significance
Background: Reasonable
Novelty: Lack of novelty
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

In this paper the authors present some text similarity measures in order to demonstrate which ones are efficient for clustering a "huge" dataset. This review paper starts with very oddly structured English paragraphs that make it hard to read and immediate fixing. Regardless of the writing quality, the authors do a decent job framing their argument and provide a good-enough overview of background work. The evaluation of efficiency is not sufficient to support any of the paper's arguments in my opinion, as there are no large scale experiments and comparisons.

Reasons to accept:

None

Reasons to reject:

1) Not innovative work, there are some eerily similar articles from several years ago:
- A Survey of Text Similarity Approaches (https://pdfs.semanticscholar.org/5b5c/a878c534aee3882a038ef9e82f46e10213...)
- A Review on Text Similarity Technique used in IR and its Application (http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.695.3407&rep=re...)

2) Quality of the paper is not up to standard with a journal publication

3) Not enough (or at all) experiments to prove efficiency on large scale corpus

4) The clustering aspect of the evaluation is missing (experimentally)

Nanopublication comments:

Further comments:

1 Comment

Meta-Review by Editor

Submitted by Tobias Kuhn on Thu, 01/11/2018 - 01:15

We have reviewed your submission carefully. The task that you set out to address with the manuscript is an important one. Nonetheless, in its current state the paper is not suitable for publication, requiring very significant changes. The reviewer's commentaries offer useful input to guide improvement of your work. I would emphasize the most important areas: (1) a more in depth evaluation of the efficiency of algorithms on large scale datasets, (2) a more comprehensive review of existing methods (or justification of your reduced scope), and (3) differentiating your work from existing reviews. The text also needs umerous corrections to grammar and spelling toimprove readability. Finally, a clearer explanation of what data you are using is needed, and should you decided to submit again to this journal , you must ensure the data is FAIR and openly available in established data repositories.

Olivia Woolley-Meza (0000-0003-4517-2765)

Data Science

Efficient Similarity Measures for Clustering a Huge Dataset: A Critical Review

Tracking #: 527-1507

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Supplementary Files (optional):

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

1 Comment

Meta-Review by Editor