Enhanced SS-DBSCAN Clustering Algorithm for High-Dimensional Data

Tracking #: 887-1867

Authors:

	Name	ORCID
	Gloriana Monko	https://orcid.org/0000-0003-2414-5004
	Masaomi Kimura	https://orcid.org/0000-0003-3991-4259

Responsible editor:

Jamie McCusker

Submission Type:

Research Paper

Abstract:

This research introduces an enhanced SS-DBSCAN, a scalable and robust density-based clustering algorithm designed to tackle challenges in high-dimensional and complex data analysis. The algorithm integrates advanced parameter optimization techniques to improve clustering accuracy and interpretability. Key innovations include a Fast Grid Search (FGS) method for optimizing the search of optimal MinPts by keeping the epsilon parameter obtained constant. Notably, this study emphasizes the often-overlooked MinPts parameter, introducing a dynamic approach that initiates by calculating density metrics within a specified epsilon distance and adjusting the MinPts range based on the standard deviation of these metrics. This approach identifies optimal MinPts values based on the maximum allowed range. Comprehensive experiments on five real-world datasets demonstrate SS-DBSCAN's superior performance compared to DBSCAN, HDBSCAN, and OPTICS, evidenced by higher silhouette and Davies-Bouldin Index scores. The results highlight SS-DBSCAN's ability to capture intrinsic clustering structures accurately, providing deeper insights across various research domains. SS-DBSCAN's scalability and adaptability to diverse data densities make it a valuable tool for analyzing large, complex datasets.

Manuscript:

ds-paper-887.pdf

Revised Version:

Enhanced SS-DBSCAN Clustering Algorithm for High-Dimensional Data

Data repository URLs:

https://doi.org/10.5281/zenodo.13889331

https://www.kaggle.com/datasets/simaanjali/emotion-analysis-based-on-text

https://www.kaggle.com/datasets/falgunipatel19/biomedical-text-publicati...

https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification

https://www.kaggle.com/datasets/rupakroy/sonarcsv

Date of Submission:

Monday, October 14, 2024

Date of Decision:

Thursday, February 13, 2025

Nanopublication URLs:

Decision:

Undecided

Solicited Reviews:

Review #1 submitted on 28/Jan/2025

Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Average
Suggested Decision: Undecided
Technical Quality of the paper: Weak
Presentation: Good
Reviewer`s confidence: High
Significance: High significance
Background: Reasonable
Novelty: Clear novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The authors need to elaborate more on certain aspects and the manuscript should therefore be extended (if the general length limit is already reached, I urge the editor to allow for an exception)

Summary of paper in a few sentences:

This paper introduces SS-DBSCAN, a novel hyperparameter selection method for DBSCAN that uses statistical properties of the dataset to derive appropriate values for both epsilon and MinPts parameters.

For a given dataset of high-dimensional embeddings, SS-DBSCAN works as follows:
1. Perform dimensionality reduction on the embeddings with PCA, followed by t-SNE
2. Perform stratified sampling on the average distances between points to find an epsilon value that best accommodates the variety of spatial distributions found in the data
3. Using the epsilon value from step 2, compute the mean and std. of density (number of neighbors found within epsilon of a given point)
4. Perform a fast grid search on MinPts, from mean-std. to mean+std. with an early stopping tolerance of 5 steps, using Silhouette as the metric.
5. Apply standard DBSCAN using the epsilon and MinPts values found in steps 2 and 4

The authors perform experiments with varying data sizes and a diverse collection of five datasets to show that SS-DBSCAN delivers consistent and high quality clustering results at different data scales and across datasets compared to DBSCAN, HDBSCAN, and OPTICS baselines.

Reasons to accept:

- Novel use of intrinsic dataset statistics to derive good values for epsilon and MinPts for DBSCAN.

- Experiments show SS-DBSCAN outperforms DBSCAN, HDBSCAN, and OPTICS baselines across a diverse set of five datasets.

- Experiments show SS-DBSCAN is more robust to data scale than DBSCAN, HDBSCAN, and OPTICS baselines.

Reasons to reject:

1. Critical information is missing from the manuscript:

- How did you identify the number of principal components to keep during PCA? How did you set t-SNE hyperparams (perplexity, learning rate, number of iterations)? Can you show that your results are not just a consequence of the particular PCA / t-SNE settings you used?

- Fig.1 mentions MultiHead Self Attention and use of a Standard Scaler. Neither are mentioned in the manuscript.

- There is no information on how embeddings are constructed for each dataset, nor a description of the contents of the datasets or the expected clustering results. You mention using a sentence embedding model from Sentence-BERT. Is this applied to all five datasets? If so, please specify how you format the text for embedding in each case. If you use tabular features as embeddings please specify the sources of each feature.

- Table 4 does not specify the dataset used in the experiment

- Table 5 repeats the same exact DBI values for all four clustering algorithms. If this is a copying error, the actual DBI values are missing. Also, the constant data size used in the experiment is missing

- Figure 7 is missing the dataset used to run the experiment

- Figure 8 is missing the constant data size used in the experiment

- The discussion section refers to SS-DBSCAN having better noise sensitivity than the baselines, but does not specify how SS-DBSCAN handles noise differently from the default DBSCAN behavior. Are noise points excluded from the computation of Silhouette when doing the fast grid search, or are you referring to something else?

- No discussion of limitations as typically found in a limitations section

2. The Silhouette metric is used for both MinPts selection and experimental evaluation when comparing to baselines. However, it has been shown that Silhouette is not an appropriate metric for density-based clustering algorithms such as DBSCAN, specifically when dealing with irregularly (non-spherical) shaped clusters. Instead, the DBCV [1] metric should be used.

[1] Moulavi, D., Jaskowiak, P. A., Campello, R. J., Zimek, A., & Sander, J. (2014, April). Density-based clustering validation. In Proceedings of the 2014 SIAM international conference on data mining (pp. 839-847). Society for Industrial and Applied Mathematics.

3. No justification is given that dimensionality reduction is needed for the SS-DBSCAN hyperparameter selection process. It is intuitive since dimensionality reduction is commonly used with clustering in general, but this needs to be shown either empirically or theoretically if you are claiming that PCA + t-SNE is a critical piece of your contribution.

4. Why PCA + t-SNE and not UMAP [2]? UMAP is a dimension reduction algorithm commonly used with clustering - would SS-DBSCAN work with UMAP as well as it does with t-SNE? If not, this would be a critical limitation that potential users would want to know about.

[2] McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.

Nanopublication comments:

N/A

Further comments:

Overall this is very promising work. A DBSCAN variant that is as robust as these experiments show would be an enormous benefit to the research community. However, the manuscript needs significant revision before it is in shape to be accepted, as discussed above. I look forward to reviewing an updated version of this paper!

Review #2 submitted on 13/Feb/2025

By Tobias Kuhn ORCID logo

https://orcid.org/0000-0002-1267-0234

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Undecided
Technical Quality of the paper: Average
Presentation: Average
Reviewer`s confidence: Low
Significance: High significance
Background: Unable to judge
Novelty: Unable to judge
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

This manuscript presents a new approach to improve an existing custering algorithm called DBSCAN by improving the search for optimal parameters. The authors present positive results that their approach is superior to existing alternatives.

Reasons to accept:

- Clearly defined problem and targeted approach

- Seemingly good background coverage (even though I am not sufficiently close to the field to judge whether all relevant background have been covered)

- Proposed approach seems to clearly out-perform existing ones.

Reasons to reject:

- Methodology and evaluation are not always sufficiently separated

- Too much focus on visualizations instead of the hard data in the tables

- The choice of datasets is insufficiently motivated

- The provided dataset URLs only cover the used datasets, not the data produced by this study

Nanopublication comments:

Further comments:

- Unclear how the contributions under Section 3 relate to each other. Is't (1) the same as (4)? Are (2) and (3) part of (1)?

- Typo: "Contibution" > "Contribution"

- Methodology and Evaluation are not sufficiently separated. The Methodology section starts with "4.1 Data Preprocessing" detailing what datasets were used, which should be part of the evaluation (or maybe I am misunderstanding the purpose of this, in which case it should be clarified). First the conceptual approach should be fully described (e.g. in a "Methodology" section), before the details about the scientific evaluation are introduced (e.g. in a "Evaluation" section).

- Therefore it's unclear whether the pre-processing in Section 4.1 is part of the Approach or only done for the evaluation.

- "These datasets comprise sequences with lengths ranging from a minimum of 50 to a maximum of 500, with an average length of 250.": What datasets are these? Why were they selected? What's their role? What are "sequences" here? How are these minimums/maximums relevant and how were they chosen?

- Missing motivation why PCA, t-SNE, etc. are part of the approach. What purpose do they serve in the bigger picture.

- Overall, the approach isn't well introduced on a general level. Section 4 dives right into the details. It should first give a general and intuitive overview and motivation.

- "a novel stratified sampling technique": This should be better motivated too. What's the intuition behind this? Why could we expect this to work better than the alternatives? A conceptual diagram or something like that could help too.

- Section 5, "across multiple datasets, including...": It's unclear how these datasets were selected.

- Section 5: How was the dataset used in "varying sizes"? What kind of sampling was applied? This should be explained better.

- In Section 5, the paper relies too much on the visualizations. They are nice, but don't give scientific answers. I would only show a couple of the visualizations to give an impression for the reader. But the real results are in the tables, and they should get more prominence.

- The tables should better explain what they show. E.g. "more is better" and "less is better". The best results could be shown in bold. The caption should give us more indication what we are looking at.

- For the subsections like "5.1.2. Clustering Results with DBSCAN", it should be made clearer whether this is used as a competing approach or baseline we are comparing your contribution against, OR whether this is a variant of showing the performance of your approach. Specifically, does 5.1.2 applying your approach under the hood too or not?

- Overall Section 5 is lacking structure. The short subsections and many images disturb the text flow.

- "The datasets included Emotion-Sentiment, Coronavirus-Tweets, Cancer-Doc, and Sonar.": Again, why these? This shold be justified.

- Given that there are still many kinds of datasets out there that this approach was not tested on (which is natural and fine), I think the Discussion section should include a part where the generalization to other kinds of datasets/domains is discussed. Can we expect this to work there too? A statement like "we don't know; needs further work" is perfectly fine here, but I think this should be addressed.

- Conclusion could be a bit more elaborate. Maybe picking up some points from the introduction again, and quickly summarizing the discussion points from Section 7.

- The authors need to provide the results (to reproduce the tables and visualizations) in a data repository.

RESPONSE TO REVIEWERS

Comments from Chief Editor: 1. The dataset repositories need to include also the produced data, not just the used datasets 2. Some of the figures have insufficient quality; in particular, Figure 2 should be represented as a proper table instead of screenshots of command-line outputs 3. The figures showing visualizations on page 10 onwards are nice to look at but lack objective numbers/values beyond the data size. They should at least be accompanied by some descriptive statistics or partially replaced by tabular representations. "We encourage resubmission if these issues can be fixed." Responses: 1. I have included the dataset repository for the produced data- Monko, G. (2024). Adverse Drug Reaction (ADR) Text Dataset [Data set]. Zenodo. https://doi.org/10.5281/zenodo.13889331 2. I have included a table (Table 1) to represent a comparison between the two techniques, Grid Search and Fast Grid Search, and the differences in execution time when determining the MinPts parameter 3. I added two tables (Table 2 and Table 3) with descriptive statistics and explained the objectives for conducting experiments on different data sizes and datasets.

1 Comment

meta-review by editor

Submitted by Tobias Kuhn on Thu, 02/13/2025 - 10:37

We are very intrigued by the approach in this paper, but feel that we need more elaboration. The reviewers found that the manuscript needed better separation of methodology and evaluation, a greater focuss on quantitative evaluation, and a discussion of limitations, among other issues. To proceed, please address all the comments in the reviews for resubmission.

Jamie McCusker (https://orcid.org/0000-0003-1085-6059)

Data Science

Enhanced SS-DBSCAN Clustering Algorithm for High-Dimensional Data

Tracking #: 887-1867

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

1 Comment

meta-review by editor