Enhanced SS-DBSCAN Clustering Algorithm for High-Dimensional Data

Tracking #: 902-1882

Authors:



Responsible editor: 

Jamie McCusker

Submission Type: 

Research Paper

Abstract: 

This research introduces an enhanced SS-DBSCAN, a scalable and robust density-based clustering algorithm designed to tackle challenges in high-dimensional and complex data analysis. The algorithm integrates advanced parameter optimization techniques to improve clustering accuracy and interpretability. Key innovations include a Fast Grid Search (FGS) method for optimizing the search of optimal MinPts by keeping the epsilon parameter obtained constant. Notably, this study emphasizes the often-overlooked MinPts parameter, introducing a dynamic approach that initiates by calculating density metrics within a specified epsilon distance and adjusting the MinPts range based on the standard deviation of these metrics. This approach identifies optimal MinPts values based on the maximum allowed range. Comprehensive experiments on five real-world datasets demonstrate SS-DBSCAN's superior performance compared to DBSCAN, HDBSCAN, and OPTICS, evidenced by higher silhouette and Davies-Bouldin Index scores. The results highlight SS-DBSCAN's ability to capture intrinsic clustering structures accurately, providing deeper insights across various research domains. SS-DBSCAN's scalability and adaptability to diverse data densities make it a valuable tool for analyzing large, complex datasets.

Manuscript: 

Supplementary Files (optional): 

Previous Version: 

Tags: 

  • Reviewed

Data repository URLs: 

Date of Submission: 

Thursday, March 6, 2025

Date of Decision: 

Sunday, March 23, 2025


Nanopublication URLs:

Decision: 

Accept

Solicited Reviews:


1 Comment

meta-review by editor

We need you to specifically address the following issues in order to complete the acceptance of the paper:

  • The description of MultiHead Self-Attention in Fig 1 and the surrounding prose is confusing. Are you referring to the MHSA that exists in the S-BERT encoder, or are you applying a separate attention layer on top of the encoder's outputs? Unless you are doing something new / customized with the attention mechanism itself, it is sufficient just to state that you use S-BERT's encoder to get context-sensitive sentence embeddings. No need to specifically call out the attention mechanism.
  • Although you added more details on the datasets themselves, there is still no information on how you format the datasets for S-BERT. Please add examples to an appendix section where you show example records from a dataset (e.g. MIMIC III) and how the text is formatted for embedding. This is especially important for embeddings of tabular or other structured data that has been converted to text.
  • You did not adequately address the concern over use of Silhouette with DBSCAN. Show that the clusters in these datasets are overall spherical (and thus Silhouette is a suitable metric?) If there are non-spherical clusters emerging in your experiments, please exclude them from your results and give details on this in the limitations / future work section of your paper. Alternatively, you can use the DBCV metric to handle non-spherical clusters.
  • You added a section discussing the motivation for dimensionality reduction, specifically addressing the benefits of PCA and comparison of t-SNE and UMAP, but please report what happens if you remove dimensionality reduction altogether, since you are claiming it to be a core requirement of your method.
  • The produced data, including what is needed to reproduce the figures and tables, needs to be accessible in a public repository. Under the condition that this is fixed, I can recommend acceptance of this manuscript.

Jamie McCusker (https://orcid.org/0000-0003-1085-6059)