Enhanced SS-DBSCAN Clustering Algorithm for High-Dimensional Data

Tracking #: 902-1882

Authors:

	Name	ORCID
	Gloriana Monko	https://orcid.org/0000-0003-2414-5004
	Masaomi Kimura	https://orcid.org/0000-0003-3991-4259

Responsible editor:

Jamie McCusker

Submission Type:

Research Paper

Abstract:

This research introduces an enhanced SS-DBSCAN, a scalable and robust density-based clustering algorithm designed to tackle challenges in high-dimensional and complex data analysis. The algorithm integrates advanced parameter optimization techniques to improve clustering accuracy and interpretability. Key innovations include a Fast Grid Search (FGS) method for optimizing the search of optimal MinPts by keeping the epsilon parameter obtained constant. Notably, this study emphasizes the often-overlooked MinPts parameter, introducing a dynamic approach that initiates by calculating density metrics within a specified epsilon distance and adjusting the MinPts range based on the standard deviation of these metrics. This approach identifies optimal MinPts values based on the maximum allowed range. Comprehensive experiments on five real-world datasets demonstrate SS-DBSCAN's superior performance compared to DBSCAN, HDBSCAN, and OPTICS, evidenced by higher silhouette and Davies-Bouldin Index scores. The results highlight SS-DBSCAN's ability to capture intrinsic clustering structures accurately, providing deeper insights across various research domains. SS-DBSCAN's scalability and adaptability to diverse data densities make it a valuable tool for analyzing large, complex datasets.

Manuscript:

ds-paper-902.pdf

Supplementary Files (optional):

ds-supplementary-902-1487.zip

Previous Version:

Enhanced SS-DBSCAN Clustering Algorithm for High-Dimensional Data

Data repository URLs:

https://zenodo.org/records/13889331

Date of Submission:

Thursday, March 6, 2025

Date of Decision:

Sunday, March 23, 2025

Nanopublication URLs:

Decision:

Solicited Reviews:

Review #1 submitted on 17/Mar/2025

Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: Medium
Significance: High significance
Background: Reasonable
Novelty: Clear novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

The authors addressed most of my concerns from the initial review:
- Added missing information from tables & figures on manuscript
- Added additional clarification on motivation for dimensionality reduction, comparison to UMAP, and hyperparameter selection

Reasons to accept:

I recommend the paper be accepted, with the suggestions below incorporated into the final version.

Reasons to reject:

I recommend the paper be accepted, with the suggestions below incorporated into the final version.

Nanopublication comments:

N/A

Further comments:

- The description of MultiHead Self-Attention in Fig 1 and the surrounding prose is confusing. Are you referring to the MHSA that exists in the S-BERT encoder, or are you applying a separate attention layer on top of the encoder's outputs? Unless you are doing something new / customized with the attention mechanism itself, it is sufficient just to state that you use S-BERT's encoder to get context-sensitive sentence embeddings. No need to specifically call out the attention mechanism.

- Although you added more details on the datasets themselves, there is still no information on how you format the datasets for S-BERT. Please add examples to an appendix section where you show example records from a dataset (e.g. MIMIC III) and how the text is formatted for embedding. This is especially important for embeddings of tabular or other structured data that has been converted to text.

- You did not adequately address my concern over use of Silhouette with DBSCAN. Can you show that the clusters in these datasets are overall spherical (and thus Silhouette is a suitable metric?) If there are non-spherical clusters emerging in your experiments, please exclude them from your results and give details on this in the limitations / future work section of your paper. Alternatively, you can use the DBCV metric to handle non-spherical clusters.

- You added a section discussing the motivation for dimensionality reduction, specifically addressing the benefits of PCA and comparison of t-SNE and UMAP, but please report what happens if you remove dimensionality reduction altogether, since you are claiming it to be a core requirement of your method.

- Thank you for the detailed comparison of t-SNE and UMAP. I have found that UMAP's hyperparameters can be sensitive to data of different native dimensionality and densities just as DBSCAN is. Perhaps your SS-DBSCAN methodology could help for UMAP too? For example, setting min_dist for UMAP in a similar way to setting ϵ for DBSCAN, and setting n_neighbors for UMAP in a similar way to setting MinPts in DBSCAN. Something to consider for future work (SS-UMAP??) You can used fixed clustering settings while varying UMAP hyperparams to validate this.

Review #2 submitted on 21/Mar/2025

By Tobias Kuhn ORCID logo

https://orcid.org/0000-0002-1267-0234

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Average
Presentation: Average
Reviewer`s confidence: Low
Significance: High significance
Background: Unable to judge
Novelty: Unable to judge
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

This manuscript presents a new approach to improve an existing custering algorithm called DBSCAN by improving the search for optimal parameters. The authors present positive results that their approach is superior to existing alternatives.

Reasons to accept:

- Clearly defined problem and targeted approach
- Seemingly good background coverage (even though I am not sufficiently close to the field to judge whether all relevant background have been covered)
- Proposed approach seems to clearly out-perform existing ones.

Reasons to reject:

- The previous comments have been addressed reasonably well, except for the point below
- The provided dataset URLs only cover the used datasets, not the data produced by this study

Nanopublication comments:

Further comments:

There is still room for improvement, but the previous comments have been addressed reasonably well. Only the dataset repository issue still needs to be resolved. The produced data, including what is needed to reproduce the figures and tables, needs to be accessible in a public repository. Under the condition that this is fixed, I can recommend acceptance of this manuscript.

RESPONSE TO REVIEWERS

REVIEWER 1

Comment: How did you identify the number of principal components to keep during PCA? How did you set t-SNE hyperparams (perplexity, learning rate, number of iterations)? Can you show that your results are not just a consequence of the particular PCA / t-SNE settings you used?
Response: We retained 95\% of the variance through PCA to minimize information loss before applying t-SNE.
We employed the default hyperparameters for TSNE(n_components=2, perplexity = 30, learning_rate = "auto", n_iter=300)

Comment: Fig.1 mentions MultiHead Self-Attention and the use of a Standard Scaler. Neither are mentioned in the manuscript.
Response: We acknowledge the inconsistency, and such information has been added at the Methodology

Comment: There is no information on how embeddings are constructed for each dataset, nor a description of the contents of the datasets or the expected clustering results. You mention using a sentence embedding model from Sentence-BERT. Is this applied to all five datasets? If so, please specify how you format the text for embedding in each case. If you use tabular features as embeddings, please specify the sources of each feature.
Response: Yes the Sentence BERT was used in all text datasets. This part has been explained in detail in the Methodology

Comment: Table 4 does not specify the dataset used in the experiment
Response: MIMIC-III was the dataset used in this Tabel 4. We created this dataset, and it was the most complex among those used due to its structure, nature, and the presence of longer sentences. It was derived from patients' clinical notes and was designed to detect whether a patient was experiencing an adverse drug reaction.

Comment: Table 5 repeats the same exact DBI values for all four clustering algorithms. If this is a copying error, the actual DBI values are missing. Also, the constant data size used in the experiment is missing.
Response: We completely agree that the DBI results were overlooked due to a copying error, and the datasets' data sizes were missing. These issues have been corrected, and the revisions are reflected in Table 5.

Comment: Figure 7 is missing the dataset used to run the experiment
Response: MIMIC-III was the dataset used

Comment: Figure 8 is missing the constant data size used in the experiment
Response: The constant data size used in Figure 8 is 4000

Comment: The discussion section refers to SS-DBSCAN having better noise sensitivity than the baselines but does not specify how SS-DBSCAN handles noise differently from the default DBSCAN behavior. Are noise points excluded from the computation of Silhouette when doing the fast grid search, or are you referring to something else?
Response: We now clarified this in the Discussion

Comment: No discussion of limitations as typically found in a limitations section
Response: We have not yet identified specific limitations; however, we recommend further research to explore the applicability of our approach to other types of data, such as images, audio, and other modalities.

Comment: The Silhouette metric is used for both MinPts selection and experimental evaluation when comparing to baselines. However, it has been shown that Silhouette is not an appropriate metric for density-based clustering algorithms such as DBSCAN, specifically when dealing with irregularly (non-spherical) shaped clusters. Instead, the DBCV [1] metric should be used.
Response: We have used both Silhouette and Davies-Bouldin Index metrics. Other metrics, such as Normalized Mutual Information (NMI) and Homogeneity Score, are available but are designed for labeled datasets. Since our study involves unlabeled datasets, these metrics were not applicable.

Comment: No justification is given that dimensionality reduction is needed for the SS-DBSCAN hyperparameter selection process. It is intuitive since dimensionality reduction is commonly used with clustering in general, but this needs to be shown either empirically or theoretically if you are claiming that PCA + t-SNE is a critical piece of your contribution.
Response: We now justified the necessity of dimmentionality reduction in the Methodology

Comment: Why PCA + t-SNE and not UMAP [2]? UMAP is a dimension reduction algorithm commonly used with clustering - would SS-DBSCAN work with UMAP as well as it does with t-SNE? If not, this would be a critical limitation that potential users would want to know about.
Response: Our decision to use PCA followed by t-SNE was not meant to suggest that UMAP or other dimensionality reduction methods are inferior; instead, it was simply the approach that yielded the best results for our specific experiment. Therefore, we performed another experiment to prove this point. Along with this response , we have attached a document that shows the results. Yes, SS-DBSCAN can work with UMAP as well; it is not a limitation in this case.

REVIEWER 2

Comment: It is Unclear how the contributions under Section 3 relate to each other. Isn't (1) the same as (4)? Are (2) and (3) part of (1)?
Response: The interconnection between the contributions has been demonstrated. Contribution (1) primarily focuses on complexity and high-dimensional data, while Contribution (4) emphasizes scalability and adaptability.

Comment: Typo: "Contibution" > "Contribution"
Response: This typo has been corrected.

Comment: Methodology and Evaluation are not sufficiently separated. The Methodology section starts with "4.1 Data Preprocessing" detailing what datasets were used, which should be part of the evaluation (or maybe I am misunderstanding the purpose of this, in which case it should be clarified). First the conceptual approach should be fully described (e.g. in a "Methodology" section), before the details about the scientific evaluation are introduced (e.g. in a "Evaluation" section).
Response: This has been addressed in both sections. It is essential to mention the dataset used and its preprocessing in both sections to ensure clarity from a methodological perspective and within the experimental context.

Comment: Therefore it's unclear whether the pre-processing in Section 4.1 is part of the Approach or only done for the evaluation.
Response: Data preprocessing is now clearly stated as a crucial step in SS-DBSCAN and all other baseline algorithms, not just for evaluation. This process is not uniform in all research; it depends on the data used, and more details are given to respond to this.

Comment: These datasets comprise sequences with lengths ranging from a minimum of 50 to a maximum of 500, with an average length of 250.": What datasets are these? Why were they selected? What's their role? What are "sequences" here? How are these minimums/maximums relevant and how were they chosen?
Response: We now explicitly state the rationale behind dataset selection. Sequences refer to text segments in the dataset. The limits were chosen based on an initial exploratory analysis, ensuring that a majority of sequences retained essential information without unnecessary complexity. The average length of 250 reflects the natural distribution within the selected datasets.

Comment: Missing motivation why PCA, t-SNE, etc. are part of the approach. What purpose do they serve in the bigger picture?
Response: This is already addressed in the Methodology

Comment: Overall, the approach isn't well introduced on a general level. Section 4 dives right into the details. It should first give a general and intuitive overview and motivation.
Response: An overview and motivation subsection has been added at the beginning of the Methodology section to provide context before delving into the technical details.

Comment: A novel stratified sampling technique": This should be better motivated too. What's the intuition behind this? Why could we expect this to work better than the alternatives? A conceptual diagram or something like that could help too
Response: This has been addressed in the Methodology

Comment: Section 5, "across multiple datasets, including...": It's unclear how these datasets were selected.
Response: This has been explained in the Experiment Setup

Comment: Section 5: How was the dataset used in "varying sizes"? What kind of sampling was applied? This should be explained better.
Response: There was no predefined sample size for this type of research. Instead, experiments were conducted to demonstrate the scalability of our SS-DBSCAN algorithm. The comparative algorithms struggled with increasing data size, highlighting the robustness of our approach.

Comment: In Section 5, the paper relies too much on the visualizations. They are nice, but don't give scientific answers. I would only show a couple of the visualizations to give an impression for the reader. But the real results are in the tables, and they should get more prominence.
Response: We apologize if the extensive visualizations were distracting. The intention was to illustrate how cluster formations evolve as data size increases for each algorithm. Additionally, these visualizations were supported by results presented in the tables.

Comment: The tables should better explain what they show. E.g. "more is better" and "less is better". The best results could be shown in bold. The caption should give us more indication what we are looking at.
Response: The best results are now bolded to show the difference

Comment: For the subsections like "5.1.2. Clustering Results with DBSCAN", it should be made clearer whether this is used as a competing approach or baseline we are comparing your contribution against, OR whether this is a variant of showing the performance of your approach. Specifically, does 5.1.2 applying your approach under the hood too or not?
Response: Our experimental setup focused on comparing four algorithms: SS-DBSCAN, DBSCAN, HDBSCAN, and OPTICS. The objective was to evaluate the performance of SS-DBSCAN against other DBSCAN variants.

Comment: Overall, Section 5 is lacking structure. The short subsections and many images disturb the text flow.
Response: We acknowledge that the images disrupt the text flow. This is a common challenge when using LaTeX, as it adjusts the text to fit around empty spaces where images cannot be placed. However, we have ensured that all images and tables are properly cited in the text for easy reference.

Comment: The datasets included Emotion-Sentiment, Coronavirus-Tweets, Cancer-Doc, and Sonar.": Again, why these? This shold be justified
Response: This has been justified in the Experiment Setup

Comment: Given that there are still many kinds of datasets out there that this approach was not tested on (which is natural and fine), I think the Discussion section should include a part where the generalization to other kinds of datasets/domains is discussed. Can we expect this to work there too? A statement like "we don't know; needs further work" is perfectly fine here, but I think this should be addressed.

Response: We have explicitly mentioned this in the Discussion section, noting that future work is needed to assess its performance across a broader range of datasets.
Comment:Conclusion could be a bit more elaborate. Maybe picking up some points from the introduction again, and quickly summarizing the discussion points from Section 7.
Response: We have expanded the conclusion as suggested

1 Comment

meta-review by editor

Submitted by Tobias Kuhn on Sun, 03/23/2025 - 15:53

We need you to specifically address the following issues in order to complete the acceptance of the paper:

The description of MultiHead Self-Attention in Fig 1 and the surrounding prose is confusing. Are you referring to the MHSA that exists in the S-BERT encoder, or are you applying a separate attention layer on top of the encoder's outputs? Unless you are doing something new / customized with the attention mechanism itself, it is sufficient just to state that you use S-BERT's encoder to get context-sensitive sentence embeddings. No need to specifically call out the attention mechanism.
Although you added more details on the datasets themselves, there is still no information on how you format the datasets for S-BERT. Please add examples to an appendix section where you show example records from a dataset (e.g. MIMIC III) and how the text is formatted for embedding. This is especially important for embeddings of tabular or other structured data that has been converted to text.
You did not adequately address the concern over use of Silhouette with DBSCAN. Show that the clusters in these datasets are overall spherical (and thus Silhouette is a suitable metric?) If there are non-spherical clusters emerging in your experiments, please exclude them from your results and give details on this in the limitations / future work section of your paper. Alternatively, you can use the DBCV metric to handle non-spherical clusters.
You added a section discussing the motivation for dimensionality reduction, specifically addressing the benefits of PCA and comparison of t-SNE and UMAP, but please report what happens if you remove dimensionality reduction altogether, since you are claiming it to be a core requirement of your method.
The produced data, including what is needed to reproduce the figures and tables, needs to be accessible in a public repository. Under the condition that this is fixed, I can recommend acceptance of this manuscript.

Jamie McCusker (https://orcid.org/0000-0003-1085-6059)

Data Science

Enhanced SS-DBSCAN Clustering Algorithm for High-Dimensional Data

Tracking #: 902-1882

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Supplementary Files (optional):

Previous Version:

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

1 Comment

meta-review by editor