Review Details
Reviewer has chosen to be Anonymous
Overall Impression: Average
Suggested Decision: Undecided
Technical Quality of the paper: Weak
Presentation: Good
Reviewer`s confidence: High
Significance: High significance
Background: Reasonable
Novelty: Clear novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The authors need to elaborate more on certain aspects and the manuscript should therefore be extended (if the general length limit is already reached, I urge the editor to allow for an exception)
Summary of paper in a few sentences:
This paper introduces SS-DBSCAN, a novel hyperparameter selection method for DBSCAN that uses statistical properties of the dataset to derive appropriate values for both epsilon and MinPts parameters.
For a given dataset of high-dimensional embeddings, SS-DBSCAN works as follows:
1. Perform dimensionality reduction on the embeddings with PCA, followed by t-SNE
2. Perform stratified sampling on the average distances between points to find an epsilon value that best accommodates the variety of spatial distributions found in the data
3. Using the epsilon value from step 2, compute the mean and std. of density (number of neighbors found within epsilon of a given point)
4. Perform a fast grid search on MinPts, from mean-std. to mean+std. with an early stopping tolerance of 5 steps, using Silhouette as the metric.
5. Apply standard DBSCAN using the epsilon and MinPts values found in steps 2 and 4
The authors perform experiments with varying data sizes and a diverse collection of five datasets to show that SS-DBSCAN delivers consistent and high quality clustering results at different data scales and across datasets compared to DBSCAN, HDBSCAN, and OPTICS baselines.
Reasons to accept:
- Novel use of intrinsic dataset statistics to derive good values for epsilon and MinPts for DBSCAN.
- Experiments show SS-DBSCAN outperforms DBSCAN, HDBSCAN, and OPTICS baselines across a diverse set of five datasets.
- Experiments show SS-DBSCAN is more robust to data scale than DBSCAN, HDBSCAN, and OPTICS baselines.
Reasons to reject:
1. Critical information is missing from the manuscript:
- How did you identify the number of principal components to keep during PCA? How did you set t-SNE hyperparams (perplexity, learning rate, number of iterations)? Can you show that your results are not just a consequence of the particular PCA / t-SNE settings you used?
- Fig.1 mentions MultiHead Self Attention and use of a Standard Scaler. Neither are mentioned in the manuscript.
- There is no information on how embeddings are constructed for each dataset, nor a description of the contents of the datasets or the expected clustering results. You mention using a sentence embedding model from Sentence-BERT. Is this applied to all five datasets? If so, please specify how you format the text for embedding in each case. If you use tabular features as embeddings please specify the sources of each feature.
- Table 4 does not specify the dataset used in the experiment
- Table 5 repeats the same exact DBI values for all four clustering algorithms. If this is a copying error, the actual DBI values are missing. Also, the constant data size used in the experiment is missing
- Figure 7 is missing the dataset used to run the experiment
- Figure 8 is missing the constant data size used in the experiment
- The discussion section refers to SS-DBSCAN having better noise sensitivity than the baselines, but does not specify how SS-DBSCAN handles noise differently from the default DBSCAN behavior. Are noise points excluded from the computation of Silhouette when doing the fast grid search, or are you referring to something else?
- No discussion of limitations as typically found in a limitations section
2. The Silhouette metric is used for both MinPts selection and experimental evaluation when comparing to baselines. However, it has been shown that Silhouette is not an appropriate metric for density-based clustering algorithms such as DBSCAN, specifically when dealing with irregularly (non-spherical) shaped clusters. Instead, the DBCV [1] metric should be used.
[1] Moulavi, D., Jaskowiak, P. A., Campello, R. J., Zimek, A., & Sander, J. (2014, April). Density-based clustering validation. In Proceedings of the 2014 SIAM international conference on data mining (pp. 839-847). Society for Industrial and Applied Mathematics.
3. No justification is given that dimensionality reduction is needed for the SS-DBSCAN hyperparameter selection process. It is intuitive since dimensionality reduction is commonly used with clustering in general, but this needs to be shown either empirically or theoretically if you are claiming that PCA + t-SNE is a critical piece of your contribution.
4. Why PCA + t-SNE and not UMAP [2]? UMAP is a dimension reduction algorithm commonly used with clustering - would SS-DBSCAN work with UMAP as well as it does with t-SNE? If not, this would be a critical limitation that potential users would want to know about.
[2] McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.
Nanopublication comments:
N/A
Further comments:
Overall this is very promising work. A DBSCAN variant that is as robust as these experiments show would be an enormous benefit to the research community. However, the manuscript needs significant revision before it is in shape to be accepted, as discussed above. I look forward to reviewing an updated version of this paper!
1 Comment
meta-review by editor
Submitted by Tobias Kuhn on
We are very intrigued by the approach in this paper, but feel that we need more elaboration. The reviewers found that the manuscript needed better separation of methodology and evaluation, a greater focuss on quantitative evaluation, and a discussion of limitations, among other issues. To proceed, please address all the comments in the reviews for resubmission.
Jamie McCusker (https://orcid.org/0000-0003-1085-6059)