Data Decomposition for Outlier Detection coupled with Information Theoretic Validation

Tracking #: 843-1823

Authors:

	Name	ORCID
	GOURANGA DUARI	https://orcid.org/0000-0002-5872-842X
	Rajeev Kumar	https://orcid.org/0000-0003-0233-6563

Responsible editor:

Francesca D. Faraci

Submission Type:

Research Paper

Abstract:

Decomposition for complexity minimization has long been a challenging approach. Yet decomposition for outliers has rarely been experimented with. This paper presents a data decomposition approach as a pre-processor for outlier detection. The decomposition of the data using space partitioning makes homogeneous sub-groups. Consequently, it reduces the complexity of data patterns by isolating possible outliers into the sub-groups from monolithic characters. This approach creates sub-groups of homogeneous data points based on the fitness of purpose. They optimize the outlier patterns in the sub-groups for subsequent mapping of outlier detectors onto the sub-groups. This decomposition strategy is found to be effective in reducing the complexity of learning for the detectors without deterioration in the overall detection rate. We experimented with this approach using different benchmark detectors on eight benchmark datasets. Our data decomposition approach is superior for identifying localized patterns in the partitions and offers a better generalization.

Manuscript:

ds-paper-843.pdf

Data repository URLs:

https://github.com/gourangaduari1995/outlier-decomp

Date of Submission:

Sunday, June 9, 2024

Date of Decision:

Thursday, July 25, 2024

Nanopublication URLs:

Decision:

Reject

Solicited Reviews:

Review #1 submitted on 03/Jul/2024

By Wenyu Zhang ORCID logo

https://orcid.org/0000-0003-3322-9736

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Undecided
Technical Quality of the paper: Average
Presentation: Good
Reviewer`s confidence: High
Significance: High significance
Background: Reasonable
Novelty: Clear novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

The article presented a k-means clustering based data decomposition as a pro-processing method for outlier detection applications. The article is organized and clear in structure. The results of the proposed algorithm look interesting and promising.

Reasons to accept:

Revision recommended. (Revision before acceptance)

1. Table 3. Why does data decomposition’s impact on Precision and Recall are different for some dataset? i.e. Why does Recall improve a lot? Also, why does data decomposition impact some dataset more than others? Please add an explanation related to the characteristics of the datasets.
2. Fig 4. The results of ROC look interesting. Can authors provide some intuitive explanation why LOF detector’s performance is not improved by data decomposition?
3. A general question about the proposed method - efficiency of the data decomposition. Since this method is an additional step to process the data, if it is used for a larger data set, is time and computing cost a concern? This should be added in the Discussion section.
4. In Section 4, (p5 L44) “assumption that outliers can only be identified using deviation characters, but it ignores entirely the local structure of data”. How to quantitatively determine if the proposed method can be applied to a dataset or not?

Minor fixes needed:
1. p11 Sect.5 dataset name should be same in text and Table 1
2. p16 Fig 6. What does #exclusively mean in the captions?
3. p18 L8 missing word Table?

Reasons to reject:

None

Nanopublication comments:

Further comments:

Review #2 submitted on 07/Jul/2024

Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Weak
Suggested Decision: Reject
Technical Quality of the paper: Average
Presentation: Average
Reviewer`s confidence: High
Significance: Low significance
Background: Reasonable
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The authors need to elaborate more on certain aspects and the manuscript should therefore be extended (if the general length limit is already reached, I urge the editor to allow for an exception)

Summary of paper in a few sentences:

The authors introduce their new outlier detection approach that uses by data decomposition using k-means clustering to reduce the data space and use this reduced space in conjunction with existing outlier detection algorithms. The approach has two stages. In stage 1, the authors perform data decomposition using space partitioning to partition the input data into sub-groups. In stage 2, they assign the outlier detectors in the sub-groups according to the outlier detection algorithm. The authors perform experiments to compare their proposed approach by using it with a range of outlier detection techniques and analyses it across a range of datasets.

Reasons to accept:

If the authors can revise the paper to the extent where it motivates the need for a new clustering approach, e.g., by identifying gaps in the literature etc then the authors approach would be justified.

Reasons to reject:

The paper does not motivate the authors approach well. It does not identify gaps in the literature and then explain how the authors are filling those gaps.

It is not clear what is novel here? Outlier clustering was described in 2004 in Hodge & Austin and in 2012 by Han & Kamber. Why is the method proposed here novel, what is different from previous approaches? The authors need to discuss previous approaches from the literature and identify the novelty better.
Han, J., Kamber, M., & Pei, J. (2012). Outlier Detection. In Data Mining (pp. 543–584). Elsevier. https://doi.org/10.1016/b978-0-12-381479-1.00012-5
Hodge, V., & Austin, J. (2004). A survey of outlier detection methodologies. Artificial intelligence review, 22, 85-126.

Section 3 is a literature review with no motivation, or compare and contrast. What are the gaps in the literature? Why is a new approach necessary? What gap or gaps is the authors' technique filling?

On page 18, how do the authors select k? In the results in the tables and figures, the optimal k value varies across data sets. Does this method work solely with labelled data? On page 2 , line 8, the authors seem to introduce their method as unsupervised as they state that unsupervised outlier detection is the exciting research area and then introduce their approach which leaves the reader to assume it is unsupervised. How would a user of the authors' approach select the optimal k-value without labelled data which is often not available in outlier detection?
See Han, J., Kamber, M., & Pei, J. (2012). Outlier Detection. In Data Mining (pp. 543–584). Elsevier. https://doi.org/10.1016/b978-0-12-381479-1.00012-5

The paper needs more explanation of ideas and concepts.
Page 1, line 42, the authors introduce supervised, semi-supervised, and unsupervised outlier detection but do not describe them or provide a citation, e.g, Han & Kamber. Han, J., Kamber, M., & Pei, J. (2012). Outlier Detection. In Data Mining (pp. 543–584). Elsevier. https://doi.org/10.1016/b978-0-12-381479-1.00012-5
Page 2, line 13, how is data decomposition different to clustering?
Page 7, What does figure 3 show? This needs to be explained better. What do 3a, 3b, 3c and 3d show?
Page 8, the caption for figure 3 should explain what the figure shows
Page 11, table 2, are these outliers prescribed in the UCI repository or have the authors chosen which class to choose as outliers? If the authors have chosen then they need to motivate why they have chosen each class in each data set.
Page 16, figure 5, please explain what this shows. What are the numbers on the bars?
Page 18, line 8, Table ??,

The layout needs improving, figures should be placed near where they are first cited. Some are 2 or 3 pages later.
Page 2, line 15, I assume 2 - d should be 2-d ?

Nanopublication comments:

Further comments:

1 Comment

meta-review by editor

Submitted by Tobias Kuhn on Thu, 07/25/2024 - 10:53

The paper is rejected due to a lack of novelty, as it fails to present new or groundbreaking findings in its field. Additionally, weaknesses in the methodology and analysis are identified. Furthermore, the paper does not clearly articulate its relation to existing literature, and it requires more thorough clarification on its theories and principles, to enhance understanding and context.

Francesca D. Faraci (https://orcid.org/0000-0002-8720-1256)

Data Science