Review Details

Reviewer has chosen to be *Anonymous*

**Overall Impression:** Average

**Suggested Decision: ** Undecided

**Technical Quality of the paper:** Weak

**Presentation:** Weak

**Reviewer`s confidence:** Medium

**Significance:** High significance

**Background:** Reasonable

**Novelty:** Lack of novelty

**Data availability:** All used and produced data (if any) are FAIR and openly available in established data repositories

**Length of the manuscript:** The authors need to elaborate more on certain aspects and the manuscript should therefore be extended (if the general length limit is already reached, I urge the editor to allow for an exception)

**Summary of paper in a few sentences: **

This paper describes an algorithm for classifying whether an admitted ICU patient is at a high risk of mortality or not (binary classification) by using diagnostic and monitoring data recorded in the first 24 hours following admission to the ICU. The algorithm takes a three step approach to build a classifier. First, univariate variables are selected using t-tests between their values for survivors and non-survivors. Second, for all continuous numerical variables, optimal threshold cut-points are identified in order to discretize numerical variables. To do so, chi-square tests of independence between counts generated by variable thresholding and the outcome variable (survived / not survived). Essentially, if a proposed threshold value splits the patient data in such a way that the chi-square test is significant (p-value < 0.05) that threshold is kept, otherwise rejected. The procedure can be recursively performed to increase the number of cut-points and thus, the number of discrete categories of a continuous variable. Finally, logistic regression is performed in a 5-fold cross-validation scheme to learn the mode. The main results are the identification of variables high PO2, old age, eye-opening score, cardiac arrest, and COPD as highly predictive of mortality. Further, they report a high AUC score of 0.925 and a large increase from the scores achieved by commonly used mortality risk algorithms like SAPS|| and APACHE||. The authors used the publicly available MIMIC ||| dataset to train and evaluate their approach.

**Reasons to accept: **

1. The authors report a significant increase in predictive accuracy compared to the baseline approach of SAPS|| (AUC = 0.77) and APACHE|| (AUC = 0.736).

2. The optimal threshold cut-point technique provides clinicians with thresholds that can be interpreted better in ICU environments. It makes it easier for doctor's to apply the results from the classification algorithm to alleviate mortality risk.

**Reasons to reject: **

There are some major issues with statistical reporting in this article:

1. In section 2.3, it is stated that non-survivor data were upsampled such that the proportion of surviving and non-surviving patients was almost equal. This was done before 5-fold cross-validation based training and evaluation. However, in my view this is not a valid way to evaluate a model since the distribution of the data has been changed. Upsampling is fine for training the model (i.e. upsampling the non-survivor proportion in any of the training folds) but not for evaluation! In an actual ICU test environment, the model has to deal with the actual data distribution, i.e. a much lower overall mortality risk. AUC, sensitivity and specificity results reported on an already upsampled dataset are not valid.

2. In the Discussion section, paragraph 2, the authors state that "The 5-Fold and Leave-one-out Cross-Validation results showed a significant improvement in performance of the logistics regression model when the partitioned continuous variables were used instead of the raw continuous variables." However, I did not find any quantitative results tabulation to prove this claim in the manuscript.

3. The source for the quoted AUC values for SAPS||, APACHE|| and SOFA scoring systems is not provided. Did the authors test these systems on the same dataset themselves ? Where are the details ?

##############

More issues regarding the approach taken:

1. I find that the authors do not show why their method is novel though they claim it to be so. The optimal threshold cut-point technique they use is very similar to the Sheth 2015 method, which itself is not much of a development from Donoho and Jin's paper: "Higher Criticism Thresholding: Optimal Feature Selection when Useful Features are Rare and Weak".

2. The authors do not justify well the reason for discretizing continuous variables. They do not show why this statistical testing based approach is better than a Decision Tree based method which would also partition the continuous variables and provide cut-points without statistical hypothesis testing.

3. The authors themselves show that a three threshold partition is better than a two-threshold partition, in terms of p-values and chi-square test statistics. If one extrapolates this argument further, in the limit of number of partitions, one recovers the continuous variable. So wouldn't the original numerical variable be better under this line of reasoning ?

4. Details on the number of features rejected based on initial feature selection is missing.

5. No comparison is made to any other state-of-the-art method, e.g. using deep learning or more powerful tree-based methods.

6. The identified important variables are not discussed in relation to what is already known about their significance in literature.

**Further comments: **

## 1 Comment

## Meta-Review by Editor

Submitted by Tobias Kuhn on

Both reviewers commented on serious flaws in the evaluation of the method. There were questions about the novelty of the approach. The reviewers pointed at the lack of comparison to any other state-of-the-art method and of testing on a independent dataset.

Michael Krauthammer (https://orcid.org/0000-0002-4808-1845)