Reviewer has chosen not to be AnonymousOverall Impression:
UndecidedTechnical Quality of the paper:
Incomplete or inappropriateNovelty:
Unable to judgeData availability:
All used and produced data (if any) are FAIR and openly available in established data repositoriesLength of the manuscript:
The length of this manuscript is about right
Summary of paper in a few sentences:
This paper suggests an ensemble feature selection algorithm for high-dimensional immunosignature data with a low computational cost, intended to (1) outperform single approaches, and (2) be more robust to noisy data. The ensemble has three filtering methods -the gain ratio, the relief-F and the M-statistic- to rank each biomarker and aggregates the ranks as a weighted mean. The weights are based on the accuracy of each method. Their approach is applied to a public immunosignatures study, both with and without artificial Gaussian noise.
Reasons to accept:
1. The ensemble algorithm is sound and simple and eases the filter choice.
Reasons to reject:
1. There is no mention on how the data was arranged (e.g. holdout/cross-validation/leave-one-out) to estimate the performance while avoiding overfit.
2. The results section lacks elaboration. The findings deserve a deeper discussion, interpretation and a more formal analysis. Specific, quantitative and statistically sound claims are missing.
Feature selection is an outstanding challenge with high-dimensional data. The proposed method can be useful, but needs a more thorough description and the results require further elaboration to support the author's claims.
Q1. The state of the art shows that ensemble feature selection has already been used in other applications. The idea of aggregating various rankings by a weighted approach has been already explored:
Saeys, Y., Abeel, T., & Van de Peer, Y. (2008, September).
Robust feature selection using ensemble feature selection techniques.
In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 313-325). Springer, Berlin, Heidelberg.
In a posterior paper, an ensemble feature selection is weighted according to the performance in bootstrapped data:
Abeel, T., Helleputte, T., Van de Peer, Y., Dupont, P., & Saeys, Y. (2009).
Robust biomarker identification for cancer diagnosis with ensemble feature selection methods.
Bioinformatics, 26(3), 392-398.
The authors should discuss such efforts.
Q2. The dataset entry in the references is GSE52580. Accessing it shows a series of 240 samples with discrepancies with Table 1. The right GEO number seems to be GSE52581, with 1,516 samples and coherent categories.
Q3. The dataset should be better characterised. The reader ignores the magnitude and range of the features (for instance, I ignore if the 5-sd noise is high or low, relative to the data). If possible, low-dimensional representations should be provided. Many of the filtering methods estimate mean values and variances, which are sensitive to outliers - I would advocate for checking for their presence.
Q4. The details on filtering/wrapping/embedded methods seem a bit out of place in the materials and methods section. At this point, the reader should be aware of the state of the art (see also Q1) and the motivation behind the article.
Q5. Some claims are not obvious and would benefit from a reference or clarification:
* Page 3: "In connection with the absence of dependencies between features due to the biological specificity peptides, it is advisable to use filtration methods as the most computationally efficient". High collinearity between features is a common problem in biological datasets. Please provide a reference showing that this is not the case for peptide data.
* Page 5: "Filtering algorithms have various disadvantages that do not allow to find optimal set of informative features, as a result of which the efficiency of various classifiers varies considerably". Which disadvantages are those? In what mathematical sense should the set of features be optimal?
Q6. The filtering methods need more details:
* Jeffries-Matusita: needs reference. Also, is it univariate or multivariate? If the covariance matrix was computed, only one value of beta would be available for each binary comparison.
* Fisher score: reference is broken. The terms in the formula should be described (S_i, n_j, mu_ij, mu_i, rho_ij, K). The summations should be over j instead of k.
Q7. Why are only gain ratio, relief-F and M-statistic included in the ensemble? Why not just use the five methods?
Q8. There seems to be some circularity in the choice of N and the ensemble weights. In the absence of an external validation dataset or holdout, choosing the best N value based on overall performance will inevitably improve downstream performance in further tests. The same applies to the weights w_A, w_B, w_C, as their choice is performance-driven, endowing the ensemble with an unfair advantage.
Q9. How are the hyperparameters of the random forest and the SVM chosen?
Q10. The artificial noise addition/multiplication needs more detail for proper reproduction. What are the expected values of such distributions? Please put in mathematical terms how both (especially the multiplicative) noises are incorporated.
Q11. In the description of Cohen's Kappa, the citation  (Andryuschenko et al.) might need double checking because it does not mention Kappa.
Q12. The weights for the ensemble filtering are based on accuracies. Are those literally accuracies, or Cohen's Kappas? If accuracies, the class imbalance can distort the metrics. If Cohen's Kappas, those can be actually negative, and so can the denominator.
Q13. N=10 was chosen, but the graph seems flat from N=4 on. Is there a quantitative way to justify N=10 (or any other value)?
Q14. The plots should show a dispersion measure for each feature selection algorithm for a proper comparison.
Q15. Claiming that an algorithm is "the best" needs the support of a statistical test, in order to discard that the observed differences come from the sampling effect alone.
Q16. What is a reasonable amount of noise? Is the scenario with sigma=5 plausible in a real dataset? To discard biases, it would also be interesting to prove that all the classifiers are random (i.e. Kappa around 0) when the dataset breaks down.
Q17. How does the estimated performance compare to that in the state of the art for immunosignature data?
Q18. The use of English throughout the manuscript would benefit from correction from a native speaker.
- There are several non-english words scattered around (e.g. энтропия признака, page 3)
- Page 3: what does "higher risk of retraining" mean? I was unable to find the word "retrain" in reference .
- Do tables 2-4 display real data? If so, why not write the real rows and column names?
- Please specify how the attributes are sorted (by rows/columns, ascending/descending). Are the best rankings represented by low or high values? The final attributes are prioritised in the interval [0,1]; is 1 the least or the most informative?
- Table 5: first row seems out of place. It could also be more informative by including the function name, the package version and its reference, if any.
- "Technology" sounds somewhat confusing, consider using algorithm/approach/method instead.
- Reference  is actually lacking citations in the text.