Analysis of machine learning methods for COVID-19 detection using serum raman spectroscopy

Tracking #: 691-1671

Authors:

	Name	ORCID
	David Chen	https://orcid.org/0000-0003-3178-519X

Responsible editor:

Manik Sharma

Submission Type:

Research Paper

Abstract:

One of the most challenging aspects of the emergent COVID-19 pandemic caused by infection of SARS-CoV-2 has been the need for massive diagnostic tests to detect and track infection rates at the population level. Current tests such as RT-PCR can be low-throughput and labor intensive. An ultra-fast and accurate mode of detecting COVID-19 infection is crucial for healthcare workers to make informed decisions in fast-paced clinical settings. The high-dimensional, feature-rich components of raman spectra and validated predictive power for identifying human disease, cancer, as well as bacterial and viral infections poses the potential to train a supervised classification machine-learning algorithm on raman spectra of patient serum samples to detect COVID-19 infection. We developed a novel stacked subsemble classifier model coupled with an iteratively validated and automated feature selection and engineering workflow to predict COVID-19 infection status from raman spectra of 250 human serum samples, with a 10-fold cross validated classification accuracy of 98.4% (98.6% precision and 95.9% sensitivity). Furthermore, we benchmarked 9 machine learning and artificial neural network models when evaluated using 8 standalone performance metrics to assess whether ensemble methods offered any improvement from baseline machine learning models. Using a rank normalized scores derived from the performance metrics, the stacked subsemble model ranked higher than the Multi-layer Perceptron, which in turn ranked higher than the 8 other machine learning models. This study serves as a proof of concept that stacked ensemble machine learning models are a powerful predictive tool for COVID-19 diagnostics.

Manuscript:

ds-paper-691.docx

Data repository URLs:

https://github.com/davidchen0420/Raman_Spectroscopy_COVID_19

Date of Submission:

Tuesday, March 30, 2021

Date of Decision:

Wednesday, March 8, 2023

Nanopublication URLs:

Decision:

Reject

Solicited Reviews:

Review #1 submitted on 03/May/2021

By Gargi Datta ORCID logo

https://orcid.org/0000-0002-1314-7824

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Undecided
Technical Quality of the paper: Average
Presentation: Good
Reviewer`s confidence: High
Significance: High significance
Background: Reasonable
Novelty: Clear novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: This manuscript is too long for what it presents and should therefore be considerably shortened (below the general length limit)

Summary of paper in a few sentences:

The paper discusses a way to identify Covid-19 infections from serum samples using machine learning on the results of Raman spectroscopy. They were able to show how 8 different machine learning algorithms performed on classifying Covid-19 using 10-fold CV on 250 raman spectral samples. They also investigated stacking ensemble algorithms and compare the performance of the ensemble to the weak learners.

Reasons to accept:

The paper investigates the efficacy of machine learning to predict a Covid-19 infection from serum data. Raman spectral information can be used to predict Covid-19 and other diseases, and ensemble methods could be an auxiliary tool for clinical diagnosis of Covid-19. Studies like this are very important for advancing machine learning techniques to augment clinical decision making.

Reasons to reject:

I have some questions and concerns about the methodology used in the paper:

Major concerns

1. The sample size used for training the algorithms is very small (250 samples). I worry that with the more complicated methods (for example, a multi-layered perceptron), there is significant overfitting. The authors have used feature selection and extraction techniques to mitigate the high dimensionality issues. They have also used cross-validation to help with overfitting. But, there isn't an independent test set to get performance metrics on. Ideally, metrics on a separate dataset would be reported, but if that is not available, I would suggest the authors set aside say, 10% of the data as an independent test set and evaluate their methods on that dataset in addition to 10-fold CV.

2. In my opinion, the authors report too many metrics in thee results section and the significance of their work is lost a little. I would suggest the authors re-structure the results section and focus on a select few metrics. Additionally, the authors report p-values but don't mention multiple testing error correction. I suggest the authors report Bonferroni corrected p-values.

3. For PCA, it would be beneficial to show a scree plot of Eigen values.

4. Something that would make this paper stronger would be a discussion on the robustness of different models used.

Minor comments:

1. Some citations are missing. For example, citation for subsembles, stacked ensembles, cross-validation when they are first mentioned.

2. I noticed that the authors used selectivity in some of their text. Did they mean specificity?

3. Was SNV done before or after feature selection? If it was done after, what is the reasoning behind that?

4. How many initial features were there before feature selection? Please add that information either in methods or results section.

Nanopublication comments:

Further comments:

Review #2 submitted on 08/May/2021

By Prableen Kaur ORCID logo

https://orcid.org/0000-0003-4912-527X

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: High
Significance: High significance
Background: Reasonable
Novelty: Clear novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The authors need to elaborate more on certain aspects and the manuscript should therefore be extended (if the general length limit is already reached, I urge the editor to allow for an exception)

Summary of paper in a few sentences:

The manuscript is related to the use of machine learning methods for COVID19 detection.

Reasons to accept:

The idea seems to be interesting. Different machine learning techniques have been used to explore distinct performance metric.

Reasons to reject:

No reason

Nanopublication comments:

Further comments:

The manuscript is related to the use of machine learning methods for COVID19 detection. The idea seems to be interesting. Different machine learning techniques have been used to explore distinct performance metric. However, to maintain the general interest of the reader, the manuscript need to be revised as per the following suggestions:

- First of all, more background work need to be addressed. The novelty and worth of this work need to be reflected in the introduction section.

- There should be a separate related works section.

- As the whole story revolved around the use of machine learning applications. Therefore, a brief paragraph regarding the basics and foundation of the machine learning techniques need to be added in the methodology section. The general applciations of the machine learning techniques need to be briedly highlighted. For this authors may read and refer the following manuscripts.

o Kim, Gi Bae, et al. "Machine learning applications in systems metabolic engineering." Current opinion in biotechnology 64 (2020): 1-9.

o Kaur, Prableen, and Manik Sharma. "Analysis of data mining and soft computing techniques in prospecting diabetes disorder in human beings: a review." Int. J. Pharm. Sci. Res 9 (2018): 2700-2719.

o Kaur, Prableen, and Manik Sharma. "A Smart and Promising Neurological Disorder Diagnostic System: An Amalgamation of Big Data, IoT, and Emerging Computing Techniques." Intelligent Data Analysis: From Data Gathering to Data Comprehension (2020): 241-264.

o Arora, Sankalap, Manik Sharma, and Priyanka Anand. "A novel chaotic interior search algorithm for global optimization and feature selection." Applied Artificial Intelligence 34.4 (2020): 292-328.

o Gan, Lirong, Huamao Wang, and Zhaojun Yang. "Machine learning solutions to challenges in finance: An application to the pricing of financial products." Technological Forecasting and Social Change 153 (2020): 119928.

o Saadatmand, Mohammadsaleh, and Tuğrul U. Daim. "Technology Intelligence Map: Finance Machine Learning." Roadmapping Future: Technologies, Products and Services (2021): 337-3

- The results presented in the figure 2 need to be discussed in mreo detailed manner.

- The strength and limitation of this work need to be clearly addressed.

Review #3 submitted on 15/May/2021

Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Excellent
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Excellent
Reviewer`s confidence: High
Significance: High significance
Background: Comprehensive
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

Authors proposed classifier comprised of a deep learning predictive meta algorithm for to predict COVID-19 infection. Algorithmic details, datasets and results are good.

Reasons to accept:

Paper is well written. Results are validated. Solution to the problem undertaken is needed today.

Reasons to reject:

Nil

Nanopublication comments:

Further comments:

2 Comments

Meta-Review by Editor

Submitted by Tobias Kuhn on Sat, 05/22/2021 - 14:21

The idea mentioned in the manuscript is interesting. However, the manuscript needs revision. More background work need to be presented. There should be a separate related work section. More quality based literature related to the theme of the manuscript need to be explored and incorporated. The novelty and contribution of this work need to be clearly stated in the introduction section. It needs to be explicitly mentioned in the discussion section how the results have been validated. The author should revise the manuscript following these editor comments and the comments by the reviewers.

Manik Sharma (https://orcid.org/0000-0002-5942-134X)

No Revised Version Submitted: Marked as Rejected

Submitted by Tobias Kuhn on Wed, 03/08/2023 - 03:26

As the authors did not submit a revised version, I will mark this submission as rejected.

Data Science

Analysis of machine learning methods for COVID-19 detection using serum raman spectroscopy

Tracking #: 691-1671

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

2 Comments

Meta-Review by Editor

No Revised Version Submitted: Marked as Rejected