Supervised Learning Inspired Fast Forecasting Model of 2019-nCoV Outbreak using Small Dataset

Tracking #: 628-1608

Authors:

	Name	ORCID
	SANKAR MONDAL	https://orcid.org/0000-0003-4690-2598
	ARIJIT CHAKRABORTY	https://orcid.org/0000-0002-7111-7711
	Sajal Mitra	https://orcid.org/0000-0002-8624-641X
	Dipankar Das	https://orcid.org/0000-0001-6618-0962
	Debashis De	https://orcid.org/0000-0002-9688-9806
	Anindya J	https://orcid.org/0000-0001-9277-0456

Responsible editor:

Christine Chichester

Submission Type:

Research Paper

Abstract:

A rapid spread of the 2019-novel Corona Virus (2019-nCoV) epidemic imposes a threat to society and the global economy. The epidemic induced by the contagious coronavirus resulted in the suspension of day to day activities such as education, tourism, and community services in provinces of China and its neighboring countries. The real impact of this virus on a society largely depends on its outbreak momentum. Therefore, it is imperative to formulate a robust and accurate prediction model to approximate its disastrous repercussions on human lives. Limited understanding of the 2019-nCoV outbreak with the imprecision involved induces an extraordinary challenge in framing a prudent forecasting model. This publication elucidates a collaborative framework consisting of Machine Learning (ML) and Statistical prediction methods to estimate the adversity of this virus.The suggested framework offers a high degree of accuracy in evaluating the rise in the 2019-nCoV pandemic in Chinese provinces, with a reasonably small Root Mean Square Error (RMSE) on a small dataset rendered by the World Health Organization (WHO).

Manuscript:

ds-paper-628.docx

Supplementary Files (optional):

ds-supplementary-628-969.docx

ds-supplementary-628-970.docx

ds-supplementary-628-971.xlsx

ds-supplementary-628-972.xlsx

ds-supplementary-628-973.xlsx

ds-supplementary-628-974.xlsx

ds-supplementary-628-975.xlsx

ds-supplementary-628-976.xlsx

ds-supplementary-628-977.xlsx

Data repository URLs:

Description of the produced data as follows:

The top two graphs of Fig. 2 created using no. of infected and no. of death data due to the COVID-2019 outbreak in China consisting of WHO data, including augmented data generated by the Linear Regression method. (Provided in the link: https://github.com/sajalmitra2020/WHO-database/blob/master/Who %2B Predicted Data using Linear Regression.xlsx).
This dataset link also used for classification and calculation of the RMSE values of RFM and MLP methods, as reflected in Table 2, Table 3, Table 4 and Fig. 3, Fig. 4.
Fig. 5 showed the observed and predicted no. of deaths induced by the nCoV-2019 outbreak using ARIMA, ETS, and LR-lag methods, and corresponding datasets presented in the links, https://github.com/sajalmitra2020/WHO-database/blob/master/Who%20%2B%20Predicted%20Data%20using%20ARIMA.xlsx, https://github.com/sajalmitra2020/WHO-database/blob/master/Who%20%2B%20Predicted%20Data%20using%20ETS.xlsx, https://github.com/sajalmitra2020/WHO-database/blob/master/Who%20%2B%20Predicted%20Data%20using%20LR-lag.xlsxrespectively. Additionally, the RMSE values calculated using these datasets corresponding to these three methods mentioned in Table 5.
Using same source datasets Fig. 6 created from the optimized RMSE values of the above three methods, i.e., ARIMA, ETS, LR-lag, RFM, and MLP.
The observed death data of the WHO and the MLP-lag method's predicted data plotted in Fig. 7, and the corresponding dataset link is https://github.com/sajalmitra2020/WHO-database/blob/master/Who%20%2B%20Predicted%20Data%20using%20MLP-lag.xlsx.
Fig. 8 represented the observed death using data of the WHO and predicted data of the BATS model. Consequently, we calculated the RMSE of this model. The corresponding dataset link ishttps://github.com/sajalmitra2020/WHO-database/blob/master/Who%20%2B%20Predicted%20Data%20using%20BATS.xlsx.
Finally, the BATS, MLP-lag, and our CFPSD model’s RMSE values plotted in Fig. 9.

Date of Submission:

Friday, April 3, 2020

Date of Decision:

Tuesday, April 28, 2020

Nanopublication URLs:

Decision:

Reject

Solicited Reviews:

Review #1 submitted on 22/Apr/2020

Review Details

Reviewer has chosen to be Anonymous

Overall Impression: Weak
Suggested Decision: Reject
Technical Quality of the paper: Weak
Presentation: Good
Reviewer`s confidence: Medium
Significance: Low significance
Background: Reasonable
Novelty: Lack of novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

The manuscript presents several methods to forecast the evolution of the number of deaths due to the Coronavirus. The dataset is publicly available and comes from the World Health Organization. The approach presented is a combination of statistical and machine learning methods.

Reasons to accept:

The idea of combining statistical methods and machine learning is interesting.

Reasons to reject:

The goal of the study and the results are not convincing to me. The authors use data from WHO, from the 21st of January to the 14th of February (one value per day, so an extremely small dataset). They use linear regression to create artificial datapoints from the 14th to the 29th of February. The first remark is on the data augmentation. The data is supposed to be on an exponential curve, so it can not be modelized correctly by a linear model. To me this way of augmenting the data is wrong.
Secondly, I do not understand what is the classification or regression task. What is the output of their method? The authors claim to split the augmented dataset in a train and test set. Do they select the data points from the first days to predict the last ones (it seems to be the case in Fig.5 for example)? this would be forecasting. Or do they pick datapoints at random and learn to find the others, wherever they may be located in time? this would be a sort of interpolation of the evolution?
In any case, the authors are just evaluating the ability of their methods to fit or model a particular curve. This curve is partly exponential and partly linear (augmented dataset part) and is not modelling any real coronavirus death rate evolution.

Nanopublication comments:

Further comments:

Review #2 submitted on 25/Apr/2020

By Julien Herzen ORCID logo

https://orcid.org/0000-0002-5701-0141

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Bad
Suggested Decision: Reject
Technical Quality of the paper: Bad
Presentation: Average
Reviewer`s confidence: High
Significance: Moderate significance
Background: Incomplete or inappropriate
Novelty: Lack of novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

The paper proposes an approach to forecast the spread of the 2019-nCov epidemic. The authors' goals is to find a way to forecast the number of cases (infected) and number of deaths. To this end, the authors propose a two-step approach: first, "augment" the dataset using linear regression; this allows them to have a few more days of data. Second, the authors fit a couple of ML models (MLP, random forest) and some forecasting models (ARIMA, ETS) to the data, in order to obtain forecasts. The forecasts of the different models are combined, using the RMSE error as a metric, in order to obtain some final forecasts. The whole approach is based on less than a month worth of data, between January 21st and February 14th, 2020.

Reasons to accept:

I unfortunately do not see any reason to recommend this paper for publication (see below).

Reasons to reject:

There are several deep issues with the paper:

* Overall, the approach of trying to fit statistical / ML models on time series data to forecast epidemics trajectories is a very risky exercise. The trajectory of the epidemic will typically not follow monotonic trend, or periodicity, that would typically be captured by ARIMA/ETS and the likes. Similarly, there is also no reason to think that the distributions of the numbers of cases/deaths as a function of time has any stationarity, which is an implicit assumption behind supervised ML models. In fact, epidemic trajectories are strongly impacted by complex combinations of external factors, such as government actions, social habits,
vaccines developments, temperature (season), etc. None of these can be captured by any model looking at the history of time series, which renders the whole exercise quite futile. Evaluating any such attempt would require extreme care, which is not the case with this paper.

* The models presented seem very prone to overfitting: the data is really small (about 20-30 data points for cases & deaths time series), and the number of hyper-parameters and way to combine models is high. Furthermore, the models (MLP, RF) are too complex for the size of the data/test set. The test set is tiny, which makes it very easy to overfit; and there is not validation set, which indicates likely overfit on the test set.

* The data used in the paper stops on February 14th, but the paper has been submitted on April 3rd. It would at least have been reasonable to check the model predictions on recent actual data.

* There is not comparison with any epidemiological model, nor any reference to such models. The whole field of epidemiology is doing research on how to forecast epidemics; comparisons are necessary to show why any new approach would warrant attention.

* The approach seem to rely heavily on a "data augmentation" step based on linear regression. This seems completely wrong. The whole exercise then reduces in trying to predict something that was synthetically generated by the authors to start with.

* To summarize, the paper proposes no new approach, provides no new result, and has flawed methodology.

Nanopublication comments:

Further comments:

Review #3 submitted on 28/Apr/2020

By Remi Lebret ORCID logo

https://orcid.org/0000-0001-5439-7574

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Weak
Suggested Decision: Reject
Technical Quality of the paper: Weak
Presentation: Average
Reviewer`s confidence: High
Significance: High significance
Background: Reasonable
Novelty: Lack of novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences (summary of changes and improvements for second round reviews):

The paper proposes a semi-supervised approach to train a model for predicting human casualties due to the nCov-2019. Since the dataset is quite small for training machine learning (ML) models, the authors propose to augment it using linear regression. The resulting dataset is then used to train two classifiers based on ML models and to fit three time-series models.

Specific comments:
- why using linear regression since covid-19 evolution is known to be exponential?
- there is no need for tables 3 and 4 since Fig 3 and 4 are included.
- MLP doesn’t work as well as RFM probably because the chosen learning rates are too high. It would be interesting to also test with learning rates lower than 0.01. The Fig. 4 clearly indicates that smaller learning rates would help minimising the RMSE.

General comments:
There is too much wording for describing known facts (covid-19) or methodologies (RFM, MLP, ARIMA). I recommend instead that the authors better explain better the motivation of the approach.

Reasons to accept:

I don't see any reason to accept this paper.

Reasons to reject:

The goal of the paper is interesting and challenging, especially at this crisis time. However, the methodology needs to be revised. Here are the main reasons for rejection:
- Data augmentation is a good strategy when dealing with data scarcity. In the paper, I think that the data augmentation is not right since all newly created data points are appended after the last data point. Data augmentation for time series should add data points between two existing data points (see reference papers). Moreover, the trend of the time series is clearly not linear. The choice of linear regression is then quite surprising. Since the resulting experiments are based on this strong hypothesis, they are all subject to discussion.
- The testing set is not clear. It should contain only samples from the original dataset. As there are 12 samples in the test set, does it mean that the training set contains only 13 samples? In Fig. 5 and Fig. 7, we see that the predicted samples are from day 25 to day 40. Does it mean that the test set only contains the samples that the authors have created? That would be wrong.
- It is not clear whether the statistical methods are using the regression augmented dataset.
- There is confusion about machine learning models. It seems that they are used in a regression task minimizing the RSME loss, but the authors keep mentioning ML classifiers. In Equation 9, a class label is defined but it is no clear in the rest of the paper whether any classification model has been further trained.
- The authors propose a collaborative framework, but it is not clear how the collaboration is done between the different models.

References:
- Time Series Data Augmentation for Deep Learning: A Survey (https://arxiv.org/abs/2002.12478)
- Data Augmentation for Time Series Classification using Convolutional Neural Networks (https://aaltd16.irisa.fr/files/2016/08/AALTD16_paper_9.pdf)

Nanopublication comments:

Further comments:

RESPONSE TO REVIEWERS

Reference: Article #627-1607

Dear Editor and the Editors-in-Chief,
As suggested, vide email dated 01/04/2020 to usfor the research paper titled “Supervised Learning Inspired Fast Forecasting Model of 2019-nCoV Outbreak using Small Dataset” (Article #627-1607),we would like to inform you that we have incorporated some modifications in the previous version of our paper as follows:

1. The 3D bar graph (Refer: Fig. 6 and Fig. 9) now updated as line graphs (Refer: Page 6 -7)

2. The predicted no. of death on the 40th day in China updated for three methods as follows:
2.1 The updated value in the LR-lag is 2498 instead of 2380 (Refer: Page 6)
2.2 For the ETS, the updated value is 3553 instead of 3353(Refer: Page 6)
2.3 In the BATS model, the updated value is 4428 instead of 4423 (Refer: Page 7)
3. Fig. 5 updated with the re-plot of the predicted value obtained using the LR-lag method re-plotted (Refer: Page 6)

4. The information about the source datasets linked with the "GitHub" repository. (https://github.com/sajalmitra2020/WHO-database)

5. Email of both corresponding authors given (Refer: Page 1)

Please feel free to revert back to us if you need clarification/information further.

Warm Regards,
Sankar Prasad Mondal (Corresponding Author)

1 Comment

Meta-Review by Editor

Submitted by Tobias Kuhn on Tue, 04/28/2020 - 11:19

We must reject this paper due to the fact that the methodology for data augmentation, supposably a main contribution of the paper, is severely flawed. Additionally, the unclarity regarding the use and composition of the test set make the output of the model difficult to correctly evaluate.

Christine Chichester (https://orcid.org/0000-0001-6818-334X)

Data Science

Supervised Learning Inspired Fast Forecasting Model of 2019-nCoV Outbreak using Small Dataset

Tracking #: 628-1608

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Supplementary Files (optional):

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

1 Comment

Meta-Review by Editor