Towards Time-Evolving Analytics: Online Learning for Time-Dependent Evolving Data Streams

Tracking #: 724-1704

Authors:

	Name	ORCID
	Emanuele Della Valle	https://orcid.org/0000-0002-5176-5885
	Giacomo Ziffer	https://orcid.org/0000-0002-2768-3580
	Alessio Bernardo	https://orcid.org/0000-0002-3492-0345
	Vitor Cerqueira	https://orcid.org/0000-0002-9694-8423
	Albert Bifet	https://orcid.org/0000-0002-8339-7773

Responsible editor:

Robert Hoehndorf

Submission Type:

Position Paper

Abstract:

Traditional historical data analytics is at risk in a world where volatility, uncertainty, complexity, and ambiguity are the new normal. While Streaming Machine Learning (SML) and Time-series Analytics (TSA) attack some aspects of the problem, we are far from a comprehensive solution. SML trains models using fewer data and in a continuous/adaptive way relaxing the assumption that data points are identically distributed. TSA considers temporal dependence among data points, but it assumes identical distribution. Every Data Scientist fights this battle with ad-hoc solutions. In this paper, we claim that, due to the temporal dependence on the data, the existing solutions do not represent robust solutions to efficiently and automatically keep models relevant even when changes occur, and real-time processing is a must. We propose a novel and solid scientific foundation for Time-Evolving Analytics in this perspective. Such a framework aims to develop the logical, methodological, and algorithmic foundations for fast, scalable, and resilient analytics.

Manuscript:

ds-paper-724.pdf

Data repository URLs:

https://drive.google.com/drive/folders/1ouZgTdvKFBNNzvbX8OL2vpT5H_4lzlI4...

Date of Submission:

Wednesday, July 6, 2022

Date of Decision:

Wednesday, November 2, 2022

Nanopublication URLs:

Decision:

Solicited Reviews:

Review #1 submitted on 15/Sep/2022

By Yuan Yan ORCID logo

https://orcid.org/0000-0002-7602-3589

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Undecided
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: Medium
Significance: Moderate significance
Background: Comprehensive
Novelty: Limited novelty
Data availability: All used and produced data (if any) are FAIR and openly available in established data repositories
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

This position paper raised the need for a new framework to deal with data streams that are both non i.i.d. and time-dependent and proposed the Time-Evolving Analytics, which combines the advantages of Streaming Machine Learning and Time-series Analytics. The authors compared many of the popular classification methods using the Electricity dataset.

Reasons to accept:

This paper offers a good summary and overview of many different types of machine learning methods.

Reasons to reject:

However, I found a few things confusing and need clarification.

Nanopublication comments:

Further comments:

1. As a statistician, forgive my ignorance, I found many terminologies unfamiliar. Perhaps some definitions could be added? E.g. Incremental Learning, Streaming Machine Learning, concept drift, ADWIN (reference?).

2. Table 1: what is the difference between 'evolving data stream' and 'not i.i.d. data stream'? Is it the same and should delete the last column of the table? Because SML is said to 'relax the assumption that data points are iid', but there is a cross under not iid data stream.

3. The Desiderata and R1-6 seem repetitive. Linking types of data in Table 1, the shared need R1-6 in Page 2 and Framework’s Desiderata in Section 6.1:

R2 = time-dependent = Learning Sequences
R5 = i.i.d. data stream = Stateful learning
R4 = not i.i.d. data stream = Graceful forgetting + Selective remembering + Adaptive learning?
R1 = Problem agnostic
R3 = Forecasting alternatives
R6 = No task boundaries

Did I understand correctly? Then R2,4,5 are related to data types while R1,3,6 are not. Also top of page 3 summaries what data type each model couldn't deal with, hence the related R2,4,5. But did not mention whether ML & Incremental can deal with R2, TSA can deal with R5. Wouldn't it be clear to simply put R2,4,5 along with the corresponding data type in Table 1? I think it's very useful to explain the connection and differences.

4. Need more description of the dataset, e.g. the covariates (nswdemand, nswprice, viceprice, transfer) used.

5. Need to mention TSA only used label and they are online TSA at the beginning of Section 4 (somewhere P6).

6. P6: I don't understand why test the last 48 samples of each segment for ML, while use the 5-fold distributed prequential cross-validation for SML?

7. Fig. 2: I think it's better to separate boxplots for the 3 types of SML methods, instead of mixed and sorted altogether.

8. Fig. 4: plot (VFDT vs others) doesn't match the caption (NC vs others).

9. Concept drift is a new terminology I learned, which I think is similar to what we call 'change point detection' in Statistics. Perhaps some connections/comparisons can be made.

Fig. 5: 'horizontal' should be 'vertical'
P7,L27: should be SWT10_ARF, SWT20_ARF
P8,L13: performs as the baseline -> performs similar as the baseline?

Review #2 submitted on 23/Sep/2022

By Núria Queralt Rosinach ORCID logo

https://orcid.org/0000-0003-0169-8159

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: Low
Significance: High significance
Background: Reasonable
Novelty: Clear novelty
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

This is a position paper on learning robust models and performing reliable data analytics from real-time data streams, i.e. from data that change over time. On the authors assumption that there is a lack of a comprehensive solution, they claimed the need to develop a foundation for time-evolving analytics and proposed a theoretical framework to guarantee learning requirements for scalable, adaptive and solid predictive models.

Reasons to accept:

The manuscript tackles time-evolving analytics which is a major challenge for data science that affects many scientific and technological areas and has impact on a plethora of aspects of people's lifes from detecting fake news to disease prognosis or climate crisis. Authors discussed current challenges and solutions such as 'streaming machine learning' and 'time-series analytics', and claimed the need of a unifying theory between these two solutions. They proposed a novel theoretical foundation for time-evolving analytics and opened a discussion within the community.

This manuscript meets the scope of the Journal and it is well written, logically structured and clear.

Reasons to reject:

First, requirements for data science on learning over time on non independent and identically distributed data are well described. Although I missed more and a deeper description of real-world cases, its impact and problems. For example, I propose to add the federated learning case and a description of its importance, application(s) and problems. Related work seems adequate and up to date. Experiments measured the accuracy of the models over time and tracked their resource cost to provide evidence on the risk of learning ineffective predictions using high use of resources, i.e. to support authors' assumptions. These experiments seem well designed, but it would be beneficial for the reader to summarize their settings in a table for clarity and readability. Also, authors must provide a link to the experiments (code and data) to allow an open and FAIR reproducibility of the results by external researchers. I would also suggest to review these experiments by experts on the topic. Afterwards, authors presented and discussed challenges and benefits in a clear way. Then, they introduced its formal framework to develop resilient models and proposed a well supported unifying model and methodology for 'streaming machine learning' and 'time-series analytics' to support scientists in complying with the framework set of principles.

Nanopublication comments:

Further comments:

I am not an expert on the topic. But, from my perspective as a bioinformatician and data scientist this is a very important topic for current scientific problems with major societal impact that may foster a necessary and relevant discussion within the community. Therefore, I recommend to accept this paper after considering my comments.

Review #3 submitted on 26/Oct/2022

By Maxat Kulmanov ORCID logo

https://orcid.org/0000-0003-1710-1820

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Accept
Technical Quality of the paper: Good
Presentation: Good
Reviewer`s confidence: Medium
Significance: Moderate significance
Background: Reasonable
Novelty: Clear novelty
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

- The authors describe current problems in machine learning with real-time data and propose a novel framework for Time-Evolving Analytics. They envision a framework that has setting-free (supervised, unsupervised, semi-supervised) models which continuously adapt to changes, make higher order predictions into time-dependent space, predict multiple possible outcomes (multi-label), process data in real-time and do not have fixed structure. They show that existing methods from Streaming Machine Learning (SML) and Time-series Analytics (TSA) can be ineffective by introducing a new method to evaluate their accuracy and resource usage over time. They describe the characteristics of the unifying model for Time-Evolving Analytics and provide a unifying methodology to address the requirements of the proposed framework.

Reasons to accept:

- In general, I think that the questions brought up by the authors are quite important and do not have answers which could help to systematically resolve them. The paper is written well and claims are supported by experiments and data. I have the following comment:
- In my opinion, the section which describes the challenges and benefits is very brief and should be discussed in more details. For example, how are the requirements of the framework going to increase the search space and complexity of the models? Or, what is needed to achieve explainability of the outcomes? It would also be good to refer to the requirements when discussing the challenges.
- I suggest to put the data and code in a github repository

Reasons to reject:

None

Nanopublication comments:

Further comments:

1 Comment

Meta-Review by Editor

Submitted by Tobias Kuhn on Wed, 11/02/2022 - 07:41

The reviewers have provided several suggestions to improve the manuscript. In particular, the clarity of the paper can be improved by clearly introducing the problem that is discussed, introducing technical terms instead of relying on prior knowledge, and removing of jargon; the technical and scientific challenges that need to be addressed should be described more clearly. Additionally, code and data underlying the results shown in the paper should be made available to ensure reprodibility.

Robert Hoehndorf (https://orcid.org/0000-0001-8149-5890)

Data Science

Towards Time-Evolving Analytics: Online Learning for Time-Dependent Evolving Data Streams

Tracking #: 724-1704

Authors:

Responsible editor:

Submission Type:

Abstract:

Manuscript:

Tags:

Data repository URLs:

Date of Submission:

Date of Decision:

Decision:

1 Comment

Meta-Review by Editor