Mining Timed Sequential Patterns: The Minits-AllOcc Technique

Tracking #: 734-1714

Authors:

	Name	ORCID
	Somayah Karsoum	https://orcid.org/0000-0002-8855-0175
	Clark Barrus	https://orcid.org/0000-0001-8239-8689
	Le Gruenwald	https://orcid.org/0000-0002-5245-4747
	Eleazar Leal	https://orcid.org/0000-0002-3055-1845

Responsible editor:

Richard Mann

Submission Type:

Research Paper

Abstract:

Sequential pattern mining is one of the data mining tasks used to find the subsequences in a sequence dataset that appear together in order based on time. Sequence data can be collected from devices, such as sensors, GPS, or satellites, and ordered based on timestamps, which are the times when they are generated/collected. Mining patterns in such data can be used to support many applications, including weather forecasting and transportation recommendation systems. Numerous techniques have been proposed to address the problem of how to mine subsequences in a sequence dataset; however, current traditional algorithms ignore the temporal information between the itemset in a sequential pattern. This information is essential in many situations. For example, doctors, even if they know a symptom B will appear after symptom A for a specific disease, must know the time interval of when symptom B is expected to appear to reduce the disease's risk and provide a suitable treatment. Considering temporal relationship information for sequential patterns raises new issues to be solved, such as designing a new data structure to save this information and traversing this structure efficiently to discover patterns without re-scanning the database. In this paper, we propose an algorithm called Minits-AllOcc (MINIng Timed Sequential Pattern for All-time Occurrences) to find sequential patterns and the transition time between itemsets based on all occurrences of a pattern in the database. We also propose a parallel multi-core CPU version of this algorithm, called MMinits-AllOcc (Multi-core for MINIng Timed Sequential Pattern for All-time Occurrences), to deal with Big Data. Extensive experiments on real and synthetic datasets show the advantages of this approach over the brute-force method. Also, the multi-core CPU version of the algorithm is shown to outperform the single-core version on Big Data by 2.5X.

Manuscript:

ds-paper-734.pdf

Data repository URLs:

https://www.mesonet.org/index.php/site/research

https://www.microsoft.com/en-us/research/publication/t-drive-trajectory-...

Date of Submission:

Saturday, November 26, 2022

Date of Decision:

Friday, April 28, 2023

Nanopublication URLs:

Decision:

Reject

Solicited Reviews:

Review #1 submitted on 27/Mar/2023

By Emanuele Della Valle ORCID logo

https://orcid.org/0000-0002-5176-5885

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Good
Suggested Decision: Undecided
Technical Quality of the paper: Excellent
Presentation: Good
Reviewer`s confidence: Medium
Significance: High significance
Background: Incomplete or inappropriate
Novelty: Limited novelty
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

The paper focuses on sequential pattern mining. The authors claim that finding subsequences in sequences is helpful for many applications, such as those cited by the author: weather forecasting and mobility. To address this issue, the authors propose an algorithm called Minits-AllOcc to find sequential patterns and transition times between events. They also introduce a parallel version of the algorithm called MMinits-AllOcc, which runs on a multi-core CPU. The authors bring experimental evidence that Minits-AllOcc outperforms brute-force methods and that MMinits-AllOcc is 2.5X faster than the single-core version.

Reasons to accept:

The paper is very well written. It is well organized, and it reads smoothly. The motivation of the paper is strong. The proposed method is sound. The experimental campaign is extensive and proves the claims of the authors.

Reasons to reject:

I have severe doubts about the novelty of the paper. I am not an expert in sequential pattern mining; thus, I read the reviews referenced in the paper (citations 9 and 10). I discovered that state-of-the-art algorithms ignore the temporal information between itemsets in a sequential pattern.

As my stream processing area of expertise, temporal information is essential for many situations, and engines are optimized for it. While basic stream processing queries can use the order of events coming on a stream (e.g., A followed by B or A followed by B and not C), expressive queries require at least one temporal annotation per event. Temporal annotations allow computing intervals and reason on time (e.g., A followed by B within N seconds).

Indeed, in stream processing, finite state machines (FSMs) and trees are used to improve the efficiency and accuracy of processing systems by reducing the need for brute-force pattern matching. FSMs and Trees represent the states and the transitions in modeling sequences, time intervals, and causality and recognizing complex event patterns [*]. Moreover, as in the authors' setting, stream processing cannot re-scan the data because the input is an unbounded data structure that the engine cannot memorize. Last but not least, in both settings, latency matters. In stream processing, patterns of events must be detected in real-time since actions must be triggered based on those patterns. In the authors' settings, the real-time requirement does not hold, but it is clear that the faster, the better requirement holds.

[*] Artikis, A., Margara, A., Ugarte, M., Vansummeren, S. and Weidlich, M., 2017. Tutorial: Complex event recognition languages. In Proceedings of the 11th ACM International Conference on Distributed and Event-based Systems (pp. 7-10). ACM. https://cer.iit.demokritos.gr/publications/papers/2017/2017_debs_tut_pap...

Nanopublication comments:

Further comments:

* The quality of the tables is low: the font is too tiny, and pixelled.
* References 22, 23, 28, 29, and 32 are written in different fonts.
* The first paragraph in each section should be either indented or not indented. Currently, there is a mix.
* Figure 18 and figure 19 have different resolutions.
* I would avoid short sections (e.g., 4.5.*, and 5.* )

Review #2 submitted on 28/Apr/2023

By Tobias Kuhn ORCID logo

https://orcid.org/0000-0002-1267-0234

Review Details

Reviewer has chosen not to be Anonymous

Overall Impression: Average
Suggested Decision: Undecided
Technical Quality of the paper: Unable to judge
Presentation: Average
Reviewer`s confidence: Low
Significance: Moderate significance
Background: Reasonable
Novelty: Unable to judge
Data availability: Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this
Length of the manuscript: The length of this manuscript is about right

Summary of paper in a few sentences:

The paper motivates and introduces the concept of timed sequential patterns. It then reports on the development of two algorithms and implementations, a plain one and one optimized for multi-core processing. Finally, it presents evaluation results in term of execution time and effect of different parameters.

Reasons to accept:

- Interesting and relevant topic
- Overall model and approach seem sound and valuable
- Good comparison of two versions of the algorithm/implementation

Reasons to reject:

- The novelty could be better established
- The formal definitions and arguments are quite hard to follow (at least to me) due to sometimes unconventional (at least to me) notation and omissions
- No "sanity check" with other existing implementations of plain (non-timed) sequences, to see whether outputs are consistent and execution times are reasonable

Nanopublication comments:

Further comments:

- The table in Figure 1 could be presented better: unclear why elements are shown in this order; use of color could make it easier to understand; as an introductory illustration, this could possibly also be made more into an actual figure and less like a strict table

- In the introduction about patterns being "extended": It wasn't clear to me why "extending" a pattern isn't treated like creating a new pattern (so then obviously the result has to be calculated again). Are such "extension" just things that happen frequently and therefore need specific support? Or did I misunderstand what these "extensions" are? This should clarified either way.

- In the end of the introduction: "The time can be any descriptive statistic based on the user's preference, such as range, average, etc." could have been better introduced in the example and motivations before.

- "The idea of incorporating transition time ...": I'd use a word that's a bit stronger than "idea", e.g. "concept" or "model".

- It took me a bit to understand the first paragraph of Section 2. I was confused by the notation <{a1},{a2},...>, where {a1} etc seem to be sets with a single element. But I think they should just be sequences of sets, so where a1, a2, etc are sets? I believe that the curly brackets are in this case superfluous and confusing. Also, it's not explicitly specified in the text that a1 etc are itemsets.

- "A timed event is a pair e = (I, t), where I am an item set that ...": I suppose it should be "where I *is*"

- "ei.x subsetof I (1 ≤ i ≤ n)": here it's unclear what x and I stand for.

- Definition of "delta": what do p and j stand for here?

- {a, 10} seem to be a tuple (not a set). I might not be familiar with the conventions of this particular community, but to me using curly brackets normally indicate a set, and round brackets are more usual for tuples, so (a, 10). But again, there might be different conventions in this community.

- I cannot really judge the novelty, as I am not an expert in the field. Based on the Related Work section, it seems novel, but it would be good to see in a bit more detail (at some point in the paper, maybe based on examples) how the presented approach differs from some other recent approaches.

- "The drawback of these methods is ..." in Related Work: I was wondering what happens if these methods are used with an arbitrarily high ("infinite") time interval. Does it break or would it still work? If the latter, how does it then compare to your method?

- "A node can have multiple parent nodes": this confused me. Does it mean "multiple direct parents"? Then it's not a tree! Or "multiple indirect parents going up the tree hierarchy"? The latter is kind of obvious, and I would remove this part to avoid confusion.

- "<{a,2},<{a,b,19},{d,25}>" this seems mal-formed.

- I got lost following the algorithm/method somewhere around Figure 8. It wasn't clear to me here what example exactly we are seeing here. Where do all these numbers come from? Are the upper ones (<{a}> and <{b}>) just given as an example, and they then produce the bottom onces? Is TS3 in <{a}> the same as TS3 in <{b}>? If so, why are the trees different? If not, why are they called the same?

- "For air temperature ...": This paragraph (and some other parts) in Section 5 are not very pleasant to read. It would be better to put this into some kind of table or figure.

- "Competing Algorithms": I would have liked to see some rough comparison to other algorithms. For example, I suppose the timed sequences can be used to "simulate" plain sequences, so you could create some such cases, which would allow you to test whether our algorithms get you the same results as the established ones do for the plain sequences (as a kind of sanity check). Measuring the execution time in this scenario would also be an interesting thing to do. It's perfectly fine if your algorithm would be slower in this scenario, but it would be interesting to see how much slower (like doubled execution time or 100x execution time or ...?).

4 Comments

Review the paper and comment.

Submitted by Malik Jawarneh on Tue, 03/07/2023 - 05:33

Positive:

• The proposed Minits-AllOcc and MMinits-AllOcc algorithms are capable of mining sequential patterns and the transition time between itemsets based on all occurrences of a pattern in the database.
• The proposed parallel multi-core CPU version of this algorithm, MMinits-AllOcc, is able to efficiently deal with Big Data.
• Extensive experiments on real and synthetic datasets show the advantages of this approach over the brute-force method, with the multi-core CPU version of the algorithm shown to outperform the single-core version on Big Data by 2.5X.

Negative:
• Some parts of the proposed algorithm lack clarity and could be better explained.
• The proposed algorithm could be further improved by incorporating more advanced techniques to optimize the computation time.

Review comment

Submitted by Malik Jawarneh on Sat, 03/11/2023 - 09:19

Structure your abstract as follows- 1) Background 2) Aim/Objective 3) Methodology 4) Results 5) Conclusion. Write 2-4 lines for each and merge everything in one paragraph without any subheading
Abstract must contain the motivation and objective of the article. The Abstract must be very clear and the motive of the paper should be represented in a nutshell.
Introduction should be of 5-7 solid paragraphs and provide structure of work at the end of the Introduction section.
Add more contribution to your study field.
The purpose of study not clear.
Make highlight for objectives
Remove any table or figure which is taken from web. Otherwise you have to get approval from publisher and author in a provided form by springer.
Please avoid to write definitions of terms like Data mining, Sequential pattern mining, Timed sequential patterns, Singe-core and multi-core processor., etc., which are already available over web, try to cite work for such information.
In summary, only provide useful content in your work.

10. These are Title related to your area, you may use.

JAWARNEH, M. (2023). Development of Machine Learning Based Security Model for IoT Network.
Jawarneh, M., Alshare, M., Bsoul, Q., & Kalash, H. S. The Impact of Machine Learning On Educational Institutions: An Empirical Study.
Arumugam, K., Swathi, Y., Sanchez, D. T., Mustafa, M., Phoemchalard, C., Phasinam, K., & Okoronkwo, E. (2022). Towards applicability of machine learning techniques in agriculture and energy sector. Materials Today: Proceedings, 51, 2260-2263.
Sajja, G. S., Mustafa, M., Phasinam, K., Kaliyaperumal, K., Ventayen, R. J. M., & Kassanuk, T. (2021, August). Towards application of machine learning in classification and prediction of heart disease. In 2021 Second International Conference on Electronics and Sustainable Communication Systems (ICESC) (pp. 1664-1669). IEEE.

Meta-Review by Editor

Submitted by Tobias Kuhn on Fri, 04/28/2023 - 14:12

Your manuscript has been reviewed by two reviewers. Both reviewers found the paper to be interesting and potentially signficant. However, the reviewers also raised concerns about the novelty of the paper, with both noting that it was hard to evaluate the novelty as this wasn't clearly enough described in the manuscript. I consider this to be the major concern that your revision should address. In particular you should clearly compare the algorithm proposed in this manuscript to other existing algorithms, and give some idea of their relative performance, as well as delineating precisely where the new algorithm differs from existing approaches. As noted by reviewer 2, it does not necessarily matter if the computational performance of the new algorithm is slower than other methods, but the manuscript should give some idea of the relative performance in this domain.

Please also note that both reviewers highlighted that 'Not all used and produced data are FAIR and openly available in established data repositories; authors need to fix this'. Making all data FAIR and openly available will be a condition of eventual acceptance.

The reviewers also provided detailed comments on other aspects of the manuscript, and I invite you to consider these carefully in formulating your response and revision of the manuscript.

Richard Mann (https://orcid.org/0000-0003-0701-1274)

Withdrawn by the authors

Submitted by Tobias Kuhn on Mon, 07/24/2023 - 00:59

This submission was withdrawn upon request by the authors. Thereby it is now marked Rejected instead of Undecided.

Data Science