verifying and mining frequent patterns from large windows
play

Verifying and Mining Frequent Patterns from Large Windows over Data - PDF document

Verifying and Mining Frequent Patterns from Large Windows over Data Streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Computer Science Department University of California Los Angeles, CA, USA { barzan,hthakkar,zaniolo } @cs.ucla.edu such


  1. Verifying and Mining Frequent Patterns from Large Windows over Data Streams Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo Computer Science Department University of California Los Angeles, CA, USA { barzan,hthakkar,zaniolo } @cs.ucla.edu such large windows remains a computationally challenging Abstract — Mining frequent itemsets from data streams has proved to be very difficult because of computational complexity problem requiring algorithms that are faster and lighter than and the need for real-time response. In this paper, we introduce those used on stored data. Thus, algorithms that make multiple a novel verification algorithm which we then use to improve scans of the data should be avoided in favor of single- the performance of monitoring and mining tasks for association scan, incremental algorithms. In particular, the technique of rules. Thus, we propose a frequent itemset mining method partitioning large windows into slides (a.k.a. panes) to support for sliding windows, which is faster than the state-of-the-art methods—in fact, its running time that is nearly constant with incremental computations has proved very valuable in DSMS respect to the window size entails the mining of much larger [11], [12] and will be exploited in our approach. windows than it was possible before. The performance of other We will also make use of the following observation: in real- frequent itemset mining methods (including those on static data) world applications there is an obvious difference between the can be improved likewise, by replacing their counting methods (e.g., those using hash trees) by our verification algorithm. problem of (i) finding new association rules, and (ii) verifying the continuous validity of existing rules. I. I NTRODUCTION Normally, finding new rules requires both machines and Data streams have received much attention in recent years. domain experts, since size of the data is too large to be mined Furthermore, interest in online stream mining has also dra- by a person and importance of new rules with respect to matically increased [1], [2], [3], [4], [5], [6]. This interest is the application can only be validated by domain experts. In largely due to the growing set of streaming applications, such this situation, delays by the mining algorithms in detecting as credit card fraud detection, market basket data analysis, new frequent itemsets are also acceptable, provided that where data mining plays a critical role. In this paper, we focus they add little to the typical time required by the domain on the problem of mining frequent itemsets on large windows experts to validate new rules. Thus, we propose an algorithm defined over such data streams. This problem appears in many for incremental mining of frequent itemsets that compares of the applications mentioned above in different forms. favorably with existing algorithms when real-time response Mining frequent itemsets for association rules has been is required. Furthermore, the performance of the proposed studied extensively since it was first introduced by Agrawal et algorithm improves when small delays are acceptable. al. [1]. Since then many faster algorithms have been proposed Although a real-time introduction of new association rules [2], [3], [6], [7]. Furthermore, this problem appears as a is neither sensible nor feasible, the on-line verification of subproblem in many other mining contexts such as finding old rules is highly desirable in most application scenarios: sequential patterns [7], [3], clustering[8], and classification we need to determine immediately when old rules no longer [9], [10]. hold to stop them from pestering customers with improper The recent growth of interest in data stream systems and recommendations. Therefore, in this paper we propose fast data stream mining is due to the fact that, in many applica- algorithms, called verifiers henceforth, for verifying the fre- tions, data must be processed continuously, either because quency of previously frequent itemsets over newly arriving of real time requirements or simply because the stream windows. Toward this goal, we use sliding windows, whereby is too massive for a store-now & process-later approach. a large window is partitioned into smaller panes [11] and However, mining of data streams brings many challenges a response is returned promptly at the end of each slide not encountered in database mining, because of the real-time (rather than at the end of each large window). This also leads response requirement and the presence of bursty arrivals and to a more efficient computation since the frequency of the concept shifts (i.e., changes in the statistical properties of itemsets in the whole window can be computed incrementally data). In order to cope with such challenges, the continuous by counting itemsets in the new incoming (and old expiring) stream is often divided into windows, thus reducing the size panes. Thus to make this counting efficient, we introduce of the data that need to be stored and mined. This allows a novel concept of conditional counting , a.k.a. verification, detecting concept drifts/shifts by monitoring changes between which can be realized efficiently by the proposed verifiers. subsequent windows. Even so, association rule mining over Thus, the proposed incremental algorithm for finding frequent

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend