Active Mining of Data Streams Wei Fan 1 Yi-an Huang 2 Haixun Wang 1 - PDF document

Active Mining of Data Streams Wei Fan 1 Yi-an Huang 2 Haixun Wang 1 Philip S. Yu 1 1 IBM T. J. Watson Research, Hawthorne, NY 10532 { weifan,haixun,psyu } @us.ibm.com 2 College of Computing, Georgia Institute of Technology, Atlanta, GA 30332 yian@cc.gatech.edu Abstract sive mode to mine data streams results in a number of potential undesirable consequences that contradict the notions Most previously proposed mining methods on data streams of “streaming” and “continuous”. First, it may incur possi- make an unrealistic assumption that “labelled” data stream is bly higher loss due to neglected pattern drifts. If either the readily available and can be mined at anytime. However, in concept or data distribution drifts rapidly at an un-forecasted most real-world problems, labelled data streams are rarely rate that statuary constraints do not catch up, the models immediately available. Due to this reason, models are are likely to be out-of-date on the data stream and impor- refreshed periodically, that is usually synchronized with tant business opportunities might be missed. Second, it may data availability schedule. There are several undesirable have unnecessary model refresh. If there is neither concep- consequences of this “passive periodic refresh”. In this tual nor distributional change, periodic passive model refresh paper, we propose a new concept of demand-driven active and re-validation is a waste of resources. data mining. It estimates the error of the model on the new data stream without knowing the true class labels. When Demand-driven Active Mining of Data Streams We 1.1 significantly higher error is suspected, it investigates the true are proposing a demand-driven active stream data mining class labels of a selected number of examples in the most process that solves the problems of passive stream data recent data stream to verify the suspected higher error. mining. As a summary, our particular implementation of active stream data mining has three simple steps: 1 State-of-the-art Stream Mining State-of-the-art work on mining data streams concentrates on 1. Detect potential changes of data streams “on the fly” capturing time-evolving trends and patterns with “labeled” when the existing model classifies continuous data data. However, one important aspect that is often ignored or streams. The detection process does not use or know unrealistically assumed is the availability of “class labels” of any true labels of the stream. One of the change detec- data streams. Most algorithms make an implicit and imprac- tion methods is a “guess” of the actual loss or error rate tical assumption that labeled data is readily available. Most of the model on the new data stream. works focus on how to detect the change in patterns and how 2. If the guessed loss or error rate of the model in step to update the model to reflect such changes when there are 1 is much higher than an application-specific tolerable “labelled” instances to be learned . However, for many ap- maximum, we choose a small number of data records in plications, the class labels are not “immediately” available the new data stream to investigate their true class labels. unless dedicated efforts and substantial costs are spent to in- With these true class labels, we statistically estimate the vestigate these labels right away. If the true class labels were true loss of the model. readily available, data mining models would not be very use- 3. If the statistically estimated loss in step 2 is verified to ful - we might just wait. In credit card fraud detection, we be higher than the tolerable maximum, we reconstruct usually do not know if a particular transaction is a fraud un- the old model by using the same true class labels til at least one month later after the account holder receives sampled in the previous step. and reviews the monthly statement. Due to these facts, most In this paper, we concentrate on the first two steps. Our current applications obtain class labels and update existing particular implementation extends on classification trees. models in preset frequency, usually synchronized with data refresh. The effectiveness of the passive mode is dictated by 2 The Framework some “statuary and static constraints”, yet not by the “demand” for a better model with a lower loss. Such a pas- We first discuss different types of possible pattern changes in the data stream, then propose a few statistics to monitor in

Active Mining of Data Streams Wei Fan 1 Yi-an Huang 2 Haixun Wang 1 - PDF document

Active Mining of Data Streams Wei Fan 1 Yi-an Huang 2 Haixun Wang 1 Philip S. Yu 1 1 IBM T. J. Watson Research, Hawthorne, NY 10532 { weifan,haixun,psyu } @us.ibm.com 2 College of Computing, Georgia Institute of Technology, Atlanta, GA 30332

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Querying and Mining Data Streams: Querying and Mining Data Streams: You Only Get One Look You

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Streams Many large sources of data are generated as streams of updates: IP Network

Data Streams Many large sources of data are generated as streams of updates: IP Network

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

The Active Card An Active Mind in an Active Body More people, More Active, More often! The

Active Adversary Lecture 7 CCA Security MAC Active Adversary Active Adversary An active

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

SPARSE VOLUMETRIC REPRESENTATION OF TIME-LAPSE POINT CLOUD Innfarn Yoo, 05.08.2017 Introduction

PRODUCT PRESENTATION PI PIPE PE MOUNTE TED SI SIDE E WIND JAC ACK CAPACITY : 2000

Aug. 20, 2012 0 Daxin proprietary

Service Quality Counts. Data-driven solutions to manage and improve your service quality. Spectos

A Picture Is Worth A Thousand Words An Application Of Knowledge Graph To Electronic Records

The R -tree: An Efficient and Robust Access Method for Points and Rectangles N. Beckmann H.

Antitrust Notice The Casualty Actuarial Society is committed to adhering strictly to the

Section3.4 Solving Rational Equations and Radical Equations RationalEquations Method 1. Find

Active Mining of Data Streams Wei Fan 1 Yi-an Huang 2 Haixun Wang 1 - PDF document

Active Mining of Data Streams Wei Fan 1 Yi-an Huang 2 Haixun Wang 1 Philip S. Yu 1 1 IBM T. J. Watson Research, Hawthorne, NY 10532 { weifan,haixun,psyu } @us.ibm.com 2 College of Computing, Georgia Institute of Technology, Atlanta, GA 30332

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Querying and Mining Data Streams: Querying and Mining Data Streams: You Only Get One Look You

WITH C++ Prof. Amr Goneid AUC Part 9. Streams &amp; Files Prof. amr Goneid, AUC 1 Streams

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Data Streams Many large sources of data are generated as streams of updates: IP Network

Data Streams Many large sources of data are generated as streams of updates: IP Network

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data &amp; Real Time Data Streams

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

The Active Card An Active Mind in an Active Body More people, More Active, More often! The

Active Adversary Lecture 7 CCA Security MAC Active Adversary Active Adversary An active

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

SPARSE VOLUMETRIC REPRESENTATION OF TIME-LAPSE POINT CLOUD Innfarn Yoo, 05.08.2017 Introduction

PRODUCT PRESENTATION PI PIPE PE MOUNTE TED SI SIDE E WIND JAC ACK CAPACITY : 2000

Aug. 20, 2012 0 Daxin proprietary

Service Quality Counts. Data-driven solutions to manage and improve your service quality. Spectos

A Picture Is Worth A Thousand Words An Application Of Knowledge Graph To Electronic Records

The R -tree: An Efficient and Robust Access Method for Points and Rectangles N. Beckmann H.

Antitrust Notice The Casualty Actuarial Society is committed to adhering strictly to the

Section3.4 Solving Rational Equations and Radical Equations RationalEquations Method 1. Find

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams