FOR THE ECML/PKDD 2007 I NTERNATIONAL C ONFERENCE presented by - PDF document

Approximate Answers State-of-the- Art in Data Stream Approximate answers: Mining (Part I) Actual answer is within 5 ± 1 with probability ≥ 0 . 9 . Jo˜ ao Gama Outline Approximation: find an answer correct within some factor Motivation Find an answer that is within 10% of correct result Data Streams More generally, a (1 ± ǫ ) factor approximation Basic Methods Change Randomization: allow a small probability of failure Detection Predictive Answer is correct, except with probability 1 in 10,000 Learning More generally, success probability (1 − δ ) Clustering Data Streams Approximation and Randomization: ( ǫ, δ )-approximations Predictive Models from The constants ǫ and δ have great influence in the space used. Data Streams Decision Trees Typically the space is O (1 /ǫ 2 log (1 /δ )). Neural Networks

Tail Inequalities State-of-the- Art in Data Stream Mining (Part I) Approximate answers: Jo˜ ao Gama Trade-off between accuracy of the answer and computational Outline resource required to compute an answer. Motivation Data Streams Basic Methods Tail inequalities: Change Detection General bounds on the tail probability of random variables. Predictive Learning The probability that a random variable deviates far from its Clustering Data Streams expectation. Predictive Models from Data Streams Decision Trees Neural Networks

Chebyshev Inequality State-of-the- Art in Data Stream Mining (Part I) if X is a random variable with standard deviation σ , the Jo˜ ao Gama probability that the outcome of X is no less than k σ away from Outline its mean is no more than 1 / k 2 : Motivation Data Streams P ( | X − µ | ≤ k σ ) ≤ 1 k 2 Basic Methods Change Detection No more than 1/4 of the values are more than 2 standard Predictive Learning deviations away from the mean, no more than 1/9 are more Clustering than 3 standard deviations away, no more than 1/25 are more Data Streams than 5 standard deviations away, and so on. Predictive Models from Data Streams Decision Trees Neural Networks

Chernoff Bound State-of-the- Art in Data Stream Mining Consider a biased coin. One side is more likely to come up than (Part I) other, but we don’t know which and would like to find it. Jo˜ ao Gama Flip it many times and then choose the side that comes up Outline the most. Motivation Data Streams How many times do you have to flip it to be confident Basic Methods that you’ve chosen correctly? Change Detection Predictive Example: p=0.6; δ = 95% Learning Clustering Data Streams √ n ≥ ln (1 / δ ) Predictive ( p − 1 / 2) 2 Models from Data Streams Decision Trees Neural Networks

Hoeffding Bound State-of-the- Art in Data Stream Mining Characterize the deviation between the true probability of some (Part I) event and its frequency over m independent trials. Jo˜ ao Gama Outline P ( | X − µ | ≥ ǫ ) ≤ 2 exp ( − 2 m ǫ 2 / R 2 ), Motivation where R is the range of the random variables. Data Streams Basic Methods Example: After seeing 100 examples of a random variable X, Change Detection x i ∈ [0 , 1], the sample mean is x = 0 . 6; Predictive Learning The true mean is with confidence δ in x ± ǫ , where Clustering √ Data Streams R 2 ln (1 /δ ) Predictive ǫ = Models from 2 n Data Streams Decision Trees Neural Networks

Time Windows State-of-the- Art in Data Stream Mining (Part I) Instead of computing statistics over all the Jo˜ ao Gama stream ... Outline use only the most recent Motivation Data Streams data points. Basic Methods Most recent data is more Change Detection relevant than older data Predictive Learning Several Window Models: Clustering Data Streams Landmark , Sliding , Predictive Tilted Windows. Models from Data Streams Decision Trees Neural Networks

Basic Stream Methods State-of-the- Art in Data Stream Mining (Part I) Jo˜ ao Gama Sampling Outline Motivation Data Synopsis: Data Streams Sketches Basic Methods Synopsis Change Detection Histograms Predictive Wavelets Learning Clustering Data Streams Predictive Models from Data Streams Decision Trees Neural Networks

Sampling State-of-the- To otain an unbiased sampling of the data, we need to know Art in Data Stream the lenght of the stream. In Data Streams, we need to modify Mining (Part I) the approach! Jo˜ ao Gama Strategy Outline Motivation Sample instances at periodic time intervals Data Streams Useful to slow down data. Basic Methods Change Involves loss of information. Detection Predictive Learning Clustering Data Streams Predictive Models from Data Streams Decision Trees Neural Networks

Sampling State-of-the- To otain an unbiased sampling of the data, we need to know Art in Data Stream the lenght of the stream. In Data Streams, we need to modify Mining (Part I) the approach! Jo˜ ao Gama Strategy Outline Motivation Sample instances at periodic time intervals Data Streams Useful to slow down data. Basic Methods Change Involves loss of information. Detection Predictive Learning Clustering Known Problems Data Streams Not possible to detect: Predictive Models from Data Streams Changes Decision Trees Neural Networks Anomalies

The reservoir Sample Technique State-of-the- Art in Data Stream Mining (Part I) Jo˜ ao Gama Vitter, J.; Random Sampling with a Reservoir , ACM, 1985. Outline Creates uniform sample of fixed size k ; Motivation Data Streams Insert first k elements into sample Basic Methods Change Then insert i th element with prob. p i = k / i Detection Predictive Delete an instance at random. Learning Clustering Data Streams Predictive Models from Data Streams Decision Trees Neural Networks

Illustrative Problems State-of-the- Art in Data Stream Illustrative Problems Mining (Part I) Illustrative Problems: Jo˜ ao Gama Count the number of distinct values in a stream; Outline Count the number of 1’s in a sliding window of a binary Motivation string; Data Streams Basic Methods Count frequent items above a given support. Change Detection Predictive Learning Clustering Data Streams Predictive Models from Data Streams Decision Trees Neural Networks

Illustrative Problems State-of-the- Art in Data Stream Illustrative Problems Mining (Part I) Illustrative Problems: Jo˜ ao Gama Count the number of distinct values in a stream; Outline Count the number of 1’s in a sliding window of a binary Motivation string; Data Streams Basic Methods Count frequent items above a given support. Change Detection Predictive Learning Count the Number of Distinct Values in a Stream Clustering Data Streams Assume that the domain of the attribute is { 0 , 1 , . . . , M − 1 } . Predictive The problem is trivial if we have space linear in M . Models from Data Streams Is there an approximate solution is space log ( M )? Decision Trees Neural Networks

FM Sketches for Distinct Value Estimation State-of-the- Art in Data Stream Mining Flajolet and Martin; Probabilistic Counting Algorithms for (Part I) DataBase Applications , JCSS, 1983 Jo˜ ao Gama Maintain a Hash Sketch = BITMAP array of L bits,, Outline where L = O ( log ( M )), initialized to 0. Motivation Data Streams Assume a hash function h ( x ) that maps incoming values Basic Methods x ∈ [0 , . . . , M − 1], uniformly across [0 , . . . , 2 ( L − 1) ]. Change Detection Let lsb ( y ) denote the position of the least-significant 1 bit Predictive Learning in the binary representation of y . Clustering Data Streams A value x is mapped to lsb ( h ( x )). Predictive Models from For each incoming value x , set BITMAP[ lsb ( h ( x ))] = 1. Data Streams Decision Trees Neural Networks

FM Sketches for Distinct Value Estimation State-of-the- Art in Data Stream Example: Mining (Part I) Jo˜ ao Gama BITMAP: Outline 5 4 3 2 1 0 Motivation 0 0 0 0 0 0 Data Streams Basic Methods Change Detection x = 5 → h ( x ) = 101100 → lsb ( h ( x )) = 2 Predictive Learning Clustering Data Streams Predictive BITMAP: Models from Data Streams 5 4 3 2 1 0 Decision Trees 0 0 0 1 0 0 Neural Networks

FM Sketches for Distinct Value Estimation State-of-the- Art in Data Stream Mining (Part I) Jo˜ ao Gama Outline Motivation Data Streams By uniformity through h ( x ): Basic Methods P ( BITMAP [ k ] = 1) = Prob (10 k ) = 1 / 2 k +1 Change Detection Predictive Let R= position of the rightmost zero in BITMAP Learning Clustering R is an indicator of log ( d ) Data Streams Flajolet and Martin [FM85] prove that E [ R ] = log ( φ M ), Predictive Models from where φ = . 7735 Data Streams Decision Trees Estimate of M = 2 R /φ Neural Networks

Exponential Histograms State-of-the- Art in Data Stream Mining (Part I) Jo˜ ao Gama Outline Computing Statistics in a sliding window of incoming examples. Motivation Illustrative Problem: Count the number of 1’s from a moving Data Streams window in a binary string. Basic Methods Easy if we can store all the elements inside the window. Change Detection What if Predictive Learning Clustering Data Streams Predictive Models from Data Streams Decision Trees Neural Networks

Exponential Histograms State-of-the- Maintaining Stream Statistics over Sliding Windows , M.Datar, Art in Data Stream A.Gionis, P.Indyk, R.Motwani; ACM-SIAM Symposium on Mining (Part I) Discrete Algorithms;2002 Jo˜ ao Gama The basic idea: Outline Use buckets of different sizes to hold the data Motivation Each bucket has a timestamp associated with it Data Streams Basic Methods It is used to decide when the bucket is out of the window Change Detection Data Structures for Exponential Histograms: Predictive Learning Buckets: counts and time stamp Clustering Data Streams LAST: stores the size of the last bucket. Predictive Models from TOTAL: keeps the total size of the buckets. Data Streams Decision Trees The estimate of the sum of data elements is proven to be Neural Networks bounded within a user-specified parameter.

Exponential Histograms State-of-the- Art in Data Consider a simplified data stream environment where each element Stream Mining comes from the same data source and is either 0 or 1. (Part I) When a new data element arrives: Jo˜ ao Gama If the new data element is 0, ignore it Outline Otherwise create a new bucket of size 1 with the current Motivation timestamp, and increment the counter TOTAL. Data Streams Basic Methods Given a parameter, ǫ , if there are | 1 /ǫ | / 2 + 2 buckets of the Change Detection same size, merge the oldest two of these same-size buckets into Predictive a single bucket of double size. Learning Clustering The larger timestamp of the two buckets is then used as the Data Streams timestamp of the newly created bucket. Predictive Models from Data Streams If the last bucket gets merged, we update the size of the merged Decision Trees bucket to the counter LAST. Neural Networks

Exponential Histograms State-of-the- Art in Data Stream Mining Whenever we want to estimate the moving sum: (Part I) Jo˜ ao Gama Check if the oldest bucket is within the sliding window. Outline If not, we drop that bucket: Motivation subtract its size from the variable TOTAL and Data Streams update the size of the current oldest bucket to the variable Basic Methods LAST. Change Detection Repeat the procedure until all the buckets with Predictive Learning timestamps outside of the sliding window are dropped. Clustering Data Streams The estimate of 1’s in the sliding window is Predictive TOTAL-LAST/2. Models from Data Streams Decision Trees Neural Networks

Exponential Histograms: Analysis State-of-the- Art in Data Stream Mining (Part I) The size of the buckets grows exponentially: Jo˜ ao Gama 2 0 , 2 1 , 2 2 . . . 2 h Outline Need only O ( logN ) buckets. Motivation It is shown that, for N 1’s in the sliding window, we only Data Streams Basic Methods need O (( logN ) /ǫ ) buckets to maintain the moving sum Change and the error of estimating Detection Predictive Learning The error in the oldest bucket only . Clustering The moving sum is proven to be bounded within a given Data Streams Predictive relative error, ǫ . Models from Data Streams Decision Trees Neural Networks

Exponential Histograms: Example State-of-the- Art in Data Stream Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Mining (Part I) Element 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 Time Buckets Total Last Jo˜ ao Gama T1 1 1 1 1 Outline T2 1 1 , 1 2 2 1 Motivation T3 1 1 , 1 2 , 1 3 3 1 Data Streams (merge) 2 2 , 1 3 3 1 Window length=10 Basic Methods T4 2 2 , 1 3 , 1 4 3 2 Relative Error=0.5 Change . . . Detection Merge if 3 buckets of the T11 4 4 , 2 8 , 2 10 , 1 11 9 4 Predictive same size: | 1 / 0 . 5 | / 2 / 2 Learning T12 4 4 , 2 8 , 2 10 , 1 11 , 1 12 10 4 Clustering T13 4 4 , 4 10 , 2 12 , 1 13 11 4 Data Streams T14 4 4 , 4 10 , 2 12 , 1 13 , 1 14 12 4 Predictive Models from (Removing out-of-date) Data Streams T15 4 10 , 2 12 , 1 13 , 1 14 8 4 Decision Trees Neural Networks

Current Research on Data Streams State-of-the- Art in Data Stream Mining (Part I) Jo˜ ao Gama Basic stream synopses computation Outline Samples, Equi-depth histograms, Wavelets Motivation Sketch-based computation techniques Data Streams Basic Methods Self-joins, Joins, Wavelets, V-optimal histograms Change Detection Advanced techniques Predictive Learning Sliding windows, Distinct values, Hot lists Clustering Data Streams Predictive Models from Data Streams Decision Trees Neural Networks

Bibliography State-of-the- Data Streams: Algorithms and Applications (2003) S. Art in Data Stream Muthukrishnan Mining (Part I) Stream Data Management (2005) N. Chaudry, K. Shaw, Jo˜ ao Gama M. Abdelguerfi, Springer Outline Data Streams and Data Synopses for Massive Data Sets , Motivation Yossi Matias (Invited Talk at ECML-PKDD 05) Data Streams Models and Issues in Data Stream Systems (2002), Brian Basic Methods Babcock Shivnath Babu Mayur Datar Rajeev Motwani Change Detection Jennifer Widom ;PODS Predictive Learning Querying and Mining Data Streams: You only get one Clustering Data Streams look ; M. Garafalakis, J. Gehrke, R. Rastagi; Predictive Randomized Algorithms ; R.Motwani, P. Raghavan, Models from Data Streams Cambridge University Press, 1995 Decision Trees Neural Networks Data Mining Concepts and Techniques , J. Hanm M. Kambler, Morgan Kaufmann, 2006

Outline State-of-the- Art in Data Stream Mining (Part I) 1 Motivation Jo˜ ao Gama Outline 2 Data Streams Motivation Data Streams Basic Methods 3 Change Detection Change Detection Predictive Learning 4 Clustering Data Streams Clustering Data Streams Predictive 5 Predictive Models from Data Streams Models from Data Streams Decision Trees Neural Networks

Introduction State-of-the- Art in Data Data flows continuously over time Dynamic Environments . Stream Mining Some characteristic properties of the problem can change over (Part I) time. Jo˜ ao Gama Machine Learning algorithms assume: Outline Instances are generated at random according to some Motivation probability distribution D . Data Streams Basic Methods Instances are independent and identically distributed Change Detection It is required that D is stationary Predictive Learning Examples: Clustering Data Streams e-commerce, user modelling Predictive Models from Spam emails Data Streams Decision Trees Fraud Detection, Intrusion detection Neural Networks

Introduction State-of-the- Art in Data Stream Mining (Part I) Concept drift means that the concept about which data is Jo˜ ao Gama obtained may shift from time to time, each time after some minimum permanence. Outline Motivation Any change in the distribution underlying the data Data Streams Basic Methods Change Detection Predictive Learning Clustering Data Streams Context : a set of examples from the data stream where the Predictive underlying distribution is stationary Models from Data Streams Decision Trees Neural Networks

The Nature of Change State-of-the- Art in Data Stream Mining (Part I) Jo˜ ao Gama The causes of change: Outline Motivation Changes due to modifications in the context of learning Data Streams due to changes in hidden variables . Basic Methods Changes in the characteristic properties of the observed Change Detection variables. Predictive Learning Clustering Data Streams Predictive Models from Data Streams Decision Trees Neural Networks

Change Detection in Predictive Learning State-of-the- Art in Data Stream Mining (Part I) When there is a change in the class-distribution of the Jo˜ ao Gama examples: Outline The actual model does not correspond any more to the Motivation actual distribution. Data Streams The error-rate increases Basic Methods Change Basic Idea: Monitor the evolution of the error rate. Detection Predictive Main Problems: Learning Clustering How to distinguish Change from Noise? Data Streams Predictive How to React to drift? Models from Data Streams Decision Trees Neural Networks

A Framework based on Statistical Quality Control State-of-the- Art in Data Stream Mining Suppose a sequence of examples in the form < � x i , y i > (Part I) The actual decision model classifies each example in the Jo˜ ao Gama sequence Outline In the 0-1 loss function, predictions are either True or False Motivation The predictions of the learning algorithm are sequences: Data Streams Basic Methods T , F , T , F , T , F , T , T , T , F , . . . . Change The Error is a random variable from Bernoulli trials. Detection Predictive The Binomial distribution gives the general form of the Learning Clustering probability of observing a F : Data Streams � p i = ( F / i ) and s i = p i (1 − p i ) / i where i is the number of Predictive Models from trials. Data Streams Decision Trees Neural Networks

The P-chart Algorithm State-of-the- Art in Data Stream The algorithm maintains two registers: P min and S min such Mining (Part I) that P min + S min = min ( p i + s i ) Jo˜ ao Gama Minimum of the error rate taking into account the variance of the estimator. Outline At example j : Motivation Data Streams The error of the learning algorithm will be Basic Methods Out-control if p j + s j > p min + α ∗ s min Change Detection In-control if p j + s j < p min + β ∗ s min Predictive Learning Warning Level : if Clustering Data Streams p min + α ∗ s min > p j + s j > p min + β ∗ s min Predictive Models from The constants α and β depend on the desired confidence level. Data Streams Decision Trees Admissible values are β = 2 and α = 3. Neural Networks

The P-chart Algorithm At example j the actual decision State-of-the- Art in Data model classifies the example Stream Mining (Part I) Compute the error and variance: Jo˜ ao Gama p j and s j If the error is Outline Motivation In-control the actual model is Data Streams updated Incorporate the Basic Methods example in the decision model Change Warning zone : Maintain the Detection Predictive actual model Learning First Time: the lower limit of Clustering Data Streams the window is: L warning = j Predictive Out-Control Re-learn a new Models from Data Streams model using as training set Decision Trees the set of examples Neural Networks [ L warning , j ].

Analysis of the P-chart Algorithm State-of-the- Art in Data Stream Mining (Part I) Jo˜ ao Gama Outline Independent of the Learning Algorithm Motivation Data Streams Resilient to False Alarms Basic Methods Maintain a single Decision Model in Memory Change Detection Predictive Learning Clustering Data Streams Predictive Models from Data Streams Decision Trees Neural Networks

Main Characteristics in Change Detection State-of-the- Art in Data Stream Mining (Part I) Data management Jo˜ ao Gama Characterizes the information about training examples stored in memory. Outline Motivation Detection methods Data Streams Characterizes the techniques and mechanisms for drift Basic Methods detection Change Detection Adaptation methods Predictive Learning Adaptation of the decision model to the current Clustering Data Streams distribution Predictive Decision model management Models from Data Streams Decision Trees Neural Networks

Decision model management State-of-the- Art in Data Stream Mining (Part I) Jo˜ ao Gama Model management characterize the number of decision models needed to maintain in memory. Outline Motivation The key issue here is the assumption that data generated Data Streams comes from multiple distributions, Basic Methods at least in the transition between contexts. Change Detection Instead of maintaining a single decision model several Predictive Learning authors propose the use of multiple decision models. Clustering Data Streams Predictive Models from Data Streams Decision Trees Neural Networks

Dynamic Weighted Majority State-of-the- Art in Data Stream Mining (Part I) A seminal work, is the system presented by Kolter and Maloof Jo˜ ao Gama (ICDM03, ICML05). Outline The Dynamic Weighted Majority algorithm (DWM) is an Motivation ensemble method for tracking concept drift. Data Streams Basic Methods Maintains an ensemble of base learners, Change Detection Predicts using a weighted-majority vote of these experts . Predictive Learning Dynamically creates and deletes experts in response to Clustering Data Streams changes in performance. Predictive Models from Data Streams Decision Trees Neural Networks

Granularity of Decision Models State-of-the- Art in Data Stream Mining (Part I) Jo˜ ao Gama Occurrences of drift can have impact in part of the instance space. Outline Motivation Global models: Require the reconstruction of all the Data Streams decision model. (like naive Bayes, SVM, etc) Basic Methods Change Granular decision models : Require the reconstruction of Detection parts of the decision model (like decision rules, decision Predictive Learning trees) Clustering Data Streams Predictive Models from Data Streams Decision Trees Neural Networks

Online Divisive-Agglomerative Clustering State-of-the- Art in Data Stream Mining (Part I) Goal: Continuously maintain a clustering structure from Jo˜ ao Gama evolving time series data streams. Outline Incremental clustering of streaming time series; Motivation Constructs a hierarchical tree-shaped structure of clusters Data Streams Using a top-down strategy. Basic Methods Change The leaves are the resulting clusters: each leaf groups a Detection Predictive set of variables. Learning Clustering The union of the leaves is the complete set of variables. Data Streams Predictive The intersection of leaves is the empty set. Models from Data Streams Decision Trees Neural Networks

Online Divisive-Agglomerative Clustering State-of-the- Art in Data Stream Mining Key Concept – Diameter of a cluster : the maximum (Part I) distance between two variables. Jo˜ ao Gama Incremental system to monitor clusters’ diameters Outline Motivation Performs hierarchical clustering of first-order differences Data Streams Can detect changes in the clustering structure Basic Methods Change Two Operators: Detection Predictive Splitting: expand the structure Learning Agglomeration: contract the structure Clustering Data Streams Splitting and agglomerative criteria are supported by a Predictive confidence level given by the Hoeffding bounds . Models from Data Streams Decision Trees Neural Networks

Main Algorithm [Rodrigues, Gama, 2006] State-of-the- Art in Data Stream Mining (Part I) Jo˜ ao Gama ForEver Outline Read Next Example Motivation Compute first order differences Data Streams For all the clusters Basic Methods Update the sufficient statistics Change Time to Time Detection Predictive Verify Merge Clusters Learning Clustering Verify Expand Cluster Data Streams Predictive Models from Data Streams Decision Trees Neural Networks

Feeding ODAC State-of-the- Each example is processed once. Art in Data Stream Only sufficient statistics at leaves are updated. Mining (Part I) Sufficient Statistics: a triangular matrix of the correlations Jo˜ ao Gama between variables in a leaf. Released when a leaf expands to a node. Outline Motivation Data Streams Basic Methods Change Detection Predictive Learning Clustering Data Streams Predictive Models from Data Streams Decision Trees Neural Networks

Similarity Distance State-of-the- Art in Data Stream � 1 − corr ( a , b ) Distance between time Series: rnomc ( a , b ) = Mining 2 (Part I) where corr ( a , b ) is the Pearson Correlation coefficient: Jo˜ ao Gama P − AB corr ( a , b ) = n Outline q q A 2 − A 2 B 2 − B 2 n n Motivation The sufficient statistics needed to compute the correlation are Data Streams easily updated at each time step: Basic Methods A = � a i , B = � b i , A 2 = � a 2 i , B 2 = � b 2 i , P = � a i b i Change Detection Predictive Learning Clustering Data Streams Predictive Models from Data Streams Decision Trees Neural Networks

Splitting Criteria State-of-the- Art in Data Stream Mining (Part I) When should we expand a leaf? Jo˜ ao Gama Let Outline d 1 = d ( a , b ) the farthest distance Motivation Data Streams d 2 the second farthest distance Basic Methods Hoeffding bound : Change Detection � R 2 ln (1 /δ ) Split if d 1 − d 2 > ǫ with ǫ = Predictive Learning 2 n where R is the range of the random variable; δ is a user Clustering Data Streams confidence level, and n is the number of observed data points. Predictive Models from Data Streams Decision Trees Neural Networks

Expanding a Leaf State-of-the- Art in Data Stream Mining (Part I) Find Pivots: Jo˜ ao Gama Step 1 x j , x k : d ( x j , x k ) > d ( a , b ) Outline ∀ a , b � = j , k Motivation If Splitting Criteria applies: Data Streams Step 2 Basic Methods Generate two new clusters. Change Detection Predictive Each new cluster attract nearest Learning Step 3 Clustering variables. Data Streams Predictive Models from Data Streams Decision Trees Neural Networks

Multiple Time-Windows State-of-the- Art in Data Stream Mining (Part I) A multi-window system : each node (and leaves) receive Jo˜ ao Gama examples from different time-windows. Outline Motivation Data Streams Basic Methods Change Detection Predictive Learning Clustering Data Streams Predictive Models from Data Streams Decision Trees Neural Networks

Change Detection State-of-the- Art in Data Time Series Concept Drift : Stream Mining (Part I) Change in the distribution generating the observations. Jo˜ ao Gama Clustering Analysis Concept Drift Outline Changing the way time series correlate with each other Motivation Change in he cluster Structure. Data Streams The Splitting Criteria guarantees that cluster’s diameters Basic Methods Change monotonically decrease. Detection Predictive Learning Assume Clusters: c j with descendants c k and c s . Clustering If diameter ( c k ) − diameter ( c j ) > ǫ OR Data Streams diameter ( c s ) − diameter ( c j ) > ǫ Predictive Models from Change in the correlation structure! Data Streams Decision Trees Merge clusters c k and c s into c j . Neural Networks

Properties of ODAC State-of-the- Art in Data Stream Mining (Part I) Jo˜ ao Gama For stationary data the cluster’s diameters monotonically decrease. Outline Motivation Constant update time/memory consumption with Data Streams respect to the number of examples! Basic Methods Every time a split is reported Change Detection the time to process the next example decreases , and Predictive Learning the space used by the new leaves is less than that used by Clustering the parent. Data Streams Predictive Models from Data Streams Decision Trees Neural Networks

A snapshot - 1 year data, 2500 variables State-of-the- Art in Data Stream Mining (Part I) Jo˜ ao Gama Outline Motivation Data Streams Basic Methods Change Detection Predictive Learning Clustering Data Streams Predictive Models from Data Streams Decision Trees Neural Networks

Memory Usage State-of-the- Art in Data Stream Mining (Part I) Jo˜ ao Gama Outline Motivation Data Streams Basic Methods Change Detection Predictive Learning Clustering Data Streams Predictive Models from Data Streams Decision Trees Neural Networks

Speed in Processing Time State-of-the- Art in Data Stream Mining (Part I) Jo˜ ao Gama Outline Motivation Data Streams Basic Methods Change Detection Predictive Learning Clustering Data Streams Predictive Models from Data Streams Decision Trees Neural Networks

Desirable properties: State-of-the- Art in Data Stream Mining Processing each example: (Part I) Small constant time Jo˜ ao Gama Fixed amount of main memory Outline Single scan of the data Motivation Without (or reduced) revisit old records. Data Streams Eventually using a sliding window of more recent examples Basic Methods Processing examples at the speed they arrive Change Detection Predictive Classifiers at anytime Learning Clustering Ideally, produce a model equivalent to the one that would Data Streams be obtained by a batch data-mining algorithm Predictive Models from Ability to detect and react to concept drift Data Streams Decision Trees Neural Networks

Very Fast Decision Trees State-of-the- Art in Data Stream Mining High-Speed Data Streams , P. Domingos, G. Hulten; KDD00 Mining (Part I) Jo˜ ao Gama The base Idea: Outline A small sample can often be enough to choose the optimal splitting Motivation attribute Data Streams Basic Methods Collect sufficient statistics from a small set of examples Change Detection Predictive Estimate the merit of each attribute Learning Clustering Use Hoeffding bound to guarantee that the best attribute is Data Streams really the best . Predictive Models from Statistical evidence that it is better than the second best Data Streams Decision Trees Neural Networks

Very Fast Decision Trees: Main Algorithm State-of-the- Art in Data Input: δ desired probability level. Stream Mining (Part I) Output: T A decision Tree Jo˜ ao Gama Init: T ← Empty Leaf (Root) Outline While (TRUE) Motivation Read next Example Data Streams Propagate Example through the Tree from the Root till a Basic Methods leaf Change Update Sufficient Statistics at leaf Detection Predictive If leaf (# examples ) > N min Learning Clustering Evaluate the merit of each attribute Data Streams Let A 1 the best attribute and A 2 the second best Predictive p R 2 ln (1 /δ ) / (2 n ) Let ǫ = Models from If G ( A 1 ) − G ( A 2 ) > ǫ Data Streams Decision Trees Install a splitting test based on A 1 Neural Networks Expand the tree with two descendant leaves

Classification Strategies State-of-the- Accurate Decision Trees for mining high-speed Data Streams , Art in Data J.Gama, R. Rocha; KDD03 Stream Mining (Part I) To classify an unlabelled example: Jo˜ ao Gama The example traverses the tree from the root to a leaf Outline It is classified using the information stored in that leaf Motivation Data Streams Two classification strategies: Basic Methods The standard strategy use ONLY information about the class Change Detection distribution: P ( Class i ) Predictive Learning A more informed strategy, use the sufficient statistics Clustering Data Streams P ( x j | Class i ) Predictive Classify the example in the class that maximizes P ( C k |− → x ) Models from x ) ∝ P ( C k ) � P ( x j | C k ). Naive Bayes Classifier: P ( C k |− → Data Streams Decision Trees Neural Networks VFDT stores sufficient statistics of hundred of examples in leaves.

VFDT: Illustrative Evaluation State-of-the- Art in Data Stream Mining (Part I) Jo˜ ao Gama Outline Motivation Data Streams Basic Methods Change Detection Predictive Learning Clustering Data Streams Predictive Models from Data Streams Decision Trees Neural Networks

VFDT: Analysis State-of-the- Art in Data Stream Mining (Part I) Jo˜ ao Gama Low variance models: Outline Stable decisions with statistical support. Motivation Low overfiting: Data Streams Examples are processed only once. Basic Methods Change Convergence : VFDT becomes asymptotically close to Detection Predictive that of a batch learner. The expected disagreement is δ/ p ; Learning where p is the probability that an example fall into a leaf. Clustering Data Streams Predictive Models from Data Streams Decision Trees Neural Networks

Neural-Nets and Data Streams State-of-the- Art in Data Stream Multilayer Neural Networks Mining (Part I) A general Function approximation method; Jo˜ ao Gama A 3 layer ANN can approximate any continuous function Outline with arbitrary precision; Motivation Fast Train and Prediction: Data Streams Basic Methods Each example is propagated once Change The Error is back-propagated once Detection Predictive Learning No overfitting Clustering Data Streams First: Prediction Predictive Second: Update the Model Models from Data Streams Smoothly adjust to gradual changes Decision Trees Neural Networks

�� Mohamed Medhat Gaber Tasmanian CSIRO ICT Centre Mail: GPO Box 1538, Hobart, TAS 7001, Australia E-mail: Mohamed.Gaber@csiro.au

�� Frequent Pattern Mining in Data Streams � Time Series Analysis in Data Streams � Data Stream Mining Systems � Applications of Mining Data Streams � Future Directions � Open Issues � Future Vision � Resources

�� Frequent pattern mining refers to finding patterns that occur greater than a pre- specified threshold value. � Patterns refer to items, itemsets, or sequences. � Threshold refers to the percentage of the pattern occurrences to the total number of transactions. It is termed as Support

�� Finding frequent patterns is the first step for the discovery of association rules in the form of A � B . � Apriori algorithm represents a pioneering work for association rules discovery � R Agrawal and R Srikant, Fast Algorithms for Mining Association Rules. In Proc. of the 20th International Conference on Very Large Databases, Santiago, Chile, September 1994 � An important step towards improving the performance of association rules discovery was FP-Growth � J. Han, J. Pei, and Y. Yin. Mining Frequent Patterns without Candidate Generation. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD'00), Dallas, TX, May 2000.

�� Many measurements have been proposed for finding the strength of the rules. � The very frequently used measure is confidence. � Confidence refers to the probability that set B exists given that A already exists in a transaction. � Confidence (A � B) = Support (AB) / Support (A)

�� The process of frequent pattern mining over data streams differs from the conventional one as follows: � The technique should be linear or sublinear (You Have Only One Look). � Frequent items (heavy hitters) and itemsets are often the final output.

�� !�� Manku and Motwani have two master algorithms in this area: � Sticky Sampling � Lossy Counting G. S. Manku and R. Motwani. Approximate Frequency Counts over Data Streams , in Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), Hong Kong, China, August 2002.

��"!��#�� Sticky sampling is a probabilistic technique. � The user inputs three parameters � Support (s) � Error ( ε ) � Probability of failure ( δ ) � A simple data structure is maintained that has entries of data elements and their associated frequencies (e, f). � The sampling rate decreases gradually with the increase in the number of processed data elements.

��"!��#�� For each incoming element in a data stream, the data structure is checked for an entry. � If an entry exists, then increment the frequency � Otherwise sample the element with the current sampling rate. If selected, then add a new entry, else the element is ignored. � � With every change in sampling rate, a unbiased coin toss is done for each entry with decreasing the frequency with every unsuccessful coin toss. � If the frequency goes down to zero, the entry is released.

$��!�� Lossy counting is a deterministic technique. � The user inputs two parameters � Support (s) � Error ( ε ) � The data structure has one more attribute for each entry than the sticky sampling technique (e, f, � ) where � is the maximum possible error in f. � The stream is conceptually divided into buckets with a width w = 1/ ε . � Each bucket is labelled by a value of N / w, where N starts from 1 and increases by 1.

$��!�� For a new incoming element, the data structure is checked � If an entry exists, then increment the frequency � Otherwise, add a new entry with � = b current -1 where b current is the current bucket label. � When switching to a new bucket, all entries with f+ � < b current are deleted. � Lossy Count outperforms Sticky Sampling in practice.

�� Manku and Motwani has extended Lossy Counting to find frequent itemsets. G. S. Manku and R. Motwani. Approximate Frequency Counts over Data Streams , in Proceedings of the 28th International Conference on Very Large Data Bases (VLDB), Hong Kong, China, August 2002. � The technique follows the same steps with batch processing of transactions according to memory availability. � All subsets of the stored batch are checked and pruned. � If the frequency of a new entry is greater than the number of buckets currently in memory, then a new entry is added to the data structure.

�� Frequent Pattern Mining in Data Streams � Time Series Analysis in Data Streams � Data Stream Mining Systems � Applications of Mining Data Streams � Future Directions � Open Issues � Future Vision � Resources

��%��&��!�� Time Series Analysis refers to applying different data analysis techniques on measurements acquired over temporal basis. � Data analysis techniques recently applied on time series include clustering, classification, indexing, and association rules. � The focus of classical time series analysis was on forecasting and pattern identification

��%��&��!�� Similarity measures over time series data represent the main step in time series analysis. � Euclidean and dynamic time warping represent the major similarity measures used in time series. � Longer time series could be represent computationally hard for the analysis tasks. � Different time series representations have been proposed to reduce the length of a time series.

%��&��!�� When data elements (records) in a data stream are processed based on their temporal dimension, we consider the process as time series analysis. � Time series analysis in data streams are different in two aspects: � Several data points are considered to be an entry. � The analysis is done in real-time as opposed to traditional time series analysis.

�!�'��&##��(�� &(� SAX is a fast symbolic approximation of time series. � J. Lin, E. Keogh, S. Lonardi, and B. Chiu, A Symbolic Representation of � Time Series, with Implications for Streaming Algorithms , in proceedings of the 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, San Diego, CA. June 13, 2003. It allows a time series with a length n to be transformed to an � approximated time series with an arbitrarily length w, where w <<n. SAX follows three main steps: � Piecewise Aggregate Approximation (PAA) � Symbolic Discretization � Distance measurement � SAX is generic and could be applied to any time series analysis � technique.

��)��&��&##��*�� &&� � A time series with size n is approximated using PAA to a time series with size w using the following equation. Where is the i th element in the approximated time series

�!�'��+�� Breakpoints are calculated that produce equal areas from one point to another under Gaussian distribution. � A lookup table could be used. � According to the output of PAA � If a point is less than the smallest breakpoint, then it is denoted as “a”. � Otherwise and if the point is greater than the smallest breakpoint and less than the next larger one, then it is denoted as “b”. � etc.

�� The following distance measure is applied when comparing two different time series: � It returns the minimum distance between the original time series. � A lookup table is calculated and used to find the distance between every two letters.

�&(�� SAX has been applied to many data mining techniques including � Clustering (hierarchical and partitioning) � Classification (Nearest neighbour and decision trees) � Change detection � SAX represents the state-of-the-art in time series data streams analysis due to its generality

��&( � SAX has been used to discover discords in time series. The technique is termed as Hot SAX. � Keogh, E., Lin, J. and Fu, A., HOT SAX: Efficiently Finding the Most Unusual Time Series Subsequence. In the 5th IEEE International Conference on Data Mining, New Orleans, LA. Nov 27-30, 2005. � Discords are the time series subsequences that are maximally different from the rest of the time series subsequences. � It is 3 to 4 times faster than brute force technique. � This makes it a candidate for data streaming applications

��&(�� The process starts with sliding widows of a fixed size over the whole time series to generate subsequence � Each generated subsequence is approximated using SAX � The approximated subsequence is then inserted in an array indexed according to its position in the original time series � The number of occurrences of each SAX word is also inserted in the array.

��&(�� The array is then transformed to a tries where the leaf nodes represent the array index where the word appears. � The two data structures (array and trie) complement each other.

FOR THE ECML/PKDD 2007 I NTERNATIONAL C ONFERENCE presented by - PDF document

T HE 18 TH E UROPEAN C ONFERENCE ON M ACHINE L EARNING AND THE 11 TH E UROPEAN C ONFERENCE ON P RINCIPLES AND P RACTICE OF K NOWLEDGE D ISCOVERY IN D ATABASES S TATE - OF - THE - ART IN D ATA S TREAM M INING T UTORIAL N OTES FOR THE ECML/PKDD 2007

Opening session Bled, September 7, 2009 ECML PKDD in SecondLife For the first time, ECML PKDD

Solving Complex Machine Learning Problems with Ensemble Methods ECML/PKDD 2013 Workshop .

A Spectral Learning Algorithm for Finite State Transducers Borja Balle , Ariadna Quattoni, Xavier

Programming by Examples Sumit Gulwani ECML/PKDD Conference Microsoft Sep 2019 Example-based

Semantic Data Mining Tutorial at ECML/PKDD 2011 Athens September 9, 2011 Tutorial overview

ECML PKDD Discovery Challenge 2008 Spam Detection and Tag Recommendations in Social Bookmarking

On Feature Selection, Bias-Variance, and Bagging Art Munson 1 Rich Caruana 2 1 Department of

Factorization of the Label Conditional Distribution for Multi-Label Classification ECML PKDD 2015

The Quest for the Perfect Image Representation Tinne Tuytelaars KU Leuven ECML-PKDD 2019

Efficiency/Effectiveness Trade-offs in Learning to Rank Tutorial @ ECML PKDD 2018

Evaluation metrics and proper scoring rules Classifier Calibration Tutorial ECML PKDD 2020 Dr.

Wages of Wins Albrecht Zimmermann, Universit de Caen 19/09/2016, MLSA '16 @ECML/PKDD

Efficiency/Effectiveness Trade-offs in Learning to Rank Tutorial @ ECML PKDD 2018

ECML 2015 Big Targets Workshop Paul Mineiro Paul Mineiro ECML 2015 Big Targets Workshop How can

Euro 2016 Predictions Using Team Rating Systems Jan Lasek deepsense.io Machine Learning and

Product Features Technical Training 2007 Technical Training 2007 Technical Training 2007

10 July 2015 www.eprg.group.cam.ac.uk Acknowledgements www.eprg.group.cam.ac.uk What is trust?

Bounds on 4D Conformal and Superconformal Field Theories David Simmons-Duffin Harvard University

UV Completion of Some UV Fixed Points Igor Klebanov Talk at ERG2016 Conference ICTP, Trieste

Reconciling Supersymmetry and Leptogenesis Hitoshi Murayama (IPMU Tokyo & Berkeley) COSMO

Yasunori Nomura UC Berkeley; LBNL Dark Matter Existence is well established Rotation curves of

The p 2 , 0 q Superconformal Bootstrap Leonardo Rastelli Yang Institute for Theoretical Physics

Glassy spherules from the Cretaceous-Paleogene boundary interval in the Bkowiec section (Skole

Evaluating Systems Chapter 22 Computer Security: Art and Science , 2 nd Edition Version 1.0

Sambuz

Useful Links

Newsletter

Mail Us

FOR THE ECML/PKDD 2007 I NTERNATIONAL C ONFERENCE presented by - PDF document

T HE 18 TH E UROPEAN C ONFERENCE ON M ACHINE L EARNING AND THE 11 TH E UROPEAN C ONFERENCE ON P RINCIPLES AND P RACTICE OF K NOWLEDGE D ISCOVERY IN D ATABASES S TATE - OF - THE - ART IN D ATA S TREAM M INING T UTORIAL N OTES FOR THE ECML/PKDD 2007

Opening session Bled, September 7, 2009 ECML PKDD in SecondLife For the first time, ECML PKDD

Solving Complex Machine Learning Problems with Ensemble Methods ECML/PKDD 2013 Workshop .

A Spectral Learning Algorithm for Finite State Transducers Borja Balle , Ariadna Quattoni, Xavier

Programming by Examples Sumit Gulwani ECML/PKDD Conference Microsoft Sep 2019 Example-based

Semantic Data Mining Tutorial at ECML/PKDD 2011 Athens September 9, 2011 Tutorial overview

ECML PKDD Discovery Challenge 2008 Spam Detection and Tag Recommendations in Social Bookmarking

On Feature Selection, Bias-Variance, and Bagging Art Munson 1 Rich Caruana 2 1 Department of

Factorization of the Label Conditional Distribution for Multi-Label Classification ECML PKDD 2015

The Quest for the Perfect Image Representation Tinne Tuytelaars KU Leuven ECML-PKDD 2019

Efficiency/Effectiveness Trade-offs in Learning to Rank Tutorial @ ECML PKDD 2018

Evaluation metrics and proper scoring rules Classifier Calibration Tutorial ECML PKDD 2020 Dr.

Wages of Wins Albrecht Zimmermann, Universit de Caen 19/09/2016, MLSA '16 @ECML/PKDD

Efficiency/Effectiveness Trade-offs in Learning to Rank Tutorial @ ECML PKDD 2018

ECML 2015 Big Targets Workshop Paul Mineiro Paul Mineiro ECML 2015 Big Targets Workshop How can

Euro 2016 Predictions Using Team Rating Systems Jan Lasek deepsense.io Machine Learning and

Product Features Technical Training 2007 Technical Training 2007 Technical Training 2007

10 July 2015 www.eprg.group.cam.ac.uk Acknowledgements www.eprg.group.cam.ac.uk What is trust?

Bounds on 4D Conformal and Superconformal Field Theories David Simmons-Duffin Harvard University

UV Completion of Some UV Fixed Points Igor Klebanov Talk at ERG2016 Conference ICTP, Trieste

Reconciling Supersymmetry and Leptogenesis Hitoshi Murayama (IPMU Tokyo &amp; Berkeley) COSMO

Yasunori Nomura UC Berkeley; LBNL Dark Matter Existence is well established Rotation curves of

The p 2 , 0 q Superconformal Bootstrap Leonardo Rastelli Yang Institute for Theoretical Physics

Glassy spherules from the Cretaceous-Paleogene boundary interval in the Bkowiec section (Skole

Evaluating Systems Chapter 22 Computer Security: Art and Science , 2 nd Edition Version 1.0

Sambuz

Useful Links

Newsletter

Mail Us

Reconciling Supersymmetry and Leptogenesis Hitoshi Murayama (IPMU Tokyo & Berkeley) COSMO