QUERYING AND MINING QUERYING AND MINING DATA STREAMS Elena - PowerPoint PPT Presentation

Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010 QUERYING AND MINING QUERYING AND MINING DATA STREAMS Elena Ikonomovska Jožef Stefan Institute – Department of Knowledge Technologies

Outline  Definitions  Datastream models  Similarity measures  Historical background  Foundations  Estimating the L 2 distance  Estimating the Jaccard similarity: Min-Wise Hashing  Key applications  Maintaining statistics on streams  Hot items  Some advanced results (Appendix)  Estimating rarity and similarity (the windowed model)  Tight bounds for approximate histograms and cluster ‐ based summaries Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

Data stream models: Time series model  A stream is a vector / point in space / p p  Items are arriving in order of their indices:   { , , ,...} { , , ,...} x x x x x x x x 1 1 2 2 3 3 … coordinates of the vector 1 2 3 4 x 1 x 2 x 3 x 4  The value of the i-th item is the value of the i-th coordinate of the vector  Distance (similarity) between two streams is the distance between the two points Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

Data stream models: Turnstile model  Each arriving item is an update to some component of g p p the vector: 1 2 3 4 1 2 3 4 (2, 4) ⇒ 10 5 24 12 10 9 24 12 (2, x 2 (2 x (5) ) (5) ) indicates the 5 th update to the 2 nd indicates the 5 -th update to the 2 -nd component of the vector (2) + x i  value: x i = x i (1) + x i (3) … i i i i  positive or negative update  only nonnegative updates ⇒ cash register model Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

L p distances (p ≥ 0 ) (p ) p  Stream 1 {x 1 ,x 2 ,x 3 ,…} & stream 2 {y 1 ,y 2 ,y 3 ,…} in {1,…,m} { 1 } { 1 } { } 2 3 2 3 p -y i p | 1/p L p = Σ i |x i  L 0 distance (Hamming distance) ⇔ the number of  L 0 distance (Hamming distance) the number of indices i such that x i ≠ y i  A measure of dis(similarity) of two streams [CDI02]  L ∞ = max i |x i - y i | 2 -y i 2 | 1/2 distance  L 2 = Σ i |x i 2 )- for approximating self-join sizes  L 2 norm (f 2 [AGM’99] Q = COUNT(R A R) |dom(A)| = m Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

Basic requirements q  Naïve approach: store the points/vectors in memory  Naïve approach: store the points/vectors in memory and compute any distance/similarity measure or a statistic (norm, frequency moment) ( , q y )  Typically:  Large quantities of data – single pass g q g p  Memory is constrained – O(log m)  Real-time answers – linear time algorithms O(n) g ( )  Allowed approximate answers ( ε , δ )  ε & δ are user-specified parameters p p Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

Historical background g  [AMS’96] approximate F 2 (inserts only)  [AMS 96] approximate F 2 (inserts only)  [AGM’99] approximate L 2 norm (inserts and deletes)  [FKS’99] approximate L 1 distance [ ] pp 1  [Indyk’00] approximate L p distance for p � (0,2]  p-stable distributions (Caushy is 1-stable, Gaussian is 2-stable )  [CDI’02] efficient approximation of L 0 distance  Approximate distances on windowed streams  [DGI’02] approximate L p distance  [Datar-Muthukrishnan’02] approximate Jaccard similarity Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

Estimating the L 2 distance [AGM’99] g [ ] 2  Data streams (x 1 , x 2 …, x n ) and (y 1 , y 2 … y n )  For each i = 1, 2, …n define a i.i.d. random variable X i P[X i = 1] = P[X i = -1] = 1/2 � E[X i ]=0  Base idea: Simply maintain Σ i=1,..,n X i ( x i - y i )  For some i, j and items (i, x i (j) ), (i, y i (j) ) : (j) is added and X i  X i · x i (j) is subtracted · y i E[( Σ i=1,..,n X i (x i -y i )) 2 ] = 1 0 E[ Σ i=1 n X i i ( 2 (x i -y i ) 2 + Σ i ≠ j X i X j (x i -y i )(x j -y j ) ] = i y i ) j ( i y i )( j y j ) ] [ i=1,..,n i ≠ j i Σ i=1,..,n (x i -y i ) 2  The problem amounts to obtaining an unbiased estimate  The problem amounts to obtaining an unbiased estimate Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

Standard boosting technique g q  Run the algorithm in parallel k= θ (1/ ε 2 ) times  Run the algorithm in parallel k θ (1/ ε ) times Maintain sums Σ i=1,..,n X i ( x i - y i ) for k different random 1. assignments for the random var. X i,k i,k Take the average of their squares for a given run r 2. ⇒ v (r) (reduce the variance/error!) Chebyshev Repeat the procedure l = θ (log(1/ δ )) times X i,k,l 3. Output the median over {v (1) ,v (2) ,…,v (l) } Chernoff 4. Maintains nkl values in parallel for the random 5. variables Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

Result The Chebyshev inequality + Chernoff: The Chebyshev inequality Chernoff: ⇒ this estimates the square of L 2 within (1± ε ) factor with probability > (1 - δ ) p y ( )  Random variables needed: nkl !  The random variables can be four-wise independent p  This is enough so that Chebyshev still holds [AMS’96]  pseudorandomly generated on the fly  O(kl) = O(1/ ε 2 log(1/ δ )) words + a logarithmic-length array of seeds O(log m) Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

Estimating the L p distance g p  p -stable distributions [I’00] p [ ] D is a p-stable distribution if:  For all real numbers a 1 , a 2 , …, a k If X 1 , X 2 ,…,X k are i.i.d. random var. drawn from D ⇒ Σ a X has the same distribution as X( Σ | a | p ) 1/p ⇒ Σ a i X i has the same distribution as X( Σ i | a i | p ) 1/p for random variable X with distribution D  Cauchy distribution is 1-stable L 1  Gaussian distribution is 2-stable L 2 Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

The algorithm g z 1 , z 2 ,…z is the stream vector z 1 , z 2 ,…z n is the stream vector  Again… run in parallel k= θ (1/ ε 2 log(1/ δ )) procedures & maintain sums Σ i z i X i for each run procedures & maintain sums Σ i z i X i for each run 1,…k  The value of Σ i z i X i in the l -th run is Z ( l ) e va ue o e u s i i i  Z (l) is a random variable itself  Let D is p -stable: e s p s ab e: Z (l) = X (l) ( Σ i | z i | p ) 1/p for some random variable X (l) drawn from D for some random variable X drawn from D Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

Estimating the L p distance cont. g p  The output is: p (1/ γ ) median {|Z (1) |, |Z (2) |,…, |Z (k) |}  where γ is the median of |X|, for X random variable distributed according to D D  Chebyshev : This estimate is within a multiplicative factor (1 ± ε ) of the true norm with probability (1- δ ) ( ) p y ( )  Observation [CDI’02]:  L p is a good approximation of the L 0 norm for p sufficiently small ll  p= ε /log(m) where m is the maximum absolute value of any item in the stream Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

The Jaccard similarity S A ={a 1 ,a 2 ,..a n } S B ={b 1 ,b 2 ,…,b n }  Let A (and B) denote the set of distinct elements |A ∩ B|/|AUB| = Jaccard similarity  Example: (view sets as columns) m=6 A B item 1 0 1 |AUB|=5 item 2 1 0 1 1 1 1 simJ(A,B) = 2/5 = 0.4 simJ(A,B) 2/5 0.4 0 0 1 1 item 6 item 6 0 0 1 1 Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

Signature idea g  Represent the sets A and B by signatures Sig(A) and  Represent the sets A and B by signatures Sig(A) and Sig(B)  Compute the similarity over the signatures p y g  E[simH(Sig(A),Sig(B))]=simJ(A,B)  Simplest approach S p pp  Sample the sets (rows) uniformly at random k times to get k-bit signature Sig ( instead of m bits )  Problems!  Sparsity – sampling might miss important information Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

Tool: Min-Wise Hashing  π ‐ randomly chosen permutation over {1,…,m} y p { , , }  For any subset A ⊆ [m] the min-hash of A is:  h π (A) = min i ∊ A { π (i)} π ( ) i ∊ A { ( )}  Index of the first row with value 1  random permutation of the rows  One bit of the k-bit signature of A, Sig(A) O bi f h k bi i f A Si (A)  When π is chosen uniformly at random from the set of all permutations on [m] for any two subsets A B of all permutations on [m] for any two subsets A,B of [m] then: Pr[h (A) = h (B)] = |A ∩ B|/|AUB| Pr[h π (A) h π (B)] |A ∩ B|/|AUB| Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

Example p  Consider the following permutations: for m=5  1 = (1 2 3 4 5) k=1  2 = (5 4 3 2 1) k=2  3 = (3 4 5 1 2) k=3  And the sets: A = {1,3,4} B = {1,2,5} The min-hash values are as follows: h  1 (A) = 1 h  1 (B) = 1 k=1 h  2 (A) = 4 h  2 (B) = 5 k=2 h  (A) = 3 h  3 (A) = 3 h  (B) = 5 h  3 (B) = 5 k=3 k=3 � the expectation of the fraction of permutations where min- hash values agree is simJ(A,B) Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

QUERYING AND MINING QUERYING AND MINING DATA STREAMS Elena - PowerPoint PPT Presentation

Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010 QUERYING AND MINING QUERYING AND MINING DATA STREAMS Elena Ikonomovska Joef Stefan Institute Department of Knowledge Technologies Outline

Querying and Mining Data Streams: Querying and Mining Data Streams: You Only Get One Look You

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Data Streams Many large sources of data are generated as streams of updates: IP Network

Data Streams Many large sources of data are generated as streams of updates: IP Network

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Comparing Data Streams Using Hamming Norms Graham Cormode, Mayur Datar, Piotr Indyk, S.

A P A P A Proposal for Publishing Data A Proposal for Publishing Data l f l f P bli hi P bli

The problem Combining querying of XML data with ontology queries Example XML document

Mining Data Streams (Part 1) CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford

Streams and File I/O Fundamentals of Computer Science Outline Overview of Streams and File

Introduction to data stream querying and mining Georges HEBRAIL Workshop Franco-Brasileiro sobre

STAT 339 Evaluating a Classifier 3 February 2017 Colin Reimer Dawson Questions/Administrative

Honors Combinatorics CMSC-27410 = Math-28410 CMSC-37200 Instructor: Laszlo Babai University

Estimating the Physical Distance between Two Locations with Wi-Fi Received Signal Strength

Mixing Time Def: Total variation distance of distributions and on the same countable set

Parallel Homotopy Algorithms to Solve Polynomial Systems Jan Verschelde Department of Math, Stat

Consensus measures generated by weighted Kemeny distances on linear orders e Luis GARC David

Connecting the dots with common sense and linear models L eon Bottou NEC Labs America COS

The Second-Moment Method Will Perkins January 28, 2013 Markovs Inequality Recall Markovs

QUERYING AND MINING QUERYING AND MINING DATA STREAMS Elena - PowerPoint PPT Presentation

Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010 QUERYING AND MINING QUERYING AND MINING DATA STREAMS Elena Ikonomovska Joef Stefan Institute Department of Knowledge Technologies Outline

Querying and Mining Data Streams: Querying and Mining Data Streams: You Only Get One Look You

WITH C++ Prof. Amr Goneid AUC Part 9. Streams &amp; Files Prof. amr Goneid, AUC 1 Streams

Data Streams Many large sources of data are generated as streams of updates: IP Network

Data Streams Many large sources of data are generated as streams of updates: IP Network

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data &amp; Real Time Data Streams

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

Stream Bank Stabilization in Open Space Streams in open space There are approximately 35

CSE 143 Streams as C++ Classes Streams are C++ classes Streams have lots of built-in

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Comparing Data Streams Using Hamming Norms Graham Cormode, Mayur Datar, Piotr Indyk, S.

A P A P A Proposal for Publishing Data A Proposal for Publishing Data l f l f P bli hi P bli

The problem Combining querying of XML data with ontology queries Example XML document

Mining Data Streams (Part 1) CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford

Streams and File I/O Fundamentals of Computer Science Outline Overview of Streams and File

Introduction to data stream querying and mining Georges HEBRAIL Workshop Franco-Brasileiro sobre

STAT 339 Evaluating a Classifier 3 February 2017 Colin Reimer Dawson Questions/Administrative

Honors Combinatorics CMSC-27410 = Math-28410 CMSC-37200 Instructor: Laszlo Babai University

Estimating the Physical Distance between Two Locations with Wi-Fi Received Signal Strength

Mixing Time Def: Total variation distance of distributions and on the same countable set

Parallel Homotopy Algorithms to Solve Polynomial Systems Jan Verschelde Department of Math, Stat

Consensus measures generated by weighted Kemeny distances on linear orders e Luis GARC David

Connecting the dots with common sense and linear models L eon Bottou NEC Labs America COS

The Second-Moment Method Will Perkins January 28, 2013 Markovs Inequality Recall Markovs

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams