querying and mining querying and mining data streams
play

QUERYING AND MINING QUERYING AND MINING DATA STREAMS Elena - PowerPoint PPT Presentation

Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010 QUERYING AND MINING QUERYING AND MINING DATA STREAMS Elena Ikonomovska Joef Stefan Institute Department of Knowledge Technologies Outline


  1. Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010 QUERYING AND MINING QUERYING AND MINING DATA STREAMS Elena Ikonomovska Jožef Stefan Institute – Department of Knowledge Technologies

  2. Outline  Definitions  Datastream models  Similarity measures  Historical background  Foundations  Estimating the L 2 distance  Estimating the Jaccard similarity: Min-Wise Hashing  Key applications  Maintaining statistics on streams  Hot items  Some advanced results (Appendix)  Estimating rarity and similarity (the windowed model)  Tight bounds for approximate histograms and cluster ‐ based summaries Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  3. Data stream models: Time series model  A stream is a vector / point in space / p p  Items are arriving in order of their indices:   { , , ,...} { , , ,...} x x x x x x x x 1 1 2 2 3 3 … coordinates of the vector 1 2 3 4 x 1 x 2 x 3 x 4  The value of the i-th item is the value of the i-th coordinate of the vector  Distance (similarity) between two streams is the distance between the two points Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  4. Data stream models: Turnstile model  Each arriving item is an update to some component of g p p the vector: 1 2 3 4 1 2 3 4 (2, 4) ⇒ 10 5 24 12 10 9 24 12 (2, x 2 (2 x (5) ) (5) ) indicates the 5 th update to the 2 nd indicates the 5 -th update to the 2 -nd component of the vector (2) + x i  value: x i = x i (1) + x i (3) … i i i i  positive or negative update  only nonnegative updates ⇒ cash register model Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  5. L p distances (p ≥ 0 ) (p ) p  Stream 1 {x 1 ,x 2 ,x 3 ,…} & stream 2 {y 1 ,y 2 ,y 3 ,…} in {1,…,m} { 1 } { 1 } { } 2 3 2 3 p -y i p | 1/p L p = Σ i |x i  L 0 distance (Hamming distance) ⇔ the number of  L 0 distance (Hamming distance) the number of indices i such that x i ≠ y i  A measure of dis(similarity) of two streams [CDI02]  L ∞ = max i |x i - y i | 2 -y i 2 | 1/2 distance  L 2 = Σ i |x i 2 )- for approximating self-join sizes  L 2 norm (f 2 [AGM’99] Q = COUNT(R A R) |dom(A)| = m Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  6. Basic requirements q  Naïve approach: store the points/vectors in memory  Naïve approach: store the points/vectors in memory and compute any distance/similarity measure or a statistic (norm, frequency moment) ( , q y )  Typically:  Large quantities of data – single pass g q g p  Memory is constrained – O(log m)  Real-time answers – linear time algorithms O(n) g ( )  Allowed approximate answers ( ε , δ )  ε & δ are user-specified parameters p p Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  7. Historical background g  [AMS’96] approximate F 2 (inserts only)  [AMS 96] approximate F 2 (inserts only)  [AGM’99] approximate L 2 norm (inserts and deletes)  [FKS’99] approximate L 1 distance [ ] pp 1  [Indyk’00] approximate L p distance for p � (0,2]  p-stable distributions (Caushy is 1-stable, Gaussian is 2-stable )  [CDI’02] efficient approximation of L 0 distance  Approximate distances on windowed streams  [DGI’02] approximate L p distance  [Datar-Muthukrishnan’02] approximate Jaccard similarity Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  8. Estimating the L 2 distance [AGM’99] g [ ] 2  Data streams (x 1 , x 2 …, x n ) and (y 1 , y 2 … y n )  For each i = 1, 2, …n define a i.i.d. random variable X i P[X i = 1] = P[X i = -1] = 1/2 � E[X i ]=0  Base idea: Simply maintain Σ i=1,..,n X i ( x i - y i )  For some i, j and items (i, x i (j) ), (i, y i (j) ) : (j) is added and X i  X i · x i (j) is subtracted · y i E[( Σ i=1,..,n X i (x i -y i )) 2 ] = 1 0 E[ Σ i=1 n X i i ( 2 (x i -y i ) 2 + Σ i ≠ j X i X j (x i -y i )(x j -y j ) ] = i y i ) j ( i y i )( j y j ) ] [ i=1,..,n i ≠ j i Σ i=1,..,n (x i -y i ) 2  The problem amounts to obtaining an unbiased estimate  The problem amounts to obtaining an unbiased estimate Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  9. Standard boosting technique g q  Run the algorithm in parallel k= θ (1/ ε 2 ) times  Run the algorithm in parallel k θ (1/ ε ) times Maintain sums Σ i=1,..,n X i ( x i - y i ) for k different random 1. assignments for the random var. X i,k i,k Take the average of their squares for a given run r 2. ⇒ v (r) (reduce the variance/error!) Chebyshev Repeat the procedure l = θ (log(1/ δ )) times X i,k,l 3. Output the median over {v (1) ,v (2) ,…,v (l) } Chernoff 4. Maintains nkl values in parallel for the random 5. variables Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  10. Result The Chebyshev inequality + Chernoff: The Chebyshev inequality Chernoff: ⇒ this estimates the square of L 2 within (1± ε ) factor with probability > (1 - δ ) p y ( )  Random variables needed: nkl !  The random variables can be four-wise independent p  This is enough so that Chebyshev still holds [AMS’96]  pseudorandomly generated on the fly  O(kl) = O(1/ ε 2 log(1/ δ )) words + a logarithmic-length array of seeds O(log m) Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  11. Estimating the L p distance g p  p -stable distributions [I’00] p [ ] D is a p-stable distribution if:  For all real numbers a 1 , a 2 , …, a k If X 1 , X 2 ,…,X k are i.i.d. random var. drawn from D ⇒ Σ a X has the same distribution as X( Σ | a | p ) 1/p ⇒ Σ a i X i has the same distribution as X( Σ i | a i | p ) 1/p for random variable X with distribution D  Cauchy distribution is 1-stable L 1  Gaussian distribution is 2-stable L 2 Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  12. The algorithm g z 1 , z 2 ,…z is the stream vector z 1 , z 2 ,…z n is the stream vector  Again… run in parallel k= θ (1/ ε 2 log(1/ δ )) procedures & maintain sums Σ i z i X i for each run procedures & maintain sums Σ i z i X i for each run 1,…k  The value of Σ i z i X i in the l -th run is Z ( l ) e va ue o e u s i i i  Z (l) is a random variable itself  Let D is p -stable: e s p s ab e: Z (l) = X (l) ( Σ i | z i | p ) 1/p for some random variable X (l) drawn from D for some random variable X drawn from D Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  13. Estimating the L p distance cont. g p  The output is: p (1/ γ ) median {|Z (1) |, |Z (2) |,…, |Z (k) |}  where γ is the median of |X|, for X random variable distributed according to D D  Chebyshev : This estimate is within a multiplicative factor (1 ± ε ) of the true norm with probability (1- δ ) ( ) p y ( )  Observation [CDI’02]:  L p is a good approximation of the L 0 norm for p sufficiently small ll  p= ε /log(m) where m is the maximum absolute value of any item in the stream Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  14. The Jaccard similarity S A ={a 1 ,a 2 ,..a n } S B ={b 1 ,b 2 ,…,b n }  Let A (and B) denote the set of distinct elements |A ∩ B|/|AUB| = Jaccard similarity  Example: (view sets as columns) m=6 A B item 1 0 1 |AUB|=5 item 2 1 0 1 1 1 1 simJ(A,B) = 2/5 = 0.4 simJ(A,B) 2/5 0.4 0 0 1 1 item 6 item 6 0 0 1 1 Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  15. Signature idea g  Represent the sets A and B by signatures Sig(A) and  Represent the sets A and B by signatures Sig(A) and Sig(B)  Compute the similarity over the signatures p y g  E[simH(Sig(A),Sig(B))]=simJ(A,B)  Simplest approach S p pp  Sample the sets (rows) uniformly at random k times to get k-bit signature Sig ( instead of m bits )  Problems!  Sparsity – sampling might miss important information Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  16. Tool: Min-Wise Hashing  π ‐ randomly chosen permutation over {1,…,m} y p { , , }  For any subset A ⊆ [m] the min-hash of A is:  h π (A) = min i ∊ A { π (i)} π ( ) i ∊ A { ( )}  Index of the first row with value 1  random permutation of the rows  One bit of the k-bit signature of A, Sig(A) O bi f h k bi i f A Si (A)  When π is chosen uniformly at random from the set of all permutations on [m] for any two subsets A B of all permutations on [m] for any two subsets A,B of [m] then: Pr[h (A) = h (B)] = |A ∩ B|/|AUB| Pr[h π (A) h π (B)] |A ∩ B|/|AUB| Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  17. Example p  Consider the following permutations: for m=5  1 = (1 2 3 4 5) k=1  2 = (5 4 3 2 1) k=2  3 = (3 4 5 1 2) k=3  And the sets: A = {1,3,4} B = {1,2,5} The min-hash values are as follows: h  1 (A) = 1 h  1 (B) = 1 k=1 h  2 (A) = 4 h  2 (B) = 5 k=2 h  (A) = 3 h  3 (A) = 3 h  (B) = 5 h  3 (B) = 5 k=3 k=3 � the expectation of the fraction of permutations where min- hash values agree is simJ(A,B) Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend