braid discovering lag correlations in multiple streams
play

BRAID: Discovering Lag Correlations in Multiple Streams Yasushi - PowerPoint PPT Presentation

BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos (Carnegie Mellon Univ.) Motivation n Data-stream applications q Network analysis q


  1. BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos (Carnegie Mellon Univ.)

  2. Motivation n Data-stream applications q Network analysis q Sensor monitoring q Financial data analysis q Moving object tracking n Goal q Monitor multiple numerical streams q Determine which pairs are correlated with lags q Report the value of each such lag (if any) SIGMOD 2005 2 Y. Sakurai et al

  3. Lag Correlations n Examples q A decrease in interest rates typically precedes an increase in house sales by a few months q Higher amounts of fluoride in the drinking water leads to fewer dental cavities, some years later q High CPU utilization on server 1 precedes high CPU utilization for server 2 by a few minutes SIGMOD 2005 3 Y. Sakurai et al

  4. Lag Correlations n Example of lag-correlated sequences These sequences are correlated with lag l =1300 time-ticks CCF (Cross-Correlation Function) SIGMOD 2005 4 Y. Sakurai et al

  5. Lag Correlations n Example of lag-correlated sequences q Fast (high performance) q Nimble (Low memory consumption) q Accurate (good approximation) CCF (Cross-Correlation Function) SIGMOD 2005 5 Y. Sakurai et al

  6. Problem #1: PAIR of sequences n For given two co-evolving sequences X and Y , determine q Whether there is a lag correlation q If yes, what is the lag length l X yes; ? l = 1,300 Y n Any time, on semi-infinite streams SIGMOD 2005 6 Y. Sakurai et al

  7. Problem #2: k-way n For given k numerical sequences, X 1 ,…,X k , report q Which pairs (if any) have a lag correlation q The corresponding lag for such pairs X 1 X 1 and X 2 ; l = 1,300 ? ... X 2 ... X k n again, ‘any time’, streaming fashion SIGMOD 2005 7 Y. Sakurai et al

  8. Our solution, BRAID n characteristics: q ‘Any-time’ processing, and fast Computation time per time tick is constant q Nimble Memory space requirement is sub-linear of sequence length q Accurate Approximation introduces small error SIGMOD 2005 8 Y. Sakurai et al

  9. Related Work n Sequence indexing q Agrawal et al. (FODO 1993) q Faloutsos et al. (SIGMOD 1994) q Keogh et al. (SIGMOD 2001) n Compression (wavelet and random projections) q Gilbert et al. (VLDB 2001) q Guha et al. (VLDB 2004) q Dobra et al.(SIGMOD 2002) q Ganguly et al.(SIGMOD 2003) SIGMOD 2005 9 Y. Sakurai et al

  10. Related Work n Data Stream Management q Abadi et al. (VLDB Journal 2003) q Motwani et al. (CIDR 2003) q Chandrasekaran et al. (CIDR 2003) q Cranor et al. (SIGMOD 2003) SIGMOD 2005 10 Y. Sakurai et al

  11. Related Work n Pattern discovery q Clustering for data streams Guha et al. (TKDE 2003) q Monitoring multiple streams Zhu et al. (VLDB 2002) q Forecasting Yi et al. (ICDE 2000) Papadimitriou et al. (VLDB 2003) n None of previously published methods focuses on the problem SIGMOD 2005 11 Y. Sakurai et al

  12. Overview n Introduction / Related work n Background n Main ideas n Theoretical analysis n Experimental results SIGMOD 2005 12 Y. Sakurai et al

  13. Background n Lag correlation positively correlated + g Correlation un-correlated anti-correlated (lower than - g ) Lag CCF (Cross-Correlation Function) SIGMOD 2005 13 Y. Sakurai et al

  14. Background details n Definition of ‘ score ’, the absolute value of R ( l ) = score ( l ) R ( l ) å n - - ( x x )( y y ) - t t l = = + R ( l ) t l 1 å å - n n l - - 2 2 ( x x ) ( y y ) t t = + = t l 1 t 1 n Lag correlation > g q Given a threshold g , score ( l ) q A local maximum q The earliest such maximum, if more maxima exist SIGMOD 2005 14 Y. Sakurai et al

  15. Overview n Introduction / Related work n Background n Main ideas n Theoretical analysis n Experimental results SIGMOD 2005 15 Y. Sakurai et al

  16. Why not ‘ naive ’ ? n Naive solution: q Compute correlation coefficient for each lag l = 0, 1, 2, 3, …, n/2 n But, q O ( n ) space q O ( n 2 ) time Correlation or O ( n log n ) time w/ FFT t=n Time n/ 2 0 Lag SIGMOD 2005 16 Y. Sakurai et al

  17. Main Idea (1) n Incremental computing: q the correlation coefficient of two sequences is ‘algebraic’ -> can be computed incrementally n we need to maintain only 6 ‘sufficient statistics’: q Sequence length n q Sum of X, Square sum of X q Sum of Y, Square sum of Y q Inner-product for X and the shifted Y SIGMOD 2005 17 Y. Sakurai et al

  18. Main Idea (1) details n Incremental computing: n Sequence length n å = n = Sx ( 1 , n ) x n Sum of X : t t 1 å = n = 2 Sxx ( 1 , n ) x n Square sum of X : t t 1 å n = n Inner-product for X and the shifted Y : Sxy ( l ) x t y - t l = + t l 1 q Compute R ( l ) incrementally: C ( l ) = R ( l ) + × - Vx ( l 1 , n ) Vy ( 1 , n l ) n Covariance of X and Y: + × - Sx ( l 1 , n ) Sy ( 1 , n l ) = - C ( l ) Sxy ( l ) - n l n Variance of X: + 2 ( Sx ( l 1 , n )) + = + - Vx ( l 1 , n ) Sxx ( l 1 , n ) - n l SIGMOD 2005 18 Y. Sakurai et al

  19. Main Idea (1) n Complexity Naive Naive BRAID (incremental) Space O ( n ) O ( n ) Comp. time O ( n log n ) O ( n ) Better, but not good enough! SIGMOD 2005 19 Y. Sakurai et al

  20. Main Idea (2) n Geometric lag probing Correlation Lag SIGMOD 2005 20 Y. Sakurai et al

  21. Main Idea (2) n Geometric lag probing n ie., compute the correlation coefficient for lag: l = 0, 1, 2, 4, ... 2 h Correlation O ( log n ) estimations 0 1 2 4 8 Lag SIGMOD 2005 21 Y. Sakurai et al

  22. Main Idea (2) n Geometric lag probing Naive Naive BRAID (incremental) Space O ( n ) O ( n ) Comp. time O ( n log n ) O ( n ) O ( log n ) n But, so far, we still need O ( n ) space because the longest lag is n/2 SIGMOD 2005 22 Y. Sakurai et al

  23. Main Idea (3) n Sequence smoothing Reminder: Naïve: Correlation t=n Time Lag SIGMOD 2005 23 Y. Sakurai et al

  24. Main Idea (3) n Sequence smoothing q Means of windows for each level q Sufficient statistics computed from the means q CCF computed from the sufficient statistics q But, it allows a partial redundancy Correlation Level h= 0 t=n Time Lag SIGMOD 2005 24 Y. Sakurai et al

  25. Putting it all together: n Geometric lag probing + smoothing q Use colored windows q Keep track of only a geometric progression of the lag values: l ={0,1,2,4,8,…,2 h ,…} Correlation Level h= 0 t=n Time Lag SIGMOD 2005 25 Y. Sakurai et al

  26. Putting it all together: n Geometric lag probing + smoothing q Use colored windows q Keep track of only a geometric progression of the lag values: l ={0,1,2,4,8,…,2 h ,…} Y h= 0 Correlation l= 0 Level X h= 0 t=n Time Lag SIGMOD 2005 26 Y. Sakurai et al

  27. Putting it all together: n Geometric lag probing + smoothing q Use colored windows q Keep track of only a geometric progression of the lag values: l ={0,1,2,4,8,…,2 h ,…} Y h= 0 Correlation l= 1 Level X h= 0 t=n Time Lag SIGMOD 2005 27 Y. Sakurai et al

  28. Putting it all together: n Geometric lag probing + smoothing q Use colored windows q Keep track of only a geometric progression of the lag values: l ={0,1,2,4,8,…,2 h ,…} Y h= 1 Correlation l= 2 Level X h= 1 t h =n/ 2 Time Lag SIGMOD 2005 28 Y. Sakurai et al

  29. Putting it all together: n Geometric lag probing + smoothing q Use colored windows q Keep track of only a geometric progression of the lag values: l ={0,1,2,4,8,…,2 h ,…} Y h= 2 Correlation l= 4 Level X h= 2 t h =n/ 4 Time Lag SIGMOD 2005 29 Y. Sakurai et al

  30. Putting it all together: n Geometric lag probing + smoothing q Use colored windows q Keep track of only a geometric progression of the lag values: l ={0,1,2,4,8,…,2 h ,…} Y h= 3 Correlation l= 8 Level X h= 3 t h =n/ 8 Time Lag SIGMOD 2005 30 Y. Sakurai et al

  31. Putting it all together: n Geometric lag probing + smoothing q Use colored windows q Keep track of only a geometric progression of the lag values: l ={0,1,2,4,8,…,2 h ,…} q Use a cubic spline to interpolate Correlation Level h= 0 t=n Time Lag SIGMOD 2005 31 Y. Sakurai et al

  32. Thus: n Complexity Naive Naive BRAID (incremental) Space O ( n ) O ( n ) O ( log n ) Comp. time O ( n log n ) O ( n ) O (1) * (*) Computation time: O(logn) And actually, amortized time: O(1) SIGMOD 2005 32 Y. Sakurai et al

  33. Overview details n Introduction / Related work n Background n Main ideas q enhancing the accuracy n Theoretical analysis n Experimental results SIGMOD 2005 33 Y. Sakurai et al

  34. Enhanced Probing Scheme n Q: How to probe more densely than 2 h ? Correlation Level h=0 t=n Time Lag SIGMOD 2005 34 Y. Sakurai et al

  35. Enhanced Probing Scheme n Q: How to probe more densely than 2 h ? n A: probe in a mixture of geometric and arithmetic progressions Correlation Level h=0 t=n Time Lag SIGMOD 2005 35 Y. Sakurai et al

  36. Enhanced Probing Scheme n Basic scheme: b= 1 (one number for each level) n Enhanced scheme: b> 1 q Example of b= 4 q Probing the CCF in a mixture of geometric and arithmetic progressions: l ={0,1,…,7;8,10,12,14;16,20,24,28;32,40,…} Correlation step: 4 step:1 step: 2 Level h=0 t=n Time Lag SIGMOD 2005 36 Y. Sakurai et al

  37. Overview n Introduction / Related work n Background n Main ideas n Theoretical analysis n Experimental results SIGMOD 2005 37 Y. Sakurai et al

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend