BRAID: Discovering Lag Correlations in Multiple Streams Yasushi - - PowerPoint PPT Presentation

braid discovering lag correlations in multiple streams
SMART_READER_LITE
LIVE PREVIEW

BRAID: Discovering Lag Correlations in Multiple Streams Yasushi - - PowerPoint PPT Presentation

BRAID: Discovering Lag Correlations in Multiple Streams Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos (Carnegie Mellon Univ.) Motivation n Data-stream applications q Network analysis q


slide-1
SLIDE 1

BRAID: Discovering Lag Correlations in Multiple Streams

Yasushi Sakurai (NTT Cyber Space Labs) Spiros Papadimitriou (Carnegie Mellon Univ.) Christos Faloutsos (Carnegie Mellon Univ.)

slide-2
SLIDE 2

SIGMOD 2005

  • Y. Sakurai et al

2

Motivation

n Data-stream applications

q Network analysis q Sensor monitoring q Financial data analysis q Moving object tracking

n Goal

q Monitor multiple numerical streams q Determine which pairs are correlated with lags q Report the value of each such lag (if any)

slide-3
SLIDE 3

SIGMOD 2005

  • Y. Sakurai et al

3

Lag Correlations

n Examples

q A decrease in interest rates typically precedes an

increase in house sales by a few months

q Higher amounts of fluoride in the drinking water

leads to fewer dental cavities, some years later

q High CPU utilization on server 1 precedes high

CPU utilization for server 2 by a few minutes

slide-4
SLIDE 4

SIGMOD 2005

  • Y. Sakurai et al

4

Lag Correlations

n Example of lag-correlated sequences

These sequences are correlated with lag l=1300 time-ticks CCF (Cross-Correlation Function)

slide-5
SLIDE 5

SIGMOD 2005

  • Y. Sakurai et al

5

Lag Correlations

CCF (Cross-Correlation Function)

n Example of lag-correlated sequences

q Fast

(high performance)

q Nimble

(Low memory consumption)

q Accurate

(good approximation)

slide-6
SLIDE 6

SIGMOD 2005

  • Y. Sakurai et al

6

Problem #1: PAIR of sequences

n For given two co-evolving sequences X and Y,

determine

q Whether there is a lag correlation q If yes, what is the lag length l

n Any time, on semi-infinite streams

? yes; l = 1,300 X Y

slide-7
SLIDE 7

SIGMOD 2005

  • Y. Sakurai et al

7

Problem #2: k-way

n For given k numerical sequences, X1,…,Xk ,

report

q Which pairs (if any) have a lag correlation q The corresponding lag for such pairs

n again, ‘any time’, streaming fashion

? X1 and X2; l = 1,300 ... X1

...

X2 Xk

slide-8
SLIDE 8

SIGMOD 2005

  • Y. Sakurai et al

8

Our solution, BRAID

n characteristics:

q ‘Any-time’ processing, and fast

Computation time per time tick is constant

q Nimble

Memory space requirement is sub-linear of sequence length

q Accurate

Approximation introduces small error

slide-9
SLIDE 9

SIGMOD 2005

  • Y. Sakurai et al

9

n Sequence indexing

q Agrawal et al. (FODO 1993) q Faloutsos et al. (SIGMOD 1994) q Keogh et al. (SIGMOD 2001)

n Compression (wavelet and random

projections)

q Gilbert et al. (VLDB 2001) q Guha et al. (VLDB 2004) q Dobra et al.(SIGMOD 2002) q Ganguly et al.(SIGMOD 2003)

Related Work

slide-10
SLIDE 10

SIGMOD 2005

  • Y. Sakurai et al

10

n Data Stream Management

q Abadi et al. (VLDB Journal 2003) q Motwani et al. (CIDR 2003) q Chandrasekaran et al. (CIDR 2003) q Cranor et al. (SIGMOD 2003)

Related Work

slide-11
SLIDE 11

SIGMOD 2005

  • Y. Sakurai et al

11

Related Work

n Pattern discovery

q Clustering for data streams

Guha et al. (TKDE 2003)

q Monitoring multiple streams

Zhu et al. (VLDB 2002)

q Forecasting

Yi et al. (ICDE 2000) Papadimitriou et al. (VLDB 2003)

n None of previously published methods focuses on

the problem

slide-12
SLIDE 12

SIGMOD 2005

  • Y. Sakurai et al

12

Overview

n Introduction / Related work n Background n Main ideas n Theoretical analysis n Experimental results

slide-13
SLIDE 13

SIGMOD 2005

  • Y. Sakurai et al

13

Background

CCF (Cross-Correlation Function)

positively correlated un-correlated +g anti-correlated (lower than -g)

n Lag correlation

Lag Correlation

slide-14
SLIDE 14

SIGMOD 2005

  • Y. Sakurai et al

14

Background

n Definition of ‘score’, the absolute value of R(l) n Lag correlation

q Given a threshold g, q A local maximum q The earliest such maximum, if more maxima exist

) ( ) ( l R l score =

å å å

  • =

+ = + =

  • =

l n t t n l t t n l t l t t

y y x x y y x x l R

1 2 1 2 1

) ( ) ( ) )( ( ) (

g > ) (l score

details

slide-15
SLIDE 15

SIGMOD 2005

  • Y. Sakurai et al

15

Overview

n Introduction / Related work n Background n Main ideas n Theoretical analysis n Experimental results

slide-16
SLIDE 16

SIGMOD 2005

  • Y. Sakurai et al

16

Why not ‘naive’?

n Naive solution:

q Compute correlation coefficient for each lag

l = 0, 1, 2, 3, …, n/2

n But,

q O(n) space q O(n2) time

  • r O(n log n) time w/ FFT

t=n Time Lag Correlation n/2

slide-17
SLIDE 17

SIGMOD 2005

  • Y. Sakurai et al

17

Main Idea (1)

n Incremental computing:

q the correlation coefficient of two sequences is

‘algebraic’ -> can be computed incrementally

n we need to maintain only 6 ‘sufficient statistics’:

q Sequence length n q Sum of X, Square sum of X q Sum of Y, Square sum of Y q Inner-product for X and the shifted Y

slide-18
SLIDE 18

SIGMOD 2005

  • Y. Sakurai et al

18

Main Idea (1)

n Incremental computing:

n Sequence length n n Sum of X : n Square sum of X : n Inner-product for X and the shifted Y :

q Compute R(l) incrementally:

n Covariance of X and Y: n Variance of X:

å

+ =

  • =

n l t l t t y

x l Sxy

1

) (

å =

=

n t t

x n Sx

1

) , 1 (

å =

=

n t t

x n Sxx

1 2

) , 1 (

) , 1 ( ) , 1 ( ) ( ) ( l n Vy n l Vx l C l R

  • ×

+ = l n l n Sy n l Sx l Sxy l C

  • ×

+

  • =

) , 1 ( ) , 1 ( ) ( ) (

l n n l Sx n l Sxx n l Vx

  • +
  • +

= +

2

)) , 1 ( ( ) , 1 ( ) , 1 (

details

slide-19
SLIDE 19

SIGMOD 2005

  • Y. Sakurai et al

19

Main Idea (1)

n Complexity

Naive Naive (incremental) BRAID Space O(n) O(n)

  • Comp. time

O(n log n) O(n)

Better, but not good enough!

slide-20
SLIDE 20

SIGMOD 2005

  • Y. Sakurai et al

20

Main Idea (2)

Lag Correlation

n Geometric lag probing

slide-21
SLIDE 21

SIGMOD 2005

  • Y. Sakurai et al

21

Main Idea (2)

1 2 4 8 Lag Correlation

n Geometric lag probing n ie., compute the correlation coefficient for lag:

l = 0, 1, 2, 4, ... 2h

O(log n) estimations

slide-22
SLIDE 22

SIGMOD 2005

  • Y. Sakurai et al

22

Main Idea (2)

n Geometric lag probing n But, so far, we still need O(n) space because

the longest lag is n/2

Naive Naive (incremental) BRAID Space O(n) O(n)

  • Comp. time

O(n log n) O(n) O(log n)

slide-23
SLIDE 23

SIGMOD 2005

  • Y. Sakurai et al

23

Main Idea (3)

Lag Correlation

n Sequence smoothing

t=n Time

Reminder: Naïve:

slide-24
SLIDE 24

SIGMOD 2005

  • Y. Sakurai et al

24

Main Idea (3)

Lag Correlation Level h=0 t=n Time

n Sequence smoothing

q Means of windows for each level q Sufficient statistics computed from the means q CCF computed from the sufficient statistics q But, it allows a partial redundancy

slide-25
SLIDE 25

SIGMOD 2005

  • Y. Sakurai et al

25

Putting it all together:

Lag Correlation Level h=0 t=n Time

n Geometric lag probing + smoothing

q Use colored windows q Keep track of only a geometric progression of the

lag values: l={0,1,2,4,8,…,2h,…}

slide-26
SLIDE 26

SIGMOD 2005

  • Y. Sakurai et al

26

Putting it all together:

n Geometric lag probing + smoothing

q Use colored windows q Keep track of only a geometric progression of the

lag values: l={0,1,2,4,8,…,2h,…}

Lag Correlation Level h=0 t=n Time h=0

Y X

l=0

slide-27
SLIDE 27

SIGMOD 2005

  • Y. Sakurai et al

27

Putting it all together:

n Geometric lag probing + smoothing

q Use colored windows q Keep track of only a geometric progression of the

lag values: l={0,1,2,4,8,…,2h,…}

Lag Correlation Level h=0 t=n Time h=0

Y X

l=1

slide-28
SLIDE 28

SIGMOD 2005

  • Y. Sakurai et al

28

Putting it all together:

n Geometric lag probing + smoothing

q Use colored windows q Keep track of only a geometric progression of the

lag values: l={0,1,2,4,8,…,2h,…}

Lag Correlation Level h=1 th=n/2 Time h=1

Y X

l=2

slide-29
SLIDE 29

SIGMOD 2005

  • Y. Sakurai et al

29

Putting it all together:

n Geometric lag probing + smoothing

q Use colored windows q Keep track of only a geometric progression of the

lag values: l={0,1,2,4,8,…,2h,…}

Lag Correlation Level h=2 Time h=2

Y X

th=n/4 l=4

slide-30
SLIDE 30

SIGMOD 2005

  • Y. Sakurai et al

30

Putting it all together:

n Geometric lag probing + smoothing

q Use colored windows q Keep track of only a geometric progression of the

lag values: l={0,1,2,4,8,…,2h,…}

Lag Correlation Level h=3 Time h=3

Y X

th=n/8 l=8

slide-31
SLIDE 31

SIGMOD 2005

  • Y. Sakurai et al

31

Putting it all together:

Lag Correlation Level h=0 t=n Time

n Geometric lag probing + smoothing

q Use colored windows q Keep track of only a geometric progression of the

lag values: l={0,1,2,4,8,…,2h,…}

q Use a cubic spline to interpolate

slide-32
SLIDE 32

SIGMOD 2005

  • Y. Sakurai et al

32

Thus:

n Complexity

Naive Naive (incremental) BRAID Space O(n) O(n) O(log n)

  • Comp. time

O(n log n) O(n) O(1) *

(*) Computation time: O(logn) And actually, amortized time: O(1)

slide-33
SLIDE 33

SIGMOD 2005

  • Y. Sakurai et al

33

Overview

n Introduction / Related work n Background n Main ideas

q enhancing the accuracy

n Theoretical analysis n Experimental results

details

slide-34
SLIDE 34

SIGMOD 2005

  • Y. Sakurai et al

34

Enhanced Probing Scheme

n Q: How to probe more densely than 2h ?

Lag Correlation Level h=0 t=n Time

slide-35
SLIDE 35

SIGMOD 2005

  • Y. Sakurai et al

35

Enhanced Probing Scheme

n Q: How to probe more densely than 2h ? n A: probe in a mixture of geometric and arithmetic

progressions

Lag Correlation Level h=0 t=n Time

slide-36
SLIDE 36

SIGMOD 2005

  • Y. Sakurai et al

36

Enhanced Probing Scheme

n Basic scheme: b=1 (one number for each level) n Enhanced scheme: b>1

q Example of b=4 q Probing the CCF in a mixture of geometric and arithmetic

progressions: l={0,1,…,7;8,10,12,14;16,20,24,28;32,40,…}

Level h=0 Time t=n Correlation Lag step:1 step: 2 step: 4

slide-37
SLIDE 37

SIGMOD 2005

  • Y. Sakurai et al

37

Overview

n Introduction / Related work n Background n Main ideas n Theoretical analysis n Experimental results

slide-38
SLIDE 38

SIGMOD 2005

  • Y. Sakurai et al

38

Theoretical Analysis - Accuracy

n Effect of smoothing n Effect of geometric lag probing

For sequences with low frequencies, smoothing introduces only small error BRAIDS will provide no error, if lag probing satisfies the sampling theorem (Nyquist’s)

slide-39
SLIDE 39

SIGMOD 2005

  • Y. Sakurai et al

39

n Effect of geometric lag probing

q Informally, BRAIDS will provide no error, if lag

probing satisfies the sampling theorem (Nyquist’s)

q Formally: Theorem 2

fR: the Nyquist frequency of CCF, fR=min(fx, fy) fx, fy: the Nyquist frequencies of X and Y

Theoretical Analysis - Accuracy

BRAID will find the lag correlations perfectly, if

R

f b l 2 £ £

details

slide-40
SLIDE 40

SIGMOD 2005

  • Y. Sakurai et al

40

Theoretical Analysis - Complexity

Naive solution

q O(n) space q O(n) time per time

tick BRAID

q O(log n) space q O(1) time for updating

sufficient statistics

q O(log n) time for

interpolating (when

  • utput is required)

details

slide-41
SLIDE 41

SIGMOD 2005

  • Y. Sakurai et al

41

Overview

n Introduction / Related work n Background n Main ideas n Theoretical analysis n Experimental results

slide-42
SLIDE 42

SIGMOD 2005

  • Y. Sakurai et al

42

Experimental results

n Setup

q Intel Xeon 2.8GHz, 1GB memory, Linux q Datasets:

Synthetic: Sines, SpikeTrains, Real: Humidity, Light, Temperature, Kursk, Sunspots

q Enhanced BRAID, b=16

slide-43
SLIDE 43

SIGMOD 2005

  • Y. Sakurai et al

43

Experimental results

n Evaluation

q Accuracy for CCF q Accuracy for the lag estimation q Computation time q k-way lag correlations

slide-44
SLIDE 44

SIGMOD 2005

  • Y. Sakurai et al

44

Accuracy for CCF (1)

n Sines

CCF (Cross-Correlation Function) BRAID perfectly estimates the correlation coefficients

  • f the sinusoidal wave
slide-45
SLIDE 45

SIGMOD 2005

  • Y. Sakurai et al

45

Accuracy for CCF (2)

n SpikeTrains

CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients

slide-46
SLIDE 46

SIGMOD 2005

  • Y. Sakurai et al

46

Accuracy for CCF (3)

n Humidity (Real data)

CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients

slide-47
SLIDE 47

SIGMOD 2005

  • Y. Sakurai et al

47

Accuracy for CCF (4)

n Light (Real data)

CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients

slide-48
SLIDE 48

SIGMOD 2005

  • Y. Sakurai et al

48

Accuracy for CCF (5)

n Kursk (Real data)

CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients

slide-49
SLIDE 49

SIGMOD 2005

  • Y. Sakurai et al

49

Accuracy for CCF (6)

n Sunspots (Real data)

CCF (Cross-Correlation Function) BRAID closely estimates the correlation coefficients

slide-50
SLIDE 50

SIGMOD 2005

  • Y. Sakurai et al

50

Experimental results

n Evaluation

q Accuracy for CCF q Accuracy for the lag estimation q Computation time q k-way lag correlations

slide-51
SLIDE 51

SIGMOD 2005

  • Y. Sakurai et al

51

Estimation Error of Lag Correlations

n Largest relative error is about 1%

Datasets Lag correlation Estimation error (%) Naive BRAID Sines 716 716 0.000 SpikeTrains 2841 2830 0.387 Humidity 3842 3855 0.338 Light 567 570 0.529 Kursk 1463 1472 0.615 Sunspots 1156 1168 1.038

slide-52
SLIDE 52

SIGMOD 2005

  • Y. Sakurai et al

52

Experimental results

n Evaluation

q Accuracy for CCF q Accuracy for the lag estimation q Computation time q k-way lag correlations

slide-53
SLIDE 53

SIGMOD 2005

  • Y. Sakurai et al

53

Computation time

n Reduce computation time dramatically n Up to 40,000 times faster

slide-54
SLIDE 54

SIGMOD 2005

  • Y. Sakurai et al

54

Experimental results

n Evaluation

q Accuracy for CCF q Accuracy for the lag estimation q Computation time q k-way lag correlations

slide-55
SLIDE 55

SIGMOD 2005

  • Y. Sakurai et al

55

Group Lag Correlations

n 55 Temperature sequences n Two correlated pairs

Estimation of CCF of #16 and #19 Estimation of CCF of #47 and #48 #16 #19 #47 #48

slide-56
SLIDE 56

SIGMOD 2005

  • Y. Sakurai et al

56

Conclusions

n

Automatic lag correlation detection on data stream

  • 1. ‘Any-time’
  • 2. Nimble

q

O(log n) space, O(1) time to update the statistics

  • 3. Fast

q

Up to 40,000 times faster than the naive implementation

  • 4. Accurate

q

within 1% relative error or less

slide-57
SLIDE 57

SIGMOD 2005

  • Y. Sakurai et al

57

n Effect of geometric lag probing

q Informally, BRAIDS will provide no error, if lag

probing satisfies the sampling theorem (Nyquist’s)

q Formally: Theorem 2

fR: the Nyquist frequency of CCF, fR=min(fx, fy) fx, fy: the Nyquist frequencies of X and Y

Theoretical Analysis - Accuracy

BRAID will find the lag correlations perfectly, if

R

f b l 2 £ £

details

slide-58
SLIDE 58

SIGMOD 2005

  • Y. Sakurai et al

58

Effect of Probing

n Dataset: Sines n Lag correlation with b=1 n lR=1024

slide-59
SLIDE 59

SIGMOD 2005

  • Y. Sakurai et al

59

Effect of Probing

n Dataset: Light n Lag correlation with b=1 n lR=630