Ranked Subsequence Matching in Time-Series Databases Wook-Shin Han - - PowerPoint PPT Presentation

ranked subsequence matching in time series databases
SMART_READER_LITE
LIVE PREVIEW

Ranked Subsequence Matching in Time-Series Databases Wook-Shin Han - - PowerPoint PPT Presentation

Ranked Subsequence Matching in Time-Series Databases Wook-Shin Han (Kyungpook National University, Korea) Jinsoo Lee (Kyungpook National University, Korea) Yang-Sae Moon (Kangwon National University, Korea) Haifeng Jiang (Google Inc., USA) 1


slide-1
SLIDE 1

1

Ranked Subsequence Matching in Time-Series Databases

Wook-Shin Han (Kyungpook National University, Korea) Jinsoo Lee (Kyungpook National University, Korea) Yang-Sae Moon (Kangwon National University, Korea) Haifeng Jiang (Google Inc., USA)

slide-2
SLIDE 2

2

Contents

Introduction Overview of DTW and Existing Lower Bounds Basic Ranked Subsequence Matching Algorithms Minimum Distance Matching Window Pair (MDMWP) and mdmwp-Distance Based Pruning Deferred Group Subsequence Retrieval Performance Evaluation Conclusions

slide-3
SLIDE 3

3

Time-Series Databases [AFS93, FRM94, MWL01]

Time-series data

Sequences of values sampled at a fixed time interval Examples: music data, stock prices and network traffic data

Time-series databases

Data sequence: time-series data stored in a database Query sequence: time-series data given by a user for similarity search

slide-4
SLIDE 4

4

Similarity Metric

Measuring similarity as the distance between a data sequence and a given query sequence We use the dynamic time warping (DTW) distance [BC96, SC78]

One of most robust similarity measures Widely used for various applications such as query by humming [ZS03], image searching [BCP05], and speech recognition [RJ93]

slide-5
SLIDE 5

5

Motivation

Ranked subsequence matching under DTW

finds top-k similar subsequences to a query sequence from data sequences under DTW

All the existing methods have been developed only for either whole matching or range subsequence matching

slide-6
SLIDE 6

6

Contributions

Propose the first and foremost approach for ranked subsequence matching Propose the concept of minimum-distance matching-window pair and pruning with MDMWP distance Propose deferred group subsequence retrieval along with another lower bound, window-group distance Show efficiency of the proposed methods using many real and synthetic datasets

slide-7
SLIDE 7

7

Review of DTW

Warping width Sakoe-Chiba Band

slide-8
SLIDE 8

8

Query Envelope [Keo02, ZS03]

U Q L

slide-9
SLIDE 9

9

LB_Keogh [Keo02 ]

Distance between a query envelope E(Q) and a data sequence S Lower bounding distance under DTW at the sequence level

Q S

slide-10
SLIDE 10

10

Piecewise Aggregate Approximation (PAA) [YF00, Keo02]

Dimension reduction: N dimension → f dimension

S = (PAA(S)) S

slide-11
SLIDE 11

11

PAA(ENV(Q))

PAA(U) Q PAA(L)

slide-12
SLIDE 12

12

LB_PAA [ZS03]

Distance between the PAA of the query envelope P(E(Q)) and the PAA of the data sequence P(S) Lower bounding distance under DTW at the index level

Q S

slide-13
SLIDE 13

13

Lower Boundness of the Two Distances for Whole Matching [Keo02, ZS03]

Lemma 1. Given two subsequence Q and S of the same length and a warping width ρ, the following equation holds:

We can exploit these lower bounds whenever pruning is possible at the index level or at the sequence level.

slide-14
SLIDE 14

14

Related Work

Range Whole Matching [AFC93] Ranked Whole Matching

Under Euclidean Distance [Keo01, Cha03] Under DTW [Keo02]

Range Subsequence Matching

Dividing a data sequence into sliding windows, a query sequence into disjoint windows [FRM94] Dual Match: dual approach of FRM [MWL01] General Match [MWH02]

slide-15
SLIDE 15

15

Two Basic Algorithms for Ranked Subsequence Matching

DualMatchTopK

applies the window construction mechanism of DualMatch [MWL01] to the ranked whole matching algorithm [Cha03, Keo02]

RangeTopK

Obtains top-k entries at the index level using DualMatchTopK and an upper bound ε by retrieving the corresponding data subsequences for the entries and then finds top-k subsequences using the range subsequence matching algorithm with ε

slide-16
SLIDE 16

16

Pruning at the index level Pruning at the sequence level

slide-17
SLIDE 17

17

s1 s2 s3 s4 RootNode R1 R2

R1 R2 s1 s2 s3 s4

RootNode Q E(q1)

<RootNode, 0, q1, -1, -1>

E(q2)

<RootNode, 0, q2, -1, -1>

E(q3)

<RootNode, 0, q3, -1, -1>

E(q8)

<RootNode, 0, q8, -1, -1>

q1 q2 q3 q8 Priority Queue Top δcur = ∞

E(Q) Distance

slide-18
SLIDE 18

18

s1 s2 s3 s4 RootNode R1

R1 R2 s1 s2 s3 s4

RootNode

<RootNode, 0, q1, -1, -1>

Priority Queue Top

R2 δcur = ∞

slide-19
SLIDE 19

19

s1 s2 s3 s4 RootNode R1

R1 R2 s1 s2 s3 s4

RootNode

<RootNode, 0, q1, -1, -1>

q1 Priority Queue Top

3.2 1.3

R2

MINDIST(P(E(q1)), R1) = MINDIST(P(E(q1)), R2) =

δcur = ∞

slide-20
SLIDE 20

20

s1 s2 s3 s4 RootNode R1

R1 R2 s1 s2 s3 s4

RootNode q1 Priority Queue Top

3.2 1.3

R2

<R1, 1.3, q1, -1, -1> <R2, 3.2, q1, -1, -1>

… …

δcur = ∞

slide-21
SLIDE 21

21

s1 s2 s3 s4 RootNode R1

R1 R2 s1 s2 s3 s4

RootNode q1 Priority Queue Top R2

<R1, 1.3, q1, -1, -1>

δcur = 5.3

slide-22
SLIDE 22

22

s1 s2 s3 s4 RootNode R1

R1 R2 s1 s2 s3 s4

RootNode q1 Priority Queue Top R2

<R1, 1.3, q1, -1, -1>

6.5 4.0

LB_PAA(P(E(q1)), s1)= LB_PAA(P(E(q1)), s2)=

δcur = 5.3

slide-23
SLIDE 23

23

s1 s2 s3 s4 RootNode R1

R1 R2 s1 s2 s3 s4

RootNode q1 Priority Queue Top R2

<R1, 1.3, q1, -1, -1>

6.5 4.0

LB_PAA(P(E(q1)), s1)= LB_PAA(P(E(q1)), s2)=

δcur = 5.3

since 6.5 > δcur, s1 is pruned

slide-24
SLIDE 24

24

s1 s2 s3 s4 RootNode R1

R1 R2 s1 s2 s3 s4

RootNode q1 Priority Queue Top R2

6.5 4.0

<s2, 4.0, q1, 3, 8>

δcur = 5.3

slide-25
SLIDE 25

25

s1 s2 s3 s4 RootNode R1

R1 R2 s1 s2 s3 s4

RootNode Priority Queue Top R2

<s2, 4.0, q1, 3, 8>

δcur = 5.3

slide-26
SLIDE 26

26

s1 s3 s4 RootNode R1

R1 R2 s1 s2 s3 s4

RootNode Priority Queue Top R2

<s2, 4.0, q1, 3, 8>

q1 s2 sid: 3

  • ffset: 8

δcur = 5.3

sid,offset

slide-27
SLIDE 27

27

s1 s3 s4 RootNode R1

R1 R2 s1 s2 s3 s4

RootNode Priority Queue R2 q1 s2 sid: 3

  • ffset: 8

Q D3[8:8+Len(Q)-1]

LB_Keogh(E(Q), D3[8:8+Len(Q)-1])= 5.0 < δcur

Top

δcur = 5.3

slide-28
SLIDE 28

28

s1 s3 s4 RootNode R1

R1 R2 s1 s2 s3 s4

RootNode Priority Queue R2 q1 s2 sid: 3

  • ffset: 8

DTWρ(Q, D3[8:8+Len(Q)-1])= 5.2 < δcur

Top

δcur = 5.3

Q D3[8:8+Len(Q)-1]

slide-29
SLIDE 29

29

s1 s3 s4 RootNode R1

R1 R2 s1 s2 s3 s4

RootNode Priority Queue Top R2 q1 s2 sid: 3

  • ffset: 8

<D3[8:8+Len(Q)-1], 5.2, -1, 3, 8>

... ...

δcur = 5.3

slide-30
SLIDE 30

30

Comments on DualMatchTopK

Many unnecessary subsequences are likely to be retrieved due to the loose lower bound To solve this problem, we propose an approach that prunes the index search space leveraging the novel notion of minimum-distance matching-window pair

slide-31
SLIDE 31

31

Minimum-Distance Matching-Window Pair

s1 s2 s3 s4

ω

E(q1) E(q2) E(q3) E(q4)

S Q

subsequence S[i:j]

LB_PAA(P(E(qi)) , P(si)) =9.2 =11.2 =6.9 =7.1 U L

slide-32
SLIDE 32

32

MDMWP Distance

Suppose that MDMWP of P(E(Q))and P(S[i:j)) is (P(E(qm), P(sm)) mdmwp-distance =

slide-33
SLIDE 33

33

Lower Boundness of MDMWP-distance

We call the algorithm that incorporates mdmwp- distance based pruning in DualMatchTopK, AdvTopK

slide-34
SLIDE 34

34

Correctness of AdvTopK

slide-35
SLIDE 35

35

Deferred Group Subsequence Retrieval

I/O optimization over AdvTopK

avoid excessive random disk I/Os maximize buffer utilization

Delay a fixed size set of subsequence retrieval requests and enables batch retrieval in a sequential access manner Introduce the group subsequence access list for storing all requests delayed for the next bulk access

slide-36
SLIDE 36

36

Example of Group Subsequence Access List

Window Request Group

slide-37
SLIDE 37

37

Window-Group Distance

Derived by exploiting both delayed matching windows in each group and the largest distance in the group subsequence access list

s1 s2 s3 s4

E(q1) E(q2) E(q3) E(q4)

S Q

subsequence S[i:j]

=27 =11 ≥ 38 ≥ 38

WG-dist(P(E(Q), P(S[i:j])) :

( )

11 27 38 4 2

p p p p

+ + × −

U L

LB_PAA(P(E(qi)) , P(si))

slide-38
SLIDE 38

38

Experimental Setup

Algorithms compared

  • DualMatchTopK, RangeTopK, AdvTopK, DeferredTopK
  • SeqTopK: sequential scan based algorithm exploiting LB_Keogh

Datasets used

  • UCR-DATA (33 data sets of different characteristics in the UCR time-

series archive, 1,055,525 entries)

  • WALK-DATA (random walk data consisting of one million entries)
  • STOCK-DATA (real data set consisting of 329,112 entries)
  • MUSIC-DATA (pitch data set consisting of 2,373,120 entries extracted

from 500 MIDI files )

Linux Kernel 2.6 PC with 512 Mbytes RAM and Pentium IV 2.8 GHz CPU

slide-39
SLIDE 39

39

Experimental parameters

slide-40
SLIDE 40

40

In terms of # of candidates, AdvTopK/DeferredTopK significantly

  • utperform RangeTopK and SeqToK due to MDMWP-distance

and WG-distance based pruning. In terms of # of page accesses, for small k, all index-based algorithms perform much better than SeqTopK and RangeTopK. As k increases, # of page access of all the index-based algorithms increase.

Effect of k Using UCR-DATA

We see similar trends in terms of wall clock time.

slide-41
SLIDE 41

41

Effect of Buffer Size Using UCR-DATA

As the buffer size increases, both the number of page accesses and wall clock time decrease for all the index-based algorithms. DeferredTopK shows almost constant performance and much better performance with a very small buffer size.

slide-42
SLIDE 42

42

Effect of Window Size Using UCR-DATA

As the window size increases, all three measures of these index-based algorithms decrease due to window size effect.

slide-43
SLIDE 43

43

Effect of Query Length Using UCR-DATA

As the query length increases, the relative size of the corresponding window decreases, and thus, more candidates occur due to the window size effect.

slide-44
SLIDE 44

44

Experimental Results for WALK-DATA by Varying k

The trend is similar to that for UCR-DATA.

slide-45
SLIDE 45

45

Experimental Result for MUSIC-DATA by Varying k

Again, similar trend for MUSIC-DATA!

slide-46
SLIDE 46

46

Conclusions

proposed a novel notion of the minimum-distance matching- window pair and derived a lower bound, mdmwp-distance proposed the deferred group subsequence retrieval to avoid excessive random disk I/Os and bad buffer utilization derived another lower bound window-group distance that can be used together with deferred group subsequence retrieval proposed four ranked subsequence matching methods, DualMatchTopK, RangeTopK, AdvTopK, and DeferredTopK Extensive experiments showed that our advanced methods

  • utperform competing methods by up to orders of

magnitude

slide-47
SLIDE 47

47

Thank You Very Much! Any Questions?

slide-48
SLIDE 48

48

Appendix

slide-49
SLIDE 49

49

RangeTopK