ranked subsequence matching in time series databases
play

Ranked Subsequence Matching in Time-Series Databases Wook-Shin Han - PowerPoint PPT Presentation

Ranked Subsequence Matching in Time-Series Databases Wook-Shin Han (Kyungpook National University, Korea) Jinsoo Lee (Kyungpook National University, Korea) Yang-Sae Moon (Kangwon National University, Korea) Haifeng Jiang (Google Inc., USA) 1


  1. Ranked Subsequence Matching in Time-Series Databases Wook-Shin Han (Kyungpook National University, Korea) Jinsoo Lee (Kyungpook National University, Korea) Yang-Sae Moon (Kangwon National University, Korea) Haifeng Jiang (Google Inc., USA) 1

  2. Contents � Introduction � Overview of DTW and Existing Lower Bounds � Basic Ranked Subsequence Matching Algorithms � Minimum Distance Matching Window Pair (MDMWP) and mdmwp-Distance Based Pruning � Deferred Group Subsequence Retrieval � Performance Evaluation � Conclusions 2

  3. Time-Series Databases [AFS93, FRM94, MWL01] � Time-series data � Sequences of values sampled at a fixed time interval � Examples: music data, stock prices and network traffic data � Time-series databases � Data sequence: time-series data stored in a database � Query sequence: time-series data given by a user for similarity search 3

  4. Similarity Metric � Measuring similarity as the distance between a data sequence and a given query sequence � We use the dynamic time warping (DTW) distance [BC96, SC78] � One of most robust similarity measures � Widely used for various applications such as query by humming [ZS03], image searching [BCP05], and speech recognition [RJ93] 4

  5. Motivation � Ranked subsequence matching under DTW � finds top- k similar subsequences to a query sequence from data sequences under DTW � All the existing methods have been developed only for either whole matching or range subsequence matching 5

  6. Contributions � Propose the first and foremost approach for ranked subsequence matching � Propose the concept of minimum-distance matching-window pair and pruning with MDMWP distance � Propose deferred group subsequence retrieval along with another lower bound, window-group distance � Show efficiency of the proposed methods using many real and synthetic datasets 6

  7. Review of DTW Sakoe-Chiba Band Warping width 7

  8. Query Envelope [Keo02, ZS03] U Q L 8

  9. LB_Keogh [Keo02 ] � Distance between a query envelope E( Q ) and a data sequence S � Lower bounding distance under DTW at the sequence level S Q 9

  10. Piecewise Aggregate Approximation (PAA) [YF00, Keo02] � Dimension reduction: N dimension → f dimension S = (PAA( S )) S 10

  11. PAA(ENV(Q)) PAA( U ) Q PAA( L ) 11

  12. LB_PAA [ZS03] � Distance between the PAA of the query envelope P (E( Q )) and the PAA of the data sequence P ( S ) � Lower bounding distance under DTW at the index level S Q 12

  13. Lower Boundness of the Two Distances for Whole Matching [Keo02, ZS03] Lemma 1. Given two subsequence Q and S of the same length and a warping width ρ , the following equation holds : We can exploit these lower bounds whenever pruning is possible at the index level or at the sequence level. 13

  14. Related Work � Range Whole Matching [AFC93] � Ranked Whole Matching � Under Euclidean Distance [Keo01, Cha03] � Under DTW [Keo02] � Range Subsequence Matching � Dividing a data sequence into sliding windows, a query sequence into disjoint windows [FRM94] � Dual Match: dual approach of FRM [MWL01] � General Match [MWH02] 14

  15. Two Basic Algorithms for Ranked Subsequence Matching � DualMatchTopK � applies the window construction mechanism of DualMatch [MWL01] to the ranked whole matching algorithm [Cha03, Keo02] � RangeTopK � Obtains top-k entries at the index level using DualMatchTopK and an upper bound ε by retrieving the corresponding data subsequences for the entries � and then finds top- k subsequences using the range subsequence matching algorithm with ε 15

  16. Pruning at the index level Pruning at the sequence level 16

  17. RootNode � R 1 R 2 Q E(Q) s 1 s 2 s 3 s 4 E(q 1 ) E(q 2 ) E(q 3 ) … E(q 8 ) RootNode Distance R 1 < RootNode , 0 , q 1 , -1, -1 > � Top q 1 q 8 < RootNode , 0 , q 2 , -1, -1 > q 3 s 1 < RootNode , 0 , q 3 , -1, -1 > q 2 R 2 δ cur = ∞ … s 3 s 2 < RootNode , 0 , q 8 , -1, -1 > s 4 Priority Queue 17

  18. RootNode � R 1 R 2 s 1 s 2 s 3 s 4 RootNode R 1 < RootNode , 0 , q 1 , -1, -1 > � Top s 1 R 2 δ cur = ∞ … s 3 s 2 s 4 Priority Queue 18

  19. RootNode � R 1 R 2 s 1 s 2 s 3 s 4 < RootNode , 0 , q 1 , -1, -1 > RootNode � Top MINDIST(P(E(q 1 )), R 1 ) = q 1 1.3 s 1 MINDIST(P(E(q 1 )), R 2 ) = 3.2 δ cur = ∞ R 2 … s 3 s 2 R 1 s 4 Priority Queue 19

  20. RootNode � R 1 R 2 s 1 s 2 s 3 s 4 RootNode R 1 � Top … q 1 1.3 < R 1 , 1.3 , q 1 , -1, -1 > s 1 3.2 … δ cur = ∞ R 2 < R 2 , 3.2 , q 1 , -1, -1 > s 3 s 2 … s 4 Priority Queue 20

  21. RootNode � R 1 R 2 s 1 s 2 s 3 s 4 RootNode R 1 < R 1 , 1.3 , q 1 , -1, -1 > � Top q 1 s 1 … δ cur = 5.3 R 2 s 3 s 2 s 4 Priority Queue 21

  22. RootNode � R 1 R 2 s 1 s 2 s 3 s 4 < R 1 , 1.3 , q 1 , -1, -1 > RootNode R 1 � Top q 1 LB_PAA(P(E(q 1 )), s 1 )= 6.5 s 1 … δ cur = 5.3 4.0 R 2 LB_PAA(P(E(q 1 )), s 2 )= s 3 s 2 s 4 Priority Queue 22

  23. RootNode � R 1 R 2 s 1 s 2 s 3 s 4 < R 1 , 1.3 , q 1 , -1, -1 > RootNode R 1 � Top q 1 LB_PAA(P(E(q 1 )), s 1 )= 6.5 s 1 … δ cur = 5.3 LB_PAA(P(E(q 1 )), s 2 )= R 2 4.0 since 6.5 > δ cur , s 3 s 1 is pruned s 2 s 4 Priority Queue 23

  24. RootNode � R 1 R 2 s 1 s 2 s 3 s 4 RootNode R 1 � Top q 1 6.5 … s 1 < s 2 , 4.0 , q 1 , 3 , 8 > δ cur = 5.3 R 2 4.0 … s 3 s 2 s 4 Priority Queue 24

  25. RootNode � R 1 R 2 s 1 s 2 s 3 s 4 RootNode R 1 � Top < s 2 , 4.0 , q 1 , 3 , 8 > … s 1 δ cur = 5.3 R 2 s 3 s 2 s 4 Priority Queue 25

  26. RootNode � R 1 R 2 s 1 s 2 s 3 s 4 sid,offset < s 2 , 4.0 , q 1 , 3 , 8 > RootNode R 1 � Top q 1 … s 1 δ cur = 5.3 R 2 s 3 s 2 sid: 3 offset: 8 s 4 Priority Queue 26

  27. RootNode � R 1 R 2 D 3 [8:8+Len( Q )-1] s 1 s 2 s 3 s 4 Q LB_Keogh ( E ( Q ), D 3 [8:8+ Len ( Q )-1])= 5.0 < δ cur RootNode R 1 � Top q 1 … s 1 δ cur = 5.3 R 2 s 3 s 2 sid: 3 offset: 8 s 4 Priority Queue 27

  28. RootNode � R 1 R 2 D 3 [8:8+Len( Q )-1] s 1 s 2 s 3 s 4 Q DTW ρ ( Q , D 3 [8:8+ Len ( Q )-1])= 5.2 < δ cur RootNode R 1 � Top q 1 … s 1 δ cur = 5.3 R 2 s 3 s 2 sid: 3 offset: 8 s 4 Priority Queue 28

  29. RootNode � R 1 R 2 s 1 s 2 s 3 s 4 RootNode R 1 � Top q 1 ... s 1 δ cur = 5.3 R 2 < D 3 [8:8+ Len ( Q )-1] , 5.2 , -1 , 3 , 8 > s 3 ... s 2 sid: 3 offset: 8 s 4 Priority Queue 29

  30. Comments on DualMatchTopK � Many unnecessary subsequences are likely to be retrieved due to the loose lower bound � To solve this problem, we propose an approach that prunes the index search space leveraging the novel notion of minimum-distance matching-window pair 30

  31. Minimum-Distance Matching-Window Pair subsequence S [ i:j ] S s 1 s 2 s 3 s 4 LB_PAA ( P ( E ( q i )) , P ( s i )) =9.2 =11.2 =7.1 =6.9 E ( q 1 ) E ( q 2 ) E ( q 3 ) E ( q 4 ) Q U L ω 31

  32. MDMWP Distance � Suppose that MDMWP of P ( E (Q))and P ( S [i:j)) is ( P ( E ( q m ), P ( s m )) � mdmwp-distance = 32

  33. Lower Boundness of MDMWP-distance We call the algorithm that incorporates mdmwp- distance based pruning in DualMatchTopK, AdvTopK 33

  34. Correctness of AdvTopK 34

  35. Deferred Group Subsequence Retrieval � I/O optimization over AdvTopK � avoid excessive random disk I/Os � maximize buffer utilization � Delay a fixed size set of subsequence retrieval requests and enables batch retrieval in a sequential access manner � Introduce the group subsequence access list for storing all requests delayed for the next bulk access 35

  36. Example of Group Subsequence Access List Window Request Group 36

  37. Window-Group Distance � Derived by exploiting both delayed matching windows in each group and the largest distance in the group subsequence access list subsequence S [ i:j ] S s 1 s 2 s 3 s 4 LB_PAA ( P ( E ( q i )) , P ( s i )) =27 =11 ≥ 38 ≥ 38 E ( q 1 ) E ( q 2 ) E ( q 3 ) E ( q 4 ) Q U L ( ) WG-dist ( P ( E ( Q ), P (S[ i : j ])) : + + × − p p p p 11 27 38 4 2 37

  38. Experimental Setup � Algorithms compared � DualMatchTopK, RangeTopK, AdvTopK, DeferredTopK � SeqTopK: sequential scan based algorithm exploiting LB_Keogh � Datasets used � UCR-DATA (33 data sets of different characteristics in the UCR time- series archive, 1,055,525 entries) � WALK-DATA (random walk data consisting of one million entries) � STOCK-DATA (real data set consisting of 329,112 entries) � MUSIC-DATA (pitch data set consisting of 2,373,120 entries extracted from 500 MIDI files ) � Linux Kernel 2.6 PC with 512 Mbytes RAM and Pentium IV 2.8 GHz CPU 38

  39. � Experimental parameters 39

  40. Effect of k Using UCR-DATA We see similar trends in terms of wall clock time. In terms of # of candidates, AdvTopK/DeferredTopK significantly In terms of # of page accesses, for small k, all index-based algorithms perform much better than SeqTopK and RangeTopK. outperform RangeTopK and SeqToK due to MDMWP-distance As k increases, # of page access of all the index-based algorithms and WG-distance based pruning. increase. 40

  41. Effect of Buffer Size Using UCR-DATA As the buffer size increases, both the number of page accesses DeferredTopK shows almost constant performance and much and wall clock time decrease for all the index-based algorithms. better performance with a very small buffer size. 41

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend