Exact Indexing of Dynamic Exact Indexing of Dynamic Time Warping Time Warping
Eamonn Keogh Eamonn Keogh
Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu
Exact Indexing of Dynamic Exact Indexing of Dynamic Time Warping - - PowerPoint PPT Presentation
Exact Indexing of Dynamic Exact Indexing of Dynamic Time Warping Time Warping Eamonn Keogh Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Fair Use
Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu
If you use these slides (or any part thereof) for any lecture or class, please send me an email, if possible with a pointer to the relevant web page or document. eamonn
eamonn@cs.ucr.edu
Clustering Clustering Classification Classification Rule Discovery Rule Discovery
10
s = 0.5 c = 0.3
Query by Content
Euclidean Distance
Sequences are aligned “one to one”.
“Warped” Time Axis
Nonlinear alignments are possible.
Training data consists of 10 exemplars from each class.
Euclidean Distance Error rate
Dynamic Time Warping Error rate
Classification Experiment on Classification Experiment on Cylinder Cylinder-
Bell-
Funnel Dataset Dataset
Classification Classification
Clustering Clustering
Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Wednesday was a national holiday
Euclidean Dynamic Time Warping
Bioinformatics: Aach, J. and
Church, G. (2001). Aligning gene expression time series with time warping algorithms. Bioinformatics. Volume 17, pp 495-508.
Robotics: Schmill, M., Oates, T. &
Cohen, P. (1999). Learned models for continuous planning. In 7th International Workshop on Artificial Intelligence and Statistics.
Medicine: Caiani, E.G., et. al.
(1998) Warped-average template technique to track on a cycle-by-cycle basis the cardiac filling phases on left ventricular
Chemistry: Gollmer, K., & Posten, C.
(1995) Detection of distorted pattern using dynamic time warping algorithm and application for supervision of bioprocesses. IFAC CHEMFAS-4
Meteorology/ Tracking/ Biometrics / Astronomy / Finance / Manufacturing … Gesture Recognition:
Gavrila, D. M. & Davis,L. S.(1995). Towards 3-d model-based tracking and recognition of human movement: a multi-view approach. In IEEE IWAFGR
Because of the robustness of Dynamic Time Warping Because of the robustness of Dynamic Time Warping compared to Euclidean Distance, it is used in… compared to Euclidean Distance, it is used in…
C Q C Q
=
K w C Q DTW
K k k 1
min ) , (
γ(i,j) = d(qi,cj) + min{ γ(i-1,j-1) , γ(i-1,j ) , γ(i,j-1) }
Warping path w
DTW is much bet t er t han Euclidean dist ance f or classif icat ion, clust ering, query by cont ent et c. But is it not t rue t hat “dynamic t ime warping cannot be speeded up by indexing *”, and is O(n2)? Dooh
* Agrawal, R., Lin, K. I., Sawhney, H. S., & Shim, K. (1995). Fast similarity search in the presence of noise, scaling, and translation in times-series
C Q C Q Sakoe-Chiba Band Itakura Parallelogram
Sakoe-Chiba Band Itakura Parallelogram
Intuition
Try to use a cheap lower bounding calculation as
Only do the expensive, full calculations when it is absolutely necessary.
1. best_so_far = infinity; 2. for all sequences in database 3. LB_dist = lower_bound_distance( Ci, Q); 4. if LB_dist < best_so_far 5. true_dist = DTW(Ci, Q); 6. if true_dist < best_so_far 7. best_so_far = true_dist; 8. index_of_best_match = i; 9. endif 10. endif
AlgorithmLower_Bounding_Sequential_Scan(Q)
1. best_so_far = infinity; 2. for all sequences in database 3. LB_dist = lower_bound_distance( Ci, Q); 4. if LB_dist < best_so_far 5. true_dist = DTW(Ci, Q); 6. if true_dist < best_so_far 7. best_so_far = true_dist; 8. index_of_best_match = i; 9. endif 10. endif
AlgorithmLower_Bounding_Sequential_Scan(Q)
We can speed up similarity search under DTW by using a lower bounding function.
A B C D
The squared difference between the two sequence’s first (A), last (D), minimum (B) and maximum points (C) is returned as the lower bound
Kim, S, Park, S, & Chu, W. An index-based approach for similarity search supporting time warping in large sequence
LB_Kim
The sum of the squared length of gray lines represent the minimum the corresponding points contribution to the
returned as the lower bounding measure
Yi, B, Jagadish, H & Faloutsos,
time sequences under time
max(Q) min(Q)
LB_Yi
very robust technique for measuring time series similarity.
techniques to speed up similarity search have been introduced, including global constraints and two different lower bounding techniques.
L U Q U L Q
C Q C Q Sakoe-Chiba Band Itakura Parallelogram
Ui = max(qi-r : qi+r) Li = min(qi-r : qi+r)
C U L Q C U L Q
C Q C Q Sakoe-Chiba Band Itakura Parallelogram
∑
=
< − > − =
n i i i i i i i i i
L c if L c U c if U c C Q Keogh LB
1 2 2
) ( ) ( ) , ( _
LB_Keogh
LB_Keogh Sakoe-Chiba LB_Keogh Itakura LB_Yi LB_Kim
The tightness of the lower bound for each technique is proportio The tightness of the lower bound for each technique is proportional nal to the length of gray lines used in the illustrations to the length of gray lines used in the illustrations
finance, medicine, biometrics, chemistry, astronomy, robotics, networking and industry. The datasets cover the complete
spectrum of stationary/ non-stationary, noisy/ smooth, cyclical/ non-cyclical, symmetric/ asymmetric etc
saved every random number, every setting and all data.
created by a quantum mechanical process.
worst case for us (the Itakura Parallelogram would give us much better results).
extracted 50 sequences of length 256. We compared each sequence to the 49
average ratio from the 1,225 (50*49/2) comparisons made.
nce Dista Warp Time Dynamic True nce Dista Warp Time Dynamic
Estimate Bound Lower
T =
The larger the better Query length of 256 is about the mean in the literature.
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
32
LB_Kim LB_Yi LB_Keogh
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
0.2 0.4 0.6 0.8 1.0
0.2 0.4 0.6 0.8 1.0
16 32 64 128 256 512 1024
Query Length Tightness of Lower Bound T
LB_Kim LB_Yi LB_Keogh
Effect of Query Length on Tightness of Lower Bounds Effect of Query Length on Tightness of Lower Bounds
31 32
These experiment s suggest we can use t he new lower bounding t echnique t o speed up sequent ial search.
But what we really need is a t echnique t o index t he t ime series
A Dimensionality Reduction Technique A Dimensionality Reduction Technique
Piecewise Aggregate Approximation (PAA) Piecewise Aggregate Approximation (PAA)
20 40 60 80 100 120 140
C C
c1 c2 c3 c4 c5 c6 c7 c8
+ − =
=
i i j j n N i
N n N n
c C
1 ) 1 (
Keogh, E,. Chakrabarti, K,. Pazzani, M. & Mehrotra, S. (2000). Dimensionality reduction for fast similarity search in large time series databases.
Yi, B, K., & Faloutsos, C.(2000). Fast time sequence indexing for arbitrary Lp norms.
Advantages of PAA (for Euclidean Indexing) Advantages of PAA (for Euclidean Indexing)
wavelets and Fourier transform (empirically)
index
( ) ( )
i i i
N n N n
U U U ,..., max ˆ
1 1 + −
=
( ) ( )
i i i
N n N n
L L L ,..., min ˆ
1 1 + −
=
U ˆ L ˆ
We create special PAA
will denote and .
U ˆ L ˆ
U L Q
MINDIST(Q,R)
∑
=
< − > −
N i i i i i i i i i
L h if L h U l if U l N n
1 2 2
ˆ ) ˆ ( ˆ ) ˆ (
Our index structure contains a leaf node U. Let R = (L, H) be the MBR associated with U
MBR R = (L,H)
L = {l1, l2, …, lN} H = {h1, h2, …, hN}
h1 h2 hi l1 l2 li
= ) , ( R Q MINDIST
We have seen how to define and
U ˆ L ˆ
We can now define the MINDIST function, which returns the distance between a query Q and a MBR R
( ) ( )
( )
i i i
N n N n
U U U ,..., max ˆ
1 1 + −
=
( ) ( )
( )
i i i
N n N n
L L L ,..., min ˆ
1 1 + −
=
U ˆ L ˆ
Variable queue: MinPriorityQueue; Variable list: temp; 1. queue.push(root_node_of_index, 0); 2. while not queue.IsEmpty() do 3. top = queue.Top(); 4. for each time series C in temp such that DTW(Q,C) ≤ top.dist 5. Remove C from temp; 6. Add C to result; 7. if |result| = K return result ; 8. queue.Pop(); 9. if top is an PAA point C 10. Retrieve full sequence C from database; 11. temp.insert(C, DTW(Q,C));
13. for each data item C in top 14. queue.push(C, LB_PAA(Q,));
// top is a non-leaf node 16. for each child node U in top 17. queue.push(U, MINDIST(Q,R)) // R is MBR associated with U.
Algorithm KNNSearch(Q,K)
Seidl, T. & Kriegel, H. (1998). Optimal multi-step k-nearest neighbor search.
Having defined the Having defined the MINDIST MINDIST function we can use (slightly function we can use (slightly modified) classic K modified) classic K-
Nearest Neighbor and Range Queries Neighbor and Range Queries
if T is a non-leaf node for each child U of T if MINDIST(Q,R)≤ ε RangeSearch(Q, ε, U); // R is MBR of U else // T is a leaf node for each PAA point C in T if LB_PAA(Q,)≤ ε Retrieve full sequence C from database; if DTW(Q,C) ≤ e Add C to result;
Algorithm RangeSearch(Q, ε, T)
1. 2. 3. 4. 5. 6. 7. 8.
each of the 50 sequences we separate out the sequence from the other 49 sequences, then find the nearest match to our withheld sequence among the remaining 49 sequences using the sequential scan
lower bounding functions to prune away the quadratic- time computation of the full DTW algorithm.
for each approach.
database in
Number DTW full require not do that
Number
The larger the better Query length of 256 is about the mean in the literature.
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
LB_Kim LB_Yi LB_Keogh
3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2
0.2 0.4 0.6 0.8 1.0
Database Size
Pruning Power P
LB_Kim LB_Yi LB_Keogh
4 8 16 32 64 128 512
31 32
Metric Definition: The Normalized CPU cost: The ratio of average CPU time to execute a query using the index to the average CPU time required to perform a linear (sequential) scan. The normalized cost of linear scan is 1.0 Datasets
pooled together. 763,270 items
common test dataset in the
System: AMD Athlon 1.4 GHZ processor, with 512 MB of physical memory and 57.2 GB of secondary storage. The index used was the R-Tree Algorithms: We compare the proposed technique to linear scan. LB_Yi does not have an index method and LB_Kim never beats linear scan
0.2 0.4 0.6 0.8 1
210 212 214 216 218 220
Normalized CPU Cost
210 212 214 216 218 220
Random Walk II Mixed Bag
LScan LB_Keogh LScan LB_Keogh
Note that the X-axis is logarithmic
www.cs.ucr.edu/~eamonn/TSDMA/index.html www.cs.ucr.edu/~eamonn/TSDMA/index.html
Thanks to Kaushik Chakrabarti, Dennis DeCoste, Sharad Mehrotra, Michalis Vlachos and the VLDB reviewers for their useful comments. Datasets and code used in this paper can be found at..