��������������� Information Systems M Prof. Paolo Ciaccia http://www-db.deis.unibo.it/courses/SI-M/ ��������������������������� Time series, that is, sequences of observations ( samples ) made through time, � are present in everyday’s life: Temperature, rainfalls, seismic traces � Weblogs � Stock prices � This as well as many of the � EEG, ECG, blood pressure following figures/examples are � Enrolled students at the Engineering Fac. taken from the tutorial given � … by Eamonn Keogh � at SBBD 2002 29 (XVII Brazilian 28 Symposium on Databases) 27 26 25 www.cs.ucr.edu/~eamonn/ 24 23 0 50 100 150 200 250 300 350 400 450 500 ��������������� ��������������������� � 1
�������������������������������������� Similarity search can help you in: � Looking for the occurrence of known patterns � Discovering unknown patterns � Putting “things together” (clustering) � Classifiying new data � Predicting/extrapolating future behaviors � … � ��������������� ��������������������� � ������������������������������� ��� Consider a large time series DB, e.g.: � 1 hour of ECG data: 1 GByte � Typical Weblog: 5 GBytes per week � Space Shuttle DB: 158 GBytes � MACHO Astronomical DB: 2 Tbytes, updated with 3 GBytes a day � (20 million stars recorded nightly for 4 years) http://wwwmacho.anu.edu.au/ First problem: large database size � Second problem: subjectivity of similarity evaluation � ��������������� ��������������������� ! 2
"���������� ������������� Given two time series of equal length D, the usual way to measure their � (dis-)similarity is based on Euclidean distance However, with Euclidean distance we have to face two basic problems � 1. High9dimensionality: (very) large D values 2. Sensitivity to “alignment of values” s q For problem 1. we need to define � effective lower-bounding techniques that work in a (much) lower dimensional space ( ) D - 1 ( ) = ∑ − 2 L s, q s q For problem 2. we will introduce � 2 t t = t 0 a new similarity criterion ��������������� ��������������������� # $��%���������& Diretly comparing two time series (e.g., using Euclidean distance) might lead � to counter-intuitive results This is because Euclidean distance, as well as other measures, are very � sensitive to some data features that might conflict with the subjective notion of similarity Thus, a good idea is to pre-process time series in order to remove such � features ��������������� ��������������������� ' 3
(����������������� Subtract from each sample the mean value of the series � 3 3 d(q,s) 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 q = q - mean(q) s = s - mean(s) d(q,s) 0 50 100 150 200 250 300 0 50 100 150 200 250 300 ��������������� ��������������������� ) *����� ���������& Normalize the amplitude (divide by the standard deviation of the series) � 0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000 q = (q - mean(q)) / std(q) s = (s - mean(s)) / std(s) ��������������� ��������������������� + 4
,������������ Average the values of each sample with those of its neighbors (smoothing) � 8 8 6 6 4 4 2 2 0 0 -2 -2 -4 -4 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 q = smooth(q) s = smooth(s) ��������������� ��������������������� - .����������������� �����/�.0����� The first method for reducing the dimensionality of time series, proposed in � [AFS93], was based on the Discrete Fourier Transform (DFT) Remind: given a time series s, the Fourier coefficients are complex numbers � (amplitude,phase), defined as: − 1 D 1 ∑ ( ) = − π = − S s exp j 2 f t/D f 0,..., D 1 f t D = t 0 Img Parseval theorem: DFT preserves the energy of the signal: � S f − ( ) ∑ − D 1 D 1 ( ) ∑ 2 = 2 = = E s s E S S |S f | t f t = 0 f = 0 Re where |S f | = |Re(S f ) + j Img(S f )| = √ (Re(S f ) 2 + Img(S f ) 2 ) is the absolute value of S f (its amplitude), which equals the Euclidean distance of S f from the origin ��������������� ��������������������� �1 5
.����������������� �����/�.0����� The key observation is that the DFT is a linear transformation � Thus, we have: � − − D 1 D 1 ∑ ∑ ( ) ( ) ( ) 2 2 2 = − = − = − = − = 2 L (s, q) s q E s q E S Q S Q L (S, Q) 2 t t f f 2 = = t 0 f 0 where |S f - Q f | is the Euclidean distance between S f and Q f in the complex plane The above just says that DFT also preserves the Euclidean distance � Img What can we gain from such transformation? |S f - Q f | Q f S f Re ��������������� ��������������������� �� .����������������� �����/�.0����� The key observation is that, by keeping only a small set of Fourier coefficients, � we can obtain a good approximation of the original signal This is because most of the energy of many real-world signals concentrates in � the low frequencies ([AFS93]): More precisely, the energy spectrum (|S f | 2 vs. f) behaves as O(f -b ), b > 0: � b = 2 (random walk or brown noise): used to model the behavior of stock � movements and currency exchange rates b > 2 (black noise): suitable to model slowly varying natural phenomena � (e.g., water levels of rivers) b = 1 (pink noise): according to Birkhoff’s theory, musical scores follow this � energy pattern Thus, by only keeping the first few coefficients (D’ << D), an effective � dimensionality reduction can be obtained Note: this is the basic idea used by well9known compression standards, such � as JPEG (which is based on Discrete Cosine Transform) For what we have seen, this “projection” technique satisfies the L-B lemma � ��������������� ��������������������� �� 6
*���2�����/�334����� Sampling rate: 128 Hz � Energy spectrum Time series (4 secs, 512 points) ��������������� ��������������������� �� *��������2����� First 4 Fourier Fourier data values coefficients coefficients 128 points � 0.4995 1.5698 1.5698 0.5264 1.0485 1.0485 0.5523 0.7160 0.7160 s 0.5761 0.8406 0.8406 0.5973 0.3709 0.3709 s’ 0.6153 0.4670 0.4670 0.6301 0.2667 0.2667 0.6420 0.1928 0.1928 0.6515 0.1635 0.6596 0.1602 0 20 40 60 80 100 120 140 0.6672 0.0992 0.6751 0.1282 0.6843 0.1438 0.6954 0.1416 0.7086 0.1400 0.7240 0.1412 0.7412 0.1530 0.7595 0.0795 0.7780 0.1013 0.7956 0.1150 0.8115 0.1801 s’ = approximation of s with 0.8247 0.1082 0.8345 0.0812 4 Fourier coefficients 0.8407 0.0347 0.8431 0.0052 0.8423 0.0017 0.8387 0.0002 … ... ��������������� ��������������������� �! 7
Recommend
More recommend