Information Systems M Prof. Paolo Ciaccia

�� Information Systems M Prof. Paolo Ciaccia http://www-db.deis.unibo.it/courses/SI-M/ �� Time series, that is, sequences of observations ( samples ) made through time, � are present in everyday’s life: Temperature, rainfalls, seismic traces � Weblogs � Stock prices � This as well as many of the � EEG, ECG, blood pressure following figures/examples are � Enrolled students at the Engineering Fac. taken from the tutorial given � … by Eamonn Keogh � at SBBD 2002 29 (XVII Brazilian 28 Symposium on Databases) 27 26 25 www.cs.ucr.edu/~eamonn/ 24 23 0 50 100 150 200 250 300 350 400 450 500 �� 1

�� Similarity search can help you in: � Looking for the occurrence of known patterns � Discovering unknown patterns � Putting “things together” (clustering) � Classifiying new data � Predicting/extrapolating future behaviors � … � �� Consider a large time series DB, e.g.: � 1 hour of ECG data: 1 GByte � Typical Weblog: 5 GBytes per week � Space Shuttle DB: 158 GBytes � MACHO Astronomical DB: 2 Tbytes, updated with 3 GBytes a day � (20 million stars recorded nightly for 4 years) http://wwwmacho.anu.edu.au/ First problem: large database size � Second problem: subjectivity of similarity evaluation � �� ! 2

"�� Given two time series of equal length D, the usual way to measure their � (dis-)similarity is based on Euclidean distance However, with Euclidean distance we have to face two basic problems � 1. High9dimensionality: (very) large D values 2. Sensitivity to “alignment of values” s q For problem 1. we need to define � effective lower-bounding techniques that work in a (much) lower dimensional space ( ) D - 1 ( ) = ∑ − 2 L s, q s q For problem 2. we will introduce � 2 t t = t 0 a new similarity criterion �� # $��%��& Diretly comparing two time series (e.g., using Euclidean distance) might lead � to counter-intuitive results This is because Euclidean distance, as well as other measures, are very � sensitive to some data features that might conflict with the subjective notion of similarity Thus, a good idea is to pre-process time series in order to remove such � features �� ' 3

(�� Subtract from each sample the mean value of the series � 3 3 d(q,s) 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 0 50 100 150 200 250 300 0 50 100 150 200 250 300 q = q - mean(q) s = s - mean(s) d(q,s) 0 50 100 150 200 250 300 0 50 100 150 200 250 300 �� ) *�� & Normalize the amplitude (divide by the standard deviation of the series) � 0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000 q = (q - mean(q)) / std(q) s = (s - mean(s)) / std(s) �� + 4

,�� Average the values of each sample with those of its neighbors (smoothing) � 8 8 6 6 4 4 2 2 0 0 -2 -2 -4 -4 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 q = smooth(q) s = smooth(s) �� - .�� /�.0�� The first method for reducing the dimensionality of time series, proposed in � [AFS93], was based on the Discrete Fourier Transform (DFT) Remind: given a time series s, the Fourier coefficients are complex numbers � (amplitude,phase), defined as: − 1 D 1 ∑ ( ) = − π = − S s exp j 2 f t/D f 0,..., D 1 f t D = t 0 Img Parseval theorem: DFT preserves the energy of the signal: � S f − ( ) ∑ − D 1 D 1 ( ) ∑ 2 = 2 = = E s s E S S |S f | t f t = 0 f = 0 Re where |S f | = |Re(S f ) + j Img(S f )| = √ (Re(S f ) 2 + Img(S f ) 2 ) is the absolute value of S f (its amplitude), which equals the Euclidean distance of S f from the origin �� 1 5

.�� /�.0�� The key observation is that the DFT is a linear transformation � Thus, we have: � − − D 1 D 1 ∑ ∑ ( ) ( ) ( ) 2 2 2 = − = − = − = − = 2 L (s, q) s q E s q E S Q S Q L (S, Q) 2 t t f f 2 = = t 0 f 0 where |S f - Q f | is the Euclidean distance between S f and Q f in the complex plane The above just says that DFT also preserves the Euclidean distance � Img What can we gain from such transformation? |S f - Q f | Q f S f Re �� .�� /�.0�� The key observation is that, by keeping only a small set of Fourier coefficients, � we can obtain a good approximation of the original signal This is because most of the energy of many real-world signals concentrates in � the low frequencies ([AFS93]): More precisely, the energy spectrum (|S f | 2 vs. f) behaves as O(f -b ), b > 0: � b = 2 (random walk or brown noise): used to model the behavior of stock � movements and currency exchange rates b > 2 (black noise): suitable to model slowly varying natural phenomena � (e.g., water levels of rivers) b = 1 (pink noise): according to Birkhoff’s theory, musical scores follow this � energy pattern Thus, by only keeping the first few coefficients (D’ << D), an effective � dimensionality reduction can be obtained Note: this is the basic idea used by well9known compression standards, such � as JPEG (which is based on Discrete Cosine Transform) For what we have seen, this “projection” technique satisfies the L-B lemma � �� 6

*��2��/�334�� Sampling rate: 128 Hz � Energy spectrum Time series (4 secs, 512 points) �� *��2�� First 4 Fourier Fourier data values coefficients coefficients 128 points � 0.4995 1.5698 1.5698 0.5264 1.0485 1.0485 0.5523 0.7160 0.7160 s 0.5761 0.8406 0.8406 0.5973 0.3709 0.3709 s’ 0.6153 0.4670 0.4670 0.6301 0.2667 0.2667 0.6420 0.1928 0.1928 0.6515 0.1635 0.6596 0.1602 0 20 40 60 80 100 120 140 0.6672 0.0992 0.6751 0.1282 0.6843 0.1438 0.6954 0.1416 0.7086 0.1400 0.7240 0.1412 0.7412 0.1530 0.7595 0.0795 0.7780 0.1013 0.7956 0.1150 0.8115 0.1801 s’ = approximation of s with 0.8247 0.1082 0.8345 0.0812 4 Fourier coefficients 0.8407 0.0347 0.8431 0.0052 0.8423 0.0017 0.8387 0.0002 … ... �� ! 7

Information Systems M Prof. Paolo Ciaccia - PDF document

Information Systems M Prof. Paolo Ciaccia http://www-db.deis.unibo.it/courses/SI-M/ Time series, that is, sequences

Time Series Time Series Time Series Prof. Paolo Ciaccia Prof. Paolo Ciaccia http://www-

Image Databases Image Databases Image Databases Prof. Paolo Ciaccia Prof. Paolo Ciaccia

The BBS Algorithm The BBS Algorithm The BBS Algorithm Prof. Paolo Ciaccia Prof. Paolo Ciaccia

Preference Relations Relations Preference Preference Relations Prof. Paolo Ciaccia Prof. Paolo

Relevance Feedback Relevance Feedback Relevance Feedback Prof. Paolo Ciaccia Prof. Paolo

Middleware Queries Queries Middleware Middleware Queries Prof. Paolo Ciaccia Prof. Paolo

Searching Documents and Pages Searching Documents and Pages Searching Documents and Pages Prof.

Top- -k k Queries Queries on SQL on SQL Databases Databases Top Top-k Queries on SQL

Indexing Multimedia Multimedia Databases Databases Indexing Indexing Multimedia Databases

Information Filtering Information Systems M Prof. Paolo Ciaccia

Skyline queries Information Systems M Prof. Paolo Ciaccia http://www

Principles of Information Filtering in Metric Spaces Paolo Ciaccia and Marco Patella DEIS,

Latent Semantic Indexing Information Systems M Prof. Paolo Ciaccia

Environmental Impact Statement DEIS Report DEIS Process Historical Resources study

Prof. Paolo Ciaccia

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

Lecture 6 Discrete Time Series Colin Rundel 02/06/2017 1 Discrete Time Series 2 Stationary

Time series analysis Agathe Guilloux Organizational issues be graded.

Time-Series Data in MongoDB on a Budget Peter Schwaller Senior Director Server Engineering,

Time series causality inference using the Phase Slope Index. Florin Popescu Guido Nolte

T i me S e r i e s D a t a b a s e s a n d S t r e a mi n g a l g o

Chapter 8 Forecasting Demand Qualitative Forecasting Methods Moving Averages and Smoothing

digital Stadium Delay Tolerant Networks in the Real World Overview Introduce the team and the