Information Systems M Prof. Paolo Ciaccia - - PDF document

information systems m prof paolo ciaccia http db deis
SMART_READER_LITE
LIVE PREVIEW

Information Systems M Prof. Paolo Ciaccia - - PDF document

Information Systems M Prof. Paolo Ciaccia http://www-db.deis.unibo.it/courses/SI-M/ Time series, that is, sequences


slide-1
SLIDE 1

1

  • Information Systems M
  • Prof. Paolo Ciaccia

http://www-db.deis.unibo.it/courses/SI-M/

  • Time series, that is, sequences of observations (samples) made through time,

are present in everyday’s life:

  • Temperature, rainfalls, seismic traces
  • Weblogs
  • Stock prices
  • EEG, ECG, blood pressure
  • Enrolled students at the Engineering Fac.
  • 50

100 150 200 250 300 350 400 450 500 23 24 25 26 27 28 29

This as well as many of the following figures/examples are taken from the tutorial given by Eamonn Keogh at SBBD 2002 (XVII Brazilian Symposium on Databases) www.cs.ucr.edu/~eamonn/

slide-2
SLIDE 2

2

  • Similarity search can help you in:
  • Looking for the occurrence of known patterns
  • Discovering unknown patterns
  • Putting “things together” (clustering)
  • Classifiying new data
  • Predicting/extrapolating future behaviors
  • Consider a large time series DB, e.g.:
  • 1 hour of ECG data: 1 GByte
  • Typical Weblog: 5 GBytes per week
  • Space Shuttle DB: 158 GBytes
  • MACHO Astronomical DB: 2 Tbytes, updated with 3 GBytes a day

(20 million stars recorded nightly for 4 years) http://wwwmacho.anu.edu.au/

  • First problem: large database size
  • Second problem: subjectivity of similarity evaluation
  • !
slide-3
SLIDE 3

3

"

  • Given two time series of equal length D, the usual way to measure their

(dis-)similarity is based on Euclidean distance

  • However, with Euclidean distance we have to face two basic problems
  • 1. High9dimensionality: (very) large D values
  • 2. Sensitivity to “alignment of values”
  • For problem 1. we need to define

effective lower-bounding techniques that work in a (much) lower dimensional space

  • For problem 2. we will introduce

a new similarity criterion

  • #
  • (

) ( )

∑ − =

= 1

  • D

t 2 t t 2

q s q s, L

q s

$%&

  • Diretly comparing two time series (e.g., using Euclidean distance) might lead

to counter-intuitive results

  • This is because Euclidean distance, as well as other measures, are very

sensitive to some data features that might conflict with the subjective notion of similarity

  • Thus, a good idea is to pre-process time series in order to remove such

features

  • '
slide-4
SLIDE 4

4

(

  • Subtract from each sample the mean value of the series
  • )

50 100 150 200 250 300 0.5 1 1.5 2 2.5 3 50 100 150 200 250 300 0.5 1 1.5 2 2.5 3

d(q,s)

50 100 150 200 250 300 50 100 150 200 250 300

q = q - mean(q) s = s - mean(s) d(q,s)

* &

  • Normalize the amplitude (divide by the standard deviation of the series)
  • +

100 200 300 400 500 600 700 800 900 1000

q = (q - mean(q)) / std(q) s = (s - mean(s)) / std(s)

100 200 300 400 500 600 700 800 900 1000

slide-5
SLIDE 5

5

,

  • Average the values of each sample with those of its neighbors (smoothing)
  • 20

40 60 80 100 120 140

  • 4
  • 2

2 4 6 8 20 40 60 80 100 120 140

  • 4
  • 2

2 4 6 8

q = smooth(q) s = smooth(s)

. /.0

  • The first method for reducing the dimensionality of time series, proposed in

[AFS93], was based on the Discrete Fourier Transform (DFT)

  • Remind: given a time series s, the Fourier coefficients are complex numbers

(amplitude,phase), defined as:

  • Parseval theorem: DFT preserves the energy of the signal:

where |Sf| = |Re(Sf) + j Img(Sf)| = √(Re(Sf)2 + Img(Sf)2) is the absolute value of Sf (its amplitude), which equals the Euclidean distance of Sf from the origin

  • 1

( )

1 D 0,..., f t/D f 2 j exp s D 1 S

1 D t t f

− = − =

− =

π

( ) ( ) ∑

− = − =

= = =

1 D f 2 f 1 D t 2 t

S S E s s E

  • Re

Sf Img |Sf |

slide-6
SLIDE 6

6

. /.0

  • The key observation is that the DFT is a linear transformation
  • Thus, we have:

where |Sf - Qf | is the Euclidean distance between Sf and Qf in the complex plane

  • The above just says that DFT also preserves the Euclidean distance

What can we gain from such transformation?

  • (

) ( ) ( )

2 2 1 D f 2 f f 1 D t 2 t t 2 2

Q) (S, L Q S Q S E q s E q s q) (s, L = − = − = − = − =

∑ ∑

− = − =

  • Re

Sf Img Qf |Sf - Qf |

. /.0

  • The key observation is that, by keeping only a small set of Fourier coefficients,

we can obtain a good approximation of the original signal

  • This is because most of the energy of many real-world signals concentrates in

the low frequencies ([AFS93]):

  • More precisely, the energy spectrum (|Sf|2 vs. f) behaves as O(f-b), b > 0:
  • b = 2 (random walk or brown noise): used to model the behavior of stock

movements and currency exchange rates

  • b > 2 (black noise): suitable to model slowly varying natural phenomena

(e.g., water levels of rivers)

  • b = 1 (pink noise): according to Birkhoff’s theory, musical scores follow this

energy pattern

  • Thus, by only keeping the first few coefficients (D’ << D), an effective

dimensionality reduction can be obtained

  • Note: this is the basic idea used by well9known compression standards, such

as JPEG (which is based on Discrete Cosine Transform)

  • For what we have seen, this “projection” technique satisfies the L-B lemma
slide-7
SLIDE 7

7

*2/334

  • Sampling rate: 128 Hz
  • Time series (4 secs, 512 points)

Energy spectrum

  • *2
  • 128 points
  • !

s’ = approximation of s with 4 Fourier coefficients

20 40 60 80 100 120 140

s s’

1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928

First 4 Fourier coefficients

1.5698 1.0485 0.7160 0.8406 0.3709 0.4670 0.2667 0.1928 0.1635 0.1602 0.0992 0.1282 0.1438 0.1416 0.1400 0.1412 0.1530 0.0795 0.1013 0.1150 0.1801 0.1082 0.0812 0.0347 0.0052 0.0017 0.0002 ...

Fourier coefficients

0.4995 0.5264 0.5523 0.5761 0.5973 0.6153 0.6301 0.6420 0.6515 0.6596 0.6672 0.6751 0.6843 0.6954 0.7086 0.7240 0.7412 0.7595 0.7780 0.7956 0.8115 0.8247 0.8345 0.8407 0.8431 0.8423 0.8387 …

data values

slide-8
SLIDE 8

8

5.0

☺ Can be computed in O(DlogD) time using FFT (provided D is a power of 2) Difficult to use if one wants to deal with sequences of different length Not really amenable to deal with “signals with spots” (time-varying energy)

An alternative to DFT is to use wavelets, which takes a different perspective:

  • A signal can be represented as a sum of contributions, each at a different

resolution level

  • Discrete Wavelet Transform (DWT) can be computed in O(D) time

Also used in the JPEG2000 standard for compressing images

Experimental results however show that the superiority of DWT w.r.t. DFT is

dependent on the specific dataset

  • #

200 400 600

Good for wavelets bad for Fourier Good for Fourier bad for wavelets

200 400 600

  • "/
  • As with DFT, the time series is decomposed into

a linear combinations of base elements

  • The Haar DWT, applied to a series of length

D = 2n , pairs samples, stores their difference and passes their average to the next stage

  • This process is repeated recursively, which leads

to 2n − 1 difference values and one final average (which is 0 for normalized series) s = (3,2,4,6,5,6,2,2) Differences Averages (1,-2,-1,0) (2.5,5,5.5,2) (-3.5,3.5) (3.75,3.75) (0) (3. 75)

  • '

Haar 0 Haar 1 Haar 2 Haar 3 Haar 4 Haar 5 Haar 6 Haar 7

20 40 60 80 100 120 140 X X'

DWT

slide-9
SLIDE 9

9

. /$**

  • PAA (Piecewise Aggregate Approximation) [KCP+00,YF00] is a very simple,

intuitive and fast, O(D), method to approximate time series

  • Its performance is comparable to that of DFT and DWT
  • We take a window of size W and segment our time series into D’ = D/W

“pieces” (sub-sequences), each of size W

  • For each piece, we compute the average of values, i.e.
  • Our approximation is therefore s’ = (s’1,…,s’D’)
  • We have √

√ √ √W× × × × L2(s’,q’) ≤ ≤ ≤ ≤ L2(s,q) (arguments generalize those used for the “global average” example)

  • The same can be generalized to work with arbitrary Lp9norms [YF00]
  • )

( )

W s s

1 W i W 1 i t t ' i

− × × − =

=

20 40 60 80 100 120 140

s s' W

  • 6&37&8
  • The graphs shows the fraction of data (indexed by an R-tree) that must be

retrieved from disk to answer a 1-NN query

  • “mixed” dataset
  • +

0.1 0.2 0.3 0.4 0.5 64 32 16 256 512 1024 0.1 0.2 0.3 0.4 0.5 64 32 16 256 512 1024 0.1 0.2 0.3 0.4 0.5 64 32 16 256 512 1024 0.1 0.2 0.3 0.4 0.5 64 32 16 256 512 1024

DFT DWT PAA

  • On most datasets,

the three methods yield very similar results

  • The actual response

time may vary, even for a given method, depending on the implementation…

slide-10
SLIDE 10

10

32/ &3

  • Consider a sequential 1-NN algorithm based on Euclidean distance, and let r

be the lowest distance found so far

  • Two basic optimization techiniques are
  • 1. Avoid taking the square root, i.e., use squared Euclidean distance

This does not influence the result

  • 2. Early terminate to evaluate

the distance on a seres s if the so9far accumulated distance for s is ≥r

  • Clearly, these optimization are not

peculiar to time series

  • Number of Objects

Seconds

1 2 3 4 5 10,000 50,000 100,000 Euclid Opt1 Opt2