[PPT] - Time Series Mining and Forecasting Duen Horng (Polo) Chau Georgia PowerPoint Presentation

SLIDE 1

Duen Horng (Polo) Chau  Georgia Tech

Time Series 

Mining and Forecasting

Slides based on Prof. Christos Faloutsos’s materials

CSE 6242 / CX 4242

SLIDE 2

Outline

Motivation
Similarity search – distance functions
Linear Forecasting
Non-linear forecasting
Conclusions

SLIDE 3

Problem definition

Given: one or more sequences

x1 , x2 , … , xt , … (y1, y2, … , yt , …) (… )

Find

– similar sequences; forecasts – patterns; clusters; outliers

SLIDE 4

Motivation - Applications

Financial, sales, economic series
Medical

– ECGs +; blood pressure etc monitoring – reactions to new drugs – elderly care

SLIDE 5

Motivation - Applications (cont’d)

‘Smart house’

– sensors monitor temperature, humidity, air quality

video surveillance

SLIDE 6

Motivation - Applications (cont’d)

Weather, environment/anti-pollution

– volcano monitoring – air/water pollutant monitoring

SLIDE 7

Motivation - Applications (cont’d)

Computer systems

– ‘Active Disks’ (buffering, prefetching) – web servers (ditto) – network traffic monitoring – ...

SLIDE 8

Stream Data: Disk accesses

time #bytes

SLIDE 9

Problem #1:

Goal: given a signal (e.g.., #packets over time) Find: patterns, periodicities, and/or compress

year count lynx caught per year (packets per day; temperature per day)

SLIDE 10

Problem#2: Forecast

Given xt, xt-1, …, forecast xt+1

10 20 30 40 50 60 70 80 90 1 3 5 7 9 11

Time Tick Number of packets sent

??

SLIDE 11

Problem#2’: Similarity search

E.g.., Find a 3-tick pattern, similar to the last one

10 20 30 40 50 60 70 80 90 1 3 5 7 9 11

Time Tick Number of packets sent

??

SLIDE 12

Problem #3:

Given: A set of correlated time sequences
Forecast ‘Sent(t)’

Number of packets

23 45 68 90

Time Tick

1 4 6 9 11

sent lost repeated

SLIDE 13

Important observations

Patterns, rules, forecasting and similarity indexing are closely related:

To do forecasting, we need

– to find patterns/rules – to find similar settings in the past

to find outliers, we need to have forecasts

– (outlier = too far away from our forecast)

SLIDE 14

Outline

Motivation
Similarity Search and Indexing
Linear Forecasting
Non-linear forecasting
Conclusions

SLIDE 15

Outline

Motivation
Similarity search and distance functions

– Euclidean – Time-warping

...

SLIDE 16

Importance of distance functions

Subtle, but absolutely necessary:

A ‘must’ for similarity indexing (->

forecasting)

A ‘must’ for clustering

Two major families

– Euclidean and Lp norms – Time warping and variations

SLIDE 17

Euclidean and Lp

∑

=

− =

n i i i

y x y x D

1 2

) ( ) , ( ! !

x(t) y(t)

...

∑

=

− =

n i p i i p

y x y x L

1

| | ) , ( ! ! L1: city-block = Manhattan L2 = Euclidean L∞

SLIDE 18

Observation #1

Time sequence -> n-d

vector

...

Day-1 Day-2 Day-n

SLIDE 19

Observation #2

Euclidean distance is closely related to

– cosine similarity – dot product – ‘cross-correlation’ function

...

Day-1 Day-2 Day-n

SLIDE 20

Time Warping

allow accelerations - decelerations

– (with or w/o penalty)

THEN compute the (Euclidean) distance (+

penalty)

related to the string-editing distance

SLIDE 21

Time Warping

‘stutters’:

SLIDE 22

Time warping

Q: how to compute it? A: dynamic programming D( i, j ) = cost to match prefix of length i of first sequence x with prefix

f length j of second sequence y

SLIDE 23

Thus, with no penalty for stutter, for sequences x1, x2, …, xi,; y1, y2, …, yj

! " ! # $ − − − − + − = ) , 1 ( ) 1 , ( ) 1 , 1 ( min ] [ ] [ ) , ( j i D j i D j i D j y i x j i D

x-stutter y-stutter no stutter

Time warping

SLIDE 24

VERY SIMILAR to the string-editing distance

! " ! # $ − − − − + − = ) , 1 ( ) 1 , ( ) 1 , 1 ( min ] [ ] [ ) , ( j i D j i D j i D j y i x j i D

x-stutter y-stutter no stutter

Time warping

SLIDE 25

Time warping

Complexity: O(M*N) - quadratic on the

length of the strings

Many variations (penalty for stutters; limit
n the number/percentage of stutters; …)
popular in voice processing

[Rabiner + Juang]

SLIDE 26

Other Distance functions

piece-wise linear/flat approx.; compare

pieces [Keogh+01] [Faloutsos+97]

‘cepstrum’ (for voice [Rabiner+Juang])

– do DFT; take log of amplitude; do DFT again!

Allow for small gaps [Agrawal+95]

See tutorial by [Gunopulos + Das, SIGMOD01]

SLIDE 27

Other Distance functions

In [Keogh+, KDD’04]: parameter-free, MDL

based

SLIDE 28

Conclusions

Prevailing distances:

– Euclidean and – time-warping

SLIDE 29

Outline

Motivation
Similarity search and distance functions
Linear Forecasting
Non-linear forecasting
Conclusions

SLIDE 30

Linear Forecasting

SLIDE 31

Forecasting “Prediction is very difficult, especially about the future.”

Nils Bohr

Danish physicist and Nobel Prize laureate

SLIDE 32

Outline

Motivation
...
Linear Forecasting

– Auto-regression: Least Squares; RLS – Co-evolving time sequences – Examples – Conclusions

SLIDE 33

Reference

[Yi+00] Byoung-Kee Yi et al.: Online Data Mining for Co-Evolving Time Sequences, ICDE 2000. (Describes MUSCLES and Recursive Least Squares)

SLIDE 34

Problem#2: Forecast

Example: give xt-1, xt-2, …, forecast xt

10 20 30 40 50 60 70 80 90 1 3 5 7 9 11

Time Tick Number of packets sent

??

SLIDE 35

Forecasting: Preprocessing

MANUALLY: remove trends spot periodicities

2 3 5 6 1 2 3 4 5 6 7 8 9 10 1 2 2 3 1 3 5 7 9 11 13

time time 7 days

SLIDE 36

Problem#2: Forecast

Solution: try to express

xt as a linear function of the past: xt-1, xt-2, …, (up to a window of w)

Formally:

10 20 30 40 50 60 70 80 90 1 3 5 7 9 11

Time Tick

??

SLIDE 37

(Problem: Back-cast; interpolate)

Solution - interpolate: try to express

xt as a linear function of the past AND the future:

xt+1, xt+2, … xt+wfuture; xt-1, … xt-wpast

(up to windows of wpast , wfuture)

EXACTLY the same algo’s

10 20 30 40 50 60 70 80 90 1 3 5 7 9 11

Time Tick

??

SLIDE 38

40 45 50 55 60 65 70 75 80 85 15 25 35 45

Body weight

patient weight height 1 27 43 2 43 54 3 54 72 … … … N 25 ??

express what we don’t know (= “dependent variable”)
as a linear function of what we know (= “independent variable(s)”)

Body height

Refresher: Linear Regression

SLIDE 39

40 45 50 55 60 65 70 75 80 85 15 25 35 45

Body weight

patient weight height 1 27 43 2 43 54 3 54 72 … … … N 25 ??

express what we don’t know (= “dependent variable”)
as a linear function of what we know (= “independent variable(s)”)

Body height

Refresher: Linear Regression

SLIDE 40

40 45 50 55 60 65 70 75 80 85 15 25 35 45

Body weight

patient weight height 1 27 43 2 43 54 3 54 72 … … … N 25 ??

express what we don’t know (= “dependent variable”)
as a linear function of what we know (= “independent variable(s)”)

Body height

Refresher: Linear Regression

SLIDE 41

40 45 50 55 60 65 70 75 80 85 15 25 35 45

Body weight

patient weight height 1 27 43 2 43 54 3 54 72 … … … N 25 ??

express what we don’t know (= “dependent variable”)
as a linear function of what we know (= “independent variable(s)”)

Body height

Refresher: Linear Regression

SLIDE 42

Time Packets Sent (t-1) Packets Sent(t) 1

43

2 43 54 3 54 72 … … … N 25 ??

Linear Auto Regression

SLIDE 43

Linear Auto Regression

#packets sent at time t-1 #packets sent at time t

Time Packets Sent (t-1) Packets Sent(t) 1

43

2 43 54 3 54 72 … … … N 25 ??

Lag w = 1

Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])

‘lag-plot’

SLIDE 44

Linear Auto Regression

#packets sent at time t-1 #packets sent at time t

Time Packets Sent (t-1) Packets Sent(t) 1

43

2 43 54 3 54 72 … … … N 25 ??

Lag w = 1

Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])

‘lag-plot’

SLIDE 45

Linear Auto Regression

#packets sent at time t-1 #packets sent at time t

Time Packets Sent (t-1) Packets Sent(t) 1

43

2 43 54 3 54 72 … … … N 25 ??

Lag w = 1

Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])

‘lag-plot’

SLIDE 46

Linear Auto Regression

#packets sent at time t-1 #packets sent at time t

Time Packets Sent (t-1) Packets Sent(t) 1

43

2 43 54 3 54 72 … … … N 25 ??

Lag w = 1

Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])

‘lag-plot’

SLIDE 47

Outline

Motivation
...
Linear Forecasting

– Auto-regression: Least Squares; RLS – Co-evolving time sequences – Examples – Conclusions

SLIDE 48

More details:

Q1: Can it work with window w > 1?
A1: YES!

xt-2 xt xt-1

SLIDE 49

More details:

Q1: Can it work with window w > 1?
A1: YES! (we’ll fit a hyper-plane, then!)

xt-2 xt xt-1

SLIDE 50

More details:

Q1: Can it work with window w > 1?
A1: YES! (we’ll fit a hyper-plane, then!)

xt-2 xt-1 xt

SLIDE 51

More details:

Q1: Can it work with window w > 1?
A1: YES! The problem becomes:

X[N ×w] × a[w ×1] = y[N ×1]

OVER-CONSTRAINED

– a is the vector of the regression coefficients – X has the N values of the w indep. variables – y has the N values of the dependent variable

SLIDE 52

More details:

X[N ×w] × a[w ×1] = y[N ×1]

! ! ! ! ! ! ! ! " # $ $ $ $ $ $ $ $ % & = ! ! ! ! " # $ $ $ $ % & × ! ! ! ! ! ! ! ! " # $ $ $ $ $ $ $ $ % &

N w Nw N N w w

y y y a a a X X X X X X X X X ! ! ! ! … ! ! ! … "

2 1 2 1 2 1 2 22 21 1 12 11

, , , , , , , , ,

Ind-var1 Ind-var-w time

SLIDE 53

More details:

X[N ×w] × a[w ×1] = y[N ×1]

! ! ! ! ! ! ! ! " # $ $ $ $ $ $ $ $ % & = ! ! ! ! " # $ $ $ $ % & × ! ! ! ! ! ! ! ! " # $ $ $ $ $ $ $ $ % &

N w Nw N N w w

y y y a a a X X X X X X X X X ! ! ! ! … ! ! ! … "

2 1 2 1 2 1 2 22 21 1 12 11

, , , , , , , , ,

Ind-var1 Ind-var-w time

SLIDE 54

More details

Q2: How to estimate a1, a2, … aw = a?
A2: with Least Squares fit
(Moore-Penrose pseudo-inverse)
a is the vector that minimizes the RMSE

from y a = ( XT × X )-1 × (XT × y)

SLIDE 55

More details

Straightforward solution:
Observations:

– Sample matrix X grows over time – needs matrix inversion – O(N×w2) computation – O(N×w) storage a = ( XT × X )-1 × (XT × y)

a : Regression Coeff. Vector X : Sample Matrix

XN:

w N

SLIDE 56

Even more details

Q3: Can we estimate a incrementally?
A3: Yes, with the brilliant, classic method of

“Recursive Least Squares” (RLS) (see, e.g., [Yi+00], for details).

We can do the matrix inversion, WITHOUT

inversion! (How is that possible?!)

SLIDE 57

Even more details

Q3: Can we estimate a incrementally?
A3: Yes, with the brilliant, classic method of

“Recursive Least Squares” (RLS)   (see, e.g., [Yi+00], for details).

We can do the matrix inversion, WITHOUT

inversion! (How is that possible?!)

A: our matrix has special form: (XT X)

SLIDE 58

More details

XN:

w N

XN+1

At the N+1 time tick:

xN+1 SKIP

SLIDE 59

Let GN = ( XN

T × XN )-1 (“gain matrix”)

GN+1 can be computed recursively from GN

without matrix inversion

GN

w w SKIP

More details: key ideas

SLIDE 60

Comparison:

Straightforward Least

Squares

– Needs huge matrix  (growing in size) O(N×w) – Costly matrix operation O(N×w2)

Recursive LS

– Need much smaller, fixed size matrix O(w×w) – Fast, incremental computation O(1×w2) – no matrix inversion N = 106, w = 1-100

SLIDE 61

EVEN more details:

N N T N N N N

G x x G c G G × × × × − =

+ + − + 1 1 1 1

] [ ] [

] 1 [

1 1 T N N N

x G x c

+ +

× × + =

1 x w row vector Let’s elaborate (VERY IMPORTANT, VERY VALUABLE!) SKIP

SLIDE 62

EVEN more details:

] [ ] [

1 1 1 1 1 + + − + +

× × × =

N T N N T N

y X X X a

SKIP

SLIDE 63

EVEN more details:

] [ ] [

1 1 1 1 1 + + − + +

× × × =

N T N N T N

y X X X a

[w x 1] [w x (N+1)] [(N+1) x w] [w x (N+1)] [(N+1) x 1] SKIP

SLIDE 64

EVEN more details:

] [ ] [

1 1 1 1 1 + + − + +

× × × =

N T N N T N

y X X X a

[w x (N+1)] [(N+1) x w] SKIP

SLIDE 65

EVEN more details:

N N T N N N N

G x x G c G G × × × × − =

+ + − + 1 1 1 1

] [ ] [

] 1 [

1 1 T N N N

x G x c

+ +

× × + =

wxw wxw wxw wx1 1xw wxw 1x1 SCALAR! SKIP

] [ ] [

1 1 1 1 1 + + − + +

× × × =

N T N N T N

y X X X a

1 1 1 1

] [

− + + +

× ≡

N T N N

X X G

‘gain matrix’

SLIDE 66

Altogether:

I G δ ≡

where I: w x w identity matrix δ: a large positive number

SKIP

SLIDE 67

Comparison:

Straightforward Least

Squares

– Needs huge matrix  (growing in size) O(N×w) – Costly matrix operation O(N×w2)

Recursive LS

– Need much smaller, fixed size matrix O(w×w) – Fast, incremental computation O(1×w2) – no matrix inversion N = 106, w = 1-100

SLIDE 68

Pictorially:

Given:

Independent Variable Dependent Variable

SLIDE 69

Pictorially:

Independent Variable Dependent Variable

.

new point

SLIDE 70

Pictorially:

Independent Variable Dependent Variable

RLS: quickly compute new best fit

new point

SLIDE 71

Even more details

Q4: can we ‘forget’ the older samples?
A4: Yes - RLS can easily handle that [Yi+00]:

SLIDE 72

Adaptability - ‘forgetting’

Independent Variable eg., #packets sent Dependent Variable eg., #bytes sent

SLIDE 73

Adaptability - ‘forgetting’

Independent Variable

eg. #packets sent

Dependent Variable eg., #bytes sent Trend change (R)LS with no forgetting

SLIDE 74

Adaptability - ‘forgetting’

Independent Variable Dependent Variable Trend change (R)LS with no forgetting (R)LS with forgetting

RLS: can *trivially* handle ‘forgetting’