[PPT] - Time Series Mining and Forecasting Duen Horng (Polo) Chau Assistant PowerPoint Presentation

SLIDE 1

http://poloclub.gatech.edu/cse6242 

CSE6242 / CX4242: Data & Visual Analytics 

Time Series

Mining and Forecasting

Duen Horng (Polo) Chau 

Assistant Professor  Associate Director, MS Analytics  Georgia Tech

Partly based on materials by   Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Parishit Ram (GT PhD alum; SkyTree), Alex Gray

SLIDE 2

Outline

Motivation
Similarity search – distance functions
Linear Forecasting
Non-linear forecasting
Conclusions

SLIDE 3

Problem definition

Given: one or more sequences

x1 , x2 , … , xt , … (y1, y2, … , yt , …) (… )

Find

– similar sequences; forecasts – patterns; clusters; outliers

SLIDE 4

Motivation - Applications

Financial, sales, economic series
Medical

– ECGs +; blood pressure etc monitoring – reactions to new drugs – elderly care

SLIDE 5

Motivation - Applications (cont’d)

‘Smart house’

– sensors monitor temperature, humidity, air quality

video surveillance

SLIDE 6

Motivation - Applications (cont’d)

Weather, environment/anti-pollution

– volcano monitoring – air/water pollutant monitoring

SLIDE 7

Motivation - Applications (cont’d)

Computer systems

– ‘Active Disks’ (buffering, prefetching) – web servers (ditto) – network traffic monitoring – ...

SLIDE 8

Stream Data: Disk accesses

time #bytes

SLIDE 9

Problem #1:

Goal: given a signal (e.g.., #packets over time) Find: patterns, periodicities, and/or compress

year count lynx caught per year (packets per day; temperature per day)

SLIDE 10

Problem#2: Forecast

Given xt, xt-1, …, forecast xt+1

10 20 30 40 50 60 70 80 90 1 3 5 7 9 11

Time Tick Number of packets sent

??

SLIDE 11

Problem#2’: Similarity search

E.g.., Find a 3-tick pattern, similar to the last one

10 20 30 40 50 60 70 80 90 1 3 5 7 9 11

Time Tick Number of packets sent

??

SLIDE 12

Problem #3:

Given: A set of correlated time sequences
Forecast ‘Sent(t)’

Number of packets

23 45 68 90

Time Tick

1 4 6 9 11

sent lost repeated

SLIDE 13

Important observations

Patterns, rules, forecasting and similarity indexing are closely related:

To do forecasting, we need

– to find patterns/rules – to find similar settings in the past

to find outliers, we need to have forecasts

– (outlier = too far away from our forecast)

SLIDE 14

Outline

Motivation
Similarity search and distance functions

– Euclidean – Time-warping

...

SLIDE 15

Importance of distance functions

Subtle, but absolutely necessary:

A ‘must’ for similarity indexing (->

forecasting)

A ‘must’ for clustering

Two major families

– Euclidean and Lp norms – Time warping and variations

SLIDE 16

Euclidean and Lp

∑

=

− =

n i i i

y x y x D

1 2

) ( ) , ( ! !

x(t) y(t)

...

∑

=

− =

n i p i i p

y x y x L

1

| | ) , ( ! ! L1: city-block = Manhattan L2 = Euclidean L∞

SLIDE 17

Observation #1

Time sequence -> n-d vector

...

Day-1 Day-2 Day-n

SLIDE 18

Observation #2

Euclidean distance is closely related to

– cosine similarity – dot product

...

Day-1 Day-2 Day-n

SLIDE 19

Time Warping

allow accelerations - decelerations

– (with or without penalty)

THEN compute the (Euclidean) distance (+

penalty)

related to the string-editing distance

SLIDE 20

Time Warping

‘stutters’:

SLIDE 21

Time warping

Q: how to compute it? A: dynamic programming D( i, j ) = cost to match prefix of length i of first sequence x with prefix

f length j of second sequence y

SLIDE 22

http://www.psb.ugent.be/cbd/papers/gentxwarper/DTWalgorithm.htm

Time warping

SLIDE 23

Thus, with no penalty for stutter, for sequences x1, x2, …, xi,; y1, y2, …, yj

! " ! # $ − − − − + − = ) , 1 ( ) 1 , ( ) 1 , 1 ( min ] [ ] [ ) , ( j i D j i D j i D j y i x j i D

x-stutter y-stutter no stutter

Time warping

SLIDE 24

VERY SIMILAR to the string-editing distance

! " ! # $ − − − − + − = ) , 1 ( ) 1 , ( ) 1 , 1 ( min ] [ ] [ ) , ( j i D j i D j i D j y i x j i D

x-stutter y-stutter no stutter

Time warping

SLIDE 25

Time warping

Complexity: O(M*N) - quadratic on the

length of the strings

Many variations (penalty for stutters; limit
n the number/percentage of stutters; …)
popular in voice processing

[Rabiner + Juang]

SLIDE 26

Other Distance functions

piece-wise linear/flat approx.; compare

pieces [Keogh+01] [Faloutsos+97]

‘cepstrum’ (for voice [Rabiner+Juang])

– do DFT; take log of amplitude; do DFT again!

Allow for small gaps [Agrawal+95]

See tutorial by [Gunopulos + Das, SIGMOD01]

SLIDE 27

Other Distance functions

In [Keogh+, KDD’04]: parameter-free, MDL

based

SLIDE 28

Conclusions

Prevailing distances:

– Euclidean and – time-warping

SLIDE 29

Outline

Motivation
Similarity search and distance functions
Linear Forecasting
Non-linear forecasting
Conclusions

SLIDE 30

Linear Forecasting

SLIDE 31

Outline

Motivation
...
Linear Forecasting

– Auto-regression: Least Squares; RLS – Co-evolving time sequences – Examples – Conclusions

SLIDE 32

Problem#2: Forecast

Example: give xt-1, xt-2, …, forecast xt

10 20 30 40 50 60 70 80 90 1 3 5 7 9 11

Time Tick Number of packets sent

??

SLIDE 33

Forecasting: Preprocessing

MANUALLY: remove trends spot periodicities

2 3 5 6 1 2 3 4 5 6 7 8 9 10 1 2 2 3 1 3 5 7 9 11 13

time time 7 days

SLIDE 34

Problem#2: Forecast

Solution: try to express

xt as a linear function of the past: xt-1, xt-2, …, (up to a window of w)

Formally:

10 20 30 40 50 60 70 80 90 1 3 5 7 9 11

Time Tick

??

SLIDE 35

(Problem: Back-cast; interpolate)

Solution - interpolate: try to express

xt as a linear function of the past AND the future:

xt+1, xt+2, … xt+wfuture; xt-1, … xt-wpast

(up to windows of wpast , wfuture)

EXACTLY the same algo’s

10 20 30 40 50 60 70 80 90 1 3 5 7 9 11

Time Tick

??

SLIDE 36

40 45 50 55 60 65 70 75 80 85 15 25 35 45

Body weight

patient weight height 1 27 43 2 43 54 3 54 72 … … … N 25 ??

Express what we don’t know (= “dependent variable”)

as a linear function of what we know (= “independent variable(s)”)

Body height

Refresher: Linear Regression

SLIDE 37

40 45 50 55 60 65 70 75 80 85 15 25 35 45

Body weight

patient weight height 1 27 43 2 43 54 3 54 72 … … … N 25 ??

Express what we don’t know (= “dependent variable”)

as a linear function of what we know (= “independent variable(s)”)

Body height

Refresher: Linear Regression

SLIDE 38

40 45 50 55 60 65 70 75 80 85 15 25 35 45

Body weight

patient weight height 1 27 43 2 43 54 3 54 72 … … … N 25 ??

Express what we don’t know (= “dependent variable”)

as a linear function of what we know (= “independent variable(s)”)

Body height

Refresher: Linear Regression

SLIDE 39

40 45 50 55 60 65 70 75 80 85 15 25 35 45

Body weight

patient weight height 1 27 43 2 43 54 3 54 72 … … … N 25 ??

Express what we don’t know (= “dependent variable”)

as a linear function of what we know (= “independent variable(s)”)

Body height

Refresher: Linear Regression

SLIDE 40

Time Packets Sent (t-1) Packets Sent(t) 1

43

2 43 54 3 54 72 … … … N 25 ??

Linear Auto Regression

SLIDE 41

Linear Auto Regression

#packets sent at time t-1 #packets sent at time t

Time Packets Sent (t-1) Packets Sent(t) 1

43

2 43 54 3 54 72 … … … N 25 ??

Lag w = 1

Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])

‘lag-plot’

SLIDE 42

Linear Auto Regression

#packets sent at time t-1 #packets sent at time t

Time Packets Sent (t-1) Packets Sent(t) 1

43

2 43 54 3 54 72 … … … N 25 ??

Lag w = 1

Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])

‘lag-plot’

SLIDE 43

Linear Auto Regression

#packets sent at time t-1 #packets sent at time t

Time Packets Sent (t-1) Packets Sent(t) 1

43

2 43 54 3 54 72 … … … N 25 ??

Lag w = 1

Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])

‘lag-plot’

SLIDE 44

Linear Auto Regression

#packets sent at time t-1 #packets sent at time t

Time Packets Sent (t-1) Packets Sent(t) 1

43

2 43 54 3 54 72 … … … N 25 ??

Lag w = 1

Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])

‘lag-plot’

SLIDE 45

More details:

Q1: Can it work with window w > 1?
A1: YES!

xt-2 xt xt-1

SLIDE 46

More details:

Q1: Can it work with window w > 1?
A1: YES! (we’ll fit a hyper-plane, then!)

xt-2 xt xt-1

SLIDE 47

More details:

Q1: Can it work with window w > 1?
A1: YES! (we’ll fit a hyper-plane, then!)

xt-2 xt-1 xt

SLIDE 48

More details:

Q1: Can it work with window w > 1?
A1: YES! The problem becomes:

X[N ×w] × a[w ×1] = y[N ×1]

OVER-CONSTRAINED

– a is the vector of the regression coefficients – X has the N values of the w indep. variables – y has the N values of the dependent variable

SLIDE 49

More details:

X[N ×w] × a[w ×1] = y[N ×1]

! ! ! ! ! ! ! ! " # $ $ $ $ $ $ $ $ % & = ! ! ! ! " # $ $ $ $ % & × ! ! ! ! ! ! ! ! " # $ $ $ $ $ $ $ $ % &

N w Nw N N w w

y y y a a a X X X X X X X X X ! ! ! ! … ! ! ! … "

2 1 2 1 2 1 2 22 21 1 12 11

, , , , , , , , ,

Ind-var1 Ind-var-w time

SLIDE 50

More details:

X[N ×w] × a[w ×1] = y[N ×1]

! ! ! ! ! ! ! ! " # $ $ $ $ $ $ $ $ % & = ! ! ! ! " # $ $ $ $ % & × ! ! ! ! ! ! ! ! " # $ $ $ $ $ $ $ $ % &

N w Nw N N w w

y y y a a a X X X X X X X X X ! ! ! ! … ! ! ! … "

2 1 2 1 2 1 2 22 21 1 12 11

, , , , , , , , ,

Ind-var1 Ind-var-w time

SLIDE 51

More details

Q2: How to estimate a1, a2, … aw = a?
A2: with Least Squares fit
(Moore-Penrose pseudo-inverse)
a is the vector that minimizes the RMSE

from y a = ( XT × X )-1 × (XT × y)

SLIDE 52

More details

Straightforward solution:
Observations:

– Sample matrix X grows over time – needs matrix inversion – O(N×w2) computation – O(N×w) storage a = ( XT × X )-1 × (XT × y)

a : Regression Coeff. Vector X : Sample Matrix

XN:

w N

SLIDE 53

Even more details

Q3: Can we estimate a incrementally?
A3: Yes, with the brilliant, classic method of

“Recursive Least Squares” (RLS) (see, e.g., [Yi+00], for details).

We can do the matrix inversion, WITHOUT

inversion! (How is that possible?!)

SLIDE 54

Even more details

Q3: Can we estimate a incrementally?
A3: Yes, with the brilliant, classic method of

“Recursive Least Squares” (RLS)   (see, e.g., [Yi+00], for details).

We can do the matrix inversion, WITHOUT

inversion! (How is that possible?!)

A: our matrix has special form: (XT X)

SLIDE 55

More details

XN:

w N

XN+1

At the N+1 time tick:

xN+1 SKIP

SLIDE 56

Let GN = ( XN

T × XN )-1 (“gain matrix”)

GN+1 can be computed recursively from GN

without matrix inversion

GN

w w SKIP

More details: key ideas

SLIDE 57

Comparison:

Straightforward Least

Squares

– Needs huge matrix  (growing in size) O(N×w) – Costly matrix operation O(N×w2)

Recursive LS

– Need much smaller, fixed size matrix O(w×w) – Fast, incremental computation O(1×w2) – no matrix inversion N = 106, w = 1-100

SLIDE 58

EVEN more details:

N N T N N N N

G x x G c G G × × × × − =

+ + − + 1 1 1 1

] [ ] [

] 1 [

1 1 T N N N

x G x c

+ +

× × + =

1 x w row vector Let’s elaborate (VERY IMPORTANT, VERY VALUABLE!) SKIP

SLIDE 59

EVEN more details:

] [ ] [

1 1 1 1 1 + + − + +

× × × =

N T N N T N

y X X X a

SKIP

SLIDE 60

EVEN more details:

] [ ] [

1 1 1 1 1 + + − + +

× × × =

N T N N T N

y X X X a

[w x 1] [w x (N+1)] [(N+1) x w] [w x (N+1)] [(N+1) x 1] SKIP

SLIDE 61

EVEN more details:

] [ ] [

1 1 1 1 1 + + − + +

× × × =

N T N N T N

y X X X a

[w x (N+1)] [(N+1) x w] SKIP

SLIDE 62

EVEN more details:

N N T N N N N

G x x G c G G × × × × − =

+ + − + 1 1 1 1

] [ ] [

] 1 [

1 1 T N N N

x G x c

+ +

× × + =

wxw wxw wxw wx1 1xw wxw 1x1 SCALAR! SKIP

] [ ] [

1 1 1 1 1 + + − + +

× × × =

N T N N T N

y X X X a

1 1 1 1

] [

− + + +

× ≡

N T N N

X X G

‘gain matrix’

SLIDE 63

Altogether:

I G δ ≡

where I: w x w identity matrix δ: a large positive number

SKIP

SLIDE 64

Comparison:

Straightforward Least

Squares

– Needs huge matrix  (growing in size) O(N×w) – Costly matrix operation O(N×w2)

Recursive LS

– Need much smaller, fixed size matrix O(w×w) – Fast, incremental computation O(1×w2) – no matrix inversion N = 106, w = 1-100

SLIDE 65

Pictorially:

Given:

Independent Variable Dependent Variable

SLIDE 66

Pictorially:

Independent Variable Dependent Variable

.

new point

SLIDE 67

Pictorially:

Independent Variable Dependent Variable

RLS: quickly compute new best fit

new point

SLIDE 68

Even more details

Q4: can we ‘forget’ the older samples?
A4: Yes - RLS can easily handle that [Yi+00]:

SLIDE 69

Adaptability - ‘forgetting’

Independent Variable eg., #packets sent Dependent Variable eg., #bytes sent

SLIDE 70

Adaptability - ‘forgetting’

Independent Variable

eg. #packets sent

Dependent Variable eg., #bytes sent Trend change (R)LS with no forgetting

SLIDE 71

Adaptability - ‘forgetting’

Independent Variable Dependent Variable Trend change (R)LS with no forgetting (R)LS with forgetting

RLS: can *trivially* handle ‘forgetting’