Time Series Mining and Forecasting Duen Horng (Polo) Chau Assistant - - PowerPoint PPT Presentation

time series
SMART_READER_LITE
LIVE PREVIEW

Time Series Mining and Forecasting Duen Horng (Polo) Chau Assistant - - PowerPoint PPT Presentation

http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Time Series Mining and Forecasting Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on


slide-1
SLIDE 1

http://poloclub.gatech.edu/cse6242


CSE6242 / CX4242: Data & Visual Analytics


Time Series

Mining and Forecasting

Duen Horng (Polo) Chau


Assistant Professor
 Associate Director, MS Analytics
 Georgia Tech

Partly based on materials by 
 Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Parishit Ram (GT PhD alum; SkyTree), Alex Gray

slide-2
SLIDE 2

Outline

  • Motivation
  • Similarity search – distance functions
  • Linear Forecasting
  • Non-linear forecasting
  • Conclusions
slide-3
SLIDE 3

Problem definition

  • Given: one or more sequences

x1 , x2 , … , xt , … (y1, y2, … , yt , …) (… )

  • Find

– similar sequences; forecasts – patterns; clusters; outliers

slide-4
SLIDE 4

Motivation - Applications

  • Financial, sales, economic series
  • Medical

– ECGs +; blood pressure etc monitoring – reactions to new drugs – elderly care

slide-5
SLIDE 5

Motivation - Applications (cont’d)

  • ‘Smart house’

– sensors monitor temperature, humidity, air quality

  • video surveillance
slide-6
SLIDE 6

Motivation - Applications (cont’d)

  • Weather, environment/anti-pollution

– volcano monitoring – air/water pollutant monitoring

slide-7
SLIDE 7

Motivation - Applications (cont’d)

  • Computer systems

– ‘Active Disks’ (buffering, prefetching) – web servers (ditto) – network traffic monitoring – ...

slide-8
SLIDE 8

Stream Data: Disk accesses

time #bytes

slide-9
SLIDE 9

Problem #1:

Goal: given a signal (e.g.., #packets over time) Find: patterns, periodicities, and/or compress

year count lynx caught per year (packets per day; temperature per day)

slide-10
SLIDE 10

Problem#2: Forecast

Given xt, xt-1, …, forecast xt+1

10 20 30 40 50 60 70 80 90 1 3 5 7 9 11

Time Tick Number of packets sent

??

slide-11
SLIDE 11

Problem#2’: Similarity search

E.g.., Find a 3-tick pattern, similar to the last one

10 20 30 40 50 60 70 80 90 1 3 5 7 9 11

Time Tick Number of packets sent

??

slide-12
SLIDE 12

Problem #3:

  • Given: A set of correlated time sequences
  • Forecast ‘Sent(t)’

Number of packets

23 45 68 90

Time Tick

1 4 6 9 11

sent lost repeated

slide-13
SLIDE 13

Important observations

Patterns, rules, forecasting and similarity indexing are closely related:

  • To do forecasting, we need

– to find patterns/rules – to find similar settings in the past

  • to find outliers, we need to have forecasts

– (outlier = too far away from our forecast)

slide-14
SLIDE 14

Outline

  • Motivation
  • Similarity search and distance functions

– Euclidean – Time-warping

  • ...
slide-15
SLIDE 15

Importance of distance functions

Subtle, but absolutely necessary:

  • A ‘must’ for similarity indexing (->

forecasting)

  • A ‘must’ for clustering

Two major families

– Euclidean and Lp norms – Time warping and variations

slide-16
SLIDE 16

Euclidean and Lp

=

− =

n i i i

y x y x D

1 2

) ( ) , ( ! !

x(t) y(t)

...

=

− =

n i p i i p

y x y x L

1

| | ) , ( ! ! L1: city-block = Manhattan L2 = Euclidean L∞

slide-17
SLIDE 17

Observation #1

Time sequence -> n-d vector

...

Day-1 Day-2 Day-n

slide-18
SLIDE 18

Observation #2

Euclidean distance is closely related to

– cosine similarity – dot product

...

Day-1 Day-2 Day-n

slide-19
SLIDE 19

Time Warping

  • allow accelerations - decelerations

– (with or without penalty)

  • THEN compute the (Euclidean) distance (+

penalty)

  • related to the string-editing distance
slide-20
SLIDE 20

Time Warping

‘stutters’:

slide-21
SLIDE 21

Time warping

Q: how to compute it? A: dynamic programming D( i, j ) = cost to match prefix of length i of first sequence x with prefix

  • f length j of second sequence y
slide-22
SLIDE 22

http://www.psb.ugent.be/cbd/papers/gentxwarper/DTWalgorithm.htm

Time warping

slide-23
SLIDE 23

Thus, with no penalty for stutter, for sequences x1, x2, …, xi,; y1, y2, …, yj

! " ! # $ − − − − + − = ) , 1 ( ) 1 , ( ) 1 , 1 ( min ] [ ] [ ) , ( j i D j i D j i D j y i x j i D

x-stutter y-stutter no stutter

Time warping

slide-24
SLIDE 24

VERY SIMILAR to the string-editing distance

! " ! # $ − − − − + − = ) , 1 ( ) 1 , ( ) 1 , 1 ( min ] [ ] [ ) , ( j i D j i D j i D j y i x j i D

x-stutter y-stutter no stutter

Time warping

slide-25
SLIDE 25

Time warping

  • Complexity: O(M*N) - quadratic on the

length of the strings

  • Many variations (penalty for stutters; limit
  • n the number/percentage of stutters; …)
  • popular in voice processing 


[Rabiner + Juang]

slide-26
SLIDE 26

Other Distance functions

  • piece-wise linear/flat approx.; compare

pieces [Keogh+01] [Faloutsos+97]

  • ‘cepstrum’ (for voice [Rabiner+Juang])

– do DFT; take log of amplitude; do DFT again!

  • Allow for small gaps [Agrawal+95]

See tutorial by [Gunopulos + Das, SIGMOD01]

slide-27
SLIDE 27

Other Distance functions

  • In [Keogh+, KDD’04]: parameter-free, MDL

based

slide-28
SLIDE 28

Conclusions

Prevailing distances:

– Euclidean and – time-warping

slide-29
SLIDE 29

Outline

  • Motivation
  • Similarity search and distance functions
  • Linear Forecasting
  • Non-linear forecasting
  • Conclusions
slide-30
SLIDE 30

Linear Forecasting

slide-31
SLIDE 31

Outline

  • Motivation
  • ...
  • Linear Forecasting

– Auto-regression: Least Squares; RLS – Co-evolving time sequences – Examples – Conclusions

slide-32
SLIDE 32

Problem#2: Forecast

  • Example: give xt-1, xt-2, …, forecast xt

10 20 30 40 50 60 70 80 90 1 3 5 7 9 11

Time Tick Number of packets sent

??

slide-33
SLIDE 33

Forecasting: Preprocessing

MANUALLY: remove trends spot periodicities

2 3 5 6 1 2 3 4 5 6 7 8 9 10 1 2 2 3 1 3 5 7 9 11 13

time time 7 days

slide-34
SLIDE 34

Problem#2: Forecast

  • Solution: try to express

xt as a linear function of the past: xt-1, xt-2, …, (up to a window of w)

Formally:

10 20 30 40 50 60 70 80 90 1 3 5 7 9 11

Time Tick

??

slide-35
SLIDE 35

(Problem: Back-cast; interpolate)

  • Solution - interpolate: try to express

xt as a linear function of the past AND the future:

xt+1, xt+2, … xt+wfuture; xt-1, … xt-wpast

(up to windows of wpast , wfuture)

  • EXACTLY the same algo’s

10 20 30 40 50 60 70 80 90 1 3 5 7 9 11

Time Tick

??

slide-36
SLIDE 36

40 45 50 55 60 65 70 75 80 85 15 25 35 45

Body weight

patient weight height 1 27 43 2 43 54 3 54 72 … … … N 25 ??

Express what we don’t know (= “dependent variable”)

as a linear function of what we know (= “independent variable(s)”)

Body height

Refresher: Linear Regression

slide-37
SLIDE 37

40 45 50 55 60 65 70 75 80 85 15 25 35 45

Body weight

patient weight height 1 27 43 2 43 54 3 54 72 … … … N 25 ??

Express what we don’t know (= “dependent variable”)

as a linear function of what we know (= “independent variable(s)”)

Body height

Refresher: Linear Regression

slide-38
SLIDE 38

40 45 50 55 60 65 70 75 80 85 15 25 35 45

Body weight

patient weight height 1 27 43 2 43 54 3 54 72 … … … N 25 ??

Express what we don’t know (= “dependent variable”)

as a linear function of what we know (= “independent variable(s)”)

Body height

Refresher: Linear Regression

slide-39
SLIDE 39

40 45 50 55 60 65 70 75 80 85 15 25 35 45

Body weight

patient weight height 1 27 43 2 43 54 3 54 72 … … … N 25 ??

Express what we don’t know (= “dependent variable”)

as a linear function of what we know (= “independent variable(s)”)

Body height

Refresher: Linear Regression

slide-40
SLIDE 40

Time Packets Sent (t-1) Packets Sent(t) 1

  • 43

2 43 54 3 54 72 … … … N 25 ??

Linear Auto Regression

slide-41
SLIDE 41

Linear Auto Regression

#packets sent at time t-1 #packets sent at time t

Time Packets Sent (t-1) Packets Sent(t) 1

  • 43

2 43 54 3 54 72 … … … N 25 ??

Lag w = 1

Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])

‘lag-plot’

slide-42
SLIDE 42

Linear Auto Regression

#packets sent at time t-1 #packets sent at time t

Time Packets Sent (t-1) Packets Sent(t) 1

  • 43

2 43 54 3 54 72 … … … N 25 ??

Lag w = 1

Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])

‘lag-plot’

slide-43
SLIDE 43

Linear Auto Regression

#packets sent at time t-1 #packets sent at time t

Time Packets Sent (t-1) Packets Sent(t) 1

  • 43

2 43 54 3 54 72 … … … N 25 ??

Lag w = 1

Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])

‘lag-plot’

slide-44
SLIDE 44

Linear Auto Regression

#packets sent at time t-1 #packets sent at time t

Time Packets Sent (t-1) Packets Sent(t) 1

  • 43

2 43 54 3 54 72 … … … N 25 ??

Lag w = 1

Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])

‘lag-plot’

slide-45
SLIDE 45

More details:

  • Q1: Can it work with window w > 1?
  • A1: YES!

xt-2 xt xt-1

slide-46
SLIDE 46

More details:

  • Q1: Can it work with window w > 1?
  • A1: YES! (we’ll fit a hyper-plane, then!)

xt-2 xt xt-1

slide-47
SLIDE 47

More details:

  • Q1: Can it work with window w > 1?
  • A1: YES! (we’ll fit a hyper-plane, then!)

xt-2 xt-1 xt

slide-48
SLIDE 48

More details:

  • Q1: Can it work with window w > 1?
  • A1: YES! The problem becomes:

X[N ×w] × a[w ×1] = y[N ×1]

  • OVER-CONSTRAINED

– a is the vector of the regression coefficients – X has the N values of the w indep. variables – y has the N values of the dependent variable

slide-49
SLIDE 49

More details:

  • X[N ×w] × a[w ×1] = y[N ×1]

! ! ! ! ! ! ! ! " # $ $ $ $ $ $ $ $ % & = ! ! ! ! " # $ $ $ $ % & × ! ! ! ! ! ! ! ! " # $ $ $ $ $ $ $ $ % &

N w Nw N N w w

y y y a a a X X X X X X X X X ! ! ! ! … ! ! ! … "

2 1 2 1 2 1 2 22 21 1 12 11

, , , , , , , , ,

Ind-var1 Ind-var-w time

slide-50
SLIDE 50

More details:

  • X[N ×w] × a[w ×1] = y[N ×1]

! ! ! ! ! ! ! ! " # $ $ $ $ $ $ $ $ % & = ! ! ! ! " # $ $ $ $ % & × ! ! ! ! ! ! ! ! " # $ $ $ $ $ $ $ $ % &

N w Nw N N w w

y y y a a a X X X X X X X X X ! ! ! ! … ! ! ! … "

2 1 2 1 2 1 2 22 21 1 12 11

, , , , , , , , ,

Ind-var1 Ind-var-w time

slide-51
SLIDE 51

More details

  • Q2: How to estimate a1, a2, … aw = a?
  • A2: with Least Squares fit
  • (Moore-Penrose pseudo-inverse)
  • a is the vector that minimizes the RMSE

from y a = ( XT × X )-1 × (XT × y)

slide-52
SLIDE 52

More details

  • Straightforward solution:
  • Observations:

– Sample matrix X grows over time – needs matrix inversion – O(N×w2) computation – O(N×w) storage a = ( XT × X )-1 × (XT × y)

a : Regression Coeff. Vector X : Sample Matrix

XN:

w N

slide-53
SLIDE 53

Even more details

  • Q3: Can we estimate a incrementally?
  • A3: Yes, with the brilliant, classic method of

“Recursive Least Squares” (RLS) (see, e.g., [Yi+00], for details).

  • We can do the matrix inversion, WITHOUT

inversion! (How is that possible?!)

slide-54
SLIDE 54

Even more details

  • Q3: Can we estimate a incrementally?
  • A3: Yes, with the brilliant, classic method of

“Recursive Least Squares” (RLS) 
 (see, e.g., [Yi+00], for details).

  • We can do the matrix inversion, WITHOUT

inversion! (How is that possible?!)

  • A: our matrix has special form: (XT X)
slide-55
SLIDE 55

More details

XN:

w N

XN+1

At the N+1 time tick:

xN+1 SKIP

slide-56
SLIDE 56
  • Let GN = ( XN

T × XN )-1 (“gain matrix”)

  • GN+1 can be computed recursively from GN 


without matrix inversion

GN

w w SKIP

More details: key ideas

slide-57
SLIDE 57

Comparison:

  • Straightforward Least

Squares

– Needs huge matrix
 (growing in size) O(N×w) – Costly matrix operation O(N×w2)

  • Recursive LS

– Need much smaller, fixed size matrix O(w×w) – Fast, incremental computation O(1×w2) – no matrix inversion N = 106, w = 1-100

slide-58
SLIDE 58

EVEN more details:

N N T N N N N

G x x G c G G × × × × − =

+ + − + 1 1 1 1

] [ ] [

] 1 [

1 1 T N N N

x G x c

+ +

× × + =

1 x w row vector Let’s elaborate (VERY IMPORTANT, VERY VALUABLE!) SKIP

slide-59
SLIDE 59

EVEN more details:

] [ ] [

1 1 1 1 1 + + − + +

× × × =

N T N N T N

y X X X a

SKIP

slide-60
SLIDE 60

EVEN more details:

] [ ] [

1 1 1 1 1 + + − + +

× × × =

N T N N T N

y X X X a

[w x 1] [w x (N+1)] [(N+1) x w] [w x (N+1)] [(N+1) x 1] SKIP

slide-61
SLIDE 61

EVEN more details:

] [ ] [

1 1 1 1 1 + + − + +

× × × =

N T N N T N

y X X X a

[w x (N+1)] [(N+1) x w] SKIP

slide-62
SLIDE 62

EVEN more details:

N N T N N N N

G x x G c G G × × × × − =

+ + − + 1 1 1 1

] [ ] [

] 1 [

1 1 T N N N

x G x c

+ +

× × + =

wxw wxw wxw wx1 1xw wxw 1x1 SCALAR! SKIP

] [ ] [

1 1 1 1 1 + + − + +

× × × =

N T N N T N

y X X X a

1 1 1 1

] [

− + + +

× ≡

N T N N

X X G

‘gain matrix’

slide-63
SLIDE 63

Altogether:

I G δ ≡

where I: w x w identity matrix δ: a large positive number

SKIP

slide-64
SLIDE 64

Comparison:

  • Straightforward Least

Squares

– Needs huge matrix
 (growing in size) O(N×w) – Costly matrix operation O(N×w2)

  • Recursive LS

– Need much smaller, fixed size matrix O(w×w) – Fast, incremental computation O(1×w2) – no matrix inversion N = 106, w = 1-100

slide-65
SLIDE 65

Pictorially:

  • Given:

Independent Variable Dependent Variable

slide-66
SLIDE 66

Pictorially:

Independent Variable Dependent Variable

.

new point

slide-67
SLIDE 67

Pictorially:

Independent Variable Dependent Variable

RLS: quickly compute new best fit

new point

slide-68
SLIDE 68

Even more details

  • Q4: can we ‘forget’ the older samples?
  • A4: Yes - RLS can easily handle that [Yi+00]:
slide-69
SLIDE 69

Adaptability - ‘forgetting’

Independent Variable eg., #packets sent Dependent Variable eg., #bytes sent

slide-70
SLIDE 70

Adaptability - ‘forgetting’

Independent Variable

  • eg. #packets sent

Dependent Variable eg., #bytes sent Trend change (R)LS with no forgetting

slide-71
SLIDE 71

Adaptability - ‘forgetting’

Independent Variable Dependent Variable Trend change (R)LS with no forgetting (R)LS with forgetting

  • RLS: can *trivially* handle ‘forgetting’