SLIDE 1 Duen Horng (Polo) Chau
Georgia Tech
Time Series
Mining and Forecasting
Slides based on Prof. Christos Faloutsos’s materials
CSE 6242 / CX 4242
SLIDE 2 Outline
- Motivation
- Similarity search – distance functions
- Linear Forecasting
- Non-linear forecasting
- Conclusions
SLIDE 3 Problem definition
- Given: one or more sequences
x1 , x2 , … , xt , … (y1, y2, … , yt , …) (… )
– similar sequences; forecasts – patterns; clusters; outliers
SLIDE 4 Motivation - Applications
- Financial, sales, economic series
- Medical
– ECGs +; blood pressure etc monitoring – reactions to new drugs – elderly care
SLIDE 5 Motivation - Applications (cont’d)
– sensors monitor temperature, humidity, air quality
SLIDE 6 Motivation - Applications (cont’d)
- Weather, environment/anti-pollution
– volcano monitoring – air/water pollutant monitoring
SLIDE 7 Motivation - Applications (cont’d)
– ‘Active Disks’ (buffering, prefetching) – web servers (ditto) – network traffic monitoring – ...
SLIDE 8
Stream Data: Disk accesses
time #bytes
SLIDE 9 Problem #1:
Goal: given a signal (e.g.., #packets over time) Find: patterns, periodicities, and/or compress
year count lynx caught per year (packets per day; temperature per day)
SLIDE 10 Problem#2: Forecast
Given xt, xt-1, …, forecast xt+1
10 20 30 40 50 60 70 80 90 1 3 5 7 9 11
Time Tick Number of packets sent
??
SLIDE 11 Problem#2’: Similarity search
E.g.., Find a 3-tick pattern, similar to the last one
10 20 30 40 50 60 70 80 90 1 3 5 7 9 11
Time Tick Number of packets sent
??
SLIDE 12 Problem #3:
- Given: A set of correlated time sequences
- Forecast ‘Sent(t)’
Number of packets
23 45 68 90
Time Tick
1 4 6 9 11
sent lost repeated
SLIDE 13 Important observations
Patterns, rules, forecasting and similarity indexing are closely related:
- To do forecasting, we need
– to find patterns/rules – to find similar settings in the past
- to find outliers, we need to have forecasts
– (outlier = too far away from our forecast)
SLIDE 14 Outline
- Motivation
- Similarity Search and Indexing
- Linear Forecasting
- Non-linear forecasting
- Conclusions
SLIDE 15 Outline
- Motivation
- Similarity search and distance functions
– Euclidean – Time-warping
SLIDE 16 Importance of distance functions
Subtle, but absolutely necessary:
- A ‘must’ for similarity indexing (->
forecasting)
Two major families
– Euclidean and Lp norms – Time warping and variations
SLIDE 17 Euclidean and Lp
∑
=
− =
n i i i
y x y x D
1 2
) ( ) , ( ! !
x(t) y(t)
...
∑
=
− =
n i p i i p
y x y x L
1
| | ) , ( ! ! L1: city-block = Manhattan L2 = Euclidean L∞
SLIDE 18 Observation #1
vector
...
Day-1 Day-2 Day-n
SLIDE 19 Observation #2
Euclidean distance is closely related to
– cosine similarity – dot product – ‘cross-correlation’ function
...
Day-1 Day-2 Day-n
SLIDE 20 Time Warping
- allow accelerations - decelerations
– (with or w/o penalty)
- THEN compute the (Euclidean) distance (+
penalty)
- related to the string-editing distance
SLIDE 21
Time Warping
‘stutters’:
SLIDE 22 Time warping
Q: how to compute it? A: dynamic programming D( i, j ) = cost to match prefix of length i of first sequence x with prefix
- f length j of second sequence y
SLIDE 23
Thus, with no penalty for stutter, for sequences x1, x2, …, xi,; y1, y2, …, yj
! " ! # $ − − − − + − = ) , 1 ( ) 1 , ( ) 1 , 1 ( min ] [ ] [ ) , ( j i D j i D j i D j y i x j i D
x-stutter y-stutter no stutter
Time warping
SLIDE 24
VERY SIMILAR to the string-editing distance
! " ! # $ − − − − + − = ) , 1 ( ) 1 , ( ) 1 , 1 ( min ] [ ] [ ) , ( j i D j i D j i D j y i x j i D
x-stutter y-stutter no stutter
Time warping
SLIDE 25 Time warping
- Complexity: O(M*N) - quadratic on the
length of the strings
- Many variations (penalty for stutters; limit
- n the number/percentage of stutters; …)
- popular in voice processing
[Rabiner + Juang]
SLIDE 26 Other Distance functions
- piece-wise linear/flat approx.; compare
pieces [Keogh+01] [Faloutsos+97]
- ‘cepstrum’ (for voice [Rabiner+Juang])
– do DFT; take log of amplitude; do DFT again!
- Allow for small gaps [Agrawal+95]
See tutorial by [Gunopulos + Das, SIGMOD01]
SLIDE 27 Other Distance functions
- In [Keogh+, KDD’04]: parameter-free, MDL
based
SLIDE 28
Conclusions
Prevailing distances:
– Euclidean and – time-warping
SLIDE 29 Outline
- Motivation
- Similarity search and distance functions
- Linear Forecasting
- Non-linear forecasting
- Conclusions
SLIDE 30
Linear Forecasting
SLIDE 31 Forecasting “Prediction is very difficult, especially about the future.”
Danish physicist and Nobel Prize laureate
SLIDE 32 Outline
- Motivation
- ...
- Linear Forecasting
– Auto-regression: Least Squares; RLS – Co-evolving time sequences – Examples – Conclusions
SLIDE 33
Reference
[Yi+00] Byoung-Kee Yi et al.: Online Data Mining for Co-Evolving Time Sequences, ICDE 2000. (Describes MUSCLES and Recursive Least Squares)
SLIDE 34 Problem#2: Forecast
- Example: give xt-1, xt-2, …, forecast xt
10 20 30 40 50 60 70 80 90 1 3 5 7 9 11
Time Tick Number of packets sent
??
SLIDE 35 Forecasting: Preprocessing
MANUALLY: remove trends spot periodicities
2 3 5 6 1 2 3 4 5 6 7 8 9 10 1 2 2 3 1 3 5 7 9 11 13
time time 7 days
SLIDE 36 Problem#2: Forecast
xt as a linear function of the past: xt-1, xt-2, …, (up to a window of w)
Formally:
10 20 30 40 50 60 70 80 90 1 3 5 7 9 11
Time Tick
??
SLIDE 37 (Problem: Back-cast; interpolate)
- Solution - interpolate: try to express
xt as a linear function of the past AND the future:
xt+1, xt+2, … xt+wfuture; xt-1, … xt-wpast
(up to windows of wpast , wfuture)
10 20 30 40 50 60 70 80 90 1 3 5 7 9 11
Time Tick
??
SLIDE 38 40 45 50 55 60 65 70 75 80 85 15 25 35 45
Body weight
patient weight height 1 27 43 2 43 54 3 54 72 … … … N 25 ??
- express what we don’t know (= “dependent variable”)
- as a linear function of what we know (= “independent variable(s)”)
Body height
Refresher: Linear Regression
SLIDE 39 40 45 50 55 60 65 70 75 80 85 15 25 35 45
Body weight
patient weight height 1 27 43 2 43 54 3 54 72 … … … N 25 ??
- express what we don’t know (= “dependent variable”)
- as a linear function of what we know (= “independent variable(s)”)
Body height
Refresher: Linear Regression
SLIDE 40 40 45 50 55 60 65 70 75 80 85 15 25 35 45
Body weight
patient weight height 1 27 43 2 43 54 3 54 72 … … … N 25 ??
- express what we don’t know (= “dependent variable”)
- as a linear function of what we know (= “independent variable(s)”)
Body height
Refresher: Linear Regression
SLIDE 41 40 45 50 55 60 65 70 75 80 85 15 25 35 45
Body weight
patient weight height 1 27 43 2 43 54 3 54 72 … … … N 25 ??
- express what we don’t know (= “dependent variable”)
- as a linear function of what we know (= “independent variable(s)”)
Body height
Refresher: Linear Regression
SLIDE 42 Time Packets Sent (t-1) Packets Sent(t) 1
2 43 54 3 54 72 … … … N 25 ??
Linear Auto Regression
SLIDE 43 Linear Auto Regression
#packets sent at time t-1 #packets sent at time t
Time Packets Sent (t-1) Packets Sent(t) 1
2 43 54 3 54 72 … … … N 25 ??
Lag w = 1
Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])
‘lag-plot’
SLIDE 44 Linear Auto Regression
#packets sent at time t-1 #packets sent at time t
Time Packets Sent (t-1) Packets Sent(t) 1
2 43 54 3 54 72 … … … N 25 ??
Lag w = 1
Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])
‘lag-plot’
SLIDE 45 Linear Auto Regression
#packets sent at time t-1 #packets sent at time t
Time Packets Sent (t-1) Packets Sent(t) 1
2 43 54 3 54 72 … … … N 25 ??
Lag w = 1
Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])
‘lag-plot’
SLIDE 46 Linear Auto Regression
#packets sent at time t-1 #packets sent at time t
Time Packets Sent (t-1) Packets Sent(t) 1
2 43 54 3 54 72 … … … N 25 ??
Lag w = 1
Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])
‘lag-plot’
SLIDE 47 Outline
- Motivation
- ...
- Linear Forecasting
– Auto-regression: Least Squares; RLS – Co-evolving time sequences – Examples – Conclusions
SLIDE 48 More details:
- Q1: Can it work with window w > 1?
- A1: YES!
xt-2 xt xt-1
SLIDE 49 More details:
- Q1: Can it work with window w > 1?
- A1: YES! (we’ll fit a hyper-plane, then!)
xt-2 xt xt-1
SLIDE 50 More details:
- Q1: Can it work with window w > 1?
- A1: YES! (we’ll fit a hyper-plane, then!)
xt-2 xt-1 xt
SLIDE 51 More details:
- Q1: Can it work with window w > 1?
- A1: YES! The problem becomes:
X[N ×w] × a[w ×1] = y[N ×1]
– a is the vector of the regression coefficients – X has the N values of the w indep. variables – y has the N values of the dependent variable
SLIDE 52 More details:
- X[N ×w] × a[w ×1] = y[N ×1]
! ! ! ! ! ! ! ! " # $ $ $ $ $ $ $ $ % & = ! ! ! ! " # $ $ $ $ % & × ! ! ! ! ! ! ! ! " # $ $ $ $ $ $ $ $ % &
N w Nw N N w w
y y y a a a X X X X X X X X X ! ! ! ! … ! ! ! … "
2 1 2 1 2 1 2 22 21 1 12 11
, , , , , , , , ,
Ind-var1 Ind-var-w time
SLIDE 53 More details:
- X[N ×w] × a[w ×1] = y[N ×1]
! ! ! ! ! ! ! ! " # $ $ $ $ $ $ $ $ % & = ! ! ! ! " # $ $ $ $ % & × ! ! ! ! ! ! ! ! " # $ $ $ $ $ $ $ $ % &
N w Nw N N w w
y y y a a a X X X X X X X X X ! ! ! ! … ! ! ! … "
2 1 2 1 2 1 2 22 21 1 12 11
, , , , , , , , ,
Ind-var1 Ind-var-w time
SLIDE 54 More details
- Q2: How to estimate a1, a2, … aw = a?
- A2: with Least Squares fit
- (Moore-Penrose pseudo-inverse)
- a is the vector that minimizes the RMSE
from y a = ( XT × X )-1 × (XT × y)
SLIDE 55 More details
- Straightforward solution:
- Observations:
– Sample matrix X grows over time – needs matrix inversion – O(N×w2) computation – O(N×w) storage a = ( XT × X )-1 × (XT × y)
a : Regression Coeff. Vector X : Sample Matrix
XN:
w N
SLIDE 56 Even more details
- Q3: Can we estimate a incrementally?
- A3: Yes, with the brilliant, classic method of
“Recursive Least Squares” (RLS) (see, e.g., [Yi+00], for details).
- We can do the matrix inversion, WITHOUT
inversion! (How is that possible?!)
SLIDE 57 Even more details
- Q3: Can we estimate a incrementally?
- A3: Yes, with the brilliant, classic method of
“Recursive Least Squares” (RLS)
(see, e.g., [Yi+00], for details).
- We can do the matrix inversion, WITHOUT
inversion! (How is that possible?!)
- A: our matrix has special form: (XT X)
SLIDE 58 More details
XN:
w N
XN+1
At the N+1 time tick:
xN+1 SKIP
SLIDE 59
T × XN )-1 (“gain matrix”)
- GN+1 can be computed recursively from GN
without matrix inversion
GN
w w SKIP
More details: key ideas
SLIDE 60 Comparison:
Squares
– Needs huge matrix
(growing in size) O(N×w) – Costly matrix operation O(N×w2)
– Need much smaller, fixed size matrix O(w×w) – Fast, incremental computation O(1×w2) – no matrix inversion N = 106, w = 1-100
SLIDE 61 EVEN more details:
N N T N N N N
G x x G c G G × × × × − =
+ + − + 1 1 1 1
] [ ] [
] 1 [
1 1 T N N N
x G x c
+ +
× × + =
1 x w row vector Let’s elaborate (VERY IMPORTANT, VERY VALUABLE!) SKIP
SLIDE 62 EVEN more details:
] [ ] [
1 1 1 1 1 + + − + +
× × × =
N T N N T N
y X X X a
SKIP
SLIDE 63 EVEN more details:
] [ ] [
1 1 1 1 1 + + − + +
× × × =
N T N N T N
y X X X a
[w x 1] [w x (N+1)] [(N+1) x w] [w x (N+1)] [(N+1) x 1] SKIP
SLIDE 64 EVEN more details:
] [ ] [
1 1 1 1 1 + + − + +
× × × =
N T N N T N
y X X X a
[w x (N+1)] [(N+1) x w] SKIP
SLIDE 65 EVEN more details:
N N T N N N N
G x x G c G G × × × × − =
+ + − + 1 1 1 1
] [ ] [
] 1 [
1 1 T N N N
x G x c
+ +
× × + =
wxw wxw wxw wx1 1xw wxw 1x1 SCALAR! SKIP
] [ ] [
1 1 1 1 1 + + − + +
× × × =
N T N N T N
y X X X a
1 1 1 1
] [
− + + +
× ≡
N T N N
X X G
‘gain matrix’
SLIDE 66 Altogether:
I G δ ≡
where I: w x w identity matrix δ: a large positive number
SKIP
SLIDE 67 Comparison:
Squares
– Needs huge matrix
(growing in size) O(N×w) – Costly matrix operation O(N×w2)
– Need much smaller, fixed size matrix O(w×w) – Fast, incremental computation O(1×w2) – no matrix inversion N = 106, w = 1-100
SLIDE 68 Pictorially:
Independent Variable Dependent Variable
SLIDE 69 Pictorially:
Independent Variable Dependent Variable
.
new point
SLIDE 70 Pictorially:
Independent Variable Dependent Variable
RLS: quickly compute new best fit
new point
SLIDE 71 Even more details
- Q4: can we ‘forget’ the older samples?
- A4: Yes - RLS can easily handle that [Yi+00]:
SLIDE 72 Adaptability - ‘forgetting’
Independent Variable eg., #packets sent Dependent Variable eg., #bytes sent
SLIDE 73 Adaptability - ‘forgetting’
Independent Variable
Dependent Variable eg., #bytes sent Trend change (R)LS with no forgetting
SLIDE 74 Adaptability - ‘forgetting’
Independent Variable Dependent Variable Trend change (R)LS with no forgetting (R)LS with forgetting
- RLS: can *trivially* handle ‘forgetting’