SLIDE 1 http://poloclub.gatech.edu/cse6242
CSE6242 / CX4242: Data & Visual Analytics
Time Series
Mining and Forecasting
Duen Horng (Polo) Chau
Assistant Professor
Associate Director, MS Analytics
Georgia Tech
Partly based on materials by
Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos, Parishit Ram (GT PhD alum; SkyTree), Alex Gray
SLIDE 2 Outline
- Motivation
- Similarity search – distance functions
- Linear Forecasting
- Non-linear forecasting
- Conclusions
SLIDE 3 Problem definition
- Given: one or more sequences
x1 , x2 , … , xt , … (y1, y2, … , yt , …) (… )
– similar sequences; forecasts – patterns; clusters; outliers
SLIDE 4 Motivation - Applications
- Financial, sales, economic series
- Medical
– ECGs +; blood pressure etc monitoring – reactions to new drugs – elderly care
SLIDE 5 Motivation - Applications (cont’d)
– sensors monitor temperature, humidity, air quality
SLIDE 6 Motivation - Applications (cont’d)
- Weather, environment/anti-pollution
– volcano monitoring – air/water pollutant monitoring
SLIDE 7 Motivation - Applications (cont’d)
– ‘Active Disks’ (buffering, prefetching) – web servers (ditto) – network traffic monitoring – ...
SLIDE 8
Stream Data: Disk accesses
time #bytes
SLIDE 9 Problem #1:
Goal: given a signal (e.g.., #packets over time) Find: patterns, periodicities, and/or compress
year count lynx caught per year (packets per day; temperature per day)
SLIDE 10 Problem#2: Forecast
Given xt, xt-1, …, forecast xt+1
10 20 30 40 50 60 70 80 90 1 3 5 7 9 11
Time Tick Number of packets sent
??
SLIDE 11 Problem#2’: Similarity search
E.g.., Find a 3-tick pattern, similar to the last one
10 20 30 40 50 60 70 80 90 1 3 5 7 9 11
Time Tick Number of packets sent
??
SLIDE 12 Problem #3:
- Given: A set of correlated time sequences
- Forecast ‘Sent(t)’
Number of packets
23 45 68 90
Time Tick
1 4 6 9 11
sent lost repeated
SLIDE 13 Important observations
Patterns, rules, forecasting and similarity indexing are closely related:
- To do forecasting, we need
– to find patterns/rules – to find similar settings in the past
- to find outliers, we need to have forecasts
– (outlier = too far away from our forecast)
SLIDE 14 Outline
- Motivation
- Similarity search and distance functions
– Euclidean – Time-warping
SLIDE 15 Importance of distance functions
Subtle, but absolutely necessary:
- A ‘must’ for similarity indexing (->
forecasting)
Two major families
– Euclidean and Lp norms – Time warping and variations
SLIDE 16 Euclidean and Lp
∑
=
− =
n i i i
y x y x D
1 2
) ( ) , ( ! !
x(t) y(t)
...
∑
=
− =
n i p i i p
y x y x L
1
| | ) , ( ! ! L1: city-block = Manhattan L2 = Euclidean L∞
SLIDE 17 Observation #1
Time sequence -> n-d vector
...
Day-1 Day-2 Day-n
SLIDE 18 Observation #2
Euclidean distance is closely related to
– cosine similarity – dot product
...
Day-1 Day-2 Day-n
SLIDE 19 Time Warping
- allow accelerations - decelerations
– (with or without penalty)
- THEN compute the (Euclidean) distance (+
penalty)
- related to the string-editing distance
SLIDE 20
Time Warping
‘stutters’:
SLIDE 21 Time warping
Q: how to compute it? A: dynamic programming D( i, j ) = cost to match prefix of length i of first sequence x with prefix
- f length j of second sequence y
SLIDE 22 http://www.psb.ugent.be/cbd/papers/gentxwarper/DTWalgorithm.htm
Time warping
SLIDE 23
Thus, with no penalty for stutter, for sequences x1, x2, …, xi,; y1, y2, …, yj
! " ! # $ − − − − + − = ) , 1 ( ) 1 , ( ) 1 , 1 ( min ] [ ] [ ) , ( j i D j i D j i D j y i x j i D
x-stutter y-stutter no stutter
Time warping
SLIDE 24
VERY SIMILAR to the string-editing distance
! " ! # $ − − − − + − = ) , 1 ( ) 1 , ( ) 1 , 1 ( min ] [ ] [ ) , ( j i D j i D j i D j y i x j i D
x-stutter y-stutter no stutter
Time warping
SLIDE 25 Time warping
- Complexity: O(M*N) - quadratic on the
length of the strings
- Many variations (penalty for stutters; limit
- n the number/percentage of stutters; …)
- popular in voice processing
[Rabiner + Juang]
SLIDE 26 Other Distance functions
- piece-wise linear/flat approx.; compare
pieces [Keogh+01] [Faloutsos+97]
- ‘cepstrum’ (for voice [Rabiner+Juang])
– do DFT; take log of amplitude; do DFT again!
- Allow for small gaps [Agrawal+95]
See tutorial by [Gunopulos + Das, SIGMOD01]
SLIDE 27 Other Distance functions
- In [Keogh+, KDD’04]: parameter-free, MDL
based
SLIDE 28
Conclusions
Prevailing distances:
– Euclidean and – time-warping
SLIDE 29 Outline
- Motivation
- Similarity search and distance functions
- Linear Forecasting
- Non-linear forecasting
- Conclusions
SLIDE 30
Linear Forecasting
SLIDE 31 Outline
- Motivation
- ...
- Linear Forecasting
– Auto-regression: Least Squares; RLS – Co-evolving time sequences – Examples – Conclusions
SLIDE 32 Problem#2: Forecast
- Example: give xt-1, xt-2, …, forecast xt
10 20 30 40 50 60 70 80 90 1 3 5 7 9 11
Time Tick Number of packets sent
??
SLIDE 33 Forecasting: Preprocessing
MANUALLY: remove trends spot periodicities
2 3 5 6 1 2 3 4 5 6 7 8 9 10 1 2 2 3 1 3 5 7 9 11 13
time time 7 days
SLIDE 34 Problem#2: Forecast
xt as a linear function of the past: xt-1, xt-2, …, (up to a window of w)
Formally:
10 20 30 40 50 60 70 80 90 1 3 5 7 9 11
Time Tick
??
SLIDE 35 (Problem: Back-cast; interpolate)
- Solution - interpolate: try to express
xt as a linear function of the past AND the future:
xt+1, xt+2, … xt+wfuture; xt-1, … xt-wpast
(up to windows of wpast , wfuture)
10 20 30 40 50 60 70 80 90 1 3 5 7 9 11
Time Tick
??
SLIDE 36 40 45 50 55 60 65 70 75 80 85 15 25 35 45
Body weight
patient weight height 1 27 43 2 43 54 3 54 72 … … … N 25 ??
Express what we don’t know (= “dependent variable”)
as a linear function of what we know (= “independent variable(s)”)
Body height
Refresher: Linear Regression
SLIDE 37 40 45 50 55 60 65 70 75 80 85 15 25 35 45
Body weight
patient weight height 1 27 43 2 43 54 3 54 72 … … … N 25 ??
Express what we don’t know (= “dependent variable”)
as a linear function of what we know (= “independent variable(s)”)
Body height
Refresher: Linear Regression
SLIDE 38 40 45 50 55 60 65 70 75 80 85 15 25 35 45
Body weight
patient weight height 1 27 43 2 43 54 3 54 72 … … … N 25 ??
Express what we don’t know (= “dependent variable”)
as a linear function of what we know (= “independent variable(s)”)
Body height
Refresher: Linear Regression
SLIDE 39 40 45 50 55 60 65 70 75 80 85 15 25 35 45
Body weight
patient weight height 1 27 43 2 43 54 3 54 72 … … … N 25 ??
Express what we don’t know (= “dependent variable”)
as a linear function of what we know (= “independent variable(s)”)
Body height
Refresher: Linear Regression
SLIDE 40 Time Packets Sent (t-1) Packets Sent(t) 1
2 43 54 3 54 72 … … … N 25 ??
Linear Auto Regression
SLIDE 41 Linear Auto Regression
#packets sent at time t-1 #packets sent at time t
Time Packets Sent (t-1) Packets Sent(t) 1
2 43 54 3 54 72 … … … N 25 ??
Lag w = 1
Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])
‘lag-plot’
SLIDE 42 Linear Auto Regression
#packets sent at time t-1 #packets sent at time t
Time Packets Sent (t-1) Packets Sent(t) 1
2 43 54 3 54 72 … … … N 25 ??
Lag w = 1
Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])
‘lag-plot’
SLIDE 43 Linear Auto Regression
#packets sent at time t-1 #packets sent at time t
Time Packets Sent (t-1) Packets Sent(t) 1
2 43 54 3 54 72 … … … N 25 ??
Lag w = 1
Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])
‘lag-plot’
SLIDE 44 Linear Auto Regression
#packets sent at time t-1 #packets sent at time t
Time Packets Sent (t-1) Packets Sent(t) 1
2 43 54 3 54 72 … … … N 25 ??
Lag w = 1
Dependent variable = # of packets sent (S [t]) Independent variable = # of packets sent (S[t-1])
‘lag-plot’
SLIDE 45 More details:
- Q1: Can it work with window w > 1?
- A1: YES!
xt-2 xt xt-1
SLIDE 46 More details:
- Q1: Can it work with window w > 1?
- A1: YES! (we’ll fit a hyper-plane, then!)
xt-2 xt xt-1
SLIDE 47 More details:
- Q1: Can it work with window w > 1?
- A1: YES! (we’ll fit a hyper-plane, then!)
xt-2 xt-1 xt
SLIDE 48 More details:
- Q1: Can it work with window w > 1?
- A1: YES! The problem becomes:
X[N ×w] × a[w ×1] = y[N ×1]
– a is the vector of the regression coefficients – X has the N values of the w indep. variables – y has the N values of the dependent variable
SLIDE 49 More details:
- X[N ×w] × a[w ×1] = y[N ×1]
! ! ! ! ! ! ! ! " # $ $ $ $ $ $ $ $ % & = ! ! ! ! " # $ $ $ $ % & × ! ! ! ! ! ! ! ! " # $ $ $ $ $ $ $ $ % &
N w Nw N N w w
y y y a a a X X X X X X X X X ! ! ! ! … ! ! ! … "
2 1 2 1 2 1 2 22 21 1 12 11
, , , , , , , , ,
Ind-var1 Ind-var-w time
SLIDE 50 More details:
- X[N ×w] × a[w ×1] = y[N ×1]
! ! ! ! ! ! ! ! " # $ $ $ $ $ $ $ $ % & = ! ! ! ! " # $ $ $ $ % & × ! ! ! ! ! ! ! ! " # $ $ $ $ $ $ $ $ % &
N w Nw N N w w
y y y a a a X X X X X X X X X ! ! ! ! … ! ! ! … "
2 1 2 1 2 1 2 22 21 1 12 11
, , , , , , , , ,
Ind-var1 Ind-var-w time
SLIDE 51 More details
- Q2: How to estimate a1, a2, … aw = a?
- A2: with Least Squares fit
- (Moore-Penrose pseudo-inverse)
- a is the vector that minimizes the RMSE
from y a = ( XT × X )-1 × (XT × y)
SLIDE 52 More details
- Straightforward solution:
- Observations:
– Sample matrix X grows over time – needs matrix inversion – O(N×w2) computation – O(N×w) storage a = ( XT × X )-1 × (XT × y)
a : Regression Coeff. Vector X : Sample Matrix
XN:
w N
SLIDE 53 Even more details
- Q3: Can we estimate a incrementally?
- A3: Yes, with the brilliant, classic method of
“Recursive Least Squares” (RLS) (see, e.g., [Yi+00], for details).
- We can do the matrix inversion, WITHOUT
inversion! (How is that possible?!)
SLIDE 54 Even more details
- Q3: Can we estimate a incrementally?
- A3: Yes, with the brilliant, classic method of
“Recursive Least Squares” (RLS)
(see, e.g., [Yi+00], for details).
- We can do the matrix inversion, WITHOUT
inversion! (How is that possible?!)
- A: our matrix has special form: (XT X)
SLIDE 55 More details
XN:
w N
XN+1
At the N+1 time tick:
xN+1 SKIP
SLIDE 56
T × XN )-1 (“gain matrix”)
- GN+1 can be computed recursively from GN
without matrix inversion
GN
w w SKIP
More details: key ideas
SLIDE 57 Comparison:
Squares
– Needs huge matrix
(growing in size) O(N×w) – Costly matrix operation O(N×w2)
– Need much smaller, fixed size matrix O(w×w) – Fast, incremental computation O(1×w2) – no matrix inversion N = 106, w = 1-100
SLIDE 58 EVEN more details:
N N T N N N N
G x x G c G G × × × × − =
+ + − + 1 1 1 1
] [ ] [
] 1 [
1 1 T N N N
x G x c
+ +
× × + =
1 x w row vector Let’s elaborate (VERY IMPORTANT, VERY VALUABLE!) SKIP
SLIDE 59 EVEN more details:
] [ ] [
1 1 1 1 1 + + − + +
× × × =
N T N N T N
y X X X a
SKIP
SLIDE 60 EVEN more details:
] [ ] [
1 1 1 1 1 + + − + +
× × × =
N T N N T N
y X X X a
[w x 1] [w x (N+1)] [(N+1) x w] [w x (N+1)] [(N+1) x 1] SKIP
SLIDE 61 EVEN more details:
] [ ] [
1 1 1 1 1 + + − + +
× × × =
N T N N T N
y X X X a
[w x (N+1)] [(N+1) x w] SKIP
SLIDE 62 EVEN more details:
N N T N N N N
G x x G c G G × × × × − =
+ + − + 1 1 1 1
] [ ] [
] 1 [
1 1 T N N N
x G x c
+ +
× × + =
wxw wxw wxw wx1 1xw wxw 1x1 SCALAR! SKIP
] [ ] [
1 1 1 1 1 + + − + +
× × × =
N T N N T N
y X X X a
1 1 1 1
] [
− + + +
× ≡
N T N N
X X G
‘gain matrix’
SLIDE 63 Altogether:
I G δ ≡
where I: w x w identity matrix δ: a large positive number
SKIP
SLIDE 64 Comparison:
Squares
– Needs huge matrix
(growing in size) O(N×w) – Costly matrix operation O(N×w2)
– Need much smaller, fixed size matrix O(w×w) – Fast, incremental computation O(1×w2) – no matrix inversion N = 106, w = 1-100
SLIDE 65 Pictorially:
Independent Variable Dependent Variable
SLIDE 66 Pictorially:
Independent Variable Dependent Variable
.
new point
SLIDE 67 Pictorially:
Independent Variable Dependent Variable
RLS: quickly compute new best fit
new point
SLIDE 68 Even more details
- Q4: can we ‘forget’ the older samples?
- A4: Yes - RLS can easily handle that [Yi+00]:
SLIDE 69 Adaptability - ‘forgetting’
Independent Variable eg., #packets sent Dependent Variable eg., #bytes sent
SLIDE 70 Adaptability - ‘forgetting’
Independent Variable
Dependent Variable eg., #bytes sent Trend change (R)LS with no forgetting
SLIDE 71 Adaptability - ‘forgetting’
Independent Variable Dependent Variable Trend change (R)LS with no forgetting (R)LS with forgetting
- RLS: can *trivially* handle ‘forgetting’