[PPT] - CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: PowerPoint Presentation

SLIDE 1

CS6220: DATA MINING TECHNIQUES

Instructor: Yizhou Sun

yzsun@ccs.neu.edu November 30, 2015

Mining Time Series Data

SLIDE 2

Announcement

No class next week and see you on Dec.

14.

The final report and presentation

guideline is going to be released soon.

Office hour:
Tuesday: 3:30-5:00pm
Friday: 2:30-4:00pm

2

SLIDE 3

Methods to Learn

3

Matrix Data Text Data Set Data Sequence Data Time Series Graph & Network Images Classification

Decision Tree; Naïve Bayes; Logistic Regression SVM; kNN HMM Label Propagation* Neural Network

Clustering

K-means; hierarchical clustering; DBSCAN; Mixture Models; kernel k-means* PLSA SCAN*; Spectral Clustering*

Frequent Pattern Mining

Apriori; FP-growth GSP; PrefixSpan

Prediction

Linear Regression Autoregression

Similarity Search

DTW P-PageRank

Ranking

PageRank

SLIDE 4

Mining Time Series Data

Basic Concepts
Time Series Prediction and Forecasting
Time Series Similarity Search
Summary

4

SLIDE 5

Example: Inflation Rate Time Series

5

SLIDE 6

Example: Unemployment Rate Time Series

6

SLIDE 7

Example: Stock

7

SLIDE 8

Example: Product Sale

8

SLIDE 9

Time Series

A time series is a sequence of numerical data

points, measured typically at successive times, spaced at (often uniform) time intervals

Random variables for a time series are

Represented as:

𝑍 = 𝑍

1, 𝑍 2, … , 𝑝𝑠

𝑍 = 𝑍

𝑢: 𝑢 ∈ 𝑈 , 𝑥ℎ𝑓𝑠𝑓 𝑈 𝑗𝑡 𝑢ℎ𝑓 𝑗𝑜𝑒𝑓𝑦 𝑡𝑓𝑢

An observation of a time series with length N is

represent as:

𝑍 = {𝑧1, 𝑧2, … , 𝑧𝑂}

9

SLIDE 10

Mining Time Series Data

Basic Concepts
Time Series Prediction and Forecasting
Time Series Similarity Search
Summary

10

SLIDE 11

Categories of Time-Series Movements

Categories of Time-Series Movements (T, C, S, I)
Long-term or trend movements (trend curve): general

direction in which a time series is moving over a long interval

f time
Cyclic movements or cycle variations: long term oscillations

about a trend line or curve

e.g., business cycles, may or may not be periodic
Seasonal movements or seasonal variations
E.g., almost identical patterns that a time series appears to

follow during corresponding months of successive years.

Irregular or random movements

11

SLIDE 12

12

SLIDE 13

Lag, Difference

The first lag of 𝑍

𝑢 is 𝑍 𝑢−1; the jth lag of 𝑍 𝑢

is 𝑍

𝑢−𝑘

The first difference of a time series, Δ𝑍

𝑢 =

𝑍

𝑢 − 𝑍 𝑢−1

Sometimes difference in logarithm is used

Δln(𝑍

𝑢) = ln(𝑍 𝑢) − ln(𝑍 𝑢−1)

13

SLIDE 14

Example: First Lag and First Difference

14

SLIDE 15

Autocorrelation

Autocorrelation: the correlation between

a time series and its lagged values

The first autocorrelation 𝜍1
The jth autocorrelation 𝜍𝑘

15

Autocovariance

SLIDE 16

Sample Autocorrelation Calculation

The jth sample autocorrelation
𝜍𝑘 =

𝑑𝑝𝑤(𝑍

𝑢,𝑍𝑢−𝑘)

𝑤𝑏𝑠(𝑍

𝑢)

Where

𝑑𝑝𝑤(𝑍

𝑢, 𝑍 𝑢−𝑘) is calculated as:

i.e., considering two time series: Y(1,…,T-j) and

Y(j+1,…,T)

16

𝑍

𝑢

𝑍

𝑢−𝑘

𝑧𝑘+1 𝑧1 𝑧𝑘+2 𝑧2 ⋮ ⋮ 𝑧𝑈−1 𝑧𝑈−𝑘−1 𝑧𝑈 𝑧𝑈−𝑘

SLIDE 17

Example of Autocorrelation

For inflation and its change

17

𝝇𝟐 = 𝟏. 𝟗𝟔, very high: Last quarter’s inflation rate contains much information about this quarter’s inflation rate

SLIDE 18

Focus on Stationary Time Series

Stationary is key for time series

regression: Future is similar to the past in terms of distribution

18

SLIDE 19

Autoregression

Use past values 𝑍

𝑢−1,𝑍 𝑢−2, … to predict 𝑍 𝑢

An autore

toregressi gression

n is a regression model in

which Yt is regressed against its own lagged values.

The number of lags used as regressors is called

the or

rder

er of the autoregression.

In a first order autoregression, Yt is regressed

against Yt–1

In a pth order autoregression, Yt is regressed

against Yt–1,Yt–2,…,Yt–p

19

SLIDE 20

The First Order Autoregression Model AR(1)

AR(1) model:
The AR(1) model can be estimated by OLS

regression of Yt against Yt–1

Testing β1 = 0 vs. β1 ≠ 0 provides a test of

the hypothesis that Yt–1 is not useful for forecasting Yt

20

SLIDE 21

Prediction vs. Forecast

A predicted value refers to the value of Y

predicted (using a regression) for an

bservation in the sample used to estimate

the regression – this is the usual definition

Predicted values are “in sample”
A forecast refers to the value of Y forecasted

for an observation not in the sample used to estimate the regression.

Forecasts are forecasts of the future – which

cannot have been used to estimate the regression.

21

SLIDE 22

Time Series Regression with Additional Predictors

So far we have considered forecasting

models that use only past values of Y

It makes sense to add other variables (X)

that might be useful predictors of Y, above and beyond the predictive value of lagged values of Y:

22

SLIDE 23

Mining Time Series Data

Basic Concepts
Time Series Prediction and Forecasting
Time Series Similarity Search
Summary

23

SLIDE 24

Why Similarity Search?

Wide applications
Find a time period with similar inflation rate

and unemployment time series?

Find a similar stock to Facebook?
Find a similar product to a query one

according to sale time series?

…

24

SLIDE 25

Example

25 VanEck International Fund Fidelity Selective Precious Metal and Mineral Fund

Two similar mutual funds in the different fund group

SLIDE 26

Similarity Search for Time Series Data

Time Series Similarity Search
Euclidean distances and 𝑀𝑞 norms
Dynamic Time Warping (DTW)
Time Domain vs. Frequency Domain

26

SLIDE 27

Euclidean Distance and Lp Norms

Given two time series with equal length n
𝐷 = 𝑑1, 𝑑2, … , 𝑑𝑜
𝑅 = 𝑟1, 𝑟2, … , 𝑟𝑜
𝑒 𝐷, 𝑅 = ∑|𝑑𝑗 − 𝑟𝑗|𝑞 1/𝑞
When p=2, it is Euclidean distance

27

SLIDE 28

Enhanced Lp Norm-based Distance

Issues with Lp Norm: cannot deal with
ffset and scaling in the Y-axis
Solution: normalizing the time series
𝑑𝑗

′ = 𝑑𝑗−𝜈(𝐷) 𝜏(𝐷)

28

SLIDE 29

Dynamic Time Warping (DTW)

For two sequences that do not line up

well in X-axis, but share roughly similar shape

We need to warp the time axis to make better

alignment

29

SLIDE 30

Goal of DTW

Given
Two sequences (with possible different

lengths):

𝑌 = {𝑦1, 𝑦2, … , 𝑦𝑂}
𝑍 = {𝑧1, 𝑧2, … , 𝑧𝑁}
A local distance (cost) measure between 𝑦𝑜

and 𝑧𝑛

Goal:
Find an alignment between X and Y, such that,

the overall cost is minimized

30

SLIDE 31

Cost Matrix of Two Time Series

31

SLIDE 32

Represent an Alignment by Warping Path

An (N,M)-warping path is a sequence 𝑞 =

(𝑞1, 𝑞2, … , 𝑞𝑀) with 𝑞𝑚 = (𝑜𝑚, 𝑛𝑚), satisfying the three conditions:

Boundary condition: 𝑞1 = 1,1 , 𝑞𝑀 = 𝑂, 𝑁
Starting from the first point and ending at last point
Monotonicity condition: 𝑜𝑚 and 𝑛𝑚 are non-

decreasing with 𝑚

Step size condition: 𝑞𝑚+1 − 𝑞𝑚 ∈

0,1 , 1,0 , 1,1

Move one step right, up, or up-right

32

SLIDE 33

Q: Which Path is a Warping Path?

33

SLIDE 34

Optimal Warping Path

The total cost given a warping path p
𝑑𝑞 𝑌, 𝑍 = ∑𝑚 𝑑(𝑦𝑜𝑚, 𝑧𝑛𝑚)
The optimal warping path p*
𝑑𝑞∗ 𝑌, 𝑍 =

min 𝑑𝑞 𝑌, 𝑍 𝑞 𝑗𝑡 𝑏𝑜 𝑂, 𝑁 − 𝑥𝑏𝑠𝑞𝑗𝑜𝑕 𝑞𝑏𝑢ℎ

DTW distance between X and Y is defined as:
the optimal cost 𝑑𝑞∗ 𝑌, 𝑍

34

SLIDE 35

How to Find p*?

Naïve solution:
Enumerate all the possible warping path
Exponential in N and M!

35

SLIDE 36

Dynamic Programming for DTW

Dynamic programming:
Let D(n,m) denote the DTW distance between

X(1,…,n) and Y(1,…,m)

D is called accumulative cost matrix
Note D(N,M) = DTW(X,Y)
Recursively calculate D(n,m)
𝐸 𝑜, 𝑛 = min 𝐸 𝑜 − 1, 𝑛 , 𝐸 𝑜, 𝑛 − 1 , 𝐸 𝑜 − 1, 𝑛 − 1

+ 𝑑(𝑦𝑜, 𝑧𝑛)

When m or n = 1
𝐸 𝑜, 1 = ∑𝑙=1:𝑜 𝑑 𝑦𝑙, 1 ;
𝐸 1, 𝑛 = ∑𝑙=1:𝑛 𝑑 1, 𝑧𝑙 ;

36

Time complexity: O(MN)

SLIDE 37

Trace back to Get p* from D

37

SLIDE 38

Example

38

SLIDE 39

Time Domain vs. Frequency Domain

Many techniques for signal analysis require the data to be in

the frequency domain

Usually data-independent transformations are used
The transformation matrix is determined a

priori

discrete Fourier transform (DFT)
discrete wavelet transform (DWT)
The distance between two signals in the time domain is the

same as their Euclidean distance in the frequency domain

39

SLIDE 40

Example of DFT

40

SLIDE 41

41

SLIDE 42

Example of DWT (with Harr Wavelet)

42

SLIDE 43

43

SLIDE 44

*Discrete Fourier Transformation

DFT does a good job of concentrating energy in

the first few coefficients

If we keep only first a few coefficients in DFT, we

can compute the lower bounds of the actual distance

Feature extraction: keep the first few coefficients

(F-index) as representative of the sequence

44

SLIDE 45

*DFT (Cont.)

Parseval’s Theorem
The Euclidean distance between two signals in the time

domain is the same as their distance in the frequency domain

Keep the first few (say, 3) coefficients underestimates

the distance and there will be no false dismissals!

45

 

   



1 2 1 2

| | | |

n f f n t t

X x

| ] )[ ( ] )[ ( | | ] [ ] [ |

3 2 2

 

 

    

f n t

f Q F f S F t Q t S  

SLIDE 46

Mining Time Series Data

Basic Concepts
Time Series Prediction and Forecasting
Time Series Similarity Search
Summary

46

SLIDE 47

Summary

Time Series Prediction and Forecasting
Autocorrelation; autoregression
Time series similarity search
Euclidean distance and Lp norm
Dynamic time warping
Time domain vs. frequency domain

47