CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: - - PowerPoint PPT Presentation

β–Ά
cs6220 data mining techniques
SMART_READER_LITE
LIVE PREVIEW

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: - - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu November 12, 2013 Mining Time Series Data Basic Concepts Time Series Prediction and Forecasting Time Series Similarity Search


slide-1
SLIDE 1

CS6220: DATA MINING TECHNIQUES

Instructor: Yizhou Sun

yzsun@ccs.neu.edu November 12, 2013

Mining Time Series Data

slide-2
SLIDE 2

Mining Time Series Data

  • Basic Concepts
  • Time Series Prediction and Forecasting
  • Time Series Similarity Search
  • Summary

3

slide-3
SLIDE 3

Example: Inflation Rate Time Series

4

slide-4
SLIDE 4

Example: Unemployment Rate Time Series

5

slide-5
SLIDE 5

Example: Stock

6

slide-6
SLIDE 6

Example: Product Sale

7

slide-7
SLIDE 7

Time Series

  • A time series is a sequence of numerical data

points, measured typically at successive times, spaced at (often uniform) time intervals

  • Random variables for a time series are

Represented as:

  • 𝑍 = 𝑍

1, 𝑍 2, … , 𝑝𝑠

  • 𝑍 = 𝑍

𝑒: 𝑒 ∈ π‘ˆ , π‘₯β„Žπ‘“π‘ π‘“ π‘ˆ 𝑗𝑑 π‘’β„Žπ‘“ π‘—π‘œπ‘’π‘“π‘¦ 𝑑𝑓𝑒

  • An observation of a time series with length N is

represent as:

  • 𝑍 = {𝑧1, 𝑧2, … , 𝑧𝑂}

8

slide-8
SLIDE 8

Mining Time Series Data

  • Basic Concepts
  • Time Series Prediction and Forecasting
  • Time Series Similarity Search
  • Summary

9

slide-9
SLIDE 9

Categories of Time-Series Movements

  • Categories of Time-Series Movements (T, C, S, I)
  • Long-term or trend movements (trend curve): general

direction in which a time series is moving over a long interval

  • f time
  • Cyclic movements or cycle variations: long term oscillations

about a trend line or curve

  • e.g., business cycles, may or may not be periodic
  • Seasonal movements or seasonal variations
  • i.e, almost identical patterns that a time series appears to

follow during corresponding months of successive years.

  • Irregular or random movements

10

slide-10
SLIDE 10

11

slide-11
SLIDE 11

Lag, Difference

  • The first lag of 𝑍

𝑒 is 𝑍 π‘’βˆ’1; the jth lag of 𝑍 𝑒

is 𝑍

π‘’βˆ’π‘˜

  • The first difference of a time series,

Δ𝑍

𝑒 = 𝑍 𝑒 βˆ’ 𝑍 π‘’βˆ’1

  • Sometimes difference in logarithm is used

Ξ”ln (𝑍

𝑒) = ln

(𝑍

𝑒) βˆ’ ln

(𝑍

π‘’βˆ’1)

12

slide-12
SLIDE 12

Example: First Lag and First Difference

13

slide-13
SLIDE 13

Autocorrelation

  • Autocorrelation: the correlation between

a time series and its lagged values

  • The first autocorrelation 𝜍1
  • The jth autocorrelation πœπ‘˜

14

Autocovariance

slide-14
SLIDE 14

Sample Autocorrelation Calculation

  • The jth sample autocorrelation
  • 𝜍

π‘˜ =

𝑑𝑝𝑀 (𝑍

𝑒,π‘π‘’βˆ’π‘˜)

𝑀𝑏𝑠 (𝑍

𝑒)

  • Where 𝑑𝑝𝑀

(𝑍

𝑒, 𝑍 π‘’βˆ’π‘˜) is calculated as:

  • i.e., considering two time series: Y(1,…,T-j) and

Y(j+1,…,T)

15

slide-15
SLIDE 15

Example of Autocorrelation

  • For inflation and its change

16

π‡πŸ = 𝟏. πŸ—πŸ”, very high: Last quarter’s inflation rate contains much information about this quarter’s inflation rate

slide-16
SLIDE 16

Focus on Stationary Time Series

  • Stationary is key for time series

regression: Future is similar to the past in terms of distribution

17

slide-17
SLIDE 17

Autoregression

  • Use past values 𝑍

π‘’βˆ’1,𝑍 π‘’βˆ’2, … to predict 𝑍 𝑒

  • An autore

toregression gression is a regression model in which Yt is regressed against its own lagged values.

  • The number of lags used as regressors is called

the or

  • rder

er of the autoregression.

  • In a first order autoregression, Yt is regressed

against Yt–1

  • In a pth order autoregression, Yt is regressed

against Yt–1,Yt–2,…,Yt–p

18

slide-18
SLIDE 18

The First Order Autoregression Model AR(1)

  • AR(1) model:
  • The AR(1) model can be estimated by OLS

regression of Yt against Yt–1

  • Testing Ξ²1 = 0 vs. Ξ²1 β‰  0 provides a test of

the hypothesis that Yt–1 is not useful for forecasting Yt

19

slide-19
SLIDE 19

Prediction vs. Forecast

  • A predicted value refers to the value of Y

predicted (using a regression) for an

  • bservation in the sample used to estimate

the regression – this is the usual definition

  • Predicted values are β€œin sample”
  • A forecast refers to the value of Y forecasted

for an observation not in the sample used to estimate the regression.

  • Forecasts are forecasts of the future – which

cannot have been used to estimate the regression.

20

slide-20
SLIDE 20

Time Series Regression with Additional Predictors

  • So far we have considered forecasting

models that use only past values of Y

  • It makes sense to add other variables (X)

that might be useful predictors of Y, above and beyond the predictive value of lagged values of Y:

  • 21
slide-21
SLIDE 21

Mining Time Series Data

  • Basic Concepts
  • Time Series Prediction and Forecasting
  • Time Series Similarity Search
  • Summary

22

slide-22
SLIDE 22

Why Similarity Search?

  • Wide applications
  • Find a time period with similar inflation rate

and unemployment time series?

  • Find a similar stock to Facebook?
  • Find a similar product to a query one

according to sale time series?

  • …

23

slide-23
SLIDE 23

Example

24 VanEck International Fund Fidelity Selective Precious Metal and Mineral Fund

Two similar mutual funds in the different fund group

slide-24
SLIDE 24

Similarity Search for Time Series Data

  • Time Series Similarity Search
  • Euclidean distances and π‘€π‘ž norms
  • Dynamic Time Warping (DTW)
  • Time Domain vs. Frequency Domain

25

slide-25
SLIDE 25

Euclidean Distance and Lp Norms

  • Given two time series with equal length n
  • 𝐷 = 𝑑1, 𝑑2, … , π‘‘π‘œ
  • 𝑅 = π‘Ÿ1, π‘Ÿ2, … , π‘Ÿπ‘œ
  • 𝑒 𝐷, 𝑅 = βˆ‘|𝑑𝑗 βˆ’ π‘Ÿπ‘—|π‘ž 1/π‘ž
  • When p=2, it is Euclidean distance

26

slide-26
SLIDE 26

Enhanced Lp Norm-based Distance

  • Issues with Lp Norm: cannot deal with
  • ffset and scaling in the Y-axis
  • Solution: normalizing the time series
  • 𝑑𝑗

β€² = π‘‘π‘—βˆ’πœˆ(𝐷) 𝜏(𝐷)

27

slide-27
SLIDE 27

Dynamic Time Warping (DTW)

  • For two sequences that do not line up

well in X-axis, but share roughly similar shape

  • We need to warp the time axis to make better

alignment

28

slide-28
SLIDE 28

Goal of DTW

  • Given
  • Two sequences (with possible different

lengths):

  • π‘Œ = {𝑦1, 𝑦2, … , 𝑦𝑂}
  • 𝑍 = {𝑧1, 𝑧2, … , 𝑧𝑁}
  • A local distance (cost) measure between π‘¦π‘œ

and 𝑧𝑛

  • Goal:
  • Find an alignment between X and Y, such that,

the overall cost is minimized

29

slide-29
SLIDE 29

Cost Matrix of Two Time Series

30

slide-30
SLIDE 30

Represent an Alignment by Warping Path

  • An (N,M)-warping path is a sequence

π‘ž = (π‘ž1, π‘ž2, … , π‘žπ‘€) with π‘žπ‘š = (π‘œπ‘š, π‘›π‘š), satisfying the three conditions:

  • Boundary condition: π‘ž1 = 1,1 , π‘žπ‘€ = 𝑂, 𝑁
  • Starting from the first point and ending at last point
  • Monotonicity condition: π‘œπ‘š and π‘›π‘š are non-

decreasing with π‘š

  • Step size condition:

π‘žπ‘š+1 βˆ’ π‘žπ‘š ∈ 0,1 , 1,0 , 1,1

  • Move one step right, up, or up-right

31

slide-31
SLIDE 31

Q: Which Path is a Warping Path?

32

slide-32
SLIDE 32

Optimal Warping Path

  • The total cost given a warping path p
  • π‘‘π‘ž π‘Œ, 𝑍 = βˆ‘ 𝑑(π‘¦π‘œπ‘š, π‘§π‘›π‘š)

π‘š

  • The optimal warping path p*
  • π‘‘π‘žβˆ— π‘Œ, 𝑍 =

min π‘‘π‘ž π‘Œ, 𝑍 π‘ž 𝑗𝑑 π‘π‘œ 𝑂, 𝑁 -π‘₯π‘π‘ π‘žπ‘—π‘œπ‘• π‘žπ‘π‘’β„Ž

  • DTW distance between X and Y is defined

as:

  • the optimal cost π‘‘π‘žβˆ— π‘Œ, 𝑍

33

slide-33
SLIDE 33

How to Find p*?

  • NaΓ―ve solution:
  • Enumerate all the possible warping path
  • Exponential in N and M!

34

slide-34
SLIDE 34

Dynamic Programming for DTW

  • Dynamic programming:
  • Let D(n,m) denote the DTW distance between

X(1,…,n) and Y(1,…,m)

  • D is called accumulative cost matrix
  • Note D(N,M) = DTW(X,Y)
  • Recursively calculate D(n,m)
  • 𝐸 π‘œ, 𝑛 =

min 𝐸 π‘œ βˆ’ 1, 𝑛 , 𝐸 π‘œ, 𝑛 βˆ’ 1 , 𝐸 π‘œ βˆ’ 1, 𝑛 βˆ’ 1 + 𝑑(π‘¦π‘œ, 𝑧𝑛)

  • When m or n = 1
  • 𝐸 π‘œ, 1 = βˆ‘

𝑑 𝑦𝑙, 1 ;

𝑙=1:π‘œ

  • 𝐸 1, 𝑛 = βˆ‘

𝑑 1, 𝑧𝑙 ;

𝑙=1:𝑛 35

Time complexity: O(MN)

slide-35
SLIDE 35

Trace back to Get p* from D

36

slide-36
SLIDE 36

Example

37

slide-37
SLIDE 37

Time Domain vs. Frequency Domain

  • Many techniques for signal analysis require the data to be in

the frequency domain

  • Usually data-independent transformations are used
  • The transformation matrix is determined a

priori

  • discrete Fourier transform (DFT)
  • discrete wavelet transform (DWT)
  • The distance between two signals in the time domain is the

same as their Euclidean distance in the frequency domain

38

slide-38
SLIDE 38

Example of DFT

39

slide-39
SLIDE 39

40

slide-40
SLIDE 40

Example of DWT (with Harr Wavelet)

41

slide-41
SLIDE 41

42

slide-42
SLIDE 42

Discrete Fourier Transformation

  • DFT does a good job of concentrating energy in

the first few coefficients

  • If we keep only first a few coefficients in DFT, we

can compute the lower bounds of the actual distance

  • Feature extraction: keep the first few coefficients

(F-index) as representative of the sequence

43

slide-43
SLIDE 43

DFT (Cont.)

  • Parseval’s Theorem
  • The Euclidean distance between two signals in the time

domain is the same as their distance in the frequency domain

  • Keep the first few (say, 3) coefficients underestimates

the distance and there will be no false dismissals!

44

οƒ₯ οƒ₯

ο€­ ο€½ ο€­ ο€½

ο€½

1 2 1 2

| | | |

n f f n t t

X x

| ] )[ ( ] )[ ( | | ] [ ] [ |

3 2 2

οƒ₯ οƒ₯

ο€½ ο€½

ο‚£ ο€­ οƒž ο‚£ ο€­

f n t

f Q F f S F t Q t S ο₯ ο₯

slide-44
SLIDE 44

Mining Time Series Data

  • Basic Concepts
  • Time Series Prediction and Forecasting
  • Time Series Similarity Search
  • Summary

45

slide-45
SLIDE 45

Summary

  • Time Series Prediction and Forecasting
  • Autocorrelation; autoregression
  • Time series similarity search
  • Euclidean distance and Lp norm
  • Dynamic time warping
  • Time domain vs. frequency domain

46