CS145: INTRODUCTION TO DATA MINING Sequence Data: Similarity Search - - PowerPoint PPT Presentation

โ–ถ
cs145 introduction to data mining
SMART_READER_LITE
LIVE PREVIEW

CS145: INTRODUCTION TO DATA MINING Sequence Data: Similarity Search - - PowerPoint PPT Presentation

CS145: INTRODUCTION TO DATA MINING Sequence Data: Similarity Search Instructor: Yizhou Sun yzsun@cs.ucla.edu November 27, 2017 Methods to be Learnt Vector Data Set Data Sequence Data Text Data Logistic Regression; Nave Bayes for Text


slide-1
SLIDE 1

CS145: INTRODUCTION TO DATA MINING

Instructor: Yizhou Sun

yzsun@cs.ucla.edu November 27, 2017

Sequence Data: Similarity Search

slide-2
SLIDE 2

Methods to be Learnt

2

Vector Data Set Data Sequence Data Text Data Classification

Logistic Regression; Decision Tree; KNN; SVM; NN Naรฏve Bayes for Text

Clustering

K-means; hierarchical clustering; DBSCAN; Mixture Models PLSA

Prediction

Linear Regression GLM*

Frequent Pattern Mining

Apriori; FP growth GSP; PrefixSpan

Similarity Search

DTW

slide-3
SLIDE 3

Similarity Search on Time Series Data

  • Basic Concepts
  • Time Series Similarity Search
  • *Time Series Prediction and Forecasting
  • Summary

3

slide-4
SLIDE 4

Example: Inflation Rate Time Series

4

slide-5
SLIDE 5

Example: Unemployment Rate Time Series

5

slide-6
SLIDE 6

Example: Stock

6

slide-7
SLIDE 7

Example: Product Sale

7

slide-8
SLIDE 8

Time Series

  • A time series is a sequence of numerical data

points, measured typically at successive times, spaced at (often uniform) time intervals

  • Random variables for a time series are

Represented as:

  • ๐‘ = ๐‘

1, ๐‘ 2, โ€ฆ , ๐‘๐‘ 

  • ๐‘ = ๐‘

๐‘ข: ๐‘ข โˆˆ ๐‘ˆ , ๐‘ฅโ„Ž๐‘“๐‘ ๐‘“ ๐‘ˆ ๐‘—๐‘ก ๐‘ขโ„Ž๐‘“ ๐‘—๐‘œ๐‘’๐‘“๐‘ฆ ๐‘ก๐‘“๐‘ข

  • An observation of a time series with length N is

represent as:

  • ๐‘ = {๐‘ง1, ๐‘ง2, โ€ฆ , ๐‘ง๐‘‚}

8

slide-9
SLIDE 9

Similarity Search on Time Series Data

  • Basic Concepts
  • Time Series Similarity Search
  • *Time Series Prediction and Forecasting
  • Summary

9

slide-10
SLIDE 10

Why Similarity Search?

  • Wide applications
  • Find a time period with similar inflation rate

and unemployment time series?

  • Find a similar stock to Facebook?
  • Find a similar product to a query one

according to sale time series?

  • โ€ฆ

10

slide-11
SLIDE 11

Example

11 VanEck International Fund Fidelity Selective Precious Metal and Mineral Fund

Two similar mutual funds in the different fund group

slide-12
SLIDE 12

Similarity Search for Time Series Data

  • Time Series Similarity Search
  • Euclidean distances and ๐‘€๐‘ž norms
  • Dynamic Time Warping (DTW)
  • Time Domain vs. Frequency Domain

12

slide-13
SLIDE 13

Euclidean Distance and Lp Norms

  • Given two time series with equal length n
  • ๐ท = ๐‘‘1, ๐‘‘2, โ€ฆ , ๐‘‘๐‘œ
  • ๐‘… = ๐‘Ÿ1, ๐‘Ÿ2, โ€ฆ , ๐‘Ÿ๐‘œ
  • ๐‘’ ๐ท, ๐‘… = โˆ‘|๐‘‘๐‘— โˆ’ ๐‘Ÿ๐‘—|๐‘ž 1/๐‘ž
  • When p=2, it is Euclidean distance

13

slide-14
SLIDE 14

Enhanced Lp Norm-based Distance

  • Issues with Lp Norm: cannot deal with
  • ffset and scaling in the Y-axis
  • Solution: normalizing the time series
  • ๐‘‘๐‘—

โ€ฒ = ๐‘‘๐‘—โˆ’๐œˆ(๐ท) ๐œ(๐ท)

14

slide-15
SLIDE 15

Dynamic Time Warping (DTW)

  • For two sequences that do not line up

well in X-axis, but share roughly similar shape

  • We need to warp the time axis to make better

alignment

15

slide-16
SLIDE 16

Goal of DTW

  • Given
  • Two sequences (with possible different

lengths):

  • ๐‘Œ = {๐‘ฆ1, ๐‘ฆ2, โ€ฆ , ๐‘ฆ๐‘‚}
  • ๐‘ = {๐‘ง1, ๐‘ง2, โ€ฆ , ๐‘ง๐‘}
  • A local distance (cost) measure between ๐‘ฆ๐‘œ

and ๐‘ง๐‘›: ๐‘‘(๐‘ฆ๐‘œ, ๐‘ง๐‘›)

  • Goal:
  • Find an alignment between X and Y, such that,

the overall cost is minimized

16

slide-17
SLIDE 17

Cost Matrix of Two Time Series

17

๐’…(๐’š๐’, ๐’›๐’)

slide-18
SLIDE 18

Represent an Alignment by Warping Path

  • An (N,M)-warping path is a sequence ๐‘ž =

(๐‘ž1, ๐‘ž2, โ€ฆ , ๐‘ž๐‘€) with ๐‘ž๐‘š = (๐‘œ๐‘š, ๐‘›๐‘š), satisfying the three conditions:

  • Boundary condition: ๐‘ž1 = 1,1 , ๐‘ž๐‘€ = ๐‘‚, ๐‘
  • Starting from the first point and ending at last point
  • Monotonicity condition: ๐‘œ๐‘š and ๐‘›๐‘š are non-

decreasing with ๐‘š

  • Step size condition:
  • ๐‘ž๐‘š+1 โˆ’ ๐‘ž๐‘š โˆˆ

0,1 , 1,0 , 1,1

  • Move one step right, up, or up-right

18

slide-19
SLIDE 19

Q: Which Path is a Warping Path?

19

slide-20
SLIDE 20

Optimal Warping Path

  • The total cost given a warping path p
  • ๐‘‘๐‘ž ๐‘Œ, ๐‘ = โˆ‘๐‘š ๐‘‘(๐‘ฆ๐‘œ๐‘š, ๐‘ง๐‘›๐‘š)
  • The optimal warping path p*
  • ๐‘‘๐‘žโˆ— ๐‘Œ, ๐‘ =

min ๐‘‘๐‘ž ๐‘Œ, ๐‘ ๐‘ž ๐‘—๐‘ก ๐‘๐‘œ ๐‘‚, ๐‘ โˆ’ ๐‘ฅ๐‘๐‘ ๐‘ž๐‘—๐‘œ๐‘• ๐‘ž๐‘๐‘ขโ„Ž

  • DTW distance between X and Y is defined as:
  • the optimal cost ๐‘‘๐‘žโˆ— ๐‘Œ, ๐‘

20

slide-21
SLIDE 21

How to Find p*?

  • Naรฏve solution:
  • Enumerate all the possible warping path
  • Exponential in N and M!

21

slide-22
SLIDE 22

Dynamic Programming for DTW

  • Dynamic programming:
  • Let D(n,m) denote the DTW distance between

X(1,โ€ฆ,n) and Y(1,โ€ฆ,m)

  • D is called accumulative cost matrix
  • Note D(N,M) = DTW(X,Y)
  • Recursively calculate D(n,m)
  • ๐ธ ๐‘œ, ๐‘› = min ๐ธ ๐‘œ โˆ’ 1, ๐‘› , ๐ธ ๐‘œ, ๐‘› โˆ’ 1 , ๐ธ ๐‘œ โˆ’ 1, ๐‘› โˆ’ 1

+ ๐‘‘(๐‘ฆ๐‘œ, ๐‘ง๐‘›)

  • When m or n = 1
  • ๐ธ ๐‘œ, 1 = โˆ‘๐‘™=1:๐‘œ ๐‘‘ ๐‘ฆ๐‘™, ๐‘ง1 ;
  • ๐ธ 1, ๐‘› = โˆ‘๐‘™=1:๐‘› ๐‘‘ ๐‘ฆ1, ๐‘ง๐‘™ ;

22

Time complexity: O(MN)

slide-23
SLIDE 23

Trace back to Get p* from D

23

slide-24
SLIDE 24

Example

24

slide-25
SLIDE 25

Time Domain vs. Frequency Domain

  • Many techniques for signal analysis require the data to be in

the frequency domain

  • Usually data-independent transformations are used
  • The transformation matrix is determined a

priori

  • discrete Fourier transform (DFT)
  • discrete wavelet transform (DWT)
  • The distance between two signals in the time domain is the

same as their Euclidean distance in the frequency domain

25

slide-26
SLIDE 26

Example of DFT

26

slide-27
SLIDE 27

27

slide-28
SLIDE 28

Example of DWT (with Harr Wavelet)

28

slide-29
SLIDE 29

29

slide-30
SLIDE 30

*Discrete Fourier Transformation

  • DFT does a good job of concentrating energy in

the first few coefficients

  • If we keep only first a few coefficients in DFT, we

can compute the lower bounds of the actual distance

  • Feature extraction: keep the first few coefficients

(F-index) as representative of the sequence

30

slide-31
SLIDE 31

*DFT (Cont.)

  • Parsevalโ€™s Theorem
  • The Euclidean distance between two signals in the time

domain is the same as their distance in the frequency domain

  • Keep the first few (say, 3) coefficients underestimates

the distance and there will be no false dismissals!

31

๏ƒฅ ๏ƒฅ

๏€ญ ๏€ฝ ๏€ญ ๏€ฝ

๏€ฝ

1 2 1 2

| | | |

n f f n t t

X x

| ] )[ ( ] )[ ( | | ] [ ] [ |

3 2 2

๏ƒฅ ๏ƒฅ

๏€ฝ ๏€ฝ

๏‚ฃ ๏€ญ ๏ƒž ๏‚ฃ ๏€ญ

f n t

f Q F f S F t Q t S ๏ฅ ๏ฅ

slide-32
SLIDE 32

Similarity Search on Time Series Data

  • Basic Concepts
  • Time Series Similarity Search
  • *Time Series Prediction and Forecasting
  • Summary

32

slide-33
SLIDE 33

Categories of Time-Series Movements

  • Categories of Time-Series Movements (T, C, S, I)
  • Long-term or trend movements (trend curve): general

direction in which a time series is moving over a long interval

  • f time
  • Cyclic movements or cycle variations: long term oscillations

about a trend line or curve

  • e.g., business cycles, may or may not be periodic
  • Seasonal movements or seasonal variations
  • E.g., almost identical patterns that a time series appears to

follow during corresponding months of successive years.

  • Irregular or random movements

33

slide-34
SLIDE 34

34

slide-35
SLIDE 35

Lag, Difference

  • The first lag of ๐‘

๐‘ข is ๐‘ ๐‘ขโˆ’1; the jth lag of ๐‘ ๐‘ข

is ๐‘

๐‘ขโˆ’๐‘˜

  • The first difference of a time series, ฮ”๐‘

๐‘ข =

๐‘

๐‘ข โˆ’ ๐‘ ๐‘ขโˆ’1

  • Sometimes difference in logarithm is used

ฮ”ln(๐‘

๐‘ข) = ln(๐‘ ๐‘ข) โˆ’ ln(๐‘ ๐‘ขโˆ’1)

35

slide-36
SLIDE 36

Example: First Lag and First Difference

36

slide-37
SLIDE 37

Autocorrelation

  • Autocorrelation: the correlation between

a time series and its lagged values

  • The first autocorrelation ๐œ1
  • The jth autocorrelation ๐œ๐‘˜

37

Autocovariance

slide-38
SLIDE 38

Sample Autocorrelation Calculation

  • The jth sample autocorrelation
  • เทœ

๐œ๐‘˜ =

เทž ๐‘‘๐‘๐‘ค(๐‘

๐‘ข,๐‘๐‘ขโˆ’๐‘˜)

เทž ๐‘ค๐‘๐‘ (๐‘

๐‘ข)

  • Where เทž

๐‘‘๐‘๐‘ค(๐‘

๐‘ข, ๐‘ ๐‘ขโˆ’๐‘˜) is calculated as:

  • i.e., considering two time series: Y(1,โ€ฆ,T-j) and

Y(j+1,โ€ฆ,T)

38

๐‘

๐‘ข

๐‘

๐‘ขโˆ’๐‘˜

๐‘ง๐‘˜+1 ๐‘ง1 ๐‘ง๐‘˜+2 ๐‘ง2 โ‹ฎ โ‹ฎ ๐‘ง๐‘ˆโˆ’1 ๐‘ง๐‘ˆโˆ’๐‘˜โˆ’1 ๐‘ง๐‘ˆ ๐‘ง๐‘ˆโˆ’๐‘˜

slide-39
SLIDE 39

Example of Autocorrelation

  • For inflation and its change

39

๐‡๐Ÿ = ๐Ÿ. ๐Ÿ—๐Ÿ”, very high: Last quarterโ€™s inflation rate contains much information about this quarterโ€™s inflation rate

slide-40
SLIDE 40

Focus on Stationary Time Series

  • Stationary is key for time series

regression: Future is similar to the past in terms of distribution

40

slide-41
SLIDE 41

Autoregression

  • Use past values ๐‘

๐‘ขโˆ’1,๐‘ ๐‘ขโˆ’2, โ€ฆ to predict ๐‘ ๐‘ข

  • An au

auto tore regre gressi ssion

  • n is a regression model in

which Yt is regressed against its own lagged values.

  • The number of lags used as regressors is called

the or

  • rde

der r of the autoregression.

  • In a first order autoregression, Yt is regressed

against Ytโ€“1

  • In a pth order autoregression, Yt is regressed

against Ytโ€“1,Ytโ€“2,โ€ฆ,Ytโ€“p

41

slide-42
SLIDE 42

The First Order Autoregression Model AR(1)

  • AR(1) model:
  • The AR(1) model can be estimated by OLS

regression of Yt against Ytโ€“1

  • Testing ฮฒ1 = 0 vs. ฮฒ1 โ‰  0 provides a test of

the hypothesis that Ytโ€“1 is not useful for forecasting Yt

42

slide-43
SLIDE 43

Prediction vs. Forecast

  • A predicted value refers to the value of Y

predicted (using a regression) for an

  • bservation in the sample used to estimate

the regression โ€“ this is the usual definition

  • Predicted values are โ€œin sampleโ€
  • A forecast refers to the value of Y forecasted

for an observation not in the sample used to estimate the regression.

  • Forecasts are forecasts of the future โ€“ which

cannot have been used to estimate the regression.

43

slide-44
SLIDE 44

Time Series Regression with Additional Predictors

  • So far we have considered forecasting

models that use only past values of Y

  • It makes sense to add other variables (X)

that might be useful predictors of Y, above and beyond the predictive value of lagged values of Y:

  • 44
slide-45
SLIDE 45

Similarity Search on Time Series Data

  • Basic Concepts
  • Time Series Similarity Search
  • *Time Series Prediction and Forecasting
  • Summary

45

slide-46
SLIDE 46

Summary

  • Time series similarity search
  • Euclidean distance and Lp norm
  • Dynamic time warping
  • Time domain vs. frequency domain
  • *Time Series Prediction and Forecasting
  • Autocorrelation; autoregression

46