Chapter 7-1: Se Sequential Data Data Jilles Vreeken Revision 1, - - PowerPoint PPT Presentation

chapter 7 1 se sequential data data
SMART_READER_LITE
LIVE PREVIEW

Chapter 7-1: Se Sequential Data Data Jilles Vreeken Revision 1, - - PowerPoint PPT Presentation

Chapter 7-1: Se Sequential Data Data Jilles Vreeken Revision 1, November 26 th Definition of smoothing clarified IRDM 15/16 24 Nov 2015 IRDM Chapter 7, overview Time Series Basic Ideas 1. Prediction 2. Motif Discovery 3.


slide-1
SLIDE 1

IRDM ‘15/16

Jilles Vreeken

Chapter 7-1: Se Sequential Data Data

24 Nov 2015 Revision 1, November 26th

Definition of smoothing clarified

slide-2
SLIDE 2

IRDM ‘15/16

IRDM Chapter 7, overview

 Time Series

1.

Basic Ideas

2.

Prediction

3.

Motif Discovery

 Discrete Sequences

4.

Basic Ideas

5.

Pattern Discovery

6.

Hidden Markov Models

You’ll find this covered in Aggarwal Ch. 3.4, 14, 15

VII-1: 2

slide-3
SLIDE 3

IRDM ‘15/16

IRDM Chapter 7, today

 Time Series

1.

Basic Ideas

2.

Prediction

3.

Motif Discovery

 Discrete Sequences

4.

Basic Ideas

5.

Pattern Discovery

6.

Hidden Markov Models

You’ll find this covered in Aggarwal Ch. 3.4, 14, 15

VII-1: 3

slide-4
SLIDE 4

IRDM ‘15/16

Chapter 7.1:

Basi asic I Ideas eas

Aggarwal Ch. 14.1-14.2

VII-1: 4

slide-5
SLIDE 5

IRDM ‘15/16

T emperature Data

VII-1: 5

Temp (°C) 28.2 25.4 30.5 15.7 33.4 29.4 28.6 16.1 28.5 27.9 15.5 31.4

slide-6
SLIDE 6

IRDM ‘15/16

T emperature Data

VII-1: 6

Time Temp (°C) June-15 28.2 June-16 25.4 June-17 30.5 June-18 15.7 June-19 33.4 June-20 29.4 June-22 28.6 June-23 16.1 June-24 28.5 June-25 27.9 June-26 15.5 June-27 31.4

5 10 15 20 25 30 35 40

Daily Temperature

slide-7
SLIDE 7

IRDM ‘15/16

T emperature Data

VII-1: 7

Time Temp (°C) June-15 28.2 June-16 25.4 June-17 30.5 June-18 15.7 June-19 33.4 June-20 29.4 June-22 28.6 June-23 16.1 June-24 28.5 June-25 27.9 June-26 15.5 June-27 31.4

5 10 15 20 25 30 35 40

Daily Temperature

slide-8
SLIDE 8

IRDM ‘15/16

T emperature Data

VII-1: 8

Time Temp (°C) June-15 28.2 June-16 25.4 June-17 30.5 Sept-18 15.7 June-19 33.4 June-20 29.4 June-22 28.6 Sept-23 16.1 Sept-24 28.5 June-25 27.9 Sept-26 15.5 June-27 31.4

5 10 15 20 25 30 35 40

Daily Temperature

slide-9
SLIDE 9

IRDM ‘15/16

Applications

VII-1: 9

Stock a analy lysis is Weathe her Forecasting ing Healt lth Monit itorin ing Socia ial Network Analysis is

slide-10
SLIDE 10

IRDM ‘15/16

Definition

A time s e seri eries of len engt gth 𝑜 consists of 𝑜 tuples 𝑢1, 𝑌1 , 𝑢2, 𝑌2 , … (𝑢𝑜, 𝑌𝑜) where for a tuple (𝑢𝑗, 𝑌𝑗), 𝑢𝑗 is the ti time s stam tamp, and 𝑌𝑗 is the data ata at time 𝑢𝑗, and we have a total

  • rder on the time stamps 𝑢1 < 𝑢2 < ⋯ < 𝑢𝑜

Length

may either be finite or infinite

Time stamps

may be contiguous, in practice integers are easier

Data

when talking about time series, usually numeric, continuous real eal-val alued ed

may be univariate (one attribute) or multivariate (multiple attributes)

VII-1: 10

slide-11
SLIDE 11

IRDM ‘15/16

Probabilistic Model of Time Series

Consider data 𝑌𝑗 at time 𝑢𝑗 as a random variable

the actual data we observe at 𝑢𝑗 is a realiza zati tion of 𝑌𝑗

Some probabilistic properties can be stable le over time

e.g. the mean 𝜈𝑗 of 𝑌𝑗 does not change (much)

the covariance between pairs (𝑌𝑗, 𝑌𝑗+ℎ) is (almost) the same as (𝑌1, 𝑌1+ℎ), i.e., the autoc

  • covar

arian ance of 𝑌𝑗 does not change (much)

A time series is stationa nary if the process behind it doe

  • es

s not

  • t change

 𝜈𝑢 = 𝜈𝑡 = 𝜈 for all 𝑢, 𝑡, and  𝐷𝑌𝑌 𝑢, 𝑡 = 𝐷𝑌𝑌 𝑡 − 𝑢 = 𝐷𝑌𝑌(𝜐) where 𝜐 = |𝑡 − 𝑢| is the amount of time

by which the signal is shifted

Stationary time series are easy to model and predict

 most real-world time series, however, are anything but stationary

(recall, if 𝑌𝑗 has mean 𝜈𝑗 = 𝐹[𝑌𝑗], 𝐷𝑌𝑌 𝑢, 𝑡 = 𝑑𝑑𝑑 𝑌𝑢, 𝑌𝑡 = 𝐹 𝑌𝑢𝑌𝑡 − 𝜈𝑢𝜈𝑡) VII-1: 11

slide-12
SLIDE 12

IRDM ‘15/16

Stationarity of Time Series

VII-1: 12

10 20 30 40

Daily Temperature

10 20 30 40

Monthly Temperature

slide-13
SLIDE 13

IRDM ‘15/16

Seasonality & trend

VII-1: 13

5 10 15 20 25 30 35 40

Monthly Temperature

2011 2012 2013

slide-14
SLIDE 14

IRDM ‘15/16

Formulation

Classically, we assume a time series 𝑌 is composed of 𝑌𝑗 = 𝑡𝑡𝑡𝑡𝑑𝑜𝑡𝑡𝑡𝑢𝑧𝑗 + 𝑢𝑢𝑡𝑜𝑒𝑗 + 𝑜𝑑𝑡𝑡𝑡𝑗 where 𝑜𝑑𝑡𝑡𝑡𝑗 is stationary. To make 𝑌 stationary, we simply have to remove seasonality and trend.

VII-1: 14

slide-15
SLIDE 15

IRDM ‘15/16

Seasonality

Seasonality is essentially perio iodici icity

 seasonality is a perio

iodic ic functio ion n of time with period 𝑒 𝑡𝑡𝑡𝑡𝑑𝑜𝑡𝑡𝑡𝑢𝑧𝑗 = 𝑡𝑡𝑡𝑡𝑑𝑜𝑡𝑡𝑡𝑢𝑧𝑗−𝑒

How to find the seasonal ality f ty functi tion?

1.

by fitting a sine e or cosi

  • sine function

difficult – the signal may also be sine’ish

2.

by di diffe fferen encing

𝑌𝑗 = 𝑡𝑡𝑡𝑡𝑑𝑜𝑡𝑡𝑡𝑢𝑧𝑗 + 𝑢𝑢𝑡𝑜𝑒𝑗 + 𝑜𝑑𝑡𝑡𝑡𝑗 𝑌𝑗−𝑒 = 𝑡𝑡𝑡𝑡𝑑𝑜𝑡𝑡𝑡𝑢𝑧𝑗−𝑒 + 𝑢𝑢𝑡𝑜𝑒𝑗−𝑒 + 𝑜𝑑𝑡𝑡𝑡𝑗−𝑒

VII-1: 15

slide-16
SLIDE 16

IRDM ‘15/16

Seasonality

Seasonality is essentially perio iodici icity

 seasonality is a perio

iodic ic functio ion n of time with period 𝑒 𝑡𝑡𝑡𝑡𝑑𝑜𝑡𝑡𝑡𝑢𝑧𝑗 = 𝑡𝑡𝑡𝑡𝑑𝑜𝑡𝑡𝑡𝑢𝑧𝑗−𝑒

How to find the seasonal ality f ty functi tion?

1.

by fitting a sine e or cosi

  • sine function

difficult – the signal may also be sine’ish

2.

by di diffe fferen encing

𝑌𝑗 = 𝑡𝑡𝑡𝑡𝑑𝑜𝑡𝑡𝑡𝑢𝑧𝑗 + 𝑢𝑢𝑡𝑜𝑒𝑗 + 𝑜𝑑𝑡𝑡𝑡𝑗 𝑌𝑗−𝑒 = 𝑡𝑡𝑡𝑡𝑑𝑜𝑡𝑡𝑡𝑢𝑧𝑗−𝑒 + 𝑢𝑢𝑡𝑜𝑒𝑗−𝑒 + 𝑜𝑑𝑡𝑡𝑡𝑗−𝑒 𝑌𝑗

′ = 𝑌𝑗 − 𝑌𝑗−𝑒

VII-1: 16

slide-17
SLIDE 17

IRDM ‘15/16 VII-1: 17

5 10 15 20 25 30 35 40

Monthly Temperature

2011 2012 2013

𝑌𝑗

′ = 𝑌𝑗 − 𝑌𝑗−𝑒 where d = 12

slide-18
SLIDE 18

IRDM ‘15/16

Example: Removing Seasonality

VII-1: 18

5 10 15 20 25 30 35 40

Monthly Temperature

This is the time series we obtained by removing seasonality

slide-19
SLIDE 19

IRDM ‘15/16

Trend

Trend is a pol

  • lynom

nomial f func unction

  • n of time (assumption)

How to find the trend function?

1.

by fit itting ing functio ions ns

 difficult to do, up to what order, when to stop?

2.

by di diffe fferen encing 𝑌𝑗

′ = 𝑌𝑗 − 𝑌𝑗−1

𝑌𝑗

′′ = 𝑌𝑗 ′ − 𝑌𝑗−1 ′

 usually 2 times is enough

VII-1: 19

slide-20
SLIDE 20

IRDM ‘15/16 VII-1: 20

5 10 15 20 25 30 35 40

Monthly Temperature

This is the time series we obtained by removing seasonality

Example: Removing Trend

slide-21
SLIDE 21

IRDM ‘15/16

Example: Removing Trend

VII-1: 21

  • 5

5 10 15 20 25 30 35 40

Monthly Temperature

This is the time series we obtained by removing seasonality and trend

𝑌𝑗

′ = 𝑌𝑗 − 𝑌𝑗−1

slide-22
SLIDE 22

IRDM ‘15/16

Example: Removing Trend

VII-1: 22

  • 5

5 10 15 20 25 30 35 40

Monthly Temperature

The left-over fluctuations are either noise or non-trivial patterns

𝑌𝑗

′ = 𝑌𝑗 − 𝑌𝑗−1

slide-23
SLIDE 23

IRDM ‘15/16

Pre-processing

We can infer missing values by interpolation 𝑌𝑙 = 𝑌𝑗 + 𝑢𝑙 − 𝑢𝑗 𝑢𝑘 − 𝑢𝑗 × (𝑌

𝑘 − 𝑌𝑗)

where 𝑢𝑗 < 𝑢𝑙 < 𝑢𝑘

VII-1: 23

slide-24
SLIDE 24

IRDM ‘15/16

Pre-processing

We can infer missing values by interpolation 𝑌𝑙 = 𝑌𝑗 + 𝑢𝑙 − 𝑢𝑗 𝑢𝑘 − 𝑢𝑗 × (𝑌

𝑘 − 𝑌𝑗)

where 𝑢𝑗 < 𝑢𝑙 < 𝑢𝑘

VII-1: 24

Time Temp (°C) 1 June-19 33.4 2 June-20 29.4 4 June-22 5 June-23 16.1

Temperature on June-22: 𝑌4 = 𝑌2 + 𝑢4 − 𝑢2 𝑢5 − 𝑢2 × 𝑌5 − 𝑌2 = 29.4 +

4−2 5−2 × 16.1 − 29.4

= 20.5

slide-25
SLIDE 25

IRDM ‘15/16

Smoothing

We can remove noise by smoot

  • othin

ing Standard options include avera veraging ng 𝑌𝑗

′ = 𝑡𝑑𝑏(𝑌𝑗−𝑥, … , 𝑌𝑗)

where win window

  • w le

length 𝑥 is a user-specified parameter We can more weight to recent values by exponent nential s smoothi hing 𝑌𝑗

′ = 1 − 𝛽 𝑗 ⋅ 𝑌0 ′ + 𝛽 𝑌 𝑘 ⋅ 1 − 𝛽 𝑗−𝑘 𝑗 𝑘=1

where the user chooses decay factor 𝛽

(updated on Nov 26th : we now average explicitly over past values) VII-1: 25

slide-26
SLIDE 26

IRDM ‘15/16

Chapter 7.2:

Forec ecast sting ing

Aggarwal Ch. 14.3

VII-1: 26

slide-27
SLIDE 27

IRDM ‘15/16

Principle of Forecasting

If we wish to make predictions, then clearly we must assu assume that something is stab stable over time.

VII-1: 27

slide-28
SLIDE 28

IRDM ‘15/16

Autoregressive (AR) model

Future values depend on past ast va values + random noise

 assumption: the time series depends on autocorrelation

Which past values?

 the 𝑥 immedi

diatel ely previous values

What relation between past and future?

 linear combination

What kind of noise?

 Gaussian

VII-1: 28

slide-29
SLIDE 29

IRDM ‘15/16

AR, formally

Future value is a linear combination of past ast va values + white noise 𝑌𝑢 = 𝑡𝑗 ⋅ 𝑌𝑢−𝑗

𝑥 𝑗=1

+ 𝑑 + 𝜗𝑢

where 𝜗𝑢~𝒪(0, 𝜏2)

VII-1: 29

Linear combination of past v valu lues noi noise with shifted mean

slide-30
SLIDE 30

IRDM ‘15/16

Least-square regression

𝜗𝑢 = 𝑌𝑢 − (𝑡1 ⋅ 𝑌𝑢−1 + 𝑡2 ⋅ 𝑌𝑢−2 + ⋯ + 𝑡𝑥 ⋅ 𝑌𝑢−𝑥 + 𝑑) Given data 𝑬 of 𝑂 training instances, we want to find 𝑡1, … , 𝑡𝑥 and 𝑑 that minimise the me mean squ squared err error 1 𝑂 − 𝑥 𝜗𝑢

2 𝑂 𝑢=𝑥+1

VII-1: 30

pr predic icted v valu lue act ctual v l valu lue

the prediction error is simply the Gaussian noise in the AR model, the smaller we can get this value, the better!

slide-31
SLIDE 31

IRDM ‘15/16

Solving AR

Find 𝑡1, … , 𝑡𝑥 and 𝑑 that min inim imiz ize

1 𝑂−𝑥 ∑

𝜗𝑢

2 𝑂 𝑢=𝑥+1

There are different solving strategies available

 ordinary least squares, assumes 𝜗𝑢 and 𝑌𝑢 are uncorrelated  generalized least squares, assumes correlation exists but is known  iteratively reweighted least squares, assumes correlation is unknown

Many standard tools available to do AR

 MATLAB: ar

ar function

 R: ar

arima function

VII-1: 31

slide-32
SLIDE 32

IRDM ‘15/16

Example: AR

Monthly temperature measured above the ground in a province of Vietnam from 1971 to 2001

VII-1: 32

  • 3
  • 2
  • 1

1 2 3 4 Jan-71 Oct-71 Jul-72 Apr-73 Jan-74 Oct-74 Jul-75 Apr-76 Jan-77 Oct-77 Jul-78 Apr-79 Jan-80 Oct-80 Jul-81 Apr-82 Jan-83 Oct-83 Jul-84 Apr-85 Jan-86 Oct-86 Jul-87 Apr-88 Jan-89 Oct-89 Jul-90 Apr-91 Jan-92 Oct-92 Jul-93 Apr-94 Jan-95 Oct-95 Jul-96 Apr-97 Jan-98 Oct-98 Jul-99 Apr-00 Jan-01 Oct-01

Original Data

slide-33
SLIDE 33

IRDM ‘15/16

VII-1: 33

  • 3
  • 2
  • 1

1 2 3 4 Jan-71 Oct-71 Jul-72 Apr-73 Jan-74 Oct-74 Jul-75 Apr-76 Jan-77 Oct-77 Jul-78 Apr-79 Jan-80 Oct-80 Jul-81 Apr-82 Jan-83 Oct-83 Jul-84 Apr-85 Jan-86 Oct-86 Jul-87 Apr-88 Jan-89 Oct-89 Jul-90 Apr-91 Jan-92 Oct-92 Jul-93 Apr-94 Jan-95 Oct-95 Jul-96 Apr-97 Jan-98 Oct-98 Jul-99 Apr-00 Jan-01 Oct-01

Original Data

  • 3
  • 2
  • 1

1 2 3 Jan-72 Oct-72 Jul-73 Apr-74 Jan-75 Oct-75 Jul-76 Apr-77 Jan-78 Oct-78 Jul-79 Apr-80 Jan-81 Oct-81 Jul-82 Apr-83 Jan-84 Oct-84 Jul-85 Apr-86 Jan-87 Oct-87 Jul-88 Apr-89 Jan-90 Oct-90 Jul-91 Apr-92 Jan-93 Oct-93 Jul-94 Apr-95 Jan-96 Oct-96 Jul-97 Apr-98 Jan-99 Oct-99 Jul-00 Apr-01

Season Removed

  • 2.5
  • 2
  • 1.5
  • 1
  • 0.5

0.5 1 1.5 2 Feb-72 Nov-72 Aug-73 May-74 Feb-75 Nov-75 Aug-76 May-77 Feb-78 Nov-78 Aug-79 May-80 Feb-81 Nov-81 Aug-82 May-83 Feb-84 Nov-84 Aug-85 May-86 Feb-87 Nov-87 Aug-88 May-89 Feb-90 Nov-90 Aug-91 May-92 Feb-93 Nov-93 Aug-94 May-95 Feb-96 Nov-96 Aug-97 May-98 Feb-99 Nov-99 Aug-00 May-01

Differencing 1

Mean Squared Error Mean Squared Error Mean Squared Error

slide-34
SLIDE 34

0.1 0.2 0.3 0.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0.1 0.2 0.3 0.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Original Data: MSE vs. w Season Removed: MSE vs. w

0.1 0.2 0.3 0.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Differencing 1: MSE vs. w

34

These plots show how the MSE behaves wrt to 𝑥. I.e., they to choose 𝑥.

slide-35
SLIDE 35

Moving Average (MA) model

Future values depend on deterministic factor + noi noise

 assumption: the time series depends on hist

stor

  • ry of sh

shoc

  • cks

What deterministic factor?

 the mea

ean of the time series

Noise over what past values?

 the current value and 𝑟 immedi

diatel ely previous values

What kind of noise?

 Gaussian

35

slide-36
SLIDE 36

IRDM ‘15/16

The 𝑁𝑁 𝑟 is defined as

𝑌𝑢 = 𝜈 + 𝜗𝑢 + 𝑐𝑗 ⋅ 𝜗𝑢−𝑗

𝑟 𝑗=1

where 𝜗𝑗~𝒪(0, 𝜏𝑗

2)

Recall, for the 𝑁𝐵(𝑥) model we had

𝑌𝑢 = 𝑑 + 𝜗𝑢 + 𝑡𝑗 ⋅ 𝑌𝑢−𝑗

𝑥 𝑗=1

MA, formally

VII-1: 36

past n nois ise mean cur urrent noi noise

slide-37
SLIDE 37

IRDM ‘15/16

Solving MA

Find those 𝑐1, … , 𝑐𝑟 that min inim imiz ize the error Unlike for AR, this problem is not linear

 to identify noise terms, we need to know 𝑐1, … , 𝑐𝑟  to identify 𝑐1, … , 𝑐𝑟, we need to know the noise terms  typically we use an iterative non-linear fitting approach,

instead of linear least-squares

VII-1: 37

slide-38
SLIDE 38

IRDM ‘15/16

The ARMA model

ARMA RMA combines the AR AR model with the MA MA model Future values depend on past ast va values + hi histor

  • ry of
  • f noi

noise

 the time series depends on

both autocorrela lation and hist stor

  • ry of
  • f sh

shoc

  • cks

The ARMA model has two parameters, 𝑥 and 𝑟

 window length w for autocorrelation  history length 𝑟 for noise

What kind of noise?

 Gaussian

VII-1: 38

slide-39
SLIDE 39

IRDM ‘15/16

ARMA, formally

ARMA RMA combines the AR AR model with the MA MA model Autoregressive model, 𝑁𝐵(𝑥): 𝑌𝑢 = 𝑑 + 𝜗𝑢 + ∑ 𝑡𝑗 ⋅ 𝑌𝑢−𝑗

𝑥 𝑗=1

Moving Average model, 𝑁𝑁(𝑟) 𝑌𝑢 = 𝜈 + 𝜗𝑢 + ∑ 𝑐𝑗 ⋅ 𝜗𝑢−𝑗

𝑟 𝑗=1

Auto toregressive ve Mov

  • ving

ng Ave verag age model, 𝑁𝐵𝑁𝑁(𝑥, 𝑟) 𝑌𝑢 = 𝑑 + 𝜗𝑢 + 𝑡𝑗 ⋅ 𝑌𝑢−𝑗

𝑥 𝑗=1

+ 𝑐𝑗 ⋅ 𝜗𝑢−𝑗

𝑟 𝑗=1

VII-1: 39

slide-40
SLIDE 40

IRDM ‘15/16

Solving ARMA

Find those 𝑡𝑗 and 𝑐𝑗 and 𝑑 that min inim imize ize the error We need non non-lin linear least ast-squ squar are re regre ression

 many standard tools to do this  MATLAB and R implement ARMA as ‘arma’ resp. ‘arima’

How to set 𝑥 and 𝑟?

 as small as possible, so that the model still fits the data well  aka, good luck

VII-1: 40

slide-41
SLIDE 41

IRDM ‘15/16

Chapter 7.3:

Motif Disc Discove very

Aggarwal Ch. 14.4, 3.4

VII-1: 41

slide-42
SLIDE 42

IRDM ‘15/16

Motifs

A mot motif is a shape that frequently repeats in a time series

 shape can also be called ‘pattern’

Many variations of mot motif dis isco covery exist

 contiguous versus

non-continguous shapes

 low versus

high granularities

 single time series versus

databases of time series

VII-1: 42

slide-43
SLIDE 43

IRDM ‘15/16

What is a motif?

When does a motif belong to a time series?

 there are two main methods for deciding

1. 1.

distan tance-based suppo support A segment ment 𝑌[𝑡, 𝑘] of a sequence 𝑌 is said to suppo support a motif 𝑍 when the distance 𝑒(𝑌[𝑡, 𝑘], 𝑍) between the segment and the motif is below some threshold 𝜗.

2. 2.

di discrete-ma matching b base sed suppo support first we discretise time series 𝑌 into a discrete sequence 𝑡. A motif is now a (frequent) subsequence of 𝑡.

VII-1: 43

slide-44
SLIDE 44

IRDM ‘15/16

Distance-based motifs, formally

A motif, a sequence 𝑇 = 𝑇1, … , 𝑇𝑥 of real values, is said to approxi ximat ately m y match a contiguous subsequence of length 𝑥 in time series 𝑌, if the distance between (𝑇1, … , 𝑇𝑥) and 𝑌𝑗, … , 𝑌𝑗+𝑥−1 is at most 𝜗.

 commonly, Euclidean distance or Dynamic Time Warping

The frequency of a motif is its number of occurrences

 the number of matches of a motif 𝑇 = 𝑇1, … , 𝑇𝑥 to the time series

𝑌1, … , 𝑌𝑜 at threshold 𝜗 is equal to the number of windows of length 𝑥 in 𝑌 for which the distance is at most 𝜗

VII-1: 44

slide-45
SLIDE 45

IRDM ‘15/16

T

  • p-𝑙 motifs

Nobody wants all motifs

 lots of many 𝜗-similar matches for even a single true occurrence  instead, we aim for the top-𝑙 best motifs

As with frequent itemset mining, redundancy is an issue

 we need to keep the top-k diverse  distances between any pair of motifs must be at least 2 ⋅ 𝜗

VII-1: 45

slide-46
SLIDE 46

IRDM ‘15/16

FINDBESTMOTIF(𝑌, 𝑥, 𝜗)

begin in for 𝑡 = 1 to 𝑜 − 𝑥 + 1 do do b begi gin 𝐷𝑡𝑜𝑒𝑡𝑒𝑡𝑢𝑡 = (𝑌𝑗, … , 𝑌𝑗+𝑥−1) for 𝑘 = 1 to 𝑜 − 𝑥 + 1 do do b begi gin 𝐷𝑑𝐷𝐷𝑡𝑢𝑡𝐷𝑑 = (𝑌

𝑘, … , 𝑌 𝑘+𝑥−1)

𝑒 = 𝑒𝑡𝑡𝑢𝑡𝑜𝑑𝑡(𝐷𝑡𝑜𝑒𝑡𝑒𝑡𝑢𝑡, 𝐷𝑑𝐷𝐷𝑡𝑢𝑡𝐷𝑑) if 𝑒 < 𝜗 and and (non-trivial-match) the hen n increment support count of 𝐷𝑡𝑜𝑒𝑡𝑒𝑡𝑢𝑡 end ndfor if 𝐷𝑡𝑜𝑒𝑡𝑒𝑡𝑢𝑡 has the highest count found so far the hen n update 𝐶𝑡𝑡𝑢𝐷𝑡𝑜𝑒𝑡𝑒𝑡𝑢𝑡 end ndfor retu eturn 𝐶𝑡𝑡𝑢𝐷𝑡𝑜𝑒𝑡𝑒𝑡𝑢𝑡 end nd

(trivially expanded to top-𝑙) VII-1: 46

slide-47
SLIDE 47

IRDM ‘15/16

Computational Complexity

Finding the best motif takes 𝑃(𝑜2) distance computations Practical complexity largely depends on distance function

 Euclidean distance is fast  Dynamic Time Warping is often better, but much slower

Lower bounds are our friend

 if the lower bound on the distance between a motif and a windows

is greater than 𝜗, the window will never support the motif

 piecewise-aggregate approximations (PAA) allow fast computation

  • f lower bounds by considering simplified (compressed) time series

VII-1: 47

slide-48
SLIDE 48

IRDM ‘15/16

Conclusions

Prediction over time is one of the most important and most used data analysis problems – predic ictiv ive a analy lytics ics There exist two main types of sequential data

 continuous real-valued time s

e ser eries and discrete eve vent s t seque uences

 for both specialised algorithms exist

In practice, despite many assumptions ARMA RMA is powerful

 often used in industry, learn how to use it, learn when to use it

Patterns in time series are called mot motifs

 by choosing a distance function can be mined directly from time series

VII-1: 48

slide-49
SLIDE 49

IRDM ‘15/16

Thank you!

Prediction over time is one of the most important and most used data analysis problems – predic ictiv ive a analy lytics ics There exist two main types of sequential data

 continuous real-valued time s

e ser eries and discrete eve vent s t seque uences

 for both specialised algorithms exist

In practice, despite many assumptions ARMA RMA is powerful

 often used in industry, learn how to use it, learn when to use it

Patterns in time series are called mot motifs

 by choosing a distance function can be mined directly from time series

VII-1: 49