Chapter 7-2: Discrete S ete Sequentia ential D Data ta Jilles - - PowerPoint PPT Presentation

chapter 7 2 discrete s ete sequentia ential d data ta
SMART_READER_LITE
LIVE PREVIEW

Chapter 7-2: Discrete S ete Sequentia ential D Data ta Jilles - - PowerPoint PPT Presentation

Chapter 7-2: Discrete S ete Sequentia ential D Data ta Jilles Vreeken IRDM 15/16 26 Nov 2015 IRDM Chapter 7, overview Time Series Basic Ideas 1. Prediction 2. Motif Discovery 3. Discrete Sequences Basic Ideas 4. Pattern


slide-1
SLIDE 1

IRDM ‘15/16

Jilles Vreeken

Chapter 7-2: Discrete S ete Sequentia ential D Data ta

26 Nov 2015

slide-2
SLIDE 2

IRDM ‘15/16

IRDM Chapter 7, overview

 Time Series

1.

Basic Ideas

2.

Prediction

3.

Motif Discovery

 Discrete Sequences

4.

Basic Ideas

5.

Pattern Discovery

6.

Hidden Markov Models

You’ll find this covered in Aggarwal Ch. 3.4, 14, 15

VII-1: 2

slide-3
SLIDE 3

IRDM ‘15/16

IRDM Chapter 7, today

 Time Series

1.

Basic Ideas

2.

Prediction

3.

Motif Discovery

 Discrete Sequences

4.

Basic Ideas

5.

Pattern Discovery

6.

Hidden Markov Models

You’ll find this covered in Aggarwal Ch. 3.4, 14, 15

VII-1: 3

slide-4
SLIDE 4

IRDM ‘15/16

Chapter 7.3, ctd:

Motif Disc Discove very

Aggarwal Ch. 14.4, 3.4

VII-1: 4

slide-5
SLIDE 5

IRDM ‘15/16

Dynamic Time Warping

DTW stretches the time axis of one series to enable better matches

(Aggarwal Ch. 3.4) VII-1: 5

slide-6
SLIDE 6

IRDM ‘15/16

DTW, formally

Let 𝐸𝐸𝐸(𝑗, 𝑘) be the optimal distance between the first 𝑗 and first 𝑘 elements of time series 𝑌 of length 𝑜 and 𝑍 of length 𝑛 𝐸𝐸𝐸 𝑗, 𝑘 = 𝑒𝑗𝑒𝑒𝑒𝑜𝑒𝑒 𝑌𝑗, 𝑍

𝑘 + min

𝐸𝐸𝐸(𝑗, 𝑘 − 1) 𝐸𝐸𝐸(𝑗 − 1, 𝑘) 𝐸𝐸𝐸(𝑗 − 1, 𝑘 − 1) repeat 𝑦𝑗 repeat 𝑧𝑘 repeat neither We initialise as follows

 𝐸𝐸𝐸 0,0 = 0  𝐸𝐸𝐸 0, 𝑘 = ∞ for all 𝑘 ∈ {1, … , 𝑜}  𝐸𝐸𝐸 𝑗, 0 = ∞ for all 𝑗 ∈ {1, … , 𝑛}

We can then simply iterate by increasing 𝑗 and 𝑘

(Aggarwal Ch. 3.4) VII-1: 6

slide-7
SLIDE 7

IRDM ‘15/16

Computing DTW (1)

Let 𝐸𝐸𝐸(𝑗, 𝑘) be the optimal distance between the first 𝑗 and first 𝑘 elements of time series 𝑌 of length 𝑜 and 𝑍 of length 𝑛 𝐸𝐸𝐸 𝑗, 𝑘 = 𝑒𝑗𝑒𝑒𝑒𝑜𝑒𝑒 𝑌𝑗, 𝑍

𝑘 + min

𝐸𝐸𝐸(𝑗, 𝑘 − 1) 𝐸𝐸𝐸(𝑗 − 1, 𝑘) 𝐸𝐸𝐸(𝑗 − 1, 𝑘 − 1) repeat 𝑦𝑗 repeat 𝑧𝑘 repeat neither From the initialised values, can simply iterate by increasing 𝑗 and 𝑘:

for for 𝑗 = 1 to 𝑛 for for 𝑘 = 1 to 𝑜 compute 𝐸𝐸𝐸(𝑗, 𝑘)

We can also compute it recursively, by dynamic programming. Both naïve strategies cost 𝑃 𝑜𝑛 , however.

(Aggarwal Ch. 3.4) VII-1: 7

slide-8
SLIDE 8

IRDM ‘15/16

Computing DTW (2)

Let 𝐸𝐸𝐸(𝑗, 𝑘) be the optimal distance between the first 𝑗 elements of time series 𝑌 of length 𝑜 and the first 𝑘 elements of time series 𝑍 of length 𝑛

𝐸𝐸𝐸 𝑗, 𝑘 = 𝑒𝑗𝑒𝑒𝑒𝑜𝑒𝑒 𝑌𝑗, 𝑍

𝑘 + min

𝐸𝐸𝐸(𝑗, 𝑘 − 1) 𝐸𝐸𝐸(𝑗 − 1, 𝑘) 𝐸𝐸𝐸(𝑗 − 1, 𝑘 − 1) repeat 𝑦𝑗 repeat 𝑧𝑘 repeat neither

We can speed up computation by imposing constraints.

 e.g. a window constraint to compute 𝐸𝐸𝐸(𝑗, 𝑘) only when 𝑗 − 𝑘 ≤ 𝑥  we then only need max 0, i − w − min

{𝑜, 𝑗 + 𝑥} inner loops

(Aggarwal Ch. 3.4) VII-1: 8

slide-9
SLIDE 9

IRDM ‘15/16

Lower bounds on DTW

Even smarter is to speed up DTW using a lower bound.

𝑀𝑀_𝐿𝑒𝐿𝐿𝐿(𝑌, 𝑍) = 𝑍

𝑗 − 𝑉𝑗 2

𝑍

𝑗 − 𝑀𝑗 2

if 𝑌𝑗 > 𝑉𝑗 if 𝑌𝑗 < 𝑀𝑗

  • therwise

𝑜 𝑗=1

𝑉𝑗 = max {𝑌𝑗−𝑠: 𝑌𝑗+𝑠} 𝑀𝑗 = min {𝑌𝑗−𝑠: 𝑌𝑗+𝑠} where 𝑠 is the reach, the allowed range

  • f warping

VII-1: 9

X Y

L U

slide-10
SLIDE 10

IRDM ‘15/16

Discrete S Sequence ces

VII- 2: 10

slide-11
SLIDE 11

IRDM ‘15/16

Chapter 7.4:

Basi asic Ideas eas

Aggarwal Ch. 14.1-14.2

VII-2: 11

slide-12
SLIDE 12

IRDM ‘15/16

Trouble in Time Series Paradise

Continuous real-valued time series have their downsides

 mining results rely on either a dis

istanc nce function n or assu sump mption

  • ns

 indexing, pattern mining, summarisation, clustering, classification,

and outlier detection results hence rely on arbitrar ary choices

Discrete sequences are often easier to deal with

 mining results rely mostly on count

nting ing

How to transform a time series into an event sequence?

 discretisation

VII-2: 12

slide-13
SLIDE 13

IRDM ‘15/16

Approximating a Time Series

(Lin et al. 2002, 2007) VII-2: 13

slide-14
SLIDE 14

IRDM ‘15/16

SAX

Symbolic Aggregate Approximation (SAX)

 most well-known approach to discretise a time series  type of piece-wise aggregated approximation (PAA)

How to do SAX

 divide the data into 𝑥 fr

frames es

 compute the mean per frame  perform equal-height binning

  • ver the means, to obtain an

alphabet of 𝑒 characters

(Lin et al. 2002, 2007) VII-2: 14

slide-15
SLIDE 15

IRDM ‘15/16

Definitions

A discrete seque uenc nce 𝑌1 … 𝑌𝑜 of length 𝑜 and dimensionality 𝑒, contains 𝑒 discrete feature values at each of 𝑜 different timestamps 𝑒1 … 𝑒𝑜. Each of the 𝑜 comp

  • mpon
  • nents 𝑌𝑗 contains 𝑒 discrete

behavioral attributes (𝑦𝑗

1 … 𝑦𝑗 𝑒) collected at the 𝑗th

timestamp. The actual time stamps are usually ignored – they only induce an order on the components, or eve vents ts.

VII-2: 15

slide-16
SLIDE 16

IRDM ‘15/16

Types of discrete sequences

In many applications, the dimensionality is 1

 e.g. strings, such as text or genomes.  for AATCGTAC over an alphabet Σ = {A, C, G, T}, each 𝑌𝑗 ∈ Σ

In some applications, each 𝑌𝑗 is not a vector, but a se set

 e.g. a supermarket transaction, 𝑌𝑗 ⊆ Σ  there is no order within 𝑌𝑗

We will consider the set-setting, as it is most general

VII-2: 16

slide-17
SLIDE 17

IRDM ‘15/16

Chapter 7.5:

Freque uent nt P Pat atterns ns

Aggarwal Ch. 15.2

VII-2: 17

slide-18
SLIDE 18

IRDM ‘15/16

Sequential patterns

A se sequ quential p patt attern is a sequence.

 to occur in the data, it has to be a subsequence of the data.

Defini inition: n: Given two sequences 𝒴 = 𝑌1 … 𝑌𝑜 and 𝒶 = 𝑎1 … 𝑎𝑙 where all elements 𝑌𝑗 and 𝑎𝑗 in the sequences are sets. Then, the sequence 𝒶 is a subsequ equen ence of 𝒴, if 𝑙 elements 𝑌𝑗1 … 𝑌𝑗𝑙 can be found in 𝒴, such that 𝑗1 < 𝑗2 < ⋯ < 𝑗𝑙 and 𝑎

𝑘 ⊆ 𝑌𝑗𝑘 for each 𝑘 ∈ {1 … 𝑙}

VII-2: 18

a a a b d c d b a b c a a b a b a b a b

𝒶 = 𝒴 =

slide-19
SLIDE 19

IRDM ‘15/16

Support

Depending on whether we have a datab atabas ase 𝑬 of sequences,

  • r a singl

gle l long s g sequ equence, we have to define the suppo support of a sequential pattern differently. Standard, or ‘per sequence’ support counting

 given a database 𝑬 = {𝒴1, … , 𝒴𝑂}, the support of a subsequence

𝒶 is the number of sequences in 𝑬 that contain 𝒶.

Window-based support counting

 given a single sequence 𝒴, the support of a subsequence 𝒶 is the

number of windo dows over 𝒴 that contain 𝒶.

(we can define frequency analogue as relative support) VII-2: 19

slide-20
SLIDE 20

IRDM ‘15/16

Windows

A wind ndow

  • w 𝒴[𝑒; 𝑒] is a strict subsequence of sequence 𝒴.

𝒴[𝑒; 𝑒] = 𝑌𝑗 ∈ 𝒴 ∣ 𝑒 ≤ 𝑗 ≤ s Window-based support counting

 we can choose a window length 𝑥, and sweep over the data

VII-2: 20

a a a b d c d b a b c a b d a a b c a b

𝒶 = 𝒴 =

slide-21
SLIDE 21

IRDM ‘15/16

Windows

VII-2: 21

a a a b d c d b a b c a b d a a b c a b : 1

A window 𝒴[𝑒; 𝑒] is a strict subsequence of sequence 𝒴. 𝒴[𝑒; 𝑒] = 𝑌𝑗 ∈ 𝒴 ∣ 𝑒 ≤ 𝑗 ≤ s Window-based support counting

 we can choose a window length 𝑥, and sweep over the data

𝒶 = 𝒴 =

slide-22
SLIDE 22

IRDM ‘15/16

Windows

VII-2: 22

a a a b d c d b a b c a b d a a b c a b : 1

A window 𝒴[𝑒; 𝑒] is a strict subsequence of sequence 𝒴. 𝒴[𝑒; 𝑒] = 𝑌𝑗 ∈ 𝒴 ∣ 𝑒 ≤ 𝑗 ≤ s Window-based support counting

 we can choose a window length 𝑥, and sweep over the data

𝒶 = 𝒴 =

slide-23
SLIDE 23

IRDM ‘15/16

Windows

VII-2: 23

a a a b d c d b a b c a b d a a b c a b : 1 : 2

A window 𝒴[𝑒; 𝑒] is a strict subsequence of sequence 𝒴. 𝒴[𝑒; 𝑒] = 𝑌𝑗 ∈ 𝒴 ∣ 𝑒 ≤ 𝑗 ≤ s Window-based support counting

 we can choose a window length 𝑥, and sweep over the data

𝒶 = 𝒴 =

slide-24
SLIDE 24

IRDM ‘15/16

A window 𝒴[𝑒; 𝑒] is a strict subsequence of sequence 𝒴. 𝒴[𝑒; 𝑒] = 𝑌𝑗 ∈ 𝒴 ∣ 𝑒 ≤ 𝑗 ≤ s Window-based support counting

 we can choose a window length 𝑥, and sweep over the data

Windows

VII-2: 24

a a a b d c d b a b c a b d a a b c a b : 2 : 3

𝒶 = 𝒴 =

slide-25
SLIDE 25

IRDM ‘15/16

Windows

VII-2: 25

a a a b d c d b a b c a b d a a b c a b : 3 : 4

A window 𝒴[𝑒; 𝑒] is a strict subsequence of sequence 𝒴. 𝒴[𝑒; 𝑒] = 𝑌𝑗 ∈ 𝒴 ∣ 𝑒 ≤ 𝑗 ≤ s Window-based support counting

 we can choose a window length 𝑥, and sweep over the data  support is now dependent on 𝑥, what happens with longer 𝑥?

𝒶 = 𝒴 =

slide-26
SLIDE 26

IRDM ‘15/16

Minimal windows

Fixed window lengths lead to double counting

 if 𝒴[𝑒; 𝑒] supports sequence 𝒶 so do 𝒴[𝑒; 𝑒 + 𝑙] and 𝒴[𝑒 − 𝑙; 𝑒]

We can avoid this by counting only min inim imal w l win indows

 𝑥 = 𝒴[𝑒; 𝑒] is a minimal window of pattern 𝒶 if 𝑥 contains 𝒶 but

no other proper sub-windows of w contain 𝒶.

 for efficiency or fun, we may want to set a maximal window size

VII-2: 26

a a a b d c d b a b c a b d a a b c a b

𝒶 = 𝒴 =

slide-27
SLIDE 27

IRDM ‘15/16

Minimal windows

Fixed window lengths lead to double counting

 if 𝒴[𝑒; 𝑒] supports sequence 𝒶 so do 𝒴[𝑒; 𝑒 + 𝑙] and 𝒴[𝑒 − 𝑙; 𝑒]

We can avoid this by counting only min inim imal w l win indows

 𝑥 = 𝒴[𝑒; 𝑒] is a minimal window of pattern 𝒶 if 𝑥 contains 𝒶 but

no other proper sub-windows of w contain 𝒶.

 for efficiency or fun, we may want to set a maximal window size

VII-2: 27

a a a b d c d b a b c a b d a a b c a b

𝒶 = 𝒴 =

slide-28
SLIDE 28

IRDM ‘15/16

Minimal windows

Fixed window lengths lead to double counting

 if 𝒴[𝑒; 𝑒] supports sequence 𝒶 so do 𝒴[𝑒; 𝑒 + 𝑙] and 𝒴[𝑒 − 𝑙; 𝑒]

We can avoid this by counting only min inim imal w l win indows

 𝑥 = 𝒴[𝑒; 𝑒] is a minimal window of pattern 𝒶 if 𝑥 contains 𝒶 but

no other proper sub-windows of w contain 𝒶.

 for efficiency or fun, we may want to set a maximal window size

VII-2: 28

a a a b d c d b a b c a b d a a b c a b

𝒶 = 𝒴 =

: 5

slide-29
SLIDE 29

IRDM ‘15/16

Mining Frequent Sequential Patterns

Like for itemsets, the per-sequence and per-window definitions of support are also monotone

 we can employ level-wise search!

We can modify

 APRIORI to get to GSP (Agrawal & Srikant, 1995; Mannila, Toivonen, Verkamo, 1995)  ECLAT to get SPADE (Zaki, 2000)  FP-GROWTH to get PREFIXSPAN (Pei et al., 2001)

VII-2: 29

slide-30
SLIDE 30

IRDM ‘15/16

Generalised Sequential Pattern Mining

Alg lgorit ithm GSP(sequence database 𝑬, minimal support 𝜏) begin in 𝑙 ← 1; ℱ𝑙 ← {all frequent 1 − item elements} wh while ile ℱ𝑙 is not empty do do Generate 𝒟𝑙+1 by joining pairs of sequences in ℱ𝑙, such that removing an item from the first element of one sequence matches the sequence obtained by removing an item from the last element of the other Prune sequences from 𝒟𝑙+1 that violate downward closure Determine ℱ𝑙+1 by support counting on (𝒟𝑙+1, 𝑬) and retaining sequences from 𝒟𝑙+1 with support at least 𝜏 𝑙 ← 𝑙 + 1 end nd return ⋃ ℱ𝑗

𝑙 𝑗=1

end nd

(Agrawal & Srikant, 1995; Mannila, Toivonen & Verkamo, 1995) VII-2: 30

a b c b a c b

+

∈ ℱ𝑙 ∈ ℱ𝑙 ∈ 𝒟𝑙+1

slide-31
SLIDE 31

IRDM ‘15/16

Episodes

There are many types of sequential patterns The most well-known are

 𝑜-grams, 𝑙-mers, or strict subsequ

quen ences es, where we do not allo llow gaps

 serial episodes, or subsequences,

where we do allow gaps

VII-2: 31

a c b a c b a a a b c c d b d b e c b d a a b c

𝒴 =

slide-32
SLIDE 32

IRDM ‘15/16

Episodes

There are many types of sequential patterns The most well-known are

 𝑜-grams, 𝑙-mers, or strict subsequ

quen ences es, where we do not allo llow gaps

 serial episodes, or subsequences,

where we do allow gaps

Each element can contain one or more items

VII-2: 32

a b a c b d c d a a c d d c d b e f e d b d a a b c

𝒴 =

a b c a d b c a a

slide-33
SLIDE 33

IRDM ‘15/16

Parallel episodes

Serial episodes are still restrictive

 not everything always happens exactly in sequential order

Paralle llel e l epis isodes acknowledge this

 a parallel episode defines a partial order, for a match it requires all

parallel events to happen, but does not specify their exact order.

 e.g. first ,, then in any order and , and then

We can also combine the two into genera eralised ed e episodes es

VII-2: 33

b d c a c b d a a

and

c b d a b d c

slide-34
SLIDE 34

IRDM ‘15/16

Chapter 7.6:

Hid idden M en Mar arkov M Model els

Aggarwal Ch. 15.5

VII-2: 34

slide-35
SLIDE 35

IRDM ‘15/16

Informal definition

Hidden Mar Markov Mo Models are probabilistic, generative models for discrete sequences. It is a graphical model in which nodes correspond to sys system st states, and edges to state changes ges. In a HMM the states of the system are hidd dden en; not directly visible to the user. We only observe a sequence over symbols Σ that the system generates when it switches between states.

VII-2: 35

slide-36
SLIDE 36

IRDM ‘15/16

Example HMM

This HMM can generate sequences such as

 VVVVVVVMVV

Veggie (common)

 MVMVVMMVM

Omni (common)

 MMVMVVVVVV

Omni-turned-Veggie (not very common)

 MMMMMMMM

Carnivore (rare)

VII-2: 36 Vegetarian Omnivore

0.99 0.90 0.01 0.10

Meal d distrib ibutio ion V = 99% M = 1% Meal d distrib ibutio ion 𝑊 = 50% 𝑁 = 50%

slide-37
SLIDE 37

IRDM ‘15/16

Example HMM (2)

VII-2: 37 Flexitarian Omnivore

0.60 0.90 0.20 0.08

Meal d distrib ibutio ion V = 80% M = 20% Meal d distrib ibutio ion 𝑊 = 50% 𝑁 = 50%

Vegetarian Carnivore

0.99

Meal d distrib ibutio ion V = 99% M = 1% Meal d distrib ibutio ion V = 1% M = 99%

0.60 0.01 0.20 0.40 0.02

slide-38
SLIDE 38

IRDM ‘15/16

Formal definition

A Hid idden Ma Markov

  • v Mod

Model l over alphabet Σ = {𝜏1, … , 𝜏 Σ } is a directed graph 𝐻(𝑇, 𝐸) consisting of 𝑜 states 𝑇 = {𝑒1, … , 𝑒𝑜}. The initial state probabilities are 𝜌𝑗, … , 𝜌𝑜. The (directed) edges correspond to state transitions. The probability of a transition from state 𝑒𝑗 to state 𝑒

𝑘 is denoted by 𝑞𝑗𝑘.

For every visit to a state, a symbol from Σ is generated with probability 𝑄(𝜏𝑗 ∣ 𝑒𝑘).

VII-2: 38

slide-39
SLIDE 39

IRDM ‘15/16

What to do with an HMM

There are three main things to do with an HMM

1. 1.

Traini ning ng.

Given topology 𝐻 and database 𝑬, learn the in init itia ial l state probabili litie ies, transit itio ion probabili lities, and the sy symb mbol emi missi ssion probabil ilit itie ies. .

2. 2.

Explan anati ation.

Given an HMM, determine the mos

  • st

t likel ely sta tate e seq equen ence that generated test sequence 𝒶.

3. 3.

Evaluati ation.

Given an HMM, determine the probability ty o

  • f tes

est t seq equen ence 𝒶.

VII-2: 39

slide-40
SLIDE 40

IRDM ‘15/16

Using an HMM for Evaluation

We want to know the fit it p probabilit ility that sequence 𝒴 = 𝑌1 … 𝑌𝑛 was generated by the given HMM. Naïve approach

 compute all 𝑜𝑛 possible paths over 𝐻  for each, determine probability of generating 𝒴  sum these probabilities, this is the fit probability of 𝒴

VII-2: 40

slide-41
SLIDE 41

IRDM ‘15/16

Recursive Evaluation

The fit probability of the first 𝑠 symbols1 can be co compu mputed recur ecursi sively from the fit probability of the first (𝑠 − 1) symbols2 Let 𝛽𝑠 𝒴, 𝑒

𝑘 be the probability that the first 𝑠 symbols of 𝒴

are generated by the model, and the last state is 𝑒𝑘. 𝛽𝑠 𝒴, 𝑒

𝑘 = 𝛽𝑠−1 𝒴, 𝑒𝑗 ⋅ 𝑞𝑗𝑘 ⋅ 𝑄 𝑌𝑠

𝑒𝑘

𝑜 𝑗=1

That is, we sum over all paths up to different final nodes.

1 and a fixed value of the 𝑠𝑢𝑢 state 2 and a fixed value of the (𝑠 − 1)𝑡𝑢 state

VII-2: 41

slide-42
SLIDE 42

IRDM ‘15/16

Forward Algorithm

We initialise with with 𝛽1 𝒴, 𝑒

𝑘

= 𝜌𝑘 ⋅ 𝑄 𝑌1 𝑒

𝑘 and

then iteratively compute for each 𝑠 = 1 … 𝑛. The fit probability of 𝒴 is the sum over all end-states, 𝐺(𝒴) = 𝛽𝑛 𝒴, 𝑒

𝑘 𝑜 𝑘=1

The complexity of the Forwar ard A d Algorith thm is 𝑃(𝑜2𝑛)

VII-2: 42

slide-43
SLIDE 43

IRDM ‘15/16

But, why?

Good question to ask: why compute the fit probability?

 classification  clustering  anomaly detection

For the first two, we can now create group-specific HMMs, and assi assign th the most l st likely y se sequ quences to those. For the the third, we have an HMM for our training data, and can now report poorly y fitti tting s g sequences.

VII-2: 43

slide-44
SLIDE 44

IRDM ‘15/16

Using an HMM for Explanation

We want to know why a y a se sequ quence 𝓨 fits o s our data

  • data. The

most likely state sequence gives an intuitive explanation. Naïve approach

 compute all 𝑜𝑛 possible paths over the HMM  for each, determine probability of generating 𝒴  report the path with maxim

imum probabilit ity

Instead of naively, can re-use the recursive approach?

VII-2: 44

slide-45
SLIDE 45

IRDM ‘15/16

Viterbi Algorithm

Any subpath of an optimal state path must also be optimal for generating the corresponding subsequence. Let 𝜀𝑠(𝒴, 𝑒

𝑘) be the probability of the best state sequence

generating the first 𝑠 symbols of 𝒴 ending at state 𝑒

𝑘, with

𝜀𝑠 𝒴, 𝑒

𝑘 = max 𝑗∈[1,𝑜] 𝜀𝑠−1 𝒴, 𝑒𝑗 ⋅ 𝑞𝑗𝑘 ⋅ 𝑄(𝑌𝑠 ∣ 𝑒 𝑘)

That is, we recursively compute the maximum-probability path over all 𝑜 different paths for different final nodes. Overall, we initialise recursion with 𝜀1 𝒴, 𝑒

𝑘 = 𝜌𝑘𝑄(𝑌1 ∣ 𝑒 𝑘),

and then iteratively compute for 𝑠 = 1 … 𝑛.

VII-2: 45

slide-46
SLIDE 46

IRDM ‘15/16

Training an HMM

So far, we assumed the given HMM was tr trai ained. How do we train a HMM in practice? Learning the parameters of an HMM is difficult

 no known algorithm is guaranteed to give the global optimum

There do exist methods for reasonably effective solutions

 e.g. the Forward-Backward (Baum-Welch) algorithm

VII-2: 46

slide-47
SLIDE 47

IRDM ‘15/16

Backward

We already know how to calculate the for forwa ward p prob

  • babili

ility 𝛽𝑠(𝒴, 𝑒𝑘) for the first 𝑠 symbols of a sequence 𝒴, ending at 𝑒𝑘. Now, let 𝛾𝑠(𝒴, 𝑒

𝑘) be the backwar

ard p prob

  • bability for the sequence

after and nd no not i incl ncluding ng the 𝑠𝑢𝑢 symbol, conditioned ned that the 𝑠𝑢𝑢 state is 𝑒𝑘. We initialise 𝛾 𝒴 𝒴, 𝑒𝑘 = 1, and compute 𝛾𝑠(𝒴, 𝑒𝑘) just as 𝛽𝑠(𝒴, 𝑒

𝑘) but from back to front.

For the Baum-Welch algorithm, we’ll also need

 𝛿𝑠 𝒴, 𝑒𝑗 for the probability that the 𝑠𝑢𝑢 state corresponds to 𝑒𝑗, and  𝜔𝑠(𝒴, 𝑒𝑗, 𝑒

𝑘) for the probability of the 𝑠𝑢𝑢 state 𝑒𝑗, and the 𝑠 + 1 𝑢𝑢 state 𝑒 𝑘

VII-2: 47

slide-48
SLIDE 48

IRDM ‘15/16

Baum-Welch

We initialize the model parameters randoml

  • mly.

We will then iteratively (E-step) Estimate 𝛽 ⋅ , 𝛾 ⋅ , 𝜔 ⋅ , and 𝛿(⋅) from the current model parameters (M-step) Estimate model parameters 𝜌 ⋅ , 𝑄 ⋅ ⋅ , 𝑞⋅⋅ from the current 𝛽 ⋅ , 𝛾 ⋅ , 𝜔 ⋅ , and 𝛿(⋅) until the parameters converge. This is simply the EM strategy!

VII-2: 48

slide-49
SLIDE 49

IRDM ‘15/16

Estimating parameters

𝛽 ⋅

  • Easy. We estimate these using the

Forward algorithm. 𝛾 ⋅

  • Easy. We estimate these using the

Backward algorithm.

VII-2: 49

slide-50
SLIDE 50

IRDM ‘15/16

Estimating parameters (2)

𝜔(⋅) We can split this value into first till 𝑠𝑢𝑢, 𝑠𝑢𝑢, and 𝑠 + 1 𝑢𝑢 till end 𝜔𝑠 𝒴, 𝑒𝑗, 𝑒

𝑘

= 𝛽𝑠 𝒴, 𝑒𝑗 ⋅ 𝑞𝑗𝑘 ⋅ 𝑄 𝑌𝑠+1 𝑒

𝑘

⋅ 𝛾𝑠+1(𝒴, 𝑒

𝑘)

and normalize to probabilities over all pairs 𝑗, 𝑘. So, easy, after all. 𝛿 ⋅

  • Easy. For 𝛿𝑠(𝒴, 𝑒𝑗) just fix 𝑒𝑗 and sum
  • ver 𝜔𝑠(𝒴, 𝑒𝑗, 𝑒

𝑘) varying 𝑒 𝑘

VII-2: 50

slide-51
SLIDE 51

IRDM ‘15/16

But, why?

VII-2: 51

slide-52
SLIDE 52

IRDM ‘15/16

Conclusions

Discrete sequences are a fun aspect of time series

 many interesting problems

Mining sequential patterns

 more expressive than itemsets, more difficult to define support

Hidden Markov Models

 can be used to predict, explain, evaluate discrete sequences

VII-2: 52

slide-53
SLIDE 53

IRDM ‘15/16

Discrete sequences are a fun aspect of time series

 many interesting problems

Mining sequential patterns

 more expressive than itemsets, more difficult to define support

Hidden Markov Models

 can be used to predict, explain, evaluate discrete sequences

VII-2: 53