Course of Pattern Recognition Stochastic models dealing with - - PowerPoint PPT Presentation

course of pattern recognition
SMART_READER_LITE
LIVE PREVIEW

Course of Pattern Recognition Stochastic models dealing with - - PowerPoint PPT Presentation

Introduction Markov Models Learning Edit Distances with EM Course of Pattern Recognition Stochastic models dealing with sequences Amaury Habrard L ABORATOIRE H UBERT C URIEN , UMR CNRS 5516 e Jean Monnet Saint- Universit Etienne


slide-1
SLIDE 1

lhc-logo Introduction Markov Models Learning Edit Distances with EM

Course of Pattern Recognition

Stochastic models dealing with sequences

Amaury Habrard

LABORATOIRE HUBERT CURIEN, UMR CNRS 5516 Universit´ e Jean Monnet Saint-´ Etienne amaury.habrard@univ-st-etienne.fr http://labh-curien.univ-st-etienne.fr/∼habrard/Slides/

Amaury Habrard Pattern Recognition & Machine Learning

slide-2
SLIDE 2

lhc-logo Introduction Markov Models Learning Edit Distances with EM

Introduction

We have seen non parametric bayesian methods.

Remarks

Common feature: the objects ”mainly” were represented by numerical features → used in bayesian classifiers, nearest neighbor algorithms, artificial neural networks, decision trees, clustering methods. In this part, the objects we deal with are not numerical vectors anymore...but rather structured data such as sequences of

  • symbols. We follow the line of edit distance.

How to achieve a pattern recognition task from this type of data?

Amaury Habrard Pattern Recognition & Machine Learning

slide-3
SLIDE 3

lhc-logo Introduction Markov Models Learning Edit Distances with EM

Introduction

Definition

Pattern recognition deals with the conception of automatic systems able to interpret signals of the real world.

Definition

To carry out its objectives, a pattern recognition process has to achieve the following three steps:

1 Pattern encoding: e.g., analogical signals → numerical signals 2 Pattern representation: features describing the patterns 3 Discrimination: generation of decision rules requiring a

machine learning stage and a decision stage.

Amaury Habrard Pattern Recognition & Machine Learning

slide-4
SLIDE 4

lhc-logo Introduction Markov Models Learning Edit Distances with EM

Toy example

Question

How to discriminate between men and women?

weight height

?

ax + by + c = 0 weight height

Amaury Habrard Pattern Recognition & Machine Learning

slide-5
SLIDE 5

lhc-logo Introduction Markov Models Learning Edit Distances with EM

Learning from a d-dimensional space

In a numerical space, you are able to train different kinds of models (called classifiers or hypotheses)...

lh

? ?

y = ax + b

?

Amaury Habrard Pattern Recognition & Machine Learning

slide-6
SLIDE 6

lhc-logo Introduction Markov Models Learning Edit Distances with EM

Quality of an hypothesis

Use of a loss function

Definition

The empirical error ˆ ǫh of a hypothesis h ∈ H is the proportion of errors that h makes over the learning sample S = {(x1, y1), ..., (xn, yn)} (1 1[a] = 1 if a is true, 0 otherwise). ˆ ǫh = 1 |S|

  • i

1 1[h(xi)=yi]

Definition

The (often unknown) generalization error ǫh of a hypothesis h is the error probability of h over X. ǫh = Pxi∈X,DX [h(xi) = yi]

Amaury Habrard Pattern Recognition & Machine Learning

slide-7
SLIDE 7

lhc-logo Introduction Markov Models Learning Edit Distances with EM

From numerical to structured data

Question 1

What kind of models can we apply on structured data, such as strings (e.g. words of a language), trees (e.g. XML documents) or graphs (e.g. biological molecules)?

Question 2

What kind of objective function to optimize when the training set is

  • nly composed of positive examples (e.g. learning of the mother

tongue)?

Amaury Habrard Pattern Recognition & Machine Learning

slide-8
SLIDE 8

lhc-logo Introduction Markov Models Learning Edit Distances with EM

Example of graph-based structured data

Many real world applications directly provide us structured data.

Question

What are the subgraphs shared by both molecules? → Subgraph isomorphism.

Amaury Habrard Pattern Recognition & Machine Learning

slide-9
SLIDE 9

lhc-logo Introduction Markov Models Learning Edit Distances with EM

Example of tree-based structured data

Question

How to automatically transform an HTML document into a XML one? How to classify web sites?

Amaury Habrard Pattern Recognition & Machine Learning

slide-10
SLIDE 10

lhc-logo Introduction Markov Models Learning Edit Distances with EM

Example of string-based structured data

Question

What are the most frequent patterns? → String alignment.

Amaury Habrard Pattern Recognition & Machine Learning

slide-11
SLIDE 11

lhc-logo Introduction Markov Models Learning Edit Distances with EM

Introduction

Many problems can be represented by structured data. We can even encounter situations where it is relevant to transform the original numerical representation into a structured

  • ne.

This can be the case, for instance, in image recognition.

Amaury Habrard Pattern Recognition & Machine Learning

slide-12
SLIDE 12

lhc-logo Introduction Markov Models Learning Edit Distances with EM

Example in image recognition

Usual way in image recognition: comparison of histograms. Another strategy can consist in using a structured representation → integration of topological and geometrical relationships between the

  • bjects of the image.

Amaury Habrard Pattern Recognition & Machine Learning

slide-13
SLIDE 13

lhc-logo Introduction Markov Models Learning Edit Distances with EM

Example in image recognition

→ String-based representation

.

O

. . .

P A

. .

E D

.

H

. . . . . .

E B K N I L A

. B

J

. .

F 16 points characteristic AEEPBKBHIJFLANOD Corresponding String

Amaury Habrard Pattern Recognition & Machine Learning

slide-14
SLIDE 14

lhc-logo Introduction Markov Models Learning Edit Distances with EM

Example in image recognition

→ Tree-based representation

E H

. . .

E B K N I L J

. .

F

. . . . . .

O

. . .

P A

. .

E D

.

H

. . . . . .

E B K N I L A

. B

J

. .

F

.

A

. .

D A B

. . .

O B H A A D Root P F E B J I L E O 2nd level 3rd level K characteristic 16 points P Corresponding Tree N 1st level Amaury Habrard Pattern Recognition & Machine Learning

slide-15
SLIDE 15

lhc-logo Introduction Markov Models Learning Edit Distances with EM

Example in image recognition

→ Graph-based representation

.

O

. . .

P A

. .

E D

.

H

. . . . . .

E B K N I L A

. B

J

. .

F 16 points characteristic Corresponding Graph

.

O

. . .

P A

. .

E D

.

H

. . . . . .

E B K N I A

. B

J

. .

F L

Amaury Habrard Pattern Recognition & Machine Learning

slide-16
SLIDE 16

lhc-logo Introduction Markov Models Learning Edit Distances with EM

Example handwritten digit recognition

→ Another string-based representation

Starting point 1 2 3 4 5 6 7 Freeman Codes 222234445533445666660222217760021107666501 coding string

Amaury Habrard Pattern Recognition & Machine Learning

slide-17
SLIDE 17

lhc-logo Introduction Markov Models Learning Edit Distances with EM

ExampleHilbert-Peano transform: Another string-based representation

4 pixels 16 pixels 64 pixels

7 6 5 4 3 2 1 8 1 3 4 5 6 7 8 2 y

yn yn+1 yN yn−1 y1

Amaury Habrard Pattern Recognition & Machine Learning

slide-18
SLIDE 18

lhc-logo Introduction Markov Models Learning Edit Distances with EM

Example: Sequence of images in a video

→ Encode the sequence of images as a string

Amaury Habrard Pattern Recognition & Machine Learning

slide-19
SLIDE 19

lhc-logo Introduction Markov Models Learning Edit Distances with EM

Introduction

Many pattern recognition tasks are (or can be...) described by structured data. In this course, we will only deal with strings (or sequences of symbols)...

Amaury Habrard Pattern Recognition & Machine Learning

slide-20
SLIDE 20

lhc-logo Introduction Markov Models Learning Edit Distances with EM

Example

Given two sets of first names (male and female), how to infer a model able to detect the belonging class of a new instance?

John Kevin Thomas Bob Mathias Stephen Philip Patricia Rosana Demi Sharon Lindsey Dana ? John Kevin Thomas Bob Mathias Stephen Philip Patricia Rosana Demi Sharon Lindsey Dana ?

y = ax + b

John Kevin Thomas Bob Mathias Stephen Philip Patricia Rosana Demi Sharon Lindsey Dana ?

Remark

Standard algorithms can not be used anymore.

Amaury Habrard Pattern Recognition & Machine Learning

slide-21
SLIDE 21

lhc-logo Introduction Markov Models Learning Edit Distances with EM

Numerical data versus structured data

A solution is to represent a string in p-dimensional vector. Example: given an alphabet Σ = {a, b}, a feature correspond to the number of occurrences of subtrings of size <= 2: < #a, #b, #aa, #ab, #ba, #bb > → the string aababa is encoded by: < 4, 2, 1, 2, 2, 0 > We lose a lot of information ! ⇒ not the best solution.

Amaury Habrard Pattern Recognition & Machine Learning

slide-22
SLIDE 22

lhc-logo Introduction Markov Models Learning Edit Distances with EM

Numerical data versus structured data

To correctly deal with sequences, the induced model must be able to characterize

1 the nature (the class) of the sequence (e.g. John is a male first

name)

2 the way the events (the symbols) occur (e.g. John is correct

while Ojhn is not...) Dealing with sequences is more difficult than handling numerical

  • features. Two possible solutions:

1 Use of structured data-based algorithms → e.g. Hidden Markov

Models, Grammatical inference.

2 Use of specific similarity measures → e.g. to be able to use kNN

algorithms.

Amaury Habrard Pattern Recognition & Machine Learning

slide-23
SLIDE 23

lhc-logo Introduction Markov Models Learning Edit Distances with EM

Outline

1 Introduction 2 Markov Models

Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

3 Learning Edit Distances with EM

How to learn edit probabilities with EM?

Amaury Habrard Pattern Recognition & Machine Learning

slide-24
SLIDE 24

lhc-logo Introduction Markov Models Learning Edit Distances with EM

References

References

1 Apprentissage Artificiel, Cornu´

ejols, Miclet, Eyrolles, 2002 (in french...).

2 Machine Learning, Tom Mitchell, McGraw Hill, 1997. 3 Pattern Recognition, Sergio Theodoridis, Konstantinos

Koutroumbas, Academic Press, 2006

Important notice

We see an instance of EM which is a very important algorithm for learning parameters of probabilistic models. The Expectation Maximization Algorithm - A short tutorial, Sean Borman, technical report.

Amaury Habrard Pattern Recognition & Machine Learning

slide-25
SLIDE 25

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Markov Models

Amaury Habrard Pattern Recognition & Machine Learning

slide-26
SLIDE 26

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Many real world problems require to know the probability of a sequence of events: Natural language processing, Text recognition, Weather forecast, etc.

Definition

Given a sequence of events a1 . . . ak, the joint probability p(a1 . . . ak) is obtained as follows: p(a1 . . . ak) = p(a1) × p(a2|a1) × p(a3|a1a2) . . . × p(ak|a1 . . . ak−1) where a1 . . . ak−1 is the history of symbol ak. For example, p(he reads a book) = p(he) × p(reads|he) × p(a|he reads) × p(book|he reads a) × p(#|he reads a book).

Amaury Habrard Pattern Recognition & Machine Learning

slide-27
SLIDE 27

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

N-grams

Question

How to model sequence probabilities? A natural way to simplify the previous calculation consists in bounding the history → n-grams

Definition

A n-gram is a type of probabilistic model for predicting the next item ai given the n − 1 previous observed elements of the sequence, such that p(ai|a1 . . . ai−1) = p(ai|ai−(n−1)ai−(n−2) . . . ai−1) For example, with n = 3, we get the following 3-gram: p(ai|a1 . . . ai−1) = p(ai|ai−2ai−1)

Amaury Habrard Pattern Recognition & Machine Learning

slide-28
SLIDE 28

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

N-grams

Question

How to train the n-gram model from a learning sample? Let us take the followng learning sequence aabaacaab. It is possible to estimate the conditional probabilities such that: p(b|aa) = p(aab) p(aa) = 2/3 p(c|aa) = p(aac) p(aa) = 1/3 ...

Amaury Habrard Pattern Recognition & Machine Learning

slide-29
SLIDE 29

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

N-grams

From the estimated probabilites, we can deduce the following automaton modeling the 3-grams: When n=2 (bigram), the probability of a word (or symbol) only depends on the preceding one. A bigram is a first-order observable Markov Model.

Amaury Habrard Pattern Recognition & Machine Learning

slide-30
SLIDE 30

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

N-grams - Exercise

Exercise

Let us consider the three following sequences of weather states composed of the allowed symbols C (Cloudy), S (Sunny) and R (Rainy): CCRCSSS SSCRSC RRCCRS

1 Build the probabilistic automaton modeling with bi-grams the

weather forecast.

2 Compute the probability of the sequence RRCCRS.

Amaury Habrard Pattern Recognition & Machine Learning

slide-31
SLIDE 31

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Observable Markov Models

Amaury Habrard Pattern Recognition & Machine Learning

slide-32
SLIDE 32

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Formalism of Observable Markov Models

Definition

A stochastic model is a random process which can change of state si at each time qt. The probability P(q1, q2, ..., qT) of a set of states q1, q2, ..., qT is computed as follows: P(q1, q2, ..., qT) = P(q1).P(q2|q1).P(q3|q2q1)...P(qT|q1, ..., qT−1)

Definition

A stochastic process fulfills the first-order Markov property if: ∀t, P(qt = si|qt−1 = sj, qt−2 = sk...) = P(qt = si|qt−1 = sj)

Amaury Habrard Pattern Recognition & Machine Learning

slide-33
SLIDE 33

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Observable Markov Models

...that means that the global evolution is determined by the initial probability and the successive transitions. Thus, P(q1, q2, ..., qT) = P(q1).P(q2|q1)....P(qT|qT−1) = P(q1).

T

  • k=2

P(qk|qk−1)

Definition

A markov model is stationary if and only if ∀t, ∀k, P(qt = si|qt−1 = sj) = P(qt+k = si|qt+k−1 = sj) ...that means that whatever the time t, the probability of a transition between two states si and sj remains the same.

Amaury Habrard Pattern Recognition & Machine Learning

slide-34
SLIDE 34

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Observable Markov Models

Definition

An observable markov model is a finite state automaton for which each state is associated to a given observation. Transition probabilities between two states are defined by matrix A = [aij], where aij expresses the probability to go from state si to state sj. More formally, aij = P(qt = sj|qt−1 = si)

Amaury Habrard Pattern Recognition & Machine Learning

slide-35
SLIDE 35

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Observable Markov Models

Toy Example: Let us (try to) model the weather (rainy, sunny or cloudy) by an observable markov model. To each possible observation is associated a state, such that s1 = 1 → rainy, s2 = 2 → cloudy, s3 = 3 → sunny. Let us suppose that the transition probabilities of matrix A have been estimated from a training sample: t|t + 1 rainy cloudy sunny rainy 0.2 0.3 0.5 cloudy 0.6 0.2 0.2 sunny 0.1 0.1 0.8 a1,2 = 0.3 means that the probability to observe a cloudy weather after a rainy day is 0.3.

Amaury Habrard Pattern Recognition & Machine Learning

slide-36
SLIDE 36

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Observable Markov Models

One can deduce the corresponding observable markov chain.

Rainy Cloudy Sunny 0.2 0.2 0.8 0.1 0.1 03 0.5 0.2 0.6 Amaury Habrard Pattern Recognition & Machine Learning

slide-37
SLIDE 37

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Observable Markov Models

Exercise

A couple spending a week-end together plans to go to the museum and to the beach during the following two days. It is Friday and the weather is cloudy.

Question

What is the best schedule for that week-end?

Amaury Habrard Pattern Recognition & Machine Learning

slide-38
SLIDE 38

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Example of application

P(CCC) = 1 × P(C|C) × P(C|C) = 1 × 0.2 × 0.2 = 0.04 P(CCR) = 1 × 0.2 × 0.6 = 0.12 P(CCS) = 1 × 0.2 × 0.2 = 0.04 P(CRR) = 1 × 0.6 × 0.2 = 0.12 P(CRC) = 1 × 0.6 × 0.3 = 0.18 P(CRS) = 1 × 0.6 × 0.5 = 0.30 P(CSS) = 1 × 0.2 × 0.8 = 0.16 P(CSC) = 1 × 0.2 × 0.1 = 0.02 P(CSC) = 1 × 0.2 × 0.1 = 0.02 The best schedule is Saturday = museum and Sunday = beach! Note that the probabilities sum up to 1.

Amaury Habrard Pattern Recognition & Machine Learning

slide-39
SLIDE 39

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Property of a Markov Model

Property

Given a Markov Model M and a constant c, if ∀i,

j aij = 1 then M

describes a statistical distribution over all the sequences of size c, such that

  • ∀x∈{C,R,S}c

P(x) = 1

Amaury Habrard Pattern Recognition & Machine Learning

slide-40
SLIDE 40

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Limitations of Observable MM: From Observable to Hidden MM

Amaury Habrard Pattern Recognition & Machine Learning

slide-41
SLIDE 41

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

From Observable to Hidden Markov Models

Let’s take an example in ornithology! Let θ be a theory stating that about 30 swans and 80 gooses land every day on a given lake. Let us observe the sequence of arrivals at a given day.

At time t: some birds are arriving (20 swans and 5 gooses) → the theory seems to be wrong, but... At time t + 1: about 10 birds are landing equally distributed among the two classes of birds. At time t + 2: many gooses arrive accompanied by some swans.

Amaury Habrard Pattern Recognition & Machine Learning

slide-42
SLIDE 42

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

From Observable to Hidden Markov Models

Remark

It would not be relevant to only state that P(G) = 80

110 = 0.73

and P(S) = 30

110 = 0.27

Indeed, we can note that the arrival time of birds is important to characterize the species of birds. It is important to take into account the three periods of time and wrap them in the model.

Question

How to model such a phenomenon? → Hidden Markov Models

Amaury Habrard Pattern Recognition & Machine Learning

slide-43
SLIDE 43

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

From Observable to Hidden Markov Models

Question

Why Hidden Markov Models are important? A possible observable markov model would be the following:

State 1: Goose State 2: Swan

P(S) = 1 − p P(G) = p p p 1 − p 1 − p

Amaury Habrard Pattern Recognition & Machine Learning

slide-44
SLIDE 44

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Limitations of observable markov models

Each state corresponds to a given observation: Swan or Goose. Observing a goose G (resp. a swan S) means to be in state 1 (resp. 2). The probability of a sequence does not depend on the appearance

  • f the birds. Indeed, P(GGGS) = P(SGGG) = p3(1 − p)

Remark

We have seen that the arrival time is important. So, a simple

  • bservable markov model is not sufficient to model this

phenomenon. There is no more reason to assume that the number of states is equal to the number of possible observations.

Amaury Habrard Pattern Recognition & Machine Learning

slide-45
SLIDE 45

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Discovering Markov Models by example

A possible way for modeling this phenomenon could be:

P(S) P(G) time 1 0.8 0.2 time 2 0.5 0.5 time 3 0.1 0.9

25/26 8/9 88/89 1/89 1/9 1/26 time 1 time 2 time 3 P(G)=0.2 P(S)=0.8 P(G)=0.5 P(S)=0.5 P(G)=0.9 P(S)=0.1

Question

How to learn such a model?

Amaury Habrard Pattern Recognition & Machine Learning

slide-46
SLIDE 46

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Features of such a model

It is represented in the form of a stochastic finite state automaton. It models a statistical distribution over the sequences of birds landing on that lake. It can be viewed as a generative model from which sequences of arrivals can be sampled. It can be learned from examples that only belong to the target concept (positive examples). It can allow us to classify new instances.

Amaury Habrard Pattern Recognition & Machine Learning

slide-47
SLIDE 47

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Hidden Markov model: Formalism

Two sets of random variables are available: one is observable, the other one is hidden. The observable variables are the observations themselves O1, O2, ..., OT (e.g. GGGSGS). The hidden variables are the states q1, q2, ..., qT in which O1, O2, ..., OT have been observed (e.g. time 1, time 2 or time 3).

Example

Other example: 3 possible states (no revise, revise a little, revise a lot for exams) 2 possible observations (to pass or to fail an exam) One can only observe the result (pass or fail) without having any information about the level of revision.

Amaury Habrard Pattern Recognition & Machine Learning

slide-48
SLIDE 48

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Hidden Markov model: Formalism

A Hidden Markov Model (HMM) λ = (A, B, π) is defined by:

1 a set S of N states where S = {s1, s2, ..., sN}. A state at a given

time t is noted qt (qt ∈ S).

2 M observable symbols in each state, V = {v1, v2, ..., vM}. An

element Ot of V represents the symbol observed at time t.

3 a matrix A of transition probabilities between states.

aij = A(i, j) = P(qt+1 = sj|qt = si), ∀i, j ∈ [1..N], ∀t[1..T]

4 a matrix B of observation probabilities of symbols.

bj(k) = P(Ot = vk|qt = sj), 1 ≤ j ≤ N, 1 ≤ k ≤ M

5 a vector of probabilities π representing the initial probability of

each state. π = {πi}i=1,2,...,N. πi = P(q1 = si), 1 ≤ i ≤ N

Amaury Habrard Pattern Recognition & Machine Learning

slide-49
SLIDE 49

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Links between HMMs and probabilistic automata

Let V a finite alphabet of symbols. An HMM defined a distribution over strings of size n (Vn). HMM are equivalent to probabilistic non-deterministic finite automata (PNFA) without final probabilities. To define a distribution over all the possible (finite) strings (V∗ =

n Vn), one needs to add a termination symbol (#) to V.

In this case HMM with final probabilities define the same class

  • f distribution than PNFA:

for any HMM there exists an equivalent PNFA with the same number of states for any PNFA with m states there exists an equivalent HMM with a number of states less or equal than min(m2, m × |V|)

Important notice: this slide with the next two slides are outside of the scope of the course.

Amaury Habrard Pattern Recognition & Machine Learning

slide-50
SLIDE 50

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Example: conversion of a HMM into a PNFA

0.1 0.7 0.3 0.9 [a 0.3] [b 0.7] [a 0.2] [b 0.8] [a 0.8] [b 0.2] [a 0.9] [b 0.1] 0.04 0.36 0.42 0.18 0.7 0.9 0.1 0.3 11 12 21 22 0.04 0.36 0.42 0.18 11 12 21 22 a 0.09 b 0.21 b 0.18 b 0.08 a 0.02 a 0.27 b 0.03 b 0.49 a 0.21 a 0.72 b 0.72 a 0.18 b 0.02 a 0.08 a 0.63 b 0.07

  • Fig. 10. Transformation of an HMM into an equivalent PNFA.

Amaury Habrard Pattern Recognition & Machine Learning

slide-51
SLIDE 51

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Example: conversion of a PNFA into a HMM

0.4 0.6 a 0.27 a 0.56 a 0.02 b 0.08 b 0.03 a 0.27 b 0.14 b 0.63 1 2

0.1 0.7 0.3 0.9 [a 0.3] [b 0.7] [a 0.2] [b 0.8] [a 0.8] [b 0.2] [a 0.9] [b 0.1] 0.04 0.36 0.42 0.18 0.7 0.9 0.1 0.3 11 12 21 22 1 2

  • Amaury Habrard

Pattern Recognition & Machine Learning

slide-52
SLIDE 52

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

What are the problems to solve?

  • P1. The assessment from the model λ of the probability of a

sequence of observations O = {O1, ..., OT}.

Example

Given a HMM modeling the English language, what is the probability that the sentence “the cat is sleeping on the bed” is syntactically correct?

Example

Given a HMM modeling the arrival of swans and gooses on the lake, what is the probability that a given sequence of birds follows this model?

Amaury Habrard Pattern Recognition & Machine Learning

slide-53
SLIDE 53

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

What are the problems to solve?

  • P2. The search for the most probable path (i.e. the estimation of

the hidden part of the model). Given a sequence of observations O and a given model λ, what is the set of states used by O in λ?

Example

What are the most probable set of states in which a nuclear station has been before a given accident?

Amaury Habrard Pattern Recognition & Machine Learning

slide-54
SLIDE 54

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

What are the problems to solve?

  • P3. The learning of the parameters A, B, π of λ.

This requires an objective function. In machine learning, several inductive principles are available (you already studied the empirical risk minimization ERM). Since we only have positive examples, we can not apply the ERM principle. We rather optimize as objective function the likelihood of the learning set.

L =

  • O∈LS

P(O|λ) Note we often consider the log likelihood because of the properties of log: log L =

  • O∈LS

log P(O|λ) (it avoids numerical problems)

Amaury Habrard Pattern Recognition & Machine Learning

slide-55
SLIDE 55

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

The HMM and the bayesian classification

Question

How to use the HMM to deal with a multi-class problem?

1 One HMM λ(k) is learned for each class ck ∈ C = {c1, ..., cK}. 2 Given an observation Oi ∈ O, one selects the model (i.e. the

class) that returns the maximum probability P(λ(k)|Oi), where P(λ(k)|Oi) = P(Oi|λ(k)).P(λ(k)) P(Oi)

3 One needs to be able to compute P(Oi|λ(k)).

Amaury Habrard Pattern Recognition & Machine Learning

slide-56
SLIDE 56

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

P1/ Computation of P(O|λ)

Amaury Habrard Pattern Recognition & Machine Learning

slide-57
SLIDE 57

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Computation of the probability P(O|λ)

Assumption: the learned model λ is available. One aims to compute P(O = O1, ..., OT|λ), i.e. the probability to

  • bserve O1 at t = 1, then O2, at t = 2, ..., and OT at t = T.

If one tests all the possible paths from a set of N states, the algorithmic complexity for computing the probability of a sequence of size T is Θ(NT). Fortunately, one can use dynamic programming to improve the algorithmic complexity (Θ(N2T)).

Amaury Habrard Pattern Recognition & Machine Learning

slide-58
SLIDE 58

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Forward Algorithm

Forward Algorithm

Let αt(i) = P(O1O2...Ot, qt = si|λ) be the probability to be in state si while having observed the first t symbols of O. Initialization: α1(i) = πibi(O1),1 ≤ i ≤ N Induction: αt+1(j) = [

N

  • i=1

αt(i)aij]bj(Ot+1), 1 ≤ t ≤ T − 1, 1 ≤ j ≤ N Final step: P(O|λ) =

N

  • i=1

αT(i)

Amaury Habrard Pattern Recognition & Machine Learning

slide-59
SLIDE 59

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Graphical explanation

State 1 State 1 State 2 State i 11 a 21 i1 a α (1) 1 α (2) α ( ) 1 1 2 2 1 π1 π2 i b (O ) b (O ) b (O ) 1 i 1 1 1 2 i a α (1) b (O ) t=1 t=2 t=3 π

Amaury Habrard Pattern Recognition & Machine Learning

slide-60
SLIDE 60

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Exercise

Let us assume that the following HMM models the leisure activities (Museum or Beach) according to the weather.

Forward Algorithm Let αt(i) be the probability to be in state si at time t. Init: α1(i) = πibi(O1),1 ≤ i ≤ N Induction: αt+1(j) = [

N

X

i=1

αt(i)aij]bj(Ot+1) Final step: P(O|λ) = PN

i=1 αT(i)

0.2 0.2 0.8 0.1 0.1 0.3 0.5 0.2 0.6 1 2 Rainy Cloudy Sunny 3 Museum:1 Beach:0 Museum:0.7 Beach:0.3 Museum:0.1 Beach:0.9 1

What is the probability to go first to the museum and then to the beach?

Amaury Habrard Pattern Recognition & Machine Learning

slide-61
SLIDE 61

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

P2/ Computation of the optimal path

Amaury Habrard Pattern Recognition & Machine Learning

slide-62
SLIDE 62

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Computation of the optimal path: The Viterbi algorithm

The Viterbi algorithm (Viterbi 1967) is a dynamic programming algorithm for finding the most likely sequence of hidden states. The aim is to determine the best path that corresponds to the observation O, i.e. argmaxQP(Q, O|λ) Let δt(i) be the probability of the current best path leading to state si at time t given the first t observations: δt(i) = maxq1,...,qt−1P(q1, q2, ..., qt = si, O1, O2, ..., Ot|λ) By induction, one computes the next quantity δt+1(j) while storing the best states in a matrix φ : δt+1(j) = [maxiδt(i)aij]bj(Ot+1) φt+1(j) ← ArgMaxi[δt(i)aij]

Amaury Habrard Pattern Recognition & Machine Learning

slide-63
SLIDE 63

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Viterbi Algorithm

Viterbi Algorithm foreach State i do δ1(i) ← πibi(O1); φ1(i) ← 0; t ← 2; while t ≤ T do j ← 1; while j ≤ N do δt(j) ← Maxi[δt−1(i)aij]bj(Ot); φt(j) ← ArgMaxi[δt−1(i)aij]; j ← j + 1; t ← t + 1; q∗

T ← ArgMaxi[δT(i)];

t ← T − 1; while t ≥ 1 do q∗

t ← φt+1(q∗ t+1);

t ← t − 1;

Amaury Habrard Pattern Recognition & Machine Learning

slide-64
SLIDE 64

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Example What is the optimal path for the sequence of activities O = (Museum, Beach)?

0.2 0.2 0.8 0.1 0.1 0.3 0.5 0.2 0.6 1 2 Rainy Cloudy Sunny 3 Museum:1 Beach:0 Museum:0.7 Beach:0.3 Museum:0.1 Beach:0.9 1

δ1(1) = π1b1(M) = 0.7 → φ1(1) = 0 δ1(2) = π2b2(M) = 0.0 → φ1(2) = 0 δ1(3) = π3b3(M) = 0.0 → φ1(3) = 0 δ2(1) = max(δ1(1)a11, δ1(2)a21, δ1(3)a31) ⋆ b1(B) = max(0.7⋆0.2, 0×0.3, 0×0.1)⋆0.3 = 0.042 → φ2(1) = 1 δ2(2) = max(δ1(1)a12, δ1(2)a22, δ1(3)a32) ⋆ b2(B) = max(0.7⋆0.6, 0×0.2, 0×0.1)⋆0 = 0 → φ2(2) = 1 δ2(3) = max(δ1(1)a13, δ1(2)a23, δ1(3)a33) ⋆ b3(B) = max(0.7⋆0.2, 0×0.5, 0×0.8)⋆0.9 = 0.126 → φ2(3) = 1

Amaury Habrard Pattern Recognition & Machine Learning

slide-65
SLIDE 65

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Example

Finally, we get φ1(1) = 0, φ2(1) = 1 φ1(2) = 0, φ2(2) = 1 φ1(3) = 0, φ2(3) = 1 max(δ2(1), δ2(2), δ2(3)) = 0.126 q∗

2 = 3

q∗

1 = φ2(3) = 1

We deduce that the best sequence of states is: (cloudy, sunny)

Amaury Habrard Pattern Recognition & Machine Learning

slide-66
SLIDE 66

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Exercise

What is the optimal path for O = (Beach, Beach, Museum)?

Viterbi Algorithm foreach State i do δ1(i) ← πibi(O1); φ1(i) ← 0; t ← 2; while t ≤ T do j ← 1; while j ≤ N do δt(j) ← Maxi[δt−1(i)aij]bj(Ot); φt(j) ← ArgMaxi[δt−1(i)aij]; j ← j + 1; t ← t + 1; q∗

T ← ArgMaxi[δT(i)]; t ← T − 1;

while t ≥ 1 do q∗

t ← φt+1(q∗ t+1); t ← t − 1;

0.2 0.2 0.8 0.1 0.1 0.3 0.5 0.2 0.6 1 2 Rainy Cloudy Sunny 3 Museum:1 Beach:0 Museum:0.7 Beach:0.3 Museum:0.1 Beach:0.9 1

Amaury Habrard Pattern Recognition & Machine Learning

slide-67
SLIDE 67

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Example

Beach Beach Museum state1 1 ∗ 0.3 0.3 ∗ a11∗ b1(Beach) = 0.3 ∗ 0.2 ∗ 0.3 max((0.3∗0.2∗0.3)∗a11 ∗b1 = 0.018∗0.2, 0 ∗ a21 ∗ b1 = 0, (0.3 ∗ 0.2 ∗ 0.9) ∗ a31 ∗ b1 = 0.054∗0.1)∗0.7= max(0.00252, 0, 0.00378) state2 0*0 max(0.3 ∗ 0.2 ∗ 0.3 ∗ a12 = 0.3 ∗ 0.2 ∗ 0.3 ∗0.6, 0, 0.3 ∗ 0.2 ∗ 0.9 ∗ a32 = 0.3 ∗ 0.2 ∗ 0.9 ∗ 0.1) ∗ 1 = max(0.0108, 0, 0.0054) state3 0*0.9 0.3 ∗ a13 ∗ b3(Beach) = 0.3 ∗ 0.2 ∗ 0.9 max(0.3∗0.2∗0.3∗a13 = 0.3∗0.2∗0.3∗0.2, 0, 0.3∗0.2∗0.9∗a33 = 0.3∗0.2∗0.9∗0.8)∗0.1 = max(0.00036, 0, 0.00432)

We deduce that the best sequence of states is: (cloudy (state 1), cloudy (state 1), rainy (state 2) ) (see next slide).

Amaury Habrard Pattern Recognition & Machine Learning

slide-68
SLIDE 68

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Example

Finally, we get φ1(1) = 0, φ2(1) = 1, φ3(1) = 3 φ1(2) = 0, φ2(2) = 1, φ3(2) = 1 φ1(3) = 0, φ2(3) = 1, φ3(3) = 3 max(δ3(1), δ3(2), δ3(3)) = max(0.00378, 0.0108, 0.00432) = 0.0108 q∗

3 = 2

q∗

2 = φ3(2) = 1

q∗

1 = φ2(1) = 1

We deduce that the best sequence of states is: (cloudy, cloudy, rainy)

Amaury Habrard Pattern Recognition & Machine Learning

slide-69
SLIDE 69

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Viterbi Algorithm

Complexity

Time complexity: O(N2T) where N is the number of states and T the size of the observation. Space complexity: O(NT) (one path per state)

Amaury Habrard Pattern Recognition & Machine Learning

slide-70
SLIDE 70

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

P3/ Learning of the model λ

Amaury Habrard Pattern Recognition & Machine Learning

slide-71
SLIDE 71

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Learning of the model λ

Aim: learning of the parameters (A, B, π) that maximize the likelihood of a training set of observations O = {O1, ..., Om}. ...or in other words, we wish to estimate the model parameters for which the observed data are the most likely.

Amaury Habrard Pattern Recognition & Machine Learning

slide-72
SLIDE 72

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Learning of the model λ

Let O = {O1, ..., Om} be a learning set. One aims to maximize the likelihood: P(O|λ) =

m

  • k=1

P(Ok|λ) One searches for the parameters of λ such that argmaxλ[P(O|λ) =

m

  • k=1

P(Ok|λ)]

Amaury Habrard Pattern Recognition & Machine Learning

slide-73
SLIDE 73

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Baum-Welch Algorithm

The Baum-Welch algorithm (that is named for Leonard E. Baum and Lloyd R. Welch.), is also called forward-backward algorithm. It allows the calculation of the probability of a sequence by using 2 procedures, called forward and backward

1 αt(i) (already used!) is the probability to be in state si while

having observed the first t symbols of the current observation Ok.

2 βt(i) is the probability to observe from state si the last symbols

(from t + 1 to T) of Ok.

Amaury Habrard Pattern Recognition & Machine Learning

slide-74
SLIDE 74

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Backward Algorithm

Backward Algorithm

βt(i) = P(Ot+1, ...OT|qt = si) Initialization: βT(i) = 1, 1 ≤ i ≤ N Induction: βt(j) =

N

  • i=1

ajibi(Ot+1)βt+1(i), 1 ≤ t ≤ T − 1, 1 ≤ j ≤ N Final step: P(O|λ) =

N

  • i=1

πibi(O1)β1(i)

Amaury Habrard Pattern Recognition & Machine Learning

slide-75
SLIDE 75

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Graphical explanation

β (1) 1 β (2) 1 β (2) 1 β (2) 2 β (3) 1 b (O ) 2 1 b (O ) 1 3 b (O ) 1 1 1 π 2 π 3 π b (O ) 2 1 a 11 a 13 β (3) 2 a 12 b (O ) 2 2 b (O ) 2 3 1 1 1 t=1 t=2 State 3 State 2 β (1) State 1 2

Backward: ” right-to-left ” - Forward: ” left-to-right ” - Final values are equal

Amaury Habrard Pattern Recognition & Machine Learning

slide-76
SLIDE 76

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Exercise

Given the following hidden markov models λ = (A, B, π).

Backward Algorithm Init: βT(i) = 1, 1 ≤ i ≤ N Induction: βt(j) =

N

X

i=1

ajibi(Ot+1)βt+1(i) Final step: P(O|λ) = PN

i=1 πibi(O1)β1(i)

0.2 0.2 0.8 0.1 0.1 0.3 0.5 0.2 0.6 1 2 Rainy Cloudy Sunny 3 Museum:1 Beach:0 Museum:0.7 Beach:0.3 Museum:0.1 Beach:0.9 1

Question

Compute P(Museum, Beach|λ) using the backward function. (=0.168)

Amaury Habrard Pattern Recognition & Machine Learning

slide-77
SLIDE 77

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Baum-Welch Algorithm

Two functions (forward and backward) at our disposal to learn A, B, π. The Baum-Welch algorithm is based on the following steps:

1 Choose an initial set of parameters λ0

Advice: Non zero values to prevent the parameters from not being used during the learning process. Principle: Reinforcement of some paths given the training set and the initialized parameters.

2 Compute λ1 from λ0 using expectation and maximization

steps.

Goal: increase of the likelihood.

3 Repeat this process until convergence.

Use of a threshold or a statistical test.

Amaury Habrard Pattern Recognition & Machine Learning

slide-78
SLIDE 78

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Baum-Welch Algorithm

It is a generalized expectation-maximization (EM) algorithm.

Definition

The expectation-maximization (EM) algorithm is a powerful computational technique for locating maxima of functions. It is widely used in statistics for maximum likelihood estimation of parameters in incomplete or hidden data models.

Remark

In an HMM context, the parameters are A, B and π and the hidden variables are the states reached during the induction.

Amaury Habrard Pattern Recognition & Machine Learning

slide-79
SLIDE 79

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

EM Algorithm

Question

Why is this algorithm called expectation-maximization? Answering this question requires some background in optimization...

Amaury Habrard Pattern Recognition & Machine Learning

slide-80
SLIDE 80

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

EM Algorithm

Let O be an observable sequence of events, and λ = (A, B, π) the parameters we want to learn. The objective is to find λ such that P(O|λ) is maximum. Rather than optimizing P(O|λ), we maximize L(λ) = lnP(O|λ)

since ln(x) is a strictly increasing function, the value of λ which maximizes P(O|λ) also maximizes lnP(O|λ). ln(x) is a concave function on which the Jensen’s inequality can be applied (we will see later the interest of this inequality).

Amaury Habrard Pattern Recognition & Machine Learning

slide-81
SLIDE 81

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

EM Algorithm

Theorem (Jensen’s inequality)

Let f be a concave function defined on an interval I. If x1, x2, ..., xn ∈ I and γ1, γ2, ..., γn ≥ 0 with n

i=1 γi = 1, then

f n

  • i=1

γixi

n

  • i=1

γif(xi)

Amaury Habrard Pattern Recognition & Machine Learning

slide-82
SLIDE 82

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

EM Algorithm

The EM algorithm is an iterative procedure. Assume that after the nth iteration the current estimate for λ is given by λn. Since the objective is to maximize lnP(O|λ), we wish to compute an updated estimate λ that maximizes L(λ) − L(λn) = lnP(O|λ) − lnP(O|λn) (1) So far, we have not considered any unobserved variables z. To integrate z into the optimization process, we can note that: P(O|λ) =

  • z

P(O|z, λ) × P(z|λ)

Amaury Habrard Pattern Recognition & Machine Learning

slide-83
SLIDE 83

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

EM Algorithm

Equation 1 can be rewritten as L(λ) − L(λn) = ln

  • z

P(O|z, λ) × P(z|λ)

  • − lnP(O|λn) (2)

Jensen’s inequality can then be applied on Equation 2 where the constant γi will take the form of P(z|O, λn). L(λ) − L(λn) = ln

  • z P(O|z, λ) × P(z|λ)
  • − lnP(O|λn)

= ln

  • z P(O|z, λ) × P(z|λ) × P(z|O,λn)

P(z|O,λn)

  • − lnP(O|λn)

= ln

  • z P(z|O, λn) × P(O|z,λ)P(z|λ)

P(z|O,λn)

  • − lnP(O|λn)

z P(z|O, λn)ln

  • P(O|z,λ)P(z|λ)

P(z|O,λn)

  • − lnP(O|λn)

=

z P(z|O, λn)ln

  • P(O|z,λ)P(z|λ)

P(z|O,λn)P(O|λn)

  • Amaury Habrard

Pattern Recognition & Machine Learning

slide-84
SLIDE 84

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

EM Algorithm

Therefore, our objective is to choose λ such that L(λ) ≥ L(λn) +

z P(z|O, λn)ln

  • P(O|z,λ)P(z|λ)

P(z|O,λn)P(O|λn)

  • is

maximized. Let λn+1 be this updated value. λn+1 = argmaxλ

  • L(λn) +
  • z

P(z|O, λn)ln P(O|z, λ)P(z|λ) P(z|O, λn)P(O|λn)

  • Now drop terms which are constant w.r.t. λ

λn+1 = argmaxλ

  • z P(z|O, λn)lnP(O|z, λ)P(z|λ)
  • λn+1 = argmaxλ
  • z P(z|O, λn)ln P(O,z,λ)

P(z,λ) P(z,λ) P(λ)

  • λn+1 = argmaxλ
  • z P(z|O, λn)lnP(O, z|λ)
  • λn+1 = argmaxλ
  • Ez|O,λn{lnP(O, z|λ)}
  • Amaury Habrard

Pattern Recognition & Machine Learning

slide-85
SLIDE 85

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

EM Algorithm

Conclusion

In λn+1 = argmaxλ

  • Ez|O,λn{lnP(O, z|λ)}
  • the expectation and

maximization steps are apparent. The EM algorithm thus consists of iterating the:

1 E-step: Determine the conditional expectation

Ez|O,λn{lnP(O, z|λ)}. This is done using the αt(i) (forward) and βt(i) (backward) procedures.

2 M-step: Maximize this expression with respect to λ.

Following this principle, one can analytically define the update rule of the parameters A, B, π.

Amaury Habrard Pattern Recognition & Machine Learning

slide-86
SLIDE 86

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Baum-Welch Algorithm

Definition

Let pk

t (i, j) be the probability to use the transition going from state si

(emitting the symbol Ok

t ) to state sj (emitting Ok t+1) with the kth

  • bservation of the learning set.

pk

t (i, j) = P(qt = si, qt+1 = sj|Ok, λ)

pk

t (i, j) = P(qt = si, qt+1 = sj, Ok|λ)

P(Ok|λ) pk

t (i, j) = αk t (i).aij.bj(Ok t+1).βk t+1(j)

P(Ok|λ) pk

t (i, j) will be useful to estimate matrix A.

Amaury Habrard Pattern Recognition & Machine Learning

slide-87
SLIDE 87

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Baum-Welch Algorithm

Definition

Let γk

t (i) be the probability that the tth symbol of Ok is emitted in si.

γk

t (i) = P(qt = si|Ok, λ)

γk

t (i) = N

  • j=1

P(qt = si, qt+1 = sj|Ok, λ) γk

t (i) =

N

j=1 P(qt = si, qt+1 = sj, Ok|λ)

P(Ok|λ) γk

t (i) = N

  • j=1

pk

t (i, j) = αk t (i).βk t (i)

P(Ok|λ) γk

t (i) will be useful to estimate matrix B

Amaury Habrard Pattern Recognition & Machine Learning

slide-88
SLIDE 88

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Baum-Welch Algorithm

Definition

One can now estimate the new parameters of the model πi = 1 m

m

  • k=1

γk

1(i)

...that means that πi is the proportion of times that state si is used to emit the first symbol of a sequence.

Amaury Habrard Pattern Recognition & Machine Learning

slide-89
SLIDE 89

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Baum-Welch Algorithm

Definition

aij = P(qt+1 = sj|qt = si) = m

k=1

|Ok|

t=1 pk t (i, j)

m

k=1

|Ok|

t=1 γk t (i)

...that means that aij is the proportion of times that the transition from si to sj is used in the learning set.

Amaury Habrard Pattern Recognition & Machine Learning

slide-90
SLIDE 90

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Baum-Welch Algorithm

Definition

bj(l) = P(Ot = vl|qt = sj) =

m

  • k=1

|Ok|

  • t:Ok

t =vl

γk

t (j)

...that means that bj(l) is the proportion of times that HMM is in state sj and emits the symbol vl.

Amaury Habrard Pattern Recognition & Machine Learning

slide-91
SLIDE 91

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Baum-Welch Algorithm

Definition

All the parameters of the HMM are estimable and computable. πi = 1 m

m

  • k=1

γk

1(i)

aij = P(qt+1 = sj|qt = si) = m

k=1

|Ok|

t=1 pk t (i, j)

m

k=1

|Ok|

t=1 γk t (i)

bj(l) = P(Ot = vl|qt = sj) =

m

  • k=1

|Ok|

  • t:Ok

t =vl

γk

t (j)

Amaury Habrard Pattern Recognition & Machine Learning

slide-92
SLIDE 92

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Example

Let us consider the following HMM with 6 states with π0 = {1, 0, 0, 0, 0, 0} and O = {bca#, cca#, bbba#, bcba#, cbba#, ccba#}

0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0.33 0.33 0.33 0 0.33 0.33 0.33 0 0 0 0 0 0 1 0 0.5 0.5 0 0 0 1 2 3 4 5 0.5 0.5 1 1 1 a b c # 0.33 0.33 0.33 0 0.33 0.33 0.33 0 0.33 0.33 0.33 0 B0= 0 0 0 0 0 0 A0= 1 2 3 4 5 6 1 2 3 4 5 6 6 1 6 5 4 3 2 0 0 0 1

Amaury Habrard Pattern Recognition & Machine Learning

slide-93
SLIDE 93

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Example

Update of a12 the probability of the transition between s1 and s2. a12 = P(q2 = s2|q1 = s1) = m

k=1 pk 1(1, 2)

m

k=1 γk 1(1)

We know that pk

t (i, j) = αk t (i).aij.bj(Ok t+1).βk t+1(j)

P(Ok|λ) Let us consider the first example O1 = bca# (probabilities of 1 are

  • mitted)

p1

1(1, 2) = α1 1(1).a12.b2(O1 2).β1 2(2)

P(O1|λ) = 1/3 × 1/2 × 1/3 × 1/3 1/54 = 1

Amaury Habrard Pattern Recognition & Machine Learning

slide-94
SLIDE 94

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Example

s1 s4 s3 s2 s6 s5 1 b c a # 1 1 1/3 1/3 1/18 1/2 1/54 1/54 1/2 1/3 1/3 1 1/3 1/18 1 1/3 1/54 1

Amaury Habrard Pattern Recognition & Machine Learning

slide-95
SLIDE 95

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

Conclusion

Conclusion

HMM are generative models that describe a probabilistic distribution over the sequences. There is no theoretical result dealing with the optimal number of

  • states. The structure of the HMM is provided by the user.

The states do not correspond to real physical phenomena.

Amaury Habrard Pattern Recognition & Machine Learning

slide-96
SLIDE 96

lhc-logo Introduction Markov Models Learning Edit Distances with EM Observable Markov Models Hidden Markov Models Expectation-Maximization (EM) Algorithm

HMM are models that allow us to deal with strings. Another way to handle such structured data is to call on specific metrics that enable us to use standard methods such as the k-nearest neighbor algorithm. The edit distance is probably the most used metric to compare two strings. However, it highly depends on the costs assigned to each edit

  • peration.

The next part deals with a new learning method of those parameters.

Amaury Habrard Pattern Recognition & Machine Learning

slide-97
SLIDE 97

lhc-logo Introduction Markov Models Learning Edit Distances with EM How to learn edit probabilities with EM?

String Edit Distance

Definition

The Levenshtein distance (or Edit Distance) between two strings x = x1...xT and y = y1...yV is given by the minimum number of edit

  • perations needed to transform x into y, where an operation is an

insertion, deletion, or substitution of a single character. Rather than counting the number of edit operations, one can assign an edit cost to each edit operation and search for the less costly transformation: c(xi, yj) is the cost of the substitution of xi by yj, c(xi, λ) is the cost of the deletion (xi into the empty symbol λ), c(λ, yj) is the cost of the insertion of yj.

Amaury Habrard Pattern Recognition & Machine Learning

slide-98
SLIDE 98

lhc-logo Introduction Markov Models Learning Edit Distances with EM How to learn edit probabilities with EM?

Edit Distance

Remark

We can make the following remarks: The impact of the choice of the edit costs is very important. Modify edit costs = change the neighborhood of a given string = modify its classification. Three possible solutions to tune the edit costs:

1 Arbitrary choice... 2 Use of background knowledge (e.g. on a qwerty keyboard, the

key “w” is more often changed into a “q” or a “e” than into a “m”).

3 Learn the edit parameters from a training set.

Amaury Habrard Pattern Recognition & Machine Learning

slide-99
SLIDE 99

lhc-logo Introduction Markov Models Learning Edit Distances with EM How to learn edit probabilities with EM?

Question

How to learn edit costs?

Remark

A string can be changed into another one according to different edit scripts. Therefore, the edit scripts can be considered as hidden parameters. The observable parameters are the symbols of a pair of (input,ouput) strings. The EM algorithm can be used to learn those parameters !!

Amaury Habrard Pattern Recognition & Machine Learning

slide-100
SLIDE 100

lhc-logo Introduction Markov Models Learning Edit Distances with EM How to learn edit probabilities with EM?

Learning framework of the edit distance

One can use the EM algorithm and the forward and backward procedures to learn an edit distance. Since EM is a probabilistic method, one first learns the probability P(x, y) of a pair of strings x and y. A stochastic edit distance (in fact an edit similarity) can be deduced from the negative logarithm of P(x, y) such that: DE(x, y) = −logP(x, y) The symmetry property of the distance can be lost. Distance →Similarity.

Amaury Habrard Pattern Recognition & Machine Learning

slide-101
SLIDE 101

lhc-logo Introduction Markov Models Learning Edit Distances with EM How to learn edit probabilities with EM?

Learning framework of the edit distance

Rather than setting edit costs c(xi, yj) we aim to learn a matrix of edit probabilities δ(xi, yj). So far: c(xi, yj) λ a b λ 0.5 0.5 a 0.5 1 b 0.5 1 From now on: δ(xi, yj) λ a b λ 0.05 0.05 a 0.05 0.7 0.15 b 0.05 0.05 0.8

Amaury Habrard Pattern Recognition & Machine Learning

slide-102
SLIDE 102

lhc-logo Introduction Markov Models Learning Edit Distances with EM How to learn edit probabilities with EM?

Adaptation of the forward procedure

In the standard forward procedure of a HMM, αt(i) corresponded to the probability of all the possible paths before emitting in state i the tth symbol Ot of O. In the forward procedure adapted to the edit distance context,

  • ne has to compute αt,v, that is the probability of the paths before

emitting the pair of symbols (xt, yv) of the string pair (x, y).

Amaury Habrard Pattern Recognition & Machine Learning

slide-103
SLIDE 103

lhc-logo Introduction Markov Models Learning Edit Distances with EM How to learn edit probabilities with EM?

Forward Procedure

Input: Two strings x(T) and y(V) Output: Probability αT,V of the pair of strings X(T) and Y(V) α0,0 = 1; for t = 0 to T do for v = 0 to V do if (v > 0 ∨ t > 0)[αt,v = 0]; if (v > 0)[αt,v+ = αt,v−1δ(ǫ, yv)]; if (t > 0)[αt,v+ = αt−1,vδ(xt, ǫ)]; if (v > 0 ∧ t > 0)[αt,v+ = αt−1,v−1δ(xt, yv)]; αT,V∗ = δ(#); P(x, y) = αT,V

Amaury Habrard Pattern Recognition & Machine Learning

slide-104
SLIDE 104

lhc-logo Introduction Markov Models Learning Edit Distances with EM How to learn edit probabilities with EM?

Backward Procedure

Input: Two strings x(T) and y(V) Output: Probability βT,V of the pair of strings X(T) and Y(V) βT,V = δ(#); for t = T to 0 do for v = V to 0 do if (v < V ∨ t < T)[βt,v = 0]; if (v < V)[βt,v+ = δ(ǫ, yv+1)βt,v+1]; if (t < T)[βt,v+ = δ(xt+1, ǫ)βt+1,v]; if (v < V ∧ t < T)[βt,v+ = δ(xt+1, yv+1)βt+1,v+1]; Return βT,V; P(x, y) = βT,V

Amaury Habrard Pattern Recognition & Machine Learning

slide-105
SLIDE 105

lhc-logo Introduction Markov Models Learning Edit Distances with EM How to learn edit probabilities with EM?

Expectation Step

Input: Two strings x(T) and y(V) for t = 0 to T do for v = 0 to V do if (t > 0)[γ(xt, ǫ)+ = αt−1,vδ(xt,ǫ)βt,v

αT,V

]; if (v > 0)[γ(ǫ, yv)+ = αt,v−1δ(ǫ,yv)βt,v

αT,V

]; if (t > 0 ∧ v > 0)[γ(xt, yv)+ = αt−1,v−1δ(xt,yv)βt,v

αT,V

];

Let us recall that for a HMM: pk

t (i, j) = αk t (i).aij.bj(Ok t+1).βk t+1(j)

P(Ok|λ)

Amaury Habrard Pattern Recognition & Machine Learning