[PPT] - Robust Hidden Markov Models Inference in the Presence of Label Noise PowerPoint Presentation

SLIDE 1

Robust Hidden Markov Models Inference in the Presence of Label Noise

Benoît Frénay 25 August 2014

SLIDE 2

Machine Learning in a Nutshell

1

SLIDE 3

Challenges in Machine Learning: Robust Inference

2

SLIDE 4

Overview of the Presentation

Segmentation of electrocardiogram signals:

3

SLIDE 5

Overview of the Presentation

Segmentation of electrocardiogram signals: goal: allow automated diagnosis of heart disease

3

SLIDE 6

Overview of the Presentation

Segmentation of electrocardiogram signals: goal: allow automated diagnosis of heart disease tools: hidden Markov models and wavelet transform

3

SLIDE 7

Overview of the Presentation

Segmentation of electrocardiogram signals: goal: allow automated diagnosis of heart disease tools: hidden Markov models and wavelet transform issue: robustness to label noise (i.e. expert errors)

3

SLIDE 8

Overview of the Presentation

Segmentation of electrocardiogram signals: goal: allow automated diagnosis of heart disease tools: hidden Markov models and wavelet transform issue: robustness to label noise (i.e. expert errors) solution: modelling of expert behaviour

3

SLIDE 9

Electrocardiogram Signal Segmentation

SLIDE 10

What is an Electrocardiogram Signal ?

An ECG is a measure of the electrical activity of the human heart. Patterns of interest: P wave, QRS complex, T wave, baseline.

5

SLIDE 11

Where Does it Come from ?

The ECG results from the superposition of several signals.

6

SLIDE 12

What it Looks Like in Real-World Cases

Real ECGs are polluted by various sources of noise.

7

SLIDE 13

What is our Goal in ECG Segmentation ?

Task: split/segment an entire ECG into patterns. Available data: a few manual segmentations from experts. Issue: some of the annotations of the experts are incorrect. Probabilistic model of sequences with labels hidden Markov Models (with wavelet transform)

8

SLIDE 14

Hidden Markov Models

SLIDE 15

Hidden Markov Models in a Nutshell

Hidden Markov models (HMMs) are probabilistic models of sequences. S1, . . . , ST is the sequence of annotations (ex.: state of the heart). P(St = st|St−1 = st−1)

10

SLIDE 16

Hidden Markov Models in a Nutshell

Hidden Markov models (HMMs) are probabilistic models of sequences. S1, . . . , ST is the sequence of annotations (ex.: state of the heart). P(St = st|St−1 = st−1) O1, . . . , OT is the sequence of observations (ex.: measured voltage). P(Ot = ot|St = st)

10

SLIDE 17

Hypotheses Behind Hidden Markov Models (1)

Markov hypothesis: the next state only depend on the current state.

11

SLIDE 18

Hypotheses Behind Hidden Markov Models (2)

Observations are conditionally independent w.r.t. the hidden states: P(O1, . . . , OT|S1, . . . , ST) = T

t=1 P(Ot|St) 12

SLIDE 19

Learning Hidden Markov Models

Learning an HMM means to estimate probabilities: P(St) are prior probabilities P(St|St−1) are transition probabilities P(Ot|St) are emission probabilities. Parameters Θ = (q, a, b): qi is the prior of state i aij is the transition probability from state i to state j bi is the observation distributions for state i

13

SLIDE 20

Standard Inference Algorithms for HMMs

Supervised learning: assumes the observed labels are correct; maximises the likelihood P(S, O|Θ); learns the correct concepts; sensitive to label noise. Baum-Welch algorithm: unsupervised, i.e. observed labels are discarded; iteratively (i) label samples and (ii) learn a model; may learn concepts which differs significantly; theoretically insensitive to label noise.

14

SLIDE 21

Supervised Learning for Hidden Markov Models

Supervised: uses annotations, which are assumed to be reliable. Maximises the likelihood P(S, O|Θ) = qs1 T

t=2 ast−1st

T

t=1 bst(ot). 15

SLIDE 22

Supervised Learning for Hidden Markov Models

Supervised: uses annotations, which are assumed to be reliable. Maximises the likelihood P(S, O|Θ) = qs1 T

t=2 ast−1st

T

t=1 bst(ot).

Transition probabilities P(St|St−1) are estimated by counting aij = #(transitions from i to j) / #(transitions from i) Emission probabilities P(Ot|St) are obtained by PDF estimation standard models in ECG analysis: Gaussian mixture models (GMMs)

15

SLIDE 23

Unsupervised Learning for Hidden Markov Models (1)

Unsupervised: uses only observations, guesses hidden states. Maximises the likelihood P(O|Θ) =

S P(S, O|Θ). 16

SLIDE 24

Unsupervised Learning for Hidden Markov Models (1)

Unsupervised: uses only observations, guesses hidden states. Maximises the likelihood P(O|Θ) =

S P(S, O|Θ).

Non-convex function to optimise: log P(O|Θ) = log

S
qs1

T

t=2

ast−1st

T

t=1

bst(ot)

Solution: expectation-maximisation algorithm (a.k.a. Baum-Welch).

16

SLIDE 25

Unsupervised Learning for Hidden Markov Models (2)

The log-likelihood is intractable, but what about a convex lower bound ?

Source: Pattern Recognition and Machine Learning, C. Bishop, 2006.

Two steps: find a tractable lower bound maximise this lower bound w.r.t. Θ

17

SLIDE 26

Unsupervised Learning for Hidden Markov Models (3)

Idea: use Jensen inequality to find a lower bound to the log-likelihood. log P(O|Θ) = log

S

P(S, O|Θ)

18

SLIDE 27

Unsupervised Learning for Hidden Markov Models (3)

Idea: use Jensen inequality to find a lower bound to the log-likelihood. log P(O|Θ) = log

S

P(S, O|Θ) = log

S

q(S)P(S, O|Θ) q(S)

18

SLIDE 28

Unsupervised Learning for Hidden Markov Models (3)

Idea: use Jensen inequality to find a lower bound to the log-likelihood. log P(O|Θ) = log

S

P(S, O|Θ) = log

S

q(S)P(S, O|Θ) q(S) ≥

S

q(S) log P(S, O|Θ) q(S)

18

SLIDE 29

Unsupervised Learning for Hidden Markov Models (3)

Idea: use Jensen inequality to find a lower bound to the log-likelihood. log P(O|Θ) = log

S

P(S, O|Θ) = log

S

q(S)P(S, O|Θ) q(S) ≥

S

q(S) log P(S, O|Θ) q(S) =

S

q(S) log P(S|O, Θ) q(S) + const

18

SLIDE 30

Unsupervised Learning for Hidden Markov Models (3)

Idea: use Jensen inequality to find a lower bound to the log-likelihood. log P(O|Θ) = log

S

P(S, O|Θ) = log

S

q(S)P(S, O|Θ) q(S) ≥

S

q(S) log P(S, O|Θ) q(S) =

S

q(S) log P(S|O, Θ) q(S) + const Best lower bound with q(S) = P(S|O, Θ).

18

SLIDE 31

The Expectation-Maximisation / Baum-Welch Algorithm

Expectation step: estimate the posteriors γt(i) = P(St = i|O, Θold) ǫt(i, j) = P(St−1 = i, St = j|O, Θold)

19

SLIDE 32

The Expectation-Maximisation / Baum-Welch Algorithm

Expectation step: estimate the posteriors γt(i) = P(St = i|O, Θold) ǫt(i, j) = P(St−1 = i, St = j|O, Θold) Maximisation step for qi and aij: qi = γ1(i) |S|

i=1 γ1(i)

aij = T

t=2 ǫt(i, j)

T

t=2

|S|

j=1 ǫt(i, j)

The hidden states are estimated and used to compute the parameters.

19

SLIDE 33

Wavelet Transform

SLIDE 34

Why do we Need High-Dimensional Representations ?

Using HMMs with raw ECG signals gives 70% of accuracy. The Markov and conditionally independency hypotheses are strong: transitions do not depend only on the current state emissions are not independent, even when states are given

21

SLIDE 35

Why do we Need High-Dimensional Representations ?

Using HMMs with raw ECG signals gives 70% of accuracy. The Markov and conditionally independency hypotheses are strong: transitions do not depend only on the current state emissions are not independent, even when states are given Solution: use a multi-dimensional representation of the ECG signal. Example: O(t) → (O(t), O′(t), O′′(t)). the observation vector contains contextual information numerical estimations of derivative are unstable

21

SLIDE 36

Wavelet Transform in a Nutshell

Signals can be studied at different time scales (or frequencies). Fourrier transform only considers the whole signal (no localisation) f (ω) = ∞

−∞

f (t) e−2πiωtdt

22

SLIDE 37

Wavelet Transform in a Nutshell

Signals can be studied at different time scales (or frequencies). Fourrier transform only considers the whole signal (no localisation) f (ω) = ∞

−∞

f (t) e−2πiωtdt The wavelet transform uses a localised function ψ (a.k.a. wavelet) fψ(a, b) = 1

|a|

∞

−∞

ψ t − b a

f (t)dt

where b is the translation factor and a is the scale factor.

22

SLIDE 38

Example of Time-Frequency Analysis (1)

Source: A Wavelet Tour of Signal Processing, Stéphane Mallat, 1999.

23

SLIDE 39

Example of Time-Frequency Analysis (2)

Source: A Wavelet Tour of Signal Processing, Stéphane Mallat, 1999.

24

SLIDE 40

Information Extraction with Wavelet Transform

filtered using a 3-30 Hz band-pass filter transformed using a continuous wavelet transform dyadic scales from 21 to 27 are kept and normalised

25

SLIDE 41

Label Noise-Tolerant Hidden Markov Models

SLIDE 42

Motivation

For real datasets, perfect labelling is difficult: subjectivity of the labelling task; lack of information; communication noise. In particular, label noise arise in biomedical applications. Previous works by e.g. Lawrence et al. incorporated a noise model into a generative model for i.i.d. observations (classification).

27

SLIDE 43

Example of Label Noise in ECGs

28

SLIDE 44

Label Noise-Tolerant Hidden Markov Models

S1, . . . , ST is the sequence of true states (ex.: state of the heart). P(St|St−1)

29

SLIDE 45

Label Noise-Tolerant Hidden Markov Models

S1, . . . , ST is the sequence of true states (ex.: state of the heart). P(St|St−1) O1, . . . , OT is the sequence of observations (ex.: measured voltage). P(Ot = ot|St = st)

29

SLIDE 46

Label Noise-Tolerant Hidden Markov Models

S1, . . . , ST is the sequence of true states (ex.: state of the heart). P(St|St−1) O1, . . . , OT is the sequence of observations (ex.: measured voltage). P(Ot = ot|St = st) Y1, . . . , YT is the sequence of observed annotations (ex.: P, QRS or T). P(Yt|St)

29

SLIDE 47

Label Noise Model for Expert Annotations

Two distinct sequences of states: the observed, noisy annotations Y the hidden, true labels S The annotation probability is dij =

1 − pi

(i = j)

pi |S|−1

(i = j) where pi is the expert error probability in i.

30

SLIDE 48

Label Noise-Tolerant Learning

Compromise between supervised learning and Baum-Welch. assumes the observed labels are potentially noisy maximises the likelihood P(Y , O|Θ) learns the correct concepts less sensitive to label noise

31

SLIDE 49

Expectation-Maximisation Algorithm

Non-convex function to optimise: log P(O, Y |Θ) = log

S

P(O, Y , S|Θ), Expectation step: estimate the posteriors γt(i) = P(St = i|O, Y , Θold) ǫt(i, j) = P(St−1 = i, St = j|O, Y , Θold) Maximisation step for pi: pi =

t|Yt=i γt(i)

T

t=1 γt(i)

The true labels are estimated and used to compute the parameters.

32

SLIDE 50

Experiments

SLIDE 51

Experimental Settings

EM algorithms: GMM with 5 components; EM algorithms are repeated 10 times; Electrocardiograms: a set of 10 artificial ECGs; 10 ECGs selected in the sinus MIT-QT database; 10 ECGs selected in the arrhythmia MIT-QT database. Comparison: learning with addition of artificial label noise; comparison on original signals; label noise moves the boundaries of P and T waves.

34

SLIDE 52

Experimental Results on Artificial ECGs

Supervised learning, Baum-Welch and label noise-tolerant.

35

SLIDE 53

Experimental Results for Sinus ECGs

Supervised learning, Baum-Welch and label noise-tolerant.

36

SLIDE 54

Experimental Resuts for Arrhythmia ECGs

Supervised learning, Baum-Welch and label noise-tolerant.

37

SLIDE 55

Discussion

Supervised learning: affected by increasing label noise. Baum-Welch: worst results for small levels of noise; less affected by the label noise better than supervised learning for large levels of noise. Label-noise tolerant algorithm: affected by increasing label noise; most often better than Baum-Welch; better than supervised learning for large levels of noise.

38

SLIDE 56

Conclusion

SLIDE 57

Conclusion

Proper probabilistic modelling of label noise improves results. Label noise-tolerant HMMs give good results for ECG segmentation. Results published in: Frénay, B., de Lannoy, G., Verleysen, M. Label Noise-Tolerant Hidden Markov Models for Segmentation: Application to ECGs. In Proc. ECML-PKDD 2011, p. 455-470. More on label noise (literature survey): Frénay, B., Verleysen, M. Classification in the Presence of Label Noise: a Survey. IEEE Trans. Neural Networks and Learning Systems, 25(5), 2014, p. 845–869. Tutorial: Frénay, B., Kabán A. A Comprehensive Introduction to Label

Noise. In Proc. ESANN, Bruges, Belgium, 23-25 April 2014, p. 667–676.

40

SLIDE 58

Related Publications

ECG analysis: de Lannoy, G., Frénay, B., Verleysen M. Supervised ECG delineation using the wavelet transform and hidden markov models. In Proc. MBEC, Antwerp, Belgium, 23–27 November 2008, 22–25. Frénay, B., de Lannoy, G., Verleysen M. Improving the transition modelling in hidden markov models for ecg segmentation. In

Proc. ESANN, Bruges, Belgium, 22-24 April 2009, p. 141–146.

Label noise/anomalous observations Frénay, B., Doquire, G., Verleysen, M. Estimating mutual information for feature selection in the presence of label noise. Computational Statistics & Data Analysis, 71, 832-848, 2014. Frénay, B., Verleysen, M. Pointwise Probability Reinforcements for Robust Statistical Inference. Neural Networks, 50, 124-141, 2014.

41

SLIDE 59

Robust Hidden Markov Models Inference in the Presence of Label Noise

Machine Learning in a Nutshell

Challenges in Machine Learning: Robust Inference

Overview of the Presentation

Overview of the Presentation

Overview of the Presentation

Overview of the Presentation

Overview of the Presentation

Electrocardiogram Signal Segmentation

What is an Electrocardiogram Signal ?

Where Does it Come from ?

What it Looks Like in Real-World Cases

What is our Goal in ECG Segmentation ?

Hidden Markov Models

Hidden Markov Models in a Nutshell

Hidden Markov Models in a Nutshell

Hypotheses Behind Hidden Markov Models (1)

Hypotheses Behind Hidden Markov Models (2)

Learning Hidden Markov Models

Standard Inference Algorithms for HMMs

Supervised Learning for Hidden Markov Models

Supervised Learning for Hidden Markov Models

Unsupervised Learning for Hidden Markov Models (1)

Unsupervised Learning for Hidden Markov Models (1)

Unsupervised Learning for Hidden Markov Models (2)

Unsupervised Learning for Hidden Markov Models (3)

Unsupervised Learning for Hidden Markov Models (3)

Unsupervised Learning for Hidden Markov Models (3)

Unsupervised Learning for Hidden Markov Models (3)

Unsupervised Learning for Hidden Markov Models (3)

The Expectation-Maximisation / Baum-Welch Algorithm

The Expectation-Maximisation / Baum-Welch Algorithm

Wavelet Transform

Why do we Need High-Dimensional Representations ?

Why do we Need High-Dimensional Representations ?

Wavelet Transform in a Nutshell

Wavelet Transform in a Nutshell

Example of Time-Frequency Analysis (1)

Example of Time-Frequency Analysis (2)

Information Extraction with Wavelet Transform

Label Noise-Tolerant Hidden Markov Models

Motivation

Example of Label Noise in ECGs

Label Noise-Tolerant Hidden Markov Models

Label Noise-Tolerant Hidden Markov Models

Label Noise-Tolerant Hidden Markov Models

Label Noise Model for Expert Annotations

Label Noise-Tolerant Learning

Expectation-Maximisation Algorithm

Experiments

Experimental Settings

Experimental Results on Artificial ECGs

Experimental Results for Sinus ECGs

Experimental Resuts for Arrhythmia ECGs

Discussion

Conclusion

Conclusion

Related Publications

Thank you for your attention ! I hope it was interesting for everyone. Any questions ?