Robust Hidden Markov Models Inference in the Presence of Label Noise - - PowerPoint PPT Presentation

robust hidden markov models inference in the presence of
SMART_READER_LITE
LIVE PREVIEW

Robust Hidden Markov Models Inference in the Presence of Label Noise - - PowerPoint PPT Presentation

Robust Hidden Markov Models Inference in the Presence of Label Noise Benot Frnay February 7, 2014 What is Machine Learning ? What is Machine Learning ? Machine learning is about learning from data . A model is inferred from a training set


slide-1
SLIDE 1

Robust Hidden Markov Models Inference in the Presence of Label Noise

Benoît Frénay February 7, 2014

slide-2
SLIDE 2

What is Machine Learning ?

slide-3
SLIDE 3

What is Machine Learning ?

Machine learning is about learning from data. A model is inferred from a training set to make predictions.

slide-4
SLIDE 4

What is Machine Learning ?

Machine learning is about learning from data. A model is inferred from a training set to make predictions.

slide-5
SLIDE 5

What is Machine Learning ?

Machine learning is about learning from data. A model is inferred from a training set to make predictions.

slide-6
SLIDE 6

Examples of Tasks: Regression

Example: predict children weight from anthropometric measures.

slide-7
SLIDE 7

Examples of Tasks: Regression

Example: predict children weight from anthropometric measures.

slide-8
SLIDE 8

Examples of Tasks: Regression

Example: predict children weight from anthropometric measures.

slide-9
SLIDE 9

Examples of Tasks: Classification

Examples: disease diagnosis, spam filtering, image classification.

slide-10
SLIDE 10

Examples of Tasks: Classification

Examples: disease diagnosis, spam filtering, image classification.

slide-11
SLIDE 11

Examples of Tasks: Classification

Examples: disease diagnosis, spam filtering, image classification.

slide-12
SLIDE 12

What does it Mean for a Machine to Learn ?

Machine learning studies how machine can learn automatically. Learning means to find a model of data. Three steps: specify a type of model (e.g. a linear model) specify a criterion (e.g. mean square error) find the best model w.r.t. the criterion

slide-13
SLIDE 13

Example of Learning Process: Linear Regression

Model: linear model f (x1, . . . , xn) = w1x1 + · · · + wdxd + w0 Criterion: mean square error

n

  • i=1

(yi − f (x1, . . . , xn))2 Algorithm: linear regression w = arg min

w n

  • i=1

(yi − f (x1, . . . , xn))2 =

  • X ′X

−1 X ′y

slide-14
SLIDE 14

Overview of the Presentation

Segmentation of electrocardiogram signals:

slide-15
SLIDE 15

Overview of the Presentation

Segmentation of electrocardiogram signals: goal: allow automated diagnosis of heart disease

slide-16
SLIDE 16

Overview of the Presentation

Segmentation of electrocardiogram signals: goal: allow automated diagnosis of heart disease tools: hidden Markov models and wavelet transform

slide-17
SLIDE 17

Overview of the Presentation

Segmentation of electrocardiogram signals: goal: allow automated diagnosis of heart disease tools: hidden Markov models and wavelet transform issue: robustness to label noise (i.e. expert errors)

slide-18
SLIDE 18

Overview of the Presentation

Segmentation of electrocardiogram signals: goal: allow automated diagnosis of heart disease tools: hidden Markov models and wavelet transform issue: robustness to label noise (i.e. expert errors) solution: modelling of expert behaviour

slide-19
SLIDE 19

Overview of the Presentation

Segmentation of electrocardiogram signals:

slide-20
SLIDE 20

Overview of the Presentation

Segmentation of electrocardiogram signals: goal: allow automated diagnosis of heart disease

slide-21
SLIDE 21

Overview of the Presentation

Segmentation of electrocardiogram signals: goal: allow automated diagnosis of heart disease tools: hidden Markov models and wavelet transform

slide-22
SLIDE 22

Overview of the Presentation

Segmentation of electrocardiogram signals: goal: allow automated diagnosis of heart disease tools: hidden Markov models and wavelet transform issue: robustness to label noise (i.e. expert errors)

slide-23
SLIDE 23

Overview of the Presentation

Segmentation of electrocardiogram signals: goal: allow automated diagnosis of heart disease tools: hidden Markov models and wavelet transform issue: robustness to label noise (i.e. expert errors) solution: modelling of expert behaviour

slide-24
SLIDE 24

Electrocardiogram Signal Segmentation

slide-25
SLIDE 25

What is an Electrocardiogram Signal ?

An ECG is a measure of the electrical activity of the human heart. Patterns of interest: P wave, QRS complex, T wave, baseline.

slide-26
SLIDE 26

Where Does it Come from ?

The ECG results from the superposition of several signals.

slide-27
SLIDE 27

What it Looks Like in Real-World Cases

Real ECGs are polluted by various sources of noise.

slide-28
SLIDE 28

What is our Goal in ECG Segmentation ?

Task: split/segment an entire ECG into patterns. Available data: a few manual segmentations from experts. Issue: some of the annotations of the experts are incorrect. Probabilistic model of sequences with labels hidden Markov Models (with wavelet transform)

slide-29
SLIDE 29

Hidden Markov Models

slide-30
SLIDE 30

Hidden Markov Models in a Nutshell

Hidden Markov models (HMMs) are probabilistic models of sequences. S1, . . . , ST is the sequence of annotations (ex.: state of the heart). P(St = st|St−1 = st−1)

slide-31
SLIDE 31

Hidden Markov Models in a Nutshell

Hidden Markov models (HMMs) are probabilistic models of sequences. S1, . . . , ST is the sequence of annotations (ex.: state of the heart). P(St = st|St−1 = st−1) O1, . . . , OT is the sequence of observations (ex.: measured voltage). P(Ot = ot|St = st)

slide-32
SLIDE 32

Hypotheses Behind Hidden Markov Models (1)

Markov hypothesis: the next state only depend on the current state.

slide-33
SLIDE 33

Hypotheses Behind Hidden Markov Models (2)

Observations are conditionally independent w.r.t. the hidden states: P(O1, . . . , OT|S1, . . . , ST) = T

t=1 P(Ot|St)

slide-34
SLIDE 34

Learning Hidden Markov Models

Learning an HMM means to estimate probabilities: P(St) are prior probabilities P(St|St−1) are transition probabilities P(Ot|St) are emission probabilities. Parameters Θ = (q, a, b): qi is the prior of state i aij is the transition probability from state i to state j bi is the observation distributions for state i

slide-35
SLIDE 35

Standard Inference Algorithms for HMMs

Supervised learning: assumes the observed labels are correct; maximises the likelihood P(S, O|Θ); learns the correct concepts; sensitive to label noise. Baum-Welch algorithm: unsupervised, i.e. observed labels are discarded; iteratively (i) label samples and (ii) learn a model; may learn concepts which differs significantly; theoretically insensitive to label noise.

slide-36
SLIDE 36

Supervised Learning for Hidden Markov Models

Supervised: uses annotations, which are assumed to be reliable. Maximises the likelihood P(S, O|Θ) = qs1 T

t=2 ast−1st

T

t=1 bst(ot).

slide-37
SLIDE 37

Supervised Learning for Hidden Markov Models

Supervised: uses annotations, which are assumed to be reliable. Maximises the likelihood P(S, O|Θ) = qs1 T

t=2 ast−1st

T

t=1 bst(ot).

Transition probabilities P(St|St−1) are estimated by counting aij = #(transitions from i to j) / #(transitions from i) Emission probabilities P(Ot|St) are obtained by PDF estimation standard models in ECG analysis: Gaussian mixture models (GMMs)

slide-38
SLIDE 38

Unsupervised Learning for Hidden Markov Models (1)

Unsupervised: uses only observations, guesses hidden states. Maximises the likelihood P(O|Θ) =

S P(S, O|Θ).

slide-39
SLIDE 39

Unsupervised Learning for Hidden Markov Models (1)

Unsupervised: uses only observations, guesses hidden states. Maximises the likelihood P(O|Θ) =

S P(S, O|Θ).

Non-convex function to optimise: log P(O|Θ) = log

  • S
  • qs1

T

  • t=2

ast−1st

T

  • t=1

bst(ot)

  • Solution: expectation-maximisation algorithm (a.k.a. Baum-Welch).
slide-40
SLIDE 40

Unsupervised Learning for Hidden Markov Models (2)

The log-likelihood is intractable, but what about a convex lower bound ?

Source: Pattern Recognition and Machine Learning, C. Bishop, 2006.

Two steps: find a tractable lower bound maximise this lower bound w.r.t. Θ

slide-41
SLIDE 41

Unsupervised Learning for Hidden Markov Models (3)

Idea: use Jensen inequality to find a lower bound to the log-likelihood. log P(O|Θ) = log

  • S

P(S, O|Θ)

slide-42
SLIDE 42

Unsupervised Learning for Hidden Markov Models (3)

Idea: use Jensen inequality to find a lower bound to the log-likelihood. log P(O|Θ) = log

  • S

P(S, O|Θ) = log

  • S

q(S)P(S, O|Θ) q(S)

slide-43
SLIDE 43

Unsupervised Learning for Hidden Markov Models (3)

Idea: use Jensen inequality to find a lower bound to the log-likelihood. log P(O|Θ) = log

  • S

P(S, O|Θ) = log

  • S

q(S)P(S, O|Θ) q(S) ≥

  • S

q(S) log P(S, O|Θ) q(S)

slide-44
SLIDE 44

Unsupervised Learning for Hidden Markov Models (3)

Idea: use Jensen inequality to find a lower bound to the log-likelihood. log P(O|Θ) = log

  • S

P(S, O|Θ) = log

  • S

q(S)P(S, O|Θ) q(S) ≥

  • S

q(S) log P(S, O|Θ) q(S) =

  • S

q(S) log P(S|O, Θ) q(S) + const

slide-45
SLIDE 45

Unsupervised Learning for Hidden Markov Models (3)

Idea: use Jensen inequality to find a lower bound to the log-likelihood. log P(O|Θ) = log

  • S

P(S, O|Θ) = log

  • S

q(S)P(S, O|Θ) q(S) ≥

  • S

q(S) log P(S, O|Θ) q(S) =

  • S

q(S) log P(S|O, Θ) q(S) + const Best lower bound with q(S) = P(S|O, Θ).

slide-46
SLIDE 46

The Expectation-Maximisation / Baum-Welch Algorithm

Expectation step: estimate the posteriors γt(i) = P(St = i|O, Θold) ǫt(i, j) = P(St−1 = i, St = j|O, Θold)

slide-47
SLIDE 47

The Expectation-Maximisation / Baum-Welch Algorithm

Expectation step: estimate the posteriors γt(i) = P(St = i|O, Θold) ǫt(i, j) = P(St−1 = i, St = j|O, Θold) Maximisation step for qi and aij: qi = γ1(i) |S|

i=1 γ1(i)

aij = T

t=2 ǫt(i, j)

T

t=2

|S|

j=1 ǫt(i, j)

The hidden states are estimated and used to compute the parameters.

slide-48
SLIDE 48

Wavelet Transform

slide-49
SLIDE 49

Why do we Need High-Dimensional Representations ?

Using HMMs with raw ECG signals gives 70% of accuracy. The Markov and conditionally independency hypotheses are strong: transitions do not depend only on the current state emissions are not independent, even when states are given

slide-50
SLIDE 50

Why do we Need High-Dimensional Representations ?

Using HMMs with raw ECG signals gives 70% of accuracy. The Markov and conditionally independency hypotheses are strong: transitions do not depend only on the current state emissions are not independent, even when states are given Solution: use a multi-dimensional representation of the ECG signal. Example: O(t) → (O(t), O′(t), O′′(t)). the observation vector contains contextual information numerical estimations of derivative are unstable

slide-51
SLIDE 51

Wavelet Transform in a Nutshell

Signals can be studied at different time scales (or frequencies). Fourrier transform only considers the whole signal (no localisation) f (ω) = ∞

−∞

f (t) e−2πiωtdt

slide-52
SLIDE 52

Wavelet Transform in a Nutshell

Signals can be studied at different time scales (or frequencies). Fourrier transform only considers the whole signal (no localisation) f (ω) = ∞

−∞

f (t) e−2πiωtdt The wavelet transform uses a localised function ψ (a.k.a. wavelet) fψ(a, b) = 1

  • |a|

−∞

ψ t − b a

  • f (t)dt

where b is the translation factor and a is the scale factor.

slide-53
SLIDE 53

Example of Time-Frequency Analysis (1)

Source: A Wavelet Tour of Signal Processing, Stéphane Mallat, 1999.

slide-54
SLIDE 54

Example of Time-Frequency Analysis (2)

Source: A Wavelet Tour of Signal Processing, Stéphane Mallat, 1999.

slide-55
SLIDE 55

Information Extraction with Wavelet Transform

filtered using a 3-30 Hz band-pass filter transformed using a continuous wavelet transform dyadic scales from 21 to 27 are kept and normalised

slide-56
SLIDE 56

Label Noise-Tolerant Hidden Markov Models

slide-57
SLIDE 57

Motivation

For real datasets, perfect labelling is difficult: subjectivity of the labelling task; lack of information; communication noise. In particular, label noise arise in biomedical applications. Previous works by e.g. Lawrence et al. incorporated a noise model into a generative model for i.i.d. observations (classification).

slide-58
SLIDE 58

Example of Label Noise in ECGs

slide-59
SLIDE 59

Label Noise-Tolerant Hidden Markov Models

S1, . . . , ST is the sequence of true states (ex.: state of the heart). P(St|St−1)

slide-60
SLIDE 60

Label Noise-Tolerant Hidden Markov Models

S1, . . . , ST is the sequence of true states (ex.: state of the heart). P(St|St−1) O1, . . . , OT is the sequence of observations (ex.: measured voltage). P(Ot = ot|St = st)

slide-61
SLIDE 61

Label Noise-Tolerant Hidden Markov Models

S1, . . . , ST is the sequence of true states (ex.: state of the heart). P(St|St−1) O1, . . . , OT is the sequence of observations (ex.: measured voltage). P(Ot = ot|St = st) Y1, . . . , YT is the sequence of observed annotations (ex.: P, QRS or T). P(Yt|St)

slide-62
SLIDE 62

Label Noise Model for Expert Annotations

Two distinct sequences of states: the observed, noisy annotations Y the hidden, true labels S The annotation probability is dij =

  • 1 − pi

(i = j)

pi |S|−1

(i = j) where pi is the expert error probability in i.

slide-63
SLIDE 63

Label Noise-Tolerant Learning

Compromise between supervised learning and Baum-Welch. assumes the observed labels are potentially noisy maximises the likelihood P(Y , O|Θ) learns the correct concepts less sensitive to label noise

slide-64
SLIDE 64

Expectation-Maximisation Algorithm

Non-convex function to optimise: log P(O, Y |Θ) = log

  • S

P(O, Y , S|Θ), Expectation step: estimate the posteriors γt(i) = P(St = i|O, Y , Θold) ǫt(i, j) = P(St−1 = i, St = j|O, Y , Θold) Maximisation step for pi: pi =

  • t|Yt=i γt(i)

T

t=1 γt(i)

The true labels are estimated and used to compute the parameters.

slide-65
SLIDE 65

Experiments

slide-66
SLIDE 66

Experimental Settings

EM algorithms: GMM with 5 components; EM algorithms are repeated 10 times; Electrocardiograms: a set of 10 artificial ECGs; 10 ECGs selected in the sinus MIT-QT database; 10 ECGs selected in the arrhythmia MIT-QT database. Comparison: learning with addition of artificial label noise; comparison on original signals; label noise moves the boundaries of P and T waves.

slide-67
SLIDE 67

Experimental Results on Artificial ECGs

Supervised learning, Baum-Welch and label noise-tolerant.

slide-68
SLIDE 68

Experimental Results for Sinus ECGs

Supervised learning, Baum-Welch and label noise-tolerant.

slide-69
SLIDE 69

Experimental Resuts for Arrhythmia ECGs

Supervised learning, Baum-Welch and label noise-tolerant.

slide-70
SLIDE 70

Discussion

Supervised learning: affected by increasing label noise. Baum-Welch: worst results for small levels of noise; less affected by the label noise better than supervised learning for large levels of noise. Label-noise tolerant algorithm: affected by increasing label noise; most often better than Baum-Welch; better than supervised learning for large levels of noise.

slide-71
SLIDE 71

Conclusion

slide-72
SLIDE 72

Conclusion

Proper probabilistic modelling of label noise improves results. Label noise-tolerant HMMs give good results for ECG segmentation. Results published in: Frénay, B., de Lannoy, G., Verleysen, M. Label Noise-Tolerant Hidden Markov Models for Segmentation: Application to

  • ECGs. In Proc. ECML-PKDD 2011, p. 455-470.

More on label noise: Frénay, B., Verleysen, M. Classification in the Presence of Label Noise: a Survey. IEEE TNN-LS, in press, 25 pages.

slide-73
SLIDE 73

Thank you for your attention ! I hope it was interesting for everyone. Any questions ?