lecture 4
play

Lecture 4 Hidden Markov Models Michael Picheny, Bhuvana - PowerPoint PPT Presentation

Lecture 4 Hidden Markov Models Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen,nussbaum}@us.ibm.com 10


  1. The Markov Property, Order n Holds if: N � P ( x 1 , . . . , x N ) = P ( x i | x 1 , . . . , x i − 1 ) i = 1 N � = P ( x i | x i − n , x i − n + 1 , · · · , x i − 1 ) i = 1 e.g. , if know weather for past n days . . . Knowing more doesn’t help predict future weather. i.e. , if data satisfies this property . . . No loss from just remembering past n items! 34 / 157

  2. A Non-Hidden Markov Model, Order 1 Let’s assume: knowing yesterday’s weather is enough. N � P ( x 1 , . . . , x N ) = P ( x i | x i − 1 ) i = 1 Before (no state): single multinomial P ( x i ) . After (with state): separate multinomial P ( x i | x i − 1 ) . . . For each x i − 1 ∈ { rainy , windy , calm } . Model P ( x i | x i − 1 ) with parameter p x i − 1 , x i . What about P ( x 1 | x 0 ) ? Assume x 0 = start, a special value. One more multinomial: P ( x i | start ) . Constraint: � x i p x i − 1 , x i = 1 for all x i − 1 . 35 / 157

  3. A Picture After observe x , go to state labeled x . Is state non-hidden? ❲✴ ♣ st❛rt ❀❲ ❈✴ ♣ ❈❀❈ ❈✴ ♣ st❛rt ❀❈ st❛rt ❈ ❘✴ ♣ ❈❀❘ ❈✴ ♣ ❲❀❈ ❲✴ ♣ ❈❀❲ ❈✴ ♣ ❘❀❈ ❘✴ ♣ st❛rt ❀❘ ❘✴ ♣ ❲❀❘ ❘ ❲ ❲✴ ♣ ❘❀❲ ❘✴ ♣ ❘❀❘ ❲✴ ♣ ❲❀❲ 36 / 157

  4. Computing the Likelihood of Data Some data: x = W, W, C, C, W, W, C, R, C, R. ❲✴✵✳✸ ❈✴✵✳✶ ❈✴✵✳✺ st❛rt ❈ ❘✴✵✳✼ ❈✴✵✳✶ ❲ ✹ ✳ ✵ ✴ ❘✴✵✳✷ ✴ ✵ ❈ ✳ ✷ ❘✴✵✳✻ ❘ ❲ The likelihood: ❲✴✵✳✺ ❘✴✵✳✶ ❲✴✵✳✸ N N � � P ( x 1 , . . . , x 10 ) = P ( x i | x i − 1 ) = p x i − 1 , x i i = 1 i = 1 = p start , W × p W , W × p W , C × . . . = 0 . 3 × 0 . 3 × 0 . 1 × 0 . 1 × . . . = 1 . 06 × 10 − 6 37 / 157

  5. Computing the Likelihood of Data More generally: N N � � P ( x 1 , . . . , x N ) = P ( x i | x i − 1 ) = p x i − 1 , x i i = 1 i = 1 c ( x i − 1 , x i ) � = p x i − 1 , x i x i − 1 , x i � log P ( x 1 , . . . x N ) = c ( x i − 1 , x i ) log p x i − 1 , x i x i − 1 , x i x 0 = start . c ( x i − 1 , x i ) is count of x i following x i − 1 . Likelihood only depends on counts of pairs (bigrams). 38 / 157

  6. Maximum Likelihood Estimation Choose p x i − 1 , x i to optimize log likelihood: L ( x N � 1 ) = c ( x i − 1 , x i ) log p x i − 1 , x i x i − 1 , x i � � = c ( start , x i ) log p start , x i + c ( R , x i ) log p R , x i + x i x i � � c ( W , x i ) log p W , x i + c ( C , x i ) log p C , x i x i x i Each sum is log likelihood of multinomial. Each multinomial has nonoverlapping parameter set. Can optimize each sum independently! c ( x i − 1 , x i ) p MLE x i − 1 , x i = � x c ( x i − 1 , x ) 39 / 157

  7. Example: Maximum Likelihood Estimation Some raw data: W, W, C, C, W, W, C, R, C, R, W, C, C, C, R, R, R, R, C, C, R, R, R, R, R, R, R, R, C, C, C, C, C, R, R, R, R, R, R, R, C, C, C, W, C, C, C, C, C, C, R, C, C, C, C Counts and ML estimates: p MLE c ( · , · ) R W C sum R W C start 0 1 0 1 start 0.000 1.000 0.000 R 16 1 5 22 R 0.727 0.045 0.227 W 0 2 4 6 W 0.000 0.333 0.667 C 6 2 18 26 C 0.231 0.077 0.692 c ( x i − 1 , x i ) 5 p MLE p MLE x i − 1 , x i = R , C = 16 + 1 + 5 = 0 . 227 � x c ( x i − 1 , x ) 40 / 157

  8. Example: Maximum Likelihood Estimation ❲✴✶ ❲✴✶✳✵✵✵ ❈✴✶✽ ❈✴✵✳✻✾✷ ❈✴✵ ❈✴✵✳✵✵✵ ✶ ✷✻ st❛rt ❈ ❘✴✻ ❈✴✹ ❘✴✵✳✷✸✶ ❈✴✵✳✻✻✼ ❲✴✵✳✵✼✼ ❈✴✵✳✷✷✼ ❈✴✺ ❲✴✷ ❘✴✵ ❘✴✵✳✵✵✵ ❘✴✵ ❘✴✵✳✵✵✵ ✷✷ ✻ ❘ ❲ ❲✴✶ ❲✴✵✳✵✹✺ ❘✴✶✻ ❲✴✷ ❘✴✵✳✼✷✼ ❲✴✵✳✸✸✸ 41 / 157

  9. Example: Orders Some raw data: W, W, C, C, W, W, C, R, C, R, W, C, C, C, R, R, R, R, C, C, R, R, R, R, R, R, R, R, C, C, C, C, C, R, R, R, R, R, R, R, C, C, C, W, C, C, C, C, C, C, R, C, C, C, C Data sampled from MLE Markov model, order 1: W, W, C, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, C, C, C, C, C, C, W, W, C, C, C, R, R, R, C, C, W, C, C, C, C, C, R, R, R, R, R, C, R, R, C, R, R, R, R, R Data sampled from MLE Markov model, order 0: C, R, C, R, R, R, R, C, R, R, C, C, R, C, C, R, R, R, R, C, C, C, R, C, R, W, R, C, C, C, W, C, R, C, C, W, C, C, C, C, R, R, C, C, C, R, C, R, R, C, R, C, R, W, R 42 / 157

  10. Recap: Non-Hidden Markov Models Use states to encode limited amount of information . . . About the past. Current state is known . Log likelihood just depends on pair counts. � L ( x N 1 ) = c ( x i − 1 , x i ) log p x i − 1 , x i x i − 1 , x i MLE: count and normalize. c ( x i − 1 , x i ) p MLE x i − 1 , x i = � x c ( x i − 1 , x ) Easy beezy. 43 / 157

  11. Part II Discrete Hidden Markov Models 44 / 157

  12. Case Study: Austin Weather 2.0 Ignore rain; one sample every two weeks: C, W, C, C, C, C, W, C, C, C, C, C, C, W, W, C, W, C, W, W, C, W, C, W, C, C, C, C, C, C, C, C, C, C, C, C, C, C, W, C, C, C, W, W, C, C, W, W, C, W, C, W, C, C, C, C, C, C, C, C, C, C, C, C, C, W, C, W, C, C, W, W, C, W, W, W, C, W, C, C, C, C, C, C, C, C, C, C, W, C, W, W, W, C, C, C, C, C, W, C, C, W, C, C, C, C, C, C, C, C, C, C, C, W Does system have state/memory? 45 / 157

  13. Another View C W C C C C W C C C C C C W W C W C W W C W C W C C C C C C C C C C C C C C W C C C W W C C W W C W C W C C C C C C C C C C C C C W C W C C W W C W W W C W C C C C C C C C C C W C W W W C C C C C W C C W C C C C C C C C C C C W C C W W C W C C C W C W C W C C C C C C C C W C C C C C W C C C W C W C W C C W C W C C C C C C C C C C C C C W C C C W W C C C W C W C Does system have memory? How many states? 46 / 157

  14. A Hidden Markov Model For simplicity, no separate start state. Always start in calm state c . ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ Why is state “hidden”? What are conditions for state to be non-hidden? 47 / 157

  15. Contrast: Non-Hidden Markov Models ❲✴✶✳✵✵✵ ❈✴✵✳✻✾✷ ❈✴✵✳✵✵✵ st❛rt ❈ ❘✴✵✳✷✸✶ ❈✴✵✳✻✻✼ ❲ ✼ ✷ ✴ ✷ ✵ ✳ ❘✴✵✳✵✵✵ ✵ ✳ ✵ ✴ ✼ ❈ ✼ ❘✴✵✳✵✵✵ ❘ ❲ ❲✴✵✳✻ ❈✴✵✳✹ ❲✴✵✳✵✹✺ ❘✴✵✳✼✷✼ ❲✴✵✳✸✸✸ 48 / 157

  16. Back to Coins: Hidden Information Memory-less example: Coin 0: p H = 0 . 7 , p T = 0 . 3 Coin 1: p H = 0 . 9 , p T = 0 . 1 Coin 2: p H = 0 . 2 , p T = 0 . 8 Experiment: Flip Coin 0. If outcome is H, flip Coin 1 and record ; else flip Coin 2 and record. Coin 0 flips outcomes are hidden! What is the probability of the sequence: H T T T ? p ( H ) = 0 . 9 x 0 . 7 + 0 . 2 x 0 . 3 ; p ( T ) = 0 . 1 x 0 . 7 + 0 . 8 x 0 . 3 An example with memory: 2 coins, flip each twice. Record first flip, use second to determine which coin to flip. No way to know the outcome of even flips. Order matters now and . . . Cannot uniquely determine which state sequence produced the observed output sequence 49 / 157

  17. Why Hidden State? No “simple” way to determine state given observed. If see “ W ”, doesn’t mean windy season started. Speech recognition: one HMM per word. Each state represents different sound in word. How to tell from observed when state switches? Hidden models can model same stuff as non-hidden . . . Using much fewer states. Pop quiz: name a hidden model with no memory. 50 / 157

  18. The Problem With Hidden State For observed x = x 1 , . . . , x N , what is hidden state h ? Corresponding state sequence h = h 1 , . . . , h N + 1 . In non-hidden model, how many h possible given x ? In hidden model, what h are possible given x ? ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ This makes everything difficult. 51 / 157

  19. Three Key Tasks for HMM’s Find single best path in HMM given observed x . 1 e.g. , when did windy season begin? e.g. , when did each sound in word begin? Find total likelihood P ( x ) of observed. 2 e.g. , to pick which word assigns highest likelihood. Find ML estimates for parameters of HMM. 3 i.e. , estimate arc probabilities to match training data. These problems are easy to solve for a state-observable Markov model. More complicated for a HMM as we have to consider all possible state sequences. 52 / 157

  20. Where Are We? Computing the Best Path 1 Computing the Likelihood of Observations 2 Estimating Model Parameters 3 Discussion 4 53 / 157

  21. What We Want to Compute Given observed, e.g. , x = C, W, C, C, W, . . . Find state sequence h ∗ with highest likelihood. h ∗ = arg max P ( h , x ) h Why is this easy for non-hidden model? Given state sequence h , how to compute P ( h , x ) ? Same as for non-hidden model. Multiply all arc probabilities along path. 54 / 157

  22. Likelihood of Single State Sequence Some data: x = W, C, C. A state sequence: h = c , c , c , w . ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ Likelihood of path: P ( h , x ) = 0 . 2 × 0 . 6 × 0 . 1 = 0 . 012 55 / 157

  23. What We Want to Compute Given observed, e.g. , x = C, W, C, C, W, . . . Find state sequence h ∗ with highest likelihood. h ∗ = arg max P ( h , x ) h Let’s start with simpler problem: Find likelihood of best state sequence P best ( x ) . Worry about identity of best sequence later. P best ( x ) = max P ( h , x ) h 56 / 157

  24. What’s the Problem? P best ( x ) = max P ( h , x ) h For observation sequence of length N . . . How many different possible state sequences h ? ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ How in blazes can we do max . . . Over exponential number of state sequences? 57 / 157

  25. Dynamic Programming Let S 0 be start state; e.g. , the calm season c . Let P ( S , t ) be set of paths of length t . . . Starting at start state S 0 and ending at S . . . Consistent with observed x 1 , . . . , x t . Any path p ∈ P ( S , t ) must be composed of . . . Path of length t − 1 to predecessor state S ′ → S . . . Followed by arc from S ′ to S labeled with x t . This decomposition is unique. x t � P ( S ′ , t − 1 ) · ( S ′ P ( S , t ) = → S ) S ′ xt → S 58 / 157

  26. Dynamic Programming x t � P ( S ′ , t − 1 ) · ( S ′ P ( S , t ) = → S ) S ′ xt → S Let ˆ α ( S , t ) = likelihood of best path of length t . . . Starting at start state S 0 and ending at S . P ( p ) = prob of path p = product of arc probs. α ( S , t ) = max ˆ p ∈P ( S , t ) P ( p ) P ( p ′ · ( S ′ x t = max → S )) p ′ ∈P ( S ′ , t − 1 ) , S ′ xt → S x t P ( S ′ p ′ ∈P ( S ′ , t − 1 ) P ( p ′ ) = max → S ) max S ′ xt → S x t P ( S ′ α ( S ′ , t − 1 ) = max → S ) × ˆ S ′ xt → S 59 / 157

  27. What Were We Computing Again? Assume observed x of length T . Want likelihood of best path of length T . . . Starting at start state S 0 and ending anywhere. P best ( x ) = max P ( h , x ) = max α ( S , T ) ˆ h S If can compute ˆ α ( S , T ) , we are done. If know ˆ α ( S , t − 1 ) for all S , easy to compute ˆ α ( S , t ) : x t P ( S ′ α ( S ′ , t − 1 ) α ( S , t ) = max ˆ → S ) × ˆ S ′ xt → S This looks promising . . . 60 / 157

  28. The Viterbi Algorithm α ( S , 0 ) = 1 for S = S 0 , 0 otherwise. ˆ For t = 1 , . . . , T : For each state S : x t P ( S ′ α ( S ′ , t − 1 ) α ( S , t ) = max ˆ → S ) × ˆ S ′ xt → S The end. P best ( x ) = max P ( h , x ) = max α ( S , T ) ˆ h S 61 / 157

  29. Viterbi and Shortest Path Equivalent to shortest path problem. 3 1 1 4 1 3 1 2 19 3 10 One “state” for each state/time pair ( S , t ) . Iterate through “states” in topological order: All arcs go forward in time. If order “states” by time, valid ordering. S ′ → S { d ( S ′ ) + distance ( S ′ , S ) } d ( S ) = min x t P ( S ′ α ( S ′ , t − 1 ) α ( S , t ) = max ˆ → S ) × ˆ S ′ xt → S 62 / 157

  30. Identifying the Best Path Wait! We can calc likelihood of best path: P best ( x ) = max P ( h , x ) h What we really wanted: identity of best path. i.e. , the best state sequence h . Basic idea: for each S , t . . . Record identity S prev ( S , t ) of previous state S ′ . . . In best path of length t ending at state S . Find best final state. Backtrace best previous states until reach start state. 63 / 157

  31. The Viterbi Algorithm With Backtrace α ( S , 0 ) = 1 for S = S 0 , 0 otherwise. ˆ For t = 1 , . . . , T : For each state S : x t P ( S ′ α ( S ′ , t − 1 ) α ( S , t ) = max ˆ → S ) × ˆ S ′ xt → S x t P ( S ′ α ( S ′ , t − 1 ) S prev ( S , t ) = arg max → S ) × ˆ S ′ xt → S The end. P best ( x ) = max α ( S , T ) ˆ S S final ( x ) = arg max α ( S , T ) ˆ S 64 / 157

  32. The Backtrace S cur ← S final ( x ) for t in T , . . . , 1: S cur ← S prev ( S cur , t ) The best state sequence is . . . List of states traversed in reverse order. 65 / 157

  33. Illustration with a trellis State transition diagram in time State: 1 2 3 Time: 0 1 2 3 4 Obs: f a aa aab aabb .5x.2 .5x.2 .5x.8 .5x.8 .3x.3 .3x.7 .3x.7 .3x.3 .2 .2 .2 .2 .2 .4x.5 .4x.5 .4x.5 .4x.5 .5x.3 .5x.7 .5x.7 .5x.3 .1 .1 .1 .1 .1 28 66 / 157

  34. Illustration with a trellis (contd.) Accumulating scores State: 1 2 3 Time: 0 1 2 3 4 Obs: f a aa aab aabb .5x.2 .5x.8 .5x.2 .5x.8 1 0.4 .16 .3x.3 .3x.7 .3x.7 .3x.3 .2 .2 .2 .2 .2 .4x.5 .4x.5 .4x.5 .4x.5 .2 .21+.04+.08=.33 .084+.066+.32=.182 .5x.3 .5x.7 .5x.7 .5x.3 .1 .1 .1 .1 .1 .02 .033+.03=.063 .0495+.0182=.0677 29 67 / 157

  35. Viterbi algorithm Accumulating scores State: 1 2 3 Time: 0 1 2 3 4 Obs: f a aa aab aabb .5x.2 .5x.8 .5x.2 .5x.8 1 .0016 0.4 .016 .16 .3x.3 .3x.7 .3x.7 .3x.3 .2 .2 .2 .2 .2 .4x.5 .4x.5 .4x.5 .4x.5 .00336 max(.08 .21 .04) .0168 max(.084 .042 .032) .5x.3 .5x.7 .5x.7 .5x.3 .1 .1 .1 .1 .1 .00588 Max(.0084 .0315) .0294 max(.03 .021) 33 68 / 157

  36. Best path through the trellis Accumulating scores State: 1 2 3 Time: 0 1 2 3 4 Obs: f a aa aab aabb .5x.2 .5x.8 .5x.2 .5x.8 .0016 .016 .16 0.4 1 .3x.3 .3x.7 .3x.7 .3x.3 .2 .2 .2 .2 .2 .4x.5 .4x.5 .4x.5 .4x.5 0.2 .00336 .21 .084 .0168 .5x.3 .5x.7 .5x.7 .5x.3 .1 .1 .1 .1 .1 0.02 .00588 .0315 .0294 .03 34 69 / 157

  37. Example Some data: C, C, W, W. ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ α ˆ 0 1 2 3 4 c 1.000 0.600 0.360 0.072 0.014 w 0.000 0.100 0.060 0.036 0.022 C C α ( c , 2 ) = max { P ( c ˆ → c ) × ˆ α ( c , 1 ) , P ( w → c ) × ˆ α ( w , 1 ) } = max { 0 . 6 × 0 . 6 , 0 . 1 × 0 . 1 } = 0 . 36 70 / 157

  38. Example: The Backtrace S prev 0 1 2 3 4 c c c c c w c c c w h ∗ = arg max P ( h , x ) = ( c , c , c , w , w ) h The data: C, C, W, W. Calm season switching to windy season. 71 / 157

  39. Recap: The Viterbi Algorithm Given observed x , . . . Exponential number of hidden sequences h . Can find likelihood and identity of best path . . . Efficiently using dynamic programming. What is time complexity? 72 / 157

  40. Where Are We? Computing the Best Path 1 Computing the Likelihood of Observations 2 Estimating Model Parameters 3 Discussion 4 73 / 157

  41. What We Want to Compute Given observed, e.g. , x = C, W, C, C, W, . . . Find total likelihood P ( x ) . Need to sum likelihood over all hidden sequences: � P ( x ) = P ( h , x ) h Given state sequence h , how to compute P ( h , x ) ? Multiply all arc probabilities along path. Why is this sum easy for non-hidden model? 74 / 157

  42. What’s the Problem? � P ( x ) = P ( h , x ) h For observation sequence of length N . . . How many different possible state sequences h ? ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ How in blazes can we do sum . . . Over exponential number of state sequences? 75 / 157

  43. Dynamic Programming Let P ( S , t ) be set of paths of length t . . . Starting at start state S 0 and ending at S . . . Consistent with observed x 1 , . . . , x t . Any path p ∈ P ( S , t ) must be composed of . . . Path of length t − 1 to predecessor state S ′ → S . . . Followed by arc from S ′ to S labeled with x t . x t � P ( S ′ , t − 1 ) · ( S ′ P ( S , t ) = → S ) S ′ xt → S 76 / 157

  44. Dynamic Programming x t � P ( S ′ , t − 1 ) · ( S ′ P ( S , t ) = → S ) S ′ xt → S Let α ( S , t ) = sum of likelihoods of paths of length t . . . Starting at start state S 0 and ending at S . � α ( S , t ) = P ( p ) p ∈P ( S , t ) P ( p ′ · ( S ′ x t � = → S )) p ′ ∈P ( S ′ , t − 1 ) , S ′ xt → S x t � � P ( S ′ P ( p ′ ) = → S ) S ′ xt p ′ ∈P ( S ′ , t − 1 ) → S x t � P ( S ′ → S ) × α ( S ′ , t − 1 ) = S ′ xt → S 77 / 157

  45. What Were We Computing Again? Assume observed x of length T . Want sum of likelihoods of paths of length T . . . Starting at start state S 0 and ending anywhere. � � P ( x ) = P ( h , x ) = α ( S , T ) h S If can compute α ( S , T ) , we are done. If know α ( S , t − 1 ) for all S , easy to compute α ( S , t ) : x t � P ( S ′ → S ) × α ( S ′ , t − 1 ) α ( S , t ) = S ′ xt → S This looks promising . . . 78 / 157

  46. The Forward Algorithm α ( S , 0 ) = 1 for S = S 0 , 0 otherwise. For t = 1 , . . . , T : For each state S : x t � P ( S ′ → S ) × α ( S ′ , t − 1 ) α ( S , t ) = S ′ xt → S The end. � � P ( x ) = P ( h , x ) = α ( S , T ) h S 79 / 157

  47. Viterbi vs. Forward The goal: P best ( x ) = max P ( h , x ) = max α ( S , T ) ˆ h S � � P ( x ) = P ( h , x ) = α ( S , T ) h S The invariant. x t P ( S ′ α ( S ′ , t − 1 ) α ( S , t ) = max ˆ → S ) × ˆ S ′ xt → S x t � P ( S ′ → S ) × α ( S ′ , t − 1 ) α ( S , t ) = S ′ xt → S Just replace all max’s with sums (any semiring will do). 80 / 157

  48. Example Some data: C, C, W, W. ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ α 0 1 2 3 4 c 1.000 0.600 0.370 0.082 0.025 w 0.000 0.100 0.080 0.085 0.059 C C α ( c , 2 ) = P ( c → c ) × α ( c , 1 ) + P ( w → c ) × α ( w , 1 ) = 0 . 6 × 0 . 6 + 0 . 1 × 0 . 1 = 0 . 37 81 / 157

  49. Recap: The Forward Algorithm Can find total likelihood P ( x ) of observed . . . Using very similar algorithm to Viterbi algorithm. Just replace max’s with sums. Same time complexity. 82 / 157

  50. Where Are We? Computing the Best Path 1 Computing the Likelihood of Observations 2 Estimating Model Parameters 3 Discussion 4 83 / 157

  51. Training the Parameters of an HMM Given training data x . . . Estimate parameters of model . . . To maximize likelihood of training data. � P ( x ) = P ( h , x ) h 84 / 157

  52. What Are The Parameters? One parameter for each arc: ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ Identify arc by source S , destination S ′ , and label x : p S x → S ′ . Probs of arcs leaving same state must sum to 1: � p S x → S ′ = 1 for all S x , S ′ 85 / 157

  53. What Did We Do For Non-Hidden Again? Likelihood of single path: product of arc probabilities. Log likelihood can be written as: x � L ( x N → S ′ ) log p S x 1 ) = c ( S → S ′ S x → S ′ x → S ′ ) of each arc. Just depends on counts c ( S Each source state corresponds to multinomial . . . With nonoverlapping parameters. ML estimation for multinomials: count and normalize! x c ( S → S ′ ) p MLE → S ′ = S x x � x , S ′ c ( S → S ′ ) 86 / 157

  54. Example: Non-Hidden Estimation ❲✴✶ ❲✴✶✳✵✵✵ ❈✴✶✽ ❈✴✵✳✻✾✷ ❈✴✵ ❈✴✵✳✵✵✵ ✶ ✷✻ st❛rt ❈ ❘✴✻ ❈✴✹ ❘✴✵✳✷✸✶ ❈✴✵✳✻✻✼ ❲✴✵✳✵✼✼ ❈✴✵✳✷✷✼ ❈✴✺ ❲✴✷ ❘✴✵ ❘✴✵✳✵✵✵ ❘✴✵ ❘✴✵✳✵✵✵ ✷✷ ✻ ❘ ❲ ❲✴✶ ❲✴✵✳✵✹✺ ❘✴✶✻ ❲✴✷ ❘✴✵✳✼✷✼ ❲✴✵✳✸✸✸ 87 / 157

  55. How Do We Train Hidden Models? Hmmm, I know this one . . . 88 / 157

  56. Review: The EM Algorithm General way to train parameters in hidden models . . . To optimize likelihood. Guaranteed to improve likelihood in each iteration. Only finds local optimum. Seeding matters. 89 / 157

  57. The EM Algorithm Initialize parameter values somehow. For each iteration . . . Expectation step: compute posterior (count) of each h . P ( h , x ) ˜ P ( h | x ) = � h P ( h , x ) Maximization step: update parameters. Instead of data x with unknown h , pretend . . . Non-hidden data where . . . (Fractional) count of each ( h , x ) is ˜ P ( h | x ) . 90 / 157

  58. Applying EM to HMM’s: The E Step Compute posterior (count) of each h . P ( h , x ) ˜ P ( h | x ) = � h P ( h , x ) How to compute prob of single path P ( h , x ) ? Multiply arc probabilities along path. How to compute denominator? This is just total likelihood of observed P ( x ) . � P ( x ) = P ( h , x ) h This looks vaguely familiar. 91 / 157

  59. Applying EM to HMM’s: The M Step Non-hidden case: single path h with count 1. Total count of arc is count of arc in h : x x → S ′ ) = c h ( S → S ′ ) c ( S Normalize. x → S ′ ) c ( S p MLE → S ′ = S x x � x , S ′ c ( S → S ′ ) Hidden case: every path h has count ˜ P ( h | x ) . Total count of arc is weighted sum . . . Of count of arc in each h . x ˜ x � → S ′ ) = → S ′ ) c ( S P ( h | x ) c h ( S h Normalize as before. 92 / 157

  60. What’s the Problem? Need to sum over exponential number of h : x x ˜ � → S ′ ) = → S ′ ) c ( S P ( h | x ) c h ( S h If only we had an algorithm for doing this type of thing. 93 / 157

  61. The Game Plan Decompose sum by time ( i.e. , position in x ). Find count of each arc at each “time” t . T T x x � � � ˜ → S ′ ) = → S ′ , t ) = c ( S c ( S P ( h | x ) t = 1 t = 1 h ∈P ( S x → S ′ , t ) x x → S ′ , t ) are paths where arc at time t is S → S ′ . P ( S x → S ′ , t ) is empty if x � = x t . P ( S Otherwise, use dynamic programming to compute x t � ˜ → S ′ , t ) ≡ c ( S P ( h | x ) xt h ∈P ( S → S ′ , t ) 94 / 157

  62. Let’s Rearrange Some Recall we can compute P ( x ) using Forward algorithm: P ( h | x ) = P ( h , x ) ˜ P ( x ) Some paraphrasing: x t ˜ � → S ′ , t ) = c ( S P ( h | x ) xt h ∈P ( S → S ′ , t ) 1 � = P ( h , x ) P ( x ) xt h ∈P ( S → S ′ , t ) 1 � = P ( p ) P ( x ) xt p ∈P ( S → S ′ , t ) 95 / 157

  63. What We Need x t → S ′ , t ) . Goal: sum over all paths p ∈ P ( S x t → S ′ . Arc at time t is S Let P i ( S , t ) be set of (initial) paths of length t . . . Starting at start state S 0 and ending at S . . . Consistent with observed x 1 , . . . , x t . Let P f ( S , t ) be set of (final) paths of length T − t . . . Starting at state S and ending at any state . . . Consistent with observed x t + 1 , . . . , x T . Then: x t x t → S ′ , t ) = P i ( S , t − 1 ) · ( S → S ′ ) · P f ( S ′ , t ) P ( S 96 / 157

  64. Translating Path Sets to Probabilities x t x t → S ′ , t ) = P i ( S , t − 1 ) · ( S → S ′ ) · P f ( S ′ , t ) P ( S Let α ( S , t ) = sum of likelihoods of paths of length t . . . Starting at start state S 0 and ending at S . Let β ( S , t ) = sum of likelihoods of paths of length T − t . . . Starting at state S and ending at any state. 1 x t � → S ′ , t ) = c ( S P ( p ) P ( x ) xt p ∈P ( S → S ′ , t ) 1 x t � → S ′ ) · p f ) = P ( p i · ( S P ( x ) p i ∈P i ( S , t − 1 ) , p f ∈P f ( S ′ , t ) 1 � � = P ( x ) × p S P ( p i ) P ( p f ) xt → S ′ p i ∈P i ( S , t − 1 ) p f ∈P f ( S ′ , t ) 1 → S ′ × α ( S , t − 1 ) × β ( S ′ , t ) = P ( x ) × p S xt 97 / 157

  65. Mini-Recap To do ML estimation in M step . . . x → S ′ ) . Need count of each arc: c ( S Decompose count of arc by time: T x x � → S ′ ) = → S ′ , t ) c ( S c ( S t = 1 Can compute count at time efficiently . . . If have forward probabilities α ( S , t ) . . . And backward probabilities β ( S , T ) . 1 x t → S ′ , t ) = → S ′ × α ( S , t − 1 ) × β ( S ′ , t ) c ( S P ( x ) × p S xt 98 / 157

  66. The Forward-Backward Algorithm (1 iter) Apply Forward algorithm to compute α ( S , t ) , P ( x ) . Apply Backward algorithm to compute β ( S , t ) . → S ′ and time t . . . x t For each arc S Compute posterior count of arc at time t if x = x t . 1 x t → S ′ , t ) = → S ′ × α ( S , t − 1 ) × β ( S ′ , t ) c ( S P ( x ) × p S xt Sum to get total counts for each arc. T x x � → S ′ ) = → S ′ , t ) c ( S c ( S t = 1 For each arc, find ML estimate of parameter: x → S ′ ) c ( S p MLE → S ′ = S x x � x , S ′ c ( S → S ′ ) 99 / 157

  67. The Forward Algorithm α ( S , 0 ) = 1 for S = S 0 , 0 otherwise. For t = 1 , . . . , T : For each state S : � → S × α ( S ′ , t − 1 ) α ( S , t ) = p S ′ xt S ′ xt → S The end. � � P ( x ) = P ( h , x ) = α ( S , T ) h S 100 / 157

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend