Lecture 4 Hidden Markov Models Michael Picheny, Bhuvana - PowerPoint PPT Presentation

The Markov Property, Order n Holds if: N � P ( x 1 , . . . , x N ) = P ( x i | x 1 , . . . , x i − 1 ) i = 1 N � = P ( x i | x i − n , x i − n + 1 , · · · , x i − 1 ) i = 1 e.g. , if know weather for past n days . . . Knowing more doesn’t help predict future weather. i.e. , if data satisfies this property . . . No loss from just remembering past n items! 34 / 157

A Non-Hidden Markov Model, Order 1 Let’s assume: knowing yesterday’s weather is enough. N � P ( x 1 , . . . , x N ) = P ( x i | x i − 1 ) i = 1 Before (no state): single multinomial P ( x i ) . After (with state): separate multinomial P ( x i | x i − 1 ) . . . For each x i − 1 ∈ { rainy , windy , calm } . Model P ( x i | x i − 1 ) with parameter p x i − 1 , x i . What about P ( x 1 | x 0 ) ? Assume x 0 = start, a special value. One more multinomial: P ( x i | start ) . Constraint: � x i p x i − 1 , x i = 1 for all x i − 1 . 35 / 157

A Picture After observe x , go to state labeled x . Is state non-hidden? ❲✴ ♣ st❛rt ❀❲ ❈✴ ♣ ❈❀❈ ❈✴ ♣ st❛rt ❀❈ st❛rt ❈ ❘✴ ♣ ❈❀❘ ❈✴ ♣ ❲❀❈ ❲✴ ♣ ❈❀❲ ❈✴ ♣ ❘❀❈ ❘✴ ♣ st❛rt ❀❘ ❘✴ ♣ ❲❀❘ ❘ ❲ ❲✴ ♣ ❘❀❲ ❘✴ ♣ ❘❀❘ ❲✴ ♣ ❲❀❲ 36 / 157

Computing the Likelihood of Data Some data: x = W, W, C, C, W, W, C, R, C, R. ❲✴✵✳✸ ❈✴✵✳✶ ❈✴✵✳✺ st❛rt ❈ ❘✴✵✳✼ ❈✴✵✳✶ ❲ ✹ ✳ ✵ ✴ ❘✴✵✳✷ ✴ ✵ ❈ ✳ ✷ ❘✴✵✳✻ ❘ ❲ The likelihood: ❲✴✵✳✺ ❘✴✵✳✶ ❲✴✵✳✸ N N � � P ( x 1 , . . . , x 10 ) = P ( x i | x i − 1 ) = p x i − 1 , x i i = 1 i = 1 = p start , W × p W , W × p W , C × . . . = 0 . 3 × 0 . 3 × 0 . 1 × 0 . 1 × . . . = 1 . 06 × 10 − 6 37 / 157

Computing the Likelihood of Data More generally: N N � � P ( x 1 , . . . , x N ) = P ( x i | x i − 1 ) = p x i − 1 , x i i = 1 i = 1 c ( x i − 1 , x i ) � = p x i − 1 , x i x i − 1 , x i � log P ( x 1 , . . . x N ) = c ( x i − 1 , x i ) log p x i − 1 , x i x i − 1 , x i x 0 = start . c ( x i − 1 , x i ) is count of x i following x i − 1 . Likelihood only depends on counts of pairs (bigrams). 38 / 157

Maximum Likelihood Estimation Choose p x i − 1 , x i to optimize log likelihood: L ( x N � 1 ) = c ( x i − 1 , x i ) log p x i − 1 , x i x i − 1 , x i � � = c ( start , x i ) log p start , x i + c ( R , x i ) log p R , x i + x i x i � � c ( W , x i ) log p W , x i + c ( C , x i ) log p C , x i x i x i Each sum is log likelihood of multinomial. Each multinomial has nonoverlapping parameter set. Can optimize each sum independently! c ( x i − 1 , x i ) p MLE x i − 1 , x i = � x c ( x i − 1 , x ) 39 / 157

Example: Maximum Likelihood Estimation Some raw data: W, W, C, C, W, W, C, R, C, R, W, C, C, C, R, R, R, R, C, C, R, R, R, R, R, R, R, R, C, C, C, C, C, R, R, R, R, R, R, R, C, C, C, W, C, C, C, C, C, C, R, C, C, C, C Counts and ML estimates: p MLE c ( · , · ) R W C sum R W C start 0 1 0 1 start 0.000 1.000 0.000 R 16 1 5 22 R 0.727 0.045 0.227 W 0 2 4 6 W 0.000 0.333 0.667 C 6 2 18 26 C 0.231 0.077 0.692 c ( x i − 1 , x i ) 5 p MLE p MLE x i − 1 , x i = R , C = 16 + 1 + 5 = 0 . 227 � x c ( x i − 1 , x ) 40 / 157

Example: Maximum Likelihood Estimation ❲✴✶ ❲✴✶✳✵✵✵ ❈✴✶✽ ❈✴✵✳✻✾✷ ❈✴✵ ❈✴✵✳✵✵✵ ✶ ✷✻ st❛rt ❈ ❘✴✻ ❈✴✹ ❘✴✵✳✷✸✶ ❈✴✵✳✻✻✼ ❲✴✵✳✵✼✼ ❈✴✵✳✷✷✼ ❈✴✺ ❲✴✷ ❘✴✵ ❘✴✵✳✵✵✵ ❘✴✵ ❘✴✵✳✵✵✵ ✷✷ ✻ ❘ ❲ ❲✴✶ ❲✴✵✳✵✹✺ ❘✴✶✻ ❲✴✷ ❘✴✵✳✼✷✼ ❲✴✵✳✸✸✸ 41 / 157

Example: Orders Some raw data: W, W, C, C, W, W, C, R, C, R, W, C, C, C, R, R, R, R, C, C, R, R, R, R, R, R, R, R, C, C, C, C, C, R, R, R, R, R, R, R, C, C, C, W, C, C, C, C, C, C, R, C, C, C, C Data sampled from MLE Markov model, order 1: W, W, C, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, C, C, C, C, C, C, W, W, C, C, C, R, R, R, C, C, W, C, C, C, C, C, R, R, R, R, R, C, R, R, C, R, R, R, R, R Data sampled from MLE Markov model, order 0: C, R, C, R, R, R, R, C, R, R, C, C, R, C, C, R, R, R, R, C, C, C, R, C, R, W, R, C, C, C, W, C, R, C, C, W, C, C, C, C, R, R, C, C, C, R, C, R, R, C, R, C, R, W, R 42 / 157

Recap: Non-Hidden Markov Models Use states to encode limited amount of information . . . About the past. Current state is known . Log likelihood just depends on pair counts. � L ( x N 1 ) = c ( x i − 1 , x i ) log p x i − 1 , x i x i − 1 , x i MLE: count and normalize. c ( x i − 1 , x i ) p MLE x i − 1 , x i = � x c ( x i − 1 , x ) Easy beezy. 43 / 157

Part II Discrete Hidden Markov Models 44 / 157

Case Study: Austin Weather 2.0 Ignore rain; one sample every two weeks: C, W, C, C, C, C, W, C, C, C, C, C, C, W, W, C, W, C, W, W, C, W, C, W, C, C, C, C, C, C, C, C, C, C, C, C, C, C, W, C, C, C, W, W, C, C, W, W, C, W, C, W, C, C, C, C, C, C, C, C, C, C, C, C, C, W, C, W, C, C, W, W, C, W, W, W, C, W, C, C, C, C, C, C, C, C, C, C, W, C, W, W, W, C, C, C, C, C, W, C, C, W, C, C, C, C, C, C, C, C, C, C, C, W Does system have state/memory? 45 / 157

Another View C W C C C C W C C C C C C W W C W C W W C W C W C C C C C C C C C C C C C C W C C C W W C C W W C W C W C C C C C C C C C C C C C W C W C C W W C W W W C W C C C C C C C C C C W C W W W C C C C C W C C W C C C C C C C C C C C W C C W W C W C C C W C W C W C C C C C C C C W C C C C C W C C C W C W C W C C W C W C C C C C C C C C C C C C W C C C W W C C C W C W C Does system have memory? How many states? 46 / 157

A Hidden Markov Model For simplicity, no separate start state. Always start in calm state c . ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ Why is state “hidden”? What are conditions for state to be non-hidden? 47 / 157

Contrast: Non-Hidden Markov Models ❲✴✶✳✵✵✵ ❈✴✵✳✻✾✷ ❈✴✵✳✵✵✵ st❛rt ❈ ❘✴✵✳✷✸✶ ❈✴✵✳✻✻✼ ❲ ✼ ✷ ✴ ✷ ✵ ✳ ❘✴✵✳✵✵✵ ✵ ✳ ✵ ✴ ✼ ❈ ✼ ❘✴✵✳✵✵✵ ❘ ❲ ❲✴✵✳✻ ❈✴✵✳✹ ❲✴✵✳✵✹✺ ❘✴✵✳✼✷✼ ❲✴✵✳✸✸✸ 48 / 157

Back to Coins: Hidden Information Memory-less example: Coin 0: p H = 0 . 7 , p T = 0 . 3 Coin 1: p H = 0 . 9 , p T = 0 . 1 Coin 2: p H = 0 . 2 , p T = 0 . 8 Experiment: Flip Coin 0. If outcome is H, flip Coin 1 and record ; else flip Coin 2 and record. Coin 0 flips outcomes are hidden! What is the probability of the sequence: H T T T ? p ( H ) = 0 . 9 x 0 . 7 + 0 . 2 x 0 . 3 ; p ( T ) = 0 . 1 x 0 . 7 + 0 . 8 x 0 . 3 An example with memory: 2 coins, flip each twice. Record first flip, use second to determine which coin to flip. No way to know the outcome of even flips. Order matters now and . . . Cannot uniquely determine which state sequence produced the observed output sequence 49 / 157

Why Hidden State? No “simple” way to determine state given observed. If see “ W ”, doesn’t mean windy season started. Speech recognition: one HMM per word. Each state represents different sound in word. How to tell from observed when state switches? Hidden models can model same stuff as non-hidden . . . Using much fewer states. Pop quiz: name a hidden model with no memory. 50 / 157

The Problem With Hidden State For observed x = x 1 , . . . , x N , what is hidden state h ? Corresponding state sequence h = h 1 , . . . , h N + 1 . In non-hidden model, how many h possible given x ? In hidden model, what h are possible given x ? ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ This makes everything difficult. 51 / 157

Three Key Tasks for HMM’s Find single best path in HMM given observed x . 1 e.g. , when did windy season begin? e.g. , when did each sound in word begin? Find total likelihood P ( x ) of observed. 2 e.g. , to pick which word assigns highest likelihood. Find ML estimates for parameters of HMM. 3 i.e. , estimate arc probabilities to match training data. These problems are easy to solve for a state-observable Markov model. More complicated for a HMM as we have to consider all possible state sequences. 52 / 157

Where Are We? Computing the Best Path 1 Computing the Likelihood of Observations 2 Estimating Model Parameters 3 Discussion 4 53 / 157

What We Want to Compute Given observed, e.g. , x = C, W, C, C, W, . . . Find state sequence h ∗ with highest likelihood. h ∗ = arg max P ( h , x ) h Why is this easy for non-hidden model? Given state sequence h , how to compute P ( h , x ) ? Same as for non-hidden model. Multiply all arc probabilities along path. 54 / 157

Likelihood of Single State Sequence Some data: x = W, C, C. A state sequence: h = c , c , c , w . ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ Likelihood of path: P ( h , x ) = 0 . 2 × 0 . 6 × 0 . 1 = 0 . 012 55 / 157

What We Want to Compute Given observed, e.g. , x = C, W, C, C, W, . . . Find state sequence h ∗ with highest likelihood. h ∗ = arg max P ( h , x ) h Let’s start with simpler problem: Find likelihood of best state sequence P best ( x ) . Worry about identity of best sequence later. P best ( x ) = max P ( h , x ) h 56 / 157

What’s the Problem? P best ( x ) = max P ( h , x ) h For observation sequence of length N . . . How many different possible state sequences h ? ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ How in blazes can we do max . . . Over exponential number of state sequences? 57 / 157

Dynamic Programming Let S 0 be start state; e.g. , the calm season c . Let P ( S , t ) be set of paths of length t . . . Starting at start state S 0 and ending at S . . . Consistent with observed x 1 , . . . , x t . Any path p ∈ P ( S , t ) must be composed of . . . Path of length t − 1 to predecessor state S ′ → S . . . Followed by arc from S ′ to S labeled with x t . This decomposition is unique. x t � P ( S ′ , t − 1 ) · ( S ′ P ( S , t ) = → S ) S ′ xt → S 58 / 157

Dynamic Programming x t � P ( S ′ , t − 1 ) · ( S ′ P ( S , t ) = → S ) S ′ xt → S Let ˆ α ( S , t ) = likelihood of best path of length t . . . Starting at start state S 0 and ending at S . P ( p ) = prob of path p = product of arc probs. α ( S , t ) = max ˆ p ∈P ( S , t ) P ( p ) P ( p ′ · ( S ′ x t = max → S )) p ′ ∈P ( S ′ , t − 1 ) , S ′ xt → S x t P ( S ′ p ′ ∈P ( S ′ , t − 1 ) P ( p ′ ) = max → S ) max S ′ xt → S x t P ( S ′ α ( S ′ , t − 1 ) = max → S ) × ˆ S ′ xt → S 59 / 157

What Were We Computing Again? Assume observed x of length T . Want likelihood of best path of length T . . . Starting at start state S 0 and ending anywhere. P best ( x ) = max P ( h , x ) = max α ( S , T ) ˆ h S If can compute ˆ α ( S , T ) , we are done. If know ˆ α ( S , t − 1 ) for all S , easy to compute ˆ α ( S , t ) : x t P ( S ′ α ( S ′ , t − 1 ) α ( S , t ) = max ˆ → S ) × ˆ S ′ xt → S This looks promising . . . 60 / 157

The Viterbi Algorithm α ( S , 0 ) = 1 for S = S 0 , 0 otherwise. ˆ For t = 1 , . . . , T : For each state S : x t P ( S ′ α ( S ′ , t − 1 ) α ( S , t ) = max ˆ → S ) × ˆ S ′ xt → S The end. P best ( x ) = max P ( h , x ) = max α ( S , T ) ˆ h S 61 / 157

Viterbi and Shortest Path Equivalent to shortest path problem. 3 1 1 4 1 3 1 2 19 3 10 One “state” for each state/time pair ( S , t ) . Iterate through “states” in topological order: All arcs go forward in time. If order “states” by time, valid ordering. S ′ → S { d ( S ′ ) + distance ( S ′ , S ) } d ( S ) = min x t P ( S ′ α ( S ′ , t − 1 ) α ( S , t ) = max ˆ → S ) × ˆ S ′ xt → S 62 / 157

Identifying the Best Path Wait! We can calc likelihood of best path: P best ( x ) = max P ( h , x ) h What we really wanted: identity of best path. i.e. , the best state sequence h . Basic idea: for each S , t . . . Record identity S prev ( S , t ) of previous state S ′ . . . In best path of length t ending at state S . Find best final state. Backtrace best previous states until reach start state. 63 / 157

The Viterbi Algorithm With Backtrace α ( S , 0 ) = 1 for S = S 0 , 0 otherwise. ˆ For t = 1 , . . . , T : For each state S : x t P ( S ′ α ( S ′ , t − 1 ) α ( S , t ) = max ˆ → S ) × ˆ S ′ xt → S x t P ( S ′ α ( S ′ , t − 1 ) S prev ( S , t ) = arg max → S ) × ˆ S ′ xt → S The end. P best ( x ) = max α ( S , T ) ˆ S S final ( x ) = arg max α ( S , T ) ˆ S 64 / 157

The Backtrace S cur ← S final ( x ) for t in T , . . . , 1: S cur ← S prev ( S cur , t ) The best state sequence is . . . List of states traversed in reverse order. 65 / 157

Illustration with a trellis State transition diagram in time State: 1 2 3 Time: 0 1 2 3 4 Obs: f a aa aab aabb .5x.2 .5x.2 .5x.8 .5x.8 .3x.3 .3x.7 .3x.7 .3x.3 .2 .2 .2 .2 .2 .4x.5 .4x.5 .4x.5 .4x.5 .5x.3 .5x.7 .5x.7 .5x.3 .1 .1 .1 .1 .1 28 66 / 157

Illustration with a trellis (contd.) Accumulating scores State: 1 2 3 Time: 0 1 2 3 4 Obs: f a aa aab aabb .5x.2 .5x.8 .5x.2 .5x.8 1 0.4 .16 .3x.3 .3x.7 .3x.7 .3x.3 .2 .2 .2 .2 .2 .4x.5 .4x.5 .4x.5 .4x.5 .2 .21+.04+.08=.33 .084+.066+.32=.182 .5x.3 .5x.7 .5x.7 .5x.3 .1 .1 .1 .1 .1 .02 .033+.03=.063 .0495+.0182=.0677 29 67 / 157

Viterbi algorithm Accumulating scores State: 1 2 3 Time: 0 1 2 3 4 Obs: f a aa aab aabb .5x.2 .5x.8 .5x.2 .5x.8 1 .0016 0.4 .016 .16 .3x.3 .3x.7 .3x.7 .3x.3 .2 .2 .2 .2 .2 .4x.5 .4x.5 .4x.5 .4x.5 .00336 max(.08 .21 .04) .0168 max(.084 .042 .032) .5x.3 .5x.7 .5x.7 .5x.3 .1 .1 .1 .1 .1 .00588 Max(.0084 .0315) .0294 max(.03 .021) 33 68 / 157

Best path through the trellis Accumulating scores State: 1 2 3 Time: 0 1 2 3 4 Obs: f a aa aab aabb .5x.2 .5x.8 .5x.2 .5x.8 .0016 .016 .16 0.4 1 .3x.3 .3x.7 .3x.7 .3x.3 .2 .2 .2 .2 .2 .4x.5 .4x.5 .4x.5 .4x.5 0.2 .00336 .21 .084 .0168 .5x.3 .5x.7 .5x.7 .5x.3 .1 .1 .1 .1 .1 0.02 .00588 .0315 .0294 .03 34 69 / 157

Example Some data: C, C, W, W. ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ α ˆ 0 1 2 3 4 c 1.000 0.600 0.360 0.072 0.014 w 0.000 0.100 0.060 0.036 0.022 C C α ( c , 2 ) = max { P ( c ˆ → c ) × ˆ α ( c , 1 ) , P ( w → c ) × ˆ α ( w , 1 ) } = max { 0 . 6 × 0 . 6 , 0 . 1 × 0 . 1 } = 0 . 36 70 / 157

Example: The Backtrace S prev 0 1 2 3 4 c c c c c w c c c w h ∗ = arg max P ( h , x ) = ( c , c , c , w , w ) h The data: C, C, W, W. Calm season switching to windy season. 71 / 157

Recap: The Viterbi Algorithm Given observed x , . . . Exponential number of hidden sequences h . Can find likelihood and identity of best path . . . Efficiently using dynamic programming. What is time complexity? 72 / 157

What We Want to Compute Given observed, e.g. , x = C, W, C, C, W, . . . Find total likelihood P ( x ) . Need to sum likelihood over all hidden sequences: � P ( x ) = P ( h , x ) h Given state sequence h , how to compute P ( h , x ) ? Multiply all arc probabilities along path. Why is this sum easy for non-hidden model? 74 / 157

What’s the Problem? � P ( x ) = P ( h , x ) h For observation sequence of length N . . . How many different possible state sequences h ? ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ How in blazes can we do sum . . . Over exponential number of state sequences? 75 / 157

Dynamic Programming Let P ( S , t ) be set of paths of length t . . . Starting at start state S 0 and ending at S . . . Consistent with observed x 1 , . . . , x t . Any path p ∈ P ( S , t ) must be composed of . . . Path of length t − 1 to predecessor state S ′ → S . . . Followed by arc from S ′ to S labeled with x t . x t � P ( S ′ , t − 1 ) · ( S ′ P ( S , t ) = → S ) S ′ xt → S 76 / 157

Dynamic Programming x t � P ( S ′ , t − 1 ) · ( S ′ P ( S , t ) = → S ) S ′ xt → S Let α ( S , t ) = sum of likelihoods of paths of length t . . . Starting at start state S 0 and ending at S . � α ( S , t ) = P ( p ) p ∈P ( S , t ) P ( p ′ · ( S ′ x t � = → S )) p ′ ∈P ( S ′ , t − 1 ) , S ′ xt → S x t � � P ( S ′ P ( p ′ ) = → S ) S ′ xt p ′ ∈P ( S ′ , t − 1 ) → S x t � P ( S ′ → S ) × α ( S ′ , t − 1 ) = S ′ xt → S 77 / 157

What Were We Computing Again? Assume observed x of length T . Want sum of likelihoods of paths of length T . . . Starting at start state S 0 and ending anywhere. � � P ( x ) = P ( h , x ) = α ( S , T ) h S If can compute α ( S , T ) , we are done. If know α ( S , t − 1 ) for all S , easy to compute α ( S , t ) : x t � P ( S ′ → S ) × α ( S ′ , t − 1 ) α ( S , t ) = S ′ xt → S This looks promising . . . 78 / 157

The Forward Algorithm α ( S , 0 ) = 1 for S = S 0 , 0 otherwise. For t = 1 , . . . , T : For each state S : x t � P ( S ′ → S ) × α ( S ′ , t − 1 ) α ( S , t ) = S ′ xt → S The end. � � P ( x ) = P ( h , x ) = α ( S , T ) h S 79 / 157

Viterbi vs. Forward The goal: P best ( x ) = max P ( h , x ) = max α ( S , T ) ˆ h S � � P ( x ) = P ( h , x ) = α ( S , T ) h S The invariant. x t P ( S ′ α ( S ′ , t − 1 ) α ( S , t ) = max ˆ → S ) × ˆ S ′ xt → S x t � P ( S ′ → S ) × α ( S ′ , t − 1 ) α ( S , t ) = S ′ xt → S Just replace all max’s with sums (any semiring will do). 80 / 157

Example Some data: C, C, W, W. ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ α 0 1 2 3 4 c 1.000 0.600 0.370 0.082 0.025 w 0.000 0.100 0.080 0.085 0.059 C C α ( c , 2 ) = P ( c → c ) × α ( c , 1 ) + P ( w → c ) × α ( w , 1 ) = 0 . 6 × 0 . 6 + 0 . 1 × 0 . 1 = 0 . 37 81 / 157

Recap: The Forward Algorithm Can find total likelihood P ( x ) of observed . . . Using very similar algorithm to Viterbi algorithm. Just replace max’s with sums. Same time complexity. 82 / 157

Training the Parameters of an HMM Given training data x . . . Estimate parameters of model . . . To maximize likelihood of training data. � P ( x ) = P ( h , x ) h 84 / 157

What Are The Parameters? One parameter for each arc: ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ Identify arc by source S , destination S ′ , and label x : p S x → S ′ . Probs of arcs leaving same state must sum to 1: � p S x → S ′ = 1 for all S x , S ′ 85 / 157

What Did We Do For Non-Hidden Again? Likelihood of single path: product of arc probabilities. Log likelihood can be written as: x � L ( x N → S ′ ) log p S x 1 ) = c ( S → S ′ S x → S ′ x → S ′ ) of each arc. Just depends on counts c ( S Each source state corresponds to multinomial . . . With nonoverlapping parameters. ML estimation for multinomials: count and normalize! x c ( S → S ′ ) p MLE → S ′ = S x x � x , S ′ c ( S → S ′ ) 86 / 157

Example: Non-Hidden Estimation ❲✴✶ ❲✴✶✳✵✵✵ ❈✴✶✽ ❈✴✵✳✻✾✷ ❈✴✵ ❈✴✵✳✵✵✵ ✶ ✷✻ st❛rt ❈ ❘✴✻ ❈✴✹ ❘✴✵✳✷✸✶ ❈✴✵✳✻✻✼ ❲✴✵✳✵✼✼ ❈✴✵✳✷✷✼ ❈✴✺ ❲✴✷ ❘✴✵ ❘✴✵✳✵✵✵ ❘✴✵ ❘✴✵✳✵✵✵ ✷✷ ✻ ❘ ❲ ❲✴✶ ❲✴✵✳✵✹✺ ❘✴✶✻ ❲✴✷ ❘✴✵✳✼✷✼ ❲✴✵✳✸✸✸ 87 / 157

How Do We Train Hidden Models? Hmmm, I know this one . . . 88 / 157

Review: The EM Algorithm General way to train parameters in hidden models . . . To optimize likelihood. Guaranteed to improve likelihood in each iteration. Only finds local optimum. Seeding matters. 89 / 157

The EM Algorithm Initialize parameter values somehow. For each iteration . . . Expectation step: compute posterior (count) of each h . P ( h , x ) ˜ P ( h | x ) = � h P ( h , x ) Maximization step: update parameters. Instead of data x with unknown h , pretend . . . Non-hidden data where . . . (Fractional) count of each ( h , x ) is ˜ P ( h | x ) . 90 / 157

Applying EM to HMM’s: The E Step Compute posterior (count) of each h . P ( h , x ) ˜ P ( h | x ) = � h P ( h , x ) How to compute prob of single path P ( h , x ) ? Multiply arc probabilities along path. How to compute denominator? This is just total likelihood of observed P ( x ) . � P ( x ) = P ( h , x ) h This looks vaguely familiar. 91 / 157

Applying EM to HMM’s: The M Step Non-hidden case: single path h with count 1. Total count of arc is count of arc in h : x x → S ′ ) = c h ( S → S ′ ) c ( S Normalize. x → S ′ ) c ( S p MLE → S ′ = S x x � x , S ′ c ( S → S ′ ) Hidden case: every path h has count ˜ P ( h | x ) . Total count of arc is weighted sum . . . Of count of arc in each h . x ˜ x � → S ′ ) = → S ′ ) c ( S P ( h | x ) c h ( S h Normalize as before. 92 / 157

What’s the Problem? Need to sum over exponential number of h : x x ˜ � → S ′ ) = → S ′ ) c ( S P ( h | x ) c h ( S h If only we had an algorithm for doing this type of thing. 93 / 157

The Game Plan Decompose sum by time ( i.e. , position in x ). Find count of each arc at each “time” t . T T x x � � � ˜ → S ′ ) = → S ′ , t ) = c ( S c ( S P ( h | x ) t = 1 t = 1 h ∈P ( S x → S ′ , t ) x x → S ′ , t ) are paths where arc at time t is S → S ′ . P ( S x → S ′ , t ) is empty if x � = x t . P ( S Otherwise, use dynamic programming to compute x t � ˜ → S ′ , t ) ≡ c ( S P ( h | x ) xt h ∈P ( S → S ′ , t ) 94 / 157

Let’s Rearrange Some Recall we can compute P ( x ) using Forward algorithm: P ( h | x ) = P ( h , x ) ˜ P ( x ) Some paraphrasing: x t ˜ � → S ′ , t ) = c ( S P ( h | x ) xt h ∈P ( S → S ′ , t ) 1 � = P ( h , x ) P ( x ) xt h ∈P ( S → S ′ , t ) 1 � = P ( p ) P ( x ) xt p ∈P ( S → S ′ , t ) 95 / 157

What We Need x t → S ′ , t ) . Goal: sum over all paths p ∈ P ( S x t → S ′ . Arc at time t is S Let P i ( S , t ) be set of (initial) paths of length t . . . Starting at start state S 0 and ending at S . . . Consistent with observed x 1 , . . . , x t . Let P f ( S , t ) be set of (final) paths of length T − t . . . Starting at state S and ending at any state . . . Consistent with observed x t + 1 , . . . , x T . Then: x t x t → S ′ , t ) = P i ( S , t − 1 ) · ( S → S ′ ) · P f ( S ′ , t ) P ( S 96 / 157

Translating Path Sets to Probabilities x t x t → S ′ , t ) = P i ( S , t − 1 ) · ( S → S ′ ) · P f ( S ′ , t ) P ( S Let α ( S , t ) = sum of likelihoods of paths of length t . . . Starting at start state S 0 and ending at S . Let β ( S , t ) = sum of likelihoods of paths of length T − t . . . Starting at state S and ending at any state. 1 x t � → S ′ , t ) = c ( S P ( p ) P ( x ) xt p ∈P ( S → S ′ , t ) 1 x t � → S ′ ) · p f ) = P ( p i · ( S P ( x ) p i ∈P i ( S , t − 1 ) , p f ∈P f ( S ′ , t ) 1 � � = P ( x ) × p S P ( p i ) P ( p f ) xt → S ′ p i ∈P i ( S , t − 1 ) p f ∈P f ( S ′ , t ) 1 → S ′ × α ( S , t − 1 ) × β ( S ′ , t ) = P ( x ) × p S xt 97 / 157

Mini-Recap To do ML estimation in M step . . . x → S ′ ) . Need count of each arc: c ( S Decompose count of arc by time: T x x � → S ′ ) = → S ′ , t ) c ( S c ( S t = 1 Can compute count at time efficiently . . . If have forward probabilities α ( S , t ) . . . And backward probabilities β ( S , T ) . 1 x t → S ′ , t ) = → S ′ × α ( S , t − 1 ) × β ( S ′ , t ) c ( S P ( x ) × p S xt 98 / 157

The Forward-Backward Algorithm (1 iter) Apply Forward algorithm to compute α ( S , t ) , P ( x ) . Apply Backward algorithm to compute β ( S , t ) . → S ′ and time t . . . x t For each arc S Compute posterior count of arc at time t if x = x t . 1 x t → S ′ , t ) = → S ′ × α ( S , t − 1 ) × β ( S ′ , t ) c ( S P ( x ) × p S xt Sum to get total counts for each arc. T x x � → S ′ ) = → S ′ , t ) c ( S c ( S t = 1 For each arc, find ML estimate of parameter: x → S ′ ) c ( S p MLE → S ′ = S x x � x , S ′ c ( S → S ′ ) 99 / 157

The Forward Algorithm α ( S , 0 ) = 1 for S = S 0 , 0 otherwise. For t = 1 , . . . , T : For each state S : � → S × α ( S ′ , t − 1 ) α ( S , t ) = p S ′ xt S ′ xt → S The end. � � P ( x ) = P ( h , x ) = α ( S , T ) h S 100 / 157

Lecture 4 Hidden Markov Models Michael Picheny, Bhuvana - PowerPoint PPT Presentation

Lecture 4 Hidden Markov Models Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen,nussbaum}@us.ibm.com 10

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Lecture 8 Barna Saha AT&T-Labs Research October 3, 2013 Outline Clustering K-Center

CSE101: Design and Analysis of Algorithms Ragesh Jaiswal, CSE, UCSD Ragesh Jaiswal, CSE, UCSD

1 Choose your language 4 Exercise sessions (bungsgruppen) are available in German (5) and

Informatics 2A: Lecture 1 Introduction and Course Administration John Longley Mirella Lapata

Entropy and Shannons Theorem Lecture 28 December 10, 2013 Sariel (UIUC) CS573 1 Fall 2013

CS101 Lecture 01: Introduction Aaron Stevens (azs@bu.edu) 16 January 2013 Computer Science What

Faculty-Peer Partnerships for Teaching Feedback 2) Share what happened in my department 3)

Challenge and Solutions for { Peta | Exa }-scale Programming WPSE09 panel discussion Raymond

Lecture 4 Hidden Markov Models Michael Picheny, Bhuvana - PowerPoint PPT Presentation

Lecture 4 Hidden Markov Models Michael Picheny, Bhuvana Ramabhadran, Stanley F . Chen, Markus Nussbaum-Thom Watson Group IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen,nussbaum}@us.ibm.com 10

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Lecture 8 Barna Saha AT&amp;T-Labs Research October 3, 2013 Outline Clustering K-Center

CSE101: Design and Analysis of Algorithms Ragesh Jaiswal, CSE, UCSD Ragesh Jaiswal, CSE, UCSD

1 Choose your language 4 Exercise sessions (bungsgruppen) are available in German (5) and

Informatics 2A: Lecture 1 Introduction and Course Administration John Longley Mirella Lapata

Entropy and Shannons Theorem Lecture 28 December 10, 2013 Sariel (UIUC) CS573 1 Fall 2013

CS101 Lecture 01: Introduction Aaron Stevens (azs@bu.edu) 16 January 2013 Computer Science What

Faculty-Peer Partnerships for Teaching Feedback 2) Share what happened in my department 3)

Challenge and Solutions for { Peta | Exa }-scale Programming WPSE09 panel discussion Raymond

Lecture 8 Barna Saha AT&T-Labs Research October 3, 2013 Outline Clustering K-Center