The Markov Property, Order n Holds if: N � P ( x 1 , . . . , x N ) = P ( x i | x 1 , . . . , x i − 1 ) i = 1 N � = P ( x i | x i − n , x i − n + 1 , · · · , x i − 1 ) i = 1 e.g. , if know weather for past n days . . . Knowing more doesn’t help predict future weather. i.e. , if data satisfies this property . . . No loss from just remembering past n items! 34 / 157
A Non-Hidden Markov Model, Order 1 Let’s assume: knowing yesterday’s weather is enough. N � P ( x 1 , . . . , x N ) = P ( x i | x i − 1 ) i = 1 Before (no state): single multinomial P ( x i ) . After (with state): separate multinomial P ( x i | x i − 1 ) . . . For each x i − 1 ∈ { rainy , windy , calm } . Model P ( x i | x i − 1 ) with parameter p x i − 1 , x i . What about P ( x 1 | x 0 ) ? Assume x 0 = start, a special value. One more multinomial: P ( x i | start ) . Constraint: � x i p x i − 1 , x i = 1 for all x i − 1 . 35 / 157
A Picture After observe x , go to state labeled x . Is state non-hidden? ❲✴ ♣ st❛rt ❀❲ ❈✴ ♣ ❈❀❈ ❈✴ ♣ st❛rt ❀❈ st❛rt ❈ ❘✴ ♣ ❈❀❘ ❈✴ ♣ ❲❀❈ ❲✴ ♣ ❈❀❲ ❈✴ ♣ ❘❀❈ ❘✴ ♣ st❛rt ❀❘ ❘✴ ♣ ❲❀❘ ❘ ❲ ❲✴ ♣ ❘❀❲ ❘✴ ♣ ❘❀❘ ❲✴ ♣ ❲❀❲ 36 / 157
Computing the Likelihood of Data Some data: x = W, W, C, C, W, W, C, R, C, R. ❲✴✵✳✸ ❈✴✵✳✶ ❈✴✵✳✺ st❛rt ❈ ❘✴✵✳✼ ❈✴✵✳✶ ❲ ✹ ✳ ✵ ✴ ❘✴✵✳✷ ✴ ✵ ❈ ✳ ✷ ❘✴✵✳✻ ❘ ❲ The likelihood: ❲✴✵✳✺ ❘✴✵✳✶ ❲✴✵✳✸ N N � � P ( x 1 , . . . , x 10 ) = P ( x i | x i − 1 ) = p x i − 1 , x i i = 1 i = 1 = p start , W × p W , W × p W , C × . . . = 0 . 3 × 0 . 3 × 0 . 1 × 0 . 1 × . . . = 1 . 06 × 10 − 6 37 / 157
Computing the Likelihood of Data More generally: N N � � P ( x 1 , . . . , x N ) = P ( x i | x i − 1 ) = p x i − 1 , x i i = 1 i = 1 c ( x i − 1 , x i ) � = p x i − 1 , x i x i − 1 , x i � log P ( x 1 , . . . x N ) = c ( x i − 1 , x i ) log p x i − 1 , x i x i − 1 , x i x 0 = start . c ( x i − 1 , x i ) is count of x i following x i − 1 . Likelihood only depends on counts of pairs (bigrams). 38 / 157
Maximum Likelihood Estimation Choose p x i − 1 , x i to optimize log likelihood: L ( x N � 1 ) = c ( x i − 1 , x i ) log p x i − 1 , x i x i − 1 , x i � � = c ( start , x i ) log p start , x i + c ( R , x i ) log p R , x i + x i x i � � c ( W , x i ) log p W , x i + c ( C , x i ) log p C , x i x i x i Each sum is log likelihood of multinomial. Each multinomial has nonoverlapping parameter set. Can optimize each sum independently! c ( x i − 1 , x i ) p MLE x i − 1 , x i = � x c ( x i − 1 , x ) 39 / 157
Example: Maximum Likelihood Estimation Some raw data: W, W, C, C, W, W, C, R, C, R, W, C, C, C, R, R, R, R, C, C, R, R, R, R, R, R, R, R, C, C, C, C, C, R, R, R, R, R, R, R, C, C, C, W, C, C, C, C, C, C, R, C, C, C, C Counts and ML estimates: p MLE c ( · , · ) R W C sum R W C start 0 1 0 1 start 0.000 1.000 0.000 R 16 1 5 22 R 0.727 0.045 0.227 W 0 2 4 6 W 0.000 0.333 0.667 C 6 2 18 26 C 0.231 0.077 0.692 c ( x i − 1 , x i ) 5 p MLE p MLE x i − 1 , x i = R , C = 16 + 1 + 5 = 0 . 227 � x c ( x i − 1 , x ) 40 / 157
Example: Maximum Likelihood Estimation ❲✴✶ ❲✴✶✳✵✵✵ ❈✴✶✽ ❈✴✵✳✻✾✷ ❈✴✵ ❈✴✵✳✵✵✵ ✶ ✷✻ st❛rt ❈ ❘✴✻ ❈✴✹ ❘✴✵✳✷✸✶ ❈✴✵✳✻✻✼ ❲✴✵✳✵✼✼ ❈✴✵✳✷✷✼ ❈✴✺ ❲✴✷ ❘✴✵ ❘✴✵✳✵✵✵ ❘✴✵ ❘✴✵✳✵✵✵ ✷✷ ✻ ❘ ❲ ❲✴✶ ❲✴✵✳✵✹✺ ❘✴✶✻ ❲✴✷ ❘✴✵✳✼✷✼ ❲✴✵✳✸✸✸ 41 / 157
Example: Orders Some raw data: W, W, C, C, W, W, C, R, C, R, W, C, C, C, R, R, R, R, C, C, R, R, R, R, R, R, R, R, C, C, C, C, C, R, R, R, R, R, R, R, C, C, C, W, C, C, C, C, C, C, R, C, C, C, C Data sampled from MLE Markov model, order 1: W, W, C, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, R, C, C, C, C, C, C, W, W, C, C, C, R, R, R, C, C, W, C, C, C, C, C, R, R, R, R, R, C, R, R, C, R, R, R, R, R Data sampled from MLE Markov model, order 0: C, R, C, R, R, R, R, C, R, R, C, C, R, C, C, R, R, R, R, C, C, C, R, C, R, W, R, C, C, C, W, C, R, C, C, W, C, C, C, C, R, R, C, C, C, R, C, R, R, C, R, C, R, W, R 42 / 157
Recap: Non-Hidden Markov Models Use states to encode limited amount of information . . . About the past. Current state is known . Log likelihood just depends on pair counts. � L ( x N 1 ) = c ( x i − 1 , x i ) log p x i − 1 , x i x i − 1 , x i MLE: count and normalize. c ( x i − 1 , x i ) p MLE x i − 1 , x i = � x c ( x i − 1 , x ) Easy beezy. 43 / 157
Part II Discrete Hidden Markov Models 44 / 157
Case Study: Austin Weather 2.0 Ignore rain; one sample every two weeks: C, W, C, C, C, C, W, C, C, C, C, C, C, W, W, C, W, C, W, W, C, W, C, W, C, C, C, C, C, C, C, C, C, C, C, C, C, C, W, C, C, C, W, W, C, C, W, W, C, W, C, W, C, C, C, C, C, C, C, C, C, C, C, C, C, W, C, W, C, C, W, W, C, W, W, W, C, W, C, C, C, C, C, C, C, C, C, C, W, C, W, W, W, C, C, C, C, C, W, C, C, W, C, C, C, C, C, C, C, C, C, C, C, W Does system have state/memory? 45 / 157
Another View C W C C C C W C C C C C C W W C W C W W C W C W C C C C C C C C C C C C C C W C C C W W C C W W C W C W C C C C C C C C C C C C C W C W C C W W C W W W C W C C C C C C C C C C W C W W W C C C C C W C C W C C C C C C C C C C C W C C W W C W C C C W C W C W C C C C C C C C W C C C C C W C C C W C W C W C C W C W C C C C C C C C C C C C C W C C C W W C C C W C W C Does system have memory? How many states? 46 / 157
A Hidden Markov Model For simplicity, no separate start state. Always start in calm state c . ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ Why is state “hidden”? What are conditions for state to be non-hidden? 47 / 157
Contrast: Non-Hidden Markov Models ❲✴✶✳✵✵✵ ❈✴✵✳✻✾✷ ❈✴✵✳✵✵✵ st❛rt ❈ ❘✴✵✳✷✸✶ ❈✴✵✳✻✻✼ ❲ ✼ ✷ ✴ ✷ ✵ ✳ ❘✴✵✳✵✵✵ ✵ ✳ ✵ ✴ ✼ ❈ ✼ ❘✴✵✳✵✵✵ ❘ ❲ ❲✴✵✳✻ ❈✴✵✳✹ ❲✴✵✳✵✹✺ ❘✴✵✳✼✷✼ ❲✴✵✳✸✸✸ 48 / 157
Back to Coins: Hidden Information Memory-less example: Coin 0: p H = 0 . 7 , p T = 0 . 3 Coin 1: p H = 0 . 9 , p T = 0 . 1 Coin 2: p H = 0 . 2 , p T = 0 . 8 Experiment: Flip Coin 0. If outcome is H, flip Coin 1 and record ; else flip Coin 2 and record. Coin 0 flips outcomes are hidden! What is the probability of the sequence: H T T T ? p ( H ) = 0 . 9 x 0 . 7 + 0 . 2 x 0 . 3 ; p ( T ) = 0 . 1 x 0 . 7 + 0 . 8 x 0 . 3 An example with memory: 2 coins, flip each twice. Record first flip, use second to determine which coin to flip. No way to know the outcome of even flips. Order matters now and . . . Cannot uniquely determine which state sequence produced the observed output sequence 49 / 157
Why Hidden State? No “simple” way to determine state given observed. If see “ W ”, doesn’t mean windy season started. Speech recognition: one HMM per word. Each state represents different sound in word. How to tell from observed when state switches? Hidden models can model same stuff as non-hidden . . . Using much fewer states. Pop quiz: name a hidden model with no memory. 50 / 157
The Problem With Hidden State For observed x = x 1 , . . . , x N , what is hidden state h ? Corresponding state sequence h = h 1 , . . . , h N + 1 . In non-hidden model, how many h possible given x ? In hidden model, what h are possible given x ? ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ This makes everything difficult. 51 / 157
Three Key Tasks for HMM’s Find single best path in HMM given observed x . 1 e.g. , when did windy season begin? e.g. , when did each sound in word begin? Find total likelihood P ( x ) of observed. 2 e.g. , to pick which word assigns highest likelihood. Find ML estimates for parameters of HMM. 3 i.e. , estimate arc probabilities to match training data. These problems are easy to solve for a state-observable Markov model. More complicated for a HMM as we have to consider all possible state sequences. 52 / 157
Where Are We? Computing the Best Path 1 Computing the Likelihood of Observations 2 Estimating Model Parameters 3 Discussion 4 53 / 157
What We Want to Compute Given observed, e.g. , x = C, W, C, C, W, . . . Find state sequence h ∗ with highest likelihood. h ∗ = arg max P ( h , x ) h Why is this easy for non-hidden model? Given state sequence h , how to compute P ( h , x ) ? Same as for non-hidden model. Multiply all arc probabilities along path. 54 / 157
Likelihood of Single State Sequence Some data: x = W, C, C. A state sequence: h = c , c , c , w . ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ Likelihood of path: P ( h , x ) = 0 . 2 × 0 . 6 × 0 . 1 = 0 . 012 55 / 157
What We Want to Compute Given observed, e.g. , x = C, W, C, C, W, . . . Find state sequence h ∗ with highest likelihood. h ∗ = arg max P ( h , x ) h Let’s start with simpler problem: Find likelihood of best state sequence P best ( x ) . Worry about identity of best sequence later. P best ( x ) = max P ( h , x ) h 56 / 157
What’s the Problem? P best ( x ) = max P ( h , x ) h For observation sequence of length N . . . How many different possible state sequences h ? ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ How in blazes can we do max . . . Over exponential number of state sequences? 57 / 157
Dynamic Programming Let S 0 be start state; e.g. , the calm season c . Let P ( S , t ) be set of paths of length t . . . Starting at start state S 0 and ending at S . . . Consistent with observed x 1 , . . . , x t . Any path p ∈ P ( S , t ) must be composed of . . . Path of length t − 1 to predecessor state S ′ → S . . . Followed by arc from S ′ to S labeled with x t . This decomposition is unique. x t � P ( S ′ , t − 1 ) · ( S ′ P ( S , t ) = → S ) S ′ xt → S 58 / 157
Dynamic Programming x t � P ( S ′ , t − 1 ) · ( S ′ P ( S , t ) = → S ) S ′ xt → S Let ˆ α ( S , t ) = likelihood of best path of length t . . . Starting at start state S 0 and ending at S . P ( p ) = prob of path p = product of arc probs. α ( S , t ) = max ˆ p ∈P ( S , t ) P ( p ) P ( p ′ · ( S ′ x t = max → S )) p ′ ∈P ( S ′ , t − 1 ) , S ′ xt → S x t P ( S ′ p ′ ∈P ( S ′ , t − 1 ) P ( p ′ ) = max → S ) max S ′ xt → S x t P ( S ′ α ( S ′ , t − 1 ) = max → S ) × ˆ S ′ xt → S 59 / 157
What Were We Computing Again? Assume observed x of length T . Want likelihood of best path of length T . . . Starting at start state S 0 and ending anywhere. P best ( x ) = max P ( h , x ) = max α ( S , T ) ˆ h S If can compute ˆ α ( S , T ) , we are done. If know ˆ α ( S , t − 1 ) for all S , easy to compute ˆ α ( S , t ) : x t P ( S ′ α ( S ′ , t − 1 ) α ( S , t ) = max ˆ → S ) × ˆ S ′ xt → S This looks promising . . . 60 / 157
The Viterbi Algorithm α ( S , 0 ) = 1 for S = S 0 , 0 otherwise. ˆ For t = 1 , . . . , T : For each state S : x t P ( S ′ α ( S ′ , t − 1 ) α ( S , t ) = max ˆ → S ) × ˆ S ′ xt → S The end. P best ( x ) = max P ( h , x ) = max α ( S , T ) ˆ h S 61 / 157
Viterbi and Shortest Path Equivalent to shortest path problem. 3 1 1 4 1 3 1 2 19 3 10 One “state” for each state/time pair ( S , t ) . Iterate through “states” in topological order: All arcs go forward in time. If order “states” by time, valid ordering. S ′ → S { d ( S ′ ) + distance ( S ′ , S ) } d ( S ) = min x t P ( S ′ α ( S ′ , t − 1 ) α ( S , t ) = max ˆ → S ) × ˆ S ′ xt → S 62 / 157
Identifying the Best Path Wait! We can calc likelihood of best path: P best ( x ) = max P ( h , x ) h What we really wanted: identity of best path. i.e. , the best state sequence h . Basic idea: for each S , t . . . Record identity S prev ( S , t ) of previous state S ′ . . . In best path of length t ending at state S . Find best final state. Backtrace best previous states until reach start state. 63 / 157
The Viterbi Algorithm With Backtrace α ( S , 0 ) = 1 for S = S 0 , 0 otherwise. ˆ For t = 1 , . . . , T : For each state S : x t P ( S ′ α ( S ′ , t − 1 ) α ( S , t ) = max ˆ → S ) × ˆ S ′ xt → S x t P ( S ′ α ( S ′ , t − 1 ) S prev ( S , t ) = arg max → S ) × ˆ S ′ xt → S The end. P best ( x ) = max α ( S , T ) ˆ S S final ( x ) = arg max α ( S , T ) ˆ S 64 / 157
The Backtrace S cur ← S final ( x ) for t in T , . . . , 1: S cur ← S prev ( S cur , t ) The best state sequence is . . . List of states traversed in reverse order. 65 / 157
Illustration with a trellis State transition diagram in time State: 1 2 3 Time: 0 1 2 3 4 Obs: f a aa aab aabb .5x.2 .5x.2 .5x.8 .5x.8 .3x.3 .3x.7 .3x.7 .3x.3 .2 .2 .2 .2 .2 .4x.5 .4x.5 .4x.5 .4x.5 .5x.3 .5x.7 .5x.7 .5x.3 .1 .1 .1 .1 .1 28 66 / 157
Illustration with a trellis (contd.) Accumulating scores State: 1 2 3 Time: 0 1 2 3 4 Obs: f a aa aab aabb .5x.2 .5x.8 .5x.2 .5x.8 1 0.4 .16 .3x.3 .3x.7 .3x.7 .3x.3 .2 .2 .2 .2 .2 .4x.5 .4x.5 .4x.5 .4x.5 .2 .21+.04+.08=.33 .084+.066+.32=.182 .5x.3 .5x.7 .5x.7 .5x.3 .1 .1 .1 .1 .1 .02 .033+.03=.063 .0495+.0182=.0677 29 67 / 157
Viterbi algorithm Accumulating scores State: 1 2 3 Time: 0 1 2 3 4 Obs: f a aa aab aabb .5x.2 .5x.8 .5x.2 .5x.8 1 .0016 0.4 .016 .16 .3x.3 .3x.7 .3x.7 .3x.3 .2 .2 .2 .2 .2 .4x.5 .4x.5 .4x.5 .4x.5 .00336 max(.08 .21 .04) .0168 max(.084 .042 .032) .5x.3 .5x.7 .5x.7 .5x.3 .1 .1 .1 .1 .1 .00588 Max(.0084 .0315) .0294 max(.03 .021) 33 68 / 157
Best path through the trellis Accumulating scores State: 1 2 3 Time: 0 1 2 3 4 Obs: f a aa aab aabb .5x.2 .5x.8 .5x.2 .5x.8 .0016 .016 .16 0.4 1 .3x.3 .3x.7 .3x.7 .3x.3 .2 .2 .2 .2 .2 .4x.5 .4x.5 .4x.5 .4x.5 0.2 .00336 .21 .084 .0168 .5x.3 .5x.7 .5x.7 .5x.3 .1 .1 .1 .1 .1 0.02 .00588 .0315 .0294 .03 34 69 / 157
Example Some data: C, C, W, W. ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ α ˆ 0 1 2 3 4 c 1.000 0.600 0.360 0.072 0.014 w 0.000 0.100 0.060 0.036 0.022 C C α ( c , 2 ) = max { P ( c ˆ → c ) × ˆ α ( c , 1 ) , P ( w → c ) × ˆ α ( w , 1 ) } = max { 0 . 6 × 0 . 6 , 0 . 1 × 0 . 1 } = 0 . 36 70 / 157
Example: The Backtrace S prev 0 1 2 3 4 c c c c c w c c c w h ∗ = arg max P ( h , x ) = ( c , c , c , w , w ) h The data: C, C, W, W. Calm season switching to windy season. 71 / 157
Recap: The Viterbi Algorithm Given observed x , . . . Exponential number of hidden sequences h . Can find likelihood and identity of best path . . . Efficiently using dynamic programming. What is time complexity? 72 / 157
Where Are We? Computing the Best Path 1 Computing the Likelihood of Observations 2 Estimating Model Parameters 3 Discussion 4 73 / 157
What We Want to Compute Given observed, e.g. , x = C, W, C, C, W, . . . Find total likelihood P ( x ) . Need to sum likelihood over all hidden sequences: � P ( x ) = P ( h , x ) h Given state sequence h , how to compute P ( h , x ) ? Multiply all arc probabilities along path. Why is this sum easy for non-hidden model? 74 / 157
What’s the Problem? � P ( x ) = P ( h , x ) h For observation sequence of length N . . . How many different possible state sequences h ? ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ How in blazes can we do sum . . . Over exponential number of state sequences? 75 / 157
Dynamic Programming Let P ( S , t ) be set of paths of length t . . . Starting at start state S 0 and ending at S . . . Consistent with observed x 1 , . . . , x t . Any path p ∈ P ( S , t ) must be composed of . . . Path of length t − 1 to predecessor state S ′ → S . . . Followed by arc from S ′ to S labeled with x t . x t � P ( S ′ , t − 1 ) · ( S ′ P ( S , t ) = → S ) S ′ xt → S 76 / 157
Dynamic Programming x t � P ( S ′ , t − 1 ) · ( S ′ P ( S , t ) = → S ) S ′ xt → S Let α ( S , t ) = sum of likelihoods of paths of length t . . . Starting at start state S 0 and ending at S . � α ( S , t ) = P ( p ) p ∈P ( S , t ) P ( p ′ · ( S ′ x t � = → S )) p ′ ∈P ( S ′ , t − 1 ) , S ′ xt → S x t � � P ( S ′ P ( p ′ ) = → S ) S ′ xt p ′ ∈P ( S ′ , t − 1 ) → S x t � P ( S ′ → S ) × α ( S ′ , t − 1 ) = S ′ xt → S 77 / 157
What Were We Computing Again? Assume observed x of length T . Want sum of likelihoods of paths of length T . . . Starting at start state S 0 and ending anywhere. � � P ( x ) = P ( h , x ) = α ( S , T ) h S If can compute α ( S , T ) , we are done. If know α ( S , t − 1 ) for all S , easy to compute α ( S , t ) : x t � P ( S ′ → S ) × α ( S ′ , t − 1 ) α ( S , t ) = S ′ xt → S This looks promising . . . 78 / 157
The Forward Algorithm α ( S , 0 ) = 1 for S = S 0 , 0 otherwise. For t = 1 , . . . , T : For each state S : x t � P ( S ′ → S ) × α ( S ′ , t − 1 ) α ( S , t ) = S ′ xt → S The end. � � P ( x ) = P ( h , x ) = α ( S , T ) h S 79 / 157
Viterbi vs. Forward The goal: P best ( x ) = max P ( h , x ) = max α ( S , T ) ˆ h S � � P ( x ) = P ( h , x ) = α ( S , T ) h S The invariant. x t P ( S ′ α ( S ′ , t − 1 ) α ( S , t ) = max ˆ → S ) × ˆ S ′ xt → S x t � P ( S ′ → S ) × α ( S ′ , t − 1 ) α ( S , t ) = S ′ xt → S Just replace all max’s with sums (any semiring will do). 80 / 157
Example Some data: C, C, W, W. ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ α 0 1 2 3 4 c 1.000 0.600 0.370 0.082 0.025 w 0.000 0.100 0.080 0.085 0.059 C C α ( c , 2 ) = P ( c → c ) × α ( c , 1 ) + P ( w → c ) × α ( w , 1 ) = 0 . 6 × 0 . 6 + 0 . 1 × 0 . 1 = 0 . 37 81 / 157
Recap: The Forward Algorithm Can find total likelihood P ( x ) of observed . . . Using very similar algorithm to Viterbi algorithm. Just replace max’s with sums. Same time complexity. 82 / 157
Where Are We? Computing the Best Path 1 Computing the Likelihood of Observations 2 Estimating Model Parameters 3 Discussion 4 83 / 157
Training the Parameters of an HMM Given training data x . . . Estimate parameters of model . . . To maximize likelihood of training data. � P ( x ) = P ( h , x ) h 84 / 157
What Are The Parameters? One parameter for each arc: ❈✴✵✳✻ ❈✴✵✳✷ ❲✴✵✳✶ ❲✴✵✳✶ ❝ ✇ ❈✴✵✳✶ ❈✴✵✳✶ ❲✴✵✳✷ ❲✴✵✳✻ Identify arc by source S , destination S ′ , and label x : p S x → S ′ . Probs of arcs leaving same state must sum to 1: � p S x → S ′ = 1 for all S x , S ′ 85 / 157
What Did We Do For Non-Hidden Again? Likelihood of single path: product of arc probabilities. Log likelihood can be written as: x � L ( x N → S ′ ) log p S x 1 ) = c ( S → S ′ S x → S ′ x → S ′ ) of each arc. Just depends on counts c ( S Each source state corresponds to multinomial . . . With nonoverlapping parameters. ML estimation for multinomials: count and normalize! x c ( S → S ′ ) p MLE → S ′ = S x x � x , S ′ c ( S → S ′ ) 86 / 157
Example: Non-Hidden Estimation ❲✴✶ ❲✴✶✳✵✵✵ ❈✴✶✽ ❈✴✵✳✻✾✷ ❈✴✵ ❈✴✵✳✵✵✵ ✶ ✷✻ st❛rt ❈ ❘✴✻ ❈✴✹ ❘✴✵✳✷✸✶ ❈✴✵✳✻✻✼ ❲✴✵✳✵✼✼ ❈✴✵✳✷✷✼ ❈✴✺ ❲✴✷ ❘✴✵ ❘✴✵✳✵✵✵ ❘✴✵ ❘✴✵✳✵✵✵ ✷✷ ✻ ❘ ❲ ❲✴✶ ❲✴✵✳✵✹✺ ❘✴✶✻ ❲✴✷ ❘✴✵✳✼✷✼ ❲✴✵✳✸✸✸ 87 / 157
How Do We Train Hidden Models? Hmmm, I know this one . . . 88 / 157
Review: The EM Algorithm General way to train parameters in hidden models . . . To optimize likelihood. Guaranteed to improve likelihood in each iteration. Only finds local optimum. Seeding matters. 89 / 157
The EM Algorithm Initialize parameter values somehow. For each iteration . . . Expectation step: compute posterior (count) of each h . P ( h , x ) ˜ P ( h | x ) = � h P ( h , x ) Maximization step: update parameters. Instead of data x with unknown h , pretend . . . Non-hidden data where . . . (Fractional) count of each ( h , x ) is ˜ P ( h | x ) . 90 / 157
Applying EM to HMM’s: The E Step Compute posterior (count) of each h . P ( h , x ) ˜ P ( h | x ) = � h P ( h , x ) How to compute prob of single path P ( h , x ) ? Multiply arc probabilities along path. How to compute denominator? This is just total likelihood of observed P ( x ) . � P ( x ) = P ( h , x ) h This looks vaguely familiar. 91 / 157
Applying EM to HMM’s: The M Step Non-hidden case: single path h with count 1. Total count of arc is count of arc in h : x x → S ′ ) = c h ( S → S ′ ) c ( S Normalize. x → S ′ ) c ( S p MLE → S ′ = S x x � x , S ′ c ( S → S ′ ) Hidden case: every path h has count ˜ P ( h | x ) . Total count of arc is weighted sum . . . Of count of arc in each h . x ˜ x � → S ′ ) = → S ′ ) c ( S P ( h | x ) c h ( S h Normalize as before. 92 / 157
What’s the Problem? Need to sum over exponential number of h : x x ˜ � → S ′ ) = → S ′ ) c ( S P ( h | x ) c h ( S h If only we had an algorithm for doing this type of thing. 93 / 157
The Game Plan Decompose sum by time ( i.e. , position in x ). Find count of each arc at each “time” t . T T x x � � � ˜ → S ′ ) = → S ′ , t ) = c ( S c ( S P ( h | x ) t = 1 t = 1 h ∈P ( S x → S ′ , t ) x x → S ′ , t ) are paths where arc at time t is S → S ′ . P ( S x → S ′ , t ) is empty if x � = x t . P ( S Otherwise, use dynamic programming to compute x t � ˜ → S ′ , t ) ≡ c ( S P ( h | x ) xt h ∈P ( S → S ′ , t ) 94 / 157
Let’s Rearrange Some Recall we can compute P ( x ) using Forward algorithm: P ( h | x ) = P ( h , x ) ˜ P ( x ) Some paraphrasing: x t ˜ � → S ′ , t ) = c ( S P ( h | x ) xt h ∈P ( S → S ′ , t ) 1 � = P ( h , x ) P ( x ) xt h ∈P ( S → S ′ , t ) 1 � = P ( p ) P ( x ) xt p ∈P ( S → S ′ , t ) 95 / 157
What We Need x t → S ′ , t ) . Goal: sum over all paths p ∈ P ( S x t → S ′ . Arc at time t is S Let P i ( S , t ) be set of (initial) paths of length t . . . Starting at start state S 0 and ending at S . . . Consistent with observed x 1 , . . . , x t . Let P f ( S , t ) be set of (final) paths of length T − t . . . Starting at state S and ending at any state . . . Consistent with observed x t + 1 , . . . , x T . Then: x t x t → S ′ , t ) = P i ( S , t − 1 ) · ( S → S ′ ) · P f ( S ′ , t ) P ( S 96 / 157
Translating Path Sets to Probabilities x t x t → S ′ , t ) = P i ( S , t − 1 ) · ( S → S ′ ) · P f ( S ′ , t ) P ( S Let α ( S , t ) = sum of likelihoods of paths of length t . . . Starting at start state S 0 and ending at S . Let β ( S , t ) = sum of likelihoods of paths of length T − t . . . Starting at state S and ending at any state. 1 x t � → S ′ , t ) = c ( S P ( p ) P ( x ) xt p ∈P ( S → S ′ , t ) 1 x t � → S ′ ) · p f ) = P ( p i · ( S P ( x ) p i ∈P i ( S , t − 1 ) , p f ∈P f ( S ′ , t ) 1 � � = P ( x ) × p S P ( p i ) P ( p f ) xt → S ′ p i ∈P i ( S , t − 1 ) p f ∈P f ( S ′ , t ) 1 → S ′ × α ( S , t − 1 ) × β ( S ′ , t ) = P ( x ) × p S xt 97 / 157
Mini-Recap To do ML estimation in M step . . . x → S ′ ) . Need count of each arc: c ( S Decompose count of arc by time: T x x � → S ′ ) = → S ′ , t ) c ( S c ( S t = 1 Can compute count at time efficiently . . . If have forward probabilities α ( S , t ) . . . And backward probabilities β ( S , T ) . 1 x t → S ′ , t ) = → S ′ × α ( S , t − 1 ) × β ( S ′ , t ) c ( S P ( x ) × p S xt 98 / 157
The Forward-Backward Algorithm (1 iter) Apply Forward algorithm to compute α ( S , t ) , P ( x ) . Apply Backward algorithm to compute β ( S , t ) . → S ′ and time t . . . x t For each arc S Compute posterior count of arc at time t if x = x t . 1 x t → S ′ , t ) = → S ′ × α ( S , t − 1 ) × β ( S ′ , t ) c ( S P ( x ) × p S xt Sum to get total counts for each arc. T x x � → S ′ ) = → S ′ , t ) c ( S c ( S t = 1 For each arc, find ML estimate of parameter: x → S ′ ) c ( S p MLE → S ′ = S x x � x , S ′ c ( S → S ′ ) 99 / 157
The Forward Algorithm α ( S , 0 ) = 1 for S = S 0 , 0 otherwise. For t = 1 , . . . , T : For each state S : � → S × α ( S ′ , t − 1 ) α ( S , t ) = p S ′ xt S ′ xt → S The end. � � P ( x ) = P ( h , x ) = α ( S , T ) h S 100 / 157
Recommend
More recommend