inf4820 algorithms for artificial intelligence and
play

INF4820: Algorithms for Artificial Intelligence and Natural - PowerPoint PPT Presentation

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Hidden Markov Models Murhaf Fares & Stephan Oepen Language Technology Group (LTG) October 27, 2016 Recap: Probabilistic Language Models Basic probability


  1. Labelled Sequences ◮ We are interested in the probability of sequences like: flies like the wind flies like the wind or nns vb dt nn vbz p dt nn ◮ In normal text, we see the words, but not the tags. ◮ Consider the POS tags to be underlying skeleton of the sentence, unseen but influencing the sentence shape. ◮ A structure like this, consisting of a hidden state sequence, and a related observation sequence can be modelled as a Hidden Markov Model . 8

  2. Hidden Markov Models The generative story: � S � 9

  3. Hidden Markov Models The generative story: � S � DT P ( DT |� S � ) P ( S , O ) = P ( DT |� S � ) 9

  4. Hidden Markov Models The generative story: the P (the | DT ) � S � DT P ( DT |� S � ) P ( S , O ) = P ( DT |� S � ) P (the | DT) 9

  5. Hidden Markov Models The generative story: the P (the | DT ) � S � DT NN P ( DT |� S � ) P ( NN | DT ) P ( S , O ) = P ( DT |� S � ) P (the | DT) P (NN | DT) 9

  6. Hidden Markov Models The generative story: the cat P (the | DT ) P (cat | NN ) � S � DT NN P ( DT |� S � ) P ( NN | DT ) P ( S , O ) = P ( DT |� S � ) P (the | DT) P (NN | DT) P (cat | NN) 9

  7. Hidden Markov Models The generative story: the cat P (the | DT ) P (cat | NN ) � S � DT NN VBZ P ( DT |� S � ) P ( NN | DT ) P ( VBZ | NN ) P ( S , O ) = P ( DT |� S � ) P (the | DT) P (NN | DT) P (cat | NN) P (VBZ | NN) 9

  8. Hidden Markov Models The generative story: the cat eats P (eats | VBZ ) P (the | DT ) P (cat | NN ) � S � DT NN VBZ P ( DT |� S � ) P ( NN | DT ) P ( VBZ | NN ) P ( S , O ) = P ( DT |� S � ) P (the | DT) P (NN | DT) P (cat | NN) P (VBZ | NN) P (eats | VBZ) 9

  9. Hidden Markov Models The generative story: the cat eats P (the | DT ) P (cat | NN ) P (eats | VBZ ) � S � DT NN VBZ NNS P ( DT |� S � ) P ( NN | DT ) P ( VBZ | NN ) P ( NNS | VBZ ) P ( S , O ) = P ( DT |� S � ) P (the | DT) P (NN | DT) P (cat | NN) P (VBZ | NN) P (eats | VBZ) P (NNS | VBZ) 9

  10. Hidden Markov Models The generative story: the cat eats mice P (eats | VBZ ) P (mice | NNS ) P (the | DT ) P (cat | NN ) � S � DT NN VBZ NNS P ( DT |� S � ) P ( NN | DT ) P ( VBZ | NN ) P ( NNS | VBZ ) P ( S , O ) = P ( DT |� S � ) P (the | DT) P (NN | DT) P (cat | NN) P (VBZ | NN) P (eats | VBZ) P (NNS | VBZ) P (mice | NNS) 9

  11. Hidden Markov Models The generative story: the cat eats mice P (mice | NNS ) P (the | DT ) P (cat | NN ) P (eats | VBZ ) � S � � / S � DT NN VBZ NNS P ( DT |� S � ) P ( NN | DT ) P ( VBZ | NN ) P ( NNS | VBZ ) P ( � / S �| NNS ) P ( S , O ) = P ( DT |� S � ) P (the | DT) P (NN | DT) P (cat | NN) P (VBZ | NN) P (eats | VBZ) P (NNS | VBZ) P (mice | NNS) P ( � / S �| NNS) 9

  12. Hidden Markov Models The generative story: cat eats the mice P (the | DT ) P (cat | NN ) P (eats | VBZ ) P (mice | NNS ) � S � NN NNS � / S � DT VBZ P ( DT |� S � ) P ( NN | DT ) P ( VBZ | NN ) P ( NNS | VBZ ) P ( � / S �| NNS ) P ( S , O ) = P ( DT |� S � ) P (the | DT) P (NN | DT) P (cat | NN) P (VBZ | NN) P (eats | VBZ) P (NNS | VBZ) P (mice | NNS) P ( � / S �| NNS) 9

  13. Hidden Markov Models For a bi-gram HMM, with O N 1 : N + 1 � P ( S , O ) = P ( s i | s i − 1 ) P ( o i | s i ) where s 0 = � S � , s N + 1 = � / S � i = 1 10

  14. Hidden Markov Models For a bi-gram HMM, with O N 1 : N + 1 � P ( S , O ) = P ( s i | s i − 1 ) P ( o i | s i ) where s 0 = � S � , s N + 1 = � / S � i = 1 ◮ The transition probabilities model the probabilities of moving from state to state. 10

  15. Hidden Markov Models For a bi-gram HMM, with O N 1 : N + 1 � P ( S , O ) = P ( s i | s i − 1 ) P ( o i | s i ) where s 0 = � S � , s N + 1 = � / S � i = 1 ◮ The transition probabilities model the probabilities of moving from state to state. ◮ The emission probabilities model the probability that a state emits a particular observation. 10

  16. Using HMMs The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: ◮ P ( S , O ) given S and O ◮ P ( O ) given O ◮ S that maximises P ( S | O ) given O ◮ We can also learn the model parameters, given a set of observations. 11

  17. Using HMMs The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: ◮ P ( S , O ) given S and O ◮ P ( O ) given O ◮ S that maximises P ( S | O ) given O ◮ We can also learn the model parameters, given a set of observations. 11

  18. Using HMMs The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: ◮ P ( S , O ) given S and O ◮ P ( O ) given O ◮ S that maximises P ( S | O ) given O ◮ We can also learn the model parameters, given a set of observations. 11

  19. Using HMMs The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: ◮ P ( S , O ) given S and O ◮ P ( O ) given O ◮ S that maximises P ( S | O ) given O ◮ We can also learn the model parameters, given a set of observations. Our observations will be words ( w i ), and our states PoS tags ( t i ) 11

  20. Estimation As so often in NLP, we learn an HMM from labelled data: Transition probabilities Based on a training corpus of previously tagged text, with tags as our state, the MLE can be computed from the counts of observed tags: P ( t i | t i − 1 ) = C ( t i − 1 , t i ) C ( t i − 1 ) 12

  21. Estimation As so often in NLP, we learn an HMM from labelled data: Transition probabilities Based on a training corpus of previously tagged text, with tags as our state, the MLE can be computed from the counts of observed tags: P ( t i | t i − 1 ) = C ( t i − 1 , t i ) C ( t i − 1 ) Emission probabilities Computed from relative frequencies in the same way, with the words as observations: P ( w i | t i ) = C ( t i , w i ) C ( t i ) 12

  22. Implementation Issues P ( S , O ) = P ( s 1 |� S � ) P ( o 1 | s 1 ) P ( s 2 | s 1 ) P ( o 2 | s 2 ) P ( s 3 | s 2 ) P ( o 3 | s 3 ) . . . = 0 . 0429 × 0 . 0031 × 0 . 0044 × 0 . 0001 × 0 . 0072 × . . . 13

  23. Implementation Issues P ( S , O ) = P ( s 1 |� S � ) P ( o 1 | s 1 ) P ( s 2 | s 1 ) P ( o 2 | s 2 ) P ( s 3 | s 2 ) P ( o 3 | s 3 ) . . . = 0 . 0429 × 0 . 0031 × 0 . 0044 × 0 . 0001 × 0 . 0072 × . . . ◮ Multiplying many small probabilities → underflow 13

  24. Implementation Issues P ( S , O ) = P ( s 1 |� S � ) P ( o 1 | s 1 ) P ( s 2 | s 1 ) P ( o 2 | s 2 ) P ( s 3 | s 2 ) P ( o 3 | s 3 ) . . . = 0 . 0429 × 0 . 0031 × 0 . 0044 × 0 . 0001 × 0 . 0072 × . . . ◮ Multiplying many small probabilities → underflow ◮ Solution: work in log(arithmic) space: ◮ log( AB ) = log( A ) + log( B ) ◮ hence P ( A ) P ( B ) = exp(log( A ) + log( B )) ◮ log( P ( S , O )) = − 1 . 368 + − 2 . 509 + − 2 . 357 + − 4 + − 2 . 143 + . . . 13

  25. Implementation Issues P ( S , O ) = P ( s 1 |� S � ) P ( o 1 | s 1 ) P ( s 2 | s 1 ) P ( o 2 | s 2 ) P ( s 3 | s 2 ) P ( o 3 | s 3 ) . . . = 0 . 0429 × 0 . 0031 × 0 . 0044 × 0 . 0001 × 0 . 0072 × . . . ◮ Multiplying many small probabilities → underflow ◮ Solution: work in log(arithmic) space: ◮ log( AB ) = log( A ) + log( B ) ◮ hence P ( A ) P ( B ) = exp(log( A ) + log( B )) ◮ log( P ( S , O )) = − 1 . 368 + − 2 . 509 + − 2 . 357 + − 4 + − 2 . 143 + . . . The issues related to MLE / smoothing that we discussed for n -gram models also applies here . . . 13

  26. Ice Cream and Global Warming Missing records of weather in Baltimore for Summer 2007 ◮ Jason likes to eat ice cream. ◮ He records his daily ice cream consumption in his diary. ◮ The number of ice creams he ate was influenced, but not entirely determined by the weather. ◮ Today’s weather is partially predictable from yesterday’s. 14

  27. Ice Cream and Global Warming Missing records of weather in Baltimore for Summer 2007 ◮ Jason likes to eat ice cream. ◮ He records his daily ice cream consumption in his diary. ◮ The number of ice creams he ate was influenced, but not entirely determined by the weather. ◮ Today’s weather is partially predictable from yesterday’s. A Hidden Markov Model! with: ◮ Hidden states: { H , C } (plus pseudo-states � S � and � / S � ) ◮ Observations: { 1 , 2 , 3 } 14

  28. Ice Cream and Global Warming � S � 0.8 0.2 0.3 H C 0.6 0.2 0.5 P (1 | H ) = 0.2 P (1 | C ) = 0.5 0.2 0.2 � / S � P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 15

  29. Using HMMs The HMM models the process of generating the labelled sequence. We can use this model for a number of tasks: ◮ P ( S , O ) given S and O ◮ P ( O ) given O ◮ S that maximises P ( S | O ) given O ◮ P ( s x | O ) given O ◮ We can also learn the model parameters, given a set of observations. 16

  30. Part-of-Speech Tagging We want to find the tag sequence, given a word sequence. With tags as our states and words as our observations, we know: N + 1 � P ( S , O ) = P ( s i | s i − 1 ) P ( o i | s i ) i = 1 We want: P ( S | O ) 17

  31. Part-of-Speech Tagging We want to find the tag sequence, given a word sequence. With tags as our states and words as our observations, we know: N + 1 � P ( S , O ) = P ( s i | s i − 1 ) P ( o i | s i ) i = 1 We want: P ( S | O ) = P ( S , O ) P ( O ) 17

  32. Part-of-Speech Tagging We want to find the tag sequence, given a word sequence. With tags as our states and words as our observations, we know: N + 1 � P ( S , O ) = P ( s i | s i − 1 ) P ( o i | s i ) i = 1 We want: P ( S | O ) = P ( S , O ) P ( O ) Actually, we want the state sequence ˆ S that maximises P ( S | O ): P ( S , O ) ˆ S = arg max P ( O ) S 17

  33. Part-of-Speech Tagging We want to find the tag sequence, given a word sequence. With tags as our states and words as our observations, we know: N + 1 � P ( S , O ) = P ( s i | s i − 1 ) P ( o i | s i ) i = 1 We want: P ( S | O ) = P ( S , O ) P ( O ) Actually, we want the state sequence ˆ S that maximises P ( S | O ): P ( S , O ) ˆ S = arg max P ( O ) S Since P ( O ) always is the same, we can drop the denominator: ˆ S = arg max P ( S , O ) S 17

  34. Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 P ( H | H ) = 0.6 P ( C | H ) = 0.2 P ( H | C ) = 0.3 P ( C | C ) = 0.5 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18

  35. Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 P ( H | H ) = 0.6 P ( C | H ) = 0.2 P ( H | C ) = 0.3 P ( C | C ) = 0.5 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18

  36. Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � P ( H | H ) = 0.6 P ( C | H ) = 0.2 P ( H | C ) = 0.3 P ( C | C ) = 0.5 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18

  37. Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 P ( H | C ) = 0.3 P ( C | C ) = 0.5 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18

  38. Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � P ( H | C ) = 0.3 P ( C | C ) = 0.5 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18

  39. Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � 0.0001536 P ( H | C ) = 0.3 P ( C | C ) = 0.5 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18

  40. Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � 0.0001536 P ( H | C ) = 0.3 P ( C | C ) = 0.5 � S � � / S � H C H 0.0007680 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18

  41. Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � 0.0001536 P ( H | C ) = 0.3 P ( C | C ) = 0.5 � S � � / S � H C H 0.0007680 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 � S � H C C � / S � 0.0003200 P (1 | H ) = 0.2 P (1 | C ) = 0.5 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18

  42. Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � 0.0001536 P ( H | C ) = 0.3 P ( C | C ) = 0.5 � S � � / S � H C H 0.0007680 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 � S � H C C � / S � 0.0003200 P (1 | H ) = 0.2 P (1 | C ) = 0.5 � S � C H H � / S � 0.0000576 P (2 | H ) = 0.4 P (2 | C ) = 0.4 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18

  43. Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � 0.0001536 P ( H | C ) = 0.3 P ( C | C ) = 0.5 � S � � / S � H C H 0.0007680 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 � S � H C C � / S � 0.0003200 P (1 | H ) = 0.2 P (1 | C ) = 0.5 � S � C H H � / S � 0.0000576 P (2 | H ) = 0.4 P (2 | C ) = 0.4 � S � C H C � / S � 0.0000048 P (3 | H ) = 0.4 P (3 | C ) = 0.1 18

  44. Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � 0.0001536 P ( H | C ) = 0.3 P ( C | C ) = 0.5 � S � � / S � H C H 0.0007680 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 � S � H C C � / S � 0.0003200 P (1 | H ) = 0.2 P (1 | C ) = 0.5 � S � C H H � / S � 0.0000576 P (2 | H ) = 0.4 P (2 | C ) = 0.4 � S � C H C � / S � 0.0000048 P (3 | H ) = 0.4 P (3 | C ) = 0.1 � S � C C H � / S � 0.0001200 18

  45. Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � 0.0001536 P ( H | C ) = 0.3 P ( C | C ) = 0.5 � S � � / S � H C H 0.0007680 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 � S � H C C � / S � 0.0003200 P (1 | H ) = 0.2 P (1 | C ) = 0.5 � S � C H H � / S � 0.0000576 P (2 | H ) = 0.4 P (2 | C ) = 0.4 � S � C H C � / S � 0.0000048 P (3 | H ) = 0.4 P (3 | C ) = 0.1 � S � C C H � / S � 0.0001200 � S � C C C � / S � 0.0000500 18

  46. Decoding Task What is the most likely state sequence S , given an observation sequence O and an HMM. HMM if O = 3 1 3 P ( H |� S � ) = 0.8 P ( C |� S � ) = 0.2 � S � H H H � / S � 0.0018432 P ( H | H ) = 0.6 P ( C | H ) = 0.2 � S � H H C � / S � 0.0001536 P ( H | C ) = 0.3 P ( C | C ) = 0.5 � S � � / S � H C H 0.0007680 P ( � / S �| H ) = 0.2 P ( � / S �| C ) = 0.2 � S � H C C � / S � 0.0003200 P (1 | H ) = 0.2 P (1 | C ) = 0.5 � S � C H H � / S � 0.0000576 P (2 | H ) = 0.4 P (2 | C ) = 0.4 � S � C H C � / S � 0.0000048 P (3 | H ) = 0.4 P (3 | C ) = 0.1 � S � C C H � / S � 0.0001200 � S � C C C � / S � 0.0000500 18

  47. Dynamic Programming For (only) two states and a (short) observation sequence of length three, comparing all possible sequences is workable, but . . . 19

  48. Dynamic Programming For (only) two states and a (short) observation sequence of length three, comparing all possible sequences is workable, but . . . ◮ for N observations and L states, there are L N sequences ◮ we do the same partial calculations over and over again 19

  49. Dynamic Programming For (only) two states and a (short) observation sequence of length three, comparing all possible sequences is workable, but . . . ◮ for N observations and L states, there are L N sequences ◮ we do the same partial calculations over and over again Dynamic Programming: ◮ records sub-problem solutions for further re-use ◮ useful when a complex problem can be described recursively ◮ examples: Dijkstra’s shortest path, minimum edit distance, longest common subsequence, Viterbi algorithm 19

  50. Dynamic Programming For (only) two states and a (short) observation sequence of length three, comparing all possible sequences is workable, but . . . ◮ for N observations and L states, there are L N sequences ◮ we do the same partial calculations over and over again Dynamic Programming: ◮ records sub-problem solutions for further re-use ◮ useful when a complex problem can be described recursively ◮ examples: Dijkstra’s shortest path, minimum edit distance, longest common subsequence, Viterbi algorithm 19

  51. Viterbi Algorithm Recall our problem: maximise P ( s 1 . . . s n | o 1 . . . o n ) = P ( s 1 | s 0 ) P ( o 1 | s 1 ) P ( s 2 | s 1 ) P ( o 2 | s 2 ) . . . 20

  52. Viterbi Algorithm Recall our problem: maximise P ( s 1 . . . s n | o 1 . . . o n ) = P ( s 1 | s 0 ) P ( o 1 | s 1 ) P ( s 2 | s 1 ) P ( o 2 | s 2 ) . . . Our recursive sub-problem: L v i ( x ) = max k = 1 [ v i − 1 ( k ) · P ( x | k ) · P ( o i | x )] The variable v i ( x ) represents the maximum probability that the i -th state is x , given that we have seen O i 1 . 20

  53. Viterbi Algorithm Recall our problem: maximise P ( s 1 . . . s n | o 1 . . . o n ) = P ( s 1 | s 0 ) P ( o 1 | s 1 ) P ( s 2 | s 1 ) P ( o 2 | s 2 ) . . . Our recursive sub-problem: L v i ( x ) = max k = 1 [ v i − 1 ( k ) · P ( x | k ) · P ( o i | x )] The variable v i ( x ) represents the maximum probability that the i -th state is x , given that we have seen O i 1 . 20

  54. Viterbi Algorithm Recall our problem: maximise P ( s 1 . . . s n | o 1 . . . o n ) = P ( s 1 | s 0 ) P ( o 1 | s 1 ) P ( s 2 | s 1 ) P ( o 2 | s 2 ) . . . Our recursive sub-problem: L v i ( x ) = max k = 1 [ v i − 1 ( k ) · P ( x | k ) · P ( o i | x )] The variable v i ( x ) represents the maximum probability that the i -th state is x , given that we have seen O i 1 . At each step, we record backpointers showing which previous state led to the maximum probability. 20

  55. An Example of the Viterbi Algorithm H H H � S � � / S � C C C 3 1 3 o 1 o 2 o 3 21

  56. An Example of the Viterbi Algorithm v 1 ( H ) = 0 . 32 H H H P ( H | S ) P (3 | H ) 0 . 8 ∗ 0 . 4 � S � � / S � C C C 3 1 3 o 1 o 2 o 3 21

  57. An Example of the Viterbi Algorithm v 1 ( H ) = 0 . 32 H H H P ( H | S ) P (3 | H ) 0 . 8 ∗ 0 . 4 � S � � / S � C C C 3 1 3 o 1 o 2 o 3 21

  58. An Example of the Viterbi Algorithm v 1 ( H ) = 0 . 32 H H H P ( H | S ) P (3 | H ) 0 . 8 ∗ 0 . 4 � S � � / S � P ( C | S ) P (3 | C ) 0 . 2 ∗ 0 . 1 C C C v 1 ( C ) = 0 . 02 3 1 3 o 1 o 2 o 3 21

  59. An Example of the Viterbi Algorithm v 1 ( H ) = 0 . 32 H H H P ( H | S ) P (3 | H ) 0 . 8 ∗ 0 . 4 � S � � / S � P ( C | S ) P (3 | C ) 0 . 2 ∗ 0 . 1 C C C v 1 ( C ) = 0 . 02 3 1 3 o 1 o 2 o 3 21

  60. An Example of the Viterbi Algorithm v 2 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) v 1 ( H ) = 0 . 32 = . 0384 P ( H | H ) P (1 | H ) H H H 0 . 6 ∗ 0 . 2 P ( H | S ) P (3 | H ) 0 . 8 ∗ 0 . 4 � S � � / S � ) H | 1 P ( C | S ) P (3 | C ) ( P 2 ) . C 0 | H 0 . 2 ∗ 0 . 1 ∗ ( P 3 . 0 C C C v 1 ( C ) = 0 . 02 3 1 3 o 1 o 2 o 3 21

  61. An Example of the Viterbi Algorithm v 2 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) v 1 ( H ) = 0 . 32 = . 0384 P ( H | H ) P (1 | H ) H H H 0 . 6 ∗ 0 . 2 P ( H | S ) P (3 | H ) 0 . 8 ∗ 0 . 4 � S � � / S � ) H | 1 P ( C | S ) P (3 | C ) ( P 2 ) . C 0 | H 0 . 2 ∗ 0 . 1 ∗ ( P 3 . 0 C C C v 1 ( C ) = 0 . 02 3 1 3 o 1 o 2 o 3 21

  62. An Example of the Viterbi Algorithm v 2 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) v 1 ( H ) = 0 . 32 = . 0384 P ( H | H ) P (1 | H ) H H H 0 . 6 ∗ 0 . 2 P ( H | S ) P (3 | H ) P ( C | 0 . 8 ∗ 0 . 4 H 0 ) . P 2 ( 1 ∗ | 0 C . ) 5 � S � � / S � ) H | 1 P ( C | S ) P (3 | C ) ( P 2 ) . C 0 | H 0 . 2 ∗ 0 . 1 ∗ ( P 3 . 0 P ( C | C ) P (1 | C ) C C C 0 . 5 ∗ 0 . 5 v 1 ( C ) = 0 . 02 v 2 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) = . 032 3 1 3 o 1 o 2 o 3 21

  63. An Example of the Viterbi Algorithm v 2 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) v 1 ( H ) = 0 . 32 = . 0384 P ( H | H ) P (1 | H ) H H H 0 . 6 ∗ 0 . 2 P ( H | S ) P (3 | H ) P ( C | H ) P (1 | C ) 0 . 8 ∗ 0 . 4 0 . 2 ∗ 0 . 5 � S � � / S � ) H | 1 P ( C | S ) P (3 | C ) ( P 2 ) . C 0 | H 0 . 2 ∗ 0 . 1 ∗ ( P 3 . 0 P ( C | C ) P (1 | C ) C C C 0 . 5 ∗ 0 . 5 v 1 ( C ) = 0 . 02 v 2 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) = . 032 3 1 3 o 1 o 2 o 3 21

  64. An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 P ( H | S ) P (3 | H ) P ( C | 0 . 8 ∗ 0 . 4 H 0 ) . P 2 ( 1 ∗ | 0 C . ) 5 � S � � / S � ) ) H H | | 1 3 P ( C | S ) P (3 | C ) ( ( P P 2 4 ) . ) . C 0 C 0 | | H H 0 . 2 ∗ 0 . 1 ∗ ∗ ( ( P 3 P 3 . . 0 0 P ( C | C ) P (1 | C ) C C C 0 . 5 ∗ 0 . 5 v 1 ( C ) = 0 . 02 v 2 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) = . 032 3 1 3 o 1 o 2 o 3 21

  65. An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 P ( H | S ) P (3 | H ) P ( C 0 . 8 ∗ 0 . 4 | H 0 ) . P 2 ( 1 ∗ | C 0 ) . 5 � S � � / S � ) ) H H | 3 | 1 ( ( P ( C | S ) P (3 | C ) P P 2 4 ) ) C . C . 0 0 | | 0 . 2 ∗ 0 . 1 H ∗ H ∗ ( ( 3 3 P P . . 0 0 P ( C | C ) P (1 | C ) C C C 0 . 5 ∗ 0 . 5 v 1 ( C ) = 0 . 02 v 2 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) = . 032 3 3 1 o 1 o 2 o 3 21

  66. An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 P ( H | S ) P (3 | H ) P P ( ( C C | | 0 . 8 ∗ 0 . 4 H H 0 0 ) ) . P . P 2 2 ( ( 1 3 ∗ ∗ | | 0 C 0 C . ) . ) 5 1 � S � � / S � ) ) H H | | 1 3 P ( C | S ) P (3 | C ) ( ( P P 2 4 ) . ) . C 0 C 0 | | H H 0 . 2 ∗ 0 . 1 ∗ ∗ ( ( P 3 P 3 . . 0 0 P ( C | C ) P (1 | C ) P ( C | C ) P (3 | C ) C C C 0 . 5 ∗ 0 . 5 0 . 5 ∗ 0 . 1 v 1 ( C ) = 0 . 02 v 2 ( C ) = v 3 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) max( . 0384 ∗ . 02 , . 032 ∗ . 05) = . 032 = . 0016 3 1 3 o 1 o 2 o 3 21

  67. An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 P ( H | S ) P (3 | H ) P P ( ( C C | | 0 . 8 ∗ 0 . 4 H H 0 0 ) ) . P . P 2 2 ( ( ∗ 1 3 ∗ | C | 0 0 C . ) . ) 5 1 � S � � / S � ) ) H H | | 1 3 P ( C | S ) P (3 | C ) ( ( P P 2 4 ) ) . C . C 0 0 | | H H ∗ 0 . 2 ∗ 0 . 1 ∗ ( ( 3 P 3 P . . 0 0 P ( C | C ) P (1 | C ) P ( C | C ) P (3 | C ) C C C 0 . 5 ∗ 0 . 5 0 . 5 ∗ 0 . 1 v 1 ( C ) = 0 . 02 v 3 ( C ) = v 2 ( C ) = max( . 0384 ∗ . 02 , . 032 ∗ . 05) max( . 32 ∗ . 1 , . 02 ∗ . 25) = . 0016 = . 032 3 1 3 o 1 o 2 o 3 21

  68. An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 v f ( � / S � ) = max( . 009216 ∗ . 2 , P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H . 0016 ∗ . 2) 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 = . 0018432 P ( H | S ) P (3 | H ) P P P ( � / S �| H ) ( ( C C | | 0 . 8 ∗ 0 . 4 H H 0 0 ) ) 0 . 2 . P . P 2 2 ( ( 1 3 ∗ ∗ | | 0 C 0 C . ) . ) 5 1 � S � � / S � ) ) H H | | 1 3 P ( C | S ) P (3 | C ) ( ( P P 2 4 ) C ) . ) . C 0 C 0 | � | | S H H / 0 . 2 ∗ 0 . 1 ∗ ∗ � ( ( ( P P 3 P 3 2 . . 0 0 0 . P ( C | C ) P (1 | C ) P ( C | C ) P (3 | C ) C C C 0 . 5 ∗ 0 . 5 0 . 5 ∗ 0 . 1 v 1 ( C ) = 0 . 02 v 2 ( C ) = v 3 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) max( . 0384 ∗ . 02 , . 032 ∗ . 05) = . 032 = . 0016 3 1 3 o 1 o 2 o 3 21

  69. An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 v f ( � / S � ) = max( . 009216 ∗ . 2 , P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H . 0016 ∗ . 2) 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 = . 0018432 P ( H | S ) P (3 | H ) P P P ( � / S �| H ) ( ( C C 0 . 8 ∗ 0 . 4 | H | H 0 ) 0 ) 0 . 2 . P . P 2 2 ( ( 1 3 ∗ ∗ | | C C 0 0 ) ) . . 5 1 � S � � / S � ) P ( H | C ) P (3 | H ) H 1 | ( P ( C | S ) P (3 | C ) P 2 0 . 3 ∗ 0 . 4 ) ) C C . | 0 � S | 0 . 2 ∗ 0 . 1 H ∗ / � ( ( 3 P P 2 . . 0 0 P ( C | C ) P (1 | C ) P ( C | C ) P (3 | C ) C C C 0 . 5 ∗ 0 . 5 0 . 5 ∗ 0 . 1 v 1 ( C ) = 0 . 02 v 2 ( C ) = v 3 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) max( . 0384 ∗ . 02 , . 032 ∗ . 05) = . 032 = . 0016 3 3 1 o 1 o 2 o 3 H 21

  70. An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 v f ( � / S � ) = max( . 009216 ∗ . 2 , P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H . 0016 ∗ . 2) 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 = . 0018432 P ( H | S ) P (3 | H ) P P P ( � / S �| H ) ( ( C C 0 . 8 ∗ 0 . 4 | H | H 0 ) 0 ) 0 . 2 . P . P 2 2 ( ( 1 ∗ ∗ 3 | | C C 0 0 ) . . ) 5 1 � S � � / S � ) P ( H | C ) P (3 | H ) H | 1 ( P ( C | S ) P (3 | C ) P 2 0 . 3 ∗ 0 . 4 ) ) C C . | 0 � S | 0 . 2 ∗ 0 . 1 H ∗ / � ( ( 3 P P 2 . . 0 0 P ( C | C ) P (1 | C ) P ( C | C ) P (3 | C ) C C C 0 . 5 ∗ 0 . 5 0 . 5 ∗ 0 . 1 v 1 ( C ) = 0 . 02 v 2 ( C ) = v 3 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) max( . 0384 ∗ . 02 , . 032 ∗ . 05) = . 032 = . 0016 3 3 1 o 1 o 2 o 3 H H 21

  71. An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 v f ( � / S � ) = max( . 009216 ∗ . 2 , P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H . 0016 ∗ . 2) 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 = . 0018432 P ( H | S ) P (3 | H ) P ( C | H ) P (1 | C ) P P ( � / S �| H ) ( C | 0 . 8 ∗ 0 . 4 H 0 . 2 ∗ 0 . 5 0 ) 0 . 2 . P 2 ( ∗ 3 | C 0 . ) 1 � S � � / S � ) P ( H | C ) P (3 | H ) H | 1 ( P ( C | S ) P (3 | C ) P 2 0 . 3 ∗ 0 . 4 ) ) C C . | 0 � S | 0 . 2 ∗ 0 . 1 H ∗ / � ( ( 3 P P 2 . . 0 0 P ( C | C ) P (1 | C ) P ( C | C ) P (3 | C ) C C C 0 . 5 ∗ 0 . 5 0 . 5 ∗ 0 . 1 v 1 ( C ) = 0 . 02 v 2 ( C ) = v 3 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) max( . 0384 ∗ . 02 , . 032 ∗ . 05) = . 032 = . 0016 3 3 1 o 1 o 2 o 3 H H H 21

  72. An Example of the Viterbi Algorithm v 2 ( H ) = v 3 ( H ) = max( . 32 ∗ . 12 , . 02 ∗ . 06) max( . 0384 ∗ . 24 , . 032 ∗ . 12) v 1 ( H ) = 0 . 32 = . 0384 = . 009216 v f ( � / S � ) = max( . 009216 ∗ . 2 , P ( H | H ) P (1 | H ) P ( H | H ) P (3 | H ) H H H . 0016 ∗ . 2) 0 . 6 ∗ 0 . 2 0 . 6 ∗ 0 . 4 = . 0018432 P ( H | S ) P (3 | H ) P ( C | H ) P (1 | C ) P P ( � / S �| H ) ( C | 0 . 8 ∗ 0 . 4 H 0 . 2 ∗ 0 . 5 0 ) 0 . 2 . P 2 ( ∗ 3 | C 0 . ) 1 � S � � / S � ) P ( H | C ) P (3 | H ) H | 1 ( P ( C | S ) P (3 | C ) P 2 0 . 3 ∗ 0 . 4 ) ) C C . | 0 � S | 0 . 2 ∗ 0 . 1 H ∗ / � ( ( 3 P P 2 . . 0 0 P ( C | C ) P (1 | C ) P ( C | C ) P (3 | C ) C C C 0 . 5 ∗ 0 . 5 0 . 5 ∗ 0 . 1 v 1 ( C ) = 0 . 02 v 2 ( C ) = v 3 ( C ) = max( . 32 ∗ . 1 , . 02 ∗ . 25) max( . 0384 ∗ . 02 , . 032 ∗ . 05) = . 032 = . 0016 3 3 1 o 1 o 2 o 3 � � H H H 21

  73. Pseudocode for the Viterbi Algorithm Input : observations of length N , state set of size L Output : best-path create a path probability matrix viterbi [ N , L + 2] create a path backpointer matrix backpointer [ N , L + 2] for each state s from 1 to L do viterbi [1 , s ] ← trans ( � S � , s ) × emit ( o 1 , s ) backpointer [1 , s ] ← 0 end for each time step i from 2 to N do for each state s from 1 to L do viterbi [ i , s ] ← max L s ′ = 1 viterbi [ i − 1 , s ′ ] × trans ( s ′ , s ) × emit ( o i , s ) backpointer [ i , s ] ← arg max L s ′ = 1 viterbi [ i − 1 , s ′ ] × trans ( s ′ , s ) end end viterbi [ N , L + 1] ← max L s = 1 viterbi [ s , N ] × trans ( s , � / S � ) backpointer [ N , L + 1] ← arg max L s = 1 viterbi [ N , s ] × trans ( s , � / S � ) return the path by following backpointers from backpointer [ N , L + 1] 22

  74. Diversion: Complexity and O(N) Big-O notation describes the complexity of an algorithm. ◮ it describes the worst-case order of growth in terms of the size of the input ◮ only the largest order term is represented ◮ constant factors are ignored ◮ determined by looking at loops in the code 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend