hidden markov models
play

Hidden Markov Models Bhiksha Raj 10 Nov 2016 11755/18797 1 - PowerPoint PPT Presentation

Machine Learning for Signal Processing Hidden Markov Models Bhiksha Raj 10 Nov 2016 11755/18797 1 Prediction : a holy grail Physical trajectories Automobiles, rockets, heavenly bodies Natural phenomena Weather Financial


  1. Probability that the HMM will follow a particular state sequence  ( , , ,...) ( ) ( | ) ( | )... P s s s P s P s s P s s 1 2 3 1 2 1 3 2 • P ( s 1 ) is the probability that the process will initially be in state s 1 • P ( s i | s i ) is the transition probability of moving to state s i at the next time instant when the system is currently in s i – Also denoted by T ij earlier 11755/18797 28

  2. Generating Observations from States HMM assumed to be generating data state sequence state distributions observation sequence • At each time it generates an observation from the state it is in at that time 11755/18797 29

  3. Probability that the HMM will generate a particular observation sequence given a state sequence (state sequence known )  ( , , ,...| , , ,...) ( | ) ( | ) ( | )... P o o o s s s P o s P o s P o s 1 2 3 1 2 3 1 1 2 2 3 3 Computed from the Gaussian or Gaussian mixture for state s 1 • P ( o i | s i ) is the probability of generating observation o i when the system is in state s i 11755/18797 30

  4. Proceeding through States and Producing Observations HMM assumed to be generating data state sequence state distributions observation sequence • At each time it produces an observation and makes a transition 11755/18797 31

  5. Probability that the HMM will generate a particular state sequence and from it, a particular observation sequence  ( , , ,..., , , ,...) P o o o s s s 1 2 3 1 2 3  ( , , ,...| , , ,...) ( , , ,...) P o o o s s s P s s s 1 2 3 1 2 3 1 2 3 ( | ) ( | ) ( | )... ( ) ( | ) ( | )... P o s P o s P o s P s P s s P s s 1 1 2 2 3 3 1 2 1 3 2 11755/18797 32

  6. Probability of Generating an Observation Sequence • The precise state sequence is not known • All possible state sequences must be considered    ( , , ,..., , , ,...) P o o o s s s ( , , ,...) P o o o 1 2 3 1 2 3 1 2 3 . all possible . state sequences  ( | ) ( | ) ( | )... ( ) ( | ) ( | )... P o s P o s P o s P s P s s P s s 1 1 2 2 3 3 1 2 1 3 2 . all possible . state sequences 11755/18797 33

  7. Computing it Efficiently • Explicit summing over all state sequences is not tractable – A very large number of possible state sequences • Instead we use the forward algorithm • A dynamic programming technique. 11755/18797 34

  8. Illustrative Example • Example: a generic HMM with 5 states and a “terminating state”. – Left to right topology • P ( s i ) = 1 for state 1 and 0 for others – The arrows represent transition for which the probability is not 0 • Notation: – P ( s i | s i ) = T ij – We represent P ( o t | s i ) = b i ( t ) for brevity 11755/18797 35

  9. Diversion: The Trellis State index a ( s , t ) s Feature vectors t-1 t (time) • The trellis is a graphical representation of all possible paths through the HMM to produce a given observation • The Y-axis represents HMM states, X axis represents observations • Every edge in the graph represents a valid transition in the HMM over a single time step • Every node represents the event of a particular observation being generated from a particular state 11755/18797 36

  10. The Forward Algorithm a   ( , ) ( , ,..., , ( ) ) s t P x x x state t s 1 2 t State index a ( s , t ) s time t-1 t • a ( s , t ) is the total probability of ALL state sequences that end at state s at time t , and all observations until x t 11755/18797 37

  11. The Forward Algorithm a   ( , ) ( , ,..., , ( ) ) s t P x x x state t s 1 2 t Can be recursively State index estimated starting from the first time a ( s , t-1 ) a ( s , t ) instant s (forward recursion) a ( 1 , t-1 ) time t-1 t  a  a  ( , ) ( ' , 1 ) ( | ' ) ( | ) s t s t P s s P x t s ' s • a ( s , t ) can be recursively computed in terms of a ( s’ , t ’ ), the forward probabilities at time t-1 11755/18797 38

  12. The Forward Algorithm   a ( , ) Totalprob s T s State index time T • In the final observation the alpha at each state gives the probability of all state sequences ending at that state • General model: The total probability of the observation is the sum of the alpha values at all states 11755/18797 39

  13. The absorbing state • Observation sequences are assumed to end only when the process arrives at an absorbing state – No observations are produced from the absorbing state 11755/18797 40

  14. The Forward Algorithm  a  ( , 1 ) Totalprob s T absorbing State index time T  a   a ( , 1 ) ( ' , ) ( | ' ) s T s T P s s absorbing absorbing ' s • Absorbing state model: The total probability is the alpha computed at the absorbing state after the final observation 11755/18797 41

  15. Problem 2: State segmentation • Given only a sequence of observations, how do we determine which sequence of states was followed in producing it? 11755/18797 42

  16. The HMM as a generator HMM assumed to be generating data state sequence state distributions observation sequence • The process goes through a series of states and produces observations from them 11755/18797 43

  17. States are hidden HMM assumed to be generating data state sequence state distributions observation sequence • The observations do not reveal the underlying state 11755/18797 44

  18. The state segmentation problem HMM assumed to be generating data state sequence state distributions observation sequence • State segmentation: Estimate state sequence given observations 11755/18797 45

  19. Estimating the State Sequence • Many different state sequences are capable of producing the observation • Solution: Identify the most probable state sequence – The state sequence for which the probability of progressing through that sequence and generating the observation sequence is maximum  – i.e is maximum ( , , ,..., , , ,...) P o o o s s s 1 2 3 1 2 3 11755/18797 46

  20. Estimating the state sequence • Once again, exhaustive evaluation is impossibly expensive • But once again a simple dynamic-programming solution is available  ( , , ,..., , , ,...) P o o o s s s 1 2 3 1 2 3 ( | ) ( | ) ( | )... ( ) ( | ) ( | )... P o s P o s P o s P s P s s P s s 1 1 2 2 3 3 1 2 1 3 2 • Needed: arg max ( | ) ( ) ( | ) ( | ) ( | ) ( | ) P o s P s P o s P s s P o s P s s , , ,... 1 1 1 2 2 2 1 3 3 3 2 s s s 1 2 3 11755/18797 47

  21. Estimating the state sequence • Once again, exhaustive evaluation is impossibly expensive • But once again a simple dynamic-programming solution is available  ( , , ,..., , , ,...) P o o o s s s 1 2 3 1 2 3 ( | ) ( | ) ( | )... ( ) ( | ) ( | )... P o s P o s P o s P s P s s P s s 1 1 2 2 3 3 1 2 1 3 2 • Needed: arg max ( | ) ( ) ( | ) ( | ) ( | ) ( | ) P o s P s P o s P s s P o s P s s , , ,... 1 1 1 2 2 2 1 3 3 3 2 s s s 1 2 3 11755/18797 48

  22. The HMM as a generator HMM assumed to be generating data state sequence state distributions observation sequence • Each enclosed term represents one forward transition and a subsequent emission 11755/18797 49

  23. The state sequence • The probability of a state sequence ?,?,?,?,s x ,s y ending at time t , and producing all observations until o t – P( o 1..t-1 , ?,?,?,?, s x , o t ,s y ) = P( o 1..t-1 ,?,?,?,?, s x ) P( o t | s y )P( s y | s x ) • The best state sequence that ends with s x , s y at t will have a probability equal to the probability of the best state sequence ending at t-1 at s x times P( o t | s y )P( s y | s x ) 11755/18797 50

  24. Extending the state sequence s x s y state sequence state distributions observation sequence t • The probability of a state sequence ?,?,?,?,s x ,s y ending at time t and producing observations until o t – P( o 1..t-1 , o t , ?,?,?,?, s x ,s y ) = P( o 1..t-1 ,?,?,?,?, s x )P( o t | s y )P( s y | s x ) 11755/18797 51

  25. Trellis • The graph below shows the set of all possible state sequences through this HMM in five time instants time t 11755/18797 52

  26. The cost of extending a state sequence • The cost of extending a state sequence ending at s x is only dependent on the transition from s x to s y , and the observation probability at s y P( o t | s y )P( s y | s x ) s y s x time t 11755/18797 53

  27. The cost of extending a state sequence • The best path to s y through s x is simply an extension of the best path to s x BestP( o 1..t-1 ,?,?,?,?, s x ) P( o t | s y )P( s y | s x ) s y s x time t 11755/18797 54

  28. The Recursion • The overall best path to s y is an extension of the best path to one of the states at the previous time s y time t 11755/18797 55

  29. The Recursion  Prob. of best path to s y = Max sx BestP( o 1..t-1 ,?,?,?,?, s x ) P( o t | s y )P( s y | s x ) s y time t 11755/18797 56

  30. Finding the best state sequence • The simple algorithm just presented is called the VITERBI algorithm in the literature – After A.J.Viterbi, who invented this dynamic programming algorithm for a completely different purpose: decoding error correction codes! 11755/18797 57

  31. Viterbi Search (contd.) Initial state initialized with path-score = P ( s 1 ) b 1 ( 1 ) time In this example all other states have score 0 since P ( s i ) = 0 for 11755/18797 58 them

  32. Viterbi Search (contd.) State with best path-score State with path-score < best State without a valid path-score P ( t ) = max [ P ( t-1 ) t b ( t )] j i ij j i State transition probability, i to j Score for state j , given the input at time t Total path-score ending up at state j at time t time 11755/18797 59

  33. Viterbi Search (contd.) P ( t ) = max [ P ( t-1 ) t b ( t )] j i ij j i State transition probability, i to j Score for state j , given the input at time t Total path-score ending up at state j at time t time 11755/18797 60

  34. Viterbi Search (contd.) time 11755/18797 61

  35. Viterbi Search (contd.) time 11755/18797 62

  36. Viterbi Search (contd.) time 11755/18797 63

  37. Viterbi Search (contd.) time 11755/18797 64

  38. Viterbi Search (contd.) time 11755/18797 65

  39. Viterbi Search (contd.) time 11755/18797 66

  40. Viterbi Search (contd.) time 11755/18797 67

  41. Viterbi Search (contd.) THE BEST STATE SEQUENCE IS THE ESTIMATE OF THE STATE SEQUENCE FOLLOWED IN GENERATING THE OBSERVATION time 11755/18797 68

  42. Problem3: Training HMM parameters • We can compute the probability of an observation, and the best state sequence given an observation, using the HMM’s parameters • But where do the HMM parameters come from? • They must be learned from a collection of observation sequences 11755/18797 69

  43. Learning HMM parameters: Simple procedure – counting • Given a set of training instances • Iteratively: 1. Initialize HMM parameters 2. Segment all training instances 3. Estimate transition probabilities and state output probability parameters by counting 11755/18797 70

  44. Learning by counting example • Explanation by example in next few slides • 2-state HMM, Gaussian PDF at states, 3 observation sequences • Example shows ONE iteration – How to count after state sequences are obtained 11755/18797 71

  45. Example: Learning HMM Parameters • We have an HMM with two states s1 and s2. • Observations are vectors x ij – i-th sequence, j-th vector • We are given the following three observation sequences – And have already estimated state sequences Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 72

  46. Example: Learning HMM Parameters Initial state probabilities (usually denoted as p ): • – We have 3 observations – 2 of these begin with S1, and one with S2 p (S1) = 2/3, p (S2) = 1/3 – Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 73

  47. Example: Learning HMM Parameters • Transition probabilities: – State S1 occurs 11 times in non-terminal locations – Of these, it is followed by S1 X times – It is followed by S2 Y times – P(S1 | S1) = x/ 11; P(S2 | S1) = y / 11 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 74

  48. Example: Learning HMM Parameters • Transition probabilities: – State S1 occurs 11 times in non-terminal locations – Of these, it is followed immediately by S1 6 times – It is followed by S2 Y times – P(S1 | S1) = x/ 11; P(S2 | S1) = y / 11 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 75

  49. Example: Learning HMM Parameters • Transition probabilities: – State S1 occurs 11 times in non-terminal locations – Of these, it is followed immediately by S1 6 times – It is followed immediately by S2 5 times – P(S1 | S1) = x/ 11; P(S2 | S1) = y / 11 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 76

  50. Example: Learning HMM Parameters • Transition probabilities: – State S1 occurs 11 times in non-terminal locations – Of these, it is followed immediately by S1 6 times – It is followed immediately by S2 5 times – P(S1 | S1 ) = 6/ 11; P(S2 | S1 ) = 5 / 11 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 77

  51. Example: Learning HMM Parameters • Transition probabilities: – State S2 occurs 13 times in non-terminal locations – Of these, it is followed immediately by S1 6 times – It is followed immediately by S2 5 times – P(S1 | S1 ) = 6/ 11; P(S2 | S1 ) = 5 / 11 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs. X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 78

  52. Example: Learning HMM Parameters • Transition probabilities: – State S2 occurs 13 times in non-terminal locations – Of these, it is followed immediately by S1 5 times – It is followed immediately by S2 5 times – P(S1 | S1 ) = 6/ 11; P(S2 | S1 ) = 5 / 11 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 79

  53. Example: Learning HMM Parameters • Transition probabilities: – State S2 occurs 13 times in non-terminal locations – Of these, it is followed immediately by S1 5 times – It is followed immediately by S2 8 times – P(S1 | S1 ) = 6/ 11; P(S2 | S1 ) = 5 / 11 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 80

  54. Example: Learning HMM Parameters • Transition probabilities: – State S2 occurs 13 times in non-terminal locations – Of these, it is followed immediately by S1 5 times – It is followed immediately by S2 8 times – P(S1 | S2 ) = 5 / 13; P(S2 | S2 ) = 8 / 13 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 81

  55. Parameters learnt so far • State initial probabilities, often denoted as p – p (S1) = 2/3 = 0.66 – p (S2) = 1/3 = 0.33 • State transition probabilities – P(S1 | S1) = 6/11 = 0.545; P(S2 | S1) = 5/11 = 0.455 – P(S1 | S2) = 5/13 = 0.385; P(S2 | S2) = 8/13 = 0.615 – Represented as a transition matrix     ( 1 | 1 ) ( 2 | 1 ) 0 . 545 0 . 455 P S S P S S       A         ( 1 | 2 ) ( 2 | 2 ) 0 . 385 0 . 615 P S S P S S Each row of this matrix must sum to 1.0 11755/18797 82

  56. Example: Learning HMM Parameters • State output probability for S1 – There are 13 observations in S1 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 83

  57. Example: Learning HMM Parameters • State output probability for S1 – There are 13 observations in S1 – Segregate them out and count • Compute parameters (mean and variance) of Gaussian output density for state S1   Time 1 2 6 7 9 10 1  X    m Q  m T 1 ( | ) exp 0 . 5 ( ) ( ) P X S X 1 1 1 1 state S1 S1 S1 S1 S1 S1 p Q d ( 2 ) | | 1 Obs X a1 X a2 X a6 X a7 X a9 X a10          X X X X X X X Time 3 4 9 1 m    1 2 6 7 9 10 3 a a a a a a b   state S1 S1 S1      1   13 X X X X X X Obs X b3 X b4 X b9 4 9 1 2 4 5 b b c c c c          m  m   m  m  T T ... X X X X   a 1 1 a 1 1 a 2 1 a 2 1 1       Time 1 3 4 5   Q   m  m   m  m  T T ... X X X X 1 b 3 1 b 3 1 b 4 1 b 4 1 state S1 S1 S1 S1   13          m  m   m  m  T T ... X X X X Obs X c1 X c2 X c4 X c5   c 1 1 c 1 1 c 2 1 c 2 1 11755/18797 84

  58. Example: Learning HMM Parameters • State output probability for S2 – There are 14 observations in S2 Time 1 2 3 4 5 6 7 8 9 10 state S1 S1 S2 S2 S2 S1 S1 S2 S1 S1 Observation 1 Obs X a1 X a2 X a3 X a4 X a5 X a6 X a7 X a8 X a9 X a10 Time 1 2 3 4 5 6 7 8 9 Observation 2 state S2 S2 S1 S1 S2 S2 S2 S2 S1 Obs X b1 X b2 X b3 X b4 X b5 X b6 X b7 X b8 X b9 Time 1 2 3 4 5 6 7 8 state S1 S2 S1 S1 S1 S2 S2 S2 Observation 3 Obs X c1 X c2 X c3 X c4 X c5 X c6 X c7 X c8 11755/18797 85

  59. Example: Learning HMM Parameters • State output probability for S2 – There are 14 observations in S2 – Segregate them out and count • Compute parameters (mean and variance) of Gaussian output density for state S2 Time 3 4 5 8   1 state S2 S2 S2 S2  X    m Q  m 1 T ( | ) exp 0 . 5 ( ) ( ) P X S X 2 2 2 2 p Q Obs X a3 X a4 X a5 X a8 d ( 2 ) | | 2 Time 1 2 5 6 7 8 state S2 S2 S2 S2 S2 S2 Obs X b1 X b2 X b5 X b6 X b7 X b8          X X X X X X X 1   m  3 4 5 8 1 2 5 a a a a b b b         2 14   X X X X X X X Time 2 6 7 8 6 7 8 2 6 7 8 b b b c c c c state S2 S2 S2 S2   1    Obs X c2 X c6 X c7 X c8 Q   m  m  T ... X X 1 3 2 3 2 a a 14 11755/18797 86

  60. We have learnt all the HMM parmeters • State initial probabilities, often denoted as p – p (S1) = 0.66 p (S2) = 1/3 = 0.33 • State transition probabilities   0 . 545 0 . 455    A     0 . 385 0 . 615 • State output probabilities State output probability for S1 State output probability for S2     1 1  X    m Q  m  X 1 T    m Q  m 1 ( | ) exp 0 . 5 ( ) ( ) T P X S X ( | ) exp 0 . 5 ( ) ( ) P X S X 2 2 2 2 p Q 1 1 1 1 p Q d ( 2 ) | | d ( 2 ) | | 2 1 11755/18797 87

  61. Update rules at each iteration No. of observatio n sequences that start at state s p  ( ) i s i Total no. of observatio n sequences     1 X , obs t     : ( ) . &. ( 1 ) m   obs t state t s state t s : ( ) obs t state t s ( | ) i j i P s s     i j i 1 1   : ( ) . : ( ) . obs t state t s obs t state t s i i    m  m T ( )( ) X X , , obs t i obs t i  Q  : ( ) obs t state t s i   i 1  : ( ) . obs t state t s i • Assumes state output PDF = Gaussian – For GMMs, estimate GMM parameters from collection of observations at any state 11755/18797 88

  62. Training by segmentation: Viterbi training yes Initial Segmentations Models Converged? models no  Initialize all HMM parameters  Segment all training observation sequences into states using the Viterbi algorithm with the current models  Using estimated state sequences and training observation sequences, reestimate the HMM parameters  This method is also called a “segmental k - means” learning procedure 11755/18797

  63. Alternative to counting: SOFT counting • Expectation maximization • Every observation contributes to every state 11755/18797 90

  64. Update rules at each iteration    ( ( 1 ) | ) P state t s Obs i p  ( ) Obs s i Total no. of observatio n sequences     ( ( ) , ( 1 ) | ) P state t s state t s Obs i j  Obs t ( | ) P s s   j i ( ( ) | ) P state t s Obs i Obs t   ( ( ) | ) P state t s Obs X i Obs , t m  Obs t   i ( ( ) | ) P state t s Obs i Obs t    m  m T ( ( ) | )( )( ) P state t s Obs X X , , i Obs t i Obs t i Q  Obs t   i ( ( ) | ) P state t s Obs i Obs t • Every observation contributes to every state 11755/18797 91

  65. Update rules at each iteration    ( ( 1 ) | ) P state t s Obs i p  ( ) Obs s i Total no. of observatio n sequences     ( ( ) , ( 1 ) | ) P state t s state t s Obs i j  Obs t ( | ) P s s   j i ( ( ) | ) P state t s Obs i Obs t   ( ( ) | ) P state t s Obs X i Obs , t m  Obs t   i ( ( ) | ) P state t s Obs i Obs t    m  m T ( ( ) | )( )( ) P state t s Obs X X , , i Obs t i Obs t i Q  Obs t   i ( ( ) | ) P state t s Obs i Obs t • Where did these terms come from? 11755/18797 92

  66.  ( ( ) | ) P state t s Obs • The probability that the process was at s when it generated X t given the entire observation • Dropping the “ Obs ” subscript for brevity    ( ( ) | , ,..., ) ( ( ) , , ,..., ) P state t s X X X P state t s X X X 1 2 1 2 T T  • We will compute ( ( ) , , ,..., ) P state t s x x x 1 2 i T first – This is the probability that the process visited s at time t while producing the entire observation 11755/18797 93

  67.  ( ( ) , , ,..., ) P state t s x x x 1 2 T • The probability that the HMM was in a particular state s when generating the observation sequence is the probability that it followed a state sequence that passed through s at time t s time t 11755/18797 94

  68.  ( ( ) , , ,..., ) P state t s x x x 1 2 T • This can be decomposed into two multiplicative sections – The section of the lattice leading into state s at time t and the section leading out of it s time t 11755/18797 95

  69. The Forward Paths • The probability of the red section is the total probability of all state sequences ending at state s at time t – This is simply a ( s,t ) – Can be computed using the forward algorithm s time t 11755/18797 96

  70. The Backward Paths • The blue portion represents the probability of all state sequences that began at state s at time t – Like the red portion it can be computed using a backward recursion time t 11755/18797 97

  71. The Backward Recursion b   ( , ) ( , ,..., | ( ) ) s t P x x x state t s   1 2 t t T b ( N , t ) Can be recursively estimated starting from the final time b ( s , t ) time instant b ( s , t ) s (backward recursion) time t t+1  b  b  ( , ) ( ' , 1 ) ( ' | ) ( | ' ) s t s t P s s P x s  1 t ' s • b ( s , t ) is the total probability of ALL state sequences that depart from s at time t , and all observations after x t – b ( s,T ) = 1 at the final time instant for all valid final states 11755/18797 98

  72. The complete probability a b   ( , ) ( , ) ( , ,..., , ( ) ) s t s t P x x x state t s   1 2 t t T b ( N , t ) b ( s , t ) a ( s , t-1 ) s a ( s 1 , t-1 ) time t-1 t t+1 11755/18797 99

  73. Posterior probability of a state • The probability that the process was in state s at time t , given that we have observed the data is obtained by simple normalization  a b ( ( ) , , ,..., ) ( , ) ( , ) P state t s x x x s t s t    1 2 ( ( ) | ) T P state t s Obs    a b ( ( ) , , ,..., ) ( ' , ) ( ' , ) P state t s x x x s t s t 1 2 T ' ' s s • This term is often referred to as the gamma term and denoted by g s,t 11755/18797 100

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend