probabilistic unsupervised learning latent variable
play

Probabilistic & Unsupervised Learning Latent Variable Models for - PowerPoint PPT Presentation

Probabilistic & Unsupervised Learning Latent Variable Models for Time Series Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science University College London Term 1, Autumn


  1. HMMs and SSMs (Linear Gaussian) State space models are the continuous state analogue of hidden Markov models. s 1 s 2 s 3 • • • s T y 1 y 2 y 3 • • • y T ⇔ x 1 x 2 x 3 x T x 1 x 2 x 3 x T ◮ A continuous vector state is a very powerful representation. For an HMM to communicate N bits of information about the past, it needs 2 N states! But a real-valued state vector can store an arbitrary number of bits in principle. s 1 s 2 s 3 • • • s T x 1 x 2 x 3 x T ◮ Linear-Gaussian output/dynamics are very weak. The types of dynamics linear SSMs can capture is very limited. HMMs can in principle represent arbitrary stochastic dynamics and output mappings.

  2. Many Extensions 1 ◮ Constrained HMMs 64 1 64 ◮ Continuous state models with discrete outputs for time series and static data ◮ Hierarchical models ◮ Hybrid systems ⇔ Mixed continuous & discrete states, switching state-space models

  3. Richer state representations A t� A t+1 A t+2 B t� B t+1 B t+2 ... C t� C t+1 C t+2 D t� D t+1 D t+2 Factorial HMMs Dynamic Bayesian Networks ◮ These are hidden Markov models with many state variables (i.e. a distributed representation of the state). ◮ The state can capture many more bits of information about the sequence (linear in the number of state variables).

  4. Chain models: ML Learning with EM y 1 y 2 y 3 • • • y T s 1 s 2 s 3 • • • s T x 1 x 2 x 3 x T x 1 x 2 x 3 x T y 1 ∼ N ( µ 0 , Q 0 ) s 1 ∼ π y t | y t − 1 ∼ N ( A y t − 1 , Q ) s t | s t − 1 ∼ Φ s t − 1 , · x t | y t ∼ N ( C y t , R ) x t | s t ∼ A s t The structure of learning and inference for both models is dictated by the factored structure.

  5. Chain models: ML Learning with EM y 1 y 2 y 3 • • • y T s 1 s 2 s 3 • • • s T x 1 x 2 x 3 x T x 1 x 2 x 3 x T y 1 ∼ N ( µ 0 , Q 0 ) s 1 ∼ π y t | y t − 1 ∼ N ( A y t − 1 , Q ) s t | s t − 1 ∼ Φ s t − 1 , · x t | y t ∼ N ( C y t , R ) x t | s t ∼ A s t The structure of learning and inference for both models is dictated by the factored structure. T T � � P ( x 1 , . . . , x T , y 1 , . . . , y T ) = P ( y 1 ) P ( y t | y t − 1 ) P ( x t | y t ) t = 2 t = 1

  6. Chain models: ML Learning with EM y 1 y 2 y 3 • • • y T s 1 s 2 s 3 • • • s T x 1 x 2 x 3 x T x 1 x 2 x 3 x T y 1 ∼ N ( µ 0 , Q 0 ) s 1 ∼ π y t | y t − 1 ∼ N ( A y t − 1 , Q ) s t | s t − 1 ∼ Φ s t − 1 , · x t | y t ∼ N ( C y t , R ) x t | s t ∼ A s t The structure of learning and inference for both models is dictated by the factored structure. T T � � P ( x 1 , . . . , x T , y 1 , . . . , y T ) = P ( y 1 ) P ( y t | y t − 1 ) P ( x t | y t ) t = 2 t = 1 Learning (M-step) : argmax � log P ( x 1 , . . . , x T , y 1 , . . . , y T ) � q ( y 1 ,..., y T ) = � � � T � T � log P ( y 1 ) � q ( y 1 ) + � log P ( y t | y t − 1 ) � q ( y t , y t − 1 ) + � log P ( x t | y t ) � q ( y t ) argmax t = 2 t = 1

  7. Chain models: ML Learning with EM y 1 y 2 y 3 • • • y T s 1 s 2 s 3 • • • s T x 1 x 2 x 3 x T x 1 x 2 x 3 x T y 1 ∼ N ( µ 0 , Q 0 ) s 1 ∼ π y t | y t − 1 ∼ N ( A y t − 1 , Q ) s t | s t − 1 ∼ Φ s t − 1 , · x t | y t ∼ N ( C y t , R ) x t | s t ∼ A s t The structure of learning and inference for both models is dictated by the factored structure. T T � � P ( x 1 , . . . , x T , y 1 , . . . , y T ) = P ( y 1 ) P ( y t | y t − 1 ) P ( x t | y t ) t = 2 t = 1 Learning (M-step) : argmax � log P ( x 1 , . . . , x T , y 1 , . . . , y T ) � q ( y 1 ,..., y T ) = � � � T � T � log P ( y 1 ) � q ( y 1 ) + � log P ( y t | y t − 1 ) � q ( y t , y t − 1 ) + � log P ( x t | y t ) � q ( y t ) argmax t = 2 t = 1 So the expectations needed (in E-step) are derived from singleton and pairwise marginals.

  8. Chain models: Inference Three general inference problems : P ( y t | x 1 , . . . , x t ) Filtering: P ( y t | x 1 , . . . , x T ) (also P ( y t , y t − 1 | x 1 , . . . , x T ) for learning) Smoothing: P ( y t | x 1 , . . . , x t − ∆ t ) Prediction:

  9. Chain models: Inference Three general inference problems : P ( y t | x 1 , . . . , x t ) Filtering: P ( y t | x 1 , . . . , x T ) (also P ( y t , y t − 1 | x 1 , . . . , x T ) for learning) Smoothing: P ( y t | x 1 , . . . , x t − ∆ t ) Prediction: Naively, these marginal posteriors seem to require very large integrals (or sums) � � P ( y t | x 1 , . . . , x t ) = · · · d y 1 . . . d y t − 1 P ( y 1 , . . . , y t | x 1 , . . . , x t )

  10. Chain models: Inference Three general inference problems : P ( y t | x 1 , . . . , x t ) Filtering: P ( y t | x 1 , . . . , x T ) (also P ( y t , y t − 1 | x 1 , . . . , x T ) for learning) Smoothing: P ( y t | x 1 , . . . , x t − ∆ t ) Prediction: Naively, these marginal posteriors seem to require very large integrals (or sums) � � P ( y t | x 1 , . . . , x t ) = · · · d y 1 . . . d y t − 1 P ( y 1 , . . . , y t | x 1 , . . . , x t ) but again the factored structure of the distributions will help us. The algorithms rely on a form of temporal updating or message passing.

  11. Crawling the HMM state-lattice 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P ( s t = k | x 1 . . . x k ) = � � P ( s 1 = k 1 , . . . , s t = k | x 1 . . . x t ) = π k 1 A k 1 ( x 1 )Φ k 1 , k 2 A k 2 ( x 2 ) . . . Φ k t − 1 , k A k ( x t ) k 1 ,..., k t − 1 k 1 ,..., k t − 1

  12. Crawling the HMM state-lattice 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P ( s t = k | x 1 . . . x k ) = � � P ( s 1 = k 1 , . . . , s t = k | x 1 . . . x t ) = π k 1 A k 1 ( x 1 )Φ k 1 , k 2 A k 2 ( x 2 ) . . . Φ k t − 1 , k A k ( x t ) k 1 ,..., k t − 1 k 1 ,..., k t − 1 Na¨ ıve algorithm: ◮ start a “bug” at each of the K states at t = 1 holding value 1

  13. Crawling the HMM state-lattice 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P ( s t = k | x 1 . . . x k ) = � � P ( s 1 = k 1 , . . . , s t = k | x 1 . . . x t ) = π k 1 A k 1 ( x 1 )Φ k 1 , k 2 A k 2 ( x 2 ) . . . Φ k t − 1 , k A k ( x t ) k 1 ,..., k t − 1 k 1 ,..., k t − 1 Na¨ ıve algorithm: ◮ start a “bug” at each of the K states at t = 1 holding value 1 ◮ move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. × output emission prob.

  14. Crawling the HMM state-lattice 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P ( s t = k | x 1 . . . x k ) = � � P ( s 1 = k 1 , . . . , s t = k | x 1 . . . x t ) = π k 1 A k 1 ( x 1 )Φ k 1 , k 2 A k 2 ( x 2 ) . . . Φ k t − 1 , k A k ( x t ) k 1 ,..., k t − 1 k 1 ,..., k t − 1 Na¨ ıve algorithm: ◮ start a “bug” at each of the K states at t = 1 holding value 1 ◮ move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. × output emission prob. ◮ repeat

  15. Crawling the HMM state-lattice 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P ( s t = k | x 1 . . . x k ) = � � P ( s 1 = k 1 , . . . , s t = k | x 1 . . . x t ) = π k 1 A k 1 ( x 1 )Φ k 1 , k 2 A k 2 ( x 2 ) . . . Φ k t − 1 , k A k ( x t ) k 1 ,..., k t − 1 k 1 ,..., k t − 1 Na¨ ıve algorithm: ◮ start a “bug” at each of the K states at t = 1 holding value 1 ◮ move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. × output emission prob. ◮ repeat until all bugs have reached time t ◮ sum up values on all K t − 1 bugs that reach state s t = k (one bug per state path)

  16. Crawling the HMM state-lattice 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 s 1 s 2 s 3 s 4 s 5 s 6 Consider an HMM, where we want to find P ( s t = k | x 1 . . . x k ) = � � P ( s 1 = k 1 , . . . , s t = k | x 1 . . . x t ) = π k 1 A k 1 ( x 1 )Φ k 1 , k 2 A k 2 ( x 2 ) . . . Φ k t − 1 , k A k ( x t ) k 1 ,..., k t − 1 k 1 ,..., k t − 1 Na¨ ıve algorithm: ◮ start a “bug” at each of the K states at t = 1 holding value 1 ◮ move each bug forward in time: make copies of each bug to each subsequent state and multiply the value of each copy by transition prob. × output emission prob. ◮ repeat until all bugs have reached time t ◮ sum up values on all K t − 1 bugs that reach state s t = k (one bug per state path) Clever recursion: ◮ at every step, replace bugs at each node with a single bug carrying sum of values

  17. Probability updating: “Bayesian filtering” y 1 y 2 y 3 • • • y T x 1 x 2 x 3 x T � P ( y t | x 1 : t ) = P ( y t , y t − 1 | x 1 : t ) d y t − 1

  18. Probability updating: “Bayesian filtering” y 1 y 2 y 3 • • • y T x 1 x 2 x 3 x T � P ( y t | x 1 : t ) = P ( y t , y t − 1 | x t , x 1 : t − 1 ) d y t − 1

  19. Probability updating: “Bayesian filtering” y 1 y 2 y 3 • • • y T x 1 x 2 x 3 x T � P ( y t | x 1 : t ) = P ( y t , y t − 1 | x t , x 1 : t − 1 ) d y t − 1 � P ( x t , y t , y t − 1 | x 1 : t − 1 ) = d y t − 1 P ( x t | x 1 : t − 1 )

  20. Probability updating: “Bayesian filtering” y 1 y 2 y 3 • • • y T x 1 x 2 x 3 x T � P ( y t | x 1 : t ) = P ( y t , y t − 1 | x t , x 1 : t − 1 ) d y t − 1 � P ( x t , y t , y t − 1 | x 1 : t − 1 ) = d y t − 1 P ( x t | x 1 : t − 1 ) � ∝ P ( x t | y t , y t − 1 , x 1 : t − 1 ) P ( y t | y t − 1 , x 1 : t − 1 ) P ( y t − 1 | x 1 : t − 1 ) d y t − 1

  21. Probability updating: “Bayesian filtering” y 1 y 2 y 3 • • • y T x 1 x 2 x 3 x T � P ( y t | x 1 : t ) = P ( y t , y t − 1 | x t , x 1 : t − 1 ) d y t − 1 � P ( x t , y t , y t − 1 | x 1 : t − 1 ) = d y t − 1 P ( x t | x 1 : t − 1 ) � ∝ P ( x t | y t , y t − 1 , x 1 : t − 1 ) P ( y t | y t − 1 , x 1 : t − 1 ) P ( y t − 1 | x 1 : t − 1 ) d y t − 1 � = P ( x t | y t ) P ( y t | y t − 1 ) P ( y t − 1 | x 1 : t − 1 ) d y t − 1 Markov property

  22. Probability updating: “Bayesian filtering” y 1 y 2 y 3 • • • y T x 1 x 2 x 3 x T � P ( y t | x 1 : t ) = P ( y t , y t − 1 | x t , x 1 : t − 1 ) d y t − 1 � P ( x t , y t , y t − 1 | x 1 : t − 1 ) = d y t − 1 P ( x t | x 1 : t − 1 ) � ∝ P ( x t | y t , y t − 1 , x 1 : t − 1 ) P ( y t | y t − 1 , x 1 : t − 1 ) P ( y t − 1 | x 1 : t − 1 ) d y t − 1 � = P ( x t | y t ) P ( y t | y t − 1 ) P ( y t − 1 | x 1 : t − 1 ) d y t − 1 Markov property This is a forward recursion based on Bayes rule.

  23. The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming .

  24. The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming . Define: α t ( i ) = P ( x 1 , . . . , x t , s t = i | θ )

  25. The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming . Define: α t ( i ) = P ( x 1 , . . . , x t , s t = i | θ ) Then much like the Bayesian filtering updates, we have: � � K � α 1 ( i ) = π i A i ( x 1 ) α t + 1 ( i ) = α t ( j )Φ ji A i ( x t + 1 ) j = 1

  26. The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming . Define: α t ( i ) = P ( x 1 , . . . , x t , s t = i | θ ) Then much like the Bayesian filtering updates, we have: � � K � α 1 ( i ) = π i A i ( x 1 ) α t + 1 ( i ) = α t ( j )Φ ji A i ( x t + 1 ) j = 1 We’ve defined α t ( i ) to be a joint rather than a posterior.

  27. The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming . Define: α t ( i ) = P ( x 1 , . . . , x t , s t = i | θ ) Then much like the Bayesian filtering updates, we have: � � K � α 1 ( i ) = π i A i ( x 1 ) α t + 1 ( i ) = α t ( j )Φ ji A i ( x t + 1 ) j = 1 We’ve defined α t ( i ) to be a joint rather than a posterior. It’s easy to obtain the posterior by normalisation: α t ( i ) � P ( s t = i | x 1 , . . . , x t , θ ) = k α t ( k )

  28. The HMM: Forward pass The forward recursion for the HMM is a form of dynamic programming . Define: α t ( i ) = P ( x 1 , . . . , x t , s t = i | θ ) Then much like the Bayesian filtering updates, we have: � � K � α 1 ( i ) = π i A i ( x 1 ) α t + 1 ( i ) = α t ( j )Φ ji A i ( x t + 1 ) j = 1 We’ve defined α t ( i ) to be a joint rather than a posterior. It’s easy to obtain the posterior by normalisation: α t ( i ) � P ( s t = i | x 1 , . . . , x t , θ ) = k α t ( k ) This form enables us to compute the likelihood for θ = { A , Φ , π } efficiently in O ( TK 2 ) time: � � K P ( x 1 . . . x T | θ ) = P ( x 1 , . . . , x T , s 1 , . . . , s T , θ ) = α T ( k ) s 1 ,..., s T k = 1 ıve sum (number of paths = K T ). avoiding the exponential number of paths in the na¨

  29. The LGSSM: Kalman Filtering y 1 ∼ N ( µ 0 , Q 0 ) y 1 y 2 y 3 • • • y T y t | y t − 1 ∼ N ( A y t − 1 , Q ) x t | y t ∼ N ( C y t , R ) x 1 x 2 x 3 x T For the SSM, the sums become integrals.

  30. The LGSSM: Kalman Filtering y 1 ∼ N ( µ 0 , Q 0 ) y 1 y 2 y 3 • • • y T y t | y t − 1 ∼ N ( A y t − 1 , Q ) x t | y t ∼ N ( C y t , R ) x 1 x 2 x 3 x T 1 = µ 0 and ˆ y 0 V 0 For the SSM, the sums become integrals. Let ˆ 1 = Q 0 ; then (cf. FA) � � i ) , ˆ 1 − K 1 C ˆ y 0 y 0 V 0 V 0 P ( y 1 | x 1 ) = N 1 + K 1 ( x 1 − C ˆ ˆ 1

  31. The LGSSM: Kalman Filtering y 1 ∼ N ( µ 0 , Q 0 ) y 1 y 2 y 3 • • • y T y t | y t − 1 ∼ N ( A y t − 1 , Q ) x t | y t ∼ N ( C y t , R ) x 1 x 2 x 3 x T 1 = µ 0 and ˆ y 0 V 0 For the SSM, the sums become integrals. Let ˆ 1 = Q 0 ; then (cf. FA) � � 1 C T + R ) − 1 i ) , ˆ 1 − K 1 C ˆ K 1 = ˆ 1 C T ( C ˆ y 0 y 0 V 0 V 0 V 0 V 0 P ( y 1 | x 1 ) = N 1 + K 1 ( x 1 − C ˆ ˆ 1

  32. The LGSSM: Kalman Filtering y 1 ∼ N ( µ 0 , Q 0 ) y 1 y 2 y 3 • • • y T y t | y t − 1 ∼ N ( A y t − 1 , Q ) x t | y t ∼ N ( C y t , R ) x 1 x 2 x 3 x T 1 = µ 0 and ˆ y 0 V 0 For the SSM, the sums become integrals. Let ˆ 1 = Q 0 ; then (cf. FA) � � 1 C T + R ) − 1 , ˆ 1 − K 1 C ˆ K 1 = ˆ 1 C T ( C ˆ y 0 y 0 V 0 V 0 V 0 V 0 P ( y 1 | x 1 ) = N 1 + K 1 ( x 1 − C ˆ i ) ˆ 1 � �� � � �� � ˆ y 1 V 1 ˆ 1 1

  33. The LGSSM: Kalman Filtering y 1 ∼ N ( µ 0 , Q 0 ) y 1 y 2 y 3 • • • y T y t | y t − 1 ∼ N ( A y t − 1 , Q ) x t | y t ∼ N ( C y t , R ) x 1 x 2 x 3 x T 1 = µ 0 and ˆ y 0 V 0 For the SSM, the sums become integrals. Let ˆ 1 = Q 0 ; then (cf. FA) � � 1 C T + R ) − 1 , ˆ 1 − K 1 C ˆ K 1 = ˆ 1 C T ( C ˆ y 0 y 0 V 0 V 0 V 0 V 0 P ( y 1 | x 1 ) = N 1 + K 1 ( x 1 − C ˆ i ) ˆ 1 � �� � � �� � ˆ y 1 V 1 ˆ 1 1 t ≡ E [ y t | x 1 , . . . , x T ] and ˆ y T V T In general, we define ˆ t ≡ V [ y t | x 1 , . . . , x T ] .

  34. The LGSSM: Kalman Filtering y 1 ∼ N ( µ 0 , Q 0 ) y 1 y 2 y 3 • • • y T y t | y t − 1 ∼ N ( A y t − 1 , Q ) x t | y t ∼ N ( C y t , R ) x 1 x 2 x 3 x T 1 = µ 0 and ˆ y 0 V 0 For the SSM, the sums become integrals. Let ˆ 1 = Q 0 ; then (cf. FA) � � 1 C T + R ) − 1 , ˆ 1 − K 1 C ˆ K 1 = ˆ 1 C T ( C ˆ y 0 y 0 V 0 V 0 V 0 V 0 P ( y 1 | x 1 ) = N 1 + K 1 ( x 1 − C ˆ i ) ˆ 1 � �� � � �� � ˆ y 1 V 1 ˆ 1 1 t ≡ E [ y t | x 1 , . . . , x T ] and ˆ y T V T In general, we define ˆ t ≡ V [ y t | x 1 , . . . , x T ] . Then, � P ( y t | x 1 : t − 1 ) = d y t − 1 P ( y t | y t − 1 ) P ( y t − 1 | x 1 : t − 1 )

  35. The LGSSM: Kalman Filtering y 1 ∼ N ( µ 0 , Q 0 ) y 1 y 2 y 3 • • • y T y t | y t − 1 ∼ N ( A y t − 1 , Q ) x t | y t ∼ N ( C y t , R ) x 1 x 2 x 3 x T 1 = µ 0 and ˆ y 0 V 0 For the SSM, the sums become integrals. Let ˆ 1 = Q 0 ; then (cf. FA) � � 1 C T + R ) − 1 , ˆ 1 − K 1 C ˆ K 1 = ˆ 1 C T ( C ˆ y 0 y 0 V 0 V 0 V 0 V 0 P ( y 1 | x 1 ) = N 1 + K 1 ( x 1 − C ˆ i ) ˆ 1 � �� � � �� � ˆ y 1 V 1 ˆ 1 1 t ≡ E [ y t | x 1 , . . . , x T ] and ˆ y T V T In general, we define ˆ t ≡ V [ y t | x 1 , . . . , x T ] . Then, � t − 1 A T + Q y t − 1 , A ˆ V t − 1 P ( y t | x 1 : t − 1 ) = d y t − 1 P ( y t | y t − 1 ) P ( y t − 1 | x 1 : t − 1 ) = N ( A ˆ ) t − 1 � �� � � �� � y t − 1 V t − 1 ˆ ˆ t t

  36. The LGSSM: Kalman Filtering y 1 ∼ N ( µ 0 , Q 0 ) y 1 y 2 y 3 • • • y T y t | y t − 1 ∼ N ( A y t − 1 , Q ) x t | y t ∼ N ( C y t , R ) x 1 x 2 x 3 x T 1 = µ 0 and ˆ y 0 V 0 For the SSM, the sums become integrals. Let ˆ 1 = Q 0 ; then (cf. FA) � � 1 C T + R ) − 1 , ˆ 1 − K 1 C ˆ K 1 = ˆ 1 C T ( C ˆ y 0 y 0 V 0 V 0 V 0 V 0 P ( y 1 | x 1 ) = N 1 + K 1 ( x 1 − C ˆ i ) ˆ 1 � �� � � �� � ˆ y 1 V 1 ˆ 1 1 t ≡ E [ y t | x 1 , . . . , x T ] and ˆ y T V T In general, we define ˆ t ≡ V [ y t | x 1 , . . . , x T ] . Then, � t − 1 A T + Q y t − 1 , A ˆ V t − 1 P ( y t | x 1 : t − 1 ) = d y t − 1 P ( y t | y t − 1 ) P ( y t − 1 | x 1 : t − 1 ) = N ( A ˆ ) t − 1 � �� � � �� � y t − 1 V t − 1 ˆ ˆ t t y t − 1 y t − 1 V t − 1 V t − 1 , ˆ − K t C ˆ P ( y t | x 1 : t ) = N ( ˆ + K t ( x t − C ˆ ) ) t t t t � �� � � �� � y t ˆ V t ˆ C T + R ) − 1 t t K t = ˆ V t − 1 C T ( C ˆ V t − 1 t t � �� � Kalman gain

  37. The marginal posterior: “Bayesian smoothing” y 1 y 2 y 3 • • • y T x 1 x 2 x 3 x T P ( y t | x 1 : T )

  38. The marginal posterior: “Bayesian smoothing” y 1 y 2 y 3 • • • y T x 1 x 2 x 3 x T P ( y t | x 1 : T ) = P ( y t , x t + 1 : T | x 1 : t ) P ( x t + 1 : T | x 1 : t )

  39. The marginal posterior: “Bayesian smoothing” y 1 y 2 y 3 • • • y T x 1 x 2 x 3 x T P ( y t | x 1 : T ) = P ( y t , x t + 1 : T | x 1 : t ) P ( x t + 1 : T | x 1 : t ) = P ( x t + 1 : T | y t ) P ( y t | x 1 : t ) P ( x t + 1 : T | x 1 : t ) The marginal combines a backward message with the forward message found by filtering.

  40. The HMM: Forward–Backward Algorithm State estimation: compute marginal posterior distribution over state at time t : γ t ( i ) ≡ P ( s t = i | x 1 : T ) = P ( s t = i , x 1 : t ) P ( x t + 1 : T | s t = i ) α t ( i ) β t ( i ) � = P ( x 1 : T ) j α t ( j ) β t ( j ) where there is a simple backward recursion for � K β t ( i ) ≡ P ( x t + 1 : T | s t = i ) = P ( s t + 1 = j , x t + 1 , x t + 2 : T | s t = i ) j = 1 � K � K = P ( s t + 1 = j | s t = i ) P ( x t + 1 | s t + 1 = j ) P ( x t + 2 : T | s t + 1 = j ) = Φ ij A j ( x t + 1 ) β t + 1 ( j ) j = 1 j = 1 α t ( i ) gives total inflow of probabilities to node ( t , i ) ; β t ( i ) gives total outflow of probabiilties. 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 4 4 4 4 4 4 s 1 s 2 s 3 s 4 s 5 s 6 Bugs again: the bugs run forward from time 0 to t and backward from time T to t .

  41. Viterbi decoding ◮ The numbers γ t ( i ) computed by forward-backward give the marginal posterior distribution over states at each time.

  42. Viterbi decoding ◮ The numbers γ t ( i ) computed by forward-backward give the marginal posterior distribution over states at each time. ◮ By choosing the state i ∗ t with the largest γ t ( i ) at each time, we can make a “best” state path. This is the path with the maximum expected number of correct states.

  43. Viterbi decoding ◮ The numbers γ t ( i ) computed by forward-backward give the marginal posterior distribution over states at each time. ◮ By choosing the state i ∗ t with the largest γ t ( i ) at each time, we can make a “best” state path. This is the path with the maximum expected number of correct states. ◮ But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero!

  44. Viterbi decoding ◮ The numbers γ t ( i ) computed by forward-backward give the marginal posterior distribution over states at each time. ◮ By choosing the state i ∗ t with the largest γ t ( i ) at each time, we can make a “best” state path. This is the path with the maximum expected number of correct states. ◮ But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! ◮ To find the single best path, we use the Viterbi decoding algorithm which is just Bellman’s dynamic programming algorithm applied to this problem. This is an inference P ( s 1 : T | x 1 : T , θ ) algorithm which computes the most probable state sequences: argmax s 1 : T

  45. Viterbi decoding ◮ The numbers γ t ( i ) computed by forward-backward give the marginal posterior distribution over states at each time. ◮ By choosing the state i ∗ t with the largest γ t ( i ) at each time, we can make a “best” state path. This is the path with the maximum expected number of correct states. ◮ But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! ◮ To find the single best path, we use the Viterbi decoding algorithm which is just Bellman’s dynamic programming algorithm applied to this problem. This is an inference P ( s 1 : T | x 1 : T , θ ) algorithm which computes the most probable state sequences: argmax s 1 : T ◮ The recursions look the same as forward-backward, except with max instead of � .

  46. Viterbi decoding ◮ The numbers γ t ( i ) computed by forward-backward give the marginal posterior distribution over states at each time. ◮ By choosing the state i ∗ t with the largest γ t ( i ) at each time, we can make a “best” state path. This is the path with the maximum expected number of correct states. ◮ But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! ◮ To find the single best path, we use the Viterbi decoding algorithm which is just Bellman’s dynamic programming algorithm applied to this problem. This is an inference P ( s 1 : T | x 1 : T , θ ) algorithm which computes the most probable state sequences: argmax s 1 : T ◮ The recursions look the same as forward-backward, except with max instead of � . ◮ Bugs once more: same trick except at each step kill all bugs but the one with the highest value at the node.

  47. Viterbi decoding ◮ The numbers γ t ( i ) computed by forward-backward give the marginal posterior distribution over states at each time. ◮ By choosing the state i ∗ t with the largest γ t ( i ) at each time, we can make a “best” state path. This is the path with the maximum expected number of correct states. ◮ But it is not the single path with the highest probability of generating the data. In fact it may be a path of probability zero! ◮ To find the single best path, we use the Viterbi decoding algorithm which is just Bellman’s dynamic programming algorithm applied to this problem. This is an inference P ( s 1 : T | x 1 : T , θ ) algorithm which computes the most probable state sequences: argmax s 1 : T ◮ The recursions look the same as forward-backward, except with max instead of � . ◮ Bugs once more: same trick except at each step kill all bugs but the one with the highest value at the node. ◮ There is also a modified EM training based on the Viterbi decoder (assignment).

  48. The LGSSM: Kalman smoothing y 1 y 2 y 3 • • • y T x 1 x 2 x 3 x T We use a slightly different decomposition: � P ( y t | x 1 : T ) = P ( y t , y t + 1 | x 1 : T ) d y t + 1 � = P ( y t | y t + 1 , x 1 : T ) P ( y t + 1 | x 1 : T ) d y t + 1 � = P ( y t | y t + 1 , x 1 : t ) P ( y t + 1 | x 1 : T ) d y t + 1 Markov property This gives the additional backward recursion : J t = ˆ t A T (ˆ t + 1 ) − 1 V t V t y T y t y T y t t + 1 − A ˆ ˆ t = ˆ t + J t ( ˆ t ) ˆ t = ˆ t + J t (ˆ t + 1 − ˆ V T V t V T V t T t + 1 ) J t

  49. ML Learning for SSMs using batch EM A A A A y 1 y 2 y 3 • • • y T C C C C x 1 x 2 x 3 x T Parameters: θ = { µ 0 , Q 0 , A , Q , C , R } Free energy: � F ( q , θ ) = d y 1 : T q ( y 1 : T )( log P ( x 1 : T , y 1 : T | θ ) − log q ( y 1 : T )) q ∗ ( y ) = p ( y | x , θ ) E-step: Maximise F w.r.t. q with θ fixed: This can be achieved with a two-state extension of the Kalman smoother. M-step: Maximize F w.r.t. θ with q fixed. This boils down to solving a few weighted least squares problems, since all the variables in: � T p ( y , x | θ ) = p ( y 1 ) p ( x 1 | y 1 ) p ( y t | y t − 1 ) p ( x t | y t ) t = 2 form a multivariate Gaussian.

  50. The M step for C � � 2 ( x t − C y t ) T R − 1 ( x t − C y t ) − 1 p ( x t | y t ) ∝ exp ⇒

  51. The M step for C � � 2 ( x t − C y t ) T R − 1 ( x t − C y t ) − 1 p ( x t | y t ) ∝ exp ⇒ �� � ln p ( x t | y t ) C new = argmax C t q

  52. The M step for C � � 2 ( x t − C y t ) T R − 1 ( x t − C y t ) − 1 p ( x t | y t ) ∝ exp ⇒ �� � ln p ( x t | y t ) C new = argmax C t q � � � − 1 ( x t − C y t ) T R − 1 ( x t − C y t ) = argmax + const 2 C t q

  53. The M step for C � � 2 ( x t − C y t ) T R − 1 ( x t − C y t ) − 1 p ( x t | y t ) ∝ exp ⇒ �� � ln p ( x t | y t ) C new = argmax C t q � � � − 1 ( x t − C y t ) T R − 1 ( x t − C y t ) = argmax + const 2 C t q � � � − 1 x T t R − 1 x t − 2 x T t R − 1 C � y t � + � y T t C T R − 1 C y t � = argmax 2 C t

  54. The M step for C � � 2 ( x t − C y t ) T R − 1 ( x t − C y t ) − 1 p ( x t | y t ) ∝ exp ⇒ �� � ln p ( x t | y t ) C new = argmax C t q � � � − 1 ( x t − C y t ) T R − 1 ( x t − C y t ) = argmax + const 2 C t q � � � − 1 x T t R − 1 x t − 2 x T t R − 1 C � y t � + � y T t C T R − 1 C y t � = argmax 2 C t � � � � �� ��� � − 1 � y t � x T t R − 1 C T R − 1 C y t y T = argmax Tr C 2Tr t C t t

  55. The M step for C � � 2 ( x t − C y t ) T R − 1 ( x t − C y t ) − 1 p ( x t | y t ) ∝ exp ⇒ �� � ln p ( x t | y t ) C new = argmax C t q � � � − 1 ( x t − C y t ) T R − 1 ( x t − C y t ) = argmax + const 2 C t q � � � − 1 x T t R − 1 x t − 2 x T t R − 1 C � y t � + � y T t C T R − 1 C y t � = argmax 2 C t � � � � �� ��� � − 1 � y t � x T t R − 1 C T R − 1 C y t y T = argmax Tr C 2Tr t C t t �� � ∂ C = R − 1 � = B T , we have ∂ {·} x t � y t � T − R − 1 C using ∂ Tr [ AB ] y t y T t ∂ A t t

  56. The M step for C � � 2 ( x t − C y t ) T R − 1 ( x t − C y t ) − 1 p ( x t | y t ) ∝ exp ⇒ �� � ln p ( x t | y t ) C new = argmax C t q � � � − 1 ( x t − C y t ) T R − 1 ( x t − C y t ) = argmax + const 2 C t q � � � − 1 x T t R − 1 x t − 2 x T t R − 1 C � y t � + � y T t C T R − 1 C y t � = argmax 2 C t � � � � �� ��� � − 1 � y t � x T t R − 1 C T R − 1 C y t y T = argmax Tr C 2Tr t C t t �� � ∂ C = R − 1 � = B T , we have ∂ {·} x t � y t � T − R − 1 C using ∂ Tr [ AB ] y t y T t ∂ A t t �� � �� �� − 1 � x t � y t � T y t y T ⇒ C new = t t t Notice that this is exactly the same equation as in factor analysis and linear regression!

  57. The M step for A � � − 1 2 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) p ( y t + 1 | y t ) ∝ exp ⇒

  58. The M step for A � � − 1 2 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) p ( y t + 1 | y t ) ∝ exp ⇒ �� � A new = argmax ln p ( y t + 1 | y t ) A t q

  59. The M step for A � � − 1 2 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) p ( y t + 1 | y t ) ∝ exp ⇒ �� � A new = argmax ln p ( y t + 1 | y t ) A t q � � � − 1 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) = argmax + const 2 A t q

  60. The M step for A � � − 1 2 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) p ( y t + 1 | y t ) ∝ exp ⇒ �� � A new = argmax ln p ( y t + 1 | y t ) A t q � � � − 1 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) = argmax + const 2 A t q � �� � � � � − 1 t + 1 Q − 1 y t + 1 − 2 t + 1 Q − 1 A y t t A T Q − 1 A y t y T y T y T = argmax + 2 A t

  61. The M step for A � � − 1 2 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) p ( y t + 1 | y t ) ∝ exp ⇒ �� � A new = argmax ln p ( y t + 1 | y t ) A t q � � � − 1 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) = argmax + const 2 A t q � �� � � � � − 1 t + 1 Q − 1 y t + 1 − 2 t + 1 Q − 1 A y t t A T Q − 1 A y t y T y T y T = argmax + 2 A t � � � � ��� � � � � � − 1 Q − 1 A T Q − 1 A y t y T y t y T = argmax Tr A 2Tr t + 1 t A t t

  62. The M step for A � � − 1 2 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) p ( y t + 1 | y t ) ∝ exp ⇒ �� � A new = argmax ln p ( y t + 1 | y t ) A t q � � � − 1 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) = argmax + const 2 A t q � �� � � � � − 1 t + 1 Q − 1 y t + 1 − 2 t + 1 Q − 1 A y t t A T Q − 1 A y t y T y T y T = argmax + 2 A t � � � � ��� � � � � � − 1 Q − 1 A T Q − 1 A y t y T y t y T = argmax Tr A 2Tr t + 1 t A t t � � � � ∂ A = Q − 1 � � = B T , we have ∂ {·} using ∂ Tr [ AB ] − Q − 1 A y t + 1 y T y t y T t t ∂ A t t

  63. The M step for A � � − 1 2 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) p ( y t + 1 | y t ) ∝ exp ⇒ �� � A new = argmax ln p ( y t + 1 | y t ) A t q � � � − 1 ( y t + 1 − A y t ) T Q − 1 ( y t + 1 − A y t ) = argmax + const 2 A t q � �� � � � � − 1 t + 1 Q − 1 y t + 1 − 2 t + 1 Q − 1 A y t t A T Q − 1 A y t y T y T y T = argmax + 2 A t � � � � ��� � � � � � − 1 Q − 1 A T Q − 1 A y t y T y t y T = argmax Tr A 2Tr t + 1 t A t t � � � � ∂ A = Q − 1 � � = B T , we have ∂ {·} using ∂ Tr [ AB ] − Q − 1 A y t + 1 y T y t y T t t ∂ A t t �� �� �� �� − 1 � � y t + 1 y T y t y T ⇒ A new = t t t t This is still analagous to factor analysis and linear regression, with expected correlations.

  64. Learning (online gradient) Time series data must often be processed in real-time, and we may want to update parameters online as observations arrive. We can do so by updating a local version of the likelihood based on the Kalman filter estimates. Consider the log likelihood contributed by each data point ( ℓ t ): � T � T ln p ( x t | x 1 , . . . , x t − 1 ) = ℓ = ℓ t t = 1 t = 1 Then, ℓ t = − D 2 ln 2 π − 1 2 ln | Σ | − 1 y t − 1 ) T Σ − 1 ( x t − C ˆ y t − 1 2 ( x t − C ˆ ) t t where D is dimension of x , and: y t − 1 y t − 1 ˆ = A ˆ t t − 1 C T + R Σ = C ˆ V t − 1 t t − 1 A T + Q V t − 1 ˆ = A ˆ V t − 1 t We differentiate ℓ t to obtain gradient rules for A , C , Q , R . The size of the gradient step (learning rate) reflects our expectation about nonstationarity.

  65. Learning HMMs using EM T T T T • • • s 1 s 2 s 3 s T A A A A x 1 x 2 x 3 x T Parameters: θ = { π , Φ , A } Free energy: � F ( q , θ ) = q ( s 1 : T )( log P ( x 1 : T , s 1 : T | θ ) − log q ( s 1 : T )) s 1 : T q ∗ ( s 1 : T ) = P ( s 1 : T | x 1 : T , θ ) E-step: Maximise F w.r.t. q with θ fixed: We will only need the marginal probabilities q ( s t , s t + 1 ) , which can also be obtained from the forward–backward algorithm. M-step: Maximize F w.r.t. θ with q fixed. We can re-estimate the parameters by computing the expected number of times the HMM was in state i , emitted symbol k and transitioned to state j . This is the Baum-Welch algorithm and it predates the (more general) EM algorithm.

  66. M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ .

  67. M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ . ◮ The initial state distribution is the expected number of times in state i at t = 1: ˆ π i = γ 1 ( i )

  68. M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ . ◮ The initial state distribution is the expected number of times in state i at t = 1: ˆ π i = γ 1 ( i ) ◮ The expected number of transitions from state i to j which begin at time t is: ξ t ( i → j ) ≡ P ( s t = i , s t + 1 = j | x 1 : T ) = α t ( i )Φ ij A j ( x t + 1 ) β t + 1 ( j ) / P ( x 1 : T ) so the estimated transition probabilities are: � T − 1 T − 1 � � � Φ ij = ξ t ( i → j ) γ t ( i ) t = 1 t = 1

  69. M step: Parameter updates are given by ratios of expected counts We can derive the following updates by taking derivatives of F w.r.t. θ . ◮ The initial state distribution is the expected number of times in state i at t = 1: ˆ π i = γ 1 ( i ) ◮ The expected number of transitions from state i to j which begin at time t is: ξ t ( i → j ) ≡ P ( s t = i , s t + 1 = j | x 1 : T ) = α t ( i )Φ ij A j ( x t + 1 ) β t + 1 ( j ) / P ( x 1 : T ) so the estimated transition probabilities are: � T − 1 T − 1 � � � Φ ij = ξ t ( i → j ) γ t ( i ) t = 1 t = 1 ◮ The output distributions are the expected number of times we observe a particular symbol in a particular state: � � � T � A ik = γ t ( i ) γ t ( i ) t : x t = k t = 1 (or the state-probability-weighted mean and variance for a Gaussian output model).

  70. HMM practicalities ◮ Numerical scaling: the conventional message definition is in terms of a large joint: α t ( i ) = P ( x 1 : t , s t = i ) → 0 as t grows, and so can easily underflow. Rescale: K � � α t ( i ) = A i ( x t ) α t − 1 ( j )Φ ji ˜ ρ t = α t ( i ) α t ( i ) = α t ( i ) /ρ t ˜ j i = 1 Exercise: show that: � T ρ t = P ( x t | x 1 : t − 1 , θ ) ρ t = P ( x 1 : T | θ ) t = 1 What does this make ˜ α t ( i ) ?

  71. HMM practicalities ◮ Numerical scaling: the conventional message definition is in terms of a large joint: α t ( i ) = P ( x 1 : t , s t = i ) → 0 as t grows, and so can easily underflow. Rescale: K � � α t ( i ) = A i ( x t ) α t − 1 ( j )Φ ji ˜ ρ t = α t ( i ) α t ( i ) = α t ( i ) /ρ t ˜ j i = 1 Exercise: show that: � T ρ t = P ( x t | x 1 : t − 1 , θ ) ρ t = P ( x 1 : T | θ ) t = 1 What does this make ˜ α t ( i ) ? ◮ Multiple observed sequences: average numerators and denominators in the ratios of updates.

  72. HMM practicalities ◮ Numerical scaling: the conventional message definition is in terms of a large joint: α t ( i ) = P ( x 1 : t , s t = i ) → 0 as t grows, and so can easily underflow. Rescale: K � � α t ( i ) = A i ( x t ) α t − 1 ( j )Φ ji ˜ ρ t = α t ( i ) α t ( i ) = α t ( i ) /ρ t ˜ j i = 1 Exercise: show that: � T ρ t = P ( x t | x 1 : t − 1 , θ ) ρ t = P ( x 1 : T | θ ) t = 1 What does this make ˜ α t ( i ) ? ◮ Multiple observed sequences: average numerators and denominators in the ratios of updates. ◮ Local optima (random restarts, annealing; see discussion later).

  73. HMM pseudocode: inference (E step) Forward-backward including scaling tricks. [ ◦ is the element-by-element (Hadamard/Schur) product: ‘ . ∗ ’ in matlab.] for t = 1 : T , i = 1 : K p t ( i ) = A i ( x t ) ρ 1 = � K α 1 = π ◦ p 1 i = 1 α 1 ( i ) α 1 = α 1 /ρ 1 ρ t = � K α t = (Φ T ∗ α t − 1 ) ◦ p t for t = 2 : T i = 1 α t ( i ) α t = α t /ρ t β T = 1 for t = T − 1 : 1 β t = Φ ∗ ( β t + 1 ◦ p t + 1 ) /ρ t + 1 log P ( x 1 : T ) = � T t = 1 log ( ρ t ) for t = 1 : T γ t = α t ◦ β t ξ t = Φ ◦ ( α t ∗ ( β t + 1 ◦ p t + 1 ) T ) /ρ t + 1 for t = 1 : T − 1

  74. HMM pseudocode: parameter re-estimation (M step) Baum-Welch parameter updates: For each sequence l = 1 : L , run forward–backward to get γ ( l ) and ξ ( l ) , then � L π i = 1 l = 1 γ ( l ) 1 ( i ) L � L � T ( l ) − 1 ξ ( l ) t ( ij ) l = 1 t = 1 Φ ij = � L � T ( l ) − 1 γ ( l ) t ( i ) l = 1 t = 1 � L � T ( l ) t = 1 δ ( x t = k ) γ ( l ) t ( i ) l = 1 A ik = � L � T ( l ) t = 1 γ ( l ) t ( i ) l = 1

  75. Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y : P ( y ) = N ( 0 , I ) P ( x | y ) = N (Λ y , Ψ)

  76. Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y : ˜ P ( y ) = N ( 0 , I ) y = U y & ˜ P ( x | y ) = N (Λ y , Ψ) Λ = Λ U T

  77. Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y : � U 0 , UIU T � P (˜ y ) = N = N ( 0 , I ) ˜ P ( y ) = N ( 0 , I ) y = U y & ⇒ � � � � ˜ P ( x | y ) = N (Λ y , Ψ) Λ = Λ U T ˜ Λ U T U y , Ψ P ( x | ˜ y ) = N = N Λ˜ y , Ψ

  78. Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y : � U 0 , UIU T � P (˜ y ) = N = N ( 0 , I ) ˜ P ( y ) = N ( 0 , I ) y = U y & ⇒ � � � � ˜ P ( x | y ) = N (Λ y , Ψ) Λ = Λ U T ˜ Λ U T U y , Ψ P ( x | ˜ y ) = N = N Λ˜ y , Ψ Similarly, a mixture model is invariant to permutations of the latent.

  79. Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y : � U 0 , UIU T � P (˜ y ) = N = N ( 0 , I ) ˜ P ( y ) = N ( 0 , I ) y = U y & ⇒ � � � � ˜ P ( x | y ) = N (Λ y , Ψ) Λ = Λ U T ˜ Λ U T U y , Ψ P ( x | ˜ y ) = N = N Λ˜ y , Ψ Similarly, a mixture model is invariant to permutations of the latent. The LGSSM likelihood is conserved with respect to any invertible transform of the latent: P ( y t + 1 | y t ) = N ( A y t , Q ) P ( x t | y t ) = N ( C y , R )

  80. Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y : � U 0 , UIU T � P (˜ y ) = N = N ( 0 , I ) ˜ P ( y ) = N ( 0 , I ) y = U y & ⇒ � � � � ˜ P ( x | y ) = N (Λ y , Ψ) Λ = Λ U T ˜ Λ U T U y , Ψ P ( x | ˜ y ) = N = N Λ˜ y , Ψ Similarly, a mixture model is invariant to permutations of the latent. The LGSSM likelihood is conserved with respect to any invertible transform of the latent: ˜ A = GAG − 1 P ( y t + 1 | y t ) = N ( A y t , Q ) ˜ y = G y & P ( x t | y t ) = N ( C y , R ) Q = GQG T ˜ C = CG − 1 ˜

  81. Degeneracies Recall that the FA likelihood is conserved with respect to orthogonal transformations of y : � U 0 , UIU T � P (˜ y ) = N = N ( 0 , I ) ˜ P ( y ) = N ( 0 , I ) y = U y & ⇒ � � � � ˜ P ( x | y ) = N (Λ y , Ψ) Λ = Λ U T ˜ Λ U T U y , Ψ P ( x | ˜ y ) = N = N Λ˜ y , Ψ Similarly, a mixture model is invariant to permutations of the latent. The LGSSM likelihood is conserved with respect to any invertible transform of the latent: A = GAG − 1 ˜ P ( y t + 1 | y t ) = N ( A y t , Q ) ˜ y = G y & P ( x t | y t ) = N ( C y , R ) Q = GQG T ˜ C = CG − 1 ˜ � GAG − 1 G y t , GQG T � � ˜ � y t , ˜ P (˜ y t + 1 | ˜ y t ) = N = N A ˜ Q ⇒ � � � ˜ � CG − 1 G y , R P ( x t | ˜ y t ) = N = N C ˜ y , R

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend