the infinite markov model
play

The Infinite Markov Model Daichi Mochihashi NTT Communication - PowerPoint PPT Presentation

The Infinite Markov Model Daichi Mochihashi NTT Communication Science Laboratories, Japan daichi@cslab.kecl.ntt.co.jp NIPS 2007 The Infinite Markov Model (NIPS 2007) p.1/20 Overview is is of will of will states order


  1. The Infinite Markov Model Daichi Mochihashi NTT Communication Science Laboratories, Japan daichi@cslab.kecl.ntt.co.jp NIPS 2007 The Infinite Markov Model (NIPS 2007) – p.1/20

  2. Overview ǫ ǫ is is of will of will states order language states he she language Fixed n-th order Markov model united · · · infinite • Fixed-order Markov dependency ⇓ Infinitely variable Markov orders the • Simple prior for stochastic trees Infinitely Variable-order (other than Coalescents) Markov model ◦ How to draw an inference based on only the output sequences? The Infinite Markov Model (NIPS 2007) – p.2/20

  3. Markov Models p ( ”mama I want to sing” ) = p ( mama ) × p ( I | mama ) 1st order × p ( want | mama I ) × p ( to | I want ) × p ( sing | want to ) 2nd order n-gram (3-gram) • “n-gram” (n-1’th order Markov) model is prevalent in speech recognition and natural language processing • Music processing, Bioinformatics, compression, · · · • Notice: HMM is a first order Markov model over hidden states ◦ Emission is a unigram on the hidden state The Infinite Markov Model (NIPS 2007) – p.3/20

  4. Estimating a Markov Model ǫ Predictive Distributions "will" "of" "and" butter "she will" "he will" "states of" "bread and" • Each Markov state is a node in a Suffix Tree (Ron+ (1994), Pereira+ (1995), Buhlmann (1999)) ◦ Depth = Markov order ◦ Each node has a predictive distribution over the next word • Problem: # of states will explode as the order n gets larger ◦ Restrict to a small Markov order ( n = 3 ∼ 5 in speech and NLP) ◦ Distributions get sparser and sparser ⇒ using hierarchical Bayes ? The Infinite Markov Model (NIPS 2007) – p.4/20

  5. Hierarchical (Poisson-) Dirichlet Process • Teh (2006), Goldwater+ (2006) mapped a hierarchical Dirichlet process to Markov Models ǫ Text · · · · · · a bread and butter "of" "and" "will" is usually a · · · america butter "she will" "he will" "states of" " bread and " ◦ n ’th order predictive distribution is a Dirichlet process draw from the ( n − 1) ’th distribution ◦ Chinese restaurant process representation: a customer = a count (in the training data) ◦ Hierarchical Pitman-Yor Language Model (HPYLM) The Infinite Markov Model (NIPS 2007) – p.5/20

  6. Hierarchical (Poisson-) Dirichlet Process • Teh (2006), Goldwater+ (2006) mapped a hierarchical Dirichlet process to Markov Models ǫ Text · · · · · · a bread and butter "of" "and" "will" is usually a · · · america butter "she will" "he will" "states of" " bread and " ◦ n ’th order predictive distribution is a Dirichlet process draw from the ( n − 1) ’th distribution ◦ Chinese restaurant process representation: a customer = a count (in the training data) ◦ Hierarchical Pitman-Yor Language Model (HPYLM) The Infinite Markov Model (NIPS 2007) – p.5/20

  7. Hierarchical (Poisson-) Dirichlet Process • Teh (2006), Goldwater+ (2006) mapped a hierarchical Dirichlet process to Markov Models ǫ Text · · · · · · a bread and butter "of" "and" "will" is usually a · · · america butter "she will" "he will" "states of" " bread and " ◦ n ’th order predictive distribution is a Dirichlet process draw from the ( n − 1) ’th distribution ◦ Chinese restaurant process representation: a customer = a count (in the training data) ◦ Hierarchical Pitman-Yor Language Model (HPYLM) The Infinite Markov Model (NIPS 2007) – p.5/20

  8. Hierarchical (Poisson-) Dirichlet Process • Teh (2006), Goldwater+ (2006) mapped a hierarchical Dirichlet process to Markov Models Proxy (imaginary) customer ǫ (Not relevant) Text · · · · · · a bread and butter "of" "and" "will" is usually a · · · america butter "she will" "he will" "states of" " bread and " ◦ n ’th order predictive distribution is a Dirichlet process draw from the ( n − 1) ’th distribution ◦ Chinese restaurant process representation: a customer = a count (in the training data) ◦ Hierarchical Pitman-Yor Language Model (HPYLM) The Infinite Markov Model (NIPS 2007) – p.5/20

  9. Problem with HPYLM ǫ Text · · · · · · a bread and butter "will" "of" "and" is usually a · · · america butter "she will" "he will" "states of" " bread and " • All the real customers reside in depth ( n − 1) (say, 2) in the suffix tree ◦ Corresponds to a fixed Markov order ◦ “less than”; “the united states of america” ◦ Character model for “supercalifragilisticexpialidocious”! • How can we deploy customers at suitably different depths? The Infinite Markov Model (NIPS 2007) – p.6/20

  10. Infinite-depth Hierarchical CRP 1 − q i i 1 − q j j k 1 − q k • Add a customer by stochastically decending a suffix tree from its root • Each node i has a probability to stop at that node ( 1 − q i equals the “penetration” probability) q i ∼ Be( α, β ) i.i.d. (1) • Therefore, a customer will stop at depth n by the probability n − 1 � p ( n | h ) = q n (1 − q i ) . (2) i =0 The Infinite Markov Model (NIPS 2007) – p.7/20

  11. Variable-order Pitman-Yor language model (VPYLM) • For the training data w = w 1 w 2 · · · w T , latent Markov orders n = n 1 n 2 · · · n T exist: � � p ( w ) = p ( w , n , s ) (3) n s ◦ s = s 1 s 2 · · · s T : seatings of proxy customers in parent nodes • Gibbs sample n for inference: p ( n t | w , n − t , s − t ) ∝ p ( w t | n t , w , n − t , s − t ) · p ( n t | w − t , n − t , s − t ) (4) � �� � � �� � n t -gram prediction prob to reach depth n t ◦ Trade-off between two terms (penalty for deep n t ) ◦ How to compute the second term p ( n t | w − t , n − t , s − t ) ? The Infinite Markov Model (NIPS 2007) – p.8/20

  12. Inference of VPYLM (2) w ǫ ( a, b ) = (100 , 900) 900+ β w t − 2 w t +1 w t − 1 w t · · · · · · 1000+ α + β w t − 1 ( a, b ) = (10 , 70) ← → n · · · 3 · · · 2 2 4 70+ β 80+ α + β w t − 2 ( a, b ) = (30 , 20) 20+ β 50+ α + β • We can estimate q i of node i β w t − 3 5+ α + β through the depths of the ( a, b ) = (5 , 0) other customers • Let a i = # of times the node i was stopped at, · · · b i = # of times the node i was passed by: n − 1 � p ( n t = n | w − t , n − t , s − t ) = q n (1 − q i ) (5) i =0 n − 1 a n + α b i + β � = a i + b i + α + β . (6) a n + b n + α + β i =0 The Infinite Markov Model (NIPS 2007) – p.9/20

  13. Estimated Markov Orders 0 1 2 3 4 5 6 7 8 9 n consuming opposition prospects delegates european producer unfazed attempt nations appear strong prices about cartel week while likely meet EOS from pact that u.s. key this the will the fix of to to is a | • Hinton diagram of p ( n t | w ) used in Gibbs sampling for the training data • Estimated Markov orders from which each word has been generated. • NAB Wall Street Journal corpus of 10,007,108 words The Infinite Markov Model (NIPS 2007) – p.10/20

  14. Prediction • We don’t know the Markov order n beforehand ⇒ sum it out ∞ ∞ � � p ( w | h ) = p ( w, n | h ) = p ( w | n, h ) p ( n | h ) . (7) n =0 n =0 • We can rewrite the above expression recursively: p ( w | h ) = p (0 | h ) · p ( w | h, 0) + p (1 | h ) · p ( w | h, 1) + p (2 | h ) · p ( w | h, 2) + · · · = q 0 · p ( w | h, 0)+(1 − q 0 ) q 1 · p ( w | h, 1)+(1 − q 0 )(1 − q 1 ) q 2 · p ( w | h, 2) · · · � � = q 0 · p ( w | h, 0)+(1 − q 0 ) q 1 · p ( w | h, 1)+(1 − q 1 ) q 2 · p ( w | h, 2)+ · · · (8) • Therefore, p ( w | h, n + ) ≡ q n · p ( w | h, n ) + (1 − q n ) · p ( w | h, ( n +1) + ) , (9) p ( w | h ) = p ( w | h, 0 + ) . (10) The Infinite Markov Model (NIPS 2007) – p.11/20

  15. Prediction (2) p ( w | h, n + ) ≡ q n · p ( w | h, n ) +(1 − q n ) · p ( w | h, ( n +1) + ) � �� � � �� � Prediction at Depth n Prediction at Depths >n p ( w | h ) = p ( w | h, 0 + ) , q n ∼ Be( α, β ) . • Stick-breaking process on an infinite tree, where breaking proportions will differ from branch to branch. • Bayesian sophistication of CTW (context tree weighting) algorithm (Willems+ 1995) in information theory ( ⇒ Poster) The Infinite Markov Model (NIPS 2007) – p.12/20

  16. Perplexity and Number of Nodes in the Tree HPYLM VPYLM Nodes(H) Nodes(V) n 3 113.60 113.74 1,417K 1,344K 5 101.08 101.69 12,699K 7,466K 7 N/A 100.68 27,193K 10,182K 8 N/A 100.58 34,459K 10,434K ∞ — 100.36 — 10,629K • Perplexity = 1/average predictive probabilities (lower is better) • VPYLM causes no memory overflow even for large n ◦ Italic : expected number of nodes • Identical performance as HPYLM, but with much less number of nodes ◦ ∞ -gram performed the best ( ǫ =1 e − 8 ) The Infinite Markov Model (NIPS 2007) – p.13/20

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend