The Infinite Markov Model Daichi Mochihashi NTT Communication - PowerPoint PPT Presentation

The Infinite Markov Model Daichi Mochihashi NTT Communication Science Laboratories, Japan daichi@cslab.kecl.ntt.co.jp NIPS 2007 The Infinite Markov Model (NIPS 2007) – p.1/20

Overview ǫ ǫ is is of will of will states order language states he she language Fixed n-th order Markov model united · · · infinite • Fixed-order Markov dependency ⇓ Infinitely variable Markov orders the • Simple prior for stochastic trees Infinitely Variable-order (other than Coalescents) Markov model ◦ How to draw an inference based on only the output sequences? The Infinite Markov Model (NIPS 2007) – p.2/20

Markov Models p ( ”mama I want to sing” ) = p ( mama ) × p ( I | mama ) 1st order × p ( want | mama I ) × p ( to | I want ) × p ( sing | want to ) 2nd order n-gram (3-gram) • “n-gram” (n-1’th order Markov) model is prevalent in speech recognition and natural language processing • Music processing, Bioinformatics, compression, · · · • Notice: HMM is a first order Markov model over hidden states ◦ Emission is a unigram on the hidden state The Infinite Markov Model (NIPS 2007) – p.3/20

Estimating a Markov Model ǫ Predictive Distributions "will" "of" "and" butter "she will" "he will" "states of" "bread and" • Each Markov state is a node in a Suffix Tree (Ron+ (1994), Pereira+ (1995), Buhlmann (1999)) ◦ Depth = Markov order ◦ Each node has a predictive distribution over the next word • Problem: # of states will explode as the order n gets larger ◦ Restrict to a small Markov order ( n = 3 ∼ 5 in speech and NLP) ◦ Distributions get sparser and sparser ⇒ using hierarchical Bayes ? The Infinite Markov Model (NIPS 2007) – p.4/20

Hierarchical (Poisson-) Dirichlet Process • Teh (2006), Goldwater+ (2006) mapped a hierarchical Dirichlet process to Markov Models ǫ Text · · · · · · a bread and butter "of" "and" "will" is usually a · · · america butter "she will" "he will" "states of" " bread and " ◦ n ’th order predictive distribution is a Dirichlet process draw from the ( n − 1) ’th distribution ◦ Chinese restaurant process representation: a customer = a count (in the training data) ◦ Hierarchical Pitman-Yor Language Model (HPYLM) The Infinite Markov Model (NIPS 2007) – p.5/20

Hierarchical (Poisson-) Dirichlet Process • Teh (2006), Goldwater+ (2006) mapped a hierarchical Dirichlet process to Markov Models Proxy (imaginary) customer ǫ (Not relevant) Text · · · · · · a bread and butter "of" "and" "will" is usually a · · · america butter "she will" "he will" "states of" " bread and " ◦ n ’th order predictive distribution is a Dirichlet process draw from the ( n − 1) ’th distribution ◦ Chinese restaurant process representation: a customer = a count (in the training data) ◦ Hierarchical Pitman-Yor Language Model (HPYLM) The Infinite Markov Model (NIPS 2007) – p.5/20

Problem with HPYLM ǫ Text · · · · · · a bread and butter "will" "of" "and" is usually a · · · america butter "she will" "he will" "states of" " bread and " • All the real customers reside in depth ( n − 1) (say, 2) in the suffix tree ◦ Corresponds to a fixed Markov order ◦ “less than”; “the united states of america” ◦ Character model for “supercalifragilisticexpialidocious”! • How can we deploy customers at suitably different depths? The Infinite Markov Model (NIPS 2007) – p.6/20

Infinite-depth Hierarchical CRP 1 − q i i 1 − q j j k 1 − q k • Add a customer by stochastically decending a suffix tree from its root • Each node i has a probability to stop at that node ( 1 − q i equals the “penetration” probability) q i ∼ Be( α, β ) i.i.d. (1) • Therefore, a customer will stop at depth n by the probability n − 1 � p ( n | h ) = q n (1 − q i ) . (2) i =0 The Infinite Markov Model (NIPS 2007) – p.7/20

Variable-order Pitman-Yor language model (VPYLM) • For the training data w = w 1 w 2 · · · w T , latent Markov orders n = n 1 n 2 · · · n T exist: � � p ( w ) = p ( w , n , s ) (3) n s ◦ s = s 1 s 2 · · · s T : seatings of proxy customers in parent nodes • Gibbs sample n for inference: p ( n t | w , n − t , s − t ) ∝ p ( w t | n t , w , n − t , s − t ) · p ( n t | w − t , n − t , s − t ) (4) � �� n t -gram prediction prob to reach depth n t ◦ Trade-off between two terms (penalty for deep n t ) ◦ How to compute the second term p ( n t | w − t , n − t , s − t ) ? The Infinite Markov Model (NIPS 2007) – p.8/20

Inference of VPYLM (2) w ǫ ( a, b ) = (100 , 900) 900+ β w t − 2 w t +1 w t − 1 w t · · · · · · 1000+ α + β w t − 1 ( a, b ) = (10 , 70) ← → n · · · 3 · · · 2 2 4 70+ β 80+ α + β w t − 2 ( a, b ) = (30 , 20) 20+ β 50+ α + β • We can estimate q i of node i β w t − 3 5+ α + β through the depths of the ( a, b ) = (5 , 0) other customers • Let a i = # of times the node i was stopped at, · · · b i = # of times the node i was passed by: n − 1 � p ( n t = n | w − t , n − t , s − t ) = q n (1 − q i ) (5) i =0 n − 1 a n + α b i + β � = a i + b i + α + β . (6) a n + b n + α + β i =0 The Infinite Markov Model (NIPS 2007) – p.9/20

Estimated Markov Orders 0 1 2 3 4 5 6 7 8 9 n consuming opposition prospects delegates european producer unfazed attempt nations appear strong prices about cartel week while likely meet EOS from pact that u.s. key this the will the fix of to to is a | • Hinton diagram of p ( n t | w ) used in Gibbs sampling for the training data • Estimated Markov orders from which each word has been generated. • NAB Wall Street Journal corpus of 10,007,108 words The Infinite Markov Model (NIPS 2007) – p.10/20

Prediction (2) p ( w | h, n + ) ≡ q n · p ( w | h, n ) +(1 − q n ) · p ( w | h, ( n +1) + ) � �� Prediction at Depth n Prediction at Depths >n p ( w | h ) = p ( w | h, 0 + ) , q n ∼ Be( α, β ) . • Stick-breaking process on an infinite tree, where breaking proportions will differ from branch to branch. • Bayesian sophistication of CTW (context tree weighting) algorithm (Willems+ 1995) in information theory ( ⇒ Poster) The Infinite Markov Model (NIPS 2007) – p.12/20

Perplexity and Number of Nodes in the Tree HPYLM VPYLM Nodes(H) Nodes(V) n 3 113.60 113.74 1,417K 1,344K 5 101.08 101.69 12,699K 7,466K 7 N/A 100.68 27,193K 10,182K 8 N/A 100.58 34,459K 10,434K ∞ — 100.36 — 10,629K • Perplexity = 1/average predictive probabilities (lower is better) • VPYLM causes no memory overflow even for large n ◦ Italic : expected number of nodes • Identical performance as HPYLM, but with much less number of nodes ◦ ∞ -gram performed the best ( ǫ =1 e − 8 ) The Infinite Markov Model (NIPS 2007) – p.13/20

The Infinite Markov Model Daichi Mochihashi NTT Communication - PowerPoint PPT Presentation

The Infinite Markov Model Daichi Mochihashi NTT Communication Science Laboratories, Japan daichi@cslab.kecl.ntt.co.jp NIPS 2007 The Infinite Markov Model (NIPS 2007) p.1/20 Overview is is of will of will states order

Markov Chains Markov Processes Discrete-time Markov Chains Continuous-time Markov Chains Dr

Hidden Markov Models Discrete Markov Processes 1 Hidden Markov Models Hidden Markov Models 2

Model Repair for Markov Decision Model Repair for Markov Decision Model Repair for Markov

Markov chains and Hidden Markov Models 9000 Markov chains and HMMs We will discuss: Markov

CSCE 471/871 Lecture 3: Markov Chains Markov Chains and and Hidden Markov Models Hidden

Infinite graphs P eter Komj ath LC12 P eter Komj ath Infinite graphs Infinite

Stochastic Processes Markov Processes Hamid R. Rabiee 1 Overview o Markov Property o Markov

The Hidden Markov The Hidden Markov Model (HMM) Model (HMM) 1 Lecture Outline Lecture Outline

Part 3 Markov Chain Modeling Markov Chain Model Stochastic model Amounts to sequence of

Hidden Markov Models Pratik Lahiri Introduction A hidden Markov model (HMM) is a

Imprecise Markov chains From basic theory to applications II prof. Jasper De Bock Imprecise

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and

Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University 2 Markov Chains

Discrete Time Markov Chains Discrete-Time Markov Chains Books - Introduction to Stochastic

Hidden Markov Models Steven J Zeil Old Dominion Univ. Fall 2010 1 Discrete Markov Processes

Overview Motivation Verifying Continuous-Time Markov Chains 1 Lecture 1+2: Discrete-Time Markov

Fast Multipole Methods in Arbitrary Dimensions with Chenhan Yu James Levitt Severin Riez

Programming Languages and Machine Learning Martin Vechev DeepCode.ai and ETH Zurich PL Research:

Arabic Dialect Identification in the Context of Bivalency and Code-Switching Mahmoud EL-Haj Paul

Multilingual Aspects in Speech and Multimodal Interfaces Paolo Baggia Director of International

Gram-Schmidt Finding Orthonormal Basis The famous Gram-Schmidt process is used to produce an

A CLT for Information-Theoretic Statistics of Gram Random Matrices Malika Kharouf Joint work

N-gram Graph: Representation for Graphs Shengchao Liu, Mehmet Furkan Demirel, Yingyu Liang

On Out-of-Distribution Detection Algorithms with Deep Neural Skin Cancer Classifiers Andre G. C.