neural probabilistic language model for system combination
play

Neural Probabilistic Language Model for System Combination Tsuyoshi - PDF document

Neural Probabilistic Language Model for System Combination Tsuyoshi Okita Dublin City University DCU-NPLM Overview MBR BLEU QE Lucy backbone decoder a b c confusion topic network QE Alignment NPLM


  1. Neural Probabilistic Language Model for System Combination Tsuyoshi Okita Dublin City University

  2. DCU-NPLM Overview MBR BLEU QE Lucy backbone decoder a b c confusion topic network QE Alignment NPLM construction monolingual d e f external knowledge word alignment IHMM TERalign monotonic A B C D consensus baseline DA NPLM DA+NPLM decoding Standard system combination (green) 2 / 26

  3. System Combination Overview ◮ System combination [Matusov et al., 05; Rosti et al., 07] ◮ We focus on three technical topics 1. Minimum-Bayes Risk (MBR) decoder (with MERT tuning) 2. Monolingual word aligner 3. Monotonic (consensus) decoder (with MERT tuning) 3 / 26

  4. System Combination Overview ◮ System combination [Matusov et al., 05; Rosti et al., 07] ◮ Given: Set of MT outputs ◮ We focus on three technical topics 1. Minimum-Bayes Risk (MBR) decoder (with MERT tuning) 2. Monolingual word aligner 3. Monotonic (consensus) decoder (with MERT tuning) 3 / 26

  5. System Combination Overview ◮ System combination [Matusov et al., 05; Rosti et al., 07] ◮ Given: Set of MT outputs 1. Build a confusion network ◮ We focus on three technical topics 1. Minimum-Bayes Risk (MBR) decoder (with MERT tuning) 2. Monolingual word aligner 3. Monotonic (consensus) decoder (with MERT tuning) 3 / 26

  6. System Combination Overview ◮ System combination [Matusov et al., 05; Rosti et al., 07] ◮ Given: Set of MT outputs 1. Build a confusion network ◮ Select a backbone by Minimum-Bayes Risk (MBR) decoder (with MERT tuning) ◮ We focus on three technical topics 1. Minimum-Bayes Risk (MBR) decoder (with MERT tuning) 2. Monolingual word aligner 3. Monotonic (consensus) decoder (with MERT tuning) 3 / 26

  7. System Combination Overview ◮ System combination [Matusov et al., 05; Rosti et al., 07] ◮ Given: Set of MT outputs 1. Build a confusion network ◮ Select a backbone by Minimum-Bayes Risk (MBR) decoder (with MERT tuning) ◮ Run monolingual word aligner ◮ We focus on three technical topics 1. Minimum-Bayes Risk (MBR) decoder (with MERT tuning) 2. Monolingual word aligner 3. Monotonic (consensus) decoder (with MERT tuning) 3 / 26

  8. System Combination Overview ◮ System combination [Matusov et al., 05; Rosti et al., 07] ◮ Given: Set of MT outputs 1. Build a confusion network ◮ Select a backbone by Minimum-Bayes Risk (MBR) decoder (with MERT tuning) ◮ Run monolingual word aligner 2. Run monotonic (consensus) decoder (with MERT tuning) ◮ We focus on three technical topics 1. Minimum-Bayes Risk (MBR) decoder (with MERT tuning) 2. Monolingual word aligner 3. Monotonic (consensus) decoder (with MERT tuning) 3 / 26

  9. System Combination Overview they are normally on a week . Input 1 these are normally made in a week . Input 2 este himself go normally in a week . Input 3 these do usually in a week . Input 4 ⇓ 1. MBR decoding these are normally made in a week . Backbone(2) ⇓ 2. monolingual word alignment these are normally made in a week . Backbone(2) hyp(1) they S are normally on S a week . ***** D hyp(3) este S himself S go S normally S in a week . hyp(4) these do S usually S in a week . ***** D ⇓ 3. monotonic consensus decoding these are normally in a week . Output ***** 4 / 26

  10. 1. MBR Decoding 1. Given MT outputs, choose 1 sentence. ˆ E MBR = argmin E ′ ∈E R ( E � ) best � L ( E , E � ) P ( E | F ) = argmin E ′ ∈E E ′ ∈E E � (1 − BLEU E ( E � )) P ( E | F ) = argmin E ′ ∈E E ′ ∈E E = argmin E ′ ∈E ⎡ ⎡ ⎤ ⎤ ⎡ ⎤ B E 1 ( E 1 ) B E 2 ( E 1 ) B E 3 ( E 1 ) B E 4 ( E 1 ) P ( E 1 | F ) B E 1 ( E 2 ) B E 2 ( E 2 ) B E 3 ( E 2 ) B E 4 ( E 2 ) P ( E 2 | F ) ⎢ ⎢ ⎥ ⎥ ⎢ ⎥ ⎣ 1 − ⎢ ⎢ ⎥ ⎥ ⎢ ⎥ . . . . . . P ( E 3 | F ) ⎣ ⎦ ⎦ ⎣ ⎦ B E 1 ( E 4 ) B E 2 ( E 4 ) B E 3 ( E 4 ) B E 4 ( E 4 ) P ( E 4 | F ) 5 / 26

  11. 1. MBR Decoding they are normally on a week . Input 1 these are normally made in a week . Input 2 este himself go normally in a week . Input 3 these do usually in a week . Input 4 ⎡ ⎡ ⎤ ⎤ ⎡ ⎤ 1 . 0 0 . 259 0 . 221 0 . 245 0 . 25 0 . 267 1 . 0 0 . 366 0 . 377 0 . 25 ⎢ ⎢ ⎥ ⎥ ⎢ ⎥ = argmin ⎣ 1 − ⎢ ⎢ ⎥ ⎥ ⎢ ⎥ . . . . . . 0 . 25 ⎣ ⎦ ⎦ ⎣ ⎦ 0 . 245 0 . 366 0 . 346 1 . 0 0 . 25 = argmin [0 . 565 , 0 . 502 , 0 . 517 , 0 . 506] = ( Input 2) these are normally made in a week . Backbone(2) 6 / 26

  12. 2. Monolingual Word Alignment ◮ TER-based monolingual word alignment ◮ Same words in different sentence are aligned ◮ Proceeded in a pairwise manner: Input 1 and backbone, Input 3 and backbone, Input 4 and backbone. these are normally made in a week . Backbone(2) hyp(1) they S are normally on S a week . ***** D these are normally made in a week . Backbone(2) hyp(3) este S himself S go S normally S in a week . these are normally made in a week . Backbone(2) hyp(4) these do S usually S in a week . ***** D 7 / 26

  13. 3. Monotonic Consensus Decoding ◮ Monotonic consensus decoding is limited version of MAP decoding ◮ monotonic (position dependent) ◮ phrase selection depends on the position (local TMs + global LM) I � φ ( i | ¯ e best = arg max e i ) p LM ( e ) e i =1 = arg max e { φ (1 | these) φ (2 | are) φ (3 | normally) φ (4 |∅ ) φ (5 | in) φ (6 | a) φ (7 | week) p LM ( e ) , . . . } = these are normally in a week (1) 1 ||| these ||| 0.50 2 ||| are ||| 0.50 3 ||| normally ||| 0.50 1 ||| they ||| 0.25 2 ||| himself ||| 0.25 ... 1 ||| este ||| 0.25 2 ||| ∅ ||| 0.25 ... 8 / 26

  14. Table Of Contents 1. Overview of System Combination with Latent Variable with NPLM 2. Neural Probabilistic Language Model 3. Experiments 4. Conclusions and Further Works 9 / 26

  15. Overview 1. N-gram language model 2. Smoothing methods for n-gram language model [Kneser and Ney, 95; Chen and Goodman, 98; Teh, 06] ◮ Particular interest on unseen data 3. Neural probabilistic language model ( NPLM )[Bengio, 00;Bengio et al., 2005] ◮ Perplexity: 1 < 2 < 3 10 / 26

  16. N-gram Language Model ◮ N-gram Language Model p ( W ) (where W is a string w 1 , . . . , w n ) ◮ p ( W ) is the probability that if we pick a sentence of English words at random, it turns out to be W . ◮ Markov assumption ◮ Markov chain: p ( w 1 , . . . , w n ) = p ( w 1 ) p ( w 2 | w 1 ) . . . p ( w n | w 1 , . . . , w n − 1 ) ◮ History under m words: p ( w n | w 1 , . . . , w n − 1 ) ≈ p ( w n | w n − m , . . . , w n − 1 ) ◮ Perplexity (This measure is used when one tries to model an unknown probability distribution p , based on a training sample that was drawn from p .) ◮ Given a proposed model q , the perplexity, defined as P N 1 N log 2 q ( x i ) , 2 i =1 suggests how well it predicts a separate test sample x 1 , . . . , x N also drawn from p . 11 / 26

  17. Language Model Smoothing(1) ◮ Motivation: unseen n-gram problem ◮ An n-gram which was not appeared in the training set may appear in the test set. 1. The probability of n-grams in training set is too big. 2. The probability of unseen n-grams is zero. ◮ (Some n-grams which will be reasonably appeared based on the lower- / higher-order n-grams may not appeared in the training set.) ◮ Smoothing method is 1. to adjust the empirical counts that we observe in the training set to the expected counts of n-grams in previously unseen text. 2. to estimate the expected counts of unseen n-grams included in test set. (Often no treatment) 12 / 26

  18. Language Model Smoothing (2) c ( w i − 1 w i ) P ( w i | w i − 1 ) = maximum likelihood P w c ( w i − 1 w ) c ( w i − 1 w i )+1 add one P ( w i | w i − 1 ) = P w c ( w i − 1 w )+ v P ( w i | w i − 1 ) = c ( w i − 1 w i ) − D absolute discounting P w c ( w i − 1 w ) P ( w i | w i − 1 ) = c ( w i − 1 w i ) − D N 1+ ( • w ) Kneser-Ney w c ( w i − 1 w ) , α ( w i ) P N 1+ ( w i − 1 w ) P ( w i | w i − 1 ) = c ( w i − 1 w i ) − D i N 1+ ( • w ) interpolated modified KN w c ( w i − 1 w ) + β ( w i ) P N 1+ ( w i − 1 w ) D 1 = 1 − 2 YN 2 / N 1 , D 2 = 2 − 3 YN 3 / N 2 D 3+ = 3 − 4 YN 4 / N 3 , Y = N 1 / ( N 1 + 2 N 2 ) P ( w i | w i − 1 ) = c ( w i − 1 w i ) − d · t hw N 1+ ( • w ) hierarchical PY w c ( w i − 1 w )+ θ + δ ( w i ) P N 1+ ( w i − 1 w ) θ + d · t h · δ ( w i ) = θ + P w c ( w i − 1 w ) Table: Smoothing Method for Language Model 13 / 26

  19. Neural Probabilistic Language Model ◮ Learning representation of data in order to make the probability distribution of word sequences more compact ◮ Focus on similar semantical and syntactical roles of words. ◮ For example, two sentences ◮ “The cat is walking in the bedroom” and ◮ “A dog was running in a room” ◮ Similarity between (the, a), (bedroom, room), (is, was), and (running, walking). ◮ Bengio’s implementation [00]. ◮ Implemention using multi-layer neural network. ◮ 20% to 35% better perplexity than the language model with the modified Kneser-Ney methods. 14 / 26

  20. Neural Probabilistic Language Model (2) ◮ to capture the semantically and syntactically similar words in a way that a latent word depends on the context (Below ideal situation) a japanese electronics executive was kidnapped the u.s. tabacco director is abducted its german sales manager were killed one british consulting economist be found russian spokesman are abduction 15 / 26

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend