Neural Probabilistic Language Model for System Combination Tsuyoshi - PDF document

Neural Probabilistic Language Model for System Combination Tsuyoshi Okita Dublin City University

DCU-NPLM Overview MBR BLEU QE Lucy backbone decoder a b c confusion topic network QE Alignment NPLM construction monolingual d e f external knowledge word alignment IHMM TERalign monotonic A B C D consensus baseline DA NPLM DA+NPLM decoding Standard system combination (green) 2 / 26

System Combination Overview ◮ System combination [Matusov et al., 05; Rosti et al., 07] ◮ We focus on three technical topics 1. Minimum-Bayes Risk (MBR) decoder (with MERT tuning) 2. Monolingual word aligner 3. Monotonic (consensus) decoder (with MERT tuning) 3 / 26

System Combination Overview ◮ System combination [Matusov et al., 05; Rosti et al., 07] ◮ Given: Set of MT outputs ◮ We focus on three technical topics 1. Minimum-Bayes Risk (MBR) decoder (with MERT tuning) 2. Monolingual word aligner 3. Monotonic (consensus) decoder (with MERT tuning) 3 / 26

System Combination Overview ◮ System combination [Matusov et al., 05; Rosti et al., 07] ◮ Given: Set of MT outputs 1. Build a confusion network ◮ We focus on three technical topics 1. Minimum-Bayes Risk (MBR) decoder (with MERT tuning) 2. Monolingual word aligner 3. Monotonic (consensus) decoder (with MERT tuning) 3 / 26

System Combination Overview ◮ System combination [Matusov et al., 05; Rosti et al., 07] ◮ Given: Set of MT outputs 1. Build a confusion network ◮ Select a backbone by Minimum-Bayes Risk (MBR) decoder (with MERT tuning) ◮ We focus on three technical topics 1. Minimum-Bayes Risk (MBR) decoder (with MERT tuning) 2. Monolingual word aligner 3. Monotonic (consensus) decoder (with MERT tuning) 3 / 26

System Combination Overview ◮ System combination [Matusov et al., 05; Rosti et al., 07] ◮ Given: Set of MT outputs 1. Build a confusion network ◮ Select a backbone by Minimum-Bayes Risk (MBR) decoder (with MERT tuning) ◮ Run monolingual word aligner ◮ We focus on three technical topics 1. Minimum-Bayes Risk (MBR) decoder (with MERT tuning) 2. Monolingual word aligner 3. Monotonic (consensus) decoder (with MERT tuning) 3 / 26

System Combination Overview ◮ System combination [Matusov et al., 05; Rosti et al., 07] ◮ Given: Set of MT outputs 1. Build a confusion network ◮ Select a backbone by Minimum-Bayes Risk (MBR) decoder (with MERT tuning) ◮ Run monolingual word aligner 2. Run monotonic (consensus) decoder (with MERT tuning) ◮ We focus on three technical topics 1. Minimum-Bayes Risk (MBR) decoder (with MERT tuning) 2. Monolingual word aligner 3. Monotonic (consensus) decoder (with MERT tuning) 3 / 26

System Combination Overview they are normally on a week . Input 1 these are normally made in a week . Input 2 este himself go normally in a week . Input 3 these do usually in a week . Input 4 ⇓ 1. MBR decoding these are normally made in a week . Backbone(2) ⇓ 2. monolingual word alignment these are normally made in a week . Backbone(2) hyp(1) they S are normally on S a week . ***** D hyp(3) este S himself S go S normally S in a week . hyp(4) these do S usually S in a week . ***** D ⇓ 3. monotonic consensus decoding these are normally in a week . Output ***** 4 / 26

1. MBR Decoding 1. Given MT outputs, choose 1 sentence. ˆ E MBR = argmin E ′ ∈E R ( E � ) best � L ( E , E � ) P ( E | F ) = argmin E ′ ∈E E ′ ∈E E � (1 − BLEU E ( E � )) P ( E | F ) = argmin E ′ ∈E E ′ ∈E E = argmin E ′ ∈E ⎡ ⎡ ⎤ ⎤ ⎡ ⎤ B E 1 ( E 1 ) B E 2 ( E 1 ) B E 3 ( E 1 ) B E 4 ( E 1 ) P ( E 1 | F ) B E 1 ( E 2 ) B E 2 ( E 2 ) B E 3 ( E 2 ) B E 4 ( E 2 ) P ( E 2 | F ) ⎢ ⎢ ⎥ ⎥ ⎢ ⎥ ⎣ 1 − ⎢ ⎢ ⎥ ⎥ ⎢ ⎥ . . . . . . P ( E 3 | F ) ⎣ ⎦ ⎦ ⎣ ⎦ B E 1 ( E 4 ) B E 2 ( E 4 ) B E 3 ( E 4 ) B E 4 ( E 4 ) P ( E 4 | F ) 5 / 26

1. MBR Decoding they are normally on a week . Input 1 these are normally made in a week . Input 2 este himself go normally in a week . Input 3 these do usually in a week . Input 4 ⎡ ⎡ ⎤ ⎤ ⎡ ⎤ 1 . 0 0 . 259 0 . 221 0 . 245 0 . 25 0 . 267 1 . 0 0 . 366 0 . 377 0 . 25 ⎢ ⎢ ⎥ ⎥ ⎢ ⎥ = argmin ⎣ 1 − ⎢ ⎢ ⎥ ⎥ ⎢ ⎥ . . . . . . 0 . 25 ⎣ ⎦ ⎦ ⎣ ⎦ 0 . 245 0 . 366 0 . 346 1 . 0 0 . 25 = argmin [0 . 565 , 0 . 502 , 0 . 517 , 0 . 506] = ( Input 2) these are normally made in a week . Backbone(2) 6 / 26

2. Monolingual Word Alignment ◮ TER-based monolingual word alignment ◮ Same words in different sentence are aligned ◮ Proceeded in a pairwise manner: Input 1 and backbone, Input 3 and backbone, Input 4 and backbone. these are normally made in a week . Backbone(2) hyp(1) they S are normally on S a week . ***** D these are normally made in a week . Backbone(2) hyp(3) este S himself S go S normally S in a week . these are normally made in a week . Backbone(2) hyp(4) these do S usually S in a week . ***** D 7 / 26

3. Monotonic Consensus Decoding ◮ Monotonic consensus decoding is limited version of MAP decoding ◮ monotonic (position dependent) ◮ phrase selection depends on the position (local TMs + global LM) I � φ ( i | ¯ e best = arg max e i ) p LM ( e ) e i =1 = arg max e { φ (1 | these) φ (2 | are) φ (3 | normally) φ (4 |∅ ) φ (5 | in) φ (6 | a) φ (7 | week) p LM ( e ) , . . . } = these are normally in a week (1) 1 ||| these ||| 0.50 2 ||| are ||| 0.50 3 ||| normally ||| 0.50 1 ||| they ||| 0.25 2 ||| himself ||| 0.25 ... 1 ||| este ||| 0.25 2 ||| ∅ ||| 0.25 ... 8 / 26

Table Of Contents 1. Overview of System Combination with Latent Variable with NPLM 2. Neural Probabilistic Language Model 3. Experiments 4. Conclusions and Further Works 9 / 26

Overview 1. N-gram language model 2. Smoothing methods for n-gram language model [Kneser and Ney, 95; Chen and Goodman, 98; Teh, 06] ◮ Particular interest on unseen data 3. Neural probabilistic language model ( NPLM )[Bengio, 00;Bengio et al., 2005] ◮ Perplexity: 1 < 2 < 3 10 / 26

N-gram Language Model ◮ N-gram Language Model p ( W ) (where W is a string w 1 , . . . , w n ) ◮ p ( W ) is the probability that if we pick a sentence of English words at random, it turns out to be W . ◮ Markov assumption ◮ Markov chain: p ( w 1 , . . . , w n ) = p ( w 1 ) p ( w 2 | w 1 ) . . . p ( w n | w 1 , . . . , w n − 1 ) ◮ History under m words: p ( w n | w 1 , . . . , w n − 1 ) ≈ p ( w n | w n − m , . . . , w n − 1 ) ◮ Perplexity (This measure is used when one tries to model an unknown probability distribution p , based on a training sample that was drawn from p .) ◮ Given a proposed model q , the perplexity, defined as P N 1 N log 2 q ( x i ) , 2 i =1 suggests how well it predicts a separate test sample x 1 , . . . , x N also drawn from p . 11 / 26

Language Model Smoothing(1) ◮ Motivation: unseen n-gram problem ◮ An n-gram which was not appeared in the training set may appear in the test set. 1. The probability of n-grams in training set is too big. 2. The probability of unseen n-grams is zero. ◮ (Some n-grams which will be reasonably appeared based on the lower- / higher-order n-grams may not appeared in the training set.) ◮ Smoothing method is 1. to adjust the empirical counts that we observe in the training set to the expected counts of n-grams in previously unseen text. 2. to estimate the expected counts of unseen n-grams included in test set. (Often no treatment) 12 / 26

Language Model Smoothing (2) c ( w i − 1 w i ) P ( w i | w i − 1 ) = maximum likelihood P w c ( w i − 1 w ) c ( w i − 1 w i )+1 add one P ( w i | w i − 1 ) = P w c ( w i − 1 w )+ v P ( w i | w i − 1 ) = c ( w i − 1 w i ) − D absolute discounting P w c ( w i − 1 w ) P ( w i | w i − 1 ) = c ( w i − 1 w i ) − D N 1+ ( • w ) Kneser-Ney w c ( w i − 1 w ) , α ( w i ) P N 1+ ( w i − 1 w ) P ( w i | w i − 1 ) = c ( w i − 1 w i ) − D i N 1+ ( • w ) interpolated modified KN w c ( w i − 1 w ) + β ( w i ) P N 1+ ( w i − 1 w ) D 1 = 1 − 2 YN 2 / N 1 , D 2 = 2 − 3 YN 3 / N 2 D 3+ = 3 − 4 YN 4 / N 3 , Y = N 1 / ( N 1 + 2 N 2 ) P ( w i | w i − 1 ) = c ( w i − 1 w i ) − d · t hw N 1+ ( • w ) hierarchical PY w c ( w i − 1 w )+ θ + δ ( w i ) P N 1+ ( w i − 1 w ) θ + d · t h · δ ( w i ) = θ + P w c ( w i − 1 w ) Table: Smoothing Method for Language Model 13 / 26

Neural Probabilistic Language Model ◮ Learning representation of data in order to make the probability distribution of word sequences more compact ◮ Focus on similar semantical and syntactical roles of words. ◮ For example, two sentences ◮ “The cat is walking in the bedroom” and ◮ “A dog was running in a room” ◮ Similarity between (the, a), (bedroom, room), (is, was), and (running, walking). ◮ Bengio’s implementation [00]. ◮ Implemention using multi-layer neural network. ◮ 20% to 35% better perplexity than the language model with the modified Kneser-Ney methods. 14 / 26

Neural Probabilistic Language Model (2) ◮ to capture the semantically and syntactically similar words in a way that a latent word depends on the context (Below ideal situation) a japanese electronics executive was kidnapped the u.s. tabacco director is abducted its german sales manager were killed one british consulting economist be found russian spokesman are abduction 15 / 26

Neural Probabilistic Language Model for System Combination Tsuyoshi - PDF document

Neural Probabilistic Language Model for System Combination Tsuyoshi Okita Dublin City University DCU-NPLM Overview MBR BLEU QE Lucy backbone decoder a b c confusion topic network QE Alignment NPLM

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

MT System Combination Silja Hildebrand MT System Combination System Combination in MT

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

CS 4110 Probabilistic Programming Probabilistic Programming It's not about writing software.

Neural Networks and Handwriting Recognition Background Neural Networks Neural Network Steven

Learning Neural Networks Learning Neural Networks Neural Networks can represent complex Neural

Introduction to Neural Machine Translation Gongbo Tang 16 September 2019 Outline Why Neural

Outline Multi-Engine Machine Translation 1 Alignment Search Space Features Match Model

Running Probabilistic Running Probabilistic Running Probabilistic Programs Backwards Programs

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Thesis

Probabilistic Computation Lecture 13 BPP vs. PH 1 Recap 2 Recap Probabilistic computation 2

Table of Contents I Probabilistic Reasoning Classical Probabilistic Models Basic Probabilistic

Probabilistic Computation Lecture 12 Flipping coins, taking chances PP, BPP 1 Probabilistic

Probabilistic Tracking and Probabilistic Tracking and Probabilistic Tracking and Reconstruction

Probabilistic Computation Lecture 13 Understanding BPP 1 Recap 2 Recap Probabilistic

From Probabilistic Circuits to Probabilistic Programs and Back Guy Van den Broeck PROBPROG - Oct

CAP Developments CAP Developments in Washington State in Washington State Don Miller

Facilitation of Fine Motor and Visual Skills, Building a Foundation for Success Stacy Krueger

Quill: A Collaborative Design Quill: A Collaborative Design Assistant for Cross Platform Web

Emergency Action Pl ans What should you do in an emergency situation? An Emergency Action Plan

Advancing the Wireless Emergency Alerts (WEA) 3.0 System Steve Barclay (Moderator) Sr.

Logical minimisation of metarules in meta-interpretive learning Andrew Cropper and Stephen

Bayesian Reasoning Todays Class Posteriors and priors We dont (cant!) know

Abductive reasoning with explicit justification Advisors Ph.D. Francisco Hernndez Quiroz (UNAM,