Improving the Minimum Bayes Risk Combination of Machine Translation - PowerPoint PPT Presentation

Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems Jes´ us Gonz´ alez-Rubio, Francisco Casacuberta { jegonzalez,fcn } @dsic.upv.es Pattern Recognition and Human Language Technology Group Universitat Polit` ecnica de Val` encia (Spain) Work supported by the EU 7 th Framework program (FP/2007-2013) under the CasMaCat project (gran no 287576) JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

Overview • Introduction • Minimum Bayes’ Risk System Combination • Dynamic Programming Decoding for MBRSC • Evaluation • Conclusions JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

Introduction JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

Motivation • MT technology is still far from human translation quality • Different MT approaches have complementary strengths and limitations • Focus on Minimum Bayes’ Risk System Combination (MBRSC) – Conceptually simple and provide competitive empirical results • Our contributions: – New decoding algorithms based on Dynamic Programming (DP) – An MBRSC formulation based on linear BLEU [Tromble et al., 2008] JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

Minimum Bayes’ Risk System Combination JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

Model and Decision Function • Weighted ensemble of K probability distributions (translation models) K � P( y | x ) = α k · P k ( y | x ) k =1 • The minimum Bayes’ risk classifier for BLEU is given by:   K �  � P k ( y ′ | x ) · BLEU( y , y ′ ) y = arg max ˆ α k ·  y ∈Y y ′ ∈Y k =1 � �� system − specific expected BLEU • Complex decoding problem JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

Decoding • Direct implementation has a temporal complexity in O (max( | y | ) · | Y | 2 ) • Practical approach: divide decoding into gain computation and search • Expected BLEU is approximated by BLEU over expected n-gram counts expected count of w � � � �� P( y ′ | x ) · BLEU( y , y ′ ) P( y ′ | x ) · # w ( y ′ ) ≈ � BLEU y , y ′ ∈Y y ′ ∈Y � �� expected BLEU of y BLEU of y over expected counts • Search is implemented as a gradient ascent algorithm • Final temporal complexity in O (max( | y | ) 2 · | Σ | · S ) JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

Dynamic Programming Decoding for MBRSC JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

Dynamic Programming Decoding • Gradient ascent decoding is sensitive to an initial solution – Prone to get stuck in local optima • Dynamic programming provides a more sophisticated solution • Basic idea: iterative generation of new translation hypotheses – Start with an empty hypothesis – Repeatedly generate hypotheses of size i +1 by extending hypotheses of size i with one more target word JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

Dynamic Programming Decoding II • Graph structure, nodes store hypotheses with the same n-grams · · · · · · i − 3 i − 2 i − 1 i i + 1 • Unfortunately, the number of nodes is exponential in | Σ | • In practice, DP decoding is implemented as a beam search algorithm JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

Beam Search Implementation • Key Idea: keep the M best-scoring hypotheses each step • Breadth-first exploration to avoid repeated computations • Upper bound ( I ) to the size of the consensus translations • Rest score estimation to better compare the potential of each hypothesis • Final complexity in O ( I 2 · M · D ) , where D << | Σ | JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

Dynamic Programming Decoding for Linear BLEU JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

Why Linear BLEU? • Count clippings forbid the incremental computation of BLEU – We cannot exploit the full potential of the DP framework • Linear BLEU approximates the logarithm of BLEU [Tromble et al., 2008] � log(BLEU( y , y ′ )) ≈ λ 0 · | y | + λ w · # w ( y ) · δ w ( y ′ ) (1) w ∈W ( y ) • Expected linear BLEU gain can be computed incrementally JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

DP Decoding for Linear BLEU • Search nodes contain hypotheses that share their last three words – | Σ | 3 nodes in the search graph – DP decoding can be implemented exactly (no pruning) • Breadth-first exploration and upper bound ( I ) for translation size • No need for rest-score estimation • Implementation has a complexity in O ( I · | Σ | 3 · D ) JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

Empirical Evaluation JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

Experimental Setup • WMT 2009 French-English corpus • Combine translations of the five systems that submitted N-best lists – 450 translations on average for each source sentence • Maximum length ( I ) equal to the longest provided translation • Uniform ensemble weights – Controlled environment to compare different setups – Initial results showed that weights did not deviate much from uniform JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

Preliminary Experiments 28.1 2000 BLEU [%] Time [min] 1500 27.9 Translation quality 1000 27.7 500 Decoding time Hyps. after pruning (M) 27.5 0 1 10 100 1000 • We chose M = 10 as the pruning value for the next experiments JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

Translation Quality Results System setup BLEU[%] TER[%] worst single system 24.8 60.4 best single system 26.4 56.0 EC 27.7 55.4 Gradient ascent LB 26.3 59.6 Beam Search EC 27.8 55.1 DP LB 26.8 57.8 EC stands for BLEU over expected counts, and LC stands for linear BLEU • Scarce quality improvements but better score for 53% of the sentences JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

Decoding Time Results • Estimated by the number calls to compute the expected BLEU – Factors out potential effects of the particular implementations • Beam search made ∼ 15 million calls ( ∼ 1 . 3 s. per sentence) – Gradient ascent made ∼ 20 million calls • DP-based decoding also improved the efficiency of MBRSC JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

Analysis of Linear BLEU Results EC: we have made great progress . LB: we have made great progress . we have made EC: it seems to be clear that it is better to buy only a phone . LB: to be clear that it seems to be clear that it is better to buy only a phone . EC: i am curious to know if i could see here . LB: am curious to know if i am curious to know if i could see here . • The lack of count clippings results in repetitions of common n-grams – Explains the observed degradation in translation quality JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

Conclusions JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

Conclusions • DP-based decoding outperformed previous gradient ascent search – Better-scoring translations with less temporal complexity – However, improvements in translation quality were scarce • Linear BLEU boosts efficiency but penalizes translation quality • An extended linear BLEU score may mitigate this effect – For example, by including a language model score JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

Thank you, questions? JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

References R. Tromble, S. Kumar, F. Och, and W. Macherey. Lattice minimum bayes-risk decoding for statistical machine translation. In Proc. of the Empirical Methods in Natural Language Processing conference , pages 620–629, 2008. JGR,FCN Improving the Minimum Bayes’ Risk Combination of Machine Translation Systems IWSLT’13

Improving the Minimum Bayes Risk Combination of Machine Translation - PowerPoint PPT Presentation

Improving the Minimum Bayes Risk Combination of Machine Translation Systems Jes us Gonz alez-Rubio, Francisco Casacuberta { jegonzalez,fcn } @dsic.upv.es Pattern Recognition and Human Language Technology Group Universitat Polit` ecnica

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

Minimum Bayes-Risk Methods in Automatic Speech Recognition Vaibhava Goel IBM William Byrne

On the minimum rank of a graph Jisu Jeong June 21, 2013 Jisu Jeong On the minimum rank of a

MT System Combination Silja Hildebrand MT System Combination System Combination in MT

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

I ntroduction to Mobile Robotics Bayes Filter Kalm an Filter Wolfram Burgard 1 Bayes

Introduction to Machine Learning Classification: Naive Bayes Learning goals 15 Understand the

Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners Minimum description

Risk Management Workshop 1 Risk management workshop Why do we Risk Risk and need risk

Outline Multi-Engine Machine Translation 1 Alignment Search Space Features Match Model

Formal Modeling in Cognitive Science Independence Lecture 23: Conditional Probability; Bayes

Generalized Inversion Sequences Carla D. Savage Department of Computer Science North Carolina

Linear Resolution, Chordality and Ascent of Clutters Ashkan Nikseresht

Optimistic Policy Optimization via Multiple Importance Sampling Matteo Papini Alberto Maria

ENUMERATING (2+2)-FREE POSETS BY THE NUMBER OF MINIMAL ELEMENTS AND OTHER STATISTICS SERGEY

Alternating Direction Method of Multipliers Prof S. Boyd HYCON 2, Trento, 23/6/11 source:

SDCA Stochastic Dual Coordinate Ascent Jingchang Liu June 29, 2017 University of Science and

HVS Planning for FD Integration/Installation Bo Yu, Francesco Pietropaolo 21 August 2019 Outline

1 & 2 Samuel Series Lesson #029 October 13, 2015 Dean Bible Ministries

Improving the Minimum Bayes Risk Combination of Machine Translation - PowerPoint PPT Presentation

Improving the Minimum Bayes Risk Combination of Machine Translation Systems Jes us Gonz alez-Rubio, Francisco Casacuberta { jegonzalez,fcn } @dsic.upv.es Pattern Recognition and Human Language Technology Group Universitat Polit` ecnica

Naive Bayes and Gaussian Bayes Classifier Ladislav Rampasek slides by Mengye Ren and others

The Nave Bayes Classifier Machine Learning 1 Todays lecture The nave Bayes Classifier

Bayes Theorem Thomas Bayes (1701-1761) Simple form of Bayes Theorem, for

Minimum Bayes-Risk Methods in Automatic Speech Recognition Vaibhava Goel IBM William Byrne

On the minimum rank of a graph Jisu Jeong June 21, 2013 Jisu Jeong On the minimum rank of a

MT System Combination Silja Hildebrand MT System Combination System Combination in MT

DATA MINING: NAVE BAYES 1 Nave Bayes Classifier Thomas Bayes 1702 - 1761 We will start off

Cognitive Modeling Unseen Examples 2 Bayes Classifiers Lecture 14: Naive Bayes Classifiers

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

Bayes Classifiers Nave Bayes Classification Patrick Mair Bayes Classifiers Weather data

I ntroduction to Mobile Robotics Bayes Filter Kalm an Filter Wolfram Burgard 1 Bayes

Introduction to Machine Learning Classification: Naive Bayes Learning goals 15 Understand the

Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners Minimum description

Risk Management Workshop 1 Risk management workshop Why do we Risk Risk and need risk

Outline Multi-Engine Machine Translation 1 Alignment Search Space Features Match Model

Formal Modeling in Cognitive Science Independence Lecture 23: Conditional Probability; Bayes

Generalized Inversion Sequences Carla D. Savage Department of Computer Science North Carolina

Linear Resolution, Chordality and Ascent of Clutters Ashkan Nikseresht

Optimistic Policy Optimization via Multiple Importance Sampling Matteo Papini Alberto Maria

ENUMERATING (2+2)-FREE POSETS BY THE NUMBER OF MINIMAL ELEMENTS AND OTHER STATISTICS SERGEY

Alternating Direction Method of Multipliers Prof S. Boyd HYCON 2, Trento, 23/6/11 source:

SDCA Stochastic Dual Coordinate Ascent Jingchang Liu June 29, 2017 University of Science and

HVS Planning for FD Integration/Installation Bo Yu, Francesco Pietropaolo 21 August 2019 Outline

1 &amp; 2 Samuel Series Lesson #029 October 13, 2015 Dean Bible Ministries

1 & 2 Samuel Series Lesson #029 October 13, 2015 Dean Bible Ministries