Improved Minimum Error Rate Training in Moses Nicola Bertoldi, Barry - PowerPoint PPT Presentation

Improved Minimum Error Rate Training in Moses Nicola Bertoldi, Barry Haddow and Jean-Baptiste Fouet Third MT Marathon, Prague 28th January 2009

Outline MERT background The need for a new MERT in moses The design of the new MERT Evaluation Conclusions and future work

Discriminative Models State-of-the-art performance in statistical machine translation Combine outputs of several probabilistic models Features can be any function of source and target e.g. forward/backward translation log probability, language model score, word penalty, etc. Linear Model r � e ∗ ( λ ) = arg max λ i h i ( e , f ) e i =1 feature weights λ i . . . λ r and feature functions h 1 . . . h r .

Weight Optimisation How do we choose the best lambda? We want the weights that produce the best translations. Where “best” is measured by some automatic metric, eg bleu , per etc. Most popular method is minimum error rate training (MERT), proposed by Och (2003). A form of coordinate ascent Uses n-best lists from tuning set to approximate decoder output Generally works well for small numbers of features (up to 20 or 30) Implementation available in moses

The Need for a New MERT The existing moses MERT implementation has a number of issues Lack of modularity in the design makes it difficult to e.g. replace bleu with another automatic metric. Mix of program languages in implementation hinders experimentation. At MTM2, a reimplementation of MERT was instigated with the following goals: Clean, modular design to facilitate extension and experimentation Separation of translation metric and optimisation code Standalone open-source software, isolated from moses Improved efficiency

Architecture of new MERT models refs scoretype optimal input Moses n-best Scorer statistics Optimizer weights inner loop weights outer loop MERT consists of inner and outer loop Outer loop runs the decoder over the tuning set and produces n-best lists Inner loop does weight optimisation Iterate outer loop until convergence Inner loop was replaced in the new MERT

MERT Design: Inner loop Uses the n-best lists and references N-best lists of previous iterations are merged Aims to find the weight set that maximises the translation score on the tuning set Consists of two main components: Scorer Calculates translation metric Optimiser Searches for the best weight set These are implemented as separate classes Can add a new Scorer/Optimiser by implementing new subclass For efficiency, some scoring statistics are pre-calculated in a separate extraction phase.

Evaluation: Translation performance on Heldout Test Sets Evaluation was performed on two different tasks from WMT08 Standard moses system with 100-best lists for tuning Scores in tables are all bleu nc-devtest07 nc-test07 newstest08 old MERT 24.42 25.55 15.50 new MERT 24.87 25.70 15.54 Table: Comparison using the news commentary task. devtest06 test06 test07 old MERT 32.75 32.67 33.23 new MERT 32.86 32.79 33.19 Table: Comparison using the europarl task.

Evaluation: Iterations This shows the variation of bleu on tuning and heldout sets, against iterations of the outer loop Standard moses set up, with europarl training and WMT dev06 and test08 for tuning and heldout test, respectively. Compare MERT old, with MERT new using 1,3 or all previous n-best lists Note: ability to specify previous list count is new Development Test 23 23 22.5 22.5 22 22 BLEU (%) BLEU (%) 21.5 21.5 old old 21 new-all 21 new-all new-3 new-3 20.5 20.5 new-1 new-1 20 20 0 5 10 15 20 0 5 10 15 20 iteration iteration

Evaluation: Disk usage The graph below shows the on-disk usage of mert for each iteration 450 400 old 350 new-all new-3 300 Size (Mb) new-1 250 200 150 100 50 0 0 2 4 6 8 10 12 14 16 18 20 iteration The new MERT implementation uses more disk as duplicate removal has not yet been implemented

Evaluation: Execution time The graphs below compare execution time for old MERT and three different configurations of new MERT. Total accumulated inner-loop time Time for phase 1 (extraction) and phase 2 (optimisation). 40000 5000 old, phase 1 old, total 4500 35000 new-all, total old, phase 2 4000 30000 time (seconds) new-3, total time (seconds) new-all, phase 1 3500 new-all, phase 2 new-1, total 25000 3000 20000 2500 2000 15000 1500 10000 1000 5000 500 0 0 0 2 4 6 8 10 12 14 16 18 20 0 2 4 6 8 10 12 14 iteration iteration

Evaluation: Execution Time - Comments New MERT concatenates latest n-best list to previous ones Old MERT merges lists, removing duplicates This means that new MERT has very short extraction phase, which does not increase with number of iterations But optimisation time increases more quickly with number of iterations It is roughly linear in the size of the combined n-best list And disk usage increases too Duplicate removal should reduce execution time

Evaluation: Extensibility The principal aim of rewriting MERT was to provide a more flexible design So it should be easier to incorporate new features Cer, Jurafsky and Manning (WMT 2008) showed how MERT could be improved by “regularisation” Smoothing out of the error surface helps to avoid local maxima in the translation metric This is done by either taking an average or minimum over a neighbourhood This smoothing was added to the scorer base-class Making it available to any scorer The smoothing was tested on fr-en and de-en WMT08 europarl data.

Smoothing Experiments: bleu scores fr-en de-en Method Window devtest06 test06 test07 devtest06 test06 test07 none n/a 32.86 32.79 33.19 27.54 27.67 28.07 minimum ± 1 32.70 32.65 33.20 27.51 27.79 28.00 ± 2 32.81 32.75 33.21 27.75 27.85 28.10 ± 3 32.83 32.76 32.93 27.70 27.92 27.96 ± 4 32.88 32.77 33.24 27.70 27.87 28.02 average ± 1 32.79 32.77 33.29 27.44 27.81 28.00 ± 2 32.89 32.83 33.28 27.63 27.73 27.98 ± 3 32.78 32.67 33.19 27.53 27.67 27.87 ± 4 32.81 32.79 33.25 27.81 28.01 28.22 The gains of 0.5-1.0 bleu reported by Cer et al. were not reproduced They used a Chinese-English translation task The error surface may have more noise

Conclusions We have described a new open source implementation of MERT It is distributed within moses, but is standalone Modularity allows easy replacement of optimiser or translation metric Currently both per and bleu scorers are available The translation performance of systems tuned by the new MERT is similar to those tuned by the old MERT The new MERT is slightly slower and uses more disk than the old This is a known problem which will be rectified

Possible Enhancements Implement duplicate removal when merging n-best lists Add new automatric metrics eg wer , meteor , combinations of metrics Add ability to constrain the feature weights Add priors to the weights Investigate parallelisation of the algorithm Implement the lattice optimisation proposed by Macherey et al (EMNLP 2008)

Questions? Thank you! Questions?

Improved Minimum Error Rate Training in Moses Nicola Bertoldi, Barry - PowerPoint PPT Presentation

Improved Minimum Error Rate Training in Moses Nicola Bertoldi, Barry Haddow and Jean-Baptiste Fouet Third MT Marathon, Prague 28th January 2009 Outline MERT background The need for a new MERT in moses The design of the new MERT Evaluation

Gods Character Moses Moses Confucius Moses Confucius Solon Moses Moses Hammurabi Moses

Moses was the greatest Any other O.T. character Moses was the greatest Moses was the

Labor Classification Yrs Rate 1 Rate 2 Rate 3 Rate 4 Rate 5 Rate 6 Rate 7 Rate 8 Rate 9

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

WHEN GOD IS VIEWED AS INSUFFICIENT Exodus 4:1-17 SETTING THE SCENE: God meets with Moses

On the minimum rank of a graph Jisu Jeong June 21, 2013 Jisu Jeong On the minimum rank of a

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

Introduction to Machine Learning Evaluation: Training Error compstat-lmu.github.io/lecture_i2ml

WHEN GOD SENDS A BABY Exodus 2:1-10 EXODUS 33:8-9 Whenever Moses went out to the tent, all the

Israel s Government (stage #1) Yahweh Law Moses 12 Tribes Elders Israel s Government

Moses Philipp Koehn 3 March 2016 Philipp Koehn Machine Translation: Moses 3 March 2016 Who

(Date of Composition: by 1406 BC) Moses: There is no clear statement in any book of the Pentateuch

Leon N. Moses Dis,nguished Lecture in Transporta,on October 27, 2015 Leon N. Moses 1924-2013

Translation as Weighted Deduction Adam Lopez University of Edinburgh Moses Hiero Koehn et

The Wilderness Tabernacle God Moses: Exodus 25 - 27 The Wilderness Tabernacle Moses Builds

BIBLICAL SURVEY Joseph - Moses: Archaeology Israel in Egypt Joseph to Moses What problems

$% >>> for name in ["Andrew", "Teboho",

Optical Loop Back Test for ODMB7 Preproduction Sicheng Wang Hardware Connections Simple

Week 3 - Friday What did we talk about last time? Control flow Selection if

GArSoft Update: Hits and Tracks Tom Junk HPGTPC Meeting July 13, 2018 Two Zooms on a GENIE veCC

Extended Static Checking with ESC/Java2 1. Overview Wolfgang Schreiner

Signals - II Tevfik Ko ar Louisiana State University October 9 th , 2008 1 2 Sending

Nicholas Berente University of Georgia Susan Winter National Science Foundation University of

Pushing Enterprise Software to the Next Level Self-contained Web Applications on In-Memory

Improved Minimum Error Rate Training in Moses Nicola Bertoldi, Barry - PowerPoint PPT Presentation

Improved Minimum Error Rate Training in Moses Nicola Bertoldi, Barry Haddow and Jean-Baptiste Fouet Third MT Marathon, Prague 28th January 2009 Outline MERT background The need for a new MERT in moses The design of the new MERT Evaluation

Gods Character Moses Moses Confucius Moses Confucius Solon Moses Moses Hammurabi Moses

Moses was the greatest Any other O.T. character Moses was the greatest Moses was the

Labor Classification Yrs Rate 1 Rate 2 Rate 3 Rate 4 Rate 5 Rate 6 Rate 7 Rate 8 Rate 9

Chapter 11: The R.M.S. Error for Regression Errors: A has a large positive error B has a large

WHEN GOD IS VIEWED AS INSUFFICIENT Exodus 4:1-17 SETTING THE SCENE: God meets with Moses

On the minimum rank of a graph Jisu Jeong June 21, 2013 Jisu Jeong On the minimum rank of a

ERROR DETECTON &amp; CORRECTION Error Detection EDC= Error Detection and Correction bits

Introduction to Machine Learning Evaluation: Training Error compstat-lmu.github.io/lecture_i2ml

WHEN GOD SENDS A BABY Exodus 2:1-10 EXODUS 33:8-9 Whenever Moses went out to the tent, all the

Israel s Government (stage #1) Yahweh Law Moses 12 Tribes Elders Israel s Government

Moses Philipp Koehn 3 March 2016 Philipp Koehn Machine Translation: Moses 3 March 2016 Who

(Date of Composition: by 1406 BC) Moses: There is no clear statement in any book of the Pentateuch

Leon N. Moses Dis,nguished Lecture in Transporta,on October 27, 2015 Leon N. Moses 1924-2013

Translation as Weighted Deduction Adam Lopez University of Edinburgh Moses Hiero Koehn et

The Wilderness Tabernacle God Moses: Exodus 25 - 27 The Wilderness Tabernacle Moses Builds

BIBLICAL SURVEY Joseph - Moses: Archaeology Israel in Egypt Joseph to Moses What problems

$% &gt;&gt;&gt; for name in [&quot;Andrew&quot;, &quot;Teboho&quot;,

Optical Loop Back Test for ODMB7 Preproduction Sicheng Wang Hardware Connections Simple

Week 3 - Friday What did we talk about last time? Control flow Selection if

GArSoft Update: Hits and Tracks Tom Junk HPGTPC Meeting July 13, 2018 Two Zooms on a GENIE veCC

Extended Static Checking with ESC/Java2 1. Overview Wolfgang Schreiner

Signals - II Tevfik Ko ar Louisiana State University October 9 th , 2008 1 2 Sending

Nicholas Berente University of Georgia Susan Winter National Science Foundation University of

Pushing Enterprise Software to the Next Level Self-contained Web Applications on In-Memory

ERROR DETECTON & CORRECTION Error Detection EDC= Error Detection and Correction bits

$% >>> for name in ["Andrew", "Teboho",