Statistical Machine Translation: Rapid Development with Limited - - PowerPoint PPT Presentation

statistical machine translation rapid development with
SMART_READER_LITE
LIVE PREVIEW

Statistical Machine Translation: Rapid Development with Limited - - PowerPoint PPT Presentation

Statistical Machine Translation: Rapid Development with Limited Resources George Foster, Simona Gandrabur, Philippe Langlais , Pierre Plamondon, Graham Russell and Michel Simard RALI-DIRO Universit e de Montr eal CP. 6128 succursale


slide-1
SLIDE 1

Statistical Machine Translation: Rapid Development with Limited Resources

George Foster, Simona Gandrabur, Philippe Langlais, Pierre Plamondon, Graham Russell and Michel Simard

RALI-DIRO Universit´ e de Montr´ eal

  • CP. 6128 succursale centre-ville

Montr´ eal (Qu´ ebec) Canada, H3C 3J7 www-rali.iro.umontreal.ca

MT-Summit IX — New Orleans 1

slide-2
SLIDE 2

Motivation

What progress can a small team of de- velopers expect to achieve in creating a statistical MT system for an unfamil- iar language, using only data and tech- nology readily available in-house, or at short notice from external sources?

  • Work conducted within the NIST 2003 MT evaluation task

http://www.nist.gov/speech/tests/mt/

  • Chinese-to-English task
  • Computing resources: Pentium-4 class PCs with a

maximum of 1Gb RAM

MT-Summit IX — New Orleans 2

slide-3
SLIDE 3

We had a plan...

Rescoring approach built on top of a roughly state-of-the-art transla- tion model such as IBM Model 4

  • Extensively used in automatic speech recognition on n-best

lists or word-graphs (Ortmanns et al., 1997; Rose and Riccardi, 1999)

  • More recently proposed for use in SMT (Och and Ney,

2002; Soricut et al., 2002; Ueffing et al., 2002).

MT-Summit IX — New Orleans 3

slide-4
SLIDE 4

Step 1 – Preparing with Canadian Hansards

Alegr´ ıas

  • Install necessary packages,
  • Train translation and language

models,

  • Write IBM model 2&4 decoders

֒ → IBM4 models trained with: GIZA++ and mkcls www-i6.informatik.rwth-aachen.de/Colleagues/och ֒ → Language and IBM model 2 trained with in-house packages ֒ → Multiple search strategy (Nießen et al., 1998; Germann et al., 2001)

MT-Summit IX — New Orleans 4

slide-5
SLIDE 5

Step 1 – 3/4 weeks later . . .

Seguiriyas

We’ve got it ! Our first English-French GIZA++ model was ready to use.

  • Establishing the limits of the package (maximum input

size, etc.) ֒ → 2/3 days of computation to train on a corpus of 1 million pair of sentences ֒ → can’t train with more data (memory problems)

  • Running mkcls

֒ → around 10 hours of computation to cluster a vocabulary into 50 classes

  • Writing wrappers for the model data structures

MT-Summit IX — New Orleans 5

slide-6
SLIDE 6

Step 2 – Corpus Preprocessing

SMT is not exactly language blind . . .

The Linguistic Data Consortium (LDC) dis- tributed the Training data for the NIST task (at least partially did). http://www.ldc.upenn.edu/ ֒ → A surprising variety of formats (sounds nice but is not) ֒ → Word boundaries inserted by means of a revised version of the mansegment program supplied by the LDC. ֒ → Doubt: is our sentence aligner supposed to work for Chinese/English corpora? One person-month of effort for a judicious mixture of automatic and semi-automatic approaches

MT-Summit IX — New Orleans 6

slide-7
SLIDE 7

Step 2 – Corpus Preprocessing

Take one

  • For the NIST exercice, only pre-aligned texts were used.

֒ → Some regions acknowledged by the supplier to be potentially unreliable were omitted.

  • Instead of recompiling GIZA++ in order to account for

sentences longer than 40 words, we devised an knowledge-poor splitter relying heavily on punctuations. ֒ → In cases where no suitable punctuation existed, sentences were split at an arbitrary token boundary. ֒ → Mostly a mix of un and hansard was used to train language and translation models

MT-Summit IX — New Orleans 7

slide-8
SLIDE 8

Step 3 – Decoders

The joy of diversity

Three different decoders, all previously described in the statistical MT litera- ture, were implemented. Sounds odd to do that under time pressure, but we found possible advantages:

  • Detection of certain bugs (actually useful)
  • Competition between coders: “mine is better than yours”

(did work too)

  • Could be fruitful in a rescoring strategy (details later)

Detail: explicit enumeration of the candidate translations (n-best lists)

MT-Summit IX — New Orleans 8

slide-9
SLIDE 9

Step 3 – Decoders

Greedy decoder (Germann et al., 2001)

  • ISI ReWrite Decoder availablea at:

http://www.isi.edu/licensed-sw/rewrite-decoder

  • Requires the language model to be trained with the

CMU-Cambridge Statistical Language Modeling Toolkit (Clarkson and Rosenfeld, 1997) ֒ → We found it easier to rewrite the ReWrite Decoder Hypotheses generated by the hill-climbing search were collected into an n-best list.

aAt least for Canadian residents MT-Summit IX — New Orleans 9

slide-10
SLIDE 10

Step 3 – Decoders

Inverted Alignment Decoder (Nießen et al., 1998)

Shame on us: we also tested the performance of a DP-decoder designed for IBM model 2. for all target position i = 1, 2, . . . , Imax do prune(i − 1); for all live hypotheses hi do for all word w in the Active Vocabulary do for all fertility f ∈ {1, 2, 3} do for all uncovered source positions j, . . . , j + f − 1 do Consider h′ the extension of hi with w (at target position i) aligned with j, . . . , j + f − 1 if score(h′) > Score(i, j, f, c) then Keep h′ and record back-track information Best live hypotheses are kept in an n-best list

MT-Summit IX — New Orleans 10

slide-11
SLIDE 11

Step 3 – Decoders

Stack-based Decoder (FST)

  • loop until time limit reached:

– pop best node from the stack – if final, add hypothesis to n-best list – else ∗ expand “exhaustively” ∗ add resulting hypotheses to graph and stack

  • main properties:

– all nodes retained in graph – fast output of initial hypotheses, with successive refinement – precise time control – no heuristic function on suffixes

MT-Summit IX — New Orleans 11

slide-12
SLIDE 12

Step 3 – Decoders

Stack-based Decoder (FST)

  • graph properties:

– 30M nodes in ≈ 1GB – nodes retain trigram state and source alignments – retroactive score correction

  • prefix heuristics:

– multiple stacks to correct for prefix-score bias: ∗ number of source and target words ∗ unigram logprob – pop depends on stack and gain over parent

  • timing: max 3 minutes per source sentence; more time

gives better model scores but worse NIST scores

MT-Summit IX — New Orleans 12

slide-13
SLIDE 13

Step 3 – Decoders

The cost of diversity

decoder coding tuning greedy 2 1 FST 3 3 IBM2 2 3 total 7 7 Approximate number of person-weeks of development But it was worth it! (the price of a decent IBM-4 stack-based decoder)

MT-Summit IX — New Orleans 13

slide-14
SLIDE 14

Step 3 – Decoders

Finally

  • Two decoders for IBM model 4, one for IBM model 2.
  • Existential questions: “is it good or bad ?”, “why?”, etc.
  • Tuning the compromise between speed and quality is

difficult

  • Incremental improvements

֒ → We compared the decoders by translating 100 sentences (of at most 20 words)

  • greedy

(results within 10 minutes or so)

  • fst

(results within half an hour or so)

  • ibm2-fast

(results within few seconds)

  • ibm2-slow

(results within half an hour or so)

MT-Summit IX — New Orleans 14

slide-15
SLIDE 15

The bigger the better?

Main factor is decoder type

1-best 100-best decoder wer nist nist% wer nist nist% hansard greedy 68·93 2·41448 24·20 61·71 3·68806 37·00 ibm2-fast 65·87 3·22954 32·30 59·22 4·42125 44·20 ibm2-slow 63·85 3·85769 38·50 53·03 5·28764 52·80 fst 62·86 4·19043 41·90 55·24 5·10464 51·00 un greedy 70·35 2·76181 26·10 62·97 3·98415 37·70 ibm2-fast 69·80 3·19254 30·20 63·04 4·38660 41·50 ibm2-slow 68·77 4·39036 41·50 58·65 5·77882 54·60 fst 65·57 4·56739 43·20 57·18 5·80536 54·90 sinorama greedy 86·89 0·79860 7·80 82·16 1·37465 13·40 ibm2-fast 87·55 1·09399 10·30 82·45 1·68875 15·80 ibm2-slow 87·56 1·46096 13·70 81·55 2·44893 23·00 fst 88·97 1·72001 16·10 85·40 2·35273 22·00 xinhua greedy 89·64 1·30970 12·70 85·10 2·00496 19·40 ibm2-fast 91·09 1·08899 10·30 85·90 1·86932 17·70 ibm2-slow 89·13 1·34132 12·70 83·86 2·29718 21·80 fst 90·82 1·08510 10·30 87·98 1·56167 14·80

MT-Summit IX — New Orleans 15

slide-16
SLIDE 16

The bigger the better?

Search space is also important

1-best 100-best decoder wer nist nist% wer nist nist% hansard greedy 68·93 2·41448 24·20 61·71 3·68806 37·00 ibm2-fast 65·87 3·22954 32·30 59·22 4·42125 44·20 ibm2-slow 63·85 3·85769 38·50 53·03 5·28764 52·80 fst 62·86 4·19043 41·90 55·24 5·10464 51·00 un greedy 70·35 2·76181 26·10 62·97 3·98415 37·70 ibm2-fast 69·80 3·19254 30·20 63·04 4·38660 41·50 ibm2-slow 68·77 4·39036 41·50 58·65 5·77882 54·60 fst 65·57 4·56739 43·20 57·18 5·80536 54·90 sinorama greedy 86·89 0·79860 7·80 82·16 1·37465 13·40 ibm2-fast 87·55 1·09399 10·30 82·45 1·68875 15·80 ibm2-slow 87·56 1·46096 13·70 81·55 2·44893 23·00 fst 88·97 1·72001 16·10 85·40 2·35273 22·00 xinhua greedy 89·64 1·30970 12·70 85·10 2·00496 19·40 ibm2-fast 91·09 1·08899 10·30 85·90 1·86932 17·70 ibm2-slow 89·13 1·34132 12·70 83·86 2·29718 21·80 fst 90·82 1·08510 10·30 87·98 1·56167 14·80

MT-Summit IX — New Orleans 16

slide-17
SLIDE 17

The bigger the better?

Even a small nbest list seems beneficial

1-best 100-best decoder wer nist nist% wer nist nist% hansard greedy 68·93 2·41448 24·20 61·71 3·68806 37·00 ibm2-fast 65·87 3·22954 32·30 59·22 4·42125 44·20 ibm2-slow 63·85 3·85769 38·50 53·03 5·28764 52·80 fst 62·86 4·19043 41·90 55·24 5·10464 51·00 un greedy 70·35 2·76181 26·10 62·97 3·98415 37·70 ibm2-fast 69·80 3·19254 30·20 63·04 4·38660 41·50 ibm2-slow 68·77 4·39036 41·50 58·65 5·77882 54·60 fst 65·57 4·56739 43·20 57·18 5·80536 54·90 sinorama greedy 86·89 0·79860 7·80 82·16 1·37465 13·40 ibm2-fast 87·55 1·09399 10·30 82·45 1·68875 15·80 ibm2-slow 87·56 1·46096 13·70 81·55 2·44893 23·00 fst 88·97 1·72001 16·10 85·40 2·35273 22·00 xinhua greedy 89·64 1·30970 12·70 85·10 2·00496 19·40 ibm2-fast 91·09 1·08899 10·30 85·90 1·86932 17·70 ibm2-slow 89·13 1·34132 12·70 83·86 2·29718 21·80 fst 90·82 1·08510 10·30 87·98 1·56167 14·80

MT-Summit IX — New Orleans 17

slide-18
SLIDE 18

Why so bad on xinhua and sinorama?

Language model perplexities

test train hansard un sinorama xinhua hansard 20·14 296·10 451·68 1005·51 un 500·14 25·01 860·34 1007·62 sinorama 769·53 1058·98 33·85 1970·16 xinhua 1393·69 856·07 1269·27 15·44

  • unlikely that this is due to major bugs in the process
  • train/test data mismatch more significant on xinhua and

sinorama

MT-Summit IX — New Orleans 18

slide-19
SLIDE 19

Why so bad on xinhua and sinorama?

Poor target vocabulary coverage

used complete corpus

  • ov

rank

  • ov

rank hansard 11·29 1·4 0·29 44·9 un 8·21 1·3 0·53 25·4 sinorama 29·80 1·8 3·81 143·5 xinhua 43·31 2·1 4·06 162·9

  • ov

percentage of reference words uncovered by the target vocabulary rank average rank (in the translation table) of the expected translation for the actually used target vocabulary and for the complete

  • ne (that is taking all the target words in the transfer table

corresponding to any word of the source sentence + fertility-0 words)

MT-Summit IX — New Orleans 19

slide-20
SLIDE 20

Multi-engine translation

Experiment assuming an oracle

  • Merge the 25 best translations of each decoder

= ⇒ list of 100 translations with possible repetitions

  • Lower wer than the minimum wer measured for each

decoder hansard 4·7 (48·30 instead of 53·03) un 4·3 sinorama 1·8 xinhua 0·15 Absolute improvement in wer over the minimum observed

  • Consistant with the idea promoted within the Pangloss

system (Frederking et al., 1994)

MT-Summit IX — New Orleans 20

slide-21
SLIDE 21

A few hours before the NIST deadline

Chaos was leading us

  • We’ve got a Chinese-English UN-

kind translation engine in hand,

  • No test sentences translated yet

(that would have been too easy),

24:00 Time to really translate the test sentences 09:00 !@# Where did Piet put the translations ? 09:30 What the UNK is that? 10:00 Patch a de-UNKer 10:45 Forget to de-tokenize the output (i.e. doesn ’t) 10:57 Press the send button to submit

MT-Summit IX — New Orleans 21

slide-22
SLIDE 22

Why so bad?

Influence of the corpus corpus type |lm| |tm| nist fbis reports 68,848 36,586 5·0288 sinorama magazine 103,250 53,877 4·2273 un reports 720,000 662,360 4·0975 xinhua news 65,000 39,364 3·7751 hansard hansards 351,514 335,910 3·5069 mix mixed 166,529 166,529 5·6875 high mixed 166,529 166,529 5·5938 take1 mixed 818,937 1,225,488 4·3437

take1 (the NIST exercise setting); 934,508 sentences pairs from un, 205,368 from the hansard, 50,378 from sinorama, and 35,234 from xinhua. mix (after the NIST exercice); 166,529 sentence pairs from the fbis, the xinhua and the sinorama corpora, plus the first 50,000 pairs of the hansard corpus. high (after the NIST exercice); same as mix, but the fst decoder was allowed to explore a larger space.

MT-Summit IX — New Orleans 22

slide-23
SLIDE 23

Conclusions

  • Despite the availability of very good toolkits and published

decoding algorithms, the process of building an SMT from scratch was not as painless and effortless as we had hoped.

  • Essential practical aspects of the process, time consuming

although of little inherent scientific interest, often become the bottleneck. – Training corpora, although available in large quantities,

  • ften require a significant amount of work even when

they are ostensibly clean – Known decoding techniques and heuristics cannot be simply applied without extensive experimentations for accuracy/runtime trade-offs

  • Why so bad ?

MT-Summit IX — New Orleans 23

slide-24
SLIDE 24

Philip Clarkson and Ronald Rosenfeld. 1997. Statistical Language Modeling using the CMU-Cambridge Toolkit. In Eurospeech’97, volume 5, pages 2707–2710. Robert Frederking, Sergei Nirenburg, David Farwell, Stephen Helmreich, Eduard Hovy, Kevin Knight, Stephen Beale, Constantine Domashnev, Donalee Attardo, Dean Grannes, and Ralf Brown.

  • 1994. Integrating Translations from Multiple Sources within the Pangloss Mark III Machine

Translation System. In Proceedings of the First Conference of the Association for Machine Translation in the Americas (AMTA), pages 73–80, Columbia. Ulrich Germann, Michael Jahr, Kevin Knight, Daniel Marcu, and Kenji Yamada. 2001. Fast Decoding and Optimal Decoding for Machine Translation. In Proceedings of the 39th Annual Meeting

  • f the Association for Computational Linguistics, pages 228–235, Toulouse.

Sonja Nießen, Stephan Vogel, Herrmann Ney, and Christof Tillmann. 1998. A DP based Search Algorithm for Statistical Machine Translation. In Proceedings of the 36th Annual Meetings of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics, volume 2, pages 960–967, Montreal. Franz Josef Och and Hermann Ney. 2002. Discriminative Training and Maximum Entropy Models for Statistical Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 295–302, Philadelphia. Stefan Ortmanns, Hermann Ney, and Xavier Aubert. 1997. A Word Graph Algorithm for Large Vocabulary Continuous Speech Recognition. Computer Speech and Language, 11(1):43–72. Richard C. Rose and Giuseppe Riccardi. 1999. Automatic Speech Recognition using Acoustic Confidence Conditioned Language Models. In Proceedings of the 6th European Conference on Speech Communication and Technology (EUROSPEECH), pages 303–306. Radu Soricut, Kevin Knight, and Daniel Marcu. 2002. Using a large monolingual corpus to improve translation accuracy. In Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA-2002), Tiburon, CA. Nicola Ueffing, Franz Och, and Hermann Ney. 2002. Generation of Word Graphs in Statistical Machine Translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 156–163. MT-Summit IX — New Orleans 24