Transductive learning for statistical machine translation Nicola - - PowerPoint PPT Presentation

transductive learning for statistical machine translation
SMART_READER_LITE
LIVE PREVIEW

Transductive learning for statistical machine translation Nicola - - PowerPoint PPT Presentation

Transductive learning for statistical machine translation Nicola Ueffing 1 Gholamreza Haffari 2 Anoop Sarkar 2 1 Interactive Language Technologies Group National Research Council Canada Gatineau, QC, Canada nicola.ueffing@nrc.gc.ca 2 School of


slide-1
SLIDE 1

Transductive learning for statistical machine translation

Nicola Ueffing1 Gholamreza Haffari2 Anoop Sarkar2

1Interactive Language Technologies Group

National Research Council Canada Gatineau, QC, Canada nicola.ueffing@nrc.gc.ca

2School of Computing Science

Simon Fraser University Vancouver, Canada {ghaffar1,anoop}@cs.sfu.ca

ACL 2007: June 25

slide-2
SLIDE 2

Outline

1 Motivation 2 Transductive Machine Translation 3 Experimental Results

SMT System EuroParl French–English NIST Chinese–English

2 / 51

slide-3
SLIDE 3

Motivation

MT System f e

3 / 51

slide-4
SLIDE 4

Motivation

MT System f e

Bilingual Data

(e,f) pairs Monolingual Data Target E

4 / 51

slide-5
SLIDE 5

Motivation

MT System f e

Bilingual Data

(e,f) pairs Monolingual Data Target E Monolingual Data Source F

?

Here: we explore monolingual source-language data to improve translation quality

5 / 51

slide-6
SLIDE 6

Where it would be useful?

In some cases amount of bilingual data is limited and expensive to create Use monolingual source-language data to

adapt to new domain, topic or style

  • vercome training/testing data mismatch, e.g. text/speech

6 / 51

slide-7
SLIDE 7

Where it would be useful?

In some cases amount of bilingual data is limited and expensive to create Use monolingual source-language data to

adapt to new domain, topic or style

  • vercome training/testing data mismatch, e.g. text/speech

Examples: training data testing data effect newswire web text adapt to domain and style written text speech adapt to speech characteristics written text and speech speech identify parts of model relevant for speech

7 / 51

slide-8
SLIDE 8

1 Motivation 2 Transductive Machine Translation 3 Experimental Results

SMT System EuroParl French–English NIST Chinese–English

8 / 51

slide-9
SLIDE 9

Transductive SMT

Bilingual Data Decode Translations MT System Test Data

translation model params

Estimate

LM

9 / 51

slide-10
SLIDE 10

Transductive SMT

Bilingual Data Decode Translations MT System Test Data

translation model params

Estimate

LM

Good Translations

10 / 51

slide-11
SLIDE 11

Transductive SMT

Bilingual Data Decode Translations MT System Test Data

translation model params

Estimate

LM

Good Translations

s1 s2 s3 s4 ...

score

Score & Select

11 / 51

slide-12
SLIDE 12

Scoring Translations

Score s1 s2 s3 s4 ...

1 Confidence estimation

log-linear combination of different posterior probabilities and LM probability posterior probabilities for words and phrases, calculated over N-best list combination optimized w.r.t. sentence classification error rate

12 / 51

slide-13
SLIDE 13

Scoring Translations

Score s1 s2 s3 s4 ... Score s1 s2 s3 s4 ...

1 Confidence estimation

log-linear combination of different posterior probabilities and LM probability posterior probabilities for words and phrases, calculated over N-best list combination optimized w.r.t. sentence classification error rate

2 Normalized sentence score assigned by SMT system

13 / 51

slide-14
SLIDE 14

Selection

Score s1 s2 s3 s4 ... Score s1 s2 s3 s4 ...

1 Importance sampling: sample with replacement, probability

distribution based on scores

14 / 51

slide-15
SLIDE 15

Selection

Score s1 s2 s3 s4 ... Score s1 s2 s3 s4 ...

1 Importance sampling: sample with replacement, probability

distribution based on scores

2 Threshold: select all translations with score above threshold,

  • ptimize threshold on dev set beforehand

15 / 51

slide-16
SLIDE 16

Selection

Score s1 s2 s3 s4 ... Score s1 s2 s3 s4 ...

1 Importance sampling: sample with replacement, probability

distribution based on scores

2 Threshold: select all translations with score above threshold,

  • ptimize threshold on dev set beforehand

3 Keep all translations: comparative experiment

16 / 51

slide-17
SLIDE 17

Estimate

We extract “good” translations and use these to augment our SMT system

17 / 51

slide-18
SLIDE 18

Estimate

We extract “good” translations and use these to augment our SMT system Different choices are used to Estimate a new model

18 / 51

slide-19
SLIDE 19

Estimate

We extract “good” translations and use these to augment our SMT system Different choices are used to Estimate a new model

1 Add new translations to training set and do full re-training

(can be made efficient; details in the paper)

19 / 51

slide-20
SLIDE 20

Estimate

We extract “good” translations and use these to augment our SMT system Different choices are used to Estimate a new model

1 Add new translations to training set and do full re-training

(can be made efficient; details in the paper)

2 A mixture model of phrase pair probabilities from training set

combined with phrase pairs from dev/test set

20 / 51

slide-21
SLIDE 21

Estimate

We extract “good” translations and use these to augment our SMT system Different choices are used to Estimate a new model

1 Add new translations to training set and do full re-training

(can be made efficient; details in the paper)

2 A mixture model of phrase pair probabilities from training set

combined with phrase pairs from dev/test set

3 Use new phrase pairs to train an additional phrase table and

use it as a new feature function in the SMT log-linear model (feature weights learned using dev corpus).

21 / 51

slide-22
SLIDE 22

Estimate (additional phrase table)

train

SMT system

distortion model(s) language model(s) phrase table(s) source text + reliable translations target language text source language text

score + select

phrase table additional

...

N−best list dev / test corpus

filter out bad translations

22 / 51

slide-23
SLIDE 23

Why does it work?

Reinforces parts of the phrase translation model which are relevant for test corpus,

  • btain more focused probability distribution

Composes new phrases, for example:

  • riginal paral-

lel corpus additional source data possible new phrases ’A B’, ’C D E’ ’A B C D E’ ’A B C’, ’B C D E’, ’A B C D E’, . . .

23 / 51

slide-24
SLIDE 24

Limitations of the approach

No learning of translations of unknown source-language words

  • ccurring in the new data

Only learning of compositional phrases; system will not learn translation of idioms:

“it is raining”+“cats and dogs” →“it is raining cats and dogs” “es regnet” +“Katzen und Hunde” →“es regnet in Str¨

  • men”

“il pleut” +“des chats et des chiens”→“il pleut des cordes”

24 / 51

slide-25
SLIDE 25

1 Motivation 2 Transductive Machine Translation 3 Experimental Results

SMT System EuroParl French–English NIST Chinese–English

25 / 51

slide-26
SLIDE 26

Experimental setting: Baseline & SMT system

PORTAGE: state-of-the-art phrase-based system (NRC, Canada) Decoder models: several (smoothed) phrase table(s), translation direction p(sJ

1 | tI 1)

several 4-gram language model(s), trained with SRILM toolkit distortion penalty based on number of skipped source words word penalty

26 / 51

slide-27
SLIDE 27

Experimental setting: Baseline & SMT system

PORTAGE: state-of-the-art phrase-based system (NRC, Canada) Decoder models: several (smoothed) phrase table(s), translation direction p(sJ

1 | tI 1)

several 4-gram language model(s), trained with SRILM toolkit distortion penalty based on number of skipped source words word penalty Additional rescoring models: two different IBM-1 features in both translation directions posterior probabilities for words, phrases, n-grams, and sentence length: calculated over the N-best list, using the sentence probabilities assigned by the baseline system Our approach also works with other phrase-based MT system, e.g. Moses

27 / 51

slide-28
SLIDE 28

EuroParl French–English

Setup and evaluation: French → English translation training and testing conditions: WMT2006 shared task 688k sentence pairs for training 2,000/3,064 sentences in dev/test set evaluate with BLEU-4, mWER, mPER, using 1 references 95%-confidence intervals, using bootstrap resampling

28 / 51

slide-29
SLIDE 29

Results EuroParl French–English

Translation quality for importance sampling based on normalized sentence scores, full re-training of phrase table Train100k

2 4 6 8 10 12 14 16 18 24.05 24.1 24.15 24.2 24.25 24.3 24.35 24.4 24.45

Iteration Bleu score

Train150k

2 4 6 8 10 12 14 16 24.45 24.5 24.55 24.6 24.65 24.7 24.75 24.8 24.85

Iteration Bleu score

Transductive learning provides improvement in accuracy equivalent to adding 50k training examples

29 / 51

slide-30
SLIDE 30

EuroParl translation examples

baseline but it will be agreed on what we are putting into this constitution . adapted but it must be agreed upon what we are putting into the constitution . reference but we must reach agreement on what to put in this con- stitution . baseline this does not want to say first of all , as a result . adapted it does not mean that everything is going on . reference this does not mean that everything has to happen at once .

30 / 51

slide-31
SLIDE 31

NIST Chinese–English

Setup and evaluation: Chinese → English translation training conditions: NIST 2006 eval, large data track testing: 2006 eval corpus with 3,940 sentences 4 different genres, partially not covered by training data (broadcast conversations, . . . ) evaluate with BLEU-4, mWER, mPER, using 4 / 1 references 95%-confidence intervals, using bootstrap resampling

31 / 51

slide-32
SLIDE 32

Results: NIST Chinese–English

Translation quality on NIST 2006 Chinese–English, NIST part. Different versions of selection and scoring method. selection scoring BLEU[%] mWER[%] mPER[%] baseline 27.9±0.7 67.2±0.6 44.0±0.5

32 / 51

slide-33
SLIDE 33

Results: NIST Chinese–English

Translation quality on NIST 2006 Chinese–English, NIST part. Different versions of selection and scoring method. selection scoring BLEU[%] mWER[%] mPER[%] baseline 27.9±0.7 67.2±0.6 44.0±0.5 keep all 28.1 66.5 44.2

33 / 51

slide-34
SLIDE 34

Results: NIST Chinese–English

Translation quality on NIST 2006 Chinese–English, NIST part. Different versions of selection and scoring method. selection scoring BLEU[%] mWER[%] mPER[%] baseline 27.9±0.7 67.2±0.6 44.0±0.5 keep all 28.1 66.5 44.2

  • import. sampl.

norm.score 28.7 66.1 43.6 confidence 28.4 65.8 43.2

34 / 51

slide-35
SLIDE 35

Results: NIST Chinese–English

Translation quality on NIST 2006 Chinese–English, NIST part. Different versions of selection and scoring method. selection scoring BLEU[%] mWER[%] mPER[%] baseline 27.9±0.7 67.2±0.6 44.0±0.5 keep all 28.1 66.5 44.2

  • import. sampl.

norm.score 28.7 66.1 43.6 confidence 28.4 65.8 43.2 threshold norm.score 28.3 66.1 43.5 confidence 29.3 65.6 43.2

35 / 51

slide-36
SLIDE 36

NIST translation examples

baseline [the report said] [that the] [united states] [is] [a poten- tial] [problem] [, the] [practice of] [china ’s] [foreign policy] [is] [likely to] [weaken us] [influence] [.] transductive [the report] [said that] [this is] [a potential] [problem] [in] [the united states] [,] [china] [is] [likely to] [weaken] [the impact of] [american foreign policy] [.] reference the report said that this is a potential problem for america . china ’s course of action could possibly weaken the influence of american foreign policy .

36 / 51

slide-37
SLIDE 37

NIST translation examples

baseline [the report said] [that the] [united states] [is] [a poten- tial] [problem] [, the] [practice of] [china ’s] [foreign policy] [is] [likely to] [weaken us] [influence] [.] transductive [the report] [said that] [this is] [a potential] [problem] [in] [the united states] [,] [china] [is] [likely to] [weaken] [the impact of] [american foreign policy] [.] reference the report said that this is a potential problem for america . china ’s course of action could possibly weaken the influence of american foreign policy . baseline [what we advocate] [his] [name] transductive [we] [advocate] [him] [.] reference we advocate him .

37 / 51

slide-38
SLIDE 38

NIST translation examples

baseline [the report said] [that the] [united states] [is] [a poten- tial] [problem] [, the] [practice of] [china ’s] [foreign policy] [is] [likely to] [weaken us] [influence] [.] transductive [the report] [said that] [this is] [a potential] [problem] [in] [the united states] [,] [china] [is] [likely to] [weaken] [the impact of] [american foreign policy] [.] reference the report said that this is a potential problem for america . china ’s course of action could possibly weaken the influence of american foreign policy . baseline [what we advocate] [his] [name] transductive [we] [advocate] [him] [.] reference we advocate him . baseline [”] [we should] [really be] [male] [nominees] [..] [....] transductive [he] [should] [be] [nominated] [male] [,] [really] [.] reference he should be nominated as the best actor , really .

38 / 51

slide-39
SLIDE 39

Conclusion

Explore monolingual source-language data to improve an existing MT system:

translate data using MT system automatically identify reliable translations learn new models on these

39 / 51

slide-40
SLIDE 40

Conclusion

Explore monolingual source-language data to improve an existing MT system:

translate data using MT system automatically identify reliable translations learn new models on these

Introduced transductive learning approach for statistical MT

filtering training data for re-training using additional phrase table from test data as feature in MT log-linear model confidence estimation for accurate detection of good translations importance sampling with thresholding to obtain multiple good translations even for a single sentence

40 / 51

slide-41
SLIDE 41

Conclusion

Explore monolingual source-language data to improve an existing MT system:

translate data using MT system automatically identify reliable translations learn new models on these

Introduced transductive learning approach for statistical MT

filtering training data for re-training using additional phrase table from test data as feature in MT log-linear model confidence estimation for accurate detection of good translations importance sampling with thresholding to obtain multiple good translations even for a single sentence

Translation quality improves through transductive learning

41 / 51

slide-42
SLIDE 42

Conclusion

Explore monolingual source-language data to improve an existing MT system:

translate data using MT system automatically identify reliable translations learn new models on these

Introduced transductive learning approach for statistical MT

filtering training data for re-training using additional phrase table from test data as feature in MT log-linear model confidence estimation for accurate detection of good translations importance sampling with thresholding to obtain multiple good translations even for a single sentence

Translation quality improves through transductive learning Discarding bad translations is important

42 / 51

slide-43
SLIDE 43

Conclusion

Explore monolingual source-language data to improve an existing MT system:

translate data using MT system automatically identify reliable translations learn new models on these

Introduced transductive learning approach for statistical MT

filtering training data for re-training using additional phrase table from test data as feature in MT log-linear model confidence estimation for accurate detection of good translations importance sampling with thresholding to obtain multiple good translations even for a single sentence

Translation quality improves through transductive learning Discarding bad translations is important Approach applicable to other types of statistical MT system

43 / 51

slide-44
SLIDE 44

Literature

Transductive learning/unsupervised training: D. Yarowsky [ACL, 1995], Abney [CompLing 30-03, 2004], Vapnik “Statistical learning theory” [Wiley, 1998] Self-training for SMT: Ueffing [IWSLT, 2006] PORTAGE: Ueffing et. al. [ACL WMT Workshop, 2007] Confidence measures: Blatz et al. [CoLing 2004], Ueffing and Ney [CompLing 33-01, 2007]

44 / 51

slide-45
SLIDE 45

Acknowledgment

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-06-C-0023. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the Defense Advanced Research Projects Agency (DARPA). This research was partially supported by NSERC, Canada (RGPIN: 264905).

45 / 51

slide-46
SLIDE 46

END

46 / 51

slide-47
SLIDE 47

Filtering the training corpus

If the size of the training corpus is huge, the training time is going to be very long; filter training corpus based on n-gram-coverage with the dev/test corpus to find relevant parts

47 / 51

slide-48
SLIDE 48

Results NIST Chinese–English

Statistics of the phrase tables trained on the genres of the NIST test corpora.

Chinese–English eval-04 editorials newswire speeches sentences 449 901 438 selected translations 101 187 113 size of adapted phrase table 1,981 3,591 2,321 new phrases in phrase table 679 1,359 657 adapted phrases used 707 1,314 815 new phrases used 23 47 25 Chinese–English eval-06 broadcast broadcast newsgroup newswire conversations news sentences 979 1,083 898 980 selected translations 477 274 226 172 size of adapted phrase table 2,155 4,027 2,905 2,804 new phrases in phrase table 1,058 1,645 1,259 1,058 adapted phrases used 759 1,479 1,077 1,115 new phrases used 90 86 88 41

48 / 51

slide-49
SLIDE 49

Results: NIST Chinese–English

Translation quality on the NIST 2006 Chinese–English task. Different versions of selection and scoring method.

corpus selection scoring BLEU[%] mWER[%] mPER[%] GALE baseline 12.7±0.5 75.8±0.6 54.6±0.6 (1 ref.) keep all 12.9 75.7 55.0 import.sampl. norm.score 13.2 74.7 54.1 confidence 12.9 74.4 53.5 threshold norm.score 12.7 75.2 54.2 confidence 13.6 73.4 53.2 NIST baseline 27.9±0.7 67.2±0.6 44.0±0.5 (4 refs.) keep all 28.1 66.5 44.2 import.sampl. norm.score 28.7 66.1 43.6 confidence 28.4 65.8 43.2 threshold norm.score 28.3 66.1 43.5 confidence 29.3 65.6 43.2

49 / 51

slide-50
SLIDE 50

More NIST translation examples (1)

baseline [the capitalist] [system] [, because] [it] [is] [immoral] [to] [criticize] [china] [for years] [, capitalism] [, so] [it] [didn’t] [have] [a set of] [moral values] [.] transductive [capitalism] [has] [a set] [of] [moral values] [,] [because] [china] [has] [denounced] [capitalism] [,] [so it] [does not] [have] [a set] [of moral] [.] reference capitalism , its set of morals , because china has crit- icized capitalism for many years , this set of morals is no longer there . baseline [the fact] [that this] [is] [.] transductive [this] [is] [the point] [.] reference that is actually the point .

50 / 51

slide-51
SLIDE 51

Results EuroParl French–English

Translation quality for importance sampling with full re-training, normalized sentence scores, filtered 100k training sentence pairs

2 4 6 8 10 12 14 16 18 24.05 24.1 24.15 24.2 24.25 24.3 24.35 24.4 24.45

Iteration Bleu score

51 / 51