Machine Translation for Human Translators Carnegie Mellon Ph.D. - - PowerPoint PPT Presentation

machine translation for human translators
SMART_READER_LITE
LIVE PREVIEW

Machine Translation for Human Translators Carnegie Mellon Ph.D. - - PowerPoint PPT Presentation

Machine Translation for Human Translators Carnegie Mellon Ph.D. Thesis Michael Denkowski Language Technologies Institute School of Computer Science Carnegie Mellon University April 20, 2015 Thesis Committee: Alon Lavie (chair), Carnegie


slide-1
SLIDE 1

Machine Translation for Human Translators

Carnegie Mellon Ph.D. Thesis Michael Denkowski

Language Technologies Institute School of Computer Science Carnegie Mellon University

April 20, 2015 Thesis Committee:

Alon Lavie (chair), Carnegie Mellon University Chris Dyer, Carnegie Mellon University Jaime Carbonell, Carnegie Mellon University Gregory Shreve, Kent State University

1

slide-2
SLIDE 2

Motivating Examples

When is a translation “good enough”?

2

slide-3
SLIDE 3

Machine Translation

3

slide-4
SLIDE 4

Machine Translation

4

slide-5
SLIDE 5

Human Translation

International Organizations Global Businesses Community Projects Require human-quality translation of complex content Machine translation currently unable to deliver quality and consistency

5

slide-6
SLIDE 6

Human Translation

International Organizations Global Businesses Community Projects

$37 billion in 2014

Require human-quality translation of complex content Machine translation currently unable to deliver quality and consistency

5

slide-7
SLIDE 7

MT with Human Post-Editing

Source Document Machine Translation Human Editing Translated Document

Use machine translation to improve speed of human translation

6

slide-8
SLIDE 8

MT with Human Post-Editing

Source Document Machine Translation Human Editing Translated Document

Use machine translation to improve speed of human translation Increasing adoption by government organizations and businesses

6

slide-9
SLIDE 9

Post-Editing Example

Son comportement ne peut ˆ etre qualifi´ e que d’irr´ eprochable .

7

slide-10
SLIDE 10

Post-Editing Example

Son comportement ne peut ˆ etre qualifi´ e que d’irr´ eprochable . His behavior cannot be described as d’irr´ eprochable .

7

slide-11
SLIDE 11

Post-Editing Example

Son comportement ne peut ˆ etre qualifi´ e que d’irr´ eprochable . His behavior cannot be described as d’irr´ eprochable . Its behavior can only be described as flawless .

7

slide-12
SLIDE 12

Post-Editing Example

Son comportement ne peut ˆ etre qualifi´ e que d’irr´ eprochable . His behavior cannot be described as d’irr´ eprochable . Its behavior can only be described as flawless . MT task: minimize work for human translators

7

slide-13
SLIDE 13

Machine Translation with Human Post-Editing

Post-editing faster and more accurate than unaided translation (Guerberof, 2009; Carl et al., 2011; Koehn, 2012; Zhechev, 2012; inter alia) Productivity gains but MT systems not engineered for human post-editing How can we extend MT systems to target post-editing?

8

slide-14
SLIDE 14

Thesis Statement

While general improvements in MT quality have led to improved performance and increased interest in this application, there has been relatively little work on designing translation systems specifically for post-editing. We present extensions to key components of MT pipelines that significantly reduce the amount of work required from human translators.

9

slide-15
SLIDE 15

Thesis Claims

We claim that:

10

slide-16
SLIDE 16

Thesis Claims

We claim that: The amount of work required of human translators can be reduced by translation systems that immediately learn from editor feedback.

10

slide-17
SLIDE 17

Thesis Claims

We claim that: The amount of work required of human translators can be reduced by translation systems that immediately learn from editor feedback. The usability of machine translations can be improved by automatically identifying the most costly types of translation errors and tuning MT systems to avoid them.

10

slide-18
SLIDE 18

Thesis Claims

We claim that: The amount of work required of human translators can be reduced by translation systems that immediately learn from editor feedback. The usability of machine translations can be improved by automatically identifying the most costly types of translation errors and tuning MT systems to avoid them. The most significant gains in post-editing productivity are realized when several system components can adapt in unison.

10

slide-19
SLIDE 19

Research Contributions

To support our claims, we make the following contributions to the research community:

11

slide-20
SLIDE 20

Research Contributions

To support our claims, we make the following contributions to the research community: a method for immediately incorporating post-editing data into a translation model

11

slide-21
SLIDE 21

Research Contributions

To support our claims, we make the following contributions to the research community: a method for immediately incorporating post-editing data into a translation model a technique for running an online learning algorithm that continuously updates feature weights during decoding

11

slide-22
SLIDE 22

Research Contributions

To support our claims, we make the following contributions to the research community: a method for immediately incorporating post-editing data into a translation model a technique for running an online learning algorithm that continuously updates feature weights during decoding a workflow for training and deploying adaptive MT systems for human translators using only normal training data

11

slide-23
SLIDE 23

Research Contributions

To support our claims, we make the following contributions to the research community: a method for immediately incorporating post-editing data into a translation model a technique for running an online learning algorithm that continuously updates feature weights during decoding a workflow for training and deploying adaptive MT systems for human translators using only normal training data an advanced automatic MT evaluation metric capable of fitting various measures of editing effort

11

slide-24
SLIDE 24

Research Contributions

To support our claims, we make the following contributions to the research community: a method for immediately incorporating post-editing data into a translation model a technique for running an online learning algorithm that continuously updates feature weights during decoding a workflow for training and deploying adaptive MT systems for human translators using only normal training data an advanced automatic MT evaluation metric capable of fitting various measures of editing effort an end-to-end post-editing pipeline that demonstrates the effectiveness of our adaptive systems in live post-editing scenarios

11

slide-25
SLIDE 25

Overview

Online learning for statistical MT Translation model review Real time model adaptation Simulated post-editing Post-editing software and experiments Kent State live post-editing Automatic metrics for post-editing Meteor automatic metric Evaluation and optimization for post-editing Conclusion and Future Work

12

slide-26
SLIDE 26

Overview

Online learning for statistical MT Translation model review Real time model adaptation Simulated post-editing Post-editing software and experiments Kent State live post-editing Automatic metrics for post-editing Meteor automatic metric Evaluation and optimization for post-editing Conclusion and Future Work

13

slide-27
SLIDE 27

Online Learning for MT

Statistical translation models built from bilingual data

14

slide-28
SLIDE 28

Online Learning for MT

Statistical translation models built from bilingual data Post-editing generates new bilingual data

14

slide-29
SLIDE 29

Online Learning for MT

Statistical translation models built from bilingual data Post-editing generates new bilingual data Goal: incorporate post-editing data back into model in real time Learn from feedback: avoid repeating the same translation errors

14

slide-30
SLIDE 30

Online Learning for MT

Batch learning (standard MT): Estimation Prediction

15

slide-31
SLIDE 31

Online Learning for MT

Batch learning (standard MT): Estimation Prediction Online learning (this work): Prediction Truth Update

15

slide-32
SLIDE 32

Online Learning for MT

Batch learning (standard MT): Estimation Prediction Online learning (this work): Prediction Truth Update Requirement: all system components operate at the sentence level

15

slide-33
SLIDE 33

Overview

Online learning for statistical MT Translation model review Real time model adaptation Simulated post-editing Post-editing software and experiments Kent State live post-editing Automatic metrics for post-editing Meteor automatic metric Evaluation and optimization for post-editing Conclusion and Future Work

16

slide-34
SLIDE 34

Post-Editing with Standard MT

Large LM

X → f/e

Grammar

wi...wn

Weights Static Decoder Input Sentence Post-Editing

17

slide-35
SLIDE 35

Machine Translation Formalism

Phrase-based machine translation (Koehn et al., 2003): la v´ erit´ e − → the truth Match spans of input text against phrases we know how to translate

18

slide-36
SLIDE 36

Machine Translation Formalism

Phrase-based machine translation (Koehn et al., 2003): la v´ erit´ e − → the truth Match spans of input text against phrases we know how to translate Hierarchical phrase-based MT (Chiang, 2007): X − → la X 1

  • the X 1

X − → v´ erit´ e

  • truth

Generalization where phrases can contain other phrases Phrases become rules in a synchronous context-free grammar

18

slide-37
SLIDE 37

Hierarchical Phrase-Based Translation Example

Input sentence: Pourtant , la v´ erit´ e est ailleurs selon moi . Translation Grammar: X − → X 1 est ailleurs X 2 .

  • X 2 , X 1 lies elsewhere .

X − → Pourtant ,

  • Yet

X − → la v´ erit´ e

  • the truth

X − → selon moi

  • in my view

Glue Grammar: S − → S 1 X 2

  • S 1 X 2

S − → X 1

  • X 2

19

slide-38
SLIDE 38

Hierarchical Phrase-Based Translation Example

F Pourtant , la v´ erit´ e est ailleurs selon moi . E

20

slide-39
SLIDE 39

Hierarchical Phrase-Based Translation Example

F Pourtant , la v´ erit´ e est ailleurs selon moi . E the truth X la v´ erit´ e X the truth

20

slide-40
SLIDE 40

Hierarchical Phrase-Based Translation Example

F Pourtant , la v´ erit´ e est ailleurs selon moi . E Yet in my view the truth X Pourtant , X Yet X la v´ erit´ e X the truth X selon moi X in my view

20

slide-41
SLIDE 41

Hierarchical Phrase-Based Translation Example

F Pourtant , la v´ erit´ e est ailleurs selon moi . E Yet in my view , the truth lies elsewhere . X Pourtant , X Yet X la v´ erit´ e X the truth X selon moi X in my view X estailleurs . X , lieselsewhere . S S

20

slide-42
SLIDE 42

Model Parameterization

Ambiguity: many ways to translate the same source phrase Add feature scores that encode properties of translation: X − → devis

  • quote

0.5 10

  • 137

... X − → devis

  • estimate

0.4 13

  • 261

... X − → devis

  • specifications

0.2 5

  • 407

... Decoder uses feature scores and weights to select the most likely translation derivation.

21

slide-43
SLIDE 43

Linear Translation Models

Single feature score for a translation derivation with rule-local features hi ∈ Hi: Hi(D) =

  • X→¯

f/¯ e∈D

hi

  • X → ¯

f ¯ e

  • Score for a derivation using several features Hi ∈ H with weight vector

wi ∈ W: S(D) =

|H|

  • i=1

wiHi(D) Decoder selects translation with largest product W · H

22

slide-44
SLIDE 44

Linear Translation Models

Single feature score for a translation derivation with rule-local features hi ∈ Hi: Hi(D) =

  • X→¯

f/¯ e∈D

hi

  • X → ¯

f ¯ e

  • Score for a derivation using several features Hi ∈ H with weight vector

wi ∈ W: S(D) =

|H|

  • i=1

wiHi(D) Decoder selects translation with largest product W · H

sentence-level prediction step

22

slide-45
SLIDE 45

Learning Translations

Learning translations

23

slide-46
SLIDE 46

Translation Model Estimation

Sentence-parallel bilingual text

F Devis de garage en quatre ´ etapes. Avec l’outil Auda-Taller, l’entreprise Audatex garantit que l’usager ob- tient un devis en seulement qua- tre ´ etapes : identifier le v´ ehicule, chercher la pi` ece de rechange, cr´ eer un devis et le g´ en´ erer. La facilit´ e d’utilisation est un ´ el´ ement essentiel de ces syst` emes, surtout pour conva- incre les professionnels les plus ˆ ag´ es qui, dans une plus ou moins grande mesure, sont r´ etifs ` a l’utilisation de nouvelles techniques de gestion. E A shop’s estimate in four steps. With the AudaTaller tool, Audatex guaran- tees that the user gets an estimate in only 4 steps: identify the vehi- cle, look for the spare part, create an estimate and generate an estimate. User friendliness is an essential con- dition for these systems, especially to convincing older technicians, who, to varying degrees, are usually more re- luctant to use new management tech- niques.

24

slide-47
SLIDE 47

Translation Model Estimation

Sentence-parallel bilingual text

F Devis de garage en quatre ´ etapes. Avec l’outil Auda-Taller, l’entreprise Audatex garantit que l’usager ob- tient un devis en seulement qua- tre ´ etapes : identifier le v´ ehicule, chercher la pi` ece de rechange, cr´ eer un devis et le g´ en´ erer. La facilit´ e d’utilisation est un ´ el´ ement essentiel de ces syst` emes, surtout pour conva- incre les professionnels les plus ˆ ag´ es qui, dans une plus ou moins grande mesure, sont r´ etifs ` a l’utilisation de nouvelles techniques de gestion. E A shop’s estimate in four steps. With the AudaTaller tool, Audatex guaran- tees that the user gets an estimate in only 4 steps: identify the vehi- cle, look for the spare part, create an estimate and generate an estimate. User friendliness is an essential con- dition for these systems, especially to convincing older technicians, who, to varying degrees, are usually more re- luctant to use new management tech- niques.

Each sentence is a training instance

24

slide-48
SLIDE 48

Model Estimation: Word Alignment

Brown et al. (1993), Dyer et al. (2013)

F E Devis de garage en quatre ´ etapes A shop ’s estimate in four steps

25

slide-49
SLIDE 49

Model Estimation: Word Alignment

Brown et al. (1993), Dyer et al. (2013)

F E Devis de garage en quatre ´ etapes A shop ’s estimate in four steps

25

slide-50
SLIDE 50

Model Estimation: Phrase Extraction

Koehn et al. (2003), Och and Ney (2004), Och et al. (1999)

F E A shop ’s estimate in four steps Devis

  • de
  • garage
  • en
  • quatre
  • ´

etapes

  • 26
slide-51
SLIDE 51

Model Estimation: Phrase Extraction

Koehn et al. (2003), Och and Ney (2004), Och et al. (1999)

F E A shop ’s estimate in four steps Devis

  • de
  • garage
  • en
  • quatre
  • ´

etapes

  • a shop ’s

de garage en quatre ´ etapes in four steps

26

slide-52
SLIDE 52

Model Estimation: Hierarchical Phrase Extraction

Chiang (2007) Yet in my view , the truth lies elsewhere . Pourtant

  • ,
  • la

erit´ e

  • est
  • ailleurs
  • selon
  • moi
  • .
  • F

E

27

slide-53
SLIDE 53

Model Estimation: Hierarchical Phrase Extraction

Chiang (2007) Yet in my view , the truth lies elsewhere . Pourtant

  • ,
  • la

erit´ e

  • est
  • ailleurs
  • selon
  • moi
  • .
  • F

E la v´ erit´ e est ailleurs selon moi . − → in my view , the truth lies elsewhere .

27

slide-54
SLIDE 54

Model Estimation: Hierarchical Phrase Extraction

Chiang (2007) Yet in my view , the truth lies elsewhere . Pourtant

  • ,
  • la

erit´ e

  • est
  • ailleurs
  • selon
  • moi
  • .
  • F

E X 2 X 1 la v´ erit´ e est ailleurs selon moi . − → in my view , the truth lies elsewhere . X 1 est ailleurs X 2 . − → X 2 , X 1 lies elsewhere .

27

slide-55
SLIDE 55

Model Estimation: Hierarchical Phrase Extraction

Chiang (2007) Yet in my view , the truth lies elsewhere . Pourtant

  • ,
  • la

erit´ e

  • est
  • ailleurs
  • selon
  • moi
  • .
  • F

E X 2 X 1 la v´ erit´ e est ailleurs selon moi . − → in my view , the truth lies elsewhere . X 1 est ailleurs X 2 . − → X 2 , X 1 lies elsewhere .

sentence-level rule learning

27

slide-56
SLIDE 56

Parameterization: Feature Scoring

Add feature functions to rules X − → ¯ f/¯ e:

Training Data

N

i=1

Corpus Stats

X → ¯ f/¯ e

Scored Grammar (Global) Static Translate Sentence Input Sentence

28

slide-57
SLIDE 57

Parameterization: Feature Scoring

Add feature functions to rules X − → ¯ f/¯ e:

Training Data

N

i=1

Corpus Stats

X → ¯ f/¯ e

Scored Grammar (Global) Static Translate Sentence Input Sentence

× corpus-level rule scoring

28

slide-58
SLIDE 58

Suffix Array Grammar Extraction

Brown (1996), Callison-Burch et al. (2005), Lopez (2008)

Training Data Suffix Array Static SA Sample

N

i=1

Sample Stats

X → ¯ f/¯ e

Grammar (Sentence) Translate Sentence Input Sentence

29

slide-59
SLIDE 59

Scoring via Sampling

Suffix array statistics available in sample S for each source ¯ f: cS(¯ f, ¯ e): count of instances where ¯ f is aligned to ¯ e (co-occurrence count) cS(¯ f): count of instances where ¯ f is aligned to any target |S|: total number of instances (equal to occurrences of ¯ f in training data, up to the sample size) Used to calculate feature scores for each rule at the time of extraction

30

slide-60
SLIDE 60

Scoring via Sampling

Suffix array statistics available in sample S for each source ¯ f: cS(¯ f, ¯ e): count of instances where ¯ f is aligned to ¯ e (co-occurrence count) cS(¯ f): count of instances where ¯ f is aligned to any target |S|: total number of instances (equal to occurrences of ¯ f in training data, up to the sample size) Used to calculate feature scores for each rule at the time of extraction

× sentence-level grammar extraction, but static training data

30

slide-61
SLIDE 61

Overview

Online learning for statistical MT Translation model review Real time model adaptation Simulated post-editing Post-editing software and experiments Kent State live post-editing Automatic metrics for post-editing Meteor automatic metric Evaluation and optimization for post-editing Conclusion and Future Work

31

slide-62
SLIDE 62

Online Grammar Extraction

Denkowski et al. (EACL 2014)

Training Data Suffix Array Static Sample

N

i=1

Sample Stats

X → ¯ f/¯ e

Grammar (Sentence) Translate Sentence Input Sentence

32

slide-63
SLIDE 63

Online Grammar Extraction

Denkowski et al. (EACL 2014)

Training Data Suffix Array Static Sample

N

i=1

Sample Stats

X → ¯ f/¯ e

Grammar (Sentence) Translate Sentence Input Sentence Lookup Table Dynamic Post-Edit Sentence

32

slide-64
SLIDE 64

Online Grammar Extraction

Denkowski et al. (EACL 2014)

Maintain dynamic lookup table for post-edit data Pair each sample S from suffix array with exhaustive lookup L from lookup table Parallel statistics available at grammar scoring time: cL(¯ f, ¯ e): count of instances where ¯ f is aligned to ¯ e (co-occurrence count) cL(¯ f): count of instances where ¯ f is aligned to any target |L|: total number of instances (equal to occurrences of ¯ f in post-edit data, no limit)

33

slide-65
SLIDE 65

Rule Scoring

Denkowski et al. (EACL 2014)

Suffix array feature set (Lopez 2008) Phrase features encode likelihood of translation rule given training data Features scored with S: CoherentP(e|f) = cS(¯ f, ¯ e) |S| Count(f,e) = cS(¯ f, ¯ e) SampleCount(f) = |S|

34

slide-66
SLIDE 66

Rule Scoring

Denkowski et al. (EACL 2014)

Suffix array feature set (Lopez 2008) Phrase features encode likelihood of translation rule given training data Features scored with S and L: CoherentP(e|f) = cS(¯ f, ¯ e) + cL(¯ f, ¯ e) |S| + |L| Count(f,e) = cS(¯ f, ¯ e) + cL(¯ f, ¯ e) SampleCount(f) = |S| + |L|

34

slide-67
SLIDE 67

Rule Scoring

Denkowski et al. (EACL 2014)

Indicator features identify certain classes of rules Features scored with S: Singleton(f) =

  • 1

cS(¯ f) = 1

  • therwise

Singleton(f,e) =

  • 1

cS(¯ f, ¯ e) = 1

  • therwise

35

slide-68
SLIDE 68

Rule Scoring

Denkowski et al. (EACL 2014)

Indicator features identify certain classes of rules Features scored with S and L: Singleton(f) =

  • 1

cS(¯ f) + cL(¯ f) = 1

  • therwise

Singleton(f,e) =

  • 1

cS(¯ f, ¯ e) + cL(¯ f, ¯ e) = 1

  • therwise

PostEditSupport(f,e) =

  • 1

cL(¯ f, ¯ e) > 0

  • therwise

35

slide-69
SLIDE 69

Parameter Optimization

Denkowski et al. (EACL 2014)

Choose feature weights that maximize objective function (BLEU score)

  • n a development corpus

Minimum error rate training (MERT) (Och, 2003): Translate Optimize

36

slide-70
SLIDE 70

Parameter Optimization

Denkowski et al. (EACL 2014)

Choose feature weights that maximize objective function (BLEU score)

  • n a development corpus

Minimum error rate training (MERT) (Och, 2003): Translate Optimize Margin infused relaxed algorithm (MIRA) (Chiang 2012): Translate Truth Update

36

slide-71
SLIDE 71

Post-Editing with Standard MT

Denkowski et al. (EACL 2014)

Large LM

X → f/e

Grammar

wi...wn

Weights Static Decoder Input Sentence Post-Editing

37

slide-72
SLIDE 72

Post-Editing with Adaptive MT

Denkowski et al. (EACL 2014)

Large Bitext LM Static PE Data

wi...wn

Weights Dynamic

X → f/e

TM Decoder Input Sentence Post-Editing

38

slide-73
SLIDE 73

Overview

How can we build systems without translators in the loop?

39

slide-74
SLIDE 74

Overview

Online learning for statistical MT Translation model review Real time model adaptation Simulated post-editing Post-editing software and experiments Kent State live post-editing Automatic metrics for post-editing Meteor automatic metric Evaluation and optimization for post-editing Conclusion and Future Work

40

slide-75
SLIDE 75

Simulated Post-Editing

Denkowski et al. (EACL 2014)

Hola contestadora ... Hello voicemail, my old ... He llamado a servicio ... I’ve called for tech ... Ignor´ e la advertencia ... I ignored my boss’ ... Ahora anochece, y mi ... Now it’s evening, and ... Todav´ ıa sigo en espera ... I’m still on hold. I’m ... No creo que me hayas ... I don’t think you ... Ya he presionado cada ... I punched every touch ...

Incremental training data Source Target (Reference)

Use pre-generated references in place of post-editing (Hardt and Elming, 2010) Build, evaluate, and deploy adaptive systems using only standard training data

41

slide-76
SLIDE 76

Simulated Post-Editing Experiments

Denkowski et al. (EACL 2014)

MT System (cdec) Hierarchical phrase-based model using suffix arrays Large 4-gram language model MIRA optimization Model Adaptation Update TM and weights independently and in conjunction Training Data WMT12 Spanish–English and NIST 2012 Arabic–English Evaluation Data WMT/NIST news (standard test sets) TED talks (totally blind out-of-domain test)

42

slide-77
SLIDE 77

Simulated Post-Editing Experiments

Denkowski et al. (EACL 2014)

Spanish–English Arabic–English

WMT TED1 TED2 26 28 30 32 34 36 BLEU Score

Baseline Grammar MIRA Both

NIST TED1 TED2 10 12 14 16 18 20 22 24 26 28 BLEU Score

Baseline Grammar MIRA Both 43

slide-78
SLIDE 78

Simulated Post-Editing Experiments

Denkowski et al. (EACL 2014)

Spanish–English Arabic–English

WMT TED1 TED2 26 28 30 32 34 36 BLEU Score

Baseline Grammar MIRA Both

NIST TED1 TED2 10 12 14 16 18 20 22 24 26 28 BLEU Score

Baseline Grammar MIRA Both

Up to 1.7 BLEU improvement over static baseline

43

slide-79
SLIDE 79

Recent Work

How can we better leverage incremental data?

44

slide-80
SLIDE 80

Translation Model Combination

Denkowski (AMTA 2014 Workshop on Interactive and Adaptive MT)

cdec (Dyer et al., 2010) Single translation model updated with new data Single feature set that changes over time (summation) Moses (Koehn et al., 2007) Multiple translation models: background and post-editing Per-feature linear interpolation in context of full system Recent additions to Moses toolkit Dynamic suffix array phrase tables (Germann, 2014) Fast MIRA implementation (Cherry and Foster, 2012) Multiple phrase tables with runtime weight updates (Denkowski, 2014)

45

slide-81
SLIDE 81

Translation Model Combination

Denkowski (AMTA 2014 Workshop on Interactive and Adaptive MT)

Spanish–English Arabic–English

WMT TED1 TED2 30 31 32 33 34 35 36 37 38 BLEU Score

Baseline PE Support Multi Model +MIRA

NIST TED1 TED2 10 15 20 25 30 BLEU Score

Baseline PE Support Multi Model +MIRA 46

slide-82
SLIDE 82

Translation Model Combination

Denkowski (AMTA 2014 Workshop on Interactive and Adaptive MT)

Spanish–English Arabic–English

WMT TED1 TED2 30 31 32 33 34 35 36 37 38 BLEU Score

Baseline PE Support Multi Model +MIRA

NIST TED1 TED2 10 15 20 25 30 BLEU Score

Baseline PE Support Multi Model +MIRA

Up to 4.9 BLEU improvement over static baseline

46

slide-83
SLIDE 83

Related Work: Learning from Post-Editing

Updating translation grammars with post-editing data Cache-based translation and language models (Nepveu et al., 2004; Bertoldi et al., 2013) Store sufficient statistics in grammar (Ortiz-Mart´ ınez et al., 2010) Distinguish between background and post-editing data (Hardt and Elming, 2010) Updating feature weights during decoding Various online learning algorithms to update MERT weights (Mart´ ınez-G´

  • mez et al., 2012; L´
  • pez-Salcedo et al., 2012)

Algorithm for learning from binary classification examples (Saluja et al., 2012)

47

slide-84
SLIDE 84

Overview

Online learning for statistical MT Translation model review Real time model adaptation Simulated post-editing Post-editing software and experiments Kent State live post-editing Automatic metrics for post-editing Meteor automatic metric Evaluation and optimization for post-editing Conclusion and Future Work

48

slide-85
SLIDE 85

Tools for Human Translators

49

slide-86
SLIDE 86

TransCenter Post-Editing Interface

Denkowski and Lavie (AMTA 2012), Denkowski et al. (HaCat 2014)

50

slide-87
SLIDE 87

TransCenter Post-Editing Interface

Denkowski and Lavie (AMTA 2012), Denkowski et al. (HaCat 2014)

51

slide-88
SLIDE 88

TransCenter Post-Editing Interface

Denkowski and Lavie (AMTA 2012), Denkowski et al. (HaCat 2014)

52

slide-89
SLIDE 89

TransCenter Post-Editing Interface

Denkowski and Lavie (AMTA 2012), Denkowski et al. (HaCat 2014)

53

slide-90
SLIDE 90

Overview

Online learning for statistical MT Translation model review Real time model adaptation Simulated post-editing Post-editing software and experiments Kent State live post-editing Automatic metrics for post-editing Meteor automatic metric Evaluation and optimization for post-editing Conclusion and Future Work

54

slide-91
SLIDE 91

Post-Editing Field Test

Denkowski et al. (HaCat 2014)

Experimental Setup Six translation studies students from Kent State University post-edited MT output Text: 4 excerpts from TED talks translated from Spanish into English (100 sentences total) Two excerpts translated by static system, two by adaptive system (shuffled by user) Record post-editing effort (HTER) and translator rating

55

slide-92
SLIDE 92

Post-Editing Field Test

Denkowski et al. (HaCat 2014)

Results Adaptive system significantly outperforms static baseline Small improvement in simulated scenario leads to significant improvement in production HTER ↓ Rating ↑ Sim PE BLEU ↑ Baseline 19.26 4.19 34.50 Adaptive 17.01 4.31 34.95

56

slide-93
SLIDE 93

Related Work: Computer-Aided Translation Tools

Translation software suites CASMACAT project: full-featured open source translator’s workbench software (Ortiz-Mart´ ınez et al., 2012) MateCat project: enterprise-grade workbench with MT integration and project management (Federico, 2014; Cattelan, 2014) Novel CAT approaches Streamlined interface with both phrase prediction and post-editing (Green, 2014) Effectiveness of monolingual post-editing assisted by word alignments (Schwartz, 2014)

57

slide-94
SLIDE 94

Overview

Online learning for statistical MT Translation model review Real time model adaptation Simulated post-editing Post-editing software and experiments Kent State live post-editing Automatic metrics for post-editing Meteor automatic metric Evaluation and optimization for post-editing Conclusion and Future Work

58

slide-95
SLIDE 95

System Optimization

Parameter optimization (MIRA) Choose feature weights W that maximizes objective on tuning set Automatic metrics approximate human evaluation of MT output against reference translations Adequacy-based evaluation Good translations should be semantically similar to references Several adequacy-driven research efforts: ACL WMT (Callison-Burch et al., 2011) NIST OpenMT (Przybocki et al., 2009)

59

slide-96
SLIDE 96

Standard MT Evaluation

Standard BLEU metric based on N-gram precision (P) (Papineni et al., 2002) Matches spans of hypothesis E′ against reference E Surface forms only, depends on multiple references to capture translation variation (expensive) Jointly measures word choice and order BLEU = BP × exp N

  • n=1

1 N log Pn

  • BP =
  • 1

|E′| > |E| e

1−|E| |E′|

|E′| ≤ |E|

60

slide-97
SLIDE 97

Standard MT Evaluation

Shortcomings of BLEU metric (Banerjee and Lavie 2005, Callison-Burch et al., 2007): Evaluating surface forms misses correct translations N-grams have no notion of global coherence

61

slide-98
SLIDE 98

Standard MT Evaluation

Shortcomings of BLEU metric (Banerjee and Lavie 2005, Callison-Burch et al., 2007): Evaluating surface forms misses correct translations N-grams have no notion of global coherence E: The large home

61

slide-99
SLIDE 99

Standard MT Evaluation

Shortcomings of BLEU metric (Banerjee and Lavie 2005, Callison-Burch et al., 2007): Evaluating surface forms misses correct translations N-grams have no notion of global coherence E: The large home E′

1:

A big house BLEU = 0 E′

2:

I am a dinosaur BLEU = 0

61

slide-100
SLIDE 100

Post-Editing

Final translations must be human quality (editing required) Good MT output should require less work for humans to edit Human-targeted translation edit rate (HTER, Snover et al., 2006)

1

Human translators correct MT output

2

Automatically calculate number of edits using TER TER = # edits |E| Edits: insertion, deletion, substitution, block shift “Better” translations not always easier to post-edit

62

slide-101
SLIDE 101

Translation Example

WMT 2011 Czech–English Track

Translations scored by BLEU E: The problem is that life of the lines is two to four years.

63

slide-102
SLIDE 102

Translation Example

WMT 2011 Czech–English Track

Translations scored by BLEU E: The problem is that life of the lines is two to four years. E′

1: The problem is that life is two lines, up to four years.

E′

2: The problem is that the durability of lines is two or four years.

63

slide-103
SLIDE 103

Translation Example

WMT 2011 Czech–English Track

Translations scored by BLEU E: The problem is that life of the lines is two to four years.

  • E′

1: The problem is that life is two lines, up to four years.

0.49 E′

2: The problem is that the durability of lines is two or four years.

0.34

63

slide-104
SLIDE 104

Translation Example

WMT 2011 Czech–English Track

Translations scored by BLEU E: The problem is that life of the lines is two to four years.

  • E′

1: The problem is that life is two of the lines , up to is two to four years.

0.49 0.29 E′

2: The problem is that the durability life of lines is two or to four years.

0.34 0.14

63

slide-105
SLIDE 105

Overview

Online learning for statistical MT Translation model review Real time model adaptation Simulated post-editing Post-editing software and experiments Kent State live post-editing Automatic metrics for post-editing Meteor automatic metric Evaluation and optimization for post-editing Conclusion and Future Work

64

slide-106
SLIDE 106

Meteor

Banerjee and Lavie (2005), Lavie and Denkowski (2009), Denkowski and Lavie (2011)

Motivation: address shortcomings of BLEU Flexible matching to capture translation variation Measure word choice and order separately, combine with tunable scoring function Measure sentence coherence globally Meteor: alignment-based tunable evaluation metric Align hypothesis E′ to reference E Compute score based on alignment quality

65

slide-107
SLIDE 107

Meteor Alignment

Denkowski and Lavie (2011)

E′ E The United States embassy know that dependable source . The American embassy knows this from a reliable source .

66

slide-108
SLIDE 108

Meteor Alignment

Denkowski and Lavie (2011)

E′ E The United States embassy know that dependable source . The American embassy knows this from a reliable source . exact

66

slide-109
SLIDE 109

Meteor Alignment

Denkowski and Lavie (2011)

E′ E The United States embassy know that dependable source . The American embassy knows this from a reliable source . exact stem

66

slide-110
SLIDE 110

Meteor Alignment

Denkowski and Lavie (2011)

E′ E The United States embassy know that dependable source . The American embassy knows this from a reliable source . exact stem synonym

66

slide-111
SLIDE 111

Meteor Alignment

Denkowski and Lavie (2011)

E′ E The United States embassy know that dependable source . The American embassy knows this from a reliable source . exact stem synonym paraphrase

66

slide-112
SLIDE 112

Meteor Alignment

Denkowski and Lavie (2011)

E′ E The United States embassy know that dependable source . The American embassy knows this from a reliable source . P R (P and R weighted by match type, content vs function words)

66

slide-113
SLIDE 113

Meteor Alignment

Denkowski and Lavie (2011)

E′ E The United States embassy know that dependable source . The American embassy knows this from a reliable source . P R Chunks = 2

66

slide-114
SLIDE 114

Meteor Scoring

Denkowski and Lavie (2011)

P and R weighted by match type (wi, ..., wn) and content-function word weight (δ) Fα = P × R α × P + (1 − α) × R Frag = Chunks AvgMatches Meteor =

  • 1 − γ × Fragβ

× Fα Tunable parameters: W = wi, ..., wn: weights for flexible match types α: balance between precision and recall β, γ: weight and severity of fragmentation δ: relative contribution of content versus function words

67

slide-115
SLIDE 115

Meteor and Post-Editing

Denkowski (AMTA 2014 Workshop on Interactive and Adaptive MT)

Casting Meteor’s scoring features as post-editing measures: Precision Incorrect content (deletion) Recall Missing content (insertion) Fragmentation Incorrectly ordered content (reordering) Match types Partially correct content (minor edits) Content vs Function Content vs grammaticality edits Advantage over edit distance: error types identified separately and combined with a parameterized scoring function

68

slide-116
SLIDE 116

Overview

Online learning for statistical MT Translation model review Real time model adaptation Simulated post-editing Post-editing software and experiments Kent State live post-editing Automatic metrics for post-editing Meteor automatic metric Evaluation and optimization for post-editing Conclusion and Future Work

69

slide-117
SLIDE 117

Metrics Targeting Post-Editing

Denkowski (AMTA 2014 Workshop on Interactive and Adaptive MT)

Startup Deploy system tuned with simulated post-editing and BLEU Collect enough data for a post-editing dev set Retuning (second stage booster rocket) Tune Meteor to fit post-editing effort (keystroke, very close to rating) Tune system to new Meteor on new dev set Continue to adapt to Meteor in production

70

slide-118
SLIDE 118

Second Field Test

Denkowski (AMTA 2014 Workshop on Interactive and Adaptive MT)

Results Repeat post-editing experiments with second set of students and TED talks Compare BLEU and Meteor-tuned adaptive systems (both optimized on TED talk data) Adapting to Meteor lowers BLEU but yields significant improvement in live post-editing Feasible in production: significant data and editing records HTER ↓ Rating ↑ Sim PE BLEU ↑ Adapt BLEU 20.1 4.16 27.3 Adapt Meteor 18.9 4.24 26.6

71

slide-119
SLIDE 119

Related Work: Automatic Metrics

Evaluation Shared metrics tasks at workshops on statistical machine translation (Callison-Burch et al., 2008, 2009, 2010, ...) TER-plus: extended version of TER with flexible matching and tunable weights (Snover et al., 2009) Stanford probabilistic edit distance metric with linguistic features (Wang and Manning, 2012) Optimization Tuning to a metric tends to improve quality according to that metric (Cer et al., 2010) Effectiveness of tuning to a more sophisticated metric than BLEU (Liu et al., 2011)

72

slide-120
SLIDE 120

Overview

Online learning for statistical MT Translation model review Real time model adaptation Simulated post-editing Post-editing software and experiments Kent State live post-editing Automatic metrics for post-editing Meteor automatic metric Evaluation and optimization for post-editing Conclusion and Future Work

73

slide-121
SLIDE 121

Conclusion

Real time adaptive MT systems Immediately incorporate post-editing data into translation models Run an online optimizer that continuously updates feature weights during decoding Simulate post-editing to train on normal system building data See best results when combining techniques: up to 4.9 BLEU

74

slide-122
SLIDE 122

Conclusion

Live post-editing experiments TransCenter interface that simplifies and records post-editing tasks Live experiments that show a reduction in human labor when working with adaptive systems

75

slide-123
SLIDE 123

Conclusion

The United States embassy know that dependable source . The American embassy knows this from a reliable source .

Automatic metrics for post-editing Meteor MT evaluation metric capable of fitting various measures

  • f editing effort

Live experiments that show a further gains in translator productivity when systems adapt to Meteor

76

slide-124
SLIDE 124

Future Work

Adaptive MT systems Sparse features for more rapid, fine-grained adaptation (Chiang et al., 2009) Danger of overfitting, opportunity for more sophisticated optimizers

77

slide-125
SLIDE 125

Future Work

Adaptive MT systems Sparse features for more rapid, fine-grained adaptation (Chiang et al., 2009) Danger of overfitting, opportunity for more sophisticated optimizers End-to-end workflows Integrate adaptive MT with advanced post-editing interfaces (Green et al., 2014; Schwartz et al., 2014)

77

slide-126
SLIDE 126

Future Work

Adaptive MT systems Sparse features for more rapid, fine-grained adaptation (Chiang et al., 2009) Danger of overfitting, opportunity for more sophisticated optimizers End-to-end workflows Integrate adaptive MT with advanced post-editing interfaces (Green et al., 2014; Schwartz et al., 2014) Automatic metrics Tune metrics to editing time (bottom line cost) Requires significant amount of data from a fixed pool of translators

77

slide-127
SLIDE 127

Open Source Software www.cs.cmu.edu/˜mdenkows

Building adaptive MT systems cdec Realtime: adaptive MT systems with cdec RTA: Realtime adaptive MT framework using Moses Live post-editing TransCenter: post-editing data collection interface All Kent State post-editing data Targeted automatic metrics Meteor: tunable MT evaluation metric

78

slide-128
SLIDE 128

Machine Translation for Human Translators

Carnegie Mellon Ph.D. Thesis Michael Denkowski

Language Technologies Institute School of Computer Science Carnegie Mellon University

April 20, 2015 Thesis Committee:

Alon Lavie (chair), Carnegie Mellon University Chris Dyer, Carnegie Mellon University Jaime Carbonell, Carnegie Mellon University Gregory Shreve, Kent State University

79