Machine Translation for Human Translators Carnegie Mellon Ph.D. - - PowerPoint PPT Presentation

machine translation for human translators
SMART_READER_LITE
LIVE PREVIEW

Machine Translation for Human Translators Carnegie Mellon Ph.D. - - PowerPoint PPT Presentation

Machine Translation for Human Translators Carnegie Mellon Ph.D. Thesis Proposal Michael Denkowski Language Technologies Institute School of Computer Science Carnegie Mellon University May 30, 2013 Thesis Committee: Alon Lavie (chair),


slide-1
SLIDE 1

Machine Translation for Human Translators

Carnegie Mellon Ph.D. Thesis Proposal Michael Denkowski

Language Technologies Institute School of Computer Science Carnegie Mellon University

May 30, 2013 Thesis Committee:

Alon Lavie (chair), Carnegie Mellon University Chris Dyer, Carnegie Mellon University Jaime Carbonell, Carnegie Mellon University Gregory Shreve, Kent State University

1

slide-2
SLIDE 2

Motivating Examples

When is translation “good enough”?

2

slide-3
SLIDE 3

Machine Translation

3

slide-4
SLIDE 4

Human Translation

International Organizations Global Businesses Community Projects Require human-quality translation of complex content Machine translation currently unable to deliver quality and consistency

4

slide-5
SLIDE 5

MT with Human Post-Editing

Source Document Machine Translation Human Editing Translated Document

Use machine translation to improve speed of human translation

5

slide-6
SLIDE 6

MT with Human Post-Editing

Source Document Machine Translation Human Editing Translated Document

Use machine translation to improve speed of human translation Increasing adoption by government organizations and businesses

5

slide-7
SLIDE 7

Post-Editing Example

Son comportement ne peut ˆ etre qualifi´ e que d’irr´ eprochable .

6

slide-8
SLIDE 8

Post-Editing Example

Son comportement ne peut ˆ etre qualifi´ e que d’irr´ eprochable . His behaviour cannot be described as d’irr´ eprochable .

6

slide-9
SLIDE 9

Post-Editing Example

Son comportement ne peut ˆ etre qualifi´ e que d’irr´ eprochable . His behaviour cannot be described as d’irr´ eprochable . Its behavior can only be described as flawless .

6

slide-10
SLIDE 10

Post-Editing Example

Son comportement ne peut ˆ etre qualifi´ e que d’irr´ eprochable . His behaviour cannot be described as d’irr´ eprochable . Its behavior can only be described as flawless . MT task: minimize work for human translators

6

slide-11
SLIDE 11

Thesis Statement

While general improvements in MT quality have led to increased interest and productivity in post-editing, there has been little work on designing translation systems specifically for this task. We propose improvements to key components of MT pipelines aimed at significantly reducing the amount of work required from human translators.

7

slide-12
SLIDE 12

Thesis Claims

We claim that:

8

slide-13
SLIDE 13

Thesis Claims

We claim that: Human editing demands can be reduced by translation models that immediately learn from feedback.

8

slide-14
SLIDE 14

Thesis Claims

We claim that: Human editing demands can be reduced by translation models that immediately learn from feedback. Human editing demands can be reduced by identifying and minimizing costly types of translation errors.

8

slide-15
SLIDE 15

Thesis Claims

We claim that: Human editing demands can be reduced by translation models that immediately learn from feedback. Human editing demands can be reduced by identifying and minimizing costly types of translation errors. Human editing effort can be better quantified with more accurate statistical measures.

8

slide-16
SLIDE 16

Thesis Proposal

To support our claims, we propose to:

9

slide-17
SLIDE 17

Thesis Proposal

To support our claims, we propose to: develop an online translation model that immediately incorporates post-editor feedback

9

slide-18
SLIDE 18

Thesis Proposal

To support our claims, we propose to: develop an online translation model that immediately incorporates post-editor feedback assemble an extended translation feature set that allows an

  • ptimizer to learn when to trust different feedback sources

9

slide-19
SLIDE 19

Thesis Proposal

To support our claims, we propose to: develop an online translation model that immediately incorporates post-editor feedback assemble an extended translation feature set that allows an

  • ptimizer to learn when to trust different feedback sources

design advanced automatic metrics capable of predicting post-editing effort for MT system optimization and evaluation

9

slide-20
SLIDE 20

Thesis Proposal

To support our claims, we propose to: develop an online translation model that immediately incorporates post-editor feedback assemble an extended translation feature set that allows an

  • ptimizer to learn when to trust different feedback sources

design advanced automatic metrics capable of predicting post-editing effort for MT system optimization and evaluation directly investigate the impact of online learning on post-editing requirements in a real-time scenario with human translators

9

slide-21
SLIDE 21

Outline

Introduction Online translation model adaptation Metrics for system optimization and evaluation Post-editing data collection and analysis Research timeline

10

slide-22
SLIDE 22

Online Translation Model Adaptation

Statistical translation models built from bilingual data

11

slide-23
SLIDE 23

Online Translation Model Adaptation

Statistical translation models built from bilingual data Post-editing generates new bilingual data

11

slide-24
SLIDE 24

Online Translation Model Adaptation

Statistical translation models built from bilingual data Post-editing generates new bilingual data Goal: incorporate post-editing data back into model in real time Learn from feedback: avoid repeating the same translation errors

11

slide-25
SLIDE 25

Online Translation Model Adaptation

Batch learning (standard MT): Model estimation − → prediction Learn translation model from all available data Translate new data with static model

12

slide-26
SLIDE 26

Online Translation Model Adaptation

Batch learning (standard MT): Model estimation − → prediction Learn translation model from all available data Translate new data with static model Online learning (this work): translation as a series of trials

1

Model makes prediction (translation hypothesis)

2

Model sees true answer (edited translation)

3

Model updates parameters (incremental adaptation)

12

slide-27
SLIDE 27

Online Translation Model Adaptation

Batch learning (standard MT): Model estimation − → prediction Learn translation model from all available data Translate new data with static model Online learning (this work): translation as a series of trials

1

Model makes prediction (translation hypothesis)

2

Model sees true answer (edited translation)

3

Model updates parameters (incremental adaptation) Requirement: sentence-level prediction and model update steps

12

slide-28
SLIDE 28

Translation Model Review

Translation model review

13

slide-29
SLIDE 29

Machine Translation Formalism

Phrase-based machine translation (Koehn et al., 2003): la v´ erit´ e − → the truth Match spans of input text against phrases we know how to translate

14

slide-30
SLIDE 30

Machine Translation Formalism

Phrase-based machine translation (Koehn et al., 2003): la v´ erit´ e − → the truth Match spans of input text against phrases we know how to translate Hierarchical phrase-based translation (Chiang, 2007): X − → la v´ erit´ e X 1

  • the truth X 1

Phrases become rules in a synchronous context-free grammar Generalization where phrases can contain other phrases Parse source sentence, generate target sentence

14

slide-31
SLIDE 31

Hierarchical Phrase-Based Translation Example

Input sentence: Pourtant , la v´ erit´ e est ailleurs selon moi . Translation Grammar: X − → X 1 est ailleurs X 2 .

  • X 2 , X 1 lies elsewhere .

X − → Pourtant ,

  • Yet

X − → la v´ erit´ e

  • the truth

X − → selon moi

  • in my view

Glue Grammar: S − → S 1 X 2

  • S 1 X 2

S − → X 1

  • X 2

15

slide-32
SLIDE 32

Hierarchical Phrase-Based Translation Example

F Pourtant , la v´ erit´ e est ailleurs selon moi . E

16

slide-33
SLIDE 33

Hierarchical Phrase-Based Translation Example

F Pourtant , la v´ erit´ e est ailleurs selon moi . E the truth X la v´ erit´ e X the truth

16

slide-34
SLIDE 34

Hierarchical Phrase-Based Translation Example

F Pourtant , la v´ erit´ e est ailleurs selon moi . E Yet in my view the truth X Pourtant , X Yet X la v´ erit´ e X the truth X selon moi X in my view

16

slide-35
SLIDE 35

Hierarchical Phrase-Based Translation Example

F Pourtant , la v´ erit´ e est ailleurs selon moi . E Yet in my view , the truth lies elsewhere . X Pourtant , X Yet X la v´ erit´ e X the truth X selon moi X in my view X estailleurs . X , lieselsewhere . S S

16

slide-36
SLIDE 36

Model Parameterization

Ambiguity: many ways to translate the same source phrase Add feature scores that encode properties of translation: X − → devis

  • quote

0.5 10

  • 137

... X − → devis

  • estimate

0.4 12

  • 284

... X − → devis

  • estimate

0.4 13

  • 261

... X − → devis

  • specifications

0.2 5

  • 407

... Decoder uses feature scores and weights to select the most likely translation derivation.

17

slide-37
SLIDE 37

Linear Translation Models

Single feature score for a translation derivation with rule-local features hi ∈ Hi: Hi(D) =

  • X→¯

f/¯ e∈D

hi

  • X → ¯

f ¯ e

  • Score for a derivation using several features Hi ∈ H with weight vector

wi ∈ W: S(D) =

|H|

  • i=1

wiHi(D) Decoder selects translation with largest product W · H

18

slide-38
SLIDE 38

Linear Translation Models

Single feature score for a translation derivation with rule-local features hi ∈ Hi: Hi(D) =

  • X→¯

f/¯ e∈D

hi

  • X → ¯

f ¯ e

  • Score for a derivation using several features Hi ∈ H with weight vector

wi ∈ W: S(D) =

|H|

  • i=1

wiHi(D) Decoder selects translation with largest product W · H

sentence-level prediction step

18

slide-39
SLIDE 39

Learning Translations

Learning translations

19

slide-40
SLIDE 40

Translation Model Estimation

Sentence-parallel bilingual text

F Devis de garage en quatre ´ etapes. Avec l’outil Auda-Taller, l’entreprise Audatex garantit que l’usager ob- tient un devis en seulement qua- tre ´ etapes : identifier le v´ ehicule, chercher la pi` ece de rechange, cr´ eer un devis et le g´ en´ erer. La facilit´ e d’utilisation est un ´ el´ ement essentiel de ces syst` emes, surtout pour conva- incre les professionnels les plus ˆ ag´ es qui, dans une plus ou moins grande mesure, sont r´ etifs ` a l’utilisation de nouvelles techniques de gestion. E A shop’s estimate in four steps. With the AudaTaller tool, Audatex guaran- tees that the user gets an estimate in only 4 steps: identify the vehi- cle, look for the spare part, create an estimate and generate an estimate. User friendliness is an essential con- dition for these systems, especially to convincing older technicians, who, to varying degrees, are usually more re- luctant to use new management tech- niques.

20

slide-41
SLIDE 41

Translation Model Estimation

Sentence-parallel bilingual text

F Devis de garage en quatre ´ etapes. Avec l’outil Auda-Taller, l’entreprise Audatex garantit que l’usager ob- tient un devis en seulement qua- tre ´ etapes : identifier le v´ ehicule, chercher la pi` ece de rechange, cr´ eer un devis et le g´ en´ erer. La facilit´ e d’utilisation est un ´ el´ ement essentiel de ces syst` emes, surtout pour conva- incre les professionnels les plus ˆ ag´ es qui, dans une plus ou moins grande mesure, sont r´ etifs ` a l’utilisation de nouvelles techniques de gestion. E A shop’s estimate in four steps. With the AudaTaller tool, Audatex guaran- tees that the user gets an estimate in only 4 steps: identify the vehi- cle, look for the spare part, create an estimate and generate an estimate. User friendliness is an essential con- dition for these systems, especially to convincing older technicians, who, to varying degrees, are usually more re- luctant to use new management tech- niques.

Each sentence is a training instance

20

slide-42
SLIDE 42

Model Estimation: Word Alignment

Brown et al. (1993), Dyer et al. (2013)

F E Devis de garage en quatre ´ etapes A shop ’s estimate in four steps

21

slide-43
SLIDE 43

Model Estimation: Word Alignment

Brown et al. (1993), Dyer et al. (2013)

F E Devis de garage en quatre ´ etapes A shop ’s estimate in four steps

21

slide-44
SLIDE 44

Model Estimation: Phrase Extraction

Koehn et al. (2003), Och and Ney (2004), Och et al. (1999)

F E A shop ’s estimate in four steps Devis

  • de
  • garage
  • en
  • quatre
  • ´

etapes

  • 22
slide-45
SLIDE 45

Model Estimation: Phrase Extraction

Koehn et al. (2003), Och and Ney (2004), Och et al. (1999)

F E A shop ’s estimate in four steps Devis

  • de
  • garage
  • en
  • quatre
  • ´

etapes

  • a shop ’s

de garage en quatre ´ etapes in four steps

22

slide-46
SLIDE 46

Model Estimation: Hierarchical Phrase Extraction

Chiang (2007) Yet in my view , the truth lies elsewhere . Pourtant

  • ,
  • la

erit´ e

  • est
  • ailleurs
  • selon
  • moi
  • .
  • F

E

23

slide-47
SLIDE 47

Model Estimation: Hierarchical Phrase Extraction

Chiang (2007) Yet in my view , the truth lies elsewhere . Pourtant

  • ,
  • la

erit´ e

  • est
  • ailleurs
  • selon
  • moi
  • .
  • F

E la v´ erit´ e est ailleurs selon moi . − → in my view , the truth lies elsewhere .

23

slide-48
SLIDE 48

Model Estimation: Hierarchical Phrase Extraction

Chiang (2007) Yet in my view , the truth lies elsewhere . Pourtant

  • ,
  • la

erit´ e

  • est
  • ailleurs
  • selon
  • moi
  • .
  • F

E X 2 X 1 la v´ erit´ e est ailleurs selon moi . − → in my view , the truth lies elsewhere . X 1 est ailleurs X 2 . − → X 2 , X 1 lies elsewhere .

23

slide-49
SLIDE 49

Model Estimation: Hierarchical Phrase Extraction

Chiang (2007) Yet in my view , the truth lies elsewhere . Pourtant

  • ,
  • la

erit´ e

  • est
  • ailleurs
  • selon
  • moi
  • .
  • F

E X 2 X 1 la v´ erit´ e est ailleurs selon moi . − → in my view , the truth lies elsewhere . X 1 est ailleurs X 2 . − → X 2 , X 1 lies elsewhere .

sentence-level rule learning

23

slide-50
SLIDE 50

Parameterization: Feature Scoring

Add feature functions to rules X − → ¯ f/¯ e:

Training Data

N

i=1

Corpus Stats

X → ¯ f/¯ e

Scored Grammar (Global) Static Translate Sentence Input Sentence

24

slide-51
SLIDE 51

Parameterization: Feature Scoring

Add feature functions to rules X − → ¯ f/¯ e:

Training Data

N

i=1

Corpus Stats

X → ¯ f/¯ e

Scored Grammar (Global) Static Translate Sentence Input Sentence

× corpus-level rule scoring

24

slide-52
SLIDE 52

Suffix Array Grammar Extraction

Callison-Burch et al. (2005), Lopez (2008)

Training Data Suffix Array Static SA Sample

N

i=1

Sample Stats

X → ¯ f/¯ e

Grammar (Sentence) Translate Sentence Input Sentence

25

slide-53
SLIDE 53

Scoring via Sampling

Suffix array statistics available in sample S for each source ¯ f: cS(¯ f, ¯ e): count of instances where ¯ f is aligned to ¯ e (co-occurrence count) cS(¯ f): count of instances where ¯ f is aligned to any target |S|: total number of instances (equal to occurrences of ¯ f in training data, up to the sample size) Used to calculate feature scores for each rule at the time of extraction

26

slide-54
SLIDE 54

Scoring via Sampling

Suffix array statistics available in sample S for each source ¯ f: cS(¯ f, ¯ e): count of instances where ¯ f is aligned to ¯ e (co-occurrence count) cS(¯ f): count of instances where ¯ f is aligned to any target |S|: total number of instances (equal to occurrences of ¯ f in training data, up to the sample size) Used to calculate feature scores for each rule at the time of extraction

× sentence-level grammar extraction, but static training data

26

slide-55
SLIDE 55

Online Grammar Extraction

Contribution 1: online grammar extraction for MT (completed work)

27

slide-56
SLIDE 56

Online Grammar Extraction

Training Data Suffix Array Static Sample

N

i=1

Sample Stats

X → ¯ f/¯ e

Grammar (Sentence) Translate Sentence Input Sentence

28

slide-57
SLIDE 57

Online Grammar Extraction

Training Data Suffix Array Static Sample

N

i=1

Sample Stats

X → ¯ f/¯ e

Grammar (Sentence) Translate Sentence Input Sentence Lookup Table Dynamic Post-Edit Sentence

28

slide-58
SLIDE 58

Online Grammar Extraction

Maintain dynamic lookup table for post-edit data Pair each sample S from suffix array with exhaustive lookup L from lookup table Parallel statistics available at grammar scoring time: cL(¯ f, ¯ e): count of instances where ¯ f is aligned to ¯ e (co-occurrence count) cL(¯ f): count of instances where ¯ f is aligned to any target |L|: total number of instances (equal to occurrences of ¯ f in post-edit data, no limit)

29

slide-59
SLIDE 59

Online Grammar Extraction

Maintaining lookup table: Word-align post-edit sentence pairs with existing model (Dyer et al., 2013) Pre-calculate statistics for fast lookups Benefits to translation No lookup limit biases model toward highly relevant training instances Parallel statistics allow rule scoring with minimal modifications Minimal impact on extraction time: still practical for real-time translation

30

slide-60
SLIDE 60

Rule Scoring

Suffix array feature set (Lopez 2008) Phrase features encode likelihood of translation rule given training data Features scored with S: CoherentP(e|f) = cS(¯ f, ¯ e) |S| Count(f,e) = cS(¯ f, ¯ e) SampleCount(f) = |S|

31

slide-61
SLIDE 61

Rule Scoring

Suffix array feature set (Lopez 2008) Phrase features encode likelihood of translation rule given training data Features scored with S and L: CoherentP(e|f) = cS(¯ f, ¯ e) + cL(¯ f, ¯ e) |S| + |L| Count(f,e) = cS(¯ f, ¯ e) + cL(¯ f, ¯ e) SampleCount(f) = |S| + |L|

31

slide-62
SLIDE 62

Rule Scoring

Indicator features identify certain classes of rules Features scored with S: Singleton(f) =

  • 1

cS(¯ f) = 1

  • therwise

Singleton(f,e) =

  • 1

cS(¯ f, ¯ e) = 1

  • therwise

32

slide-63
SLIDE 63

Rule Scoring

Indicator features identify certain classes of rules Features scored with S and L: Singleton(f) =

  • 1

cS(¯ f) + cL(¯ f) = 1

  • therwise

Singleton(f,e) =

  • 1

cS(¯ f, ¯ e) + cL(¯ f, ¯ e) = 1

  • therwise

PostEditSupport(f,e) =

  • 1

cL(¯ f, ¯ e) > 0

  • therwise

32

slide-64
SLIDE 64

Experiments

Compare our online model against a static (suffix array) baseline Language pairs (both directions): English–Spanish: WMT 2011 Europarl and news (2M sent) English–Arabic: NIST 2012 news (5M sent) Evaluation sets: News: WMT 2010 and 2011, NIST OpenMT 2008 and 2009 TED talks: 2 test sets of 10 talks each (open domain) Systems tuned on news data, not re-tuned for blind out-of-domain test

33

slide-65
SLIDE 65

Experiments

Simulated post-editing (Hardt and Elming 2010): Use reference translations as stand-in for post-editing Available for both tuning and evaluation All incremental adaptation encoded in grammars: No modification to decoder Optimize with standard MERT Additional features: 4-gram language model probability and OOV count Arity: count of non-terminals X i in rules Glue rule count Pass-through count Word count

34

slide-66
SLIDE 66

Experiments

Spanish–English English–Spanish WMT10 WMT11 TED1 TED2 WMT10 WMT11 TED1 TED2 Suffix Array 29.2 27.9 32.8 29.6 27.4 29.1 26.1 25.6 Online 30.2 28.8 34.8 31.0 28.5 30.1 27.8 27.0 Arabic–English English–Arabic MT08 MT09 TED1 TED2 MT08 MT09 TED1 TED2 Suffix Array 38.0 41.6 10.5 10.5 18.9 23.8 7.5 7.9 Online 38.5 42.3 11.3 11.7 19.2 24.1 8.0 8.7

BLEU scores (averaged over 3 MERT runs)

35

slide-67
SLIDE 67

Experiments

News TED Talks New Supported New Supported Spanish–English 15% 19% 14% 18% English–Spanish 12% 16% 9% 13% Arabic–English 9% 12% 23% 28% English–Arabic 5% 8% 17% 20% Percentages of new and supported rules in online grammars Trend: mix of learning new translation choices and disambiguating existing choices Grammar size is not significantly increased no noticeable impact on decoding time

36

slide-68
SLIDE 68

Online Grammar Extraction

Contribution 1 summary: online grammar extraction for MT (completed work) Cast MT for post-editing as an online learning problem Define online translation model that incorproates human feedback after each sentence edited Significant improvement over baseline with no modification to decoder or optimizer

37

slide-69
SLIDE 69

Extended Feature Sets

Contribution 2: extended feature sets for online grammars (proposed work)

38

slide-70
SLIDE 70

Extended Feature Sets

Shortcomings of current online translation model: All post-edit data stored in single table (translator, domain, document) Single weight for features that become more reliable over time (0 versus 3000 sentences of post-edit data) Proposed solutions: Generalize to arbitrary number of data sources Copy online feature set for data size ranges

39

slide-71
SLIDE 71

Multiple Data Source Features

Motivation: text organized into documents that fall into domains Lookup table extension: table for each data source

Current Document

Current Domain

Currrent Translator

Feature set extension: sample only matching data sources Generalized statistics for sampling J sources:

  • j

cSj(¯ f, ¯ e)

  • j

cSj(¯ f)

  • j

|Sj|

40

slide-72
SLIDE 72

Multiple Data Source Features

Data source-specific feature sets Copy feature set for each domain (Daume III 2007, Clark 2012) Each copy estimated from only in-domain data Include general feature set (all data) Multiplies feature set: General, same-document, same-domain, same translator 6 × 4 = 24 features (nears limit of MERT optimization)

41

slide-73
SLIDE 73

Multiple Data Source Features

Generalized phrase features: CoherentP(e|f) J =

  • jCSj(¯

f, ¯ e)

  • j|Sj|

SampleCount(f) J =

  • j

|Sj| Count(f,e) J =

  • j

CSj(¯ f, ¯ e)

42

slide-74
SLIDE 74

Multiple Data Source Features

Generalized indicator features: Singleton(f) J =

  • j

CSj(¯ f) = 1 Singleton(f,e) J =

  • j

CSj(¯ f, ¯ e) = 1 DataSupport(f,e) J =

  • j

CSj(¯ f, ¯ e) > 0

43

slide-75
SLIDE 75

Data Size Features

Motivation: features become more reliable when estimated from larger data Lookup table extension: Count instances added to each data source (document, domain, translator) Feature set extension: multiple copies of each feature Copy feature set for each data source, bin by data size (0-10, 10+, ...) Features only fire when data size matches bin Hk

j (X → ¯

f ¯ e, i) =

  • H(X → ¯

f ¯ e) if j ≤ i ≤ k

  • therwise

44

slide-76
SLIDE 76

Parameter Optimization with Extended Feature Sets

Online feature set with 4 data source sets and 3 time bins: 6 × 4 × 3 = 72 features Minimum error rate training (MERT) (Och, 2003):

× Optimize feature weights with line search, struggles with large

feature sets, correlated features

45

slide-77
SLIDE 77

Parameter Optimization with Extended Feature Sets

Online feature set with 4 data source sets and 3 time bins: 6 × 4 × 3 = 72 features Minimum error rate training (MERT) (Och, 2003):

× Optimize feature weights with line search, struggles with large

feature sets, correlated features Pairwise rank optimization (PRO) (Hopkins and May, 2011):

Optimize feature weights with binary classification of hypothesis

rankings, shown to scale to thousands of features Cutting-plane margin infused relaxation algorithm (MIRA) (Eidelman 2012):

Optimize feature weights with parallelized online learning, shown

to be highly stable and scalable

45

slide-78
SLIDE 78

Experiments

Repeat simulated post-editing experiments with same data sets: English–Spanish and English–Arabic WMT/NIST News and TED talks Compare following configurations to on-demand and initial online systems: Data source-specific extended feature sets Data size-specific extended feature sets Experiment with PRO and MIRA: Compare best to MERT on initial system to form baseline Use best to optimize extended systems

46

slide-79
SLIDE 79

Extended Feature Sets

Contribution 2 summary: extended feature sets for online grammars (proposed work) Extend feature set to independently weight multiple data sources Extend feature set to weight individual sources by data size Explore using new optimizers for extended online feature sets

47

slide-80
SLIDE 80

Outline

Introduction Online translation model adaptation Metrics for system optimization and evaluation Post-editing data collection and analysis Research timeline

48

slide-81
SLIDE 81

System Optimization

Parameter optimization (MERT, PRO, MIRA): Choose set of feature weights W that maximizes objective function on tuning set Objectives depend on automatic metrics that score model predictions E′ against reference translations E Metrics approximate human judgments of translation quality Assumption: MT output evaluated on adequacy: Good translations should be semantically similar to reference translations Several adequacy-driven research efforts: ACL WMT (Callison-Burch et al., 2011) NIST OpenMT (Przybocki et al., 2009)

49

slide-82
SLIDE 82

Standard MT Evaluation

Papineni et al. (2002)

Standard BLEU metric based on N-gram precision (P) Matches spans of hypothesis E′ against reference E Surface forms only, depends on multiple references to capture translation variation (expensive) Jointly measures word choice and order BLEU = BP × exp 4

  • n=1

1 N log Pn

  • BP =
  • 1

|E′| > |E| e

1−|E| |E′|

|E′| ≤ |E|

50

slide-83
SLIDE 83

Standard MT Evaluation

Shortcomings of BLEU metric (Banerjee and Lavie 2005, Callison-Burch et al., 2007): Evaluating surface forms misses correct translations N-grams have no notion of global coherence

51

slide-84
SLIDE 84

Standard MT Evaluation

Shortcomings of BLEU metric (Banerjee and Lavie 2005, Callison-Burch et al., 2007): Evaluating surface forms misses correct translations N-grams have no notion of global coherence E: The large home

51

slide-85
SLIDE 85

Standard MT Evaluation

Shortcomings of BLEU metric (Banerjee and Lavie 2005, Callison-Burch et al., 2007): Evaluating surface forms misses correct translations N-grams have no notion of global coherence E: The large home E′

1:

A big house BLEU = 0 E′

2:

I am a dinosaur BLEU = 0

51

slide-86
SLIDE 86

Post-Editing

Final translations must be human quality (editing required) Good MT output should require less work for humans to edit Human-targeted translation edit rate (HTER, Snover et al., 2006)

1

Human translators correct MT output

2

Automatically calculate number of edits using TER TER = # edits |E| Edits: insertion, deletion, substitution, block shift “Better” translations not always easier to post-edit

52

slide-87
SLIDE 87

Translation Example

WMT 2011 Czech–English Track

Translations judged by humans E: He was supposed to pay half a million to Luboˇ s G.

53

slide-88
SLIDE 88

Translation Example

WMT 2011 Czech–English Track

Translations judged by humans E: He was supposed to pay half a million to Luboˇ s G. E′

1: He had for Luboˇ

si G. to pay half a million crowns. E′

2: He had to pay luboˇ

si G. half a million kronor.

53

slide-89
SLIDE 89

Translation Example

WMT 2011 Czech–English Track

Translations judged by humans E: He was supposed to pay half a million to Luboˇ s G.

  • E′

1: He had for Luboˇ

si G. to pay half a million crowns. E′

2: He had to pay luboˇ

si G. half a million kronor.

53

slide-90
SLIDE 90

Translation Example

WMT 2011 Czech–English Track

Translations judged by humans E: He was supposed to pay half a million to Luboˇ s G.

  • E′

1: He had for to pay Luboˇ

si Luboˇ s G. to pay half a million crowns. 0.27 E′

2: He had to pay luboˇ

si Luboˇ s G. half a million kronor. 0.09

53

slide-91
SLIDE 91

Translation Example

WMT 2011 Czech–English Track

Translations scored by BLEU E: The problem is that life of the lines is two to four years.

54

slide-92
SLIDE 92

Translation Example

WMT 2011 Czech–English Track

Translations scored by BLEU E: The problem is that life of the lines is two to four years. E′

1: The problem is that life is two lines, up to four years.

E′

2: The problem is that the durability of lines is two or four years.

54

slide-93
SLIDE 93

Translation Example

WMT 2011 Czech–English Track

Translations scored by BLEU E: The problem is that life of the lines is two to four years.

  • E′

1: The problem is that life is two lines, up to four years.

0.49 E′

2: The problem is that the durability of lines is two or four years.

0.34

54

slide-94
SLIDE 94

Translation Example

WMT 2011 Czech–English Track

Translations scored by BLEU E: The problem is that life of the lines is two to four years.

  • E′

1: The problem is that life is two of the lines , up to is two to four years.

0.49 0.29 E′

2: The problem is that the durability life of lines is two or to four years.

0.34 0.14

54

slide-95
SLIDE 95

Improved Metrics for MT in Post-Editing Tasks

Contribution 3: improved metrics for MT in post-editing tasks (partially completed work)

55

slide-96
SLIDE 96

Preliminary Post-Editing Experiment

Denkowski and Lavie (2012)

90 sentences from Google Docs documentation Translated from English to Spanish by two systems: Microsoft Translator Moses system (Europarl) 180 MT outputs total Post-edited by translators at Kent State Institute for Applied Linguistics Translators never saw the reference translations

56

slide-97
SLIDE 97

Preliminary Post-Editing Experiment

Denkowski and Lavie (2012)

Data collected from translators: Post-edited translations Expert post-editing ratings 1: No editing required 2: Minor editing, meaning preserved 3: Major editing, meaning lost 4: Re-translate From parallel data: Independent reference translations

57

slide-98
SLIDE 98

Preliminary Post-Editing Experiment

Denkowski and Lavie (2012)

Task 1: predict post-editing utility with automatic metric scores: Goal: metrics used to select best system configuration, should be consistent with human preference Average expert rating: 1.69 Average HTER: 12.4

58

slide-99
SLIDE 99

Preliminary Post-Editing Experiment

Denkowski and Lavie (2012)

Task 1: predict post-editing utility with automatic metric scores: Goal: metrics used to select best system configuration, should be consistent with human preference Average expert rating: 1.69 Average HTER: 12.4 BLEU MT vs Post-edited 79.2 MT vs Ref 31.7 Post-edited vs Ref 34.1 Corpus-level score

58

slide-100
SLIDE 100

Preliminary Post-Editing Experiment

Denkowski and Lavie (2012)

Task 2: discriminate between usable and non-usable translations: Goal: metrics rank hypotheses during optimization, should prefer translations suitable for post-editing Divide translations into two groups: Suitable for post-editing (1–2) Not suitable for post-editing (3–4) Examine metric score distribution of each group Metrics should be able to separate two classes of translations

59

slide-101
SLIDE 101

Usability Experiment Results

Denkowski and Lavie (2012)

0.0 0.2 0.4 0.6 0.8 1.0 BLEU Score 5 10 15 20 25 Sentences

Usable Non-usable

0.0 0.2 0.4 0.6 0.8 1.0 HTER 10 20 30 40 50 60 70 80 Sentences

Usable Non-usable

Impossible to separate translations with BLEU

60

slide-102
SLIDE 102

Usability Experiment Results

Denkowski and Lavie (2012)

Are results skewed by the small size of the data (180 sentences)? Repeat experiment with publicly available WMT12 quality estimation task data (Callison-Burch et al., 2012): 1832 English-to-Spanish MT outputs HTER scores and 5-point multiple-expert ratings

61

slide-103
SLIDE 103

Usability Experiment Results (WMT 2012)

Denkowski and Lavie (2012)

0.0 0.2 0.4 0.6 0.8 1.0 BLEU Score 50 100 150 200 Sentences

Usable Non-usable

0.0 0.2 0.4 0.6 0.8 1.0 HTER 50 100 150 200 250 Sentences

Usable Non-usable

Distribution overlap more apparent with larger data

62

slide-104
SLIDE 104

Meteor

Banerjee and Lavie (2005), Lavie and Denkowski (2009), Denkowski and Lavie (2011)

Motivation: address shortcomings of BLEU Flexible matching to capture translation variation Measure word choice and order separately, combine with tunable scoring function Measure sentence coherence globally Meteor: alignment-based tunable evaluation metric Align hypothesis E′ to reference E Compute score based on alignment quality

63

slide-105
SLIDE 105

Meteor Alignment

Denkowski and Lavie (2011)

E′ E The United States embassy know that dependable source . The American embassy knows this from a reliable source .

64

slide-106
SLIDE 106

Meteor Alignment

Denkowski and Lavie (2011)

E′ E The United States embassy know that dependable source . The American embassy knows this from a reliable source . exact

64

slide-107
SLIDE 107

Meteor Alignment

Denkowski and Lavie (2011)

E′ E The United States embassy know that dependable source . The American embassy knows this from a reliable source . exact stem

64

slide-108
SLIDE 108

Meteor Alignment

Denkowski and Lavie (2011)

E′ E The United States embassy know that dependable source . The American embassy knows this from a reliable source . exact stem synonym

64

slide-109
SLIDE 109

Meteor Alignment

Denkowski and Lavie (2011)

E′ E The United States embassy know that dependable source . The American embassy knows this from a reliable source . exact stem synonym paraphrase

64

slide-110
SLIDE 110

Meteor Alignment

Denkowski and Lavie (2011)

E′ E The United States embassy know that dependable source . The American embassy knows this from a reliable source . P R (P and R weighted by match type, content vs function words)

64

slide-111
SLIDE 111

Meteor Alignment

Denkowski and Lavie (2011)

E′ E The United States embassy know that dependable source . The American embassy knows this from a reliable source . P R Chunks = 2

64

slide-112
SLIDE 112

Meteor Scoring

Denkowski and Lavie (2011)

P and R weighted by match type (wi, ..., wn) and content-function word weight (δ) Fα = P × R α × P + (1 − α) × R Frag = Chunks AvgMatches Meteor =

  • 1 − γ × Fragβ

× Fα Tunable parameters: W = wi, ..., wn: weights for flexible match types α: balance between precision and recall β, γ: weight and severity of fragmentation δ: relative contribution of content versus function words

65

slide-113
SLIDE 113

Meteor and Post-Editing

Casting Meteor’s scoring features as post-editing measures: Precision Incorrect content (deletion) Recall Missing content (insertion) Fragmentation Incorrectly ordered content (reordering) Match types Partially correct content (minor edits) Content vs Function Content vs grammaticality edits Advantage over edit distance: error types identified separately and combined with a parameterized scoring function

66

slide-114
SLIDE 114

MT Evaluation Experiments

Denkowski and Lavie (2010a, 2010b)

Task: use Meteor to predict post-editing effort HTER data from DARPA GALE project (Olive et al., 2011) Cross-validation with data from Phase 2 and Phase 3 Training: maximize correlation between Meteor and HTER scores

67

slide-115
SLIDE 115

MT Evaluation Experiments

Denkowski and Lavie (2010a, 2010b)

Task: use Meteor to predict post-editing effort HTER data from DARPA GALE project (Olive et al., 2011) Cross-validation with data from Phase 2 and Phase 3 Training: maximize correlation between Meteor and HTER scores Metric Tuning Data P2 r P3 r BLEU N/A

  • 0.545
  • 0.489

TER N/A 0.592 0.515 Meteor P2

  • 0.642
  • 0.594

P3

  • 0.625
  • 0.612

67

slide-116
SLIDE 116

MT System Optimization Experiments

Denkowski and Lavie (2011)

Task: use Meteor to tune MT systems (initial experiments limited to adequacy) Standard phrase-based SMT system Data sets: WMT French–English: 12 million bilingual sentences NIST Urdu–English: 87 thousand bilingual sentences (less than 1% of WMT scale)

68

slide-117
SLIDE 117

MT System Optimization Experiments

Denkowski and Lavie (2011)

Task: use Meteor to tune MT systems (initial experiments limited to adequacy) Standard phrase-based SMT system Data sets: WMT French–English: 12 million bilingual sentences NIST Urdu–English: 87 thousand bilingual sentences (less than 1% of WMT scale) French–English Tuning Metric BLEU Meteor BLEU 28.27 54.07 Meteor 28.14 54.11 Urdu–English Tuning Metric BLEU Meteor BLEU 23.67 50.45 Meteor 24.89 51.29

68

slide-118
SLIDE 118

Proposed Meteor Experiments

Tune on more reliable post-edit data (more to come) Use more precise post-edit version of Meteor to tune online post-editing systems with PRO and MIRA Evaluate Meteor tuning and evaluation for all language directions and domains Goal: combination of post-editing-targeted translation model and

  • ptimization objective to significantly improve translation usability for

human translators

69

slide-119
SLIDE 119

Improved Metrics for MT in Post-Editing Tasks

Contribution 3 summary: improved metrics for MT in post-editing tasks (partially completed work) Demonstrate differences between adequacy and post-editing translation/evaluation tasks Design version of Meteor metric with improved capacity to predict editing effort Combine online translation model with Meteor-driven optimizer for dedicted post-editing MT pipeline

70

slide-120
SLIDE 120

Outline

Introduction Online translation model adaptation Metrics for system optimization and evaluation Post-editing data collection and analysis Research timeline

71

slide-121
SLIDE 121

State of Post-Editing Data

Two types of judgments: HTER: approximate editing from initial and final translations Expert assessments: evaluate amount of editing from same data

Corrected Output Post-Editing Process MT Output

72

slide-122
SLIDE 122

State of Post-Editing Data

Two types of judgments: HTER: approximate editing from initial and final translations Expert assessments: evaluate amount of editing from same data

Corrected Output Post-Editing Process MT Output

Both make lossy approximations: intermediate edits lost HTER relies on coarse TER aligner

72

slide-123
SLIDE 123

State of Post-Editing Data

HTER inherits limitations of TER aligner All non-monotonic, non-surface-form word matches require edits All edits treated equally: Insertion, deletion, substitution, block reordering Morphological variation (tense, agreement) Content, grammaticality, untranslated words Even punctuation Goal: collect more accurate post-edit data, eliminate reliance on coarse measures

73

slide-124
SLIDE 124

Improved Post-Editing Data Collection

Contribution 4: improved post-editing data collection with TransCenter (partially completed work)

74

slide-125
SLIDE 125

TransCenter Data Collection Framework

Denkowski and Lavie (2012)

Objectives: Provide controlled environment for translation post-editing Accurately record all editing activity Simplicity: avoid usage barriers and experimental confounds Design: Web-based translation editing interface All user interaction reported to server Automatic edit information extraction

75

slide-126
SLIDE 126

TransCenter Data Collection Framework

Denkowski and Lavie (2012)

TransCenter Server

Web-Based Translation Editor

Source Text & MT Output Translators Translated Text Detailed Edit Report

Translate from any computer with an Internet connection Full support for any language pair

76

slide-127
SLIDE 127

TransCenter Data Collection Framework

Denkowski and Lavie (2012)

Browser-based editor interface

77

slide-128
SLIDE 128

TransCenter Data Collection Framework

Denkowski and Lavie (2012)

Track intermediate edits

78

slide-129
SLIDE 129

TransCenter Data Collection Framework

Denkowski and Lavie (2012)

Generate and export edit reports

79

slide-130
SLIDE 130

Proposed Human Post-Editing Experiments

Extend TransCenter to support full integration of real-time MT with feedback Run real-time post-editing experiments with translators from Kent State University Use TransCenter for both data collection and empirical evaluation Collect post-editing data for both baseline (on-demand) system and extended online system Select best system in simulated experiments for human editing evaluation Fully cover at least one language direction for both systems Implement average pause ratio (Lacruz et al., 2012) in addition to existing measures

80

slide-131
SLIDE 131

Improved Post-Editing Data Collection

Contribution 4: improved post-editing data collection with TransCenter (partially completed work) Full integration of online translation system with web-based translation editor Collect highly accurate post-editing data for system evaluation and metric development Extensive analysis of different measures of human editing effort

81

slide-132
SLIDE 132

Outline

Introduction Online translation model adaptation Metrics for system optimization and evaluation Post-editing data collection and analysis Research timeline

82

slide-133
SLIDE 133

Thesis Research Progress Review

Translation Modeling Online grammar extraction Completed Multiple context feature sets Proposed Data size feature sets Proposed Automatic Metrics Extended version of Meteor metric Completed Initial Meteor optimization and evaluation for adequacy Completed Meteor evaluation for post-editing Proposed Meteor optimization for post-editing Proposed Data Collection Basic version of TransCenter Completed Initial analysis of adequacy versus post-editing tasks Completed Initial post-editing data analysis Completed Extended version of TransCenter with live MT and feedback Proposed Real-time post-editing experiments with TransCenter Proposed Full post-editing data evaluation and analysis Proposed

83

slide-134
SLIDE 134

Thesis Research Timeline

Jun–Jul Multiple context feature sets for online grammar learning 2013 Data size feature sets for online grammar learning Aug–Sep Extended version of TransCenter with live MT and feedback 2013 Start real-time post-editing experiments Oct–Nov Continue real-time post-editing experiments 2013 Meteor evaluation for post-editing as data arrives Dec 2013 Meteor optimization for post-editing Jan 2014 Post-editing data evaluation and analysis as data is available Feb–Mar Additional translation experiments and analysis for as many 2014 language scenarios as resources allow Apr–May Writing thesis document 2014 May 2014 Thesis defense

84

slide-135
SLIDE 135

Machine Translation for Human Translators

Carnegie Mellon Ph.D. Thesis Proposal Michael Denkowski

Language Technologies Institute School of Computer Science Carnegie Mellon University

May 30, 2013 Thesis Committee:

Alon Lavie (chair), Carnegie Mellon University Chris Dyer, Carnegie Mellon University Jaime Carbonell, Carnegie Mellon University Gregory Shreve, Kent State University

85