Fast and Adaptive Online Training of Feature-Rich Translation Models - - PowerPoint PPT Presentation

fast and adaptive online training of feature rich
SMART_READER_LITE
LIVE PREVIEW

Fast and Adaptive Online Training of Feature-Rich Translation Models - - PowerPoint PPT Presentation

Fast and Adaptive Online Training of Feature-Rich Translation Models Spence Green Sida Wang Daniel Cer Christopher D. Manning Stanford University ACL 2013 Feature-Rich Research Industry/Evaluations Liang et al. 2006 Tillmann and Zhang


slide-1
SLIDE 1

Fast and Adaptive Online Training

  • f Feature-Rich Translation Models

Spence Green Sida Wang Daniel Cer Christopher D. Manning Stanford University ACL 2013

slide-2
SLIDE 2

Feature-Rich Research Industry/Evaluations

Liang et al. 2006 Tillmann and Zhang 2006 Arun and Koehn 2007 Ittycheriah and Roukos 2007 Watanabe et al. 2007 Chiang et al. 2008; Chiang et al. 2009

n-best/lattice MERT

Haddow et al. 2011 Hopkins and May 2011

MIRA (ISI)

Xiang and Ittycheriah 2011 Cherry and Foster 2012 Chiang 2012 Gimpel 2012 Simianer et al. 2012 Watanabe 2012

slide-3
SLIDE 3

Feature-Rich Research Industry/Evaluations

Liang et al. 2006 Tillmann and Zhang 2006 Arun and Koehn 2007 Ittycheriah and Roukos 2007 Watanabe et al. 2007 Chiang et al. 2008; Chiang et al. 2009

n-best/lattice MERT

Haddow et al. 2011 Hopkins and May 2011

MIRA (ISI)

Xiang and Ittycheriah 2011 Cherry and Foster 2012 Chiang 2012 Gimpel 2012 Simianer et al. 2012 Watanabe 2012

slide-4
SLIDE 4

Feature-rich Shared Task Submissions

# Feature-rich 2012 WMT IWSLT 1 2013 WMT 2 ? IWSLT TBD 4

slide-5
SLIDE 5

Speculation: Entrenchment Of MERT

Feature-rich on small tuning sets? Implementation complexity Open source availability 5

slide-6
SLIDE 6

Speculation: Entrenchment Of MERT

Feature-rich on small tuning sets? Implementation complexity Open source availability

Top-selling phone of 2003

5

slide-7
SLIDE 7

Motivation: Why Feature-Rich MT?

Make MT more like other machine learning settings Features for specific errors Domain adaptation 6

slide-8
SLIDE 8

Motivation: Why Online MT Tuning?

Search: decode more often Better solutions See: [Liang and Klein 2009] Computer-aided translation: incremental updating 7

slide-9
SLIDE 9

Benefits Of Our Method

Fast and scalable Adapts to dense/sparse feature mix Not complicated 8

slide-10
SLIDE 10

Online Algorithm Overview

Updating with an adaptive learning rate Automatic feature selection via L1 regularization Loss function: Pairwise ranking 9

slide-11
SLIDE 11

Notation

t

time/update step 10

slide-12
SLIDE 12

Notation

t

time/update step

t

weight vector in Rn 10

slide-13
SLIDE 13

Notation

t

time/update step

t

weight vector in Rn

η

learning rate 10

slide-14
SLIDE 14

Notation

t

time/update step

t

weight vector in Rn

η

learning rate

ℓt()

loss of t’th example 10

slide-15
SLIDE 15

Notation

t

time/update step

t

weight vector in Rn

η

learning rate

ℓt()

loss of t’th example

zt−1 ∈ ∂ℓt(t−1)

subgradient set (subdifferential) 10

slide-16
SLIDE 16

Notation

t

time/update step

t

weight vector in Rn

η

learning rate

ℓt()

loss of t’th example

zt−1 ∈ ∂ℓt(t−1)

subgradient set (subdifferential)

zt−1 = ∇ℓt(t−1)

for differentiable loss functions 10

slide-17
SLIDE 17

Notation

t

time/update step

t

weight vector in Rn

η

learning rate

ℓt()

loss of t’th example

zt−1 ∈ ∂ℓt(t−1)

subgradient set (subdifferential)

zt−1 = ∇ℓt(t−1)

for differentiable loss functions

r()

regularization function 10

slide-18
SLIDE 18

Warm-up: Stochastic Gradient Descent

Per-instance update:

t = t−1 − ηzt−1

11

slide-19
SLIDE 19

Warm-up: Stochastic Gradient Descent

Per-instance update:

t = t−1 − ηzt−1

Issue #1: learning rate schedule

η / t ?

11

slide-20
SLIDE 20

Warm-up: Stochastic Gradient Descent

Per-instance update:

t = t−1 − ηzt−1

Issue #1: learning rate schedule

η / t ? η /

  • t ?

11

slide-21
SLIDE 21

Warm-up: Stochastic Gradient Descent

Per-instance update:

t = t−1 − ηzt−1

Issue #1: learning rate schedule

η / t ? η /

  • t ?

η / (1 + γt) ?

Yuck. 11

slide-22
SLIDE 22

Warm-up: Stochastic Gradient Descent

SGD update:

t = t−1 − ηzt−1

Issue #2: same step size for every coordinate 12

slide-23
SLIDE 23

Warm-up: Stochastic Gradient Descent

SGD update:

t = t−1 − ηzt−1

Issue #2: same step size for every coordinate Intuitively, we might want: Frequent feature: small steps e.g. η / t Rare feature: large steps e.g. η /

  • t

12

slide-24
SLIDE 24

SGD: Learning Rate Adaptation

SGD update:

t = t−1 − ηzt−1

Scale learning rate with A−1 ∈ Rn×n:

t = t−1 − ηA−1zt−1

13

slide-25
SLIDE 25

SGD: Learning Rate Adaptation

SGD update:

t = t−1 − ηzt−1

Scale learning rate with A−1 ∈ Rn×n:

t = t−1 − ηA−1zt−1

Choices:

A−1 = 

(SGD) 13

slide-26
SLIDE 26

SGD: Learning Rate Adaptation

SGD update:

t = t−1 − ηzt−1

Scale learning rate with A−1 ∈ Rn×n:

t = t−1 − ηA−1zt−1

Choices:

A−1 = 

(SGD)

A−1 = H−1 (Batch: Newton step)

13

slide-27
SLIDE 27

AdaGrad

Duchi et al. 2011

Update:

t = t−1 − ηA−1zt−1

Set A−1 = G−1/2

t

:

Gt = Gt−1 + zt−1 · z⊤

t−1

14

slide-28
SLIDE 28

AdaGrad: Approximations and Intuition

For high-dimensional t, use diagonal Gt

t = t−1 − ηG−1/2

t

zt−1

15

slide-29
SLIDE 29

AdaGrad: Approximations and Intuition

For high-dimensional t, use diagonal Gt

t = t−1 − ηG−1/2

t

zt−1

Intuition:

1/

  • t schedule on constant gradient

Small steps for frequent features Big steps for rare features [Duchi et al. 2011] 15

slide-30
SLIDE 30

AdaGrad vs. SGD: 2D Illustration

−10 −5 5 10 −10 −8 −6 −4 −2 2 4 6 8 10

SGD AdaGrad

16

slide-31
SLIDE 31

Feature Selection

Traditional approach: frequency cutoffs Unattractive for large tuning sets (e.g. bitext) 17

slide-32
SLIDE 32

Feature Selection

Traditional approach: frequency cutoffs Unattractive for large tuning sets (e.g. bitext) More principled: L1 regularization

r() =

||

17

slide-33
SLIDE 33

Feature Selection: FOBOS

T wo-step update:

t− 1

2 = t−1 − ηzt−1

(1)

t = rg min

      1 2

  •  − t− 1

2

  • 2
  • proximal term

+ λ · r()

  • regularization

     

(2) [Duchi and Singer 2009] 18

slide-34
SLIDE 34

Feature Selection: FOBOS

T wo-step update:

t− 1

2 = t−1 − ηzt−1

(1)

t = rg min

      1 2

  •  − t− 1

2

  • 2
  • proximal term

+ λ · r()

  • regularization

     

(2) [Duchi and Singer 2009] Extension: AdaGrad update in step (1) 18

slide-35
SLIDE 35

Feature Selection: FOBOS

For L1, FOBOS becomes soft thresholding:

t = sign(t− 1

2 )

  • t− 1

2

  • − λ
  • +

19

slide-36
SLIDE 36

Feature Selection: FOBOS

For L1, FOBOS becomes soft thresholding:

t = sign(t− 1

2 )

  • t− 1

2

  • − λ
  • +

Squared-L2 also has a simple form 19

slide-37
SLIDE 37

Feature Selection: Lazy Regularization

Lazy updating: only update active coordinates Big speedup in MT setting 20

slide-38
SLIDE 38

Feature Selection: Lazy Regularization

Lazy updating: only update active coordinates Big speedup in MT setting Easy with FOBOS:

t′

j : last update of dimension j

Use λ(t − t′

j )

20

slide-39
SLIDE 39

AdaGrad+FOBOS: Full Algorithm

  • 1. Additive update: Gt

21

slide-40
SLIDE 40

AdaGrad+FOBOS: Full Algorithm

  • 1. Additive update: Gt
  • 2. Additive update: t− 1

2

21

slide-41
SLIDE 41

AdaGrad+FOBOS: Full Algorithm

  • 1. Additive update: Gt
  • 2. Additive update: t− 1

2

  • 3. Closed-form regularization: t

21

slide-42
SLIDE 42

AdaGrad+FOBOS: Full Algorithm

  • 1. Additive update: Gt
  • 2. Additive update: t− 1

2

  • 3. Closed-form regularization: t

21

slide-43
SLIDE 43

AdaGrad+FOBOS: Full Algorithm

  • 1. Additive update: Gt
  • 2. Additive update: t− 1

2

  • 3. Closed-form regularization: t

Not complicated Very fast 21

slide-44
SLIDE 44

Recap: Pairwise Ranking

For derivation d, feature map ϕ(d), references e1:k Metric: B(d, e1:k) (e.g. BLEU+1) Model score: M(d) =  · ϕ(d) 22

slide-45
SLIDE 45

Recap: Pairwise Ranking

For derivation d, feature map ϕ(d), references e1:k Metric: B(d, e1:k) (e.g. BLEU+1) Model score: M(d) =  · ϕ(d) Pairwise consistency:

M(d+) > M(d−) ⇐⇒ B

  • d+, e1:k

> B

  • d−, e1:k

[Hopkins and May 2011] 22

slide-46
SLIDE 46

Loss Function: Pairwise Ranking

M(d+) > M(d−) ⇐⇒  · (ϕ(d+) − ϕ(d−)) > 0

23

slide-47
SLIDE 47

Loss Function: Pairwise Ranking

M(d+) > M(d−) ⇐⇒  · (ϕ(d+) − ϕ(d−)) > 0

Loss formulation: Difference vector:  = ϕ(d+) − ϕ(d−) Find  so that  ·  > 0 Binary classification problem between  and − 23

slide-48
SLIDE 48

Loss Function: Pairwise Ranking

M(d+) > M(d−) ⇐⇒  · (ϕ(d+) − ϕ(d−)) > 0

Loss formulation: Difference vector:  = ϕ(d+) − ϕ(d−) Find  so that  ·  > 0 Binary classification problem between  and − Logistic loss: convex, differentiable [Hopkins and May 2011] 23

slide-49
SLIDE 49

Parallelization

Online algorithms are inherently sequential 24

slide-50
SLIDE 50

Parallelization

Online algorithms are inherently sequential Out-of-order updating:

7 = 6 − ηz4 8 = 7 − ηz6 9 = 8 − ηz5

24

slide-51
SLIDE 51

Parallelization

Online algorithms are inherently sequential Out-of-order updating:

7 = 6 − ηz4 8 = 7 − ηz6 9 = 8 − ηz5

Low-latency regret bound: O(

  • T)

[Langford et al. 2009] 24

slide-52
SLIDE 52

Translation Quality Experiments

Arabic-English (Ar–En) and Chinese-English (Zh–En) Newswire and mixed-genre experiments BOLT bitexts: data up to 2012 Bilingual Monolingual

Sentences Tokens Tokens

Ar–En 6.6M 375M 990M Zh–En 9.3M 538M 25

slide-53
SLIDE 53

MT System

Phrase-based MT: Phrasal [Cer et al. 2010] Dense baseline: MERT

Cer et al. 2008 line search

Accumulates n-best lists Random starting points, etc. 26

slide-54
SLIDE 54

Feature-Rich Baseline: PRO

Pairwise Ranking Optimization (PRO) Batch log loss minimization Phrasal implementation: L-BFGS with L2 regularization [Hopkins and May 2011] 27

slide-55
SLIDE 55

Feature-Rich Baseline: PRO

Pairwise Ranking Optimization (PRO) Batch log loss minimization Phrasal implementation: L-BFGS with L2 regularization [Hopkins and May 2011] Sanity check: Moses PRO and kb-MIRA (batch) implementations 27

slide-56
SLIDE 56

Dense Features

8 Hierarchical lex. reordering 28

slide-57
SLIDE 57

Dense Features

8 Hierarchical lex. reordering 5 Moses phrase table features 1 Rule bitext count 1 Unique rule indicator 28

slide-58
SLIDE 58

Dense Features

8 Hierarchical lex. reordering 5 Moses phrase table features 1 Rule bitext count 1 Unique rule indicator 1 Word penalty 1 Linear distortion 1 LM 1 Unknown word

19

28

slide-59
SLIDE 59

Sparse Feature Templates

Discriminative Phrase Table (PT) Rule indicator:

  • ⇒ space program

✶ ✶

29

slide-60
SLIDE 60

Sparse Feature Templates

Discriminative Phrase Table (PT) Rule indicator:

  • ⇒ space program
  • Discriminative Alignments (AL)

Source word deletion:

  • Word alignments:

  • ⇒ space

29

slide-61
SLIDE 61

Sparse Feature Templates

Discriminative Phrase Table (PT) Rule indicator:

  • ⇒ space program
  • Discriminative Alignments (AL)

Source word deletion:

  • Word alignments:

  • ⇒ space
  • Discriminative Lex. Reordering (LO)

Phrase orientation:

  • swap(
  • ⇒ space)
  • 29
slide-62
SLIDE 62

Evaluation: NIST OpenMT

Small tuning set: MT06 “Large” tuning set: MT0568 (≈4200 segments) BLEU-4 uncased, Four references 30

slide-63
SLIDE 63

Evaluation: NIST OpenMT

Small tuning set: MT06 “Large” tuning set: MT0568 (≈4200 segments) BLEU-4 uncased, Four references Paper: mixed genre (bitext) experiments 30

slide-64
SLIDE 64

Results: Small Tuning Set (Dense)

Ar–En Zh–En Tune Test Avg. Tune Test Avg. MERT 45.08 50.51 33.73 34.49 This paper 43.16 50.11 32.20 35.25 31

slide-65
SLIDE 65

Results: Add More Features

Ar–En Zh–En Tune Test Avg. Tune Test Avg. MERT—Dense 45.08 50.51 33.73 34.49 This paper +PT 50.61 50.52 34.92 35.12 32

slide-66
SLIDE 66

Results: Add More Features

Ar–En Zh–En Tune Test Avg. Tune Test Avg. MERT—Dense 45.08 50.51 33.73 34.49 This paper +PT 50.61 50.52 34.92 35.12 This paper +All 60.85 50.97 39.43 35.31 (MT06 tuning set) 32

slide-67
SLIDE 67

Results: Add More Data

Ar–En Zh–En Test Avg. Test Avg. MERT—mt06 50.51 34.49 MERT—mt0568 50.74 34.55 33

slide-68
SLIDE 68

Results: Add More Data

Ar–En Zh–En Test Avg. Test Avg. MERT—mt06 50.51 34.49 MERT—mt0568 50.74 34.55 This paper

+All—mt06

50.97 35.31 33

slide-69
SLIDE 69

Results: Add More Data

Ar–En Zh–En Test Avg. Test Avg. MERT—mt06 50.51 34.49 MERT—mt0568 50.74 34.55 This paper

+All—mt06

50.97 35.31

+All—mt0568

52.34

+1.60

36.61

+2.06

33

slide-70
SLIDE 70

Results: Add More Data

Ar–En Zh–En Test Avg. Test Avg. MERT—mt06 50.51 34.49 MERT—mt0568 50.74 34.55 This paper

+All—mt06

50.97 35.31

+All—mt0568

52.34

+1.60

36.61

+2.06

PRO+All worse than MERT—mt0568 33

slide-71
SLIDE 71

Analysis: Zh–En MT06 Tuning

(16 threads) Epochs Min/epoch MERT Dense 22 180 34

slide-72
SLIDE 72

Analysis: Zh–En MT06 Tuning

(16 threads) Epochs Min/epoch MERT Dense 22 180 PRO

+PT

25 35 kb-MIRA*

+PT

26 25 This paper

+PT

10 10 34

slide-73
SLIDE 73

Analysis: Zh–En MT06 Tuning

(16 threads) Epochs Min/epoch MERT Dense 22 180 PRO

+PT

25 35 kb-MIRA*

+PT

26 25 This paper

+PT

10 10 PRO

+All

13 100 This paper

+All

5 15 34

slide-74
SLIDE 74

Analysis: Zh–En MT06 Tuning

(16 threads) Epochs Min/epoch MERT Dense 22 180 PRO

+PT

25 35 kb-MIRA*

+PT

26 25 This paper

+PT

10 10 PRO

+All

13 100 This paper

+All

5 15 MERT—mt0568 tuning takes about 5 days 34

slide-75
SLIDE 75

Analysis: Runtime

Online regret bounds depend on # updates Large datasets: more updates per epoch Fewer epochs to converge 35

slide-76
SLIDE 76

Analysis: Runtime

Online regret bounds depend on # updates Large datasets: more updates per epoch Fewer epochs to converge Lazy updating helps:

t ≈ 100k features zt−1 ≈ 500 features

35

slide-77
SLIDE 77

Analysis: Reordering

Arabic matrix clauses often verb-initial 36

slide-78
SLIDE 78

Analysis: Reordering

Arabic matrix clauses often verb-initial Manually selected 208 verb-initial segments (MT09) 36

slide-79
SLIDE 79

Analysis: Reordering

Arabic matrix clauses often verb-initial Manually selected 208 verb-initial segments (MT09) 32 differed for MERT–Dense vs. +All 36

slide-80
SLIDE 80

Analysis: Reordering

+All correct

18 56.3% MERT–Dense correct 4 12.5% Both wrong 10 31.3% 32 37

slide-81
SLIDE 81

Analysis: Reordering

+All correct

18 56.3% MERT–Dense correct 4 12.5% Both wrong 10 31.3% 32 ref: the newspaper and television reported MERT she said the newspaper and television

+All

television and newspaper said 37

slide-82
SLIDE 82

Analysis: Domain Adaptation

  • ⇒ program, programme

38

slide-83
SLIDE 83

Analysis: Domain Adaptation

  • ⇒ program, programme

# bitext–5k # MT0568

programme 185 program 19 449 38

slide-84
SLIDE 84

Analysis: Domain Adaptation

  • ⇒ program, programme

# bitext–5k # MT0568

programme 185 program 19 449

+PT rules: programme

353 79

+PT rules: program

9 31 38

slide-85
SLIDE 85

Caveats and Next Steps

Single-reference setting BLEU+1 is unreliable Lexicalized features cause overfitting 39

slide-86
SLIDE 86

Caveats and Next Steps

Single-reference setting BLEU+1 is unreliable Lexicalized features cause overfitting Current work Bitext tuning Different loss function 39

slide-87
SLIDE 87

Conclusion

Fast, adaptive, online tuning for MT 40

slide-88
SLIDE 88

Conclusion

Fast, adaptive, online tuning for MT Easy to implement 40

slide-89
SLIDE 89

Conclusion

Fast, adaptive, online tuning for MT Easy to implement Works as well as MERT for Dense 40

slide-90
SLIDE 90

Conclusion

Fast, adaptive, online tuning for MT Easy to implement Works as well as MERT for Dense Sane feature engineering 40

slide-91
SLIDE 91

Fast and Adaptive Online Training

  • f Feature-Rich Translation Models

Spence Green Sida Wang Daniel Cer Christopher D. Manning Stanford University Try the code in Phrasal:

nlp.stanford.edu/software/phrasal/

slide-92
SLIDE 92

En–De Learning Curve

  • 7.5

10.0 12.5 15.0 17.5 1 2 3 4 5 6 7 8 9 10

Epoch BLEU newtest2008−2011

Model

  • dense

feature−rich

42

slide-93
SLIDE 93

Sparse Features: Negative Results

Discriminative LM

Jane called Sally

Phrase boundary features

Jane || called Sally

Alignment constellation 1-0 0-1 Target word insertion

Jane called the Sally

43