MaxForce: Max-Violation Perceptron and Forced Decoding for Scalable - - PowerPoint PPT Presentation

maxforce max violation perceptron and forced decoding for
SMART_READER_LITE
LIVE PREVIEW

MaxForce: Max-Violation Perceptron and Forced Decoding for Scalable - - PowerPoint PPT Presentation

MaxForce: Max-Violation Perceptron and Forced Decoding for Scalable MT Training held talks with Sharon held Bush talks with Sharon 0 1 2 3 4 5 6 Heng Yu Liang Huang Kai Zhao Haitao Mi Chinese Acad. of Sciences CUNY CUNY IBM T.


slide-1
SLIDE 1

MaxForce: Max-Violation Perceptron and Forced Decoding for Scalable MT Training

Heng Yu

Chinese Acad. of Sciences

Liang Huang

CUNY

Haitao Mi

IBM T. J. Watson

1 2 3 4 5 6 Bush held held talks talks with with Sharon Sharon

Kai Zhao

CUNY

slide-2
SLIDE 2

MaxForce: Max-Violation Perceptron and Forced Decoding for Scalable MT Training

Heng Yu

Chinese Acad. of Sciences

Liang Huang

CUNY

Haitao Mi

IBM T. J. Watson

1 2 3 4 5 6 Bush held held talks talks with with Sharon Sharon

Scalable Training for MT Finally Made Successful

Kai Zhao

CUNY

slide-3
SLIDE 3

Discriminative Training for SMT

  • discriminative training is dominant in parsing / tagging
  • can use arbitrary, overlapping, lexicalized features
  • but not very successful yet in machine translation
  • most efforts on MT training tune feature weights on

the small dev set (~1k sents) not the training set!

  • as a result can only use ~10 dense features (MERT)
  • or ~10k rather impoverished features (MIRA/PRO)
  • Liang et al (2006) train on the training set but failed

2

training set (>100k sentences)

dev set (~1k sents) test set (~1k sents)

slide-4
SLIDE 4

Timeline for MT Training

3

training set (>100k sentences)

dev set (~1k sents) test set (~1k sents)

MERT

(Och ’02)

(dense features)

slide-5
SLIDE 5

Timeline for MT Training

3

training set (>100k sentences)

dev set (~1k sents) test set (~1k sents)

Standard Perceptron (a noble failure)

(Liang et al 2006)

MERT

(Och ’02)

(dense features)

slide-6
SLIDE 6

Timeline for MT Training

3

training set (>100k sentences)

dev set (~1k sents) test set (~1k sents)

Standard Perceptron (a noble failure)

(Liang et al 2006)

MIRA

(Watanabe+ ’07) (Chiang+ ’08-’12)

MERT

(Och ’02)

(dense features) (pseudo sparse features)

slide-7
SLIDE 7

Timeline for MT Training

3

training set (>100k sentences)

dev set (~1k sents) test set (~1k sents)

Standard Perceptron (a noble failure)

(Liang et al 2006)

MIRA

(Watanabe+ ’07) (Chiang+ ’08-’12)

PRO

(Hopkins+May ’11)

Regression

(Bazrafshan+ ’12)

MERT

(Och ’02)

(dense features) (pseudo sparse features)

slide-8
SLIDE 8

Timeline for MT Training

3

training set (>100k sentences)

dev set (~1k sents) test set (~1k sents)

Standard Perceptron (a noble failure)

(Liang et al 2006)

MIRA

(Watanabe+ ’07) (Chiang+ ’08-’12)

PRO

(Hopkins+May ’11)

Regression

(Bazrafshan+ ’12)

HOLS

(Flanigan+ ’13)

(sparse features as

  • ne dense feature)

MERT

(Och ’02)

(dense features) (pseudo sparse features)

slide-9
SLIDE 9

Timeline for MT Training

3

training set (>100k sentences)

dev set (~1k sents) test set (~1k sents)

Standard Perceptron (a noble failure)

(Liang et al 2006)

MIRA

(Watanabe+ ’07) (Chiang+ ’08-’12)

PRO

(Hopkins+May ’11)

Regression

(Bazrafshan+ ’12)

  • ur work (2013): violation-fixing

perceptron with truly sparse features

HOLS

(Flanigan+ ’13)

(sparse features as

  • ne dense feature)

MERT

(Och ’02)

(dense features) (pseudo sparse features)

slide-10
SLIDE 10

Timeline for MT Training

3

training set (>100k sentences)

dev set (~1k sents) test set (~1k sents)

Standard Perceptron (a noble failure)

(Liang et al 2006)

MIRA

(Watanabe+ ’07) (Chiang+ ’08-’12)

PRO

(Hopkins+May ’11)

Regression

(Bazrafshan+ ’12)

  • ur work (2013): violation-fixing

perceptron with truly sparse features

HOLS

(Flanigan+ ’13)

(sparse features as

  • ne dense feature)

MERT

(Och ’02)

(dense features) (pseudo sparse features)

?

slide-11
SLIDE 11

Why previous work fails

  • their learning methods are based on exact search
  • MT has huge search spaces => severe search errors
  • learning algorithms should fix search errors
  • full updates (perceptron/MIRA/PRO) can’t fix search errors
  • MT involves latent variables (derivations not annotated)
  • perceptron/MIRA was not designed for latent variables
  • we need better variants for perceptron

4

slide-12
SLIDE 12

Why our approach works

  • use a variant of perceptron tailored for inexact search
  • fix search errors in the middle of the search
  • “partial updates” instead of “full updates”
  • use forced decoding lattice as the target to update to
  • use parallelized minibatch to speed up learning
  • result: scaled to a large portion of the training data
  • 20M sparse features => +2.0 BLEU over MERT/PRO

5

slide-13
SLIDE 13

MT as Structured Classification

  • with latent variables (hidden derivations)

6

x y

the man bit the dog

那 人 咬 了 狗

slide-14
SLIDE 14

MT as Structured Classification

  • with latent variables (hidden derivations)

6

x y

the man bit the dog

那 人 咬 了 狗

...

all gold derivations

slide-15
SLIDE 15

MT as Structured Classification

  • with latent variables (hidden derivations)

6

x y

the man bit the dog

那 人 咬 了 狗

...

x

那 人 咬 了 狗

all gold derivations

slide-16
SLIDE 16

MT as Structured Classification

  • with latent variables (hidden derivations)

6

x y

the man bit the dog

那 人 咬 了 狗

...

x y

the dog bit the man

那 人 咬 了 狗

best derivation all gold derivations

slide-17
SLIDE 17

MT as Structured Classification

  • with latent variables (hidden derivations)

6

x y

the man bit the dog

那 人 咬 了 狗

...

x y

the dog bit the man

那 人 咬 了 狗

best derivation all gold derivations wrong translation

slide-18
SLIDE 18

MT as Structured Classification

  • with latent variables (hidden derivations)

6

x y

the man bit the dog

那 人 咬 了 狗

...

x y

the dog bit the man

那 人 咬 了 狗

best derivation best gold derivation all gold derivations wrong translation

slide-19
SLIDE 19

MT as Structured Classification

  • with latent variables (hidden derivations)

6

x y

the man bit the dog

那 人 咬 了 狗

...

x y

the dog bit the man

那 人 咬 了 狗

best derivation best gold derivation

update: penalize best derivation and reward best gold derivation

all gold derivations wrong translation

  • ++
slide-20
SLIDE 20

Outline

  • Motivations
  • Phrase-based Translation and Forced Decoding
  • Violation-Fixing Perceptron for SMT
  • Update Strategies: Early Update and Max-Violation
  • Feature Design
  • Experiments

7

slide-21
SLIDE 21

Phrase-based translation

yu Shalong juxing le huitan

与 沙⻰龚 举行 了 会谈

held talks with Sharon

布什 Bushi Bush

yu Shalong juxing le huitan

with Sharon held talks meetings Sharon held

with

Bush Bushi

slide-22
SLIDE 22

Phrase-based translation

yu Shalong juxing le huitan

与 沙⻰龚 举行 了 会谈

held talks with Sharon

布什 Bushi Bush

yu Shalong juxing le huitan

with Sharon held talks meetings Sharon held

with

_ _ _ _ _ _ Bush Bushi

slide-23
SLIDE 23

Phrase-based translation

yu Shalong juxing le huitan

与 沙⻰龚 举行 了 会谈

held talks with Sharon

布什 Bushi Bush

yu Shalong juxing le huitan

with Sharon held talks meetings Sharon held

with

_ _ _ _ _ _ Bush Bushi

  • _ _ _ _ _
slide-24
SLIDE 24

Phrase-based translation

yu Shalong juxing le huitan

与 沙⻰龚 举行 了 会谈

held talks with Sharon

布什 Bushi Bush

yu Shalong juxing le huitan

with Sharon held talks meetings Sharon held

with

_ _ _ _ _ _

  • _ _●
  • Bush

Bushi

  • _ _ _ _ _
slide-25
SLIDE 25

Phrase-based translation

yu Shalong juxing le huitan

与 沙⻰龚 举行 了 会谈

held talks with Sharon

布什 Bushi Bush

yu Shalong juxing le huitan

with Sharon held talks meetings Sharon held

with

_ _ _ _ _ _

  • _ _●
  • Bush

Bushi

  • _ _ _ _ _
slide-26
SLIDE 26

Phrase-based translation

yu Shalong juxing le huitan

与 沙⻰龚 举行 了 会谈

held talks with Sharon

布什 Bushi Bush

yu Shalong juxing le huitan

with Sharon held talks meetings Sharon held

with

_ _ _ _ _ _

  • _ _●
  • Bush

Bushi

  • _ _ _ _ _
slide-27
SLIDE 27

Phrase-based translation

yu Shalong juxing le huitan

与 沙⻰龚 举行 了 会谈

held talks with Sharon

布什 Bushi Bush

yu Shalong juxing le huitan

with Sharon held talks meetings Sharon held

with

_ _ _ _ _ _

  • _ _●
  • !
  • _●
  • Bush

Bushi

  • _ _ _ _ _
slide-28
SLIDE 28

Language Model and Beam Search

  • split each -LM state into many +LM states

9

slide-29
SLIDE 29

Language Model and Beam Search

  • split each -LM state into many +LM states

9

  • _ _ _ _ _ !

! ! ! ! !

Bush

slide-30
SLIDE 30

Language Model and Beam Search

  • split each -LM state into many +LM states

9

  • _ _●
  • !

! ! ! ! ! ! ! ! ! !

... talks

  • _ _●
  • !

! ! ! ! ! ! ! ! ! ! ! ! !

... talk

  • _ _●
  • !

! ! !

... meeting

  • _ _ _ _ _ !

! ! ! ! !

Bush

slide-31
SLIDE 31

Language Model and Beam Search

  • split each -LM state into many +LM states

9

  • _ _●
  • !

! ! ! ! ! ! ! ! ! !

... talks

  • _ _●
  • !

! ! ! ! ! ! ! ! ! ! ! ! !

... talk

  • _ _●
  • !

! ! !

... meeting

  • ●●●●● ... Sharon
  • ●●●●● ... Shalong
  • _ _ _ _ _ !

! ! ! ! !

Bush

slide-32
SLIDE 32

Language Model and Beam Search

  • split each -LM state into many +LM states

9

  • _ _●
  • !

! ! ! ! ! ! ! ! ! !

... talks

  • _ _●
  • !

! ! ! ! ! ! ! ! ! ! ! ! !

... talk

  • _ _●
  • !

! ! !

... meeting

  • ●●●●● ... Sharon
  • ●●●●● ... Shalong
  • _ _ _ _ _ !

! ! ! ! !

Bush

  • ● ● ● ● ●
slide-33
SLIDE 33

Forced Decoding

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

  • both as data selection (more literal) and oracle derivations

_ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

slide-34
SLIDE 34

Forced Decoding

  • _ _ _ _ _ !

! ! ! ! !

Bush

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

  • both as data selection (more literal) and oracle derivations

_ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

slide-35
SLIDE 35

Forced Decoding

  • _ _●
  • !

! ! ! ! ! ! ! ! ! !

... talks

  • _ _●
  • !

! ! ! ! ! ! ! ! ! ! ! ! !

... talk

  • _ _●
  • !

! ! !

... meeting

  • _ _ _ _ _ !

! ! ! ! !

Bush

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

  • both as data selection (more literal) and oracle derivations

_ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

slide-36
SLIDE 36

Forced Decoding

  • _ _●
  • !

! ! ! ! ! ! ! ! ! !

... talks

  • _ _●
  • !

! ! ! ! ! ! ! ! ! ! ! ! !

... talk

  • _ _●
  • !

! ! !

... meeting

  • ●●●●● ... Sharon
  • ●●●●● ... Shalong
  • _ _ _ _ _ !

! ! ! ! !

Bush

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

  • both as data selection (more literal) and oracle derivations

_ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

slide-37
SLIDE 37

Forced Decoding

  • _ _●
  • !

! ! ! ! ! ! ! ! ! !

... talks

  • _ _●
  • !

! ! ! ! ! ! ! ! ! ! ! ! !

... talk

  • _ _●
  • !

! ! !

... meeting

  • ●●●●● ... Sharon
  • ●●●●● ... Shalong
  • _ _ _ _ _ !

! ! ! ! !

Bush

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

  • both as data selection (more literal) and oracle derivations

_ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

slide-38
SLIDE 38

Forced Decoding

  • _ _●
  • !

! ! ! ! ! ! ! ! ! !

... talks

  • _ _●
  • !

! ! ! ! ! ! ! ! ! ! ! ! !

... talk

  • _ _●
  • !

! ! !

... meeting

  • ●●●●● ... Sharon
  • ●●●●● ... Shalong
  • _ _ _ _ _ !

! ! ! ! !

Bush

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

  • both as data selection (more literal) and oracle derivations

_ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

slide-39
SLIDE 39

Forced Decoding

  • _ _●
  • !

! ! ! ! ! ! ! ! ! !

... talks

  • _ _●
  • !

! ! ! ! ! ! ! ! ! ! ! ! !

... talk

  • _ _●
  • !

! ! !

... meeting

  • ●●●●● ... Sharon
  • ●●●●● ... Shalong
  • _ _ _ _ _ !

! ! ! ! !

Bush

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

  • ne gold derivation
  • both as data selection (more literal) and oracle derivations

_ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

slide-40
SLIDE 40

Unreachable Sentences and Prefix

11

U.N. sent 50

  • bservers

to monitor the 1st election since Bolivia restored democracy 5 3 3 4

玻利维亚 恢复 民主 政治 以来 首次 全国 大选 联合国 派遣 50名 观察员 监督

  • distortion limit causes unreachability (hiero would be better)
  • but we can still use reachable prefix-pairs of unreachable pairs
slide-41
SLIDE 41

Unreachable Sentences and Prefix

11

U.N. sent 50

  • bservers

to monitor the 1st election since Bolivia restored democracy 5 3 3 4

玻利维亚 恢复 民主 政治 以来 首次 全国 大选 联合国 派遣 50名 观察员 监督

  • distortion limit causes unreachability (hiero would be better)
  • but we can still use reachable prefix-pairs of unreachable pairs
slide-42
SLIDE 42

Sentence/Word Reachability Ratio

  • how many sentences pairs pass forced decoding?
  • the ratio drops dramatically as sentences get longer
  • prefixes boost coverage

12

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 10 20 30 40 50 60 70 Ratio of complete coverage Sentence length Distortion-unlimit Distortion-limit 6 Distortion-limit 4 Distortion-limit 2 Distortion-limit 0

slide-43
SLIDE 43

Sentence/Word Reachability Ratio

  • how many sentences pairs pass forced decoding?
  • the ratio drops dramatically as sentences get longer
  • prefixes boost coverage

12

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 10 20 30 40 50 60 70 Ratio of complete coverage Sentence length Distortion-unlimit Distortion-limit 6 Distortion-limit 4 Distortion-limit 2 Distortion-limit 0

slide-44
SLIDE 44

Sentence/Word Reachability Ratio

  • how many sentences pairs pass forced decoding?
  • the ratio drops dramatically as sentences get longer
  • prefixes boost coverage

12

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 10 20 30 40 50 60 70 Ratio of complete coverage Sentence length Distortion-unlimit Distortion-limit 6 Distortion-limit 4 Distortion-limit 2 Distortion-limit 0

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 10 20 30 40 50 60 70 Ratio of complete coverage Sentence length dist-6 dist-4 dist-2 dist-0

slide-45
SLIDE 45

Number of Gold Derivations

  • exponential in sentence length (on fully reachables)
  • these are the “latent variables” in learning

13

10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 5 10 15 20 25 30 35 40 45 50 Average number of derivations Sentence length dist-6 dist-4 dist-2 dist-0

slide-46
SLIDE 46

Outline

  • Background: Phrase-based Translation (Koehn, 2004)
  • Forced Decoding
  • Violation-Fixing Perceptron for MT Training
  • Update strategy
  • Feature design
  • Experiments

14

slide-47
SLIDE 47

Structured Perceptron (Collins 02)

15

x y=-1 y=+1 x y

update weights if y ≠ z w

x z

exact inference

binary classification

slide-48
SLIDE 48

Structured Perceptron (Collins 02)

15

x y=-1 y=+1 x y

update weights if y ≠ z w

x z

exact inference

binary classification structured classification

slide-49
SLIDE 49

Structured Perceptron (Collins 02)

15

x y

the man bit the dog

那 人 咬 了 狗

x y=-1 y=+1 x y

update weights if y ≠ z w

x z

exact inference

binary classification structured classification

slide-50
SLIDE 50

Structured Perceptron (Collins 02)

  • challenges in applying perceptron for MT
  • the inference (decoding) is vastly inexact (beam search)
  • we know standard perceptron doesn’t work for MT
  • intuition: the learner should fix the search error first

15

x y

the man bit the dog

那 人 咬 了 狗

y

update weights if y ≠ z w

x z

exact inference

x y=-1 y=+1 x y

update weights if y ≠ z w

x z

exact inference constant # of classes exponential # of classes

binary classification structured classification

slide-51
SLIDE 51

Structured Perceptron (Collins 02)

  • challenges in applying perceptron for MT
  • the inference (decoding) is vastly inexact (beam search)
  • we know standard perceptron doesn’t work for MT
  • intuition: the learner should fix the search error first

15

x y

the man bit the dog

那 人 咬 了 狗

y

update weights if y ≠ z w

x z

exact inference

x y=-1 y=+1 x y

update weights if y ≠ z w

x z

exact inference constant # of classes exponential # of classes

binary classification structured classification

inexact inference

slide-52
SLIDE 52

Search Error: Gold Derivations Pruned

16 _ _ _ _ _ _

1 2 3 4 5 6

_ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice real decoding beam search

slide-53
SLIDE 53

Search Error: Gold Derivations Pruned

16 _ _ _ _ _ _

1 2 3 4 5 6

  • _ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

_ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice real decoding beam search

slide-54
SLIDE 54

Search Error: Gold Derivations Pruned

16 _ _ _ _ _ _

1 2 3 4 5 6

  • _ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _ _ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice real decoding beam search

slide-55
SLIDE 55

Search Error: Gold Derivations Pruned

16 _ _ _ _ _ _

1 2 3 4 5 6

  • _ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

  • ● _ ● _ _

_ ● ● ● _ _ _ _ ● ● ●_ _ ● ● ● _ _

  • _ _ ● ● _

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _ _ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice real decoding beam search

slide-56
SLIDE 56

Search Error: Gold Derivations Pruned

16 _ _ _ _ _ _

  • _ _ ● ● ●

1 2 3 4 5 6

  • _ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

  • ● _ ● _ _

_ ● ● ● _ _ _ _ ● ● ●_ _ ● ● ● _ _

  • _ _ ● ● _
  • _ ● ● _ ●

_ ● _ ● ● ●

  • _ _ ● ● ●

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _ _ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice real decoding beam search

slide-57
SLIDE 57

Search Error: Gold Derivations Pruned

16 _ _ _ _ _ _

1 2 3 4 5 6

  • _ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

  • ● _ ● _ _

_ ● ● ● _ _ _ _ ● ● ●_ _ ● ● ● _ _

  • _ _ ● ● _
  • _ ● ● _ ●

_ ● _ ● ● ●

  • _ _ ● ● ●

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _ _ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice real decoding beam search

slide-58
SLIDE 58

Search Error: Gold Derivations Pruned

16 _ _ _ _ _ _

1 2 3 4 5 6

  • _ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

  • ● _ ● _ _

_ ● ● ● _ _ _ _ ● ● ●_ _ ● ● ● _ _

  • _ _ ● ● _
  • _ ● ● _ ●

_ ● _ ● ● ●

  • _ _ ● ● ●

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _ _ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice real decoding beam search

should fix search errors here!

slide-59
SLIDE 59

Fixing Search Error 1: Early Update

17

standard update (no guarantee!)

21

Model

slide-60
SLIDE 60

Fixing Search Error 1: Early Update

  • early update (Collins/Roark’04) when the correct falls off beam
  • up to this point the incorrect prefix should score higher
  • that’s a “violation” which we want to fix

17

standard update (no guarantee!)

21

Model

slide-61
SLIDE 61

Fixing Search Error 1: Early Update

  • early update (Collins/Roark’04) when the correct falls off beam
  • up to this point the incorrect prefix should score higher
  • that’s a “violation” which we want to fix

17

correct sequence falls off beam (pruned)

correct

standard update (no guarantee!)

21

Model

slide-62
SLIDE 62

Fixing Search Error 1: Early Update

  • early update (Collins/Roark’04) when the correct falls off beam
  • up to this point the incorrect prefix should score higher
  • that’s a “violation” which we want to fix

17

correct sequence falls off beam (pruned)

correct i n c

  • r

r e c t

standard update (no guarantee!)

21

Model

slide-63
SLIDE 63

Fixing Search Error 1: Early Update

  • early update (Collins/Roark’04) when the correct falls off beam
  • up to this point the incorrect prefix should score higher
  • that’s a “violation” which we want to fix

17

early update

correct sequence falls off beam (pruned)

correct i n c

  • r

r e c t violation guaranteed: incorrect prefix scores higher up to this point

standard update (no guarantee!)

21

Model

slide-64
SLIDE 64

Fixing Search Error 1: Early Update

  • early update (Collins/Roark’04) when the correct falls off beam
  • up to this point the incorrect prefix should score higher
  • that’s a “violation” which we want to fix
  • standard perceptron does not guarantee violation
  • w/ pruning, the correct seq. might score higher at the end!
  • called “invalid” update b/c it doesn’t fix the search error

17

early update

correct sequence falls off beam (pruned)

correct i n c

  • r

r e c t violation guaranteed: incorrect prefix scores higher up to this point

standard update (no guarantee!)

21

Model

slide-65
SLIDE 65

Early Update w/ Latent Variable

18

21

Model

_ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

  • the gold-standard derivations are not annotated
  • we treat any reference-producing derivation as good
slide-66
SLIDE 66

Early Update w/ Latent Variable

18

21

Model

_ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

  • the gold-standard derivations are not annotated
  • we treat any reference-producing derivation as good

correct

slide-67
SLIDE 67

Early Update w/ Latent Variable

18

21

Model

_ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

  • the gold-standard derivations are not annotated
  • we treat any reference-producing derivation as good

correct

slide-68
SLIDE 68

Early Update w/ Latent Variable

18

21

Model

_ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

  • the gold-standard derivations are not annotated
  • we treat any reference-producing derivation as good

correct

all correct derivations fall off

slide-69
SLIDE 69

Early Update w/ Latent Variable

18

i n c

  • r

r e c t

21

Model

_ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

  • the gold-standard derivations are not annotated
  • we treat any reference-producing derivation as good

correct

all correct derivations fall off

slide-70
SLIDE 70

Early Update w/ Latent Variable

18

early update

i n c

  • r

r e c t violation guaranteed: incorrect prefix scores higher up to this point

21

Model

_ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

  • the gold-standard derivations are not annotated
  • we treat any reference-producing derivation as good

correct

all correct derivations fall off

slide-71
SLIDE 71

Early Update w/ Latent Variable

18

early update

i n c

  • r

r e c t violation guaranteed: incorrect prefix scores higher up to this point

21

Model

_ _ _ _ _ _

  • _ _ _ _ _
  • _ _ ● ● _
  • _ _ ● ● ●
  • ● _ ● ● ●
  • ● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

  • the gold-standard derivations are not annotated
  • we treat any reference-producing derivation as good

correct

all correct derivations fall off stop decoding

slide-72
SLIDE 72

Fixing Search Error 2: Max-Violation

19

  • early update works but learns slowly due to partial updates
  • max-violation: use the prefix where violation is maximum
  • “worst-mistake” in the search space
  • we call these methods “violation-fixing perceptrons” (Huang

et al 2012)

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

slide-73
SLIDE 73

Early Update vs. Max-Violation

_ _ _ _ _ _

1 2 3 4 5 6

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

slide-74
SLIDE 74

Early Update vs. Max-Violation

_ _ _ _ _ _

1 2 3 4 5 6

  • _ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

slide-75
SLIDE 75

Early Update vs. Max-Violation

_ _ _ _ _ _

1 2 3 4 5 6

  • _ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

slide-76
SLIDE 76

Early Update vs. Max-Violation

_ _ _ _ _ _

1 2 3 4 5 6

  • _ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

  • ● _ ● _ _

_ ● ● ● _ _ _ _ ● ● ●_ _ ● ● ● _ _

  • _ _ ● ● _

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

slide-77
SLIDE 77

Early Update vs. Max-Violation

Early-update

_ _ _ _ _ _

1 2 3 4 5 6

  • _ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

  • ● _ ● _ _

_ ● ● ● _ _ _ _ ● ● ●_ _ ● ● ● _ _

  • _ _ ● ● _

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

slide-78
SLIDE 78

Early Update vs. Max-Violation

Early-update

_ _ _ _ _ _

  • _ _ ● ● ●

1 2 3 4 5 6

  • _ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

  • ● _ ● _ _

_ ● ● ● _ _ _ _ ● ● ●_ _ ● ● ● _ _

  • _ _ ● ● _
  • _ ● ● _ ●

_ ● _ ● ● ●

  • _ _ ● ● ●

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

slide-79
SLIDE 79

Early Update vs. Max-Violation

Early-update

_ _ _ _ _ _

1 2 3 4 5 6

  • _ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

  • ● _ ● _ _

_ ● ● ● _ _ _ _ ● ● ●_ _ ● ● ● _ _

  • _ _ ● ● _
  • _ ● ● _ ●

_ ● _ ● ● ●

  • _ _ ● ● ●

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

slide-80
SLIDE 80

Early Update vs. Max-Violation

Early-update

  • ● _ ● ● ●

_ _ _ _ _ _

1 2 3 4 5 6

  • _ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

  • ● _ ● _ _

_ ● ● ● _ _ _ _ ● ● ●_ _ ● ● ● _ _

  • _ _ ● ● _
  • _ ● ● _ ●

_ ● _ ● ● ●

  • _ _ ● ● ●

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _

  • ● ● ● _ ●
  • _ ● ● ● ●
  • ● ● _ ● ●
  • ● ● _ ● ●

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

slide-81
SLIDE 81

Early Update vs. Max-Violation

Early-update

  • ● _ ● ● ●

_ _ _ _ _ _

1 2 3 4 5 6

  • _ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

  • ● _ ● _ _

_ ● ● ● _ _ _ _ ● ● ●_ _ ● ● ● _ _

  • _ _ ● ● _
  • _ ● ● _ ●

_ ● _ ● ● ●

  • _ _ ● ● ●

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _

  • ● ● ● _ ●
  • _ ● ● ● ●
  • ● ● _ ● ●
  • ● ● _ ● ●
  • ● ● ● ● ●

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

slide-82
SLIDE 82

Early Update vs. Max-Violation

Early-update

  • ● _ ● ● ●

_ _ _ _ _ _

1 2 3 4 5 6

  • _ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

  • ● _ ● _ _

_ ● ● ● _ _ _ _ ● ● ●_ _ ● ● ● _ _

  • _ _ ● ● _
  • _ ● ● _ ●

_ ● _ ● ● ●

  • _ _ ● ● ●

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _

  • ● ● ● _ ●
  • _ ● ● ● ●
  • ● ● _ ● ●
  • ● ● _ ● ●

Max-violation

  • ● ● ● ● ●

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

slide-83
SLIDE 83

Latent-Variable Perceptron

early max- violation latest full

(standard)

best in the beam worst in the beam falls off the beam biggest violation last valid update c

  • r

r e c t s e q u e n c e invalid update! early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

slide-84
SLIDE 84

Roadmap of the techniques

22

structured perceptron

(Collins, 2002)

slide-85
SLIDE 85

Roadmap of the techniques

22

structured perceptron

(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

slide-86
SLIDE 86

Roadmap of the techniques

22

structured perceptron

(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004; Huang et al 2012)

slide-87
SLIDE 87

Roadmap of the techniques

22

structured perceptron

(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004; Huang et al 2012)

latent-variable perceptron w/ inexact search

(Yu et al 2013)

slide-88
SLIDE 88

Roadmap of the techniques

22

structured perceptron

(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004; Huang et al 2012)

latent-variable perceptron w/ inexact search

(Yu et al 2013)

hiero syntactic parsing semantic parsing transliteration

slide-89
SLIDE 89

Feature Design

  • Dense features:
  • standard phrase-based features (Koehn, 2004)
  • Sparse Features:
  • rule-identification features (unique id for each rule)
  • word-edges features
  • lexicalized local translation context within a rule
  • non-local features
  • dependency between consecutive rules

23

slide-90
SLIDE 90

WordEdges Features (local)

24

与 沙⻰龚

举行 了 会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

  • the first and last Chinese words in the rule
  • the first and last English words in the rule
  • the two Chinese words surrounding the rule
slide-91
SLIDE 91

WordEdges Features (local)

24

与 沙⻰龚

举行 了 会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

  • the first and last Chinese words in the rule
  • the first and last English words in the rule
  • the two Chinese words surrounding the rule
slide-92
SLIDE 92

WordEdges Features (local)

24

与 沙⻰龚

举行 了 会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

  • the first and last Chinese words in the rule
  • the first and last English words in the rule
  • the two Chinese words surrounding the rule
slide-93
SLIDE 93

WordEdges Features (local)

24

与 沙⻰龚

举行 了 会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

  • the first and last Chinese words in the rule
  • the first and last English words in the rule
  • the two Chinese words surrounding the rule
slide-94
SLIDE 94

WordEdges Features (local)

24

与 沙⻰龚

举行 了 会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

  • the first and last Chinese words in the rule
  • the first and last English words in the rule
  • the two Chinese words surrounding the rule
slide-95
SLIDE 95

WordEdges Features (local)

24

与 沙⻰龚

举行 了 会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

  • the first and last Chinese words in the rule
  • the first and last English words in the rule
  • the two Chinese words surrounding the rule

Combo Features:

slide-96
SLIDE 96

WordEdges Features (local)

24

与 沙⻰龚

举行 了 会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

  • the first and last Chinese words in the rule
  • the first and last English words in the rule
  • the two Chinese words surrounding the rule

Combo Features:

slide-97
SLIDE 97

WordEdges Features (local)

24

与 沙⻰龚

举行 了 会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

  • the first and last Chinese words in the rule
  • the first and last English words in the rule
  • the two Chinese words surrounding the rule

Combo Features:

100010=沙⻰龚|held

slide-98
SLIDE 98

WordEdges Features (local)

24

与 沙⻰龚

举行 了 会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

  • the first and last Chinese words in the rule
  • the first and last English words in the rule
  • the two Chinese words surrounding the rule

Combo Features:

100010=沙⻰龚|held

slide-99
SLIDE 99

WordEdges Features (local)

24

与 沙⻰龚

举行 了 会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

  • the first and last Chinese words in the rule
  • the first and last English words in the rule
  • the two Chinese words surrounding the rule

Combo Features:

100010=沙⻰龚|held

slide-100
SLIDE 100

WordEdges Features (local)

24

与 沙⻰龚

举行 了 会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

  • the first and last Chinese words in the rule
  • the first and last English words in the rule
  • the two Chinese words surrounding the rule

Combo Features:

100010=沙⻰龚|held 010001=举行|talks

slide-101
SLIDE 101

WordEdges Features (local)

24

与 沙⻰龚

举行 了 会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

  • the first and last Chinese words in the rule
  • the first and last English words in the rule
  • the two Chinese words surrounding the rule

Combo Features:

100010=沙⻰龚|held 010001=举行|talks

slide-102
SLIDE 102

Lexical backoffs and combos

  • Lexical features are often too sparse
  • 6 kinds of lexical backoffs with various budgets
  • total budget can’t exceed 10 (bilexical)

25

与 沙⻰龚

举行 了 会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

slide-103
SLIDE 103

Lexical backoffs and combos

  • Lexical features are often too sparse
  • 6 kinds of lexical backoffs with various budgets
  • total budget can’t exceed 10 (bilexical)

25

与 沙⻰龚

举行 了 会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

slide-104
SLIDE 104

Lexical backoffs and combos

  • Lexical features are often too sparse
  • 6 kinds of lexical backoffs with various budgets
  • total budget can’t exceed 10 (bilexical)

25

与 沙⻰龚

举行 了 会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

100010=沙⻰龚|held

slide-105
SLIDE 105

Lexical backoffs and combos

  • Lexical features are often too sparse
  • 6 kinds of lexical backoffs with various budgets
  • total budget can’t exceed 10 (bilexical)

25

与 沙⻰龚

举行 了 会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

P00010=NN|held 100010=沙⻰龚|held

slide-106
SLIDE 106

Lexical backoffs and combos

  • Lexical features are often too sparse
  • 6 kinds of lexical backoffs with various budgets
  • total budget can’t exceed 10 (bilexical)

25

与 沙⻰龚

举行 了 会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

P00010=NN|held 100010=沙⻰龚|held

slide-107
SLIDE 107

Lexical backoffs and combos

  • Lexical features are often too sparse
  • 6 kinds of lexical backoffs with various budgets
  • total budget can’t exceed 10 (bilexical)

25

与 沙⻰龚

举行 了 会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

P00010=NN|held 100010=沙⻰龚|held 010001=举行|talks

slide-108
SLIDE 108

Lexical backoffs and combos

  • Lexical features are often too sparse
  • 6 kinds of lexical backoffs with various budgets
  • total budget can’t exceed 10 (bilexical)

25

与 沙⻰龚

举行 了 会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

P00010=NN|held 0c0001=举|talks 100010=沙⻰龚|held 010001=举行|talks

slide-109
SLIDE 109

Non-Local Features (trivial)

26

与 沙⻰龚

举行 了 会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

  • two consecutive rule ids (rule bigram model)
  • the last two English words and the current rule
  • should explore a lot more!
slide-110
SLIDE 110

Non-Local Features (trivial)

26

与 沙⻰龚

举行 了 会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

  • two consecutive rule ids (rule bigram model)
  • the last two English words and the current rule
  • should explore a lot more!
slide-111
SLIDE 111

Non-Local Features (trivial)

26

与 沙⻰龚

举行 了 会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

  • two consecutive rule ids (rule bigram model)
  • the last two English words and the current rule
  • should explore a lot more!
slide-112
SLIDE 112

Experiments

27

  • Date sets
slide-113
SLIDE 113

Experiments

27

Scale Language

sent. dev tst

s s

  • Date sets
slide-114
SLIDE 114

Experiments

27

Scale Language

sent. dev tst

Small

Ch-En 30k

nist06 news nist08 news

Large

Ch-En 240k

nist06 news nist08 news

  • Date sets
slide-115
SLIDE 115

Experiments

27

Scale Language

sent. dev tst

Small

Ch-En 30k

nist06 news nist08 news

Large

Ch-En 240k

nist06 news nist08 news

  • Date sets
slide-116
SLIDE 116

Experiments

27

10x dev

Scale Language

sent. dev tst

Small

Ch-En 30k

nist06 news nist08 news

Large

Ch-En 240k

nist06 news nist08 news

  • Date sets
slide-117
SLIDE 117

Experiments

27

10x dev 120x dev

Scale Language

sent. dev tst

Small

Ch-En 30k

nist06 news nist08 news

Large

Ch-En 240k

nist06 news nist08 news

  • Date sets
slide-118
SLIDE 118

Experiments

27

10x dev 120x dev

Scale Language

sent. dev tst

Small

Ch-En 30k

nist06 news nist08 news

Large

Ch-En 240k

nist06 news nist08 news

Large Sp-En 170k

newstest2012 newtest2013

  • Date sets
slide-119
SLIDE 119

Experiments

27

10x dev 120x dev

Scale Language

sent. dev tst

Small

Ch-En 30k

nist06 news nist08 news

Large

Ch-En 240k

nist06 news nist08 news

Large Sp-En 170k

newstest2012 newtest2013

  • Date sets

Sp-En sent. word. ratio 55% 43.9%

slide-120
SLIDE 120

Experiments

27

10x dev 120x dev

Scale Language

sent. dev tst

Small

Ch-En 30k

nist06 news nist08 news

Large

Ch-En 240k

nist06 news nist08 news

Large Sp-En 170k

newstest2012 newtest2013

  • Date sets

Sp-En sent. word. ratio 55% 43.9%

31x dev

slide-121
SLIDE 121

Perceptron: std, early, and max-violation

  • standard perceptron (Liang et al’s “bold”) works poorly
  • b/c invalid update ratio is very high (search quality is low)
  • max-violation converges faster than early update

17 18 19 20 21 22 23 24 25 26 2 4 6 8 10 12 14 16 18 20 BLEU Number of iteration MaxForce MERT e a r l y l

  • c

a l s t a n d a r d

28

slide-122
SLIDE 122

Perceptron: std, early, and max-violation

  • standard perceptron (Liang et al’s “bold”) works poorly
  • b/c invalid update ratio is very high (search quality is low)
  • max-violation converges faster than early update

this explains why Liang et al ’06 failed std ~ “bold”; local ~ “local”

17 18 19 20 21 22 23 24 25 26 2 4 6 8 10 12 14 16 18 20 BLEU Number of iteration MaxForce MERT e a r l y l

  • c

a l s t a n d a r d

28

slide-123
SLIDE 123

Perceptron: std, early, and max-violation

  • standard perceptron (Liang et al’s “bold”) works poorly
  • b/c invalid update ratio is very high (search quality is low)
  • max-violation converges faster than early update

50% 60% 70% 80% 90% 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Ratio beam size Ratio of invalid updates +non-local feature

(standard perceptron)

this explains why Liang et al ’06 failed std ~ “bold”; local ~ “local”

17 18 19 20 21 22 23 24 25 26 2 4 6 8 10 12 14 16 18 20 BLEU Number of iteration MaxForce MERT e a r l y l

  • c

a l s t a n d a r d

28

slide-124
SLIDE 124

Perceptron: std, early, and max-violation

  • standard perceptron (Liang et al’s “bold”) works poorly
  • b/c invalid update ratio is very high (search quality is low)
  • max-violation converges faster than early update

50% 60% 70% 80% 90% 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Ratio beam size Ratio of invalid updates +non-local feature

(standard perceptron)

this explains why Liang et al ’06 failed std ~ “bold”; local ~ “local”

17 18 19 20 21 22 23 24 25 26 2 4 6 8 10 12 14 16 18 20 BLEU Number of iteration MaxForce MERT e a r l y l

  • c

a l s t a n d a r d

28

slide-125
SLIDE 125

Parallelized Perceptron

29

  • mini-batch perceptron (Zhao and Huang, 2013) much faster

than iterative parameter mixing (McDonald et al, 2010)

  • 6 CPUs => ~4x speedup; 24 CPUs => ~7x speedup

22 23 24 0.5 1 1.5 2 2.5 3 3.5 4 BLEU Time MERT PRO-dense minibatch(24-core) minibatch(6-core) minibatch(1 core) single processor

Time

slide-126
SLIDE 126

Internal comparison with different features

  • dense: 11 standard features for phrase-based MT
  • ruleid: rule identification feature
  • word-edges: word-edges features with back-offs
  • non-local: non-local features with back-offs

30

18 19 20 21 22 23 24 25 26 2 4 6 8 10 12 14 16 BLEU Number of iteration MERT +non-local +word-edges +ruleid dense

slide-127
SLIDE 127

Internal comparison with different features

  • dense: 11 standard features for phrase-based MT
  • ruleid: rule identification feature
  • word-edges: word-edges features with back-offs
  • non-local: non-local features with back-offs

30

dense: 11 features

18 19 20 21 22 23 24 25 26 2 4 6 8 10 12 14 16 BLEU Number of iteration MERT +non-local +word-edges +ruleid dense

slide-128
SLIDE 128

Internal comparison with different features

  • dense: 11 standard features for phrase-based MT
  • ruleid: rule identification feature
  • word-edges: word-edges features with back-offs
  • non-local: non-local features with back-offs

30

ruleid: 0.1% dense: 11 features

18 19 20 21 22 23 24 25 26 2 4 6 8 10 12 14 16 BLEU Number of iteration MERT +non-local +word-edges +ruleid dense

+0.9 bleu

slide-129
SLIDE 129

Internal comparison with different features

  • dense: 11 standard features for phrase-based MT
  • ruleid: rule identification feature
  • word-edges: word-edges features with back-offs
  • non-local: non-local features with back-offs

30

ruleid: 0.1% wordedges: 99.6% dense: 11 features

18 19 20 21 22 23 24 25 26 2 4 6 8 10 12 14 16 BLEU Number of iteration MERT +non-local +word-edges +ruleid dense

+0.9 bleu +2.3

slide-130
SLIDE 130

Internal comparison with different features

  • dense: 11 standard features for phrase-based MT
  • ruleid: rule identification feature
  • word-edges: word-edges features with back-offs
  • non-local: non-local features with back-offs

30

ruleid: 0.1% wordedges: 99.6% non-local: 0.3% dense: 11 features

18 19 20 21 22 23 24 25 26 2 4 6 8 10 12 14 16 BLEU Number of iteration MERT +non-local +word-edges +ruleid dense

+0.9 bleu +2.3 +0.7

slide-131
SLIDE 131

External comparison with MERT & PRO

31

  • MERT, PRO-dense/medium/sparse all tune on dev-set
  • PRO-sparse use the same feature as ours

10 12 14 16 18 20 22 24 26 2 4 6 8 10 12 14 16 BLEU Number of iteration MaxForce MERT PRO-dense PRO-medium PRO-large

slide-132
SLIDE 132

Final Results on FBIS Data

32

  • Moses: state-of-the-art phrase-based system in C++
  • Cubit: phrase-based system (Huang and Chiang, 2007) in python
  • almost identical baseline scores with MERT
  • max-violation takes ~47 hours on 24 CPUs (23M features)
slide-133
SLIDE 133

Final Results on FBIS Data

32

System Alg. Tune on Features Dev Test Moses Cubit Cubit

MERT dev set 11 25.5 22.5 MERT dev set 11 25.4 22.5 Ma

  • Moses: state-of-the-art phrase-based system in C++
  • Cubit: phrase-based system (Huang and Chiang, 2007) in python
  • almost identical baseline scores with MERT
  • max-violation takes ~47 hours on 24 CPUs (23M features)
slide-134
SLIDE 134

Final Results on FBIS Data

32

System Alg. Tune on Features Dev Test Moses Cubit Cubit

MERT dev set 11 25.5 22.5 MERT dev set 11 25.4 22.5 11 25.6 22.6 PRO dev set 3k MaxForce

  • Moses: state-of-the-art phrase-based system in C++
  • Cubit: phrase-based system (Huang and Chiang, 2007) in python
  • almost identical baseline scores with MERT
  • max-violation takes ~47 hours on 24 CPUs (23M features)
slide-135
SLIDE 135

Final Results on FBIS Data

32

System Alg. Tune on Features Dev Test Moses Cubit Cubit

MERT dev set 11 25.5 22.5 MERT dev set 11 25.4 22.5 11 25.6 22.6 PRO dev set 3k 26.3 23.0 36k MaxForce

  • Moses: state-of-the-art phrase-based system in C++
  • Cubit: phrase-based system (Huang and Chiang, 2007) in python
  • almost identical baseline scores with MERT
  • max-violation takes ~47 hours on 24 CPUs (23M features)
slide-136
SLIDE 136

Final Results on FBIS Data

32

System Alg. Tune on Features Dev Test Moses Cubit Cubit

MERT dev set 11 25.5 22.5 MERT dev set 11 25.4 22.5 11 25.6 22.6 PRO dev set 3k 26.3 23.0 36k 17.7 14.3 MaxForce

  • Moses: state-of-the-art phrase-based system in C++
  • Cubit: phrase-based system (Huang and Chiang, 2007) in python
  • almost identical baseline scores with MERT
  • max-violation takes ~47 hours on 24 CPUs (23M features)
slide-137
SLIDE 137

Final Results on FBIS Data

32

System Alg. Tune on Features Dev Test Moses Cubit Cubit

MERT dev set 11 25.5 22.5 MERT dev set 11 25.4 22.5 11 25.6 22.6 PRO dev set 3k 26.3 23.0 36k 17.7 14.3 MaxForce Train set 23M 27.8 24.5

  • Moses: state-of-the-art phrase-based system in C++
  • Cubit: phrase-based system (Huang and Chiang, 2007) in python
  • almost identical baseline scores with MERT
  • max-violation takes ~47 hours on 24 CPUs (23M features)
slide-138
SLIDE 138

Final Results on FBIS Data

32

System Alg. Tune on Features Dev Test Moses Cubit Cubit

MERT dev set 11 25.5 22.5 MERT dev set 11 25.4 22.5 11 25.6 22.6 PRO dev set 3k 26.3 23.0 36k 17.7 14.3 MaxForce Train set 23M 27.8 24.5

  • Moses: state-of-the-art phrase-based system in C++
  • Cubit: phrase-based system (Huang and Chiang, 2007) in python
  • almost identical baseline scores with MERT
  • max-violation takes ~47 hours on 24 CPUs (23M features)

+2.3

slide-139
SLIDE 139

Final Results on FBIS Data

32

System Alg. Tune on Features Dev Test Moses Cubit Cubit

MERT dev set 11 25.5 22.5 MERT dev set 11 25.4 22.5 11 25.6 22.6 PRO dev set 3k 26.3 23.0 36k 17.7 14.3 MaxForce Train set 23M 27.8 24.5

  • Moses: state-of-the-art phrase-based system in C++
  • Cubit: phrase-based system (Huang and Chiang, 2007) in python
  • almost identical baseline scores with MERT
  • max-violation takes ~47 hours on 24 CPUs (23M features)

+2.3 +2.0

slide-140
SLIDE 140

Results on Spanish-English set

  • Data-set: Europarl corpus, 170k sentences
  • dev/test set: newtest2012 / 2013 (one-reference only)
  • +1 in 1-ref bleu ~ +2 in 4-ref bleu
  • bleu improvement is comparable to Chinese w/ 4-refs

33

system algorithm #feat. dev test Moses Mert 11 27.4 24.4 Cubit MaxForce 21M 28.7 25.5

Sp-En

  • sent. word.

Reachable ratio 55% 43.9%

slide-141
SLIDE 141

Results on Spanish-English set

  • Data-set: Europarl corpus, 170k sentences
  • dev/test set: newtest2012 / 2013 (one-reference only)
  • +1 in 1-ref bleu ~ +2 in 4-ref bleu
  • bleu improvement is comparable to Chinese w/ 4-refs

33

system algorithm #feat. dev test Moses Mert 11 27.4 24.4 Cubit MaxForce 21M 28.7 25.5 +1.3 +1.1

Sp-En

  • sent. word.

Reachable ratio 55% 43.9%

slide-142
SLIDE 142

Conclusion

  • a simple yet effective online learning approach for MT
  • scaled to (a large portion of) the training set for the first time
  • able to incorporate 20M sparse lexicalized features
  • no need to define BLEU+1, or hope/fear derivations
  • no learning rate or hyperparameters
  • +2.3/+2.0 BLEU points better than MERT/PRO
  • the three ingredients that made it work
  • violation-fixing perceptron: early-update and max-violation
  • forced decoding lattice helps
  • minibatch parallelization scales it up to big data

34

slide-143
SLIDE 143

Roadmap of the techniques

35

structured perceptron

(Collins, 2002)

slide-144
SLIDE 144

Roadmap of the techniques

35

structured perceptron

(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

slide-145
SLIDE 145

Roadmap of the techniques

35

structured perceptron

(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004; Huang et al 2012)

slide-146
SLIDE 146

Roadmap of the techniques

35

structured perceptron

(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004; Huang et al 2012)

latent-variable perceptron w/ inexact search

(Yu et al 2013)

slide-147
SLIDE 147

Roadmap of the techniques

35

structured perceptron

(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004; Huang et al 2012)

latent-variable perceptron w/ inexact search

(Yu et al 2013)

hiero syntactic parsing semantic parsing transliteration

slide-148
SLIDE 148

Roadmap of the techniques

35

structured perceptron

(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004; Huang et al 2012)

latent-variable perceptron w/ inexact search

(Yu et al 2013)

hiero syntactic parsing semantic parsing transliteration

replacing EM for partially-

  • bserved data
slide-149
SLIDE 149

20 years of Statistical MT

  • word alignment: IBM models (Brown et al 90, 93)
  • translation model (choose one from below)
  • SCFG (ITG: Wu 95, 97; Hiero: Chiang 05, 07) or STSG (GHKM 04, 06; Liu+ 06; Huang+ 06)
  • PBMT (Och+Ney 02; Koehn et al 03)
  • evaluation metric: BLEU (Papineni et al 02)
  • decoding algorithm: cube pruning (Chiang 07; Huang+Chiang 07)
  • training algorithm (choose one from below)
  • MERT (Och 03): ~10 dense features on dev set
  • MIRA (Chiang et al 08-12) or PRO (Hopkins+May 11): ~10k feats on dev set
  • MaxForce: 20M+ feats on training set; +2/+1.5 BLEU over MERT/PRO
  • Max-Violation Perceptron with Forced Decoding: fixes search errors
  • first successful effort of online large-scale discriminative training for MT
slide-150
SLIDE 150

When learning with vastly inexact search, you should use a principled method such as max-violation. Thank you!

Max-violation