[PPT] - MaxForce: Max-Violation Perceptron and Forced Decoding for Scalable PowerPoint Presentation

SLIDE 1

MaxForce: Max-Violation Perceptron and Forced Decoding for Scalable MT Training

Heng Yu

Chinese Acad. of Sciences

Liang Huang

CUNY

Haitao Mi

IBM T. J. Watson

1 2 3 4 5 6 Bush held held talks talks with with Sharon Sharon

Kai Zhao

CUNY

SLIDE 2

MaxForce: Max-Violation Perceptron and Forced Decoding for Scalable MT Training

Heng Yu

Chinese Acad. of Sciences

Liang Huang

CUNY

Haitao Mi

IBM T. J. Watson

1 2 3 4 5 6 Bush held held talks talks with with Sharon Sharon

Scalable Training for MT Finally Made Successful

Kai Zhao

CUNY

SLIDE 3

Discriminative Training for SMT

discriminative training is dominant in parsing / tagging
can use arbitrary, overlapping, lexicalized features
but not very successful yet in machine translation
most efforts on MT training tune feature weights on

the small dev set (~1k sents) not the training set!

as a result can only use ~10 dense features (MERT)
or ~10k rather impoverished features (MIRA/PRO)
Liang et al (2006) train on the training set but failed

2

training set (>100k sentences)

dev set (~1k sents) test set (~1k sents)

SLIDE 4

Timeline for MT Training

3

training set (>100k sentences)

dev set (~1k sents) test set (~1k sents)

MERT

(Och ’02)

(dense features)

SLIDE 5

Timeline for MT Training

3

training set (>100k sentences)

dev set (~1k sents) test set (~1k sents)

Standard Perceptron (a noble failure)

(Liang et al 2006)

MERT

(Och ’02)

(dense features)

SLIDE 6

Timeline for MT Training

3

training set (>100k sentences)

dev set (~1k sents) test set (~1k sents)

Standard Perceptron (a noble failure)

(Liang et al 2006)

MIRA

(Watanabe+ ’07) (Chiang+ ’08-’12)

MERT

(Och ’02)

(dense features) (pseudo sparse features)

SLIDE 7

Timeline for MT Training

3

training set (>100k sentences)

dev set (~1k sents) test set (~1k sents)

Standard Perceptron (a noble failure)

(Liang et al 2006)

MIRA

(Watanabe+ ’07) (Chiang+ ’08-’12)

PRO

(Hopkins+May ’11)

Regression

(Bazrafshan+ ’12)

MERT

(Och ’02)

(dense features) (pseudo sparse features)

SLIDE 8

Timeline for MT Training

3

training set (>100k sentences)

dev set (~1k sents) test set (~1k sents)

Standard Perceptron (a noble failure)

(Liang et al 2006)

MIRA

(Watanabe+ ’07) (Chiang+ ’08-’12)

PRO

(Hopkins+May ’11)

Regression

(Bazrafshan+ ’12)

HOLS

(Flanigan+ ’13)

(sparse features as

ne dense feature)

MERT

(Och ’02)

(dense features) (pseudo sparse features)

SLIDE 9

Timeline for MT Training

3

training set (>100k sentences)

dev set (~1k sents) test set (~1k sents)

Standard Perceptron (a noble failure)

(Liang et al 2006)

MIRA

(Watanabe+ ’07) (Chiang+ ’08-’12)

PRO

(Hopkins+May ’11)

Regression

(Bazrafshan+ ’12)

ur work (2013): violation-fixing

perceptron with truly sparse features

HOLS

(Flanigan+ ’13)

(sparse features as

ne dense feature)

MERT

(Och ’02)

(dense features) (pseudo sparse features)

SLIDE 10

Timeline for MT Training

3

training set (>100k sentences)

dev set (~1k sents) test set (~1k sents)

Standard Perceptron (a noble failure)

(Liang et al 2006)

MIRA

(Watanabe+ ’07) (Chiang+ ’08-’12)

PRO

(Hopkins+May ’11)

Regression

(Bazrafshan+ ’12)

ur work (2013): violation-fixing

perceptron with truly sparse features

HOLS

(Flanigan+ ’13)

(sparse features as

ne dense feature)

MERT

(Och ’02)

(dense features) (pseudo sparse features)

?

SLIDE 11

Why previous work fails

their learning methods are based on exact search
MT has huge search spaces => severe search errors
learning algorithms should fix search errors
full updates (perceptron/MIRA/PRO) can’t fix search errors
MT involves latent variables (derivations not annotated)
perceptron/MIRA was not designed for latent variables
we need better variants for perceptron

4

SLIDE 12

Why our approach works

use a variant of perceptron tailored for inexact search
fix search errors in the middle of the search
“partial updates” instead of “full updates”
use forced decoding lattice as the target to update to
use parallelized minibatch to speed up learning
result: scaled to a large portion of the training data
20M sparse features => +2.0 BLEU over MERT/PRO

5

SLIDE 13

MT as Structured Classification

with latent variables (hidden derivations)

6

x y

the man bit the dog

那人咬了狗

SLIDE 14

MT as Structured Classification

with latent variables (hidden derivations)

6

x y

the man bit the dog

那人咬了狗

...

all gold derivations

SLIDE 15

MT as Structured Classification

with latent variables (hidden derivations)

6

x y

the man bit the dog

那人咬了狗

...

x

那人咬了狗

all gold derivations

SLIDE 16

MT as Structured Classification

with latent variables (hidden derivations)

6

x y

the man bit the dog

那人咬了狗

...

x y

the dog bit the man

那人咬了狗

best derivation all gold derivations

SLIDE 17

MT as Structured Classification

with latent variables (hidden derivations)

6

x y

the man bit the dog

那人咬了狗

...

x y

the dog bit the man

那人咬了狗

best derivation all gold derivations wrong translation

SLIDE 18

MT as Structured Classification

with latent variables (hidden derivations)

6

x y

the man bit the dog

那人咬了狗

...

x y

the dog bit the man

那人咬了狗

best derivation best gold derivation all gold derivations wrong translation

SLIDE 19

MT as Structured Classification

with latent variables (hidden derivations)

6

x y

the man bit the dog

那人咬了狗

...

x y

the dog bit the man

那人咬了狗

best derivation best gold derivation

update: penalize best derivation and reward best gold derivation

all gold derivations wrong translation

++

SLIDE 20

Outline

Motivations
Phrase-based Translation and Forced Decoding
Violation-Fixing Perceptron for SMT
Update Strategies: Early Update and Max-Violation
Feature Design
Experiments

7

SLIDE 21

Phrase-based translation

yu Shalong juxing le huitan

与沙⻰龚举行了会谈

held talks with Sharon

布什 Bushi Bush

yu Shalong juxing le huitan

with Sharon held talks meetings Sharon held

with

Bush Bushi

SLIDE 22

Phrase-based translation

yu Shalong juxing le huitan

与沙⻰龚举行了会谈

held talks with Sharon

布什 Bushi Bush

yu Shalong juxing le huitan

with Sharon held talks meetings Sharon held

with

_ _ _ _ _ _ Bush Bushi

SLIDE 23

Phrase-based translation

yu Shalong juxing le huitan

与沙⻰龚举行了会谈

held talks with Sharon

布什 Bushi Bush

yu Shalong juxing le huitan

with Sharon held talks meetings Sharon held

with

_ _ _ _ _ _ Bush Bushi

_ _ _ _ _

SLIDE 24

Phrase-based translation

yu Shalong juxing le huitan

与沙⻰龚举行了会谈

held talks with Sharon

布什 Bushi Bush

yu Shalong juxing le huitan

with Sharon held talks meetings Sharon held

with

_ _ _ _ _ _

_ _●
Bush

Bushi

_ _ _ _ _

SLIDE 25

Phrase-based translation

yu Shalong juxing le huitan

与沙⻰龚举行了会谈

held talks with Sharon

布什 Bushi Bush

yu Shalong juxing le huitan

with Sharon held talks meetings Sharon held

with

_ _ _ _ _ _

_ _●
Bush

Bushi

_ _ _ _ _

SLIDE 26

Phrase-based translation

yu Shalong juxing le huitan

与沙⻰龚举行了会谈

held talks with Sharon

布什 Bushi Bush

yu Shalong juxing le huitan

with Sharon held talks meetings Sharon held

with

_ _ _ _ _ _

_ _●
Bush

Bushi

_ _ _ _ _

SLIDE 27

Phrase-based translation

yu Shalong juxing le huitan

与沙⻰龚举行了会谈

held talks with Sharon

布什 Bushi Bush

yu Shalong juxing le huitan

with Sharon held talks meetings Sharon held

with

_ _ _ _ _ _

_ _●
!
_●
Bush

Bushi

_ _ _ _ _

SLIDE 28

Language Model and Beam Search

split each -LM state into many +LM states

9

SLIDE 29

Language Model and Beam Search

split each -LM state into many +LM states

9

_ _ _ _ _ !

! ! ! ! !

Bush

SLIDE 30

Language Model and Beam Search

split each -LM state into many +LM states

9

_ _●
!

! ! ! ! ! ! ! ! ! !

... talks

_ _●
!

! ! ! ! ! ! ! ! ! ! ! ! !

... talk

_ _●
!

! ! !

... meeting

_ _ _ _ _ !

! ! ! ! !

Bush

SLIDE 31

Language Model and Beam Search

split each -LM state into many +LM states

9

_ _●
!

! ! ! ! ! ! ! ! ! !

... talks

_ _●
!

! ! ! ! ! ! ! ! ! ! ! ! !

... talk

_ _●
!

! ! !

... meeting

●●●●● ... Sharon
●●●●● ... Shalong
_ _ _ _ _ !

! ! ! ! !

Bush

SLIDE 32

Language Model and Beam Search

split each -LM state into many +LM states

9

_ _●
!

! ! ! ! ! ! ! ! ! !

... talks

_ _●
!

! ! ! ! ! ! ! ! ! ! ! ! !

... talk

_ _●
!

! ! !

... meeting

●●●●● ... Sharon
●●●●● ... Shalong
_ _ _ _ _ !

! ! ! ! !

Bush

● ● ● ● ●

SLIDE 33

Forced Decoding

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

both as data selection (more literal) and oracle derivations

_ _ _ _ _ _

_ _ _ _ _
_ _ ● ● _
_ _ ● ● ●
● _ ● ● ●
● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

SLIDE 34

Forced Decoding

_ _ _ _ _ !

! ! ! ! !

Bush

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

both as data selection (more literal) and oracle derivations

_ _ _ _ _ _

_ _ _ _ _
_ _ ● ● _
_ _ ● ● ●
● _ ● ● ●
● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

SLIDE 35

Forced Decoding

_ _●
!

! ! ! ! ! ! ! ! ! !

... talks

_ _●
!

! ! ! ! ! ! ! ! ! ! ! ! !

... talk

_ _●
!

! ! !

... meeting

_ _ _ _ _ !

! ! ! ! !

Bush

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

both as data selection (more literal) and oracle derivations

_ _ _ _ _ _

_ _ _ _ _
_ _ ● ● _
_ _ ● ● ●
● _ ● ● ●
● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

SLIDE 36

Forced Decoding

_ _●
!

! ! ! ! ! ! ! ! ! !

... talks

_ _●
!

! ! ! ! ! ! ! ! ! ! ! ! !

... talk

_ _●
!

! ! !

... meeting

●●●●● ... Sharon
●●●●● ... Shalong
_ _ _ _ _ !

! ! ! ! !

Bush

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

both as data selection (more literal) and oracle derivations

_ _ _ _ _ _

_ _ _ _ _
_ _ ● ● _
_ _ ● ● ●
● _ ● ● ●
● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

SLIDE 37

Forced Decoding

_ _●
!

! ! ! ! ! ! ! ! ! !

... talks

_ _●
!

! ! ! ! ! ! ! ! ! ! ! ! !

... talk

_ _●
!

! ! !

... meeting

●●●●● ... Sharon
●●●●● ... Shalong
_ _ _ _ _ !

! ! ! ! !

Bush

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

both as data selection (more literal) and oracle derivations

_ _ _ _ _ _

_ _ _ _ _
_ _ ● ● _
_ _ ● ● ●
● _ ● ● ●
● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

SLIDE 38

Forced Decoding

_ _●
!

! ! ! ! ! ! ! ! ! !

... talks

_ _●
!

! ! ! ! ! ! ! ! ! ! ! ! !

... talk

_ _●
!

! ! !

... meeting

●●●●● ... Sharon
●●●●● ... Shalong
_ _ _ _ _ !

! ! ! ! !

Bush

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

both as data selection (more literal) and oracle derivations

_ _ _ _ _ _

_ _ _ _ _
_ _ ● ● _
_ _ ● ● ●
● _ ● ● ●
● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

SLIDE 39

Forced Decoding

_ _●
!

! ! ! ! ! ! ! ! ! !

... talks

_ _●
!

! ! ! ! ! ! ! ! ! ! ! ! !

... talk

_ _●
!

! ! !

... meeting

●●●●● ... Sharon
●●●●● ... Shalong
_ _ _ _ _ !

! ! ! ! !

Bush

Bushi yu Shalong juxing le huitan Bush held talks with Sharon

ne gold derivation
both as data selection (more literal) and oracle derivations

_ _ _ _ _ _

_ _ _ _ _
_ _ ● ● _
_ _ ● ● ●
● _ ● ● ●
● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

SLIDE 40

Unreachable Sentences and Prefix

11

U.N. sent 50

bservers

to monitor the 1st election since Bolivia restored democracy 5 3 3 4

玻利维亚恢复民主政治以来首次全国大选联合国派遣 50名观察员监督

distortion limit causes unreachability (hiero would be better)
but we can still use reachable prefix-pairs of unreachable pairs

SLIDE 41

Unreachable Sentences and Prefix

11

U.N. sent 50

bservers

to monitor the 1st election since Bolivia restored democracy 5 3 3 4

玻利维亚恢复民主政治以来首次全国大选联合国派遣 50名观察员监督

distortion limit causes unreachability (hiero would be better)
but we can still use reachable prefix-pairs of unreachable pairs

SLIDE 42

Sentence/Word Reachability Ratio

how many sentences pairs pass forced decoding?
the ratio drops dramatically as sentences get longer
prefixes boost coverage

12

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 10 20 30 40 50 60 70 Ratio of complete coverage Sentence length Distortion-unlimit Distortion-limit 6 Distortion-limit 4 Distortion-limit 2 Distortion-limit 0

SLIDE 43

Sentence/Word Reachability Ratio

how many sentences pairs pass forced decoding?
the ratio drops dramatically as sentences get longer
prefixes boost coverage

12

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 10 20 30 40 50 60 70 Ratio of complete coverage Sentence length Distortion-unlimit Distortion-limit 6 Distortion-limit 4 Distortion-limit 2 Distortion-limit 0

SLIDE 44

Sentence/Word Reachability Ratio

how many sentences pairs pass forced decoding?
the ratio drops dramatically as sentences get longer
prefixes boost coverage

12

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 10 20 30 40 50 60 70 Ratio of complete coverage Sentence length Distortion-unlimit Distortion-limit 6 Distortion-limit 4 Distortion-limit 2 Distortion-limit 0

0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 10 20 30 40 50 60 70 Ratio of complete coverage Sentence length dist-6 dist-4 dist-2 dist-0

SLIDE 45

Number of Gold Derivations

exponential in sentence length (on fully reachables)
these are the “latent variables” in learning

13

10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 5 10 15 20 25 30 35 40 45 50 Average number of derivations Sentence length dist-6 dist-4 dist-2 dist-0

SLIDE 46

Outline

Background: Phrase-based Translation (Koehn, 2004)
Forced Decoding
Violation-Fixing Perceptron for MT Training
Update strategy
Feature design
Experiments

14

SLIDE 47

Structured Perceptron (Collins 02)

15

x y=-1 y=+1 x y

update weights if y ≠ z w

x z

exact inference

binary classification

SLIDE 48

Structured Perceptron (Collins 02)

15

x y=-1 y=+1 x y

update weights if y ≠ z w

x z

exact inference

binary classification structured classification

SLIDE 49

Structured Perceptron (Collins 02)

15

x y

the man bit the dog

那人咬了狗

x y=-1 y=+1 x y

update weights if y ≠ z w

x z

exact inference

binary classification structured classification

SLIDE 50

Structured Perceptron (Collins 02)

challenges in applying perceptron for MT
the inference (decoding) is vastly inexact (beam search)
we know standard perceptron doesn’t work for MT
intuition: the learner should fix the search error first

15

x y

the man bit the dog

那人咬了狗

y

update weights if y ≠ z w

x z

exact inference

x y=-1 y=+1 x y

update weights if y ≠ z w

x z

exact inference constant # of classes exponential # of classes

binary classification structured classification

SLIDE 51

Structured Perceptron (Collins 02)

challenges in applying perceptron for MT
the inference (decoding) is vastly inexact (beam search)
we know standard perceptron doesn’t work for MT
intuition: the learner should fix the search error first

15

x y

the man bit the dog

那人咬了狗

y

update weights if y ≠ z w

x z

exact inference

x y=-1 y=+1 x y

update weights if y ≠ z w

x z

exact inference constant # of classes exponential # of classes

binary classification structured classification

inexact inference

SLIDE 52

Search Error: Gold Derivations Pruned

16 _ _ _ _ _ _

1 2 3 4 5 6

_ _ _ _ _ _

_ _ _ _ _
_ _ ● ● _
_ _ ● ● ●
● _ ● ● ●
● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice real decoding beam search

SLIDE 53

Search Error: Gold Derivations Pruned

16 _ _ _ _ _ _

1 2 3 4 5 6

_ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

_ _ _ _ _ _

_ _ _ _ _
_ _ ● ● _
_ _ ● ● ●
● _ ● ● ●
● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice real decoding beam search

SLIDE 54

Search Error: Gold Derivations Pruned

16 _ _ _ _ _ _

1 2 3 4 5 6

_ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _ _ _ _ _ _ _

_ _ _ _ _
_ _ ● ● _
_ _ ● ● ●
● _ ● ● ●
● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice real decoding beam search

SLIDE 55

Search Error: Gold Derivations Pruned

16 _ _ _ _ _ _

1 2 3 4 5 6

_ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

● _ ● _ _

_ ● ● ● _ _ _ _ ● ● ●_ _ ● ● ● _ _

_ _ ● ● _

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _ _ _ _ _ _ _

_ _ _ _ _
_ _ ● ● _
_ _ ● ● ●
● _ ● ● ●
● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice real decoding beam search

SLIDE 56

Search Error: Gold Derivations Pruned

16 _ _ _ _ _ _

_ _ ● ● ●

1 2 3 4 5 6

_ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

● _ ● _ _

_ ● ● ● _ _ _ _ ● ● ●_ _ ● ● ● _ _

_ _ ● ● _
_ ● ● _ ●

_ ● _ ● ● ●

_ _ ● ● ●

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _ _ _ _ _ _ _

_ _ _ _ _
_ _ ● ● _
_ _ ● ● ●
● _ ● ● ●
● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice real decoding beam search

SLIDE 57

Search Error: Gold Derivations Pruned

16 _ _ _ _ _ _

1 2 3 4 5 6

_ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

● _ ● _ _

_ ● ● ● _ _ _ _ ● ● ●_ _ ● ● ● _ _

_ _ ● ● _
_ ● ● _ ●

_ ● _ ● ● ●

_ _ ● ● ●

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _ _ _ _ _ _ _

_ _ _ _ _
_ _ ● ● _
_ _ ● ● ●
● _ ● ● ●
● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice real decoding beam search

SLIDE 58

Search Error: Gold Derivations Pruned

16 _ _ _ _ _ _

1 2 3 4 5 6

_ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

● _ ● _ _

_ ● ● ● _ _ _ _ ● ● ●_ _ ● ● ● _ _

_ _ ● ● _
_ ● ● _ ●

_ ● _ ● ● ●

_ _ ● ● ●

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _ _ _ _ _ _ _

_ _ _ _ _
_ _ ● ● _
_ _ ● ● ●
● _ ● ● ●
● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice real decoding beam search

should fix search errors here!

SLIDE 59

Fixing Search Error 1: Early Update

17

standard update (no guarantee!)

21

Model

SLIDE 60

Fixing Search Error 1: Early Update

early update (Collins/Roark’04) when the correct falls off beam
up to this point the incorrect prefix should score higher
that’s a “violation” which we want to fix

17

standard update (no guarantee!)

21

Model

SLIDE 61

Fixing Search Error 1: Early Update

early update (Collins/Roark’04) when the correct falls off beam
up to this point the incorrect prefix should score higher
that’s a “violation” which we want to fix

17

correct sequence falls off beam (pruned)

correct

standard update (no guarantee!)

21

Model

SLIDE 62

Fixing Search Error 1: Early Update

early update (Collins/Roark’04) when the correct falls off beam
up to this point the incorrect prefix should score higher
that’s a “violation” which we want to fix

17

correct sequence falls off beam (pruned)

correct i n c

r

r e c t

standard update (no guarantee!)

21

Model

SLIDE 63

Fixing Search Error 1: Early Update

early update (Collins/Roark’04) when the correct falls off beam
up to this point the incorrect prefix should score higher
that’s a “violation” which we want to fix

17

early update

correct sequence falls off beam (pruned)

correct i n c

r

r e c t violation guaranteed: incorrect prefix scores higher up to this point

standard update (no guarantee!)

21

Model

SLIDE 64

Fixing Search Error 1: Early Update

early update (Collins/Roark’04) when the correct falls off beam
up to this point the incorrect prefix should score higher
that’s a “violation” which we want to fix
standard perceptron does not guarantee violation
w/ pruning, the correct seq. might score higher at the end!
called “invalid” update b/c it doesn’t fix the search error

17

early update

correct sequence falls off beam (pruned)

correct i n c

r

r e c t violation guaranteed: incorrect prefix scores higher up to this point

standard update (no guarantee!)

21

Model

SLIDE 65

Early Update w/ Latent Variable

18

21

Model

_ _ _ _ _ _

_ _ _ _ _
_ _ ● ● _
_ _ ● ● ●
● _ ● ● ●
● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

the gold-standard derivations are not annotated
we treat any reference-producing derivation as good

SLIDE 66

Early Update w/ Latent Variable

18

21

Model

_ _ _ _ _ _

_ _ _ _ _
_ _ ● ● _
_ _ ● ● ●
● _ ● ● ●
● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

the gold-standard derivations are not annotated
we treat any reference-producing derivation as good

correct

SLIDE 67

Early Update w/ Latent Variable

18

21

Model

_ _ _ _ _ _

_ _ _ _ _
_ _ ● ● _
_ _ ● ● ●
● _ ● ● ●
● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

the gold-standard derivations are not annotated
we treat any reference-producing derivation as good

correct

SLIDE 68

Early Update w/ Latent Variable

18

21

Model

_ _ _ _ _ _

_ _ _ _ _
_ _ ● ● _
_ _ ● ● ●
● _ ● ● ●
● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

the gold-standard derivations are not annotated
we treat any reference-producing derivation as good

correct

all correct derivations fall off

SLIDE 69

Early Update w/ Latent Variable

18

i n c

r

r e c t

21

Model

_ _ _ _ _ _

_ _ _ _ _
_ _ ● ● _
_ _ ● ● ●
● _ ● ● ●
● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

the gold-standard derivations are not annotated
we treat any reference-producing derivation as good

correct

all correct derivations fall off

SLIDE 70

Early Update w/ Latent Variable

18

early update

i n c

r

r e c t violation guaranteed: incorrect prefix scores higher up to this point

21

Model

_ _ _ _ _ _

_ _ _ _ _
_ _ ● ● _
_ _ ● ● ●
● _ ● ● ●
● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

the gold-standard derivations are not annotated
we treat any reference-producing derivation as good

correct

all correct derivations fall off

SLIDE 71

Early Update w/ Latent Variable

18

early update

i n c

r

r e c t violation guaranteed: incorrect prefix scores higher up to this point

21

Model

_ _ _ _ _ _

_ _ _ _ _
_ _ ● ● _
_ _ ● ● ●
● _ ● ● ●
● ● ● ● ●

1 2 3 4 5 6

Bush held talks with Sharon held talks with Sharon gold derivation lattice

the gold-standard derivations are not annotated
we treat any reference-producing derivation as good

correct

all correct derivations fall off stop decoding

SLIDE 72

Fixing Search Error 2: Max-Violation

19

early update works but learns slowly due to partial updates
max-violation: use the prefix where violation is maximum
“worst-mistake” in the search space
we call these methods “violation-fixing perceptrons” (Huang

et al 2012)

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

SLIDE 73

Early Update vs. Max-Violation

_ _ _ _ _ _

1 2 3 4 5 6

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

SLIDE 74

Early Update vs. Max-Violation

_ _ _ _ _ _

1 2 3 4 5 6

_ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

SLIDE 75

Early Update vs. Max-Violation

_ _ _ _ _ _

1 2 3 4 5 6

_ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

SLIDE 76

Early Update vs. Max-Violation

_ _ _ _ _ _

1 2 3 4 5 6

_ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

● _ ● _ _

_ ● ● ● _ _ _ _ ● ● ●_ _ ● ● ● _ _

_ _ ● ● _

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

SLIDE 77

Early Update vs. Max-Violation

Early-update

_ _ _ _ _ _

1 2 3 4 5 6

_ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

● _ ● _ _

_ ● ● ● _ _ _ _ ● ● ●_ _ ● ● ● _ _

_ _ ● ● _

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

SLIDE 78

Early Update vs. Max-Violation

Early-update

_ _ _ _ _ _

_ _ ● ● ●

1 2 3 4 5 6

_ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

● _ ● _ _

_ ● ● ● _ _ _ _ ● ● ●_ _ ● ● ● _ _

_ _ ● ● _
_ ● ● _ ●

_ ● _ ● ● ●

_ _ ● ● ●

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

SLIDE 79

Early Update vs. Max-Violation

Early-update

_ _ _ _ _ _

1 2 3 4 5 6

_ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

● _ ● _ _

_ ● ● ● _ _ _ _ ● ● ●_ _ ● ● ● _ _

_ _ ● ● _
_ ● ● _ ●

_ ● _ ● ● ●

_ _ ● ● ●

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

SLIDE 80

Early Update vs. Max-Violation

Early-update

● _ ● ● ●

_ _ _ _ _ _

1 2 3 4 5 6

_ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

● _ ● _ _

_ ● ● ● _ _ _ _ ● ● ●_ _ ● ● ● _ _

_ _ ● ● _
_ ● ● _ ●

_ ● _ ● ● ●

_ _ ● ● ●

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _

● ● ● _ ●
_ ● ● ● ●
● ● _ ● ●
● ● _ ● ●

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

SLIDE 81

Early Update vs. Max-Violation

Early-update

● _ ● ● ●

_ _ _ _ _ _

1 2 3 4 5 6

_ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

● _ ● _ _

_ ● ● ● _ _ _ _ ● ● ●_ _ ● ● ● _ _

_ _ ● ● _
_ ● ● _ ●

_ ● _ ● ● ●

_ _ ● ● ●

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _

● ● ● _ ●
_ ● ● ● ●
● ● _ ● ●
● ● _ ● ●
● ● ● ● ●

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

SLIDE 82

Early Update vs. Max-Violation

Early-update

● _ ● ● ●

_ _ _ _ _ _

1 2 3 4 5 6

_ _ _ _ _

_ _ ●_ _ _ _ _ _ _ ● _ _ _ _ _ _ ●

● _ ● _ _

_ ● ● ● _ _ _ _ ● ● ●_ _ ● ● ● _ _

_ _ ● ● _
_ ● ● _ ●

_ ● _ ● ● ●

_ _ ● ● ●

_ ● ● _ _ _ _ _ ● ●_ _ _ _ ● _ ● _ _ ● _ _ ● _

● ● ● _ ●
_ ● ● ● ●
● ● _ ● ●
● ● _ ● ●

Max-violation

● ● ● ● ●

early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

SLIDE 83

Latent-Variable Perceptron

early max- violation latest full

(standard)

best in the beam worst in the beam falls off the beam biggest violation last valid update c

r

r e c t s e q u e n c e invalid update! early max- violation best in the beam worst in the beam

d−

i

d+

i

d+

i∗

d−

i∗

d+

|x|

dy

|x| std local standard update is invalid model w

d−

|x|

SLIDE 84

Roadmap of the techniques

22

structured perceptron

(Collins, 2002)

SLIDE 85

Roadmap of the techniques

22

structured perceptron

(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

SLIDE 86

Roadmap of the techniques

22

structured perceptron

(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004; Huang et al 2012)

SLIDE 87

Roadmap of the techniques

22

structured perceptron

(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004; Huang et al 2012)

latent-variable perceptron w/ inexact search

(Yu et al 2013)

SLIDE 88

Roadmap of the techniques

22

structured perceptron

(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004; Huang et al 2012)

latent-variable perceptron w/ inexact search

(Yu et al 2013)

hiero syntactic parsing semantic parsing transliteration

SLIDE 89

Feature Design

Dense features:
standard phrase-based features (Koehn, 2004)
Sparse Features:
rule-identification features (unique id for each rule)
word-edges features
lexicalized local translation context within a rule
non-local features
dependency between consecutive rules

23

SLIDE 90

WordEdges Features (local)

24

与沙⻰龚

举行了会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

the first and last Chinese words in the rule
the first and last English words in the rule
the two Chinese words surrounding the rule

SLIDE 91

WordEdges Features (local)

24

与沙⻰龚

举行了会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

the first and last Chinese words in the rule
the first and last English words in the rule
the two Chinese words surrounding the rule

SLIDE 92

WordEdges Features (local)

24

与沙⻰龚

举行了会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

the first and last Chinese words in the rule
the first and last English words in the rule
the two Chinese words surrounding the rule

SLIDE 93

WordEdges Features (local)

24

与沙⻰龚

举行了会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

the first and last Chinese words in the rule
the first and last English words in the rule
the two Chinese words surrounding the rule

SLIDE 94

WordEdges Features (local)

24

与沙⻰龚

举行了会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

the first and last Chinese words in the rule
the first and last English words in the rule
the two Chinese words surrounding the rule

SLIDE 95

WordEdges Features (local)

24

与沙⻰龚

举行了会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

the first and last Chinese words in the rule
the first and last English words in the rule
the two Chinese words surrounding the rule

Combo Features:

SLIDE 96

WordEdges Features (local)

24

与沙⻰龚

举行了会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

the first and last Chinese words in the rule
the first and last English words in the rule
the two Chinese words surrounding the rule

Combo Features:

SLIDE 97

WordEdges Features (local)

24

与沙⻰龚

举行了会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

the first and last Chinese words in the rule
the first and last English words in the rule
the two Chinese words surrounding the rule

Combo Features:

100010=沙⻰龚|held

SLIDE 98

WordEdges Features (local)

24

与沙⻰龚

举行了会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

the first and last Chinese words in the rule
the first and last English words in the rule
the two Chinese words surrounding the rule

Combo Features:

100010=沙⻰龚|held

SLIDE 99

WordEdges Features (local)

24

与沙⻰龚

举行了会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

the first and last Chinese words in the rule
the first and last English words in the rule
the two Chinese words surrounding the rule

Combo Features:

100010=沙⻰龚|held

SLIDE 100

WordEdges Features (local)

24

与沙⻰龚

举行了会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

the first and last Chinese words in the rule
the first and last English words in the rule
the two Chinese words surrounding the rule

Combo Features:

100010=沙⻰龚|held 010001=举行|talks

SLIDE 101

WordEdges Features (local)

24

与沙⻰龚

举行了会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

the first and last Chinese words in the rule
the first and last English words in the rule
the two Chinese words surrounding the rule

Combo Features:

100010=沙⻰龚|held 010001=举行|talks

SLIDE 102

Lexical backoffs and combos

Lexical features are often too sparse
6 kinds of lexical backoffs with various budgets
total budget can’t exceed 10 (bilexical)

25

与沙⻰龚

举行了会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

SLIDE 103

Lexical backoffs and combos

Lexical features are often too sparse
6 kinds of lexical backoffs with various budgets
total budget can’t exceed 10 (bilexical)

25

与沙⻰龚

举行了会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

SLIDE 104

Lexical backoffs and combos

Lexical features are often too sparse
6 kinds of lexical backoffs with various budgets
total budget can’t exceed 10 (bilexical)

25

与沙⻰龚

举行了会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

100010=沙⻰龚|held

SLIDE 105

Lexical backoffs and combos

Lexical features are often too sparse
6 kinds of lexical backoffs with various budgets
total budget can’t exceed 10 (bilexical)

25

与沙⻰龚

举行了会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

P00010=NN|held 100010=沙⻰龚|held

SLIDE 106

Lexical backoffs and combos

Lexical features are often too sparse
6 kinds of lexical backoffs with various budgets
total budget can’t exceed 10 (bilexical)

25

与沙⻰龚

举行了会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

P00010=NN|held 100010=沙⻰龚|held

SLIDE 107

Lexical backoffs and combos

Lexical features are often too sparse
6 kinds of lexical backoffs with various budgets
total budget can’t exceed 10 (bilexical)

25

与沙⻰龚

举行了会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

P00010=NN|held 100010=沙⻰龚|held 010001=举行|talks

SLIDE 108

Lexical backoffs and combos

Lexical features are often too sparse
6 kinds of lexical backoffs with various budgets
total budget can’t exceed 10 (bilexical)

25

与沙⻰龚

举行了会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

P00010=NN|held 0c0001=举|talks 100010=沙⻰龚|held 010001=举行|talks

SLIDE 109

Non-Local Features (trivial)

26

与沙⻰龚

举行了会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

two consecutive rule ids (rule bigram model)
the last two English words and the current rule
should explore a lot more!

SLIDE 110

Non-Local Features (trivial)

26

与沙⻰龚

举行了会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

two consecutive rule ids (rule bigram model)
the last two English words and the current rule
should explore a lot more!

SLIDE 111

Non-Local Features (trivial)

26

与沙⻰龚

举行了会谈

held a few talks

</s> r2

布什 Bush

r1 <s> <s>

two consecutive rule ids (rule bigram model)
the last two English words and the current rule
should explore a lot more!

SLIDE 112

Experiments

27

Date sets

SLIDE 113

Experiments

27

Scale Language

sent. dev tst

s s

Date sets

SLIDE 114

Experiments

27

Scale Language

sent. dev tst

Small

Ch-En 30k

nist06 news nist08 news

Large

Ch-En 240k

nist06 news nist08 news

Date sets

SLIDE 115

Experiments

27

Scale Language

sent. dev tst

Small

Ch-En 30k

nist06 news nist08 news

Large

Ch-En 240k

nist06 news nist08 news

Date sets

SLIDE 116

Experiments

27

10x dev

Scale Language

sent. dev tst

Small

Ch-En 30k

nist06 news nist08 news

Large

Ch-En 240k

nist06 news nist08 news

Date sets

SLIDE 117

Experiments

27

10x dev 120x dev

Scale Language

sent. dev tst

Small

Ch-En 30k

nist06 news nist08 news

Large

Ch-En 240k

nist06 news nist08 news

Date sets

SLIDE 118

Experiments

27

10x dev 120x dev

Scale Language

sent. dev tst

Small

Ch-En 30k

nist06 news nist08 news

Large

Ch-En 240k

nist06 news nist08 news

Large Sp-En 170k

newstest2012 newtest2013

Date sets

SLIDE 119

Experiments

27

10x dev 120x dev

Scale Language

sent. dev tst

Small

Ch-En 30k

nist06 news nist08 news

Large

Ch-En 240k

nist06 news nist08 news

Large Sp-En 170k

newstest2012 newtest2013

Date sets

Sp-En sent. word. ratio 55% 43.9%

SLIDE 120

Experiments

27

10x dev 120x dev

Scale Language

sent. dev tst

Small

Ch-En 30k

nist06 news nist08 news

Large

Ch-En 240k

nist06 news nist08 news

Large Sp-En 170k

newstest2012 newtest2013

Date sets

Sp-En sent. word. ratio 55% 43.9%

31x dev

SLIDE 121

Perceptron: std, early, and max-violation

standard perceptron (Liang et al’s “bold”) works poorly
b/c invalid update ratio is very high (search quality is low)
max-violation converges faster than early update

17 18 19 20 21 22 23 24 25 26 2 4 6 8 10 12 14 16 18 20 BLEU Number of iteration MaxForce MERT e a r l y l

c

a l s t a n d a r d

28

SLIDE 122

Perceptron: std, early, and max-violation

standard perceptron (Liang et al’s “bold”) works poorly
b/c invalid update ratio is very high (search quality is low)
max-violation converges faster than early update

this explains why Liang et al ’06 failed std ~ “bold”; local ~ “local”

17 18 19 20 21 22 23 24 25 26 2 4 6 8 10 12 14 16 18 20 BLEU Number of iteration MaxForce MERT e a r l y l

c

a l s t a n d a r d

28

SLIDE 123

Perceptron: std, early, and max-violation

standard perceptron (Liang et al’s “bold”) works poorly
b/c invalid update ratio is very high (search quality is low)
max-violation converges faster than early update

50% 60% 70% 80% 90% 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Ratio beam size Ratio of invalid updates +non-local feature

(standard perceptron)

this explains why Liang et al ’06 failed std ~ “bold”; local ~ “local”

17 18 19 20 21 22 23 24 25 26 2 4 6 8 10 12 14 16 18 20 BLEU Number of iteration MaxForce MERT e a r l y l

c

a l s t a n d a r d

28

SLIDE 124

Perceptron: std, early, and max-violation

standard perceptron (Liang et al’s “bold”) works poorly
b/c invalid update ratio is very high (search quality is low)
max-violation converges faster than early update

50% 60% 70% 80% 90% 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 Ratio beam size Ratio of invalid updates +non-local feature

(standard perceptron)

this explains why Liang et al ’06 failed std ~ “bold”; local ~ “local”

17 18 19 20 21 22 23 24 25 26 2 4 6 8 10 12 14 16 18 20 BLEU Number of iteration MaxForce MERT e a r l y l

c

a l s t a n d a r d

28

SLIDE 125

Parallelized Perceptron

29

mini-batch perceptron (Zhao and Huang, 2013) much faster

than iterative parameter mixing (McDonald et al, 2010)

6 CPUs => ~4x speedup; 24 CPUs => ~7x speedup

22 23 24 0.5 1 1.5 2 2.5 3 3.5 4 BLEU Time MERT PRO-dense minibatch(24-core) minibatch(6-core) minibatch(1 core) single processor

Time

SLIDE 126

Internal comparison with different features

dense: 11 standard features for phrase-based MT
ruleid: rule identification feature
word-edges: word-edges features with back-offs
non-local: non-local features with back-offs

30

18 19 20 21 22 23 24 25 26 2 4 6 8 10 12 14 16 BLEU Number of iteration MERT +non-local +word-edges +ruleid dense

SLIDE 127

Internal comparison with different features

dense: 11 standard features for phrase-based MT
ruleid: rule identification feature
word-edges: word-edges features with back-offs
non-local: non-local features with back-offs

30

dense: 11 features

18 19 20 21 22 23 24 25 26 2 4 6 8 10 12 14 16 BLEU Number of iteration MERT +non-local +word-edges +ruleid dense

SLIDE 128

Internal comparison with different features

dense: 11 standard features for phrase-based MT
ruleid: rule identification feature
word-edges: word-edges features with back-offs
non-local: non-local features with back-offs

30

ruleid: 0.1% dense: 11 features

18 19 20 21 22 23 24 25 26 2 4 6 8 10 12 14 16 BLEU Number of iteration MERT +non-local +word-edges +ruleid dense

+0.9 bleu

SLIDE 129

Internal comparison with different features

dense: 11 standard features for phrase-based MT
ruleid: rule identification feature
word-edges: word-edges features with back-offs
non-local: non-local features with back-offs

30

ruleid: 0.1% wordedges: 99.6% dense: 11 features

18 19 20 21 22 23 24 25 26 2 4 6 8 10 12 14 16 BLEU Number of iteration MERT +non-local +word-edges +ruleid dense

+0.9 bleu +2.3

SLIDE 130

Internal comparison with different features

dense: 11 standard features for phrase-based MT
ruleid: rule identification feature
word-edges: word-edges features with back-offs
non-local: non-local features with back-offs

30

ruleid: 0.1% wordedges: 99.6% non-local: 0.3% dense: 11 features

18 19 20 21 22 23 24 25 26 2 4 6 8 10 12 14 16 BLEU Number of iteration MERT +non-local +word-edges +ruleid dense

+0.9 bleu +2.3 +0.7

SLIDE 131

External comparison with MERT & PRO

31

MERT, PRO-dense/medium/sparse all tune on dev-set
PRO-sparse use the same feature as ours

10 12 14 16 18 20 22 24 26 2 4 6 8 10 12 14 16 BLEU Number of iteration MaxForce MERT PRO-dense PRO-medium PRO-large

SLIDE 132

Final Results on FBIS Data

32

Moses: state-of-the-art phrase-based system in C++
Cubit: phrase-based system (Huang and Chiang, 2007) in python
almost identical baseline scores with MERT
max-violation takes ~47 hours on 24 CPUs (23M features)

SLIDE 133

Final Results on FBIS Data

32

System Alg. Tune on Features Dev Test Moses Cubit Cubit

MERT dev set 11 25.5 22.5 MERT dev set 11 25.4 22.5 Ma

Moses: state-of-the-art phrase-based system in C++
Cubit: phrase-based system (Huang and Chiang, 2007) in python
almost identical baseline scores with MERT
max-violation takes ~47 hours on 24 CPUs (23M features)

SLIDE 134

Final Results on FBIS Data

32

System Alg. Tune on Features Dev Test Moses Cubit Cubit

MERT dev set 11 25.5 22.5 MERT dev set 11 25.4 22.5 11 25.6 22.6 PRO dev set 3k MaxForce

Moses: state-of-the-art phrase-based system in C++
Cubit: phrase-based system (Huang and Chiang, 2007) in python
almost identical baseline scores with MERT
max-violation takes ~47 hours on 24 CPUs (23M features)

SLIDE 135

Final Results on FBIS Data

32

System Alg. Tune on Features Dev Test Moses Cubit Cubit

MERT dev set 11 25.5 22.5 MERT dev set 11 25.4 22.5 11 25.6 22.6 PRO dev set 3k 26.3 23.0 36k MaxForce

Moses: state-of-the-art phrase-based system in C++
Cubit: phrase-based system (Huang and Chiang, 2007) in python
almost identical baseline scores with MERT
max-violation takes ~47 hours on 24 CPUs (23M features)

SLIDE 136

Final Results on FBIS Data

32

System Alg. Tune on Features Dev Test Moses Cubit Cubit

MERT dev set 11 25.5 22.5 MERT dev set 11 25.4 22.5 11 25.6 22.6 PRO dev set 3k 26.3 23.0 36k 17.7 14.3 MaxForce

Moses: state-of-the-art phrase-based system in C++
Cubit: phrase-based system (Huang and Chiang, 2007) in python
almost identical baseline scores with MERT
max-violation takes ~47 hours on 24 CPUs (23M features)

SLIDE 137

Final Results on FBIS Data

32

System Alg. Tune on Features Dev Test Moses Cubit Cubit

MERT dev set 11 25.5 22.5 MERT dev set 11 25.4 22.5 11 25.6 22.6 PRO dev set 3k 26.3 23.0 36k 17.7 14.3 MaxForce Train set 23M 27.8 24.5

Moses: state-of-the-art phrase-based system in C++
Cubit: phrase-based system (Huang and Chiang, 2007) in python
almost identical baseline scores with MERT
max-violation takes ~47 hours on 24 CPUs (23M features)

SLIDE 138

Final Results on FBIS Data

32

System Alg. Tune on Features Dev Test Moses Cubit Cubit

MERT dev set 11 25.5 22.5 MERT dev set 11 25.4 22.5 11 25.6 22.6 PRO dev set 3k 26.3 23.0 36k 17.7 14.3 MaxForce Train set 23M 27.8 24.5

Moses: state-of-the-art phrase-based system in C++
Cubit: phrase-based system (Huang and Chiang, 2007) in python
almost identical baseline scores with MERT
max-violation takes ~47 hours on 24 CPUs (23M features)

+2.3

SLIDE 139

Final Results on FBIS Data

32

System Alg. Tune on Features Dev Test Moses Cubit Cubit

MERT dev set 11 25.5 22.5 MERT dev set 11 25.4 22.5 11 25.6 22.6 PRO dev set 3k 26.3 23.0 36k 17.7 14.3 MaxForce Train set 23M 27.8 24.5

Moses: state-of-the-art phrase-based system in C++
Cubit: phrase-based system (Huang and Chiang, 2007) in python
almost identical baseline scores with MERT
max-violation takes ~47 hours on 24 CPUs (23M features)

+2.3 +2.0

SLIDE 140

Results on Spanish-English set

Data-set: Europarl corpus, 170k sentences
dev/test set: newtest2012 / 2013 (one-reference only)
+1 in 1-ref bleu ~ +2 in 4-ref bleu
bleu improvement is comparable to Chinese w/ 4-refs

33

system algorithm #feat. dev test Moses Mert 11 27.4 24.4 Cubit MaxForce 21M 28.7 25.5

Sp-En

sent. word.

Reachable ratio 55% 43.9%

SLIDE 141

Results on Spanish-English set

Data-set: Europarl corpus, 170k sentences
dev/test set: newtest2012 / 2013 (one-reference only)
+1 in 1-ref bleu ~ +2 in 4-ref bleu
bleu improvement is comparable to Chinese w/ 4-refs

33

system algorithm #feat. dev test Moses Mert 11 27.4 24.4 Cubit MaxForce 21M 28.7 25.5 +1.3 +1.1

Sp-En

sent. word.

Reachable ratio 55% 43.9%

SLIDE 142

Conclusion

a simple yet effective online learning approach for MT
scaled to (a large portion of) the training set for the first time
able to incorporate 20M sparse lexicalized features
no need to define BLEU+1, or hope/fear derivations
no learning rate or hyperparameters
+2.3/+2.0 BLEU points better than MERT/PRO
the three ingredients that made it work
violation-fixing perceptron: early-update and max-violation
forced decoding lattice helps
minibatch parallelization scales it up to big data

34

SLIDE 143

Roadmap of the techniques

35

structured perceptron

(Collins, 2002)

SLIDE 144

Roadmap of the techniques

35

structured perceptron

(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

SLIDE 145

Roadmap of the techniques

35

structured perceptron

(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004; Huang et al 2012)

SLIDE 146

Roadmap of the techniques

35

structured perceptron

(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004; Huang et al 2012)

latent-variable perceptron w/ inexact search

(Yu et al 2013)

SLIDE 147

Roadmap of the techniques

35

structured perceptron

(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004; Huang et al 2012)

latent-variable perceptron w/ inexact search

(Yu et al 2013)

hiero syntactic parsing semantic parsing transliteration

SLIDE 148

Roadmap of the techniques

35

structured perceptron

(Collins, 2002)

latent-variable perceptron

(Zettlemoyer and Collins, 2005; Sun et al., 2009)

perceptron w/ inexact search

(Collins & Roark, 2004; Huang et al 2012)

latent-variable perceptron w/ inexact search

(Yu et al 2013)

hiero syntactic parsing semantic parsing transliteration

replacing EM for partially-

bserved data

SLIDE 149

20 years of Statistical MT

word alignment: IBM models (Brown et al 90, 93)
translation model (choose one from below)
SCFG (ITG: Wu 95, 97; Hiero: Chiang 05, 07) or STSG (GHKM 04, 06; Liu+ 06; Huang+ 06)
PBMT (Och+Ney 02; Koehn et al 03)
evaluation metric: BLEU (Papineni et al 02)
decoding algorithm: cube pruning (Chiang 07; Huang+Chiang 07)
training algorithm (choose one from below)
MERT (Och 03): ~10 dense features on dev set
MIRA (Chiang et al 08-12) or PRO (Hopkins+May 11): ~10k feats on dev set
MaxForce: 20M+ feats on training set; +2/+1.5 BLEU over MERT/PRO
Max-Violation Perceptron with Forced Decoding: fixes search errors
first successful effort of online large-scale discriminative training for MT

SLIDE 150

When learning with vastly inexact search, you should use a principled method such as max-violation. Thank you!

Max-violation