[PPT] - Statistical Phrase-Based Translation Philipp Koehn, Franz Och, PowerPoint Presentation

SLIDE 1

Statistical Phrase-Based Translation

Philipp Koehn, Franz Och, Daniel Marcu

koehn@isi.edu, och@isi.edu, marcu@isi.edu

Information Sciences Institute University of Southern California

– p.1

SLIDE 2

Statistical Phrase-Based Translation p

Motivation p

Phrase-based translation is the best way

to do statistical machine translation

– best performance in recent DARPA evaluations – also fairly simple – tools are freely available

How do I construct a phrase translation table?

Philipp Koehn, Franz Och, Daniel Marcu – USC/ISI 2

– p.2

SLIDE 3

Statistical Phrase-Based Translation p

Goals p

Compare different approaches to learn phrases
Examine properties of phrase-based translation
Syntax and phrases

Philipp Koehn, Franz Och, Daniel Marcu – USC/ISI 3

– p.3

SLIDE 4

Statistical Phrase-Based Translation p

Overview p

Evaluation framework

– unified model – decoder – corpus

Three methods for learning phrases

– word-alignment induced phrases – syntactic phrases – phrase-alignment

Experiments

Philipp Koehn, Franz Och, Daniel Marcu – USC/ISI 4

– p.4

SLIDE 5

Statistical Phrase-Based Translation p

Model p

Morgen fliege ich nach Kanada zur Konferenz Tomorrow I will fly to the conference in Canada

Bayes rule: argmax
✁

✂☎✄ ✆ ✝ ✞ ✟

argmax

✁

✂ ✝ ✆ ✄ ✞ ✁ ✂☎✄ ✞

Foreign sentence

✝

is segmented into

✠

phrases

✡ ☛ ☞

Each phrase is translated with

✂ ✡ ✌ ✆ ✡✎✍ ✌ ✞

Phrases are reordered with

✏ ✂✒✑ ✞

Use of language model

✁

LM

✂ ✄ ✞

and word penalty

✓ ✔

✔

Philipp Koehn, Franz Och, Daniel Marcu – USC/ISI 5

– p.5

SLIDE 6

Statistical Phrase-Based Translation p

Decoder: Beam Search p

e: Mary f: *-------- p: .534 e: witch f: -------*- p: .182 e: f: ---------- p: 1 e: ... did f: *-------- p: .122 e: ... slap f: *-***---- p: .043

Build English by hypothesis expansion

– from left to right – search space exponential with sentence length

reduction by pruning weak hypothesis aided by future cost estimate

Philipp Koehn, Franz Och, Daniel Marcu – USC/ISI 6

– p.6

SLIDE 7

Statistical Phrase-Based Translation p

Evaluation on Europarl Corpus p

Collected from the European Parliament Proceedings

– Available at http://www.isi.edu/

koehn/

– 11 languages, 20 million words each

Test set

– German-English – 1755 sentence of length 5-15

Philipp Koehn, Franz Och, Daniel Marcu – USC/ISI 7

– p.7

SLIDE 8

Statistical Phrase-Based Translation p

Three Methods for Learning Phrases p

Word-alignment induced phrases

– similar to alignment templates [Och et al., 1999]

Syntactic phrases

– only syntactic phrases are learned – same restriction as in recently proposed syntactic transfer models

Phrase-alignment

– joint model [Marcu and Wong, 2002]

Philipp Koehn, Franz Och, Daniel Marcu – USC/ISI 8

– p.8

SLIDE 9

Statistical Phrase-Based Translation p

Word Alignment Induced Phrases p

Word alignment is generated using IBM Model 4

– bidirectional alignments e

f, f
e

– intersect alignments – grow additional alignment points with heuristics

Collect phrase pairs consistent with word alignment
This is alignment templates without word classes

[Och et al., 1999]

Philipp Koehn, Franz Och, Daniel Marcu – USC/ISI 9

– p.9

SLIDE 10

Statistical Phrase-Based Translation p

Word Alignment Induced Phrases (2) p

Maria no daba una bofetada a la bruja verde Mary witch green the slap not did

(Maria, Mary), (no, did not), (slap, daba una bofetada), (a la, the), (bruja, witch), (verde, green), (Maria no, Mary did not), (no daba una bofetada, did not slap), (daba una bofetada a la, slap the), (bruja verde, green witch), (Maria no daba una bofetada, Mary did not slap), (no daba una bofetada a la, did not slap the), (a la bruja verde, the green witch), (Maria no daba una bofetada a la, Mary did not slap the), (daba una bofetada a la bruja verde, slap the green witch), (no daba una bofetada a la bruja verde, did not slap the green witch), (Maria no daba una bofetada a la bruja verde, Mary did not slap the green witch)

Philipp Koehn, Franz Och, Daniel Marcu – USC/ISI 10

– p.10

SLIDE 11

Statistical Phrase-Based Translation p

Syntactic Phrases p

Syntactic phrases span whole constituents in parse tree
Motivation

– only these phrases used syntactic transfer models, e.g., [Yamada and Knight, 2002] – does syntax help or hurt?

Extract syntactic phrase pairs

– parse both sides (with statistical parsers) – use word alignment as before – limit to phrases to syntactic constituents in parse tree

Philipp Koehn, Franz Och, Daniel Marcu – USC/ISI 11

– p.11

SLIDE 12

Statistical Phrase-Based Translation p

Phrase Alignment p

Morgen fliege ich nach Kanada zur Konferenz Tomorrow I will fly to the conference in Canada 1 2 3 4 5

Direct Phrase Alignment of Parallel Corpus

[Marcu and Wong, 2002]

Generative Story

– a number of concepts are created – each concept generates a foreign and English phrase

Philipp Koehn, Franz Och, Daniel Marcu – USC/ISI 12

– p.12

SLIDE 13

Statistical Phrase-Based Translation p

Experiments p

Comparison of core methods
Maximum phrase length
Lexical weighting
Phrase extraction heuristics
Simpler word alignment models
Other language pairs

Philipp Koehn, Franz Och, Daniel Marcu – USC/ISI 13

– p.13

SLIDE 14

Statistical Phrase-Based Translation p

Comparison of Core Methods p

Same decoder, same training data, same language model

– except for IBM Model 4: uses greedy decoder [Germann et al., 2001]

WAIPh best, syntactic phrases very bad
✁

10k 20k 40k 80k 160k 320k .18 .19 .20 .21 .22 .23 .24 .25 .26 .27 Training Corpus Size BLEU WAIPh

✂ ✂ ✂ ✂ ✂ ✂

Joint

✂ ✂ ✂ ✂ ✂ ✂

Syn

✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂

M4

All following experiments on WAIPh only

Philipp Koehn, Franz Och, Daniel Marcu – USC/ISI 14

– p.14

SLIDE 15

Statistical Phrase-Based Translation p

Maximum Phrase Length p

Maximum limit on length of phrases

– higher limit

larger phrase translation table

– all tables still fit into memory of modern machines Max. Training corpus size Length 10k 20k 40k 80k 160k 320k 2 37k 70k 135k 250k 474k 882k 3 63k 128k 261k 509k 1028k 1996k 4 84k 176k 370k 736k 1536k 3152k 5 101k 215k 459k 925k 1968k 4119k 7 130k 278k 605k 1217k 2657k 5663k

Philipp Koehn, Franz Och, Daniel Marcu – USC/ISI 15

– p.15

SLIDE 16

Statistical Phrase-Based Translation p

Maximum Phrase Length (2) p

Impact of limit on translation quality

– not much improvement if maximum length is extended beyond 3 – independent of training corpus size

✁

10k 20k 40k 80k 160k 320k .21 .22 .23 .24 .25 .26 .27 Training Corpus Size BLEU

✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂

max2 max3 max4 max5 max7

Philipp Koehn, Franz Och, Daniel Marcu – USC/ISI 16

– p.16

SLIDE 17

Statistical Phrase-Based Translation p

Lexical Weighting p

Augment phrase translation probability

✂ ✡ ✆ ✡ ✍ ✞

with lexical translation probabilities

✂

✆ ✍ ✞

la bruja verde the ###

green
###

witch

###
Lexical weight:

✁✄✂ ☎ ✆ ✝

la

✞

the

✟✡✠ ✆ ✝

bruja

✞

witch

✟ ✠ ✆ ✝

verde

✞

green

✟

Philipp Koehn, Franz Och, Daniel Marcu – USC/ISI 17

– p.17

SLIDE 18

Statistical Phrase-Based Translation p

Lexical Weighting p

Improves translation quality
✁

10k 20k 40k 80k 160k 320k .21 .22 .23 .24 .25 .26 .27 .28 Training Corpus Size BLEU

✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂

no-lex lex

Philipp Koehn, Franz Och, Daniel Marcu – USC/ISI 18

– p.18

SLIDE 19

Statistical Phrase-Based Translation p

Phrase Extraction Heuristics p

Recall: word alignment based on intersection of

bidirectional IBM Model 4 alignments + heuristics

Maria no daba una bofetada a la bruja verde Mary witch green the slap not did

Philipp Koehn, Franz Och, Daniel Marcu – USC/ISI 19

– p.19

SLIDE 20

Statistical Phrase-Based Translation p

Phrase Extraction Heuristics (2) p

Different phrases are learned, if heuristic to create word

alignment is changed.

Variations in heuristics:

– only to directly neighboring – also to diagonally neighboring – also to non-neighboring – prefer English-foreign or foreign-to-English – use lexical probabilities or frequencies – extend only to unaligned words – ...

Philipp Koehn, Franz Och, Daniel Marcu – USC/ISI 20

– p.20

SLIDE 21

Statistical Phrase-Based Translation p

Phrase Extraction Heuristics (3) p

No clear advantage to any strategy

– large differences, but ... – ... depending on corpus size – ... depending on language pair

✁

10k 20k 40k 80k 160k 320k .20 .21 .22 .23 .24 .25 .26 .27 .28 Training Corpus Size BLEU

✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂

diag-and diag base e2f f2e union

Philipp Koehn, Franz Och, Daniel Marcu – USC/ISI 21

– p.21

SLIDE 22

Statistical Phrase-Based Translation p

Simpler Word Alignment Models p

Using simpler IBM Models for word alignment

– not much impact, if simpler models used – simpler models computationally much cheaper

✁

10k 20k 40k 80k 160k 320k .20 .21 .22 .23 .24 .25 .26 .27 .28 Training Corpus Size BLEU

✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂ ✂

m4 m3 m2 m1

Philipp Koehn, Franz Och, Daniel Marcu – USC/ISI 22

– p.22

SLIDE 23

Statistical Phrase-Based Translation p

Other Language Pairs p

Finding hold for other language pairs, other corpora

– Phrase translation better than IBM Model 4 – Lexicalization helps (about +0.01 BLEU)

Language Pair Model4 Phrase Lex English-German 0.2040 0.2361 0.2449 French-English 0.2787 0.3294 0.3389 English-French 0.2555 0.3145 0.3247 Finnish-English 0.2178 0.2742 0.2806 Swedish-English 0.3137 0.3459 0.3554 Chinese-English 0.1190 0.1395 0.1418 Philipp Koehn, Franz Och, Daniel Marcu – USC/ISI 23

– p.23

SLIDE 24

Statistical Phrase-Based Translation p

Conclusions p

Phrase-based translation better than word-based

translation

Limit to syntactic phrases hurts a lot
Small phrases (up to 3 words) good enough
Lexical weighting helpful
Phrase extraction heuristics matter, but best heuristics

vary on corpus size, language pair

Philipp Koehn, Franz Och, Daniel Marcu – USC/ISI 24

– p.24