Tree-based and Forest-Based Translation Liang Huang Joint work - - PowerPoint PPT Presentation

tree based and forest based translation
SMART_READER_LITE
LIVE PREVIEW

Tree-based and Forest-Based Translation Liang Huang Joint work - - PowerPoint PPT Presentation

Tree-based and Forest-Based Translation Liang Huang Joint work with Kevin Knight (ISI), Aravind Joshi (Penn), Haitao Mi and Qun Liu (ICT) UC Berkeley, Feb 6, 2009 Translation is hard! zi zhu zhong duan self help


slide-1
SLIDE 1

Joint work with Kevin Knight (ISI), Aravind Joshi (Penn), Haitao Mi and Qun Liu (ICT)

UC Berkeley, Feb 6, 2009

Tree-based and Forest-Based Translation

Liang Huang

slide-2
SLIDE 2

Translation is hard!

2

zi zhu zhong duan 自 助 终 端

self help terminal device

(ATM, “self-service terminal”)

help oneself terminating machine

slide-3
SLIDE 3

Translation is hard!

3

slide-4
SLIDE 4

Translation is hard!

3

slide-5
SLIDE 5

Translation is hard!

3

slide-6
SLIDE 6

Translation is hard!

3

slide-7
SLIDE 7
  • r even...

4

slide-8
SLIDE 8
  • r even...

4

clear evidence that MT is used in real life.

slide-9
SLIDE 9

How do people translate?

  • 1. understand the source language sentence
  • 2. generate the target language translation

5

Bush hold

and/ with

meeting Sharon

[past.]

布什 举行

会谈 沙龙

Bùshí juxíng

yu

huìtán Shalóng

le

slide-10
SLIDE 10

How do people translate?

  • 1. understand the source language sentence
  • 2. generate the target language translation

5

Bush hold

and/ with

meeting Sharon

[past.]

布什 举行

会谈 沙龙

Bùshí juxíng

yu

huìtán Shalóng

le

slide-11
SLIDE 11

How do people translate?

  • 1. understand the source language sentence
  • 2. generate the target language translation

5

Bush hold

and/ with

meeting Sharon

[past.]

“Bush held a meeting with Sharon” 布什 举行

会谈 沙龙

Bùshí juxíng

yu

huìtán Shalóng

le

slide-12
SLIDE 12

How do compilers translate?

  • 1. parse high-level language program into a syntax tree
  • 2. generate intermediate or machine code accordingly

6

x3 = y + 3;

slide-13
SLIDE 13

How do compilers translate?

  • 1. parse high-level language program into a syntax tree
  • 2. generate intermediate or machine code accordingly

6

x3 = y + 3;

slide-14
SLIDE 14

How do compilers translate?

  • 1. parse high-level language program into a syntax tree
  • 2. generate intermediate or machine code accordingly

6

x3 = y + 3;

LD R1, id2 ADDF R1, R1, #3.0 // add float RTOI R2, R1 // real to int ST id1, R2

slide-15
SLIDE 15

How do compilers translate?

  • 1. parse high-level language program into a syntax tree
  • 2. generate intermediate or machine code accordingly

6

x3 = y + 3;

LD R1, id2 ADDF R1, R1, #3.0 // add float RTOI R2, R1 // real to int ST id1, R2

syntax-directed translation (~1960)

slide-16
SLIDE 16

Syntax-Directed Machine Translation

  • 1. parse the source-language sentence into a tree
  • 2. recursively convert it into a target-language sentence

7

Bush hold

and/ with

meeting Sharon [past.]

(Irons 1961; Lewis, Stearns 1968; Aho, Ullman 1972) ==> (Huang, Knight, Joshi 2006)

slide-17
SLIDE 17

Syntax-Directed Machine Translation

  • 1. parse the source-language sentence into a tree
  • 2. recursively convert it into a target-language sentence

7

Bush hold

and/ with

meeting Sharon [past.]

(Irons 1961; Lewis, Stearns 1968; Aho, Ullman 1972) ==> (Huang, Knight, Joshi 2006)

slide-18
SLIDE 18

Syntax-Directed Machine Translation

8

(Huang, Knight, Joshi 2006); rules from (Galley et al., 04)

  • recursive rewrite by pattern-matching
slide-19
SLIDE 19

Syntax-Directed Machine Translation

8

(Huang, Knight, Joshi 2006); rules from (Galley et al., 04)

  • recursive rewrite by pattern-matching
slide-20
SLIDE 20

Syntax-Directed Machine Translation

8

(Huang, Knight, Joshi 2006); rules from (Galley et al., 04)

  • recursive rewrite by pattern-matching
slide-21
SLIDE 21

Syntax-Directed Machine Translation

8

(Huang, Knight, Joshi 2006); rules from (Galley et al., 04)

  • recursive rewrite by pattern-matching

with

slide-22
SLIDE 22

Syntax-Directed Machine Translation?

  • recursively solve unfinished subproblems

9

with

(Huang, Knight, Joshi 2006); rules from (Galley et al., 04)

slide-23
SLIDE 23

Syntax-Directed Machine Translation?

  • recursively solve unfinished subproblems

9

with

(Huang, Knight, Joshi 2006); rules from (Galley et al., 04)

slide-24
SLIDE 24

Syntax-Directed Machine Translation?

  • recursively solve unfinished subproblems

9

with Bush

(Huang, Knight, Joshi 2006); rules from (Galley et al., 04)

slide-25
SLIDE 25

Syntax-Directed Machine Translation?

  • recursively solve unfinished subproblems

9

with Bush

(Huang, Knight, Joshi 2006); rules from (Galley et al., 04)

slide-26
SLIDE 26

Syntax-Directed Machine Translation?

  • recursively solve unfinished subproblems

9

held with Bush

(Huang, Knight, Joshi 2006); rules from (Galley et al., 04)

slide-27
SLIDE 27

Syntax-Directed Machine Translation?

  • continue pattern-matching

10

Bush held with

(Huang, Knight, Joshi 2006); rules from (Galley et al., 04)

slide-28
SLIDE 28

Syntax-Directed Machine Translation?

  • continue pattern-matching

10

Bush held with a meeting Sharon

(Huang, Knight, Joshi 2006); rules from (Galley et al., 04)

slide-29
SLIDE 29

Syntax-Directed Machine Translation?

  • continue pattern-matching

11

Bush held with a meeting Sharon

(Huang, Knight, Joshi 2006); rules from (Galley et al., 04)

slide-30
SLIDE 30

Syntax-Directed Machine Translation?

  • continue pattern-matching

11

Bush held with a meeting Sharon

(Huang, Knight, Joshi 2006); rules from (Galley et al., 04)

slide-31
SLIDE 31

Pros: simple, fast, and expressive

  • simple architecture: separate parsing and translation
  • efficient linear-time dynamic programming
  • “soft decision” at each node on which rule to use
  • (trivial) depth-first traversal with memoization
  • expressive multi-level rules for syntactic divergence

(beyond CFG)

12

(Huang, Knight, Joshi 2006); rules from (Galley et al., 04)

slide-32
SLIDE 32

Pros: simple, fast, and expressive

  • simple architecture: separate parsing and translation
  • efficient linear-time dynamic programming
  • “soft decision” at each node on which rule to use
  • (trivial) depth-first traversal with memoization
  • expressive multi-level rules for syntactic divergence

(beyond CFG)

12

(Huang, Knight, Joshi 2006); rules from (Galley et al., 04)

slide-33
SLIDE 33

Pros: simple, fast, and expressive

  • simple architecture: separate parsing and translation
  • efficient linear-time dynamic programming
  • “soft decision” at each node on which rule to use
  • (trivial) depth-first traversal with memoization
  • expressive multi-level rules for syntactic divergence

(beyond CFG)

12

(Huang, Knight, Joshi 2006); rules from (Galley et al., 04)

slide-34
SLIDE 34

Pros: simple, fast, and expressive

  • simple architecture: separate parsing and translation
  • efficient linear-time dynamic programming
  • “soft decision” at each node on which rule to use
  • (trivial) depth-first traversal with memoization
  • expressive multi-level rules for syntactic divergence

(beyond CFG)

12

(Huang, Knight, Joshi 2006); rules from (Galley et al., 04)

slide-35
SLIDE 35

Cons: Parsing Errors

  • ambiguity is a fundamental problem in natural languages
  • probably will never have perfect parsers (unlike compiling)
  • parsing errors affect translation quality!

13

slide-36
SLIDE 36

Cons: Parsing Errors

  • ambiguity is a fundamental problem in natural languages
  • probably will never have perfect parsers (unlike compiling)
  • parsing errors affect translation quality!

13

emergency exit

  • r “safe exports”?
slide-37
SLIDE 37

Cons: Parsing Errors

  • ambiguity is a fundamental problem in natural languages
  • probably will never have perfect parsers (unlike compiling)
  • parsing errors affect translation quality!

13

emergency exit

  • r “safe exports”?

mind your head

  • r “meet cautiously”?
slide-38
SLIDE 38

Exponential Explosion of Ambiguity

14

I saw her duck.

slide-39
SLIDE 39

Exponential Explosion of Ambiguity

14

I saw her duck.

slide-40
SLIDE 40

Exponential Explosion of Ambiguity

14

I saw her duck.

slide-41
SLIDE 41

Exponential Explosion of Ambiguity

  • how about...
  • I saw her duck with a telescope.
  • I saw her duck with a telescope in the garden...

14

I saw her duck.

slide-42
SLIDE 42

Exponential Explosion of Ambiguity

  • how about...
  • I saw her duck with a telescope.
  • I saw her duck with a telescope in the garden...

14

I saw her duck.

slide-43
SLIDE 43

Exponential Explosion of Ambiguity

  • how about...
  • I saw her duck with a telescope.
  • I saw her duck with a telescope in the garden...

14

... I saw her duck.

NLP == dealing with ambiguities.

slide-44
SLIDE 44

Tackling Ambiguities in Translation

15

slide-45
SLIDE 45

Tackling Ambiguities in Translation

  • simplest idea: take top-k trees rather than 1-best parse
  • but only covers tiny fraction of the exponential space
  • and these k-best trees are very similar
  • e.g., 50-best trees ~ 5-6 binary ambiguities (25 < 50 <26)
  • very inefficient to translate on these very similar trees

15

slide-46
SLIDE 46

Tackling Ambiguities in Translation

  • simplest idea: take top-k trees rather than 1-best parse
  • but only covers tiny fraction of the exponential space
  • and these k-best trees are very similar
  • e.g., 50-best trees ~ 5-6 binary ambiguities (25 < 50 <26)
  • very inefficient to translate on these very similar trees
  • most ambitious idea: combining parsing and translation
  • start from the input string, rather than 1-best tree
  • essentially considering all trees (search space too big)

15

slide-47
SLIDE 47

Tackling Ambiguities in Translation

  • simplest idea: take top-k trees rather than 1-best parse
  • but only covers tiny fraction of the exponential space
  • and these k-best trees are very similar
  • e.g., 50-best trees ~ 5-6 binary ambiguities (25 < 50 <26)
  • very inefficient to translate on these very similar trees
  • most ambitious idea: combining parsing and translation
  • start from the input string, rather than 1-best tree
  • essentially considering all trees (search space too big)
  • our approach: packed forest (poly. encoding of exp. space)
  • almost as fast as 1-best, almost as good as combined

15

slide-48
SLIDE 48

Outline

  • Overview: Tree-based Translation
  • Forest-based Translation
  • Packed Forest
  • Translation on a Forest
  • Experiments
  • Forest-based Rule Extraction
  • Large-scale Experiments

16

slide-49
SLIDE 49

From Lattices to Forests

  • common theme: polynomial encoding of exponential space
  • forest generalizes “lattice/graph” from finite-state world
  • paths => trees (in DP: knapsack vs. matrix-chain multiplication)
  • graph => hypergraph; regular grammar => CFG

17

(Earley 1970; Billot and Lang 1989)

slide-50
SLIDE 50

Packed Forest

  • a compact representation of many many parses
  • by sharing common sub-derivations
  • polynomial-space encoding of exponentially large set

18

(Klein and Manning, 2001; Huang and Chiang, 2005)

0 I 1 saw 2 him 3 with 4 a 5 mirror 6

slide-51
SLIDE 51

Packed Forest

  • a compact representation of many many parses
  • by sharing common sub-derivations
  • polynomial-space encoding of exponentially large set

18

(Klein and Manning, 2001; Huang and Chiang, 2005)

0 I 1 saw 2 him 3 with 4 a 5 mirror 6

nodes hyperedges

a hypergraph

slide-52
SLIDE 52

Forest-based Translation

19

“and” / “with”

slide-53
SLIDE 53

Forest-based Translation

20

“and” / “with”

布什 举行 与 会谈 沙龙 了

slide-54
SLIDE 54

“and”

Forest-based Translation

20

pattern-matching

  • n forest

(linear-time in forest size)

“and” / “with”

布什 举行 与 会谈 沙龙 了

slide-55
SLIDE 55

“and”

Forest-based Translation

20

pattern-matching

  • n forest

(linear-time in forest size)

“and” / “with”

布什 举行 与 会谈 沙龙 了

slide-56
SLIDE 56

“and”

Forest-based Translation

21

pattern-matching

  • n forest

(linear-time in forest size)

“and” / “with”

布什 举行 与 会谈 沙龙 了

slide-57
SLIDE 57

“and”

Forest-based Translation

21

pattern-matching

  • n forest

(linear-time in forest size)

“and” / “with”

布什 举行 与 会谈 沙龙 了

slide-58
SLIDE 58

“and”

Forest-based Translation

21

pattern-matching

  • n forest

(linear-time in forest size)

“and” / “with”

布什 举行 与 会谈 沙龙 了

slide-59
SLIDE 59

“and”

Forest-based Translation

21

pattern-matching

  • n forest

(linear-time in forest size)

“and” / “with”

布什 举行 与 会谈 沙龙 了

slide-60
SLIDE 60

Translation Forest

22

slide-61
SLIDE 61

Translation Forest

22

slide-62
SLIDE 62

Translation Forest

22

“held a meeting”

“Sharon” “Bush”

slide-63
SLIDE 63

Translation Forest

22

“held a meeting”

“Sharon” “Bush”

“Bush held a meeting with Sharon”

slide-64
SLIDE 64

The Whole Pipeline

23

parse forest translation forest translation+LM forest

parser

packed forests

(Huang and Chiang, 2005; 2007; Chiang, 2007)

input sentence 1-best translation k-best translations

pattern-matching w/ translation rules (exact) integrating language models (cube pruning)

  • Alg. 3
slide-65
SLIDE 65

The Whole Pipeline

24

parse forest translation forest translation+LM forest

parser pattern-matching w/ translation rules (exact) integrating language models (cube pruning)

packed forests

(Huang and Chiang, 2005; 2007; Chiang, 2007)

input sentence 1-best translation k-best translations

pruned forest

forest pruning

  • Alg. 3
slide-66
SLIDE 66

Parse Forest Pruning

  • prune unpromising hyperedges
  • principled way: inside-outside
  • first compute Viterbi inside β, outside α
  • then αβ(e) = α(v) + c(e) + β(u) + β(w)
  • cost of best deriv that traverses e
  • similar to “expected count” in EM
  • prune away hyperedges that have

αβ(e) - αβ(TOP) > p for some threshold p

25

...

v u w

e

  • utside

α(v) β(u) inside β(w) inside

Jonathan Graehl: relatively useless pruning

slide-67
SLIDE 67

Small-Scale Experiments

  • Chinese-to-English translation
  • on a tree-to-string system similar to (Liu et al, 2006)
  • 31k sentences pairs (0.8M Chinese & 0.9M English words)
  • GIZA++ aligned
  • trigram language model trained on the English side
  • dev: NIST 2002 (878 sent.); test: NIST 2005 (1082 sent.)
  • Chinese-side parsed by the parser of Xiong et al. (2005)
  • modified to output a forest for each sentence (Huang 2008)
  • BLEU score: 1-best baseline: 0.2430 vs. Pharaoh: 0.2297

26

slide-68
SLIDE 68

k-best trees vs. forest-based

27

1.7 Bleu improvement over 1-best, 0.8 over 30-best, and even faster!

k = ~6.1×108 trees ~2×104 trees

slide-69
SLIDE 69

forest as virtual ∞-best list

  • how often is the ith-best tree picked by the decoder?

28

32% beyond 100-best 20% beyond 1000-best 1000

suggested by Mark Johnson

(~6.1×108 -best)

slide-70
SLIDE 70

wait a sec... where are the rules from?

slide-71
SLIDE 71

wait a sec... where are the rules from?

xiǎoxīn

小心 X <=> be careful not to X

slide-72
SLIDE 72

wait a sec... where are the rules from?

xiǎoxīn

小心 X <=> be careful not to X

xiǎoxīn gǒu

小心 狗 <=> be aware of dog

slide-73
SLIDE 73

wait a sec... where are the rules from?

xiǎoxīn

小心 X <=> be careful not to X 小心 VP <=> be careful not to VP 小心 NP <=> be careful of NP . . .

xiǎoxīn gǒu

小心 狗 <=> be aware of dog

slide-74
SLIDE 74

Outline

  • Overview: Tree-based Translation
  • Forest-based Translation
  • Forest-based Rule Extraction
  • background: tree-based rule extraction (Galley et al., 2004)
  • extension to forest-based
  • large-scale experiments

30

slide-75
SLIDE 75

Where are the rules from?

  • source parse tree, target sentence, and alignment
  • compute target spans

31

GHKM - (Galley et al 2004; 2006)

slide-76
SLIDE 76

Where are the rules from?

  • source parse tree, target sentence, and alignment
  • well-formed fragment: contiguous and faithful t-span

32

GHKM - (Galley et al 2004; 2006)

slide-77
SLIDE 77

Where are the rules from?

  • source parse tree, target sentence, and alignment
  • well-formed fragment: contiguous and faithful t-span

32

GHKM - (Galley et al 2004; 2006)

admissible set

slide-78
SLIDE 78

Where are the rules from?

  • source parse tree, target sentence, and alignment
  • well-formed fragment: contiguous and faithful t-span

32

GHKM - (Galley et al 2004; 2006)

admissible set

slide-79
SLIDE 79

Where are the rules from?

  • source parse tree, target sentence, and alignment
  • well-formed fragment: contiguous and faithful t-span

33

GHKM - (Galley et al 2004; 2006)

admissible set

slide-80
SLIDE 80

Where are the rules from?

  • source parse tree, target sentence, and alignment
  • well-formed fragment: contiguous and faithful t-span

33

GHKM - (Galley et al 2004; 2006)

admissible set

slide-81
SLIDE 81

Forest-based Rule Extraction

  • same cut set computation; different fragmentation

34

also in (Wang, Knight, Marcu, 2007)

slide-82
SLIDE 82

Forest-based Rule Extraction

  • same cut set computation; different fragmentation

35

also in (Wang, Knight, Marcu, 2007)

slide-83
SLIDE 83

Forest-based Rule Extraction

  • same cut set computation; different fragmentation

35

also in (Wang, Knight, Marcu, 2007)

slide-84
SLIDE 84

Forest-based Rule Extraction

  • same cut set computation; different fragmentation

35

also in (Wang, Knight, Marcu, 2007)

slide-85
SLIDE 85

Forest-based Rule Extraction

  • same admissible set definition; different fragmentation

36

slide-86
SLIDE 86

Forest-based Rule Extraction

  • same admissible set definition; different fragmentation

36

slide-87
SLIDE 87

Forest-based Rule Extraction

  • same admissible set definition; different fragmentation

36

slide-88
SLIDE 88

Forest-based Rule Extraction

  • same admissible set definition; different fragmentation

36

slide-89
SLIDE 89

Forest-based Rule Extraction

  • forest can extract smaller chunks of rules

37

slide-90
SLIDE 90

Forest-based Rule Extraction

  • forest can extract smaller chunks of rules

37

slide-91
SLIDE 91

Forest-based Rule Extraction

  • forest can extract smaller chunks of rules

37

slide-92
SLIDE 92

The Forest2 Pipeline

source sentence

training time

target sentence

aligner

word alignment

rule extractor

translation ruleset

parser

1-best/ forest

slide-93
SLIDE 93

The Forest2 Pipeline

source sentence

training time

target sentence

aligner

word alignment

rule extractor

translation ruleset

translation time

pattern- matcher

target sentence

parser

1-best/ forest

source sentence

parser

1-best/forest

slide-94
SLIDE 94

Forest vs. k-best Extraction

39

1.0 Bleu improvement over 1-best, twice as fast as 30-best extraction

~108 trees

slide-95
SLIDE 95

Forest2

  • FBIS: 239k sentence pairs (7M/9M Chinese/English words)
  • forest in both extraction and decoding
  • forest2 results is 2.5 points better than 1-best2
  • and outperforms Hiero (Chiang 2007) by quite a bit

40

1-best tree forest 1-best tree 30-best trees forest Hiero

0.2560 0.2674 0.2634 0.2767 0.2679 0.2816 0.2738

rules from ... translating on ...

slide-96
SLIDE 96

Translation Examples

  • src 鲍威尔 说 与 阿拉法特 会谈 很 重要

Bàowēir shūo yǔ Alāfǎtè huìtán hěn zhòngyào Powell say with Arafat talk very important

  • 1-best2 Powell said the very important talks with Arafat
  • forest2 Powell said his meeting with Arafat is very important
  • hiero Powell said very important talks with Arafat

41

slide-97
SLIDE 97

Conclusions

  • main theme: efficient syntax-directed translation
  • forest-based translation
  • forest = “underspecified syntax”: polynomial vs. exponential
  • still fast (with pruning), yet does not commit to 1-best tree
  • translating millions of trees is faster than just on top-k trees
  • forest-based rule extraction: improving rule set quality
  • very simple idea, but works well in practice
  • significant improvement over 1-best syntax-directed
  • final result outperforms hiero by quite a bit

42

slide-98
SLIDE 98

Forest is your friend in machine translation. help save the forest.

More “forest-based” algorithms in my thesis (this talk is about Chap. 6).

slide-99
SLIDE 99

self-service terminals carefully slide http://translate.google.com

slide-100
SLIDE 100

self-service terminals carefully slide http://translate.google.com

slide-101
SLIDE 101

self-service terminals carefully slide http://translate.google.com

slide-102
SLIDE 102

Larger Decoding Experiments (ACL)

  • 2.2M sentence pairs (57M Chinese and 62M English words)
  • larger trigram models (1/3 of Xinhua Gigaword)
  • also use bilingual phrases (BP) as flat translation rules
  • phrases that are consistent with syntactic constituents
  • forest enables larger improvement with BP

45

T2S T2S+BP 1-best tree 30-best trees forest

improvement

0.2666 0.2939 0.2755 0.3084 0.2839 0.3149 1.7

2.1