Faster Decoding for Phrases and Syntax Kenneth Heafield Translation - - PowerPoint PPT Presentation

faster decoding for phrases and syntax
SMART_READER_LITE
LIVE PREVIEW

Faster Decoding for Phrases and Syntax Kenneth Heafield Translation - - PowerPoint PPT Presentation

Faster Decoding for Phrases and Syntax Kenneth Heafield Translation is Expensive speed-up in tuning time but a ff ects the performance 18 days using 12 cores [Williams et al WMT 2014] Time-sensitive BLEU score [Chung and


slide-1
SLIDE 1

Faster Decoding for Phrases and Syntax

Kenneth Heafield

slide-2
SLIDE 2

Translation is Expensive

“speed-up in tuning time but affects the performance” “18 days using 12 cores”

[Williams et al WMT 2014]

“Time-sensitive BLEU score”

[Chung and Galley, 2012]

“Due to time constraints, this procedure was not used”

[Servan et al, WMT 2012]

= ) Routine Quality Compromises

Introduction Problem Cube Pruning Incremental Conclusion

2

slide-3
SLIDE 3

Introduction Problem Cube Pruning Incremental Conclusion

3

slide-4
SLIDE 4

Blame the Language Model

“LM queries often account for more than 50% of the CPU”

[Green et al, WMT 2014]

Introduction Problem Cube Pruning Incremental Conclusion

4

slide-5
SLIDE 5

Blame the Language Model

“LM queries often account for more than 50% of the CPU”

[Green et al, WMT 2014]

Faster queries (KenLM) More effective queries

Introduction Problem Cube Pruning Incremental Conclusion

5

slide-6
SLIDE 6
slide-7
SLIDE 7
slide-8
SLIDE 8

1 Decoding problem 2 Cube pruning 3 Incremental

Introduction Problem Cube Pruning Incremental Conclusion

8

slide-9
SLIDE 9

Decoding Example: Input

Le gar¸ con a vu l’homme avec un t´ elescope

Introduction Problem Cube Pruning Incremental Conclusion

9

slide-10
SLIDE 10

Decoding Example: Parse with SCFG

S:S X:NP X:VP X:VP X:PP X:V X:NP Le gar¸ con a vu l’homme avec un t´ elescope

Introduction Problem Cube Pruning Incremental Conclusion

10

slide-11
SLIDE 11

Decoding Example: Read Target Side

S:S X:NP X:VP X:VP X:PP X:V X:NP Le gar¸ con The boy A boy a vu seen saw view l’homme man the man some men avec un t´ elescope with the telescope to an telescope with a telescope

Introduction Problem Cube Pruning Incremental Conclusion

11

slide-12
SLIDE 12

Decoding Example: One Constituent

S:S X:NP X:VP X:VP X:PP X:V X:NP Le gar¸ con The boy A boy a vu seen saw view l’homme man the man some men avec un t´ elescope with the telescope to an telescope with a telescope

Introduction Problem Cube Pruning Incremental Conclusion

12

slide-13
SLIDE 13

X:VP X:V X:NP a vu Hyp seen saw view l’homme Hyp man the man some men

Introduction Problem Cube Pruning Incremental Conclusion

13

slide-14
SLIDE 14

X:VP X:V X:NP a vu Hyp seen saw view l’homme Hyp man the man some men X:VP a vu l’homme Hypothesis seen man seen the man seen some men saw man saw the man saw some men view man view the man view some men

Introduction Problem Cube Pruning Incremental Conclusion

14

slide-15
SLIDE 15

X:VP X:V X:NP a vu Hyp Score seen

  • 3.8

saw

  • 4.0

view

  • 4.0

l’homme Hyp Score man

  • 3.6

the man

  • 4.3

some men

  • 6.3

X:VP a vu l’homme Hypothesis Score seen man

  • 8.8

seen the man

  • 7.6

seen some men

  • 9.5

saw man

  • 8.3

saw the man

  • 6.9

saw some men

  • 8.5

view man

  • 8.5

view the man

  • 8.9

view some men

  • 10.8

Introduction Problem Cube Pruning Incremental Conclusion

15

slide-16
SLIDE 16

X:VP X:V X:NP a vu Hyp Score seen

  • 3.8

saw

  • 4.0

view

  • 4.0

l’homme Hyp Score man

  • 3.6

the man

  • 4.3

some men

  • 6.3

X:VP a vu l’homme Hypothesis Score saw the man

  • 6.9

seen the man

  • 7.6

saw man

  • 8.3

saw some men

  • 8.5

view man

  • 8.5

seen man

  • 8.8

view the man

  • 8.9

seen some men

  • 9.5

view some men

  • 10.8

Introduction Problem Cube Pruning Incremental Conclusion

16

slide-17
SLIDE 17

X:VP X:V X:NP a vu Hyp Score seen

  • 3.8

saw

  • 4.0

view

  • 4.0

l’homme Hyp Score man

  • 3.6

the man

  • 4.3

some men

  • 6.3

X:VP a vu l’homme Hypothesis Score saw the man

  • 6.9

seen the man

  • 7.6

saw man

  • 8.3

saw some men

  • 8.5

view man

  • 8.5

seen man

  • 8.8

view the man

  • 8.9

seen some men

  • 9.5

view some men

  • 10.8

Scores do not sum

Introduction Problem Cube Pruning Incremental Conclusion

17

slide-18
SLIDE 18

X:VP X:V X:NP a vu Hyp Score seen

  • 3.8

saw

  • 4.0

view

  • 4.0

l’homme Hyp Score man

  • 3.6

the man

  • 4.3

some men

  • 6.3

X:VP a vu l’homme Hypothesis Score saw the man

  • 6.9

seen the man

  • 7.6

saw man

  • 8.3

saw some men

  • 8.5

view man

  • 8.5

seen man

  • 8.8

view the man

  • 8.9

seen some men

  • 9.5

view some men

  • 10.8

Pruning is Approximate

Introduction Problem Cube Pruning Incremental Conclusion

18

slide-19
SLIDE 19

Appending Strings

Hypotheses are built by string concatenation. Language model probability changes when this is done:

p(saw the man) = p(the | saw)p(man | saw the) p(saw)p(the man) p(the) p(man | the)

Introduction Problem Cube Pruning Incremental Conclusion

19

slide-20
SLIDE 20

Appending Strings

Hypotheses are built by string concatenation. Language model probability changes when this is done:

p(saw the man) = p(the | saw)p(man | saw the) p(saw)p(the man) p(the) p(man | the)

Log probability is part of the score = ) Scores do not sum = ) Local decisions may not be globally optimal = ) Search is hard.

Introduction Problem Cube Pruning Incremental Conclusion

20

slide-21
SLIDE 21

1 Decoding problem 2 Cube pruning 3 Incremental

Introduction Problem Cube Pruning Incremental Conclusion

21

slide-22
SLIDE 22

Beam Search

man 3.6 the man 4.3 some men 6.3 seen 3.8 seen man 8.8 seen the man 7.6 seen some men 9.5 saw 4.0 saw man 8.3 saw the man 6.9 saw some men 8.5 view 4.0 view man 8.5 view the man 8.9 view some men 10.8 [Lowerre, 1976; Chiang, 2005]

Introduction Problem Cube Pruning Incremental Conclusion

22

slide-23
SLIDE 23

Cube Pruning

man 3.6 the man 4.3 some men 6.3 seen 3.8 Queue saw 4.0 view 4.0 Queue Hypothesis Sum seen man 3.83.6=7.4 [Chiang, 2007]

Introduction Problem Cube Pruning Incremental Conclusion

23

slide-24
SLIDE 24

Cube Pruning

man 3.6 the man 4.3 some men 6.3 seen 3.8 seen man 8.8 Queue saw 4.0 Queue view 4.0 Queue Hypothesis Sum saw man 4.03.6=7.6 seen the man 3.84.3=8.1 [Chiang, 2007]

Introduction Problem Cube Pruning Incremental Conclusion

24

slide-25
SLIDE 25

Cube Pruning

man 3.6 the man 4.3 some men 6.3 seen 3.8 seen man 8.8 Queue saw 4.0 saw man 8.3 Queue view 4.0 Queue Queue Hypothesis Sum view man 4.03.6=7.6 seen the man 3.84.3=8.1 saw the man 4.04.3=8.3 [Chiang, 2007]

Introduction Problem Cube Pruning Incremental Conclusion

25

slide-26
SLIDE 26

Cube Pruning

man 3.6 the man 4.3 some men 6.3 seen 3.8 seen man 8.8 Queue saw 4.0 saw man 8.3 Queue view 4.0 view man 8.5 Queue Queue Hypothesis Sum seen the man 3.84.3=8.1 saw the man 4.04.3=8.3 view the man 4.04.3=8.3 [Chiang, 2007]

Introduction Problem Cube Pruning Incremental Conclusion

26

slide-27
SLIDE 27

Beam Search Make every dish. Keep the best k, throw the rest out. Cube pruning Combine the best ingredients. Only make k dishes.

Introduction Problem Cube Pruning Incremental Conclusion

27

slide-28
SLIDE 28

Cube Pruning Hypotheses are Atomic

String is a are a String countries that countries which country String is a countries that are a countries that are a countries which . . .

No notion that “a countries” is bad.

Introduction Problem Cube Pruning Incremental Conclusion

28

slide-29
SLIDE 29

Beam Search Make every dish. Keep the best k, throw the rest out. Cube pruning Combine the best ingredients. Only make k dishes. Coarse-to-Fine Make small portions, taste, and order the best ones.

Introduction Problem Cube Pruning Incremental Conclusion

29

slide-30
SLIDE 30

Coarse-to-Fine

Decode multiple times, adding detail each time:

Increased LM order, words instead of classes

Detect and prune “a countries” with a bigram LM.

[Zhang et al, 2008; Petrov et al, 2008]

Introduction Problem Cube Pruning Incremental Conclusion

30

slide-31
SLIDE 31

Coarse-to-Fine

Decode multiple times, adding detail each time:

Increased LM order, words instead of classes

Detect and prune “a countries” with a bigram LM.

[Zhang et al, 2008; Petrov et al, 2008]

Requires tuning each pruning pass. Operates in lock step.

Introduction Problem Cube Pruning Incremental Conclusion

31

slide-32
SLIDE 32

Coarse-to-Fine

Decode multiple times, adding detail each time:

Increased LM order, words instead of classes

Detect and prune “a countries” with a bigram LM.

[Zhang et al, 2008; Petrov et al, 2008]

Requires tuning each pruning pass. Operates in lock step.

Can coarse-to-fine be done on the fly?

Introduction Problem Cube Pruning Incremental Conclusion

32

slide-33
SLIDE 33

1 Decoding problem 2 Cube pruning 3 Incremental

Introduction Problem Cube Pruning Incremental Conclusion

33

slide-34
SLIDE 34

Observations

Competing translations have words in common: is a, are a

Introduction Problem Cube Pruning Incremental Conclusion

34

slide-35
SLIDE 35

Observations

Competing translations have words in common: is a, are a Words at the boundary matter most: a + country, a + countries

Introduction Problem Cube Pruning Incremental Conclusion

35

slide-36
SLIDE 36

Observations

Competing translations have words in common: is a, are a Words at the boundary matter most: a + country, a + countries

Emphasize boundary words

Introduction Problem Cube Pruning Incremental Conclusion

36

slide-37
SLIDE 37

Beam Search Make every dish. Keep the best k, throw the rest out. Cube pruning Combine the best ingredients. Only make k dishes. Coarse-to-Fine Make small portions, taste, and order the best ones. Incremental Taste during cooking. Share ingredients.

Introduction Problem Cube Pruning Incremental Conclusion

37

slide-38
SLIDE 38

Boundary Words

1 Left-to-right phrase-based: one side 2 Bottom-up syntax: both sides

Introduction Problem Cube Pruning Incremental Conclusion

38

slide-39
SLIDE 39

Partial Translations

Plain text

The United Kingdom is a + . . . Scotland and Wales are a + . . .

Tree

✏ a is The United Kingdom are Scotland and Wales

Introduction Problem Cube Pruning Incremental Conclusion

39

slide-40
SLIDE 40

Phrase Continuations

Plain text

. . . + countries that . . . + countries which . . . + country

Tree

✏ country countries which that

Introduction Problem Cube Pruning Incremental Conclusion

40

slide-41
SLIDE 41

✏ a is The United Kingdom are Scotland and Wales ✏ country countries which that

+

Introduction Problem Cube Pruning Incremental Conclusion

41

slide-42
SLIDE 42

✏ a is The United Kingdom are Scotland and Wales ✏ country countries which that

+

Introduction Problem Cube Pruning Incremental Conclusion

42

slide-43
SLIDE 43

✏ a is The United Kingdom are Scotland and Wales ✏ country countries which that

+

Introduction Problem Cube Pruning Incremental Conclusion

43

slide-44
SLIDE 44

✏ a is The United Kingdom are Scotland and Wales ✏ country countries which that

+

Does the model like “a + countries”?

Introduction Problem Cube Pruning Incremental Conclusion

44

slide-45
SLIDE 45

Exploring and Backtracking

Does the model like “a + countries”? Yes Try more detail. No Consider alternatives.

Introduction Problem Cube Pruning Incremental Conclusion

45

slide-46
SLIDE 46

Exploring and Backtracking

Does the model like “a + countries”? Yes Try more detail. No Consider alternatives. Formally: best-first search with a priority queue.

Introduction Problem Cube Pruning Incremental Conclusion

46

slide-47
SLIDE 47

The queue entry

“a + ✏”

splits into

Best Child “a + countries” Other Children “a + country”

Introduction Problem Cube Pruning Incremental Conclusion

47

slide-48
SLIDE 48

Scores come from the best descendant:

Score(a) = max{Score(is a), Score(are a)}

Introduction Problem Cube Pruning Incremental Conclusion

48

slide-49
SLIDE 49

Scores come from the best descendant:

Score(a) = max{Score(is a), Score(are a)}

The language model updates scores:

Score(a + countries) < Score(a) + Score(countries)

Introduction Problem Cube Pruning Incremental Conclusion

49

slide-50
SLIDE 50

Scores come from the best descendant:

Score(a) = max{Score(is a), Score(are a)}

The language model updates scores:

Score(a + countries) < Score(a) + Score(countries) Formally: p(countries | a) replaces p(countries)

Introduction Problem Cube Pruning Incremental Conclusion

50

slide-51
SLIDE 51

Best-First Algorithm Summary

Populate the queue with ✏ + ✏ Loop until k complete options have been found: Split the top-scoring option Build a tree from the k complete options

Introduction Problem Cube Pruning Incremental Conclusion

51

slide-52
SLIDE 52

Summary

Translations are assembled from left to right. Partial translations often share suffixes. Phrases often share prefixes. Test suffixes and prefixes before full combinations.

Introduction Problem Cube Pruning Incremental Conclusion

52

slide-53
SLIDE 53

Experiment

Task Chinese–English Source Stanford Model Phrase-based Software My own decoder, mtplz, versus Moses

Introduction Problem Cube Pruning Incremental Conclusion

53

slide-54
SLIDE 54

Phrase-Based Results

  • 29.5
  • 29.0
  • 28.5
  • 28.0
  • 27.5

1 2 3 4 Average model score CPU seconds/sentence mtplz with Incremental Moses with Cube Pruning

Introduction Problem Cube Pruning Incremental Conclusion

54

slide-55
SLIDE 55

Phrase-Based Results

13 14 15 1 2 3 4 Uncased BLEU CPU seconds/sentence mtplz with Incremental Moses with Cube Pruning

Introduction Problem Cube Pruning Incremental Conclusion

55

slide-56
SLIDE 56

Search

The language model cares most about adjacent words. Test them first.

Share work for shared words.

Introduction Problem Cube Pruning Incremental Conclusion

56

slide-57
SLIDE 57

Boundary Words

1 Left-to-right phrase-based: one side 2 Bottom-up syntax: both sides

Introduction Problem Cube Pruning Incremental Conclusion

57

slide-58
SLIDE 58

Bottom-Up Syntax: Both Sides

is a X:NP1 </s> is a X:NP1 that How do we find the best value to substitute?

Manage words on both sides.

Introduction Problem Cube Pruning Incremental Conclusion

58

slide-59
SLIDE 59

Example Hypotheses

countries that maintain diplomatic relations with North Korea . Left State Right State relations ties countries that have an embassy in DPR Korea . country that maintains some diplomatic ties in North Korea . nations which has some diplomatic ties with DPR Korea . country that maintains some diplomatic ties with DPR Korea .

Introduction Problem Cube Pruning Incremental Conclusion

59

slide-60
SLIDE 60

Example Hypotheses

Left State Right State (countries that ⇧ with North Korea .) (nations which has ⇧ with DPR Korea .) (countries that have ⇧ DPR Korea .) (country ⇧ in North Korea .) (country ⇧ with DPR Korea .) ⇧ Words the language model does not care about

Introduction Problem Cube Pruning Incremental Conclusion

60

slide-61
SLIDE 61

Idea: alternate between left and right side

Introduction Problem Cube Pruning Incremental Conclusion

61

slide-62
SLIDE 62

Group by Leftmost Word

(✏ ⇧ ✏) (country ⇧ Korea .) (country ⇧ with DPR Korea .) (country ⇧ in North Korea .) (nations which has ⇧ with DPR Korea .) (countries that ⇧ Korea .) (countries that have ⇧ DPR Korea .) (countries that ⇧ with North Korea .)

Introduction Problem Cube Pruning Incremental Conclusion

62

slide-63
SLIDE 63

Reveal Common Words in Each Group

(✏ ⇧ ✏) (country ⇧ Korea .) (country ⇧ with DPR Korea .) (country ⇧ in North Korea .) (nations which has ⇧ with DPR Korea .) (countries that ⇧ Korea .) (countries that have ⇧ DPR Korea .) (countries that ⇧ with North Korea .)

Introduction Problem Cube Pruning Incremental Conclusion

63

slide-64
SLIDE 64

Alternate Sides Until Tree is Full

(✏ ⇧ ✏) (country ⇧ Korea .) (country ⇧ with DPR Korea .) (country ⇧ in North Korea .) (nations which has ⇧ with DPR Korea .) (countries that ⇧ Korea .) (countries that have ⇧ DPR Korea .) (countries that ⇧ with North Korea .)

Introduction Problem Cube Pruning Incremental Conclusion

64

slide-65
SLIDE 65

Using Rules

is a X:NP1 </s> X:V 1 the X:N2

turns into turns into

is a (✏ ⇧ ✏) </s> (✏ ⇧ ✏) the (✏ ⇧ ✏) | {z } | {z } X:V 1 X:N2

Introduction Problem Cube Pruning Incremental Conclusion

65

slide-66
SLIDE 66

Exploring and Backtracking

Does the LM like “is a (countries that ⇧ Korea .) </s>”? Yes Try more detail. No Consider alternatives.

Introduction Problem Cube Pruning Incremental Conclusion

66

slide-67
SLIDE 67

Exploring and Backtracking

Does the LM like “is a (countries that ⇧ Korea .) </s>”? Yes Try more detail. No Consider alternatives. Formally: priority queue containing breadcrumbs.

Introduction Problem Cube Pruning Incremental Conclusion

67

slide-68
SLIDE 68

Split and Leave Breadcrumbs

(✏ ⇧ ✏) (country ⇧ Korea .) (country ⇧ with DPR Korea .) (country ⇧ in North Korea .) (nations which has ⇧ with DPR Korea .) (countries that ⇧ Korea .) (countries that have ⇧ DPR Korea .) (countries that ⇧ with North Korea .)

Introduction Problem Cube Pruning Incremental Conclusion

68

slide-69
SLIDE 69

Split and Leave Breadcrumbs

(✏ ⇧ ✏) (country ⇧ Korea .) (country ⇧ with DPR Korea .) (country ⇧ in North Korea .) (nations which has ⇧ with DPR Korea .) (countries that ⇧ Korea .) (countries that have ⇧ DPR Korea .) (countries that ⇧ with North Korea .) [1+]

Introduction Problem Cube Pruning Incremental Conclusion

69

slide-70
SLIDE 70

The queue entry

is a (✏ ⇧ ✏) </s>

splits into

Zeroth Child “is a (countries that ⇧ Korea .) </s>” Other Children “is a (✏ ⇧ ✏)[1+] </s>” Children except the zeroth.

Introduction Problem Cube Pruning Incremental Conclusion

70

slide-71
SLIDE 71

A priority queue contains competing entries: is a (countries that ⇧ Korea .) </s> (✏ ⇧ ✏) the (✏ ⇧ ✏) is a (✏ ⇧ ✏)[1+] </s> The algorithm pops the top entry, splits a non-terminal, and pushes.

Introduction Problem Cube Pruning Incremental Conclusion

71

slide-72
SLIDE 72

Best-First Algorithm

Populate the queue with rules like “is a (✏ ⇧ ✏) </s>” Loop until k complete options have been found: Split the top-scoring option, leave a breadcrumb Build a tree from the k complete options

Introduction Problem Cube Pruning Incremental Conclusion

72

slide-73
SLIDE 73

Syntax

Same as phrase-based, just concatenate on left and right.

Introduction Problem Cube Pruning Incremental Conclusion

73

slide-74
SLIDE 74

Experiment

Task WMT 2011 German-English Model Hierarchical Decoder Moses

Introduction Problem Cube Pruning Incremental Conclusion

74

slide-75
SLIDE 75

Moses Hierarchical

101.9 101.8 101.7 101.6 101.5 101.4 0.5 1 1.5 2 2.5 Average model score CPU seconds/sentence Incremental Cube pruning Additive cube pruning

Introduction Problem Cube Pruning Incremental Conclusion

75

slide-76
SLIDE 76

Moses Hierarchical

21.4 21.6 21.8 22.0 22.2 0.5 1 1.5 2 2.5 Uncased BLEU CPU seconds/sentence Incremental Cube pruning Additive cube pruning

Introduction Problem Cube Pruning Incremental Conclusion

76

slide-77
SLIDE 77

Speed Ratio Hiero zh-en Hiero en-de Hiero de-en Hiero de-en cdec Syntax en-de Syntax de-en Tree-to-tree fr-en cdec 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Introduction Problem Cube Pruning Incremental Conclusion

77

slide-78
SLIDE 78

Speed Ratio Hiero zh-en Hiero en-de Hiero de-en Hiero de-en cdec Syntax en-de Syntax de-en Tree-to-tree fr-en cdec Beam 20 Beam < 20 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0

Introduction Problem Cube Pruning Incremental Conclusion

78

slide-79
SLIDE 79

Incremental

A series of coarse-to-fine estimates. Continually taste the dish and adjust.

Introduction Problem Cube Pruning Incremental Conclusion

79

slide-80
SLIDE 80

Takeaway

Search limits what translation can do.

Long-distance models like gender and number are harder.

Open the black box.

Language models can produce intermediate scores.

Introduction Problem Cube Pruning Incremental Conclusion

80