Variational Decoding for Statistical Machine Translation Zhifei Li, - - PowerPoint PPT Presentation

variational decoding for statistical machine translation
SMART_READER_LITE
LIVE PREVIEW

Variational Decoding for Statistical Machine Translation Zhifei Li, - - PowerPoint PPT Presentation

Variational Decoding for Statistical Machine Translation Zhifei Li, Jason Eisner, and Sanjeev Khudanpur Center for Language and Speech Processing Computer Science Department Johns Hopkins University 1 Monday, August 17, 2009 Spurious


slide-1
SLIDE 1

Variational Decoding for Statistical Machine Translation

Zhifei Li, Jason Eisner, and Sanjeev Khudanpur

Center for Language and Speech Processing Computer Science Department Johns Hopkins University

1 Monday, August 17, 2009
slide-2
SLIDE 2

Spurious Ambiguity

  • Statistical models in MT exhibit spurious

ambiguity

  • Many different derivations (e.g., trees or

segmentations) generate the same translation string

  • Regular phrase-based MT systems
  • phrase segmentation ambiguity
  • Tree-based MT systems
  • derivation tree ambiguity
2 Monday, August 17, 2009
slide-3
SLIDE 3

Spurious Ambiguity in Phrase Segmentations

3 Monday, August 17, 2009
slide-4
SLIDE 4

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

3 Monday, August 17, 2009
slide-5
SLIDE 5

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

machine translation software

3 Monday, August 17, 2009
slide-6
SLIDE 6

machine translation

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

machine translation software

3 Monday, August 17, 2009
slide-7
SLIDE 7

machine translation

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

software machine translation software

3 Monday, August 17, 2009
slide-8
SLIDE 8

machine translation

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

机器 翻译 软件

software machine translation software

3 Monday, August 17, 2009
slide-9
SLIDE 9

machine translation

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

machine

机器 翻译 软件

software machine translation software

3 Monday, August 17, 2009
slide-10
SLIDE 10

machine translation

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

machine

机器 翻译 软件

software translation software machine translation software

3 Monday, August 17, 2009
slide-11
SLIDE 11

machine translation

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

machine

机器 翻译 软件

software translation software machine

机器

machine translation software

3 Monday, August 17, 2009
slide-12
SLIDE 12

machine translation

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

machine

机器 翻译 软件

software translation software machine

机器 翻译

translation machine translation software

3 Monday, August 17, 2009
slide-13
SLIDE 13

machine translation

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

machine

机器 翻译 软件

software translation software machine

机器 翻译 软件

translation software machine translation software

3 Monday, August 17, 2009
slide-14
SLIDE 14

machine translation

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

machine

机器 翻译 软件

software translation software machine

机器 翻译 软件

translation software

  • Same output:

“machine translation software”

  • Three different phrase

segmentations

machine translation software

3 Monday, August 17, 2009
slide-15
SLIDE 15

machine translation

机器 翻译 软件

Spurious Ambiguity in Phrase Segmentations

machine

机器 翻译 软件

software translation software machine

机器 翻译 软件

translation software

  • Same output:

“machine translation software”

  • Three different phrase

segmentations

machine translation software

3

machine transfer software

Monday, August 17, 2009
slide-16
SLIDE 16

Spurious Ambiguity in Derivation Trees

4 Monday, August 17, 2009
slide-17
SLIDE 17

Spurious Ambiguity in Derivation Trees

机器 翻译 软件

4 Monday, August 17, 2009
slide-18
SLIDE 18

Spurious Ambiguity in Derivation Trees

机器 翻译 软件

S->(机器, machine)

4 Monday, August 17, 2009
slide-19
SLIDE 19

Spurious Ambiguity in Derivation Trees

机器 翻译 软件

S->(机器, machine) S->(翻译, translation)

4 Monday, August 17, 2009
slide-20
SLIDE 20

Spurious Ambiguity in Derivation Trees

机器 翻译 软件

S->(机器, machine) S->(翻译, translation) S->(软件, software)

4 Monday, August 17, 2009
slide-21
SLIDE 21

Spurious Ambiguity in Derivation Trees

机器 翻译 软件

S->(机器, machine) S->(翻译, translation) S->(软件, software) S->(S0 S1, S0 S1)

4 Monday, August 17, 2009
slide-22
SLIDE 22

Spurious Ambiguity in Derivation Trees

机器 翻译 软件

S->(机器, machine) S->(翻译, translation) S->(软件, software) S->(S0 S1, S0 S1) S->(S0 S1, S0 S1)

4 Monday, August 17, 2009
slide-23
SLIDE 23

Spurious Ambiguity in Derivation Trees

机器 翻译 软件

S->(机器, machine) S->(翻译, translation) S->(软件, software) S->(机器, machine) S->(翻译, translation) S->(软件, software) S->(S0 S1, S0 S1) S->(S0 S1, S0 S1) S->(S0 S1, S0 S1) S->(S0 S1, S0 S1)

4 Monday, August 17, 2009
slide-24
SLIDE 24

Spurious Ambiguity in Derivation Trees

机器 翻译 软件

S->(机器, machine) S->(翻译, translation) S->(软件, software) S->(机器, machine) S->(翻译, translation) S->(软件, software) S->(机器, machine)

翻译

S->(软件, software) S->(S0 S1, S0 S1) S->(S0 S1, S0 S1) S->(S0 S1, S0 S1) S->(S0 S1, S0 S1) S->(S0 翻译 S1, S0 translation S1)

4 Monday, August 17, 2009
slide-25
SLIDE 25

Spurious Ambiguity in Derivation Trees

机器 翻译 软件

S->(机器, machine) S->(翻译, translation) S->(软件, software) S->(机器, machine) S->(翻译, translation) S->(软件, software) S->(机器, machine)

翻译

S->(软件, software) S->(S0 S1, S0 S1) S->(S0 S1, S0 S1) S->(S0 S1, S0 S1) S->(S0 S1, S0 S1) S->(S0 翻译 S1, S0 translation S1)

  • Same output:

“machine translation software”

  • Three different derivation trees
4 Monday, August 17, 2009
slide-26
SLIDE 26 5

Maximum A Posterior (MAP) Decoding

Monday, August 17, 2009
slide-27
SLIDE 27

red translation blue translation green translation

translation string

5

Maximum A Posterior (MAP) Decoding

Monday, August 17, 2009
slide-28
SLIDE 28

red translation blue translation green translation

derivation translation string

5

Maximum A Posterior (MAP) Decoding

Monday, August 17, 2009
slide-29
SLIDE 29

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string

5

Maximum A Posterior (MAP) Decoding

Monday, August 17, 2009
slide-30
SLIDE 30

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP

5

Maximum A Posterior (MAP) Decoding

Monday, August 17, 2009
slide-31
SLIDE 31

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP

5
  • Exact MAP decoding

Maximum A Posterior (MAP) Decoding

Monday, August 17, 2009
slide-32
SLIDE 32

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP

5
  • Exact MAP decoding
  • x: Foreign sentence
  • y: English sentence
  • d: derivation

y∗ = arg max

y∈Trans(x) p(y|x)

= arg max

y∈Trans(x)

  • d∈D(x,y)

p(y, d|x)

Maximum A Posterior (MAP) Decoding

Monday, August 17, 2009
slide-33
SLIDE 33

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP

5
  • Exact MAP decoding
  • x: Foreign sentence
  • y: English sentence
  • d: derivation

y∗ = arg max

y∈Trans(x) p(y|x)

= arg max

y∈Trans(x)

  • d∈D(x,y)

p(y, d|x)

Maximum A Posterior (MAP) Decoding

Monday, August 17, 2009
slide-34
SLIDE 34

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP

Maximum A Posterior (MAP) Decoding

6
  • Exact MAP decoding
  • x: Foreign sentence
  • y: English sentence
  • d: derivation

y∗ = arg max

y∈Trans(x) p(y|x)

= arg max

y∈Trans(x)

  • d∈D(x,y)

p(y, d|x)

Monday, August 17, 2009
slide-35
SLIDE 35

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP

Maximum A Posterior (MAP) Decoding

6
  • Exact MAP decoding
  • x: Foreign sentence
  • y: English sentence
  • d: derivation

y∗ = arg max

y∈Trans(x) p(y|x)

= arg max

y∈Trans(x)

  • d∈D(x,y)

p(y, d|x)

0.28

Monday, August 17, 2009
slide-36
SLIDE 36

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP

7
  • Exact MAP decoding

y∗ = arg max

y∈Trans(x) p(y|x)

= arg max

y∈Trans(x)

  • d∈D(x,y)

p(y, d|x)

0.28

Maximum A Posterior (MAP) Decoding

  • x: Foreign sentence
  • y: English sentence
  • d: derivation
Monday, August 17, 2009
slide-37
SLIDE 37

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP

7
  • Exact MAP decoding

y∗ = arg max

y∈Trans(x) p(y|x)

= arg max

y∈Trans(x)

  • d∈D(x,y)

p(y, d|x)

0.28 0.28

Maximum A Posterior (MAP) Decoding

  • x: Foreign sentence
  • y: English sentence
  • d: derivation
Monday, August 17, 2009
slide-38
SLIDE 38

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP

8
  • Exact MAP decoding

y∗ = arg max

y∈Trans(x) p(y|x)

= arg max

y∈Trans(x)

  • d∈D(x,y)

p(y, d|x)

0.28 0.28

Maximum A Posterior (MAP) Decoding

  • x: Foreign sentence
  • y: English sentence
  • d: derivation
Monday, August 17, 2009
slide-39
SLIDE 39

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP

8
  • Exact MAP decoding

y∗ = arg max

y∈Trans(x) p(y|x)

= arg max

y∈Trans(x)

  • d∈D(x,y)

p(y, d|x)

0.28 0.28 0.44

Maximum A Posterior (MAP) Decoding

  • x: Foreign sentence
  • y: English sentence
  • d: derivation
Monday, August 17, 2009
slide-40
SLIDE 40

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP 0.28 0.28 0.44

9
  • Exact MAP decoding

y∗ = arg max

y∈Trans(x) p(y|x)

= arg max

y∈Trans(x)

  • d∈D(x,y)

p(y, d|x)

Maximum A Posterior (MAP) Decoding

  • x: Foreign sentence
  • y: English sentence
  • d: derivation
Monday, August 17, 2009
slide-41
SLIDE 41

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP 0.28 0.28 0.44

9
  • Exact MAP decoding

y∗ = arg max

y∈Trans(x) p(y|x)

= arg max

y∈Trans(x)

  • d∈D(x,y)

p(y, d|x)

Maximum A Posterior (MAP) Decoding

  • x: Foreign sentence
  • y: English sentence
  • d: derivation
Monday, August 17, 2009
slide-42
SLIDE 42

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP 0.28 0.28 0.44

9
  • Exact MAP decoding

y∗ = arg max

y∈Trans(x) p(y|x)

= arg max

y∈Trans(x)

  • d∈D(x,y)

p(y, d|x)

Maximum A Posterior (MAP) Decoding

  • x: Foreign sentence
  • y: English sentence
  • d: derivation
Monday, August 17, 2009
slide-43
SLIDE 43

Hypergraph as a search space

Monday, August 17, 2009
slide-44
SLIDE 44

Hypergraph as a search space

dianzi0 shang1 de2 mao3

S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1 Monday, August 17, 2009
slide-45
SLIDE 45

Hypergraph as a search space

dianzi0 shang1 de2 mao3

S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

A hypergraph is a compact structure to encode exponentially many trees.

Monday, August 17, 2009
slide-46
SLIDE 46

Hypergraph as a search space

dianzi0 shang1 de2 mao3

S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1 Monday, August 17, 2009
slide-47
SLIDE 47

Hypergraph as a search space

dianzi0 shang1 de2 mao3

S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

Probabilistic Hypergraph

Monday, August 17, 2009
slide-48
SLIDE 48

Hypergraph as a search space

dianzi0 shang1 de2 mao3

S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

The hypergraph defines a probability distribution

  • ver derivation trees, i.e. p(y, d | x),

Probabilistic Hypergraph

Monday, August 17, 2009
slide-49
SLIDE 49

Hypergraph as a search space

dianzi0 shang1 de2 mao3

S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

The hypergraph defines a probability distribution

  • ver derivation trees, i.e. p(y, d | x),

and also a distribution (implicit)

  • ver strings, i.e. p(y | x).

Probabilistic Hypergraph

Monday, August 17, 2009
slide-50
SLIDE 50

Hypergraph as a search space

dianzi0 shang1 de2 mao3

S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

The hypergraph defines a probability distribution

  • ver derivation trees, i.e. p(y, d | x),

and also a distribution (implicit)

  • ver strings, i.e. p(y | x).

Probabilistic Hypergraph

  • Exact MAP decoding

NP-hard (Sima’an 1996) exponential size

y∗ = arg max

y∈HG(x) p(y|x)

= arg max

y∈HG(x)
  • d∈D(x,y)

p(y, d|x)

Monday, August 17, 2009
slide-51
SLIDE 51
  • Maximum a posterior (MAP) decoding
  • Viterbi approximation
  • N-best approximation (crunching) (May and

Knight 2006)

Decoding with spurious ambiguity?

Monday, August 17, 2009
slide-52
SLIDE 52

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP

Viterbi Approximation

  • Viterbi approximation

y∗ = arg max

y∈Trans(x)

max

d∈D(x,y) p(y, d|x)

= Y(arg max

d∈D(x) p(y, d|x))

0.28 0.28 0.44

Monday, August 17, 2009
slide-53
SLIDE 53

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP

Viterbi Approximation

  • Viterbi approximation

y∗ = arg max

y∈Trans(x)

max

d∈D(x,y) p(y, d|x)

= Y(arg max

d∈D(x) p(y, d|x))

0.28 0.28 0.44 Viterbi

Monday, August 17, 2009
slide-54
SLIDE 54

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP

Viterbi Approximation

  • Viterbi approximation

y∗ = arg max

y∈Trans(x)

max

d∈D(x,y) p(y, d|x)

= Y(arg max

d∈D(x) p(y, d|x))

0.28 0.28 0.44 Viterbi

Monday, August 17, 2009
slide-55
SLIDE 55

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP

Viterbi Approximation

  • Viterbi approximation

y∗ = arg max

y∈Trans(x)

max

d∈D(x,y) p(y, d|x)

= Y(arg max

d∈D(x) p(y, d|x))

0.28 0.28 0.44 Viterbi 0.16 0.14 0.13

Monday, August 17, 2009
slide-56
SLIDE 56

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP

Viterbi Approximation

  • Viterbi approximation

y∗ = arg max

y∈Trans(x)

max

d∈D(x,y) p(y, d|x)

= Y(arg max

d∈D(x) p(y, d|x))

0.28 0.28 0.44 Viterbi 0.16 0.14 0.13

Monday, August 17, 2009
slide-57
SLIDE 57

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP

N-best Approximation

0.28 0.28 0.44 Viterbi 0.16 0.14 0.13

  • N-best approximation (crunching) (May and Knight 2006)

y∗ = arg max

y∈Trans(x)

  • d∈D(x,y)∩ND(x)

p(y, d|x)

Monday, August 17, 2009
slide-58
SLIDE 58

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP

N-best Approximation

0.28 0.28 0.44 Viterbi 0.16 0.14 0.13 4-best crunching

  • N-best approximation (crunching) (May and Knight 2006)

y∗ = arg max

y∈Trans(x)

  • d∈D(x,y)∩ND(x)

p(y, d|x)

Monday, August 17, 2009
slide-59
SLIDE 59

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP 0.28 0.28 0.44 Viterbi 0.16 0.14 0.13 4-best crunching

N-best Approximation

  • N-best approximation (crunching) (May and Knight 2006)

y∗ = arg max

y∈Trans(x)

  • d∈D(x,y)∩ND(x)

p(y, d|x)

Monday, August 17, 2009
slide-60
SLIDE 60

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP 0.28 0.28 0.44 Viterbi 0.16 0.14 0.13 4-best crunching 0.16 0.28 0.13

N-best Approximation

  • N-best approximation (crunching) (May and Knight 2006)

y∗ = arg max

y∈Trans(x)

  • d∈D(x,y)∩ND(x)

p(y, d|x)

Monday, August 17, 2009
slide-61
SLIDE 61

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP 0.28 0.28 0.44 Viterbi 0.16 0.14 0.13 4-best crunching 0.16 0.28 0.13

N-best Approximation

  • N-best approximation (crunching) (May and Knight 2006)

y∗ = arg max

y∈Trans(x)

  • d∈D(x,y)∩ND(x)

p(y, d|x)

Monday, August 17, 2009
slide-62
SLIDE 62

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP

MAP vs. Approximations

0.28 0.28 0.44 Viterbi 0.16 0.14 0.13 4-best crunching 0.16 0.28 0.13

Monday, August 17, 2009
slide-63
SLIDE 63

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP

MAP vs. Approximations

0.28 0.28 0.44 Viterbi 0.16 0.14 0.13 4-best crunching 0.16 0.28 0.13

  • Exact MAP decoding under spurious ambiguity is intractable
Monday, August 17, 2009
slide-64
SLIDE 64

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP

MAP vs. Approximations

0.28 0.28 0.44 Viterbi 0.16 0.14 0.13 4-best crunching 0.16 0.28 0.13

  • Viterbi and crunching are efficient, but ignore most derivations
  • Exact MAP decoding under spurious ambiguity is intractable
Monday, August 17, 2009
slide-65
SLIDE 65

red translation blue translation green translation

0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10

probability derivation translation string MAP

MAP vs. Approximations

0.28 0.28 0.44 Viterbi 0.16 0.14 0.13 4-best crunching 0.16 0.28 0.13

  • Our goal: develop an approximation that considers all the

derivations but still allows tractable decoding

  • Viterbi and crunching are efficient, but ignore most derivations
  • Exact MAP decoding under spurious ambiguity is intractable
Monday, August 17, 2009
slide-66
SLIDE 66

Variational Decoding

18 Monday, August 17, 2009
slide-67
SLIDE 67

Variational Decoding

18

Decoding using Variational approximation

Monday, August 17, 2009
slide-68
SLIDE 68

Variational Decoding

18

Decoding using Variational approximation Decoding using a sentence-specific approximate distribution

Monday, August 17, 2009
slide-69
SLIDE 69

Variational Decoding for MT: an Overview

Monday, August 17, 2009
slide-70
SLIDE 70

Variational Decoding for MT: an Overview

Sentence-specific decoding

Monday, August 17, 2009
slide-71
SLIDE 71

Variational Decoding for MT: an Overview

Sentence-specific decoding Three steps:

Monday, August 17, 2009
slide-72
SLIDE 72

Variational Decoding for MT: an Overview

Sentence-specific decoding

1

Generate a hypergraph

Three steps:

Monday, August 17, 2009
slide-73
SLIDE 73

Variational Decoding for MT: an Overview

Sentence-specific decoding Foreign sentence x

1

Generate a hypergraph

Three steps:

Monday, August 17, 2009
slide-74
SLIDE 74

Variational Decoding for MT: an Overview

Sentence-specific decoding Foreign sentence x SMT

1

Generate a hypergraph

Three steps:

Monday, August 17, 2009
slide-75
SLIDE 75

Variational Decoding for MT: an Overview

dianzi0 shang1 de2 mao3

S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X S→X0, X0 X→X0 de X1, X0 ’s X1

Sentence-specific decoding Foreign sentence x SMT

1

Generate a hypergraph

Three steps:

Monday, August 17, 2009
slide-76
SLIDE 76

Variational Decoding for MT: an Overview

dianzi0 shang1 de2 mao3

S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X S→X0, X0 X→X0 de X1, X0 ’s X1

Sentence-specific decoding Foreign sentence x SMT

1

Generate a hypergraph

Three steps: p(y, d | x)

Monday, August 17, 2009
slide-77
SLIDE 77

Variational Decoding for MT: an Overview

dianzi0 shang1 de2 mao3

S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X S→X0, X0 X→X0 de X1, X0 ’s X1

Sentence-specific decoding Foreign sentence x SMT p(y | x)

1

Generate a hypergraph

Three steps: p(y, d | x)

Monday, August 17, 2009
slide-78
SLIDE 78

Variational Decoding for MT: an Overview

dianzi0 shang1 de2 mao3

S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X S→X0, X0 X→X0 de X1, X0 ’s X1

Sentence-specific decoding Foreign sentence x SMT

MAP decoding under P is intractable

p(y | x)

1

Generate a hypergraph

Three steps: p(y, d | x)

Monday, August 17, 2009
slide-79
SLIDE 79 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

1

p(y, d | x)

Generate a hypergraph

Monday, August 17, 2009
slide-80
SLIDE 80 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

1

p(y, d | x)

Generate a hypergraph

Monday, August 17, 2009
slide-81
SLIDE 81 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

1

p(y, d | x)

2

Generate a hypergraph

Monday, August 17, 2009
slide-82
SLIDE 82 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

1

p(y, d | x)

dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

p(y, d | x)

2

Generate a hypergraph

Monday, August 17, 2009
slide-83
SLIDE 83 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

1

p(y, d | x)

dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

p(y, d | x)

2

Estimate a model from the hypergraph Generate a hypergraph

q*(y | x)

Monday, August 17, 2009
slide-84
SLIDE 84 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

q* is an n-gram model

  • ver output strings.

1

p(y, d | x)

dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

p(y, d | x)

2

Estimate a model from the hypergraph Generate a hypergraph

q*(y | x)

Monday, August 17, 2009
slide-85
SLIDE 85 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

q* is an n-gram model

  • ver output strings.

1

p(y, d | x)

dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

p(y, d | x)

2

Estimate a model from the hypergraph Generate a hypergraph

q*(y | x) ≈∑d∈D(x,y) p(y,d|x)

Monday, August 17, 2009
slide-86
SLIDE 86 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

q* is an n-gram model

  • ver output strings.

1

p(y, d | x)

dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

p(y, d | x)

2 3

Estimate a model from the hypergraph Generate a hypergraph

q*(y | x) ≈∑d∈D(x,y) p(y,d|x)

Monday, August 17, 2009
slide-87
SLIDE 87 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

q* is an n-gram model

  • ver output strings.

Decode using q*

  • n the hypergraph

1

p(y, d | x)

dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

p(y, d | x)

dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

q*(y | x)

2 3

Estimate a model from the hypergraph Generate a hypergraph

q*(y | x) ≈∑d∈D(x,y) p(y,d|x)

Monday, August 17, 2009
slide-88
SLIDE 88

Variational Inference

21 Monday, August 17, 2009
slide-89
SLIDE 89

Variational Inference

  • We want to do inference under p, but it is intractable
21 Monday, August 17, 2009
slide-90
SLIDE 90

Variational Inference

  • We want to do inference under p, but it is intractable

y∗ = arg max

y

p(y|x)

21 Monday, August 17, 2009
slide-91
SLIDE 91

Variational Inference

  • We want to do inference under p, but it is intractable

y∗ = arg max

y

p(y|x)

  • Instead, we derive a simpler distribution q*
21 Monday, August 17, 2009
slide-92
SLIDE 92

Variational Inference

  • We want to do inference under p, but it is intractable

y∗ = arg max

y

p(y|x)

  • Instead, we derive a simpler distribution q*

q∗ = arg min

q∈Q KL(p||q)

21 Monday, August 17, 2009
slide-93
SLIDE 93

Variational Inference

  • We want to do inference under p, but it is intractable

y∗ = arg max

y

p(y|x)

  • Instead, we derive a simpler distribution q*
  • Then, we will use q* as a surrogate for p in inference

q∗ = arg min

q∈Q KL(p||q)

21 Monday, August 17, 2009
slide-94
SLIDE 94

Variational Inference

  • We want to do inference under p, but it is intractable

y∗ = arg max

y

p(y|x)

  • Instead, we derive a simpler distribution q*
  • Then, we will use q* as a surrogate for p in inference

y∗ = arg max

y

q∗(y | x)

q∗ = arg min

q∈Q KL(p||q)

21 Monday, August 17, 2009
slide-95
SLIDE 95

Variational Inference

  • We want to do inference under p, but it is intractable

y∗ = arg max

y

p(y|x)

  • Instead, we derive a simpler distribution q*
  • Then, we will use q* as a surrogate for p in inference

y∗ = arg max

y

q∗(y | x)

q∗ = arg min

q∈Q KL(p||q)

21

P

Monday, August 17, 2009
slide-96
SLIDE 96

Variational Inference

  • We want to do inference under p, but it is intractable

y∗ = arg max

y

p(y|x)

  • Instead, we derive a simpler distribution q*
  • Then, we will use q* as a surrogate for p in inference

y∗ = arg max

y

q∗(y | x)

q∗ = arg min

q∈Q KL(p||q)

21

p

P

Monday, August 17, 2009
slide-97
SLIDE 97

Variational Inference

  • We want to do inference under p, but it is intractable

y∗ = arg max

y

p(y|x)

  • Instead, we derive a simpler distribution q*
  • Then, we will use q* as a surrogate for p in inference

y∗ = arg max

y

q∗(y | x)

q∗ = arg min

q∈Q KL(p||q)

21

p Q

P

Monday, August 17, 2009
slide-98
SLIDE 98

Variational Inference

  • We want to do inference under p, but it is intractable

y∗ = arg max

y

p(y|x)

  • Instead, we derive a simpler distribution q*
  • Then, we will use q* as a surrogate for p in inference

y∗ = arg max

y

q∗(y | x)

q∗ = arg min

q∈Q KL(p||q)

21

p Q

P

Monday, August 17, 2009
slide-99
SLIDE 99

Variational Inference

  • We want to do inference under p, but it is intractable

y∗ = arg max

y

p(y|x)

  • Instead, we derive a simpler distribution q*
  • Then, we will use q* as a surrogate for p in inference

y∗ = arg max

y

q∗(y | x)

q∗ = arg min

q∈Q KL(p||q)

21

p Q q*

P

Monday, August 17, 2009
slide-100
SLIDE 100

Variational Approximation

  • q*: an approximation having minimum distance to p

q∗ = arg min

q∈Q KL(p||q)

a family of distributions

22 Monday, August 17, 2009
slide-101
SLIDE 101

Variational Approximation

  • q*: an approximation having minimum distance to p

= arg min

q∈Q
  • y∈Trans(x)

plogp q q∗ = arg min

q∈Q KL(p||q)

a family of distributions

22 Monday, August 17, 2009
slide-102
SLIDE 102

Variational Approximation

  • q*: an approximation having minimum distance to p

= arg min

q∈Q
  • y∈Trans(x)

plogp q = arg min

q∈Q
  • y∈Trans(x)

(plogp − plogq) q∗ = arg min

q∈Q KL(p||q)

a family of distributions

22 Monday, August 17, 2009
slide-103
SLIDE 103

constant

Variational Approximation

  • q*: an approximation having minimum distance to p

= arg min

q∈Q
  • y∈Trans(x)

plogp q = arg min

q∈Q
  • y∈Trans(x)

(plogp − plogq) q∗ = arg min

q∈Q KL(p||q)

a family of distributions

22 Monday, August 17, 2009
slide-104
SLIDE 104

constant

Variational Approximation

  • q*: an approximation having minimum distance to p

= arg min

q∈Q
  • y∈Trans(x)

plogp q = arg min

q∈Q
  • y∈Trans(x)

(plogp − plogq) = arg max

q∈Q
  • y∈Trans(x)

plogq q∗ = arg min

q∈Q KL(p||q)

a family of distributions

22 Monday, August 17, 2009
slide-105
SLIDE 105

constant

Variational Approximation

  • q*: an approximation having minimum distance to p

= arg min

q∈Q
  • y∈Trans(x)

plogp q = arg min

q∈Q
  • y∈Trans(x)

(plogp − plogq) = arg max

q∈Q
  • y∈Trans(x)

plogq q∗ = arg min

q∈Q KL(p||q)
  • Three questions

a family of distributions

22 Monday, August 17, 2009
slide-106
SLIDE 106

constant

Variational Approximation

  • q*: an approximation having minimum distance to p

= arg min

q∈Q
  • y∈Trans(x)

plogp q = arg min

q∈Q
  • y∈Trans(x)

(plogp − plogq) = arg max

q∈Q
  • y∈Trans(x)

plogq q∗ = arg min

q∈Q KL(p||q)
  • Three questions
  • how to parameterize q?

a family of distributions

22 Monday, August 17, 2009
slide-107
SLIDE 107

constant

Variational Approximation

  • q*: an approximation having minimum distance to p

= arg min

q∈Q
  • y∈Trans(x)

plogp q = arg min

q∈Q
  • y∈Trans(x)

(plogp − plogq) = arg max

q∈Q
  • y∈Trans(x)

plogq q∗ = arg min

q∈Q KL(p||q)
  • Three questions
  • how to parameterize q?
  • how to estimate q*?

a family of distributions

22 Monday, August 17, 2009
slide-108
SLIDE 108

constant

Variational Approximation

  • q*: an approximation having minimum distance to p

= arg min

q∈Q
  • y∈Trans(x)

plogp q = arg min

q∈Q
  • y∈Trans(x)

(plogp − plogq) = arg max

q∈Q
  • y∈Trans(x)

plogq q∗ = arg min

q∈Q KL(p||q)
  • Three questions
  • how to parameterize q?
  • how to estimate q*?
  • how to use q* for decoding?

a family of distributions

22 Monday, August 17, 2009
slide-109
SLIDE 109

Parameterization of q∈Q

23 Monday, August 17, 2009
slide-110
SLIDE 110

Parameterization of q∈Q

  • Naturally, we parameterize q as an n-gram model
23 Monday, August 17, 2009
slide-111
SLIDE 111

Parameterization of q∈Q

  • Naturally, we parameterize q as an n-gram model
  • The probability of a string is a product of the

probabilities of those n-grams appearing in that string

23 Monday, August 17, 2009
slide-112
SLIDE 112

Parameterization of q∈Q

  • Naturally, we parameterize q as an n-gram model
  • The probability of a string is a product of the

probabilities of those n-grams appearing in that string

3-gram model

23 Monday, August 17, 2009
slide-113
SLIDE 113

Parameterization of q∈Q

  • Naturally, we parameterize q as an n-gram model
  • The probability of a string is a product of the

probabilities of those n-grams appearing in that string

y: a b c d e f 3-gram model

23 Monday, August 17, 2009
slide-114
SLIDE 114

Parameterization of q∈Q

  • Naturally, we parameterize q as an n-gram model
  • The probability of a string is a product of the

probabilities of those n-grams appearing in that string

y: a b c d e f 3-gram model

23

q(y) = q(a) · q(b|a) · q(c|ab) · q(d|bc) · q(e|cd) · q(f|de)

Monday, August 17, 2009
slide-115
SLIDE 115

Parameterization of q∈Q

  • Naturally, we parameterize q as an n-gram model
  • The probability of a string is a product of the

probabilities of those n-grams appearing in that string

y: a b c d e f 3-gram model

23

q(y) = q(a) · q(b|a) · q(c|ab) · q(d|bc) · q(e|cd) · q(f|de)

Other ways of parameterizations are possible!

Monday, August 17, 2009
slide-116
SLIDE 116
  • Naturally, we parameterize q as an n-gram model
  • The probability of a string is a product of the

probabilities of those n-grams appearing in that string

y: a b c d e f 3-gram model

24

q(y) = q(a) · q(b|a) · q(c|ab) · q(d|bc) · q(e|cd) · q(f|de)

Parameterization of q∈Q

Monday, August 17, 2009
slide-117
SLIDE 117
  • Naturally, we parameterize q as an n-gram model
  • The probability of a string is a product of the

probabilities of those n-grams appearing in that string

y: a b c d e f 3-gram model

24

q(y) = q(a) · q(b|a) · q(c|ab) · q(d|bc) · q(e|cd) · q(f|de)

how to estimate these n-gram probabilities?

Parameterization of q∈Q

Monday, August 17, 2009
slide-118
SLIDE 118

Estimation of q*∈Q

  • Variational approximation
25

q∗ = arg max

q∈Q

  • y∈Trans(x)

plogq

Monday, August 17, 2009
slide-119
SLIDE 119

Estimation of q*∈Q

  • Variational approximation
  • q* is a maximum likelihood estimate (MLE)

where p is the empirical distribution

25

q∗ = arg max

q∈Q

  • y∈Trans(x)

plogq

Monday, August 17, 2009
slide-120
SLIDE 120

Estimation of q*∈Q

  • Variational approximation
  • q* is a maximum likelihood estimate (MLE)

where p is the empirical distribution

25

But in our case, p is defined not by a corpus, but by a hypergraph for a given test sentence!

q∗ = arg max

q∈Q

  • y∈Trans(x)

plogq

Monday, August 17, 2009
slide-121
SLIDE 121

Estimation of q*∈Q

  • Variational approximation
  • q* is a maximum likelihood estimate (MLE)

where p is the empirical distribution

25

But in our case, p is defined not by a corpus, but by a hypergraph for a given test sentence!

dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

q∗ = arg max

q∈Q

  • y∈Trans(x)

plogq

Monday, August 17, 2009
slide-122
SLIDE 122

Estimation of q*∈Q

  • Variational approximation
  • q* is a maximum likelihood estimate (MLE)

where p is the empirical distribution

25

But in our case, p is defined not by a corpus, but by a hypergraph for a given test sentence!

dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

estimate q∗ = arg max

q∈Q

  • y∈Trans(x)

plogq

Monday, August 17, 2009
slide-123
SLIDE 123

Estimation of q*∈Q

  • Variational approximation
  • q* is a maximum likelihood estimate (MLE)

where p is the empirical distribution

25

But in our case, p is defined not by a corpus, but by a hypergraph for a given test sentence!

dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

bi-gram model estimate q∗ = arg max

q∈Q

  • y∈Trans(x)

plogq

Monday, August 17, 2009
slide-124
SLIDE 124

Estimation of q*∈Q

  • Variational approximation
  • q* is a maximum likelihood estimate (MLE)

where p is the empirical distribution

25

But in our case, p is defined not by a corpus, but by a hypergraph for a given test sentence!

dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

bi-gram model

  • brute force

estimate q∗ = arg max

q∈Q

  • y∈Trans(x)

plogq

Monday, August 17, 2009
slide-125
SLIDE 125

Estimation of q*∈Q

  • Variational approximation
  • q* is a maximum likelihood estimate (MLE)

where p is the empirical distribution

25

But in our case, p is defined not by a corpus, but by a hypergraph for a given test sentence!

dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

bi-gram model

  • brute force
  • dynamic programming

estimate q∗ = arg max

q∈Q

  • y∈Trans(x)

plogq

Monday, August 17, 2009
slide-126
SLIDE 126 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

dianzi0 shang1 de2 mao3

X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X0 ’s X1

dianzi0 shang1 de2 mao3

X→mao, a cat X→dianzi shang, the mat S→X0, X0

dianzi0 shang1 de2 mao3

X→mao, a cat X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X1 on X0 X→X0 de X1, X1 of X0

dianzi0 shang1 de2 mao3

X→mao, a cat X→dianzi shang, the mat S→X0, X0 26

Estimating q* from a hypergraph: brute force

Monday, August 17, 2009
slide-127
SLIDE 127 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

dianzi0 shang1 de2 mao3

X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X0 ’s X1

dianzi0 shang1 de2 mao3

X→mao, a cat X→dianzi shang, the mat S→X0, X0

dianzi0 shang1 de2 mao3

X→mao, a cat X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X1 on X0 X→X0 de X1, X1 of X0

dianzi0 shang1 de2 mao3

X→mao, a cat X→dianzi shang, the mat S→X0, X0 26

Estimating q* from a hypergraph: brute force

Bi-gram estimation:

Monday, August 17, 2009
slide-128
SLIDE 128 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

dianzi0 shang1 de2 mao3

X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X0 ’s X1

dianzi0 shang1 de2 mao3

X→mao, a cat X→dianzi shang, the mat S→X0, X0

dianzi0 shang1 de2 mao3

X→mao, a cat X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X1 on X0 X→X0 de X1, X1 of X0

dianzi0 shang1 de2 mao3

X→mao, a cat X→dianzi shang, the mat S→X0, X0 26

Estimating q* from a hypergraph: brute force

Bi-gram estimation:

  • unpack the hypergraph
Monday, August 17, 2009
slide-129
SLIDE 129

dianzi0 shang1 de2 mao3

X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X0 ’s X1

dianzi0 shang1 de2 mao3

X→mao, a cat X→dianzi shang, the mat S→X0, X0

dianzi0 shang1 de2 mao3

X→mao, a cat X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X1 on X0 X→X0 de X1, X1 of X0

dianzi0 shang1 de2 mao3

X→mao, a cat X→dianzi shang, the mat S→X0, X0 27 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

Estimating q* from a hypergraph: brute force

Bi-gram estimation:

  • unpack the hypergraph
Monday, August 17, 2009
slide-130
SLIDE 130

dianzi0 shang1 de2 mao3

X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X0 ’s X1

dianzi0 shang1 de2 mao3

X→mao, a cat X→dianzi shang, the mat S→X0, X0

dianzi0 shang1 de2 mao3

X→mao, a cat X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X1 on X0 X→X0 de X1, X1 of X0

dianzi0 shang1 de2 mao3

X→mao, a cat X→dianzi shang, the mat S→X0, X0

the mat a cat a cat on the mat a cat of the mat the mat ‘s a cat

27 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

Estimating q* from a hypergraph: brute force

Bi-gram estimation:

  • unpack the hypergraph
Monday, August 17, 2009
slide-131
SLIDE 131

dianzi0 shang1 de2 mao3

X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X0 ’s X1

dianzi0 shang1 de2 mao3

X→mao, a cat X→dianzi shang, the mat S→X0, X0

dianzi0 shang1 de2 mao3

X→mao, a cat X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X1 on X0 X→X0 de X1, X1 of X0

dianzi0 shang1 de2 mao3

X→mao, a cat X→dianzi shang, the mat S→X0, X0

the mat a cat a cat on the mat a cat of the mat the mat ‘s a cat

p=2/8 p=1/8 p=3/8 p=2/8

27 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

Estimating q* from a hypergraph: brute force

Bi-gram estimation:

  • unpack the hypergraph
Monday, August 17, 2009
slide-132
SLIDE 132

the mat a cat a cat of the mat the mat ‘s a cat

p=2/8 p=1/8 p=3/8 p=2/8

28

Estimating q* from a hypergraph: brute force

a cat on the mat

Monday, August 17, 2009
slide-133
SLIDE 133

the mat a cat a cat of the mat the mat ‘s a cat

p=2/8 p=1/8 p=3/8 p=2/8

28

Bi-gram estimation:

  • unpack the hypergraph

Estimating q* from a hypergraph: brute force

a cat on the mat

Monday, August 17, 2009
slide-134
SLIDE 134

the mat a cat a cat of the mat the mat ‘s a cat

p=2/8 p=1/8 p=3/8 p=2/8

28

Bi-gram estimation:

  • unpack the hypergraph
  • accumulate the soft-count
  • f each bigram

Estimating q* from a hypergraph: brute force

a cat on the mat

Monday, August 17, 2009
slide-135
SLIDE 135

the mat a cat a cat of the mat the mat ‘s a cat

p=2/8 p=1/8 p=3/8 p=2/8

28

Bi-gram estimation:

  • unpack the hypergraph
  • accumulate the soft-count
  • f each bigram
  • normalize the counts

Estimating q* from a hypergraph: brute force

a cat on the mat

Monday, August 17, 2009
slide-136
SLIDE 136

the mat a cat a cat of the mat the mat ‘s a cat

p=2/8 p=1/8 p=3/8 p=2/8

28

Bi-gram estimation:

  • unpack the hypergraph
  • accumulate the soft-count
  • f each bigram
  • normalize the counts

Estimating q* from a hypergraph: brute force

Pr(on | cat)=1/8 Pr(of | cat)=2/8 Pr(</s> | cat)=5/8

a cat on the mat

Monday, August 17, 2009
slide-137
SLIDE 137

the mat a cat a cat of the mat the mat ‘s a cat

p=2/8 p=1/8 p=3/8 p=2/8

28

Bi-gram estimation:

  • unpack the hypergraph
  • accumulate the soft-count
  • f each bigram
  • normalize the counts

Estimating q* from a hypergraph: brute force

Pr(on | cat)=1/8 Pr(of | cat)=2/8 Pr(</s> | cat)=5/8

a cat on the mat

Monday, August 17, 2009
slide-138
SLIDE 138 29 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

Estimating q* from a hypergraph: dynamic programming

Monday, August 17, 2009
slide-139
SLIDE 139 29 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

Estimating q* from a hypergraph: dynamic programming

Bi-gram estimation:

Monday, August 17, 2009
slide-140
SLIDE 140 29 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

Estimating q* from a hypergraph: dynamic programming

Bi-gram estimation:

  • run inside-outside on the

hypergraph

Monday, August 17, 2009
slide-141
SLIDE 141 29 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

Estimating q* from a hypergraph: dynamic programming

Bi-gram estimation:

  • run inside-outside on the

hypergraph

  • accumulate the soft-count of

each bigram at each hyperedge

Monday, August 17, 2009
slide-142
SLIDE 142 29 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

Estimating q* from a hypergraph: dynamic programming

Bi-gram estimation:

  • run inside-outside on the

hypergraph

  • accumulate the soft-count of

each bigram at each hyperedge

  • normalize the counts
Monday, August 17, 2009
slide-143
SLIDE 143

Decoding using q*∈Q

30 Monday, August 17, 2009
slide-144
SLIDE 144

Decoding using q*∈Q

  • Rescore the hypergraph HG(x)
30 Monday, August 17, 2009
slide-145
SLIDE 145

Decoding using q*∈Q

  • Rescore the hypergraph HG(x)

y∗ = arg max

y∈HG(x) q∗(y|x)

30 Monday, August 17, 2009
slide-146
SLIDE 146

Decoding using q*∈Q

  • Rescore the hypergraph HG(x)

y∗ = arg max

y∈HG(x) q∗(y|x)

30 Monday, August 17, 2009
slide-147
SLIDE 147

Decoding using q*∈Q

  • Rescore the hypergraph HG(x)

y∗ = arg max

y∈HG(x) q∗(y|x)

30

q* is an n-gram model.

Monday, August 17, 2009
slide-148
SLIDE 148

Decoding using q*∈Q

  • Rescore the hypergraph HG(x)
  • have efficient dynamic programming algorithms
  • score the hypergraph using an n-gram model

y∗ = arg max

y∈HG(x) q∗(y|x)

30

q* is an n-gram model.

Monday, August 17, 2009
slide-149
SLIDE 149

Decoding using q*∈Q

  • Rescore the hypergraph HG(x)
  • have efficient dynamic programming algorithms
  • score the hypergraph using an n-gram model

y∗ = arg max

y∈HG(x) q∗(y|x)

30

q* is an n-gram model.

John already told you how to do this☺

Monday, August 17, 2009
slide-150
SLIDE 150

KL divergences under different variational models

31

q∗ = arg min

q∈Q KL(p||q) = H(p, q) − H(p)

Monday, August 17, 2009
slide-151
SLIDE 151

KL divergences under different variational models

31

q∗ = arg min

q∈Q KL(p||q) = H(p, q) − H(p)

Measure H(p) KL(p||·) bits/word q∗

1

q∗

2

q∗

3

q∗

4

MT’04 1.36 0.97 0.32 0.21 0.17 MT’05 1.37 0.94 0.32 0.21 0.17

Monday, August 17, 2009
slide-152
SLIDE 152

KL divergences under different variational models

  • The larger the order n is, the smaller the KL

divergence is!

  • The reduction of KL divergence happens

mostly when switching from unigram to bigram

31

q∗ = arg min

q∈Q KL(p||q) = H(p, q) − H(p)

Measure H(p) KL(p||·) bits/word q∗

1

q∗

2

q∗

3

q∗

4

MT’04 1.36 0.97 0.32 0.21 0.17 MT’05 1.37 0.94 0.32 0.21 0.17

Monday, August 17, 2009
slide-153
SLIDE 153

KL divergences under different variational models

32

Measure H(p) KL(p||·) bits/word q∗

1

q∗

2

q∗

3

q∗

4

MT’04 1.36 0.97 0.32 0.21 0.17 MT’05 1.37 0.94 0.32 0.21 0.17

q∗ = arg min

q∈Q KL(p||q) = H(p, q) − H(p)

Monday, August 17, 2009
slide-154
SLIDE 154

KL divergences under different variational models

32

How to compute them on a hypergraph? see (Li and Eisner, EMNLP’09) Measure H(p) KL(p||·) bits/word q∗

1

q∗

2

q∗

3

q∗

4

MT’04 1.36 0.97 0.32 0.21 0.17 MT’05 1.37 0.94 0.32 0.21 0.17

q∗ = arg min

q∈Q KL(p||q) = H(p, q) − H(p)

Monday, August 17, 2009
slide-155
SLIDE 155

BLEU scores when using a single variational n-gram model

Decoding scheme MT’04 MT’05 Viterbi 35.4 32.6 1gram 25.9 24.5 2gram 36.1 33.4 3gram 36.0 33.1 4gram 35.8 32.9

33 Monday, August 17, 2009
slide-156
SLIDE 156

BLEU scores when using a single variational n-gram model

  • unigram performs very badly

Decoding scheme MT’04 MT’05 Viterbi 35.4 32.6 1gram 25.9 24.5 2gram 36.1 33.4 3gram 36.0 33.1 4gram 35.8 32.9

33 Monday, August 17, 2009
slide-157
SLIDE 157

BLEU scores when using a single variational n-gram model

  • unigram performs very badly

Decoding scheme MT’04 MT’05 Viterbi 35.4 32.6 1gram 25.9 24.5 2gram 36.1 33.4 3gram 36.0 33.1 4gram 35.8 32.9

33
  • bigram achieves best BLEU scores
Monday, August 17, 2009
slide-158
SLIDE 158

BLEU scores when using a single variational n-gram model

  • unigram performs very badly

Decoding scheme MT’04 MT’05 Viterbi 35.4 32.6 1gram 25.9 24.5 2gram 36.1 33.4 3gram 36.0 33.1 4gram 35.8 32.9

33
  • bigram achieves best BLEU scores

???

Monday, August 17, 2009
slide-159
SLIDE 159

BLEU scores when using a single variational n-gram model

  • unigram performs very badly

Decoding scheme MT’04 MT’05 Viterbi 35.4 32.6 1gram 25.9 24.5 2gram 36.1 33.4 3gram 36.0 33.1 4gram 35.8 32.9

33
  • bigram achieves best BLEU scores

???

modeling error in p

Monday, August 17, 2009
slide-160
SLIDE 160 34 Monday, August 17, 2009
slide-161
SLIDE 161 34

BLEU cares about both low- and high-order n-gram matches

Monday, August 17, 2009
slide-162
SLIDE 162 34

BLEU cares about both low- and high-order n-gram matches

  • Interpolating variational n-gram model for different n

y∗ = arg max

y∈HG(x)

  • n

θn · log q∗

n(y | x)

Monday, August 17, 2009
slide-163
SLIDE 163 34

BLEU cares about both low- and high-order n-gram matches Viterbi and variational are different ways in approximating p

  • Interpolating variational n-gram model for different n

y∗ = arg max

y∈HG(x)

  • n

θn · log q∗

n(y | x)

Monday, August 17, 2009
slide-164
SLIDE 164 34

BLEU cares about both low- and high-order n-gram matches Viterbi and variational are different ways in approximating p

y∗ = arg max

y∈HG(x)
  • n

θn · log q∗

n(y | x) + θv · log pViterbi(y | x)
  • Interpolating variational n-gram model for different n

y∗ = arg max

y∈HG(x)

  • n

θn · log q∗

n(y | x)

Monday, August 17, 2009
slide-165
SLIDE 165 34

BLEU cares about both low- and high-order n-gram matches Viterbi and variational are different ways in approximating p

y∗ = arg max

y∈HG(x)
  • n

θn · log q∗

n(y | x) + θv · log pViterbi(y | x)
  • Interpolating variational n-gram model for different n

y∗ = arg max

y∈HG(x)

  • n

θn · log q∗

n(y | x)

Monday, August 17, 2009
slide-166
SLIDE 166

Minimum Bayes Risk (MBR) decoding?

35

(Tromble et al. 2008) (Denero et al. 2009)

Monday, August 17, 2009
slide-167
SLIDE 167

Minimum Risk Decoding

  • Minimum risk decoding
  • find the consensus translation string
  • Maximum A Posterior (MAP) decoding
  • find the most probable translation string

Risk(y) =

  • y′

L(y, y

′)p(y ′|x)

y∗ = arg max

y∈HG(x) p(y|x)

y∗ = arg min

y∈HG(x) Risk(y)

36 Monday, August 17, 2009
slide-168
SLIDE 168

Variational Decoding(VD) vs. MBR (Tromble et al. 2008)

37 Monday, August 17, 2009
slide-169
SLIDE 169

Variational Decoding(VD) vs. MBR (Tromble et al. 2008)

37

spurious ambiguity

Monday, August 17, 2009
slide-170
SLIDE 170

Variational Decoding(VD) vs. MBR (Tromble et al. 2008)

37

spurious ambiguity VD

Monday, August 17, 2009
slide-171
SLIDE 171

Variational Decoding(VD) vs. MBR (Tromble et al. 2008)

37

spurious ambiguity consensus VD

Monday, August 17, 2009
slide-172
SLIDE 172

Variational Decoding(VD) vs. MBR (Tromble et al. 2008)

37

spurious ambiguity consensus VD MBR

Monday, August 17, 2009
slide-173
SLIDE 173

Variational Decoding(VD) vs. MBR (Tromble et al. 2008)

37

spurious ambiguity consensus VD MBR Interpolated VD

Monday, August 17, 2009
slide-174
SLIDE 174

Variational Decoding(VD) vs. MBR (Tromble et al. 2008)

37

spurious ambiguity consensus VD MBR Interpolated VD Both BLEU metric and our variational distributions happen to use n-gram dependencies.

Monday, August 17, 2009
slide-175
SLIDE 175
  • Variational decoding with interpolation

q(r(w) | h(w), x) =

  • y′ cw(y
′)p(y ′ | x)
  • y′ ch(w)(y
′)p(y ′ | x)

qn(y | x) =

  • w∈Wn

q(r(w) | h(w), x)cw(y) y∗ = arg max

y∈HG(x)

  • n

θn · log q∗

n(y | x)

  • Minimum risk decoding (Tromble et al. 2008)

gn(y | x) =

  • w∈Wn

g(w | x)cw(y) g(w | x) =

  • y′

δw(y′)p(y′ | x) y∗ = arg max

y∈HG(x)

  • n

θn · gn(y | x)

38 Monday, August 17, 2009
slide-176
SLIDE 176
  • Variational decoding with interpolation

q(r(w) | h(w), x) =

  • y′ cw(y
′)p(y ′ | x)
  • y′ ch(w)(y
′)p(y ′ | x)

qn(y | x) =

  • w∈Wn

q(r(w) | h(w), x)cw(y) y∗ = arg max

y∈HG(x)

  • n

θn · log q∗

n(y | x)

  • Minimum risk decoding (Tromble et al. 2008)

gn(y | x) =

  • w∈Wn

g(w | x)cw(y) g(w | x) =

  • y′

δw(y′)p(y′ | x) y∗ = arg max

y∈HG(x)

  • n

θn · gn(y | x)

38

decision rule decision rule

Monday, August 17, 2009
slide-177
SLIDE 177
  • Variational decoding with interpolation

q(r(w) | h(w), x) =

  • y′ cw(y
′)p(y ′ | x)
  • y′ ch(w)(y
′)p(y ′ | x)

qn(y | x) =

  • w∈Wn

q(r(w) | h(w), x)cw(y) y∗ = arg max

y∈HG(x)

  • n

θn · log q∗

n(y | x)

  • Minimum risk decoding (Tromble et al. 2008)

gn(y | x) =

  • w∈Wn

g(w | x)cw(y) g(w | x) =

  • y′

δw(y′)p(y′ | x) y∗ = arg max

y∈HG(x)

  • n

θn · gn(y | x)

38

decision rule decision rule n-gram model n-gram model

Monday, August 17, 2009
slide-178
SLIDE 178
  • Variational decoding with interpolation

q(r(w) | h(w), x) =

  • y′ cw(y
′)p(y ′ | x)
  • y′ ch(w)(y
′)p(y ′ | x)

qn(y | x) =

  • w∈Wn

q(r(w) | h(w), x)cw(y) y∗ = arg max

y∈HG(x)

  • n

θn · log q∗

n(y | x)

  • Minimum risk decoding (Tromble et al. 2008)

gn(y | x) =

  • w∈Wn

g(w | x)cw(y) g(w | x) =

  • y′

δw(y′)p(y′ | x) y∗ = arg max

y∈HG(x)

  • n

θn · gn(y | x)

38

decision rule decision rule n-gram model n-gram model n-gram probability n-gram probability

Monday, August 17, 2009
slide-179
SLIDE 179
  • Variational decoding with interpolation

q(r(w) | h(w), x) =

  • y′ cw(y
′)p(y ′ | x)
  • y′ ch(w)(y
′)p(y ′ | x)

qn(y | x) =

  • w∈Wn

q(r(w) | h(w), x)cw(y) y∗ = arg max

y∈HG(x)

  • n

θn · log q∗

n(y | x)

  • Minimum risk decoding (Tromble et al. 2008)

gn(y | x) =

  • w∈Wn

g(w | x)cw(y) g(w | x) =

  • y′

δw(y′)p(y′ | x) y∗ = arg max

y∈HG(x)

  • n

θn · gn(y | x)

39 Monday, August 17, 2009
slide-180
SLIDE 180
  • Variational decoding with interpolation

q(r(w) | h(w), x) =

  • y′ cw(y
′)p(y ′ | x)
  • y′ ch(w)(y
′)p(y ′ | x)

qn(y | x) =

  • w∈Wn

q(r(w) | h(w), x)cw(y) y∗ = arg max

y∈HG(x)

  • n

θn · log q∗

n(y | x)

  • Minimum risk decoding (Tromble et al. 2008)

gn(y | x) =

  • w∈Wn

g(w | x)cw(y) g(w | x) =

  • y′

δw(y′)p(y′ | x) non-probabilistic very expensive to compute y∗ = arg max

y∈HG(x)

  • n

θn · gn(y | x)

39 Monday, August 17, 2009
slide-181
SLIDE 181

BLEU Results on Chinese-English NIST MT Tasks

Decoding scheme MT’04 MT’05 Viterbi 35.4 32.6 MBR (K=1000) 35.8 32.7 Crunching (N=10000) 35.7 32.8 Crunching+MBR (N=10000) 35.8 32.7 Variational (1to4gram+wp+vt) 36.6 33.5

40 Monday, August 17, 2009
slide-182
SLIDE 182

BLEU Results on Chinese-English NIST MT Tasks

  • variational decoding improves over Viterbi, MBR, and crunching

Decoding scheme MT’04 MT’05 Viterbi 35.4 32.6 MBR (K=1000) 35.8 32.7 Crunching (N=10000) 35.7 32.8 Crunching+MBR (N=10000) 35.8 32.7 Variational (1to4gram+wp+vt) 36.6 33.5

40 Monday, August 17, 2009
slide-183
SLIDE 183

Conclusions

  • Exact MAP decoding with spurious ambiguity is

intractable

  • Viterbi or N-best approximations are efficient,

but ignore most derivations

  • We developed a variational approximation, which

considers all derivations but still allows tractable decoding

  • Our variational decoding improves a state of the

art baseline

41 Monday, August 17, 2009
slide-184
SLIDE 184

Future directions

  • The MT pipeline is full of intractable problems
  • variational approximation is a principled way to tackle

these problems

  • Decoding with spurious ambiguity is a common

problem in many other NLP applications

  • Models with latent variables
  • Data oriented parsing (DOP)
  • Hidden Markov Models (HMM)
  • ......
42 Monday, August 17, 2009
slide-185
SLIDE 185

Thank you! 谢谢!

43 Monday, August 17, 2009
slide-186
SLIDE 186

Joshua

44 Monday, August 17, 2009
slide-187
SLIDE 187 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

q* is an n-gram model

  • ver output strings.

Decode using q*

  • n the hypergraph

1

p(y, d | x)

dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

p(y, d | x)

dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1

q*(y | x)

2 3

Estimate a model from the hypergraph Generate a hypergraph

q*(y | x) ≈∑d∈D(x,y) p(y,d|x)

Monday, August 17, 2009