Variational Decoding for Statistical Machine Translation
Zhifei Li, Jason Eisner, and Sanjeev Khudanpur
Center for Language and Speech Processing Computer Science Department Johns Hopkins University
1 Monday, August 17, 2009
Variational Decoding for Statistical Machine Translation Zhifei Li, - - PowerPoint PPT Presentation
Variational Decoding for Statistical Machine Translation Zhifei Li, Jason Eisner, and Sanjeev Khudanpur Center for Language and Speech Processing Computer Science Department Johns Hopkins University 1 Monday, August 17, 2009 Spurious
Variational Decoding for Statistical Machine Translation
Zhifei Li, Jason Eisner, and Sanjeev Khudanpur
Center for Language and Speech Processing Computer Science Department Johns Hopkins University
1 Monday, August 17, 2009Spurious Ambiguity
ambiguity
segmentations) generate the same translation string
Spurious Ambiguity in Phrase Segmentations
3 Monday, August 17, 2009机器 翻译 软件
Spurious Ambiguity in Phrase Segmentations
3 Monday, August 17, 2009机器 翻译 软件
Spurious Ambiguity in Phrase Segmentations
machine translation software
3 Monday, August 17, 2009machine translation
机器 翻译 软件
Spurious Ambiguity in Phrase Segmentations
machine translation software
3 Monday, August 17, 2009machine translation
机器 翻译 软件
Spurious Ambiguity in Phrase Segmentations
software machine translation software
3 Monday, August 17, 2009machine translation
机器 翻译 软件
Spurious Ambiguity in Phrase Segmentations
机器 翻译 软件
software machine translation software
3 Monday, August 17, 2009machine translation
机器 翻译 软件
Spurious Ambiguity in Phrase Segmentations
machine
机器 翻译 软件
software machine translation software
3 Monday, August 17, 2009machine translation
机器 翻译 软件
Spurious Ambiguity in Phrase Segmentations
machine
机器 翻译 软件
software translation software machine translation software
3 Monday, August 17, 2009machine translation
机器 翻译 软件
Spurious Ambiguity in Phrase Segmentations
machine
机器 翻译 软件
software translation software machine
机器
machine translation software
3 Monday, August 17, 2009machine translation
机器 翻译 软件
Spurious Ambiguity in Phrase Segmentations
machine
机器 翻译 软件
software translation software machine
机器 翻译
translation machine translation software
3 Monday, August 17, 2009machine translation
机器 翻译 软件
Spurious Ambiguity in Phrase Segmentations
machine
机器 翻译 软件
software translation software machine
机器 翻译 软件
translation software machine translation software
3 Monday, August 17, 2009machine translation
机器 翻译 软件
Spurious Ambiguity in Phrase Segmentations
machine
机器 翻译 软件
software translation software machine
机器 翻译 软件
translation software
“machine translation software”
segmentations
machine translation software
3 Monday, August 17, 2009machine translation
机器 翻译 软件
Spurious Ambiguity in Phrase Segmentations
machine
机器 翻译 软件
software translation software machine
机器 翻译 软件
translation software
“machine translation software”
segmentations
machine translation software
3machine transfer software
Monday, August 17, 2009Spurious Ambiguity in Derivation Trees
4 Monday, August 17, 2009Spurious Ambiguity in Derivation Trees
机器 翻译 软件
4 Monday, August 17, 2009Spurious Ambiguity in Derivation Trees
机器 翻译 软件
S->(机器, machine)
4 Monday, August 17, 2009Spurious Ambiguity in Derivation Trees
机器 翻译 软件
S->(机器, machine) S->(翻译, translation)
4 Monday, August 17, 2009Spurious Ambiguity in Derivation Trees
机器 翻译 软件
S->(机器, machine) S->(翻译, translation) S->(软件, software)
4 Monday, August 17, 2009Spurious Ambiguity in Derivation Trees
机器 翻译 软件
S->(机器, machine) S->(翻译, translation) S->(软件, software) S->(S0 S1, S0 S1)
4 Monday, August 17, 2009Spurious Ambiguity in Derivation Trees
机器 翻译 软件
S->(机器, machine) S->(翻译, translation) S->(软件, software) S->(S0 S1, S0 S1) S->(S0 S1, S0 S1)
4 Monday, August 17, 2009Spurious Ambiguity in Derivation Trees
机器 翻译 软件
S->(机器, machine) S->(翻译, translation) S->(软件, software) S->(机器, machine) S->(翻译, translation) S->(软件, software) S->(S0 S1, S0 S1) S->(S0 S1, S0 S1) S->(S0 S1, S0 S1) S->(S0 S1, S0 S1)
4 Monday, August 17, 2009Spurious Ambiguity in Derivation Trees
机器 翻译 软件
S->(机器, machine) S->(翻译, translation) S->(软件, software) S->(机器, machine) S->(翻译, translation) S->(软件, software) S->(机器, machine)
翻译
S->(软件, software) S->(S0 S1, S0 S1) S->(S0 S1, S0 S1) S->(S0 S1, S0 S1) S->(S0 S1, S0 S1) S->(S0 翻译 S1, S0 translation S1)
4 Monday, August 17, 2009Spurious Ambiguity in Derivation Trees
机器 翻译 软件
S->(机器, machine) S->(翻译, translation) S->(软件, software) S->(机器, machine) S->(翻译, translation) S->(软件, software) S->(机器, machine)
翻译
S->(软件, software) S->(S0 S1, S0 S1) S->(S0 S1, S0 S1) S->(S0 S1, S0 S1) S->(S0 S1, S0 S1) S->(S0 翻译 S1, S0 translation S1)
“machine translation software”
Maximum A Posterior (MAP) Decoding
Monday, August 17, 2009red translation blue translation green translation
translation string
5Maximum A Posterior (MAP) Decoding
Monday, August 17, 2009red translation blue translation green translation
derivation translation string
5Maximum A Posterior (MAP) Decoding
Monday, August 17, 2009red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string
5Maximum A Posterior (MAP) Decoding
Monday, August 17, 2009red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP
5Maximum A Posterior (MAP) Decoding
Monday, August 17, 2009red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP
5Maximum A Posterior (MAP) Decoding
Monday, August 17, 2009red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP
5y∗ = arg max
y∈Trans(x) p(y|x)
= arg max
y∈Trans(x)
p(y, d|x)
Maximum A Posterior (MAP) Decoding
Monday, August 17, 2009red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP
5y∗ = arg max
y∈Trans(x) p(y|x)
= arg max
y∈Trans(x)
p(y, d|x)
Maximum A Posterior (MAP) Decoding
Monday, August 17, 2009red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP
Maximum A Posterior (MAP) Decoding
6y∗ = arg max
y∈Trans(x) p(y|x)
= arg max
y∈Trans(x)
p(y, d|x)
Monday, August 17, 2009red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP
Maximum A Posterior (MAP) Decoding
6y∗ = arg max
y∈Trans(x) p(y|x)
= arg max
y∈Trans(x)
p(y, d|x)
0.28
Monday, August 17, 2009red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP
7y∗ = arg max
y∈Trans(x) p(y|x)
= arg max
y∈Trans(x)
p(y, d|x)
0.28
Maximum A Posterior (MAP) Decoding
red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP
7y∗ = arg max
y∈Trans(x) p(y|x)
= arg max
y∈Trans(x)
p(y, d|x)
0.28 0.28
Maximum A Posterior (MAP) Decoding
red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP
8y∗ = arg max
y∈Trans(x) p(y|x)
= arg max
y∈Trans(x)
p(y, d|x)
0.28 0.28
Maximum A Posterior (MAP) Decoding
red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP
8y∗ = arg max
y∈Trans(x) p(y|x)
= arg max
y∈Trans(x)
p(y, d|x)
0.28 0.28 0.44
Maximum A Posterior (MAP) Decoding
red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP 0.28 0.28 0.44
9y∗ = arg max
y∈Trans(x) p(y|x)
= arg max
y∈Trans(x)
p(y, d|x)
Maximum A Posterior (MAP) Decoding
red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP 0.28 0.28 0.44
9y∗ = arg max
y∈Trans(x) p(y|x)
= arg max
y∈Trans(x)
p(y, d|x)
Maximum A Posterior (MAP) Decoding
red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP 0.28 0.28 0.44
9y∗ = arg max
y∈Trans(x) p(y|x)
= arg max
y∈Trans(x)
p(y, d|x)
Maximum A Posterior (MAP) Decoding
Hypergraph as a search space
Monday, August 17, 2009Hypergraph as a search space
dianzi0 shang1 de2 mao3
S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1 Monday, August 17, 2009Hypergraph as a search space
dianzi0 shang1 de2 mao3
S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1A hypergraph is a compact structure to encode exponentially many trees.
Monday, August 17, 2009Hypergraph as a search space
dianzi0 shang1 de2 mao3
S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1 Monday, August 17, 2009Hypergraph as a search space
dianzi0 shang1 de2 mao3
S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1Probabilistic Hypergraph
Monday, August 17, 2009Hypergraph as a search space
dianzi0 shang1 de2 mao3
S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1The hypergraph defines a probability distribution
Probabilistic Hypergraph
Monday, August 17, 2009Hypergraph as a search space
dianzi0 shang1 de2 mao3
S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1The hypergraph defines a probability distribution
and also a distribution (implicit)
Probabilistic Hypergraph
Monday, August 17, 2009Hypergraph as a search space
dianzi0 shang1 de2 mao3
S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1The hypergraph defines a probability distribution
and also a distribution (implicit)
Probabilistic Hypergraph
NP-hard (Sima’an 1996) exponential size
y∗ = arg max
y∈HG(x) p(y|x)= arg max
y∈HG(x)p(y, d|x)
Monday, August 17, 2009Knight 2006)
Decoding with spurious ambiguity?
Monday, August 17, 2009red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP
Viterbi Approximation
y∗ = arg max
y∈Trans(x)
max
d∈D(x,y) p(y, d|x)
= Y(arg max
d∈D(x) p(y, d|x))
0.28 0.28 0.44
Monday, August 17, 2009red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP
Viterbi Approximation
y∗ = arg max
y∈Trans(x)
max
d∈D(x,y) p(y, d|x)
= Y(arg max
d∈D(x) p(y, d|x))
0.28 0.28 0.44 Viterbi
Monday, August 17, 2009red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP
Viterbi Approximation
y∗ = arg max
y∈Trans(x)
max
d∈D(x,y) p(y, d|x)
= Y(arg max
d∈D(x) p(y, d|x))
0.28 0.28 0.44 Viterbi
Monday, August 17, 2009red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP
Viterbi Approximation
y∗ = arg max
y∈Trans(x)
max
d∈D(x,y) p(y, d|x)
= Y(arg max
d∈D(x) p(y, d|x))
0.28 0.28 0.44 Viterbi 0.16 0.14 0.13
Monday, August 17, 2009red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP
Viterbi Approximation
y∗ = arg max
y∈Trans(x)
max
d∈D(x,y) p(y, d|x)
= Y(arg max
d∈D(x) p(y, d|x))
0.28 0.28 0.44 Viterbi 0.16 0.14 0.13
Monday, August 17, 2009red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP
N-best Approximation
0.28 0.28 0.44 Viterbi 0.16 0.14 0.13
y∗ = arg max
y∈Trans(x)
p(y, d|x)
Monday, August 17, 2009red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP
N-best Approximation
0.28 0.28 0.44 Viterbi 0.16 0.14 0.13 4-best crunching
y∗ = arg max
y∈Trans(x)
p(y, d|x)
Monday, August 17, 2009red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP 0.28 0.28 0.44 Viterbi 0.16 0.14 0.13 4-best crunching
N-best Approximation
y∗ = arg max
y∈Trans(x)
p(y, d|x)
Monday, August 17, 2009red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP 0.28 0.28 0.44 Viterbi 0.16 0.14 0.13 4-best crunching 0.16 0.28 0.13
N-best Approximation
y∗ = arg max
y∈Trans(x)
p(y, d|x)
Monday, August 17, 2009red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP 0.28 0.28 0.44 Viterbi 0.16 0.14 0.13 4-best crunching 0.16 0.28 0.13
N-best Approximation
y∗ = arg max
y∈Trans(x)
p(y, d|x)
Monday, August 17, 2009red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP
MAP vs. Approximations
0.28 0.28 0.44 Viterbi 0.16 0.14 0.13 4-best crunching 0.16 0.28 0.13
Monday, August 17, 2009red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP
MAP vs. Approximations
0.28 0.28 0.44 Viterbi 0.16 0.14 0.13 4-best crunching 0.16 0.28 0.13
red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP
MAP vs. Approximations
0.28 0.28 0.44 Viterbi 0.16 0.14 0.13 4-best crunching 0.16 0.28 0.13
red translation blue translation green translation
0.16 0.14 0.14 0.13 0.12 0.11 0.10 0.10
probability derivation translation string MAP
MAP vs. Approximations
0.28 0.28 0.44 Viterbi 0.16 0.14 0.13 4-best crunching 0.16 0.28 0.13
derivations but still allows tractable decoding
Variational Decoding
18 Monday, August 17, 2009Variational Decoding
18Decoding using Variational approximation
Monday, August 17, 2009Variational Decoding
18Decoding using Variational approximation Decoding using a sentence-specific approximate distribution
Monday, August 17, 2009Variational Decoding for MT: an Overview
Monday, August 17, 2009Variational Decoding for MT: an Overview
Sentence-specific decoding
Monday, August 17, 2009Variational Decoding for MT: an Overview
Sentence-specific decoding Three steps:
Monday, August 17, 2009Variational Decoding for MT: an Overview
Sentence-specific decoding
1
Generate a hypergraph
Three steps:
Monday, August 17, 2009Variational Decoding for MT: an Overview
Sentence-specific decoding Foreign sentence x
1
Generate a hypergraph
Three steps:
Monday, August 17, 2009Variational Decoding for MT: an Overview
Sentence-specific decoding Foreign sentence x SMT
1
Generate a hypergraph
Three steps:
Monday, August 17, 2009Variational Decoding for MT: an Overview
dianzi0 shang1 de2 mao3
S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X S→X0, X0 X→X0 de X1, X0 ’s X1Sentence-specific decoding Foreign sentence x SMT
1
Generate a hypergraph
Three steps:
Monday, August 17, 2009Variational Decoding for MT: an Overview
dianzi0 shang1 de2 mao3
S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X S→X0, X0 X→X0 de X1, X0 ’s X1Sentence-specific decoding Foreign sentence x SMT
1
Generate a hypergraph
Three steps: p(y, d | x)
Monday, August 17, 2009Variational Decoding for MT: an Overview
dianzi0 shang1 de2 mao3
S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X S→X0, X0 X→X0 de X1, X0 ’s X1Sentence-specific decoding Foreign sentence x SMT p(y | x)
1
Generate a hypergraph
Three steps: p(y, d | x)
Monday, August 17, 2009Variational Decoding for MT: an Overview
dianzi0 shang1 de2 mao3
S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X S→X0, X0 X→X0 de X1, X0 ’s X1Sentence-specific decoding Foreign sentence x SMT
MAP decoding under P is intractable
p(y | x)
1
Generate a hypergraph
Three steps: p(y, d | x)
Monday, August 17, 20091
p(y, d | x)
Generate a hypergraph
Monday, August 17, 20091
p(y, d | x)
Generate a hypergraph
Monday, August 17, 20091
p(y, d | x)
2
Generate a hypergraph
Monday, August 17, 20091
p(y, d | x)
dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1p(y, d | x)
2
Generate a hypergraph
Monday, August 17, 20091
p(y, d | x)
dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1p(y, d | x)
2
Estimate a model from the hypergraph Generate a hypergraph
q*(y | x)
Monday, August 17, 2009q* is an n-gram model
1
p(y, d | x)
dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1p(y, d | x)
2
Estimate a model from the hypergraph Generate a hypergraph
q*(y | x)
Monday, August 17, 2009q* is an n-gram model
1
p(y, d | x)
dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1p(y, d | x)
2
Estimate a model from the hypergraph Generate a hypergraph
q*(y | x) ≈∑d∈D(x,y) p(y,d|x)
Monday, August 17, 2009q* is an n-gram model
1
p(y, d | x)
dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1p(y, d | x)
2 3
Estimate a model from the hypergraph Generate a hypergraph
q*(y | x) ≈∑d∈D(x,y) p(y,d|x)
Monday, August 17, 2009q* is an n-gram model
Decode using q*
1
p(y, d | x)
dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1p(y, d | x)
dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1q*(y | x)
2 3
Estimate a model from the hypergraph Generate a hypergraph
q*(y | x) ≈∑d∈D(x,y) p(y,d|x)
Monday, August 17, 2009Variational Inference
21 Monday, August 17, 2009Variational Inference
Variational Inference
y∗ = arg max
y
p(y|x)
21 Monday, August 17, 2009Variational Inference
y∗ = arg max
y
p(y|x)
Variational Inference
y∗ = arg max
y
p(y|x)
q∗ = arg min
q∈Q KL(p||q)
21 Monday, August 17, 2009Variational Inference
y∗ = arg max
y
p(y|x)
q∗ = arg min
q∈Q KL(p||q)
21 Monday, August 17, 2009Variational Inference
y∗ = arg max
y
p(y|x)
y∗ = arg max
y
q∗(y | x)
q∗ = arg min
q∈Q KL(p||q)
21 Monday, August 17, 2009Variational Inference
y∗ = arg max
y
p(y|x)
y∗ = arg max
y
q∗(y | x)
q∗ = arg min
q∈Q KL(p||q)
21P
Monday, August 17, 2009Variational Inference
y∗ = arg max
y
p(y|x)
y∗ = arg max
y
q∗(y | x)
q∗ = arg min
q∈Q KL(p||q)
21p
P
Monday, August 17, 2009Variational Inference
y∗ = arg max
y
p(y|x)
y∗ = arg max
y
q∗(y | x)
q∗ = arg min
q∈Q KL(p||q)
21p Q
P
Monday, August 17, 2009Variational Inference
y∗ = arg max
y
p(y|x)
y∗ = arg max
y
q∗(y | x)
q∗ = arg min
q∈Q KL(p||q)
21p Q
P
Monday, August 17, 2009Variational Inference
y∗ = arg max
y
p(y|x)
y∗ = arg max
y
q∗(y | x)
q∗ = arg min
q∈Q KL(p||q)
21p Q q*
P
Monday, August 17, 2009Variational Approximation
q∗ = arg min
q∈Q KL(p||q)a family of distributions
22 Monday, August 17, 2009Variational Approximation
= arg min
q∈Qplogp q q∗ = arg min
q∈Q KL(p||q)a family of distributions
22 Monday, August 17, 2009Variational Approximation
= arg min
q∈Qplogp q = arg min
q∈Q(plogp − plogq) q∗ = arg min
q∈Q KL(p||q)a family of distributions
22 Monday, August 17, 2009constant
Variational Approximation
= arg min
q∈Qplogp q = arg min
q∈Q(plogp − plogq) q∗ = arg min
q∈Q KL(p||q)a family of distributions
22 Monday, August 17, 2009constant
Variational Approximation
= arg min
q∈Qplogp q = arg min
q∈Q(plogp − plogq) = arg max
q∈Qplogq q∗ = arg min
q∈Q KL(p||q)a family of distributions
22 Monday, August 17, 2009constant
Variational Approximation
= arg min
q∈Qplogp q = arg min
q∈Q(plogp − plogq) = arg max
q∈Qplogq q∗ = arg min
q∈Q KL(p||q)a family of distributions
22 Monday, August 17, 2009constant
Variational Approximation
= arg min
q∈Qplogp q = arg min
q∈Q(plogp − plogq) = arg max
q∈Qplogq q∗ = arg min
q∈Q KL(p||q)a family of distributions
22 Monday, August 17, 2009constant
Variational Approximation
= arg min
q∈Qplogp q = arg min
q∈Q(plogp − plogq) = arg max
q∈Qplogq q∗ = arg min
q∈Q KL(p||q)a family of distributions
22 Monday, August 17, 2009constant
Variational Approximation
= arg min
q∈Qplogp q = arg min
q∈Q(plogp − plogq) = arg max
q∈Qplogq q∗ = arg min
q∈Q KL(p||q)a family of distributions
22 Monday, August 17, 2009Parameterization of q∈Q
23 Monday, August 17, 2009Parameterization of q∈Q
Parameterization of q∈Q
probabilities of those n-grams appearing in that string
23 Monday, August 17, 2009Parameterization of q∈Q
probabilities of those n-grams appearing in that string
3-gram model
23 Monday, August 17, 2009Parameterization of q∈Q
probabilities of those n-grams appearing in that string
y: a b c d e f 3-gram model
23 Monday, August 17, 2009Parameterization of q∈Q
probabilities of those n-grams appearing in that string
y: a b c d e f 3-gram model
23q(y) = q(a) · q(b|a) · q(c|ab) · q(d|bc) · q(e|cd) · q(f|de)
Monday, August 17, 2009Parameterization of q∈Q
probabilities of those n-grams appearing in that string
y: a b c d e f 3-gram model
23q(y) = q(a) · q(b|a) · q(c|ab) · q(d|bc) · q(e|cd) · q(f|de)
Other ways of parameterizations are possible!
Monday, August 17, 2009probabilities of those n-grams appearing in that string
y: a b c d e f 3-gram model
24q(y) = q(a) · q(b|a) · q(c|ab) · q(d|bc) · q(e|cd) · q(f|de)
Parameterization of q∈Q
Monday, August 17, 2009probabilities of those n-grams appearing in that string
y: a b c d e f 3-gram model
24q(y) = q(a) · q(b|a) · q(c|ab) · q(d|bc) · q(e|cd) · q(f|de)
how to estimate these n-gram probabilities?
Parameterization of q∈Q
Monday, August 17, 2009Estimation of q*∈Q
q∗ = arg max
q∈Q
plogq
Monday, August 17, 2009Estimation of q*∈Q
where p is the empirical distribution
25q∗ = arg max
q∈Q
plogq
Monday, August 17, 2009Estimation of q*∈Q
where p is the empirical distribution
25But in our case, p is defined not by a corpus, but by a hypergraph for a given test sentence!
q∗ = arg max
q∈Q
plogq
Monday, August 17, 2009Estimation of q*∈Q
where p is the empirical distribution
25But in our case, p is defined not by a corpus, but by a hypergraph for a given test sentence!
dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1q∗ = arg max
q∈Q
plogq
Monday, August 17, 2009Estimation of q*∈Q
where p is the empirical distribution
25But in our case, p is defined not by a corpus, but by a hypergraph for a given test sentence!
dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1estimate q∗ = arg max
q∈Q
plogq
Monday, August 17, 2009Estimation of q*∈Q
where p is the empirical distribution
25But in our case, p is defined not by a corpus, but by a hypergraph for a given test sentence!
dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1bi-gram model estimate q∗ = arg max
q∈Q
plogq
Monday, August 17, 2009Estimation of q*∈Q
where p is the empirical distribution
25But in our case, p is defined not by a corpus, but by a hypergraph for a given test sentence!
dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1bi-gram model
estimate q∗ = arg max
q∈Q
plogq
Monday, August 17, 2009Estimation of q*∈Q
where p is the empirical distribution
25But in our case, p is defined not by a corpus, but by a hypergraph for a given test sentence!
dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1bi-gram model
estimate q∗ = arg max
q∈Q
plogq
Monday, August 17, 2009dianzi0 shang1 de2 mao3
X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X0 ’s X1dianzi0 shang1 de2 mao3
X→mao, a cat X→dianzi shang, the mat S→X0, X0dianzi0 shang1 de2 mao3
X→mao, a cat X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X1 on X0 X→X0 de X1, X1 of X0dianzi0 shang1 de2 mao3
X→mao, a cat X→dianzi shang, the mat S→X0, X0 26Estimating q* from a hypergraph: brute force
Monday, August 17, 2009dianzi0 shang1 de2 mao3
X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X0 ’s X1dianzi0 shang1 de2 mao3
X→mao, a cat X→dianzi shang, the mat S→X0, X0dianzi0 shang1 de2 mao3
X→mao, a cat X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X1 on X0 X→X0 de X1, X1 of X0dianzi0 shang1 de2 mao3
X→mao, a cat X→dianzi shang, the mat S→X0, X0 26Estimating q* from a hypergraph: brute force
Bi-gram estimation:
Monday, August 17, 2009dianzi0 shang1 de2 mao3
X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X0 ’s X1dianzi0 shang1 de2 mao3
X→mao, a cat X→dianzi shang, the mat S→X0, X0dianzi0 shang1 de2 mao3
X→mao, a cat X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X1 on X0 X→X0 de X1, X1 of X0dianzi0 shang1 de2 mao3
X→mao, a cat X→dianzi shang, the mat S→X0, X0 26Estimating q* from a hypergraph: brute force
Bi-gram estimation:
dianzi0 shang1 de2 mao3
X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X0 ’s X1dianzi0 shang1 de2 mao3
X→mao, a cat X→dianzi shang, the mat S→X0, X0dianzi0 shang1 de2 mao3
X→mao, a cat X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X1 on X0 X→X0 de X1, X1 of X0dianzi0 shang1 de2 mao3
X→mao, a cat X→dianzi shang, the mat S→X0, X0 27 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1Estimating q* from a hypergraph: brute force
Bi-gram estimation:
dianzi0 shang1 de2 mao3
X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X0 ’s X1dianzi0 shang1 de2 mao3
X→mao, a cat X→dianzi shang, the mat S→X0, X0dianzi0 shang1 de2 mao3
X→mao, a cat X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X1 on X0 X→X0 de X1, X1 of X0dianzi0 shang1 de2 mao3
X→mao, a cat X→dianzi shang, the mat S→X0, X0the mat a cat a cat on the mat a cat of the mat the mat ‘s a cat
27 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1Estimating q* from a hypergraph: brute force
Bi-gram estimation:
dianzi0 shang1 de2 mao3
X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X0 ’s X1dianzi0 shang1 de2 mao3
X→mao, a cat X→dianzi shang, the mat S→X0, X0dianzi0 shang1 de2 mao3
X→mao, a cat X→dianzi shang, the mat S→X0, X0 X→X0 de X1, X1 on X0 X→X0 de X1, X1 of X0dianzi0 shang1 de2 mao3
X→mao, a cat X→dianzi shang, the mat S→X0, X0the mat a cat a cat on the mat a cat of the mat the mat ‘s a cat
p=2/8 p=1/8 p=3/8 p=2/8
27 dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1Estimating q* from a hypergraph: brute force
Bi-gram estimation:
the mat a cat a cat of the mat the mat ‘s a cat
p=2/8 p=1/8 p=3/8 p=2/8
28Estimating q* from a hypergraph: brute force
a cat on the mat
Monday, August 17, 2009the mat a cat a cat of the mat the mat ‘s a cat
p=2/8 p=1/8 p=3/8 p=2/8
28Bi-gram estimation:
Estimating q* from a hypergraph: brute force
a cat on the mat
Monday, August 17, 2009the mat a cat a cat of the mat the mat ‘s a cat
p=2/8 p=1/8 p=3/8 p=2/8
28Bi-gram estimation:
Estimating q* from a hypergraph: brute force
a cat on the mat
Monday, August 17, 2009the mat a cat a cat of the mat the mat ‘s a cat
p=2/8 p=1/8 p=3/8 p=2/8
28Bi-gram estimation:
Estimating q* from a hypergraph: brute force
a cat on the mat
Monday, August 17, 2009the mat a cat a cat of the mat the mat ‘s a cat
p=2/8 p=1/8 p=3/8 p=2/8
28Bi-gram estimation:
Estimating q* from a hypergraph: brute force
Pr(on | cat)=1/8 Pr(of | cat)=2/8 Pr(</s> | cat)=5/8
a cat on the mat
Monday, August 17, 2009the mat a cat a cat of the mat the mat ‘s a cat
p=2/8 p=1/8 p=3/8 p=2/8
28Bi-gram estimation:
Estimating q* from a hypergraph: brute force
Pr(on | cat)=1/8 Pr(of | cat)=2/8 Pr(</s> | cat)=5/8
a cat on the mat
Monday, August 17, 2009Estimating q* from a hypergraph: dynamic programming
Monday, August 17, 2009Estimating q* from a hypergraph: dynamic programming
Bi-gram estimation:
Monday, August 17, 2009Estimating q* from a hypergraph: dynamic programming
Bi-gram estimation:
hypergraph
Monday, August 17, 2009Estimating q* from a hypergraph: dynamic programming
Bi-gram estimation:
hypergraph
each bigram at each hyperedge
Monday, August 17, 2009Estimating q* from a hypergraph: dynamic programming
Bi-gram estimation:
hypergraph
each bigram at each hyperedge
Decoding using q*∈Q
30 Monday, August 17, 2009Decoding using q*∈Q
Decoding using q*∈Q
y∗ = arg max
y∈HG(x) q∗(y|x)
30 Monday, August 17, 2009Decoding using q*∈Q
y∗ = arg max
y∈HG(x) q∗(y|x)
30 Monday, August 17, 2009Decoding using q*∈Q
y∗ = arg max
y∈HG(x) q∗(y|x)
30q* is an n-gram model.
Monday, August 17, 2009Decoding using q*∈Q
y∗ = arg max
y∈HG(x) q∗(y|x)
30q* is an n-gram model.
Monday, August 17, 2009Decoding using q*∈Q
y∗ = arg max
y∈HG(x) q∗(y|x)
30q* is an n-gram model.
John already told you how to do this☺
Monday, August 17, 2009KL divergences under different variational models
31q∗ = arg min
q∈Q KL(p||q) = H(p, q) − H(p)
Monday, August 17, 2009KL divergences under different variational models
31q∗ = arg min
q∈Q KL(p||q) = H(p, q) − H(p)
Measure H(p) KL(p||·) bits/word q∗
1
q∗
2
q∗
3
q∗
4
MT’04 1.36 0.97 0.32 0.21 0.17 MT’05 1.37 0.94 0.32 0.21 0.17
Monday, August 17, 2009KL divergences under different variational models
divergence is!
mostly when switching from unigram to bigram
31q∗ = arg min
q∈Q KL(p||q) = H(p, q) − H(p)
Measure H(p) KL(p||·) bits/word q∗
1
q∗
2
q∗
3
q∗
4
MT’04 1.36 0.97 0.32 0.21 0.17 MT’05 1.37 0.94 0.32 0.21 0.17
Monday, August 17, 2009KL divergences under different variational models
32Measure H(p) KL(p||·) bits/word q∗
1
q∗
2
q∗
3
q∗
4
MT’04 1.36 0.97 0.32 0.21 0.17 MT’05 1.37 0.94 0.32 0.21 0.17
q∗ = arg min
q∈Q KL(p||q) = H(p, q) − H(p)
Monday, August 17, 2009KL divergences under different variational models
32How to compute them on a hypergraph? see (Li and Eisner, EMNLP’09) Measure H(p) KL(p||·) bits/word q∗
1
q∗
2
q∗
3
q∗
4
MT’04 1.36 0.97 0.32 0.21 0.17 MT’05 1.37 0.94 0.32 0.21 0.17
q∗ = arg min
q∈Q KL(p||q) = H(p, q) − H(p)
Monday, August 17, 2009BLEU scores when using a single variational n-gram model
Decoding scheme MT’04 MT’05 Viterbi 35.4 32.6 1gram 25.9 24.5 2gram 36.1 33.4 3gram 36.0 33.1 4gram 35.8 32.9
33 Monday, August 17, 2009BLEU scores when using a single variational n-gram model
Decoding scheme MT’04 MT’05 Viterbi 35.4 32.6 1gram 25.9 24.5 2gram 36.1 33.4 3gram 36.0 33.1 4gram 35.8 32.9
33 Monday, August 17, 2009BLEU scores when using a single variational n-gram model
Decoding scheme MT’04 MT’05 Viterbi 35.4 32.6 1gram 25.9 24.5 2gram 36.1 33.4 3gram 36.0 33.1 4gram 35.8 32.9
33BLEU scores when using a single variational n-gram model
Decoding scheme MT’04 MT’05 Viterbi 35.4 32.6 1gram 25.9 24.5 2gram 36.1 33.4 3gram 36.0 33.1 4gram 35.8 32.9
33???
Monday, August 17, 2009BLEU scores when using a single variational n-gram model
Decoding scheme MT’04 MT’05 Viterbi 35.4 32.6 1gram 25.9 24.5 2gram 36.1 33.4 3gram 36.0 33.1 4gram 35.8 32.9
33???
modeling error in p
Monday, August 17, 2009BLEU cares about both low- and high-order n-gram matches
Monday, August 17, 2009BLEU cares about both low- and high-order n-gram matches
y∗ = arg max
y∈HG(x)
θn · log q∗
n(y | x)
Monday, August 17, 2009BLEU cares about both low- and high-order n-gram matches Viterbi and variational are different ways in approximating p
y∗ = arg max
y∈HG(x)
θn · log q∗
n(y | x)
Monday, August 17, 2009BLEU cares about both low- and high-order n-gram matches Viterbi and variational are different ways in approximating p
y∗ = arg max
y∈HG(x)θn · log q∗
n(y | x) + θv · log pViterbi(y | x)y∗ = arg max
y∈HG(x)
θn · log q∗
n(y | x)
Monday, August 17, 2009BLEU cares about both low- and high-order n-gram matches Viterbi and variational are different ways in approximating p
y∗ = arg max
y∈HG(x)θn · log q∗
n(y | x) + θv · log pViterbi(y | x)y∗ = arg max
y∈HG(x)
θn · log q∗
n(y | x)
Monday, August 17, 2009Minimum Bayes Risk (MBR) decoding?
35(Tromble et al. 2008) (Denero et al. 2009)
Monday, August 17, 2009Minimum Risk Decoding
Risk(y) =
L(y, y
′)p(y ′|x)y∗ = arg max
y∈HG(x) p(y|x)
y∗ = arg min
y∈HG(x) Risk(y)
36 Monday, August 17, 2009Variational Decoding(VD) vs. MBR (Tromble et al. 2008)
37 Monday, August 17, 2009Variational Decoding(VD) vs. MBR (Tromble et al. 2008)
37spurious ambiguity
Monday, August 17, 2009Variational Decoding(VD) vs. MBR (Tromble et al. 2008)
37spurious ambiguity VD
Monday, August 17, 2009Variational Decoding(VD) vs. MBR (Tromble et al. 2008)
37spurious ambiguity consensus VD
Monday, August 17, 2009Variational Decoding(VD) vs. MBR (Tromble et al. 2008)
37spurious ambiguity consensus VD MBR
Monday, August 17, 2009Variational Decoding(VD) vs. MBR (Tromble et al. 2008)
37spurious ambiguity consensus VD MBR Interpolated VD
Monday, August 17, 2009Variational Decoding(VD) vs. MBR (Tromble et al. 2008)
37spurious ambiguity consensus VD MBR Interpolated VD Both BLEU metric and our variational distributions happen to use n-gram dependencies.
Monday, August 17, 2009q(r(w) | h(w), x) =
qn(y | x) =
q(r(w) | h(w), x)cw(y) y∗ = arg max
y∈HG(x)
θn · log q∗
n(y | x)
gn(y | x) =
g(w | x)cw(y) g(w | x) =
δw(y′)p(y′ | x) y∗ = arg max
y∈HG(x)
θn · gn(y | x)
38 Monday, August 17, 2009q(r(w) | h(w), x) =
qn(y | x) =
q(r(w) | h(w), x)cw(y) y∗ = arg max
y∈HG(x)
θn · log q∗
n(y | x)
gn(y | x) =
g(w | x)cw(y) g(w | x) =
δw(y′)p(y′ | x) y∗ = arg max
y∈HG(x)
θn · gn(y | x)
38decision rule decision rule
Monday, August 17, 2009q(r(w) | h(w), x) =
qn(y | x) =
q(r(w) | h(w), x)cw(y) y∗ = arg max
y∈HG(x)
θn · log q∗
n(y | x)
gn(y | x) =
g(w | x)cw(y) g(w | x) =
δw(y′)p(y′ | x) y∗ = arg max
y∈HG(x)
θn · gn(y | x)
38decision rule decision rule n-gram model n-gram model
Monday, August 17, 2009q(r(w) | h(w), x) =
qn(y | x) =
q(r(w) | h(w), x)cw(y) y∗ = arg max
y∈HG(x)
θn · log q∗
n(y | x)
gn(y | x) =
g(w | x)cw(y) g(w | x) =
δw(y′)p(y′ | x) y∗ = arg max
y∈HG(x)
θn · gn(y | x)
38decision rule decision rule n-gram model n-gram model n-gram probability n-gram probability
Monday, August 17, 2009q(r(w) | h(w), x) =
qn(y | x) =
q(r(w) | h(w), x)cw(y) y∗ = arg max
y∈HG(x)
θn · log q∗
n(y | x)
gn(y | x) =
g(w | x)cw(y) g(w | x) =
δw(y′)p(y′ | x) y∗ = arg max
y∈HG(x)
θn · gn(y | x)
39 Monday, August 17, 2009q(r(w) | h(w), x) =
qn(y | x) =
q(r(w) | h(w), x)cw(y) y∗ = arg max
y∈HG(x)
θn · log q∗
n(y | x)
gn(y | x) =
g(w | x)cw(y) g(w | x) =
δw(y′)p(y′ | x) non-probabilistic very expensive to compute y∗ = arg max
y∈HG(x)
θn · gn(y | x)
39 Monday, August 17, 2009BLEU Results on Chinese-English NIST MT Tasks
Decoding scheme MT’04 MT’05 Viterbi 35.4 32.6 MBR (K=1000) 35.8 32.7 Crunching (N=10000) 35.7 32.8 Crunching+MBR (N=10000) 35.8 32.7 Variational (1to4gram+wp+vt) 36.6 33.5
40 Monday, August 17, 2009BLEU Results on Chinese-English NIST MT Tasks
Decoding scheme MT’04 MT’05 Viterbi 35.4 32.6 MBR (K=1000) 35.8 32.7 Crunching (N=10000) 35.7 32.8 Crunching+MBR (N=10000) 35.8 32.7 Variational (1to4gram+wp+vt) 36.6 33.5
40 Monday, August 17, 2009Conclusions
intractable
but ignore most derivations
considers all derivations but still allows tractable decoding
art baseline
41 Monday, August 17, 2009Future directions
these problems
problem in many other NLP applications
Thank you! 谢谢!
43 Monday, August 17, 2009q* is an n-gram model
Decode using q*
1
p(y, d | x)
dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1p(y, d | x)
dianzi0 shang1 de2 mao3 S 0,4 X 0,4 the · · · cat X 0,4 a · · · mat X 0,2 the · · · mat X 3,4 a · · · cat X→mao, a cat X→X0 de X1, X0 X1 X→dianzi shang, the mat X→X0 de X1, X1 on X0 S→X0, X0 X→X0 de X1, X1 of X0 S→X0, X0 X→X0 de X1, X0 ’s X1q*(y | x)
2 3
Estimate a model from the hypergraph Generate a hypergraph
q*(y | x) ≈∑d∈D(x,y) p(y,d|x)
Monday, August 17, 2009