Variational Decoding for Statistical Machine Translation Zhifei Li, - PowerPoint PPT Presentation

Maximum A Posterior (MAP) Decoding translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.12 0.11 0.10 • Exact MAP decoding 0.10 y ∗ = arg y ∈ Trans( x ) p ( y | x ) max • x: Foreign sentence � = arg max p ( y, d | x ) • y: English sentence y ∈ Trans( x ) d ∈ D( x,y ) • d: derivation 8 Monday, August 17, 2009

Maximum A Posterior (MAP) Decoding translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 • Exact MAP decoding 0.10 y ∗ = arg y ∈ Trans( x ) p ( y | x ) max • x: Foreign sentence � = arg max p ( y, d | x ) • y: English sentence y ∈ Trans( x ) d ∈ D( x,y ) • d: derivation 8 Monday, August 17, 2009

Maximum A Posterior (MAP) Decoding translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 • Exact MAP decoding 0.10 y ∗ = arg y ∈ Trans( x ) p ( y | x ) max • x: Foreign sentence � = arg max p ( y, d | x ) • y: English sentence y ∈ Trans( x ) d ∈ D( x,y ) • d: derivation 9 Monday, August 17, 2009

Hypergraph as a search space Monday, August 17, 2009

Hypergraph as a search space S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

Hypergraph as a search space A hypergraph is a compact structure to encode exponentially many trees. S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

Hypergraph as a search space S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

Hypergraph as a search space S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � Probabilistic X →� X 0 de X 1 , X 0 X 1 � Hypergraph X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

Hypergraph as a search space The hypergraph defines a probability distribution over derivation trees , i.e. p(y, d | x), S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � Probabilistic X →� X 0 de X 1 , X 0 X 1 � Hypergraph X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

Hypergraph as a search space The hypergraph defines a probability distribution over derivation trees , i.e. p(y, d | x), and also a distribution (implicit) over strings , i.e. p(y | x). S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � Probabilistic X →� X 0 de X 1 , X 0 X 1 � Hypergraph X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

Hypergraph as a search space The hypergraph defines a probability distribution over derivation trees , i.e. p(y, d | x), • Exact MAP decoding and also a distribution (implicit) over strings , i.e. p(y | x). y ∗ = arg y ∈ HG( x ) p ( y | x ) max � = arg max p ( y, d | x ) S 0,4 y ∈ HG( x ) d ∈ D( x,y ) S →� X 0 , X 0 � exponential size S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat NP-hard (Sima’an 1996) X →� X 0 de X 1 , X 1 on X 0 � X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � Probabilistic X →� X 0 de X 1 , X 0 X 1 � Hypergraph X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

Decoding with spurious ambiguity? • Maximum a posterior (MAP) decoding • Viterbi approximation • N-best approximation (crunching) (May and Knight 2006) Monday, August 17, 2009

Viterbi Approximation translation probability derivation MAP string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 0.10 • Viterbi approximation y ∗ = arg max d ∈ D( x,y ) p ( y, d | x ) max y ∈ Trans( x ) = Y(arg max d ∈ D( x ) p ( y, d | x )) Monday, August 17, 2009

Viterbi Approximation translation probability derivation MAP Viterbi string 0.16 red translation 0.28 0.14 0.14 blue translation 0.28 0.13 green translation 0.44 0.12 0.11 0.10 0.10 • Viterbi approximation y ∗ = arg max d ∈ D( x,y ) p ( y, d | x ) max y ∈ Trans( x ) = Y(arg max d ∈ D( x ) p ( y, d | x )) Monday, August 17, 2009

Viterbi Approximation translation probability derivation MAP Viterbi string 0.16 red translation 0.28 0.16 0.14 0.14 blue translation 0.28 0.14 0.13 green translation 0.44 0.13 0.12 0.11 0.10 0.10 • Viterbi approximation y ∗ = arg max d ∈ D( x,y ) p ( y, d | x ) max y ∈ Trans( x ) = Y(arg max d ∈ D( x ) p ( y, d | x )) Monday, August 17, 2009

N-best Approximation translation probability derivation MAP Viterbi string 0.16 red translation 0.28 0.16 0.14 0.14 blue translation 0.28 0.14 0.13 green translation 0.44 0.13 0.12 0.11 0.10 0.10 • N-best approximation (crunching) (May and Knight 2006) � y ∗ = arg max p ( y, d | x ) y ∈ Trans( x ) d ∈ D( x,y ) ∩ ND( x ) Monday, August 17, 2009

N-best Approximation translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.14 0.14 blue translation 0.28 0.14 0.13 green translation 0.44 0.13 0.12 0.11 0.10 0.10 • N-best approximation (crunching) (May and Knight 2006) � y ∗ = arg max p ( y, d | x ) y ∈ Trans( x ) d ∈ D( x,y ) ∩ ND( x ) Monday, August 17, 2009

N-best Approximation translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.16 0.14 0.14 blue translation 0.28 0.14 0.28 0.13 green translation 0.44 0.13 0.13 0.12 0.11 0.10 0.10 • N-best approximation (crunching) (May and Knight 2006) � y ∗ = arg max p ( y, d | x ) y ∈ Trans( x ) d ∈ D( x,y ) ∩ ND( x ) Monday, August 17, 2009

MAP vs. Approximations translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.16 0.14 0.14 blue translation 0.28 0.14 0.28 0.13 green translation 0.44 0.13 0.13 0.12 0.11 0.10 0.10 Monday, August 17, 2009

MAP vs. Approximations translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.16 0.14 0.14 blue translation 0.28 0.14 0.28 0.13 green translation 0.44 0.13 0.13 0.12 0.11 0.10 0.10 • Exact MAP decoding under spurious ambiguity is intractable Monday, August 17, 2009

MAP vs. Approximations translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.16 0.14 0.14 blue translation 0.28 0.14 0.28 0.13 green translation 0.44 0.13 0.13 0.12 0.11 0.10 0.10 • Exact MAP decoding under spurious ambiguity is intractable • Viterbi and crunching are efficient, but ignore most derivations Monday, August 17, 2009

MAP vs. Approximations translation 4-best probability derivation MAP Viterbi string crunching 0.16 red translation 0.28 0.16 0.16 0.14 0.14 blue translation 0.28 0.14 0.28 0.13 green translation 0.44 0.13 0.13 0.12 0.11 0.10 0.10 • Exact MAP decoding under spurious ambiguity is intractable • Viterbi and crunching are efficient, but ignore most derivations • Our goal: develop an approximation that considers all the derivations but still allows tractable decoding Monday, August 17, 2009

Variational Decoding 18 Monday, August 17, 2009

Variational Decoding Decoding using Variational approximation 18 Monday, August 17, 2009

Variational Decoding Decoding using Variational approximation Decoding using a sentence-specific approximate distribution 18 Monday, August 17, 2009

Variational Decoding for MT: an Overview Monday, August 17, 2009

Variational Decoding for MT: an Overview Sentence-specific decoding Monday, August 17, 2009

Variational Decoding for MT: an Overview Sentence-specific decoding Three steps: Monday, August 17, 2009

Variational Decoding for MT: an Overview Sentence-specific decoding Three steps: 1 Generate a hypergraph Monday, August 17, 2009

Variational Decoding for MT: an Overview Sentence-specific decoding Three steps: 1 Generate a hypergraph Foreign sentence x Monday, August 17, 2009

Variational Decoding for MT: an Overview Sentence-specific decoding Three steps: 1 Generate a hypergraph Foreign SMT sentence x Monday, August 17, 2009

Variational Decoding for MT: an Overview Sentence-specific decoding Three steps: 1 Generate a hypergraph S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Foreign X →� X 0 de X 1 , X 1 on X 0 � SMT X →� X 0 de X 1 , X 0 ’s X 1 � sentence x X →� X 0 de X 1 , X 1 of X X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

Variational Decoding for MT: an Overview Sentence-specific decoding Three steps: 1 Generate a hypergraph S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat p(y, d | x) Foreign X →� X 0 de X 1 , X 1 on X 0 � SMT X →� X 0 de X 1 , X 0 ’s X 1 � sentence x X →� X 0 de X 1 , X 1 of X X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

Variational Decoding for MT: an Overview Sentence-specific decoding Three steps: 1 Generate a hypergraph S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat p(y, d | x) Foreign X →� X 0 de X 1 , X 1 on X 0 � SMT X →� X 0 de X 1 , X 0 ’s X 1 � sentence x X →� X 0 de X 1 , X 1 of X X →� X 0 de X 1 , X 0 X 1 � p(y | x) X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

Variational Decoding for MT: an Overview Sentence-specific decoding MAP decoding under P is intractable Three steps: 1 Generate a hypergraph S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat p(y, d | x) Foreign X →� X 0 de X 1 , X 1 on X 0 � SMT X →� X 0 de X 1 , X 0 ’s X 1 � sentence x X →� X 0 de X 1 , X 1 of X X →� X 0 de X 1 , X 0 X 1 � p(y | x) X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 2 Monday, August 17, 2009

1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 2 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 2 S 0,4 Estimate a model S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat from the hypergraph X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � q*(y | x) X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 q* is an n-gram model 2 S 0,4 Estimate a model S →� X 0 , X 0 � over output strings. S →� X 0 , X 0 � X 0,4 a · · · mat from the hypergraph X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � q*(y | x) X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 q* is an n-gram model 2 S 0,4 Estimate a model S →� X 0 , X 0 � over output strings. S →� X 0 , X 0 � X 0,4 a · · · mat from the hypergraph X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � q*(y | x) X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat ≈∑ d ∈ D(x,y) p(y,d|x) X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 q* is an n-gram model 2 S 0,4 Estimate a model S →� X 0 , X 0 � over output strings. S →� X 0 , X 0 � X 0,4 a · · · mat from the hypergraph X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � q*(y | x) X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat ≈∑ d ∈ D(x,y) p(y,d|x) X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 3 Monday, August 17, 2009

1 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat X 0,4 the · · · cat Generate a hypergraph X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 q* is an n-gram model 2 S 0,4 Estimate a model S →� X 0 , X 0 � over output strings. S →� X 0 , X 0 � X 0,4 a · · · mat from the hypergraph X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � p(y, d | x) X →� X 0 de X 1 , X 0 ’s X 1 � q*(y | x) X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat ≈∑ d ∈ D(x,y) p(y,d|x) X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 3 S 0,4 S →� X 0 , X 0 � S →� X 0 , X 0 � X 0,4 a · · · mat Decode using q* X 0,4 the · · · cat X →� X 0 de X 1 , X 1 on X 0 � q*(y | x) X →� X 0 de X 1 , X 0 ’s X 1 � on the hypergraph X →� X 0 de X 1 , X 1 of X 0 � X →� X 0 de X 1 , X 0 X 1 � X 0,2 the · · · mat X 3,4 a · · · cat X →� dianzi shang , the mat � X →� mao , a cat � dianzi 0 shang 1 de 2 mao 3 Monday, August 17, 2009

Variational Inference 21 Monday, August 17, 2009

Variational Inference • We want to do inference under p, but it is intractable 21 Monday, August 17, 2009

Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y 21 Monday, August 17, 2009

Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* 21 Monday, August 17, 2009

Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* q ∗ = arg min q ∈ Q KL( p || q ) 21 Monday, August 17, 2009

Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* q ∗ = arg min q ∈ Q KL( p || q ) • Then, we will use q* as a surrogate for p in inference 21 Monday, August 17, 2009

Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* q ∗ = arg min q ∈ Q KL( p || q ) • Then, we will use q* as a surrogate for p in inference = arg max q ∗ ( y | x ) y ∗ y 21 Monday, August 17, 2009

Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* P q ∗ = arg min q ∈ Q KL( p || q ) • Then, we will use q* as a surrogate for p in inference = arg max q ∗ ( y | x ) y ∗ y 21 Monday, August 17, 2009

Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* P p q ∗ = arg min q ∈ Q KL( p || q ) • Then, we will use q* as a surrogate for p in inference = arg max q ∗ ( y | x ) y ∗ y 21 Monday, August 17, 2009

Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* P p q ∗ = arg min q ∈ Q KL( p || q ) Q • Then, we will use q* as a surrogate for p in inference = arg max q ∗ ( y | x ) y ∗ y 21 Monday, August 17, 2009

Variational Inference • We want to do inference under p, but it is intractable = arg max p ( y | x ) y ∗ y • Instead, we derive a simpler distribution q* P p q ∗ = arg min q ∈ Q KL( p || q ) q* Q • Then, we will use q* as a surrogate for p in inference = arg max q ∗ ( y | x ) y ∗ y 21 Monday, August 17, 2009

Variational Approximation • q* : an approximation having minimum distance to p q ∗ = arg min q ∈ Q KL( p || q ) a family of distributions 22 Monday, August 17, 2009

Variational Decoding for Statistical Machine Translation Zhifei Li, - PowerPoint PPT Presentation

Variational Decoding for Statistical Machine Translation Zhifei Li, Jason Eisner, and Sanjeev Khudanpur Center for Language and Speech Processing Computer Science Department Johns Hopkins University 1 Monday, August 17, 2009 Spurious

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

Decoding Philipp Koehn 17 September 2020 Philipp Koehn Machine Translation: Decoding 17

Chapter 6 Decoding Statistical Machine Translation Decoding We have a mathematical model for

Statistical Machine Translation George Foster George Foster Statistical Machine Translation A

Neural Machine Translation Decoding Philipp Koehn 8 October 2020 Philipp Koehn Machine

Statistical Machine Translation Nadir Durrani 21-November-2014 Machine Translation

Representing Huge Translation Models Statistical Machine Translation parallel text + alignment

Syntax-Based Decoding Philipp Koehn 9 November 2017 Philipp Koehn Machine Translation:

Syntax-Based Decoding 2 Philipp Koehn 14 November 2017 Philipp Koehn Machine Translation:

Statistical Machine Translation Graham Neubig Nara Institute of Science and Technology (NAIST)

Variational Decoding for Statistical Machine Translation Zhifei Li, Jason Eisner, and Sanjeev

Variational Auto-encoders 2 VARIATIONAL AUTO-ENCODERS INTRODUCTION VARIATIONAL AUTO-ENCODERS

Domain Adaptation in Statistical Machine Translation Logic, Language and Computation Bart

Introd u ction to machine translation MAC H IN E TR AN SL ATION IN P YTH ON Th u shan

Machine Translation Machine Translation February 13, 2008 Andreas Eisele UdS Computerlinguistik

Neural Machine Translation Gongbo Tang 8 October 2018 Outline Neural Machine Translation 1

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of

Courageously Liv iving Through COVID-19 Together: Residents and Families Dee Lender, Executive

IA 1A5 (=E1R86), 1L1 (=E1R05) , IIA E2R40 , 2011 8 ( 10 )

R: a nearly-Lisp Christophe Rhodes Teclo Networks AG April 6, 2011 1 / 29 Outline

BiFluX: A Bidirectional Functional Update Language for XML Hugo Pacheco 1 Tao Zan 2 Zhenjiang Hu 2

Decoding in Latent Conditional Models: A Practically Fast Solution for an NP-hard Problem Xu Sun (

A Theory Of Inferred Causation Daniel Kttel ETH Zrich, Switzerland 23. May 2006 Our Task

Topic Modeling Lecture 9: October 9, 2013 CS886 2 Natural Language Understanding University of