CRF Word Alignment & Noisy Channel Translation
Machine Translation Lecture 6 Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu Website: mt-class.org/penn
CRF Word Alignment & Noisy Channel Translation Machine - - PowerPoint PPT Presentation
CRF Word Alignment & Noisy Channel Translation Machine Translation Lecture 6 Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu Website: mt-class.org/penn Last Time ... X Translation Translation Alignment p ( p ( ) = )
Machine Translation Lecture 6 Instructor: Chris Callison-Burch TAs: Mitchell Stern, Justin Chiu Website: mt-class.org/penn
Alignment
×
Alignment
a∈[0,n]m
m
i=1
IBM Model 4 alignment Our model's alignment
a∈[0,n]m
m
i=1
a∈[0,n]m p(a | e, f, m)
A B C X Y Z
p(A, B, C, X, Y, Z) = p(A) × p(B | A) × p(C | B)× p(X | A)p(Y | B)p(Z | C)
A B C X Y Z
p(A, B, C, X, Y, Z) = 1 Z × Ψ1(A, B) × Ψ2(B, C) × Ψ3(C, D)× Ψ4(X) × Ψ5(Y ) × Ψ6(Z)
X Y
X = {a, b, c} X ∈ X Y ∈ X Z = X
x∈X
X
y∈X
Ψ1(x, y)Ψ2(x)Ψ3(y)
Z = X
x∈X
Ψ2(x) X
y∈X
Ψ1(x, y)Ψ3(y)
A B C X Y Z
p(A, B, C, X, Y, Z) = 1 Z × Ψ1(A, B) × Ψ2(B, C) × Ψ3(C, D)× Ψ4(X) × Ψ5(Y ) × Ψ6(Z) Ψ1,2,3(x, y) = exp X
k
wkfk(x, y)
to arbitrary features (functions) of the variables
Z (often over and over again!)
y
All factors in the graph of
p(y | x) = 1 Zw(y) exp X
F ∈G
X
k
wkfk(F, x)
p(a | e, f)
ˆ wMLE = arg max
w
Y
(xi,yi)∈D
p(yi | xi ; w)
Cohn (2006)
models (still make a one-to-many assumption)
p(a | e, f) = 1 Zw(e, f) exp
|e|
X
i=1
X
k
wkf(ai, ai−1, i, e, f) O(n2m) ≈ O(n3)
Identical word
Identical word
17
Matching prefix
Identical word Matching prefix
18
Matching suffix
Identical word Matching prefix Matching suffix
19
Orthographic similarity
Identical word Matching prefix Matching suffix Orthographic similarity
20
In dictionary
Identical word Matching prefix Matching suffix Orthographic similarity In dictionary ...
21
p(e) p(e | f, m) p(e, a | f, m) p(a | e, f)
p(e) p(e | f, m) p(e, a | f, m) p(a | e, f) p(e | f, m) p(e)
Warren Weaver to Norbert Wiener, March, 1947
Claude Shannon. “A Mathematical Theory of Communication” 1948.
Encoder
M
Message
“Noisy” channel Decoder
Y X M 0
Sent transmission Received transmission Recovered message
Claude Shannon. “A Mathematical Theory of Communication” 1948.
Encoder
M
Message
“Noisy” channel Decoder
Y X M 0
Sent transmission Received transmission Recovered message
Claude Shannon. “A Mathematical Theory of Communication” 1948.
Encoder
M
Message
“Noisy” channel Decoder
Y X M 0
Sent transmission Received transmission Recovered message
Claude Shannon. “A Mathematical Theory of Communication” 1948.
Encoder
M
Message
“Noisy” channel Decoder
Y X M 0
Sent transmission Received transmission Recovered message
“Noisy” channel Decoder
Y X M 0
Sent transmission Received transmission Recovered message
Y 0
“Noisy” channel Decoder
Y X M 0
Sent transmission Received transmission Recovered message
Y 0
“Noisy” channel Decoder
Y X M 0
Sent transmission Received transmission Recovered message
Y 0
“Noisy” channel Decoder
Y X M 0
Sent transmission Received transmission Recovered message
Y 0
y
y
y
“Noisy” channel Decoder
Y X M 0
Sent transmission Received transmission Recovered message
Y 0
y
y
y
I can help.
“Noisy” channel Decoder
Y X M 0
Sent transmission Received transmission Recovered message
Y 0
y
y
y
“Noisy” channel Decoder
Y X M 0
Sent transmission Received transmission Recovered message
Y 0
y
y
y
“Noisy” channel Decoder
Y X M 0
Sent transmission Received transmission Recovered message
Y 0
y
y
y
“Noisy” channel Decoder
Y X M 0
Sent transmission Received transmission Recovered message
Y 0
y
Sent transmission Received transmission Recovered message
“Noisy” channel Decoder
Y X M 0 Y 0
y
e
Sent transmission Received transmission Recovered message
“Noisy” channel Decoder
Y X M 0 Y 0
y
e
Sent transmission Received transmission Recovered message
“Noisy” channel Decoder
Y X M 0 Y 0
y
e
source
p(e) p(f | e) e∗ = arg max
e
p(e | f) = arg max
e
p(f | e) × p(e)