 
              CRF Word Alignment & Noisy Channel Translation January 31, 2013 Tuesday, February 19, 13
Last Time ... X Translation Translation Alignment p ( p ( ) = ) , Alignment Tuesday, February 19, 13
Last Time ... X Translation Translation Alignment p ( p ( ) = ) , Alignment X Alignment p ( p ( Translation | Alignment ) ) × = Alignment Tuesday, February 19, 13
Last Time ... X Translation Translation Alignment p ( p ( ) = ) , Alignment X Alignment p ( p ( Translation | Alignment ) ) × = | {z } | {z } Alignment { z }| { m z }| X Y p ( e | f , m ) = p ( a | f , m ) × p ( e i | f a i ) i =1 a ∈ [0 ,n ] m Tuesday, February 19, 13
MAP alignment IBM Model 4 alignment Our model's alignment Tuesday, February 19, 13
MAP alignment IBM Model 4 alignment Our model's alignment Tuesday, February 19, 13
A few tricks... p(f|e) p(e|f) Tuesday, February 19, 13
A few tricks... p(f|e) p(e|f) Tuesday, February 19, 13
A few tricks... p(f|e) p(e|f) Tuesday, February 19, 13
Another View With this model: m X Y p ( e | f , m ) = p ( a | f , m ) × p ( e i | f a i ) i =1 a ∈ [0 ,n ] m The problem of word alignment is as: a ∗ = arg a ∈ [0 ,n ] m p ( a | e , f , m ) max Tuesday, February 19, 13
Another View With this model: m X Y p ( e | f , m ) = p ( a | f , m ) × p ( e i | f a i ) i =1 a ∈ [0 ,n ] m The problem of word alignment is as: a ∗ = arg a ∈ [0 ,n ] m p ( a | e , f , m ) max Can we model this distribution directly? Tuesday, February 19, 13
Markov Random Fields (MRFs) p ( A, B, C, X, Y, Z ) = A B C p ( A ) × p ( B | A ) × p ( C | B ) × p ( X | A ) p ( Y | B ) p ( Z | C ) X Y Z Tuesday, February 19, 13
Markov Random Fields (MRFs) p ( A, B, C, X, Y, Z ) = A B C p ( A ) × p ( B | A ) × p ( C | B ) × p ( X | A ) p ( Y | B ) p ( Z | C ) X Y Z p ( A, B, C, X, Y, Z ) = 1 Z × A B C Ψ 1 ( A, B ) × Ψ 2 ( B, C ) × Ψ 3 ( C, D ) × Ψ 4 ( X ) × Ψ 5 ( Y ) × Ψ 6 ( Z ) X Y Z Tuesday, February 19, 13
Markov Random Fields (MRFs) p ( A, B, C, X, Y, Z ) = A B C p ( A ) × p ( B | A ) × p ( C | B ) × p ( X | A ) p ( Y | B ) p ( Z | C ) X Y Z p ( A, B, C, X, Y, Z ) = 1 Z × A B C Ψ 1 ( A, B ) × Ψ 2 ( B, C ) × Ψ 3 ( C, D ) × Ψ 4 ( X ) × Ψ 5 ( Y ) × Ψ 6 ( Z ) X Y Z “Factors” Tuesday, February 19, 13
Computing Z X X Z = Ψ 1 ( x, y ) Ψ 2 ( x ) Ψ 3 ( y ) X Y x ∈ X y ∈ X X = { a , b , c } When the graph has certain X ∈ X structures (e.g., chains), you can Y ∈ X factor to get polytime DP algorithms. X X Z = Ψ 2 ( x ) Ψ 1 ( x, y ) Ψ 3 ( y ) x ∈ X y ∈ X Tuesday, February 19, 13
Log-linear models p ( A, B, C, X, Y, Z ) = 1 Z × A B C Ψ 1 ( A, B ) × Ψ 2 ( B, C ) × Ψ 3 ( C, D ) × Ψ 4 ( X ) × Ψ 5 ( Y ) × Ψ 6 ( Z ) X Y Z X Ψ 1 , 2 , 3 ( x, y ) = exp w k f k ( x, y ) k Tuesday, February 19, 13
Log-linear models p ( A, B, C, X, Y, Z ) = 1 Z × A B C Ψ 1 ( A, B ) × Ψ 2 ( B, C ) × Ψ 3 ( C, D ) × Ψ 4 ( X ) × Ψ 5 ( Y ) × Ψ 6 ( Z ) X Y Z X Ψ 1 , 2 , 3 ( x, y ) = exp w k f k ( x, y ) k Weights (learned) Tuesday, February 19, 13
Log-linear models p ( A, B, C, X, Y, Z ) = 1 Z × A B C Ψ 1 ( A, B ) × Ψ 2 ( B, C ) × Ψ 3 ( C, D ) × Ψ 4 ( X ) × Ψ 5 ( Y ) × Ψ 6 ( Z ) X Y Z X Ψ 1 , 2 , 3 ( x, y ) = exp w k f k ( x, y ) k Weights (learned) Feature functions (specified) Tuesday, February 19, 13
Random Fields • Benefits • Potential functions can be defined with respect to arbitrary features (functions) of the variables • Great way to incorporate knowledge • Drawbacks • Likelihood involves computing Z • Maximizing likelihood usually requires computing Z (often over and over again!) Tuesday, February 19, 13
Conditional Random Fields • Use MRFs to parameterize a conditional distribution. Very easy: let feature functions look at anything they want in the “input” Tuesday, February 19, 13
Conditional Random Fields • Use MRFs to parameterize a conditional distribution. Very easy: let feature functions look at anything they want in the “input” 1 X X p ( y | x ) = Z w ( y ) exp w k f k ( F, x ) F ∈ G k Tuesday, February 19, 13
Conditional Random Fields • Use MRFs to parameterize a conditional distribution. Very easy: let feature functions look at anything they want in the “input” 1 X X p ( y | x ) = Z w ( y ) exp w k f k ( F, x ) F ∈ G k All factors in the graph of y Tuesday, February 19, 13
Parameter Learning • CRFs are trained to maximize conditional likelihood Y w MLE = arg max ˆ p ( y i | x i ; w ) w ( x i , y i ) ∈ D • Recall we want to directly model p ( a | e , f ) • The likelihood of what alignments? Tuesday, February 19, 13
Parameter Learning • CRFs are trained to maximize conditional likelihood Y w MLE = arg max ˆ p ( y i | x i ; w ) w ( x i , y i ) ∈ D • Recall we want to directly model p ( a | e , f ) • The likelihood of what alignments? Gold reference alignments! Tuesday, February 19, 13
CRF for Alignment • One of many possibilities, due to Blunsom & Cohn (2006) | e | 1 X X p ( a | e , f ) = Z w ( e , f ) exp w k f ( a i , a i − 1 , i, e , f ) i =1 k • a has the same form as in the lexical translation models (still make a one-to-many assumption) • w k are the model parameters • f k are the feature functions Tuesday, February 19, 13
CRF for Alignment • One of many possibilities, due to Blunsom & Cohn (2006) | e | 1 X X p ( a | e , f ) = Z w ( e , f ) exp w k f ( a i , a i − 1 , i, e , f ) i =1 k • a has the same form as in the lexical translation models (still make a one-to-many assumption) • w k are the model parameters • f k are the feature functions O ( n 2 m ) ≈ O ( n 3 ) Tuesday, February 19, 13
Model • Labels (one per target word) index source sentence • Train model (e,f) and (f,e) [inverting the reference alignments] Tuesday, February 19, 13
Experiments Tuesday, February 19, 13
pervez musharrafs langer abschied Identical word pervez musharraf ’s long goodbye Identical word 17 Tuesday, February 19, 13
pervez musharrafs langer abschied Matching prefix pervez musharraf ’s long goodbye Identical word Matching prefix 18 Tuesday, February 19, 13
pervez musharrafs langer abschied Matching suffix pervez musharraf ’s long goodbye Identical word Matching prefix Matching suffix 19 Tuesday, February 19, 13
pervez musharrafs langer abschied Orthographic similarity pervez musharraf ’s long goodbye Identical word Matching prefix Matching suffix Orthographic similarity 20 Tuesday, February 19, 13
pervez musharrafs langer abschied In dictionary pervez musharraf ’s long goodbye Identical word In dictionary Matching prefix ... Matching suffix Orthographic similarity 21 Tuesday, February 19, 13
pervez musharrafs langer abschied pervez musharraf ’s long goodbye Identical word In dictionary Matching prefix ... Matching suffix Orthographic similarity 21 Tuesday, February 19, 13
Lexical Features • Word word indicator features ↔ • Various word word co-occurrence scores ↔ • IBM Model 1 probabilities ( t → s , s → t ) • Geometric mean of Model 1 probabilities • Dice’s coefficient [binned] • Products of the above Tuesday, February 19, 13
Lexical Features • ↔ Word class word class indicator • NN translates as NN ( NN_NN =1 ) • NN does not translate as MD ( NN_MD =1 ) • Identical word feature • 2010 = 2010 ( IdentWord =1 IdentNum =1 ) • Identical prefix feature • Obama ~ Obamu ( IdentPrefix =1 ) • Orthographic similarity measure [binned] • Al-Qaeda ~ Al-Kaida ( OrthoSim050_080=1 ) Tuesday, February 19, 13
Other Features • Compute features from large amounts of unlabeled text • Does the Model 4 alignment contain this alignment point? • What is the Model 1 posterior probability of this alignment point? Tuesday, February 19, 13
Results Tuesday, February 19, 13
Summary Tuesday, February 19, 13
Summary Unfortunately, you need gold alignments ! Tuesday, February 19, 13
Putting the pieces together p ( e ) p ( e | f , m ) p ( e , a | f , m ) p ( a | e , f ) Tuesday, February 19, 13
Putting the pieces together • We have seen how to model the following: p ( e ) p ( e | f , m ) p ( e , a | f , m ) p ( a | e , f ) Tuesday, February 19, 13
Putting the pieces together • We have seen how to model the following: p ( e ) p ( e | f , m ) p ( e , a | f , m ) p ( a | e , f ) Tuesday, February 19, 13
Putting the pieces together • We have seen how to model the following: p ( e ) p ( e | f , m ) p ( e , a | f , m ) p ( a | e , f ) • Goal: a better model of that knows about p ( e | f , m ) p ( e ) Tuesday, February 19, 13
One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: ‘ This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode . ’ Warren Weaver to Norbert Wiener, March, 1947 Tuesday, February 19, 13
Recommend
More recommend