Discriminative Training
February 19, 2013
Tuesday, February 19, 13
Discriminative Training February 19, 2013 Tuesday, February 19, 13 - - PowerPoint PPT Presentation
Discriminative Training February 19, 2013 Tuesday, February 19, 13 Noisy Channels Again p ( e ) English source Tuesday, February 19, 13 Noisy Channels Again p ( g | e ) p ( e ) English source German Tuesday, February 19, 13 Noisy
February 19, 2013
Tuesday, February 19, 13
English
source
p(e)
Tuesday, February 19, 13
German English
source
p(e) p(g | e)
Tuesday, February 19, 13
German English
source
decoder
p(e) p(g | e) e∗ = arg max
e
p(e | g) = arg max
e
p(g | e) × p(e) p(g) = arg max
e
p(g | e) × p(e)
Tuesday, February 19, 13
e∗ = arg max
e
p(e | g) = arg max
e
p(g | e) × p(e) p(g) = arg max
e
p(g | e) × p(e)
Tuesday, February 19, 13
e∗ = arg max
e
p(e | g) = arg max
e
p(g | e) × p(e) p(g) = arg max
e
p(g | e) × p(e) = arg max
e
log p(g | e) + log p(e) = arg max
e
1 1 > | {z }
w>
log p(g | e) log p(e)
{z }
h(g,e)
Tuesday, February 19, 13
e∗ = arg max
e
p(e | g) = arg max
e
p(g | e) × p(e) p(g) = arg max
e
p(g | e) × p(e) = arg max
e
log p(g | e) + log p(e) = arg max
e
1 1 > | {z }
w>
log p(g | e) log p(e)
{z }
h(g,e)
Tuesday, February 19, 13
e∗ = arg max
e
p(e | g) = arg max
e
p(g | e) × p(e) p(g) = arg max
e
p(g | e) × p(e) = arg max
e
log p(g | e) + log p(e) = arg max
e
1 1 > | {z }
w>
log p(g | e) log p(e)
{z }
h(g,e)
Tuesday, February 19, 13
Tuesday, February 19, 13
Tuesday, February 19, 13
Tuesday, February 19, 13
Tuesday, February 19, 13
g
Improvement 1: change to find better translations ~ w
Tuesday, February 19, 13
Tuesday, February 19, 13
Tuesday, February 19, 13
Tuesday, February 19, 13
Improvement 2: Add dimensions to make points separable
Tuesday, February 19, 13
channel in two ways
e⇤ = arg max
e
w>h(g, e) h(g, e) w
Tuesday, February 19, 13
18
Tuesday, February 19, 13
18
x BITES y
Tuesday, February 19, 13
Mann beißt Hund
Mann beißt Hund
Mann beißt Hund
Mann beißt Hund
Mann beißt Hund
Mann beißt Hund
19
x BITES y
Tuesday, February 19, 13
Mann beißt Hund
Mann beißt Hund
Mann beißt Hund
Mann beißt Hund
Mann beißt Hund
Mann beißt Hund
19
x BITES y
Tuesday, February 19, 13
Mann beißt Hund
Mann beißt Hund
Mann beißt Hund
Mann beißt Hund
Mann beißt Hund
Mann beißt Hund
19
x BITES y
Tuesday, February 19, 13
Mann beißt Hund
Mann beißt Hund
Mann beißt Hund
Mann beißt Hund
Mann beißt Hund
Mann beißt Hund
19
x BITES y
Tuesday, February 19, 13
Mann beißt Hund
Mann beißt Hund
Mann beißt Hund
Mann beißt Hund
Mann beißt Hund
Mann beißt Hund
19
x BITES y
Tuesday, February 19, 13
20
bank = “River bank” vs. “Financial institution”
Are lexical choices appropriate?
Tuesday, February 19, 13
20
bank = “River bank” vs. “Financial institution”
Are lexical choices appropriate?
Are semantic/syntactic relations preserved?
“Dog bites man” vs. “Man bites dog”
Tuesday, February 19, 13
20
bank = “River bank” vs. “Financial institution”
Are lexical choices appropriate?
Are semantic/syntactic relations preserved?
“Dog bites man” vs. “Man bites dog”
Is the output fluent / well-formed?
“Man bites dog” vs. “Man bite dog”
Tuesday, February 19, 13
21
Mann beißt Hund
What do lexical features look like?
Tuesday, February 19, 13
21
Mann beißt Hund
What do lexical features look like?
Tuesday, February 19, 13
21
Mann beißt Hund
What do lexical features look like?
score(g, e) = w>h(g, e) h15,342(g, e) = ( 1, ∃i, j : gi = Hund, ej = cat 0,
First attempt:
Tuesday, February 19, 13
But what if a cat is being chased by a Hund?
21
Mann beißt Hund
What do lexical features look like?
score(g, e) = w>h(g, e) h15,342(g, e) = ( 1, ∃i, j : gi = Hund, ej = cat 0,
First attempt:
Tuesday, February 19, 13
Mann beißt Hund
Latent variables enable more precise features:
22
What do lexical features look like?
score(g, e, a) = w>h(g, e, a) h15,342(g, e, a) = X
(i,j)2a
( 1, if gi = Hund, ej = cat 0,
Tuesday, February 19, 13
23
Tuesday, February 19, 13
24
Tuesday, February 19, 13
25
h1 h2
Tuesday, February 19, 13
25
h1 h2
Hypotheses
Tuesday, February 19, 13
26
h1 h2
References
Tuesday, February 19, 13
27
We assume a decoder that computes:
he⇤, a⇤i = arg max
he,ai w>h(g, e, a)
And K-best lists of, that is:
{e⇤
i , a⇤ i ⇥}i=K i=1 = arg ith- max he,ai w>h(g, e, a)
Standard, efficient algorithms exist for this.
Tuesday, February 19, 13
reference translations
translation from others by the maximal margin
28
Tuesday, February 19, 13
reference translations
translation from others by the maximal margin
28
Tuesday, February 19, 13
model exactly produces the reference and no credit otherwise
translations?
29
Tuesday, February 19, 13
model exactly produces the reference and no credit otherwise
translations?
29
Tuesday, February 19, 13
a score for how good/bad a translation is
reference to this function
30
Tuesday, February 19, 13
31
h1 h2 ~ w
Tuesday, February 19, 13
31
h1 h2 ~ w
#2#1
#3 #4 #5 #6 #7 #8 #9 #10
Tuesday, February 19, 13
32
h1 h2
#2#1
#3 #4 #5 #6 #7 #8 #9 #10
0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 ~ w
Tuesday, February 19, 13
with a linear model
33
Tuesday, February 19, 13
34
h1 h2
#2 #1 #3 #4 #5 #6 #7 #8 #9 #10
0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 h1 h2
Tuesday, February 19, 13
34
h1 h2
#2 #1 #3 #4 #5 #6 #7 #8 #9 #10
0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 h1 h2
Tuesday, February 19, 13
35
h1 h2
#2 #1 #3 #4 #5 #6 #7 #8 #9 #10
0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 h1 h2
Worse!
Tuesday, February 19, 13
36
h1 h2
#2 #1 #3 #4 #5 #6 #7 #8 #9 #10
0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 h1 h2
Worse!
Tuesday, February 19, 13
37
h1 h2
#2 #1 #3 #4 #5 #6 #7 #8 #9 #10
0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 h1 h2
Tuesday, February 19, 13
38
h1 h2
#2 #1 #3 #4 #5 #6 #7 #8 #9 #10
0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 h1 h2
Better!
Tuesday, February 19, 13
39
h1 h2
#2 #1 #3 #4 #5 #6 #7 #8 #9 #10
0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 h1 h2
Better!
Tuesday, February 19, 13
40
h1 h2
#2 #1 #3 #4 #5 #6 #7 #8 #9 #10
0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 h1 h2
Worse!
Tuesday, February 19, 13
41
h1 h2
#2 #1 #3 #4 #5 #6 #7 #8 #9 #10
0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 h1 h2
Better!
Tuesday, February 19, 13
42
h1 h2
#2 #1 #3 #4 #5 #6 #7 #8 #9 #10
0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 h1 h2
Tuesday, February 19, 13
43
h1 h2
Tuesday, February 19, 13
44
h1 h2
~ w
Tuesday, February 19, 13
45
h1 h2
#2#1
#3 #4 #5 #6 #7 #8 #9 #10
0.8 ≤ < 1.0 0.6 ≤ < 0.8 0.4 ≤ < 0.6 0.2 ≤ < 0.4 0.0 ≤ < 0.2 ~ w
Tuesday, February 19, 13
metric
46
Tuesday, February 19, 13
47
Now pick a search vector , and consider how the score of this hypothesis will change: Given weight vector , any hypothesis will have a (scalar) score
w he, ai m = w>h(g, e, a) m = (w + γv)>h(g, e, a) = w>h(g, e, a) | {z }
b
+γ v>h(g, e, a) | {z }
a
= aγ + b v wnew = w + γv m
Tuesday, February 19, 13
48
Now pick a search vector , and consider how the score of this hypothesis will change: Given weight vector , any hypothesis will have a (scalar) score
w he, ai m = w>h(g, e, a) m = (w + γv)>h(g, e, a) = w>h(g, e, a) | {z }
b
+γ v>h(g, e, a) | {z }
a
= aγ + b v wnew = w + γv m
Tuesday, February 19, 13
49
Now pick a search vector , and consider how the score of this hypothesis will change: Given weight vector , any hypothesis will have a (scalar) score
w he, ai m = w>h(g, e, a) m = (w + γv)>h(g, e, a) = w>h(g, e, a) | {z }
b
+γ v>h(g, e, a) | {z }
a
= aγ + b v wnew = w + γv m
Tuesday, February 19, 13
50
Now pick a search vector , and consider how the score of this hypothesis will change: Given weight vector , any hypothesis will have a (scalar) score
w he, ai m = w>h(g, e, a) m = (w + γv)>h(g, e, a) = w>h(g, e, a) | {z }
b
+γ v>h(g, e, a) | {z }
a
= aγ + b v wnew = w + γv m
Tuesday, February 19, 13
51
Now pick a search vector , and consider how the score of this hypothesis will change: Given weight vector , any hypothesis will have a (scalar) score
w he, ai m = w>h(g, e, a) m = (w + γv)>h(g, e, a) = w>h(g, e, a) | {z }
b
+γ v>h(g, e, a) | {z }
a
= aγ + b v wnew = w + γv m
Tuesday, February 19, 13
51
Now pick a search vector , and consider how the score of this hypothesis will change: Given weight vector , any hypothesis will have a (scalar) score
w he, ai m = w>h(g, e, a) m = (w + γv)>h(g, e, a) = w>h(g, e, a) | {z }
b
+γ v>h(g, e, a) | {z }
a
= aγ + b v wnew = w + γv m
Linear function in 2D!
Tuesday, February 19, 13
52
m γ
Tuesday, February 19, 13
53
Recall our k-best set {e∗
i , a∗ i ⇥}K i=1
m γ
Tuesday, February 19, 13
54
Recall our k-best set {e∗
i , a∗ i ⇥}K i=1
m γ
Tuesday, February 19, 13
55
m γ
Tuesday, February 19, 13
56
m he∗
162, a∗ 162i
he∗
28, a∗ 28i
he∗
73, a∗ 73i
γ
Tuesday, February 19, 13
57
m γ he∗
162, a∗ 162i
he∗
28, a∗ 28i
he∗
73, a∗ 73i
Tuesday, February 19, 13
58
m γ he∗
162, a∗ 162i
he∗
28, a∗ 28i
he∗
73, a∗ 73i
γ
errors
Tuesday, February 19, 13
59
m γ he∗
162, a∗ 162i
he∗
28, a∗ 28i
he∗
73, a∗ 73i
γ
errors
Tuesday, February 19, 13
60
m γ γ
errors
Tuesday, February 19, 13
61
m γ γ
errors
γ
Tuesday, February 19, 13
62
γ
errors
wnew = γ∗v + w
Let
γ∗
Tuesday, February 19, 13
for evaluation metrics (e.g., BLEU)
dynamic programming
63
Tuesday, February 19, 13
64
Tuesday, February 19, 13
65
m
γ he∗
162, a∗ 162i
he∗
28, a∗ 28i
he∗
73, a∗ 73i
Tuesday, February 19, 13