Sparse Feature Learning
Philipp Koehn 1 March 2016
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
Sparse Feature Learning Philipp Koehn 1 March 2016 Philipp Koehn - - PowerPoint PPT Presentation
Sparse Feature Learning Philipp Koehn 1 March 2016 Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016 Multiple Component Models 1 Translation Model Language Model Reordering Model Philipp Koehn Machine
Philipp Koehn 1 March 2016
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
1
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
2
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
3
p(a | to) = 0.18 p(casa | house) = 0.35 p(azur | blue) = 0.77 p(la | the) = 0.32
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
4
– each phrase translation score – each language model n-gram – etc.
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
5
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
6
[Och&al. 2003]
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
7
[Och&al. 2003]
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
8
[Och&al. 2003]
[Hopkins/May 2011]
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
9
[Och&al. 2003]
[Wuebker et al., 2012]
[Hopkins/May 2011]
[Yu et al., 2013]
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
10
– optimize towards most similar translation – or, only process sentence pair partially
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
11
BLEU({ebest
i
= argmaxei
hj(ei, fi) λi}, {eref
i })
BLEU’( argmaxei
(hj(ei, fi) λi), eref
i
)
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
12
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
13
– number of source words – number of target words – number of source and target words
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
14
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
15
– lexicalized by first or last word of phrase (source or target) – word representation replaced by word class – orientation type
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
16
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
17
– number of nodes of a particular kind – matching of source and target constituents – reordering within syntactic constituents
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
18
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
19
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
20
score(λ, di) =
λj hj(di)
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
21
i
for sentence pair i and its score score(λ, dref
i ) =
λj hj(dref
i )
dbest
i
= argmaxd score(λi, di) = argmaxd
λj hj(di)
score(λ, dbest
i
) =
λj hj(dbest
i
)
error(λ, dbest
i
, dref
i ) = score(λ, dbest i
) − score(λ, dref
i ) Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
22
λ error(λ) g r a d i e n t =
current λ
λ error(λ) gradient = 1 current λ
gradient negative gradient positive ⇒ we need to move right ⇒ we need to move left
dλerror(λ) at any point
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
23
λ error(λ) g r a d i e n t = 2 current λ
λ error(λ) gradient = 1 current λ
λ error(λ) g r a d i e n t = . 2 current λ
gradient high (steep) gradient medium gradient low (flat) ⇒ move a lot ⇒ move some ⇒ move little
dλerror(λ) Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
24
error(λ, dbest
i
, dref
i ) = score(λ, dbest i
) − score(λ, dref
i )
d d λ error(λ, dbest
i
, dref
i )
d d λj error(λj, dbest
i
, dref
i ) =
d d λj score(λ, dbest
i
) − score(λ, dref
i ) Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
25
d d λj error(λj, dbest
i
, dref
i ) =
d d λj
λj′ hj′(dbest
i
) −
λj′ hj′(dref
i )
j = λj, the terms λj′ hj′(di) are constant, so they disappear
d d λj error(λj, dbest
i
, dref
i ) =
d d λj λj hj(dbest
i
) − λj hj(dref
i )
d d λj error(λj, dbest
i
, dref
i ) = hj(dbest i
) − hj(dref
i )
⇒ Our model update is λnew
j
= λj − (hj(dbest
i
) − hj(dref
i )) Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
26
– promote features whose value is bigger in reference – demote features whose value is bigger in model best
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
27
Input: set of sentence pairs (e,f), set of features Output: set of weights λ for each feature
1: λi = 0 for all i 2: while not converged do 3:
for all foreign sentences f do
4:
dbest = best derivation according to model
5:
dref = reference derivation
6:
if dbest = dref then
7:
for all features hi do
8:
λi += hi(dref) − hi(dbest)
9:
end for
10:
end if
11:
end for
12: end while Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
28
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
29
covered by search produceable by model all English sentences
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
30
alignment points too distant required reordering distance too large → phrase pair too big to extract → exceeds distortion limit of decoder
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
31
– add one free n-gram count to statistics → avoids BLEU score of 0 – however: wrong balance between 1-4 grams, too drastic brevity penalty
– leave all other sentence translations fixed – collect n-gram matches and totals from them – add n-gram matches and total from current candidate → consider impact on overall BLEU score
– maintain decaying statistics for n-gram matches, total n-grams countt = 9 10 countt−1 + current-countt
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
32
romanization) with English glosses: sd qTEp mn AlkEk AlmmlH ” brytzl ” Hlqh . blocked piece
biscuit salted ” pretzel ” his-throat
A piece of a salted biscuit, a ”pretzel,” blocked his throat.
A pretzel, a salted biscuit, became lodged in his throat.
note: example from Chiang (2012) Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
33
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
34
model score translation quality fear hope n-best utopian
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
35
dhope = argmaxd BLEU(d) + score(d)
– Metric difference (should be big) ∆BLEU(dhope, d) = BLEU(dhope) − BLEU(d) – Score difference (should be small or negative) ∆score(λ, dhope, d) = score(λ, dhope) − score(λ, d) – Margin v(λ, dhope, d) = ∆BLEU(dhope, d) − ∆score(λ, dhope, d) – Fear translation dfear = argmaxd v(λ, dhope, d)
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
36
λnew
j
= λj − δi
i
) − hj(dhope
i
)
δi = min
i
, dfear
i
) − ∆score(dhope
i
, dfear
i
) ||∆h||2
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
37
⇒ Adaptive Regularization of Weights (AROW) – record confidence in weights over time – include this in the learning rate for each feature
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
38
⇒ Break up training data into batches
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
39
– use operations as in Gibbs samples – vary one translation option choice – vary one reordering decision – vary one phrase segmentation decision – adopt new translation based on relative score
→ apply MIRA update if more costly translation has higher BLEU
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
40
– repeated decoding needed – computationally very expensive
– n-best list or search graph (lattice) – straightforward parallelization – does not seem to harm performance
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
41
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
42
Translation Feature values BLEU+1 it is not under house
27.3% he is not under house
30.2% it is not a home
30.2% it is not to go home
31.2% it is not for house
27.3% he is not to go home
31.2% he does not home
36.2% it is not packing
21.8% he is not packing
24.2% he is not for home
32.5%
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
43
Translation Feature values BLEU+1 it is not under house
27.3% he is not under house
30.2% it is not a home
30.2% it is not to go home
31.2% it is not for house
27.3% he is not to go home
31.2% he does not home
36.2% it is not packing
21.8% he is not packing
24.2% he is not for home
32.5%
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
44
Translation Feature values BLEU+1 it is not under house
27.3% he is not under house
30.2% it is not a home
30.2% it is not to go home
31.2% it is not for house
27.3% he is not to go home
31.2% he does not home
36.2% it is not packing
21.8% he is not packing
24.2% he is not for home
32.5%
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
45
– − − → bad = (−31.75, −17.25, −20.43, −4.90, −6.90, −5) – − − − → good = (−36.70, −13.52, −17.09, −6.22, −7.82, −5)
– − − → bad − − − − → good → – − − − → good − − − → bad →
e.g., MegaM (http://www.umiacs.umd.edu/∼hal/megam/)
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
46
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
47
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
48
require decoder to match the reference translation
do not use translation rules originally collected from current sentence pair
– 90% of training data used for rule collection – 10% to validate rules – rotate
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
49
– allow rules originally collected from current sentence pair – very costly → only used, if everything else fails
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
50
⇒ Much smaller model ⇒ Sometimes better model
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
51
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
52
– yes: ignoring them leaves out large amounts of training data – no: data selection, non-literal translations are lower quality
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
53
Reachabillity by distortion limit and sentence length Chinese–English NIST [Yu et al., 2013]
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
54
are it he goes does not yes
no word translated
translated two words translated three words translated
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
55
are it he goes does not yes
no word translated
translated two words translated three words translated
he does not go home
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
56
1 2 3 4 5
– perceptron update between partial derivations – best derivation vs. best reference derivation outside beam
→ only stop when no hope that reference derivation will be in a future stack
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
57
1 2 3 4 final 5 6 7
– find stack where maximal model score difference between ∗ best derivation ∗ best reference derivation – update between those two derivations
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
58
– optimization over full training corpus – over 20 million features – relatively small data conditions (5-9 millions words) – gain: +2 BLEU points
– rule id – word edge features (first and last word of phrase), defined over words, word clusters, or POS tags – combinations of word edge features – non-local features: ids of consecutive rules, rule id + last two English words
Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016
59
a handful thousands millions n-best aligned corpus iterative n-best search space rule scores MERT
[Och&al. 2003]
MIRA [Chiang 2007] SampleRank [Haddow&al. 2011] Leave One Out
[Wuebker et al., 2012]
PRO
[Hopkins/May 2011]
MaxViolation
[Yu et al., 2013] Philipp Koehn Machine Translation: Sparse Feature Learning 1 March 2016