Tuning
Philipp Koehn presented by Gaurav Kumar 28 September 2017
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
Tuning Philipp Koehn presented by Gaurav Kumar 28 September 2017 - - PowerPoint PPT Presentation
Tuning Philipp Koehn presented by Gaurav Kumar 28 September 2017 Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017 The Story so Far: Generative Models 1 The definition of translation probability follows a
Philipp Koehn presented by Gaurav Kumar 28 September 2017
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
1
argmaxep(e|f) = argmaxep(f|e) p(e)
for instance IBM Model 1: word translations are independent of each other p(e|f, a) = 1 Z
p(ei|fa(i))
– maximum likelihood estimation of component probability distribution – EM algorithm for discovering hidden variables (alignment)
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
2
pLM × pT M × pD
pλLM
LM × pλT M T M × pλD D
pλi
i
λi log(pi) = log
pλi
i Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
3
– language model – reordering (distortion) model – phrase translation model – word translation model – word count – phrase count – character count – drop word feature – phrase pair frequency – additional language models
h(e, f, a) =
if ∃ei ∈ e, ei is verb
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
4
– manual setting of weights: try a few, take best – automate this process
– set aside a development corpus – set the weights, so that optimal translation performance on this development corpus is achieved – requires automatic scoring method (e.g., BLEU)
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
5
– translation process is broken down to steps – each step is modeled by a probability distribution – each probability distribution is estimated from data by maximum likelihood
– model consist of a number of features (e.g. the language model score) – each feature has a weight, measuring its value for judging a translation as correct – feature weights are optimized on development data, so that the system output matches correct translations as close as possible
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
6
– feature score: value of the property (e.g., language model probability) – feature weight: importance of the feature (e.g., language model feature more important than word count feature)
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
7
– first pass: run decoder to generate set of candidate translations – second pass: ∗ add features ∗ rescore translations
– integrate all features into the decoder – learn feature weights that lead decoder to best translation
– thousands or millions of features – optimization of the entire training corpus – requires different training methods
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
8
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
9
– do not list all possible translation – beam search for best one – dynamic programming and pruning
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
10
are
p:-1.220
it
p:-0.484
he
p:-0.556
goes
p:-1.648
does not
p:-1.664
go
p:-2.743
to
p:-2.839
to house
p:-4.334
home
p:-4.182
go
p:-4.087
house
p:-5.912
not
p:-3.526
goes
p:-1.388
home
p:-5.012
by expanding the most promising partial translations ⇒ Search graph
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
11
are
p:-1.220
it
p:-0.484
he
p:-0.556
goes
p:-1.648
does not
p:-1.664
go
p:-2.743
to
p:-2.839
to house
p:-4.334
home
p:-4.182
go
p:-4.087
house
p:-5.912
not
p:-3.526
goes
p:-1.388
home
p:-5.012
– without: total number of paths = number of full translation hypotheses – with: combinatorial expansion
– without: 4 full translation hypotheses – with: 10 different full paths
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
12
<s> are <s> it <s> he he goes does not not go not to to house go home to go go house goes not it goes not home
does not
not
to house
5 5 6 h e
4 8 4 i t
. 2 2 a r e
goes
. 1 8 d
s n
goes
not
go
. 4 5 1 t
home
to house
. 4 8 6 h
e
. 2 4 8 g
8 2 5 h
s e
– states: partial translations – transitions: applications of phrase translations – weights: added scores by phrase translation
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
13
– Σ is the alphabet of output symbols (in our case, the emitted phrases) – S is a finite set of states – s0 is an initial state (s0 ∈ S), (in our case the initial hypothesis) – δ is the state transition function δ : S × Σ → S – F is the set of final states (in our case representing hypotheses that have covered all input words).
– scores for emissions from each transition π : S × Σ × S → R
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
14
rank score sentence 1
he does not go home 2
he does not go to house 3
he goes not to house 4
it goes not to house 5
he goes not home 6
it goes not home 7
it does not go home 8
it does not go to house 9
he does not to go house 10
it does not to go house
⇒ Extract n best translations
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
15
<s> are <s> it <s> he he goes does not not go not to to house go home to go go house goes not it goes not home
does not
not
to house
5 5 6 h e
4 8 4 i t
. 2 2 a r e goes d
s n
goes not go t
to house h
e g
s e
1 5 2
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
16
1 5 2
<s> he does not not go go home
5 5 6 h e d
s n
go home
⇒ Best path: he does not go home
Base path Base cost Detour cost Detour state final
to house final
not home final
does not final
go house
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
17
<s> he does not not go to house
5 5 6 h e d
s n
go to house
1 5 2
Base path Base cost Detour cost Detour state to house
goes not final
not home final
does not to house
it final
go house
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
18
<s> he he goes to house goes not
to house
5 5 6 h e goes not
1 5 2
Base path Base cost Detour cost Detour state to house / goes not
it goes final
not home final
does not to house
it final
go house
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
19
– model score: what the machine translation system thinks is good – error score: what is actually a good translation
– recall: lecture on evaluation – canonical metric: BLEU score
– commonly used: BLEU+1 – adjusted precision: correct matches+1
total+1 Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
20
Translation Feature values BLEU+1 it is not under house
27.3% he is not under house
30.2% it is not a home
30.2% it is not to go home
31.2% it is not for house
27.3% he is not to go home
31.2% he does not home
36.2% it is not packing
21.8% he is not packing
24.2% he is not for home
32.5%
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
21
training input sentences base model n-best list of translations reference translations labeled training data reranker learn decode combine test input sentence base model n-best list of translations reranker decode translation rerank
Training Testing
additional features additional features combine
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
22
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
23
p(x) = exp
n
λihi(x)
parameters λi
– Powell search – Simplex algorithm
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
24
– set a few weights (say, 10–15) – tuning set of 1000s of sentence pairs sufficient
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
25
for each we have n-best list of translations
→ different translations come out on top → BLEU score
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
26
→ may produce completely different translations ⇒ Iterate optimization, accumulate n-best lists
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
27
decoder n-best list of translations decode
parameters new parameters initial parameters final parameters apply if converged if changed
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
28
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
29
✬ ✫ ✩ ✪
given: sentences with n-best list of translations iterate n times randomize starting feature weights iterate until convergences for each feature find best feature weight update if different from current return best feature weights found in any iteration
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
30
– find optimal value for one parameter weight λ – ... while leaving all other weights constant
p(ei|f) = λai + bi
– we deal with 100s of translations ei per sentence f – we deal with 100s or 1000s of sentences f – we are trying to find the value λ so that over all sentences, the error score is
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
31
p(ei|f) = λai + bi
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
32
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
33
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
34
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
35
⇒ Algorithm: – find the threshold points – for each interval between threshold points ∗ find best translations ∗ compute error-score – pick interval with best error-score
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
36
0.25 0.3 0.35 0.4 0.45 0.5
0.5 1 "BLEU" 0.4925 0.493 0.4935 0.494 0.4945 0.495
0.005 0.01 "BLEU"
full range peak
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
37
Input: sentences with n-best list of translations, initial parameter values 1: repeat 2: for all parameter do 3: set of threshold points T = {} 4: for all sentence do 5: for all translation do 6: compute line l: parameter value → score 7: end for 8: find line l with steepest descent 9: while find line l2 that intersects with l first do 10: add parameter value at intersection to set of threshold points T 11: l = l2 12: end while 13: end for 14: sort set of threshold points T by parameter value 15: compute score for value before first threshold point 16: for all threshold point t ∈ T do 17: compute score for value after threshold point t 18: if highest do record max score and threshold point t 19: end for 20: if max score is higher than current do update parameter value 21: end for 22: until no changes to parameter values applied Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
38
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
39
– recall: error is computed over the entire tuning set – brute force method requires reranking of 1000s of n-best lists
– try to find direction in which the optimum lies – here: we cannot compute derivative
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
40
– high dimensional space = each dimension is one of the λi parameters – a point in the space = each parameter set to a value
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
41
worst good best
– use parameter settings to rerank all the n-best lists – compute overall tuning set score (BLEU)
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
42
worst good best
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
43
worst good best
– mid point: M = 1
2(best + good)
– reflection point: R = M + (M − worst) – extension: R = M + 2(M − worst)
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
44
worst good best
terrible
– C1 point between worst and M: C1 = M + 1
2(M − worst)
– C2 point between M and R: C2 = M + 3
2(M − worst).
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
45
worst good best
– S point between worst and best: S = 1
2(best + worst).
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
46
– n + 1 points for n parameters – midpoint M is the center of all points except worst – in final case, all good points moved towards midpoints closer to best
– generate n-best list – iterate
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017
47
Philipp Koehn / Gaurav Kumar Machine Translation: Tuning 28 September 2017