What can Statistical Machine Translation teach Neural Text Generation about Optimization?
Graham Neubig
@ NAACL Workshop on Methods for Optimizing and Evaluating Neural Language Generation 6/6/2019
What can Statistical Machine Translation teach Neural Text - - PowerPoint PPT Presentation
What can Statistical Machine Translation teach Neural Text Generation about Optimization? Graham Neubig @ NAACL Workshop on Methods for Optimizing and Evaluating Neural Language Generation 6/6/2019 or How to Optimize your Neural Generation
Graham Neubig
@ NAACL Workshop on Methods for Optimizing and Evaluating Neural Language Generation 6/6/2019
Graham Neubig
@ NAACL Workshop on Methods for Optimizing and Evaluating Neural Language Generation 6/6/2019
... Neubig & Watanabe, Computational Linguistics (2016)
kono eiga ga kirai movie this I hate
translation accuracy [Och 2004]
Minimum Error Rate Training in Statistical Machine Translation (Och 2004)
log P(Y | X) = X
i
λiφi(X, Y )/Z
<latexit sha1_base64="zi4lDHl42mhk2a3gk9P95mU898=">ACG3icbZDLSgMxFIYz9VbrbdSlm2ARWpA6o4K6EIpuXFawtrVThkwm04YmSHJCGXog7jxVdy4UHEluPBtTC8Lrf4Q+PjPOUnOHySMKu04X1Zubn5hcSm/XFhZXVvfsDe3blWcSkzqOGaxbAZIEUYFqWuqGWkmkiAeMNI+pejeuOeSEVjcaMHCelw1BU0ohpY/n2kcfiLqyVWtDjNITNMjyHnkq5T6HzDUhGlHSoz4tNfdbZXgA73y76FScseBfcKdQBFPVfPvDC2OciI0ZkiptuskupMhqSlmZFjwUkUShPuoS9oGBeJEdbLxckO4Z5wQRrE0R2g4dn9OZIgrNeCB6eRI9RsbWT+V2unOjrtZFQkqSYCTx6KUgZ1DEdJwZBKgjUbGEBYUvNXiHtIqxNngUTgju78l+oH1bOKu71cbF6MU0jD3bALigBF5yAKrgCNVAHGDyAJ/ACXq1H69l6s94nrTlrOrMNfsn6/AYeVZ56</latexit><latexit sha1_base64="zi4lDHl42mhk2a3gk9P95mU898=">ACG3icbZDLSgMxFIYz9VbrbdSlm2ARWpA6o4K6EIpuXFawtrVThkwm04YmSHJCGXog7jxVdy4UHEluPBtTC8Lrf4Q+PjPOUnOHySMKu04X1Zubn5hcSm/XFhZXVvfsDe3blWcSkzqOGaxbAZIEUYFqWuqGWkmkiAeMNI+pejeuOeSEVjcaMHCelw1BU0ohpY/n2kcfiLqyVWtDjNITNMjyHnkq5T6HzDUhGlHSoz4tNfdbZXgA73y76FScseBfcKdQBFPVfPvDC2OciI0ZkiptuskupMhqSlmZFjwUkUShPuoS9oGBeJEdbLxckO4Z5wQRrE0R2g4dn9OZIgrNeCB6eRI9RsbWT+V2unOjrtZFQkqSYCTx6KUgZ1DEdJwZBKgjUbGEBYUvNXiHtIqxNngUTgju78l+oH1bOKu71cbF6MU0jD3bALigBF5yAKrgCNVAHGDyAJ/ACXq1H69l6s94nrTlrOrMNfsn6/AYeVZ56</latexit><latexit sha1_base64="zi4lDHl42mhk2a3gk9P95mU898=">ACG3icbZDLSgMxFIYz9VbrbdSlm2ARWpA6o4K6EIpuXFawtrVThkwm04YmSHJCGXog7jxVdy4UHEluPBtTC8Lrf4Q+PjPOUnOHySMKu04X1Zubn5hcSm/XFhZXVvfsDe3blWcSkzqOGaxbAZIEUYFqWuqGWkmkiAeMNI+pejeuOeSEVjcaMHCelw1BU0ohpY/n2kcfiLqyVWtDjNITNMjyHnkq5T6HzDUhGlHSoz4tNfdbZXgA73y76FScseBfcKdQBFPVfPvDC2OciI0ZkiptuskupMhqSlmZFjwUkUShPuoS9oGBeJEdbLxckO4Z5wQRrE0R2g4dn9OZIgrNeCB6eRI9RsbWT+V2unOjrtZFQkqSYCTx6KUgZ1DEdJwZBKgjUbGEBYUvNXiHtIqxNngUTgju78l+oH1bOKu71cbF6MU0jD3bALigBF5yAKrgCNVAHGDyAJ/ACXq1H69l6s94nrTlrOrMNfsn6/AYeVZ56</latexit></s>
dec dec dec dec
</s> I hate this movie kono eiga ga kirai I hate this movie
Encoder Decoder
likelihood (not accuracy!)
I
classify classify
I hate hate
classify
this this
classify
movie movie
classify
</s> encoder P(E | F) =
T
Y
t=1
P(et | F, e1, . . . , et−1)
<latexit sha1_base64="4+Z5A9vFnGki2tmcH1tEn43Xra8=">ACKXicbVDLSsNAFJ34tr6qLt0MFqGClkQEdSH4QHFZwarQ1DCZ3OrgJBNmboQS8j1u/BU3Lnxt/RGnNQtfBy6cOede5t4TplIYdN03Z2h4ZHRsfGKyMjU9MztXnV84NyrTHFpcSaUvQ2ZAigRaKFDCZaqBxaGEi/D2sO9f3IE2QiVn2EuhE7PrRHQFZ2iloLrfrB9RPxYRPV6lu9RPtYqCHe94io/K2izDgGW/hqFwFujvowUmv4jx3WvWA2qNbfhDkD/Eq8kNVKiGVSf/EjxLIYEuWTGtD03xU7ONAouoaj4mYGU8Vt2DW1LExaD6eSDUwu6YpWIdpW2lSAdqN8nchYb04tD2xkzvDG/vb74n9fOsLvdyUWSZgJ/qom0mKivZzo5HQwFH2LGFcC7sr5TdM423YoNwft98l/S2mjsNLzTzdreQZnGBFkiy6ROPLJF9sgJaZIW4eSePJn8uI8OE/Oq/P+1TrklDOL5Aecj0/UeaN6</latexit><latexit sha1_base64="4+Z5A9vFnGki2tmcH1tEn43Xra8=">ACKXicbVDLSsNAFJ34tr6qLt0MFqGClkQEdSH4QHFZwarQ1DCZ3OrgJBNmboQS8j1u/BU3Lnxt/RGnNQtfBy6cOede5t4TplIYdN03Z2h4ZHRsfGKyMjU9MztXnV84NyrTHFpcSaUvQ2ZAigRaKFDCZaqBxaGEi/D2sO9f3IE2QiVn2EuhE7PrRHQFZ2iloLrfrB9RPxYRPV6lu9RPtYqCHe94io/K2izDgGW/hqFwFujvowUmv4jx3WvWA2qNbfhDkD/Eq8kNVKiGVSf/EjxLIYEuWTGtD03xU7ONAouoaj4mYGU8Vt2DW1LExaD6eSDUwu6YpWIdpW2lSAdqN8nchYb04tD2xkzvDG/vb74n9fOsLvdyUWSZgJ/qom0mKivZzo5HQwFH2LGFcC7sr5TdM423YoNwft98l/S2mjsNLzTzdreQZnGBFkiy6ROPLJF9sgJaZIW4eSePJn8uI8OE/Oq/P+1TrklDOL5Aecj0/UeaN6</latexit><latexit sha1_base64="4+Z5A9vFnGki2tmcH1tEn43Xra8=">ACKXicbVDLSsNAFJ34tr6qLt0MFqGClkQEdSH4QHFZwarQ1DCZ3OrgJBNmboQS8j1u/BU3Lnxt/RGnNQtfBy6cOede5t4TplIYdN03Z2h4ZHRsfGKyMjU9MztXnV84NyrTHFpcSaUvQ2ZAigRaKFDCZaqBxaGEi/D2sO9f3IE2QiVn2EuhE7PrRHQFZ2iloLrfrB9RPxYRPV6lu9RPtYqCHe94io/K2izDgGW/hqFwFujvowUmv4jx3WvWA2qNbfhDkD/Eq8kNVKiGVSf/EjxLIYEuWTGtD03xU7ONAouoaj4mYGU8Vt2DW1LExaD6eSDUwu6YpWIdpW2lSAdqN8nchYb04tD2xkzvDG/vb74n9fOsLvdyUWSZgJ/qom0mKivZzo5HQwFH2LGFcC7sr5TdM423YoNwft98l/S2mjsNLzTzdreQZnGBFkiy6ROPLJF9sgJaZIW4eSePJn8uI8OE/Oq/P+1TrklDOL5Aecj0/UeaN6</latexit>in the reference given the previous words `(E | F) = − log P(E | F) = −
T
X
t=1
log P(et | F, e1, . . . , et−1)
<latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPtPrQE=">ACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPV1uvt3s6bH87UVuBIGXsdc4dKlniCQpvK4scp0rvMpvy78q19onTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbe6WAqbF+lQRL9WGi4dq5uc79pOZ0456C/F/3rim6cmkWVE5bi/qBprYAMLKqGQloUpOaecGlvyuIG265IP8hkS8hefrk52R0Pg8SC4+9s+dG2s3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9ljxC+/Qcv6aoL</latexit><latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPtPrQE=">ACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPV1uvt3s6bH87UVuBIGXsdc4dKlniCQpvK4scp0rvMpvy78q19onTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbe6WAqbF+lQRL9WGi4dq5uc79pOZ0456C/F/3rim6cmkWVE5bi/qBprYAMLKqGQloUpOaecGlvyuIG265IP8hkS8hefrk52R0Pg8SC4+9s+dG2s3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9ljxC+/Qcv6aoL</latexit><latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPtPrQE=">ACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPV1uvt3s6bH87UVuBIGXsdc4dKlniCQpvK4scp0rvMpvy78q19onTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbe6WAqbF+lQRL9WGi4dq5uc79pOZ0456C/F/3rim6cmkWVE5bi/qBprYAMLKqGQloUpOaecGlvyuIG265IP8hkS8hefrk52R0Pg8SC4+9s+dG2s3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9ljxC+/Qcv6aoL</latexit>but at test time we may make mistakes that propagate
training, and cannot deal with them at test
phenomena such as repeating.
I
classify classify
I I I
classify
I encoder I
classify
I I
classify
I
e.g. BLEU or METEOR
Translation 2016 [Neubig 16]
23 24 25 26 27 MLE MLE+Length MinRisk 80 85 90 95 100 MLE MLE+Length MinRisk
BLEU Length Ratio
length problems, and does much better than heuristics
Lexicons and Minimum Risk Training for Neural Machine Translation: NAIST-CMU at WAT2016 (Neubig 16)
conducive to gradient-based optimization
ˆ E = argmax ˜
EP( ˜
E | F)
<latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">ACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLru1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE92Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNzE2zZdSi4hLzopwYSxvusC01LI6bAtLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit><latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">ACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLru1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE92Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNzE2zZdSi4hLzopwYSxvusC01LI6bAtLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit><latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">ACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLru1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE92Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNzE2zZdSi4hLzopwYSxvusC01LI6bAtLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit>error(E, ˆ E) = 1 − BLEU(E, ˆ E)
<latexit sha1_base64="KRxJjxRAFBSumCLgm+mSm7rf7k=">ACHicbVDLSgNBEJyNrxhfUY9eBoMQcOuBNSDECIBDx4iuEZIQpidJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUP65yO/dgdKizC4xkETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyqBNLiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHpEQuSJW4hJMH8kReyKv1aD1b9b7pDVlTWe2yS9Yn9F6BW</latexit><latexit sha1_base64="KRxJjxRAFBSumCLgm+mSm7rf7k=">ACHicbVDLSgNBEJyNrxhfUY9eBoMQcOuBNSDECIBDx4iuEZIQpidJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUP65yO/dgdKizC4xkETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyqBNLiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHpEQuSJW4hJMH8kReyKv1aD1b9b7pDVlTWe2yS9Yn9F6BW</latexit><latexit sha1_base64="KRxJjxRAFBSumCLgm+mSm7rf7k=">ACHicbVDLSgNBEJyNrxhfUY9eBoMQcOuBNSDECIBDx4iuEZIQpidJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUP65yO/dgdKizC4xkETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyqBNLiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHpEQuSJW4hJMH8kReyKv1aD1b9b7pDVlTWe2yS9Yn9F6BW</latexit>(over an n-best list for every hypothesis)
F1 φ1 φ2 φ3 err E1,1 1
0.6 E1,2 0
1
E1,3 1
1 1
F2 φ1 φ2 φ3 err E2,1
1
0.8
E2,2
3 1
0.3
E2,3
3 1 2
2 4
1 2 3 4
2 4
1 2 3 4
(a) (b) λ1=-1, λ2=1, λ3=0
2 4 1
2 4 1
2 4 1 2
(d)
α ←1.25
(c)
F1 candidates F2 candidates F1 error F2 error total error E1,1 E1,2 E1,3 E2,1 E2,2 E2,3
d1=0, d2=0, d3=1 λ1=-1, λ2=1, λ3=1.25
risk(F, E, θ) = X
˜ E
P( ˜ E | F; θ)error(E, ˜ E).
<latexit sha1_base64="iwD7OmBG4KhDZEWl5K36ziE3oIk=">ACTHicbVFdSyMxFM1U14/uh1UfQmWhRakzIigIoIoLfvYhe1W6JSydza0GRmSO6IZg/6Iuwb/svfPFBRdi0Hcpu3QuBk3POvUlOgkQKg672ymtrH5YW9/YLH/89PnLVmV756eJU82hw2MZ6+uAGZAig4KlHCdaGAqkNANxldTvXsL2og4+oGTBPqK3URiKDhDSw0qoY9wh5kWZpzXWge0eUB9HAGyOj2nvknVIPNRyBCyZp7Tdm2xob4SIW2dLezQaB1rPabEzhrDcGlarbcGdF3wOvAFVSVHtQ+eWHMU8VRMglM6bnuQn2M6ZRcAl52U8NJIyP2Q30LIyYAtPZmnk9KtlQjqMtV0R0hn7d0fGlDETFVinYjgy9qU/J/WS3F40s9ElKQIEZ8fNEwlxZhOo6Wh0MBRTixgXAt7V8pHTDO9gPKNgRv+cnvQewcdrwvh9VLy6LNDbIHtknNeKRY3JBvpE26RBO7skjeSYvzoPz5Lw6b3NrySl6dsk/Vr7A9xQsM=</latexit><latexit sha1_base64="iwD7OmBG4KhDZEWl5K36ziE3oIk=">ACTHicbVFdSyMxFM1U14/uh1UfQmWhRakzIigIoIoLfvYhe1W6JSydza0GRmSO6IZg/6Iuwb/svfPFBRdi0Hcpu3QuBk3POvUlOgkQKg672ymtrH5YW9/YLH/89PnLVmV756eJU82hw2MZ6+uAGZAig4KlHCdaGAqkNANxldTvXsL2og4+oGTBPqK3URiKDhDSw0qoY9wh5kWZpzXWge0eUB9HAGyOj2nvknVIPNRyBCyZp7Tdm2xob4SIW2dLezQaB1rPabEzhrDcGlarbcGdF3wOvAFVSVHtQ+eWHMU8VRMglM6bnuQn2M6ZRcAl52U8NJIyP2Q30LIyYAtPZmnk9KtlQjqMtV0R0hn7d0fGlDETFVinYjgy9qU/J/WS3F40s9ElKQIEZ8fNEwlxZhOo6Wh0MBRTixgXAt7V8pHTDO9gPKNgRv+cnvQewcdrwvh9VLy6LNDbIHtknNeKRY3JBvpE26RBO7skjeSYvzoPz5Lw6b3NrySl6dsk/Vr7A9xQsM=</latexit><latexit sha1_base64="iwD7OmBG4KhDZEWl5K36ziE3oIk=">ACTHicbVFdSyMxFM1U14/uh1UfQmWhRakzIigIoIoLfvYhe1W6JSydza0GRmSO6IZg/6Iuwb/svfPFBRdi0Hcpu3QuBk3POvUlOgkQKg672ymtrH5YW9/YLH/89PnLVmV756eJU82hw2MZ6+uAGZAig4KlHCdaGAqkNANxldTvXsL2og4+oGTBPqK3URiKDhDSw0qoY9wh5kWZpzXWge0eUB9HAGyOj2nvknVIPNRyBCyZp7Tdm2xob4SIW2dLezQaB1rPabEzhrDcGlarbcGdF3wOvAFVSVHtQ+eWHMU8VRMglM6bnuQn2M6ZRcAl52U8NJIyP2Q30LIyYAtPZmnk9KtlQjqMtV0R0hn7d0fGlDETFVinYjgy9qU/J/WS3F40s9ElKQIEZ8fNEwlxZhOo6Wh0MBRTixgXAt7V8pHTDO9gPKNgRv+cnvQewcdrwvh9VLy6LNDbIHtknNeKRY3JBvpE26RBO7skjeSYvzoPz5Lw6b3NrySl6dsk/Vr7A9xQsM=</latexit>function -> differentiable!
Minimum Risk Annealing for Training Log-Linear Models (Smith and Eisner 2006) Minimum risk training for neural machine translation (Shen et al. 2015)
calculate risk over that
n-best search
risk(F, E, S) = X
˜ E∈S
P( ˜ E | F) Z error(E, ˆ E)
<latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">ACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/otekhCILjQz76QGiYmfckjZJCoth+CeovVh5ufpqb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MdPfoKxItdHOCtgpNi5FpngD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOu3ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJlWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVu8A7qWy3A=</latexit><latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">ACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/otekhCILjQz76QGiYmfckjZJCoth+CeovVh5ufpqb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MdPfoKxItdHOCtgpNi5FpngD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOu3ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJlWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVu8A7qWy3A=</latexit><latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">ACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/otekhCILjQz76QGiYmfckjZJCoth+CeovVh5ufpqb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MdPfoKxItdHOCtgpNi5FpngD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOu3ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJlWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVu8A7qWy3A=</latexit>minimizing risk
weight
`reinforce(X, Y ) = −R( ˆ Y , Y ) log P( ˆ Y | X)
<latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">ACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mdaFiqWAimY6OzeiWVdrUrmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okmJEh8nRtDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZOuGMe9bJDzjvHwvDpmQ=</latexit><latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">ACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mdaFiqWAimY6OzeiWVdrUrmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okmJEh8nRtDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZOuGMe9bJDzjvHwvDpmQ=</latexit><latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">ACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mdaFiqWAimY6OzeiWVdrUrmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okmJEh8nRtDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZOuGMe9bJDzjvHwvDpmQ=</latexit>Chances are, this is you 😕
Minimum risk training for neural machine translation (Shen et al. 2015)
(not latent variables or standard RL settings)
MLE to the full objective
for a particular sentence Reward 0.8 0.3 0.95 Baseline 0.1 B-R
0.2 “This is an easy sentence” “Buffalo Buffalo Buffalo”
reflect when we did better or worse than expected `baseline(X) = −(R( ˆ Y , Y ) − B( ˆ Y )) log P( ˆ Y | X)
distribution
accounts for unsampled hypotheses that should be in the denominator risk(F, E, θ, τ, S) = X
˜ E∈S
P( ˜ E | F; θ)1/τ Z error(E, ˆ E)
<latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">ACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSWTuNoxJnjhkUXELdSoLJeNLdg5TDzVTYGdubVZNdzyT0bw/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFefB1Mdkf7o/jXu/wXevGBnlOXpI+iclrckg+kDGZE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit><latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">ACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSWTuNoxJnjhkUXELdSoLJeNLdg5TDzVTYGdubVZNdzyT0bw/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFefB1Mdkf7o/jXu/wXevGBnlOXpI+iclrckg+kDGZE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit><latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">ACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSWTuNoxJnjhkUXELdSoLJeNLdg5TDzVTYGdubVZNdzyT0bw/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFefB1Mdkf7o/jXu/wXevGBnlOXpI+iclrckg+kDGZE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit>1 2 3 4 0.5 1 1.5 2
1 2 3 4 0.5 1 1.5 2
1 2 3 4 0.5 1 1.5 2
1 2 3 4 0.5 1 1.5 2
τ = 1 τ = 0.5 τ = 0.25 τ = 0.05
NMT+ MinRisk PBMT+MERT Model NMT PBMT Optimized Parameters Millions 5-30 Log-linear Weights (others MLE) Objective Risk Error Metric Granularity Sentence Level Corpus Level n-best Lists Re-generated Accumulated
model?
Freezing Subnetworks to Analyze Domain Adaptation in
models as a linear combination of a few hyper-parameters?
Contextualized Parameter Generation for Universal NMT. Platanios et al. 2018.
W = X
i
αiWi
<latexit sha1_base64="Ko9ZPauNruXiU2+UoH4L6VexyWk=">AB/3icbZDNSsNAFIUn9a/Wv6gLF24Gi+CqJCKoC6HoxmUFYwpNCDfTaTt0MgkzE6GEbnwVNy5U3Poa7nwbp20W2npg4OPce7lzT5xprTjfFuVpeWV1bXqem1jc2t7x97de1BpLgn1SMpT2Y5BUc4E9TnLYzSGJOfXj4c2k7j9SqVgq7vUo2ECfcF6jIA2VmQf+PgKBypPIoYD4NkADPgRi+y603CmwovglBHpVqR/RV0U5InVGjCQamO62Q6LEBqRjgd14Jc0QzIEPq0Y1BAQlVYTA8Y42PjdHEvleYJjafu74kCEqVGSWw6E9ADNV+bmP/VOrnuXYQFE1muqSCzRb2cY53iSRq4yQlmo8MAJHM/BWTAUg2mRWMyG48ycvgnfauGy4d2f15nWZRhUdoiN0glx0jproFrWQhwgao2f0it6sJ+vFerc+Zq0Vq5zZR39kf4AcJ+VOw=</latexit><latexit sha1_base64="Ko9ZPauNruXiU2+UoH4L6VexyWk=">AB/3icbZDNSsNAFIUn9a/Wv6gLF24Gi+CqJCKoC6HoxmUFYwpNCDfTaTt0MgkzE6GEbnwVNy5U3Poa7nwbp20W2npg4OPce7lzT5xprTjfFuVpeWV1bXqem1jc2t7x97de1BpLgn1SMpT2Y5BUc4E9TnLYzSGJOfXj4c2k7j9SqVgq7vUo2ECfcF6jIA2VmQf+PgKBypPIoYD4NkADPgRi+y603CmwovglBHpVqR/RV0U5InVGjCQamO62Q6LEBqRjgd14Jc0QzIEPq0Y1BAQlVYTA8Y42PjdHEvleYJjafu74kCEqVGSWw6E9ADNV+bmP/VOrnuXYQFE1muqSCzRb2cY53iSRq4yQlmo8MAJHM/BWTAUg2mRWMyG48ycvgnfauGy4d2f15nWZRhUdoiN0glx0jproFrWQhwgao2f0it6sJ+vFerc+Zq0Vq5zZR39kf4AcJ+VOw=</latexit><latexit sha1_base64="Ko9ZPauNruXiU2+UoH4L6VexyWk=">AB/3icbZDNSsNAFIUn9a/Wv6gLF24Gi+CqJCKoC6HoxmUFYwpNCDfTaTt0MgkzE6GEbnwVNy5U3Poa7nwbp20W2npg4OPce7lzT5xprTjfFuVpeWV1bXqem1jc2t7x97de1BpLgn1SMpT2Y5BUc4E9TnLYzSGJOfXj4c2k7j9SqVgq7vUo2ECfcF6jIA2VmQf+PgKBypPIoYD4NkADPgRi+y603CmwovglBHpVqR/RV0U5InVGjCQamO62Q6LEBqRjgd14Jc0QzIEPq0Y1BAQlVYTA8Y42PjdHEvleYJjafu74kCEqVGSWw6E9ADNV+bmP/VOrnuXYQFE1muqSCzRb2cY53iSRq4yQlmo8MAJHM/BWTAUg2mRWMyG48ycvgnfauGy4d2f15nWZRhUdoiN0glx0jproFrWQhwgao2f0it6sJ+vFerc+Zq0Vq5zZR39kf4AcJ+VOw=</latexit>want to do in the first place?
move towards a peakier distribution?
Minimum risk annealing for training log-linear models. Smith and Eisner 2006.
1 2 3 4 0.5 1 1.5 2
1 2 3 4 0.5 1 1.5 2
1 2 3 4 0.5 1 1.5 2
1 2 3 4 0.5 1 1.5 2
τ = 1 τ = 0.5 τ = 0.25 τ = 0.05
Training progression
The Best Lexical Metric for Phrase-Based Statistical MT System
MEANT: an inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility via semantic frames. Lo and Wu, 2011.
measures improves MT:
Beyond BLEU: Training Neural Machine Translation with Semantic
average
Optimizing for sentence-level BLEU+1 yields short translations. Naklov et al. 2012.
statistics to approximate corpus BLEU?
Online large-margin training of syntactic and structural translation features. Chiang et
lists across epochs:
new n-best 2 n-best 1
Epoch 1
n-best 1
Epoch 2
new n-best 2 n-best 1
Epoch 3
new n-best 3
parameters, it still has good hypotheses from which to recover.
experience replay in RL:
Self-improving reactive agents based on reinforcement learning, planning and teaching. Lin 1992. Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing. Liang et al. 2018.
Optimization for Statistical Machine Translation, a Survey (Neubig and Watanabe 2016)