What can Statistical Machine Translation teach Neural Text - - PowerPoint PPT Presentation

what can statistical machine translation teach neural
SMART_READER_LITE
LIVE PREVIEW

What can Statistical Machine Translation teach Neural Text - - PowerPoint PPT Presentation

What can Statistical Machine Translation teach Neural Text Generation about Optimization? Graham Neubig @ NAACL Workshop on Methods for Optimizing and Evaluating Neural Language Generation 6/6/2019 or How to Optimize your Neural Generation


slide-1
SLIDE 1

What can Statistical Machine Translation teach Neural Text Generation about Optimization?

Graham Neubig

@ NAACL Workshop on Methods for Optimizing and Evaluating Neural Language Generation
 6/6/2019

slide-2
SLIDE 2

Graham Neubig

@ NAACL Workshop on Methods for Optimizing and Evaluating Neural Language Generation
 6/6/2019

How to Optimize your Neural Generation System towards your Evaluation Function

  • r
slide-3
SLIDE 3

... Neubig & Watanabe, Computational Linguistics (2016)

slide-4
SLIDE 4

Then: Symbolic Translation Models

kono eiga ga kirai movie this I hate

  • First step: learn component models to maximize likelihood
  • Translation model P(y|x) -- e.g. P( movie | eiga )
  • Language model P(Y) -- e.g. P(hate | I)
  • Reordering model -- e.g. P(<swap> | eiga, ga kirai)
  • Length model P(|Y|) -- e.g. word penalty for each word added
  • Second step: learning log-linear combination to maximize

translation accuracy [Och 2004]

Minimum Error Rate Training in Statistical Machine Translation (Och 2004)

log P(Y | X) = X

i

λiφi(X, Y )/Z

<latexit sha1_base64="zi4lDHl42mhk2a3gk9P95mU898=">ACG3icbZDLSgMxFIYz9VbrbdSlm2ARWpA6o4K6EIpuXFawtrVThkwm04YmSHJCGXog7jxVdy4UHEluPBtTC8Lrf4Q+PjPOUnOHySMKu04X1Zubn5hcSm/XFhZXVvfsDe3blWcSkzqOGaxbAZIEUYFqWuqGWkmkiAeMNI+pejeuOeSEVjcaMHCelw1BU0ohpY/n2kcfiLqyVWtDjNITNMjyHnkq5T6HzDUhGlHSoz4tNfdbZXgA73y76FScseBfcKdQBFPVfPvDC2OciI0ZkiptuskupMhqSlmZFjwUkUShPuoS9oGBeJEdbLxckO4Z5wQRrE0R2g4dn9OZIgrNeCB6eRI9RsbWT+V2unOjrtZFQkqSYCTx6KUgZ1DEdJwZBKgjUbGEBYUvNXiHtIqxNngUTgju78l+oH1bOKu71cbF6MU0jD3bALigBF5yAKrgCNVAHGDyAJ/ACXq1H69l6s94nrTlrOrMNfsn6/AYeVZ56</latexit><latexit sha1_base64="zi4lDHl42mhk2a3gk9P95mU898=">ACG3icbZDLSgMxFIYz9VbrbdSlm2ARWpA6o4K6EIpuXFawtrVThkwm04YmSHJCGXog7jxVdy4UHEluPBtTC8Lrf4Q+PjPOUnOHySMKu04X1Zubn5hcSm/XFhZXVvfsDe3blWcSkzqOGaxbAZIEUYFqWuqGWkmkiAeMNI+pejeuOeSEVjcaMHCelw1BU0ohpY/n2kcfiLqyVWtDjNITNMjyHnkq5T6HzDUhGlHSoz4tNfdbZXgA73y76FScseBfcKdQBFPVfPvDC2OciI0ZkiptuskupMhqSlmZFjwUkUShPuoS9oGBeJEdbLxckO4Z5wQRrE0R2g4dn9OZIgrNeCB6eRI9RsbWT+V2unOjrtZFQkqSYCTx6KUgZ1DEdJwZBKgjUbGEBYUvNXiHtIqxNngUTgju78l+oH1bOKu71cbF6MU0jD3bALigBF5yAKrgCNVAHGDyAJ/ACXq1H69l6s94nrTlrOrMNfsn6/AYeVZ56</latexit><latexit sha1_base64="zi4lDHl42mhk2a3gk9P95mU898=">ACG3icbZDLSgMxFIYz9VbrbdSlm2ARWpA6o4K6EIpuXFawtrVThkwm04YmSHJCGXog7jxVdy4UHEluPBtTC8Lrf4Q+PjPOUnOHySMKu04X1Zubn5hcSm/XFhZXVvfsDe3blWcSkzqOGaxbAZIEUYFqWuqGWkmkiAeMNI+pejeuOeSEVjcaMHCelw1BU0ohpY/n2kcfiLqyVWtDjNITNMjyHnkq5T6HzDUhGlHSoz4tNfdbZXgA73y76FScseBfcKdQBFPVfPvDC2OciI0ZkiptuskupMhqSlmZFjwUkUShPuoS9oGBeJEdbLxckO4Z5wQRrE0R2g4dn9OZIgrNeCB6eRI9RsbWT+V2unOjrtZFQkqSYCTx6KUgZ1DEdJwZBKgjUbGEBYUvNXiHtIqxNngUTgju78l+oH1bOKu71cbF6MU0jD3bALigBF5yAKrgCNVAHGDyAJ/ACXq1H69l6s94nrTlrOrMNfsn6/AYeVZ56</latexit>
slide-5
SLIDE 5

Now: Auto-regressive Neural Networks

</s>

dec dec dec dec

</s> I hate this movie kono eiga ga kirai I hate this movie

Encoder Decoder

  • All parameters trained end-to-end, usually to maximize

likelihood (not accuracy!)

slide-6
SLIDE 6

Standard MT System Training/Decoding

slide-7
SLIDE 7

Decoder Structure

I

classify classify

I hate hate

classify

this this

classify

movie movie

classify

</s> encoder P(E | F) =

T

Y

t=1

P(et | F, e1, . . . , et−1)

<latexit sha1_base64="4+Z5A9vFnGki2tmcH1tEn43Xra8=">ACKXicbVDLSsNAFJ34tr6qLt0MFqGClkQEdSH4QHFZwarQ1DCZ3OrgJBNmboQS8j1u/BU3Lnxt/RGnNQtfBy6cOede5t4TplIYdN03Z2h4ZHRsfGKyMjU9MztXnV84NyrTHFpcSaUvQ2ZAigRaKFDCZaqBxaGEi/D2sO9f3IE2QiVn2EuhE7PrRHQFZ2iloLrfrB9RPxYRPV6lu9RPtYqCHe94io/K2izDgGW/hqFwFujvowUmv4jx3WvWA2qNbfhDkD/Eq8kNVKiGVSf/EjxLIYEuWTGtD03xU7ONAouoaj4mYGU8Vt2DW1LExaD6eSDUwu6YpWIdpW2lSAdqN8nchYb04tD2xkzvDG/vb74n9fOsLvdyUWSZgJ/qom0mKivZzo5HQwFH2LGFcC7sr5TdM423YoNwft98l/S2mjsNLzTzdreQZnGBFkiy6ROPLJF9sgJaZIW4eSePJn8uI8OE/Oq/P+1TrklDOL5Aecj0/UeaN6</latexit><latexit sha1_base64="4+Z5A9vFnGki2tmcH1tEn43Xra8=">ACKXicbVDLSsNAFJ34tr6qLt0MFqGClkQEdSH4QHFZwarQ1DCZ3OrgJBNmboQS8j1u/BU3Lnxt/RGnNQtfBy6cOede5t4TplIYdN03Z2h4ZHRsfGKyMjU9MztXnV84NyrTHFpcSaUvQ2ZAigRaKFDCZaqBxaGEi/D2sO9f3IE2QiVn2EuhE7PrRHQFZ2iloLrfrB9RPxYRPV6lu9RPtYqCHe94io/K2izDgGW/hqFwFujvowUmv4jx3WvWA2qNbfhDkD/Eq8kNVKiGVSf/EjxLIYEuWTGtD03xU7ONAouoaj4mYGU8Vt2DW1LExaD6eSDUwu6YpWIdpW2lSAdqN8nchYb04tD2xkzvDG/vb74n9fOsLvdyUWSZgJ/qom0mKivZzo5HQwFH2LGFcC7sr5TdM423YoNwft98l/S2mjsNLzTzdreQZnGBFkiy6ROPLJF9sgJaZIW4eSePJn8uI8OE/Oq/P+1TrklDOL5Aecj0/UeaN6</latexit><latexit sha1_base64="4+Z5A9vFnGki2tmcH1tEn43Xra8=">ACKXicbVDLSsNAFJ34tr6qLt0MFqGClkQEdSH4QHFZwarQ1DCZ3OrgJBNmboQS8j1u/BU3Lnxt/RGnNQtfBy6cOede5t4TplIYdN03Z2h4ZHRsfGKyMjU9MztXnV84NyrTHFpcSaUvQ2ZAigRaKFDCZaqBxaGEi/D2sO9f3IE2QiVn2EuhE7PrRHQFZ2iloLrfrB9RPxYRPV6lu9RPtYqCHe94io/K2izDgGW/hqFwFujvowUmv4jx3WvWA2qNbfhDkD/Eq8kNVKiGVSf/EjxLIYEuWTGtD03xU7ONAouoaj4mYGU8Vt2DW1LExaD6eSDUwu6YpWIdpW2lSAdqN8nchYb04tD2xkzvDG/vb74n9fOsLvdyUWSZgJ/qom0mKivZzo5HQwFH2LGFcC7sr5TdM423YoNwft98l/S2mjsNLzTzdreQZnGBFkiy6ROPLJF9sgJaZIW4eSePJn8uI8OE/Oq/P+1TrklDOL5Aecj0/UeaN6</latexit>
slide-8
SLIDE 8

Maximum Likelihood Training

  • Maximum the likelihood of predicting the next word

in the reference given the previous words `(E | F) = − log P(E | F) = −

T

X

t=1

log P(et | F, e1, . . . , et−1)

<latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPtPrQE=">ACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPV1uvt3s6bH87UVuBIGXsdc4dKlniCQpvK4scp0rvMpvy78q19onTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbe6WAqbF+lQRL9WGi4dq5uc79pOZ0456C/F/3rim6cmkWVE5bi/qBprYAMLKqGQloUpOaecGlvyuIG265IP8hkS8hefrk52R0Pg8SC4+9s+dG2s3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9ljxC+/Qcv6aoL</latexit><latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPtPrQE=">ACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPV1uvt3s6bH87UVuBIGXsdc4dKlniCQpvK4scp0rvMpvy78q19onTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbe6WAqbF+lQRL9WGi4dq5uc79pOZ0456C/F/3rim6cmkWVE5bi/qBprYAMLKqGQloUpOaecGlvyuIG265IP8hkS8hefrk52R0Pg8SC4+9s+dG2s3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9ljxC+/Qcv6aoL</latexit><latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPtPrQE=">ACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPV1uvt3s6bH87UVuBIGXsdc4dKlniCQpvK4scp0rvMpvy78q19onTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbe6WAqbF+lQRL9WGi4dq5uc79pOZ0456C/F/3rim6cmkWVE5bi/qBprYAMLKqGQloUpOaecGlvyuIG265IP8hkS8hefrk52R0Pg8SC4+9s+dG2s3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9ljxC+/Qcv6aoL</latexit>
  • Also called "teacher forcing"
slide-9
SLIDE 9

Problem 1: Exposure Bias

  • Teacher forcing assumes feeding correct previous input,

but at test time we may make mistakes that propagate

  • Exposure bias: The model is not exposed to mistakes during

training, and cannot deal with them at test

  • Really important! One main source of commonly witnessed

phenomena such as repeating.

I

classify classify

I I I

classify

I encoder I

classify

I I

classify

I

slide-10
SLIDE 10

Problem 2: Disregard to Evaluation Metrics

  • In the end, we want good translations
  • Good translations can be measured with metrics,

e.g. BLEU or METEOR

  • Really important! Causes systematic problems:
  • Hypothesis-reference length mismatch
  • Dropped/repeated content
slide-11
SLIDE 11

A Clear Example

  • My (winning) submission to Workshop on Asian

Translation 2016 [Neubig 16]

23 24 25 26 27 MLE MLE+Length MinRisk 80 85 90 95 100 MLE MLE+Length MinRisk

BLEU Length Ratio

  • Just training for (sentence-level) BLEU largely fixes

length problems, and does much better than heuristics

Lexicons and Minimum Risk Training for Neural Machine Translation: NAIST-CMU at WAT2016 (Neubig 16)

slide-12
SLIDE 12

Error and Risk

slide-13
SLIDE 13

Error

  • Generate a translation


  • Calculate its "badness" (e.g. 1-BLEU, 1-METEOR)


  • We would like to minimize error
  • Problem: argmax is not differentiable, and thus not

conducive to gradient-based optimization

ˆ E = argmax ˜

EP( ˜

E | F)

<latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">ACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLru1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE92Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNzE2zZdSi4hLzopwYSxvusC01LI6bAtLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit><latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">ACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLru1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE92Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNzE2zZdSi4hLzopwYSxvusC01LI6bAtLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit><latexit sha1_base64="6ek90mJoNPTvCtomTW+aydQsu2s=">ACH3icbVBNSwMxEM3W7/pV9eglWAS9lF0RqgehKIrHClaFbinZ7LQNTXaXZFZalv0pXvwrXjyoiDf/jWkt4teDwJv3ZpjMCxIpDLru1OYmp6ZnZtfKC4uLa+sltbWr0ycag4NHstY3wTMgBQRNFCghJtEA1OBhOugfzLyr29BGxFHlzhMoKVYNxIdwRlaqV2q+j2G2WlOj6iPMCM6a5ig7yd+ShkCNbKaX3nq6C+EiE92Xym7FHYP+Jd6ElMkE9XbpzQ9jniqIkEtmTNzE2zZdSi4hLzopwYSxvusC01LI6bAtLxgTndtkpIO7G2L0I6Vr9PZEwZM1SB7VQMe+a3NxL/85opdg5amYiSFCHin4s6qaQY01FaNBQaOMqhJYxrYf9KeY9pxtFmWrQheL9P/ksae5XDinexX64dT9KYJ5tki+wQj1RJjZyTOmkQTu7IA3kiz8698+i8OK+frQVnMrNBfsB5/wAY9KMb</latexit>

error(E, ˆ E) = 1 − BLEU(E, ˆ E)

<latexit sha1_base64="KRxJjxRAFBSumCLgm+mSm7rf7k=">ACHicbVDLSgNBEJyNrxhfUY9eBoMQcOuBNSDECIBDx4iuEZIQpidJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUP65yO/dgdKizC4xkETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyqBNLiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHpEQuSJW4hJMH8kReyKv1aD1b9b7pDVlTWe2yS9Yn9F6BW</latexit><latexit sha1_base64="KRxJjxRAFBSumCLgm+mSm7rf7k=">ACHicbVDLSgNBEJyNrxhfUY9eBoMQcOuBNSDECIBDx4iuEZIQpidJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUP65yO/dgdKizC4xkETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyqBNLiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHpEQuSJW4hJMH8kReyKv1aD1b9b7pDVlTWe2yS9Yn9F6BW</latexit><latexit sha1_base64="KRxJjxRAFBSumCLgm+mSm7rf7k=">ACHicbVDLSgNBEJyNrxhfUY9eBoMQcOuBNSDECIBDx4iuEZIQpidJIhsw9mesWw5Ee8+CtePKh48SD4N04eB40WNBRV3XR3eZEUGm37y0rNzS8sLqWXMyura+sb2c2tGx3GioPLQxmqW49pkCIAFwVKuI0UMN+TUP65yO/dgdKizC4xkETZ91A9ERnKGRWtliA+EeE1AqVMN85YA2egyTynCfnlHncGKWLyvuL6+VzdkFewz6lzhTkiNTVFvZj0Y75LEPAXLJtK47doTNhCkUXMIw04g1RIz3WRfqhgbMB91Mxt8N6Z5R2rQTKlMB0rH6cyJhvtYD3zOdPsOenvVG4n9ePcbOSTMRQRQjBHyqBNLiEdRUXbQgFHOTCEcSXMrZT3mGIcTaAZE4Iz+/Jf4h4VTgvOVTFXKk/TSJMdskvyxCHpEQuSJW4hJMH8kReyKv1aD1b9b7pDVlTWe2yS9Yn9F6BW</latexit>
slide-14
SLIDE 14

In Phrase-based MT:
 Minimum Error Rate Training

  • A clever trick for gradient-free optimization of linear models
  • Pick a single direction in feature space
  • Exactly calculate the loss surface in this direction only

(over an n-best list for every hypothesis)

F1 φ1 φ2 φ3 err E1,1 1

  • 1

0.6 E1,2 0

1

E1,3 1

1 1

F2 φ1 φ2 φ3 err E2,1

1

  • 2

0.8

E2,2

3 1

0.3

E2,3

3 1 2

  • 4
  • 2

2 4

  • 4
  • 3
  • 2
  • 1

1 2 3 4

  • 4
  • 2

2 4

  • 4
  • 3
  • 2
  • 1

1 2 3 4

(a) (b) λ1=-1, λ2=1, λ3=0

  • 4
  • 2

2 4 1

  • 4
  • 2

2 4 1

  • 4
  • 2

2 4 1 2

(d)

α ←1.25

(c)

F1 candidates F2 candidates F1 error F2 error total error E1,1 E1,2 E1,3 E2,1 E2,2 E2,3

d1=0, d2=0, d3=1 λ1=-1, λ2=1, λ3=1.25

slide-15
SLIDE 15

A Smooth Approximation: Risk [Smith+ 2006, Shen+ 2015]

  • Risk is defined as the expected error

risk(F, E, θ) = X

˜ E

P( ˜ E | F; θ)error(E, ˜ E).

<latexit sha1_base64="iwD7OmBG4KhDZEWl5K36ziE3oIk=">ACTHicbVFdSyMxFM1U14/uh1UfQmWhRakzIigIoIoLfvYhe1W6JSydza0GRmSO6IZg/6Iuwb/svfPFBRdi0Hcpu3QuBk3POvUlOgkQKg672ymtrH5YW9/YLH/89PnLVmV756eJU82hw2MZ6+uAGZAig4KlHCdaGAqkNANxldTvXsL2og4+oGTBPqK3URiKDhDSw0qoY9wh5kWZpzXWge0eUB9HAGyOj2nvknVIPNRyBCyZp7Tdm2xob4SIW2dLezQaB1rPabEzhrDcGlarbcGdF3wOvAFVSVHtQ+eWHMU8VRMglM6bnuQn2M6ZRcAl52U8NJIyP2Q30LIyYAtPZmnk9KtlQjqMtV0R0hn7d0fGlDETFVinYjgy9qU/J/WS3F40s9ElKQIEZ8fNEwlxZhOo6Wh0MBRTixgXAt7V8pHTDO9gPKNgRv+cnvQewcdrwvh9VLy6LNDbIHtknNeKRY3JBvpE26RBO7skjeSYvzoPz5Lw6b3NrySl6dsk/Vr7A9xQsM=</latexit><latexit sha1_base64="iwD7OmBG4KhDZEWl5K36ziE3oIk=">ACTHicbVFdSyMxFM1U14/uh1UfQmWhRakzIigIoIoLfvYhe1W6JSydza0GRmSO6IZg/6Iuwb/svfPFBRdi0Hcpu3QuBk3POvUlOgkQKg672ymtrH5YW9/YLH/89PnLVmV756eJU82hw2MZ6+uAGZAig4KlHCdaGAqkNANxldTvXsL2og4+oGTBPqK3URiKDhDSw0qoY9wh5kWZpzXWge0eUB9HAGyOj2nvknVIPNRyBCyZp7Tdm2xob4SIW2dLezQaB1rPabEzhrDcGlarbcGdF3wOvAFVSVHtQ+eWHMU8VRMglM6bnuQn2M6ZRcAl52U8NJIyP2Q30LIyYAtPZmnk9KtlQjqMtV0R0hn7d0fGlDETFVinYjgy9qU/J/WS3F40s9ElKQIEZ8fNEwlxZhOo6Wh0MBRTixgXAt7V8pHTDO9gPKNgRv+cnvQewcdrwvh9VLy6LNDbIHtknNeKRY3JBvpE26RBO7skjeSYvzoPz5Lw6b3NrySl6dsk/Vr7A9xQsM=</latexit><latexit sha1_base64="iwD7OmBG4KhDZEWl5K36ziE3oIk=">ACTHicbVFdSyMxFM1U14/uh1UfQmWhRakzIigIoIoLfvYhe1W6JSydza0GRmSO6IZg/6Iuwb/svfPFBRdi0Hcpu3QuBk3POvUlOgkQKg672ymtrH5YW9/YLH/89PnLVmV756eJU82hw2MZ6+uAGZAig4KlHCdaGAqkNANxldTvXsL2og4+oGTBPqK3URiKDhDSw0qoY9wh5kWZpzXWge0eUB9HAGyOj2nvknVIPNRyBCyZp7Tdm2xob4SIW2dLezQaB1rPabEzhrDcGlarbcGdF3wOvAFVSVHtQ+eWHMU8VRMglM6bnuQn2M6ZRcAl52U8NJIyP2Q30LIyYAtPZmnk9KtlQjqMtV0R0hn7d0fGlDETFVinYjgy9qU/J/WS3F40s9ElKQIEZ8fNEwlxZhOo6Wh0MBRTixgXAt7V8pHTDO9gPKNgRv+cnvQewcdrwvh9VLy6LNDbIHtknNeKRY3JBvpE26RBO7skjeSYvzoPz5Lw6b3NrySl6dsk/Vr7A9xQsM=</latexit>
  • This is includes the probability in the objective

function -> differentiable!

Minimum Risk Annealing for Training Log-Linear Models (Smith and Eisner 2006) Minimum risk training for neural machine translation (Shen et al. 2015)

slide-16
SLIDE 16

Sub-sampling

  • Create a small sample of sentences (5-50), and

calculate risk over that

  • Samples can be created using random sampling or

n-best search

  • If random sampling, make sure to deduplicate

risk(F, E, S) = X

˜ E∈S

P( ˜ E | F) Z error(E, ˆ E)

<latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">ACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/otekhCILjQz76QGiYmfckjZJCoth+CeovVh5ufpqb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MdPfoKxItdHOCtgpNi5FpngD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOu3ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJlWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVu8A7qWy3A=</latexit><latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">ACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/otekhCILjQz76QGiYmfckjZJCoth+CeovVh5ufpqb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MdPfoKxItdHOCtgpNi5FpngD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOu3ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJlWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVu8A7qWy3A=</latexit><latexit sha1_base64="s7VNmewP+sEAU60nHL1SnfP+azM=">ACTHicbVFNaxsxFNS6aZo4/XDbYy4ipmBDMLul0OYQC02OTokTkK8xmi1b2NhSbtIb0uN2D/YSyC3/otekhCILjQz76QGiYmfckjZJCoth+CeovVh5ufpqb2+8frN23eN9x+ObV4aDgOey9ycJsyCFBoGKFDCaWGAqUTCSTL9MdPfoKxItdHOCtgpNi5FpngD01bqQxwi90Rthp1ept0+42PWzTXRrbUo1djEKm4LoVjYWmh37LDOu3ogKJHSXrtyZx4vZoExualaflI8Yeg97XGjGXbCRdHnIFqCJlWf9y4jNOclwo0csmsHUZhgSPHDAouoarHpYWC8Sk7h6GHmimwI7dIo6KfPJPSLDd+aQL9mGHY8ramUq8UzGc2KfanPyfNiwx+zZyQhclgub3B2WlpJjTebQ0FQY4ypkHjBvh70r5hPnA0H9A3YcQPX3yczD43NnpRAdfmnvfl2mskU2yRVokIl/JHtknfTIgnPwmf8kVuQ4ugn/BTXB7b60Fy56P5FHVu8A7qWy3A=</latexit>
slide-17
SLIDE 17

Policy Gradient/REINFORCE

  • Alternative way of maximizing expected reward,

minimizing risk

  • Outputs that get a bigger reward will get a higher

weight

  • Can show this converges to minimum-risk solution

`reinforce(X, Y ) = −R( ˆ Y , Y ) log P( ˆ Y | X)

<latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">ACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mdaFiqWAimY6OzeiWVdrUrmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okmJEh8nRtDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZOuGMe9bJDzjvHwvDpmQ=</latexit><latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">ACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mdaFiqWAimY6OzeiWVdrUrmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okmJEh8nRtDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZOuGMe9bJDzjvHwvDpmQ=</latexit><latexit sha1_base64="QJ/ljc72z58oUdsvi8ZHPU5Q/Xw=">ACK3icbVBNSwMxEM36bf2qevQSLEILWnZFUA+C6MVjFast3VKy6bQNzWaXZFYsy/4gL/4VQTyoePV/mNYKfj0IPN6bmcy8IJbCoOu+OBOTU9Mzs3PzuYXFpeWV/OralYkSzaHKIxnpWsAMSKGgigIl1GINLAwkXAf906F/fQPaiEhd4iCGZsi6SnQEZ2ilVv7UBylbqY9wi6kGoTqRHZxlxdp2vUSP6M5F0e8xTOvZNq2XfBl1aeVLoX4o2rRWauULbtkdgf4l3pgUyBiVv7Rb0c8CUEhl8yYhufG2EyZRsElZDk/MRAz3mdaFiqWAimY6OzeiWVdrUrmfQjpSv3ekLDRmEAa2MmTYM7+9ofif10iwc9BMhYoTBMU/P+okmJEh8nRtDAUQ4sYVwLuyvlPaYZR5tvzobg/T75L6nulg/L3vle4fhknMYc2SCbpEg8sk+OyRmpkCrh5I48kGfy4tw7T86r8/ZOuGMe9bJDzjvHwvDpmQ=</latexit>
slide-18
SLIDE 18

But Wait, why is Everyone Using MLE for NMT?

slide-19
SLIDE 19

When Training goes Bad...

Chances are, this is you 😕

Minimum risk training for neural machine translation (Shen et al. 2015)

slide-20
SLIDE 20

It Happens to the Best of Us

  • Email from a famous MT researcher:



 "we also re-implemented MRT, but so far, training has been very unstable, and after a improving for a bit, our models develop a bias towards producing ever-shorter translations..."

slide-21
SLIDE 21

My Current Recipe for Stabilizing MRT/Reinforcement Learning

slide-22
SLIDE 22

Warm-start

  • Start training with maximum likelihood, then switch
  • ver to REINFORCE
  • Works only in the scenarios where we can run MLE

(not latent variables or standard RL settings)

  • MIXER (Ranzato et al. 2016) gradually transitions from

MLE to the full objective

slide-23
SLIDE 23

Adding a Baseline

  • Basic idea: we have expectations about our reward

for a particular sentence Reward 0.8 0.3 0.95 Baseline 0.1 B-R

  • 0.15

0.2 “This is an easy sentence” “Buffalo Buffalo Buffalo”

  • We can instead weight our likelihood by B-R to

reflect when we did better or worse than expected `baseline(X) = −(R( ˆ Y , Y ) − B( ˆ Y )) log P( ˆ Y | X)

slide-24
SLIDE 24

Increasing Batch Size

  • If we use a single sentence, high variance
  • Solution: increase the number of examples (roll-
  • uts) done before an update to stabilize
slide-25
SLIDE 25

Adding Temperature

  • Temperature adjusts the peakiness of the

distribution
 
 


  • With a small sample, setting temperature > 1

accounts for unsampled hypotheses that should be in the denominator risk(F, E, θ, τ, S) = X

˜ E∈S

P( ˜ E | F; θ)1/τ Z error(E, ˆ E)

<latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">ACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSWTuNoxJnjhkUXELdSoLJeNLdg5TDzVTYGdubVZNdzyT0bw/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFefB1Mdkf7o/jXu/wXevGBnlOXpI+iclrckg+kDGZE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit><latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">ACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSWTuNoxJnjhkUXELdSoLJeNLdg5TDzVTYGdubVZNdzyT0bw/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFefB1Mdkf7o/jXu/wXevGBnlOXpI+iclrckg+kDGZE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit><latexit sha1_base64="e4M3TNipvdjfyQh+cH52R4IE4w0=">ACa3icbVHLbhMxFPUMrxIeTQsLXguLqFIiRWGmqgRVhVSBWrEMKqEVmRB5PHcaK7ZnZN9BRNas+EN2fAIbvgFPOovSciXrHp1zH/ZxWkphMYp+BeGNm7du39m427l3/8HDze7W9mdbVIbDhBeyMGcpsyCFhgkKlHBWGmAqlXCaLt83+uk3MFYU+hOuSpgpdq5FLjhDT827PxKE7+iMsMu6fzykR0Oa4AKQNZlVQ3oyoG9pYis1dwkKmYE7qmkiND3xKTeMu3H/kqBERo8P2hmDry5+1Yypa/fFi+tVYExh6n6zaMHQNw3m3V40itZBr4O4BT3Sxnje/ZlkBa8UaOSWTuNoxJnjhkUXELdSoLJeNLdg5TDzVTYGdubVZNdzyT0bw/mika/Zyh2PK2pVKfaViuLBXtYb8nzatMH8zc0KXFYLmF4vySlIsaOM8zYQBjnLlAeNG+LtSvmDeQfT/0/EmxFefB1Mdkf7o/jXu/wXevGBnlOXpI+iclrckg+kDGZE5+B5vBk+Bp8Cd8HD4LX1yUhkHb84j8E+HOX85buPE=</latexit>
  • 4
  • 3
  • 2
  • 1

1 2 3 4 0.5 1 1.5 2

  • 4
  • 3
  • 2
  • 1

1 2 3 4 0.5 1 1.5 2

  • 4
  • 3
  • 2
  • 1

1 2 3 4 0.5 1 1.5 2

  • 4
  • 3
  • 2
  • 1

1 2 3 4 0.5 1 1.5 2

τ = 1 τ = 0.5 τ = 0.25 τ = 0.05

slide-26
SLIDE 26

Contrasting Phrase-based SMT and NMT

slide-27
SLIDE 27

Phrase-based SMT MERT and NMT MinRisk/REINFORCE

NMT+ MinRisk PBMT+MERT Model NMT PBMT Optimized Parameters Millions 5-30 Log-linear Weights (others MLE) Objective Risk Error Metric Granularity Sentence Level Corpus Level n-best Lists Re-generated Accumulated

slide-28
SLIDE 28

Optimized Parameters

  • Maybe we can optimize
  • nly some parts of the

model?


Freezing Subnetworks to Analyze Domain Adaptation in

  • NMT. Thompson et al. 2018.
  • Maybe we can express

models as a linear combination of a few hyper-parameters?


Contextualized Parameter Generation for Universal NMT. Platanios et al. 2018.

  • Can we reduce the number of parameters
  • ptimized for NMT?

W = X

i

αiWi

<latexit sha1_base64="Ko9ZPauNruXiU2+UoH4L6VexyWk=">AB/3icbZDNSsNAFIUn9a/Wv6gLF24Gi+CqJCKoC6HoxmUFYwpNCDfTaTt0MgkzE6GEbnwVNy5U3Poa7nwbp20W2npg4OPce7lzT5xprTjfFuVpeWV1bXqem1jc2t7x97de1BpLgn1SMpT2Y5BUc4E9TnLYzSGJOfXj4c2k7j9SqVgq7vUo2ECfcF6jIA2VmQf+PgKBypPIoYD4NkADPgRi+y603CmwovglBHpVqR/RV0U5InVGjCQamO62Q6LEBqRjgd14Jc0QzIEPq0Y1BAQlVYTA8Y42PjdHEvleYJjafu74kCEqVGSWw6E9ADNV+bmP/VOrnuXYQFE1muqSCzRb2cY53iSRq4yQlmo8MAJHM/BWTAUg2mRWMyG48ycvgnfauGy4d2f15nWZRhUdoiN0glx0jproFrWQhwgao2f0it6sJ+vFerc+Zq0Vq5zZR39kf4AcJ+VOw=</latexit><latexit sha1_base64="Ko9ZPauNruXiU2+UoH4L6VexyWk=">AB/3icbZDNSsNAFIUn9a/Wv6gLF24Gi+CqJCKoC6HoxmUFYwpNCDfTaTt0MgkzE6GEbnwVNy5U3Poa7nwbp20W2npg4OPce7lzT5xprTjfFuVpeWV1bXqem1jc2t7x97de1BpLgn1SMpT2Y5BUc4E9TnLYzSGJOfXj4c2k7j9SqVgq7vUo2ECfcF6jIA2VmQf+PgKBypPIoYD4NkADPgRi+y603CmwovglBHpVqR/RV0U5InVGjCQamO62Q6LEBqRjgd14Jc0QzIEPq0Y1BAQlVYTA8Y42PjdHEvleYJjafu74kCEqVGSWw6E9ADNV+bmP/VOrnuXYQFE1muqSCzRb2cY53iSRq4yQlmo8MAJHM/BWTAUg2mRWMyG48ycvgnfauGy4d2f15nWZRhUdoiN0glx0jproFrWQhwgao2f0it6sJ+vFerc+Zq0Vq5zZR39kf4AcJ+VOw=</latexit><latexit sha1_base64="Ko9ZPauNruXiU2+UoH4L6VexyWk=">AB/3icbZDNSsNAFIUn9a/Wv6gLF24Gi+CqJCKoC6HoxmUFYwpNCDfTaTt0MgkzE6GEbnwVNy5U3Poa7nwbp20W2npg4OPce7lzT5xprTjfFuVpeWV1bXqem1jc2t7x97de1BpLgn1SMpT2Y5BUc4E9TnLYzSGJOfXj4c2k7j9SqVgq7vUo2ECfcF6jIA2VmQf+PgKBypPIoYD4NkADPgRi+y603CmwovglBHpVqR/RV0U5InVGjCQamO62Q6LEBqRjgd14Jc0QzIEPq0Y1BAQlVYTA8Y42PjdHEvleYJjafu74kCEqVGSWw6E9ADNV+bmP/VOrnuXYQFE1muqSCzRb2cY53iSRq4yQlmo8MAJHM/BWTAUg2mRWMyG48ycvgnfauGy4d2f15nWZRhUdoiN0glx0jproFrWQhwgao2f0it6sJ+vFerc+Zq0Vq5zZR39kf4AcJ+VOw=</latexit>
slide-29
SLIDE 29

Objective

  • Can we move closer to minimizing error, which is what we

want to do in the first place?

  • Maybe we can gradually anneal the temperature to

move towards a peakier distribution?


Minimum risk annealing for training log-linear models. Smith and Eisner 2006.

  • 4
  • 3
  • 2
  • 1

1 2 3 4 0.5 1 1.5 2

  • 4
  • 3
  • 2
  • 1

1 2 3 4 0.5 1 1.5 2

  • 4
  • 3
  • 2
  • 1

1 2 3 4 0.5 1 1.5 2

  • 4
  • 3
  • 2
  • 1

1 2 3 4 0.5 1 1.5 2

τ = 1 τ = 0.5 τ = 0.25 τ = 0.05

Training progression

slide-30
SLIDE 30

Metric

  • We have lots of metrics! BLEU, METEOR, ROUGE, CIDER
  • Depending on the metric you optimize, results differ.



 
 
 
 


The Best Lexical Metric for Phrase-Based Statistical MT System

  • Optimization. Cer et al. 2010.
  • Maybe a metric that considers semantic roles?


MEANT: an inexpensive, high-accuracy, semi-automatic metric for evaluating translation utility via semantic frames. Lo and Wu, 2011.

  • New! Optimizing towards neural semantic similarity

measures improves MT:


Beyond BLEU: Training Neural Machine Translation with Semantic

  • Similarity. Wieting et al. 2019.
slide-31
SLIDE 31

Metric Granularity

  • Two ways of measuring metrics
  • Sentence-level: Measure sentence-by-sentence,

average

  • Corpus: Sum sufficient statistics, calculate score
  • Regular BLEU is corpus-level, but mini-batch NMT
  • ptimization algorithms calculate sentence level
  • This causes problems, e.g. in sentence length!


Optimizing for sentence-level BLEU+1 yields short translations. Naklov et al. 2012.

  • Maybe we can keep a running average of the sufficient

statistics to approximate corpus BLEU?


Online large-margin training of syntactic and structural translation features. Chiang et

  • al. 2008.
slide-32
SLIDE 32

N-best Lists

  • In MERT for PBMT, we would accumulate n-best

lists across epochs:

new n-best 2 n-best 1

Epoch 1

n-best 1

Epoch 2

new n-best 2 n-best 1

Epoch 3

new n-best 3

  • Greatly stabilizes training! Even if model learns horrible

parameters, it still has good hypotheses from which to recover.

  • Maybe we could do the same for NMT? Analogous to

experience replay in RL:


Self-improving reactive agents based on reinforcement learning, planning and teaching. Lin 1992.
 Memory Augmented Policy Optimization for Program Synthesis and Semantic Parsing. Liang et al. 2018.

slide-33
SLIDE 33

Summary

slide-34
SLIDE 34

Summary

  • Neural MT has come a long way, and we can
  • ptimize for accuracy
  • This is important, fixes lots of problems that we'd
  • therwise use heuristic hacks for
  • But no-one does it... Problems of stability speed.
  • Still lots to remember from the past!


Optimization for Statistical Machine Translation, a Survey (Neubig and Watanabe 2016)

Thanks! Questions?