what can statistical machine translation teach neural
play

What can Statistical Machine Translation teach Neural Text - PowerPoint PPT Presentation

What can Statistical Machine Translation teach Neural Text Generation about Optimization? Graham Neubig @ NAACL Workshop on Methods for Optimizing and Evaluating Neural Language Generation 6/6/2019 or How to Optimize your Neural Generation


  1. What can Statistical Machine Translation teach Neural Text Generation about Optimization? Graham Neubig @ NAACL Workshop on Methods for Optimizing and Evaluating Neural Language Generation 
 6/6/2019

  2. or How to Optimize your Neural Generation System towards your Evaluation Function Graham Neubig @ NAACL Workshop on Methods for Optimizing and Evaluating Neural Language Generation 
 6/6/2019

  3. ... Neubig & Watanabe, Computational Linguistics (2016)

  4. <latexit sha1_base64="zi4lDHl42mhk2a3gk9P95mU898=">ACG3icbZDLSgMxFIYz9VbrbdSlm2ARWpA6o4K6EIpuXFawtrVThkwm04YmSHJCGXog7jxVdy4UHEluPBtTC8Lrf4Q+PjPOUnOHySMKu04X1Zubn5hcSm/XFhZXVvfsDe3blWcSkzqOGaxbAZIEUYFqWuqGWkmkiAeMNI+pejeuOeSEVjcaMHCelw1BU0ohpY/n2kcfiLqyVWtDjNITNMjyHnkq5T6HzDUhGlHSoz4tNfdbZXgA73y76FScseBfcKdQBFPVfPvDC2OciI0ZkiptuskupMhqSlmZFjwUkUShPuoS9oGBeJEdbLxckO4Z5wQRrE0R2g4dn9OZIgrNeCB6eRI9RsbWT+V2unOjrtZFQkqSYCTx6KUgZ1DEdJwZBKgjUbGEBYUvNXiHtIqxNngUTgju78l+oH1bOKu71cbF6MU0jD3bALigBF5yAKrgCNVAHGDyAJ/ACXq1H69l6s94nrTlrOrMNfsn6/AYeVZ56</latexit> <latexit sha1_base64="zi4lDHl42mhk2a3gk9P95mU898=">ACG3icbZDLSgMxFIYz9VbrbdSlm2ARWpA6o4K6EIpuXFawtrVThkwm04YmSHJCGXog7jxVdy4UHEluPBtTC8Lrf4Q+PjPOUnOHySMKu04X1Zubn5hcSm/XFhZXVvfsDe3blWcSkzqOGaxbAZIEUYFqWuqGWkmkiAeMNI+pejeuOeSEVjcaMHCelw1BU0ohpY/n2kcfiLqyVWtDjNITNMjyHnkq5T6HzDUhGlHSoz4tNfdbZXgA73y76FScseBfcKdQBFPVfPvDC2OciI0ZkiptuskupMhqSlmZFjwUkUShPuoS9oGBeJEdbLxckO4Z5wQRrE0R2g4dn9OZIgrNeCB6eRI9RsbWT+V2unOjrtZFQkqSYCTx6KUgZ1DEdJwZBKgjUbGEBYUvNXiHtIqxNngUTgju78l+oH1bOKu71cbF6MU0jD3bALigBF5yAKrgCNVAHGDyAJ/ACXq1H69l6s94nrTlrOrMNfsn6/AYeVZ56</latexit> <latexit sha1_base64="zi4lDHl42mhk2a3gk9P95mU898=">ACG3icbZDLSgMxFIYz9VbrbdSlm2ARWpA6o4K6EIpuXFawtrVThkwm04YmSHJCGXog7jxVdy4UHEluPBtTC8Lrf4Q+PjPOUnOHySMKu04X1Zubn5hcSm/XFhZXVvfsDe3blWcSkzqOGaxbAZIEUYFqWuqGWkmkiAeMNI+pejeuOeSEVjcaMHCelw1BU0ohpY/n2kcfiLqyVWtDjNITNMjyHnkq5T6HzDUhGlHSoz4tNfdbZXgA73y76FScseBfcKdQBFPVfPvDC2OciI0ZkiptuskupMhqSlmZFjwUkUShPuoS9oGBeJEdbLxckO4Z5wQRrE0R2g4dn9OZIgrNeCB6eRI9RsbWT+V2unOjrtZFQkqSYCTx6KUgZ1DEdJwZBKgjUbGEBYUvNXiHtIqxNngUTgju78l+oH1bOKu71cbF6MU0jD3bALigBF5yAKrgCNVAHGDyAJ/ACXq1H69l6s94nrTlrOrMNfsn6/AYeVZ56</latexit> Then: Symbolic Translation Models kono eiga kirai ga this movie I hate • First step: learn component models to maximize likelihood • Translation model P(y|x) -- e.g. P( movie | eiga ) • Language model P(Y) -- e.g. P(hate | I) • Reordering model -- e.g. P(<swap> | eiga, ga kirai ) • Length model P(|Y|) -- e.g. word penalty for each word added • Second step: learning log-linear combination to maximize translation accuracy [Och 2004] X log P ( Y | X ) = λ i φ i ( X, Y ) /Z i Minimum Error Rate Training in Statistical Machine Translation (Och 2004)

  5. Now: Auto-regressive Neural Networks kono eiga ga kirai Encoder </s> I hate this movie Decoder dec dec dec dec </s> I hate this movie • All parameters trained end-to-end, usually to maximize likelihood (not accuracy!)

  6. Standard MT System Training/Decoding

  7. <latexit sha1_base64="4+Z5A9vFnGki2tmcH1tEn43Xra8=">ACKXicbVDLSsNAFJ34tr6qLt0MFqGClkQEdSH4QHFZwarQ1DCZ3OrgJBNmboQS8j1u/BU3Lnxt/RGnNQtfBy6cOede5t4TplIYdN03Z2h4ZHRsfGKyMjU9MztXnV84NyrTHFpcSaUvQ2ZAigRaKFDCZaqBxaGEi/D2sO9f3IE2QiVn2EuhE7PrRHQFZ2iloLrfrB9RPxYRPV6lu9RPtYqCHe94io/K2izDgGW/hqFwFujvowUmv4jx3WvWA2qNbfhDkD/Eq8kNVKiGVSf/EjxLIYEuWTGtD03xU7ONAouoaj4mYGU8Vt2DW1LExaD6eSDUwu6YpWIdpW2lSAdqN8nchYb04tD2xkzvDG/vb74n9fOsLvdyUWSZgJ/qom0mKivZzo5HQwFH2LGFcC7sr5TdM423YoNwft98l/S2mjsNLzTzdreQZnGBFkiy6ROPLJF9sgJaZIW4eSePJn8uI8OE/Oq/P+1TrklDOL5Aecj0/UeaN6</latexit> <latexit sha1_base64="4+Z5A9vFnGki2tmcH1tEn43Xra8=">ACKXicbVDLSsNAFJ34tr6qLt0MFqGClkQEdSH4QHFZwarQ1DCZ3OrgJBNmboQS8j1u/BU3Lnxt/RGnNQtfBy6cOede5t4TplIYdN03Z2h4ZHRsfGKyMjU9MztXnV84NyrTHFpcSaUvQ2ZAigRaKFDCZaqBxaGEi/D2sO9f3IE2QiVn2EuhE7PrRHQFZ2iloLrfrB9RPxYRPV6lu9RPtYqCHe94io/K2izDgGW/hqFwFujvowUmv4jx3WvWA2qNbfhDkD/Eq8kNVKiGVSf/EjxLIYEuWTGtD03xU7ONAouoaj4mYGU8Vt2DW1LExaD6eSDUwu6YpWIdpW2lSAdqN8nchYb04tD2xkzvDG/vb74n9fOsLvdyUWSZgJ/qom0mKivZzo5HQwFH2LGFcC7sr5TdM423YoNwft98l/S2mjsNLzTzdreQZnGBFkiy6ROPLJF9sgJaZIW4eSePJn8uI8OE/Oq/P+1TrklDOL5Aecj0/UeaN6</latexit> <latexit sha1_base64="4+Z5A9vFnGki2tmcH1tEn43Xra8=">ACKXicbVDLSsNAFJ34tr6qLt0MFqGClkQEdSH4QHFZwarQ1DCZ3OrgJBNmboQS8j1u/BU3Lnxt/RGnNQtfBy6cOede5t4TplIYdN03Z2h4ZHRsfGKyMjU9MztXnV84NyrTHFpcSaUvQ2ZAigRaKFDCZaqBxaGEi/D2sO9f3IE2QiVn2EuhE7PrRHQFZ2iloLrfrB9RPxYRPV6lu9RPtYqCHe94io/K2izDgGW/hqFwFujvowUmv4jx3WvWA2qNbfhDkD/Eq8kNVKiGVSf/EjxLIYEuWTGtD03xU7ONAouoaj4mYGU8Vt2DW1LExaD6eSDUwu6YpWIdpW2lSAdqN8nchYb04tD2xkzvDG/vb74n9fOsLvdyUWSZgJ/qom0mKivZzo5HQwFH2LGFcC7sr5TdM423YoNwft98l/S2mjsNLzTzdreQZnGBFkiy6ROPLJF9sgJaZIW4eSePJn8uI8OE/Oq/P+1TrklDOL5Aecj0/UeaN6</latexit> Decoder Structure encoder I hate this movie classify classify classify classify classify I hate this movie </s> T Y P ( E | F ) = P ( e t | F, e 1 , . . . , e t − 1 ) t =1

  8. <latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPtPrQE=">ACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPV1uvt3s6bH87UVuBIGXsdc4dKlniCQpvK4scp0rvMpvy78q19onTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbe6WAqbF+lQRL9WGi4dq5uc79pOZ0456C/F/3rim6cmkWVE5bi/qBprYAMLKqGQloUpOaecGlvyuIG265IP8hkS8hefrk52R0Pg8SC4+9s+dG2s3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9ljxC+/Qcv6aoL</latexit> <latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPtPrQE=">ACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPV1uvt3s6bH87UVuBIGXsdc4dKlniCQpvK4scp0rvMpvy78q19onTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbe6WAqbF+lQRL9WGi4dq5uc79pOZ0456C/F/3rim6cmkWVE5bi/qBprYAMLKqGQloUpOaecGlvyuIG265IP8hkS8hefrk52R0Pg8SC4+9s+dG2s3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9ljxC+/Qcv6aoL</latexit> <latexit sha1_base64="GeA/Os4/BK6Zz954iZvfPtPrQE=">ACXHicbVFdSxwxFM1MtepY61qhL325uLQo6DIjhdoHQVpafFzBrcLOdshk7q7BZDIkdwrLMH+yb/Wlf6XZdR786IHAyTn3kOQkr5R0FMd/gvDFyurLtfWNaPV1uvt3s6bH87UVuBIGXsdc4dKlniCQpvK4scp0rvMpvy78q19onTlJc0rnGg+K+VUCk5eynqUolL73yDVsoDvB/ABTuEoVWYGwdqmkbwGMs5SF2ts4ZOk/Znc9lCl8OMuQhYJYcer0w5Babho6S9iDr9eNBvAQ8J0lH+qzDMOv9Tgsjao0lCcWdGydxRZOGW5JCYRultcOKi1s+w7GnJdfoJs2ynRbe6WAqbF+lQRL9WGi4dq5uc79pOZ0456C/F/3rim6cmkWVE5bi/qBprYAMLKqGQloUpOaecGlvyuIG265IP8hkS8hefrk52R0Pg8SC4+9s+dG2s3dsj+2zhH1iZ+ycDdmICXYXsGAjiIK/4Wq4GW7dj4ZBl9ljxC+/Qcv6aoL</latexit> Maximum Likelihood Training • Maximum the likelihood of predicting the next word in the reference given the previous words ` ( E | F ) = − log P ( E | F ) T X = − log P ( e t | F, e 1 , . . . , e t − 1 ) t =1 • Also called "teacher forcing"

  9. Problem 1: Exposure Bias • Teacher forcing assumes feeding correct previous input, but at test time we may make mistakes that propagate encoder I I I I classify classify classify classify classify I I I I I • Exposure bias: The model is not exposed to mistakes during training, and cannot deal with them at test • Really important! One main source of commonly witnessed phenomena such as repeating.

  10. Problem 2: Disregard to Evaluation Metrics • In the end, we want good translations • Good translations can be measured with metrics, e.g. BLEU or METEOR • Really important! Causes systematic problems: • Hypothesis-reference length mismatch • Dropped/repeated content

  11. A Clear Example • My (winning) submission to Workshop on Asian Translation 2016 [Neubig 16] Length Ratio BLEU 100 27 95 26 90 25 85 24 80 23 MLE MLE+Length MinRisk MLE MLE+Length MinRisk • Just training for (sentence-level) BLEU largely fixes length problems, and does much better than heuristics Lexicons and Minimum Risk Training for Neural Machine Translation: NAIST-CMU at WAT2016 (Neubig 16)

  12. Error and Risk

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend