Language modeling, Machine translation
CS 4803 / 7643 Deep Learning Georgia Tech, 31 March 2020 James Cross, Jean Maillard Facebook AI
Language modeling, Machine translation CS 4803 / 7643 Deep Learning - - PowerPoint PPT Presentation
Language modeling, Machine translation CS 4803 / 7643 Deep Learning Georgia Tech, 31 March 2020 James Cross, Jean Maillard Facebook AI <latexit
CS 4803 / 7643 Deep Learning Georgia Tech, 31 March 2020 James Cross, Jean Maillard Facebook AI
(
= () ( | ) ( | , ) · · · ( | −, . . . , )
<latexit sha1_base64="esm3pRcyzhZIwzpYc7P3G1Y=">ACenicjVDLbhNBEGwvj4Tl5QSJCxGjpBsYaxdJxEnpChEwCGIEMWJTayZmfbzsjz0syYEK32W/gGbvwKF67wERwYey2kJBfqMt1VXa3pyo3gzifJj0Z04+at2yurd+K79+4/eNhcWz92emYZDpgW2g5z6lBwhQPvcChsUhlLvAkn76e6yef0Tqu1ZG/MHgq6UTxMWfUB2rUPHxFMtM+H6WdxdPJC/Iv3Zz2fa7NcK7d1CUulVC/SqksyMVfqVFzI+klC5DrRbosNnZaEH19/O3pwWit0cwKzWYSlWeCOvcpTYw/Lan1nAms4vgZGeI+PcIhOef+jLyhDHOtp+QaYG2F/SBw3CIzMsvFRlrS3ItCsI9FZzF2cyhoWxKJ1hS6ST1Z91cVpf4OekMsA6DYhKVdjrbxr7/EJD7fth8RU961FnHbKz+oXYsdc8/WeJsD8O5Ft8H3weDlnpty8xUpalChunVxK4Xx/1eutXb/hjC3IUaq/AEWtCGF7CDryDAxgAg+/wE37B78afqBV1ouf1aNRYeh7BJURbfwHTV71t</latexit>= Y
next word history
Language modelling aims to assign probabilities to word sequences
Search
how do I stop my backpack from ruining ruining my life ruining my relationship ruining my meal ruining your day ruining my hair ruining my carpet ruining your hair ruining my garden
( | −) = (, −) (−) ' (, −) (−)
<latexit sha1_base64="t1SUIl5bqI5ryfcfn5vHGe8uUlk=">AC2HicjZFdaxNBFIZn14/W+NFUL3szGAsJxLBbKl4JRYt6UbFK0wa7IczOnqRD5suZSVsZFrwTr4T+u/4Q751NtmAbLzxX75znvOfAO7nmzLokuYziW7fv3F1Zvde4/+Dho7Xm+uNDq2aGQp8qrswgJxY4k9B3zHEYaANE5ByO8umbih+dgrFMyQP3TcNQkIlkY0aJC61R81emcRvjsxHDmWBFEJ49T0vcwa9wNjaE+ky3A+1ekU656NQPnFkm4OvVrNJgiFNGEgGeqpl05ZL73zM1HjVbS+ZF14WaS1aqK790XqUZoWiMwHSU6sPU4T7YaeGMcoh7LRyGYWNKFTMoHj4pRpWx2Q38+T6/8m3sirCDupJuLYNzEA9gjBzDAZ8yd4LeEQq7UFH8GUoDpBd63gDMtcn9e4rEyOFe8wMwRzui1tdVOq4GYxYC5oIwOVbS2fYumzBnu3vhP2T3nQGYdvyNSwvXfEfl+T9LI9uFkIeBD8H3sY48fF3pdRVyejPSZXG41Uu3ey8+bd2Xtdxr6IN9BS1UYpeoh30Hu2jPqLod7QRPYs24y/x9/hH/HMxGke15wm6VvHFHzoP4IU=</latexit>(
Y
= Y
How many times “wi-1” appears … … in a large body of text, like Wikipedia. How many times “wi wi-1” appears … Bigram language model
= () ( | ) ( | , ) ( | , , ) · · ·
<latexit sha1_base64="7AkBSE30VOYkqJYeaQU2boOAdA=">ACoXicjVBLaxsxGJS3r9R9xGmPvYiagOL2XVceiqENvQBKXVDnCx4zfKt9rMjLK2EJMcpy/61/o/e21/Q2XHKU5y6ZyGc18aHItuHVR9LMR3Ll7/6DrYfNR4+fPN1u7Tw7sWpuGA6ZEsokOVgUvMSh405gog2CzAWe5rP3S/0HI3lqjx23zWOJUxLPuEMnJeyVvKWp2Flm8uyKULrIeTSUv6D9tke1dKb1wU+5fyXvhpscK5WzWakfdaAV6m8Rr0iZrDLKdRpwWis0lo4JsHYUR9qNKzCOM4F1s5nOLWpgM5jiqDjn2pYg0Y6ri9UM9aZfgbQS3FmYSx98RM8hGNM6IK7M/oBGOZKzegRQoGm6/2hRf8nmVcXNZ0oQ3MlCsodCM6u1S47rUbmj1n0tpDAy4kqne0c8Cl3Njz0w5bhR4M4261uXLpMrTqWmf+LNMD9HsY/OJzXzUacMpUqa4rXfuR45uT3iYnvW7c7+1m/v1vPvUVekJekQ2LyhuyT2RAhoSRH+QX+U3+BO3gczAIji6fBo15jm5hmD0FxNyxvM=</latexit>conditional independence (Markov assumption) ' () ( | ) ( | , ) ( | , , ) · · ·
<latexit sha1_base64="UmklOflgxR8UvG/xGutumsks+gA=">ACpnicjVDNbhMxGHSWvxJ+msKRi0WElEqraDcN4lq1FXAoUFC3idSNlm+9X1Ir9trYTlO02qfjKXgErvACOGmK0vbCnEYznvnkybXg1kXRz0Zw5+69+w82HjYfPX7ydLO19ezEqplhmDAlBnmYFHwEhPHncChNgyFzjIp/sLf3COxnJVHrvGkcSJiUfcwbOS1lrlFou8RtNe3Ms3h7SidZz2aSl7Qf9o827lSeuG63L+Sd8J1jxXK2azVjrREvQ2iVekTVY4yrYacVoNpNYOibA2tM40m5UgXGcCaybzXRmUQObwgRPi3OubQkS7ai6WG5Rr/sVSCvBnYW59MFXdIiHcIxDOufujL4FhrlSU/oFoUDT9X5i0f9J5tVFTcfK0FyJgnIHgrNrtYtOq5H5Yxa9LSTwcqxKZzsHfMKdDQ/9umX4ziBOt6sbly5Ty45F5v8izfQA/R4GP/jcJ40GnDJVqutK137k+Oakt8lJrxv3u68/9u7e6u5N8gL8pJ0SEzekF3ynhyRhDyg/wiv8mfoBN8DJgcPk0aKwyz8k1BF/AtiCyWM=</latexit>RNN cell
RNN cell
RNN cell
RNN cell
(|, , , )
<latexit sha1_base64="oTDo3PUehJQ5qM0T1Jo8JgA=">ACZXicjVBNTxsxFHSWtkBaSvhQL62ERVQpSFG0C0FcUYtKD1RQRCASG63eOi/Bir2bIdQbfSX9Nr+2/4BfwNvCGHApe+w2j0xjNPnlQLbl0Y3laCuRcvX80vLFZfv1l6u1xbWT23amwYdpgSynRTsCh4h3HncCuNgyFXiRj6X+sU1GstVduZ+aOxJGZ8wBk4v0pqH2LdmCS79CedJO2mh50StkuItpJaPWyF06HPSTQjdTKbk2SlEsV9xcYSM8cEWHsZhdr1cjCOM4FtRqPLWpgIxjiZf+a5uBRNvLb6ZfKf7Vc5BWgrtqptIbP9IuHsEZdumEuyv6BRimSo3oKUIfTcvrHYs01jLNbwo6UIamSvQpdyA4exRbZlqNzB+z6GUhgWcDlTnbOBD7mzyJeTNQ8N4mgrf3LpwTXNKD3/Z6nGB+j7MPjN+41GnDK5LEucl34kqOnlT4n59utqN3a/d6u73+a1b1A3pN0iAR2SP75Cs5IR3CyC/ym/whfyt3wVKwHrx7eBpUZp418miCjXveq7RE</latexit>RNN language models don’t need a Markov assumption. The more context we have to predict the next word, the better!
Inference Given a trained RNN language model:
Stop when the special end symbol </s> is predicted. At time step t+1:
At every time step t, to predict:
…
RNN cell People RNN cell the RNN cell embedding size RNN cell We vocabulary size User input: P e
l e c
m
a r e y e a r n e w a s
e t h i n g t h e i s w i t h a n d s e x t a n t m a n g
Inference: finding p(w | “We the”)
RNN cell <s> = (
projection
How do we know how well our RNN models language? If our model is good, it will assign high probability to real sentences.
(∗, ) = − X
∈X
∗() ()
<latexit sha1_base64="PbD9fBj5DemC4m07gR3FSdRfAcM=">ACjXicjVHLbhMxFHWmPEp4NKVLNhYRUoJCNINasYGqgoh2URBTtSHSKPc5Na8Uu20wZ80N8DdvyNXjSWdB2w1kd+dxzH8eFEdz5NL1qJGv37j94uP6o+fjJ02cbrc3nJ04vLIMh0LbvKAOBFcw9NwLyI0FKgsBp8X8U6WfXoB1XKtj/9PASNKZ4lPOqI9P49aAaAOWem0VlRAOyo758bqHTRd/wG8wcQs5DkvCFZHUnzMqQl6WOJZ0l1MhJ5hE9m41U76Qr4Lslq0kY1jsabjYxMNFtIUJ4J6txZlho/CtR6zgSUzSZODCUzekMziYX3LhqPTcKy9XJ5b96oNJV2/UKGY2vcA6H9BhyfMn9Of5MGRaz/F3oBOw/agPHWBiZBGWJZ5qiwstJph7Kji70bq6QywOMxBlIWkXE218q4z4DPuXe8whqh6+xZg3g23Jl27Vj0qz/9ZmQAMQ8LX6Lva/0xgZgymDKGnN2O9C45edvPtvs737bex/ruNfRC/QSdVCG3qE9dICO0BAx9Av9RlfoT7KR7CTvk93r0qRe7bQDST7fwEkpMSU</latexit>cross entropy reference distribution expected # bits required to represent an event drawn from p*, when using a coding scheme optimal for p.
= −
per-word cross entropy commonly used as loss function
How do we know how well our RNN models language? Perplexity is the geometric mean of the inverse probability of a sequence of words (according to the model).
(
Note:
= −
(|−,...)
<latexit sha1_base64="f0BtvPhQq4psZ2g3mN8CNo167bw=">ACh3icjZBLbxMxFIWd4dWGVwpLNhYRUiqlYQaVxwZUoAIWpRTUtJE6cj23KRXsceW7TRF1vwbfg1b2PBvcNIsaLvhro58/J2re7iR6Hya/mk167fuHlrZbV5+87de/dbaw8OnJ5aAX2hpbYDzhxIrKDv0UsYGAtMcQmHfPJ+7h+egnWoq3/3cBQsXGFIxTMx6ei9eY15cdhIx9ZJkJWh906d1NV4PEuzaUeF5zmpjMrkOYKSzorAm5kdTd6pfZuvS5a7bSXLoZeFdlStMly9oq1RpaXWkwVF5I5txRlho/DMx6FBLqZjOfOjBMTNgYjspTNK5iCtwnC2Orf/1A1NOMX/S5SqCT+gAdtg+DOgM/Qn9wARwrSf0G7ASbC/6fQfxHMXDWU1H2lKuZUnRM4niQuw80xkQcZmDaEvFsBrpyrvONo7Ru+5OrK/qfrQAk/VwadM5tciYM/+HNPNtiH1Y+By5LwYs89qG3NTBzEvOLld6VRw862WbvedfN9tb75Z1r5BH5DHpkIy8JFvkE9kjfSLID/KT/CK/k9XkafIieX+NWksmYfkwiRv/wKUeMKx</latexit>commonly used as evaluation metric
Training Given a long list of words:
Minimisation of this loss is equivalent to maximum likelihood estimation. At every time step t, compute the predictive distribution pt.
predictive distribution and the actual next word.
. . .
RNN cell 24th
L
<latexit sha1_base64="KgQAI637OpXuwe+Ldl8EOeF1ZF4=">ACVHicjZBfTxNBFMVnF1Csyh9GVCY4Js9mlFNo3okR8KAEJhSZsJXent2XSmZ1xZoqYzX4OX/VLkfhdfHC21ERMSLxPJ3Pmd+7NybTg1sXxzyBcWFx69Hj5Se3ps+crq2vrL86smhqGPaEMv0MLAqeY89xJ7CvDYLMBJ5nk3eVf36NxnKVn7qvGgcSxjkfcQbOPw1SCe6KgSi65afty7V6HXaO62kSb3Ybe90WpXobDXbTZpE8WzqZD7Hl+tBkg4Vm0rMHRNg7USazcowDjOBJa1Wjq1qIFNYIwXw2ubQ4S7aC4mV1e/u0XIG1TSOTHnxN+9iFU+zTL9xd0fAMFNqQk8Qhmgi7/cs0lTLrLgp6UgZmikxpNyB4OxebJVpNTK/zK3hQSej1Tu7OY+H3NnG13fRd4MIiTN8U/m+6oWUbF/B9S/fR92Hw0HNHGg04ZYpUl4Uufcl/mqQPi7OtKNmOWh/j+t7bed3L5BXZIJskIbtkj3wgx6RHGPlMvpHv5EdwG/wKF8Klu69hMGdeknsTrvwGVxSyCA=</latexit>RNN cell the
L
<latexit sha1_base64="UKb9Ek8PfCzTUt5o2/I5FrZV0BY=">ACVHicjZBfTxNBFMVnFxGsivx59GViQ4Js9mlFto3gkR9qBEIhSZsJXent2XSmZ1hZgqYzX4OX/VLmfhdfHC21ERMSLxPJ3Pmd+7NybTg1sXxzyBceLT4eGn5Se3ps+crL1bX1k+tmhqGPaEMv0MLAqeY89xJ7CvDYLMBJ5lk7eVf3aNxnKVn7gvGgcSxjkfcQbOPw1SCe6SgSi65efmxWo9jrtnVbSpF7stnc6rUp0tpvtJk2ieDZ1Mp/Di7UgSYeKTSXmjgmw9jyJtRsUYBxnAstaLZ1a1MAmMbz4TXNgeJdlDczi4v/YLkLa6pFJD27SPnbhBPv0hrtL+g4YZkpN6DHCE3k/Z5FmqZFbclHSlDMyWGlDsQnN2LrTKtRuaXWfS2kMDzkcqd3TrgY+5so+u7yBvDeLkdfHPpjtqlEx/4fU0gP0fRj86LlPGg04ZYpUl4Uufcl/mqQPi9PtKHkTtY7i+t7+vO5l8pK8IlskIbtkj3wgh6RHGLkiX8k38j34EfwKF8LFu69hMGc2yL0JV34DVTCyBw=</latexit>RNN cell On
L
<latexit sha1_base64="kRM3P9wQheMrUlMrOBOTqajBsk8=">ACVHicjZBfaxNBFMVnt1Zr1No/j74MBqFCWHYT0yZvpRbtQ8QqTRvoxnJ3cpMOmdkZyZtZdnP4at+KaHfpQ+dTSNYQfA+HebM79zLybTg1sXxdRAuPVh+Gjlce3J02erz9fWN46tmhmGfaEMoMLAqeY9xJ3CgDYLMBJ5k07eVf3KBxnKVH7lvGocSJjkfcwbOPw1TCe6cgSh65Zfm2Vo9jrqd7XbSol7sdLa7Up0m61OiyZRPJ86Wczh2XqQpCPFZhJzxwRYe5rE2g0LMI4zgWtls4samBTmODp6IJrm4NEOyu5peXf/oFSFtd08ikB1/RAfbgCAf0krtz+g4YZkpN6WeEZrI+32LNUyK65KOlaGZkqMKHcgOLsXW2Vajcwvs+htIYHnY5U7u7XPJ9zZRs93kTfeG8Tp6+KvTXfUPKNi/g+pfvo+zD4wXMfNRpwyhSpLgtd+pJ/N0n/LY6bUfIman+K67t7i7pXyAvykmyRhOyQXJADkmfMPKVfCc/yM/gV3ATLoXLd1/DYMFsknsTrt4CU0yBg=</latexit>Training, given the sentence “On the 24th of February 1810, …”
+ + + =
+
. . .
L =
<latexit sha1_base64="6vKl+aSkN4DtnyU6J2hGxAPHA=">ACU3icjZBLTxsxFIU9U6AQXqEs2VhESCBF0QwK6qoSahFlEURABCIxEbrj3CRW7LFlOxQ0mr/Btv1TXfBb2OA8Fjw23NWRj79zr06qBbcuip6C8Mvc/MLXxaXS8srq2np549uVSPDsMWUKadgkXBM2w57gS2tUGQqcDrdPhr7F/fobFcZfuQWNHQj/jPc7A+ackeAGDETeKH7clitRLZoM/SjimaiQ2TRvN4I46So2kpg5JsDamzjSrpODcZwJLEqlZGRAxtCH2+6d1zbDCTaTn4/Obx47ecg7fiYaio9uEPb2IBLbNM/3A3oMTBMlRrSC4Qumpr3WxZpomWa3xe0pwxNlehS7kBw9iZ2nGk1Mr/MoreFBJ71VObs7hHvc2erDV9FVv1tEId7+btNU2qSMWY+h5SI/R9GDz13JlGA06ZPNFrgtfcvy+0o/iar8W12sH5/XK4c9Z3Ytki2yTXRKT7+SQnJAmaRFGNHkf8m/4H/wHIbh3PRrGMyYTfJmwtUXxL2xPw=</latexit>L =
= −
+
<latexit sha1_base64="EkF1suzQh2xL4lQdrE9yrvLA6IE=">ACXnicjZDBTxQxFMa7gyKsIguEBOlkZhgXCYzBsKJhChBDxjRsLAJs07edN8uzbTpu0Cpk/xiv8RdxI/Be8a3eXg8DFd/rSr7/v5X2Fty6JLmuRVOPHk8/mZmtP30293y+sbB4ZNXQMGwxJZRpF2BR8BJbjuBbW0QZCHwuBh8GPnHZ2gsV+Wh+6GxI6Ff8h5n4MJT3ljepuZUH2qv7vcn+fevU2rKm+sJnEyHvpQpLdidSf+83vjxS9ykC/U0qyr2FBi6ZgAa0/SRLuOB+M4E1jV69nQogY2gD6edM+4tiVItB1/Mb6h+tf3IK0Ed9osZABf0zbuwyG26Tl3p3QPGBZKDeg3hC6aOPgtizTsvAXFe0pQwslupQ7EJzdiR1lWo0sLMYbCGBlz1VOru2y/vc2eZ+aKVsfjSIgzf+3qYJNc4YMf+H1LNdDH0Y/By4LxoNOGV8piuvRyWn9yt9KI7exelGvPk1tP2eTGaGvCSvyBpJyRbZIZ/IAWkRjz5S7JVe0mo7movnJ16h2yROxOt/AUEZ7cg</latexit>L = −
RNN cell <start>
L
<latexit sha1_base64="7n7PEJgxvCLz7QhLXIierqJH1+I=">ACVHicjZBfTxNBFMVnFxGsivx59GViQ4Js9mlFto3gkR9qBEIhSZsJXent2XSmZ1hZgqYzX4OX/VLmfhdfHC21ERMSLxPJ3Pmd+7NybTg1sXxzyBceLT4eGn5Se3ps+crL1bX1k+tmhqGPaEMv0MLAqeY89xJ7CvDYLMBJ5lk7eVf3aNxnKVn7gvGgcSxjkfcQbOPw1SCe6SgSi65efkYrUeR532TitpUi92zudViU62812kyZRPJs6mc/hxVqQpEPFphJzxwRYe57E2g0KMI4zgWtlk4tamATGOP58Jprm4NEOyhuZ5eXf/sFSFtd08ikBzdpH7twgn16w90lfQcM6Um9BhiCbyfs8iTbXMituSjpShmRJDyh0Izu7FVplWI/PLHpbSOD5SOXObh3wMXe20fVd5I3BnHyuvhn0x01y6iY/0Nq6QH6Pgx+9NwnjQacMkWqy0KXvuQ/TdKHxel2lLyJWkdxfW9/XvcyeUlekS2SkF2yRz6Q9IjFyRr+Qb+R78CH6FC+Hi3dcwmDMb5N6EK78BUWiyBQ=</latexit>Applications
Conditional language models Like a language model, but conditioned on extra context c.
(
Y
Masked language models
position embeddings word embeddings
<mask> <cls> a seat <mask> have a <mask> <s> 1 2 3 4 5 6 7 8
+ + + + + + + + +
transformer encoder
take </s> drink
predictions
Masked language models
that works on 100 languages:
allons enfants de la patrie , le jour de gloire
ee mungu nguvu yetu Ilete baraka kwetu
… … … … …
Model size in perspective
performance (GLUE) 50 63 75 88 100 number of parameters (millions) 10 100 1000 10000 100000 CBoW (2013) BERT (2018) RoBERTa (2019) T5 (2019) ELMo (2018) embedding-based RNN transformer Types of model: human baseline
Knowledge distillation Idea:
These models work very well, but they are often too slow!
Can we make them faster without affecting their NLP performance?
Input text Model Prediction Target training loss
standard training
Knowledge distillation
Input text Model Prediction Target training loss
standard training
distillation loss Input text Pretrained teacher model Student model Prediction Prediction
knowledge distillation, first attempt
Knowledge distillation
Input text Model Prediction Target training loss
standard training
distillation loss Input text Pretrained teacher model Student model Prediction Prediction
knowledge distillation, first attempt
distillation loss student loss Input text Target Pretrained teacher model Student model Prediction Prediction
knowledge distillation, second attempt
Knowledge distillation
Input text Model Prediction Target training loss
standard training
distillation loss Input text Pretrained teacher model Student model Prediction Prediction
knowledge distillation, first attempt
distillation loss student loss Input text Target Pretrained teacher model Student model Prediction Prediction
knowledge distillation, second attempt
distillation loss student loss Input text Target Pretrained teacher model Student model Soft predictions Soft predictions
knowledge distillation
Knowledge distillation
References
at Scale. arXiv preprint 1911.02116.
Language Generation, Translation, and Comprehension. arXiv preprint 1910.13461.
Languages are different
Translation is a Conditional Language Model
Sequence-to-Sequence: Naive RNN
x1 x2 x3 x4 θ y1 y2 y3 y1 y2 y3 <end>
Sequence-to-Sequence: Attention
x1 x2 x3 x4 θ y1 y2 y3 y1 y2 y3 <end> encoder outputs
Attention as Alignment
p = 0.9 p = 0.08 p = 0.02
Transformer NMT
figure from Vaswani et al. 2017
transformer architecture
considers all encoder outputs
to-right) but decoder can be trained in parallel
Beam Search
_EOS watched saw watch saw watched the a a movie the a movie movie movie _EOS _EOS _EOS _EOS
Beam Search
this is self-attention input for previous steps)
(efficient on GPU)
decision)
Non-Autoregressive Machine Translation (NAT)
figure from Gu et al. 2017
Variants of Non-Autoregressive Translation (NAT)
unseen words?
general but too slow in practice
words based on frequency
Byte-Pair Encoding (BPE), illustrated at right Byte-Pair Encoding (BPE)
example from Sennrich et al. 2016
unseen words?
general but too slow in practice
words based on frequency
Byte-Pair Encoding (BPE), illustrated at right Byte-Pair Encoding (BPE) token frequency
</w>
</w>
</w>
</w>
learned merge operations
r • → r• l o → lo lo w → low e r• → er•
example from Sennrich et al. 2016
Ongoing challenges for machines translation
References
(2014)
Models (2019)