Language modeling, Machine translation CS 4803 / 7643 Deep Learning - - PowerPoint PPT Presentation

language modeling machine translation
SMART_READER_LITE
LIVE PREVIEW

Language modeling, Machine translation CS 4803 / 7643 Deep Learning - - PowerPoint PPT Presentation

Language modeling, Machine translation CS 4803 / 7643 Deep Learning Georgia Tech, 31 March 2020 James Cross, Jean Maillard Facebook AI <latexit


slide-1
SLIDE 1

Language modeling, Machine translation

CS 4803 / 7643 Deep Learning
 Georgia Tech, 31 March 2020 James Cross, Jean Maillard Facebook AI

slide-2
SLIDE 2

(

  • ) = (, , . . . , )
<latexit sha1_base64="+dNSvzyk3bTWsb78UPpoRGnYxc=">ACAXicjU+7TgJBFJ31ifhCLW020kCyIbsEY2VCxKiFBRoBE9dsZocLTphXZgaJ2VD5MYbO2PoRtvI3Lo9GaTzNPfe27uiRWjxvr+2FlaXldW89sZDe3tnd2c3v7TSP7mkCDSCb1fYwNMCqgYalcK80YB4zaMW92kRvPYM2VIo7+6LgkeOuoB1KsE1HUa4WSgUaW6kF5pCoYSFUPE7MsOieugvaIAo8dxCVPTdkbWnNpBHFKJf3S/4U7iIJ5iSP5qhHubewLUmfg7CEYWMeAl/ZxwRrSwmDYTbsG1CY9HAXEswNx/bJi7k3qUYBSRcMWMwYx1R0pLCmcE67NH3nOk0lvEsN0CsmF5hALGXPvQXcBj1zTW9MP+ypMmCvzkWSbNcCiql45tKvno2z5hBh+gIFVCATlAVXaE6aiCRugLfaOx8+qMnHfnY7a65Mw9B+gXnM8fQfKVsg=</latexit>

= () ( | ) ( | , ) · · · ( | −, . . . , )

<latexit sha1_base64="esm3pRcyzhZIwzpYc7P3G1Y=">ACenicjVDLbhNBEGwvj4Tl5QSJCxGjpBsYaxdJxEnpChEwCGIEMWJTayZmfbzsjz0syYEK32W/gGbvwKF67wERwYey2kJBfqMt1VXa3pyo3gzifJj0Z04+at2yurd+K79+4/eNhcWz92emYZDpgW2g5z6lBwhQPvcChsUhlLvAkn76e6yef0Tqu1ZG/MHgq6UTxMWfUB2rUPHxFMtM+H6WdxdPJC/Iv3Zz2fa7NcK7d1CUulVC/SqksyMVfqVFzI+klC5DrRbosNnZaEH19/O3pwWit0cwKzWYSlWeCOvcpTYw/Lan1nAms4vgZGeI+PcIhOef+jLyhDHOtp+QaYG2F/SBw3CIzMsvFRlrS3ItCsI9FZzF2cyhoWxKJ1hS6ST1Z91cVpf4OekMsA6DYhKVdjrbxr7/EJD7fth8RU961FnHbKz+oXYsdc8/WeJsD8O5Ft8H3weDlnpty8xUpalChunVxK4Xx/1eutXb/hjC3IUaq/AEWtCGF7CDryDAxgAg+/wE37B78afqBV1ouf1aNRYeh7BJURbfwHTV71t</latexit>

= Y

  • ( | −, . . . , )
<latexit sha1_base64="Ynzh+YerLKBvzlqKqJatR2Jb+xA=">ACnicjVBNTxsxEHW2X5B+BXrkYjWqFKQ02kVUPSEhimgPIFLEwkpstPJ6J2G09tqyHUK12l/WX9Jjr/An6oSoKnDpXPw0z+/NzMu1QOvC8FcrePL02fMXK6vtl69ev3nbWVs/s2pqOMRcCWSnFkQWEHs0AlItAEmcwHneflzp9fgbGoqlP3Q8NIskmFY+TM+VbWiXdoqo0qMvQv7dE6XjWQzA4w6oU0NBZhg1NJRZ/2T2DvDyBYs7V+DFq+jQVhXK27xtRs5l1uEgXBR9DKIl6JlDbO1VpQWik8lVI4LZu1FGo3qplxyP0O7XY6taAZL9kELor1LZiEuyovl5s1PzL10xaydxlP5de+IEmcMhOIaEzdJf0gHIlSrpCbACzMDzsQV/vMzr64aOlaG5EgVFxwTye7ZzT6uB+2EWPC0kw2qsKmd7+zhBf/2hT7XqfzUA5Wb9YNKdauEx1/yfpJ3ug8/DwJHXHWswzPn4U93UuvEhRw8jfQzOtgbR9uDT9+3u7t4y7hWyQd6THonIZ7JLvpEhiQknP8lvckNuAxocBEfB8d3XoLXUvCP3Kkj+AN6uzEk=</latexit>

next word history

Language modelling aims to assign probabilities to word sequences

slide-3
SLIDE 3

Search

Search Engine

how do I stop my backpack from ruining ruining my life ruining my relationship ruining my meal ruining your day ruining my hair ruining my carpet ruining your hair ruining my garden

slide-4
SLIDE 4

( | −) = (, −) (−) ' (, −) (−)

<latexit sha1_base64="t1SUIl5bqI5ryfcfn5vHGe8uUlk=">AC2HicjZFdaxNBFIZn14/W+NFUL3szGAsJxLBbKl4JRYt6UbFK0wa7IczOnqRD5suZSVsZFrwTr4T+u/4Q751NtmAbLzxX75znvOfAO7nmzLokuYziW7fv3F1Zvde4/+Dho7Xm+uNDq2aGQp8qrswgJxY4k9B3zHEYaANE5ByO8umbih+dgrFMyQP3TcNQkIlkY0aJC61R81emcRvjsxHDmWBFEJ49T0vcwa9wNjaE+ky3A+1ekU656NQPnFkm4OvVrNJgiFNGEgGeqpl05ZL73zM1HjVbS+ZF14WaS1aqK790XqUZoWiMwHSU6sPU4T7YaeGMcoh7LRyGYWNKFTMoHj4pRpWx2Q38+T6/8m3sirCDupJuLYNzEA9gjBzDAZ8yd4LeEQq7UFH8GUoDpBd63gDMtcn9e4rEyOFe8wMwRzui1tdVOq4GYxYC5oIwOVbS2fYumzBnu3vhP2T3nQGYdvyNSwvXfEfl+T9LI9uFkIeBD8H3sY48fF3pdRVyejPSZXG41Uu3ey8+bd2Xtdxr6IN9BS1UYpeoh30Hu2jPqLod7QRPYs24y/x9/hH/HMxGke15wm6VvHFHzoP4IU=</latexit>

(

  • ) =

Y

  • ( | −, . . . , )
<latexit sha1_base64="ag/y4BxKNhqDscT2xbwQ/sQ9Nq4=">AChXicjVBNbxMxFHSWj5bwldIjF4sIKZHSaBcFeimigopyKgpo3UjVZvS+pFXt2U7Tyto/w6/hCjf+Tb1pDrS9MKfRmzfvaSbXglsXx38b0b37Dx6urT9qPn7y9Nnz1saLY6vmhuGQKaHMKAeLgpc4dNwJHGmDIHOBJ/nsU62fnKOxXJVH7lLjWMK05BPOwIVR1tpJNe2kWubeVt3NVGFRmn9ZTSRc0kL6hfZJ5vJVWPpqJQzvaClFTdrNWO+/ES9C5JVqRNVjMNhpJWig2l1g6JsDa0yTWbuzBOM4EVs1mOreogc1giqfFOde2BIl27C+WUat/dQ/SnBnvVwG42s6wgM4whFdcHdGPwPDXKkZ/YFQoOkHfWiRLpNeVHSiDM2VKCh3IDi7cba+aTWy8MxikIUEXk5U6Wxnj095SH8Qyit7+wZx1vW3Pl27ljdqz/9Zmukehj4Mfg2+bxoNOGV8qiuvq1BycrvSu+T4T8Z9N9+H7R3P67qXicvySvSIQnZJrvkCzkQ8LIT/KL/CZ/orVoKxpE765Xo8bKs0luIPpwBd2DwHQ=</latexit>

= Y

  • ( | −)
<latexit sha1_base64="ASoTq3GdtjpdLVNoR5prulzuGI8=">ACa3icjVDBThsxFHS2tKWhLQFuhYNFhBSkNqtqHpCQi0qHECliEAkNorel/CU+y1ZTtAtdp7v6bX9lf6Ef2HOiGHApfOxaM3nl6kxlJzsfx71r0ZOHps+eL+pL1+9Xm6srJ47PbECu0JLbXsZOJRUYNeTl9gzFkFlEi+y8aepfnGN1pEuzvw3g30Fo4KGJMCH0aCxuctTY3U+oPDyFuc3U6YoD6Skt0nFtweNZtyJZ+CPSTInTbHyWClqS5FhOFhRcSnLtMYuP7JVhPQmJVr6cThwbEGEZ4mV+TcQUodP3ydnZQ9a9egnIK/FU7U8G4xXt4BGfY4zfkr/hnEJhpPeanCDnaTtC7DsMlKitvKz7Ulmda5pw8SBL3YqeZzqAIyxwGWSqgYqgL71r7NCLv2kehoqJ9YBH2+WDTXeuWcbU83+WerqPoQ+Lx8H3xaAFr2Zmqo0VSg5eVjpY3L+rpPsdN5/3WnufZzXvcjW2SZrsYR9YHvskJ2wLhPsO/vBfrJftT/RWvQm2rj7GtXmnjV2D9HWXxvrtzM=</latexit>

How many times “wi-1” appears … … in a large body of text, like Wikipedia. How many times “wi wi-1” appears … Bigram language model

= () ( | ) ( | , ) ( | , , ) · · ·

<latexit sha1_base64="7AkBSE30VOYkqJYeaQU2boOAdA=">ACoXicjVBLaxsxGJS3r9R9xGmPvYiagOL2XVceiqENvQBKXVDnCx4zfKt9rMjLK2EJMcpy/61/o/e21/Q2XHKU5y6ZyGc18aHItuHVR9LMR3Ll7/6DrYfNR4+fPN1u7Tw7sWpuGA6ZEsokOVgUvMSh405gog2CzAWe5rP3S/0HI3lqjx23zWOJUxLPuEMnJeyVvKWp2Flm8uyKULrIeTSUv6D9tke1dKb1wU+5fyXvhpscK5WzWakfdaAV6m8Rr0iZrDLKdRpwWis0lo4JsHYUR9qNKzCOM4F1s5nOLWpgM5jiqDjn2pYg0Y6ri9UM9aZfgbQS3FmYSx98RM8hGNM6IK7M/oBGOZKzegRQoGm6/2hRf8nmVcXNZ0oQ3MlCsodCM6u1S47rUbmj1n0tpDAy4kqne0c8Cl3Njz0w5bhR4M4261uXLpMrTqWmf+LNMD9HsY/OJzXzUacMpUqa4rXfuR45uT3iYnvW7c7+1m/v1vPvUVekJekQ2LyhuyT2RAhoSRH+QX+U3+BO3gczAIji6fBo15jm5hmD0FxNyxvM=</latexit>

conditional independence (Markov assumption) ' () ( | ) ( | , ) ( | , , ) · · ·

<latexit sha1_base64="UmklOflgxR8UvG/xGutumsks+gA=">ACpnicjVDNbhMxGHSWvxJ+msKRi0WElEqraDcN4lq1FXAoUFC3idSNlm+9X1Ir9trYTlO02qfjKXgErvACOGmK0vbCnEYznvnkybXg1kXRz0Zw5+69+w82HjYfPX7ydLO19ezEqplhmDAlBnmYFHwEhPHncChNgyFzjIp/sLf3COxnJVHrvGkcSJiUfcwbOS1lrlFou8RtNe3Ms3h7SidZz2aSl7Qf9o827lSeuG63L+Sd8J1jxXK2azVjrREvQ2iVekTVY4yrYacVoNpNYOibA2tM40m5UgXGcCaybzXRmUQObwgRPi3OubQkS7ai6WG5Rr/sVSCvBnYW59MFXdIiHcIxDOufujL4FhrlSU/oFoUDT9X5i0f9J5tVFTcfK0FyJgnIHgrNrtYtOq5H5Yxa9LSTwcqxKZzsHfMKdDQ/9umX4ziBOt6sbly5Ty45F5v8izfQA/R4GP/jcJ40GnDJVqutK137k+Oakt8lJrxv3u68/9u7e6u5N8gL8pJ0SEzekF3ynhyRhDyg/wiv8mfoBN8DJgcPk0aKwyz8k1BF/AtiCyWM=</latexit>
slide-5
SLIDE 5

RNN cell

  • <latexit sha1_base64="rGpfCw03yA1iTferFVMd3+82y28=">ACSnicjZDNThsxFIU94S+EQgldmM1QqJSFM0gKrYRIMoCVFoIRGKi6I5zk1ix5btBNBoHoEtvFRfgNdgV3VT52fRwIa7OvLxd+7VSbTg1oXhc1BYWFxaXimultY+rG983CxvXVk1NAwbTAlmglYFDzFhuNOYFMbBJkIvE4Gh2P/eoTGcpVeunuNLQm9lHc5A+efLm7bUXuzEtbCydC3IpqJCpnNebscRHFHsaHE1DEB1t5EoXatDIzjTGBeKsVDixrYAHp40xlxbVOQaFvZ3eTe/H8/A2kluH41kR7cpk08hUts0lvu+vQYGCZKDegvhA6amvcbFmsZLd5bSrDE2U6FDuQHA2FzvOtBqZX2bR20ICT7sqdXbniPe4s9VT30Ba/W4QB1+zV5um1CRjzLwPKcVH6PsweOa5HxoNOGWyWOeZzn3J0etK34qr3Vq0V/v2c69SP5jVXSfyReyQyKyT+rkhJyTBmGkRx7I3kKfgcvwZ/g7/RrIZgxn8jcFBb/AaY4rzM=</latexit>

RNN cell

  • <latexit sha1_base64="dBYkVpPm536KmLosBInGLrJeQA=">ACSnicjZDNbhMxFIU9IaUlUEhycYiQipSFM1URd1GbQRdBEgaSNlouiOc5NasceW7fyg0TxCt+1L8QK8BjvEBudnQZoNd3Xk4+/cq5Nowa0Lw59B4VFx7/H+wZPS02eHz1+Uj15eWTU1DtMCW6CVgUPMWO405gVxsEmQi8TiYXS/96hsZylbd419CeOUjzgD5+zQcng3IlrIWrobsi2ogK2UxrcBRE8VCxqcTUMQHW9qJQu34GxnEmMC+V4qlFDWwCY+wNZ1zbFCTafrZY3Zv/62cgrQR3U02kB9/SLjahjV065+6GfgCGiVIT+hVhiKbm/Y5FGmuZIucjpShiRJDyh0IzrZil5lWI/PLHpbSODpSKXOHjf4mDtbfoG0upHgzh5lz3YtKZWGUvm/5BS3EDfh8FPnvus0YBTJot1nunclxw9rHRXJ3UotPa+y+nlfr5pu4D8pq8IckImekTi5Ji3QI2NyS+7IfAj+BX8Dv6svxaCDfOKbE2h+BeoHK80</latexit>

RNN cell

  • <latexit sha1_base64="N7oW9JQqfcVeOF0u6dk8ufHIgE0=">ACSnicjZDNThsxFIU9IW0htOVvycYiQkqlKJpQd1GgFoWQ0tCZGYKLrj3AQr9tiynUA1mkdgW16KF+A12CE2dX4WBTa9qyMf+denUQLbl0Y3geFpeKbt+WV0qr7z98XFvf2GxbNTYMW0wJZToJWBQ8xZbjTmBHGwSZCDxPRodT/3yCxnKVnrnfGrsShikfcAbOP/26n3prZfDWjgb+lpEC1Emi2n2NoIo7is2lpg6JsDaiyjUrpuBcZwJzEuleGxRAxvBEC/6E65tChJtN7ue3Zv/62cgrQR3WU2kB3dpBxtwh16xd0l/QYME6VG9CdCH03N+y2LNYya5zOlCGJkr0KXcgOHsWO820GplfZtHbQgJPByp1tnLEh9zZasM3kFa/G8TRp+zFpjk1y5gy/4eU4iP0fRg8dwPjQacMlms80znvuToZaWvRftzLdqr7Z/ulesHi7qXyTbZIRUSka+kTo5Jk7QI0NyQ/6Q2+AueAgeg6f510KwYLbIsykU/wKqAK81</latexit>

RNN cell

  • <latexit sha1_base64="Vno21VB0DMdmb10OrfE2uFj/e8k=">ACSnicjZDNbhMxFIU9oS1toCWBJRuLClIUTRTBbGNIugi0aSNlouiOc5NYsceW7TRBo3mEbtuX4gX6GuwQG5yfBWk3vasjH3/nXp1EC25dGN4FhSc7u3tP9w+Kz54fHr0olV9eWDUzDtMCW6CVgUPMWO405gVxsEmQi8TKaflv7lFRrLVXrufmrsSxinfMQZOP90Nh80BqVKWA9XQx+KaCMqZDOng3IQxUPFZhJTxwRY24tC7foZGMeZwLxYjGcWNbApjLE3vOLapiDR9rPF6t78fz8DaSW4S2RHnxLu9iGc+zSOXcT+hkYJkpN6Q+EIZq69zsWaxlki1yOlKGJkoMKXcgONuKXWZajcwvs+htIYGnI5U6W23xMXe21vYNpLUvBnH6Lru3aU2tMpbM45Bi3ELfh8Gvnvum0YBTJot1nunclxzdr/ShuDiuR436+NSvPjpu598pq8IVUSkQ+kSU7IKekQRsbkmtyQ2+BX8Dv4E/xdfy0EG+YV2ZrCzj+r5K82</latexit>

(|, , , )

<latexit sha1_base64="oTDo3PUehJQ5qM0T1Jo8JgA=">ACZXicjVBNTxsxFHSWtkBaSvhQL62ERVQpSFG0C0FcUYtKD1RQRCASG63eOi/Bir2bIdQbfSX9Nr+2/4BfwNvCGHApe+w2j0xjNPnlQLbl0Y3laCuRcvX80vLFZfv1l6u1xbWT23amwYdpgSynRTsCh4h3HncCuNgyFXiRj6X+sU1GstVduZ+aOxJGZ8wBk4v0pqH2LdmCS79CedJO2mh50StkuItpJaPWyF06HPSTQjdTKbk2SlEsV9xcYSM8cEWHsZhdr1cjCOM4FtRqPLWpgIxjiZf+a5uBRNvLb6ZfKf7Vc5BWgrtqptIbP9IuHsEZdumEuyv6BRimSo3oKUIfTcvrHYs01jLNbwo6UIamSvQpdyA4exRbZlqNzB+z6GUhgWcDlTnbOBD7mzyJeTNQ8N4mgrf3LpwTXNKD3/Z6nGB+j7MPjN+41GnDK5LEucl34kqOnlT4n59utqN3a/d6u73+a1b1A3pN0iAR2SP75Cs5IR3CyC/ym/whfyt3wVKwHrx7eBpUZp418miCjXveq7RE</latexit>

RNN language models don’t need a Markov assumption. The more context we have to predict the next word, the better!

slide-6
SLIDE 6

Inference Given a trained RNN language model:

  • Feed the words one by one, starting with a special start symbol <s>.

Stop when the special end symbol </s> is predicted. At time step t+1:

  • Feed the embedding corresponding to the word predicted at t.

At every time step t, to predict:

  • Project the RNN output to the vocabulary space.
  • Normalise using softmax to obtain a valid distribution.
  • Sample from this distribution to predict the next word.
slide-7
SLIDE 7

  • <latexit sha1_base64="74qbGxPHZlSmXlqoyuImvaZ8tgk=">ACTnicjZBLSwMxFIUz9VXrW5dugkVQKGVGFLeioi4UH1hbcIrcSW/bMkJKkPhvkRbvVPufWPuBOd1i58bLyrQ06+cy8n0oJb5/uvXmFkdGx8ojhZmpqemZ2bX1i8sqpnGNaYEso0IrAoeI1x53AhjYIMhJYj+K9vl+/RWO5Si7dg8amhE7C25yBy5/qoZR2stu5st+1R8M/SuCoSiT4ZzdLHhB2FKsJzFxTIC14GvXTMF4zgTmJVKYc+iBhZDB69bt1zbBCTaZno/uDn7qcgrQTXrUQyB1dpA4/hEhv0jrsuPQCGkVIxvUBoanmfs0iHRx+n9G2MjRSokW5A8HZj9h+ptXI8mUWc1tI4ElbJc6u7fMOd7ZynLeQVA4NYrye/tr0RQ0y+sz/kFK4j3kfBk9y7lSjAadMGuos1f2Sg9+V/hVXG9Vgs7p1vlne2R3WXSTLZIWskYBskx1yRM5IjTASk0fyRJ69F+/Ne/c+vr4WvCGzRH5MofgJDOxXA=</latexit>

RNN cell People RNN cell the RNN cell embedding size RNN cell We vocabulary size User input: P e

  • p

l e c

  • m

m

  • n

a r e y e a r n e w a s

  • m

e t h i n g t h e i s w i t h a n d s e x t a n t m a n g

  • Next word prediction:

Inference: finding p(w | “We the”)

RNN cell <s> = (

  • ) =
  • P
<latexit sha1_base64="f0v9yTw6+OapXTDBrYwhBGDogLg=">ACmHicjVFNb9NAEN2YrxK+UrjBZUWElEpRZFdFnJAiqChIRS1V0aqgzXejNtvN7V7roErfZn8WMQV/gf2I4PtL0wp9l5897svElVzo0Nw5+d4NbtO3fvbdzvPnj46PGT3ubTEyNLzXDCZC71NAWDOS9wYrnNcao0gkhzPE2X72v89BK14bI4t8VzgQsCp5xBrYqJb2D2OLKOuUTt/SWCrUYKUuQKAzMrMCVn4QK5G60m+tezINzOFXVybcexebUiQXtHlfeJ/0+uEobILeTKI26ZM2DpPNThTPJSsFpblYMxZFCo7c6AtZzn6bjcuDSpgS1jg2fySK1N/zszcqtne/4s7EaAPR+moiK+olPch2Oc0m/cntMPwDCVckmPEOaoRxU+MUib5VaeZlLTVOZzyi3knF2RrTWNQlYNM1jBuQBeZLKwZrDLF9ya4X7lZzHc04jLXdt0prVaNSc/6N0412s/ND4ueIdtGdxsfL1rXr96LqlN5OT7VG0M3r9Zac/ftfavUFekJdkQCLyhozJR3JIJoSRH+QX+U3+BM+DcbAXfFq3Bp2W84xcieDoL68VzJs=</latexit>

projection

slide-8
SLIDE 8

How do we know how well our RNN models language? If our model is good, it will assign high probability to real sentences.

(∗, ) = − X

∈X

∗() ()

<latexit sha1_base64="PbD9fBj5DemC4m07gR3FSdRfAcM=">ACjXicjVHLbhMxFHWmPEp4NKVLNhYRUoJCNINasYGqgoh2URBTtSHSKPc5Na8Uu20wZ80N8DdvyNXjSWdB2w1kd+dxzH8eFEdz5NL1qJGv37j94uP6o+fjJ02cbrc3nJ04vLIMh0LbvKAOBFcw9NwLyI0FKgsBp8X8U6WfXoB1XKtj/9PASNKZ4lPOqI9P49aAaAOWem0VlRAOyo758bqHTRd/wG8wcQs5DkvCFZHUnzMqQl6WOJZ0l1MhJ5hE9m41U76Qr4Lslq0kY1jsabjYxMNFtIUJ4J6txZlho/CtR6zgSUzSZODCUzekMziYX3LhqPTcKy9XJ5b96oNJV2/UKGY2vcA6H9BhyfMn9Of5MGRaz/F3oBOw/agPHWBiZBGWJZ5qiwstJph7Kji70bq6QywOMxBlIWkXE218q4z4DPuXe8whqh6+xZg3g23Jl27Vj0qz/9ZmQAMQ8LX6Lva/0xgZgymDKGnN2O9C45edvPtvs737bex/ruNfRC/QSdVCG3qE9dICO0BAx9Av9RlfoT7KR7CTvk93r0qRe7bQDST7fwEkpMSU</latexit>

cross entropy reference distribution expected # bits required to represent an event drawn from p*, when using a coding scheme optimal for p.

= −

  • X
  • ( | −, . . .)
<latexit sha1_base64="AhNscK9HjbJITsCKty7CgwkfnEI=">ACknicjVHNbhMxGHSWvxL+UuDGxSJCSqU02kVFlSohFVpBkUIpqGkj1enK6/2SWrHXlu0RZafiafhwAVeBSfdA20vzGnk8cznb1xowa1L01+N5NbtO3fvrdxvPnj46PGT1urTI6tmhsGAKaHMsKAWBK9g4LgTMNQGqCwEHBfTnYV+fA7GclUdu8aRpJOKj7mjLp4lLc+EaXBUKdMRSX4vYDf4nUyNpT5LPj9QOxM5vx0HxOhJpjozjznmEhe4nu+XoWulEplbNreaud9tIl8E2S1aSNahzkq42MlIrNJFSOCWrtSZqN/LUOM4EhGaTzCxoyqZ0AiflOd28UY78hfLvcO/uqfSurOuoWMxld4CH16CEM85+4Mf6AMCqWm+BvQEkwv6gMLcRtZ+IuAx8rgQokSc0cFZ1diF5lWA4vDLERZSMqrsaqc7ezyCXe249NVt2PBmC65q9NunQtMxae/7M0yS7EPgx8jr4v9e94oPXIZacXa/0Jjl63cs2em+brS39d1r6AX6CXqoAxtom20hw7QADH0A/1Ev9Gf5HmylbxLdi6vJo3a8wxdQdL/C4s3x28=</latexit>

per-word cross entropy commonly used as loss function

slide-9
SLIDE 9

How do we know how well our RNN models language? Perplexity is the geometric mean of the inverse probability of a sequence of words (according to the model).

(

  • ) =
  • Y
  • s
  • ( | −, . . .)
<latexit sha1_base64="iKjyY+Qi7goVlh6JC8Pmtu8orsA=">ACqHicjZHNbhMxFIWd4a+EvxSWbCwipASFaAYVdVWpgpYlBJQ0wTFYeTx3KRW7LGxnSaV5cfjIXgGtrBnJp0FbTfc1ZGPz73XnzMtuHVx/LMR3bh56/adrbvNe/cfPHzU2n58YtXSMBgyJZQZ9SC4AUMHXcCxtoAlZmAUbZ4W/mjMzCWq+LYnWuYSjov+Iwz6sqjtJUSpcFQp0xBJfgBGC1gzd156BAtM29DF+9ho3KU/7tCBP73bjJ0dSTmaHMJ8ET3VmlHBPJc7xKPX+ZhB4mIlfOdkNIW+24H28KXxdJLdqorkG63UhIrthSQuGYoNZOkli7qafGcSYgNJtkaUFTtqBzmORnXNtqczv16w2N8K/vqbSutNeJsvgczyGQ3oMY7zi7hS/owypRb4C9AcTL/0hxbw5tnrgGfK4EyJHNHBWeX2lY9rQZWDrNQ2kJSXsxU4WzngM+5s73Dkm/Re28AFl1/ZdJFatOjyvxfpEkOoORh4GOZ+1T/WQk/eF1BTq4ivS5OXvWTnf7rzvt/Tc17i30FD1DHZSgXbSPqABGiKGfqBf6Df6E72IBtEo+npxNWrUmSfoUkXZX9n+0ds=</latexit>

Note:

  • The exponent is our loss function, the per-word cross entropy!
  • The lower the perplexity, the better our model is.

= −

  • P

(|−,...)

<latexit sha1_base64="f0BtvPhQq4psZ2g3mN8CNo167bw=">ACh3icjZBLbxMxFIWd4dWGVwpLNhYRUiqlYQaVxwZUoAIWpRTUtJE6cj23KRXsceW7TRF1vwbfg1b2PBvcNIsaLvhro58/J2re7iR6Hya/mk167fuHlrZbV5+87de/dbaw8OnJ5aAX2hpbYDzhxIrKDv0UsYGAtMcQmHfPJ+7h+egnWoq3/3cBQsXGFIxTMx6ei9eY15cdhIx9ZJkJWh906d1NV4PEuzaUeF5zmpjMrkOYKSzorAm5kdTd6pfZuvS5a7bSXLoZeFdlStMly9oq1RpaXWkwVF5I5txRlho/DMx6FBLqZjOfOjBMTNgYjspTNK5iCtwnC2Orf/1A1NOMX/S5SqCT+gAdtg+DOgM/Qn9wARwrSf0G7ASbC/6fQfxHMXDWU1H2lKuZUnRM4niQuw80xkQcZmDaEvFsBrpyrvONo7Ru+5OrK/qfrQAk/VwadM5tciYM/+HNPNtiH1Y+By5LwYs89qG3NTBzEvOLld6VRw862WbvedfN9tb75Z1r5BH5DHpkIy8JFvkE9kjfSLID/KT/CK/k9XkafIieX+NWksmYfkwiRv/wKUeMKx</latexit>

commonly used as evaluation metric

slide-10
SLIDE 10

Training Given a long list of words:

  • Feed them one by one, enclosing every sentence with <s> and </s>.

Minimisation of this loss is equivalent to maximum likelihood estimation. At every time step t, compute the predictive distribution pt.

  • Compute ℒ t, the loss at time step t, as the cross entropy between the

predictive distribution and the actual next word.

  • The total loss ℒ is the average of the loss at every time step.
slide-11
SLIDE 11

. . .

RNN cell 24th

  • <latexit sha1_base64="zubXSPfzgz+uqeTaeUkeuyFGm4M=">ACdXicjVHNbhMxGHSWv5IWSOGIkCxaIJGiZbdp2uRWQVU4tGqLmnalboi8zreJFXt2U4JWu1L8DS9wlvwJEg94U1aiSIhMaeRxzOfv3GiODM2CH5WvDt3791/sPSwurzy6PGT2urTUyOnmkKPSi51lBADnGXQs8xyiJQGIhIOZ8nkfamfXYA2TGYn9quCviCjKWMEuOBrVmLBVoYqXOiIDcyNQKMitiDqmtx0ok+bT4vBlrNhrbxqC2FvjdzlY7bGFHtjtb3XZJuhutTguHfjDH2k6jdRxdvrk6GqxWwngo6VRAZiknxpyHgbL9nGjLKIeiWo2nBhShEzKC8+EFU6Z8huns/lqxZ96ToQRxI6biXDGVziCfXICEf7C7BjvEQqJlBP8CcgQtO/0ngE832BW4FRqnEg+xMwSzuit2DLTKBumAEnc0FYlsrMmvouGzFrmvurKz5QNMGvlfkxaueUbp+T9LNd4F14eGA+c7vP6APFZFrgpX8k2T+N/kdMPN/32sWv7HVpgCT1HL1EdhWgb7aCP6Aj1EXf0CX6jn5UfnkvHXv9eKqV7n2PEO34L39DSkjwlE=</latexit>

L

<latexit sha1_base64="KgQAI637OpXuwe+Ldl8EOeF1ZF4=">ACVHicjZBfTxNBFMVnF1Csyh9GVCY4Js9mlFNo3okR8KAEJhSZsJXent2XSmZ1xZoqYzX4OX/VLkfhdfHC21ERMSLxPJ3Pmd+7NybTg1sXxzyBcWFx69Hj5Se3ps+crq2vrL86smhqGPaEMv0MLAqeY89xJ7CvDYLMBJ5nk3eVf36NxnKVn7qvGgcSxjkfcQbOPw1SCe6KgSi65afty7V6HXaO62kSb3Ybe90WpXobDXbTZpE8WzqZD7Hl+tBkg4Vm0rMHRNg7USazcowDjOBJa1Wjq1qIFNYIwXw2ubQ4S7aC4mV1e/u0XIG1TSOTHnxN+9iFU+zTL9xd0fAMFNqQk8Qhmgi7/cs0lTLrLgp6UgZmikxpNyB4OxebJVpNTK/zK3hQSej1Tu7OY+H3NnG13fRd4MIiTN8U/m+6oWUbF/B9S/fR92Hw0HNHGg04ZYpUl4Uufcl/mqQPi7OtKNmOWh/j+t7bed3L5BXZIJskIbtkj3wgx6RHGPlMvpHv5EdwG/wKF8Klu69hMGdeknsTrvwGVxSyCA=</latexit>

RNN cell the

  • <latexit sha1_base64="IXJ1b+L8BYDcpl2qzhrBM0b0FG4=">ACdXicjVHdbtMwGHXD31b+OrhESBbjp5WqkCx0a+8mNgEXQ2xo3SItXeW4X1qrdmzZ7lYU5SV4mt2Ot+BJkLjCaYfEkJA4V0c+Pufzd5wqzowNgu8178bNW7fvrKzW7967/+BhY+3RkZEzTaFPJZc6TokBznLoW2Y5xEoDESmH43S6U+nHZ6ANk/mh/aJgIMg4ZxmjxLqjYaOdSAWaWKlzIqAwMrOCzMuEQ2abiRJpMStPo0Sz8cS2ho31wO91NzthB3Z6m72OhXpbUTdCId+sMD6dis6iC9e/dwfrtXCZCTpTEBuKSfGnISBsoOCaMsoh7JeT2YGFKFTMoaT0RlTpnqGRTzxWrln3pBhBHETtqpcMYXOIY9cgxPmd2gt8RCqmU/wZyAi07/S+AbzYF7iTGqcSj7CzBLO6LXYKtMoG6YASdzQVieydya5i4bM2vae6svP1eA0xbxV+Tlq5FRuX5P0s92QXh4aPzvfp6gOKRJWFKl3Jv5vE/yZHG374xu8cuLbfoiVW0BP0DVRiLbQNvqA9lEfUfQVXaBL9K32w3vqPfdeLq96tSvPY3QN3utfJzjCUA=</latexit>

L

<latexit sha1_base64="UKb9Ek8PfCzTUt5o2/I5FrZV0BY=">ACVHicjZBfTxNBFMVnFxGsivx59GViQ4Js9mlFto3gkR9qBEIhSZsJXent2XSmZ1hZgqYzX4OX/VLmfhdfHC21ERMSLxPJ3Pmd+7NybTg1sXxzyBceLT4eGn5Se3ps+crL1bX1k+tmhqGPaEMv0MLAqeY89xJ7CvDYLMBJ5lk7eVf3aNxnKVn7gvGgcSxjkfcQbOPw1SCe6SgSi65efmxWo9jrtnVbSpF7stnc6rUp0tpvtJk2ieDZ1Mp/Di7UgSYeKTSXmjgmw9jyJtRsUYBxnAstaLZ1a1MAmMbz4TXNgeJdlDczi4v/YLkLa6pFJD27SPnbhBPv0hrtL+g4YZkpN6DHCE3k/Z5FmqZFbclHSlDMyWGlDsQnN2LrTKtRuaXWfS2kMDzkcqd3TrgY+5so+u7yBvDeLkdfHPpjtqlEx/4fU0gP0fRj86LlPGg04ZYpUl4Uufcl/mqQPi9PtKHkTtY7i+t7+vO5l8pK8IlskIbtkj3wgh6RHGLkiX8k38j34EfwKF8LFu69hMGc2yL0JV34DVTCyBw=</latexit>

RNN cell On

  • <latexit sha1_base64="ie8F9gy2q4Zza84lvK0Ca3fcyeA=">ACdXicjVHbhMxFHS2XEq4peURVbIol0SKlt2EtMlbRSvgoYgWNe1K3TyOmcTK/basp2SarU/wdf0tfwFX4LE96kSBQJiXkaeTxzfMaJ4szYIPhe8VZu3b5zd/Ve9f6Dh48e19bWj42caQp9KrnUIMcJZB3zLIVIaiEg4nCT3VI/OQdtmMyO7IWCgSDjKWMEuOhrVmLBVoYqXOiIDcyNQKMi9iDqmtx0ok+aw4a8WajSe2MaxtBn6vu9UJ29iR7e5Wr1OSXqvdbePQDxbY3Gm0D6PLVz8PhmuVMB5JOhOQWcqJMadhoOwgJ9oyqGoVuOZAUXolIzhdHTOlCmfYQb5fLFa8aeE2EsZNmIpzxBY5gnxBhL8wO8HvCIVEyin+DGQE2nd63wBebDAvcCo1TiQfYWYJZ/RGbJlpFA3zICTuSAsS2VmTX2PjZk1zX1XVtZ8rwGmjfyvSUvXIqP0/J+lGu+B60PDR+f7dP0BeayKXBWu5N9N4n+T45YfvE7h67t2iJVfQUPUN1FKJtIM+oAPURxR9RZfoCn2r/PA2vOfey+Vr3LteYJuwHv9CyVNwk8=</latexit>

L

<latexit sha1_base64="kRM3P9wQheMrUlMrOBOTqajBsk8=">ACVHicjZBfaxNBFMVnt1Zr1No/j74MBqFCWHYT0yZvpRbtQ8QqTRvoxnJ3cpMOmdkZyZtZdnP4at+KaHfpQ+dTSNYQfA+HebM79zLybTg1sXxdRAuPVh+Gjlce3J02erz9fWN46tmhmGfaEMoMLAqeY9xJ3CgDYLMBJ5k07eVf3KBxnKVH7lvGocSJjkfcwbOPw1TCe6cgSh65Zfm2Vo9jrqd7XbSol7sdLa7Up0m61OiyZRPJ86Wczh2XqQpCPFZhJzxwRYe5rE2g0LMI4zgWtls4samBTmODp6IJrm4NEOyu5peXf/oFSFtd08ikB1/RAfbgCAf0krtz+g4YZkpN6WeEZrI+32LNUyK65KOlaGZkqMKHcgOLsXW2Vajcwvs+htIYHnY5U7u7XPJ9zZRs93kTfeG8Tp6+KvTXfUPKNi/g+pfvo+zD4wXMfNRpwyhSpLgtd+pJ/N0n/LY6bUfIman+K67t7i7pXyAvykmyRhOyQXJADkmfMPKVfCc/yM/gV3ATLoXLd1/DYMFsknsTrt4CU0yBg=</latexit>

Training, given the sentence “On the 24th of February 1810, …”

+ + + =

  • X
  • L
<latexit sha1_base64="o6y7lPazvTp1kHabRUDSd9PMTE=">ACa3icjZDNThsxFIWdoS0Q2hJgV7qwiJCoFEUzCMQKCVFUWIQWKgKRmDS649wEK/bYsh2gsmbP07CFV+lD9B1wQhb8bLirIx9/59on04JbF8f/StHUu/cfpmdmy3MfP32erywsnlo1NAybTAlWhlYFDzHpuNOYEsbBJkJPMsG30f+2SUay1V+4v5qbEvo57zHGbhw1KmsbNO0Z4D5pPA/C5raoew4mkpwFwyEbxR/XKdSjevxeOhrkUxElUzmqLNQStKuYkOJuWMCrD1PYu3aHozjTGBRLqdDixrYAPp43r3k2uYg0b9fhDxVPfg7Sj19QyGcBV2sIGnGCLXnF3QX8Aw0ypAf2N0EVTD37TIk21zPx1QXvK0EyJLuUOBGfPYkeZViMLywGW0jgeU/lzq7t8T53tYIFeW1fYM4+OZfbHqkxhkj5m1IOd3D0IfBw8D90mjAKeNTXhdhJKTl5W+Fqfr9WSjvnm8Ud3ZndQ9Q5bJClkjCdkiO+SAHJEmYeSG3JI7cl/6Hy1FX6Kvj1ej0oRZIs8mWn0AZVy5ag=</latexit>

+

. . .

L =

<latexit sha1_base64="6vKl+aSkN4DtnyU6J2hGxAPHA=">ACU3icjZBLTxsxFIU9U6AQXqEs2VhESCBF0QwK6qoSahFlEURABCIxEbrj3CRW7LFlOxQ0mr/Btv1TXfBb2OA8Fjw23NWRj79zr06qBbcuip6C8Mvc/MLXxaXS8srq2np549uVSPDsMWUKadgkXBM2w57gS2tUGQqcDrdPhr7F/fobFcZfuQWNHQj/jPc7A+ackeAGDETeKH7clitRLZoM/SjimaiQ2TRvN4I46So2kpg5JsDamzjSrpODcZwJLEqlZGRAxtCH2+6d1zbDCTaTn4/Obx47ecg7fiYaio9uEPb2IBLbNM/3A3oMTBMlRrSC4Qumpr3WxZpomWa3xe0pwxNlehS7kBw9iZ2nGk1Mr/MoreFBJ71VObs7hHvc2erDV9FVv1tEId7+btNU2qSMWY+h5SI/R9GDz13JlGA06ZPNFrgtfcvy+0o/iar8W12sH5/XK4c9Z3Ytki2yTXRKT7+SQnJAmaRFGNHkf8m/4H/wHIbh3PRrGMyYTfJmwtUXxL2xPw=</latexit>

L =

  • ,
<latexit sha1_base64="oQIRl/8Jza8ERAmYvTgyM9Y+C9o=">ACj3icjVFNbxMxEHWj5bwlcCRi0WElEpRtIta9QSKaCkgBVFQ0aq08jrTBIrXtuyJ6XRav9Rf01vCH4M3nQPtL0wpye/eTPznlOrpMc4/lWL7t1/8HBj81H98ZOnz543mi+OvVk6AQNhlHDlHtQUsMAJSoYWgc8SxWcpIu9kj85B+el0Ue4sjDK+EzLqRQcw9O4cAyjnPBVd4vzpC+o8xYcByN0zyDfM8Z7z9qdMauCqZgim1mszS3oblDV2fInJzNcWvcaMXdeF30Lkgq0CJVHY6btYRNjFhmoFEo7v1pElsc5dyhFAqKep0tPVguFnwGp5NzaX15kB/lF2vTxb98zjNf2uikWRC+oUPo8yMY0p8S5/SAC0iNWdAfwCfguoEfeKBrGxcFnRpHU6MmVCJXUtwYW870FkRY5iHQKuNST41G396XM4m+0w8x6s4nB7DYym9tulatZ5Sa/5PU2T6EPBx8Dbpv1VfkzBYh8hBycjvSu+D4bTfZ7u583271PlRxb5JX5DVpk4Tskh75TA7JgAhySa7Ib/Inaka70fuod90a1SrNS3Kjoi9/Af9XyGI=</latexit>

= −

+

<latexit sha1_base64="EkF1suzQh2xL4lQdrE9yrvLA6IE=">ACXnicjZDBTxQxFMa7gyKsIguEBOlkZhgXCYzBsKJhChBDxjRsLAJs07edN8uzbTpu0Cpk/xiv8RdxI/Be8a3eXg8DFd/rSr7/v5X2Fty6JLmuRVOPHk8/mZmtP30293y+sbB4ZNXQMGwxJZRpF2BR8BJbjuBbW0QZCHwuBh8GPnHZ2gsV+Wh+6GxI6Ff8h5n4MJT3ljepuZUH2qv7vcn+fevU2rKm+sJnEyHvpQpLdidSf+83vjxS9ykC/U0qyr2FBi6ZgAa0/SRLuOB+M4E1jV69nQogY2gD6edM+4tiVItB1/Mb6h+tf3IK0Ed9osZABf0zbuwyG26Tl3p3QPGBZKDeg3hC6aOPgtizTsvAXFe0pQwslupQ7EJzdiR1lWo0sLMYbCGBlz1VOru2y/vc2eZ+aKVsfjSIgzf+3qYJNc4YMf+H1LNdDH0Y/By4LxoNOGV8piuvRyWn9yt9KI7exelGvPk1tP2eTGaGvCSvyBpJyRbZIZ/IAWkRjz5S7JVe0mo7movnJ16h2yROxOt/AUEZ7cg</latexit>

L = −

  • <latexit sha1_base64="IntfmxfORVYbSgneI49jLfZgYdE=">ACeXicjVDLbhMxFHWmQEt4pbBk41IhBRGMxSEVISoAIWQS2oaSPVaeRxbqZW/JLtlCBrfoMfYcsWVvwAH8BXsMGTdEHbDWd15HPudenMI7n2W/GsnSpctXleuNq9dv3HzVmv19r7TU8ugx7TQtl9QB4Ir6HnuBfSNBSoLAQfF5HWtH5yAdVyrPf/ZwEDSUvExZ9THp2ErIx5mPkBaphV5TiT1x4yK0K2OcvwCPyJCl9gc5cPF2I6qhq31LM3mwBdJfkrWt9Z+fv093Xy5O1xt5GSk2VSC8kxQ5w7zPhBoNZzJqBqNsnUgaFsQks4HJ1w4xSV4AZhNv9e9a8eqHT1jZ1CRuN93Icu3YM+/sT9MX5DGRaT/BHoCOwadR7DjAxsgizCo+1xYUWI8w9FZydia0znQEWlzmIspCUq7FW3rW3ecm963RjYarz1gJMHoRzmxaueUbt+T9Lk2xD7MPC+jbMWCp1zYQUwVTl5yfr/Qi2X+c5k/Spx9i26/QAivoLrqH2ihHz9AWeod2UQ8x9AV9Q9/Rj8afZC1pJw8Xo0nj1HMHnUGy8RdzTcJO</latexit>
  • <latexit sha1_base64="4pvnAvo3qYjFV7k5BVXWHqD/6Ns=">ACUHicjVBNTxsxFHyblgJbaIEeuViNKoEURbuoiCsCRHsAlVYEgtiA3jovwYq9tmyHglb7K3pt/xQ3/gm31gmpxMeFuXj0xjPvaXIjhfNJchvVXr2ejM9Mxu/nZt/935hcenI6aHl1OJatvO0ZEUBbW8JLaxhKqXNJxPtge6ceXZJ3QxaG/NtR2C9ET3D0YXSGZWXw+osPV+oJ81kDPacpBNShwkOzhejNOtqPlRUeC7RudM0Mb5TovWCS6riOBs6MsgH2KfT7qUwrkBFrlNeja+uHuolKqfQXzRyFYyfWJv28JDa7KfwF2wXOeVaD9gPwi7ZtBbjtj49KuK9bRluZdJjxKwR/FjKdIR6WOQqyVCiKni68W9kRfeFdYy/0UDS+WKLBavlk071rnDHyvMwSZzsU+rC0H3zfDFn02paZqUpThZLTp5U+J0drzfRzc/17Ut/cmtQ9A8vwEVYghQ3YhK9wAC3goOAX/IY/0U10F/2tRfdf/7/wAR6hFv8DSkCw/A=</latexit>

RNN cell <start>

  • <latexit sha1_base64="X1E+ASFqN/HGMO9swgkm8IfeDMY=">ACdXicjVHNbhMxGHSWAiX8pfRYVbIoP4kULbsNaZNbVSrgUESLmnalbhp5nW8TK/basp2SarUv0afplb4FT4LECW9SJIqExJxGHs98/saJ4szYIPhe8e4s3b13f/lB9eGjx0+e1laeHRs51R6VHKpo4QY4CyDnmWQ6Q0EJFwOEkm70r95By0YTI7shcK+oKMpYySqw7GtSasVSgiZU6IwJyI1MryKyIOaS2HiuR5NPiLIw1G41tY1DbCPxuZ6sdtrAj252tbrsk3c1Wp4VDP5hjY6fROoyuXv8GKxUwngo6VRAZiknxpyGgbL9nGjLKIeiWo2nBhShEzKC0+E5U6Z8huns/lqxZ96ToQRxI6biXDGlziCfXIEf7K7Bi/JxQSKSf4C5AhaN/pPQN4vsGswKnUOJF8iJklnNFbsWmUDdMANO5oKwLJWZNfU9NmLWNPdWVnzgwaYNPK/Ji1c84zS83+WarwHrg8Nn5zv80H5LEqclW4kn83if9Njf98K3fPnRt76IFltEaeo7qKETbaAd9RAeohyi6RFfoG7qu/PDWvRfeq8Vr3LjWUW34L35BSNiwk4=</latexit>

L

<latexit sha1_base64="7n7PEJgxvCLz7QhLXIierqJH1+I=">ACVHicjZBfTxNBFMVnFxGsivx59GViQ4Js9mlFto3gkR9qBEIhSZsJXent2XSmZ1hZgqYzX4OX/VLmfhdfHC21ERMSLxPJ3Pmd+7NybTg1sXxzyBceLT4eGn5Se3ps+crL1bX1k+tmhqGPaEMv0MLAqeY89xJ7CvDYLMBJ5lk7eVf3aNxnKVn7gvGgcSxjkfcQbOPw1SCe6SgSi65efkYrUeR532TitpUi92zudViU62812kyZRPJs6mc/hxVqQpEPFphJzxwRYe57E2g0KMI4zgWtlk4tamATGOP58Jprm4NEOyhuZ5eXf/sFSFtd08ikBzdpH7twgn16w90lfQcM6Um9BhiCbyfs8iTbXMituSjpShmRJDyh0Izu7FVplWI/PLHpbSOD5SOXObh3wMXe20fVd5I3BnHyuvhn0x01y6iY/0Nq6QH6Pgx+9NwnjQacMkWqy0KXvuQ/TdKHxel2lLyJWkdxfW9/XvcyeUlekS2SkF2yRz6Q9IjFyRr+Qb+R78CH6FC+Hi3dcwmDMb5N6EK78BUWiyBQ=</latexit>
  • <latexit sha1_base64="Cj/Rsa6rgAEt5HFJ8SuSQv0jP0s=">ACUHicjZBNTxRBEIZrFhUYVECPeOhITDZbGYIxCtRghwomFhDbOSmt7a3c52T3e6e/nIZH4FV/hT3Pgn3rR3VhM+LtalK/X2+1blyY0UzifJbdSYefL02ezcfLzw/MXLxaXlV4dOjy2nNtdS206OjqQoqO2Fl9QxlDlko7y0aeJfnRK1gldHPgLQ12Fg0L0BUcfRj8yo/LSVD/Tk6XVpJXUxR436d9mdesN1LV/shylWU/zsaLCc4nOHaeJ8d0SrRdcUhXH2diRQT7CAR3ToVxBSpy3fK8vrq6q5eonEI/bOYqGN+xDu3hAXYmfBDtoOcq1H7Dthj2wr6G1HrD79vGJ9bVmuZY8Jj1Lwe7GTGeIh2WOgiwViqKvC+/WtsVAeNfcCxyK5mdLNHpfPtg0dUZE8/WeJsmwIPS1+C76shi17bMjNVwBwgpw+RPm4O1vpRmvzW6D9cUob5mAF3sIapPABtmAX9qENHBRcwhVcRzfRr+h3I5p+/fCa7hXjfgP4DGxUw=</latexit>
slide-12
SLIDE 12

Applications

  • Predictive typing:
  • In search fields
  • For keyboards
  • For assisted typing, e.g. sentence completion
  • Automatic speech recognition:
  • How likely is the user to have said My hair is wet vs My hairy sweat?
  • Basic grammar correction:
  • p(They’re happy together) > p(Their happy together)
slide-13
SLIDE 13

Conditional language models Like a language model, but conditioned on extra context c.

(

  • | ) =

Y

  • ( | , −, . . . , )
<latexit sha1_base64="Wv/8y3qPrGEt6od2hR2DqLRB9k=">ACiXicjVBdSxtBFJ2sbdXYj2gf+zI0FBJIw45YFEIKm0fLXFaMAJy+zsTRwyszPMTIxl2b/TX+NrC/03nV3zUPWl5+Ue7rnXu5JjRTOx/GfRrTy5Omz1bX15sbzFy9ftTa3zp2eWw5DrqW2o5Q5kCKHoRdewshYCqVcJHOjir94hqsEzo/8z8MjBWb5mIiOPOhlbQG1HSoUWnhSqpEhnkXH2BqrM4SEWpnUZVa6OFUoj3pOxhKjPtXdUg3aTVjvtxDfyYkCVpoyVOk80GoZnmcwW5I5d0li48cFs15wCWzSecODOMzNoXL7FoYlzMFblzc1O+W/+oFU04xf9VLVTC+wyM4YWcwgvhr/BHxiHVeoa/A8vA9oM+dIDrd29KPNEWp1pmWHgmBb+3trpDPBwzEGQpWIin+jcu86xmIrw/EkIMO9sgCzbvHg0p2r3lF5/s/SpMcQ8rDwJfi+GrDMa1tQUxamDCGTh5E+JufbfbLT/Btpz04XMa9ht6gt6iDCNpFA/QZnaIh4ugnukW/0O9oIyLRXrR/Nxo1lp7X6B6io7+UgMIM</latexit>
  • Sentiment analysis: c = {positive, negative, neutral}, s = text.
  • Text summarisation: c = a long document, s = its summary.
  • Machine translation: c = French text, s = English text.
  • Image captioning: c = an image, s = its caption.
  • Optical character recognition: c = image of a line, s = its content.
  • Speech recognition: c = a recording, s = its content.
slide-14
SLIDE 14

Masked language models

position embeddings word embeddings

<mask> <cls> a seat <mask> have a <mask> <s> 1 2 3 4 5 6 7 8

+ + + + + + + + +

transformer encoder

take </s> drink

predictions

slide-15
SLIDE 15

Masked language models

  • Don’t have to limit ourselves to just English. XLM-R is a single model

that works on 100 languages:

allons enfants de la patrie , le jour de gloire

  • ! say , can you see , by the dawn ’s early

ee mungu nguvu yetu Ilete baraka kwetu

  • जन-गण-मन अिधनायक जय हे, भारत भाग्र िवधाता! पंजाब

… … … … …

  • No need to explicitly tag the input language.
  • For many cases, it performs better than language-specific models.
  • Achieve state-of-the-art on a variety of tasks.
slide-16
SLIDE 16

Model size in perspective

performance (GLUE) 50 63 75 88 100 number of parameters (millions) 10 100 1000 10000 100000 CBoW (2013) BERT (2018) RoBERTa (2019) T5 (2019) ELMo (2018) embedding-based RNN transformer Types of model: human baseline

slide-17
SLIDE 17

Knowledge distillation Idea:

  • Take a big model, use it to teach a smaller (and therefore faster) model.

These models work very well, but they are often too slow!

  • Can we make the smaller model work as well as the larger one?

Can we make them faster without affecting their NLP performance?

slide-18
SLIDE 18

Input text Model Prediction Target training loss

standard training

Knowledge distillation

slide-19
SLIDE 19

Input text Model Prediction Target training loss

standard training

distillation loss Input text Pretrained teacher model Student model Prediction Prediction

knowledge distillation, first attempt

Knowledge distillation

slide-20
SLIDE 20

Input text Model Prediction Target training loss

standard training

distillation loss Input text Pretrained teacher model Student model Prediction Prediction

knowledge distillation, first attempt

distillation loss student loss Input text Target Pretrained teacher model Student model Prediction Prediction

knowledge distillation, second attempt

Knowledge distillation

slide-21
SLIDE 21

Input text Model Prediction Target training loss

standard training

distillation loss Input text Pretrained teacher model Student model Prediction Prediction

knowledge distillation, first attempt

distillation loss student loss Input text Target Pretrained teacher model Student model Prediction Prediction

knowledge distillation, second attempt

distillation loss student loss Input text Target Pretrained teacher model Student model Soft predictions Soft predictions

knowledge distillation

Knowledge distillation

slide-22
SLIDE 22

References

  • Bengio et al. (2003). A Neural Probabilistic Language Model. In JMLR.
  • Sutskever et al. (2014) Sequence to sequence learning with neural networks. In NIPS 14.
  • Devlin et al. (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language
  • Understanding. In NAACL 2019.
  • Liu, Ott, Goyal, Du et al. (2019) RoBERTa: A Robustly Optimized BERT Pretraining
  • Approach. arXiv preprint 1907.11692.
  • Conneau, Khandelwal et al. (2019) Unsupervised Cross-lingual Representation Learning

at Scale. arXiv preprint 1911.02116.

  • Lewis et al. (2019) BART: Denoising Sequence-to-Sequence Pre-training for Natural

Language Generation, Translation, and Comprehension. arXiv preprint 1910.13461.

  • Raffel et al. (2019) Exploring the Limits of Transfer Learning with a Unified Text-to-Text
  • Transformer. arXiv preprint 1910.10683.
slide-23
SLIDE 23

Neural Machine Translation

slide-24
SLIDE 24

Languages are different

  • I saw a movie
slide-25
SLIDE 25

Translation is a Conditional Language Model

mon sac à dos ruine ma vie ! my backpack is ruining my _____

slide-26
SLIDE 26

Sequence-to-Sequence: Naive RNN

x1 x2 x3 x4 θ y1 y2 y3 y1 y2 y3 <end>

slide-27
SLIDE 27

Sequence-to-Sequence: Attention

x1 x2 x3 x4 θ y1 y2 y3 y1 y2 y3 <end> encoder outputs

slide-28
SLIDE 28

Attention as Alignment

I saw a ___

p = 0.9 p = 0.08 p = 0.02

slide-29
SLIDE 29

Transformer NMT

figure from Vaswani et al. 2017

  • NMT is original application of

transformer architecture

  • Encoder-decoder attention

considers all encoder outputs

  • Decoder self-attention considers
  • nly previous outputs
  • Inference is autoregressive (left-

to-right) but decoder can be trained in parallel

slide-30
SLIDE 30

Beam Search

  • I

_EOS watched saw watch saw watched the a a movie the a movie movie movie _EOS _EOS _EOS _EOS

slide-31
SLIDE 31

Beam Search

  • Each beam element has different state (for transformer decoder,

this is self-attention input for previous steps)

  • Total computation scales linearly with beam width
  • However computation is highly parallelizable over the beam

(efficient on GPU)

  • Inference is autoregressive (each step depends on previous

decision)

slide-32
SLIDE 32

Non-Autoregressive Machine Translation (NAT)

figure from Gu et al. 2017

slide-33
SLIDE 33

Variants of Non-Autoregressive Translation (NAT)

  • Original approach: Gu et al. 2018
  • Fertility Prediction
  • Noisy Parallel Decoding
  • Conditional Masked Language Models: Ghazvininejad et. al 2019
  • Training with randomly masked target tokens
  • Inference: iteratively predict highest-probability target tokens
  • Levenshtein Transformer: Gu et al. 2019
  • Separate decoders for insertion, prediction, and deletion
  • Trained with reinforcement learning
slide-34
SLIDE 34
  • Problem: how to model rare or

unseen words?

  • Character-level models are

general but too slow in practice

  • Subword models solve break up

words based on frequency

  • One widely used algorithm is

Byte-Pair Encoding (BPE), illustrated at right Byte-Pair Encoding (BPE)

example from Sennrich et al. 2016

slide-35
SLIDE 35
  • Problem: how to model rare or

unseen words?

  • Character-level models are

general but too slow in practice

  • Subword models solve break up

words based on frequency

  • One widely used algorithm is

Byte-Pair Encoding (BPE), illustrated at right Byte-Pair Encoding (BPE) token frequency

l

  • w

e r

</w>

2 l

  • w

</w>

5 n e w e s

</w>

6 t w i d e s

</w>

3 t

learned merge operations

r • → r• l o → lo lo w → low e r• → er•

example from Sennrich et al. 2016

slide-36
SLIDE 36

Ongoing challenges for machines translation

  • Low-resource language directions
  • Massively multilingual models
  • Iterative back-translation
  • Data mining
  • Noisy input text (such as internet slang)
  • Character-aware translation
  • Adversarial training techniques
  • Catastrophic failures
  • Improved quality estimation, identification techniques
slide-37
SLIDE 37

References

  • Bahdanau et al. Neural Machine Translation by Jointly Learning to Align and Translate

(2014)

  • Ghazvininejad et al. Mask-Predict: Parallel Decoding of Conditional Masked Language

Models (2019)

  • Gu et al. Non-Autoregressive Neural Machine Translation (2017)
  • Gu et al. Levenshtein Transformer (2019)
  • Ott et al. Scaling Neural Machine Translation (2018)
  • Sennrich et al. Neural Machine Translation of Rare Words with Subword Units (2016)
  • Sutskever et al. Sequence to Sequence Learning with Neural Networks (2014)
  • Vaswani et al. Attention is All You Need (2017)