lecture 13 a en on
play

Lecture 13: A,en.on Jus$n Johnson October 23, 2019 Lecture 13 - 1 - PowerPoint PPT Presentation

Lecture 13: A,en.on Jus$n Johnson October 23, 2019 Lecture 13 - 1 Midterm Grades will be out in ~1 week Please do not discuss midterm ques$ons on Piazza Someone leD a waterboEle in exam room Post on Piazza if it is yours Jus$n Johnson


  1. Sequence-to-Sequence with RNNs an and A,en.on on ✖ ✖ ✖ ✖ a 23 a 24 a 21 a 22 comiendo estamos soDmax Repeat: Use s 1 to y 1 y 2 compute new context e 21 e 22 e 23 e 24 vector c 2 + Use c 2 to compute s 2 , y 2 h 1 h 2 h 3 h 4 s 0 s 1 s 2 Intui0on : Context vector aEends to the relevant x 1 x 2 x 3 x 4 c 1 y 0 c 2 y 1 part of the input sequence “comiendo” = “ea0ng” we are ea$ng bread so maybe a 21 =a 24 =0.05, [START] estamos a 22 =0.1, a 23 =0.8 Bahdanau et al, “Neural machine transla$on by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 20

  2. Sequence-to-Sequence with RNNs an and A,en.on on Use a different context vector in each 0mestep of decoder - Input sequence not bo<lenecked through single vector comiendo [STOP] estamos pan - At each 0mestep of decoder, context vector “looks at” different parts of the input sequence y 1 y 2 y 3 y 4 h 1 h 2 h 3 h 4 s 0 s 1 s 2 s 3 s 4 x 1 x 2 x 3 x 4 c 1 y 0 c 2 y 1 c 3 y 2 c 4 y 3 we are ea$ng bread [START] estamos comiendo pan Bahdanau et al, “Neural machine transla$on by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 21

  3. Sequence-to-Sequence with RNNs an and A,en.on on Visualize aEen$on weights a t,i Example : English to French transla$on Input : “The agreement on the European Economic Area was signed in August 1992.” Output : “L’accord sur la zone économique européenne a été signé en août 1992.” Bahdanau et al, “Neural machine transla$on by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 22

  4. Sequence-to-Sequence with RNNs an and A,en.on on Visualize aEen$on weights a t,i Example : English to French transla$on Diagonal a<en0on means Input : “ The agreement on the words correspond in order European Economic Area was signed in August 1992 .” Output : “ L’accord sur la zone économique européenne a été signé en août 1992 .” Diagonal a<en0on means words correspond in order Bahdanau et al, “Neural machine transla$on by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 23

  5. Sequence-to-Sequence with RNNs an and A,en.on on Visualize aEen$on weights a t,i Example : English to French transla$on Diagonal a<en0on means Input : “ The agreement on the words correspond in order European Economic Area was signed in August 1992 .” A<en0on figures out different word orders Output : “ L’accord sur la zone économique européenne a été signé en août 1992 .” Diagonal a<en0on means words correspond in order Bahdanau et al, “Neural machine transla$on by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 24

  6. Sequence-to-Sequence with RNNs an and A,en.on on Visualize aEen$on weights a t,i Example : English to French transla$on Diagonal a<en0on means Input : “ The agreement on the words correspond in order European Economic Area was signed in August 1992 .” A<en0on figures out different word orders Output : “ L’accord sur la zone Verb conjuga0on économique européenne a été signé en août 1992 .” Diagonal a<en0on means words correspond in order Bahdanau et al, “Neural machine transla$on by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 25

  7. Sequence-to-Sequence with RNNs an and A,en.on on The decoder doesn’t use the fact that h i form an ordered sequence – it just treats them as an unordered set {h i } comiendo [STOP] estamos pan Can use similar architecture given any y 1 y 2 y 3 y 4 set of input hidden vectors {h i }! h 1 h 2 h 3 h 4 s 0 s 1 s 2 s 3 s 4 x 1 x 2 x 3 x 4 c 1 y 0 c 2 y 1 c 3 y 2 c 4 y 3 we are ea$ng bread [START] estamos comiendo pan Bahdanau et al, “Neural machine transla$on by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 26

  8. Image Cap.oning with RNNs and A,en.on h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 Use a CNN to compute a grid of features for an image Cat image is free to use under the Pixabay License Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 27

  9. Image Cap.oning with RNNs and A,en.on Alignment scores e t,i,j = f aE (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 e 1,2,1 e 1,2,2 e 1,2,3 e 1,3,1 e 1,3,2 e 1,3,3 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 Use a CNN to compute a grid of features for an image Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 28

  10. Image Cap.oning with RNNs and A,en.on Alignment scores AEen$on weights e t,i,j = f aE (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 a 1,1,1 a 1,1,2 a 1,1,3 a t,:,: = soDmax(e t,:,: ) soDmax e 1,2,1 e 1,2,2 e 1,2,3 a 1,2,1 a 1,2,2 a 1,2,3 e 1,3,1 e 1,3,2 e 1,3,3 a 1,3,1 a 1,3,2 a 1,3,3 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 Use a CNN to compute a grid of features for an image Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 29

  11. Image Cap.oning with RNNs and A,en.on Alignment scores AEen$on weights e t,i,j = f aE (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 a 1,1,1 a 1,1,2 a 1,1,3 a t,:,: = soDmax(e t,:,: ) soDmax e 1,2,1 e 1,2,2 e 1,2,3 a 1,2,1 a 1,2,2 a 1,2,3 c t = ∑ i,j a t,i,j h i,j e 1,3,1 e 1,3,2 e 1,3,3 a 1,3,1 a 1,3,2 a 1,3,3 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 c 1 Use a CNN to compute a grid of features for an image Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 30

  12. Image Cap.oning with RNNs and A,en.on Alignment scores AEen$on weights e t,i,j = f aE (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 a 1,1,1 a 1,1,2 a 1,1,3 a t,:,: = soDmax(e t,:,: ) soDmax cat e 1,2,1 e 1,2,2 e 1,2,3 a 1,2,1 a 1,2,2 a 1,2,3 c t = ∑ i,j a t,i,j h i,j e 1,3,1 e 1,3,2 e 1,3,3 a 1,3,1 a 1,3,2 a 1,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 31

  13. Image Cap.oning with RNNs and A,en.on e t,i,j = f aE (s t-1 , h i,j ) a t,:,: = soDmax(e t,:,: ) cat c t = ∑ i,j a t,i,j h i,j y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 32

  14. Image Cap.oning with RNNs and A,en.on Alignment scores e t,i,j = f aE (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a t,:,: = soDmax(e t,:,: ) cat e 2,2,1 e 2,2,2 e 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 33

  15. Image Cap.oning with RNNs and A,en.on Alignment scores AEen$on weights e t,i,j = f aE (s t-1 , h i,j ) a 2,1,1 a 2,1,2 a 2,1,3 e 2,1,1 e 2,1,2 e 2,1,3 a t,:,: = soDmax(e t,:,: ) soDmax cat e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 34

  16. Image Cap.oning with RNNs and A,en.on Alignment scores AEen$on weights e t,i,j = f aE (s t-1 , h i,j ) a 2,1,1 a 2,1,2 a 2,1,3 e 2,1,1 e 2,1,2 e 2,1,3 a t,:,: = soDmax(e t,:,: ) soDmax cat e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 Use a CNN to compute a grid of features for an image [START] Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 35

  17. Image Cap.oning with RNNs and A,en.on Alignment scores AEen$on weights e t,i,j = f aE (s t-1 , h i,j ) a 2,1,1 a 2,1,2 a 2,1,3 e 2,1,1 e 2,1,2 e 2,1,3 a t,:,: = soDmax(e t,:,: ) soDmax sivng cat e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 y 2 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 s 2 h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 y 1 Use a CNN to compute a grid of features for an image [START] cat Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 36

  18. Image Cap.oning with RNNs and A,en.on Each $mestep of decoder e t,i,j = f aE (s t-1 , h i,j ) uses a different context a t,:,: = soDmax(e t,:,: ) sivng [STOP] cat outside vector that looks at different c t = ∑ i,j a t,i,j h i,j parts of the input image y 1 y 2 y 3 y 4 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 s 2 s 3 s 4 h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 y 1 c 3 y 2 c 4 y 3 Use a CNN to compute a grid of features for an image [START] cat sivng outside Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 37

  19. Image Cap.oning with RNNs and A,en.on Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 38

  20. Image Cap.oning with RNNs and A,en.on Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Jus$n Johnson October 23, 2019 Lecture 13 - 39

  21. Human Vision: Fovea Light enters eye Re0na detects light Acuity graph is licensed under CC A-SA 3.0 Unported Jus$n Johnson October 23, 2019 Lecture 13 - 40

  22. Human Vision: Fovea The fovea is a $ny region of the Light enters eye re$na that can see with high acuity Re0na detects light Acuity graph is licensed under CC A-SA 3.0 Unported (No changes made) Eye image is licensed under CC A-SA 3.0 Unported (added black arrow, green arc, and white circle) Jus$n Johnson October 23, 2019 Lecture 13 - 41

  23. Human Vision: Saccades The fovea is a $ny region of the Human eyes are constantly moving so we don’t no$ce re$na that can see with high acuity Saccade video is licensed under CC A-SA 4.0 Interna$onal (no changes made) Acuity graph is licensed under CC A-SA 3.0 Unported (No changes made) Jus$n Johnson October 23, 2019 Lecture 13 - 42

  24. Image Cap.oning with RNNs and A,en.on AEen$on weights at each $mestep kind of like saccades of human eye Xu et al, “Show, AEend, and Tell: Neural Image Cap$on Genera$on with Visual AEen$on”, ICML 2015 Saccade video is licensed under CC A-SA 4.0 Interna$onal (no changes made) Jus$n Johnson October 23, 2019 Lecture 13 - 43

  25. X, A,end, and Y “ Show, a<end, and tell ” (Xu et al, ICML 2015) Look at image, aEend to image regions, produce ques$on “ Ask, a<end, and answer ” (Xu and Saenko, ECCV 2016) “ Show, ask, a<end, and answer” (Kazemi and Elqursh, 2017) Read text of ques$on, aEend to image regions, produce answer “ Listen, a<end, and spell ” (Chan et al, ICASSP 2016) Process raw audio, aEend to audio regions while producing text “ Listen, a<end, and walk ” (Mei et al, AAAI 2016) Process text, aEend to text regions, output naviga$on commands “ Show, a<end, and interact ” (Qureshi et al, ICRA 2017) Process image, aEend to image regions, output robot control commands “ Show, a<end, and read ” (Li et al, AAAI 2019) Process image, aEend to image regions, output text Jus$n Johnson October 23, 2019 Lecture 13 - 44

  26. A,en.on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vector : q (Shape: D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D X ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 Similarity func0on : f aE h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 [START] Computa0on : Similari0es : e (Shape: N X ) e i = f aE ( q , X i ) A<en0on weights : a = soDmax(e) (Shape: N X ) Output vector : y = ∑ i a i X i (Shape: D X ) Jus$n Johnson October 23, 2019 Lecture 13 - 45

  27. A,en.on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vector : q (Shape: D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D Q ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 Similarity func0on : dot product h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 [START] Computa0on : Changes: Similari0es : e (Shape: N X ) e i = q · X i - Use dot product for similarity A<en0on weights : a = soDmax(e) (Shape: N X ) Output vector : y = ∑ i a i X i (Shape: D X ) Jus$n Johnson October 23, 2019 Lecture 13 - 46

  28. A,en.on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vector : q (Shape: D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D Q ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 Similarity func0on : scaled dot product h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 [START] Computa0on : Changes: Similari0es : e (Shape: N X ) e i = q · X i / sqrt(D Q ) - Use scaled dot product for similarity A<en0on weights : a = soDmax(e) (Shape: N X ) Output vector : y = ∑ i a i X i (Shape: D X ) Jus$n Johnson October 23, 2019 Lecture 13 - 47

  29. A,en.on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vector : q (Shape: D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D Q ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 Similarity func0on : scaled dot product h 3,1 h 3,2 h 3,3 Large similari$es will cause soDmax to saturate and give vanishing gradients c 1 y 0 c 2 Recall a · b = |a||b| cos(angle) Suppose that a and b are constant vectors of [START] dimension D Then |a| = (∑ i a 2 ) 1/2 = a sqrt(D) Computa0on : Changes: Similari0es : e (Shape: N X ) e i = q · X i / sqrt(D Q ) - Use scaled dot product for similarity A<en0on weights : a = soDmax(e) (Shape: N X ) Output vector : y = ∑ i a i X i (Shape: D X ) Jus$n Johnson October 23, 2019 Lecture 13 - 48

  30. A,en.on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vectors : Q (Shape: N Q x D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D Q ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 [START] Computa0on : Similari0es : E = QX T (Shape: N Q x N X ) E i,j = Q i · X j / sqrt(D Q ) Changes: - Use dot product for similarity A<en0on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) - Mul$ple query vectors Output vectors : Y = A X (Shape: N Q x D X ) Y i = ∑ j A i,j X j Jus$n Johnson October 23, 2019 Lecture 13 - 49

  31. A,en.on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vectors : Q (Shape: N Q x D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D X ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 Key matrix : W K (Shape: D X x D Q ) h 3,1 h 3,2 h 3,3 Value matrix: W V (Shape: D X x D V ) c 1 y 0 c 2 [START] Computa0on : Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Similari0es : E = QK T (Shape: N Q x N X ) E i,j = Q i · K j / sqrt(D Q ) Changes: - Use dot product for similarity A<en0on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) - Mul$ple query vectors Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j - Separate key and value Jus$n Johnson October 23, 2019 Lecture 13 - 50

  32. A,en.on Layer Inputs : Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Computa0on : X 1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 Value Vectors : V = XW V (Shape: N X x D V ) Similari0es : E = QK T (Shape: N Q x N X ) E i,j = Q i · K j / sqrt(D Q ) X 3 A<en0on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 23, 2019 Lecture 13 - 51

  33. A,en.on Layer Inputs : Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Computa0on : X 1 K 1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 K 2 Value Vectors : V = XW V (Shape: N X x D V ) Similari0es : E = QK T (Shape: N Q x N X ) E i,j = Q i · K j / sqrt(D Q ) X 3 K 3 A<en0on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 23, 2019 Lecture 13 - 52

  34. A,en.on Layer Inputs : Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Computa0on : E 2,1 E 3,1 E 4,1 X 1 K 1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 K 2 E 1,2 E 2,2 E 3,2 E 4,2 Value Vectors : V = XW V (Shape: N X x D V ) Similari0es : E = QK T (Shape: N Q x N X ) E i,j = Q i · K j / sqrt(D Q ) E 2,3 E 3,3 E 4,3 X 3 K 3 E 1,3 A<en0on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 23, 2019 Lecture 13 - 53

  35. A,en.on Layer Inputs : A 1,1 A 2,1 A 3,1 A 4,1 Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) A 1,2 A 2,2 A 3,2 A 4,2 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) A 1,3 A 2,3 A 3,3 A 4,3 SoDmax( ) Computa0on : E 2,1 E 3,1 E 4,1 X 1 K 1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 K 2 E 1,2 E 2,2 E 3,2 E 4,2 Value Vectors : V = XW V (Shape: N X x D V ) Similari0es : E = QK T (Shape: N Q x N X ) E i,j = Q i · K j / sqrt(D Q ) E 2,3 E 3,3 E 4,3 X 3 K 3 E 1,3 A<en0on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 23, 2019 Lecture 13 - 54

  36. A,en.on Layer Inputs : V 1 A 1,1 A 2,1 A 3,1 A 4,1 Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) V 2 A 1,2 A 2,2 A 3,2 A 4,2 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) V 3 A 1,3 A 2,3 A 3,3 A 4,3 SoDmax( ) Computa0on : E 2,1 E 3,1 E 4,1 X 1 K 1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 K 2 E 1,2 E 2,2 E 3,2 E 4,2 Value Vectors : V = XW V (Shape: N X x D V ) Similari0es : E = QK T (Shape: N Q x N X ) E i,j = Q i · K j / sqrt(D Q ) E 2,3 E 3,3 E 4,3 X 3 K 3 E 1,3 A<en0on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 23, 2019 Lecture 13 - 55

  37. Y 1 Y 2 Y 3 Y 4 A,en.on Layer Product( ), Sum( ) Inputs : V 1 A 1,1 A 2,1 A 3,1 A 4,1 Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) V 2 A 1,2 A 2,2 A 3,2 A 4,2 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) V 3 A 1,3 A 2,3 A 3,3 A 4,3 SoDmax( ) Computa0on : E 2,1 E 3,1 E 4,1 X 1 K 1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 K 2 E 1,2 E 2,2 E 3,2 E 4,2 Value Vectors : V = XW V (Shape: N X x D V ) Similari0es : E = QK T (Shape: N Q x N X ) E i,j = Q i · K j / sqrt(D Q ) E 2,3 E 3,3 E 4,3 X 3 K 3 E 1,3 A<en0on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 23, 2019 Lecture 13 - 56

  38. Self-A,en.on Layer One query per input vector Inputs : Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Query matrix : W Q (Shape: D X x D Q ) Computa0on : Query vectors : Q = XW Q Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 57

  39. Self-A,en.on Layer One query per input vector Inputs : Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Query matrix : W Q (Shape: D X x D Q ) Computa0on : Query vectors : Q = XW Q Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 58

  40. Self-A,en.on Layer One query per input vector Inputs : Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Query matrix : W Q (Shape: D X x D Q ) K 3 Computa0on : K 2 Query vectors : Q = XW Q K 1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 59

  41. Self-A,en.on Layer One query per input vector Inputs : Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Query matrix : W Q (Shape: D X x D Q ) E 2,3 E 3,3 K 3 E 1,3 Computa0on : K 2 E 1,2 E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 60

  42. Self-A,en.on Layer One query per input vector A 3,3 A 1,3 A 2,3 Inputs : A 1,2 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) E 2,3 E 3,3 K 3 E 1,3 Computa0on : K 2 E 1,2 E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 61

  43. Self-A,en.on Layer One query per input vector V 3 A 3,3 A 1,3 A 2,3 Inputs : V 2 A 1,2 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) E 2,3 E 3,3 K 3 E 1,3 Computa0on : K 2 E 1,2 E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 62

  44. Y 1 Y 3 Y 2 Self-A,en.on Layer Product(→), Sum(↑) One query per input vector V 3 A 3,3 A 1,3 A 2,3 Inputs : V 2 A 1,2 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) E 2,3 E 3,3 K 3 E 1,3 Computa0on : K 2 E 1,2 E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 63

  45. Self-A,en.on Layer Product(→), Sum(↑) Consider permu0ng the input vectors: Inputs : Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) Computa0on : Query vectors : Q = XW Q Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 64

  46. Self-A,en.on Layer Product(→), Sum(↑) Consider permu0ng the input vectors: Inputs : Input vectors : X (Shape: N X x D X ) Queries and Keys will be the same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) K 2 Computa0on : K 1 Query vectors : Q = XW Q K 3 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 65

  47. Self-A,en.on Layer Product(→), Sum(↑) Consider permu0ng the input vectors: Inputs : Input vectors : X (Shape: N X x D X ) Similari$es will be the same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) E 1,2 E 2,2 K 2 E 3,2 Computa0on : K 1 E 3,1 E 1,1 E 2,1 Query vectors : Q = XW Q K 3 E 1,3 E 2,3 E 3,3 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 66

  48. Self-A,en.on Layer Product(→), Sum(↑) Consider permu0ng A 2,2 A 3,2 A 1,2 the input vectors: Inputs : A 3,1 A 1,1 A 2,1 Input vectors : X (Shape: N X x D X ) AEen$on weights will be A 1,3 A 2,3 A 3,3 the same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) E 1,2 E 2,2 K 2 E 3,2 Computa0on : K 1 E 3,1 E 1,1 E 2,1 Query vectors : Q = XW Q K 3 E 1,3 E 2,3 E 3,3 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 67

  49. Self-A,en.on Layer Product(→), Sum(↑) Consider permu0ng V 2 A 2,2 A 3,2 A 1,2 the input vectors: Inputs : V 1 A 3,1 A 1,1 A 2,1 Input vectors : X (Shape: N X x D X ) Values will be the V 3 A 1,3 A 2,3 A 3,3 same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) E 1,2 E 2,2 K 2 E 3,2 Computa0on : K 1 E 3,1 E 1,1 E 2,1 Query vectors : Q = XW Q K 3 E 1,3 E 2,3 E 3,3 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 68

  50. Y 3 Y 2 Y 1 Self-A,en.on Layer Product(→), Sum(↑) Consider permu0ng V 2 A 2,2 A 3,2 A 1,2 the input vectors: Inputs : V 1 A 3,1 A 1,1 A 2,1 Input vectors : X (Shape: N X x D X ) Outputs will be the V 3 A 1,3 A 2,3 A 3,3 same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) E 1,2 E 2,2 K 2 E 3,2 Computa0on : K 1 E 3,1 E 1,1 E 2,1 Query vectors : Q = XW Q K 3 E 1,3 E 2,3 E 3,3 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 69

  51. Y 3 Y 2 Y 1 Self-A,en.on Layer Product(→), Sum(↑) Consider permu0ng V 2 A 2,2 A 3,2 A 1,2 the input vectors: Inputs : V 1 A 3,1 A 1,1 A 2,1 Input vectors : X (Shape: N X x D X ) Outputs will be the V 3 A 1,3 A 2,3 A 3,3 same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Self-aEen$on layer is Query matrix : W Q (Shape: D X x D Q ) Permuta0on Equivariant E 1,2 E 2,2 K 2 E 3,2 f(s(x)) = s(f(x)) Computa0on : K 1 E 3,1 E 1,1 E 2,1 Query vectors : Q = XW Q Self-AEen$on layer works K 3 E 1,3 E 2,3 E 3,3 Key vectors : K = XW K (Shape: N X x D Q ) on sets of vectors Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 70

  52. Y 1 Y 3 Y 2 Self-A,en.on Layer Product(→), Sum(↑) Self aEen$on doesn’t V 3 A 3,3 A 1,3 A 2,3 “know” the order of the Inputs : vectors it is processing! V 2 A 1,2 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) E 2,3 E 3,3 K 3 E 1,3 Computa0on : K 2 E 1,2 E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 71

  53. Y 1 Y 3 Y 2 Self-A,en.on Layer Product(→), Sum(↑) Self aEen$on doesn’t V 3 A 3,3 A 1,3 A 2,3 “know” the order of the Inputs : vectors it is processing! V 2 A 1,2 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 In order to make Key matrix : W K (Shape: D X x D Q ) processing posi$on- Value matrix: W V (Shape: D X x D V ) SoDmax(↑) aware, concatenate input Query matrix : W Q (Shape: D X x D Q ) with posi0onal encoding E 2,3 E 3,3 K 3 E 1,3 Computa0on : K 2 E 1,2 E 2,2 E 3,2 E can be learned lookup Query vectors : Q = XW Q table, or fixed func$on K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j E(1) E(2) E(3) Jus$n Johnson October 23, 2019 Lecture 13 - 72

  54. Y 1 Y 3 Y 2 Masked Self-A,en.on Layer Ma Product(→), Sum(↑) Don’t let vectors “look ahead” in the sequence V 3 A 3,3 0 0 Inputs : V 2 0 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) -∞ E 3,3 K 3 -∞ Computa0on : K 2 -∞ E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 73

  55. Big cat [END] Masked Self-A,en.on Layer Ma Product(→), Sum(↑) Don’t let vectors “look ahead” in the sequence V 3 A 3,3 Used for language modeling (predict next word) 0 0 Inputs : V 2 0 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) SoDmax(↑) Query matrix : W Q (Shape: D X x D Q ) -∞ E 3,3 K 3 -∞ Computa0on : K 2 -∞ E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) [START] Big cat Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 74

  56. Y 1 Y 3 Y 2 .head Self-A,en.on Layer Mul Mul.he Use H independent “AEen$on Heads” in parallel Concat Inputs : Input vectors : X (Shape: N X x D X ) Y 1 Y 2 Y 3 Y 1 Y 2 Y 3 Y 1 Y 2 Y 3 Product(→), Sum(↑) Product(→), Sum(↑) Product(→), Sum(↑) Key matrix : W K (Shape: D X x D Q ) V 3 A 3,3 V 3 A 3,3 V 3 A 3,3 A 1,3 A 2,3 A 1,3 A 2,3 A 1,3 A 2,3 V 2 V 2 V 2 A 1,2 A 2,2 A 3,2 A 1,2 A 2,2 A 3,2 A 1,2 A 2,2 A 3,2 Value matrix: W V (Shape: D X x D V ) V 1 V 1 V 1 A 1,1 A 2,1 A 3,1 A 1,1 A 2,1 A 3,1 A 1,1 A 2,1 A 3,1 Hyperparameters : Softmax(↑) Softmax(↑) Softmax(↑) Query matrix : W Q (Shape: D X x D Q ) K 3 E 1,3 E 2,3 E 3,3 K 3 E 1,3 E 2,3 E 3,3 K 3 E 1,3 E 2,3 E 3,3 K 2 E 1,2 E 2,2 E 3,2 K 2 E 1,2 E 2,2 E 3,2 K 2 E 1,2 E 2,2 E 3,2 Query dimension D Q K 1 E 1,1 E 2,1 E 3,1 K 1 E 1,1 E 2,1 E 3,1 K 1 E 1,1 E 2,1 E 3,1 Q 3 Q 3 Q 3 Q 1 Q 2 Q 1 Q 2 Q 1 Q 2 Number of heads H X 1 X 2 X 3 X 1 X 2 X 3 X 1 X 2 X 3 Computa0on : Query vectors : Q = XW Q Split Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Similari0es : E = QK T (Shape: N X x N X ) E i,j = Q i · K j / sqrt(D Q ) A<en0on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 23, 2019 Lecture 13 - 75

  57. Example: CNN with Self-A,en.on Input Image CNN Features: C x H x W Cat image is free to use under the Pixabay License Zhang et al, “Self-AEen$on Genera$ve Adversarial Networks”, ICML 2018 Jus$n Johnson October 23, 2019 Lecture 13 - 76

  58. Example: CNN with Self-A,en.on Queries : C’ x H x W 1x1 Conv Input Image Keys : CNN C’ x H x W 1x1 Conv Features: C x H x W Cat image is free to use under the Pixabay License Values : C’ x H x W 1x1 Conv Zhang et al, “Self-AEen$on Genera$ve Adversarial Networks”, ICML 2018 Jus$n Johnson October 23, 2019 Lecture 13 - 77

  59. Example: CNN with Self-A,en.on A<en0on Weights Queries : (H x W) x (H x W) Transpose C’ x H x W 1x1 Conv Input Image soDmax x Keys : CNN C’ x H x W 1x1 Conv Features: C x H x W Cat image is free to use under the Pixabay License Values : C’ x H x W 1x1 Conv Zhang et al, “Self-AEen$on Genera$ve Adversarial Networks”, ICML 2018 Jus$n Johnson October 23, 2019 Lecture 13 - 78

  60. Example: CNN with Self-A,en.on A<en0on Weights Queries : (H x W) x (H x W) Transpose C’ x H x W 1x1 Conv Input Image soDmax x Keys : CNN C’ x H x W 1x1 Conv Features: C x H x W C’ x H x W Cat image is free to use under the Pixabay License Values : C’ x H x W x 1x1 Conv Zhang et al, “Self-AEen$on Genera$ve Adversarial Networks”, ICML 2018 Jus$n Johnson October 23, 2019 Lecture 13 - 79

  61. Example: CNN with Self-A,en.on A<en0on Weights Queries : (H x W) x (H x W) Transpose C’ x H x W 1x1 Conv Input Image soDmax x C x H x H Keys : CNN C’ x H x W 1x1 Conv Features: C x H x W C’ x H x W Cat image is free to use under the Pixabay License Values : C’ x H x W x 1x1 Conv 1x1 Conv Zhang et al, “Self-AEen$on Genera$ve Adversarial Networks”, ICML 2018 Jus$n Johnson October 23, 2019 Lecture 13 - 80

  62. Example: CNN with Self-A,en.on Residual Connec0on A<en0on Weights Queries : (H x W) x (H x W) Transpose C’ x H x W 1x1 Conv Input Image soDmax x C x H x W Keys : + CNN C’ x H x W 1x1 Conv Features: C x H x W C’ x H x W Cat image is free to use under the Pixabay License Values : C’ x H x W x 1x1 Conv 1x1 Conv Self-AEen$on Module Zhang et al, “Self-AEen$on Genera$ve Adversarial Networks”, ICML 2018 Jus$n Johnson October 23, 2019 Lecture 13 - 81

  63. Three Ways of Processing Sequences Recurrent Neural Network y 1 y 2 y 3 y 4 x 1 x 2 x 3 x 4 Works on Ordered Sequences (+) Good at long sequences: Aher one RNN layer, h T ”sees” the whole sequence (-) Not parallelizable: need to compute hidden states sequen0ally Jus$n Johnson October 23, 2019 Lecture 13 - 82

  64. Three Ways of Processing Sequences Recurrent Neural Network 1D Convolu$on y 1 y 2 y 3 y 4 y 1 y 2 y 3 y 4 x 1 x 2 x 3 x 4 x 1 x 2 x 3 x 4 Works on Ordered Sequences Works on Mul0dimensional Grids (+) Good at long sequences: Aher (-) Bad at long sequences: Need to one RNN layer, h T ”sees” the whole stack many conv layers for outputs sequence to “see” the whole sequence (-) Not parallelizable: need to (+) Highly parallel: Each output can compute hidden states sequen0ally be computed in parallel Jus$n Johnson October 23, 2019 Lecture 13 - 83

  65. Three Ways of Processing Sequences Recurrent Neural Network 1D Convolu$on Self-AEen$on Y 1 Y 2 Y 3 Product(→), Sum(↑) y 1 y 2 y 3 y 4 y 1 y 2 y 3 y 4 V 3 A 2,3 A 3,3 A 1,3 V 2 A 1,2 A 2,2 A 3,2 V 1 A 1,1 A 2,1 A 3,1 Softmax(↑) K 3 E 1,3 E 2,3 E 3,3 K 2 E 1,2 E 2,2 E 3,2 K 1 E 2,1 E 3,1 E 1,1 x 1 x 2 x 3 x 4 x 1 x 2 x 3 x 4 Q 1 Q 2 Q 3 X 1 X 2 X 3 Works on Ordered Sequences Works on Sets of Vectors Works on Mul0dimensional Grids (+) Good at long sequences: Aher (-) Good at long sequences: aher one (-) Bad at long sequences: Need to one RNN layer, h T ”sees” the whole self-a<en0on layer, each output stack many conv layers for outputs sequence “sees” all inputs! to “see” the whole sequence (-) Not parallelizable: need to (+) Highly parallel: Each output can (+) Highly parallel: Each output can compute hidden states sequen0ally be computed in parallel be computed in parallel (-) Very memory intensive Jus$n Johnson October 23, 2019 Lecture 13 - 84

  66. Three Ways of Processing Sequences Recurrent Neural Network 1D Convolu$on Self-AEen$on Y 1 Y 2 Y 3 Product(→), Sum(↑) y 1 y 2 y 3 y 4 y 1 y 2 y 3 y 4 V 3 A 2,3 A 3,3 A 1,3 AEen$on is all you need V 2 A 1,2 A 2,2 A 3,2 V 1 A 1,1 A 2,1 A 3,1 Softmax(↑) K 3 E 1,3 E 2,3 E 3,3 K 2 E 1,2 E 2,2 E 3,2 K 1 E 2,1 E 3,1 E 1,1 Vaswani et al, NeurIPS 2017 x 1 x 2 x 3 x 4 x 1 x 2 x 3 x 4 Q 1 Q 2 Q 3 X 1 X 2 X 3 Works on Ordered Sequences Works on Sets of Vectors Works on Mul0dimensional Grids (+) Good at long sequences: Aher (-) Good at long sequences: aher one (-) Bad at long sequences: Need to one RNN layer, h T ”sees” the whole self-a<en0on layer, each output stack many conv layers for outputs sequence “sees” all inputs! to “see” the whole sequence (-) Not parallelizable: need to (+) Highly parallel: Each output can (+) Highly parallel: Each output can compute hidden states sequen0ally be computed in parallel be computed in parallel (-) Very memory intensive Jus$n Johnson October 23, 2019 Lecture 13 - 85

  67. The Transformer x 1 x 2 x 3 x 4 Vaswani et al, “AEen$on is all you need”, NeurIPS 2017 Jus$n Johnson October 23, 2019 Lecture 13 - 86

  68. The Transformer All vectors interact Self-AEen$on with each other x 1 x 2 x 3 x 4 Vaswani et al, “AEen$on is all you need”, NeurIPS 2017 Jus$n Johnson October 23, 2019 Lecture 13 - 87

  69. The Transformer Residual connec$on + All vectors interact Self-AEen$on with each other x 1 x 2 x 3 x 4 Vaswani et al, “AEen$on is all you need”, NeurIPS 2017 Jus$n Johnson October 23, 2019 Lecture 13 - 88

  70. The Transformer Recall Layer Normaliza0on : Given h 1 , …, h N (Shape: D) scale: 𝛿 (Shape: D) shiD: 𝛾 (Shape: D) 𝜈 i = (1/D)∑ j h i,j (scalar) 𝜏 i = (∑ j (h i,j - 𝜈 i ) 2 ) 1/2 (scalar) Layer Normaliza$on z i = (h i - 𝜈 i ) / 𝜏 i Residual connec$on + y i = 𝛿 * z i + 𝛾 All vectors interact Self-AEen$on with each other Ba et al, 2016 x 1 x 2 x 3 x 4 Vaswani et al, “AEen$on is all you need”, NeurIPS 2017 Jus$n Johnson October 23, 2019 Lecture 13 - 89

  71. The Transformer Recall Layer Normaliza0on : Given h 1 , …, h N (Shape: D) scale: 𝛿 (Shape: D) MLP independently MLP MLP MLP MLP shiD: 𝛾 (Shape: D) on each vector 𝜈 i = (1/D)∑ j h i,j (scalar) 𝜏 i = (∑ j (h i,j - 𝜈 i ) 2 ) 1/2 (scalar) Layer Normaliza$on z i = (h i - 𝜈 i ) / 𝜏 i Residual connec$on + y i = 𝛿 * z i + 𝛾 All vectors interact Self-AEen$on with each other Ba et al, 2016 x 1 x 2 x 3 x 4 Vaswani et al, “AEen$on is all you need”, NeurIPS 2017 Jus$n Johnson October 23, 2019 Lecture 13 - 90

  72. The Transformer Recall Layer Normaliza0on : Residual connec$on + Given h 1 , …, h N (Shape: D) scale: 𝛿 (Shape: D) MLP independently MLP MLP MLP MLP shiD: 𝛾 (Shape: D) on each vector 𝜈 i = (1/D)∑ j h i,j (scalar) 𝜏 i = (∑ j (h i,j - 𝜈 i ) 2 ) 1/2 (scalar) Layer Normaliza$on z i = (h i - 𝜈 i ) / 𝜏 i Residual connec$on + y i = 𝛿 * z i + 𝛾 All vectors interact Self-AEen$on with each other Ba et al, 2016 x 1 x 2 x 3 x 4 Vaswani et al, “AEen$on is all you need”, NeurIPS 2017 Jus$n Johnson October 23, 2019 Lecture 13 - 91

  73. The Transformer y 1 y 2 y 3 y 4 Layer Normaliza$on Recall Layer Normaliza0on : Residual connec$on + Given h 1 , …, h N (Shape: D) scale: 𝛿 (Shape: D) MLP independently MLP MLP MLP MLP shiD: 𝛾 (Shape: D) on each vector 𝜈 i = (1/D)∑ j h i,j (scalar) 𝜏 i = (∑ j (h i,j - 𝜈 i ) 2 ) 1/2 (scalar) Layer Normaliza$on z i = (h i - 𝜈 i ) / 𝜏 i Residual connec$on + y i = 𝛿 * z i + 𝛾 All vectors interact Self-AEen$on with each other Ba et al, 2016 x 1 x 2 x 3 x 4 Vaswani et al, “AEen$on is all you need”, NeurIPS 2017 Jus$n Johnson October 23, 2019 Lecture 13 - 92

  74. The Transformer y 1 y 2 y 3 y 4 Transformer Block: Layer Normaliza$on Input : Set of vectors x + Output : Set of vectors y MLP MLP MLP MLP Self-aEen$on is the only interac$on between vectors! Layer Normaliza$on Layer norm and MLP work + independently per vector Self-AEen$on Highly scalable, highly parallelizable x 1 x 2 x 3 x 4 Vaswani et al, “AEen$on is all you need”, NeurIPS 2017 Jus$n Johnson October 23, 2019 Lecture 13 - 93

  75. The Transformer Layer Normalization + MLP MLP MLP MLP Transformer Block: Layer Normalization + Input : Set of vectors x Self-Attention Output : Set of vectors y A Transformer is a sequence Layer Normalization of transformer blocks + Self-aEen$on is the only MLP MLP MLP MLP interac$on between vectors! Vaswani et al: Layer Normalization + 12 blocks, D Q =512, 6 heads Self-Attention Layer norm and MLP work Layer Normalization independently per vector + MLP MLP MLP MLP Highly scalable, highly Layer Normalization + parallelizable Self-Attention Vaswani et al, “AEen$on is all you need”, NeurIPS 2017 Jus$n Johnson October 23, 2019 Lecture 13 - 94

  76. The Transformer: Transfer Learning Layer Normalization + MLP MLP MLP MLP Layer Normalization “ImageNet Moment for Natural Language Processing” + Self-Attention Pretraining : Layer Normalization + Download a lot of text from the internet MLP MLP MLP MLP Layer Normalization + Train a giant Transformer model for language modeling Self-Attention Layer Normalization + Finetuning: MLP MLP MLP MLP Fine-tune the Transformer on your own NLP task Layer Normalization + Self-Attention Devlin et al, "BERT: Pre-training of Deep Bidirec$onal Transformers for Language Understanding", EMNLP 2018 Jus$n Johnson October 23, 2019 Lecture 13 - 95

  77. Scaling up Transformers Model Layers Width Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) Vaswani et al, “AEen$on is all you need”, NeurIPS 2017 Justin Johnson October 23, 2019 Lecture 13 - 96

  78. Scaling up Transformers Model Layers Width Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB BERT-Large 24 1024 16 340M 13 GB Devlin et al, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", EMNLP 2018 Justin Johnson October 23, 2019 Lecture 13 - 97

  79. Scaling up Transformers Model Layers Width Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB BERT-Large 24 1024 16 340M 13 GB XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days) RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day) Yang et al, XLNet: Generalized Autoregressive Pretraining for Language Understanding", 2019 Liu et al, "RoBERTa: A Robustly Optimized BERT Pretraining Approach", 2019 Justin Johnson October 23, 2019 Lecture 13 - 98

  80. Scaling up Transformers Model Layers Width Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB BERT-Large 24 1024 16 340M 13 GB XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days) RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day) GPT-2 12 768 ? 117M 40 GB GPT-2 24 1024 ? 345M 40 GB GPT-2 36 1280 ? 762M 40 GB GPT-2 48 1600 ? 1.5B 40 GB Radford et al, "Language models are unsupervised multitask learners", 2019 Justin Johnson October 23, 2019 Lecture 13 - 99

  81. Scaling up Transformers Model Layers Width Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB BERT-Large 24 1024 16 340M 13 GB XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days) RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day) GPT-2 12 768 ? 117M 40 GB GPT-2 24 1024 ? 345M 40 GB GPT-2 36 1280 ? 762M 40 GB GPT-2 48 1600 ? 1.5B 40 GB Megatron-LM 40 1536 16 1.2B 174 GB 64x V100 GPU Megatron-LM 54 1920 20 2.5B 174 GB 128x V100 GPU Megatron-LM 64 2304 24 4.2B 174 GB 256x V100 GPU (10 days) Megatron-LM 72 3072 32 8.3B 174 GB 512x V100 GPU (9 days) Shoeybi et al, "Megatron-LM: Training Multi-Billion Parameter Languge Models using Model Parallelism", 2019 Justin Johnson October 23, 2019 Lecture 13 - 100

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend