lecture 13 a en on
play

Lecture 13: A,en.on Jus$n Johnson October 14, 2020 Lecture 13 - 1 - PowerPoint PPT Presentation

Lecture 13: A,en.on Jus$n Johnson October 14, 2020 Lecture 13 - 1 Reminder: Assignment 4 - Assignment 4 is released: h2ps://web.eecs.umich.edu/~jus<ncj/teaching/eec s498/FA2020/assignment4.html - Due Friday October 30, 11:59pm EDT -


  1. Sequence-to-Sequence with RNNs an and A>en?on on Use a different context vector in each $mestep of decoder - Input sequence not bo9lenecked through single vector comiendo [STOP] estamos pan - At each $mestep of decoder, context vector “looks at” different parts of the input sequence y 1 y 2 y 3 y 4 h 1 h 2 h 3 h 4 s 0 s 1 s 2 s 3 s 4 x 1 x 2 x 3 x 4 c 1 y 0 c 2 y 1 c 3 y 2 c 4 y 3 we are ea%ng bread [START] estamos comiendo pan Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 20

  2. Sequence-to-Sequence with RNNs an and A>en?on on Visualize a2en<on weights a t,i Example : English to French transla<on Input : “The agreement on the European Economic Area was signed in August 1992.” Output : “L’accord sur la zone économique européenne a été signé en août 1992.” Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 21

  3. Sequence-to-Sequence with RNNs an and A>en?on on Visualize a2en<on weights a t,i Example : English to French transla<on Diagonal a)en+on means Input : “ The agreement on the words correspond in order European Economic Area was signed in August 1992 .” Output : “ L’accord sur la zone économique européenne a été signé en août 1992 .” Diagonal a)en+on means words correspond in order Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 22

  4. Sequence-to-Sequence with RNNs an and A>en?on on Visualize a2en<on weights a t,i Example : English to French transla<on Diagonal a)en+on means Input : “ The agreement on the words correspond in order European Economic Area was signed in August 1992 .” A)en+on figures out different word orders Output : “ L’accord sur la zone économique européenne a été signé en août 1992 .” Diagonal a)en+on means words correspond in order Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 23

  5. Sequence-to-Sequence with RNNs an and A>en?on on Visualize a2en<on weights a t,i Example : English to French transla<on Diagonal a)en+on means Input : “ The agreement on the words correspond in order European Economic Area was signed in August 1992 .” A)en+on figures out different word orders Output : “ L’accord sur la zone Verb conjuga+on économique européenne a été signé en août 1992 .” Diagonal a)en+on means words correspond in order Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 24

  6. Sequence-to-Sequence with RNNs an and A>en?on on The decoder doesn’t use the fact that h i form an ordered sequence – it just treats them as an unordered set {h i } comiendo [STOP] estamos pan Can use similar architecture given any y 1 y 2 y 3 y 4 set of input hidden vectors {h i }! h 1 h 2 h 3 h 4 s 0 s 1 s 2 s 3 s 4 x 1 x 2 x 3 x 4 c 1 y 0 c 2 y 1 c 3 y 2 c 4 y 3 we are ea%ng bread [START] estamos comiendo pan Bahdanau et al, “Neural machine translaAon by jointly learning to align and translate”, ICLR 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 25

  7. Image Cap?oning with RNNs and A>en?on h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 Use a CNN to compute a grid of features for an image Cat image is free to use under the Pixabay License Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 26

  8. Image Cap?oning with RNNs and A>en?on Alignment scores e t,i,j = f a& (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 e 1,2,1 e 1,2,2 e 1,2,3 e 1,3,1 e 1,3,2 e 1,3,3 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 Use a CNN to compute a grid of features for an image Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 27

  9. Image Cap?oning with RNNs and A>en?on Alignment scores A5en6on weights e t,i,j = f a& (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 a 1,1,1 a 1,1,2 a 1,1,3 a t,:,: = soZmax(e t,:,: ) so#max e 1,2,1 e 1,2,2 e 1,2,3 a 1,2,1 a 1,2,2 a 1,2,3 e 1,3,1 e 1,3,2 e 1,3,3 a 1,3,1 a 1,3,2 a 1,3,3 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 Use a CNN to compute a grid of features for an image Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 28

  10. Image Cap?oning with RNNs and A>en?on Alignment scores A5en6on weights e t,i,j = f a& (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 a 1,1,1 a 1,1,2 a 1,1,3 a t,:,: = soZmax(e t,:,: ) so#max e 1,2,1 e 1,2,2 e 1,2,3 a 1,2,1 a 1,2,2 a 1,2,3 c t = ∑ i,j a t,i,j h i,j e 1,3,1 e 1,3,2 e 1,3,3 a 1,3,1 a 1,3,2 a 1,3,3 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 c 1 Use a CNN to compute a grid of features for an image Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 29

  11. Image Cap?oning with RNNs and A>en?on Alignment scores A5en6on weights e t,i,j = f a& (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 a 1,1,1 a 1,1,2 a 1,1,3 a t,:,: = soZmax(e t,:,: ) so#max cat e 1,2,1 e 1,2,2 e 1,2,3 a 1,2,1 a 1,2,2 a 1,2,3 c t = ∑ i,j a t,i,j h i,j e 1,3,1 e 1,3,2 e 1,3,3 a 1,3,1 a 1,3,2 a 1,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 30

  12. Image Cap?oning with RNNs and A>en?on e t,i,j = f a& (s t-1 , h i,j ) a t,:,: = soZmax(e t,:,: ) cat c t = ∑ i,j a t,i,j h i,j y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 31

  13. Image Cap?oning with RNNs and A>en?on Alignment scores e t,i,j = f a& (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a t,:,: = soZmax(e t,:,: ) cat e 2,2,1 e 2,2,2 e 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 32

  14. Image Cap?oning with RNNs and A>en?on Alignment scores A5en6on weights e t,i,j = f a& (s t-1 , h i,j ) a 2,1,1 a 2,1,2 a 2,1,3 e 2,1,1 e 2,1,2 e 2,1,3 a t,:,: = soZmax(e t,:,: ) so#max cat e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 33

  15. Image Cap?oning with RNNs and A>en?on Alignment scores A5en6on weights e t,i,j = f a& (s t-1 , h i,j ) a 2,1,1 a 2,1,2 a 2,1,3 e 2,1,1 e 2,1,2 e 2,1,3 a t,:,: = soZmax(e t,:,: ) so#max cat e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 Use a CNN to compute a grid of features for an image [START] Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 34

  16. Image Cap?oning with RNNs and A>en?on Alignment scores A5en6on weights e t,i,j = f a& (s t-1 , h i,j ) a 2,1,1 a 2,1,2 a 2,1,3 e 2,1,1 e 2,1,2 e 2,1,3 a t,:,: = soZmax(e t,:,: ) so#max si9ng cat e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 y 2 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 s 2 h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 y 1 Use a CNN to compute a grid of features for an image [START] cat Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 35

  17. Image Cap?oning with RNNs and A>en?on Each <mestep of decoder e t,i,j = f a& (s t-1 , h i,j ) uses a different context a t,:,: = soZmax(e t,:,: ) si9ng [STOP] cat outside vector that looks at different c t = ∑ i,j a t,i,j h i,j parts of the input image y 1 y 2 y 3 y 4 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 s 2 s 3 s 4 h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 y 1 c 3 y 2 c 4 y 3 Use a CNN to compute a grid of features for an image [START] cat si9ng outside Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 36

  18. Image Cap?oning with RNNs and A>en?on Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 37

  19. Image Cap?oning with RNNs and A>en?on Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Jus$n Johnson October 14, 2020 Lecture 13 - 38

  20. Human Vision: Fovea Light enters eye Re$na detects light Acuity graph is licensed under CC A-SA 3.0 Unported Jus$n Johnson October 14, 2020 Lecture 13 - 39

  21. Human Vision: Fovea The fovea is a %ny region of the Light enters eye re%na that can see with high acuity Re$na detects light Acuity graph is licensed under CC A-SA 3.0 Unported (No changes made) Eye image is licensed under CC A-SA 3.0 Unported (added black arrow, green arc, and white circle) Jus$n Johnson October 14, 2020 Lecture 13 - 40

  22. Human Vision: Saccades The fovea is a %ny region of the Human eyes are constantly moving so we don’t no%ce re%na that can see with high acuity Saccade video is licensed under CC A-SA 4.0 InternaFonal (no changes made) Acuity graph is licensed under CC A-SA 3.0 Unported (No changes made) Jus$n Johnson October 14, 2020 Lecture 13 - 41

  23. Image Cap?oning with RNNs and A>en?on A2en<on weights at each <mestep kind of like saccades of human eye Xu et al, “Show, AKend, and Tell: Neural Image CapAon GeneraAon with Visual AKenAon”, ICML 2015 Saccade video is licensed under CC A-SA 4.0 InternaFonal (no changes made) Jus$n Johnson October 14, 2020 Lecture 13 - 42

  24. X, A>end, and Y “ Show, a9end, and tell ” (Xu et al, ICML 2015) Look at image, aWend to image regions, produce ques$on “ Ask, a9end, and answer ” (Xu and Saenko, ECCV 2016) “ Show, ask, a9end, and answer” (Kazemi and Elqursh, 2017) Read text of ques$on, aWend to image regions, produce answer “ Listen, a9end, and spell ” (Chan et al, ICASSP 2016) Process raw audio, aWend to audio regions while producing text “ Listen, a9end, and walk ” (Mei et al, AAAI 2016) Process text, aWend to text regions, output naviga$on commands “ Show, a9end, and interact ” (Qureshi et al, ICRA 2017) Process image, aWend to image regions, output robot control commands “ Show, a9end, and read ” (Li et al, AAAI 2019) Process image, aWend to image regions, output text Jus$n Johnson October 14, 2020 Lecture 13 - 43

  25. A>en?on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vector : q (Shape: D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D X ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 Similarity func$on : f a+ h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 [START] Computa$on : Similari$es : e (Shape: N X ) e i = f a+ ( q , X i ) A9en$on weights : a = soDmax(e) (Shape: N X ) Output vector : y = ∑ i a i X i (Shape: D X ) Jus$n Johnson October 14, 2020 Lecture 13 - 44

  26. A>en?on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vector : q (Shape: D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D Q ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 Similarity func$on : dot product h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 [START] Computa$on : Changes: Similari$es : e (Shape: N X ) e i = q · X i - Use dot product for similarity A9en$on weights : a = soDmax(e) (Shape: N X ) Output vector : y = ∑ i a i X i (Shape: D X ) Jus$n Johnson October 14, 2020 Lecture 13 - 45

  27. A>en?on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vector : q (Shape: D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D Q ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 Similarity func$on : scaled dot product h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 [START] Computa$on : Changes: Similari$es : e (Shape: N X ) e i = q · X i / sqrt(D Q ) - Use scaled dot product for similarity A9en$on weights : a = soDmax(e) (Shape: N X ) Output vector : y = ∑ i a i X i (Shape: D X ) Jus$n Johnson October 14, 2020 Lecture 13 - 46

  28. A>en?on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vector : q (Shape: D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D Q ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 Similarity func$on : scaled dot product h 3,1 h 3,2 h 3,3 Large similari%es will cause soDmax to saturate and give vanishing gradients c 1 y 0 c 2 Recall a · b = |a||b| cos(angle) Suppose that a and b are constant vectors of [START] dimension D Then |a| = (∑ i a 2 ) 1/2 = a 𝐸 Computa$on : Changes: Similari$es : e (Shape: N X ) e i = q · X i / 𝐸 ! - Use scaled dot product for similarity A9en$on weights : a = soDmax(e) (Shape: N X ) Output vector : y = ∑ i a i X i (Shape: D X ) Jus$n Johnson October 14, 2020 Lecture 13 - 47

  29. A>en?on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vectors : Q (Shape: N Q x D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D Q ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 [START] Computa$on : Changes: Similari$es : E = QX T / 𝐸 ! (Shape: N Q x N X ) E i,j = ( Q i · X j )/ 𝐸 ! - Use dot product for similarity A9en$on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) - Mul$ple query vectors Output vectors : Y = A X (Shape: N Q x D X ) Y i = ∑ j A i,j X j Jus$n Johnson October 14, 2020 Lecture 13 - 48

  30. A>en?on Layer Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 a t,:,: = softmax(e t,:,: ) softmax seagull e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j Inputs : e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 Query vectors : Q (Shape: N Q x D Q ) h 1,1 h 1,2 h 1,3 Input vectors : X (Shape: N X x D X ) CNN h 2,1 h 2,2 h 2,3 s 0 s 1 Key matrix : W K (Shape: D X x D Q ) h 3,1 h 3,2 h 3,3 Value matrix: W V (Shape: D X x D V ) c 1 y 0 c 2 [START] Computa$on : Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Similari$es : E = QK T / 𝐸 ! (Shape: N Q x N X ) E i,j = ( Q i · K j ) / 𝐸 ! Changes: - Use dot product for similarity A9en$on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) - Mul$ple query vectors Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j - Separate key and value Jus$n Johnson October 14, 2020 Lecture 13 - 49

  31. A>en?on Layer Inputs : Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Computa$on : X 1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 Value Vectors : V = XW V (Shape: N X x D V ) Similari$es : E = QK T / 𝐸 ! (Shape: N Q x N X ) E i,j = ( Q i · K j ) / 𝐸 ! X 3 A9en$on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 14, 2020 Lecture 13 - 50

  32. A>en?on Layer Inputs : Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Computa$on : X 1 K 1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 K 2 Value Vectors : V = XW V (Shape: N X x D V ) Similari$es : E = QK T / 𝐸 ! (Shape: N Q x N X ) E i,j = ( Q i · K j ) / 𝐸 ! X 3 K 3 A9en$on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 14, 2020 Lecture 13 - 51

  33. A>en?on Layer Inputs : Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Computa$on : E 2,1 E 3,1 E 4,1 X 1 K 1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 K 2 E 1,2 E 2,2 E 3,2 E 4,2 Value Vectors : V = XW V (Shape: N X x D V ) Similari$es : E = QK T / 𝐸 ! (Shape: N Q x N X ) E i,j = ( Q i · K j ) / 𝐸 ! E 2,3 E 3,3 E 4,3 X 3 K 3 E 1,3 A9en$on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 14, 2020 Lecture 13 - 52

  34. A>en?on Layer Inputs : A 1,1 A 2,1 A 3,1 A 4,1 Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) A 1,2 A 2,2 A 3,2 A 4,2 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) A 1,3 A 2,3 A 3,3 A 4,3 SoDmax( ) Computa$on : E 2,1 E 3,1 E 4,1 X 1 K 1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 K 2 E 1,2 E 2,2 E 3,2 E 4,2 Value Vectors : V = XW V (Shape: N X x D V ) Similari$es : E = QK T / 𝐸 ! (Shape: N Q x N X ) E i,j = ( Q i · K j ) / 𝐸 ! E 2,3 E 3,3 E 4,3 X 3 K 3 E 1,3 A9en$on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 14, 2020 Lecture 13 - 53

  35. A>en?on Layer Inputs : V 1 A 1,1 A 2,1 A 3,1 A 4,1 Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) V 2 A 1,2 A 2,2 A 3,2 A 4,2 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) V 3 A 1,3 A 2,3 A 3,3 A 4,3 SoDmax( ) Computa$on : E 2,1 E 3,1 E 4,1 X 1 K 1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 K 2 E 1,2 E 2,2 E 3,2 E 4,2 Value Vectors : V = XW V (Shape: N X x D V ) Similari$es : E = QK T / 𝐸 ! (Shape: N Q x N X ) E i,j = ( Q i · K j ) / 𝐸 ! E 2,3 E 3,3 E 4,3 X 3 K 3 E 1,3 A9en$on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 14, 2020 Lecture 13 - 54

  36. Y 1 Y 2 Y 3 Y 4 A>en?on Layer Product( ), Sum( ) Inputs : V 1 A 1,1 A 2,1 A 3,1 A 4,1 Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) V 2 A 1,2 A 2,2 A 3,2 A 4,2 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) V 3 A 1,3 A 2,3 A 3,3 A 4,3 SoDmax( ) Computa$on : E 2,1 E 3,1 E 4,1 X 1 K 1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) X 2 K 2 E 1,2 E 2,2 E 3,2 E 4,2 Value Vectors : V = XW V (Shape: N X x D V ) Similari$es : E = QK T / 𝐸 ! (Shape: N Q x N X ) E i,j = ( Q i · K j ) / 𝐸 ! E 2,3 E 3,3 E 4,3 X 3 K 3 E 1,3 A9en$on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Q 1 Q 2 Q 3 Q 4 Jus$n Johnson October 14, 2020 Lecture 13 - 55

  37. Self-A>en?on Layer One query per input vector Inputs : Query vectors : Q (Shape: N Q x D Q ) Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Computa$on : Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Similari$es : E = QK T / 𝐸 ! (Shape: N Q x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N Q x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N Q x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 56

  38. Self-A>en?on Layer One query per input vector Inputs : Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Query matrix : W Q (Shape: D X x D Q ) Computa$on : Query vectors : Q = XW Q Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 57

  39. Self-A>en?on Layer One query per input vector Inputs : Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Query matrix : W Q (Shape: D X x D Q ) K 3 Computa$on : K 2 Query vectors : Q = XW Q K 1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 58

  40. Self-A>en?on Layer One query per input vector Inputs : Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) Query matrix : W Q (Shape: D X x D Q ) E 2,3 E 3,3 K 3 E 1,3 Computa$on : K 2 E 1,2 E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 59

  41. Self-A>en?on Layer One query per input vector A 3,3 A 1,3 A 2,3 Inputs : A 1,2 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) E 2,3 E 3,3 K 3 E 1,3 Computa$on : K 2 E 1,2 E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 60

  42. Self-A>en?on Layer One query per input vector V 3 A 3,3 A 1,3 A 2,3 Inputs : V 2 A 1,2 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) E 2,3 E 3,3 K 3 E 1,3 Computa$on : K 2 E 1,2 E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 61

  43. Y 1 Y 3 Y 2 Self-A>en?on Layer Product(→), Sum(↑) One query per input vector V 3 A 3,3 A 1,3 A 2,3 Inputs : V 2 A 1,2 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) E 2,3 E 3,3 K 3 E 1,3 Computa$on : K 2 E 1,2 E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 62

  44. Self-A>en?on Layer Product(→), Sum(↑) Consider permu+ng the input vectors: Inputs : Input vectors : X (Shape: N X x D X ) Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) Computa$on : Query vectors : Q = XW Q Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 63

  45. Self-A>en?on Layer Product(→), Sum(↑) Consider permu+ng the input vectors: Inputs : Input vectors : X (Shape: N X x D X ) Queries and Keys will be the same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) K 2 Computa$on : K 1 Query vectors : Q = XW Q K 3 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 64

  46. Self-A>en?on Layer Product(→), Sum(↑) Consider permu+ng the input vectors: Inputs : Input vectors : X (Shape: N X x D X ) Similari%es will be the same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) E 1,2 E 2,2 K 2 E 3,2 Computa$on : K 1 E 3,1 E 1,1 E 2,1 Query vectors : Q = XW Q K 3 E 1,3 E 2,3 E 3,3 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 65

  47. Self-A>en?on Layer Product(→), Sum(↑) Consider permu+ng A 2,2 A 3,2 A 1,2 the input vectors: Inputs : A 3,1 A 1,1 A 2,1 Input vectors : X (Shape: N X x D X ) ARen%on weights will be A 1,3 A 2,3 A 3,3 the same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) E 1,2 E 2,2 K 2 E 3,2 Computa$on : K 1 E 3,1 E 1,1 E 2,1 Query vectors : Q = XW Q K 3 E 1,3 E 2,3 E 3,3 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 66

  48. Self-A>en?on Layer Product(→), Sum(↑) Consider permu+ng V 2 A 2,2 A 3,2 A 1,2 the input vectors: Inputs : V 1 A 3,1 A 1,1 A 2,1 Input vectors : X (Shape: N X x D X ) Values will be the V 3 A 1,3 A 2,3 A 3,3 same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) E 1,2 E 2,2 K 2 E 3,2 Computa$on : K 1 E 3,1 E 1,1 E 2,1 Query vectors : Q = XW Q K 3 E 1,3 E 2,3 E 3,3 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 67

  49. Y 3 Y 2 Y 1 Self-A>en?on Layer Product(→), Sum(↑) Consider permu+ng V 2 A 2,2 A 3,2 A 1,2 the input vectors: Inputs : V 1 A 3,1 A 1,1 A 2,1 Input vectors : X (Shape: N X x D X ) Outputs will be the V 3 A 1,3 A 2,3 A 3,3 same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) E 1,2 E 2,2 K 2 E 3,2 Computa$on : K 1 E 3,1 E 1,1 E 2,1 Query vectors : Q = XW Q K 3 E 1,3 E 2,3 E 3,3 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 68

  50. Y 3 Y 2 Y 1 Self-A>en?on Layer Product(→), Sum(↑) Consider permu+ng V 2 A 2,2 A 3,2 A 1,2 the input vectors: Inputs : V 1 A 3,1 A 1,1 A 2,1 Input vectors : X (Shape: N X x D X ) Outputs will be the V 3 A 1,3 A 2,3 A 3,3 same, but permuted Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Self-aRen%on layer is Query matrix : W Q (Shape: D X x D Q ) Permuta+on Equivariant E 1,2 E 2,2 K 2 E 3,2 f(s(x)) = s(f(x)) Computa$on : K 1 E 3,1 E 1,1 E 2,1 Query vectors : Q = XW Q Self-ARen%on layer works K 3 E 1,3 E 2,3 E 3,3 Key vectors : K = XW K (Shape: N X x D Q ) on sets of vectors Value Vectors : V = XW V (Shape: N X x D V ) Q 3 Q 1 Q 2 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 3 X 1 X 2 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 69

  51. Y 1 Y 3 Y 2 Self-A>en?on Layer Product(→), Sum(↑) Self aRen%on doesn’t V 3 A 3,3 A 1,3 A 2,3 “know” the order of the Inputs : vectors it is processing! V 2 A 1,2 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) E 2,3 E 3,3 K 3 E 1,3 Computa$on : K 2 E 1,2 E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 70

  52. Y 1 Y 3 Y 2 Self-A>en?on Layer Product(→), Sum(↑) Self aRen%on doesn’t V 3 A 3,3 A 1,3 A 2,3 “know” the order of the Inputs : vectors it is processing! V 2 A 1,2 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 In order to make Key matrix : W K (Shape: D X x D Q ) processing posi%on- Value matrix: W V (Shape: D X x D V ) So#max(↑) aware, concatenate input Query matrix : W Q (Shape: D X x D Q ) with posi+onal encoding E 2,3 E 3,3 K 3 E 1,3 Computa$on : K 2 E 1,2 E 2,2 E 3,2 E can be learned lookup Query vectors : Q = XW Q table, or fixed func%on K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j E(1) E(2) E(3) Jus$n Johnson October 14, 2020 Lecture 13 - 71

  53. Y 1 Y 3 Y 2 Masked Self-A>en?on Layer Ma Product(→), Sum(↑) Don’t let vectors “look ahead” in the sequence V 3 A 3,3 0 0 Inputs : V 2 0 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) -∞ E 3,3 K 3 -∞ Computa$on : K 2 -∞ E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 72

  54. Big cat [END] Masked Self-A>en?on Layer Ma Product(→), Sum(↑) Don’t let vectors “look ahead” in the sequence V 3 A 3,3 Used for language modeling (predict next word) 0 0 Inputs : V 2 0 A 2,2 A 3,2 Input vectors : X (Shape: N X x D X ) V 1 A 2,1 A 3,1 A 1,1 Key matrix : W K (Shape: D X x D Q ) Value matrix: W V (Shape: D X x D V ) So#max(↑) Query matrix : W Q (Shape: D X x D Q ) -∞ E 3,3 K 3 -∞ Computa$on : K 2 -∞ E 2,2 E 3,2 Query vectors : Q = XW Q K 1 E 2,1 E 3,1 E 1,1 Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Q 1 Q 2 Q 3 Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) [START] Big cat Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 73

  55. Y 1 Y 3 Y 2 ?head Self-A>en?on Layer Mul Mul?he Use H independent “AWen$on Heads” in parallel Concat Inputs : Input vectors : X (Shape: N X x D X ) Y 1 Y 2 Y 3 Y 1 Y 2 Y 3 Y 1 Y 2 Y 3 Product(→), Sum(↑) Product(→), Sum(↑) Product(→), Sum(↑) Key matrix : W K (Shape: D X x D Q ) V 3 A 3,3 V 3 A 3,3 V 3 A 3,3 A 1,3 A 2,3 A 1,3 A 2,3 A 1,3 A 2,3 V 2 V 2 V 2 A 1,2 A 2,2 A 3,2 A 1,2 A 2,2 A 3,2 A 1,2 A 2,2 A 3,2 Value matrix: W V (Shape: D X x D V ) V 1 V 1 V 1 A 1,1 A 2,1 A 3,1 A 1,1 A 2,1 A 3,1 A 1,1 A 2,1 A 3,1 Hyperparameters : Softmax(↑) Softmax(↑) Softmax(↑) Query matrix : W Q (Shape: D X x D Q ) K 3 E 1,3 E 2,3 E 3,3 K 3 E 1,3 E 2,3 E 3,3 K 3 E 1,3 E 2,3 E 3,3 K 2 E 1,2 E 2,2 E 3,2 K 2 E 1,2 E 2,2 E 3,2 K 2 E 1,2 E 2,2 E 3,2 Query dimension D Q K 1 E 1,1 E 2,1 E 3,1 K 1 E 1,1 E 2,1 E 3,1 K 1 E 1,1 E 2,1 E 3,1 Q 3 Q 3 Q 3 Q 1 Q 2 Q 1 Q 2 Q 1 Q 2 Number of heads H X 1 X 2 X 3 X 1 X 2 X 3 X 1 X 2 X 3 Computa$on : Query vectors : Q = XW Q Split Key vectors : K = XW K (Shape: N X x D Q ) Value Vectors : V = XW V (Shape: N X x D V ) Similari$es : E = QK T / 𝐸 ! (Shape: N X x N X ) E i,j = ( Q i · K j ) / 𝐸 ! A9en$on weights : A = soDmax(E, dim=1) (Shape: N X x N X ) X 1 X 2 X 3 Output vectors : Y = A V (Shape: N X x D V ) Y i = ∑ j A i,j V j Jus$n Johnson October 14, 2020 Lecture 13 - 74

  56. Example: CNN with Self-A>en?on Input Image CNN Features: C x H x W Cat image is free to use under the Pixabay License Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018 Jus$n Johnson October 14, 2020 Lecture 13 - 75

  57. Example: CNN with Self-A>en?on Queries : C’ x H x W 1x1 Conv Input Image Keys : CNN C’ x H x W 1x1 Conv Features: C x H x W Cat image is free to use under the Pixabay License Values : C’ x H x W 1x1 Conv Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018 Jus$n Johnson October 14, 2020 Lecture 13 - 76

  58. Example: CNN with Self-A>en?on A)en+on Weights Queries : (H x W) x (H x W) Transpose C’ x H x W 1x1 Conv Input Image soDmax x Keys : CNN C’ x H x W 1x1 Conv Features: C x H x W Cat image is free to use under the Pixabay License Values : C’ x H x W 1x1 Conv Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018 Jus$n Johnson October 14, 2020 Lecture 13 - 77

  59. Example: CNN with Self-A>en?on A)en+on Weights Queries : (H x W) x (H x W) Transpose C’ x H x W 1x1 Conv Input Image soDmax x Keys : CNN C’ x H x W 1x1 Conv Features: C x H x W C’ x H x W Cat image is free to use under the Pixabay License Values : C’ x H x W x 1x1 Conv Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018 Jus$n Johnson October 14, 2020 Lecture 13 - 78

  60. Example: CNN with Self-A>en?on A)en+on Weights Queries : (H x W) x (H x W) Transpose C’ x H x W 1x1 Conv Input Image soDmax x C x H x H Keys : CNN C’ x H x W 1x1 Conv Features: C x H x W C’ x H x W Cat image is free to use under the Pixabay License Values : C’ x H x W x 1x1 Conv 1x1 Conv Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018 Jus$n Johnson October 14, 2020 Lecture 13 - 79

  61. Example: CNN with Self-A>en?on Residual Connec+on A)en+on Weights Queries : (H x W) x (H x W) Transpose C’ x H x W 1x1 Conv Input Image soDmax x C x H x W Keys : + CNN C’ x H x W 1x1 Conv Features: C x H x W C’ x H x W Cat image is free to use under the Pixabay License Values : C’ x H x W x 1x1 Conv 1x1 Conv Self-A2en<on Module Zhang et al, “Self-AKenAon GeneraAve Adversarial Networks”, ICML 2018 Jus$n Johnson October 14, 2020 Lecture 13 - 80

  62. Three Ways of Processing Sequences Recurrent Neural Network y 1 y 2 y 3 y 4 x 1 x 2 x 3 x 4 Works on Ordered Sequences (+) Good at long sequences: ANer one RNN layer, h T ”sees” the whole sequence (-) Not parallelizable: need to compute hidden states sequen+ally Jus$n Johnson October 14, 2020 Lecture 13 - 81

  63. Three Ways of Processing Sequences Recurrent Neural Network 1D Convolu<on y 1 y 2 y 3 y 4 y 1 y 2 y 3 y 4 x 1 x 2 x 3 x 4 x 1 x 2 x 3 x 4 Works on Ordered Sequences Works on Mul+dimensional Grids (+) Good at long sequences: ANer (-) Bad at long sequences: Need to one RNN layer, h T ”sees” the whole stack many conv layers for outputs sequence to “see” the whole sequence (-) Not parallelizable: need to (+) Highly parallel: Each output can compute hidden states sequen+ally be computed in parallel Jus$n Johnson October 14, 2020 Lecture 13 - 82

  64. Three Ways of Processing Sequences Recurrent Neural Network 1D Convolu<on Self-A2en<on Y 1 Y 2 Y 3 Product(→), Sum(↑) y 1 y 2 y 3 y 4 y 1 y 2 y 3 y 4 V 3 A 2,3 A 3,3 A 1,3 V 2 A 1,2 A 2,2 A 3,2 V 1 A 1,1 A 2,1 A 3,1 Softmax(↑) K 3 E 1,3 E 2,3 E 3,3 K 2 E 1,2 E 2,2 E 3,2 K 1 E 2,1 E 3,1 E 1,1 x 1 x 2 x 3 x 4 x 1 x 2 x 3 x 4 Q 1 Q 2 Q 3 X 1 X 2 X 3 Works on Ordered Sequences Works on Sets of Vectors Works on Mul+dimensional Grids (+) Good at long sequences: ANer (-) Good at long sequences: aNer one (-) Bad at long sequences: Need to one RNN layer, h T ”sees” the whole self-a)en+on layer, each output stack many conv layers for outputs sequence “sees” all inputs! to “see” the whole sequence (-) Not parallelizable: need to (+) Highly parallel: Each output can (+) Highly parallel: Each output can compute hidden states sequen+ally be computed in parallel be computed in parallel (-) Very memory intensive Jus$n Johnson October 14, 2020 Lecture 13 - 83

  65. Three Ways of Processing Sequences Recurrent Neural Network 1D Convolu<on Self-A2en<on Y 1 Y 2 Y 3 Product(→), Sum(↑) y 1 y 2 y 3 y 4 y 1 y 2 y 3 y 4 V 3 A 2,3 A 3,3 A 1,3 A"en%on is all you need V 2 A 1,2 A 2,2 A 3,2 V 1 A 1,1 A 2,1 A 3,1 Softmax(↑) K 3 E 1,3 E 2,3 E 3,3 K 2 E 1,2 E 2,2 E 3,2 K 1 E 2,1 E 3,1 E 1,1 Vaswani et al, NeurIPS 2017 x 1 x 2 x 3 x 4 x 1 x 2 x 3 x 4 Q 1 Q 2 Q 3 X 1 X 2 X 3 Works on Ordered Sequences Works on Sets of Vectors Works on Mul+dimensional Grids (+) Good at long sequences: ANer (-) Good at long sequences: aNer one (-) Bad at long sequences: Need to one RNN layer, h T ”sees” the whole self-a)en+on layer, each output stack many conv layers for outputs sequence “sees” all inputs! to “see” the whole sequence (-) Not parallelizable: need to (+) Highly parallel: Each output can (+) Highly parallel: Each output can compute hidden states sequen+ally be computed in parallel be computed in parallel (-) Very memory intensive Jus$n Johnson October 14, 2020 Lecture 13 - 84

  66. The Transformer x 1 x 2 x 3 x 4 Vaswani et al, “AKenAon is all you need”, NeurIPS 2017 Jus$n Johnson October 14, 2020 Lecture 13 - 85

  67. The Transformer All vectors interact Self-AWen$on with each other x 1 x 2 x 3 x 4 Vaswani et al, “AKenAon is all you need”, NeurIPS 2017 Jus$n Johnson October 14, 2020 Lecture 13 - 86

  68. The Transformer Residual connec<on + All vectors interact Self-AWen$on with each other x 1 x 2 x 3 x 4 Vaswani et al, “AKenAon is all you need”, NeurIPS 2017 Jus$n Johnson October 14, 2020 Lecture 13 - 87

  69. The Transformer Recall Layer NormalizaGon : Given h 1 , …, h N (Shape: D) scale: 𝛿 (Shape: D) shiO: 𝛾 (Shape: D) 𝜈 i = (∑ j h i,j )/D (scalar) 𝜏 i = (∑ j (h i,j - 𝜈 i ) 2 /D) 1/2 (scalar) Layer Normaliza$on z i = (h i - 𝜈 i ) / 𝜏 i Residual connec<on + y i = 𝛿 * z i + 𝛾 All vectors interact Self-AWen$on with each other Ba et al, 2016 x 1 x 2 x 3 x 4 Vaswani et al, “AKenAon is all you need”, NeurIPS 2017 Jus$n Johnson October 14, 2020 Lecture 13 - 88

  70. The Transformer Recall Layer NormalizaGon : Given h 1 , …, h N (Shape: D) scale: 𝛿 (Shape: D) MLP independently MLP MLP MLP MLP shiO: 𝛾 (Shape: D) on each vector 𝜈 i = (∑ j h i,j )/D (scalar) 𝜏 i = (∑ j (h i,j - 𝜈 i ) 2 /D) 1/2 (scalar) Layer Normaliza$on z i = (h i - 𝜈 i ) / 𝜏 i Residual connec<on + y i = 𝛿 * z i + 𝛾 All vectors interact Self-AWen$on with each other Ba et al, 2016 x 1 x 2 x 3 x 4 Vaswani et al, “AKenAon is all you need”, NeurIPS 2017 Jus$n Johnson October 14, 2020 Lecture 13 - 89

  71. The Transformer Recall Layer NormalizaGon : Residual connec<on + Given h 1 , …, h N (Shape: D) scale: 𝛿 (Shape: D) MLP independently MLP MLP MLP MLP shiO: 𝛾 (Shape: D) on each vector 𝜈 i = (∑ j h i,j )/D (scalar) 𝜏 i = (∑ j (h i,j - 𝜈 i ) 2 /D) 1/2 (scalar) Layer Normaliza$on z i = (h i - 𝜈 i ) / 𝜏 i Residual connec<on + y i = 𝛿 * z i + 𝛾 All vectors interact Self-AWen$on with each other Ba et al, 2016 x 1 x 2 x 3 x 4 Vaswani et al, “AKenAon is all you need”, NeurIPS 2017 Jus$n Johnson October 14, 2020 Lecture 13 - 90

  72. The Transformer y 1 y 2 y 3 y 4 Layer Normaliza$on Recall Layer NormalizaGon : Residual connec<on + Given h 1 , …, h N (Shape: D) scale: 𝛿 (Shape: D) MLP independently MLP MLP MLP MLP shiO: 𝛾 (Shape: D) on each vector 𝜈 i = (∑ j h i,j )/D (scalar) 𝜏 i = (∑ j (h i,j - 𝜈 i ) 2 /D) 1/2 (scalar) Layer Normaliza$on z i = (h i - 𝜈 i ) / 𝜏 i Residual connec<on + y i = 𝛿 * z i + 𝛾 All vectors interact Self-AWen$on with each other Ba et al, 2016 x 1 x 2 x 3 x 4 Vaswani et al, “AKenAon is all you need”, NeurIPS 2017 Jus$n Johnson October 14, 2020 Lecture 13 - 91

  73. The Transformer y 1 y 2 y 3 y 4 Transformer Block: Layer Normaliza$on Input : Set of vectors x + Output : Set of vectors y MLP MLP MLP MLP Self-a2en<on is the only interac<on between vectors! Layer Normaliza$on Layer norm and MLP work + independently per vector Self-AWen$on Highly scalable, highly parallelizable x 1 x 2 x 3 x 4 Vaswani et al, “AKenAon is all you need”, NeurIPS 2017 Jus$n Johnson October 14, 2020 Lecture 13 - 92

  74. The Transformer Layer Normalization + MLP MLP MLP MLP Transformer Block: Layer Normalization + Input : Set of vectors x Self-Attention Output : Set of vectors y A Transformer is a sequence Layer Normalization of transformer blocks + Self-a2en<on is the only MLP MLP MLP MLP interac<on between vectors! Vaswani et al: Layer Normalization + 12 blocks, D Q =512, 6 heads Self-Attention Layer norm and MLP work Layer Normalization independently per vector + MLP MLP MLP MLP Highly scalable, highly Layer Normalization + parallelizable Self-Attention Vaswani et al, “AKenAon is all you need”, NeurIPS 2017 Jus$n Johnson October 14, 2020 Lecture 13 - 93

  75. The Transformer: Transfer Learning Layer Normalization + MLP MLP MLP MLP Layer Normalization “ImageNet Moment for Natural Language Processing” + Self-Attention Pretraining : Layer Normalization + Download a lot of text from the internet MLP MLP MLP MLP Layer Normalization + Train a giant Transformer model for language modeling Self-Attention Layer Normalization + Finetuning: MLP MLP MLP MLP Fine-tune the Transformer on your own NLP task Layer Normalization + Self-Attention Devlin et al, "BERT: Pre-training of Deep BidirecAonal Transformers for Language Understanding", EMNLP 2018 Jus$n Johnson October 14, 2020 Lecture 13 - 94

  76. Scaling up Transformers Model Layers Width Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) Vaswani et al, “AKenAon is all you need”, NeurIPS 2017 Justin Johnson October 14, 2020 Lecture 13 - 95

  77. Scaling up Transformers Model Layers Width Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB BERT-Large 24 1024 16 340M 13 GB Devlin et al, "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", EMNLP 2018 Justin Johnson October 14, 2020 Lecture 13 - 96

  78. Scaling up Transformers Model Layers Width Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB BERT-Large 24 1024 16 340M 13 GB XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days) RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day) Yang et al, XLNet: Generalized Autoregressive Pretraining for Language Understanding", 2019 Liu et al, "RoBERTa: A Robustly Optimized BERT Pretraining Approach", 2019 Justin Johnson October 14, 2020 Lecture 13 - 97

  79. Scaling up Transformers Model Layers Width Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB BERT-Large 24 1024 16 340M 13 GB XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days) RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day) GPT-2 48 1600 ? 1.5B 40 GB Radford et al, "Language models are unsupervised multitask learners", 2019 Jus$n Johnson October 14, 2020 Lecture 13 - 98

  80. Scaling up Transformers Model Layers Width Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB BERT-Large 24 1024 16 340M 13 GB XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days) RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day) GPT-2 48 1600 ? 1.5B 40 GB Megatron-LM 72 3072 32 8.3B 174 GB 512x V100 GPU (9 days) Shoeybi et al, "Megatron-LM: Training Multi-Billion Parameter Languge Models using Model Parallelism", 2019 Justin Johnson October 14, 2020 Lecture 13 - 99

  81. ~$430,000 on Amazon AWS! Scaling up Transformers Model Layers Width Heads Params Data Training Transformer-Base 12 512 8 65M 8x P100 (12 hours) Transformer-Large 12 1024 16 213M 8x P100 (3.5 days) BERT-Base 12 768 12 110M 13 GB BERT-Large 24 1024 16 340M 13 GB XLNet-Large 24 1024 16 ~340M 126 GB 512x TPU-v3 (2.5 days) RoBERTa 24 1024 16 355M 160 GB 1024x V100 GPU (1 day) GPT-2 48 1600 ? 1.5B 40 GB Megatron-LM 72 3072 32 8.3B 174 GB 512x V100 GPU (9 days) Shoeybi et al, "Megatron-LM: Training Multi-Billion Parameter Languge Models using Model Parallelism", 2019 Justin Johnson October 14, 2020 Lecture 13 - 100

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend