attention transformers bert and vilbert
play

Attention, Transformers, BERT, and ViLBERT Arjun Majumdar Georgia - PowerPoint PPT Presentation

Attention, Transformers, BERT, and ViLBERT Arjun Majumdar Georgia Tech Slide Credits: Andrej Karpathy, Justin Johnson, Dhruv Batra Recall: Recurrent Neural Networks Image Credit: Andrej Karpathy Sequence-to-Sequence with RNNs Input : Sequence


  1. Attention, Transformers, BERT, and ViLBERT Arjun Majumdar Georgia Tech Slide Credits: Andrej Karpathy, Justin Johnson, Dhruv Batra

  2. Recall: Recurrent Neural Networks Image Credit: Andrej Karpathy

  3. Sequence-to-Sequence with RNNs Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ Encoder: h t = f W (x t , h t-1 ) h 1 h 2 h 3 h 4 x 1 x 2 x 3 x 4 we are eating bread Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Slide Credit: Justin Johnson

  4. Sequence-to-Sequence with RNNs Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ From final hidden state predict: Initial decoder state s 0 Encoder: h t = f W (x t , h t-1 ) Context vector c (often c=h T ) h 1 h 2 h 3 h 4 s 0 c x 1 x 2 x 3 x 4 we are eating bread Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Slide Credit: Justin Johnson

  5. Sequence-to-Sequence with RNNs Decoder: s t = g U (y t-1 , h t-1 , c) Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ estamos From final hidden state predict: y 1 Initial decoder state s 0 Encoder: h t = f W (x t , h t-1 ) Context vector c (often c=h T ) h 1 h 2 h 3 h 4 s 0 s 1 c x 1 x 2 x 3 x 4 y 0 we are eating bread [START] Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Slide Credit: Justin Johnson

  6. Sequence-to-Sequence with RNNs Decoder: s t = g U (y t-1 , h t-1 , c) Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ estamos comiendo From final hidden state predict: y 1 y 2 Initial decoder state s 0 Encoder: h t = f W (x t , h t-1 ) Context vector c (often c=h T ) h 1 h 2 h 3 h 4 s 0 s 1 s 2 c x 1 x 2 x 3 x 4 y 0 y 1 we are eating bread [START] estamos Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Slide Credit: Justin Johnson

  7. Sequence-to-Sequence with RNNs Decoder: s t = g U (y t-1 , h t-1 , c) Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ estamos comiendo pan [STOP] From final hidden state predict: y 1 y 2 y 3 y 4 Initial decoder state s 0 Encoder: h t = f W (x t , h t-1 ) Context vector c (often c=h T ) h 1 h 2 h 3 h 4 s 0 s 1 s 2 s 3 s 4 c x 1 x 2 x 3 x 4 y 0 y 1 y 2 y 3 we are eating bread [START] estamos comiendo pan Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Slide Credit: Justin Johnson

  8. Sequence-to-Sequence with RNNs Decoder: s t = g U (y t-1 , h t-1 , c) Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ estamos comiendo pan [STOP] Problem: Input sequence y 1 y 2 y 3 y 4 bottlenecked through Encoder: h t = f W (x t , h t-1 ) fixed-sized vector. h 1 h 2 h 3 h 4 s 0 s 1 s 2 s 3 s 4 c x 1 x 2 x 3 x 4 y 0 y 1 y 2 y 3 we are eating bread [START] estamos comiendo pan Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Slide Credit: Justin Johnson

  9. Sequence-to-Sequence with RNNs Decoder: s t = g U (y t-1 , h t-1 , c) Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ estamos comiendo pan [STOP] Problem: Input sequence y 1 y 2 y 3 y 4 bottlenecked through Encoder: h t = f W (x t , h t-1 ) fixed-sized vector. h 1 h 2 h 3 h 4 s 0 s 1 s 2 s 3 s 4 c x 1 x 2 x 3 x 4 y 0 y 1 y 2 y 3 we are eating bread [START] estamos comiendo pan Idea: use new context vector at each step of decoder! Sutskever et al, “Sequence to sequence learning with neural networks”, NeurIPS 2014 Slide Credit: Justin Johnson

  10. Sequence-to-Sequence with RNNs and Attention Input : Sequence x 1 , … x T Output : Sequence y 1 , …, y T’ From final hidden state: Encoder: h t = f W (x t , h t-1 ) Initial decoder state s 0 h 1 h 2 h 3 h 4 s 0 x 1 x 2 x 3 x 4 we are eating bread Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson

  11. Sequence-to-Sequence with RNNs and Attention Compute (scalar) alignment scores e t,i = f att (s t-1 , h i ) (f att is an MLP) From final hidden state: e 11 e 12 e 13 e 14 Initial decoder state s 0 h 1 h 2 h 3 h 4 s 0 x 1 x 2 x 3 x 4 we are eating bread Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson

  12. Sequence-to-Sequence with RNNs and Attention Compute (scalar) alignment scores e t,i = f att (s t-1 , h i ) (f att is an MLP) a 11 a 12 a 13 a 14 Normalize alignment scores to get attention weights softmax 0 < a t,i < 1 ∑ i a t,i = 1 From final hidden state: e 11 e 12 e 13 e 14 Initial decoder state s 0 h 1 h 2 h 3 h 4 s 0 x 1 x 2 x 3 x 4 we are eating bread Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson

  13. Sequence-to-Sequence with RNNs and Attention Compute (scalar) alignment scores ✖︐ ✖︐ ✖︐ e t,i = f att (s t-1 , h i ) (f att is an MLP) ✖︐ a 11 a 12 a 13 a 14 Normalize alignment scores to get attention weights softmax 0 < a t,i < 1 ∑ i a t,i = 1 From final hidden state: e 11 e 12 e 13 e 14 Initial decoder state s 0 Compute context vector as linear combination of hidden states + h 1 h 2 h 3 h 4 s 0 c t = ∑ i a t,i h i x 1 x 2 x 3 x 4 c 1 we are eating bread Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson

  14. Sequence-to-Sequence with RNNs and Attention Compute (scalar) alignment scores ✖︐ ✖︐ ✖︐ e t,i = f att (s t-1 , h i ) (f att is an MLP) ✖︐ a 11 a 12 a 13 a 14 Normalize alignment scores estamos to get attention weights softmax 0 < a t,i < 1 ∑ i a t,i = 1 From final hidden state: y 1 e 11 e 12 e 13 e 14 Initial decoder state s 0 Compute context vector as linear combination of hidden states + h 1 h 2 h 3 h 4 s 0 s 1 c t = ∑ i a t,i h i Use context vector in x 1 x 2 x 3 x 4 c 1 y 0 decoder: s t = g U (y t-1 , s t-1 , c t ) we are eating bread Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson

  15. Sequence-to-Sequence with RNNs and Attention Compute (scalar) alignment scores ✖︐ ✖︐ ✖︐ e t,i = f att (s t-1 , h i ) (f att is an MLP) ✖︐ a 11 a 12 a 13 a 14 Normalize alignment scores estamos to get attention weights softmax 0 < a t,i < 1 ∑ i a t,i = 1 From final hidden state: y 1 e 11 e 12 e 13 e 14 Initial decoder state s 0 Compute context vector as linear combination of hidden states + h 1 h 2 h 3 h 4 s 0 s 1 c t = ∑ i a t,i h i Use context vector in x 1 x 2 x 3 x 4 c 1 y 0 decoder: s t = g U (y t-1 , s t-1 , c t ) we are eating bread This is all differentiable! Do not supervise attention weights – backprop through everything Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson

  16. Sequence-to-Sequence with RNNs and Attention Compute (scalar) alignment scores ✖︐ ✖︐ ✖︐ e t,i = f att (s t-1 , h i ) (f att is an MLP) ✖︐ a 11 a 12 a 13 a 14 Normalize alignment scores estamos to get attention weights softmax 0 < a t,i < 1 ∑ i a t,i = 1 From final hidden state: y 1 e 11 e 12 e 13 e 14 Initial decoder state s 0 Compute context vector as linear combination of hidden states + h 1 h 2 h 3 h 4 s 0 s 1 c t = ∑ i a t,i h i Intuition : Context vector Use context vector in attends to the relevant x 1 x 2 x 3 x 4 c 1 y 0 decoder: s t = g U (y t-1 , s t-1 , c t ) part of the input sequence “estamos” = “we are” we are eating bread This is all differentiable! Do not a 11 =0.45, a 12 =0.45, a 13 =0.05, a 14 =0.05 supervise attention weights – backprop through everything Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson

  17. Sequence-to-Sequence with RNNs Repeat: Use s 1 ✖︐ ✖︐ ✖︐ ✖︐ to compute new a 21 a 22 a 23 a 24 context vector c 2 estamos softmax y 1 e 21 e 22 e 23 e 24 + h 1 h 2 h 3 h 4 s 0 s 1 x 1 x 2 x 3 x 4 c 1 y 0 c 2 we are eating bread [START] Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson

  18. Sequence-to-Sequence with RNNs and Attention Repeat: Use s 1 ✖︐ ✖︐ ✖︐ ✖︐ to compute new a 21 a 22 a 23 a 24 context vector c 2 estamos comiendo softmax Use c 2 to y 1 y 2 compute s 2 , y 2 e 21 e 22 e 23 e 24 + h 1 h 2 h 3 h 4 s 0 s 1 s 2 x 1 x 2 x 3 x 4 c 1 y 0 c 2 y 1 we are eating bread [START] estamos Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson

  19. Sequence-to-Sequence with RNNs and Attention Repeat: Use s 1 ✖︐ ✖︐ ✖︐ ✖︐ to compute new a 21 a 22 a 23 a 24 context vector c 2 estamos comiendo softmax Use c 2 to y 1 y 2 compute s 2 , y 2 e 21 e 22 e 23 e 24 + h 1 h 2 h 3 h 4 s 0 s 1 s 2 Intuition : Context vector attends to the relevant part x 1 x 2 x 3 x 4 c 1 y 0 c 2 y 1 of the input sequence “comiendo” = “eating” we are eating bread [START] estamos Bahdanau et al, “Neural machine translation by jointly learning to align and translate”, ICLR 2015 Slide Credit: Justin Johnson

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend