at attent ntio ion the the proble lem
play

At Attent ntio ion The The proble lem For very long sentences, - PowerPoint PPT Presentation

At Attent ntio ion The The proble lem For very long sentences, the score for machine translation really goes down after 30-40 words. With attention Performance degradation Bahdanau et al 2014. Neural machine translation by jointly


  1. At Attent ntio ion

  2. The The proble lem • For very long sentences, the score for machine translation really goes down after 30-40 words. With attention Performance degradation Bahdanau et al 2014. Neural machine translation by jointly learning to align and translate. Prof. Leal-Taixé and Prof. Niessner 2

  3. Bas Basic structure e of of a a RN RNN • We want to have notion of “time” or “sequence” Hidden state Previous input hidden state [Christopher Olah] Understanding LSTMs Prof. Leal-Taixé and Prof. Niessner 3

  4. Bas Basic structure e of of a a RN RNN • We want to have notion of “time” or “sequence” Hidden state Parameters to be learned Prof. Leal-Taixé and Prof. Niessner 4

  5. Bas Basic structure e of of a a RN RNN • We want to have notion of “time” or “sequence” Output Hidden state Same parameters for each time step = generalization! Prof. Leal-Taixé and Prof. Niessner 5

  6. Bas Basic structure e of of a a RN RNN • Unrolling RNNs Hidden state is the same [Christopher Olah] Understanding LSTMs Prof. Leal-Taixé and Prof. Niessner 6

  7. Bas Basic structure e of of a a RN RNN • Unrolling RNNs Prof. Leal-Taixé and Prof. Niessner 7

  8. Lo Long ng-te term depend ndenci ncies I mo moved to Germany any … so I speak German an fluently Prof. Leal-Taixé and Prof. Niessner 8

  9. Atte Attenti ntion: n: intu ntuiti tion ATTENTION: Which hidden states are more important to predict my output? I mo moved to Germany any … so I speak German an fluently Prof. Leal-Taixé and Prof. Niessner 9

  10. Atte Attenti ntion: n: intu ntuiti tion Context α 1 ,t +1 α t,t +1 α t +1 ,t +1 I mo moved to Germany any … so I speak German an fluently Prof. Leal-Taixé and Prof. Niessner 10

  11. Atte Attenti ntion: n: archi hite tectu ture • A decoder processes the information D D D • Decoders take as Context input: α t,t +1 α t +1 ,t +1 – Previous decoder hidden state – Previous output – Attention Prof. Leal-Taixé and Prof. Niessner 11

  12. Attenti Atte ntion indicates how much the word in the position • + 1 α 1 ,t +1 is important to translate the work in position t + 1 • The context aggregates the attention t +1 X c t +1 = α k,t +1 a k k =1 ft attention: All attention masks alpha sum up to 1 • So Soft Prof. Leal-Taixé and Prof. Niessner 12

  13. Comp Computin ing the e atten ention ion ma mask • We can train a small neural network Previous state of d t the decoder NN f 1 ,t +1 Hidden state of a 1 the encoder exp f 1 ,t +1 • Normalize α 1 ,t +1 = P t +1 k =1 exp f k,t +1 Prof. Leal-Taixé and Prof. Niessner 13

  14. At Attent ntio ion n fo for vis visio ion

  15. Wh Why do do we e need eed at atten ention on? • We use the whole image to make the classification BIRD • Are all pixels equally important? Prof. Leal-Taixé and Prof. Niessner 15

  16. Wh Why do do we e need eed at atten ention on? • Wouldn’t it be easier and computationally more efficient to just run our classification network on the patch? Prof. Leal-Taixé and Prof. Niessner 16

  17. Sof Soft t atten ttenti tion on for or ca capti tion oning

  18. Im Image captioning Xu et al 2015. Show attention and tell: neural image caption generation with visual attention. Prof. Leal-Taixé and Prof. Niessner 18

  19. Im Image captioning • Input: image • Output: a sentence describing the image. • Enc Encoder : a classification CNN (VGGNet, AlexNet). This computes a feature maps over the image. • De Decoder : an attention-based RNN – In each time step, the decoder computes an attention map over the entire image, effectively deciding which regions to focus on. – It receives a context vector, which is the weighted average of the conv net features. Prof. Leal-Taixé and Prof. Niessner 19

  20. Con Conven ention ional caption ionin ing LSTM only sees the image once! Encoder Decoder Image from: https://blog.heuritech.com/2016/01/20/attention-mechanism/ Prof. Leal-Taixé and Prof. Niessner 20

  21. Atte Attenti ntion n mecha hani nism A girl is throwing a frisbee in the park Prof. Leal-Taixé and Prof. Niessner 21

  22. Atte Attenti ntion n mecha hani nism A girl is throwing a frisbee in the park Prof. Leal-Taixé and Prof. Niessner 22

  23. Atte Attenti ntion n mecha hani nism A girl is throwing a frisbee in the park Prof. Leal-Taixé and Prof. Niessner 23

  24. Atte Attenti ntion n mecha hani nism A girl is throwing a frisbee in the park Prof. Leal-Taixé and Prof. Niessner 24

  25. Atte Attenti ntion n mecha hani nism y " : Output of encoder are the image features which still retain spatial information (no FC layer!) Z " : Output of attention model h " : Hidden state of LSTM Prof. Leal-Taixé and Prof. Niessner 25

  26. Atte Attenti ntion n mecha hani nism How does the attention model look like? Prof. Leal-Taixé and Prof. Niessner 26

  27. Atte Attenti ntion n model • Attention architecture Output attention Any past hidden state Visual features Image: https://blog.heuritech.com/2016/01/20/attention-mechanism/ Prof. Leal-Taixé and Prof. Niessner 27

  28. Atte Attenti ntion n model • Inputs = feature descriptor for each image patch Prof. Leal-Taixé and Prof. Niessner 28

  29. Atte Attenti ntion n model • Inputs = feature descriptor for each image patch Still related to the spatial location of the image Prof. Leal-Taixé and Prof. Niessner 29

  30. Atte Attenti ntion n model • We want an bounded output ! " = tanh ( )* + + ( -* . " Prof. Leal-Taixé and Prof. Niessner 30

  31. Atte Attenti ntion n model • Softmax to create the attention values between 0 and 1 Prof. Leal-Taixé and Prof. Niessner 31

  32. Atte Attenti ntion n model • Multiplied by the image features à ranking by importance Prof. Leal-Taixé and Prof. Niessner 32

  33. Ha Hard a attentio ion mo model • Choosing one of the features by sampling with probabilities s i Prof. Leal-Taixé and Prof. Niessner 33

  34. Ty Types of atte ttenti ntion • Soft at attent ntion : deterministic process that can be backproped • Har ard at attent ntion : stochastic process, gradient is estimated through Monte Carlo sampling. • Soft attention is the most commonly used since it can be incorporated into the optimization more easily Prof. Leal-Taixé and Prof. Niessner 34

  35. Ty Types of atte ttenti ntion • Soft vs hard attention Soft Hard Prof. Leal-Taixé and Prof. Niessner 35

  36. Ty Types of atte ttenti ntion: n: soft Final context Attention Att Can be backproped • module Uses all the image • Image: Stanford CS231n lecture Prof. Leal-Taixé and Prof. Niessner 36

  37. Ty Types of atte ttenti ntion: n: ha hard You can view it as an • image cropping! If we cannot use • gradient descent, what alternative could we use to train this function? Reinforcement Learning Image: Stanford CS231n lecture Prof. Leal-Taixé and Prof. Niessner 37

  38. Im Image captioning with attention Xu et al 2015. Show attention and tell: neural image caption generation with visual attention. Prof. Leal-Taixé and Prof. Niessner 38

  39. In Interesting works on attention Luong et al, “Effective Approaches to Attentionbased Neural Machine • Translation,” EMNLP 2015 Chan et al, “Listen, Attend, and Spell”, arXiv 2015 • Chorowski et al, “Attention-based models for Speech Recognition”, NIPS • 2015 Yao et al, “Describing Videos by Exploiting Temporal Structure”, ICCV • 2015 Xu and Saenko, “Ask, Attend and Answer: Exploring Question-Guided • Spatial Attention for Visual Question Answering”, arXiv 2015 Zhu et al, “Visual7W: Grounded Question Answering in Images”, arXiv • 2015 Chu et al. „Online Multi-Object Tracking Using CNN-based Single Object • Tracker with Spatial-Temporal Attention Mechanism“. ICCV 2017 Prof. Leal-Taixé and Prof. Niessner 39

  40. Con Condi diti tion oning

  41. Wh When en do do we e need eed con ondi dition oning? • Scene understanding from an image and an audio source. Both need to be processed! Prof. Leal-Taixé and Prof. Niessner 41

  42. Wh When en do do we e need eed con ondi dition oning? • Visual Question and Answering: the sentence (question) needs to be understood, the image is needed to create the answer. Prof. Leal-Taixé and Prof. Niessner 42

  43. Wh When en do do we e need eed con ondi dition oning? • Visual Question and Answering: the sentence (question) needs to be understood, the image is needed to create the answer. Prof. Leal-Taixé and Prof. Niessner 43

  44. Wh When en do do we e need eed con ondi dition oning? • We have two sources, can we process one in n the he cont ntext of the other? • Cond nditioni ning ng : the computation carried out by a model is conditioned or modulated by information extracted from an auxiliary input. • Note: a similar thing can be obtained with attention (see p. 39) Prof. Leal-Taixé and Prof. Niessner 44

  45. Wh When en do do we e need eed con ondi dition oning? • Generate images based on a word • Do we need to retrain a model for each word? Image: https://distill.pub/2018/feature-wise-transformations/ Prof. Leal-Taixé and Prof. Niessner 45

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend