attention models attention models motivation
play

Attention Models Attention Models: Motivation bird Image: H x W x - PowerPoint PPT Presentation

Day 4 Lecture 6 Attention Models Attention Models: Motivation bird Image: H x W x 3 The whole input volume is used to predict the output... 2 Attention Models: Motivation bird Image: H x W x 3 The whole input volume is used to predict


  1. Day 4 Lecture 6 Attention Models

  2. Attention Models: Motivation bird Image: H x W x 3 The whole input volume is used to predict the output... 2

  3. Attention Models: Motivation bird Image: H x W x 3 The whole input volume is used to predict the output... ...despite the fact that not all pixels are equally important 3

  4. Attention Models: Motivation Attention models can relieve computational burden Helpful when processing big images ! 4

  5. Attention Models: Motivation Attention models can relieve computational burden Helpful when processing big images ! bird 5

  6. Encoder & Decoder From previous lecture... The whole input sentence is used to produce the translation Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) 6

  7. Attention Models Bahdanau et al. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR 2015 Kyunghyun Cho, “Introduction to Neural Machine Translation with GPUs” (2015) 7

  8. Attention Models Idea: Focus in different parts of the input as you make/refine predictions in time E.g.: Image Captioning A bird flying over a body of water 8

  9. LSTM Decoder A bird flying ... <EOS> ... LSTM LSTM LSTM LSTM CNN Features: D The LSTM decoder “sees” the input only at the beginning ! 9

  10. Attention for Image Captioning CNN Features: Image: L x D H x W x 3 Slide Credit: CS231n 10

  11. Attention for Image Captioning Attention weights (LxD) a1 CNN h0 Features: Image: L x D H x W x 3 Slide Credit: CS231n 11

  12. Attention for Image Captioning Attention predicted weights (LxD) word a1 a2 y2 CNN h0 h1 Features: Image: L x D Weighted H x W x 3 z1 y1 features: D Weighted combination First word of features Slide Credit: CS231n 12

  13. Attention for Image Captioning Attention predicted weights (LxD) word a1 a2 y2 a3 y3 CNN h0 h1 h2 Features: Image: L x D Weighted H x W x 3 z1 y1 z2 y2 features: D Weighted combination First word of features Slide Credit: CS231n 13

  14. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 14

  15. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 15

  16. Attention for Image Captioning Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 16

  17. Soft Attention Soft attention: Summarize ALL locations z = p a a+ p b b + p c c + p d d a b CNN Derivative dz/dp is nice! c d Train with gradient descent Grid of features Image: (Each D- H x W x 3 dimensional) Context vector z (D-dimensional) p a p b From RNN: p c p d Distribution over grid locations p a + p b + p c + p c = 1 Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 Slide Credit: CS231n 17

  18. Soft Attention Soft attention: Summarize ALL locations z = p a a+ p b b + p c c + p d d a b CNN Differentiable function c d Train with gradient descent Grid of features Image: (Each D- H x W x 3 dimensional) Context vector z (D-dimensional) p a p b From RNN: p c p d ● Still uses the whole input ! Distribution over ● Constrained to fix grid grid locations p a + p b + p c + p c = 1 Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015 Slide Credit: CS231n 18

  19. Hard attention Hard attention : Sample a subset of the input Not a differentiable function ! Input image: Cropped and H x W x 3 rescaled image: Can’t train with backprop :( X x Y x 3 Box Coordinates: (xc, yc, w, h) need reinforcement learning Gradient is 0 almost everywhere Gradient is undefined at x = 0 19

  20. Hard attention Classify images by attending to Generate images by attending to arbitrary regions of the input arbitrary regions of the output Gregor et al. DRAW: A Recurrent Neural Network For Image Generation. ICML 2015 20

  21. Hard attention Gregor et al. DRAW: A Recurrent Neural Network For Image Generation. ICML 2015 21

  22. Hard attention Read text, generate handwriting using an RNN that attends at different arbitrary regions over time REAL GENERATED Graves. Generating Sequences with Recurrent Neural Networks. arXiv 2013 22

  23. Hard attention CNN bird Input image: Cropped and H x W x 3 rescaled image: X x Y x 3 Box Coordinates: (xc, yc, w, h) Not a differentiable function ! Can’t train with backprop :( 23

  24. Spatial Transformer Networks CNN bird Input image: Cropped and H x W x 3 rescaled image: X x Y x 3 Box Coordinates: (xc, yc, w, h) Not a differentiable function ! Make it differentiable Can’t train with backprop :( Train with backprop :) 24 Jaderberg et al. Spatial Transformer Networks. NIPS 2015

  25. Spatial Transformer Networks Network Idea : Function mapping attends to pixel coordinates (xt, yt) of input by output to pixel coordinates Can we make this predicting � (xs, ys) of input function differentiable? Repeat for all pixels Input image: in output to get a Cropped and H x W x 3 sampling grid rescaled image: X x Y x 3 Then use bilinear Box Coordinates: interpolation to (xc, yc, w, h) compute output Jaderberg et al. Spatial Transformer Networks. NIPS 2015 Slide Credit: CS231n 25

  26. Spatial Transformer Networks Insert spatial transformers into a classification network and it learns to attend and transform the input Differentiable module Easy to incorporate in any network, anywhere ! Jaderberg et al. Spatial Transformer Networks. NIPS 2015 26

  27. Spatial Transformer Networks Fine-grained classification Jaderberg et al. Spatial Transformer Networks. NIPS 2015 27

  28. Visual Attention Visual Question Answering Zhu et al. Visual7w: Grounded Question Answering in Images. arXiv 2016 28

  29. Visual Attention Action Recognition in Videos Salient Object Detection Sharma et al. Action Recognition Using Visual Attention. arXiv 2016 Kuen et al. Recurrent Attentional Networks for Saliency Detection. CVPR 2016 29

  30. Other examples Attention to scale for Semantic attention semantic segmentation For image captioning Chen et al. Attention to Scale: Scale-aware Semantic Image Segmentation. CVPR 2016 You et al. Image Captioning with Semantic Attention. CVPR 2016 30

  31. Resources ● CS231n Lecture @ Stanford [slides][video] ● More on Reinforcement Learning ● Soft vs Hard attention ● Handwriting generation demo ● Spatial Transformer Networks - Slides & Video by Victor Campos ● Attention implementations: ○ Seq2seq in Keras ○ DRAW & Spatial Transformers in Keras ○ DRAW in Lasagne ○ DRAW in Tensorflow 31

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend