show attend and tell
play

Show, Attend and Tell: Neural Image Caption Generation with Visual - PowerPoint PPT Presentation

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu*, Jimmy Ba , Ryan Kiros , Kyunghyun Cho*, Aaron Courville*, Ruslan Salakhutdinov , Richard Zemel , Yoshua Bengio* eal*/ University of Toronto


  1. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention Kelvin Xu*, Jimmy Ba † , Ryan Kiros † , Kyunghyun Cho*, Aaron Courville*, Ruslan Salakhutdinov † , Richard Zemel † , Yoshua Bengio* eal*/ University of Toronto † Universit´ e de Montr´ (some figures from Hugo Larochelle) July 8, 2015 1 / 46

  2. Caption generation is another building block social roles goals and intentions High situation Level causality 1 sec functionality Scene Understanding Level of Human activity understanding Task Time actions Detection 150 ms segmentation shape tracking Low texture 90 ms Leve l feature and descriptors Object Scene Activity Figure: adapted from a figure from Feifei Li 2 / 46

  3. What our model does: Figure: A bird flying over a body of water . 3 / 46

  4. Overview Recent work in image caption generation Encoder-decoder models from a number of groups: (Berkeley, Google, Stanford, Toronto, others) Motivating our architecture from human attention Our proposed attention model Model Description Quantitative/Qualitative Results 4 / 46

  5. This talk: Recent work in image caption generation Encoder-decoder models from a number of groups: (Berkeley, Google, Stanford, Toronto, others) Motivating our architecture from human attention Our proposed attention model Model Description Quantitative/Qualitative Results 5 / 46

  6. Recent surge of interest in image captioning ◮ Submissions on this topic at CVPR 2015 (from groups at Google, Berkeley, Stanford, Microsoft.. etc) ◮ Inspired by some successes in machine translation (Kalchbrenner et al. 2013, Sutskever et al. 2014, Cho et al. 2014) 6 / 46

  7. Theme: Use a convnet to condition Figure: from Karpathy et al. (2015) 7 / 46

  8. Figure: Vinyal et al. (2015) model is quite similar 8 / 46

  9. This talk: Recent work in image caption generation Encoder-decoder models from a number of groups: (Berkeley, Google, Stanford, Toronto, others) Motivating our architecture from human attention Our proposed attention model Model Description Quantitative/Qualitative Results 9 / 46

  10. What are some things we know about human attention? 10 / 46

  11. (1) human vision is foveated & sequential ◮ Particular parts of an image come to the forefront 1 2 1 2 3 3 ◮ It is a sequential decision process (“saccades”, glimpses) 11 / 46

  12. (2) bottom-up input influences Figure: from Borji and Itti. (2013) [2] 12 / 46

  13. mechanisms ¡at ¡work… (3) top-down task level control Figure: from Yarbus (1967) 13 / 46

  14. Summary: useful aspects of attention ◮ foveated visual field (spatial focus) ◮ sequential decision making (temporal dynamics) ◮ bottom-up input influence ◮ top-down modulation of specific task 14 / 46

  15. This talk: Recent work in image caption generation Encoder-decoder models from a number of groups: (Berkeley, Google, Stanford, Toronto, others) Motivating our architecture from human attention Our proposed attention model Model Description Quantitative/Qualitative Results 15 / 46

  16. Our proposed attention model ◮ ”Low Level” convolutional feature extraction: a = { a 1 , a 2 , .., a L } ◮ Compute the importance of each of these regions α = { α 1 , α 2 , .., α L } ◮ Combine α and a to represent the image (context: ˆ z i ) 16 / 46

  17. A little bit more specific output = ( a, man, is, jumping, into, a, lake, . ) 17 / 46

  18. Convolutional feature extraction output = ( a, man, is, jumping, into, a, lake, . ) Convolutional Neural Network Annotation Vectors a j 18 / 46

  19. Given a initial hidden state (predicted from image).. output = ( a, man, is, jumping, into, a, lake, . ) Recurrent h i State Convolutional Neural Network Annotation Vectors a j 19 / 46

  20. Predict the “importance” of each region output = ( a, man, is, jumping, into, a, lake, . ) Recurrent h i State Mechanism Attention α j α j Σ =1 Convolutional Neural Network Annotation Vectors a j 20 / 46

  21. Combine with annotation vectors.. output = ( a, man, is, jumping, into, a, lake, . ) Recurrent h i State Mechanism Attention Attention α weight j + α j Σ =1 Convolutional Neural Network Annotation Vectors a j 21 / 46

  22. Feed into next hidden state and predict the next word output = ( a, man, is, jumping, into, a, lake, . ) y i Sample Word Recurrent h i State Mechanism Attention Attention α weight j + α j Σ =1 Convolutional Neural Network Annotation Vectors a j 22 / 46

  23. In the next step, we use the new hidden state output = ( a, man, is, jumping, into, a, lake, . ) y i Sample Word Recurrent h i State Mechanism Attention Attention α weight j + α j Σ =1 Convolutional Neural Network Annotation Vectors a j 23 / 46

  24. Continue until end of sequence output = ( a, man, is, jumping, into, a, lake, . ) y i Sample Word Recurrent h i State Mechanism Attention Attention α weight j + α j Σ =1 Convolutional Neural Network Annotation Vectors a j 24 / 46

  25. The attention is driven by the recurrent state + image ◮ At every time step, compute the importance of each region depending on the top-down + bottom-up signals e ti = f att( a i , h t − 1 ) exp( e ti ) α ti = � L k =1 exp( e tk ) ◮ We use a softmax to constrain that these weights sum to 1 ◮ We explore two different ways use the above distribution to compute a meaningful image representation 25 / 46

  26. Stochastic or Deterministic? output = ( a, man, is, jumping, into, a, lake, . ) y i Sample Word Recurrent h i State Mechanism Stochastic Attention Attention α or weight j + Deterministic α j Σ =1 Convolutional Neural Network Annotation Vectors a j 26 / 46

  27. Quick note on our decoder: LSTM (Hochreiter et al. 1997) z t z t h t-1 h t-1 Ey t-1 Ey t-1 h t-1 i o input gate output gate c h t z t input modulator memory cell Ey t-1 f forget gate h t-1 Ey t-1 z t 27 / 46

  28. Deterministic (Soft) Attention ◮ Feed in a attention weighted image input: L � ˆ z t = α t , i a i i =1 ◮ This is what A. Graves (2013)/D. Bahdanau et al (2015) did in handwriting recognition/machine translation 28 / 46

  29. Alternatively: Stochastic (Hard) Attention ◮ Sample α stochastically at every time step ◮ In RL terms, think of softmax α as a Boltzmann Policy: � L s = p ( s | a ) log p ( y | s , a ) ≤ log p ( y | a ) s N s n | a ) s n , a ) ∂ W ≈ 1 ∂ L s � ∂ log p ( y | ˜ s n , a ) ∂ log p (˜ � � + log p ( y | ˜ N ∂ W ∂ W n =1 By Williams 1992, and re-popularized recently by Mnih et al. 2014, Ba et al. 2015 29 / 46

  30. Quantitative Results 30 / 46

  31. A footnote on these metrics 31 / 46

  32. Under automatic metrics, humans are not great :( 32 / 46

  33. But human evaluation (mechanical turks) is quite different 33 / 46

  34. Stochastic or Deterministic? output = ( a, man, is, jumping, into, a, lake, . ) y i Sample Word Recurrent h i State Mechanism Stochastic Attention Attention α or weight j + Deterministic α j Σ =1 Convolutional Neural Network Annotation Vectors a j 34 / 46

  35. Visualizing our learned attention: the good 35 / 46

  36. Visualizing the our learned attention: the bad 36 / 46

  37. Other fun things you can do: 37 / 46

  38. A soccer ball .. 38 / 46

  39. Two cakes on a plate.. 39 / 46

  40. Important previous work 40 / 46

  41. attention in machine translation � � ��� � (4) � ��� � � � � ��� � ��� ap- � ��� � ��� distinct � � � � � � � � � � � � � � � � � � � � com- � � � � Figure: also from UdeM lab (Bahdanau et al. 2014) [1] 41 / 46

  42. attention mechanism in handwritten character generation ������� �������� ������ �������� ������ ���������� Figure: from (Graves et al. 2013) [3] 42 / 46

  43. Recently, many more.. 43 / 46

  44. Thanks for attending! 44 / 46

  45. Thanks for attending! Code: https://github.com/kelvinxu/arctic-captions 45 / 46

  46. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 , 2014. Ali Borji and Laurent Itti. State-of-the-art in visual attention modeling. Pattern Analysis and Machine Intelligence, IEEE Transactions on , 35(1):185–207, 2013. Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850 , 2013. 46 / 46

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend