tutorial on recent advances
play

Tutorial on Recent Advances in Visual Captioning Luowei Zhou - PowerPoint PPT Presentation

Tutorial on Recent Advances in Visual Captioning Luowei Zhou 06/15/2020 1 Outline Problem Overview Visual Captioning Taxonomy Image Captioning Datasets and Evaluation Video Description Grounded Caption Generation


  1. Tutorial on Recent Advances in Visual Captioning Luowei Zhou 06/15/2020 1

  2. Outline • Problem Overview • Visual Captioning Taxonomy • Image Captioning • Datasets and Evaluation • Video Description • Grounded Caption Generation • Dense Caption Generation • Conclusion • Q&A 2

  3. Problem Overview • Visual Captioning – Describe the content of an image or video with a natural language sentence. A cat is sitting next to a pine tree, looking up. A dog is playing piano with a girl. 3 Cat image is free to use under the Pixabay License. Dog video is free to use under the Creative Commons license.

  4. Applications of Visual Captioning • Alt-text generation (from PowerPoint) • Content-based image retrieval (CBIR) • Or just for fun! 4

  5. 5 A fun video running visual captioning model real-time made by Kyle McDonald. Source: https://vimeo.com/146492001

  6. Visual Captioning Taxonomy Template- based Image Retrieval- Visual based Methods Domain Captioning Deep Learning- Video Short clips based Region Semantic Attention Attention Long Transformer Soft CNN-LSTM videos -based Attention 6

  7. Image Captioning with CNN-LSTM “Show and Tell” • Problem Formulation • The Encoder-Decoder framework Visual Language “Cat sitting outside” Encoder Decoder 7 Image credit: Vinyals et al. “Show and Tell: A Neural Image Caption Generator”, CVPR 2015.

  8. Image Captioning with Soft Attention • Soft Attention – Dynamically attend to input content based on query. • Basic elements: query – 𝑟 , keys - 𝐿 , and values – 𝑊 • In our case, keys and values are usually identical. They come from the CNN activation map. • Query 𝑟 is determined by the global image feature or LSTM’s hidden states. Bahdanau et al. “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015. 8 Xu et al. “Show, Attend and Tell”, ICML 2015.

  9. Image Captioning with Soft Attention h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 Use a CNN to compute a grid of features for an image 9 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

  10. Image Captioning with Soft Attention Alignment scores e t,i,j = f att (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 e 1,2,1 e 1,2,2 e 1,2,3 e 1,3,1 e 1,3,2 e 1,3,3 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 Use a CNN to compute a grid of features for an image 10 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

  11. Image Captioning with Soft Attention Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 a 1,1,1 a 1,1,2 a 1,1,3 softmax a t,:,: = softmax(e t,:,: ) a 1,2,1 a 1,2,2 a 1,2,3 e 1,2,1 e 1,2,2 e 1,2,3 e 1,3,1 e 1,3,2 e 1,3,3 a 1,3,1 a 1,3,2 a 1,3,3 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 Use a CNN to compute a grid of features for an image 11 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

  12. Image Captioning with Soft Attention Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 a 1,1,1 a 1,1,2 a 1,1,3 softmax a t,:,: = softmax(e t,:,: ) a 1,2,1 a 1,2,2 a 1,2,3 e 1,2,1 e 1,2,2 e 1,2,3 c t = ∑ i,j a t,i,j h i,j e 1,3,1 e 1,3,2 e 1,3,3 a 1,3,1 a 1,3,2 a 1,3,3 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 c 1 Use a CNN to compute a grid of features for an image 12 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

  13. Image Captioning with Soft Attention Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 a 1,1,1 a 1,1,2 a 1,1,3 softmax a t,:,: = softmax(e t,:,: ) cat a 1,2,1 a 1,2,2 a 1,2,3 e 1,2,1 e 1,2,2 e 1,2,3 c t = ∑ i,j a t,i,j h i,j e 1,3,1 e 1,3,2 e 1,3,3 a 1,3,1 a 1,3,2 a 1,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] 13 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

  14. Image Captioning with Soft Attention e t,i,j = f att (s t-1 , h i,j ) a t,:,: = softmax(e t,:,: ) cat c t = ∑ i,j a t,i,j h i,j y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] 14 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

  15. Image Captioning with Soft Attention Alignment scores e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a t,:,: = softmax(e t,:,: ) cat e 2,2,1 e 2,2,2 e 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] 15 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

  16. Image Captioning with Soft Attention Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 softmax a t,:,: = softmax(e t,:,: ) cat a 2,2,1 a 2,2,2 a 2,2,3 e 2,2,1 e 2,2,2 e 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] 16 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

  17. Image Captioning with Soft Attention Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) a 2,1,1 a 2,1,2 a 2,1,3 e 2,1,1 e 2,1,2 e 2,1,3 a t,:,: = softmax(e t,:,: ) softmax cat e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 Use a CNN to compute a grid of features for an image [START] 17 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

  18. Image Captioning with Soft Attention Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) a 2,1,1 a 2,1,2 a 2,1,3 e 2,1,1 e 2,1,2 e 2,1,3 softmax a t,:,: = softmax(e t,:,: ) sitting cat e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 y 2 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 s 2 h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 y 1 Use a CNN to compute a grid of features for an image [START] cat 18 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

  19. Image Captioning with Soft Attention Each timestep of decoder e t,i,j = f att (s t-1 , h i,j ) uses a different context a t,:,: = softmax(e t,:,: ) sitting [STOP] cat outside vector that looks at different c t = ∑ i,j a t,i,j h i,j parts of the input image y 1 y 2 y 3 y 4 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 s 2 s 3 s 4 h 3,1 h 3,2 h 3,3 c 1 y 0 c 3 y 2 c 4 y 3 c 2 y 1 Use a CNN to compute a grid of features for an image [START] cat sitting outside 19 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

  20. Image Captioning with Soft Attention 20 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

  21. Image Captioning with Soft Attention 21 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

  22. Image Captioning with Region Attention • Variants of Soft Attention based on the feature input • Grid activation features (covered) • Region proposal features Faster R-CNN 22

  23. Image Captioning with “Fancier” Attention Semantic attention Adaptive Attention • Visual attributes • Knowing when to & not to attend to the image You et al. “Image captioning with semantic attention”, CVPR 2016. Yao et al. “Boosting Image Captioning with Attributes”, ICC V 2017. 23 Lu et al. “Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning”, CVPR 2017.

  24. Image Captioning with “Fancier” Attention Attention on Attention X-Linear Attention • Spatial and channel-wise bilinear attention Huang et al. “Attention on Attention for Image Captioning”, ICCV 2019. 24 Pan et al. “X - Linear Attention Networks for Image Captioning”, CVPR 2020.

  25. Image Captioning with “Fancier” Attention Hierarchy Parsing and GCNs Auto-Encoding Scene Graphs • Hierarchal tree structure in image • Scene Graphs in image and text Yao et al. “Hierarchy Parsing for Image Captioning”, ICCV 2019. 25 Yang et al. “Auto - encoding scene graphs for image captioning”, CVPR 2019.

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend