Tutorial on Recent Advances in Visual Captioning Luowei Zhou - PowerPoint PPT Presentation

Tutorial on Recent Advances in Visual Captioning Luowei Zhou 06/15/2020 1

Outline • Problem Overview • Visual Captioning Taxonomy • Image Captioning • Datasets and Evaluation • Video Description • Grounded Caption Generation • Dense Caption Generation • Conclusion • Q&A 2

Problem Overview • Visual Captioning – Describe the content of an image or video with a natural language sentence. A cat is sitting next to a pine tree, looking up. A dog is playing piano with a girl. 3 Cat image is free to use under the Pixabay License. Dog video is free to use under the Creative Commons license.

Applications of Visual Captioning • Alt-text generation (from PowerPoint) • Content-based image retrieval (CBIR) • Or just for fun! 4

5 A fun video running visual captioning model real-time made by Kyle McDonald. Source: https://vimeo.com/146492001

Visual Captioning Taxonomy Template- based Image Retrieval- Visual based Methods Domain Captioning Deep Learning- Video Short clips based Region Semantic Attention Attention Long Transformer Soft CNN-LSTM videos -based Attention 6

Image Captioning with CNN-LSTM “Show and Tell” • Problem Formulation • The Encoder-Decoder framework Visual Language “Cat sitting outside” Encoder Decoder 7 Image credit: Vinyals et al. “Show and Tell: A Neural Image Caption Generator”, CVPR 2015.

Image Captioning with Soft Attention • Soft Attention – Dynamically attend to input content based on query. • Basic elements: query – 𝑟 , keys - 𝐿 , and values – 𝑊 • In our case, keys and values are usually identical. They come from the CNN activation map. • Query 𝑟 is determined by the global image feature or LSTM’s hidden states. Bahdanau et al. “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015. 8 Xu et al. “Show, Attend and Tell”, ICML 2015.

Image Captioning with Soft Attention h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 Use a CNN to compute a grid of features for an image 9 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

Image Captioning with Soft Attention Alignment scores e t,i,j = f att (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 e 1,2,1 e 1,2,2 e 1,2,3 e 1,3,1 e 1,3,2 e 1,3,3 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 Use a CNN to compute a grid of features for an image 10 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

Image Captioning with Soft Attention Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 a 1,1,1 a 1,1,2 a 1,1,3 softmax a t,:,: = softmax(e t,:,: ) a 1,2,1 a 1,2,2 a 1,2,3 e 1,2,1 e 1,2,2 e 1,2,3 e 1,3,1 e 1,3,2 e 1,3,3 a 1,3,1 a 1,3,2 a 1,3,3 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 Use a CNN to compute a grid of features for an image 11 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

Image Captioning with Soft Attention Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 a 1,1,1 a 1,1,2 a 1,1,3 softmax a t,:,: = softmax(e t,:,: ) a 1,2,1 a 1,2,2 a 1,2,3 e 1,2,1 e 1,2,2 e 1,2,3 c t = ∑ i,j a t,i,j h i,j e 1,3,1 e 1,3,2 e 1,3,3 a 1,3,1 a 1,3,2 a 1,3,3 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 h 3,1 h 3,2 h 3,3 c 1 Use a CNN to compute a grid of features for an image 12 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

Image Captioning with Soft Attention Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 1,1,1 e 1,1,2 e 1,1,3 a 1,1,1 a 1,1,2 a 1,1,3 softmax a t,:,: = softmax(e t,:,: ) cat a 1,2,1 a 1,2,2 a 1,2,3 e 1,2,1 e 1,2,2 e 1,2,3 c t = ∑ i,j a t,i,j h i,j e 1,3,1 e 1,3,2 e 1,3,3 a 1,3,1 a 1,3,2 a 1,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] 13 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

Image Captioning with Soft Attention e t,i,j = f att (s t-1 , h i,j ) a t,:,: = softmax(e t,:,: ) cat c t = ∑ i,j a t,i,j h i,j y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] 14 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

Image Captioning with Soft Attention Alignment scores e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a t,:,: = softmax(e t,:,: ) cat e 2,2,1 e 2,2,2 e 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] 15 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

Image Captioning with Soft Attention Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) e 2,1,1 e 2,1,2 e 2,1,3 a 2,1,1 a 2,1,2 a 2,1,3 softmax a t,:,: = softmax(e t,:,: ) cat a 2,2,1 a 2,2,2 a 2,2,3 e 2,2,1 e 2,2,2 e 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 Use a CNN to compute a grid of features for an image [START] 16 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

Image Captioning with Soft Attention Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) a 2,1,1 a 2,1,2 a 2,1,3 e 2,1,1 e 2,1,2 e 2,1,3 a t,:,: = softmax(e t,:,: ) softmax cat e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 Use a CNN to compute a grid of features for an image [START] 17 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

Image Captioning with Soft Attention Alignment scores Attention weights e t,i,j = f att (s t-1 , h i,j ) a 2,1,1 a 2,1,2 a 2,1,3 e 2,1,1 e 2,1,2 e 2,1,3 softmax a t,:,: = softmax(e t,:,: ) sitting cat e 2,2,1 e 2,2,2 e 2,2,3 a 2,2,1 a 2,2,2 a 2,2,3 c t = ∑ i,j a t,i,j h i,j e 2,3,1 e 2,3,2 e 2,3,3 a 2,3,1 a 2,3,2 a 2,3,3 y 1 y 2 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 s 2 h 3,1 h 3,2 h 3,3 c 1 y 0 c 2 y 1 Use a CNN to compute a grid of features for an image [START] cat 18 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

Image Captioning with Soft Attention Each timestep of decoder e t,i,j = f att (s t-1 , h i,j ) uses a different context a t,:,: = softmax(e t,:,: ) sitting [STOP] cat outside vector that looks at different c t = ∑ i,j a t,i,j h i,j parts of the input image y 1 y 2 y 3 y 4 h 1,1 h 1,2 h 1,3 CNN h 2,1 h 2,2 h 2,3 s 0 s 1 s 2 s 3 s 4 h 3,1 h 3,2 h 3,3 c 1 y 0 c 3 y 2 c 4 y 3 c 2 y 1 Use a CNN to compute a grid of features for an image [START] cat sitting outside 19 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

Image Captioning with Soft Attention 20 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

Image Captioning with Soft Attention 21 Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

Image Captioning with Region Attention • Variants of Soft Attention based on the feature input • Grid activation features (covered) • Region proposal features Faster R-CNN 22

Image Captioning with “Fancier” Attention Semantic attention Adaptive Attention • Visual attributes • Knowing when to & not to attend to the image You et al. “Image captioning with semantic attention”, CVPR 2016. Yao et al. “Boosting Image Captioning with Attributes”, ICC V 2017. 23 Lu et al. “Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning”, CVPR 2017.

Image Captioning with “Fancier” Attention Attention on Attention X-Linear Attention • Spatial and channel-wise bilinear attention Huang et al. “Attention on Attention for Image Captioning”, ICCV 2019. 24 Pan et al. “X - Linear Attention Networks for Image Captioning”, CVPR 2020.

Image Captioning with “Fancier” Attention Hierarchy Parsing and GCNs Auto-Encoding Scene Graphs • Hierarchal tree structure in image • Scene Graphs in image and text Yao et al. “Hierarchy Parsing for Image Captioning”, ICCV 2019. 25 Yang et al. “Auto - encoding scene graphs for image captioning”, CVPR 2019.

Tutorial on Recent Advances in Visual Captioning Luowei Zhou - PowerPoint PPT Presentation

Tutorial on Recent Advances in Visual Captioning Luowei Zhou 06/15/2020 1 Outline Problem Overview Visual Captioning Taxonomy Image Captioning Datasets and Evaluation Video Description Grounded Caption Generation

Tutorial Tutorial A2 is out, its called Inpainting Tutorial Tutorial A2 is out, its called

A GAMS TUTORIAL A GAMS TUTORIAL A GAMS TUTORIAL WHAT IS GAMS ? General Algebraic Modeling

Recent Advances In the Recent Advances In the Management of ITP Management of ITP Prof Gregory

Recent Advances in Biomolecular NMR Lucia Banci CERM University of Florence Recent Advances

Recent Advances in Biomolecular NMR Lucia Banci CERM University of Florence Recent Advances

Recent Advances in Photonic Recent Advances in Photonic effect employing IP- based distributed

Recent advances in Mandelbrot martingales theory Julien Barral, Universit e Paris Nord

Seminar on Seminar on Recent Developments in Project Management Recent Developments in Project

Excel Tutorial 1 Getting Started with Excel Tutorial 2 Formatting a Workbook Tutorial 3

Recent Advances in Fair Resource Allocation Rupert Freeman and Nisarg Shah Microsoft Research

Recent Advances in Adversarial Machine Learning Nicholas Carlini Google Research Recent

Recent Development in India Recent Development in India Recent Development in India Recent

PROGRAMMING TUTORIAL Thierry Lepley, April 4 th 2016 TUTORIAL GOAL Intermediate Tutorial for

Do Fifty- Two Motivation Overview of the Language

UPPAAL Tutorial UPPAAL Tutorial UPPAAL Tutorial Introduction Introduction Alexandre David

PowerPoint Tutorial 1 Creating a Presentation Tutorial 2 Applying and Modifying Text and

(patients/clients/relatives of patients/colleagues) then I hope that you will see the direct

Statistical NLP Spring 2011 Lecture 25: Summarization Dan Klein UC Berkeley Document

IR in Context of the User: Interactive IR Evaluation Peter Ingwersen Royal School of LIS

Introduction to Water Transportation 1 - Egyptians had ships 6000 B.C. - Recently, increase in

iPython Data Analytics in Python 1 / 13 The SciPy Stack SciPy is a Python-based ecosystem of

LIMSI-COT at SemEval-2016 Task 12: Temporal relation identification using a pipeline of

Session 13/14: L T EX/Markdown CV Workshop A P . S. Langeslag 31 January 2019 L A T EX L

CANCER Cancer is a group of diseases characterized by uncontrolled growth and spread of abnormal