Tutorial on Recent Advances in Visual Captioning Luowei Zhou - - PowerPoint PPT Presentation

tutorial on recent advances
SMART_READER_LITE
LIVE PREVIEW

Tutorial on Recent Advances in Visual Captioning Luowei Zhou - - PowerPoint PPT Presentation

Tutorial on Recent Advances in Visual Captioning Luowei Zhou 06/15/2020 1 Outline Problem Overview Visual Captioning Taxonomy Image Captioning Datasets and Evaluation Video Description Grounded Caption Generation


slide-1
SLIDE 1

Tutorial on Recent Advances in Visual Captioning

Luowei Zhou 06/15/2020

1

slide-2
SLIDE 2

Outline

  • Problem Overview
  • Visual Captioning Taxonomy
  • Image Captioning
  • Datasets and Evaluation
  • Video Description
  • Grounded Caption Generation
  • Dense Caption Generation
  • Conclusion
  • Q&A

2

slide-3
SLIDE 3

Problem Overview

  • Visual Captioning – Describe the content of an image or video with a

natural language sentence.

3

Cat image is free to use under the Pixabay License. Dog video is free to use under the Creative Commons license.

A cat is sitting next to a pine tree, looking up. A dog is playing piano with a girl.

slide-4
SLIDE 4

Applications of Visual Captioning

  • Alt-text generation (from PowerPoint)
  • Content-based image retrieval (CBIR)
  • Or just for fun!

4

slide-5
SLIDE 5

A fun video running visual captioning model real-time made by Kyle McDonald. Source: https://vimeo.com/146492001

5

slide-6
SLIDE 6

Visual Captioning Taxonomy

6

Visual Captioning

Domain Methods

CNN-LSTM Transformer

  • based

Semantic Attention Region Attention Soft Attention Template- based Retrieval- based

Deep Learning- based

Image Video Short clips Long videos

slide-7
SLIDE 7

Image Captioning with CNN-LSTM

  • Problem Formulation
  • The Encoder-Decoder framework

7

Image credit: Vinyals et al. “Show and Tell: A Neural Image Caption Generator”, CVPR 2015.

Visual Encoder Language Decoder

“Cat sitting outside”

“Show and Tell”

slide-8
SLIDE 8

Image Captioning with Soft Attention

  • Soft Attention – Dynamically attend to input content based on query.
  • Basic elements: query – 𝑟, keys - 𝐿, and values – 𝑊
  • In our case, keys and values are usually identical. They come from the

CNN activation map.

  • Query 𝑟 is determined by the global image feature or LSTM’s hidden

states.

8

Bahdanau et al. “Neural Machine Translation by Jointly Learning to Align and Translate”, ICLR 2015. Xu et al. “Show, Attend and Tell”, ICML 2015.

slide-9
SLIDE 9

h2,1 h2,2 h2,3 h3,1 h3,2 h3,3 h1,1 h1,2 h1,3

Image Captioning with Soft Attention

9

Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

CNN

Use a CNN to compute a grid of features for an image s0

slide-10
SLIDE 10

Image Captioning with Soft Attention

10

Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

s0

CNN

Use a CNN to compute a grid of features for an image

Alignment scores

et,i,j = fatt(st-1, hi,j)

h2,1 h2,2 h2,3 h3,1 h3,2 h3,3 h1,1 h1,2 h1,3

e1,2,1 e1,2,2 e1,2,3 e1,3,1 e1,3,2 e1,3,3 e1,1,1 e1,1,2 e1,1,3

slide-11
SLIDE 11

a1,2,1 a1,2,2 a1,2,3 a1,3,1 a1,3,2 a1,3,3 a1,1,1 a1,1,2 a1,1,3 e1,2,1 e1,2,2 e1,2,3 e1,3,1 e1,3,2 e1,3,3 e1,1,1 e1,1,2 e1,1,3

Image Captioning with Soft Attention

11

Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

s0

CNN

Use a CNN to compute a grid of features for an image

softmax

Alignment scores Attention weights

et,i,j = fatt(st-1, hi,j) at,:,: = softmax(et,:,:)

h2,1 h2,2 h2,3 h3,1 h3,2 h3,3 h1,1 h1,2 h1,3

slide-12
SLIDE 12

Image Captioning with Soft Attention

12

Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

s0 c1

CNN

Use a CNN to compute a grid of features for an image

softmax

Alignment scores Attention weights

et,i,j = fatt(st-1, hi,j) at,:,: = softmax(et,:,:) ct= ∑i,jat,i,jhi,j

h2,1 h2,2 h2,3 h3,1 h3,2 h3,3 h1,1 h1,2 h1,3

e1,2,1 e1,2,2 e1,2,3 e1,3,1 e1,3,2 e1,3,3 e1,1,1 e1,1,2 e1,1,3 a1,2,1 a1,2,2 a1,2,3 a1,3,1 a1,3,2 a1,3,3 a1,1,1 a1,1,2 a1,1,3

slide-13
SLIDE 13

Image Captioning with Soft Attention

13

Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

s0 s1

[START]

y0 y1

cat

c1

CNN

Use a CNN to compute a grid of features for an image

softmax

Alignment scores Attention weights

et,i,j = fatt(st-1, hi,j) at,:,: = softmax(et,:,:) ct= ∑i,jat,i,jhi,j

h2,1 h2,2 h2,3 h3,1 h3,2 h3,3 h1,1 h1,2 h1,3

e1,2,1 e1,2,2 e1,2,3 e1,3,1 e1,3,2 e1,3,3 e1,1,1 e1,1,2 e1,1,3 a1,2,1 a1,2,2 a1,2,3 a1,3,1 a1,3,2 a1,3,3 a1,1,1 a1,1,2 a1,1,3

slide-14
SLIDE 14

Image Captioning with Soft Attention

14

Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

s0 s1

[START]

y0 y1

cat

c1

CNN

Use a CNN to compute a grid of features for an image

et,i,j = fatt(st-1, hi,j) at,:,: = softmax(et,:,:) ct= ∑i,jat,i,jhi,j

h2,1 h2,2 h2,3 h3,1 h3,2 h3,3 h1,1 h1,2 h1,3

slide-15
SLIDE 15

e2,2,1 e2,2,2 e2,2,3 e2,3,1 e2,3,2 e2,3,3 e2,1,1 e2,1,2 e2,1,3

Image Captioning with Soft Attention

15

Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

s0 s1

[START]

y0 y1

cat

c1

CNN

Use a CNN to compute a grid of features for an image

Alignment scores

et,i,j = fatt(st-1, hi,j) at,:,: = softmax(et,:,:) ct= ∑i,jat,i,jhi,j

h2,1 h2,2 h2,3 h3,1 h3,2 h3,3 h1,1 h1,2 h1,3

slide-16
SLIDE 16

Image Captioning with Soft Attention

16

Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

s0 s1

[START]

y0 y1

cat

c1

CNN

Use a CNN to compute a grid of features for an image

softmax

Alignment scores Attention weights

et,i,j = fatt(st-1, hi,j) at,:,: = softmax(et,:,:) ct= ∑i,jat,i,jhi,j

h2,1 h2,2 h2,3 h3,1 h3,2 h3,3 h1,1 h1,2 h1,3

e2,2,1 e2,2,2 e2,2,3 e2,3,1 e2,3,2 e2,3,3 e2,1,1 e2,1,2 e2,1,3 a2,2,1 a2,2,2 a2,2,3 a2,3,1 a2,3,2 a2,3,3 a2,1,1 a2,1,2 a2,1,3

slide-17
SLIDE 17

Image Captioning with Soft Attention

17

Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

s0 s1

[START]

y0 y1

cat

c1

CNN

Use a CNN to compute a grid of features for an image

softmax

Alignment scores Attention weights

et,i,j = fatt(st-1, hi,j) at,:,: = softmax(et,:,:) ct= ∑i,jat,i,jhi,j

c2 h2,1 h2,2 h2,3 h3,1 h3,2 h3,3 h1,1 h1,2 h1,3

e2,2,1 e2,2,2 e2,2,3 e2,3,1 e2,3,2 e2,3,3 e2,1,1 e2,1,2 e2,1,3 a2,2,1 a2,2,2 a2,2,3 a2,3,1 a2,3,2 a2,3,3 a2,1,1 a2,1,2 a2,1,3

slide-18
SLIDE 18

Image Captioning with Soft Attention

18

Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

s0 s1

[START]

y0 y1 c1

CNN

Use a CNN to compute a grid of features for an image

softmax

Alignment scores Attention weights

et,i,j = fatt(st-1, hi,j) at,:,: = softmax(et,:,:) ct= ∑i,jat,i,jhi,j

c2 s2 y2 y1 h2,1 h2,2 h2,3 h3,1 h3,2 h3,3 h1,1 h1,2 h1,3

e2,2,1 e2,2,2 e2,2,3 e2,3,1 e2,3,2 e2,3,3 e2,1,1 e2,1,2 e2,1,3 a2,2,1 a2,2,2 a2,2,3 a2,3,1 a2,3,2 a2,3,3 a2,1,1 a2,1,2 a2,1,3

cat sitting cat

slide-19
SLIDE 19

Image Captioning with Soft Attention

19

Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

s0 s1 s2

[START]

y0 y1 y2

cat sitting

  • utside

cat sitting

s3 s4 y3 y4

  • utside

[STOP]

c1 y1 c2 y2 c3 y3 c4

CNN

Use a CNN to compute a grid of features for an image

Each timestep of decoder uses a different context vector that looks at different parts of the input image

et,i,j = fatt(st-1, hi,j) at,:,: = softmax(et,:,:) ct= ∑i,jat,i,jhi,j

h2,1 h2,2 h2,3 h3,1 h3,2 h3,3 h1,1 h1,2 h1,3

slide-20
SLIDE 20

Image Captioning with Soft Attention

20

Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

slide-21
SLIDE 21

Image Captioning with Soft Attention

21

Slide credit: UMich EECS 498/598 DeepVision course by Justin Johnson. Method: “Show, Attend and Tell” by Xu et al. ICML 2015.

slide-22
SLIDE 22

Image Captioning with Region Attention

  • Variants of Soft Attention based on the feature input
  • Grid activation features (covered)
  • Region proposal features

22

Faster R-CNN

slide-23
SLIDE 23

Image Captioning with “Fancier” Attention

Semantic attention

  • Visual attributes

Adaptive Attention

  • Knowing when to & not to

attend to the image

23

You et al. “Image captioning with semantic attention”, CVPR 2016. Yao et al. “Boosting Image Captioning with Attributes”, ICCV 2017. Lu et al. “Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning”, CVPR 2017.

slide-24
SLIDE 24

Image Captioning with “Fancier” Attention

Attention on Attention X-Linear Attention

  • Spatial and channel-wise bilinear

attention

24

Huang et al. “Attention on Attention for Image Captioning”, ICCV 2019. Pan et al. “X-Linear Attention Networks for Image Captioning”, CVPR 2020.

slide-25
SLIDE 25

Image Captioning with “Fancier” Attention

Hierarchy Parsing and GCNs

  • Hierarchal tree structure in image

Auto-Encoding Scene Graphs

  • Scene Graphs in image and text

25

Yao et al. “Hierarchy Parsing for Image Captioning”, ICCV 2019. Yang et al. “Auto-encoding scene graphs for image captioning”, CVPR 2019.

slide-26
SLIDE 26

Image Captioning with Transformer

  • Transformer performs sequence-to-sequence generation.
  • Self-Attention – A type of soft attention that “attends to itself”.
  • Self-Attention is a special case of Graph Neural Networks (GNNs) that

has a fully-connected graph.

  • Self-attention is sometimes used to model relationship between
  • bject regions, similar to GCNs.

Vaswani et al. “Attention is all you need”, NIPS 2017. Yao et al. “Exploring visual relationship for image captioning”, ECCV 2018. Further readings: https://graphdeeplearning.github.io/post/transformers-are-gnns/

26

slide-27
SLIDE 27

Image Captioning with Transformer

  • Transformer is first adapted for

captioning in Zhou et al.

  • Others: Object Relation

Transformer, Meshed-Memory Transformer

Zhou et al. “End-to-end dense video captioning with masked transformer”, CVPR 2018. Herdade et al. “Image Captioning: Transforming Objects into Words”, NeurIPS 2019. Cornia et al. “Meshed-Memory Transformer for Image Captioning”, CVPR 2020.

27

slide-28
SLIDE 28

Vision-Language Pre-training (VLP)

  • Two-stage training strategy: pre-training and fine-tuning.
  • Pre-training is performed on a large dataset. Usually with auto-

generated captions. The training objective is unsupervised.

  • Fine-tuning is task-specific supervised training on downstream tasks.
  • All methods are based on BERT (a variant of Transformer).

28

Zhou et al. “Unified vision-language pre-training for image captioning and vqa”, AAAI 2020. Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, NAACL 2019.

slide-29
SLIDE 29

Vision-Language Pre-training (VLP)

Separate Encoder-Decoder

  • Methods: VideoBERT and Oscar
  • Only the encoder is pre-trained

Unified Encoder-Decoder

  • Methods: Unified VLP
  • Both encoder and decoder are

pre-trained

29

Sun et al. “Videobert: A joint model for video and language representation learning,” ICCV 2019. Li et al. “Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks,” arXiv 2020. Zhou et al. “Unified vision-language pre-training for image captioning and vqa”, AAAI 2020. Language Decoder Multi-Modal Encoder Unified Encoder-Decoder

slide-30
SLIDE 30

Evaluation – Benchmark Dataset

COCO Captions

  • Train / val / test: 113k / 5k / 5k
  • Hidden test (leaderboard): 40k
  • Vocabulary (≥ 5 occurrences):

9,587

  • Most-adopted!

Flirckr30K

  • Train / val / test: 29k / 1k / 1k
  • Vocabulary (≥ 5 occurrences):

6,864

32

Chen et al. “Microsoft coco captions: Data collection and evaluation server”, arXiv 2015.

slide-31
SLIDE 31

33

Image credit: Chen et al. “Microsoft coco captions: Data collection and evaluation server”, arXiv 2015.

slide-32
SLIDE 32

Evaluation – Metrics

  • Most commonly-used: BLEU / METEOR / CIDEr / SPICE
  • BLEU: based on n-gram based precision
  • METEOR: ordering sensitive through unigram matching
  • CIDEr: gives more weight-age to important n-grams through TF-IDF
  • SPICE: F1-score over caption scene-graph tuples
  • Further readings: Sanja Fidler’s lecture slides

http://www.cs.toronto.edu/~fidler/slides/2017/CSC2539/Kaustav_sli des.pdf

34

slide-33
SLIDE 33

Evaluation – Results on COCO

Method BLEU@4 METEOR CIDEr SPICE CNN-LSTM 20.3

  • Soft Attention

24.3 23.9

  • Semantic Attention

30.4 24.3

  • Adaptive Attention

32.5 26.6 108.5 19.5 Region Attention* 36.3 27.7 120.1 21.4 Attention on Attention* 38.9 29.2 129.8 22.4 Transformer (vanilla)* 38.2 28.9 128.4 22.2 𝑁2 Transformer* 39.1 29.2 131.2 22.6 X-Transformer* 39.7 29.5 132.8 23.4 VLP (with pre-training)* 39.5 29.3 129.3 23.2 Oscar (with pre-training)* 41.7 30.6 140.0 24.5

35

Note that all methods use a single model. * Indicates with CIDEr optimization

slide-34
SLIDE 34

Image Captioning – Other Topics

  • Dense Captioning
  • Novel Object Captioning
  • Stylized Captioning (GAN)
  • RL-based (e.g., SCST)

37

slide-35
SLIDE 35

Video Captioning

  • Now, we extend our scope to the video domain.

Image credit: Aafaq et al. “Video Description: A Survey of Methods, Datasets and Evaluation Metrics” ACM Computing Surveys 2019.

38

Description

Description

slide-36
SLIDE 36

Video Description

  • Method-wise, almost no difference! (enc-dec, attention, Trans. etc.)
  • The temporal info is aggregated through the following methods:

Image credit: Aafaq et al. “Video Description: A Survey of Methods, Datasets and Evaluation Metrics” ACM Computing Surveys 2019.

39

slide-37
SLIDE 37

Description alone might fail...

See notes for the image credits.

Description: A bottle of ketchup and a bottle of sriracha are on a table.

42

slide-38
SLIDE 38

Description alone might fail...

See notes for the image credits.

Description: A bottle of ketchup and a bottle of sriracha are on a table.

43

slide-39
SLIDE 39

Grounded Visual Description

  • Essentially, visual description + object grounding or detection
  • In the image domain, Neural Baby Talk
  • In the video domain, Grounded Video Description
  • Requires special dataset that has both description and bounding box

44

Lu et al. “Neural Baby Talk”, CVPR 2018. Zhou et al. “Grounded video description”, CVPR 2019.

slide-40
SLIDE 40

Single-Frame Annotation

45

From ActivityNet-Entities dataset. Zhou et al. “Grounded video description”, CVPR 2019.

We see a man playing a saxophone in front of microphones.

slide-41
SLIDE 41

Multi-Frame Annotation

46

From ActivityNet-Entities dataset. Zhou et al. “Grounded video description”, CVPR 2019.

Two women are on a tennis court, showing the technique to posing and hitting the ball.

slide-42
SLIDE 42

Grounded Video Description (GVD) model

  • Architecture: grounding module + caption decoder
  • Grounding happens simultaneously with caption generation.
  • GVD adopts three proxy tasks to leverage the BBox annotations:
  • Supervised attention
  • Supervised grounding
  • Region classification
  • Details: https://www.youtube.com/watch?v=7AVCgn21noM

47

Zhou et al. “Grounded video description”, CVPR 2019.

slide-43
SLIDE 43

Video Description

  • The Encoder-Decoder framework works fairly well for images and

short video clips.

  • How about long videos?
  • The average video length on YouTube is 4.4 minutes!

49

slide-44
SLIDE 44

Video Paragraph Description

50

Yu et al. “Video Paragraph Captioning Using Hierarchical Recurrent Neural Networks” CVPR 2016.

Add chopped bacon to a hot pan and stir. Remove the bacon from the pan. Place the beef into a hot pan to brown. Add onion and carrots to the pan. Pour the meat back into the pan and add flour. Place the pan into the oven. Add bay leaves thyme red wine beef stock garlic and tomato paste to the pan and boil. Add pearl onions to a hot pan and add beef stock bay leaf and thyme. Add mushrooms to a hot pan. Add the mushrooms and pearl onions to the meat...

slide-45
SLIDE 45

Dense Video Description

  • Objective – Localize and describe events from a video.
  • Input: Video. Output: Triplets of event start, end time and description

51

Image credit: Aafaq et al. “Video Description: A Survey of Methods, Datasets and Evaluation Metrics”, ACM Computing Surveys 2019.

slide-46
SLIDE 46

Dense Video Description

  • Existing methods usually contain two modules: event proposal and

video description.

Krishna et al. “Dense-captioning events in videos”, ICCV 2017. Zhou et al. “End-to-end dense video captioning with masked transformer”, CVPR 2018.

52

Captions

Language Decoder Video clip Video Encoder Video Video Encoder (shared)

Event proposals

Proposal Decoder Language Decoder

Captions

Separate Training Alternating Training

Proposal Decoder

Event proposals

Video Video Encoder Video Video Encoder (shared)

Event proposals

Proposal Decoder Language Decoder

Captions

End-to-End Training

slide-47
SLIDE 47

Dense Video Description

  • End-to-End Masked Transformer

53

Zhou et al. “End-to-end dense video captioning with masked transformer”, CVPR 2018.

slide-48
SLIDE 48

Conclusions

  • We have seen aggressive progresses in the field…
  • On COCO Captions, CIDEr goes from <100 to 140
  • Motivation is important. Avoid piling up “Legos”.
  • To achieve better result interpretability, we need grounding.
  • Towards generalizable and robust models, pre-training is one option

54

slide-49
SLIDE 49

Limitations

  • Still a long way to go before production-ready due to…
  • Recognition failure
  • Object hallucination
  • Model bias

55

Rohrbach et al. “Object hallucination in image captioning,” EMNLP 2018. Burns et al. “Women also Snowboard: Overcoming Bias in Captioning Models,” ECCV 2018.

  • > better feature
  • > alleviating biases
  • > better grounding/detection
slide-50
SLIDE 50

Future Directions

  • Evaluation metrics that correlate better with human judgement.
  • Revisit grid features and simplify the model pipeline.
  • In vision-language pre-training, how to close the gap between pre-

training domain and downstream domain.

56

slide-51
SLIDE 51

Thank you! Any questions?

57