Vision and Language Representation Learning Self Supervised - - PowerPoint PPT Presentation

vision and language representation learning
SMART_READER_LITE
LIVE PREVIEW

Vision and Language Representation Learning Self Supervised - - PowerPoint PPT Presentation

Vision and Language Representation Learning Self Supervised Pretraining and Multi-Task Learning Jiasen Lu April 21, 2020 1 1 Vision and Language Visual Question Answering Image Captioning Visual Commonsense Reasoning Refer Expression 2


slide-1
SLIDE 1

Vision and Language Representation Learning

– Self Supervised Pretraining and Multi-Task Learning

1

1

Jiasen Lu April 21, 2020

slide-2
SLIDE 2

Image Captioning Visual Question Answering Visual Commonsense Reasoning Refer Expression

Vision and Language

2

[Antol et.al. 2015, Vinyals et.al. 2015, Zellers et.al. 2018, Yu et.al 2018]

slide-3
SLIDE 3

Refer Expression Image Captioning Visual Question Answering Visual Commonsense Reasoning

Vision and Language

Visual Commonsense Reasoning

3

[Antol et.al. 2015, Vinyals et.al. 2015, Zellers et.al. 2018, Yu et.al 2018]

slide-4
SLIDE 4

Visual Grounding

C: A bunch of red and yellow flowers on a branch. Q: What type of plant is this? A: Banana

[Shen et.al 2018]

Common model for visual grounding and leverage them on a wide array of vision-and-language tasks

4

slide-5
SLIDE 5

Pretrain-Transfer

Object Detection Semantic Segmentation Pose Estimation Question Answering Commonsense Inference Sentiment Analysis

[Deng et.al 2009, Devlin 2018]

5

slide-6
SLIDE 6

Pretrain-Transfer

Alt-text: Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 in Franklin, Tennessee. Conceptual Captions: pop artist performs at the festival in a city.

Conceptual Caption Dataset

  • Aligned image-caption pairs.
  • 3.3 million image compared to 0.12 million in COCO

caption.

  • Automatically collected.

[Sharma et.al 2018]

6

slide-7
SLIDE 7

BERT

Alt-text: Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 in Franklin, Tennessee. Conceptual Captions: pop artist performs at the festival in a city.

Conceptual Caption Dataset

BERT

ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈

<SEP> Tok1 Tok 2 <SEP>

… … Sentence B

<CLS> Tok 1 Tok2 Tok N

… Sentence A ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈

7

[Sharma et.al 2018, Devlin et.al 2018]]

slide-8
SLIDE 8

ViLBERT

Alt-text: Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 in Franklin, Tennessee. Conceptual Captions: pop artist performs at the festival in a city.

Conceptual Caption Dataset

BERT

ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈

<SEP> Tok1 Tok 2 <SEP>

… … Sentence B

<CLS> Tok 1 Tok2 Tok N

… Sentence A ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈

<MASK > <MASK >

T1 T2

8

[Sharma et.al 2018, Devlin et.al 2018]]

slide-9
SLIDE 9

Single-Stream model

Alt-text: Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 in Franklin, Tennessee. Conceptual Captions: pop artist performs at the festival in a city.

Conceptual Caption Dataset

BERT

ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈

<SEP> Tok1 Tok 2 <SEP>

… … Sentence

<CLS>

… Image ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈

9

[Sharma et.al 2018, Devlin et.al 2018]]

slide-10
SLIDE 10

Single-Stream model

Alt-text: Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 in Franklin, Tennessee. Conceptual Captions: pop artist performs at the festival in a city.

Conceptual Caption Dataset

BERT

ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈

<SEP> Tok1 Tok 2 <SEP>

… … Sentence

<CLS>

… Image ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈

<MASK > <MASK >

T1

10

[Sharma et.al 2018, Devlin et.al 2018]]

slide-11
SLIDE 11

ViLBERT

Problem: Different modalities may require different level of abstractions.

  • Linguistic stream:

artist

Linear

  • Visual stream:

[He et.al. 2015]

11

slide-12
SLIDE 12

ViLBERT

Solution: two-stream model which process visual and linguistic separately. L - BERT

ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈

Tok1 Tok2 <SEP>

… … Sentence

<CLS>

… Image ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈

<IMG>

V- BERT

𝑚 - layers 𝑙 - layers

12

slide-13
SLIDE 13

ViLBERT

Problem: how to fuse two different modality? Solution: use co-attention [Lu et.al. 2016] to fuse information between different source.

13

slide-14
SLIDE 14

ViLBERT

14

Co-attention [Lu et.al. 2016] to fuse information between different source.

slide-15
SLIDE 15

Pre-training Objective

Masked multi-modal modelling Multi-modal alignment prediction

  • Follows masked LM in BERT.
  • 15% of the words or image regions to predict.
  • Linguistic stream:
  • 80% of the time, replace with [MASK].
  • 10% of the time, replace random word.
  • 10% of the time, keep same.
  • Visual stream:
  • 80% of the time, replace with zero vector.
  • Predict whether image and caption is aligned or not

15

slide-16
SLIDE 16

Visualizations

A boat covered in flowers near the market.

[Sharma et.al 2018]

16

slide-17
SLIDE 17

Sentence → Image

Layer 0 Layer 5 H0 H7 BertVis https://github.com/jessevig/bertviz

17

slide-18
SLIDE 18

Sentence → Image

Layer 0 Layer 5 H0 H7 BertVis https://github.com/jessevig/bertviz

18

slide-19
SLIDE 19

Image → Sentence

Layer 0 Layer 5 H0 H7 BertVis https://github.com/jessevig/bertviz

19

slide-20
SLIDE 20

Image → Sentence

Layer 0 Layer 5 H0 H7 BertVis https://github.com/jessevig/bertviz

20

slide-21
SLIDE 21

Pre-training Fine-Tuning

Image and text pair from conceptual caption

Vision & Language BERT

ℎ𝑊 ℎ𝑊

1

ℎ𝑊

2

ℎ𝑊

3

ℎ𝑊𝒰

<IMG>

ℎ𝑀0 ℎ𝑀1

ℎ𝑀2

ℎ𝑀3

ℎ𝑀𝑈

<CLS>

Man shopping

for <SEP>

… …

<MASK > <MASK > <MASK > <MASK>

Man shopping

Masked Sentence Masked Region Image Question Pair

Vision & Language BERT

ℎ𝑊 ℎ𝑊

1

ℎ𝑊

2

ℎ𝑊

3

ℎ𝑊𝒰

<IMG>

ℎ𝑀0 ℎ𝑀1

ℎ𝑀2

ℎ𝑀3

ℎ𝑀𝑈

<CLS>

What is

the <SEP>

… …

Question Image

shopping VQA VCR Refer

Fine-tuning Procedure

21

slide-22
SLIDE 22

Tasks

[Antol 2015, zeller 2018, Yu 2016, Plummer 2015]

22

slide-23
SLIDE 23

Results

70.22 65.9 68.85 68.93 70.55

test-dev VQA

43.1 47.27 52.73 49.48 54.04

val VCR Q->A

23

65.33 65.64 69.21 68.61 72.34

val RefCOCO+

48.6 45.5 58.2

test Image Retrieval

slide-24
SLIDE 24

Concurrent Work

24

[Li 2019, Tan 2019, Li 2019, Su 2019, Zhou 2019, Chen 2019]

slide-25
SLIDE 25

Summary

Summary Task-agnostic visiolinguistic representations pretraining for visual grounding

  • Introduce pretrain-transfer to vision and language tasks.
  • Achieve SOTA on multiple vision and language tasks.

Limitations The model can still learn inconsistent grounding by task specific finetuning.

  • Training multiple vision and language task together – multi-task V&L

25

slide-26
SLIDE 26

Multi-Task V&L Learning

One Model for V&L: ViLBERT Problems:

  • Inconsistent grounding by task specific

finetuning.

  • Four V&L tasks.
  • Model is huge, overfitting.

What we want:

  • Test on more tasks.
  • Consistent Grounding across tasks.
  • Explore the limit of the model.

26

VQA

  • VQA
  • Genome QA
  • GQA

Image Description

  • Caption based

Retrieval (COCO)

  • Caption based

Retrieval (COCO) Referring Expression

  • Ref COCO
  • Ref COCO+
  • Ref COCOg
  • Visual 7w
  • GuessWhat

V&L Verification

  • NLVR2
  • Visual Entailment
slide-27
SLIDE 27

Multi-Task V&L Learning

27

Model improvements over ViLBERT

slide-28
SLIDE 28

Multi-Task V&L Learning

28

Model improvements over ViLBERT

L - BERT

ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈

Tok1 Tok2 <SEP>

… …

Aligned caption

<CLS>

Image

ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈

<IMG>

V- BERT

  • Masked multi-modal modelling only for aligned image caption pairs.

<MASK>

slide-29
SLIDE 29

Multi-Task V&L Learning

29

Model improvements over ViLBERT

  • Masked multi-modal modelling only for aligned image caption pairs.
  • Masking overlapped image regions (IOU > 0.4).

L - BERT

ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈

Tok1 Tok2 <SEP>

… …

<CLS>

Image

ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈

<IMG>

V- BERT

<MASK>

Aligned caption

slide-30
SLIDE 30

Multi-Task V&L Learning

30

Multi-Task Vision and Language Learning

  • Use different head but similar tasks share the same head.

L - BERT

ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈

<TSK>

Tok1 <SEP>

… …

<CLS>

Image

ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈

<IMG>

V- BERT

Aligned caption

VQA/Genome QA GQA Retrieval NLVR Visual Entailment

slide-31
SLIDE 31

Multi-Task V&L Learning

31

Multi-Task Vision and Language Learning

  • Use different head but similar tasks share the same head.

L - BERT

ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈

<TSK>

Tok1 <SEP>

… …

<CLS>

Image

ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈

<IMG>

V- BERT

Aligned caption

VQA/Genome QA GQA Retrieval NLVR Visual Entailment Refer Expression

slide-32
SLIDE 32

Multi-Task V&L Learning

32

  • Add <TSK> token for multi-task training.

Multi-Task Vision and Language Learning

L - BERT

ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈

<TSK>

Tok1 <SEP>

… …

<CLS>

Image

ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈

<IMG>

V- BERT

Aligned caption

  • Use different head but similar tasks share the same head.
slide-33
SLIDE 33

Multi-Task V&L Learning

33

  • Add <TSK> token for multi-task training.

Multi-Task Vision and Language Learning

  • Use different head but similar tasks share the same head.
  • Dynamic Stop and Go
slide-34
SLIDE 34

Multi-Task V&L Learning

34

  • Add <TSK> token for multi-task training.

Multi-Task Vision and Language Learning

  • Use different head but similar tasks share the same head.
  • Dynamic Stop and Go
slide-35
SLIDE 35

Multi-Task V&L Learning

35

  • Add <TSK> token for multi-task training.

Multi-Task Vision and Language Learning

  • Use different head but similar tasks share the same head.
  • Dynamic Stop and Go

Training Procedure

  • Conceptual caption pre-training.
  • Multi-task training.
  • Finetune on single task.
slide-36
SLIDE 36

Compare with SOTA

36

70.55 70.8 71.16 71.24 72.42 72.27 72.06 73.08

test-dev

VQA

70.57 69.36 71.12 72.9 73.4 74.12

test

RefCOCO+

Model Pretrained

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

slide-37
SLIDE 37

Compare with SOTA

37

70.55 70.8 71.16 71.24 72.42 72.27 72.06 73.08

test-dev

VQA

70.57 69.36 71.12 72.9 73.4 74.12

test

RefCOCO+

2-stream Model Pretrained CC

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

slide-38
SLIDE 38

Compare with SOTA

38

70.55 70.8 71.16 71.24 72.42 72.27 72.06 73.08

test-dev

VQA

70.57 69.36 71.12 72.9 73.4 74.12

test

RefCOCO+

2-stream Model Pretrained CC 1-stream

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

CC+COCO

slide-39
SLIDE 39

Compare with SOTA

39

70.55 70.8 71.16 71.24 72.42 72.27 72.06 73.08

test-dev

VQA

70.57 69.36 71.12 72.9 73.4 74.12

test

RefCOCO+

2-stream Model Pretrained CC 1-stream CC+wiki 1-stream

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

CC+COCO

slide-40
SLIDE 40

Compare with SOTA

40

70.55 70.8 71.16 71.24 72.42 72.27 72.06 73.08

test-dev

VQA

70.57 69.36 71.12 72.9 73.4 74.12

test

RefCOCO+

2-stream Model Pretrained CC 1-stream CC+wiki 2-stream CC 1-stream

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

CC+COCO

slide-41
SLIDE 41

Compare with SOTA

41

70.55 70.8 71.16 71.24 72.42 72.27 72.06 73.08

test-dev

VQA

70.57 69.36 71.12 72.9 73.4 74.12

test

RefCOCO+

2-stream Model Pretrained CC 1-stream CC+wiki 2-stream CC 2-stream COCO+VG 1-stream

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

CC+COCO

slide-42
SLIDE 42

Compare with SOTA

42

70.55 70.8 71.16 71.24 72.42 72.27 72.06 73.08

test-dev

VQA

70.57 69.36 71.12 72.9 73.4 74.12

test

RefCOCO+

2-stream Model Pretrained CC 1-stream CC+wiki 2-stream CC 2-stream COCO+VG 1-stream COCO+VG CC+SBU 1-stream

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

CC+COCO

slide-43
SLIDE 43

Compare with SOTA

43

70.55 70.8 71.16 71.24 72.42 72.27 72.06 73.08

test-dev

VQA

70.57 69.36 71.12 72.9 73.4 74.12

test

RefCOCO+

2-stream Model Pretrained CC 1-stream CC+wiki 2-stream CC 2-stream COCO+VG 1-stream COCO+VG CC+SBU 2-stream CC 1-stream

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

CC+COCO

slide-44
SLIDE 44

Compare with SOTA

44

70.55 70.8 71.16 71.24 72.42 72.27 72.06 73.08

test-dev

VQA

70.57 69.36 71.12 72.9 73.4 74.12

test

RefCOCO+

2-stream Model Pretrained CC 1-stream 1-stream CC+wiki 2-stream CC 2-stream COCO+VG 1-stream COCO+VG CC+SBU 2-stream 2-stream CC + MT CC

[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]

CC+COCO

slide-45
SLIDE 45

Ablation Study

45

71.24 72.0372.06 73.08

test-dev VQA

34.1 36.18 35.05 36.38

test Genome QA

59.09 59.6 59.81 60.72

test GQA

64.8 65.06 63.02 67.36

R1 Retrieval COCO

61.46 66 63.19 66.12

R1 Retrieval Flickr

80.51 81.54 82.52 83.06

test Visual7W

62.53 64.7864.43 65.19

test GuessWhat

74.2574.62 77.72 78.24

test NLVR2

76.5376.5276.63 77.32

test SNLI-VE

69.47 72.8 73.4 74.12

test RefCOCO+

slide-46
SLIDE 46

Ablation Study

46

71.24 72.0372.06 73.08

test-dev VQA

34.1 36.18 35.05 36.38

test Genome QA

59.09 59.6 59.81 60.72

test GQA

64.8 65.06 63.02 67.36

R1 Retrieval COCO

61.46 66 63.19 66.12

R1 Retrieval Flickr

80.51 81.54 82.52 83.06

test Visual7W

62.53 64.7864.43 65.19

test GuessWhat

74.2574.62 77.72 78.24

test NLVR2

76.5376.5276.63 77.32

test SNLI-VE

69.47 72.8 73.4 74.12

test RefCOCO+

slide-47
SLIDE 47

Ablation Study

47

71.24 72.0372.06 73.08

test-dev VQA

34.1 36.18 35.05 36.38

test Genome QA

59.09 59.6 59.81 60.72

test GQA

64.8 65.06 63.02 67.36

R1 Retrieval COCO

61.46 66 63.19 66.12

R1 Retrieval Flickr

80.51 81.54 82.52 83.06

test Visual7W

62.53 64.7864.43 65.19

test GuessWhat

74.2574.62 77.72 78.24

test NLVR2

76.5376.5276.63 77.32

test SNLI-VE

69.47 72.8 73.4 74.12

test RefCOCO+

slide-48
SLIDE 48

Ablation Study

48

Relative Performance of Trained with Average VQA Image Retrieval Refer Expression V&L Verification VQA

  • 0.38

0.38

  • 0.2

0.19 Image Retrieval 0.46

  • 0.23
  • 4.13
  • 1.15

Refer Expression 0.39 0.78

  • 0.24

0.47 V&L Verification 2.29 1.47 0.67

  • 1.48

Average 1.04 0.88 0.43

  • 1.36
  • Task Performance with different Group
slide-49
SLIDE 49

Ablation Study

49

Task Performance with different Group

Relative Performance of Trained with Average VQA Image Retrieval Refer Expression V&L Verification VQA

  • 0.38

0.38

  • 0.2

0.19 Image Retrieval 0.46

  • 0.23
  • 4.13
  • 1.15

Refer Expression 0.39 0.78

  • 0.24

0.47 V&L Verification 2.29 1.47 0.67

  • 1.48

Average 1.04 0.88 0.43

  • 1.36
slide-50
SLIDE 50

Ablation Study

50

Task Performance with different Group

Relative Performance of Trained with Average VQA Image Retrieval Refer Expression V&L Verification VQA

  • 0.38

0.38

  • 0.2

0.19 Image Retrieval 0.46

  • 0.23
  • 4.13
  • 1.15

Refer Expression 0.39 0.78

  • 0.24

0.47 V&L Verification 2.29 1.47 0.67

  • 1.48

Average 1.04 0.88 0.43

  • 1.36
slide-51
SLIDE 51

DEMO

https://vilbert.cloudcv.org/

slide-52
SLIDE 52

Summary

Summary Explore multi-tasks vision and language learning.

  • Introduce several tricks to improve ViLBERT.
  • Add <TSK> tokens to improved the multi-task learning.
  • Finetune on the multi-task representation lead to new SOTA.
  • Study how different groups connect with each other.

52

Potential directions

  • How to use the information across tasks for XAI?
  • How to incorporate more modalities?
  • How to make the model smaller and more efficient?
  • How to combine symbolic reasoning with representation learning.
slide-53
SLIDE 53

The END

Question?

53