Vision and Language Representation Learning
– Self Supervised Pretraining and Multi-Task Learning
1
1
Jiasen Lu April 21, 2020
Vision and Language Representation Learning Self Supervised - - PowerPoint PPT Presentation
Vision and Language Representation Learning Self Supervised Pretraining and Multi-Task Learning Jiasen Lu April 21, 2020 1 1 Vision and Language Visual Question Answering Image Captioning Visual Commonsense Reasoning Refer Expression 2
– Self Supervised Pretraining and Multi-Task Learning
1
1
Jiasen Lu April 21, 2020
Image Captioning Visual Question Answering Visual Commonsense Reasoning Refer Expression
2
[Antol et.al. 2015, Vinyals et.al. 2015, Zellers et.al. 2018, Yu et.al 2018]
Refer Expression Image Captioning Visual Question Answering Visual Commonsense Reasoning
Visual Commonsense Reasoning
3
[Antol et.al. 2015, Vinyals et.al. 2015, Zellers et.al. 2018, Yu et.al 2018]
C: A bunch of red and yellow flowers on a branch. Q: What type of plant is this? A: Banana
[Shen et.al 2018]
Common model for visual grounding and leverage them on a wide array of vision-and-language tasks
4
Object Detection Semantic Segmentation Pose Estimation Question Answering Commonsense Inference Sentiment Analysis
[Deng et.al 2009, Devlin 2018]
5
Alt-text: Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 in Franklin, Tennessee. Conceptual Captions: pop artist performs at the festival in a city.
Conceptual Caption Dataset
caption.
[Sharma et.al 2018]
6
Alt-text: Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 in Franklin, Tennessee. Conceptual Captions: pop artist performs at the festival in a city.
Conceptual Caption Dataset
BERT
ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈
<SEP> Tok1 Tok 2 <SEP>
… … Sentence B
<CLS> Tok 1 Tok2 Tok N
… Sentence A ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈
…
7
[Sharma et.al 2018, Devlin et.al 2018]]
Alt-text: Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 in Franklin, Tennessee. Conceptual Captions: pop artist performs at the festival in a city.
Conceptual Caption Dataset
BERT
ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈
<SEP> Tok1 Tok 2 <SEP>
… … Sentence B
<CLS> Tok 1 Tok2 Tok N
… Sentence A ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈
…
<MASK > <MASK >
T1 T2
8
[Sharma et.al 2018, Devlin et.al 2018]]
Alt-text: Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 in Franklin, Tennessee. Conceptual Captions: pop artist performs at the festival in a city.
Conceptual Caption Dataset
BERT
ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈
<SEP> Tok1 Tok 2 <SEP>
… … Sentence
<CLS>
… Image ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈
…
9
[Sharma et.al 2018, Devlin et.al 2018]]
Alt-text: Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 in Franklin, Tennessee. Conceptual Captions: pop artist performs at the festival in a city.
Conceptual Caption Dataset
BERT
ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈
<SEP> Tok1 Tok 2 <SEP>
… … Sentence
<CLS>
… Image ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈
…
<MASK > <MASK >
T1
10
[Sharma et.al 2018, Devlin et.al 2018]]
Problem: Different modalities may require different level of abstractions.
artist
Linear
[He et.al. 2015]
11
Solution: two-stream model which process visual and linguistic separately. L - BERT
ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈
Tok1 Tok2 <SEP>
… … Sentence
<CLS>
… Image ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈
…
<IMG>
V- BERT
𝑚 - layers 𝑙 - layers
12
Problem: how to fuse two different modality? Solution: use co-attention [Lu et.al. 2016] to fuse information between different source.
13
14
Co-attention [Lu et.al. 2016] to fuse information between different source.
Masked multi-modal modelling Multi-modal alignment prediction
15
A boat covered in flowers near the market.
[Sharma et.al 2018]
16
Layer 0 Layer 5 H0 H7 BertVis https://github.com/jessevig/bertviz
17
Layer 0 Layer 5 H0 H7 BertVis https://github.com/jessevig/bertviz
18
Layer 0 Layer 5 H0 H7 BertVis https://github.com/jessevig/bertviz
19
Layer 0 Layer 5 H0 H7 BertVis https://github.com/jessevig/bertviz
20
Pre-training Fine-Tuning
Image and text pair from conceptual caption
Vision & Language BERT
…
ℎ𝑊 ℎ𝑊
1
ℎ𝑊
2
ℎ𝑊
3
ℎ𝑊𝒰
<IMG>
ℎ𝑀0 ℎ𝑀1
ℎ𝑀2
ℎ𝑀3
ℎ𝑀𝑈
<CLS>
Man shopping
for <SEP>
… …
…
<MASK > <MASK > <MASK > <MASK>
Man shopping
Masked Sentence Masked Region Image Question Pair
Vision & Language BERT
…
ℎ𝑊 ℎ𝑊
1
ℎ𝑊
2
ℎ𝑊
3
ℎ𝑊𝒰
<IMG>
ℎ𝑀0 ℎ𝑀1
ℎ𝑀2
ℎ𝑀3
ℎ𝑀𝑈
<CLS>
What is
the <SEP>
… …
…
Question Image
shopping VQA VCR Refer
21
[Antol 2015, zeller 2018, Yu 2016, Plummer 2015]
22
70.22 65.9 68.85 68.93 70.55
test-dev VQA
43.1 47.27 52.73 49.48 54.04
val VCR Q->A
23
65.33 65.64 69.21 68.61 72.34
val RefCOCO+
48.6 45.5 58.2
test Image Retrieval
24
[Li 2019, Tan 2019, Li 2019, Su 2019, Zhou 2019, Chen 2019]
Summary Task-agnostic visiolinguistic representations pretraining for visual grounding
Limitations The model can still learn inconsistent grounding by task specific finetuning.
25
One Model for V&L: ViLBERT Problems:
finetuning.
What we want:
26
VQA
Image Description
Retrieval (COCO)
Retrieval (COCO) Referring Expression
V&L Verification
27
28
L - BERT
ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈
Tok1 Tok2 <SEP>
… …
Aligned caption
<CLS>
…
Image
ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈
…
<IMG>
V- BERT
<MASK>
29
L - BERT
ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈
Tok1 Tok2 <SEP>
… …
<CLS>
…
Image
ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈
…
<IMG>
V- BERT
<MASK>
Aligned caption
30
L - BERT
ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈
<TSK>
Tok1 <SEP>
… …
<CLS>
…
Image
ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈
…
<IMG>
V- BERT
Aligned caption
VQA/Genome QA GQA Retrieval NLVR Visual Entailment
31
L - BERT
ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈
<TSK>
Tok1 <SEP>
… …
<CLS>
…
Image
ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈
…
<IMG>
V- BERT
Aligned caption
VQA/Genome QA GQA Retrieval NLVR Visual Entailment Refer Expression
32
L - BERT
ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈
<TSK>
Tok1 <SEP>
… …
<CLS>
…
Image
ℎ𝑀0 ℎ𝑀1 ℎ𝑀3 ℎ𝑀𝑈
…
<IMG>
V- BERT
Aligned caption
33
34
35
36
70.55 70.8 71.16 71.24 72.42 72.27 72.06 73.08
test-dev
VQA
70.57 69.36 71.12 72.9 73.4 74.12
test
RefCOCO+
Model Pretrained
[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]
37
70.55 70.8 71.16 71.24 72.42 72.27 72.06 73.08
test-dev
VQA
70.57 69.36 71.12 72.9 73.4 74.12
test
RefCOCO+
2-stream Model Pretrained CC
[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]
38
70.55 70.8 71.16 71.24 72.42 72.27 72.06 73.08
test-dev
VQA
70.57 69.36 71.12 72.9 73.4 74.12
test
RefCOCO+
2-stream Model Pretrained CC 1-stream
[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]
CC+COCO
39
70.55 70.8 71.16 71.24 72.42 72.27 72.06 73.08
test-dev
VQA
70.57 69.36 71.12 72.9 73.4 74.12
test
RefCOCO+
2-stream Model Pretrained CC 1-stream CC+wiki 1-stream
[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]
CC+COCO
40
70.55 70.8 71.16 71.24 72.42 72.27 72.06 73.08
test-dev
VQA
70.57 69.36 71.12 72.9 73.4 74.12
test
RefCOCO+
2-stream Model Pretrained CC 1-stream CC+wiki 2-stream CC 1-stream
[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]
CC+COCO
41
70.55 70.8 71.16 71.24 72.42 72.27 72.06 73.08
test-dev
VQA
70.57 69.36 71.12 72.9 73.4 74.12
test
RefCOCO+
2-stream Model Pretrained CC 1-stream CC+wiki 2-stream CC 2-stream COCO+VG 1-stream
[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]
CC+COCO
42
70.55 70.8 71.16 71.24 72.42 72.27 72.06 73.08
test-dev
VQA
70.57 69.36 71.12 72.9 73.4 74.12
test
RefCOCO+
2-stream Model Pretrained CC 1-stream CC+wiki 2-stream CC 2-stream COCO+VG 1-stream COCO+VG CC+SBU 1-stream
[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]
CC+COCO
43
70.55 70.8 71.16 71.24 72.42 72.27 72.06 73.08
test-dev
VQA
70.57 69.36 71.12 72.9 73.4 74.12
test
RefCOCO+
2-stream Model Pretrained CC 1-stream CC+wiki 2-stream CC 2-stream COCO+VG 1-stream COCO+VG CC+SBU 2-stream CC 1-stream
[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]
CC+COCO
44
70.55 70.8 71.16 71.24 72.42 72.27 72.06 73.08
test-dev
VQA
70.57 69.36 71.12 72.9 73.4 74.12
test
RefCOCO+
2-stream Model Pretrained CC 1-stream 1-stream CC+wiki 2-stream CC 2-stream COCO+VG 1-stream COCO+VG CC+SBU 2-stream 2-stream CC + MT CC
[Li 2019, Tan 2019, Li 2019, Su 2019, Chen 2019]
CC+COCO
45
71.24 72.0372.06 73.08
test-dev VQA
34.1 36.18 35.05 36.38
test Genome QA
59.09 59.6 59.81 60.72
test GQA
64.8 65.06 63.02 67.36
R1 Retrieval COCO
61.46 66 63.19 66.12
R1 Retrieval Flickr
80.51 81.54 82.52 83.06
test Visual7W
62.53 64.7864.43 65.19
test GuessWhat
74.2574.62 77.72 78.24
test NLVR2
76.5376.5276.63 77.32
test SNLI-VE
69.47 72.8 73.4 74.12
test RefCOCO+
46
71.24 72.0372.06 73.08
test-dev VQA
34.1 36.18 35.05 36.38
test Genome QA
59.09 59.6 59.81 60.72
test GQA
64.8 65.06 63.02 67.36
R1 Retrieval COCO
61.46 66 63.19 66.12
R1 Retrieval Flickr
80.51 81.54 82.52 83.06
test Visual7W
62.53 64.7864.43 65.19
test GuessWhat
74.2574.62 77.72 78.24
test NLVR2
76.5376.5276.63 77.32
test SNLI-VE
69.47 72.8 73.4 74.12
test RefCOCO+
47
71.24 72.0372.06 73.08
test-dev VQA
34.1 36.18 35.05 36.38
test Genome QA
59.09 59.6 59.81 60.72
test GQA
64.8 65.06 63.02 67.36
R1 Retrieval COCO
61.46 66 63.19 66.12
R1 Retrieval Flickr
80.51 81.54 82.52 83.06
test Visual7W
62.53 64.7864.43 65.19
test GuessWhat
74.2574.62 77.72 78.24
test NLVR2
76.5376.5276.63 77.32
test SNLI-VE
69.47 72.8 73.4 74.12
test RefCOCO+
48
Relative Performance of Trained with Average VQA Image Retrieval Refer Expression V&L Verification VQA
0.38
0.19 Image Retrieval 0.46
Refer Expression 0.39 0.78
0.47 V&L Verification 2.29 1.47 0.67
Average 1.04 0.88 0.43
49
Task Performance with different Group
Relative Performance of Trained with Average VQA Image Retrieval Refer Expression V&L Verification VQA
0.38
0.19 Image Retrieval 0.46
Refer Expression 0.39 0.78
0.47 V&L Verification 2.29 1.47 0.67
Average 1.04 0.88 0.43
50
Task Performance with different Group
Relative Performance of Trained with Average VQA Image Retrieval Refer Expression V&L Verification VQA
0.38
0.19 Image Retrieval 0.46
Refer Expression 0.39 0.78
0.47 V&L Verification 2.29 1.47 0.67
Average 1.04 0.88 0.43
https://vilbert.cloudcv.org/
Summary Explore multi-tasks vision and language learning.
52
Potential directions
53