Vision and Language Representation Learning Self Supervised - PowerPoint PPT Presentation

Vision and Language Representation Learning – Self Supervised Pretraining and Multi-Task Learning Jiasen Lu April 21, 2020 1 1

Vision and Language Visual Question Answering Image Captioning Visual Commonsense Reasoning Refer Expression 2 [Antol et.al. 2015, Vinyals et.al. 2015, Zellers et.al. 2018, Yu et.al 2018]

Vision and Language Visual Question Answering Image Captioning Visual Commonsense Reasoning Visual Commonsense Reasoning Refer Expression 3 [Antol et.al. 2015, Vinyals et.al. 2015, Zellers et.al. 2018, Yu et.al 2018]

Visual Grounding C: A bunch of red and yellow Q: What type of plant is this? flowers on a branch. A: Banana Common model for visual grounding and leverage them on a wide array of vision-and-language tasks 4 [Shen et.al 2018]

Pretrain-Transfer Object Detection Semantic Segmentation Pose Estimation Question Answering Commonsense Inference Sentiment Analysis 5 [Deng et.al 2009, Devlin 2018]

Pretrain-Transfer • Aligned image-caption pairs. • 3.3 million image compared to 0.12 million in COCO Alt-text : Musician Justin Timberlake caption. performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 • Automatically collected. in Franklin, Tennessee. Conceptual Captions : pop artist performs at the festival in a city. Conceptual Caption Dataset 6 [Sharma et.al 2018]

BERT … … ℎ 𝑀 0 ℎ 𝑀 𝑈 ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 ℎ 𝑀 𝑈 ℎ 𝑀 1 BERT Alt-text : Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 … … Tok 1 Tok2 <SEP> <CLS> Tok N <SEP> Tok1 Tok 2 in Franklin, Tennessee. Conceptual Captions : pop artist performs at the festival in a city. Sentence A Sentence B Conceptual Caption Dataset 7 [Sharma et.al 2018, Devlin et.al 2018]]

ViLBERT T 1 T 2 … … ℎ 𝑀 0 ℎ 𝑀 𝑈 ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 ℎ 𝑀 𝑈 ℎ 𝑀 1 BERT Alt-text : Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 … <MASK … <MASK Tok 1 Tok2 <SEP> <CLS> Tok N <SEP> Tok1 Tok 2 in Franklin, Tennessee. > > Conceptual Captions : pop artist performs at the festival in a city. Sentence A Sentence B Conceptual Caption Dataset 8 [Sharma et.al 2018, Devlin et.al 2018]]

Single-Stream model … … ℎ 𝑀 0 ℎ 𝑀 𝑈 ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 ℎ 𝑀 𝑈 ℎ 𝑀 1 BERT Alt-text : Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 … … <SEP> <CLS> <SEP> Tok1 Tok 2 in Franklin, Tennessee. Conceptual Captions : pop artist performs at the festival in a city. Sentence Image Conceptual Caption Dataset 9 [Sharma et.al 2018, Devlin et.al 2018]]

Single-Stream model T 1 … … ℎ 𝑀 0 ℎ 𝑀 𝑈 ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 ℎ 𝑀 𝑈 ℎ 𝑀 1 BERT Alt-text : Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 … <MASK … <MASK <SEP> <CLS> <SEP> Tok1 Tok 2 in Franklin, Tennessee. > > Conceptual Captions : pop artist performs at the festival in a city. Sentence Image Conceptual Caption Dataset 10 [Sharma et.al 2018, Devlin et.al 2018]]

ViLBERT Problem : Different modalities may require different level of abstractions. • Linguistic stream: Linear artist • Visual stream: 11 [He et.al. 2015]

ViLBERT Solution : two-stream model which process visual and linguistic separately. … ℎ 𝑀 0 … ℎ 𝑀 1 ℎ 𝑀 𝑈 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 ℎ 𝑀 𝑈 ℎ 𝑀 1 𝑙 - layers 𝑚 - layers L - BERT V- BERT … … <IMG> <SEP> Tok1 <CLS> Tok2 Sentence Image 12

ViLBERT Problem : how to fuse two different modality? Solution : use co-attention [ Lu et.al. 2016 ] to fuse information between different source. 13

ViLBERT Co-attention [ Lu et.al. 2016 ] to fuse information between different source. 14

Pre-training Objective Masked multi-modal modelling • Follows masked LM in BERT. • 15% of the words or image regions to predict. • Linguistic stream: o 80% of the time, replace with [MASK] . o 10% of the time, replace random word. o 10% of the time, keep same. • Visual stream: o 80% of the time, replace with zero vector. Multi-modal alignment prediction • Predict whether image and caption is aligned or not 15

Visualizations A boat covered in flowers near the market. 16 [Sharma et.al 2018]

Sentence → Image H7 H0 Layer 0 Layer 5 17 BertVis https://github.com/jessevig/bertviz

Sentence → Image H7 H0 Layer 0 Layer 5 18 BertVis https://github.com/jessevig/bertviz

Image → Sentence H7 H0 Layer 0 Layer 5 19 BertVis https://github.com/jessevig/bertviz

Image → Sentence H7 H0 Layer 0 Layer 5 20 BertVis https://github.com/jessevig/bertviz

Fine-tuning Procedure Refer VCR VQA shopping Man shopping … … … … ℎ 𝑊 𝒰 ℎ 𝑀 2 ℎ 𝑀 3 ℎ 𝑀 𝑈 ℎ 𝑊 𝒰 ℎ 𝑀 2 ℎ 𝑀 3 ℎ 𝑀 𝑈 ℎ 𝑊 ℎ 𝑊 ℎ 𝑊 ℎ 𝑊 ℎ 𝑀 0 ℎ 𝑀 1 ℎ 𝑊 ℎ 𝑊 ℎ 𝑊 ℎ 𝑊 ℎ 𝑀 0 ℎ 𝑀 1 0 1 2 3 0 1 2 3 Vision & Language BERT Vision & Language BERT … … … … <MASK <MASK <MASK Man shopping What is <IMG> <CLS> <MASK> for <SEP> <IMG> <CLS> the <SEP> > > > Masked Region Masked Sentence Image Question Image and text pair from conceptual caption Image Question Pair Pre-training Fine-Tuning 21

Tasks 22 [Antol 2015, zeller 2018, Yu 2016, Plummer 2015]

Results VQA VCR Q->A RefCOCO+ Image Retrieval 70.55 58.2 70.22 54.04 72.34 52.73 68.85 68.93 69.21 49.48 68.61 47.27 48.6 65.33 65.64 45.5 65.9 43.1 test-dev val val test 23

Concurrent Work 24 [Li 2019, Tan 2019, Li 2019, Su 2019, Zhou 2019, Chen 2019]

Summary Summary Task-agnostic visiolinguistic representations pretraining for visual grounding • Introduce pretrain-transfer to vision and language tasks. • Achieve SOTA on multiple vision and language tasks. Limitations The model can still learn inconsistent grounding by task specific finetuning. • Training multiple vision and language task together – multi-task V&L 25

Multi-Task V&L Learning One Model for V&L: ViLBERT Problems: • Inconsistent grounding by task specific Referring Expression VQA • Ref COCO • VQA finetuning. • Ref COCO+ • Genome QA • Four V&L tasks. • Ref COCOg • GQA • Model is huge, overfitting. • Visual 7w Image Description • GuessWhat What we want: • Caption based • Test on more tasks. V&L Verification Retrieval (COCO) • Consistent Grounding across tasks. • NLVR2 • Caption based • Explore the limit of the model. • Visual Entailment Retrieval (COCO) 26

Multi-Task V&L Learning Model improvements over ViLBERT 27

Multi-Task V&L Learning Model improvements over ViLBERT • Masked multi-modal modelling only for aligned image caption pairs. ℎ 𝑀 0 … ℎ 𝑀 𝑈 ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 … ℎ 𝑀 𝑈 ℎ 𝑀 1 L - BERT V- BERT … … <IMG> <MASK> <SEP> <CLS> Tok1 Tok2 Image Aligned caption 28

Multi-Task V&L Learning Model improvements over ViLBERT • Masked multi-modal modelling only for aligned image caption pairs. • Masking overlapped image regions (IOU > 0.4). ℎ 𝑀 0 … ℎ 𝑀 𝑈 ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 … ℎ 𝑀 𝑈 ℎ 𝑀 1 L - BERT V- BERT … … <IMG> <MASK> <SEP> <CLS> Tok1 Tok2 Image Aligned caption 29

Multi-Task V&L Learning Multi-Task Vision and Language Learning • Use different head but similar tasks share the same head. VQA/Genome QA GQA Retrieval NLVR ℎ 𝑀 0 … ℎ 𝑀 𝑈 Visual Entailment ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 … ℎ 𝑀 𝑈 ℎ 𝑀 1 L - BERT V- BERT … … <IMG> <SEP> <CLS> Tok1 <TSK> Image Aligned caption 30

Multi-Task V&L Learning Multi-Task Vision and Language Learning • Use different head but similar tasks share the same head. VQA/Genome QA Refer Expression GQA Retrieval NLVR ℎ 𝑀 0 … ℎ 𝑀 𝑈 Visual Entailment ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 … ℎ 𝑀 𝑈 ℎ 𝑀 1 L - BERT V- BERT … … <IMG> <SEP> <CLS> Tok1 <TSK> Image Aligned caption 31

Multi-Task V&L Learning Multi-Task Vision and Language Learning • Use different head but similar tasks share the same head. • Add <TSK> token for multi-task training. ℎ 𝑀 0 … ℎ 𝑀 𝑈 ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 … ℎ 𝑀 𝑈 ℎ 𝑀 1 L - BERT V- BERT … … <IMG> <SEP> <CLS> Tok1 <TSK> Image Aligned caption 32

Multi-Task V&L Learning Multi-Task Vision and Language Learning • Use different head but similar tasks share the same head. • Add <TSK> token for multi-task training. • Dynamic Stop and Go 33

Vision and Language Representation Learning Self Supervised - PowerPoint PPT Presentation

Vision and Language Representation Learning Self Supervised Pretraining and Multi-Task Learning Jiasen Lu April 21, 2020 1 1 Vision and Language Visual Question Answering Image Captioning Visual Commonsense Reasoning Refer Expression 2

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

Computer Vision Computer Vision How does vision work? What is vision for? Ela Claridge

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Branding Presentation VISION Mevushal VISION Muscat of Alexandria & Viognier VISION

Stable and Efficient Representation Learning with Nonnegativity Constraints Tsung-Han Lin and

K K Knowledge Knowledge l d l d Representation Representation Representation

Vision Our National Church partners .. Vision Our National Network partners Vision Getting

Vision Services Vision Services & & Vision Therapy Vision Therapy February 2, 2007

Vision, Language, Interaction and Generation Qi Wu Australian Institute for Machine Learning

Precise and Approximate Representation of Numbers The Cartesian-Lagrangian representation of

Feature Representation Vision BoWs and Beyond Praveen Krishnan Feature Representation in

Image and Video Coding: Representation, Acquisition, Display ... 10011 ... encoder decoder

Number representation in Java Scientific notation Overview topics Binary representation of

parametric surface patches 1 implicit representation implicit surface representation f ( P ) = 0

HIM Without Walls Realizing Our Vision! Realizing Our Vision Realize Our Vision Realizing Our

Deep Learning in Computer Vision Caner Hazrba Deep Learning in Action 24. June 15

Incorporating External Textual Knowledge for Life Event Recognition and Retrieval NTUnlg at

Neural Networks for Machine Learning Lecture 15c Deep autoencoders for document retrieval and

KE4IR S E y K b d e r e I w P o p Knowledge Extraction for Information Retrieval

SODAR THE IRODS-POWERED SYSTEM FOR OMICS DATA ACCESS AND RETRIEVAL Mikko Nieminen iRODS

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Retrieval of Autobiographical Information Erica Yu and Scott Fricker AAPOR May 18, 2014 All

Simple and Effective Retrieve-Edit-Rerank Text Generation Nabil Hossain Marjan Ghazvininejad Luke

TALP at GeoCLEF 2007: Using Terrier with Geographical Knowledge Filtering Daniel Ferr es and

Vision and Language Representation Learning Self Supervised - PowerPoint PPT Presentation

Vision and Language Representation Learning Self Supervised Pretraining and Multi-Task Learning Jiasen Lu April 21, 2020 1 1 Vision and Language Visual Question Answering Image Captioning Visual Commonsense Reasoning Refer Expression 2

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

Computer Vision Computer Vision How does vision work? What is vision for? Ela Claridge

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Branding Presentation VISION Mevushal VISION Muscat of Alexandria &amp; Viognier VISION

Stable and Efficient Representation Learning with Nonnegativity Constraints Tsung-Han Lin and

K K Knowledge Knowledge l d l d Representation Representation Representation

Vision Our National Church partners .. Vision Our National Network partners Vision Getting

Vision Services Vision Services &amp; &amp; Vision Therapy Vision Therapy February 2, 2007

Vision, Language, Interaction and Generation Qi Wu Australian Institute for Machine Learning

Precise and Approximate Representation of Numbers The Cartesian-Lagrangian representation of

Feature Representation Vision BoWs and Beyond Praveen Krishnan Feature Representation in

Image and Video Coding: Representation, Acquisition, Display ... 10011 ... encoder decoder

Number representation in Java Scientific notation Overview topics Binary representation of

parametric surface patches 1 implicit representation implicit surface representation f ( P ) = 0

HIM Without Walls Realizing Our Vision! Realizing Our Vision Realize Our Vision Realizing Our

Deep Learning in Computer Vision Caner Hazrba Deep Learning in Action 24. June 15

Incorporating External Textual Knowledge for Life Event Recognition and Retrieval NTUnlg at

Neural Networks for Machine Learning Lecture 15c Deep autoencoders for document retrieval and

KE4IR S E y K b d e r e I w P o p Knowledge Extraction for Information Retrieval

SODAR THE IRODS-POWERED SYSTEM FOR OMICS DATA ACCESS AND RETRIEVAL Mikko Nieminen iRODS

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Retrieval of Autobiographical Information Erica Yu and Scott Fricker AAPOR May 18, 2014 All

Simple and Effective Retrieve-Edit-Rerank Text Generation Nabil Hossain Marjan Ghazvininejad Luke

TALP at GeoCLEF 2007: Using Terrier with Geographical Knowledge Filtering Daniel Ferr es and

Branding Presentation VISION Mevushal VISION Muscat of Alexandria & Viognier VISION

Vision Services Vision Services & & Vision Therapy Vision Therapy February 2, 2007