vision and language representation learning
play

Vision and Language Representation Learning Self Supervised - PowerPoint PPT Presentation

Vision and Language Representation Learning Self Supervised Pretraining and Multi-Task Learning Jiasen Lu April 21, 2020 1 1 Vision and Language Visual Question Answering Image Captioning Visual Commonsense Reasoning Refer Expression 2


  1. Vision and Language Representation Learning – Self Supervised Pretraining and Multi-Task Learning Jiasen Lu April 21, 2020 1 1

  2. Vision and Language Visual Question Answering Image Captioning Visual Commonsense Reasoning Refer Expression 2 [Antol et.al. 2015, Vinyals et.al. 2015, Zellers et.al. 2018, Yu et.al 2018]

  3. Vision and Language Visual Question Answering Image Captioning Visual Commonsense Reasoning Visual Commonsense Reasoning Refer Expression 3 [Antol et.al. 2015, Vinyals et.al. 2015, Zellers et.al. 2018, Yu et.al 2018]

  4. Visual Grounding C: A bunch of red and yellow Q: What type of plant is this? flowers on a branch. A: Banana Common model for visual grounding and leverage them on a wide array of vision-and-language tasks 4 [Shen et.al 2018]

  5. Pretrain-Transfer Object Detection Semantic Segmentation Pose Estimation Question Answering Commonsense Inference Sentiment Analysis 5 [Deng et.al 2009, Devlin 2018]

  6. Pretrain-Transfer • Aligned image-caption pairs. • 3.3 million image compared to 0.12 million in COCO Alt-text : Musician Justin Timberlake caption. performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 • Automatically collected. in Franklin, Tennessee. Conceptual Captions : pop artist performs at the festival in a city. Conceptual Caption Dataset 6 [Sharma et.al 2018]

  7. BERT … … ℎ 𝑀 0 ℎ 𝑀 𝑈 ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 ℎ 𝑀 𝑈 ℎ 𝑀 1 BERT Alt-text : Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 … … Tok 1 Tok2 <SEP> <CLS> Tok N <SEP> Tok1 Tok 2 in Franklin, Tennessee. Conceptual Captions : pop artist performs at the festival in a city. Sentence A Sentence B Conceptual Caption Dataset 7 [Sharma et.al 2018, Devlin et.al 2018]]

  8. ViLBERT T 1 T 2 … … ℎ 𝑀 0 ℎ 𝑀 𝑈 ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 ℎ 𝑀 𝑈 ℎ 𝑀 1 BERT Alt-text : Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 … <MASK … <MASK Tok 1 Tok2 <SEP> <CLS> Tok N <SEP> Tok1 Tok 2 in Franklin, Tennessee. > > Conceptual Captions : pop artist performs at the festival in a city. Sentence A Sentence B Conceptual Caption Dataset 8 [Sharma et.al 2018, Devlin et.al 2018]]

  9. Single-Stream model … … ℎ 𝑀 0 ℎ 𝑀 𝑈 ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 ℎ 𝑀 𝑈 ℎ 𝑀 1 BERT Alt-text : Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 … … <SEP> <CLS> <SEP> Tok1 Tok 2 in Franklin, Tennessee. Conceptual Captions : pop artist performs at the festival in a city. Sentence Image Conceptual Caption Dataset 9 [Sharma et.al 2018, Devlin et.al 2018]]

  10. Single-Stream model T 1 … … ℎ 𝑀 0 ℎ 𝑀 𝑈 ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 ℎ 𝑀 𝑈 ℎ 𝑀 1 BERT Alt-text : Musician Justin Timberlake performs at the 2017 Pilgrimage Music & Cultural Festival on September 23, 2017 … <MASK … <MASK <SEP> <CLS> <SEP> Tok1 Tok 2 in Franklin, Tennessee. > > Conceptual Captions : pop artist performs at the festival in a city. Sentence Image Conceptual Caption Dataset 10 [Sharma et.al 2018, Devlin et.al 2018]]

  11. ViLBERT Problem : Different modalities may require different level of abstractions. • Linguistic stream: Linear artist • Visual stream: 11 [He et.al. 2015]

  12. ViLBERT Solution : two-stream model which process visual and linguistic separately. … ℎ 𝑀 0 … ℎ 𝑀 1 ℎ 𝑀 𝑈 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 ℎ 𝑀 𝑈 ℎ 𝑀 1 𝑙 - layers 𝑚 - layers L - BERT V- BERT … … <IMG> <SEP> Tok1 <CLS> Tok2 Sentence Image 12

  13. ViLBERT Problem : how to fuse two different modality? Solution : use co-attention [ Lu et.al. 2016 ] to fuse information between different source. 13

  14. ViLBERT Co-attention [ Lu et.al. 2016 ] to fuse information between different source. 14

  15. Pre-training Objective Masked multi-modal modelling • Follows masked LM in BERT. • 15% of the words or image regions to predict. • Linguistic stream: o 80% of the time, replace with [MASK] . o 10% of the time, replace random word. o 10% of the time, keep same. • Visual stream: o 80% of the time, replace with zero vector. Multi-modal alignment prediction • Predict whether image and caption is aligned or not 15

  16. Visualizations A boat covered in flowers near the market. 16 [Sharma et.al 2018]

  17. Sentence → Image H7 H0 Layer 0 Layer 5 17 BertVis https://github.com/jessevig/bertviz

  18. Sentence → Image H7 H0 Layer 0 Layer 5 18 BertVis https://github.com/jessevig/bertviz

  19. Image → Sentence H7 H0 Layer 0 Layer 5 19 BertVis https://github.com/jessevig/bertviz

  20. Image → Sentence H7 H0 Layer 0 Layer 5 20 BertVis https://github.com/jessevig/bertviz

  21. Fine-tuning Procedure Refer VCR VQA shopping Man shopping … … … … ℎ 𝑊 𝒰 ℎ 𝑀 2 ℎ 𝑀 3 ℎ 𝑀 𝑈 ℎ 𝑊 𝒰 ℎ 𝑀 2 ℎ 𝑀 3 ℎ 𝑀 𝑈 ℎ 𝑊 ℎ 𝑊 ℎ 𝑊 ℎ 𝑊 ℎ 𝑀 0 ℎ 𝑀 1 ℎ 𝑊 ℎ 𝑊 ℎ 𝑊 ℎ 𝑊 ℎ 𝑀 0 ℎ 𝑀 1 0 1 2 3 0 1 2 3 Vision & Language BERT Vision & Language BERT … … … … <MASK <MASK <MASK Man shopping What is <IMG> <CLS> <MASK> for <SEP> <IMG> <CLS> the <SEP> > > > Masked Region Masked Sentence Image Question Image and text pair from conceptual caption Image Question Pair Pre-training Fine-Tuning 21

  22. Tasks 22 [Antol 2015, zeller 2018, Yu 2016, Plummer 2015]

  23. Results VQA VCR Q->A RefCOCO+ Image Retrieval 70.55 58.2 70.22 54.04 72.34 52.73 68.85 68.93 69.21 49.48 68.61 47.27 48.6 65.33 65.64 45.5 65.9 43.1 test-dev val val test 23

  24. Concurrent Work 24 [Li 2019, Tan 2019, Li 2019, Su 2019, Zhou 2019, Chen 2019]

  25. Summary Summary Task-agnostic visiolinguistic representations pretraining for visual grounding • Introduce pretrain-transfer to vision and language tasks. • Achieve SOTA on multiple vision and language tasks. Limitations The model can still learn inconsistent grounding by task specific finetuning. • Training multiple vision and language task together – multi-task V&L 25

  26. Multi-Task V&L Learning One Model for V&L: ViLBERT Problems: • Inconsistent grounding by task specific Referring Expression VQA • Ref COCO • VQA finetuning. • Ref COCO+ • Genome QA • Four V&L tasks. • Ref COCOg • GQA • Model is huge, overfitting. • Visual 7w Image Description • GuessWhat What we want: • Caption based • Test on more tasks. V&L Verification Retrieval (COCO) • Consistent Grounding across tasks. • NLVR2 • Caption based • Explore the limit of the model. • Visual Entailment Retrieval (COCO) 26

  27. Multi-Task V&L Learning Model improvements over ViLBERT 27

  28. Multi-Task V&L Learning Model improvements over ViLBERT • Masked multi-modal modelling only for aligned image caption pairs. ℎ 𝑀 0 … ℎ 𝑀 𝑈 ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 … ℎ 𝑀 𝑈 ℎ 𝑀 1 L - BERT V- BERT … … <IMG> <MASK> <SEP> <CLS> Tok1 Tok2 Image Aligned caption 28

  29. Multi-Task V&L Learning Model improvements over ViLBERT • Masked multi-modal modelling only for aligned image caption pairs. • Masking overlapped image regions (IOU > 0.4). ℎ 𝑀 0 … ℎ 𝑀 𝑈 ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 … ℎ 𝑀 𝑈 ℎ 𝑀 1 L - BERT V- BERT … … <IMG> <MASK> <SEP> <CLS> Tok1 Tok2 Image Aligned caption 29

  30. Multi-Task V&L Learning Multi-Task Vision and Language Learning • Use different head but similar tasks share the same head. VQA/Genome QA GQA Retrieval NLVR ℎ 𝑀 0 … ℎ 𝑀 𝑈 Visual Entailment ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 … ℎ 𝑀 𝑈 ℎ 𝑀 1 L - BERT V- BERT … … <IMG> <SEP> <CLS> Tok1 <TSK> Image Aligned caption 30

  31. Multi-Task V&L Learning Multi-Task Vision and Language Learning • Use different head but similar tasks share the same head. VQA/Genome QA Refer Expression GQA Retrieval NLVR ℎ 𝑀 0 … ℎ 𝑀 𝑈 Visual Entailment ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 … ℎ 𝑀 𝑈 ℎ 𝑀 1 L - BERT V- BERT … … <IMG> <SEP> <CLS> Tok1 <TSK> Image Aligned caption 31

  32. Multi-Task V&L Learning Multi-Task Vision and Language Learning • Use different head but similar tasks share the same head. • Add <TSK> token for multi-task training. ℎ 𝑀 0 … ℎ 𝑀 𝑈 ℎ 𝑀 1 ℎ 𝑀 3 ℎ 𝑀 0 ℎ 𝑀 3 … ℎ 𝑀 𝑈 ℎ 𝑀 1 L - BERT V- BERT … … <IMG> <SEP> <CLS> Tok1 <TSK> Image Aligned caption 32

  33. Multi-Task V&L Learning Multi-Task Vision and Language Learning • Use different head but similar tasks share the same head. • Add <TSK> token for multi-task training. • Dynamic Stop and Go 33

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend