vision and language learning with graph neural networks
play

Vision and Language Learning with Graph Neural Networks Linchao Zhu - PowerPoint PPT Presentation

Vision and Language Learning with Graph Neural Networks Linchao Zhu 22 Apr, 2020 Recognition, LEarning, Reasoning UTS CRICOS 00099F Overview RNNs for Image Captioning Transformer for Image Captioning Graph Network for Visual


  1. Vision and Language Learning with Graph Neural Networks Linchao Zhu 22 Apr, 2020 Recognition, LEarning, Reasoning UTS CRICOS 00099F

  2. Overview • RNNs for Image Captioning • Transformer for Image Captioning • Graph Network for Visual Commonsense Reasoning Recognition, LEarning, Reasoning

  3. Image Captioning Zero-shot novel object captioning: the model How to generate descriptions for • needs to caption novel objects without additional unseen words? training sentence data about the object. Wu et al., Decoupled Novel Object Captioner, ACM MM 2018. Recognition, LEarning, Reasoning

  4. Image Captioning Wu et al., Decoupled Novel Object Captioner, ACM MM 2018. Recognition, LEarning, Reasoning

  5. Novel Image Captioning Results on eight novel objects in the held-out MS COCO dataset • A larger dataset: nocaps: novel object captioning at scale, ICCV 2019 Wu et al., Decoupled Novel Object Captioner, ACM MM 2018. Recognition, LEarning, Reasoning

  6. Image Captioning • Semantic attributes are useful Visual regions are useful • Li et al., Entangled Transformer for Image Captioning, ICCV 2019. Recognition, LEarning, Reasoning

  7. Image Captioning Li et al., Entangled Transformer for Image Captioning, ICCV 2019. Recognition, LEarning, Reasoning

  8. Image Captioning EnTangled Attention • Li et al., Entangled Transformer for Image Captioning, ICCV 2019. Recognition, LEarning, Reasoning

  9. Image Captioning Gated Bilateral Controller • Li et al., Entangled Transformer for Image Captioning, ICCV 2019. Recognition, LEarning, Reasoning

  10. Image Captioning Results on MSCOCO (Karpathy’s split) • Fuse two models Li et al., Entangled Transformer for Image Captioning, ICCV 2019. Recognition, LEarning, Reasoning

  11. Image Captioning Results on MSCOCO (Karpathy’s split) with sequence-level optimization • Transformer v : visual input only (w/o GBC) Transformer s : semantic attributes only (w/o GBC) Parallel: no ETA but use GBC Stacked v : stacked two visual layers(w/o GBC) Stacked s : stacked two semantic layers (w/o GBC) ETA: ours Li et al., Entangled Transformer for Image Captioning, ICCV 2019. Recognition, LEarning, Reasoning

  12. Visual Commonsense Reasoning Question -> Answer -> Rationale Zellers et al., 2015 Recognition, LEarning, Reasoning

  13. Visual Commonsense Reasoning Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019. Recognition, LEarning, Reasoning

  14. Visual Commonsense Reasoning Local features Global features Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019. Recognition, LEarning, Reasoning

  15. Visual Commonsense Reasoning Directional Reasoning • Conv Local features Global features attention Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019. Recognition, LEarning, Reasoning

  16. Visual Commonsense Reasoning Loss: multi-class cross-entropy loss • • Results on the VCR dataset Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019. Recognition, LEarning, Reasoning

  17. Visual Commonsense Reasoning Conditional centers • Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019. Recognition, LEarning, Reasoning

  18. Visual Commonsense Reasoning Ablation studies for GraphVLAD • No-C: No conditional center No-G: No graph convolution Ablation studies for directional reasoning • No-R: no reasoning module LSTM-R: use LSTM for reasoning GCN: use GCN for reasoning D-GCN: directional GCN for reasoning Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019. Recognition, LEarning, Reasoning

  19. Conclusion • Visual reasoning is challenging • Graph Networks are powerful. More studies to be investigated. • One model solves them all?

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend