UTS CRICOS 00099F
Vision and Language Learning with Graph Neural Networks Linchao Zhu - - PowerPoint PPT Presentation
Vision and Language Learning with Graph Neural Networks Linchao Zhu - - PowerPoint PPT Presentation
Vision and Language Learning with Graph Neural Networks Linchao Zhu 22 Apr, 2020 Recognition, LEarning, Reasoning UTS CRICOS 00099F Overview RNNs for Image Captioning Transformer for Image Captioning Graph Network for Visual
Overview
- RNNs for Image Captioning
- Transformer for Image Captioning
- Graph Network for Visual Commonsense Reasoning
Recognition, LEarning, Reasoning
Image Captioning
Recognition, LEarning, Reasoning Wu et al., Decoupled Novel Object Captioner, ACM MM 2018.
- How to generate descriptions for
unseen words? Zero-shot novel object captioning: the model needs to caption novel objects without additional training sentence data about the object.
Image Captioning
Recognition, LEarning, Reasoning Wu et al., Decoupled Novel Object Captioner, ACM MM 2018.
Novel Image Captioning
Recognition, LEarning, Reasoning
Wu et al., Decoupled Novel Object Captioner, ACM MM 2018.
- Results on eight novel objects in the held-out MS COCO dataset
A larger dataset: nocaps: novel object captioning at scale, ICCV 2019
Image Captioning
Recognition, LEarning, Reasoning
- Semantic attributes are useful
- Visual regions are useful
Li et al., Entangled Transformer for Image Captioning, ICCV 2019.
Image Captioning
Recognition, LEarning, Reasoning Li et al., Entangled Transformer for Image Captioning, ICCV 2019.
Image Captioning
Recognition, LEarning, Reasoning Li et al., Entangled Transformer for Image Captioning, ICCV 2019.
- EnTangled Attention
Image Captioning
Recognition, LEarning, Reasoning Li et al., Entangled Transformer for Image Captioning, ICCV 2019.
- Gated Bilateral Controller
Image Captioning
Recognition, LEarning, Reasoning Li et al., Entangled Transformer for Image Captioning, ICCV 2019.
- Results on MSCOCO (Karpathy’s split)
Fuse two models
Image Captioning
Recognition, LEarning, Reasoning Li et al., Entangled Transformer for Image Captioning, ICCV 2019.
- Results on MSCOCO (Karpathy’s split) with sequence-level optimization
Transformerv: visual input only (w/o GBC) Transformers: semantic attributes only (w/o GBC) Parallel: no ETA but use GBC Stackedv: stacked two visual layers(w/o GBC) Stackeds: stacked two semantic layers (w/o GBC) ETA: ours
Recognition, LEarning, Reasoning
Visual Commonsense Reasoning
Zellers et al., 2015 Question -> Answer -> Rationale
Recognition, LEarning, Reasoning
Visual Commonsense Reasoning
Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019.
Recognition, LEarning, Reasoning
Visual Commonsense Reasoning
Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019.
Local features Global features
Recognition, LEarning, Reasoning
Visual Commonsense Reasoning
Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019.
Local features Global features
- Directional Reasoning
attention Conv
Recognition, LEarning, Reasoning
Visual Commonsense Reasoning
Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019.
- Loss: multi-class cross-entropy loss
- Results on the VCR dataset
Recognition, LEarning, Reasoning
Visual Commonsense Reasoning
Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019.
- Conditional centers
Recognition, LEarning, Reasoning
Visual Commonsense Reasoning
Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019.
- Ablation studies for GraphVLAD
No-C: No conditional center No-G: No graph convolution
- Ablation studies for directional reasoning
No-R: no reasoning module LSTM-R: use LSTM for reasoning GCN: use GCN for reasoning D-GCN: directional GCN for reasoning
Conclusion
- Visual reasoning is challenging
- Graph Networks are powerful. More studies to be investigated.
- One model solves them all?