Vision and Language Learning with Graph Neural Networks Linchao Zhu - - PowerPoint PPT Presentation

vision and language learning with graph neural networks
SMART_READER_LITE
LIVE PREVIEW

Vision and Language Learning with Graph Neural Networks Linchao Zhu - - PowerPoint PPT Presentation

Vision and Language Learning with Graph Neural Networks Linchao Zhu 22 Apr, 2020 Recognition, LEarning, Reasoning UTS CRICOS 00099F Overview RNNs for Image Captioning Transformer for Image Captioning Graph Network for Visual


slide-1
SLIDE 1

UTS CRICOS 00099F

Vision and Language Learning with Graph Neural Networks

Linchao Zhu 22 Apr, 2020

Recognition, LEarning, Reasoning

slide-2
SLIDE 2

Overview

  • RNNs for Image Captioning
  • Transformer for Image Captioning
  • Graph Network for Visual Commonsense Reasoning

Recognition, LEarning, Reasoning

slide-3
SLIDE 3

Image Captioning

Recognition, LEarning, Reasoning Wu et al., Decoupled Novel Object Captioner, ACM MM 2018.

  • How to generate descriptions for

unseen words? Zero-shot novel object captioning: the model needs to caption novel objects without additional training sentence data about the object.

slide-4
SLIDE 4

Image Captioning

Recognition, LEarning, Reasoning Wu et al., Decoupled Novel Object Captioner, ACM MM 2018.

slide-5
SLIDE 5

Novel Image Captioning

Recognition, LEarning, Reasoning

Wu et al., Decoupled Novel Object Captioner, ACM MM 2018.

  • Results on eight novel objects in the held-out MS COCO dataset

A larger dataset: nocaps: novel object captioning at scale, ICCV 2019

slide-6
SLIDE 6

Image Captioning

Recognition, LEarning, Reasoning

  • Semantic attributes are useful
  • Visual regions are useful

Li et al., Entangled Transformer for Image Captioning, ICCV 2019.

slide-7
SLIDE 7

Image Captioning

Recognition, LEarning, Reasoning Li et al., Entangled Transformer for Image Captioning, ICCV 2019.

slide-8
SLIDE 8

Image Captioning

Recognition, LEarning, Reasoning Li et al., Entangled Transformer for Image Captioning, ICCV 2019.

  • EnTangled Attention
slide-9
SLIDE 9

Image Captioning

Recognition, LEarning, Reasoning Li et al., Entangled Transformer for Image Captioning, ICCV 2019.

  • Gated Bilateral Controller
slide-10
SLIDE 10

Image Captioning

Recognition, LEarning, Reasoning Li et al., Entangled Transformer for Image Captioning, ICCV 2019.

  • Results on MSCOCO (Karpathy’s split)

Fuse two models

slide-11
SLIDE 11

Image Captioning

Recognition, LEarning, Reasoning Li et al., Entangled Transformer for Image Captioning, ICCV 2019.

  • Results on MSCOCO (Karpathy’s split) with sequence-level optimization

Transformerv: visual input only (w/o GBC) Transformers: semantic attributes only (w/o GBC) Parallel: no ETA but use GBC Stackedv: stacked two visual layers(w/o GBC) Stackeds: stacked two semantic layers (w/o GBC) ETA: ours

slide-12
SLIDE 12

Recognition, LEarning, Reasoning

Visual Commonsense Reasoning

Zellers et al., 2015 Question -> Answer -> Rationale

slide-13
SLIDE 13

Recognition, LEarning, Reasoning

Visual Commonsense Reasoning

Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019.

slide-14
SLIDE 14

Recognition, LEarning, Reasoning

Visual Commonsense Reasoning

Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019.

Local features Global features

slide-15
SLIDE 15

Recognition, LEarning, Reasoning

Visual Commonsense Reasoning

Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019.

Local features Global features

  • Directional Reasoning

attention Conv

slide-16
SLIDE 16

Recognition, LEarning, Reasoning

Visual Commonsense Reasoning

Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019.

  • Loss: multi-class cross-entropy loss
  • Results on the VCR dataset
slide-17
SLIDE 17

Recognition, LEarning, Reasoning

Visual Commonsense Reasoning

Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019.

  • Conditional centers
slide-18
SLIDE 18

Recognition, LEarning, Reasoning

Visual Commonsense Reasoning

Wu et al., Connective Cognition Network for Directional Visual Commonsense Reasoning, NeurIPS 2019.

  • Ablation studies for GraphVLAD

No-C: No conditional center No-G: No graph convolution

  • Ablation studies for directional reasoning

No-R: no reasoning module LSTM-R: use LSTM for reasoning GCN: use GCN for reasoning D-GCN: directional GCN for reasoning

slide-19
SLIDE 19

Conclusion

  • Visual reasoning is challenging
  • Graph Networks are powerful. More studies to be investigated.
  • One model solves them all?