Multi-modal Reasoning: Bridging Vision and Language Heming Zhang - - PowerPoint PPT Presentation

multi modal reasoning bridging vision and language
SMART_READER_LITE
LIVE PREVIEW

Multi-modal Reasoning: Bridging Vision and Language Heming Zhang - - PowerPoint PPT Presentation

Multi-modal Reasoning: Bridging Vision and Language Heming Zhang Media Communications Lab University of Southern California Personal Assistant AI Touchstone 2 The mass of an electron is approximately 9.109 10 -31 kg. 3 Has Personal


slide-1
SLIDE 1

Multi-modal Reasoning: Bridging Vision and Language

Heming Zhang

Media Communications Lab University of Southern California

slide-2
SLIDE 2

2

Personal Assistant – AI Touchstone

slide-3
SLIDE 3

3

The mass of an electron is approximately 9.109×10-31 kg.

slide-4
SLIDE 4

4

Has Personal Assistant Come True?

Illustration by Fiona Carswell

slide-5
SLIDE 5

5

Vision & Language in MCL

Vision

  • Object detection
  • Semantic segmentation
  • Video segmentation

Language

  • Text classification
  • Language graph learning

Vision & Language

  • Visual dialogue
  • Vision & Language

navigation

  • Multi-modal machine

translation

slide-6
SLIDE 6

6

  • Dialogue that is grounded in vision

What is Visual Dialogue?

A man wearing leather jacket standing next to a motorcycle Is it colored leather? What color is his leather? Yes, it is.

slide-7
SLIDE 7

7

  • Aiding visually impaired users
  • Aiding analysts

Why Visual Dialogue?

Daisy just sent you some pictures of her new house. Yes, there is a large living room with fireplace Great, is the living room large? Did anyone pass the gate yesterday? Were any of them carrying a cardboard box? Yes, 45 instances logged on camera.

slide-8
SLIDE 8

8

From Information Point of View

Image Text

slide-9
SLIDE 9

9

  • Encoder-decoder framework

(Das et al., 2017, Lu et al., 2017, Wu et al., 2018, etc.)

– Encoder

  • Embeds image, question and dialogue history

– Decoder

  • Decodes the embedding to answers in natural language

Previous Work

Embedding Et Ât Qt , I , Ht

Encoder Decoder

slide-10
SLIDE 10

10

  • Lu et al., 2017, Wu et al., 2018, etc.

– Use one input as guidance to compute attention

  • n another input

Previous multi-modal encoders

slide-11
SLIDE 11

11

  • Weighted-sum over features

Attention

Weights

c w h c

slide-12
SLIDE 12

12

Attention with Guidance

𝒈𝑕 Weights

c w h c

slide-13
SLIDE 13

13

  • Lu et al., 2017, Wu et al., 2018, etc.

– Use one input as guidance to compute attention

  • n another input

– Process inputs sequentially in pre-defined orders

Previous multi-modal encoders

slide-14
SLIDE 14

14

  • Lu et al. 2017

Encoders with Sequential Attention

What color is his leather? A man wearing leather jacket standing next to a motorcycle Is it colored leather? Yes, it is.

slide-15
SLIDE 15

15

  • Wu et al., 2018

Encoders with Sequential Attention

What color is his leather? A man wearing leather jacket standing next to a motorcycle Is it colored leather? Yes, it is.

slide-16
SLIDE 16

16

  • Cannot accommodate to different scenario’s
  • How many people are there in the image?
  • Is there anything else on the table?

Previous multi-modal encoders

slide-17
SLIDE 17

17 E FI FQ FH

Adaptive reasoning

FI FQ FH Reasoning RNN fg, i i = imax ? No Yes Guided Attention Guided Attention Guided Attention fI, i fH, i fQ, i fQIH, i Comprehension Exploration

slide-18
SLIDE 18

18

Attention Visualization

Is the little boy on a beach? How old does he look?

slide-19
SLIDE 19

19

Attention Visualization

How old does he look? What color hair does he have?

slide-20
SLIDE 20

20

Attention Visualization

What color hair does he have? Is he dressed for summer?

slide-21
SLIDE 21

21

Attention Visualization

What color is the airplane? Time step i=1

slide-22
SLIDE 22

22

Attention Visualization

What color is the airplane? Time step i=2

slide-23
SLIDE 23

23

Qualitative results

4 ducks are in a grassy island of a parking lot with their heads down

slide-24
SLIDE 24

24

Qualitative results

4 ducks are in a grassy island of a parking lot with their heads down Questions Human Ours Any grass? Yes Yes, a lot of grass What color grass? It is green with brownish dead spots Green and brown

slide-25
SLIDE 25

25

Qualitative results

4 ducks are in a grassy island of a parking lot with their heads down Questions Human Ours Any vehicles on the lot? Yes Yes, there are a lot of cars Do they look new or old? They look new They look new

slide-26
SLIDE 26

26

Generative Visual Dialogue System via Weighted Likelihood Estimation

Heming Zhang, Shalini Ghosh, Larry Heck, Stephen Walsh, Junting Zhang, Jie Zhang, C.-C. Jay Kuo

Thursday Aug. 15th 09:30 - 10:30 AM CV|LV - Language and Vision 2 (2501-2502)

IJCAI 2019

slide-27
SLIDE 27

27

  • What is visual dialogue?
  • Dialogue that grounded in vision

Vision-grounded Problems Revisited

A man wearing leather jacket standing next to a motorcycle Is it colored leather? What color is his leather? Yes, it is.

slide-28
SLIDE 28

28

  • From information point of view

Vision-grounded Problems Revisited

Image Text

slide-29
SLIDE 29

29

Vision-grounded Problems Revisited

  • No alignment between image & text manifolds

Image Text

SIFT BoW CNN … RNN Transformer …

slide-30
SLIDE 30

30

Bridging Vision & Language

Image Text

?

slide-31
SLIDE 31

31

Bridging Vision & Language

Image Text Joint

  • Manifold alignment
slide-32
SLIDE 32

32

  • Usually one-to-one mapping in other

manifold alignment problems

– E.g. machine translation

Bridging Vision & Language

English Dutch Joint I like you I take you with me Ik hou van jou Ik neem je mee

slide-33
SLIDE 33

33

  • Alignment between vision and language

– No one-to-one mapping

Bridging Vision & Language

Image Text Joint

slide-34
SLIDE 34

34

  • Weighted-sum over features

Attention Revisited

Weights

c w h c

slide-35
SLIDE 35

35

  • Alignment by attention

– Joint learning of attention and alignment

Bridging Vision & Language

Image Text Joint

slide-36
SLIDE 36

36

Related Research in MCL

Vision

  • Object detection
  • Semantic segmentation
  • Video segmentation

Language

  • Text classification
  • Language graph learning

Vision & Language

  • Visual dialogue
  • Vision & Language

navigation

  • Multi-modal machine

translation

slide-37
SLIDE 37

37

  • Instructions in natural language

– Walk down and turn right.

  • Surrounding environment in vision

Vision-and-language Navigation

slide-38
SLIDE 38

38

  • Leave the room into the hall and go straight.
  • Head towards the stairs.
  • Stop on the round rug next to the flowers.

Co-attention between Vision & Language

slide-39
SLIDE 39

39

Unsupervised Multi-modal Neural Machine Translation

slide-40
SLIDE 40

40

  • Lab director: Prof. C.-C. Jay Kuo
  • Visiting scholars
  • PhD students
  • Master students

Media Communication Lab

slide-41
SLIDE 41

41

Visit us at http://mcl.usc.edu/

Thank you for listening