Multi-modal Reasoning: Bridging Vision and Language Heming Zhang - - PowerPoint PPT Presentation

▶

Dec 31, 2022 618 likes •1.04k views

Multi-modal Reasoning: Bridging Vision and Language Heming Zhang Media Communications Lab University of Southern California Personal Assistant AI Touchstone 2 The mass of an electron is approximately 9.109 10 -31 kg. 3 Has Personal

SLIDE 1

Multi-modal Reasoning: Bridging Vision and Language

Heming Zhang

Media Communications Lab University of Southern California

SLIDE 2

Personal Assistant – AI Touchstone

SLIDE 3

The mass of an electron is approximately 9.109×10-31 kg.

SLIDE 4

Has Personal Assistant Come True?

Illustration by Fiona Carswell

SLIDE 5

Vision & Language in MCL

Vision

Object detection
Semantic segmentation
Video segmentation

Language

Text classification
Language graph learning

Vision & Language

Visual dialogue
Vision & Language

navigation

Multi-modal machine

translation

SLIDE 6

Dialogue that is grounded in vision

What is Visual Dialogue?

A man wearing leather jacket standing next to a motorcycle Is it colored leather? What color is his leather? Yes, it is.

SLIDE 7

Aiding visually impaired users
Aiding analysts

Why Visual Dialogue?

Daisy just sent you some pictures of her new house. Yes, there is a large living room with fireplace Great, is the living room large? Did anyone pass the gate yesterday? Were any of them carrying a cardboard box? Yes, 45 instances logged on camera.

SLIDE 8

From Information Point of View

Image Text

SLIDE 9

Encoder-decoder framework

(Das et al., 2017, Lu et al., 2017, Wu et al., 2018, etc.)

– Encoder

Embeds image, question and dialogue history

– Decoder

Decodes the embedding to answers in natural language

Previous Work

Embedding Et Ât Qt , I , Ht

Encoder Decoder

SLIDE 10

Lu et al., 2017, Wu et al., 2018, etc.

– Use one input as guidance to compute attention

n another input

Previous multi-modal encoders

SLIDE 11

Weighted-sum over features

Attention

Weights

c w h c

SLIDE 12

Attention with Guidance

𝒈𝑕 Weights

c w h c

SLIDE 13

Lu et al., 2017, Wu et al., 2018, etc.

– Use one input as guidance to compute attention

n another input

– Process inputs sequentially in pre-defined orders

Previous multi-modal encoders

SLIDE 14

Lu et al. 2017

Encoders with Sequential Attention

What color is his leather? A man wearing leather jacket standing next to a motorcycle Is it colored leather? Yes, it is.

SLIDE 15

Wu et al., 2018

Encoders with Sequential Attention

What color is his leather? A man wearing leather jacket standing next to a motorcycle Is it colored leather? Yes, it is.

SLIDE 16

Cannot accommodate to different scenario’s
How many people are there in the image?
Is there anything else on the table?

Previous multi-modal encoders

SLIDE 17

17 E FI FQ FH

Adaptive reasoning

FI FQ FH Reasoning RNN fg, i i = imax ? No Yes Guided Attention Guided Attention Guided Attention fI, i fH, i fQ, i fQIH, i Comprehension Exploration

SLIDE 18

Attention Visualization

Is the little boy on a beach? How old does he look?

SLIDE 19

Attention Visualization

How old does he look? What color hair does he have?

SLIDE 20

Attention Visualization

What color hair does he have? Is he dressed for summer?

SLIDE 21

Attention Visualization

What color is the airplane? Time step i=1

SLIDE 22

Attention Visualization

What color is the airplane? Time step i=2

SLIDE 23

Qualitative results

4 ducks are in a grassy island of a parking lot with their heads down

SLIDE 24

Qualitative results

4 ducks are in a grassy island of a parking lot with their heads down Questions Human Ours Any grass? Yes Yes, a lot of grass What color grass? It is green with brownish dead spots Green and brown

SLIDE 25

Qualitative results

4 ducks are in a grassy island of a parking lot with their heads down Questions Human Ours Any vehicles on the lot? Yes Yes, there are a lot of cars Do they look new or old? They look new They look new

SLIDE 26

Generative Visual Dialogue System via Weighted Likelihood Estimation

Heming Zhang, Shalini Ghosh, Larry Heck, Stephen Walsh, Junting Zhang, Jie Zhang, C.-C. Jay Kuo

Thursday Aug. 15th 09:30 - 10:30 AM CV|LV - Language and Vision 2 (2501-2502)

IJCAI 2019

SLIDE 27

What is visual dialogue?
Dialogue that grounded in vision

Vision-grounded Problems Revisited

A man wearing leather jacket standing next to a motorcycle Is it colored leather? What color is his leather? Yes, it is.

SLIDE 28

From information point of view

Vision-grounded Problems Revisited

Image Text

SLIDE 29

Vision-grounded Problems Revisited

No alignment between image & text manifolds

Image Text

SIFT BoW CNN … RNN Transformer …

SLIDE 30

Bridging Vision & Language

Image Text

?

SLIDE 31

Bridging Vision & Language

Image Text Joint

Manifold alignment

SLIDE 32

Usually one-to-one mapping in other

manifold alignment problems

– E.g. machine translation

Bridging Vision & Language

English Dutch Joint I like you I take you with me Ik hou van jou Ik neem je mee

SLIDE 33

Alignment between vision and language

– No one-to-one mapping

Bridging Vision & Language

Image Text Joint

SLIDE 34

Weighted-sum over features

Attention Revisited

Weights

c w h c

SLIDE 35

Alignment by attention

– Joint learning of attention and alignment

Bridging Vision & Language

Image Text Joint

SLIDE 36

Related Research in MCL

Vision

Object detection
Semantic segmentation
Video segmentation

Language

Text classification
Language graph learning

Vision & Language

Visual dialogue
Vision & Language

navigation

Multi-modal machine

translation

SLIDE 37

Instructions in natural language

– Walk down and turn right.

Surrounding environment in vision

Vision-and-language Navigation

SLIDE 38

Leave the room into the hall and go straight.
Head towards the stairs.
Stop on the round rug next to the flowers.

Co-attention between Vision & Language

SLIDE 39

Unsupervised Multi-modal Neural Machine Translation

SLIDE 40

Lab director: Prof. C.-C. Jay Kuo
Visiting scholars
PhD students
Master students

Media Communication Lab

SLIDE 41

Multi-modal Reasoning: Bridging Vision and Language

Heming Zhang

Personal Assistant – AI Touchstone

The mass of an electron is approximately 9.109×10-31 kg.

Has Personal Assistant Come True?

Vision & Language in MCL

Vision

Language

Vision & Language

navigation

translation

What is Visual Dialogue?

Why Visual Dialogue?

From Information Point of View

(Das et al., 2017, Lu et al., 2017, Wu et al., 2018, etc.)

– Encoder

– Decoder

Previous Work

Encoder Decoder

– Use one input as guidance to compute attention

Previous multi-modal encoders

Attention

Attention with Guidance

– Use one input as guidance to compute attention

– Process inputs sequentially in pre-defined orders

Previous multi-modal encoders

Encoders with Sequential Attention

Encoders with Sequential Attention

Previous multi-modal encoders

Adaptive reasoning

Attention Visualization

Is the little boy on a beach? How old does he look?

Attention Visualization

How old does he look? What color hair does he have?

Attention Visualization

What color hair does he have? Is he dressed for summer?

Attention Visualization

What color is the airplane? Time step i=1

Attention Visualization

What color is the airplane? Time step i=2

Qualitative results

Qualitative results

Qualitative results

Generative Visual Dialogue System via Weighted Likelihood Estimation

Heming Zhang, Shalini Ghosh, Larry Heck, Stephen Walsh, Junting Zhang, Jie Zhang, C.-C. Jay Kuo

Thursday Aug. 15th 09:30 - 10:30 AM CV|LV - Language and Vision 2 (2501-2502)

IJCAI 2019

Vision-grounded Problems Revisited

Vision-grounded Problems Revisited

Vision-grounded Problems Revisited

Bridging Vision & Language

?

Bridging Vision & Language

manifold alignment problems

– E.g. machine translation

Bridging Vision & Language

– No one-to-one mapping

Bridging Vision & Language

Attention Revisited

– Joint learning of attention and alignment

Bridging Vision & Language

Related Research in MCL

Vision

Language

Vision & Language

navigation

translation

– Walk down and turn right.

Vision-and-language Navigation

Co-attention between Vision & Language

Unsupervised Multi-modal Neural Machine Translation

Media Communication Lab

Visit us at http://mcl.usc.edu/

Thank you for listening