[PPT] - CMP722 ADVANCED COMPUTER VISION Lecture #4 Multimodality Aykut PowerPoint Presentation

SLIDE 1

Lecture #4 – Multimodality

Aykut Erdem // Hacettepe University // Spring 2019

CMP722

ADVANCED COMPUTER VISION

Illustration: Detail from Fritz Kahn’s Der Mensch als Industriepalast

SLIDE 2

sequential data
convolutions in time
recurrent neural networks (RNNs)
autoregressive generative models
attention models
case study: transformer model

Previously on CMP722

Illustration: DeepMind

SLIDE 3

Lecture overview

what is multimodality
a historical view on multimodal research
core technical challenges
joint representations
coordinated representations
multimodal fusion
Disclaimer: Much of the material and slides for this lecture were borrowed from

—Louis-Philippe Morency Tadas Baltrusaitis’s CMU 11-777 class —Qi Wu’s slides for ACL 2018 Tutorial on Connecting Language and Vision to Actions

3

SLIDE 4

What is Multimodal?

Multiple modes, i.e.,

distinct “peaks” (local maxima) in the probability density function

4

Multimodal distribution

➢ Multiple modes, i.e., distinct “peaks”

Multimodal distribution

SLIDE 5

What is Multimodal?

Sensory Modalities

5

SLIDE 6

What is Multimodal?

Modality: : The way in which something happens or is experienced.

Modality refers to a certain type of information and/or the

representation format in which information is stored.

Sensory modality: one of the primary forms of sensation, as vision or

touch; channel of communication. Medium (“middle”): A means or instrumentality for storing or communicating information; system of communication/transmission.

Medium is the means whereby this information is delivered to the

senses of the interpreter.

6

SLIDE 7

Examples of Modalities

Natural language (both spoken or written)
Visual (from images or videos)
Auditory (including voice, sounds and music)
Haptics/touch
Smell, taste and self-motion
Physiological signals
Electrocardiogram (ECG), skin conductance
Other modalities
Infrared images, depth images, fMRI

7

SLIDE 8

Psychology Medical Speech Vision Language Multimedia Robotics Learning

Multiple Communities and Modalities

8

Psychology Medical Speech Vision Language Multimedia Robotics Learning

SLIDE 9

A Historical View

9

SLIDE 10

Prior Research on “Multimodal”

Four eras of multimodal research

The “behavioral” era (1970s until late 1980s)
The “computational” era (late 1980s until 2000)
The “interaction” era (2000 - 2010)
The “deep learning” era (2010s until ...)
Main focus of this lecture

10

SLIDE 11

The “Behavioral” Era (1970s until late 1980s)

Multi-sensory integration (in psychology):

Multimodal signal detection: Independent decisions vs. integration [1980]
Infants' perception of substance and temporal synchrony in multimodal events

[1983]

A multimodal assessment of behavioral and cognitive deficits in abused and

neglected preschoolers [1984]

Ø TRIVIA: Geoffrey Hinton received his B.A. in Psychology

11

Multimodal al Behavi avior Therap apy y by Arnold Lazarus [1973] ➢ 7 dimensions of personality (or modalities)

SLIDE 12

The McGurk Effect (1976)

Hearing lips and seeing voices – Nature

12

SLIDE 13

The “Computational” Era (Late 1980s until 2000)

1) Audio-Visual Speech Recognition (AVSR)

Motivated by the McGurk effect
First AVSR System in 1986

“Automatic lipreading to enhance speech recognition”

Good survey paper [2002]

“Recent Advances in the Automatic Recognition of Audio-Visual Speech”

Ø TRIVIA: The first multimodal deep learning paper was about audio visual speech recognition [ICML 2011]

13

SLIDE 14

The “Computational” Era (Late 1980s until 2000)

2) Multimodal/multisensory interfaces

Multimodal Human-Computer Interaction

(HCI) “Study of how to design and evaluate new computer systems where human interact through multiple modalities, including both input and output modalities.”

14

Glove-talk: A neural network interface between a data-glove and a speech synthesizer By Sidney Fels & Geoffrey Hinton [CHI’95]

SLIDE 15

The “Computational” Era (Late 1980s until 2000)

2) Multimodal/multisensory interfaces

15

Rosalind Picard Affective Computing is computing that relates to, arises from, or deliberately influences emotion or other affective phenomena.

➢ The “ ” Era (Late 1980s until 2000)

Ro

SLIDE 16

The “Computational” Era (Late 1980s until 2000)

3) Multimedia Computing

16

“The Informedia Digital Video Library Project automatically combines speech, image and natural language understanding to create a full-content searchable digital video library.”

➢ The “ ” Era (Late 1980s until 2000)

“The digital video library.”

[1994-2010]

SLIDE 17

The “Computational” Era (Late 1980s until 2000)

3) Multimedia Computing Multimedia content analysis

Shot-boundary detection (1991 - )
Parsing a video into continuous camera shots
Still and dynamic video abstracts (1992 - )
Making video browsable via representative frames (keyframes)
Generating short clips carrying the essence of the video content
High-level parsing (1997 - )
Parsing a video into semantically meaningful segments
Automatic annotation (indexing) (1999 - )
Detecting prespecified events/scenes/objects in video

17

SLIDE 18

The “Computational” Era (Late 1980s until 2000)

Hidden Markov Models [1960s]

18

Factorial Hidden Markov Models [1996]
Coupled Hidden Markov Models [1997]
kov Models [1960s]

h0 h1 h2 h3 h4 x1 x2 x3 x4

Models [1997]

SLIDE 19

Multimodal Computation Models

Artificial Neural Networks [1940s]

19

Backpropagation [1975]
Convolutional neural networks [1980s]
ural Networks [1940s]
tion [1975]

Convolu

Backpropagation [1

SLIDE 20

The “Interaction” Era (2000s)

1) Modeling Human Multimodal Interaction Ø TRIVIA: Samy Bengio started at IDIAP working on AMI project

20

➢ The “ ” Era (2000s)

AMI Project [2001-2006, IDIAP]
100+ hours of meeting recordings
Fully synchronized audio-video
Transcribed and annotated

CHIL Project [Alex Waibel]

Computers in the Human Interaction Loop
Multi-sensor multimodal processing
Face-to-face interactions

SLIDE 21

The “Interaction” Era (2000s)

1) Modeling Human Multimodal Interaction

TRIVIA: LP Morency’s PhD research was partially funded by CALO

21

CALO Project [2003-2008, SRI]

Cognitive Assistant that Learns and Organizes
Personalized Assistant that Learns (PAL)
Siri was a spinoff from this project

SSP Project [2008-2011, IDIAP]

Social Signal Processing
First coined by Sandy Pentland in 2007
Great dataset repository: http://sspnet.eu/

➢ The “ ” Era (2000s)

CA

SS
TRIVIA: LP’s PhD research was partially funded by CALO ☺

SLIDE 22

The “Interaction” Era (2000s)

2) Multimedia Information Retrieval Rese sear arch ch tasks asks an and ch chal allenges: s:

Shot boundary, story segmentation, search
“High-level feature extraction”: semantic event detection
Introduced in 2008: copy detection and surveillance events
Introduced in 2010: Multimedia event detection (MED)

22

➢ The “ ” Era (2000s)

“Yearly competition to based evaluation”

“High level feature extraction”: semantic event detection
“Yearly competition to

promote progress in content-based retrieval from digital video via open, metrics-based evaluation” [Hosted by NIST, 2001-2006]

SLIDE 23

Multimodal Computational Models

Dynamic Bayesian Networks
Kevin Murphy’s PhD thesis and Matlab toolbox
Asynchronous HMM for multimodal [Samy Bengio, 2007]

Audio-visual speech segmentation

23

▪

▪ Kevin Murphy’s PhD thesis and Matlab toolbox ▪ segmentation

SLIDE 24

Multimodal Computational Models

Discriminative sequential models
Conditional random fields [Lafferty et al., 2001]
Latent-dynamic CRF [Morency et al., 2007]

24

▪

▪ tional random fields [Lafferty et ▪

▪

▪ ▪ dynamic CRF

SLIDE 25

The “deep learning” era (2010s until ...)

Representation learning (a.k.a. deep learning)

Multimodal deep learning [ICML 2011]
Multimodal Learning with Deep Boltzmann Machines [NIPS 2012]
Visual attention: Show, Attend and Tell: Neural Image Caption Generation with

Visual Attention [ICML 2015]

Key enablers for multimodal research:

New large-scale multimodal datasets
Faster computer and GPUS
High-level visual features
“Dimensional” linguistic features

25

The topic of

ur next lecture

SLIDE 26

Real World tasks tackled by MMML

Affect recognition
Emotion
Persuasion
Personality traits
Media description
Image captioning
Video captioning
Visual Question Answering
Event recognition
Action recognition
Segmentation
Multimedia information retrieval
Content based/Cross-media

26

▪

▪ ▪ ▪

▪

▪ ▪ ▪

▪

▪ ▪

▪ retrieval

▪

▪ ▪ ▪

▪

▪ ▪ ▪

▪

▪ ▪

▪ eval

▪

SLIDE 27

Core Technical Challenges

27

SLIDE 28

Multimodal Machine Learning

Core Technical Challenges: Representation Translation Alignment Fusion Co-Learning

28

Verbal Vocal Visual

SLIDE 29

Core Challenge 1: Representation

Definition: : Learning how to represent and summarize multimodal data in away that exploits the complementarity and redundancy.

29

Joint representations: A

Modality 1 Modality 2

Representation

SLIDE 30

Joint Multimodal Representation

30

“I like it!”

Joyful tone

Tensed voice

“Wow!”

Joint Representation

(Multimodal Space)

Joint Representations (Multimodal Space) "Wow!" "I like it!"

Joyful tone Tensed voice

SLIDE 31

Joint Multimodal Representation

Audio-visual speech recognition

[Ngiam et al., ICML 2011]

Bimodal Deep Belief Network

Image captioning

[Srivastava and Salahutdinov, NIPS 2012]

Multimodal Deep Boltzmann Machine

Audio-visual emotion recognition

[Kim et al., ICASSP 2013]

Deep Boltzmann Machine

31

DepthVerbal DepthVideo DepthMultimodal

Verbal

Visual Multimodal Representation

SLIDE 32

Multimodal Vector Space Arithmetic

32

[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014]

SLIDE 33

Core Challenge 1: Representation

Definition: : Learning how to represent and summarize multimodal data in away that exploits the complementarity and redundancy.

33

Modality 1 Modality 2

Representation

Joint representations: A Coordinated representations: B

Modality 1 Modality 2

Repres 2

Repres. 1

SLIDE 34

Coordinated Representation: Deep CCA

34

𝒁 𝒀 𝑽 𝑾 𝑰𝒚 𝑰𝒛

𝐼𝑧 𝐼𝑦

𝑿𝒚 𝑿𝒛 𝒀 𝒁 𝒗 𝒘 𝒗∗, 𝒘∗ = argmax

𝒗,𝒘

𝑑𝑝𝑠𝑠 𝒗𝑼𝒀, 𝒘𝑼𝒁

𝒁 𝒀 𝑽 𝑾 𝑰𝒚 𝑰𝒛

𝐼𝑧 𝐼𝑦

𝑿𝒚 𝑿𝒛 𝒀 𝒁 𝒗 𝒘 𝒗∗, 𝒘∗ = argmax

𝒗,𝒘

𝑑𝑝𝑠𝑠 𝒗𝑼𝒀, 𝒘𝑼𝒁

Andrew et al., ICML 2013

· · · · · · Text Image 𝒁 𝒀 𝑽 𝑾 · · · · · · 𝑰𝒚 𝑰𝒛

View 𝐼𝑧 View 𝐼𝑦

· · · · · · 𝑿𝒚 𝑿𝒛 𝒀 𝒁 𝒗 𝒘 𝒗∗, 𝒘∗ = argmax

𝒗,𝒘

𝑑𝑝𝑠𝑠 𝒗𝑼𝒀, 𝒘𝑼𝒁 ted:

Learn linear projections that are maximally correlated:

SLIDE 35

Core Challenge 2: Alignment

Definition: : Identify the direct relations between (sub)elements from two or more different modalities.

35

Explicit Alignment

The goal is to directly find correspondences between elements of different modalities

A Implicit Alignment

Uses internally latent alignment of modalities in order to better solve a different problem

B

t1 t2 t3 tn t4 t5 tn

Fancy algorithm

Modality 1 Modality 2

SLIDE 36

Implicit Alignment

36

Karpathy et al., Deep Fragment Embeddings for Bidirectional Image Sentence Mapping, https://arxiv.org/pdf/1406.5679.pdf

SLIDE 37

Attention Models for Image Captioning

37

Distribution

ver L

locations Expectation

ver features:

D

𝑏1 𝑡0 𝑡1 𝑨1 𝑧1 𝑏2 𝑒1 𝑡2 𝑨2 𝑧2 𝑏3 𝑒2

First word

Output word

SLIDE 38

Core Challenge 3: Fusion

38

Definition:

: To join information from two or more modalities to perform a prediction task.

Model-Agnostic Approaches A 1) Early Fusion 2) Late Fusion

Classifier

Modality 1 Modality 2 Classifier Classifier Modality 1 Modality 2

SLIDE 39

Core Challenge 3: Fusion

39

Definition:

: To join information from two or more modalities to perform a prediction task.

Model-Based (Intermediate) Approaches B 1) Deep neural networks 2) Kernel-based methods 3) Graphical models

𝒚𝟐 𝑩 ℎ1 𝐵 ℎ2 𝐵 ℎ3 𝐵 ℎ4 𝐵 ℎ5 𝐵 𝒚𝟑 𝑩 𝒚𝟒 𝑩 𝒚𝟓 𝑩 𝒚𝟔 𝑩 𝒚𝟐 𝑾 ℎ1 𝑊 ℎ2 𝑊 ℎ3 𝑊 ℎ4 𝑊 ℎ5 𝑊 𝒚𝟑 𝑾 𝒚𝟒 𝑾 𝒚𝟓 𝑾 𝒚𝟔 𝑾 y

Multiple kernel learning Multi-View Hidden CRF

SLIDE 40

Core Challenge 4: Translation

40

Definition: : Process of changing data from one modality to another, where the translation relationship can often be open-ended or subjective.

Example-based A Model-driven B

SLIDE 41

Core Challenge 4: Translation

41

–

Transcriptions + Audio streams Visual gestures

(both speaker and listener gestures)

Marsella et al., Virtual character performance from speech, SIGGRAPH/Eurographics Symposium on Computer Animation, 2013

SLIDE 42

Core Challenge 4: Translation

42

–

Transcriptions + Audio streams Visual gestures

(both speaker and listener gestures)

Marsella et al., Virtual character performance from speech, SIGGRAPH/Eurographics Symposium on Computer Animation, 2013

SLIDE 43

Core Challenge 5: Co-Learning

Definition: : Transfer knowledge between modalities, including their representations and predictive models.

43

Modality 1 Prediction Modality 2

SLIDE 44

Core Challenge 5: Co-Learning

44

Parallel A Non-Parallel B Hybrid C

SLIDE 45

Taxonomy of Multimodal Research

Represe sentation

Joint
Neural networks
Graphical models
Sequential
Coordinated
Similarity
Structured

Translation

Example-based
Retrieval
Combination
Model-based
Grammar-based
Encoder-decoder
Online prediction

Alignment

Explicit
Unsupervised
Supervised
Implicit
Graphical models
Neural networks

Fusion

Model agnostic
Early fusion
Late fusion
Hybrid fusion
Model-based
Kernel-based
Graphical models
Neural networks

Co-learning

Parallel data
Co-training
Transfer learning
Non-parallel data
Zero-shot learning
Concept grounding
Transfer learning
Hybrid data
Bridging

45

Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency, Multimodal Machine Learning: A Survey and Taxonomy https://arxiv.org/abs/1705.09406

SLIDE 46

Multimodal Representations

46

SLIDE 47

Multimodal representations

Why do we need multimodal representations?
Can just have unimodal ones and just fuse them

§ What if relationship is complex? § Doesn’t exploit joint information, especially at lower/intermediate levels

47

SLIDE 48

Multimodal representations

What do we want from multi-modal

representation

Similarity in that space implies

similarity in corresponding concepts

Useful for various discriminative

tasks – retrieval, mapping, fusion etc.

Possible to obtain in absence of one
r more modalities
Fill in missing modalities given others

(map between modalities)

48

modal
epts
–
ne
Modality 1

Modality 2 Modality 3

Fancy representation

Modality 1 Modality 2 Modality 3

Fancy representation

Prediction Modality 1 Modality 2

Fancy representation

SLIDE 49

Multimodal representation types

Ø Simplest version: modality concatenation (early fusion) Ø Can be learned supervised or unsupervised Ø Multimodal factor analysis Ø Similarity-based methods (e.g., cosine distance) Ø Structure constraints (e.g.,

rthogonality, sparseness)

49

Modality 1 Modality 2

Representation

Modality 1 Modality 2

Repres 2

Repres. 1

Joint represenations: A Coordinated representations B

SLIDE 50

Joint representations

50

SLIDE 51

Shallow multimodal representations

Want deep multimodal representations
Shallow representations do not capture complex relationships
Often shared layer only maps to the shared section directly

51

Shallow RBM

Shallow Autoencoder

SLIDE 52

Deep multimodal autoencoders

A deep representation learning

approach

A bimodal auto-encoder
Used for Audio-visual speech

recognition

52

▪ ▪

▪ speech

[Ngiam et al., Multimodal Deep Learning, 2011]

SLIDE 53

Deep multimodal autoencoders – training

Individual modalities can be

pre-trained

RBMs
Denoising Autoencoders
To train the model to

reconstruct the other modality

Use both
Remove audio

53

▪ ▪

▪ speech

SLIDE 54

Deep multimodal autoencoders – training

Individual modalities can be

pre-trained

RBMs
Denoising Autoencoders
To train the model to

reconstruct the other modality

Use both
Remove audio
Remove video

54

▪ ▪

▪ speech

SLIDE 55

Multimodal Encoder-Decoder

Visual modality often encoded

using CNN

Language modality will be

decoded using LSTM

A simple multilayer

perceptron will be used to translate from visual (CNN) to language (LSTM)

55

· · · · · · · · · · · · Text Image · · · 𝒁 𝒀

▪ ▪ g LSTM

▪

SLIDE 56

Multimodal Joint Representation

For supervised learning tasks
Joining the unimodal

representations:

Simple concatenation
Element-wise multiplication or

summation

Multilayer perceptron
How to explicitly model both

unimodal and bimodal interactions?

56

e.g. Sentiment

SLIDE 57

Deep multimodal Boltzmann machines

Generative model
Individual modalities trained like a

DBN

Multimodal representation trained

using Variational approaches

Used for image tagging and cross-

media retrieval

Reconstruction of one modality

from another is a bit more “natural” than in autoencoder representation

Can actually sample text and

images

57

▪ ▪ ▪ ▪ ▪ another is a bit more “natural” than in ▪

· · ·

softmax

[Srivastava and Salakhutdinov, Multimodal Learning with Deep Boltzmann Machines, 2012, 2014]

SLIDE 58

Deep multimodal Boltzmann machines

58

Srivastava and Salakhutdinov, “Multimodal Learning with Deep Boltzmann Machines”, NIPS 2012

[Srivastava and Salakhutdinov, Multimodal Learning with Deep Boltzmann Machines, 2012, 2014] Srivastava and Salakhutdinov, “Multimodal Learning with Deep Boltzmann Machines”, NIPS 2012

SLIDE 59

Coordinated Representations

59

SLIDE 60

Coordinated multimodal embeddings

Instead of projecting to a joint space enforce the similarity between

unimodal embeddings

60

Modality 1

Modality 2 Repres 2

Repres. 1

SLIDE 61

Coordinated multimodal embeddings

Learn (unsupervised)

two or more coordinated representations from multiple modalities. A loss function is defined to bring closer these multiple representations.

61

· · · · · · · · · · · · Text Image · · · · · ·

Similarity metric

(e.g., cosine distance)

𝒁 𝒀

SLIDE 62

Coordinated multimodal embeddings

Instead of projecting to a joint space enforce the similarity between

unimodal embeddings

Often referred to as semantic space embeddings

62

[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014] [Frome et al., DeViSE: A Deep Visual-Semantic Embedding Model, 2013]

SLIDE 63

Coordinated multimodal embeddings

63

[Huang et al., Learning Deep Structured Semantic Models for Web Search using Clickthrough Data, 2013]

SLIDE 64

Multimodal Vector Space Arithmetic

64

[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014]

SLIDE 65

Multimodal Vector Space Arithmetic

65

[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014]

SLIDE 66

Structured coordinated embeddings

Instead of or in addition to similarity add alternative structure

66

[Vendrov et al., Order-Embeddings of Images and Language, 2016] [Jiang and Li, Deep Cross-Modal Hashing]

SLIDE 67

Recap: Multimodal representations

Joint representations
Project modalities to the same space
Use when all the modalities are present during test time
Suitable for multimodal fusion
Coordinated representations
Project modalities to their own coordinated space
Use when only one of the modalities is present during test- time
Suitable for multimodal translation
Good for retrieval

67

SLIDE 68

Multimodal Fusion

68

SLIDE 69

Concatenate

What s the mustache made of ? CNN LSTM

SLIDE 70

Element-wise sum

What s the mustache made of ? CNN LSTM

SLIDE 71

Element-wise product

What s the mustache made of ? CNN LSTM

SLIDE 72

Bilinear pooling

What s the mustache made of ? CNN LSTM

Outer product

2048 2048 3000 12.5 billion !!!

SLIDE 73

Multimodal Compact Bilinear Pooling (MCB)

What s the mustache made of ? CNN LSTM

Fukui et al. EMNLP 2016

SLIDE 74

Next Lecture: Language and Vision

74