Lecture #4 – Multimodality
Aykut Erdem // Hacettepe University // Spring 2019
CMP722
ADVANCED COMPUTER VISION
Illustration: Detail from Fritz Kahn’s Der Mensch als Industriepalast
CMP722 ADVANCED COMPUTER VISION Lecture #4 Multimodality Aykut - - PowerPoint PPT Presentation
Illustration: Detail from Fritz Kahns Der Mensch als Industriepalast CMP722 ADVANCED COMPUTER VISION Lecture #4 Multimodality Aykut Erdem // Hacettepe University // Spring 2019 Illustration: DeepMind Previously on CMP722
Lecture #4 – Multimodality
Aykut Erdem // Hacettepe University // Spring 2019
Illustration: Detail from Fritz Kahn’s Der Mensch als Industriepalast
Previously on CMP722
Illustration: DeepMind
Lecture overview
—Louis-Philippe Morency Tadas Baltrusaitis’s CMU 11-777 class —Qi Wu’s slides for ACL 2018 Tutorial on Connecting Language and Vision to Actions
3
What is Multimodal?
distinct “peaks” (local maxima) in the probability density function
4
Multimodal distribution
➢ Multiple modes, i.e., distinct “peaks”
Multimodal distribution
What is Multimodal?
Sensory Modalities
5
What is Multimodal?
Modality: : The way in which something happens or is experienced.
representation format in which information is stored.
touch; channel of communication. Medium (“middle”): A means or instrumentality for storing or communicating information; system of communication/transmission.
senses of the interpreter.
6
Examples of Modalities
7
Psychology Medical Speech Vision Language Multimedia Robotics Learning
Multiple Communities and Modalities
8
Psychology Medical Speech Vision Language Multimedia Robotics Learning
9
Prior Research on “Multimodal”
Four eras of multimodal research
10
The “Behavioral” Era (1970s until late 1980s)
Multi-sensory integration (in psychology):
[1983]
neglected preschoolers [1984]
Ø TRIVIA: Geoffrey Hinton received his B.A. in Psychology
11
Multimodal al Behavi avior Therap apy y by Arnold Lazarus [1973] ➢ 7 dimensions of personality (or modalities)
The McGurk Effect (1976)
Hearing lips and seeing voices – Nature
12
The “Computational” Era (Late 1980s until 2000)
1) Audio-Visual Speech Recognition (AVSR)
“Automatic lipreading to enhance speech recognition”
“Recent Advances in the Automatic Recognition of Audio-Visual Speech”
Ø TRIVIA: The first multimodal deep learning paper was about audio visual speech recognition [ICML 2011]
13
The “Computational” Era (Late 1980s until 2000)
2) Multimodal/multisensory interfaces
(HCI) “Study of how to design and evaluate new computer systems where human interact through multiple modalities, including both input and output modalities.”
14
Glove-talk: A neural network interface between a data-glove and a speech synthesizer By Sidney Fels & Geoffrey Hinton [CHI’95]
The “Computational” Era (Late 1980s until 2000)
2) Multimodal/multisensory interfaces
15
Rosalind Picard Affective Computing is computing that relates to, arises from, or deliberately influences emotion or other affective phenomena.
➢ The “ ” Era (Late 1980s until 2000)
Ro
The “Computational” Era (Late 1980s until 2000)
3) Multimedia Computing
16
“The Informedia Digital Video Library Project automatically combines speech, image and natural language understanding to create a full-content searchable digital video library.”
➢ The “ ” Era (Late 1980s until 2000)
“The digital video library.”
[1994-2010]
The “Computational” Era (Late 1980s until 2000)
3) Multimedia Computing Multimedia content analysis
17
The “Computational” Era (Late 1980s until 2000)
18
Multimodal Computation Models
19
Convolu
The “Interaction” Era (2000s)
1) Modeling Human Multimodal Interaction Ø TRIVIA: Samy Bengio started at IDIAP working on AMI project
20
➢ The “ ” Era (2000s)
CHIL Project [Alex Waibel]
The “Interaction” Era (2000s)
1) Modeling Human Multimodal Interaction
21
CALO Project [2003-2008, SRI]
SSP Project [2008-2011, IDIAP]
➢ The “ ” Era (2000s)
CA
The “Interaction” Era (2000s)
2) Multimedia Information Retrieval Rese sear arch ch tasks asks an and ch chal allenges: s:
22
➢ The “ ” Era (2000s)
“Yearly competition to based evaluation”
promote progress in content-based retrieval from digital video via open, metrics-based evaluation” [Hosted by NIST, 2001-2006]
Multimodal Computational Models
Audio-visual speech segmentation
23
▪
▪ Kevin Murphy’s PhD thesis and Matlab toolbox ▪ segmentation
Multimodal Computational Models
24
▪
▪ tional random fields [Lafferty et ▪
▪
▪ ▪ dynamic CRF
The “deep learning” era (2010s until ...)
Representation learning (a.k.a. deep learning)
Visual Attention [ICML 2015]
Key enablers for multimodal research:
25
The topic of
Real World tasks tackled by MMML
26
▪
▪ ▪ ▪
▪
▪ ▪ ▪
▪
▪ ▪
▪ retrieval
▪
▪
▪ ▪ ▪
▪
▪ ▪ ▪
▪
▪ ▪
▪ eval
▪
27
Multimodal Machine Learning
Core Technical Challenges: Representation Translation Alignment Fusion Co-Learning
28
Verbal Vocal Visual
Verbal Vocal Visual
Core Challenge 1: Representation
Definition: : Learning how to represent and summarize multimodal data in away that exploits the complementarity and redundancy.
29
Joint representations: A
Modality 1 Modality 2
Representation
Joint Multimodal Representation
30
“I like it!”
Joyful tone
Tensed voice
“Wow!”
Joint Representation
(Multimodal Space)
Joint Representations (Multimodal Space) "Wow!" "I like it!"
Joyful tone Tensed voice
Joint Multimodal Representation
Audio-visual speech recognition
[Ngiam et al., ICML 2011]
Image captioning
[Srivastava and Salahutdinov, NIPS 2012]
Audio-visual emotion recognition
[Kim et al., ICASSP 2013]
31
DepthVerbal DepthVideo DepthMultimodal
Visual Multimodal Representation
Multimodal Vector Space Arithmetic
32
[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014]
Core Challenge 1: Representation
Definition: : Learning how to represent and summarize multimodal data in away that exploits the complementarity and redundancy.
33
Modality 1 Modality 2
Representation
Joint representations: A Coordinated representations: B
Modality 1 Modality 2
Repres 2
Coordinated Representation: Deep CCA
34
𝒁 𝒀 𝑽 𝑾 𝑰𝒚 𝑰𝒛
𝐼𝑧 𝐼𝑦𝑿𝒚 𝑿𝒛 𝒀 𝒁 𝒗 𝒘 𝒗∗, 𝒘∗ = argmax
𝒗,𝒘
𝑑𝑝𝑠𝑠 𝒗𝑼𝒀, 𝒘𝑼𝒁
𝒁 𝒀 𝑽 𝑾 𝑰𝒚 𝑰𝒛
𝐼𝑧 𝐼𝑦𝑿𝒚 𝑿𝒛 𝒀 𝒁 𝒗 𝒘 𝒗∗, 𝒘∗ = argmax
𝒗,𝒘
𝑑𝑝𝑠𝑠 𝒗𝑼𝒀, 𝒘𝑼𝒁
Andrew et al., ICML 2013
· · · · · · Text Image 𝒁 𝒀 𝑽 𝑾 · · · · · · 𝑰𝒚 𝑰𝒛
View 𝐼𝑧 View 𝐼𝑦· · · · · · 𝑿𝒚 𝑿𝒛 𝒀 𝒁 𝒗 𝒘 𝒗∗, 𝒘∗ = argmax
𝒗,𝒘
𝑑𝑝𝑠𝑠 𝒗𝑼𝒀, 𝒘𝑼𝒁 ted:
Core Challenge 2: Alignment
Definition: : Identify the direct relations between (sub)elements from two or more different modalities.
35
Explicit Alignment
The goal is to directly find correspondences between elements of different modalities
A Implicit Alignment
Uses internally latent alignment of modalities in order to better solve a different problem
B
t1 t2 t3 tn t4 t5 tn
Fancy algorithm
Modality 1 Modality 2
Implicit Alignment
36
Karpathy et al., Deep Fragment Embeddings for Bidirectional Image Sentence Mapping, https://arxiv.org/pdf/1406.5679.pdf
Attention Models for Image Captioning
37
Distribution
locations Expectation
D
𝑏1 𝑡0 𝑡1 𝑨1 𝑧1 𝑏2 𝑒1 𝑡2 𝑨2 𝑧2 𝑏3 𝑒2
First word
Output word
Core Challenge 3: Fusion
38
: To join information from two or more modalities to perform a prediction task.
Model-Agnostic Approaches A 1) Early Fusion 2) Late Fusion
Classifier
Modality 1 Modality 2 Classifier Classifier Modality 1 Modality 2
Core Challenge 3: Fusion
39
: To join information from two or more modalities to perform a prediction task.
Model-Based (Intermediate) Approaches B 1) Deep neural networks 2) Kernel-based methods 3) Graphical models
𝒚𝟐 𝑩 ℎ1 𝐵 ℎ2 𝐵 ℎ3 𝐵 ℎ4 𝐵 ℎ5 𝐵 𝒚𝟑 𝑩 𝒚𝟒 𝑩 𝒚𝟓 𝑩 𝒚𝟔 𝑩 𝒚𝟐 𝑾 ℎ1 𝑊 ℎ2 𝑊 ℎ3 𝑊 ℎ4 𝑊 ℎ5 𝑊 𝒚𝟑 𝑾 𝒚𝟒 𝑾 𝒚𝟓 𝑾 𝒚𝟔 𝑾 yMultiple kernel learning Multi-View Hidden CRF
Core Challenge 4: Translation
40
Definition: : Process of changing data from one modality to another, where the translation relationship can often be open-ended or subjective.
Example-based A Model-driven B
Core Challenge 4: Translation
41
–
Transcriptions + Audio streams Visual gestures
(both speaker and listener gestures)
Marsella et al., Virtual character performance from speech, SIGGRAPH/Eurographics Symposium on Computer Animation, 2013
Core Challenge 4: Translation
42
–
Transcriptions + Audio streams Visual gestures
(both speaker and listener gestures)
Marsella et al., Virtual character performance from speech, SIGGRAPH/Eurographics Symposium on Computer Animation, 2013
Core Challenge 5: Co-Learning
Definition: : Transfer knowledge between modalities, including their representations and predictive models.
43
Modality 1 Prediction Modality 2
Core Challenge 5: Co-Learning
44
Parallel A Non-Parallel B Hybrid C
Taxonomy of Multimodal Research
Represe sentation
Translation
Alignment
Fusion
Co-learning
45
Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency, Multimodal Machine Learning: A Survey and Taxonomy https://arxiv.org/abs/1705.09406
46
Multimodal representations
§ What if relationship is complex? § Doesn’t exploit joint information, especially at lower/intermediate levels
47
Multimodal representations
representation
similarity in corresponding concepts
tasks – retrieval, mapping, fusion etc.
(map between modalities)
48
Modality 2 Modality 3
Fancy representation
Modality 1 Modality 2 Modality 3
Fancy representation
Prediction Modality 1 Modality 2
Fancy representation
Multimodal representation types
Ø Simplest version: modality concatenation (early fusion) Ø Can be learned supervised or unsupervised Ø Multimodal factor analysis Ø Similarity-based methods (e.g., cosine distance) Ø Structure constraints (e.g.,
49
Modality 1 Modality 2
Representation
Modality 1 Modality 2
Repres 2
Joint represenations: A Coordinated representations B
50
Shallow multimodal representations
51
Shallow Autoencoder
Deep multimodal autoencoders
approach
recognition
52
▪ ▪
▪ speech
[Ngiam et al., Multimodal Deep Learning, 2011]
Deep multimodal autoencoders – training
pre-trained
reconstruct the other modality
53
▪ ▪
▪ speech
Deep multimodal autoencoders – training
pre-trained
reconstruct the other modality
54
▪ ▪
▪ speech
Multimodal Encoder-Decoder
using CNN
decoded using LSTM
perceptron will be used to translate from visual (CNN) to language (LSTM)
55
· · · · · · · · · · · · Text Image · · · 𝒁 𝒀
▪ ▪ g LSTM
▪
Multimodal Joint Representation
representations:
summation
unimodal and bimodal interactions?
56
e.g. Sentiment
Deep multimodal Boltzmann machines
DBN
using Variational approaches
media retrieval
from another is a bit more “natural” than in autoencoder representation
images
57
▪ ▪ ▪ ▪ ▪ another is a bit more “natural” than in ▪
· · ·
softmax
[Srivastava and Salakhutdinov, Multimodal Learning with Deep Boltzmann Machines, 2012, 2014]
Deep multimodal Boltzmann machines
58
Srivastava and Salakhutdinov, “Multimodal Learning with Deep Boltzmann Machines”, NIPS 2012
[Srivastava and Salakhutdinov, Multimodal Learning with Deep Boltzmann Machines, 2012, 2014] Srivastava and Salakhutdinov, “Multimodal Learning with Deep Boltzmann Machines”, NIPS 2012
59
Coordinated multimodal embeddings
unimodal embeddings
60
Modality 2 Repres 2
Coordinated multimodal embeddings
two or more coordinated representations from multiple modalities. A loss function is defined to bring closer these multiple representations.
61
· · · · · · · · · · · · Text Image · · · · · ·
Similarity metric
(e.g., cosine distance)
𝒁 𝒀
Coordinated multimodal embeddings
unimodal embeddings
62
[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014] [Frome et al., DeViSE: A Deep Visual-Semantic Embedding Model, 2013]
Coordinated multimodal embeddings
63
[Huang et al., Learning Deep Structured Semantic Models for Web Search using Clickthrough Data, 2013]
Multimodal Vector Space Arithmetic
64
[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014]
Multimodal Vector Space Arithmetic
65
[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014]
Structured coordinated embeddings
66
[Vendrov et al., Order-Embeddings of Images and Language, 2016] [Jiang and Li, Deep Cross-Modal Hashing]
Recap: Multimodal representations
67
68
Concatenate
What s the mustache made of ? CNN LSTM
Element-wise sum
What s the mustache made of ? CNN LSTM
Element-wise product
What s the mustache made of ? CNN LSTM
Bilinear pooling
What s the mustache made of ? CNN LSTM
Outer product
2048 2048 3000 12.5 billion !!!
Multimodal Compact Bilinear Pooling (MCB)
What s the mustache made of ? CNN LSTM
Fukui et al. EMNLP 2016
74