CMP722 ADVANCED COMPUTER VISION Lecture #4 Multimodality Aykut - - PowerPoint PPT Presentation

cmp722
SMART_READER_LITE
LIVE PREVIEW

CMP722 ADVANCED COMPUTER VISION Lecture #4 Multimodality Aykut - - PowerPoint PPT Presentation

Illustration: Detail from Fritz Kahns Der Mensch als Industriepalast CMP722 ADVANCED COMPUTER VISION Lecture #4 Multimodality Aykut Erdem // Hacettepe University // Spring 2019 Illustration: DeepMind Previously on CMP722


slide-1
SLIDE 1

Lecture #4 – Multimodality

Aykut Erdem // Hacettepe University // Spring 2019

CMP722

ADVANCED COMPUTER VISION

Illustration: Detail from Fritz Kahn’s Der Mensch als Industriepalast

slide-2
SLIDE 2
  • sequential data
  • convolutions in time
  • recurrent neural networks (RNNs)
  • autoregressive generative models
  • attention models
  • case study: transformer model

Previously on CMP722

Illustration: DeepMind

slide-3
SLIDE 3

Lecture overview

  • what is multimodality
  • a historical view on multimodal research
  • core technical challenges
  • joint representations
  • coordinated representations
  • multimodal fusion
  • Disclaimer: Much of the material and slides for this lecture were borrowed from

—Louis-Philippe Morency Tadas Baltrusaitis’s CMU 11-777 class —Qi Wu’s slides for ACL 2018 Tutorial on Connecting Language and Vision to Actions

3

slide-4
SLIDE 4

What is Multimodal?

  • Multiple modes, i.e.,

distinct “peaks” (local maxima) in the probability density function

4

Multimodal distribution

➢ Multiple modes, i.e., distinct “peaks”

Multimodal distribution

slide-5
SLIDE 5

What is Multimodal?

Sensory Modalities

5

slide-6
SLIDE 6

What is Multimodal?

Modality: : The way in which something happens or is experienced.

  • Modality refers to a certain type of information and/or the

representation format in which information is stored.

  • Sensory modality: one of the primary forms of sensation, as vision or

touch; channel of communication. Medium (“middle”): A means or instrumentality for storing or communicating information; system of communication/transmission.

  • Medium is the means whereby this information is delivered to the

senses of the interpreter.

6

slide-7
SLIDE 7

Examples of Modalities

  • Natural language (both spoken or written)
  • Visual (from images or videos)
  • Auditory (including voice, sounds and music)
  • Haptics/touch
  • Smell, taste and self-motion
  • Physiological signals
  • Electrocardiogram (ECG), skin conductance
  • Other modalities
  • Infrared images, depth images, fMRI

7

slide-8
SLIDE 8

Psychology Medical Speech Vision Language Multimedia Robotics Learning

Multiple Communities and Modalities

8

Psychology Medical Speech Vision Language Multimedia Robotics Learning

slide-9
SLIDE 9

A Historical View

9

slide-10
SLIDE 10

Prior Research on “Multimodal”

Four eras of multimodal research

  • The “behavioral” era (1970s until late 1980s)
  • The “computational” era (late 1980s until 2000)
  • The “interaction” era (2000 - 2010)
  • The “deep learning” era (2010s until ...)
  • Main focus of this lecture

10

slide-11
SLIDE 11

The “Behavioral” Era (1970s until late 1980s)

Multi-sensory integration (in psychology):

  • Multimodal signal detection: Independent decisions vs. integration [1980]
  • Infants' perception of substance and temporal synchrony in multimodal events

[1983]

  • A multimodal assessment of behavioral and cognitive deficits in abused and

neglected preschoolers [1984]

Ø TRIVIA: Geoffrey Hinton received his B.A. in Psychology

11

Multimodal al Behavi avior Therap apy y by Arnold Lazarus [1973] ➢ 7 dimensions of personality (or modalities)

slide-12
SLIDE 12

The McGurk Effect (1976)

Hearing lips and seeing voices – Nature

12

slide-13
SLIDE 13

The “Computational” Era (Late 1980s until 2000)

1) Audio-Visual Speech Recognition (AVSR)

  • Motivated by the McGurk effect
  • First AVSR System in 1986

“Automatic lipreading to enhance speech recognition”

  • Good survey paper [2002]

“Recent Advances in the Automatic Recognition of Audio-Visual Speech”

Ø TRIVIA: The first multimodal deep learning paper was about audio visual speech recognition [ICML 2011]

13

slide-14
SLIDE 14

The “Computational” Era (Late 1980s until 2000)

2) Multimodal/multisensory interfaces

  • Multimodal Human-Computer Interaction

(HCI) “Study of how to design and evaluate new computer systems where human interact through multiple modalities, including both input and output modalities.”

14

Glove-talk: A neural network interface between a data-glove and a speech synthesizer By Sidney Fels & Geoffrey Hinton [CHI’95]

slide-15
SLIDE 15

The “Computational” Era (Late 1980s until 2000)

2) Multimodal/multisensory interfaces

15

Rosalind Picard Affective Computing is computing that relates to, arises from, or deliberately influences emotion or other affective phenomena.

➢ The “ ” Era (Late 1980s until 2000)

Ro

slide-16
SLIDE 16

The “Computational” Era (Late 1980s until 2000)

3) Multimedia Computing

16

“The Informedia Digital Video Library Project automatically combines speech, image and natural language understanding to create a full-content searchable digital video library.”

➢ The “ ” Era (Late 1980s until 2000)

“The digital video library.”

[1994-2010]

slide-17
SLIDE 17

The “Computational” Era (Late 1980s until 2000)

3) Multimedia Computing Multimedia content analysis

  • Shot-boundary detection (1991 - )
  • Parsing a video into continuous camera shots
  • Still and dynamic video abstracts (1992 - )
  • Making video browsable via representative frames (keyframes)
  • Generating short clips carrying the essence of the video content
  • High-level parsing (1997 - )
  • Parsing a video into semantically meaningful segments
  • Automatic annotation (indexing) (1999 - )
  • Detecting prespecified events/scenes/objects in video

17

slide-18
SLIDE 18

The “Computational” Era (Late 1980s until 2000)

  • Hidden Markov Models [1960s]

18

  • Factorial Hidden Markov Models [1996]
  • Coupled Hidden Markov Models [1997]
  • kov Models [1960s]
h0 h1 h2 h3 h4 x1 x2 x3 x4
  • Models [1997]
slide-19
SLIDE 19

Multimodal Computation Models

  • Artificial Neural Networks [1940s]

19

  • Backpropagation [1975]
  • Convolutional neural networks [1980s]
  • ural Networks [1940s]
  • tion [1975]

Convolu

  • Backpropagation [1
slide-20
SLIDE 20

The “Interaction” Era (2000s)

1) Modeling Human Multimodal Interaction Ø TRIVIA: Samy Bengio started at IDIAP working on AMI project

20

➢ The “ ” Era (2000s)

  • AMI Project [2001-2006, IDIAP]
  • 100+ hours of meeting recordings
  • Fully synchronized audio-video
  • Transcribed and annotated

CHIL Project [Alex Waibel]

  • Computers in the Human Interaction Loop
  • Multi-sensor multimodal processing
  • Face-to-face interactions
slide-21
SLIDE 21

The “Interaction” Era (2000s)

1) Modeling Human Multimodal Interaction

  • TRIVIA: LP Morency’s PhD research was partially funded by CALO

21

CALO Project [2003-2008, SRI]

  • Cognitive Assistant that Learns and Organizes
  • Personalized Assistant that Learns (PAL)
  • Siri was a spinoff from this project

SSP Project [2008-2011, IDIAP]

  • Social Signal Processing
  • First coined by Sandy Pentland in 2007
  • Great dataset repository: http://sspnet.eu/

➢ The “ ” Era (2000s)

CA

  • SS
  • TRIVIA: LP’s PhD research was partially funded by CALO ☺
slide-22
SLIDE 22

The “Interaction” Era (2000s)

2) Multimedia Information Retrieval Rese sear arch ch tasks asks an and ch chal allenges: s:

  • Shot boundary, story segmentation, search
  • “High-level feature extraction”: semantic event detection
  • Introduced in 2008: copy detection and surveillance events
  • Introduced in 2010: Multimedia event detection (MED)

22

➢ The “ ” Era (2000s)

“Yearly competition to based evaluation”

  • “High level feature extraction”: semantic event detection
  • “Yearly competition to

promote progress in content-based retrieval from digital video via open, metrics-based evaluation” [Hosted by NIST, 2001-2006]

slide-23
SLIDE 23

Multimodal Computational Models

  • Dynamic Bayesian Networks
  • Kevin Murphy’s PhD thesis and Matlab toolbox
  • Asynchronous HMM for multimodal [Samy Bengio, 2007]

Audio-visual speech segmentation

23

▪ Kevin Murphy’s PhD thesis and Matlab toolbox ▪ segmentation

slide-24
SLIDE 24

Multimodal Computational Models

  • Discriminative sequential models
  • Conditional random fields [Lafferty et al., 2001]
  • Latent-dynamic CRF [Morency et al., 2007]

24

▪ tional random fields [Lafferty et ▪

▪ ▪ dynamic CRF

slide-25
SLIDE 25

The “deep learning” era (2010s until ...)

Representation learning (a.k.a. deep learning)

  • Multimodal deep learning [ICML 2011]
  • Multimodal Learning with Deep Boltzmann Machines [NIPS 2012]
  • Visual attention: Show, Attend and Tell: Neural Image Caption Generation with

Visual Attention [ICML 2015]

Key enablers for multimodal research:

  • New large-scale multimodal datasets
  • Faster computer and GPUS
  • High-level visual features
  • “Dimensional” linguistic features

25

The topic of

  • ur next lecture
slide-26
SLIDE 26

Real World tasks tackled by MMML

  • Affect recognition
  • Emotion
  • Persuasion
  • Personality traits
  • Media description
  • Image captioning
  • Video captioning
  • Visual Question Answering
  • Event recognition
  • Action recognition
  • Segmentation
  • Multimedia information retrieval
  • Content based/Cross-media

26

▪ ▪ ▪

▪ ▪ ▪

▪ ▪

▪ retrieval

▪ ▪ ▪

▪ ▪ ▪

▪ ▪

▪ eval

slide-27
SLIDE 27

Core Technical Challenges

27

slide-28
SLIDE 28

Multimodal Machine Learning

Core Technical Challenges: Representation Translation Alignment Fusion Co-Learning

28

Verbal Vocal Visual

Verbal Vocal Visual

slide-29
SLIDE 29

Core Challenge 1: Representation

Definition: : Learning how to represent and summarize multimodal data in away that exploits the complementarity and redundancy.

29

Joint representations: A

Modality 1 Modality 2

Representation

slide-30
SLIDE 30

Joint Multimodal Representation

30

“I like it!”

Joyful tone

Tensed voice

“Wow!”

Joint Representation

(Multimodal Space)

Joint Representations (Multimodal Space) "Wow!" "I like it!"

Joyful tone Tensed voice

slide-31
SLIDE 31

Joint Multimodal Representation

Audio-visual speech recognition

[Ngiam et al., ICML 2011]

  • Bimodal Deep Belief Network

Image captioning

[Srivastava and Salahutdinov, NIPS 2012]

  • Multimodal Deep Boltzmann Machine

Audio-visual emotion recognition

[Kim et al., ICASSP 2013]

  • Deep Boltzmann Machine

31

DepthVerbal DepthVideo DepthMultimodal

  • Verbal

Visual Multimodal Representation

slide-32
SLIDE 32

Multimodal Vector Space Arithmetic

32

[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014]

slide-33
SLIDE 33

Core Challenge 1: Representation

Definition: : Learning how to represent and summarize multimodal data in away that exploits the complementarity and redundancy.

33

Modality 1 Modality 2

Representation

Joint representations: A Coordinated representations: B

Modality 1 Modality 2

Repres 2

  • Repres. 1
slide-34
SLIDE 34

Coordinated Representation: Deep CCA

34

𝒁 𝒀 𝑽 𝑾 𝑰𝒚 𝑰𝒛

𝐼𝑧 𝐼𝑦

𝑿𝒚 𝑿𝒛 𝒀 𝒁 𝒗 𝒘 𝒗∗, 𝒘∗ = argmax

𝒗,𝒘

𝑑𝑝𝑠𝑠 𝒗𝑼𝒀, 𝒘𝑼𝒁

𝒁 𝒀 𝑽 𝑾 𝑰𝒚 𝑰𝒛

𝐼𝑧 𝐼𝑦

𝑿𝒚 𝑿𝒛 𝒀 𝒁 𝒗 𝒘 𝒗∗, 𝒘∗ = argmax

𝒗,𝒘

𝑑𝑝𝑠𝑠 𝒗𝑼𝒀, 𝒘𝑼𝒁

Andrew et al., ICML 2013

· · · · · · Text Image 𝒁 𝒀 𝑽 𝑾 · · · · · · 𝑰𝒚 𝑰𝒛

View 𝐼𝑧 View 𝐼𝑦

· · · · · · 𝑿𝒚 𝑿𝒛 𝒀 𝒁 𝒗 𝒘 𝒗∗, 𝒘∗ = argmax

𝒗,𝒘

𝑑𝑝𝑠𝑠 𝒗𝑼𝒀, 𝒘𝑼𝒁 ted:

  • Learn linear projections that are maximally correlated:
slide-35
SLIDE 35

Core Challenge 2: Alignment

Definition: : Identify the direct relations between (sub)elements from two or more different modalities.

35

Explicit Alignment

The goal is to directly find correspondences between elements of different modalities

A Implicit Alignment

Uses internally latent alignment of modalities in order to better solve a different problem

B

t1 t2 t3 tn t4 t5 tn

Fancy algorithm

Modality 1 Modality 2

slide-36
SLIDE 36

Implicit Alignment

36

Karpathy et al., Deep Fragment Embeddings for Bidirectional Image Sentence Mapping, https://arxiv.org/pdf/1406.5679.pdf

slide-37
SLIDE 37

Attention Models for Image Captioning

37

Distribution

  • ver L

locations Expectation

  • ver features:

D

𝑏1 𝑡0 𝑡1 𝑨1 𝑧1 𝑏2 𝑒1 𝑡2 𝑨2 𝑧2 𝑏3 𝑒2

First word

Output word

slide-38
SLIDE 38

Core Challenge 3: Fusion

38

  • Definition:

: To join information from two or more modalities to perform a prediction task.

Model-Agnostic Approaches A 1) Early Fusion 2) Late Fusion

Classifier

Modality 1 Modality 2 Classifier Classifier Modality 1 Modality 2

slide-39
SLIDE 39

Core Challenge 3: Fusion

39

  • Definition:

: To join information from two or more modalities to perform a prediction task.

Model-Based (Intermediate) Approaches B 1) Deep neural networks 2) Kernel-based methods 3) Graphical models

𝒚𝟐 𝑩 ℎ1 𝐵 ℎ2 𝐵 ℎ3 𝐵 ℎ4 𝐵 ℎ5 𝐵 𝒚𝟑 𝑩 𝒚𝟒 𝑩 𝒚𝟓 𝑩 𝒚𝟔 𝑩 𝒚𝟐 𝑾 ℎ1 𝑊 ℎ2 𝑊 ℎ3 𝑊 ℎ4 𝑊 ℎ5 𝑊 𝒚𝟑 𝑾 𝒚𝟒 𝑾 𝒚𝟓 𝑾 𝒚𝟔 𝑾 y

Multiple kernel learning Multi-View Hidden CRF

slide-40
SLIDE 40

Core Challenge 4: Translation

40

Definition: : Process of changing data from one modality to another, where the translation relationship can often be open-ended or subjective.

Example-based A Model-driven B

slide-41
SLIDE 41

Core Challenge 4: Translation

41

Transcriptions + Audio streams Visual gestures

(both speaker and listener gestures)

Marsella et al., Virtual character performance from speech, SIGGRAPH/Eurographics Symposium on Computer Animation, 2013

slide-42
SLIDE 42

Core Challenge 4: Translation

42

Transcriptions + Audio streams Visual gestures

(both speaker and listener gestures)

Marsella et al., Virtual character performance from speech, SIGGRAPH/Eurographics Symposium on Computer Animation, 2013

slide-43
SLIDE 43

Core Challenge 5: Co-Learning

Definition: : Transfer knowledge between modalities, including their representations and predictive models.

43

Modality 1 Prediction Modality 2

slide-44
SLIDE 44

Core Challenge 5: Co-Learning

44

Parallel A Non-Parallel B Hybrid C

slide-45
SLIDE 45

Taxonomy of Multimodal Research

Represe sentation

  • Joint
  • Neural networks
  • Graphical models
  • Sequential
  • Coordinated
  • Similarity
  • Structured

Translation

  • Example-based
  • Retrieval
  • Combination
  • Model-based
  • Grammar-based
  • Encoder-decoder
  • Online prediction

Alignment

  • Explicit
  • Unsupervised
  • Supervised
  • Implicit
  • Graphical models
  • Neural networks

Fusion

  • Model agnostic
  • Early fusion
  • Late fusion
  • Hybrid fusion
  • Model-based
  • Kernel-based
  • Graphical models
  • Neural networks

Co-learning

  • Parallel data
  • Co-training
  • Transfer learning
  • Non-parallel data
  • Zero-shot learning
  • Concept grounding
  • Transfer learning
  • Hybrid data
  • Bridging

45

Tadas Baltrusaitis, Chaitanya Ahuja, and Louis-Philippe Morency, Multimodal Machine Learning: A Survey and Taxonomy https://arxiv.org/abs/1705.09406

slide-46
SLIDE 46

Multimodal Representations

46

slide-47
SLIDE 47

Multimodal representations

  • Why do we need multimodal representations?
  • Can just have unimodal ones and just fuse them

§ What if relationship is complex? § Doesn’t exploit joint information, especially at lower/intermediate levels

47

slide-48
SLIDE 48

Multimodal representations

  • What do we want from multi-modal

representation

  • Similarity in that space implies

similarity in corresponding concepts

  • Useful for various discriminative

tasks – retrieval, mapping, fusion etc.

  • Possible to obtain in absence of one
  • r more modalities
  • Fill in missing modalities given others

(map between modalities)

48

  • modal
  • epts
  • ne
  • Modality 1

Modality 2 Modality 3

Fancy representation

Modality 1 Modality 2 Modality 3

Fancy representation

Prediction Modality 1 Modality 2

Fancy representation

slide-49
SLIDE 49

Multimodal representation types

Ø Simplest version: modality concatenation (early fusion) Ø Can be learned supervised or unsupervised Ø Multimodal factor analysis Ø Similarity-based methods (e.g., cosine distance) Ø Structure constraints (e.g.,

  • rthogonality, sparseness)

49

Modality 1 Modality 2

Representation

Modality 1 Modality 2

Repres 2

  • Repres. 1

Joint represenations: A Coordinated representations B

slide-50
SLIDE 50

Joint representations

50

slide-51
SLIDE 51

Shallow multimodal representations

  • Want deep multimodal representations
  • Shallow representations do not capture complex relationships
  • Often shared layer only maps to the shared section directly

51

  • Shallow RBM

Shallow Autoencoder

slide-52
SLIDE 52

Deep multimodal autoencoders

  • A deep representation learning

approach

  • A bimodal auto-encoder
  • Used for Audio-visual speech

recognition

52

▪ ▪

▪ speech

[Ngiam et al., Multimodal Deep Learning, 2011]

slide-53
SLIDE 53

Deep multimodal autoencoders – training

  • Individual modalities can be

pre-trained

  • RBMs
  • Denoising Autoencoders
  • To train the model to

reconstruct the other modality

  • Use both
  • Remove audio

53

▪ ▪

▪ speech

slide-54
SLIDE 54

Deep multimodal autoencoders – training

  • Individual modalities can be

pre-trained

  • RBMs
  • Denoising Autoencoders
  • To train the model to

reconstruct the other modality

  • Use both
  • Remove audio
  • Remove video

54

▪ ▪

▪ speech

slide-55
SLIDE 55

Multimodal Encoder-Decoder

  • Visual modality often encoded

using CNN

  • Language modality will be

decoded using LSTM

  • A simple multilayer

perceptron will be used to translate from visual (CNN) to language (LSTM)

55

· · · · · · · · · · · · Text Image · · · 𝒁 𝒀

▪ ▪ g LSTM

slide-56
SLIDE 56

Multimodal Joint Representation

  • For supervised learning tasks
  • Joining the unimodal

representations:

  • Simple concatenation
  • Element-wise multiplication or

summation

  • Multilayer perceptron
  • How to explicitly model both

unimodal and bimodal interactions?

56

e.g. Sentiment

slide-57
SLIDE 57

Deep multimodal Boltzmann machines

  • Generative model
  • Individual modalities trained like a

DBN

  • Multimodal representation trained

using Variational approaches

  • Used for image tagging and cross-

media retrieval

  • Reconstruction of one modality

from another is a bit more “natural” than in autoencoder representation

  • Can actually sample text and

images

57

▪ ▪ ▪ ▪ ▪ another is a bit more “natural” than in ▪

· · ·

softmax

[Srivastava and Salakhutdinov, Multimodal Learning with Deep Boltzmann Machines, 2012, 2014]

slide-58
SLIDE 58

Deep multimodal Boltzmann machines

58

Srivastava and Salakhutdinov, “Multimodal Learning with Deep Boltzmann Machines”, NIPS 2012

[Srivastava and Salakhutdinov, Multimodal Learning with Deep Boltzmann Machines, 2012, 2014] Srivastava and Salakhutdinov, “Multimodal Learning with Deep Boltzmann Machines”, NIPS 2012

slide-59
SLIDE 59

Coordinated Representations

59

slide-60
SLIDE 60

Coordinated multimodal embeddings

  • Instead of projecting to a joint space enforce the similarity between

unimodal embeddings

60

  • Modality 1

Modality 2 Repres 2

  • Repres. 1
slide-61
SLIDE 61

Coordinated multimodal embeddings

  • Learn (unsupervised)

two or more coordinated representations from multiple modalities. A loss function is defined to bring closer these multiple representations.

61

· · · · · · · · · · · · Text Image · · · · · ·

Similarity metric

(e.g., cosine distance)

𝒁 𝒀

slide-62
SLIDE 62

Coordinated multimodal embeddings

  • Instead of projecting to a joint space enforce the similarity between

unimodal embeddings

  • Often referred to as semantic space embeddings

62

[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014] [Frome et al., DeViSE: A Deep Visual-Semantic Embedding Model, 2013]

slide-63
SLIDE 63

Coordinated multimodal embeddings

63

[Huang et al., Learning Deep Structured Semantic Models for Web Search using Clickthrough Data, 2013]

slide-64
SLIDE 64

Multimodal Vector Space Arithmetic

64

[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014]

slide-65
SLIDE 65

Multimodal Vector Space Arithmetic

65

[Kiros et al., Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, 2014]

slide-66
SLIDE 66

Structured coordinated embeddings

  • Instead of or in addition to similarity add alternative structure

66

[Vendrov et al., Order-Embeddings of Images and Language, 2016] [Jiang and Li, Deep Cross-Modal Hashing]

slide-67
SLIDE 67

Recap: Multimodal representations

  • Joint representations
  • Project modalities to the same space
  • Use when all the modalities are present during test time
  • Suitable for multimodal fusion
  • Coordinated representations
  • Project modalities to their own coordinated space
  • Use when only one of the modalities is present during test- time
  • Suitable for multimodal translation
  • Good for retrieval

67

slide-68
SLIDE 68

Multimodal Fusion

68

slide-69
SLIDE 69

Concatenate

What s the mustache made of ? CNN LSTM

slide-70
SLIDE 70

Element-wise sum

What s the mustache made of ? CNN LSTM

slide-71
SLIDE 71

Element-wise product

What s the mustache made of ? CNN LSTM

slide-72
SLIDE 72

Bilinear pooling

What s the mustache made of ? CNN LSTM

Outer product

2048 2048 3000 12.5 billion !!!

slide-73
SLIDE 73

Multimodal Compact Bilinear Pooling (MCB)

What s the mustache made of ? CNN LSTM

Fukui et al. EMNLP 2016

slide-74
SLIDE 74

Next Lecture: Language and Vision

74