Multimodality
Learning from Text, Speech, and Vision
CMU 11-4/611 Natural Language Processing Lecture 28 April 14, 2020 Shruti Palaskar
Multimodality Learning from Text, Speech, and Vision CMU 11-4/611 - - PowerPoint PPT Presentation
Multimodality Learning from Text, Speech, and Vision CMU 11-4/611 Natural Language Processing Lecture 28 April 14, 2020 Shruti Palaskar Outline I. What is multimodality? II. Types of modalities III. Commonly used Models IV. Multimodal
CMU 11-4/611 Natural Language Processing Lecture 28 April 14, 2020 Shruti Palaskar
I. What is multimodality? II. Types of modalities III. Commonly used Models IV. Multimodal Fusion and Representation Learning V. Multimodal Tasks: Use Cases
2
3
4
5
6
7
JARVIS The Matrix
8
9
10
Human: A Small Dogs Ears Stick Up As It Runs In The Grass. Model: A Black And White Dog Is Running On Grass With A Frisbee In Its Mouth
11
12
13
Human: A Young Girl In A White Dress Standing In Front Of A Fence And Fountain. Model: Two Men Are Standing In Front Of A Fountain
14
15
QUESTIONS
16
A. there is only one person and a dog .
A. she walks in from outside with the towel around her neck .
A. she does not interact with the dog
A. she dropped the towel on the floor at the end of the video . QUESTIONS
17
18
Common challenges based on the tasks we just saw
How do we teach machines to perceive?
19
I. What is multimodality? II. Types of modalities III. Commonly used Models IV. Multimodal Fusion and Representation Learning V. Multimodal Tasks: Use Cases
20
21
22
IMAGE/VIDEO SPEECH/AUDIO TEXT EMOTION/AFFECT /SENTIMENT
○ Mammal -> Placental -> Carnivore -> Canine -> Dog -> Working Dog -> Husky Deng et al. 2009
23
Sanabria et al. 2018
24
Wei et al. 2016
25
26
27
Single Perceptron
28
Single Perceptron
29
30
Limitation #1 Very large amount of input data samples (xi), which requires a gigantic amount of model parameters.
31
Translation invariance: we can use same parameters to capture a specific “feature” in any area of the
parameters to capture different features. These operations are equivalent to perform convolutions with different filters.
32
LeCun et al. 1998
33
Krizhevsky et al. 2012
34
Limitation #1 Very large amount of input data samples (xi), which requires a gigantic amount of model parameters. Limitation #2 Does not naturally handle input data
(eg. audio/video/word sequences)
35
Build specific connections capturing the temporal evolution → Shared weights in time
36
37
Combination is commonly implemented as a small NN on top
sum, average). Recurrent Neural Networks are well suited for processing sequences. Donahue et al. 2015
38
Olah and Cate 2016
39
Slide by LP Morency
40
41
Slide by LP Morency
42
Slide by LP Morency
43
44
Word2Vec
Mikolov et al. 2013
45
Cho et al. 2014
46
47
48
49
50
1. Vision and Language 2. Speech, Vision and Language 3. Multimedia 4. Emotion and Affect
51
52
53
Vinyals et al. 2015
54
Karpathy et al. 2015
Slides by Marc Bolaños
55
Xu et al. 2015
56
Johnson et al. 2016
57
Donahue et al. 2015
58
Pan et al. 2016
Slides by Marc Bolaños
59
60
61
62
63
very little . a little bit more then this maybe . you can use red peppers if you like to get a little bit color in your omelet . some people do and some people do n't . but i find that some of the people that are mexicans who are friends of mine that have a mexican she like to put red peppers and green peppers and yellow peppers in hers and with a lot of onions . that is the way they make there spanish omelets that is what she says . i loved it , it actually tasted really good . you are going to take the onion also and dice it really small . you do n't want big chunks of onion in there cause it is just pops out of the omelet . so we are going to dice the up also very very small . so we have small pieces of onions and peppers ready to go .
how to cut peppers to make a spanish
making cuban breakfast recipes in this free cooking video . Transcript (290 words on avg) “Teaser” (33 words on avg)
~1.5 minutes of audio and video
Palaskar et al. 2019
66
67
68
Assael et al. 2016
69
Chung et al. 2017
70
71
72
73
74
75
76
I. What is multimodality? II. Types of modalities III. Commonly used Models IV. Multimodal Fusion and Representation Learning V. Multimodal Tasks: Use Cases
77
○ Privacy ○ Security ○ Regulation ○ Storage
○ Highly open-ended research!
Thank you! spalaska@cs.cmu.edu
78