Multimodality Learning from Text, Speech, and Vision CMU 11-4/611 - - PowerPoint PPT Presentation

multimodality
SMART_READER_LITE
LIVE PREVIEW

Multimodality Learning from Text, Speech, and Vision CMU 11-4/611 - - PowerPoint PPT Presentation

Multimodality Learning from Text, Speech, and Vision CMU 11-4/611 Natural Language Processing Lecture 28 April 14, 2020 Shruti Palaskar Outline I. What is multimodality? II. Types of modalities III. Commonly used Models IV. Multimodal


slide-1
SLIDE 1

Multimodality

Learning from Text, Speech, and Vision

CMU 11-4/611 Natural Language Processing Lecture 28 April 14, 2020 Shruti Palaskar

slide-2
SLIDE 2

Outline

I. What is multimodality? II. Types of modalities III. Commonly used Models IV. Multimodal Fusion and Representation Learning V. Multimodal Tasks: Use Cases

2

slide-3
SLIDE 3
  • I. What is Multimodality?

3

slide-4
SLIDE 4

Human Interaction is Inherently Multimodal

4

slide-5
SLIDE 5

How We Perceive

5

slide-6
SLIDE 6

How We Perceive

6

slide-7
SLIDE 7

The Dream: Sci-Fi Movies

7

JARVIS The Matrix

slide-8
SLIDE 8

Reality?

8

slide-9
SLIDE 9

Give a caption.

9

slide-10
SLIDE 10

Give a caption.

10

Human: A Small Dogs Ears Stick Up As It Runs In The Grass. Model: A Black And White Dog Is Running On Grass With A Frisbee In Its Mouth

slide-11
SLIDE 11

Single sentence image description -> Captioning

11

slide-12
SLIDE 12

Give a caption.

12

slide-13
SLIDE 13

Give a caption.

13

Human: A Young Girl In A White Dress Standing In Front Of A Fence And Fountain. Model: Two Men Are Standing In Front Of A Fountain

slide-14
SLIDE 14

Reality?

14

slide-15
SLIDE 15

Watch the video and answer questions.

15

  • Q. is there only one person ?
  • Q. does she walk in with a towel around her neck ?
  • Q. does she interact with the dog ?
  • Q. does she drop the towel on the floor ?

QUESTIONS

slide-16
SLIDE 16

Watch the video and answer questions.

16

  • Q. is there only one person ?

A. there is only one person and a dog .

  • Q. does she walk in with a towel around her neck ?

A. she walks in from outside with the towel around her neck .

  • Q. does she interact with the dog ?

A. she does not interact with the dog

  • Q. does she drop the towel on the floor ?

A. she dropped the towel on the floor at the end of the video . QUESTIONS

slide-17
SLIDE 17

Simple questions, simple answers -> Video Question Answering

17

slide-18
SLIDE 18

Reality? Baby Steps. Still a long way to go.

18

slide-19
SLIDE 19

...Challenges

Common challenges based on the tasks we just saw

  • Training Dataset bias
  • Very complicated tasks
  • Lack of common sense reasoning within models
  • No world knowledge available like humans do
  • Physics, Nature, Memory, Experience

How do we teach machines to perceive?

19

slide-20
SLIDE 20

Outline

I. What is multimodality? II. Types of modalities III. Commonly used Models IV. Multimodal Fusion and Representation Learning V. Multimodal Tasks: Use Cases

20

slide-21
SLIDE 21
  • II. Types of modalities

21

slide-22
SLIDE 22

Types of Modalities

22

IMAGE/VIDEO SPEECH/AUDIO TEXT EMOTION/AFFECT /SENTIMENT

slide-23
SLIDE 23

Example Dataset: ImageNet

  • Object Recognition
  • Image Tagging/Categorization
  • ~14M images
  • Knowledge Ontology
  • Hierarchical Tags

○ Mammal -> Placental -> Carnivore -> Canine -> Dog -> Working Dog -> Husky Deng et al. 2009

23

slide-24
SLIDE 24

Example Dataset: How2 Dataset

  • Speech
  • Video
  • English Transcript
  • Portuguese Transcript
  • Summary

Sanabria et al. 2018

24

slide-25
SLIDE 25

Example Dataset: Open Pose

  • Action Recognition
  • Pose Estimation
  • Human Dynamic
  • Body Dynamics

Wei et al. 2016

25

slide-26
SLIDE 26
  • III. Commonly Used Models

26

slide-27
SLIDE 27

Multilayer Perceptrons

27

Single Perceptron

slide-28
SLIDE 28

Multilayer Perceptrons

28

Single Perceptron

slide-29
SLIDE 29

Multilayer Perceptrons: Uses in Multimedia

29

slide-30
SLIDE 30

Multilayer Perceptrons: Limitations

30

Limitation #1 Very large amount of input data samples (xi), which requires a gigantic amount of model parameters.

slide-31
SLIDE 31

Convolutional Neural Networks (CNNs)

31

Translation invariance: we can use same parameters to capture a specific “feature” in any area of the

  • image. We can use different sets of

parameters to capture different features. These operations are equivalent to perform convolutions with different filters.

slide-32
SLIDE 32

Convolutional Neural Networks (CNNs)

32

LeCun et al. 1998

slide-33
SLIDE 33

Convolutional Neural Networks (CNNs) for Image Encoding

33

Krizhevsky et al. 2012

slide-34
SLIDE 34

Multilayer Perceptrons: Limitations

34

Limitation #1 Very large amount of input data samples (xi), which requires a gigantic amount of model parameters. Limitation #2 Does not naturally handle input data

  • f variable dimension

(eg. audio/video/word sequences)

slide-35
SLIDE 35

Recurrent Neural Networks

35

Build specific connections capturing the temporal evolution → Shared weights in time

slide-36
SLIDE 36

Recurrent Neural Networks

36

slide-37
SLIDE 37

Recurrent Neural Networks for Video Encoding

37

Combination is commonly implemented as a small NN on top

  • f a pooling operation (e.g. max,

sum, average). Recurrent Neural Networks are well suited for processing sequences. Donahue et al. 2015

slide-38
SLIDE 38

Attention Mechanism

38

Olah and Cate 2016

slide-39
SLIDE 39

Loss Function: Softmax

39

Slide by LP Morency

slide-40
SLIDE 40
  • IV. Multimodal Fusion & Representation Learning

40

slide-41
SLIDE 41

Fusion: Model Agnostic

41

Slide by LP Morency

slide-42
SLIDE 42

Fusion: Model Based

42

Slide by LP Morency

slide-43
SLIDE 43

Representation Learning: Encoder-Decoder

43

slide-44
SLIDE 44

Representation Learning

44

Word2Vec

Mikolov et al. 2013

slide-45
SLIDE 45

Representation Learning: RNNs

45

Cho et al. 2014

slide-46
SLIDE 46

Representation Learning: Self-Supervised

46

slide-47
SLIDE 47

Representation Learning: Transfer Learning

47

slide-48
SLIDE 48

Representation Learning: Joint Learning

48

slide-49
SLIDE 49

Representation Learning: Joint Learning (Similarity)

49

slide-50
SLIDE 50
  • V. Common Tasks, Use Cases

50

slide-51
SLIDE 51
  • V. Common Tasks

1. Vision and Language 2. Speech, Vision and Language 3. Multimedia 4. Emotion and Affect

51

  • Image/Video Captioning
  • Visual Question Answering
  • Visual Dialog
  • Video Summarization
  • Lip Reading
  • Audio Visual Speech Recognition
  • Visual Speech Synthesis
slide-52
SLIDE 52
  • 1. Vision and Language Common Tasks

52

slide-53
SLIDE 53

Image Captioning

53

Vinyals et al. 2015

slide-54
SLIDE 54

Image Captioning

54

Karpathy et al. 2015

Slides by Marc Bolaños

slide-55
SLIDE 55

Image Captioning: Show, Attend and Tell

55

Xu et al. 2015

slide-56
SLIDE 56

Image Captioning and Detection

56

Johnson et al. 2016

slide-57
SLIDE 57

Video Captioning

57

Donahue et al. 2015

slide-58
SLIDE 58

Video Captioning

58

Pan et al. 2016

Slides by Marc Bolaños

slide-59
SLIDE 59

Visual Question Answering

59

slide-60
SLIDE 60

Visual Question Answering

60

slide-61
SLIDE 61

Visual Question Answering

61

slide-62
SLIDE 62

Visual Question Answering

62

slide-63
SLIDE 63

Visual Question Answering

63

slide-64
SLIDE 64

Video Summarization

  • n behalf of expert village my name is lizbeth muller and today we are going to show you how to make spanish
  • melet . i 'm going to dice a little bit of peppers here . i 'm not going to use a lot , i 'm going to use very

very little . a little bit more then this maybe . you can use red peppers if you like to get a little bit color in your omelet . some people do and some people do n't . but i find that some of the people that are mexicans who are friends of mine that have a mexican she like to put red peppers and green peppers and yellow peppers in hers and with a lot of onions . that is the way they make there spanish omelets that is what she says . i loved it , it actually tasted really good . you are going to take the onion also and dice it really small . you do n't want big chunks of onion in there cause it is just pops out of the omelet . so we are going to dice the up also very very small . so we have small pieces of onions and peppers ready to go .

how to cut peppers to make a spanish

  • melette ; get expert tips and advice on

making cuban breakfast recipes in this free cooking video . Transcript (290 words on avg) “Teaser” (33 words on avg)

~1.5 minutes of audio and video

slide-65
SLIDE 65

Video Summarization: Hierarchical Model

Palaskar et al. 2019

slide-66
SLIDE 66

Action Recognition

66

slide-67
SLIDE 67
  • 2. Speech, Vision and Language Common Tasks

67

slide-68
SLIDE 68

Audio Visual Speech Recognition: Lip Reading

68

Assael et al. 2016

slide-69
SLIDE 69

Lip Reading: Watch, Listen, Attend and Spell

69

Chung et al. 2017

slide-70
SLIDE 70
  • 3. Multimedia Common Tasks

70

slide-71
SLIDE 71

Multimedia Retrieval

71

slide-72
SLIDE 72

Multimedia Retrieval

72

slide-73
SLIDE 73

Multimedia Retrieval: Shared Multimodal Representation

73

slide-74
SLIDE 74

Multimedia Retrieval

74

slide-75
SLIDE 75
  • 4. Emotion and Affect

75

slide-76
SLIDE 76

Affect Recognition: Emotion, Sentiment, Persuasion, Personality

76

slide-77
SLIDE 77

Outline

I. What is multimodality? II. Types of modalities III. Commonly used Models IV. Multimodal Fusion and Representation Learning V. Multimodal Tasks: Use Cases

77

slide-78
SLIDE 78

Takeaways

  • Lots of multimodal data generated everyday
  • Need automatic ways to understand it

○ Privacy ○ Security ○ Regulation ○ Storage

  • Different models used for different downstream tasks

○ Highly open-ended research!

  • Try it out for fun on Kaggle!

Thank you! spalaska@cs.cmu.edu

78