CMP722 ADVANCED COMPUTER VISION Lecture #5 Language and Vision - - PowerPoint PPT Presentation

cmp722
SMART_READER_LITE
LIVE PREVIEW

CMP722 ADVANCED COMPUTER VISION Lecture #5 Language and Vision - - PowerPoint PPT Presentation

Illustration: William Joel CMP722 ADVANCED COMPUTER VISION Lecture #5 Language and Vision Aykut Erdem // Hacettepe University // Spring 2019 Illustration: Detail from Fritz Kahns Der Mensch als Industriepalast Previously on CMP722


slide-1
SLIDE 1

Lecture #5 – Language and Vision

Aykut Erdem // Hacettepe University // Spring 2019

CMP722

ADVANCED COMPUTER VISION

Illustration: William Joel

slide-2
SLIDE 2
  • what is multimodality
  • a historical view on multimodal

research

  • core technical challenges
  • joint representations
  • coordinated representations
  • multimodal fusion

Previously on CMP722

Illustration: Detail from Fritz Kahn’s Der Mensch als Industriepalast

slide-3
SLIDE 3

Lecture overview

  • image captioning
  • visual question answering
  • case study: neural module networks
  • Disclaimer: Much of the material and slides for this lecture were borrowed from

—Bill Freeman, Antonio Torralba and Phillip Isola’s MIT 6.869 class

3

slide-4
SLIDE 4

“A flock of birds against a gray sky”

Image captioning

4

slide-5
SLIDE 5

Recipe for deep learning in a new domain

  • 1. Transform your data into numbers (e.g., a vector)
  • 2. Transform your goal into an equation (objective function)
  • 3. #1 and #2 specify the “learning problem”
  • 4. Use a generic optimizer (SGD) and an appropriate architecture

(e.g., CNN or RNN) to solve the learning problem

5

slide-6
SLIDE 6

Training data …

“Fish” “Grizzly” “Chameleon”

Training data …

1 2 3

One-hot vector Training data …

[0,0,1] [0,1,0] [1,0,0]

How to represent words as numbers?

6

slide-7
SLIDE 7

Prediction

dolphin cat grizzly bear angel fish chameleon iguana elephant clown fish

1

How to represent words as numbers?

7

slide-8
SLIDE 8

Prediction

a aardvark absolve accurate adapt after aghast aether

1

… Rather than having just a handful of possible object classes, we can represent all words in a large vocabulary using a very large K (e.g., K=100,000).

How to represent words as numbers?

8

slide-9
SLIDE 9

Prediction

a b c d e g h f

1

… Or, represent each character as a class (e.g., K=26 for English letters), and represent words as a sequence

  • f characters.

How to represent words as numbers?

9

slide-10
SLIDE 10

Hidden Outputs Inputs

“A” “clown” “fish” “swimming” “in” “open” “seas”

This problem is called image captioning

10

slide-11
SLIDE 11

Hidden Outputs Input

“A” “clown” “fish” “swimming” “in” “open” “seas”

LSTM LSTM CNN LSTM LSTM LSTM LSTM LSTM

“A” “clown” “swimming” “fish” “in” “open” END

LSTM

“seas”

11

slide-12
SLIDE 12

Hidden Outputs Input

LSTM LSTM CNN LSTM LSTM LSTM LSTM LSTM

“A” “clown” “swimming” “fish” “in” “open”

LSTM

“seas” “A” “clown” “fish” “swimming” “in” “open” “seas” END

Targets Max-likelihood objective: maximize probability the model assigns to each target word:

12

slide-13
SLIDE 13

Hidden Outputs Input

LSTM LSTM CNN LSTM LSTM LSTM LSTM LSTM LSTM

Targets Max-likelihood objective: minimize cross-entropy between model outputs and one-hot encoded targets.

13

slide-14
SLIDE 14

LSTM CNN

Hidden Outputs Input Samples

LSTM

“clown”

LSTM

“A”

LSTM

“fish”

LSTM

“swimming”

LSTM

“in”

LSTM

“open”

LSTM

“seas” “A” “clown” “fish” “swimming” “in” “open” “seas” END

Sample from predicted distribution over words. Alternatively, sample most likely word.

14

slide-15
SLIDE 15

It was very popular a few years ago

15

slide-16
SLIDE 16

Show and Tell: A Neural Image Caption Generator [Vinyals et. al., CVPR 2015]

16

slide-17
SLIDE 17

Show and Tell: A Neural Image Caption Generator [Vinyals et. al., CVPR 2015]

17

slide-18
SLIDE 18

18

slide-19
SLIDE 19

Input: No sequence Output: No sequence Example: “standard” classification/ regression problems Input: No sequence Output: Sequence Example: Im2Caption Input: Sequence Output: No sequence Example: sentence classification, multiple-choice question answering Input: Sequence Output: Sequence Example: machine translation, video captioning, open-ended question answering, video question answering

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

How do we model sequences?

19

slide-20
SLIDE 20

https://arxiv.org/pdf/1505.00468v6.pdf

2016

20

slide-21
SLIDE 21

http://www.visualqa.org/challenge.html

21

slide-22
SLIDE 22

22

slide-23
SLIDE 23

Questions and answers collected with AMT

23

slide-24
SLIDE 24

Image Question Answer

Architecture

24

slide-25
SLIDE 25

“Fish” “Water” “Shark” “Whale” “Cat” “Couch” “Sun” [1,0,0,0,0,0,0,…] [0,1,0,0,0,0,0,…] [0,0,1,0,0,0,0,…] [0,0,0,1,0,0,0,…] [0,0,0,0,1,0,0,…] [0,0,0,0,0,1,0,…] [0,0,0,0,0,0,1,…]

How to represent words as numbers

25

slide-26
SLIDE 26

Image

layer 3 representation of image layer 1 representation of image

Represent image as a vector of neural activations

(perhaps representing a vector of detected texture patterns or object parts)

im2vec

26

slide-27
SLIDE 27

“Elephant”

dense vector representation of word …

  • ne-hot vector representation of word

X2vec methods are also called embeddings of X, e.g., a word embedding

word2vec

27

slide-28
SLIDE 28

“Fish” “Water” “Shark” “Whale” “Cat” “Couch” “Sun” “Tuna”

Dim 1 Dim 2

Words with similar meanings should be near each other

28

slide-29
SLIDE 29

Proxy: words that are used in the same context tend to have similar meanings Words with similar meanings should be near each other “Meaning is use” — Wittgenstein words with similar contexts should be near each other

word2vec

29

slide-30
SLIDE 30

'sofa' 'armchair' 'bench' 'chair' 'deck chair' 'ottoman' 'seat' 'stool' 'swivel chair’ ‘loveseat’ … sofa ’person' ’man' ’woman' ’child' ’teenager' ’girl' ’boy' ’baby' ’daughter’ ‘son’ … Next to the is a desk, and a is sitting behind it. person

30

slide-31
SLIDE 31

word2vec

  • T. Mikolov, K. Chen, G. Corrado, J. Dean. Efficient Estimation of Word Representations in Vector Space.

arXiv:1301.3781, 2013

I parked the car in a nearby

  • street. It is a red car with two

doors, … I parked the vehicle in a nearby street…

31

slide-32
SLIDE 32

word2vec

  • T. Mikolov, K. Chen, G. Corrado, J. Dean. Efficient Estimation of Word Representations in Vector Space.

arXiv:1301.3781, 2013

I parked the car in a nearby

  • street. It is a red car with two

doors, …

car

encoder

w

decoder

List of words in the context of “car”

32

slide-33
SLIDE 33

word2vec

Word = ‘car’

Hidden layer Soft-max classifier

Output prob. That each word is in the context of the input word

  • T. Mikolov, K. Chen, G. Corrado, J. Dean. Efficient Estimation of Word Representations in Vector Space.

arXiv:1301.3781, 2013

Encoder Decoder

33

slide-34
SLIDE 34

word2vec, training

Linear layer

Soft-max classifier

Output prob. That each word is in the context of the input word [0, 0, 1, 0, … 0]V

car A S w P

[0, 0, 1, 0, 0, … 0]

w = S

  • T. Mikolov, K. Chen, G. Corrado, J. Dean. Efficient Estimation of Word Representations in Vector Space.

arXiv:1301.3781, 2013

p =exi / Σj exj

xi A A V d = V

34

slide-35
SLIDE 35

word2vec, training

[0, 0, 1, 0, 0, … 0]

w = S

  • T. Mikolov, K. Chen, G. Corrado, J. Dean. Efficient Estimation of Word Representations in Vector Space.

arXiv:1301.3781, 2013

p =exi / Σj exj

xi A A V d = V

  • In training maximize log-likelihood over the training set:

T … training set size c ... context window size

35

slide-36
SLIDE 36

word2vec, test time

Linear layer

[0, 0, 1, 0, … 0] car A w

  • T. Mikolov, K. Chen, G. Corrado, J. Dean. Efficient Estimation of Word Representations in Vector Space.

arXiv:1301.3781, 2013

At test time, w is our word embedding. The encoding is just a look up table.

car

[0, 0, 1, 0, 0, … 0]

w = A A V d

36

slide-37
SLIDE 37

Algebraic operations with the vector representation of words

X = Vector(“Paris”) – vector(“France”) + vector(“Italy”) Closest nearest neighbor to X is vector(“Rome”)

37

slide-38
SLIDE 38

Examples from https://www.tensorflow.org/tutorials/representation/word2vec

38

slide-39
SLIDE 39

Image Question Answer

  • ften, we work with word embeddings, rather

than one-hot representations of words

Architecture

39

slide-40
SLIDE 40

Architecture

There are 1000 possible answers in this system. Questions are unlimited.

40

slide-41
SLIDE 41

41

slide-42
SLIDE 42

42

slide-43
SLIDE 43

43

slide-44
SLIDE 44

44

slide-45
SLIDE 45

45

slide-46
SLIDE 46

46

slide-47
SLIDE 47

Neural Module Networks

47

Slides credit: Jacob Andreas

slide-48
SLIDE 48

Grounded question answering

nam ame typ ype co coast astal al Co Columbia ci city no no Co Cooper rive ver ye yes Char arlest ston ci city ye yes

What rivers are in South Carolina? Cooper

48

slide-49
SLIDE 49

Grounded question answering

What color is the necktie? yellow

49

slide-50
SLIDE 50

Grounded question answering

yes Is there a red shape above a circle?

50

slide-51
SLIDE 51

Neural nets learn lexical groundings

yes

[Iyyer et al. 2014, Bordes et al. 2014, Yang et al. 2015, Malinowski et al., 2015]

Is there a red shape above a circle?

51

slide-52
SLIDE 52

Semantic parsers learn composition

yes

[Wong & Mooney 2007, Kwiatkowski et al. 2010, Liang et al. 2011, A et al. 2013]

Is there a red shape above a circle?

52

slide-53
SLIDE 53

Neural module networks learn both!

yes Is there a red shape above a circle?

red and and

53

slide-54
SLIDE 54

Neural module networks

Is there a red shape above a circle?

red exists

true

↦ ↦

above

54

slide-55
SLIDE 55

Neural module networks

Is there a red shape above a circle?

red exists

true

↦ ↦

above

circle red above exists and

55

slide-56
SLIDE 56

Neural module networks

yes Is there a red shape above a circle?

red exists

true

↦ ↦

above

circle red above exists and

56

slide-57
SLIDE 57

Representing meaning

Is there a red shape above a circle?

57

slide-58
SLIDE 58

Representing meaning

Is there a red shape above a circle?

58

slide-59
SLIDE 59

Sets encode meaning

Is there a red shape above a circle?

59

slide-60
SLIDE 60

Is there a red shape above a circle?

Sets encode meaning

60

slide-61
SLIDE 61

Is there a red shape above a circle?

Set transformations encode meaning

61

slide-62
SLIDE 62

Set transformations encode meaning

Is there a red shape above a circle?

62

slide-63
SLIDE 63

Is there a red shape above a circle?

exists and red above circle

Sentence meanings are computations

63

slide-64
SLIDE 64

exists and red above circle

Sentence meanings are computations

Is there a red shape above a circle?

64

slide-65
SLIDE 65

exists and red above circle red exists

true

↦ ↦

above

Computations are built from set functions

65

slide-66
SLIDE 66

…or relaxed to vector functions

exists and red above circle red exists

true

↦ ↦

above

0.0 0.9 1.0

66

slide-67
SLIDE 67

Composing vector functions

exists and red above circle red exists

true

↦ ↦

above

67

slide-68
SLIDE 68

Composing vector functions

exists and red above circle red exists

true

↦ ↦

above

68

slide-69
SLIDE 69

Composing vector functions

circle red above and exists red exists

true

↦ ↦

above

69

slide-70
SLIDE 70

Compositions of vector functions are neural nets

true

↦ ↦ ↦

70

slide-71
SLIDE 71

circle red above and exists red exists

true

↦ ↦

above

Compositions of vector functions are neural nets

71

slide-72
SLIDE 72

Outline

yes Is there a red shape above a circle?

red exists

true

↦ ↦

above

circle red above exists and

72

slide-73
SLIDE 73

Outline

yes Is there a red shape above a circle?

red exists

true

↦ ↦

above

circle red above exists and

73

slide-74
SLIDE 74

Outline

yes Is there a red shape above a circle?

red exists

true

↦ ↦

above

circle red above exists and

74

slide-75
SLIDE 75

Outline

yes Is there a red shape above a circle?

red exists

true

↦ ↦

above

circle red above exists and

75

slide-76
SLIDE 76

Outline

yes Is there a red shape above a circle?

red exists

true

↦ ↦

above

circle red above exists and

76

slide-77
SLIDE 77

Anatomy of a module

above

77

slide-78
SLIDE 78

Anatomy of a module

color

red

78

slide-79
SLIDE 79

The [find] module

red

79

slide-80
SLIDE 80

The [find] module

necktie

[Xu et al. 2015]

80

slide-81
SLIDE 81

The [find] module

city

nam ame typ ype co coast astal al Co Columbia ci city no no Co Cooper rive ver ye yes Myr yrtle Beach ach ci city ye yes

Columbia Cooper Myrtle Beach

0.9 0.8 0.1

81

slide-82
SLIDE 82

The [find] module

red

82

slide-83
SLIDE 83

The [find] module

red

83

slide-84
SLIDE 84

The [find] module

red

red

84

slide-85
SLIDE 85

red

The [find] module

red

0.9

85

slide-86
SLIDE 86

red

The [find] module

red

0.9

86

slide-87
SLIDE 87

red

The [find] module

red

0.1

87

slide-88
SLIDE 88

The [describe] module

color

red

88

slide-89
SLIDE 89

The [describe] module

what

necktie

89

slide-90
SLIDE 90

The [describe] module

color

red

90

slide-91
SLIDE 91

The [describe] module

color

red

. . .

91

slide-92
SLIDE 92

The [describe] module

color

red

. . .

92

slide-93
SLIDE 93

What modules do we need?

Is there a red shape above a circle? Who is running in the grass? What cities are south of San Diego? What color is the triangle?

93

slide-94
SLIDE 94

A module for predicates

Is there a red shape above a circle? Who is running in the grass? What cities are south of San Diego?

[find]

94

What color is the triangle?

slide-95
SLIDE 95

Who is running in the grass? Is there a red shape above a circle?

A module for relations

What cities are south of San Diego? What color is the triangle?

[find] [relate]

95

slide-96
SLIDE 96

Module inventory

Is there a red shape above a circle? Who is running in the grass? What cities are south of San Diego? What color is the triangle?

[find] [relate] [exists]

[describe]

[and]

96

slide-97
SLIDE 97

Outline

yes Is there a red shape above a circle?

red exists

true

↦ ↦

above

circle red above exists and

97

slide-98
SLIDE 98

Learning

Is there a red shape above a circle? What color is the shape right of a circle?

circle red above exists and circle right_of color

yes blue

98

slide-99
SLIDE 99

Learning

yes blue

Is there a red shape above a circle? What color is the shape right of a circle?

99

slide-100
SLIDE 100

Parameter tying

circle circle

yes blue

Is there a red shape above a circle? What color is the shape right of a circle?

100

slide-101
SLIDE 101

Parameter tying

circle circle

yes blue

Is there a red shape above a circle? What color is the shape right of a circle?

101

slide-102
SLIDE 102

Extreme parameter tying

red

102

circle above exists and circle right_of color square right_of shape circle above red exists and

left_of

slide-103
SLIDE 103

Learning with fixed layouts is easy!

Σ p( | ; W)

yes

,

W

arg max

(where every root module outputs a distribution over answers and W is the set of all module parameters)

103

slide-104
SLIDE 104

Maximum likelihood estimation

104

slide-105
SLIDE 105

Outline

yes Is there a red shape above a circle?

red exists

true

↦ ↦

above

circle red above exists and

105

slide-106
SLIDE 106

Where do layouts come from?

Is there a red shape above a circle?

be red shape there any circle above a

[Reddy et al. 2016]

106

slide-107
SLIDE 107

Is there a red shape above a circle?

be red shape circle above

Where do layouts come from?

107

slide-108
SLIDE 108

Is there a red shape above a circle?

be

circle red above shape

Where do layouts come from?

108

slide-109
SLIDE 109

Is there a red shape above a circle?

circle red above shape

Where do layouts come from?

109

slide-110
SLIDE 110

Is there a red shape above a circle?

circle red above shape and

Where do layouts come from?

110

slide-111
SLIDE 111

Experiments

nam ame typ ype co coast astal al Co Columbia ci city no no Co Cooper rive ver ye yes Char arlest ston ci city ye yes

111

slide-112
SLIDE 112

Experiments: VQA dataset

What is in the sheep’s ear? tag What color is the necktie? yellow

[Antol et al. 2015]

112

slide-113
SLIDE 113

Experiments: VQA dataset

50,00 53,33 56,67 60,00

55.9 57.4 58.9 59.4

Zhou (2015) Noh (2015) Yang (2015)

Ours

113

slide-114
SLIDE 114

Experiments: VQA dataset

50,00 53,33 56,67 60,00

55.9 57.4 58.9 59.4

Zhou (2015) Noh (2015) Yang (2015)

Ours

114

slide-115
SLIDE 115

50,00 62,50 75,00 87,50 100,00

65.3 76.5 90.6

*Zhou

Ours

Experiments: SHAPES dataset

*Yang

115

slide-116
SLIDE 116

Experiments: VQA Dataset

sheep ear and what and

What is in the sheep’s ear? tag

116

slide-117
SLIDE 117

Experiments: VQA Dataset

sheep ear and what and

What is in the sheep’s ear? tag

117

slide-118
SLIDE 118

Experiments: VQA Dataset

sheep ear and what and

What is in the sheep’s ear? tag

118

slide-119
SLIDE 119

Neural module networks

yes Is there a red shape above a circle?

red exists

true

↦ ↦

above

circle red above exists and

119

slide-120
SLIDE 120

Neural module networks

Combines advantages of:

  • Representation learning (like a neural net)
  • Compositionality (like a semantic parser)

circle red above exists and

Linguistic structure dynamically generates model structure

120

slide-121
SLIDE 121

An emerging term for general models with these properties is differentiable programming. Deep nets are popular for a few reasons:

  • 1. High capacity
  • 2. Easy to optimize (differentiable)
  • 3. Compositional “block based

programming”

Differentiable programming

121

slide-122
SLIDE 122

Deep learning Differentiable programming

Differentiable programming

122

slide-123
SLIDE 123

Differentiable programming

123

slide-124
SLIDE 124

Next Lecture: Deep Reinforcement Learning

124