Lecture #5 – Language and Vision
Aykut Erdem // Hacettepe University // Spring 2019
CMP722
ADVANCED COMPUTER VISION
Illustration: William Joel
CMP722 ADVANCED COMPUTER VISION Lecture #5 Language and Vision - - PowerPoint PPT Presentation
Illustration: William Joel CMP722 ADVANCED COMPUTER VISION Lecture #5 Language and Vision Aykut Erdem // Hacettepe University // Spring 2019 Illustration: Detail from Fritz Kahns Der Mensch als Industriepalast Previously on CMP722
Lecture #5 – Language and Vision
Aykut Erdem // Hacettepe University // Spring 2019
Illustration: William Joel
research
Previously on CMP722
Illustration: Detail from Fritz Kahn’s Der Mensch als Industriepalast
Lecture overview
—Bill Freeman, Antonio Torralba and Phillip Isola’s MIT 6.869 class
3
“A flock of birds against a gray sky”
Image captioning
4
Recipe for deep learning in a new domain
(e.g., CNN or RNN) to solve the learning problem
5
Training data …
“Fish” “Grizzly” “Chameleon”
Training data …
1 2 3
One-hot vector Training data …
[0,0,1] [0,1,0] [1,0,0]
How to represent words as numbers?
6
Prediction
dolphin cat grizzly bear angel fish chameleon iguana elephant clown fish
1
…
How to represent words as numbers?
7
Prediction
a aardvark absolve accurate adapt after aghast aether
1
… Rather than having just a handful of possible object classes, we can represent all words in a large vocabulary using a very large K (e.g., K=100,000).
How to represent words as numbers?
8
Prediction
a b c d e g h f
1
… Or, represent each character as a class (e.g., K=26 for English letters), and represent words as a sequence
How to represent words as numbers?
9
Hidden Outputs Inputs
“A” “clown” “fish” “swimming” “in” “open” “seas”
This problem is called image captioning
10
Hidden Outputs Input
“A” “clown” “fish” “swimming” “in” “open” “seas”
LSTM LSTM CNN LSTM LSTM LSTM LSTM LSTM
“A” “clown” “swimming” “fish” “in” “open” END
LSTM
“seas”
11
Hidden Outputs Input
LSTM LSTM CNN LSTM LSTM LSTM LSTM LSTM
“A” “clown” “swimming” “fish” “in” “open”
LSTM
“seas” “A” “clown” “fish” “swimming” “in” “open” “seas” END
Targets Max-likelihood objective: maximize probability the model assigns to each target word:
12
Hidden Outputs Input
LSTM LSTM CNN LSTM LSTM LSTM LSTM LSTM LSTM
Targets Max-likelihood objective: minimize cross-entropy between model outputs and one-hot encoded targets.
13
LSTM CNN
Hidden Outputs Input Samples
LSTM
“clown”
LSTM
“A”
LSTM
“fish”
LSTM
“swimming”
LSTM
“in”
LSTM
“open”
LSTM
“seas” “A” “clown” “fish” “swimming” “in” “open” “seas” END
Sample from predicted distribution over words. Alternatively, sample most likely word.
14
It was very popular a few years ago
15
Show and Tell: A Neural Image Caption Generator [Vinyals et. al., CVPR 2015]
16
Show and Tell: A Neural Image Caption Generator [Vinyals et. al., CVPR 2015]
17
18
Input: No sequence Output: No sequence Example: “standard” classification/ regression problems Input: No sequence Output: Sequence Example: Im2Caption Input: Sequence Output: No sequence Example: sentence classification, multiple-choice question answering Input: Sequence Output: Sequence Example: machine translation, video captioning, open-ended question answering, video question answering
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
How do we model sequences?
19
https://arxiv.org/pdf/1505.00468v6.pdf
2016
20
http://www.visualqa.org/challenge.html
21
22
Questions and answers collected with AMT
23
Image Question Answer
Architecture
24
“Fish” “Water” “Shark” “Whale” “Cat” “Couch” “Sun” [1,0,0,0,0,0,0,…] [0,1,0,0,0,0,0,…] [0,0,1,0,0,0,0,…] [0,0,0,1,0,0,0,…] [0,0,0,0,1,0,0,…] [0,0,0,0,0,1,0,…] [0,0,0,0,0,0,1,…]
How to represent words as numbers
25
Image
layer 3 representation of image layer 1 representation of image
Represent image as a vector of neural activations
(perhaps representing a vector of detected texture patterns or object parts)
im2vec
26
“Elephant”
dense vector representation of word …
X2vec methods are also called embeddings of X, e.g., a word embedding
word2vec
27
“Fish” “Water” “Shark” “Whale” “Cat” “Couch” “Sun” “Tuna”
Dim 1 Dim 2
Words with similar meanings should be near each other
28
Proxy: words that are used in the same context tend to have similar meanings Words with similar meanings should be near each other “Meaning is use” — Wittgenstein words with similar contexts should be near each other
word2vec
29
'sofa' 'armchair' 'bench' 'chair' 'deck chair' 'ottoman' 'seat' 'stool' 'swivel chair’ ‘loveseat’ … sofa ’person' ’man' ’woman' ’child' ’teenager' ’girl' ’boy' ’baby' ’daughter’ ‘son’ … Next to the is a desk, and a is sitting behind it. person
30
word2vec
arXiv:1301.3781, 2013
I parked the car in a nearby
doors, … I parked the vehicle in a nearby street…
31
word2vec
arXiv:1301.3781, 2013
I parked the car in a nearby
doors, …
car
encoder
w
decoder
List of words in the context of “car”
32
word2vec
Word = ‘car’
Hidden layer Soft-max classifier
Output prob. That each word is in the context of the input word
arXiv:1301.3781, 2013
Encoder Decoder
33
word2vec, training
Linear layer
Soft-max classifier
Output prob. That each word is in the context of the input word [0, 0, 1, 0, … 0]V
car A S w P
[0, 0, 1, 0, 0, … 0]
w = S
arXiv:1301.3781, 2013
p =exi / Σj exj
xi A A V d = V
34
word2vec, training
[0, 0, 1, 0, 0, … 0]
w = S
arXiv:1301.3781, 2013
p =exi / Σj exj
xi A A V d = V
T … training set size c ... context window size
35
word2vec, test time
Linear layer
[0, 0, 1, 0, … 0] car A w
arXiv:1301.3781, 2013
At test time, w is our word embedding. The encoding is just a look up table.
car
[0, 0, 1, 0, 0, … 0]
w = A A V d
36
Algebraic operations with the vector representation of words
X = Vector(“Paris”) – vector(“France”) + vector(“Italy”) Closest nearest neighbor to X is vector(“Rome”)
37
Examples from https://www.tensorflow.org/tutorials/representation/word2vec
38
Image Question Answer
than one-hot representations of words
Architecture
39
Architecture
There are 1000 possible answers in this system. Questions are unlimited.
40
41
42
43
44
45
46
47
Slides credit: Jacob Andreas
Grounded question answering
nam ame typ ype co coast astal al Co Columbia ci city no no Co Cooper rive ver ye yes Char arlest ston ci city ye yes
What rivers are in South Carolina? Cooper
48
Grounded question answering
What color is the necktie? yellow
49
Grounded question answering
yes Is there a red shape above a circle?
50
Neural nets learn lexical groundings
yes
[Iyyer et al. 2014, Bordes et al. 2014, Yang et al. 2015, Malinowski et al., 2015]
Is there a red shape above a circle?
51
Semantic parsers learn composition
yes
[Wong & Mooney 2007, Kwiatkowski et al. 2010, Liang et al. 2011, A et al. 2013]
Is there a red shape above a circle?
52
Neural module networks learn both!
yes Is there a red shape above a circle?
red and and
53
Neural module networks
Is there a red shape above a circle?
red exists
true
↦ ↦
above
↦
54
Neural module networks
Is there a red shape above a circle?
red exists
true
↦ ↦
above
↦
circle red above exists and
55
Neural module networks
yes Is there a red shape above a circle?
red exists
true
↦ ↦
above
↦
circle red above exists and
56
Representing meaning
Is there a red shape above a circle?
57
Representing meaning
Is there a red shape above a circle?
58
Sets encode meaning
Is there a red shape above a circle?
59
Is there a red shape above a circle?
Sets encode meaning
60
Is there a red shape above a circle?
Set transformations encode meaning
61
Set transformations encode meaning
Is there a red shape above a circle?
62
Is there a red shape above a circle?
exists and red above circle
Sentence meanings are computations
63
exists and red above circle
Sentence meanings are computations
Is there a red shape above a circle?
64
exists and red above circle red exists
true
↦ ↦
above
↦
Computations are built from set functions
65
…or relaxed to vector functions
exists and red above circle red exists
true
↦ ↦
above
↦
0.0 0.9 1.0
66
Composing vector functions
exists and red above circle red exists
true
↦ ↦
above
↦
67
Composing vector functions
exists and red above circle red exists
true
↦ ↦
above
↦
68
Composing vector functions
circle red above and exists red exists
true
↦ ↦
above
↦
69
Compositions of vector functions are neural nets
true
↦ ↦ ↦
70
circle red above and exists red exists
true
↦ ↦
above
↦
Compositions of vector functions are neural nets
71
Outline
yes Is there a red shape above a circle?
red exists
true
↦ ↦
above
↦
circle red above exists and
72
Outline
yes Is there a red shape above a circle?
red exists
true
↦ ↦
above
↦
circle red above exists and
73
Outline
yes Is there a red shape above a circle?
red exists
true
↦ ↦
above
↦
circle red above exists and
74
Outline
yes Is there a red shape above a circle?
red exists
true
↦ ↦
above
↦
circle red above exists and
75
Outline
yes Is there a red shape above a circle?
red exists
true
↦ ↦
above
↦
circle red above exists and
76
Anatomy of a module
above
77
Anatomy of a module
color
red
78
The [find] module
red
79
The [find] module
necktie
[Xu et al. 2015]
80
The [find] module
city
nam ame typ ype co coast astal al Co Columbia ci city no no Co Cooper rive ver ye yes Myr yrtle Beach ach ci city ye yes
Columbia Cooper Myrtle Beach
0.9 0.8 0.1
81
The [find] module
red
82
The [find] module
red
83
The [find] module
red
red
84
red
The [find] module
red
0.9
85
red
The [find] module
red
0.9
86
red
The [find] module
red
0.1
87
The [describe] module
color
red
88
The [describe] module
what
necktie
89
The [describe] module
color
red
90
The [describe] module
color
red
. . .
91
The [describe] module
color
red
. . .
92
What modules do we need?
Is there a red shape above a circle? Who is running in the grass? What cities are south of San Diego? What color is the triangle?
93
A module for predicates
Is there a red shape above a circle? Who is running in the grass? What cities are south of San Diego?
[find]
94
What color is the triangle?
Who is running in the grass? Is there a red shape above a circle?
A module for relations
What cities are south of San Diego? What color is the triangle?
[find] [relate]
95
Module inventory
Is there a red shape above a circle? Who is running in the grass? What cities are south of San Diego? What color is the triangle?
[find] [relate] [exists]
[describe]
[and]
96
Outline
yes Is there a red shape above a circle?
red exists
true
↦ ↦
above
↦
circle red above exists and
97
Learning
Is there a red shape above a circle? What color is the shape right of a circle?
circle red above exists and circle right_of color
yes blue
98
Learning
yes blue
Is there a red shape above a circle? What color is the shape right of a circle?
99
Parameter tying
circle circle
yes blue
Is there a red shape above a circle? What color is the shape right of a circle?
100
Parameter tying
circle circle
yes blue
Is there a red shape above a circle? What color is the shape right of a circle?
101
Extreme parameter tying
red
102
circle above exists and circle right_of color square right_of shape circle above red exists and
left_of
Learning with fixed layouts is easy!
yes
,
W
arg max
(where every root module outputs a distribution over answers and W is the set of all module parameters)
103
Maximum likelihood estimation
104
Outline
yes Is there a red shape above a circle?
red exists
true
↦ ↦
above
↦
circle red above exists and
105
Where do layouts come from?
Is there a red shape above a circle?
be red shape there any circle above a
[Reddy et al. 2016]
106
Is there a red shape above a circle?
be red shape circle above
Where do layouts come from?
107
Is there a red shape above a circle?
be
circle red above shape
Where do layouts come from?
108
Is there a red shape above a circle?
circle red above shape
Where do layouts come from?
109
Is there a red shape above a circle?
circle red above shape and
Where do layouts come from?
110
Experiments
nam ame typ ype co coast astal al Co Columbia ci city no no Co Cooper rive ver ye yes Char arlest ston ci city ye yes
111
Experiments: VQA dataset
What is in the sheep’s ear? tag What color is the necktie? yellow
[Antol et al. 2015]
112
Experiments: VQA dataset
50,00 53,33 56,67 60,00
55.9 57.4 58.9 59.4
Zhou (2015) Noh (2015) Yang (2015)
Ours
113
Experiments: VQA dataset
50,00 53,33 56,67 60,00
55.9 57.4 58.9 59.4
Zhou (2015) Noh (2015) Yang (2015)
Ours
114
50,00 62,50 75,00 87,50 100,00
65.3 76.5 90.6
*Zhou
Ours
Experiments: SHAPES dataset
*Yang
115
Experiments: VQA Dataset
sheep ear and what and
What is in the sheep’s ear? tag
116
Experiments: VQA Dataset
sheep ear and what and
What is in the sheep’s ear? tag
117
Experiments: VQA Dataset
sheep ear and what and
What is in the sheep’s ear? tag
118
Neural module networks
yes Is there a red shape above a circle?
red exists
true
↦ ↦
above
↦
circle red above exists and
119
Neural module networks
Combines advantages of:
circle red above exists and
Linguistic structure dynamically generates model structure
120
An emerging term for general models with these properties is differentiable programming. Deep nets are popular for a few reasons:
programming”
Differentiable programming
121
Deep learning Differentiable programming
Differentiable programming
122
Differentiable programming
123
124