Hideki Nakayama
The University of Tokyo, Grad School of IST
1
Hideki Nakayama The University of Tokyo, Grad School of IST 1 - - PowerPoint PPT Presentation
GTC technology conference 2017 @Sun Jose, May 11th Hideki Nakayama The University of Tokyo, Grad School of IST 1 Hideki Nakayama Assistant Professor @The University of Tokyo AI research center Research topics: Computer Vision
1
3
Object discovery Representation learning for vision Wearable interface Fine-grained recognition Large-scale image tagging Vision-based recommendation Medical image analysis
4
5
6
Goal: to learn a function that transforms data in one modality
How: statistical estimation from a lot of paired examples
7
i i
01 . 01 . 99 .
8
A brown dog in front of a door. A black and white cow standing in a field.
(e.g., recurrent neural network)
(e.g.,convolutional neural network)
9
(convolutional neural network)
(e.g., recurrent neural network)
10
Multimodal Neural Language Models”, TACL, 2015.
11
12
Imag mage recognitio tion /c /capti tion
Mac Machi hine Transla latio ion Multimed edia synthesi esis 私は学生です。 I am a student. This is a dog.
13
14
Learn multimodal representation of X and Y from indirect data
Assumption: Z is a “common” modality (e.g., image, English text)
15
16
Japane apanese En Englis glish Japane apanese Imag mage En Englis glish Typical approach
Our approach (image pivot)
s
N k s k k s 1
=
t
N k k t k t 1
=
… … … … … … … … … … … … … … …
Image encoder CNN Source language encoder RNN Target language decoder RNN Target language encoder RNN
v
s
t
t
Multimodal space
18
s
N k s k k s 1
=
t
N k k t k t 1
=
… … … … … … … … … … … … … … …
Image encoder CNN Source language encoder RNN Target language decoder RNN Target language encoder RNN
v
s
t
t
Multimodal space
19
k
s k
白い壁の隣 に座っている 小さな犬。
… … … … … … … … … … … … … … …
Image encoder CNN Source language encoder RNN Target language decoder RNN Target language encoder RNN
v
s
t
t
Multimodal space
20
s
N k s k k s 1
=
≠
s
N k k i i s s k v k s s k v s
k
s k
(Hyper parameter)
白い壁の隣 に座っている 小さな犬。
[Frome+, NIPS’13]
… … … … … … … … … … … … … … …
Image encoder CNN Source language encoder RNN Target language decoder RNN Target language encoder RNN
v
s
t
t
Multimodal space
21
≠
t
N k k i i t t k v k t t k v t
t
N k k t k t 1
=
k
t k
A black dog sitting on grass next to a sidewalk.
… … … … … … … … … … … … … … …
Image encoder CNN Source language encoder RNN Target language decoder RNN Target language encoder RNN
v
s
t
t
Multimodal space
22
t
N k k t k t 1
=
t k
k
A black dog sitting on grass next to a sidewalk.
… … … … … … … … … … … … … … …
Image encoder CNN Source language encoder RNN Target language decoder RNN Target language encoder RNN
v
s
t
t
Multimodal space
23
t
N k k t k t 1
=
k
k
A black dog sitting on grass next to a sidewalk. A black dog sitting on grass next to a sidewalk.
… … … … … … … … … … … … … … …
Image encoder CNN Source language encoder RNN Target language decoder RNN Target language encoder RNN
v
s
t
t
Multimodal space
24
v
t
q v t q
q
A black and white cow standing in a grassy field. 草地に立ってい る黒と白の牛。
IAPR-TC12 [Grubinger+, 2006]
Multi30K [Elliott+, 2016]
We randomly split data into our zero-shot setup and
a photo of a brown sandy beach; the dark blue sea with small breaking waves behind it; a dark green palm tree in the foreground on the left; a blue sky with clouds
background; ein Photo eines braunen Sandstrands; das dunkelblaue Meer mit kleinen brechenden Wellen dahinter; eine dunkelgrüne Palme im Vordergrund links; ein blauer Himmel mit Wolken am Horizont im Hintergrund;
26
27
Zero-shot results are comparable to supervised models using
28
29
30
31
32
33
34
35
(https://ja.wikipedia.org/wiki/航空会社)
36
37
38