Hideki Nakayama The University of Tokyo, Grad School of IST 1 - - PowerPoint PPT Presentation

hideki nakayama
SMART_READER_LITE
LIVE PREVIEW

Hideki Nakayama The University of Tokyo, Grad School of IST 1 - - PowerPoint PPT Presentation

GTC technology conference 2017 @Sun Jose, May 11th Hideki Nakayama The University of Tokyo, Grad School of IST 1 Hideki Nakayama Assistant Professor @The University of Tokyo AI research center Research topics: Computer Vision


slide-1
SLIDE 1

Hideki Nakayama

The University of Tokyo, Grad School of IST

1

GTC technology conference 2017 @Sun Jose, May 11th

slide-2
SLIDE 2

 Hideki Nakayama

  • Assistant Professor @The University of Tokyo
  • AI research center

 Research topics:

  • Computer Vision
  • Natural Language Processing
  • Deep Learning
slide-3
SLIDE 3

3

Object discovery Representation learning for vision Wearable interface Fine-grained recognition Large-scale image tagging Vision-based recommendation Medical image analysis

slide-4
SLIDE 4

4

Word representation learning Flexible attention mechanism Automatic question generation

slide-5
SLIDE 5

5

Multimodal deep models

a cat is trying to eat the food

Image/video caption generation Multimodal machine translation

slide-6
SLIDE 6

 1. Background: cross-modal encoder-decoder learning

with supervised data

 2. Proposed idea: pivot-based learning  3. Zero-shot learning of machine translation system

using image pivots

6

slide-7
SLIDE 7

 Goal: to learn a function that transforms data in one modality

x (source) into another modality y y (target)

 How: statistical estimation from a lot of paired examples

7

cat dog bird

( ) { }

N i

i i

,..., 1 , , = y x

x

y

cat dog bird

( )

x f

X Y

               01 . 01 . 99 .

slide-8
SLIDE 8

 Derive hidden multimodal representation (vector)

that aligns the coupled source and target data

8

Multimodal Space

X

Y

A brown dog in front of a door. A black and white cow standing in a field.

Text encoder/decoder

(e.g., recurrent neural network)

Image encoder

(e.g.,convolutional neural network)

slide-9
SLIDE 9

 Prediction can be realized by encoding an input into

multimodal space and then decoding it

9

Multimodal Space

x

y ˆ

A black dog sitting on grass. Image encoder

(convolutional neural network)

Text encoder/decoder

(e.g., recurrent neural network)

slide-10
SLIDE 10

10

[Kiros et al., 2014]

  • R. Kiros et al., “Unifying Visual-Semantic Embeddings with

Multimodal Neural Language Models”, TACL, 2015.

slide-11
SLIDE 11

11

[Kiros et al., 2014]

slide-12
SLIDE 12

 As long as we have enough parallel data,

we can now build many attractive applications

12

Multimodal Space

X Y

Imag mage recognitio tion /c /capti tion

  • ning

Mac Machi hine Transla latio ion Multimed edia synthesi esis 私は学生です。 I am a student. This is a dog.

slide-13
SLIDE 13

 Supervised parallel data (X,Y) is not always available

in real situations!

 Annotating data is very expensive…

  • 1M parallel sentences (machine translation)
  • 15M images in 10K categories (object recognition)
  • etc.

 What can we do when NO direct parallel data is

available?

13

slide-14
SLIDE 14

 Semi-supervised learning  Transfer learning

14

X Y

Unlabeled

X

Unlabeled

Y X ′ Y′

Another domain

X Y

Target domain

slide-15
SLIDE 15

 Learn multimodal representation of X and Y from indirect data

(X,Z) and (Z,Y) where Z is the “pivot” (third modality)

 Assumption: Z is a “common” modality (e.g., image, English text)

and therefore (X,Z) and (Z,Y) are relatively easy to obtain

15

X Z

Multimodal Space

Y

( )

Z X ,

( )

Y Z,

slide-16
SLIDE 16

 1. Background: cross-modal encoder-decoder learning

with supervised data

 2. Proposed idea: pivot-based learning  3. Zero-shot learning of machine translation system

16

[1] R. Funaki and H. Nakayama, “Image-mediated Learning for Zero-shot Cross-lingual Document Retrieval”, In Proc. of EMNLP, 2015. [2] H. Nakayama and N. Nishida, “Toward Zero-resource Machine Translation by Multimodal Embedding with Multimedia Pivot”, Machine Translation Journal, 2017. (in press)

slide-17
SLIDE 17

Japane apanese En Englis glish Japane apanese Imag mage En Englis glish  Typical approach

  • Parallel document is

hard to obtain…

X Y X

Z

Y

 Our approach (image pivot)

  • We can find abundant monolingual

documents with images!

  • E.g., blog, SNS, web news

{ }

s

N k s k k s 1

,

=

= z x T

{ }

t

N k k t k t 1

,

=

= y z T

slide-18
SLIDE 18

… … … … … … … … … … … … … … …

Image encoder CNN Source language encoder RNN Target language decoder RNN Target language encoder RNN

v

E

s

E

t

D

t

E

Multimodal space

  • Multimodal embedding using image pivots
  • Puts target language decoder on top of the multimodal space
  • End-to-end learning with neural network (deep learning)

18

Training Data:

{ }

s

N k s k k s 1

,

=

= z x T

{ }

t

N k k t k t 1

,

=

= y z T

slide-19
SLIDE 19

… … … … … … … … … … … … … … …

Image encoder CNN Source language encoder RNN Target language decoder RNN Target language encoder RNN

v

E

s

E

t

D

t

E

Multimodal space

  • Align source language texts and images

in the multimodal space

19

k

x

s k

z

白い壁の隣 に座っている 小さな犬。

slide-20
SLIDE 20

… … … … … … … … … … … … … … …

Image encoder CNN Source language encoder RNN Target language decoder RNN Target language encoder RNN

v

E

s

E

t

D

t

E

Multimodal space

  • Align source language texts and images

in the multimodal space

20

{ }

s

N k s k k s 1

,

=

= z x T

( ) ( ) ( ) ( ) ( ) ( ) { }

∑∑

+ − =

s

N k k i i s s k v k s s k v s

E E s E E s x z x z , , , max α L

k

x

s k

z

Margin

(Hyper parameter)

An image Negative (not paired) text Paired text

( )

s

: Similarity score function

白い壁の隣 に座っている 小さな犬。

Pair-wise Rank Loss

[Frome+, NIPS’13]

slide-21
SLIDE 21

… … … … … … … … … … … … … … …

Image encoder CNN Source language encoder RNN Target language decoder RNN Target language encoder RNN

v

E

s

E

t

D

t

E

Multimodal space

21

  • Align target language texts and images

in the multimodal space

( ) ( ) ( ) ( ) ( ) ( ) { }

∑∑

+ − =

t

N k k i i t t k v k t t k v t

E E s E E s y z y z , , , max α L

{ }

t

N k k t k t 1

,

=

= y z T

k

y

t k

z

A black dog sitting on grass next to a sidewalk.

slide-22
SLIDE 22

… … … … … … … … … … … … … … …

Image encoder CNN Source language encoder RNN Target language decoder RNN Target language encoder RNN

v

E

s

E

t

D

t

E

Multimodal space

  • Feedforward images in target data and decode it into texts
  • Cross-entropy loss

22

{ }

t

N k k t k t 1

,

=

= y z T

t k

z

k

y

A black dog sitting on grass next to a sidewalk.

slide-23
SLIDE 23

… … … … … … … … … … … … … … …

Image encoder CNN Source language encoder RNN Target language decoder RNN Target language encoder RNN

v

E

s

E

t

D

t

E

Multimodal space

  • Reconstruction loss of texts in target language
  • This can also improve decoder performance

23

{ }

t

N k k t k t 1

,

=

= y z T

k

y

k

y

A black dog sitting on grass next to a sidewalk. A black dog sitting on grass next to a sidewalk.

slide-24
SLIDE 24

… … … … … … … … … … … … … … …

Image encoder CNN Source language encoder RNN Target language decoder RNN Target language encoder RNN

v

E

s

E

t

D

t

E

Multimodal space

24

 Just feedforward through and  We don’t need images in the testing phase!

v

E

t

D

( ) ( )

q v t q

E D x y = ˆ

q

x

A black and white cow standing in a grassy field. 草地に立ってい る黒と白の牛。

slide-25
SLIDE 25

 IAPR-TC12 [Grubinger+, 2006]

  • 20000 images and English/German captions

 Multi30K [Elliott+, 2016]

  • 30000 images and English/German captions

 We randomly split data into our zero-shot setup and

perform German to English translation

a photo of a brown sandy beach; the dark blue sea with small breaking waves behind it; a dark green palm tree in the foreground on the left; a blue sky with clouds

  • n the horizon in the

background; ein Photo eines braunen Sandstrands; das dunkelblaue Meer mit kleinen brechenden Wellen dahinter; eine dunkelgrüne Palme im Vordergrund links; ein blauer Himmel mit Wolken am Horizont im Hintergrund;

slide-26
SLIDE 26

26

slide-27
SLIDE 27

 Evaluation Metrics: BLEU scores (larger is better)

27

Ou Ours (Zero-shot learning) Super ervised sed ba baseli lines (parallel corpus)

 Zero-shot results are comparable to supervised models using

parallel corpora roughly 20% as large as our monolingual ones.

slide-28
SLIDE 28

28

slide-29
SLIDE 29

 Cross-camera

person identification

 Recognizing other

sensory data

29

X Z Y

caption cam 2 cam 3 cam 1

X Z Y

depth image

XZ

L

ZY

L

All we need is two losses! (data is still capsulated)

A black sofa in a room.

slide-30
SLIDE 30

30

X Z Y

slide-31
SLIDE 31

31

slide-32
SLIDE 32

32

X

Y

slide-33
SLIDE 33

33

L L L L L L L L L L L

X

Y

slide-34
SLIDE 34

34

L L L L L L L L L L L

X

Y

  • Routing “knowledge”
  • Edge-side loss computation
  • No need to open data itself!
slide-35
SLIDE 35

 Numerous new modalities in different types of data,

different environments (≒Airports)

 “Direct flight” (≒supervised learning) for each pair is

theoretically possible but practically infeasible

  • Annotation cost, privacy or company-side issue

 “hub airport” (pivot)

plays the key role!

35

World airlines

(https://ja.wikipedia.org/wiki/航空会社)

slide-36
SLIDE 36

 Pivot-base learning

  • train a target function from only indirect data (X, Z)

and (Z, Y) when no direct parallel data (X, Y) is available

  • Joint multimodal embedding + encoder-decoder model

 Example application

  • Zero-resource (i.e., no direct parallel corpora) machine translation

using images as the pivot

  • Person identification in multi-camera networks

 Futures

  • Routing “knowledge” in modality networks

36

( )

y x x → : f

slide-37
SLIDE 37

37

slide-38
SLIDE 38

 Visual encoder: VGG-16 network [Simonyan+, 2015]

  • Powerful pre-trained network for object recognition

 Language encoders/decoders: RNNs with long short-

term memory (LSTM)

  • 512-dim word embedding and 1024-dim hidden units.

 Solver: Adam optimizer [Kingma+, 2014]

  • Minibatch size 32
  • Early stopping when the validation loss no longer improves

38