hideki nakayama
play

Hideki Nakayama The University of Tokyo, Grad School of IST 1 - PowerPoint PPT Presentation

GTC technology conference 2017 @Sun Jose, May 11th Hideki Nakayama The University of Tokyo, Grad School of IST 1 Hideki Nakayama Assistant Professor @The University of Tokyo AI research center Research topics: Computer Vision


  1. GTC technology conference 2017 @Sun Jose, May 11th Hideki Nakayama The University of Tokyo, Grad School of IST 1

  2.  Hideki Nakayama ◦ Assistant Professor @The University of Tokyo ◦ AI research center  Research topics: ◦ Computer Vision ◦ Natural Language Processing ◦ Deep Learning

  3. Large-scale image tagging Fine-grained recognition Wearable interface Medical image analysis Representation learning Object discovery for vision Vision-based recommendation 3

  4. Automatic question generation Word representation learning Flexible attention mechanism 4

  5. a cat is trying to eat the food Image/video caption generation Multimodal deep models Multimodal machine translation 5

  6.  1. Background: cross-modal encoder-decoder learning with supervised data  2. Proposed idea: pivot-based learning  3. Zero-shot learning of machine translation system using image pivots 6

  7.  Goal: to learn a function that transforms data in one modality x (source) into another modality y y (target)  How: statistical estimation from a lot of paired examples { ( ) } = , , i 1 ,..., N x y i i … X cat dog bird Y x y cat   0 . 99   ( ) dog   0 . 01 f x   bird 0 . 01     7   

  8.  Derive hidden multimodal representation (vector) that aligns the coupled source and target data Multimodal Space Text encoder/decoder Image encoder (e.g., recurrent neural network) (e.g.,convolutional neural network) A brown dog in front of a door. A black and white cow standing in a field. X Y 8

  9.  Prediction can be realized by encoding an input into multimodal space and then decoding it Multimodal Space Image encoder Text encoder/decoder (convolutional neural (e.g., recurrent neural network) network) A black dog sitting on grass. ˆ y x 9

  10. R. Kiros et al., “Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models”, TACL, 2015. [Kiros et al., 2014] 10

  11. [Kiros et al., 2014] 11

  12.  As long as we have enough parallel data , we can now build many attractive applications Multimodal Space X Y Imag mage recognitio tion /c /capti tion oning Mac Machi hine 私は学生です。 I am a student. Transla latio ion Multimed edia This is a dog. synthesi esis 12

  13.  Supervised parallel data (X,Y) is not always available in real situations!  Annotating data is very expensive… ◦ 1M parallel sentences (machine translation) ◦ 15M images in 10K categories (object recognition) ◦ etc.  What can we do when NO direct parallel data is available? 13

  14.  Semi-supervised learning Unlabeled Unlabeled X Y 0 X 0 Y  Transfer learning Another X ′ Y ′ domain Target domain X Y 14

  15.  Learn multimodal representation of X and Y from indirect data (X,Z) and (Z,Y) where Z is the “pivot” (third modality)  Assumption: Z is a “common” modality (e.g., image, English text) and therefore (X,Z) and (Z,Y) are relatively easy to obtain Multimodal Space ( ) ( ) X , Z Z , Y X Y Z 15

  16.  1. Background: cross-modal encoder-decoder learning with supervised data  2. Proposed idea: pivot-based learning  3. Zero-shot learning of machine translation system [1] R. Funaki and H. Nakayama, “Image-mediated Learning for Zero-shot Cross-lingual Document Retrieval”, In Proc. of EMNLP , 2015. [2] H. Nakayama and N. Nishida, “Toward Zero-resource Machine Translation by Multimodal Embedding with Multimedia Pivot”, Machine Translation Journal , 2017. (in press) 16

  17.  Our approach (image pivot)  Typical approach ◦ Parallel document is ◦ We can find abundant monolingual documents with images! hard to obtain… ◦ E.g., blog, SNS, web news X Y Z X Y Japane apanese Imag mage Englis En glish Japane apanese Englis En glish { } t T N = t t , z y = k k k 1 { } s N T = s s , x z = k k k 1

  18. ◦ Multimodal embedding using image pivots ◦ Puts target language decoder on top of the multimodal space ◦ End-to-end learning with neural network (deep learning) { } s N T = Training Data: s s , x z = Source language encoder RNN k k k 1 { } t T N = t t , z y = s k k k 1 E … … … … … v E t D … … … … … Image encoder CNN Target language decoder RNN Multimodal space t E … … … … … Target language encoder RNN 18

  19. ◦ Align source language texts and images in the multimodal space 白い壁の隣 に座っている Source language encoder RNN 小さな犬。 x k s E … … … … … v E t D … … … … … s z Image encoder CNN k Target language decoder RNN Multimodal space t E … … … … … Target language encoder RNN 19

  20. ◦ Align source language texts and images in the multimodal space 白い壁の隣 { } Source language encoder RNN に座っている s T N = s s 小さな犬。 , x z = k k k 1 x k s E … … … … … Pair-wise Rank Loss ( ) s [Frome+, NIPS’13] : Similarity score function { ( ( ) ( ) ) ( ( ) ( ) ) } s N ∑∑ L = α − + s v s s v s s v max 0 , s E , E s E , E E z x z x t D k k k i ≠ … … … … … k i k s z Image encoder CNN k Target language decoder RNN Multimodal space t E Margin … … … … … An image (Hyper parameter) Paired Negative text (not paired) text Target language encoder RNN 20

  21. ◦ Align target language texts and images in the multimodal space Source language encoder RNN s E … … … … … v E t D … … … … … t Image encoder CNN z Target language decoder RNN k Multimodal space t { } E t … … … … … T N = t t , A black dog z y = k k k 1 sitting on grass next to a sidewalk. Target language encoder RNN y { ( ( ) ( ) ) ( ( ) ( ) ) } t N ∑∑ k L = α − + t v t t v t t max 0 , s E , E s E , E z y z y k k k i 21 ≠ k i k

  22. ◦ Feedforward images in target data and decode it into texts ◦ Cross-entropy loss { } t T N = t t , z y = k k k 1 Source language encoder RNN y k s E … … … … … A black dog sitting on grass next to a sidewalk. v E t D … … … … … t Image encoder CNN z Target language decoder RNN k Multimodal space t E … … … … … Target language encoder RNN 22

  23. ◦ Reconstruction loss of texts in target language ◦ This can also improve decoder performance { } t T N = t t , z y Source language encoder RNN = k k k 1 y k s E … … … … … A black dog sitting on grass next to a sidewalk. v E t D … … … … … Image encoder CNN Target language decoder RNN Multimodal space t E … … … … … y k A black dog sitting on grass next to a Target language encoder RNN sidewalk. 23

  24.  Just feedforward through and v t E D  We don’t need images in the testing phase! x q Source language encoder RNN ( ) ( ) 草地に立ってい y = t v ˆ D E る黒と白の牛。 x q q s E … … … … … A black and white cow standing in a grassy field. v E t D … … … … … Image encoder CNN Target language decoder RNN Multimodal space t E … … … … … Target language encoder RNN 24

  25.  IAPR-TC12 [Grubinger+, 2006] ◦ 20000 images and English/German captions a photo of a brown sandy ein Photo eines braunen beach; the dark blue sea Sandstrands; das with small breaking dunkelblaue Meer mit waves behind it; a dark kleinen brechenden Wellen green palm tree in the dahinter; eine dunkelgrüne foreground on the left; Palme im Vordergrund links; a blue sky with clouds ein blauer Himmel mit on the horizon in the Wolken am Horizont im background; Hintergrund;  Multi30K [Elliott+, 2016] ◦ 30000 images and English/German captions  We randomly split data into our zero-shot setup and perform German to English translation

  26. 26

  27.  Evaluation Metrics: BLEU scores (larger is better) Ou Ours (Zero-shot learning) Super ervised sed ba baseli lines (parallel corpus)  Zero-shot results are comparable to supervised models using parallel corpora roughly 20% as large as our monolingual ones. 27

  28. 28

  29. L  Cross-camera L XZ ZY person identification Z All we need is two losses! cam 3 (data is still capsulated) X Y cam 2 cam 1  Recognizing other sensory data Z image A black sofa in a room. X Y depth caption 29

  30. Z X Y 30

  31. 31

  32. X Y 32

  33. L L X L L L L L L L L L Y 33

  34. • Routing “knowledge” • Edge-side loss computation • No need to open data itself! L L X L L L L L L L L L Y 34

  35.  Numerous new modalities in different types of data, different environments ( ≒ Airports)  “Direct flight” ( ≒ supervised learning) for each pair is theoretically possible but practically infeasible ◦ Annotation cost, privacy or company-side issue  “hub airport” (pivot) plays the key role! World airlines (https://ja.wikipedia.org/wiki/ 航空会社 ) 35

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend