introduction of recent work at mil
play

Introduction of Recent Work at MIL The University of Tokyo, NVAIL - PowerPoint PPT Presentation

Recognize, Describe, and Generate: Introduction of Recent Work at MIL The University of Tokyo, NVAIL Partner Yoshitaka Ushiku MIL: Machine Intelligence Laboratory Beyond Human Intelligence Based on Cyber-Physical Systems Members Varying


  1. Recognize, Describe, and Generate: Introduction of Recent Work at MIL The University of Tokyo, NVAIL Partner Yoshitaka Ushiku

  2. MIL: Machine Intelligence Laboratory Beyond Human Intelligence Based on Cyber-Physical Systems Members Varying research topics ICCV, CVPR, ECCV, • One Professor ( Prof. Harada ) ICML, NIPS, • One Lecturer ( me ) ICASSP , SIGdial, • One Assistant Professor ACM Multimedia, ICME, • One Postdoc ICRA, IROS, etc. • Two Office Administrators The most important thing • 11 Ph. D. students • 23 Master students • 8 Bachelor students We are hiring! • 5 Interns

  3. Journalist Robot • Born in 2006 • Objective: publishing news automatically – Recognize • Objects, people, actions – Describe • What is happening – Generate • Contents as humans do

  4. Outline • Journalist Robot: ancestor of current work in MIL • Outline: research originates with this robot – Recognize • Basic: Framework for DL, Domain Adaptation • Classification: Single-modality, Multi-modalities – Describe • Image Captioning • Video Captioning – Generate • Image Reconstruction • Video Generation

  5. Recognize

  6. MILJS: JavaScript × Deep Learning [Hidaka+, ICLR Workshop 2017]

  7. MILJS: JavaScript × Deep Learning [Hidaka+, ICLR Workshop 2017] • Support for both learning and inference • Support for nodes with GPGPUs – Currently WebCL is utilized. – Now working on WebGPU. • Support for nodes w/o GPGPUs • No requirements to install any software – Even ResNet with 152 layers can be trained Let me show you a preliminary demonstration using mnist!

  8. Asymmetric Tri-training for Domain Adaptation [Saito+, submitted to ICML 2017] • Unsupervised domain adaptation Trained on mnist → Works on SVHN? – Ground-truth labels are associated with source (mnist) – However, there are no labels for target (SVHN)

  9. Asymmetric Tri-training for Domain Adaptation [Saito+, submitted to ICML 2017] • Asymmetric Tri-training: pseudo labels for target domain

  10. Asymmetric Tri-training for Domain Adaptation [Saito+, submitted to ICML 2017] 1 st : Training on MNIST → Add pseudo labels for easy samples eight nine 2 nd ~: Training on MNIST+α→ Add more pseudo labels

  11. End-to-end learning for environmental sound classification [Tokozume+, ICASSP 2017] Existing methods for speech / sound recognition: ① Feature extraction: Fourier Transformation (log-mel features) ② Classification: CNN with the extracted feature map ① ② Log-mel features are suitable for human speech; but for environmental sounds…?

  12. End-to-end learning for environmental sound classification [Tokozume+, ICASSP 2017] Proposed approach (EnvNet): CNN for both ① feature map extraction and ② classification ① ② Extracted “feature map”

  13. End-to-end learning for environmental sound classification [Tokozume+, ICASSP 2017] Comparison of accuracy [%] on ESC-50 [Piczak, ACM MM 2015] 71.0 64.5 64.0 log-mel feature + CNN End-to-end CNN End-to-end CNN & [Piczak, MLSP 2015] (Ours) log-mel feature + CNN (Ours) EnvNet can extract discriminative features for environmental sounds

  14. Visual Question Answering (VQA) [Saito+, ICME 2017] Question answering system for • Associated image • Question by natural language Q: Is it going to rain soon? Q: Why is there snow on one Ground Truth A: yes side of the stream and clear grass on the other? Ground Truth A: shade

  15. Visual Question Answering (VQA) [Saito+, ICME 2017] VQA = Multi-class classification Image feature Image � � � Integrated vector � ��� Answer � Question feature bed sheets, pillow � � Question � What objects are found on the bed? After integrating for ��� : usual classification

  16. Visual Question Answering [Saito+, ICME 2017] Current advancement: improving how to integrate � and � � ��� � � � • Concatenation e.g.) [Antol+, ICCV 2015] � � • Summation e.g.) Image feature (with attention) + Question feature � ��� � � � � � [Xu+Saenko, ECCV 2016] • Multiplication � ��� � � � � � e.g.) Bilinear multiplication [Fukui+, EMNLP 2016] � � � � • This work: DualNet doing sum, multiply and concatenation � ��� � � � � �

  17. Visual Question Answering (VQA) [Saito+, ICME 2017] VQA Challenge 2016 (in CVPR 2016) Won the 1 st place on abstract images w/o attention mechanism Q: What fruit is yellow and brown? Q: How many screens are there? A: banana A: 2 Q: What is the boy playing with? Q: Are there any animals swimming in the A: teddy bear pond? A: no

  18. Describe

  19. Automatic Image Captioning [Ushiku+, ACMMM 2011 ]

  20. Training Dataset A small white dog A white van wearing a flannel parked in an warmer. empty lot. A small gray dog A white cat rests A small white dog standing on a leash. on a leash. head on a stone. Nearest Captions A black dog White and gray standing in a kitten lying on Input Image A small white dog wearing a flannel warmer. grassy area. its side. A small white dog wearing a flannel warmer. A small gray dog on a leash. A small gray dog on a leash. Silver car parked A woman posing on side of road. on a red scooter. A black dog standing in a grassy area. A black dog standing in a grassy area.

  21. Automatic Image Captioning [ACM MM 2012 , ICCV 2015] Group of people sitting at a table with a dinner. Tourists are standing on the middle of a flat desert.

  22. Image Captioning + Sentiment Terms [Andrew+, BMVC 2016] A confused man in a A man in a blue shirt A zebra standing in a blue shirt is sitting on a and blue jeans is field with a tree in the bench. standing in the dirty background. overlooked water.

  23. Image Captioning + Sentiment Terms [Andrew+, BMVC 2016] Two steps for adding a sentiment term 1. Usual image captioning using CNN+RNN The most probable noun is memorized

  24. Image Captioning + Sentiment Terms [Andrew+, BMVC 2016] Two steps for adding a sentiment term 1. Usual image captioning using CNN+RNN 2. Forced to predict sentiment term before the noun

  25. Beyond Caption to Narrative [Andrew+, ICIP 2016] A man is holding a box of doughnuts. Then he and a woman are standing next each other. Then she is holding a plate of food.

  26. Beyond Caption to Narrative [Andrew+, ICIP 2016] A man is he and a she is holding Narrative holding a box woman are a plate of food. of doughnuts. standing next each other.

  27. Beyond Caption to Narrative [Andrew+, ICIP 2016] A boat is floating on the water near a mountain. And a man riding a wave on top of a surfboard. Then he on the surfboard in the water.

  28. Generate

  29. Image Reconstruction [Kato+, CVPR 2014] Traditional pipeline for image classification Extracting Collecting Calculating Classifying local descriptors descriptors Global feature images � � � ��� � � d d d ( d ; θ ) p 3 d 2 m d m Camera d d 1 d 2 1 k d d k N d d Cat d 3 j j d N

  30. Image Reconstruction [Kato+, CVPR 2014] � � � ��� � � d d d ( d ; θ ) p 3 d 2 m d m Camera 1 d d d 2 1 k d d k N d d Cat d 3 j j d N Inversed problem: Image reconstruction from a label Pot

  31. Image Reconstruction [Kato+, CVPR 2014] Pot Optimized arrangement using: Global location cost + Adjacency cost Other examples cat (bombay) camera grand piano gramophone headphone pyramid joshua tree wheel chair

  32. Video Generation [Yamamoto+, ACMMM 2016] • Image generation is still challenging Only successful for controlled settings: – Human faces – Birds – Flowers BEGAN StackGAN [Berthelot+, 2017 Mar.] [Zhang+, 2016 Dec.] • Video generation is … – Additionally requiring temporal consistency – Extremely challenging [Vondrick+, NIPS 2016]

  33. Video Generation [Yamamoto+, ACMMM 2016] • This work: generating easy videos – C3D (3D convolutional neural network) for conditional generation with an input label – tempCAE (temporal convolutional auto-encoder) for regularizing video to improve its naturalness

  34. Video Generation [Yamamoto+, ACMMM 2016] Car runs Ours to left (C3D+tempCAE) Only C3D Ours Rocket (C3D+tempCAE) flies up Only C3D

  35. Conclusion • MIL: Machine Intelligence Laboratory Beyond Human Intelligence Based on Cyber-Physical Systems • This talk introduces some of the current research – Recognize • Basic: Framework for DL, Domain Adaptation • Classification: Single-modality, Multi-modalities – Describe • Image Captioning, Video Captioning – Generate • Image Reconstruction, Video Generation

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend