deep learning in computer vision
play

Deep Learning in Computer Vision Yikang Li MMLab, The Chinese - PowerPoint PPT Presentation

Deep Learning in Computer Vision Yikang Li MMLab, The Chinese University of Hong Kong Sep 22nd, 2017 @Microsoft Research Asia, China Outline 1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image


  1. Deep Learning in Computer Vision Yikang Li MMLab, The Chinese University of Hong Kong Sep 22nd, 2017 @Microsoft Research Asia, China

  2. Outline 1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations

  3. Introduction - DL in the press [1] http://news.stanford.edu/press-releases/2017/04/03/deep-learning-aldrug-development/ [2]https://www.cnbc.com/2017/09/08/a-i-can-detect-the-sexual-orientation-of-a-person-based-on- one-photo-research-shows.html

  4. Introduction - DL in the press CVPR 2017 (2600+ submissions, 4200+ registrants, 120+ sponsors) http://cvpr2017.thecvf.com/

  5. http://business.financialpost.com/t echnology/federal-and-ontario-go vernments-invest-up-to-100-millio n-in-new-artificial-intelligence-vect or-institute/wcm/ceb9218f-cbaf-49 Introduction - Investment in AI 68-a6a6-cceff5ec3754

  6. Renowned Researchers/Groups - Trevor Darrell, BAIR, UC Berkeley - Recognition, detection - Yanqing Jia (Caffe) , Jeff Donahue (DeepMind), Ross Girshick (Fast-RCNN) - Fei-Fei LI, Stanford University - ImageNet, Emerging topics - Jia LI (Snapchat, Google), Jia DENG (UMich), Andrej Karpathy (Tesla, OpenAI) - Antonio Torralba, CSAIL, MIT - Scene understanding, multimodality-based Computer Vision - Facebook Artificial Intelligence Research (FAIR) - DeepMind, Google Brain, Google Research - Microsoft Research

  7. Outline 1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations

  8. Roadmap of Deep Learning - Depth

  9. Basic Block - Convolution Kernel Kernel Convolution operation: f(x) = Wx + b, f is called feature maps . W is the shared weight (kernel/filter/parameter in the network). Its weights are shared across locations. Convolution is conducted in a sliding-window style to save parameters and achieve translation-invariant, which is very important for vision tasks. Deep neural network is just a stack of convolutional layers. Rule of thumb: deeper means better.

  10. Roadmap of Deep Learning - Network Structure (cont’d)

  11. Outline 1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations

  12. What is object detection? http://mp7.watson.ibm.com/ICCV2015/slides/iccv15_tutorial_training_rbg.pdf

  13. Why object detection? It is the fundamental task in vision + Detection in general classes + Face detection, crowd analysis + Car/signal detection

  14. RCNN -> Fast RCNN -> Faster RCNN

  15. RCNN -> Fast RCNN -> Faster RCNN

  16. RCNN -> Fast RCNN -> Faster RCNN

  17. RCNN -> Fast RCNN -> Faster RCNN

  18. Detection Results

  19. Back into the General Picture: Deep Learning for Computer Vision

  20. Outline 1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations

  21. Dataset: Image Captioning - PASCAL SENTENCE DATASET Describe image with a natural sentence - 1000 images & 5 sents / im - Designed for image classification, object detection and segmentation. - No filtering, complex scenes, scaling, view points of different objects. - FLICKR 8K - 8108 images & 5 sents / im - obtained from the Flickr website by University of Illinois at Urbana, Champaign - FLICKR 30K - extension to the Flickr 8K - MS COCO - Largest Caption dataset - Includes captions & object annoatations - 328,000 images & 5 sents / im - Visual Genome - Densely-annotated dataset - Includes objects, scene graphs, region captions (grounded), Q&As (grounded), attributes - 108,077 images with full annotations - Not very clean, need a little pre-processing Two gentleman talking in front of propeller ❏ plane. Two men are conversing next to a small ❏ Metric: airplane. Two men talking in front of a plane. ❏ - BLEU, METEOR, Rouge, CIDEr, Human-based Measurement Two men talking in front of a small plane. ❏ Two men talk while standing next to a small ❏ passenger plane at an airport. Exploring Image Captioning Datasets: http://sidgan.me/technical/2016/01/09/Exploring-Datasets

  22. A simple Baseline: NeuralTalk A simple NeuralTalk Demo: https://github.com/karpathy/neuraltalk

  23. Attention Mechanism: Show, Attend and Tell Show, Attend and Tell: Neural Image Caption Generation with Visual Attention: https://arxiv.org/abs/1502.03044

  24. Modified Attention Mechanism: Know when to look Adaptive Attention module Determine how to mix the visual or linguistic information with a visual sentinel (softmax over k feature map vectors & 1 linguistic vector). Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning: https://arxiv.org/abs/1612.01887

  25. Concept-driven Image Captioning Semantic Compositional Networks for Visual Captioning: https://arxiv.org/abs/1502.03044

  26. Dense Captioning Localize and describe salient region with a natural sentence DenseCap: Fully Convolutional Localization Networks for Dense Captioning: http://cs.stanford.edu/people/karpathy/densecap/

  27. DenseCap: Fully Convolutional Localization Networks for Dense Captioning DenseCap: Fully Convolutional Localization Networks for Dense Captioning: http://cs.stanford.edu/people/karpathy/densecap/

  28. Outline 1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations

  29. Dataset: Visual Q&A - DAQUAR Answer an image-based question - first dataset and benchmark released for the VQA task - Images are from NYU Depth V2 dataset with semantic segmentations - 1449 images (795 training, 654 test), 12468 question (auto-generated & human-annotated) - COCO-QA - Automatically generated from image captions. - 123287 images, 78736 train questions, 38948 test questions - 4 types of questions: object, number, color, location - Answers are all one-word - VQA - Most widely-used VQA dataset - two parts: one contains images from COCO, the other contains abstract scenes - 204,721 COCO and 50,000 abstract images with ~5.4 questions/im - CLEVR - A Diagnostic Dataset for the reasoning ability of VQA models - rendered images and automatically-generated questions with functional programs and scene graphs - 100,000 images (70,000 train & 15,000 val & 15,000 test) with ~10 questions/im - Visual Genome Question : What color is the man's tie? - Densely-annotated dataset Answer : Brown - Includes objects, scene graphs, region captions (grounded), Q&As (grounded), attributes - 108,077 images with 1.7M grounded Q&A pairs - Not very clean, need a little pre-processing Survey of Visual Question Answering: Datasets and Techniques: https://arxiv.org/abs/1705.03865

  30. Simple Baseline Method Simple Baseline for Visual Question Answering: https://arxiv.org/abs/1512.02167

  31. A Strong Baseline: Attention (1) Where To Look: Focus Regions for Visual Question Answering: https://arxiv.org/abs/1511.07394

  32. A Strong Baseline: Attention (2) Multiple glimpse Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering: https://arxiv.org/abs/1704.03162

  33. Co-Attention Mechanism for Image & Question Parallel Co-Attention Alternating Co-Attention Hierarchical Question-Image Co-Attention for Visual Question Answering: https://arxiv.org/abs/1606.00061

  34. Hierarchical Question Encoding Hierarchical Question Encoding Scheme Encoding for Answer prediction Hierarchical Question-Image Co-Attention for Visual Question Answering: https://arxiv.org/abs/1606.00061

  35. Multimodal Fusion: Bilinear interaction modeling MUTAN: Multimodal Tucker Fusion for Visual Question Answering: https://arxiv.org/abs/1705.06676

  36. Duality of Question Answering and Question Generation Visual Question Generation as Dual Task of Visual Question Answering

  37. Duality of Question Answering and Question Generation: Dual MUTAN Visual Question Generation as Dual Task of Visual Question Answering

  38. Learning to Reason: Compositional Network End-to-End Training with policy gradient Learning to Reason: End-to-End Module Networks for Visual Question Answering: https://arxiv.org/abs/1704.05526

  39. Outline 1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations

  40. Visual Relations Describe the Image with object nodes and their interactions Scene Graph Generation from Objects, Phrases and Region Captions: https://arxiv.org/abs/1707.09700

  41. Baseline: Visual Relationship Detection with Language Prior Using word2vec as extra information for predicate recognition Visual Relationship Detection with Language Priors: http://cs.stanford.edu/people/ranjaykrishna/vrd/

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend