Deep Learning in Computer Vision Yikang Li MMLab, The Chinese - - PowerPoint PPT Presentation

deep learning in computer vision
SMART_READER_LITE
LIVE PREVIEW

Deep Learning in Computer Vision Yikang Li MMLab, The Chinese - - PowerPoint PPT Presentation

Deep Learning in Computer Vision Yikang Li MMLab, The Chinese University of Hong Kong Sep 22nd, 2017 @Microsoft Research Asia, China Outline 1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image


slide-1
SLIDE 1

Deep Learning in Computer Vision

Yikang Li

MMLab, The Chinese University of Hong Kong Sep 22nd, 2017 @Microsoft Research Asia, China

slide-2
SLIDE 2

Outline

1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations

slide-3
SLIDE 3

Introduction - DL in the press

[1]http://news.stanford.edu/press-releases/2017/04/03/deep-learning-aldrug-development/ [2]https://www.cnbc.com/2017/09/08/a-i-can-detect-the-sexual-orientation-of-a-person-based-on-

  • ne-photo-research-shows.html
slide-4
SLIDE 4

Introduction - DL in the press

CVPR 2017 (2600+ submissions, 4200+ registrants, 120+ sponsors) http://cvpr2017.thecvf.com/

slide-5
SLIDE 5

Introduction - Investment in AI

http://business.financialpost.com/t echnology/federal-and-ontario-go vernments-invest-up-to-100-millio n-in-new-artificial-intelligence-vect

  • r-institute/wcm/ceb9218f-cbaf-49

68-a6a6-cceff5ec3754

slide-6
SLIDE 6

Renowned Researchers/Groups

  • Trevor Darrell, BAIR, UC Berkeley
  • Recognition, detection
  • Yanqing Jia (Caffe), Jeff Donahue (DeepMind), Ross Girshick (Fast-RCNN)
  • Fei-Fei LI, Stanford University
  • ImageNet, Emerging topics
  • Jia LI (Snapchat, Google), Jia DENG (UMich), Andrej Karpathy (Tesla, OpenAI)
  • Antonio Torralba, CSAIL, MIT
  • Scene understanding, multimodality-based Computer Vision
  • Facebook Artificial Intelligence Research (FAIR)
  • DeepMind, Google Brain, Google Research
  • Microsoft Research
slide-7
SLIDE 7

Outline

1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations

slide-8
SLIDE 8

Roadmap of Deep Learning - Depth

slide-9
SLIDE 9

Basic Block - Convolution Convolution operation: f(x) = Wx + b, f is called feature maps. W is the shared weight (kernel/filter/parameter in the network). Its weights are shared across locations. Convolution is conducted in a sliding-window style to save parameters and achieve translation-invariant, which is very important for vision tasks. Deep neural network is just a stack of convolutional layers. Rule of thumb: deeper means better.

Kernel Kernel

slide-10
SLIDE 10

Roadmap of Deep Learning - Network Structure (cont’d)

slide-11
SLIDE 11

Outline

1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations

slide-12
SLIDE 12

What is object detection?

http://mp7.watson.ibm.com/ICCV2015/slides/iccv15_tutorial_training_rbg.pdf

slide-13
SLIDE 13

Why object detection?

It is the fundamental task in vision

+ Detection in general classes + Face detection, crowd analysis + Car/signal detection

slide-14
SLIDE 14

RCNN -> Fast RCNN -> Faster RCNN

slide-15
SLIDE 15

RCNN -> Fast RCNN -> Faster RCNN

slide-16
SLIDE 16

RCNN -> Fast RCNN -> Faster RCNN

slide-17
SLIDE 17

RCNN -> Fast RCNN -> Faster RCNN

slide-18
SLIDE 18

Detection Results

slide-19
SLIDE 19

Back into the General Picture: Deep Learning for Computer Vision

slide-20
SLIDE 20

Outline

1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations

slide-21
SLIDE 21

Image Captioning

Describe image with a natural sentence ❏ Two gentleman talking in front of propeller plane. ❏ Two men are conversing next to a small airplane. ❏ Two men talking in front of a plane. ❏ Two men talking in front of a small plane. ❏ Two men talk while standing next to a small passenger plane at an airport.

Dataset:

  • PASCAL SENTENCE DATASET
  • 1000 images & 5 sents / im
  • Designed for image classification, object detection and segmentation.
  • No filtering, complex scenes, scaling, view points of different objects.
  • FLICKR 8K
  • 8108 images & 5 sents / im
  • btained from the Flickr website by University of Illinois at Urbana, Champaign
  • FLICKR 30K
  • extension to the Flickr 8K
  • MS COCO
  • Largest Caption dataset
  • Includes captions & object annoatations
  • 328,000 images & 5 sents / im
  • Visual Genome
  • Densely-annotated dataset
  • Includes objects, scene graphs, region captions (grounded), Q&As (grounded),

attributes

  • 108,077 images with full annotations
  • Not very clean, need a little pre-processing

Metric:

  • BLEU, METEOR, Rouge, CIDEr, Human-based Measurement

Exploring Image Captioning Datasets: http://sidgan.me/technical/2016/01/09/Exploring-Datasets

slide-22
SLIDE 22

A simple Baseline: NeuralTalk

A simple NeuralTalk Demo: https://github.com/karpathy/neuraltalk

slide-23
SLIDE 23

Attention Mechanism: Show, Attend and Tell

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention: https://arxiv.org/abs/1502.03044

slide-24
SLIDE 24

Modified Attention Mechanism: Know when to look

Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning: https://arxiv.org/abs/1612.01887

Adaptive Attention module

Determine how to mix the visual or linguistic information with a visual sentinel (softmax over k feature map vectors & 1 linguistic vector).

slide-25
SLIDE 25

Concept-driven Image Captioning

Semantic Compositional Networks for Visual Captioning: https://arxiv.org/abs/1502.03044

slide-26
SLIDE 26

Dense Captioning

Localize and describe salient region with a natural sentence

DenseCap: Fully Convolutional Localization Networks for Dense Captioning: http://cs.stanford.edu/people/karpathy/densecap/

slide-27
SLIDE 27

DenseCap: Fully Convolutional Localization Networks for Dense Captioning

DenseCap: Fully Convolutional Localization Networks for Dense Captioning: http://cs.stanford.edu/people/karpathy/densecap/

slide-28
SLIDE 28

Outline

1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations

slide-29
SLIDE 29

Visual Q&A

Answer an image-based question Question: What color is the man's tie? Answer: Brown

Dataset:

  • DAQUAR
  • first dataset and benchmark released for the VQA task
  • Images are from NYU Depth V2 dataset with semantic segmentations
  • 1449 images (795 training, 654 test), 12468 question (auto-generated &

human-annotated)

  • COCO-QA
  • Automatically generated from image captions.
  • 123287 images, 78736 train questions, 38948 test questions
  • 4 types of questions: object, number, color, location
  • Answers are all one-word
  • VQA
  • Most widely-used VQA dataset
  • two parts: one contains images from COCO, the other contains abstract scenes
  • 204,721 COCO and 50,000 abstract images with ~5.4 questions/im
  • CLEVR
  • A Diagnostic Dataset for the reasoning ability of VQA models
  • rendered images and automatically-generated questions with functional

programs and scene graphs

  • 100,000 images (70,000 train & 15,000 val & 15,000 test) with ~10 questions/im
  • Visual Genome
  • Densely-annotated dataset
  • Includes objects, scene graphs, region captions (grounded), Q&As (grounded),

attributes

  • 108,077 images with 1.7M grounded Q&A pairs
  • Not very clean, need a little pre-processing

Survey of Visual Question Answering: Datasets and Techniques: https://arxiv.org/abs/1705.03865

slide-30
SLIDE 30

Simple Baseline Method

Simple Baseline for Visual Question Answering: https://arxiv.org/abs/1512.02167

slide-31
SLIDE 31

A Strong Baseline: Attention (1)

Where To Look: Focus Regions for Visual Question Answering: https://arxiv.org/abs/1511.07394

slide-32
SLIDE 32

A Strong Baseline: Attention (2)

Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering: https://arxiv.org/abs/1704.03162

Multiple glimpse

slide-33
SLIDE 33

Co-Attention Mechanism for Image & Question

Hierarchical Question-Image Co-Attention for Visual Question Answering: https://arxiv.org/abs/1606.00061

Parallel Co-Attention Alternating Co-Attention

slide-34
SLIDE 34

Hierarchical Question Encoding

Hierarchical Question-Image Co-Attention for Visual Question Answering: https://arxiv.org/abs/1606.00061

Hierarchical Question Encoding Scheme Encoding for Answer prediction

slide-35
SLIDE 35

Multimodal Fusion: Bilinear interaction modeling

MUTAN: Multimodal Tucker Fusion for Visual Question Answering: https://arxiv.org/abs/1705.06676

slide-36
SLIDE 36

Duality of Question Answering and Question Generation

Visual Question Generation as Dual Task of Visual Question Answering

slide-37
SLIDE 37

Duality of Question Answering and Question Generation: Dual MUTAN

Visual Question Generation as Dual Task of Visual Question Answering

slide-38
SLIDE 38

Learning to Reason: Compositional Network

Learning to Reason: End-to-End Module Networks for Visual Question Answering: https://arxiv.org/abs/1704.05526

End-to-End Training with policy gradient

slide-39
SLIDE 39

Outline

1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations

slide-40
SLIDE 40

Visual Relations

Describe the Image with object nodes and their interactions Scene Graph Generation from Objects, Phrases and Region Captions: https://arxiv.org/abs/1707.09700

slide-41
SLIDE 41

Baseline: Visual Relationship Detection with Language Prior

Using word2vec as extra information for predicate recognition

Visual Relationship Detection with Language Priors: http://cs.stanford.edu/people/ranjaykrishna/vrd/

slide-42
SLIDE 42

Jointly detect objects and relations

Leverage the dependencies within the objects and their relationships as extra constraints

Triplet Proposal:

  • Region proposal: RPN generates object proposals
  • Triplet proposal: group the proposals and generate

<subject-phrase-object> triplet proposal

  • Triplet NMS: redundant proposal removal

Phrase Detection:

  • Branch-based detection model
  • ROI pooling helps different branch focus on different components
  • Message passing structure (VPRS) help different branches share

information and consider the three components as a whole ViP-CNN: Visual Phrase Guided Convolutional Neural Network: http://cvboy.com/publication/cvpr2017_vip_cnn/

slide-43
SLIDE 43

Relations as an intermediate level of Objects and Region Captions

Leverage the dependencies within the objects and their relationships as extra constraints

Scene Graph Generation from Objects, Phrases and Region Captions: http://cvboy.com/publication/iccv2017_msdn/

slide-44
SLIDE 44

Emerging Topics: Human-Object Interaction

Detecting and Recognizing Human-Object Interactions: https://arxiv.org/abs/1704.07333

slide-45
SLIDE 45

Q&A

allen.li.thu@gmail.com