Deep Learning in Computer Vision
Yikang Li
MMLab, The Chinese University of Hong Kong Sep 22nd, 2017 @Microsoft Research Asia, China
Deep Learning in Computer Vision Yikang Li MMLab, The Chinese - - PowerPoint PPT Presentation
Deep Learning in Computer Vision Yikang Li MMLab, The Chinese University of Hong Kong Sep 22nd, 2017 @Microsoft Research Asia, China Outline 1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image
Yikang Li
MMLab, The Chinese University of Hong Kong Sep 22nd, 2017 @Microsoft Research Asia, China
1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations
[1]http://news.stanford.edu/press-releases/2017/04/03/deep-learning-aldrug-development/ [2]https://www.cnbc.com/2017/09/08/a-i-can-detect-the-sexual-orientation-of-a-person-based-on-
CVPR 2017 (2600+ submissions, 4200+ registrants, 120+ sponsors) http://cvpr2017.thecvf.com/
http://business.financialpost.com/t echnology/federal-and-ontario-go vernments-invest-up-to-100-millio n-in-new-artificial-intelligence-vect
68-a6a6-cceff5ec3754
1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations
Roadmap of Deep Learning - Depth
Basic Block - Convolution Convolution operation: f(x) = Wx + b, f is called feature maps. W is the shared weight (kernel/filter/parameter in the network). Its weights are shared across locations. Convolution is conducted in a sliding-window style to save parameters and achieve translation-invariant, which is very important for vision tasks. Deep neural network is just a stack of convolutional layers. Rule of thumb: deeper means better.
Kernel Kernel
Roadmap of Deep Learning - Network Structure (cont’d)
1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations
http://mp7.watson.ibm.com/ICCV2015/slides/iccv15_tutorial_training_rbg.pdf
+ Detection in general classes + Face detection, crowd analysis + Car/signal detection
RCNN -> Fast RCNN -> Faster RCNN
RCNN -> Fast RCNN -> Faster RCNN
RCNN -> Fast RCNN -> Faster RCNN
RCNN -> Fast RCNN -> Faster RCNN
Detection Results
Back into the General Picture: Deep Learning for Computer Vision
1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations
Describe image with a natural sentence ❏ Two gentleman talking in front of propeller plane. ❏ Two men are conversing next to a small airplane. ❏ Two men talking in front of a plane. ❏ Two men talking in front of a small plane. ❏ Two men talk while standing next to a small passenger plane at an airport.
Dataset:
attributes
Metric:
Exploring Image Captioning Datasets: http://sidgan.me/technical/2016/01/09/Exploring-Datasets
A simple Baseline: NeuralTalk
A simple NeuralTalk Demo: https://github.com/karpathy/neuraltalk
Attention Mechanism: Show, Attend and Tell
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention: https://arxiv.org/abs/1502.03044
Modified Attention Mechanism: Know when to look
Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning: https://arxiv.org/abs/1612.01887
Adaptive Attention module
Determine how to mix the visual or linguistic information with a visual sentinel (softmax over k feature map vectors & 1 linguistic vector).
Semantic Compositional Networks for Visual Captioning: https://arxiv.org/abs/1502.03044
Localize and describe salient region with a natural sentence
DenseCap: Fully Convolutional Localization Networks for Dense Captioning: http://cs.stanford.edu/people/karpathy/densecap/
DenseCap: Fully Convolutional Localization Networks for Dense Captioning
DenseCap: Fully Convolutional Localization Networks for Dense Captioning: http://cs.stanford.edu/people/karpathy/densecap/
1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations
Answer an image-based question Question: What color is the man's tie? Answer: Brown
Dataset:
human-annotated)
programs and scene graphs
attributes
Survey of Visual Question Answering: Datasets and Techniques: https://arxiv.org/abs/1705.03865
Simple Baseline Method
Simple Baseline for Visual Question Answering: https://arxiv.org/abs/1512.02167
A Strong Baseline: Attention (1)
Where To Look: Focus Regions for Visual Question Answering: https://arxiv.org/abs/1511.07394
A Strong Baseline: Attention (2)
Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering: https://arxiv.org/abs/1704.03162
Multiple glimpse
Co-Attention Mechanism for Image & Question
Hierarchical Question-Image Co-Attention for Visual Question Answering: https://arxiv.org/abs/1606.00061
Parallel Co-Attention Alternating Co-Attention
Hierarchical Question Encoding
Hierarchical Question-Image Co-Attention for Visual Question Answering: https://arxiv.org/abs/1606.00061
Hierarchical Question Encoding Scheme Encoding for Answer prediction
Multimodal Fusion: Bilinear interaction modeling
MUTAN: Multimodal Tucker Fusion for Visual Question Answering: https://arxiv.org/abs/1705.06676
Duality of Question Answering and Question Generation
Visual Question Generation as Dual Task of Visual Question Answering
Duality of Question Answering and Question Generation: Dual MUTAN
Visual Question Generation as Dual Task of Visual Question Answering
Learning to Reason: Compositional Network
Learning to Reason: End-to-End Module Networks for Visual Question Answering: https://arxiv.org/abs/1704.05526
End-to-End Training with policy gradient
1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations
Describe the Image with object nodes and their interactions Scene Graph Generation from Objects, Phrases and Region Captions: https://arxiv.org/abs/1707.09700
Baseline: Visual Relationship Detection with Language Prior
Using word2vec as extra information for predicate recognition
Visual Relationship Detection with Language Priors: http://cs.stanford.edu/people/ranjaykrishna/vrd/
Jointly detect objects and relations
Leverage the dependencies within the objects and their relationships as extra constraints
Triplet Proposal:
<subject-phrase-object> triplet proposal
Phrase Detection:
information and consider the three components as a whole ViP-CNN: Visual Phrase Guided Convolutional Neural Network: http://cvboy.com/publication/cvpr2017_vip_cnn/
Relations as an intermediate level of Objects and Region Captions
Leverage the dependencies within the objects and their relationships as extra constraints
Scene Graph Generation from Objects, Phrases and Region Captions: http://cvboy.com/publication/iccv2017_msdn/
Emerging Topics: Human-Object Interaction
Detecting and Recognizing Human-Object Interactions: https://arxiv.org/abs/1704.07333
allen.li.thu@gmail.com