Deep Learning in Computer Vision Yikang Li MMLab, The Chinese - PowerPoint PPT Presentation

Deep Learning in Computer Vision Yikang Li MMLab, The Chinese University of Hong Kong Sep 22nd, 2017 @Microsoft Research Asia, China

Outline 1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image Captioning 5. DL in CV: Visual Question Answering 6. DL in CV: Visual Relations

Introduction - DL in the press [1] http://news.stanford.edu/press-releases/2017/04/03/deep-learning-aldrug-development/ [2]https://www.cnbc.com/2017/09/08/a-i-can-detect-the-sexual-orientation-of-a-person-based-on- one-photo-research-shows.html

Introduction - DL in the press CVPR 2017 (2600+ submissions, 4200+ registrants, 120+ sponsors) http://cvpr2017.thecvf.com/

http://business.financialpost.com/t echnology/federal-and-ontario-go vernments-invest-up-to-100-millio n-in-new-artificial-intelligence-vect or-institute/wcm/ceb9218f-cbaf-49 Introduction - Investment in AI 68-a6a6-cceff5ec3754

Renowned Researchers/Groups - Trevor Darrell, BAIR, UC Berkeley - Recognition, detection - Yanqing Jia (Caffe) , Jeff Donahue (DeepMind), Ross Girshick (Fast-RCNN) - Fei-Fei LI, Stanford University - ImageNet, Emerging topics - Jia LI (Snapchat, Google), Jia DENG (UMich), Andrej Karpathy (Tesla, OpenAI) - Antonio Torralba, CSAIL, MIT - Scene understanding, multimodality-based Computer Vision - Facebook Artificial Intelligence Research (FAIR) - DeepMind, Google Brain, Google Research - Microsoft Research

Roadmap of Deep Learning - Depth

Basic Block - Convolution Kernel Kernel Convolution operation: f(x) = Wx + b, f is called feature maps . W is the shared weight (kernel/filter/parameter in the network). Its weights are shared across locations. Convolution is conducted in a sliding-window style to save parameters and achieve translation-invariant, which is very important for vision tasks. Deep neural network is just a stack of convolutional layers. Rule of thumb: deeper means better.

Roadmap of Deep Learning - Network Structure (cont’d)

What is object detection? http://mp7.watson.ibm.com/ICCV2015/slides/iccv15_tutorial_training_rbg.pdf

Why object detection? It is the fundamental task in vision + Detection in general classes + Face detection, crowd analysis + Car/signal detection

RCNN -> Fast RCNN -> Faster RCNN

Detection Results

Back into the General Picture: Deep Learning for Computer Vision

Dataset: Image Captioning - PASCAL SENTENCE DATASET Describe image with a natural sentence - 1000 images & 5 sents / im - Designed for image classification, object detection and segmentation. - No filtering, complex scenes, scaling, view points of different objects. - FLICKR 8K - 8108 images & 5 sents / im - obtained from the Flickr website by University of Illinois at Urbana, Champaign - FLICKR 30K - extension to the Flickr 8K - MS COCO - Largest Caption dataset - Includes captions & object annoatations - 328,000 images & 5 sents / im - Visual Genome - Densely-annotated dataset - Includes objects, scene graphs, region captions (grounded), Q&As (grounded), attributes - 108,077 images with full annotations - Not very clean, need a little pre-processing Two gentleman talking in front of propeller ❏ plane. Two men are conversing next to a small ❏ Metric: airplane. Two men talking in front of a plane. ❏ - BLEU, METEOR, Rouge, CIDEr, Human-based Measurement Two men talking in front of a small plane. ❏ Two men talk while standing next to a small ❏ passenger plane at an airport. Exploring Image Captioning Datasets: http://sidgan.me/technical/2016/01/09/Exploring-Datasets

A simple Baseline: NeuralTalk A simple NeuralTalk Demo: https://github.com/karpathy/neuraltalk

Attention Mechanism: Show, Attend and Tell Show, Attend and Tell: Neural Image Caption Generation with Visual Attention: https://arxiv.org/abs/1502.03044

Modified Attention Mechanism: Know when to look Adaptive Attention module Determine how to mix the visual or linguistic information with a visual sentinel (softmax over k feature map vectors & 1 linguistic vector). Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning: https://arxiv.org/abs/1612.01887

Concept-driven Image Captioning Semantic Compositional Networks for Visual Captioning: https://arxiv.org/abs/1502.03044

Dense Captioning Localize and describe salient region with a natural sentence DenseCap: Fully Convolutional Localization Networks for Dense Captioning: http://cs.stanford.edu/people/karpathy/densecap/

DenseCap: Fully Convolutional Localization Networks for Dense Captioning DenseCap: Fully Convolutional Localization Networks for Dense Captioning: http://cs.stanford.edu/people/karpathy/densecap/

Dataset: Visual Q&A - DAQUAR Answer an image-based question - first dataset and benchmark released for the VQA task - Images are from NYU Depth V2 dataset with semantic segmentations - 1449 images (795 training, 654 test), 12468 question (auto-generated & human-annotated) - COCO-QA - Automatically generated from image captions. - 123287 images, 78736 train questions, 38948 test questions - 4 types of questions: object, number, color, location - Answers are all one-word - VQA - Most widely-used VQA dataset - two parts: one contains images from COCO, the other contains abstract scenes - 204,721 COCO and 50,000 abstract images with ~5.4 questions/im - CLEVR - A Diagnostic Dataset for the reasoning ability of VQA models - rendered images and automatically-generated questions with functional programs and scene graphs - 100,000 images (70,000 train & 15,000 val & 15,000 test) with ~10 questions/im - Visual Genome Question : What color is the man's tie? - Densely-annotated dataset Answer : Brown - Includes objects, scene graphs, region captions (grounded), Q&As (grounded), attributes - 108,077 images with 1.7M grounded Q&A pairs - Not very clean, need a little pre-processing Survey of Visual Question Answering: Datasets and Techniques: https://arxiv.org/abs/1705.03865

Simple Baseline Method Simple Baseline for Visual Question Answering: https://arxiv.org/abs/1512.02167

A Strong Baseline: Attention (1) Where To Look: Focus Regions for Visual Question Answering: https://arxiv.org/abs/1511.07394

A Strong Baseline: Attention (2) Multiple glimpse Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering: https://arxiv.org/abs/1704.03162

Co-Attention Mechanism for Image & Question Parallel Co-Attention Alternating Co-Attention Hierarchical Question-Image Co-Attention for Visual Question Answering: https://arxiv.org/abs/1606.00061

Hierarchical Question Encoding Hierarchical Question Encoding Scheme Encoding for Answer prediction Hierarchical Question-Image Co-Attention for Visual Question Answering: https://arxiv.org/abs/1606.00061

Multimodal Fusion: Bilinear interaction modeling MUTAN: Multimodal Tucker Fusion for Visual Question Answering: https://arxiv.org/abs/1705.06676

Duality of Question Answering and Question Generation Visual Question Generation as Dual Task of Visual Question Answering

Duality of Question Answering and Question Generation: Dual MUTAN Visual Question Generation as Dual Task of Visual Question Answering

Learning to Reason: Compositional Network End-to-End Training with policy gradient Learning to Reason: End-to-End Module Networks for Visual Question Answering: https://arxiv.org/abs/1704.05526

Visual Relations Describe the Image with object nodes and their interactions Scene Graph Generation from Objects, Phrases and Region Captions: https://arxiv.org/abs/1707.09700

Baseline: Visual Relationship Detection with Language Prior Using word2vec as extra information for predicate recognition Visual Relationship Detection with Language Priors: http://cs.stanford.edu/people/ranjaykrishna/vrd/

Deep Learning in Computer Vision Yikang Li MMLab, The Chinese - PowerPoint PPT Presentation

Deep Learning in Computer Vision Yikang Li MMLab, The Chinese University of Hong Kong Sep 22nd, 2017 @Microsoft Research Asia, China Outline 1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image

Deep Learning in Computer Vision Caner Hazrba Deep Learning in Action 24. June 15

Delving Deep into Computer Vision Caner Hazirbas Machine Learning Meetup #1 Delving Deep into

Computer Vision Computer Vision How does vision work? What is vision for? Ela Claridge

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Reasoning A Vision for Automated Deduction Stephan Schulz Deep Reasoning A Vision for

CS262: Computer Vision (and Human-Computer Interaction) John Magee 1 Computer Vision How are

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Branding Presentation VISION Mevushal VISION Muscat of Alexandria & Viognier VISION

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

Protecting the intellectual property of machine learning models: Prediction APIs, model

Session-Based Compositional Verification on Actor-based Concurrent Systems Session Types for ABS

Indoor Localization Without Infrastructure Using the Acoustic Background Spectrum Stephen P.

Predicate Abstraction with SATABS Version 1.0, 2010 Outline Introduction Existential

S u p p o r t i n g S u p p o r t i n g C o v a r i a n t R e t u

Gravitational Waves and Gamma-Ray Bursts in Multimessenger Astrophysics LIGO-G0900996 Szabolcs

Relations between irreducible and absorbing Markov chains G. Rubino AMS Sectional Meeting

community.haskell.org/~ndm/firstify Neil Mitchell, Colin Runciman University of York The

Sambuz

Useful Links

Newsletter

Mail Us

Deep Learning in Computer Vision Yikang Li MMLab, The Chinese - PowerPoint PPT Presentation

Deep Learning in Computer Vision Yikang Li MMLab, The Chinese University of Hong Kong Sep 22nd, 2017 @Microsoft Research Asia, China Outline 1. Introduction 2. Roadmap of Deep Learning 3. DL in CV: Object detection 4. DL in CV: Image

Deep Learning in Computer Vision Caner Hazrba Deep Learning in Action 24. June 15

Delving Deep into Computer Vision Caner Hazirbas Machine Learning Meetup #1 Delving Deep into

Computer Vision Computer Vision How does vision work? What is vision for? Ela Claridge

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Reasoning A Vision for Automated Deduction Stephan Schulz Deep Reasoning A Vision for

CS262: Computer Vision (and Human-Computer Interaction) John Magee 1 Computer Vision How are

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Branding Presentation VISION Mevushal VISION Muscat of Alexandria &amp; Viognier VISION

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

Protecting the intellectual property of machine learning models: Prediction APIs, model

Session-Based Compositional Verification on Actor-based Concurrent Systems Session Types for ABS

Indoor Localization Without Infrastructure Using the Acoustic Background Spectrum Stephen P.

Predicate Abstraction with SATABS Version 1.0, 2010 Outline Introduction Existential

S u p p o r t i n g S u p p o r t i n g C o v a r i a n t R e t u

Gravitational Waves and Gamma-Ray Bursts in Multimessenger Astrophysics LIGO-G0900996 Szabolcs

Relations between irreducible and absorbing Markov chains G. Rubino AMS Sectional Meeting

community.haskell.org/~ndm/firstify Neil Mitchell, Colin Runciman University of York The

Sambuz

Useful Links

Newsletter

Mail Us

Branding Presentation VISION Mevushal VISION Muscat of Alexandria & Viognier VISION