CNN Applications in Computer Vision ELEG 5491 Tutorial Xihui Liu

Table of Contents ● Image Representation & Pre-processing ● Object detection ● Semantic Segmentation ● Instance Segmentation 2

Image Representation Grayscale image ● Can be represented by 2D matrices − By default, we use 8 bits per pixel − 3

Image Representation Image is a 2D array of pixels (picture element) with FIXED Number of ● samples : N x M N x M = 256 x 256 N x M = 30 x 30 4

Color Image Representation Color image ● Each pixel is specified by three values, (R, G, B) in the range of [0,255] − (8-bit integers) R G B 5

Color Image Representation Color image ● Color images are stored in a 3 x M x N tensor − [0,255] is usually mapped to [0.0,1.0] in PyTorch (a deep learning library) − 6

CNN Applications in Computer Vision Image Classification ● Given an input image, classify it into a predefined class − Other computer vision tasks ● Semantic Object Segmentation Detection 7

Object Detection: Impact of Deep Learning PASCAL VOC is a classical object detection benchmark ● 9

Object Detection as Classification: Sliding Window Apply a CNN to many different crops of the image, CNN classifies ● each crop as object or background 10

Object Detection as Classification: Sliding Window Apply a CNN to many different crops of the image, CNN classifies ● each crop as object or background Problem: Need to apply CNN to huge number of locations and scales, very computationally expensive! 13

Region Proposals Find plausible image regions that are likely to contain objects ● Relatively fast to run; e.g. Selective Search gives 1000 region ● proposals in a few seconds on CPU Alexe et al, “Measuring the objectness of image windows”, TPAMI 2012 Uijlings et al, “Selective Search for Object Recognition”, IJCV 2013 14 Cheng et al, “BING: Binarized normed gradients for objectness estimation at 300fps”, CVPR 2014 Zitnick and Dollar, “Edge boxes: Locating object proposals from edges”, ECCV 2014

R-CNN 15 Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.

R-CNN: Problems Ad hoc training objectives ● Fine-tune network with softmax classifier (log loss) − Train post-hoc linear SVMs (hinge loss) − Train post-hoc bounding-box regressions (least squares) − Training is slow (84h), takes a lot of disk space ● Inference (detection) is slow ● 47s / image with VGG16 [Simonyan & Zisserman. ICLR15] − Fixed by SPP-net [He et al. ECCV14] − 16 Girshick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014.

Fast R-CNN 17 Girshick et al, “Fast R-CNN”, ICCV 2015.

Fast R-CNN: ROI Pooling 18 Girshick et al, “Fast R-CNN”, ICCV 2015.

R-CNN vs SPP vs Fast R-CNN 19 He et al, “Spatial pyramid pooling in deep convolutional networks for visual recognition”, ECCV 2014 Girshick et al, “Fast R-CNN”, ICCV 2015.

Faster R-CNN Make CNN do proposals! ● Insert Region Proposal ● Network (RPN) to predict proposals from features Jointly train with 4 losses: ● RPN classify object / not − object RPN regress box coordinates − Final classification score − (object classes) Final box coordinates − 20 Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015

Faster R-CNN 21 Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015

One-stage Methods without Proposals: YOLO / SSD 22 Redmon et al, “You Only Look Once: Unified, Real-Time Object Detection”, CVPR 2016 Liu et al, “SSD: Single-Shot MultiBox Detector”, ECCV 2016

Object Detection: Lots of variables ... Object Detection Base Network Takeaways architecture VGG16 Faster R-CNN is Faster R-CNN ResNet-101 slower but more R-FCN Inception V2 Accurate SSD Inception V3 Inception SSD is much faster Image Size ResNet but not as accurate # Region Proposals MobileNet …. Huang et al, “Speed/accuracy trade-offs for modern convolutional object detectors”, CVPR 2017 R-FCN: Dai et al, “R-FCN: Object Detection via Region-based Fully Convolutional Networks”, NIPS 2016 Inception-V2: Ioffe and Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, ICML 2015 Inception V3: Szegedy et al, “Rethinking the Inception Architecture for Computer Vision”, arXiv 2016 Inception ResNet: Szegedy et al, “Inception-V4, Inception-ResNet and the Impact of Residual Connections on Learning”, arXiv 2016 MobileNet: Howard et al, “Efficient Convolutional Neural Networks for Mobile Vision Applications”, arXiv 2017

Semantic Segmentation Classical Computer ● Vision problem Label each pixel in the ● image with a class label Does not differentiate ● instance, only care about pixels 25

Some Public Semantic Segmentation Datasets 26

Semantic Segmentation Idea: Sliding Window Problem: Very inefficient! Not reusing shared features between overlapping patches 27 Farabet et al, “Learning Hierarchical Features for Scene Labeling,” TPAMI 2013 Pinheiro and Collobert, “Recurrent Convolutional Neural Networks for Scene Labeling”, ICML 2014

Semantic Segmentation Idea: Fully Convolutional Design a network as a bunch of convolutional layers to make predictions for pixels all at once! Problem: convolutions at original image resolution will be very expensive ... 28

Semantic Segmentation Idea: Fully Convolutional Design network as a bunch of convolutional Downsampling: Upsampling: layers, with downsampling and upsampling Pooling, strided ??? inside the network! convolution Apply cross-entropy loss at every pixel of the predicted label map 29 Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015 Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

Convolution Layer Typical 3 x 3 convolution, stride 2 pad 1 30

“Deconvolution” Layer for Upsampling Other names: Filter moves 2 pixels in the -Deconvolution (bad) output for every one pixel in -Upconvolution the input -Fractionally strided convolution Stride gives ratio between -Backward strided movement in output and input 31 convolution

Transpose Convolution: 1D Example Output contains copies of the filter weighted by the input, summing at where at overlaps in the output Need to crop one pixel from output to make output exactly 2x input 32

Instance Segmentation Not only to segment each pixel but differentiate different instances of ● the same class Idea: combining object detection and semantic segmentation for ● instance segmentation 34

Mask R-CNN Idea: combining object detection and semantic segmentation for ● instance segmentation 35 He et al, “Mask R-CNN”, ICCV 2017

Mask R-CNN: Very Good Results 36 He et al, “Mask R-CNN”, ICCV 2017

Mask R-CNN: Also Can Estimate Human Poses 37 He et al, “Mask R-CNN”, ICCV 2017

Mask R-CNN: Also Can Estimate Human Poses 38 He et al, “Mask R-CNN”, ICCV 2017

Thanks! ELEG 5491 Tutorial Xihui Liu

CNN Applications in Computer Vision ELEG 5491 Tutorial Xihui Liu - PowerPoint PPT Presentation

CNN Applications in Computer Vision ELEG 5491 Tutorial Xihui Liu Table of Contents Image Representation & Pre-processing Object detection Semantic Segmentation Instance Segmentation 2 Image Representation Grayscale image

CS7015 (Deep Learning) : Lecture 12 Object Detection: R-CNN, Fast R-CNN, Faster R-CNN, You Only

Object Detection using R-CNN Experiments CS381V: Visual Recognition, Spring 2016 William Xie

Computer Vision Computer Vision How does vision work? What is vision for? Ela Claridge

Decay vertex ID using CNN for p K+ Aaron Higuera University of Houston CNN Tools on

CNN Ba CNN Based ed Pi Pipeline peline for or Op Optical ical Fl Flow ow Tal Schuster,

CENG5030 Part 2-1: Introduction to Convolutional Nueral Network Bei Yu (Latest update: March 4,

Nue Energy Reconstruction with CNN Lars Hertel, Ilsoo Seong, Jianming Bian 2018/08/20 Intro.

Convolutional Neural Networks Computer Vision Jia-Bin Huang, Virginia Tech Todays class

CS262: Computer Vision (and Human-Computer Interaction) John Magee 1 Computer Vision How are

Branding Presentation VISION Mevushal VISION Muscat of Alexandria & Viognier VISION

Computer Vision Introduction Historical context Connections to other disciplines Vision and

Vision Services Vision Services & & Vision Therapy Vision Therapy February 2, 2007

Vision Our National Church partners .. Vision Our National Network partners Vision Getting

Moving CNN Accelerator Computations Closer to Data Sumanth Gudaparthi Surya Narayanan Rajeev

Object Detection in Recent 3 Years Beyond RetinaNet and Mask R-CNN Gang Yu

Dynamic Graph CNN for learning on point clouds Wang Yue, et al. Otakar Jaek March 25, 2019

CNNs for Segmentation, Localization, and detection M. Soleymani Sharif University of Technology

Object Detection EECS 442 Prof. David Fouhey Winter 2019, University of Michigan

Poster: Securing IoT through coverage-bounding wireless communication with visible light Qing

AIHce EXP Virtual Advancing Worker Health and Safety 1 Poster Specifications Your poster should

dinam: A Wireless Sensor Network Concept and Platform for Rapid Development June 16 th , 2010 7 th

Learning-based Contour Detection & Contour-based Object Detection Iasonas Kokkinos Department

Context-sensitive Analysis Attribute Grammar And Type Checking cs5363 1 Context-Sensitive

Lecture 6: Recognition & Detection http://media.ee.ntu.edu.tw/courses/cv/18F/ FB: NTUEE