CSE 152: Computer Vision Hao Su Lecture 10: Object Recognition How - - PowerPoint PPT Presentation

cse 152 computer vision
SMART_READER_LITE
LIVE PREVIEW

CSE 152: Computer Vision Hao Su Lecture 10: Object Recognition How - - PowerPoint PPT Presentation

CSE 152: Computer Vision Hao Su Lecture 10: Object Recognition How do we represent objects - Bounding box Figures from https://github.com/facebookresearch/detectron2 How do we represent objects - Bounding box - Instance mask Figures from


slide-1
SLIDE 1

Lecture 10: Object Recognition

CSE 152: Computer Vision

Hao Su

slide-2
SLIDE 2

How do we represent objects

  • Bounding box

Figures from https://github.com/facebookresearch/detectron2

slide-3
SLIDE 3

How do we represent objects

  • Bounding box
  • Instance mask

Figures from https://github.com/facebookresearch/ detectron2

slide-4
SLIDE 4

How do we represent objects

  • Bounding box
  • Instance mask
  • Keypoint

Figures from https://github.com/facebookresearch/ detectron2

slide-5
SLIDE 5

How do we represent objects

  • Bounding box
  • Instance mask
  • Keypoint

Figures from https://github.com/facebookresearch/ detectron2

slide-6
SLIDE 6

Object Detection with Bounding Boxes

What? - Recognition/ Classification Where? - Localization/ Regression

Slides modified from Ross Girshick tutorial at CVPR 2019

slide-7
SLIDE 7

Object Detection with Segmentation Masks

What? - Recognition Where? - Segmentation

Slides modified from Ross Girshick tutorial at CVPR 2019

slide-8
SLIDE 8

Semantic Segmentation

Predict a pixel-wise class label Stuff: walls, buildings, sky, road Things: human, cars, bikes

Figures from Panoptic Segmentation, CVPR 2019

slide-9
SLIDE 9

Datasets

Microsoft COCO

slide-10
SLIDE 10

Object Detection

slide-11
SLIDE 11

Object Detection → Object Classification

Slides modified from Ross Girshick tutorial at CVPR 2019

Input: an image Proposals/Candidates Cropped image

We’ve already reduced object detection to object classification!

Crop and resize (warp) Enumerate / heuristic algorithm

slide-12
SLIDE 12

R-CNN (Regional ConvNet)

Cropped image Region of Interests (RoI) BBox Regression Class Probability How probable is it a human? How can we modify this bounding box?

Computationally expensive

Input: an image Proposals/Candidates

ConvNet

Slides modified from Ross Girshick tutorial at CVPR 2019

Enumerate / heuristic algorithm

slide-13
SLIDE 13

Faster R-CNN

Input: an image Proposals/Candidates Region of Interests (RoI) ConvNet RoI-Pool Feature map for an image Feature map for a RoI Similar to Crop & Resize ConvNet Multilayer Perceptron (MLP) BBox Regression Class Probability Slides modified from Ross Girshick tutorial at CVPR 2019 Region Proposal Network (RPN)

slide-14
SLIDE 14

Faster R-CNN

  • At each location, consider boxes of many different

sizes and aspect ratios

slide-15
SLIDE 15

Faster R-CNN

  • At each location, consider boxes of many different

sizes and aspect ratios

slide-16
SLIDE 16

Object Segmentation

slide-17
SLIDE 17

Lecture 11 -

Semantic Segmentation Idea: Fully Convolutional

Input: 3 x H x W Convolutio ns: D x H x W Conv Score s: C x H x W argmax

May 10, 2017

Predictio ns: H x W Design a network as a bunch of convolutional layers to make predictions for pixels all at once! Conv Conv Conv

slide-18
SLIDE 18

Lecture 11 -

Semantic Segmentation Idea: Fully Convolutional

Input: 3 x H x W Predictio ns: H x W

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015 Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

May 10, 2017

Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network! High- res: D1 x H/2 x W/2 High- res: D1 x H/2 x W/2 Med-res: D2 x H/4 x W/4 Med-res: D2 x H/4 x W/4 Low- res: D3 x H/4 x W/4

slide-19
SLIDE 19

Lecture 11 -

Semantic Segmentation Idea: Fully Convolutional

Input: 3 x H x W Predictio ns: H x W

Long, Shelhamer, and Darrell, “Fully Convolutional Networks for Semantic Segmentation”, CVPR 2015 Noh et al, “Learning Deconvolution Network for Semantic Segmentation”, ICCV 2015

May 10, 2017

High- res: D1 x H/2 x W/2 High- res: D1 x H/2 x W/2 Med-res: D2 x H/4 x W/4 Med-res: D2 x H/4 x W/4 Low- res: D3 x H/4 x W/4 Downsampling: Pooling, strided convolution Upsampling: ??? Design network as a bunch of convolutional layers, with downsampling and upsampling inside the network!

slide-20
SLIDE 20

Input: 2 x 2 Output: 4 x 4 Input gives weight for filter Sum where

  • utput overlaps

Learnable Upsampling: Transpose Convolution

3 x 3 transpose convolution, stride 2 pad 1

Filter moves 2 pixels in the output for every one pixel in the input Stride gives ratio between movement in output and input

slide-21
SLIDE 21

Input: 2 x 2 Output: 4 x 4 Input gives weight for filter Sum where

  • utput overlaps

Learnable Upsampling: Transpose Convolution

3 x 3 transpose convolution, stride 2 pad 1

Filter moves 2 pixels in the output for every one pixel in the input Stride gives ratio between movement in output and input Other names:

  • Deconvolution (bad)
  • Upconvolution
  • Fractionally strided convolution
  • Backward strided convolution
slide-22
SLIDE 22

Semantic vs. Instance Segmentation

Slides modified from Ross Girshick tutorial at CVPR 2019

slide-23
SLIDE 23

Mask R-CNN

  • First do object detection using the Faster R-CNN

arch, and then do semantic segmentation inside the cropped region

  • Share features of the first few layers for detection

and segmentation