Lecture 11: Object detection Contains slides from S. Lazebnik, R. - - PowerPoint PPT Presentation

lecture 11 object detection
SMART_READER_LITE
LIVE PREVIEW

Lecture 11: Object detection Contains slides from S. Lazebnik, R. - - PowerPoint PPT Presentation

Lecture 11: Object detection Contains slides from S. Lazebnik, R. Girshick, B. Hariharan 1 Object detection with bounding boxes What? Where? Object detection Source: R. Girshick 2 Evaluating an object detector At test time,


slide-1
SLIDE 1

Lecture 11: Object detection

1

Contains slides from S. Lazebnik, R. Girshick, B. Hariharan

slide-2
SLIDE 2

Object detection with bounding boxes

2

“Object detection”

What? Where?

Source: R. Girshick

slide-3
SLIDE 3
  • At test time, predict bounding boxes, class labels, and confidence scores
  • For each detection, determine whether it is a true or false positive
  • Intersection over union (IoU): Area(GT Det) / Area(GT Det) > 0.5

∩ ∪

Evaluating an object detector

3

cat dog cat: 0.8 dog: 0.6 dog: 0.55

Ground truth (GT)

Source: S. Lazebnik

slide-4
SLIDE 4

Source: B. Hariharan

Evaluating an object detector

4

Intersection over union (also known as Jaccard similarity)

slide-5
SLIDE 5
  • For each class, plot Recall-Precision curve and compute Average Precision

(area under the curve)

  • Take mean of AP over classes to get mAP

5

Precision: true positive detections / 
 total detections Recall: true positive detections / 
 total positive test instances

Evaluating an object detector

Source: S. Lazebnik

slide-6
SLIDE 6

Average precision

6

Precision 1 Recall

Source: B. Hariharan

slide-7
SLIDE 7

Average precision

7

Precision 1 1 Recall

Source: B. Hariharan

slide-8
SLIDE 8

Detection as classification

  • Run through every possible box and classify
  • Well-localized object of class k or not?
  • How many boxes?
  • Every pair of pixels = 1 box
  • = O(N2)
  • For 300 x 500 image, N = 150K
  • 2.25 x 1010 boxes!
  • Related challenge: almost all boxes are negative!

8

Source: B. Hariharan

slide-9
SLIDE 9

Selective search

Stage 1: generate candidate bounding boxes Stage 2: apply classifier only to each candidate bounding box

[Uijlings et al., "Selective Search for Object Recognition”, 2013]

Input image Edge detection Bounding box proposal

[Zitnick and Dollar, "Edge Boxes…”, 2014]

9

Source: Torralba, Freeman, Isola

slide-10
SLIDE 10

R-CNN: Region proposals + CNN features

10

Input image ConvNet ConvNet ConvNet Linear Linear Linear Warped image regions Forward each region through ConvNet Classify regions with linear classifier Region proposals from selective search (~2K rectangles that are likely to contain objects)

  • R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich Feature Hierarchies for Accurate Object Detection and

Semantic Segmentation, CVPR 2014. Source: R. Girshick

slide-11
SLIDE 11

R-CNN at test time

11

Input image Extract region proposals (~2k / image) Compute CNN features

  • a. Crop

Source: R. Girshick

slide-12
SLIDE 12

12

Input image Extract region proposals (~2k / image) Compute CNN features

  • a. Crop
  • b. Scale (anisotropic)

227 x 227

R-CNN at test time

Source: R. Girshick

slide-13
SLIDE 13
  • 1. Crop
  • b. Scale (anisotropic)

13

Input image Extract region proposals (~2k / image) Compute CNN features

  • c. Forward propagate

Output: “fc7” features

R-CNN at test time

Source: R. Girshick

slide-14
SLIDE 14

14

Input image Extract region proposals (~2k / image) Compute CNN features

Warped proposal 4096-dimensional fc7 feature vector linear classifiers (SVM or softmax)

person? 1.6 horse? -0.3

... ...

Classify regions

R-CNN at test time

Source: R. Girshick

slide-15
SLIDE 15

Linear regression

  • n CNN features

Original proposal Predicted

  • bject bounding box

Bounding-box regression

R-CNN at test time: proposal refinement

15

Source: R. Girshick

slide-16
SLIDE 16

Bounding-box regression

16

  • riginal

predicted

Δh × h + h Δw × w + w (Δx × w + x, Δy × h + h) w h (x, y)

Source: R. Girshick

slide-17
SLIDE 17

Non-maximum suppression

0.9 0.8

Source: B. Hariharan

If two boxes overlap significantly (e.g. > 50% IoU), drop the one with the lower score. Usually use greedy algorithm.

slide-18
SLIDE 18

Problems with R-CNN

  • 1. Slow! Have to run CNN per

window

  • 2. Hand-crafted mechanism for

region proposal might be suboptimal.

18

ConvNet ConvNet ConvNet Linear Linear Linear

slide-19
SLIDE 19

“Fast” R-CNN: reuse features between proposals

19

ConvNet Forward whole image through ConvNet Conv5 feature map of image RoI Pooling layer Linear + softmax FCs Fully-connected layers Softmax classifier Region proposals Linear Bounding-box regressors

  • R. Girshick, Fast R-CNN, ICCV 2015

Source: R. Girshick

slide-20
SLIDE 20

ROI Pooling

  • How do we crop from a feature map?
  • Step 1: Resize boxes to account for subsampling

20

Layer 1 Layer 2 Layer 3

Source: B. Hariharan

slide-21
SLIDE 21

ROI Pooling

  • How do we crop from a feature map?
  • Step 2: Snap to feature map grid

21

Source: B. Hariharan

slide-22
SLIDE 22

ROI Pooling

  • How do we crop from a feature map?
  • Step 3: Overlay a new grid of fixed size

22

Source: B. Hariharan

slide-23
SLIDE 23

ROI Pooling

  • How do we crop from a feature map?
  • Step 4: Take max in each cell

23

Source: B. Hariharan

See more here: https://deepsense.ai/region-of-interest-pooling-explained/

Classification

slide-24
SLIDE 24

24

CNN feature map Region proposals CNN feature map Region Proposal Network

  • S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with

Region Proposal Networks, NIPS 2015 share features

“Faster” R-CNN: learn region proposals

slide-25
SLIDE 25

RPN: Region Proposal Network

25

= FCN

𝑔𝐽

(𝐽)

Conv feature map

Source: R. Girshick

slide-26
SLIDE 26

RPN: Region Proposal Network

26

3x3 “sliding window” Scans the feature map looking for objects

= FCN

𝑔𝐽

(𝐽)

Conv feature map

Source: R. Girshick

slide-27
SLIDE 27

RPN: Anchor Box

27

Anchor box: predictions are 
 w.r.t. this box, not the 3x3
 sliding window

Conv feature map

= FCN

𝑔𝐽

(𝐽)

3x3 “sliding window” Scans the feature map looking for objects

Source: R. Girshick

slide-28
SLIDE 28

RPN: Anchor Box

28

3x3 “sliding window” ➢ Objectness classifier [0, 1] ➢ Box regressor
 predicting (dx, dy, dh, dw)

Conv feature map

= FCN

𝑔𝐽

(𝐽)

Anchor box: predictions are 
 w.r.t. this box, not the 3x3
 sliding window

Source: R. Girshick

slide-29
SLIDE 29

RPN: Prediction (on object)

29

3x3 “sliding window” ➢ Objectness classifier [0, 1] ➢ Box regressor
 predicting (dx, dy, dh, dw)

P(object) = 0.94

Objectness score

Source: R. Girshick

slide-30
SLIDE 30

RPN: Prediction (on object)

30

Anchor box: transformed by box regressor

3x3 “sliding window” ➢ Objectness classifier [0, 1] ➢ Box regressor
 predicting (dx, dy, dh, dw)

P(object) = 0.94

Source: R. Girshick

slide-31
SLIDE 31

RPN: Prediction (off object)

31

Anchor box: transformed by box regressor

3x3 “sliding window” ➢ Objectness classifier ➢ Box regressor
 predicting (dx, dy, dh, dw)

P(object) = 0.02

Objectness score

Source: R. Girshick

slide-32
SLIDE 32

RPN: Multiple Anchors

32

3x3 “sliding window” ➢ K objectness classifiers ➢ K box regressors

Conv feature map

= FCN

𝑔𝐽

(𝐽)

Anchor boxes: K anchors
 per location with different
 scales and aspect ratios

Source: R. Girshick

slide-33
SLIDE 33

One network, four losses

33

image

CNN feature map Region Proposal Network proposals RoI pooling Classification loss Bounding-box regression loss … Classification loss Bounding-box regression loss

Source: R. Girshick, K. He, S. Lazebnik

slide-34
SLIDE 34

Faster R-CNN results

34

Source: S. Lazebnik

slide-35
SLIDE 35

Object detection progress

35

R-CNNv1 Fast R-CNN Before CNNs After CNNs Faster R-CNN

Performance on PASCAL VOC

Source: S. Lazebnik

slide-36
SLIDE 36

Streamlined detection architectures

  • The Faster R-CNN pipeline separates

proposal generation and region classification:

  • Is it possible do detection in one shot?

Conv feature map of the entire image Region Proposals RoI features RPN RoI pooling Classification + Regression Detections Conv feature map of the entire image Detections Classification + Regression

Source: S. Lazebnik

slide-37
SLIDE 37
  • Divide the image into a coarse grid and directly predict class label and a few

candidate boxes for each grid cell

Single-stage object detector

37

  • J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You Only Look Once: Unified, Real-Time

Object Detection, CVPR 2016

Source: S. Lazebnik

slide-38
SLIDE 38
  • 1. Take conv feature maps at 7x7 resolution
  • 2. Predict, at each location, a score for each class and 2 bboxes w/ confidences
  • For PASCAL, output is 7x7x30 (30 = 20 + 2*(4+1))

YOLO detector

38

  • J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, You Only Look Once: Unified, Real-Time

Object Detection, CVPR 2016

Source: S. Lazebnik

  • 7x speedup over Faster R-CNN (45-155 FPS vs. 7-18 FPS)

but less accurate (e.g. 65% vs. 72 mAP%)

slide-39
SLIDE 39

Challenges in object detection

slide-40
SLIDE 40

Beyond bounding boxes: instance segmentation

40

Predict segmentation mask for each object From COCO [Lin et al., 2014]

Source: B. Hariharan

slide-41
SLIDE 41

Instance segmentation

41

Faster R-CNN Extra “head” on network predicts binary mask

[He et al., “Mask R-CNN”, 2017]

ROI pooling with tiny change: bilinear interpolation instead of max

slide-42
SLIDE 42

Example Mask Training Targets

42

Image with training proposal 28x28 mask target Image with training proposal 28x28 mask target

Source: R. Girshick

slide-43
SLIDE 43

Example Mask Training Targets

43

Image with training proposal 28x28 mask target Image with training proposal 28x28 mask target

Source: R. Girshick

slide-44
SLIDE 44

Example Mask Training Targets

44

Image with training proposal 28x28 mask target Image with training proposal 28x28 mask target

Source: R. Girshick

slide-45
SLIDE 45

Example Mask Training Targets

45

Image with training proposal 28x28 mask target Image with training proposal 28x28 mask target

slide-46
SLIDE 46
slide-47
SLIDE 47
slide-48
SLIDE 48

Human Pose

➢ Add keypoint head (28x28x17) ➢ Predict one “mask” for each keypoint ➢ Softmax over spatial locations (encodes one keypoint per mask “prior”)

keypoints x17

(Not shown: Head architecture is slightly different for keypoints)

17 keypoint “mask” predictions shown as
 heatmaps with OKS
 scores from argmax
 positions

Source: R. Girshick

slide-49
SLIDE 49
slide-50
SLIDE 50
slide-51
SLIDE 51

Panoptic Segmentation

51

panoptic segmentation [Kriilov et al. 2018] predict label + instance id per pixel Semantic segmentation “stuff” Instance detection, “things” Source: ?

slide-52
SLIDE 52

[Xiong et al. 2019]

slide-53
SLIDE 53

We still need lots of labeled examples

53

Mask R-CNN on COCO with Different Training Set Sizes

Image source: R. Girshick

slide-54
SLIDE 54

Handle the long tail of the distribution

54

Object categories Frequency Person, dog, table, … Teacup, wreath, birdfeeder, …

slide-55
SLIDE 55

From COCO (80 categories) [Lin et al., 2014] LVIS dataset (1000+ categories) “Few shot” (e.g. < 20 examples) [Gupta et al., 2019]

Image source: R. Girshick

Handle the “long tail” of the distribution

55

slide-56
SLIDE 56

Next time: Action recognition

56