R-CNN minus R Karel Lenc, Andrea Vedaldi Object detection 2 Goal : - - PowerPoint PPT Presentation

r cnn minus r
SMART_READER_LITE
LIVE PREVIEW

R-CNN minus R Karel Lenc, Andrea Vedaldi Object detection 2 Goal : - - PowerPoint PPT Presentation

Visual Geometry Group, Department of Engineering Science R-CNN minus R Karel Lenc, Andrea Vedaldi Object detection 2 Goal : tightly enclose objects of a certain type in a bounding box bikes planes horses birds Top


slide-1
SLIDE 1

R-CNN minus R

Karel Lenc, Andrea Vedaldi

Visual Geometry Group, Department of Engineering Science

slide-2
SLIDE 2

Object detection

2 Goal: tightly enclose objects of a certain type in a bounding box

“bikes” “planes” “birds” “horses”

slide-3
SLIDE 3

Top performer: Region proposals + CNN

3

WHERE WHAT

proposal generation CNN

chair

CNN CNN

background potted plant [Girshick et al. 2013, He et al. 2014, 2015]

slide-4
SLIDE 4

WHAT Convolutional neural networks E.g. region classification with AlexNet, VGG VD [Krizhevsky et al. 2012, Simonyan Zisserman 2014] WHERE Segmentation algorithm E.g. region proposal from selective search [Uijlings et al. 2013]

Top performer: Region proposals + CNN

4

slide-5
SLIDE 5

Can CNN understand where as well as what?

5

proposal generation

Where?

WHAT WHERE Convolutional neural network

?

slide-6
SLIDE 6

Approaches to object detection

Scanning windows

Sliding windows

HOG detector [Dalal Triggs 2005]

DPM [Felzenszwalb et al. 2008]

Cascaded windows

AdaBoost [Viola Jones 2004]

MKL [Vedaldi et al. 2009]

B and Bound [Lampert et al. 2009]

Jumping windows

[Sivic et al. 2008]

Selective windows

[Endres and Hoeim 2010, Uijlings

  • et. al 2011, Alexe et al. 2012, Gu et
  • al. 2012]

Hough voting

Implicit shape models [Amit Geman 1997, Leibe et al. 2003]

Max margin [Maji Berg 2009], Random Forests [Gall Lempitsky 2009] Classifiers & features

linear SVMs, kernel SVM, Fisher Vectors, … [Cinbis et al. 2013, …]

convolutional neural networks [Sermanet et al. 2014, Girshick et al. 2014, …]

HOG, SIFT, C-SIFT, … [van de Sande et al. 2010, …]

Segmentation cues, … [Shotton et al. 2008, Cinbis et al. 2013, …] 6

slide-7
SLIDE 7

Evolution of object detection

7 PASCAL VOC 2007 data DPM [Felzenszwalb et al.] MKL [Vedaldi et al.] DPMv5 [Girshick et al.] Regionlet [Wang et al.] RCNN-Alex [Girshick et al.] RCNN-VGG [Girshick et al.] 10 20 30 40 50 60 70 2008 2009 2010 2011 2012 2013 2014 2015 mAP [%] Year

slide-8
SLIDE 8

Pros: simple and effective Cons: slow as the CNN is re-evaluated for each tested region

R-CNN

8 [Girshick et al. 2013]

CNN

chair

CNN CNN

background potted plant

c5 c1 c2 c3 c4 f6 f7 f8

(SVM)

label

slide-9
SLIDE 9

SPP R-CNN

Convolutional features = local features Region descriptor = pooled local features

Spatial pyramid + max pooling [He et al. 2014]

Bag of words, Fisher vector, VLAD, …. [Cimpoi et. al. 2015] Order of magnitudes speedup 9 [He et al. 2014]

chair bowl potted plant

c5 c1 c2 c3 c4 f6 f7 f8

(SVM)

f6 f7 f8

(SVM)

f6 f7 f8

(SVM)

local features pooling encoder

slide-10
SLIDE 10

Computational cost

SPP-CNN results in a significant test-time speedup However, region proposal extraction is the new bottleneck R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time

2000 4000 6000 8000 10000 12000 SPP-CNN R-CNN

  • Avg. Time per Image [ms]
  • Sel. Search

CNN evaluation

slide-11
SLIDE 11

Streamlining R-CNN and SPP-CNN Dropping proposal generation

slide-12
SLIDE 12

Streamlining R-CNN and SPP-CNN Dropping proposal generation

slide-13
SLIDE 13

A complex learning pipeline

  • 1. Pre-train a large CNN (on ImageNet)
  • 2. Extract region proposals (on PASCAL VOC)
  • 3. Use pre-processed regions to:
  • 1. Fine-tune the CNN
  • 2. Learn an SVM to rank regions
  • 3. Learn a bounding-box regressor to refine localization

13 (SPP) R-CNN training comprises many steps

label (fine tuning)

c5 c1 c2 c3 c4 f6 f7 f8

SVM

label (ranking)

linear regress.

  • b. box
slide-14
SLIDE 14

A complex learning pipeline

With SPP R-CNN of [He et al. 2014] fine-tuning is limited to the fully connected layers 14 (SPP) R-CNN training comprises many steps

label

c5 c1 c2 c3 c4 f6 f7 f8

SVM

label

linear regress.

  • b. box

frozen

slide-15
SLIDE 15

Streamlining R-CNN

Up to a simple transformation, softmax is just as good as hinge loss for box ranking. 15 Removing the SVM phase score(s) learning loss mAP fine tuning 𝑇𝑑 = exp( 𝑥𝑑, 𝜚 𝒚 + 𝑐𝑑) − log 𝑇𝑑0 𝑇0 + 𝑇1 + 𝑇2 + … + 𝑇𝐷 38.1 region ranking 𝑅1 = 𝑥1, 𝜚 𝒚 + 𝑐1 ⋮ 𝑅𝐷 = 𝑥𝐷, 𝜚 𝒚 + 𝑐𝐷 max 0, 1 − 𝑧 𝑅1 ⋮ max{0, 1 − 𝑧 𝑅𝐷} 59.8 region raking 𝑅𝑑 = log 𝑇𝑑 𝑇0 from fine-tuning 58.4

slide-16
SLIDE 16

Streamlining R-CNN and SPP-CNN

SPP and bounding box regressions can be easily implemented in a CNN (with a DAG topology) and trained jointly in one step 16 See also [Fast R-CNN and Faster R-CNN]

label

c5 c1 c2 c3 c4 f6 f7 f8

lin. regress.

  • b. box

frozen

label

c5 c1 c2 c3 c4 f6 f7 f8

freg

  • b. box

SPP

slide-17
SLIDE 17

Streamlining R-CNN and SPP-CNN Dropping proposal generation

slide-18
SLIDE 18

A constant-time region proposal generator

Proposals are now very fast but very inaccurate We let the CNN compensate with the bounding box regressor 18

Algorithm Preprocessing Collect all the training bounding boxes (x1,y1,x2,y2) Use K-means to extract K clusters in (x1,y1,x2,y2) space Proposal generation Regardless of the image, return the same K cluster centers

slide-19
SLIDE 19

Proposal statistics on PASCAL VOC

19

ground truth selective search 2K sliding windows 7K clustering 3K

slide-20
SLIDE 20

Information pathways

20 [See also Lenc Vedaldi CPVR 2015]

label

c5 c1 c2 c3 c4 f6 f7 f8

linear regress.

bounding box shared local features where path what path

equivariant representation invariant representation

slide-21
SLIDE 21

CNN-based bounding box regression

Dashed line: proposals Solid line: corrected by the CNN 21

slide-22
SLIDE 22

Performance

Observations

Selective search is much better than fixed generators

However, bounding box regression almost eliminates the difference

Clustering allows to use significantly less boxes than sliding windows 22

0.42 0.44 0.46 0.48 0.5 0.52 0.54 0.56 0.58 0.6

  • Sel. Search (2K

boxes)

  • Slid. Win. (7K

Boxes) Clusters (2K Boxes) Clusters (7K Boxes) mAP (VOC07) Baseline BBR

slide-23
SLIDE 23

Timings

23 Finding (1) Streamlining accelerates SPP

50 100 150 200 250 300 350 400 450 SPP Streamlined SPP

  • Avg. Time per Image [ms]
  • Im. Prep.

GPU↔CPU CONV Layers

  • Spat. Pooling

FC Layers Bbox Regr.

slide-24
SLIDE 24

Timings

24 Finding (2) Dropping selective search is a huge benefit

500 1000 1500 2000 2500 3000 SPP Streamlined SPP Minus R

  • Avg. Time per Image [ms]
  • Sel. Search
  • Im. Prep.

GPU↔CPU CONV Layers

  • Spat. Pooling

FC Layers Bbox Regr.

slide-25
SLIDE 25

Timings

25 Finding (2) Dropping selective search is a huge benefit

2000 4000 6000 8000 10000 12000 RCNN SPP Streamlined SPP Minus R

  • Avg. Time per Image [ms]
  • Sel. Search
  • Im. Prep.

GPU↔CPU CONV Layers

  • Spat. Pooling

FC Layers Bbox Regr.

slide-26
SLIDE 26

Timings

26 Test-time speedups

1.0 4.5 5.0 67.5

10 20 30 40 50 60 70 80 RCNN SPP Streamlined SPP Minus R Times faster than R-CNN

slide-27
SLIDE 27

Conclusions

Current CNNs can localize objects well

External segmentation cues bring only a minor benefit at a great expense Benefits of CNN-only solutions

Much faster, particularly at test time

Much simpler and streamlined implementations Future steps

Eliminate the remaining accuracy gap

Essentially achieved in [Faster R-CNN, Ren et al. 2015]

Beyond bounding boxes

Beyond detection 27