[PPT] - R-CNN minus R Karel Lenc, Andrea Vedaldi Object detection 2 Goal : PowerPoint Presentation

SLIDE 1

R-CNN minus R

Karel Lenc, Andrea Vedaldi

Visual Geometry Group, Department of Engineering Science

SLIDE 2

Object detection

2 Goal: tightly enclose objects of a certain type in a bounding box

“bikes” “planes” “birds” “horses”

SLIDE 3

Top performer: Region proposals + CNN

3

WHERE WHAT

proposal generation CNN

chair

CNN CNN

background potted plant [Girshick et al. 2013, He et al. 2014, 2015]

SLIDE 4

WHAT Convolutional neural networks E.g. region classification with AlexNet, VGG VD [Krizhevsky et al. 2012, Simonyan Zisserman 2014] WHERE Segmentation algorithm E.g. region proposal from selective search [Uijlings et al. 2013]

Top performer: Region proposals + CNN

4

SLIDE 5

Can CNN understand where as well as what?

5

proposal generation

Where?

WHAT WHERE Convolutional neural network

?

SLIDE 6

Approaches to object detection

Scanning windows

▶

Sliding windows

▶

HOG detector [Dalal Triggs 2005]

▶

DPM [Felzenszwalb et al. 2008]

▶

Cascaded windows

▶

AdaBoost [Viola Jones 2004]

▶

MKL [Vedaldi et al. 2009]

▶

B and Bound [Lampert et al. 2009]

▶

Jumping windows

▶

[Sivic et al. 2008]

▶

Selective windows

▶

[Endres and Hoeim 2010, Uijlings

et. al 2011, Alexe et al. 2012, Gu et
al. 2012]

Hough voting

▶

Implicit shape models [Amit Geman 1997, Leibe et al. 2003]

▶

Max margin [Maji Berg 2009], Random Forests [Gall Lempitsky 2009] Classifiers & features

▶

linear SVMs, kernel SVM, Fisher Vectors, … [Cinbis et al. 2013, …]

▶

convolutional neural networks [Sermanet et al. 2014, Girshick et al. 2014, …]

▶

HOG, SIFT, C-SIFT, … [van de Sande et al. 2010, …]

▶

Segmentation cues, … [Shotton et al. 2008, Cinbis et al. 2013, …] 6

SLIDE 7

Evolution of object detection

7 PASCAL VOC 2007 data DPM [Felzenszwalb et al.] MKL [Vedaldi et al.] DPMv5 [Girshick et al.] Regionlet [Wang et al.] RCNN-Alex [Girshick et al.] RCNN-VGG [Girshick et al.] 10 20 30 40 50 60 70 2008 2009 2010 2011 2012 2013 2014 2015 mAP [%] Year

SLIDE 8

Pros: simple and effective Cons: slow as the CNN is re-evaluated for each tested region

R-CNN

8 [Girshick et al. 2013]

CNN

chair

CNN CNN

background potted plant

c5 c1 c2 c3 c4 f6 f7 f8

(SVM)

label

SLIDE 9

SPP R-CNN

Convolutional features = local features Region descriptor = pooled local features

▶

Spatial pyramid + max pooling [He et al. 2014]

▶

Bag of words, Fisher vector, VLAD, …. [Cimpoi et. al. 2015] Order of magnitudes speedup 9 [He et al. 2014]

chair bowl potted plant

c5 c1 c2 c3 c4 f6 f7 f8

(SVM)

f6 f7 f8

(SVM)

f6 f7 f8

(SVM)

local features pooling encoder

SLIDE 10

Computational cost

SPP-CNN results in a significant test-time speedup However, region proposal extraction is the new bottleneck R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time

2000 4000 6000 8000 10000 12000 SPP-CNN R-CNN

Avg. Time per Image [ms]
Sel. Search

CNN evaluation

SLIDE 11

Streamlining R-CNN and SPP-CNN Dropping proposal generation

SLIDE 12

Streamlining R-CNN and SPP-CNN Dropping proposal generation

SLIDE 13

A complex learning pipeline

1. Pre-train a large CNN (on ImageNet)
2. Extract region proposals (on PASCAL VOC)
3. Use pre-processed regions to:
1. Fine-tune the CNN
2. Learn an SVM to rank regions
3. Learn a bounding-box regressor to refine localization

13 (SPP) R-CNN training comprises many steps

label (fine tuning)

c5 c1 c2 c3 c4 f6 f7 f8

SVM

label (ranking)

linear regress.

b. box

SLIDE 14

A complex learning pipeline

With SPP R-CNN of [He et al. 2014] fine-tuning is limited to the fully connected layers 14 (SPP) R-CNN training comprises many steps

label

c5 c1 c2 c3 c4 f6 f7 f8

SVM

label

linear regress.

b. box

frozen

SLIDE 15

Streamlining R-CNN

Up to a simple transformation, softmax is just as good as hinge loss for box ranking. 15 Removing the SVM phase score(s) learning loss mAP fine tuning 𝑇𝑑 = exp( 𝑥𝑑, 𝜚 𝒚 + 𝑐𝑑) − log 𝑇𝑑0 𝑇0 + 𝑇1 + 𝑇2 + … + 𝑇𝐷 38.1 region ranking 𝑅1 = 𝑥1, 𝜚 𝒚 + 𝑐1 ⋮ 𝑅𝐷 = 𝑥𝐷, 𝜚 𝒚 + 𝑐𝐷 max 0, 1 − 𝑧 𝑅1 ⋮ max{0, 1 − 𝑧 𝑅𝐷} 59.8 region raking 𝑅𝑑 = log 𝑇𝑑 𝑇0 from fine-tuning 58.4

SLIDE 16

Streamlining R-CNN and SPP-CNN

SPP and bounding box regressions can be easily implemented in a CNN (with a DAG topology) and trained jointly in one step 16 See also [Fast R-CNN and Faster R-CNN]

label

c5 c1 c2 c3 c4 f6 f7 f8

lin. regress.

b. box

frozen

label

c5 c1 c2 c3 c4 f6 f7 f8

freg

b. box

SPP

SLIDE 17

Streamlining R-CNN and SPP-CNN Dropping proposal generation

SLIDE 18

A constant-time region proposal generator

Proposals are now very fast but very inaccurate We let the CNN compensate with the bounding box regressor 18

Algorithm Preprocessing Collect all the training bounding boxes (x1,y1,x2,y2) Use K-means to extract K clusters in (x1,y1,x2,y2) space Proposal generation Regardless of the image, return the same K cluster centers

SLIDE 19

Proposal statistics on PASCAL VOC

19

ground truth selective search 2K sliding windows 7K clustering 3K

SLIDE 20

Information pathways

20 [See also Lenc Vedaldi CPVR 2015]

label

c5 c1 c2 c3 c4 f6 f7 f8

linear regress.

bounding box shared local features where path what path

equivariant representation invariant representation

SLIDE 21

CNN-based bounding box regression

Dashed line: proposals Solid line: corrected by the CNN 21

SLIDE 22

Performance

Observations

▶

Selective search is much better than fixed generators

▶

However, bounding box regression almost eliminates the difference

▶

Clustering allows to use significantly less boxes than sliding windows 22

0.42 0.44 0.46 0.48 0.5 0.52 0.54 0.56 0.58 0.6

Sel. Search (2K

boxes)

Slid. Win. (7K

Boxes) Clusters (2K Boxes) Clusters (7K Boxes) mAP (VOC07) Baseline BBR

SLIDE 23

Timings

23 Finding (1) Streamlining accelerates SPP

50 100 150 200 250 300 350 400 450 SPP Streamlined SPP

Avg. Time per Image [ms]
Im. Prep.

GPU↔CPU CONV Layers

Spat. Pooling

FC Layers Bbox Regr.

SLIDE 24

Timings

24 Finding (2) Dropping selective search is a huge benefit

500 1000 1500 2000 2500 3000 SPP Streamlined SPP Minus R

Avg. Time per Image [ms]
Sel. Search
Im. Prep.

GPU↔CPU CONV Layers

Spat. Pooling

FC Layers Bbox Regr.

SLIDE 25

Timings

25 Finding (2) Dropping selective search is a huge benefit

2000 4000 6000 8000 10000 12000 RCNN SPP Streamlined SPP Minus R

Avg. Time per Image [ms]
Sel. Search
Im. Prep.

GPU↔CPU CONV Layers

Spat. Pooling

FC Layers Bbox Regr.

SLIDE 26

Timings

26 Test-time speedups

1.0 4.5 5.0 67.5

10 20 30 40 50 60 70 80 RCNN SPP Streamlined SPP Minus R Times faster than R-CNN

SLIDE 27

Conclusions

Current CNNs can localize objects well

▶

External segmentation cues bring only a minor benefit at a great expense Benefits of CNN-only solutions

▶

Much faster, particularly at test time

▶

Much simpler and streamlined implementations Future steps

▶

Eliminate the remaining accuracy gap

▶

Essentially achieved in [Faster R-CNN, Ren et al. 2015]

▶

Beyond bounding boxes

▶

R-CNN minus R

Karel Lenc, Andrea Vedaldi

Object detection

“bikes” “planes” “birds” “horses”

Top performer: Region proposals + CNN

WHERE WHAT

chair

background potted plant [Girshick et al. 2013, He et al. 2014, 2015]

WHAT Convolutional neural networks E.g. region classification with AlexNet, VGG VD [Krizhevsky et al. 2012, Simonyan Zisserman 2014] WHERE Segmentation algorithm E.g. region proposal from selective search [Uijlings et al. 2013]

Top performer: Region proposals + CNN

Can CNN understand where as well as what?

Where?

WHAT WHERE Convolutional neural network

?

Approaches to object detection

B and Bound [Lampert et al. 2009]

[Endres and Hoeim 2010, Uijlings

Evolution of object detection

R-CNN

chair

background potted plant

label

SPP R-CNN

chair bowl potted plant

Computational cost

Streamlining R-CNN and SPP-CNN Dropping proposal generation

Streamlining R-CNN and SPP-CNN Dropping proposal generation

A complex learning pipeline

label (fine tuning)

label (ranking)

A complex learning pipeline

label

label

Streamlining R-CNN

Streamlining R-CNN and SPP-CNN

label

label

Streamlining R-CNN and SPP-CNN Dropping proposal generation

A constant-time region proposal generator

Algorithm Preprocessing Collect all the training bounding boxes (x1,y1,x2,y2) Use K-means to extract K clusters in (x1,y1,x2,y2) space Proposal generation Regardless of the image, return the same K cluster centers

Proposal statistics on PASCAL VOC

ground truth selective search 2K sliding windows 7K clustering 3K

Information pathways

label

bounding box shared local features where path what path

CNN-based bounding box regression

Performance

Timings

Timings

Timings

Timings

1.0 4.5 5.0 67.5

Conclusions

Eliminate the remaining accuracy gap

Beyond detection 27