R-CNN minus R
Karel Lenc, Andrea Vedaldi
Visual Geometry Group, Department of Engineering Science
R-CNN minus R Karel Lenc, Andrea Vedaldi Object detection 2 Goal : - - PowerPoint PPT Presentation
Visual Geometry Group, Department of Engineering Science R-CNN minus R Karel Lenc, Andrea Vedaldi Object detection 2 Goal : tightly enclose objects of a certain type in a bounding box bikes planes horses birds Top
Visual Geometry Group, Department of Engineering Science
2 Goal: tightly enclose objects of a certain type in a bounding box
3
proposal generation CNN
CNN CNN
4
5
proposal generation
Scanning windows
▶
Sliding windows
▶
HOG detector [Dalal Triggs 2005]
▶
DPM [Felzenszwalb et al. 2008]
▶
Cascaded windows
▶
AdaBoost [Viola Jones 2004]
▶
MKL [Vedaldi et al. 2009]
▶
▶
Jumping windows
▶
[Sivic et al. 2008]
▶
Selective windows
▶
Hough voting
▶
Implicit shape models [Amit Geman 1997, Leibe et al. 2003]
▶
Max margin [Maji Berg 2009], Random Forests [Gall Lempitsky 2009] Classifiers & features
▶
linear SVMs, kernel SVM, Fisher Vectors, … [Cinbis et al. 2013, …]
▶
convolutional neural networks [Sermanet et al. 2014, Girshick et al. 2014, …]
▶
HOG, SIFT, C-SIFT, … [van de Sande et al. 2010, …]
▶
Segmentation cues, … [Shotton et al. 2008, Cinbis et al. 2013, …] 6
7 PASCAL VOC 2007 data DPM [Felzenszwalb et al.] MKL [Vedaldi et al.] DPMv5 [Girshick et al.] Regionlet [Wang et al.] RCNN-Alex [Girshick et al.] RCNN-VGG [Girshick et al.] 10 20 30 40 50 60 70 2008 2009 2010 2011 2012 2013 2014 2015 mAP [%] Year
Pros: simple and effective Cons: slow as the CNN is re-evaluated for each tested region
8 [Girshick et al. 2013]
CNN
CNN CNN
c5 c1 c2 c3 c4 f6 f7 f8
(SVM)
Convolutional features = local features Region descriptor = pooled local features
▶
Spatial pyramid + max pooling [He et al. 2014]
▶
Bag of words, Fisher vector, VLAD, …. [Cimpoi et. al. 2015] Order of magnitudes speedup 9 [He et al. 2014]
c5 c1 c2 c3 c4 f6 f7 f8
(SVM)
f6 f7 f8
(SVM)
f6 f7 f8
(SVM)
local features pooling encoder
SPP-CNN results in a significant test-time speedup However, region proposal extraction is the new bottleneck R-CNN minus R: can we get rid of region proposal extraction? 10 Detection time
2000 4000 6000 8000 10000 12000 SPP-CNN R-CNN
CNN evaluation
13 (SPP) R-CNN training comprises many steps
c5 c1 c2 c3 c4 f6 f7 f8
SVM
linear regress.
With SPP R-CNN of [He et al. 2014] fine-tuning is limited to the fully connected layers 14 (SPP) R-CNN training comprises many steps
c5 c1 c2 c3 c4 f6 f7 f8
SVM
linear regress.
frozen
SPP and bounding box regressions can be easily implemented in a CNN (with a DAG topology) and trained jointly in one step 16 See also [Fast R-CNN and Faster R-CNN]
c5 c1 c2 c3 c4 f6 f7 f8
lin. regress.
frozen
c5 c1 c2 c3 c4 f6 f7 f8
freg
SPP
Proposals are now very fast but very inaccurate We let the CNN compensate with the bounding box regressor 18
19
20 [See also Lenc Vedaldi CPVR 2015]
c5 c1 c2 c3 c4 f6 f7 f8
linear regress.
equivariant representation invariant representation
Dashed line: proposals Solid line: corrected by the CNN 21
Observations
▶
Selective search is much better than fixed generators
▶
However, bounding box regression almost eliminates the difference
▶
Clustering allows to use significantly less boxes than sliding windows 22
0.42 0.44 0.46 0.48 0.5 0.52 0.54 0.56 0.58 0.6
boxes)
Boxes) Clusters (2K Boxes) Clusters (7K Boxes) mAP (VOC07) Baseline BBR
23 Finding (1) Streamlining accelerates SPP
50 100 150 200 250 300 350 400 450 SPP Streamlined SPP
GPU↔CPU CONV Layers
FC Layers Bbox Regr.
24 Finding (2) Dropping selective search is a huge benefit
500 1000 1500 2000 2500 3000 SPP Streamlined SPP Minus R
GPU↔CPU CONV Layers
FC Layers Bbox Regr.
25 Finding (2) Dropping selective search is a huge benefit
2000 4000 6000 8000 10000 12000 RCNN SPP Streamlined SPP Minus R
GPU↔CPU CONV Layers
FC Layers Bbox Regr.
26 Test-time speedups
10 20 30 40 50 60 70 80 RCNN SPP Streamlined SPP Minus R Times faster than R-CNN
Current CNNs can localize objects well
▶
External segmentation cues bring only a minor benefit at a great expense Benefits of CNN-only solutions
▶
Much faster, particularly at test time
▶
Much simpler and streamlined implementations Future steps
▶
▶
Essentially achieved in [Faster R-CNN, Ren et al. 2015]
▶
Beyond bounding boxes
▶