Improving*Object*Detection*with*
Deep*Convolutional*Networks*via* Bayesian*Optimization*and* Structured*Prediction*
Yuting Zhang*†,*KihyukSohn†,*Ruben*Villegas†,*Gang*Pan*,* HonglakLee†
* †
I mproving*Object*Detection*with* Deep*Convolutional*Networks*via* - - PowerPoint PPT Presentation
I mproving*Object*Detection*with* Deep*Convolutional*Networks*via* Bayesian*Optimization*and* Structured*Prediction* Yuting Zhang * ,*KihyukSohn ,*Ruben*Villegas ,*Gang*Pan * ,* HonglakLee *
* †
[LeCune et*al.*1989;*Sermanet et*al.*2013;*Girschick et*al.,*2014;*Simoyan et*al.,*2014;**Lin*et*al.*2014,*and*many*others]
Girshick et*al,*“RegionNbased*Convolutional*Networks*for*Accurate*Object* Detection*and*Semantic*Segmentation”,*PAMI*2015*&*CVPR*2014. Image*adapted*from*Girshick et*al.,*2014 CNN Aeroplane?*No Car?*Yes Person?*No … …
Input*image Region* proposal CNN*feature* extraction Cropping Classification
1)
2)
1000Ncategory*classification
20*categories
A.*Krizhevsky,*I.*Sutskever,*and*G.*E.*Hinton.*Imagenet classification*with*deep*convolutional* neural*networks.*NIPS,*2012.
! bounding*box
K.*E.*A.*Sande,*J.*R.*R.*Uijlings,*T.*Gevers,*and*A.*W.* M.*Smeulders.*Segmentation*as*selective*search*for*
Images*from*Krizhevsky et*al.*2012*&*Sande*et*al.*2011
A.*Krizhevsky,*I.*Sutskever,*and*G.*E.*Hinton.*Imagenet classification*with* deep*convolutional*neural*networks.*In*NIPS,*2012.
Classification*confidence* for sampled*bounding*boxes
K.*E.*A.*Sande,*J.*R.*R.*Uijlings,*T.*Gevers,*and*A.*W.*M.*Smeulders.* Segmentation*as*selective*search*for*object*recognition.*ICCV,*2011.
33.4%
53.7%
33.4%
53.7%
The*image*is*from*the*KITTI*dataset
e.g.,*CNNNbased*classifier*or*any*score*function*of*detection*methods.
+ +,- )
+ = "($, &+) be the known solutions. We
)0- > max
+
. " () ∝ . () " . "
)0- &)0-,() can be expressed as a multivariate Gaussian,
)0- > max
+ = "
;
)
< " − " ;
) ⋅ . "|&)0-,();A B" C ;D
The*image*is*from*PASCAL*VOC2007
Neither1gives1 good1 localization Take1this1as1 ONE1starting1 point
(centerX,*centerY,*height,*width)
alized by*max to*visualize*EI*in*2D
E $; F = argmax!∈J "($, &; F) " $, &; F = FKL M $,& L M $, & = NL $, & , O = +1 R, SSSSSSSSSSSO = −1
F T = argmaxU V Δ E $X;F , &X
Y X,-
Δ(&, &X) = Z 1 − IoU &, &X , SSSifSO = OX = 1 0, ifSO = OX = −1 1, ifSO ≠ OX
**Blaschko and*Lampert,*“Learning*to*localize*objects*with*structured*output* regression”,*ECCV,*2008. CNN*features
Other*related*work:*LeCun et*al.*1989;*Taskar et*al.*2005;*Joachimset*al.*2005;*Veldaldiet*al.*2014;* Thomson*et*al.*2014;*and*many*others
using structured SVM framework
min
U SSSS1
2 ∥ F ∥` + a b V"
Y X,-
cXSSSSSS, subjectSto FKL M $X,&X ≥ FKL M $X,& + Δ &, &X − cX, ∀& ∈ J, ∀m cX ≥ SS0, ∀m
FKL($X,&X) ≥ SS1 − cX, SSSSSSSSSSSSSSSSS ∀m ∈ nopq, FKL $X,& ≤SS −1 + cX, ∀& ∈ J, ∀m ∈ nrst, FKL $X,&X ≥SSFuL $X,& + Δvpw &, &X − cX,S ∀& ∈ J,∀m ∈ nopq,
where Δvpw(&, &X) = 1 − IoU(&, &X).
Recognition Localization
LBFGNS*for*learning*classification*layer
SGD*for*fineNtuning*the*whole*CNN
violated*sample
Xzs{| $X,& = IoU &,&X
Ground*truth (GT) IoU=0.3 IoU=0.7
More1region1proposal1methods:
fast*(default)*/*extended*/*quality
Random*generate*extra*boxes* without*Bayesian*optimization
**Alexe,*B.,*Deselaers,*T.,*&*Ferrari,*V.*(2012).*Measuring*the*objectness of*image*
IoU threshold for true positives
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
mAP / %
10 20 30 40 50 60 70 80 90 100
SS (~2000 boxes per image) SS + Objectness (~3000 boxes per image) SS extended (~3500 boxes per image) SS quality (~10000 boxes per image) SS + Local random search (~2100 boxes per image) SS + FGS (~2100 boxes per image)
More1region1proposal1methods:
fast*(default)*/*extended*/*quality
IoU threshold for true positives
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
mAP / %
10 20 30 40 50 60 70 80 90 100
SS (~2000 boxes per image) SS + Objectness (~3000 boxes per image) SS extended (~3500 boxes per image) SS quality (~10000 boxes per image) SS + Local random search (~2100 boxes per image) SS + FGS (~2100 boxes per image)
Results:
Different*IoU thresholds*for* accepting*a*true*positive
mean*average*precision*(mAP)
Random*generate*extra*boxes* without*Bayesian*optimization
Maximum FTS iteration number (tmax)
1 2 3 4 5 6 7 8
Actual time consumption / second (ratio)
0 ( 0%) 5 ( 3%) 10 ( 6%) 15 ( 9%) 20 (13%) 25 (16%)
Feature extraction GP regression, etc.
Mean%Average%Precision Standard localization R$CNN"(AlexNet) 58.5 R$CNN"(VGGNet) 65.4 Mean%Average%Precision Standard localization R$CNN"(AlexNet) 58.5 R$CNN"(VGGNet) 65.4
Bounding*box*regression*is*always* taken*as*a*postNprocessing*step.
Mean%Average%Precision Standard localization R$CNN"(AlexNet) 58.5 R$CNN"(VGGNet) 65.4 +"StructObj 66.6 +"StructObj$FT 66.9 Mean%Average%Precision Standard localization R$CNN"(AlexNet) 58.5 R$CNN"(VGGNet) 65.4
1.2%
Mean%Average%Precision Standard localization R$CNN"(AlexNet) 58.5 R$CNN"(VGGNet) 65.4 +"StructObj 66.6 +"StructObj$FT 66.9 +"FGS 67.2 Mean%Average%Precision Standard localization R$CNN"(AlexNet) 58.5 R$CNN"(VGGNet) 65.4 +"StructObj 66.6 +"StructObj$FT 66.9
1.8%
Mean%Average%Precision Standard localization R$CNN"(AlexNet) 58.5 R$CNN"(VGGNet) 65.4 +"StructObj 66.6 +"StructObj$FT 66.9 +"FGS 67.2 +"FGS"+"StructObj 68.5 +"FGS"+"StructObj$FT 68.4 Mean%Average%Precision Standard localization R$CNN"(AlexNet) 58.5 R$CNN"(VGGNet) 65.4 +"StructObj 66.6 +"StructObj$FT 66.9 +"FGS 67.2
3.1%
Mean%Average%Precision Standard localization R$CNN"(AlexNet) 58.5 R$CNN"(VGGNet) 65.4 +"StructObj 66.6 +"StructObj$FT 66.9 +"FGS 67.2 +"FGS"+"StructObj 68.5 +"FGS"+"StructObj$FT 68.4 IoU>0.5 IoU>0.7 More%accurate% localization
Mean%Average%Precision Standard localization More%accurate% localization R$CNN"(AlexNet) 58.5 35.2 R$CNN"(VGGNet) 65.4 35.2 +"StructObj 66.6 40.5 +"StructObj$FT 66.9 41.8 +"FGS 67.2 42.7 +"FGS"+"StructObj 68.5 43.0 +"FGS"+"StructObj$FT 68.4 43.7 IoU>0.5 IoU>0.7 More%accurate% localization 35.2 35.2 40.5 41.8 42.7 43.0 43.7
Mean%Average%Precision Standard localization More%accurate% localization R$CNN"(AlexNet) 58.5 35.2 R$CNN"(VGGNet) 65.4 35.2 +"StructObj 66.6 40.5 +"StructObj$FT 66.9 41.8 +"FGS 67.2 42.7 +"FGS"+"StructObj 68.5 43.0 +"FGS"+"StructObj$FT 68.4 43.7 IoU>0.5 IoU>0.7 More%accurate% localization 35.2 35.2 40.5 41.8 42.7 43.0 43.7 7.8%
Mean%Average%Precision Standard localization More%accurate% localization R$CNN"(AlexNet) 58.5 35.2 R$CNN"(VGGNet) 65.4 35.2 +"StructObj 66.6 40.5 +"StructObj$FT 66.9 41.8 +"FGS 67.2 42.7 +"FGS"+"StructObj 68.5 43.0 +"FGS"+"StructObj$FT 68.4 43.7 IoU>0.5 IoU>0.7 More%accurate% localization 35.2 35.2 40.5 41.8 42.7 43.0 43.7 8.6%
Mean%Average%Precision IoU>0.5 R$CNN"(AlexNet) 53.3 R$CNN"(VGGNet) 63.0 +"StructObj 65.1 +"FGS 64.0 +"FGS"+"StructObj 66.4 3.4%
Mean%Average%Precision IoU>0.5 R$CNN"(AlexNet) 53.3 R$CNN"(VGGNet) 63.0 +"StructObj 65.1 +"FGS 64.0 +"FGS"+"StructObj 66.4 Network"in"Network* 63.8 2.6% *M.*Lin, Q.*Chen, S.*Yan,*Network*In*Network,*ICLR*2014
Good1examples1
Original*image
Good1examples1
Red1boxes: RNCNN*(VGGNet)* baseline.
Good1examples1
Red1boxes: RNCNN*(VGGNet)* baseline. Green1boxes: Ground*truth(GT)
Good1examples1
Numbers: Overlap*(IoU)* with*GT Red1boxes: RNCNN*(VGGNet)* baseline. Green1boxes: Ground*truth(GT)
Good1examples1
Numbers: Overlap*(IoU)* with*GT Red1boxes: RNCNN*(VGGNet)* baseline. Green1boxes: Ground*truth(GT) Yellow1boxes: Ours*(+*StructObj +*FGS)
Good1examples1
Original*image
Good1examples1
Red1boxes: RNCNN*(VGGNet)* baseline.
Good1examples1
Red1boxes: RNCNN*(VGGNet)* baseline. Green1boxes: Ground*truth(GT)
Good1examples1
Numbers: Overlap*(IoU)* with*GT Red1boxes: RNCNN*(VGGNet)* baseline. Green1boxes: Ground*truth(GT)
Good1examples1
Numbers: Overlap*(IoU)* with*GT Red1boxes: RNCNN*(VGGNet)* baseline. Green1boxes: Ground*truth(GT) Yellow1boxes: Ours*(+*StructObj +*FGS)
1.*Find*better*bounding*boxes*via*Bayesian*optimization 2.*Improve*localization*sensitivity*via**structured*objective
Analysis*and*Machine*Intelligence,*34(11):2189–2202,*Nov*2012.*6
.*Lamblin,*D.*Popovici,*H.*Larochelle,*et*al.*Greedy*layerNwise* training*of*deep*networks.*In*NIPS,*2007.
.*Vincent.*Representation*learning:*A*review*and*new*perspectives.*IEEE*TransacN tions
CVPR,*2009.
feature*for*generic*visual*recognition.*CoRR,*abs/1310.1531,*2013.*
2014.*
lenge 2007*(VOC2007)*Results,*2007.*
2012*(VOC2012)*Results,*2012.
.*Felzenszwalb,*R.*Girshick,*D.*McAllester,*and*D.*Ramanan.*Object*detection*with*discriminatively*trained*partN based*models.*IEEE*Transactions*on*Pattern*Analysis*and*Machine*Intelligence,*32(9):1627–1645,*2010.
segmentation.*In*CVPR,*2014.*
Semantic*Segmentation.*In*IEEE*PAMI,*2015.
.*Arbelaez,*and*J.*Malik.*Recognition*using*regions.*In*CVPR,*2009.
volutional architecture*for*fast*feature*embedding.*CoRR,*abs/1408.5093,*2014.
21(4):345–383,*2001.
NIPS,*2012.
to*handwritten*zip*code*recognition.*Neural*Computation,*1(4):541–551,*1989.*
convolutional*deep*belief*networks.*Communications*of*the*ACM,*54(10):95–103,*2011.*
Global*Optimization,*2(117N129):2,*1978.
Learning).*The*MIT*Press,*2006.*
Berg,*and*L.*FeiNFei.*ImageNet Large*Scale*Visual*Recognition*Challenge,*2014.*
.*Wohlhart,*P .*M.*Roth,*and*H.*Bischof.*Accurate*object*detection*with*joint*classificationN regression* random*forests.*In*CVPR,*2014.
.*Sermanet,*D.*Eigen,*X.*Zhang,*M.*Mathieu,*R.*Fergus,*and*Y.*LeCun.*OverFeat:*Integrated*recognition,*localization* and*detection*using*convolutional*networks.*In*ICLR,*2014.
.Adams.* Practical*bayesian optimization*of*machine*learning*algorithms.*In*NIPS,* 2012.
.*Sermanet,*S.*Reed,*D.*Anguelov,*D.*Erhan,*V.*Vanhoucke,*and*A.*Rabinovich.*Going* deeper*with*convolutions.*arXiv preprint*arXiv:1409.4842,*2014.*1
International*Journal*of*Computer*Vision,*104(2):154–171,*2013