Fast%R'CNN Ross$Girshick Facebook$AI$Research$(FAIR) - - PowerPoint PPT Presentation

fast r cnn
SMART_READER_LITE
LIVE PREVIEW

Fast%R'CNN Ross$Girshick Facebook$AI$Research$(FAIR) - - PowerPoint PPT Presentation

Reproducible$research$ get$the$code! http://git.io/vBqm5 Fast%R'CNN Ross$Girshick Facebook$AI$Research$(FAIR) Work$done$at$Microsoft$Research Fast%Region'based%ConvNets (R'CNNs)% for%Object%Detection Localization Wh Where? person :


slide-1
SLIDE 1

Fast%R'CNN

Ross$Girshick Facebook$AI$Research$(FAIR)

Work$done$at$Microsoft$Research

http://git.io/vBqm5

Reproducible$research$– get$the$code!

slide-2
SLIDE 2

Fast%Region'based%ConvNets (R'CNNs)% for%Object%Detection

Recognition Wh What?

car : 1.000 dog : 0.997 person : 0.992 person : 0.979 horse : 0.993

Localization Wh Where?

Figure%adapted%from%Kaiming He

slide-3
SLIDE 3

Object%detection%renaissance% (2013'present)

0% 10% 20% 30% 40% 50% 60% 70% 80%

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

mean0Average0Precision0(mAP) year

Before$deep$convnets Using$deep$convnets

PASCAL$VOC

slide-4
SLIDE 4

Object%detection%renaissance% (2013'present)

0% 10% 20% 30% 40% 50% 60% 70% 80%

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

mean0Average0Precision0(mAP) year

Before$deep$convnets Using$deep$convnets RHCNNv1

PASCAL$VOC

slide-5
SLIDE 5

Object%detection%renaissance% (2013'present)

0% 10% 20% 30% 40% 50% 60% 70% 80%

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016

mean0Average0Precision0(mAP) year

+$Accurate H Slow H Inelegant RHCNNv1 Fast$RHCNN +$Accurate +$Fast +$Streamlined

PASCAL$VOC

slide-6
SLIDE 6

Region'based%convnets (R'CNNs)

  • RHCNN$(aka$“slow$RHCNN”)$[Girshick et$al.$CVPR14]
  • SPPHnet$[He$et$al.$ECCV14]
slide-7
SLIDE 7

Slow%R'CNN

Girshick et$al.$CVPR14. Input$image

slide-8
SLIDE 8

Slow%R'CNN

Girshick et$al.$CVPR14. Input$image Regions$of$Interest$(RoI)$ from$a$proposal$method (~2k)

slide-9
SLIDE 9

Slow%R'CNN

Girshick et$al.$CVPR14. Input$image Warped$image$regions Regions$of$Interest$(RoI)$ from$a$proposal$method (~2k)

slide-10
SLIDE 10

Slow%R'CNN

Girshick et$al.$CVPR14. Input$image ConvNet ConvNet ConvNet Warped$image$regions Forward$each$region$ through$ ConvNet Regions$of$Interest$(RoI)$ from$a$proposal$method (~2k)

slide-11
SLIDE 11

Slow%R'CNN

Girshick et$al.$CVPR14. Input$image ConvNet ConvNet ConvNet SVMs SVMs SVMs Warped$image$regions Forward$each$region$ through$ ConvNet Classify$regions$with$SVMs Regions$of$Interest$(RoI)$ from$a$proposal$method (~2k) Post$hoc$component

slide-12
SLIDE 12

Slow%R'CNN

Girshick et$al.$CVPR14. Input$image ConvNet ConvNet ConvNet SVMs SVMs SVMs Warped$image$regions Forward$each$region$ through$ ConvNet Bbox reg Bbox reg Bbox reg Apply$boundingHbox$ regressors Classify$regions$with$SVMs Regions$of$Interest$(RoI)$ from$a$proposal$method (~2k) Post$hoc$component

slide-13
SLIDE 13

What’s%wrong%with%slow%R'CNN?

slide-14
SLIDE 14

What’s%wrong%with%slow%R'CNN?

  • Ad$hoc$training$objectives
  • FineHtune$network$with$softmax classifier$(log$loss)
  • Train$postHhoc$linear$SVMs$(hinge$loss)
  • Train$postHhoc$boundingHbox$regressors (squared$loss)
slide-15
SLIDE 15

What’s%wrong%with%slow%R'CNN?

  • Ad$hoc$training$objectives
  • FineHtune$network$with$softmax classifier$(log$loss)
  • Train$postHhoc$linear$SVMs$(hinge$loss)
  • Train$postHhoc$boundingHbox$regressors (squared$loss)
  • Training$is$slow$(84h),$takes$a$lot$of$disk$space
slide-16
SLIDE 16

What’s%wrong%with%slow%R'CNN?

  • Ad$hoc$training$objectives
  • FineHtune$network$with$softmax classifier$(log$loss)
  • Train$postHhoc$linear$SVMs$(hinge$loss)
  • Train$postHhoc$boundingHbox$regressions$(least$squares)
  • Training$is$slow$(84h),$takes$a$lot$of$disk$space
  • Inference$(detection)$is$slow
  • 47s$/$image$with$VGG16$[Simonyan &$Zisserman.$ICLR15]
  • Fixed$by$SPPHnet$[He$et$al.$ECCV14]

~2000$ConvNet forward$passes$per$image

slide-17
SLIDE 17

SPP'net

Input$image He$et$al.$ECCV14.

slide-18
SLIDE 18

SPP'net

ConvNet Input$image Forward$whole&image$through$ConvNet He$et$al.$ECCV14. “conv5”$feature$map$of$image

slide-19
SLIDE 19

SPP'net

ConvNet Input$image Forward$whole image$through$ConvNet “conv5”$feature$map$of$image Regions$of Interest$(RoIs) from$a$proposal method He$et$al.$ECCV14.

slide-20
SLIDE 20

SPP'net

ConvNet Input$image Forward$whole image$through$ConvNet “conv5”$feature$map$of$image Regions$of Interest$(RoIs) from$a$proposal method Spatial$Pyramid$Pooling$ (SPP)$layer He$et$al.$ECCV14.

slide-21
SLIDE 21

SPP'net

ConvNet Input$image Forward$whole image$through$ConvNet “conv5”$feature$map$of$image Regions$of Interest$(RoIs) from$a$proposal method Spatial$Pyramid$Pooling$ (SPP)$layer He$et$al.$ECCV14. SVMs FullyHconnected$layers Classify$regions$with$SVMs FCs Post$hoc$component

slide-22
SLIDE 22

SPP'net

ConvNet Input$image Forward$whole image$through$ConvNet “conv5”$feature$map$of$image Regions$of Interest$(RoIs) from$a$proposal method Spatial$Pyramid$Pooling$ (SPP)$layer He$et$al.$ECCV14. SVMs FullyHconnected$layers Classify$regions$with$SVMs FCs Bbox reg Apply$boundingHbox$ regressors Post$hoc$component

slide-23
SLIDE 23

What’s%good%about%SPP'net?

  • Fixes$one$issue$with$RHCNN:$makes$testing$fast

ConvNet SVMs FCs Bbox reg RegionHwise computation ImageHwise computation (shared) Post$hoc$component

slide-24
SLIDE 24

What’s%wrong%with%SPP'net?

  • Inherits$the$rest$of$RHCNN’s$problems
  • Ad$hoc$training$objectives
  • Training$is$slow$(25h),$takes$a$lot$of$disk$space
slide-25
SLIDE 25

What’s%wrong%with%SPP'net?

  • Inherits$the$rest$of$RHCNN’s$problems
  • Ad$hoc$training$objectives
  • Training$is$slow$(though$faster),$takes$a$lot$of$disk$space
  • Introduces$a$new$problem:$cannot$update$

parameters$below$SPP$layer$during$training

slide-26
SLIDE 26

SPP'net:%the%main%limitation

ConvNet He$et$al.$ECCV14. SVMs Trainable (3$layers) Frozen (13$layers) FCs Bbox reg Post$hoc$component

slide-27
SLIDE 27

Fast%R'CNN

  • Fast$testHtime,$like$SPPHnet
slide-28
SLIDE 28

Fast%R'CNN

  • Fast$testHtime,$like$SPPHnet
  • One$network,$trained$in$one$stage
slide-29
SLIDE 29

Fast%R'CNN

  • Fast$testHtime,$like$SPPHnet
  • One$network,$trained$in$one$stage
  • Higher$mean$average$precision$than$slow$RHCNN$

and$SPPHnet

slide-30
SLIDE 30

Fast%R'CNN%(test%time)

ConvNet Input$image Forward$whole$image$through$ConvNet “conv5”$feature$map$of$image Regions$of Interest$(RoIs) from$a$proposal method

slide-31
SLIDE 31

Fast%R'CNN%(test%time)

ConvNet Input$image Forward$whole$image$through$ConvNet “conv5”$feature$map$of$image “RoI Pooling”$(singleHlevel$SPP)$layer Regions$of Interest$(RoIs) from$a$proposal method

slide-32
SLIDE 32

Fast%R'CNN%(test%time)

ConvNet Input$image Forward$whole$image$through$ConvNet “conv5”$feature$map$of$image “RoI Pooling”$(singleHlevel$SPP)$layer Linear$+ softmax FCs FullyHconnected$layers Softmax classifier Regions$of Interest$(RoIs) from$a$proposal method

slide-33
SLIDE 33

Fast%R'CNN%(test%time)

ConvNet Input$image Forward$whole$image$through$ConvNet “conv5”$feature$map$of$image “RoI Pooling”$(singleHlevel$SPP)$layer Linear$+ softmax FCs FullyHconnected$layers Softmax classifier Regions$of Interest$(RoIs) from$a$proposal method Linear BoundingHbox$ regressors

slide-34
SLIDE 34

Fast%R'CNN (training)

ConvNet Linear$+ softmax FCs Linear

slide-35
SLIDE 35

Fast%R'CNN (training)

Log$loss$+$smooth$L1$loss ConvNet Linear$+ softmax FCs Linear MultiHtask$loss

slide-36
SLIDE 36

Fast%R'CNN (training)

Log$loss$+$smooth$L1$loss ConvNet Linear$+ softmax FCs Linear Trainable MultiHtask$loss

slide-37
SLIDE 37

Obstacle%#1:%Differentiable%RoI pooling

Region%of%Interest%(RoI)%pooling%must%be%(sub')% differentiable%to%train%conv layers

slide-38
SLIDE 38

Obstacle%#1:%Differentiable%RoI pooling

RoI pooling RoI pooling !∗ 0,2 = 23 !∗ 1,0 = 23

Over$regions$), locations$* Partial for$01 1$if$), * “pooled” input$!;$0$o/w Partial$from next$layer

23 201 = 4 4 ! = !∗ ), * 23 2567

7 6

)

8

)

9

0:;

58,: 59,8

)

8

)

9

max%pooling%“switch”% (i.e. argmax back'pointer)

slide-39
SLIDE 39

Obstacle%#2:%efficient%SGD%steps

Slow%R'CNN%and%SPP'net%use%region'wise%sampling%to% make%mini'batches

  • Sample%128%example%RoIs uniformly%at%random
  • Examples%will%come%from%different%images%with%high%

probability

...$ SGD$miniHbatch ...$ ...$ ...$

slide-40
SLIDE 40

Obstacle%#2:%efficient%SGD%steps

Note%the%receptive%field%for%one%example%RoI is%often% very%large

  • Worst%case:%the%receptive%field%is%the%entire%image

Example$RoI RoI’sreceptive$field Example$RoI

slide-41
SLIDE 41

Obstacle%#2:%efficient%SGD%steps

Worst%case%cost%per%mini'batch%(crude%model%of% computational%complexity) 128*600*1000%/%(128*224%*224)%=%12x%more% computation%than%slow%R'CNN

input%size%for%Fast%R'CNN input%size%for%slow%R'CNN

Example$RoI RoI’sreceptive$field Example$RoI

slide-42
SLIDE 42

Obstacle%#2:%efficient%SGD%steps

Solution:%use%hierarchical%sampling%to%build%mini' batches

...$ ...$ ...$ ...$

slide-43
SLIDE 43

Obstacle%#2:%efficient%SGD%steps

Solution:%use%hierarchical%sampling%to%build%mini' batches

...$ Sample$images ...$ ...$ ...$

  • Sample%a%small%

number%of%images% (2)

slide-44
SLIDE 44

Obstacle%#2:%efficient%SGD%steps

Solution:%use%hierarchical%sampling%to%build%mini' batches

...$ Sample$images ...$ ...$ ...$ SGD$miniHbatch

  • Sample%a%small%

number%of%images% (2)

  • Sample%many%

examples%from% each%image%(64)%

slide-45
SLIDE 45

Obstacle%#2:%efficient%SGD%steps

Use%the%test'time%trick%from%SPP'net%during%training

  • Share%computation%between%overlapping%examples%

from%the%same%image

Example$RoI 2 Union$of$RoIs’ receptive$fields (shared$computation) Example$RoI 1 Example$RoI 3 Example$RoI 2 Example$RoI 1 Example$RoI 3

slide-46
SLIDE 46

Obstacle%#2:%efficient%SGD%steps

Cost%per%mini'batch%compared%to%slow%R'CNN%(same% crude%cost%model)

  • 2*600*1000%/%(128*224*224)%=%0.19x%less%

computation%than%slow%R'CNN

input%size%for%Fast%R'CNN input%size%for%slow%R'CNN

Example$RoI 2 Union$of$RoIs’ receptive$fields (shared$computation) Example$RoI 1 Example$RoI 3 Example$RoI 2 Example$RoI 1 Example$RoI 3

slide-47
SLIDE 47

Main%results

Fast%R'CNN R'CNN%[1] SPP'net%[2] Train$time$(h) 9.5 84 25 H Speedup 8.8x 1x 3.4x Test$time$/$image 0.32s 47.0s 2.3s Test$speedup 146x 1x 20x mAP 66.9% 66.0% 63.1% Timings$exclude$object$proposal$time,$which$is$equal$for$all$methods. All$methods$use$VGG16$from$Simonyan and$Zisserman. [1]$Girshick et$al.$CVPR14. [2]$He$et$al.$ECCV14.

slide-48
SLIDE 48

Main%results

Fast%R'CNN R'CNN%[1] SPP'net%[2] Train$time$(h) 9.5 84 25 H Speedup 8.8x 1x 3.4x Test$time$/$image 0.32s 47.0s 2.3s Test$speedup 146x 1x 20x mAP 66.9% 66.0% 63.1% Timings$exclude$object$proposal$time,$which$is$equal$for$all$methods. All$methods$use$VGG16$from$Simonyan and$Zisserman. [1]$Girshick et$al.$CVPR14. [2]$He$et$al.$ECCV14.

slide-49
SLIDE 49

Main%results

Fast%R'CNN R'CNN%[1] SPP'net%[2] Train$time$(h) 9.5 84 25 H Speedup 8.8x 1x 3.4x Test$time$/$image 0.32s 47.0s 2.3s Test$speedup 146x 1x 20x mAP 66.9% 66.0% 63.1% Timings$exclude$object$proposal$time,$which$is$equal$for$all$methods. All$methods$use$VGG16$from$Simonyan and$Zisserman. [1]$Girshick et$al.$CVPR14. [2]$He$et$al.$ECCV14.

slide-50
SLIDE 50

Further%test'time%speedups

Fully%connected%layers%take 45%%of%the%forward%pass% time

slide-51
SLIDE 51

Further%test'time%speedups

Compress%these%layers%with% truncated%SVD

J.$Xue,$J.$Li,$and$Y.$Gong. Restructuring$of$deep$neural$network$acoustic$models$with$singular$value$decomposition. Interspeech,$2013.

slide-52
SLIDE 52

Further%test'time%speedups

Without$SVD With$SVD

slide-53
SLIDE 53

Other%findings

slide-54
SLIDE 54

End'to'end%training%matters

Fast%R'CNN%(VGG16) FineHtune layers ≥ fc6 ≥ conv3_1 ≥ conv2_1 VOC07$mAP 61.4% 66.9% 67.2% Test$time$per$image 0.32s 0.32s 0.32s 1.4x$slower training

slide-55
SLIDE 55

Multi'task%training%helps

Fast%R'CNN (VGG16) MultiHtask$training? Y Y StageHwise$training? Y TestHtime$bbox reg. Y Y VOC07$mAP 62.6% 63.4% 64.0% 66.9%

slide-56
SLIDE 56

Multi'task%training%helps

Fast%R'CNN (VGG16) MultiHtask$training? Y Y StageHwise$training? Y TestHtime$bbox reg. Y Y VOC07$mAP 62.6% 63.4% 64.0% 66.9% Trained$without a$bbox regressor

slide-57
SLIDE 57

Multi'task%training%helps

Fast%R'CNN (VGG16) MultiHtask$training? Y Y StageHwise$training? Y TestHtime$bbox reg. Y Y VOC07$mAP 62.6% 63.4% 64.0% 66.9% Trained$with a$bbox regressor, but$it’s$disabled$at test$time

slide-58
SLIDE 58

Multi'task%training%helps

Fast%R'CNN (VGG16) MultiHtask$training? Y Y StageHwise$training? Y TestHtime$bbox reg. Y Y VOC07$mAP 62.6% 63.4% 64.0% 66.9% Post$hoc$bbox regressor,$used at$test$time

slide-59
SLIDE 59

Multi'task%training%helps

Fast%R'CNN (VGG16) MultiHtask$training? Y Y StageHwise$training? Y TestHtime$bbox reg. Y Y VOC07$mAP 62.6% 63.4% 64.0% 66.9% MultiHtask$objective, using$bbox regressors at$test$time

slide-60
SLIDE 60

What’s%still%wrong?

  • Out'of'network%region%proposals
  • Selective%search:%2s%/%im;%%EdgeBoxes:%0.2s%/%im
  • Fortunately,%we%have%a%solution
  • Our%follow'up%work%was%presented%last%week%at%NIPS

Shaoqing Ren,%Kaiming He,%Ross%Girshick &%Jian%Sun.% “Faster%R'CNN:%Towards%Real'Time%Object%Detection% with%Region%Proposal%Networks.”%NIPS%2015.

slide-61
SLIDE 61

Fast%R'CNN%take'aways

  • EndHtoHend$training$of$deep$ConvNets for$detection
  • Fast$training$times
  • Open$source$for$easy$experimentation

“I$think$[the$Fast$RHCNN]$code$is$averageHsomewhat$above$ average$for$what$it$is.” – sporkles on$r/MachineLearning

  • A$large$number$of$ImageNet detection$and$COCO$

detection$methods$are$built$on$Fast$RHCNN

Checkout$the$ImageNet /$COCO$Challenge$workshop$on$ Thursday!

slide-62
SLIDE 62

Thanks!

rbg@fb.com http://git.io/vBqm5

Reproducible$research$– get$the$code!

slide-63
SLIDE 63

Softmax works%well (vs.%post%hoc%SVMs)

Method%(VGG16) classifier VOC07%mAP Slow$RHCNN Post$hoc$SVM 66.0% Fast$RHCNN Post$hoc$SVM 66.8% Fast$RHCNN Softmax 66.9%

slide-64
SLIDE 64

More%proposals%is%harmful