Tw Two-sta stage ge object object detec detectors tors CV3DST - - PowerPoint PPT Presentation

tw two sta stage ge object object detec detectors tors
SMART_READER_LITE
LIVE PREVIEW

Tw Two-sta stage ge object object detec detectors tors CV3DST - - PowerPoint PPT Presentation

Tw Two-sta stage ge object object detec detectors tors CV3DST | Prof. Leal-Taix 1 Ty Types of object ct dete tecto ctors One-stage detectors Class score (cat, Classification dog, person) Feature Image extraction Bounding


slide-1
SLIDE 1

Tw Two-sta stage ge object

  • bject

detec detectors tors

1 CV3DST | Prof. Leal-Taixé

slide-2
SLIDE 2

Ty Types of object ct dete tecto ctors

  • One-stage detectors
  • Two-stage detectors

2

Feature extraction Extraction of

  • bject

proposals Classification Localization Class score (cat, dog, person) Refine bounding box (Δx, Δy, Δw, Δh) Image Feature extraction Classification Localization Class score (cat, dog, person) Bounding box (x,y,w,h) Image

CV3DST | Prof. Leal-Taixé

slide-3
SLIDE 3

Ty Types of object ct dete tecto ctors

  • One-stage detectors
  • Two-stage detectors

3

Feature extraction Extraction of

  • bject

proposals Classification Localization Class score (cat, dog, person) Refine bounding box (Δx, Δy, Δw, Δh) Image Feature extraction Classification Localization Class score (cat, dog, person) Bounding box (x,y,w,h) Image

CV3DST | Prof. Leal-Taixé

slide-4
SLIDE 4

Lo Locali lizati tion

  • Bounding box regression

4

Image Output: Box coordinates (x,y,w,h) Feature extraction (this time with a Neural Network) Ground truth: Box coordinates

Lecture 8 - 12

CV3DST | Prof. Leal-Taixé

L2 loss function

slide-5
SLIDE 5

Lo Locali lizati tion

  • Bounding box regression

5

Image Output: Box coordinates (x,y,w,h) Ground truth: Box coordinates

Lecture 8 - 12

CV3DST | Prof. Leal-Taixé

L2 loss function Convolutional Neural Network

slide-6
SLIDE 6

Lo Locali lizati tion n and nd cla lassificati tion

  • Bounding box regression

6

Image Output: Box coordinates (x,y,w,h)

Lecture 8 - 12

CV3DST | Prof. Leal-Taixé

Convolutional Neural Network Fully connected

slide-7
SLIDE 7

Lo Locali lizati tion n and nd cla lassificati tion

  • Bounding box regression

7

Image Output: Box coordinates (x,y,w,h)

Lecture 8 - 12

CV3DST | Prof. Leal-Taixé

Convolutional Neural Network Fully connected Output: Class scores L2 loss Softmax loss

slide-8
SLIDE 8

Lo Locali lizati tion n and nd cla lassificati tion

  • Bounding box regression

8

Image Output: Box coordinates (x,y,w,h)

Lecture 8 - 12

CV3DST | Prof. Leal-Taixé

Convolutional Neural Network Output: Class scores Regression head Classification head

slide-9
SLIDE 9

Lo Locali lizati tion n and nd cla lassificati tion

  • It was typical to train the classification head first,

freeze the layers

  • Then train the regression head
  • At test time, we use both!

10 CV3DST | Prof. Leal-Taixé

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

slide-10
SLIDE 10

Ov Overfe rfeat

  • Sliding window + box regression + classification

11

Image (221 x 221 x 3) Boxes (1000 x 4)

Lecture 8 - 12

CV3DST | Prof. Leal-Taixé

Convolutional Neural Network Class scores 1000

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

Feature map (5 x 5 x 1024)

slide-11
SLIDE 11

Ov Overfe rfeat

  • Sliding window + box regression + classification

12

Image (468 x 356 x 3)

Lecture 8 - 12

CV3DST | Prof. Leal-Taixé

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

slide-12
SLIDE 12

Ov Overfe rfeat

  • Sliding window + box regression + classification

13

Image (468 x 356 x 3)

Lecture 8 - 12

CV3DST | Prof. Leal-Taixé

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

slide-13
SLIDE 13

Ov Overfe rfeat

  • Sliding window + box regression + classification

14

Image (468 x 356 x 3)

Lecture 8 - 12

CV3DST | Prof. Leal-Taixé

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

slide-14
SLIDE 14

Ov Overfe rfeat

  • Sliding window + box regression + classification

15

Image (468 x 356 x 3)

Lecture 8 - 12

CV3DST | Prof. Leal-Taixé

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

slide-15
SLIDE 15

Ov Overfe rfeat

  • Sliding window + box regression + classification

16

Image (468 x 356 x 3)

Lecture 8 - 12

CV3DST | Prof. Leal-Taixé

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

We end up with many predictions and we have to combine them for a final detection (in Overfeat they have a greedy method)

slide-16
SLIDE 16

Ov Overfe rfeat

  • Sliding window + box regression + classification

17

Image (468 x 356 x 3)

Lecture 8 - 12

CV3DST | Prof. Leal-Taixé

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

We end up with many predictions and we have to combine them for a final detection (in Overfeat they have a greedy method)

slide-17
SLIDE 17

Ov Overfe rfeat

  • In practice: use many sliding window locations and

multiple scales

Lecture 8 - 31

Window positions + score maps Box regression outputs Final Predictions

18 CV3DST | Prof. Leal-Taixé

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

slide-18
SLIDE 18

Ov Overfe rfeat

  • Sliding window + box regression + classification

19

Image (221 x 221 x 3) Boxes (1000 x 4)

Lecture 8 - 12

CV3DST | Prof. Leal-Taixé

Convolutional Neural Network Class scores 1000

Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014

Feature map (5 x 5 x 1024) What prevents us from dealing with any image size?

slide-19
SLIDE 19

Wh What at ab abou

  • ut multiple

e ob

  • bjec

jects?

20 CV3DST | Prof. Leal-Taixé

  • Localization:

Regression

  • How about detection?
slide-20
SLIDE 20

Wh What at ab abou

  • ut multiple

e ob

  • bjec

jects?

21 CV3DST | Prof. Leal-Taixé

  • Localization:

Regression

  • How about detection?

3 objects means having an output of 12 numbers (3 x 4)

slide-21
SLIDE 21

Wh What at ab abou

  • ut multiple

e ob

  • bjec

jects?

22 CV3DST | Prof. Leal-Taixé

  • Localization:

Regression

  • How about detection?

14 objects means having an output of 56 numbers (14 x 4)

slide-22
SLIDE 22

Wh What at ab abou

  • ut multiple

e ob

  • bjec

jects?

23 CV3DST | Prof. Leal-Taixé

  • Localization:

Regression

  • How about detection?
  • Having a variable sized output is not optimal for Neural

Networks

  • There are a couple of workarounds:

– RNN: Romera-Paredes and Torr. Recurrent Instance Segmentation. ECCV 2016. – Set prediction: Rezatofighi, Kaskman, Motlagh, Shi, Cremers, Leal-Taixé,

  • Reid. Deep Perm-Set Net: Learn to predict sets with unknown permutation

and cardinality using deep neural networks. Arxiv: 1805.00613

slide-23
SLIDE 23

De Dete tecti ction as cla classifica cati tion?

24 CV3DST | Prof. Leal-Taixé

  • Localization:

Regression

  • How about detection?

Regression

Is this a Flamingo? NO

slide-24
SLIDE 24

De Dete tecti ction as cla classifica cati tion?

25 CV3DST | Prof. Leal-Taixé

  • Localization:

Regression

  • How about detection?

Regression

Is this a Flamingo? NO

slide-25
SLIDE 25

De Dete tecti ction as cla classifica cati tion?

26 CV3DST | Prof. Leal-Taixé

  • Localization:

Regression

  • How about detection?

Regression

Is this a Flamingo? YES!

slide-26
SLIDE 26

De Dete tecti ction as cla classifica cati tion?

27 CV3DST | Prof. Leal-Taixé

  • Localization:

Regression

  • How about detection?

Classification

  • Problem:

– Expensive to try all possible positions, scales and aspect ratios – How about trying only on a subset of boxes with most potential?

slide-27
SLIDE 27

Reg Region

  • n Pr

Propo posals ls

  • We have already seen a method that gives us

“interesting” regions in an image that potentially contain an object

  • Step 1: Obtain region

proposals

  • Step 2: Classify them.

Lecture 8 - 49

28 CV3DST | Prof. Leal-Taixé

slide-28
SLIDE 28

Th The e R-CNN family ly

29 CV3DST | Prof. Leal-Taixé

slide-29
SLIDE 29

R-CN CNN

30 CV3DST | Prof. Leal-Taixé

Girschick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014

slide-30
SLIDE 30

R-CN CNN

31 CV3DST | Prof. Leal-Taixé

Girschick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014

Warping to a fix size 227 x 227 Extract features Classification head Regression head to refine the bounding box location

slide-31
SLIDE 31

R-CN CNN

  • Training scheme:

– 1. Pre-train the CNN on ImageNet – 2. Finetune the CNN on the number of classes the detector is aiming to classify (softmax loss) – 3. Train a linear Support Vector Machine classifier to classify image regions. One SVM per class! (hinge loss) – 4. Train the bounding box regressor (L2 loss)

32 CV3DST | Prof. Leal-Taixé

slide-32
SLIDE 32

R-CN CNN

  • PROS:

– The pipeline of proposals, feature extraction and SVM classification is well-known and tested. Only features are changed (CNN instead of HOG). – CNN summarizes each proposal into a 4096 vector (much more compact representation compared to HOG) – Leverage transfer learning: the CNN can be pre-trained for image classification with C classes. One needs only to change the FC layers to deal with Z classes.

33 CV3DST | Prof. Leal-Taixé

slide-33
SLIDE 33

R-CN CNN

  • CONS:

– Slow! 47s/image with VGG16 backbone. One considers around 2000 proposals per image, they need to be warped and forwarded through the CNN. – Training is also slow and complex – The object proposal algorithm is fixed. Feature extraction and SVM classifier are trained separately à not exploiting learning to its full potential.

34 CV3DST | Prof. Leal-Taixé

Let us try to solve this first

slide-34
SLIDE 34

SP SPP-Ne Net

35 CV3DST | Prof. Leal-Taixé

He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. ECCV 2014.

How do we “pool” these features into a common size Frozen

slide-35
SLIDE 35

SP SPP-Ne Net

  • It solved the R-CNN problem of being slow at test

time

  • It still has some problems inherited from R-CNN:

– Training is still slow (a bit faster than R-CNN) – Training scheme is still complex – Still no end-to-end training

36 CV3DST | Prof. Leal-Taixé

He et al. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. ECCV 2014.

slide-36
SLIDE 36

Fa Fast R R-CNN CNN

37 CV3DST | Prof. Leal-Taixé

slide-37
SLIDE 37

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 - 1 Feb 2016

Girschick, “Fast R-CNN”, ICCV 2015 Slide credit: Ross Girschick

Lecture 8 - 67

Fa Fast st R R-CN CNN

38 CV3DST | Prof. Leal-Taixé

Shared computation at test time (like SPP)

slide-38
SLIDE 38

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 - 1 Feb 2016

Girschick, “Fast R-CNN”, ICCV 2015 Slide credit: Ross Girschick

Lecture 8 - 67

Fa Fast st R R-CN CNN

39 CV3DST | Prof. Leal-Taixé

Region of Interest Pooling

slide-39
SLIDE 39

Fa Fast st R R-CNN: NN: RoI RoI Po Pooli ling

  • Region of Interest Pooling

40

Image (N x M x 3) Boxes (1000 x 4)

Lecture 8 - 12

CV3DST | Prof. Leal-Taixé

Convolutional Neural Network Class scores 1000 Feature map (L x K x C) FC layers expect a fixed size (H x W x C)

slide-40
SLIDE 40

Fa Fast st R R-CNN: NN: RoI RoI Po Pooli ling

  • Region of Interest Pooling

41

Image (N x M x 3) Boxes (1000 x 4)

Lecture 8 - 12

CV3DST | Prof. Leal-Taixé

Convolutional Neural Network Class scores 1000 Feature map (L x K x C) FC layers expect a fixed size (H x W x C) We have to transform this feature map into size (H x W x C)

slide-41
SLIDE 41

Fa Fast st R R-CNN: NN: RoI RoI Po Pooli ling

  • Region of Interest Pooling

42

Boxes (1000 x 4)

Lecture 8 - 12

CV3DST | Prof. Leal-Taixé

Class scores 1000 Feature map (L x K x C) FC layers expect a fixed size (H x W x C) Zoom in

slide-42
SLIDE 42

Fa Fast st R R-CNN: NN: RoI RoI Po Pooli ling

  • Region of Interest Pooling

43

Boxes (1000 x 4)

Lecture 8 - 12

CV3DST | Prof. Leal-Taixé

Class scores 1000 Feature map (L x K x C) FC layers expect a fixed size (H x W x C) Zoom in We put a H x W grid on top

slide-43
SLIDE 43

Fa Fast st R R-CNN: NN: RoI RoI Po Pooli ling

  • Region of Interest Pooling

44

Boxes (1000 x 4)

Lecture 8 - 12

CV3DST | Prof. Leal-Taixé

Class scores 1000 Feature map (L x K x C) FC layers expect a fixed size (H x W x C) Zoom in We put a H x W grid on top Pooling Feature map (H x W x C)

slide-44
SLIDE 44

Fa Fast st R R-CNN: NN: RoI RoI Po Pooli ling

  • RoI Pooling: how do you do backpropagation?

45

Boxes (1000 x 4)

Lecture 8 - 12

CV3DST | Prof. Leal-Taixé

Class scores 1000 Feature map (L x K x C) FC layers expect a fixed size (H x W x C) Zoom in We put a H x W grid on top Pooling Feature map (H x W x C) Like max-pooling!

slide-45
SLIDE 45

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 - 1 Feb 2016

Fa Fast st R R-CNN NN Res Results

  • VGG-16 CNN on Pascal VOC 2007 dataset

Lecture 8 - 77

R-CNN NN Fa Fast R-CNN NN Training Time: 84 hours 9.5 hours (Speedup) 1x 8.8x Faster!

46 CV3DST | Prof. Leal-Taixé

slide-46
SLIDE 46

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 - 1 Feb 2016

Fa Fast st R R-CNN NN Res Results

  • VGG-16 CNN on Pascal VOC 2007 dataset

Lecture 8 - 77

R-CNN NN Fa Fast R-CNN NN Training Time: 84 hours 9.5 hours (Speedup) 1x 8.8x Test time per image 47 seconds 0.32 seconds (Speedup) 1x 146x Faster! FASTER!

47 CV3DST | Prof. Leal-Taixé

slide-47
SLIDE 47

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 - 1 Feb 2016

Fa Fast st R R-CNN NN Res Results

  • VGG-16 CNN on Pascal VOC 2007 dataset

Lecture 8 - 77

R-CNN NN Fa Fast R-CNN NN Training Time: 84 hours 9.5 hours (Speedup) 1x 8.8x Test time per image 47 seconds 0.32 seconds (Speedup) 1x 146x mAP (VOC 2007) 66.0 66.9 Faster! FASTER! Better!

48 CV3DST | Prof. Leal-Taixé

slide-48
SLIDE 48

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 - 1 Feb 2016

Fa Fast st R R-CNN NN Res Results

  • VGG-16 CNN on Pascal VOC 2007 dataset

Lecture 8 - 77

R-CNN NN Fa Fast R-CNN NN Training Time: 84 hours 9.5 hours (Speedup) 1x 8.8x Test time per image 47 seconds 0.32 seconds (Speedup) 1x 146x mAP (VOC 2007) 66.0 66.9 Faster! FASTER! Better!

49 CV3DST | Prof. Leal-Taixé

The test times do not include proposal generation!

slide-49
SLIDE 49

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 - 1 Feb 2016

Fa Fast st R R-CNN NN Res Results

  • VGG-16 CNN on Pascal VOC 2007 dataset

Lecture 8 - 77

R-CNN NN Fa Fast R-CNN NN Training Time: 84 hours 9.5 hours (Speedup) 1x 8.8x Test time per image 50 50 se seconds 2 se seconds (Speedup) 1x 1x 25 25x mAP (VOC 2007) 66.0 66.9 Faster! FASTER! Better!

50 CV3DST | Prof. Leal-Taixé

With proposals included

slide-50
SLIDE 50

Fa Faster R R-CNN CNN

51 CV3DST | Prof. Leal-Taixé

slide-51
SLIDE 51

Fa Fast ster R-CNN: NN:

  • Solution: Have the proposal

generation integrated with the rest of the pipeline

  • Region Proposal Ne

Network (RPN) trained to produce region proposals directly.

  • After RPN, everything is like

Fast R-CNN

Lecture 8 - 80

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015 Slide credit: Ross Girschick

52 CV3DST | Prof. Leal-Taixé

slide-52
SLIDE 52

Reg Region

  • n prop
  • pos
  • sal

al net etwor

  • rk
  • How to extract proposals

53 CV3DST | Prof. Leal-Taixé

Image (N x M x 3)

Lecture 8 - 12

(H x W x 4096)

  • How many

proposals? ü We need to decide a fixed number

  • Where are they

placed? ü Densely Extract proposals Zoom in

slide-53
SLIDE 53

2

Reg Region

  • n prop
  • pos
  • sal

al net etwor

  • rk
  • We fix the number of proposals by using a set of n=9

anchors per location.

  • 9 anchors = 3 scales

and 3 aspect ratios

54 CV3DST | Prof. Leal-Taixé

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015

Zoomed in feature map

slide-54
SLIDE 54

2

Reg Region

  • n prop
  • pos
  • sal

al net etwor

  • rk
  • We fix the number of proposals by using a set of n=9

anchors per location.

  • 9 anchors = 3 scales

and 3 aspect ratios

  • We extract a descriptor

per location

55 CV3DST | Prof. Leal-Taixé

Ren et al, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, NIPS 2015

Zoomed in feature map

slide-55
SLIDE 55

Reg Region

  • n prop
  • pos
  • sal

al net etwor

  • rk
  • How to extract proposals

56 CV3DST | Prof. Leal-Taixé

Image (N x M x 3)

Lecture 8 - 12

(H x W x 4096) 3x3 conv (H x W x 256) (H x W x n) #anchors per image?

slide-56
SLIDE 56

Reg Region

  • n prop
  • pos
  • sal

al net etwor

  • rk
  • How to extract proposals

57 CV3DST | Prof. Leal-Taixé

Image (N x M x 3)

Lecture 8 - 12

(H x W x 4096) 3x3 conv (H x W x 256) (H x W x (2n+4n)) (H x W x n) #anchors per image? 1 classification score per proposal (object/non-

  • bject)

Anchor regression to proposal box 1x1 conv

slide-57
SLIDE 57

Reg Region

  • n prop
  • pos
  • sal

al net etwor

  • rk
  • How to extract proposals

58 CV3DST | Prof. Leal-Taixé

Image (N x M x 3)

Lecture 8 - 12

(H x W x 4096) 3x3 conv (H x W x 256) (H x W x (2n+4n)) Per feature map location, I get a set of anchor correction and classification into

  • bject/non-object

1 classification score per proposal (object/non-

  • bject)

Anchor regression to proposal box 1x1 conv RPN

slide-58
SLIDE 58

RP RPN: trai aining an and d los

  • sses

es

59 CV3DST | Prof. Leal-Taixé

Lecture 8 - 12

  • Classification ground truth: We compute which

indicates how much an anchor overlaps with the ground truth bounding boxes

  • 1 indicates the anchor represent an object

(foreground) and 0 indicates background object. The rest do not contribute to the training. p∗ p∗ = 1 if IoU > 0.7 p∗ = 0 if IoU < 0.3

slide-59
SLIDE 59

RP RPN: trai aining an and d los

  • sses

es

60 CV3DST | Prof. Leal-Taixé

Lecture 8 - 12

  • For an image, we randomly sample 256 anchors to

form a mini-batch (balanced objects vs. non-objects)

  • We calculate the classification loss (binary cross-

entropy).

  • Those anchors that do contain an object are used to

compute the regression loss

slide-60
SLIDE 60

RP RPN: trai aining an and d los

  • sses

es

61 CV3DST | Prof. Leal-Taixé

Lecture 8 - 12

  • Each anchor is described by the center position,

width and height xa, ya, wa, ha

slide-61
SLIDE 61

RP RPN: trai aining an and d los

  • sses

es

62 CV3DST | Prof. Leal-Taixé

Lecture 8 - 12

  • Each anchor is described by the center position,

width and height

  • What the network actually predicts are
  • Smooth L1 loss on regression targets

tx = (x − xa)/wa, ty = (y − ya)/ha, tw = log(w/wa), th = log(h/ha), tx, ty, tw, th

Normalized x Normalized y Normalized width Normalized height

xa, ya, wa, ha

slide-62
SLIDE 62

1 2 3 4

Fa Fast ster R R-CNN: NN: Tr Traini ning ng

63 CV3DST | Prof. Leal-Taixé

Slide credit: Ross Girschick

  • First implementation, training of RPN separate from

the rest.

  • Now we can train jointly!
  • Four losses:
  • 1. RPN classification (object/non-object)
  • 2. RPN regression (anchor -> proposal)
  • 3. Fast R-CNN classification (type of object)
  • 4. Fast R-CNN regression (proposal -> box)
slide-63
SLIDE 63

Fa Fast ster R R-CN CNN

  • 10x faster at test time wrt Fast R-CNN
  • Trained end-to-end including feature extraction,

region proposals, classifier and regressor

  • More accurate, since proposals are learned. RPN is

fully convolutional

64 CV3DST | Prof. Leal-Taixé

slide-64
SLIDE 64

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 8 - 1 Feb 2016

Fa Fast ster R R-CNN: NN: Res Results

Lecture 8 - 84

R-CNN Fast R-CNN Faster R-CNN Test time per image (with proposals) 50 seconds 2 seconds 0. 0.2 se seconds (Speedup) 1x 25x 25 250x mAP (VOC 2007) 66.0 66.9 66. 66.9

65 CV3DST | Prof. Leal-Taixé

slide-65
SLIDE 65

Tw Two-sta stage ge object

  • bject

detec detectors tors

66 CV3DST | Prof. Leal-Taixé

slide-66
SLIDE 66

Rel Relat ated ed wor

  • rks
  • Shrivastava, Gupta, Girshick. “Training region-based object

detectors with online hard example mining”. CVPR 2016.

  • Dai, Li, He and Sun. “R-FCN: Object detection via region-

based fully convolutional networks”. 2016.

  • Dai, Qi, Xiong, Li, Zhang, Hu and Wei. “Deformable

convolutional networks”. ICCV 2017.

  • Lin, Dollar, Girshick, He, Hariharan and Belongie. “Feature

Pyramid Networks for object detection”. CVPR 2017.

67 CV3DST | Prof. Leal-Taixé