tw two sta stage ge object object detec detectors tors
play

Tw Two-sta stage ge object object detec detectors tors CV3DST - PowerPoint PPT Presentation

Tw Two-sta stage ge object object detec detectors tors CV3DST | Prof. Leal-Taix 1 Ty Types of object ct dete tecto ctors One-stage detectors Class score (cat, Classification dog, person) Feature Image extraction Bounding


  1. Tw Two-sta stage ge object object detec detectors tors CV3DST | Prof. Leal-Taixé 1

  2. Ty Types of object ct dete tecto ctors • One-stage detectors Class score (cat, Classification dog, person) Feature Image extraction Bounding box Localization (x,y,w,h) • Two-stage detectors Class score (cat, Classification Extraction of dog, person) Feature Image object extraction Refine bounding box proposals Localization ( Δ x, Δ y, Δ w, Δ h) CV3DST | Prof. Leal-Taixé 2

  3. Ty Types of object ct dete tecto ctors • One-stage detectors Class score (cat, Classification dog, person) Feature Image extraction Bounding box Localization (x,y,w,h) • Two-stage detectors Class score (cat, Classification Extraction of dog, person) Feature Image object extraction Refine bounding box proposals Localization ( Δ x, Δ y, Δ w, Δ h) CV3DST | Prof. Leal-Taixé 3

  4. Lo Locali lizati tion • Bounding box regression Output: Box coordinates (x,y,w,h) Feature extraction (this time with a L2 loss function Neural Network) Image Ground truth: Box coordinates Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 4

  5. Lo Locali lizati tion • Bounding box regression Output: Box coordinates (x,y,w,h) L2 loss function Convolutional Image Neural Network Ground truth: Box coordinates Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 5

  6. Lo Locali lizati tion n and nd cla lassificati tion • Bounding box regression Fully connected Output: Box coordinates (x,y,w,h) Convolutional Image Neural Network Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 6

  7. Lo Locali lizati tion n and nd cla lassificati tion • Bounding box regression Fully connected L2 loss Output: Box coordinates (x,y,w,h) Convolutional Image Neural Network Softmax loss Output: Class scores Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 7

  8. Lo Locali lizati tion n and nd cla lassificati tion • Bounding box regression Regression head Output: Box coordinates (x,y,w,h) Convolutional Image Neural Network Classification Output: head Class scores Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 8

  9. Lo Locali lizati tion n and nd cla lassificati tion • It was typical to train the classification head first, freeze the layers • Then train the regression head • At test time, we use both! Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014 CV3DST | Prof. Leal-Taixé 10

  10. Ov Overfe rfeat • Sliding window + box regression + classification Feature map Boxes (5 x 5 x 1024) (1000 x 4) Convolutional Class scores Image Neural Network 1000 (221 x 221 x 3) Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014 Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 11

  11. Ov Overfe rfeat • Sliding window + box regression + classification Image (468 x 356 x 3) Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014 Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 12

  12. Ov Overfe rfeat • Sliding window + box regression + classification Image (468 x 356 x 3) Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014 Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 13

  13. Ov Overfe rfeat • Sliding window + box regression + classification Image (468 x 356 x 3) Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014 Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 14

  14. Ov Overfe rfeat • Sliding window + box regression + classification Image (468 x 356 x 3) Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014 Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 15

  15. Ov Overfe rfeat • Sliding window + box regression + classification We end up with many predictions and we have to combine them for a final detection (in Overfeat they have a greedy method) Image (468 x 356 x 3) Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014 Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 16

  16. Ov Overfe rfeat • Sliding window + box regression + classification We end up with many predictions and we have to combine them for a final detection (in Overfeat they have a greedy method) Image (468 x 356 x 3) Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014 Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 17

  17. Ov Overfe rfeat • In practice: use many sliding window locations and multiple scales Window positions + score maps Box regression outputs Final Predictions Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014 Lecture 8 - 31 CV3DST | Prof. Leal-Taixé 18

  18. Ov Overfe rfeat • Sliding window + box regression + classification Feature map Boxes (5 x 5 x 1024) (1000 x 4) Convolutional Class scores Image Neural Network 1000 (221 x 221 x 3) What prevents us from dealing with any image size? Sermanet et al, “Integrated Recognition, Localization and Detection using Convolutional Networks”, ICLR 2014 Lecture 8 - 12 CV3DST | Prof. Leal-Taixé 19

  19. Wh What at ab abou out multiple e ob objec jects? • Localization: Regression • How about detection? CV3DST | Prof. Leal-Taixé 20

  20. Wh What at ab abou out multiple e ob objec jects? • Localization: Regression • How about detection? 3 objects means having an output of 12 numbers (3 x 4) CV3DST | Prof. Leal-Taixé 21

  21. Wh What at ab abou out multiple e ob objec jects? • Localization: Regression • How about detection? 14 objects means having an output of 56 numbers (14 x 4) CV3DST | Prof. Leal-Taixé 22

  22. What Wh at ab abou out multiple e ob objec jects? • Localization: Regression • How about detection? • Having a variable sized output is not optimal for Neural Networks • There are a couple of workarounds: – RNN: Romera-Paredes and Torr. Recurrent Instance Segmentation. ECCV 2016. – Set prediction: Rezatofighi, Kaskman, Motlagh, Shi, Cremers, Leal-Taixé, Reid. Deep Perm-Set Net: Learn to predict sets with unknown permutation and cardinality using deep neural networks. Arxiv: 1805.00613 CV3DST | Prof. Leal-Taixé 23

  23. De Dete tecti ction as cla classifica cati tion? • Localization: Regression • How about detection? Regression Is this a Flamingo? NO CV3DST | Prof. Leal-Taixé 24

  24. De Dete tecti ction as cla classifica cati tion? • Localization: Regression • How about detection? Regression Is this a Flamingo? NO CV3DST | Prof. Leal-Taixé 25

  25. De Dete tecti ction as cla classifica cati tion? • Localization: Regression • How about detection? Regression Is this a Flamingo? YES! CV3DST | Prof. Leal-Taixé 26

  26. De Dete tecti ction as cla classifica cati tion? • Localization: Regression • How about detection? Classification • Problem: – Expensive to try all possible positions, scales and aspect ratios – How about trying only on a subset of boxes with most potential? CV3DST | Prof. Leal-Taixé 27

  27. Reg Region on Pr Propo posals ls • We have already seen a method that gives us “interesting” regions in an image that potentially contain an object • Step 1: Obtain region proposals • Step 2: Classify them. Lecture 8 - 49 CV3DST | Prof. Leal-Taixé 28

  28. Th The e R-CNN family ly CV3DST | Prof. Leal-Taixé 29

  29. R-CN CNN Girschick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014 CV3DST | Prof. Leal-Taixé 30

  30. R-CN CNN Classification head Regression head to refine the bounding box Extract features location Warping to a fix size 227 x 227 Girschick et al, “Rich feature hierarchies for accurate object detection and semantic segmentation”, CVPR 2014 CV3DST | Prof. Leal-Taixé 31

  31. R-CN CNN • Training scheme: – 1. Pre-train the CNN on ImageNet – 2. Finetune the CNN on the number of classes the detector is aiming to classify (softmax loss) – 3. Train a linear Support Vector Machine classifier to classify image regions. One SVM per class! (hinge loss) – 4. Train the bounding box regressor (L2 loss) CV3DST | Prof. Leal-Taixé 32

  32. R-CN CNN • PROS: – The pipeline of proposals, feature extraction and SVM classification is well-known and tested. Only features are changed (CNN instead of HOG). – CNN summarizes each proposal into a 4096 vector (much more compact representation compared to HOG) – Leverage transfer learning: the CNN can be pre-trained for image classification with C classes. One needs only to change the FC layers to deal with Z classes. CV3DST | Prof. Leal-Taixé 33

  33. R-CN CNN • CONS: Let us try to solve this first – Slow! 47s/image with VGG16 backbone. One considers around 2000 proposals per image, they need to be warped and forwarded through the CNN. – Training is also slow and complex – The object proposal algorithm is fixed. Feature extraction and SVM classifier are trained separately à not exploiting learning to its full potential. CV3DST | Prof. Leal-Taixé 34

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend