SSD: Single Shot MultiBox Detector Wei Liu, Dragomir Anguelov, - - PowerPoint PPT Presentation

ssd single shot multibox detector
SMART_READER_LITE
LIVE PREVIEW

SSD: Single Shot MultiBox Detector Wei Liu, Dragomir Anguelov, - - PowerPoint PPT Presentation

SSD: Single Shot MultiBox Detector Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg Slides by: Sulabh Shrestha Receptive Field Use multiple Ref:


slide-1
SLIDE 1

SSD: Single Shot MultiBox Detector

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg Slides by: Sulabh Shrestha

slide-2
SLIDE 2

Receptive Field

Ref: https://cv-tricks.com/object-detection/single-shot-multibox-detector-ssd/

▪ Deep feature maps

▪ Smaller size ▪ Larger receptive fields ▪ May miss small objects

▪ Shallow feature maps

▪ Larger size ▪ Smaller receptive fields ▪ May not be able to see larger objects

▪ Use multiple for corresponding receptive field sized objects

Use multiple

slide-3
SLIDE 3

Architecture

▪ Base Network + Extra Feature Layer ▪ No FC layer ▪ Specific feature maps responsive to particular scale of objects

▪ Not necessarily same as the receptive field ▪ A hyper-parameter ▪ Dependent on data

8x8 Feature map 4x4 Feature map

VGG

slide-4
SLIDE 4

Base Network

▪VGG 16 ▪Pool5 changed:

▪ 3x3 kernel instead of 2x2 ▪ Stride 1 instead of 2

▪1st 2 FCs replaced by CNN

▪ DeepLab LargeFOV

▪Last FC removed altogether ▪No dropouts used ▪Conv4_3 also used for prediction

▪ 4th Group of Conv ▪ 3rd kernel

Ref: VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

slide-5
SLIDE 5

Multiple Default Boxes

▪Similar to Anchor boxes of Faster-RCNN ▪Example feature map:

▪m x n ▪p-channels

▪For each location (i, j)

▪ Multiple default boxes (k) ▪ 3 x 3 x p-channel CNN for each box

▪ Confidence of each class, ci ; i Є [1, C] ▪ x, y, w, h ▪ (C+4) outputs

▪ Total outputs for 1 feature map:

▪ m * n * k * (#classes + 4)

m n p

slide-6
SLIDE 6

Scale and Aspect ratio

▪ How many default boxes per location? ▪ Scale

▪ Related to but not exact as the receptive field ▪ If m feature maps used for prediction: ▪ smin = 0.2 ▪ smax = 0.9 ▪ Eg.

▪ s = 0.2 ▪ img-size = 300 ▪ Default box corresponding size = 0.2 * 300 = 60

▪ Aspect ratios(ar)

▪ {1, 2, 3, 1/2, 1/3} ~ k ▪ Width (wk

a) = sk √ ar

▪ Height (hk

a) = sk / √ ar

▪ Eg.

▪ s = 0.2, img-size = 300 ▪ ar = 1

  • ->

w = 0.2 * 300 = 60 h = 0.2 * 300 = 60 ▪ ar = 2

  • ->

w = 0.2 * √ 2 * 300 = 85 h = 0.2 / √ 2 * 300 = 42 ▪ ar = 1/2 --> w = 0.2 * √ ½ * 300 = 42 h = 0.2 / √ ½ * 300 = 85

slide-7
SLIDE 7

Training

  • Basenet pre-trained on ImageNet CLS-LOC dataset
  • Fine-tuned for respective dataset
  • Matching Strategy
  • Any 𝐽𝑃𝑉𝑒𝑓𝑔𝑏𝑣𝑚𝑢𝑐𝑝𝑦

𝑕𝑠𝑝𝑣𝑜𝑒𝑢𝑠𝑣𝑢ℎ > 0.5 → 𝑞𝑝𝑡𝑗𝑢𝑗𝑤𝑓

  • Simplifies learning problem
  • Can detect object in multiple overlapping default boxes
  • Loss
  • Confidence loss (c)
  • Softmax loss over multiple classes
  • Localization loss (xywh)
  • Smooth L1 loss
  • Ground truth box(g) vs Default box(l)

Ref: https://github.com/rbgirshick/py-faster-rcnn/files/764206/SmoothL1Loss.1.pdf

slide-8
SLIDE 8

Results

PASCAL VOC2007 test detection results PASCAL VOC2012 test detection results

slide-9
SLIDE 9

Inference

  • Filter boxes with low confidence
  • NMS with 0.45 IOU
  • Take top 200 detections
  • Better mAP
  • Faster FPS

VOC2007 Test data

slide-10
SLIDE 10

Analysis

  • Better than 2 stage network:
  • Single network for localization and classification
  • Better than YOLO
  • Use multiple feature maps
  • Use many more default boxes
  • No FC layer
  • Faster inference
  • Fewer parameters
  • Smaller input size
  • Faster RCNN
  • 600 min. size
  • YOLO
  • 448 x 448
slide-11
SLIDE 11

Ablation Studies - 1

  • Data Augmentation helps
  • Original image
  • Random sample of patch
  • Sample patch
  • IOUmin is 0.1, 0.3, 0.5, 0.7, 0.9
  • More Multiple boxes helps
  • Using FC instead of CNN

(Atrous)

  • Similar result
  • 20% slow
slide-12
SLIDE 12

Ablation Studies - 2

  • Use different number of feature maps
  • Similar # of default boxes to make it fair
  • More feature maps better
  • Up to a certain extent
  • Not using boundary defaults boxes better
  • Avoid default boxes lying outside the image
slide-13
SLIDE 13

Thank you

Questions?