VGG/MOBILENET SSD Michael Sun July September, 2019 DATASETS FOR - - PowerPoint PPT Presentation

vgg mobilenet ssd
SMART_READER_LITE
LIVE PREVIEW

VGG/MOBILENET SSD Michael Sun July September, 2019 DATASETS FOR - - PowerPoint PPT Presentation

LOGO DETECTION WITH VGG/MOBILENET SSD Michael Sun July September, 2019 DATASETS FOR FINE-TUNING BelgaLogos FlickrLogos Combined LogosInTheWild DATASETS FOR FINE-TUNING (POST-CLEANING) BelgaLogos BelgaLogos-top LogosInTheWild


slide-1
SLIDE 1

LOGO DETECTION WITH VGG/MOBILENET SSD

Michael Sun July – September, 2019

slide-2
SLIDE 2
slide-3
SLIDE 3

BelgaLogos FlickrLogos LogosInTheWild Combined

DATASETS FOR FINE-TUNING

slide-4
SLIDE 4

DATASETS FOR FINE-TUNING (POST-CLEANING)

BelgaLogos BelgaLogos-top LogosInTheWild LogosInTheWild-top FlickrLogos

Available Instances 9844 1063 20085 6775 5968 Images 2650 688 6692 2090 2235 Classes 37 (all positive) 10 (all positive) 102 11 47

slide-5
SLIDE 5

NOTE ON CLEANING

  • ADDRESSING CLASS IMBALANCE (KL-DIVERGENCE)
  • SCRIPT TO RANDOMLY SAMPLE INSTANCES
  • NEGATIVE SAMPLES (POS-NEG RATIO, NEGATIVE MINING)
slide-6
SLIDE 6

HYPERPARAMETERS

AT TRAINING TIME

  • LEARNING RATE SCHEDULE (EPOCHS, RATES)
  • BATCH SIZE
  • MOMENTUM
  • AUGMENTATION PIPELINE

AT MODEL COMPILE

  • LAYER FREEZING
  • LOSS FUNCTION
  • IMAGE DIM, SCALES, AR, MISC.
slide-7
SLIDE 7

FREEZING LAYERS

(BELGAS-TOP) 0.662/0.353 mAP 0.623/0.332 mAP 0.634/0.367 mAP 0.396/0.285 mAP 0.341/0.224 mAP

slide-8
SLIDE 8

MOMENTUM

(LOGOSINTHEWILD) 0.013/0.008 mAP 0.001/0.0 mAP 0.019/0.007 mAP 0.033/0.022 mAP 0.043/0.012 mAP

  • INHERENT DIFFICULTY

OF THIS DATASET

  • BIAS-VARIANCE

TRADEOFF

slide-9
SLIDE 9

LOSS FUNCTION

SSD

  • TOTAL LOSS = LOCALIZATION + ALPHA * CLASSIFICATION
  • LOCALIZATION = SUM_(POS BOXES) SMOOTH-L1 + SUM_(NEG BOXES) SMOOTH-L1
  • NEG_POS_RATIO AND N_NEG_MIN
  • DEFAULT: ALPHA=1, NP=3, N_NEG_MIN=0

Classification loss (cross-entropy) Localization loss (smooth L1) Positive ground truths Negative ground truths

slide-10
SLIDE 10

BATCH SIZE

  • BIAS OF EACH MODEL
  • EXPLANATION OF LOSS SPIKES AT LR DROPS

0.158/0.169 mAP 0.257/0.232 mAP 0.327/0.308 mAP

slide-11
SLIDE 11

CLASSIFICATION LOSS

  • (METHOD 1, ORIGINAL REPO) REGULAR CROSS-ENTROPY
  • (METHOD 2) WEIGH EACH CLASS C BY (TOTAL - #SAMPLES IN C)
  • (METHOD 3) METHOD 2 BUT ONLY FOR POSITIVE CLASSES
slide-12
SLIDE 12

WEIGHTING METHOD

(ALL CARS COMBINED DATASET)

slide-13
SLIDE 13

ALPHA AND NEG-POS RATIO

(NBA LOGOS: KIA AND ADIDAS)

slide-14
SLIDE 14

EXPERIMENTS: LR SCHEDULE

(BEST CARS DATASET: CITROEN, FERRARI, KIA, MERCEDES, AUDI, BMW) Epochs LR drops Val mAP 0, 10 5e-5, 1e-5 0, 10 2e-5, 1e-5 0.029 5e-5 0.125 3e-5 0.083 1e-5 0.019

slide-15
SLIDE 15

EXPERIMENTS: MOMENTUM

(BEST CARS DATASET: CITROEN, FERRARI, KIA, MERCEDES, AUDI, BMW) Momentum (all 5e-5) Val mAP 0.9 0.7 0.045 0.5 0.02 0.3 0.0 0.1 0.015

slide-16
SLIDE 16

EXPERIMENTS: NEG-POS-RATIO & ALPHA

(BEST CARS DATASET: CITROEN, FERRARI, KIA, MERCEDES, AUDI, BMW) Alpha Val mAP 1 0.115 0.9 0.145 0.8 0.169 0.7 0.168 NP Val mAP 4 0.03 5 0.115 6 0.015

slide-17
SLIDE 17

EXPERIMENTS: MISC

(BEST CARS DATASET: CITROEN, FERRARI, KIA, MERCEDES, AUDI, BMW)

  • CONF THRESH (DEFAULT 1E-2); 1E-3 GOT 0.11 VAL MAP
  • IMG DIM (300, 350, 400, …); 400 DID WORSE ON VAL; > 400 GPU CAN’T HANDLE WITH

BATCH SIZE 32

  • MOMENTUM SCHEDULER; TRIED ONCE, COULDN’T CONVERGE
  • XAVIER-INITIALIZE CLASS/LOC LAYERS GOT 0.132 VAL MAP
slide-18
SLIDE 18
slide-19
SLIDE 19
  • AFTER ANALYSIS

ASPECT-RATIOS, SCALES

scales = [0.05, 0.10, 0.15, 0.2, 0.25, 0.37, 0.50] aspect_ratios = [[1.0, 1.5, 2.0/3.0], [1.0, 1.5, 2.0/3.0, 1.25, 0.8], [1.0, 1.5, 2.0/3.0, 1.25, 0.8], [1.0, 1.5, 2.0/3.0, 1.25, 0.8], [1.0, 1.5, 2.0/3.0], [1.0, 1.5, 2.0/3.0]]

  • DEFAULT

scales_pascal = [0.1, 0.2, 0.37, 0.54, 0.71, 0.88, 1.05] # The anchor box scaling factors used in the original SSD300 for the Pascal VOC datasets aspect_ratios = [[1.0, 2.0, 0.5], [1.0, 2.0, 0.5, 3.0, 1.0/3.0], [1.0, 2.0, 0.5, 3.0, 1.0/3.0], [1.0, 2.0, 0.5, 3.0, 1.0/3.0], [1.0, 2.0, 0.5], [1.0, 2.0, 0.5]]

slide-20
SLIDE 20
  • PHOTOMETRIC (BRIGHTNESS, CONTRAST,

SATURATION, HUE, RANDOM CHANNEL SWAP)

  • EXPAND
  • RANDOM CROP
  • RANDOM FLIP
  • RESIZE

AUGMENTATION PIPELINE

  • ORDER OF TRANSFORMATIONS
  • MIN-SCALE, MAX-SCALE
  • MIN-SCALE, MAX-SCALE
  • FREQUENCY
  • INTERPOLATION
slide-21
SLIDE 21

DOUBLE FINE-TUNING

Dataset of all cars, 16 classes, ~1500 images Dataset of cars of interest, 3 classes, ~250 images Final model All-cars model VGG- COCO SSD

0.018 val mAP; 10.08 val loss 0.021 val mAP; 9.63 val loss 0.0 val mAP; 10.43 val loss Class-diff ~0.4 AP vs 0.0 AP

slide-22
SLIDE 22

WEIGHTING METHOD

(ALL CARS FINE TUNED DATASET)

  • WHY I TUNED THIS HYPERPARAMETER

0.34 mAP (0.325, 0.337, 0.358) 0.337 mAP (0.256, 0.354, 0.4) 0.34 mAP (0.327, 0.337, 0.339) (After 50 epochs)

slide-23
SLIDE 23

VGG/ MNet SSD

Filtering (0.5) Non-max- suppression (NMS)

(#frames, #boxes, 6)

Top-k

(#frames, k, 6) (#frames, ?, 6) (#frames,)

TFLite graph Android

REAL-TIME INFERENCE PIPELINE

VGG/ MNet SSD Initial filtering (0.01) NMS

(#frames, #boxes, 12)

Anchor Boxes

(#frames, #boxes, 6) (#frames, ?, 6) (#frames,?, 6)

Keras and numpy

Training

TensorFlow

Top-k/ pad-k

(#frames, k, 6)

Final Filter (0.5)

User

slide-24
SLIDE 24

TOP-K (TFLITE)

  • IN: (1, 8732, 6); OUT: (1, 20, 6)

VGG/ MNet SSD

(#frames, #boxes, 6)

Top-k

slide-25
SLIDE 25

FILTERING (TFLITE)

  • IN: (1, 20, 6); OUT: (1, ?, 6)
  • WHY 0.5 (OURS) VS 0.01 THRESHOLD

(ORIGINAL) IS OK

Filtering (0.5) Top-k

(#frames, k, 6)

slide-26
SLIDE 26

NON-MAX-SUPPRESSION (TFLITE -> ANDROID)

  • IN: (1, ?, 6); OUT: (1, ?, 6)
  • LIMIT TO K=10
  • WHILE STILL BOXES LEFT & WE HAVE LESS THAN K
  • ADD HIGHEST CONFIDENCE BOX B
  • FOR EACH REMAINING BOX
  • COMPUTE IOU WITH B
  • IF > 0.5 THEN REMOVE

Filtering (0.5) Non-max- suppression (NMS)

(#frames, ?, 6)

slide-27
SLIDE 27
  • Combination of

alpha=0.5, np=5 and alpha=1.5 np=3

  • A lot of trial-and-error

with lr drops, early stopping

slide-28
SLIDE 28
  • (Left) Candidate for

final, has mistakes

  • (Bottom) Final video

sent for annotations

slide-29
SLIDE 29

NOTES ON DEPTHWISE

  • FINAL 2 CONVS
  • FINAL 4 CONVS (~5 SECONDS/FRAME)
  • FINAL 6 CONVS
  • ALL CLASS LAYERS
  • ALL LOC LAYERS
  • FIRST 2 LAYERS 0 MAP
slide-30
SLIDE 30

SUGGESTIONS ON IMPROVEMENT

  • CURRENTLY FRAME-BY-FRAME TAKES 10S/FRAME
  • BATCH SIZE INFERENCE – FRAME RATE/DELAY TRADEOFF
  • (OPTIMISTIC) – SEQUENCE MODELLING
slide-31
SLIDE 31

PROGRESS ON MOBILENET V2

Conv-block 1

Conv-block 1

Conv-block 6

Classification

Localization VGG/ Mobilenet base layers ILSVRC/ImageNet COCO/VOC Conv’s/ Depthwise-separable

  • Base layers
  • Conv blocks
  • Classification/localization

layers

slide-32
SLIDE 32

FINAL THOUGHTS

  • DIFFICULTY OF NOT HAVING FIXED DATASET, HAVING NOISY DATA
  • THANKS TO UTKARSH AND GAURAV