Turbo-boosting Neural Networks for Object Detection Hongyang Li - - PowerPoint PPT Presentation

turbo boosting neural networks for object detection
SMART_READER_LITE
LIVE PREVIEW

Turbo-boosting Neural Networks for Object Detection Hongyang Li - - PowerPoint PPT Presentation

S9551 | Mar 20, 2019 | 14:00 pm, RM 231 Turbo-boosting Neural Networks for Object Detection Hongyang Li The Chinese University of Hong Kong / Microsoft Research Asia Research Timeline Hongyang Ph.D. student start 2015 CUHK Ph.D. candidate


slide-1
SLIDE 1

S9551 | Mar 20, 2019 | 14:00 pm, RM 231

Turbo-boosting Neural Networks for Object Detection

Hongyang Li

The Chinese University of Hong Kong / Microsoft Research Asia

slide-2
SLIDE 2

Hongyang

CUHK Ph.D. candidate / Microsoft Intern

Research Timeline

Ph.D. student start ImageNet Challenge (PAMI), Object Attributes (ICCV) 2015 2015 Multi-bias Activation (ICML) Recurrent Design for Detection (ICCV), COCO Loss (NIPS) 2016 2017 Zoom-out-and-in Network (IJCV), Capsule Nets (ECCV) Feature Intertwiner (ICLR), Few-shot Learning (CVPR) 2018 2019

First-author Papers

slide-3
SLIDE 3

Outline

1. Introduction to Object Detection

a. Pipeline overview b. Dataset and evaluation c. Popular methods d. Existing problems

2. Solution: A Feature Intertwiner Module 3. Detection in Reality

a. Implementation on GPUs b. Efficiency and accuracy tradeoff

4. Future of Object Detection

slide-4
SLIDE 4
  • 1. Introduction to Object Detection
slide-5
SLIDE 5

Object Detection: core and fundamental task in computer vision

He et al.

Mask-RCNN

ICCV 2017 Best paper

slide-6
SLIDE 6

Object Detection is everywhere

OBJECT DETECTION

slide-7
SLIDE 7

How to solve it?

A naive solution: place many boxes on top of image/feature maps and classify them!

person Not person

slide-8
SLIDE 8

How to solve it?

And yet challenges are:

person

1. Variations in shape/appearance/size

baseball

Helmet Cotton Hat

  • 2. Ambiguity in cluttered scenarios
slide-9
SLIDE 9

How to solve it?

(a) Place anchors as many as possible and (b) have layers deeper and deeper.

(a) place anchors (b) network design

slide-10
SLIDE 10

Popular methods at a glance

Pipeline/system design

One-stage: YOLO and variants SSD and variants Two-stage: R-CNN family (Fast RCNN, Faster RCNN, etc)

Component/structure/loss design

Feature Pyramid Network Focal loss (RetinaNet) Online hard negative mining (OHEM) Zoom-out-and-in Network (ours) Recurrent Scale Approximation (ours) Feature Intertwiner (ours)

slide-11
SLIDE 11

Pipeline: a roadmap of R-CNN family (two-stage detector)

P_l is the feature map output at level l; P_m is from a higher level m. level m level l

...

slide-12
SLIDE 12

Pipeline: a roadmap of R-CNN family (two-stage detector)

P_l is the feature map output at level l; P_m is from a higher level m.

RoI

level m level l

Small anchors cropped out of P_l

...

RoI output (fixed size)

slide-13
SLIDE 13

Pipeline: a roadmap of R-CNN family (two-stage detector)

P_l is the feature map output at level l; P_m is from a higher level m.

RoI

Person detected!

level m level l

...

slide-14
SLIDE 14

Pipeline: a roadmap of R-CNN family (two-stage detector)

P_l is the feature map output at level l; P_m is from a higher level m.

RoI RoI

Person detected!

level m level l

Large anchors cropped out of P_m

...

slide-15
SLIDE 15

Pipeline: a roadmap of R-CNN family (two-stage detector)

P_l is the feature map output at level l; P_m is from a higher level m.

RoI RoI

Person detected!

level m level l

RPN loss RPN loss

...

slide-16
SLIDE 16

Side: what is RoI (region of interest) operation?

Person detected! RPN loss RPN loss

...

RoI RoI

Fixed size

  • utput

RoI*

*Achieved by pooling; No learned parameters here Many variants of RoI operations

Arbitrary size of feature map

slide-17
SLIDE 17

R-CNN family (two-stage detector) vs. YOLO (one -stage detector)

RoI RoI

... Two stage:

R-CNN family

RPN loss RPN loss RPN: Two-class cls. problem (object or not?) K-class cls. problem (dog, cat, etc) Image size can vary

slide-18
SLIDE 18

R-CNN family (two-stage detector) vs. YOLO (one -stage detector)

RoI RoI

... ...

Multiple K-class classifiers (dog, cat, etc)

Two stage:

R-CNN family

One stage:

YOLO/SSD

Image size can NOT vary RPN loss RPN loss RPN: Two-class cls. problem (object or not?) K-class cls. problem (dog, cat, etc) Image size can vary

More accurate Faster

slide-19
SLIDE 19

Both R-CNN and SSD models have been tremendously adopted in academia/industry.

In this talk, we focus on the two-stage detector with RoI operation.

slide-20
SLIDE 20

Datasets

COCO dataset

http://mscoco.org/

YouTube-8M dataset

https://research.google.com/youtube8m/

And many others

ImageNet, VisualGenome, Pascal VOC, KITTI, etc.

slide-21
SLIDE 21

Evaluation - mean AP

prediction

Ground truth

If IoU (intersection / union) = 0.65 > threshold, Then current prediction is counted as Correct

For category person,

Get a set of Correct/incorrect predictions, compute the precision/recall. Get the average precision (AP) from the precision/recall figure. Done. Get all categories, that’s mAP (under threshold).

slide-22
SLIDE 22

What is uncomfortable in current pipelines?

Assume RoI’s output is 20

RoI input 40 → 20 RoI input 7 → 20 Inaccurate features due to up-sampling! Accurate features in down-sampling!

Large objects Small objects

slide-23
SLIDE 23

What percentage of objects suffer from this?

Table 3 in our paper.

Proposal assignment on each level before RoI operation. ‘below #’ indicates how many proposals are there whose size is below the size of RoI output.

We define small set to be the anchors on current level and large set to be all anchors above current level.

slide-24
SLIDE 24
  • 2. Solution: A Feature Intertwiner Module
slide-25
SLIDE 25

Our assumption

Visual feature Semantic feature

The semantic features among instances (large or small) within the same class should be the same. same!!!

slide-26
SLIDE 26

Our motivation

Inaccurate maps/features

Intuition: let reliable features supervise/guide the learning of the less reliable ones.

Naive feature intertwiner concept:

Suppose we have two sets of features already -

  • ne is from large objects and the other is from small ones.
slide-27
SLIDE 27

The Feature Intertwiner

For current level l

  • Cls. loss
  • Reg. loss (bbox)

Make-up layer:

fuel back the lost information during RoI and compensate necessary details for small instances. (one conv. layer)

For small objects

slide-28
SLIDE 28

The Feature Intertwiner

For current level l

  • Cls. loss
  • Reg. loss (bbox)

Intertwiner loss Input to Intertwiner Critic layer:

transfer features to a larger channel size and reduce spatial size to one. (two conv. layers)

For large objects

slide-29
SLIDE 29

The Feature Intertwiner

  • Cls. loss
  • Reg. loss (bbox)

Intertwiner loss Input to Intertwiner Total loss = (Intertwiner+cls.+reg.) for all levels

For current level l

slide-30
SLIDE 30

The Feature Intertwiner

Anchors are placed at various levels. What if there are no large instances in this mini-batch, for the current level? We define small set to be the anchors on current level and large set to be all anchors above current level.

slide-31
SLIDE 31

The Feature Intertwiner - class buffer

We use a class buffer to store the accurate feature set from large instances. How to generate the buffer?

One simple idea is to

Take the average of features of all large objects during training. Feature Intertwiner

For level l For all levels Level 2 Level 3 ...

Historical logger

  • Inter. loss
slide-32
SLIDE 32

Discussions on Feature Intertwiner

  • the intertwiner is proposed to optimize feature

learning of the less reliable set. During test, the green part will be removed.

  • can be seen as a teacher-student guidance in the

self-supervised domain.

  • detach the gradient update in buffer will obtain

better results. “Soft targets”, similarly as in RL (replay memory).

  • The buffer is level-agnostic. Improvements over all

levels/sizes of objects are observed.

Historical logger

  • Inter. loss

For inference

slide-33
SLIDE 33

The Feature Intertwiner - choosing optimal feature maps

How to choose the appropriate maps for large objects? as input to intertwiner

One simple solution is to

(a) Use the feature map directly on current level.

This is inappropriate. why? For level l For all levels

  • Inter. loss

We define small set to be the anchors on current level and large set to be all anchors above current level.

slide-34
SLIDE 34

The Feature Intertwiner - choosing optimal feature maps

How to choose the appropriate maps for large objects? as input to intertwiner

Other options are

(b) use the feature maps on higher level. (c) upsample higher-level maps to current level, with learnable parameters (or not). We will empirically analyze these later.

slide-35
SLIDE 35

The Feature Intertwiner - choosing optimal feature maps

How to choose the appropriate maps for large objects? as input to intertwiner

Our final option is based on (c)

(d), build a better alignment between the upsampled feature map with current map.

slide-36
SLIDE 36

The Feature Intertwiner - choosing optimal feature maps

How to choose the appropriate maps for large objects? as input to intertwiner

The approach is

Optimal transport (OT).

In a nutshell, OT is to optimally move one distribution (P_m|l) to the other (P_l).

Our final option is based on (c)

(d), build a better alignment between the upsampled feature map with current map.

slide-37
SLIDE 37

The Feature Intertwiner - choosing optimal feature maps

How to choose the appropriate maps for large objects? as input to intertwiner

The approach is

Optimal transport (OT).

In a nutshell, OT is to optimally move one distribution (P_m|l) to the other (P_l). Q is a cost matrix (distance) P is a proxy matrix satisfying some constraint.

Our final option is based on (c)

(d), build a better alignment between the upsampled feature map with current map.

slide-38
SLIDE 38

The Feature Intertwiner - choosing optimal feature maps

How to choose the appropriate maps for large objects? as input to intertwiner

How to compute

Optimal transport (OT). = Pm F H Q

Cost matrix

P

Sinkhorn iterate

OT loss

slide-39
SLIDE 39

The Feature Intertwiner - choosing optimal feature maps

How to choose the appropriate maps for large objects? as input to intertwiner

How to compute

Optimal transport (OT). =

Components

Pm F H Q

Cost matrix

P

Sinkhorn iterate

OT loss P H->Q

slide-40
SLIDE 40

The Feature Intertwiner - choosing optimal feature maps

Why Optimal transport (OT) is better than others? Hence, the final loss:

  • OT metric converges while other variants (KL or JS) don’t
  • Provides sensible cost functions when learning distributions

supported by low-dim manifolds (p_l and p_m|l)

slide-41
SLIDE 41

Summary of our method

slide-42
SLIDE 42

Experiments

slide-43
SLIDE 43

Setup

  • Evaluate our algorithm on COCO dataset
  • Train set: trainval-35k, test set: minival
  • Network structure: resNet-50 or resNet-101 with FPN
  • Based on Mask-RCNN framework without seg. Branch
  • Evaluation metric: meanAP under different thresholds and sizes

The rest of details are stated in Sec. 6.5 in the paper.

slide-44
SLIDE 44

Ablation on module design

Table 2 in the paper gray background is the chosen default

Different anchor placements

Observation #1: Feature Intertwiner Module is better than baseline. ~2% mAP increase Large objects also improve. Why? Does the intertwiner module work better?

slide-45
SLIDE 45

Ablation on module design

Table 2 in the paper gray background is the chosen default

Observation #2: By optimizing the make-up layer; the linearly combined features would further boost performance. How does the intertwiner module affect feature learning?

Gradient flow

slide-46
SLIDE 46

Ablation on module design

Table 2 in the paper gray background is the chosen default

Observation #3: Recording all history of the large/reliable set would achieve better results (and save mem); one unified buffer is enough. Does buffer size matter? Unified or level-based buffer? How to design the buffer?

slide-47
SLIDE 47

Ablation on OT unit

Table 1 in the paper

Different input sources for the reliable set

slide-48
SLIDE 48

Visualization on samples within class

w/o intertwiner with intertwiner

slide-49
SLIDE 49

Comparison with state-of-the-arts (I)

Figure 4 in the paper

Improvement per category after embedding the feature intertwiner

32.8 (baseline) vs 35.2 (ours) Most small-sized objects get improved!

slide-50
SLIDE 50

Comparison with state-of-the-arts (I)

The most distinctive improvements are Microwave, truck, cow, car, zebra Zoom in

slide-51
SLIDE 51

Comparison with state-of-the-arts (I)

Dropped! Some categories witness a drop of performance Couch, baseball bat, broccoli Couch The feature set of large couch is less accurate due to noises (of other classes).

slide-52
SLIDE 52

Comparison with state-of-the-arts (II)

Fast-RCNN variants

36.8 44.2

Same backbone

39.1

SSD

33.2

Proposed

Table 4 in the paper

Single-model performance (bounding box AP)

slide-53
SLIDE 53

This work is published at ICLR 2019

Paper:

https://openreview.net/f

  • rum?id=SyxZJn05YX

Check out our poster at GTC! P9108 AI/Deep Learning Research Near the gear store

Code:

https://github.com/hli2020/featu re_intertwiner

slide-54
SLIDE 54
  • 3. Detection in Reality
slide-55
SLIDE 55

Practical issues on multi-GPUs

1. Batch normalization

Standard Implementations of BN in public frameworks (suck as Caffe, MXNet, Torch, TF, PyTorch) are unsynchronized, which means that the data are normalized within each GPU.

https://hangzhang.org/PyTorch-Encoding/notes/syncbn.html

Synchronized BN

slide-56
SLIDE 56

Practical issues on multi-GPUs

1. Batch normalization Does it matter? As long as bs on each GPU is not too few, unsynchronized BN is ok.

Note that bs in the “deeper” part is the # of RoIs/boxes on each card; Batch size in the backbone is the # of image!

Another rule of thumb: fixed BN in the backbone when finetune the network on your task.

slide-57
SLIDE 57

Practical issues on multi-GPUs

  • 2. Wrap up the loss computation into forward() on each card

Otherwise GPU 0 would take too much memory in some cases, causing mem imbalance and decrease utility of other GPUs.

loss loss loss loss loss

slide-58
SLIDE 58

Practical issues on multi-GPUs

  • 3. Different images must have same size of targets as input
  • 4. What if the utility of GPUs is low?
  • Dataloader is slow
  • Move op. to Tensor
  • Or change to another workstation
  • (often during inference, utility is low)
slide-59
SLIDE 59

Trade-off between accuracy and efficiency

Additional model capacity increase in our method:

  • Critic/make-up layers
  • Buffer
  • OT module

But these new designs only have light-weight effect.

FPN SSD Better area

slide-60
SLIDE 60

Trade-off between accuracy and efficiency

More facts:

Training: 8 GPUs, batch size=8, 3.4 days Mem cost 9.6G/gpu, baseline 8.3G Test (input 800 on Titan X): 325 ms/image, baseline 308 ms/image

FPN SSD Better area Mask-RCNN (39.2) InterNet (42.5)

slide-61
SLIDE 61
  • 4. Future of Object Detection
slide-62
SLIDE 62

Any alternatives? to abandon current anchor-based pipeline

Idea: Current solution are all based on anchors (one-stage or two-stage). Is bounding box really accurate to detector all objects?

How about detect objects using bottom-up approaches? Like pixel-wise segmentation? In this way, we can walkaround the box detection pipeline.

Densely cluttered persons

slide-63
SLIDE 63

Take-away Messages

1. Object detection is the basic and core task of other high-level vision problems. 2. Feature engine (backbone) and detector design (domain knowledge) are important. 3. Beyond current pipeline (dense anchors): solve detection via bottom-up approaches or 3D structure of objects. 4. Beyond detection only - one model to learn them all:

detection, segmentation, pose estimation, captioning, zero-shot detection, curriculum learning, ...

slide-64
SLIDE 64

Thank you! Questions?

Collaborators:

Yu Liu Bo Dai Xiaoyang Shaoshuai Wanli Xiaogang

Email: yangli@ee.cuhk.edu.hk Slides at: http://www.ee.cuhk.edu.hk/~yangli/ twitter @francislee2020