The Glimpse of Detectron : Dynamic Forwarding and Routing in Modern - - PowerPoint PPT Presentation

the glimpse of detectron dynamic forwarding and routing
SMART_READER_LITE
LIVE PREVIEW

The Glimpse of Detectron : Dynamic Forwarding and Routing in Modern - - PowerPoint PPT Presentation

The Glimpse of Detectron : Dynamic Forwarding and Routing in Modern Detectors Ziwei Liu Multimedia Lab (MMLAB) The Chinese University of Hong Kong Dynamic Forwarding Content-Aware Resolution-Adaptive A neurobiological model of visual


slide-1
SLIDE 1

The Glimpse of Detectron: Dynamic Forwarding and Routing in Modern Detectors

Ziwei Liu Multimedia Lab (MMLAB) The Chinese University of Hong Kong

slide-2
SLIDE 2

Dynamic Forwarding

A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information

  • Content-Aware
  • Resolution-Adaptive
slide-3
SLIDE 3

Dynamic Routing

A neurobiological model of visual attention and invariant pattern recognition based on dynamic routing of information

  • Information Flow
  • Selection & Fusion
slide-4
SLIDE 4

Overview

Backbone

  • 1. We proposed a new backbone FishNet. (NIPS 2018)
slide-5
SLIDE 5

Overview

Backbone Proposal

  • 1. We proposed a new backbone FishNet. (NIPS 2018)

2. We designed a feature guided anchoring scheme to improve the average recall (AR) of RPN by 10 points. (CVPR 2019)

slide-6
SLIDE 6

Overview

Backbone Proposal

  • 1. We proposed a new backbone FishNet. (NIPS 2018)

2. We designed a feature guided anchoring scheme to improve the average recall (AR) of RPN by 10 points. (CVPR 2019)

  • 3. We proposed a new upsampling operator CARAFE. (ICCV 2019)

Upsampling

slide-7
SLIDE 7

Overview

  • 4. We developed a hybrid cascading and branching pipeline for

detection and segmentation. (CVPR 2019)

Backbone Proposal Detection & Segmentation

  • 1. We proposed a new backbone FishNet. (NIPS 2018)

2. We designed a feature guided anchoring scheme to improve the average recall (AR) of RPN by 10 points. (CVPR 2019)

  • 3. We proposed a new upsampling operator CARAFE. (ICCV 2019)

Upsampling

slide-8
SLIDE 8

FishNet: A Versatile Backbone for Image, Region, and Pixel Level Prediction (NIPS 2018)

slide-9
SLIDE 9

Motivation

  • The basic principles for designing CNN for region and pixel level tasks are diverging from the

principles for image classification.

  • Unify the advantages of networks designed for region and pixel level tasks in obtaining deep

features with high-resolution.

FishNet

Image classification Segmentation, pose estimation, detection ... Region and pixel level tasks

slide-10
SLIDE 10

Motivation

  • Traditional consecutive down-sampling will prevent the very shallow layers to be directly

connected till the end, which may exacerbate the vanishing gradient problem.

  • Features from varying depths could be used for refining each other.

FishNet

FishNet: A Versatile Backbone for Image, Region, and Pixel Level Prediction, NIPS 2018.

slide-11
SLIDE 11

FishNet

21.98%(5.92%) 21.65%(5.86%) 21.35%(5.81%) 22.58%(6.35%) 22.20%(6.20%) 22.15%(6.12%) 21.20% 23.75 %(7.00%) 22.30%(6.20%) 21.69%(5.94%)

21.00% 21.50% 22.00% 22.50% 23.00% 23.50% 24.00% 10 20 30 40 50 60 70

Top-1(Top-5) Error Params FishNet DenseNet ResNet Top-1 Classification Error on ImageNet

slide-12
SLIDE 12

FishNet

MS COCO val-2017 detection and instance segmentation results.

slide-13
SLIDE 13

FishNet

  • Fish tail, fish body, fish head
  • More flexible information flow
  • Adaptive feature resolution reservation
slide-14
SLIDE 14

Region Proposal by Guided Anchoring (CVPR 2019)

slide-15
SLIDE 15
  • We introduce a Guided Anchoring Scheme to generate anchors and

build up a Guided Anchoring Region Proposal Network (GA-RPN)

  • GA-RPN achieves 9.1% higher average recall (AR) on MS COCO with

90% fewer anchors than the RPN baseline.

  • GA-RPN improves Fast R-CNN, Faster R-CNN and RetinaNet by over

2.2%, 2.7% and 1.2%.

Overview

slide-16
SLIDE 16

Baseline

Region Proposal Network (RPN)

Ren S, He K, Girshick R, et al. Faster r-cnn: Towards real-time object detection with region proposal networks[C]//Advances in neural information processing systems. 2015: 91-99.

prediction anchors

Image feature

Sliding Window

RPN RPN adopts a uniform anchoring scheme which uniformly generates anchors with predefined scales and aspect ratios over the whole image.

Base anchors

slide-17
SLIDE 17

Baseline

Uniform anchoring scheme has intrinsic drawbacks:

  • Most of generated anchors are irrelevant to the objects. (less than 0.01%

anchors are positive samples)

  • The conventional method are unaware of object shapes.
slide-18
SLIDE 18

Baseline

How to overcome such drawbacks:

  • Anchors should be distributed on feature maps considering how likely the

locations contain objects.

  • Anchor shapes should be predicted rather than pre-defined.
slide-19
SLIDE 19

Guided Anchoring

Guided Anchoring Component has following steps:

  • The first step identifies the locations where objects are likely to exist.
  • The second stage predicts shapes of anchors.
  • In addition, we further introduce a feature adaption module to refine

the features considering anchor shapes.

slide-20
SLIDE 20

Guided Anchoring

Anchor Location Prediction

1x1 conv

slide-21
SLIDE 21

Guided Anchoring

Guided Anchoring Component has following steps:

  • The first step identifies the locations where objects are likely to exist.
  • The second stage predicts shapes of anchors.
  • In addition, we further introduce a feature adaption module to refine

the features considering anchor shapes.

slide-22
SLIDE 22

Guided Anchoring

Anchor Shape Prediction

1x1 conv 1x1 conv wide tall

slide-23
SLIDE 23

Guided Anchoring

Feature Adaption

3x3 deformable conv

slide-24
SLIDE 24

Guided Anchoring

Why feature adaptive?

Method AR100 AR300 AR1000 ARS ARM ARL RPN 47.5 54.7 59.4 31.7 55.1 64.6 GA-RPN w/o F.A. 54.0 60.1 63.8 36.7 63.1 71.5 GA-RPN + F.A. 59.2 65.2 68.5 40.9 67.8 79.0

A feature and an anchor on the same location should be consistent.

slide-25
SLIDE 25

Guided Anchoring

Experiment Results

RPN (ResNet-50) RPN (ResNet-152) RPN (ResNeXt-101) GA-RPN (ResNet-50) GA-RPN (SENet-154) RPN (SENet-154) 0.58 0.6 0.62 0.64 0.66 0.68 0.7 0.72 2 4 6 8 10 12

AR1000 Runtime on TITAN X (fps)

slide-26
SLIDE 26

Guided Anchoring

Experiment Results

Detector AP AR50 AP75 APS APM APL Fast R-CNN 37.1 59.6 39.7 20.7 39.5 47.1 GA-Fast-RCNN 39.4 59.4 42.8 21.6 41.9 50.4 Faster R-CNN 37.1 59.1 40.1 21.3 39.8 46.5 GA-Faster-RCNN 39.8 59.2 43.5 21.8 42.6 50.7 RetinaNet 35.9 55.4 38.8 19.4 38.9 46.5 GA-RetinaNet 37.1 56.9 40.0 20.1 40.1 48.0

Detection results on MS COCO 2017 test-dev with ResNet-50 backbone

slide-27
SLIDE 27

Guided Anchoring

Examples

RPN GA-RPN

slide-28
SLIDE 28

Guided Anchoring

  • From sliding window to sparse, non-uniform distribution
  • From predefined shapes to learnable, arbitrary shapes
  • Refine features based on anchor shapes
slide-29
SLIDE 29

CARAFE: Content-Aware ReAssembly of Features (ICCV 2019 Oral)

slide-30
SLIDE 30

Object detection Instance segmentation Semantic segmentation

  • Feature upsampling is a key operation in a number of modern convolutional

network architectures, e.g. Feature Pyramids Networks, U-Net, Stacked Hourglass Networks.

  • Its design is critical for dense prediction tasks such as object detection and

semantic/instance segmentation.

Background

slide-31
SLIDE 31

Background

Nearest Neighbor (NN) Bilinear Interpolations leverage distances to measure the correlations between pixels, and hand-crafted upsampling kernels are used. (Pros: low cost / Cons: hand- crafted upsampling kernels)

slide-32
SLIDE 32

Background

Nearest Neighbor (NN) Bilinear Interpolations leverage distances to measure the correlations between pixels, and hand-crafted upsampling kernels are used. (Pros: low cost / Cons: hand- crafted upsampling kernels) Deconvolution (Transposed Convolution) Deconvolution is an inverse

  • perator of a convolution, which

uses a fixed kernel for all samples within a limited receptive field. (Pros: learnable kernel / Cons: not content-aware, limited receptive field)

slide-33
SLIDE 33

Background

Nearest Neighbor (NN) Bilinear Interpolations leverage distances to measure the correlations between pixels, and hand-crafted upsampling kernels are used. (Pros: low cost / Cons: hand- crafted upsampling kernels) Pixel Shuffle Pixel Shuffle reshapes depth

  • n the channel space into

width and height on the spatial

  • space. It brings highly

computational overhead when expanding the channel space. Deconvolution (Transposed Convolution) Deconvolution is an inverse

  • perator of a convolution, which

uses a fixed kernel for all samples within a limited receptive field. (Pros: learnable kernel / Cons: not content-aware, limited receptive field) (Pros: learnable kernel/ Cons: not content-aware, limited receptive field, high cost)

slide-34
SLIDE 34

Content-Aware ReAssembly of FEatures (CARAFE) is a universal, lightweight and highly effective upsampling operator.

  • Large field of view. CARAFE can aggregate contextual information within a large receptive field.
  • Content-aware handling. CARAFE enables instance-specific content-aware handling, which

generates adaptive kernels on-the-fly.

  • Lightweight and fast to compute. CARAFE introduces little computational overhead and can be

readily integrated into modern network architectures

Overview

slide-35
SLIDE 35

Content-Aware ReAssembly of FEatures (CARAFE) is a universal, lightweight and highly effective upsampling operator.

  • Large field of view. CARAFE can aggregate contextual information within a large receptive field.
  • Content-aware handling. CARAFE enables instance-specific content-aware handling, which

generates adaptive kernels on-the-fly.

  • Lightweight and fast to compute. CARAFE introduces little computational overhead and can be

readily integrated into modern network architectures CARAFE shows consistent and substantial gains across object detection, instance/semantic segmentation and inpainting (1.2%, 1.3%, 1.8%, 1.1db respectively) with negligible computational

  • verhead.

Overview

slide-36
SLIDE 36

CARAFE

On each location, CARAFE can leverage the content information of such location to predict assembly kernels and assemble the features inside a predefined nearby region. 1) The first step is to predict a reassembly kernel for each destination location according to its content. ( is the k x k sub-region of 𝜓 centered at the location 𝑚, i.e., the neighbor of 𝑌$.) 2) The second step is to reassemble the features with predicted kernels.

slide-37
SLIDE 37

CARAFE

Framework

slide-38
SLIDE 38

CARAFE

Kernel Predication Module

slide-39
SLIDE 39

CARAFE

Content-aware Reassembly Module

slide-40
SLIDE 40

CARAFE

  • Each source location on corresponds to 𝜏& destination locations on .
  • Each destination location on requires a 𝑙()x 𝑙() reassembly kernel. (𝑙() is

the reassembly kernel size.) 𝜓 𝜓+ 𝜓+

slide-41
SLIDE 41

CARAFE

Kernel Predication Module

1) Channel Compressor. (1 x 1 convolution layer which compresses the input feature channel from C to

  • Cm. The goal of this step is for speed-up without harming the performance.)
slide-42
SLIDE 42

CARAFE

Kernel Predication Module

1) Channel Compressor. (1 x 1 convolution layer which compresses the input feature channel from C to

  • Cm. The goal of this step is for speed-up without harming the performance.)

2) Content Encoder. (Convolution layer of kernel size 𝑙-./01-2 to generate reassembly kernels based on the content of input features. An empirical formula 𝑙-./01-2 = 𝑙() − 2 is a good trade-off between performance and efficiency through our study)

slide-43
SLIDE 43

CARAFE

Kernel Predication Module

1) Channel Compressor. (1 x 1 convolution layer which compresses the input feature channel from C to

  • Cm. The goal of this step is for speed-up without harming the performance.)

2) Content Encoder. (Convolution layer of kernel size 𝑙-./01-2 to generate reassembly kernels based on the content of input features. An empirical formula 𝑙-./01-2 = 𝑙() − 2 is a good trade-off between performance and effificiency through our study) 3) Kernel Normalizer. (Each 𝑙() x 𝑙() reassembly kernel is normalized with a softmax function.)

slide-44
SLIDE 44

CARAFE

Content-aware Reassembly Module

slide-45
SLIDE 45

Applications

CARAFE introduces little computational overhead and can be readily integrated into modern network architectures.

  • Object Detection (Faster R-CNN w/ FPN)
  • Instance Segmentation (Mask R-CNN w/ FPN)
  • Semantic Segmentation (UperNet)
  • Image Inpainting (Global&Local, Partial Conv)
slide-46
SLIDE 46

Applications

Object Detection & Instance Segmentation 1) Feature Pyramid Network (Faster R-CNN,Mask R-CNN) 2) Mask Head (Mask R-CNN)

Feature Pyramid Network (FPN) Mask Head

slide-47
SLIDE 47

Experiments

Object Detection & Instance Segmentation:

slide-48
SLIDE 48

Experiments

Semantic Segmentation:

slide-49
SLIDE 49

Experiments

Image Inpainting:

slide-50
SLIDE 50

Experiments

Compare with previous upsamplers:

slide-51
SLIDE 51

How CARAFE works:

Experiments

slide-52
SLIDE 52

CARAFE

  • Universal operator
  • Content-aware upsampling
  • Fast to compute
slide-53
SLIDE 53

Hybrid Task Cascade for Instance Segmentation (CVPR 2019)

slide-54
SLIDE 54

Pipeline

A hybrid architecture with interleaved task branching and cascade.

Backbone

Proposals

Stage 1

  • cls. + reg.

Mask feature Regressed box Semantic feature

RPN Semantic head Stage 1 mask Stage 2

  • cls. + reg.

Stage 2

  • cls. + reg.
slide-55
SLIDE 55

Pipeline

Baseline: Cascade R-CNN

Backbone

Proposals

Regressed box

RPN Stage 1

  • cls. + reg.

Stage 2

  • cls. + reg.
slide-56
SLIDE 56

Pipeline

Baseline: Cascade R-CNN

Backbone

Proposals

Regressed box

RPN Stage 1

  • cls. + reg.

Stage 2

  • cls. + reg.

Problem: designed for detection, not segmentation

slide-57
SLIDE 57

Pipeline

Baseline: Cascade R-CNN + Mask R-CNN

Backbone

Proposals

Stage 1 mask

Regressed box

RPN Stage 1

  • cls. + reg.

Stage 2 mask Stage 2

  • cls. + reg.
slide-58
SLIDE 58

Pipeline

Baseline: Cascade R-CNN + Mask R-CNN

Backbone

Proposals

Stage 1 mask

Regressed box

RPN Stage 1

  • cls. + reg.

Stage 2 mask Stage 2

  • cls. + reg.

Problem: mismatch of training and testing pipeline

slide-59
SLIDE 59

Pipeline

Backbone

Proposals

Stage 1 mask

Regressed box

RPN Stage 1

  • cls. + reg.

Stage 2 mask Stage 2

  • cls. + reg.

Problem: mismatch of training and testing pipeline training

Backbone

Proposals

Stage 1 mask

Regressed box

RPN Stage 1

  • cls. + reg.

Stage 2 mask Stage 2

  • cls. + reg.

testing

slide-60
SLIDE 60

Pipeline

Task cascade: ordinal bbox prediction and mask prediction

Backbone

Proposals

Stage 1

  • cls. + reg.

Regressed box

RPN Stage 1 mask Stage 2

  • cls. + reg.

Stage 2

  • cls. + reg.
slide-61
SLIDE 61

Pipeline

Task cascade: ordinal bbox prediction and mask prediction

Backbone

Proposals

Stage 1

  • cls. + reg.

Regressed box

RPN Stage 1 mask Stage 2

  • cls. + reg.

Stage 2

  • cls. + reg.

Problem: no connection between mask branches of different stages

slide-62
SLIDE 62

Pipeline

Interleaved execution: box cascade & mask cascade

Backbone

Proposals

Stage 1

  • cls. + reg.

Mask feature Regressed box

RPN Stage 1 mask Stage 2

  • cls. + reg.

Stage 2 mask

slide-63
SLIDE 63

Pipeline

Interleaved execution: box cascade & mask cascade

Backbone

Proposals

Stage 1

  • cls. + reg.

Mask feature Regressed box

RPN Stage 1 mask Stage 2

  • cls. + reg.

Stage 2 mask

Problem: contextual information is not much explored

slide-64
SLIDE 64

Pipeline

Hybrid branching: additional semantic segmentation branch

Backbone

Proposals

Stage 1

  • cls. + reg.

Mask feature Regressed box Semantic feature

RPN Semantic head Stage 1 mask Stage 2

  • cls. + reg.

Stage 2 mask

slide-65
SLIDE 65

Hybrid Task Cascade

  • Cascade between different tasks
  • Interleaved execution
  • Contextual information fusion
slide-66
SLIDE 66

baseline R-50 Cascade with mask

35 37 39 41 43 45 47 49

mask AP on test-dev

36.7

Experiments

slide-67
SLIDE 67

baseline R-50 Cascade with mask

interleaved cascade

35 37 39 41 43 45 47 49

mask AP on test-dev

36.7

37.3 (+0.6)

Experiments

slide-68
SLIDE 68

baseline R-50 Cascade with mask interleaved cascade

semantic branch

35 37 39 41 43 45 47 49

mask AP on test-dev

36.7 37.3 (+0.6)

38.1 (+0.8)

Experiments

slide-69
SLIDE 69

baseline R-50 Cascade with mask interleaved cascade semantic branch

deformable conv

35 37 39 41 43 45 47 49

mask AP on test-dev

36.7 37.3 (+0.6) 38.1 (+0.8)

39.5 (+1.4)

Experiments

slide-70
SLIDE 70

baseline R-50 Cascade with mask interleaved cascade semantic branch deformable conv

synchronize BN

35 37 39 41 43 45 47 49

mask AP on test-dev

36.7 37.3 (+0.6) 38.1 (+0.8) 39.5 (+1.4)

40.7 (+1.2)

Experiments

slide-71
SLIDE 71

baseline R-50 Cascade with mask interleaved cascade semantic branch deformable conv synchronize BN

multi-scale training

35 37 39 41 43 45 47 49

mask AP on test-dev

36.7 37.3 (+0.6) 38.1 (+0.8) 39.5 (+1.4) 40.7 (+1.2)

42.5 (+1.8)

Experiments

slide-72
SLIDE 72

baseline R-50 Cascade with mask interleaved cascade semantic branch deformable conv synchronize BN multi-scale training

better backbone

35 37 39 41 43 45 47 49

mask AP on test-dev

36.7 37.3 (+0.6) 38.1 (+0.8) 39.5 (+1.4) 40.7 (+1.2) 42.5 (+1.8)

44.3 (+1.8)

Experiments

slide-73
SLIDE 73

baseline R-50 Cascade with mask interleaved cascade semantic branch deformable conv synchronize BN multi-scale training better backbone

GARPN finetune

35 37 39 41 43 45 47 49

mask AP on test-dev

36.7 37.3 (+0.6) 38.1 (+0.8)

45.3 (+1.0)

39.5 (+1.4) 40.7 (+1.2) 42.5 (+1.8) 44.3 (+1.8)

Experiments

slide-74
SLIDE 74

baseline R-50 Cascade with mask interleaved cascade semantic branch deformable conv synchronize BN multi-scale training better backbone GARPN finetune

multi-scale & flip testing

35 37 39 41 43 45 47 49

mask AP on test-dev

36.7 37.3 (+0.6)

47.4 (+2.1)

38.1 (+0.8) 45.3 (+1.0) 39.5 (+1.4) 40.7 (+1.2) 42.5 (+1.8) 44.3 (+1.8)

Experiments

slide-75
SLIDE 75

baseline R-50 Cascade with mask interleaved cascade semantic branch deformable conv synchronize BN multi-scale training better backbone GARPN finetune multi-scale & flip testing

model ensemble

35 37 39 41 43 45 47 49

mask AP on test-dev

36.7 37.3 (+0.6) 47.4 (+2.1) 38.1 (+0.8) 45.3 (+1.0) 39.5 (+1.4) 40.7 (+1.2) 42.5 (+1.8)

49.0 (+1.6)

44.3 (+1.8)

Experiments

slide-76
SLIDE 76

Visualization

slide-77
SLIDE 77

mmdetection (Open-MMLAB)

slide-78
SLIDE 78

Codebase

  • 10+ research institutes
  • 20+ supported methods
  • 200+ pre-trained models

GitHub: mmdet

slide-79
SLIDE 79

Codebase

GitHub: mmdet

The entries ranking 1, 2, and 3 of iMaterialist (Fashion) 2019 at FGVC6 (CVPR 2019 Workshop) are based on HTC. Here is the post of the winner.

slide-80
SLIDE 80

Thank you!

Dynamic forwarding and routing as a computational strategy for detection and beyond