Recent Progresses in Visual Segmentation Yunchao Wei ReLER, - - PowerPoint PPT Presentation

recent progresses in visual segmentation
SMART_READER_LITE
LIVE PREVIEW

Recent Progresses in Visual Segmentation Yunchao Wei ReLER, - - PowerPoint PPT Presentation

VALSE Web Webinar nar Recent Progresses in Visual Segmentation Yunchao Wei ReLER, Australian Artificial Intelligence Institute University of Technology Sydney The importance of visual segmentation Medical Agriculture Autonomous Vehicle


slide-1
SLIDE 1

Recent Progresses in Visual Segmentation

Yunchao Wei ReLER, Australian Artificial Intelligence Institute University of Technology Sydney

VALSE Web Webinar nar

slide-2
SLIDE 2

UT UTS ReLER ER Lab Lab VALSE We Webinar

2

The importance of visual segmentation

Agriculture Robotics Autonomous Vehicle Satellite Imagery Medical Imagery Video Editing

slide-3
SLIDE 3

UT UTS ReLER ER Lab Lab VALSE We Webinar

Part I: Semantic Segmentation Part II: Interactive Image Segmentation Part III: Video Object Segmentation

3

Outline

slide-4
SLIDE 4

UT UTS ReLER ER Lab Lab VALSE We Webinar

4

Part I: Semantic Segmentation

slide-5
SLIDE 5

UT UTS ReLER ER Lab Lab VALSE We Webinar

5

Semantic Segmentation

Pascal VOC LIP ADE 20K Cityscapes

slide-6
SLIDE 6

UT UTS ReLER ER Lab Lab VALSE We Webinar

6

Context Modeling in FCN Structures

[Long et al. ICCV 2015] [Chen et al. PAMI 2018] [Chen et al. ECCV 2018] [Zhao et al. CVPR 2017] [Ronneberger et al. MICCAI 2015]

Non-adaptive context modeling

slide-7
SLIDE 7

UT UTS ReLER ER Lab Lab VALSE We Webinar

7

Graph Neural Networks

[Wang et al. CVPR 2018]

but high computational complexity

Adaptive context modeling

slide-8
SLIDE 8

UT UTS ReLER ER Lab Lab VALSE We Webinar

8

Criss-Cross Attention

Criss-cross attention block, a.k.a., sparse connected self-attention

[Huang et al. ICCV 2019]

slide-9
SLIDE 9

UT UTS ReLER ER Lab Lab VALSE We Webinar

9

Recurrent Criss-Cross Attention

Criss-cross (R=2) equals to Non-local network Time & space complexity: 𝑃 𝑂! → 𝑃(𝑂)

slide-10
SLIDE 10

UT UTS ReLER ER Lab Lab VALSE We Webinar

10

CCNet: Criss-cross Network

slide-11
SLIDE 11

UT UTS ReLER ER Lab Lab VALSE We Webinar

11

Results on Cityscapes

More accurate, 15% FLOPS & 9% memory cost

slide-12
SLIDE 12

UT UTS ReLER ER Lab Lab VALSE We Webinar

12

Results on ADE20K, LIP & COCO

Scene parsing results on ADE20K Human parsing results on LIP Instance segmentation results on COCO

slide-13
SLIDE 13

UT UTS ReLER ER Lab Lab VALSE We Webinar

13

From Image to Video

Video semantic segmentation results on CamVID CCNet 3D

[Huang et al. PAMI 2020]

slide-14
SLIDE 14

UT UTS ReLER ER Lab Lab VALSE We Webinar

14

Visualization of the Learned Context on Cityscapes

Image R=1 R=2 Ground Truth

slide-15
SLIDE 15

UT UTS ReLER ER Lab Lab VALSE We Webinar

15

Follow-up Works

[Wang et al. ECCV 2020] Axial-Deeplab [Ho et al. arxiv 2019] Axial Attention

slide-16
SLIDE 16

UT UTS ReLER ER Lab Lab VALSE We Webinar

16

Recent Hotspots: Boundary modeling for better segmentation

[Cheng et al. CVPR 2020] [Kirillov et al. CVPR 2020] [Cheng et al. ECCV 2020] [Li et al. ECCV 2020] [Chen et al. CVPR 2020]

slide-17
SLIDE 17

UT UTS ReLER ER Lab Lab VALSE We Webinar

17

Part II: Interactive Image Segmentation

slide-18
SLIDE 18

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • Semi-automated, class-agnostic segmentation
  • Target object depends on the user inputs (e.g. points)
  • Allows iterative refinement until result is satisfactory

18

What is Interactive Image Segmentation?

Target object Unrelated region

slide-19
SLIDE 19

UT UTS ReLER ER Lab Lab VALSE We Webinar

19

Why should we consider interactive image segmentation?

≈ 60s per instance ≈ 1.5 hours per image

Unaffordable!!

slide-20
SLIDE 20

UT UTS ReLER ER Lab Lab VALSE We Webinar

20

Why should we consider interactive image segmentation?

Accurately & Efficiently

slide-21
SLIDE 21

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • RGB image and user interactions are used as the network input
  • Train end-to-end with FCNs (e.g. Deeplab series, PSPNet)

21

Standard pipeline

Image User interactions Ground-truth Fully convolutional network (FCN)

[Xu et al. CVPR 2016]

slide-22
SLIDE 22

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • Sparse clicks
  • Bounding box
  • Scribbles

22

Common types of user interaction

slide-23
SLIDE 23

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • Sparse clicks
  • Bounding box
  • Scribbles

23

Common types of user interaction

slide-24
SLIDE 24

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • Sparse clicks
  • Bounding box
  • Scribbles

24

Common types of user interaction

≈ 2s per instance ≈ 7s per instance ≈ 17s per instance Manual annotation ≈ 60s per instance

slide-25
SLIDE 25

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • DEXTR (Deep Extreme Cut)
  • Take 4 extreme points (top, bottom, leftmost and

rightmost pixels) as inputs

25

Existing State-of-the-Art Method: DEXTR

[Maninis et al. CVPR 2018]

slide-26
SLIDE 26

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • DEXTR (Deep Extreme Cut)
  • Take 4 extreme points (top, bottom, leftmost and

rightmost pixels) as inputs

26

Existing State-of-the-Art Method: DEXTR

Segmentation Network

Cropped image Location cues

[Maninis et al. CVPR 2018]

slide-27
SLIDE 27

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • DEXTR (Deep Extreme Cut)
  • Take 4 extreme points (top, bottom, leftmost and

rightmost pixels) as inputs

  • Problems
  • Multiple extreme points appear at similar location
  • Unrelated object lying inside the target object

27

Existing State-of-the-Art Method: DEXTR

[Maninis et al. CVPR 2018]

slide-28
SLIDE 28

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • Inside guidance (1 click)
  • Interior point located roughly at the object center
  • Disambiguate the segmentation target
  • Outside guidance (2 clicks)
  • 2 corner clicks of a box enclosing the object
  • Indicate the background region
  • The remaining 2 corners can be inferred automatically

28

Inside-Outside Guidance (IOG)

[Zhang et al. CVPR 2020]

slide-29
SLIDE 29

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • Click on a corner point
  • Click on the symmetrical corner
  • Click on the object center

29

Clicking Paradigm

Clicks Time Outside clicks 6.7s Inside click 1.5s

slide-30
SLIDE 30

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • Follow the practice of DEXTR
  • Enlarge the bounding box by 10 pixels to include context
  • Crop and resize the inputs to 512x512
  • Input representation
  • 2 separate Gaussian heatmaps for the inside and outside clicks

30

Input Representation

RGB Image Inside Guidance Outside Guidance Segmentation Network

slide-31
SLIDE 31

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • Segmentation errors mostly occur around the object boundaries

31

Network Architecture

slide-32
SLIDE 32

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • Segmentation errors mostly occur around the object boundaries
  • Use a coarse-to-fine network structure

32

Network Architecture

(a) CoarseNet (b) FineNet

[Chen et al. CVPR 2018]

slide-33
SLIDE 33

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • Segmentation errors mostly occur around the object boundaries
  • Use a coarse-to-fine network structure

33

Network Architecture

slide-34
SLIDE 34

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • Our IOG naturally supports interactive adding of new clicks
  • Add a lightweight branch to accept additional inputs
  • Train with iterative training strategy

34

Beyond Three Clicks

(a) CoarseNet (b) FineNet Refinement

Optional click for refinement

slide-35
SLIDE 35

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • Observation
  • IOG is more effective than extreme points across

different backbone

35

IOG vs. Extreme Clicks

slide-36
SLIDE 36

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • Observation
  • IOG is more effective than extreme points across

different backbone

  • Using a coarse-to-fine network structure further

improves the performance

36

IOG vs. Extreme Clicks

slide-37
SLIDE 37

UT UTS ReLER ER Lab Lab VALSE We Webinar

37

Comparison with SOTA

41.1 55.1 45.9 75.2 80.7 91.5 93.2 94.4 59.3 56.9 55.6 84 85 94.4 96.3 96.9 40 50 60 70 80 90 100 Graph cut Random walker Geodesic matting iFCN RIS-Net DEXTR IOG(3 clicks) IOG(4 clicks) PASCAL GrabCut

slide-38
SLIDE 38

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • Our IOG performs well even on unseen categories
  • Performs well across different domain even without fine-tuning
  • Can be further improved using 10% domain data for fine-tuning

38

Generalization

80.3% 79.9% 82.1% 81.7% 75.0% 76.0% 77.0% 78.0% 79.0% 80.0% 81.0% 82.0% 83.0% seen unseen

Medical domain Aerial imagery Autonomous driving

60.9 81.4

60 65 70 75 80 85 W/O FT 78.2 68.3

92.8 90.7

65 70 75 80 85 90 95 W FT W/O FT

79.4 80.2 83.8

75 76 77 78 79 80 81 82 83 84 85 CURVE- GCN DEXTR IOG

Curve-GCN DEXTR IOG

PASCAL -> COCO

slide-39
SLIDE 39

UT UTS ReLER ER Lab Lab VALSE We Webinar

39

Qualitative Results

Cityscapes Agriculture-Vision Rooftop ssTEM

General object scenes

slide-40
SLIDE 40

UT UTS ReLER ER Lab Lab VALSE We Webinar

40

Demo

[Youtube] [Bilibili]

slide-41
SLIDE 41

UT UTS ReLER ER Lab Lab VALSE We Webinar

41

Automated Mode of IOG

RGB Image Inside Guidance Outside Guidance Segmentation Network

slide-42
SLIDE 42

UT UTS ReLER ER Lab Lab VALSE We Webinar

42

Automated Mode of IOG

RGB Image Outside Guidance Segmentation Network

  • Without user interaction, our IOG can still harvest high quality masks from off-the-

shelf datasets with box annotations (e.g. ImageNet)

  • Solution: Two-stage Training:

(S1) Train a network that takes box as inputs (S2) Infer interior clicks from the masks produced in S1 and apply IOG Inputs IoU (PASCAL) w/ human 93.2 w/o human 91.1

slide-43
SLIDE 43

UT UTS ReLER ER Lab Lab VALSE We Webinar

43 43

IM GENET PIXEL-

Possible Applications Image classification Instance segmentation Semantic segmentation Salient object detection …. and more Characteristics

  • #Classes: 1000
  • #Instance: >600K

https://github.com/shiyinzhang/Pixel-ImageNet

slide-44
SLIDE 44

UT UTS ReLER ER Lab Lab VALSE We Webinar

44

Failure Cases

slide-45
SLIDE 45

UT UTS ReLER ER Lab Lab VALSE We Webinar

45

Part III: Video Object Segmentation

slide-46
SLIDE 46

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • Semi-supervised VOS

46

What is video object segmentation (VOS)?

Given the object masks at the first frame Predict the object masks in the subsequent video frames

slide-47
SLIDE 47

UT UTS ReLER ER Lab Lab VALSE We Webinar

47

Applications

Video Conferencing Video Editing & Fashion

Autonomous Vehicle

slide-48
SLIDE 48

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • DAVIS 2016 (ETH)
  • Single object as foreground
  • DAVIS 2017 (ETH)
  • Multi-objects as foreground (~120 sequences)
  • YouTube-VOS (UIUC & ByteDance 2018)
  • Multi-objects as foreground (~ 4k sequences)

48

Datasets

slide-49
SLIDE 49

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • Fine-tuning based solutions
  • Online propagation based solutions
  • Matching the current frame with the reference frame with feature concatenation
  • Matching pixels in ref and cur fames using deep metric learning
  • Matching objects in ref and cur fames using region proposals
  • Matching pixels in ref and cur fames inexplicitly using self-attention
  • Matching pixels in ref and cur fames using explicit global and local metrics

49

The roadmap of SOTA VOS methods

Very Slow Fast

slide-50
SLIDE 50

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • Fine-tuning based solutions
  • Online propagation based solutions
  • Matching the current frame with the reference frame with feature concatenation
  • Matching pixels in ref and cur fames using deep metric learning
  • Matching objects in ref and cur fames using region proposals
  • Matching pixels in ref and cur fames inexplicitly using self-attention
  • Matching pixels in ref and cur fames using explicit global and local metrics

50

The roadmap of SOTA VOS methods

Very Slow Fast

slide-51
SLIDE 51

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • Explicit matching with pixel-wise embedding

51

Previous Solution

[Voigtlaender et al. CVPR 2019]

slide-52
SLIDE 52

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • Global matching
  • First frame -> current frame (T)
  • All pixels
  • Local matching
  • Previous frame (T-1) -> current frame (T)
  • Pixels within a k-size window

52

Previous Solution

[Voigtlaender et al. CVPR 2019]

slide-53
SLIDE 53

UT UTS ReLER ER Lab Lab VALSE We Webinar

53

Previous Solution

[Voigtlaender et al. CVPR 2019]

slide-54
SLIDE 54

UT UTS ReLER ER Lab Lab VALSE We Webinar

54

Motivation: Background Matters

Prediction (t=T) Reference (t=1)

[Yang et al. ECCV 2020]

slide-55
SLIDE 55

UT UTS ReLER ER Lab Lab VALSE We Webinar

55

Collaborative VOS by Foreground-Background Integration (CFBI)

slide-56
SLIDE 56

UT UTS ReLER ER Lab Lab VALSE We Webinar

56

Robust to different moving rates between two continuous frames

Fast moving rates Slow moving rates

slide-57
SLIDE 57

UT UTS ReLER ER Lab Lab VALSE We Webinar

57

FG & BG Global & Multi-local Matching

for improving global consistency and robustness to different scales of moving rate

slide-58
SLIDE 58

UT UTS ReLER ER Lab Lab VALSE We Webinar

58

Robust to objects of various scales

slide-59
SLIDE 59

UT UTS ReLER ER Lab Lab VALSE We Webinar

59

Instance-level Attention

for perceiving large objects better

slide-60
SLIDE 60

UT UTS ReLER ER Lab Lab VALSE We Webinar

60

Collaborative Ensembler

for making big receptive fields to aggregate and process all the information

slide-61
SLIDE 61

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • Balanced random crop

61

Training Tricks

slide-62
SLIDE 62

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • Sequential Training

62

Training Tricks

slide-63
SLIDE 63

UT UTS ReLER ER Lab Lab VALSE We Webinar

63

Ablation Experiments

Single-model ablation study on DAVIS-2017 validation set

slide-64
SLIDE 64

UT UTS ReLER ER Lab Lab VALSE We Webinar

64

Comparison with SOTA

Youtube-VOS DAVIS 2016 DAVIS 2017 CFBI+ denotes using a multi-scale and flip strategy in testing.

slide-65
SLIDE 65

UT UTS ReLER ER Lab Lab VALSE We Webinar

65

Demo

[Bilibili] [Youtube]

slide-66
SLIDE 66

UT UTS ReLER ER Lab Lab VALSE We Webinar

66

More Experiments at Github

slide-67
SLIDE 67

UT UTS ReLER ER Lab Lab VALSE We Webinar

  • Efficiency?
  • High resolution?
  • Boundary?
  • Panoptic interactive segmentation?
  • Long-tailed or few shot?
  • Domain adaptation?
  • Video scene parsing?

67

Future works

slide-68
SLIDE 68

UT UTS ReLER ER Lab Lab VALSE We Webinar

68

Conclusions

  • CCNet for Semantic Segmentation

https://github.com/speedinghzl/CCNet

  • IOG for interactive object segmentation

https://github.com/shiyinzhang/Inside-Outside-Guidance

  • CFBI for video object segmentation

https://github.com/z-x-yang/CFBI

slide-69
SLIDE 69

Tha Thanks

VALSE Webinar