Recent Progresses in Visual Segmentation
Yunchao Wei ReLER, Australian Artificial Intelligence Institute University of Technology Sydney
VALSE Web Webinar nar
Recent Progresses in Visual Segmentation Yunchao Wei ReLER, - - PowerPoint PPT Presentation
VALSE Web Webinar nar Recent Progresses in Visual Segmentation Yunchao Wei ReLER, Australian Artificial Intelligence Institute University of Technology Sydney The importance of visual segmentation Medical Agriculture Autonomous Vehicle
Recent Progresses in Visual Segmentation
Yunchao Wei ReLER, Australian Artificial Intelligence Institute University of Technology Sydney
VALSE Web Webinar nar
UT UTS ReLER ER Lab Lab VALSE We Webinar
2
The importance of visual segmentation
Agriculture Robotics Autonomous Vehicle Satellite Imagery Medical Imagery Video Editing
UT UTS ReLER ER Lab Lab VALSE We Webinar
Part I: Semantic Segmentation Part II: Interactive Image Segmentation Part III: Video Object Segmentation
3
Outline
UT UTS ReLER ER Lab Lab VALSE We Webinar
4
Part I: Semantic Segmentation
UT UTS ReLER ER Lab Lab VALSE We Webinar
5
Semantic Segmentation
Pascal VOC LIP ADE 20K Cityscapes
UT UTS ReLER ER Lab Lab VALSE We Webinar
6
Context Modeling in FCN Structures
[Long et al. ICCV 2015] [Chen et al. PAMI 2018] [Chen et al. ECCV 2018] [Zhao et al. CVPR 2017] [Ronneberger et al. MICCAI 2015]
Non-adaptive context modeling
UT UTS ReLER ER Lab Lab VALSE We Webinar
7
Graph Neural Networks
[Wang et al. CVPR 2018]
but high computational complexity
Adaptive context modeling
UT UTS ReLER ER Lab Lab VALSE We Webinar
8
Criss-Cross Attention
Criss-cross attention block, a.k.a., sparse connected self-attention
[Huang et al. ICCV 2019]
UT UTS ReLER ER Lab Lab VALSE We Webinar
9
Recurrent Criss-Cross Attention
Criss-cross (R=2) equals to Non-local network Time & space complexity: 𝑃 𝑂! → 𝑃(𝑂)
UT UTS ReLER ER Lab Lab VALSE We Webinar
10
CCNet: Criss-cross Network
UT UTS ReLER ER Lab Lab VALSE We Webinar
11
Results on Cityscapes
More accurate, 15% FLOPS & 9% memory cost
UT UTS ReLER ER Lab Lab VALSE We Webinar
12
Results on ADE20K, LIP & COCO
Scene parsing results on ADE20K Human parsing results on LIP Instance segmentation results on COCO
UT UTS ReLER ER Lab Lab VALSE We Webinar
13
From Image to Video
Video semantic segmentation results on CamVID CCNet 3D
[Huang et al. PAMI 2020]
UT UTS ReLER ER Lab Lab VALSE We Webinar
14
Visualization of the Learned Context on Cityscapes
Image R=1 R=2 Ground Truth
UT UTS ReLER ER Lab Lab VALSE We Webinar
15
Follow-up Works
[Wang et al. ECCV 2020] Axial-Deeplab [Ho et al. arxiv 2019] Axial Attention
UT UTS ReLER ER Lab Lab VALSE We Webinar
16
Recent Hotspots: Boundary modeling for better segmentation
[Cheng et al. CVPR 2020] [Kirillov et al. CVPR 2020] [Cheng et al. ECCV 2020] [Li et al. ECCV 2020] [Chen et al. CVPR 2020]
UT UTS ReLER ER Lab Lab VALSE We Webinar
17
Part II: Interactive Image Segmentation
UT UTS ReLER ER Lab Lab VALSE We Webinar
18
What is Interactive Image Segmentation?
Target object Unrelated region
UT UTS ReLER ER Lab Lab VALSE We Webinar
19
Why should we consider interactive image segmentation?
≈ 60s per instance ≈ 1.5 hours per image
Unaffordable!!
UT UTS ReLER ER Lab Lab VALSE We Webinar
20
Why should we consider interactive image segmentation?
Accurately & Efficiently
UT UTS ReLER ER Lab Lab VALSE We Webinar
21
Standard pipeline
Image User interactions Ground-truth Fully convolutional network (FCN)
[Xu et al. CVPR 2016]
UT UTS ReLER ER Lab Lab VALSE We Webinar
22
Common types of user interaction
UT UTS ReLER ER Lab Lab VALSE We Webinar
23
Common types of user interaction
UT UTS ReLER ER Lab Lab VALSE We Webinar
24
Common types of user interaction
≈ 2s per instance ≈ 7s per instance ≈ 17s per instance Manual annotation ≈ 60s per instance
UT UTS ReLER ER Lab Lab VALSE We Webinar
rightmost pixels) as inputs
25
Existing State-of-the-Art Method: DEXTR
[Maninis et al. CVPR 2018]
UT UTS ReLER ER Lab Lab VALSE We Webinar
rightmost pixels) as inputs
26
Existing State-of-the-Art Method: DEXTR
Segmentation Network
Cropped image Location cues
[Maninis et al. CVPR 2018]
UT UTS ReLER ER Lab Lab VALSE We Webinar
rightmost pixels) as inputs
27
Existing State-of-the-Art Method: DEXTR
[Maninis et al. CVPR 2018]
UT UTS ReLER ER Lab Lab VALSE We Webinar
28
Inside-Outside Guidance (IOG)
[Zhang et al. CVPR 2020]
UT UTS ReLER ER Lab Lab VALSE We Webinar
29
Clicking Paradigm
Clicks Time Outside clicks 6.7s Inside click 1.5s
UT UTS ReLER ER Lab Lab VALSE We Webinar
30
Input Representation
RGB Image Inside Guidance Outside Guidance Segmentation Network
UT UTS ReLER ER Lab Lab VALSE We Webinar
31
Network Architecture
UT UTS ReLER ER Lab Lab VALSE We Webinar
32
Network Architecture
(a) CoarseNet (b) FineNet
[Chen et al. CVPR 2018]
UT UTS ReLER ER Lab Lab VALSE We Webinar
33
Network Architecture
UT UTS ReLER ER Lab Lab VALSE We Webinar
34
Beyond Three Clicks
(a) CoarseNet (b) FineNet Refinement
Optional click for refinement
UT UTS ReLER ER Lab Lab VALSE We Webinar
different backbone
35
IOG vs. Extreme Clicks
UT UTS ReLER ER Lab Lab VALSE We Webinar
different backbone
improves the performance
36
IOG vs. Extreme Clicks
UT UTS ReLER ER Lab Lab VALSE We Webinar
37
Comparison with SOTA
41.1 55.1 45.9 75.2 80.7 91.5 93.2 94.4 59.3 56.9 55.6 84 85 94.4 96.3 96.9 40 50 60 70 80 90 100 Graph cut Random walker Geodesic matting iFCN RIS-Net DEXTR IOG(3 clicks) IOG(4 clicks) PASCAL GrabCut
UT UTS ReLER ER Lab Lab VALSE We Webinar
38
Generalization
80.3% 79.9% 82.1% 81.7% 75.0% 76.0% 77.0% 78.0% 79.0% 80.0% 81.0% 82.0% 83.0% seen unseen
Medical domain Aerial imagery Autonomous driving
60.9 81.4
60 65 70 75 80 85 W/O FT 78.2 68.3
92.8 90.7
65 70 75 80 85 90 95 W FT W/O FT
79.4 80.2 83.8
75 76 77 78 79 80 81 82 83 84 85 CURVE- GCN DEXTR IOG
Curve-GCN DEXTR IOG
PASCAL -> COCO
UT UTS ReLER ER Lab Lab VALSE We Webinar
39
Qualitative Results
Cityscapes Agriculture-Vision Rooftop ssTEM
General object scenes
UT UTS ReLER ER Lab Lab VALSE We Webinar
40
Demo
[Youtube] [Bilibili]
UT UTS ReLER ER Lab Lab VALSE We Webinar
41
Automated Mode of IOG
RGB Image Inside Guidance Outside Guidance Segmentation Network
UT UTS ReLER ER Lab Lab VALSE We Webinar
42
Automated Mode of IOG
RGB Image Outside Guidance Segmentation Network
shelf datasets with box annotations (e.g. ImageNet)
(S1) Train a network that takes box as inputs (S2) Infer interior clicks from the masks produced in S1 and apply IOG Inputs IoU (PASCAL) w/ human 93.2 w/o human 91.1
UT UTS ReLER ER Lab Lab VALSE We Webinar
43 43
Possible Applications Image classification Instance segmentation Semantic segmentation Salient object detection …. and more Characteristics
https://github.com/shiyinzhang/Pixel-ImageNet
UT UTS ReLER ER Lab Lab VALSE We Webinar
44
Failure Cases
UT UTS ReLER ER Lab Lab VALSE We Webinar
45
Part III: Video Object Segmentation
UT UTS ReLER ER Lab Lab VALSE We Webinar
46
What is video object segmentation (VOS)?
Given the object masks at the first frame Predict the object masks in the subsequent video frames
UT UTS ReLER ER Lab Lab VALSE We Webinar
47
Applications
Video Conferencing Video Editing & Fashion
Autonomous Vehicle
UT UTS ReLER ER Lab Lab VALSE We Webinar
48
Datasets
UT UTS ReLER ER Lab Lab VALSE We Webinar
49
The roadmap of SOTA VOS methods
Very Slow Fast
UT UTS ReLER ER Lab Lab VALSE We Webinar
50
The roadmap of SOTA VOS methods
Very Slow Fast
UT UTS ReLER ER Lab Lab VALSE We Webinar
51
Previous Solution
[Voigtlaender et al. CVPR 2019]
UT UTS ReLER ER Lab Lab VALSE We Webinar
52
Previous Solution
[Voigtlaender et al. CVPR 2019]
UT UTS ReLER ER Lab Lab VALSE We Webinar
53
Previous Solution
[Voigtlaender et al. CVPR 2019]
UT UTS ReLER ER Lab Lab VALSE We Webinar
54
Motivation: Background Matters
Prediction (t=T) Reference (t=1)
[Yang et al. ECCV 2020]
UT UTS ReLER ER Lab Lab VALSE We Webinar
55
Collaborative VOS by Foreground-Background Integration (CFBI)
UT UTS ReLER ER Lab Lab VALSE We Webinar
56
Robust to different moving rates between two continuous frames
Fast moving rates Slow moving rates
UT UTS ReLER ER Lab Lab VALSE We Webinar
57
FG & BG Global & Multi-local Matching
for improving global consistency and robustness to different scales of moving rate
UT UTS ReLER ER Lab Lab VALSE We Webinar
58
Robust to objects of various scales
UT UTS ReLER ER Lab Lab VALSE We Webinar
59
Instance-level Attention
for perceiving large objects better
UT UTS ReLER ER Lab Lab VALSE We Webinar
60
Collaborative Ensembler
for making big receptive fields to aggregate and process all the information
UT UTS ReLER ER Lab Lab VALSE We Webinar
61
Training Tricks
UT UTS ReLER ER Lab Lab VALSE We Webinar
62
Training Tricks
UT UTS ReLER ER Lab Lab VALSE We Webinar
63
Ablation Experiments
Single-model ablation study on DAVIS-2017 validation set
UT UTS ReLER ER Lab Lab VALSE We Webinar
64
Comparison with SOTA
Youtube-VOS DAVIS 2016 DAVIS 2017 CFBI+ denotes using a multi-scale and flip strategy in testing.
UT UTS ReLER ER Lab Lab VALSE We Webinar
65
Demo
[Bilibili] [Youtube]
UT UTS ReLER ER Lab Lab VALSE We Webinar
66
More Experiments at Github
UT UTS ReLER ER Lab Lab VALSE We Webinar
67
Future works
UT UTS ReLER ER Lab Lab VALSE We Webinar
68
Conclusions
https://github.com/speedinghzl/CCNet
https://github.com/shiyinzhang/Inside-Outside-Guidance
https://github.com/z-x-yang/CFBI
VALSE Webinar