a better and faster way Shu Kong CS, ICS, UCI Image Understanding - - PowerPoint PPT Presentation

a better and faster way
SMART_READER_LITE
LIVE PREVIEW

a better and faster way Shu Kong CS, ICS, UCI Image Understanding - - PowerPoint PPT Presentation

Scene Parsing through Per-Pixel Labeling: a better and faster way Shu Kong CS, ICS, UCI Image Understanding --> Scene Parsing Scene Parsing semantic segmentation classifying each pixel into one of defined categories Scene Parsing semantic


slide-1
SLIDE 1

Shu Kong

CS, ICS, UCI

Scene Parsing through Per-Pixel Labeling: a better and faster way

slide-2
SLIDE 2

Image Understanding --> Scene Parsing

slide-3
SLIDE 3

semantic segmentation

classifying each pixel into one of defined categories Scene Parsing

slide-4
SLIDE 4

semantic segmentation (what&where) localization (where) support, surface normal (relation)

Scene Parsing

slide-5
SLIDE 5

1. Background 2. Attention to Perspective: Depth-aware Pooling Module 3. Recurrent Refining with Perspective Understanding in the Loop 4. Attention to Perspective Again 5. Pixel-wise Attentional Gating (PAG) 6. Pixel-Level Dynamic Routing 7. Conclusion

Outline

slide-6
SLIDE 6

1. Background 2. Attention to Perspective: Depth-aware Pooling Module 3. Recurrent Refining with Perspective Understanding in the Loop 4. Attention to Perspective Again 5. Pixel-wise Attentional Gating (PAG) 6. Pixel-Level Dynamic Routing 7. Conclusion

Outline

slide-7
SLIDE 7

semantic segmentation

classifying each pixel into one of defined categories Scene Parsing

slide-8
SLIDE 8

Scene Parsing from Perspective Image

large scale variation

car, pole car vs. train white board, chair chair vs. white board

slide-9
SLIDE 9

None of them consider “perspective” explicitly.

Tons of (Deep) Scene Parser, but...

slide-10
SLIDE 10

1. Background 2. Attention to Perspective: Depth-aware Pooling Module 3. Recurrent Refining with Perspective Understanding in the Loop 4. Attention to Perspective Again 5. Pixel-wise Attentional Gating (PAG) 6. Pixel-Level Dynamic Routing 7. Conclusion

Outline

slide-11
SLIDE 11

For each pixel, deciding the size of field of view (FoV) to aggregate information Attention to Perspective: Depth-aware Pooling

slide-12
SLIDE 12

For each pixel, deciding the size of field of view (FoV) to aggregate information The closer the object is to the camera, the larger size it appears in the image, the larger FoV the network should “pool”. Attention to Perspective: Depth-aware Pooling

slide-13
SLIDE 13

Depth conveys the scale information.

The closer the object is to the camera, the larger size it appears in the image, the larger FoV the network should “pool”.

Depth-aware Pooling Module

slide-14
SLIDE 14

How to use depth to choose the FoV size? Depth-aware Pooling Module

slide-15
SLIDE 15

How to use depth to choose the FoV size? How about making the pooling size adaptive w.r.t depth? Depth-aware Pooling Module

slide-16
SLIDE 16

How to use depth to choose the FoV size? How about making the pooling size adaptive w.r.t depth? We turn to dilated convolution (Atrous Convolution). Depth-aware Pooling Module

slide-17
SLIDE 17

Atrous convolution (skipping/inserting zero) a trous (French) -- holes (English) Depth-aware Pooling Module

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

slide-18
SLIDE 18

2D atrous convolution of different dilate rates. Depth-aware Pooling Module

slide-19
SLIDE 19

quantize the depth into five scales with dilate rates {1, 2, 4, 8, 16}

Depth-aware Pooling Module

slide-20
SLIDE 20

Alternatively, learning depth estimator, and testing without depth quantized depth scale classification softmax weight for multiplicative gating

Depth-aware Pooling Module

  • S. Kong, C. Fowlkes, Recurrent Scene Parsing with Perspective Understanding in the Loop, CVPR, 2018
slide-21
SLIDE 21

Alternatively, learning depth estimator, and testing without depth reliable monocular depth estimation

Depth-aware pooling module

slide-22
SLIDE 22

many possibilities to explore --

1. sharing the parameters in this pooling module (multiPool) 2. averaging the feature vs. attention vs. depth-aware gating 3. MultiPool vs. MultiScale (input)

Depth-aware pooling module

slide-23
SLIDE 23

many possibilities to explore --

1. sharing the parameters in this pooling module (multiPool)

Depth-aware pooling module

slide-24
SLIDE 24

Cityscapes dataset metric: Intersection over Union (IoU) using the ground-truth disparity map, 5 discete bins for 5 scales {1,2,4,8,16}

Depth-aware pooling module

slide-25
SLIDE 25

Cityscapes dataset metric: Intersection over Union (IoU) using the ground-truth disparity map, 5 discete bins for 5 scales {1,2,4,8,16}

Depth-aware pooling module

deepLab (baseline) avg. gtDepth tiedKernel gtDepth untied Kernel IoU 0.738 0.747 0.748 0.753

slide-26
SLIDE 26

train depth estimation branch to see if the estimated depth also helps

Depth-aware pooling module

deepLab (baseline) avg. gtDepth tiedKernel gtDepth untied Kernel IoU 0.738 0.747 0.748 0.753

slide-27
SLIDE 27

Cityscapes dataset metric: Intersection over Union (IoU) using the ground-truth disparity map, 5 discete bins for 5 scales {1,2,4,8,16}

Depth-aware pooling module

deepLab (baseline) avg. gtDepth tiedKernel gtDepth untied Kernel predDepth untied Kernel IoU 0.738 0.747 0.748 0.753 0.759

slide-28
SLIDE 28

Cityscapes dataset metric: Intersection over Union (IoU) using the ground-truth disparity map, 5 discete bins for 5 scales {1,2,4,8,16}

Why better?

Depth-aware pooling module

deepLab (baseline) avg. gtDepth tiedKernel gtDepth untied Kernel predDepth untied Kernel IoU 0.738 0.747 0.748 0.753 0.759

slide-29
SLIDE 29

many possibilities to explore --

1. sharing the parameters in this pooling module (multiPool) 2. averaging the feature vs. attention vs. depth-aware gating

Depth-aware pooling module

slide-30
SLIDE 30

many possibilities to explore --

1. sharing the parameters in this pooling module (multiPool) 2. averaging the feature vs. attention vs. depth-aware gating

Depth-aware pooling module

slide-31
SLIDE 31

many possibilities to explore --

1. sharing the parameters in this pooling module (multiPool) 2. averaging the feature vs. attention vs. depth-aware gating 3. MultiPool vs. MultiScale (input)

Depth-aware pooling module

slide-32
SLIDE 32

many possibilities to explore --

1. sharing the parameters in this pooling module (multiPool) 2. averaging the feature vs. attention vs. depth-aware gating 3. MultiPool vs. MultiScale (input)

Depth-aware pooling module

  • S. Kong, C. Fowlkes, Recurrent Scene Parsing with Perspective Understanding in the Loop, CVPR, 2018
slide-33
SLIDE 33

Qualitative Results -- street images

Depth-aware pooling module

slide-34
SLIDE 34

Qualitative Results -- panorama images

Depth-aware pooling module

slide-35
SLIDE 35

Good enough? Depth-aware pooling module

slide-36
SLIDE 36

Recurrent Refining with Perspective Understanding in the Loop Recurrent Refining Module

slide-37
SLIDE 37

1. Background 2. Attention to Perspective: Depth-aware Pooling Module 3. Recurrent Refining with Perspective Understanding in the Loop 4. Attention to Perspective Again 5. Pixel-wise Attentional Gating (PAG) 6. Pixel-Level Dynamic Routing 7. Conclusion

Recurrent Refining Module

slide-38
SLIDE 38

Recurrently refining the results by adapting the predicted depth

Recurrent Refinement Module

slide-39
SLIDE 39

unrolling the recurrent module during training adding a loss to each unrolled loop embedding the depth-aware gating module in the loops

Recurrent Refinement Module

  • S. Kong, C. Fowlkes, Recurrent Scene Parsing with Perspective Understanding in the Loop, CVPR, 2018
slide-40
SLIDE 40

Recurrently refining the results by adapting the predicted depth

Recurrent Refinement Module

  • S. Kong, C. Fowlkes, Recurrent Scene Parsing with Perspective Understanding in the Loop, CVPR, 2018
slide-41
SLIDE 41

Qualitative Results -- NYU-depth-v2 indoor

blue --> closer --> larger pooling size

Recurrent Refinement Module

  • S. Kong, C. Fowlkes, Recurrent Scene Parsing with Perspective Understanding in the Loop, CVPR, 2018
slide-42
SLIDE 42

Qualitative Results -- Cityscapes

yellow --> closer --> larger pooling size

Recurrent Refinement Module

  • S. Kong, C. Fowlkes, Recurrent Scene Parsing with Perspective Understanding in the Loop, CVPR, 2018
slide-43
SLIDE 43

Qualitative Results -- Stanford-2D-3D (panoramas)

Recurrent Refinement Module

  • S. Kong, C. Fowlkes, Recurrent Scene Parsing with Perspective Understanding in the Loop, CVPR, 2018
slide-44
SLIDE 44

Qualitative Results -- Stanford-2D-3D (panoramas)

Holes are filled!

Recurrent Refinement Module

  • S. Kong, C. Fowlkes, Recurrent Scene Parsing with Perspective Understanding in the Loop, CVPR, 2018
slide-45
SLIDE 45

1. Background 2. Attention to Perspective: Depth-aware Pooling Module 3. Recurrent Refining with Perspective Understanding in the Loop 4. Attention to Scale Again 5. Pixel-wise Attentional Gating (PAG) 6. Pixel-Level Dynamic Routing 7. Conclusion

Outline

slide-46
SLIDE 46

Attention to Scale Again

slide-47
SLIDE 47

Attentional maps prevent the model from pooling across different segments.

Attention to Scale Again

slide-48
SLIDE 48

Attentional maps prevent the model from pooling across different segments. Some scales are rarely used.

Attention to Scale Again

slide-49
SLIDE 49

learning attentional module to aggregate info six scales with dilate rates {1, 2, 4, 6, 8, 10} NYU-depth-v2 dataset (indoor scene parsing) ResNet50 backbone

Attention to Scale Again

slide-50
SLIDE 50

learning attentional module to choose the “correct” pooling scale six scales with dilate rates {1, 2, 4, 6, 8, 10} NYU-depth-v2 dataset (indoor scene parsing) ResNet50 backbone

Attention to Scale Again baseline res6 IoU 0.4205 0.4599

slide-51
SLIDE 51

Which layer to insert this attentional gating module?

Attention to Scale Again

res1 res2 res3 res4 res5 res6

slide-52
SLIDE 52

Which layer to insert this attentional gating module?

Attention to Scale Again

res1 res2 res3 res4 res5 res6

baseline res6 res5 res4 res3 IoU 0.4205 0.4599 0.4652 0.4567 0.4413

  • S. Kong, C. Fowlkes, Pixel-wise Attentional Gating for Parsimonious Pixel Labeling, 2018
slide-53
SLIDE 53

Which layer to insert this attentional gating module?

Attention to Scale Again 56 45 345 456 3456 IoU 0.4644 0.4548 0.4483 0.4497 0.4402

res1 res2 res3 res4 res5 res6

baseline res6 res5 res4 res3 IoU 0.4205 0.4599 0.4652 0.4567 0.4413

  • S. Kong, C. Fowlkes, Pixel-wise Attentional Gating for Parsimonious Pixel Labeling, 2018
slide-54
SLIDE 54

It achieves the best performance when inserting attentional gating modules at the second last residual block.

Attention to Scale Again baseline res5 IoU 0.4205 0.4652

slide-55
SLIDE 55

Qualitative Results -- res6

Attention to Scale Again

slide-56
SLIDE 56

Qualitative Results -- res5

Attention to Scale Again

slide-57
SLIDE 57

Qualitative Results -- res4

Attention to Scale Again

slide-58
SLIDE 58

Qualitative Results -- res3

Attention to Scale Again

slide-59
SLIDE 59

Qualitative Results -- res{3,4,5,6}

Attention to Scale Again

slide-60
SLIDE 60

Qualitative Results -- res{5,6}

Attention to Scale Again

slide-61
SLIDE 61

Qualitative Results -- res{5,6}

Attention to Scale Again

slide-62
SLIDE 62

Can we choose the region to process at specific scale, in stead of computing over the whole feature maps?

Attention to Scale Again

slide-63
SLIDE 63

Attention to Scale Again

slide-64
SLIDE 64

1. Background 2. Attention to Perspective: Depth-aware Pooling Module 3. Recurrent Refining with Perspective Understanding in the Loop 4. Attention to Perspective Again 5. Pixel-wise Attentional Gating (PAG) 6. Pixel-Level Dynamic Routing 7. Conclusion

Outline

slide-65
SLIDE 65

The difficulty is how to produce binary masks while still allowing for back- propagation for end-to-end training.

Pixel-wise Attentional Gating (PAG)

slide-66
SLIDE 66

using the Gumbel-Max trick for discrete (binary) masks

Pixel-wise Attentional Gating (PAG)

Gumbel, E.J.: Statistics of extremes. Courier Corporation (2012)

slide-67
SLIDE 67

using the Gumbel-Max trick for discrete (binary) masks

Pixel-wise Attentional Gating (PAG)

Categorical reparameterization with gumbel-softmax, ICLR, 2017 The concrete distribution: A continuous relaxation of discrete random variables, ICLR, 2017

slide-68
SLIDE 68

using the Gumbel-Max trick for discrete (binary) masks

Pixel-wise Attentional Gating (PAG)

Categorical reparameterization with gumbel-softmax, ICLR, 2017 The concrete distribution: A continuous relaxation of discrete random variables, ICLR, 2017

slide-69
SLIDE 69

using the Gumbel-Max trick for discrete (binary) masks

Pixel-wise Attentional Gating (PAG)

Categorical reparameterization with gumbel-softmax, ICLR, 2017 The concrete distribution: A continuous relaxation of discrete random variables, ICLR, 2017

slide-70
SLIDE 70

Multiplicative gating as weighted average Attentional Gating to select

Pixel-wise Attentional Gating (PAG)

slide-71
SLIDE 71

Perforated convolution in low-level implementation

Pixel-wise Attentional Gating (PAG)

PerforatedCNNs: Acceleration through Elimination of Redundant Convolutions, NIPS 2016

slide-72
SLIDE 72

pooling using a set of 3×3-kernels with a set of dilation rates [0,1,2,4,6,8,10] 0 means the input feature is simply copied into the output feature map

Pixel-wise Attentional Gating (PAG)

  • S. Kong, C. Fowlkes, Pixel-wise Attentional Gating for Parsimonious Pixel Labeling, 2018
slide-73
SLIDE 73

semantic segmentation

Pixel-wise Attentional Gating (PAG)

  • S. Kong, C. Fowlkes, Pixel-wise Attentional Gating for Parsimonious Pixel Labeling, 2018
slide-74
SLIDE 74

monocular depth estimation

Pixel-wise Attentional Gating (PAG)

  • S. Kong, C. Fowlkes, Pixel-wise Attentional Gating for Parsimonious Pixel Labeling, 2018
slide-75
SLIDE 75

surface normal estimation

Pixel-wise Attentional Gating (PAG)

  • S. Kong, C. Fowlkes, Pixel-wise Attentional Gating for Parsimonious Pixel Labeling, 2018
slide-76
SLIDE 76

Visual summary of three tasks on three different datasets

Pixel-wise Attentional Gating (PAG)

  • S. Kong, C. Fowlkes, Pixel-wise Attentional Gating for Parsimonious Pixel Labeling, 2018
slide-77
SLIDE 77

More qualitatively results on NYU-depth-v2

Pixel-wise Attentional Gating (PAG)

  • S. Kong, C. Fowlkes, Pixel-wise Attentional Gating for Parsimonious Pixel Labeling, 2018
slide-78
SLIDE 78

More qualitatively results on Stanford-2D-3D dataset

Pixel-wise Attentional Gating (PAG)

  • S. Kong, C. Fowlkes, Pixel-wise Attentional Gating for Parsimonious Pixel Labeling, 2018
slide-79
SLIDE 79

More qualitatively results on Cityscapes

Pixel-wise Attentional Gating (PAG)

  • S. Kong, C. Fowlkes, Pixel-wise Attentional Gating for Parsimonious Pixel Labeling, 2018
slide-80
SLIDE 80

PAG achieves better performance while maintaining the computation.

Pixel-Level Dynamic Routing

  • S. Kong, C. Fowlkes, Pixel-wise Attentional Gating for Parsimonious Pixel Labeling, 2018
slide-81
SLIDE 81

PAG achieves better performance while maintaining the computation. It also offers parsimonious inference under limited computation budget.

Pixel-Level Dynamic Routing

  • S. Kong, C. Fowlkes, Pixel-wise Attentional Gating for Parsimonious Pixel Labeling, 2018
slide-82
SLIDE 82

1. Background 2. Attention to Perspective: Depth-aware Pooling Module 3. Recurrent Refining with Perspective Understanding in the Loop 4. Attention to Perspective Again 5. Pixel-wise Attentional Gating (PAG) 6. Pixel-Level Dynamic Routing 7. Conclusion

Outline

slide-83
SLIDE 83

Parsimonious inference as dynamic computation Dynamic Computation

slide-84
SLIDE 84

Parsimonious inference as dynamic computation Dynamic Computation

[1] BlockDrop: Dynamic Inference Paths in Residual Networks [2] Convolutional Networks with Adaptive Computation Graphs [3] SkipNet: Learning Dynamic Routing in Convolutional Networks [4] Spatially Adaptive Computation Time for Residual Networks

slide-85
SLIDE 85

More generally, can we allocate dynamic computation time to each pixel of each image instance? Pixel-Level Dynamic Routing

slide-86
SLIDE 86

Pixel-Level Dynamic Routing

slide-87
SLIDE 87

Inserting PAG at each residual block for fine-tuning Dynamic Computation

  • S. Kong, C. Fowlkes, Pixel-wise Attentional Gating for Parsimonious Pixel Labeling, 2018
slide-88
SLIDE 88

sparse binary masks for perforated convolution Using KL-divergence term for sparse masks.

Dynamic Computation

  • S. Kong, C. Fowlkes, Pixel-wise Attentional Gating for Parsimonious Pixel Labeling, 2018
slide-89
SLIDE 89

Perforated convolution in low-level implementation

Pixel-wise Attentional Gating (PAG)

PerforatedCNNs: Acceleration through Elimination of Redundant Convolutions, NIPS 2016

slide-90
SLIDE 90

Semantic segmentation on NYU-depth-v2 dataset

Dynamic Computation

  • S. Kong, C. Fowlkes, Pixel-wise Attentional Gating for Parsimonious Pixel Labeling, 2018
slide-91
SLIDE 91

Boundary detection on BSDS500

Dynamic Computation

  • S. Kong, C. Fowlkes, Pixel-wise Attentional Gating for Parsimonious Pixel Labeling, 2018
slide-92
SLIDE 92

Semantic segmentation on NYU-depth-v2 Boundary detection on BSDS500

Dynamic Computation

  • S. Kong, C. Fowlkes, Pixel-wise Attentional Gating for Parsimonious Pixel Labeling, 2018
slide-93
SLIDE 93

Boundary detection on BSDS500 dataset

Dynamic Computation

  • S. Kong, C. Fowlkes, Pixel-wise Attentional Gating for Parsimonious Pixel Labeling, 2018
slide-94
SLIDE 94

NYU-depth-v2 dataset

Dynamic Computation

slide-95
SLIDE 95

Stanford-2D-3D dataset

Dynamic Computation

[1] BlockDrop: Dynamic Inference Paths in Residual Networks [2] Convolutional Networks with Adaptive Computation Graphs [3] SkipNet: Learning Dynamic Routing in Convolutional Networks [4] Spatially Adaptive Computation Time for Residual Networks

slide-96
SLIDE 96

Cityscapes dataset

Dynamic Computation

[1] BlockDrop: Dynamic Inference Paths in Residual Networks [2] Convolutional Networks with Adaptive Computation Graphs [3] SkipNet: Learning Dynamic Routing in Convolutional Networks [4] Spatially Adaptive Computation Time for Residual Networks

slide-97
SLIDE 97

1. Background 2. Attention to Perspective: Depth-aware Pooling Module 3. Recurrent Refining with Perspective Understanding in the Loop 4. Pixel-wise Attentional Gating (PAG) 5. Pixel-Level Dynamic Routing 6. Conclusion

Outline

slide-98
SLIDE 98

1. Scene parsing means more than semantic segmentation, geometry and inter-object relation

Conclusion and Future Work semantic segmentation (what) localization (where) support, surface normal (relation)

slide-99
SLIDE 99

1. Scene parsing means more than semantic segmentation, geometry and inter-object relation 2. Potentially unified model for all these tasks But for learning knowledge from different tasks? How to wire them up?

Conclusion and Future Work

slide-100
SLIDE 100

1. Scene parsing means more than semantic segmentation, geometry and inter-object relation 2. Potentially unified model for all these tasks 3. Pixel-wise Attentional Gating unit (PAG) allocates dynamic computation for pixels; it is general, agnostic to architectures and problems.

Conclusion and Future Work

slide-101
SLIDE 101

1. Scene parsing means more than semantic segmentation, geometry and inter-object relation 2. Potentially unified model for all these tasks 3. Pixel-wise Attentional Gating unit (PAG) allocates dynamic computation for pixels; it is general, agnostic to architectures and problems. 4. PAG reduces computation by 10% without noticeable loss in accuracy and performance degrades gracefully when imposing stronger computational constraints.

Conclusion and Future Work

slide-102
SLIDE 102

1. Scene parsing means more than semantic segmentation, geometry and inter-object relation 2. Potentially unified model for all these tasks 3. Pixel-wise Attentional Gating unit (PAG) allocates dynamic computation for pixels; it is general, agnostic to architectures and problems. 4. PAG reduces computation by 10% without noticeable loss in accuracy and performance degrades gracefully when imposing stronger computational constraints. But for real-time inference...?

Conclusion and Future Work

slide-103
SLIDE 103

Thanks

Q&A Charless Fowlkes Shu Kong