LEARNING AFFINITY VIA SPATIAL PROPAGATION NETWORK Sifei Liu, - - PowerPoint PPT Presentation

learning affinity via spatial propagation network
SMART_READER_LITE
LIVE PREVIEW

LEARNING AFFINITY VIA SPATIAL PROPAGATION NETWORK Sifei Liu, - - PowerPoint PPT Presentation

LEARNING AFFINITY VIA SPATIAL PROPAGATION NETWORK Sifei Liu, Shalini De Mello, Jinwei Gu, Guangyu Zhong, Ming-Hsuan Yang, Jan Kautz NVIDIA Research, March 27, 2018 WHAT IS AFFINITY? The relations between two pixels/regions A typical


slide-1
SLIDE 1

LEARNING AFFINITY VIA SPATIAL PROPAGATION NETWORK

Sifei Liu, Shalini De Mello, Jinwei Gu, Guangyu Zhong, Ming-Hsuan Yang, Jan Kautz NVIDIA Research, March 27, 2018

slide-2
SLIDE 2

WHAT IS AFFINITY?

  • The relations between two pixels/regions
  • A typical model: pixel locations and intensities

!"# = %& ', ) %* +,, +-

geometric closeness intensity closeness

%& and %* are manually designed kernel function, e.g., ./,

slide-3
SLIDE 3

WHY AFFINITY?

A semantic-level example

nose hair eye image semantic labels

spatially adjacent far away, with similar texture far away, different in shape and appearance

!"# = %& ', ) * %+ ,-, ,.

slide-4
SLIDE 4

image patch … …

  • deep
  • pixels are

independent

  • shallow
  • pixels are

correlated

a CNN

a propagation network

slide-5
SLIDE 5

image patch … …

  • deep
  • pixels are

independent

  • shallow
  • pixels are

correlated

a CNN

a propagation network

slide-6
SLIDE 6

HOW TO USE IT?

CNN image pixel-wise probability

softmax

segmentation Ø Standard CNN-based image segmentation does not explicitly model the pairwise relations of pixels.

CNN-based semantic segmentation

slide-7
SLIDE 7

WHY AFFINITY?

image CNN-based segmentation

  • ur result

Refine the segmentation

slide-8
SLIDE 8

WHY AFFINITY?

image CNN-based segmentation

  • ur result

Improve the context

slide-9
SLIDE 9

PROPOSED ARCHITECTURE

guidance network (deep CNN) RGB image coarse mask refined result conv conv affinity w"

slide-10
SLIDE 10

PROPOSED ARCHITECTURE

Spatial propagation networks (SPN)

coarse mask refined result conv conv …

slide-11
SLIDE 11

SPATIAL PROPAGATION NETWORK

ℎ" = $ − &" '" + )"ℎ"*+, . ∈ 2, 1

  • ℎ": the ."2 column in the SPN hidden layer
  • )": an 1×1 transform matrix between t-1 and t
  • &": the degree matrix normalizing the response of ℎ"
  • &" 4, 4 = ∑

)" 6,7

8 79+,7:6

Row/column-wise linear propagation

left to right propagating ℎ" ℎ"*+ ℎ" ℎ"*+

slide-12
SLIDE 12

SPATIAL PROPAGATION NETWORK

  • ℎ" = $",
  • ℎ& = ' − )& $& + +&ℎ",
  • ℎ, = ' − ), $, + +,ℎ,-&,
  • ℎ. = ' − ). $. + +.ℎ.-&,

Row/column-wise linear propagation

… …

  • /0: the span of ℎ&-. 23 ℎ&, ℎ4, … , ℎ.

6

  • 70: the span of $&-. 23 $&, $4, … , $.

6

  • 8: an 9×9 9 = ;4 matrix
  • <, = ' − ),

Affinity: the off-diagonal entities of G

slide-13
SLIDE 13

SPATIAL PROPAGATION NETWORK

Advantage: a compact representation

Affinity: the off-diagonal entities of G

  • Parameters
  • Dense affinity (dense CRF):

!"×!"

  • Affinity in SPN (for all ($%)):

!"×!

learnable affinity matrix = learn all $&, $(, … , $*

slide-14
SLIDE 14

LEARNING THE AFFINITY MATRIX

  • Basic idea: Learn all the entities of ! w.r.t. 4 directions
  • Disadvantage: Huge dimension of network output: number of channels="

image CNN

… … … … … … ℎ% ℎ%&' ℎ%&( A fully-connected spatial propagation

!', !(, … , !* → !', !(, … , !* ← !', !(, … , !* ↑ !', !(, … , !* ↓

slide-15
SLIDE 15
  • Each pixel is connected to 3 adjacent pixels in the previous row/column
  • Tridiagonal-transform matrix !"

REDUCING CONNECTIONS

Three-way connection

ℎ$." = 1 − ) *$,"

$∈-

.$," + ) *$,"ℎ$,"01

$∈-

where 2 = 3 ℎ" = 1 − 4" ." + !"ℎ"01, 5 ∈ 2, 7

!" =

ℎ" ℎ"01 ℎ" ℎ"01

  • For each pixel, only 3

scalar weights needs to be learned with each direction.

  • The total CNN output is:

78×12

slide-16
SLIDE 16

PROPOSED ARCHITECTURE

Learnable affinity through a guidance network (CNN)

guidance network (deep CNN) RGB image affinity w"

3×%& SPN

slide-17
SLIDE 17
  • The integration of 4 directions results in dense connections between all pixels.

REDUCING CONNECTIONS

Three-way connection

slide-18
SLIDE 18

Helen Dataset: high-resolution face parsing with 11 classes VOC 2012 Dataset: general object semantic segmentation with 21 classes

IMAGE SEGMENTATION REFINEMENT

slide-19
SLIDE 19

Jonathan Long, Evan Shelhamer, Trevor Darrell. Fully convolutional networks for semantic segmentation, CVPR 2015

RGB image

A TYPICAL CNN IMAGE SEGMENTATION

slide-20
SLIDE 20

SPATIAL PROPAGATION NETWORK

coarse mask

4×4×16/2 node-wise max-pool 64×64×16

refined result

4×4×16/2 128×128×1 64×64×16 128×128×1

  • Input and output: the probability map of all classes (we show one class as an example).
slide-21
SLIDE 21

GUIDANCE NETWORK

skipped links RGB image ! of one direction 64×64×16 to the propagation module symmetric CNN Helen: A relatively small network, learned from scratch VOC: VGG16 conv1~pool5 with pre-trained weights, symmetric upsample layers learned from scratch

slide-22
SLIDE 22

IMPLEMENTATION

  • Baseline CNN
  • Helen: we train a CNN-base network with the output size as !

" as the input image

  • VOC: we directly use FCN-8s.
  • Train a universal SPN
  • Coarse mask: segmentation results on the trainset by CNN-base.
  • Guidance network: an independent deep CNN.
  • Test on any base networks (for VOC only)
  • We directly apply the SPN on any CNN-base network (e.g., deeplab-VGG-16 and ResNet-101).
slide-23
SLIDE 23

HELEN FACE PARSING

  • CNN-base: the baseline CNN network
  • SPN: the three-way model with the

two different connections and a same guidance network.

CNN-base Original

  • urs

CNN+[1]

[1] Sifei Liu, Jinshan Pan and Ming-Hsuan Yang. Learning recursive filters for low-level vision via a hybrid neural network. ECCV 2016

slide-24
SLIDE 24

CNN-base Original

  • urs

CNN+[1]

HELEN FACE PARSING

  • CNN-base: the baseline CNN network
  • SPN: the three-way model with the

two different connections and a same guidance network.

[1] Sifei Liu, Jinshan Pan and Ming-Hsuan Yang. Learning recursive filters for low-level vision via a hybrid neural network. ECCV 2016

slide-25
SLIDE 25

QUANTITATIVE EVALUATION

f-score skin brows eyes nose mouth lip_upper lip_lower lip_inner

  • verall

Multi-obj [1] 90.87 69.89 74.74 90.23 82.07 59.22 66.30 81.70 83.68 CNN base 90.53 70.09 74.86 89.16 83.83 55.61 64.88 71.27 82.89 CNN Highres 91.78 71.84 74.46 89.42 81.83 68.15 72 71.95 83.21 CNN + [2] 92.26 75.05 85.44 91.51 88.13 77.61 70.81 79.95 87.09 CNN + SPN (ours) 93.1 78.53 87.71 92.62 91.08 80.17 71.53 83.13 89.3

[1] Sifei Liu, Jimei Yang, Chang Huang and Ming-Hsuan Yang. Multi-objective convolutional learning for face labeling. CVPR 2015. [2] Sifei Liu, Jinshan Pan and Ming-Hsuan Yang. Learning recursive filters for low-level vision via a hybrid neural network. ECCV 2016

slide-26
SLIDE 26

VOC2012 OBJECT SEGMENTATION

  • Segmentation Networks
  • FCN-8s
  • Deeplab-VGG-16
  • Deeplab-ResNet-101
  • Refinement Models
  • Dense CRF
  • SPN (trained on FCN-8s)

Experimental comparison

FCN deeplab vgg-16 deeplab resnet-101 a universal SPN … …

slide-27
SLIDE 27

VOC2012 OBJECT SEGMENTATION

  • CNN base: ResNet-101 with dilated

convolution (deeplab pretrained model)

  • Dense CRF: CNN base + Dense CRF
  • SPN: pretrained + SPN

CNN base Original Dense CRF SPN

slide-28
SLIDE 28

VOC2012 OBJECT SEGMENTATION

  • CNN base: ResNet-101 with dilated

convolution (deeplab pretrained model)

  • Dense CRF: CNN base + Dense CRF
  • SPN: pretrained + SPN

CNN base Original Dense CRF SPN

slide-29
SLIDE 29

VOC2012 OBJECT SEGMENTATION

slide-30
SLIDE 30

VOC2012 OBJECT SEGMENTATION

Quantitative results on the validation set

model F F+[2] F+SPN V V+[2] V+SPN R R+[2] R+SPN

  • verall AC

91.22 90.64 92.90 92.61 92.16 93.83 94.63 94.12 95.49 mean AC 77.61 70.64 79.49 80.97 73.53 83.15 84.16 77.46 86.09 mean IoU 65.51 60.95 69.86 68.97 64.42 73.12 76.46 72.02 79.76

  • F: FCN-8s
  • V: VGG-16
  • R: ResNet-101

[2] Sifei Liu, Jinshan Pan and Ming-Hsuan Yang. Learning recursive filters for low-level vision via a hybrid neural network. ECCV 2016

slide-31
SLIDE 31

VOC2012 OBJECT SEGMENTATION

Quantitative results on the validation set

model F F+[2] F+SPN V V+[2] V+SPN R R+[2] R+SPN

  • verall AC

91.22 90.64 92.90 92.61 92.16 93.83 94.63 94.12 95.49 mean AC 77.61 70.64 79.49 80.97 73.53 83.15 84.16 77.46 86.09 mean IoU 65.51 60.95 69.86 68.97 64.42 73.12 76.46 72.02 79.76 mean IoU CNN base +Dense CRF +SPN VGG-16 (val) 68.97 71.57 73.12 ResNet-101 (val) 76.40 77.69 79.76 ResNet-101 (test)

  • 79.70

80.22

[1] Liang-Chieh Chen*, George Papandreou*, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. arXiv preprint, 2016 [2] Sifei Liu, Jinshan Pan and Ming-Hsuan Yang. Learning recursive filters for low-level vision via a hybrid neural network. ECCV 2016

slide-32
SLIDE 32

RUNTIME ANALYSIS (VOC MODEL)

Computational efficiency on 512×512 images

[1] Philipp Krähenbühl, Vladlen Koltun, Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials. NIPS 2011. [2] Liang-Chieh Chen*, George Papandreou*, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. arXiv preprint arXiv:1606.00915 (2016). [3] Shuai Zheng*, Sadeep Jayasumana*, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip H. S. Torr. Conditional Random Fields as Recurrent Neural Networks. ICCV 2015.

1000 2000 3000 4000 5000 Dense CRF [1] Dense CRF [2] CRF as RNN [3] OURS ~1 s 3.2 s 4.4 s 0.08 s

slide-33
SLIDE 33

RUNTIME ANALYSIS

Spatial propagation network

guidance network (VGG-16)

image

conv1

coarse mask

SPN1 SPN2

bilinear upsample

conv2 conv3

refined mask

57ms 1.3ms 1.3ms

64×64×32

slide-34
SLIDE 34

CONCLUSIVE COMPARISON

Graphic model (dense CRF) vs SPN

Dense CRF [1] SPN

computation iterative convolution propagation learning/inference several iterations

  • ne pass

pairwise connection manually designed kernel end-to-end learned weights improvement (e.g., on VGG-16) 3.77% 6.02% run-time (512×512) 3.2s 0.08s

… … … …

[1] Liang-Chieh Chen*, George Papandreou*, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. arXiv preprint, 2016

slide-35
SLIDE 35

TEMPORAL PROPAGATION NETWORK

Extension to the spatial-temporal domain

" # # + % propagated

  • riginal video

propagated k k+10 k+20 color propagation

  • riginal video

propagated HDR propagation

…… ……

key- frame property guided informatio n

slide-36
SLIDE 36
  • Input:

colored key-frame + grayscale frames in between key-frames

  • Output:

colored sequences (K=30)

slide-37
SLIDE 37
  • Input:

HDR key-frames + LDR frames (we only show LDR here)

  • Output:

HDR sequences (K=20)

slide-38
SLIDE 38

gray-scale video a key-frame sparse annotation (automatically generated by superpixels)

  • we use the SPN to propagate the

very sparse annotation, to generate the follow colorful image;

  • during training, we randomly

sample one pixel per superpixel randomly, where 300 superpixels are produced from a 256×512 gray-scale image;

slide-39
SLIDE 39

gray-scale video a key-frame sparse annotation

… …

by SPN 0th 20th 200th by TPN

Our SPN+TPN propagation network based method can produce image-edit propagation in full-pipeline.

slide-40
SLIDE 40

THANK YOU