LEARNING AFFINITY VIA SPATIAL PROPAGATION NETWORK
Sifei Liu, Shalini De Mello, Jinwei Gu, Guangyu Zhong, Ming-Hsuan Yang, Jan Kautz NVIDIA Research, March 27, 2018
LEARNING AFFINITY VIA SPATIAL PROPAGATION NETWORK Sifei Liu, - - PowerPoint PPT Presentation
LEARNING AFFINITY VIA SPATIAL PROPAGATION NETWORK Sifei Liu, Shalini De Mello, Jinwei Gu, Guangyu Zhong, Ming-Hsuan Yang, Jan Kautz NVIDIA Research, March 27, 2018 WHAT IS AFFINITY? The relations between two pixels/regions A typical
Sifei Liu, Shalini De Mello, Jinwei Gu, Guangyu Zhong, Ming-Hsuan Yang, Jan Kautz NVIDIA Research, March 27, 2018
geometric closeness intensity closeness
%& and %* are manually designed kernel function, e.g., ./,
nose hair eye image semantic labels
spatially adjacent far away, with similar texture far away, different in shape and appearance
image patch … …
independent
correlated
image patch … …
independent
correlated
CNN image pixel-wise probability
softmax
segmentation Ø Standard CNN-based image segmentation does not explicitly model the pairwise relations of pixels.
image CNN-based segmentation
image CNN-based segmentation
guidance network (deep CNN) RGB image coarse mask refined result conv conv affinity w"
coarse mask refined result conv conv …
ℎ" = $ − &" '" + )"ℎ"*+, . ∈ 2, 1
)" 6,7
8 79+,7:6
left to right propagating ℎ" ℎ"*+ ℎ" ℎ"*+
6
6
Affinity: the off-diagonal entities of G
Affinity: the off-diagonal entities of G
!"×!"
!"×!
image CNN
… … … … … … ℎ% ℎ%&' ℎ%&( A fully-connected spatial propagation
!', !(, … , !* → !', !(, … , !* ← !', !(, … , !* ↑ !', !(, … , !* ↓
ℎ$." = 1 − ) *$,"
$∈-
.$," + ) *$,"ℎ$,"01
$∈-
where 2 = 3 ℎ" = 1 − 4" ." + !"ℎ"01, 5 ∈ 2, 7
!" =
ℎ" ℎ"01 ℎ" ℎ"01
scalar weights needs to be learned with each direction.
78×12
guidance network (deep CNN) RGB image affinity w"
3×%& SPN
Helen Dataset: high-resolution face parsing with 11 classes VOC 2012 Dataset: general object semantic segmentation with 21 classes
Jonathan Long, Evan Shelhamer, Trevor Darrell. Fully convolutional networks for semantic segmentation, CVPR 2015
RGB image
coarse mask
4×4×16/2 node-wise max-pool 64×64×16
refined result
4×4×16/2 128×128×1 64×64×16 128×128×1
skipped links RGB image ! of one direction 64×64×16 to the propagation module symmetric CNN Helen: A relatively small network, learned from scratch VOC: VGG16 conv1~pool5 with pre-trained weights, symmetric upsample layers learned from scratch
" as the input image
two different connections and a same guidance network.
CNN-base Original
CNN+[1]
[1] Sifei Liu, Jinshan Pan and Ming-Hsuan Yang. Learning recursive filters for low-level vision via a hybrid neural network. ECCV 2016
CNN-base Original
CNN+[1]
two different connections and a same guidance network.
[1] Sifei Liu, Jinshan Pan and Ming-Hsuan Yang. Learning recursive filters for low-level vision via a hybrid neural network. ECCV 2016
f-score skin brows eyes nose mouth lip_upper lip_lower lip_inner
Multi-obj [1] 90.87 69.89 74.74 90.23 82.07 59.22 66.30 81.70 83.68 CNN base 90.53 70.09 74.86 89.16 83.83 55.61 64.88 71.27 82.89 CNN Highres 91.78 71.84 74.46 89.42 81.83 68.15 72 71.95 83.21 CNN + [2] 92.26 75.05 85.44 91.51 88.13 77.61 70.81 79.95 87.09 CNN + SPN (ours) 93.1 78.53 87.71 92.62 91.08 80.17 71.53 83.13 89.3
[1] Sifei Liu, Jimei Yang, Chang Huang and Ming-Hsuan Yang. Multi-objective convolutional learning for face labeling. CVPR 2015. [2] Sifei Liu, Jinshan Pan and Ming-Hsuan Yang. Learning recursive filters for low-level vision via a hybrid neural network. ECCV 2016
FCN deeplab vgg-16 deeplab resnet-101 a universal SPN … …
convolution (deeplab pretrained model)
CNN base Original Dense CRF SPN
convolution (deeplab pretrained model)
CNN base Original Dense CRF SPN
model F F+[2] F+SPN V V+[2] V+SPN R R+[2] R+SPN
91.22 90.64 92.90 92.61 92.16 93.83 94.63 94.12 95.49 mean AC 77.61 70.64 79.49 80.97 73.53 83.15 84.16 77.46 86.09 mean IoU 65.51 60.95 69.86 68.97 64.42 73.12 76.46 72.02 79.76
[2] Sifei Liu, Jinshan Pan and Ming-Hsuan Yang. Learning recursive filters for low-level vision via a hybrid neural network. ECCV 2016
model F F+[2] F+SPN V V+[2] V+SPN R R+[2] R+SPN
91.22 90.64 92.90 92.61 92.16 93.83 94.63 94.12 95.49 mean AC 77.61 70.64 79.49 80.97 73.53 83.15 84.16 77.46 86.09 mean IoU 65.51 60.95 69.86 68.97 64.42 73.12 76.46 72.02 79.76 mean IoU CNN base +Dense CRF +SPN VGG-16 (val) 68.97 71.57 73.12 ResNet-101 (val) 76.40 77.69 79.76 ResNet-101 (test)
80.22
[1] Liang-Chieh Chen*, George Papandreou*, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. arXiv preprint, 2016 [2] Sifei Liu, Jinshan Pan and Ming-Hsuan Yang. Learning recursive filters for low-level vision via a hybrid neural network. ECCV 2016
[1] Philipp Krähenbühl, Vladlen Koltun, Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials. NIPS 2011. [2] Liang-Chieh Chen*, George Papandreou*, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. arXiv preprint arXiv:1606.00915 (2016). [3] Shuai Zheng*, Sadeep Jayasumana*, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, and Philip H. S. Torr. Conditional Random Fields as Recurrent Neural Networks. ICCV 2015.
1000 2000 3000 4000 5000 Dense CRF [1] Dense CRF [2] CRF as RNN [3] OURS ~1 s 3.2 s 4.4 s 0.08 s
guidance network (VGG-16)
image
conv1
coarse mask
SPN1 SPN2
bilinear upsample
conv2 conv3
refined mask
57ms 1.3ms 1.3ms
64×64×32
Dense CRF [1] SPN
computation iterative convolution propagation learning/inference several iterations
pairwise connection manually designed kernel end-to-end learned weights improvement (e.g., on VGG-16) 3.77% 6.02% run-time (512×512) 3.2s 0.08s
… … … …
[1] Liang-Chieh Chen*, George Papandreou*, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. arXiv preprint, 2016
⨂
" # # + % propagated
propagated k k+10 k+20 color propagation
propagated HDR propagation
…… ……
key- frame property guided informatio n
colored key-frame + grayscale frames in between key-frames
colored sequences (K=30)
HDR key-frames + LDR frames (we only show LDR here)
HDR sequences (K=20)
gray-scale video a key-frame sparse annotation (automatically generated by superpixels)
very sparse annotation, to generate the follow colorful image;
sample one pixel per superpixel randomly, where 300 superpixels are produced from a 256×512 gray-scale image;
gray-scale video a key-frame sparse annotation
… …
by SPN 0th 20th 200th by TPN
Our SPN+TPN propagation network based method can produce image-edit propagation in full-pipeline.