DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES
Sifei Liu, NVIDIA Research Xueting Li, University of California, Merced March 19, 2019
DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei - - PowerPoint PPT Presentation
DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California, Merced March 19, 2019 UNDERSTANDING SCENE AND HUMAN scene image segmentation human pose estimation semantic
Sifei Liu, NVIDIA Research Xueting Li, University of California, Merced March 19, 2019
2
semantic segmentation from cityscapes dataset pose estimation via OpenCV
3
semantic segmentation from cityscapes dataset rendered scene from the SUNCG dataset
4
semantic segmentation from cityscapes dataset
5
6
video’s from: Learning Rigidity in Dynamic Scenes with a Moving Camera for 3D Motion Field Estimation
7
scene image indoor environment human car sitting standing
8
scene image indoor environment
9
Input Image Generated Poses
10
Neurips 2018 Donghoon Lee, Sifei Liu, Jinwei Gu, Ming-Yu Liu, Ming-Hsuan Yang, Jan Kautz
CVPR 2019 Xueting Li, Sifei Liu, Kihwan Kim, Xiaolong Wang, Ming-Hsuan Yang, Jan Kautz
11
12
13
Add a person
14
Image-to-image translation, Image editing, ...
15
experiencing a virtual world
Image is from Stephan R. Richter, Zeeshan Hayer, and Vladlen Koltun, “Playing for Benchmarks”, ICCV 2017
Rendering Semantic map Visualization
16
17
18
p=0.2 p=0 p=0.8
19
Object
20
Removed Object
21
Inpainting
22
Good box Bad box
Why box? 1) We don’t want to care about the object shape for now. 2) All objects can be covered by a bounding box.
23
Unit box Affine transform
Why not using (x,y,w,h) directly? It is not differentiable to put a box using indices.
24
bbox
Affine transform
25
bbox
STN concat
tile
26
bbox
STN concat real fake
tile
Real/fake loss
27
28
bbox
STN concat real fake
tile
Real/fake loss Ignored
29
bbox
STN concat real fake
tile
Real/fake loss
30
bbox
STN concat real fake
tile
Real/fake loss
31
32
bbox
STN concat real fake
tile
Real/fake loss
Lazy z1 z2 Lazy
33
bbox
STN concat real fake
tile
Real/fake loss
bbox
34
bbox
STN
tile
concat STN concat
bbox (shared) (shared)
real fake
tile
Real/fake loss
35
bbox
STN
tile
concat STN concat
bbox
Supervised path Unsupervised path
(shared) (shared)
real fake
tile
Encoder-decoder + reconstruct + supervision
36
(red: person, blue: car)
37
38
39
tile concat
40
tile concat
“Where” module
41
tile tile concat concat Unsupervised path Supervised path (shared) real fake Encoder-decoder + supervision
“Where” module
42
Input Affine Bounding box prediction Object shape generation Output Unit box
43
Input Affine Bounding box prediction Object shape generation Output Unit box “Where” discriminator
44
Input Affine Bounding box prediction Object shape generation Output Unit box “What” discriminator
45
46
47
48
49
50
51
52
53
54
Ideal: 50% Our result: 43%
55
Input Encoder - decoder Generated
Result
Real Real
Input Generator Generated
Encoder STN Result
Real Real
Baseline 1 Baseline 2
Baseline 1 Baseline 2
56
where are they? what are they look like?
Xueting Li, Sifei Liu, Kihwan Kim , Xiaolong Wang, Ming-Hsuan Yang, Jan Kautz
➢ opportunities of interaction in the scene, i.e. what actions can the object be used for.
➢ Robot navigation ➢ Game development
Image Credit: David F . Fouhey et al. In Defense of the Direct Perception of Affordances, CoRR abs/1505.01085 (2015)
The floor can be used for standing The desk can be used for sitting
semantic knowledge geometry knowledge
fuse
A data-driven way?
pose synthesizer.
synthesized by stage I to train a data-driven and end-to-end 3D pose prediction model.
semantic knowledge geometry knowledge
pose synthesizer
…
where what
pose synthesizer.
synthesized by stage I to train a data-driven and end-to-end 3D pose prediction model.
semantic knowledge geometry knowledge
pose synthesizer
…
where what
[1] Wang X, Girdhar R, Gupta A. Binge watching: Scaling affordance learning from sitcoms. CVPR, 2017 [2] Song S, Yu F , Zeng A, et al. Semantic scene completion from a single depth. CVPR 2017.
The Sitcom [1] dataset. (no 3D annotations) The SUNCG [2] dataset. (no human poses) Combine ? semantic knowledge geometry knowledge
geometry adjustment semantic knowledge adaptation
𝑍 𝑌 𝑎 𝑧 𝑦 𝑋 𝑉 𝑊
input image location heat map generated poses mapped pose adjusted pose mapping from image to voxel
Domain adaptation
geometry adjustment semantic knowledge adaptation
𝑍 𝑌 𝑎 𝑧 𝑦 𝑋 𝑉 𝑊
input image location heat map generated poses mapped pose adjusted pose mapping from image to voxel
Domain adaptation
input image location heat map ResNet 18
convolution deconvolution
input image location heat map generated poses
Binge Watching: Scaling Affordance Learning from Sitcoms, Xiaolong Wang et al. CVPR, 2017
input image location heat map generated poses
Domain adaptation
geometry adjustment semantic knowledge adaptation
𝑍 𝑌 𝑎 𝑧 𝑦 𝑋 𝑉 𝑊
input image location heat map generated poses mapped pose adjusted pose mapping from image to voxel
Domain adaptation
generated pose mapped pose
𝑒 = 𝐼 × 𝑔 𝐼𝑞 × 𝑠
32
𝐼~𝒪(1.7,0.1)
mapping from image to voxel
𝑍 𝑌 𝑎 𝑧 𝑦 𝑋 𝑉 𝑊
𝐼𝑞 𝐼
geometry adjustment semantic knowledge adaptation
𝑍 𝑌 𝑎 𝑧 𝑦 𝑋 𝑉 𝑊
input image location heat map generated poses mapped pose adjusted pose mapping from image to voxel
Domain adaptation
mapped pose adjusted pose
as furniture .
⨂
[1] A. Gupta et al. From 3d scene geometry to human workspace. CVPR, 2011.
[1] A. Gupta et al. From 3d scene geometry to human workspace. CVPR, 2011.
generated pose adjusted pose Support constraint [2]. The human pose should be supported by a surface of surrounding objects (e.g., floor, bed).
(a) generated pose (c) scene voxel (d) sittable surface (b) pose in voxel (e) positive response (f) adjusted 3D pose
⨂
3D Gaussian
sampling
geometry adjustment semantic knowledge adaptation
𝑍 𝑌 𝑎 𝑧 𝑦 𝑋 𝑉 𝑊
input image location heat map generated poses mapped pose adjusted pose mapping from image to voxel
Domain adaptation
data-driven 3D affordance generative model training and 700 scenes for testing.
what module for pose gestures prediction.
𝑞
𝑨~𝒪(𝜈, 𝜏) 𝑞′ 𝑦, 𝑧, 𝑒 , 𝑞𝑑 𝑨~𝒪(𝜈, 𝜏) {𝐽}
where module what module
𝑀𝑏𝑒𝑤 real fake 𝑦′, 𝑧′, 𝑒′ , 𝑞𝑑
′
discriminator
what module for pose gestures prediction.
𝑞
𝑨~𝒪(𝜈, 𝜏) 𝑞′ 𝑦, 𝑧, 𝑒 , 𝑞𝑑 𝑨~𝒪(𝜈, 𝜏) {𝐽}
where module what module
𝑀𝑏𝑒𝑤 real fake 𝑦′, 𝑧′, 𝑒′ , 𝑞𝑑
′
discriminator
′)2
𝑦, 𝑧, 𝑒 , 𝑞𝑑 𝑨~𝒪(𝜈, 𝜏) 𝑦′, 𝑧′, 𝑒′ , 𝑞𝑑
′
{𝐽}
where module
𝑀𝑛𝑡𝑓 𝑀𝑙𝑚𝑒
what module for pose gestures prediction.
𝑞
𝑨~𝒪(𝜈, 𝜏) 𝑞′ 𝑦, 𝑧, 𝑒 , 𝑞𝑑 𝑨~𝒪(𝜈, 𝜏) {𝐽}
where module what module
𝑀𝑏𝑒𝑤 real fake 𝑦′, 𝑧′, 𝑒′ , 𝑞𝑑
′
discriminator
𝑘, 𝑧𝑘, 𝑒𝑘
− 𝑁𝑓𝑁𝑗 𝑦𝑘
′,𝑧𝑘 ′, 𝑒𝑘 ′ )2
𝑞
𝑦′, 𝑧′, 𝑒′ , 𝑞𝑑
′, 𝐽
𝑞′
what module
𝑀𝑛𝑡𝑓 𝑀𝑙𝑚𝑒 𝑨~𝒪(𝜈, 𝜏)
what module for pose gestures prediction.
𝑞
𝑨~𝒪(𝜈, 𝜏) 𝑞′ 𝑦, 𝑧, 𝑒 , 𝑞𝑑 𝑨~𝒪(𝜈, 𝜏) {𝐽}
where module what module
𝑀𝑏𝑒𝑤 real fake 𝑦′, 𝑧′, 𝑒′ , 𝑞𝑑
′
discriminator
𝑞
𝑞′ 𝑦, 𝑧, 𝑒 , 𝑞𝑑 𝑦′, 𝑧′, 𝑒′ , 𝑞𝑑
′
𝐽 𝐽
encoder decoder encoder decoder
𝑨~𝒪(𝜈, 𝜏) 𝑨~𝒪(𝜈, 𝜏)
decoder
𝑨~𝒪(0,1) 𝑦′, 𝑧′, 𝑒′ , 𝑞𝑑
′
decoder
𝑨~𝒪(0,1)
supervised path unsupervised path
𝑀𝑏𝑒𝑤 real fake depth heat map
. Liu, M.-H. Yang, and J. Kautz. Context-aware synthesis and placement of object instances. In NIPS, 2018.
what module for pose gestures prediction.
𝑨~𝒪(𝜈, 𝜏) 𝑞′ 𝑨~𝒪(𝜈, 𝜏) {𝐽}
where module what module
𝑦′, 𝑧′, 𝑒′ , 𝑞𝑑
′
[1] Binge Watching: Scaling Affordance Learning from Sitcoms, Xiaolong Wang et al. CVPR, 2017
classifier 0/1 plausible?
Method Semantic Score (%) Baseline [1] 72.53 Ours RGB 91.69 RGBD 91.14 Depth 89.86
46.43 74.45 53.57 72.36 25.55 27.64 0% 25% 50% 75% 100% GT vs. ours GT vs. baseline
GT
baseline
(a) User study interface. (b) User study result.
space and support constraint.
Metric (% ) Baseline [1] Ours RGB RGBD Depth geometry score 23.25 66.40 71.17 72.11
Input Image Generated Poses
Input Image Generated Poses
Poses that are not semantically plausible. Poses that are violate geometry rules in 3D world.
human pose synthesizer that leverages the pose distributions learned from the 2D world, and the physical feasibility extracted from the 3D world
for 3D affordance prediction which generates plausible human poses with full 3D information, from a single scene image.
semantic knowledge geometry knowledge
pose synthesizer
…
where what
Locations for sitting poses Locations for standing poses Locations for cars Locations for people
97
Transform a template object to fit in the input image
Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman, and Simon Lucey, “ST-GAN: Spatial Transformer Generative Adversarial Networks for Image Compositing”, CVPR 2018
98
(side view car) Limitation
Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman, and Simon Lucey, “ST-GAN: Spatial Transformer Generative Adversarial Networks for Image Compositing”, CVPR 2018
99
Xi Ouyang, Yu Cheng, Yifan Jiang, Chun-Liang Li, and Pan Zhou “Pedestrian-Synthesis-GAN: Generating Pedestrian Data in Real Scene and Beyond”, arXiv 2018
Add a new pedestrian at a target region Limitation The location and size of a pedestrian has to be given by a user
100
Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee, “Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis”, CVPR 2018
Limitation Layout prediction and image generation networks are not end-to-end trainable