DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei - - PowerPoint PPT Presentation

data driven affordance
SMART_READER_LITE
LIVE PREVIEW

DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei - - PowerPoint PPT Presentation

DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California, Merced March 19, 2019 UNDERSTANDING SCENE AND HUMAN scene image segmentation human pose estimation semantic


slide-1
SLIDE 1

DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES

Sifei Liu, NVIDIA Research Xueting Li, University of California, Merced March 19, 2019

slide-2
SLIDE 2

2

scene image segmentation human pose estimation

UNDERSTANDING SCENE AND HUMAN

semantic segmentation from cityscapes dataset pose estimation via OpenCV

slide-3
SLIDE 3

3

instance placement human placement

CREATING SCENE OR HUMAN?

semantic segmentation from cityscapes dataset rendered scene from the SUNCG dataset

✘ ✘

slide-4
SLIDE 4

4

LET’S MAKE IT MORE CHALLENGING!

shape synthesis

semantic segmentation from cityscapes dataset

slide-5
SLIDE 5

5

?

LET’S MAKE IT MORE CHALLENGING!

shape synthesis

slide-6
SLIDE 6

6

LET’S MAKE IT MORE CHALLENGING!

placement in the real world

video’s from: Learning Rigidity in Dynamic Scenes with a Moving Camera for 3D Motion Field Estimation

slide-7
SLIDE 7

7

WHAT IS AFFORDANCE?

Where are they?

scene image indoor environment human car sitting standing

slide-8
SLIDE 8

8

WHAT IS AFFORDANCE?

What are they look like?

scene image indoor environment

slide-9
SLIDE 9

9

WHAT IS AFFORDANCE?

How do they interact with the others?

Input Image Generated Poses

slide-10
SLIDE 10

10

OUTLINES

Context-Aware Synthesis and Placement of Object Instances

Neurips 2018 Donghoon Lee, Sifei Liu, Jinwei Gu, Ming-Yu Liu, Ming-Hsuan Yang, Jan Kautz

Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments

CVPR 2019 Xueting Li, Sifei Liu, Kihwan Kim, Xiaolong Wang, Ming-Hsuan Yang, Jan Kautz

slide-11
SLIDE 11

11

QUIZ

Which object is a fake one?

slide-12
SLIDE 12

12

SEQUENTIAL EDITING

Insert new objects one by one

slide-13
SLIDE 13

13

Add a person

PROBLEM DEFINATION

Semantic map manipulation by inserting objects

slide-14
SLIDE 14

14

WHY SEMANTIC MAP?

  • Editing RGB image is difficult

Image 1 Image 2

Image-to-image translation, Image editing, ...

slide-15
SLIDE 15

15

WHY SEMANTIC MAP?

  • We don’t have real RGB images in case of using a simulator, playing a game, or

experiencing a virtual world

Image is from Stephan R. Richter, Zeeshan Hayer, and Vladlen Koltun, “Playing for Benchmarks”, ICCV 2017

Rendering Semantic map Visualization

slide-16
SLIDE 16

16

  • 1. Learn “where” and “what” jointly
  • 2. End-to-end trainable network
  • 3. Diverse outputs given the same input

MAIN GOALS

slide-17
SLIDE 17

17

“WHERE” MODULE

How can we learn where to put a new object?

slide-18
SLIDE 18

18

“WHERE” MODULE

Pixel-wise annotation: almost impossible to get

p=0.2 p=0 p=0.8

slide-19
SLIDE 19

19

“WHERE” MODULE

Existing objects: need to remove and inpaint objects

Object

slide-20
SLIDE 20

20

Removed Object

“WHERE” MODULE

Existing objects: need to remove and inpaint objects

slide-21
SLIDE 21

21

“WHERE” MODULE

Existing objects: need to remove and inpaint objects

Inpainting

?

slide-22
SLIDE 22

22

“WHERE” MODULE

Our approach: put a box and see if it is reasonable

Good box Bad box

Why box? 1) We don’t want to care about the object shape for now. 2) All objects can be covered by a bounding box.

slide-23
SLIDE 23

23

“WHERE” MODULE

How to put a box?

Unit box Affine transform

Why not using (x,y,w,h) directly? It is not differentiable to put a box using indices.

slide-24
SLIDE 24

24

bbox

“WHERE” MODULE

Affine transform

slide-25
SLIDE 25

25

bbox

STN concat

tile

“WHERE” MODULE

slide-26
SLIDE 26

26

bbox

STN concat real fake

tile

Real/fake loss

“WHERE” MODULE

slide-27
SLIDE 27

27

“WHERE” MODULE

Results with 100 different random vectors

slide-28
SLIDE 28

28

bbox

STN concat real fake

tile

Real/fake loss Ignored

“WHERE” MODULE

slide-29
SLIDE 29

29

bbox

STN concat real fake

tile

Real/fake loss

“WHERE” MODULE

slide-30
SLIDE 30

30

bbox

STN concat real fake

tile

Real/fake loss

“WHERE” MODULE

slide-31
SLIDE 31

31

“WHERE” MODULE

Results with 100 different random vectors

slide-32
SLIDE 32

32

bbox

STN concat real fake

tile

Real/fake loss

“WHERE” MODULE

Lazy z1 z2 Lazy

slide-33
SLIDE 33

33

bbox

STN concat real fake

tile

Real/fake loss

“WHERE” MODULE

bbox

slide-34
SLIDE 34

34

bbox

STN

tile

concat STN concat

bbox (shared) (shared)

real fake

tile

“WHERE” MODULE

Real/fake loss

slide-35
SLIDE 35

35

bbox

STN

tile

concat STN concat

bbox

Supervised path Unsupervised path

(shared) (shared)

real fake

tile

Encoder-decoder + reconstruct + supervision

“WHERE” MODULE

slide-36
SLIDE 36

36

(red: person, blue: car)

“WHERE” MODULE

Results with 100 different random vectors

slide-37
SLIDE 37

37

“WHERE” MODULE

Results from epoch 0 to 30

slide-38
SLIDE 38

38

  • 1. Learn “where” and “what” jointly
  • 2. End-to-end trainable network
  • 3. Diverse outputs given the same input.

MAIN GOAL

slide-39
SLIDE 39

39

tile concat

“WHAT” MODULE

slide-40
SLIDE 40

40

tile concat

“WHAT” MODULE

“Where” module

slide-41
SLIDE 41

41

tile tile concat concat Unsupervised path Supervised path (shared) real fake Encoder-decoder + supervision

“WHAT” MODULE

“Where” module

slide-42
SLIDE 42

42

OVERALL ARCHITECTURE

Forward pass

Input Affine Bounding box prediction Object shape generation Output Unit box

slide-43
SLIDE 43

43

OVERALL ARCHITECTURE

Backward pass for “where” loss

Input Affine Bounding box prediction Object shape generation Output Unit box “Where” discriminator

slide-44
SLIDE 44

44

OVERALL ARCHITECTURE

Backward pass for “what” loss

Input Affine Bounding box prediction Object shape generation Output Unit box “What” discriminator

slide-45
SLIDE 45

45

“WHAT” MODULE

Fix “where”, change “what”

slide-46
SLIDE 46

46

Input Generated Synthesized RGB (pix2pix HD)

EXPERIMENTS

slide-47
SLIDE 47

47

Generated Nearest Neighbor

Synthesized RGB (nearest-neighbor)

EXPERIMENTS

slide-48
SLIDE 48

48

Input Generated Synthesized RGB (pix2pix HD)

EXPERIMENTS

slide-49
SLIDE 49

49

Generated Nearest Neighbor

Synthesized RGB (nearest-neighbor)

EXPERIMENTS

slide-50
SLIDE 50

50

slide-51
SLIDE 51

51

slide-52
SLIDE 52

52

slide-53
SLIDE 53

53

slide-54
SLIDE 54

54

Ideal: 50% Our result: 43%

USER STUDY

slide-55
SLIDE 55

55

BASELINES

Input Encoder - decoder Generated

  • bject

Result

Real Real

Input Generator Generated

  • bject

Encoder STN Result

Real Real

Baseline 1 Baseline 2

Baseline 1 Baseline 2

slide-56
SLIDE 56

56

CONCLUSION

Learning Affordance in 2D

where are they? what are they look like?

slide-57
SLIDE 57

PUTTING HUMANS IN A SCENE: LEARNING AFFORDANCE IN 3D INDOOR ENVIRONMENTS

Xueting Li, Sifei Liu, Kihwan Kim , Xiaolong Wang, Ming-Hsuan Yang, Jan Kautz

slide-58
SLIDE 58

WHAT IS AFFORDANCE IN 3D?

  • General definition:

➢ opportunities of interaction in the scene, i.e. what actions can the object be used for.

  • Applications:

➢ Robot navigation ➢ Game development

Image Credit: David F . Fouhey et al. In Defense of the Direct Perception of Affordances, CoRR abs/1505.01085 (2015)

The floor can be used for standing The desk can be used for sitting

slide-59
SLIDE 59

AFFORDANCE IN 3D WORLD

  • Given a single image of a 3D scene, generating reasonable human poses in 3D scenes.

?

slide-60
SLIDE 60

LEARNING 3D AFFORDANCE

  • Semantically plausible: the human should take common actions in indoor environment

How to define a “reasonable” human pose in indoor scenes?

  • Physically stable: the human should be well supported by its surrounding objects.
slide-61
SLIDE 61

LEARNING 3D AFFORDANCE

semantic knowledge geometry knowledge

fuse

A data-driven way?

slide-62
SLIDE 62

LEARNING 3D AFFORDANCE

  • Stage I: Build a fully-automatic 3D

pose synthesizer.

  • Stage II: Using the dataset

synthesized by stage I to train a data-driven and end-to-end 3D pose prediction model.

semantic knowledge geometry knowledge

pose synthesizer

where what

slide-63
SLIDE 63

LEARNING 3D AFFORDANCE

  • Stage I: Build a fully-automatic 3D

pose synthesizer.

  • Stage II: Using the dataset

synthesized by stage I to train a data-driven and end-to-end 3D pose prediction model.

semantic knowledge geometry knowledge

pose synthesizer

where what

slide-64
SLIDE 64

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

Fusing semantic & geometry knowledge

[1] Wang X, Girdhar R, Gupta A. Binge watching: Scaling affordance learning from sitcoms. CVPR, 2017 [2] Song S, Yu F , Zeng A, et al. Semantic scene completion from a single depth. CVPR 2017.

The Sitcom [1] dataset. (no 3D annotations) The SUNCG [2] dataset. (no human poses) Combine ? semantic knowledge geometry knowledge

slide-65
SLIDE 65

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

geometry adjustment semantic knowledge adaptation

𝑍 𝑌 𝑎 𝑧 𝑦 𝑋 𝑉 𝑊

input image location heat map generated poses mapped pose adjusted pose mapping from image to voxel

Domain adaptation

slide-66
SLIDE 66

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

geometry adjustment semantic knowledge adaptation

𝑍 𝑌 𝑎 𝑧 𝑦 𝑋 𝑉 𝑊

input image location heat map generated poses mapped pose adjusted pose mapping from image to voxel

Domain adaptation

slide-67
SLIDE 67

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

semantic knowledge adaptation

input image location heat map ResNet 18

convolution deconvolution

slide-68
SLIDE 68

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

semantic knowledge adaptation

input image location heat map generated poses

Binge Watching: Scaling Affordance Learning from Sitcoms, Xiaolong Wang et al. CVPR, 2017

slide-69
SLIDE 69

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

semantic knowledge adaptation

input image location heat map generated poses

Domain adaptation

slide-70
SLIDE 70

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

geometry adjustment semantic knowledge adaptation

𝑍 𝑌 𝑎 𝑧 𝑦 𝑋 𝑉 𝑊

input image location heat map generated poses mapped pose adjusted pose mapping from image to voxel

Domain adaptation

slide-71
SLIDE 71

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

mapping from image to voxel

generated pose mapped pose

𝑒 = 𝐼 × 𝑔 𝐼𝑞 × 𝑠

32

𝐼~𝒪(1.7,0.1)

mapping from image to voxel

𝑍 𝑌 𝑎 𝑧 𝑦 𝑋 𝑉 𝑊

𝐼𝑞 𝐼

slide-72
SLIDE 72

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

geometry adjustment semantic knowledge adaptation

𝑍 𝑌 𝑎 𝑧 𝑦 𝑋 𝑉 𝑊

input image location heat map generated poses mapped pose adjusted pose mapping from image to voxel

Domain adaptation

slide-73
SLIDE 73

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

geometry adjustment

mapped pose adjusted pose

  • Free space constraint [1]: No human body parts can intersect with any object in the scene, such

as furniture .

[1] A. Gupta et al. From 3d scene geometry to human workspace. CVPR, 2011.

slide-74
SLIDE 74

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

geometry adjustment

[1] A. Gupta et al. From 3d scene geometry to human workspace. CVPR, 2011.

generated pose adjusted pose Support constraint [2]. The human pose should be supported by a surface of surrounding objects (e.g., floor, bed).

(a) generated pose (c) scene voxel (d) sittable surface (b) pose in voxel (e) positive response (f) adjusted 3D pose

3D Gaussian

sampling

slide-75
SLIDE 75

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

geometry adjustment semantic knowledge adaptation

𝑍 𝑌 𝑎 𝑧 𝑦 𝑋 𝑉 𝑊

input image location heat map generated poses mapped pose adjusted pose mapping from image to voxel

Domain adaptation

slide-76
SLIDE 76

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER

  • In total, we generate ~1.5 million poses in 13,774 scenes. We use 13,074 scenes for the

data-driven 3D affordance generative model training and 700 scenes for testing.

Synthesized poses visualization

slide-77
SLIDE 77

A DATA-DRIVEN 3D POSE PREDICTION MODEL

  • Our 3D pose prediction model includes a where module for pose locations prediction and a

what module for pose gestures prediction.

Framework

𝑞

𝑨~𝒪(𝜈, 𝜏) 𝑞′ 𝑦, 𝑧, 𝑒 , 𝑞𝑑 𝑨~𝒪(𝜈, 𝜏) {𝐽}

where module what module

𝑀𝑏𝑒𝑤 real fake 𝑦′, 𝑧′, 𝑒′ , 𝑞𝑑

discriminator

slide-78
SLIDE 78

A DATA-DRIVEN 3D POSE PREDICTION MODEL

  • Our 3D pose prediction model includes a where module for pose locations prediction and a

what module for pose gestures prediction.

Framework

𝑞

𝑨~𝒪(𝜈, 𝜏) 𝑞′ 𝑦, 𝑧, 𝑒 , 𝑞𝑑 𝑨~𝒪(𝜈, 𝜏) {𝐽}

where module what module

𝑀𝑏𝑒𝑤 real fake 𝑦′, 𝑧′, 𝑒′ , 𝑞𝑑

discriminator

slide-79
SLIDE 79

A DATA-DRIVEN 3D POSE PREDICTION MODEL

  • The where module for pose locations prediction.
  • It should be able to sample during inference.
  • It should condition on scene image.
  • Losses:
  • 𝑀 = 𝜇𝐻𝐹𝑃𝑀𝐻𝐹𝑃 + 𝜇𝐿𝑀𝐸𝑀𝐿𝑀𝐸 + 𝜇𝑁𝑇𝐹𝑀𝑁𝑇𝐹
  • 𝑀𝑁𝑇𝐹 = (𝑦 − 𝑦′)2+(𝑧 − 𝑧′)2+(𝑒 − 𝑒′)2+(𝑞𝑑 − 𝑞𝑑

′)2

  • 𝑀𝐿𝑀𝐸 = [𝑅(𝑨|𝜈 𝑦, 𝑧, 𝑒, 𝐽 , 𝜏(𝑦, 𝑧, 𝑒, 𝐽))||𝒪(0,1)]
  • 𝑀𝐻𝐹𝑃 = (𝑁𝑓𝑁𝑗 𝑦, 𝑧, 𝑒 − 𝑁𝑓𝑁𝑗 𝑦′,𝑧′, 𝑒′ )2

The where module

𝑦, 𝑧, 𝑒 , 𝑞𝑑 𝑨~𝒪(𝜈, 𝜏) 𝑦′, 𝑧′, 𝑒′ , 𝑞𝑑

{𝐽}

where module

𝑀𝑛𝑡𝑓 𝑀𝑙𝑚𝑒

slide-80
SLIDE 80

A DATA-DRIVEN 3D POSE PREDICTION MODEL

  • Our 3D pose prediction model includes a where module for pose locations prediction and a

what module for pose gestures prediction.

Framework

𝑞

𝑨~𝒪(𝜈, 𝜏) 𝑞′ 𝑦, 𝑧, 𝑒 , 𝑞𝑑 𝑨~𝒪(𝜈, 𝜏) {𝐽}

where module what module

𝑀𝑏𝑒𝑤 real fake 𝑦′, 𝑧′, 𝑒′ , 𝑞𝑑

discriminator

slide-81
SLIDE 81

A DATA-DRIVEN 3D POSE PREDICTION MODEL

  • The what module for pose gestures prediction.
  • Losses:
  • 𝑀 = 𝜇𝐻𝐹𝑃𝑀𝐻𝐹𝑃 + 𝜇𝐿𝑀𝐸𝑀𝐿𝑀𝐸 + 𝜇𝑁𝑇𝐹𝑀𝑁𝑇𝐹
  • 𝑀𝑁𝑇𝐹 = (𝑞 − 𝑞′)2
  • 𝑀𝐿𝑀𝐸 = [𝑅(𝑨|𝜈 𝑦, 𝑧, 𝑒, 𝑞𝑑, 𝐽 , 𝜏(𝑦, 𝑧, 𝑒, 𝑞𝑑, 𝐽))||𝒪(0,1)]
  • 𝑀𝐻𝐹𝑃 = (𝑁𝑓𝑁𝑗 𝑦

𝑘, 𝑧𝑘, 𝑒𝑘

− 𝑁𝑓𝑁𝑗 𝑦𝑘

′,𝑧𝑘 ′, 𝑒𝑘 ′ )2

The what module

𝑞

𝑦′, 𝑧′, 𝑒′ , 𝑞𝑑

′, 𝐽

𝑞′

what module

𝑀𝑛𝑡𝑓 𝑀𝑙𝑚𝑒 𝑨~𝒪(𝜈, 𝜏)

slide-82
SLIDE 82

A DATA-DRIVEN 3D POSE PREDICTION MODEL

  • Our 3D pose prediction model includes a where module for pose locations prediction and a

what module for pose gestures prediction.

Framework

𝑞

𝑨~𝒪(𝜈, 𝜏) 𝑞′ 𝑦, 𝑧, 𝑒 , 𝑞𝑑 𝑨~𝒪(𝜈, 𝜏) {𝐽}

where module what module

𝑀𝑏𝑒𝑤 real fake 𝑦′, 𝑧′, 𝑒′ , 𝑞𝑑

discriminator

slide-83
SLIDE 83

A DATA-DRIVEN 3D POSE PREDICTION MODEL

Geometry-aware discriminator

𝑞

𝑞′ 𝑦, 𝑧, 𝑒 , 𝑞𝑑 𝑦′, 𝑧′, 𝑒′ , 𝑞𝑑

𝐽 𝐽

encoder decoder encoder decoder

𝑨~𝒪(𝜈, 𝜏) 𝑨~𝒪(𝜈, 𝜏)

decoder

𝑨~𝒪(0,1) 𝑦′, 𝑧′, 𝑒′ , 𝑞𝑑

decoder

𝑨~𝒪(0,1)

supervised path unsupervised path

𝑀𝑏𝑒𝑤 real fake depth heat map

  • D. Lee, S. Liu, J. Gu, M.-Y

. Liu, M.-H. Yang, and J. Kautz. Context-aware synthesis and placement of object instances. In NIPS, 2018.

slide-84
SLIDE 84

A DATA-DRIVEN 3D POSE PREDICTION MODEL

  • Our 3D pose prediction model includes a where module for pose locations prediction and a

what module for pose gestures prediction.

Framework

𝑨~𝒪(𝜈, 𝜏) 𝑞′ 𝑨~𝒪(𝜈, 𝜏) {𝐽}

where module what module

𝑦′, 𝑧′, 𝑒′ , 𝑞𝑑

slide-85
SLIDE 85

QUANTITATIVE RESULTS

  • Using a pre-trained classifier to score the “plausibility” of generated poses.

Semantic score

[1] Binge Watching: Scaling Affordance Learning from Sitcoms, Xiaolong Wang et al. CVPR, 2017

classifier 0/1 plausible?

Method Semantic Score (%) Baseline [1] 72.53 Ours RGB 91.69 RGBD 91.14 Depth 89.86

slide-86
SLIDE 86

QUANTITATIVE RESULTS

  • Using a pre-trained classifier to score the “plausibility” of generated poses.
  • User study.

Semantic score

46.43 74.45 53.57 72.36 25.55 27.64 0% 25% 50% 75% 100% GT vs. ours GT vs. baseline

  • urs vs. baseline

GT

  • urs

baseline

(a) User study interface. (b) User study result.

slide-87
SLIDE 87

QUANTITATIVE RESULTS

  • Mapping each generated pose into the 3D scene voxel and check if it satisfies the free

space and support constraint.

Geometry score

Metric (% ) Baseline [1] Ours RGB RGBD Depth geometry score 23.25 66.40 71.17 72.11

slide-88
SLIDE 88

QUALITATIVE RESULTS

Generated Poses in Scene Voxel

Input Image Generated Poses

slide-89
SLIDE 89

QUALITATIVE RESULTS

Generated Poses in Scene Voxel

Input Image Generated Poses

slide-90
SLIDE 90

FAILURE CASES

Poses that are not semantically plausible. Poses that are violate geometry rules in 3D world.

slide-91
SLIDE 91

CONCLUSION

  • We propose fully-automatic 3D

human pose synthesizer that leverages the pose distributions learned from the 2D world, and the physical feasibility extracted from the 3D world

  • We develop a generative model

for 3D affordance prediction which generates plausible human poses with full 3D information, from a single scene image.

semantic knowledge geometry knowledge

pose synthesizer

where what

slide-92
SLIDE 92

HIGHLIGHT

  • Where
  • What
  • Interaction

Three aspects of affordance modeling

Locations for sitting poses Locations for standing poses Locations for cars Locations for people

slide-93
SLIDE 93

HIGHLIGHT

  • Where
  • What
  • Interaction

Three aspects of affordance modeling

slide-94
SLIDE 94

HIGHLIGHT

  • Where
  • What
  • Interaction

Three aspects of affordance modeling

slide-95
SLIDE 95

HIGHLIGHT

  • Effective regularization by introducing adversarial training in unsupervised path.

Parallel-path training

slide-96
SLIDE 96

THANK YOU Q&A

slide-97
SLIDE 97

97

Transform a template object to fit in the input image

RELATED WORK

“Where” problem

Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman, and Simon Lucey, “ST-GAN: Spatial Transformer Generative Adversarial Networks for Image Compositing”, CVPR 2018

slide-98
SLIDE 98

98

?

(side view car) Limitation

  • 1. A template has to be given (it cannot generate templates)
  • 2. Sometimes there’s no way to fit the template into the input scene by applying affine transforms

Chen-Hsuan Lin, Ersin Yumer, Oliver Wang, Eli Shechtman, and Simon Lucey, “ST-GAN: Spatial Transformer Generative Adversarial Networks for Image Compositing”, CVPR 2018

RELATED WORK

“Where” problem

slide-99
SLIDE 99

99

Xi Ouyang, Yu Cheng, Yifan Jiang, Chun-Liang Li, and Pan Zhou “Pedestrian-Synthesis-GAN: Generating Pedestrian Data in Real Scene and Beyond”, arXiv 2018

Add a new pedestrian at a target region Limitation The location and size of a pedestrian has to be given by a user

RELATED WORK

“What” problem

slide-100
SLIDE 100

100

Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee, “Inferring Semantic Layout for Hierarchical Text-to-Image Synthesis”, CVPR 2018

Limitation Layout prediction and image generation networks are not end-to-end trainable

RELATED WORK

“Where” and “What” problem