DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei - PowerPoint PPT Presentation

DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California, Merced March 19, 2019

UNDERSTANDING SCENE AND HUMAN scene image segmentation human pose estimation semantic segmentation from cityscapes dataset pose estimation via OpenCV 2

CREATING SCENE OR HUMAN? instance placement human placement ✘ ✘ semantic segmentation from cityscapes dataset rendered scene from the SUNCG dataset 3

LET’S MAKE IT MORE CHALLENGING! shape synthesis semantic segmentation from cityscapes dataset 4

LET’S MAKE IT MORE CHALLENGING! shape synthesis ? 5

LET’S MAKE IT MORE CHALLENGING! placement in the real world video’s from: Learning Rigidity in Dynamic Scenes with a Moving Camera for 3D Motion Field Estimation 6

WHAT IS AFFORDANCE? Where are they? scene image indoor environment sitting standing human car 7

WHAT IS AFFORDANCE? What are they look like? scene image indoor environment 8

WHAT IS AFFORDANCE? How do they interact with the others? Input Image Generated Poses 9

OUTLINES Context-Aware Synthesis and Placement of Object Instances Neurips 2018 Donghoon Lee, Sifei Liu, Jinwei Gu, Ming-Yu Liu, Ming-Hsuan Yang, Jan Kautz Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments CVPR 2019 Xueting Li, Sifei Liu, Kihwan Kim, Xiaolong Wang, Ming-Hsuan Yang, Jan Kautz 10

QUIZ Which object is a fake one? 11

SEQUENTIAL EDITING Insert new objects one by one 12

PROBLEM DEFINATION Semantic map manipulation by inserting objects Add a person 13

WHY SEMANTIC MAP? • Editing RGB image is difficult Image-to-image translation, Image editing, ... Image 2 Image 1 14

WHY SEMANTIC MAP? • We don’t have real RGB images in case of using a simulator, playing a game, or experiencing a virtual world Semantic map Rendering Visualization Image is from 15 Stephan R. Richter, Zeeshan Hayer, and Vladlen Koltun , “Playing for Benchmarks”, ICCV 2017

MAIN GOALS 1. Learn “where” and “what” jointly 2. End-to-end trainable network 3. Diverse outputs given the same input 16

“WHERE” MODULE How can we learn where to put a new object? 17

“WHERE” MODULE Pixel-wise annotation: almost impossible to get p=0 p=0.8 p=0.2 18

“WHERE” MODULE Existing objects: need to remove and inpaint objects Object 19

“WHERE” MODULE Existing objects: need to remove and inpaint objects Removed Object 20

“WHERE” MODULE Existing objects: need to remove and inpaint objects Inpainting ? 21

“WHERE” MODULE Our approach: put a box and see if it is reasonable Bad box Good box Why box? 1) We don’t want to care about the object shape for now. 2) All objects can be covered by a bounding box. 22

“WHERE” MODULE How to put a box? Affine transform Unit box Why not using (x,y,w,h) directly? It is not differentiable to put a box using indices. 23

“WHERE” MODULE Affine transform bbox 24

“WHERE” MODULE concat STN tile bbox 25

“WHERE” MODULE concat fake STN Real/fake loss tile bbox real 26

“WHERE” MODULE Results with 100 different random vectors 27

“WHERE” MODULE concat fake STN Real/fake loss tile bbox Ignored real 28

“WHERE” MODULE Results with 100 different random vectors 31

“WHERE” MODULE concat fake STN Real/fake loss tile bbox Lazy Lazy z1 real z2 32

“WHERE” MODULE concat fake STN Real/fake loss tile bbox real bbox 33

“WHERE” MODULE concat fake STN Real/fake loss tile bbox (shared) (shared) real concat STN bbox tile 34

“WHERE” MODULE concat fake STN Encoder-decoder tile bbox Unsupervised path (shared) (shared) Supervised path + reconstruct real concat STN bbox tile + supervision 35

“WHERE” MODULE Results with 100 different random vectors (red: person, blue: car) 36

“WHERE” MODULE Results from epoch 0 to 30 37

MAIN GOAL 1. Learn “where” and “what” jointly 2. End-to-end trainable network 3. Diverse outputs given the same input. 38

“WHAT” MODULE concat tile 39

“Where” module “WHAT” MODULE concat tile 40

“Where” module “WHAT” MODULE concat tile fake Unsupervised path Encoder-decoder (shared) Supervised path concat real + supervision tile 41

OVERALL ARCHITECTURE Forward pass Affine Input Unit box Bounding box Object shape Output prediction generation 42

OVERALL ARCHITECTURE Backward pass for “where” loss Affine Input Unit box Bounding box Object shape Output prediction generation “Where” discriminator 43

OVERALL ARCHITECTURE Backward pass for “what” loss Affine Input Unit box Bounding box Object shape Output prediction generation “What” discriminator 44

“WHAT” MODULE Fix “where”, change “what” 45

EXPERIMENTS Synthesized RGB (pix2pix HD) Input Generated 46

EXPERIMENTS Synthesized RGB (nearest-neighbor) Nearest Neighbor Generated 47

EXPERIMENTS Synthesized RGB (pix2pix HD) Input Generated 48

EXPERIMENTS Synthesized RGB (nearest-neighbor) Nearest Neighbor Generated 49

USER STUDY Ideal: 50% Our result: 43% 54

BASELINES Encoder - Generated Input STN Result Input Result decoder object Real Generated Encoder Generator object Baseline 1 Baseline 2 Real Real Real Baseline 1 Baseline 2 55

CONCLUSION Learning Affordance in 2D where are they? what are they look like? 56

PUTTING HUMANS IN A SCENE: LEARNING AFFORDANCE IN 3D INDOOR ENVIRONMENTS Xueting Li, Sifei Liu, Kihwan Kim , Xiaolong Wang, Ming-Hsuan Yang, Jan Kautz

WHAT IS AFFORDANCE IN 3D? • General definition: ➢ opportunities of interaction in the scene, i.e. what actions can the object be used for. The floor can The desk can be used for be used for standing sitting • Applications: ➢ Robot navigation ➢ Game development Image Credit: David F . Fouhey et al. In Defense of the Direct Perception of Affordances, CoRR abs/1505.01085 (2015)

AFFORDANCE IN 3D WORLD • Given a single image of a 3D scene, generating reasonable human poses in 3D scenes. ?

LEARNING 3D AFFORDANCE How to define a “reasonable” human pose in indoor scenes? • Semantically plausible: the human should take common actions in indoor environment • Physically stable: the human should be well supported by its surrounding objects.

LEARNING 3D AFFORDANCE semantic knowledge fuse geometry knowledge A data-driven way?

LEARNING 3D AFFORDANCE • Stage I: Build a fully-automatic 3D semantic knowledge geometry knowledge pose synthesizer. • Stage II: Using the dataset synthesized by stage I to train a data-driven and end-to-end 3D pose prediction model. pose synthesizer where what …

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER Fusing semantic & geometry knowledge semantic geometry knowledge knowledge The Sitcom [1] dataset. (no 3D annotations) The SUNCG [2] dataset. (no human poses) Combine ? [1] Wang X, Girdhar R, Gupta A. Binge watching: Scaling affordance learning from sitcoms. CVPR, 2017 [2] Song S, Yu F , Zeng A, et al. Semantic scene completion from a single depth. CVPR 2017.

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER semantic knowledge adaptation geometry adjustment 𝑋 𝑍 𝑧 𝑎 𝑊 𝑦 𝑌 mapping from image to voxel Domain adaptation 𝑉 input image location heat map generated poses mapped pose adjusted pose

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER semantic knowledge adaptation input image location heat map ResNet convolution 18 deconvolution

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER semantic knowledge adaptation input image location heat map generated poses Binge Watching: Scaling Affordance Learning from Sitcoms ， Xiaolong Wang et al. CVPR, 2017

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER semantic knowledge adaptation Domain adaptation input image location heat map generated poses

A FULLY-AUTOMATIC 3D POSE SYNTHESIZER semantic knowledge adaptation geometry adjustment 𝑋 𝑍 𝑧 𝑎 𝑊 𝑦 𝑌 mapping from image to voxel Domain adaptation 𝑉 input image location heat map generated poses mapped pose adjusted pose

DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei - PowerPoint PPT Presentation

DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California, Merced March 19, 2019 UNDERSTANDING SCENE AND HUMAN scene image segmentation human pose estimation semantic

Its Not Open Data Unless it is Usable Data Mike Amundsen, API Academy CA / Layer7 @mamund

The Formalities of Affordance Antony Galton University of Exeter, UK Antony Galton The

Reinforcement Learning of Reinforcement Learning of Affordance Cues Affordance Cues Final

Plan-based Control in an Plan-based Control in an Affordance-based Robot Control

Seeing the self in the www.hmi.unimore.it washing machine the Deep Affordance of 2.0 philosophy

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

False fasting is driven by pride False fasting is driven by pride False fasting is

Rock-Paper-Fibers Bringing Physical Affordance to Mobile Touch Devices Frederik Rudeck Patrick

Thomas A. Stoffregen Affordance Perception-Action Laboratory (APAL) School of Kinesiology

Affordances SWEN-445 What is an Affordance? Psychologist James Gibson, Theory of

The Relationship between Function and Affordance David Brown WPI Lucienne Blessing Technical

Affordances SWEN-444 What is an Affordance? Psychologist James Gibson, Theory of

Deep Affordance-Grounded Sensorimotor Object Recognition Authors: Spyridon Thermos, Georgios

Affordance Extraction and Inference based on Semantic Role Labeling Daniel Loureiro , Alpio

Affordances SWEN-444 What is an Affordance? To afford means to offer, yield, provide,

Affordance-based Perception, Learning and Planning using Range Images Erol ahin KOVAN

Securing Your Social Media Strategy Otavio Freire CTO and Co-Founder Social SafeGuard Visiting

CloakingNote: A Novel Desktop Interface for Subtle Writing Using Decoy Texts Sehi L'Yi 1 Kyle Koh

Sending ICS Forms Using MT63 Soundcard Mode Faster than Voice and Error Free AH6EZ/W7 Dick

Whats Wrong with the Internet? Peter Steiner, The New Yorker, Ongoing Large-Scale Breaches

libdrizzle Eric Day eday@oddments.org MySQL Conference & Expo 2009

In-Situ Data Analysis and Visualization: ParaView, Calalyst and VTK-m GTC, San Jose, CA March,

Project Plan Banking with Amazons Alexa and Apples Siri The Capstone Experience Team

Web 2.0 and other tools for the Social Studies Class By Monica Albuixech and Janice Fairchild

DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei - PowerPoint PPT Presentation

DATA DRIVEN AFFORDANCE LEARNING IN BOTH 2D AND 3D SCENES Sifei Liu, NVIDIA Research Xueting Li, University of California, Merced March 19, 2019 UNDERSTANDING SCENE AND HUMAN scene image segmentation human pose estimation semantic

Its Not Open Data Unless it is Usable Data Mike Amundsen, API Academy CA / Layer7 @mamund

The Formalities of Affordance Antony Galton University of Exeter, UK Antony Galton The

Reinforcement Learning of Reinforcement Learning of Affordance Cues Affordance Cues Final

Plan-based Control in an Plan-based Control in an Affordance-based Robot Control

Seeing the self in the www.hmi.unimore.it washing machine the Deep Affordance of 2.0 philosophy

Priority-Driven Scheduling of Periodic Tasks Priority-driven vs. clock-driven scheduling:

False fasting is driven by pride False fasting is driven by pride False fasting is

Rock-Paper-Fibers Bringing Physical Affordance to Mobile Touch Devices Frederik Rudeck Patrick

Thomas A. Stoffregen Affordance Perception-Action Laboratory (APAL) School of Kinesiology

Affordances SWEN-445 What is an Affordance? Psychologist James Gibson, Theory of

The Relationship between Function and Affordance David Brown WPI Lucienne Blessing Technical

Affordances SWEN-444 What is an Affordance? Psychologist James Gibson, Theory of

Deep Affordance-Grounded Sensorimotor Object Recognition Authors: Spyridon Thermos, Georgios

Affordance Extraction and Inference based on Semantic Role Labeling Daniel Loureiro , Alpio

Affordances SWEN-444 What is an Affordance? To afford means to offer, yield, provide,

Affordance-based Perception, Learning and Planning using Range Images Erol ahin KOVAN

Securing Your Social Media Strategy Otavio Freire CTO and Co-Founder Social SafeGuard Visiting

CloakingNote: A Novel Desktop Interface for Subtle Writing Using Decoy Texts Sehi L'Yi 1 Kyle Koh

Sending ICS Forms Using MT63 Soundcard Mode Faster than Voice and Error Free AH6EZ/W7 Dick

Whats Wrong with the Internet? Peter Steiner, The New Yorker, Ongoing Large-Scale Breaches

libdrizzle Eric Day eday@oddments.org MySQL Conference &amp; Expo 2009

In-Situ Data Analysis and Visualization: ParaView, Calalyst and VTK-m GTC, San Jose, CA March,

Project Plan Banking with Amazons Alexa and Apples Siri The Capstone Experience Team

Web 2.0 and other tools for the Social Studies Class By Monica Albuixech and Janice Fairchild

libdrizzle Eric Day eday@oddments.org MySQL Conference & Expo 2009