Dark, Beyond Deep --- Rethink About Computer Vision Song-Chun Zhu - - PowerPoint PPT Presentation

dark beyond deep
SMART_READER_LITE
LIVE PREVIEW

Dark, Beyond Deep --- Rethink About Computer Vision Song-Chun Zhu - - PowerPoint PPT Presentation

Dark, Beyond Deep --- Rethink About Computer Vision Song-Chun Zhu 1 Distribution Statement Outline I, Rethink about Vision: --- Task-oriented representation. II, Functionality and Causlity: --- Understanding objects, not merely


slide-1
SLIDE 1

Distribution Statement

1

Dark, Beyond Deep

  • -- Rethink About Computer Vision

Song-Chun Zhu

slide-2
SLIDE 2

Outline

I, Rethink about Vision:

  • -- Task-oriented representation.

II, Functionality and Causlity:

  • -- Understanding objects, not merely classifying !
  • III. Utility learning:
  • -- Learning inner and outer utilities from observations.
slide-3
SLIDE 3
  • I. Rethink Computer Vision

Computer Vision is to “compute what are where by looking” --- [Marr, 1982]

Dorsal Pathway (“where”)

Ventral Pathway (“what”)

Human visual pathways

What: Categorical recognition of objects and scenes Where: reconstructing depth, shape, scene layout, visually guided actions, …

slide-4
SLIDE 4

But, What Is Vision For ?

In the past 20 years, CVPR research has been mostly driven by video surveillance (recognition, tracking, re-identification,…); image search (category classification). and some other smaller applications image processing (denoise, enhance, style transfer, …) multimedia (geolocalization, beutification, …) Frankly, these are not what our biologic vision systems were designed (evolved) to do …

slide-5
SLIDE 5

Michael Land et al, Perception, 1999.

What is vision for? a wide range of tasks !

Making Coffee from the perspective of an agent

slide-6
SLIDE 6

The robot needs to infer the mind (belief, attention, intent etc.) of humans to form joint task plan. X 2.4

Example of Human Robot Collaboration

slide-7
SLIDE 7

Robot Opens Medicine Bottles

Gao, Edmonds, et al. IROS 2017

slide-8
SLIDE 8

Social Interactions

Shu, et al. ICRA 2017

slide-9
SLIDE 9

Vision: Task-centered representation, learning and inference

“Dark Matter and Dark Energy”

Three levels of representations

I: View-centered (Appearance-based, 2D, 1995-now) III: Task-centered (Functionality, Physics, Intentionality, Causality, Utility) II: Object-centered (Geometry-based, 3D, 1970-1995)

slide-10
SLIDE 10

[K. Ikeuchi, M. Herbert, IROS 1992]

Task: Grasp an object Object attributes: center, radius, axis direction, position of points orientation Task-oriented representation: Different grasp strategy(task) requires the

  • bject afford different functional capabilities.

Thus the representation of the even same

  • bject can vary according to the task.

Example: Grasp the mug

  • - cylindrical grasp the mug body
  • - hook grasp the mug handle

Task-oriented Representation: Review

slide-11
SLIDE 11

Psychology studies suggest that human vision organizes representations and thus the inference process even for categorical recognition task.

[GL Malcolm, A Nuthmann, PG Schyns, Psychological science 2014] Input Image

Task-oriented Representation: Review

slide-12
SLIDE 12

My interpretation is: people represent various activities (tasks) for different scene categories and imagine the typical tasks (see the hallucinated poses) and search for their associated objects for quick verification.

Task-oriented Representation: Review

[Zhao and Zhu, CVPR, 2014, IJCV 2016]

slide-13
SLIDE 13

We ask 2 groups of people(familiar & unfamiliar with the room) to finish the same task in the same room in a limit time. Sample tasks: 1. heat food in microwave 2. find a cup to fetch water from dispenser Rooms: office, kitchen, living room …

RGB-D Sensor Pivothead (Egocentric Glass)

Human Study: Performing real tasks in 3D scene

The 3D room is reconstructed, segmented and labelled

slide-14
SLIDE 14

Task 1: Heat food in microwave

Recorded video in 1st person view. The human subject is not familiar with the room.

slide-15
SLIDE 15

Task 1: Heat food in microwave

Recorded video in 1st person view. The human subject is familiar with the room.

slide-16
SLIDE 16

Not familiar: Familiar:

Task 1: Heat food in microwave

slide-17
SLIDE 17

Task 2: Find a mug to get water from dispenser

Recorded video in 1st person view. The human subject is not familiar with the room.

slide-18
SLIDE 18

Task 2: Find a mug to get water from dispenser

Recorded video in 1st person view. The human subject is familiar with the room.

slide-19
SLIDE 19

Not familiar: Familiar:

Task 2: Find a mug to get water from dispenser

slide-20
SLIDE 20
  • II. Understanding objects in the context of a task

Why and how, beyond what and where !

slide-21
SLIDE 21

Object understanding is way beyond object recognition.

Example: Open a beer

Understanding objects in the context of a task

slide-22
SLIDE 22

For example, objects used as “opener” in the task of “open beer”

Understanding objects in the context of a task

Yixin Zhu, VCLA@UCLA

Object understanding is much more general than object recognition that memorizes 1000s

  • f examples for each category.
slide-23
SLIDE 23

Modeling Human-Object Interactions at 2 Levels

Modeling 4D body-object interactions; Modeling hand-object interactions

  • P. Wei et al ICCV 2013, PAMI 2017; Y. Zhu, Y.B. Zhao and S.C. Zhu, CVPR 2015.
slide-24
SLIDE 24

Object Recognition  Object Understanding

Test: generalization and innovation! Learning from one example

Yixin Zhu et al, “Understanding Tools …”, CVPR 2015.

Using objects as tools for various tasks.

slide-25
SLIDE 25

Imagine with other areas in the brain Given a task and a set of objects

Task-centered representation

How/where to grasp? where to crack the nut? Calculating the physics to change fluent?

slide-26
SLIDE 26

Task-oriented representation: joint spatial, temporal and causal parse graph

Spatial space Temporal space

What you see is 5%, the remaining 95% need your reasoning !

slide-27
SLIDE 27

Task-oriented representation: joint spatial, temporal and causal parse graph

Scene t1 S-pg t1 Human Nut O Hand material mass hardness AB C- pg Rt1 Rt2 Pose 2 object Hand Pose 1 T- pg

Imagined action: cracking nut

velocit y momentu m

  • bject

Scene t2 mass hardness Xt (T) Xt (A) A f Xt+1 (O) Xt+1 (O) ::= f ( Xt (O), Xt (T), Xt (A) ) FB tool S-pg t2 nut P1 P2P3 Xt+1 (O) AB FB Xt (O) Causal Structure Equation

slide-28
SLIDE 28

density mass velocity acceleration momentum

Estimating physical concepts from the observed/simulated actions

material volume displacement contact area impulse pressu re force work

Causal Structure Equation Xt+1 (O) ::= f ( Xt (O), Xt (T), Xt (A) )

Joint Physical and Causal Reasoning

slide-29
SLIDE 29

Reasoning and Simulation

affordance basis (green): where to grasp functional basis (red): where to apply to the 3rd object a dictionary of typical poses and actions

slide-30
SLIDE 30

Selecting the underlying physical concept from 1 demonstration

Assumption: human makes rationale choices (which is near optimal) ▪ other objects and actions will not outperform human choice in the task. human demonstration

  • ther ways

pg is the spatial, temporal, and causal parse graph

Selecting the top physical concepts, and adjusting parameters

slide-31
SLIDE 31

Examples that outperform human demonstration Examples that underperform human demonstration force pressure contact size Distribution of physical concepts

Selecting the underlying physical concept from 1 demonstration

slide-32
SLIDE 32

Experiment: Task-oriented Object Understanding

  • -- in contrast to memorizing examples

I am afraid that the Apes using stone tools have strong reasoning capabilities, Our tools are too specific, and it reduces to a recognition problem.

slide-33
SLIDE 33

Summary: Call for a Paradigm Shift

Going from current big data, small task setting to small data, big task setting

Representation Data task Representation Data Tasks

Next time when you review a paper: Don’t ask for big data, ask for small data !!

slide-34
SLIDE 34

Assumption I: principle of rationality: the actions of rational agents (humans or robots) are driven by their utilities. Assumption II: People share common utilities for commonsense tasks (differ from social choices).

  • III. Learning Human Utility (Values)

So, we can learn human utility / values from observing human choices/activities in video. The utility of an agent includes (i) Loss or gain on changing external fluents: What states does an agent prefer, i.e. folding clothes in a certain states. (ii) Cost of actions in inner fluents: how much does each action cost by human body parts or robot joints / actuator?

slide-35
SLIDE 35

Human Utility is Defined on the Space of Fleunts

Physical fluents: Social fluents:

Internal fluents (force, pain, …) Social relations

Fluents: time-varying states. The goal of a task is to change some fluents to desired states,

  • -- hierarchically organized.
slide-36
SLIDE 36

A B C D E F A B C D E F G

Take a simple example: Where do you like to sit on, among a number of chairs? The concept of chair is a generalized one here. If a human choose chair A over B, then A must have a higher value over B in some terms. From a small (10-20) examples, we can learn the common human utility function.

Sitting preference in an office and a lab during a discussion task.

Example 1: Learning Human Utility on Inner Fluents

slide-37
SLIDE 37

Simulating All Plausible Poses as Negative Examples

Translations Orientations

x y z

Different Poses

– Synthetize (simulate) Negative Examples in the situation: Things you could, but didn’t do.

slide-38
SLIDE 38

Learning Human Utilities (on preferred force range) from observations and simulations. The learned parameters U() are in fact the utility functions (illustrated by the red curve) which will drive human motion.

Learning Human Utility on Inner Fluents

Yixin Zhu, et al. Inferring Forces and Learning Human Utilities from Video, CVPR 2016.

slide-39
SLIDE 39

Example 2: Learning Human Utility (Values) on Outer Fluents

Folding Clothes: Hypothetic utility function (like phenotype landscape in biology) on the space of fluents.

Of course, only certain moves (change of fluents by action) are actionable, i.e. causally plausible. A path corresponds to a task plan. Given utility and causality, we should be able to derive other knowledge in man-made environments.

slide-40
SLIDE 40

Learning Utility from observations

Given videos of human demonstrations:

Fluent-Space

slide-41
SLIDE 41

Fluents of clothes are attribute/features

Observation

Thickness 0.11 0.24 0.24 0.38 0.5 X-symmetry 0.8 0.4 0.9 0.8 0.6 Width 0.83 0.71 0.60 0.61 0.57 # Layers 1 2 2 3 4 ...

Attributes or Fluents

Fluents are defined as time-varying attributes, and will be selected in the learning process.

slide-42
SLIDE 42

Utility Function Learned from a small number of examples

Although the cloth pg (and fluents) are high-dimensional, Each 1D fluents only need a small number of ranked pairs to learn.

By Nishant Shukla, UCLA

slide-43
SLIDE 43

Experimental Results

Visualizing the utility function in 2D by MDS projection.

slide-44
SLIDE 44

Deductive Planning

T-pg C-pg

Fold left sleeve Fold bottom to top Fold right sleeve Fold t-shirt Fold sleeves Fold top to bottom

O Value

slide-45
SLIDE 45

Robot Rating Human Actions

(video contains audio)

slide-46
SLIDE 46

Summary

Vision:

  • -- Paradigm shift to Task-oriented representations.

Functionality and Causlity:

  • -- Understanding objects/scenes, not merely classifying !

Utility learning:

  • -- Learning inner and outer utilities from observations.

Thank for the support of an ONR MURI project on Visual Commonsense Reasoning, NSF project on Dark Matter, and DARPA SIMPLEX project on Robot Autonomy.