Bypassing the Language Bottleneck Alexei (Alyosha) Efros UC - - PowerPoint PPT Presentation

bypassing the language bottleneck
SMART_READER_LITE
LIVE PREVIEW

Bypassing the Language Bottleneck Alexei (Alyosha) Efros UC - - PowerPoint PPT Presentation

Visual Understanding without Naming: Bypassing the Language Bottleneck Alexei (Alyosha) Efros UC Berkeley Collaborators David Abhinav Scott Martial Natasha Yaser Satkin Fouhey Gupta Hebert Kholgade Sheikh Vincent Ivan Josef


slide-1
SLIDE 1

Visual Understanding without Naming:

Bypassing the “Language Bottleneck”

Alexei (Alyosha) Efros UC Berkeley

slide-2
SLIDE 2

Collaborators

Josef Sivic Abhinav Gupta Mathieu Aubry Bryan Russell Scott Satkin Martial Hebert David Fouhey Natasha Kholgade Vincent Delaitre Ivan Laptev Yaser Sheikh Jun-Yan Zhu Yong Jae Lee

slide-3
SLIDE 3

What do we mean by Visual Understanding?

slide by Fei Fei, Fergus & Torralba

slide-4
SLIDE 4

Object naming -> Object categorization

sky building flag wall banner bus cars bus face street lamp

slide by Fei Fei, Fergus & Torralba

slide-5
SLIDE 5

Image Labeling

sky building flag wall banner bus cars bus face street lamp

slide-6
SLIDE 6

Hays and Efros, “Where in the World?”, 2009

slide-7
SLIDE 7
  • Not one-to-one:

– Much is unnamed

words Visual World

slide-8
SLIDE 8
  • Not one-to-one:

– Much is unnamed

words Visual World

CITY

slide-9
SLIDE 9

Verbs (actions)

sitting

slide-10
SLIDE 10

Visual “sitting”

Visual Context

slide-11
SLIDE 11

The Language Bottleneck Visual World

Scene understanding, spatial reasoning, prediction, image retrieval, image synthesis, etc.

words

slide-12
SLIDE 12

Visual World

  • 1. 3D Human Affordances
  • 2. 3D Object Correspondence
  • 3. User-in-the-visual-loop

Scene understanding, spatial reasoning, prediction, image retrieval, image synthesis, etc.

slide-13
SLIDE 13

From 3D Scene Geometry to Human Workspaces

Abhinav Gupta, Scott Satkin, Alexei Efros and Martial Hebert CVPR’11

slide-14
SLIDE 14

Object Naming

Couch Table Couch Lamp

slide-15
SLIDE 15

Is there a couch in the image?

Couch Table Couch Lamp

slide-16
SLIDE 16

Where can I sit ?

Couch Table Couch Lamp

slide-17
SLIDE 17

3D Indoor Image Understanding

Spatial Layout Objects

Hoiem et al. IJCV’07, Delage et al. CVPR’06, Hedau et al. ICCV’09., Lee et al. NIPS’10, Wang et al. ECCV’10

slide-18
SLIDE 18

Human Centric Scene Understanding

Reasoning in terms of set of allowable actions

Can Sit Can Walk Can Move Can Push

slide-19
SLIDE 19

Sitting

slide-20
SLIDE 20

Pose-defined Vocabulary

Sitting Motion Capture Poses

slide-21
SLIDE 21

3D Scene Geometry Joint Space of Human-Scene Interactions Human Workspace

slide-22
SLIDE 22

Qualitative Representation

slide-23
SLIDE 23

3D Scene Geometry

  • Each scene modeled by
  • Layout of the Room
  • Layout of the Objects
  • Room Represented by inside-
  • ut box
  • Objects represented by
  • ccupied voxels.

References: Hedau et al. ICCV’09., Lee et al. NIPS’10, Wang et al. ECCV’10

slide-24
SLIDE 24

Goal

Where would the Human Block fit ?

slide-25
SLIDE 25

Human Scene Interactions

Free Space Constraint : No Intersection between Human Block and Objects

slide-26
SLIDE 26

Human Scene Interactions

Support Constraint : Presence of Objects for Interaction

slide-27
SLIDE 27

Ground-Truth 3D Geometry

Data Source:

Google 3D Warehouse

slide-28
SLIDE 28

Ground-Truth 3D Geometry

Data Source:

Google 3D Warehouse

slide-29
SLIDE 29

Extracting 3D Geometry

  • Estimating 3D Scene Geometry from a single image is

an extremely difficult problem.

  • Build on work in 3D Scene Understanding of [Hedau’09]

and [Lee’10]

slide-30
SLIDE 30
slide-31
SLIDE 31
slide-32
SLIDE 32
slide-33
SLIDE 33

Subjective Scene Interpretation

slide-34
SLIDE 34

Summary

+ =

slide-35
SLIDE 35

The Inverse Problem

slide-36
SLIDE 36

People Watching: Human Actions as a Cue for Single-View Geometry

David Fouhey, Vincent Delaitre, Abhinav Gupta, Alexei Efros, Ivan Laptev, Josef Sivic ECCV 2012

slide-37
SLIDE 37

Humans as Active Sensors

Input: Timelapse Output: 3D Understanding

slide-38
SLIDE 38

Our Approach

Pose Detections

Timelapse

slide-39
SLIDE 39

Detecting Human Actions

Yang and Ramanan ‘11 Train Separate Detectors for Each Pose

Sitting Standing Reaching

slide-40
SLIDE 40

Our Approach

Pose Detections

Estimate Functional Regions from Poses

Timelapse

slide-41
SLIDE 41

From Poses to Functional Regions

Sittable Regions at Pelvic Joint

slide-42
SLIDE 42

From Poses to Functional Regions

Walkable Regions at Feet

slide-43
SLIDE 43

Affordance Constraints

Reachable Regions at Hands

slide-44
SLIDE 44

Our Approach

Functional Regions Pose Detections Timelapse

3D Room Hypotheses From Appearance

slide-45
SLIDE 45

Our Approach

Functional Regions Pose Detections Timelapse

Score 3D Room Hypotheses With Appearances + Affordances #1 #49

slide-46
SLIDE 46

Our Approach

Functional Regions Pose Detections Timelapse

Estimate Free-Space

Pose Detections

Estimate Free-Space

Pose Detections

Estimate Free-Space

slide-47
SLIDE 47

Results

slide-48
SLIDE 48

Qualitative Example

slide-49
SLIDE 49

Qualitative Example

slide-50
SLIDE 50

Quantitative Results

Location Appearance Only People Only Appearance + People Lee et al. '09 Hedau et al. '09

64.1% 70.4% 74.9% 70.8% 82.5%

Does equivalently or better 93% of the time 40 Timelapse videos from Youtube Evaluated on room layout estimation.

slide-51
SLIDE 51

Mathieu Aubry (INRIA) Daniel Maturana (CMU) Alexei Efros (UC Berkeley) Bryan Russell (Intel) Josef Sivic (INRIA)

Seeing 3D chairs:

Exemplar part-based 2D-3D alignment using a large dataset of CAD models

CVPR 2014

slide-52
SLIDE 52

Sit on the chair!

slide-53
SLIDE 53

Classification

CHAIR

Ex: ImageNet Challenge, Pascal VOC classification.

slide-54
SLIDE 54

Detection

Ex: Pascal VOC detection.

chair

slide-55
SLIDE 55

Segmentation

Ex: Pascal VOC segmentation.

slide-56
SLIDE 56

Our goal

slide-57
SLIDE 57

1980s: 2D-3D Instance Alignment

[Lowe AI 1987] [Huttenlocher and Ullman IJCV 1990] [Faugeras&Hebert’86], [Grimson&Lozano-Perez’86], …

slide-58
SLIDE 58

Recent: 3D category recognition

3D DPMs: [Herj ati&Ramanan’ 12], [Pepik et al.12], [Zia et al.’ 13], … S implified part models: [Xiang&S avarese’ 12], [Del Pero et al.’ 13] Cuboids: [Xiao et al.’ 12] [Fidler et al.’ 12] Blocks world revisited: [Gupta et al.’ 12]

See also: [Glasner et al.’11], [Fouhey et al.’13], [Satkin&Hebert’13], [Choi et al. ‘ 13], [Hejrati and Ramanan ‘14], [Savarese and Fei-Fei ‘ 07]…

slide-59
SLIDE 59

1394 3D models from internet

Approach: data-driven

slide-60
SLIDE 60
slide-61
SLIDE 61

Difficulty: viewpoint

slide-62
SLIDE 62

Approach: use 3D models

62 views

slide-63
SLIDE 63

Style Viewpoint

slide-64
SLIDE 64

Difficulty: approximate style

slide-65
SLIDE 65

Difficulty: approximate style

slide-66
SLIDE 66

Difficulty: approximate style

slide-67
SLIDE 67

Approach: part-based model

slide-68
SLIDE 68

Approach overview

3D collection Render views Select parts Match CG->real image Select the best matches

slide-69
SLIDE 69

Select discriminative parts

slide-70
SLIDE 70

Best exemplar-LDA classifiers

How to select discriminative parts?

[Hariharan et al. 2012] [Gharbi et al 2012] [Malisiewicz et al 2011]

slide-71
SLIDE 71

Approach: CG-to-photograph

Implementation: exemplar-LDA

slide-72
SLIDE 72

How to compare matches?

Matches Patches Detectors

slide-73
SLIDE 73

Patches Detectors Matches Affine Calibration with negative data See paper for details

How to compare matches?

slide-74
SLIDE 74

Example I.

slide-75
SLIDE 75

Example II.

slide-76
SLIDE 76

Example III.

slide-77
SLIDE 77

Input image DPM output Our output 3D models

slide-78
SLIDE 78

Input image DPM output Our output 3D models

slide-79
SLIDE 79

human evaluation

Orientation quality at 25% recall

Good Bad Exemplar- LDA 52% 48% Ours 90% 10%

slide-80
SLIDE 80

human evaluation

Style consistency at 25% recall

Exact Ok Bad Exemplar- LDA 3% 31% 66% Ours 21% 64% 15%

slide-81
SLIDE 81
slide-82
SLIDE 82
slide-83
SLIDE 83

The Language Bottleneck Mental Picture Image

words