See, Hear, Move: Towards Embodied Visual Perception Kristen Grauman - - PowerPoint PPT Presentation

see hear move towards embodied visual perception
SMART_READER_LITE
LIVE PREVIEW

See, Hear, Move: Towards Embodied Visual Perception Kristen Grauman - - PowerPoint PPT Presentation

See, Hear, Move: Towards Embodied Visual Perception Kristen Grauman Facebook AI Research University of Texas at Austin How do recognition systems typically learn today? dog boat Web photos + recognition A disembodied


slide-1
SLIDE 1

See, Hear, Move: Towards Embodied Visual Perception

Kristen Grauman Facebook AI Research University of Texas at Austin

slide-2
SLIDE 2

How do recognition systems typically learn today?

boat dog

… …

slide-3
SLIDE 3

Web photos + recognition

Caltech 101 (2004), Caltech 256 (2006) PASCAL (2007-12) ImageNet (2009) LabelMe (2007) MS COCO (2014) SUN (2010) Places (2014) BSD (2001) Visual Genome (2016)

A “disembodied” well-curated moment in time A “disembodied” well-curated moment in time

slide-4
SLIDE 4

Egocentric perceptual experience

A tangle of relevant and irrelevant multi-sensory information A tangle of relevant and irrelevant multi-sensory information

Kristen Grauman

slide-5
SLIDE 5

Big picture goal: Embodied visual learning

Status quo: Learn from “disembodied” bag of labeled snapshots. On the horizon: Visual learning in the context of action, motion, and multi-sensory

  • bservations.

Kristen Grauman

slide-6
SLIDE 6

Big picture goal: Embodied visual learning

Status quo: Learn from “disembodied” bag of labeled snapshots. On the horizon: Visual learning in the context of action, motion, and multi-sensory

  • bservations.

Kristen Grauman

slide-7
SLIDE 7

Towards embodied visual learning

  • 1. Learning from unlabeled video and multiple

sensory modalities

  • 2. Learning policies for how to move for

recognition and exploration

slide-8
SLIDE 8

The kitten carousel experiment

[Held & Hein, 1963]

active kitten passive kitten

Key to perceptual development: self-generated motion + visual feedback

slide-9
SLIDE 9

Goal: Teach computer vision system the connection: “how I move” “how my visual surroundings change”

Idea: Egomotion vision

+

Ego-motion motor signals Unlabeled video

[Jayaraman & Grauman, ICCV 2015, IJCV 2017]

slide-10
SLIDE 10

Equivariant embedding

  • rganized by egomotions

Pairs of frames related by similar egomotion should be related by same feature transformation

left turn right turn forward Learn

Approach: Egomotion equivariance

time

motor signal

Training data Unlabeled video + motor signals

[Jayaraman & Grauman, ICCV 2015, IJCV 2017]

slide-11
SLIDE 11

Equivariant embedding

  • rganized by egomotions

Learn

Approach: Egomotion equivariance

time

motor signal

Training data Unlabeled video + motor signals

[Jayaraman & Grauman, ICCV 2015, IJCV 2017]

slide-12
SLIDE 12

Impact on recognition

Learn from unlabeled car video (KITTI) Exploit features for static scene classification (SUN, 397 classes)

Geiger et al, IJRR ’13 Xiao et al, CVPR ’10

30% accuracy increase when labeled data scarce

slide-13
SLIDE 13

time

motor signal

Pre-recorded video Moving around to inspect

Passive → complete egomotions

slide-14
SLIDE 14

Viewgrid representation Infer unseen views

One-shot reconstruction

Key idea: One-shot reconstruction as a proxy task to learn semantic shape features.

[Jayaraman et al., ECCV 2018]

slide-15
SLIDE 15

Shape from many views geometric problem Shape from one view semantic problem

[Snavely et al, CVPR ‘06] [Sinha et al, ICCV’93]

One-shot reconstruction

[Jayaraman et al., ECCV 2018]

slide-16
SLIDE 16

Approach: ShapeCodes

[Jayaraman et al., ECCV 2018]

  • Implicit 3D shape representation
  • No “canonical” azimuth to exploit
  • Category agnostic

Learned ShapeCode embedding

slide-17
SLIDE 17

40 45 50 55 60 65

Accuracy (%)

ModelNet Pixels Random wts DrLIM* Autoencoder** LSM^ Ours

30 35 40 45 50 55

ShapeNet

*Hadsell et al, Dimensionality reduction by learning an invariant mapping, CVPR 2005 ** Masci et al, Stacked Convolutional Autoencoders for Hierarchical Feature Extraction, ICANN 2011 ^Agrawal, Carreira, Malik, Learning to See by Moving, ICCV 2015

ShapeCodes for recognition

[Chang et al 2015] [Wu et al 2015]

slide-18
SLIDE 18

Egomotion and implied body pose

Learn relationship between egocentric scene motion and 3D human body pose

[Jiang & Grauman, CVPR 2017]

Input:

egocentric video

Output:

sequence of 3d joint positions

slide-19
SLIDE 19

Egomotion and implied body pose

Learn relationship between egocentric scene motion and 3D human body pose

[Jiang & Grauman, CVPR 2017] Wearable camera video Inferred pose of camera wearer

slide-20
SLIDE 20

Implied motion in static images

[Kourtzi & Kanwisher, 2000] Activation in medial temporal / medial superior temporal (MT/MST) cortex by static images with implied motion

moving rings stationary rings static images with implied motion static images without implied motion

slide-21
SLIDE 21

Push-ups

Unlabeled video as rich source of motion experience

Im2Flow: Infer next motion in a static image

[Gao & Grauman, CVPR 2018]

slide-22
SLIDE 22

Identify static images that are most suggestive of motion or coming events

Im2Flow for “motion potential”

[Gao & Grauman, CVPR 2018]

slide-23
SLIDE 23

Im2Flow for action recognition in photos

  • Inferred motion from Im2Flow framework boosts recognition
  • Up to 6% relative gain vs. appearance stream alone

10 20 30 40 50 60 70 80

Motion Stream (Walker et al.) Motion Stream (Ours) Motion Stream (Ground- truth) Appearance Stream Appearance + Motion (Ours)

Accuracy

10 20 30 40 50 60 70 80

Motion Stream (Walker et al.) Motion Stream (Ours) Motion Stream (Ground- truth) Appearance Stream Appearance + Motion (Ours)

10 20 30 40 50 60 70 80

Motion Stream (Walker et al.) Motion Stream (Ours) Motion Stream (Ground- truth) Appearance Stream Appearance + Motion (Ours)

10 20 30 40 50 60 70 80

Motion Stream (Walker et al.) Motion Stream (Ours) Motion Stream (Ground- truth) Appearance Stream Appearance + Motion (Ours)

10 20 30 40 50 60 70 80

Motion Stream (Walker et al.) Motion Stream (Ours) Motion Stream (Ground- truth) Appearance Stream Appearance + Motion (Ours)

Two-stream network with RGB and inferred flow

[Gao & Grauman, CVPR 2018]

slide-24
SLIDE 24

Recall: Disembodied visual learning

boat dog

… …

slide-25
SLIDE 25

Listening to learn

slide-26
SLIDE 26

Listening to learn

slide-27
SLIDE 27

woof meow ring

Goal: a repertoire of objects and their sounds Challenge: a single audio channel mixes sounds of multiple objects

Listening to learn

clatter

Kristen Grauman

slide-28
SLIDE 28

Visually-guided audio source separation

Traditional approach:

  • Detect low-level correlations within a single video
  • Learn from clean single audio source examples

[Darrell et al. 2000; Fisher et al. 2001; Rivet et al. 2007; Barzelay & Schechner 2007; Casanovas et al. 2010; Parekh et al. 2017; Pu et al. 2017; Li et al. 2017]

slide-29
SLIDE 29

Learning to separate object sounds

Our idea: Leverage visual objects to learn from unlabeled video with multiple audio sources

Unlabeled video Object sound models

Violin Dog Cat

Disentangle

[Gao, Feris, & Grauman, ECCV 2018]

Kristen Grauman

slide-30
SLIDE 30

Deep multi-instance multi-label learning (MIML) to disentangle which visual objects make which sounds

Non-negative matrix factorization Visual predictions (ResNet-152

  • bjects)

Guitar Saxophone

Output: Group of audio basis vectors per object class

Visual frames Audio Audio basis vectors Top visual detections

Our approach: learning

MIML Unlabeled video

[Gao, Feris, & Grauman, ECCV 2018]

slide-31
SLIDE 31

Our approach: learning

Guitar + Violin Guitar + Piano Cello + Piano

Audio Bases

MIML detangles sounds via visually detected objects

[Gao, Feris, & Grauman, ECCV 2018]

slide-32
SLIDE 32

Audio

Estimate activations

Violin Sound Piano Sound

Novel video Frames

Initialize audio basis matrix Violin bases Piano bases Violin Piano

Our approach: inference

Given a novel video, use discovered object sound models to guide audio source separation.

Visual predictions (ResNet-152

  • bjects)

Semi-supervised source separation using NMF

slide-33
SLIDE 33

Train on 100,000 unlabeled multi-source video clips, then separate audio for novel video

Results

Baseline: M. Spiertz, Source-filter based clustering for monaural blind source separation. International Conference on Digital Audio Effects, 2009 [Gao, Feris, & Grauman, ECCV 2018]

slide-34
SLIDE 34

Train on 100,000 unlabeled multi-source video clips, then separate audio for novel video

Results

[Gao, Feris, & Grauman, ECCV 2018]

slide-35
SLIDE 35

Train on 100,000 unlabeled multi-source video clips, then separate audio for novel video

Results

Failure cases

[Gao, Feris, & Grauman, ECCV 2018]

slide-36
SLIDE 36

Visually-aided audio source separation (SDR) Visually-aided audio denoising (NSDR)

Results: Separating object sounds

Lock et al. Annals Stats 2013; Spiertz et al. ICDAE 2009; Kidron et al. CVPR 2006; Pu et al. ICASSP 2017

slide-37
SLIDE 37

Towards embodied visual learning

  • 1. Learning from unlabeled video and multiple

sensory modalities

  • 2. Learning policies for how to move for

recognition and exploration

slide-38
SLIDE 38

Active perception

Time to revisit active recognition in challenging settings!

Bajcsy 1985, Aloimonos 1988, Ballard 1991, Wilkes 1992, Dickinson 1997, Schiele & Crowley 1998, Tsotsos 2001, Denzler 2002, Soatto 2009, Ramanathan 2011, Borotschnig 2011, …

Kristen Grauman

slide-39
SLIDE 39

T=2 Predicted label: T=1 T=3

End-to-end active recognition

[Jayaraman and Grauman, ECCV 2016, PAMI 2018]

slide-40
SLIDE 40

Goal: Learn to “look around”

reconnaissance search and rescue

recognition

Can we learn look-around policies for visual agents that are curiosity-driven, exploratory, and generic?

vs.

task predefined task unfolds dynamically

Kristen Grauman

slide-41
SLIDE 41

Two scenarios

slide-42
SLIDE 42

Key idea: Active observation completion

Completion objective: Learn policy for efficiently inferring (pixels of) all yet-unseen portions of environment Agent must choose where to look before looking there.

Jayaraman and Grauman, CVPR 2018

slide-43
SLIDE 43

Encoder Actor Decoder model visualized model

?

shifted MSE loss

Approach: Active observation completion

Non-myopic: Train to target a budget of observation time

Jayaraman and Grauman, CVPR 2018

slide-44
SLIDE 44

Active “look around” results

18 23 28 33 38 1 2 3 4 5 6

per-pixel MSE (x1000) Time

SUN360 1-view random large-action large-action+ peek-saliency*

  • urs

3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 1 2 3 4

Time

ModelNet (seen cls)

5.5 5.7 5.9 6.1 6.3 6.5 6.7 6.9 7.1 7.3 7.5 1 2 3 4

Time

ModelNet (unseen cls)

*Saliency -- Harel et al, Graph based Visual Saliency, NIPS’07

Learned active look-around policy: quickly grasp environment independent of a specific task

Jayaraman and Grauman, CVPR 2018

slide-45
SLIDE 45

Active “look around” visualization

Agent’s mental model for 3D object evolves with actively accumulated glimpses

Jayaraman and Grauman, CVPR 2018; Ramakrishnan & Grauman, ECCV 2018

slide-46
SLIDE 46

Agent’s mental model for 360 scene evolves with actively accumulated glimpses

Active “look around” visualization

Jayaraman and Grauman, CVPR 2018; Ramakrishnan & Grauman, ECCV 2018

slide-47
SLIDE 47

Egomotion policy transfer

Look-around encoder Look-around Policy Decoder Classification encoder Classification Policy Classifier “beach”

Unsupervised observation completion Supervised active recognition

Look-around Policy

Plug observation completion policy in for new task

slide-48
SLIDE 48

Jayaraman and Grauman, CVPR 2018

Egomotion policy transfer

Plug observation completion policy in for new task SUN 360 Scenes ModelNet Objects

Unsupervised exploratory policy approaches supervised task-specific policy accuracy! Unsupervised exploratory policy approaches supervised task-specific policy accuracy!

slide-49
SLIDE 49

Challenge: Motion policy learning with partial observability

Exploration with limited

  • bservability impedes

policy learning Yet during training, full state may be available

House3D, Wu et al.

Kristen Grauman

slide-50
SLIDE 50

Wu et al., 2015 Jayaraman and Grauman., 2016 Ammirato et al., 2017 Jayaraman and Grauman., 2018

Status quo: ignore full observability available at training time

Challenge: Motion policy learning with partial observability

slide-51
SLIDE 51

Sidekick agent with full observability guides policy towards valuable states during training Sidekick agent with full observability guides policy towards valuable states during training

Rewards Ramakrishnan & Grauman, ECCV 2018

Idea: Sidekick policy learning

Sidekick

slide-52
SLIDE 52

360 environment - X Identify informative views Shape reward function Preview and transfer knowledge of environment

1) Reward-based sidekick

Ramakrishnan & Grauman, ECCV 2018

slide-53
SLIDE 53

Selected views Current view Cumulative information

2) Demonstration-based sidekick

Ramakrishnan & Grauman, ECCV 2018

Generate information-gathering trajectories to initially supervise policy learning 360 environment - X

slide-54
SLIDE 54

Sidekick results

Accelerate training and obtain better policies

SUN360

Reconstruction error

ModelNet

ltla: Jayaraman & Grauman, Learning to look around, CVPR 2018 asymm-ac: Pinto et al. Asymmetric actor-critic, RSS 2018

REINFORCE ACTOR-CRITIC

Epochs

slide-55
SLIDE 55

Summary

  • Visual learning benefits from

– context of action and multiple senses – continuous unsupervised observations

  • Key ideas:

– Embodied feature learning via multi-sensory signals – Active policies for view selection and camera control

Ruohan Gao Dinesh Jayaraman Santhosh Ramakrishnan Rogerio Feris

slide-56
SLIDE 56

Papers/code/videos

  • Learning to Separate Object Sounds by Watching Unlabeled Video. R. Gao, R. Feris,

and K. Grauman. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, Sept 2018. (Oral) [pdf] [videos]

  • ShapeCodes: Self-Supervised Feature Learning by Lifting Views to Viewgrids. D.

Jayaraman, R. Gao, and K. Grauman. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, Sept 2018. [pdf]

  • Sidekick Policy Learning for Active Visual Exploration. S. Ramakrishnan and K.
  • Grauman. In Proceedings of the European Conference on Computer Vision (ECCV), Munich,

Germany, Sept 2018. [pdf] [supp] [videos/code]

  • End-to-end Policy Learning for Active Visual Categorization. D. Jayaraman and K.
  • Grauman. To appear, Transactions on Pattern Analysis and Machine Intelligence (PAMI),
  • 2018. [pdf]
  • Im2Flow: Motion Hallucination from Static Images for Action Recognition. R. Gao, B.

Xiong, and K. Grauman. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, June 2018. (Oral) [pdf] [code] [project page]

  • Learning to Look Around: Intelligently Exploring Unseen Environments for Unknown
  • Tasks. D. Jayaraman and K. Grauman. In Proceedings of IEEE Conference on Computer

Vision and Pattern Recognition (CVPR), Salt Lake City, June 2018. [pdf] [animations]

  • Learning Image Representations Tied to Egomotion from Unlabeled Video. D.

Jayaraman and K. Grauman. International Journal of Computer Vision (IJCV), Special Issue for Best Papers of ICCV 2015, Mar 2017. [pdf] [preprint] [project page, pretrained models]