Embodied Visual Learning and Recognition Kristen Grauman - - PowerPoint PPT Presentation

embodied visual learning and recognition
SMART_READER_LITE
LIVE PREVIEW

Embodied Visual Learning and Recognition Kristen Grauman - - PowerPoint PPT Presentation

Weinberg Symposium on the Shared Frontiers of Artificial Intelligence and Cognitive Science University of Michigan, April 2018 Embodied Visual Learning and Recognition Kristen Grauman Department of Computer Science University of Texas at


slide-1
SLIDE 1

Embodied Visual Learning and Recognition

Kristen Grauman Department of Computer Science University of Texas at Austin

Weinberg Symposium on the Shared Frontiers of Artificial Intelligence and Cognitive Science University of Michigan, April 2018

slide-2
SLIDE 2

sky water Ferris wheel amusement park Cedar Point 12 E tree tree tree carousel deck people waiting in line ride ride ride umbrellas pedestrians maxair bench tree Lake Erie people sitting on ride

Objects Activities Scenes Locations Text / writing Faces Gestures Motions Emotions…

The Wicked Twister

Visual recognition

slide-3
SLIDE 3

AI and autonomous robotics Personal photo/video collections Surveillance and security Science and medicine Organizing visual content Gaming, HCI, Augmented Reality

Visual recognition: applications

slide-4
SLIDE 4

Visual recognition: significant recent progress

Big labeled datasets Deep learning GPU technology

5 10 15 20 25 30

2011 2012 2013 2014 2015 2016

ImageNet top-5 error (%)

slide-5
SLIDE 5

How do our systems learn about the visual world today?

boat dog

… …

slide-6
SLIDE 6

Recognition benchmarks

Caltech 101 (2004), Caltech 256 (2006) PASCAL (2007-12) ImageNet (2009) LabelMe (2007) MS COCO (2014) SUN (2010) Places (2014) BSD (2001) Visual Genome (2016)

A “disembodied” well-curated moment in time A “disembodied” well-curated moment in time

slide-7
SLIDE 7

Egocentric perceptual experience

A tangle of relevant and irrelevant multi-sensory information A tangle of relevant and irrelevant multi-sensory information

slide-8
SLIDE 8

Big picture goal: Embodied visual learning

Status quo: Learn from “disembodied” bag of labeled snapshots. On the horizon: Visual learning in the context of acting and moving in the world.

slide-9
SLIDE 9

This talk

Towards embodied visual learning

  • 1. Learning from unlabeled video and multiple

sensory modalities

  • 2. Learning policies for how to move for

recognition and exploration

slide-10
SLIDE 10

The kitten carousel experiment

[Held & Hein, 1963]

active kitten passive kitten Key to perceptual development: self-generated motion + visual feedback

slide-11
SLIDE 11

Goal: Teach computer vision system the connection: “how I move” “how my visual surroundings change”

Idea: Ego-motion vision

+

Ego-motion motor signals Unlabeled video

[Jayaraman & Grauman, ICCV 2015, IJCV 2017]

slide-12
SLIDE 12

Ego-motion vision: view prediction

After moving:

slide-13
SLIDE 13

Equivariant embedding

  • rganized by ego-motions

Pairs of frames related by similar ego-motion should be related by same feature transformation

left turn right turn forward Learn

Approach idea: Ego-motion equivariance

time

motor signal

Training data Unlabeled video + motor signals

[Jayaraman & Grauman, ICCV 2015, IJCV 2017]

slide-14
SLIDE 14

Results: Recognition

Learn from unlabeled car video (KITTI) Exploit features for static scene classification (SUN, 397 classes)

Geiger et al, IJRR ’13 Xiao et al, CVPR ’10

30% accuracy increase when labeled data scarce

slide-15
SLIDE 15

time

motor signal

Pre-recorded video Comprehensive observation

Passive → complete ego-motions

slide-16
SLIDE 16

Viewgrid representation Infer unseen views

One-shot reconstruction

Key idea: One-shot reconstruction as a proxy task to learn semantic features.

slide-17
SLIDE 17

Shape from dense views geometric problem Shape from one view semantic problem

[Snavely et al, CVPR ‘06] [Sinha et al, ICCV’93]

One-shot reconstruction

slide-18
SLIDE 18

Approach: ShapeCodes

[Jayaraman & Grauman, arXiv 2017]

  • Implicit 3D shape representation
  • No “canonical” azimuth to exploit
  • Agnostic of category

Learned embedding

slide-19
SLIDE 19

Observed view ground truth predicted

One-shot reconstruction example

[Jayaraman & Grauman, arXiv 2017]

slide-20
SLIDE 20

ShapeCodes capture semantics

t-SNE embedding for images of unseen object categories

[Jayaraman & Grauman, arXiv 2017]

slide-21
SLIDE 21

40 45 50 55 60 65

Accuracy (%)

ModelNet Pixels Random wts DrLIM* Autoencoder** LSM^ Ours

30 35 40 45 50 55

ShapeNet

*Hadsell et al, Dimensionality reduction by Learning an invariant mapping, CVPR 2005 ** Masci et al, Stacked Convolutional Autoencoders for Hierarchical Feature Extraction, ICANN 2011 ^Agrawal, Carreira, Malik, Learning to See by Moving, ICCV 2015

ShapeCodes for recognition

[Chang et al 2015] [Wu et al 2015]

slide-22
SLIDE 22

Ego-motion and implied body pose

Learn relationship between egocentric scene motion and 3D human body pose

[Jiang & Grauman, CVPR 2017]

Input:

egocentric video

Output:

sequence of 3d joint positions

slide-23
SLIDE 23

Ego-motion and implied body pose

Learn relationship between egocentric scene motion and 3D human body pose

[Jiang & Grauman, CVPR 2017] Wearable camera video Inferred pose of camera wearer

Videos: http://www.hao-jiang.net/egopose/index.html

slide-24
SLIDE 24

Towards embodied visual learning

  • 1. Learning from unlabeled video and multiple

sensory modalities

a) Egomotion / motor signals b) Audio signals

  • 2. Learning policies for how to move for

recognition and exploration

This talk

slide-25
SLIDE 25

Recall: Disembodied visual learning

boat dog

… …

slide-26
SLIDE 26

Listening to learn

slide-27
SLIDE 27

Listening to learn

slide-28
SLIDE 28

woof meow ring

Goal: A repetoire of objects and their sounds

Listening to learn

clatter

slide-29
SLIDE 29

Visually-guided audio source separation

Traditional approach:

  • Detect low-level correlations within a single video
  • Learn from clean single audio source examples

[Darrell et al. 2000; Fisher et al. 2001; Rivet et al. 2007; Barzelay & Schechner 2007; Casanovas et al. 2010; Parekh et al. 2017; Pu et al. 2017; Li et al. 2017]

slide-30
SLIDE 30

Learning to separate object sounds

Our idea: Leverage visual objects to learn from unlabeled video with multiple audio sources

Unlabeled video Object sound models

Violin Dog Cat

Disentangle

[Gao, Feris, & Grauman, arXiv 2018]

slide-31
SLIDE 31

Deep multi-instance multi-label learning (MIML) to disentangle which visual objects make which sounds

Non-negative matrix factorization Visual predictions (ResNet-152

  • bjects)

Guitar Saxophone

Output: Group of audio basis vectors per object class

Visual frames Audio Audio basis vectors Top visual detections

Our approach: training

MIML Unlabeled video

slide-32
SLIDE 32

Audio

Estimate activations

Violin Sound Piano Sound

Novel video Frames

Initialize audio basis matrix Violin bases Piano bases Violin Piano

Our approach: inference

Given a novel video, use discovered object sound models to guide audio source separation.

Visual predictions (ResNet-152

  • bjects)

Semi-supervised source separation using NMF

slide-33
SLIDE 33

Train on 100,000 unlabeled video clips, then separate audio for novel video

Results

Baseline: M. Spiertz, Source-filter based clustering for monaural blind source separation. International Conference on Digital Audio Effects, 2009 [Gao, Feris, & Grauman, arXiv 2018] Videos: http://vision.cs.utexas.edu/projects/separating_object_sounds/

slide-34
SLIDE 34

Train on 100,000 unlabeled video clips, then separate audio for novel video

Results

Failure cases

[Gao, Feris, & Grauman, arXiv 2018] Videos: http://vision.cs.utexas.edu/projects/separating_object_sounds/

slide-35
SLIDE 35

Visually-aided audio source separation (SDR) Visually-aided audio denoising (NSDR)

Results

Train on 100K unlabeled video clips from AudioSet [Gemmeke et al. 2017]

slide-36
SLIDE 36

This talk

Towards embodied visual learning

  • 1. Learning from unlabeled video and multiple

sensory modalities

  • 2. Learning policies for how to move for

recognition and exploration

slide-37
SLIDE 37

Scene recognition Object recognition ? ? ? ? ?

Current recognition benchmarks

Passive, disembodied snapshots at test time, too

slide-38
SLIDE 38

Moving to recognize

Time to revisit active recognition in challenging settings!

Bajcsy 1985, Aloimonos 1988, Ballard 1991, Wilkes 1992, Dickinson 1997, Schiele & Crowley 1998, Tsotsos 2001, Denzler 2002, Soatto 2009, Ramanathan 2011, Borotschnig 2011, …

slide-39
SLIDE 39

Difficulty: unconstrained visual input

vs.

ImageNet Web images

Moving to recognize

slide-40
SLIDE 40

mug? bowl? pan? mug

Difficulty: unconstrained visual input Opportunity: ability to move to change input

Moving to recognize

slide-41
SLIDE 41

mug? bowl? pan? mug Perception Perception Action selection Evidence fusion

End-to-end active recognition

Jayaraman and Grauman, ECCV 2016

slide-42
SLIDE 42

Look around scene Manipulate object Move around an object

[Jayaraman and Grauman, ECCV 2016]

End-to-end active recognition

slide-43
SLIDE 43

Agents that learn to look around intelligently can recognize things faster.

[Jayaraman and Grauman, ECCV 2016]

End-to-end active recognition

35 45 55 65

1 2 3 4 5

#views

SUN 360

Passive neural net Transinformation [Schiele98] SeqDP [Denzler03] Transinformation+SeqDP Ours

75 80 85 90 95

1 3 6

#views

ModelNet-10

Passive neural net ShapeNets [Wu15] Pairwise [Johns 16] Ours

25 30 35 40 45 50

1 2 3

#views

GERMS

Passive neural net Transinformation [Schiele98] SeqDP[Denzler03] Transinformation+SeqDP Ours

slide-44
SLIDE 44

P(“Church”): Top 3 guesses: (0.53) Forest Cave Beach (5.00) Street Cave Plaza courtyard (37.89) Church Lobby atrium Street P(“Plaza courtyard”): Top 3 guesses: (6.28) Restaurant Train interior Shop (11.95) Theater Restaurant Plaza courtyard (68.38) Plaza courtyard Street Theater

[Jayaraman and Grauman, ECCV 2016]

End-to-end active recognition: example

slide-45
SLIDE 45

T=2 Predicted label: T=1 T=3

End-to-end active recognition: example

GERMS dataset: Malmir et al. BMVC 2015

[Jayaraman and Grauman, ECCV 2016]

slide-46
SLIDE 46

Goal: Learn to “look around”

reconnaissance search and rescue

recognition

Can we learn look-around policies for visual agents that are curiosity-driven, exploratory, and generic?

vs.

task predefined task unfolds dynamically

slide-47
SLIDE 47

Key idea: Active observation completion

Completion objective: Learn policy for efficiently inferring (pixels of) all yet-unseen portions of environment Agent must choose where to look before looking there.

Jayaraman and Grauman, CVPR 2018

slide-48
SLIDE 48

Key idea: Active observation completion

Completion objective: Learn policy for efficiently inferring (pixels of) all yet-unseen portions of environment Agent must choose where to look before looking there.

Jayaraman and Grauman, CVPR 2018

slide-49
SLIDE 49

Encoder Actor Decoder model visualized model

?

shifted MSE loss

Approach: Active observation completion

Non-myopic: Train to target a budget of observation time

Jayaraman and Grauman, CVPR 2018

slide-50
SLIDE 50

SUN 360 panoramas ModelNet-10 CAD models

[Xiao 2012] [Wu 2015]

Two scenarios

slide-51
SLIDE 51

Active “look around” results

18 23 28 33 38 1 2 3 4 5 6

per-pixel MSE (x1000) Time

SUN360 1-view random large-action large-action+ peek-saliency*

  • urs

3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 1 2 3 4

Time

ModelNet (seen cls)

5.5 5.7 5.9 6.1 6.3 6.5 6.7 6.9 7.1 7.3 7.5 1 2 3 4

Time

ModelNet (unseen cls)

*Harel et al, Graph based Visual Saliency, NIPS’07

Learned active look-around policy: quickly grasp environment independent of a specific task

Jayaraman and Grauman, CVPR 2018

slide-52
SLIDE 52

Active “look around” visualization

Jayaraman and Grauman, CVPR 2018

Agent’s mental model for 3D object evolves with actively accumulated glimpses

slide-53
SLIDE 53

Agent’s mental model for 360 scene evolves with actively accumulated glimpses

Complete 360 scene (ground truth) Inferred scene

Active “look around” visualization

= observed views

Jayaraman and Grauman, CVPR 2018

slide-54
SLIDE 54

Motion policy transfer

Look-around encoder Look-around Policy Decoder Classification encoder Classification Policy Classifier “beach”

Unsupervised observation completion Supervised recognition

[Jayaraman et al, ECCV 16]

Look-around Policy

Plug observation completion policy in for new task

slide-55
SLIDE 55

Motion policy transfer

Plug observation completion policy in for new task SUN 360 Scenes ModelNet Objects

Unsupervised exploratory policy approaches supervised task-specific policy accuracy! Unsupervised exploratory policy approaches supervised task-specific policy accuracy!

slide-56
SLIDE 56

Summary

  • Visual learning benefits from

– context of action and motion in the world – continuous unsupervised observations

  • New ideas:

– Embodied feature learning via visual and motor signals – Learning to separate object sound models from unlabeled video – Active policies for view selection and camera control

Ruohan Gao Dinesh Jayaraman Kristen Grauman, UT Austin

slide-57
SLIDE 57

Papers

  • Learning to Separate Object Sounds by Watching Unlabeled Video.
  • R. Gao, R. Feris, and K. Grauman. arXiv:1804.01665, April 2018.

videos

  • Learning to Look Around: Intelligently Exploring Unseen

Environments for Unknown Tasks. D. Jayaraman and K. Grauman. CVPR 2018.

  • Seeing Invisible Poses: Estimating 3D Body Pose from Egocentric
  • Video. H. Jiang and K. Grauman. CVPR 2017.
  • Learning Image Representations Tied to Egomotion from

Unlabeled Video. D. Jayaraman and K. Grauman. International Journal of Computer Vision (IJCV), Special Issue for Best Papers of ICCV 2015, Mar 2017.

  • Look-Ahead Before You Leap: End-to-End Active Recognition by

Forecasting the Effect of Motion. D. Jayaraman and K.

  • Grauman. ECCV 2016.
  • Unsupervised learning through one-shot image-based shape

reconstruction, D. Jayaraman, R. Gao, K. Grauman. arXiv 2017

http://www.cs.utexas.edu/~grauman/research/pubs.html