SLIDE 1 Embodied Visual Learning and Recognition
Kristen Grauman Department of Computer Science University of Texas at Austin
Weinberg Symposium on the Shared Frontiers of Artificial Intelligence and Cognitive Science University of Michigan, April 2018
SLIDE 2 sky water Ferris wheel amusement park Cedar Point 12 E tree tree tree carousel deck people waiting in line ride ride ride umbrellas pedestrians maxair bench tree Lake Erie people sitting on ride
Objects Activities Scenes Locations Text / writing Faces Gestures Motions Emotions…
The Wicked Twister
Visual recognition
SLIDE 3 AI and autonomous robotics Personal photo/video collections Surveillance and security Science and medicine Organizing visual content Gaming, HCI, Augmented Reality
Visual recognition: applications
SLIDE 4 Visual recognition: significant recent progress
Big labeled datasets Deep learning GPU technology
5 10 15 20 25 30
2011 2012 2013 2014 2015 2016
ImageNet top-5 error (%)
SLIDE 5
How do our systems learn about the visual world today?
boat dog
… …
SLIDE 6 Recognition benchmarks
Caltech 101 (2004), Caltech 256 (2006) PASCAL (2007-12) ImageNet (2009) LabelMe (2007) MS COCO (2014) SUN (2010) Places (2014) BSD (2001) Visual Genome (2016)
A “disembodied” well-curated moment in time A “disembodied” well-curated moment in time
SLIDE 7
Egocentric perceptual experience
A tangle of relevant and irrelevant multi-sensory information A tangle of relevant and irrelevant multi-sensory information
SLIDE 8
Big picture goal: Embodied visual learning
Status quo: Learn from “disembodied” bag of labeled snapshots. On the horizon: Visual learning in the context of acting and moving in the world.
SLIDE 9 This talk
Towards embodied visual learning
- 1. Learning from unlabeled video and multiple
sensory modalities
- 2. Learning policies for how to move for
recognition and exploration
SLIDE 10
The kitten carousel experiment
[Held & Hein, 1963]
active kitten passive kitten Key to perceptual development: self-generated motion + visual feedback
SLIDE 11 Goal: Teach computer vision system the connection: “how I move” “how my visual surroundings change”
Idea: Ego-motion vision
+
Ego-motion motor signals Unlabeled video
[Jayaraman & Grauman, ICCV 2015, IJCV 2017]
SLIDE 12
Ego-motion vision: view prediction
After moving:
SLIDE 13 Equivariant embedding
Pairs of frames related by similar ego-motion should be related by same feature transformation
left turn right turn forward Learn
Approach idea: Ego-motion equivariance
time
motor signal
Training data Unlabeled video + motor signals
[Jayaraman & Grauman, ICCV 2015, IJCV 2017]
SLIDE 14 Results: Recognition
Learn from unlabeled car video (KITTI) Exploit features for static scene classification (SUN, 397 classes)
Geiger et al, IJRR ’13 Xiao et al, CVPR ’10
30% accuracy increase when labeled data scarce
SLIDE 15 time
motor signal
Pre-recorded video Comprehensive observation
Passive → complete ego-motions
SLIDE 16
Viewgrid representation Infer unseen views
One-shot reconstruction
Key idea: One-shot reconstruction as a proxy task to learn semantic features.
SLIDE 17 Shape from dense views geometric problem Shape from one view semantic problem
[Snavely et al, CVPR ‘06] [Sinha et al, ICCV’93]
One-shot reconstruction
SLIDE 18 Approach: ShapeCodes
[Jayaraman & Grauman, arXiv 2017]
- Implicit 3D shape representation
- No “canonical” azimuth to exploit
- Agnostic of category
Learned embedding
SLIDE 19 Observed view ground truth predicted
One-shot reconstruction example
[Jayaraman & Grauman, arXiv 2017]
SLIDE 20 ShapeCodes capture semantics
t-SNE embedding for images of unseen object categories
[Jayaraman & Grauman, arXiv 2017]
SLIDE 21 40 45 50 55 60 65
Accuracy (%)
ModelNet Pixels Random wts DrLIM* Autoencoder** LSM^ Ours
30 35 40 45 50 55
ShapeNet
*Hadsell et al, Dimensionality reduction by Learning an invariant mapping, CVPR 2005 ** Masci et al, Stacked Convolutional Autoencoders for Hierarchical Feature Extraction, ICANN 2011 ^Agrawal, Carreira, Malik, Learning to See by Moving, ICCV 2015
ShapeCodes for recognition
[Chang et al 2015] [Wu et al 2015]
SLIDE 22 Ego-motion and implied body pose
Learn relationship between egocentric scene motion and 3D human body pose
[Jiang & Grauman, CVPR 2017]
Input:
egocentric video
Output:
sequence of 3d joint positions
SLIDE 23 Ego-motion and implied body pose
Learn relationship between egocentric scene motion and 3D human body pose
[Jiang & Grauman, CVPR 2017] Wearable camera video Inferred pose of camera wearer
Videos: http://www.hao-jiang.net/egopose/index.html
SLIDE 24 Towards embodied visual learning
- 1. Learning from unlabeled video and multiple
sensory modalities
a) Egomotion / motor signals b) Audio signals
- 2. Learning policies for how to move for
recognition and exploration
This talk
SLIDE 25
Recall: Disembodied visual learning
boat dog
… …
SLIDE 26
Listening to learn
SLIDE 27
Listening to learn
SLIDE 28
woof meow ring
Goal: A repetoire of objects and their sounds
Listening to learn
clatter
SLIDE 29 Visually-guided audio source separation
Traditional approach:
- Detect low-level correlations within a single video
- Learn from clean single audio source examples
[Darrell et al. 2000; Fisher et al. 2001; Rivet et al. 2007; Barzelay & Schechner 2007; Casanovas et al. 2010; Parekh et al. 2017; Pu et al. 2017; Li et al. 2017]
SLIDE 30 Learning to separate object sounds
Our idea: Leverage visual objects to learn from unlabeled video with multiple audio sources
Unlabeled video Object sound models
Violin Dog Cat
Disentangle
[Gao, Feris, & Grauman, arXiv 2018]
SLIDE 31 Deep multi-instance multi-label learning (MIML) to disentangle which visual objects make which sounds
Non-negative matrix factorization Visual predictions (ResNet-152
Guitar Saxophone
Output: Group of audio basis vectors per object class
Visual frames Audio Audio basis vectors Top visual detections
Our approach: training
MIML Unlabeled video
SLIDE 32 Audio
Estimate activations
Violin Sound Piano Sound
Novel video Frames
Initialize audio basis matrix Violin bases Piano bases Violin Piano
Our approach: inference
Given a novel video, use discovered object sound models to guide audio source separation.
Visual predictions (ResNet-152
Semi-supervised source separation using NMF
SLIDE 33 Train on 100,000 unlabeled video clips, then separate audio for novel video
Results
Baseline: M. Spiertz, Source-filter based clustering for monaural blind source separation. International Conference on Digital Audio Effects, 2009 [Gao, Feris, & Grauman, arXiv 2018] Videos: http://vision.cs.utexas.edu/projects/separating_object_sounds/
SLIDE 34 Train on 100,000 unlabeled video clips, then separate audio for novel video
Results
Failure cases
[Gao, Feris, & Grauman, arXiv 2018] Videos: http://vision.cs.utexas.edu/projects/separating_object_sounds/
SLIDE 35 Visually-aided audio source separation (SDR) Visually-aided audio denoising (NSDR)
Results
Train on 100K unlabeled video clips from AudioSet [Gemmeke et al. 2017]
SLIDE 36 This talk
Towards embodied visual learning
- 1. Learning from unlabeled video and multiple
sensory modalities
- 2. Learning policies for how to move for
recognition and exploration
SLIDE 37
Scene recognition Object recognition ? ? ? ? ?
Current recognition benchmarks
Passive, disembodied snapshots at test time, too
SLIDE 38 Moving to recognize
Time to revisit active recognition in challenging settings!
Bajcsy 1985, Aloimonos 1988, Ballard 1991, Wilkes 1992, Dickinson 1997, Schiele & Crowley 1998, Tsotsos 2001, Denzler 2002, Soatto 2009, Ramanathan 2011, Borotschnig 2011, …
SLIDE 39 Difficulty: unconstrained visual input
vs.
ImageNet Web images
Moving to recognize
SLIDE 40 mug? bowl? pan? mug
Difficulty: unconstrained visual input Opportunity: ability to move to change input
Moving to recognize
SLIDE 41 mug? bowl? pan? mug Perception Perception Action selection Evidence fusion
End-to-end active recognition
Jayaraman and Grauman, ECCV 2016
SLIDE 42 Look around scene Manipulate object Move around an object
[Jayaraman and Grauman, ECCV 2016]
End-to-end active recognition
SLIDE 43 Agents that learn to look around intelligently can recognize things faster.
[Jayaraman and Grauman, ECCV 2016]
End-to-end active recognition
35 45 55 65
1 2 3 4 5
#views
SUN 360
Passive neural net Transinformation [Schiele98] SeqDP [Denzler03] Transinformation+SeqDP Ours
75 80 85 90 95
1 3 6
#views
ModelNet-10
Passive neural net ShapeNets [Wu15] Pairwise [Johns 16] Ours
25 30 35 40 45 50
1 2 3
#views
GERMS
Passive neural net Transinformation [Schiele98] SeqDP[Denzler03] Transinformation+SeqDP Ours
SLIDE 44 P(“Church”): Top 3 guesses: (0.53) Forest Cave Beach (5.00) Street Cave Plaza courtyard (37.89) Church Lobby atrium Street P(“Plaza courtyard”): Top 3 guesses: (6.28) Restaurant Train interior Shop (11.95) Theater Restaurant Plaza courtyard (68.38) Plaza courtyard Street Theater
[Jayaraman and Grauman, ECCV 2016]
End-to-end active recognition: example
SLIDE 45 T=2 Predicted label: T=1 T=3
End-to-end active recognition: example
GERMS dataset: Malmir et al. BMVC 2015
[Jayaraman and Grauman, ECCV 2016]
SLIDE 46 Goal: Learn to “look around”
reconnaissance search and rescue
recognition
Can we learn look-around policies for visual agents that are curiosity-driven, exploratory, and generic?
vs.
task predefined task unfolds dynamically
SLIDE 47 Key idea: Active observation completion
Completion objective: Learn policy for efficiently inferring (pixels of) all yet-unseen portions of environment Agent must choose where to look before looking there.
Jayaraman and Grauman, CVPR 2018
SLIDE 48 Key idea: Active observation completion
Completion objective: Learn policy for efficiently inferring (pixels of) all yet-unseen portions of environment Agent must choose where to look before looking there.
Jayaraman and Grauman, CVPR 2018
SLIDE 49 Encoder Actor Decoder model visualized model
?
shifted MSE loss
Approach: Active observation completion
Non-myopic: Train to target a budget of observation time
Jayaraman and Grauman, CVPR 2018
SLIDE 50 SUN 360 panoramas ModelNet-10 CAD models
[Xiao 2012] [Wu 2015]
Two scenarios
SLIDE 51 Active “look around” results
18 23 28 33 38 1 2 3 4 5 6
per-pixel MSE (x1000) Time
SUN360 1-view random large-action large-action+ peek-saliency*
3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4 1 2 3 4
Time
ModelNet (seen cls)
5.5 5.7 5.9 6.1 6.3 6.5 6.7 6.9 7.1 7.3 7.5 1 2 3 4
Time
ModelNet (unseen cls)
*Harel et al, Graph based Visual Saliency, NIPS’07
Learned active look-around policy: quickly grasp environment independent of a specific task
Jayaraman and Grauman, CVPR 2018
SLIDE 52 Active “look around” visualization
Jayaraman and Grauman, CVPR 2018
Agent’s mental model for 3D object evolves with actively accumulated glimpses
SLIDE 53 Agent’s mental model for 360 scene evolves with actively accumulated glimpses
Complete 360 scene (ground truth) Inferred scene
Active “look around” visualization
= observed views
Jayaraman and Grauman, CVPR 2018
SLIDE 54 Motion policy transfer
Look-around encoder Look-around Policy Decoder Classification encoder Classification Policy Classifier “beach”
Unsupervised observation completion Supervised recognition
[Jayaraman et al, ECCV 16]
Look-around Policy
Plug observation completion policy in for new task
SLIDE 55
Motion policy transfer
Plug observation completion policy in for new task SUN 360 Scenes ModelNet Objects
Unsupervised exploratory policy approaches supervised task-specific policy accuracy! Unsupervised exploratory policy approaches supervised task-specific policy accuracy!
SLIDE 56 Summary
- Visual learning benefits from
– context of action and motion in the world – continuous unsupervised observations
– Embodied feature learning via visual and motor signals – Learning to separate object sound models from unlabeled video – Active policies for view selection and camera control
Ruohan Gao Dinesh Jayaraman Kristen Grauman, UT Austin
SLIDE 57 Papers
- Learning to Separate Object Sounds by Watching Unlabeled Video.
- R. Gao, R. Feris, and K. Grauman. arXiv:1804.01665, April 2018.
videos
- Learning to Look Around: Intelligently Exploring Unseen
Environments for Unknown Tasks. D. Jayaraman and K. Grauman. CVPR 2018.
- Seeing Invisible Poses: Estimating 3D Body Pose from Egocentric
- Video. H. Jiang and K. Grauman. CVPR 2017.
- Learning Image Representations Tied to Egomotion from
Unlabeled Video. D. Jayaraman and K. Grauman. International Journal of Computer Vision (IJCV), Special Issue for Best Papers of ICCV 2015, Mar 2017.
- Look-Ahead Before You Leap: End-to-End Active Recognition by
Forecasting the Effect of Motion. D. Jayaraman and K.
- Grauman. ECCV 2016.
- Unsupervised learning through one-shot image-based shape
reconstruction, D. Jayaraman, R. Gao, K. Grauman. arXiv 2017
http://www.cs.utexas.edu/~grauman/research/pubs.html