Embodied Visual Learning and Recognition Kristen Grauman - PowerPoint PPT Presentation

Weinberg Symposium on the Shared Frontiers of Artificial Intelligence and Cognitive Science University of Michigan, April 2018 Embodied Visual Learning and Recognition Kristen Grauman Department of Computer Science University of Texas at Austin

Visual recognition Objects amusement park sky Activities Scenes Locations The Wicked Cedar Point Text / writing Twister Faces Gestures Ferris ride Motions wheel ride Emotions… 12 E Lake Erie water ride tree tree people waiting in line people sitting on ride umbrellas tree maxair carousel deck bench tree pedestrians

Visual recognition: applications Organizing visual content Science and medicine AI and autonomous robotics Personal photo/video collections Gaming, HCI, Augmented Reality Surveillance and security

Visual recognition: significant recent progress Big labeled Deep learning datasets ImageNet top-5 error (%) 30 25 20 15 GPU technology 10 5 0 2011 2012 2013 2014 2015 2016

How do our systems learn about the visual world today? dog … … boat

Recognition benchmarks A “disembodied” well-curated moment in time A “disembodied” well-curated moment in time BSD (2001) PASCAL (2007-12) Caltech 101 (2004), Caltech 256 (2006) LabelMe (2007) ImageNet (2009) SUN (2010) Places (2014) MS COCO (2014) Visual Genome (2016)

Egocentric perceptual experience A tangle of relevant and irrelevant A tangle of relevant and irrelevant multi-sensory information multi-sensory information

Big picture goal: Embodied visual learning Status quo : Learn from “disembodied” bag of labeled snapshots. On the horizon: Visual learning in the context of acting and moving in the world.

This talk Towards embodied visual learning 1. Learning from unlabeled video and multiple sensory modalities 2. Learning policies for how to move for recognition and exploration

The kitten carousel experiment [Held & Hein, 1963] passive kitten active kitten Key to perceptual development: self-generated motion + visual feedback

Idea: Ego-motion vision Goal: Teach computer vision system the connection: “how I move” “how my visual surroundings change” + Unlabeled video Ego-motion motor signals [Jayaraman & Grauman, ICCV 2015, IJCV 2017]

Ego-motion vision: view prediction After moving:

Approach idea: Ego-motion equivariance Equivariant embedding Training data organized by ego-motions Unlabeled video + motor signals left turn right turn forward motor signal Learn Pairs of frames related by similar ego-motion should be related by same time feature transformation [Jayaraman & Grauman, ICCV 2015, IJCV 2017]

Results: Recognition Learn from unlabeled car video (KITTI) Geiger et al, IJRR ’13 Exploit features for static scene classification (SUN, 397 classes) 30% accuracy increase when labeled data scarce Xiao et al, CVPR ’10

Passive → complete ego-motions Pre-recorded video Comprehensive observation motor signal time

One-shot reconstruction Viewgrid representation Infer unseen views Key idea: One-shot reconstruction as a proxy task to learn semantic features.

One-shot reconstruction [Snavely et al, CVPR ‘06] [Sinha et al, ICCV’93] Shape from dense views Shape from one view geometric problem semantic problem

Approach: ShapeCodes Learned embedding • Implicit 3D shape representation • No “canonical” azimuth to exploit • Agnostic of category [Jayaraman & Grauman, arXiv 2017]

One-shot reconstruction example Observed view ground truth predicted [Jayaraman & Grauman, arXiv 2017]

ShapeCodes capture semantics t-SNE embedding for images of unseen object categories [Jayaraman & Grauman, arXiv 2017]

ShapeCodes for recognition ShapeNet ModelNet [Chang et al 2015] [Wu et al 2015] 65 55 60 Accuracy (%) 50 55 45 50 40 45 35 40 30 Pixels Random wts DrLIM* Autoencoder** LSM^ Ours *Hadsell et al, Dimensionality reduction by Learning an invariant mapping, CVPR 2005 ** Masci et al, Stacked Convolutional Autoencoders for Hierarchical Feature Extraction, ICANN 2011 ^Agrawal, Carreira, Malik, Learning to See by Moving, ICCV 2015

Ego-motion and implied body pose Learn relationship between egocentric scene motion and 3D human body pose Output: Input: sequence of 3d egocentric video joint positions [Jiang & Grauman, CVPR 2017]

Ego-motion and implied body pose Learn relationship between egocentric scene motion and 3D human body pose Inferred pose of camera wearer Wearable camera video Videos: http://www.hao-jiang.net/egopose/index.html [Jiang & Grauman, CVPR 2017]

This talk Towards embodied visual learning 1. Learning from unlabeled video and multiple sensory modalities a) Egomotion / motor signals b) Audio signals 2. Learning policies for how to move for recognition and exploration

Recall: Disembodied visual learning dog … … boat

Listening to learn

Listening to learn woof meow clatter ring Goal : A repetoire of objects and their sounds

Visually-guided audio source separation Traditional approach: • Detect low-level correlations within a single video • Learn from clean single audio source examples [Darrell et al. 2000; Fisher et al. 2001; Rivet et al. 2007; Barzelay & Schechner 2007; Casanovas et al. 2010; Parekh et al. 2017; Pu et al. 2017; Li et al. 2017]

Learning to separate object sounds Our idea: Leverage visual objects to learn from unlabeled video with multiple audio sources Violin Dog Cat Disentangle Object sound models Unlabeled video [Gao, Feris, & Grauman, arXiv 2018]

Our approach: training Deep multi-instance multi-label learning (MIML) to disentangle which visual objects make which sounds Non-negative matrix factorization Audio Audio basis vectors Visual predictions Unlabeled Guitar MIML (ResNet-152 Saxophone video objects) Top visual Visual detections frames Output: Group of audio basis vectors per object class

Our approach: inference Given a novel video, use discovered object sound models to guide audio source separation. Frames Visual Piano Sound Violin Sound predictions Violin Piano (ResNet-152 objects) Piano bases Violin bases Semi-supervised source separation using NMF Novel video Initialize audio basis matrix Estimate Audio activations

Results Train on 100,000 unlabeled video clips, then separate audio for novel video Videos: http://vision.cs.utexas.edu/projects/separating_object_sounds/ Baseline: M. Spiertz, Source-filter based clustering for monaural blind source separation. International Conference on Digital Audio Effects, 2009 [Gao, Feris, & Grauman, arXiv 2018]

Results Train on 100,000 unlabeled video clips, then separate audio for novel video Failure cases Videos: http://vision.cs.utexas.edu/projects/separating_object_sounds/ [Gao, Feris, & Grauman, arXiv 2018]

Results Visually-aided audio source separation (SDR) Visually-aided audio denoising (NSDR) Train on 100K unlabeled video clips from AudioSet [Gemmeke et al. 2017]

This talk Towards embodied visual learning 1. Learning from unlabeled video and multiple sensory modalities 2. Learning policies for how to move for recognition and exploration

Current recognition benchmarks Passive, disembodied snapshots at test time, too Object recognition ? ? ? Scene recognition ? ?

Moving to recognize Time to revisit active recognition in challenging settings! Bajcsy 1985, Aloimonos 1988, Ballard 1991, Wilkes 1992, Dickinson 1997, Schiele & Crowley 1998, Tsotsos 2001, Denzler 2002, Soatto 2009, Ramanathan 2011, Borotschnig 2011, …

Moving to recognize Difficulty: unconstrained visual input vs. ImageNet Web images

Moving to recognize Difficulty: unconstrained visual input Opportunity: ability to move to change input mug? bowl? mug pan?

End-to-end active recognition mug? bowl? mug pan? Perception Perception Action selection Evidence fusion Jayaraman and Grauman, ECCV 2016

End-to-end active recognition Look around scene Manipulate object Move around an object [Jayaraman and Grauman, ECCV 2016]

End-to-end active recognition GERMS SUN 360 ModelNet-10 50 95 45 65 40 90 55 35 85 30 45 80 25 35 1 2 3 75 #views 1 2 3 4 5 1 3 6 #views #views Passive neural net Passive neural net Transinformation [Schiele98] Passive neural net Transinformation [Schiele98] SeqDP[Denzler03] ShapeNets [Wu15] SeqDP [Denzler03] Transinformation+SeqDP Pairwise [Johns 16] Transinformation+SeqDP Ours Ours Ours Agents that learn to look around intelligently can recognize things faster. [Jayaraman and Grauman, ECCV 2016]

End-to-end active recognition: example P(“Plaza courtyard”): P(“Church”): (0.53) (6.28) (11.95) (5.00) (37.89) (68.38) Top 3 guesses: Top 3 guesses: Restaurant Forest Theater Street Plaza courtyard Church Train interior Cave Restaurant Cave Lobby atrium Street Beach Shop Plaza courtyard Plaza courtyard Theater Street [Jayaraman and Grauman, ECCV 2016]

End-to-end active recognition: example Predicted label: T=1 T=2 T=3 GERMS dataset: Malmir et al. BMVC 2015 [Jayaraman and Grauman, ECCV 2016]

Embodied Visual Learning and Recognition Kristen Grauman - PowerPoint PPT Presentation

Weinberg Symposium on the Shared Frontiers of Artificial Intelligence and Cognitive Science University of Michigan, April 2018 Embodied Visual Learning and Recognition Kristen Grauman Department of Computer Science University of Texas at

ISLS: NAPLeS Embodied Cognition and the Learning Sciences Dor Abrahamson Embodied Design

Understanding Player Interpretation An Embodied Approach Jonne Arjoranta University of

Invitation: the performer-interpreter employs tools of the body, emotion, and audience,

Embodied Machines Artificial vs. Embodied Intelligence Artificial Intelligence (AI)

Response: Pray the Story Keynote Session 2 Sarah Agnew 1 2 Terminology reminder Embodied

Six views of embodied cognition (Wilson, 2002) What is meant by embodied cognition?

EMBODIED CARBON IN THE BUILT ENVIRONMENT: SESSION 5 - REUSE August 17, 2018 Disclaimer Webinar

Making sense of time: The embodied nature of human abstraction Rafael E. Nez Embodied

Embodied Carbon in the Built Environment: Change Through Policy February 16, 2018 Series

Embodied Carbon in MEP design Studies Louise Hamot Global Head of Lifecycle Research

Introduction to Visual Recognition General visual recognition importance for intelligence?

Biovision team 2 Retina Visual cortex 3 Retina Visual cortex 3 Retina Visual cortex 3

See, Hear, Move: Towards Embodied Visual Perception Kristen Grauman Facebook AI Research

Rich representations for Rich representations for learning visual recognition learning visual

A summary of deep models for face recognition Qianli Liao Face recognition Face recognition:

8-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches

More Zeroes of Polynomials In this lecture we look more carefully at zeroes of polynomials.

Final Exam Details The final exam will be posted on Blackboard by 7am on April 26th It will be

Lecture 16: Testing 2017-07-09 Prof. Dr. Andreas Podelski, Dr. Bernd Westphal

Computer Aided Security : Cryptographic Primitives, Voting protocols, and Wireless Sensor

The Convergence of The Convergence of Software in the Medical Software in the Medical Device

Alternating Automata: Checking Truth and Validity for Temporal Logics Moshe Y. Vardi ? Rice

Graphical vs. Tabular Notations for Risk Models: On the Role of Textual Labels and Complexity

some theory tools for neutrino interactions with nucleons RICHARD HILL UKentucky and Fermilab