Learning How to Move and Where to Look from Unlabeled Video Kristen - - PowerPoint PPT Presentation

learning how to move and where to look from unlabeled
SMART_READER_LITE
LIVE PREVIEW

Learning How to Move and Where to Look from Unlabeled Video Kristen - - PowerPoint PPT Presentation

Learning How to Move and Where to Look from Unlabeled Video Kristen Grauman Department of Computer Science University of Texas at Austin Visual recognition Objects amusement park sky Activities Scenes Locations The Wicked Cedar Point


slide-1
SLIDE 1

Learning How to Move and Where to Look from Unlabeled Video

Kristen Grauman Department of Computer Science University of Texas at Austin

slide-2
SLIDE 2

sky water Ferris wheel amusement park Cedar Point 12 E tree tree tree carousel deck people waiting in line ride ride ride umbrellas pedestrians maxair bench tree Lake Erie people sitting on ride

Objects Activities Scenes Locations Text / writing Faces Gestures Motions Emotions…

The Wicked Twister

Visual recognition

Kristen Grauman, UT Austin

slide-3
SLIDE 3

AI and autonomous robotics Personal photo/video collections Surveillance and security Science and medicine Organizing visual content Gaming, HCI, Augmented Reality

Visual recognition: applications

Kristen Grauman, UT Austin

slide-4
SLIDE 4

Significant recent progress in the field

Big labeled datasets Deep learning GPU technology

5 10 15 20 25 30

1 2 3 4 5 6

ImageNet top-5 error (%)

Kristen Grauman, UT Austin

slide-5
SLIDE 5

Recognition benchmarks

Caltech 101 (2004), Caltech 256 (2006) PASCAL (2007-12) ImageNet (2009) LabelMe (2007) MS COCO (2014) SUN (2010) Places (2014) BSD (2001) Visual Genome (2016)

Kristen Grauman, UT Austin

slide-6
SLIDE 6

How do our systems learn about the visual world today?

boat dog

… …

Expensive and restrictive in scope

Kristen Grauman, UT Austin

slide-7
SLIDE 7

Big picture goal: Embodied visual learning

Status quo: Learn from “disembodied” bag of labeled snapshots.

Inexpensive and unrestricted in scope

Our goal: Visual learning in the context of acting and moving in the world.

Kristen Grauman, UT Austin

slide-8
SLIDE 8

Big picture goal: Embodied visual learning

Status quo: Learn from “disembodied” bag of labeled snapshots.

Inexpensive and unrestricted in scope

Our goal: Visual learning in the context of acting and moving in the world.

Kristen Grauman, UT Austin

slide-9
SLIDE 9

Talk overview

  • 1. Learning representations

tied to ego-motion

  • 2. Learning representations

from unlabeled video

  • 3. Learning how to move

and where to look Towards embodied visual learning

Kristen Grauman, UT Austin

slide-10
SLIDE 10

The kitten carousel experiment

[Held & Hein, 1963]

active kitten passive kitten Key to perceptual development: self-generated motion + visual feedback

Kristen Grauman, UT Austin

slide-11
SLIDE 11

Goal: Teach computer vision system the connection: “how I move” “how my visual surroundings change”

Our idea: Ego-motion vision

+

Ego-motion motor signals Unlabeled video

[Jayaraman & Grauman, ICCV 2015]

Kristen Grauman, UT Austin

slide-12
SLIDE 12

Ego-motion vision: view prediction

After moving:

Kristen Grauman, UT Austin

slide-13
SLIDE 13

Ego-motion vision for recognition

Learning this connection requires:

  • Depth, 3D geometry
  • Semantics
  • Context

Can be learned without manual labels! Also key to recognition! Our approach: unsupervised feature learning using egocentric video + motor signals

[Jayaraman & Grauman, ICCV 2015]

Kristen Grauman, UT Austin

slide-14
SLIDE 14

Approach idea: Ego-motion equivariance

Invariant features: unresponsive to some classes of transformations

Simard et al, Tech Report, ’98 Wiskott et al, Neural Comp ’02 Hadsell et al, CVPR ’06 Mobahi et al, ICML ’09 Zou et al, NIPS ’12 Sohn et al, ICML ’12 Cadieu et al, Neural Comp ’12 Goroshin et al, ICCV ’15 Lies et al, PLoS computation biology ’14 …

Kristen Grauman, UT Austin

slide-15
SLIDE 15

Approach idea: Ego-motion equivariance

Invariant features: unresponsive to some classes of transformations Invariance discards information; equivariance organizes it. Equivariant features: predictably responsive to some classes of transformations, through simple mappings (e.g., linear)

“equivariance map”

Kristen Grauman, UT Austin

slide-16
SLIDE 16

Equivariant embedding

  • rganized by ego-motions

Pairs of frames related by similar ego-motion should be related by same feature transformation

left turn right turn forward Learn

Approach idea: Ego-motion equivariance

time

motor signal

Training data Unlabeled video + motor signals

[Jayaraman & Grauman, ICCV 2015]

Kristen Grauman, UT Austin

slide-17
SLIDE 17

Equivariant embedding

  • rganized by ego-motions

Learn

Approach idea: Ego-motion equivariance

time

motor signal

Training data Unlabeled video + motor signals

[Jayaraman & Grauman, ICCV 2015]

Kristen Grauman, UT Austin

slide-18
SLIDE 18
  • Ego-motion equivariant feature learning

Desired: for all motions and all images , Given: Unsupervised training

() ()

Feature space

[Jayaraman & Grauman, ICCV 2015]

Kristen Grauman, UT Austin

slide-19
SLIDE 19
  • Ego-motion equivariant feature learning

Desired: for all motions and all images , Given:

softmax loss

  • Unsupervised training

Supervised training class , and jointly trained

[Jayaraman & Grauman, ICCV 2015]

Kristen Grauman, UT Austin

slide-20
SLIDE 20

Results: Recognition

Learn from unlabeled car video (KITTI) Exploit features for static scene classification (SUN, 397 classes)

Geiger et al, IJRR ’13 Xiao et al, CVPR ’10

Kristen Grauman, UT Austin

slide-21
SLIDE 21

1 2 3 4 5 6 7 8 9 1 2 3 4 5

Accuracy (%)

Series1 Series2 Series3 Series4

+ Hadsell, Chopra, LeCun, “Dimensionality Reduction by Learning an Invariant Mapping”, CVPR 2006 * Agrawal, Carreira, Malik, “Learning to see by moving”, ICCV 2015

Results: Recognition

Ego-equivariance for unsupervised feature learning

Pre-trained models available

Egomotion-equivariance induces the strongest representations

SUN scenes: 397 multi-class accuracy

Kristen Grauman, UT Austin

slide-22
SLIDE 22

Talk overview

  • 1. Learning representations

tied to ego-motion

  • 2. Learning representations

from unlabeled video

  • 3. Learning how to move

and where to look Towards embodied visual learning

Kristen Grauman, UT Austin

slide-23
SLIDE 23

Learning from arbitrary unlabeled video?

Unlabeled video + ego-motion Unlabeled video

Kristen Grauman, UT Austin

slide-24
SLIDE 24

Learning from arbitrary unlabeled video?

Unlabeled video + ego-motion Unlabeled video

Kristen Grauman, UT Austin

slide-25
SLIDE 25

Background: Slow feature analysis

[Wiskott & Sejnowski, 2002]

Figure: Laurenz Wiskott, http://www.scholarpedia.org/article/File:SlowFeatureAnalysis-OptimizationProblem.png

Find functions g(x) that map quickly varying input signal x(t) slowly varying features y(t)

Kristen Grauman, UT Austin

slide-26
SLIDE 26

Background: Slow feature analysis

[Wiskott & Sejnowski, 2002]

Figure: Laurenz Wiskott, http://www.scholarpedia.org/article/File:SlowFeatureAnalysis-OptimizationProblem.png

quickly varying input signal x(t) slowly varying features y(t) Find functions g(x) that map

Kristen Grauman, UT Austin

slide-27
SLIDE 27

Prior work: Slow feature analysis

Wiskott et al, 2002 Hadsell et al. 2006 Mobahi et al. 2009 Bergstra & Bengio 2009 Goroshin et al. 2013 Wang & Gupta 2015 …

(invariance) Learn feature map such that:

Kristen Grauman, UT Austin

slide-28
SLIDE 28

(invariance) Higher order temporal coherence (equivariance) Learn feature map such that:

Our idea: Steady feature analysis

[Jayaraman & Grauman, CVPR 2016]

Kristen Grauman, UT Austin

slide-29
SLIDE 29

(invariance) (equivariance)

[Jayaraman & Grauman, CVPR 2016]

Learn feature map such that:

Our idea: Steady feature analysis

Kristen Grauman, UT Austin

slide-30
SLIDE 30

Datasets

Unlabeled video Target task (few labels)

Human Motion Database (HMDB) PASCAL 10 Actions KITTI Video SUN 397 Scenes NORB NORB 25 Objects

32 x 32 images or 96 x 96 images

Kristen Grauman, UT Austin

slide-31
SLIDE 31

Results: Steady feature analysis

**Mobahi et al., Deep Learning from Temporal Coherence in Video, ICML’09 *Hadsell et al., Dimensionality Reduction by Learning an Invariant Mapping, CVPR’06

* **

Multi-class recognition accuracy

Kristen Grauman, UT Austin

slide-32
SLIDE 32

Pre-training a representation

Unlabeled video Labeled images from a related domain Few labeled images for target task

Fine-tune

Few labeled images for target task

Supervised pre-training Unsupervised “pre-training”

Kristen Grauman, UT Austin

slide-33
SLIDE 33

Results: Can we learn more from unlabeled video than “related” labeled images?

CIFAR-100 (labeled for other categories) + HMDB (unlabeled video) PASCAL (few img labels)

Kristen Grauman, UT Austin

slide-34
SLIDE 34

Results: Can we learn more from unlabeled video than “related” labeled images?

Better even than providing 50,000 extra manual labels for auxiliary classification task!

CIFAR-100 (labeled for other categories) + HMDB (unlabeled video) PASCAL (few img labels)

Kristen Grauman, UT Austin

slide-35
SLIDE 35

Talk overview

  • 1. Learning representations

tied to ego-motion

  • 2. Learning representations

from unlabeled video

  • 3. Learning how to move

and where to look Towards embodied visual learning

Kristen Grauman, UT Austin

slide-36
SLIDE 36

Current recognition benchmarks

Caltech 101 (2004), Caltech 256 (2006) PASCAL (2007-12) ImageNet (2009) LabelMe (2007) MS COCO (2014) SUN (2010) Places (2014) BSD (2001) Visual Genome (2016)

Passive, disembodied snapshots at test time, too

Kristen Grauman, UT Austin

slide-37
SLIDE 37

Scene recognition Object recognition ? ? ? ? ?

Current recognition benchmarks

Passive, disembodied snapshots at test time, too

Kristen Grauman, UT Austin

slide-38
SLIDE 38

Moving to recognize Time to revisit active recognition in challenging settings!

Bajcsy 1985, Aloimonos 1988, Ballard 1991, Wilkes 1992, Dickinson 1997, Schiele & Crowley 1998, Tsotsos 2001, Denzler 2002, Soatto 2009, Ramanathan 2011, Borotschnig 2011, …

Kristen Grauman, UT Austin

slide-39
SLIDE 39

Difficulty: unconstrained visual input

vs.

ImageNet Web images

Moving to recognize

Kristen Grauman, UT Austin

slide-40
SLIDE 40

mug? bowl? pan? mug

Difficulty: unconstrained visual input Opportunity: ability to move to change input

Moving to recognize

Kristen Grauman, UT Austin

slide-41
SLIDE 41

mug? bowl? pan? mug Perception Perception Action selection Evidence fusion

Components of active recognition

Kristen Grauman, UT Austin

slide-42
SLIDE 42

Perception Action selection Evidence fusion

  • Verification
  • Averaging
  • Bayes / Naïve Bayes
  • Navigate to a pre-

selected viewpoint

  • Greedily maximize

information gain

  • Reinforcement

learning

  • Train for 1-view

recognition

Dickinson 1997 Schiele 1998 Denzler 2002 Borotschnig 1998 Ramanathan 2011 Wu 2015 Jayaraman 2015 Paletta 2000, Malmir 2015 Johns 2016 Paletta 2000 Denzler 2002 Ramanathan 2011 Malmir 2015 Wilkes 1992 Dickinson 1997 Schiele 1998 Denzler 2002 Soatto 2009 Ramanathan 2011 Aloimonos 2011 Borotschnig 2011 Wu 2015 Jayaraman 2015 Johns 2016 Dickinson 1997 Schiele 1998

Independent solutions for the three components

Prior approaches to active recognition

Kristen Grauman, UT Austin

slide-43
SLIDE 43

Perception Action selection Evidence fusion

JOINT TRAINING

Look-ahead

FORECASTING THE EFFECTS OF ACTIONS Multi-task training of active recognition components + look-ahead.

Jayaraman and Grauman, ECCV 2016

Our idea: end-to-end active recognition

Kristen Grauman, UT Austin

slide-44
SLIDE 44

[Malmir 2015] [Nene 1996, Schiele 1998, Denzler 2003, Ramanathan 2011…]

Instances, turntables Custom robot setting

Experiments

How to evaluate active recognition? Previously…

Jayaraman and Grauman, ECCV 2016

Kristen Grauman, UT Austin

slide-45
SLIDE 45

SUN 360 panoramas GERMS toy manipulation ModelNet-10 CAD models

[Xiao 2012] [Malmir 2015] [Wu 2015]

Experiments

Jayaraman and Grauman, ECCV 2016

Kristen Grauman, UT Austin

slide-46
SLIDE 46

Strongly outperform traditional active recognition approaches.

35 40 45 50 55 60 65 70

1 2 3 4 5

#views

SUN 360

Series1 Series2 Series3 Series4 Series7

75 80 85 90 95

1 2 3

#views

ModelNet-10

Series1 Series5 Series6 Series7

25 30 35 40 45 50

1 2 3

#views

GERMS

Series1 Series2 Series3 Series4 Series7

End-to-end active recognition: results

Jayaraman and Grauman, ECCV 2016

Kristen Grauman, UT Austin

slide-47
SLIDE 47

P(“Church”): Top 3 guesses: (0.53) Forest Cave Beach (5.00) Street Cave Plaza courtyard (37.89) Church Lobby atrium Street P(“Plaza courtyard”): Top 3 guesses: (6.28) Restaurant Train interior Shop (11.95) Theater Restaurant Plaza courtyard (68.38) Plaza courtyard Street Theater

[Jayaraman and Grauman, ECCV 2016]

End-to-end active recognition: example

Kristen Grauman, UT Austin

slide-48
SLIDE 48

T=2 Predicted label: T=1 T=3

End-to-end active recognition: example

GERMS dataset: Malmir et al. BMVC 2015

[Jayaraman and Grauman, ECCV 2016]

Kristen Grauman, UT Austin

slide-49
SLIDE 49

Talk overview

  • 1. Learning representations

tied to ego-motion

  • 2. Learning representations

from unlabeled video

  • 3. Learning how to move

and where to look Towards embodied visual learning

Kristen Grauman, UT Austin

slide-50
SLIDE 50

360° cameras and panoramic video

Kristen Grauman, UT Austin

slide-51
SLIDE 51

Challenge of viewing 360° videos

How to find the right direction to watch?

Control by mouse

Kristen Grauman, UT Austin

slide-52
SLIDE 52

Input: 360° video Output: natural-looking normal-field-of-view video Task: control the virtual camera direction Pano2Vid Definition

x y z Record x y z Record x y z Record x y z

[Su et al. ACCV 2016]

New problem: Pano2Vid automatic videography

Kristen Grauman, UT Austin

slide-53
SLIDE 53

Input: 360° Video Output: normal-field-of-view (NFOV) Video

Virtual camera direction

New problem: Pano2Vid automatic videography

[Su et al. ACCV 2016]

Kristen Grauman, UT Austin

slide-54
SLIDE 54

Our approach – AutoCam

Learn videography tendencies from unlabeled Web videos

  • Diverse capture-worthy content
  • Proper composition

[Su et al. ACCV 2016]

y x z

θ φ Ω

65.5◦ T=5

ST-glimpses

How close?

Human-captured NFOV videos (“HumanCam”) Unlabeled video

Kristen Grauman, UT Austin

slide-55
SLIDE 55

Example AutoCam Output 1

AutoCam Output Video Input 360° Video + Camera Trajectory

[Su et al. ACCV 2016]

Kristen Grauman, UT Austin

slide-56
SLIDE 56

AutoCam Eye-level Prior

Example AutoCam Output 2

[Su et al. ACCV 2016]

Kristen Grauman, UT Austin

slide-57
SLIDE 57

With Zooming Without Zooming

Input 360° Video + Camera Trajectories

Example AutoCam Output 3

[Su et al. ACCV 2016]

Kristen Grauman, UT Austin

slide-58
SLIDE 58

Results: Quantitative evaluation

Similarity to human-selected camera trajectories Similarity to user-uploaded standard web videos Create plausible videos by learning “where to look” from unlabeled video

[Su et al. ACCV 2016]

Kristen Grauman, UT Austin

slide-59
SLIDE 59
  • Active observations for representation learning
  • Explore varied space of egomotions
  • Multi-sensor active recognition
  • Learning where to look +/- recognition
  • 360 video summaries

Next steps

Kristen Grauman, UT Austin

slide-60
SLIDE 60

Summary

  • Visual learning benefits from

– context of action and motion in the world – continuous unsupervised observations

  • New ideas:

– “Embodied” feature learning via visual and motor signals – Feature learning from unlabeled video via higher order temporal coherence – Active policies for view selection and camera control

Code and pre-trained models available

http://www.cs.utexas.edu/~grauman/research/pubs.html

Ruohan Gao Yu-Chuan Su Dinesh Jayaraman