Learning Image Representations Tied to Ego motion Jayaraman and - - PowerPoint PPT Presentation

learning image representations tied to ego motion
SMART_READER_LITE
LIVE PREVIEW

Learning Image Representations Tied to Ego motion Jayaraman and - - PowerPoint PPT Presentation

University of Texas at Austin Visual Recognition Presentation (paper review) Learning Image Representations Tied to Ego motion Jayaraman and Grauman. ICCV 2015 Hilgad Montelo 2016 March Outline The "Kitten Carousel" Experiment


slide-1
SLIDE 1

Hilgad Montelo

Learning Image Representations Tied to Ego‐motion

Jayaraman and Grauman. ICCV 2015 Presentation (paper review)

2016 March University of Texas at Austin Visual Recognition

slide-2
SLIDE 2

Outline

  • The "Kitten Carousel" Experiment
  • Problem
  • Objective
  • Main Idea
  • Related Work
  • Approach
  • Experiments and Results
  • Conclusions
slide-3
SLIDE 3

The "Kitten Carousel" Experiment (Held & Hein, 1963)

3

active kitten passive kitten Key to perceptual development: self-generated motion + visual feedback

[Slide credit: Dinesh Jayaraman]

slide-4
SLIDE 4

Problem

  • Today’s visual recognition algorithms learn from

“disembodied” bag of labeled snapshots.

4

slide-5
SLIDE 5

Objective

  • Provide visual recognition algorithm that learns

in the context of acting and moving in the world.

5

slide-6
SLIDE 6
  • Associate Ego‐Motion and vision by teaching computer

vision system the connection:

  • “how I move”

“how my visual surroundings change”

+

Main Idea

6

slide-7
SLIDE 7

Ego‐motion vision: view prediction

After moving:

7

[Slide credit: Dinesh Jayaraman]

slide-8
SLIDE 8

Ego‐motion vision for recognition

  • Learning this connection requires:
  • Depth, 3D geometry
  • Semantics
  • Context
  • Can be learned without manual labels!

Also key to recognition! Approach: unsupervised feature learning using egocentric video + motor signals

8

slide-9
SLIDE 9

Related Works

9

Integrating vision and motion Visual prediction

Doersch, Gupta, Efros, “… context prediction”, ICCV 2015 Oh, Guo, Lee, Lewis, Singh, “Action-conditional video …”, NIPS 2015 Kulkarni, Whitney, Kohli, Tenenbaum, “… inverse graphics ...”, NIPS 2015 Vondrick, Pirsiavash, Torralba, “Anticipating the future ...”, arXiv 2015 Wang, Gupta, “Unsupervised learning of visual …”, ICCV 2015 Goroshin, Bruna, Tompson, Eigen, LeCun, “Unsupervised ...”, ICCV 2015 Agrawal, Carreira, Malik, “Learning to see by moving”, ICCV 2015 Watter, Springenberg, Boedecker, Riedmiller, “Embed to control...”, NIPS 2015 Levine, Finn, Darrell, Abbeel, “… visuomotor policies”, arXiv 2015 Konda, Memisevic, “Learning visual odometry ...”, VISAPP 2015

Video for unsupervised image features

slide-10
SLIDE 10

Ego‐motion equivariance

Invariant features: unresponsive to some classes of transformations Invariance discards information; equivariance organizes it. Equivariant features : predictably responsive to some classes of transformations, through simple mappings (e.g., linear)

  • “equivariance map”

Approach

10

slide-11
SLIDE 11

Equivariant embedding

  • rganized by ego‐motions

Pairs of frames related by similar ego‐motion should be related by same feature transformation

11

left turn right turn forward Learn

time →

motor signal

Training data Unlabeled video + motor signals

Approach

Source: “Learning image representations equivariant to ego motion ” Jayaraman and Grauman ICCV 2015

slide-12
SLIDE 12

Approach

  • 1. Extract training frame pairs from video
  • 2. Learn ego‐motion‐equivariant image features
  • 3. Train on target recognition task in parallel

12

slide-13
SLIDE 13

Training frame pair mining

Discovery of ego‐motion clusters

Right turn

=forward =right turn =left turn yaw change forward distance

13

[Slide credit: Dinesh Jayaraman]

slide-14
SLIDE 14

∥ ∥

Ego‐motion equivariant feature learning

  • Desired: for all motions

and all images ,

  • Given:
  • softmax loss , y

Unsupervised training Supervised training

  • Feature space

class ,

and

jointly trained

14

[Slide credit: Dinesh Jayaraman]

slide-15
SLIDE 15

Experiments

15

  • Validation using 3 public datasets: NORB, KITTI, SUN.
  • Comparison with different methods: CLSNET,

TEMPORAL, DRLIM.

slide-16
SLIDE 16

Results: Recognition

Learn from unlabeled car video (KITTI) Exploit features for static scene classification (SUN, 397 classes)

Geiger et al, IJRR ’13 Xiao et al, CVPR ’10 16

slide-17
SLIDE 17

KITTI ⟶ SUN

Do ego‐motion equivariant features improve recognition?

397 classes recognition accuracy (%)

**Mobahi et al., Deep Learning from Temporal Coherence in Video, ICML’09 *Hadsell et al., Dimensionality Reduction by Learning an Invariant Mapping, CVPR’06

Results: Recognition

6 labeled training examples per class

KITTI⟶KITTI NORB⟶NORB

Up to 30% accuracy increase

  • ver state of the art!

0.25 0.70 1.02 1.21 1.58

invariance

17

slide-18
SLIDE 18
  • Leverage proposed equivariant embedding to

select next best view for object recognition

Results: Active recognition

[Slide credit: Dinesh Jayaraman]

cup frying pan cup/bowl/pan? cup/bowl/pan?

10 20 30 40 50

Accuracy (%) NORB data

slide-19
SLIDE 19

Conclusion and Future Work

  • The paper provided a new embodied visual feature

learning paradigm.

  • The Ego‐motion equivariance boosts performance

across multiple challenging recognition tasks.

19

slide-20
SLIDE 20

Questions

  • Why KITTI training and not some other domain based

training?

  • Why does incorporating DRLIM improve EQUIV? Still

Temporal coherence properties left to be learned?

  • Is it meaningful to compare EQUIV or EQUIV + DRLIM

with the other cases with respect to equivariance error?

slide-21
SLIDE 21