First-Person Vision Kristen Grauman Department of Computer Science - - PowerPoint PPT Presentation

first person vision
SMART_READER_LITE
LIVE PREVIEW

First-Person Vision Kristen Grauman Department of Computer Science - - PowerPoint PPT Presentation

Action and Attention in First-Person Vision Kristen Grauman Department of Computer Science University of Texas at Austin With Dinesh Jayaraman, Yong Jae Lee, Yu-Chuan Su, Bo Xiong, Lu Zheng ~1990 2015 Steve Mann New era for first-person


slide-1
SLIDE 1

Action and Attention in First-Person Vision

Kristen Grauman Department of Computer Science University of Texas at Austin

With Dinesh Jayaraman, Yong Jae Lee, Yu-Chuan Su, Bo Xiong, Lu Zheng

slide-2
SLIDE 2

Steve Mann

2015 ~1990

slide-3
SLIDE 3

New era for first-person vision

Augmented reality Health monitoring Science Robotics Law enforcement Life logging

Infant figure from Linda Smith, et al.

Kristen Grauman, UT Austin

slide-4
SLIDE 4

First person vs. Third person

Traditional third-person view First-person view

UT TEA dataset

Kristen Grauman, UT Austin

slide-5
SLIDE 5

Traditional third-person view

First person vs. Third person

UT Interaction and JPL First-Person Interaction datasets

First-person view

First person “egocentric” vision:

  • Linked to ongoing experience of the

camera wearer

  • World seen in context of the camera

wearer’s activity and goals

Kristen Grauman, UT Austin

slide-6
SLIDE 6

Recent egocentric work

  • Activity and object recognition

[Spriggs et al. 2009, Ren & Gu 2010, Fathi et al. 2011, Kitani et

  • al. 2011, Pirsiavash & Ramanan 2012, McCandless & Grauman

2013, Ryoo & Matthies 2013, Poleg et al. 2014, Damen et al. 2014, Behera et al. 2014, Li et al. 2015, Yonetani et al. 2015, …]

  • Gaze and social cues

[Yamada et al. 2011, Fathi et al. 2012, Park et al. 2012, Li et al. 2013, Arev et al. 2014, Leelasawassuk et al. 2015,…]

  • Visualization, stabilization

[Kopf et al. 2014, Poleg et al. 2015]

Kristen Grauman, UT Austin

slide-7
SLIDE 7

Talk overview

Motivation

Account for the fact that camera wearer is active participant in the visual observations received

Ideas

  • 1. Action: Unsupervised feature learning
  • How is visual learning shaped by ego-motion?
  • 2. Attention: Inferring highlights in video
  • How to summarize long egocentric video?

Kristen Grauman, UT Austin

slide-8
SLIDE 8

Visual recognition

  • Recent major strides in category recognition
  • Facilitated by large labeled datasets

80M Tiny Images

[Torralba et al.]

ImageNet

[Deng et al.]

SUN Database

[Xiao et al.]

[Papageorgiou& Poggio 1998,Viola & Jones 2001, Dalal & Triggs 2005, Grauman & Darrell 2005, Lazebnik et al. 2006, Felzenszwalbet al. 2008, Krizhevsky et al. 2012, Russakovsky IJCV 2015…]

slide-9
SLIDE 9

Problem with today’s visual learning

  • Status quo: Learn from

“disembodied” bag of labeled snapshots

  • …yet visual

perception develops in the context of acting and moving in the world

Kristen Grauman, UT Austin

slide-10
SLIDE 10

Key to perceptual development: Self-generated motions + visual feedback

active kitten passive kitten

The kitten carousel experiment

[Held & Hein, 1963]

Kristen Grauman, UT Austin

slide-11
SLIDE 11

Goal: Learn the connection between “how I move” ↔ “how visual surroundings change” Approach: Unsupervised feature learning using motor signals accompanying egocentric video

Our idea: Feature learning with ego-motion

+

[Jayaraman & Grauman, ICCV 2015] Kristen Grauman, UT Austin

slide-12
SLIDE 12

Key idea: Egomotion equivariance

Invariant features: unresponsive to some classes of transformations

Invariance discards information, whereas equivariance organizes it.

Equivariant features : predictably responsive to some classes of transformations, through simple mappings (e.g., linear)

“equivariance map”

Kristen Grauman, UT Austin

slide-13
SLIDE 13

Key idea: Egomotion equivariance

Equivariant embedding

  • rganized by egomotions

Training data= Unlabeled video + motor signals

Pairs of frames related by similar ego-motion should be related by same feature transformation

Kristen Grauman, UT Austin

slide-14
SLIDE 14

Key idea: Egomotion equivariance

Equivariant embedding

  • rganized by egomotions

Training data= Unlabeled video + motor signals Kristen Grauman, UT Austin

slide-15
SLIDE 15

Ego motor signals + Deep learning architecture Observed image pairs Output embedding

Approach

[Jayaraman & Grauman, ICCV 2015] Kristen Grauman, UT Austin

slide-16
SLIDE 16

Ego motor signals + Deep learning architecture Observed image pairs Output embedding

Approach

“Active”: Exploit knowledge of

  • bserver motion

[Jayaraman & Grauman, ICCV 2015] Kristen Grauman, UT Austin

slide-17
SLIDE 17

ego-motion data stream

Learning equivariance

… …

Unlabeled video frame pairs Class-labeled images replicated layers

[Jayaraman & Grauman, ICCV 2015] Kristen Grauman, UT Austin

slide-18
SLIDE 18

Embedding objective:

ego-motion data stream

Learning equivariance

… …

Unlabeled video frame pairs Class-labeled images replicated layers

[Jayaraman & Grauman, ICCV 2015] Kristen Grauman, UT Austin

slide-19
SLIDE 19

Datasets

KITTI video

[Geiger et al. 2012] Autonomous car platform Egomotions: yaw and forward distance

SUN images

[Xiao et al. 2010] Large-scale scene classification task

NORB images

[LeCun et al. 2004] Toy recognition Egomotions: elevation and azimuth

Kristen Grauman, UT Austin

slide-20
SLIDE 20

Results: Equivariance check

Visualizing how well equivariance is preserved

[Jayaraman & Grauman, ICCV 2015]

Query pair

left

Neighbor pair (ours)

left

Pixel space neighbor pair

zoom Kristen Grauman, UT Austin

slide-21
SLIDE 21

Results: Equivariance check

Visualizing how well equivariance is preserved

[Jayaraman & Grauman, ICCV 2015]

Query pair Neighbor pair (ours) Pixel space neighbor pair

right left right Kristen Grauman, UT Austin

slide-22
SLIDE 22

Learn from autonomous car video (KITTI) Exploit features for large multi-way scene classification (SUN, 397 classes)

30% accuracy increase for small labeled training sets

Results: Recognition

[Jayaraman & Grauman, ICCV 2015] Kristen Grauman, UT Austin

slide-23
SLIDE 23

Results: Recognition

Do the learned features boost recognition accuracy?

25 classes

6 labeled training examples per class

397 classes Recognition accuracy Kristen Grauman, UT Austin

*Mobahi et al. ICML09; **Hadsell et al. CVPR06

slide-24
SLIDE 24

Leverage proposed equivariant embedding to predict next best view for object recognition

Results: Active recognition

[Schiele & Crowley 1998, Dickinson et al. 1997, Soatto 2009, Mishra et al. 2009,…]

10 20 30 40 50

Accuracy

??

NORB dataset

Kristen Grauman, UT Austin

slide-25
SLIDE 25

Next steps

  • Dynamic objects
  • Multiple modalities, e.g., depth
  • Active ego-motion planning
  • Tasks aside from recognition
slide-26
SLIDE 26

Talk overview

Motivation

Account for the fact that camera wearer is active participant in the visual observations received

Ideas

  • 1. Action: Unsupervised feature learning
  • How is visual learning shaped by ego-motion?
  • 2. Attention: Inferring highlights in video
  • How to summarize long egocentric video?

Kristen Grauman, UT Austin

slide-27
SLIDE 27

Goal: Summarize egocentric video

Output: Storyboard (or video skim) summary

9:00 am 10:00 am 11:00 am 12:00 pm 1:00 pm 2:00 pm

Wearable camera

Input: Egocentric video of the camera wearer’s day

Kristen Grauman, UT Austin

slide-28
SLIDE 28

Potential applications of egocentric video summarization

RHex Hexapedal Robot, Penn's GRASP Laboratory

Law enforcement Memory aid Mobile robot discovery Kristen Grauman, UT Austin

slide-29
SLIDE 29

Prior work: Video summarization

  • Largely third-person

– Static cameras, low-level cues informative

  • Consider summarization as a sampling problem

[Wolf 1996, Zhang et al. 1997, Ngo et al. 2003, Goldman et al. 2006, Caspi et al. 2006, Pritch et al. 2007, Laganiere et al. 2008, Liu et al. 2010, Nam & Tewfik 2002, Ellouze et al. 2010,…]

Kristen Grauman, UT Austin

slide-30
SLIDE 30

What makes egocentric data hard to summarize?

  • Subtle event boundaries
  • Subtle figure/ground
  • Long streams of data

Kristen Grauman, UT Austin

slide-31
SLIDE 31

Summarizing egocentric video

Key questions

– What objects are important, and how are they linked? – When is attention heightened? – Which frames look “intentional”?

Kristen Grauman, UT Austin

slide-32
SLIDE 32

Goal: Story-driven summarization

Characters and plot ↔ Key objects and influence

[Lu & Grauman, CVPR 2013] Kristen Grauman, UT Austin

slide-33
SLIDE 33

Goal: Story-driven summarization

Characters and plot ↔ Key objects and influence

[Lu & Grauman, CVPR 2013] Kristen Grauman, UT Austin

slide-34
SLIDE 34

Summarization as subshot selection

Good summary = chain of k selected subshots in which each influences the next via some subset of key objects

influence importance diversity

Subshots

[Lu & Grauman, CVPR 2013] Kristen Grauman, UT Austin

slide-35
SLIDE 35

Estimating visual influence

  • Aim to select the k subshots that maximize the

influence between objects (on the weakest link)

Subshots

[Lu & Grauman, CVPR 2013] Kristen Grauman, UT Austin

slide-36
SLIDE 36

Estimating visual influence

subshots Objects (or words)

Captures how reachable subshot j is from subshot i, via any object o

sink node [Lu & Grauman, CVPR 2013] Kristen Grauman, UT Austin

slide-37
SLIDE 37

distance to hand frequency distance to frame center

Learning object importance

We learn to rate regions by their egocentric importance

[Lee et al. CVPR 2012, IJCV 2015] Kristen Grauman, UT Austin

slide-38
SLIDE 38

distance to hand distance to frame center frequency Region features: size, width, height, centroid

surrounding area’s appearance, motion

[ ]

candidate region’s appearance, motion

[ ]

“Object-like” appearance, motion

  • verlap w/ face detection

[Endres et al. ECCV 2010, Lee et al. ICCV 2011]

[Lee et al. CVPR 2012, IJCV 2015]

Learning object importance

We learn to rate regions by their egocentric importance

slide-39
SLIDE 39

Datasets

UT Egocentric (UT Ego) [Lee et al. 2012] 4 videos, each 3-5 hours long, uncontrolled setting. We use visual words and subshots. Activities of Daily Living (ADL) [Pirsiavash & Ramanan 2012] 20 videos, each 20-60 minutes, daily activities in house. We use object bounding boxes and keyframes.

Kristen Grauman, UT Austin

slide-40
SLIDE 40

Our summary (12 frames) Original video (3 hours)

Example keyframe summary – UT Ego data

[Lee et al. CVPR 2012, IJCV 2015]

http://vision.cs.utexas.edu/projects/egocentric/

Kristen Grauman, UT Austin

slide-41
SLIDE 41

[Liu & Kender, 2002] (12 frames) Uniform keyframe sampling (12 frames) Alternative methods for comparison

Example keyframe summary – UT Ego data

[Lee et al. CVPR 2012, IJCV 2015] Kristen Grauman, UT Austin

slide-42
SLIDE 42

Generating storyboard maps

Augment keyframe summary with geolocations

[Lee et al., CVPR 2012, IJCV 2015] Kristen Grauman, UT Austin

slide-43
SLIDE 43

Human subject results: Blind taste test

Data

  • Vs. Uniform

sampling

  • Vs. Shortest-path
  • Vs. Object-driven

Lee et al. 2012 UT Egocentric Dataset 90.0% 90.9% 81.8% Activities Daily Living 75.7% 94.6% N/A

How often do subjects prefer our summary?

34 human subjects, ages 18-60 12 hours of original video Each comparison done by 5 subjects Total 535 tasks, 45 hours of subject time

[Lu & Grauman, CVPR 2013] Kristen Grauman, UT Austin

slide-44
SLIDE 44

Summarizing egocentric video

Key questions

– What objects are important, and how are they linked? – When is attention heightened? – Which frames look “intentional”?

slide-45
SLIDE 45

Definition: A time interval where the recorder is attracted by some object(s) and he interrupts his ongoing flow

  • f activity to purposefully

gather more information about the object(s)

Goal: Detect heightened attention

[Su & Grauman, 2015] Kristen Grauman, UT Austin

slide-46
SLIDE 46

Temporal Ego-Attention Dataset

  • “Browsing” scenarios, long & natural clips
  • 14 hours of video, 9 recorders
  • Frame-level labels x 10 annotators

14 hours of labeled ego video

[Su & Grauman, 2015] Kristen Grauman, UT Austin

slide-47
SLIDE 47

Challenges in temporal attention

  • Interesting things vary in appearance!
  • Attention ≠ stationary
  • High attention intervals vary in length
  • Lack cues of active camera control

[Su & Grauman, 2015]

Kristen Grauman, UT Austin

slide-48
SLIDE 48

Our approach

Learn motion patterns indicative of attention

[Su & Grauman, 2015]

Kristen Grauman, UT Austin

slide-49
SLIDE 49

Results: detecting temporal attention

Blue=Ground truth Red=Predicted

[Su & Grauman, 2015] Kristen Grauman, UT Austin

slide-50
SLIDE 50
  • 14 hours of video, 9 recorders

[Su & Grauman, 2015]

Results: detecting temporal attention

Kristen Grauman, UT Austin

slide-51
SLIDE 51

Summarizing egocentric video

Key questions

– What objects are important, and how are they linked? – When is attention heightened? – Which frames look “intentional”?

slide-52
SLIDE 52

Which photos were purposely taken by a human?

Intentional human taken photos Incidental wearable camera photos

[Xiong & Grauman, ECCV 2014] Kristen Grauman, UT Austin

slide-53
SLIDE 53

Idea: Detect “snap points”

  • Unsupervised data-driven approach to detect

frames in first-person video that look intentional

Web prior

Domain adapted similarity Snap point score

[Xiong & Grauman, ECCV 2014] Kristen Grauman, UT Austin

slide-54
SLIDE 54

Example snap point predictions

[Xiong & Grauman, ECCV 2014] Kristen Grauman, UT Austin

slide-55
SLIDE 55

Snap point predictions

[Xiong & Grauman, ECCV 2014] Kristen Grauman, UT Austin

slide-56
SLIDE 56

Next steps

  • Video summary as an index for search
  • Streaming computation
  • Visualization, display
  • Multiple modalities – e.g., audio

Kristen Grauman, UT Austin

slide-57
SLIDE 57

Summary

  • New opportunities with “always on” ego cameras
  • Towards active first-person vision:

– Action: “Embodied” feature learning from ego-video using both visual and motor signals – Attention: Egocentric summarization tools to cope with deluge of wearable camera data

Dinesh Jayaraman Yong Jae Lee Yu-Chuan Su Bo Xiong Lu Zheng