Following Gaze in Video A. Recasens et al. Presented by: Keivaun - - PowerPoint PPT Presentation

following gaze in video
SMART_READER_LITE
LIVE PREVIEW

Following Gaze in Video A. Recasens et al. Presented by: Keivaun - - PowerPoint PPT Presentation

Following Gaze in Video A. Recasens et al. Presented by: Keivaun Waugh and Kapil Krishnakumar Background Given face in one frame, how can we figure out where that person is looking? Target object might not be in the same frame Sample


slide-1
SLIDE 1

Following Gaze in Video

  • A. Recasens et al.

Presented by: Keivaun Waugh and Kapil Krishnakumar

slide-2
SLIDE 2

Background

  • Given face in one frame, how can we figure out where that person is looking?
  • Target object might not be in the same frame
slide-3
SLIDE 3

Sample Results

Input Video Gaze Density Gazed Area

slide-4
SLIDE 4

Architecture

slide-5
SLIDE 5

VideoGaze Dataset

  • 160k annotations of video frames from

MoviesQA dataset

  • Annotations:

○ Source Frame ○ Head Location ○ Body ○ Target Frame ( 5 per source frame) ■ Gaze Location ■ Time difference between Source and Target

slide-6
SLIDE 6

Experiments

  • Naive network architecture

○ Don’t segment network into different into different pathways ○ Concatenate all inputs and predict directly

  • Replace transformation pathway with SIFT+RANSAC affine fit finding
  • Various neighboring frame prediction windows
  • Examine failure cases

○ “Look cone” doesn’t take into account the eye position ○ Other failures

slide-7
SLIDE 7

Naive Model

slide-8
SLIDE 8

Naive Architecture

  • Use fusion of target frame and source frame to predict gaze location

Target Frame Source Frame Alex Net 0 …………… 0 0, 0.4, 0.3, 0 0 ………….. 0 0 ………….. 0 20x20

slide-9
SLIDE 9

Alternate Transformation Pathway

slide-10
SLIDE 10

Architecture

  • Replace deep CNN pathway with traditional SIFT+RANSAC affine warp

SIFT + RANSAC

slide-11
SLIDE 11

Quantitative Results

slide-12
SLIDE 12

Results

AUC (higher better) KL Divergence (lower better) L2 Dist (lower better) Description 73.7 8.048 0.225 Normal model with transformation pathway 60.2 6.604 0.294 Normal model with sparse affine 60.2 6.6604 0.294 Normal model with dense affine 60.9 6.641 0.242 Naive model 56.9 28.39 0.437 Random

slide-13
SLIDE 13

Qualitative Results

slide-14
SLIDE 14

Results

Cropped Head Full Video

  • Input video is 150 frames long

What I’m looking at

slide-15
SLIDE 15

Results - Search 150 Neighboring Frames

Original Transformation Pathway Naive Model

slide-16
SLIDE 16

Results - Search 150 Neighboring Frames

Sparse SIFT Affine Warp Dense SIFT Affine Warp

slide-17
SLIDE 17

Results - Search 25 Neighboring Frames

Original Transformation Pathway Naive Model

slide-18
SLIDE 18

Results - Search 25 Neighboring Frames

Sparse SIFT Affine Warp Dense SIFT Affine Warp

slide-19
SLIDE 19

Target in Same Frame

Original Video Original Transformation Pathway Naive Model

slide-20
SLIDE 20

Target in Same Frame

Sparse SIFT Affine Warp Dense SIFT Affine Warp

slide-21
SLIDE 21

Runtimes

  • GTX 1070 and Haswell Core i5
  • Generating results is CPU bound
  • 5 second video with 150 frame search width

○ Deep transformation pathway: 6.5 minutes ○ Sparse affine: 10.5 minutes ○ Dense affine: 32 minutes CPU Usage GPU Usage 100% 0% Usage when running model with transformation pathway

slide-22
SLIDE 22

Failure Cases

Input Video Original Transformation Pathway

slide-23
SLIDE 23

Failure Cases

Input Video Original Transformation Pathway

slide-24
SLIDE 24

Conclusions

  • Separating input modalities for Saliency and Head Pose provides significant

information to the model.

○ Illustrates importance of hand-crafted architecture even though features are automatically discovered

  • Head Direction != Eye Direction
  • Frame Predictor window selection determines whether match can be found or

not.