Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models - - PowerPoint PPT Presentation

actions in the eye dynamic gaze datasets and learnt
SMART_READER_LITE
LIVE PREVIEW

Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models - - PowerPoint PPT Presentation

Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition Stefan Mathe, Cristian Sminchisescu Presented by Mit Shah Motivation Current Computer Vision Annotations subjectively defined


slide-1
SLIDE 1

Actions in the Eye: Dynamic Gaze Datasets and Learnt Saliency Models for Visual Recognition

Stefan Mathe, Cristian Sminchisescu

Presented by Mit Shah

slide-2
SLIDE 2

Motivation…

  • Current Computer Vision

○ Annotations subjectively defined ○ Intermediate levels of computation??

2

slide-3
SLIDE 3

Motivation…

  • Lack of large scale datasets that provide recordings of the workings of the

human visual system

3

slide-4
SLIDE 4

Previous Work...

  • Study of Gaze patterns in Humans

4

A person browsing reddit with the F-shaped pattern

slide-5
SLIDE 5

Previous Work...

  • Study of Gaze patterns in Humans

○ Inter-observer consistency

5

slide-6
SLIDE 6

Previous Work...

  • Study of Gaze patterns in Humans

○ Inter-observer consistency ○ Bottom-up Features

6

slide-7
SLIDE 7

Previous Work...

  • Study of Gaze patterns in Humans

○ Inter-observer consistency ○ Bottom-up Features ○ Human Fixations

7

slide-8
SLIDE 8

Previous Work...

  • Study of Gaze patterns in Humans

○ Inter-observer consistency ○ Bottom-up Features ○ Human Fixations ○ Models of saliency

8

slide-9
SLIDE 9

Previous Work...

  • Study of Gaze patterns in Humans

○ Inter-observer consistency ○ Bottom-up Features ○ Human Fixations ○ Models of saliency ○ Uses of Saliency maps

9

Action Recognition Scene Classification Object Localization

slide-10
SLIDE 10

Previous Work...

  • Study of Gaze patterns in Humans

○ Inter-observer consistency ○ Bottom-up Features ○ Human Fixations ○ Models of saliency ○ Uses of Saliency maps ○ Previous data sets

10

At most few hundred videos recorded under free viewing conditions

slide-11
SLIDE 11

Contributions... (1)

❏ Extended existing large scale datasets Hollywood-2 and UCF Sports

11

slide-12
SLIDE 12

Contributions... (2)

❏ Dynamic consistency and alignment measures

12

AOI Markov Dynamics Temporal AOI Alignment

slide-13
SLIDE 13

Contributions... (3)

❏ Training an End-to-End automatic visual action recognition system

13

slide-14
SLIDE 14

Data Collection...

Hollywood-2 Movie Dataset

14

12 classes 69 movies 823/884 split 487k frames 20 hr Largest and Most challenging dataset Answering phone, driving a car, eating, fighting, etc.

slide-15
SLIDE 15

Data Collection...

UCF Sports Action Dataset

15

Broadcast of television channels 150 videos covering 9 sports action classes Diving, golf swinging, kicking, etc..

slide-16
SLIDE 16

Data Collection...

Extending the two data sets

16

19 Humans Action Recognition TASKS SMI iView X HiSpeed 1250 Tower-Mounted Eye Tracker Recording Environment Context Recognition Free Viewing Recording Protocol D i v i d e d i n t

  • 3

T a s k s Many other Specifications Timings/Durations & Breaks

slide-17
SLIDE 17

Static & Dynamic Consistency

Action Recognition by Humans

  • Goal & Importance
  • Human errors

○ Co Occurring Actions ○ False Positives ○ Mislabeling Videos

17

slide-18
SLIDE 18

Static Consistency Among Subjects

  • How well the regions fixated by human subjects agree on a frame by

frame basis?

18

  • Evaluation Protocol
slide-19
SLIDE 19

Static Consistency Among Subjects

19

slide-20
SLIDE 20

nA Times

The Influence of Task on Eye Movements

20

SA \ {s} Derive Saliency Maps Predict Fixations of Subject s Evaluate average prediction score for s’ in SB nA prediction scores SA Derive Saliency Maps nB prediction scores Hypothesis p-value >= 0.5 ? Independent 2-sample T-test with unequal variances

slide-21
SLIDE 21

The Influence of Task on Eye Movements

Results -

21

slide-22
SLIDE 22

Dynamic Consistency Among Subjects

  • Spatial distribution - highly consistent

22

  • Significant consistency in the order also??
  • Automatic Discovery of AOIs & 2 metrics

○ AOI Markov dynamics ○ Temporal AOI alignment

slide-23
SLIDE 23

Scanpath representation

  • Human fixations - tightly clustered
  • Assigning to closest AOI
  • Trace the scan path

23

slide-24
SLIDE 24

Automatically Finding AOIs

  • Clustering the fixations of all subjects in a frame

24

Start K-Means with 1 cluster Successively Increase until the sum of squared errors drops below a threshold Link centroids from successive frames into tracks Each resulting track becomes an AOI Each fixation assigned to the closest AOI at the time of creation

slide-25
SLIDE 25

Automatically Finding AOIs

.

25

slide-26
SLIDE 26

AOI Markov Dynamics

  • Transitions of human visual attention between AOIs by..

26

Fixated at AOI “a” @ time t-1 Probability of Transitioning to AOI “b” @ time t Human Fixation String fi

slide-27
SLIDE 27

Temporal AOI Alignment

  • Longest Common Subsequence??

27

  • Able to handle gaps and missing elements
slide-28
SLIDE 28

Evaluation Pipeline

28

Interest Point Operator Descriptor Visual Dictionary Classifiers

Input: A video Output: A set of spatio-temporal coordinates Spacetime generalization

  • f the HoG &

MBH from

  • ptical flow

Cluster descriptors into 4000 Visual words using K-means RBF-2 kernel and Multiple Kernel Learning (MKL) framework

slide-29
SLIDE 29

Human Fixation Studies

Human vs. Computer Vision Operators

  • Fixations as interest point detector

29

  • Findings

○ Low correlation ○ Why??

slide-30
SLIDE 30

Impact of Human Saliency Maps for Computer Visual Action Recognition

Saliency maps encoding only the weak surface structure of fixations (no time

  • rdering), can be used to boost the accuracy of contemporary methods

30

slide-31
SLIDE 31

Saliency Map Prediction

31

Static Features Motion Features AUC & Spatial KL Divergence

slide-32
SLIDE 32

32

Automatic Visual Action Recognition

slide-33
SLIDE 33

Conclusions

  • Combining Human + Computer Vision
  • Extending Dataset
  • Evaluating Static & Dynamic Consistency
  • Human Fixations -> Saliency Maps
  • End-to-End Action Recognition System

33

slide-34
SLIDE 34

Thanks!

34