Social Interactions: A First-Person Perspective. A. Fathi, J. - - PowerPoint PPT Presentation

social interactions a first person perspective
SMART_READER_LITE
LIVE PREVIEW

Social Interactions: A First-Person Perspective. A. Fathi, J. - - PowerPoint PPT Presentation

Social Interactions: A First-Person Perspective. A. Fathi, J. Hodgins, J. Rehg Presented by Jacob Menashe November 16, 2012 Social Interaction Detection Objective: Detect social interactions from video footage. Social Interaction Detection


slide-1
SLIDE 1

Social Interactions: A First-Person Perspective.

  • A. Fathi, J. Hodgins, J. Rehg

Presented by Jacob Menashe November 16, 2012

slide-2
SLIDE 2

Social Interaction Detection

Objective: Detect social interactions from video footage.

slide-3
SLIDE 3

Social Interaction Detection

Objective: Detect social interactions from video footage.

◮ Consider faces and attention

slide-4
SLIDE 4

Social Interaction Detection

Objective: Detect social interactions from video footage.

◮ Consider faces and attention ◮ Account for temporal context

slide-5
SLIDE 5

Social Interaction Detection

Objective: Detect social interactions from video footage.

◮ Consider faces and attention ◮ Account for temporal context ◮ Analyze first-person movements cues

slide-6
SLIDE 6

Introduction Overview Features Temporal Context Experiments

slide-7
SLIDE 7

Video Example

Red Dialogue Yellow Walking Dialogue Green Discussion Light Blue Walking Discussion Dark Blue Monologue None Background Link

slide-8
SLIDE 8

Features

Features are constructed based on first- and third-person information.

slide-9
SLIDE 9

Features

Features are constructed based on first- and third-person information.

  • 1. Dense optical flow (first-person movement).
slide-10
SLIDE 10

Features

Features are constructed based on first- and third-person information.

  • 1. Dense optical flow (first-person movement).
  • 2. Face locations (relative to first person)
slide-11
SLIDE 11

Features

Features are constructed based on first- and third-person information.

  • 1. Dense optical flow (first-person movement).
  • 2. Face locations (relative to first person)
  • 3. Attention and Roles. For each person x:
slide-12
SLIDE 12

Features

Features are constructed based on first- and third-person information.

  • 1. Dense optical flow (first-person movement).
  • 2. Face locations (relative to first person)
  • 3. Attention and Roles. For each person x:

◮ Faces looking at x

slide-13
SLIDE 13

Features

Features are constructed based on first- and third-person information.

  • 1. Dense optical flow (first-person movement).
  • 2. Face locations (relative to first person)
  • 3. Attention and Roles. For each person x:

◮ Faces looking at x ◮ Whether first person looks at x

slide-14
SLIDE 14

Features

Features are constructed based on first- and third-person information.

  • 1. Dense optical flow (first-person movement).
  • 2. Face locations (relative to first person)
  • 3. Attention and Roles. For each person x:

◮ Faces looking at x ◮ Whether first person looks at x ◮ Mutual attention between x and first person

slide-15
SLIDE 15

Features

Features are constructed based on first- and third-person information.

  • 1. Dense optical flow (first-person movement).
  • 2. Face locations (relative to first person)
  • 3. Attention and Roles. For each person x:

◮ Faces looking at x ◮ Whether first person looks at x ◮ Mutual attention between x and first person ◮ Number of faces looking at where x is looking

slide-16
SLIDE 16

Feature Example

slide-17
SLIDE 17

Conditional Random Fields

CRFs are described in Lafferty et al. [2001].

slide-18
SLIDE 18

Conditional Random Fields

CRFs are described in Lafferty et al. [2001].

◮ Observations and labels form a Markov chain. ◮ Nodes pend on neighbors.

y1 y2 y3 x1 x2 x3

slide-19
SLIDE 19

Conditional Random Fields

CRFs are described in Lafferty et al. [2001].

◮ Observations and labels form a Markov chain. ◮ Nodes pend on neighbors.

y1 y2 y3 x1 x2 x3 y1 p(y1|x1, y2)

slide-20
SLIDE 20

Conditional Random Fields

CRFs are described in Lafferty et al. [2001].

◮ Observations and labels form a Markov chain. ◮ Nodes pend on neighbors.

y1 y2 y3 x1 x2 x3 y1 y2 p(y2|y1, y3, x2)

slide-21
SLIDE 21

Conditional Random Fields

CRFs are described in Lafferty et al. [2001].

◮ Observations and labels form a Markov chain. ◮ Nodes pend on neighbors.

y1 y2 y3 x1 x2 x3 y3 p(y3|y2, x3)

slide-22
SLIDE 22

Hidden Conditional Random Fields

A micro view of the HCRF model as described in Quattoni et al. [2007]. Y h1 h2 h3 xi

slide-23
SLIDE 23

Hidden Conditional Random Fields

A micro view of the HCRF model as described in Quattoni et al. [2007].

◮ Y is a label for the whole sequence.

Y h1 h2 h3 xi

slide-24
SLIDE 24

Hidden Conditional Random Fields

A micro view of the HCRF model as described in Quattoni et al. [2007].

◮ Y is a label for the whole sequence. ◮ xi is a single observation in the sequence.

Y h1 h2 h3 xi

slide-25
SLIDE 25

Hidden Conditional Random Fields

A micro view of the HCRF model as described in Quattoni et al. [2007].

◮ Y is a label for the whole sequence. ◮ xi is a single observation in the sequence. ◮ Each hi is a possible hidden state.

Y h1 h2 h3 xi

slide-26
SLIDE 26

Hidden Conditional Random Fields (cont.)

A macro view of the HCRF model as described in Quattoni et al. [2007]. Y h1 h2 h3 x1 x2 x3

slide-27
SLIDE 27

Hidden Conditional Random Fields (cont.)

A macro view of the HCRF model as described in Quattoni et al. [2007].

◮ Y is a label for the whole sequence.

Y h1 h2 h3 x1 x2 x3

slide-28
SLIDE 28

Hidden Conditional Random Fields (cont.)

A macro view of the HCRF model as described in Quattoni et al. [2007].

◮ Y is a label for the whole sequence. ◮ Each xi is a single observation in the sequence.

Y h1 h2 h3 x1 x2 x3

slide-29
SLIDE 29

Hidden Conditional Random Fields (cont.)

A macro view of the HCRF model as described in Quattoni et al. [2007].

◮ Y is a label for the whole sequence. ◮ Each xi is a single observation in the sequence. ◮ Each hi is the hidden state label assigned to xi.

Y h1 h2 h3 x1 x2 x3

slide-30
SLIDE 30

Hidden Conditional Random Fields (cont.)

A macro view of the HCRF model as described in Quattoni et al. [2007].

◮ Y is a label for the whole sequence. ◮ Each xi is a single observation in the sequence. ◮ Each hi is the hidden state label assigned to xi.

Y h1 h2 h3 x1 x2 x3 p(h1|Y, h2, x1) h1

slide-31
SLIDE 31

Hidden Conditional Random Fields (cont.)

A macro view of the HCRF model as described in Quattoni et al. [2007].

◮ Y is a label for the whole sequence. ◮ Each xi is a single observation in the sequence. ◮ Each hi is the hidden state label assigned to xi.

Y h1 h2 h3 x1 x2 x3 p(h2|Y, h1, h3, x2) h2

slide-32
SLIDE 32

Hidden Conditional Random Fields (cont.)

A macro view of the HCRF model as described in Quattoni et al. [2007].

◮ Y is a label for the whole sequence. ◮ Each xi is a single observation in the sequence. ◮ Each hi is the hidden state label assigned to xi.

Y h1 h2 h3 x1 x2 x3 p(h3|Y, h2, x3) h3

slide-33
SLIDE 33

Hidden Conditional Random Fields (cont.)

A macro view of the HCRF model as described in Quattoni et al. [2007].

◮ Y is a label for the whole sequence. ◮ Each xi is a single observation in the sequence. ◮ Each hi is the hidden state label assigned to xi.

Y h1 h2 h3 x1 x2 x3 p(Y|{hi}) = p(Y|{xi}) Y

slide-34
SLIDE 34

HCRF Example

Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc). WDlg h1 h2 h3 x1 x2 x3

slide-35
SLIDE 35

HCRF Example

Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc).

◮ Each xi is now a feature extracted from video frames.

WDlg h1 h2 h3 x1 x2 x3 x1 x2 x3

slide-36
SLIDE 36

HCRF Example

Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc).

◮ Each xi is now a feature extracted from video frames. ◮ Each hi is determined from training:

WDlg h1 h2 h3 x1 x2 x3 h1 h2 h3

slide-37
SLIDE 37

HCRF Example

Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc).

◮ Each xi is now a feature extracted from video frames. ◮ Each hi is determined from training:

◮ h1: John wants to hear about my weekend.

WDlg h1 h2 h3 x1 x2 x3 h1

slide-38
SLIDE 38

HCRF Example

Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc).

◮ Each xi is now a feature extracted from video frames. ◮ Each hi is determined from training:

◮ h2: I’m feeling talkative.

WDlg h1 h2 h3 x1 x2 x3 h2

slide-39
SLIDE 39

HCRF Example

Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc).

◮ Each xi is now a feature extracted from video frames. ◮ Each hi is determined from training:

◮ h3: Mary wants to listen to her iPod.

WDlg h1 h2 h3 x1 x2 x3 h3

slide-40
SLIDE 40

HCRF Example

Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc).

◮ Each xi is now a feature extracted from video frames. ◮ Each hi is determined from training:

◮ h1: John wants to hear about my weekend.

WDlg h1 h2 h3 x1 x2 x3 p(h1|Y, h2, x1) h1

slide-41
SLIDE 41

HCRF Example

Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc).

◮ Each xi is now a feature extracted from video frames. ◮ Each hi is determined from training:

◮ h2: I’m feeling talkative.

WDlg h1 h2 h3 x1 x2 x3 p(h2|Y, h1, h3, x2) h2

slide-42
SLIDE 42

HCRF Example

Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc).

◮ Each xi is now a feature extracted from video frames. ◮ Each hi is determined from training:

◮ h3: Mary wants to listen to her iPod.

WDlg h1 h2 h3 x1 x2 x3 p(h3|Y, h2, x3) h3

slide-43
SLIDE 43

HCRF Example

Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc).

◮ Each xi is now a feature extracted from video frames. ◮ Each hi is determined from training:

◮ h1: John wants to hear about my weekend. ◮ h2: I’m feeling talkative. ◮ h3: Mary wants to listen to her iPod.

WDlg h1 h2 h3 x1 x2 x3 p(WDlg|{hi}) = p(WDlg|{xi}) WDlg

slide-44
SLIDE 44

HCRF Example

Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc).

◮ Each xi is now a feature extracted from video frames. ◮ Each hi is determined from training:

◮ h1: John wants to hear about my weekend. ◮ h2: I’m feeling talkative. ◮ h3: Mary wants to listen to her iPod.

WDlg h1 h2 h3 x1 x2 x3 p(WDisc|{hi}) = p(WDisc|{xi}) WDisc

slide-45
SLIDE 45

HCRF Example

Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc).

◮ Each xi is now a feature extracted from video frames. ◮ Each hi is determined from training:

◮ h1: John wants to hear about my weekend. ◮ h2: I’m feeling talkative. ◮ h3: Mary wants to listen to her iPod. ◮ If p(WDlg) > p(WDisc), assign Y = WDlg.

h1 h2 h3 x1 x2 x3 Y

slide-46
SLIDE 46

Introduction Overview Temporal Context Conditional Random Fields Hidden Conditional Random Fields HCRF Example Experiments

slide-47
SLIDE 47

Introduction Overview Temporal Context Experiments Experiment Outline Experiment 1: Video Processing Experiment 2: Caltech Dataset Conclusion

slide-48
SLIDE 48

Experiment Outline

The following experiments are presented:

slide-49
SLIDE 49

Experiment Outline

The following experiments are presented:

◮ Video Processing

slide-50
SLIDE 50

Experiment Outline

The following experiments are presented:

◮ Video Processing ◮ Caltech image dataset

slide-51
SLIDE 51

Experiment Outline

The following experiments are presented:

◮ Video Processing ◮ Caltech image dataset ◮ Adjusted parameters:

slide-52
SLIDE 52

Experiment Outline

The following experiments are presented:

◮ Video Processing ◮ Caltech image dataset ◮ Adjusted parameters:

◮ Iterations

slide-53
SLIDE 53

Experiment Outline

The following experiments are presented:

◮ Video Processing ◮ Caltech image dataset ◮ Adjusted parameters:

◮ Iterations ◮ Hidden States

slide-54
SLIDE 54

Experiment Outline

The following experiments are presented:

◮ Video Processing ◮ Caltech image dataset ◮ Adjusted parameters:

◮ Iterations ◮ Hidden States ◮ Optimization Function

slide-55
SLIDE 55

Experiment Outline

The following experiments are presented:

◮ Video Processing ◮ Caltech image dataset ◮ Adjusted parameters:

◮ Iterations ◮ Hidden States ◮ Optimization Function ◮ Clusters

slide-56
SLIDE 56

Experiment Outline

The following experiments are presented:

◮ Video Processing ◮ Caltech image dataset ◮ Adjusted parameters:

◮ Iterations ◮ Hidden States ◮ Optimization Function ◮ Clusters

◮ Compared with linear SVM baseline

slide-57
SLIDE 57

Experiment 1: Video Processing

Mine Theirs 40 training intervals 4,000 training intervals 40 testing intervals [unspecified] Dialogue vs Discussion One vs. All All Features Location First-Person Motion Attention All Features

slide-58
SLIDE 58

Experiment 1: Video Processing

Mine Theirs 40 training intervals 4,000 training intervals 40 testing intervals [unspecified] Dialogue vs Discussion One vs. All All Features Location First-Person Motion Attention All Features ~42 hours = 11,340 intervals 11,340 intervals @ 24 hours per 20 intervals > 18 months

slide-59
SLIDE 59

Experiment 1: Video Processing (cont.)

My Results Their Results

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate HCRF Dialogue vs Discussion Detection

slide-60
SLIDE 60

Experiment 2: Caltech Dataset

Experiment 2 focuses on the Caltech image dataset.

slide-61
SLIDE 61

Experiment 2: Caltech Dataset

Experiment 2 focuses on the Caltech image dataset.

◮ Multi-class HCRF evaluated

slide-62
SLIDE 62

Experiment 2: Caltech Dataset

Experiment 2 focuses on the Caltech image dataset.

◮ Multi-class HCRF evaluated ◮ Classes are evaluated in isolation.

slide-63
SLIDE 63

Experiment 2: Caltech Dataset

Experiment 2 focuses on the Caltech image dataset.

◮ Multi-class HCRF evaluated ◮ Classes are evaluated in isolation. ◮ Temporal context is simulated with clustering

slide-64
SLIDE 64

Experiment 2: Caltech Dataset

Experiment 2 focuses on the Caltech image dataset.

◮ Multi-class HCRF evaluated ◮ Classes are evaluated in isolation. ◮ Temporal context is simulated with clustering ◮ Initial parameters are based on Fathi et al. [2012]:

slide-65
SLIDE 65

Experiment 2: Caltech Dataset

Experiment 2 focuses on the Caltech image dataset.

◮ Multi-class HCRF evaluated ◮ Classes are evaluated in isolation. ◮ Temporal context is simulated with clustering ◮ Initial parameters are based on Fathi et al. [2012]:

◮ Hidden States: 5

slide-66
SLIDE 66

Experiment 2: Caltech Dataset

Experiment 2 focuses on the Caltech image dataset.

◮ Multi-class HCRF evaluated ◮ Classes are evaluated in isolation. ◮ Temporal context is simulated with clustering ◮ Initial parameters are based on Fathi et al. [2012]:

◮ Hidden States: 5 ◮ Window Size: 5

slide-67
SLIDE 67

Experiment 2: Caltech Dataset

Experiment 2 focuses on the Caltech image dataset.

◮ Multi-class HCRF evaluated ◮ Classes are evaluated in isolation. ◮ Temporal context is simulated with clustering ◮ Initial parameters are based on Fathi et al. [2012]:

◮ Hidden States: 5 ◮ Window Size: 5 ◮ Max Iterations: 100

slide-68
SLIDE 68

Experiment 2: Caltech Dataset

Experiment 2 focuses on the Caltech image dataset.

◮ Multi-class HCRF evaluated ◮ Classes are evaluated in isolation. ◮ Temporal context is simulated with clustering ◮ Initial parameters are based on Fathi et al. [2012]:

◮ Hidden States: 5 ◮ Window Size: 5 ◮ Max Iterations: 100 ◮ Optimizer: Broyden–Fletcher-Goldfarb-Shanno (BFGS)

slide-69
SLIDE 69
  • Exp. 2a: Initial Settings

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate test airplanes HCRF cars HCRF faces HCRF motorbikes HCRF svm (all)

Processing: ~18 minutes, 1 MB

slide-70
SLIDE 70
  • Exp. 2a: Initial Settings (cont.)

rank 1 airplanes rank 2 rank 3 rank 4 cars faces motorbikes

slide-71
SLIDE 71
  • Exp. 2b: Low Iterations

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate HCRF with 10 iterations airplanes HCRF cars HCRF faces HCRF motorbikes HCRF

Processing: ~3 minutes, 1 MB

slide-72
SLIDE 72
  • Exp. 2c: Low Hidden States

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate HCRF with 1 Hidden State airplanes HCRF cars HCRF faces HCRF motorbikes HCRF

Processing: ~2 minutes, 1 MB

slide-73
SLIDE 73
  • Exp. 2d: CG Optimizer

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate HCRF with Conjugate Gradient Optimizer airplanes HCRF cars HCRF faces HCRF motorbikes HCRF

Processing: ~11 minutes, 1 MB

slide-74
SLIDE 74
  • Exp. 2e: Increased Iterations

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate HCRF with 1000 iterations airplanes HCRF cars HCRF faces HCRF motorbikes HCRF

Processing: ~30 minutes, 1 MB

slide-75
SLIDE 75
  • Exp. 2f: Increased Hidden States

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate HCRF with 15 Hidden States airplanes HCRF cars HCRF faces HCRF motorbikes HCRF

Processing: ~1 hour, 3 GB

slide-76
SLIDE 76
  • Exp. 2g: Clustering + 15 Hidden States

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate HCRF with Clustering and 15 Hidden States airplanes HCRF cars HCRF faces HCRF motorbikes HCRF

Processing: ~1 hour 10 minutes, 3 GB

slide-77
SLIDE 77
  • Exp. 2g: Clustering + 15 Hidden States (cont.)

rank 1 airplanes rank 2 rank 3 rank 4 cars faces motorbikes

slide-78
SLIDE 78
  • Exp. 2h: Clustering + 20 Hidden States

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate HCRF with Clustering and 20 Hidden States airplanes HCRF cars HCRF faces HCRF motorbikes HCRF

Processing: ~1 hour 40 minutes, 5 GB

slide-79
SLIDE 79
  • Exp. 2i: LDCRF with 20 Hidden States

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate LDCRF with Clustering and 20 Hidden States airplanes LDCRF cars LDCRF faces LDCRF motorbikes LDCRF

Processing: ~5 hours 20 minutes, 5 GB

slide-80
SLIDE 80
  • Exp. 2j: CRF with Initial Parameters

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate CRF with Clustering airplanes CRF cars CRF faces CRF motorbikes CRF

Processing: ~21 seconds, 1 MB

slide-81
SLIDE 81
  • Exp. 2j: CRF with Initial Parameters (cont.)

rank 1 airplanes rank 2 rank 3 rank 4 cars faces motorbikes

slide-82
SLIDE 82

Overall Results

◮ SVM, CRF

, and LDCRF perform best

slide-83
SLIDE 83

Overall Results

◮ SVM, CRF

, and LDCRF perform best

◮ CRF almost outperforms all with negligible memory and

processing requirements

slide-84
SLIDE 84

Overall Results

◮ SVM, CRF

, and LDCRF perform best

◮ CRF almost outperforms all with negligible memory and

processing requirements

◮ Hidden states increase accuracy but at significant memory

cost

slide-85
SLIDE 85

Conclusion

◮ HCRF is accurate, but has a heavy performance cost. ◮ May be optimal for particular domains.

slide-86
SLIDE 86

References I

Alireza Fathi, Jessica K. Hodgins, and James M. Rehg. Social interactions: A first-person perspective. In CVPR, pages 1226–1233. IEEE, 2012. ISBN 978-1-4673-1226-4. URL http://dblp.uni-trier.de/db/conf/cvpr/ cvpr2012.html#FathiHR12. John D. Lafferty, Andrew McCallum, and Fernando C. N.

  • Pereira. Conditional random fields: Probabilistic models for

segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pages 282–289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1-55860-778-1. URL http: //dl.acm.org/citation.cfm?id=645530.655813.

slide-87
SLIDE 87

References II

Ariadna Quattoni, Sybor Wang, Louis-Philippe Morency, Michael Collins, and Trevor Darrell. Hidden conditional random fields. IEEE Trans. Pattern Anal. Mach. Intell., 29 (10):1848–1852, October 2007. ISSN 0162-8828. doi: 10.1109/TPAMI.2007.1124. URL http://dx.doi.org/10.1109/TPAMI.2007.1124.