SLIDE 1 Social Interactions: A First-Person Perspective.
- A. Fathi, J. Hodgins, J. Rehg
Presented by Jacob Menashe November 16, 2012
SLIDE 2
Social Interaction Detection
Objective: Detect social interactions from video footage.
SLIDE 3
Social Interaction Detection
Objective: Detect social interactions from video footage.
◮ Consider faces and attention
SLIDE 4
Social Interaction Detection
Objective: Detect social interactions from video footage.
◮ Consider faces and attention ◮ Account for temporal context
SLIDE 5
Social Interaction Detection
Objective: Detect social interactions from video footage.
◮ Consider faces and attention ◮ Account for temporal context ◮ Analyze first-person movements cues
SLIDE 6 Introduction Overview Features Temporal Context Experiments
SLIDE 7
Video Example
Red Dialogue Yellow Walking Dialogue Green Discussion Light Blue Walking Discussion Dark Blue Monologue None Background Link
SLIDE 8
Features
Features are constructed based on first- and third-person information.
SLIDE 9 Features
Features are constructed based on first- and third-person information.
- 1. Dense optical flow (first-person movement).
SLIDE 10 Features
Features are constructed based on first- and third-person information.
- 1. Dense optical flow (first-person movement).
- 2. Face locations (relative to first person)
SLIDE 11 Features
Features are constructed based on first- and third-person information.
- 1. Dense optical flow (first-person movement).
- 2. Face locations (relative to first person)
- 3. Attention and Roles. For each person x:
SLIDE 12 Features
Features are constructed based on first- and third-person information.
- 1. Dense optical flow (first-person movement).
- 2. Face locations (relative to first person)
- 3. Attention and Roles. For each person x:
◮ Faces looking at x
SLIDE 13 Features
Features are constructed based on first- and third-person information.
- 1. Dense optical flow (first-person movement).
- 2. Face locations (relative to first person)
- 3. Attention and Roles. For each person x:
◮ Faces looking at x ◮ Whether first person looks at x
SLIDE 14 Features
Features are constructed based on first- and third-person information.
- 1. Dense optical flow (first-person movement).
- 2. Face locations (relative to first person)
- 3. Attention and Roles. For each person x:
◮ Faces looking at x ◮ Whether first person looks at x ◮ Mutual attention between x and first person
SLIDE 15 Features
Features are constructed based on first- and third-person information.
- 1. Dense optical flow (first-person movement).
- 2. Face locations (relative to first person)
- 3. Attention and Roles. For each person x:
◮ Faces looking at x ◮ Whether first person looks at x ◮ Mutual attention between x and first person ◮ Number of faces looking at where x is looking
SLIDE 16
Feature Example
SLIDE 17
Conditional Random Fields
CRFs are described in Lafferty et al. [2001].
SLIDE 18
Conditional Random Fields
CRFs are described in Lafferty et al. [2001].
◮ Observations and labels form a Markov chain. ◮ Nodes pend on neighbors.
y1 y2 y3 x1 x2 x3
SLIDE 19
Conditional Random Fields
CRFs are described in Lafferty et al. [2001].
◮ Observations and labels form a Markov chain. ◮ Nodes pend on neighbors.
y1 y2 y3 x1 x2 x3 y1 p(y1|x1, y2)
SLIDE 20
Conditional Random Fields
CRFs are described in Lafferty et al. [2001].
◮ Observations and labels form a Markov chain. ◮ Nodes pend on neighbors.
y1 y2 y3 x1 x2 x3 y1 y2 p(y2|y1, y3, x2)
SLIDE 21
Conditional Random Fields
CRFs are described in Lafferty et al. [2001].
◮ Observations and labels form a Markov chain. ◮ Nodes pend on neighbors.
y1 y2 y3 x1 x2 x3 y3 p(y3|y2, x3)
SLIDE 22
Hidden Conditional Random Fields
A micro view of the HCRF model as described in Quattoni et al. [2007]. Y h1 h2 h3 xi
SLIDE 23
Hidden Conditional Random Fields
A micro view of the HCRF model as described in Quattoni et al. [2007].
◮ Y is a label for the whole sequence.
Y h1 h2 h3 xi
SLIDE 24
Hidden Conditional Random Fields
A micro view of the HCRF model as described in Quattoni et al. [2007].
◮ Y is a label for the whole sequence. ◮ xi is a single observation in the sequence.
Y h1 h2 h3 xi
SLIDE 25
Hidden Conditional Random Fields
A micro view of the HCRF model as described in Quattoni et al. [2007].
◮ Y is a label for the whole sequence. ◮ xi is a single observation in the sequence. ◮ Each hi is a possible hidden state.
Y h1 h2 h3 xi
SLIDE 26
Hidden Conditional Random Fields (cont.)
A macro view of the HCRF model as described in Quattoni et al. [2007]. Y h1 h2 h3 x1 x2 x3
SLIDE 27
Hidden Conditional Random Fields (cont.)
A macro view of the HCRF model as described in Quattoni et al. [2007].
◮ Y is a label for the whole sequence.
Y h1 h2 h3 x1 x2 x3
SLIDE 28
Hidden Conditional Random Fields (cont.)
A macro view of the HCRF model as described in Quattoni et al. [2007].
◮ Y is a label for the whole sequence. ◮ Each xi is a single observation in the sequence.
Y h1 h2 h3 x1 x2 x3
SLIDE 29
Hidden Conditional Random Fields (cont.)
A macro view of the HCRF model as described in Quattoni et al. [2007].
◮ Y is a label for the whole sequence. ◮ Each xi is a single observation in the sequence. ◮ Each hi is the hidden state label assigned to xi.
Y h1 h2 h3 x1 x2 x3
SLIDE 30
Hidden Conditional Random Fields (cont.)
A macro view of the HCRF model as described in Quattoni et al. [2007].
◮ Y is a label for the whole sequence. ◮ Each xi is a single observation in the sequence. ◮ Each hi is the hidden state label assigned to xi.
Y h1 h2 h3 x1 x2 x3 p(h1|Y, h2, x1) h1
SLIDE 31
Hidden Conditional Random Fields (cont.)
A macro view of the HCRF model as described in Quattoni et al. [2007].
◮ Y is a label for the whole sequence. ◮ Each xi is a single observation in the sequence. ◮ Each hi is the hidden state label assigned to xi.
Y h1 h2 h3 x1 x2 x3 p(h2|Y, h1, h3, x2) h2
SLIDE 32
Hidden Conditional Random Fields (cont.)
A macro view of the HCRF model as described in Quattoni et al. [2007].
◮ Y is a label for the whole sequence. ◮ Each xi is a single observation in the sequence. ◮ Each hi is the hidden state label assigned to xi.
Y h1 h2 h3 x1 x2 x3 p(h3|Y, h2, x3) h3
SLIDE 33
Hidden Conditional Random Fields (cont.)
A macro view of the HCRF model as described in Quattoni et al. [2007].
◮ Y is a label for the whole sequence. ◮ Each xi is a single observation in the sequence. ◮ Each hi is the hidden state label assigned to xi.
Y h1 h2 h3 x1 x2 x3 p(Y|{hi}) = p(Y|{xi}) Y
SLIDE 34
HCRF Example
Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc). WDlg h1 h2 h3 x1 x2 x3
SLIDE 35
HCRF Example
Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc).
◮ Each xi is now a feature extracted from video frames.
WDlg h1 h2 h3 x1 x2 x3 x1 x2 x3
SLIDE 36
HCRF Example
Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc).
◮ Each xi is now a feature extracted from video frames. ◮ Each hi is determined from training:
WDlg h1 h2 h3 x1 x2 x3 h1 h2 h3
SLIDE 37 HCRF Example
Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc).
◮ Each xi is now a feature extracted from video frames. ◮ Each hi is determined from training:
◮ h1: John wants to hear about my weekend.
WDlg h1 h2 h3 x1 x2 x3 h1
SLIDE 38 HCRF Example
Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc).
◮ Each xi is now a feature extracted from video frames. ◮ Each hi is determined from training:
◮ h2: I’m feeling talkative.
WDlg h1 h2 h3 x1 x2 x3 h2
SLIDE 39 HCRF Example
Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc).
◮ Each xi is now a feature extracted from video frames. ◮ Each hi is determined from training:
◮ h3: Mary wants to listen to her iPod.
WDlg h1 h2 h3 x1 x2 x3 h3
SLIDE 40 HCRF Example
Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc).
◮ Each xi is now a feature extracted from video frames. ◮ Each hi is determined from training:
◮ h1: John wants to hear about my weekend.
WDlg h1 h2 h3 x1 x2 x3 p(h1|Y, h2, x1) h1
SLIDE 41 HCRF Example
Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc).
◮ Each xi is now a feature extracted from video frames. ◮ Each hi is determined from training:
◮ h2: I’m feeling talkative.
WDlg h1 h2 h3 x1 x2 x3 p(h2|Y, h1, h3, x2) h2
SLIDE 42 HCRF Example
Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc).
◮ Each xi is now a feature extracted from video frames. ◮ Each hi is determined from training:
◮ h3: Mary wants to listen to her iPod.
WDlg h1 h2 h3 x1 x2 x3 p(h3|Y, h2, x3) h3
SLIDE 43 HCRF Example
Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc).
◮ Each xi is now a feature extracted from video frames. ◮ Each hi is determined from training:
◮ h1: John wants to hear about my weekend. ◮ h2: I’m feeling talkative. ◮ h3: Mary wants to listen to her iPod.
WDlg h1 h2 h3 x1 x2 x3 p(WDlg|{hi}) = p(WDlg|{xi}) WDlg
SLIDE 44 HCRF Example
Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc).
◮ Each xi is now a feature extracted from video frames. ◮ Each hi is determined from training:
◮ h1: John wants to hear about my weekend. ◮ h2: I’m feeling talkative. ◮ h3: Mary wants to listen to her iPod.
WDlg h1 h2 h3 x1 x2 x3 p(WDisc|{hi}) = p(WDisc|{xi}) WDisc
SLIDE 45 HCRF Example
Suppose we want to find the likelihood of “walking dialogue” (WDlg) vs “walking discussion” (WDisc).
◮ Each xi is now a feature extracted from video frames. ◮ Each hi is determined from training:
◮ h1: John wants to hear about my weekend. ◮ h2: I’m feeling talkative. ◮ h3: Mary wants to listen to her iPod. ◮ If p(WDlg) > p(WDisc), assign Y = WDlg.
h1 h2 h3 x1 x2 x3 Y
SLIDE 46 Introduction Overview Temporal Context Conditional Random Fields Hidden Conditional Random Fields HCRF Example Experiments
SLIDE 47 Introduction Overview Temporal Context Experiments Experiment Outline Experiment 1: Video Processing Experiment 2: Caltech Dataset Conclusion
SLIDE 48
Experiment Outline
The following experiments are presented:
SLIDE 49
Experiment Outline
The following experiments are presented:
◮ Video Processing
SLIDE 50
Experiment Outline
The following experiments are presented:
◮ Video Processing ◮ Caltech image dataset
SLIDE 51
Experiment Outline
The following experiments are presented:
◮ Video Processing ◮ Caltech image dataset ◮ Adjusted parameters:
SLIDE 52 Experiment Outline
The following experiments are presented:
◮ Video Processing ◮ Caltech image dataset ◮ Adjusted parameters:
◮ Iterations
SLIDE 53 Experiment Outline
The following experiments are presented:
◮ Video Processing ◮ Caltech image dataset ◮ Adjusted parameters:
◮ Iterations ◮ Hidden States
SLIDE 54 Experiment Outline
The following experiments are presented:
◮ Video Processing ◮ Caltech image dataset ◮ Adjusted parameters:
◮ Iterations ◮ Hidden States ◮ Optimization Function
SLIDE 55 Experiment Outline
The following experiments are presented:
◮ Video Processing ◮ Caltech image dataset ◮ Adjusted parameters:
◮ Iterations ◮ Hidden States ◮ Optimization Function ◮ Clusters
SLIDE 56 Experiment Outline
The following experiments are presented:
◮ Video Processing ◮ Caltech image dataset ◮ Adjusted parameters:
◮ Iterations ◮ Hidden States ◮ Optimization Function ◮ Clusters
◮ Compared with linear SVM baseline
SLIDE 57
Experiment 1: Video Processing
Mine Theirs 40 training intervals 4,000 training intervals 40 testing intervals [unspecified] Dialogue vs Discussion One vs. All All Features Location First-Person Motion Attention All Features
SLIDE 58
Experiment 1: Video Processing
Mine Theirs 40 training intervals 4,000 training intervals 40 testing intervals [unspecified] Dialogue vs Discussion One vs. All All Features Location First-Person Motion Attention All Features ~42 hours = 11,340 intervals 11,340 intervals @ 24 hours per 20 intervals > 18 months
SLIDE 59 Experiment 1: Video Processing (cont.)
My Results Their Results
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate HCRF Dialogue vs Discussion Detection
SLIDE 60
Experiment 2: Caltech Dataset
Experiment 2 focuses on the Caltech image dataset.
SLIDE 61
Experiment 2: Caltech Dataset
Experiment 2 focuses on the Caltech image dataset.
◮ Multi-class HCRF evaluated
SLIDE 62
Experiment 2: Caltech Dataset
Experiment 2 focuses on the Caltech image dataset.
◮ Multi-class HCRF evaluated ◮ Classes are evaluated in isolation.
SLIDE 63
Experiment 2: Caltech Dataset
Experiment 2 focuses on the Caltech image dataset.
◮ Multi-class HCRF evaluated ◮ Classes are evaluated in isolation. ◮ Temporal context is simulated with clustering
SLIDE 64
Experiment 2: Caltech Dataset
Experiment 2 focuses on the Caltech image dataset.
◮ Multi-class HCRF evaluated ◮ Classes are evaluated in isolation. ◮ Temporal context is simulated with clustering ◮ Initial parameters are based on Fathi et al. [2012]:
SLIDE 65 Experiment 2: Caltech Dataset
Experiment 2 focuses on the Caltech image dataset.
◮ Multi-class HCRF evaluated ◮ Classes are evaluated in isolation. ◮ Temporal context is simulated with clustering ◮ Initial parameters are based on Fathi et al. [2012]:
◮ Hidden States: 5
SLIDE 66 Experiment 2: Caltech Dataset
Experiment 2 focuses on the Caltech image dataset.
◮ Multi-class HCRF evaluated ◮ Classes are evaluated in isolation. ◮ Temporal context is simulated with clustering ◮ Initial parameters are based on Fathi et al. [2012]:
◮ Hidden States: 5 ◮ Window Size: 5
SLIDE 67 Experiment 2: Caltech Dataset
Experiment 2 focuses on the Caltech image dataset.
◮ Multi-class HCRF evaluated ◮ Classes are evaluated in isolation. ◮ Temporal context is simulated with clustering ◮ Initial parameters are based on Fathi et al. [2012]:
◮ Hidden States: 5 ◮ Window Size: 5 ◮ Max Iterations: 100
SLIDE 68 Experiment 2: Caltech Dataset
Experiment 2 focuses on the Caltech image dataset.
◮ Multi-class HCRF evaluated ◮ Classes are evaluated in isolation. ◮ Temporal context is simulated with clustering ◮ Initial parameters are based on Fathi et al. [2012]:
◮ Hidden States: 5 ◮ Window Size: 5 ◮ Max Iterations: 100 ◮ Optimizer: Broyden–Fletcher-Goldfarb-Shanno (BFGS)
SLIDE 69
- Exp. 2a: Initial Settings
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate test airplanes HCRF cars HCRF faces HCRF motorbikes HCRF svm (all)
Processing: ~18 minutes, 1 MB
SLIDE 70
- Exp. 2a: Initial Settings (cont.)
rank 1 airplanes rank 2 rank 3 rank 4 cars faces motorbikes
SLIDE 71
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate HCRF with 10 iterations airplanes HCRF cars HCRF faces HCRF motorbikes HCRF
Processing: ~3 minutes, 1 MB
SLIDE 72
- Exp. 2c: Low Hidden States
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate HCRF with 1 Hidden State airplanes HCRF cars HCRF faces HCRF motorbikes HCRF
Processing: ~2 minutes, 1 MB
SLIDE 73
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate HCRF with Conjugate Gradient Optimizer airplanes HCRF cars HCRF faces HCRF motorbikes HCRF
Processing: ~11 minutes, 1 MB
SLIDE 74
- Exp. 2e: Increased Iterations
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate HCRF with 1000 iterations airplanes HCRF cars HCRF faces HCRF motorbikes HCRF
Processing: ~30 minutes, 1 MB
SLIDE 75
- Exp. 2f: Increased Hidden States
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate HCRF with 15 Hidden States airplanes HCRF cars HCRF faces HCRF motorbikes HCRF
Processing: ~1 hour, 3 GB
SLIDE 76
- Exp. 2g: Clustering + 15 Hidden States
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate HCRF with Clustering and 15 Hidden States airplanes HCRF cars HCRF faces HCRF motorbikes HCRF
Processing: ~1 hour 10 minutes, 3 GB
SLIDE 77
- Exp. 2g: Clustering + 15 Hidden States (cont.)
rank 1 airplanes rank 2 rank 3 rank 4 cars faces motorbikes
SLIDE 78
- Exp. 2h: Clustering + 20 Hidden States
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate HCRF with Clustering and 20 Hidden States airplanes HCRF cars HCRF faces HCRF motorbikes HCRF
Processing: ~1 hour 40 minutes, 5 GB
SLIDE 79
- Exp. 2i: LDCRF with 20 Hidden States
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate LDCRF with Clustering and 20 Hidden States airplanes LDCRF cars LDCRF faces LDCRF motorbikes LDCRF
Processing: ~5 hours 20 minutes, 5 GB
SLIDE 80
- Exp. 2j: CRF with Initial Parameters
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False positive rate True positive rate CRF with Clustering airplanes CRF cars CRF faces CRF motorbikes CRF
Processing: ~21 seconds, 1 MB
SLIDE 81
- Exp. 2j: CRF with Initial Parameters (cont.)
rank 1 airplanes rank 2 rank 3 rank 4 cars faces motorbikes
SLIDE 82
Overall Results
◮ SVM, CRF
, and LDCRF perform best
SLIDE 83
Overall Results
◮ SVM, CRF
, and LDCRF perform best
◮ CRF almost outperforms all with negligible memory and
processing requirements
SLIDE 84
Overall Results
◮ SVM, CRF
, and LDCRF perform best
◮ CRF almost outperforms all with negligible memory and
processing requirements
◮ Hidden states increase accuracy but at significant memory
cost
SLIDE 85
Conclusion
◮ HCRF is accurate, but has a heavy performance cost. ◮ May be optimal for particular domains.
SLIDE 86 References I
Alireza Fathi, Jessica K. Hodgins, and James M. Rehg. Social interactions: A first-person perspective. In CVPR, pages 1226–1233. IEEE, 2012. ISBN 978-1-4673-1226-4. URL http://dblp.uni-trier.de/db/conf/cvpr/ cvpr2012.html#FathiHR12. John D. Lafferty, Andrew McCallum, and Fernando C. N.
- Pereira. Conditional random fields: Probabilistic models for
segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pages 282–289, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1-55860-778-1. URL http: //dl.acm.org/citation.cfm?id=645530.655813.
SLIDE 87
References II
Ariadna Quattoni, Sybor Wang, Louis-Philippe Morency, Michael Collins, and Trevor Darrell. Hidden conditional random fields. IEEE Trans. Pattern Anal. Mach. Intell., 29 (10):1848–1852, October 2007. ISSN 0162-8828. doi: 10.1109/TPAMI.2007.1124. URL http://dx.doi.org/10.1109/TPAMI.2007.1124.