SLIDE 1 LEARNING TEMPORAL EMBEDDINGS FOR COMPLEX VIDEO ANALYSIS
BY RAMANATHAN, TANG, MORI, AND LI
Chad Voegele
SLIDE 2
?
PROBLEM
What can we learn about videos without supervision
SLIDE 3
MOTIVATION
... quick fox jumps over dog ...
SLIDE 4
WORD2VEC FOR VIDEOS?
words frames
≈
sentences video segments
≈
SLIDE 5 WORD2VEC FOR VIDEOS?
ISSUES
- 1. Frames are not discrete.
- 2. Visual similarity between neighboring frames.
- 3. Representation of context.
SLIDE 6
FRAME EMBEDDING
⟶
SLIDE 7 FRAME EMBEDDING
Alex Net
fc7 input Magic ReLU LRN
SLIDE 8
EMBEDDING OBJECTIVE
similarity(a, b) = a ⋅ b ∥a∥∥b∥ = a ⋅ b
SLIDE 9
EMBEDDING OBJECTIVE
⋅ ≫ ⋅ fvj hvj f− hvj
SLIDE 10 EMBEDDING OBJECTIVE
max (0, 1 − ( − ) ⋅ ) min
embedding ∑ v∈V
∑
∈v vj
∑
≠ v− vj
fvj f− hvj
SLIDE 11
EMBEDDING OBJECTIVE
WANT
⇔ 1 − ( − ) ⋅ < 0 fvj f− hvj ⋅ > 1 + ⋅ fvj hvj f− hvj
SLIDE 12 FRAME CONTEXT
= + hvj 1 2T ∑
t=1 T
fvj−t fvj+t = hvj 1 T ∑
t=1 T
fvj−t ∈ { | k ≠ j} hvj fvk
SLIDE 13
MULTI-RESOLUTION & NEGATIVES
SLIDE 14 EVENT RETRIEVAL
TASK
v → { ∈ V | event(v) = event( )} vj vj
METHOD For each ,
- 1. Uniformly sample 4 frames from
.
- 2. Compute and average the frame embeddings.
Then,
∈ V vj vj { ⋅ ≠ v} fv ¯ fvk ¯ ∣ ∣ vk
SLIDE 15
EVENT RETRIEVAL
Method mAP (%) Chance 6.53 Two-stream pre-trained 20.09 fc6 20.08 fc7 21.24 Model (no future) 21.30 Model (no hard neg.) 24.22 Model (best) 25.07
SLIDE 16
EVENT RETRIVEAL
SLIDE 17 SAMPLE VIDEOS
Awesome Parkour and Freerunning 20... Skateboarding Montage 2015
SLIDE 18
TEMPORAL ORDER RECOVERY
2 1 4 3 1 2 3 4
SLIDE 19 TEMPORAL ORDER RECOVERY
METHOD Given Until done,
- 1. Average last two frame embeddings.
- 2. Find next frame as frame with highest similarity.
{ ∈ } svj ∣ ∣ svj vj
SLIDE 20
TEMPORAL ORDER RECOVERY
Method Kendall Tau Chance 50 Two-stream 42.05 fc6 42.43 fc7 41.67 Model (pairwise) 42.03 Model (no future) 40.91 Model (best) 40.41
SLIDE 21
TEMPORAL ORDERING FOR PHOTOS
SLIDE 22 DISCUSSION
How are long-distance dependencies captured? Can we estimate the quality of embeddings independent
Hyper-parameter tuning: fps sampling, embedding dimension, negative selection, context representation
SLIDE 23
SOURCES
Groundhog Day, 1993, Columbia Pictures Word2Vec: An Introduction Unsupervised Learning of Visual Representations using Videos by Nitish Srivastava Visualizing Data using t-SNE by van der Maaten Fox Over Dog Picture Efficient Estimation of Word Representations in Vector Space by Mikolov