Learning Graph Representations for Video Understanding
Xiaolong Wang
Carnegie Mellon University
Learning Graph Representations for Video Understanding Xiaolong - - PowerPoint PPT Presentation
Learning Graph Representations for Video Understanding Xiaolong Wang Carnegie Mellon University Computer Vision Dog He et al. Mask R-CNN. ICCV 2017. Gler et al. DensePose: Dense Human Pose Estimation In The Wild. CVPR 2018. Deep Learning
Learning Graph Representations for Video Understanding
Xiaolong Wang
Carnegie Mellon University
Computer Vision
Dog
He et al. Mask R-CNN. ICCV 2017. GΓΌler et al. DensePose: Dense Human Pose Estimation In The Wild. CVPR 2018.
Deep Learning
Mushroom Dog Ant Jelly Fungus Nest
ImageNet
Image Mushroom Dog Ant Jelly Fungus Nest
Train a Convolutional Neural Network
Russakovsky et al. ImageNet Large Scale Visual Recognition Challenge. 2014.
Convolutional Neural Networks
Figure credit: Van Den Oord et al.
Related Work: Relation Networks
[Santoro et al, 2017]
Related Work: Self-Attention
[Vaswani et al, 2017]
Related Work: Graph Convolution Networks
[Kipf et al, 2017]
This Tutorial
networks
Video Recognition
3D Conv 3D Conv 3D Conv
Playing Soccer
Reasoning for Action Recognition
Long-rang explicit reasoning
Non-local Means
π π2 π1 π3
Buades et al. A non-local algorithm for image denoising. CVPR, 2005.
Non-local Operator
Operation in feature space Can be embedded into any ConvNets π¦π π¦π
Non-local Operator
π¦π π¦π
Affinity Features
π§π = 1 π·(π¦)
βπ
π π¦π, π¦π π(π¦π)
Non-local Operator
14
π¦
π: 1 Γ 1 Γ 1 π: 1 Γ 1 Γ 1
π Γ πΌ Γ π Γ 512 π Γ πΌ Γ π Γ 512 ππΌπ Γ 512 512 Γ ππΌπ ππΌπ Γ ππΌπ ππΌπ 512 ππΌπ 512 ππΌπ ππΌπ
Γ =
π§π = 1 π·(π¦)
βπ
π π¦π, π¦π π(π¦π)
π Γ πΌ Γ π Γ 512
Non-local Operator
15
π¦
π: 1 Γ 1 Γ 1 π: 1 Γ 1 Γ 1
π Γ πΌ Γ π Γ 512 π Γ πΌ Γ π Γ 512 ππΌπ Γ 512 512 Γ ππΌπ ππΌπ Γ ππΌπ ππΌπ 512 ππΌπ 512 ππΌπ ππΌπ
Γ =
π§π = 1 π·(π¦)
βπ
π π¦π, π¦π π(π¦π)
π Γ πΌ Γ π Γ 512
Non-local Operator
16
π¦
π: 1 Γ 1 Γ 1 π: 1 Γ 1 Γ 1
π Γ πΌ Γ π Γ 512 π Γ πΌ Γ π Γ 512 ππΌπ Γ 512 512 Γ ππΌπ ππΌπ Γ ππΌπ
normalize
π§π = 1 π·(π¦)
βπ
π π¦π, π¦π π(π¦π)
π Γ πΌ Γ π Γ 512
Non-local Operator
17
π¦
π: 1 Γ 1 Γ 1 π: 1 Γ 1 Γ 1
π Γ πΌ Γ π Γ 512 π Γ πΌ Γ π Γ 512 ππΌπ Γ 512 512 Γ ππΌπ ππΌπ Γ ππΌπ
normalize
π§π = 1 π·(π¦)
βπ
π π¦π, π¦π π(π¦π)
π Γ πΌ Γ π Γ 512
π π¦π, π¦π = exp(π¦π
ππ¦π)
π·(π¦) =
βπ
π π¦π, π¦π π π¦π, π¦π π·(π¦) = exp(π¦π
ππ¦π)
βπ exp(π¦π
ππ¦π)
Non-local Operator
18
π¦
π: 1 Γ 1 Γ 1 π: 1 Γ 1 Γ 1
π Γ πΌ Γ π Γ 512 π Γ πΌ Γ π Γ 512
π: 1 Γ 1 Γ 1
π Γ πΌ Γ π Γ 512 ππΌπ Γ 512 512 Γ ππΌπ ππΌπ Γ ππΌπ ππΌπ Γ 512 ππΌπ Γ 512
normalize
π§π = 1 π·(π¦)
βπ
π π¦π, π¦π π(π¦π)
π Γ πΌ Γ π Γ 512
Non-local Operator
19
π¦
π: 1 Γ 1 Γ 1 π: 1 Γ 1 Γ 1
π Γ πΌ Γ π Γ 512 π Γ πΌ Γ π Γ 512
π: 1 Γ 1 Γ 1
π Γ πΌ Γ π Γ 512 ππΌπ Γ 512 512 Γ ππΌπ ππΌπ Γ ππΌπ ππΌπ Γ 512 ππΌπ Γ 512
normalize
π§π = 1 π·(π¦)
βπ
π π¦π, π¦π π(π¦π)
π Γ πΌ Γ π Γ 512
Non-local Operator as A Residual Block
3D Conv 3D Conv Non-local 3D Conv Non- local
Video Action Class
π¨π = π§ππ + π¦π
Examples
Action Recognition in Daily Lives
Gunnar A. Sigurdsson, GΓΌl Varol, Xiaolong Wang, Ivan Laptev, Ali Farhadi, Abhinav Gupta. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding. ECCV 2016.
Charades Dataset: 157 classes, 9.8k videos, 30s per video We let the people upload their own videos!
Action Recognition on Charades
Method mAP 3D Conv 31.8% 3D Conv + Non-local 33.5%
Opening A Book
24
Opening A Book
25
The Non-local Block
Opening A Book
Object states changes over time Human-object, object-object interactions
Opening A Book
27
A1 B1 A2 B2 A3 B3 A4 B4
Highly Correlated
Relations between Regions
Relations between Regions
π π¦π, π¦π = π π¦π π πβ²(π¦π)
π»ππ = exp π π¦π, π¦π βπ exp π π¦π, π¦π
Graph Convolutional Network
π = π»ππ
π π» π
Γ
π π π π π π
Γ
π π π
=
Graph Convolutional Network
31
Propagation
Connecting Non-local and GCN
π¨π = π§ππ + π¦π π§π = 1 π·(π¦)
βπ
π π¦π, π¦π π(π¦π) =
βπ
π π¦π, π¦π βπ π π¦π, π¦π π(π¦π) =
βπ
π»ππ π(π¦π) =
βπ
π»ππ π(π¦π) π + π¦π π = π» π π π + π The Non-local Operator: The Graph Convolution
Action Recognition on Charades
33
Method mean AP 3D Conv 31.8% 3D Conv + Non-local 33.5% 3D Conv + Region Graph 36.2% +4.4%
Action Recognition on Charades
34
30% 35% 40% 45% Involves Objects ? No Yes 3D Conv 3D Conv + Graph
Action Recognition on Charades
35
30% 35% 40% 45% Pose Variances 3D Conv 3D Conv + Graph
Connection to Mean-Shift
π§π =
βπ
π π¦π, π¦π βπ π π¦π, π¦π π(π¦π) The Non-local Operator: The Mean-Shift Clustering: π(π¦) =
π¦πβπ(π¦)
πΏ π¦, π¦π π¦πβπ(π¦) πΏ π¦, π¦π π¦π Converging to the same mean?
https://tw.rpi.edu/web/project/JeffersonProjectAtLakeGeorge/ClusteringRecent Related Work
Actor-Centric Relation Network [Sun et al, 2018] Video Action Transformer Network [Girdhar et al, 2019] Long-Term Feature Banks for Detailed Video Understanding [Wu et al, 2019]
Learning Affinity with Semantic Supervision
Learn Correspondence without Human Supervision
Goal:
The visual world exhibits continuity
Prior Work: Learning from Time
Predict Color in Time [Vondrick et al, 2018]
Inputs Outputs
Predict Pixel in Time [Mathieu et al, 2015] Predict Arrow of Time [Wei et al, 2018]
Using Tracking to Learn Features
CNN CNN
Similarity
Tracking β Similarity [Wang et al, 2015]
Using Tracking to Learn Features
CNN CNN
Similarity
Tracking β Similarity [Wang et al, 2015]
Limited by Off-the-shelf Trackers
Similarity requires tracking Tracking requires similarity
Letβs jointly learn both!
Learning to Track
How to obtain supervision?
β± β± β±
β±: a deep tracker
Supervision: Cycle-Consistency in Time
Track backwards Track forwards, back to the future
β± β± β± β± β± β±
Backpropagation through time along the cycle
Supervision: Cycle-Consistency in Time
β± β± β± β± β± β±
Differentiable Tracking
48
Encoder π Encoder π
transpose
Patch feature in time π’: π¦π’
π
Image feature in time π’ β 1: π¦π’β1
π½
100 900
Γ =
900 π 100 π
π¦π’β1
π½
π¦π’
π
Spatial Transformer π Cropping
Differentiable Tracking
49
Encoder π Encoder π
transpose
Patch feature in time π’: π¦π’
π
Image feature in time π’ β 1: π¦π’β1
π½
Patch feature in time π’ β 1: π¦π’β1
π
Spatial Transformer π Cropping
Differentiable Tracking
50
Encoder π Encoder π
transpose
π¦π’β1
π
= β±(π¦π’β1
π½
, π¦π’
π)
Recurrent Tracking
51
π¦π’
π
π¦π’
π
π’ β 1
β±
π’
β±
π’ β 1
β±
π’ β 2
β±
π’ β 2
β±
π’ β 3
β±
βππ§πππ
Cycle-Consistency Loss Function
βππ§πππ = ||πππ π¦π’
π β πππ π¦π’ π ||2 2
π¦π’
π
π¦π’
π
β±
β± β± β± β± β±
Multiple Cycles
Sub-cycles: a natural curriculum
Skip Cycles
Skip-cycles: skipping occlusions
Visualization of Training
Test Time: Nearest Neighbors in Feature Space
π’ β 1 π’
π’ β 1 π’
Test Time: Nearest Neighbors in Feature Space
Instance Mask Tracking
DAVIS Dataset
DAVIS Dataset: Pont-Tuset et al. The 2017 DAVIS Challenge on Video Object Segmentation. 2017.
Pose Keypoint Tracking
JHMDB Dataset
Comparison
Our Correspondence Optical Flow
Pose Keypoint Tracking
JHMDB Dataset
Method PCK @.1 Optical Flow 45% Vondrick et al. 45% Ours 58%
Vondrick et al. Tracking Emerges by Colorizing Videos. ECCV 2018.
Texture Tracking
DAVIS Dataset
DAVIS Dataset: Pont-Tuset et al. The 2017 DAVIS Challenge on Video Object Segmentation. 2017.
Semantic Masks Tracking
Video Instance Parsing Dataset
Zhou et al. Adaptive Temporal Encoding Network for Video Instance-level Human Parsing. ACM MM 2018.
Questions?