Learning Graph Representations for Video Understanding Xiaolong - PowerPoint PPT Presentation

Learning Graph Representations for Video Understanding Xiaolong Wang Carnegie Mellon University

Computer Vision Dog He et al. Mask R-CNN. ICCV 2017. Güler et al. DensePose: Dense Human Pose Estimation In The Wild. CVPR 2018.

Deep Learning ImageNet Mushroom Dog Ant Jelly Fungus Nest Train a Convolutional Neural Network Mushroom Dog Image Ant Jelly Fungus Nest Russakovsky et al. ImageNet Large Scale Visual Recognition Challenge. 2014.

Convolutional Neural Networks • Convolution is local • Long-range Pairwise relations are not modeled Figure credit: Van Den Oord et al.

Related Work: Relation Networks [Santoro et al, 2017]

Related Work: Self-Attention [Vaswani et al, 2017]

Related Work: Graph Convolution Networks [Kipf et al, 2017]

This Tutorial • Perform connections on different graph/relation networks • Under the application of video understanding • Both supervised and self-supervised methods

Video Recognition Playing 3D 3D 3D Soccer Conv Conv Conv

Reasoning for Action Recognition Long-rang explicit reasoning X. Wang , R. Girshick , A. Gupta, and K. He. Non-local Neural Networks . CVPR 2018.

Non-local Means 𝑟 1 𝑞 𝑟 3 𝑟 2 Buades et al. A non-local algorithm for image denoising . CVPR, 2005.

Non-local Operator Operation in feature space Can be embedded into any ConvNets 𝑦 𝑗 𝑦 𝑘

Non-local Operator 1 𝑧 𝑗 = 𝐷(𝑦) 𝑔 𝑦 𝑗 , 𝑦 𝑘 𝑕(𝑦 𝑘 ) ∀𝑘 Affinity Features 𝑦 𝑗 𝑦 𝑘

Non-local Operator 1 𝑧 𝑗 = 𝐷(𝑦) 𝑔 𝑦 𝑗 , 𝑦 𝑘 𝑕(𝑦 𝑘 ) ∀𝑘 𝑈𝐼𝑋 × 𝑈𝐼𝑋 𝑈𝐼𝑋 512 𝑈𝐼𝑋 512 × 𝑈𝐼𝑋 𝑈𝐼𝑋 × 512 × = 𝑈𝐼𝑋 512 𝑈𝐼𝑋 𝑈 × 𝐼 × 𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝜄: 1 × 1 𝜚: 1 × 1 × 1 × 1 𝑈 × 𝐼 × 𝑋 × 512 𝑦 14

Non-local Operator 1 𝑧 𝑗 = 𝐷(𝑦) 𝑔 𝑦 𝑗 , 𝑦 𝑘 𝑕(𝑦 𝑘 ) ∀𝑘 𝑈𝐼𝑋 × 𝑈𝐼𝑋 𝑈𝐼𝑋 512 𝑈𝐼𝑋 512 × 𝑈𝐼𝑋 𝑈𝐼𝑋 × 512 × = 𝑈𝐼𝑋 512 𝑈𝐼𝑋 𝑈 × 𝐼 × 𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝜄: 1 × 1 𝜚: 1 × 1 × 1 × 1 𝑈 × 𝐼 × 𝑋 × 512 𝑦 15

Non-local Operator 1 𝑧 𝑗 = 𝐷(𝑦) 𝑔 𝑦 𝑗 , 𝑦 𝑘 𝑕(𝑦 𝑘 ) ∀𝑘 normalize 𝑈𝐼𝑋 × 𝑈𝐼𝑋 512 × 𝑈𝐼𝑋 𝑈𝐼𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝜄: 1 × 1 𝜚: 1 × 1 × 1 × 1 𝑈 × 𝐼 × 𝑋 × 512 𝑦 16

Non-local Operator 1 𝑧 𝑗 = 𝐷(𝑦) 𝑔 𝑦 𝑗 , 𝑦 𝑘 𝑕(𝑦 𝑘 ) ∀𝑘 𝑈 𝑦 𝑘 ) normalize 𝑔 𝑦 𝑗 , 𝑦 𝑘 = exp(𝑦 𝑗 𝑈𝐼𝑋 × 𝑈𝐼𝑋 𝐷(𝑦) = 𝑔 𝑦 𝑗 , 𝑦 𝑘 512 × 𝑈𝐼𝑋 ∀𝑘 𝑈𝐼𝑋 × 512 𝑈 𝑦 𝑘 ) 𝑔 𝑦 𝑗 , 𝑦 𝑘 exp(𝑦 𝑗 𝑈 × 𝐼 × 𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 = 𝜄: 1 × 1 𝜚: 1 × 1 𝑈 𝑦 𝑘 ) 𝐷(𝑦) ∀𝑘 exp(𝑦 𝑗 × 1 × 1 𝑈 × 𝐼 × 𝑋 × 512 𝑦 17

Non-local Operator 1 𝑧 𝑗 = 𝐷(𝑦) 𝑔 𝑦 𝑗 , 𝑦 𝑘 𝑕(𝑦 𝑘 ) ∀𝑘 𝑈𝐼𝑋 × 512 normalize 𝑈𝐼𝑋 × 𝑈𝐼𝑋 512 × 𝑈𝐼𝑋 𝑈𝐼𝑋 × 512 𝑈𝐼𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝜄: 1 × 1 𝜚: 1 × 1 𝑕: 1 × 1 × 1 × 1 × 1 𝑈 × 𝐼 × 𝑋 × 512 𝑦 18

Non-local Operator 1 𝑧 𝑗 = 𝐷(𝑦) 𝑔 𝑦 𝑗 , 𝑦 𝑘 𝑕(𝑦 𝑘 ) ∀𝑘 𝑈𝐼𝑋 × 512 normalize 𝑈𝐼𝑋 × 𝑈𝐼𝑋 512 × 𝑈𝐼𝑋 𝑈𝐼𝑋 × 512 𝑈𝐼𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝑈 × 𝐼 × 𝑋 × 512 𝜄: 1 × 1 𝜚: 1 × 1 𝑕: 1 × 1 × 1 × 1 × 1 𝑈 × 𝐼 × 𝑋 × 512 𝑦 19

Non-local Operator as A Residual Block 𝑨 𝑗 = 𝑧 𝑗 𝑋 + 𝑦 𝑗 Action Video 3D Non- Class 3D 3D Non-local Conv local Conv Conv

Examples

Action Recognition in Daily Lives We let the people upload their own videos! Charades Dataset: 157 classes, 9.8k videos, 30s per video Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang , Ivan Laptev, Ali Farhadi, Abhinav Gupta. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding . ECCV 2016.

Action Recognition on Charades Method mAP 3D Conv 31.8% 3D Conv + Non-local 33.5%

Opening A Book 24

Opening A Book The Non-local Block 25

Opening A Book Object states changes over time Human-object, object-object interactions X. Wang and A. Gupta. Video as Space-Time Region Graphs . ECCV 2018.

Opening A Book A 4 A 2 A 1 A 3 B 3 B 4 B 1 B 2 Highly Correlated 27

Relations between Regions

Relations between Regions 𝑔 𝑦 𝑗 , 𝑦 𝑘 = 𝜚 𝑦 𝑗 𝑈 𝜚 ′ (𝑦 𝑘 ) exp 𝑔 𝑦 𝑗 , 𝑦 𝑘 𝐻 𝑗𝑘 = ∀𝑘 exp 𝑔 𝑦 𝑗 , 𝑦 𝑘

Graph Convolutional Network 𝑎 = 𝐻𝑌𝑋 𝑒 𝑒 𝑂 𝑒 × × = 𝑌 𝑋 𝑎 𝑒 𝑂 𝐻 𝑂 𝑂 Kipf. Semi-Supervised Classification with Graph Convolutional Networks. 2017

Graph Convolutional Network Propagation 31

Connecting Non-local and GCN The Non-local Operator: 1 𝑨 𝑗 = 𝑧 𝑗 𝑋 + 𝑦 𝑗 𝑧 𝑗 = 𝐷(𝑦) 𝑔 𝑦 𝑗 , 𝑦 𝑘 𝑕(𝑦 𝑘 ) ∀𝑘 𝑔 𝑦 𝑗 , 𝑦 𝑘 = 𝑕(𝑦 𝑘 ) = 𝐻 𝑗𝑘 𝑕(𝑦 𝑘 ) 𝑋 + 𝑦 𝑗 ∀𝑘 𝑔 𝑦 𝑗 , 𝑦 𝑘 ∀𝑘 ∀𝑘 = 𝐻 𝑗𝑘 𝑕(𝑦 𝑘 ) 𝑎 = 𝐻 𝑕 𝑌 𝑋 + 𝑌 ∀𝑘 The Graph Convolution

Action Recognition on Charades Method mean AP 3D Conv 31.8% 3D Conv + Non-local 33.5% +4.4% 3D Conv + Region Graph 36.2% 33

Action Recognition on Charades 3D Conv 45% 3D Conv + Graph 40% 35% 30% No Yes Involves Objects ? 34

Action Recognition on Charades 3D Conv 45% 3D Conv + Graph 40% 35% 30% Pose Variances 35

Connection to Mean-Shift The Non-local Operator: 𝑔 𝑦 𝑗 , 𝑦 𝑘 𝑧 𝑗 = 𝑕(𝑦 𝑘 ) ∀𝑘 𝑔 𝑦 𝑗 , 𝑦 𝑘 ∀𝑘 The Mean-Shift Clustering: 𝐿 𝑦, 𝑦 𝑘 𝑛(𝑦) = 𝑦 𝑘 𝑦 𝑘 ∈𝑂(𝑦) 𝐿 𝑦, 𝑦 𝑘 𝑦 𝑘 ∈𝑂(𝑦) Converging to the same mean? https://tw.rpi.edu/web/project/JeffersonProjectAtLakeGeorge/Clustering

Recent Related Work Actor-Centric Relation Network Video Action Transformer Network [Sun et al, 2018] [Girdhar et al, 2019] Long-Term Feature Banks for Detailed Video Understanding [Wu et al, 2019]

Learning Affinity with Semantic Supervision

Learn Correspondence Goal: without Human Supervision

The visual world exhibits continuity

Prior Work: Learning from Time Inputs Outputs Predict Color in Time Predict Pixel in Time [Vondrick et al, 2018] [Mathieu et al, 2015] Predict Arrow of Time [Wei et al, 2018]

Using Tracking to Learn Features Similarity CNN CNN Tracking → Similarity [Wang et al, 2015]

Using Tracking to Learn Features Similarity CNN CNN Limited by Off-the-shelf Trackers Tracking → Similarity [Wang et al, 2015]

Similarity requires tracking Tracking requires similarity Let’s jointly learn both!

Learning to Track ℱ : a deep tracker ℱ ℱ ℱ How to obtain supervision?

Supervision: Cycle-Consistency in Time Track backwards ℱ ℱ ℱ ℱ ℱ ℱ Track forwards, back to the future

Supervision: Cycle-Consistency in Time ℱ ℱ ℱ ℱ ℱ ℱ Backpropagation through time along the cycle

Differentiable Tracking 𝑞 Patch feature in time 𝑢: 𝑦 𝑢 𝑞 𝐽 𝑦 𝑢−1 𝑦 𝑢 100 𝑑 100 × = 900 𝑑 900 Encoder 𝜚 transpose Encoder 𝜚 𝐽 Image feature in time 𝑢 − 1: 𝑦 𝑢−1 48

Differentiable Tracking 𝑞 𝑞 Patch feature in time 𝑢: 𝑦 𝑢 Patch feature in time 𝑢 − 1: 𝑦 𝑢−1 Encoder 𝜚 Transformer 𝜄 Spatial Cropping transpose Encoder 𝜚 𝐽 Image feature in time 𝑢 − 1: 𝑦 𝑢−1 49

Differentiable Tracking 𝑞 𝑞 ) 𝐽 𝑦 𝑢−1 = ℱ(𝑦 𝑢−1 , 𝑦 𝑢 Encoder 𝜚 ℱ Transformer 𝜄 Spatial Cropping transpose Encoder 𝜚 50

Recurrent Tracking 𝑢 − 1 𝑢 − 3 𝑢 − 2 𝑞 ℱ ℱ ℱ 𝑦 𝑢 ℒ 𝑑𝑧𝑑𝑚𝑓 𝑞 ℱ ℱ ℱ 𝑦 𝑢 𝑢 − 2 𝑢 − 1 𝑢 51

Cycle-Consistency Loss Function 𝑞 − 𝑀𝑝𝑑 𝑦 𝑢 𝑞 || 2 2 ℒ 𝑑𝑧𝑑𝑚𝑓 = ||𝑀𝑝𝑑 𝑦 𝑢 𝑞 𝑦 𝑢 ℱ ℱ ℱ 𝑞 𝑦 𝑢 ℱ ℱ ℱ

Multiple Cycles Sub-cycles: a natural curriculum

Skip Cycles Skip-cycles: skipping occlusions

Visualization of Training

Test Time: Nearest Neighbors in Feature Space 𝑢 − 1 𝑢

Instance Mask Tracking DAVIS Dataset DAVIS Dataset: Pont-Tuset et al. The 2017 DAVIS Challenge on Video Object Segmentation. 2017.

Pose Keypoint Tracking JHMDB Dataset

Comparison Our Correspondence Optical Flow

Learning Graph Representations for Video Understanding Xiaolong - PowerPoint PPT Presentation

Learning Graph Representations for Video Understanding Xiaolong Wang Carnegie Mellon University Computer Vision Dog He et al. Mask R-CNN. ICCV 2017. Gler et al. DensePose: Dense Human Pose Estimation In The Wild. CVPR 2018. Deep Learning

8.3 GRAPH REPRESENTATIONS AND GRAPH ISOMORPHISM INCIDENCE TABLE REPRESENTATION def: An incidence

7. Video databases Video data representations Video = time-ordered sequence of correlated

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

61A Lecture 16 Announcements String Representations String Representations 4 String

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Video Games Written and Researched by: Patrick Kania First Video Game The first Video Game made

Parsing of natural language sentences to syntactic and semantic graph representations

More refined representations Control dependence graph Problem: control-flow edges in CFG

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/20/2019 NVIDIA Video Technologies Overview Turing

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/26/2018 NVIDIA Video Technologies Overview Video

Video Sur Video Sur rveillance, rveillance, , Video Analyti Video Analyti ics, and You.

ActBERT: Learning Global-Local Video-Text Representations Linchao Zhu Self-supervised pretraining

Rich representations for Rich representations for learning visual recognition learning visual

Learning from Unlabeled Video Carl Vondrick Columbia University Survivor Bias of Video Data

How var iable is the fe e dstoc k fr om de gr ade d tr e e s and mill r e sidue s for

Biohacking: Some/mes Spooky Stuff and Some/mes Wonderful Stuff

Olympus IAQ Microscopes .................................46 Microscope Slides and Coverslips

Unexpected Cleverness in Unicellular Organisms: The Slime Mold Case Marcello Caleffi Broadband

Abington Memorial Hospital Abington Memorial Hospital Abington, Pennsylvania Abington, Pennsylvania

T ASK 2 VEC : Task Embedding for Model Recommendation https://arxiv.org/abs/1902.03545 Subhransu

Welcome A NDERSON P RIMARY P4 Parents Forum 8 April 2016 Passion for Learning Quest for

Experimental Design CS294 Practical Machine Learning Daniel Ting Original Slides by Barbara

Learning Graph Representations for Video Understanding Xiaolong - PowerPoint PPT Presentation

Learning Graph Representations for Video Understanding Xiaolong Wang Carnegie Mellon University Computer Vision Dog He et al. Mask R-CNN. ICCV 2017. Gler et al. DensePose: Dense Human Pose Estimation In The Wild. CVPR 2018. Deep Learning

8.3 GRAPH REPRESENTATIONS AND GRAPH ISOMORPHISM INCIDENCE TABLE REPRESENTATION def: An incidence

7. Video databases Video data representations Video = time-ordered sequence of correlated

GRAPH MINING AND GRAPH KERNELS Part I: Graph Mining Karsten Borgwardt^ and Xifeng Yan*

61A Lecture 16 Announcements String Representations String Representations 4 String

GRAPH MINING AND GRAPH KERNELS Part II: Graph Kernels Karsten Borgwardt^ and Xifeng Yan*

Video Games Written and Researched by: Patrick Kania First Video Game The first Video Game made

Parsing of natural language sentences to syntactic and semantic graph representations

More refined representations Control dependence graph Problem: control-flow edges in CFG

Graph Indexing: Tree + Delta Delta &gt;= Graph &gt;= Graph Graph Indexing: Tree + Peixian Zhao,

Graph Mining Marco Serafini COMPSCI 532 Lecture 11 Classes of Graph Systems Graph

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/20/2019 NVIDIA Video Technologies Overview Turing

NVIDIA VIDEO TECHNOLOGIES Abhijit Patait, 3/26/2018 NVIDIA Video Technologies Overview Video

Video Sur Video Sur rveillance, rveillance, , Video Analyti Video Analyti ics, and You.

ActBERT: Learning Global-Local Video-Text Representations Linchao Zhu Self-supervised pretraining

Rich representations for Rich representations for learning visual recognition learning visual

Learning from Unlabeled Video Carl Vondrick Columbia University Survivor Bias of Video Data

How var iable is the fe e dstoc k fr om de gr ade d tr e e s and mill r e sidue s for

Biohacking: Some/mes Spooky Stuff and Some/mes Wonderful Stuff

Olympus IAQ Microscopes .................................46 Microscope Slides and Coverslips

Unexpected Cleverness in Unicellular Organisms: The Slime Mold Case Marcello Caleffi Broadband

Abington Memorial Hospital Abington Memorial Hospital Abington, Pennsylvania Abington, Pennsylvania

T ASK 2 VEC : Task Embedding for Model Recommendation https://arxiv.org/abs/1902.03545 Subhransu

Welcome A NDERSON P RIMARY P4 Parents Forum 8 April 2016 Passion for Learning Quest for

Experimental Design CS294 Practical Machine Learning Daniel Ting Original Slides by Barbara

Graph Indexing: Tree + Delta Delta >= Graph >= Graph Graph Indexing: Tree + Peixian Zhao,