Learning Graph Representations for Video Understanding Xiaolong - - PowerPoint PPT Presentation

β–Ά
learning graph representations for video understanding
SMART_READER_LITE
LIVE PREVIEW

Learning Graph Representations for Video Understanding Xiaolong - - PowerPoint PPT Presentation

Learning Graph Representations for Video Understanding Xiaolong Wang Carnegie Mellon University Computer Vision Dog He et al. Mask R-CNN. ICCV 2017. Gler et al. DensePose: Dense Human Pose Estimation In The Wild. CVPR 2018. Deep Learning


slide-1
SLIDE 1

Learning Graph Representations for Video Understanding

Xiaolong Wang

Carnegie Mellon University

slide-2
SLIDE 2

Computer Vision

Dog

He et al. Mask R-CNN. ICCV 2017. GΓΌler et al. DensePose: Dense Human Pose Estimation In The Wild. CVPR 2018.

slide-3
SLIDE 3

Deep Learning

Mushroom Dog Ant Jelly Fungus Nest

ImageNet

Image Mushroom Dog Ant Jelly Fungus Nest

Train a Convolutional Neural Network

Russakovsky et al. ImageNet Large Scale Visual Recognition Challenge. 2014.

slide-4
SLIDE 4

Convolutional Neural Networks

Figure credit: Van Den Oord et al.

  • Convolution is local
  • Long-range Pairwise relations are not modeled
slide-5
SLIDE 5

Related Work: Relation Networks

[Santoro et al, 2017]

slide-6
SLIDE 6

Related Work: Self-Attention

[Vaswani et al, 2017]

slide-7
SLIDE 7

Related Work: Graph Convolution Networks

[Kipf et al, 2017]

slide-8
SLIDE 8

This Tutorial

  • Perform connections on different graph/relation

networks

  • Under the application of video understanding
  • Both supervised and self-supervised methods
slide-9
SLIDE 9

Video Recognition

3D Conv 3D Conv 3D Conv

Playing Soccer

slide-10
SLIDE 10

Reasoning for Action Recognition

  • X. Wang , R. Girshick , A. Gupta, and K. He. Non-local Neural Networks. CVPR 2018.

Long-rang explicit reasoning

slide-11
SLIDE 11

Non-local Means

π‘ž π‘Ÿ2 π‘Ÿ1 π‘Ÿ3

Buades et al. A non-local algorithm for image denoising. CVPR, 2005.

slide-12
SLIDE 12

Non-local Operator

Operation in feature space Can be embedded into any ConvNets 𝑦𝑗 π‘¦π‘˜

slide-13
SLIDE 13

Non-local Operator

𝑦𝑗 π‘¦π‘˜

Affinity Features

𝑧𝑗 = 1 𝐷(𝑦)

βˆ€π‘˜

𝑔 𝑦𝑗, π‘¦π‘˜ 𝑕(π‘¦π‘˜)

slide-14
SLIDE 14

Non-local Operator

14

𝑦

πœ„: 1 Γ— 1 Γ— 1 𝜚: 1 Γ— 1 Γ— 1

π‘ˆ Γ— 𝐼 Γ— 𝑋 Γ— 512 π‘ˆ Γ— 𝐼 Γ— 𝑋 Γ— 512 π‘ˆπΌπ‘‹ Γ— 512 512 Γ— π‘ˆπΌπ‘‹ π‘ˆπΌπ‘‹ Γ— π‘ˆπΌπ‘‹ π‘ˆπΌπ‘‹ 512 π‘ˆπΌπ‘‹ 512 π‘ˆπΌπ‘‹ π‘ˆπΌπ‘‹

Γ— =

𝑧𝑗 = 1 𝐷(𝑦)

βˆ€π‘˜

𝑔 𝑦𝑗, π‘¦π‘˜ 𝑕(π‘¦π‘˜)

π‘ˆ Γ— 𝐼 Γ— 𝑋 Γ— 512

slide-15
SLIDE 15

Non-local Operator

15

𝑦

πœ„: 1 Γ— 1 Γ— 1 𝜚: 1 Γ— 1 Γ— 1

π‘ˆ Γ— 𝐼 Γ— 𝑋 Γ— 512 π‘ˆ Γ— 𝐼 Γ— 𝑋 Γ— 512 π‘ˆπΌπ‘‹ Γ— 512 512 Γ— π‘ˆπΌπ‘‹ π‘ˆπΌπ‘‹ Γ— π‘ˆπΌπ‘‹ π‘ˆπΌπ‘‹ 512 π‘ˆπΌπ‘‹ 512 π‘ˆπΌπ‘‹ π‘ˆπΌπ‘‹

Γ— =

𝑧𝑗 = 1 𝐷(𝑦)

βˆ€π‘˜

𝑔 𝑦𝑗, π‘¦π‘˜ 𝑕(π‘¦π‘˜)

π‘ˆ Γ— 𝐼 Γ— 𝑋 Γ— 512

slide-16
SLIDE 16

Non-local Operator

16

𝑦

πœ„: 1 Γ— 1 Γ— 1 𝜚: 1 Γ— 1 Γ— 1

π‘ˆ Γ— 𝐼 Γ— 𝑋 Γ— 512 π‘ˆ Γ— 𝐼 Γ— 𝑋 Γ— 512 π‘ˆπΌπ‘‹ Γ— 512 512 Γ— π‘ˆπΌπ‘‹ π‘ˆπΌπ‘‹ Γ— π‘ˆπΌπ‘‹

normalize

𝑧𝑗 = 1 𝐷(𝑦)

βˆ€π‘˜

𝑔 𝑦𝑗, π‘¦π‘˜ 𝑕(π‘¦π‘˜)

π‘ˆ Γ— 𝐼 Γ— 𝑋 Γ— 512

slide-17
SLIDE 17

Non-local Operator

17

𝑦

πœ„: 1 Γ— 1 Γ— 1 𝜚: 1 Γ— 1 Γ— 1

π‘ˆ Γ— 𝐼 Γ— 𝑋 Γ— 512 π‘ˆ Γ— 𝐼 Γ— 𝑋 Γ— 512 π‘ˆπΌπ‘‹ Γ— 512 512 Γ— π‘ˆπΌπ‘‹ π‘ˆπΌπ‘‹ Γ— π‘ˆπΌπ‘‹

normalize

𝑧𝑗 = 1 𝐷(𝑦)

βˆ€π‘˜

𝑔 𝑦𝑗, π‘¦π‘˜ 𝑕(π‘¦π‘˜)

π‘ˆ Γ— 𝐼 Γ— 𝑋 Γ— 512

𝑔 𝑦𝑗, π‘¦π‘˜ = exp(𝑦𝑗

π‘ˆπ‘¦π‘˜)

𝐷(𝑦) =

βˆ€π‘˜

𝑔 𝑦𝑗, π‘¦π‘˜ 𝑔 𝑦𝑗, π‘¦π‘˜ 𝐷(𝑦) = exp(𝑦𝑗

π‘ˆπ‘¦π‘˜)

βˆ€π‘˜ exp(𝑦𝑗

π‘ˆπ‘¦π‘˜)

slide-18
SLIDE 18

Non-local Operator

18

𝑦

πœ„: 1 Γ— 1 Γ— 1 𝜚: 1 Γ— 1 Γ— 1

π‘ˆ Γ— 𝐼 Γ— 𝑋 Γ— 512 π‘ˆ Γ— 𝐼 Γ— 𝑋 Γ— 512

𝑕: 1 Γ— 1 Γ— 1

π‘ˆ Γ— 𝐼 Γ— 𝑋 Γ— 512 π‘ˆπΌπ‘‹ Γ— 512 512 Γ— π‘ˆπΌπ‘‹ π‘ˆπΌπ‘‹ Γ— π‘ˆπΌπ‘‹ π‘ˆπΌπ‘‹ Γ— 512 π‘ˆπΌπ‘‹ Γ— 512

normalize

𝑧𝑗 = 1 𝐷(𝑦)

βˆ€π‘˜

𝑔 𝑦𝑗, π‘¦π‘˜ 𝑕(π‘¦π‘˜)

π‘ˆ Γ— 𝐼 Γ— 𝑋 Γ— 512

slide-19
SLIDE 19

Non-local Operator

19

𝑦

πœ„: 1 Γ— 1 Γ— 1 𝜚: 1 Γ— 1 Γ— 1

π‘ˆ Γ— 𝐼 Γ— 𝑋 Γ— 512 π‘ˆ Γ— 𝐼 Γ— 𝑋 Γ— 512

𝑕: 1 Γ— 1 Γ— 1

π‘ˆ Γ— 𝐼 Γ— 𝑋 Γ— 512 π‘ˆπΌπ‘‹ Γ— 512 512 Γ— π‘ˆπΌπ‘‹ π‘ˆπΌπ‘‹ Γ— π‘ˆπΌπ‘‹ π‘ˆπΌπ‘‹ Γ— 512 π‘ˆπΌπ‘‹ Γ— 512

normalize

𝑧𝑗 = 1 𝐷(𝑦)

βˆ€π‘˜

𝑔 𝑦𝑗, π‘¦π‘˜ 𝑕(π‘¦π‘˜)

π‘ˆ Γ— 𝐼 Γ— 𝑋 Γ— 512

slide-20
SLIDE 20

Non-local Operator as A Residual Block

3D Conv 3D Conv Non-local 3D Conv Non- local

Video Action Class

𝑨𝑗 = 𝑧𝑗𝑋 + 𝑦𝑗

slide-21
SLIDE 21

Examples

slide-22
SLIDE 22

Action Recognition in Daily Lives

Gunnar A. Sigurdsson, GΓΌl Varol, Xiaolong Wang, Ivan Laptev, Ali Farhadi, Abhinav Gupta. Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding. ECCV 2016.

Charades Dataset: 157 classes, 9.8k videos, 30s per video We let the people upload their own videos!

slide-23
SLIDE 23

Action Recognition on Charades

Method mAP 3D Conv 31.8% 3D Conv + Non-local 33.5%

slide-24
SLIDE 24

Opening A Book

24

slide-25
SLIDE 25

Opening A Book

25

The Non-local Block

slide-26
SLIDE 26

Opening A Book

Object states changes over time Human-object, object-object interactions

  • X. Wang and A. Gupta. Video as Space-Time Region Graphs. ECCV 2018.
slide-27
SLIDE 27

Opening A Book

27

A1 B1 A2 B2 A3 B3 A4 B4

Highly Correlated

slide-28
SLIDE 28

Relations between Regions

slide-29
SLIDE 29

Relations between Regions

𝑔 𝑦𝑗, π‘¦π‘˜ = 𝜚 𝑦𝑗 π‘ˆ πœšβ€²(π‘¦π‘˜)

π»π‘—π‘˜ = exp 𝑔 𝑦𝑗, π‘¦π‘˜ βˆ€π‘˜ exp 𝑔 𝑦𝑗, π‘¦π‘˜

slide-30
SLIDE 30

Graph Convolutional Network

π‘Ž = π»π‘Œπ‘‹

  • Kipf. Semi-Supervised Classification with Graph Convolutional Networks. 2017

𝑂 𝐻 𝑂

Γ—

π‘Œ 𝑂 𝑒 𝑒 𝑒 𝑋

Γ—

π‘Ž 𝑂 𝑒

=

slide-31
SLIDE 31

Graph Convolutional Network

31

Propagation

slide-32
SLIDE 32

Connecting Non-local and GCN

𝑨𝑗 = 𝑧𝑗𝑋 + 𝑦𝑗 𝑧𝑗 = 1 𝐷(𝑦)

βˆ€π‘˜

𝑔 𝑦𝑗, π‘¦π‘˜ 𝑕(π‘¦π‘˜) =

βˆ€π‘˜

𝑔 𝑦𝑗, π‘¦π‘˜ βˆ€π‘˜ 𝑔 𝑦𝑗, π‘¦π‘˜ 𝑕(π‘¦π‘˜) =

βˆ€π‘˜

π»π‘—π‘˜ 𝑕(π‘¦π‘˜) =

βˆ€π‘˜

π»π‘—π‘˜ 𝑕(π‘¦π‘˜) 𝑋 + 𝑦𝑗 π‘Ž = 𝐻 𝑕 π‘Œ 𝑋 + π‘Œ The Non-local Operator: The Graph Convolution

slide-33
SLIDE 33

Action Recognition on Charades

33

Method mean AP 3D Conv 31.8% 3D Conv + Non-local 33.5% 3D Conv + Region Graph 36.2% +4.4%

slide-34
SLIDE 34

Action Recognition on Charades

34

30% 35% 40% 45% Involves Objects ? No Yes 3D Conv 3D Conv + Graph

slide-35
SLIDE 35

Action Recognition on Charades

35

30% 35% 40% 45% Pose Variances 3D Conv 3D Conv + Graph

slide-36
SLIDE 36

Connection to Mean-Shift

𝑧𝑗 =

βˆ€π‘˜

𝑔 𝑦𝑗, π‘¦π‘˜ βˆ€π‘˜ 𝑔 𝑦𝑗, π‘¦π‘˜ 𝑕(π‘¦π‘˜) The Non-local Operator: The Mean-Shift Clustering: 𝑛(𝑦) =

π‘¦π‘˜βˆˆπ‘‚(𝑦)

𝐿 𝑦, π‘¦π‘˜ π‘¦π‘˜βˆˆπ‘‚(𝑦) 𝐿 𝑦, π‘¦π‘˜ π‘¦π‘˜ Converging to the same mean?

https://tw.rpi.edu/web/project/JeffersonProjectAtLakeGeorge/Clustering
slide-37
SLIDE 37

Recent Related Work

Actor-Centric Relation Network [Sun et al, 2018] Video Action Transformer Network [Girdhar et al, 2019] Long-Term Feature Banks for Detailed Video Understanding [Wu et al, 2019]

slide-38
SLIDE 38

Learning Affinity with Semantic Supervision

slide-39
SLIDE 39

Learn Correspondence without Human Supervision

Goal:

slide-40
SLIDE 40

The visual world exhibits continuity

slide-41
SLIDE 41

Prior Work: Learning from Time

Predict Color in Time [Vondrick et al, 2018]

Inputs Outputs

Predict Pixel in Time [Mathieu et al, 2015] Predict Arrow of Time [Wei et al, 2018]

slide-42
SLIDE 42

Using Tracking to Learn Features

CNN CNN

Similarity

Tracking β†’ Similarity [Wang et al, 2015]

slide-43
SLIDE 43

Using Tracking to Learn Features

CNN CNN

Similarity

Tracking β†’ Similarity [Wang et al, 2015]

Limited by Off-the-shelf Trackers

slide-44
SLIDE 44

Similarity requires tracking Tracking requires similarity

Let’s jointly learn both!

slide-45
SLIDE 45

Learning to Track

How to obtain supervision?

β„± β„± β„±

β„±: a deep tracker

slide-46
SLIDE 46

Supervision: Cycle-Consistency in Time

Track backwards Track forwards, back to the future

β„± β„± β„± β„± β„± β„±

slide-47
SLIDE 47

Backpropagation through time along the cycle

Supervision: Cycle-Consistency in Time

β„± β„± β„± β„± β„± β„±

slide-48
SLIDE 48

Differentiable Tracking

48

Encoder 𝜚 Encoder 𝜚

transpose

Patch feature in time 𝑒: 𝑦𝑒

π‘ž

Image feature in time 𝑒 βˆ’ 1: π‘¦π‘’βˆ’1

𝐽

100 900

Γ— =

900 𝑑 100 𝑑

π‘¦π‘’βˆ’1

𝐽

𝑦𝑒

π‘ž

slide-49
SLIDE 49

Spatial Transformer πœ„ Cropping

Differentiable Tracking

49

Encoder 𝜚 Encoder 𝜚

transpose

Patch feature in time 𝑒: 𝑦𝑒

π‘ž

Image feature in time 𝑒 βˆ’ 1: π‘¦π‘’βˆ’1

𝐽

Patch feature in time 𝑒 βˆ’ 1: π‘¦π‘’βˆ’1

π‘ž

slide-50
SLIDE 50

Spatial Transformer πœ„ Cropping

Differentiable Tracking

50

Encoder 𝜚 Encoder 𝜚

transpose

π‘¦π‘’βˆ’1

π‘ž

= β„±(π‘¦π‘’βˆ’1

𝐽

, 𝑦𝑒

π‘ž)

β„±

slide-51
SLIDE 51

Recurrent Tracking

51

𝑦𝑒

π‘ž

𝑦𝑒

π‘ž

𝑒 βˆ’ 1

β„±

𝑒

β„±

𝑒 βˆ’ 1

β„±

𝑒 βˆ’ 2

β„±

𝑒 βˆ’ 2

β„±

𝑒 βˆ’ 3

β„±

β„’π‘‘π‘§π‘‘π‘šπ‘“

slide-52
SLIDE 52

Cycle-Consistency Loss Function

β„’π‘‘π‘§π‘‘π‘šπ‘“ = ||𝑀𝑝𝑑 𝑦𝑒

π‘ž βˆ’ 𝑀𝑝𝑑 𝑦𝑒 π‘ž ||2 2

𝑦𝑒

π‘ž

𝑦𝑒

π‘ž

β„±

β„± β„± β„± β„± β„±

slide-53
SLIDE 53

Multiple Cycles

Sub-cycles: a natural curriculum

slide-54
SLIDE 54

Skip Cycles

Skip-cycles: skipping occlusions

slide-55
SLIDE 55

Visualization of Training

slide-56
SLIDE 56

Test Time: Nearest Neighbors in Feature Space

𝑒 βˆ’ 1 𝑒

slide-57
SLIDE 57

𝑒 βˆ’ 1 𝑒

Test Time: Nearest Neighbors in Feature Space

slide-58
SLIDE 58

Instance Mask Tracking

DAVIS Dataset

DAVIS Dataset: Pont-Tuset et al. The 2017 DAVIS Challenge on Video Object Segmentation. 2017.

slide-59
SLIDE 59

Pose Keypoint Tracking

JHMDB Dataset

slide-60
SLIDE 60

Comparison

Our Correspondence Optical Flow

slide-61
SLIDE 61

Pose Keypoint Tracking

JHMDB Dataset

Method PCK @.1 Optical Flow 45% Vondrick et al. 45% Ours 58%

Vondrick et al. Tracking Emerges by Colorizing Videos. ECCV 2018.

slide-62
SLIDE 62

Texture Tracking

DAVIS Dataset

DAVIS Dataset: Pont-Tuset et al. The 2017 DAVIS Challenge on Video Object Segmentation. 2017.

slide-63
SLIDE 63

Semantic Masks Tracking

Video Instance Parsing Dataset

Zhou et al. Adaptive Temporal Encoding Network for Video Instance-level Human Parsing. ACM MM 2018.

slide-64
SLIDE 64

Questions?