Towards efficient end-to-end architectures for action recognition - - PowerPoint PPT Presentation

towards efficient end to end architectures for action
SMART_READER_LITE
LIVE PREVIEW

Towards efficient end-to-end architectures for action recognition - - PowerPoint PPT Presentation

Towards efficient end-to-end architectures for action recognition and detection in videos Limin Wang Computer Vision Laboratory, ETH Zurich | | Limin Wang (CVL ETHZ) 17/4/12 1 Outline 1. Overview of action understanding 2. Temporal


slide-1
SLIDE 1

| |

Limin Wang Computer Vision Laboratory, ETH Zurich

17/4/12 Limin Wang (CVL ETHZ) 1

Towards efficient end-to-end architectures for action recognition and detection in videos

slide-2
SLIDE 2

| |

§ 1. Overview of action understanding § 2. Temporal segment networks § 3. Structured segment networks § 4. UntrimmedNets § 5. Conclusion

17/4/12 Limin Wang (CVL ETHZ) 2

Outline

slide-3
SLIDE 3

| | 17/4/12 Limin Wang (CVL ETHZ) 3

Action recognition in videos

§

  • 1. Action recognition “in the lab”: KTH, Weizmann etc.

§

  • 2. Action recognition “in TV, Movies”: UCF Sports, Holloywood etc.

§

  • 3. Action recognition “in Web Videos”: HMDB, UCF101, THUMOS,

ActivityNet etc.

Haroon Idrees et al. The THUMOS Challenge on Action Recognition for Videos "in the Wild”, in Computer Vision and Image Understanding (CVIU), 2017.

slide-4
SLIDE 4

| | 17/4/12 Limin Wang (CVL ETHZ) 4

How to define action categories

Heilbron F. C. et al. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding, in CVPR, 2015.

slide-5
SLIDE 5

| | 17/4/12 Limin Wang (CVL ETHZ) 5

How to define action categories

Heilbron F. C. et al. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding, in CVPR, 2015.

slide-6
SLIDE 6

| | 17/4/12 Limin Wang (CVL ETHZ) 6

How to label videos

Heilbron F. C. et al. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding, in CVPR, 2015.

slide-7
SLIDE 7

| | 17/4/12 Limin Wang (CVL ETHZ) 7

Dataset overview

slide-8
SLIDE 8

| |

§ Action Recognition: classify the short clip or untrimmed video into pre-defined class. § Action Temporal Localization: detect starting and ending times of action instances in untrimmed video. § Action Spatial Detection: detect the bounding boxes of actors in trimmed videos. § Action Spatial-Temporal Detection: combine temporal and spatial localization in untrimmed videos.

17/4/12 Limin Wang (CVL ETHZ) 8

Action understanding

slide-9
SLIDE 9

| | 17/4/12 Limin Wang (CVL ETHZ) 9

Action recognition – STIP+HOG/HOF (03, 08)

[1]. Ivan Laptev and Tony Lindeberg, Space-time Interest Points, in ICCV, 2003. [2] Ivan Laptev, Marcin Marszałek, Cordelia Schmid, Benjamin Rozenfeld, Learning realistic human actions from movies, in CVPR, 2008.

slide-10
SLIDE 10

| | 17/4/12 Limin Wang (CVL ETHZ) 10

Action recognition – Dense Trajectories (11, 13)

[1]. Heng Wang, Alexander Klaser, Cordelia Schmid, and Cheng-Lin Liu, Action Recognition by Dense Trajectories, in CVPR, 2011. [2] Heng Wang, and Cordelia Schmid, Action Recognition with Improved Trajectories, in ICCV, 2013.

slide-11
SLIDE 11

| | 17/4/12 Limin Wang (CVL ETHZ) 11

Action recognition – two stream CNN (2014)

Karen Simonyan and Andrew Zisserman, Two-Stream Convolutional Networks for Action Recognition in Videos, in NIPS, 2014.

slide-12
SLIDE 12

| | 17/4/12 Limin Wang (CVL ETHZ) 12

Action recognition – 3D CNN (2015)

Du Tran et al. Learning Spatiotemporal Features with 3D Convolutional Networks, in ICCV, 2015.

slide-13
SLIDE 13

| | 17/4/12 Limin Wang (CVL ETHZ) 13

Action recognition – TDD (2015)

Limin Wang, Yu Qiao, Xiaoou Tang, Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors , in CVPR, 2015.

slide-14
SLIDE 14

| |

§ 1. Overview of action understanding § 2. Temporal segment networks § 3. Structured segment networks § 4. UntrimmedNets § 5. Conclusion

17/4/12 Limin Wang (CVL ETHZ) 14

Outline

slide-15
SLIDE 15

| |

§ Towards end-to-end and video-level architecture. § Modeling issue: mainstream CNN frameworks focus on appearance and short-term motion. § Learning issue: current action datasets are relatively small and it is not easy to train deep CNNs.

17/4/12 Limin Wang (CVL ETHZ) 15

Motivation of TSN

slide-16
SLIDE 16

| | 17/4/12 Limin Wang (CVL ETHZ) 16

Overview of TSN

TSN is a video-level framework based on simple strategies of segment sampling and consensus aggregation.

slide-17
SLIDE 17

| |

§ Our segment sampling is based on the fact: there are high data redundancy in video § Our segment sampling share two properties:

§ Sparse: processing efficiency § Duration invariance: video-level framework modeling the entire video content.

17/4/12 Limin Wang (CVL ETHZ) 17

Segment sampling

slide-18
SLIDE 18

| |

§ Aggregation function aims to summarize the predictions of different snippet to yield the video-level prediction. § Simple aggregation functions:

§ Mean pooling, max pooling, weighted average

§ Advanced aggregation functions:

§ Top-k pooling, attention weighting

17/4/12 Limin Wang (CVL ETHZ) 18

Aggregation Function

slide-19
SLIDE 19

| | 17/4/12 Limin Wang (CVL ETHZ) 19

Formulation of TSN

slide-20
SLIDE 20

| |

§ Original two-stream CNNs take RGB image and stacking

  • ptical flow fields.

§ We study two other input modalities. § Stacking RGB difference

§ Approximation of motion information § Efficient to compute

§ Stacking warped optical field

§ Remove background motion

17/4/12 Limin Wang (CVL ETHZ) 20

Input modalities

slide-21
SLIDE 21

| |

§ Cross modality pre-training: pre-train both spatial and temporal nets with ImageNet models. § Partial batch normalization: only re-estimate the parameters of first BN layer. § Smaller learning rate: as pre-training, using smaller learning rate for fine tuning. § Data augmentation: using more augmentation, including corner cropping, scale jittering, ratio jittering. § High dropout ratio: 0.7 dropout ratio for temporal net and 0.8 dropout ratio for spatial net.

17/4/12 Limin Wang (CVL ETHZ) 21

Good practices

slide-22
SLIDE 22

| | 17/4/12 Limin Wang (CVL ETHZ) 22

Experiment result -- training method

slide-23
SLIDE 23

| | 17/4/12 Limin Wang (CVL ETHZ) 23

Experiment result -- input modality

slide-24
SLIDE 24

| | 17/4/12 Limin Wang (CVL ETHZ) 24

Experiment result -- TSN framework

slide-25
SLIDE 25

| | 17/4/12 Limin Wang (CVL ETHZ) 25

Experiment result -- Comparison with STOA

slide-26
SLIDE 26

| | 17/4/12 Limin Wang (CVL ETHZ) 26

ActivityNet Challenge -- 2016

slide-27
SLIDE 27

| | 17/4/12 Limin Wang (CVL ETHZ) 27

Model Visualization

slide-28
SLIDE 28

| |

§ 1. Overview of action understanding § 2. Temporal segment network § 3. Structured segment network § 4. UntrimmedNet § 5. Conclusion

17/4/12 Limin Wang (CVL ETHZ) 28

Outline

slide-29
SLIDE 29

| | 17/4/12 Limin Wang (CVL ETHZ) 29

Motivation of Structured Segment Network

  • 1. Action detection in untrimmed

video is an important problem.

  • 2. Snippet-level classifier is difficult to

accurately localize the temporal extent of action instance. Context and Structure Modeling!

slide-30
SLIDE 30

| | 17/4/12 Limin Wang (CVL ETHZ) 30

Structured Segment Network

slide-31
SLIDE 31

| |

§ Given a proposal, we will extend its temporal duration by augmentation (Context modeling). § Specifically, a proposal denoted by [s, e], its duration is d = e - s, then temporal extension [s’, e’]: e’ = e + d/2, s’ = s - d/2

17/4/12 Limin Wang (CVL ETHZ) 31

Three Stage Augmentation

slide-32
SLIDE 32

| |

§ Given a proposal, we use temporal pyramid pooling to summarize its representation (Structure modeling). § Specifically, given a proposal denoted by [s, e], for ith part in the kth level, it is pooled as follows: § The overall representation would be as follows:

17/4/12 Limin Wang (CVL ETHZ) 32

Temporal Pyramid Pooling

slide-33
SLIDE 33

| |

§ To model the class classes and completeness of instances, we design a two classifier loss § Action class classifier measure the likelihood of action class distribution: P(c|p) § Completeness classifier measure the likelihood of instance completeness: P(b|c,p) § A joint loss to optimize these two classifiers:

17/4/12 Limin Wang (CVL ETHZ) 33

Two Classifier Design

slide-34
SLIDE 34

| | 17/4/12 Limin Wang (CVL ETHZ) 34

Temporal Region Proposal

Bottom up proposal generation based on actionness map

slide-35
SLIDE 35

| | 17/4/12 Limin Wang (CVL ETHZ) 35

Experiment result (1)

slide-36
SLIDE 36

| | 17/4/12 Limin Wang (CVL ETHZ) 36

Experiment result (2)

slide-37
SLIDE 37

| | 17/4/12 Limin Wang (CVL ETHZ) 37

Experiment result (3)

slide-38
SLIDE 38

| | 17/4/12 Limin Wang (CVL ETHZ) 38

Detection example (1)

slide-39
SLIDE 39

| | 17/4/12 Limin Wang (CVL ETHZ) 39

Detection example (2)

slide-40
SLIDE 40

| |

§ 1. Overview of action understanding § 2. Temporal segment network § 3. Structured segment network § 4. UntrimmedNet § 5. Conclusion

17/4/12 Limin Wang (CVL ETHZ) 40

Outline

slide-41
SLIDE 41

| | 17/4/12 Limin Wang (CVL ETHZ) 41

Motivation of UntrimmedNet

  • 1. Labeling untrimmed video

is expensive and time consuming

  • 2. Temporal annotation is

subjective and not consistent across persons and datasets

slide-42
SLIDE 42

| | 17/4/12 Limin Wang (CVL ETHZ) 42

Overview of UntrimmedNet

slide-43
SLIDE 43

| |

§ Uniform Sampling

§ Uniform sampling of fixed duration

§ Shot based Sampling

§ First shot detection based HOG difference § For each shot, perform uniform sampling.

17/4/12 Limin Wang (CVL ETHZ) 43

Clip Proposal

slide-44
SLIDE 44

| |

§ Following TSN framework:

§ Sampling a few snippets from each clip. § Aggregating snippet-level predictions with average pooling

§ In practice, we use two stream input: RGB and Optical Flow

17/4/12 Limin Wang (CVL ETHZ) 44

Clip-level Representation and Classification

slide-45
SLIDE 45

| |

§ Selection aims to select discriminative clips or rank them with attention weights. § Two selection methods:

§ Hard selection: top-k pooling over clip-level prediction § Soft selection: learning attention weights for different clips

17/4/12 Limin Wang (CVL ETHZ) 45

Clip Selection

slide-46
SLIDE 46

| |

§ UntrimmedNet is an end-to-end learning architecture, combing three modules: feature extraction, classification module, selection module. § Video-level prediction: a bilinear model over classification score and selection weights. § The whole pipeline could be optimized with standard back propagation algorithm.

17/4/12 Limin Wang (CVL ETHZ) 46

UntrimmedNet

slide-47
SLIDE 47

| |

§ Action Recognition:

§ In practice, we sample a single frame (or 5 frame stacking of

  • ptical flow) every 30 frames.

§ The recognition from sampled frames are aggregated with top-k pooling (k set to 20) to yield the final video-level prediction.

§ Action Detection:

§ we sample frames every 15 frame and for each frame, we get both prediction scores and attention weights. § we remove background by thresholding (set to 0.0001) on the attention weights . § we produce the final detection results by thresholding (set to 0.5)

  • n the classification scores.

17/4/12 Limin Wang (CVL ETHZ) 47

Weakly supervised action recognition and detection

slide-48
SLIDE 48

| | 17/4/12 Limin Wang (CVL ETHZ) 48

Exploration Study

slide-49
SLIDE 49

| | 17/4/12 Limin Wang (CVL ETHZ) 49

Exploration Study

slide-50
SLIDE 50

| | 17/4/12 Limin Wang (CVL ETHZ) 50

Exploration Study

slide-51
SLIDE 51

| | 17/4/12 Limin Wang (CVL ETHZ) 51

Experiment Results -- Action recognition

slide-52
SLIDE 52

| | 17/4/12 Limin Wang (CVL ETHZ) 52

Experiment Results -- Action detection

slide-53
SLIDE 53

| | 17/4/12 Limin Wang (CVL ETHZ) 53

Examples of Attention

slide-54
SLIDE 54

| |

§ Temporal modeling is important for action understanding. § Segment based sampling shares two merits: temporal modeling and processing efficiency. § TSN is a general and flexible framework for action recognition. § SSN extends TSN for action detection with context and structure modeling. § UntrimmedNet extends TSN for weakly supervised setting with attention modeling.

17/4/12 Limin Wang (CVL ETHZ) 54

Summary

slide-55
SLIDE 55

| |

§ [1] L. Wang, Y . Xiong, Z. Wang, Y . Qiao, D. Lin, X. Tang, and L. Van Gool, Temporal Segment Networks: Towards Good Practices for Deep Action Recognition, in ECCV, 2016. § [2] L. Wang, Y . Xiong, D. Lin, and L. Van Gool, UntrimmedNets for Weakly Supervised Action Recognition and Detection, in CVPR 2017. § [3] Y . Xiong, Y Zhao, L. Wang, D. Lin, X. Tang, A Pursuit of Temporal Accuracy in General Activity Detection, arXiv:1703.02716

17/4/12 Limin Wang (CVL ETHZ) 55

Code and Reference

§ Temporal segment network:

https://github.com/yjxiong/temporal-segment-network

§ Structured segment network:

https://github.com/yjxiong/action-detection

§ UntrimmedNet:

https://github.com/wanglimin/UntrimmedNet

§ Video Caffe:

https://github.com/yjxiong/caffe

slide-56
SLIDE 56

| | 17/4/12 Limin Wang (CVL ETHZ) 56

合作者

slide-57
SLIDE 57

Thank you for your attention!