towards efficient end to end architectures for action
play

Towards efficient end-to-end architectures for action recognition - PowerPoint PPT Presentation

Towards efficient end-to-end architectures for action recognition and detection in videos Limin Wang Computer Vision Laboratory, ETH Zurich | | Limin Wang (CVL ETHZ) 17/4/12 1 Outline 1. Overview of action understanding 2. Temporal


  1. Towards efficient end-to-end architectures for action recognition and detection in videos Limin Wang Computer Vision Laboratory, ETH Zurich | | Limin Wang (CVL ETHZ) 17/4/12 1

  2. Outline § 1. Overview of action understanding § 2. Temporal segment networks § 3. Structured segment networks § 4. UntrimmedNets § 5. Conclusion | | Limin Wang (CVL ETHZ) 17/4/12 2

  3. Action recognition in videos § 1. Action recognition “in the lab”: KTH, Weizmann etc. § 2. Action recognition “in TV, Movies”: UCF Sports, Holloywood etc. § 3. Action recognition “in Web Videos”: HMDB, UCF101, THUMOS, ActivityNet etc. Haroon Idrees et al. The THUMOS Challenge on Action Recognition for Videos "in the Wild” , in Computer Vision and Image Understanding (CVIU), 2017. | | Limin Wang (CVL ETHZ) 17/4/12 3

  4. How to define action categories Heilbron F. C. et al. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding, in CVPR, 2015. | | Limin Wang (CVL ETHZ) 17/4/12 4

  5. How to define action categories Heilbron F. C. et al. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding, in CVPR, 2015. | | Limin Wang (CVL ETHZ) 17/4/12 5

  6. How to label videos Heilbron F. C. et al. ActivityNet: A Large-Scale Video Benchmark for Human Activity Understanding, in CVPR, 2015. | | Limin Wang (CVL ETHZ) 17/4/12 6

  7. Dataset overview | | Limin Wang (CVL ETHZ) 17/4/12 7

  8. Action understanding § Action Recognition: classify the short clip or untrimmed video into pre-defined class. § Action Temporal Localization: detect starting and ending times of action instances in untrimmed video. § Action Spatial Detection: detect the bounding boxes of actors in trimmed videos. § Action Spatial-Temporal Detection: combine temporal and spatial localization in untrimmed videos. | | Limin Wang (CVL ETHZ) 17/4/12 8

  9. Action recognition – STIP+HOG/HOF (03, 08) [1]. Ivan Laptev and Tony Lindeberg, Space-time Interest Points, in ICCV, 2003. [2] Ivan Laptev, Marcin Marszałek, Cordelia Schmid, Benjamin Rozenfeld, Learning realistic human actions from movies, in CVPR, 2008. | | Limin Wang (CVL ETHZ) 17/4/12 9

  10. Action recognition – Dense Trajectories (11, 13) [1]. Heng Wang, Alexander Klaser, Cordelia Schmid, and Cheng-Lin Liu, Action Recognition by Dense Trajectories, in CVPR, 2011. [2] Heng Wang, and Cordelia Schmid, Action Recognition with Improved Trajectories, in ICCV, 2013. | | Limin Wang (CVL ETHZ) 17/4/12 10

  11. Action recognition – two stream CNN (2014) Karen Simonyan and Andrew Zisserman, Two-Stream Convolutional Networks for Action Recognition in Videos, in NIPS, 2014. | | Limin Wang (CVL ETHZ) 17/4/12 11

  12. Action recognition – 3D CNN (2015) Du Tran et al. Learning Spatiotemporal Features with 3D Convolutional Networks, in ICCV, 2015. | | Limin Wang (CVL ETHZ) 17/4/12 12

  13. Action recognition – TDD (2015) Limin Wang, Yu Qiao, Xiaoou Tang, Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors , in CVPR, 2015. | | Limin Wang (CVL ETHZ) 17/4/12 13

  14. Outline § 1. Overview of action understanding § 2. Temporal segment networks § 3. Structured segment networks § 4. UntrimmedNets § 5. Conclusion | | Limin Wang (CVL ETHZ) 17/4/12 14

  15. Motivation of TSN § Towards end-to-end and video-level architecture. § Modeling issue: mainstream CNN frameworks focus on appearance and short-term motion. § Learning issue: current action datasets are relatively small and it is not easy to train deep CNNs. | | Limin Wang (CVL ETHZ) 17/4/12 15

  16. Overview of TSN TSN is a video-level framework based on simple strategies of segment sampling and consensus aggregation . | | Limin Wang (CVL ETHZ) 17/4/12 16

  17. Segment sampling § Our segment sampling is based on the fact: there are high data redundancy in video § Our segment sampling share two properties: § Sparse: processing efficiency § Duration invariance: video-level framework modeling the entire video content. | | Limin Wang (CVL ETHZ) 17/4/12 17

  18. Aggregation Function § Aggregation function aims to summarize the predictions of different snippet to yield the video-level prediction. § Simple aggregation functions: § Mean pooling, max pooling, weighted average § Advanced aggregation functions: § Top-k pooling, attention weighting | | Limin Wang (CVL ETHZ) 17/4/12 18

  19. Formulation of TSN | | Limin Wang (CVL ETHZ) 17/4/12 19

  20. Input modalities § Original two-stream CNNs take RGB image and stacking optical flow fields. § We study two other input modalities. § Stacking RGB difference § Approximation of motion information § Efficient to compute § Stacking warped optical field § Remove background motion | | Limin Wang (CVL ETHZ) 17/4/12 20

  21. Good practices § Cross modality pre-training: pre-train both spatial and temporal nets with ImageNet models. § Partial batch normalization: only re-estimate the parameters of first BN layer. § Smaller learning rate: as pre-training, using smaller learning rate for fine tuning. § Data augmentation: using more augmentation, including corner cropping, scale jittering, ratio jittering. § High dropout ratio: 0.7 dropout ratio for temporal net and 0.8 dropout ratio for spatial net. | | Limin Wang (CVL ETHZ) 17/4/12 21

  22. Experiment result -- training method | | Limin Wang (CVL ETHZ) 17/4/12 22

  23. Experiment result -- input modality | | Limin Wang (CVL ETHZ) 17/4/12 23

  24. Experiment result -- TSN framework | | Limin Wang (CVL ETHZ) 17/4/12 24

  25. Experiment result -- Comparison with STOA | | Limin Wang (CVL ETHZ) 17/4/12 25

  26. ActivityNet Challenge -- 2016 | | Limin Wang (CVL ETHZ) 17/4/12 26

  27. Model Visualization | | Limin Wang (CVL ETHZ) 17/4/12 27

  28. Outline § 1. Overview of action understanding § 2. Temporal segment network § 3. Structured segment network § 4. UntrimmedNet § 5. Conclusion | | Limin Wang (CVL ETHZ) 17/4/12 28

  29. Motivation of Structured Segment Network 1. Action detection in untrimmed video is an important problem. 2. Snippet-level classifier is difficult to accurately localize the temporal extent of action instance. Context and Structure Modeling! | | Limin Wang (CVL ETHZ) 17/4/12 29

  30. Structured Segment Network | | Limin Wang (CVL ETHZ) 17/4/12 30

  31. Three Stage Augmentation § Given a proposal, we will extend its temporal duration by augmentation (Context modeling). § Specifically, a proposal denoted by [s, e], its duration is d = e - s, then temporal extension [s’, e’]: e’ = e + d/2, s’ = s - d/2 | | Limin Wang (CVL ETHZ) 17/4/12 31

  32. Temporal Pyramid Pooling § Given a proposal, we use temporal pyramid pooling to summarize its representation (Structure modeling). § Specifically, given a proposal denoted by [s, e], for ith part in the kth level, it is pooled as follows: § The overall representation would be as follows: | | Limin Wang (CVL ETHZ) 17/4/12 32

  33. Two Classifier Design § To model the class classes and completeness of instances, we design a two classifier loss § Action class classifier measure the likelihood of action class distribution: P(c|p) § Completeness classifier measure the likelihood of instance completeness: P(b|c,p) § A joint loss to optimize these two classifiers: | | Limin Wang (CVL ETHZ) 17/4/12 33

  34. Temporal Region Proposal Bottom up proposal generation based on actionness map | | Limin Wang (CVL ETHZ) 17/4/12 34

  35. Experiment result (1) | | Limin Wang (CVL ETHZ) 17/4/12 35

  36. Experiment result (2) | | Limin Wang (CVL ETHZ) 17/4/12 36

  37. Experiment result (3) | | Limin Wang (CVL ETHZ) 17/4/12 37

  38. Detection example (1) | | Limin Wang (CVL ETHZ) 17/4/12 38

  39. Detection example (2) | | Limin Wang (CVL ETHZ) 17/4/12 39

  40. Outline § 1. Overview of action understanding § 2. Temporal segment network § 3. Structured segment network § 4. UntrimmedNet § 5. Conclusion | | Limin Wang (CVL ETHZ) 17/4/12 40

  41. Motivation of UntrimmedNet 1. Labeling untrimmed video is expensive and time consuming 2. Temporal annotation is subjective and not consistent across persons and datasets | | Limin Wang (CVL ETHZ) 17/4/12 41

  42. Overview of UntrimmedNet | | Limin Wang (CVL ETHZ) 17/4/12 42

  43. Clip Proposal § Uniform Sampling § Uniform sampling of fixed duration § Shot based Sampling § First shot detection based HOG difference § For each shot, perform uniform sampling. | | Limin Wang (CVL ETHZ) 17/4/12 43

  44. Clip-level Representation and Classification § Following TSN framework: § Sampling a few snippets from each clip. § Aggregating snippet-level predictions with average pooling § In practice, we use two stream input: RGB and Optical Flow | | Limin Wang (CVL ETHZ) 17/4/12 44

  45. Clip Selection § Selection aims to select discriminative clips or rank them with attention weights. § Two selection methods: § Hard selection: top-k pooling over clip-level prediction § Soft selection: learning attention weights for different clips | | Limin Wang (CVL ETHZ) 17/4/12 45

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend