Tianwei Lin Baidu VIS What is Temporal Action Detection (TAD)? - - PowerPoint PPT Presentation
Tianwei Lin Baidu VIS What is Temporal Action Detection (TAD)? - - PowerPoint PPT Presentation
Temporal Action Detection with Local and Global Context Tianwei Lin Baidu VIS What is Temporal Action Detection (TAD)? Image: Classification Video: Classification Which action? People Dog Cricket Bowling What is Temporal Action Detection
Video: Classification Image: Classification
Cricket Bowling
People Dog
Which action?
What is Temporal Action Detection (TAD)?
Background
Video: Temporal Action Detection Image: Object Detection
Background Background
- 1. Which action?
- 2. When does each action start/end?
Cricket Bowling Cricket Bowling
People Dog
What is Temporal Action Detection (TAD)?
Background
Image: Object Detection
Background Background
- 1. Which action?
- 2. When does each action start/end?
√ X
Cricket Bowling Cricket Bowling
People Dog
What is Temporal Action Detection (TAD)?
Video: Temporal Action Detection
Background
Video: Temporal Proposal Generation Image: Object Proposal
Background Background
When does each action start/end?
√ X
What is Temporal Action Proposal Generation (TAPG)?
What is high-quality proposal?
- Flexible temporal duration at flexible position
- Locate temporal boundaries precisely
- Evaluate reliable confidence scores of proposals
Related Work
- Anchor-based Approaches
– top-down framework – global-context – define multi-scales anchors with regular interval as proposals – SSAD, SSTAD, CBR, TURN, etc.
- Anchor-free Approaches
– bottom-up framework – local-context – first evaluate local clues such as boundary probability or actionness, then generate proposals via exploiting these local clues – TAG, BSN, etc.
Anchor-based: SSAD
Anchor Mechanism of SSAD Approach Overview
Single Shot Temporal Action Detection. Tianwei Lin, Xu Zhao, Zheng Shou. In ACM Multimedia 2017.
Anchor-based: SSAD
Single Shot Temporal Action Detection. Tianwei Lin, Xu Zhao, Zheng Shou. In ACM Multimedia 2017.
Anchor-based: TURN
Gao J, Yang Z, Chen K, et al. Turn tap: Temporal unit regression network for temporal action proposals[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 3628-3636.
Anchor-based: Strength and Weakness
- Strength
– can efficiently generate multiple-scales proposals – can generate reliable confidence score since it takes more global contextual information
- f all anchors simultaneously
- Weakness
– it is hard to design the default setting of anchors – usually not temporal precise – not flexible enough to cover various temporal durations
Anchor-free: TAG
Temporal Actionness Grouping(TAG)
Temporal Action Detection with Structured Segment Networks. Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Dahua Lin, Xiaoou Tang. In ICCV 2017.
Local
Anchor-free: BSN
Boundary-Sensitive Network (BSN)
BSN: Boundary sensitive network for temporal action proposal generation.
- T. Lin, X. Zhao, and S. Haisheng. In European Conference on Computer Vision, 2018.
Local then Global
Anchor-free: BSN – temporal evaluation module
Anchor-free: BSN – proposal generation
Anchor-free: BSN – proposal generation
Anchor-free: BSN – proposal evaluation module
Anchor-free: BSN – NMS
Strength and weakness of BSN
- Strength
– can generate proposals with flexible duration and precise boundary (Locally) – can generate reliable confidence score using BSP feature (Globally)
- Weakness
– feature construction and confidence evaluation are conducted to each proposal respectively, leading to inefficiency – the proposal feature is too simple to capture enough temporal context – is a multiple stages method but not an unified framework
How can we improve BSN?
- How can we evaluate confidence for all proposals simultaneously with
rich context?
– top-down methods achieve this via anchor mechanism – anchor mechanism is not suit for bottom-up methods like BSN
Boundary-Matching Network (BMN)
Local and Global
Lin T, Liu X, Li X, et al. Bmn: Boundary-matching network for temporal action proposal generation[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 3889-3898.
BM confidence map
How can we generate BM confidence map? - BM mechanism!
- Pipeline:
temporal feature sequence !" ∈ $%×' à BM feature map (" ∈ $%×)×*×' à BM confidence map (% ∈ $*×'
- Key issue 1: How to generate BM feature map from temporal feature sequence?
– target: convert !" ∈ $%×' to (" ∈ $%×)×*×' – for each proposal (+×, proposals totally), generate -.,0
1 ∈ $%×) sampling from its temporal scope
– precisely and efficiently
- Difficulties to achieve this feature sampling procedure
– how to sample feature in no-integer point? (precisely) – how to sample feature for all proposals simultaneously? (efficiently)
BM mechanism
- Our Solution:
– dot product between !" ∈ $%×' and sampling weight ( ∈ $)×'×*×' in temporal dimension – for each proposal +,,., construct /,,. ∈ $)×'via uniformly sampling N points among temporal region [12−0.258, 19 + 0.258], for a non-integer sampling point 1<, we define /,,.,<[1] ∈ $' as: – thus, we can get /,,. ∈ $)×' for proposal +,,.
BM mechanism
- Our Solution:
– then, conduct dot product in temporal dimension between !" and #$,& to generate '$,&
( ∈ *+×- :
BM mechanism
- Our Solution:
– expand !",$ ∈ &'×) to * ∈ &'×)×+×) – dot product between ,- ∈ &.×) and * ∈ &'×)×+×) to generate /- ∈ &.×'×+×) – sampling feature of all proposals with rich context is generated precisely and efficiently – this procedure is denoted as BM layer in our method
How can we generate BM confidence map? - BM mechanism!
- Key issue 2: How to generate BM confidence map from BM feature map?
– Target: covert !" ∈ $%×'×(×) to !% ∈ $(×) – a series of 3D and 2D convolution layers
- Key issue 3: What is the label for training BM confidence map?
– BM label map *% ∈ $(×), where +,,.
/ ∈ [0,1] is maximum IoU between proposal and ground-truths
BMN– Network Architecture
BMN – Experiments Results
BMN – Experiments Results
Qualitative Results
Qualitative Results
Lessons Learned
- Proposal is a very important for accurate localization
- High quality proposals should have three properties:
- Boundary-Matching mechanism we proposed can
efficiently and effectively generate high-quality temporal action proposals Ø Flexible durations and locations Ø Precise temporal boundaries Ø Reliable confidence score
Applications
- Video highlight detection
- Dynamic video cover generation
- Fine-grained Highlights generation of sport videos, game
videos, etc.
Recent Work: CTCN
Li X, Lin T, Liu X, et al. Deep Concept-wise Temporal Convolutional Networks for Action Localization[J]. arXiv preprint arXiv:1908.09442, 2019.
Recent Work: Relation-Aware Pyramid Network
Gao J, Shi Z, Li J, et al. Accurate Temporal Action Proposal Generation with Relation- Aware Pyramid Network[J]. AAAI, 2020.
Recent Work: PGCN
Zeng R, Huang W, Tan M, et al. Graph convolutional networks for temporal action localization[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 7094-7103.
Our Work: StNet
N 1 2 T Temporally sampling “super-images” 2D-Conv on super-images for local S-T modeling Stacked 3D/2D-Conv blocks for global S-T modeling Temporal 1D- Xception for long term dynamic modeling He D, Zhou Z, Gan C, et al. Stnet: Local and global spatial-temporal modeling for action recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33: 8401-8408.
Our Work: MARL
Wu W, He D, Tan X, et al. Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed Video Recognition[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 6222-6231.
Our Work: Label Graph Superimposing
Wang Y, He D, Li F, et al. Multi-Label Classification with Label Graph Superimposing[J]. AAAI, 2020.
Our Work: Dynamic Inference
Wu W, He D, Tan X, et al. Dynamic Inference: A New Approach Toward Efficient Video Action Recognition[J]. arXiv preprint arXiv:2002.03342, 2020.
PaddleVideo
- Action Recognition: TSN/TSM/StNet/Non-Local/NeXtVLAD…
- Action Detection: BSN/BMN/CTCN…
- Video Description: ETS
- Temporal Localization via Language Query: TALL
- https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/PaddleVideo