Tianwei Lin Baidu VIS What is Temporal Action Detection (TAD)? - - PowerPoint PPT Presentation

tianwei lin baidu vis what is temporal action detection
SMART_READER_LITE
LIVE PREVIEW

Tianwei Lin Baidu VIS What is Temporal Action Detection (TAD)? - - PowerPoint PPT Presentation

Temporal Action Detection with Local and Global Context Tianwei Lin Baidu VIS What is Temporal Action Detection (TAD)? Image: Classification Video: Classification Which action? People Dog Cricket Bowling What is Temporal Action Detection


slide-1
SLIDE 1

Temporal Action Detection with Local and Global Context

Tianwei Lin Baidu VIS

slide-2
SLIDE 2

Video: Classification Image: Classification

Cricket Bowling

People Dog

Which action?

What is Temporal Action Detection (TAD)?

slide-3
SLIDE 3

Background

Video: Temporal Action Detection Image: Object Detection

Background Background

  • 1. Which action?
  • 2. When does each action start/end?

Cricket Bowling Cricket Bowling

People Dog

What is Temporal Action Detection (TAD)?

slide-4
SLIDE 4

Background

Image: Object Detection

Background Background

  • 1. Which action?
  • 2. When does each action start/end?

√ X

Cricket Bowling Cricket Bowling

People Dog

What is Temporal Action Detection (TAD)?

Video: Temporal Action Detection

slide-5
SLIDE 5

Background

Video: Temporal Proposal Generation Image: Object Proposal

Background Background

When does each action start/end?

√ X

What is Temporal Action Proposal Generation (TAPG)?

slide-6
SLIDE 6

What is high-quality proposal?

  • Flexible temporal duration at flexible position
  • Locate temporal boundaries precisely
  • Evaluate reliable confidence scores of proposals
slide-7
SLIDE 7

Related Work

  • Anchor-based Approaches

– top-down framework – global-context – define multi-scales anchors with regular interval as proposals – SSAD, SSTAD, CBR, TURN, etc.

  • Anchor-free Approaches

– bottom-up framework – local-context – first evaluate local clues such as boundary probability or actionness, then generate proposals via exploiting these local clues – TAG, BSN, etc.

slide-8
SLIDE 8

Anchor-based: SSAD

Anchor Mechanism of SSAD Approach Overview

Single Shot Temporal Action Detection. Tianwei Lin, Xu Zhao, Zheng Shou. In ACM Multimedia 2017.

slide-9
SLIDE 9

Anchor-based: SSAD

Single Shot Temporal Action Detection. Tianwei Lin, Xu Zhao, Zheng Shou. In ACM Multimedia 2017.

slide-10
SLIDE 10

Anchor-based: TURN

Gao J, Yang Z, Chen K, et al. Turn tap: Temporal unit regression network for temporal action proposals[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 3628-3636.

slide-11
SLIDE 11

Anchor-based: Strength and Weakness

  • Strength

– can efficiently generate multiple-scales proposals – can generate reliable confidence score since it takes more global contextual information

  • f all anchors simultaneously
  • Weakness

– it is hard to design the default setting of anchors – usually not temporal precise – not flexible enough to cover various temporal durations

slide-12
SLIDE 12

Anchor-free: TAG

Temporal Actionness Grouping(TAG)

Temporal Action Detection with Structured Segment Networks. Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Dahua Lin, Xiaoou Tang. In ICCV 2017.

Local

slide-13
SLIDE 13

Anchor-free: BSN

Boundary-Sensitive Network (BSN)

BSN: Boundary sensitive network for temporal action proposal generation.

  • T. Lin, X. Zhao, and S. Haisheng. In European Conference on Computer Vision, 2018.

Local then Global

slide-14
SLIDE 14

Anchor-free: BSN – temporal evaluation module

slide-15
SLIDE 15

Anchor-free: BSN – proposal generation

slide-16
SLIDE 16

Anchor-free: BSN – proposal generation

slide-17
SLIDE 17

Anchor-free: BSN – proposal evaluation module

slide-18
SLIDE 18

Anchor-free: BSN – NMS

slide-19
SLIDE 19

Strength and weakness of BSN

  • Strength

– can generate proposals with flexible duration and precise boundary (Locally) – can generate reliable confidence score using BSP feature (Globally)

  • Weakness

– feature construction and confidence evaluation are conducted to each proposal respectively, leading to inefficiency – the proposal feature is too simple to capture enough temporal context – is a multiple stages method but not an unified framework

slide-20
SLIDE 20

How can we improve BSN?

  • How can we evaluate confidence for all proposals simultaneously with

rich context?

– top-down methods achieve this via anchor mechanism – anchor mechanism is not suit for bottom-up methods like BSN

slide-21
SLIDE 21

Boundary-Matching Network (BMN)

Local and Global

Lin T, Liu X, Li X, et al. Bmn: Boundary-matching network for temporal action proposal generation[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 3889-3898.

slide-22
SLIDE 22

BM confidence map

slide-23
SLIDE 23

How can we generate BM confidence map? - BM mechanism!

  • Pipeline:

temporal feature sequence !" ∈ $%×' à BM feature map (" ∈ $%×)×*×' à BM confidence map (% ∈ $*×'

  • Key issue 1: How to generate BM feature map from temporal feature sequence?

– target: convert !" ∈ $%×' to (" ∈ $%×)×*×' – for each proposal (+×, proposals totally), generate -.,0

1 ∈ $%×) sampling from its temporal scope

– precisely and efficiently

  • Difficulties to achieve this feature sampling procedure

– how to sample feature in no-integer point? (precisely) – how to sample feature for all proposals simultaneously? (efficiently)

slide-24
SLIDE 24

BM mechanism

  • Our Solution:

– dot product between !" ∈ $%×' and sampling weight ( ∈ $)×'×*×' in temporal dimension – for each proposal +,,., construct /,,. ∈ $)×'via uniformly sampling N points among temporal region [12−0.258, 19 + 0.258], for a non-integer sampling point 1<, we define /,,.,<[1] ∈ $' as: – thus, we can get /,,. ∈ $)×' for proposal +,,.

slide-25
SLIDE 25

BM mechanism

  • Our Solution:

– then, conduct dot product in temporal dimension between !" and #$,& to generate '$,&

( ∈ *+×- :

slide-26
SLIDE 26

BM mechanism

  • Our Solution:

– expand !",$ ∈ &'×) to * ∈ &'×)×+×) – dot product between ,- ∈ &.×) and * ∈ &'×)×+×) to generate /- ∈ &.×'×+×) – sampling feature of all proposals with rich context is generated precisely and efficiently – this procedure is denoted as BM layer in our method

slide-27
SLIDE 27

How can we generate BM confidence map? - BM mechanism!

  • Key issue 2: How to generate BM confidence map from BM feature map?

– Target: covert !" ∈ $%×'×(×) to !% ∈ $(×) – a series of 3D and 2D convolution layers

  • Key issue 3: What is the label for training BM confidence map?

– BM label map *% ∈ $(×), where +,,.

/ ∈ [0,1] is maximum IoU between proposal and ground-truths

slide-28
SLIDE 28

BMN– Network Architecture

slide-29
SLIDE 29

BMN – Experiments Results

slide-30
SLIDE 30

BMN – Experiments Results

slide-31
SLIDE 31

Qualitative Results

slide-32
SLIDE 32

Qualitative Results

slide-33
SLIDE 33

Lessons Learned

  • Proposal is a very important for accurate localization
  • High quality proposals should have three properties:
  • Boundary-Matching mechanism we proposed can

efficiently and effectively generate high-quality temporal action proposals Ø Flexible durations and locations Ø Precise temporal boundaries Ø Reliable confidence score

slide-34
SLIDE 34

Applications

  • Video highlight detection
  • Dynamic video cover generation
  • Fine-grained Highlights generation of sport videos, game

videos, etc.

slide-35
SLIDE 35

Recent Work: CTCN

Li X, Lin T, Liu X, et al. Deep Concept-wise Temporal Convolutional Networks for Action Localization[J]. arXiv preprint arXiv:1908.09442, 2019.

slide-36
SLIDE 36

Recent Work: Relation-Aware Pyramid Network

Gao J, Shi Z, Li J, et al. Accurate Temporal Action Proposal Generation with Relation- Aware Pyramid Network[J]. AAAI, 2020.

slide-37
SLIDE 37

Recent Work: PGCN

Zeng R, Huang W, Tan M, et al. Graph convolutional networks for temporal action localization[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 7094-7103.

slide-38
SLIDE 38

Our Work: StNet

N 1 2 T Temporally sampling “super-images” 2D-Conv on super-images for local S-T modeling Stacked 3D/2D-Conv blocks for global S-T modeling Temporal 1D- Xception for long term dynamic modeling He D, Zhou Z, Gan C, et al. Stnet: Local and global spatial-temporal modeling for action recognition[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33: 8401-8408.

slide-39
SLIDE 39

Our Work: MARL

Wu W, He D, Tan X, et al. Multi-Agent Reinforcement Learning Based Frame Sampling for Effective Untrimmed Video Recognition[C]//Proceedings of the IEEE International Conference on Computer Vision. 2019: 6222-6231.

slide-40
SLIDE 40

Our Work: Label Graph Superimposing

Wang Y, He D, Li F, et al. Multi-Label Classification with Label Graph Superimposing[J]. AAAI, 2020.

slide-41
SLIDE 41

Our Work: Dynamic Inference

Wu W, He D, Tan X, et al. Dynamic Inference: A New Approach Toward Efficient Video Action Recognition[J]. arXiv preprint arXiv:2002.03342, 2020.

slide-42
SLIDE 42

PaddleVideo

  • Action Recognition: TSN/TSM/StNet/Non-Local/NeXtVLAD…
  • Action Detection: BSN/BMN/CTCN…
  • Video Description: ETS
  • Temporal Localization via Language Query: TALL
  • https://github.com/PaddlePaddle/models/tree/develop/PaddleCV/PaddleVideo
slide-43
SLIDE 43

THANKS