S 3 D: S ingle S hot multi- S pan D etector via Fully 3D - - PowerPoint PPT Presentation

▶

Jan 23, 2023 165 likes •399 views

S 3 D: S ingle S hot multi- S pan D etector via Fully 3D Convolutional Network Da Zhang 1 , Xiyang Dai 2 , Xin Wang 1 , and Yuan-Fang Wang 1 dazhang@cs.ucsb.edu 1 UC Santa Barbara & 2 University of Maryland Task: Temporal Activity Detection

SLIDE 1

S3D: Single Shot multi-Span Detector via Fully 3D Convolutional Network

Da Zhang1, Xiyang Dai2, Xin Wang1, and Yuan-Fang Wang1

dazhang@cs.ucsb.edu

1UC Santa Barbara & 2University of Maryland

SLIDE 2

Task: Temporal Activity Detection

Input: untrimmed videos

1. Localization: when do activities start/end?
2. Classification: what are the activities?

Detection Results

Pole Vault [228.1 - 236.6s] Pole Vault [242.0 - 247.7s]

SLIDE 3

Related Works

Conventional two-stage approach: Proposal + Classification

Temporal Proposal Activity Classifier Pole Vault [228.1 - 236.6s] Pole Vault [242.0 - 247.7s]

S-CNN (CVPR 2016), CDC (CVPR 2017), TSN (ICCV 2017), R-C3D (ICCV 2017), SSN (ICCV 2017)

Sliding window, DAP, etc. Two-stream, C3D, etc.

SLIDE 4

Related Works

Current limitations:

Temporal Proposal Activity Classifier Pole Vault [228.1 - 236.6s] Pole Vault [242.0 - 247.7s]

Ineffective Inefficient

S-CNN (CVPR 2016), CDC (CVPR 2017), TSN (ICCV 2017), R-C3D (ICCV 2017), SSN (ICCV 2017)

SLIDE 5

Motivation

Can we do better? Single-shot End-to-end

Pole Vault [228.1 - 236.6s] Pole Vault [242.0 - 247.7s]

Introducing a novel Single Shot multi-Span Detector (S3D)

SLIDE 6

Motivation

Quick Summary Single-shot End-to-end

Pole Vault [228.1 - 236.6s] Pole Vault [242.0 - 247.7s]

q Directly encode entire input video with Conv3D kernels q Multi-scale default spans associated to temporal feature maps q End-to-end trainable and single forward-pass inference

SLIDE 7

S3D: Input Video

Our model takes the whole video stream as input (L frames) 112 L

SLIDE 8

S3D: Base Feature Layers

We apply the standard C3D network to extract spatial-temporal features. L/8 7

C3D up to Conv5b

D. Tran, L. Bourdev, R. Fergus, L. Torresani and M. Paluri. Learning spatiotemporal features with 3D convolutional
networks. In CVPR, 2015.

112 L

SLIDE 9

S3D: Auxiliary Feature Layers

L/8

C3D up to Conv5b

L/16 L/32 L/64 L/128 L/256 We produce a sequence of feature maps that progressively decrease in temporal dimension. Auxiliary Feature Layers

SLIDE 10

S3D: Multi-scale Default Spans

Multi-scale default spans are associated to each temporal feature map Temporal Feature Layers T 3T/4 T/2 T/4 T 3T/4 T/2 T/4 T/8 3T/8 5T/8 7T/8

SLIDE 11

S3D: Multi-scale Default Spans

Localization and classification results are predicted at each default span. Temporal Feature Layers Loc: Conf: !("#, %#) ("&, "', … , "(, ")"#) Temporal Feature Layers T 3T/4 T/2 T/4 T 3T/4 T/2 T/4 T/8 3T/8 5T/8 7T/8

SLIDE 12

S3D: Convolutional Predictors

We apply on top of each feature map a Conv3D filter to produce the results. Temporal Feature Layers

3D Max pool, Conv3D: 3x1x1x(4x(K+1+2))

SLIDE 13

S3D: Convolutional Predictors

Temporal Feature Layers

3D Max pool, Conv3D: 3x1x1x(4x(K+1+2)) Kernel size # of scales Classes + BG Localization offsets !("#, %#) ("&, "', … , "(, ")"#)

SLIDE 14

Single Shot multi-Span Detector

Training of S3D:

112 112 256 32 7 7

Video

C3D up to Conv5b layer

3D Max pool, Conv3D: 3x1x1x(4x(K+1+2)) 3D Max pool, Conv3D: 3x1x1x(4x(K+1+2))

252 Temporal Spans per Video Input Video

16 8 4 2 1

Conv10 Conv9 Conv8 Conv7 Conv6 Conv5 Temporal NMS

112

activity B activity A

Time

Base Feature Layers Auxiliary Temporal Feature Layers Temporal Activity Detections

Smooth L1 Softmax Cross Entropy Sigmoid Cross Entropy

SLIDE 15

Quantitative Results

Evaluation: mean Average Precision over 20 activities on THUMOS’14

1271 FPS on a single GTX 1080 Ti GPU

SLIDE 16

Qualitative Results

THUMOS’14 segment: Pole Vault

SLIDE 17

Qualitative Results

THUMOS’14 segment: Javelin Throw

SLIDE 18

Qualitative Results

THUMOS’14 segment: Shotput

SLIDE 19

Qualitative Results

THUMOS’14 segment: Clean and Jerk

SLIDE 20

Conclusions

Introduced S3D: q A novel single-shot end-to-end model for Temporal Activity Detection. q Simple: completely based on Conv3D kernels. q Strong: state-of-the-art performance on THUMOS’14 benchmark. q Speed: operates at 1271 FPS on a single GeForce GTX 1080 Ti GPU.

TensorFlow code coming soon at https://github.com/dazhang-cv/S3D

SLIDE 21

Thank you!

112 112 256 32 7 7

Video

C3D up to Conv5b layer

3D Max pool, Conv3D: 3x1x1x(4x(K+1+2)) 3D Max pool, Conv3D: 3x1x1x(4x(K+1+2))

252 Temporal Spans per Video Input Video

16 8 4 2 1

Conv10 Conv9 Conv8 Conv7 Conv6 Conv5 Temporal NMS

112

activity B activity A

Time

Base Feature Layers Auxiliary Temporal Feature Layers Temporal Activity Detections