S3D: Single Shot multi-Span Detector via Fully 3D Convolutional Network
Da Zhang1, Xiyang Dai2, Xin Wang1, and Yuan-Fang Wang1
dazhang@cs.ucsb.edu
1UC Santa Barbara & 2University of Maryland
S 3 D: S ingle S hot multi- S pan D etector via Fully 3D - - PowerPoint PPT Presentation
S 3 D: S ingle S hot multi- S pan D etector via Fully 3D Convolutional Network Da Zhang 1 , Xiyang Dai 2 , Xin Wang 1 , and Yuan-Fang Wang 1 dazhang@cs.ucsb.edu 1 UC Santa Barbara & 2 University of Maryland Task: Temporal Activity Detection
dazhang@cs.ucsb.edu
1UC Santa Barbara & 2University of Maryland
Input: untrimmed videos
Detection Results
Pole Vault [228.1 - 236.6s] Pole Vault [242.0 - 247.7s]
Conventional two-stage approach: Proposal + Classification
Temporal Proposal Activity Classifier Pole Vault [228.1 - 236.6s] Pole Vault [242.0 - 247.7s]
S-CNN (CVPR 2016), CDC (CVPR 2017), TSN (ICCV 2017), R-C3D (ICCV 2017), SSN (ICCV 2017)
Sliding window, DAP, etc. Two-stream, C3D, etc.
Current limitations:
Temporal Proposal Activity Classifier Pole Vault [228.1 - 236.6s] Pole Vault [242.0 - 247.7s]
S-CNN (CVPR 2016), CDC (CVPR 2017), TSN (ICCV 2017), R-C3D (ICCV 2017), SSN (ICCV 2017)
Can we do better? Single-shot End-to-end
Pole Vault [228.1 - 236.6s] Pole Vault [242.0 - 247.7s]
Introducing a novel Single Shot multi-Span Detector (S3D)
Quick Summary Single-shot End-to-end
Pole Vault [228.1 - 236.6s] Pole Vault [242.0 - 247.7s]
q Directly encode entire input video with Conv3D kernels q Multi-scale default spans associated to temporal feature maps q End-to-end trainable and single forward-pass inference
Our model takes the whole video stream as input (L frames) 112 L
We apply the standard C3D network to extract spatial-temporal features. L/8 7
C3D up to Conv5b
112 L
L/8
C3D up to Conv5b
L/16 L/32 L/64 L/128 L/256 We produce a sequence of feature maps that progressively decrease in temporal dimension. Auxiliary Feature Layers
Multi-scale default spans are associated to each temporal feature map Temporal Feature Layers T 3T/4 T/2 T/4 T 3T/4 T/2 T/4 T/8 3T/8 5T/8 7T/8
Localization and classification results are predicted at each default span. Temporal Feature Layers Loc: Conf: !("#, %#) ("&, "', … , "(, ")"#) Temporal Feature Layers T 3T/4 T/2 T/4 T 3T/4 T/2 T/4 T/8 3T/8 5T/8 7T/8
We apply on top of each feature map a Conv3D filter to produce the results. Temporal Feature Layers
3D Max pool, Conv3D: 3x1x1x(4x(K+1+2))
Temporal Feature Layers
3D Max pool, Conv3D: 3x1x1x(4x(K+1+2)) Kernel size # of scales Classes + BG Localization offsets !("#, %#) ("&, "', … , "(, ")"#)
Training of S3D:
112 112 256 32 7 7
Video
C3D up to Conv5b layer
3D Max pool, Conv3D: 3x1x1x(4x(K+1+2)) 3D Max pool, Conv3D: 3x1x1x(4x(K+1+2))
252 Temporal Spans per Video Input Video
16 8 4 2 1
Conv10 Conv9 Conv8 Conv7 Conv6 Conv5 Temporal NMS
112
activity B activity A
Time
Base Feature Layers Auxiliary Temporal Feature Layers Temporal Activity Detections
Evaluation: mean Average Precision over 20 activities on THUMOS’14
THUMOS’14 segment: Pole Vault
THUMOS’14 segment: Javelin Throw
THUMOS’14 segment: Shotput
THUMOS’14 segment: Clean and Jerk
Introduced S3D: q A novel single-shot end-to-end model for Temporal Activity Detection. q Simple: completely based on Conv3D kernels. q Strong: state-of-the-art performance on THUMOS’14 benchmark. q Speed: operates at 1271 FPS on a single GeForce GTX 1080 Ti GPU.
TensorFlow code coming soon at https://github.com/dazhang-cv/S3D
112 112 256 32 7 7
Video
C3D up to Conv5b layer
3D Max pool, Conv3D: 3x1x1x(4x(K+1+2)) 3D Max pool, Conv3D: 3x1x1x(4x(K+1+2))
252 Temporal Spans per Video Input Video
16 8 4 2 1
Conv10 Conv9 Conv8 Conv7 Conv6 Conv5 Temporal NMS
112
activity B activity A
Time
Base Feature Layers Auxiliary Temporal Feature Layers Temporal Activity Detections