S 3 D: S ingle S hot multi- S pan D etector via Fully 3D - - PowerPoint PPT Presentation

s 3 d s ingle s hot multi s pan d etector via fully 3d
SMART_READER_LITE
LIVE PREVIEW

S 3 D: S ingle S hot multi- S pan D etector via Fully 3D - - PowerPoint PPT Presentation

S 3 D: S ingle S hot multi- S pan D etector via Fully 3D Convolutional Network Da Zhang 1 , Xiyang Dai 2 , Xin Wang 1 , and Yuan-Fang Wang 1 dazhang@cs.ucsb.edu 1 UC Santa Barbara & 2 University of Maryland Task: Temporal Activity Detection


slide-1
SLIDE 1

S3D: Single Shot multi-Span Detector via Fully 3D Convolutional Network

Da Zhang1, Xiyang Dai2, Xin Wang1, and Yuan-Fang Wang1

dazhang@cs.ucsb.edu

1UC Santa Barbara & 2University of Maryland

slide-2
SLIDE 2

Task: Temporal Activity Detection

Input: untrimmed videos

  • 1. Localization: when do activities start/end?
  • 2. Classification: what are the activities?

Detection Results

Pole Vault [228.1 - 236.6s] Pole Vault [242.0 - 247.7s]

slide-3
SLIDE 3

Related Works

Conventional two-stage approach: Proposal + Classification

Temporal Proposal Activity Classifier Pole Vault [228.1 - 236.6s] Pole Vault [242.0 - 247.7s]

S-CNN (CVPR 2016), CDC (CVPR 2017), TSN (ICCV 2017), R-C3D (ICCV 2017), SSN (ICCV 2017)

Sliding window, DAP, etc. Two-stream, C3D, etc.

slide-4
SLIDE 4

Related Works

Current limitations:

Temporal Proposal Activity Classifier Pole Vault [228.1 - 236.6s] Pole Vault [242.0 - 247.7s]

Ineffective Inefficient

S-CNN (CVPR 2016), CDC (CVPR 2017), TSN (ICCV 2017), R-C3D (ICCV 2017), SSN (ICCV 2017)

slide-5
SLIDE 5

Motivation

Can we do better? Single-shot End-to-end

Pole Vault [228.1 - 236.6s] Pole Vault [242.0 - 247.7s]

Introducing a novel Single Shot multi-Span Detector (S3D)

slide-6
SLIDE 6

Motivation

Quick Summary Single-shot End-to-end

Pole Vault [228.1 - 236.6s] Pole Vault [242.0 - 247.7s]

q Directly encode entire input video with Conv3D kernels q Multi-scale default spans associated to temporal feature maps q End-to-end trainable and single forward-pass inference

slide-7
SLIDE 7

S3D: Input Video

Our model takes the whole video stream as input (L frames) 112 L

slide-8
SLIDE 8

S3D: Base Feature Layers

We apply the standard C3D network to extract spatial-temporal features. L/8 7

C3D up to Conv5b

  • D. Tran, L. Bourdev, R. Fergus, L. Torresani and M. Paluri. Learning spatiotemporal features with 3D convolutional
  • networks. In CVPR, 2015.

112 L

slide-9
SLIDE 9

S3D: Auxiliary Feature Layers

L/8

C3D up to Conv5b

L/16 L/32 L/64 L/128 L/256 We produce a sequence of feature maps that progressively decrease in temporal dimension. Auxiliary Feature Layers

slide-10
SLIDE 10

S3D: Multi-scale Default Spans

Multi-scale default spans are associated to each temporal feature map Temporal Feature Layers T 3T/4 T/2 T/4 T 3T/4 T/2 T/4 T/8 3T/8 5T/8 7T/8

slide-11
SLIDE 11

S3D: Multi-scale Default Spans

Localization and classification results are predicted at each default span. Temporal Feature Layers Loc: Conf: !("#, %#) ("&, "', … , "(, ")"#) Temporal Feature Layers T 3T/4 T/2 T/4 T 3T/4 T/2 T/4 T/8 3T/8 5T/8 7T/8

slide-12
SLIDE 12

S3D: Convolutional Predictors

We apply on top of each feature map a Conv3D filter to produce the results. Temporal Feature Layers

3D Max pool, Conv3D: 3x1x1x(4x(K+1+2))

slide-13
SLIDE 13

S3D: Convolutional Predictors

Temporal Feature Layers

3D Max pool, Conv3D: 3x1x1x(4x(K+1+2)) Kernel size # of scales Classes + BG Localization offsets !("#, %#) ("&, "', … , "(, ")"#)

slide-14
SLIDE 14

Single Shot multi-Span Detector

Training of S3D:

112 112 256 32 7 7

Video

C3D up to Conv5b layer

3D Max pool, Conv3D: 3x1x1x(4x(K+1+2)) 3D Max pool, Conv3D: 3x1x1x(4x(K+1+2))

252 Temporal Spans per Video Input Video

16 8 4 2 1

Conv10 Conv9 Conv8 Conv7 Conv6 Conv5 Temporal NMS

112

activity B activity A

Time

Base Feature Layers Auxiliary Temporal Feature Layers Temporal Activity Detections

Smooth L1 Softmax Cross Entropy Sigmoid Cross Entropy

slide-15
SLIDE 15

Quantitative Results

Evaluation: mean Average Precision over 20 activities on THUMOS’14

1271 FPS on a single GTX 1080 Ti GPU

slide-16
SLIDE 16

Qualitative Results

THUMOS’14 segment: Pole Vault

slide-17
SLIDE 17

Qualitative Results

THUMOS’14 segment: Javelin Throw

slide-18
SLIDE 18

Qualitative Results

THUMOS’14 segment: Shotput

slide-19
SLIDE 19

Qualitative Results

THUMOS’14 segment: Clean and Jerk

slide-20
SLIDE 20

Conclusions

Introduced S3D: q A novel single-shot end-to-end model for Temporal Activity Detection. q Simple: completely based on Conv3D kernels. q Strong: state-of-the-art performance on THUMOS’14 benchmark. q Speed: operates at 1271 FPS on a single GeForce GTX 1080 Ti GPU.

TensorFlow code coming soon at https://github.com/dazhang-cv/S3D

slide-21
SLIDE 21

Thank you!

112 112 256 32 7 7

Video

C3D up to Conv5b layer

3D Max pool, Conv3D: 3x1x1x(4x(K+1+2)) 3D Max pool, Conv3D: 3x1x1x(4x(K+1+2))

252 Temporal Spans per Video Input Video

16 8 4 2 1

Conv10 Conv9 Conv8 Conv7 Conv6 Conv5 Temporal NMS

112

activity B activity A

Time

Base Feature Layers Auxiliary Temporal Feature Layers Temporal Activity Detections