s 3 d s ingle s hot multi s pan d etector via fully 3d
play

S 3 D: S ingle S hot multi- S pan D etector via Fully 3D - PowerPoint PPT Presentation

S 3 D: S ingle S hot multi- S pan D etector via Fully 3D Convolutional Network Da Zhang 1 , Xiyang Dai 2 , Xin Wang 1 , and Yuan-Fang Wang 1 dazhang@cs.ucsb.edu 1 UC Santa Barbara & 2 University of Maryland Task: Temporal Activity Detection


  1. S 3 D: S ingle S hot multi- S pan D etector via Fully 3D Convolutional Network Da Zhang 1 , Xiyang Dai 2 , Xin Wang 1 , and Yuan-Fang Wang 1 dazhang@cs.ucsb.edu 1 UC Santa Barbara & 2 University of Maryland

  2. Task: Temporal Activity Detection Input: untrimmed videos 1. Localization : when do activities start/end? 2. Classification : what are the activities? Detection Results Pole Vault Pole Vault [242.0 - 247.7s] [228.1 - 236.6s]

  3. Related Works Conventional two-stage approach: Proposal + Classification Temporal Sliding window, DAP, etc. Proposal Activity Two-stream, Classifier C3D, etc. Pole Vault [228.1 - 236.6s] Pole Vault [242.0 - 247.7s] S-CNN (CVPR 2016), CDC (CVPR 2017), TSN (ICCV 2017), R-C3D (ICCV 2017), SSN (ICCV 2017)

  4. Related Works Current limitations: Temporal Ineffective Inefficient Proposal Activity Classifier Pole Vault [228.1 - 236.6s] Pole Vault [242.0 - 247.7s] S-CNN (CVPR 2016), CDC (CVPR 2017), TSN (ICCV 2017), R-C3D (ICCV 2017), SSN (ICCV 2017)

  5. Motivation Can we do better? Single-shot End-to-end Pole Vault [228.1 - 236.6s] Pole Vault [242.0 - 247.7s] Introducing a novel S ingle S hot multi- S pan D etector (S 3 D)

  6. Motivation Quick Summary Single-shot End-to-end Pole Vault [228.1 - 236.6s] Pole Vault [242.0 - 247.7s] q Directly encode entire input video with Conv3D kernels q Multi-scale default spans associated to temporal feature maps q End-to-end trainable and single forward-pass inference

  7. S 3 D: Input Video L 112 Our model takes the whole video stream as input (L frames)

  8. S 3 D: Base Feature Layers C3D up to Conv5b L/8 L 112 7 We apply the standard C3D network to extract spatial-temporal features. D. Tran, L. Bourdev, R. Fergus, L. Torresani and M. Paluri. Learning spatiotemporal features with 3D convolutional networks. In CVPR, 2015.

  9. S 3 D: Auxiliary Feature Layers Auxiliary Feature Layers C3D up to Conv5b L/256 L/128 L/64 L/32 L/16 L/8 We produce a sequence of feature maps that progressively decrease in temporal dimension.

  10. S 3 D: Multi-scale Default Spans Temporal Feature Layers 0 T/4 T/2 3T/4 T 0 T/8 T/4 3T/8 T/2 5T/8 3T/4 7T/8 T Multi-scale default spans are associated to each temporal feature map

  11. S 3 D: Multi-scale Default Spans Temporal Feature Layers Temporal Feature Layers 0 T/4 T/2 3T/4 T 0 T/8 T/4 3T/8 T/2 5T/8 3T/4 7T/8 T Loc: ! ( "#, %# ) ( " & , " ' , … , " ( , " )"# ) Conf: Localization and classification results are predicted at each default span.

  12. S 3 D: Convolutional Predictors Temporal Feature Layers 3D Max pool, Conv3D: 3x1x1x(4x(K+1+2)) We apply on top of each feature map a Conv3D filter to produce the results.

  13. S 3 D: Convolutional Predictors Temporal Feature Layers 3D Max pool, Conv3D: 3x1x1x(4x(K+1+2)) Classes + BG Localization offsets Kernel size # of scales ( " & , " ' , … , " ( , " )"# ) ! ( "#, %# )

  14. Single Shot multi-Span Detector C3D up to Conv5b layer 252 Temporal Spans per Video 1 2 4 Video Conv10 Temporal NMS 8 Conv9 activity B 16 Conv8 Conv7 256 3D Max pool, Conv3D: 3x1x1x(4x(K+1+2)) 32 112 7 Conv6 activity A 3D Max pool, Conv3D: 3x1x1x(4x(K+1+2)) 7 112 112 Conv5 Time Input Video Base Feature Layers Auxiliary Temporal Feature Layers Temporal Activity Detections Training of S 3 D: Smooth L1 Softmax Cross Sigmoid Cross Entropy Entropy

  15. Quantitative Results Evaluation: mean Average Precision over 20 activities on THUMOS’14 1271 FPS on a single GTX 1080 Ti GPU

  16. Qualitative Results THUMOS’14 segment: Pole Vault

  17. Qualitative Results THUMOS’14 segment: Javelin Throw

  18. Qualitative Results THUMOS’14 segment: Shotput

  19. Qualitative Results THUMOS’14 segment: Clean and Jerk

  20. Conclusions Introduced S 3 D : q A novel single-shot end-to-end model for Temporal Activity Detection. q Simple : completely based on Conv3D kernels. q Strong : state-of-the-art performance on THUMOS’14 benchmark. q Speed : operates at 1271 FPS on a single GeForce GTX 1080 Ti GPU. TensorFlow code coming soon at https://github.com/dazhang-cv/S3D

  21. Thank you! C3D up to Conv5b layer 252 Temporal Spans per Video 1 2 4 Video Conv10 Temporal NMS 8 Conv9 activity B 16 Conv8 Conv7 256 3D Max pool, Conv3D: 3x1x1x(4x(K+1+2)) 32 112 7 Conv6 activity A 3D Max pool, Conv3D: 3x1x1x(4x(K+1+2)) 7 112 112 Conv5 Time Input Video Base Feature Layers Auxiliary Temporal Feature Layers Temporal Activity Detections

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend