attentionnas spatiotemporal attention cell search for
play

AttentionNAS: Spatiotemporal Attention Cell Search for Video - PowerPoint PPT Presentation

AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Piergiovanni, Michael S. Ryoo, Anelia Angelova, Kris M. Kitani, Wei Hua Convolutional networks are dominant C3D [ICCV


  1. AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Piergiovanni, Michael S. Ryoo, Anelia Angelova, Kris M. Kitani, Wei Hua

  2. Convolutional networks are dominant C3D [ICCV 2015] I3D [CVPR 2017] S3D [ECCV 2018] SlowFast [ICCV 2019]

  3. What’s missing from convolution? • Where to focus in images/videos • Long-range dependencies The same convolutional kernel Long-range dependencies are is applied at every position. modeled by large receptive fields. Photo credit: [Convolution arithmetic] [Receptive field arithmetic]

  4. Attention is complementary to convolution • Map-based Attention • Dot-product Attention CBAM [ECCV 2018] Attention is All You Need [NeurIPS 2017] Where to focus : learn a pointwise Long-range dependencies : compute pairwise weighting factor for each position similarity between all the positions

  5. Many design choices need to be determined Challenge : to apply attention to videos • How to compose multiple • What is the right dimension attention operations? to apply attention to videos? Output Output ! Temporal Attention Spatial Temporal # Attention Attention Spatial Attention " Input Input Three dimensions in video data: spatial, temporal or spatiotemporal? Sequential, parallel, or others?

  6. Automatically search for attention cells in a Proposal : driven manner data-dr dat Sink Node Combine Spatial Spatial Temporal Temporal Op3 Op2 Spatial Spatial Temporal Temporal Op1 Input Novel Attention Cell Search Space Efficient Differentiable Search Method

  7. Attention Cell Search Space Attention Cell • Composed of multiple attention operations Combine • Input shape == output shape; can be inserted anywhere in existing backbones Search Space Op3 Op2 • Cell Level Search Space: Connectivity between the operations within the cell • Operation Level Search Space: Choices to instantiate an individual attention Op1 operation

  8. Cell Level Search Space Output of the cell Select input to each operation • Input to the 1 "# operation is fixed to Combine • Input to the $ #% operation is a weighted sum of selected feature maps from Op3 Op2 Combine Op1 • Concatenate channels + CONV Input to the cell

  9. Operation Level Search Space ! # Map-based Dot-product Attention Attention " 1. Spatial 2. Temporal 3. Spatiotemporal Attention Dimension Attention Operation Type

  10. Map-based Attention and Dot-product Attention Map-based Attention Where to focus : learn a pointwise weighting factor for each position Dot-product Attention Long-range dependencies : compute pairwise similarity between all the positions Assume attention dimension = temporal

  11. Search Space Summary Spatial • Map-based attention • Combine Temporal • Dot-product attention • Spatiotemporal • Attention Dimension Attention Operation Type Op3 Op2 None • ReLU • Input to each operation • Softmax • Op1 Sigmoid • Activation Function Connectivity between Operations

  12. Insert Attention Cells into Backbone Networks Combine Convolutional Convolutional Op3 Op2 Layers Layers Attention Cell Op1

  13. Differentiable Formulation of Search Space • Search algorithm : differentiable architecture search • Search cost : equals to the cost of training one network Sink Node Solid connection (no weights) Level connection weights Spatial Spatial Temporal Temporal Sink connection weights Map-based Attention Spatial Spatial Temporal Temporal Dot-product Attention Input

  14. Supergraph and Connection Weights Sink Node Solid connection (no weights) Level connection weights Spatial Spatial Temporal Temporal Sink connection weights Map-based Attention Spatial Spatial Temporal Temporal Dot-product Attention Supergraph: ! levels; each level " nodes Input Node : an attention operation of a pre- defined attention dimension and type

  15. Differentiable Search • Jointly train the network weights and connection weights with gradient descent Sink Node Convolutional Convolutional Spatial Spatial Temporal Temporal Layers Layers Spatial Spatial Temporal Temporal Input Supergraph

  16. Attention Cell Design Derivation Sink Node Solid connection (no weights) Level connection weights Spatial Spatial Temporal Temporal Sink connection weights Map-based Attention Spatial Spatial Temporal Temporal Dot-product Attention Input How to derive the attention cell design from the learned weights?

  17. Attention Cell Design Derivation Sink Node Solid connection (no weights) Level connection weights Spatial Spatial Temporal Temporal Sink connection weights Map-based Attention Spatial Spatial Temporal Temporal Dot-product Attention Input Choose top ! (e.g., 3) nodes based on

  18. Attention Cell Design Derivation Sink Node Solid connection (no weights) Level connection weights Spatial Spatial Temporal Temporal Sink connection weights Map-based Attention Spatial Spatial Temporal Temporal Dot-product Attention Input Choose top ! (e.g., 2) predecessors of each selected code recursively based on until we reach the first level

  19. Attention Cell Design Derivation Combine Solid connection (no weights) Level connection weights Spatial Temporal Sink connection weights Map-based Attention Spatial Temporal Temporal Dot-product Attention Input

  20. Experimental Setup • Backbones • Inception-based • Insert 5 cells I3D [CVPR 2017] S3D [ECCV 2018] • Datasets: Kinetics-600 and Moments in Time (MiT)

  21. Comparison with Non-local Blocks

  22. Generalization across Modalities RGB to optical flow

  23. Generalization across Backbones

  24. Generalization across Datasets

  25. Comparison with State-of-the-art

  26. Contributions • Extend NAS beyond discovering convolutional cells to attention cells • Search space for spatiotemporal attention cells • A differentiable formulation of the search space • State-of-the-art performance; outperforms non-local blocks • Strong generalization across modalities, backbones, or datasets • More analysis and visualizations of attention cells available in the paper

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend