AttentionNAS: Spatiotemporal Attention Cell Search for Video - - PowerPoint PPT Presentation

attentionnas spatiotemporal attention cell search for
SMART_READER_LITE
LIVE PREVIEW

AttentionNAS: Spatiotemporal Attention Cell Search for Video - - PowerPoint PPT Presentation

AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Piergiovanni, Michael S. Ryoo, Anelia Angelova, Kris M. Kitani, Wei Hua Convolutional networks are dominant C3D [ICCV


slide-1
SLIDE 1

AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification

Xiaofang Wang, Xuehan Xiong, Maxim Neumann, AJ Piergiovanni, Michael S. Ryoo, Anelia Angelova, Kris M. Kitani, Wei Hua

slide-2
SLIDE 2

Convolutional networks are dominant

C3D [ICCV 2015] S3D [ECCV 2018] I3D [CVPR 2017] SlowFast [ICCV 2019]

slide-3
SLIDE 3

What’s missing from convolution?

  • Where to focus in images/videos
  • Long-range dependencies

The same convolutional kernel is applied at every position. Long-range dependencies are modeled by large receptive fields.

Photo credit: [Convolution arithmetic] [Receptive field arithmetic]

slide-4
SLIDE 4

Attention is complementary to convolution

  • Map-based Attention
  • Dot-product Attention

Where to focus: learn a pointwise weighting factor for each position

Attention is All You Need [NeurIPS 2017] CBAM [ECCV 2018]

Long-range dependencies: compute pairwise similarity between all the positions

slide-5
SLIDE 5

Many design choices need to be determined to apply attention to videos

  • What is the right dimension

to apply attention to videos?

Challenge:

  • How to compose multiple

attention operations?

! " #

Three dimensions in video data: spatial, temporal or spatiotemporal?

Spatial Attention Temporal Attention Input Output Spatial Attention Temporal Attention Input Output

Sequential, parallel, or others?

slide-6
SLIDE 6

Automatically search for attention cells in a dat data-dr driven manner Proposal:

Op1 Op2 Op3 Combine Input Spatial Temporal Temporal Spatial Spatial Temporal Temporal Spatial Sink Node

Novel Attention Cell Search Space Efficient Differentiable Search Method

slide-7
SLIDE 7

Attention Cell Search Space

Op1 Op2 Op3 Combine

Attention Cell

  • Composed of multiple attention operations
  • Input shape == output shape; can be inserted

anywhere in existing backbones Search Space

  • Cell Level Search Space: Connectivity

between the operations within the cell

  • Operation Level Search Space: Choices

to instantiate an individual attention

  • peration
slide-8
SLIDE 8

Cell Level Search Space

Op1 Op2 Op3 Combine

Input to the cell Output of the cell

  • Input to the 1"# operation is fixed to
  • Input to the $#% operation is a weighted

sum of selected feature maps from Select input to each operation Combine

  • Concatenate channels + CONV
slide-9
SLIDE 9

Operation Level Search Space

Dot-product Attention Map-based Attention

Attention Operation Type Attention Dimension

! " #

  • 1. Spatial 2. Temporal 3. Spatiotemporal
slide-10
SLIDE 10

Map-based Attention and Dot-product Attention

Map-based Attention Dot-product Attention Where to focus: learn a pointwise weighting factor for each position Long-range dependencies: compute pairwise similarity between all the positions

Assume attention dimension = temporal

slide-11
SLIDE 11

Search Space Summary

  • Spatial
  • Temporal
  • Spatiotemporal

Attention Operation Type Attention Dimension

  • Map-based attention
  • Dot-product attention
  • Input to each operation

Connectivity between Operations

Op1 Op2 Op3 Combine

  • None
  • ReLU
  • Softmax
  • Sigmoid

Activation Function

slide-12
SLIDE 12

Insert Attention Cells into Backbone Networks

Attention Cell Convolutional Layers Convolutional Layers

Op1 Op2 Op3 Combine

slide-13
SLIDE 13

Differentiable Formulation of Search Space

  • Search algorithm: differentiable architecture search
  • Search cost: equals to the cost of training one network

Map-based Attention Dot-product Attention Solid connection (no weights) Level connection weights Sink connection weights Input Spatial Temporal Temporal Spatial Spatial Temporal Temporal Spatial Sink Node

slide-14
SLIDE 14

Supergraph and Connection Weights

Map-based Attention Dot-product Attention Solid connection (no weights) Level connection weights Sink connection weights Input Spatial Temporal Temporal Spatial Spatial Temporal Temporal Spatial Sink Node Supergraph: ! levels; each level " nodes Node: an attention operation of a pre- defined attention dimension and type

slide-15
SLIDE 15

Differentiable Search

  • Jointly train the network weights and connection weights with gradient descent

Input Spatial Temporal Temporal Spatial Spatial Temporal Temporal Spatial Sink Node

Supergraph Convolutional Layers Convolutional Layers

slide-16
SLIDE 16

Attention Cell Design Derivation

Map-based Attention Dot-product Attention Solid connection (no weights) Level connection weights Sink connection weights Input Spatial Temporal Temporal Spatial Spatial Temporal Temporal Spatial Sink Node

How to derive the attention cell design from the learned weights?

slide-17
SLIDE 17

Attention Cell Design Derivation

Map-based Attention Dot-product Attention Solid connection (no weights) Level connection weights Sink connection weights Input Spatial Temporal Temporal Spatial Spatial Temporal Temporal Spatial Sink Node

Choose top ! (e.g., 3) nodes based on

slide-18
SLIDE 18

Attention Cell Design Derivation

Map-based Attention Dot-product Attention Solid connection (no weights) Level connection weights Sink connection weights Input Spatial Temporal Temporal Spatial Spatial Temporal Temporal Spatial Sink Node

Choose top ! (e.g., 2) predecessors of each selected code recursively based on until we reach the first level

slide-19
SLIDE 19

Attention Cell Design Derivation

Map-based Attention Dot-product Attention Solid connection (no weights) Level connection weights Sink connection weights Input Spatial Temporal Temporal Spatial Temporal Combine

slide-20
SLIDE 20

Experimental Setup

  • Backbones
  • Inception-based
  • Insert 5 cells
  • Datasets: Kinetics-600 and Moments in Time (MiT)

I3D [CVPR 2017] S3D [ECCV 2018]

slide-21
SLIDE 21

Comparison with Non-local Blocks

slide-22
SLIDE 22

Generalization across Modalities

RGB to optical flow

slide-23
SLIDE 23

Generalization across Backbones

slide-24
SLIDE 24

Generalization across Datasets

slide-25
SLIDE 25

Comparison with State-of-the-art

slide-26
SLIDE 26

Contributions

  • Extend NAS beyond discovering convolutional cells to attention cells
  • Search space for spatiotemporal attention cells
  • A differentiable formulation of the search space
  • State-of-the-art performance; outperforms non-local blocks
  • Strong generalization across modalities, backbones, or datasets
  • More analysis and visualizations of attention cells available in the paper