Learning Spatiotemporal Features with 3D Convolutional Networks Du - - PowerPoint PPT Presentation

learning spatiotemporal features with 3d convolutional
SMART_READER_LITE
LIVE PREVIEW

Learning Spatiotemporal Features with 3D Convolutional Networks Du - - PowerPoint PPT Presentation

Learning Spatiotemporal Features with 3D Convolutional Networks Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri ada BAK 29.03.16 Effective Video Descriptor Generic Can represent different types Compact


slide-1
SLIDE 1

Learning Spatiotemporal Features with 3D Convolutional Networks

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri

Çağdaş BAK 29.03.16

slide-2
SLIDE 2

Effective Video Descriptor

  • Generic

– Can represent different types

  • Compact

– Processing, storage

  • Efficient

– computation

  • Simple

– implementation

slide-3
SLIDE 3

3D Convolution and Pooling

  • 3D Convolution is better than 2D Convolution

to model temporal information.

– 2D CONV : performed only spatially, lose temporal information. – 3D CONV : performed spatio-temporally, preserve temporal information.

  • Same phenomena is applicable for pooling.
slide-4
SLIDE 4

2D Convolution On 1-ch Input

  • Result : 2D Image.
slide-5
SLIDE 5

2D Convolution On n-ch Input

  • Result : 2D Image.
slide-6
SLIDE 6

3D Convolution On n-ch Input

  • Result : Volume
slide-7
SLIDE 7

Identify Best Architecture For 3D ConvNets (On UCF101)

  • Common network settings

– All video frames resized into 128x171. – Videos are split into non-overlapped 16 frame clip. – Input : 3x16x128x171. – 5 Convolution and Pooling layer – 2 Fully Connected layer – Softmax Loss layer to predict action labels

slide-8
SLIDE 8

Identify Best Architecture For 3D ConvNets (On UCF101)

  • Varying Network Architecture

– Homogeneous temporal depth.

  • Depth –d for 1,3,5,7

– Varying temporal depth.

  • Increasing : 3-3-5-5-7
  • Decreasing : 7-7-5-5-3-3
slide-9
SLIDE 9

3D Convolution Kernel Temporal Depth Search

slide-10
SLIDE 10

Spatiotemporal Feature Learning

  • Best Network Architecture

– With 3x3x3 kernel

slide-11
SLIDE 11

Spatiotemporal Feature Learning

  • Dataset for training

– Sports 1M Dataset

  • Largest video classification benchmark
  • 1.1 million sports videos
  • 487 categories
slide-12
SLIDE 12

Sports 1M Classification Results

slide-13
SLIDE 13

C3D Video Descriptor

  • C3D Model can be used as a feature extractor

for various video analysis tasks.

– Action recognition – Action similarity – Scene and Object recognition

  • Using with fc6 activations

– 4096 dimension

slide-14
SLIDE 14

Action Recognition

  • Dataset : UCF101

– 13.320 video – 101 human action

slide-15
SLIDE 15

Action Similarity Labeling

  • Dataset : ASLAN

– 3,631 video – 432 action class

slide-16
SLIDE 16

Scene Object Recognition

  • Dataset : YUPENN

– 420 video – 14 scene

  • Dataset : Maryland

– 130 video – 13 scene

slide-17
SLIDE 17

Why C3D Features?

  • Generic
  • Compact
  • Efficient
  • Simple
slide-18
SLIDE 18

What Does C3D Learn ?

slide-19
SLIDE 19

Useful Links

  • http://vlg.cs.dartmouth.edu/c3d/
  • https://github.com/facebook/C3D
slide-20
SLIDE 20

Learning Spatiotemporal Features with 3D Convolutional Networks

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri

Çağdaş BAK 29.03.16