 
              Learning Spatiotemporal Features with 3D Convolutional Networks Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri Çağdaş BAK 29.03.16
Effective Video Descriptor • Generic – Can represent different types • Compact – Processing, storage • Efficient – computation • Simple – implementation
3D Convolution and Pooling • 3D Convolution is better than 2D Convolution to model temporal information. – 2D CONV : performed only spatially, lose temporal information. – 3D CONV : performed spatio-temporally, preserve temporal information. • Same phenomena is applicable for pooling.
2D Convolution On 1-ch Input • Result : 2D Image.
2D Convolution On n-ch Input • Result : 2D Image.
3D Convolution On n-ch Input • Result : Volume
Identify Best Architecture For 3D ConvNets (On UCF101) • Common network settings – All video frames resized into 128x171. – Videos are split into non-overlapped 16 frame clip. – Input : 3x16x128x171. – 5 Convolution and Pooling layer – 2 Fully Connected layer – Softmax Loss layer to predict action labels
Identify Best Architecture For 3D ConvNets (On UCF101) • Varying Network Architecture – Homogeneous temporal depth. • Depth –d for 1,3,5,7 – Varying temporal depth. • Increasing : 3-3-5-5-7 • Decreasing : 7-7-5-5-3-3
3D Convolution Kernel Temporal Depth Search
Spatiotemporal Feature Learning • Best Network Architecture – With 3x3x3 kernel
Spatiotemporal Feature Learning • Dataset for training – Sports 1M Dataset • Largest video classification benchmark • 1.1 million sports videos • 487 categories
Sports 1M Classification Results
C3D Video Descriptor • C3D Model can be used as a feature extractor for various video analysis tasks. – Action recognition – Action similarity – Scene and Object recognition • Using with fc6 activations – 4096 dimension
Action Recognition • Dataset : UCF101 – 13.320 video – 101 human action
Action Similarity Labeling • Dataset : ASLAN – 3,631 video – 432 action class
Scene Object Recognition • Dataset : YUPENN – 420 video – 14 scene • Dataset : Maryland – 130 video – 13 scene
Why C3D Features? • Generic • Compact • Efficient • Simple
What Does C3D Learn ?
Useful Links • http://vlg.cs.dartmouth.edu/c3d/ • https://github.com/facebook/C3D
Learning Spatiotemporal Features with 3D Convolutional Networks Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, Manohar Paluri Çağdaş BAK 29.03.16
Recommend
More recommend