Özge Yalçınkaya
Beyond Short Snippets: Deep Networks for Video Classification
Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, George Toderici
Beyond Short Snippets: Deep Networks for Video Classification Joe - - PowerPoint PPT Presentation
Beyond Short Snippets: Deep Networks for Video Classification Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, George Toderici zge Yalnkaya Introduction Many attempts to apply CNNs to action
Özge Yalçınkaya
Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, George Toderici
✤ Many attempts to apply CNNs to action recognition ✤ Treating video frames as images, using CNN for
✤ Average predictions at the video level ✤ However, complete action information is missing
✤ For accurate video classification, learning a global
✤ Using increasing number of frames improves
✤ Moreover, optical flow images may provide
✤ Two approaches are
➡ Feature Pooling ➡ LSTM ✤ State-of-the-art
✤ AlexNet and GoogLeNet
✤ Conv Pooling: ➡ Performs max-pooling over final
➡ Feeds them to FC layer (yellow)
✤ Late Pooling: ➡ Performs max-pooling(blue) after
➡ Compared to Conv Pooling, it
✤ Slow Pooling: ➡ First, max-pooling(blue) is applied
➡ Each one is followed by a FC
➡ A single max-pooling combines
➡ Groups local features before
✤ Local Pooling: ➡ Combines frame level features
➡ Softmax(orange) layer is connected
✤ Time-Domain Convolution: ➡ Extra time-domain conv
➡ Max-pooling across frames on
➡ Captures local relationships
✤ GoogLeNet Conv Pooling: ➡ Max-pooling is applied in network ➡ Then, this layer is connected to softmax layer ➡ Enhancement is done by adding FC layers
LSTM takes input from CNN layer at each video frame. A softmax layer predicts the class for each time step
✤ Experiments done with both AlexNet and GoogLeNet ✤ Parameters are initialized from pre-trained Imagenet model,
✤ Single-frame networks are expanded to 30 and 120-frames ✤ Optical flow images are used
✤ 1 Million YouTube sports videos annotated with 487 classes ✤ 1000-3000 videos in per class ✤ Optical flow quality varies wildly between videos ✤ First 5 minutes of each video is sampled to obtain 300
Feature-pooling architecture comparisons CNN network comparisons
Effect of the number of frames in model used in GoogLeNet Optical flow effect
Comparison with the work of Karpathy et al.
✤ 13,320 videos with 101 activity classes ✤ More constrained camera movements, hand-crafted dataset
UCF-101 accuracy results for different frame numbers
State-of -the-art UCF-101 results
✤ They presented two video-classification methods that
✤ Feature pooling and LSTM for video classification is
✤ Using optical flow is beneficial ✤ State-of-the-art results are obtained on two benchmark
✤ Learning should take place over the entire video rather than