Beyond Short Snippets: Deep Networks for Video Classification Joe - - PowerPoint PPT Presentation

beyond short snippets deep networks
SMART_READER_LITE
LIVE PREVIEW

Beyond Short Snippets: Deep Networks for Video Classification Joe - - PowerPoint PPT Presentation

Beyond Short Snippets: Deep Networks for Video Classification Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, George Toderici zge Yalnkaya Introduction Many attempts to apply CNNs to action


slide-1
SLIDE 1

Özge Yalçınkaya

Beyond Short Snippets: Deep Networks for Video Classification

Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, George Toderici

slide-2
SLIDE 2

Introduction

✤ Many attempts to apply CNNs to action recognition ✤ Treating video frames as images, using CNN for

description

✤ Average predictions at the video level ✤ However, complete action information is missing

slide-3
SLIDE 3

Introduction

✤ For accurate video classification, learning a global

description of the video’s temporal information is important

✤ Using increasing number of frames improves

classification

✤ Moreover, optical flow images may provide

additional information

slide-4
SLIDE 4

Introduction

✤ Two approaches are

introduced:

➡ Feature Pooling ➡ LSTM ✤ State-of-the-art

performances on Sports-1M and UCF101

✤ AlexNet and GoogLeNet

are used

slide-5
SLIDE 5

Approach: Feature Pooling Architectures

✤ Conv Pooling: ➡ Performs max-pooling over final

CNN layer across the frames (blue)

➡ Feeds them to FC layer (yellow)

slide-6
SLIDE 6

Approach: Feature Pooling Architectures

✤ Late Pooling: ➡ Performs max-pooling(blue) after

two FC layers(yellow)

➡ Compared to Conv Pooling, it

directly combines high-level information

slide-7
SLIDE 7

Approach: Feature Pooling Architectures

✤ Slow Pooling: ➡ First, max-pooling(blue) is applied

  • ver 10 frame after CNN(like

size-10 filter)

➡ Each one is followed by a FC

layer(yellow)

➡ A single max-pooling combines

  • utputs

➡ Groups local features before

combining high level information

slide-8
SLIDE 8

Approach: Feature Pooling Architectures

✤ Local Pooling: ➡ Combines frame level features

locally as Slow Pooling(blue)

➡ Softmax(orange) layer is connected

to all FC(yellow) layers for final prediction

slide-9
SLIDE 9

Approach: Feature Pooling Architectures

✤ Time-Domain Convolution: ➡ Extra time-domain conv

layer(green)

➡ Max-pooling across frames on

temporal domain(blue)

➡ Captures local relationships

between frames

slide-10
SLIDE 10

Approach: Feature Pooling Architectures

✤ GoogLeNet Conv Pooling: ➡ Max-pooling is applied in network ➡ Then, this layer is connected to softmax layer ➡ Enhancement is done by adding FC layers

slide-11
SLIDE 11

Approach: LSTM Architecture

slide-12
SLIDE 12

Approach: LSTM Architecture

LSTM takes input from CNN layer at each video frame. A softmax layer predicts the class for each time step

slide-13
SLIDE 13

Implementation Details

✤ Experiments done with both AlexNet and GoogLeNet ✤ Parameters are initialized from pre-trained Imagenet model,

fine-tuned on Sports-1M

✤ Single-frame networks are expanded to 30 and 120-frames ✤ Optical flow images are used

slide-14
SLIDE 14

Results: Sports-1M

✤ 1 Million YouTube sports videos annotated with 487 classes ✤ 1000-3000 videos in per class ✤ Optical flow quality varies wildly between videos ✤ First 5 minutes of each video is sampled to obtain 300

frames

slide-15
SLIDE 15

Results: Sports-1M

Feature-pooling architecture comparisons CNN network comparisons

slide-16
SLIDE 16

Results: Sports-1M

Effect of the number of frames in model used in GoogLeNet Optical flow effect

slide-17
SLIDE 17

Results: Sports-1M

Comparison with the work of Karpathy et al.

slide-18
SLIDE 18

Results: UCF-101

✤ 13,320 videos with 101 activity classes ✤ More constrained camera movements, hand-crafted dataset

UCF-101 accuracy results for different frame numbers

slide-19
SLIDE 19

Results: UCF-101

State-of -the-art UCF-101 results

slide-20
SLIDE 20

Conclusion and Future Work

✤ They presented two video-classification methods that

aggregate frame-level CNN outputs to video-level

✤ Feature pooling and LSTM for video classification is

introduced

✤ Using optical flow is beneficial ✤ State-of-the-art results are obtained on two benchmark

dataset

✤ Learning should take place over the entire video rather than

short clips