Beyond Short Snippets: Deep Networks for Video Classification Joe - - PowerPoint PPT Presentation

▶

Jan 12, 2023 478 likes •693 views

Beyond Short Snippets: Deep Networks for Video Classification Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, George Toderici zge Yalnkaya Introduction Many attempts to apply CNNs to action

SLIDE 1

Özge Yalçınkaya

Beyond Short Snippets: Deep Networks for Video Classification

Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, George Toderici

SLIDE 2

Introduction

✤ Many attempts to apply CNNs to action recognition ✤ Treating video frames as images, using CNN for

description

✤ Average predictions at the video level ✤ However, complete action information is missing

SLIDE 3

Introduction

✤ For accurate video classification, learning a global

description of the video’s temporal information is important

✤ Using increasing number of frames improves

classification

✤ Moreover, optical flow images may provide

additional information

SLIDE 4

Introduction

✤ Two approaches are

introduced:

➡ Feature Pooling ➡ LSTM ✤ State-of-the-art

performances on Sports-1M and UCF101

✤ AlexNet and GoogLeNet

are used

SLIDE 5

Approach: Feature Pooling Architectures

✤ Conv Pooling: ➡ Performs max-pooling over final

CNN layer across the frames (blue)

➡ Feeds them to FC layer (yellow)

SLIDE 6

Approach: Feature Pooling Architectures

✤ Late Pooling: ➡ Performs max-pooling(blue) after

two FC layers(yellow)

➡ Compared to Conv Pooling, it

directly combines high-level information

SLIDE 7

Approach: Feature Pooling Architectures

✤ Slow Pooling: ➡ First, max-pooling(blue) is applied

ver 10 frame after CNN(like

size-10 filter)

➡ Each one is followed by a FC

layer(yellow)

➡ A single max-pooling combines

utputs

➡ Groups local features before

combining high level information

SLIDE 8

Approach: Feature Pooling Architectures

✤ Local Pooling: ➡ Combines frame level features

locally as Slow Pooling(blue)

➡ Softmax(orange) layer is connected

to all FC(yellow) layers for final prediction

SLIDE 9

Approach: Feature Pooling Architectures

✤ Time-Domain Convolution: ➡ Extra time-domain conv

layer(green)

➡ Max-pooling across frames on

temporal domain(blue)

➡ Captures local relationships

between frames

SLIDE 10

Approach: Feature Pooling Architectures

✤ GoogLeNet Conv Pooling: ➡ Max-pooling is applied in network ➡ Then, this layer is connected to softmax layer ➡ Enhancement is done by adding FC layers

SLIDE 11

Approach: LSTM Architecture

SLIDE 12

Approach: LSTM Architecture

LSTM takes input from CNN layer at each video frame. A softmax layer predicts the class for each time step

SLIDE 13

Implementation Details

✤ Experiments done with both AlexNet and GoogLeNet ✤ Parameters are initialized from pre-trained Imagenet model,

fine-tuned on Sports-1M

✤ Single-frame networks are expanded to 30 and 120-frames ✤ Optical flow images are used

SLIDE 14

Results: Sports-1M

✤ 1 Million YouTube sports videos annotated with 487 classes ✤ 1000-3000 videos in per class ✤ Optical flow quality varies wildly between videos ✤ First 5 minutes of each video is sampled to obtain 300

frames

SLIDE 15

Results: Sports-1M

Feature-pooling architecture comparisons CNN network comparisons

SLIDE 16

Results: Sports-1M

Effect of the number of frames in model used in GoogLeNet Optical flow effect

SLIDE 17

Results: Sports-1M

Comparison with the work of Karpathy et al.

SLIDE 18

Results: UCF-101

✤ 13,320 videos with 101 activity classes ✤ More constrained camera movements, hand-crafted dataset

UCF-101 accuracy results for different frame numbers

SLIDE 19

Results: UCF-101

State-of -the-art UCF-101 results

SLIDE 20

Conclusion and Future Work

✤ They presented two video-classification methods that

aggregate frame-level CNN outputs to video-level

✤ Feature pooling and LSTM for video classification is

introduced

✤ Using optical flow is beneficial ✤ State-of-the-art results are obtained on two benchmark

dataset

✤ Learning should take place over the entire video rather than

Beyond Short Snippets: Deep Networks for Video Classification

Introduction

description

Introduction

description of the video’s temporal information is important

classification

additional information

Introduction

introduced:

performances on Sports-1M and UCF101

are used

Approach: Feature Pooling Architectures

CNN layer across the frames (blue)

Approach: Feature Pooling Architectures

two FC layers(yellow)

directly combines high-level information

Approach: Feature Pooling Architectures

size-10 filter)

layer(yellow)

combining high level information

Approach: Feature Pooling Architectures

locally as Slow Pooling(blue)

to all FC(yellow) layers for final prediction

Approach: Feature Pooling Architectures

layer(green)

temporal domain(blue)

between frames

Approach: Feature Pooling Architectures

Approach: LSTM Architecture

Approach: LSTM Architecture

Implementation Details

fine-tuned on Sports-1M

Results: Sports-1M

frames

Results: Sports-1M

Results: Sports-1M

Results: Sports-1M

Results: UCF-101

Results: UCF-101

Conclusion and Future Work

aggregate frame-level CNN outputs to video-level

introduced

dataset

short clips