Segments, Residuals and Embeddings for Few-Example Video Event - - PowerPoint PPT Presentation

▶

May 23, 2023 250 likes •457 views

Segments, Residuals and Embeddings for Few-Example Video Event Detection Dennis Koelma and Cees Snoek University of Amsterdam The Netherlands Pipeline 10Ex 2016 CNN Inception avg ImageNet sample pool 2 / sec Shuffle SVM Videos Frames

SLIDE 1

Segments, Residuals and Embeddings for Few-Example Video Event Detection

Dennis Koelma and Cees Snoek University of Amsterdam The Netherlands

SLIDE 2

Pipeline 10Ex 2016

Videos Frames sample 2 / sec pool5 CNN Inception ImageNet Shuffle SVM 10Ex M1 avg pool Video Story embedding SVM 10Ex M5 prob SVM 10Ex M2 avg pool dense trajectories SVM 10Ex M3 Fisher vector mfcc0 mfcc1 mfcc2 SVM 10Ex M4 Fisher vector

SLIDE 3

Pipeline 10Ex 2017

Videos Frames sample 2 / sec pool5 ResNet + ResNeXt ImageNet Shuffle SVM 10Ex M1 difference coding Video Story embedding SVM 10Ex M5 sliding window SVM 10Ex M2 avg pool dense trajectories SVM 10Ex M3 Fisher vector mfcc0 mfcc1 mfcc2 SVM 10Ex M4 Fisher vector

SLIDE 4

CNN Features from 22k ImageNet classes

Use as many classes as possible
Find a balance between level of

abstraction of classes and number

f images in a class

Gametophyte Siderocyte 296 classes with 1 image Example imbalance Irrelevant classes

SLIDE 5

CNN training on selection out of 22k ImageNet classes

Idea
Increase level of abstraction of classes
Incorporate classes with less than 200 samples
Heuristics
Roll, Bind, Promote, Subsample
Result
12,988 classes
13.6M images

Roll N < 3000 : Bind N > 2000 : Subsample N < 200 : Promote

The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection, Pascal Mettes and Dennis Koelma and Cees Snoek, International Conference on Multimedia Retrieval, 2016

SLIDE 6

Feature Difference Coding

K-means clustering (k = 5) on last fully connected layer

before probability layers (called flatten)

Fisher like encoding but sigma is based on distance of points

assigned to a cluster to its center

0.290 0.300 0.310 0.320 0.330 0.340 0.350 flatten-avg flatten-dc

MAP 2014 Test Set

ResNet ResNeXt Fusion

SLIDE 7

Video Story: Embed the story of a video

Joint optimization of W and A to preserve

Descriptiveness: preserve video descriptions : L(A,S) Predictability: recognize terms from video content : L(S,W)

Bike Motorcycle Stunt

yi xi

Embedding

W A si

Videostory: A new multimedia embedding for few-example recognition and translation of events, Amirhossein Habibian and Thomas Mensink and Cees Snoek, Proceedings of the ACM International Conference on Multimedia, 2014

SLIDE 8

VideoStory Embedding as a Feature

0.300 0.305 0.310 0.315 0.320 0.325 0.330 0.335 flatten-avg video story

MAP 2014 Test Set

ResNet ResNeXt Fusion

SLIDE 9

Video Story for 0Ex xi

Embedding

W si

Attempting a bike trick 0.45 bike 0.30 man

A

Cosine similarity 1.0 attempt 1.0 bike 1.0 trick

SLIDE 10

Finding Segments to Expand Training Material

Window Example1 Example1_1 Example1_2 Example1_3

Cosine similarity

SLIDE 11

Window based Features

0.295 0.300 0.305 0.310 0.315 0.320 0.325 0.330 0.335 0.340 0.345 flatten-avg flatten-window

MAP 2014 Test Set

ResNet ResNeXt Fusion

SLIDE 12

Result Individual Modalities on 2014 Test Set

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

flatten-avg softmax trajectories mfcc video story flatten-dc flatten-window

ResNet ResNeXt Fusion

R < Rx < F DC is best

verfit ?

VS > flatten window > avg

SLIDE 13

Fusion Visual Modalities on 2014 Test Set

0.315 0.320 0.325 0.330 0.335 0.340 0.345 0.350 0.355 0.360 VS DC Win DC-VS DC-Win VS-DC-Win

ResNet + ResNeXt

SLIDE 14

Fusion on 2014 Test Set

0.315 0.320 0.325 0.330 0.335 0.340 0.345 0.350 0.355 0.360

AVG2-DT-MFCC-VS DC VS-DC-Win VS-DC-Win-DT-MFCC VS-DC-Win-DT-MFCC-AVG2

ResNeXt ResNet+ResNeXt

last year new features single mod visual fusion MM fusion + avg

SLIDE 15

Computational Efficiency

50 100 150 200 250

Feature Extraction

p-visualFusionTwoCNN c-mmFusionTwoCNN c-visualFusionOneCNN c-mmFusionOneCNN c-visualSingle 33.8 34 34.2 34.4 34.6 34.8 35 35.2 35.4 35.6 35.8

MAP

0.02 0.04 0.06 0.08 0.1 0.12

Classification

SLIDE 16

Our MED Submission

Test 2014 PS p-visualFusionTwoCNN c-mmFusionTwoCNN c-visualFusionOneCNN c-mmFusionOneCNN c-visualSingle AH

SLIDE 17

All MED Submissions

5 10 15 20 25 30 35 40 45 50

10 20 30 40 50 60 70 80 MediaMill MediaMill TokyoTech TokyoTech ITICERTH ITICERTH INF

SLIDE 18

Conclusions

Visual features are still improving
Fusion still works but other modalities need work
0ex helps to get more out of your examples

SLIDE 19

Segments, Residuals and Embeddings for Few-Example Video Event Detection

Pipeline 10Ex 2016

Pipeline 10Ex 2017

CNN Features from 22k ImageNet classes

abstraction of classes and number

CNN training on selection out of 22k ImageNet classes

Feature Difference Coding

before probability layers (called flatten)

assigned to a cluster to its center

Video Story: Embed the story of a video

Joint optimization of W and A to preserve

Descriptiveness: preserve video descriptions : L(A,S) Predictability: recognize terms from video content : L(S,W)

yi xi

Embedding

W A si

VideoStory Embedding as a Feature

Video Story for 0Ex xi

Embedding

W si

A

Finding Segments to Expand Training Material

Window based Features

Result Individual Modalities on 2014 Test Set

Fusion Visual Modalities on 2014 Test Set

Fusion on 2014 Test Set

Computational Efficiency

Our MED Submission

All MED Submissions

Conclusions

Thank You