Segments, Residuals and Embeddings for Few-Example Video Event - - PowerPoint PPT Presentation

segments residuals and
SMART_READER_LITE
LIVE PREVIEW

Segments, Residuals and Embeddings for Few-Example Video Event - - PowerPoint PPT Presentation

Segments, Residuals and Embeddings for Few-Example Video Event Detection Dennis Koelma and Cees Snoek University of Amsterdam The Netherlands Pipeline 10Ex 2016 CNN Inception avg ImageNet sample pool 2 / sec Shuffle SVM Videos Frames


slide-1
SLIDE 1

Segments, Residuals and Embeddings for Few-Example Video Event Detection

Dennis Koelma and Cees Snoek University of Amsterdam The Netherlands

slide-2
SLIDE 2

Pipeline 10Ex 2016

Videos Frames sample 2 / sec pool5 CNN Inception ImageNet Shuffle SVM 10Ex M1 avg pool Video Story embedding SVM 10Ex M5 prob SVM 10Ex M2 avg pool dense trajectories SVM 10Ex M3 Fisher vector mfcc0 mfcc1 mfcc2 SVM 10Ex M4 Fisher vector

slide-3
SLIDE 3

Pipeline 10Ex 2017

Videos Frames sample 2 / sec pool5 ResNet + ResNeXt ImageNet Shuffle SVM 10Ex M1 difference coding Video Story embedding SVM 10Ex M5 sliding window SVM 10Ex M2 avg pool dense trajectories SVM 10Ex M3 Fisher vector mfcc0 mfcc1 mfcc2 SVM 10Ex M4 Fisher vector

slide-4
SLIDE 4

CNN Features from 22k ImageNet classes

  • Use as many classes as possible
  • Find a balance between level of

abstraction of classes and number

  • f images in a class

4

Gametophyte Siderocyte 296 classes with 1 image Example imbalance Irrelevant classes

slide-5
SLIDE 5

CNN training on selection out of 22k ImageNet classes

  • Idea
  • Increase level of abstraction of classes
  • Incorporate classes with less than 200 samples
  • Heuristics
  • Roll, Bind, Promote, Subsample
  • Result
  • 12,988 classes
  • 13.6M images

Roll N < 3000 : Bind N > 2000 : Subsample N < 200 : Promote

The ImageNet Shuffle: Reorganized Pre-training for Video Event Detection, Pascal Mettes and Dennis Koelma and Cees Snoek, International Conference on Multimedia Retrieval, 2016

slide-6
SLIDE 6

Feature Difference Coding

  • K-means clustering (k = 5) on last fully connected layer

before probability layers (called flatten)

  • Fisher like encoding but sigma is based on distance of points

assigned to a cluster to its center

0.290 0.300 0.310 0.320 0.330 0.340 0.350 flatten-avg flatten-dc

MAP 2014 Test Set

ResNet ResNeXt Fusion

slide-7
SLIDE 7

Video Story: Embed the story of a video

Joint optimization of W and A to preserve

Descriptiveness: preserve video descriptions : L(A,S) Predictability: recognize terms from video content : L(S,W)

Bike Motorcycle Stunt

yi xi

Embedding

W A si

Videostory: A new multimedia embedding for few-example recognition and translation of events, Amirhossein Habibian and Thomas Mensink and Cees Snoek, Proceedings of the ACM International Conference on Multimedia, 2014

slide-8
SLIDE 8

VideoStory Embedding as a Feature

0.300 0.305 0.310 0.315 0.320 0.325 0.330 0.335 flatten-avg video story

MAP 2014 Test Set

ResNet ResNeXt Fusion

slide-9
SLIDE 9

Video Story for 0Ex xi

Embedding

W si

Attempting a bike trick 0.45 bike 0.30 man

A

Cosine similarity 1.0 attempt 1.0 bike 1.0 trick

slide-10
SLIDE 10

Finding Segments to Expand Training Material

Window Example1 Example1_1 Example1_2 Example1_3

Cosine similarity

slide-11
SLIDE 11

Window based Features

0.295 0.300 0.305 0.310 0.315 0.320 0.325 0.330 0.335 0.340 0.345 flatten-avg flatten-window

MAP 2014 Test Set

ResNet ResNeXt Fusion

slide-12
SLIDE 12

Result Individual Modalities on 2014 Test Set

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

flatten-avg softmax trajectories mfcc video story flatten-dc flatten-window

ResNet ResNeXt Fusion

R < Rx < F DC is best

  • verfit ?

VS > flatten window > avg

slide-13
SLIDE 13

Fusion Visual Modalities on 2014 Test Set

0.315 0.320 0.325 0.330 0.335 0.340 0.345 0.350 0.355 0.360 VS DC Win DC-VS DC-Win VS-DC-Win

ResNet + ResNeXt

slide-14
SLIDE 14

Fusion on 2014 Test Set

0.315 0.320 0.325 0.330 0.335 0.340 0.345 0.350 0.355 0.360

AVG2-DT-MFCC-VS DC VS-DC-Win VS-DC-Win-DT-MFCC VS-DC-Win-DT-MFCC-AVG2

ResNeXt ResNet+ResNeXt

last year new features single mod visual fusion MM fusion + avg

slide-15
SLIDE 15

Computational Efficiency

50 100 150 200 250

Feature Extraction

p-visualFusionTwoCNN c-mmFusionTwoCNN c-visualFusionOneCNN c-mmFusionOneCNN c-visualSingle 33.8 34 34.2 34.4 34.6 34.8 35 35.2 35.4 35.6 35.8

MAP

0.02 0.04 0.06 0.08 0.1 0.12

Classification

slide-16
SLIDE 16

Our MED Submission

Test 2014 PS p-visualFusionTwoCNN c-mmFusionTwoCNN c-visualFusionOneCNN c-mmFusionOneCNN c-visualSingle AH

slide-17
SLIDE 17

All MED Submissions

5 10 15 20 25 30 35 40 45 50

PS

10 20 30 40 50 60 70 80 MediaMill MediaMill TokyoTech TokyoTech ITICERTH ITICERTH INF

AH

slide-18
SLIDE 18

Conclusions

  • Visual features are still improving
  • Fusion still works but other modalities need work
  • 0ex helps to get more out of your examples
slide-19
SLIDE 19

Thank You