Columbia-UCF MED2010: Combining Multiple Modalities, Contextual - - PowerPoint PPT Presentation

columbia ucf med2010 combining multiple modalities
SMART_READER_LITE
LIVE PREVIEW

Columbia-UCF MED2010: Combining Multiple Modalities, Contextual - - PowerPoint PPT Presentation

Columbia-UCF MED2010: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching Yu-Gang Jiang 1 , Xiaohong Zeng 1 , Guangnan Ye 1 , Subh Bhattacharya 2 , Dan Ellis 1 , Mubarak Shah 2 , Shih-Fu Chang 1 1 Department of EE, Columbia


slide-1
SLIDE 1

Columbia-UCF MED2010: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching

Yu-Gang Jiang1, Xiaohong Zeng1, Guangnan Ye1, Subh Bhattacharya2, Dan Ellis1, Mubarak Shah2, Shih-Fu Chang1

1 Department of EE, Columbia University 2 Department of EECS, University of Central Florida

TRECVID 2010 workshop, NIST, Gaithersburg, MD

slide-2
SLIDE 2

The target…

Batting a Batting a run in run in Assembling Assembling a shelter a shelter Making a Making a cake cake

slide-3
SLIDE 3

Overview: 4 major components & 6 runs

3

Feature extraction

SIFT Spatial-temporal Spatial-temporal interest point MFCC audio MFCC audio feature

Batter detection Batter detection 5 4 2

21 scene, action, audio concepts

Semantic Diffusion with Contextual Detectors

χ2 SVM Classifiers

6

EMD- SVM

3

Re-Rank

1

Batting a run in Batting a run in Assembling a Assembling a shelter shelter Making a cake Making a cake

slide-4
SLIDE 4

Overview: overall performance

0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40

Mean Mimimal Normalized Cost

r2 r3 r4 r5 r6 r1 Run1: Run2 + “Batter” Reranking Run2: Run3 + Scene/Audio/Action Context Run3: Run6 + EMD Temporal Matching Run4: Run6 + Scene/Audio/Action Context Run5: Run6 + Scene/Audio Context Run6: Baseline Classification with 3 features 4

slide-5
SLIDE 5

Overview: per-event performance

0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000

Batting a run in (MNC)

0.000 0.200 0.400 0.600 0.800 1.000 1.200 1.400 1.600

Assembling a shelter (MNC)

0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000

Making a cake (MNC)

Run1: Run2 + “Batter” Reranking Run2: Run3 + Scene/Audio/Action Context Run3: Run6 + EMD Temporal Matching Run4: Run6 + Scene/Audio/Action Context Run5: Run6 + Scene/Audio Context Run6: Baseline Classification with 3 features

slide-6
SLIDE 6

Roadmap > multiple modalities

6

Feature extraction

SIFT Spatial-temporal Spatial-temporal interest point MFCC audio MFCC audio feature

Batter detection Batter detection 5 4 2

21 scene, action, audio concepts

Semantic Diffusion with Contextual Detectors

χ2 SVM Classifiers Classifiers

6

EMD- SVM

3

Re-Rank

1

Batting a run in Batting a run in Assembling a Assembling a shelter shelter Making a cake Making a cake

slide-7
SLIDE 7

Three Feature Modalities…

7

  • SIFT (visual)

– D. Lowe, IJCV 04.

  • STIP (visual)

– I. Laptev, IJCV 05.

  • MFCC (audio)

16ms 16ms

slide-8
SLIDE 8

Bag-of-X Representation

  • X = SIFT or STIP or MFCC
  • Soft weighting (Jiang, Ngo and Yang, ACM CIVR 2007)

Bag-of-SIFT

8

slide-9
SLIDE 9

Soft-weighting in Bag-of-X

  • Soft weighting is used for all the three

Bag-of-X representations

Image source: http://www.cs.joensuu.fi/pages/franti/vq/lkm15.gif

  • - Assign a feature to multiple visual

words

  • - weights are determined by

feature-to-word similarity

Details in: Jiang, Ngo and Yang, ACM CIVR 2007.

9

slide-10
SLIDE 10

Results on Dry-run Validation Set

10

  • Measured by Average Precision (AP)
  • STIP works best for event detection
  • The 3 features are highly complementary!
  • Should be jointly used for multimedia event detection

Assembling a shelter Batting a run in Making a cake Mean AP Visual STIP 0.468 0.719 0.476 0.554 Visual SIFT 0.353 0.787 0.396 0.512 Audio MFCC 0.249 0.692 0.270 0.404 STIP+SIFT 0.508 0.796 0.476 0.593 STIP+SIFT+MFCC 0.533 0.873 0.493 0.633

slide-11
SLIDE 11

Roadmap > temporal matching

11

Feature extraction Feature extraction

SIFT Spatial-temporal Spatial-temporal interest point MFCC audio MFCC audio feature

Batter detection Batter detection 5 4 2

21 scene, action, audio concepts

Semantic Diffusion with Contextual Detectors

Re-Rank

1

χ2 SVM Classifiers

6

EMD- SVM

3

Batting a run in Batting a run in Assembling a Assembling a shelter shelter Making a cake Making a cake

slide-12
SLIDE 12

Temporal Matching With EMD Kernel

12

  • Earth Mover’s Distance (EMD)
  • EMD Kernel: K(P,Q)=exp-ρEMD(P,Q)

… … …

P Q

  • Y. Rubner, C. Tomasi, L. J. Guibas, “A metric for distributions with applications to image databases”, ICCV, 1998.
  • D. Xu, S.-F. Chang, “Video event recognition using kernel methods with multi-level temporal alignment”, PAMI, 2008.

Given two frame sets P = {(p1, wp1), ... , (pm,wpm)} and Q = {(q1, wq1), ... , (qn,wqn)} , the EMD is computed as EMD(P, Q) = ΣiΣj fijdij / ΣiΣj fij dij is the χ2 visual feature distance of frames pi and qj. fij (weight transferred from pi and qj) is optimized by minimizing the overall transportation workload ΣiΣj fijdij Given two frame sets P = {(p1, wp1), ... , (pm,wpm)} and Q = {(q1, wq1), ... , (qn,wqn)} , the EMD is computed as EMD(P, Q) = ΣiΣj fijdij / ΣiΣj fij dij is the χ2 visual feature distance of frames pi and qj. fij (weight transferred from pi and qj) is optimized by minimizing the overall transportation workload ΣiΣj fijdij

0.14 0.1 0.04

slide-13
SLIDE 13

Temporal Matching Results

  • EMD is helpful for two events

– results measured by minimal normalized cost (lower is better)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Minimal Normalized Cost r6-baseline r3-base+EMD

13

5% gain

slide-14
SLIDE 14

Roadmap > contextual diffusion

14

Feature extraction Feature extraction

SIFT Spatial-temporal Spatial-temporal interest point MFCC audio MFCC audio feature

Batter detection Batter detection

Re-Rank

1

χ2 SVM Classifiers Classifiers

6

EMD- SVM

3 5 4 2

21 scene, action, audio concepts

Semantic Diffusion with Contextual Detectors

Batting a run in Batting a run in Assembling a Assembling a shelter shelter Making a cake Making a cake

slide-15
SLIDE 15
  • Events generally occur under particular scene

settings with certain audio sounds!

– Understanding contexts may be helpful for event detection

Batting a run in grass Baseball field sky Cheering/Clapping Speech comprehensible running

15

Scene Concepts Scene Concepts Audio Audio Concepts Concepts

Event Context

walking Action Action Concepts Concepts

slide-16
SLIDE 16
  • 21 concepts are defined and annotated over

MED development set.

  • SVM classifier for concept detection

– STIP for action concepts, SIFT for scene concepts, and MFCC for audio concepts

16

Contextual Concepts

Human Action Concepts Scene Concepts Audio Concepts

  • Person walking
  • Person running
  • Person squatting
  • Person standing up
  • Person making/assembling

stuffs with hands (hands visible)

  • Person batting baseball
  • Indoor kitchen
  • Outdoor with grass/trees

visible

  • Baseball field
  • Crowd (a group of 3+

people)

  • Cakes (close-up view)
  • Outdoor rural
  • Outdoor urban
  • Indoor quiet
  • Indoor noisy
  • Original audio
  • Dubbed audio
  • Speech comprehensible
  • Music
  • Cheering
  • Clapping

Jingen Liu, Jiebo Luo & Mubarak Shah, Recognizing Realistic Actions from Videos "in the Wild“, CVPR 2009 Shih-Fu Chang et al. Columbia University/VIREO-CityU/IRIT TRECVID2008 High-Level Feature Extraction and Interactive Video Search. TRECVID Workshop, 2008

slide-17
SLIDE 17

17

Concept Detection: example result

Baseball field Cakes (close-up view) Crowd (3+ people) Grass/trees Indoor kitchen

slide-18
SLIDE 18

18

Contextual Diffusion Model

  • Semantic Diffusion

[Jiang, Wang, Chang & Ngo, ICCV 2009]

– Semantic graph

  • Nodes are concepts/events
  • Edges represent

concept/event correlation

– Graph diffusion

  • Smooth detection scores

w.r.t. the correlation

Batting a run in Baseball field Running Cheering

Project page and source code:

http://www.ee.columbia.edu/ln/dvmm/researchProjects/MultimediaIndexing/DASD/dasd.htm

0.9 0.8 0.7 0.5

slide-19
SLIDE 19

0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 Minimal Normalized Cost r3-baseEMD r2-baseEMDSceAudAct

Contextual Diffusion Results

  • Context is slightly helpful for two events

– results measured by minimal normalized cost (lower is better)

19

2-3% gain

slide-20
SLIDE 20

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Average Precision baseline context diffusion

Contextual Diffusion Results

  • … but the improvement is much higher when

context is perfect (on a validation set)

− results measured by average precision (higher is better)

20

slide-21
SLIDE 21

Roadmap > reranking with event- specific object detector

21

Feature extraction Feature extraction

SIFT Spatial-temporal Spatial-temporal interest point MFCC audio MFCC audio feature χ2 SVM Classifiers Classifiers

6

EMD- SVM

3 5 4 2

21 scene, action, audio concepts

Semantic Diffusion with Contextual Detectors Batter detection Batter detection

Re-Rank

1

Batting a run in Batting a run in Assembling a Assembling a shelter shelter Making a cake Making a cake

slide-22
SLIDE 22

Reranking with Event-Specific Object Detector

  • “Batter” detector is trained by AdaBoost framework
slide-23
SLIDE 23

Reranking with Event-Specific Object Detector

  • “Batter” detector is trained by AdaBoost framework

Initial Ranking “Batter” Detection Reranking Based on the Ratio of detected objects

slide-24
SLIDE 24

Lessons learned

1. STIP is powerful for event detection. 2. Combining multiple audio-visual features is very effective! 3. Temporal Matching with EMD is useful for some events 4. Diffusion with Contextual Concepts is promising, and deserves deeper research 1. Explore deep joint audio-visual representation, e.g., Audio-Visual Atoms [Jiang et al, ACMMM09] 2. Another interesting research direction is to investigate an adaptive method to find the best components for each event

Future Work

slide-25
SLIDE 25

More information at: http://www.ee.columbia.edu/dvmm/

25