recognizing complex events in internet videos with audio
play

Recognizing Complex Events in Internet Videos with Audio-Visual - PowerPoint PPT Presentation

Recognizing Complex Events in Internet Videos with Audio-Visual Features Yu-Gang Jiang yjiang@ee.columbia.edu In collaboration with Xiaohong Zeng 1 , Guangnan Ye 1 , Subh Bhattacharya 2 , Dan Ellis 1 , Mubarak Shah 2 , Shih-Fu Chang 1 ,


  1. Recognizing Complex Events in Internet Videos with Audio-Visual Features Yu-Gang Jiang yjiang@ee.columbia.edu In collaboration with Xiaohong Zeng 1 , Guangnan Ye 1 , Subh Bhattacharya 2 , Dan Ellis 1 , Mubarak Shah 2 , Shih-Fu Chang 1 , Alexander C. Loui 3 Columbia University 1 University of Central Florida 2 Kodak Research Labs 3 1

  2. We take photos/videos We take photos/videos everyday/everywhere... everyday/everywhere... 2 Barack Obama Rally, Texas, 2008. http://www.paulridenour.com/Obama14.JPG Barack Obama Rally, Texas, 2008. http://www.paulridenour.com/Obama14.JPG

  3. Outline • A System for Recognizing Events in Internet Videos – Best performance in TRECVID 2010 Multimedia Event Detection Task – Features, Kernels, Context, etc. • Internet Consumer Video Analysis – A Benchmark Database – An Evaluation of Human & Machine Performance 3

  4. Outline • A System for Recognizing Events in Internet Videos – Best performance in TRECVID 2010 Multimedia Event Detection Task – Features, Kernels, Context, etc. • Internet Consumer Video Analysis – A Benchmark Database – An Evaluation of Human & Machine Performance 4

  5. The TRECVID Multimedia Event Detection Task • Target: Find videos containing an event of interest • Data: unconstrained Internet videos – 1700+ training videos (~50 positive each event); 1700+ test videos Making a Making a cake cake Assembling Assembling a shelter a shelter Batting a Batting a run in run in 5

  6. The system: 3 major components 21 scene, action, audio concepts Making a cake Making a cake Feature extraction Classifiers SIFT χ2 Semantic SVM Assembling a Assembling a Diffusion shelter shelter Spatial-temporal Spatial-temporal with interest point Contextual EMD- Detectors MFCC audio MFCC audio Batting a run in Batting a run in SVM feature Yu-Gang Jiang, Xiaohong Zeng, Guangnan Ye, S. Bhattacharya, Dan Ellis, Mubarak Shah, Shih-Fu Chang, Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching , in TRECVID 2010. 6

  7. Best performance in TRECVID2010 Multimedia event detection (MED) task Run1: Run2 + “Batter” Reranking 1.40 Run2: Run3 + Scene/Audio/Action Context Run3: Run6 + EMD Temporal Matching Run4: Run6 + Scene/Audio/Action Context 1.20 Mean Mimimal Normalized Cost Run5: Run6 + Scene/Audio Context Run6: Baseline Classification with 3 features 1.00 0.80 0.60 0.40 0.20 0.00 r2 r3 r4 r5 r6 r1 7

  8. Per-event performance Batting a run in (MNC) Assembling a shelter (MNC) 1.000 1.600 0.900 1.400 0.800 1.200 0.700 1.000 0.600 0.500 0.800 0.400 0.600 0.300 0.400 0.200 0.200 0.100 0.000 0.000 Making a cake (MNC) 1.000 0.900 Run1: Run2 + “Batter” Reranking 0.800 0.700 Run2: Run3 + Scene/Audio/Action Context 0.600 Run3: Run6 + EMD Temporal Matching 0.500 Run4: Run6 + Scene/Audio/Action Context 0.400 Run5: Run6 + Scene/Audio Context 0.300 Run6: Baseline Classification with 3 features 0.200 0.100 0.000 8

  9. Roadmap > audio-visual features 21 scene, action, audio concepts Making a cake Making a cake Feature extraction Classifiers Classifiers SIFT χ2 Semantic SVM Assembling a Assembling a Diffusion shelter shelter Spatial-temporal Spatial-temporal with interest point Contextual EMD- Detectors MFCC audio MFCC audio Batting a run in Batting a run in SVM feature 9

  10. Three audio-visual features… • SIFT (visual) – D. Lowe, IJCV 04. • STIP (visual) – I. Laptev, IJCV 05. • MFCC (audio) … 16ms 16ms 10

  11. Bag-of- X representation X = SIFT / STIP / MFCC • Soft weighting ( Jiang, Ngo and Yang, ACM CIVR 2007) • Bag-of-SIFT Bag-of-SIFT 11

  12. Results of audio-visual features • Measured by Average Precision (AP) Assembling a Batting a run Making a Mean AP shelter in cake Visual STIP 0.468 0.719 0.476 0.554 Visual SIFT 0.353 0.787 0.396 0.512 Audio MFCC 0.249 0.692 0.270 0.404 STIP+SIFT 0.508 0.796 0.476 0.593 STIP+SIFT+MFCC 0.533 0.873 0.493 0.633 • STIP works the best for event detection • The 3 features are highly complementary! 12

  13. Roadmap > temporal matching 21 scene, action, audio concepts Making a cake Making a cake Feature extraction Feature extraction Classifiers SIFT χ2 Semantic SVM Assembling a Assembling a Diffusion shelter shelter Spatial-temporal Spatial-temporal with interest point Contextual EMD- Detectors MFCC audio MFCC audio Batting a run in Batting a run in SVM feature 13

  14. Temporal matching with EMD kernel • Earth Mover’s Distance (EMD) Distance time time Q P Given two clip sets P = {( p 1 , w p 1 ), ... , ( p m , w pm )} and Q = {( q 1 , w q 1 ), ... , ( q n , w qn )} , Given two clip sets P = {( p 1 , w p 1 ), ... , ( p m , w pm )} and Q = {( q 1 , w q 1 ), ... , ( q n , w qn )} , the EMD is computed as the EMD is computed as EMD (P, Q) = Σ i Σ j f ij d ij / Σ i Σ j f ij EMD (P, Q) = Σ i Σ j f ij d ij / Σ i Σ j f ij d ij is the χ 2 visual feature distance of video clips p i and q j . f ij (weight transferred from d ij is the χ 2 visual feature distance of video clips p i and q j . f ij (weight transferred from p i and q j ) is optimized by minimizing the overall transportation workload Σ i Σ j f ij d ij p i and q j ) is optimized by minimizing the overall transportation workload Σ i Σ j f ij d ij • EMD Kernel: K(P,Q)= exp -ρ EMD (P,Q) Y. Rubner, C. Tomasi, L. J. Guibas, “A metric for distributions with applications to image databases”, ICCV, 1998. 14 D. Xu, S.-F. Chang, “Video event recognition using kernel methods with multi-level temporal alignment”, PAMI, 2008.

  15. Temporal matching results • EMD is helpful for two events – results measured by minimal normalized cost (lower is better) 0.8 5% gain Minimal Normalized Cost r6-baseline 0.7 r3-base+EMD 0.6 0.5 0.4 0.3 0.2 0.1 0 15

  16. Roadmap > contextual diffusion 21 scene, action, audio concepts Making a cake Making a cake Feature extraction Feature extraction Classifiers Classifiers SIFT χ2 Semantic SVM Assembling a Assembling a Diffusion shelter shelter Spatial-temporal Spatial-temporal with interest point Contextual EMD- Detectors MFCC audio MFCC audio Batting a run in Batting a run in SVM feature 16

  17. Event context • Events generally occur under particular scene settings with certain audio sounds! – Understanding contexts may be helpful for event detection Action Action Batting a run in Scene Concepts Scene Concepts Concepts Concepts running grass walking Baseball field Speech comprehensible sky Cheering/Clapping Audio Audio Concepts Concepts 17

  18. Contextual concepts • 21 concepts are defined and annotated over TRECVID MED development set. Human Action Concepts Scene Concepts Audio Concepts Person walking Indoor kitchen Outdoor rural    Person running Outdoor with grass/trees Outdoor urban    Person squatting visible Indoor quiet   Person standing up Baseball field Indoor noisy    Person making/assembling Crowd (a group of 3+ Original audio    stuffs with hands (hands people) Dubbed audio  visible) Cakes (close-up view) Speech comprehensible   Person batting baseball Music   Cheering  Clapping  • SVM classifier for concept detection – STIP for action concepts, SIFT for scene concepts, and MFCC for audio concepts 18

  19. Concept detection: example results Baseball field Cakes (close-up view) Crowd (3+ people) Grass/trees Indoor kitchen 19

  20. Contextual diffusion model • Semantic diffusion Baseball field 0.9 [Y.-G. Jiang, J. Wang, S.F. Chang & C.W. Ngo, ICCV 2009] – Semantic graph • Nodes are concepts/events • Edges represent concept/event 0.5 Batting a run in correlation – Graph diffusion 0.8 • Smooth detection scores 0.7 w.r.t. the correlation Running Cheering Project page and source code: http://www.ee.columbia.edu/ln/dvmm/researchProjects/MultimediaIndexing/DASD/dasd.htm 20

  21. Contextual diffusion results • Context is slightly helpful for two events – results measured by minimal normalized cost (lower is better) 0.800 r3-baseEMD r2-baseEMDSceAudAct Minimal Normalized Cost 0.700 0.600 3% gain 0.500 0.400 0.300 0.200 0.100 0.000 21

  22. Outline • A System for Recognizing Events in Internet Videos – Best performance in TRECVID 2010 Multimedia Event Detection Task – Features, Kernels, Context, etc. • Internet Consumer Video Analysis – A Benchmark Database – An Evaluation of Human & Machine Performance Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui, Consumer Video Understanding: A Benchmark Database and An Evaluation of Human and Machine Performance , in ACM ICMR 2011. 22

  23. What are Consumer Videos? • Original unedited videos captured by ordinary consumers  Interesting and very diverse contents  Very weakly indexed On average, 3 tags per consumer video on YouTube vs. 9  tags each YouTube video has  Original audio tracks are preserved; good for audio- visual joint analysis … 23

  24. Columbia Consumer Video (CCV) Database Basketball Non-music Performance Skiing Dog Wedding Reception Baseball Swimming Bird Wedding Ceremony Parade Soccer Biking Graduation Wedding Dance Beach Playground Cat Birthday Celebration Music Performance 24 Ice Skating

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend