Recognizing Complex Events in Internet Videos with Audio-Visual - PowerPoint PPT Presentation

Recognizing Complex Events in Internet Videos with Audio-Visual Features Yu-Gang Jiang yjiang@ee.columbia.edu In collaboration with Xiaohong Zeng 1 , Guangnan Ye 1 , Subh Bhattacharya 2 , Dan Ellis 1 , Mubarak Shah 2 , Shih-Fu Chang 1 , Alexander C. Loui 3 Columbia University 1 University of Central Florida 2 Kodak Research Labs 3 1

We take photos/videos We take photos/videos everyday/everywhere... everyday/everywhere... 2 Barack Obama Rally, Texas, 2008. http://www.paulridenour.com/Obama14.JPG Barack Obama Rally, Texas, 2008. http://www.paulridenour.com/Obama14.JPG

Outline • A System for Recognizing Events in Internet Videos – Best performance in TRECVID 2010 Multimedia Event Detection Task – Features, Kernels, Context, etc. • Internet Consumer Video Analysis – A Benchmark Database – An Evaluation of Human & Machine Performance 3

Outline • A System for Recognizing Events in Internet Videos – Best performance in TRECVID 2010 Multimedia Event Detection Task – Features, Kernels, Context, etc. • Internet Consumer Video Analysis – A Benchmark Database – An Evaluation of Human & Machine Performance 4

The TRECVID Multimedia Event Detection Task • Target: Find videos containing an event of interest • Data: unconstrained Internet videos – 1700+ training videos (~50 positive each event); 1700+ test videos Making a Making a cake cake Assembling Assembling a shelter a shelter Batting a Batting a run in run in 5

The system: 3 major components 21 scene, action, audio concepts Making a cake Making a cake Feature extraction Classifiers SIFT χ2 Semantic SVM Assembling a Assembling a Diffusion shelter shelter Spatial-temporal Spatial-temporal with interest point Contextual EMD- Detectors MFCC audio MFCC audio Batting a run in Batting a run in SVM feature Yu-Gang Jiang, Xiaohong Zeng, Guangnan Ye, S. Bhattacharya, Dan Ellis, Mubarak Shah, Shih-Fu Chang, Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching , in TRECVID 2010. 6

Best performance in TRECVID2010 Multimedia event detection (MED) task Run1: Run2 + “Batter” Reranking 1.40 Run2: Run3 + Scene/Audio/Action Context Run3: Run6 + EMD Temporal Matching Run4: Run6 + Scene/Audio/Action Context 1.20 Mean Mimimal Normalized Cost Run5: Run6 + Scene/Audio Context Run6: Baseline Classification with 3 features 1.00 0.80 0.60 0.40 0.20 0.00 r2 r3 r4 r5 r6 r1 7

Per-event performance Batting a run in (MNC) Assembling a shelter (MNC) 1.000 1.600 0.900 1.400 0.800 1.200 0.700 1.000 0.600 0.500 0.800 0.400 0.600 0.300 0.400 0.200 0.200 0.100 0.000 0.000 Making a cake (MNC) 1.000 0.900 Run1: Run2 + “Batter” Reranking 0.800 0.700 Run2: Run3 + Scene/Audio/Action Context 0.600 Run3: Run6 + EMD Temporal Matching 0.500 Run4: Run6 + Scene/Audio/Action Context 0.400 Run5: Run6 + Scene/Audio Context 0.300 Run6: Baseline Classification with 3 features 0.200 0.100 0.000 8

Roadmap > audio-visual features 21 scene, action, audio concepts Making a cake Making a cake Feature extraction Classifiers Classifiers SIFT χ2 Semantic SVM Assembling a Assembling a Diffusion shelter shelter Spatial-temporal Spatial-temporal with interest point Contextual EMD- Detectors MFCC audio MFCC audio Batting a run in Batting a run in SVM feature 9

Three audio-visual features… • SIFT (visual) – D. Lowe, IJCV 04. • STIP (visual) – I. Laptev, IJCV 05. • MFCC (audio) … 16ms 16ms 10

Bag-of- X representation X = SIFT / STIP / MFCC • Soft weighting ( Jiang, Ngo and Yang, ACM CIVR 2007) • Bag-of-SIFT Bag-of-SIFT 11

Results of audio-visual features • Measured by Average Precision (AP) Assembling a Batting a run Making a Mean AP shelter in cake Visual STIP 0.468 0.719 0.476 0.554 Visual SIFT 0.353 0.787 0.396 0.512 Audio MFCC 0.249 0.692 0.270 0.404 STIP+SIFT 0.508 0.796 0.476 0.593 STIP+SIFT+MFCC 0.533 0.873 0.493 0.633 • STIP works the best for event detection • The 3 features are highly complementary! 12

Roadmap > temporal matching 21 scene, action, audio concepts Making a cake Making a cake Feature extraction Feature extraction Classifiers SIFT χ2 Semantic SVM Assembling a Assembling a Diffusion shelter shelter Spatial-temporal Spatial-temporal with interest point Contextual EMD- Detectors MFCC audio MFCC audio Batting a run in Batting a run in SVM feature 13

Temporal matching with EMD kernel • Earth Mover’s Distance (EMD) Distance time time Q P Given two clip sets P = {( p 1 , w p 1 ), ... , ( p m , w pm )} and Q = {( q 1 , w q 1 ), ... , ( q n , w qn )} , Given two clip sets P = {( p 1 , w p 1 ), ... , ( p m , w pm )} and Q = {( q 1 , w q 1 ), ... , ( q n , w qn )} , the EMD is computed as the EMD is computed as EMD (P, Q) = Σ i Σ j f ij d ij / Σ i Σ j f ij EMD (P, Q) = Σ i Σ j f ij d ij / Σ i Σ j f ij d ij is the χ 2 visual feature distance of video clips p i and q j . f ij (weight transferred from d ij is the χ 2 visual feature distance of video clips p i and q j . f ij (weight transferred from p i and q j ) is optimized by minimizing the overall transportation workload Σ i Σ j f ij d ij p i and q j ) is optimized by minimizing the overall transportation workload Σ i Σ j f ij d ij • EMD Kernel: K(P,Q)= exp -ρ EMD (P,Q) Y. Rubner, C. Tomasi, L. J. Guibas, “A metric for distributions with applications to image databases”, ICCV, 1998. 14 D. Xu, S.-F. Chang, “Video event recognition using kernel methods with multi-level temporal alignment”, PAMI, 2008.

Temporal matching results • EMD is helpful for two events – results measured by minimal normalized cost (lower is better) 0.8 5% gain Minimal Normalized Cost r6-baseline 0.7 r3-base+EMD 0.6 0.5 0.4 0.3 0.2 0.1 0 15

Roadmap > contextual diffusion 21 scene, action, audio concepts Making a cake Making a cake Feature extraction Feature extraction Classifiers Classifiers SIFT χ2 Semantic SVM Assembling a Assembling a Diffusion shelter shelter Spatial-temporal Spatial-temporal with interest point Contextual EMD- Detectors MFCC audio MFCC audio Batting a run in Batting a run in SVM feature 16

Event context • Events generally occur under particular scene settings with certain audio sounds! – Understanding contexts may be helpful for event detection Action Action Batting a run in Scene Concepts Scene Concepts Concepts Concepts running grass walking Baseball field Speech comprehensible sky Cheering/Clapping Audio Audio Concepts Concepts 17

Contextual concepts • 21 concepts are defined and annotated over TRECVID MED development set. Human Action Concepts Scene Concepts Audio Concepts Person walking Indoor kitchen Outdoor rural    Person running Outdoor with grass/trees Outdoor urban    Person squatting visible Indoor quiet   Person standing up Baseball field Indoor noisy    Person making/assembling Crowd (a group of 3+ Original audio    stuffs with hands (hands people) Dubbed audio  visible) Cakes (close-up view) Speech comprehensible   Person batting baseball Music   Cheering  Clapping  • SVM classifier for concept detection – STIP for action concepts, SIFT for scene concepts, and MFCC for audio concepts 18

Concept detection: example results Baseball field Cakes (close-up view) Crowd (3+ people) Grass/trees Indoor kitchen 19

Contextual diffusion model • Semantic diffusion Baseball field 0.9 [Y.-G. Jiang, J. Wang, S.F. Chang & C.W. Ngo, ICCV 2009] – Semantic graph • Nodes are concepts/events • Edges represent concept/event 0.5 Batting a run in correlation – Graph diffusion 0.8 • Smooth detection scores 0.7 w.r.t. the correlation Running Cheering Project page and source code: http://www.ee.columbia.edu/ln/dvmm/researchProjects/MultimediaIndexing/DASD/dasd.htm 20

Contextual diffusion results • Context is slightly helpful for two events – results measured by minimal normalized cost (lower is better) 0.800 r3-baseEMD r2-baseEMDSceAudAct Minimal Normalized Cost 0.700 0.600 3% gain 0.500 0.400 0.300 0.200 0.100 0.000 21

Outline • A System for Recognizing Events in Internet Videos – Best performance in TRECVID 2010 Multimedia Event Detection Task – Features, Kernels, Context, etc. • Internet Consumer Video Analysis – A Benchmark Database – An Evaluation of Human & Machine Performance Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui, Consumer Video Understanding: A Benchmark Database and An Evaluation of Human and Machine Performance , in ACM ICMR 2011. 22

What are Consumer Videos? • Original unedited videos captured by ordinary consumers  Interesting and very diverse contents  Very weakly indexed On average, 3 tags per consumer video on YouTube vs. 9  tags each YouTube video has  Original audio tracks are preserved; good for audio- visual joint analysis … 23

Columbia Consumer Video (CCV) Database Basketball Non-music Performance Skiing Dog Wedding Reception Baseball Swimming Bird Wedding Ceremony Parade Soccer Biking Graduation Wedding Dance Beach Playground Cat Birthday Celebration Music Performance 24 Ice Skating

Recognizing Complex Events in Internet Videos with Audio-Visual - PowerPoint PPT Presentation

Recognizing Complex Events in Internet Videos with Audio-Visual Features Yu-Gang Jiang yjiang@ee.columbia.edu In collaboration with Xiaohong Zeng 1 , Guangnan Ye 1 , Subh Bhattacharya 2 , Dan Ellis 1 , Mubarak Shah 2 , Shih-Fu Chang 1 ,

Audio Device Client Better and Faster Audio I/O on Web Hongchan Choi Google Chrome Web Audio

Recognizing objects and actions in Finding boundaries images and video Recognizing

Cirrus Audio Solutions Cirrus Audio Solutions Home Audio Portable Audio Personal CD Player

Dennis Rosenberg http://DennisRosenberg.com Why Videos? People love watching videos Higher

Complex Numbers Complex Numbers 1 / 19 Complex Numbers Complex numbers ( C ) are an extension of

Create PowerPoint Audio and Video V0B August 2020 V0B V0B Schield: 2020 PPTX Create Audio-Video

Audio and Speech August 13, 2001 Audio 2 Digital sound anti-aliasing amplifier codec filter

Game Audio Coding vs. Aesthetics Leonard Paul of Lotus Audio Vancouver, Canada Game Audio :

Intermembrane Space H + H + Cyt c Co Q Complex Complex III IV H + ATPase H + Complex

Creating Videos Session will begin shortly Why create instructional videos for your courses?

Consuming videos with the ForkBrowser Consuming videos with the ForkBrowser Ork de Rooij, Cees

Understand Basketball Games 2018.6.15 Sports Videos Large quantity, high

Challenges in Recognizing Challenges in Recognizing NFL with DY NFL with DY Accessibility

Overview of the Recognizing Inference in TExt (RITE-2) at Recognizing Inference in

Recognizing object instances 3. Recognizing object instances Kristen Grauman UT-Austin Image

Compilers Recognizing Handles Alex Aiken Recognizing Handles Bad News There are no known

Balak Numbers 22:2 25:9 Balak Balaam Noah Japheth Ham Shem Cush Put Mitzraim Canaan

CME 310 Solar Power for Africa Overview of Off Grid Photovoltaic (PV) Systems 1 1) PV Panels 2)

Thinking on Uses of Dynamic Analysis for Software Security ben-holland.com $ whoami 2005

Electric Potential Multiple Choice Problems Slide 2 / 71 1 A negative charge is placed on a

Inference Networks, Graph Convolutional Networks Greg Mori School of Computing Science Simon

IGMP and MLD Optimization in Wireless and Mobile Networks

Phrase-based Image Captioning Rmi Lebret , Pedro O. Pinheiro, Ronan Collobert Idiap Research

PRINCE2 Theme Slides 1 Key places in the book (Principles and Themes) Topic Key Places