Columbia-UCF MED2010: Combining Multiple Modalities, Contextual - PowerPoint PPT Presentation

Columbia-UCF MED2010: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching Yu-Gang Jiang 1 , Xiaohong Zeng 1 , Guangnan Ye 1 , Subh Bhattacharya 2 , Dan Ellis 1 , Mubarak Shah 2 , Shih-Fu Chang 1 1 Department of EE, Columbia University 2 Department of EECS, University of Central Florida TRECVID 2010 workshop, NIST, Gaithersburg, MD

The target… Making a Making a cake cake Assembling Assembling a shelter a shelter Batting a Batting a run in run in

Overview: 4 major components & 6 runs 21 scene, action, audio concepts Making a cake Making a cake Feature extraction Classifiers SIFT χ2 Semantic 6 4 5 SVM Assembling a Assembling a Diffusion shelter shelter Spatial-temporal Spatial-temporal with interest point 3 2 Contextual EMD- Detectors MFCC audio MFCC audio Batting a run in Batting a run in SVM feature Batter detection Batter detection 1 Re-Rank 3

Overview: overall performance Run1: Run2 + “Batter” Reranking 1.40 Run2: Run3 + Scene/Audio/Action Context Run3: Run6 + EMD Temporal Matching Run4: Run6 + Scene/Audio/Action Context 1.20 Mean Mimimal Normalized Cost Run5: Run6 + Scene/Audio Context Run6: Baseline Classification with 3 features 1.00 0.80 0.60 0.40 0.20 0.00 r2 r3 r4 r5 r6 r1 4

Overview: per-event performance Batting a run in (MNC) Assembling a shelter (MNC) 1.000 1.600 0.900 1.400 0.800 1.200 0.700 1.000 0.600 0.500 0.800 0.400 0.600 0.300 0.400 0.200 0.200 0.100 0.000 0.000 Making a cake (MNC) 1.000 0.900 Run1: Run2 + “Batter” Reranking 0.800 0.700 Run2: Run3 + Scene/Audio/Action Context 0.600 Run3: Run6 + EMD Temporal Matching 0.500 Run4: Run6 + Scene/Audio/Action Context 0.400 Run5: Run6 + Scene/Audio Context 0.300 Run6: Baseline Classification with 3 features 0.200 0.100 0.000

Roadmap > multiple modalities 21 scene, action, audio concepts Making a cake Making a cake Feature extraction Classifiers Classifiers SIFT χ2 Semantic 6 4 5 SVM Assembling a Assembling a Diffusion shelter shelter Spatial-temporal Spatial-temporal with interest point 3 2 Contextual EMD- Detectors MFCC audio MFCC audio Batting a run in Batting a run in SVM feature Batter detection Batter detection 1 Re-Rank 6

Three Feature Modalities… • SIFT (visual) – D. Lowe, IJCV 04. • STIP (visual) – I. Laptev, IJCV 05. • MFCC (audio) … 16ms 16ms 7

Bag-of- X Representation • X = SIFT or STIP or MFCC • Soft weighting ( Jiang, Ngo and Yang, ACM CIVR 2007) Bag-of-SIFT 8

Soft-weighting in Bag-of-X • Soft weighting is used for all the three Bag-of-X representations -- Assign a feature to multiple visual words -- weights are determined by feature-to-word similarity Details in: Jiang, Ngo and Yang, ACM CIVR 2007. 9 Image source: http://www.cs.joensuu.fi/pages/franti/vq/lkm15.gif

Results on Dry-run Validation Set • Measured by Average Precision (AP) Assembling a Batting a run Making a Mean AP shelter in cake Visual STIP 0.468 0.719 0.476 0.554 Visual SIFT 0.353 0.787 0.396 0.512 Audio MFCC 0.249 0.692 0.270 0.404 STIP+SIFT 0.508 0.796 0.476 0.593 STIP+SIFT+MFCC 0.533 0.873 0.493 0.633 • STIP works best for event detection • The 3 features are highly complementary! Should be jointly used for multimedia event detection • 10

Roadmap > temporal matching 21 scene, action, audio concepts Making a cake Making a cake Feature extraction Feature extraction Classifiers SIFT χ2 Semantic 6 4 5 SVM Assembling a Assembling a Diffusion shelter shelter Spatial-temporal Spatial-temporal with interest point 3 2 Contextual EMD- Detectors MFCC audio MFCC audio Batting a run in Batting a run in SVM feature Batter detection Batter detection 1 Re-Rank 11

Temporal Matching With EMD Kernel • Earth Mover’s Distance (EMD) 0.14 … P 0.1 0.04 … … Q Given two frame sets P = {( p 1 , w p 1 ), ... , ( p m , w pm )} and Q = Given two frame sets P = {( p 1 , w p 1 ), ... , ( p m , w pm )} and Q = {( q 1 , w q 1 ), ... , ( q n , w qn )} , the EMD is computed as {( q 1 , w q 1 ), ... , ( q n , w qn )} , the EMD is computed as EMD (P, Q) = Σ i Σ j f ij d ij / Σ i Σ j f ij EMD (P, Q) = Σ i Σ j f ij d ij / Σ i Σ j f ij d ij is the χ 2 visual feature distance of frames p i and q j . f ij (weight d ij is the χ 2 visual feature distance of frames p i and q j . f ij (weight transferred from p i and q j ) is optimized by minimizing the overall transferred from p i and q j ) is optimized by minimizing the overall transportation workload Σ i Σ j f ij d ij transportation workload Σ i Σ j f ij d ij • EMD Kernel: K(P,Q)= exp -ρ EMD (P,Q) Y. Rubner, C. Tomasi, L. J. Guibas, “A metric for distributions with applications to image databases”, ICCV, 1998. D. Xu, S.-F. Chang, “Video event recognition using kernel methods with multi-level temporal alignment”, PAMI, 2008. 12

Temporal Matching Results • EMD is helpful for two events – results measured by minimal normalized cost (lower is better) 0.8 5% gain Minimal Normalized Cost r6-baseline 0.7 r3-base+EMD 0.6 0.5 0.4 0.3 0.2 0.1 0 13

Roadmap > contextual diffusion 21 scene, action, audio concepts Making a cake Making a cake Feature extraction Feature extraction Classifiers Classifiers SIFT χ2 Semantic 6 4 5 SVM Assembling a Assembling a Diffusion shelter shelter Spatial-temporal Spatial-temporal with interest point 3 2 Contextual EMD- Detectors MFCC audio MFCC audio Batting a run in Batting a run in SVM feature Batter detection Batter detection 1 Re-Rank 14

Event Context • Events generally occur under particular scene settings with certain audio sounds! – Understanding contexts may be helpful for event detection Action Action Batting a run in Scene Concepts Scene Concepts Concepts Concepts running grass walking Baseball field Speech comprehensible sky Cheering/Clapping Audio Audio Concepts Concepts 15

Contextual Concepts • 21 concepts are defined and annotated over MED development set. Human Action Concepts Scene Concepts Audio Concepts Person walking Indoor kitchen Outdoor rural    Person running Outdoor with grass/trees Outdoor urban    Person squatting visible Indoor quiet   Person standing up Baseball field Indoor noisy    Person making/assembling Crowd (a group of 3+ Original audio    stuffs with hands (hands people) Dubbed audio  visible) Cakes (close-up view) Speech comprehensible   Person batting baseball Music   Cheering  Clapping  • SVM classifier for concept detection – STIP for action concepts, SIFT for scene concepts, and MFCC for audio concepts Jingen Liu, Jiebo Luo & Mubarak Shah, Recognizing Realistic Actions from Videos "in the Wild“, CVPR 2009 Shih-Fu Chang et al. Columbia University/VIREO-CityU/IRIT TRECVID2008 High-Level Feature Extraction and Interactive Video Search. TRECVID Workshop, 2008 16

Concept Detection: example result Baseball field Cakes (close-up view) Crowd (3+ people) Grass/trees Indoor kitchen 17

Contextual Diffusion Model • Semantic Diffusion Baseball field 0.9 [Jiang, Wang, Chang & Ngo, ICCV 2009] – Semantic graph • Nodes are concepts/events • Edges represent 0.5 Batting a run in concept/event correlation – Graph diffusion 0.8 • Smooth detection scores 0.7 w.r.t. the correlation Running Cheering Project page and source code: http://www.ee.columbia.edu/ln/dvmm/researchProjects/MultimediaIndexing/DASD/dasd.htm 18

Contextual Diffusion Results • Context is slightly helpful for two events – results measured by minimal normalized cost (lower is better) 0.800 r3-baseEMD r2-baseEMDSceAudAct Minimal Normalized Cost 0.700 0.600 2-3% gain 0.500 0.400 0.300 0.200 0.100 0.000 19

Contextual Diffusion Results • … but the improvement is much higher when context is perfect (on a validation set) − results measured by average precision (higher is better) baseline context diffusion 1 0.9 0.8 Average Precision 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 20

Roadmap > reranking with event- specific object detector 21 scene, action, audio concepts Making a cake Making a cake Feature extraction Feature extraction Classifiers Classifiers SIFT χ2 Semantic 6 4 5 SVM Assembling a Assembling a Diffusion shelter shelter Spatial-temporal Spatial-temporal with interest point 3 2 Contextual EMD- Detectors MFCC audio MFCC audio Batting a run in Batting a run in SVM feature Batter detection Batter detection 1 Re-Rank 21

Reranking with Event-Specific Object Detector • “Batter” detector is trained by AdaBoost framework

Reranking with Event-Specific Object Detector • “Batter” detector is trained by AdaBoost framework Reranking Based on the Ratio of Initial Ranking “Batter” Detection detected objects

Lessons learned 1. STIP is powerful for event detection. 2. Combining multiple audio-visual features is very effective! 3. Temporal Matching with EMD is useful for some events 4. Diffusion with Contextual Concepts is promising, and deserves deeper research Future Work 1. Explore deep joint audio-visual representation, e.g., Audio-Visual Atoms [Jiang et al, ACMMM09] 2. Another interesting research direction is to investigate an adaptive method to find the best components for each event

Columbia-UCF MED2010: Combining Multiple Modalities, Contextual - PowerPoint PPT Presentation

Columbia-UCF MED2010: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching Yu-Gang Jiang 1 , Xiaohong Zeng 1 , Guangnan Ye 1 , Subh Bhattacharya 2 , Dan Ellis 1 , Mubarak Shah 2 , Shih-Fu Chang 1 1 Department of EE, Columbia

1 ~UCF ~UCF Methodology response to Strategic Initiative 10: Enhance UCF Community

Programming Modalities Modalities of Programming In 2020, there are three prevalent modalities

UCF Lake Nona Cancer Center Discussion Orange County Board of County Commissioners August 21,

Going Mobile with Canvas: Beyond the Basics Ryan Seilhamer (UCF) Sandesh Tuladhar (Columbia) Ryan

Marie-France Bellin Technical innovations in existing modalities New imaging modalities

69a History of Massage: Modalities 69a History of Massage: Modalities Class Outline 5

DFG Graduiertenkolleg 1564 (Research Training Group 1564) Imaging New Modalities Multimodal

69a History of Massage: Modalities 69a History of Massage: Modalities Class Outline 5 minutes

Modalities in HoTT Egbert Rijke, Mike Shulman, Bas Spitters 1706.07526 Higher toposes Internal

UCF/Alafaya Trail Pedestrian Safety Study & Campus Development Agreement BCC Public Hearing

Searchable Security Scheme for Cloud NoSQL Mohammad Ahmadian ahmadian@knights.ucf.edu Advisor:

SECURE QUERY PROCESSING in CLOUD NoSQL Mohammad Ahmadian ahmadian@knights.ucf.edu University of

Lessons Learned from A Three-Week Lessons Learned from A Three-Week Long User Study w ith

in an Adverse Business Environment Michael Doc Terry Associate Professor UCF Rosen College

Enterprise Resource Planning Faculty Senate July 9, 2020 7/9/2020 1 What is UCF ERP today?

Financials Focus Group April 10, 2020 Financials Updates: UCF Rising HRS-PeopleSoft Integration

The Design and Evaluation of a Task Centered Battery Interface Task-Centered Battery Interface

Code Blocks and Indentation Indentation is important Think of a recipe - Chocolate Cake

An Introduction to Environmental Product Declarations (EPD) Thursday 28 th May, 13:00-14:30

sique.sciencesco nf.org/data/prog ram/G4_LIO_w2 _generator_phys icslist.pdf vs events vs

Mo$va$on Currentmul$mediasearchtechnologiesprovidelimitedsearch

1 Experience in Batterers Intervention Programs (BIP) CCR Coordinator in Itasca and northern

Static Methods and Method Calls Algorithms Algorithm: A list of steps for solving a problem.

Algorithms for Natural Language Processing Lecture 11: Formal Grammars WHAT IS SYNTAX? Syntax

Columbia-UCF MED2010: Combining Multiple Modalities, Contextual - PowerPoint PPT Presentation

Columbia-UCF MED2010: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching Yu-Gang Jiang 1 , Xiaohong Zeng 1 , Guangnan Ye 1 , Subh Bhattacharya 2 , Dan Ellis 1 , Mubarak Shah 2 , Shih-Fu Chang 1 1 Department of EE, Columbia

1 ~UCF ~UCF Methodology response to Strategic Initiative 10: Enhance UCF Community

Programming Modalities Modalities of Programming In 2020, there are three prevalent modalities

UCF Lake Nona Cancer Center Discussion Orange County Board of County Commissioners August 21,

Going Mobile with Canvas: Beyond the Basics Ryan Seilhamer (UCF) Sandesh Tuladhar (Columbia) Ryan

Marie-France Bellin Technical innovations in existing modalities New imaging modalities

69a History of Massage: Modalities 69a History of Massage: Modalities Class Outline 5

DFG Graduiertenkolleg 1564 (Research Training Group 1564) Imaging New Modalities Multimodal

69a History of Massage: Modalities 69a History of Massage: Modalities Class Outline 5 minutes

Modalities in HoTT Egbert Rijke, Mike Shulman, Bas Spitters 1706.07526 Higher toposes Internal

UCF/Alafaya Trail Pedestrian Safety Study &amp; Campus Development Agreement BCC Public Hearing

Searchable Security Scheme for Cloud NoSQL Mohammad Ahmadian ahmadian@knights.ucf.edu Advisor:

SECURE QUERY PROCESSING in CLOUD NoSQL Mohammad Ahmadian ahmadian@knights.ucf.edu University of

Lessons Learned from A Three-Week Lessons Learned from A Three-Week Long User Study w ith

in an Adverse Business Environment Michael Doc Terry Associate Professor UCF Rosen College

Enterprise Resource Planning Faculty Senate July 9, 2020 7/9/2020 1 What is UCF ERP today?

Financials Focus Group April 10, 2020 Financials Updates: UCF Rising HRS-PeopleSoft Integration

The Design and Evaluation of a Task Centered Battery Interface Task-Centered Battery Interface

Code Blocks and Indentation Indentation is important Think of a recipe - Chocolate Cake

An Introduction to Environmental Product Declarations (EPD) Thursday 28 th May, 13:00-14:30

sique.sciencesco nf.org/data/prog ram/G4_LIO_w2 _generator_phys icslist.pdf vs events vs

Mo$va$on Currentmul$mediasearchtechnologiesprovidelimitedsearch

1 Experience in Batterers Intervention Programs (BIP) CCR Coordinator in Itasca and northern

Static Methods and Method Calls Algorithms Algorithm: A list of steps for solving a problem.

Algorithms for Natural Language Processing Lecture 11: Formal Grammars WHAT IS SYNTAX? Syntax

UCF/Alafaya Trail Pedestrian Safety Study & Campus Development Agreement BCC Public Hearing