Recognizing Complex Events in Internet Videos with Audio-Visual - - PowerPoint PPT Presentation

recognizing complex events in internet videos with audio
SMART_READER_LITE
LIVE PREVIEW

Recognizing Complex Events in Internet Videos with Audio-Visual - - PowerPoint PPT Presentation

Recognizing Complex Events in Internet Videos with Audio-Visual Features Yu-Gang Jiang yjiang@ee.columbia.edu In collaboration with Xiaohong Zeng 1 , Guangnan Ye 1 , Subh Bhattacharya 2 , Dan Ellis 1 , Mubarak Shah 2 , Shih-Fu Chang 1 ,


slide-1
SLIDE 1

Recognizing Complex Events in Internet Videos with Audio-Visual Features

Yu-Gang Jiang

yjiang@ee.columbia.edu

In collaboration with Xiaohong Zeng1, Guangnan Ye1, Subh Bhattacharya2, Dan Ellis1, Mubarak Shah2, Shih-Fu Chang1, Alexander C. Loui3 Columbia University1 University of Central Florida2 Kodak Research Labs3

1

slide-2
SLIDE 2

We take photos/videos We take photos/videos everyday/everywhere... everyday/everywhere...

Barack Obama Rally, Texas, 2008. http://www.paulridenour.com/Obama14.JPG Barack Obama Rally, Texas, 2008. http://www.paulridenour.com/Obama14.JPG

2

slide-3
SLIDE 3

Outline

  • A System for Recognizing Events in Internet

Videos

– Best performance in TRECVID 2010 Multimedia Event Detection Task – Features, Kernels, Context, etc.

  • Internet Consumer Video Analysis

– A Benchmark Database – An Evaluation of Human & Machine Performance

3

slide-4
SLIDE 4

Outline

  • A System for Recognizing Events in Internet

Videos

– Best performance in TRECVID 2010 Multimedia Event Detection Task – Features, Kernels, Context, etc.

  • Internet Consumer Video Analysis

– A Benchmark Database – An Evaluation of Human & Machine Performance

4

slide-5
SLIDE 5

The TRECVID Multimedia Event Detection Task

Batting a Batting a run in run in Assembling Assembling a shelter a shelter Making a Making a cake cake

  • Target: Find videos containing an event of interest
  • Data: unconstrained Internet videos

– 1700+ training videos (~50 positive each event); 1700+ test videos

5

slide-6
SLIDE 6

The system: 3 major components

6

Feature extraction

SIFT Spatial-temporal Spatial-temporal interest point MFCC audio MFCC audio feature 21 scene, action, audio concepts

Semantic Diffusion with Contextual Detectors

χ2 SVM Classifiers EMD- SVM

Batting a run in Batting a run in Assembling a Assembling a shelter shelter Making a cake Making a cake

Yu-Gang Jiang, Xiaohong Zeng, Guangnan Ye, S. Bhattacharya, Dan Ellis, Mubarak Shah, Shih-Fu Chang, Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching, in TRECVID 2010.

slide-7
SLIDE 7

Best performance in TRECVID2010

Multimedia event detection (MED) task

0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40

Mean Mimimal Normalized Cost

r2 r3 r4 r5 r6 r1 Run1: Run2 + “Batter” Reranking Run2: Run3 + Scene/Audio/Action Context Run3: Run6 + EMD Temporal Matching Run4: Run6 + Scene/Audio/Action Context Run5: Run6 + Scene/Audio Context Run6: Baseline Classification with 3 features 7

slide-8
SLIDE 8

Per-event performance

0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000

Batting a run in (MNC)

0.000 0.200 0.400 0.600 0.800 1.000 1.200 1.400 1.600

Assembling a shelter (MNC)

0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000

Making a cake (MNC)

Run1: Run2 + “Batter” Reranking Run2: Run3 + Scene/Audio/Action Context Run3: Run6 + EMD Temporal Matching Run4: Run6 + Scene/Audio/Action Context Run5: Run6 + Scene/Audio Context Run6: Baseline Classification with 3 features

8

slide-9
SLIDE 9

Roadmap > audio-visual features

9

Feature extraction

SIFT Spatial-temporal Spatial-temporal interest point MFCC audio MFCC audio feature 21 scene, action, audio concepts

Semantic Diffusion with Contextual Detectors

χ2 SVM Classifiers Classifiers EMD- SVM

Batting a run in Batting a run in Assembling a Assembling a shelter shelter Making a cake Making a cake

slide-10
SLIDE 10

Three audio-visual features…

10

  • SIFT (visual)

– D. Lowe, IJCV 04.

  • STIP (visual)

– I. Laptev, IJCV 05.

  • MFCC (audio)

16ms 16ms

slide-11
SLIDE 11

Bag-of-X representation

  • X = SIFT / STIP / MFCC
  • Soft weighting (Jiang, Ngo and Yang, ACM CIVR 2007)

Bag-of-SIFT Bag-of-SIFT

11

slide-12
SLIDE 12

Results of audio-visual features

12

  • Measured by Average Precision (AP)
  • STIP works the best for event detection
  • The 3 features are highly complementary!

Assembling a shelter Batting a run in Making a cake Mean AP Visual STIP 0.468 0.719 0.476 0.554 Visual SIFT 0.353 0.787 0.396 0.512 Audio MFCC 0.249 0.692 0.270 0.404 STIP+SIFT 0.508 0.796 0.476 0.593 STIP+SIFT+MFCC 0.533 0.873 0.493 0.633

slide-13
SLIDE 13

Roadmap > temporal matching

13

Feature extraction Feature extraction

SIFT Spatial-temporal Spatial-temporal interest point MFCC audio MFCC audio feature 21 scene, action, audio concepts

Semantic Diffusion with Contextual Detectors

χ2 SVM Classifiers EMD- SVM

Batting a run in Batting a run in Assembling a Assembling a shelter shelter Making a cake Making a cake

slide-14
SLIDE 14

Temporal matching with EMD kernel

14

  • Earth Mover’s Distance (EMD)
  • EMD Kernel: K(P,Q)=exp-ρEMD(P,Q)
  • Y. Rubner, C. Tomasi, L. J. Guibas, “A metric for distributions with applications to image databases”, ICCV, 1998.
  • D. Xu, S.-F. Chang, “Video event recognition using kernel methods with multi-level temporal alignment”, PAMI, 2008.

Given two clip sets P = {(p1, wp1), ... , (pm,wpm)} and Q = {(q1, wq1), ... , (qn,wqn)} , the EMD is computed as EMD(P, Q) = ΣiΣj fijdij / ΣiΣj fij dij is the χ2 visual feature distance of video clips pi and qj. fij (weight transferred from pi and qj) is optimized by minimizing the overall transportation workload ΣiΣj fijdij Given two clip sets P = {(p1, wp1), ... , (pm,wpm)} and Q = {(q1, wq1), ... , (qn,wqn)} , the EMD is computed as EMD(P, Q) = ΣiΣj fijdij / ΣiΣj fij dij is the χ2 visual feature distance of video clips pi and qj. fij (weight transferred from pi and qj) is optimized by minimizing the overall transportation workload ΣiΣj fijdij

Distance P Q

time time

slide-15
SLIDE 15

Temporal matching results

  • EMD is helpful for two events

– results measured by minimal normalized cost (lower is better)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Minimal Normalized Cost r6-baseline r3-base+EMD

15

5% gain

slide-16
SLIDE 16

Roadmap > contextual diffusion

16

Feature extraction Feature extraction

SIFT Spatial-temporal Spatial-temporal interest point MFCC audio MFCC audio feature χ2 SVM Classifiers Classifiers EMD- SVM 21 scene, action, audio concepts

Semantic Diffusion with Contextual Detectors

Batting a run in Batting a run in Assembling a Assembling a shelter shelter Making a cake Making a cake

slide-17
SLIDE 17
  • Events generally occur under particular scene

settings with certain audio sounds!

– Understanding contexts may be helpful for event detection

Batting a run in grass Baseball field sky Cheering/Clapping Speech comprehensible running

17

Scene Concepts Scene Concepts Audio Audio Concepts Concepts

Event context

walking Action Action Concepts Concepts

slide-18
SLIDE 18
  • 21 concepts are defined and annotated over

TRECVID MED development set.

  • SVM classifier for concept detection

– STIP for action concepts, SIFT for scene concepts, and MFCC for audio concepts

18

Contextual concepts

Human Action Concepts Scene Concepts Audio Concepts

  • Person walking
  • Person running
  • Person squatting
  • Person standing up
  • Person making/assembling

stuffs with hands (hands visible)

  • Person batting baseball
  • Indoor kitchen
  • Outdoor with grass/trees

visible

  • Baseball field
  • Crowd (a group of 3+

people)

  • Cakes (close-up view)
  • Outdoor rural
  • Outdoor urban
  • Indoor quiet
  • Indoor noisy
  • Original audio
  • Dubbed audio
  • Speech comprehensible
  • Music
  • Cheering
  • Clapping
slide-19
SLIDE 19

19

Concept detection: example results

Baseball field Cakes (close-up view) Crowd (3+ people) Grass/trees Indoor kitchen

slide-20
SLIDE 20

20

Contextual diffusion model

  • Semantic diffusion

[Y.-G. Jiang, J. Wang, S.F. Chang & C.W. Ngo, ICCV 2009]

– Semantic graph

  • Nodes are concepts/events
  • Edges represent concept/event

correlation

– Graph diffusion

  • Smooth detection scores

w.r.t. the correlation

Batting a run in Baseball field Running Cheering

Project page and source code:

http://www.ee.columbia.edu/ln/dvmm/researchProjects/MultimediaIndexing/DASD/dasd.htm

0.9 0.8 0.7 0.5

slide-21
SLIDE 21

0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 Minimal Normalized Cost r3-baseEMD r2-baseEMDSceAudAct

Contextual diffusion results

  • Context is slightly helpful for two events

– results measured by minimal normalized cost (lower is better)

21

3% gain

slide-22
SLIDE 22

Outline

  • A System for Recognizing Events in Internet

Videos

– Best performance in TRECVID 2010 Multimedia Event Detection Task – Features, Kernels, Context, etc.

  • Internet Consumer Video Analysis

– A Benchmark Database – An Evaluation of Human & Machine Performance

22

Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui, Consumer Video Understanding: A Benchmark Database and An Evaluation of Human and Machine Performance, in ACM ICMR 2011.

slide-23
SLIDE 23
  • Original unedited videos captured by ordinary

consumers

  • Interesting and very diverse contents
  • Very weakly indexed
  • On average, 3 tags per consumer video on YouTube vs. 9

tags each YouTube video has

  • Original audio tracks are preserved; good for audio-

visual joint analysis

What are Consumer Videos?

23

slide-24
SLIDE 24

Basketball Baseball Soccer Ice Skating Skiing Swimming Biking Cat Dog Bird Graduation Birthday Celebration Wedding Reception Wedding Ceremony Wedding Dance Music Performance Non-music Performance Parade Beach Playground

24

Columbia Consumer Video (CCV) Database

slide-25
SLIDE 25

CCV Snapshot

  • # videos: 9,317

– (210 hrs in total)

  • video genre

– unedited consumer videos

  • video source

– YouTube.com

  • average length

– 80 seconds

  • # defined categories

– 20

  • annotation method

– Amazon Mechanical Turk

100 200 300 400 500 600 700 800 music perf. non-music perf. dog swimming skiing parade cat ice skating beach basketball wedding dance bird playground soccer birthday baseball graduation biking wedding reception wedding ceremony 25

The trick of digging out consumer videos from YouTube: Use default filename prefix of many digital cameras: “MVI and parade”.

slide-26
SLIDE 26
  • Human Action Recognition

– KTH & Weizmann

  • (constrained environment) 2004-05

– Hollywood Database

  • (12 categories, movies) 2008

– UCF Database

  • (50 categories, YouTube Videos) 2010
  • Kodak Consumer Video
  • (25 classes, 1300+ videos) 2007
  • LabelMe Video
  • (many classes, 1300+ videos) 2009
  • TRECVID MED 2010
  • (3 classes, 3400+ videos) 2010

Existing Database?

CCV Database

Unconstrained YouTube videos Higher-level complex events More videos & better defined categories More videos & larger content variations More videos & categories

26

slide-27
SLIDE 27

Crowdsourcing: Amazon Mechanical Turk

  • A web services API that allows developers to easily integrate

human intelligence directly into their processing

Task

Is this a “parade” video?

  • Yes
  • No

$?.??

Internet-scale workforce

What can I do for you?

financial rewards

27

slide-28
SLIDE 28

MTurk: Annotation Interface

$ 0.02

Reliability of Labels: each video was assigned to four MTurk workers

28

slide-29
SLIDE 29

Human Recognition Performance

29

  • How to measure human (MTurk workers)

recognition accuracy?

– We manually and carefully labeled 896 videos

  • Golden ground truth!
  • Consolidation of the 4 sets of labels

0.2 0.4 0.6 0.8 1 1-vote 2-votes 3-votes 4-votes precison recall Plus additional manual filtering of 6 positive sample sets: 94% final precision

slide-30
SLIDE 30

Human Recognition Performance (cont.)

30

1 1 1 2 3 3 3 3 4 5 17 25 27 36 77 248 255 446 694 770

0.2 0.4 0.6 0.8 1

workers (sorted by # of submitted HITs)

precision recall

slide-31
SLIDE 31

Machine Recognition System

31

Feature extraction

SIFT Spatial-temporal interest points MFCC audio feature

Average Late Fusion

χ2 kernel SVM Classifier

Yu-Gang Jiang, Xiaohong Zeng, Guangnan Ye, Subh Bhattacharya, Dan Ellis, Mubarak Shah, Shih-Fu Chang, Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching, NIST TRECVID Workshop, 2010.

slide-32
SLIDE 32

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

average precision

Prior MFCC STIP SIFT SIFT+STIP SIFT+STIP+MFCC

Machine Recognition Accuracy

  • Measured by average precision
  • SIFT works the best for event detection
  • The 3 features are highly complementary!

32

slide-33
SLIDE 33

0.2 0.4 0.6 0.8 1

machine human

0.2 0.4 0.6 0.8 1

Human vs. Machine

  • Human has much better recall, and is much better for non-rigid objects
  • Machine is close to human on top-list precision

Precision @90% recall Precision @59% recall 33

slide-34
SLIDE 34

wedding dance soccer cat

true positives false positives

found by human&machine found by human only found by machine only found by machine only found by human only n/a n/a

Human vs. Machine: Result Examples

34

slide-35
SLIDE 35

Summary

  • The combination of the three audio-visual

features is key for good video event recognition performance

  • Temporal matching is useful for some

complex events

  • Current automatic event recognition

methods are not that bad

  • A new dataset (CCV) for consumer video

analysis

35

slide-36
SLIDE 36

Dataset download

  • Unique YouTube Video IDs,
  • Labels,
  • Training/Test Partition,
  • Three Audio/Visual Features

http://www.ee.columbia .edu/dvmm/CCV/ Fill out this …

36

slide-37
SLIDE 37

email: yjiang@ee.columbia.edu

37