Consumer Video Understanding A Benchmark Database + An Evaluation - - PowerPoint PPT Presentation

consumer video understanding
SMART_READER_LITE
LIVE PREVIEW

Consumer Video Understanding A Benchmark Database + An Evaluation - - PowerPoint PPT Presentation

Consumer Video Understanding A Benchmark Database + An Evaluation of Human & Machine Performance Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui Columbia University Kodak Research ACM ICMR 2011,


slide-1
SLIDE 1

Consumer Video Understanding

Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui Columbia University Kodak Research

A Benchmark Database + An Evaluation of Human & Machine Performance

ACM ICMR 2011, Trento, Italy, April 2011

slide-2
SLIDE 2

We (Consumers) take photos/videos everyday/everywhere...

Barack Obama Rally, Texas, 2008. http://www.paulridenour.com/Obama14.JPG

2

slide-3
SLIDE 3
  • Original unedited videos captured by ordinary

consumers

  • Interesting and very diverse contents
  • Very weakly indexed
  • 3 tags per consumer video on YouTube vs. 9 tags each

YouTube video has on average

  • Original audio tracks are preserved; good for audio-

visual joint analysis

What are Consumer Videos?

3

slide-4
SLIDE 4

Part I: A Database

Columbia Consumer Video (CCV) Database

slide-5
SLIDE 5

Basketball Baseball Soccer Ice Skating Skiing Swimming Biking Cat Dog Bird Graduation Birthday Celebration Wedding Reception Wedding Ceremony Wedding Dance Music Performance Non-music Performance Parade Beach Playground

5

Columbia Consumer Video (CCV) Database

slide-6
SLIDE 6

CCV Snapshot

  • # videos: 9,317

– (210 hrs in total)

  • video genre

– unedited consumer videos

  • video source

– YouTube.com

  • average length

– 80 seconds

  • # defined categories

– 20

  • annotation method

– Amazon Mechanical Turk

100 200 300 400 500 600 700 800 music perf. non-music perf. dog swimming skiing parade cat ice skating beach basketball wedding dance bird playground soccer birthday baseball graduation biking wedding reception wedding ceremony 6

The trick of digging out consumer videos from YouTube: Use default filename prefix of many digital cameras: “MVI and parade”.

slide-7
SLIDE 7
  • Human Action Recognition

– KTH & Weizmann

  • (constrained environment) 2004-05

– Hollywood Database

  • (12 categories, movies) 2008

– UCF Database

  • (50 categories, YouTube Videos) 2010
  • Kodak Consumer Video
  • (25 classes, 1300+ videos) 2007
  • LabelMe Video
  • (many classes, 1300+ videos) 2009
  • TRECVID MED 2010
  • (3 classes, 3400+ videos) 2010

Existing Database?

CCV Database

Unconstrained YouTube videos Higher-level complex events More videos & better defined categories More videos & larger content variations More videos & categories

slide-8
SLIDE 8

Crowdsourcing: Amazon Mechanical Turk

  • A web services API that allows developers to easily integrate

human intelligence directly into their processing

Task

Is this a “parade” video?

  • Yes
  • No

$?.??

Internet-scale workforce

What can I do for you?

financial rewards

8

slide-9
SLIDE 9

MTurk: Annotation Interface

$ 0.02

Reliability of Labels: each video was assigned to four MTurk workers

slide-10
SLIDE 10

Part II: …not Just A Database

An Evaluation of Human & Machine Performance

slide-11
SLIDE 11

Human Recognition Performance

11

  • How to measure human (MTurk workers)

recognition accuracy?

– We manually and carefully labeled 896 videos

  • Golden ground truth!
  • Consolidation of the 4 sets of labels

0.2 0.4 0.6 0.8 1 1-vote 2-votes 3-votes 4-votes precison recall Plus additional manual filtering of 6 positive sample sets: 94% final precision

slide-12
SLIDE 12

Human Recognition Performance (cont.)

12

7 10 13 14 14 16 16 17 20 20 21 23 25 31 36 37 40 76 95 160 0.2 0.4 0.6 0.8 1 workers (sorted by average labeling time per HIT)

precision recall

1 1 1 2 3 3 3 3 4 5 17 25 27 36 77 248 255 446 694 770 0.2 0.4 0.6 0.8 1 workers (sorted by # of submitted HITs)

precision recall

Time is shown in seconds on top of the bars

slide-13
SLIDE 13

Confusion Matrices

Ground-truth Labels Human Recognition

slide-14
SLIDE 14

Machine Recognition System

14

Feature extraction

SIFT Spatial-temporal interest points MFCC audio feature

Average Late Fusion

χ2 kernel SVM Classifier

Yu-Gang Jiang, Xiaohong Zeng, Guangnan Ye, Subh Bhattacharya, Dan Ellis, Mubarak Shah, Shih-Fu Chang, Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching, NIST TRECVID Workshop, 2010.

slide-15
SLIDE 15

Best Performance in TRECVID-2010

Multimedia event detection (MED) task

0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40

Mean Mimimal Normalized Cost

r2 r3 r4 r5 r6 r1 Run1: Run2 + “Batter” Reranking Run2: Run3 + Scene/Audio/Action Context Run3: Run6 + EMD Temporal Matching Run4: Run6 + Scene/Audio/Action Context Run5: Run6 + Scene/Audio Context Run6: Baseline Classification with 3 features 15

slide-16
SLIDE 16

Three Audio-Visual Features…

16

  • SIFT (visual)

– D. Lowe, IJCV ‘04

  • STIP (visual)

– I. Laptev, IJCV ‘05

  • MFCC (audio)

16ms 16ms

slide-17
SLIDE 17

Bag-of-X Representation

  • X = SIFT / STIP / MFCC
  • Soft weighting (Jiang, Ngo and Yang, ACM CIVR 2007)

Bag-of-SIFT

17

Bag of audio words / bag of frames: K. Lee and D. Ellis, Audio-Based Semantic Concept Classification for Consumer Video, IEEE Trans on Audio, Speech, and Language Processing, 2010

slide-18
SLIDE 18

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

average precision

Prior MFCC STIP SIFT SIFT+STIP SIFT+STIP+MFCC

Machine Recognition Accuracy

  • Measured by average precision
  • SIFT works the best for event detection
  • The 3 features are highly complementary!

18

slide-19
SLIDE 19

0.2 0.4 0.6 0.8 1

machine human

0.2 0.4 0.6 0.8 1

Human vs. Machine

  • Human has much better recall, and is much better for non-rigid objects
  • Machine is close to human on top-list precision

Precision @90% recall Precision @59% recall

slide-20
SLIDE 20

Human vs. Machine: Confusion Matrices

Human Recognition Machine Recognition

slide-21
SLIDE 21

wedding dance (93.3% vs. 92.9%) soccer (87.5% vs. 53.8%) cat (93.5% vs. 46.8%)

true positives false positives

found by human&machine found by human only found by machine only found by machine only found by human only n/a n/a

Human vs. Machine: Result Examples

21

slide-22
SLIDE 22

Download

  • Unique YouTube Video IDs,
  • Labels,
  • Training/Test Partition,
  • Three Audio/Visual Features

http://www.ee.columbia .edu/dvmm/CCV/ Fill out this …

slide-23
SLIDE 23

Thank you!