Consumer Video Understanding
Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui Columbia University Kodak Research
A Benchmark Database + An Evaluation of Human & Machine Performance
ACM ICMR 2011, Trento, Italy, April 2011
Consumer Video Understanding A Benchmark Database + An Evaluation - - PowerPoint PPT Presentation
Consumer Video Understanding A Benchmark Database + An Evaluation of Human & Machine Performance Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui Columbia University Kodak Research ACM ICMR 2011,
Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui Columbia University Kodak Research
A Benchmark Database + An Evaluation of Human & Machine Performance
ACM ICMR 2011, Trento, Italy, April 2011
Barack Obama Rally, Texas, 2008. http://www.paulridenour.com/Obama14.JPG
2
consumers
YouTube video has on average
visual joint analysis
3
Columbia Consumer Video (CCV) Database
Basketball Baseball Soccer Ice Skating Skiing Swimming Biking Cat Dog Bird Graduation Birthday Celebration Wedding Reception Wedding Ceremony Wedding Dance Music Performance Non-music Performance Parade Beach Playground
5
– (210 hrs in total)
– unedited consumer videos
– YouTube.com
– 80 seconds
– 20
– Amazon Mechanical Turk
100 200 300 400 500 600 700 800 music perf. non-music perf. dog swimming skiing parade cat ice skating beach basketball wedding dance bird playground soccer birthday baseball graduation biking wedding reception wedding ceremony 6
The trick of digging out consumer videos from YouTube: Use default filename prefix of many digital cameras: “MVI and parade”.
– KTH & Weizmann
– Hollywood Database
– UCF Database
CCV Database
Unconstrained YouTube videos Higher-level complex events More videos & better defined categories More videos & larger content variations More videos & categories
human intelligence directly into their processing
Task
Is this a “parade” video?
$?.??
Internet-scale workforce
What can I do for you?
financial rewards
8
Reliability of Labels: each video was assigned to four MTurk workers
An Evaluation of Human & Machine Performance
11
recognition accuracy?
– We manually and carefully labeled 896 videos
0.2 0.4 0.6 0.8 1 1-vote 2-votes 3-votes 4-votes precison recall Plus additional manual filtering of 6 positive sample sets: 94% final precision
12
7 10 13 14 14 16 16 17 20 20 21 23 25 31 36 37 40 76 95 160 0.2 0.4 0.6 0.8 1 workers (sorted by average labeling time per HIT)
precision recall
1 1 1 2 3 3 3 3 4 5 17 25 27 36 77 248 255 446 694 770 0.2 0.4 0.6 0.8 1 workers (sorted by # of submitted HITs)
precision recall
Time is shown in seconds on top of the bars
Ground-truth Labels Human Recognition
14
Feature extraction
SIFT Spatial-temporal interest points MFCC audio feature
Average Late Fusion
χ2 kernel SVM Classifier
Yu-Gang Jiang, Xiaohong Zeng, Guangnan Ye, Subh Bhattacharya, Dan Ellis, Mubarak Shah, Shih-Fu Chang, Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching, NIST TRECVID Workshop, 2010.
Multimedia event detection (MED) task
0.00 0.20 0.40 0.60 0.80 1.00 1.20 1.40
Mean Mimimal Normalized Cost
r2 r3 r4 r5 r6 r1 Run1: Run2 + “Batter” Reranking Run2: Run3 + Scene/Audio/Action Context Run3: Run6 + EMD Temporal Matching Run4: Run6 + Scene/Audio/Action Context Run5: Run6 + Scene/Audio Context Run6: Baseline Classification with 3 features 15
16
– D. Lowe, IJCV ‘04
– I. Laptev, IJCV ‘05
…
16ms 16ms
Bag-of-SIFT
17
Bag of audio words / bag of frames: K. Lee and D. Ellis, Audio-Based Semantic Concept Classification for Consumer Video, IEEE Trans on Audio, Speech, and Language Processing, 2010
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
average precision
Prior MFCC STIP SIFT SIFT+STIP SIFT+STIP+MFCC
18
0.2 0.4 0.6 0.8 1
machine human
0.2 0.4 0.6 0.8 1
Precision @90% recall Precision @59% recall
Human Recognition Machine Recognition
wedding dance (93.3% vs. 92.9%) soccer (87.5% vs. 53.8%) cat (93.5% vs. 46.8%)
true positives false positives
found by human&machine found by human only found by machine only found by machine only found by human only n/a n/a
21
http://www.ee.columbia .edu/dvmm/CCV/ Fill out this …