consumer video understanding
play

Consumer Video Understanding A Benchmark Database + An Evaluation - PowerPoint PPT Presentation

Consumer Video Understanding A Benchmark Database + An Evaluation of Human & Machine Performance Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui Columbia University Kodak Research ACM ICMR 2011,


  1. Consumer Video Understanding A Benchmark Database + An Evaluation of Human & Machine Performance Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel Ellis, Alexander C. Loui Columbia University Kodak Research ACM ICMR 2011, Trento, Italy, April 2011

  2. We (Consumers) take photos/videos everyday/everywhere... 2 Barack Obama Rally, Texas, 2008. http://www.paulridenour.com/Obama14.JPG

  3. What are Consumer Videos? • Original unedited videos captured by ordinary consumers  Interesting and very diverse contents  Very weakly indexed 3 tags per consumer video on YouTube vs. 9 tags each  YouTube video has on average  Original audio tracks are preserved; good for audio- visual joint analysis … 3

  4. Part I: A Database Columbia Consumer Video (CCV) Database

  5. Columbia Consumer Video (CCV) Database Basketball Non-music Performance Skiing Dog Wedding Reception Baseball Swimming Bird Wedding Ceremony Parade Soccer Biking Graduation Wedding Dance Beach Playground Cat Birthday Celebration Music Performance 5 Ice Skating

  6. CCV Snapshot • # videos: 9,317 wedding ceremony wedding reception – (210 hrs in total) biking graduation • video genre baseball – unedited consumer videos birthday soccer • video source playground bird – YouTube.com wedding dance • average length basketball beach – 80 seconds ice skating • # defined categories cat parade – 20 skiing swimming • annotation method dog – Amazon Mechanical Turk non-music perf. music perf. The trick of digging out consumer videos from YouTube: 0 100 200 300 400 500 600 700 800 Use default filename prefix of many digital cameras: “ MVI and parade”. 6

  7. Existing Database? CCV Database • Human Action Recognition – KTH & Weizmann Unconstrained YouTube videos • (constrained environment) 2004-05 – Hollywood Database • (12 categories, movies) 2008 Higher-level complex – UCF Database events • (50 categories, YouTube Videos) 2010 • Kodak Consumer Video More videos & better defined categories • (25 classes, 1300+ videos) 2007 • LabelMe Video More videos & larger content variations • (many classes, 1300+ videos) 2009 • TRECVID MED 2010 More videos & categories • (3 classes, 3400+ videos) 2010

  8. Crowdsourcing: Amazon Mechanical Turk  A web services API that allows developers to easily integrate human intelligence directly into their processing What can I do for you? Is this a “parade” video? o Yes o No Task $?.?? financial rewards Internet-scale workforce 8

  9. MTurk: Annotation Interface $ 0.02 Reliability of Labels: each video was assigned to four MTurk workers

  10. Part II: …not Just A Database An Evaluation of Human & Machine Performance

  11. Human Recognition Performance • How to measure human (MTurk workers) recognition accuracy? – We manually and carefully labeled 896 videos • Golden ground truth! Consolidation of the 4 sets of labels • 1 0.8 0.6 0.4 0.2 precison recall 0 1-vote 2-votes 3-votes 4-votes Plus additional manual filtering of 6 positive sample sets: 94% final precision 11

  12. Human Recognition Performance (cont.) precision recall 1 1 3 3 3 3 1 27 77 248 255 446 694 770 4 36 0.8 25 17 0.6 2 0.4 0.2 1 5 0 workers (sorted by # of submitted HITs) precision recall 7 10 13 14 31 37 1 25 20 21 16 17 16 20 36 23 0.8 14 95 0.6 160 0.4 0.2 40 76 0 workers (sorted by average labeling time per HIT) Time is shown in seconds on top of the bars 12

  13. Confusion Matrices Ground-truth Labels Human Recognition

  14. Machine Recognition System Feature extraction Classifier SIFT Average χ 2 Spatial-temporal kernel Late interest points SVM Fusion MFCC audio feature Yu-Gang Jiang, Xiaohong Zeng, Guangnan Ye, Subh Bhattacharya, Dan Ellis, Mubarak Shah, Shih-Fu Chang, Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching , NIST TRECVID Workshop, 2010. 14

  15. Best Performance in TRECVID-2010 Multimedia event detection (MED) task Run1: Run2 + “Batter” Reranking 1.40 Run2: Run3 + Scene/Audio/Action Context Run3: Run6 + EMD Temporal Matching Run4: Run6 + Scene/Audio/Action Context 1.20 Mean Mimimal Normalized Cost Run5: Run6 + Scene/Audio Context Run6: Baseline Classification with 3 features 1.00 0.80 0.60 0.40 0.20 0.00 r2 r3 r4 r5 r6 r1 15

  16. Three Audio-Visual Features… • SIFT (visual) – D. Lowe, IJCV ‘04 • STIP (visual) – I. Laptev, IJCV ‘05 … 16ms 16ms • MFCC (audio) 16

  17. Bag-of- X Representation X = SIFT / STIP / MFCC • Soft weighting (Jiang, Ngo and Yang, ACM CIVR 2007) • Bag-of-SIFT Bag of audio words / bag of frames: K. Lee and D. Ellis, Audio-Based Semantic Concept 17 Classification for Consumer Video , IEEE Trans on Audio, Speech, and Language Processing, 2010

  18. Machine Recognition Accuracy • Measured by average precision SIFT works the best for event detection • The 3 features are highly complementary! • 0.9 Prior MFCC STIP SIFT SIFT+STIP SIFT+STIP+MFCC 0.8 0.7 average precision 0.6 0.5 0.4 0.3 0.2 0.1 0 18

  19. Human vs. Machine Human has much better recall, and is much better for non-rigid objects • Machine is close to human on top-list precision • 1 0.8 0.6 Precision @90% 0.4 recall 0.2 0 1 0.8 0.6 Precision @59% 0.4 recall 0.2 0 machine human

  20. Human vs. Machine: Confusion Matrices Human Recognition Machine Recognition

  21. Human vs. Machine: Result Examples true positives false positives found by found by found by found by found by human&machine human only machine only human only machine only wedding dance (93.3% vs. 92.9%) soccer (87.5% n/a vs. 53.8%) cat (93.5% n/a vs. 46.8%) 21

  22. Download - Unique YouTube Video IDs, - Labels, - Training/Test Partition, - Three Audio/Visual Features http://www.ee.columbia .edu/dvmm/CCV/ Fill out this …

  23. Thank you!

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend