commercial detection in heterogeneous video streams using
play

Commercial Detection in Heterogeneous Video Streams Using Fused - PowerPoint PPT Presentation

Commercial Detection in Heterogeneous Video Streams Using Fused Multi-Modal and Temporal Features Masami Mizutani Fujitsu Labs. LTD. Shahram Ebadollahi Columbia University Shih-Fu Chang Columbia University IEEE ICASSP 2005 Philadelphia


  1. Commercial Detection in Heterogeneous Video Streams Using Fused Multi-Modal and Temporal Features Masami Mizutani Fujitsu Labs. LTD. Shahram Ebadollahi Columbia University Shih-Fu Chang Columbia University IEEE ICASSP 2005 Philadelphia March 22, 2005

  2. Outline � Motivation & Previous Work � Our Proposal Method � Approach � Local and Global Features for Commercial Detection � Fusion � Experiment & Result � Conclusion 2

  3. Motivation � CM (commercial) detection � Find CM and PG (program) boundaries in broadcast material � Application: � CM skip capability on digital PVR, � Collecting CM for the marketing use, � Preprocess for further content analysis in PG, etc What’s the state of the art? 3

  4. Previous Work � Dublin University Group (’01) [Marlow01] [Marlow01] � Heuristics to use blank and silence detectors � Philips Research (’03) [Dimitrova03] [Dimitrova03] � Use visual features (blank, scene change rate, text box location) from MPEG streams � Optimize the detection thresholds using Genetic Algorithm � Carnegie Melon University Group (’04) [Hauptmann04] [Hauptmann04] � Did not use blank feature, focus on color and audio � Identical CMs are broadcasted many times � Find repetitious video segments as CM candidates in video streams using SVMs in a hierarchical style 4

  5. Previous Work (cont ’ d) � Reasonable performance, but test data limited and varied. � Blank is proven to be powerful, but not always present. � CMs are not repetitious in heterogeneous data set. � We build a systematic method to fuse diverse features including blank � We validate the results using a large diverse data set. Accuracy # Programs The amount of Fusion Method (F1 %) (# Genre) total data / CM DCU01 92 10 (a few?) 3.5h / 0.4h Heuristics Philips03 89 24 (6 genres) 12h / 2.5h Genetic Algorithm CMU04 91 10 (only news) 5h / 1.2h Hierarchical SVMs Our Method 92 49 (6 genres) 36h / 9h SVM + Duration HMM 5

  6. Our Approach � Classification problem of detected scene change points � Scene change detector works well on CM/PG boundaries. (Mostly hard cut or fade in/out) � Use the pattern of multi-modal features in the local windows located at scene change points. � 15 sec window: half length of most CM clips � 120 sec window: for capturing the start/end of clips having blanks Scene Change PG CM PG 120 sec window 15 sec window Blank Overlay Text Audio(4bins) Color(12bins) Frame Scene Location (256bins) Rate (1bin) Change 16x16 … Rate (1bin) 1 2 3…1112 1 2 3 4 6

  7. Our Approach (cont’d) � Use not only local features but global temporal feature � CM and PG are interleaved in each program � Density and locations of CMs in the entire program stream are dependent on genres and broadcast sources t t t t − + + 1 1 2 i i i i PG CM PG CM PG CM PG 4 L i k e l i h o o d (a) All genres 2 More quickly 0 t 0 0 . 5 1 1 . 5 in sports than 4 L i k e l i h o o d (b) Sports 2 in movie 0 t 0 0 . 5 1 1 . 5 L i k e l i h o o d 4 (c) Movie 2 0 t 0 0 . 5 1 1 . 5 7 Example of distributions of inter-arrival time of CM segments

  8. Problem Formulation � Define two hidden states (CM, PG) at scene change points � Model them as Markov Chain with: � Duration feature : duration of stay at a state � Fused local features: observed content features at a state � Detection of CM/PG boundary � formulated as a problem of inferencing the optimal state sequence by Duration Viterbi algorithm Scene changes ( CM ) CM PG ( PG ) d d CM PG CM f f PG t f: Fused local features 8 d

  9. Modeling Duration of Stay � Duration of PG: Erlang Mixture Model � Erlang is better for fitting positive samples. [ Vasconcelos 00] 00] � Mixture model is for fitting various genres. � The fitness is confirmed by Kolmogorov-Smirnov test � Duration of CM: a uniform distribution � The models are bounded by their max & min in training data. � Normalized actual duration of stay is considered. P Duration of CM Duration of PG 1/(max CM -min CM ) 0 min CM max CM 1 d min PG max PG 9 Now, let’s see feature extraction and fusion …

  10. Feature Extraction: Scene Change, Blank and Overlay Text � Use a scene change (SC) detector [Zhong02] [Zhong02] and an simple blank frame (BF) detector � # of SCs in 15 sec and # of BFs in 120 sec Scene Change 120 sec. # of BFs ・ ・ ・ ・ ・ ・ ・ ・ ・ t Blank Frame # of SCs 15 sec. � Use overlay text location detector based on motion vector and texture energy [Zhang03] [Zhang03] 16(=352pix/22) � Detection results of every 16(=240pix/15) 5 frames are mapped onto a 2D grid (16x16 bins) � Location and frequency of overlay texts appearing in 15 sec. 256 bins 10

  11. Feature Extraction: Audio & Color � Audio (4bins): use a HMM based classifier using MFCC � 1 sec of audio � {silence, speech, music, music/speech} � The counts of each class in 15 sec. Scene Change Count 15 sec. ・ ・ ・ t 1 sec. unit 1 2 3 4 � Color (12bins): use the histogram of the predetermined 12 pallet colors of shots in 15 sec. [Wei04] [Wei04] � The pallet color of each shot is determined based on 3 dominant colors of the keyframe. Scene Change The 12 pallet colors equally Count 15 sec. divides L*u*v space. ・ ・ ・ ・ ・ ・ t 1 2 3 1112 1 shot unit 11

  12. Fuse Multi-Modal Features � Fuse into a single posterior probability in a late fusion style (2-step), due to the great diversity of the features � Use a local two-class (CM/PG) classifier for a modality � Find the posterior of CM using Bayes rule and sigmoid function [Plat99] [Plat99] � Another SVM fuses the posteriors and finds the final posterior of CM Overlay SC Rate (1bin) BF Rate (1bin) Audio (4bins) Color (12bins) Text (256bins) Classifier #1 Classifier #2 Classifier #3 Classifier #4 Classifier #5 (Poisson, ML) (Poisson, ML) (SVM w/ RBF) (SVM w/ RBF) (SVM w/ RBF) Bayes rule for ML 1 Classifier = ( | ) P CM o (SVM w/ RBF) ( | ) ( ) P o PG P PG Conversion to a posterior + 1 ( | ) ( ) P o CM P CM A fused feature Sigmoid function for SVM � Feed to Markov Chain 1 ≈ = ( | ) ( ) P CM o f x α + β + ( ) x 1 e

  13. Experimental Data Set � Heterogeneous data set: � 49 programs from 6 US local/national channels � Including 6 genres: News, Drama, Animation, Entertainment, Movie, Sports � Totally 36 hrs including 9 hrs of commercials � Starts of CM and PG are labeled by manual � 3-Fold Cross Validation (training, validation, testing) CH(date) 6:00PM 6:30PM 7:00PM 7:30PM 8:00PM 8:30PM 9:00PM 9:30PM 10:00AM 10:30PM 11:00AM 11:30PM WB11 DRAMA DRAMA DRAMA DRAMA DRAMA DRAMA DRAMA DRAMA INFO DRAMA DRAMA (Fri. (SitCom) (SitCom) (SitCom) (SitCom) (SitCom) (SitCom) (SitCom) (SitCom) (D/N) (SitCom) (SitCom) 3/12/04) UPN9 DRAMA DRAMA DRAMA DRAMA MOVIE INFO DRAMA ENT (Sat. (SitCom) (SitCom) (SitCom) (SitCom) (D/N) (SitCom) (Gossip) 3/13/04) FOX5 INFO ANIME ANIME DRAMA ANIME DRAMA DRAMA DRAMA INFO DRAMA DRAMA (Sun. (D/N) (SitCom) (SitCom) (SitCom) (Daily New s, (SitCom) (SitCom) 3/14/04) Sports Nesw ) NBC INFO INFO INFO INFO DRAMA DRAMA DRAMA DRAMA DRAMA INFO ENT (Tue (D/N) (Politics/ (Others) (Others) (SitCom) (SitCom) (SitCom) (D/N) (Talk 3/16/04) National) Show ) 12:00PM 12:30PM 1:00PM 1:30PM 2:00PM 2:30PM 3:00PM 3:30PM 4:00PM 4:30PM 5:00AM 5:30PM ABC7 INFO ENT DRAMA DRAMA DRAMA ENT INFO (Mon. (D/N) (QUIZ) (Talk show) (D/N) 3/15/04) CBS2 IN SPORTS EVENT (Basketball Tournament) INFO (Thurs. FO (D/N) 3/18/04) 13

  14. Performance Metric � F1 [D itrova03] for counting correctly classified [Dim imitrova03] boundaries � Each scene change point is a candidate, with label of positive (CM) or negative (PG). � Higher is better. But, can’t deal with short errors. = + 1 2 /( ) F PR P R = + /( ) … Recall R TP TP FN = + /( ) P TP TP FP … Precision PG CM PG Ground Truth t Detection Result PG CM PG t 14 TN FP TP FN TN

  15. Performance Metric (cont ’ d) � WindowDiff [Pevzner02] [Pevzner02] to measure discrepancies between ground truth (ref.) and detection result (hyp.) � Widely used for text segmentation. � Lower is better. − N k 1 ∑ = − > ( , ) (| ( , ) ( , ) | 0 ) WD ref hyp b ref ref b hyp hyp + + − i i k i i k N k = 1 i : # of shots in the entire stream, N k : avg. number of shots in PG and CM segments ( , ) b i j : # of PG and CM boundaries btw position i and j N Ref A scene change shot Hyp 15 i + PG/CM boundary i k

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend