Commercial Detection in Heterogeneous Video Streams Using Fused - - PowerPoint PPT Presentation
Commercial Detection in Heterogeneous Video Streams Using Fused - - PowerPoint PPT Presentation
Commercial Detection in Heterogeneous Video Streams Using Fused Multi-Modal and Temporal Features Masami Mizutani Fujitsu Labs. LTD. Shahram Ebadollahi Columbia University Shih-Fu Chang Columbia University IEEE ICASSP 2005 Philadelphia
2
Outline
Motivation & Previous Work Our Proposal Method
Approach Local and Global Features for Commercial Detection Fusion
Experiment & Result Conclusion
3
Motivation
CM (commercial) detection
Find CM and PG (program) boundaries in broadcast
material
Application:
CM skip capability on digital PVR, Collecting CM for the marketing use, Preprocess for further content analysis in PG, etc
What’s the state of the art?
4
Previous Work
Dublin University Group (’01) [Marlow01]
[Marlow01]
Heuristics to use blank and silence detectors
Philips Research (’03) [Dimitrova03]
[Dimitrova03]
Use visual features (blank, scene change rate, text
box location) from MPEG streams
Optimize the detection thresholds using Genetic
Algorithm
Carnegie Melon University Group (’04) [Hauptmann04]
[Hauptmann04]
Did not use blank feature, focus on color and audio Identical CMs are broadcasted many times Find repetitious video segments as CM candidates
in video streams using SVMs in a hierarchical style
5
Previous Work (cont’d)
Reasonable performance, but test data limited and varied. Blank is proven to be powerful, but not always present. CMs are not repetitious in heterogeneous data set.
Accuracy (F1 %) # Programs (# Genre) The amount of total data / CM Fusion Method DCU01 92 Philips03 89 24 (6 genres) 12h / 2.5h Genetic Algorithm 91 92 Heuristics 3.5h / 0.4h 10 (a few?) 10 (only news) Our Method 49 (6 genres) 36h / 9h SVM + Duration HMM 5h / 1.2h CMU04 Hierarchical SVMs
We build a systematic method to fuse diverse features including blank We validate the results using a large diverse data set.
6
Our Approach
Classification problem of detected scene change points
Scene change detector works well on CM/PG
- boundaries. (Mostly hard cut or fade in/out)
Use the pattern of multi-modal features in the local
windows located at scene change points.
15 sec window: half length of most CM clips 120 sec window: for capturing the start/end of clips
having blanks
PG CM PG
Blank Frame Rate (1bin)
15 sec window Scene Change 120 sec window
1 2 3 4
Audio(4bins)
1 2 3…1112
Color(12bins)
…
16x16
Overlay Text Location (256bins) Scene Change Rate (1bin)
7
Our Approach (cont’d)
Use not only local features but global temporal feature
CM and PG are interleaved in each program Density and locations of CMs in the entire program
stream are dependent on genres and broadcast sources
CM PG CM PG PG PG CM
Example of distributions of inter-arrival time of CM segments
i
t
1 − i
t
1 + i
t
2 + i
t
(b) Sports (c) Movie (a) All genres
. 5 1 1 . 5 2 4 . 5 1 1 . 5 2 4 . 5 1 1 . 5 2 4 L i k e l i h
- d
L i k e l i h
- d
L i k e l i h
- d
t t t
More quickly in sports than in movie
8
Problem Formulation
Define two hidden states (CM, PG) at scene change points
Model them as Markov Chain with:
Duration feature : duration of stay at a state Fused local features: observed content features at a
state
Detection of CM/PG boundary
formulated as a problem of inferencing the optimal state sequence by Duration Viterbi algorithm
CM PG ) (CM d ) (PG d
f f
Scene changes
CM
t
d
PG CM PG f: Fused local features
9
Modeling Duration of Stay
Duration of PG: Erlang Mixture Model
Erlang is better for fitting positive samples.[Vasconcelos00]
00]
Mixture model is for fitting various genres. The fitness is confirmed by Kolmogorov-Smirnov test
Duration of CM: a uniform distribution The models are bounded by their max & min in training
data.
Normalized actual duration of stay is considered.
Now, let’s see feature extraction and fusion …
minCM minPG maxCM d P 1 maxPG
Duration of PG Duration of CM
1/(maxCM-minCM)
10
Feature Extraction: Scene Change, Blank and Overlay Text
Use a scene change (SC) detector [Zhong02]
[Zhong02] and an
simple blank frame (BF) detector
# of SCs in 15 sec and # of BFs in 120 sec
Use overlay text location detector based on motion vector
and texture energy [Zhang03]
[Zhang03]
Detection results of every
5 frames are mapped onto a 2D grid (16x16 bins)
Location and frequency of overlay
texts appearing in 15 sec.
15 sec. Scene Change 120 sec. t ・ ・ ・ # of BFs ・ ・ ・ ・ ・ ・ Blank Frame # of SCs 16(=352pix/22) 16(=240pix/15) 256 bins
11
Feature Extraction: Audio & Color
Audio (4bins): use a HMM based classifier using MFCC
1 sec of audio {silence, speech, music, music/speech} The counts of each class in 15 sec.
Color (12bins): use the histogram of the predetermined 12
pallet colors of shots in 15 sec. [Wei04]
[Wei04]
The pallet color of each shot is determined based on
3 dominant colors of the keyframe.
Scene Change 15 sec. 1 sec. unit 1 2 3 4 t ・ ・ ・ Count Scene Change 15 sec. 1 shot unit 1 2 3 1112 ・ ・ ・ t ・ ・ ・ Count The 12 pallet colors equally divides L*u*v space.
Fuse Multi-Modal Features
Classifier (SVM w/ RBF) Audio (4bins) Overlay Text (256bins) SC Rate (1bin) Color (12bins) BF Rate (1bin) Classifier #1 (Poisson, ML) Classifier #2 (Poisson, ML) Classifier #3 (SVM w/ RBF) Classifier #4 (SVM w/ RBF) Classifier #5 (SVM w/ RBF)
Fuse into a single posterior probability in a late fusion
style (2-step), due to the great diversity of the features
Use a local two-class (CM/PG) classifier for a modality
Find the posterior of CM using Bayes rule and sigmoid function [Plat99]
[Plat99]
Another SVM fuses the posteriors and finds the final
posterior of CM
A fused feature
Feed to Markov Chain
Conversion to a posterior
) (
1 1 ) ( ) | (
β α +
+ = ≈
x
e x f
- CM
P
Sigmoid function for SVM
) ( ) | ( ) ( ) | ( 1 1 ) | ( CM P CM
- P
PG P PG
- P
- CM
P + =
Bayes rule for ML
13
Experimental Data Set
Heterogeneous data set:
49 programs from 6 US local/national channels Including 6 genres:
News, Drama, Animation, Entertainment, Movie, Sports
Totally 36 hrs including 9 hrs of commercials
Starts of CM and PG are labeled by manual 3-Fold Cross Validation (training, validation, testing)
CH(date) 6:00PM 6:30PM 7:00PM 7:30PM 8:00PM 8:30PM 9:00PM 9:30PM 10:00AM 10:30PM 11:00AM 11:30PM WB11 (Fri. 3/12/04) DRAMA (SitCom) DRAMA (SitCom) DRAMA (SitCom) DRAMA (SitCom) DRAMA (SitCom) DRAMA (SitCom) DRAMA (SitCom) DRAMA (SitCom) INFO (D/N) DRAMA (SitCom) DRAMA (SitCom) UPN9 (Sat. 3/13/04) DRAMA (SitCom) DRAMA (SitCom) DRAMA (SitCom) DRAMA (SitCom) MOVIE INFO (D/N) DRAMA (SitCom) ENT (Gossip) FOX5 (Sun. 3/14/04) INFO (D/N) ANIME ANIME DRAMA (SitCom) ANIME DRAMA (SitCom) DRAMA (SitCom) DRAMA INFO (Daily New s, Sports Nesw ) DRAMA (SitCom) DRAMA (SitCom) NBC (Tue 3/16/04) INFO (D/N) INFO (Politics/ National) INFO (Others) INFO (Others) DRAMA (SitCom) DRAMA (SitCom) DRAMA (SitCom) DRAMA DRAMA INFO (D/N) ENT (Talk Show ) 12:00PM 12:30PM 1:00PM 1:30PM 2:00PM 2:30PM 3:00PM 3:30PM 4:00PM 4:30PM 5:00AM 5:30PM ABC7 (Mon. 3/15/04) INFO (D/N) ENT (QUIZ) DRAMA DRAMA DRAMA ENT (Talk show) INFO (D/N) CBS2 (Thurs. 3/18/04) IN FO SPORTS EVENT (Basketball Tournament) INFO (D/N)
14
Performance Metric
F1 [D
[Dim imitrova03] itrova03] for counting correctly classified
boundaries
Each scene change point is a candidate, with label of
positive (CM) or negative (PG).
Higher is better. But, can’t deal with short errors.
Ground Truth Detection Result CM PG PG CM PG PG t t TN FP TP FN TN
) /( ) /( ) /( 2 1 FP TP TP P FN TP TP R R P PR F + = + = + =
…Recall …Precision
15
Performance Metric (cont’d)
WindowDiff [Pevzner02]
[Pevzner02] to measure discrepancies
between ground truth (ref.) and detection result (hyp.)
Widely used for text segmentation. Lower is better.
) | ) , ( ) , ( (| 1 ) , (
1
> − − =
∑
− = + + k N i k i i k i i
hyp hyp b ref ref b k N hyp ref WD
k
) , ( j i b
N
: # of shots in the entire stream, : avg. number of shots in PG and CM segments : # of PG and CM boundaries btw position i and j
i k i +
Ref Hyp
A scene change shot PG/CM boundary
N
16
Experimental Result
Shows importance of duration features
Point Decision: using all features is the best
But, a lot of short erroneous fragmentations (see WD rates)
Viterbi: drastically eliminates the fragmentation errors Duration Viterbi:
When using all features, achieves F1=92% Even without the blank feature, works very well
(F1=85%, WD=25%)
a) Only blank feature b) Other features c) All features Metric F1 (R,P) WD F1 (R,P) WD F1 (R,P) WD 1) Point Decision 87% 87% (94%, 81%) 23% 23% 80% 0% (85%, 75%) 71% 71% 90% 0% (94%, 86%) 38% 38% 2) Viterbi + (1) 87% 87% (95%, 81%) 22% 22% 84% 4% (89%, 79%) 31% 31% 91% 1% (96%, 87%) 16% 16% 3) Duration Viterbi + (1) 89% 89% (95%, 83%) 21% 21% 85% 5% (90%, 80%) 25% 25% 92% 2% (95%, 88%) 15% 15%
- Table. Average performance of 49 streams
17
Experimental Result (cont’d)
Shows robustness for heterogeneous data set
Blank although powerful, is not always available.
Relying on only one strong feature is not good (see the worst case performance of 50% below).
Robust fusion of all features is important, including the
temporal duration modeling.
5 5 5 6 6 5 7 7 5 8 8 5 9 9 5 1 5 1 1 5 2 2 5
Green : Only blank feature Yellow: Other features Red : All features # of streams
Unreliable!
F1 (%)
- Figure. The distribution of detection accuracy(F1) on 49 streams
when using Duration Viterbi
18
Example of detection result
6 . 1 6 . 1 5 6 . 2 6 . 2 5 6 . 3 6 . 3 5 6 . 4 6 . 4 5 6 . 5 x 1
5
5 1 # 4 t e s t p g i d = 1 , c b s 2 , N E W S , f r a m e 6 1 2 9 8 t
- 6
4 8 7 8 8 w i t h d u r a t i
- n
m
- d
e l G T 6 . 1 6 . 1 5 6 . 2 6 . 2 5 6 . 3 6 . 3 5 6 . 4 6 . 4 5 6 . 5 x 1
5
5 b l k / D V 6 . 1 6 . 1 5 6 . 2 6 . 2 5 6 . 3 6 . 3 5 6 . 4 6 . 4 5 6 . 5 x 1
5
5 n
- b
/ D V 6 . 1 6 . 1 5 6 . 2 6 . 2 5 6 . 3 6 . 3 5 6 . 4 6 . 4 5 6 . 5 x 1
5
5 a l l f / D V
Ground Truth Locations of Blank frames Short distance of blanks PG found without blank Need more temporal smoothing! 30 mins. News Program, RED: PG, BULE: CM Boundary ambiguity due to Viterbi processing. a) Only Blank b) Other features without blank c) All features including blank Using Duration Viterbi in all cases the detected start of CM is delayed
19
Conclusion
Our proposal method:
A systematic late fusion method based on a Markov
Model and combination of local and global duration features.
Experimental results on heterogeneous data show:
Effectiveness of duration feature and the multimodal
feature fusion
Robustness of our method even when the blank
feature is not present.
Future work
Incorporation of a larger feature pool Comparison with prior work on the same condition and
data set
End
Thank you very much! Any question?
21
Reference
[Marlow01] “Audio and Video Processing for Automatic TV Advertisement
Detection”, ISSC01
[Dimitrova03] “Evolvable Visual Commercial Detector”, CVPR03 [Hauptmann04] “Comparison and Combination of Two Novel Commercial
Detection Methods”, ICME04
[Vasconcelos00] “Statistical Models of Video Structure for Content Analysis and
Characterization”, IEEE Trans. on Image Processing, 2000
[Zhong02], “Segmentation, Index and Summarization of Digital Video Content”,
Ph.D Thesis, Columiba Univ. 2002
[Zhang03] "Accurate Overlay Text Extraction for Digital Video Analysis", ITRE03 [Platt99] "Probabilistic outputs for support vector machines and comparison to
regularized likelihood methods", Advances in Large Margin Classifiers, MIT Press, 1999.
[Pevzner02] “A Critique and Improvement of an Evaluation Metric for Text
Segmentation”, Computational Linguistics, 2002
[Wei04] “Color-mood analysis of films based on syntactic and psychological
models”, ICME04