Commercial Detection in Heterogeneous Video Streams Using Fused - - PowerPoint PPT Presentation

commercial detection in heterogeneous video streams using
SMART_READER_LITE
LIVE PREVIEW

Commercial Detection in Heterogeneous Video Streams Using Fused - - PowerPoint PPT Presentation

Commercial Detection in Heterogeneous Video Streams Using Fused Multi-Modal and Temporal Features Masami Mizutani Fujitsu Labs. LTD. Shahram Ebadollahi Columbia University Shih-Fu Chang Columbia University IEEE ICASSP 2005 Philadelphia


slide-1
SLIDE 1

Commercial Detection in Heterogeneous Video Streams Using Fused Multi-Modal and Temporal Features

Masami Mizutani Fujitsu Labs. LTD. Shahram Ebadollahi Columbia University Shih-Fu Chang Columbia University IEEE ICASSP 2005 Philadelphia March 22, 2005

slide-2
SLIDE 2

2

Outline

Motivation & Previous Work Our Proposal Method

Approach Local and Global Features for Commercial Detection Fusion

Experiment & Result Conclusion

slide-3
SLIDE 3

3

Motivation

CM (commercial) detection

Find CM and PG (program) boundaries in broadcast

material

Application:

CM skip capability on digital PVR, Collecting CM for the marketing use, Preprocess for further content analysis in PG, etc

What’s the state of the art?

slide-4
SLIDE 4

4

Previous Work

Dublin University Group (’01) [Marlow01]

[Marlow01]

Heuristics to use blank and silence detectors

Philips Research (’03) [Dimitrova03]

[Dimitrova03]

Use visual features (blank, scene change rate, text

box location) from MPEG streams

Optimize the detection thresholds using Genetic

Algorithm

Carnegie Melon University Group (’04) [Hauptmann04]

[Hauptmann04]

Did not use blank feature, focus on color and audio Identical CMs are broadcasted many times Find repetitious video segments as CM candidates

in video streams using SVMs in a hierarchical style

slide-5
SLIDE 5

5

Previous Work (cont’d)

Reasonable performance, but test data limited and varied. Blank is proven to be powerful, but not always present. CMs are not repetitious in heterogeneous data set.

Accuracy (F1 %) # Programs (# Genre) The amount of total data / CM Fusion Method DCU01 92 Philips03 89 24 (6 genres) 12h / 2.5h Genetic Algorithm 91 92 Heuristics 3.5h / 0.4h 10 (a few?) 10 (only news) Our Method 49 (6 genres) 36h / 9h SVM + Duration HMM 5h / 1.2h CMU04 Hierarchical SVMs

We build a systematic method to fuse diverse features including blank We validate the results using a large diverse data set.

slide-6
SLIDE 6

6

Our Approach

Classification problem of detected scene change points

Scene change detector works well on CM/PG

  • boundaries. (Mostly hard cut or fade in/out)

Use the pattern of multi-modal features in the local

windows located at scene change points.

15 sec window: half length of most CM clips 120 sec window: for capturing the start/end of clips

having blanks

PG CM PG

Blank Frame Rate (1bin)

15 sec window Scene Change 120 sec window

1 2 3 4

Audio(4bins)

1 2 3…1112

Color(12bins)

16x16

Overlay Text Location (256bins) Scene Change Rate (1bin)

slide-7
SLIDE 7

7

Our Approach (cont’d)

Use not only local features but global temporal feature

CM and PG are interleaved in each program Density and locations of CMs in the entire program

stream are dependent on genres and broadcast sources

CM PG CM PG PG PG CM

Example of distributions of inter-arrival time of CM segments

i

t

1 − i

t

1 + i

t

2 + i

t

(b) Sports (c) Movie (a) All genres

. 5 1 1 . 5 2 4 . 5 1 1 . 5 2 4 . 5 1 1 . 5 2 4 L i k e l i h

  • d

L i k e l i h

  • d

L i k e l i h

  • d

t t t

More quickly in sports than in movie

slide-8
SLIDE 8

8

Problem Formulation

Define two hidden states (CM, PG) at scene change points

Model them as Markov Chain with:

Duration feature : duration of stay at a state Fused local features: observed content features at a

state

Detection of CM/PG boundary

formulated as a problem of inferencing the optimal state sequence by Duration Viterbi algorithm

CM PG ) (CM d ) (PG d

f f

Scene changes

CM

t

d

PG CM PG f: Fused local features

slide-9
SLIDE 9

9

Modeling Duration of Stay

Duration of PG: Erlang Mixture Model

Erlang is better for fitting positive samples.[Vasconcelos00]

00]

Mixture model is for fitting various genres. The fitness is confirmed by Kolmogorov-Smirnov test

Duration of CM: a uniform distribution The models are bounded by their max & min in training

data.

Normalized actual duration of stay is considered.

Now, let’s see feature extraction and fusion …

minCM minPG maxCM d P 1 maxPG

Duration of PG Duration of CM

1/(maxCM-minCM)

slide-10
SLIDE 10

10

Feature Extraction: Scene Change, Blank and Overlay Text

Use a scene change (SC) detector [Zhong02]

[Zhong02] and an

simple blank frame (BF) detector

# of SCs in 15 sec and # of BFs in 120 sec

Use overlay text location detector based on motion vector

and texture energy [Zhang03]

[Zhang03]

Detection results of every

5 frames are mapped onto a 2D grid (16x16 bins)

Location and frequency of overlay

texts appearing in 15 sec.

15 sec. Scene Change 120 sec. t ・ ・ ・ # of BFs ・ ・ ・ ・ ・ ・ Blank Frame # of SCs 16(=352pix/22) 16(=240pix/15) 256 bins

slide-11
SLIDE 11

11

Feature Extraction: Audio & Color

Audio (4bins): use a HMM based classifier using MFCC

1 sec of audio {silence, speech, music, music/speech} The counts of each class in 15 sec.

Color (12bins): use the histogram of the predetermined 12

pallet colors of shots in 15 sec. [Wei04]

[Wei04]

The pallet color of each shot is determined based on

3 dominant colors of the keyframe.

Scene Change 15 sec. 1 sec. unit 1 2 3 4 t ・ ・ ・ Count Scene Change 15 sec. 1 shot unit 1 2 3 1112 ・ ・ ・ t ・ ・ ・ Count The 12 pallet colors equally divides L*u*v space.

slide-12
SLIDE 12

Fuse Multi-Modal Features

Classifier (SVM w/ RBF) Audio (4bins) Overlay Text (256bins) SC Rate (1bin) Color (12bins) BF Rate (1bin) Classifier #1 (Poisson, ML) Classifier #2 (Poisson, ML) Classifier #3 (SVM w/ RBF) Classifier #4 (SVM w/ RBF) Classifier #5 (SVM w/ RBF)

Fuse into a single posterior probability in a late fusion

style (2-step), due to the great diversity of the features

Use a local two-class (CM/PG) classifier for a modality

Find the posterior of CM using Bayes rule and sigmoid function [Plat99]

[Plat99]

Another SVM fuses the posteriors and finds the final

posterior of CM

A fused feature

Feed to Markov Chain

Conversion to a posterior

) (

1 1 ) ( ) | (

β α +

+ = ≈

x

e x f

  • CM

P

Sigmoid function for SVM

) ( ) | ( ) ( ) | ( 1 1 ) | ( CM P CM

  • P

PG P PG

  • P
  • CM

P + =

Bayes rule for ML

slide-13
SLIDE 13

13

Experimental Data Set

Heterogeneous data set:

49 programs from 6 US local/national channels Including 6 genres:

News, Drama, Animation, Entertainment, Movie, Sports

Totally 36 hrs including 9 hrs of commercials

Starts of CM and PG are labeled by manual 3-Fold Cross Validation (training, validation, testing)

CH(date) 6:00PM 6:30PM 7:00PM 7:30PM 8:00PM 8:30PM 9:00PM 9:30PM 10:00AM 10:30PM 11:00AM 11:30PM WB11 (Fri. 3/12/04) DRAMA (SitCom) DRAMA (SitCom) DRAMA (SitCom) DRAMA (SitCom) DRAMA (SitCom) DRAMA (SitCom) DRAMA (SitCom) DRAMA (SitCom) INFO (D/N) DRAMA (SitCom) DRAMA (SitCom) UPN9 (Sat. 3/13/04) DRAMA (SitCom) DRAMA (SitCom) DRAMA (SitCom) DRAMA (SitCom) MOVIE INFO (D/N) DRAMA (SitCom) ENT (Gossip) FOX5 (Sun. 3/14/04) INFO (D/N) ANIME ANIME DRAMA (SitCom) ANIME DRAMA (SitCom) DRAMA (SitCom) DRAMA INFO (Daily New s, Sports Nesw ) DRAMA (SitCom) DRAMA (SitCom) NBC (Tue 3/16/04) INFO (D/N) INFO (Politics/ National) INFO (Others) INFO (Others) DRAMA (SitCom) DRAMA (SitCom) DRAMA (SitCom) DRAMA DRAMA INFO (D/N) ENT (Talk Show ) 12:00PM 12:30PM 1:00PM 1:30PM 2:00PM 2:30PM 3:00PM 3:30PM 4:00PM 4:30PM 5:00AM 5:30PM ABC7 (Mon. 3/15/04) INFO (D/N) ENT (QUIZ) DRAMA DRAMA DRAMA ENT (Talk show) INFO (D/N) CBS2 (Thurs. 3/18/04) IN FO SPORTS EVENT (Basketball Tournament) INFO (D/N)

slide-14
SLIDE 14

14

Performance Metric

F1 [D

[Dim imitrova03] itrova03] for counting correctly classified

boundaries

Each scene change point is a candidate, with label of

positive (CM) or negative (PG).

Higher is better. But, can’t deal with short errors.

Ground Truth Detection Result CM PG PG CM PG PG t t TN FP TP FN TN

) /( ) /( ) /( 2 1 FP TP TP P FN TP TP R R P PR F + = + = + =

…Recall …Precision

slide-15
SLIDE 15

15

Performance Metric (cont’d)

WindowDiff [Pevzner02]

[Pevzner02] to measure discrepancies

between ground truth (ref.) and detection result (hyp.)

Widely used for text segmentation. Lower is better.

) | ) , ( ) , ( (| 1 ) , (

1

> − − =

− = + + k N i k i i k i i

hyp hyp b ref ref b k N hyp ref WD

k

) , ( j i b

N

: # of shots in the entire stream, : avg. number of shots in PG and CM segments : # of PG and CM boundaries btw position i and j

i k i +

Ref Hyp

A scene change shot PG/CM boundary

N

slide-16
SLIDE 16

16

Experimental Result

Shows importance of duration features

Point Decision: using all features is the best

But, a lot of short erroneous fragmentations (see WD rates)

Viterbi: drastically eliminates the fragmentation errors Duration Viterbi:

When using all features, achieves F1=92% Even without the blank feature, works very well

(F1=85%, WD=25%)

a) Only blank feature b) Other features c) All features Metric F1 (R,P) WD F1 (R,P) WD F1 (R,P) WD 1) Point Decision 87% 87% (94%, 81%) 23% 23% 80% 0% (85%, 75%) 71% 71% 90% 0% (94%, 86%) 38% 38% 2) Viterbi + (1) 87% 87% (95%, 81%) 22% 22% 84% 4% (89%, 79%) 31% 31% 91% 1% (96%, 87%) 16% 16% 3) Duration Viterbi + (1) 89% 89% (95%, 83%) 21% 21% 85% 5% (90%, 80%) 25% 25% 92% 2% (95%, 88%) 15% 15%

  • Table. Average performance of 49 streams
slide-17
SLIDE 17

17

Experimental Result (cont’d)

Shows robustness for heterogeneous data set

Blank although powerful, is not always available.

Relying on only one strong feature is not good (see the worst case performance of 50% below).

Robust fusion of all features is important, including the

temporal duration modeling.

5 5 5 6 6 5 7 7 5 8 8 5 9 9 5 1 5 1 1 5 2 2 5

Green : Only blank feature Yellow: Other features Red : All features # of streams

Unreliable!

F1 (%)

  • Figure. The distribution of detection accuracy(F1) on 49 streams

when using Duration Viterbi

slide-18
SLIDE 18

18

Example of detection result

6 . 1 6 . 1 5 6 . 2 6 . 2 5 6 . 3 6 . 3 5 6 . 4 6 . 4 5 6 . 5 x 1

5

5 1 # 4 t e s t p g i d = 1 , c b s 2 , N E W S , f r a m e 6 1 2 9 8 t

  • 6

4 8 7 8 8 w i t h d u r a t i

  • n

m

  • d

e l G T 6 . 1 6 . 1 5 6 . 2 6 . 2 5 6 . 3 6 . 3 5 6 . 4 6 . 4 5 6 . 5 x 1

5

5 b l k / D V 6 . 1 6 . 1 5 6 . 2 6 . 2 5 6 . 3 6 . 3 5 6 . 4 6 . 4 5 6 . 5 x 1

5

5 n

  • b

/ D V 6 . 1 6 . 1 5 6 . 2 6 . 2 5 6 . 3 6 . 3 5 6 . 4 6 . 4 5 6 . 5 x 1

5

5 a l l f / D V

Ground Truth Locations of Blank frames Short distance of blanks PG found without blank Need more temporal smoothing! 30 mins. News Program, RED: PG, BULE: CM Boundary ambiguity due to Viterbi processing. a) Only Blank b) Other features without blank c) All features including blank Using Duration Viterbi in all cases the detected start of CM is delayed

slide-19
SLIDE 19

19

Conclusion

Our proposal method:

A systematic late fusion method based on a Markov

Model and combination of local and global duration features.

Experimental results on heterogeneous data show:

Effectiveness of duration feature and the multimodal

feature fusion

Robustness of our method even when the blank

feature is not present.

Future work

Incorporation of a larger feature pool Comparison with prior work on the same condition and

data set

slide-20
SLIDE 20

End

Thank you very much! Any question?

slide-21
SLIDE 21

21

Reference

[Marlow01] “Audio and Video Processing for Automatic TV Advertisement

Detection”, ISSC01

[Dimitrova03] “Evolvable Visual Commercial Detector”, CVPR03 [Hauptmann04] “Comparison and Combination of Two Novel Commercial

Detection Methods”, ICME04

[Vasconcelos00] “Statistical Models of Video Structure for Content Analysis and

Characterization”, IEEE Trans. on Image Processing, 2000

[Zhong02], “Segmentation, Index and Summarization of Digital Video Content”,

Ph.D Thesis, Columiba Univ. 2002

[Zhang03] "Accurate Overlay Text Extraction for Digital Video Analysis", ITRE03 [Platt99] "Probabilistic outputs for support vector machines and comparison to

regularized likelihood methods", Advances in Large Margin Classifiers, MIT Press, 1999.

[Pevzner02] “A Critique and Improvement of an Evaluation Metric for Text

Segmentation”, Computational Linguistics, 2002

[Wei04] “Color-mood analysis of films based on syntactic and psychological

models”, ICME04