Discovery and Fusion of Salient Multi-modal Features towards News - - PowerPoint PPT Presentation

discovery and fusion of salient multi modal features
SMART_READER_LITE
LIVE PREVIEW

Discovery and Fusion of Salient Multi-modal Features towards News - - PowerPoint PPT Presentation

Discovery and Fusion of Salient Multi-modal Features towards News Story Segmentation - @ TRECVID 2003 Workshop Winston Hsu 1 , Shih-Fu Chang 1 , Lyndon Kennedy 1 , Chih-wei Huang 1 , Ching-Yung Lin 2 , and Giridharan Iyengar 3 1 Dept. of


slide-1
SLIDE 1
  • Winston H.-M. Hsu -

digital video | multimedia lab

COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Discovery and Fusion of Salient Multi-modal Features towards News Story Segmentation

  • @TRECVID 2003 Workshop

Winston Hsu1, Shih-Fu Chang1, Lyndon Kennedy1, Chih-wei Huang1, Ching-Yung Lin2, and Giridharan Iyengar3

11/17/2003

  • 1Dept. of Electrical Engineering, Columbia University, New York, NY

2IBM T. J. Watson Research Center, Hawthorne, NY 3IBM T. J. Watson Research Center, Yorktown Heights, NY

slide-2
SLIDE 2
  • 2-
  • Winston H.-M. Hsu -

digital video | multimedia lab

COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Story definition (from LDC)

A News story is defined as a segment of a news

broadcast with a coherent news focus which contains at least two independent, declarative clauses.

  • Misc. segments like commercials, reporter

chitchat, station identifications, public service, long musical (>9 sec), interludes, etc…

Story Segmentation

N N N N N N N M M M N M

slide-3
SLIDE 3
  • 3-
  • Winston H.-M. Hsu -

digital video | multimedia lab

COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Challenging problems due to diverse syntax

music/animation : visual anchors : story sports

>> samples

* Visual anchors alone account for 51% and 67% of stories only on ABC/CNN 0.51 0.38 0.80 CNN 0.67 0.67 0.67 ABC Anchor Face

F1 R P Set Modalities

slide-4
SLIDE 4
  • 4-
  • Winston H.-M. Hsu -

digital video | multimedia lab

COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Our Goal

A robust statistical framework to fuse diverse

features from different modalities

An unified framework that can be adapted to different

new video sources

Automatically generate customized models (parameters) for

CNN and ABC channels with the same framework

An efficient mechanism for inducing dominant

features for any specific domain

Allow us to handle large pools of features smoothly Allow us to incorporate computational noisy feature

detectors

::More information, "Discovery and Fusion of Salient Multi-modal Features towards News Story Segmentation," invited talk, Jan. 18-22, San Jose, SPIE/Electronic Imaging 2004.

slide-5
SLIDE 5
  • 5-
  • Winston H.-M. Hsu -

digital video | multimedia lab

COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Issue: a story boundary at the candidate point

?

Use the perceptual multi-modal features computed from

surrounding windows to infer decisions

with observation xk to estimate posterior probability q(b|xk)

a anchor face? motion energy changes? a commercial starts in 15 sec. change from music to speech? significant pause occurs? just starts a speech segment? {cue phrase}i appears {cue phrase}j appears

tk tk+1 tk-1 Bp Bn Bc

Need for Multi-modal Fusion

k

t

slide-6
SLIDE 6
  • 6-
  • Winston H.-M. Hsu -

digital video | multimedia lab

COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Our Proposed Framework – Exponential Model w/ Perceptual Binary Features

1 1 1 1 1 1 1 … 1 1 1 1 1 1 1 9 1 1 1 1 8 1 1 1 1 1 7 1 1 1 1 1 1 6 1 1 1 1 1 1 5 1 1 1 1 4 1 1 1 1 1 1 1 1 3 1 1 1 1 1 2 1 1 1 1 1 1 1 b

training samples

( , )

1 ( | ) , ( , ), {0,1} ( )

i i i

f x b i

q b x e f x b b Z x

λ λ λ ⋅

∑ = ∈

Raw features: Face Motion Significant Pause Speech segment Commercial Text segmentation score …

* Use supervised learning to find optimal exponential weight for binary feature

i

λ i

i

f

slide-7
SLIDE 7
  • 7-
  • Winston H.-M. Hsu -

digital video | multimedia lab

COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

  • Estimate

from training data by minimizing Kullback-Leibler divergence, defined as

  • Also maximize the log-likelihood (with max extreme “0”)
  • Iteratively find

Parameter Estimation

( | ) ( || ) ( , )log ( | ) ( , )log ( | ) constant( )

x b x b

p b x D p q p b x q b x p x b q b x p

λ λ λ

= = − +

∑∑ ∑∑

  • i

i i

λ λ λ ∆ + = ′

i

λ ( | ) q b x

λ

{( , )}

k k

T x b =

, ,

( , ) ( , ) 1 log ( ) ( | ) ( , )

i x b i i x b

p x b f x b M p x q b x f x b

λ

λ     ∆ =    

∑ ∑

  • (

) ( , )log ( | )

p x b

L q p x b q b x

λ λ

≡ ∑∑

  • empirical distribution

estimated model

  • Because of the convexity of objective function, the iterative process is guaranteed to the global optima.
  • Our Matlab implementations show efficient convergence in ~30 mins when using 30 features and 11,705

training samples ( || ) D p q

  • p
  • q
slide-8
SLIDE 8
  • 8-
  • Winston H.-M. Hsu -

digital video | multimedia lab

COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Input: collection of candidate features, training samples, and

the desired model size

Output: selected features and their corresponding exponential

weights

  • Current model augmented with feature with weight ,
  • Select the candidate which improves current model the most,

at each iteration;

Feature Selection

{ }

{ }

{ }

{ }

* , ,

arg max sup ( || ) ( || ) arg max sup ( ) ( )

h h C p h p h C

h D p q D p q L q L q

α α α α ∈ ∈

= − = −

  • ( , )

,

( | ) ( | ) ( )

h x b h

e q b x q b x Z x

α α α

=

h

α

q

reduce divergence increase log-likelihood

q

slide-9
SLIDE 9
  • 9-
  • Winston H.-M. Hsu -

digital video | multimedia lab

COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK sigpas music comm. text seg. score face shot motion

Examples of Raw Features

boolean point combinatorial

Misc.

boolean segment sports real point text seg. score real segment motion

Video

boolean point shot boundary real segment face boolean segment commercial real point pause

Audio

real point pitch jump real point significant pause boolean segment musc./spch. disc. real segment spch seg./rapidity boolean point ASR cue terms

Text

boolean point V-OCR cue terms Value Time Index Raw Features Modality

Features exist at different time scales, asynchronous points Need an unified wrapper to convert them to consistent representation & imitate human perceptions

candidate point

slide-10
SLIDE 10
  • 10-
  • Winston H.-M. Hsu -

digital video | multimedia lab

COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Feature Wrapper

Feature Library

{ }

( )

r i

f t

{ }

( )

r i

f ⋅

Maximum Entropy

( | ) q b ⋅

Feature Wrapper

( , , , , )

r w i k

F f t dt v B

{ }

j

g

{ }

;

k k

h λ

r i

f

r i

f ∆

dt : delta operation v

: binarization thresholds

B1 B2 B3 1 1 B1 B2 1 1 B1 1

delta interval:

  • bservation windows:

binarization levels:

raw features: candidate point:

dt

v

B

r i

f

k

t

slide-11
SLIDE 11
  • 11-
  • Winston H.-M. Hsu -

digital video | multimedia lab

COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Selected Features (from CNN)

0.0008 0.0016 0.0022 0.0015 0.0015 0.0019 0.0024 0.0058 0.0160 0.3879 gain The surrounding observation window has a pause with the duration larger than 0.25 second. 0.0939 Pause 10 A speech segment starts in the surrounding observation window 0.3734 Speech segment 6 A commercial starts in 15 to 20 seconds after the candidate point. 1.0782 Commercial 7 A speech segment ends after the candidate point

  • 0.4127

Speech segment 8 A speech segment before the candidate point

  • 0.3566

Speech segment 5 An anchor face segment occupies at least 10% of next window 0.7251 Anchor face 9 An audio pause with the duration larger than 2.0 second appears after the boundary point. 0.2434 Pause 3 An anchor face segment just starts after the boundary point 0.4771 Anchor Face 1 Significant pause Significant pause & non-commercial raw feature set The surrounding observation window has a significant pause with the pitch jump intensity larger than the normalized pitch threshold 1.0 and the pause duration larger than 0.5 second. 0.7947 4 A significant pause within the non-commercial section appears in surrounding observation window. 0.7471 2 interpretation no

* The first 10 “A+V” features automatically discovered for the CNN channel

>> >> >>

λ

slide-12
SLIDE 12
  • 12-
  • Winston H.-M. Hsu -

digital video | multimedia lab

COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Significant Pause

::sigpas_seg_0202cnn

Uniform Significant Pause 0.24 0.24 0.22 0.22

R

0.39 0.42 0.22 0.26

F1

0.20 0.20 0.10 0.10

P

0.22 0.22 0.14 0.14

F1

0.43 0.37 2.5 0.45 0.40 5.0 CNN 0.34 0.16 2.5 0.38 0.20 5.0 ABC

R P ε Set

slide-13
SLIDE 13
  • 13-
  • Winston H.-M. Hsu -

digital video | multimedia lab

COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Precision vs. Recall Curves

A+V (ABC) A+V+T (ABC) A+V (CNN) A+V+T (CNN) Anchor Face (CNN) Anchor Face (ABC)

  • Single point P/R is not sufficient for assessment -> need more samples of P/R curves
  • Improvement by multi-modal fusion is significant
  • CNN improves more in high recall area
  • ABC improve more in high precision area

0.67 0.74 A+V+T 0.69 0.71 A+V (BM) A+V+T (BM)

CNN ABC

0.73 0.63 0.76 0.69 A+V

2 1 P R F P R ⋅ ⋅ = +

decision boundary: ( | )

m

q b b ⋅ >

slide-14
SLIDE 14
  • 14-
  • Winston H.-M. Hsu -

digital video | multimedia lab

COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Story Typing

0.90 0.90 0.91 CNN 0.91 0.94 0.89 ABC A+V+T 0.91 0.90 0.92 CNN 0.92 0.92 0.93 ABC A+V

F1 R P Set Modalities

Match filter

Keyframes of a test video

templates

Binary decision: news/non-news

Median Filters

News detection result

{ }

| |

1

N i J t i

i S S S

type

ε     >      

=

  • A story segment is assigned as “News” if overlapping with the

non-commercial segments larger than a threshold

'

( )

T T

A A M M =

  • [ ]

[ ]

T

M u n u n T = − −

  • : Morphological OPEN

: Morphological CLOSE

450 (frames) T =

slide-15
SLIDE 15
  • 15-
  • Winston H.-M. Hsu -

digital video | multimedia lab

COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Summary

  • We have developed a statistical framework that can be

systematically applied to diverse news video sources

  • The results are promising and show multi-modal improvement
  • The same framework can be used to select dominant features from

any modalities flexibly

  • The performance shows room for further research
  • How to go beyond 75% and reach 90%?
  • Evaluation metrics should include complete P/R curves
  • Future works
  • Address imbalanced data distributions
  • Explore temporal dynamics in stories
  • Expand feature pool such as speech phoneme rapidity, video OCR,

and high-level concept detection, etc.

slide-16
SLIDE 16
  • 16-
  • Winston H.-M. Hsu -

digital video | multimedia lab

COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Acknowledgements

Thanks to

Martin Franz of IBM Research for providing an

ASR only story segmentation system

Dongqing Zhang of Columbia University for

providing the geometric active contour face detection system

TRECVID 2003 organizing team for providing the

evaluation platform and precious video corpora

slide-17
SLIDE 17
  • Winston H.-M. Hsu -

digital video | multimedia lab

COL UMBIA UNIVE RSIT Y IN THE CITY OF NEW YORK

Q & A Thank You!!

*More information: "Discovery and Fusion of Salient Multi-modal Features towards News Story Segmentation," invited talk, Jan. 18-22, San Jose, SPIE/Electronic Imaging 2004.