Adaptive Feature Discovery for TRECVID Broadcast News Video Story - - PowerPoint PPT Presentation

adaptive feature discovery for trecvid broadcast news
SMART_READER_LITE
LIVE PREVIEW

Adaptive Feature Discovery for TRECVID Broadcast News Video Story - - PowerPoint PPT Presentation

Adaptive Feature Discovery for TRECVID Broadcast News Video Story Segmentation @TRECVID Workshop 2004, Nov. 15-16 1 , Lyndon Kennedy 1 , Shih-Fu Chang 1 , Winston Hsu 3 , John Smith 2 , Giridharan Iyengar 3 Martin Franz 1 Dept. of Electrical


slide-1
SLIDE 1
  • Winston H.-M. Hsu -

digital video | multimedia lab

Adaptive Feature Discovery for TRECVID Broadcast News Video Story Segmentation

@TRECVID Workshop 2004, Nov. 15-16 Winston Hsu

1, Lyndon Kennedy 1, Shih-Fu Chang 1,

Martin Franz

3, John Smith 2, Giridharan Iyengar3

http://www.ee.columbia.edu/~winston

  • 1Dept. of Electrical Engineering, Columbia University, New York, NY

2IBM T. J. Watson Research Center, Hawthorne, NY 3IBM T. J. Watson Research Center, Yorktown Heights, NY

slide-2
SLIDE 2
  • 2-

digital video | multimedia lab

trecvid workshop, 11/15/2004

Outlines

Features and Fusion Strategies

Multi-modal features at different observation windows

(e.g., prosody, visual cues, text)

Fusion with Support Vector Machines

New focus in 2004:

Automatic Visual Cue Cluster Construction (VC3

framework)

Ability to handle diverse production events

Thorough error analysis for different genres Brief comparison with last year results

slide-3
SLIDE 3
  • 4-

digital video | multimedia lab

trecvid workshop, 11/15/2004

Story Segmentation Model

  • Determine the candidate points
  • union of pauses and shot boundaries with fuzzy window 2.5 sec
slide-4
SLIDE 4
  • 5-

digital video | multimedia lab

trecvid workshop, 11/15/2004

Story Segmentation Model

  • Determine the candidate points
  • union of pauses and shot boundaries with fuzzy window 2.5 sec
  • Extract and aggregate relevant features from surrounding windows
  • take into account asynchronous multi-modal futures; e.g., text, audio
slide-5
SLIDE 5
  • 6-

digital video | multimedia lab

trecvid workshop, 11/15/2004

Story Segmentation Model

  • Determine the candidate points
  • union of pauses and shot boundaries with fuzzy window 2.5 sec
  • Extract and aggregate relevant features from surrounding windows
  • take into account asynchronous multi-modal futures; e.g., text, audio
  • Classify the candidate points as “boundary” or “non-boundary”
  • SVMs with RBF kernels
  • Post-processing

?

Post-processing

slide-6
SLIDE 6
  • 7-

digital video | multimedia lab

trecvid workshop, 11/15/2004

Raw Multi-Modal Features

1 1 30 1 1 15~40 2 2

Dim.

text story seg. scores pause prosody features speaker change speech rapidity Visual Cues Clusters commercial motion

Raw Features

Text Audio Visual

Modality * before taking into account different observation windows

slide-7
SLIDE 7
  • 8-

digital video | multimedia lab

trecvid workshop, 11/15/2004

Visual Cue Cluster Construction (VC3)

Motivation

News channels usually have different visual production events

across channels or time and are statistically relevant to story boundaries

Usually try different ways to manually enumerate all the

production events from inspections, and then train the classifiers

e.g. ANCHOR, STUDIO, WEATHER, CNN_HEADLINE, …, etc. Problems -> deploying on multiple channels of multiple countries …

We hope to discover a systematic work to catch “visual

cue clusters”

Analogously, text -> cue words or cue word clusters Automatically, rather than by human inspection Avoid time-consuming news production annotations

via Information Bottleneck Clustering!

slide-8
SLIDE 8
  • 9-

digital video | multimedia lab

trecvid workshop, 11/15/2004

VC3: the Information Bottleneck Principle

Cluster to but still trying to preserve the

mutual information with label space

If , a hard partitioning; we only care about

maximizing ; that’s to minimize

slide-9
SLIDE 9
  • 10-

digital video | multimedia lab

trecvid workshop, 11/15/2004

VC3 Overview: a Simple Example

slide-10
SLIDE 10
  • 11-

digital video | multimedia lab

trecvid workshop, 11/15/2004

VC3 Overview: a Simple Example

c1 c2 c3

  • Items (features) in the same cluster tend to be with similar probability

distributions over the event labels Y->semantic consistency!!

  • MI contributions from different clusters -> feature selection

c1 c2 c3

slide-11
SLIDE 11
  • 12-

digital video | multimedia lab

trecvid workshop, 11/15/2004

For IB clustering, we essentially need

However, video features are not discrete but continuous!

Approximate joint probability via kernel density

estimation from existent feature observations Embed prior knowledge on kernels functions and the kernel bandwidth (D-dimensional)

Gaussian Kernel (diagonal): Raw features: autocorrelogram, color moments, and Gabor texture

VC3 Overview: Joint Probability Approximation

Gaussian Kernel with specific kernel bandwidth

  • bserved event probability

conditioning on the feature

slide-12
SLIDE 12
  • 13-

digital video | multimedia lab

trecvid workshop, 11/15/2004

VC3 Overview: Cluster Examples-I

ABC VCs for story seg.

cluster selection/feature reduction!!

slide-13
SLIDE 13
  • 14-

digital video | multimedia lab

trecvid workshop, 11/15/2004

VC3 Overview: Cluster Examples-II

CNN VCs for story seg.

slide-14
SLIDE 14
  • 15-

digital video | multimedia lab

trecvid workshop, 11/15/2004

VC3 Overview: Cluster Examples-III

CNN VCs for text association

POINT, WIN, PLAY, MICHAEL, GAME, … POINT, DOLLAR, PERCENT, WORLD, DOW, NASDAQ, STREET SPORT, HEADLINE, JAMES, GAMES, … PRESIDENT, CLINTON, WHITE, DOLLAR, LEWINSKY, HOUSE, … TEMPERATURE, SHOWER, RAIN, THUDERSTORM, PRESSURE, …

slide-15
SLIDE 15
  • 16-

digital video | multimedia lab

trecvid workshop, 11/15/2004

VC3 Overview: Feature Projection

In feature extraction, project an image to those induced

cue clusters by calculating the membership probabilities

K-dim. VC Features

slide-16
SLIDE 16
  • 17-

digital video | multimedia lab

trecvid workshop, 11/15/2004

Performance Overview (A+V, Validation Set)

A+V ABC A+V CNN

slide-17
SLIDE 17
  • 18-

digital video | multimedia lab

trecvid workshop, 11/15/2004 32.0 6.3 8.8 15.0 21.3 2.1 2.0 7.0 6.1 8.2 0.1 6.3 2.2 29.4 4.6 5.4 2.0 7.9 3.7 7.6 2.9 2.1 30.4 2.6 7.6 0.9 6.2 5.9 3.2 6.9 2.5 0.3 7.1 5.8 12.9 30.2

0.0 5.0 10.0 15.0 20.0 25.0 30.0 35.0

Ratio (Overall) ME VCs A+V

Performance Overview (A+V, Validation Set)

2nd anch.

  • cont. shrt

bref. weather prev->comm sprt->comm in anch. sprt bref. msc/anim ::story types

  • Annotate 749 stories into 9 types from 22 CNN videos
  • Fixed 0.71 precision; VC(*) evaluated at shot boundaries ONLY
  • anch. led
slide-18
SLIDE 18
  • 19-

digital video | multimedia lab

trecvid workshop, 11/15/2004

Performance Overview (A+V+T, Validation Set)

>>

V A

>>>

A T Over-fitting in the training set!! : SVM fusion

>>>

V A T

Revised A+V+T Fusion approach

slide-19
SLIDE 19
  • 20-

digital video | multimedia lab

trecvid workshop, 11/15/2004

TRECVID 2004 Story Segmentation NIST Submission 0.61 0.57 0.69 0.65 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80

dT mT AV_efc+efc AV_efc+ec AV_fc+fc AVmT AVmT_fc+fc AVdT_fc+c AVdT_fc+fc AVmT_fc_c best_of_others

F1

TRECIV04 Test 04 Result

10 Columbia_IBM submissions

  • Significant degradation (10%) comparing with our two validation sets (A+V,

A+V+T: 0.72+)

  • Probably due to that (1) visual patterns or raw feature had changed a lot in

the test set; (2) the fusion strategy; (3) the selection of decision threshold

slide-20
SLIDE 20
  • 21-

digital video | multimedia lab

trecvid workshop, 11/15/2004

Summary

Develop a novel information-theoretical framework to

discover visual cue clusters automatically adapt to diverse production events of different channel avoid manual specification/annotation of salient visual cues

Results confirm the effectiveness of VCs in the validation

set

But the performance degrades in the test set due to time gap

Multi-modal fusion

Fusion of A and V has significant improvement Fusion of AV and T improves performance in ABC only Strategies for fusion are critical – simultaneous fusion is better

Major remaining errors

Short sports briefings Suggest merging them to a continuous story in the ground truth

slide-21
SLIDE 21
  • 22-

digital video | multimedia lab

trecvid workshop, 11/15/2004

< the end; thanks >