[PPT] - CDVP & TRECVID-2003 News Story Segmentation Task Csaba Czirjek, PowerPoint Presentation

SLIDE 1

TREC-2003 (Neil O’Hare)

1 -

Center for Digital Video Processing

C e n t e r f o r D I g I t a l V I d e o P r o c e s s I n g

CDVP & TRECVID-2003

News Story Segmentation Task

Csaba Czirjek, Gareth J.F. Jones, Seán Marlow, Noel Murphy, Noel

E. O’Connor, Neil O’Hare, Alan F.

Smeaton

SLIDE 2

TREC-2003 (Neil O’Hare)

2 -

Center for Digital Video Processing

C e n t e r f o r D I g I t a l V I d e o P r o c e s s I n g

Structure of a News Broadcast

We assume stories are delimited by shots of the

anchorperson

Features of Anchor shots:

– All anchor shots within a broadcast taken from the same camera setup – filmed with a static camera, with little object motion – anchor shots in a single broadcast are visually similar to each other

SLIDE 4

TREC-2003 (Neil O’Hare)

4 -

Center for Digital Video Processing

C e n t e r f o r D I g I t a l V I d e o P r o c e s s I n g

Structure of a News Broadcast

Anchorperson Shots News Report Shots Commercial Break

SLIDE 5

TREC-2003 (Neil O’Hare)

5 -

Center for Digital Video Processing

C e n t e r f o r D I g I t a l V I d e o P r o c e s s I n g

System Overview

We use TRECVID 2003 common shot boundary

provided by CLIPS-IMAG

Extracted features combined to detect anchor

shots

Story boundaries logged at the start of anchor

shots

Aim is to extract features that are robust to

changes across broadcasters (eg faces, motion, shot length)

This would give a generic news segmentation

system

SLIDE 6

TREC-2003 (Neil O’Hare)

6 -

Center for Digital Video Processing

C e n t e r f o r D I g I t a l V I d e o P r o c e s s I n g

System Overview

1 2 3 4 5 6 7 8

Shot Clustering Face Detection

Motion Activity Analysis x 2

Shot Length Text Segmentation

Support Vector Machine

News Story Detection News Stories

Donated by StreamSage Donated by CLIPS-IMAG

Shot Level Feature Extraction

30 Minute News Program

Shot Boundary Detection

SLIDE 7

TREC-2003 (Neil O’Hare)

7 -

Center for Digital Video Processing

C e n t e r f o r D I g I t a l V I d e o P r o c e s s I n g

Shots are clustered based on visual similarity

(colour histogram)

anchor shots grouped together
anchor clusters identified using heuristics:

– tend to be dispersed throughout the broadcast – average length longer than others – anchor shots are very similar to each other: they form ‘tighter’ clusters

Feature Extraction 1 - Shot Clustering

SLIDE 8

TREC-2003 (Neil O’Hare)

8 -

Center for Digital Video Processing

C e n t e r f o r D I g I t a l V I d e o P r o c e s s I n g

Feature Extraction 2 - Face Detection

Coarse to fine approach to extract candidate

regions:

– Skin like pixels identified based on colour – Morphological filtering used to obtain smoothed areas of connected pixels – Shape and size heuristics remove candidate face regions

Candidates passed to a Principle Component

Analysis (PCA) module for final classification

Every 12th frame (I-frames) used for

processing

SLIDE 9

TREC-2003 (Neil O’Hare)

9 -

Center for Digital Video Processing

C e n t e r f o r D I g I t a l V I d e o P r o c e s s I n g

Face Detection

0. 7 0. 5 0. 8 0. 2

Original video file For every 12th frame Filtered image after morphological adjustment Image after applying size/shape heuristics Detected faces with confidence score skin filtering + morphological adjustment size/shape heuristics Face Database PCA

SLIDE 10

TREC-2003 (Neil O’Hare)

10 -

Center for Digital Video Processing

C e n t e r f o r D I g I t a l V I d e o P r o c e s s I n g

Feature Extraction 3 - Activity Measure

Motion Activity analysis based on MPEG-1

motion vectors

Every P-frame is analysed
We count the number of zero length motion

vectors in a P-frame (excluding I-blocks)

Activity measure:
No. of zero length vectors

Total No. of macroblocks

SLIDE 11

TREC-2003 (Neil O’Hare)

11 -

Center for Digital Video Processing

C e n t e r f o r D I g I t a l V I d e o P r o c e s s I n g

Two separate shot level measures used:

– least active P-frame is used to represent the shot – All motion vectors across a shot are added to form a cumulative motion vector. Activity measure then calculated using cumulative motion vector 0,-1 0,1

3,5

0,0 0,0 4,3

2,1

1,-1 1,0 0,1 1,0

2,4

3,0 0,0 0,0

2,1

0,1 0,1 0,0 1,1

5,9

3,0 0,0 4,3

4,2

1,0 1,1

+ = frame a frame b cumulative frame: frame a + frame b

Feature Extraction 3 - Activity Measure

SLIDE 12

TREC-2003 (Neil O’Hare)

12 -

Center for Digital Video Processing

C e n t e r f o r D I g I t a l V I d e o P r o c e s s I n g

Feature Extraction 4 - Shot Length

Shot length used as a feature
Measured in frames

SLIDE 13

TREC-2003 (Neil O’Hare)

13 -

Center for Digital Video Processing

C e n t e r f o r D I g I t a l V I d e o P r o c e s s I n g

Feature Extraction 5 - Text Analysis

To allow us to complete the required runs, we

used text analysis provided by StreamSage

StreamSage text output used as binary

feature

SLIDE 14

TREC-2003 (Neil O’Hare)

14 -

Center for Digital Video Processing

C e n t e r f o r D I g I t a l V I d e o P r o c e s s I n g

Combination of Features - SVM

Extracted features combined using Support

Vector Machine

Trained on 10 hours of the TRECVID 2003

development set (5 CNN, 5 ABC)

Resulting SVM classifier detects anchor shots
Story boundaries are logged at the beginning
f anchor shots

SLIDE 15

TREC-2003 (Neil O’Hare)

15 -

Center for Digital Video Processing

C e n t e r f o r D I g I t a l V I d e o P r o c e s s I n g

Submitted Runs

3 Required Runs

– A/V only system - generic system for ABC and CNN

(DCU03_REQ_AV)

– A/V + text - generic system for ABC and CNN

(DCU03_REQ_AV_TEXT)

– Text only - text Analysis provided by StreamSage

(DCU03_REQ_TEXT_ONLY)

2 Additional Optional Runs

– Specialised systems for ABC and CNN. Separate SVMs for each broadcaster (DCU03_OPT_AV) – Clustering algorithm in isolation (DCU03_OPT_CLUSTER)

SLIDE 16

TREC-2003 (Neil O’Hare)

16 -

Center for Digital Video Processing

C e n t e r f o r D I g I t a l V I d e o P r o c e s s I n g

DCU Results

System ID Recall Precision

DCU03_REQ_AV 0.328 0.409 DCU03_REQ_AV_TEXT 0.294 0.453 DCU03_REQ_TEXT_ONLY 0.049 0.208 DCU03_OPT_AV 0.313 0.453 DCU03_OPT_CLUSTER 0.364 0.304

SLIDE 17

TREC-2003 (Neil O’Hare)

17 -

Center for Digital Video Processing

C e n t e r f o r D I g I t a l V I d e o P r o c e s s I n g

Overall Results - All Groups

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall Precision

DCU Fudan IBM kddi NUS StreamSage UCF Iowa

SLIDE 18

TREC-2003 (Neil O’Hare)

18 -

Center for Digital Video Processing

C e n t e r f o r D I g I t a l V I d e o P r o c e s s I n g

Conclusions

Best results from specialised system

(DCU03_OPT_AV)

generic system not far behind
Extracted features robust across

broadcasters

Combined results improve precision

with small loss in recall compared to clustering alone

SLIDE 19

TREC-2003 (Neil O’Hare)

19 -

Center for Digital Video Processing

C e n t e r f o r D I g I t a l V I d e o P r o c e s s I n g

CDVP & TRECVID-2003

News Story Segmentation Task

Csaba Czirjek, Gareth J.F. Jones, Seán Marlow, Noel Murphy, Noel

Smeaton

Contents

Structure of a News Broadcast

anchorperson

Structure of a News Broadcast

System Overview

provided by CLIPS-IMAG

shots

shots

changes across broadcasters (eg faces, motion, shot length)

system

System Overview

(colour histogram)

Feature Extraction 1 - Shot Clustering

Feature Extraction 2 - Face Detection

regions:

Analysis (PCA) module for final classification

processing

Face Detection

Feature Extraction 3 - Activity Measure

motion vectors

vectors in a P-frame (excluding I-blocks)

Total No. of macroblocks

+ = frame a frame b cumulative frame: frame a + frame b

Feature Extraction 3 - Activity Measure

Feature Extraction 4 - Shot Length

Feature Extraction 5 - Text Analysis

used text analysis provided by StreamSage

feature

Combination of Features - SVM

Vector Machine

development set (5 CNN, 5 ABC)

Submitted Runs

DCU Results

System ID Recall Precision

Overall Results - All Groups

Conclusions

broadcasters

with small loss in recall compared to clustering alone

Thank You

Thank You