CMU @ TRECVID Event Detection @ Ming-yu Chen & Alex Hauptmann - - PowerPoint PPT Presentation

cmu trecvid event detection
SMART_READER_LITE
LIVE PREVIEW

CMU @ TRECVID Event Detection @ Ming-yu Chen & Alex Hauptmann - - PowerPoint PPT Presentation

CMU @ TRECVID Event Detection @ Ming-yu Chen & Alex Hauptmann School of Computer Science Carnegie Mellon University Carnegie Mellon University Carnegie Mellon CMU @ TRECVID 2009 E CMU @ TRECVID 2009 Event Detection t D t ti CMU


slide-1
SLIDE 1

CMU @ TRECVID Event Detection @

Ming-yu Chen & Alex Hauptmann

School of Computer Science Carnegie Mellon University Carnegie Mellon University Carnegie Mellon

slide-2
SLIDE 2

CMU @ TRECVID 2009 E t D t ti CMU @ TRECVID 2009 Event Detection

CMU submitted all 10 event detection tasks Part-based generic approach

  • Local features extracted from videos
  • Local features describe both appearance and motion
  • Bag of word features represent video content

Bag of word features represent video content

  • Robust to action deformation, occlusion and illumination

lidi i d d i h Sliding window detection approach

  • Extend part-based method to detection tasks
  • False alarm reduction is a critical task

False alarm reduction is a critical task 2

slide-3
SLIDE 3

S t i System overviw

3

slide-4
SLIDE 4

M SIFT f t d t ti MoSIFT – feature detection

MoSIFT detects spatial interest points in multiple scales p p p

  • Local maximum of Difference of Gaussian (DoG)

MoSIFT computes optical flow to detect moving areas MoSIFT detects video interest areas by local maximum of DoG and

  • ptical flows

4

slide-5
SLIDE 5

M SIFT f t d i ti MoSIFT – feature description

Descriptor of shape p p

  • Histogram of Gradient (HoG)
  • Aggregate neighbor areas as 4x4 grids; each grid is described as 8 orientations

4 4 8 128 di i l t t d ib h f i t t

  • 4x4x8 = 128 dimensional vector to describe shape of interest areas

Descriptor of motion

  • Histogram of Optical Flow (HoF); the same format as HoG

Histogram of Optical Flow (HoF); the same format as HoG

  • 128 dimensional vector to describe motion of interest areas

256 dimensional vectors as feature descriptors

5

slide-6
SLIDE 6

E t d t ti Event detection

K-mean cluster algorithm is applied to quantize feature points extracted g pp q p from videos

  • K is chosen by cross-validation

id d b k i b il b l i l A video codebook is built by clustering result

  • A visual code is a category of similar video interest points

Bag of word (BoW) feature is constructed for each video sequence Bag of word (BoW) feature is constructed for each video sequence

  • Soft weight is used to construct BoW feature

Event models are trained by Support Vector Machine (SVM) y pp ( )

  • X2 kernel is applied

Sliding window approach creates video sequence in both training and t ti t testing sets

6

slide-7
SLIDE 7

E l ti t i DCR Evaluation metric - DCR

Normalized Detection Cost Rate (NDCR) is used to evaluate performances. ( ) p

( ) ( )

E S R Cost E S P Cost E S

  • st

DetectionC

FA FA Miss Miss

, , ) , (

  • +
  • =

[0,1] [0, ∞) Strongly penalize false alarms

  • NDCR doesn’t encourage to detect more positive examples as much as reducing false

alarms alarms

  • Reducing false alarms is then extremely important to improve NDCR scores

7

slide-8
SLIDE 8

F l l d ti False alarm reduction

Cascade architecture is highly used to reduce false alarm in detect g y tasks We applied the idea of cascade algorithm in test phase to reduce false alarm

  • Two positive biased classifiers are built (due to computation, it can extend to

more layers) y )

  • Windows pass both classifiers will be predicted as positive

All windows T T M1 M2 F F Detected windows

8

Rejected windows

slide-9
SLIDE 9

F l l d ti (C t ) False alarm reduction (Cont.)

Lesson from last year, multi-scale sliding window approach has a lot y , g pp

  • f false alarm

We do not apply multi-scale this year Instead of several short positive predictions, we aggregated consecutive positive predictions as a long positive segment

  • Reduce number of positive predictions
  • Reduce number of positive predictions

Performance improves 80% by cascade algorithm Performance improves 40% by concatenating short predictions to long Performance improves 40% by concatenating short predictions to long predictions

9

slide-10
SLIDE 10

S t t System set up

MoSIFT features are extracted via 3 different scales every 5 frames y

  • approximate 2160 hours for a single core to extract MoSIFT features

A sliding window (25 frames) slides every 5 frames 1000 video codes Soft weighted BoW feature representation (4 nearest clusters) One against all SVM model for each action of each camera view

  • 50 models are built (10 actions * 5 camera views)

10

slide-11
SLIDE 11

P f i Performance comparison

1.4 1.2 0.8 1 DCR CMU Median 0.4 0.6 Min D Median Best 0.2 C e l l T

  • E

a r e v a t

  • r

N

  • E

n t r y E m b r a c e O b j e c t P u t O p p

  • s

i n g F l

  • w

P e

  • p

l e M e e t P e

  • p

l e S p l i t U p P e r s

  • n

R u n s P

  • i

n t i n g T a k e P i c t u r e

11

E l e v O p P e Actions

slide-12
SLIDE 12

C t d t ti i Correct detection comparison

600 700 400 500 CorDet CMU Median M 200 300 Num of C Max 100 C e l l T

  • E

a r E l e v a t

  • r

N

  • E

n t r y E m b r a c e O b j e c t P u t O p p

  • s

i n g F l

  • w

P e

  • p

l e M e e t P e

  • p

l e S p l i t U p P e r s

  • n

R u n s P

  • i

n t i n g T a k e P i c t u r e

12

Action

slide-13
SLIDE 13

P f (2008 2009) Performance (2008 v.s. 2009)

13

slide-14
SLIDE 14

Hi h l l f t t ti High level feature extraction

Motion related high level features g

  • 7 motion related concepts
  • Airplane flying, Person playing soccer, Hand, Person playing a musical

instrument Person riding a bicycle Person eating People dacing instrument, Person riding a bicycle, Person eating, People dacing

MAP MM 0.24 PKU 0.21 TITG 0.20 CMU 0.18 FTRD 0 18 FTRD 0.18 VIREO 0.18 Eurecom 0.18

14

slide-15
SLIDE 15

C l i & f t k Conclusion & future work

Conclusion:

  • A generic approach to detect events
  • MoSIFT features captures both shape and motion information

P f b t ll t k

  • Perform robust over all tasks
  • False alarm reduction is critical to improve DCR

Future work:

  • The approach can’t localize where the action is
  • The approach can further fuse with people tracking and global features
  • Bag of word representation is lack of spatial constraints

15