CMU-IBM-NUS@TRECVID 2012: Surveillance Event Detection(SED) Yang Cai - - PowerPoint PPT Presentation

cmu ibm nus trecvid 2012
SMART_READER_LITE
LIVE PREVIEW

CMU-IBM-NUS@TRECVID 2012: Surveillance Event Detection(SED) Yang Cai - - PowerPoint PPT Presentation

CMU-IBM-NUS@TRECVID 2012: Surveillance Event Detection(SED) Yang Cai *, Qiang Chen *, Lisa Brown , Ankur Datta , Quanfu Fan , Rogerio Feris , Shuicheng Yan , Alex Hauptmann , Sharath Pankanti Carnegie Mellon University


slide-1
SLIDE 1

CMU-IBM-NUS@TRECVID 2012: Surveillance Event Detection(SED)

Yang Cai Ϯ*, Qiang Chen ɨɬ*, Lisa Brown ɨ, Ankur Datta ɨ, Quanfu Fan ɨ, Rogerio Feris ɨ, Shuicheng Yan ɬ, Alex Hauptmann Ϯ, Sharath Pankanti ɨ

Ϯ Carnegie Mellon University ɨ IBM Research ɬ National University of Singapore

*Equal contributions by co-authors.

slide-2
SLIDE 2

Outline

  • Retrospective Event Detection

– System Overview – Fisher Vector Coding for Event Representation – Performance Evaluation

  • Interactive Event Detection

– Detection Results Visualization

  • Event-specific Results Visualization

– User Feedback Utilization

  • Temporal Locality Based Search

– Performance Evaluation

slide-3
SLIDE 3

Outline

  • Retrospective Event Detection

– System Overview – Fisher Vector Coding for Event Representation – Performance Evaluation

  • Interactive Event Detection

– Detection Results Visualization

  • Event-specific Results Visualization

– User Feedback Utilization

  • Temporal Locality Based Search

– Performance Evaluation

slide-4
SLIDE 4

System Overview

Extract MoSIFT features Fisher Vector coding Linear SVM training Hard samples mining Model

Sliding Window

Training Sequence 1 Training Sequence n

Offline Training

Extract MoSIFT features Testing Sequence 1

Sliding Window

Fisher Vector coding Classification Detection Result

Online Testing

Non-maximum Suppression

slide-5
SLIDE 5

System Overview

Extract MoSIFT features Fisher Vector coding Linear SVM training Hard samples mining Model

Sliding Window

Training Sequence 1 Training Sequence n

Offline Training

Extract MoSIFT features Testing Sequence 1

Sliding Window

Fisher Vector coding Classification Detection Result

Online Testing

Non-maximum Suppression

slide-6
SLIDE 6

Event Representation

  • Fisher Vector (FV) Coding [1]:

– A GMM is learnt to model each MoSIFT features. – For each feature point in a detection window, the gradients with respective to mean and standard deviation of the GMM are calculated. – FV is the concatenation of the two gradients averaged over all features in a detection window.

  • Fisher Vector (FV) vs. Bag-of-Word(BoW) [2]

– BoW is only about counting local descriptors assigned to each visual word while FV includes higher order statistics. – FV is faster to compute than BoW for a given feature dimension.

[1] F. Perronnin and T. Mensink. Improving the fisher kernel for large-scale image

  • classification. In ECCV, 2010.

[2] F. Perronnin and H. Jégou. Tutorial on Large-Scale Visual Recognition, in CVPR, 2012.

Equal contributions by co-authors.

slide-7
SLIDE 7

Performance Evaluation

Primary Runs Results CMU-IBM_FV2012 Others’ Best 2012 CMU_BoW2011 ActDCR MinDCR ActDCR MinDCR ActDCR MinDCR CellToEar 1.0007 1.0003 1.004 0.9814 1.0365 1.0003 Embrace 0.8 0.7794 0.8247 0.824 0.884 0.8658 ObjectPut 1.004 0.9994 0.9983 0.9983 1.0171 1.0003 PeopleMeet 1.0361 0.949 0.9799 0.9777 1.01 0.9724 PeopleSplitUp 0.8433 0.7882 0.9843 0.9787 1.0217 1.0003 PersonRuns 0.8346 0.7872 0.9702 0.9623 0.8924 0.837 Pointing 1.0175 0.9921 0.9813 0.977 1.5186 1.0001

slide-8
SLIDE 8

Performance Evaluation

  • Compared to this year other teams’ results (Others’ Best 2012):

– our system has better performance on 4/7 events (actual/minimum DCR of primary run). Primary Runs Results CMU-IBM_FV2012 Others’ Best 2012 CMU_BoW2011 ActDCR MinDCR ActDCR MinDCR ActDCR MinDCR CellToEar 1.0007 1.0003 1.004 0.9814 1.0365 1.0003 Embrace 0.8 0.7794 0.8247 0.824 0.884 0.8658 ObjectPut 1.004 0.9994 0.9983 0.9983 1.0171 1.0003 PeopleMeet 1.0361 0.949 0.9799 0.9777 1.01 0.9724 PeopleSplitUp 0.8433 0.7882 0.9843 0.9787 1.0217 1.0003 PersonRuns 0.8346 0.7872 0.9702 0.9623 0.8924 0.837 Pointing 1.0175 0.9921 0.9813 0.977 1.5186 1.0001

slide-9
SLIDE 9

Performance Evaluation

  • Compared to this year other teams’ results (Others’ Best 2012):

– our system has better performance on 4/7 events (actual/minimum DCR of primary run).

  • Compared to our last year system based on BoW (CMU_BoW2011):

– this year system gets improvement on 6/7 events (actual/min DCR of primary run). Primary Runs Results CMU-IBM_FV2012 Others’ Best 2012 CMU_BoW2011 ActDCR MinDCR ActDCR MinDCR ActDCR MinDCR CellToEar 1.0007 1.0003 1.004 0.9814 1.0365 1.0003 Embrace 0.8 0.7794 0.8247 0.824 0.884 0.8658 ObjectPut 1.004 0.9994 0.9983 0.9983 1.0171 1.0003 PeopleMeet 1.0361 0.949 0.9799 0.9777 1.01 0.9724 PeopleSplitUp 0.8433 0.7882 0.9843 0.9787 1.0217 1.0003 PersonRuns 0.8346 0.7872 0.9702 0.9623 0.8924 0.837 Pointing 1.0175 0.9921 0.9813 0.977 1.5186 1.0001

slide-10
SLIDE 10

Outline

  • Retrospective Event Detection

– System Overview – Fisher Vector Encoding for Event Representation – Performance Evaluation

  • Interactive Event Detection

– Detection Results Visualization

  • Event-specific Results Visualization

– User Feedback Utilization

  • Temporal Locality Based Search

– Performance Evaluation

slide-11
SLIDE 11

Detection Results Visualization

  • Problem:

– Without a good visualization method, user-system interaction can be very ineffective and inefficient.

  • E.g. one may use several minutes to judge if a system detection is

true positive or false alarm.

Is this a “CellToEar”?

slide-12
SLIDE 12

Detection Results Visualization

  • Objective:

– To find visualization methods that enable users to accurately and quickly understand detection results.

slide-13
SLIDE 13

Event-specific Results Visualization

Events:

slide-14
SLIDE 14

Event-specific Results Visualization

Events: PersonRuns

slide-15
SLIDE 15

Event-specific Results Visualization

Events: PersonRuns

(A) (B) (C) (D) (E) (F)

Which are true positives (PersonRuns)?

slide-16
SLIDE 16

Event-specific Results Visualization

Events: Pointing

slide-17
SLIDE 17

Event-specific Results Visualization

Events:

(A) (B) (C) (D) (E) (F)

Pointing Which are true positives (Pointing)?

slide-18
SLIDE 18

Event-specific Results Visualization

Events: Pointing Which are true positives (Pointing)?

slide-19
SLIDE 19

Event-specific Detection Visualization

Events: PeopleSplitUp

Detection Result

Are they “PeopleSplitUp”? Probably…

Detection Result

slide-20
SLIDE 20

Event-specific Results Visualization

Events: PeopleSplitUp

Detection Result Detection Result Context Context Context Context

slide-21
SLIDE 21

Event-specific Results Visualization

  • Different events are visualized using different

schemes:

– many low-resolution units:

  • Place multiple low-resolution units in a screen.
  • For events that can be captured by a glance.

e.g. “PersonRuns”

– few high-resolution units:

  • Place few high-resolution units in a screen.
  • For events that require careful checking.

e.g. “CellToEar”, “ObjectPut”, “Pointing”.

– contextual units:

  • Add context next to detections.
  • For group events with multiple phrases.

e.g. “PeopleSplitUp”, “PeopleMeet”, “Embrace”.

many low-resolution units few high-resolution units contextual units

slide-22
SLIDE 22

User Feedback Utilization

  • Problem:

– Without feedback utilization, the interaction is nothing but removing false alarms.

  • Objective:

– To efficiently reduce miss detections as well by leveraging user feedbacks.

slide-23
SLIDE 23

An Observation

  • A temporally clustered distribution (temporal locality):

– We calculated the interval between consecutive events of same class in development data. – For some events (e.g. “Pointing”, “ObjectPut”, “Embrace”, “PersonRuns”, etc.), most of the intervals are very small (< 200 frames/8 seconds).

0.05 0.1 0.15 0.2 0.25 [0,100) [100,200) [200,300) [300,400) [400,500) [500,600) [600,700) [700,800) [800,900) [900,1000) [1000,1100) [1100,1200) [1200,1300) [1300,1400) [1400,1500) [1500,1600) [1600,1700) [1700,1800) [1800,1900) [1900,2000)

Percentage Interval(frame) PersonRuns CellToEar ObjectPut PeopleMeet PeopleSplitUp Embrace Pointing

slide-24
SLIDE 24

Temporal Locality Based Search

  • What does the observation tell us?

– If we observe one positive at somewhere, we are likely to find another positive nearby.

  • Temporal locality based search:

– After receiving one positive feedback from user, the system returns user a set of neighbors living closely to the

  • positive. Then user can quickly go through the neighbors

to find potential miss detections.

slide-25
SLIDE 25

Performance Evaluation

Actual DCR Development Set

(Training: Dev08, Testing: Eval08, Wall time: 5 mins)

Evaluation Set

(Primary Run)

Retro Naive ESpecVis ESpecVis+TLSearch Retro ESpecVis+TLSearch CellToEar 1.0008 1.0014 1.0008 1.0009 1.0007 1.009 Embrace 0.9519 0.9547 0.9344 0.9115 0.8 0.6696 ObjectPut 1.0033 1.0026 1.0024 1.0023 1.004 1.0064 PeopleMeet 0.9381 0.9338 0.9334 0.9361 1.0361 0.9786 PeopleSplitUp 0.8972 0.9416 0.889 0.8863 0.8433 0.8177 PersonRuns 0.761 0.7528 0.7511 0.7366 0.8346 0.6445 Pointing 1.0168 1.0109 1.0134 1.0084 1.0175 0.9854

  • Retro: retrospective event detection system output using fisher vector method.
  • Naïve: the baseline interactive method, which linearly scans system output with only

“many low-resolution units” visualization method for all events.

  • ESpecVis: linearly scan system output with event-specific visualization.
  • ESpecVis+TLSearch: scan the system output with both event-specific visualization and

temporal locality search.

slide-26
SLIDE 26

Performance Evaluation

Actual DCR Development Set

(Training: Dev08, Testing: Eval08, Wall time: 5 mins)

Evaluation Set

(Primary Run)

Retro Naive ESpecVis ESpecVis+TLSearch Retro ESpecVis+TLSearch CellToEar 1.0008 1.0014 1.0008 1.0009 1.0007 1.009 Embrace 0.9519 0.9547 0.9344 0.9115 0.8 0.6696 ObjectPut 1.0033 1.0026 1.0024 1.0023 1.004 1.0064 PeopleMeet 0.9381 0.9338 0.9334 0.9361 1.0361 0.9786 PeopleSplitUp 0.8972 0.9416 0.889 0.8863 0.8433 0.8177 PersonRuns 0.761 0.7528 0.7511 0.7366 0.8346 0.6445 Pointing 1.0168 1.0109 1.0134 1.0084 1.0175 0.9854

  • Retro: retrospective event detection system output using fisher vector method.
  • Naïve: the baseline interactive method, which linearly scans system output with only

“many low-resolution units” visualization method for all events.

  • ESpecVis: linearly scan system output with event-specific visualization.
  • ESpecVis+TLSearch: scan the system output with both event-specific visualization and

temporal locality search.

slide-27
SLIDE 27

Performance Evaluation

Actual DCR Development Set

(Training: Dev08, Testing: Eval08, Wall time: 5 mins)

Evaluation Set

(Primary Run)

Retro Naive ESpecVis ESpecVis+TLSearch Retro ESpecVis+TLSearch CellToEar 1.0008 1.0014 1.0008 1.0009 1.0007 1.009 Embrace 0.9519 0.9547 0.9344 0.9115 0.8 0.6696 ObjectPut 1.0033 1.0026 1.0024 1.0023 1.004 1.0064 PeopleMeet 0.9381 0.9338 0.9334 0.9361 1.0361 0.9786 PeopleSplitUp 0.8972 0.9416 0.889 0.8863 0.8433 0.8177 PersonRuns 0.761 0.7528 0.7511 0.7366 0.8346 0.6445 Pointing 1.0168 1.0109 1.0134 1.0084 1.0175 0.9854

  • Retro: retrospective event detection system output using fisher vector method.
  • Naïve: the baseline interactive method, which linearly scans system output with only

“many low-resolution units” visualization method for all events.

  • ESpecVis: linearly scan system output with event-specific visualization.
  • ESpecVis+TLSearch: scan the system output with both event-specific visualization and

temporal locality search.

slide-28
SLIDE 28

Performance Evaluation

Actual DCR Development Set

(Training: Dev08, Testing: Eval08, Wall time: 5 mins)

Evaluation Set

(Primary Run)

Retro Naive ESpecVis ESpecVis+TLSearch Retro ESpecVis+TLSearch CellToEar 1.0008 1.0014 1.0008 1.0009 1.0007 1.009 Embrace 0.9519 0.9547 0.9344 0.9115 0.8 0.6696 ObjectPut 1.0033 1.0026 1.0024 1.0023 1.004 1.0064 PeopleMeet 0.9381 0.9338 0.9334 0.9361 1.0361 0.9786 PeopleSplitUp 0.8972 0.9416 0.889 0.8863 0.8433 0.8177 PersonRuns 0.761 0.7528 0.7511 0.7366 0.8346 0.6445 Pointing 1.0168 1.0109 1.0134 1.0084 1.0175 0.9854

  • Retro: retrospective event detection system output using fisher vector method.
  • Naïve: the baseline interactive method, which linearly scans system output with only

“many low-resolution units” visualization method for all events.

  • ESpecVis: linearly scan system output with event-specific visualization.
  • ESpecVis+TLSearch: scan the system output with both event-specific visualization and

temporal locality search.

slide-29
SLIDE 29

Performance Evaluation

Actual DCR Development Set

(Training: Dev08, Testing: Eval08, Wall time: 5 mins)

Evaluation Set

(Primary Run)

Retro Naive ESpecVis ESpecVis+TLSearch Retro ESpecVis+TLSearch CellToEar 1.0008 1.0014 1.0008 1.0009 1.0007 1.009 Embrace 0.9519 0.9547 0.9344 0.9115 0.8 0.6696 ObjectPut 1.0033 1.0026 1.0024 1.0023 1.004 1.0064 PeopleMeet 0.9381 0.9338 0.9334 0.9361 1.0361 0.9786 PeopleSplitUp 0.8972 0.9416 0.889 0.8863 0.8433 0.8177 PersonRuns 0.761 0.7528 0.7511 0.7366 0.8346 0.6445 Pointing 1.0168 1.0109 1.0134 1.0084 1.0175 0.9854

  • Retro: retrospective event detection system output using fisher vector method.
  • Naïve: the baseline interactive method, which linearly scans system output with only

“many low-resolution units” visualization method for all events.

  • ESpecVis: linearly scan system output with event-specific visualization.
  • ESpecVis+TLSearch: scan the system output with both event-specific visualization and

temporal locality search.

slide-30
SLIDE 30

Conclusions

  • Retrospective System:

– Fisher Vector coding significantly improves detection performance (DCR) on some events. E.g “PersonRuns”, “Embrace”, “PeopleSplitUp”. – The performances of “CellToEar”, “Pointing” and “ObjectPut” are still not good.

  • Interactive System:

– Event-specific scheme should be used in detection results visualization. – Temporal locality search can improve the performance for event with good temporal locality and reasonable system detection accuracy.

slide-31
SLIDE 31

Future Works

  • Retrospective System:

– “Interaction-oriented” detection methods which aim to facilitate user interaction need to be studied. E.g. event spatially localization.

  • Interactive System:

– Better visualization techniques need to be developed for difficult events. E.g. “CellToEar”, “ObjectPut”. – More user feedback utilization methods need to be studied.

slide-32
SLIDE 32

Thanks!