Event Detection in Airport Surveillance The TRECVid 2008 Evaluation - - PowerPoint PPT Presentation

event detection in airport surveillance
SMART_READER_LITE
LIVE PREVIEW

Event Detection in Airport Surveillance The TRECVid 2008 Evaluation - - PowerPoint PPT Presentation

Event Detection in Airport Surveillance The TRECVid 2008 Evaluation The TRECVid 2008 Evaluation Jerome Ajot, Jonathan Fiscus, John Garofolo Martial Michel, Paul Over, Travis Rose, Mehmet Yilmaz NIST Heather Simpson, Stephanie Strassel LDC V


slide-1
SLIDE 1

Event Detection in Airport Surveillance

The TRECVid 2008 Evaluation The TRECVid 2008 Evaluation

Jerome Ajot, Jonathan Fiscus, John Garofolo Martial Michel, Paul Over, Travis Rose, Mehmet Yilmaz NIST Heather Simpson, Stephanie Strassel LDC

Video Analysis Content Extraction

slide-2
SLIDE 2

Outline

  • Motivation
  • Evaluation process
  • Data
  • Task definitions

Events

  • Events
  • Annotation process
  • Scoring
  • Adjudication
  • Conclusion & Future work
slide-3
SLIDE 3

Motivation

  • Problem: automatic detection of observable

events in surveillance video

  • Challenges:

– requires application of several Computer Vision techniques techniques

  • segmentation, person detection/tracking, object recognition,

feature extraction, etc.

– involves subtleties that are readily understood by humans, difficult to encode for machine learning approaches – can be complicated due to clutter in the environment, lighting, camera placement, traffic, etc.

slide-4
SLIDE 4

NISTEvaluation Process

Determine Program Requirements Dry-Run shakedown

  • Evaluation Plan

Task Definitions Protocols/Metrics Rollout Schedule

Choosing the right task and metric is key

Formal Evaluation Assess required/existing resources Develop detailed plans with researcher input Technical Workshops and reports Recommendations

Data Identification

Evaluation Resources

Training Data Development Data Evaluation Data Ground Truth and

  • ther metadata

Scoring and Truthing Tools

slide-5
SLIDE 5

UK Home Office London Gatwick Airport Data

  • Home Office collected two parallel

surveillance camera datasets

– 1 for their multi-camera tracking evaluation – 1 for our event detection evaluation

  • 100 hour event detection dataset

– 10 data collection sessions – 10 data collection sessions * 2 hours per session * 5 cameras per session

  • Camera views

– Elevator close-up – 4 high traffic areas – Camera view features – Controlled access door – Some overlapping views – Areas with low pixels on target

slide-6
SLIDE 6

TRECVid Retrospective Event Detection

  • Task:

– Given a definition of an observable event involving humans, detect all occurrences of an event in airport surveillance video surveillance video – Identify each event observation by

  • The temporal extent
  • A detection score indicating the strength of evidence
  • A binary decision on the detection score optimizing

performance for a surrogate application

slide-7
SLIDE 7

TRECVid Freestyle Analysis

  • Goal is to support innovation in ways not

anticipated by the retrospective task

  • Freestyle task includes:

– rationale – rationale – clear definition of the task – performance measures – reference annotations – baseline system implementation

slide-8
SLIDE 8

Technology Readiness Discussion Results

Benchmark detection accuracy across a variety of low occurrence events

SitDown StandUp ObjectPut ReverseDirection ObjectGet ChildWalking PersonLoiters LargeLuggage PersonRuns VestAppears OpenCloseDoor

Zero Acc., Not Feasible 2+ yrs. Zero Acc., Not Feasible next year Low Acc.

0% 20% 40% 60% 80% 100% CellToEar ObjectGive Embrace PeopleMeet PeopleSplitUp Pointing ElevatorNoEntry UseATM OpposingFlow SitDown

Low Acc. Low-Medium Acc. Med.-High Acc. High Acc. Fraction of 13 Participants Events Selected for 2008

slide-9
SLIDE 9

Event Annotation Guidelines

  • Jointly developed by:

– NIST, Linguistic Data Consortium (LDC), Computer Vision Community

  • Rules help users identify event observations

– Reasonable Interpretation (RI) Rule

  • If according to a reasonable interpretation of the video, the event must

have occurred, then it is a taggable event – Start/Stop times for occlusion

  • Observations with “occluded start times” begin with the occlusion or

frame boundary

  • Observations with “occluded end times” end with the occlusion or frame

boundary

  • Frame boundaries are occlusions, but the existence of the event still

follows the RI Rule

  • Event Definitions left minimal to capture human intuitions

– Contrast with highly defined annotation tasks such as ACE

slide-10
SLIDE 10

Annotator Training

  • Training session with lead annotator to introduce task

and guidelines

  • Complete 1-3 practice files

– Tool functionality – Data and camera views Annotation decisions and rules of thumb – Annotation decisions and rules of thumb

  • Regular team meetings for ongoing training
  • Annotator mailing list to resolve challenging examples

– Usually matter of reinforcing basic principles – “How would you describe this event to someone else?”

  • Decisions logged to LDC wiki for annotator reference
  • NIST input sought on issues that could not be resolved

locally

slide-11
SLIDE 11

Annotation Tool and Data Processing

  • Annotation Tool

– ViPER GT, developed by UMD (now AMA)

  • http://viper-toolkit.sourceforge.net/

– NIST and LDC adapted tool for workflow system compatibility

  • Data Pre-processing
  • – OS limitations required conversion from MPEG to JPEG
  • 1 JPEG image for each frame

– For each video clip assigned to annotators

  • Divided JPEGs into framespan directories
  • Created .info file specifying order of JPEGs
  • Created ViPER XML file (XGTF) with pointer to .info file

– Default ViPER playback rate = about 25 frames (JPEGs)/second

slide-12
SLIDE 12

Annotation Workflow Design

  • Pilot study to determine optimal balance of clip

duration and number of events per work session

  • Source data divided into 5m 10s clips

– 10s = 5s of overlap with the preceding and following clips

  • Events divided into 2 sets of 5

Set 1: PersonRun, CellToEar, ObjectPut, Pointing, – Set 1: PersonRun, CellToEar, ObjectPut, Pointing, ElevatorNoEntry – Set 2: PeopleMeet, PeopleSplitUp, Embrace, OpposingFlow, TakePicture

  • For each assigned clip + event set, detect any event
  • ccurrence and label its temporal extent
  • 5% of devtest set dually annotated (double-blind) to

establish baseline IAA and permit consistency analysis

slide-13
SLIDE 13

Visualization of Annotation Workflow

A1

A2

E1 E2 E5 E4 E3

Event Set 1 Set 1 Annotators

5 minutes 10 secs 5 minutes

Video

A3 A4

E7 E9 E6 E10 E8

Event Set 2 Set 2 Annotators

slide-14
SLIDE 14

Annotation Rates

  • Average 10-15 x Real Time

– i.e. 50-75 mins per 5m clip, with 5 events under consideration per clip

  • Annotation rates heavily conditioned by camera view
slide-15
SLIDE 15

Annotation Rates

  • Average 6-9 x Real Time (10x-15x Real Time including upper outliers)

– i.e. 31-46.5 mins per 5m clip, with 5 events under consideration per clip

  • Annotation rates heavily conditioned by camera view
slide-16
SLIDE 16

Annotation Challenges

  • Ambiguity of guidelines

– Loosely defined guidelines tap into human intuition instead of forcing real world data into artificial categories – But human intuitions often differ on borderline cases – Lack of specification can also lead to incorrect interpretation

  • Too broad (e.g. baby as object in ObjectPut)

Too strict (e.g. person walking ahead of group as PeopleSplitUp)

  • Too strict (e.g. person walking ahead of group as PeopleSplitUp)
  • Ambiguity and complexity of data

– Video quality leads to missed events and ambiguous event instances

  • Gesturing or pointing? ObjectPut or picking up an object? CellToEar or

fixing hair?

  • Human factors

– Annotator fatigue a real issue for this task

  • Technical issues
slide-17
SLIDE 17

Example Observations

Easy to Find Example Hard to Find Example Pointing Embrace

slide-18
SLIDE 18

Table of Participants Vs Events

Cell To Ear Elevator NoEntry Embrace ObjectPut Opposing Flow People Meet People Split Up Person Runs Pointing Take Picture

AIT

X X X

BUT

X X X X

CMU

X X X X X X X X X X

DCU

X X X X X

FD

  • 16 Sites
  • 72 Event Runs

FD

X X X

IFP-UIUC-NEC

X X X X X X X X X X

Intuvision

X X X

MCG-ICT-CAS

X X X X X X X

NHKSTRL

X X X

QMUL-ACTIVA

X X X

SJTU

X X X X X

THU-MNL

X X X

TokyoTech

X X X

Toshiba

X X X

UAM

X X X

UCF

X X X X

Total 3 11 4 5 15 6 4 15 3 6

slide-19
SLIDE 19

Rates of Event Observations

Development vs. Evaluation data

35 40 45 50

ns/Hour) Dev 08 Eval 08

A single Rtarget (20) was chosen for the evaluation

5 10 15 20 25 30 35

RateTarget (Observations/H

slide-20
SLIDE 20

Evaluation Protocol Synopsis

  • NIST used the Framework for Detection Evaluation

(F4DE) Toolkit

  • Available for download on the Event Detection Web Site
  • Events are independent for eval. purposes
  • Two step evaluation process
  • System observations are “aligned” to reference observations
  • Detection performance is a tradeoff between missed
  • Detection performance is a tradeoff between missed

detections and false alarms

  • Two methods of evaluating performance

– Decision Error Tradeoff curves graphically depict performance – A “Surrogate Application”: Normalized Detection Cost Rate

– A priori application requirements unknown – Optimization to be achieved using a “System Value Function”

slide-21
SLIDE 21

Temporal Alignment for Detection in Streaming Media

Time

  • Ref. Obs.

Hungarian Solution to Bipartite Graph Matching

  • Mapping Alignment Rules

– Mid point of system with Δt of reference extent – Temporal congruence and decision scores give preference to overlapping events

  • Sys. Obs.
slide-22
SLIDE 22

Decision Error Tradeoff Curves ProbMiss vs. RateFA

Decision Score Histogram

  • f Observations

Full Distribution

Count of O Decision Score

Full Distribution

slide-23
SLIDE 23

Decision Error Tradeoff Curves ProbMiss vs. RateFA

Decision Score Histogram Separated wrt. Reference Annotation s

  • f Observations

Incorrect System Observations

Non-Targets Targets Θ

Count of O System Decision Score

True Observations

tion SignalDura s FalseAlarm RateFA # ) ( = θ TrueObs MissedObs P

Miss

# # ) ( = θ

Normalizing by # of Non-Observations is impossible for Streaming Detection Evaluations

Targets

slide-24
SLIDE 24

Decision Error Tradeoff Curves ProbMiss vs. RateFA

Compute RateFA and PMiss for all Θ

  • f Observations

Incorrect System Observations

Θ

Count of O System Decision Score

True Observations

)) ( ), ( ( θ θ

Miss FA

P Rate

  • +

= ) ( * * ) ( min arg ) (

arg

θ θ θ

θ FA et T Miss FA Miss

R R Cost Cost P R MinimumNDC

slide-25
SLIDE 25

Decision Error Tradeoff Curves Actual vs. Minimum NDCR

  • f Observations

System

  • Obs. With

/YES/ Decision System

  • Obs. With

/NO/ Decision

10 =

Miss

Cost Event Detection Constants

Θ

Count of O System Decision Score

  • +

= ) ( * * ) ( min arg ) (

arg

θ θ θ

θ FA et T Miss FA Miss

R R Cost Cost P R MinimumNDC .) . ( * * .) . ( .) . (

arg

Dec Act R R Cost Cost Dec Act P Dec Act ActualNDCR

FA et T Miss FA Miss

+ =

20 1

arg

= =

et T FA Miss

R Cost

slide-26
SLIDE 26

PersonRuns Event

Best Submission per Site

SJTU_1-p-baseline_1 QMUL-ACTIVA_3-p-… NHKSTRL_4-p-NHK-… MCG-ICT-CAS_2-p-… IFP-UIUC-NEC_3-p-1_3 FD_1-p-base_1 DCU_1-p-DCUSystem_1 CMU_11-p-VCUBE_1 BUT_2-p-butsys_1 AIT_1-p-baseline_1

Min NDCR

  • Act. NDCR

0.5 1 1.5 UCF_1-p-UCF08_1 UAM_1-p-baseline_1 Toshiba_2-p-baseline_1 TokyoTech_3-p-EVAL_1 THU-MNL_2-c-… SJTU_1-p-baseline_1

slide-27
SLIDE 27

Estimating Human Error Rates:

6-Way Annotation Study

  • LDC create 6 independent

annotations for each excerpt

Caveats of the experiment

  • Not balanced by events
  • Not balanced by annotators
  • Blindly merge all annotations

– Use evaluation code to iteratively merge annotations – Commonly detected observations counted once

  • !

" #$

counted once

  • Analysis:

– Curves follow published studies on finding software bugs* – Curves suggest more annotation is needed for some events but False Alarms haven’t been accounted for – LDC reviewed all observed events (100% Adjudication)

  • %

& ' % & '

  • #$

($ )*$ + ,

  • )#

# # * Nielsen and Landauer: “A Mathematical Model of Finding Usability Problems”

slide-28
SLIDE 28

Estimating Human Error Rates:

Humans vs. 6-Way Adjudicated References

0.8 1 1.2 CellToEar Embrace ObjectPut Circle area is the duration of video annotated 50 minutes

  • 0.2

0.2 0.4 0.6

  • 5

5 10 15 20 25 30 PMiss RateFA (FA per Hour) ObjectPut OpposingFlow PeopleMeet PeopleSplitUp PersonRuns Pointing TakePicture 5 minutes

slide-29
SLIDE 29

PersonRuns Event

Best Submission per Site with Human Error Estimates

Circle area is the duration of video annotated Gia -> 50minutes

slide-30
SLIDE 30

Random DET Curves for Streaming Detection Evaluations

  • Parametric random curves are not possible

– Due to un-countable non-target trials – Monte Carlo simulation is a feasible method

  • Monte Carlo Random DET Curves

– Two factors influence a random system

  • RTarget
  • Primary effect
  • RTarget
  • Primary effect
  • Observation duration statistics --

Secondary effect

– Distribution measurements: Mean, Standard Deviation, etc.

– Test set size computation (Rule of 30 @ 40% Pmiss)

  • #Hours = 30 errs / .4 (Pmiss) / RTarget

– Our procedure:

1. Measure Rtarget and Mean Duration of observations in the eval set 2. Construct 50 pairs of a random test set and system output with decision scores from a uniform random distribution, 1000 system obs./hour 3. Compute an ref/sys pair-averaged, DET Curve

slide-31
SLIDE 31

PersonRuns Event

Best Submission per Site with Human Error Estimates and Random Curves

  • Random system
  • Rtarg=6.36,
  • MeanDur=3.25s
  • TestDur=12H
slide-32
SLIDE 32

PeopleMeet Event

Best Submission per Site

0.5 1 1.5 UAM_1-p-baseline_1 TokyoTech_3-p-EVAL_1 SJTU_1-p-baseline_1 MCG-ICT-CAS_2-p-baseline_1 DCU_1-p-DCUSystem_1 CMU_11-p-VCUBE_1

Min NDCR

  • Act. NDCR
  • Random system
  • Rtarg=23.46,
  • MeanDur=129.4s
  • TestDur=3H
slide-33
SLIDE 33

PeopleSplitUp Event

Best Submission per Site

0.5 1 1.5 UAM_1-p-baseline_1 TokyoTech_3-p-EVAL_1 MCG-ICT-CAS_2-p-baseline_1 CMU_11-p-VCUBE_1

Min NDCR

  • Act. NDCR
  • Random system
  • Rtarg=13.2,
  • MeanDur=11.66s
  • TestDur=6H
slide-34
SLIDE 34

Opposing Flow Event

Best Submission per Site

0.5 1 1.5 UCF_1-p-UCF08_1 Toshiba_2-p-baseline_1 SJTU_1-p-baseline_1 NHKSTRL_4-p-NHK-SYS1_2 Intuvision_2-p-zipsub_1 FD_1-p-base_1 CMU_11-p-VCUBE_1 AIT_1-p-baseline_1

Min NDCR

  • Act. NDCR
  • Too few human annot.
  • No random system
  • Rtarg=0.23,
  • MeanDur=2.38s
  • TestDur=326H

0.5 1 1.5

slide-35
SLIDE 35

Elevator No Entry Event

Best Submission per Site

Toshiba_2-p-baseline_1 SJTU_1-p-baseline_1 QMUL-ACTIVA_3-p-baseline_1 NHKSTRL_4-p-NHK-SYS1_3 MCG-ICT-CAS_2-p-Run2_1 Intuvision_2-p-zipsub_1 IFP-UIUC-NEC_3-p-2_1 DCU_1-p-DCUSystem_1 CMU_11-p-VCUBE_1 BUT_2-p-butsys_1 AIT_1-p-baseline_1

Min NDCR

  • Act. NDCR
  • No human perf. avail.
  • No random system
  • Rtarg=0.11,
  • MeanDur=11.5s
  • TestDur=642H

0.2 0.4 0.6 0.8 1 1.2 1.4 Toshiba_2-p-baseline_1

slide-36
SLIDE 36

Object Put Event

Best Submission per Site

0.5 1 1.5 UCF_1-p-UCF08_1 UAM_1-p-baseline_1 IFP-UIUC-NEC_3-p-1_3 CMU_11-p-VCUBE_1 BUT_2-p-butsys_1

Min NDCR

  • Act. NDCR
  • Random system
  • Rtarg=38.5,
  • MeanDur=1.08s
  • TestDur=1H
slide-37
SLIDE 37

Embrace Event

Best Submission per Site

0.5 1 1.5 MCG-ICT-CAS_2-p-baseline_1 IFP-UIUC-NEC_3-p-1_3 DCU_1-p-DCUSystem_1 CMU_11-p-VCUBE_1

Min NDCR

  • Act. NDCR
  • Random system
  • Rtarg=8.09,
  • MeanDur=5.2s
  • TestDur=9H
slide-38
SLIDE 38

CellToEar Event

Best Submission per Site

0.5 1 1.5 THU-MNL_2-c-contrast_2 IFP-UIUC-NEC_3-p-2_1 CMU_11-p-VCUBE_1

Min NDCR

  • Act. NDCR
  • Random system
  • Rtarg=7.15,
  • MeanDur=27s
  • TestDur=5H
slide-39
SLIDE 39

Pointing Event

Best Submission per Site

0.5 1 1.5 SJTU_1-p-baseline_1 IFP-UIUC-NEC_3-p-1_3 CMU_11-p-VCUBE_1

Min NDCR

  • Act. NDCR
  • Random system
  • Rtarg=45.47,
  • MeanDur=1.43s
  • TestDur=2H
slide-40
SLIDE 40

TakePicture Event

Best Submission per Site

0.5 1 1.5 UCF_1-p-UCF08_1 Intuvision_2-p-zipsub_1 IFP-UIUC-NEC_3-p-1_3 FD_1-p-base_1 CMU_11-p-VCUBE_1

Min NDCR

  • Act. NDCR
  • Random system
  • Rtarg=.44,
  • MeanDur=9.34s
  • TestDur=170H
slide-41
SLIDE 41

Best Run: All Events

9 8 7 5 2 4 3 1 6 8 Ordinal Position

  • f Selected

Events from the Technology Readiness Discussion

slide-42
SLIDE 42

Adjudication Summary

  • Dual annotation studies indicated a low recall rate for

humans

– NIST and LDC designed an system-mediated adjudication framework focused on improving recall

  • Adjudication process for streaming detection

– Merge system false alarms to develop a prioritized list of excerpts to review:

  • Take into account existing annotations
  • Take into account temporally overlapping annotations

– Review top 100 false alarm excerpts sorted by

  • Inter-system agreement
  • Average decisions score
slide-43
SLIDE 43

Effect of Adjudication

60 70 uns

MinNDCRPostAdjud-MinNDCRPreAdjud

PersonRuns Pointing TakePicture

On Annotations On System Scores

10 20 30 40 50 60

  • 0.2
  • 0.1

0.1 Number of Event Runs

20 40 60 80 CellToEar ElevatorNoEntry Embrace ObjectPut OpposingFlow PeopleMeet PeopleSplitUp PersonRuns

Number of New Event Observations After Reviewing 100 Excerpts Worse Scores Better Scores

slide-44
SLIDE 44

Conclusions

  • Detecting events in high volumes of found

data is feasible

– 16 sites completed the evaluation – Human annotation performance indicates the task – Human annotation performance indicates the task has a high degree of difficulty – 50 Hr. test set insufficient for low frequency events, but 12 Hrs. is sufficient for most events