Unsupervised scene adaptation for faster multi-scale pedestrian - - PowerPoint PPT Presentation

▶

Jan 12, 2024 104 likes •311 views

Unsupervised scene adaptation for faster multi-scale pedestrian detection Speaker Federico Bartoli 1 Giuseppe Lisanti 1 , Svebor Karaman 1 , Andrew D. Bagdanov 2 and Alberto Del Bimbo 1 1 MICC (Media Integration and Communication Center) -

SLIDE 1

Unsupervised scene adaptation for faster multi-scale pedestrian detection

Speaker Federico Bartoli1 Giuseppe Lisanti1, Svebor Karaman1, Andrew D. Bagdanov2 and Alberto Del Bimbo1

1 MICC (Media Integration and Communication Center) - University of Florence, Italy

{firstname.lastname}@unifi.it

2 CVC (Computer Vision Center) - Autonomous University of Barcelona, Spain

bagdanov@cvc.uab.es

Federico Bartoli (Unifi::Micc) Faster Multi-Scale Pedestrian Detection 28 August 2014 1/ 18

SLIDE 2

Real-time Pedestrian Detection

Application contexts

1 Video Surveillance 2 Tracking 3 People Re-identification 4 Action Recognition

Main critical factors

1 Changes of scale and strong view-point dependency ◮ Different target locations can produce high scale changes ◮ Lost of scene depth information in the image 2 Variability: ◮ Different person poses (e.g. front or side view) ◮ Changes in illumination intensity 3 Scene complexity ◮ Indoor or Outdoor ◮ Clutter, crowd and partial occlusion Federico Bartoli (Unifi::Micc) Faster Multi-Scale Pedestrian Detection 28 August 2014 2/ 18

SLIDE 3

Standard execution pipeline of a multi-scale pedestrian detector

Four principal phases Each perform a specific task:

Feature Extraction on Pyramid of Image

Detection Windows Proposal: Sparse or Dense sampling

Classification: Boosting, SVM

Non Maximal Suppression

Image Pyramid of Images Detection Windows Proposal No Maximal Suppression Classi er Pyramid of Features

Detection windows

.....

T T T 1 F 2 F N F

Rejected Positives

Federico Bartoli (Unifi::Micc) Faster Multi-Scale Pedestrian Detection 28 August 2014 3/ 18

SLIDE 4

Standard execution pipeline of a multi-scale pedestrian detector

Four principal phases Main bottlenecks:

Feature Extraction on Pyramid of Image Channel features [Dollar’14]

Detection Windows Proposal: Sparse or Dense sampling Scene adapted detection windows proposal

Classification: Boosting, SVM Soft cascade approximation

Non Maximal Suppression

Image Pyramid of Images Detection Windows Proposal No Maximal Suppression Classi er Pyramid of Features

Detection windows

.....

T T T 1 F 2 F N F

Rejected Positives

Federico Bartoli (Unifi::Micc) Faster Multi-Scale Pedestrian Detection 28 August 2014 3/ 18

SLIDE 5

Faster multi-scale pedestrian detection

Question How to increase the speed of a pre-trained pedestrian detector on a scene? Framework Proposed Speed up the detection process of a Soft-Cascade pedestrian detector No a priori information about the scene required All learning done by mining statistics about the detector operating on the scene Exploit only ROS (Region of Support) information to build the models Strategies:

Linear Cascade Approximation: acts on classifier domain, for each sample estimate a final score without calculating all stages

Generative model for candidate window proposal: acts on pyramid domain, modelling the scene-dependent statistics of detection windows in terms of both location and scale

The result is a significant reduction in the total number of stages evaluation required in the soft cascade detection process

Federico Bartoli (Unifi::Micc) Faster Multi-Scale Pedestrian Detection 28 August 2014 4/ 18

SLIDE 6

Linear Cascade Approximation

Soft Cascade Architecture Let x ∈ RD be a sample to evaluate and Y ∈ {−1, 1} its class label: Classifier: H(x) = T

k=1 fk(x), where fk : RD −

→ R is a stage computation Partial Score: Ht(x) = t

k=1 fk(x) the sum of the first t stage scores

x is classified positive (Y = 1) ⇐ ⇒ Ψ (Ht(x), θt) ≥ 0 ∀t ∈ [1, T] where Ψ is a stopping criterion and {θt} are each stage rejection thresholds. Linear Cascade Approximation Objective: For a given test sample x, we want to consider only a reduced number t < T of stages of H(x) in order to assign a score to a detection window Find ˜ Ht→T ∈ R that estimates H using only the first t stages of the soft cascade, such that: H(x) ≃ ˜ Ht→T (x) ∀x ∈ P(I)

Federico Bartoli (Unifi::Micc) Faster Multi-Scale Pedestrian Detection 28 August 2014 5/ 18

SLIDE 7

Linear Cascade Approximation

ex. Average positive traces extracted from a soft cascade of 1024 stages on the Oxford
dataset. Traces are colored based on their level membership in the pyramid

200 400 600 800 1000 50 100 150 Stage Partial Score

Level:0 Level:1 Level:2 Level:3 Level:4 Level:5 Level:6 Level:7 Level:8 Level:9 Level:10 Level:11 Level:12 Level:13 Level:14 Level:15 Level:16 Level:17 Level:18 Level:19 Level:20 Level:21 Level:22 Level:23 Federico Bartoli (Unifi::Micc) Faster Multi-Scale Pedestrian Detection 28 August 2014 6/ 18

SLIDE 8

Linear Cascade Approximation

Strategy Grouping all traces respect to their level Linear regression to estimate the parameters (slope and intercept) for the interpolation Compute the average trace for each group Final score approximation takes the following form: Ht→T (x) = ¯ wl · T − t + Ht(x) + ¯ ǫl where l: level of x ¯ wl ≡ E[{wi

l}] are the average trace parameters for the level l:

◮ wi l = arg minw ||STw − ht→T (x(i))|| ◮ w ∈ R2, w =

w0 w1

with w0 the intercept and w1 the slope

◮ S =

· · · 1 · · · 1 t t + ∆ t + 2∆ · · · T

◮ hT

t→T (x(i)) =

Ht(x(i))

Ht+∆(x(i)) · · · T

◮ ∆: sampling step for the stages used in regression

¯ ǫl= average interpolation error on the stage T

Federico Bartoli (Unifi::Micc) Faster Multi-Scale Pedestrian Detection 28 August 2014 7/ 18

SLIDE 9

Generative Model for candidate window proposal

Observations: The presence and scale of targets is highly dependent on the geometry of the scene. Only detection windows in a limited scale range can be detected in a sub-region of frame The complete evaluation of all possible scales in all sub-regions of the image is wasteful Idea Only evaluate detection windows with a high likelihood to be a local maxima considering the geometric and scale statistics on the scene

Sliding windows Candidate Window Proposal

Federico Bartoli (Unifi::Micc) Faster Multi-Scale Pedestrian Detection 28 August 2014 8/ 18

SLIDE 10

Generative Model for candidate window proposal

1 Leveraging Region of Support (ROS) information:

The ROS is indicative of both the detector precision and the scene geometry:

◮ The cardinality of each ROS is a good estimate of true positive: objects with a low rank are

ften false positive.

◮ The location and scale of strongs can be considered to learn a model able to describe the

geometry and perspective of the scene

ROS information are discriminative and can be extracted at no additional cost during the non maximum suppression process.

ex. Some strongs (and their ROS) from a soft cascade classifier on a frame from Oxford:

Federico Bartoli (Unifi::Micc) Faster Multi-Scale Pedestrian Detection 28 August 2014 9/ 18

SLIDE 11

Generative Model for candidate window proposal

2 Scene Model (Mn)

+75

Grid Training Set Extraction

Level

|ROS|

Mn = (Gn, { ˜ Hl

b}, {µl b, Σl b}, {Eb})

where: n: grid of n2 blocks 1 ≤ l ≤ L pyramid levels ˜ Hl

b: Hl b normalized over all levels l in block b

Eb =

l=1 Hl b

b∈Gn

l=1 Hl ˜ b

Observations: Training of Model weakly-supervised Search differentiated according to the sub-region of frame (block) Generation of detection windows based on: spatial position ({µb,l}, {Σb,l}), scale ({Hb}) and energy({Eb}) No need of calibration

Federico Bartoli (Unifi::Micc) Faster Multi-Scale Pedestrian Detection 28 August 2014 10/ 18

SLIDE 12

Generative Model for candidate window proposal

3 Candidate windows proposal at detection time

Algorithm For each block b and scale l of Images Pyramid P(I): Compute the total number of detection windows to genereate: N = γ|P(I)|EbHl

If not enough information (Hl

b < τ) =

⇒ uniform extraction in the block region Else randomly sample from normal distribution N(µl

b, Σl b) with covariance expansion:

◮ Strategy round-based ◮ For each round the covariance matrix is expanded by a factor (using X 2 α distribution) ◮ Iteration until the total number of obtained detection windows is approximately N ◮ Reduction of duplicate samples

Parameter γ ∈ [0, 1]: Proportion of detection windows of a pyramid to be evaluated An estimate of the final speedup we want from the resulting detector Tradeoff between between speed (γ → 0) and accuracy (γ → 1) of the detector

Federico Bartoli (Unifi::Micc) Faster Multi-Scale Pedestrian Detection 28 August 2014 11/ 18

SLIDE 13

Test and Results

Baseline: 3 Soft cascade with 1024 stages, Images Pyramid from 3 octaves with 8 levels each Features used by stages:

◮ HOG with 6 bin for orientation (0◦ − 360◦) ◮ Gradient Histogram ◮ Color Channels LUV

Dataset:

seq. Oxford: sampling 1 fps from video Oxford (3 min) and frame reshape at 640 × 480
seq. PETS: uniform extraction of 200 images from PETS (795 frames) and reshape at a

640 × 480 Speed of proposed Framework in terms of stages saving: δ =

∀x∈P [H(x)]
∀x∈X 1{c=0} [H(x)] + 1{c=1} [ ˜

Ht→T (x)]

Federico Bartoli (Unifi::Micc) Faster Multi-Scale Pedestrian Detection 28 August 2014 12/ 18

SLIDE 14

Performance with Linear Cascade Approximation

Dataset considerati: seq. Oxford e PETS 7 values for t, uniform extracted from [64, 961]

−2

−1

10 10

0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

Log(false positive per image) miss rate t = 129 [δ = 1.24x] t = 257 [δ = 1.17x] t = 385 [δ = 1.13x] t = 513 [δ = 1.10x] t = 641 [δ = 1.07x] t = 769 [δ = 1.04x] t = 897 [δ = 1.02x] Baseline

seq. Oxford

5 10 15 20 25 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

false positive per image miss rate

t = 129 [δ = 1.38x] t = 257 [δ = 1.28x] t = 385 [δ = 1.21x] t = 513 [δ = 1.16x] t = 641 [δ = 1.11x] t = 769 [δ = 1.07x] t = 897 [δ = 1.03x] Baseline

seq. PETS

Results:

◮ seq. Oxford: reductions between 1% − 24% with maxima accuracy loss lower 5% ◮ seq. PETS: same results of saving, but error not more than 4% with t > 257 Federico Bartoli (Unifi::Micc) Faster Multi-Scale Pedestrian Detection 28 August 2014 13/ 18

SLIDE 15

Limited savings with Linear Cascade Approximation

Considering only stages reduction during detecteion windows evaluation is not enough to obtain high saving values: |XP | ≪ |XN| (two order of magnitude) Same cost on evaluation of XP and XN Maximum savings lower than 50% respect to full evaluation of Pyramid

200 400 600 800 1000 1 1.05 1.1 1.15 1.2 1.25 1.3 1.35 1.4 1.45 1.5 t δ Oxford #positive without NMS = 429 #detection windows = 285944 PETS #positive without NMS = 598 #detection windows = 449456]

100 200 300 400 500 600 700 800 900 10 10

Stage Log(Num. Neg. Totali)

seq. Oxford

Federico Bartoli (Unifi::Micc) Faster Multi-Scale Pedestrian Detection 28 August 2014 14/ 18

SLIDE 16

Performance with Generative Model

Dataset: seq. Oxford Gn ∈ {2, 3, 4, 5, 6} SpeedUp ∈ [4×, 8×, 16×, 32×, 64×]

−2

−1

10 10

0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Log(false positive per image) miss rate γ = 0.25 (4x) [δ = 2.85x] γ = 0.125 (8x) [δ = 4.14x] γ = 0.063(16x) [δ = 6.69x] γ = 0.031 (32x) [δ = 11.09x] γ = 0.016 (64x) [δ = 19.44x] Baseline

Grid 2 × 2

−2

−1

10 10

0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Log(false positive per image) miss rate γ = 0.25 (4x)] [δ = 3.37x] γ = 0.125 (8x)] [δ = 4.81x] γ = 0.063 (16x)] [δ = 6.6579x] γ = 0.031 (32x)] [δ = 10.53x] γ = 0.016 (64x)] [δ = 18.79x] Baseline

Grid 4 × 4

Results:

◮ With all configurations we obtain a savings greater then 50% ◮ For grid size of 2 × 2, the minimum and maximum saving values is 65% (2.85×) and 95%

(19.44×) respectively.

Federico Bartoli (Unifi::Micc) Faster Multi-Scale Pedestrian Detection 28 August 2014 15/ 18

SLIDE 17

Performance with our Framework

−2

−1

10 10

0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

Log(false positive per image) miss rate t = 257 [δ = 3.78x] t = 385 [δ = 3.55x] t = 513 [δ = 3.36x] t = 641 [δ = 3.20x] t = 769 [δ = 3.08x] t = 897 [δ = 3.00x] Baseline

γ = 0.25 (4×)

−2

−1

10 10

0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

Log(false positive per image) miss rate t = 257 [δ = 9.24x] t = 385 [δ = 8.55x] t = 513 [δ = 8.06x] t = 641 [δ = 7.64x] t = 769 [δ = 7.28x] t = 897 [δ = 6.97x] Baseline

γ = 0.0625 (16×)

1 2 3 4 5 6 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

false positive per image miss rate t = 257 [δ = 5.42x] t = 385 [δ = 4.91x] t = 513 [δ = 4.51x] t = 641 [δ = 4.19x] t = 769 [δ = 3.92x] t = 897 [δ = 3.69x] Baseline

γ = 0.25 (4×)

0.5 1 1.5 2 2.5 3 3.5 4 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

false positive per image miss rate t = 257 [δ = 11.26x] t = 385 [δ = 9.93x] t = 513 [δ = 8.94x] t = 641 [δ = 8.18x] t = 769 [δ = 7.55x] t = 897 [δ = 7.01x] Baseline

γ = 0.0625 (16×)

Federico Bartoli (Unifi::Micc) Faster Multi-Scale Pedestrian Detection 28 August 2014 16/ 18

SLIDE 18

Performance Comparison: Our Framework vs Main Person Detectors Oxford PETS Detectors Miss-rate(%) Savings(δ) Miss-rate(%) Savings(δ) DPM 80

97 1× 51 1× Baseline 99 1× 9 1× Linear Cascade App. 98.1 1.17× 10.4 1.28× Candidate Windows Pro. 98.9 11.09× 7.4 3.37× With both 98.5 12.73× 11.4 4.19×

Federico Bartoli (Unifi::Micc) Faster Multi-Scale Pedestrian Detection 28 August 2014 17/ 18

SLIDE 19

Conclusions

In the classifier and Pyramid domains can be applied approximation strategies for complexity reduction Maximum saving is obtained means sparse sampling of detection windows to evaluate by classifier The ROS information proves to be effective data for modelling both geometry and statistics for a scene Our Framework The proposed strategies are weakly-supervised The great reduction of stages to evaluate allows to run the detector in real-time The framework implementation results very easy, also no need of dedicated hardware to run (ex. GPU)

Federico Bartoli (Unifi::Micc) Faster Multi-Scale Pedestrian Detection 28 August 2014 18/ 18