C Crowd Counting and Behavior d C ti d B h i Modeling with - - PowerPoint PPT Presentation

c crowd counting and behavior d c ti d b h i modeling
SMART_READER_LITE
LIVE PREVIEW

C Crowd Counting and Behavior d C ti d B h i Modeling with - - PowerPoint PPT Presentation

C Crowd Counting and Behavior d C ti d B h i Modeling with Modeling with Convolutional Neural Networks Hongsheng Li Hongsheng Li 1 Dept. of Electronic Enigineering, 2 Multimedia Laboratory The Chinese University of Hong


slide-1
SLIDE 1

C d C ti d B h i Crowd Counting and Behavior Modeling with Modeling with Convolutional Neural Networks

Hongsheng Li 李鴻升 Hongsheng Li 李鴻升

1 Dept. of Electronic Enigineering, 2 Multimedia Laboratory

The Chinese University of Hong Kong The Chinese University of Hong Kong

slide-2
SLIDE 2

Typical Surveillance Scenario yp

2

slide-3
SLIDE 3

Background Subtraction g

[Stauffer and Grimson 1999] [Elgammal et al 2000] [Stauffer and Grimson 1999] [Elgammal et al. 2000] [Zivkovic 2004] [Kim et al. 2005] [Sheikh and Shah 2005]

3

slide-4
SLIDE 4

Crowd Tracking

[Lucas and Kanade 1981] [Shi and Tomasi 1994] [Wang et al. 2011]

4

slide-5
SLIDE 5

Crowd Motion Analysis y

[Ali and Shah 2007] [Amer and Todorovic 2011] [Chang et al 2011] [Ali and Shah 2007] [Amer and Todorovic 2011] [Chang et al. 2011] [Loy et al. 2012] [Pellegrini et al. 2009] [Zhou et al. 2013]

5

slide-6
SLIDE 6

Contents Contents

  • Crowd Counting

– Cross‐scene crowd count and density estimation y with deep CNN [Zhang et al. CVPR’15] – Crossing‐line crowd counting with two‐phase deep Crossing line crowd counting with two phase deep CNN [Zhao et al. ECCV’16]

C d B h i M d li

  • Crowd Behavior Modeling

– Multi‐person walking path prediction [Yi et al. ECCV’16]

slide-7
SLIDE 7

Contents Contents

  • Crowd Counting

– Cross‐scene crowd count and density estimation y with deep CNN [Zhang et al. CVPR’15] – Crossing‐line crowd counting with two‐phase deep Crossing line crowd counting with two phase deep CNN [Zhao et al. ECCV’16]

C d B h i M d li

  • Crowd Behavior Modeling

– Multi‐person walking path prediction [Yi et al. ECCV’16]

slide-8
SLIDE 8

Cross scene Crowd Counting Cross‐scene Crowd Counting

  • Problem definition
  • Problem definition

– Counting the people in the Region‐Of‐Interest (ROI d d h bl i ) (ROI, denoted as the blue region)

slide-9
SLIDE 9

CNN for Crowd Counting CNN for Crowd Counting

  • Training strategy
  • Training strategy

– Patch‐based training – Alternatively training with crowd counts and crowd density objectives

slide-10
SLIDE 10

Create Ground Truth Patches (Cont’d) Create Ground Truth Patches (Cont d)

  • Estimating perspective map of a scene
  • Estimating perspective map of a scene

– Each scene needs 2‐4 annotations of person h i h height – Each pixel stores the value that how many meters current pixel represent

slide-11
SLIDE 11

Create Ground Truth Patches (Cont’d) Create Ground Truth Patches (Cont d)

  • Convolution on the head annotation map with

p person‐shape kernel

– Person‐shape kernel should be sum to 1 – Person‐shape kernel should be sum to 1 – Crop 3x3 meter patches – Normalize patches to the same size (72x72)

slide-12
SLIDE 12

Alternative Training Strategy Alternative Training Strategy

  • Train each step until convergence
  • Train each step until convergence

– Train with pixel‐level density maps and L2 loss – Train with crowd counts of patches

slide-13
SLIDE 13

Finetuning on Unseen Scenes Finetuning on Unseen Scenes

  • Training on all training scenes
  • Training on all training scenes
  • For an unseen scene, the trained model might

not be suitable for direct deployment

  • Finetuning the pre‐trained model on training

Finetuning the pre trained model on training patches similar to those test patches

slide-14
SLIDE 14

Training Patch Retrieval Training Patch Retrieval

  • Candidate training scene retrieval
  • Candidate training scene retrieval

– Given a target scene, retrieve training scenes with i il i (i i h i il similar perspective map (i.e., scenes with similar viewing angles) – Top 20 perspective‐map‐similar training scenes are kept

Top 20 training scenes Test Scene 1 Top‐20 training scenes Test Scene 2

slide-15
SLIDE 15

Training Patch Retrieval (cont’d) Training Patch Retrieval (cont d)

  • Candidate training patch retrieval
  • Candidate training patch retrieval

– Estimate target scene density using pretrained model – Retrieve training patches to match the distribution of target scene according to its density histogram

Target scene d it Training patches d density distribution density distribution

slide-16
SLIDE 16

Datasets Datasets

  • UCSD [Chan et al CVPR’08]
  • UCSD [Chan et al. CVPR 08]
  • UCF_CC_50 [Idrees et al. CVPR’13]
  • WordExpo’10 dataset (with SJTU)

Train & validation: 1 127 one minute video clips of – Train & validation: 1,127 one‐minute video clips of 103 scenes T t 5 h id li f 5 – Test: 5 one‐hour video clips from 5 scenes

Dataset # frames # scenes Resolution FPS # people per # total frame annotations UCSD 2,000 1 158 X 238 10 11‐46 49885 UCF FF 50 50 50 Various image 94 4543 63974 UCF_FF_50 50 50 Various image 94‐4543 63974 WorldExpo 4.44 million 108 576 X 720 25 1‐253 199923

slide-17
SLIDE 17

Results Results

slide-18
SLIDE 18

Results: WorldExpo’10 Results: WorldExpo 10

  • Metric: mean absolute error

Metric: mean absolute error

Method Scene 1 Scene 2 Scene 3 Scene 4 Scene 5 Average LBP+RR 13.6 58.9 37.1 21.8 23.4 31.0 LBP RR 13.6 58.9 37.1 21.8 23.4 31.0

Fiaschi et al. ICPR’12

2.2 87.3 22.2 16.4 5.4 26.7

Chen et al. BMVC’12

2.1 55.9 9.6 11.3 3.4 16.5 Crowd CNN 2.0 29.5 9.7 9.3 3.1 10.7

slide-19
SLIDE 19

Results: UCSD & UCF CC 50 Results: UCSD & UCF_CC_50

UCSD UCSD UCF_CC_50

slide-20
SLIDE 20

Contents Contents

  • Crowd Counting

– Cross‐scene crowd count and density estimation y with deep CNN [Zhang et al. CVPR’15] – Crossing‐line crowd counting with two‐phase deep Crossing line crowd counting with two phase deep CNN [Zhao et al. ECCV’16]

C d B h i M d li

  • Crowd Behavior Modeling

– Multi‐person walking path prediction [Yi et al. ECCV’16]

slide-21
SLIDE 21

Cross scene Crowd Counting Cross‐scene Crowd Counting

  • Problem definition
  • Problem definition

– Count people crossing a Line‐of‐Inerest in both di i directions – Has practical needs in intelligent surveillance

slide-22
SLIDE 22

Temporal slicing Temporal slicing

  • Existing LOI counting methods mostly use
  • Existing LOI counting methods mostly use

temporal slices

slide-23
SLIDE 23

CNN with Pixel level Supervision CNN with Pixel‐level Supervision

  • CNN trained with pixel level supervision maps
  • CNN trained with pixel‐level supervision maps

– Instantaneous crowd counting map, which can be d d decomposed to – Crowd density map – Crowd velocity map

slide-24
SLIDE 24

Definition of Crowd Counting Map Definition of Crowd Counting Map

  • At a single time step how many persons have
  • At a single time step, how many persons have

passed this location along x and y directions at h l i each location.

  • Crossing‐line counts can be calculated by

g y projecting the values to the normal direction

  • f the LOI
  • f the LOI
slide-25
SLIDE 25

Definition of Crowd Counting Map (cont’d)

  • Corwd counting map can be decomposed as
  • Corwd counting map can be decomposed as

the multiplication of crowd density map and d l i crowd velocity map

slide-26
SLIDE 26

Two phase Strategy Two‐phase Strategy

  • Two phase strategy
  • Two phase strategy

– Phase I: train with density and velocity supervision – Phase II: train with counting supervision

slide-27
SLIDE 27

Supervision Maps Supervision Maps

GT counting map

GT velocity map GT density map

d Estimated counting map

Estimated velocity map Estimated density map

slide-28
SLIDE 28

From Instantaneous Counts to LOI Counts

  • Project the x and y directional counting values on the LOI to

Project the x and y directional counting values on the LOI to its normal direction.

  • Integrating over all the projected values leads to the

g g p j instantaneous LOI counts and in the two directions at time t

  • For certain period of time T, integrate the instantaneous

counting numbers to obtain the final crossing line counts within T, LOI

slide-29
SLIDE 29

LOI Counting Dataset LOI Counting Dataset

  • A new LOI counting dataset
  • A new LOI counting dataset
  • Evaluation metric

Mean Windowed Relative Absolute Errors ‐ Mean Windowed Relative Absolute Errors

slide-30
SLIDE 30

LOI Counting Dataset LOI Counting Dataset

  • A new LOI counting dataset
  • A new LOI counting dataset
slide-31
SLIDE 31

Results Results

  • Baselines:
  • Baselines:

– Phase I: no phase II training, estimated velocity map and density map are directly multiplied density map are directly multiplied – Direct‐A: CNN without elementwise multiplication, direct train with Phase II supervision p – Direct‐B: CNN with elementwise multiplication, direct train with Phase II supervision – Two‐separate: two separate CNNs for velocity and density

slide-32
SLIDE 32

Results Results

2X Speed Upward Downward Upward Downward

slide-33
SLIDE 33

Contents Contents

  • Crowd Counting

– Cross‐scene crowd count and density estimation y with deep CNN [Zhang et al. CVPR’15] – Crossing‐line crowd counting with two‐phase deep Crossing line crowd counting with two phase deep CNN [Zhao et al. ECCV’16]

C d B h i M d li

  • Crowd Behavior Modeling

– Multi‐person walking path prediction [Yi et al. ECCV’16]

slide-34
SLIDE 34

Problem Definition Problem Definition

  • Previous five frames as input

BLUE: input locations. GREEN: GT future locations. RED: current locations.

slide-35
SLIDE 35

Problem Definition Problem Definition

  • Need to predict future five frames

BLUE: input locations. GREEN: GT future locations. RED: current locations.

slide-36
SLIDE 36

Main Difficulties Main Difficulties

  • How to solve the problem with deep neural

network?

  • How to encode pedestrian walking paths as

the input of a deep networks? the input of a deep networks?

  • How to jointly model the behaviors of all

pedestrians in the scene?

slide-37
SLIDE 37

Method Flowchart Method Flowchart

Input pedestrian Encoded input Encoded output Predicted pedestrian Input pedestrian walking paths displacement volume displacement volume Beha ior Predicted pedestrian walking paths Encoding Decoding Behavior CNN

slide-38
SLIDE 38

Behavior Encoding Behavior Encoding

Input pedestrian Encoded input Input pedestrian walking paths displacement volume Encoding Pedestrian i Displacement vector i Walking paths Displacement volume Walking paths Pedestrian j Displacement vector j volume

38

slide-39
SLIDE 39

Behavior CNN Behavior‐CNN

Input pedestrian Encoded input Encoded output Predicted pedestrian Input pedestrian walking paths displacement volume displacement volume B h i Predicted pedestrian walking paths Encoding Decoding Behavior CNN

slide-40
SLIDE 40

Dataset I [*] Dataset II

Dataset I [*] Dataset II Dataset I [ ] Dataset II Scene type Indoor Outdoor Resolution (pixel) 1,920 by 1,080 1,920 by 1,080 (p ) , y , , y , Video duration (s) 4,000 450 Frame rate (fps) 25 25 Annotated pedestrians 5,000 560

  • No. of annotated pedestrians

12,684 797

[*] S. Yi, H. Li, and X. Wang. Understanding pedestrian behaviors from stationary crowd groups. In Proc. CVPR. IEEE, 2015.

40

slide-41
SLIDE 41

Location awareness Location awareness

With the learned location bias map, our Behavior‐CNN can p, distinguish different locations of the scene St l ti b d b t th di t d Strong correlations are observed between the predicted walking pattern and the training walking pattern.

41

slide-42
SLIDE 42

Learned feature filters

The input pedestrian paths can be classified into some rough categories by the g y filters in conv1.

St l ti b t ifi lki tt d filt Strong correlations between specific walking patterns and filter response maps can be well observed.

42

slide-43
SLIDE 43

Learned feature filters

Filt f 2 d Filters of conv2 and conv3 generally classify pedestrians into finer and more specific categories. For filters in higher‐ level layers, they y y generally encode more complex behavior, e.g. stationary crowds.

l i b ifi lki d fil

stationary crowds.

Strong correlations between specific walking patterns and filter response maps can be well observed.

43

slide-44
SLIDE 44

Results: Pedestrian walking path prediction

Dataset I Dataset II Dataset I Dataset II Dataset I (Annotation) Dataset II (Annotation) Dataset I (KLT) Dataset II (KLT) Behavior‐CNN 2.421% 2.348% 2.517% 3.816% Constant velocity 6.091% 6.468% 5.864% 5.635% Constant acceleration 9.899% 9.428% 6.619% 7.656% SVM regression 4 639% 4 276% 5 053% 5 327% SVM regression 4.639% 4.276% 5.053% 5.327% SFM [Helbing’95] 4.280% 5.921% 4.447% 5.044% LTA [Pellegrini’09] 4.723% 4.571% 4.346% 4.639% TIM [Cancela’14] 4.075% 4.141% 4.790% 4.790%

Prediction results (MSE) of different methods trained on the annotated d i lki h h KLT j i D I d D II pedestrian walking paths or the KLT trajectories on Dataset I and Dataset II.

Annotation Training: annotated pedestrian locations

44

Training: annotated pedestrian locations Evaluation: annotated pedestrian locations

slide-45
SLIDE 45

Receptive fields Receptive fields

C t ti fi ld i

  • Current receptive field size
  • Around 10% of the scene
  • Large enough to capture nearby pedestrians and activities
  • Two alternative structures are designed to decrease the receptive field size
  • Filer size changed from 3x3 to 1x1. Fix parameter size.
  • Structure simplified from 3conv+pool+3conv+deconv to

MSE 3X3 (OURS) 1X1

3conv+pool+3conv and 3conv+3conv.

MSE 3X3 (OURS) 1X1 3conv+pool+3conv+deconv (ours) 2.421% 2.555% 3 l 3 2 431% 2 571% 3conv+pool+3conv 2.431% 2.571% 3conv+3conv 2.468% 2.858%

Prediction results of different filter sizes and net structures on Dataset I.

45

slide-46
SLIDE 46

Example Clip Example Clip

Proposed Behavior‐CNN Input pedestrian walking paths Constant velocity BLUE dots: input previous locations. GREEN dots: ground truth future locations. g RED dots: the predicted future locations. LTA [Pellegrini et al.]

46

slide-47
SLIDE 47

Example Clip 3 Example Clip 3

Proposed Behavior‐CNN Input pedestrian walking paths Constant velocity BLUE dots: input previous locations. GREEN dots: ground truth future locations. g RED dots: the predicted future locations. LTA [Pellegrini et al.]

47

slide-48
SLIDE 48

Results: Predictions as tracking prior Results: Predictions as tracking prior

The average L2 distance between ground truth walking paths and tracking results of 1000 pedestrians in Dataset p g p I were used for evaluation. Methods KLT+Behavior‐CNN KLT+RFT [Zhou’11] KLT Error 83 79 228 33 411 71 Error 83.79 228.33 411.71

Results of pedestrian tracking on Dataset I

slide-49
SLIDE 49

E l P d i 1

GREEN dots: ground truth

Example Pedestrian 1

BLUE dots: RFT [Zhou et al.] RED dots: proposed

slide-50
SLIDE 50

E l P d i 2

GREEN dots: ground truth

Example Pedestrian 2

BLUE dots: RFT [Zhou et al.] RED dots: proposed

slide-51
SLIDE 51

Conclusions Conclusions

  • We investigated crowd‐related applications

with deep neural networks p

  • There still exist many challenges

L k f ill d t – Lack of surveillance data – Requires a lot of human annotations – High accuracy needed for real‐world applications

slide-52
SLIDE 52

Thank you! Q&A Q&A