Finding People in Images and Videos Navneet DALAL GRAVIR, INRIA - PowerPoint PPT Presentation

Finding People in Images and Videos Navneet DALAL GRAVIR, INRIA Rhône-Alpes Thesis Advisors Cordelia SCHMID et Bill TRIGGS 17 July, 2006 Institut National Polytechnique de Grenoble

Goals & Applications Goal: Detect and localise people in images and videos Applications: Images, films & multi-media analysis Pedestrian detection for smart cars Visual surveillance, behavior analysis 2

Difficulties Wide variety of articulated poses Variable appearance and clothing Complex backgrounds Unconstrained illumination Occlusions, different scales Videos sequences involves motion of the subject, the camera and the objects in the background Main assumption: upright fully visible people 3

Talk Outline Overview of detection methodology Static images Feature sets Object localisation Extension to other object classes Videos Motion features Optical flow estimation Part based person detection Conclusions and perspectives 4

Overview of Methodology Detection Phase Scale-space pyramid Scan image(s) at all ` scales and locations Extract features over windows Run linear SVM Detection window classifier on all locations Fuse multiple Focus on building robust detections in 3-D position & scale space feature sets (static & motion) Object detections with bounding boxes 5

Finding People in Images 6 N. Dalal and B. Triggs. Histograms of Oriented Gradients for Human Detection. CVPR, 2005

Existing Person Detectors/Feature Sets +1 Current Approaches -1 Haar wavelets + SVM: • Papageorgiou & Poggio, 2000; Mohan et al 2000 +1 -1 Rectangular differential features + adaBoost: • Viola & Jones, 2001 Edge templates + nearest neighbour: • Gavrila & Philomen, 1999 Model based methods • Felzenszwalb & Huttenlocher, 2000; Ioffe & Forsyth, 1999 Other works • Leibe et al, 2005; Mikolajczyk et al, 2004 Orientation histograms Freeman et al, 1996; Lowe, 1999 (SIFT); Belongie et al, 2002 (Shape contexts) 7

Static Feature Extraction Input image Detection window Normalise gamma Compute gradients Cell Weighted vote in spatial & orientation cells Block Contrast normalise over Overlap overlapping spatial cells of Blocks Collect HOGs over detection window Feature vector f = [ ..., ..., ...] Linear SVM 8 N. Dalal and B. Triggs. Histograms of Oriented Gradients for Human Detection. CVPR, 2005

Overview of Learning Phase Learning phase Resample negative training Input: Annotations on training images to create hard images examples Create fixed-resolution Encode images into feature normalised training image spaces data set Encode images into feature Learn binary classifier spaces Object/Non-object decision Learn binary classifier Retraining reduces false positives by an order of magnitude! 9

HOG Descriptors Parameters Schemes Gradient scale RGB or Lab, colour/gray-space Orientation bins Block normalisation L2 -norm, Percentage of block 2 ← + ε v v / v overlap or 2 L1 -norm, ← + ε v v /( v ) 1 Block R-HOG/SIFT C-HOG Center bin Cell 10

Evaluation Data Sets MIT pedestrian database INRIA person database Train 507 positive windows Train 1208 positive windows Negative data unavailable 1218 negative images 200 positive windows 566 positive windows Test Test Negative data unavailable 453 negative images Overall 709 annotations+ Overall 1774 annotations+ reflections reflections 11

Overall Performance MIT pedestrian database INRIA person database R/C-HOG give near perfect separation on MIT database Have 1-2 order lower false positives than other descriptors 12

Performance on INRIA Database 13

Effect of Parameters Gradient smoothing, σ Orientation bins, β Reducing gradient scale Increasing orientation bins from 3 to 0 decreases false from 4 to 9 decreases false positives by 10 times positives by 10 times 14

Normalisation Method & Block Overlap Normalisation method Block overlap Strong local normalisation Overlapping blocks improve is essential performance, but descriptor size increases 15

Effect of Block and Cell Size 64 128 Trade off between need for local spatial invariance and need for finer spatial resolution 16

Descriptor Cues Input Average Weighted Weighted Outside-in example gradients pos wts neg wts weights Most important cues are head, shoulder, leg silhouettes Vertical gradients inside a person are counted as negative Overlapping blocks just outside the contour are most important 17

Multi-Scale Object Localisation Bias Clip Detection Score Multi-scale dense scan of detection window s (in log) y x Threshold Η = σ σ σ [exp( s ) , exp( s ) , ] i i x i y s   ∑ 2 n = − − Η −  1  f ( x ) w exp ( x x ) / / 2   i i i i Apply robust mode detection, Final detections like mean shift 18

Effect of Spatial Smoothing Spatial smoothing aspect ratio as per window shape, smallest sigma approx. equal to stride/cell size Relatively independent of scale smoothing, sigma equal to 0.4 to 0.7 octaves gives good results 19

Effect of Other Parameters Different mappings Effect of scale-ratio Hard clipping of SVM scores Fine scale sampling helps improve gives the best results than simple recall probabilistic mapping of these scores 20

Results Using Static HOG No temporal smoothing of detections 21

Conclusions for Static Case Fine grained features improve performance Rectify fine gradients then pool spatially • No gradient smoothing, [1 0 -1] derivative mask • Orientation voting into fine bins • Spatial voting into coarser bins Use gradient magnitude (no thresholding) Strong local normalization Use overlapping blocks Robust non-maximum suppression • Fine scale sampling, hard clipping & anisotropic kernel Human detection rate of 90% at 10 -4 false positives per window Slower than integral images of Viola & Jones, 2001 22

Applications to Other Classes 23 M. Everingham et al. The 2005 PAS CAL Visual Object Classes Challenge. Proceedings of the PAS CAL Challenge

Parameter Settings Most HOG parameters are stable across different classes Parameters that change Gamma compression Normalisation methods Signed/un-signed gradients 24

Results from Pascal VOC 2006 Motorbike Bicycle Person Sheep Horse Cow Dog Bus Car Cat Cam 0.030 0.254 0.178 0.249 0.138 0.131 0.091 0.149 0.151 0.118 bridge ENSMP - 0.398 - - - - - 0.159 - - HOG 0.164 0.444 0.390 0.414 0.117 0.251 - 0.212 - - Laptev= HOG+ 0.114 - 0.318 0.440 - - 0.140 0.224 - - Ada- boost TUD 0.074 - 0.153 - - - - - - - TKK 0.039 0.222 0.265 0.303 0.169 0.227 0.137 0.252 0.160 0.113 HOG outperformed other methods for 4 out of 10 classes Its adaBoost variant outperformed other methods for 2 out of 10 classes 25

Finding People in Videos 26 N. Dalal, B. Triggs and C. S chmid. Human Detection Using Oriented Histograms of Flow and Appearance . ECCV, 2006.

Finding People in Videos Motivation Human motion is very characteristic Requirements Must work for moving camera and background Robust coding of relative motion of human parts Courtesy: R. Blake Previous works Vanderbilt Univ Viola et al, 2003 Gavrila et al, 2004 Efros et al, 2003 27 N. Dalal, B. Triggs and C. S chmid. Human Detection Using Oriented Histograms of Flow and Appearance . ECCV, 2006.

Handling Camera Motion Camera motion characterisation Pan and tilt is locally translational Rest is depth induced motion parallax Use local differential of flow Cancels out effects of camera rotation Highlights 3D depth boundaries Highlights motion boundaries Robust encoding into oriented histograms Some focus on capturing motion boundaries Other focus on capturing internal motion or relative dynamics of different limbs 28

Motion HOG Processing Chain Input image Consecutive image Detection windows Normalise gamma & colour Flow field Magnitude of flow Compute optical flow Compute differential flow Differential flow Y Differential flow X Accumulate votes for differential flow orientation over spatial cells Cell Block Normalise contrast within overlapping blocks of cells Overlap of Blocks Collect HOGs for all blocks over detection window 29

Overview of Feature Extraction Appearance Input image Consecutive image(s) Channel Channel Motion Static HOG Motion HOG Encoding Encoding Collect HOGs over Data Set detection window 5 DVDs, 182 shots Train Linear SVM 5562 positive windows Same 5 DVDs, 50 shots Test 1 Object/Non-object decision 1704 positive windows 6 new DVDs, 128 shots Test 2 2700 positive windows 30

Coding Motion Boundaries Treat x , y -flow components as independent images Take their local gradients separately, and compute HOGs as in static images First Second Estd. Flow frame frame flow mag. Motion Boundary Histograms (MBH) encode depth and motion Avg. Avg. x -flow y -flow boundaries diff diff x -flow y -flow diff diff 31

Coding Internal Dynamics Ideally compute relative displacements of different limbs Requires reliable part detectors Parts are relatively localised in our detection windows Allows different coding schemes based on fixed spatial differences Internal Motion Histograms (IMH) encode relative dynamics of different regions 32

Finding People in Images and Videos Navneet DALAL GRAVIR, INRIA - PowerPoint PPT Presentation

Finding People in Images and Videos Navneet DALAL GRAVIR, INRIA Rhne-Alpes Thesis Advisors Cordelia SCHMID et Bill TRIGGS 17 July, 2006 Institut National Polytechnique de Grenoble Goals & Applications Goal: Detect and localise people

Content-Based Projections for Panoramic Images and Panoramic Images and Videos Videos

Dennis Rosenberg http://DennisRosenberg.com Why Videos? People love watching videos Higher

CS4495/6495 Introduction to Computer Vision 2A-L1 Images as functions Images as functions Images

Finding your way in a graph Finding your way in a graph Finding your way in a graph Finding your

Creating Videos Session will begin shortly Why create instructional videos for your courses?

Consuming videos with the ForkBrowser Consuming videos with the ForkBrowser Ork de Rooij, Cees

Understand Basketball Games 2018.6.15 Sports Videos Large quantity, high

Finding Hidden Supernovae with Finding Hidden Supernovae with Finding Hidden Supernovae with

Tree Pr ee Proximity ximity Finding the good and bad of trees. joe@buildfax.com Tree

Bitmap (Raster) Images CO2016 Multimedia and Computer Graphics Roy Crole: Bitmap Images (CO2016,

2 nd semester Photo comparison and Role play We define "comparing two images" as

Action recognition in videos Action recognition in videos Cordelia Schmid Cordelia Schmid

Action recognition in videos II Action recognition in videos II Cordelia Schmid INRIA Grenoble

HAAR-like features for images Images digit images are scanned hand written digits Digit

https://images-na.ssl-images-amazon.com/images/I/A1w4iP5ov-L._SY879_.jpg Translate this table to a

STATUS COUNT FINDING APPROVED 5 FINDING CONDITIONAL 16 FINDING DENIED 11

Motion Tracking CS6240 Multimedia Analysis Leow Wee Kheng Department of Computer Science School

Current rent U Use A Applica lications ions 2018-2019 Presentation to Clark County Planning

and Analysts Q2 2020 Results Presentation to Investors and Analysts We are Iceland Seafood

+ checklists Innovation February 2017 RECP Introductory Training 2 RECP applications in

On the Static Diffie-Hellman Problem on Elliptic Curves over Extension Fields Robert Granger

Incorporating IPMVP and Six Sigma Strategies into Monitoring and Evaluation Prepared for the

Earth Ocean Atmospheric Sciences Building, Florida State University, Tallahassee Urban water

Stability analysys of retrial queing system with non Poisson input and constant retrial rate

Finding People in Images and Videos Navneet DALAL GRAVIR, INRIA - PowerPoint PPT Presentation

Finding People in Images and Videos Navneet DALAL GRAVIR, INRIA Rhne-Alpes Thesis Advisors Cordelia SCHMID et Bill TRIGGS 17 July, 2006 Institut National Polytechnique de Grenoble Goals & Applications Goal: Detect and localise people

Content-Based Projections for Panoramic Images and Panoramic Images and Videos Videos

Dennis Rosenberg http://DennisRosenberg.com Why Videos? People love watching videos Higher

CS4495/6495 Introduction to Computer Vision 2A-L1 Images as functions Images as functions Images

Finding your way in a graph Finding your way in a graph Finding your way in a graph Finding your

Creating Videos Session will begin shortly Why create instructional videos for your courses?

Consuming videos with the ForkBrowser Consuming videos with the ForkBrowser Ork de Rooij, Cees

Understand Basketball Games 2018.6.15 Sports Videos Large quantity, high

Finding Hidden Supernovae with Finding Hidden Supernovae with Finding Hidden Supernovae with

Tree Pr ee Proximity ximity Finding the good and bad of trees. joe@buildfax.com Tree

Bitmap (Raster) Images CO2016 Multimedia and Computer Graphics Roy Crole: Bitmap Images (CO2016,

2 nd semester Photo comparison and Role play We define &quot;comparing two images&quot; as

Action recognition in videos Action recognition in videos Cordelia Schmid Cordelia Schmid

Action recognition in videos II Action recognition in videos II Cordelia Schmid INRIA Grenoble

HAAR-like features for images Images digit images are scanned hand written digits Digit

https://images-na.ssl-images-amazon.com/images/I/A1w4iP5ov-L._SY879_.jpg Translate this table to a

STATUS COUNT FINDING APPROVED 5 FINDING CONDITIONAL 16 FINDING DENIED 11

Motion Tracking CS6240 Multimedia Analysis Leow Wee Kheng Department of Computer Science School

Current rent U Use A Applica lications ions 2018-2019 Presentation to Clark County Planning

and Analysts Q2 2020 Results Presentation to Investors and Analysts We are Iceland Seafood

+ checklists Innovation February 2017 RECP Introductory Training 2 RECP applications in

On the Static Diffie-Hellman Problem on Elliptic Curves over Extension Fields Robert Granger

Incorporating IPMVP and Six Sigma Strategies into Monitoring and Evaluation Prepared for the

Earth Ocean Atmospheric Sciences Building, Florida State University, Tallahassee Urban water

Stability analysys of retrial queing system with non Poisson input and constant retrial rate

2 nd semester Photo comparison and Role play We define "comparing two images" as