HumanDetection GregMori CMPT888 Outline Humandetectioninimages - - PowerPoint PPT Presentation
HumanDetection GregMori CMPT888 Outline Humandetectioninimages - - PowerPoint PPT Presentation
HumanDetection GregMori CMPT888 Outline Humandetectioninimages HistogramsofOrientedGradients(HOG) DalalandTriggsCVPR2005 LatentSVM(LSVM)
Outline
- Human detection in images
– Histograms of Oriented Gradients (HOG)
- Dalal and Triggs CVPR 2005
– Latent SVM (L‐SVM)
- Part‐based model
- Felzenszwalb et al. CVPR 2008
- Human detection in videos
– Cascade of boosted classifiers
- Viola et al. ICCV 2003
– Motion HOG
- Dalal et al. ECCV 2006
HISTOGRAMS OF ORIENTED GRADIENTS FOR HUMAN DETECTION
Slides from Navneet Dalal
!
"#$%&'(')**%+,$-+#.&
"#$%/'01-1,-'$.2'%#,$%+&1'*1#*%1'+.'+3$41&'$.2'5+21#& )**%+,$-+#.&/
63$41&7'8+%3&'('39%-+:312+$'$.$%;&+& <121&-=+$.'21-1,-+#.'8#='&3$=-',$=& >+&9$%'&9=51+%%$.,17'?1@$5+#='$.$%;&+&
!
"#$$#%&'(#)*
+#,)-./0#)(1-2$-/0(#%&'/(),-32*)* 4/0#/5')-/33)/0/6%)-/6,-%'2(7#68 92:3');-5/%<802&6,* =6%26*(0/#6),-#''&:#6/(#26 >%%'&*#26*?-,#$$)0)6(-*%/')* 4#,)2*-*)@&)6%)*-#6.2'.)*-:2(#26-2$- (7)-*&5A)%(?-(7)-%/:)0/-/6,-(7)- 25A)%(*-#6-(7)-5/%<802&6, B/#6-/**&:3(#26C-&30#87(-$&''1-.#*#5')- 3)23')
!
"#$#%&'()$#*+)',-#+$&#%./
!"#$%&'()*+,-'.&/
()$#*+)'0)&#.+'!'1'2'3334'3334''''3335
67.&8 (0"*#+1-/'()+##+ 2'-)3&',(4"&'(-.(/$+&-+1(5( "*-'.&+&-".(6'11/ !".&*+/&(."*#+1-/'("4'*( "4'*1+$$-.)(/$+&-+1(6'11/ !"11'6&(789/("4'*( ,'&'6&-".(:-.,": 9/:*#'%;$<)
=)#)&#%./'>%/?.>
;-.'+*(<=> @0)+7$:' .A'67.&8B C)77
!"#$%&%&#%'(#)"#*+,--."#!"#$%&'()#*%+*,'"-.$-/*0'(/"-.$#*+%'*!1)(.*2-$-3$"%.4#/0123#4556
!
"#$%#&$'()*(+$,%-&-.(/0,1$
!"#$%&'(%#$)&*+#,,(-("$ .%*/0"&(1#2",&(%3/&-"#34$"& ,5#*", 6$"#3"&-(7"08$",/+43(/%& %/$1#+(,"0&3$#(%(%2&(1#2"& 0#3#&,"3
+$,%-&-.(20,1$
"34$5678)-9)34$56(:$5&1&)- !"#$%&'(%#$)&*+#,,(-("$ .%*/0"&(1#2",&(%3/&-"#34$"& ,5#*", 9",#15+"&%"2#3(:"&3$#(%(%2& (1#2",&3/&*$"#3"&;#$0& "7#15+", ;-2<6=(>--)6,6&)-1()-(6%,&-&-.( &?,.$1
@$6%,&-&-.(%$:<5$1(*,A1$( 2)1&6&#$1(3B(,-()%:$%()*( ?,.-&6<:$C
!"
#$%&'()*+,-./+)
01+12(.(+) %+13,(4.&)*15( $+,(4.1.,/4&6,4) 0(+*(4.17(&/8&65/*9& /:(+51-
- +
- !
!
" ! ! !
;*<(2() =%>&/+&?16@&*/5/A+B7+1CD)-1*( >5/*9&4/+215,)1.,/4
!"D4/+2@
/+
!#D4/+2@
E(55 >5/*9
=D#$%B;FGH
E(4.(+&6,4
ED#$%
# "$
%
- +
- !
! !
!!
"#$%&$'()*+,$'$+-.'/
01203+4.5/)*+6$'$7$/. 809+4.6./'5($*+6$'$7$/.
:#.5$%%+;<=+$**)'$'()*/>+ 5.?%.@'()*/
A<<+4)/('(#.+B(*6)B/ 1.C$'(#.+6$'$+&*$#$(%$7%. D<;+4)/('(#.+B(*6)B/ 1.C$'(#.+6$'$+&*$#$(%$7%. DEE+4)/('(#.+B(*6)B/ FDG+*.C$'(#.+(H$C./ !A<I+4)/('(#.+B(*6)B/ !A!I+*.C$'(#.+(H$C./
:#.5$%%+!;;F+$**)'$'()*/>+ 5.?%.@'()*/ 95$(* 9./' 95$(* 9./'
!"
#$%&'(()*%&+,&-'./%
012)3%4%56&7'.)4'6'8'5% 19:1;)3%&5,.)4'6'8'5% :<=>?#@)A7$%).%'&)3%&+%/6)5%3'&'67,.),.)012)4'6'8'5% ?'$%)!>"),&4%&)(,B%&)+'(5%)3,5767$%5)6C'.),6C%&)4%5/&736,&5
!"
#$%&'%()*+$,'*,-./-0,1)2)3)4$
!"
#$$%&'()$(*+,+-%'%,.
/,+01%2'(.-))'31245(! 6,1%2'+'1)2(712.5(" 8%09&124(4,+01%2'(.&+:%( $,)-(;(')(<(0%&,%+.%.($+:.%( =).1'1>%.(7?(!<('1-%. @2&,%+.124(),1%2'+'1)2(712.( $,)-("(')(A(0%&,%+.%.($+:.%( =).1'1>%.(7?(!<('1-%.
!"
#$%&'()*'+)$,-./+0$1-2-3($45-67/%('8
#$%&'()*'+)$,-&/+0$1 3($45-$7/%('8 9+%$,:-($4'(-,$%&'()*'+)$,- )*-/**/,+)'( 67/%('88),:-;($45*-)&8%$7/- 8/%<$%&',4/=-;>+-1/*4%)8+$%- *)?/-),4%/'*/*
!"
#$$%&'()$(*+)&,(-./(0%++(123%
45-/%()$$(6%'7%%.(.%%/($)5(+)&-+(89-'2-+(2.:-52-.&%(-./( .%%/($)5($2.%5(89-'2-+(5%8)+;'2).
!<= ">
!"
#$%&'()*+',-.$%
/0).*, $123)4$ 5$(67*$8, )+%,9*% 5$(67*$8, 0$6,9*% :.*%(8$;(0, 9$(67*%
<+%*,(3)+'*20*,&.$%,2'$,7$28=,%7+.48$'=,4$6,%(47+.$**$% >$'*(&24,6'28($0*%,(0%(8$,2,)$'%+0,2'$,&+.0*$8,2%,0$62*(?$ :?$'42))(06,@4+&A%,B.%*,+.*%(8$,*7$,&+0*+.',2'$,3+%*, (3)+'*20*
C?$'26$, 6'28($0*%
!
"#$%#&$'()*(+$,-).)/)01
2)345()6(74&/.&60(%)745,( *$8,4%$(5$,5(95,8,&3(:(;),&)6<
!"#$%&"'()*'$% +$($,()-.#%).%/01% *-#)()-.%2%#,3'$%#*3,$ 45(63,(%7$3("6$#%-8$6% 9).+-9# :,3.%)&3;$<#=%3(%3''% #,3'$#%3.+%'-,3()-.# "7=$3,(.$,$3,&)65('&,-( 7)46.&60(7)>$5(
?$,$3,&)6(@-85$
A
B38/$C5D83$(D1%8;&. ?$,$3,&)6('&6.)' >".%').$36%:?@% ,'3##)7)$6%-.%3''% '-,3()-.#
!"
#$%&'()*+%,-./0,*&-12*+%'3+&'24
566%7-82/$3&-92:,-:,&,*&'24;- %'<,-9,+4-3='>&
- =
=
- !
" " " " # $ " % " "
& ' # # ! " " # $ %&' # $ ( ) # %&'$ ) # *%&'$
! +
! ! !
- %
$ #(!"#$%&'( ?%'6-@,&,*&'24-)*28, #$%&'(3*+%,-:,43,-3*+4-2>- :,&,*&'24-A'4:2A B'4+%-:,&,*&'243 C=8,3=2%: D'+3
!"
#$$%&'()$(*+,'-,.(*/))'0-12
*+,'-,.(3/))'0-12(,3+%&'(4,'-)(,3( +%4(5-16)5(30,+%7(3/,..%3'(3-2/,( ,++4)89(%:;,.(')(3'4-6%<&%..(3-=% >%.,'-?%.@(-16%+%16%1'()$(3&,.%( 3/))'0-127(3-2/,(%:;,.(')(A9B(')(A9C( )&',?%3(2-?%3(2))6(4%3;.'3
!"
#$$%&'()$(*'+%,(-.,./%'%,0
12$$%,%3'(/.442350 #$$%&'()$(0&.6%7,.'2)
8.,9(&6244235()$(:;<(0&),%0( 52=%0('+%(>%0'(,%0?6'0('+.3(02/46%( 4,)>.>2620'2&(/.44235()$('+%0%( 0&),%0 @23%(0&.6%(0./46235(+%640(2/4,)=%( ,%&.66
DETECTING HUMANS USING A PART‐BASED MODEL
Felzenszwalb et al., A Discriminatively Trained, Multiscale, Deformable Part Model, CVPR 2008 Slides from Pedro Felzenszwalb
PASCAL Challenge
- ~10,000 images, with ~25,000 target objects
- Objects from 20 categories (person, car, bicycle, cow, table...)
- Objects are annotated with labeled bounding boxes
Why is it hard?
- Objects in rich categories exhibit significant variability
- Photometric variation
- Viewpoint variation
- Intra-class variability
- Cars come in a variety of shapes (sedan, minivan, etc)
- People wear different clothes and take different poses
We need rich object models But this leads to difficult matching and training problems
Starting point: sliding window classifiers
Feature vector x = [... , ... , ... , ... ]
- Detect objects by testing each subwindow
- Reduces object detection to binary classification
- Dalal & Triggs: HOG features + linear SVM classifier
- Previous state of the art for detecting people
Histogram of Gradient (HOG) features
- Image is partitioned into 8x8 pixel blocks
- In each block we compute a histogram of gradient orientations
- Invariant to changes in lighting, small deformations, etc.
- Compute features at different resolutions (pyramid)
HOG Filters
- Array of weights for features in subwindow of HOG pyramid
- Score is dot product of filter and feature vector
HOG pyramid H
Score of F at position p is F (p, H)
Filter F
(p, H) = concatenation of HOG features from subwindow specified by p p
Dalal & Triggs: HOG + linear SVMs
Typical form of a model (p, H) (q, H) There is much more background than objects Start with random negatives and repeat: 1) Train a model 2) Harvest false positives to define “hard negatives”
Overview of our models
- Mixture of deformable part models
- Each component has global template + deformable parts
- Fully trained from bounding boxes alone
2 component bicycle model
root filters coarse resolution part filters finer resolution deformation models
Each component has a root filter F0 and n part models (Fi, vi, di)
Object hypothesis
Image pyramid HOG feature pyramid
Multiscale model captures features at two-resolutions
Score is sum of filter scores minus deformation costs
p0 : location of root p1,..., pn : location of parts z = (p0,..., pn)
filters deformation parameters displacements
score(p0, . . . , pn) =
n
- i=0
Fi · φ(H, pi) −
n
- i=1
di · (dx2
i , dy2 i )
concatenation of HOG features and part displacement features concatenation filters and deformation parameters
score(z) = β · Ψ(H, z)
Score of a hypothesis
“data term” “spatial prior”
Matching
- Define an overall score for each root location
- Based on best placement of parts
- High scoring root locations define detections
- “sliding window approach”
- Efficient computation: dynamic programming +
generalized distance transforms (max-convolution)
score(p0) = max
p1,...,pn score(p0, . . . , pn).
head filter
Dl(x, y) = max
dx,dy
- Rl(x + dx, y + dy) − di · (dx2, dy2)
- Transformed response
max-convolution, computed in linear time (spreading, local max, etc) input image Response of filter in l-th pyramid level
Rl(x, y) = F · φ(H, (x, y, l))
cross-correlation
+ x x x
... ... ...
model response of root filter transformed responses response of part filters feature map feature map at twice the resolution combined score of root locations color encoding of filter response values
Matching results
(after non-maximum suppression) ~1 second to search all scales
Training
- Training data consists of images with labeled bounding boxes.
- Need to learn the model structure, filters and deformation costs.
Training
Latent SVM (MI-SVM)
LD(β) = 1 2||β||2 + C
n
- i=1
max(0, 1 − yifβ(xi))
Minimize
D = (x1, y1, . . . , xn, yn)
Training data
yi ∈ {−1, 1}
We would like to find such that: yifβ(xi) > 0 Classifiers that score an example x using
are model parameters
z are latent values
fβ(x) = max
z∈Z(x) β · Φ(x, z)
Semi-convexity
- Maximum of convex functions is convex
- is convex in
- is convex for negative examples
max(0, 1 − yifβ(xi)) fβ(x) = max
z∈Z(x) β · Φ(x, z)
LD(β) = 1 2||β||2 + C
n
- i=1
max(0, 1 − yifβ(xi))
Convex if latent values for positive examples are fixed
Latent SVM training
- Convex if we fix z for positive examples
- Optimization:
- Initialize and iterate:
- Pick best z for each positive example
- Optimize via gradient descent with data-mining
LD(β) = 1 2||β||2 + C
n
- i=1
max(0, 1 − yifβ(xi))
Training Models
- Reduce to Latent SVM training problem
- Positive example specifies some z should have high score
- Bounding box defines range of root locations
- Parts can be anywhere
- This defines Z(x)
Background
- Negative example specifies no z should have high score
- One negative example per root location in a background image
- Huge number of negative examples
- Consistent with requiring low false-positive rate
Training algorithm, nested iterations
Fix “best” positive latent values for positives Harvest high scoring (x,z) pairs from background images Update model using gradient descent Trow away (x,z) pairs with low score
- Sequence of training rounds
- Train root filters
- Initialize parts from root
- Train final model
Person model
root filters coarse resolution part filters finer resolution deformation models
Person detections
high scoring true positives high scoring false positives (not enough overlap)
Quantitative results
- 7 systems competed in the 2008 challenge
- Out of 20 classes we got:
- First place in 7 classes
- Second place in 8 classes
- Some statistics:
- It takes ~2 seconds to evaluate a model in one image
- It takes ~4 hours to train a model
- MUCH faster than most systems.
HUMAN DETECTION IN VIDEO
Motion is Helpful!
- Humans can perceive human figure presence
and action in videos
– Even from solely from body joint positions – Even in clutter
- Moving light displays
– Johansson, Perception and Psychophysics 1973 – Ideas used by Song et al. CVIU 2000
CASCADE OF BOOSTED FEATURES FOR DETECTING PEDESTRIANS
Viola, Jones, and Snow, Detecting pedestrians using patterns of motion and appearance, ICCV 2003
Viola‐Jones
- Viola‐Jones face detector
– Viola and Jones CVPR 2001 – Window‐scanning approach
- Two nice ideas
– Define many, efficient‐to‐compute features
- AdaBoost to select good ones from them
– Cascade architecture to quickly eliminate non‐face sub‐windows
Adaboost Algorithm
- Given a set of “weak learners”
- Build “strong learner”
– Greedy selection of weak learners – Each iteration, choose best weak learner
hi(x) ∈ {+1,−1}
h(x) =
T
∑
t=1
αtht(x)
53
AdaBoost Algorithm
W w
Face Features
- Features – Haar‐like
rectangle features
- Each weak learner
examines a single feature
Integral Images
- Fast computation of features possible using
Integral Images
Cascade of Classifiers
- Most image sub‐windows don’t contain a face
Learned Classifier
- First two weak learners chosen:
And People?
- Same algorithm, slightly different features
- Diagonal to capture legs
- Frame differencing for
motion
MOTION HOG
Dalal, Triggs, and Schmid, Human Detection Using Oriented Histograms of Flow and Appearance, ECCV 2006 Slides from Navneet Dalal
!"
#$%&$'()*+(,-$./00&'1(234&'
!"##$%&'()*+',"-'.##'/#"%0+' "1$-'2$&$%&3"4'5342"5 6"-7.#3+$'%"4&-.+&'53&834' "1$-#.9934:'/#"%0+'",'%$##+ ;%%<7<#.&$'1"&$+',"-' 23,,$-$4&3.#',#"5'"-3$4&.&3"4' "1$-'+9.&3.#'%$##+ !"79<&$'"9&3%.#',#"5 '6"-7.#3+$':.77.'='%"#"<- !"79<&$'23,,$-$4&3.#',#"5 5'67%(&841/ 2$'0/.7%&9/(&841/ :;$<(=&/;> #41'&%7>/($=(=;$< ?&==/-/'%&4;(=;$<(@ ?&==/-/'%&4;(=;$<(A B;$.C *9/-;46( $=(B;$.C0 2/;; ?/%/.%&$'(<&'>$<0
!"
#$%&$'%()*+),%-./&%)01.&-2.'*3
!"##$%&'()*+'",$-' .$&$%&/"0'1/0."1 #45%2.67*38*45%2.)9%2':'*3 2/0$3-'456 4&3&/%'()*' 70%"./08 6"&/"0'()*' 70%"./08 );3</.)'=->% ?*3:%2/.'$%)'=->%@:A B<<%-&-32%) ?C-33%D E*.'*3) ?C-33%D
F%:.)G F%:.)H F&-'3
I-=%)J)KLK:M)J"):C*.: HN"O)<*:'.'$%)('39*(: J)KLK:M)HPG):C*.: JJQG)<*:'.'$%)('39*(: Q)3%()KLK:M)HGP):C*.: GN"")<*:'.'$%)('39*(:
K-.-)I%.
!"
#$%&'()*$+&$'),$-'%./&01
2&/1+) 3/.40 506$'%) 3/.40 71+%8) 39$: 29$:) 4.(8 !;39$:)) %&33 ";39$:) %&33 <=(8))) ";39$:) %&33 <=(8))) !;39$:) %&33
>/0.+)"?)!;39$:)6$4@$'0'+1) .1)&'%0@0'%0'+)&4.(01 >.A0)+B0&/)9$6.9)(/.%&0'+1) 10@./.+09C?).'%)6$4@-+0) DEF1).1)&')1+.+&6)&4.(01 *$+&$'),$-'%./C)D&1+$(/.41) G*,DH)0'6$%0)%0@+B).'%)4$+&$') I$-'%./&01
!"
#$%&'()*'+,-'./)01'.2&34
*%,.//1)3$256+,)-,/.+&7,)%&45/.3,2,'+4) $8)%&88,-,'+)/&294
:,;6&-,4)-,/&.9/,)5.-+)%,+,3+$-4
<.-+4).-,)-,/.+&7,/1)/$3./&4,%)&')$6-) %,+,3+&$')=&'%$=4 >//$=4)%&88,-,'+)3$%&'()43?,2,4)9.4,%) $')8&@,%)45.+&./)%&88,-,'3,4 *'+,-'./)A$+&$')B&4+$(-.24)C*ABD),'3$%,)
- ,/.+&7,)%1'.2&34)$8)%&88,-,'+)-,(&$'4
!!
"#$%&'()*+),-.
/+012-&.+33-4-)5-
678-&!9"#".+33-4-)*+72:&(3&32(;& <-5*(4&+07=-:&>$!9&$#"? @74+7)*:&07A&,:-&274=-4& :17*+72&.+:1275-0-)*:&;B+2-& .+33-4-)5+)=9&-C=C&>D&E&E&E&FD?
'-)*-4&5-22&.+33-4-)5- !" !" !" !" !" !" !" #" !" G7<-2-*F:*A2-&5-22& .+33-4-)5-:
HD FD HD FD HD FD HD FD HD FI HD FD HD FD HD HD FD HD FD HD FD FD HD HD FI HD
SUMMARY
Summary
- Large literature on human detection
– These are a few, widely used, examples
- Code is available
– Ask me for reading list of others
- Encode shape and motion
– Gradient filters – Motion histograms
- Encode spatial variability