Object detection using cascades
- f boosted classifiers
Javier Ruiz-del-Solar and Rodrigo Verschae
EVIC 2006
December 15th, 2006 Chile
Universidad de Chile Department of Electrical Engineering
Object detection using cascades of boosted classifiers Javier - - PowerPoint PPT Presentation
Universidad de Chile Department of Electrical Engineering Object detection using cascades of boosted classifiers Javier Ruiz-del-Solar and Rodrigo Verschae EVIC 2006 December 15th, 2006 Chile General Outline This tutorial has two parts
Javier Ruiz-del-Solar and Rodrigo Verschae
EVIC 2006
December 15th, 2006 Chile
Universidad de Chile Department of Electrical Engineering
class (detection problem).
identity verification (hand, face, iris, fingerprint, …), fault detection, skin detection, person detection, car detection, eye detection, object recognition, …
combinations)
Frontal Semi-Frontal Profile In some cases objects observed under different views are considered as different objects.
Alignment and Pre-Processing Object Detector Input Image
– intrinsic variability in the objects – extrinsic variability in images.
Presence or absence of structural components Variability among
Variability of the particular
Illumination Out-of-Plane Rotation (Pose) Occlusion In-plane Rotation Scale
Capturing Device / Compression / Image Quality/
– Some images are just in grey scale and in others the colors were modified. – The color changes according to the illumination conditions, capturing device, etc – The background can have similar colors – Using the state of the art segmentation algorithms, it is only possible to obtain very good results when the working environment is controlled. – However, color is very useful to reduce the search space, though some objects may be lost. – In summary, under uncontrolled environments, it is even more difficult to detect objets if color is used.
– SVM (Support Vector Machines, Osuna et al. 1997) * – NN (Neural Networks)
– Wavelet-Bayesian (Schneiderman & Kanade 1998, 2000) * – SNoW (Sparse Network of Winnows, Roth et al.1998) * – FLD (Fisher lineal Discriminant, Yang et al. 2000) – MFA (Mixture of Factor Analyzers, Yang et al. 2000) – Adaboost/Nested-Cascade*
illumination conditions); Yen-Yu Lin et al. 2004 (occlusions)
– Kullback-Leibler boosting (Liu & Shum, 2003) – CFF (Convolutional Face Finder, neural based, Garcia & Delakis, 2004) – Many Others…
– Adaboost/Nested-Cascade * – Wavelet-Bayesian * – CFF – Kullback-Leibler boosting
– The set S,the training set, is used to learn a function f(x) that predicts the value of y from x.
Faces Non-Faces [Images from: Ce Liu & Hueng-Yeung Shum, 2003]
H(x)
Classifier Pre-Processing Window Extractor
… …
Multi-resolution Analysis Input Image Processing Windows Multi-resolution Images Processing of Overlapped Detections Non-Face Face
Labeling Classifier Training
Images containing the
Images non containing the
Window Sampling
Non-Object examples Object examples Training Dataset New non-object examples Classifier Instance
Evaluation
Classifier Instance
Classification
Classifier Instance Training examples Images containing no faces (large set) Final Boosted Classifier
Boosting Bootstrapping
Object Non-Object
the label for which the probability density function (multiplied by the a priori probability) is highest.
λ ≥
= K i i i
nonObject x F P Object x F P
1
) / ) ( ( ) / ) ( (
) / ( ) / ( nonObject x P Object x P
) / ) ( ( Object x F P
k
k
) (x Fk
· The idea is to determinate an hyperplane that separates the 2 classes
· Margin of a given sample: its distance to the decision surface (hyperplane). · The optimum hyperplane is the one that
maximize the margin of the closest samples (for both classes). · The normal vector of the plane, w, is defined so for the two classes (faces/non-faces): · Then, the value given by the classifier must be:
⇒ · > + b δ w
I
Ω δ ∈ b S + = δ w δ · ) (
F using the function:
“KERNEL TRICK”:
y xT
) , ( K y x
· If K(x,y) satisfies the Mercer conditions , then the following expansion exists:
d
) 1 ( ) , K(
T +
= y x y x
2 2
σ 2
) , K(
y x
y x
− −
= e ) θ tanh( ) , K(
T −
= y x y x k
Examples:
) ( ) (
T
y Φ x Φ = · This is equivalent to perform a internal product of the mapped vector in F R N → Φ :
( RBF )
b y S
Vectors Support i i
+ = ∑ ) , ( K ) ( δ δ δ · The output given by the classifier is:
i
y : labels ( +1: faces, -1 non-faces )
i
δ : projected difference δ : new projected difference
=
i i i
) ( ) ( ) , K( y x y x φ φ
SVM main idea:
surface) is the one that is far from for the more difficult examples.
from all possible values).
binary sparse vector.
Training
If x = -1 and If x = +1 and !
(General diagram for k classes)
… Training examples …
Weight Normalization Training of the gi,,i=1… using: To select the best gj (Dt): To define ht = gj (Dt)
Final Classifier Base Classifiers Number of classifiers to be selected
T
To update the weights
Initial weight of the examples
Di(0)= 1/m1 positives Di(0)= 1/m2 negatives
(m1) (m2)
Weight of the classifier ht
αt
∑
−1 T = t t t (x)
h α sign = H(x) ) h(x y (j)e D = (j) D
i i t α t + t − 1
t t t
Z (j) D = (j) D /
+
2 1
1 m m j= t t
(j) D = Z
(j) D
t
... ...,
1 j
g , g
To select αt
Normalization Factor
Labeling Classifier Training
Images containing the
Images non containing the
Window Sampling
Non-Object examples Object examples Training Dataset New non-object examples Classifier Instance
Evaluation
Classifier Instance
Classification
Classifier Instance Training examples Images non containing the
Final Boosted Classifier
Boosting Bootstrapping
Object Non-Object
Faces Non-Faces [Images from: Ce Liu & Hueng-Yeung Shum, 2003]
– Labeling
need to be annotated
– Increasing the training set: To generate Variations
illumination
– Annotation: To annotate the reference points – Generation of variants of the annotated faces: Mirrowing, Translations, Scale Changes, Rotations, to add noise, etc.
To gather images containing no faces To sample windows from these images.
Labeling Classifier Training
Images containing the
Images non containing the
Window Sampling
Non-Object examples Object examples Training Dataset New non-object examples Classifier Instance
Evaluation
Final Classifier
Classification
Classifier Instance Training examples
classified (as faces) by the trained classifier and then to add them to the training set.
“border” of the two classes.
Images non containing the
Bootstrapping
Object Non-Object
Objective:
non-face class in the boundary with the face class. Procedure: 1. To train the classifier. 2. To collect new non-face windows that were wrongly classified as faces by the recently trained classifier. 3. To add these windows to the training set. 4. Go to (1) of one of the following conditions is not achieved:
a. The false positive rate is not low enough. b. There is no improvement in the performance
[Sung and Poggio 96] After the bootstrapping Before the bootstrapping
[Sung and Poggio 96] After the bootstrapping Before the bootstrapping
– True Positives Rate (TPR):
– True Negative Rate (TNR):
negatives.
– False Positives Rate (FPR):
– False Negative Rate (FNR):
TPR + FNR = 1 TNR + FPR = 1
the classifier for differentiating the two classes.
the outputs for the two classes (the classifier output distribution for the two classes).
are applied to the classifier output, different operation points are
In the example the output of the classifiers of each of the classes follows a Gaussian distribution
Images from: http://www.anaesthetist.com/mnm/stats/roc/
H(x)
Cascade Classifier Pre-Processing Window Extractor
… …
Multi-resolution Analysis Input Images Windows to be processed Image pyramid Overlapped window processing Non-object Object
Vn
V1
– Jones M., Viola P., “Fast Multi-view Face Detection”, CVPR, June, 2003. – Wu B., et al, “Fast Rotation Invariant Multi-view Face Detection based on Real Adaboost”, FG 2004
– Schneiderman H., “Feature-Centric Evaluation for Efficient Cascaded Object Detection”, CVPR 2004
– Viola P., Jones M., Snow D., “Detecting Pedestrians Using Patterns of Motion and Appearance”, ICCV 2003
– Koelsch M. ,Turk M., “Robust Hand Detection”, FG2004
– Torralba A., Murphy K., and Freeman W., “Sharing Visual Features for Multiclass and Multiview Object Detection”, AI Memo 2004-008, April 2004, massachusetts institute of technology — computer science and artificial intelligence laboratory
as object.
[Viola & Jones 2001]
Filter 1 … Filter i … Filter n Non-object Non-object Non-object Non-object Non-object
V V V V V: processing window
– False Positive Rate:
– Detection Rate – Processing speed
−1 N = i t total
f = f
−1 N = i t total
d = d
6
1x10 20 0.5 ,
−
≈ ⇒ ∀
total t
f = N , = f t if
0.9802 20 0.999 , ≈ ⇒ ∀
total t
d = N , = d t if
− − 1 1 1 N = i i j= j i total
f nf + nf = nf
0.9046 20 0.995 , ≈ ⇒ ∀
total t
d = N , = d t if
Filter 1 … Filter i … Filter n Non-object Non-object Non-object Non-object Non-object
V V V V V: processing window
[Viola & Jones 2001]
Filter 1 … Filter i … Filter n Non-object Non-object Non-object Non-object Non-object Object V V V V V: processing window
– In an image there are much more non-object windows than object windows – Therefore the average processing time of the windows is defined by the processing time of the non-object windows.
– Many non-object windows are “vey different” to object windows, i.e. it is easy to classify them as non-objects. – Most windows will be classified as non-object by the first filters (also called stages or layers) of the cascade. – Therefore the filter i should be evaluated faster than the filter i+1 – The firs and second filters are the most important ones for the processing speed
Cascade Classifier:
f: maximum allowed of false positives per layer d: minimum allowed detection rate per layer Ftotal: final false positive rate P: training set of Objects N: training set of non-Objects
i = i+1
While Fall > Ftotal Train Hi such that
fi ≤ f, di ≥ d
Update the detection and false positive rates of the cascade
Fall = Fall * fi , Dall = Dall * di
Add the new layer to the cascade
H = HΘ Hi
Re-sample the set of non-faces, N, using H()
Hi : layer i of the cascade H: final classifier Initialization: Fall = 1.0, Dall = 1.0, ND = 0, i = 0
Filter 1 … Filter i … Filter n
Non-object Non-object Non-object Non-object Non-object
Start training new layer Train new layer Get new rates Add new layer to the cascade Get non object training set
next layer of the cascade
To share information between layers of the cascade To obtained a more compact and robust cascade.
Filter 1 … Filter i … Filter n Non-face Non-face Non-face Non-face Non-face face V V V V C1 Ci-1 V: Processing window Ci: Confidence given by the filter i of the cascade C1
[Bo WU et al. 2004]
– To train the same classifier several times (classifier base). – In each iteration a different training set (or distribution) is used, obtaining a weak classifier (also called hypothesis o instance of the base classifier). – The obtained weak classifiers will be combined to form a final classifier (also called robust classifier).
– ¿How to choose the distribution/training set in each iteration? – ¿How to combine the obtained classifiers?
in different ways : – Bagging (Bootstrap aggregation) – AdaBoost (Adaptive Boosting) – Cross-validated committees – LogitBoost – etc.
Discrete Adaboost : [Y. Freund and R. Shapire 1995] Real Adaboost, also Multi-class: [R. Shapire and Y.Singer 1999 ] Firsts use of Adaboost in face detection: [Viola & Jones 2001] and [Schneiderman 2000]
Annotation Training of the Classifier
Images containing the
Images non containing the
Window Sampling
Object examples Training Set Instance of a Classifier
Evaluation
Training Examples Non-Object examples New non-object examples
Classification
Images non containing the Objects (large set)
Bootstrapping
Instance of the Classifier Final Classifier
Boosting
hi, Instance of the
base classifier
classifier H(h1,h2…,hn)
Object Non-Object
T
1
t t t t
1 −
– Bagging (Bootstrap aggregation) – Cross-validated committees – Discrete AdaBoost (Adaptive Boosting) – Real AdaBoost – Arc-gv – LogitBoost – …
training data for example,
– decision trees – neural networks – decision stumps
→ every weak learner is trained with a different “adapted” distribution.
→ the algorithm focuses on hard examples
→ greedy search
example of the training set has an associated weight
{ }
m , = i m = i D , + y X, x , i D y , x
i m i i, i
1,... , 1 ) ( 1 1, ) (
,..., 1
− ∈ ∈
=
weights D0, is obtained.
the previously trained classifiers.
∑
−1 T = t t t
(x) h α sign = H(x)
– SVM – Neural Networks – Adaboost
)) , ( ),..., , ( , ( min arg
1 1 m m t G g t
y x y x D Function h
∈
=
… Training examples …
Weight Normalization Training of the gi,,i=1… using: To select the best gj (Dt): To define ht = gj (Dt)
Final Classifier Base Classifiers Number of classifiers to be selected
T
To update the weights
Initial weight of the examples
Di(0)= 1/m1 positives Di(0)= 1/m2 negatives
(m1) (m2)
Weight of the classifier ht
αt
∑
−1 T = t t t (x)
h α sign = H(x) ) h(x y (j)e D = (j) D
i i t α t + t − 1
t t t
Z (j) D = (j) D /
+
2 1
1 m m j= t t
(j) D = Z
(j) D
t
... ...,
1 j
g , g
To select αt
Normalization Factor
i i
y ) H(x : i m ≠ 1
− i i t t i t t
) (x h y (i)e D = Z α
≤ ≠
T = t t i i
Z y ) H(x : i m
1
1
− t
t t
r r + = α 1 1 ln 2 1
[ ]
1 1,+ h
t
− ∈
i t i t t
– The training error diminishes very fast when adding new weak classifiers (Real Adaboost, bounded case): – The final classifiers usually do not over fit (low out-off training set error) even when the training error is Zero and we continue to add weak classifier to the final robust classifier (empirical result).
t
− ≤ ≠
T = t t i i
r y ) H(x : i m
1
1 1
i t i t t
− = − − = − −
= =
1 1 1 1
) (
t j j i i t i t j j i t i t t i i i t t i t t
Z x f y e Z ) (x h y e ) (x h y (i)e D = Z α α
− =
=
i i t i t j j
x f y e Z ) (
1
) (
1
x f sign (x) h α sign = H(x)
T = t t t
= ∑
−
i i t i T = t t i i
1
and at the same time minimizes a bound of the training error
the margin
i i t i T = t t i i
1
– Generalization error: – With high probability: (There is a “similar” bound for Neural Networks)
] [ Pr ] [ Pr
~ ) , ( ~ ) , (
≤ = ≠ yf(x) y H(x)
P y x P y x
P y x y x S
m m
distributi a to according chosen is )} , ),...( , {(
1 1
=
+ ≤ ≤ ≠ θ θ ) / log( ~ ] ) ( [ Pr ] ) ( [ Pr
~ ) , ( ~ ) , (
d m m d O x yf y x H
S y x P y x
Schapire, Freund, Bartlett and Lee, 1998
Adaboost and Margin, a real example:
left: training and test errors, right: accumulated margin for 5, 100 and 1000 iterations
)) (exp( ) 2 1 ( ) 2 1 ( 1 2 ] ) ( [ Pr , 2 1
1 1 1 1 1 ~ ) , (
T O ) ε ( ε x yf
T T t θ t
t T S y x t
− ≤ + − ≤ − ≤ ≤ < > − ≤
+ − = +
θ θ
γ γ θ γ θ γ γ ε then and if
+ ≤ ≤ ≠ θ θ ) / log( ~ ] ) ( [ Pr ] ) ( [ Pr
~ ) , ( ~ ) , (
d m m d O x yf y x H
S y x P y x
– the Vapnik-Chervonenkis dimension of the G, – d measures the “complexity” of the family of functions G – m, the number of training samples – Does not depend on T, the number of weak classifiers!! – (There is a “similar” bound for Neural Networks)
+ ≤ ≤ ≠ θ θ ) / log( ~ ] ) ( [ Pr ] ) ( [ Pr
~ ) , ( ~ ) , (
d m m d O x yf y x H
S y x P y x
dim G
– Risk: – Bayes Risk: – Consistency: We say that learning method is universally consistent if, for all distributions P,
– Under some assumptions Adaboost is consistent, including: (e.g. SVM are also consistent) ) )) ( ( sign Pr( ) ( Y X f f R ≠ = ) ( inf ) ( ) (
* .
g R f R f R
G g s a m m ∈ ∞ →
= → 1 ), ( < = ν
ν
m O tm ) ( inf ) (
*
g R f R
G g∈
= ∞ < ) (
dim G
VC
Bartlett 2006
[R. Shapire and Y. Singer 1999]
domain F
whole domain
n
F , , F ...
1
[R. Shapire and Y. Singer 1999]
F x f ∈ ) (
j
c
Prediction
domain F
whole domain
Up Tables)
n
F , , F ...
1
[R. Shapire and Y. Singer 1999]
) x h(f ) (
F x f ∈ ) (
j
c
Prediction
j j
F x f ), x h(f = c ∈ ) ( ) (
∧ ∈ b = y F i D = W
i j i x f : i j b ) (
) (
− j j + j
W W = c ln 2 1
With Adaboost the features (and weak-classifiers) can
How to partition the feature domain?
[Viola & Jones, 2001]
(x,y)
y) I(x,
(x,y)
) y' I(x = y) II(x,
y y' x, x'∑ ≤ ≤
, '
4
A B C D
2 1 3
) II( ) II( ) II( + ) II( = D C) + (A B) + (A (A) + D) + C + B + (A = D 3 2 1 4 − − − −
– To take a 3x3 Neighborhood and to compare the pixel in the middle against its neighbors – Example: – Feature value = 10101110 (8 bits)
[Fröba et al. 2004]
y1 y2 y3 y4 x y5 y6 y7 y8
90 45 50 5 50 75 89 70 36 1 0 1 x 1 1 1 0
y1 y2 y3 y4 x y5 y6 y7 y8
(Formal definition) – Concatenation: – Comparison Function: – Feature: – 3x3 Neighborhood: – Census Transform: (8 bits) – Modified Census Transform: (9 bits) I(x) = (x) I 9 I(x) + I(y) = (x) I
Ν' y∑ ∈
[Fröba et al. 2004]
next layer of the cascade
To share information between layers of the cascade To obtained a more compact and robust cascade.
Filter 1 … Filter i … Filter n Non-face Non-face Non-face Non-face Non-face face V V V V C1 Ci-1 V: Processing window Ci: Confidence given by the filter i of the cascade C1
[Bo WU et al. 2004]
… … Reject Accept V V V
Feature Weak Classifier Nested Classifier Robust Classifier
Non-Object Object
conf1(v) conf2(v)
) ( H ⋅
1
) ( H ⋅
2
) ( H ⋅
3
) ( H
n ⋅
−1 1 1 1 T = t t (x)
h = (x) H 1
1 1 1
> i , (x) h + (x) H = (x) H
i T = t i t i j= j i
− −
Non-Object Non-Object Non-Object Non-Object
[Bo WU et al. 2004]
Train new layer
Nested Cascades and Adaboost:
f: maximum allowed of false positives per layer d: minimum allowed detection rate per layer Ftotal: final false positive rate P: training set of faces N: training set of non-faces
i = i+1
While Fall > Ftotal Update the detection and false positive rates
Fall = Fall * fi , Dall = Dall * di
To add the new layer Hi to the cascade: H = HΘ Hi To re-sample the set of non-faces N using H
Li : Layer i of the cascade H: final classifier Initialization: Fall = 1.0, Dall = 1.0, ND = 0, i = 0
Filter 1 … Filter i … Filter n
Non-face Non-face Non-face Non-face Non-face face
To Use Adaboost to train Hi such that
fi ≤ f, di ≥ d
To Use Adaboost to add weak classifiers to Hi until fi ≤ f, di ≥ d Update the weigths Di using Hi If i > 1 If i = 1
Get new rates Add new layer to the cascade Get non object training set
Liu et al. 2004 Bo WU et al. 2004
Bo WU 92.1 90.1 86.6 77.3 Ours
Face Detection Eyes Detection Gender Classification Face Windows Input Image Eyes Coordinates Gender
– ~90% face detection / ~0.2 FP per image (CMU-MIT) – ~80% correct gender recognition – ~99% eye detection with d_error < 5% d_eyes
– Trained using Adaboost (domain-partitioning).
– Training set:
layer of the cascade)
– Training time: 12 hours.
– About 200 ms for an 320x240 pixel image (≈ 5 frames per second) – About 1000 ms for an 640x480 pixel image (≈ 1 frames per second)
– CMU-MIT db:
– Localization Rate (When only one face is searched): ~95-99%
– Features base on rectangular features (first layers) y mLBP (last layers).
– Hierarchical grid using a coarse-to-fine procedure in image space
Demo
– Statistical learning:
– Training/testing Issues
– Cascade classifiers
appearance of the classes: Cascade Classifier
– Adaboost:
IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 24, no. 1, pp. 34-58, 2002
Machine Learinig, 37(3):297-336, 1999
Detector Cascade, Neural Information Systems 14, December 2001
Detection Based on Real Adaboost. Int. Conf. on Face and Gesture Recognition – FG 2004, 79 – 84, Seoul, Korea, May 2004, IEEE Press.
Face and Gesture Recognition – FG 2004, 91 - 96, Seoul, Korea, May 2004, IEEE Press.
Face Detection, IEEE transactions on pattern analysis and machine intelligence, 1408 - 1423 , vol. 26, 11, November 2004
Gabor Filter Features, Int. Conf. on Face and Gesture Recognition – FG 2004, Seoul, Korea, May 2004, IEEE Press.
2004.
– Varying pose:
– In-plane Rotation:
Jones, 2003)
– Occlusion:
Fröba et al., 2004)
– Expression:
http://home.t-online.de/home/Robert.Frischholz/face.htm
http://www.ri.cmu.edu/labs/lab_51.html
http://cbcl.mit.edu/cbcl/software-datasets/FaceData2.html
http://vision.ai.uiuc.edu/mhyang/face-detection-survey.html
– Garcia & delakis: http://www.csd.uoc.gr/~cgarcia/FaceDetectDemo.html – Schneiderman & Kanade: http://www.vasc.ri.cmu.edu/cgi- bin/demos/findface.cgi – WebFaces: http://www.cwr.cl/webfaces
– Standard (de facto):
– Others:
persons)
– Profile:
– In-plane rotation:
– Rowley:
– Viola & Jones: 24x24 pixels
– MIT (CBCL FACE DATABASE #1): 19x19 pixels
– Face Recognition datasets: is it fare to used them?
– …