Combining extreme value theory and machine learning for novelty detection
Luca Steyn
Combining extreme value theory and machine learning for Luca Steyn - - PowerPoint PPT Presentation
Combining extreme value theory and machine learning for Luca Steyn novelty detection Two topics: Extreme value theory Novelty detection INTRODUCTION A new idea for multivariate extreme value theory and multivariate anomaly
Luca Steyn
theory and multivariate anomaly detection
Computer Science
when new observations differ from what is expected as normal behaviour.
(positive or negative).
detect novel observations.
threshold a distribution representing the normal state of the system. (Is this a bad thing?)
differ to some extent from the observations in the normal class.
Many algorithms for novelty detection have been
function (pdf) by .
is large, i.e. .
∈ p X
= d
dx
f x F x
( )
: ≥
= ∫
S x f x t
F t f x dx
0.9 =
S
F t
∗
x
∗ <
f x t
much certainty do we have that this observation is anomalous?
that an observation is anomalous.
independent and identically distributed (iid) random variables and let . If sequences of constants and exist such that , then is necessarily the Generalized Extreme Value (GEV) distribution.
1 2 3
, , , X X X
>
n
a
n
b
1
,
−
− → → ∞
n n n
a M b G x n
max
=
=
n n i i
M X
G x
setting (in the limit).
Pareto, Gumbel, (extremal) Weibull.
1
exp 1 , 0, 1 exp exp , 0,
γ γ
γ γ γ γ
−
− + ≠ + > = − − = ∈ x x G x x x
min max = − − X X
auxiliary function and for all , Furthermore, F
⋅ b 1 γ + > x
1
1 1 as 1
γ
γ
−
− + → + → ∞ − F y b y x x y F y
1
γ
γ + → = + b y b y x u x b y
high enough threshold such that the exceedances are approximately generalised Pareto (GP) distributed. Hence, for a large threshold , t = − Z X t t
1
1
γ
γ
−
> > ≈ + z P Z z X t b t
multimodal Hence, one needs a method that transforms the data to overcome these issues.
probability density.
Furthermore, we can choose where is the known distribution of .
( )
{ }
; 1, ,
argmin
=
=
i
n i X i n
E f X
( ) ( )
{ }
min min = ≡
n i i i i
f E f X Y
( )
, µ Σ X N
( )
{ }
1
1 exp Weibull type GEV
−
≤ ≈ − − ≡
n n
P f E y a y
( )
1 1 −
=
n d
a G n
( )
d
G y
( )
= Y f X
is novel is given by the probability that the density estimate at this observation is less than the distribution of minimum probability density, i.e.:
∗
x
∗ ∗
= y f x
1
is novel exp
∗ ∗ − ∗
= > ≈ −
n n
P x P f E y a y
Problem: Gaussian assumption is too strict.
expression of parameter estimates.
attraction of Weibull type GEV.
maximum likelihood.
Weibull density of GMM minimum density:
Weibull density of GMM minimum density :
Weibull density of GMM minimum density :
variables are variance, skewness, kurtosis and entropy of Wavelet transformed image.
data.
the test set.
BIC criterion.
banknotes using Weibull GEV of minimum density.
forged banknote on test set.
fake banknotes.
Predicted Response Real Forged Real 162 1 Forged 609
under the assumption that not all classes are known at training.
detection.
not seen at training.
classes.
probabilities to account for other classes.
from a new class not seen at training.
Consider a model that produces . For each class:
above a threshold . The probability that an observation is not novel with respect to class k is: where . Notice a per-class estimation strategy is followed.
( ) ,
1,2, , P Y k x k K = =
( )
ˆ | , 1, = = =
jk j k
x x y k j n
( )
µ =
k jk
mean x µ = −
jk jk k
d x = −
jk jk k
Z D t
k
t x
( )
| > >
k k k
P Z z D t and µ = − = −
k k
Z D t D X
Update probabilities: We update each class probability with The probability that an observation is from none of the classes is then Classify as class with maximum probability.
( )
{ } { }
( ) ( ) ( ) ( )
1
, 1
γ
γ σ
∗ ∗ ∗ ∗ − ∗
= = = = ∩ > = = = = ⋅ > = = ≈ = = ⋅ +
k
new k k k k k k k
P Y k X x P Y k Z z X x P Y k X x P Z z Y k X x z P Y k X x
( )
( )
novel 1 |
∗
= − = =
∑
new k
P Y P Y k X x
Approach:
Kaggle.
classes.
classified training data.
classes’ correctly classified training data.
Training data:
Class 1 2 3 4 5 6 7 Observations 3285 3728 3382 3496 3243 3054 3312 3501
CNN model:
connected layers.
final fully connected layer.
classified training data for each class. Each dataset contains output of the ultimate hidden layer.
Training results: Misclassification error: 0,156%
Prediction Response 1 2 3 4 5 6 7 3284 1 1 1 3711 1 2 2 2 3379 1 3 3 2 3493 4 2 3240 1 5 1 3046 6 1 3 1 8 3310 7 9 1 1 1 3496
Training results:
Training results:
Estimate GP distribution for rescaling: For each class:
the threshold.
Estimate GP distribution for rescaling:
Estimate GP distribution for rescaling:
Rescale class probabilities of test set:
data.
to rescale probability.
probability.
Results on testing set: Test error: 5.91% Test error without rescaling: 20.08%
Prediction Response 1 2 3 4 5 6 7 Unknown 834 2 1 918 6 2 760 8 3 794 23 4 706 19 5 682 1 14 6 791 8 7 1 2 869 22 Unknown 13 37 33 61 123 59 33 31 1548
Perhaps a better model:
Perhaps a better model:
an observation is from the corresponding class. For each class:
classified training data.
threshold.
Perhaps a better model: Test error: 5.69%
Prediction Response 1 2 3 4 5 6 7 Unknown 842 8 1 930 8 2 1 781 1 10 3 768 33 4 756 11 5 700 1 28 6 811 14 7 1 1 846 52 Unknown 5 25 13 87 71 41 13 54 1486
Question: Can a regression tree be used for density estimation? If so, can we use this model to detect anomalous
Main problem: Need a valid splitting criterion to determine optimal split recursively. Criminisi, Shotton & Konukoglu (2011) proposed the information gain with the continuous entropy
Consider splitting the root node into two decision
the set and let the left and right decision nodes be denoted by , respectively. The information gain of this split (for the multivariate Gaussian) is then . This splitting criterion is used with recursive binary partitioning to grow a density tree. S and
L R
S S ln ln ln
L R L R
S S I S S = Σ − Σ − Σ
The density estimate is obtained from the Gaussian distribution in each terminal node. Let the leaf reached by an input x be denoted by . Then, the probability density at the input x is given by , where K is a normalising constant, is the proportion
multivariate Gaussian density.
( )
l x
( )
( ) ( ) ( )
, ,
l x l x l x
f x x K π φ µ = Σ
( )
l x
π
( )
φ ⋅
The normalising constant is given as . For each leaf, its density estimates are used to estimate the GP distribution associated with the peaks below a small threshold. This distribution is then used to detect if a new
( )
, ,
l l l x l x l
K x dx π φ µ
∈
= Σ
2
3.2 X ≤ −
1
2.8 X ≤
( )
1
f x
( )
2
f x
( )
3
f x
Consequently, the density in each region is .
( )
, , 0.988
l l l x l x l
K x dx π φ µ
∈
= Σ ≈
( )
( ) ( ) ( )
, , 0.988
l x l x l x
f x x π φ µ = Σ
sampled class.
detect new classes.
computer science.