Combining extreme value theory and machine learning for Luca Steyn - - PowerPoint PPT Presentation

combining extreme value theory and machine learning for
SMART_READER_LITE
LIVE PREVIEW

Combining extreme value theory and machine learning for Luca Steyn - - PowerPoint PPT Presentation

Combining extreme value theory and machine learning for Luca Steyn novelty detection Two topics: Extreme value theory Novelty detection INTRODUCTION A new idea for multivariate extreme value theory and multivariate anomaly


slide-1
SLIDE 1

Combining extreme value theory and machine learning for novelty detection

Luca Steyn

slide-2
SLIDE 2

INTRODUCTION

  • Two topics:
  • Extreme value theory
  • Novelty detection
  • A new idea for multivariate extreme value

theory and multivariate anomaly detection

  • Brings together research from Statistics and

Computer Science

slide-3
SLIDE 3

What is novelty detection?

  • Novelty detection is the process of identifying

when new observations differ from what is expected as normal behaviour.

  • Classification problem, i.e. normal or anomalous

(positive or negative).

  • Conventional classification algorithms fail to

detect novel observations.

  • Use a one-class classification approach

threshold a distribution representing the normal state of the system. (Is this a bad thing?)

  • Assumption: Novel observations are scarce and

differ to some extent from the observations in the normal class.

slide-4
SLIDE 4

Methods to perform novelty detection

Many algorithms for novelty detection have been

  • proposed. Broad approaches are:
  • A distance-based approach
  • Modified KNN algorithm
  • A domain-based approach
  • One-class support vector machines
  • A reconstruction-based approach
  • Neural networks or PCA
  • A probabilistic approach
  • Density estimation and thresholding
slide-5
SLIDE 5

A probabilistic approach

  • Let and denote the probability density

function (pdf) by .

  • Choose a threshold t such that

is large, i.e. .

  • Then, a new observation is novel if .

∈ p X

( ) ( )

= d

dx

f x F x

( ) ( )

( )

: ≥

= ∫

S x f x t

F t f x dx

( )

0.9 =

S

F t

x

( )

∗ <

f x t

slide-6
SLIDE 6

A probabilistic approach

slide-7
SLIDE 7

A probabilistic approach

  • If a new observation is below the threshold, how

much certainty do we have that this observation is anomalous?

  • Extreme value theory estimates a probability

that an observation is anomalous.

slide-8
SLIDE 8

Extreme value theory: Fisher-Tippett theorem

  • Let be a sequence of

independent and identically distributed (iid) random variables and let . If sequences of constants and exist such that , then is necessarily the Generalized Extreme Value (GEV) distribution.

{ }

1 2 3

, , , X X X

{ }

>

n

a

{ }

n

b

( ) ( )

1

,

− → → ∞

n n n

a M b G x n

{ } 1

max

=

=

n n i i

M X

( )

G x

slide-9
SLIDE 9

Extreme value theory: Fisher-Tippett theorem

  • The GEV distribution is given by
  • Move from a non-parametric to a parametric

setting (in the limit).

  • Three types of GEV distributions: Frechét-

Pareto, Gumbel, (extremal) Weibull.

  • Note: .

( ) ( )

{ }

( ) { }

{ }

1

exp 1 , 0, 1 exp exp , 0,

γ γ

γ γ γ γ

 − + ≠ + >  =   − − = ∈   x x G x x x

( ) ( )

min max = − − X X

slide-10
SLIDE 10

Extreme value theory: Pickands- Balkema-de Haan theorem

  • The distribution is in the domain of attraction
  • f the GEV distribution if and only if for some

auxiliary function and for all , Furthermore, F

( )

⋅ b 1 γ + > x

( )

( )

( ) ( )

1

1 1 as 1

γ

γ

− + → + → ∞ − F y b y x x y F y

( )

( )

( )

1

γ

γ + → = + b y b y x u x b y

slide-11
SLIDE 11

Extreme value theory: Pickands- Balkema-de Haan theorem

  • Essentially, this theorem states that there exists a

high enough threshold such that the exceedances are approximately generalised Pareto (GP) distributed. Hence, for a large threshold , t = − Z X t t

( )

( )

1

1

γ

γ

  > > ≈ +     z P Z z X t b t

slide-12
SLIDE 12

Example: Uniform distribution

slide-13
SLIDE 13

Other problems with EVT

  • The problem is multivariate
  • The distribution under normal conditions is

multimodal Hence, one needs a method that transforms the data to overcome these issues.

slide-14
SLIDE 14

An approach based on minimum probability density

  • Redefine extreme value theory in terms of minimum

probability density.

  • Let such that
  • Assume
  • It can be shown that

Furthermore, we can choose where is the known distribution of .

( )

{ }

; 1, ,

argmin

=

=

i

n i X i n

E f X

( ) ( )

{ }

min min = ≡

n i i i i

f E f X Y

( )

, µ Σ  X N

( )

{ }

1

1 exp Weibull type GEV

    ≤ ≈ − − ≡    

n n

P f E y a y

( )

1 1 −

=

n d

a G n

( )

d

G y

( )

= Y f X

slide-15
SLIDE 15

An approach based on minimum probability density

  • Hence, the probability that a new observation

is novel is given by the probability that the density estimate at this observation is less than the distribution of minimum probability density, i.e.:

x

( )

∗ ∗

= y f x

( )

( )

( ) { }

1

is novel exp

∗ ∗ − ∗

= > ≈ −

n n

P x P f E y a y

slide-16
SLIDE 16

An approach based on minimum probability density

slide-17
SLIDE 17

An approach based on minimum probability density

Problem: Gaussian assumption is too strict.

slide-18
SLIDE 18

An approach based on minimum probability density

  • Gaussian assumption leads to analytical

expression of parameter estimates.

  • Minimum of GMM density bounded at zero.
  • Hence, density of GMM is in domain of

attraction of Weibull type GEV.

  • However, parameters must be estimated via

maximum likelihood.

slide-19
SLIDE 19

An approach based on minimum probability density

Weibull density of GMM minimum density:

slide-20
SLIDE 20

An approach based on minimum probability density

Weibull density of GMM minimum density :

slide-21
SLIDE 21

An approach based on minimum probability density

Weibull density of GMM minimum density :

slide-22
SLIDE 22

Banknote authentication example

  • Dataset: Wavelet transform of banknotes –

variables are variance, skewness, kurtosis and entropy of Wavelet transformed image.

  • There are 600 real banknotes in the training

data.

  • There are 162 real and 610 forged banknotes in

the test set.

slide-23
SLIDE 23

Banknote authentication example

  • Select number of components in GMM with

BIC criterion.

  • Optimal was 5 Gaussian components.
  • Estimate distribution of minimum density of real

banknotes using Weibull GEV of minimum density.

  • Use this distribution to determine probability of

forged banknote on test set.

slide-24
SLIDE 24

Banknote authentication example

  • Results:
  • Clearly, the model does very well in detecting

fake banknotes.

  • However, very easy data.

Predicted Response Real Forged Real 162 1 Forged 609

slide-25
SLIDE 25

Supervised novelty detection and Open-set Recognition

  • Open-set recognition: Perform classification

under the assumption that not all classes are known at training.

  • Use extreme value theory to detect new classes.
  • Similar concepts used for supervised novelty

detection.

slide-26
SLIDE 26

A new approach based on the GP distribution

  • Problem: Testing set possibly contains classes

not seen at training.

  • Use a supervised model to classify known

classes.

  • Use extreme value theory to adjust predicted

probabilities to account for other classes.

  • Estimate the probability that an observation is

from a new class not seen at training.

slide-27
SLIDE 27

A new approach based on the GP distribution

Consider a model that produces . For each class:

  • 1. Find the correctly classified training data
  • 2. Let and compute
  • 3. Fit a GP distribution to the exceedances

above a threshold . The probability that an observation is not novel with respect to class k is: where . Notice a per-class estimation strategy is followed.

( ) ,

1,2, , P Y k x k K = = 

( )

ˆ | , 1, = = = 

jk j k

x x y k j n

( )

µ =

k jk

mean x µ = −

jk jk k

d x = −

jk jk k

Z D t

k

t x

( )

| > >

k k k

P Z z D t and µ = − = −

k k

Z D t D X

slide-28
SLIDE 28

A new approach based on the GP distribution

Update probabilities: We update each class probability with The probability that an observation is from none of the classes is then Classify as class with maximum probability.

( )

{ } { }

( ) ( ) ( ) ( )

1

, 1

γ

γ σ

∗ ∗ ∗ ∗ − ∗

= = = = ∩ > = = = = ⋅ > = =   ≈ = = ⋅ +    

k

new k k k k k k k

P Y k X x P Y k Z z X x P Y k X x P Z z Y k X x z P Y k X x

( )

( )

novel 1 |

= − = =

new k

P Y P Y k X x

slide-29
SLIDE 29

Handwritten digits example

Approach:

  • Images of handwritten digits downloaded from

Kaggle.

  • Use 0 to 7 as known classes in training data.
  • Use 0 to 9 in testing data, i.e. 8 and 9 are new

classes.

  • Fit CNN on training data and find correctly

classified training data.

  • Extract activations in final hidden layer for each

classes’ correctly classified training data.

  • Use these features to estimate probability that an
  • bservation is from a new class.
slide-30
SLIDE 30

Handwritten digits example

Training data:

Class 1 2 3 4 5 6 7 Observations 3285 3728 3382 3496 3243 3054 3312 3501

slide-31
SLIDE 31

Handwritten digits example

CNN model:

  • Two convolutional layers and four fully

connected layers.

  • Use ReLU activations on all hidden layers.
  • SoftMax activation on output layer.
  • Extract correctly classified training data on the

final fully connected layer.

  • Split data by class, i.e. 1 dataset of correctly

classified training data for each class. Each dataset contains output of the ultimate hidden layer.

slide-32
SLIDE 32

Handwritten digits example

Training results: Misclassification error: 0,156%

Prediction Response 1 2 3 4 5 6 7 3284 1 1 1 3711 1 2 2 2 3379 1 3 3 2 3493 4 2 3240 1 5 1 3046 6 1 3 1 8 3310 7 9 1 1 1 3496

slide-33
SLIDE 33

Handwritten digits example

Training results:

slide-34
SLIDE 34

Handwritten digits example

Training results:

slide-35
SLIDE 35

Handwritten digits example

Estimate GP distribution for rescaling: For each class:

  • Use correctly classified training data.
  • Find Mahalanobis distance for each observation.
  • Select a high threshold.
  • Estimate GP distribution of exceedances above

the threshold.

slide-36
SLIDE 36

Handwritten digits example

Estimate GP distribution for rescaling:

slide-37
SLIDE 37

Handwritten digits example

Estimate GP distribution for rescaling:

slide-38
SLIDE 38

Handwritten digits example

Rescale class probabilities of test set:

  • Extract activations on final hidden layer of test

data.

  • Find model predictions.
  • For the predicted class, use that GP distribution

to rescale probability.

  • Classify to class with maximum class

probability.

slide-39
SLIDE 39

Handwritten digits example

Results on testing set: Test error: 5.91% Test error without rescaling: 20.08%

Prediction Response 1 2 3 4 5 6 7 Unknown 834 2 1 918 6 2 760 8 3 794 23 4 706 19 5 682 1 14 6 791 8 7 1 2 869 22 Unknown 13 37 33 61 123 59 33 31 1548

slide-40
SLIDE 40

Handwritten digits example

Perhaps a better model:

slide-41
SLIDE 41

Handwritten digits example

Perhaps a better model:

  • Each activation in the final layer is large when

an observation is from the corresponding class. For each class:

  • Find output on corresponding node of correctly

classified training data.

  • Fit GP distribution on peaks below a small

threshold.

  • Use this probability to rescale.
slide-42
SLIDE 42

Handwritten digits example

Perhaps a better model: Test error: 5.69%

Prediction Response 1 2 3 4 5 6 7 Unknown 842 8 1 930 8 2 1 781 1 10 3 768 33 4 756 11 5 700 1 28 6 811 14 7 1 1 846 52 Unknown 5 25 13 87 71 41 13 54 1486

slide-43
SLIDE 43

An idea for random forests

Question: Can a regression tree be used for density estimation? If so, can we use this model to detect anomalous

  • bservations?

Main problem: Need a valid splitting criterion to determine optimal split recursively. Criminisi, Shotton & Konukoglu (2011) proposed the information gain with the continuous entropy

  • f a multivariate Gaussian.
slide-44
SLIDE 44

An idea for random forests

Consider splitting the root node into two decision

  • nodes. Let the data in the root node be denoted by

the set and let the left and right decision nodes be denoted by , respectively. The information gain of this split (for the multivariate Gaussian) is then . This splitting criterion is used with recursive binary partitioning to grow a density tree. S and

L R

S S ln ln ln

L R L R

S S I S S = Σ − Σ − Σ

slide-45
SLIDE 45

An idea for random forests

The density estimate is obtained from the Gaussian distribution in each terminal node. Let the leaf reached by an input x be denoted by . Then, the probability density at the input x is given by , where K is a normalising constant, is the proportion

  • f observations in that node and is the

multivariate Gaussian density.

( )

l x

( )

( ) ( ) ( )

( )

, ,

l x l x l x

f x x K π φ µ = Σ

( )

l x

π

( )

φ ⋅

slide-46
SLIDE 46

An idea for random forests

The normalising constant is given as . For each leaf, its density estimates are used to estimate the GP distribution associated with the peaks below a small threshold. This distribution is then used to detect if a new

  • bservation is anomalous.

( )

( )

, ,

l l l x l x l

K x dx π φ µ

= Σ

∑∫

slide-47
SLIDE 47

An idea for random forests: An example

slide-48
SLIDE 48

An idea for random forests : An example

2

3.2 X ≤ −

1

2.8 X ≤

( )

1

f x

( )

2

f x

( )

3

f x

slide-49
SLIDE 49

An idea for random forests : An example

Consequently, the density in each region is .

( )

( )

, , 0.988

l l l x l x l

K x dx π φ µ

= Σ ≈

∑∫

( )

( ) ( ) ( )

( )

, , 0.988

l x l x l x

f x x π φ µ = Σ

slide-50
SLIDE 50

Conclusion

  • New idea for novelty detection.
  • Unsupervised:
  • Estimate a density or similarity measure.
  • Perform EVT to detect anomalies.
  • Supervised:
  • Estimate probabilities/scores for each well-

sampled class.

  • Use EVT to rescale probabilities/scores to

detect new classes.

  • New connection between theoretical statistics and

computer science.

  • Thanks for listening!