combining extreme value theory and machine learning for
play

Combining extreme value theory and machine learning for Luca Steyn - PowerPoint PPT Presentation

Combining extreme value theory and machine learning for Luca Steyn novelty detection Two topics: Extreme value theory Novelty detection INTRODUCTION A new idea for multivariate extreme value theory and multivariate anomaly


  1. Combining extreme value theory and machine learning for Luca Steyn novelty detection

  2. • Two topics:  Extreme value theory  Novelty detection INTRODUCTION • A new idea for multivariate extreme value theory and multivariate anomaly detection • Brings together research from Statistics and Computer Science

  3. • Novelty detection is the process of identifying when new observations differ from what is expected as normal behaviour. • Classification problem, i.e. normal or anomalous (positive or negative). • Conventional classification algorithms fail to What is novelty detect novel observations. detection? ⇒ • Use a one-class classification approach threshold a distribution representing the normal state of the system. (Is this a bad thing?) • Assumption: Novel observations are scarce and differ to some extent from the observations in the normal class.

  4. Many algorithms for novelty detection have been proposed. Broad approaches are: • A distance-based approach - Modified KNN algorithm Methods to • A domain-based approach perform novelty - One-class support vector machines detection • A reconstruction-based approach - Neural networks or PCA • A probabilistic approach - Density estimation and thresholding

  5. ∈  p • Let and denote the probability density X ( ) ( ) = d function (pdf) by . f x F x dx = ∫ ( ) ( ) A probabilistic • Choose a threshold t such that F t f x dx ( ) S = is large, i.e. . 0.9 F t ( ) ≥ approach : x f x t S ( ∗ < ) ∗ • Then, a new observation is novel if . f x t x

  6. A probabilistic approach

  7. • If a new observation is below the threshold, how much certainty do we have that this observation A probabilistic is anomalous? approach • Extreme value theory estimates a probability that an observation is anomalous.

  8. { } ,  • Let be a sequence of , , X X X 1 2 3 independent and identically distributed (iid) Extreme value { } 1 = n max random variables and let . If M X = theory: n i i { } { } > sequences of constants and exist 0 a b Fisher-Tippett n n ( ) ( ) ( ) − − → → ∞ 1 such that , then , theorem a M b G x n G x n n n is necessarily the Generalized Extreme Value (GEV) distribution.

  9. • The GEV distribution is given by { }  ( ) ( ) − 1 − + γ γ ≠ + γ > γ exp 1 , 0, 1 0  x x ( ) =  G x Extreme value γ { } { }  − − γ = ∈  exp exp , 0, x x  theory: • Move from a non-parametric to a parametric Fisher-Tippett setting (in the limit). theorem • Three types of GEV distributions: Frechét- Pareto, Gumbel, (extremal) Weibull. ( ) ( ) = − − • Note: . min max X X

  10. • The distribution is in the domain of attraction F of the GEV distribution if and only if for some Extreme value ( ) ⋅ + γ > auxiliary function and for all , 1 0 x b theory: ( ) ( ) − + 1 F y b y x ( ) − 1 → + γ → ∞ γ Pickands- 1 as x y ( ) − 1 F y Balkema-de Haan Furthermore, theorem ( ) ( ) + b y b y x γ → = + γ 1 u x ( ) b y

  11. • Essentially, this theorem states that there exists a high enough threshold such that the t Extreme value = − exceedances are approximately Z X t theory: Pickands- generalised Pareto (GP) distributed. Hence, for a Balkema-de Haan large threshold , t theorem − 1   γ ( ) z > > ≈ + γ   1 P Z z X t ( )  b t 

  12. Example: Uniform distribution

  13. Other problems with EVT • The problem is multivariate • The distribution under normal conditions is multimodal Hence, one needs a method that transforms the data to overcome these issues.

  14. • Redefine extreme value theory in terms of minimum probability density. An approach { } { } ( ) ( ) ( ) = = ≡ based on • Let such that argmin min min E f X f E f X Y n i n i i = i i  ; 1, , X i n ( ) i minimum µ Σ  • Assume , X N probability density • It can be shown that { } ( )    ≤  ≈ − − − ≡ 1 1 exp Weibull type GEV P f E y a y     n n 1 1 ( ) ( ) = − Furthermore, we can choose where is the a G n G y n d d ( ) = known distribution of . Y f X

  15. ∗ • Hence, the probability that a new observation x is novel is given by the probability that the ( ) ∗ = ∗ An approach density estimate at this observation y f x based on is less than the distribution of minimum minimum probability density, i.e. : probability density ( ) ( ) { } ( ) ∗ ∗ − ∗ = > ≈ − 1 is novel exp P x P f E y a y n n

  16. An approach based on minimum probability density

  17. Problem: Gaussian assumption is too strict. An approach based on minimum probability density

  18. • Gaussian assumption leads to analytical expression of parameter estimates. An approach • Minimum of GMM density bounded at zero. based on • Hence, density of GMM is in domain of minimum probability density attraction of Weibull type GEV. • However, parameters must be estimated via maximum likelihood.

  19. Weibull density of GMM minimum density: An approach based on minimum probability density

  20. Weibull density of GMM minimum density : An approach based on minimum probability density

  21. Weibull density of GMM minimum density : An approach based on minimum probability density

  22. • Dataset: Wavelet transform of banknotes – variables are variance, skewness, kurtosis and entropy of Wavelet transformed image. Banknote authentication • There are 600 real banknotes in the training example data. • There are 162 real and 610 forged banknotes in the test set.

  23. • Select number of components in GMM with BIC criterion. • Optimal was 5 Gaussian components. Banknote • Estimate distribution of minimum density of real authentication example banknotes using Weibull GEV of minimum density. • Use this distribution to determine probability of forged banknote on test set.

  24. • Results: Response Predicted Real Forged Banknote Real 162 1 authentication Forged 0 609 example • Clearly, the model does very well in detecting fake banknotes. • However, very easy data.

  25. • Open-set recognition: Perform classification under the assumption that not all classes are Supervised known at training. novelty detection and Open-set • Use extreme value theory to detect new classes. Recognition • Similar concepts used for supervised novelty detection.

  26. • Problem: Testing set possibly contains classes not seen at training. • Use a supervised model to classify known A new approach classes. based on the GP distribution • Use extreme value theory to adjust predicted probabilities to account for other classes. • Estimate the probability that an observation is from a new class not seen at training.

  27. ( ) , = =  Consider a model that produces . 1,2, , P Y k x k K For each class: 1. Find the correctly classified training data ( ) = = =  ˆ | , 1, x x y k j n jk j k ( ) µ = = − µ 2. Let and compute mean x d x k jk jk jk k A new approach = − 3. Fit a GP distribution to the exceedances Z D t jk jk k above a threshold . based on the GP t k The probability that an observation is not novel with respect to x distribution class k is: ( ) > > = − = − µ | where . P Z z D t and Z D t D X k k k k k Notice a per-class estimation strategy is followed.

  28. Update probabilities: We update each class probability with ( ) ∗ = = new P Y k X x ( ) { } { } ∗ = = ∩ > = P Y k Z z X x k k A new approach ( ) ( ) = = = ∗ ⋅ > = = ∗ , P Y k X x P Z z Y k X x based on the GP k k − 1 ( )   γ z ∗ ≈ = = ⋅ + γ k   distribution 1 P Y k X x k σ   k k The probability that an observation is from none of the classes is then ( ) ∑ ( ) = − = = ∗ new novel 1 | P Y P Y k X x k Classify as class with maximum probability.

  29. Approach: • Images of handwritten digits downloaded from Kaggle. • Use 0 to 7 as known classes in training data. • Use 0 to 9 in testing data, i.e. 8 and 9 are new Handwritten digits classes. example • Fit CNN on training data and find correctly classified training data. • Extract activations in final hidden layer for each classes’ correctly classified training data. • Use these features to estimate probability that an observation is from a new class.

  30. Training data: Class 0 1 2 3 4 5 6 7 Observations 3285 3728 3382 3496 3243 3054 3312 3501 Handwritten digits example

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend