Scalable Anomaly Detection with Spark and SOS Strata NYC - - PowerPoint PPT Presentation

scalable anomaly detection with spark and sos
SMART_READER_LITE
LIVE PREVIEW

Scalable Anomaly Detection with Spark and SOS Strata NYC - - PowerPoint PPT Presentation

Scalable Anomaly Detection with Spark and SOS Strata NYC September 26, 2019 Hi there, my name is Jeroen Janssens Today SOS, World! Anomalies and outliers Evaluating outlier-selection algorithms Various approaches to outlier


slide-1
SLIDE 1

Scalable Anomaly Detection with Spark and SOS

Strata NYC September 26, 2019

slide-2
SLIDE 2

Hi there, my name is Jeroen Janssens

slide-3
SLIDE 3

Today

  • SOS, World!
  • Anomalies and outliers
  • Evaluating outlier-selection algorithms
  • Various approaches to outlier selection
  • Stochastic Outlier Selection
  • Conclusion
slide-4
SLIDE 4

SOS, World!

01-sos-world.ipynb

slide-5
SLIDE 5

Implementations of SOS

  • Python: http://bit.ly/sos-python
  • Spark: http://bit.ly/sos-spark
  • R: http://bit.ly/sos-r
  • Flink: http://bit.ly/sos-flink
slide-6
SLIDE 6

Anomalies and outliers

slide-7
SLIDE 7

An anomaly is an observation or event that deviates qualitatively from what is considered to be normal, according to a domain expert.

slide-8
SLIDE 8

Detecting anomalies is important

  • Expensive
  • Dangerous
  • Mess up your model
slide-9
SLIDE 9

Human anomaly detection may suffer from

  • Fatigue
  • Information overload
  • Emotional bias
slide-10
SLIDE 10

Feature-vector representation

slide-11
SLIDE 11

Dissimilarity-matrix representation

slide-12
SLIDE 12

From anomaly to outlier

slide-13
SLIDE 13

An outlier is a data point that deviates quantitatively from the majority of the data points, according to an

  • utlier-selection algorithm.
slide-14
SLIDE 14

The symbiotic relationship between the domain expert and the algorithm

slide-15
SLIDE 15

Data flow diagram

Data flow diagram illustrating the relationship between the domain expert (square) and the

  • utlier-selection algorithm (top circle).
slide-16
SLIDE 16

Six Euler diagrams (1/2)

slide-17
SLIDE 17

Six Euler diagrams (2/2)

slide-18
SLIDE 18

Evaluating outlier-selection algorithms

slide-19
SLIDE 19

Confusion matrix

Computer says no.

slide-20
SLIDE 20

Four possible

  • utcomes
slide-21
SLIDE 21

Evaluation

Illustration of relabelling a multi-class data set into multiple one-class data sets.

slide-22
SLIDE 22

Anomalies are rare

In order to evaluate the algorithm we simulate anomalies to be rare. Banana for scale.

slide-23
SLIDE 23

Outlier scores

The dashed line indicates the threshold chosen by the domain expert.

slide-24
SLIDE 24

ROC curve

An ROC curve plots the false alarm rate against the hit rate for all possible thresholds.

slide-25
SLIDE 25

Various approaches to outlier selection

slide-26
SLIDE 26

Distribution-based

  • utlier selection
slide-27
SLIDE 27

Distance-based

  • utlier selection

Size does matter

slide-28
SLIDE 28

Density-based

  • utlier-selection
slide-29
SLIDE 29

Stochastic Outlier Selection

slide-30
SLIDE 30

Stochastic Outlier Selection

  • Unsupervised outlier selection algorithm
  • Employs concept of affinity (inspired by t-SNE)
  • One parameter: perplexity
  • Computes outlier probabilities
slide-31
SLIDE 31

t-Distributed Neighbor Embedding (t-SNE; Van der Maaten, Hinton) employs affinity to perform dimensionality reduction

slide-32
SLIDE 32

A data point is selected as an outlier when all the other data points have insufficient affinity with it.

slide-33
SLIDE 33

From input to output

slide-34
SLIDE 34

From feature matrix to dissimilarity matrix

slide-35
SLIDE 35

From input to output

slide-36
SLIDE 36

Smooth neighborhoods

slide-37
SLIDE 37

Affinity between data points

slide-38
SLIDE 38

From input to output

slide-39
SLIDE 39

From affinity to binding probability

The binding matrix B is obtained by normalising each row in the affinity matrix A.

slide-40
SLIDE 40

Binding probabilities form a graph

slide-41
SLIDE 41

Binding probabilities form a graph

slide-42
SLIDE 42

Stochastic Neighbor Graph

A data point belongs to the outlier class when no it is not selected by any

  • ther data points.
slide-43
SLIDE 43

Three SNGs

The three SNGs Ga, Gb, and Gc are sampled from the discrete probability distribution P(G).

slide-44
SLIDE 44

Set of all SNGs

slide-45
SLIDE 45

Approximating outlier probabilities by sampling SNGs

slide-46
SLIDE 46

Demo: Sampling SNGs in CoffeeScript and D3

http://bit.ly/sos-d3

slide-47
SLIDE 47

Computing outlier probabilities through marginalisation

slide-48
SLIDE 48

Computing outlier probabilities in closed form

slide-49
SLIDE 49

Proof!

slide-50
SLIDE 50

Selecting outliers

slide-51
SLIDE 51

Adaptive variances via the perplexity parameter

slide-52
SLIDE 52

Continuous binary search

slide-53
SLIDE 53

Perplexity influences outlier probabilities

slide-54
SLIDE 54

Evaluation and comparison

slide-55
SLIDE 55

Putlier-score plots

slide-56
SLIDE 56

Real-world datasets

slide-57
SLIDE 57

Synthetic datasets

slide-58
SLIDE 58

Synthetic datasets

slide-59
SLIDE 59

Synthetic datasets

slide-60
SLIDE 60

Synthetic datasets

slide-61
SLIDE 61

SOS performs significantly better

slide-62
SLIDE 62

Spark implementation of SOS

slide-63
SLIDE 63

Spark implementation of SOS

  • Developed by Fokko Driesprong
  • Works with DataFrame API
  • Available on GitHub
  • Plan is to make it part of MLLib
slide-64
SLIDE 64

SOS on PySpark

92-pyspark-sos.ipynb

slide-65
SLIDE 65

Summary

  • Outlier selection can support the detection of anomalies
  • SOS is an intuitive and probabilistic algorithm to select outliers
  • SOS has a very good performance
  • No free lunch
slide-66
SLIDE 66

Thank you! Here are some links

  • Blog: http://bit.ly/sos-blog
  • D3 Demo: http://bit.ly/sos-d3
  • Python implementation: http://bit.ly/sos-python
  • Spark implementation: http://bit.ly/sos-spark
  • R implementation: http://bit.ly/sos-r
  • Flink implementation: http://bit.ly/sos-flink