Anomalies in Data Maximilian Toller KDDM2 Maximilian Toller, - - PowerPoint PPT Presentation

anomalies in data
SMART_READER_LITE
LIVE PREVIEW

Anomalies in Data Maximilian Toller KDDM2 Maximilian Toller, - - PowerPoint PPT Presentation

www.tugraz.at Anomalies in Data SCIENCE PASSION TECHNOLOGY Anomalies in Data Maximilian Toller KDDM2 Maximilian Toller, Know-Center > www.tugraz.at 1 KDDM2 www.tugraz.at Anomalies in Data Recall from earlier Maximilian Toller,


slide-1
SLIDE 1

www.tugraz.at

Anomalies in Data

SCIENCE PASSION TECHNOLOGY

Anomalies in Data

Maximilian Toller KDDM2

> www.tugraz.at

Maximilian Toller, Know-Center KDDM2 1

slide-2
SLIDE 2

www.tugraz.at

Anomalies in Data

Recall from earlier

Maximilian Toller, Know-Center KDDM2 2

slide-3
SLIDE 3

www.tugraz.at

What are Outliers?

A recap from KDDM1

Maximilian Toller, Know-Center KDDM2 3

slide-4
SLIDE 4

www.tugraz.at

What are Outliers?

Definitions An observation that appears to deviate markedly from other members of the sample in which it occurs. (Grubbs, 1969) An observation (or subset of observations) which appears to be inconsistent with the remainder of that set of data. (Barnett and Lewis, 1974) An observation, which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism. (Hawkins, 1980)

Maximilian Toller, Know-Center KDDM2 4

slide-5
SLIDE 5

www.tugraz.at

What are Outliers?

Examples (easy) Inliers Outliers (Grubb, Barnett) Outliers (Grubb, Barnett, Hawkins)

−8 −6 −4 −2 2 4 6 −6 −4 −2 2 4 6 8 X Y

Maximilian Toller, Know-Center KDDM2 5

slide-6
SLIDE 6

www.tugraz.at

What are Outliers?

Examples (more difficult)

0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0 x y

Maximilian Toller, Know-Center KDDM2 6

slide-7
SLIDE 7

www.tugraz.at

What are Outliers?

Examples (more difficult)

0.0 0.2 0.4 0.6 0.8 0.0 0.2 0.4 0.6 0.8 1.0 x y 20 40 60 80 70 80 90 100 110 x y

Maximilian Toller, Know-Center KDDM2 6

slide-8
SLIDE 8

www.tugraz.at

What are Outliers?

Examples (more difficult)

−50 50 100 150 200 400 600 800 1000 x y

Maximilian Toller, Know-Center KDDM2 7

slide-9
SLIDE 9

www.tugraz.at

What are Outliers?

Examples (more difficult)

−50 50 100 150 200 400 600 800 1000 x y 0.3 0.4 0.5 0.6 0.0 0.2 0.4 0.6 0.8 1.0 x y

Maximilian Toller, Know-Center KDDM2 7

slide-10
SLIDE 10

www.tugraz.at

What are Outliers?

Methods: Preview There are many outlier detection methods: Local outlier factor Angle-based outlier degree Artificial neural networks . . . Why are there so many?

Maximilian Toller, Know-Center KDDM2 8

slide-11
SLIDE 11

www.tugraz.at

What are Anomalies?

Maximilian Toller, Know-Center KDDM2 9

slide-12
SLIDE 12

www.tugraz.at

What are Anomalies?

Difference from Outliers In literature, outlier and anomaly are used interchangeably For both, only vague definitions exist that are very similar However, the terms have different origins and different typical use:

Outliers typically. . . . . . are motivated by statistics. . . . are unusual data. . . . are investigated by traditional researches and statisticians. Anomalies typically. . . . . . require context. . . . are abnormal events. . . . are investigated by data analysts and data scientists.

Maximilian Toller, Know-Center KDDM2 10

slide-13
SLIDE 13

www.tugraz.at

What are Anomalies?

Example: Credit card fraud Billions of dollars lost every year Fraudulent transactions often significantly different Difficult to disguise fraud s.t. it is not visible on any scale

Maximilian Toller, Know-Center KDDM2 11

slide-14
SLIDE 14

www.tugraz.at

What are Anomalies?

Example: Cancer One of the most common causes of human death Disease with abnormal cell growth Cancer has abnormal gene expression signature

(Quinn et al., 2019)

Maximilian Toller, Know-Center KDDM2 12

slide-15
SLIDE 15

www.tugraz.at

What are Anomalies?

The role of context Abnormality is context-dependent Discordant data problem (credit card fraud example)

Many normal observations Rare outlying data

Anomaly class problem (cancer example)

Normal data class Anomaly classes

Can data define abnormality?

Maximilian Toller, Know-Center KDDM2 13

slide-16
SLIDE 16

www.tugraz.at

Unlikely, Discordant and Contaminated Data

How to interpret suspicious data

Maximilian Toller, Know-Center KDDM2 14

slide-17
SLIDE 17

www.tugraz.at

Unlikely, Discordant and Contaminated Data

The Case of Hadlum vs Hadlum Mr Hadlum accuses Mrs Hadlum of adultery Sole evidence: Birth of child 349 days after Mr Hadlum left the country Average human gestation period: 280 days

(Barnett and Lewis, 1974)

Maximilian Toller, Know-Center KDDM2 15

slide-18
SLIDE 18

www.tugraz.at

Unlikely, Discordant and Contaminated Data

The Case of Hadlum vs Hadlum Mr Hadlum conjectured different distribution (red) Judges did not find Mrs Hadlum guilty, since 349 days unlikely, but not impossible (blue) (Modern research showed that more than 340 days is impossible)

(Zimek and Filzmoser, 2018)

Maximilian Toller, Know-Center KDDM2 16

slide-19
SLIDE 19

www.tugraz.at

Unlikely, Discordant and Contaminated Data

The Antarctic Ozone Hole Ozone layer protects Earth from solar radiation Damaged by human emissions of chlorofuorocarbons High depletion (hole) above poles

https://de.wikipedia.org/wiki/Datei:Ozone_layer.jpg Maximilian Toller, Know-Center KDDM2 17

slide-20
SLIDE 20

www.tugraz.at

Unlikely, Discordant and Contaminated Data

The (Ant)Arctic Ozone Hole Farman et al. (1985) discover hole in field study Authors hesitant to publish Nimbus satellite data showed no drop Problem: Largely deviating values discarded as measurement errors

NASA/JPL-Caltech Maximilian Toller, Know-Center KDDM2 18

slide-21
SLIDE 21

www.tugraz.at

Unlikely, Discordant and Contaminated Data

Definition Unlikely data Position of judges "Random drop of

  • zone not caused

by humans" Data unlikely but still normal No correction Action: none Discordant data Position of Mr Hadlum Ozone field study by Farman et al. (1985) Data too unlikely to be normal Correction of model Action: investigate Contamination "Wrong day of birth?” Satellite measurement error Data incorrect or misleading Correction of data Action: remove

Maximilian Toller, Know-Center KDDM2 19

slide-22
SLIDE 22

www.tugraz.at

Unlikely, Discordant and Contaminated Data

Implications It is hard to classify data as unlikely, discordant or contaminated No universal decision criterion Domain knowledge as remedy Ultimately subjective

Maximilian Toller, Know-Center KDDM2 20

slide-23
SLIDE 23

www.tugraz.at

Unlikely, Discordant and Contaminated Data

Strategies

  • 1. Try to ignore anomalies (Not interesting)
  • 2. Find anomalies for investigation or removal (Interesting)

Maximilian Toller, Know-Center KDDM2 21

slide-24
SLIDE 24

www.tugraz.at

Robust Statistics

Data Analysis in Presence of Anomalies

Maximilian Toller, Know-Center KDDM2 22

slide-25
SLIDE 25

www.tugraz.at

Robust Statistics

Introduction I Setting Potentially contaminated dataset Majority uncontaminated Cannot find or remove contamination, e.g. inserted by attacker Task: Analyze data in spite of contamination, understand what is normal

Maximilian Toller, Know-Center KDDM2 23

slide-26
SLIDE 26

www.tugraz.at

Robust Statistics

Introduction II Challenges No prior information about data Contamination may be arbitrarily “bad” (adversarial) Question: Which methods are suitable?

Maximilian Toller, Know-Center KDDM2 24

slide-27
SLIDE 27

www.tugraz.at

Robust Statistics

Example: Mean and variance Two common estimators Sample mean ¯ x = 1

n

n

j=1 xj

Sample variance ˆ

σ2

x = 1 n−1

n

j=1(xj − ¯

x)2 Mean and variance are influenced by contamination Original x = [1, 3, 2, 1, 9, 2, 3, 2, 3, 2, 2, 1]

¯

x ≈ 2.58

ˆ σ2

x ≈ 4.63

Clean y = [1, 3, 2, 1, 2, 3, 2, 3, 2, 2, 1]

¯

y = 2

ˆ σ2

y = 0.6

Maximilian Toller, Know-Center KDDM2 25

slide-28
SLIDE 28

www.tugraz.at

Robust Statistics

Example: Mean and variance What happens when attacker corrupts data unfavorably?

Maximilian Toller, Know-Center KDDM2 26

slide-29
SLIDE 29

www.tugraz.at

Robust Statistics

Example: Mean and variance What happens when attacker corrupts data unfavorably? Attack #1 a1 = [1, 3, 2, 1, 900, 2, 3, 2, 3, 2, 2, 1]

¯

a1 ≈ 76.83

ˆ σ2

a1 ≈ 67200.88

Maximilian Toller, Know-Center KDDM2 26

slide-30
SLIDE 30

www.tugraz.at

Robust Statistics

Example: Mean and variance What happens when attacker corrupts data unfavorably? Attack #1 a1 = [1, 3, 2, 1, 900, 2, 3, 2, 3, 2, 2, 1]

¯

a1 ≈ 76.83

ˆ σ2

a1 ≈ 67200.88

Attack #2 a2 = [1, 3, 2, 1, 900000000, 2, 3, 2, 3, 2, 2, 1]

¯

a2 ≈ 7.5 × 107

ˆ σ2

a2 ≈ 6.75 × 1016

Maximilian Toller, Know-Center KDDM2 26

slide-31
SLIDE 31

www.tugraz.at

Robust Statistics

Example: Mean and variance What happens when attacker corrupts data unfavorably? Attack #1 a1 = [1, 3, 2, 1, 900, 2, 3, 2, 3, 2, 2, 1]

¯

a1 ≈ 76.83

ˆ σ2

a1 ≈ 67200.88

Attack #2 a2 = [1, 3, 2, 1, 900000000, 2, 3, 2, 3, 2, 2, 1]

¯

a2 ≈ 7.5 × 107

ˆ σ2

a2 ≈ 6.75 × 1016

Attack #3 a3 = [1, 3, 2, 1, ∞, 2, 3, 2, 3, 2, 2, 1]

¯

a3 = ∞

ˆ σ2

a3 = ∞

Maximilian Toller, Know-Center KDDM2 26

slide-32
SLIDE 32

www.tugraz.at

Robust Statistics

Example: Mean and variance What happens when attacker corrupts data unfavorably? Attack #1 a1 = [1, 3, 2, 1, 900, 2, 3, 2, 3, 2, 2, 1]

¯

a1 ≈ 76.83

ˆ σ2

a1 ≈ 67200.88

Attack #2 a2 = [1, 3, 2, 1, 900000000, 2, 3, 2, 3, 2, 2, 1]

¯

a2 ≈ 7.5 × 107

ˆ σ2

a2 ≈ 6.75 × 1016

Attack #3 a3 = [1, 3, 2, 1, ∞, 2, 3, 2, 3, 2, 2, 1]

¯

a3 = ∞

ˆ σ2

a3 = ∞

→ Mean and variance are not robust.

Maximilian Toller, Know-Center KDDM2 26

slide-33
SLIDE 33

www.tugraz.at

Robust Statistics

Example: Median and MAD Two different estimators Median m(X)

Any real number satisfying P(X ≤ m(X)) ≥ 0.5 and P(X ≥ m(X)) ≥ 0.5 For finite data x = [x1, . . . , xn]: m(x) = x⌊(n+1)/2⌋+x⌈(n+1)/2⌉

2

(middle value)

Median Absolute Deviation (MAD) ζ(x) = m(|x − m(x)|)

Maximilian Toller, Know-Center KDDM2 27

slide-34
SLIDE 34

www.tugraz.at

Robust Statistics

Median and MAD are less influenced by contamination

Maximilian Toller, Know-Center KDDM2 28

slide-35
SLIDE 35

www.tugraz.at

Robust Statistics

Median and MAD are less influenced by contamination a1 = [1, 3, 2, 1, 900, 2, 3, 2, 3, 2, 2, 1] m(a1) = 2

ζ(a1) = 1

Maximilian Toller, Know-Center KDDM2 28

slide-36
SLIDE 36

www.tugraz.at

Robust Statistics

Median and MAD are less influenced by contamination a1 = [1, 3, 2, 1, 900, 2, 3, 2, 3, 2, 2, 1] m(a1) = 2

ζ(a1) = 1

a2 = [1, 3, 2, 1, ∞, 2, 3, 2, 3, 2, 2, 1] m(a2) = 2

ζ(a2) = 1

Maximilian Toller, Know-Center KDDM2 28

slide-37
SLIDE 37

www.tugraz.at

Robust Statistics

Median and MAD are less influenced by contamination a1 = [1, 3, 2, 1, 900, 2, 3, 2, 3, 2, 2, 1] m(a1) = 2

ζ(a1) = 1

a2 = [1, 3, 2, 1, ∞, 2, 3, 2, 3, 2, 2, 1] m(a2) = 2

ζ(a2) = 1

a3 = [∞, 3, 2, ∞, ∞, 2, ∞, 2, 3, 2, 2, ∞] m(a3) = 3

ζ(a3) = 1

Maximilian Toller, Know-Center KDDM2 28

slide-38
SLIDE 38

www.tugraz.at

Robust Statistics

Median and MAD are less influenced by contamination a1 = [1, 3, 2, 1, 900, 2, 3, 2, 3, 2, 2, 1] m(a1) = 2

ζ(a1) = 1

a2 = [1, 3, 2, 1, ∞, 2, 3, 2, 3, 2, 2, 1] m(a2) = 2

ζ(a2) = 1

a3 = [∞, 3, 2, ∞, ∞, 2, ∞, 2, 3, 2, 2, ∞] m(a3) = 3

ζ(a3) = 1

a4 = [∞, ∞, 2, ∞, ∞, 2, ∞, 2, ∞, 2, 2, ∞] m(a4) = ∞

ζ(a4) = ∞

Maximilian Toller, Know-Center KDDM2 28

slide-39
SLIDE 39

www.tugraz.at

Robust Statistics

Median and MAD are less influenced by contamination a1 = [1, 3, 2, 1, 900, 2, 3, 2, 3, 2, 2, 1] m(a1) = 2

ζ(a1) = 1

a2 = [1, 3, 2, 1, ∞, 2, 3, 2, 3, 2, 2, 1] m(a2) = 2

ζ(a2) = 1

a3 = [∞, 3, 2, ∞, ∞, 2, ∞, 2, 3, 2, 2, ∞] m(a3) = 3

ζ(a3) = 1

a4 = [∞, ∞, 2, ∞, ∞, 2, ∞, 2, ∞, 2, 2, ∞] m(a4) = ∞

ζ(a4) = ∞ → Median and MAD are robust estimators of central tendency and

dispersion

Maximilian Toller, Know-Center KDDM2 28

slide-40
SLIDE 40

www.tugraz.at

Robust Statistics

Definition A statistic T (·) maps data to single value, i.e. T : Rn → R Examples: mean, minimum, χ2tests, . . . Robust Statistics = Robust + T (·) Definition A statistic T (·) is robust if it behaves favorably as the data it is computed on increasingly deviates from the assumptions made by T (·).

Maximilian Toller, Know-Center KDDM2 29

slide-41
SLIDE 41

www.tugraz.at

Robust Statistics

About mean and variance I What is estimated by sample mean ¯ x = ˆ

µX = 1

n

n

j=1 xi ?

sample variance ˆ

σX =

1 n−1

n

j=1(xi − ¯

x)2 ? By the strong law of large numbers (L.L.N.)

¯

x a.s.

→ µX = E[X] (n → ∞) ˆ σx → σx (n → ∞)

Maximilian Toller, Know-Center KDDM2 30

slide-42
SLIDE 42

www.tugraz.at

Robust Statistics

About mean and variance II The strong L.L.N. assumes x iid

∼ D(·).

Anomalies typically follow a different distribution A single anomaly might break iid assumption

¯

x and ˆ

σX become biased towards anomaly

Maximilian Toller, Know-Center KDDM2 31

slide-43
SLIDE 43

www.tugraz.at

Robust Statistics

Bias Mean ¯ x and median m(x) are affected differently by contamination

→ Different amount of contamination needed to bias them

Single corrupted observation will add bias to ¯ x At least n

2 corrupted observations needed to bias m(x)

Question: How do we measure the impact of contamination on bias?

Maximilian Toller, Know-Center KDDM2 32

slide-44
SLIDE 44

www.tugraz.at

Robust Statistics

Breakdown point I Definition Let Tn(·) be an estimator of θ and let Tn(xn) = ˆ

θ. Further, let 0 < k < n

  • bservations in xn be contamination to an arbitrary value. Then the

breakdown point β⋆ of Tn is given by

β⋆

T(n) = min

k n

  • |E[ˆ

θ] − θ| = sup b(Tn, θ)

  • Maximilian Toller, Know-Center

KDDM2 33

slide-45
SLIDE 45

www.tugraz.at

Robust Statistics

Breakdown point II In simple terms The smallest fraction of corrupted observations that Tn cannot handle Assess robustness with

  • 1. Corrupt observation
  • 2. Check bias
  • 3. Repeat until worst possible output reached

Maximilian Toller, Know-Center KDDM2 34

slide-46
SLIDE 46

www.tugraz.at

Robust Statistics

Breakdown Point: Example Some breakdown points Mean

β⋆

¯ x(n) = 1 n

IQR

β⋆

I (n) = n 4

Median

β⋆

m(n) = n 2

Perceptron

β⋆

p(n) = 1 n

Easy to test on small dataset

  • 1. Contaminate a few observations
  • 2. See how statistic/algorithm behaves

Maximilian Toller, Know-Center KDDM2 35

slide-47
SLIDE 47

www.tugraz.at

Robust Statistics

Recap of last few slides I Robustness is about deviations from assumptions Every meaningful statistic/algorithm T (·) assumes something

(no-free lunch theorems)

Robust methods are consistent and become slowly biased towards contamination Robustness can be measured with the (asymptotic) breakdown point

Maximilian Toller, Know-Center KDDM2 36

slide-48
SLIDE 48

www.tugraz.at

Robust Statistics

Recap of last few slides II Want to test if T (·) is robust?

  • 1. Find dataset X where assumptions of T (·) hold
  • 2. Compute T (X)
  • 3. Contaminate X to X ′ so that assumptions of T (·) are violated
  • 4. Compute T (X ′)

Maximilian Toller, Know-Center KDDM2 37

slide-49
SLIDE 49

www.tugraz.at

Robust Statistics

Final Remark: Efficiency Robust methods are needed when anomalies in data Robustness alone is not enough T (·) also needs to be good at estimating θ Statistical efficiency

Maximilian Toller, Know-Center KDDM2 38

slide-50
SLIDE 50

www.tugraz.at

Anomaly Detection

Maximilian Toller, Know-Center KDDM2 39

slide-51
SLIDE 51

www.tugraz.at

Anomaly Detection

Introduction There are many “anomaly detection” methods Density-based techniques One-class support-vector machines Artificial neural networks . . . Why are there so many? Performance depends largely on dataset (Why?) There are many types of anomalies Different settings require different methods

Maximilian Toller, Know-Center KDDM2 40

slide-52
SLIDE 52

www.tugraz.at

Anomaly Detection

Objective Apparent goal: Detect when something unexpected/abnormal happens What data is available? Given data might contain very many anomalies . . . . . . or none.

→ True goal: Need to learn what is normal

Normality is typically defined by the problem context, not by data

Maximilian Toller, Know-Center KDDM2 41

slide-53
SLIDE 53

www.tugraz.at

Anomaly Detection

A classical pitfall

0e+00 2e+08 4e+08 6e+08 8e+08 10 15 20 25

Maximilian Toller, Know-Center KDDM2 42

slide-54
SLIDE 54

www.tugraz.at

Anomaly Detection

A classical pitfall

0e+00 2e+08 4e+08 6e+08 8e+08 10 15 20 25 0e+00 2e+08 4e+08 6e+08 8e+08 10 15 20 25

Maximilian Toller, Know-Center KDDM2 42

slide-55
SLIDE 55

www.tugraz.at

Anomaly Detection

How can we learn what is normal?

Expert-based (traditional)

  • 1. Acquire domain expertise
  • 2. Analyze data and formulate rules
  • 3. Test rules

Model-driven (traditional statistics)

  • 1. Understand problem
  • 2. Make assumptions and model
  • 3. Compare model with data

Data-driven (data science)

  • 1. Analyze data
  • 2. Derive model from data & problem understanding
  • 3. Search deviations from model in data

Maximilian Toller, Know-Center KDDM2 43

slide-56
SLIDE 56

www.tugraz.at

Anomaly Detection

How can we learn from data what is normal? I

  • 1. Labeled data with normal and anomalous records

Goal: Learn to detect labeled anomalies Reduction to classification problem + Super easy compared to other settings! – What about new anomalies?

Maximilian Toller, Know-Center KDDM2 44

slide-57
SLIDE 57

www.tugraz.at

Anomaly Detection

How can we learn from data what is normal? II

  • 2. Labeled data with only normal records (and maybe unlabeled data)

Goal: Learn boundaries of what is normal No assumptions made about anomalies + Best setting for successful anomaly detection! – Setting very rare

Maximilian Toller, Know-Center KDDM2 45

slide-58
SLIDE 58

www.tugraz.at

Anomaly Detection

How can we learn from data what is normal? III

  • 3. Unlabeled data

Goal: Find deviating data Hard to learn what is normal + Most common practical setting – Impossible to truly solve (needs strong assumptions)

Maximilian Toller, Know-Center KDDM2 46

slide-59
SLIDE 59

www.tugraz.at

Anomaly Detection

Overview: Settings and Methods

  • 1. Fully labeled data
  • 2. Labeled normal data
  • 3. Unlabeled data

Supervised anomaly detection Unsupervised anomaly detection Method-based anomaly detection

Maximilian Toller, Know-Center KDDM2 47

slide-60
SLIDE 60

www.tugraz.at

Setting: Fully Labeled Data

Maximilian Toller, Know-Center KDDM2 48

slide-61
SLIDE 61

www.tugraz.at

Setting: Fully Labeled Data

Overview I Setting Labeled training set Learn to classify normal and abnormal data

→ Classification problem

Examples Distinguish between normal cell growth and cancer Recognize attack signatures in normal web traffic

Maximilian Toller, Know-Center KDDM2 49

slide-62
SLIDE 62

www.tugraz.at

Setting: Fully Labeled Data

Overview II Suggested approach: Supervised learning Statistical regression methods Support vector machines Classical neural networks Deep neural networks . . .

Maximilian Toller, Know-Center KDDM2 50

slide-63
SLIDE 63

www.tugraz.at

Setting: Fully Labeled Data

Method 1.1: K-nearest neighbor classification Class of query is class of kth nearest neighbor

→ Anomalies are close to each other

Critical component: Distance function Euclidean distance Mahalanobis distance . . .

By Antti Ajanki AnAj - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=2170282 Maximilian Toller, Know-Center KDDM2 51

slide-64
SLIDE 64

www.tugraz.at

Setting: Fully Labeled Data

Method 1.2: Support Vector Machines Construct hyperplane that separates classes To solve nonlinear problems, needs extension Kernels Polynomial Radial basis function Hyperbolic tangent . . .

By Larhmam - Own work, CC BY-SA 3.0 https://commons.wikimedia.org/wiki/File:SVM_margin.png Maximilian Toller, Know-Center KDDM2 52

slide-65
SLIDE 65

www.tugraz.at

Setting: Fully Labeled Data

Problems I While supervised methods learn to classify data as normal or

  • anomalous. . .

. . . they do not learn what is normal Only boarder between seen anomalies and normal learned Unseen anomalies not considered

Maximilian Toller, Know-Center KDDM2 53

slide-66
SLIDE 66

www.tugraz.at

Setting: Fully Labeled Data

Problems II Only applicable when all possible types of anomalies are known Examples: Detect cheating at simple gambling → Always unusually high winnings Classical (naive) anti-virus approaches → Learn attack signatures

Maximilian Toller, Know-Center KDDM2 54

slide-67
SLIDE 67

www.tugraz.at

Setting: Labeled Normal Data

Maximilian Toller, Know-Center KDDM2 55

slide-68
SLIDE 68

www.tugraz.at

Setting: Labeled Normal Data

Overview I Setting Dataset with only normal data Learn what is normal Decide how likely unlabeled data are normal

Maximilian Toller, Know-Center KDDM2 56

slide-69
SLIDE 69

www.tugraz.at

Setting: Labeled Normal Data

Overview II This is the most promising setting! Not restricted to certain anomaly types Ideal for handling new anomalies Labeled normal data rare in practice Suggested Approach: Unsupervised Learning

Maximilian Toller, Know-Center KDDM2 57

slide-70
SLIDE 70

www.tugraz.at

Setting: Labeled Normal Data

Method 2.1: Multivariate kernel density estimation Estimate probability density functions Assigns probabilities to entire space Assumption: Unlikely = Anomalous Needs good kernel function

Duong, Tarn. "ks: Kernel density estimation and kernel discriminant analysis for multivariate data in R." Journal

  • f Statistical Software 21.7 (2007): 1-16.

Maximilian Toller, Know-Center KDDM2 58

slide-71
SLIDE 71

www.tugraz.at

Setting: Labeled Normal Data

Method 2.2: One-class support vector machines Planar approach Hyperplane between data and origin Maximize distance Spherical approach

(support vector data descriptors)

Hypersphere around data Minimize volume Needs good kernel function

Muñoz-Marí, Jordi, et al. "Semisupervised one-class support vector machines for classification of remote sensing data." IEEE transactions

  • n geoscience and remote sensing 48.8 (2010): 3188-3197.

Maximilian Toller, Know-Center KDDM2 59

slide-72
SLIDE 72

www.tugraz.at

Setting: Labeled Normal Data

Method 2.3: Autoencoders Learn to replicate data Collect reconstruction error for unlabeled queries Low error: normal High error: anomaly Important: Needs large training data set!

By Chervinskii - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=45555552 Maximilian Toller, Know-Center KDDM2 60

slide-73
SLIDE 73

www.tugraz.at

Setting: Unlabeled Data

Maximilian Toller, Know-Center KDDM2 61

slide-74
SLIDE 74

www.tugraz.at

Setting: Unlabeled Data

Overview I Setting Unlabeled dataset No context information available Limited domain expertise Worst scenario How distinguish between normal and anomalous? No method for learning normality How can detection results be evaluated?

Maximilian Toller, Know-Center KDDM2 62

slide-75
SLIDE 75

www.tugraz.at

Setting: Unlabeled Data

Overview II Solution: Make assumptions No learning without assumptions (no free lunch theorems) Assume that outliers according to method Y are anomalies Important: Use simple detection methods!

Maximilian Toller, Know-Center KDDM2 63

slide-76
SLIDE 76

www.tugraz.at

Setting: Unlabeled Data

Method 3.1: Local outlier probability Local Outlier Factor Estimate local density Low local density → anomaly How to interpret deviation? Local Outlier Probability Estimate local density Estimate outlier probability

Kriegel, Hans-Peter, et al. "LoOP: local outlier probabilities." Proceedings of the 18th ACM conference on Information and knowledge management. ACM, 2009. Maximilian Toller, Know-Center KDDM2 64

slide-77
SLIDE 77

www.tugraz.at

Setting: Unlabeled Data

Method 3.2: Isolation forest Isolation tree

  • 1. Randomly split data with

hyperplane

  • 2. Repeat until every point isolated
  • 3. Evaluate number of partitions

Few partitions to isolate → anomaly Many partitions to isolate → inlier

Isolation Forest

  • 1. Grow many isolation trees
  • 2. Compare trees

Liu, Fei Tony, Kai Ming Ting, and Zhi-Hua Zhou. "Isolation forest." 2008 Eighth IEEE International Conference on Data Mining. IEEE, 2008. Maximilian Toller, Know-Center KDDM2 65

slide-78
SLIDE 78

www.tugraz.at

Setting: Unlabeled Data

Method 3.3: DBSCAN Cluster data according to density

  • 1. Compute k-NN distances
  • 2. Check which data have many

neighbors

  • 3. Connect “dense” data
  • 4. Points in no cluster are anomalies

Returns clustering and anomalies

LBy Chire - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17045963 Maximilian Toller, Know-Center KDDM2 66

slide-79
SLIDE 79

www.tugraz.at

Final Remarks

Maximilian Toller, Know-Center KDDM2 67

slide-80
SLIDE 80

www.tugraz.at

Final Remarks

Tools Robust statistics

https://cran.r-project.org/web/views/Robust.html https://www.iumsp.ch/en/software/robust-statistics AstroPy

Anomaly detection

DDoutlier ELKI anomaly (R package) scikit-learn Tensorflow, Keras

Maximilian Toller, Know-Center KDDM2 68

slide-81
SLIDE 81

www.tugraz.at

Final Remarks

Further Reading

Chandola, Varun, Arindam Banerjee, and Vipin Kumar. "Anomaly detection: A survey." ACM computing surveys (CSUR) 41.3 (2009): 15. Zimek, Arthur, and Peter Filzmoser. "There and back again: Outlier detection between statistical reasoning and data mining algorithms." Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8.6 (2018): e1280. Campos, Guilherme O., et al. "On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study." Data Mining and Knowledge Discovery 30.4 (2016): 891-927. Görnitz, Nico, et al. "Toward supervised anomaly detection." Journal of Artificial Intelligence Research 46 (2013): 235-262.

Maximilian Toller, Know-Center KDDM2 69

slide-82
SLIDE 82

www.tugraz.at

Final Remarks

The End

Thank you for your attention!

Maximilian Toller, Know-Center KDDM2 70

slide-83
SLIDE 83

www.tugraz.at

Final Remarks

References I Barnett, V. and Lewis, T. (1974). Outliers in statistical data. Wiley. Farman, J. C., Gardiner, B. G., and Shanklin, J. D. (1985). Large losses

  • f total ozone in antarctica reveal seasonal clox/nox interaction.

Nature, 315(6016):207. Grubbs, F. E. (1969). Procedures for detecting outlying observations in

  • samples. Technometrics, 11(1):1–21.

Hawkins, D. M. (1980). Identification of outliers, volume 11. Springer.

Maximilian Toller, Know-Center KDDM2 71

slide-84
SLIDE 84

www.tugraz.at

Final Remarks

References II Quinn, T. P ., Nguyen, T., Lee, S. C., and Venkatesh, S. (2019). Cancer as a tissue anomaly: classifying tumor transcriptomes based only on healthy data. Frontiers in genetics, 10. Zimek, A. and Filzmoser, P . (2018). There and back again: Outlier detection between statistical reasoning and data mining algorithms. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(6):e1280.

Maximilian Toller, Know-Center KDDM2 72