[PPT] - Chapter 9: Out utlie lier A r Ana naly lysis is Jilles PowerPoint Presentation

SLIDE 1

IRDM ‘15/16

Jilles Vreeken

Chapter 9: Out utlie lier A r Ana naly lysis is

8 Dec 2015

SLIDE 2

IRDM ‘15/16

IRDM Chapter 9, overview

1.

Basics & Motivation

2.

Extreme Value Analysis

3.

Probabilistic Methods

4.

Cluster-based Methods

5.

Distance-based Methods

You’ll find this covered in: Aggarwal, Ch. 8, 9

IX: 2

SLIDE 3

IRDM ‘15/16

Decem embe ber 14th

th – 18

18th

th

Tut utoria rials ls o

n

n Gra Graph Mining ph Mining

Januar uary 4 y 4th

th – 8th th

No T Tut utoria rials

IX: 3

SLIDE 4

IRDM ‘15/16

Dece cemb mber 10th

th 2015

2015

Th The S e Sec econd Midt Midterm T T es est

Wh When: from 14:15 to 15:25 Wh Where: Günter-Hotz-Hörsaal (E2 2) Material: Patterns, Clusters, and Classification You are a allo llowed to br brin ing o

ne (1

(1) ) sheet o

f A

A4 p pape per wit with handwr writ itten or pr prin inted notes o

n bo

both s sid ides . . No other material (n l (notes, bo books, c course m materials) ) or devic ices (c (calc lculator, n notebook, c cell ph ll phone, s spo poon, etc tc) a ) allo llowed. Br Brin ing a an ID; D; eit ither y your UdS UdS card, o

r pa

passport.

IX: 4

SLIDE 5

IRDM ‘15/16

Chapter 9.1:

The B he Basi asics s & M Motiva vation

Aggarwal Ch. 8.1

IX: 5

SLIDE 6

IRDM ‘15/16

Outliers

An outlie lier is a data point very different from most

f the remaining data.

 the standard definition is by Hawkins

“An outlier is an observation which deviates so much from the other observations as to arouse suspicion it was generated by a different mechanism”

IX: 6

SLIDE 7

IRDM ‘15/16

Example Outliers

IX: 7

Outlier Inlier Outlier, maybe

Feature 𝑌 Feature 𝑍

SLIDE 8

IRDM ‘15/16

Outliers

An outlie lier is a data point very different from most

f the remaining data.

 the standard definition is by Hawkins

“An outlier is an observation which deviates so much from the other observations as to arouse suspicion it was generated by a different mechanism”

Outliers are also known as

 anoma

nomalies, abnormalities, discordants, deviants

IX: 8

SLIDE 9

IRDM ‘15/16

Why bother?

Outlier analysis is a key area of data mining Unlike pattern mining, clustering, and classification, it aims to describe what is not not normal Applications are many

 data cleaning  fraud detection  intrusion detection  rare disease detection  predictive maintenance

IX: 9

SLIDE 10

IRDM ‘15/16

Not noise

Outliers are not noise

 noise is uninteresting, outliers are  noise is random, outliers aren’t

Outliers are generated by a differ fferen ent p process

 e.g. Lionel Messi, or credit card fraudsters, or rare disease patients  we have too little data to infer that process exactly  detected outliers help us to better understand the data

IX: 10

SLIDE 11

IRDM ‘15/16

Outliers everywhere

Many many different outlier detection methods exist

 many different methods needed

 e.g. continuous vs. discrete data  e.g. tables, sequences, graphs

The key ey pr probl blem, and why outlier analysis is interesting: beforehand, we do not know what we are looking for

 what is weird?  what is normal?

IX: 11

SLIDE 12

IRDM ‘15/16

Three Types of Outliers

Global outliers

 object that deviate from the r

e res est o

f t

the e dat data s set et

 main issue: find a good measure of deviation

Local outliers

 object that deviates from a sele

lected co context e.g. differs strongly from its neighboring objects

 main issue: how to define the local context?

Collective outliers

 a subset of objects that co

colle llect ctively ly deviate from the data or context, e.g. intrusion detection

 main issue: combinatorial number of sets of objects

IX: 12

global outlier local outlier collective

utliers

SLIDE 13

IRDM ‘15/16

Ranking versus Thresholding

Most outlier analysis methods give a real-valued score How to decide whether a point is worth looking at?

 we set a threshold, or look at the top-𝑙  no best answer, depends on situation

How to evaluate?

 very, very difficult  is there a ‘true’ outlier ranking?  how bad is it to miss one, or to report two too many?

IX: 13

SLIDE 14

IRDM ‘15/16

Supervised Outlier Detection

Given sufficient data, we can construct a classifier

 and then simply use it to predict how outlying an object is  typically does not fly in practice

Problem 1: Insufficient training data

 outliers are rare  we can boost (resample) a training set from a small set of known outliers  we can train on artificial samples

Problem 2: Recall

 recall is more important than accuracy



we want to catch them all

IX: 14

SLIDE 15

IRDM ‘15/16

Chapter 9.2:

Extreme V Value A alue Analysi nalysis

Aggarwal Ch. 8.2

IX: 15

SLIDE 16

IRDM ‘15/16

Extreme Values

The traditional statistical approach to identifying

utliers is extreme v

eme value a e analysi sis Those points 𝑦 ∈ 𝑬 that are in the stati tatisti tical al tails ils

f the probability distribution 𝑞 of 𝑬 are outliers.

 only identifies very specific outliers

For example, for {1,3,3,3,50,97,97,97,100}

 extreme values are 1 and 100, although 50 is the most isolated

Tails are naturally defined for univariate distributions

 defining the multivariate tail area of a distribution is more tricky

IX: 16

SLIDE 17

IRDM ‘15/16

Problems with multivariate tails

IX: 17

Feature 𝑌 Feature 𝑍

Extreme value Outlier, but not extreme value

SLIDE 18

IRDM ‘15/16

Univariate Extreme Value Analysis

Strong relation to statistical tail confidence tests Assume a distribution, and consider the probability density function 𝑔

𝑌(𝑦) for attribute 𝑌

 the lo

lower t tail il are then those values 𝑦 < 𝑚 for which for all 𝑔

𝑌 𝑦 < 𝜗

 the uppe

pper t r tail il are then those values 𝑦 > 𝑣 for which for all 𝑔

𝑌 𝑦 < 𝜗

IX: 18

SLIDE 19

IRDM ‘15/16

Not a density threshold.

Not all distributions have two tails

 exponential distributions, for example

IX: 19

SLIDE 20

IRDM ‘15/16

Univariate

For example, for a Gaussian

𝑔

𝑌 𝑦 =

1 𝜏 ⋅ 2 ⋅ 𝜌 ⋅ 𝑓− 𝑦−𝜈 2

2⋅𝜏2



with sufficient data we can estimate 𝜏 and 𝜈 with high accuracy

We can then compute 𝑨-scores, 𝑨𝑗 = (𝑦𝑗 − 𝜈)/𝜏



large positive values correspond to upper tail, large negative to lower tail

We can write the pdf in terms of 𝑨-scores as

𝑔

𝑌 𝑨𝑗 =

1 𝜏 ⋅ 2 ⋅ 𝜌 ⋅ 𝑓−𝑦𝑗

2



the cumu mulat ative e normal mal distribution

n then tells the area of the tail larger than 𝑨𝑗



as rule of thumb, 𝑨-scores with absolute values larger than 3 are extreme

IX: 20

SLIDE 21

IRDM ‘15/16

Depth-based methods

The main idea is that the conv

nvex-hul

ull of a set of data points represents the pa pare reto-opt ptimal e extrem emes es of the set

 find the convex hull, and assign 𝑙 to all 𝑦 ∈ ℎ𝑣𝑚𝑚 𝑬  remove ℎ𝑣𝑚𝑚 𝑬 from 𝑬, increase 𝑙 and repeat until 𝑬 is empty

The depth 𝑙 identifies how extreme a point is

IX: 21

SLIDE 22

IRDM ‘15/16

Example, depth

IX: 22

Depth 2 h 2 Depth 1 h 1 Depth 4 h 4 Depth 3 h 3

SLIDE 23

IRDM ‘15/16

Depth-based methods

The main idea is that the conv

nvex-hul

ull of a set of data points represents the pa pare reto-opt ptimal e extrem emes es of the set

 find set 𝑇 of corners of convex hull of 𝑬  assign depth 𝑙 to all 𝑦 ∈ 𝑇, and repeat until 𝑬 is empty

The depth of a point identifies how extreme it is Very sensitive to dimensionality

 recall, how are typically distributed over the hull of a hypersphere  computational complexity

IX: 23

SLIDE 24

IRDM ‘15/16

Multivariate Extreme Value Analysis

We can also define tails for mult ltiv ivaria iate d dis istrib ibutio ions

 areas of extreme values with probability density less than some threshold

More complicated than univariate

 and, only works for unim

imodal d dis istribu butions w wit ith s sin ingle pe peak

IX: 24

SLIDE 25

IRDM ‘15/16

Multivariate Extreme Value Analysis

For a mult ltiv ivaria iate G Gaussia ian, we have its density as

𝑔 𝑦 = 1 Σ ⋅ 2 ⋅ 𝜌

𝑒 2

⋅ 𝑓−1

2⋅ 𝑦−𝜈 Σ−1 x−𝜈 𝑈

 where Σ is the 𝑒-by-𝑒 covariance matrix, and |Σ| is its determinant

The exponent resembles Mahala lanobis is di dista stance…

IX: 25

SLIDE 26

IRDM ‘15/16

Mahalonobis distance

Mah ahal alan anobis d distan tance is defined as

𝑁 𝑦, 𝜈, Σ = 𝑦 − 𝜈 Σ−1 x − 𝜈 𝑈



Σ is a 𝑒-by-𝑒 covariance matrix, and 𝜈 a mean-vector

Essentially Euclidean distance, after applying PCA, and after dividing by standard deviation



very useful in practice



e.g. for example on the left, 𝑁 𝑐, 𝜈, Σ > 𝑁(𝑏, 𝜈, Σ)

(Mahalanobis, 1936) IX: 26

point 𝑐

Feature 𝑌 Feature 𝑍

point 𝑏 centre

SLIDE 27

IRDM ‘15/16

Multivariate Extreme Value Analysis

For a mult ltiv ivaria iate G Gaussia ian, we have its density as

𝑔 𝑦 = 1 Σ ⋅ 2 ⋅ 𝜌

𝑒 2

⋅ 𝑓−1

2⋅ 𝑦−𝜈 Σ−1 x−𝜈 𝑈

 where Σ is the 𝑒-by-𝑒 covariance matrix, and |Σ| is its determinant

The exponent is half squared Mahalanobis distance

𝑔 𝑦 = 1 Σ ⋅ 2 ⋅ 𝜌

𝑒 2

⋅ 𝑓−1

2⋅𝑁 𝑦,𝜈,Σ 2

 for the probability density to fall below a threshold, the

Mahalonobis distance needs to be larger than a threshold.

IX: 27

SLIDE 28

IRDM ‘15/16

Probably extreme

Mahalanobis distance to the mean is an extremity score

 larger values imply more extreme behavior

The probability of being extreme may be more insightful

 how to model?

Mahalanobis considers axes-rotated and scaled data

 each component along the principal components can be modeled as an

independent standard Gaussian, which means we can model by 𝒴2

 points for which the Mahalanobis distance is larger

than the cumulative probability are potential outliers

IX: 28

SLIDE 29

IRDM ‘15/16

Extreme downsides

Extreme value analysis is a rather basic technique.

 only works when data has only a single-peaked distribution  requires assuming a

a dis istrib ibution (e.g. Gaussian)

Depth-based methods are very brittle in practice

 do not scale well with dimensionality

IX: 29

SLIDE 30

IRDM ‘15/16

Chapter 9.3:

Probabilis abilistic ic M Methods

Aggarwal Ch. 8.3

IX: 30

SLIDE 31

IRDM ‘15/16

Mixtures

Mahalanobis distance works well if there is a single peak

 what if there are multiple?

We can generalise to mult ltip iple d dis istrib ibutio ions using mixture re modellin lling

 to this end, we’ll re-employ EM clustering.

IX: 31

SLIDE 32

IRDM ‘15/16

Fit and Unfit

We assume the data was generated by a mixtu xture of 𝑙 distributions 𝒣1 … 𝒣𝑙 and the generation process was

1.

select a mixture component 𝑘 with prior probability 𝑏𝑗 where 𝑗 ∈ {1 … 𝑙}.

2.

generate a data point from 𝒣𝑘

The probability of point 𝑦 generated by model ℳ is

𝑔 𝑦 ℳ = 𝑏𝑗 ⋅ 𝑔𝑗(𝑦)

𝑙 𝑗

 outliers will naturally have low fit probabilities

IX: 32

SLIDE 33

IRDM ‘15/16

Unfit Outliers

IX: 33

Feature 𝑌 Feature 𝑍

Very low fit, outlier Very low fit, outlier High fit

SLIDE 34

IRDM ‘15/16

Fit and Unfit

The probability of point 𝑦 generated by model ℳ is

𝑔 𝑦 ℳ = 𝑏𝑗 ⋅ 𝑔𝑗(𝑦)

𝑙 𝑗

 outliers will naturally have low fit probabilities

To find the parameters for ℳ, we need to optimise

𝑔𝑒𝑒𝑒𝑒 𝑬 ℳ = log 𝑔(𝑦 ∣ ℳ)

𝑦∈𝑬

such that the log likelihood of the data is maximized.

 this we do using EM (see lecture V-1)

IX: 34

SLIDE 35

IRDM ‘15/16

Ups and Downs

Mixture modelling works very well

 when we know the family of distributions of the components  when we have sufficient data to estimate their parameters  and allows to include background knowledge, e.g. correlations

In practice, however…

 we do not know the number of components  we do not know the distributions  we do not have have sufficient data

Due to overfitting we are likely to miss outliers

IX: 35

SLIDE 36

IRDM ‘15/16

Chapter 9.4:

Clust luster-bas based ed M Methods ds

Aggarwal Ch. 8.4

IX: 36

SLIDE 37

IRDM ‘15/16

Clusters and Outliers

In both the probabilistic and cluster based approaches we define outliers as po point nts t s tha hat deviate f from t m the he no norm In the probabilistic approach the norm is a dis istrib ibutio ion.

 points with a low fit are outliers

In the cluster-based approach the norm is a clust steri ering ng.

 points far away from these clusters are outliers

A si simpl mplistic a appr pproach is to say that every point that does not belong to a cluster is an outlier.

 many clustering algorithms claim to find outliers as a side-product  data points on the boundaries of clusters, however, are not real outliers

IX: 37

SLIDE 38

IRDM ‘15/16

Clusters or Freaks

IX: 38

Cluster of outliers freak quadruplet

SLIDE 39

IRDM ‘15/16

Simple approach

The simple cluster-based approach to outlier detection

1.

cluster your data

2.

distance to closest centroid is outlier score for point 𝑦

Raw distances can be deceiving

 what if clusters are of different density?  what if clusters are of different shape?

We need a score that takes the cont

ntext into account

IX: 39

SLIDE 40

IRDM ‘15/16

Local Mahalanobis

Mahalanobis distance does consider shape and density

 it is a globa

bal score, for single peaked unimodal distributions

We can, however, define a lo loca cal Mahalanobis score

 compute mean vector 𝜈𝑠 and covariance matrix Σ𝑠 pe

per r cl cluster 𝐷𝑠 ∈ 𝒟



𝑗, 𝑘 ∈ Σ𝑠 is the covariance of dimensions 𝑗 and 𝑘 in in clu luster 𝐷𝑠 𝑁 𝑦, 𝜈𝑠, Σ𝑠 = 𝑦 − 𝜈𝑠 Σ𝑠

−1 𝑦 − 𝜈𝑠 𝑈

 we can directly use it as an outlier score, higher is weirder

IX: 40

SLIDE 41

IRDM ‘15/16

Cluster𝑔…

Cluster-based anomaly detection makes intuitive sense

 it works decent in practice, even when not tailored for outliers  can detect small clusters of outliers

Noise is a big problem

 clustering techniques do not distinguish between

ambient noise and isolated points

 neither appear in a cluster, so both are outliers  neither global nor local distances to centroids help

IX: 41

SLIDE 42

IRDM ‘15/16

Chapter 9.5:

Dist Distanc nce-bas based ed Met ethods ds

Aggarwal Ch. 8.5

IX: 42

SLIDE 43

IRDM ‘15/16

Distance-based outliers

We can identify outliers instan stance-base ased, as opposed to model based, by using a distance measure. “The distance-based outlier score of an object 𝑦 is its distance to its 𝑙th nearest neighbor.” In practice

 you choose a (meaningful) distance measure 𝑒  you choose a (low) number of neighbors 𝑙

 object 𝑦 is not

not p part of its own 𝑙-nearest neighbors

 this avoids scores of 0 when 𝑙 = 1

IX: 43

SLIDE 44

IRDM ‘15/16

Computing 𝑊𝑙

Distance-based methods

 finer granularity than clustering or model-based methods

Let 𝑊𝑙(𝑦) be the distance of 𝑦 to its 𝑙𝑒𝑢 nearest neighbor

 𝑊𝑙 𝑦 = 𝑒 𝑦, 𝑧

𝑧 is the 𝑙𝑒𝑢 𝑜𝑓𝑏𝑜𝑓𝑜𝑜 𝑜𝑓𝑗𝑜ℎ𝑐𝑜𝑜 𝑜𝑔 𝑦

Naively computing 𝑊𝑙(𝑦) takes 𝑃(𝑜)

 for all data points 𝑦 ∈ 𝑬 cost is 𝑃(𝑜2), which is infeasible for large 𝑬

We can speed up by indexing

 but, for high-dimensional data effectiveness degrades

IX: 44

SLIDE 45

IRDM ‘15/16

Bounding through Sampling

First, we choose a sample 𝑇 of 𝑜 < 𝑜 ≪ 𝑜 objects from 𝑬

 we compute all pairwise distances between 𝑇 and 𝑬

 this costs 𝑃 𝑜 ⋅ 𝑜 ≪ 𝑃(𝑜2)

We have the exact score 𝑊𝑙 for all objects 𝑇 We have a low

wer b

bound

und 𝑀𝑠 of the 𝑜𝑒𝑢 score of 𝑬

 the 𝑜𝑒𝑢 score of 𝑇, in pseudo-math: 𝑀𝑠 = sort

𝑒 𝑦, 𝑧 𝑦, 𝑧 ∈ 𝑇

𝑠

 any object with a lower score will not be in top-𝑜 for 𝑬

We have an up upper b bound

und 𝑉𝑙 for scores of objects in 𝑬 − 𝑇

 the 𝑙-nearest neighbor distance of object 𝑦 ∈ 𝑬 − 𝑇 to objects in 𝑇  or, in pseudo-math: 𝑉𝑙 𝑦 = sort

𝑒 𝑦, 𝑧 𝑧 ∈ 𝑇

𝑙

IX: 45

SLIDE 46

IRDM ‘15/16

T

p-𝑜 distance-based outliers

Usually, we only want the top-𝑜 most outlying objects

 these we can compute much more efficiently, using two tricks

Trick 1: compute lower and upper bounds by sampling

 compute full score for a set of 𝑜 > 𝑜 objects  gives lower bound on score for 𝑜𝑒𝑢 object  gives upper bound for score for all objects not in sample

Trick 2: early termination

 compute full score only for candidates that can beat the 𝑜𝑒𝑢 score

IX: 46

SLIDE 47

IRDM ‘15/16

Applying the bounds

No 𝑦 ∈ 𝑬 with upper bound 𝑉𝑙 𝑦 < 𝑀𝑠 will be in top-𝑜

 we do not have to compute its distances to 𝑬 − 𝑇  only have to compute for 𝑆 =

𝑦 ∈ 𝑬 − 𝑇 𝑉𝑙 𝑦 > 𝑀𝑠 ⊆ 𝑬 − 𝑇

Top-𝑜 ranked outliers in 𝑆 ∪ 𝑇 are returned as final output In practice, as 𝑆 ∪ 𝑇 ≪ |𝐸|, this saves a lot of computation

 especially if 𝑬 is clustered  especially if we chose 𝑇 wisely/luckily

 at least one point per cluster, and 𝑜 points in sparse regions

How would you choose 𝑇?

IX: 47

SLIDE 48

IRDM ‘15/16

Early T ermination

We can do better, however. While computing the scores for 𝑦 ∈ 𝑆, every time we discover an object with 𝑊𝑙 𝑦 > 𝑀𝑠 we should update 𝑀𝑠

 meaning, pruning further candidates from 𝑆

For every 𝑦 ∈ 𝑆 we start with upper bound 𝑉𝑙 𝑦

 initially based on the distances to 𝑇, but while computing the

distances to 𝑬 − 𝑇, we should update it

 once 𝑉𝑙 𝑦 drops below 𝑀𝑠 we should terminate

IX: 48

SLIDE 49

IRDM ‘15/16

Algorithm

Alg lgorit ithm TOP𝑜-𝑙NN-OUTLIERS(data 𝑬, distance 𝑒, sample size 𝑜) 𝑇 ← 𝑜𝑏𝑡𝑞𝑚𝑓(𝑬, 𝑜) compute distances between 𝑇 and 𝑬 𝑆 ← 𝑦 ∈ 𝑬 − 𝑇 𝑉𝑙 𝑦 > 𝑀𝑠 for each 𝑦 ∈ 𝑆 do do for each 𝑧 ∈ 𝑬 − 𝑇 do do update current 𝑙-nearest neighbor distance estimate 𝑊𝑙(𝑦) by computing distance of 𝑧 to 𝑦 if 𝑊𝑙 𝑦 ≤ 𝑀 th then en break if if 𝑊𝑙 𝑦 > 𝑀 th then en include 𝑦 in current 𝑜 best outliers update 𝑀 to the new 𝑜𝑒𝑢 best outlier score retu eturn top-𝑜 outliers from 𝑇 ∪ 𝑆

IX: 49

SLIDE 50

IRDM ‘15/16

Locally Outlying Factors

Raw distance measures don’t always identify outliers well

 they do not measure the intrin

insic ic dis istance ces

 e.g. Euclidean distance, but Mahalanobis neither

The Locally Outlying Factors algorithm (LOF) is

ne of the earliest proposals to alleviate this

 also one of the most used* local outlier detection techniques

* or at least, most often compared against and beaten by more modern methods IX: 50

SLIDE 51

IRDM ‘15/16

Impact of Locality

IX: 51

utlier
utlier

Points related to sparse cluster Different density Different shapes Points related to long cluster

utlier

SLIDE 52

IRDM ‘15/16

LOF Begins

In LOF we consider our data locally. That is, for a point 𝑦 we primarily work with its 𝑙-nearest neighborhood. Let 𝑀𝑙 𝑦 be the set of objects that are at most as far as the 𝑙𝑒𝑢 nearest neighbor of 𝑦 𝑀𝑙 𝑦 = 𝑧 ∈ 𝑬 𝑒 𝑦, 𝑧 ≤ 𝑊𝑙 𝑌 Usually 𝑀𝑙 𝑦 will contain 𝑙 points, sometimes more.

IX: 52

SLIDE 53

IRDM ‘15/16

LOF, Origins

When a point 𝑧 is in a dense area of the data, 𝑊𝑙(𝑧) will be low (i.e. when there are many points close to it.) When two points 𝑦 and 𝑧 are in each others 𝑙-nearest neighbors, 𝑒 𝑦, 𝑧 ≤ min 𝑊𝑙 𝑦 , 𝑊𝑙 𝑧 We can measure how

w ou
utlyi

ying an object 𝑦 is with regard to object 𝑧 by considering the reacha habilit ity dis istanc nce between 𝑦 and 𝑧. 𝑆𝑙 𝑦, 𝑧 = max 𝑒 𝑦, 𝑧 , 𝑊𝑙 𝑧 when 𝑦 is is not

t in the 𝑙-nearest neighborhood of 𝑧, it is 𝑒(𝑦, 𝑧)

when 𝑦 is is in the in the 𝑙-nearest neighborhood of 𝑧, it is 𝑊𝑙(𝑧) and 𝑙 is essentially a data-driven smoothing parameter

IX: 53

SLIDE 54

IRDM ‘15/16

The Rise of LOF

We compute the ave verage age reachab ability d ty distan ances between

bject 𝑦 and the objects in its 𝑙-nearest neighborhood

𝐵𝑆𝑙 𝑦 = mean𝑧∈𝑀𝑙 𝑦 𝑆𝑙(𝑦, 𝑧) which will be maximal when the nearest neighbors of 𝑦 are at the edge of a dense cluster

IX: 54

SLIDE 55

IRDM ‘15/16

The LOF Awakens

Now, finally, given a database 𝑬, distance measure 𝑒, and a number

f neighbors 𝑙, we define the lo

local outly lying ng factor of a point 𝑦 as 𝑀𝑃𝐺𝑙 𝑦 = 𝑡𝑓𝑏𝑜𝑧∈𝑀𝑙 𝑦 𝐵𝑆𝑙 𝑦 𝐵𝑆𝑙 𝑧 For objects inside a cluster, it will take a value close to 1, regardless

f how dense the cluster is.

For outliers, 𝑀𝑃𝐺𝑙(𝑦) ≫ 1, as because 𝑦 is not in the nearest neighborhoods of its own nearest neighbors the denominator will be much smaller than the numerator

IX: 55

SLIDE 56

IRDM ‘15/16

The LOF strikes back

LOF works well in practice

 even with raw (Euclidean) distance measures  regardless of number and density of clusters

Why?

 because of the relative normalisation in the denominator  it considers local

l informatio ion and can adapt to local density

LOF is not perfect

 𝑃(𝑜2) for high dimensional data, 𝑃(𝑜 log 𝑜) when we can index  many variants exist for different cluster shapes

IX: 56

SLIDE 57

IRDM ‘15/16

Impact of Locality

IX: 57

utlier
utlier

Points related to sparse cluster LOF LOF LOF says it’s an Outlier 

utlier

SLIDE 58

IRDM ‘15/16

The return of the LOF

Euclidean distance-based has a bias to spherical clusters

 single-link clustering does not have this disadvantage

We can fix this by defining 𝑀𝑙(𝑦) using Single-Link

 start with 𝑀1 𝑦 = {𝑦}  and then iteratively add that 𝑧 that is closest to any point in 𝑀𝑙  𝑀𝑙+1 𝑦 = 𝑀𝑙 𝑦 ∪ {

min

𝑧∈𝑬,𝑧∉𝑀𝑙 𝑦

𝑒 𝑧, 𝑨 𝑨 ∈ 𝑀𝑙 }

We can also again employ lo loca cal l Mahalanobis distance

 simply compute 𝑁(⋅) over 𝑀𝑙(𝑦), i.e. 𝑀𝑁𝑙 𝑦 = 𝑁(𝑦, 𝜈𝑙, Σ𝑙)  tells how extreme a point 𝑦 is with regard to its local neighborhood  no need to normalise, as 𝑁 does that behind the scenes!

IX: 58

SLIDE 59

IRDM ‘15/16

Impact of Locality

IX: 59

utlier
utlier

Points related to sparse cluster LOF Local Mahalanobis Points related to long cluster

utlier

SLIDE 60

IRDM ‘15/16

Conclusions

Outliers are generated by a differ fferen ent p process



not noise, but ‘nuggets of knowledge’, identifying exceptions in your data

Discovering outliers is non non-tr trivi vial al



reduces to the core question of data mining: what is normal?

We have seen four different cla classic ic approaches



extreme value analysis, probabilistic, cluster, and distance-based methods

Discovering outliers in complex d x data ta is very ch challe llengin ing



what does outlying mean in high-dimensional data?



what does outlying mean in a graph?

IX: 60

SLIDE 61

IRDM ‘15/16

Outliers are generated by a differ fferen ent p process



not noise, but ‘nuggets of knowledge’, identifying exceptions in your data

Discovering outliers is non non-tr trivi vial al



reduces to the core question of data mining: what is normal?

We have seen four different cla classic ic approaches



extreme value analysis, probabilistic, cluster, and distance-based methods

Discovering outliers in complex d x data ta is very ch challe llengin ing



what does outlying mean in high-dimensional data?



what does outlying mean in a graph?

IX: 61