IRDM ‘15/16
Jilles Vreeken
Chapter 9: Out utlie lier A r Ana naly lysis is
8 Dec 2015
Chapter 9: Out utlie lier A r Ana naly lysis is Jilles - - PowerPoint PPT Presentation
Chapter 9: Out utlie lier A r Ana naly lysis is Jilles Vreeken IRDM 15/16 8 Dec 2015 IRDM Chapter 9, overview Basics & Motivation 1. Extreme Value Analysis 2. Probabilistic Methods 3. Cluster-based Methods 4.
IRDM ‘15/16
8 Dec 2015
IRDM ‘15/16
1.
Basics & Motivation
2.
Extreme Value Analysis
3.
Probabilistic Methods
4.
Cluster-based Methods
5.
Distance-based Methods
You’ll find this covered in: Aggarwal, Ch. 8, 9
IX: 2
IRDM ‘15/16
IX: 3
IRDM ‘15/16
th 2015
Wh When: from 14:15 to 15:25 Wh Where: Günter-Hotz-Hörsaal (E2 2) Material: Patterns, Clusters, and Classification You are a allo llowed to br brin ing o
(1) ) sheet o
A4 p pape per wit with handwr writ itten or pr prin inted notes o
both s sid ides . . No other material (n l (notes, bo books, c course m materials) ) or devic ices (c (calc lculator, n notebook, c cell ph ll phone, s spo poon, etc tc) a ) allo llowed. Br Brin ing a an ID; D; eit ither y your UdS UdS card, o
passport.
IX: 4
IRDM ‘15/16
Aggarwal Ch. 8.1
IX: 5
IRDM ‘15/16
An outlie lier is a data point very different from most
the standard definition is by Hawkins
“An outlier is an observation which deviates so much from the other observations as to arouse suspicion it was generated by a different mechanism”
IX: 6
IRDM ‘15/16
IX: 7
Outlier Inlier Outlier, maybe
Feature 𝑌 Feature 𝑍
IRDM ‘15/16
An outlie lier is a data point very different from most
the standard definition is by Hawkins
“An outlier is an observation which deviates so much from the other observations as to arouse suspicion it was generated by a different mechanism”
anoma
nomalies, abnormalities, discordants, deviants
IX: 8
IRDM ‘15/16
Outlier analysis is a key area of data mining Unlike pattern mining, clustering, and classification, it aims to describe what is not not normal Applications are many
data cleaning fraud detection intrusion detection rare disease detection predictive maintenance
IX: 9
IRDM ‘15/16
Outliers are not noise
noise is uninteresting, outliers are noise is random, outliers aren’t
Outliers are generated by a differ fferen ent p process
e.g. Lionel Messi, or credit card fraudsters, or rare disease patients we have too little data to infer that process exactly detected outliers help us to better understand the data
IX: 10
IRDM ‘15/16
Many many different outlier detection methods exist
many different methods needed
e.g. continuous vs. discrete data e.g. tables, sequences, graphs
The key ey pr probl blem, and why outlier analysis is interesting: beforehand, we do not know what we are looking for
what is weird? what is normal?
IX: 11
IRDM ‘15/16
Global outliers
object that deviate from the r
e res est o
the e dat data s set et
main issue: find a good measure of deviation
Local outliers
object that deviates from a sele
lected co context e.g. differs strongly from its neighboring objects
main issue: how to define the local context?
Collective outliers
a subset of objects that co
colle llect ctively ly deviate from the data or context, e.g. intrusion detection
main issue: combinatorial number of sets of objects
IX: 12
global outlier local outlier collective
IRDM ‘15/16
Most outlier analysis methods give a real-valued score How to decide whether a point is worth looking at?
we set a threshold, or look at the top-𝑙 no best answer, depends on situation
How to evaluate?
very, very difficult is there a ‘true’ outlier ranking? how bad is it to miss one, or to report two too many?
IX: 13
IRDM ‘15/16
Given sufficient data, we can construct a classifier
and then simply use it to predict how outlying an object is typically does not fly in practice
Problem 1: Insufficient training data
outliers are rare we can boost (resample) a training set from a small set of known outliers we can train on artificial samples
Problem 2: Recall
recall is more important than accuracy
we want to catch them all
IX: 14
IRDM ‘15/16
Aggarwal Ch. 8.2
IX: 15
IRDM ‘15/16
The traditional statistical approach to identifying
eme value a e analysi sis Those points 𝑦 ∈ 𝑬 that are in the stati tatisti tical al tails ils
only identifies very specific outliers
For example, for {1,3,3,3,50,97,97,97,100}
extreme values are 1 and 100, although 50 is the most isolated
Tails are naturally defined for univariate distributions
defining the multivariate tail area of a distribution is more tricky
IX: 16
IRDM ‘15/16
IX: 17
Feature 𝑌 Feature 𝑍
Extreme value Outlier, but not extreme value
IRDM ‘15/16
Strong relation to statistical tail confidence tests Assume a distribution, and consider the probability density function 𝑔
𝑌(𝑦) for attribute 𝑌
the lo
lower t tail il are then those values 𝑦 < 𝑚 for which for all 𝑔
𝑌 𝑦 < 𝜗
the uppe
pper t r tail il are then those values 𝑦 > 𝑣 for which for all 𝑔
𝑌 𝑦 < 𝜗
IX: 18
IRDM ‘15/16
Not all distributions have two tails
exponential distributions, for example
IX: 19
IRDM ‘15/16
For example, for a Gaussian
𝑔
𝑌 𝑦 =
1 𝜏 ⋅ 2 ⋅ 𝜌 ⋅ 𝑓− 𝑦−𝜈 2
2⋅𝜏2
with sufficient data we can estimate 𝜏 and 𝜈 with high accuracy
We can then compute 𝑨-scores, 𝑨𝑗 = (𝑦𝑗 − 𝜈)/𝜏
large positive values correspond to upper tail, large negative to lower tail
We can write the pdf in terms of 𝑨-scores as
𝑔
𝑌 𝑨𝑗 =
1 𝜏 ⋅ 2 ⋅ 𝜌 ⋅ 𝑓−𝑦𝑗
2
2
the cumu mulat ative e normal mal distribution
as rule of thumb, 𝑨-scores with absolute values larger than 3 are extreme
IX: 20
IRDM ‘15/16
The main idea is that the conv
ull of a set of data points represents the pa pare reto-opt ptimal e extrem emes es of the set
find the convex hull, and assign 𝑙 to all 𝑦 ∈ ℎ𝑣𝑚𝑚 𝑬 remove ℎ𝑣𝑚𝑚 𝑬 from 𝑬, increase 𝑙 and repeat until 𝑬 is empty
The depth 𝑙 identifies how extreme a point is
IX: 21
IRDM ‘15/16
IX: 22
Depth 2 h 2 Depth 1 h 1 Depth 4 h 4 Depth 3 h 3
IRDM ‘15/16
The main idea is that the conv
ull of a set of data points represents the pa pare reto-opt ptimal e extrem emes es of the set
find set 𝑇 of corners of convex hull of 𝑬 assign depth 𝑙 to all 𝑦 ∈ 𝑇, and repeat until 𝑬 is empty
The depth of a point identifies how extreme it is Very sensitive to dimensionality
recall, how are typically distributed over the hull of a hypersphere computational complexity
IX: 23
IRDM ‘15/16
We can also define tails for mult ltiv ivaria iate d dis istrib ibutio ions
areas of extreme values with probability density less than some threshold
More complicated than univariate
and, only works for unim
imodal d dis istribu butions w wit ith s sin ingle pe peak
IX: 24
IRDM ‘15/16
For a mult ltiv ivaria iate G Gaussia ian, we have its density as
𝑔 𝑦 = 1 Σ ⋅ 2 ⋅ 𝜌
𝑒 2
⋅ 𝑓−1
2⋅ 𝑦−𝜈 Σ−1 x−𝜈 𝑈
where Σ is the 𝑒-by-𝑒 covariance matrix, and |Σ| is its determinant
The exponent resembles Mahala lanobis is di dista stance…
IX: 25
IRDM ‘15/16
Mah ahal alan anobis d distan tance is defined as
𝑁 𝑦, 𝜈, Σ = 𝑦 − 𝜈 Σ−1 x − 𝜈 𝑈
Σ is a 𝑒-by-𝑒 covariance matrix, and 𝜈 a mean-vector
Essentially Euclidean distance, after applying PCA, and after dividing by standard deviation
very useful in practice
e.g. for example on the left, 𝑁 𝑐, 𝜈, Σ > 𝑁(𝑏, 𝜈, Σ)
(Mahalanobis, 1936) IX: 26
point 𝑐
Feature 𝑌 Feature 𝑍
point 𝑏 centre
IRDM ‘15/16
For a mult ltiv ivaria iate G Gaussia ian, we have its density as
𝑔 𝑦 = 1 Σ ⋅ 2 ⋅ 𝜌
𝑒 2
⋅ 𝑓−1
2⋅ 𝑦−𝜈 Σ−1 x−𝜈 𝑈
where Σ is the 𝑒-by-𝑒 covariance matrix, and |Σ| is its determinant
The exponent is half squared Mahalanobis distance
𝑔 𝑦 = 1 Σ ⋅ 2 ⋅ 𝜌
𝑒 2
⋅ 𝑓−1
2⋅𝑁 𝑦,𝜈,Σ 2
for the probability density to fall below a threshold, the
Mahalonobis distance needs to be larger than a threshold.
IX: 27
IRDM ‘15/16
Mahalanobis distance to the mean is an extremity score
larger values imply more extreme behavior
The probability of being extreme may be more insightful
how to model?
Mahalanobis considers axes-rotated and scaled data
each component along the principal components can be modeled as an
independent standard Gaussian, which means we can model by 𝒴2
points for which the Mahalanobis distance is larger
than the cumulative probability are potential outliers
IX: 28
IRDM ‘15/16
Extreme value analysis is a rather basic technique.
only works when data has only a single-peaked distribution requires assuming a
a dis istrib ibution (e.g. Gaussian)
Depth-based methods are very brittle in practice
do not scale well with dimensionality
IX: 29
IRDM ‘15/16
Aggarwal Ch. 8.3
IX: 30
IRDM ‘15/16
Mahalanobis distance works well if there is a single peak
what if there are multiple?
We can generalise to mult ltip iple d dis istrib ibutio ions using mixture re modellin lling
to this end, we’ll re-employ EM clustering.
IX: 31
IRDM ‘15/16
We assume the data was generated by a mixtu xture of 𝑙 distributions 1 … 𝑙 and the generation process was
1.
select a mixture component 𝑘 with prior probability 𝑏𝑗 where 𝑗 ∈ {1 … 𝑙}.
2.
generate a data point from 𝑘
The probability of point 𝑦 generated by model ℳ is
𝑔 𝑦 ℳ = 𝑏𝑗 ⋅ 𝑔𝑗(𝑦)
𝑙 𝑗
outliers will naturally have low fit probabilities
IX: 32
IRDM ‘15/16
IX: 33
Feature 𝑌 Feature 𝑍
Very low fit, outlier Very low fit, outlier High fit
IRDM ‘15/16
The probability of point 𝑦 generated by model ℳ is
𝑔 𝑦 ℳ = 𝑏𝑗 ⋅ 𝑔𝑗(𝑦)
𝑙 𝑗
outliers will naturally have low fit probabilities
To find the parameters for ℳ, we need to optimise
𝑔𝑒𝑒𝑒𝑒 𝑬 ℳ = log 𝑔(𝑦 ∣ ℳ)
𝑦∈𝑬
such that the log likelihood of the data is maximized.
this we do using EM (see lecture V-1)
IX: 34
IRDM ‘15/16
Mixture modelling works very well
when we know the family of distributions of the components when we have sufficient data to estimate their parameters and allows to include background knowledge, e.g. correlations
In practice, however…
we do not know the number of components we do not know the distributions we do not have have sufficient data
Due to overfitting we are likely to miss outliers
IX: 35
IRDM ‘15/16
Aggarwal Ch. 8.4
IX: 36
IRDM ‘15/16
In both the probabilistic and cluster based approaches we define outliers as po point nts t s tha hat deviate f from t m the he no norm In the probabilistic approach the norm is a dis istrib ibutio ion.
points with a low fit are outliers
In the cluster-based approach the norm is a clust steri ering ng.
points far away from these clusters are outliers
A si simpl mplistic a appr pproach is to say that every point that does not belong to a cluster is an outlier.
many clustering algorithms claim to find outliers as a side-product data points on the boundaries of clusters, however, are not real outliers
IX: 37
IRDM ‘15/16
IX: 38
Cluster of outliers freak quadruplet
IRDM ‘15/16
The simple cluster-based approach to outlier detection
1.
cluster your data
2.
distance to closest centroid is outlier score for point 𝑦
Raw distances can be deceiving
what if clusters are of different density? what if clusters are of different shape?
We need a score that takes the cont
IX: 39
IRDM ‘15/16
Mahalanobis distance does consider shape and density
it is a globa
bal score, for single peaked unimodal distributions
We can, however, define a lo loca cal Mahalanobis score
compute mean vector 𝜈𝑠 and covariance matrix Σ𝑠 pe
per r cl cluster 𝐷𝑠 ∈ 𝒟
𝑗, 𝑘 ∈ Σ𝑠 is the covariance of dimensions 𝑗 and 𝑘 in in clu luster 𝐷𝑠 𝑁 𝑦, 𝜈𝑠, Σ𝑠 = 𝑦 − 𝜈𝑠 Σ𝑠
−1 𝑦 − 𝜈𝑠 𝑈
we can directly use it as an outlier score, higher is weirder
IX: 40
IRDM ‘15/16
Cluster-based anomaly detection makes intuitive sense
it works decent in practice, even when not tailored for outliers can detect small clusters of outliers
Noise is a big problem
clustering techniques do not distinguish between
ambient noise and isolated points
neither appear in a cluster, so both are outliers neither global nor local distances to centroids help
IX: 41
IRDM ‘15/16
Aggarwal Ch. 8.5
IX: 42
IRDM ‘15/16
We can identify outliers instan stance-base ased, as opposed to model based, by using a distance measure. “The distance-based outlier score of an object 𝑦 is its distance to its 𝑙th nearest neighbor.” In practice
you choose a (meaningful) distance measure 𝑒 you choose a (low) number of neighbors 𝑙
object 𝑦 is not
not p part of its own 𝑙-nearest neighbors
this avoids scores of 0 when 𝑙 = 1
IX: 43
IRDM ‘15/16
Distance-based methods
finer granularity than clustering or model-based methods
Let 𝑊𝑙(𝑦) be the distance of 𝑦 to its 𝑙𝑒𝑢 nearest neighbor
𝑊𝑙 𝑦 = 𝑒 𝑦, 𝑧
𝑧 is the 𝑙𝑒𝑢 𝑜𝑓𝑏𝑜𝑓𝑜𝑜 𝑜𝑓𝑗𝑜ℎ𝑐𝑜𝑜 𝑜𝑔 𝑦
Naively computing 𝑊𝑙(𝑦) takes 𝑃(𝑜)
for all data points 𝑦 ∈ 𝑬 cost is 𝑃(𝑜2), which is infeasible for large 𝑬
We can speed up by indexing
but, for high-dimensional data effectiveness degrades
IX: 44
IRDM ‘15/16
First, we choose a sample 𝑇 of 𝑜 < 𝑜 ≪ 𝑜 objects from 𝑬
we compute all pairwise distances between 𝑇 and 𝑬
this costs 𝑃 𝑜 ⋅ 𝑜 ≪ 𝑃(𝑜2)
We have the exact score 𝑊𝑙 for all objects 𝑇 We have a low
bound
the 𝑜𝑒𝑢 score of 𝑇, in pseudo-math: 𝑀𝑠 = sort
𝑒 𝑦, 𝑧 𝑦, 𝑧 ∈ 𝑇
𝑠
any object with a lower score will not be in top-𝑜 for 𝑬
We have an up upper b bound
the 𝑙-nearest neighbor distance of object 𝑦 ∈ 𝑬 − 𝑇 to objects in 𝑇 or, in pseudo-math: 𝑉𝑙 𝑦 = sort
𝑒 𝑦, 𝑧 𝑧 ∈ 𝑇
𝑙
IX: 45
IRDM ‘15/16
Usually, we only want the top-𝑜 most outlying objects
these we can compute much more efficiently, using two tricks
Trick 1: compute lower and upper bounds by sampling
compute full score for a set of 𝑜 > 𝑜 objects gives lower bound on score for 𝑜𝑒𝑢 object gives upper bound for score for all objects not in sample
Trick 2: early termination
compute full score only for candidates that can beat the 𝑜𝑒𝑢 score
IX: 46
IRDM ‘15/16
No 𝑦 ∈ 𝑬 with upper bound 𝑉𝑙 𝑦 < 𝑀𝑠 will be in top-𝑜
we do not have to compute its distances to 𝑬 − 𝑇 only have to compute for 𝑆 =
𝑦 ∈ 𝑬 − 𝑇 𝑉𝑙 𝑦 > 𝑀𝑠 ⊆ 𝑬 − 𝑇
Top-𝑜 ranked outliers in 𝑆 ∪ 𝑇 are returned as final output In practice, as 𝑆 ∪ 𝑇 ≪ |𝐸|, this saves a lot of computation
especially if 𝑬 is clustered especially if we chose 𝑇 wisely/luckily
at least one point per cluster, and 𝑜 points in sparse regions
How would you choose 𝑇?
IX: 47
IRDM ‘15/16
We can do better, however. While computing the scores for 𝑦 ∈ 𝑆, every time we discover an object with 𝑊𝑙 𝑦 > 𝑀𝑠 we should update 𝑀𝑠
meaning, pruning further candidates from 𝑆
For every 𝑦 ∈ 𝑆 we start with upper bound 𝑉𝑙 𝑦
initially based on the distances to 𝑇, but while computing the
distances to 𝑬 − 𝑇, we should update it
once 𝑉𝑙 𝑦 drops below 𝑀𝑠 we should terminate
IX: 48
IRDM ‘15/16
Alg lgorit ithm TOP𝑜-𝑙NN-OUTLIERS(data 𝑬, distance 𝑒, sample size 𝑜) 𝑇 ← 𝑜𝑏𝑡𝑞𝑚𝑓(𝑬, 𝑜) compute distances between 𝑇 and 𝑬 𝑆 ← 𝑦 ∈ 𝑬 − 𝑇 𝑉𝑙 𝑦 > 𝑀𝑠 for each 𝑦 ∈ 𝑆 do do for each 𝑧 ∈ 𝑬 − 𝑇 do do update current 𝑙-nearest neighbor distance estimate 𝑊𝑙(𝑦) by computing distance of 𝑧 to 𝑦 if 𝑊𝑙 𝑦 ≤ 𝑀 th then en break if if 𝑊𝑙 𝑦 > 𝑀 th then en include 𝑦 in current 𝑜 best outliers update 𝑀 to the new 𝑜𝑒𝑢 best outlier score retu eturn top-𝑜 outliers from 𝑇 ∪ 𝑆
IX: 49
IRDM ‘15/16
Raw distance measures don’t always identify outliers well
they do not measure the intrin
insic ic dis istance ces
e.g. Euclidean distance, but Mahalanobis neither
The Locally Outlying Factors algorithm (LOF) is
also one of the most used* local outlier detection techniques
* or at least, most often compared against and beaten by more modern methods IX: 50
IRDM ‘15/16
IX: 51
Points related to sparse cluster Different density Different shapes Points related to long cluster
IRDM ‘15/16
In LOF we consider our data locally. That is, for a point 𝑦 we primarily work with its 𝑙-nearest neighborhood. Let 𝑀𝑙 𝑦 be the set of objects that are at most as far as the 𝑙𝑒𝑢 nearest neighbor of 𝑦 𝑀𝑙 𝑦 = 𝑧 ∈ 𝑬 𝑒 𝑦, 𝑧 ≤ 𝑊𝑙 𝑌 Usually 𝑀𝑙 𝑦 will contain 𝑙 points, sometimes more.
IX: 52
IRDM ‘15/16
When a point 𝑧 is in a dense area of the data, 𝑊𝑙(𝑧) will be low (i.e. when there are many points close to it.) When two points 𝑦 and 𝑧 are in each others 𝑙-nearest neighbors, 𝑒 𝑦, 𝑧 ≤ min 𝑊𝑙 𝑦 , 𝑊𝑙 𝑧 We can measure how
ying an object 𝑦 is with regard to object 𝑧 by considering the reacha habilit ity dis istanc nce between 𝑦 and 𝑧. 𝑆𝑙 𝑦, 𝑧 = max 𝑒 𝑦, 𝑧 , 𝑊𝑙 𝑧 when 𝑦 is is not
when 𝑦 is is in the in the 𝑙-nearest neighborhood of 𝑧, it is 𝑊𝑙(𝑧) and 𝑙 is essentially a data-driven smoothing parameter
IX: 53
IRDM ‘15/16
We compute the ave verage age reachab ability d ty distan ances between
𝐵𝑆𝑙 𝑦 = mean𝑧∈𝑀𝑙 𝑦 𝑆𝑙(𝑦, 𝑧) which will be maximal when the nearest neighbors of 𝑦 are at the edge of a dense cluster
IX: 54
IRDM ‘15/16
Now, finally, given a database 𝑬, distance measure 𝑒, and a number
local outly lying ng factor of a point 𝑦 as 𝑀𝑃𝐺𝑙 𝑦 = 𝑡𝑓𝑏𝑜𝑧∈𝑀𝑙 𝑦 𝐵𝑆𝑙 𝑦 𝐵𝑆𝑙 𝑧 For objects inside a cluster, it will take a value close to 1, regardless
For outliers, 𝑀𝑃𝐺𝑙(𝑦) ≫ 1, as because 𝑦 is not in the nearest neighborhoods of its own nearest neighbors the denominator will be much smaller than the numerator
IX: 55
IRDM ‘15/16
LOF works well in practice
even with raw (Euclidean) distance measures regardless of number and density of clusters
Why?
because of the relative normalisation in the denominator it considers local
l informatio ion and can adapt to local density
LOF is not perfect
𝑃(𝑜2) for high dimensional data, 𝑃(𝑜 log 𝑜) when we can index many variants exist for different cluster shapes
IX: 56
IRDM ‘15/16
IX: 57
Points related to sparse cluster LOF LOF LOF says it’s an Outlier
IRDM ‘15/16
Euclidean distance-based has a bias to spherical clusters
single-link clustering does not have this disadvantage
We can fix this by defining 𝑀𝑙(𝑦) using Single-Link
start with 𝑀1 𝑦 = {𝑦} and then iteratively add that 𝑧 that is closest to any point in 𝑀𝑙 𝑀𝑙+1 𝑦 = 𝑀𝑙 𝑦 ∪ {
min
𝑧∈𝑬,𝑧∉𝑀𝑙 𝑦
𝑒 𝑧, 𝑨 𝑨 ∈ 𝑀𝑙 }
We can also again employ lo loca cal l Mahalanobis distance
simply compute 𝑁(⋅) over 𝑀𝑙(𝑦), i.e. 𝑀𝑁𝑙 𝑦 = 𝑁(𝑦, 𝜈𝑙, Σ𝑙) tells how extreme a point 𝑦 is with regard to its local neighborhood no need to normalise, as 𝑁 does that behind the scenes!
IX: 58
IRDM ‘15/16
IX: 59
Points related to sparse cluster LOF Local Mahalanobis Points related to long cluster
IRDM ‘15/16
Outliers are generated by a differ fferen ent p process
not noise, but ‘nuggets of knowledge’, identifying exceptions in your data
Discovering outliers is non non-tr trivi vial al
reduces to the core question of data mining: what is normal?
We have seen four different cla classic ic approaches
extreme value analysis, probabilistic, cluster, and distance-based methods
Discovering outliers in complex d x data ta is very ch challe llengin ing
what does outlying mean in high-dimensional data?
what does outlying mean in a graph?
IX: 60
IRDM ‘15/16
Outliers are generated by a differ fferen ent p process
not noise, but ‘nuggets of knowledge’, identifying exceptions in your data
Discovering outliers is non non-tr trivi vial al
reduces to the core question of data mining: what is normal?
We have seen four different cla classic ic approaches
extreme value analysis, probabilistic, cluster, and distance-based methods
Discovering outliers in complex d x data ta is very ch challe llengin ing
what does outlying mean in high-dimensional data?
what does outlying mean in a graph?
IX: 61