Anomaly Detection Qi Liu University of Science and Technology of - - PowerPoint PPT Presentation
Anomaly Detection Qi Liu University of Science and Technology of - - PowerPoint PPT Presentation
Anomaly Detection Qi Liu University of Science and Technology of China qiliuql@ustc.edu.cn ili l@ t d Data Mining Tasks Data Mining Tasks 2 Data Tid Tid Refund Refund Marital Marital Taxable Taxable Cheat Status Income
Data Mining Tasks … Data Mining Tasks …
2
Tid Refund Marital Taxable
Data
Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 11 No Married 60K No 12 Yes Divorced 220K No 12 Yes Divorced 220K No 13 No Single 85K Yes 14 No Married 75K No 15 No Single 90K Yes
10Milk
Anomaly/Outlier Detection Anomaly/Outlier Detection
What are anomalies/outliers?
The set of data points that are
considerably different than the considerably different than the remainder of the data
Natural implication is that anomalies are relatively
rare
O
i th d ft if h l t f d t
One in a thousand occurs often if you have lots of data Context is important, e.g., freezing temps in July
Can be important or a nuisance
10 foot tall 2 year old Unusually high blood pressure
Importance of Anomaly Detection Importance of Anomaly Detection
Ozone Depletion History
- In 1985 three researchers (Farman,
Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels below normal levels
- Why did the Nimbus 7 satellite, which
had instruments aboard for recording had instruments aboard for recording
- zone levels, not record similarly low
- zone concentrations?
- The ozone concentrations recorded by
the satellite were so low they were being treated as outliers by a computer
Sources: htt // l i d t d / ht l
program and discarded!
http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size.html
Causes of Anomalies Causes of Anomalies
Data from different classes
Measuring the weights of oranges, but a few grapefruit are mixed
in
Natural ariation
Natural variation
Unusually tall people
Data errors
200 pound 2 year old 200 pound 2 year old
Distinction Between Noise and Anomalies Anomalies
h d l
Noise is erroneous, perhaps random, values or
contaminating objects
Weight recorded incorrectly Grapefruit mixed in with the oranges Noise doesn’t necessarily produce unusual values or
- bjects
Noise is not interesting Anomalies may be interesting if they are not a result of
noise noise
Noise and anomalies are related but distinct concepts
General Issues: Number of Attributes General Issues: Number of Attributes
Many anomalies are defined in terms of a single attribute Height Shape Color Can be hard to find an anomaly using all attributes Noisy or irrelevant attributes Noisy or irrelevant attributes Object is only anomalous with respect to some attributes However, an object may not be anomalous in any one
tt ib t attribute
General Issues: Anomaly Scoring General Issues: Anomaly Scoring
Many anomaly detection techniques provide only a
binary categorization
An object is an anomaly or it isn’t This is especially true of classification‐based approaches Other approaches assign a score to all points This score measures the degree to which an object is an anomaly This score measures the degree to which an object is an anomaly This allows objects to be ranked In the end, you often need a binary decision Should this credit card transaction be flagged?
gg
Still useful to have a score How many anomalies are there?
Other Issues for Anomaly Detection y
Find all anomalies at once or one at a time Swamping Masking
E l ti
Evaluation How do you measure performance? Supervised vs unsupervised situations Supervised vs. unsupervised situations Efficiency Efficiency Context Context Professional basketball team
Variants of Anomaly Detection Problems Problems
Gi d t t D fi d ll d t i t D ith
Given a data set D, find all data points x ∈ D with
anomaly scores greater than some threshold t
Given a data set D, find all data points x ∈ D having
the top n largest anomaly scores the top‐n largest anomaly scores d l l b
Given a data set D, containing mostly normal (but
unlabeled) data points, and a test point x, compute the l f ith t t D anomaly score of x with respect to D
Model‐Based Anomaly D t ti Detection
Build a model for the data and see
Build a model for the data and see Unsupervised
Anomalies are those points that don’t fit well Anomalies are those points that don t fit well Anomalies are those points that distort the model Examples: Statistical distribution Clusters Regression
g
Geometric Graph Su e
i ed
Supervised Anomalies are regarded as a rare class Need to have training data
g
Additional Anomaly Detection Te hni ues Techniques
P i it b d
Proximity‐based
Anomalies are points far away from other points Can detect this graphically in some cases Can detect this graphically in some cases
Density‐based
Low density points are outliers Low density points are outliers
Pattern matching
Create profiles or templates of atypical but important events or Create profiles or templates of atypical but important events or
- bjects
Algorithms to detect these patterns are usually simple and efficient
g p y p
Graphical Approaches Graphical Approaches
B l l
Boxplots or scatter plots Limitations
N t t ti
Not automatic Subjective
Convex Hull Method Convex Hull Method
Extreme points are assumed to be outliers Extreme points are assumed to be outliers Use convex hull method to detect extreme
values
What if the outlier occurs in the middle of
the data?
Statistical Approaches Statistical Approaches
Probabilistic definition of an outlier: An outlier is an object that Probabilistic definition of an outlier: An outlier is an object that has a low probability with respect to a probability distribution model of the data.
Usually assume a parametric model describing the distribution
- f the data (e.g., normal distribution)
Apply a statistical test that depends on
Data distribution Parameters of distribution (e.g., mean, variance) Number of expected outliers (confidence limit)
I ue
Issues
Identifying the distribution of a data set Heavy tailed distribution Heavy tailed distribution Number of attributes Is the data a mixture of distributions?
Normal Distributions Normal Distributions
One-dimensional G i Gaussian
6 7 8 0.1
Two-dimensional Gaussian
2 3 4 5 0.06 0.07 0.08 0.09
Gaussian
y
- 2
- 1
1 0.02 0.03 0.04 0.05
x
- 4
- 3
- 2
- 1
1 2 3 4 5
- 5
- 4
- 3
probability density 0.01
Grubbs’ Test Grubbs Test
D li i i i d
Detect outliers in univariate data Assume data comes from normal distribution Detects one outlier at a time, remove the outlier,
and repeat and repeat
H0: There is no outlier in data
X X G − = max
HA: There is at least one outlier
Grubbs’ test statistic:
s
2
) 2 / (
) 1 ( −
N N
t N G
α
Reject H0 if:
2
) 2 , / ( ) 2 , / (
2 ) (
− −
+ − >
N N N N
t N N G
α α
Statistical‐based – Likelihood A h Approach
Assume the data set D contains samples from a
mixture of two probability distributions:
M (majority distribution) A (anomalous distribution)
General Approach:
Initially, assume all the data points belong to M
L L (D) b h l lik lih d f D i
Let Lt(D) be the log likelihood of D at time t For each point xt that belongs to M, move it to A Let L
1 (D) be the new log likelihood
Let Lt+1 (D) be the new log likelihood. Compute the difference, Δ = Lt(D) – Lt+1 (D) If Δ > c (some threshold), then xt is declared as an anomaly and moved
tl f M t A permanently from M to A
Statistical‐based – Likelihood A h Approach
Data distribution, D = (1 – λ) M + λ A M is a probability distribution estimated from data M is a probability distribution estimated from data
Can be based on any modeling method (naïve Bayes,
maximum entropy etc) maximum entropy, etc)
A is initially assumed to be uniform distribution
⎞ ⎛ ⎞ ⎛
N
Likelihood at time t:
∑ ∑ ∏ ∏ ∏
∈ ∈ =
⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ ⎟ ⎟ ⎠ ⎞ ⎜ ⎜ ⎝ ⎛ − = =
t i t t t i t t
A x i A A M x i M M N i i D t
x P x P x P D L ) ( ) ( ) 1 ( ) ( ) (
| | | | 1
λ λ
∑ ∑
∈ ∈
+ + + − =
t i t t i t
A x i A t M x i M t t
x P A x P M D LL ) ( log log ) ( log ) 1 log( ) ( λ λ
Strengths/Weaknesses of Statistical A h Approaches
Firm mathematical foundation Can be very efficient
G d l f d b k
Good results if distribution is known
I d di ib i b k
In many cases, data distribution may not be known For high dimensional data it may be difficult to estimate For high dimensional data, it may be difficult to estimate
the true distribution
Anomalies can distort the parameters of the distribution
Distance‐Based Approaches Distance Based Approaches
Several different techniques An object is an outlier if a specified fraction of the
- bjects is more than a specified distance away (Knorr,
j p y ( , Ng 1998)
Some statistical definitions are special cases of this
The outlier score of an object is the distance to its kth
i hb nearest neighbor
One Nearest Neighbor ‐ One Outlier
D
1 8 2 1.6 1.8 1.2 1.4 0 8 1 0.6 0.8 0.4
Outlier Score
One Nearest Neighbor ‐ Two Outliers g
0 55
D
0.5 0.55 0.4 0.45 0.3 0.35 0.2 0.25 0 1 0.15 0.05 0.1
Outlier Score
Five Nearest Neighbors ‐ Small Cluster Cluster
2
D
1.8 1.4 1.6 1.2 0.8 1 0.6 0.4
Outlier Score
Five Nearest Neighbors ‐ Differing D it Density
D
1 6 1.8 1.4 1.6 1 1.2 0.8 0.4 0.6 0.2
Outlier Score
Strengths/Weaknesses of Distance‐Based Approaches Strengths/Weaknesses of Distance Based Approaches
Simple Expensive – O(n2)
S iti t t
Sensitive to parameters Sensitive to variations in density Sensitive to variations in density
Distance becomes less meaningful in high
Distance becomes less meaningful in high‐
dimensional space
Density‐Based Approaches Density‐Based Approaches
Density‐based Outlier: The outlier score of an object
is the inverse of the density around the object.
Can be defined in terms of the k nearest neighbors One definition: Inverse of distance to kth neighbor
A h d fi i i I f h di k i hb
Another definition: Inverse of the average distance to k neighbors DBSCAN definition
If there are regions of different density, this approach
can have problems can have problems
Relative Density Relative Density
Consider the density of a point relative to that of its k
nearest neighbors
Relative Density Outlier Scores Relative Density Outlier Scores
6
6.85
C 5 4
1.40
D 3
1.33
1 2 A
Outlier Score
1
Density‐based: LOF h approach
For each point compute the density of its local
For each point, compute the density of its local
neighborhood
Compute local outlier factor (LOF) of a sample p as the Compute local outlier factor (LOF) of a sample p as the
average of the ratios of the density of sample p and the density of its nearest neighbors y g
Outliers are points with largest LOF value
In the NN approach, p2 is not considered as outlier, while LOF approach find
p2
while LOF approach find both p1 and p2 as outliers
×
p1
×
Strengths/Weaknesses of Density‐Based Approaches Strengths/Weaknesses of Density Based Approaches
Simple
E O
2
Expensive – O(n2)
Se iti e to a a ete
Sensitive to parameters
D it b l i f l i hi h
Density becomes less meaningful in high‐
dimensional space
Clustering‐Based Approaches
Clustering‐based Outlier: An Clustering‐based Outlier: An
- bject is a cluster‐based outlier if
it does not strongly belong to any g y g y cluster
For prototype‐based clusters, an
bj t i tli if it i t l
- bject is an outlier if it is not close
enough to a cluster center
For density‐based clusters, an object
y , j is an outlier if its density is too low
For graph‐based clusters, an object is
an outlier if it is not well connected an outlier if it is not well connected
Other issues include the impact of
- utliers on the clusters and the
- utliers on the clusters and the
number of clusters
Distance of Points from Closest Centroids Centroids
4 5 4 4.5 C
4.6
3 3.5 2.5 D
0.17
1.5 2 0.5 1 A
1.2
Outlier Score
0 5
Relative Distance of Points from Closest Centroid Closest Centroid
4 3.5
C: 76.9
2 5 3
D: 15.0
2 2.5 1.5 0.5 1
A: 13.1
Outlier Score
Strengths/Weaknesses of Cluster‐Based Approaches g pp
Simple Many clustering techniques can be used Can be difficult to decide on a clustering technique Can be difficult to decide on number of clusters Outliers can distort the clusters
Co‐anomaly Event Detection in Multiple Temperature Series p p
36
Co‐anomaly Event Detection in Multiple Temperature Series Multiple Temperature Series
37
Co‐anomaly Event Detection in Multiple Temperature Series Multiple Temperature Series
38
Co‐anomaly Event Detection in Multiple Temperature Series
39